Notes ITSC
Notes ITSC
Notes ITSC
Computing
Cindy Orozco
Summer 2016
Abstract
These notes are designed for the course CME 108: Introduction to Scientific Com-
puting (MATH 114) during Summer 2016. In addition of lectures notes from previ-
ous editions of the course (given by Pr. Eric Dunham), this material follows these
references:
2 Root-finding algorithms 15
2.1 Closed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 False Position . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Open methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Fixed Point iteration . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3 Secant method . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Minimizing a function in 1D . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Golden Section . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Parabolic Interpolation . . . . . . . . . . . . . . . . . . . . . 25
1
3.4 Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Interpolation 60
5.1 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.1 Monomial Interpolation . . . . . . . . . . . . . . . . . . . . . 61
5.1.2 Lagrange interpolation . . . . . . . . . . . . . . . . . . . . . . 62
5.1.3 Newton Interpolation . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Piecewise Polynomial Interpolation . . . . . . . . . . . . . . . . . . . 66
5.2.1 Runge’s Phenomenon . . . . . . . . . . . . . . . . . . . . . . 66
5.2.2 Piecewise linear interpolation . . . . . . . . . . . . . . . . . . 68
5.2.3 Cubic Spline Interpolation . . . . . . . . . . . . . . . . . . . . 69
5.3 Some additional comments . . . . . . . . . . . . . . . . . . . . . . . 70
6 Numerical Integration 72
6.1 Newton-Cotes Formulas . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.1.1 Composite Newton-Cotes formula . . . . . . . . . . . . . . . 74
6.1.2 Error of Newton-Cotes formulas . . . . . . . . . . . . . . . . 75
6.2 Gaussian Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3 Some additional comments . . . . . . . . . . . . . . . . . . . . . . . 78
7 Numerical Differentiation 79
7.1 Numerical Differentiation using Taylor series . . . . . . . . . . . . . 80
7.2 Numerical Differentiation using Interpolation . . . . . . . . . . . . . 82
7.3 Richardson Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . 84
2
9.2 Finite Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.2.1 Other types of Boundary Conditions . . . . . . . . . . . . . . 101
3
Chapter 1
Sometimes there is the misconception that the most reliable alternative to solve a
problem is to use math and computers. The principal issue is that if we do not keep
track of the sources of error, things can go terrible wrong. For example, in 1996,
the European Space Agency failed launching Ariane 5 rocket. The reason: 16-bits
integer overflow.
In order to keep track of what can go wrong solving a problem, first we need
to understand all the decisions made along the way, and more important all the
implications and limitations we are subject to. To illustrate the thinking process,
let us introduce the following example:
Well, if the carpentry classes were true, you would definitely know how much
wood you would need for the case, but probably you would not have any idea about
the appropriate chord length that achieves the precision of the pendulum. Therefore
let us try to model the clock
p as a simple pendulum. Given the drag coefficient b and
a natural frequency ω = L/g, where L is the arm length and g is the gravity , the
angle of oscillation θ made with the vertical in time t can be defined by the second
order differential equation:
4
dθ d2 θ
where we use θ̇ = dt and θ̈ = dt2
.
Notice that this is a nonlinear equation in θ, because of the term ω 2 sin(θ). If you
try a couple of minutes, you will notice it is really difficult to solve it analytically.
Therefore it is preferable to discretize it. This means that we will choose a finite
number of t where we will know the solution of θ. Using these values, we approximate
the derivatives using sums and divisions:
KΘ + ω 2 sin Θ = 0 (1.3)
Notice that solving this simple problem we used a lot different techniques: fi-
nite differences, numerical optimization and numerical linear algebra. Also this
illustrates the way problems are solved using Scientific Computing.
In Figure 1.1 we can see the process starting form a problem in the real world
to implementation. Related with each step, there is a different type of error (notice
the red text). There are three levels of relevance of this errors:
• Error relative to each field: The tolerance of our method needs to include
measurement errors and the consequences of assuming certain physical model.
5
Figure 1.1: Different steps in Scientific Computing
• Implementation errors: First we will always find bugs in the code. And second,
we cannot represent exactly numbers in the computer. This type of errors are
called roundoff errors.
Since we can not avoid any of this three type of errors, we will start by discussing
how to identify and quantify them. From now on we will use
|p̃ − p|
Absolute error: |p̃ − p| Relative error: (1.5)
p
Computers represent numbers using binary system. Each bit can take values
{0, 1}, similar to a light bulb that goes on/off. A set of 8 bits creates a byte, and
usually every language or architecture measure the amount of memory in bytes. We
cannot represent any number because we have a finite amount of memory. Addi-
tionally, in the case that memory was not a problem, we have finite time to make
6
computations. More used memory means more communication time and more com-
putation time.
• A sign ±
• An exponent e
7
• Relative error: If x can be represented in the system then
|x − f l(x)| ≤ |x|machine (1.9)
The propagated error is related with how the operations change the initial
error, i.e. if the error gets magnified or if we get catastrophic cancellation. For
example if x >> y then f l(x) ⊕ f l(y) = f l(x), and if x ≈ y then f l(x) f l(y) = 0.
8
You always need a trade-off between admissible loads (Robustness)
and cost (efficiency). Old buildings are designed to admit large loads
during a seismic event without any visible deformation. Nevertheless
their cost is 5 times more in comparison with a new building. This does
not mean new buildings will collapse in any earthquake, but they have
more probability to have visible damage
Notice that stability does not only depend on the method, but also depends on
the mathematical problem. If the problem does not behave “nicely”, then it does
not matter how stable is the method, we can not ensure to get a desirable solution.
Think when we are solving Ay = b and A is a singular matrix. Depending on b
we can have infinite solutions or zero solutions. How can the algorithm select the
solution we are looking for?.
9
To avoid these issues, it is better to work with well condition problems: “prob-
lems that behave nicely”.
• The solution depends continuously on the data. Small changes on the initial
condition represent small changes on the data.
lim pn = p (1.12)
n→∞
Once we know the algorithm converges to the correct result, then we want to
know the speed of convergence. For this we have two measures:
For α = 1 we talk about linear convergence and for α = 2 we talk about quadratic
convergence.
Notice that O(βn ) comes from asymptotic notation. Some useful definitions are
10
Definition 10. Let f, g functions in R (or N), and a defined limit L = ∞ or 0 or ....
We informally say:
Instead of working with any function, we know how to work with polynomials.
And instead of solving for infinite number of points, it is enough in practice to solve
for certain points. Of course, these two techniques introduce errors, but once we
know how their magnitude depends on the approximation, we can refine it until we
get acceptable results.
11
Theorem 1. Taylor’s Theorem If f ∈ C n+1 [a, b] (This means that f and all its
derivatives up to n + 1-th derivative are continuous in the interval [a, b]). then for
any x and x0 in [a, b], there exists ξ(x) between x and x0 such that:
n
X f (k) (x0 ) f (n+1) (ξ(x))
f (x) = (x − x0 )k + (x − x0 )n+1 (1.15)
k! (n + 1)!
k=0 | {z }
| {z }
Rn : Remainder
n-th Taylor Polynomial
Example 1. For example, what is the rate of convergence of (sin x)/x → 1 when
x → 0?
12
√
Assuming that xn → a when n → ∞ then
√
|xn+1 − a| 1 1
lim √ 2 = lim = √
n→∞ |xn − a| n→∞ 2xn 2 a
Therefore the order of convergence is quadratic. Notice that
√ 0 α=1
|xn+1 − a| 1
lim √ = 2 √a α = 2
n→∞ |xn − a|α
∞ α>2
Theorem 3. Mean Value Suppose f ∈ C[a, b] and f ∈ C 1 (a, b). There exists
c ∈ (a, b) such that
f (b) − f (a)
f 0 (c) =
b−a
.
Theorem 4. Rolle’s Apply the mean value theorem for the case f (a) = f (b).
Therefore there exists c ∈ (a, b) such that
f 0 (c) = 0
.
13
Figure 1.3: Mean value theorem
In almost every proof in Numerical Analysis we use one or many of this theorems.
Always the key step is related with integrating by parts, or approximating a function
with a Taylor’s series. Therefore, you would really want to remember these theorems,
not only for the course, but for sure for your career.
14
Chapter 2
Root-finding algorithms
Suppose that we want to find x such that f (x) = 0. There are two ways of solving
it:
Closed methods are robust, because we ensure that we are always reducing the level
of uncertainty, therefore they always converge. In contrast, open methods are not
that robust but they are efficient. Since they approximate the function, they pre-
serve more information of the model. If they converge, they converge faster than
closed methods. The problem is that we cannot bound the error at each iteration,
and that explains why they can diverge.
15
2.1 Closed Methods
2.1.1 Bisection
Notice that in general, our function can have more than one root. Therefore the
first step, regardless the method we are using, is to select an appropriate interval
[a, b] where the desired root is. We can do this plotting the function.
Also if we select this interval such that f (a) and f (b) have different sign, using
the Intermediate Value theorem there exists c ∈ [a, b] s.t. f (c) = 0. Therefore the
simplest method is to reduce the interval [a, b] such that we always preserve that
f (a) and f (b) have different signs. If we are only using the sign of f , the healthiest
option is to guess that xn = (an + bn )/2 the midpoint. Depending on the sign of
f (c) we update the interval, an we ensure that the error is at most the length of the
interval.
Set xl = a, xr = b;
while |xr − xl | > tol do
x = xr +x
2
l
if f (x) = 0 then
return x
end
if f (x)f (xl ) < 0 then
xr = x
else
xl = x
end
end
return x
Algorithm 1: Bisection Pseudo-code
Notice that at each iteration the maximum error is reduced by a half, therefore
if {xn } is the set of iterates and x is the real root, we know that x, xn ∈ [an , bn ]
then:
1 1 1
|xn − x| ≤ |bn − an | = 2 |bn−1 − an−1 | = ... = n+1 |b − a|
2 2 2
Therefore the rate of convergence is O(2n ). Additionally we can estimate an upper
bound for the maximum number of iterations we would take given a tolerance |xn −
x| < tol. We have that
b−a
n ≤ log2 −1 (2.1)
tol
16
Figure 2.2: Bisection for x2 + 3x
17
2.1.2 False Position
Notice that in bisection, we only use the sign of the function. What would happen if
we approximate the function with a line joining both extreme points?. This method
is called False Position. Therefore we update the interval, guessing xn is the root
of the straight line between f (a) and f (b)
Set xl = a, xr = b;
while |xr − xl | > tol do
−xl
x = xr − f (xr ) f (xxrr)−f (xl )
if f (x) = 0 then
return x
end
if f (x)f (xl ) < 0 then
xr = x
else
xl = x
end
end
return x
Algorithm 2: False Position Pseudo-code
Although that this looks like a better alternative, we cannot ensure we are re-
ducing the interval as much as 1/2. Therefore depending on the function, its con-
vergence can be faster or slower than Bisection. For example consider a function
that is almost flat and negative, and suddenly goes straight up near one of the end
points. In such a case, the interval reduction using false position is less than using
bisection.
Figure 2.4: When Bisection is not always worse than False Position
18
2.2 Open methods
2.2.1 Fixed Point iteration
Open methods are based in expressing the next guess in terms of the previous ones.
In general we have something of the form
xn+1 = g(xn ) (2.2)
Notice that if we reach the solution of f (x) = 0, we would like to stay there, i.e.
x = g(x) (2.3)
This means that the solution x is a fixed point of g.
Set p0 ;
p1 = g(p0 ); n = 1
while |pn − pn−1 | > tol and n < max iterations do
pn+1 = g(pn );
n = n + 1;
end
return pn
Algorithm 3: Fixed Point Iteration Pseudo-code
19
It is not easy to decide if any scheme and initial condition converges or not.
Nevertheless, we have a theorem that helps us in a large number of cases.
Theorem 7. Fixed Point Theorem If g ∈ C[a, b] and g ∈ C 1 (a, b) such that:
1. g : [a, b] → [a, b] “The range goes into the domain”
2. ∃k < 1 : |g 0 (x)| ≤ k < 1 for all x ∈ (a, b) “The function is not steep”
Then g has a unique fixed point p ∈ [a, b] and the sequence {pn = g(pn−1 )} starting
from any p0 ∈ [a, b] converges to p such that |pn − p| = O(k n )
Notice that if we have a function g that satisfies the two conditions of this the-
orem, we can say it converges because of the theorem. But of course there are
functions that does not satisfy this theorem and they also converge.
If |g 0 (p)| =
6 0, then we have linear order of convergence.
20
2.2.2 Newton’s method
Notice that if f 0 (xn ) 6= 0, we can approximate our function with the tangent line at
the point (xn , f (xn )). And we can approximate the root as the zero of the function
f (xn )
xn+1 = xn − (2.5)
f 0 (xn )
21
Figure 2.5: Newton for x2 + 3x
22
2.3 Minimizing a function in 1D
Up to now, we have talked about solving nonlinear functions in 1D. This means
that given a problem where we need to find x ∈ R such that it satisfies an equation,
we can solve it if we express it as f (x) = 0. We have talked about two strategies:
enclosing the root (using bisection) or approximating the model (using a fixed point
iteration, such as Newton). We have seen that the first type of method is robust,
whereas the second type is faster if it converges to the solution. And depending on
whether or not we know more information of the function (sign or value or deriva-
tive), our approximation has more properties or not.
Now let us talk about optimizing a function in 1D. In general, this means that
we need to find x̂ such that minimizes φ(x).
Recall that when we were finding the root of a function, we could identify if we were
approaching the solution, or at least we could measure the error based on evaluating
our guess xn in f (x). Since we were looking for x such that f (x) = 0, then we could
say that a point xn+1 was better than xn if |f (xn+1 )| < |f (xn )|, and moreover we
can stop if |f (xn )| is sufficiently small.
But when we are minimizing a function, certainly we can measure if f (xn+1 ) <
f (xn ) but we cannot define a stop criteria only based in the function value, since
probably we do not know if there exists another point with even smaller evaluation.
Therefore we need to evaluate the derivative at the point.
φ0 (x∗ ) = 0 (2.9)
Using the Taylor series we can approximate φ(x) using a quadratic model
φ00 (x∗ )
φ(x∗ + ∆x) = φ(x∗ ) + φ0 (x∗ )∆x + ∆x2 + O(∆x3 ) (2.10)
2
Therefore for sufficiently small ∆x, we can see that the approximation has a mini-
mum or a maximum depending on φ00 (x∗ ):
23
Definition 12. Given a critical point x∗
Therefore if we call f (x) = φ0 (x). To minimize φ(x) not only we need to find x̂
such that f (x̂) = 0 but also f 0 (x̂) > 0. Notice also that finding a global minimizer x̃
is much more difficult because we also require that φ(x) > φ(x̃) ∀x ∈ [a, b]. Therefore
unless we know our function is of certain type (i.e. convex) , we only look for local
minimizers.
One way is to solve f (x) = 0 for f (x) = φ0 (x) using either closed or open methods
and once we find x∗ such that f (x∗ ) ≈ 0, then we verify if f 0 (x∗ ) = φ00 (x∗ ) > 0.
24
Figure 2.7: “high, low, high” pattern, but how can we reduce the interval to choose
either the red or the green model.
Suppose we have the interval [an , bn ] and we have the internal points cn , dn . First
we would like to be symmetric, i.e.
cn − an = dn − bn
Second, you would like to reuse the point that is enclosed by the subinterval we
choose. Without loss of generality, let us assume we choose the subinterval [an , dn ]
and cn is the internal point. Therefore we would like
cn − an dn − an
=
dn − an bn − an
α(bn − an ) (1 − α)(bn − an )
=
(1 − α)(bn − an ) bn − an
α = (1 − α)2
0 = α2 − 3α + 1
√
1− 5 1
α=1+ =1−
2 ϕ
√
−1 + 5 1
1−α= =
2 ϕ
25
Set a, b
α = 1 − 1/ϕ
c = a + α ∗ (b − a); φc = φ(c)
d = a + (1 − α)(b − a); φd = φ(d)
while |b − a| > tol do
if φc < φd then
b=d
d = c; φd = φc
c = a + α(b − a); φc = φ(c)
else
a=c
c = d; φc = φd
d = a + (1 − α)(b − a); φd = φ(d)
end
end
return c
Algorithm 4: Golden Section Pseudo-code
the next iteration. When we were finding roots of the function, we introduce the
method of false position to estimate the next iterate approximating the function
with a straight line. Now, we can use a similar approach, an use three points to
estimate a quadratic model for the point.
Between 3 different points there exists a unique parabola passing through them.
If the general equation for the parabola is:
ψ(x) = αx2 + βx + γ
We want to find α, β, γ that defines the parabola between (xn−2 , f (xn−2 )), (xn−1 , f (xn−1 )),
(xn , f (xn )). Therefore it should satisfy:
f (xn−2 ) = αx2n−2 + βxn−2 + γ
f (xn−1 ) = αx2n−1 + βxn−1 + γ
f (xn ) = αx2n + βxn + γ
That if we write it in matrix form, we have
2
xn−2 xn−2 1 α f (xn−2 )
x2n−1 xn−1 1 β = f (xn−1 ) (2.11)
x2n xn 1 γ f (xn )
Solving this linear system, we can find the coefficients of the quadratic model, and
moreover, the minimum of it
ψ(x)0 = 2αx + β
If α > 0, we take xn+1 = −β/(2α), which is the minimum of the quadratic model.
26
Chapter 3
Each of the equations represents a hyperplane in Rm and the solution is the intersec-
tion of all of them. Notice that depending on the equations, there can be a unique
solution (only 1 point), infinite solutions ( an affine subspace of dimension bigger
than 1) or no solutions. We can also express the system as
a11 a12 ... a1m x1 b1
a21 a22 ... a2m x2 b2
Ax =
...
= =b (3.1)
... ... ... ... ...
an1 an2 ... anm xm bn
And using the geometric intuition, now we know that not every system Ax = b
where A ∈ Rn×m and b ∈ Rn has a unique solution in Rm .
27
We can also see the system as a linear combination of column vectors of A:
a11 a12 a1m
m
X a21 a22 a2m
aj xj =
... x1 + ... x2 + ... + ... xm = b
(3.2)
j=1
an1 an2 anm
range(A) = {y ∈ Rn |∃x : Ax = y}
null(A) = {x ∈ Rm |Ax = 0}
• If b ∈
/ range(A), i.e. b is not a linear combination of columns of A, then the
system does not have a solution.
• dim(null(A)) = 0
• range(A) = Rn
• det(A) 6= 0
28
We know that if A is invertible, we will have a unique solution. Otherwise if A
singular we will have either infinite solution or none.
Now the problem is that there exists matrices that are invertible, but are nearly to
be singular. For example consider
1+ 1
A=
3 3
But when = 0, A is singular and therefore we cannot longer compute A−1 . Small
perturbations, i.e. changes in (for example due to roundoff errors) produce huge
changes in the solution. This type of problems are ill conditioned problems (recall
the discussion in Chapter 1 about well conditioned problems and stability).
• Symmetric At = A
• Semi-Positive definite ∀x 6= 0, xT Ax ≥ 0
• Sparse matrix: most of its elements are zero. Instead of having O(n2 ) entries,
usually it has O(n).
29
Name Shape Explanation Properties
QR QR Q orthogonal Ax = b =⇒ Rx = QT b
R Upper triangular (using backward substitution)
30
3.1.4 Eigenvalues
In problems where we want to find stable solutions for physical systems, usually we
need to evaluate Ak . If A is dense, each matrix-matrix product cost O(n3 ) opera-
tions. Therefore computing A...A k times is too expensive (O(kn3 )), and in addition
we deal with roundoff errors.
Notice that to find eigenvalues of A is equivalent to find the roots of the charac-
teristic polynomial ρA (λ).
Ax = λx =⇒ ρA (λ) = det(A − λI) = 0
And we define to types of multiplicity for an eigenvalue:
• Algebraic: Multiplicity of root in ρA (λ)
• Geometric: dim(Eλ ) = {x ∈ Rn |Ax = λx}
And we know that Algebraic multiplicity ≥ Geometric multiplicity. If we have equal-
ity for all the eigenvalues, then A is diagonalizable.
31
Depending on A this system can be really easy or really hard to solve. For example
of A is diagonal, we only need to divide bj by ajj to find xj .
Another relatively easy type of matrices are the lower and upper triangular ma-
trices. For example with an upper triangular matrix in Rn×n , every variable xk
only depends on the following xk+1 , ..., xn−1 , xn . Therefore we can move backward,
starting from xn , then xn−1 , and so on. This method is called Backward Substitution.
In general, the goal of Gauss Elimination is first reduce any matrix A into an
upper triangular form and then solve the system using backward substitution. To
do this we use the following three rules:
−7/3
x3 = =7
−1/3
1 2
x2 = (2 − (−1)x3 ) = (2 + 7) = 6
3/2 3
1 1
x1 = (2 − (1)x2 − (0)x3 ) = (2 − 6) = −2
2 2
32
Notice that we can express the set of transformations as multiplying A by the
matrices
1 0 0 1 0 0
L1 = 1/2 1 0 L2 = 0 1 0
3/2 0 1 0 −11/3 1
Therefore
2 1 0
L2 L1 A = U = 0 3/2 −1
0 0 −1/3
Then
A = L−1 L−1 U = LU
| 1 {z 2 }
L
Where
1 0 0 2 1 0
L = −1/2 1 0 U = 0 3/2 −1
−3/2 11/3 1 0 0 −1/3
Notice that for the first step you perform Forward substitution and for the second
one Backward substitution.
33
for k = 1 : n − 1 do
for i = k + 1 : n do
lik = aakk
ik
for j = k + 1 : n do
aij = aij − lik akj
end
bi = bi − lik bk
end
end
Algorithm 5: Gauss Elimination Pseudo-code
for k = n : −1 : 1 do
sum = 0
for j = k + 1 : n do
sum = sum + akj xj
end
xk = bk −sum
akk
end
Algorithm 6: Backward substitution Pseudo-code
Notice that we can find the inverse of a matrix applying Gauss elimination to the
system
A I ∼ ... ∼ I A−1
Imagine that we do not have roundoff errors, therefore why don’t we just compute
the inverse of a matrix instead of computing the LU factorization. The answer is
because of the number of operations (flops). Let us compute the number of flops
for different methods. We count the sums and the products, and each of the sums
are the iterations inside the loops. Gauss Jordan refers to the method that instead
34
of reducing the matrix to a triangular form, leaves it as a diagonal.
n−1
X X n Xn
f lops(Gauss Elimination) = 3 + 2
k=1 i=k+1 j=k+1
n−1 n
!
X X
= (3 + 2(n − k))
k=1 i=k+1
n−1
X
= (n − k)(3 + 2(n − k))
k=1
n−1
X 3
=2 k2 + k
2
k=1
2
= n3 + O(n2 )
3
n
X n
X
f lops(Backward substitution) = 2 + 2
k=1 j=k+1
n
X
=2 (1 + n − k)
k=1
Xn
=2 k
k=1
= n2 + O(n)
35
Xn X Xn
f lops(Gauss Jordan) =
3+ 2
k=1 i = 1 : n j=k+1
i 6= k
Xn X
=
(3 + 2(n − k))
k=1
i=1:n
i 6= k
n
X
= (n − 1)(3 + 2(n − k))
k=1
n
X 3
= 2(n − 1) k+
2
k=1
= n + O(n2 )
3
n−1
X n
X n
X
f lops(Inverse) = 1 + 2n + 2
k=1 i=k+1 j=k+1
n−1 n
!
X X
= (1 + 2n + 2(n − k))
k=1 i=k+1
n−1
X
= (n − k)(1 + 2n + 2(n − k))
k=1
n−1
X
=2 k 2 + nk + k
k=1
5
= n3 + O(n2 )
3
As you can see, finding the Inverse or finding LU factorization are both O(n3 ). The
difference is in the coefficient in front. Finding the inverse is almost three times as
expensive as finding the LU.
Also notice that backward substitution (and therefore Forward substitution) are
O(n2 ). Therefore once we have the LU, solving (3.4) is really cheap. Think about
what is the number of flops of solving a diagonal or quasi-diagonal system (i.e. tridi-
36
agonal)!
3.2.2 Pivoting
Does this method work for any invertible matrix?
P A = LU
P AQ = LU
In the case of solving the system using this permutation matrices, we need to modify
x and b appropriately.
A = LDŨ
where Ũ is upper triangular, with diagonal entries equal to one, and we can get it
dividing each row Ui by the diagonal entry uii . Notice that if A is symmetric, we
can have a symmetric decomposition
A = LDLT
37
And moreover, if all the entries of D are > 0 we can take the square root and
A = LD1/2 D1/2 LT
A = GGT
where G is lower triangular matrix with positive entries in the diagonal. This
is called Cholesky factorization. If the matrix is far from being singular and
moreover it is symmetric positive definite (i.e. all the eigenvalues are positive and
far from zero), then the Cholesky factorization for A exists and the method is stable.
3.3 Conditioning
3.3.1 Vector and Matrix Norms
Definition 15. A norm is a function || · || : V → R such that:
• ||x|| ≥ 0; ||x|| = 0 ⇐⇒ x = 0
• ||αx|| = |α| ||x||
• ||x + y|| ≤ ||x|| + ||y|| ∀x, y ∈ Rn
Some typical vector norms defined in Rn are
1/2
√ Xn
||x||2 = xt x = |xj |2
j=1
n
X
||x||1 = |xj |
j=1
1/p
n
X
||x||p = |xj |p
j=1
38
And as a nice fact, we can even define an Inner Product between matrices based on
this norm:
hA, Bi = tr(AT B)
On the other hand we can see a Matrix as operator : Rn → Rm . This means we can
analyze how the matrix transforms the vectors from one space to the other. In this
way we can define the Induced norm:
||Ax||p
||A||p = sup for 1 ≤ p ≤ ∞ (3.6)
x6=0 ||x||p
Some of the typical induced norms are:
X X
||A||1 = max |aij | ||A||∞ = max |aij | ||A||2 = max |σi |
j i
i j
where σi corresponds to the singular value, that are the same as the diagonal terms
in Σ in the SVD, singular value decomposition. Also
q
||A||2 = ρ (AAT ) where ρ(M ) = max |λi | is the spectral radious
λi ∈Λ(M )
In the case of linear systems, let us understand where the error comes from.
First let us define the residual (that we can measure in reality, without knowing the
real solution) as:
r = b − Ax̃ = Ax − Ax̃ = A(x − x̃)
then the absolute error is:
||x − x̃|| = ||A−1 r|| ≤ ||A−1 || ||r||
If we prefer to consider the relative error, and using that ||b|| ≤ ||A|| ||x||, we have
||x − x̃|| ||r||
≤ ||A|| ||A−1 || (3.7)
||x|| ||b||
Here we see two sources from error:
39
• The method and how the residual can change from method to method ||r||
• And the problem itself regardless of the method we use.
In this setting, choosing a specific norm, we define the condition number of a
matrix as
κ(A) = ||A|| ||A−1 || (3.8)
Notice that we have the following properties:
• Using the sub-multiplicative property 1 = ||I|| = ||AA−1 || ≤ ||A|| ||A−1 ||
• Using that for orthogonal matrices QT = Q−1 and ||Q|| = ||QT || then 1 =
||Q||||Q−1 ||
• And using the SVD decomposition
σmax
k2 =
σmin
Notice that from (3.7), if a matrix has a large condition number, the bound in the
relative error of Ax = b is bigger, therefore we would need a robust method. If you
want to explore more about condition number I recommend to read Ascher section
5.8 or Bradie section 3.4, or for deeper understanding a Numerical Linear Algebra
book such as the one from Trefethen and Bau.
40
Figure 3.1: Data Fitting problem: Let us suppose we have the data series
{(ti , pi )} and we want to see how the model f (t) = α0 + α1 t + α22 t2 approximates
the price.
Therefore in linear least squares, we are looking for x such that minimizes
1
min ||b − Ax||22
x∈Rm 2
Let us call
1
ψ(x) = ||b − Ax||22
2
1
= (b − Ax)T (b − Ax)
2
1 T
b b − 2bT Ax + xT AT Ax
=
2
1 1
= xT AT Ax − (AT b)T x + bT b
2 2
41
In this case to have a minimum, as in 1D we need a critical point, (i.e. ∇ψ = 0)
and a second derivative requirement (∇2 ψ is positive definite).
If it is not easy to see how are the derivatives from ψ let us see it by components:
n m
!!2
1X X
ψ(x) = bi − aik xk
2
i=1 k=1
42
Since AT A is positive definite, then now we can solve the least squares problem as
finding x such that
AT Ax = AT b (3.12)
And they are called the normal equations, because they satisfy that
AT (Ax − b) = AT r = 0
Therefore we can also define a condition number for rectangular matrices A ∈ Rn×m
and in 2-norm is equal to
σ1
κ(A) =
σm
Therefore if we solve (3.12) using Gauss elimination, or in this case Cholesky Fac-
torization (since AT A is symmetric positive definite), then the error will not depend
on κ(A) but on κ(AT A). And from the SVD we can see that
σ12
κ(AT A) = 2
= (κ(A))2
σm
Therefore if the condition number of A is large, when we solve using normal equa-
tions we duplicate, therefore it is more ill conditioned.
43
Therefore the only way to reduce the residual is to solve exactly for the first
m equations of R. And the condition number of R is comparable with the
one of A. As an additional comment when we have A ∈ Rn×m n! = m
full column rank, and b ∈ Rn , if we write in MATLAB A\b, it returns the
least squares solution using QR factorization. Check the last part of http:
//www.mathworks.com/help/matlab/ref/mldivide.html .
As we can always expect, as the method gets more robust, the cost of computing it
increases. Therefore depending on the problem and the conditioning of A, we would
like to use a different method.
44
Chapter 4
The first remark we should do is that in general we can not generalize closed meth-
ods in this multidimensional setting, because we do not have a graphic picture of
the function neither basic conditions on the neighborhood of a root that should been
satisfied. Therefore we only have open methods. Here we will talk about Newton’s
method.
Definition 16. Let x = [x1 , ..., xm ]T ∈ Rm and F = [f1 , ..., fn ]T with bounded
derivatives up to order at least 2, then for any direction p = [p1 , ..., pm ]T ∈ Rm the
45
Taylor expansion of F is
Let us start for the case when n = m. Then starting at x(0) we want to find x∗
such that F(x∗ ) = 0. We can start approximating F with the Taylor series to find
x(1) . Let x(1) = x(0) + p(0) then
But since we would like F(x(1) ) = 0 then we can choose p0 such that
Notice that we get a linear system of equations that we can solve using any of the
methods we discussed in Chapter 3.
As stopping criteria you can choose the absolute error between iterations, the relative
error, the norm of the function value, ....
Example 5. (See Ascher Section 9.1. Example 9.1) Consider the intersection be-
tween the circle centered at the origin with radius of 1 x21 + x22 = 1 and the parabola
x2 = (x1 − 1)2 . We want to find the points such that the 2 curves intersect.
Solution. Notice that we can express the problem as:
x21 + x22 − 1
x1 2
Find x = ∈ R s.t. F(x) = 0 where F = (4.7)
x2 (x1 − 1)2 − x2
46
Set x(0) ; k = 0;
while abs error(x(k) ) > tol and k < max iterations do
(k) (k) (k) (k)
Solve for p in J x p = −F x ;
x(k+1) = x(k) + p(k) ;
k = k + 1;
end
return x(k)
Algorithm 7: Multidimensional Newton Pseudo-code
(0) (0)
And therefore starting from x(0) = (x1 , x2 ) we have that
x21 + x22 − 1
(k) 2x1 2x2 (k)
x(k+1) = x(k) + p where p =−
2(x1 − 1) 1 (x1 − 1)2 − x2
As in the 1D case, we need to be near the root to have convergence. And now we do
not have any other method to combine with to ensure convergence. Nevertheless we
can use p(k) as a direction instead of just a solution and find the best α such that
is a good guess. These are called Line-search methods and they are a keystone
in Numerical Optimization.
On the other hand, notice that at each step we need to compute the Jacobian
matrix at x(k) , i.e. J x(k) , and depending on F this can be a very expensive
process. Therefore, we would like to approximate J with a matrix B (k) that we
(k)
update at each iteration in order to be near the value of J x . These are called
Quasi-Newton methods. One of them is the Broyden’s method:
47
4.2 Non-Linear Least Squares
In section 3.4, we talked about minimizing a specific type of problems: linear least
squares problems, where we had a matrix A ∈ Rn×m with n > m and we want to
find
1
minimize ||Ax − b||22
x 2
In most of the cases Ax represents a model linear in the parameters to approximate
b, i.e. if x = (α1 , ..., αm ) then the model is of the shape
Now let us consider the case where the model is not linear in the parameters. For
example the logistic model for population growth:
N0 Kert
N (t) =
K + N0 (ert − 1)
In the logistic model, the parameters are: N0 the initial population, r the growth
rate, K the carrying capacity. And it is clear that N (t) is not linear in the pa-
rameters. Let us define x = (x1 , x2 , x3 ) = (N0 , r, K), and let us suppose that we
have 1 ≤ i ≤ n measurements of bacteria population (ti , Ni ). Therefore we have the
system of equations
x1 x3 ex2 t1
= N1
x3 + x1 (ex2 t1 − 1)
x1 x3 ex2 t2
= N2
x3 + x1 (ex2 t2 − 1)
... = ...
x1 x3 ex2 tn
= Nn
x3 + x1 (ex2 tn − 1) |{z}
| {z } b
g(x)
48
Following the same steps we did in the linear case, let us call
1
ψ(x) = ||g(x) − b||22
2
1
= (g(x) − b)T (g(x) − b)
2
1
= g(x)T g(x) − g(x)T b + bT b
2
And in component-wise
n
1X
ψ(x) = (gi (x) − bi )2
2
i=1
where A(x) is the Jacobian of g(x). Notice that the Hessian of ψ component-wise
is given by:
n
∂2ψ ∂ X ∂gi
= (gi (x) − bi )
∂xl ∂xj ∂xl ∂xj
i=1
n
X ∂gi ∂gi ∂ 2 gi
= + (gi (x) − bi )
∂x ∂xl ∂xl ∂xj
i=1 | j{z } | {z }
AT (x)A(x) L(x)
Therefore
49
Notice that to solve (4.9) we need to solve a non-linear system of equations. If we
call φ(x) = AT (x)(g(x) − b), then we want to find x such that φ(x) = 0. Therefore
we can solve it using Newton, starting from an initial guess, and finding a new p
such that φ(x + p) ≈ 0. We use the Taylor series of φ(x) and we find
Jφ (x0 )p = −φ(x0 )
∇2 ψ(x0 ) = −∇ψ(x0 )
AT (x0 )A(x0 ) + L(x0 ) p = −AT (x0 )(g(x0 ) − b)
Notice that this equation is a bit complicated because it involves the function g with
its first (in A(x)) and second derivatives (in L(x)).
• Since there is always a difference between Newton and Gauss-Newton (i.e L),
then Gauss-Newton does not have quadratic convergence.
50
Set x(0) ; k = 0;
while abs error(x(k) ) > tol and k < max iterations do
Solve the linear least squares problem
1
minimize ||A(x(k) )p + (g(x(k) ) − b)||22
p∈R m 2
in finding x∗ such that there exists δ > 0 that for any y with ||y − x|| < δ then
φ(x∗ ) ≤ φ(y)
This is a univariate function in α and we can find its minimum using any of the
methods described in chapter 2. Once we find α∗ we can update our guess as
(1) (1)
x(2) = (α∗ , x2 , ..., xn ) and now we can optimize in the second component x2 and
so on. Once we reach xn , we start again in x1 and we repeat the process until we
cannot minimize any more. This method is called Coordinate descent.
Consider the two functions in Figure 4.1. They are exactly the same function and
the same initial guess. The only difference is that we rotated our coordinate system.
For the fist one we get the minimum in one iteration whereas in the second one we
get in two. But probably if we selected in a better way our search direction, we
could find the minimum in both cases in the same number of iterations.
To choose this direction, let us consider the Taylor series for a function φ : Rn →
R
51
Figure 4.1: Minimize φ(x, y) = x2 + y 2 choosing one direction at a time for different
initial conditions. How can we just do everything in one step?
52
all directions p such that ||p|| = 1 the biggest descent is
p = −∇φ(x) (4.16)
normalized. When we choose the direction p that satisfies (4.16) then the method
is called steepest descent or gradient descent method.
Now if we try in our functions of Figure 4.1, we take only one iteration if we use
steepest descent direction. But now let us consider Figure 4.2. We have the same
function and we are choosing two different initial conditions, both at the same level
curve of φ. We optimize choosing the steepest descent direction. In the first case
we get the minimum in only one iteration, whereas in the second case we start to
zig-zag towards the minimum. Why is this happening?
Figure 4.2: Minimize φ(x, y) = x2 +4y 2 choosing one direction at a time for different
initial conditions. How can we just do everything in one step?
∇φ(x) = 0 (4.18)
53
Definition 19. Given a critical point x∗ , then
Therefore we just need to find p∗ solving a linear system of equations (4.19). With
this direction, we only take one iteration in both function in Figure (4.2).
Notice that if we consider the Taylor series of the gradient of φ, let G = ∇φ using
(4.2) we have that for x, p ∈ Rn if φ is smooth enough then:
which is indeed the same equation we got from approximating our function with a
quadratic model from Taylor’s series.
54
Notice that if we assume that ∇φ(x(k) ) is positive definite, we can solve (4.19)
using a Cholesky Factorization. Therefore one of the methods to solve (4.19) when
we do not know if ∇φ(x(k) ) is to modify its diagonal until a Cholesky factorization
exists. This method is called Modified Cholesky and it is one of the most used
methods.
2. Minimize your function in the direction p(k) . Consider f (α) = φ(x(k) + αp(k) ).
This is the restriction of your function to the direction p(k) , and f : R → R.
Therefore we can apply any of the techniques of chapter 2 to find α(k) such
that minimizes f (α). Then we update the guess as
x(k+1) = x(k) + α(k) p(k)
With this method we construct a sequence of points {x(1) , x(2) , x(3) , ...} such that
φ(x(1) ) ≥ φ(x(2) ) ≥ φ(x(3) ) ≥ ... and it converges to a local minimum of φ.
Any minimization method based on this two steps is called a Line Search
method. And as you can expect, there is always a trade off between finding a de-
scent direction and optimizing the function in that direction.
For example, steepest descent has linear rate of convergence if we solve exactly
the 1D minimization problem at each step, whereas Newton’s method has quadratic
rate of convergence if we always take α(k) = 1 and we are sufficiently close to the real
minimum. Nevertheless, steepest descent only requires to know the first derivative
of the function and Newton’s method requires first and second derivative for every
direction.
55
Example 6. Consider the function φ(x1 , x2 ) = (x1 + x2 )2 + 4(x1 − x2 )2 . How is
the first iteration with each of the methods if x(0) = (−1, −2)?
Solution. Let us compute the gradient and the Hessian matrix of φ:
f (α) = φ((−1, −2) − α(2, 0)) = ((−1 − 2α) − 2)2 + 4((−1 − 2α) + 2)2
56
• Newton’s method We have that:
10 −6 (0) −2
p = −∇φ(−1, −2) =
−6 10 14
57
Therefore if we want Bk+1 being a nice approximation of ∇2 φ(x(k+1) ) then it
should satisfy (4.21) i.e.
Bk+1 sk = yk (4.22)
Bk sk sTk Bk yk ykT
Bk+1 = Bk − + (4.24)
sTk Bk sk ykT sk
– DFP method It applies the same update as the BFGS to the inverse
of the Hessian instead of the Hessian itself. This means that if we call
Bk−1 = Hk then DFP update is
Hk yk ykT Hk sk sTk
Hk+1 = Hk − + (4.25)
ykT Hk yk sTk yk
(yk − Bk sk )(yk − Bk sk )T
Bk+1 = Bk + (4.26)
(yk − Bk sk )T yk
How much effort should we invest in finding the minimum of f (α) = φ(x(k) + αp(k) )
if we are far from x∗ ? Do we have an initial guess for α? Notice that if we are
considerably far from the minimum, it is not worthy to put a lot of time in finding
the minimum in certain direction if that does not bring me sufficiently closer to x∗ .
This means that most of the times, the methods used in finding the minimum are
not the most sophisticated. For example we can use backtracking, i.e. start with
αmax and iterate using αi = (1/2)i αmax until we find a reasonable minimum. Other
58
methods use bisection or parabolic interpolation with loose tolerances.
Also notice that for steepest descent and coordinate decent we do not have a
clear idea of the size of α because it comes from a linear model where the optimal
α is ∞. In contrast, Newton’s method comes from a quadratic model, where the
optimal α is 1. Therefore we have an initial guess for Newton’s method, and also
Quasi-Newton methods, but not for steepest descent or coordinate descent. In more
general cases, when the problem also has constraints (i.e. another set of conditions
that the solution should satisfy), then αmax is designed in order to satisfy all the
constraints.
• Goldstein Condition To avoid small α and also divergence: For c ∈ (0, 1/2)
φ(x(k) )+(1−c) α ∇φ(x(k) )T p(k) ≤ φ(x(k) +αp(k) ) ≤ φ(x(k) )+c α ∇φ(x(k) )T p(k)
For Newton’s method, we can take c = 1e − 4, but for steepest descent it depends
on the problem. Sometimes it should be around c = 0.1.
59
Chapter 5
Interpolation
In general, when we talked about interpolation, we are looking for a function of the
form
such that for our given set of data (x0 , y0 ), ..., (xn , yn ) satisfies that
60
If we fixed the basis functions φi , in order to know the coefficients ci we need to
solve a linear system of n equations and n unknowns:
φ0 (x0 ) φ1 (x0 ) . . . φn (x0 ) c0 y0
φ0 (x1 ) φ1 (x1 ) . . . φn (x1 ) c1 y1
= (5.4)
... ... ... ... ... ...
φ0 (xn ) φ1 (xn ) . . . φn (xn ) cn yn
When we talk about polynomial interpolation, we choose the basis functions to be
polynomials. But since the polynomials are a Vector space, we can take different
set of basis functions to represent the same polynomial interpolation. For example:
• Monomial Interpolation if we choose φi (x) = xi
x−x
• Lagrange Interpolation if we choose φi (x) = j6=i xi −xjj
Q
Q
• Newton Interpolation if we choose φi (x) = j<i (x − xj )
Of course, with each choose of basis, the coefficients of the polynomial will change,
but the result will be the same.
1 xn . . . xnn cn yn
| {z }
Vandermonde matrix
And therefore if all the x entries are different, then the determinant is different
from zero and X is invertible. Then there is a unique solution for the coefficients.
This means there is a unique polynomial of order n that passes exactly through
n + 1 points with different abscissas. Therefore it does not matter which basis we
are working with, we will get the same polynomial. The only difference is that the
representation in some basis is easier to understand than in others.
61
Which are the drawbacks of the monomial representation
• X is often ill-conditioned, and solving the system with Gauss elimination takes
2/3 n3
• The coefficients do not give a lot of insight of the behavior of the interpolant.
For example notice that it is easier to imagine
f (x) = (x − 3)3
than
f (x) = x3 − 9x2 + 27x − 27
Recall that we are looking for an interpolation that given (x0 , y0 ), ..., (xn , yn ), satis-
fies
yi = f (xi ) = c0 φ0 (xi ) + c1 φ1 (xi ) + · · · + cn φn (xi )
For example if we consider the basis functions φi (x) = Li (x) such that
(
1 i=j
Li (xj ) = (5.7)
0 i 6= j
yi = c0 L0 (xi ) + c1 L1 (xi ) + · · · + ci−1 Li−1 (xi−1 ) + ci Li (xi ) + ci+1 Li+1 (xi+1 ) + · · · + cn Ln (xi )
= c0 ∗ 0 + c1 ∗ 0 + · · · + ci−1 ∗ 0 + ci ∗ 1 + ci+1 ∗ 0 + · · · + cn ∗ 0
= ci
Let us think the case when n = 2, then we have (x0 , y0 ), (x1 , y1 ), (x2 , y2 ). To con-
struct L0 (x) we need:
• L0 (x0 ) = 1
62
That satisfies the first condition. For the second condition, we need to find the value
of a such that L0 (x0 ) = 1. Therefore
1
1 = L0 (x0 ) = a(x0 − x1 )(x0 − x2 ) =⇒ a =
(x0 − x1 )(x0 − x2 )
And similarly we can solve for L1 (x) and L2 (x) to get:
(x − x1 )(x − x2 ) (x − x0 )(x − x2 ) (x − x0 )(x − x1 )
L0 (x) = L1 (x) = L2 (x) =
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )
Now for a general n we can define the Lagrange polynomials Li (x) as
Y (x − xj )
Li (x) = (5.8)
(xi − xj )
j6=i
There is another expression for (5.9) that makes it easier to compute using the
barycentric weights wi . Define
Y 1
ρi = (xi − xj ) wi = (5.10)
ρi
j6=i
63
5.1.3 Newton Interpolation
What if we want to an adaptive algorithm where we can add point
one by one?
Sometimes, all the interpolation points are not available when we start to interpolate
them. Imagine an experiment where we collect few points, we see how is the result,
and we measure more points to add to the interpolation. When we compute the
coefficients for the new polynomial, if we use either Lagrangian or monomial basis,
we need to recompute all the values, we cannot reuse the information we had before.
This means
φ0 = 1, φ1 = (x − x0 ), φ2 = (x − x0 )(x − x1 ), φ3 = (x − x0 )(x − x1 )(x − x2 ), . . .
Therefore we will look for the coefficients ci such that
yi = pn (xi ) = c0 φ0 (xi ) + c1 φ1 (xi ) + · · · + cn φn (xi )
Therefore notice that if pn interpolates {(x0 , y0 ), . . . , (xn , yn )}, then pn+1 that in-
terpolates all the previous n points plus an extra one (xn+1 , yn+1 ) is given by:
pn+1 (x) = pn (x) + cn+1 (x − x0 )(x − x1 )...(x − xn )
How can we compute the coefficients? Divided Differences
Since the definition of the polynomial is recursive, we can expect that the definition
of the coefficients is recursive. Notice that
1. f (x0 ) = c0
2. f (x1 ) = c0 + c1 (x1 − x0 ). Therefore:
f (x1 ) − f (x0 )
c1 =
x1 − x0
3. f (x2 ) = c0 + c1 (x2 − x0 ) + c2 (x2 − x0 )(x2 − x1 ). Therefore:
f (x1 ) − f (x0 )
f (x2 ) = f (x0 ) + (x2 − x0 ) + c2 (x2 − x0 )(x2 − x1 )
x1 − x0
f (x1 ) − f (x0 )
f (x2 ) − f (x1 ) = f (x0 ) − f (x1 ) + (x2 − x0 ) + c2 (x2 − x0 )(x2 − x1 )
x1 − x0
f (x1 ) − f (x0 )
f (x2 ) − f (x1 ) = (x2 − x1 ) + c2 (x2 − x0 )(x2 − x1 )
x1 − x0
f (x2 )−f (x1 ) f (x1 )−f (x0 )
x2 −x1 − x1 −x0
= c2
x2 − x0
64
4. . . .
So in general, we can define the ith coefficient as the ith divided difference where:
c0 = f [x0 ] = f (x0 )
f (x1 ) − f (x0 ) f [x1 ] − f [x0 ]
c1 = f [x0 , x1 ] = =
x1 − x0 x1 − x0
f (x2 )−f (x1 ) f (x1 )−f (x0 )
x2 −x1 − x1 −x0 f [x1 , x2 ] − f [x0 , x1 ]
c2 = f [x0 , x1 , x2 ] = =
x2 − x0 x2 − x0
... = ...
f [x1 , . . . , xi ] − f [x0 , . . . , xi−1 ]
ci = f [x0 , x1 , . . . , xi ] = (5.13)
xi − x0
Therefore we can construct a table i, xi , f [xi ], f [xi , xi+1 ], . . . and so on, and each
time we add a new value to interpolate, we just add a new extra row to the table.
How can we estimate the error of the interpolation?
Notice that the divided differences look like approximations of higher derivatives,
indeed we have the following theorem
Theorem 8. Let f a function with k bounded derivatives in the interval [a, b] and
let z0 , ..., zk k + 1 distinct points, then there is ξ ∈ [a, b] such that
f (k) (ξ)
f [x0 , ..., xk ] = (5.14)
k!
So now let us consider the points {(x0 , f (x0 )), ..., (xn , f (xn ))} and pn (x) the
interpolant. If we want to compute the error at x:
en (x) = f (x) − pn (x) (5.15)
We can construct an interpolant for the points {(x0 , f (x0 )), ..., (xn , f (xn ))}∪{(x, f (x)))}
given by
n
Y
pn+1 (y) = pn (y) + f [x0 , ..., xn , x] (y − xi )
i=0
65
Name Basis Construction cost Properties
Monomial {xi }ni=0 o 2/3n3 Intuitive, Prove existence
x−xj n
nQ
Lagrange j6=i xi −xj n2 Stable, ci = yi
nQ oi=0
n
i−1
Newton j=0 x − xj 3/2n2 Adaptive, compute the error
i=0
As you can see, as we increase n, the interpolant has more and more oscillations
at the tails of the interval, and the error is concentrated there. Adding more points
only makes the situation worse.
66
Now let us use a different set of x-coordinates called the Chebyshev points.
This coordinates correspond to the roots of the Chebyshev polynomials in the in-
terval [−1, 1] and they are given by:
2k − 1
xk = cos π (5.18)
2n
In order to scale them to the interval [−5, 5] we multiply them by 5, and in this case
we also include the beginning and the last point of the interval.
Now when we plot the interpolant of this new set of {(xk , f (xk ))}, we do not
have as many oscillations as before and the error keeps bounded.
Why does this happen?
This is called the Runge’s phenomenon and illustrates why we usually do not
consider polynomial interpolations of high degree unless the set of x coordinates is
really special. If we recall the equation (5.17), the error depends on the maximum
of the k-th derivative of the function, and also, on the distribution of the points, i.e.
n
Y
max (s − xi ) (5.19)
a≤s≤b
i=0
67
In particular, the Chebyshev points properly scaled minimizes (5.19), and that ex-
plains the different behavior between Figures 5.1 and 5.2.
In the previous section, we saw that interpolating a small set of data ends up in a
small order polynomial, and everything was well behaved. Therefore the idea now
is to split the data into small chunks, and interpolate each piece separately.
In one hand, notice that in the constant interpolation we took 1 point, in the
linear interpolation we took 2 points, and therefore we can increase the order of
the polynomial taking more points to define 1 curve (bigger chunks). For exam-
ple if we want to construct piecewise quadratic interpolation, we can take
x2i−1 , x2i , x2i+1 , and construct a parabola between them. The only problem with
68
this approach is what happens in the break points. In the case of the parabola, it
is C 2 in (x2i−1 , x2i +1 ) but the derivative is not continuous in x2i−1 and x2i+1 .
The second approach is to work with the same chunks (only two points defining
each polynomial) and not only impose continuity in each xi but also continuity of
the derivatives at each xi .
69
Therefore if we have n + 1 points, we have n splines, each one with 4 unknowns,
and up to now we have 2n + 2(n − 1) equations. Therefore we need to impose 2
additional conditions that can be:
• Free boundary or Natural spline:
φ00 (x0 ) = φ00 (xn ) = 0
• Not-a-knot
φ000 (x0 ) = φ000 (x1 ), φ000 (xn−1 ) = φ000 (xn )
As you can see, in this case all the 4n equations are coupled, therefore the interpo-
lation although it looks like local, it is not in reality. To solve for the coefficients,
we need to construct a linear system Ax = b, where A is a tridiagonal matrix. Once
we find the coefficients, we get a spline that is C 2 in all the interval (x0 , xn ).
In the case of piecewise-linear functions, the basis functions are called hat func-
tions, and they are really famous in Finite Elements method to solve differential
equations. As the Lagrange polynomials they satisfy φj (xi ) = δi,j , but the differ-
ence is that they have compact support (i.e. they are different from zero in a small
interval). They are defined by
x−xj−1
xj −xj−1 xj−1 ≤ x ≤ xj
x−xj+1
φj (xi ) = xj −xj+1 xj ≤ x ≤ xj+1 (5.23)
0 otherwise
The splines also have related basis functions called B-splines, and they are the basis
of Computer-aided Design (CAD), and sometimes they are also used for solving
differential equations.
70
What if my curve is too complicated or I want to interpolate points
in more dimensions?
In those we use parametric curves. Therefore if we have the data {(xi , yi )}, we
include a new variable τi that parametrize the curve. Then we solve two separate
interpolation problems with the data {(τi , xi )} and {(τi , yi )}, and then the resulting
curve will be (x(τ ), y(τ )).
There is not a simple answer in this case. As we saw in chapter 4, we can formulate
any model, and we can fit the data solving a least squares problem. In such a case,
we can choose the order of the polynomial that we want, and fit as many points
we have, and there is not a correspondence between points and polynomial degree.
The problem (or sometimes the advantage) of this approach is that the model is
not ensure to pass exactly through all the points, but, the residual error is reduced.
Therefore if we have some experiment data, that we know it has measurement er-
rors, the best option is to go for least squares, and it is up to you which model to
select.
But then, why does interpolation exist? Well, not every model wants to fit
experimental data. Sometimes you are sure your function has to satisfy certain
points. And moreover, you are interested in the integral or the derivative of your
function. Think in the case of a really complicated function, that you want to
estimate its integral in certain interval. Then you would like to take as many points
as you can in order to approximate that interval. Therefore you can use a global
interpolation using “special” points, such as Chebyshev, or you can use piecewise
interpolation, such as constant, linear or cubic.
71
Chapter 6
Numerical Integration
The idea behind numerical integration is to remember that if f is nice enough, the
integral of f in an interval is the limit of Riemann sums. Therefore if we want to
approximate an integral, it sounds reasonable to propose the following:
Z b n
X
I= f (x)dx ≈ ωi f (xi ) (6.1)
a i=0
Where xi are called the abscissas, and ωi the weights. This form is called a quadra-
ture rule.
When we were interpolating points {(xi , f (xi )}ni=0 , we chose certain basis functions
such that
Xn
f (x) ≈ pn (x) = f (xi )φi (x)
i=0
For example in the case of polynomial interpolation φi (x) = Li (x) the Lagrange
polynomials, and in the case of piecewise linear interpolation φi (x) were the hat
functions.
72
6.1 Newton-Cotes Formulas
To introduce the different Newton-Cotes Formulas, let us follow the same steps from
Polynomial Interpolation. First we have a function, and let us choose a set of n + 1
points in the interval [a, b]. Once we have the points {(xi , f (xi ))}ni=0 , then we can
find a unique polynomial of order n that passes through all the points.
b+a
For example, if n = 0 and we choose x0 = 2 ,
then
b+a
p0 (x) = f (x0 ) = f
2
And the corresponding quadrature formula is the Midpoint rule
Z b
b+a
I≈ p0 (x)dx = f (b − a)
a 2
Now, if n = 1 and we choose x0 = a, and x1 = b then
(x − x1 ) (x − x0 ) f (b)(x − a) − f (a)(x − b)
p1 (x) = f (x0 ) + f (x1 ) =
x0 − x1 x1 − x0 b−a
And the corresponding quadrature formula is the Trapezoidal Rule
Z b
b−a
I≈ p1 (x)dx = (f (a) + f (b))
a 2
a+b
And if n = 2 and we choose x0 = a, x1 = 2 , x2 = b then
73
But following the same reasoning as in Interpolation, you can imagine that using
equidistant points with a large n is not the most stable algorithm. Therefore we
have two choices: split the interval into small chunks (which is equivalent to do a
piecewise interpolation) or choose an optimal set of points for the quadrature rule.
The first approach is called Composite Newton-Cotes formulas and the second
one is called Gauss quadrature
In the case of integration, we can split the interval into different segments, and
compute the integral separately in each of them. Therefore, we can use an interpo-
lation scheme that can have discontinuous derivatives or even discontinuities in the
function at finite points.
n Z
X bi n Z
X bi Z b
(i)
I= f (x)dx ≈ p (x)dx = φ(x)dx (6.2)
ai ai
i=1
|
i=1
{z } | a {z }
Piecewise Interpolant
Integrate each segment
If all of the points are equally spaced, i.e. xi = a + ih where h = (b − a)/n then
n Z bi n−1
!
X h X
I≈ p1 (x)dx = f (a) + 2 f (xi ) + f (b) (6.4)
ai 2
i=1 i=1
74
6.1.2 Error of Newton-Cotes formulas
Since we are using a polynomial to interpolate f (x) and then integrate, we can
express the error based on the interpolation error. This means that if we call:
Z b Z b
I= f (x)dx In = φn (x)dx (6.6)
a a
For example if φn (x) is the polynomial that interpolates the points {(xi , f (xi )} then
we know from (5.16) that
Z b Z b n
Y
e = I − In = f (x) − φn (x)dx = f [x0 , x1 , ..., xn , x] (x − xi )dx (6.8)
a a i=0
• Midpoint rule
f 00 (ξ1 )
e= (b − a)3
24
• Trapezoidal rule
f 00 (ξ2 )
e=− (b − a)3
12
• Simpson’s rule
5
f (4) (ξ3 )
b−a
e=−
90 2
75
So for example, how can we get the error of the Midpoint rule? Using Taylor series
around (b + a)/2:
Z b
b+a
e= f (x) − f dx
a 2
b+a 2
Z b
0 b+a b+a 1 00
= f x− + f (ξ) x − dx
a 2 2 2 2
f 00 (ξ1 )
= (b − a)3
24
Definition 20. We call precision of a quadrature formula the largest integer p
such that the quadrature error is zero (e = 0)
Solution. Since all the errors are expressed using derivatives, it is easy to see that:
With the Newton-Cotes approach, we saw that using n+1 points we can get at most
p = n + 1 precision, this means we can interpolate exactly polynomials of degree at
most p = n + 1. But if we see, in (6.1) we have 2(n + 1) choices, therefore something
tells us that we can optimize the choice of weights and abscissas to get a better
precision with the same number of points.
Let us suppose we want to find a quadrature rule that exactly integrates all the
polynomials of degree at most 3 in the interval [−1, 1] using the minimum number
76
of points. In particular it should integrate {1, x, x2 , x3 }. Let us suppose we use 2
abscissas, they should satisfy:
Z 1
1dx = ω1 (1) + ω2 (1)
−1
Z 1
xdx = ω1 (x1 ) + ω2 (x2 )
−1
Z 1
x2 dx = ω1 (x1 )2 + ω2 (x2 )2
−1
Z 1
x3 dx = ω1 (x1 )3 + ω2 (x2 )3
−1
77
6.3 Some additional comments
What does it mean to intepolate (integrate) using “...” polynomials?
For example using Legendre polynomials?
It means that if we want to use n + 1 points, use the roots of the “...” polynomial
of order n + 1 as abscissas.
where w(x) is a weight function that is positive and integrable. For each weight
function, we can find an orthogonal family of polynomials which roots are the opti-
mal choice for the abscissas. Some examples are
For example consider a simple domain Ω ∈ Rm , then the integral of the function
over the domain can be decomposed into iterated integrals for each variable
Z Z b(1) Z b(2) Z b(m)
I= f (x)dx = ... f (x1 , x2 , ..., xm )dx1 dx2 ...dxm (6.13)
Ω a(1) a(2) a(n)
An then we can apply a quadrature rule in each integral. For example if we use
a quadrature rule of n + 1 points in each direction, we will end up with a grid of
(n + 1)m points to evaluate the whole integral.
78
Chapter 7
Numerical Differentiation
When we started the course, one of the principal goals in developing all the methods
was to solve numerically differential equations. Therefore, first we learned how to
solve systems of equations (linear and non-linear) and we applied them to solve un-
constrained optimization problems. Some of those optimization problems involved
data fitting using linear and non-linear least squares. At the heart of all this meth-
ods was to approximate the function with a Taylor series expansion around certain
point x0
Later, we switched a bit from subject, and we look for methods to fit polynomials
that exactly satisfy some data, i.e. that interpolates it. So given certain points
(xi , yi ), we proposed a model
n
X
φ(x) = cj φj (x)
j=1
i.e. a linear combination of basis functions, and we looked for coefficients cj such
that yi = φ(xi ). First we proposed global polynomials, and using different basis we
proved existence and error bounds. Then to avoid oscillations, we proposed local
polynomials (piecewise functions), and we enforced continuity at the break points.
But if the idea is to solve differential equations, what about numerical differen-
tiation?. In this chapter we will show two methods to find numerical differentiation
79
formulas. The first one is intuitive using Taylor series and the second one is more
general using Lagrange Interpolation.
80
Now to approximate the second derivative, we can use again the Taylor series of
x0 + h and x0 − h of order 3:
h2 00 h3 000 h4 (4)
f (x0 + h) = f (x0 ) + hf 0 (x0 ) + f (x0 ) + f (x0 ) + f (ξ1 )
2 3! 4!
h2 h3 000 h4 (4)
f (x0 − h) = f (x0 ) − hf 0 (x0 ) + f 00 (x0 ) − f (x0 ) + f (ξ2 )
2 3! 4!
Adding both equations we get the centered formula for the second derivative:
h4 (4)
f (x0 + h) + f (x0 − h) = 2f (x0 ) + h2 f 00 (x0 ) + 2 f (ξ)
4!
f (x0 + h) − 2f (x0 ) + f (x0 − h) h2 (4)
f 00 (x0 ) = − f (ξ) (7.5)
h2 12
And equation (7.5) is second order accurate.
and find the coefficients for which we should multiply each Taylor expansion.
Nevertheless, with this approach we do not have enough insight what does it
mean each of the coefficients, and how does the underlined approximation behave.
81
7.2 Numerical Differentiation using Interpolation
Now let us go back to the case of approximating the first derivative at x0 using
5 points: x0 − 2h, x0 − h, x0 , x0 + h. In this case let us consider the Lagrange
polynomial that interpolates the 5 points:
2 2
X Y (x − (x0 + jh))
φ(x) = f (x0 + ih)Li (x) where Li (x) =
((x0 + ih) − (x0 + jh))
i=−2
j = −2
j 6= i
Therefore
2
X
φ0 (x) = f (x0 + ih)L0i (x)
i=−2
2 2
X 1 Y (x − (x0 + jh))
where L0i (x) =
((x0 + ih) − (x0 + kh)) ((x0 + ih) − (x0 + jh))
k = −2 j = −2
k 6= i j 6= i, k
Therefore
2
1 X 1 1 1 1 1
L00 (x0 ) = − = + + + =0
h k −2 −1 1 2
k = −2
k 6= 0
2
1 Y (−j)
L0i (x0 ) =
ih (i − j)
j = −2
j 6= i, 0
82
Then
1 (1) (−1) (−2) 1
L0−2 (x0 ) =− =
2h (−2 + 1) (−2 − 1) (−2 − 2) 12h
0 1 (2) (−1) (−2) 8
L−1 (x0 ) = − =−
1h (−1 + 2) (−1 − 1) (−1 − 2) 12h
1 (2) (1) (−2) 8
L01 (x0 ) = =
1h (1 + 2) (1 + 1) (1 − 2) 12h
1 (2) (1) (−1) 1
L02 (x0 ) = =−
2h (2 + 2) (2 + 1) (2 − 1) 12h
Therefore using the error formula for interpolation we have that
2
!
∂ Y
f 0 (x0 ) = φ0 (x0 ) + f [x0 − 2h, x0 − h, x0 , x0 + h, x0 + 2h, x] (x − (x0 + ih))
∂x
i=−2 x=x0
2
X 2
Y
= f (x0 + ih)L0i (x0 ) + f [x0 − 2h, x0 − h, x0 , x0 + h, x0 + 2h, x0 ] (ih)
i=−2
i = −2
i 6= 0
1 h4
= (f (x0 − 2h) − 8f (x0 − h) + 8f (x0 + h) − f (x0 + 2h)) + f (5) (ξ)
12h 30
Which is exactly the same result as solving the linear system given by the Taylor
expansions around x0 . The difference is that now we understand that the deriva-
tive comes from fitting a polynomial of order 4 in the points x0 −2h, x0 −h, x0 , x0 +h.
On the other hand, notice that regardless of the method we use to approximate
the derivatives, we can write
ik
X
0
f (xi ) ≈ aij f (xj ) = (Af )i (7.8)
j=i1
This means that we can find a matrix A such that transforms the vector f =
[f (x0 ), ..., f (xn )] into f 0 = [f 0 (x0 ), ..., f 0 (xn )]. This matrix is called differentiation
83
matrix. Therefore once we construct the matrix, we can find the derivative for all
the points in just ”one operation”.
84
Chapter 8
In this part of the course we will start to work with the simplest model that includes
a derivative. We will look for a function y(t) such that
dy
= f (t, y), f or a ≤ t ≤ b (8.1)
dt
We call t the independent variable and y the dependent variable. What makes
it difficult to solve is that f depends on the unknown variable.
Also notice that (8.1) has infinite number of solutions. For example if f (t, y) = 0,
then y(t) = constant and it can take any value. Therefore we need to specify
additional conditions in order to have a unique solution. In this case we will specify
the value of the function at t = a
y(a) = α (8.2)
And together (8.1) plus (8.2) are called Initial Value problem. Also to have a
unique solution, we need f (t, y) to be Lipschitz continuous (See Bradie Section 7.1,
LeVeque Section 5.2)
85
For each of this t’s we have:
y(t0 ) = α
dy
(ti ) = f (ti , y(ti )) (8.4)
dt
How can we get rid of the derivative?
Let us call yi the numerical approximation of y(ti ). We can use any of the approxi-
mations form numerical differentiation to approximate the derivative. For example
if we use Forward differences we transform (8.4) into
y0 = α
yi+1 − yi
= f (ti , yi ) =⇒ yi+1 = yi + hf (ti , yi ) (8.5)
h
This is the so called Euler’s method. Notice that the advantage of (8.5) is that
yi+1 is explicitly defined in terms of yi . This type of method is called an explicit
method.
Notice that we could also use Backward differences to approximate the deriva-
tive and get:
y0 = α
yi − yi−1
= f (ti , yi ) =⇒ yi+1 = yi + hf (ti+1 , yi+1 ) (8.6)
h
This is the so called Backward Euler’s method. The shape between Euler and
backward Euler is the same, but the substantial difference is that that yi+1 de-
pends implicitly on itself (notice the f (ti+1 , yi+1 ) term). This is called an implicit
method. The difficulty of implicit methods is that in general we need to solve a
nonlinear system of equations. Nevertheless, later we will se that they have advan-
tages.
Let us remember the discussion about scientific computing we had at the beginning
of the quarter. When we talked about algorithms, there were 2 important proper-
ties an algorithm should have: Robustness and Efficiency. The first one was related
with the stability of the method (if I perturb the input, how much does the output
change?). The second one was related with convergence of the method (how many
steps should I take in order to get a small error?).
86
This two concepts are again present in the solution of differential equations.
More properly we will look for a one step method:
y(ti+1 ) − y(ti )
τi = − f (ti , y(ti )) (8.8)
h
Using Taylor series around yti we have:
• Stable: How can we relate local truncation error and global discretization
error?
Let us suppose that we want to solve the ODE
dy
= λy + g(t), 0 ≤ t ≤ T
dt
y(0) = y0
Integrating in both sides (or using Duhamel’s principle), the solution is given
by
Z t
λt
y(t) = y0 e + eλ(t−s) g(s)ds
i=0
87
On the other hand using Forward Euler we have that
yi+1 = yi + λhyi + hg(ti ) = (1 + λh)yi + hg(ti )
If we evaluate Forward Euler with the exact solution, and using the local
truncation error, we get:
yi+1 = (1 + λh)yi + hg(ti )
y(ti+1 ) = (1 + λh)y(ti ) + hg(ti ) + hτi
ei+1 = (1 + λh)ei − hτi (8.10)
Using the recurrence relation, we have
Xn
en = −h (1 + λh)n−i τi−1
i=1
Since |1+λh| < e|λ|h (this is called Zero-Stability) we can bound |(1+λh)n−i | ≤
eT |λ| and since we have n terms then nh = T . Therefore if τ = [τ0 , τ1 , ...τn−1 ],
then
|en | ≤= T eT |λ| ||τ ||∞
Since Euler is consistent then ||τ ||∞ → 0 and therefore |en | → 0, i.e. the
method is convergent.
But notice that this is not enough. If we go back to (8.10), at each time step,
the previous error is amplified by a factor of (1 + λh). If λ < 0, the solution is
characterized by an exponential decay. Therefore, if the method is absolutely
stable, we expect that the errors will also decay. This requires that:
2
|1 + λh| < 1 ⇐⇒ −1 < 1 − |λ|h < 1 ⇐⇒ h < (8.11)
|λ|
Therefore we need to choose h sufficiently small and λ < 0 to have a nice
behavior in Euler.
88
Then
n n−i
X 1
en = −h τi−1
(1 − λh)
i=1
∂f
Therefore λ is usually related with ∂y around the points of the solution. For
example we can take
∂f
λ = max
t,y∈Ω ∂y (t,y)
89
• Backward Differences : Backward Euler’s method
dy yi − yi−1
(ti , yi ) = + O(h)
dx h
• Centered Differences : Multi-step method
dy yi+1 − yi−1
(ti , yi ) = + O(h2 )
dx 2h
Therefore replacing in the ODE we get
yi+1 − yi−1
= f (ti , yi )
2h
yi+1 = yi−1 + 2hf (ti , yi ) (8.12)
And it is a multi-step method because yi+1 depends on 2 time-steps:
yi and yi−1 . This type of methods are more complicated because we
need to specify more initial conditions and the analysis of global error
and stability is more sensitive. Nevertheless they have a higher order of
accuracy.
2. Taylor’s Series Recall the Theorem1, and let us use it for y(ti+1 ) around
y(ti ) then we have
h2 h3
y(ti+1 ) = y(ti ) + y 0 (ti )h + y 00 (ti ) + y 000 (ti ) + ... (8.13)
2! 3!
And using the ODE we have that
y 0 (ti ) = f (ti , yi )
d
y 00 (ti ) = (f (ti , yi ))
dt
∂ ∂ dy
= f (ti , yi ) + f (ti , yi )
∂t ∂y dt
∂ ∂
= f (ti , yi ) + f (ti , yi ) f (ti , yi )
∂t ∂y
000
y (ti ) = . . .
Therefore second order scheme, we can ignore all the terms of third order or
higher and we can just replace the equivalences given by the ODE:
2
∂ ∂ h
y(ti+1 ) = y(ti ) + f (ti , yi )h + f (ti , yi ) + f (ti , yi ) f (ti , yi ) (8.14)
∂t ∂y 2!
This is called a Taylor method, and it is explicit, high order accurate one-
step method. The only drawback of this family of methods is that we need to
compute the derivatives of f , and as the order increases, handling them starts
to be complicated. It is a nice theoretical method, but not useful in practice.
90
3. Numerical Integration If we go back to (8.1), integrating in both sides we
have that
Z ti+1
y(ti+1 ) = y(ti ) + f (t, y(t))dt (8.15)
ti
91
• For the Trapezoidal method, we can approximate
yi+1 ≈ yi + hf (ti , yi )
(1) h
yi+1/2 = yi + f (ti , yi ) using Forward Euler
2
(2) h (1)
yi+1/2 = yi + f (ti+1/2 , yi+1/2 ) Using Backward differences
2
(1) (2)
yi+1 = yi + hf (ti+1/2 , yi+1/2 ) Using Modified Euler
Therefore when we use Simpson’s formula we have the famous 4-th order
Runge-Kutta
h (1)
yi+1 = yi + f (ti , yi ) + 2f (ti + 1/2h, yi+1/2 )
6
(2) (1)
+2f (ti + 1/2h, yi+1/2 ) + f (ti + h, yi+1 ) (8.18)
with c1 = 0
The principal characteristic among them is that the points in time are not equally
spaced (i.e. hi is not constant). Moreover, at each iteration, they adjust the position
92
of the new ti+1 = ti +h depending on the local error tolerance they want to have. To
have an estimation for the local error, they solve the value of yi+1 using 2 methods
of order q and q + 1, and then they compute the difference between the results. If
ŷi+1 is the estimation for order q and yi+1 the estimation for order q + 1, then we
accept the value of yi+1 if
|ŷi+1 − yi+1 | ≤ h tol
. If not we compute a new h̄
1/q
h tol
h̄ = h µ
|ŷi+1 − yi+1 |
where µ is a factor between 0 and 1 (we can take µ = 0.9).
93
And in general, even if we are working with only one differential equation, most of
the times it involves higher order derivatives. For example, the pendulum equation
we had at the beginning of the course:
d2 θ dθ
2
+ b + ω 2 sin θ = 0
dt dt
dθ
We can call φ = dt and we have the following system:
dθ
=φ
dt
dφ
= −bφ − ω 2 sin θ
dt
In general, when we talk about an Initial Value problem for a ODE system, we want
to solve for u(t) = [u1 (t), u2 (t)..., un (t)]0 vector of unknowns such that
(
du
dt = f (t, u1 (t), u2 (t), ..., un ) (8.21)
u(0) = [α1 , α2 , ..., αn ] known
And as we can expect, all the numerical methods that we developed previously (Eu-
ler, Backward Euler, Runge-Kutta) extend naturally to solve systems of ODEs. The
only difference is that we need to reconsider the stability properties of the methods,
and how that affects the performance of it.
The simplest non-trivial case is when f (t, u1 (t), u2 (t), ..., un ) is linear in u, i.e.
(
du
dt = Au (8.22)
u(0) = [α1 , α2 , ..., αn ] known
Let us assume all λj < 0. Therefore each λj is associated with the rate of decay of
uj . To use Euler to solve the system we need that each spacing hj < 2/|λj |, where
(i+1) (i)
uj − uj (i)
= λj uj
hj
94
If we solve the system as a whole, to have stability we need h ≤ minj hj <
2/ maxj |λj |. But in the other hand, if I am interesting in a large scale behav-
ior, the variables that will dominate are those with the smallest rate of decay, i.e.
with the smallest values of |λi |.
This means that in order to have stability, we need to choose h according to the
”modes” or |λi | that we are less interested on, making the computation really slow.
If the gap between the largest λi and the smallest one is considerably big, then we
say that the problem is stiff.
Usually stiff problems are characterized by a separation of time scale, and when
we are using explicit methods, we required a really small time step to preserve the
stability of the method avoiding effects of the small scale. This is a very common
situation in reaction-diffusion and transport equations.
In a more general case, if f (t, u1 , u2 , ..., un ) is not linear, we can always think
in its linear approximation using the Taylor series. In such a case A would be the
Jacobian matrix of f at (t, u1 , u2 , ..., un ).
95
Chapter 9
Since it is a first order, we only require one extra condition to have a unique solution,
y(t0 ) = y0 . But what about higher order ODE’s?. For example in the homework we
have
and we can solve it if θ and θ0 are specified at the initial point t = 0, i.e. if it is
an initial value problem. But what about if we have other type of conditions? for
example if we specify values θ(0) = θ0 and θ(T ) = θT ?
In the case where the extra conditions are specified in more than one point,
generally at both endpoints (t = t0 &t = tf ) we say we have a boundary value
problem. And now we can have different types of conditions:
• Periodic Condition We identify initial point with end point as the same
point, i.e. θ(0) = θ(T ) and θ0 (0) = θ0 (T )
96
In this chapter we will talk about two different methods to solve a boundary value
problem: Shooting problem and finite differences. The first one will use the tools we
developed in the previous chapter to redefine the problem as a Initial Value problem,
whereas the second approach will be extended to Partial differential equations, and
solves all the discretization points at once.
The only problem is that we do not know a priori φ0 . Therefore in the shooting
method we “try” different values of φ0 until we get θ(T ) = θT .
Does this setting look familiar?
Indeed! We are solving a root finding problem. This means that we want to find φ0
such that θ as a function of its initial values, gives us θ(T ; φ0 ) = θT . Therefore we
should use any of the methods in Chapter 2 to find a result. At each iteration we
will solve an initial value problem for a guess of φ0 . If |θ(T ; φ0 ) − θT | > tol, then we
update φ0 using a root finding algorithm.
Since each iteration is really expensive, we would like to start with an appropriate
first guess, and use a fast convergence algorithm, such as Newton. The problem of
Newton’s method for root finding is that if we call g(φ0 ) = θ(T ; φ0 ), it requires the
derivative of g with respect to φ0 , which is not easy to compute. Instead we prefer
to use Secant method and we update φ0 as
(i) (i−1)
(i+1) (i)
(i)
φ − φ0
φ0 = φ0 − g φ0 0 (9.3)
(i) (i−1)
g φ0 − g φ0
97
If the ODE is linear (i.e. θ00 = p(t)θ0 + q(t)θ + r(t)) then we can start assuming that
φ0 (0) = (θ(T ) − θ(0))/T . If the ODE is nonlinear, it is more difficult to predict how
the method behaves, and moreover it could exists more than one solution. Therefore
just try with different initial guesses, to see how it performs.
With linear equations, the superposition principle holds. It means that we can
decompose the extra conditions (either boundary or initial conditions) into homo-
geneous and nonhomogenous part, solve them separately and add the two solutions
as the final result. Following this approach we can consider the following 2 ODE
systems:
00 0
u = p(t)u + q(t)u + r(t) t0 ≤ t ≤ tf
u(t0 ) = y0 (9.5)
0
u (t0 ) = 0
00 0 t0 ≤ t ≤ tf
w = p(t)w + q(t)w
w(t0 ) = 0 (9.6)
0
w (t0 ) = 1
And assume that y = u + cw. Therefore in order to satisfy (9.4), we need y(tf ) =
u(tf ) + cw(tf ). Solving (9.5) and (9.6) numerically, we can approximate c as:
yf − u(tf )
c= (9.7)
w(tf )
Since the superposition principle does not hold in nonlinear equations, we cannot
use this approach to solve them.
98
For the Linear case, instead of solving (9.5) and (9.6), we can consider the
following set of Initial Value Problems:
00 0
u = p(t)u + q(t)u t0 ≤ t ≤ tf
u(t0 ) = 1 (9.8)
0
u (t0 ) = 0
00 0 t0 ≤ t ≤ tf
w = p(t)w + q(t)w
w(t0 ) = 0 (9.9)
0
w (t0 ) = 1
00 0 t0 ≤ t ≤ tf
z = p(t)z + q(t)z + r(t)
z(t0 ) = 0 (9.10)
0
z (tf ) = 0
And we assume that y(t) = α1 u(t) + α2 w(t) + α3 z(t), and we solve for the boundary
conditions.
In general, for the nonlinear case, we continue solving a root finding problem.
If the conditions are Neumann (y 0 (t0 ) is given), then we start with a guess of the
function value y(t0 ) = α, and we iterate until the boundary conditions are satisfied.
If the conditions are Robin, we can decide either to guess y 0 (t0 ) or y(t0 ), and the
other term is given by the Boundary conditions. If the problem is nonlinear, we
cannot ensure the stability of the algorithm beforehand. Therefore if it is possible,
it is always good to try both from right to left and left to right, and solving for y(t0 )
or for y 0 (t0 )
Before, we solved this problem based on an initial value problem, using a guess for
the derivate of the function θ0 at zero, and then applying root finding techniques to
converges to the exact solution.
But now if we just analyze the problem, we can discretize both θ00 and θ0 using
finite differences and replace them in (9.11). Notice that we would like to have the
99
same truncation error in both approximations, otherwise only one of them would be
responsible of the error. Therefore let us use centered differences for the first and
second derivative, and for every θi ≈ θ(ti ), where ti = t0 + ih = iT /N , we have
θ0 = α
θi+1 −2θi +θi−1 −θi−1
h2
+ b θi+12h + ω 2 sin(θi ) = 0 1 ≤ i ≤ N − 1 (9.12)
θN = β
On the other hand, Finite Differences gives us a general stencil for all the points
in the grid, and to solve it, we should solve a system of equations involving all the
points in the grid. Since the matrix of coefficients is usually banded (for example
tridiagonal), There are efficient ways to solve all the points at once. Nevertheless,
the cost of solving the system increases with the number of total points in the grid.
100
9.2.1 Other types of Boundary Conditions
Let us consider now the following problem
00 0 2 for 0 ≤ t ≤ T
θ + bθ + ω sin(θ) = 0
α0 θ(0) + β0 θ0 (0) = γ0 (9.16)
αT θ(T ) + βT θ0 (T ) = γT
One way is to approximate the Robin conditions using finite differences. We could
use forward differences for θ0 (0) and backward for θ0 (T ). But we would break the
second order approximation, and the error would be concentrated in the endpoints.
Another option is to have a second order approximation for the derivative. Since
t0 , and tN are at the boundary, the only way to approximate the derivatives using
the existing points is to take t0 , t1 & t2 and on the other hand tN −2 , tN −1 & tN .
The problem of this approach is that the tridiagonal structure of (9.13) is lost.
If t0 and tN were not boundary points, we could approximate the derivative using
centered differences, and the tridiagonal pattern would not be lost. Therefore the
idea of this approach is to include fictitious points t−1 and tN +1 , and approximate
the Robin conditions with centered differences.
But adding 2 additional unknowns that don’t exist does not look like
the best answer
Well if we combine the Robin conditions with the numerical equation at θ0 we get:
θ1 − θ−1
α0 θ0 + β0 = γ0 (9.17)
2h
θ1 − 2θ0 + θ−1 θ1 − θ−1
2
+b + ω 2 sin(θ0 ) = 0 (9.18)
h 2h
From (9.17), if β0 6= 0, we have that:
θ1 − θ−1 1
= (γ0 − α0 θ0 )
2h β0
2h
θ−1 = θ1 − (γ0 − α0 θ0 )
β0
101
Replacing in (9.18) we get:
θ1 − 2θ0 1 2h 1
2
+ 2 θ1 − (γ0 − α0 θ0 ) + b (γ0 − α0 θ0 ) + ω 2 sin(θ0 ) = 0
h h β0 β0
2θ1 − 2θ0 γ0 − α0 θ0 2
+ b − + ω 2 sin(θ0 ) = 0
h2 β0 h
102
Chapter 10
Let us say u = u(x, y), this means u is a function of x and y. Then first order
semilinear PDE is given by:
∂u ∂u
a(x, y) + b(x, y) = c(x, y) (10.1)
∂x ∂y
This equation defines the directional derivative of u with respect to the vector field
V = (a(x, y), b(x, y). This means that given certain initial conditions, the solution
of the equation moves following characteristic curves defined by the vector field.
In other words, if we know the solution in the characteristic curves, we know the
solution in the whole domain.
103
When we want to solve numerically this equation, we really need to understand how
the solution should behave. For example if in (10.2), a > 0 it means that the char-
acteristics are straight lines with positive slope in the t v.s. x plane. Therefore if we
use a forward discretization in x, we won’t capture the behavior of the characteristic
curves, and, as a result, the numerical method would be unstable.
For this reason solving PDEs numerically is not only a matter of approximating
the derivatives (for example using finite differences), but also being sure that the
numerical approximation follows the same physical principles that the original PDE
does.
∂2u ∂2u
∆u = + 2 =0 (10.5)
∂x2 ∂y
or the Poisson Equation
∂2u ∂2u
∆u = + 2 = f (x, y) (10.6)
∂x2 ∂y
both related with the steady-state solution of the Heat equation, where f (x)
is a source term.
∂2u ∂2u
c2 − 2 =0 (10.7)
∂x2 ∂t
Since they can be factorize as 2 first order PDEs, its behavior is similar to
the advection equation. This means that the solution at a point is given by
the characteristic curve that passes through it, and it is not affected by the
solution of the whole domain.
∂u ∂2u
=k 2 (10.8)
∂t ∂x
104
And the principal difference with the wave equation is that the speed of propagation
is infinity, this means that any change in the function in any point of the domain
affects the whole domain. Think in how is the behavior of a heated pan versus the
waves generated by throwing a stone into a lake. Therefore this two phenomenons
are summarize as:
• Diffusion It has an instantaneous smoothing effect of the initial conditions.
For example Elliptic and Parabolic equations
• Advection or Transport It carries the initial conditions along the domain
using characteristic curves. Wave equation: First order PDE and Hyperbolic
equations
In a more general setting, we could have equations that combine parabolic and
hyperbolic behaviors, for example the advection-diffusion-reaction equation:
∂u ∂2u ∂u
= k 2 − a + R(x) (10.9)
∂t | ∂x ∂x | {z }
reaction
{z } |{z}
dif f usion advection
For this type of equations, we should develop specific numerical methods that can
capture the effects of both systems, dealing with the stability issues of a stiff problem.
∆u = f (x) (10.12)
Notice that in 1D, (10.12) is a second order ODE u00 = f (x), and we can solve it
using the methods of the previous chapter. In 2D we get:
∂2u ∂2u
+ 2 = f (x, y) (10.13)
∂x2 ∂y
105
Additional to (10.13), we should specify boundary conditions, and these can be
Dirichlet, Newman, Robin or more general conditions. As you can assume, dis-
cretizing (10.13) using finite differences is not really complicated. The challenge
now is the different type of domains we could have.
Then we want to approximate the function at the grid points as u(xi , yj ) ≈ ui,j .
Therefore using centered finite difference to approximate the second order derivatives
in (10.13), we get:
ui+1,j − 2ui,j + ui−1,j ui,j+1 − 2ui,j + ui,j−1
2
+ = f (xi , yi ) (10.14)
hx h2x
If hx = hy = h we get:
1
(ui+1,j + ui−1,j + ui,j+1 + ui,j−1 − 4ui,j ) = f (xi , yi ) (10.15)
h2
This is the 5-points stencil for the Laplacian operator.
And notice that if we solve for ui,j we get that
The maximum principle is one of the most important properties of the Laplace
equation, because it explains why solutions are smooth inside the domain.
106
Figure 10.1: Assembly of the matrix to solve the Poisson equation in 2D with equally
spaced points and Dirichlet Boundary conditions. Notice that we can remove the
boundary terms from the system, just passing their values to the right hand side of
the system of equations. Having other type of boundary conditions will only change
the rows corresponding to the boundary nodes.
107
optimal choice in order to solve the system. A lot of people have worked in finding
optimal configurations for this matrix.
On the other hand, dealing with different types of boundary conditions is exactly
the same as in the Boundary Value problems with ODE’s.
Notice that the right hand side only involves partial derivatives with respect to
space, therefore for each t we can discretize it using methods for elliptic equations.
Once we do that, we will have a linear system that depends on u(x, t) and its neigh-
bors. We can call this approximation ∆ ˜ h u.
∂u ˜ h u(t) t > 0
= D∆ (10.17)
∂t
And for each x, we can this Initial Value ODE problem in time, using any of the
methods derived in Chapter 8, for example Euler, Backward Euler or any other
Runge-Kutta or multistep method. And of course, as we could expect from the
stability analysis of ODE methods, there will be a tight relationship between the
discretization in space and the discretization in time.
Let us work with the simplest case to see how this procedure works. If Ω is just
one line segment, then we will be working in one dimension. The heat equation for
108
1D with Dirichlet Boundary conditions is
∂u ∂2u
∂t = D ∂x2 x0 ≤ x ≤ xf , t > 0
u(x , t) = α(t)
0
(10.18)
u(x f , t) = β(t)
u(x, 0) = v(x)
Notice that if we consider the vector U (t) = [u1 (t), u2 (t), ..., uN −1 (t)] then (10.19)
is an initial value ODE of the form
(
dU (t) D
dt = − h2 (AU (t) + b(t))
U (0) = V
where
2 −1 −α(t)
v(x1 )
−1 2 −1 0 v(x2 )
−1 2 −1 0 v(x3 )
A= b(t) = V =
... ... ...
...
...
−1 2 −1 0 v(xN −2 )
−1 2 −β(t) v(xN −1 )
And now we can solve (10.19) using Euler, Backward Euler or Trapezoidal method.
If we define t(n) = n ∗ k where k is the time-step, then we will approximate our
(n)
solution as u(xi , t(n) ) = ui . Therefore applying the different time schemes we have
• Euler⇒FTCS Forward in Time, Centered in Space
(n+1) (n) (n) (n) (n)
ui − ui ui+1 − 2ui + ui−1
=D (10.20)
k h2
109
• Trapezoidal⇒Crank-Nicolson
(n) (n) (n) (n+1) (n+1) (n+1)
(n+1) (n)
!
ui − ui D ui+1 − 2ui + ui−1 ui+1 − 2ui + ui−1
= +
k 2 h2 h2
(10.22)
110
Chapter 11
Finite Elements method appeared around the 50’s, when engineers were solving solid
elasticity problems. Since then, it has been one of the most well known methods to
solve partial differential equations because of its generality of applications and its
solid mathematical background. In the previous chapters, we have seen that finite
differences are really simple to implement and they work in a broad set of cases.
Nevertheless:
• Not always our solutions are completely smooth (continuous). Consider having
a chord, fixed at both ends, and we apply a point force in the middle. We
would like to model its displacement using the linear elasticity formula:
d d
− E(x)A(x) u(x) = f (x) x0 < x < xf ; u(x0 ) =u(xf ) = 0 (11.1)
dx dx
where E(x) is the elasticity modulus, A(x) is the cross-section area, and f (x)
is the source term. By experience we know that once we apply a point force,
the chord creates a triangular shape, with the lowest vertex being the point of
application of the force. The problem with equation (11.1) is that we cannot
represent as a function f (x) a point force, because it is zero everywhere except
at one point where it is infinity. And also, the solution u(x) is a linear piecewise
that it is not continuously differentiable at the point source, so what does it
mean to take first and second derivative at it?
• Not always the domain is suitable for finite differences. Solving elliptic equa-
tions in a rectangular domain was really simple. We just put our 5-point
stencil along the domain, and we got an equally spaced discretization. But
what would happen if instead of a plate, we would like to model the helmet
of a professional cyclist. In such a case, it is more difficult to fit a finite dif-
ferences scheme into the object, and even if we could do it with non-equally
spaced points, it would be a nightmare if we would like to refine (add more
points) to the mesh.
111
The advantage of Finite Elements (FEM) is that it is general enough that in the
nicest cases (for example a rectangular plate) where we can use finite differences, we
can use also FEM and get the same results; but also in the most complicated cases,
where we do not know how to adapt efficiently finite differences, applying FEM is
straightforward.
Notice that when → 0, this function tends to behave as a point force, because
its support (the values where it is different from zero) gets smaller, and at x = xk
the function value increases. The point force we call it delta distribution δxk ,
and it has the property that for any sufficiently nice function φ (i.e. it is infinitely
differentiable and it has compact support), then
Z ∞
δxk (x)φ(x)dx = φ(xk ) (11.3)
−∞
And although imaging the delta distribution as a function is a bit difficult, when we
multiply it by a nice function and integrate, it is really simple to see its value.
The most important technique is integration by parts. For example, let us con-
sider the following piecewise function:
(
−1 x < 0
ψ(x) = (11.4)
1 x>0
112
From here we can apply integration by parts
Z b Z b
u0 (x)v(x)dx = u(x)v(x)|ba − u(x)v 0 (x)dx
a a
Therefore:
Z ∞ Z ∞
dψ(x) ∞ dφ(x)
φ(x) dx = φ(x)ψ(x)|−∞ − ψ(x)dx
−∞ dx | {z } −∞ dx
=0, φ is nice
Z 0 Z ∞
dφ(x) dφ(x)
= dx − dx ; apply definition of ψ
−∞ dx 0 dx
∞
= φ(x)|0−∞ − φ(x)|0
= (φ(0) − 0) − (0 − φ(0))
Z ∞
= 2φ(0) = 2δ0 φ(x)
−∞
Now we can say that dψdx = 2δ, and therefore we can generalize the derivatives to
Generalized functions that are not continuously differentiable functions.
To deal more formally with these concepts, some ideas from functional analysis
are required. An introductory course in functional analysis or an advanced course
in PDEs would do the job. A nice explanation about distributions is found in Vasy
(2015). Nevertheless, from engineering perspective, not always it is needed to un-
derstand the mathematical theory behind, but it is useful to understand when the
method works and how we can explain the errors.
In the previous section we saw how we could make sense of ”functions” and
derivatives that don’t exist in the proper definition of a function. Therefore, we
113
can expect that applying the technique of the previous section, we would get more
solutions to problems that before we could not solve. So therefore we have 2 different
formulations of a problem:
• Strong form which only make sense when the functions involved are suffi-
ciently smooth and their derivatives exists. For example:
• Weak form that we obtain when we multiply by a very nice function v(x)
(i.e. it is infinitely differentiable and it has compact support) we integrate,
and we apply integration by parts:
Z xf Z xf
d d
− E(x)A(x) u(x) v(x)dx = f (x)v(x)dx
x0 dx dx x0
Z xf Z xf
d d
E(x)A(x) u(x) v(x)dx = f (x)v(x)dx
x0 dx dx x0
xf
d
− E(x)A(x) u(x)v(x)
dx x0
| {z }
u(x0 )=u(xf )=0
Z xf Z xf
du dv
E(x)A(x) dx = f (x)v(x)dx
x0 dx dx x0
Notice that once we have the integral form v does not have to be a really
dv
smooth function, but it is enough that v(x) and dx make sense once we multiply
them by a function and we integrate them. Therefore we can generalize the
integral form that we have before as :
In this case U is the space of possible solutions, we call it trial functions and V is
the space for which it has to be satisfied, we call it test functions
114
For people more interested in the mathematical details, if we define
Z
L2 = {f functions ∈ (x0 , xf ) : f 2 dx < ∞}
and
H 1 = {f ∈ L2 : f 0 ∈ L2 }
then
U = V = {f ∈ H 1 |u(x0 ) = u(xf ) = 0}
11.4 Implementation
So our system of equations derived from the weak formulation is given by
n Z xf Z xf
X dφi dφj
uj E(x)A(x) dx = f (x)φi (x)dx (11.5)
x0 dx dx x0
j=1 | {z } | {z }
aji bi
AU = B
115