Linear Least-Squares

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Linear Least Squares Optimization

MA 348 Optimization
Kurt Bryan
Linear Least-Squares

Consider an objective function of the special form


1 m
f (x) = f 2 (x) (1)
2 i=1 i

where x lRn and each function fi is LINEAR, that is, of the form fi (x) = aTi x bi for
some ai lRn and some scalar bi . Such functions occur frequently when tting linear models
to data; usually m is larger than n, sometimes much larger. For example, one might try to
t a model (x) = a + bx + cx2 to (x, y) data points (1.0, 2.2), (1.6, 2.8), (2.3, 3.9), (3.4, 4.4),
and (4.1, 5.2), by adjusting a, b, and c to minimize the squared error
1
f (a, b, c) = (((1.0)2.2)2 +((1.6)2.8)2 +((2.3)3.9)2 +((3.44.4)2 +((4.1)5.2)2 ).
2
In this case x = (a, b, c), f1 (a, b, c) = a + b + c 2.2, f2 (a, b, c) = a + 1.6b + (1.6)2 c 2.8,
etc. The factor of 1/2 above and in equation (1) is there only to make some formulas that
appear later prettier; it doesnt really change the problem.
Another typical situation in which such objective functions arise is that of solving Ax =
b when there are more equations than unknowns (or more generally when the system is
inconsistent). In the case that Ax = b has no solution, we might instead seek that x which
does the best job of solving the equations, in the sense that x minimizes
1
f (x) = Ax b2 (2)
2

where p2 = i p2i is the usual Euclidean 2-norm. This ts into the mold of equation (1)
where ai in (1) is the ith row of A. Thus the problem of minimizing a function of the form
in (2) can be cast in the form of minimizing f in equation (1).
In fact, the converse is also true: minimization of the function f in (1) can be cast into
the form (2) by taking A to have the vectors aTi as rows and b = (b1 , . . . , bn )T . The problems
of minimizing the functions in equations (1) and (2) are thus totally equivalent.
Of course you could attack this optimization problem with any general nonlinear algo-
rithm, e.g., conjugate gradients or quasi-Newton methods. But the objective function (1) has
a very special structuresetting the gradient to zero leads to a linear system of equations
which can be solved quite eciently.

The Normal Equations

The objective function in (2) can be written out in gory detail as


1 m n
f (x) = ( Aij xj bi )2 (3)
2 i=1 j=1

1
Dierentiate with respect to xk to nd

f m n m
= Aij Aik xj Aik bi . (4)
xk i=1 j=1 i=1

We can do the double sum in any order we like. You can easily check that m i=1 Aik bi is just

the kth component of AT b, which Ill write as (AT b)k . Also, m A A
i=1 ij ik is just (AT A)jk

or (AT A)kj (since AT A is symmetric), so that nj=1 ( m T
i=1 Aij Aik )xj is just (A Ax)k .
If we arrange f as a column vector (stack components k = 1 to k = n) we nd that
f (x) = AT Ax AT b and the condition for a critical point is that f = 0, or

AT Ax = AT b. (5)

These are the so-called normal equations. They ought to look familiar from DE I.

Solving the Normal Equations

The normal equations are just an n by n system of linear equations for x. The matrix
T
A A is symmetric and positive semi-denite. In fact this matrix is very likely positive
denite (exactly when A has full column rank), so we can try solving by using Cholesky
Factorization, which buys a factor of 2 eciency over LU. This is a reasonable way to solve
the problem, although you can sometimes run into trouble in certain cases A.
As an example, consider the case in which

1 1 1

A = 2 , b = 0


0 0 1

with = 105 . The matrix A is close to being rank 1, but it seems reasonable that
if we work in 10 digit arithmetic we shouldnt have any trouble. The EXACT solution
to the normal equations (5) in this case is x1 = 2/3, x2 = 1/3, which you can compute
symbolically, with no rounding error. But if you solve the normal equations numerically
using either Cholesky or LU factorization with 10 digit arithmetic you get x1 = 0.5 and
x2 = 0.5, which is horrible. If you work in anything LESS than 10 digits you nd that AT A
is outright singular.
The problem is that forming the product AT A encourages round o error, essentially
halving the number of signicant gures we have at our disposal. If you want to know more,
take numerical analysis. This isnt a problem if A is far enough from singular (though you
still lose signicant gures). What wed like is a way to minimize f (x) in equation (2) but
without forming AT A.

2
QR Factorization

Instead of using a Cholesky Factorization A = LT DL and backsubstitution to solve,


well use a QR factorization. Its a fact that any m by n matrix A can factored as A = QR
where Q is an m by m orthogonal matrix (meaning QT Q = I) and R is an m by n upper
triangular matrix. This can be used to minimize the function in equation (2).

Example

Let
1 1
A = 0 1

.
2 2
The you can check that A = QR where

5 2 5
5
0 5
5 5

Q= 0
1
0 , R = 0 1 .
2 5
5
0 55 0 0

Computing the QR Decomposition

The QR decomposition is computed by selectively transforming each column of A into


the required form from left to right. Well multiply A by an orthogonal matrix Q1 which
will zero out all entries in the rst column of A except the top entry. Well then multiply
Q1 A by an orthogonal matrix Q2 which will zero out all but the top two elements in column
2, but without messing up column 1. We continue like this, so at the kth stage we multiply
Qk1 Q1 A by an orthogonal matrix Qk which eliminates all but the top k elements in
column k, without messing up the previous columns. If w do this for k = 1 to k = n we
obtain something like
Qn Qn1 Q1 A = R (6)
where R is upper triangular. We then have QA = R for some orthogonal matrix Q =
Qn Qn1 Q1 (since the product of orthogonal matrices is orthogonal). From this we can
obtain A = QR (exercise).

Exercise

Show that the product of orthogonal matrices is orthogonal.

Prove A = QR using equation (6).

The trick is to nd an orthogonal matrix Q that selectively zeroes out all but the top
k elements of a given vector (since each column of A is processed one at a time, we can
consider them as vectors).

3
A useful fact to note is this: Qx = x for any orthogonal Q and vector x, so multi-
plication by Q preserves Euclidean length. To see this just note that
Qx2 = (Qx) (Qx)
= (xT QT )(Qx)
= xT x
= x2
Exercise
True or False: The converse of the above holds, i.e. if Px = x for all x then P is
orthogonal.

Heres how to construct the magic Qi matrices. Let v be a vector in lRn . Form the
matrix
vvT
H=I2 T (7)
v v
an n by n matrix. The matrix H is orthogonal, for
vvT T vvT
HT H = (I 2 ) (I 2 )
vT v vT v
vvT vvT
= (I 2 T )(I 2 T )
v v v v
vvT vvT vvT
= I4 T +4 T
v v v v vT v
T
vv vvT
= I4 T +4 T
v v v v
= I
since the middle vT v in the last term on the second-to-last line is just a scalar and can-
cels with one of the denominator copies. Matrices of the form of equation (7) are called
Householder matrices.
Let a be a vector in lRn . Consider the chore of choosing v so that H as dened by
equation (7) has the eect that forming Ha zeros out all but the rst element of a, i.e.,



0
Ha =
.. = e1 .

.
0
Since H is length-preserving we must have = a. How should we choose v? If you
write out Ha explicitly you obtain
( )
vvT vT a
Ha = I 2 T a = a 2v T = e1
v v v v

4
or, if we solve for v,
vT v
v = (a e1 ) . (8)
2vT a
T
v v
Now the quantity 2v T a is just a scalar, and if you examine the denition of H you see that

multiplying v by a scalar doesnt change anything. We then might just as well scale v so
vT v
2vT a
= 1 and so take
v = a ae1 (9)
where Ive used = a. Its conventional to take the plus sign in equation (9) if a1 0
and the minus sign if a1 < 0, to avoid cancellation errors.

Example:

Let a = [1, 2, 4, 2]T . Then a = 5 and we obtain v = a + [5, 0, 0, 0]T = [6, 2, 4, 2]T .
The matrix H is
51 25 45 25
2 13
15
4 2
H= 5 15 15
4 .
54 15
4 7
15 15

2 2 4 13
5 15 15 15

You can check that Ha = [5, 0, 0, 0]T .

To zero out all but the top k elements of a vector a, partition a into two pieces
[ ]
a1
a=
a2

where a1 is in lRk1 and a2 is in lRnk+1 . We can nd a vector v2 lRnk+1 and corresponding


n k + 1 by n k + 1 Householder matrix H2 with the property that



0
H 2 a2 =
.. .

.
0
Now let v be dened by [ ]
0
v= .
v2
The corresponding Householder matrix looks like
[ ]
I 0
H=
0 H2

and has the desired eect of zeroing out the bottom n k elements of a.

5
Now that we have a recipe for constructing orthogonal matrices which zero all but the
rst k elements of a vector a, we can apply the recipe leading to equation (6). The Qi will
be appropriate Householder matrices. There are certain eciency and numerical issues we
havent addressed, but the sequence of Householder transformations as presented above are
the basic technique used by most software for computing the QR factorization. And the QR
algorithm, properly implemented, is very stable numerically.

Solving Least-Squares Problems with QR Decomposition

Lets return to Ax = b. We factor A = QR to obtain

QRx = b.

Multiply both sides by QT to nd


Rx = QT b.
Now for simplicity lets consider the case where R is of full column rank. This means that
R has non-zero elements in positions Rkk for k = 1 to n. This is certainly the generic case.
In this case we can write [ ]
R1
R=
0
where R1 is an n by n invertible upper triangular matrix and 0 means an m n by n matrix
of zeros. We can write Rx = QT b as
[ ] [ ]
R1 c1
x= (10)
0 c2

where c1 is the rst n components of QT b and c2 the last m n components. This matrix
equation is really two equations, namely R1 x = c1 and 0x = c2 . We cant do anything with
the second equationno matter what the value of x, the equation 0x = c2 will never be
satised (unless were so lucky that c2 = 0). But we can solve the rst equation R1 x = c1
exactly (since R1 is invertible), by doing a simple backsubstitution. Moreover, since Q is
orthogonal we have (for any choice of x) that

Ax b2 = QRx b2
= Rx QT b2
= R1 x c1 2 + c2 2 . (11)

We clearly minimize the right side of equation (11) (and hence the value of Ax b) by
taking x so that R1 x = c1 , which yields the least-squares solution.

6
Example:

Let
1 2 1

A = 0 1 , b = 0 .
1 1 2
Then A = QR with

2

6

3
3

2 2
2 6 3
2

Q=
0
6
3
33 ,
R= 0 2
6 .

2
2
6
6
33 0 0

Then Rx = QT b becomes

2 3
2 [ ] 3
2
2

2
x1
0 6 =
6
6 .

2 x2 3
0 0 3

This yields equation 26 x2 = 66 from which we obtain x2 = 1/3. The rst equation then
yields x1 = 1. This is the least squares solution to Ax = b.

Stability

The QR decomposition is a more stably way to solve the linear least squares problem.
We wont do a detailed analysis, but lets reconsider the problem from above in which

1 1 1
A = 2 , b = 0


0 0 1

with = 0.00001. If you QR factor A you obtain (using 10 signicant gure arithmetic)

1.0 0.00001 0.0 1.0 1.0

Q = 0.00001 1.0 0.0 , R = 0.0 0.00003 .
0.0 0.0 1.0 0.0 0.0

and then form Rx = QT b, which leads to equations 1.0x1 + 0.9999999997x2 = 1.0 and
0.00003x2 = 0.00001, with solutions x1 = 0.6666666668 and x2 = 0.3333333333, which are
correct to 9 places.

You might also like