Lecture 7: Least-Squares Problem: Convex Optimization
Lecture 7: Least-Squares Problem: Convex Optimization
Lecture 7: Least-Squares Problem: Convex Optimization
Recap
So far we have studied several stuffs: (1) the concept of convex optimization problems; (2) a cat-
egorization of such problems as summarized in Fig. 1; (3) why the convex optimization problems
are tractable; (4) a bunch of important examples which can be translated into LPs and which
can be solved via LP relaxation; (5) the simplex algorithm for LPs; (6) CVX implementation
for LPs.
Convex Optimization
Embark on:
Least-squares
Least-squares Linear Program
1800s (LP) 1939
Quadratic Program (QP) 1956
Today’s lecture
Today we are going to move onto the second instance of convex optimization problems: the Least-
squares problem. Specifically in this lecture, we are going to cover three stuffs: (1) Will review
what the least-squares problem is; (2) Will provide a geometric insight which can help us to
understand what the least-squares solution means and therefore why the problem is important;
(3) Will study one very important application in machine learning: the classification problem.
1
But I never explained why (3) enables us to obtain the solution though. Now remember what
we learned about unconstrained convex minimization in Lecture 3: If f (x) is convex and differ-
entiable at every point in domf , then ∇f (x∗ ) = 0 is the optimality condition (i.e., the sufficient
and necessary condition for x∗ to be optimal). Applying this to the least-squares problem, we
then see that (3) is indeed the optimality condition. This is because the objective function
kAx − bk2 is convex in x (why? check this in PS3).
Dimensions of (x, A, b)
What about dimensions of (x, A, b)? Let d be the dimension of the optimization variable x.
Then, x ∈ Rd . Let A := [a1 · · · ad ] ∈ Rm×d . Then, b ∈ Rm . We now have two cases depending
on the values of m and d.
One is: m < d (A is a wide matrix 1 ). Suppose all the row vectors in A are linearly independent,
i.e., rank(A) = m, which is a typical case in practice. In this case, we have larger number d of
unknowns than the number m of linear equations in Ax − b = 0. This implies that there are
infinitely many solutions that satisfy Ax − b = 0. So in this case, the optimal value p∗ = 0,
which is definitely not an interesting scenario.
The second case is: m ≥ d (A is a tall matrix ). Suppose that b does not lie in the range space
of A, range(A) (the space spanned by all the column vectors of A), which is a typical case in
practice. In this case, obviously there is no solution that satisfies Ax − b = 0. But what we
can say is that it has a solution that minimizes kAx − bk2 though, and this forms the basic
idea behind the least-squares problem (that Gauss brought up in the 1800s). So what we are
interested in is the second case: m ≥ d.
Geometric insight
Now let me give you a geometric insight behind the least-squares problem. From this, you will
see what the least-squares solution means, as well as why the problem is important accordingly.
distance
distance
Let us first consider the simplest setting in which d = 1. In this case, A is simply a column
vector and x is a scalar. Suppose we have two vectors a1 and b, as illustrated in Fig. 2(a), in
which b is not aligned with a1 due to our assumption made earlier: b ∈ / range(A). Notice that
2
ka1 x − bk is minimized when the vector a1 x − b (marked in the blue thick line) is perpendicular
to the direction of a1 . So from this, one can interpret the least-squares solution as the distance-
minimizing solution. The distance-minimizing solution is obviously what we want. So it is sort
1
In fact, a majority of people use the terminology like a short matrix instead. But Prof. Stephen Boyd at
Stanford strongly recommended me to use a different terminology: a wide matrix. His rationale was that the wide
matrix has sort of positive nuance, while the short matrix looks negative. I respect his attitude for advocating
positive aspects, so I use the “wide matrix” terminology.
2
of a good solution which well matches with our natural demand.
We can have the same interpretation for a slightly more general case, say d = 2. In this case,
A = [a1 a2 ] ∈ Rm×2 . The vector Ax now lies in the plane, which is the range space of A:
Ax ∈ range(A); see Fig. 2(b). Similarly kAx − bk2 is minimized when the vector Ax − b (marked
in the blue thick line) is perpendicular to the plane, as illustrated in Fig. 2(b). So again one can
interpret Ax∗ as the distance-minimizing solution.
Least-squares classifier
Least-squares label
classifier
linear projection
To see what the idea means, let us consider a block diagram for the classifier, illustrated in
Fig. 3. The least-squares classifier is parameterized by a weight vector, say w ∈ Rd . Given
input xi , it computes a linear projection onto the space spanned by the weight vector w; hence,
it outputs xTi w. The way to design w is as follows. Using the corresponding label yi , we first
compute its squared error: kxTi w − yi k2 . Next compute the aggregated squared error with all
of the m data points given. Finally we formulate an optimization problem which minimizes the
3
aggregation:
m
X
min kxTi w − yi k2 . (4)
w∈Rd
i=1
Notice that the objective function is very similar to the one that we saw in Lecture 1. Yes, that
is the objective function that Gauss came up with while addressing the astronomy problem. So
we can use the same simplification trick that Gauss did, thus obtaining:
x w − y1
2
T
m 1
..
X
kxTi w − yi k2 =
.
i=1
xTm w − ym
T
2 (5)
x y1
1
=
... w − ...
.
xTm ym
Letting A := [xT1 ; · · · ; xTm ] and b := [y1 ; · · · ; yn ], we can then re-wrote the optimization problem
as:
where 1{·} denotes an indicator function which outputs 1 when the event {·} is true; 0 otherwise.
On a side note, we say that the seen data employed for training a classifier is called the training
data. In this case, the training data are:
T
x1 y1
A := ... , b := ... .
xTm ym
4
Example: Evaluation on test dataset (mtest = 1139 + 127)
Here is an example that demonstrates the test error performance of a trained classifier tested on
a test dataset. Suppose that the test dataset has two datasets: (1) legitimate dataset containing
1139 test data points; (2) spam dataset containing 127 test data points. Applying the trained
classifier, we have 1120 correct and 19 wrong answers for 1139 legitimate emails. For 127 spam
emails, we have 95 correct and 32 wrong answers. See Fig. 4.
1120 19
(legitimate) true positive misdetection
32 95
false positive
(spam)
(false alarm)
Notice that the two errors are highly imbalanced; one is much smaller than the other. Actually
if you think about it, it is sort of a desired situation. In reality, it is crucial to protect missing
legitimate emails. In other words, we should be able to well declare legitimate emails if they are
the cases, meaning that we should reduce the misdetection error as much as possible. In this
case, it is around 1.7%, which is more or less okay.
On the other hand, we are sort of okay with declaring spam emails when they are actually
legitimate emails, meaning that a moderate value of the false positive error is acceptable in
reality. In this case, it is around 25%, which is more or less okay.
5
Since we know about the test error which is a proper performance measure, we are now ready to
compare performances of the two classifiers: the linear classifier and the least-squares classifier.
To this end, we first gather training dataset: {(xi , yi )}m i=1 . We then use this to design the
margin-based linear classifier (using LP) as well as the least-squares classifier (using (7)).
Next we test the classifiers on test dataset {(xtest,i , ytest,i )}m
i=1 to compute TestError linear and
test
TestError LeastSquares . This is how we compare performances. Now you may wonder which is
better in terms of the test error measure. Don’t worry. You will have a chance to check this in
PS.
Regularization technique
Actually there is an issue in applying the least-squares classifier without any modification.
In reality, a data point, say x, contains some noise. Data points are usually obtained from
measurements made by humans or sensors. But humans and sensors are not perfect in reality,
so x definitely contains some error. This error incurs an issue: Large values of w∗ (designed as
per (7)) can boost up such noise.
To avoid this, we somehow want to make those values small. One way to implement this is to
minimize kw∗ k2 . But obviously at the same time, we want to make kAw − yk2 small; otherwise,
w∗ would be always 0 - this is obviously what we do not want to get. This motivates people to
come up with a natural idea, which is to regulate the two objective at the same time:
where λ ≥ 0. Notice that for one extreme case of λ = 0, we obtain the conventional least-squared
solution while for the other extreme case of λ = ∞, we get w∗ = 0 in which we declare spam
emails randomly with probability 0.5. This technique is called the regularization and λ is called
the regularization factor.
error
training error
test error
Now you may wonder how performances vary in terms of the regularization factor λ. First
consider the training error which is defined as:
m
1 X
TrainError = 1 {ŷi =
6 yi } (11)
m
i=1
where ŷi = sign(xTi w∗ ). Obviously the training error is minimized at λ = 0 because in the case
we focus only on the error factor induced by the training dataset. And it monotonically increases
6
with an increase in λ, since the higher λ, the more we regulate, penalizing more on the training
error. So we will get something like a blue curve plotted in Fig. 5.
On the other hand, the situation is different about the test error, defined as:
mtest
1 X
TestError = 1 {ŷtest,i 6= ytest,i } . (12)
mtest
i=1
When λ = 0, the test error would be small but larger than the training error because the λ = 0
case focuses only on the training error. Increasing λ, we would have a regularization effect, so
we expect the smaller test error. But if λ is too big, then the classifier would be close to w∗ = 0
in which the test error would be obviously very large. So one can expect there is a sweet spot
on λ that minimizes the test error. Hence, we may obtain something like a red curve plotted in
Fig. 5. Actually this is indeed the case. Again you will have a chance to check this in PS.
Look ahead
Next time, we will study another application in which the least-squares problem has played a
crucial role in a different field, the medical field: Computed Tomography (CT).