Lecture 7: Least-Squares Problem: Convex Optimization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

EE523 Convex Optimization March 21, 2019

KAIST, Spring 2019

Lecture 7: Least-squares Problem

Recap
So far we have studied several stuffs: (1) the concept of convex optimization problems; (2) a cat-
egorization of such problems as summarized in Fig. 1; (3) why the convex optimization problems
are tractable; (4) a bunch of important examples which can be translated into LPs and which
can be solved via LP relaxation; (5) the simplex algorithm for LPs; (6) CVX implementation
for LPs.

Convex Optimization
Embark on:
Least-squares
Least-squares Linear Program
1800s (LP) 1939
Quadratic Program (QP) 1956

Second-Order Cone Program


(SOCP) 1994
Semi-Definite Program
(SDP) 1994
……

Figure 1: Hierarchy of convex optimization problems

Today’s lecture
Today we are going to move onto the second instance of convex optimization problems: the Least-
squares problem. Specifically in this lecture, we are going to cover three stuffs: (1) Will review
what the least-squares problem is; (2) Will provide a geometric insight which can help us to
understand what the least-squares solution means and therefore why the problem is important;
(3) Will study one very important application in machine learning: the classification problem.

Review: Least-squares problem


Let us start by reviewing the least-squares problem that we studied in Lecture 1. The problem
is formulated as:
min kAx − bk2 . (1)
As mentioned earlier, one of the most important things that we can benefit from this problem
is that it has the closed-form solution:
x∗ = (AT A)−1 AT b. (2)
Actually in Lecture 1, we could obtain this solution by simply looking for a solution that makes
its gradient w.r.t. x as 0:
∇kAx∗ − bk2 = 0. (3)

1
But I never explained why (3) enables us to obtain the solution though. Now remember what
we learned about unconstrained convex minimization in Lecture 3: If f (x) is convex and differ-
entiable at every point in domf , then ∇f (x∗ ) = 0 is the optimality condition (i.e., the sufficient
and necessary condition for x∗ to be optimal). Applying this to the least-squares problem, we
then see that (3) is indeed the optimality condition. This is because the objective function
kAx − bk2 is convex in x (why? check this in PS3).

Dimensions of (x, A, b)
What about dimensions of (x, A, b)? Let d be the dimension of the optimization variable x.
Then, x ∈ Rd . Let A := [a1 · · · ad ] ∈ Rm×d . Then, b ∈ Rm . We now have two cases depending
on the values of m and d.
One is: m < d (A is a wide matrix 1 ). Suppose all the row vectors in A are linearly independent,
i.e., rank(A) = m, which is a typical case in practice. In this case, we have larger number d of
unknowns than the number m of linear equations in Ax − b = 0. This implies that there are
infinitely many solutions that satisfy Ax − b = 0. So in this case, the optimal value p∗ = 0,
which is definitely not an interesting scenario.
The second case is: m ≥ d (A is a tall matrix ). Suppose that b does not lie in the range space
of A, range(A) (the space spanned by all the column vectors of A), which is a typical case in
practice. In this case, obviously there is no solution that satisfies Ax − b = 0. But what we
can say is that it has a solution that minimizes kAx − bk2 though, and this forms the basic
idea behind the least-squares problem (that Gauss brought up in the 1800s). So what we are
interested in is the second case: m ≥ d.

Geometric insight
Now let me give you a geometric insight behind the least-squares problem. From this, you will
see what the least-squares solution means, as well as why the problem is important accordingly.

distance
distance

Figure 2: Geometric insight of Least-squares solutions

Let us first consider the simplest setting in which d = 1. In this case, A is simply a column
vector and x is a scalar. Suppose we have two vectors a1 and b, as illustrated in Fig. 2(a), in
which b is not aligned with a1 due to our assumption made earlier: b ∈ / range(A). Notice that
2
ka1 x − bk is minimized when the vector a1 x − b (marked in the blue thick line) is perpendicular
to the direction of a1 . So from this, one can interpret the least-squares solution as the distance-
minimizing solution. The distance-minimizing solution is obviously what we want. So it is sort
1
In fact, a majority of people use the terminology like a short matrix instead. But Prof. Stephen Boyd at
Stanford strongly recommended me to use a different terminology: a wide matrix. His rationale was that the wide
matrix has sort of positive nuance, while the short matrix looks negative. I respect his attitude for advocating
positive aspects, so I use the “wide matrix” terminology.

2
of a good solution which well matches with our natural demand.
We can have the same interpretation for a slightly more general case, say d = 2. In this case,
A = [a1 a2 ] ∈ Rm×2 . The vector Ax now lies in the plane, which is the range space of A:
Ax ∈ range(A); see Fig. 2(b). Similarly kAx − bk2 is minimized when the vector Ax − b (marked
in the blue thick line) is perpendicular to the plane, as illustrated in Fig. 2(b). So again one can
interpret Ax∗ as the distance-minimizing solution.

An application: Classification problem


As mentioned earlier, the least-squares problem is a very popular and powerful problem which
has played a significant role in the optimization field since the birth of the problem in the 1800s.
It has been employed for addressing many important problems that arise in a wide variety of
applications.
In this lecture, I would like to put a particular emphasis on one important application that
arises in machine learning as well as that we have already investigated earlier: the classification
problem.
Remember the classification example that we studied in Lecture 5: legitimate-vs-spam emails
classification, in which we are given m data points {(xi , yi )}m
i=1 . Here xi indicates a feature vector.
In the example, we considered a two-dimensional case where xi := (xi1 , xi2 ) and (xi1 , xi2 ) denote
the frequencies of keywords 1 and 2 that appear in the ith email, respectively. Here yi is a label,
indicating an identity of the ith email: yi = +1 (legitimate email), yi = −1 (spam email).
For the above setting, we considered linear classifiers. Specifically, for the separable case, we
formulated an LP which intends to find a line that separate two datasets (legitimate vs. spam).
For the non-separable case (which is a typical case in practice), we formulated a slightly different
LP which finds a line that minimizes the aggregated margin.
In this lecture, we will consider a different classifier which is based on the least-squares problem
and therefore called: the least-squares classifier. The idea of the least-squares classifier is to find
a linear projection that minimizes the aggregated squared error.

Least-squares classifier

Least-squares label
classifier
linear projection

Figure 3: Block diagram of the least-squares classifier.

To see what the idea means, let us consider a block diagram for the classifier, illustrated in
Fig. 3. The least-squares classifier is parameterized by a weight vector, say w ∈ Rd . Given
input xi , it computes a linear projection onto the space spanned by the weight vector w; hence,
it outputs xTi w. The way to design w is as follows. Using the corresponding label yi , we first
compute its squared error: kxTi w − yi k2 . Next compute the aggregated squared error with all
of the m data points given. Finally we formulate an optimization problem which minimizes the

3
aggregation:
m
X
min kxTi w − yi k2 . (4)
w∈Rd
i=1

Notice that the objective function is very similar to the one that we saw in Lecture 1. Yes, that
is the objective function that Gauss came up with while addressing the astronomy problem. So
we can use the same simplification trick that Gauss did, thus obtaining:

x w − y1 2
 T 
m 1
..
X
kxTi w − yi k2 = 
 
. 

i=1 xTm w − ym
 T    2 (5)
x y1
1
=  ...  w −  ...  .

   

xTm ym

Letting A := [xT1 ; · · · ; xTm ] and b := [y1 ; · · · ; yn ], we can then re-wrote the optimization problem
as:

min kAw − bk2 , (6)


w

which is the least-squares problem. So we obtain w∗ as:

w∗ = (AT A)−1 AT b. (7)

Testing classifiers on new unseen data


Here one natural question that arises is: Can the least-squares classifier (7) perform better than
the linear classifier that we developed earlier using LP? To answer this question, first of all, we
need to know what is a proper performance measure that one can use. One very popular and
conventional approach in machine learning is to employ so called the test error. The test error
is computed by testing a classifier on new unseen data called the test data.
To see what it means, consider the following setup. Given the classifier w∗ designed as per (7),
we input unseen test data, say xtest . The output is then xTtest w∗ . Next we declare a legitimate
email if xTtest w∗ ≥ 0; otherwise, declare a spam email, meaning that we take the sign of the
output to obtain: ŷtest = sign(xTtest w). Comparing this to the ground-truth test data label ytest ,
we check if the classifier gives a correct answer (0) or not (1). Considering many such test data
points, say mtest test data points, we compute the test error as the average of such measures:
mtest
1 X
TestError = 1{ŷtest,i 6= ytest,i } (8)
mtest
i=1

where 1{·} denotes an indicator function which outputs 1 when the event {·} is true; 0 otherwise.
On a side note, we say that the seen data employed for training a classifier is called the training
data. In this case, the training data are:
 T   
x1 y1
A :=  ...  , b :=  ...  .
   

xTm ym

4
Example: Evaluation on test dataset (mtest = 1139 + 127)
Here is an example that demonstrates the test error performance of a trained classifier tested on
a test dataset. Suppose that the test dataset has two datasets: (1) legitimate dataset containing
1139 test data points; (2) spam dataset containing 127 test data points. Applying the trained
classifier, we have 1120 correct and 19 wrong answers for 1139 legitimate emails. For 127 spam
emails, we have 95 correct and 32 wrong answers. See Fig. 4.

1120 19
(legitimate) true positive misdetection

32 95
false positive
(spam)
(false alarm)

Figure 4: Test error performance of a trained classifier tested on a test dataset.

Then, the test error is computed as:


19 + 32
TestError = ≈ 4%. (9)
1139 + 127
Actually this error can be categorized into two types depending on what the ground truth is,
and a different emphasis should be put on the two types of errors depending on applications. To
explain what it means, let us first introduce some terminologies required to understand the two
types of errors. The first is the true positive case which indicates the event {ŷ = +1|y = +1}.
The second is the misdetection case indicating {ŷ = −1|y = +1}. The third is the false positive
(or false alarm) case, which refers to {ŷ = +1|y = −1}.
Now the first type of error is concerned about the misdetection case and therefore called the
misdetection error : Pr{ŷ = −1|y = +1}. The second type of error then refers to the false
positive case, so it is called the false positive error : Pr{ŷ = +1|y = −1}. In the above example
in Fig. 4, the two types of error can be computed as:
19
MisdetectionError = ≈ 1.7%;
19 + 1120
32
FalsePositiveError = ≈ 25%.
32 + 95

Notice that the two errors are highly imbalanced; one is much smaller than the other. Actually
if you think about it, it is sort of a desired situation. In reality, it is crucial to protect missing
legitimate emails. In other words, we should be able to well declare legitimate emails if they are
the cases, meaning that we should reduce the misdetection error as much as possible. In this
case, it is around 1.7%, which is more or less okay.
On the other hand, we are sort of okay with declaring spam emails when they are actually
legitimate emails, meaning that a moderate value of the false positive error is acceptable in
reality. In this case, it is around 25%, which is more or less okay.

Linear classifier vs. least-squares classifier

5
Since we know about the test error which is a proper performance measure, we are now ready to
compare performances of the two classifiers: the linear classifier and the least-squares classifier.
To this end, we first gather training dataset: {(xi , yi )}m i=1 . We then use this to design the
margin-based linear classifier (using LP) as well as the least-squares classifier (using (7)).
Next we test the classifiers on test dataset {(xtest,i , ytest,i )}m
i=1 to compute TestError linear and
test

TestError LeastSquares . This is how we compare performances. Now you may wonder which is
better in terms of the test error measure. Don’t worry. You will have a chance to check this in
PS.

Regularization technique
Actually there is an issue in applying the least-squares classifier without any modification.
In reality, a data point, say x, contains some noise. Data points are usually obtained from
measurements made by humans or sensors. But humans and sensors are not perfect in reality,
so x definitely contains some error. This error incurs an issue: Large values of w∗ (designed as
per (7)) can boost up such noise.
To avoid this, we somehow want to make those values small. One way to implement this is to
minimize kw∗ k2 . But obviously at the same time, we want to make kAw − yk2 small; otherwise,
w∗ would be always 0 - this is obviously what we do not want to get. This motivates people to
come up with a natural idea, which is to regulate the two objective at the same time:

min kAw − yk2 + λkwk2 (10)


w∈Rd

where λ ≥ 0. Notice that for one extreme case of λ = 0, we obtain the conventional least-squared
solution while for the other extreme case of λ = ∞, we get w∗ = 0 in which we declare spam
emails randomly with probability 0.5. This technique is called the regularization and λ is called
the regularization factor.

error
training error

test error

Figure 5: Training and test errors as a function of regularization factor λ.

Now you may wonder how performances vary in terms of the regularization factor λ. First
consider the training error which is defined as:
m
1 X
TrainError = 1 {ŷi =
6 yi } (11)
m
i=1

where ŷi = sign(xTi w∗ ). Obviously the training error is minimized at λ = 0 because in the case
we focus only on the error factor induced by the training dataset. And it monotonically increases

6
with an increase in λ, since the higher λ, the more we regulate, penalizing more on the training
error. So we will get something like a blue curve plotted in Fig. 5.
On the other hand, the situation is different about the test error, defined as:
mtest
1 X
TestError = 1 {ŷtest,i 6= ytest,i } . (12)
mtest
i=1

When λ = 0, the test error would be small but larger than the training error because the λ = 0
case focuses only on the training error. Increasing λ, we would have a regularization effect, so
we expect the smaller test error. But if λ is too big, then the classifier would be close to w∗ = 0
in which the test error would be obviously very large. So one can expect there is a sweet spot
on λ that minimizes the test error. Hence, we may obtain something like a red curve plotted in
Fig. 5. Actually this is indeed the case. Again you will have a chance to check this in PS.

How to solve the regularized problem?


Going back to the regularized least-squares problem (10), now how can we solve the problem?
If you think about it, this is nothing but another least-squares problem. Why? Again applying
the same Gauss’s simplification trick, we obtain:
    2
A b
2 2√ = kA′ w − b′ k2

kAw − bk + λkwk =
w−
λ 0

where A′ := [A; λ] and b′ := [b; 0]. Hence, we get:

min kA′ w − b′ k2 . (13)


w∈Rd

So the solution would be:

w∗ = (A′T A′ )−1 A′T b′ . (14)

Look ahead
Next time, we will study another application in which the least-squares problem has played a
crucial role in a different field, the medical field: Computed Tomography (CT).

You might also like