Least
Least
Least
special subclasses of convex optimization: least-squares and linear programming. (A complete technical
treatment of these problems will be given in chapter 4.) 1.2.1 Least-squares problems A least-squares
problem is an optimization problem with no constraints (i.e., m = 0) and an objective which is a sum of
squares of terms of the form a T i x − bi : minimize f0(x) = kAx − bk 2 2 = Pk i=1(a T i x − bi) 2 . (1.4) Here A
∈ R k×n (with k ≥ n), a T i are the rows of A, and the vector x ∈ R n is the optimization variable. Solving
least-squares problems The solution of a least-squares problem (1.4) can be reduced to solving a set of
linear equations, (A T A)x = A T b, so we have the analytical solution x = (AT A) −1AT b. For least-squares
problems we have good algorithms (and software implementations) for solving the problem to high
accuracy, with very high reliability. The least-squares problem can be solved in a time approximately
proportional to n 2k, with a known constant. A current desktop computer can solve a least-squares
problem with hundreds of variables, and thousands of terms, in a few seconds; more powerful
computers, of course, can solve larger problems, or the same size problems, faster. (Moreover, these
solution times will decrease exponentially in the future, according to Moore’s law.) Algorithms and
software for solving least-squares problems are reliable enough for embedded optimization. In many
cases we can solve even larger least-squares problems, by exploiting some special structure in the
coefficient matrix A. Suppose, for example, that the matrix A is sparse, which means that it has far fewer
than kn nonzero entries. By exploiting sparsity, we can usually solve the least-squares problem much
faster than order n 2k. A current desktop computer can solve a sparse least-squares problem 1.2 Least-
squares and linear programming 5 with tens of thousands of variables, and hundreds of thousands of
terms, in around a minute (although this depends on the particular sparsity pattern). For extremely large
problems (say, with millions of variables), or for problems with exacting real-time computing
requirements, solving a least-squares problem can be a challenge. But in the vast majority of cases, we
can say that existing methods are very effective, and extremely reliable. Indeed, we can say that solving
least-squares problems (that are not on the boundary of what is currently achievable) is a (mature)
technology, that can be reliably used by many people who do not know, and do not need to know, the
details. Using least-squares The least-squares problem is the basis for regression analysis, optimal
control, and many parameter estimation and data fitting methods. It has a number of statistical
interpretations, e.g., as maximum likelihood estimation of a vector x, given linear measurements
corrupted by Gaussian measurement errors. Recognizing an optimization problem as a least-squares
problem is straightforward; we only need to verify that the objective is a quadratic function (and then
test whether the associated quadratic form is positive semidefinite). While the basic least-squares
problem has a simple fixed form, several standard techniques are used to increase its flexibility in
applications. In weighted least-squares, the weighted least-squares cost X k i=1 wi(a T i x − bi) 2 , where
w1, . . . , wk are positive, is minimized. (This problem is readily cast and solved as a standard least-
squares problem.) Here the weights wi are chosen to reflect differing levels of concern about the sizes of
the terms a T i x − bi , or simply to influence the solution. In a statistical setting, weighted least-squares
arises in estimation of a vector x, given linear measurements corrupted by errors with unequal variances.
Another technique in least-squares is regularization, in which extra terms are added to the cost function.
In the simplest case, a positive multiple of the sum of squares of the variables is added to the cost
function: X k i=1 (a T i x − bi) 2 + ρ Xn i=1 x 2 i , where ρ > 0. (This problem too can be formulated as a
standard least-squares problem.) The extra terms penalize large values of x, and result in a sensible
solution in cases when minimizing the first sum only does not. The parameter ρ is chosen by the user to
give the right trade-off between making the original objective function Pk i=1(a T i x−bi) 2 small, while
keeping Pn i=1 x 2 i not too big. Regularization comes up in statistical estimation when the vector x to be
estimated is given a prior distribution.