MS&E 318 (CME 338) Large-Scale Numerical Optimization: Stanford University, Management Science & Engineering (And ICME)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Stanford University, Management Science & Engineering (and ICME)

MS&E 318 (CME 338) Large-Scale Numerical Optimization


Instructor: Michael Saunders Spring 2015
Notes 9: Augmented Lagrangian Methods

1 Origins
Large-scale solvers such as CONOPT [7, 1], LANCELOT [4, 5, 12], MINOS [13, 14], and
SNOPT [8, 9] are designed to solve constrained optimization problems in the following fairly
general form:

NCO minimize
n
φ(x)
x∈R  
x
subject to ` ≤  Ax  ≤ u,
c(x)

where A is a sparse matrix, c(x) is a vector of smooth nonlinear functions, and the bounds
` and u are likely to exist as shown. (As always, some bounds could be infinite and some
could imply equality constraints: `j = uj .)
In the early days of optimization, it was a challenge to optimize a nonlinear function φ(x)
with no constraints. Bounds on the variables (`j ≤ xj ≤ uj ) could be accommodated with
reasonable ease by active-set strategies, but more general constraints tended to be imposed
via penalty terms and barrier terms in the objective.
In many cases it is general enough to deal with nonlinear equality constraints and simple
bounds:

NCB minimize
n
φ(x)
x∈R
subject to c(x) = 0, ` ≤ x ≤ u,

where c(x) ∈ Rm may include linear functions, and x includes slack variables to deal with
inequality constraints.
Some applications require all iterates to be feasible. FSQP is one solver that achieves this.
The Generalized Reduced Gradient (GRG) approach exemplified by CONOPT is slightly
different. Assuming the current point is feasible, it generates the next search direction
by assuming c(x) is linear. It may evaluate c(x + αp) at infeasible points, but it restores
feasibility at each stage before “moving on”.
Otherwise, most methods are content to satisfy c(x) = 0 only in the limit. The main
approaches of interest are penalty methods, augmented Lagrangian methods (which solve se-
quences of optimization subproblems with linear constraints) and nonlinear interior methods
(which solve sequences of nonlinear equations).

1.1 The Lagrangian


A fundamental function associated with problem NCB (with or without the bound con-
straints) is the Lagrangian:
L(x, y) = φ(x) − y Tc(x). (1)
For a given vector y, its derivatives ∇x L and ∇2xx L are
gL (x, y) = g0 (x) − J(x)Ty, (2)
P
HL (x, y) = H0 (x) − yi Hi (x), (3)
where g0 (x) = ∇φ(x) is the objective gradient, H0 (x) = ∇2 φ(x) is the objective Hessian,
J(x) = ∇c(x) is the matrix of constraint gradients (the Jacobian), and Hi (x) = ∇2 ci (x) is
the Hessian of the ith constraint function. In general, y will be an approximation to the
dual variables associated with the constraints c(x) = 0.

67
68 MS&E 318 (CME 338) Large-Scale Numerical Optimization

2 Penalty methods
For this section we assume there are only equality constraints ci (x) = 0, i = 1 : m:

NEC minimize
n
φ(x)
x∈R
subject to c(x) = 0.

A locally optimal solution (x∗ , y ∗ ) satisfies the first-order KKT conditions for NEC:
c(x) = 0, (4)
T
J(x) y = g0 (x). (5)
In real life we are always finding a balance between what is desirable (our objective function)
and what is legally achievable (the constraints that prevent infinite profit!). This multiple
objective point of view suggests solving the unconstrained problem

PP(ρ) minimize P (x, ρ) = φ(x) + 12 ρkc(x)k2


x

with some penalty parameter ρ > 0. We call P (x, ρ) the quadratic penalty function.
We envisage a trajectory of points xρ that solve PP(ρ) for an increasing sequence of ρ
values. We must let ρ become large to achieve near feasibility, but at least the penalty func-
tion is smooth. We may therefore apply Newton or quasi-Newton methods for unconstrained
optimization. The derivatives of the penalty function are
g(x, ρ) ≡ ∇P (x, ρ) = gL (x, y),
H(x, ρ) ≡ ∇2 P (x, ρ) = HL (x, y) + ρJ(x)TJ(x),
where gL and HL are the Lagrangian derivatives (2)–(3) with y = −ρc(x) in this case.
Note that g(x, ρ) = 0 at xρ (an unconstrained minimizer of the penalty function). Defining
yρ = −ρc(xρ ), we see that (xρ , yρ ) is the exact solution of a perturbed form of problem
NEC:
NECρ minimize φ(x)
x
subject to c(x) = c(xρ ).

Also, if the Jacobian J(x∗ ) has full row rank and x∗ is a unique local minimizer for NEC
(i.e., the reduced Hessian for NEC is positive definite), we can show that the full Hessian
H(x, ρ) is positive definite at (xρ , yρ ) for sufficiently large ρ. Thus, the penalty function is
convex for large ρ, and the minimizer xρ exists.
At first glance, Newton’s method for minimizing PP(ρ) would obtain a search direction
p by solving H(x, ρ)p = −g(x, ρ), i.e., the system
(HL + ρJ TJ)p = −(g0 + ρJ Tc), (6)
where HL is defined with y = −ρc. System (6) is ill-conditioned for large ρ (assuming
m < n). This is one reason why the early unconstrained optimizers proved unsuccessful
with quadratic penalty functions. It was some time before a cure for the ill-conditioning
was recognized, but indeed there is one. Following Gould [10] we define q = −ρ(c + Jp) at
the current x. The Newton system is then equivalent to
HL J T
    
p g
=− 0 , (7)
J − ρ1 I −q c
which contains no large numbers and may be preferable for sparsity reasons anyway. If
(x, y) is close to a local optimum (x∗ , y ∗ ) and if ρ is large, any ill-conditioning in (7) reflects
the sensitivity of (x∗ , y ∗ ) to perturbations in the data.
Unfortunately, although p can be computed reliably when ρ is large, this doesn’t save
the quadratic penalty method. When c is not very small, p leads away from the linearization
of c(x) = 0 at the current x, and Newton’s method is likely to be too slow.
Spring 2015, Notes 9 Augmented Lagrangian Methods 69

3 Equality constraints
We continue to study problem NEC, bearing in mind the difficulties encountered with the
quadratic penalty method when ρ becomes very large. Again let (x∗, y ∗ ) be a local minimizer,
and assume that the Jacobian J(x) = ∇c(x) has full row rank at x = x∗ . The first-order
optimality conditions that (x∗, y ∗ ) must satisfy are

c(x) = 0, (8)
T
g0 (x) − J(x) y = 0. (9)

The Lagrangian associated with problem NEC is L(x, y) = φ(x) − y Tc(x) (1). We see that
the Lagrangian gradient ∇x L must be zero at (x∗, y ∗ ). The required solution is a stationary
point of the Lagrangian. However, in general we cannot find x∗ by minimizing L(x, y) as
a function of x, even if we set y = y ∗ . The problem min x21 + x22 st x21 + x22 = 1 illustrates
what goes wrong no matter what the value of y.
The second-order optimality condition for (x∗ , y ∗ ) to be an isolated local minimizer is
that the Lagrangian Hessian HL (x∗ , y ∗ ) ≡ ∇2xx L(x∗ , y ∗ ) should be positive definite within
the null space of J(x∗ ). That is, the Hessian should satisfy z T HL (x∗ , y ∗ )z > 0 for all nonzero
vectors z satisfying J(x∗ )z = 0.
The following result on quadratic forms is relevant.

Theorem 1 (Debreu [6]) Let H be an n × n symmetric matrix and J an m × n matrix


with m ≤ n. If z THz > 0 for every nonzero z satisfying Jz = 0, then for all ρ sufficiently
large, H + ρJ TJ is positive definite.

This suggests that we should add to the Lagrangian a term whose Hessian is ρJ(x)TJ(x).
We already know such a function: it appears in the quadratic penalty method.

3.1 The augmented Lagrangian


The augmented Lagrangian associated with problem NEC is

L(x, y, ρ) = φ(x) − y Tc(x) + 21 ρkc(x)k2 . (10)

It may be thought of as a modification to the Lagrangian, or as a shifted quadratic penalty


function. For a given y and ρ, its derivatives ∇x L and ∇2xx L are

gL (x, y, ρ) = g0 (x) − J(x)Tyb, (11)


HL (x, y, ρ) = H0 (x) − ybi Hi (x) + ρJ(x)TJ(x),
P
(12)
yb ≡ y − ρc(x). (13)

The augmented Lagrangian method for solving problem NEC proceeds by choosing y and
ρ judiciously and then minimizing L(x, y, ρ) as a function of x. The resulting x is used to
choose a new y and ρ, and the process repeats. The auxiliary vector yb simplifies the above
notation and proves to be useful in its own right.
Note that if ρ is reasonably large, minimizing L will tend to make kc(x)k small (as for
the penalty method), even if y is somewhat arbitrary. Also, HL will tend to have positive
curvature in the right space and a minimizer is likely to exist.
On the other hand, if y is close to y ∗ , since minimizing L makes kgL k small, we see that
(x, y) ≈ (x, y ∗ ) almost satisfies (9). If it also happens that kc(x)k is small (because ρ is large
enough), (x, y) will almost satisfy (8) as well.
The strategy is to check that kc(x)k is suitably small after each (approximate) mini-
mization of L. If so, y is updated to yb = y − ρc(x). If not, ρ is increased and y remains the
same. Under favorable conditions, (x, y) → (x∗, y ∗ ) before ρ becomes too large.
70 MS&E 318 (CME 338) Large-Scale Numerical Optimization

4 LANCELOT
We now return the general optimization problem

NCB minimize
n
φ(x)
x∈R
subject to c(x) = 0, ` ≤ x ≤ u,

where c(x) = 0 includes linear and nonlinear constraints, and x includes slack variables
where necessary to deal with inequalities. The large-scale solver LANCELOT [4, 5, 12] treats
NCB by applying an augmented Lagrangian method to a sequence of bound-constrained
subproblems of the form

(BCk ) minimize L(x, yk , ρk ) = φ(x) − ykTc(x) + 12 ρk kc(x)k2


x
subject to ` ≤ x ≤ u.

A large-scale trust-region method is applied to each BCL subproblem. It takes care of the
bound constraints directly. (They are not part of the augmented Lagrangian.)
We call the LANCELOT approach a bound-constrained Lagrangian method, in antici-
pation of other methods that minimize the augmented Lagrangian subject to additional
constraints (which are likely to be linear).

4.1 The BCL algorithm


The LANCELOT BCL method is summarized in Algorithm 1. Each subproblem (BCk ) is
solved with a specified optimality tolerance ωk , generating an iterate x∗k and the associated
Lagrangian gradient zk∗ ≡ ∇L(x∗k , yk , ρk ). If kc(x∗k )k is sufficiently small, the iteration is
regarded as “successful” and an update to yk is computed from x∗k . Otherwise, yk is not
altered but ρk is increased.
Key properties are that the subproblems are solved inexactly, the penalty parameter is
increased only finitely often, and the multiplier estimates yk need not be assumed bounded.
Under certain conditions, all iterations are eventually successful, the ρk ’s remain constant,
the iterates converge superlinearly, and the algorithm terminates in a finite number of
iterations.

4.2 Solving the BCL subproblems


LANCELOT contains a solver SBMIN for minimizing a nonlinear objective function subject
to bounds. SBMIN is used to obtain an approximate solution for each BCL subproblem
(BCk ). It employs an elaborate active-set strategy (involving generalized Cauchy points and
a trust-region constraint) to determine a set of active bounds. It then applies a modified
Newton or quasi-Newton method to optimize the augmented Lagrangian objective L(x, y, ρ)
with respect to the moving variables.
In Matlab notation, most of the sparse-matrix computation is performed in solving
linear systems of the form
HL (M , M ) pM = −gL (M ),
where M is an index set denoting rows and columns corresponding to moving variables.
(Reduced Hessians Z T HZ and reduced gradients Z Tg are easy to form when the constraints
are simple bounds!)
The sparse-matrix package MA27 is used to factorize H(M , M ) or H(M , M ) + E, where
E is a diagonal modification to make the system positive definite. As M changes, Schur-
complement updates are made in order to re-use the factors.
For very large problems, the conjugate-gradient method (CG) is used to compute pM ,
with assistance from a range of of preconditioning options. If HL (M , M ) is not positive
Spring 2015, Notes 9 Augmented Lagrangian Methods 71

Algorithm 1: BCL (Bound-Constrained Lagrangian Method).


Input: x0 , y0 , z0
Output: x∗, y ∗, z ∗
Set penalty parameter ρ0 > 0 and scale factor τ > 1.
Set positive convergence tolerances η∗ , ω∗  1 and infeasibility tolerance η0 > η∗ .
Set constants α, β > 0 with α < 1.
k←0
converged ← false
repeat
Choose optimality tolerance ωk > 0 such that limk→∞ ωk ≤ ω∗ .
Find a point (x∗k , zk∗ ) that solves (BCk ) within tolerance ωk .
if kc(x∗k )k ≤ max(η∗ , ηk ) then
yk∗ ← yk − ρk c(x∗k )
xk+1 ← x∗k , yk+1 ← yk∗ , zk+1 ← zk∗ [update solution estimates]
if (xk+1 , yk+1 , zk+1 ) solves NCB then converged ← true
ρk+1 ← ρk [keep ρk ]
ηk+1 ← ηk /(1 + ρβk+1 ) [decrease ηk ]
else
xk+1 ← xk , yk+1 ← yk , zk+1 ← zk [keep solution estimates]
ρk+1 ← τ ρk [increase ρk ]
ηk+1 ← η0 /(1 + ρα k+1 ) [may increase or decrease ηk ]
end
k ←k+1
until converged
x∗ ← xk , y ∗ ← yk , z ∗ ← zk

definite, PCG may terminate with a direction of infinite descent or a direction of negative
curvature.
A feature of LANCELOT and SBMIN is their provision for nonlinear functions that are
group partially separable. This facilitates the formation of gL and HL . It is also useful for
computing the matrix-vector products HL (M , M )v required by PCG.

5 Partial augmented Lagrangians


Subproblem (BCk ) above treats the bound constraints directly. Similar theory applies if
other constraints from c(x) = 0 are imposed directly in the subproblem constraints rather
than via the augmented Lagrangian objective.
For example, Conn et al. [3] show how to treat general linear constraints directly as well
as bounds. Probably this was inspired by the strangely poor performance of LANCELOT
on linear programs. (However, the approach has not been implemented.)
An important application of this approach is to nonlinear network problems with side
constraints. Network constraints are well suited to the reduced-gradient method because
the null-space operator Z is extremely sparse and can be formed explicitly. General linear
and nonlinear constraints are best included within the augmented Lagrangian objective.

6 Further reading
The augmented Lagrangian method for handling equality constraints was originally called
the method of multipliers (Hestenes [11], Powell [16]). Powell viewed the augmented La-
grangian as a shifted quadratic penalty function. An important reference for augmented
Lagrangian methods is Bertsekas [2]. A good overview is Chapter 17 of Nocedal and Wright
[15].
72 MS&E 318 (CME 338) Large-Scale Numerical Optimization

Exercises
These are concerned with problem NEC.

1. Derive Newton’s method for solving the nonlinear equations (8)–(9) for (x, y). Show
that the search direction (p, q) satisfies the KKT-type system

HL (x, y) J T g0 − J Ty
    
p
=− , (14)
J −q c

where HL (x, y) defines the Hessian of the Lagrangian as in (3).


2. Derive Newton’s method for minimizing the augmented Lagrangian (10) wrto x. Show
that the search direction p satisfies the KKT-type system

HL (x, y, 0) J T g0 − J T y
    
p
=− , (15)
J − ρ1 I −q c

where HL (x, y, ρ) and yb = y − ρc define the Hessian of the augmented Lagrangian as


in (12)–(13) and q = −ρ(c + Jp).

References
[1] ARKI Consulting & Development A/S. http://www.conopt.com.
[2] D. P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Academic Press, New
York, 1982.
[3] A. R. Conn, N. I. M. Gould, A. Sartenaer, and Ph. L. Toint. Convergence properties of an augmented
Lagrangian algorithm for optimization with a combination of general equality and linear constraints.
SIAM J. Optim., 6:674–703, 1996.
[4] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. A globally convergent augmented Lagrangian algorithm
for optimization with general constraints and simple bounds. SIAM J. Numer. Anal., 28:545–572, 1991.
[5] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. LANCELOT: A Fortran Package for Large-scale
Nonlinear Optimization (Release A). Lecture Notes in Computation Mathematics 17. Springer Verlag,
Berlin, Heidelberg, New York, London, Paris and Tokyo, 1992.
[6] G. Debreu. Definite and semidefinite quadratic forms. Econometrica, 20:295–300, 1952.
[7] A. Drud. CONOPT: A GRG code for large sparse dynamic nonlinear optimization problems. Math.
Program., 31:153–191, 1985.
[8] P. E. Gill, W. Murray, and M. A. Saunders. SNOPT: An SQP algorithm for large-scale constrained
optimization. SIAM Review, 47(1):99–131, 2005. SIGEST article.
[9] P. E. Gill, W. Murray, and M. A. Saunders. User’s guide for SNOPT 7: Software for large-scale nonlinear
programming. http://www.scicomp.ucsd.edu/~peg (see Software), 2006.
[10] N. I. M. Gould. On the accurate determination of search directions for simple differentiable penalty
functions. IMA J. Numer. Anal., 6:357–372, 1986.
[11] M. R. Hestenes. Multiplier and gradient methods. J. Optim. Theory and Applics., 4:303–320, 1969.
[12] LANCELOT optimization software. http://www.numerical.rl.ac.uk/lancelot/blurb.html.
[13] B. A. Murtagh and M. A. Saunders. Large-scale linearly constrained optimization. Math. Program.,
14:41–72, 1978.
[14] B. A. Murtagh and M. A. Saunders. A projected Lagrangian algorithm and its implementation for
sparse nonlinear constraints. Math. Program. Study, 16:84–117, 1982.
[15] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer
Verlag, New York, second edition, 2006.
[16] M. J. D. Powell. A method for nonlinear constraints in minimization problems. In R. Fletcher, editor,
Optimization, pages 283–298. Academic Press, New York, NY, 1969.

You might also like