Convergence analysis of a proximal Gauss-Newton method
arXiv:1103.0414v1 [math.OC] 2 Mar 2011
Saverio Salzo1
Silvia Villa2
March 3, 2011
1
DISI, Università di Genova,
Via Dodecaneso 35, 16146 Genova, Italy,
[email protected]
2 DIMA & DISI, Università di Genova,
Via Dodecaneso 35, 16146 Genova, Italy,
[email protected]
Abstract
An extension of the Gauss-Newton algorithm is proposed to find local minimizers of penalized nonlinear least squares problems, under generalized Lipschitz assumptions. Convergence results of local type are obtained, as well as an estimate of the radius of the convergence
ball. Some applications for solving constrained nonlinear equations are discussed and the numerical performance of the method is assessed on some significant test problems.
Keywords Gauss-Newton method Penalized nonlinear least squares Proximity operator
Lipschitz conditions with L average
Mathematics Subject Classification (2000) MSC 65J15 MSC 90C30 MSC 47J25
1 Introduction
Given Hilbert spaces X and Y , a Fréchet differentiable nonlinear operator F : X → Y , and a convex
lower semicontinuous penalty functional J : X → R∪{+∞}, we consider the optimization problem
1
min kF(x) − yk2 + J(x) := Φ(x).
x∈X 2
(P)
Problem (P) is in general a nonconvex and nonsmooth problem, having on the other hand a
particular structure: it is in fact the sum of a nonconvex, smooth term and a convex and possibly
nonsmooth one. The aim of the paper is to find a convergent algorithm towards a local minimizer
of Φ, assuming that it exists. Motivated by several applications [13, 40, 16], problem (P) is
receiving an increasing attention. In particular, for J = 0, (P) is a classical nonlinear least squares
problem [12, 49]. This kind of problems can be solved by general optimization methods, but
typically is solved by more efficient ad hoc methods. In many cases they achieve better than linear
convergence, sometimes even quadratic, even though they do not need computation of second
derivatives. Among the various approaches, one of the most popular is the Gauss-Newton method,
introduced in [7]:
xn+1 = xn − [F ′ (xn )∗ F ′ (xn )]−1 F ′ (xn )∗ (F(xn ) − y).
(1)
1
Under suitable assumptions, such a procedure is convergent to a stationary point of x 7→ 12 kF(x) − yk2 ,
namely to a point x̄ such that F ′ (x̄)∗ (F(x̄) − y) = 0. Moreover, one can easily show that the point
xn+1 defined in (1) is the minimizer of the “linearized” functional:
1
(2)
x 7→ kF(xn ) + F ′ (xn )(x − xn ) − yk2 .
2
There is a wide literature devoted to the study of convergence results for the Gauss-Newton method
under different perspectives. In particular we can distinguish two main streams of research: the
papers devoted to a local analysis, and the ones devoted to semilocal results. The first class of
studies [12, 1, 2, 31] assume the existence of a local minimizer, and they actually determine
a region of attraction around that point, meaning that if the starting point is chosen inside that
region the iterative process is guaranteed to converge towards the minimizer. On the contrary, the
semilocal results — also known as Kantorovich type theorems — do not assume the existence of a
local minimizer, they just establish sufficient conditions on the starting point in order to make the
iterative procedure convergent towards a point that is proved to be a local minimizer [3, 14, 18, 27].
In this paper we propose a generalization of the Gauss-Newton algorithm to the case in which
J 6= 0, that reads as follows:
H(x )
xn+1 = proxJ n xn − [F ′ (xn )∗ F ′ (xn )]−1 F ′ (xn )∗ (F(xn ) − y) ,
(3)
H(x )
where proxJ n is the proximity operator associated to J (see [33, 34, 35]), with respect to the
metric defined by the operator H(xn ) := F ′ (xn )∗ F ′ (xn ). The algorithmic framework in (3) is determined following the same line of (2), i.e. linearizing the functional F at the point xn , and
computing the minimizer of the corresponding “linearized” functional
1
x 7→ kF(xn ) + F ′ (xn )(x − xn ) − yk2 + J(x)
2
This approach is the common way to deal with generalizations of the Gauss-Newton method, as
we better explain in the next section. The convergence results we obtain are of local type, and
they are comparable to those obtained for the classical Gauss-Newton method. In particular, we
get linear convergence in the general case, and quadratic convergence for zero residual problems.
Furthermore, we are able to give an estimate of the radius of the convergence ball around a local
minimizer. It should be noted that the computation of the proximity operator is in general not
straightforward and it may require an iterative algorithm itself, since in general a closed form is
not available. On the other hand, we could have denoted xn+1 simply as the minimizer of the
generalized version of (2). The formulation in terms of proximity operators allows to use the well
developed theory on this kind of operators, and in our opinion enlightens the connections and the
differences with other first order methods that have recently been proposed to solve problem (P)
(see the next section for further details).
The paper is organized as follows: we start with an analysis of the state-of-the art literature
on related problems in Section 2, and then in Section 3 we review the basic concepts that will be
used: generalized Lipschitz conditions, generalized inverses and proximity operators. In Section
4 the minimization problem is precisely stated and some necessary conditions satisfied by local
minimizers are presented. The main result of the paper, Theorem 1, is discussed in Section 5
and proved in Section 6. Section 7 gives an application to the problem of constrained nonlinear
equations and, finally, in Section 8 numerical tests are set up and analyzed.
2
2 Comparison with related work
We review the available algorithms to solve this problem, enlightening the connections and the
differences with our approach.
Convex composite optimization
mization problem of the form
Problem (P) can be cast, in principle, as a composite optimin h(c(x)),
x∈X
(4)
by setting h : Y × X → R ∪ {+∞} and c : X → Y × X as follows:
h(s, v) = ksk2 + J(v),
c(x) = (F(x) − y, x).
(5)
Problem (4) has been deeply studied in [10, 26, 28, 29, 47, 48] from different point of views,
mainly in the case X = Rn and Y = Rm , but the hypotheses significantly differ, as well as the
obtained results. More specifically, in [26], the assumptions are too general to capture the features
of problem (P), and allow only to get convergence results much weaker than the ones obtained
in Theorem 1. Regarding all the remaining papers, as a matter of fact, the following special case
of inclusion problem is treated
c(x) ∈ C,
C = argmin h.
(6)
In particular, the existence of an x such that c(x) ∈ argmin h is always assumed. That hypothesis is
of course reasonable if we think of h as a kind of norm, but if we take it as in (5), then C = {(s, v) :
s = 0, v ∈ argmin J} and we are lead to the condition
∃x ∈ X,
F(x) = y and x ∈ argmin J
which is too demanding for our original problem (P).
Nonlinear inverse problems with regularization Here the problem is to solve the nonlinear
equation
F(x) = y
(7)
in the ill-posed case. Typically a solution is found by introducing a regularization term weighted
with a positive parameter. There are two possible approaches. The first one employs iterative
methods which deal directly with problem (7), see [4]. In this case an iterative process is set up by
minimizing at each step a simplified regularized problem (generally linear) having the structure of
(P) — with a weight for J varying at each iteration. Within this class of methods, one popular
choice is the iteratively regularized Gauss-Newton method, see [8, 20, 21, 25]. Anyway, we remark
that, despite the name, that algorithm is different from any kind of Gauss-Newton optimization
method above, since it is not designed to “optimize” any objective functional Φ, but, in fact, it
directly looks for an exact solution of the equation (7).
An alternative approach is the classical Tikhonov method [13] replacing (7) by the minimization of the associated Tikhonov functional. A problem of the same type of (P) arises, with J
weighted by a properly chosen parameter. Up to our knowledge, besides our method, the papers
3
by Ramlau and Teschke, e.g. [37, 38], are the only ones providing algorithms for the minimization of such type of functionals. In addition, the scheme they propose is different from ours and in
general it converges only weakly. On the other hand, our algorithm assumes the derivative F ′ (x)
to be injective with closed range — a hypothesis not suitable for handling the ill-posed case.
3 Preliminaries and notations
We start with some notations. In the whole paper X and Y denote a Hilbert space and Ω is a
nonempty open subset of X . F : Ω ⊆ X → Y is an operator, in general nonlinear, and F ′ denotes
its derivative (Fréchet or Gâteaux). L(X ,Y ) is the space of bounded linear operators from X to Y .
If A ∈ L(X ,Y ), R(A) and N(A) shall denote respectively its range and kernel and A∗ its adjoint.
3.1 Generalized Lipschitz conditions
Convergence results for the Gauss-Newton algorithm are typically obtained requiring Lipschitz
continuity of the operator F ′ [12]. In [44, 45], Wang introduced some weaker notions of Lipschitz
continuity in the context of Newton’s method. Recently, also convergence of the Gauss-Newton
method has been proved under such generalized Lipschitz conditions, see e.g. [31, 1, 2]. In
this section we recall the definitions and review some basic properties of generalized Lipschitz
continuous functions.
First of all recall that a set U ⊂ X is called star shaped with respect to some of its points x∗ ∈ U
if the segment [x∗ , x] is contained in U for every x ∈ U .
Definition 1. Let f : Ω ⊆ X → Y and U ⊆ Ω be a starshaped set with respect to x∗ ∈ U . Fix
R ∈ (0, +∞] such that supx∈U kx − x∗ k ≤ R and let L : [0, R) → R be a positive and continuous
function. The mapping f is said to satisfy the radius Lipschitz condition of center x∗ with L
average on U if
k f (x) − f (x∗ + t(x − x∗ ))k ≤
Z kx−x∗ k
L(u) du
(8)
tkx−x∗ k
for all t ∈ [0, 1] and x ∈ U .
Note that denoting by Γ : [0, R) → R a primitive function of L, e.g. Γ(u) =
ity (8) can be written as
Ru
0
L(v) dv, inequal-
k f (x) − f (x∗ + t(x − x∗ ))k ≤ Γ(kx − x∗ k) − Γ(tkx − x∗ k).
By definition Γ is absolutely continuous and differentiable, with Γ′ (u) = L(u). Since L ≥ 0, Γ is
monotone increasing. Assuming L to be increasing, we get that Γ is convex.
Definition 2. Assume the hypotheses of Definition 1. We say that f : Ω ⊆ X → Y satisfies the
center Lipschitz condition of center x∗ with L average on U if it verifies
k f (x) − f (x∗ )k ≤
for every x ∈ U .
4
Z kx−x∗ k
0
L(u) du
(9)
The original definitions of Wang [44] do not require the continuity of the function L but just an
integrability assumption. Our choice simplifies the proofs, but we remark that this requirement is
not essential. Indeed, the proofs can be modified to handle also the more general case of Wang, at
the cost of slight technical complications. In addition, the most well-known examples of Lipschitz
averages are continuous.
Remark 1. If f is Lipschitz continuous on a convex set U with constant L then it satisfies the
radius and center Lipschitz condition of center x∗ with average constantly equal to L on U , for
every x∗ ∈ U . Vice versa, in the definitions above, if we take L constant we obtain two intermediate
concepts of Lipschitz continuity, called radius and center Lipschitz continuity, which are still in
general weaker than the classical notion, being centered at a specific point.
Note also that the center Lipschitz condition is weaker than the corresponding radius one.
Let F : Ω ⊆ X → Y be a a Fréchet differentiable operator. It is well-known, see e.g. [36], that if
F ′ is Lipschitz continuous with constant L then the following inequality holds for every x, x∗ ∈ X :
kF(x∗ ) − F(x) − F ′ (x)(x∗ − x)k ≤
L
kx∗ − xk2 .
2
(10)
We are going to show that under the weaker Lipschitz conditions introduced above it is still possible to prove two estimates which are similar to (10). To this aim we need the following three
propositions. In the first one we prove two key inequalities, and in the subsequent ones we rewrite
them in a form more similar to (10). The subsequent result is contained implicitly in several recent
papers providing local results about Gauss-Newton method, see e.g. [29, 31].
Proposition 1. Let F : Ω → Y be a Gâteaux differentiable operator. Then:
(i) if F ′ satisfies the radius Lipschitz condition of center x∗ with L average on U ⊆ Ω (with L
and U as in Definition 1), then for all x ∈ U
′
kF(x∗ ) − F(x) − F (x)(x∗ − x)k ≤
Z kx−x∗ k
L(u)u du.
(11)
0
(ii) if F ′ satisfies the center Lipschitz condition with L average at x∗ on U ⊆ Ω (with L and U
as in Definition 2), then for all x ∈ U
kF(x∗ ) − F(x) − F ′ (x)(x∗ − x)k ≤
Z kx−x∗ k
0
(2kx − x∗ k − u)L(u) du.
(12)
Proof. Let x ∈ U and define φ : [0, 1] → Y by setting φ (t) = F(x∗ + t(x − x∗ )). Clearly φ is
differentiable and φ ′ (t) = F ′ (x∗ + t(x − x∗ ))(x − x∗ ). Then
F(x) − F(x∗ ) = φ (1) − φ (0) =
Z 1
0
F ′ (x∗ + t(x − x∗ ))(x − x∗ ) dt.
Therefore
F(x∗ ) − F(x) − F ′ (x)(x∗ − x) =
Z 1
0
F ′ (x∗ + t(x − x∗ )) − F ′ (x) (x∗ − x) dt
5
and hence
Z 1
′
kF(x∗ )−F(x)−F (x)(x∗ − x)k ≤
0
kF ′ (x∗ + t(x − x∗ ))−F ′ (x)kkx − x∗ k dt.
Let us first prove (i). By (8), we get
Z 1
0
kF ′ (x∗ + t(x − x∗ )) − F ′ (x)kkx − x∗ k dt ≤ kx − x∗ k
Note that in general, setting Γ(u) = 0u L(v) dv, it follows
Z 1 Z ρ
Z 1 Z ρ
Z
L(u) du ρ dt =
L(u) du −
R
0
tρ
0
0
= ρ Γ(ρ ) −
Z 1
ρ
0
= [u Γ(u)]0 −
=
Z ρ
0
=
Z ρ
Z 1Z kx−x∗ k
L(u) du dt.
(13)
0 tkx−x∗ k
tρ
0
L(u) du ρ dt
Γ(t ρ )ρ dt
Z ρ
Γ(u) du
0
u Γ′ (u) du
L(u)u du
(14)
0
where we used the change of variables u = t ρ and an integration by parts. Writing the equality
obtained in (14) for ρ = kx − x∗ k, (13) becomes
′
kF(x∗ ) − F(x) − F (x)(x∗ − x)k ≤
Z kx−x∗ k
L(u)u du,
0
so that (i) is proved.
To show that (ii) holds, observe that the center Lipschitz condition (9) implies
kF ′ (x∗ + t(x − x∗ )) − F ′ (x)k ≤ kF ′ (x∗ + t(x − x∗ )) − F ′ (x∗ )k + kF ′ (x∗ ) − F ′ (x)k
≤
Z tkx−x∗ k
L(u) du +
Z kx−x∗ k
L(u) du
0
0
Reasoning as in the previous case and using the same notations it follows
kF(x∗ ) − F(x) − F ′ (x)[x∗ − x]k ≤
=
Z 1
0
Z ρ
0
=
Z ρ
0
Γ(t ρ )ρ dt + Γ(ρ )ρ
Γ(u) du + Γ(ρ )ρ
ρ
Γ(u) du + (2ρ − u)Γ(u) 0
=
Z ρ
=
Z kx−x∗ k
0
0
6
(2ρ − u)Γ′ (u) du
(2kx − x∗ k − u)L(u) du.
The following Propositions are a direct consequence of Lemma 2.2 in [30], and have already
been stated in a slightly different form in [31]. We include the proofs for the sake of completeness.
Proposition 2. Given L : [0, R) → R a continuous, positive and increasing function, the function
γλ defined by setting
1 Z r
uλ L(u) du if r ∈ (0, R)
r1+λ 0
γλ : [0, R) → R,
γλ (r) =
(15)
L(0)
if r = 0,
1+λ
is well-defined, continuous, positive and increasing for all λ ≥ 0. Moreover γλ is constant and
equal to L/(1 + λ ) if L is constant, and it is strictly increasing if L is strictly increasing. Finally
the following inequality holds for every r ∈ [0, R)
(1 + λ )γλ (r) ≤ L(r).
(16)
Proof. Clearly γλ is differentiable on (0, R) since it is a product of differentiable functions and by
definition
Z r
r1+λ γλ (r) =
(17)
uλ L(u) du.
0
Differentiating both members of (17), it follows
(1 + λ )rλ γλ (r) + r1+λ γλ′ (r) = rλ L(r),
therefore
rγλ′ (r) = L(r) − (1 + λ )γλ (r),
(18)
Thus, if we prove (16) we also get that γλ (r) is increasing. To this aim, taking into account that L
is increasing, we have
r1+λ γλ (r) =
Z r
0
uλ L(u) du ≤
Z r
0
uλ L(r) du =
r1+λ
L(r)
1+λ
(19)
from which (16) follows. Note that if L is strictly increasing the inequality in (19) is strict, therefore in this case, recalling (18), γλ′ (r) > 0 on (0, R). On the other hand, if L is constant the
inequality in (19) is indeed an equality and γλ′ (r) = 0 by (18) implying that γλ is constant on
(0, R). The continuity of γλ at 0 follows by L’Hospital’s rule. In fact, using that L is continuous at
0:
Rr λ
u L(u) du
L(0)
rλ L(r)
= lim
=
.
lim γλ (r) = lim 0 1+λ
λ
r→0
r→0 (1 + λ )r
r→0
1+λ
r
Using the function γ0 introduced in Proposition 2 the center Lipschitz condition with L average
can be written in the following form, resembling the classical definition of Lipschitz continuity
k f (x) − f (x∗ )k ≤ γ0 (kx − x∗ k)kx − x∗ k.
7
Proposition 3. Under the assumptions of Proposition 2, the function
1 Z r
(2r − u)L(u) du if r ∈ (0, R)
r2 0
γ c : [0, R) → R,
γ c (r) =
3L(0)
if r = 0,
2
(20)
is well-defined, continuous, positive and increasing.
Proof. The definition of γ c immediately implies
r γ (r) =
2 c
Z r
0
(2r − u)L(u) du = 2r
Z r
0
L(u) du −
Z r
uL(u) du
(21)
0
and differentiating both members of (21)
2rγ c (r) + r2 (γ c )′ (r) = 2
Z r
L(u) du + rL(r)
0
Dividing by r, and using the notations of Proposition 2, we obtain
r(γ c )′ (r) = 2(γ0 (r) − γ c (r)) + L(r).
Therefore, in order to prove that γ c is increasing we just need to show that
2γ c (r) ≤ 2γ0 (r) + L(r).
(22)
In fact, using the definitions of the functions γ c and γ0 and the monotonicity of L we have:
2r γ (r) = 2
2 c
Z r
rL(u) du + 2
0
≤ 2r2 γ0 (r) + 2L(r)
Z r
Z 0r
0
(r − u)L(u) du
(r − u) du
= 2r2 γ0 (r) + r2 L(r),
that clearly implies (22). The continuity of γ c at 0 can be deduced as follows
lim γ c (r) = lim 2γ0 (r) − γ1 (r) = 2L(0) −
r→0
r→0
L(0) 3
= L(0).
2
2
relying on the continuity of γ0 and γ1 proved in Proposition 2.
Remark 2. Using the functions γ0 , γ1 and γ c introduced in (15) and in (20), the inequality (9)
written for F ′ becomes
kF ′ (x) − F ′ (x∗ )k ≤ γ0 (kx − x∗ k)kx − x∗ k.
The inequalities (11) and (12) can be written respectively as
kF(x∗ ) − F(x) − F ′ (x)(x∗ − x)k ≤ γ1 (kx − x∗ k)kx − x∗ k2
8
(23)
and
kF(x∗ ) − F(x) − F ′ (x)(x∗ − x)k ≤ γ c (kx − x∗ k)kx − x∗ k2 ,
(24)
that generalize the inequality (10).
Note moreover that the functions rγ0 (r), r2 γ1 (r) and r2 γ c (r) are always strictly increasing and
if L is a constant function equal to L, then
3
L
γ0 (r) = L,
γ1 (r) = ,
γ c (r) = L.
2
2
Remark 3. It is worth noting that, even though only a Gâteaux differentiability has been required,
the previous inequalities together with the hypothesis on the function L implies Fréchet differentiability at x∗ .
3.2 Generalized inverses
In this section we collect some well-known results regarding the Moore-Penrose generalized inverse (also known as pseudoinverse) A† of a linear operator A. They will be useful in the rest of
the paper. For the definition and a comprehensive analysis of the properties of the Moore-Penrose
inverse we refer the reader to [17].
Assume that A ∈ L(X ,Y ) has a closed range. The pseudoinverse of A is the linear operator
†
A ∈ L(Y, X ) defined by means of the four “Moore-Penrose equations”
AA† A = A,
(AA† )∗ = AA† ,
A† AA† = A† ,
(A† A)∗ = A† A.
(25)
Denoting by PN(A) and PR(A) the orthogonal projectors onto the kernel and the range of A respectively, from the definition it is clear that
A† A = I − PN(A),
AA† = PR(A).
(26)
In case A is injective, N(A) = {0} and A† A = I, that is A† is a left inverse of A. Furthermore, for
each A ∈ L(X ,Y ) the following statements are equivalent:
• A is injective and the range of A is closed;
• A∗ A is invertible in L(X , X ).
and if one of those equivalent conditions is true then A† = (A∗ A)−1 A∗ and kA† k2 = k(A∗ A)−1 k.
The following lemma gives a perturbation bound for the Moore-Penrose pseudoinverse, see
[42, 46].
Lemma 1. Let A, B ∈ L(X ,Y ) with A injective and R(A) closed. If k(B − A)A† k < 1, then B is
injective, R(B) is closed and
kB† k ≤
Moreover
and therefore
kB† − A† k ≤
kB† − A† k ≤
kA† k
.
1 − k(B − A)A† k
√
2kA† kkB† kkB − Ak,
√
2
kA† k2 kB − Ak
.
1 − kA† kkB − Ak
9
3.3 The proximity operator
This section consists of an introduction on proximity operators, which were first introduced by
Moreau in [33], and further investigated in [34, 35] as a generalization of the notion of convex
projection operator. Let H : X → X be a continuous, positive and selfadjoint operator, bounded
from below, and therefore invertible. Then we can define a new scalar product on X by setting
hx, ziH = hx, Hzi. The corresponding induced norm k·kH is equivalent to the given norm on X ,
since the following inequalities hold true
1
kxk2 ≤ kxk2H ≤ kHkkxk2 .
kH −1 k
(27)
The Moreau-Yosida approximation of a convex and lower semicontinuous function ϕ : X →
R ∪ {+∞} with respect to the scalar product induced by H is the function Mϕ : X → R defined by
setting
1
Mϕ (z) = inf ϕ (x) + kx − zk2H .
(28)
x∈X
2
For every z ∈ X , the infimum in equation (28) is attained at a unique point, denoted proxϕH (z).
In this way, an operator
proxϕH : X → X
is defined, which is called the proximity operator associated to ϕ w.r.t. H. In case H = I is the
identity, the proximity operator is denoted simply by proxϕ . Writing the first order optimality
conditions for (28), we get
p = proxH
ϕ (z) ⇐⇒ 0 ∈ ∂ ϕ (p) + H(p − z) ⇐⇒ Hz ∈ (∂ ϕ + H)(p),
(29)
which gives
−1
proxH
ϕ (z) = (H + ∂ ϕ ) (Hz).
We remark that the map (H + ∂ ϕ )−1 (in principle multi-valued) is single-valued, since we
know that the minimum is attained at a unique point.
p
Lemma 2. The proximity operator proxH
kHkkH −1 k with
ϕ : X → X is Lipschitz with constant
respect to k·k, namely
q
H
kproxH
(z
)
−
prox
(z
)k
≤
kHkkH −1 kkz1 − z2 k.
(30)
ϕ 1
ϕ 2
Proof. Being the proximity operator firmly nonexpansive with respect to the scalar product induced by H (see e.g. Lemma 2.4 in [11]) we have
H
kproxH
ϕ (z1 ) − proxϕ (z2 )kH ≤ kz1 − z2 kH .
Using the inequalities in (27) relating k·k and k·kH we get the desired result.
It is also possible to show that in some cases the computation of the proximity operator with
respect to the scalar product induced by H can be brought back to the computation of the proximity
operator with respect to the original norm. In particular, the following proposition holds.
10
Proposition 4. Let A ∈ L(X ,Y ). Let us suppose A to be injective with closed range. Set H = A∗ A
and assume ϕ : X → R ∪ {+∞} a proper, convex and lower semicontinuous functional. Then
†
proxH
ϕ = A proxϕ ◦A† A
Proof. Being A injective, H is positive and invertible, thus
o
n
1
2
proxϕH : X → X
proxH
ϕ (x) = argmin ϕ (z) + kz − xkH
2
z∈X
where
kz − xk2H = hA∗ A(z − x), z − xi = hA(z − x), A(z − x)i = kA(z − x)k2 .
Therefore
o
n
1
proxϕH (x) = argmin ϕ (z) + kA(z − x)k2 .
2
z∈X
(31)
On the other hand, since ϕ ◦ A† : Y → R ∪ {+∞} is convex and lower semicontinuous the corresponding proximity operator with respect to k·k is well-defined and
n
o
1
proxϕ ◦A† (y) = argmin ϕ (A†t) + kt − yk2
proxϕ ◦A† : Y → Y
2
t∈Y
If we set x = proxH
ϕ (x), by (31) we obtain
1
1
ϕ (x) + kAx − Axk2 ≤ ϕ (z) + kAz − Axk2
2
2
∀z ∈ X .
Moreover, by setting y = Ax, taking t ∈ Y , with t = t1 + t2 such that t1 ∈ R(A) and t2 ∈ R(A)⊥ and
z = A†t ∈ X , we have
1
1
ϕ (A†t) + kt − Axk2 = ϕ (A†t) + kt1 + t2 − Axk2
2
2
1
= ϕ (A†t) + kPR(A)t − Axk2 + kt2 k2
2
1
†
≥ ϕ (A t) + kAA†t − Axk2
2
1
= ϕ (z) + kAz − Axk2
2
1
≥ ϕ (x) + kAx − Axk2
2
1
= ϕ (A† y) + ky − Axk2 .
2
We finally get
and thus using (26)
o
n
1
y = argmin ϕ (A†t) + kt − Axk2 = proxϕ ◦A† (Ax)
2
t∈Y
proxϕH (x) = x = A† Ax = A† y = A† proxϕ ◦A† (Ax).
11
Since in the sequel the proximity operators will be computed with respect to a variable norm
k·kH , we are interested in the behavior of the proximity operator when H varies.
Lemma 3. Let H1 and H2 two continuous positive selfadjoint operators on X , both bounded from
below. It holds
kproxϕH1 (z) − proxϕH2 (z)k ≤ kH1−1 kk(H1 − H2 )(z − proxϕH2 (z))k.
(32)
Proof. By (29) it follows
Hi
i
Hi (z − proxH
ϕ (z)) ∈ ∂ ϕ (proxϕ (z)),
i = 1, 2.
Then, by definition of subdifferential the following inequalities hold true
H1
2
ϕ (proxH
ϕ (z)) ≥ϕ (proxϕ (z))
H2
H1
1
+ hH1 (z − proxH
ϕ (z)), prox ϕ (z) − proxϕ (z)i
H2
1
ϕ (proxH
ϕ (z)) ≥ϕ (proxϕ (z))
H2
1
+ hH2 (z − proxϕH2 (z)), proxH
ϕ (z) − proxϕ (z)i.
Summing up them, we obtain
H1
H2
1
0 ≥ hH2 (z − proxϕH2 (z)) − H1 (z − proxH
ϕ (z)), prox ϕ (z) − proxϕ (z)i,
and equivalently
H2
H1
H2
1
hH1 proxH
ϕ (z) − H2 proxϕ (z), proxϕ (z) − proxϕ (z)i ≤
2
h(H1 − H2 )z, proxϕH1 (z) − proxH
ϕ (z)i.
Adding and subtracting the same term, the previous inequality can also be written as
H2
H1
H2
1
hH1 (proxH
ϕ (z) − proxϕ (z)), prox ϕ (z) − proxϕ (z)i ≤
H1
H2
2
h(H1 − H2 )(z − proxH
ϕ (z), prox ϕ (z) − proxϕ (z)i,
from which (32) follows.
Note that in the previous lemma H1 and H2 play a symmetric role, so that they can be interchanged.
Remark 4. Combining (30) and (32), we get:
H2
H1
H1
H1
H2
1
kproxH
J z1 − proxJ z2 k ≤ kproxJ z1 − proxJ z2 k + kproxJ z2 − proxJ z2 k
1/2
kz1 − z2 k
≤ kH1 kkH1−1 k
(33)
2
+ kH1−1kk(H1 − H2 )(z2 − proxH
J z2 )k,
for every z1 , z2 ∈ X and H1 , H2 continuous and positive selfadjoint operators on X , bounded from
below.
12
4 Setting the minimization problem
In this section we collect some basic properties of the solutions of problem (P). The following
will be standing hypotheses throughout the paper.
F : Ω ⊆ X → Y is Gâteaux differentiable
(SH)
J : X → R ∪ {+∞} is proper, lower semicontinuous and convex.
Recall that J proper means the effective domain dom(J) := {x ∈ X : J(x) < +∞} is nonempty.
Without loss of generality, we shall assume y = 0 in the problem (P), since the general case
can be recovered just by replacing F with F − y. Thus, hereafter, the following optimization
problem will be considered
1
min kF(x)k2 + J(x) := Φ0 (x).
(P0 )
x∈X 2
The functional Φ0 is in general nonconvex and searching for global minimizers turns out to be a
challenging task. Therefore, the focus of this paper is on local minimizers of Φ0 , whose existence
shall be assumed from now on. Generally speaking, Gauss-Newton methods are of a local character, and allow to find a local minimizer. As it is well-known a local minimizer x∗ of Φ0 is a point
such that x∗ ∈ dom(J) ∩ Ω and there exists a neighborhood U of x∗ such that Φ0 (x∗ ) ≤ Φ0 (x) for
all x ∈ U .
By the way, hypotheses (SH) are not enough to guarantee the existence of a global minimizer of the problem (P0 ). Such existence can be proved relying on the Weierstrass theorem
as soon as we impose F to be weak to weak continuous and Φ0 (weakly) coercive, namely
limkxk→+∞ Φ0 (x) = +∞.
We start by providing first order conditions for local minimizers.
Proposition 5. Suppose (SH) are satisfied and let x∗ ∈ Ω be a local minimizer of Φ0 . Then the
following stationary condition holds
−F ′ (x∗ )∗ F(x∗ ) ∈ ∂ J(x∗ ).
Moreover, if F ′ (x∗ ) is injective and R(F ′ (x∗ )) is closed, then x∗ satisfies the fixed point equation
H(x∗ )
x∗ = proxJ
(x∗ − F ′ (x∗ )† F(x∗ )),
with H(x∗ ) := F ′ (x∗ )∗ F ′ (x∗ ).
Proof. Suppose that x∗ is a local minimizer of Φ0 . Denoting by Φ′0 (x∗ , v) the directional derivative
of Φ0 at x∗ in the direction v ∈ X , which exists thanks to (SH), the first order optimality conditions
for x∗ implies
Φ′0 (x∗ , v) ≥ 0
∀v ∈ X .
(34)
As a consequence of the differentiability of F and the convexity of J (34) can be rewritten as
−F ′ (x∗ )∗ F(x∗ )v ≤ J ′ (x, v)
13
∀v ∈ X ,
and consequently, by Proposition 3.1.6 in [9], also as
−F ′ (x∗ )∗ F(x∗ ) ∈ ∂ J(x∗ ),
(35)
which is the stationary condition of the thesis. To prove that x∗ satisfies the fixed point equation
note that adding H(x∗ )x∗ to both members of (35) we have
H(x∗ )x∗ − F ′ (x∗ )∗ F(x∗ ) ∈ (H(x∗ ) + ∂ J)(x∗ ).
Since H(x∗ ) is invertible, then the previous equation can be also rewritten as
H(x∗ )(x∗ − H(x∗ )−1 F ′ (x∗ )∗ F(x∗ )) ∈ (H(x∗ ) + ∂ J)(x∗ ).
Recalling equation (29) and the properties enjoyed by the pseudoinverse we obtain the second
assertion.
5 The algorithm - convergence analysis
In this section we state the main result of the paper, consisting in the study of the convergence of
a generalized Gauss-Newton method for solving problem (P0 ). The flavor is similar to the most
recent results concerning the standard Gauss-Newton method, proved in [31]. We start describing
some basic properties of the proposed algorithmic framework.
Fix x0 ∈ dom(J), and then, given xn , define xn+1 by setting
1
xn+1 = argmin kF(xn ) + F ′ (xn )(x − xn )k2 + J(x).
x∈X 2
(36)
Note that since the quantity inside the norm has been linearized, this problem can be solved explicitly, for instance using first order methods for the minimization of nonsmooth convex functions,
such as bundle methods or forward-backward methods (see [19, 11]). Writing down the first order
optimality conditions we will get a similar formula to the one in Proposition 5 for a minimizer x∗ .
Proposition 6. Suppose F ′ (xn ) is injective with closed range and set H(xn ) = F ′ (xn )∗ F ′ (xn ).
Then, the formula (36) defining xn+1 is equivalent to
H(xn )
xn+1 = proxJ
(xn − F ′ (xn )† F(xn )).
(37)
Proof. Thanks to the assumptions made on F ′ (xn ) the operator H(xn ) is invertible. Writing the
first order necessary conditions, which are satisfied by xn+1 we obtain
0 ∈ F ′ (xn )∗ [F(xn ) + F ′ (xn )(xn+1 − xn )] + ∂ J(xn+1 )
⇐⇒
⇐⇒
⇐⇒
⇐⇒
F ′ (xn )∗ F ′ (xn )xn − F ′ (xn )∗ F(xn ) ∈ (F ′ (xn )∗ F ′ (xn ) + ∂ J)(xn+1 )
xn+1 = (F ′ (xn )∗ F ′ (xn ) + ∂ J)−1 (F ′ (xn )∗ F ′ (xn )xn − F ′ (xn )∗ F(xn ))
xn+1 = (F ′ (xn )∗ F ′ (xn ) + ∂ J)−1 F ′ (xn )∗ F ′ (xn ) xn − F ′ (xn )† F(xn )
H(x )
xn+1 = prox j n xn − F ′ (xn )† F(xn )
14
In the next theorem we provide a local convergence analysis of the proximal Gauss-Newton
method, under the generalized Lipschitz conditions on F ′ introduced in Section 3.1. The proof is
postponed to Section 6.
Theorem 1. Suppose that (SH) are satisfied. Let U ⊆ Ω be an open starshaped set with respect to
x∗ , where x∗ ∈ dom(J) ∩U is a local minimizer of Φ0 . Moreover assume
1. F ′ (x∗ ) is injective with closed range;
2. F ′ : Ω ⊆ X → L(X ,Y ) is center Lipschitz continuous of center x∗ with L average on U (L as
in the definition 2 and increasing);
√
3. [(1 + 2)κ + 1]αβ 2 L(0) < 1, where α = kF(x∗ )k, β = kF ′ (x∗ )† k, κ = kF ′ (x∗ )† kkF ′ (x∗ )k,
the conditioning number of F ′ (x∗ ).
Define R̄ and q : [0, R̄) → R+ by setting R̄ = sup{r ∈ (0, R) : γ0 (r)r < 1/β } and
√
β
β γ0 (r)γ c (r)r2 + κγ c(r)r (1 + 2)αβ 2 γ0 (r)2 r
q(r) =
+
1 − β γ0 (r)r
(1 − β γ0 (r)r)
1 − β γ0 (r)r
√
[(1 + 2)κ + 1]αβ γ0 (r)
+
.
1 − β γ0 (r)r
The function q is continuous and strictly increasing. If we define
r̄ = sup{r ∈ (0, R̄] : q(r) < 1},
(38)
and we fix r ∈ R, with 0 < r ≤ r̄, such that Br (x∗ ) ⊆ U , we get that the sequence
x0 ∈ Br (x∗ ),
H(xn )
xn+1 = proxJ
xn − F ′ (xn )† F(xn )
with H(xn ) := F ′ (xn )∗ F ′ (xn ), is well-defined, i.e. xn ∈ Br (x∗ ) and F ′ (xn ) is injective with closed
range and it holds
kxn − x∗ k ≤ qn0 kx0 − x∗ k,
where q0 := q(kx − x0 k) < 1.
More precisely, the following inequality is true
kxn+1 − x∗ k ≤ C2 kxn − x∗ k2 +C1 kxn − x∗ k,
for constants C1 ≥ 0 and C2 > 0 defined as
√
[(1 + 2)κ + 1]αβ 2 γ0 (ρx0 )
C1 =
;
(1 − β γ0 (ρx0 )ρx0 )2
√
κβ γ c (ρx0 ) + (1 + 2)αβ 3 γ0 (ρx0 )2 + β 2 γ0 (ρx0 )γ c (ρx0 )ρx0
C2 =
,
(1 − β γ0 (ρx0 )ρx0 )2
with ρx0 = kx − x0 k.
15
Since r̄ is chosen as the biggest value ensuring q(r) ≤ 1 (a sufficient condition making the
Gauss-Newton sequence convergent), r̄ can be thought as the radius of the basin of attraction
around the local minimum point x∗ , even though in general we can’t prove the optimality of this
value.
Remark 5. An analogous theorem is true if in the assumption 2 we suppose F ′ to satisfy the
radius Lipschitz condition. All the statements remain true, just replacing γ c with γ1 . For instance
the expression of q(r) becomes
√
β
β γ0 (r)γ1 (r)r2 + κγ1 (r)r (1 + 2)αβ 2 γ0 (r)2 r
q(r) =
+
1 − β γ0 (r)r
1 − β γ0 (r)r
1 − β γ0 (r)r
√
[(1 + 2)κ + 1]αβ γ0 (r)
+
.
1 − β γ0 (r)r
Remark 6. The hypotheses we impose are in line with the state-of-art literature about classical
Gauss-Newton method (J = 0), see [31]. It is worth noting that the expression of r̄ is not affected
by the choice of J. On the other hand, the presence of the function J reduces the radius of convergence of the Gauss-Newton method. Indeed, the expression for r̄ obtained in (38), which is
valid also in the case J = 0 is always smaller than the maximum radius of convergence that can be
derived from equation (3.4) in [31], namely
r0 = sup{r ∈ (0, R̄) : q0 (r) < 1},
with
q0 (r) =
√
c
β
γ (r)r + 2β αγ0 (r) .
1 − β γ0 (r)r
The reason is that the bound (30) we use, is not sharp in case J = 0, and this causes an additional
term in the expression of q(r).
Conditions ensuring quadratic convergence. As in the classical case, also with the additional
term J, for zero residual problems quadratic convergence holds. In fact, from the expression of
C1 , we see that C1 = 0 if α = 0, i.e. F(x∗ ) = 0.
5.1 The case of constant average L
In case the function L is constant, we can derive also an explicit expression for the maximum ray
of convergence r̄.
Corollary 1. Let the assumptions of Theorem 1 be satisfied and moreover assume F ′ (x∗ ) to be
center Lipschitz continuous of center x∗ with constant average L on U . Define q : [0, 1/(β L)) →
R+ as
√
√
3(β L2 r2 + κ Lr) (1 + 2)αβ 2 L2 r [(1 + 2)κ + 1]αβ L
β
+
+
,
(39)
q(r) =
1 − β Lr
2(1 − β Lr)
1 − β Lr
1 − β Lr
16
which is continuous and strictly increasing in its domain. If we define
√
h = [(1 + 2)κ + 1]αβ 2 L (< 1),
s
2
√
√
1
3κ
3κ
r̄ =
+ (1 + 2)αβ 2 L +
+ (1 + 2)αβ 2 L + 2(1 − h)
− 2+
2+
βL
2
2
and we fix r ∈ R with 0 < r ≤ r̄ such that Br (x∗ ) ⊆ U the conclusions of Theorem 1 hold with
√
[(1 + 2)κ + 1]αβ 2 L
C1 =
;
(1 − β Lρx0 )2
√
3κ + (1 + 2)αβ 2 L2 + 3β L2 ρx0
C2 = β
.
2(1 − β Lρx0 )2
Proof. Using the expressions of γ0 and γ c found in Remark 2, one can easily show that R = 1/(β L)
and q can be written as in (39).
As before q is continuous and strictly increasing on the interval [0, 1/(β L)), and
q(0) = h < 1,
lim
r→1/(β L)
q(r) = +∞.
Therefore r̄ defined in (38) is the unique solution in (0, 1/(β L)) of the equation q(r) = 1. We are
going to find that point explicitly by solving the equation q(r) = 1. The latter is equivalent to the
following quadratic equation
√
z2 + (4 + 3κ + 2(1 + 2)αβ 2 L)z − 2(1 − h) = 0,
with z ∈ [0, 1), z = β Lr. That equation has the two distinct solutions
2
s
√
√
3
3
2
2
z = − 2 + κ + (1 + 2)αβ L ±
2 + κ + 2(1 + 2)αβ L + 2(1 − h).
2
2
Of course, we discard the negative solution, and we keep the one with the plus sign, which can be
easily checked to belong to (0, 1).
Along the same line, a similar result concerning the case of radius Lipschitz continuity can be
proved.
Corollary 2. Let the assumptions of Theorem 1 be satisfied and moreover assume F ′ (x∗ ) to be
radius Lipschitz continuous of center x∗ with constant average L on U . Define q : [0, 1/(β L)) →
R+ as
√
√
2 2
β
β L r + κ Lr (1 + 2)αβ 2 L2 r [(1 + 2)κ + 1]αβ L
q(r) =
+
+
,
1 − β Lr 2(1 − β Lr)
1 − β Lr
1 − β Lr
17
which is continuous and strictly increasing in its domain. If we define
√
h = [(1 + 2)κ + 1]αβ 2 L(< 1),
"
#
r
2
√
√
1
κ
κ
2 + + (1 + 2)αβ 2 L −
r̄ =
2 + + (1 + 2)αβ 2 L − 2(1 − h)
βL
2
2
and we fix r ∈ R with 0 < r ≤ r̄ such that Br (x∗ ) ⊆ U the conclusions of Theorem 1 hold with
√
[(1 + 2)κ + 1]αβ 2 L
C1 =
;
(1 − β Lρx0 )2
√
κ + (1 + 2)αβ 2 L2 + β L2 ρx0
C2 = β
.
2(1 − β Lρx0 )2
for r < 1/(β L).
6 Proof of Theorem 1
We are going to state some auxiliary results for proving convergence of the algorithm discussed
in the previous section. The following proposition will be one of the building blocks to show that
convergence holds.
Proposition 7. Let G : D ⊆ X → X , be a mapping and x∗ ∈ D a fixed point of G. Let U ⊆ D be an
open starshaped set with respect to x∗ ∈ U . Assume G to satisfy the inequality
kG(x) − G(x∗ )k ≤ q(kx − x∗ k)kx − x∗ k,
for all x ∈ U
(40)
for a given increasing function q : [0, R) → [0, +∞), continuous at 0 and such that q(0) < 1. Define
r̄ = sup{r ∈ (0, R) : q(r) < 1}.
Then r̄ > 0 and given r ∈ R with 0 < r ≤ r̄ and Br (x∗ ) ⊆ U , it follows G(Br (x∗ )) ⊆ Br (x∗ ), thus,
given x0 ∈ Br (x∗ ) the sequence defined by setting xn+1 = G(xn ) is well-defined. Moreover, denoting
q0 = q(kx0 − x∗ k) it holds q0 < 1 and
kxn+1 − x∗ k ≤ qn0 kx0 − x∗ k.
Proof. First note that r̄ = sup{r ∈ (0, R) : q(r) < 1} > 0 being q(0) < 1 and q continuous at 0.
Fix r ∈ R, 0 < r ≤ R̄ such that Br (x∗ ) ⊆ U and x ∈ Br (x∗ ). Then by definition of r̄, q(kx − x∗ k) < 1
and therefore (40) implies that
kG(x) − x∗ k ≤ kx − x∗ k < r,
i.e. G(x) ∈ Br (x∗ ). Thus G(Br (x∗ )) ⊆ Br (x∗ ) and a sequence can be defined in Br (x∗ ) by choosing
x0 ∈ Br (x∗ ) and setting xn+1 = G(xn ). Being kxn − x∗ k < r̄, again from the definition of r̄ we have
q(kxn − x∗ k) < 1, and from (40) we get
kxn+1 − xk = kG(xn ) − x∗ k ≤ q(kxn − x∗ k)kxn − x∗ k < kxn − x∗ k.
18
This implies that q(kxn+1 − x∗ k) ≤ q(kxn − x∗ k), since q is increasing. Therefore, denoting q(kx0 − x∗ k) =:
q0 we get q(kxn − x∗ k) ≤ q0 < 1 for all n ∈ N and
kxn+1 − x∗ k ≤ q(kxn − x∗ k)kxn − x∗ k ≤ q0 kxn − x∗ k ≤ . . . ≤ qn0 kx0 − x∗ k.
We now introduce some notations, allowing for rewriting the conditions which have been
described in Proposition 5 for a local minimizer of Φ0 . Define G and G̃ by setting
G(x) = x − F ′ (x)† F(x)
and
H(x)
G̃(x) = proxJ
(G(x)),
(41)
where H(x) = F ′ (x)∗ F ′ (x). The domain of G and G̃ is the subset D of Ω defined as
D = {x ∈ Ω : F ′ (x) is injective and R(F ′ (x)) is closed}.
(42)
If x∗ ∈ D is a local minimizer of (P0 ) the fixed point equation of Proposition 5 can be restated
by saying that x∗ is a fixed point for G̃, namely
x∗ = G̃(x∗ ).
(43)
Proposition 8. Assume (SH) and let x∗ be a local minimizer of Φ0 belonging to D. Suppose that
U ⊆ Ω is open and starshaped with respect to x∗ . Moreover assume
i) F ′ : Ω ⊆ X → L(X ,Y ) is center Lipschitz continuous of center x∗ with L : [0, R) → R+
average on U ⊆ Ω;
ii) α = kF(x∗ )k, β = kF ′ (x∗ )† k and κ = kF ′ (x∗ )† kkF ′ (x∗ )k.
Then, defining R̄ by setting
R̄ = sup{r ∈ (0, R) : γ0 (r)r < 1/β }
it follows that for all r ∈ R with 0 < r ≤ R̄ and Br (x∗ ) ⊆ U , G̃ satisfies
kG̃(x) − x∗ k ≤ q(kx − x∗ k)kx − x∗ k,
for all x ∈ Br (x∗ ), where q : [0, R̄) → R+ is defined as
√
β
β γ0 (r)γ c (r)r2 + κγ c (r)r (1 + 2)αβ 2 γ0 (r)2 r
+
q(r) =
1 − β γ0 (r)r
(1 − β γ0 (r)r)
1 − β γ0 (r)r
√
[(1 + 2)κ + 1]αβ γ0 (r)
+
1 − β γ0 (r)r
and it is continuous and strictly increasing.
19
Proof. Since x∗ is a local minimizer of Φ0 and x∗ ∈ D, with D defined as in (42), x∗ is a fixed point
of G̃ (see (43)), therefore
H(x∗ )
G(x∗ ) − proxJ
G(x∗ ) = G(x∗ ) − x∗ = F ′ (x∗ )† F(x∗ ).
(44)
Fix r ∈ R, with 0 < r ≤ R̄ such that Br (x∗ ) ⊆ U and take x ∈ Br (x∗ ). Adopting the notations of
Proposition 2, as noted in Remark 2, from the center Lipschitz hypothesis we get
kF ′ (x) − F ′ (x∗ )kkF ′ (x∗ )† k ≤ γ0 (kx − x∗ k)kx − x∗ kβ .
Recalling that Remark 2 ensures that the function ρ 7→ ργ0 (ρ ) is continuous, strictly increasing
and takes value 0 in 0, we have that R̄ > 0 and
kF ′ (x) − F ′ (x∗ )kkF ′ (x∗ )† k < β γ0 (r)r ≤ 1,
and thus applying Lemma 1, F ′ (x) is injective, with closed range, and
kF ′ (x)† k ≤
β
,
1 − β γ0 (ρx )ρx
where
ρx = kx − x∗ k.
(45)
Applying inequality (33) with H1 = H(x), H2 = H(x∗ ), z1 = G(x) and z2 = G(x∗ ), and taking into
account (44) we get
H(x )
H(x)
(G(x)) − proxJ ∗ (G(x∗ ))k
1/2
kG(x) − G(x∗ )k
≤ kH(x)kkH(x)−1 k
kG̃(x) − x∗ k = kproxJ
H(x∗ )
+ kH(x)−1 k (H(x) − H(x∗ )) G(x∗ ) − proxJ
1/2
= kH(x)kkH(x)−1 k
kG(x) − G(x∗ )k
G(x∗ )
+ kH(x)−1 kk(H(x) − H(x∗ ))F ′ (x∗ )† F(x∗ )k.
(46)
Moreover
kH(x)k = kF ′ (x)∗ F ′ (x)k = kF ′ (x)k2
kH(x)−1 k = k[F ′ (x)∗ F ′ (x)]−1 k = kF ′ (x)† k2 .
Recalling the properties of the Moore-Penrose generalized inverse in (26) and Lemma 1 we get
(H(x) − H(x∗ ))F ′ (x∗ )† = (F ′ (x)∗ F ′ (x) − F ′ (x∗ )∗ F ′ (x∗ ))F ′ (x∗ )†
= F ′ (x)∗ F ′ (x)F ′ (x∗ )† − F ′ (x∗ )∗ F ′ (x∗ )F ′ (x∗ )†
= F ′ (x)∗ F ′ (x)F ′ (x∗ )† − F ′ (x)∗ PR(F ′ (x∗ ))
+ F ′ (x)∗ PR(F ′ (x∗ )) − F ′ (x∗ )∗ PR(F ′ (x∗ ))
= F ′ (x)∗ (F ′ (x) − F ′ (x∗ ))F ′ (x∗ )†
+ (F ′ (x) − F ′ (x∗ ))∗ PR(F ′ (x∗ )),
20
therefore,
k(H(x) − H(x∗ ))F ′ (x∗ )† k ≤ kF ′ (x)kkF ′ (x∗ )† k + 1 kF ′ (x) − F ′ (x∗ )k.
(47)
Hence, substituting in (46) the bound derived in (47) we obtain
kG̃(x) − x∗ k ≤ kF ′ (x)kkF ′ (x)† kkG(x) − G(x∗ )k
(48)
+ kF ′ (x)† k2 kF ′ (x)kkF ′ (x∗ )† k + 1 kF ′ (x) − F ′ (x∗ )kkF(x∗ )k.
On the other hand, thanks to the properties of the Moore-Penrose pseudoinverse reported in (26)
and the injectivity of F ′ (x)
G(x) − G(x∗ ) = x − x∗ − F ′ (x)† F(x) + F ′ (x∗ )† F(x∗ )
= F ′ (x)† [F ′ (x)(x − x∗ ) − F(x) + F(x∗ )]
+ (F ′ (x∗ )† − F ′ (x)† )F(x∗ )
and thus, using Lemma 1
kG(x) − G(x∗ )k ≤ kF ′ (x)† kkF (x∗ ) − F(x) − F ′ (x)(x∗ − x)k
+ kF ′ (x∗ )† − F ′ (x)† kkF(x∗ )k
≤ kF ′ (x)† kkF (x∗ ) − F(x) − F ′ (x)(x∗ − x)k
√
+ 2kF ′ (x)† kkF ′ (x∗ )† kkF ′ (x) − F ′ (x∗ )kkF(x∗ )k
= kF ′ (x)† k kF(x∗ ) − F(x) − F ′ (x)(x∗ − x)k
√
+ 2kF ′ (x∗ )† kkF(x∗ )kkF ′ (x) − F ′ (x∗ )k
(49)
Substituting (49) in (48)
n
kG̃(x) − x∗ k ≤ kF ′ (x)† k2 kF ′ (x)k kF(x∗ ) − F(x) − F ′ (x)(x∗ − x)k
√
+ (1 + 2)kF ′ (x∗ )† kkF (x∗ )kkF ′ (x) − F ′ (x∗ )k
o
+kF ′ (x) − F ′ (x∗ )kkF (x∗ )k
(50)
Taking into account Remark 2, we can rewrite inequality (50) as
kG̃(x) − x∗ k
n
h
i
√
≤ kF ′ (x)† k2 kF ′ (x)k γ c (ρx )ρx2 + (1 + 2)kF ′ (x∗ )† kkF(x∗ )kγ0 (ρx )ρx
o
+ kF(x∗ )kγ0 (ρx )ρx
To find a bound for the quantity kF ′ (x)k, recall that κ is the conditioning number of F ′ (x∗ ),
i.e. κ := kF ′ (x∗ )† kkF ′ (x∗ )k, and by the triangular inequality and Remark (2) we get kF ′ (x)k ≤
21
kF ′ (x) − F ′ (x∗ )k + kF ′ (x∗ )k ≤ γ0 (ρx )ρx + κ /β . Thus, recalling (45), we finally obtain
kG̃(x) − x∗ k
≤
β2
i
h
n
√
c
2
+
(1
+
(
(
)
+
)
(
)
γ
ρ
ρ
κ
/
β
)
γ
ρ
ρ
2)
β
αγ
ρ
ρ
(
0
x
x
x
0
x
x
x
(1 − β γ0 (ρx )ρx )2
o
+ αγ0 (ρx )ρx ,
or, equivalently
β
kG̃(x) − x∗ k ≤
1 − β γ0 (ρx )ρx
β γ0 (ρx )γ c (ρx )ρx2 + κγ c(ρx )ρx
1 − β γ0 (ρx )ρx
)
√
√
(1 + 2)αβ 2 γ0 (ρx )2 ρx [(1 + 2)κ + 1]αβ γ0 (ρx )
ρx
+
+
1 − β γ0 (ρx )ρx
1 − β γ0 (ρx )ρx
= q(kx − x∗ k)kx − x∗ k,
where we set
√
β
β γ0 (r)γ c (r)r2 + κγ c (r)r (1 + 2)αβ 2 γ0 (r)2 r
q(r) =
+
1 − β γ0 (r)r
(1 − β γ0 (r)r)
1 − β γ0 (r)r
√
[(1 + 2)κ + 1]αβ γ0 (r)
+
1 − β γ0 (r)r
Finally, it is easy to prove that q is continuous and strictly increasing relying on Remark 2.
Proof of Theorem 1
Let G and G̃ be defined as in (41) and define R̄ and q as in Proposition 8. Now fix r < r̄ such
that Br (x∗ ) ⊆ U . Then, thanks to hypothesis 3), it is possible to apply Proposition 7 and to get the
first part of the thesis. Finally, relying on the structure of the function q shown in Proposition 8,
and denoting ρx0 = kx − x0 k, the expression of the constants C1 ,C2 easily follows.
7 Applications
Algorithm (37) is a two-steps algorithm, consisting of the classical Gauss-Newton step followed
by a “J-projection” in a variable metric. In this section the general framework shall be specialized
to solve constrained nonlinear systems of equations in the least squares sense. We remark that due
to hypotheses of Theorem 1 we are dealing with regular problems in the sense of Bakushinskiı̆
and Kokurin [5]. In the finite dimensional case this implies the number of equations to be greater
than the number of unknowns. This subject has been studied in [6, 24, 22, 43] (see also references
therein). Denoting by C a closed and convex subset of X , we consider the problem
min kF(x)k2 ,
x∈C
22
(51)
which can be cast in our framework by setting J(x) = ιC (x), where ιC denotes the indicator function
of the set C, i.e. ιC (x) = 0 for x ∈ C and ιC (x) = +∞ otherwise. The proximity operator of ιC with
respect to H(xn ) = F ′ (xn )∗ F ′ (xn ), turns out to be the projection onto C w.r.t. the metric defined by
H(xn ), and therefore algorithm (37) reads as follows
H(xn )
xn+1 = PC
(xn − F ′ (xn )† F(xn )).
(52)
Since in general a closed form of the projection operator is not available, a further algorithm
is needed for solving the projection task, which adds an inner iteration to the main procedure.
We choose the forward-backward algorithm [11]. Though there are many other methods for that
purpose, we do not carry out any comparison among them, because this is beyond the scope of the
present paper. By definition of proximity operator, given H = A∗ A, A injective with closed range,
we have
PCH (z) = proxH
ιC (z)
o
n
1
= argmin ιC (v) + kv − zk2H
2
o
n
1
= argmin ιC (v) + kAv − Azk2 .
2
If PC denotes the projection onto the convex set C, now with respect to the original metric of the
space X , the sequence defined by
v0 ∈ X
vk+1 = PC (vk − σ H(vk − z)),
with σ ≤ 2/kAk2 , is strongly convergent to the point PCH (z) [39]. Eventually, the full algorithm is
x0 ∈ C
z = x − F ′ (x )† F(x )
n
n
n
n
"
v0,n ∈ C, σn ≤ 1/2kF ′ (xn )k2
vk+1,n = PC (vk,n − σn F ′ (xn )∗ F ′ (xn )(vk,n − zn ))
xn+1 = lim vk,n .
(53)
k
It is worth noting that the inner iteration is not required when zn belongs to C. Indeed, in that case,
the projection leaves zn untouched and the full step of the algorithm (52) reduces to the classical
Gauss-Newton step. Such situation asymptotically occurs when x∗ is internal to C.
Algorithm (53) requires explicit evaluation of the projection PC , which can be done for simple
sets, like spheres, boxes, etc. Particularly relevant from the point of view of the applications — see
[6] and references therein — is the case of box constraints in Rn . We point out that when PC can
be computed explicitly the algorithm generates a sequence of feasible points, no matter when the
inner iteration is stopped. This feature can be useful in forcing the sequence of iterates to remain
in regions where the function is well-behaved, avoiding the Gauss-Newton step to lead to sites
where the derivative is ill-conditioned.
23
8 Numerical experiments
This section summarizes the results of the numerical experiments we carried out in order to verify
the effectiveness of algorithm (53) for solving real-life constrained nonlinear least-squares problems. In particular, we consider the case of box constraints, namely
min kF(x) − yk2 ,
x∈Rn
a ≤ x ≤ b,
n
where a and b are in R , a ≤ b and C = ∏ni=1 [ai , bi ], F : Ω ⊇ C → Rm .
The aim of the tests is to illustrate the behavior of our algorithm on some representative examples and show that it can be successfully applied to real problems. The algorithm is implemented
in MatLab, and the convergence tests
kxn+1 − xn k < ε ,
kvk+1,n − vk,n k < ε
are used with precision ε = 10−12 .
We remark that the implementation of the algorithm (53) computes the projection only approximately, meaning that the internal iteration is stopped either because the required precision
has been attained or because an a priori fixed maximum number of iterations has been reached.
For this reason, the projection step depends on the algorithm selected to that aim and the forwardbackward algorithm is just one choice among several possibilities. Furthermore, the number of
evaluations of F and F ′ depends only on the number of outer iterations. Therefore we provide just
the number of outer iterations needed to reach the target precision as a measure of our method’s
performance. Yet, the number of inner iterations does affect the number of outer iterations. In
fact, in our experiments we observed that, even though the algorithm is quite robust with respect
to errors in the computation of the projection, the number of outer iterations can increase if the
required inner precision is not attained (inner iterations reach the maximum allowed).
The experiments are performed on some standard small residual test problems. One group of
them is taken from [15] and a second group comes from the extensive library NLE [41], which is
accessible through the web site: www.polymath-software.com/library. We considered only
problems for which the solution (or a good estimate of it) is known in advance and for which
the Gauss-Newton method is known to be effective — since our proposal in fact extends the
classical one. The problems we select in the first group are Rosenbrock, Osborne1, Osborne2
[32] and Kowalik [23]. They are actually unconstrained problems, to which we added some box
constraints set up in order to make the provided solution fall on the boundary of the box (faces,
edges, vertices, etc.). Besides, on Rosenbrock’s example we tried out our method also in case the
global minimizer is kept outside the fixed box.
The remaining problems, Twoeq6 and Teneq1b, come as truly constrained and are labeled as
“higher difficulty level” in the NLE library. Unlike the first group, here the constrained minimizers
of Twoeq6 and Teneq1b lie in the interior of the feasible set. Observe in addition that the given
constraints define a convex set that is not closed. More specifically, in Teneq1b example, the
feasible region is the positive orthant of R10 , excluding four coordinate hyperplanes where the
first derivative is undefined. We overcome this difficulty by shrinking slightly the feasible region
of a small amount δ
n
o
Cδ = (x1 , . . . , x10 ) : xi ≥ δ for i = 1, . . . , 4, xi ≥ 0 for i ≥ 5 ,
24
and solving the problem in the closed and convex set Cδ with δ = 0.0001. The same trick has
been used also for solving problem Twoeq6 too.
The experimental protocol is different for the two test groups: in the former group, 20 points
belonging to the box are randomly chosen as initial guesses and the average number of required
iterations is provided; whereas in the latter one the algorithm is fed with two critical initializations
reported in the NLE’s problem description. The results are collected in Table 1.
In all tests the algorithm reached the solution up to the required precision, and we did not detect
any significant variation in the number of iterations depending on the starting points. Along the
path we checked the condition number of F ′ , observing that it always keeps bounded from above,
the problem thus remaining well-conditioned. In case of Osborne2, we saw that the classical
Gauss-Newton method does not converge for some initializations, due to ill-conditioning of the
derivative F ′ . We were able to correct this behavior by setting the constraints properly around the
known minimizer.
9 Conclusions
This paper shows that the local theory on the convergence of the Gauss-Newton method can be
extended to deal with the more general case of least squares problems with a convex penalty.
The main theoretical result demonstrates that, under weak Lipschitz conditions on the derivative,
convergence rates analogous to those existing for the standard case can be derived. An explicit
formula for the radius of the convergence ball is also provided. A valuable application we propose
concerns nonlinear equations with constraints. Our algorithm has been found effective and robust
in solving such problems as shown in several numerical tests. Both the cases of solutions on the
boundary of the feasible set as well as solutions in its interior as been treated successfully.
Acknowledgements We are grateful to Alessandro Verri for his constant support and advice. We
further thank Curzio Basso for carefully reading our paper.
References
[1] Argyros, I., Hilout, S.: On the local convergence of the Gauss-Newton method. Punjab Univ.
J. Math. (Lahore) 41, 23–33 (2009)
[2] Argyros, I., Hilout, S.: On the GaussNewton method. Journal of Applied Mathematics and
Computing pp. 1–14 (2010)
[3] Argyros, I.K.: On the semilocal convergence of the Gauss-Newton method. Adv. Nonlinear
Var. Inequal. 8(2), 93–99 (2005)
[4] Bachmayr, M., Burger, M.: Iterative total variation schemes for nonlinear inverse problems.
Inverse Problems 25(10), 105,004, 26 (2009)
25
[5] Bakushinsky, A.B., Kokurin, M.Y.: Iterative methods for approximate solution of inverse
problems, Mathematics and Its Applications (New York), vol. 577. Springer, Dordrecht
(2004)
[6] Bellavia, S., Macconi, M., Morini, B.: STRSCNE: a scaled trust-region solver for constrained nonlinear equations. Comput. Optim. Appl. 28(1), 31–50 (2004)
[7] Ben-Israel, A.: A modified Newton-Raphson method for the solution of systems of equations. Israel J. Math. 3, 94–98 (1965)
[8] Blaschke, B., Neubauer, A., Scherzer, O.: On convergence rates for the iteratively regularized
Gauss-Newton method. IMA J. Numer. Anal. 17(3), 421–436 (1997)
[9] Borwein, J.M., Lewis, A.S.: Convex Analysis and Nonlinear Optimization. Advanced Books
in Mathematics. Canadian Mathematical Society (2000)
[10] Burke, J.V., Ferris, M.C.: A Gauss-Newton method for convex composite optimization.
Math. Programming 71(2, Ser. A), 179–194 (1995)
[11] Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul. 4(4), 1168–1200 (electronic) (2005)
[12] Dennis Jr., J.E., Schnabel, R.B.: Numerical methods for unconstrained optimization and
nonlinear equations, Classics in Applied Mathematics, vol. 16. Society for Industrial and
Applied Mathematics (SIAM), Philadelphia, PA (1996). Corrected reprint of the 1983 original
[13] Engl, H.W., Hanke, M., Neubauer, A.: Regularization of inverse problems. Mathematics and
its Applications 375 (1996)
[14] Ferreira, O.P., Svaiter, B.F.: Kantorovich’s majorants principle for Newton’s method. Comput. Optim. Appl. 42(2), 213–229 (2009)
[15] Fletcher, R.: Practical methods of optimization. John Wiley and sons, West Sussex, England
(1987)
[16] Floudas, C.A., Pardalos, P.M.: A collection of test problems for constrained global optimization algorithms, Lecture Notes in Computer Science, vol. 455. Springer-Verlag, Berlin
(1990)
[17] Groetsch, C.W.: Generalized inverses of linear operators: representation and approximation.
Marcel Dekker Inc., New York (1977). Monographs and Textbooks in Pure and Applied
Mathematics, No. 37
[18] Häussler, W.M.: A Kantorovich-type convergence analysis for the Gauss-Newton-method.
Numer. Math. 48(1), 119–125 (1986)
[19] Hiriart-Urruty, J.B., Lemaréchal, C.: Convex analysis and minimization algorithms. II,
Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 306. Springer-Verlag, Berlin (1993)
26
[20] Jin, Q.: A convergence analysis of the iteratively regularized Gauss-Newton method under
the Lipschitz condition. Inverse Problems 24(4), 045,002, 16 (2008)
[21] Kaltenbacher, B., Hofmann, B.: Convergence rates for the iteratively regularized GaussNewton method in Banach spaces. Inverse Problems 26(3), 035,007, 21 (2010)
[22] Kanzow, C.: An active set-type Newton method for constrained nonlinear systems. Appl.
Optim. 50, 179200 (2001)
[23] Kowalik, J., Osborne, M.R.: Methods for unconstrained Optimization Problems. American
Elsevier (1969)
[24] Kozakevich, D.N., Martı́nez, J.M., Santos, S.A.: Solving nonlinear systems of equations
with simple constraints. Mat. Apl. Comput. 16(3), 215–235 (1997)
[25] Langer, S.: Investigation of preconditioning techniques for the iteratively regularized GaussNewton method for exponentially ill-posed problems. SIAM J. Sci. Comput. 32(5), 2543–
2559 (2010)
[26] Lewis, A., Wright, S.J.: A proximal method for composite optimization (2008). URL
http://arxiv.org/abs/0812.0423
[27] Li, C., Hu, N., Wang, J.: Convergence behavior of Gauss-Newton’s method and extensions
of the Smale point estimate theory. J. Complexity 26(3), 268–295 (2010)
[28] Li, C., Ng, K.F.: Majorizing functions and convergence of the Gauss-Newton method for
convex composite optimization. SIAM J. Optim. 18(2), 613–642 (electronic) (2007)
[29] Li, C., Wang, X.: On convergence of the Gauss-Newton method for convex composite optimization. Math. Programming 91(2, Ser. A), 349–356 (2002)
[30] Li, C., Wang, X.: Convergence of Newton’s method and uniqueness of the solution of equations in Banach spaces. II. Acta Math. Sin. (Engl. Ser.) 19(2), 405–412 (2003)
[31] Li, C., Zhang, W., Jin, X.: Convergence and uniqueness properties of Gauss-Newton’s
method. Comput. Math. Appl. 47(6-7), 1057–1067 (2004)
[32] Lootsma, F. (ed.): Numerical methods for nonlinear optimization. Academic Press Inc., U.S.
(1972)
[33] Moreau, J.J.: Fonctions convexes duales et points proximaux dans un espace hilbertien. C.
R. Acad. Sci. Paris 255, 2897–2899 (1962)
[34] Moreau, J.J.: Propriétés des applications “prox”. C. R. Acad. Sci. Paris 256, 1069–1071
(1963)
[35] Moreau, J.J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France 93,
273–299 (1965)
27
[36] Polyak, B.T.: Introduction to optimization. Translations Series in Mathematics and Engineering. Optimization Software Inc. Publications Division, New York (1987). Translated
from the Russian, With a foreword by Dimitri P. Bertsekas
[37] Ramlau, R., Teschke, G.: Tikhonov replacement functionals for iteratively solving nonlinear
operator equations. Inverse Problems 21(5), 1571–1592 (2005)
[38] Ramlau, R., Teschke, G.: A Tikhonov-based projection iteration for nonlinear ill-posed problems with sparsity constraints. Numer. Math. 104(2), 177–203 (2006)
[39] Rosasco, L., Mosci, S., Santoro, M., Verri, A., Villa, S.: Proximal methods for structured
sparsity regularization. LNAI, Springer (to appear) (2010)
[40] Scherzer, O., Grasmair, M., Grossauer, H., Haltmeier, M., Lenzen, F.: Variational methods
in imaging, Applied Mathematical Sciences, vol. 167. Springer, New York (2009)
[41] Shacham, M., Brauner, N., Cutlib, M.: A web-based library for testing performance of numerical software for solving nonlinear algebraic equations. Comp & Chem. Eng. 26, 547–
554 (2002)
[42] Stewart, G.W.: On the continuity of the generalized inverse. SIAM J. Appl. Math. 17, 33–45
(1969)
[43] Ulbrich, M.: Nonmonotone trust-region methods for bound-constrained semismooth equations with applications to nonlinear mixed complementarity problems. SIAM J. Optim.
11(4), 889–917 (2001)
[44] Wang, X.: Convergence of Newton’s method and inverse function theorem in Banach space.
Math. Comp. 68(225), 169–186 (1999)
[45] Wang, X.: Convergence of Newton’s method and uniqueness of the solution of equations in
Banach space. IMA J. Numer. Anal. 20(1), 123–134 (2000)
[46] Wedin, P.: Perturbation theory for pseudo-inverses. Nordisk Tidskr. Informationsbehandling
(BIT) 13, 217–232 (1973)
[47] Womersley, R.S.: Local properties of algorithms for minimizing nonsmooth composite functions. Math. Programming 32(1), 69–89 (1985)
[48] Womersley, R.S., Fletcher, R.: An algorithm for composite nonsmooth optimization problems. J. Optim. Theory Appl. 48(3), 493–523 (1986)
[49] Xu, C.: Nonlinear least squares problems. In: C.A. Floudas, P.M. Pardalos (eds.) Encyclopedia of Optimization, pp. 2626–2630. Springer US (2009)
28
Table 1: Results of numerical experiments on the test problems specified in the first column for
the box constraints given by a and b with starting point x0 . The point x∗ is the detected minimizer,
which we show with five digits precision for conciseness.
Function
n m
a
b
x0
x∗
avg. n. iter.
−3
3
0.89475
Rosenbrock 2
2
20 random
7
−2
0.8
0.80000
Kowalik
Osborne1
4
5
11
0.1928
0.1916
0.1234
0.1362
1
1
1
1
31
1
2
0
1
1
0.3754
1
−2
0.01287
0
Osborne2
Twoeq6
11
2
65
2
10
10
0.0001
0.0001
0.0001
0.0001
0
0
0
0
0
0
Teneq1b
0.0001
0.0001
20 random
1.31
0.4314
0.6336
0.5
0.5
0.6
1
4
2
4.5689
5
20 random
1.4
0.8
1
1
1
3
5
7
2.5
5
6
0.9999
+∞
20 random
+∞
+∞
+∞
+∞
+∞
+∞
+∞
+∞
+∞
+∞
29
0.9
0.5
0.6
0.1
1
2
1 5
20 40
1 1
0 0
0 0
0 0
0 0
0 0
1
5
0.19281
0.19165
0.12340
0.13620
0.37546
1.93569
−1.46461
0.01287
0.02212
1.31000
0.43157
0.63367
0.59941
0.75423
0.90423
1.36573
4.82393
2.39867
4.56890
5.67535
0.75739
0.02130
2.99763
3.96642
79.99969
0.00236
0.00060
0.00136
0.06457
3.53081
26.43154
0.00449
7
21
17
20
10