A Proof of Local Convergence For The Adam Optimizer

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

IJCNN 2019. International Joint Conference on Neural Networks. Budapest, Hungary.

14-19 July 2019

A Proof of Local Convergence for the Adam


Optimizer
Sebastian Bock Martin Weiß
Faculty of Computer Science and Mathematics Faculty of Computer Science and Mathematics
OTH Regensburg OTH Regensburg
Regensburg, Germany Regensburg, Germany
[email protected] [email protected]

Abstract—Adaptive Moment Estimation (Adam) is a very pop- of a family of algorithms which compute moments of first
ular training algorithm for deep neural networks, implemented order, that is approximate descent directions based on previous
in many machine learning frameworks. To the best of the authors iterates of the gradient like the initial momentum method [1],
knowledge no complete convergence analysis exists for Adam. The
contribution of this paper is a method for the local convergence as well as second order moments to control the componentwise
analysis in batch mode for a deterministic fixed training set, scaling and / or to adapt the learning rate in AdaGrad [3]
which gives necessary conditions for the hyperparameters of the and Adam [4]. More algorithms exist with variants like: batch
Adam algorithm. Due to the local nature of the arguments the mode vs. online or incremental mode – using ∇f (wt ) in
objective function can be non-convex but must be at least twice iteration t vs. ∇fk (wt ) where k iterates in a cyclic fashion over
continuously differentiable.
Index Terms—Non-convex optimization, Adam optimizer, con- 1, . . . , N , or deterministic vs. stochastic choice of the index
vergence, momentum method, dynamical system, fixed point k for ∇fk (wt ), stochastic assumptions for the observation of
∇f (wt ) or ∇fk (wt ), and so on.
I. I NTRODUCTION However for most of these algorithms only partial con-
Many problems in machine learning lead to a minimization vergence results are known. The original proof of [4] is
problem in the weights of a neural network: Consider e.g. wrong as has been noted by several authors, see [5], [6].
training data (x1 , y1 ), . . . , (xN , yN ) consisting of inputs xi Modifying the algorithm to AMSGrad, [7] establishes bounds
and outputs yi , and the task to determine a neural network on k∇f (wt )k, similar to the results in [8] for a class called
that has learned the relationship between inputs and outputs. Incremental Generalized Adam. Though none of the results
This corresponds to a function y = F (w, x), parametrized by shows convergence of the sequence {wt }t∈N0 . Also the proofs
the weights w, which minimizes the average loss function are lengthy and hardly reuse results from each other, giving
not much insight. General results from optimization cannot be
N N
1 X 1 X used for several reasons: First, the moments usually cannot
f (w) = l(xi , yi , w) =: fi (w) be proven to be a descent direction. Second, the learning
N i=1 N i=1
rate cannot be shown to be a step size valid for the Wolfe
over the training data. Typically the loss is built us- conditions for a line search, see [2]. The algorithm for the
ing some norm for regression problems, e.g. l(x, y, w) = step taken in iteration t may explicitly contain the variable t
1 2
2 ky − F (w, x)k2 , or using cross entropy for classification. in much more complicated ways than 1t in the Robbins-Monro
Optimization algorithms construct a sequence {wt }t∈N0 start- approach [9].
ing from an initial value w0 , which under appropriate as- The contribution of this paper is a generally applicable
sumptions converges to some local minimum w? for general method, based on the theory of discrete time dynamical
non-convex f . The most simple optimization algorithm for systems, which proves local convergence of Adam. The results
differentiable f is gradient descent with the update wt+1 = are purely qualitative because the results hold for learning
wt − α∇f (wt ) and a learning rate α > 0. For convex rates sufficiently small, where ”sufficiently small” is defined
f conditions on the Lipschitz constant L of ∇f guarantee in terms of the eigenvalues of the Hessian in the unknown
convergence and give estimates for the rate of convergence, minimum w? .
see [1]. However L is hard to get in practice, and choosing α
too big leads to oscillatory behaviour. Besides it is well known II. F IXED P OINT A NALYSIS UNDER P ERTURBATION
from optimization that the gradient is not the only optimum A. Notation
descent direction for f , see [2], but computation of the With Matn we denote the set of all real n-by-n matrices.
Hessian is usually prohibitive. This has led to the development The symbol ⊥ denotes the transpose of a vector or matrix.
With ⊗, ⊕ and we denote the component-wise multipli-
This paper presents results of the project ”LeaP – Learning Poses” sup-
ported by the Bavarian Ministry of Science and Art under Kap. 15 49 TG cation and division of vectors, as well as component-wise
78. addition of vectors and scalar. For f : Rn → R the gradient

978-1-7281-2009-6/$31.00 ©2019 IEEE paper N-20268.pdf


Personal use is permitted, but republication/distribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
IJCNN 2019. International Joint Conference on Neural Networks. Budapest, Hungary. 14-19 July 2019

and Hessian are written as ∇f and ∇2 f , provided they exist.


Throughout this paper we assume f : Rn → Rn at least
continuously differentiable, twice continuously differentiable
for some results. The open ball with radius r around x ∈ Rn is
denoted by Br (x) = {y ∈ Rn : ky − xk < r} and kxk is any
norm. We denote ρ (A) = max{|λ||λ eigenvalue of A} the
spectral radius of a matrix A and diag (v) ∈ M atn describes
a matrix with components of v ∈ Rn on the diagonal.
B. Related Work
Stochastic gradient descent (SGD) becomes an effective
method for optimizing noisy tasks. Especially in the area of
neural networks SGD variants are partly responsible for big
successes in the last years, see e.g. [10] or [11].
Popular first-order SGD methods are AdaGrad [3] and
RMSProp [12]. Kingma and Ba combine the advantages
of these two methods and introduce the Adaptive Moment
Estimation (Adam) in [4] (see Algorithm 1). Unfortunately,

Algorithm 1 Adam Optimizer


Require: α ∈ R+ ,  ∈ R β1 , β2 ∈ (0, 1), w0 ∈ Rn and the
function f (w) ∈ C 2 (Rn , R)
1: m0 = 0, v0 = 0, t = 0 Fig. 1. Training the cifar-10 dataset ([15]) with the two different Adam
2: while w not converged do methods
3: mt+1 = β1 mt + (1 − β1 )∇w f (wt )
4: √ − βt+1
vt+1 = β2 vt + (1 2 )∇w f (wt ) ⊗ ∇w f (wt )
1−β2 √ [5], [16] or [6]. Reddi et. al. in [7] present even a counter
5: wt+1 = wt − α 1−β t+1 mt+1 vt+1 ⊕ 
( 1 ) example and also introduce an improved method called
6: t=t+1 AMSGrad. However in experiments AMSGrad does not show
7: end while this improvements. On the contrary, in some cases it ends up
with worse accuracy than Adam. Chen et al. [8]  are showing

the Adam optimizer is not√ always defined the same way. 2
for non-convex f mint=1 E[k∇f (xt )k ] = O ss12 (T
T )
(T ) . With
Kingma and Ba [4] use v ⊕  and the bias correction
the assumption of s1 (T ) growing slower than s2 (T ), one
in m̂ and v̂. The authors in [7] and [8] do not use an  2
reaches a minimum of E[k∇f (xt )k ] but without a guarantee
as well as [13], but the latter initialize v0 = . All three
of staying there. [17] use a similar interpretation of the Adam
apply the bias correction in the learning rate αt . We √use the
optimizer with a dynamical system viewpoint. But Barakat
bias correction as described in [4, Section 2] and v ⊕ 
and Bianchi uses a non-autonomous ordinary differential
as used in [14]. The differences between the two possible
equation without Lipschitz condition because of the term
usages of  are minimal (see figure 11 ) especially in the √
v ⊕  instead of our non-autonomous system of difference
area around v ≈ 0 by choosing  as the square root of 
equations. In our opinion, this choice makes the proof longer
from [4]. Some differences in figure 1 are due to the dropout
and more complicated.
layer, which randomly eliminates neurons in the training to
avoid overfitting. The main aim by the introduction √ of  –
In the current work we present a convergence proof of
avoiding division by 0 – holds in both variants, but v ⊕ 
the Adam optimizer [4] in a complete batch mode. This
gives the additional advantage of making the right √ hand side
means the batch size is equal to the amount of training data.
continuously differentiable for v ∈ [0, ∞) whereas v ⊕  is
With this assumption we can guarantee, that we are searching
not differentiable at v = 0. Differentiability will be essential
the same minimum w? in each time step t. Due to the local
in our proof.
nature our proof does not assume the convexity of f (w),
thus we can guarantee local convergence even for non-convex
During the last years the Adam Optimizer has become
settings. The hyperparameter setting is only bounded by
one of the most used optimization methods for training neural
networks. Even if it is apparently working, there is, to the α maxni=1 (µi )
best of our knowledge, still no convergence proof for Adam. √ (1 − β1 ) < 2β1 + 2 (1)

The proof in the original paper [4] was shown wrong, see
1 Theexperiment is programmed with Keras 2.2.4, Tensorflow 1.11.0 and with µi the i-th eigenvalue of the Hessian ∇2 f (w? ). The
Python 3.6. counter example in [7] does not affect our convergence proof,

-2- paper N-20268.pdf


A Proof of Local Convergence for the Adam Optimizer

because we consider batch mode only; the incremental func- Autonomous systems constitute the special case where T
tion in [7] becomes a linear function in batch mode. does not depend on t, so we can abbreviate to T̄ : M → M
and write
C. Idea
We consider the learning algorithm from the standpoint xt+1 = T̄ (xt ), t ∈ N0 , x0 ∈ M (3)
of dynamical systems and define a common state vector x
consisting of the moments – like m and v for Adam – and the A point x? ∈ M is called equilibrium or fixed point if
weights, so we have x = (m, v, w). Then the optimization T (t, x? ) = x? for all t ∈ N0 , so the constant function xt = x?
can be written as an iteration xt+1 = T (t, xt ) for some for all t ∈ N0 is a solution of (2). In the following the asterisk
function T : N0 × X → X with X ⊂ Rp , which defines will always denote equilibria or their components. Consider a
a non-autonomous dynamical system. The function f to be solution x = x(·; x0 ) of (2). x is called stable, if for each ε > 0
minimized in the learning process, or rather its gradient, there exists δ = δ(ε) such that any solution x̃ = x̃(·; x̃0 ) of
becomes a part of T . If f is at least continuously differentiable (2) with kx̃0 − x0 k < δ fulfills kx̃t − xt k < ε for all t ∈ N0 .
a local minimum gives the necessary condition ∇f (w? ) = 0. x is called attractive if there exists δ > 0 such that any
We show that this condition leads to a fixed point x? of T , solution x̃ with kx̃0 − x0 k < δ fulfills lim kx̃t − xt k = 0. x
t→∞
where the moments are all zero. We analyse the stability is called asymptotically stable if it is stable and attractive.
properties of this fixed point and prove local asymptotic Recall that a contraction is a self-mapping on some set
stability. This is done by considering a time-variant iteration with Lipschitz constant L < 1, i.e. a mapping T̄ : M → M ,
T as the perturbation of a time-invariant iteration T̄ where M ⊂ Rn with T̄ (x) − T̄ (y) ≤ L kx − yk for all x, y ∈ M .
Banach-like fixed point arguments can be applied. We use the If M is complete, i.e. all Cauchy sequences converge, then a
second method of Lyapunov for stability analysis where the unique fixed point x? ∈ M of T̄ exists by the Banach fixed
vanishing moments simplify the computation and estimates for point theorem.
the eigenvalues. Asymptotic stability however is equivalent
Theorem II.1. Linearized asymptotic stability implies local
to convergence of the iteration defined T to x? for all x0
nonlinear stability Consider T̄ : M → M with a fixed point
sufficiently close to x? . The conditions needed for the fixed
x? and T̄ continuously differentiable in an open neighbour-
point analysis and stability results require the learning rate to
hood Br (x? ) ⊂ M of x? . Denote the Jacobian by DT̄x? , and
be sufficiently small. Note that these results cannot be obtained
assume DT̄x? < 1 for some norm on Matn . Then there
directly from standard fixed point theorems for autonomous
exists 0 < ε ≤ r and 0 ≤ c < 1 such that for all x0 with
systems, because the iteration index t enters the dynamics.
kx0 − x? k < ε
Therefore also estimates of the eigenvalues depend on the
iteration t, and even a bound on the spectral radius uniform kx(t; x0 ) − x? k ≤ ct kx0 − x? k ∀t ∈ N0 .
in t does not give the convergence results presented here:
It is well known that ρ(A) < 1 implies the existence of a i.e. x? is locally exponentially and asymptotically stable.
vector norm with induced matrix norm such that kAk < 1,
but this norm depends on A. So ρ(At ) ≤ c < 1 for some c The theorem is the core of the first method of Lyapunov
for all t ∈ N0 does not imply the existence of a single norm for discrete time systems. Unfortunately, we could not find a
such that kAt k < 1 for all t. We emphasize that the result is proof in any english textbook like [20], [18]. For a proof see
purely qualitative, giving no explicit guidance to the choice the preprint [21, Theorem 3.3] or the German textbook [22,
of the learning rates. The main advantage of our approach is Theorem 5.4]
the clearness of the proof, only computation of eigenvalues is
needed once the iteration has been written in terms of T and III. C ONVERGENCE P ROOF
T̄ . These calculations are much more simple than the lengthy Let w ∈ Rn be the weights of the function f (w) ∈
estimates in [4], [7] and [8]. C 2 (Rn , R), which has to be minimized. We also define
D. Preliminaries g (w) := ∇f (w) ∈ Rn as the gradient of f and the state
variable of our dynamical system x = (m, v, w). With these
We recall some standard definitions and facts from the definitions we can rewrite the Adam-Optimizer as a system of
theory of difference equations and discrete time dynamical the form (2).
systems, see e.g. [18, Definition 5.4.1] or [19]. Consider
T : N0 × M → M with M ⊂ Rn which defines a non- mt+1 := β1 mt + (1 − β1 ) g (wt ) ∈ Rn
autonomous dynamical system by the iteration
vt+1 := β2 vt + (1 − β2 ) g (wt ) ⊗ g (wt ) ∈ Rn (4)
xt+1 = T (t, xt ), t ∈ N0 , x0 ∈ M (2)
q
t+1
1 − β2
 mt+1 vt+1 ⊕  ∈ Rn
p 
with solutions x : N0 → M , t 7→ xt depending on the initial wt+1 := wt − α t+1
1 − β1
value x0 . We use the notations xt = x(t; x0 ) and x = x(·; x0 )
to emphasize the dependence of solutions on the initial value So the Adam optimizer can be written as the iteration of a
if necessary. We always use the initial time t0 = 0. time-variant dynamical system xt+1 = [mt+1 , vt+1 , wt+1 ]⊥ =

-3- paper N-20268.pdf


IJCNN 2019. International Joint Conference on Neural Networks. Budapest, Hungary. 14-19 July 2019

T (t, x) = T t, [mt , vt , wt ]⊥ ∈ R3n . We split the system in



and all other parameters are defined as in Algorithm 1. Then
an autonomous and a non-autonomous part JT̄ (0, 0, w? ) has the eigenvalues, for i = 1, . . . , n:
xt+1 = T (t, xt ) = T̄ (xt ) + Θ (t, xt ) (5) λ1,i = β2
r  
with 2
(β1 + 1) + (β1 + 1) − 4 β1 − αµi (β1 −1)

  
β1 mt + (1 − β1 ) g (wt ) λ2,i =
T̄ (xt ) = β2 vt + (1 − β2 ) g (wt ) ⊗ g (wt ) (6) 2
√ r  
wt − α (mt+1 vt+1 ⊕ ) 2
(β1 + 1) − (β1 + 1) − 4 β1 − αµi (β1 −1)


and λ3,i =
  2
0
In the combination of Theorem III.2 and II.1 we still have
Θ (t, xt ) =   √ 0
 (7) to show, that |λj,i | < 1 holds, then the spectral radius for the
 
1−β2t+1
 √
−α 1−β t+1 − 1 (mt+1 vt+1 ⊕ ) Jacobian is smaller than 1 and we prove the local convergence.
1

To avoid lengthy expressions we use mt+1 and vt+1 as Theorem III.3. Let the parameters be defined as in Theorem
an abbreviation for the updated terms instead of the filters III.2 and inequality (1) holds, then ρ (JT̄ (0, 0, w? )) < 1.
depending on mt , g(wt ) and vt . The autonomous system is Corollary III.4. Let the parameters be defined as in Theorem
Adam without bias correction, the disturbance term Θ adds α maxn
i=1 (µi )
III.2 and such that √

(1 − β1 ) < 2β1 + 2 holds
bias correction which leads to a non-autonomous system. The
for i ∈ {1, . . . , n}, then Algorithm 1 converges locally with
Jacobian matrix of the autonomous system (6) is
  exponential rate of convergence.
β1 I 0 (1 − β1 ) ∇w g (wt )
∂v Proof. Consider the non-autonomous system (5) with T̄ (xt )
JT̄ (mt , vt , wt ) =  0 β2 I ∂w

∂w ∂w ∂w and Θ(t, xt ) as defined in equations (6) and (7). The Hes-
∂m ∂v ∂w sian of f is continuous, so the gradient of f is locally
with Lipschitz with some constant L > 0, kg(w1 ) − g(w2 )k ≤
∂v L kw1 − w2 k for all w1 , w2 in some neighbourhood of w? .
=2 (1 − β2 ) diag(g (wt ))∇w g (wt ) Let all other parameters be defined as in Theorem III.2,
∂w α maxn
i=1 (µi )
∂w p  especially √

(1 − β1 ) < 2β1 + 2. Using m? = 0
= − α diag β1 vt+1 ⊕  and g(w? ) = 0 we estimate
∂m
∂w α  3

= diag mt+1 (vt+1 ⊕ ) 2 q
t+1

∂v 2β2 1 − β2


∂w  1
kΘ (t, x)k =α
t+1 − 1

=I − α (1 − β1 ) diag(vt+1 ⊕ )− 2 1 − β1
∂w
kβ1 m + (1 − β1 )g(w)k
3

− diag(mt+1 ⊗ (vt+1 ⊕ )− 2 ⊗ g (wt ) ∇w g (wt ) ·p
β2 v + (1 − β2 )g(w) ⊗ g(w) ⊕ 
We have the following simple observation: q
t+1

t+1
α 1 − β2 − 1 − β1


Lemma III.1. Consider a critical point w? for f , ∇f (w? ) = ≤√ t+1

 1 − β1
0. Then x? = (0, 0, w? )⊥ is a fixed point for (6) and (4).

Proof. We start the iteration with w0 = w? , v0 = 0 and m0 = · kβ1 m + (1 − β1 )g(w)k



0, i.e. x0 = (0, 0, w? ). Then (6) gives x1 = T (x0 ) = x0 , and t+1 t+1 2

α 1 − β2 − 1 − β1
inductively xt = x0 for all t. The same holds for (4). ≤√ q
 (1 − β1 ) 1 − β t+1 + 1 − β t+1 
2 1
Now we investigate the stability of this equilibrium with the
goal of asymptotic stability for local minima w? . The analysis · (β1 kmk + (1 − β1 ) kg(w)k)
is simplified because the m and v components of x? are 0. So C 2
≤ 1 − β2t+1 − 1 − β1t+1

we reach the following Jacobian: 4
· (β1 km − m? k + (1 − β1 ) kg(w) − g(w? k)
β1 I 0 (1 − β1 ) ∇w g (w? )
 
C
2(t+1)
JT̄ (0, 0, w? ) =  0 β2 I 0  ≤ −β2t+1 − 2β1t+1 + β1
4

−αβ α(1−β1 )
√ 1I 0 I− √ ∇w g (w? )
  · (β1 km − m? k + (1 − β1 )L kw − w? k)
Theorem III.2. Let JT̄ (m, v, w) ∈ Mat3n be the Jacobian of ≤Cβ t+1 (β1 km − m? k + (1 − β1 )L kw − w? k)
system (6) and w? ∈ Rn a minimum of f with positive definite
Hessian ∇2w f (w? ) = ∇w g(w? ). Denote µi ∈ R with i ∈ where we have used the Lipschitz continuity of g, and set
{1, · · · , n} the i-th eigenvalue of ∇w g (w? ), ϕi := α (1 − β1 ) β = max{β1 , β2 , β12 }, C := √(1−β ) √4α . The
1 ( 1−β2 +(1−β1 ))

-4- paper N-20268.pdf


A Proof of Local Convergence for the Adam Optimizer

term β1 km − m? k + (1 − β1 )L kw − w? k corresponds to a TABLE I


norm C OLOUR DESCRIPTION FOR THE CONVERGENCE INVESTIGATIONS

k(m̃, w̃)k∗ := β1 km̃k + (1 − β1 )L kw̃k , m̃, w̃ ∈ Rn Inequality (9) Inequality (1) Adam finds
satisfied satisfied solution
on R2n (which does not depend on w? ). By the equiva- green yes yes yes
lence of norms in finite dimensional spaces we can estimate blue no yes yes
yellow yes no yes
k(m̃, w̃)k∗ ≤ C̃ k(m̃, w̃)k for some C̃ > 0. We continue the white no no yes
estimate: black yes yes no
cyan no yes no
≤Cβ t+1 C̃ k(m − m? , w − w? )k magenta yes no no
red no no no
≤(Cβ C̃)β t kx − x? k =: C̄β t kx − x? k

for some C̄ > 0. With this estimate and Theorem V.1, it is IV. E XPERIMENTS
sufficient to prove exponential stability of a fixed point of T̄ .
To compare our requirements for convergence to the re-
By Theorem III.3 we get ρ (JT̄ (0, 0, w? )) < 1. Thus with
quirements taken by [7] or [4], we make some empirical
Theorem II.1 the fixed point (0, 0, w? ) corresponding to the
experiments. First, we look at the different requirements to
minimum w? is locally exponentially stable, and Theorem V.1
the hyperparameter.
gives local exponential convergence of the non-autonomous p
system T (t, x), i.e. the Adam algorithm. β1 < β2 (8)
2
p
β 1 < β2 (9)
The following corollary is a combination of our results with
the results in [23] to show global convergence in the strictly Inequality (1) describes the needed requirement presented
convex case. The idea is: The iteration reaches an -bounded in this paper. Problematically in this estimation is, that we
gradient k∇f (wt̃ )k <  in some iteration t̃ for suitable need the maximum eigenvalue of (1 − β1 ) ∇w g (w? ) and
choice of hyperparameters according to [23]. The arguments consequently w? . Therefore our estimation is an a posteriori
of [23] do not imply that k∇f (wt )k remains bounded, nor estimation. But with (1) we learn something about the rela-
limt→∞ ∇f (wt ) = 0, nor that limt→∞ wt exists. tionship between the hyperparameters. √α has to be very small
At this point we use our results to show that for  small to fullfill inequality 1. With a small α or a big  we always
enough the condition k∇f (wt )k <  implies that wt is in the make the weight change smaller and so we do not jump over
domain of local convergence. w? . Inequality (8) was presented in [7] and inequality (9) was
originally presented in [4]. Both are a priori estimations for
Corollary III.5. Let f : Rn → R strictly convex with
the hyperparameters.
minimum w? ∈ Rn . Assume f ∈ C 2 and ∇2 f (w? ) positive
To show the behaviour of all estimations we set up the
definite. Assume that the conditions of Theorem 2.2 in [23]
following experiments. In Experiment 1 and 2 we want to
hold (boundedness of k∇f k, conditions on hyperparameters
minimize f (w) := w4 + w3 with the minimum w? =
of Adam). Then Adam converges globally for the minimum w? .
− 43 . In Experiment 3 we minimize the multidimensional
2 2 2
Proof. Denote x? = (m? , v? , w? ) = (0, 0, w? ) as in Theorem function f (w1 , w2 ) := (w1 + 2) (w2 + 1) + (w1 + 2) +
2
III.2. Fix α > 0 such that the assumptions of Corollary 0.1 (w2 + 1) with the minimum w? = (−2, −1). We run the
III.4 hold. Choose ε̃ > 0 small enough such that for all Adam optimizer 10000 times in every hyperparameter setting
x0 = (m0 , v0 , w0 ) ∈ Bε̃ (x? ) the Adam algorithm converges and if the last five iterations wend ∈ R5 are near enough to the
according to Corollary III.4. known solution w? the attempt is declared as convergent. Near
Theorem 2.2 in [23] shows that for suitable choice of enough in this setting means that all components of wend are
parameters, for any ε > 0 we have k∇f (wt )k < ε for some contained in the interval [w? − 10−2 , w? + 10−2 ]. The colour
t ∈ {0, . . . , T } independent of w0 ∈ Rn where T depends on coding of our experiments can be found in Table I. To keep
ε. Fix this index t. By Lemma V.3 there exist C > 0, ¯ > 0 the clarity of our results we only compare the original Adam
such that k∇f (x)k ≥ C kx − x? k for all x ∈ B¯(x? ). inequality with our inequality. With inequality (8) we obtain
Choosing x0 with kx0 − x? k small enough leads to conver- similar figures.
gence to the minimum according to Corollary III.4.
Experiment 1
First, we iterate over  ∈ 10−8 , . . . , 10−6 and β1 ∈

Note that the strict convexity was only used to guarantee
uniqueness of a minimum, and that the only critical point is {0.01, . . . , 0.99}. The other hyperparamteres are fixed α =
this minimum. Otherwise k∇f (wt )k <  might hold near a 0.001, β2 = 0.1. This setting leads us to figure 2. The only
maximum or saddle point where Corollary III.4 does not apply. area, where the Adam optimizer is not finding a solution
Of course other results which guarantee -boundedness of (red dots), is inside the white area. So both inequalities are
the gradient at some iteration t for Adam in batch mode can not satisfied and the convergence is not given. The white
also be combined with our approach. area – Adam converge but no inequality is satisfied – is

-5- paper N-20268.pdf


IJCNN 2019. International Joint Conference on Neural Networks. Budapest, Hungary. 14-19 July 2019

Fig. 2. Iterating over  and β1 Fig. 4. Iterating over α and β1 with x0 = −0.750000001

Fig. 3. Iterating over α and β1 Fig. 5. Experiment 3 :Iterating over  and β1 .

formed because we only talk about estimation and not clear first at all. We also give an a posteriori boundary for the
boundaries. The blue and yellow area can be made larger or hyperparameters and show, that the choice of β2 does not
smaller by changing β2 or α. matter for the convergence near a minimum.
Experiment 2 However the proof is based on the vanishing gradient
condition ∇f (w? ) = 0 andPcannot be used for an incremental
In the second experiment we iterate over α ∈ N
algorithm for f (w) = N1 i=1 fi (w) where different compo-
{0.001, . . . , 0.1} and β1 ∈ {0.01, . . . , 0.99}. β2 = 0.2 and
nent gradients gt = ∇fit (wt ) are used in the iterations for the
 = 10−4 are fixed. With the starting point x0 = −2 we reach
moments. Clearly ∇f (w? ) = 0 does not imply ∇fi (w? ) = 0
figure 3. In the magenta and the cyan area the Adam method
for all components. We are investigating how the incremental
is not reaching the solution, although inequality (9) or (1) is
dynamical system can be related to the batch system.
satisfied. The Adam is oscillating around the solution but do
The analysis applies to any local minimum with positive
not reach them. The big difference is that the non-convergence
definite Hessian and therefore does not require overall con-
in the cyan area is attributable to the fact that our proof only
vexity. In order to show global convergence of Adam-like
shows local convergence. By starting in x0 = −0.750000001
algorithms other methods have to be applied.
the cyan area is almost complete blue (see figure 4). In
contrary the magenta area does not change that much. R EFERENCES
Experiment 3 [1] J. E. Nesterov, Introductory lectures on convex optimiza-
In the last experiment we use the same hyperparameters tion: A basic course, ser. Applied optimization. Boston,
as in experiment 1. Therefore we reach a similar looking Mass.: Kluwer Acad. Publ, 2004, vol. APOP 87.
figure 5 by iterating over the parameters. The reason for [2] J. Nocedal and S. J. Wright, Numerical Optimization,
the enlargement of the blue and green area is the different 2nd ed. New York: Springer, 2006.
function f (x), thus different eigenvalues in inequality (1). By [3] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgra-
observing the convergence behaviour from each of the four dient methods for online learning and stochastic opti-
differently coloured areas in figure 5, we can not spot big mization,” J. Mach. Learn. Res., vol. 12, pp. 2121–2159,
differences. Jul. 2011. [Online]. Available: http://dl.acm.org/citation.
cfm?id=1953048.2021068.
V. C ONCLUSION AND D ISCUSSION [4] D. P. Kingma and J. L. Ba, “Adam: A Method for
In this paper we introduce a local convergence proof of Stochastic Optimization,” CoRR, vol. abs/1412.6980,
the Adam method and to the best of our knowledge it is the 2014.

-6- paper N-20268.pdf


A Proof of Local Convergence for the Adam Optimizer

[5] S. Bock, “Rotationsermittlung von Bauteilen basierend [18] R. P. Agarwal, Z. Nashed, and E. Taft, Difference Equa-
auf neuronalen Netzen (unpublished),” Ostbayerische tions and Inequalities: Theory, Methods, and Applica-
Technische Hochschule Regensburg, M.Sc. thesis, 2017. tions, 2nd ed. Boca Raton: Chapman and Hall/CRC,
[6] D. M. Rubio, “Convergence Analysis of an Adaptive 2000.
Method of Gradient Descent,” University of Oxford, [19] W. G. Kelley and A. C. Peterson, Difference equations:
Oxford, M.Sc. thesis, 2017. An introduction with applications, 2. ed. San Diego,
[7] J. S. Reddi, S. Kale, and S. Kumar, “On the Conver- Calif.: Harcourt/Academic Press, 2001.
gence of Adam and Beyond,” in International Con- [20] H. Freeman, Discrete-time systems: An introduction to
ference on Learning Representations, 2018. [Online]. the theory. New York: J. Wiley, 1965.
Available: https://openreview.net/forum?id=ryQu7f-RZ. [21] N. Bof, R. Carli, and L. Schenato, Lyapunov Theory
[8] X. Chen, S. Liu, R. Sun, and M. Hong, On the Con- for Discrete Time Systems. [Online]. Available: https:
vergence of A Class of Adam-Type Algorithms for Non- //arxiv.org/abs/1809.05289.
Convex Optimization, 2018. [Online]. Available: https: [22] U. Krause and T. Nesemann, Differenzengleichungen
//arxiv.org/abs/1808.02941. und diskrete dynamische Systeme: Eine Einführung in
[9] H. Robbins and S. Monro, “A Stochastic Approxima- Theorie und Anwendungen, 1. Aufl., ser. De Gruyter
tion Method,” The Annals of Mathematical Statistics, Studium. s.l.: Walter de Gruyter GmbH Co.KG, 2012.
vol. 22, no. 3, pp. 400–407, 1951. [23] S. De, A. Mukherjee, and E. Ullah, Convergence guar-
[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Im- antees for RMSProp and ADAM in non-convex op-
ageNet Classification with Deep Convolutional Neural timization and an empirical comparison to Nesterov
Networks,” in Advances in Neural Information Process- acceleration, 2018. [Online]. Available: http : / / arxiv.
ing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, org/pdf/1807.06766v3.
and K. Q. Weinberger, Eds., Curran Associates, Inc, [24] J. R. Silvester, Determinants of Block Matrices, London,
2012, pp. 1097–1105. [Online]. Available: http://papers. 1999. [Online]. Available: http://www.ee.iisc.ac.in/new/
nips.cc/paper/4824-imagenet-classification-with-deep- people/faculty/prasantg/downloads/blocks.pdf.
convolutional-neural-networks.pdf.
[11] G. E. Hinton, N. Srivastava, and K. Swersky, Overview A PPENDIX
of mini-batch gradient descent (unpublished), Toronto,
2012. [Online]. Available: http://www.cs.toronto.edu/ Proof. (Theorem III.2)
∼tijmen/csc321/slides/lecture slides lec6.pdf. We see that JT̄ (0, 0, w? ) has the n-fold eigenvalue β2 . So we
[12] G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. can drop second block row and column of JT̄ and investigate
Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, the eigenvalues of
and B. Kingsbury, “Deep Neural Networks for Acoustic    
Modeling in Speech Recognition: The Shared Views β1 I (1 − β1 ) ∇w g (w? ) A B
=:
of Four Research Groups,” IEEE Signal Processing −sβ1 I I − s (1 − β1 ) ∇w g (w? ) C D
Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[13] F. Zou, L. Shen, Z. Jie, W. Zhang, and W. Liu, A where we use the abbreviation s := √α . B and D are
Sufficient Condition for Convergences of Adam and symmetric since ∇w g (w? ) is the Hessian of f . By the spectral
RMSProp, 2018. [Online]. Available: https://arxiv.org/ theorem we can diagonalize B as B = QΛQ⊥ with an
abs/1811.09358. orthogonal matrix Q and a diagonal matrix of eigenvalues
[14] A. B. da Silva and M. Gazeau, A general system Λ. Analogously holds D = I − QΛQ⊥ = Q (I  − Λ) Q⊥ .
of differential equations to model first order adaptive Q 0
We make a similarity transformation with Q̃ := ∈
algorithms, 2018. [Online]. Available: https://arxiv.org/ 0 Q
Mat2n . This leaves the eigenvalues unchanged and gives
pdf/1810.13108.
[15] A. Krizhevsky, Learning Multiple Layers of Features 
A B
 
β1 I (1 − β1 ) µi I

from Tiny Images, Toronto, 2009. [Online]. Available: Q̃⊥ Q̃ =
C D −sβ1 I I − s (1 − β1 ) µi
https : / / www. cs . toronto . edu / ∼kriz / learning - features -
2009-TR.pdf. with µi the i-th eigenvalue of the Hessian. Eigenvalues does
[16] S. Bock, M. Weiß, and J. Goppold, An improvement of not change in similarity transformations, so we can also
the convergence proof of the ADAM-Optimizer. Clus- calculate the eigenvalues of our new block matrix with four
terkonferenz 2018, 2018. [Online]. Available: http : / / diagonal sub matrices.
arxiv.org/abs/1804.10587.  
[17] A. Barakat and P. Bianchi, Convergence of the ADAM (β1 − λ) I (1 − β1 ) µi I
det
algorithm from a Dynamical System Viewpoint, Paris, −sβ1 I (1 − s (1 − β1 ) µi − λ) I
2018. [Online]. Available: https://arxiv.org/abs/1810. = det ((β1 − λ) (1 − s (1 − β1 ) µi − λ) I + (1 − β1 ) sβ1 µi I)
02263. !
=0

-7- paper N-20268.pdf


IJCNN 2019. International Joint Conference on Neural Networks. Budapest, Hungary. 14-19 July 2019

Therefore the matrix is a diagonal matrix, we can conclude: continuous with L < 1, x? ∈ M the unique fixed point of
n
Y T̄ . Assume Br (x? ) ⊂ M for some r > 0. Recall that the
det (JT̄ (0, 0, w? ) − λI) = (β1 − λ) (1 − s (1 − β1 ) µi − λ) non-autonomous system (5) is defined by
i=1
x̃t+1 = T (x̃t ) := T̄ (x̃t ) + Θ(t, x̃t )
+ (1 − β1 ) sβ1 µi
for Θ : N0 × M → Rn with the bound kΘ(t, x̃)k ≤
Each factor can be written as Cβ t kx̃ − x? k for all x̃t ∈ M , t ∈ N0 for some C ≥ 0
!
λ2 − (1 − s (1 − β1 ) µi + β1 ) λ + β1 = 0 and 0 < β < 1. Then there exists ε > 0 such that for all
x̃0 ∈ M with kx̃0 − x? k < ε the iteration defined by (5) is
and following the statement well-defined, i.e. stays in M , and converges to x? .
λ23,i =0.5 (1 − s (1 − β1 ) µi + β1 Proof. Let x = x(·, x̃0 ) be the solution of the undisturbed

iteration xt+1 = T̄ (xt ) with initial condition x̃0 , x̃ = x̃(·, x̃0 )
q
2
± (1 − s (1 − β1 ) µi + β1 ) − 4β1
the corresponding solution of (5). We define et := kx̃t − x? k,
and estimate using the assumptions
is true.
et+1 = T̄ (x̃t ) + Θ(t, x̃t ) − x?
Proof. (Theorem III.3)
We already have calculate the eigenvalues of the Jacobian in = T̄ (x̃t ) − T̄ (x? ) + Θ(t, x̃t )

Theorem III.2. With these we can easily see, that |λ1 | = |β2 | < ≤ T̄ (x̃t ) − T̄ (x? ) + kΘ(t, x̃t )k
1 is satisfied per the requirements of algorithm 1. Therefore ≤ L kx̃t − x? k + Cβ t kx̃t − x? k
we only have to look at:
q = (L + Cβ t )et
2
1 + β1 − ϕi ± (1 + β1 − ϕi ) − 4β1 Choosing t large enough, we get 0 < L + Cβ t ≤ L̃ < 1 for
λ23,i =
2 all t ≥ K because β, L < 1. Then
We define ϕi := αµ √ i (1 − β1 ) and simplify the eigenvalue.
Q 
K
 et ≤ k=1 (L + Cβ k
) L̃t−K e0 =: C̃ L̃t−K e0


1 q
2
with C̃ independent of x̃0 . So et converges to 0 exponentially.
|λ23,i | = (1 + β1 − ϕi ) ± (1 + β1 − ϕi ) − 4β1

2 The arguments so far have only been valid if x̃t ∈ M ,
i.e. the iteration is well defined. But choosing x̃0 such that
| {z }
¬
e0 = kx̃0 − x? k < C̃r small enough that we can achieve et ≤
First we look at upper bound of the eigenvalues. For this we
take term ¬ combined with the regrets for ϕi : C̃e0 < r.
q
2
q
2
Theorem V.2. Determinants of Block Matrices [24]
(1 + β1 − ϕi ) − 4β1 < (1 + β1 ) − 4β1 = ± (1 − β1 )

AB
Let M = ∈ Mat2n be a block matrix with
CD
So if we put this in λ23,i we have the inequality |λ23,i | <
1 A, B, C, D ∈ Matn . If C and D commute, then det (M ) =
2 |1 + β1 − ϕi ± (1 − β1 )|. Easy to see are the two cases: det (AD − BC) holds.
|λ23,i | < 1 with + Lemma V.3. Let f ∈ C 2 (Rn , R), x? ∈ Rn with ∇f (x? ) = 0
|λ23,i | < β1 < 1 with - and ∇2 f (x? ) invertible. Then there exist  > 0 and C > 0
with k∇f (x)k ≥ C kx − x? k for all x ∈ B (x? ).
In both cases we see that the eigenvalues are smaller than 1
in absolute value. To show the lower bound λ23,i > −1, we Proof. As f is C 2 we have ∇f (x) − ∇f (x? ) − ∇2 f (x? ) =
look again at term ¬. o(kx
− x? k). So for each δ > 0 there exists  > 0 with
q q ∇f (x) − ∇f (x? ) − ∇2 f (x? ) ≤ δ kx − x? k for all x ∈
2 2
(1 + β1 − ϕi ) − 4β1 = i 4β1 − (1 + β1 − ϕi ) 1
B (x? ). Assume w.l.o.g. δ < k∇2 f (x −1 k . Then we have
| {z } ?)

∈C\R
2
k∇f (x)k ≥ ∇ f (x? )(x − x? )
Then we can write: − ∇f (x) − ∇f (x? ) − ∇2 f (x? )(x − x? )

1
q
2 2 1
|λ23,i | = (1 + β1 − ϕi ) + 4β1 − (1 + β1 − ϕi ) ≥ kx − x? k − δ kx − x? k
2
p k∇2 f (x ?)
−1 k

= β1 < 1 =: C kx − x? k
The last inequality is given by the requirements of Theorem with C = 1
− δ > 0 by choice of δ.
k∇2 f (x? )−1 k
III.3 and so we proved the whole Theorem.
Theorem V.1. Convergence to fixed point with perturbation
Let M ⊂ Rn be a complete set, T̄ : M → M Lipschitz

-8- paper N-20268.pdf

You might also like