Line Search For Convex Minimization
Line Search For Convex Minimization
Line Search For Convex Minimization
Abstract
Golden-section search and bisection search are the two main principled algorithms for 1d minimization of
quasiconvex (unimodal) functions. The first one only uses function queries, while the second one also uses
gradient queries. Other algorithms exist under much stronger assumptions, such as Newton’s method for
twice-differentiable, strongly convex functions with Lipschitz gradients. However, to the best of our knowl-
edge, there is no principled exact line search algorithm for general convex functions — including piecewise-
linear and max-compositions of convex functions — that takes advantage of convexity. We propose two such
algorithms: ∆-Bisection is a variant of bisection search that uses (sub)gradient information and convexity to
speed up convergence, while ∆-Secant is a variant of golden-section search and uses only function queries.
Both algorithms are based on a refined definition of the optimality region ∆ containing the minimum
point, for general convex functions.
While bisection search reduces the x interval by a factor 2 at every iteration, ∆-Bisection reduces the
(sometimes much) smaller x∗ -gap ∆x (the x coordinates of ∆) by at least a factor 2 at every iteration.
Similarly, ∆-Secant also reduces the x∗ -gap by at least a factor 2 every second function query.
Moreover, and possibly more importantly, the y ∗ -gap ∆y (the y coordinates of ∆) also provides a refined
stopping criterion, which can also be used with other algorithms.
Experiments on a few convex functions confirm that our algorithms are always faster than their quasicon-
vex counterparts, often by more than a factor 2.
We further design a quasi-exact line search algorithm based on ∆-Secant. It can be used with gradient
descent as a replacement for backtracking line search, for which some parameters can be finicky to tune —
and we provide examples to this effect, on strongly-convex and smooth functions. We provide convergence
guarantees, and confirm the efficiency of quasi-exact line search on a few single- and multivariate convex
functions.1
1 Introduction
We consider the problem of exact function minimization in one dimension for general convex functions that are
expensive to evaluate (Boyd and Vandenberghe, 2004). This can occur for example when looking for a step size
for gradient descent or Frank-Wolfe (Frank and Wolfe, 1956) for logistic regression with millions of parameters.
While a backtracking line search is often used for this problem (Armijo, 1966; Boyd and Vandenberghe,
2004), such algorithms are quite sensitive to their parameters, and there is no universal good choice. In the past,
exact line searches have been used, but they were deemed too computationally costly. In this work we propose
a quasi -exact line search for convex functions, which is both computationally efficient and has tight guarantees
while being easy to tune — that is, the default value of its single parameter appears to be a universally good
choice.
When gradient information is available and computationally cheap enough, a classic algorithm for minimizing
a 1-d quasi-convex function is bisection search. It ensures that the x interval is halved at each iteration of the
search. We improve bisection search for 1-d convex functions by taking advantage of convexity to refine the
interval in which the optimum lies, which we call the optimality region ∆. We call this algorithm ∆-Bisection
and we show that it is provably faster than bisection search: the size of the optimality interval |∆x | on the x
axis (which we call the x∗ -gap) reduces by more than a factor 2 at each iteration, and often significantly more.
Moreover, for general convex functions we can use the optimality interval |∆y | (called the y ∗ -gap) to design a
1 One implementation of ∆-Secant can be found at https://github.com/deepmind/levintreesearch_cm/blob/main/lts-cm/
delta-secant.rkt
1
stopping criterion. The y ∗ -gap is almost always bounded for convex functions, while it is almost always infinite
for general quasi-convex functions.
When gradient information is not available or is too expensive to compute, another classic algorithm to
minimize
√ 1-d quasi-convex functions is golden-section search (GSS). It reduces the x interval by a factor (1 −
5)/2 ≈ 1.62 at each iteration. Again, we take advantage of convexity to refine the optimality region ∆(P ),
for a set P of function evaluations. We design an algorithm called ∆-Secant that uses this information. It
is guaranteed to reduce the x∗ -gap by a factor at least 2 every second iteration. Note that the x∗ -gap can
be significantly smaller than the x interval, so the guarantees of GSS and ∆-Secant are not straightforwardly
comparable.
Experiments on a few convex functions confirm that our algorithms are significantly faster than their quasi-
convex counterparts.
Finally, we use ∆-Secant to design a quasi-exact line search algorithm to be combined with gradient descent-
like algorithms for minimization of multivariate convex functions. We prove a tight guarantee for strongly
convex and smooth functions, which is only a factor 2 worse than the guarantee for exact line search, while being
significantly faster in practice. We validate on some experiments that our quasi-exact line search is competitive
with a well-tuned backtracking line search, while being more robust and does not require to precisely tune a
parameter. Moreover, we demonstrate that backtracking line search has failure cases even for smooth convex
functions, making it non-straightforward to tune its parameters.
For four points pi , pj , pk , pl , define pij,kl = (xij,kl , y ij,kl ) to be the intersection point of the line going through
pi and pj with the line going through pk and pl . It can be checked that it can be expressed as
f ij (0) − f kl (0) akl f ij (0) − aij f kl (0)
xij,kl = , y ij,kl = f ij (xij,kl ) = f kl (xij,kl ) = . (1)
akl − aij akl − aij
There are many ways to express this point, but we choose this one as it is simple and is more closely related to
the numerical expression we use for stability (see Appendix B).
Let P be a set of points, then we define plow (P ) = argmin(x,y)∈P y, and we also define (y low (P ), xlow (P )) =
p (P ), and similarly for phigh .
low
We assume that the user provides a convex function f : [xleft , xright ] → R ∪ {∞} and the interval [xleft , xright ].
We write x∗ ∈ argminx∈[xleft ,xright ] f (x) for some arbitrary minimum x∗ of f within the given bounds, and
y ∗ = f (x∗ ).
Given a convex function f : [xleft , xright ] → R, we say that a set of points P is (f, xleft , xright )-convex if for all
(x, y) ∈ P , y = f (x), and argminx∈[xleft ,xright ] f (x) ⊆ [xlow (P ), xhigh (P )]. (The last condition is easy to fulfill by
including the points at xleft and xright in P , but, practically, it may be useful to remove unnecessary points in
P .)
We also assume that the user provides a tolerance y tol and that the algorithms considered later return a point
(x, y) only when it can be proven that y − y ∗ ≤ y tol .
Recall that a function f : Rd → R is m-strongly convex if, for all x, x′ ∈ Rd :
m ′
f (x′ ) ≥ f (x) + ⟨x′ − x, ∇f (x)⟩ + ∥x − x∥22 ,
2
and f is M -smooth if for all x, x′ ∈ Rd :
M ′
f (x′ ) ≤ f (x) + ⟨x′ − x, ∇f (x)⟩ + ∥x − x∥22 .
2
2
3.1 ∆-Bisection: Convex line search with gradient information
It is well-known that 1-d convex minimization can be performed using the bisection method to guarantee a
reduction of the x interval by a factor 1/2 at each iteration. Quoting Boyd and Vandenberghe (2004):
The volume reduction factor is the best it can be: it is always exactly 1/2.
For general quasi-exact functions, only the sign of the gradient is informative — and provides exactly 1 bit
of information. However, more information is available for convex functions, namely, the magnitude of the
gradient. Thus, it is almost always possible to reduce the interval by a factor less than 1/2, using nothing
more than gradient information and convexity, and without performing more function or gradient queries than
the original bisection algorithm. For this purpose, we propose the ∆-Bisection algorithm. To show how this is
achieved, we need to consider a refinement of the interval where the optimum x∗ can be located. See Fig. 1.
x interval p1
y¹
y¹
tangent at
tangent at x⁰
x⁰
(x∗ , y ∗ ) tangent at
tangent at x¹
x¹
Δ-Bisection query
Δ-Bisection query xq
xq
bisection search
bisection search query
query xq'
xq'
x⁰=x⁻
x⁰=x⁻ xq
xq x⁺
x⁺ xq'
xq' x¹
x¹
Figure 1: ∆-Bisection: An example where the x∗ -gap ∆x = [x− , x+ ] ≈ [0.2, 1.85] is significantly smaller (almost
by a factor 3 here) than the x interval [x0 , x1 ] = [0.2, 5], after querying the function f and the gradients at x0 and
x1 . x+ is the x value of the intersection of the tangent at x1 with the horizontal axis at min{y 0 , y 1 }. Similarly,
x− is the x-value of the intersection of the tangent at x0 with the horizontal axis at min{y 0 , y 1 } which, here, is
x0 . Due to convexity, we know that the minimum x∗ cannot be outside [x− , x+ ]. The original bisection search
′
algorithm would query at the middle of [x0 , x1 ] at xq = 2.6 — which is outside [x− , x+ ] — which, while reducing
the [x , x ] interval by a factor 2, would reduce the ∆x interval only marginally (using the tangent information
0 1
′
at xq — not shown). By contrast, the ∆-Bisection algorithm queries at xq in the middle of ∆x , ensuring a
reduction of ∆x by at least a factor 2: once the gradient at xq becomes known, we will know that x∗ ∈ / [xq , x+ ]
q
on this figure, which gives a reduction of a factor 2, and the tangent at x will be used to calculate the new
values of x− and x+ .
At each iteration t, the algorithm maintains two points, p0t = (x0t , yt0 ) and p1t = (x1t , yt1 ), with x0t ≤ x1t ,
and the gradients of f at these points f ′ (x0t ) and f ′ (x1t ). Define xhigh
t ∈ {x0t , x1t } such that ythigh = f (xhigh
t )=
0 1 low 0 1 high low low
max{f (xt ), f (xt )}, and xt ∈ {xt , xt } \ {xt } with yt = f (xt ).
Define fba (x) = f (xab ) + (x − xab )f ′ (xab ) to be the tangent function of f at xab . Then ft0 (·) and ft1 (·) are the
two tangents at the two points x0t and x1t . Then by convexity we know that f (x∗ ) ≥ ft0 (x∗ ) and f (x∗ ) ≥ ft1 (x∗ ),
and we also know that f (x∗ ) ≤ ytlow . We use these constraints to define the optimality region ∆t at iteration t:
3
We also define the x∗ -gap ∆xt and the y ∗ -gap ∆yt :
Then necessarily (x∗ , f (x∗ )) ∈ ∆t for all t, and in particular x∗ ∈ ∆xt and f (x∗ ) ∈ ∆yt .
Define x− x + x − + 0 1
t = inf ∆t and xt = sup ∆t . Then xt (resp. xt ) is the intersection of ft (·) (resp. ft (·)) with the
horizontal line at ytlow , that is,
1 −
xqt = (x + x+
t ).
2 t
As for bisection search, if f ′ (xqt ) < 0, then x∗ > xqt and we set x0t+1 = xqt and x1t+1 = x1t ; otherwise we set
x0t+1 = x0t and x1t+1 = xqt (the case illustrated in Figure 1. The pseudocode is given in Algorithm 1.
The y ∗ -gap is used to stop the algorithm when |∆yt | ≤ y tol for some user-specified tolerance y tol . ∆yt is simply
the y interval between the intersection Yt of the two tangents, ft0 and ft1 , and the horizontal line at ytlow :
∆yt = [Yt , ytlow ] , Yt = ft0 (x) for x s.t. ft0 (x) = ft1 (x)
yt1 f ′ (x0t ) − yt0 f ′ (x1t ) + f ′ (x0t )f ′ (x1t )(x0t − x1t )
Yt = (3)
f ′ (x0t ) − f ′ (x1t )
Some experimental results can be found in Table 1. They show that ∆-Bisection always outperforms the
original bisection algorithm, often by a factor 2 or more.
Theorem 1 (∆-Bisection gap-reduction). For Algorithm 1, assuming ∆yt+1 > y tol , at each iteration t ≥ 2,
the x∗ -gap ∆xt = [x− +
t , xt ] reduces at least by a factor of 2:
|∆xt+1 | 1 low
yt+1 − zt 1 high low
= · ≤ , zt = ft+1 (xt+1 ) .
x
|∆t | 2 max{ytq , ytlow } − zt 2
high
where ft+1 is the tangent of f at xhigh
t+1 . ⋄
Proof. For the inequality, observe that max{ytq , ytlow } ≥ ytlow ≥ yt+1 low
, therefore |∆xt+1 |/|∆xt | ≤ 12 , since yt+1
low
> zt .
′ left ′ right
For the equality, first of all, if f (x ) ≥ 0 or f (x ) ≤ 0, then the algorithm terminates immediately since
the y ∗ -gap ∆yt is zero. Hence, in what follows we can safely assume that f ′ (xleft ) < 0 and f ′ (xright ) > 0, which
implies that f ′ (x0t ) ≤ 0 and f ′ (x1t ) ≥ 0 for all t.
We focus on the case yt0 ≤ yt1 . The case yt0 > yt1 is handled similarly.
The result is shown by considering 3 separate cases. The first case, f ′ (xqt ) > 0 and ytq < yt0 , is considered in
Fig. 2 (see caption). The second case, f ′ (xqt ) < 0 and ytq < yt0 , is considered in Fig. 3. The last case, f ′ (xqt ) > 0
and yt0 < ytq < yt1 , is considered in Fig. 4. All of these cases follow by using proportionality and each figure shows
that the result holds for each particular case.
Finally, the case yq > yt1 is impossible by convexity, as well as the case f ′ (ytq ) > 0 and ytq > yt0 .
Remark 2 (Root finding). ∆-Bisection can be adapted to improve upon bisection search for root finding of
monotone increasing functions, if integral information is available. ⋄
4
p1t
f(x)
f(x)
tangent at
tangent at x⁰
x⁰
tangent at
tangent at x¹
x¹
0
1 x
2 ∆t
(xqt , ·)
pt
p+
t
∆xt+1 pqt
axis
yy axis
p−
t+1
B
A
(·, z)
xx axis
axis
Figure 2: ∆-Bisection Case 1: f ′ (xqt ) > 0 and ytq < yt0 , that is, x∗ ∈ [x0t , xqt ] or more precisely x∗ ∈
q q q q
[x− 1 + 0 0
t+1 , xt ]. The query point (xt , yt ) is denoted q on the picture. Then xt+1 = xt = xt+1 and xt = xt+1 . By
|∆x t+1 | ytq −z
low
yt+1 −z
proportionality, 1 x = A
B = yt0 −z
= ytlow −z
high low
with z = ft0 (xqt ) = ft+1 (xt+1 ).
2 |∆t |
p1t
f(x)
f(x)
tangent at
tangent at x⁰
x⁰
tangent at
tangent at x¹
x¹
(xqt , ·) 1 x
2 ∆t
p0t
p+
t
axis
yy axis
B
pqt ∆xt+1
p+
t+1
A
(·, z)
xx axis
axis
Figure 3: ∆-Bisection Case 2: f ′ (xqt ) < 0 and ytq < yt0 , that is, x∗ ∈ [xqt , x1t ] or more precisely x∗ ∈
|∆x t+1 | ytq −z
low
yt+1 −z
[xqt , x+ 0 q − 1 1
t+1 ]. Then xt+1 = xt = xt+1 and xt = xt+1 . By proportionality, 1
|∆ x| = A
B = yt0 −z
= ytlow −z
with
2 t
5
p1t
f(x)
f(x)
tangent at
tangent at x⁰
x⁰
tangent at
tangent at x¹
x¹
tangent at
tangent at xq
xq
axis
yy axis
1 x
2 ∆t pqt
∆xt+1 (xqt , ·)
p0t B
p+
t+1 A p+
t
(·, z)
xx axis
axis
Figure 4: ∆-Bisection Case 3: f ′ (xqt ) > 0 and yt0 < ytq < yt1 , that is, x∗ ∈ [x0t , xqt ] or more precisely
q −
x∗ ∈ [x0t , x+ 1 0 0 low
t+1 ]). Then xt+1 = xt and xt = xt+1 = xt+1 . Also note that yt
low
= yt+1 = yt0 . By proportionality,
|∆x t+1 | yt0 −z
low
yt+1 −z
1 x = A
B = ytq −z
= ytq −z
high low
with z = ftq (x0t ) = ft+1 (xt+1 ).
2 |∆t |
Table 1: Number of iterations (not counting the initial queries at the boundaries) required by
bisection search and ∆-Bisection for several convex functions, until provably reducing the y ∗ -gap ∆yt
to at most a tolerance of y tol = 10−10 . Observe that when the minimum is at a boundary, ∆y0 = 0 and the
algorithms terminate immediately. ∆-Bisection is a little lucky for f (x) = |x| because the x∗ -gap is reduced to
0 on the second iteration, due to the symmetry of the function.
6
f (x) x0 x1 w/ gradient queries no gradient queries
Bisection ∆-Bisection ∆-Secant GSS
−x -20 7 4 4 3 3
|x| -20 7 78 6 7 50
max{−x, 2x} -20 7 78 42 23 50
max{−x, 2x} -0.01 100 78 32 18 52
|x|1.1 -20 7 72 26 28 50
2
√ x -20 7 42 28 27 33
1 + x2 -1000 900 58 42 23 41
x log x − x 0.001 20 44 28 23 32
max{x2 , (x − 3)2 } -5 55 80 22 18 60
max{x2 , (x/2 − 3)2 } -5 55 82 46 26 58
x4 -20 7 26 20 18 21
1/x2 + x2 0.001 100 50 40 31 36
Table 2: Number of queries, including the initial queries at the boundaries required by different
algorithms until provably reaching a y ∗ -gap of |∆y (Pt )| ≤ y tol = 10−10 , for several convex functions. For
gradient-based algorithms, a function+gradient query counts as two queries. Note that our version of the
golden-section search algorithm (GSS) also uses the 5-point y ∗ -gap ∆y (Pt ) as stopping criterion.
Algorithm 1 (∆-Bisection) Line search for convex functions with gradient information.
7
at the same point seems wasteful: after the first one of two close queries, some information is already gained and
the x∗ -gap has likely already reduced, so why then query at almost the same point again?
We propose a no-gradient line search algorithm ∆-Secant for general convex functions that is also guaranteed
to reduce the x∗ -gap by a factor at least 2 at every second iteration, but is likely to reduce it by more than this.
This is validated on experiments, which also show that ∆-Secant is often significantly faster than golden-section
search, and than ∆-Bisection when counting a gradient query as a separate query.
First we need to define a version of the optimality region ∆(P ) based only on a set of points P , without
gradient information.
Example 3. Take f (x) = x, and P = {(0, 0), ( 12 , 12 ), (1, 1)} then ∆(P ) = {(0, 0)}. ⋄
Example 4. Take f (x) = |x|, and P = {(−10, 10), (−5, 5), (−1, 1), (1, 1), (5, 5), (10, 10)} then ∆x (P ) = [−1, 1]
and ∆y (P ) = [0, 1]. ⋄
Theorem 5 (Optimality region). Let P be a (f, xleft , xright )-convex set of points. Then,
Proof. First, by definition of x∗ and P , we must have f (x∗ ) ≤ y low (P ). Additionally, for all (x1 , y 1 ) ∈ P and
(x2 , y 2 ) ∈ P , if x∗ ∈
/ (x1 , x2 ) then by convexity of f we have that f (x∗ ) ≥ f 12 (x∗ ). Therefore, by definition of ∆
in Eq. (4), it follows that (x∗ , f (x∗ )) ∈ ∆(P ).
If only two points are available, that is, |P | = 2, then ∆x (P ) = [xlow (P ), xhigh (P )], and ∆y (P ) = (−∞, y low (P )].
For |P | ≥ 3, ∆y (P ) is (usually) not infinite anymore, and the optimality region ∆(P ) consists of at most two
triangles, while both ∆x (P ) and ∆y (P ) are bounded intervals — see Fig. 5. Also, it turns out that having more
than 5 points does not help, because the constraints generated by the points further away from plow (P ) provide
only redundant information.
Theorem 6 (Five points are enough). Let P = {pi }i∈[|P |] be a (f, xleft , xright )-convex set of points such that
x1 ≤ x2 ≤ . . . Let k be such that pk = plow (P ), and assume 2 that 3 ≤ k ≤ |P | − 2. Define P ′ to be the subset
of P containing five points centered around the minimum, that is,
∆(P ′ ) = ∆(P ) . ⋄
2 This assumption can always be made to hold by adding specific virtual points to P .
8
Proof. Observe that necessarily ∆x (P ) ⊆ [xk−1 , xk+1 ]. Recall that f kj is the line going through pk and pj .
First, We show that pk−3 (if it exists) is superfluous.
For all i ≥ k + 1, for all x ∈ ∆x (P ) then necessarily x ∈ [xk−3 , xi ], and thus the constraint on f k−3,i in
Eq. (4) is vacuous. Thus, all constraints in Eq. (4) involving f k−3,i can be removed.
For all i ≤ k − 1, i ̸= k − 3, by convexity we have f i,k−3 (xk−1 ) ≤ y k−1 and thus for all x ≥ xk−1 , we have
i,k−3
f (x) ≤ f i,k−1 (x). Hence, for all x ≥ xk−1 , for all y, f i,k−1 (x) ≤ y implies that f i,k−3 (x) ≤ y. Thus all
constraints involving f i,k−3 in Eq. (4) for all i ≤ k − 1, i ̸= k − 3 are superfluous.
Similarly, the constraint involving f k−3,k is made redundant by the constraint involving f k−1,k .
Therefore, all constraints in Eq. (4) involving pk−3 are superfluous, and thus ∆(P \ {pk−3 }) = ∆(P ).
Recursively, we can thus eliminate from P all points to the left of pk−2 , and similarly we can eliminate all
points to the right of k + 2. Finally, there remain only the points of P ′ .
The fact that P ′ is (f, xleft , xright )-convex is now straightforward.
For simplicity, in what follows we consider the set Pt of all points queried up to step t, but an algorithm
should instead only maintain the 5 points around the minimum — as per Theorem 6.
Example 7 (4-point vs 3-point y ∗ -gap). Take f (x) = |x|, and P = {(−a, a), (ε, ε), (1, 1)} for ε ∈ (0, 1) and
a > 0, and take Q = P ∪ {(−(a + 1), a + 1)}. Note that the additional point in Q is further away from the
minimum 0 of f than the points in P . Then it is easy to check that |∆y (P )| = a + ε, while |∆y (Q)| = ε. Hence,
a 4-point y ∗ -gap can be arbitrarily more accurate than a 3-point one. Also note that more points only constrain
the optimality region more, that is, ∆(P ∪ {p}) ⊆ ∆(P ) for any point p as long as P ∪ p satisfies the assumptions
of Theorem 6. ⋄
The optimality region ∆(·) can be used with existing line-search algorithms, such as golden-section search
for example, assuming the function to minimize is convex, stopping the line search when |∆y (Pt )| ≤ y tol .
For a given set P = {pi }i∈{0,1,2,3,4} of five points, assuming plow (P ) = p2 , similarly to Eqs. (2) and (3) we
have
y2 − y1 y0 − y1
x− = inf ∆x (P ) = x1 + , a01 = ,
a01 x0 − x1
y2 − y3 y3 − y4
x+ = sup ∆x (P ) = x3 + , a34 = 3 ,
a34 x − x4
|∆x (P )| = x+ − x− ,
|∆y (P )| = y 2 − min{y 01,23 , y 12,34 } , (6)
where x− is the intersection of f 01 with the horizontal line at y 2 = y low (P ), x+ is the intersection of f 34 with
y 2 . See Eq. (1) for the definition of y ij,kl and Appendix B for numerical stability considerations.
Remark 8 (Missing points). If there are fewer than 2 points to the left and right of plow (P ), virtual points can
be added to restore this property. Assume P = {p2 , p3 , . . . , pn−2 } with x2 ≤ x3 ≤ · · · ≤ xn−2 is (f, xleft , xright )-
convex. Define Q = P ∪ {p0 , p1 , pn−1 , pn } where x0 = x2 − 2ε, x1 = x2 − ε, xn−1 = xn−2 + ε, xn = xn−2 + 2ε, and
y 0 = y 1 = y n−1 = y n−2 = ∞ for some infinitesimally small ε. Then necessarily plow (P ) is surrounded by 2 points
on each side, while ∆(Q) = ∆(P ) since no information regarding the location of the minimum is introduced,
since by assumption x∗ = argminx∈[xleft ,xright ] ∈ ∆(P ). Technically, Q is not (f, xleft , xright )-convex, but instead
(f˜, xlow (P ), xhigh (P ))-convex where f˜ = f except on the additional points of Q. ⋄
3.2.2 ∆-Secant
The ∆-Secant algorithm is an adaptation of ∆-Bisection, using ∆(Pt ) instead of ∆t . See Algorithm 2. It may
happen in rather specific situations that xlow (Pt ) is exactly the midpoint of ∆x (Pt ), which means that the query
xqt would coincide with xlow (Pt ). This poses no issue with the theory below as we can simply take the limit
toward the midpoint, in which case the secant of these two infinitesimally close points would give a gradient
information at xlow (Pt ). However, it does pose numerical issues as the algorithm would instead loop forever.
This can be easily circumvented by moving the query slightly away 3 from xlow (Pt ) by some factor ε of |∆x (Pt )|.
3 Still for numerical stability, this ‘repulsion’ should even be applied whenever |xlow (P ) − ∆x (P )/2| < ε. Querying the gradient
t t
at xlow (Pt ), if available, would be even better.
9
Algorithm 2 (∆-Secant line search algorithm for 1-d convex minimization) The improvement of
Remark 9 is not implemented. We advise to take ε = 2−7 . Not that in practice at most 5 points around plow (Pt )
actually need to be maintained in Pt , as per Theorem 6. We decompose ∆-Secant into two functions because we
will re-use ∆-Secant _core elsewhere.
def ∆-Secant(f, xleft , xright , y_tol):
def stop_when(P ): return |∆y (P )| ≤ y_tol
P = {(xleft , f (xleft )), (xright , f (xright ))}
return ∆-Secant_core(f, P , stop_when)
Remark 9 (Saving one query). One early function query can sometimes be saved: given the initial bounds
xleft and xright , after obtaining f (xleft ) and f (xright ), a query at xq1 = (xleft + xright )/2 is always required since
the algorithm cannot terminate with only 2 points, since |∆y (·)| = ∞ (by contrast to ∆-Bisection). However, if
we query at xq1 before querying at xright and if it happens that f (xleft ) ≤ f (xq ), then by convexity it is not useful
to query at xright . ⋄
Theorem 10 (∆-Secant ∆x reduction). For ∆-Secant in Algorithm 2 where we take ε → 0, for any iteration
t ∈ N1 such that ∆y (Pt+1 ) > y tol (i.e., the algorithm does not terminate before t + 2), the x∗ -gap |∆x (·)| reduces
by at least a factor 2 every second iteration: for every t ∈ N1 , |∆x (Pt+2 )| ≤ 12 |∆x (Pt )|. ⋄
Proof. First, observe that for all t ∈ N1 , ∆(Pt+1 ) ⊆ ∆(Pt ) since by definition of ∆ in Eq. (4), Pt+1 is more
constrained than Pt . Also observe that plow (Pt ) ∈ ∆(Pt ) (see Fig. 5).
q
Let [x− + x 1 − +
t , xt ] = ∆ (Pt ), xt = 2 (xt + xt ) for all t.
q q q
Case 1) If yt ≥ y (Pt ), then necessarily either x∗ ∈ [x−
low ∗ +
t , xt ] or x ∈ [xt , xt ] — depending on which
low low
segment contains x (Pt+1 ) = x (Pt ) — and this information is reflected in ∆(Pt+1 ). Thus |∆(Pt+1 )| ≤
max{xqt − x− + q 1 1
t , xt − xt } = 2 |∆(Pt )|. Hence |∆(Pt+2 )| ≤ |∆(Pt+1 )| ≤ 2 |∆(Pt )|.
q low
Case 2) Assume instead that yt ≤ y (Pt ).
q
Case 2a) At t + 1, if yt+1 ≥ y low (Pt+1 ) = ytq , then by case 1) we have |∆t+2 | ≤ 21 |∆t+1 |, and thus |∆t+2 | ≤
1
2 |∆t |.
q
Case 2b) If instead yt+1 ≤ y low (Pt+1 ) = ytq , then necessarily either x∗ ∈ [x− q ∗ q +
t , xt ] or x ∈ [xt , xt ] —
low q
depending on which segment contains x (Pt+2 ) = xt+1 — and this information is reflected in ∆(Pt+2 ). Then
again ∆(Pt+2 ) ≤ max{xqt − x− + q 1
t , xt − xt } = 2 |∆(Pt )|.
Remark 11 (Golden-section search). Let PtGSS be the set of all points queried by Golden-section search
on a convex function f up to step t. Then, as mentioned earlier, Golden-section search is only guaranteed to
k−1 k+1
√ xt − xt
reduce the x-interval around the current minimum xkt = xlow (PtGSS ) (with xtk−1 , xkt , xk+1 ∈ PtGSS )
by a factor (1 + 5)/2 ≈ 1.62 at every iteration, but this does not necessarily lead to a strict reduction of the
x∗ -gap ∆x (PtGSS ) at every (second) iteration. x GSS
√ In fact, this may not even reduce ∆ (Pt ) for many iterations.
Consider for example the function f (x) = 1 + x2 for [xleft , xright ] = [−1000, 900]. This function behaves like
|x| outside of [−1, 1], and like x2 inside [−1, 1]. After 4 iterations of ∆-Secant, the x∗ -gap is reduced to less than
2, while it takes 13 iterations for GSS to reach the same ∆x precision. ⋄
10
opt. region ∆(P ) middle of∆x (P )
01 23
p4
f f
f 34 f 12
p0
f (x)
p3
p1
∆x (P )
∆y (P )
p− p2 p+
p01,23
p12,34
Figure 5: (∆-Secant) The optimality region ∆ based on 5 points P = {p0 , . . . p4 } on a convex function f ,
without gradient information. The middle of ∆x (P ) corresponds to the query performed by ∆-Secant.
It is known that there exists an α′ > 0 such that for all α ∈ [0, α′ ], f˜(α) < f˜(0) if the function is smooth (see
Section 2). The quasi-exact line search runs ∆-Secant until the y ∗ -gap (which is an upper bound on y low (Pt )−y ∗ )
is smaller than a constant fraction of the current improvement from the first queried point y left (Pt ) − y low (Pt ):
y left (Pt ) − y low (Pt ) ≥ c|∆y (Pt )| (7)
for the algorithm parameter c > 0. See quasi_exact_line_search in Algorithm 3. Moreover, if the point
returned by ∆-Secant is at the end of the original interval, the interval size is increased by a factor 4 and
∆-Secant is run again, and so on until the returned point is not at the boundary. This scheme assumes
that f˜(α) eventually increases with α, which applies in particular to strongly-convex functions (see Section 2).
quasi_exact_line_search can also use an initial query αprev which defaults to 1. We also propose a version
of GD+Backtracking that also increases the interval size as required, see Algorithm 4 — also see Unbounded
Backtracking (Truong and Nguyen, 2021).
11
We show in Theorem 12 that the terminating condition Eq. (7) incurs a slowdown factor of c/(c+1) compared
to an exact line search (c = ∞), in terms of the number of gradient descent steps. Note that the conditions for
Theorem 12 are weaker than smoothness and strong-convexity.
Theorem 12 (Quasi-exact line search). Let f˜ : [0, ∞) → R be a convex function. Run quasi_exact_line_search
(Algorithm 3) on f˜(·) and c > 0 and collect the return value αc . Let α∞ ∈ argminα≥0 f˜(α) and assume α∞ < ∞
and that f (∞) > f (α∞ ). Then
c ˜
f˜(0) − f˜(αc ) ≥ f (0) − f˜(α∞ ) . ⋄
c+1
Proof. First, since α∞ < ∞ and f (∞) > f (α∞ ), there there exists an iteration i of quasi_exact_line_search
right
at which ∆-Secant _core returns with xlow i (P ) < xi .
Let P be the points returned by ∆-Secant at the last iteration of quasi_exact_line_search. From Theo-
rem 5 we know that |∆y (P )| ≥ y low (P ) − f˜(α∞ ) = f˜(αc ) − f˜(α∞ ). Thus by Eq. (7),
Algorithm 3 (Gradient descent with quasi-exact line search) for strongly convex and smooth functions
for unbounded domains. The line search is repeated with an increased range as long as the minimum is at the
right boundary. This allows to adapt to the smoothness parameter M without having to know it, even if M < 1.
Note that ∆-Secant is very fast when the minimum is at a boundary, in particular since points of the previous
iteration are reused. The previously found learning rate αprev is also used as a starting point for the line search.
See Appendix B for numerical stability.
P0 = {(0, f˜(0))}
for i = 0, 1, 2, ...:
xright = αprev 4i
# Re-use previous points and add the new upper bound,
# which forces ∆x (Pi′ ) > ∆x (Pi ) by Eq. (4)
Pi′ = Pi ∪ {(xright , f (xright ))}
αc = quasi_exact_line_search(f˜, c, αprev )
xk+1 = xk − αc ∇f (xk )
αprev ← αc
12
Algorithm 4 (Gradient Descent with backtracking line search) for strongly convex and smooth functions
for unbounded domains The line search is repeated with an increased range as long as the minimum is at the
right boundary. This allows to adapt to the smoothness parameter M without having to know it, even if M < 1.
The previously found learning rate αprev is also used as a starting point for the line search. See also Truong and
Nguyen (2021).
Theorem 13 (GD with quasi-exact line search). Consider an m-strongly convex and M -smooth function
f : Rd → R with (unique) minimizer x∗ ∈ Rd . Starting at x0 ∈ Rd , after k iterations of gradient descent with
quasi-exact line search (Algorithm 3), we have
k
c m
f (xk+1 ) − f (x∗ ) ≤ (f (x1 ) − f (x∗ )) 1 − . ⋄
c+1M
Proof. First, recall that strong convexity implies that the minimum x∗ is unique. At iteration k, let αk∞ =
argminα≥0 f (xk − α∇f (xk )), and let x∞ ∞
k = x − αk ∇f (x). From Boyd and Vandenberghe (2004), Chapter 9.3,
at some gradient iteration k:
1
f (x∞
k ) ≤ f (xk ) − ∥∇f (xk )∥2 .
2M
Together with Theorem 12,
c c 1
f (xk ) − f (xk+1 ) ≥ (f (xk ) − f (x∞
k )) ≥ ∥f (xk )∥2
c+1 c + 1 2M
Rearranging and subtracting f (x∗ ) on both sides,
c 1
f (xk+1 ) − f (x∗ ) ≤ f (xk ) − f (x∗ ) − ∥f (xk )∥2
c + 1 2M
Following Boyd and Vandenberghe (2004) again, we combine with ∥∇f (xk )∥2 ≥ 2m(f (xk ) − f (x∗ )),
c m
f (xk+1 ) − f (x∗ ) ≤ (f (xk ) − f (x∗ )) 1 −
c+1M
≤ ...
k
∗ c m
≤ (f (x1 ) − f (x )) 1 − .
c+1M
13
Note that this implies
c m
f (xk+1 ) − f (x∗ ) ≤ (f (x1 ) − f (x∗ )) exp −k ,
c+1M
which suggests that with the default value c = 1, the quasi-exact line search is at most a factor 2 slower (in the
number of iterations k) than using an exact line search, while being significantly more efficient in the number of
function queries.
Remark 14 (Initial gradient). Algorithm 3 can also be modified to make use of the known initial gradient,
by adding two virtual points at −1 and −2 for example. The gradient information tells us that
f (x − α∇f (x)) ≥ f (x) − α∥∇f (x)∥22 ,
that is, f˜(α) ≥ f˜(0)−α∥∇f (x)∥22 . Hence, at gradient descent iteration k, we can augment the initial set of points
P0 of quasi_exact_line_search in Algorithm 3 with {(−2, f (xk ) + 2∥∇f (xk )∥22 ), (−1, f (xk ) + ∥∇f (xk )∥22 )}.
See Appendix B, however. ⋄
14
and the condition must also not be satisfied at t − 1:
The previous example shows that taking a small value for ε can lead to slow progress. The next example
shows exactly the opposite: that large values of ε can also lead to slower progress than necessary — see also the
example from Equation (9.20) from Boyd and Vandenberghe (2004).
Example 16 (Armijo’s snowplow skier). Consider f (x) = eax for a ≥ 1. Define g(x) = ∇f (x) = aeax , then
the backtracking line search stops at step t∗ such that (see Eq. (8))
See Table 3 for some experimental results, comparing backtracking line search and quasi-exact line search.
Many-dimensional optimization. We also compare the algorithms on an artificial loss function of the form:
"N #−1
Y exp(ai,ni xni )
L(x) = P (9)
i=1 j∈[d] exp(ai,j xj )
with the parameters x0 initialized uniformly randomly in [−20, 20]d , ai,j ∈ [−1, 1] drawn uniformly randomly,
and N = 10, d = 100, and each ni ∈ [d] sampled uniformly. We use Gradient Descent with line search and stop
after 2000 (function+gradient) queries. Fig. 6 shows the sensitivity of GD with Backtracking to the ε parameter,
while GD with quasi-exact line search performs well a wide range of the parameter c. Indeed, the exponential
curvature makes it difficult for Armijo’s condition with large parameter ε to return a value within only few
iterations.
15
quasi-exact_c2
quasi-exact_c2
quasi-exact_c1
quasi-exact_c1
quasi-exact_c0.1
quasi-exact_c0.1
backtracking_ε0.5
backtracking_ε0.5
backtracking_ε0.1
backtracking_ε0.1
10⁵¹
10⁵¹
backtracking_ε0.01
backtracking_ε0.01
(log-scale)
loss (log-scale)
10³⁸
10³⁸
loss
10²⁵
10²⁵
400
400 600
600 800
800 1000
1000
cumulative function+gradient
cumulative function+gradient queries
queries
Figure 6: (Loss per function+gradient query for gradient descent with quasi-exact line search
(Algorithm 3) and backtracking line search (Algorithm 4) for the loss function of Eq. (9)) The x-
axis is strongly correlated with computation time. This figure illustrates a typical example. Due to exponentials,
ε ≥ 0.1 is too conservative for backtracking line search (see Example 16). By contrast, quasi-exact line search is
robust to various choices of c, and the default c = 1 appears to be a safe choice.
16
f (x) = 3.95x2 f (x) = e3x + e−3x f (x, y) = x4 + y 4
Line search type GD iter. Queries GD iter. Queries GD iter. Queries
quasi-exact c = 100 4 45 6 954 4 57
quasi-exact c = 10 4 35 7 957 4 49
quasi-exact c = 4 4 30 8 959 4 44
quasi-exact c = 2 4 25 8 955 5 44
quasi-exact c = 1 4 25 10 965 5 42
quasi-exact c = 0.5 4 25 10 965 5 38
quasi-exact c = 0.1 4 25 44 1090 7 43
quasi-exact c = 0.01 754 2265 44 1089 7 44
backtracking ε = 0.8 67 476 898 193817 74 (17 700) 388 (35662)
backtracking ε = 0.5 4 25 229 48266 24 (17 674) 118 (35416)
backtracking ε = 0.3 4 25 147 31486 10 (17 669) 68 (35 376)
backtracking ε = 0.1 4 25 47 9613 5 (17 655) 39 (35 327)
backtracking ε = 0.01 754 3020 13 1376 8 (17 662) 54 (35 344)
backtracking ε = 0.001 754 3020 10 867 8 (17 662) 54 (35 344)
Table 3: (Experimental results comparing backtracking line search and quasi-exact line search)
Number of GD iterations, and number of function+gradient queries for quasi-exact line search and backtracking
line search (τ = 1/2) for various values of the terminating condition parameters c and ε, until f (xt ) − f (x∗ ) ≤
10−10 for different functions. For f (x) = 3.95x2 with x0 = 1000, small values of ε make the backtracking line
search slow (see Example 15). For f (x) = e3x + e−3x with x0 = 100, large values of ε make the backtracking
line search slow (see Example 16). By contrast, quasi-exact line search is fast for a wide range of values. For
f (x, y) = x4 +y 4 with (x0 , y0 ) = (0.1, 15), for which the global minimum is degenerate (Example 3.7 from Truong
and Nguyen (2021)), standard backtracking line search (numbers in parenthesis) is stuck with a far too small
learning rate, impeding progress. Backtracking line search as in Algorithm 3 and quasi-exact line search as in
Algorithm 3 avoid this particular issue. See also Truong and Nguyen (2021) for other variants. In line with
Theorem 12, taking c = 0.01 for quasi-exact line search can incur a slowdown factor of c/(c + 1) ≈ 0.01 compared
to an exact line search, and we can indeed observe on the first function that performance is poor. The table
shows that there is no good default setting for backtracking line search. By contrast, quasi-exact line search is
robust across a wide range of parameter values. We recommend taking c = 1 to balance practical speed and
tight theoretical guarantees.
4 Conclusion
∆-Bisection and ∆-Secant are two algorithms that perform exact convex line search, the former using gradients
while the latter uses only function queries. These algorithms come with a guarantee on the optimality/duality
gap. We use this guarantee can as a stopping criterion to develop a quasi-exact line search and prove convergence
guarantees that are within a factor c/(c + 1) of the convergence speed of an exact line search (for which c = ∞),
while the slack given by c allows to terminate the search much before reaching numerical precision.
We recommend using ∆-Secant for quasi-exact line search with for example c = 1 to balance practical speed
and theoretical guarantees.
Our algorithms compare favorably with bisection search, golden-section search and backtracking line search
in their respective domains.
17
References
Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, Eng-
land, 2004.
Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quar-
terly, 3(1-2):95–110, 1956.
Larry Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of
Mathematics, 16(1):1 – 3, 1966.
Tuyen Truong and Hang-Tuan Nguyen. Backtracking gradient descent method and some applications in large
scale optimisation. part 2: Algorithms and experiments. Applied Mathematics & Optimization, 84:1–30, 12
2021.
Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th
International Conference on Machine Learning, volume 28(1) of Proceedings of Machine Learning Research,
pages 427–435, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
Appendix
A A bad case for tangent intersection selection
It is tempting to consider a variant of Algorithm 1 where at each iteration we query at the intersection of the
tangents of the two external points, that is, xqt = argmin(x,y)∈∆t y. This algorithm can be very slow, however.
Consider for example the function f (x) = max{−x/100, ebx } for b > 0, with x01 = −100, x11 = 100. It can be
verified that, for this algorithm, for approximately 100/b steps t, x0t = x01 while x1t ≈ 100 − t/b. That is, only
linear progress is made on the x axis.
By constrast, ∆-Secant takes less than 15 iterations for b = 10 for example.
B Numerical stability
B.1 Line-line intersection
Some numerical stability care needs to be taken when implementing the line-line intersection function from 4
points to calculate the y-gap.
Empirically, we found that the following formula is more stable than the determinant method. For 4 points
2
−y 1 3 4
p1 , p2 , p3 , p4 , with y 1 ≥ y 2 and y 3 ≤ y 4 , define a12 = xy2 −x 1 and a
34
= xy3 −y 12 12 2 2
−x4 , and f (x) = a (x − x ) + y and
f 34 (x) = a34 (x − x3 ) + y 3 .
Then, for any x̂ ∈ R, the intersection of f 12 and f 34 is at
f 12 (x̂) − f 34 (x̂)
xc = x̂ +
a34 − a12
First we pick x̂ = x2 , then we pick x̂ = xc and repeat the latter one more time to obtain a more accurate value.
Finally, we return the last xc , and we take y c = min{f 12 (xc ), f 34 (xc )} conservatively (for function minimization).
The quantity |f 12 (xc ) − f 34 (xc )| is the approximation error.
Additionally, (possibly numerical) zeros and infinities must be taken care of in a conservative manner (always
underestimating y c ).
18
C Table of notation
f a convex function to minimize on [xleft , xright ]
f′ derivative of f
pat = (xat , yta ) a point labelled a at iteration t
fta (x) tangent of f at xat
ftab (x) line going through pat and pbt
[x , xright ]
left
initial interval in which to start the search
y left , y right f (xleft ), f (xright )
xlow , y low
xhigh , y high
y tol User-specified tolerance used to terminate the line search
[x0t , x1t ] interval maintained by the gradient-based algorithms
xqt query coordinate of an algorithm at iteration t
∆t optimality region for ∆-Bisection at step t; contains (x∗ , f (x∗ ))
∆xt , ∆yt x and y coordinates of ∆t
Pt (f, xleft , xright )-convex set of points maintained by the ∆-Secant algorithm
∆(P ) optimality region for a (f, ·, ·) set of points P
∆x (Pt ), ∆y (Pt ) Sets of x and y coordinates of ∆(Pt )
x− , x+ inf ∆(P ), sup ∆(P )
ij,kl
p = (xij,kl , y ij,kl ) Intersection point of f ij and f kl
19