A Unified Convergence Analysis of Block Successive Minimization Methods For Nonsmooth Optimization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

1

A Unified Convergence Analysis of Block


Successive Minimization Methods for
Nonsmooth Optimization
arXiv:1209.2385v1 [math.OC] 11 Sep 2012

Meisam Razaviyayn, Mingyi Hong and Zhi-Quan Luo

Abstract

The block coordinate descent (BCD) method is widely used for minimizing a continuous function f
of several block variables. At each iteration of this method, a single block of variables is optimized, while
the remaining variables are held fixed. To ensure the convergence of the BCD method, the subproblem
to be optimized in each iteration needs to be solved exactly to its unique optimal solution. Unfortunately,
these requirements are often too restrictive for many practical scenarios. In this paper, we study an
alternative inexact BCD approach which updates the variable blocks by successively minimizing a
sequence of approximations of f which are either locally tight upper bounds of f or strictly convex
local approximations of f . We focus on characterizing the convergence properties for a fairly wide class
of such methods, especially for the cases where the objective functions are either non-differentiable or
nonconvex. Our results unify and extend the existing convergence results for many classical algorithms
such as the BCD method, the difference of convex functions (DC) method, the expectation maximization
(EM) algorithm, as well as the alternating proximal minimization algorithm.

Index Terms

Block Coordinate Descent, Block Successive Upper-bound Minimization, Successive Convex Ap-
proximation, Successive Inner Approximation

The authors are with the Department of Electrical and Computer Engineering, University of Minnesota, 200 Union Street SE,
Minneapolis, MN 55455. Emails: {meisam,mhong,luozq}@ece.umn.edu.

September 12, 2012 DRAFT


2

I. I NTRODUCTION

Consider the following optimization problem

min f (x1 , . . . , xn )

s.t. xi ∈ Xi , i = 1, 2, . . . , n,
Qn
where Xi ⊆ Rmi is a closed convex set, and f : i=1 Xi → R is a continuous function. A popular
approach for solving the above optimization problem is the block coordinate descent method (BCD),
which is also known as the Gauss-Seidel method. At each iteration of this method, the function is
minimized with respect to a single block of variables while the rest of the blocks are held fixed. More
specifically, at iteration r of the algorithm, the block variable xi is updated by solving the following
subproblem

r−1
xri = arg min f (xr1 , . . . , xri−1 , yi , xi+1 , . . . , xnr−1 ), i = 1, 2, . . . , n. (1)
yi ∈Xi

Let us use {xr } to denote the sequence of iterates generated by this algorithm, where xr , (xr1 , . . . , xrn ).
Due to its particular simple implementation, the BCD method has been widely used for solving problems
such as power allocation in wireless communication systems [28], clustering [14], image denoising and
image reconstruction [7] and dynamic programming [17].

Convergence of the BCD method typically requires the uniqueness of the minimizer at each step or the
quasi-convexity of the objective function (see [30] and the references therein). Without these assumptions,
it is possible that the BCD iterates do not get close to any of the stationary points of the problem (see
Powell [24] for examples). Unfortunately, these requirements can be quite restrictive in some important
practical problems such the tensor decomposition problem (see [19] and the application section in this
work) and the sum rate maximization problem in wireless networks. In fact, for the latter case, even
solving the per block subproblem (1) is difficult due to the non-convexity and non-differentiability of the
objective function.
To overcome such difficulties, one can modify the BCD algorithm by optimizing a well-chosen
approximate version of the objective function at each iteration. The classical gradient descent method,
for example, can be viewed as an implementation of such strategy. To illustrate, recall that the update
rule of the gradient descent method is given by

xr+1 = xr − αr+1 ∇f (xr ).

September 12, 2012 DRAFT


3

This update rule is equivalent to solving the following problem

xr+1 = arg min g(x, xr ),


x

where
1
g(x, xr ) , f (xr ) + ∇f (xr )(x − xr ) + kx − xr k2 .
2αr+1
Clearly, the function g(x, xr ) is an approximation of f (·) around the point xr . In fact, as we will see
later in this paper, successively optimizing an approximate version of the original objective is the key
idea of many important algorithms such as the concave-convex procedure [33], the EM algorithm [10],
the proximal minimization algorithm [2], to name a few. Furthermore, this idea can be used to simplify
the computation and to guarantee the convergence of the original BCD algorithm with the Gauss-Seidel
update rule (e.g. [31], [12], [32]). However, despite its wide applicability, there appears to be no general
unifying convergence analysis for this class of algorithms.
In this paper, we provide a unified convergence analysis for a general class of inexact BCD methods
in which a sequence of approximate versions of the original problem are solved successively. Our focus
will be on problems with nonsmooth and nonconvex objective functions. Two types of approximations
are considered: one being a locally tight upper bound for the original objective function, the other being
a convex local approximation of the objective function. We provide convergence analysis for both of
these successive approximation strategies as well as for various types of updating rules, including the
cyclic updating rule, the Gauss-Southwell update rule or the overlapping essentially cyclic update rule.
By allowing inexact solution of subproblems, our work unifies and extends several existing algorithms
and their convergence analysis, including the difference of convex functions (DC) method, the expectation
maximization (EM) algorithm, as well as the alternating proximal minimization algorithm.

II. T ECHNICAL P RELIMINARIES

Throughout the paper, we adopt the following notations. We use Rm to denote the space of m
dimensional real valued vectors, which is also represented as the Cartesian product of n smaller real
valued vector spaces, i.e.,
Rm = Rm1 × Rm2 × . . . × Rmn ,
Pn
where i=1 mi = m. We use the notation (0, . . . , dk , . . . , 0) to denote the vector of all zeros except the
k-th block, with dk ∈ Rmk . The following concepts/definitions are adopted in our paper:

September 12, 2012 DRAFT


4

• Distance of a point from a set: Let S ⊆ Rm be a set and x be a point in Rm , the distance of the
point x from the set S is defined as

d(x, S) = inf kx − sk,


s∈S

where k · k denotes the 2-norm in Rm .


• Directional derivative: Let f : D → R be a function where D ⊆ Rm is a convex set. The directional
derivative of f at point x in direction d is defined by
f (x + λd) − f (x)
f ′ (x; d) , lim inf .
λ↓0 λ
• Stationary points of a function: Let f : D → R be a function where D ⊆ Rm is a convex set.
The point x is a stationary point of f (·) if f ′ (x; d) ≥ 0 for all d such that x + d ∈ D . In this paper
we use the notation X ∗ to denote the set of stationary points of a function.
• Regularity of a function at a point: The function f : Rm → R is regular at the point z ∈ domf
with respect to the coordinates m1 , m2 , . . . , mn , m1 + m2 + . . . + mn = m, if f ′ (z; d) ≥ 0 for
all d = (d1 , d2 , . . . , dn ) with f ′ (z; d0k ) ≥ 0, where d0k = (0 . . . , dk , . . . , 0) and dk ∈ Rmk , ∀ k. For
detailed discussion on the regularity of a function, the readers are referred to [30, Lemma 3.1].
• Quasi-convex function: The function f is quasi-convex if

f (θx + (1 − θ)y) ≤ max{f (x), f (y)}, ∀ θ ∈ (0, 1), ∀ x, y ∈ dom f

• Coordinatewise minimum of a function: z ∈ dom f ⊆ Rm is coordinatewise minimum of f with


respect to the coordinates in ℜm1 , ℜm2 , . . . , ℜmn , m1 + . . . + mk = m if

f (z + (0, . . . , dk , . . . , 0)) ≥ f (z), ∀ dk ∈ Rmk with z + (0, . . . , dk , . . . , 0) ∈ dom f.

III. S UCCESSIVE U PPER - BOUND M INIMIZATION (SUM)

To gain some insights to the general inexact BCD method, let us first consider a simple Successive
Upper-bound Minimization (SUM) approach in which all the variables are grouped into a single block.
Although simple in form, the SUM algorithm is the key to many important algorithms such as the DC
programming [33] and the EM algorithm [4].
Consider the following optimization problem

min f (x)
(2)
s.t. x ∈ X,

September 12, 2012 DRAFT


5

1 Find a feasible point x0 ∈ X and set r = 0

2 repeat
3 r =r+1
4 Let X r = arg minx∈X u(x, xr−1 )
5 Set xr to be an arbitrary element in X r

6 until some convergence criterion is met

Fig. 1: Pseudo code of the SUM algorithm

where X is a closed convex set. Without loss of generality, we can assume that dom f = X . When
the objective function f (·) is non-convex and/or nonsmooth, solving (2) directly may not be easy. The
SUM algorithm circumvents such difficulty by optimizing a sequence of approximate objective functions
instead. More specifically, starting from a feasible point x0 , the algorithm generates a sequence {xr }
according to the following update rule

xr ∈ arg min u(x, xr−1 ) (3)


x∈X

where xr−1 is the point generated by the algorithm at iteration r − 1 and u(x, xr−1 ) is an approximation
of f (x) at the r -th iteration. Typically the approximate function u(·, ·) needs to be chosen such that the
subproblem (3) is easy to solve. Moreover, to ensure the convergence of the SUM algorithm, certain
regularity conditions on u(·, ·) is required (which will be discussed shortly). Among others, u(x, xr−1 )
needs to be a global upper bound for f (x), hence the name of the algorithm. The main steps of the SUM
algorithm are presented in Fig. 1.
We remark that the proposed SUM algorithm is in many ways similar to the inner approximation
algorithm (IAA) developed in [21], with the following key differences:
• The IAA algorithm approximates both the objective functions and the feasible sets. On the contrary,
the SUM algorithm only approximates the objective function.
• The the IAA algorithm is only applicable for problems with smooth objectives, while the SUM
algorithm is able to handle nonsmooth objectives as well.
It is worth mentioning that the existing convergence result for the IAA algorithm is quite weak. In
particular, [21, Theorem 1] states that if the whole sequence converges, then the algorithm should converge

September 12, 2012 DRAFT


6

to a stationary point. In the following, we show that the SUM algorithm provides stronger convergence
guarantees as long as the approximation function u(·, ·) satisfies certain mild assumptions1 which we
outline below.

Assumption 1 Let the approximation function u(·, ·) satisfy the following

u(y, y) = f (y), ∀y∈X (A1)

u(x, y) ≥ f (x), ∀ x, y ∈ X (A2)

u′ (x, y; d) = f ′ (y; d), ∀ d with y + d ∈ X (A3)


x=y

u(x, y) is continuous in (x, y) (A4)

The assumptions (A1) and (A2) imply that the approximate function u(·, xr−1 ) in (3) is a tight upper
bound of the original function. The assumption (A3) guarantees that the first order behavior of u(·, xr−1 )
is the same as f (·) locally (note that the directional derivative u′ (x, y; d) is only with respect to the
variable x). Although directly checking (A3) may not be easy, the following proposition provides a
sufficient condition under which (A3) holds true automatically.

Proposition 1 Assume f (x) = f0 (x) + f1 (x), where f0(·) is continuously differentiable and the direc-
tional derivative of f1 (·) exists at every point x ∈ X . Consider u(x, y) = u0 (x, y)+f1 (x), where u0 (x, y)
is a continuously differentiable function satisfying the following conditions

u0 (y, y) = f0 (y), ∀y∈X (4)

u0 (x, y) ≥ f0 (x), ∀ x, y ∈ X . (5)

Then, (A1), (A2) and (A3) hold for u(·, ·).

Proof: First of all, (4) and (5) imply (A1) and (A2) immediately. Now we prove (A3) by contradiction.
Assume the contrary so that there exist a y ∈ X and a d ∈ Rm so that

f ′ (y; d) 6= u′ (x, y; d) and y + d ∈ X. (6)


x=y

This further implies that


f0′ (y; d) 6= u′0 (x, y; d) .
x=y

1
These assumptions are weaker than those made to ensure the convergence of the IAA algorithm.

September 12, 2012 DRAFT


7

Furthermore, since f0 (·) and u0 (·, ·) are continuously differentiable, there exists a α > 0 such that for
z = y + αd,
f0′ (z; d) 6= u′0 (x, z; d) . (7)
x=z

The assumptions (4) and (5) imply that


u0 (z + λd, z) − u0 (z, z)
u′0 (x, z; d) = lim
x=z λ↓0 λ
(8)
f0 (z + λd) − f0 (z)
≥ lim = f ′ (z; d).
λ↓0 λ
On the other hand, the differentiability of f0 (·), u0 (·, ·) and using (4), (5) imply
u0 (z, z) − u0 (z − λd, z)
u′0 (x, z; d) = lim
x=z λ↓0 λ
(9)
f0 (z) − f0 (z − λd)
≤ lim = f ′ (z; d).
λ↓0 λ

Clearly, (8) and (9) imply that u′0 (x, z; d) = f ′ (z; d) which contradicts (7).
x=z
The following theorem establishes the convergence for the SUM algorithm.

Theorem 1 Assume that Assumption 1 is satisfied. Then every limit point of the iterates generated by
the SUM algorithm is a stationary point of the problem (2).

Proof: Firstly, we observe the following series of inequalities


(i) (ii)
f (xr+1 ) ≤ u(xr+1 , xr ) ≤ u(xr , xr ) = f (xr ), ∀ r = 0, 1, 2, . . . (10)

where step (i) is due to (A1), step (ii) follows from the optimality of xt+1 (cf. step 4 and 5 in Fig.1),
and the last equality is due to (A2). A straightforward consequence of (10) is that the sequence of the
objective function values are non-increasing, that is

f (x0 ) ≥ f (x1 ) ≥ f (x2 ) ≥ . . . (11)

Assume that there exists a subsequence {xrj } converging to a limit point z . Then Assumptions (A1),
(A2) together with (11) imply that

u(xrj+1 , xrj+1 ) = f (xrj+1 ) ≤ f (xrj +1 ) ≤ u(xrj +1 , xrj ) ≤ u(x, xrj ), ∀x∈X

Letting j → ∞, we obtain
u(z, z) ≤ u(x, z), ∀ x ∈ X,

September 12, 2012 DRAFT


8

which implies
u′ (x, z; d) ≥ 0, ∀ d ∈ Rm with z + d ∈ X .
x=z

Combining with (A3), we obtain

f ′ (z; d) ≥ 0, ∀ d ∈ Rm with z + d ∈ X ,

implying that z is a stationary point of f (·).

Corollary 1 Assume that the level set X 0 = {x | f (x) ≤ f (x0 )} is compact and Assumption 1 holds.
Then, the sequence of iterates {xr } generated by the SUM algorithm satisfy

lim d(xr , X ∗ ) = 0,
r→∞

where X ∗ is the set of stationary points of (2).

Proof: We prove the claim by contradiction. Suppose on the contrary that there exists a subsequence
{xrj } such that d(xrj , X ∗ ) ≥ γ for some γ > 0. Since the sequence {xrj } lies in the compact set X 0 ,
it has a limit point z . By further restricting the indices of the subsequence, we obtain

d(z, X ∗ ) = lim d(xrj , X ∗ ) ≥ γ,


j→∞

which contradicts the fact that z ∈ X ∗ due to Theorem 1.


The above results show that under Assumption 1, the SUM algorithm is globally convergent. In the
rest of this work, we derive similar results for a family of more general inexact BCD algorithms.

IV. T HE B LOCK S UCCESSIVE U PPER - BOUND M INIMIZATION A LGORITHM

In many practical applications, the optimization variables can be decomposed into independent blocks.
Such block structure, when judiciously exploited, can lead to low-complexity algorithms that are dis-
tributedly implementable. In this section, we introduce the Block Successive Upper-bound Minimization
(BSUM) algorithm, which effectively takes such block structure into consideration.
Let us assume that the feasible set X is the cartesian product of n closed convex sets: X = X1 ×. . .×Xn ,
P
with Xi ⊆ Rmi and i mi = m. Accordingly, the optimization variable x ∈ Rm can be decomposed as:
x = (x1 , x2 , . . . , xn ), with xi ∈ Xi , i = 1, · · · , M . We are interested in solving the problem

min f (x)
(12)
s.t. x ∈ X.

September 12, 2012 DRAFT


9

1 Find a feasible point x0 ∈ X and set r = 0

2 repeat
3 r = r + 1, i = (r mod n) + 1
4 Let X r = arg minxi ∈Xi ui (xi , xr−1 )
5 Set xri to be an arbitrary element in X r
6 Set xrk = xkr−1 , ∀ k 6= i

7 until some convergence criterion is met

Fig. 2: Pseudo code of the BSUM algorithm

Different from the SUM algorithm, the BSUM algorithm only updates a single block of variables in
each iteration. More precisely, at iteration r , the selected block (say block i) is computed by solving the
following subproblem

min ui (xi , xr−1 )


xi
(13)
s.t. xi ∈ X i ,
where ui (·, xr−1 ) is again an approximation (in fact, a global upper-bound) of the original objective f (·)
at the point xr−1 . Fig. 2 summarizes the main steps of the BSUM algorithm. Note that although the
blocks are updated following a simple cyclic rule, the algorithm and its convergence results can be easily
extended to the (more general) essentially cyclic update rule as well. This point will be further elaborated
in Section VII.
Now we are ready to study the convergence behavior of the BSUM algorithm. To this end, the following
regularity conditions on the function ui (·, ·) are needed.

Assumption 2

ui (yi , y) = f (y), ∀ y ∈ X,∀ i (B1)

ui (xi , y) ≥ f (y1 , . . . , yi−1 , xi , yi+1 , . . . , yn ), ∀ xi ∈ X i , ∀ y ∈ X , ∀ i (B2)

u′i (xi , y; di ) = f ′ (y; d), ∀ d = (0, . . . , di , . . . , 0) s.t. yi + di ∈ Xi , ∀ i (B3)


xi =yi

ui (xi , y) is continuous in (xi , y), ∀i (B4)

September 12, 2012 DRAFT


10

Similar to Proposition 1, we can identify a sufficient condition to ensure (B3).

Proposition 2 Assume f (x) = f0 (x) + f1 (x), where f0(·) is continuously differentiable and the direc-
tional derivative of f1 (·) exists at every point x ∈ X . Consider ui (xi , y) = u0,i (xi , y) + f1 (x), where
u0,i (xi , y) satisfies the following assumptions

u0,i (xi , x) = f0 (x), ∀ x ∈ X, ∀i

u0,i (xi , y) ≥ f0 (y1 , . . . , yi−1 , xi , yi+1 , . . . , yn ), ∀ x, y ∈ X ∀ i.

Then, (B1), (B2), and (B3) hold.

Proof: The proof is exactly the same as the proof in Proposition 1.


The convergence results regarding to the BSUM algorithm consist of two parts. In the first part, a
quasi-convexity of the objective function is assumed, which guarantees the existence of the limit points.
This is in the same spirit of the classical proof of convergence for the BCD method in [2]. However,
if we know that the iterates lie in a compact set, then a stronger result can be proved. Indeed, in the
second part of the theorem, the convergence is obtained by relaxing the quasi-convexity assumption while
imposing the compactness assumption of level sets.

Theorem 2
(a) Suppose that the function ui (xi , y) is quasi-convex in xi and Assumption 2 holds. Furthermore,
assume that the subproblem (13) has a unique solution for any point xr−1 ∈ X . Then, every limit
point z of the iterates generated by the BSUM algorithm is a coordinatewise minimum of (12). In
addition, if f (·) is regular at z , then z is a stationary point of (12).
(b) Suppose the level set X 0 = {x | f (x) ≤ f (x0 )} is compact and Assumption 2 holds. Furthermore,
assume that the subproblem (13) has a unique solution for any point xr−1 ∈ X for at least n − 1
blocks. If f (·) is regular at every point in the set of stationary points X ∗ with respect to the
coordinates x1 , . . . , xn . Then, the iterates generated by the BSUM algorithm converge to the set of
stationary points, i.e.,
lim d(xr , X ∗ ) = 0.
r→∞

Proof: The proof of part (a) is similar to the one in [2] for block coordinate descent approach. First
of all, since a locally tight upper bound of f (·) is minimized at each iteration, we have

f (x0 ) ≥ f (x1 ) ≥ f (x2 ) ≥ . . . . (14)

September 12, 2012 DRAFT


11

Therefore, the continuity of f (·) implies

lim f (xr ) = f (z). (15)


r→∞

Let us consider the subsequence {xrj } converging to the limit point z . Since the number of blocks
is finite, there exists a block which is updated infinitely often in the subsequence {rj }. Without loss of
generality, we assume that block n is updated infinitely often. Thus, by further restricting to a subsequence,
we can write
xrnj = arg min un (xn , xrj −1 ).
xn
r +1
Now we prove that xrj +1 → z , in other words, we will show that x1j → z1 . Assume the contrary that
r +1
x1j does not converge to z1 . Therefore by further restricting to a subsequence, there exists γ̄ > 0 such
that
r +1 r
γ̄ ≤ γ rj = kx1j − x1j k, ∀ rj .
r r +1
Let us normalize the difference between x1j and x1j , i.e.,
r +1 r
x1j − x1j
s rj , .
γ rj
Notice that ksrj k = 1, thus srj belongs to a compact set and it has a limit point s̄. By further restricting
to a subsequence that converges to s̄, using (B1) and (B2), we obtain
r +1
f (xrj +1 ) ≤ u1 (x1j , xr j ) (16)
r
= u1 (x1j + γ rj srj , xrj ) (17)
r
≤ u1 (x1j + ǫγ̄srj , xrj ), ∀ ǫ ∈ [0, 1] (18)
r
≤ u1 (x1j , xrj ) (19)

= f (xrj ), (20)

where (16) and (20) hold due to (B1) and (B2). The inequalities (18) and (19) are the result of quasi-
convexity of u(·, xrj ). Letting j → ∞ and combining (16), (18), (15), and (20) imply

f (z) ≤ u1 (z1 + ǫγ̄s̄, z) ≤ f (z), ∀ ǫ ∈ [0, 1],

or equivalently
f (z) = u1 (z1 + ǫγ̄s̄, z), ∀ ǫ ∈ [0, 1]. (21)

September 12, 2012 DRAFT


12

Furthermore,
r
u1 (x1j+1 , xrj+1 ) = f (xrj+1 ) ≤ f (xrj +1 )
r +1
≤ u1 (x1j , xrj ) ≤ u1 (x1 , xrj ), ∀ x1 ∈ X 1 .

Letting j → ∞, we obtain
u1 (z1 , z) ≤ u1 (x1 , z), ∀ x1 ∈ X 1 ,

which further implies that z1 is the minimizer of u1 (·, z). On the other hand, we assume that the minimizer
is unique, which contradicts (21). Therefore, the contrary assumption is not true, i.e., xrj +1 → z .

r +1
Since x1j = arg minx1 ∈X1 u1 (x1 , xrj ), we get
r +1
u1 (x1j , xrj ) ≤ u1 (x1 , xrj ) ∀ x1 ∈ X1 .

Taking the limit j → ∞ implies

u1 (z1 , z) ≤ u1 (x1 , z) ∀ x1 ∈ X 1 ,

which further implies

u′1 (x1 , z; d1 ) ≥ 0, ∀ d1 ∈ Rm1 with z1 + d1 ∈ X1 .


x1 =z1
Similarly, by repeating the above argument for the other blocks, we obtain

u′k (xk , z; dk ) ≥ 0, ∀ dk ∈ Rmk with dk + zk ∈ Xk , ∀ k = 1, . . . , n. (22)


xk =zk
Combining (B3) and (22) implies

f ′ (z; d) ≥ 0, ∀ d = (0, . . . , dk , . . . , 0) s.t. d + z ∈ X, ∀ k

in other words, z is the coordinatewise minimum of f (·).

Now we prove part (b) of the theorem. Without loss of generality, let us assume that (13) has a unique
solution at every point xr−1 for i = 1, 2, . . . , n − 1. Since the iterates lie in a compact set, we only need
to show that every limit point of the iterates is a stationary point of f (·). To do so, let us consider a
subsequence {xrj } which converges to a limit point z ∈ X 0 ⊆ X . Since the number of blocks is finite,
there exists a block i which is updated infinitely often in the subsequence {xrj }. By further restricting
to a subsequence, we can assume that
r
xi j ∈ arg min ui (xi , xrj −1 ).
xi

September 12, 2012 DRAFT


13

Since all the iterates lie in a compact set, we can further restrict to a subsequence such that

lim xrj −i+k = z k , ∀ k = 0, 1, . . . , n,


j→∞

where z k ∈ X 0 ⊆ X and z i = z . Moreover, due to the update rule in the algorithm, we have
r −i+k
uk (xkj , xrj −i+k−1 ) ≤ uk (xk , xrj −i+k−1 ), ∀ xk ∈ X k , k = 1, 2, . . . , n.

Taking the limit j → ∞, we obtain

uk (zkk , z k−1 ) ≤ uk (xk , z k−1 ), ∀ xk ∈ X k , k = 1, 2, . . . , n. (23)

Combining (23), (B1) and (B2) implies

f (z k ) ≤ uk (zkk , z k−1 ) ≤ uk (zkk−1 , z k−1 ) = f (z k−1 ), k = 1, . . . , n. (24)

On the other hand, the objective function is non-increasing in the algorithm and it has a limit. Thus,

f (z 0 ) = f (z 1 ) = . . . = f (z n ). (25)

Using (24), (25), and (23), we obtain

f (z) = uk (zkk , z k−1 ) ≤ uk (xk , z k−1 ), ∀ xk ∈ Xk , k = 1, 2, . . . , n. (26)

Furthermore, f (z) = f (z k−1 ) = uk (zkk−1 , z k−1 ) and therefore,

uk (zkk−1 , z k−1 ) ≤ uk (xk , z k−1 ), ∀ xk ∈ Xk , k = 1, 2, . . . , n. (27)

The inequalities (26) and (27) imply that zkk−1 and zkk are both the minimizer of uk (·, z k−1 ). However,
according to our assumption, the minimizer is unique for k = 1, 2, . . . , n − 1 and therefore,

z 0 = z 1 = z 2 = . . . = z n−1 = z

Plugging the above relation in (23) implies

uk (zk , z) ≤ uk (xk , z), ∀ xk ∈ Xk , k = 1, 2, . . . , n − 1. (28)

Moreover, by setting k = n in (27), we obtain

un (zn , z) ≤ un (xn , z), ∀ xn ∈ X n . (29)

The inequalities (28) and (29) imply that

u′k (xk , z; dk ) ≥ 0, ∀ dk ∈ Rmk with zk + dk ∈ Xk , k = 1, 2, . . . , n.


xk =zk

September 12, 2012 DRAFT


14

Combining this with (B3) yields

f ′ (z; d) ≥ 0, ∀ d = (0, . . . , dk , . . . , 0) with zk + dk ∈ Xk , k = 1, 2, . . . , n,

which implies the stationarity of the point z due to the regularity of f (·).
The above result extends the existing result of block coordinate descent method [2] and [30] to the
BSUM case where only an approximation of the objective function is minimized at each iteration. As
we will see in Section VIII, our result implies the global convergence of several existing algorithms
including the EM algorithm or the DC method when the Gauss-Seidel update rule is used.

V. T HE M AXIMUM I MPROVEMENT S UCCESSIVE U PPER - BOUND M INIMIZATION A LGORITHM

A key assumption for the BSUM algorithm is the uniqueness of the minimizer of the subproblem. This
assumption is necessary even for the simple BCD method [2]. In general, by removing such assumption,
the convergence is not guaranteed (see [24] for examples) unless we assume pseudo convexity in pairs
of the variables [34], [30]. In this section, we explore the possibility of removing such uniqueness
assumption.
Recently, Chen et al. [1] have proposed a related Maximum Block Improvement (MBI) algorithm,
which differs from the conventional BCD algorithm only by its update schedule. More specifically, only
the block that provides the maximum improvement is updated at each step. Remarkably, by utilizing
such modified updating rule (which is similar to the well known Gauss-Southwell update rule), the
per-block subproblems are allowed to have multiple solutions. Inspired by this recent development, we
propose to modify the BSUM algorithm similarly by simply updating the block that gives the maximum
improvement. We name the resulting algorithm the Maximum Improvement Successive Upper-bound
Minimization (MISUM) algorithm, and list its main steps in Fig. 3.
Clearly the MISUM algorithm is more general than the MBI method proposed in [1], since only an
approximate version of the subproblem is solved at each iteration. Theorem 3 states the convergence
result for the proposed MISUM algorithm.

Theorem 3 Suppose that Assumption 2 is satisfied. Then, every limit point z of the iterates generated
by the MISUM algorithm is a coordinatewise minimum of (12). In addition, if f (·) is regular at z , then
z is a stationary point of (12).

September 12, 2012 DRAFT


15

1 Find a feasible point x0 ∈ X and set r = 0

2 repeat
3 r =r+1
4 Let k = arg mini minxi ui (xi , xr−1 )
5 Let X r = arg minxk ∈Xk uk (xk , xr−1 )
6 Set xrk to be an arbitrary element in X r
7 Set xri = xir−1 , ∀ i 6= k

8 until some convergence criterion is met

Fig. 3: Pseudo code of the MISUM algorithm

Proof: Let us define Ri (y) to be the minimum objective value of the i-th subproblem at a point y ,
i.e.,
Ri (y) , min ui (xi , y).
xi

Using a similar argument as in Theorem 2, we can show that the sequence of the objective function
values are non-increasing, that is

f (xr ) = ui (xri , xr ) ≥ Ri (xr ) ≥ f (xr+1 ).

Let {xrj } be the subsequence converging to a limit point z . For every fixed block index i = 1, 2, . . . , n
and every xi ∈ Xi , we have the following series of inequalities

ui (xi , xrj ) ≥ Ri (xrj )


r +1
≥ uk (xkj , xr j )

≥ f (xrj +1 )

≥ f (xrj+1 )
r
= ui (xi j+1 , xrj+1 ),

where we use k to index the block that provides the maximum improvement at iteration rj + 1. The
first and the second inequalities are due to the definition of the function Ri (·) and the MISUM update

September 12, 2012 DRAFT


16

rule, respectively. The third inequality is implied by the upper bound assumption (B2), while the last
inequality is due to the non-increasing property of the objective values.
Letting j → ∞, we obtain

ui (xi , z) ≥ ui (zi , z), ∀ xi ∈ X i , i = 1, 2, . . . , n.

The first order optimality condition implies

u′i (xi , z; di ) ≥ 0, ∀ di with zi + di ∈ Xi , ∀ i = 1, 2, . . . , n.


xi =zi

Combining this with (B3) yields

f ′ (z; d) ≥ 0, ∀ d = (0, . . . , di , . . . , 0) with zi + di ∈ Xi , i = 1, 2, . . . , n.

In other words, z is the coordinatewise minimum of f (·).


The main advantage of the MISUM algorithm over the BSUM algorithm is that its convergence does
not rely on the uniqueness of the minimizer for the subproblems. On the other hand, each iteration of
MISUM algorithm is more expensive than the BSUM since the minimization needs to be performed for
all the blocks. Nevertheless, the MISUM algorithm is more suitable when parallel processing units are
available, since the minimizations with respect to all the blocks can be carried out simultaneously.

VI. S UCCESSIVE C ONVEX A PPROXIMATION OF A S MOOTH F UNCTION

In the previous sections, we have demonstrated that the stationary solutions of the problems (2) and (12)
can be obtained by successively minimizing a sequence of upper-bounds of f (·). However, in practice,
unless the objective f (·) possesses certain convexity/concavity structure, those upper-bounds may not be
easily identifiable. In this section, we extend the BSUM algorithm by further relaxing the requirement
that the approximation functions {ui (xi , y)} must be the global upper-bounds of the original objective
f.
Throughout this section, we use hi (., .) to denote the convex approximation function for the ith block.
Suppose that hi (xi , x) is no longer a global upper-bound of f (x), but only a first order approximation
of f (x) at each point, i.e.,

h′i (yi , x; di ) = f ′ (x; d), ∀ d = (0, . . . , di , . . . , 0) with xi + di ∈ Xi . (30)


yi =xi

In this case, simply optimizing the approximate functions in each step may not even decrease the objective
function. Nevertheless, the minimizer obtained in each step can still be used to construct a good search

September 12, 2012 DRAFT


17

1 Find a feasible point x0 ∈ X and set r = 0

2 repeat
3 r = r + 1, i = (r mod n) + 1
4 Let X r = arg minxi ∈Xi hi (xi , xr−1 )
5 Set yir to be an arbitrary element in X r and set ykr = xrk , ∀ k 6= i
6 Set dr = y r − xr and choose σ ∈ (0, 1)
7 Armijo stepsize rule: Choose αinit > 0 and β ∈ (0, 1). Let αr be the largest
element in {αinit β j }j=0,1,... satisfying:
f (xr ) − f (xr + αr dr ) ≥ −σαr f ′ (xr ; dr )
8 Set xr = xr−1 + αr (y r − xr−1 )

9 until some convergence criterion is met

Fig. 4: Pseudo code of the BSCA algorithm

direction, which, when combined with a proper step size selection rule, can yield a sufficient decrease
of the objective value.
Suppose that at iteration r , the i-th block needs to be updated. Let yir ∈ minyi ∈Xi hi (yi , xr−1 ) denote
the optimal solution for optimizing the i-th approximation function at the point xr−1 . We propose to use
yir − xir−1 as the search direction, and adopt the Armijo rule to guide the step size selection process. We
name the resulting algorithm the Block Successive Convex Approximation (BSCA) algorithm. Its main
steps are given in Figure 4.
Note that for dr = (0, . . . , dri , . . . , 0) with dri = yir − xri , we have
hi (xri + λdri , xr ) − hi (xri , xr )
f ′ (xr ; dr ) = h′i (xi , xr ; dri ) = lim ≤ 0, (31)
xi =xri λ↓0 λ
where the inequality is due to the fact that hi (·) is convex and yir = xri +dri is the minimizer at iteration r .
Moreover, there holds

f (xr ) − f (xr + αdr ) = −αf ′ (xr ; dr ) + o(α), ∀ α > 0.

Hence the Armijo step size selection rule in Figure 4 is well defined when f ′ (xr ; dr ) 6= 0, and there
exists j ∈ {0, 1, . . .} such that for αr = αinit β j ,

f (xr ) − f (xr + αr dr ) ≥ −σαr f ′ (xr ; dr ). (32)

September 12, 2012 DRAFT


18

The following theorem states the convergence result of the proposed algorithm.

Theorem 4 Suppose that f (·) is continuously differentiable and that Assumption (30) holds. Furthermore,
assume that h(x, y) is strictly convex in x and continuous in (x, y). Then every limit point of the iterates
generated by the BSCA algorithm is a stationary point of (2).

Proof: First of all, due to the use of Armijo step size selection rule, we have

f (xr ) − f (xr+1 ) ≥ −σαr f ′ (xr ; dr ) ≥ 0,

which implies
lim αr f ′ (xr ; dr ) = 0. (33)
r→∞

Consider a limit point z and a subsequence {xrj }j converging to z . Since {f (xr )} is a monotonically
decreasing sequence, it follows that
lim f (xr ) = f (z).
r→∞

By further restricting to a subsequence if necessary, we can assume without loss of generality that in
the subsequence {xrj }j the first block is updated. We first claim that we can further restrict to a further
subsequence such that
lim drj = 0. (34)
j→∞

We prove this by contradiction. Let us assume the contrary so that there exists δ, 0 < δ < 1 and
ℓ ∈ {1, 2, . . .}
kdrj k ≥ δ, ∀ j ≥ ℓ. (35)

drj
Defining prj = kdrj k , the equation (33) implies αrj kdrj kf ′ (xrj ; prj ) → 0. Thus, we have the following
two cases:
Case A: f ′ (xrj ; prj ) → 0 along a subsequence of {xrj }. Let us restrict ourselves to that subsequence.
Since kprj k = 1, there exists a limit point p̄. By further restricting to a subsequence and using the
smoothness of f (·), we obtain
f ′ (z; p̄) = 0. (36)

Furthermore, due to the strict convexity of h1 (·, z),

h1 (z1 + δp̄1 , z) > h1 (z1 , z) + δh′1 (x1 , z; p̄1 ) ≥ h1 (z1 , z), (37)
x1 =z1

September 12, 2012 DRAFT


19

where p̄1 is the first block of p̄ and the last step is due to (36) and (30). On the other hand, since
r r r r
x1j + δp1j lies between x1j and y1j , we have
r r r
h1 (x1j + δp1j , xrj ) ≤ h1 (x1j , xrj ).

Letting j → ∞ along the subsequence, we obtain

h1 (z1 + δp̄1 , z) ≤ h1 (z1 , z), (38)

which contradicts (37).


Case B: αrj kdrj k → 0 along a subsequence. Let us restrict ourselves to that subsequence. Due to the
contrary assumption (35),
lim αrj = 0,
j→∞

which further implies that there exists j0 ∈ {1, 2, . . .} such that


αrj rj αrj ′ rj rj
f (xrj + d ) − f (xrj ) > σ f (x ; d ), ∀ j ≥ j0 .
β β
Rearranging the terms, we obtain
αrj
f (xrj + rj
β kd kp ) −
rj f (xrj )
αrj
> σf ′ (xrj ; prj ), ∀ j ≤ j0 .
rj
β kd k

Letting j → ∞ along the subsequence that prj → p̄, we obtain

f ′ (z; p̄) ≥ σf ′ (z; p̄),

which implies f (z; p̄) ≥ 0 since σ < 1. Therefore, using an argument similar to the previous case, (37)
and (38) hold, which is a contradiction. Thus, the assumption (35) must be false and the condition (34)
r
must hold. On the other hand, y1j is the minimizer of h1 (·, xrj ); thus,
r
h1 (y1j , xrj ) ≤ h1 (x1 , xrj ), ∀ x1 ∈ X 1 . (39)
r r r
Note that y1j = x1j + d1j . Combining (34) and (39) and letting j → ∞ yield

h1 (z1 , z) ≤ h1 (x1 , z), ∀ x1 ∈ X 1 .

The first order optimality condition and assumption (30) imply

f ′ (z; d) ≥ 0, ∀ d = (d1 , 0, . . . , 0) with z1 + d1 ∈ X1 .

On the other hand, since drj → 0, it follows that

lim xrj +1 = z.
j→∞

September 12, 2012 DRAFT


20

Therefore, by restricting ourselves to the subsequence that drj → 0 and repeating the above argument n
times, we obtain

f ′ (z; d) ≥ 0, ∀ d = (0, . . . , dk , . . . , 0) with zk + dk ∈ Xk ; k = 1, . . . , n.

Using the regularity of f (·) at point z completes the proof.


We remark that the proposed BSCA method is related to the coordinate gradient descent method
[31], in which a strictly convex second order approximation of the objective function is minimized at
each iteration. It is important to note that the convergence results of these two algorithm do not imply
each other. The BSCA algorithm, although more general in the sense that the approximation function
could take the form of any strictly convex function that satisfies (30), only covers the case when the
objective function is smooth. Nevertheless, the freedom provided by the BSCA to choose a more general
approximation function allows one to better approximate the original function at each iteration.

VII. OVERLAPPING E SSENTIALLY C YCLIC RULE

In both the BSUM and the BSCA algorithms considered in the previous sections, variable blocks are
updated in a simple cyclic manner. In this section, we consider a very general block scheduling rule
named the overlapping essentially cyclic rule and show they still ensure the convergence of the BSUM
and the BSCA algorithms.
In the so called overlapping essentially cyclic rule, at each iteration r , a group ϑr of the variables is
chosen to be updated where
ϑr ⊆ {1, 2, . . . , n} and ϑr 6= ∅.

Furthermore, we assume that the update rule is essentially cyclic with period T , i.e.,
T
[
ϑr+i = {1, 2, . . . , n}, ∀ r.
i=1

Notice that this update rule is more general than the essentially cyclic rule since the blocks are allowed
to have overlaps. Using the overlapping essentially cyclic update rule, almost all the convergence results
presented so far still hold. For example, the following corollary extends the convergence of BSUM to
the overlapping essentially cyclic case.

Corollary 2
(a) Assume that the function ui (xi , y) is quasi-convex in xi and Assumption 2 is satisfied. Furthermore,
assume that the overlapping essentially cyclic update rule is used and the subproblem (13) has a

September 12, 2012 DRAFT


21

unique solution for every block ϑr . Then, every limit point z of the iterates generated by the BSUM
algorithm is a coordinatewise minimum of (12). In addition, if f (·) is regular at z with respect to
the updated blocks, then z is a stationary point of (12).
(b) Assume the level set X 0 = {x | f (x) ≤ f (x0 )} is compact and Assumption 2 is satisfied. Further-
more, assume that the overlapping essentially cyclic update rule is used and the subproblem (13)
has a unique solution for every block ϑr . If f (·) is regular (with respect to the updated blocks) at
every point in the set of stationary points X ∗ , then the iterates generated by the BSUM algorithm
converges to the set of stationary points, i.e.,

lim d(xr , X ∗ ) = 0.
r→∞

Proof: The proof of both cases are similar to the proof of the BSUM algorithm with the simple
cyclic update rule. Here we only present the proof for case (a). The proof of part (b) is similar.
Let {xrj } be a convergent subsequence whose limit is denoted by z . Consider every T updating
cycle along the subsequence {xrj }, namely, {(xrj , xrj +1 , . . . , xrj +T −1 )}. Since the number of different
subblocks ϑr is finite, there must exist a (fixed) T tuple of variable blocks, say (ϑ0 , ϑ1 , . . . , ϑT −1 ), that
has been updated in infinitely many T updating cycles. By restricting to the corresponding subsequence
of {xrj }, we have
r +i+1
xϑji = arg min uϑi (xϑi , xrj +i ), ∀ i = 0, 1, 2, . . . , T − 1.
xϑ i

The rest of the proof is the same as the proof of part (a) in Theorem 2. The only difference is that the
steps of the proof need to be repeated for the blocks (ϑ0 , ϑ1 , . . . , ϑT −1 ) instead of (1, . . . , n).
In the proof of Corollary 2, we first restrict ourselves to a fixed set of T variable blocks that have
been updated in infinitely many consecutive T update cycles. Then, we use the same approach as in the
proof of the convergence of cyclic update rule. Using the same technique, we can extend the results in
Theorem 4 to the overlapping essentially cyclic update rule. More specifically, we have the following
corollary.

Corollary 3 Assume f (·) is smooth and the condition (30) is satisfied. Furthermore, assume that h(x, y)
is strictly convex in x and the overlapping essentially cyclic update rule is used in the BSCA algorithm.
Then every limit point of the iterates generated by the BSCA algorithm is a stationary point of (2).

Notice that the overlapping essentially cyclic rule is not applicable to the MISUM algorithm in which
the update order of the variables is given by the amount of improvement. However, one can simply check
that the proof of Theorem 3 still applies to the case when the blocks are allowed to have overlaps.

September 12, 2012 DRAFT


22

VIII. A PPLICATIONS

In this section, we provide several applications of the algorithms proposed in the previous sections.

A. Linear Transceiver Design in Cellular Networks

Consider a K -cell wireless network where each base station k serves a set Ik of users (see Fig. 5 for
an illustration). Let ik denote the i-th receiver in cell k. For simplicity, suppose that the users and the
base stations are all equipped with N antennas. Let us define the set of all users as I = {ik | 1 ≤ k ≤
K, i ∈ Ik }. Let dik denote the number of data symbols transmitted simultaneously to user ik .

Fig. 5: The cellular network model considered in Section VIII-A. The solid lines represent the direct channels,
while the dotted lines represent the interfering channels.

When linear transceivers are used at the base stations and the users, user ik ’s received signal vector,
denoted as yik ∈ CN , can be written as
Ik K Ij
X X X
yik = Hik k Vik sik + Hik k Vℓk sℓk + Hik j Vℓj sℓj + nik , ∀ ik ∈ I,
| {z }
desired signal ℓ6=i,ℓ=1 j6=k,j=1 ℓ=1
| {z } | {z }
intracell interference intercell interference plus noise

where Vik ∈ CM ×dik is the linear transmit beamformer used by base station k for user ik ; sik ∈ Cdik ×1
is user ik ’s data signal. The matrix Hik j represents the channel from transmitter j to receiver ik , and nik
denotes the complex additive white Gaussian noise with distribution CN (0, σi2k I). User ik estimates the
intended message using a linear beamformer Uik ∈ CM ×dik : ŝik = UH
ik yik .
Treating interference as noise, the rate of user ik is given by
  X −1 
Rik (V) = log det I + Hik k Vik (Vik ) Hik k σi2k I +
H H
Hik j Vℓj (Vℓj ) H
HH
ik j .
(j,ℓ)6=(k,i)

September 12, 2012 DRAFT


23

We are interested in finding the beamformers V such that the sum of the users’ rates are optimized
K X
X Ik
max Rik (V)
{Vik }ik ∈I
k=1 i=1
(40)
Ik
X
s.t. Tr(Vik VH
ik ) ≤ P̄k , ∀ k ∈ K.
i=1
Note that we have included a transmit power constraint for each base station. It has been shown in
[20] that solving (40) is NP-hard. Therefore, we try to obtain the stationary solution for this problem.
Furthermore, we can no longer straightforwardly apply the BSUM algorithm that updates Vik ’s cyclically.
This is due to the fact that the users in the set Ik share a common power constraint. Thus the requirement
for the separability of the constraints for different block components in (12) is not satisfied.
To devise an efficient and low complexity algorithm for problem (40), we will first transform this

problem to a more suitable form. We first introduce the function fik (Uik , V) , log det E−1
ik , where
Eik is the mean square error (MSE) matrix given as
X
H H 2
Eik , (I − UH H H
ik Hik k Vik )(I − Uik Hik k Vik ) + UH H
ik Hik j Vℓj Vℓj Hik j Uik + σik Uik Uik .
(ℓ,j)6=(i,k)

In the subsequent presentation we will occasionally use the notation Eik (Uik , V) to make the dependency
of the MSE matrix and the transceivers explicit.
Taking the derivative of fik (Uik , V) with respect to Uik and checking the first order optimality
condition, we have
 X −1
arg max fik (Uik , V) = σi2k I + Hik j Vℓj VH H
ℓj Hik j Hik k .Vik
Uik
(j,ℓ)

Plugging in the optimal value of Uik in fik (·), we obtain maxUik fik (Uik , V) = Rik (V). Thus, we can
rewrite the optimization problem equivalently (40) as2
K X
X Ik
min log det(Eik )
V,U
k=1 i=1
(41)
Ik
X
s.t. Tr (Vik VH
ik ) ≤ Pk , ∀ k ∈ K.
i=1
Notice the fact that the function log det(·) is a concave function on its argument (see, e.g., [5]), then
b ik , we have
for any feasible Eik , E

b ik ) + Tr[▽Ei (log det(E


log det(Eik ) ≤ log det(E b ik ))(Eik − E
b ik )]
k

b ik ) + Tr[E
= log det(E b −1 (Eik − E
b ik )] , uik (Eik , E
b ik ) (42)
ik

2
Such equivalence is in the sense of one-to-one correspondence of both local and global optimal solutions. See [28] for a
detailed argument.

September 12, 2012 DRAFT


24

Utilizing the above transformation and the upper bound, we can again apply the BSUM algorithm. Let
V and U be two block variables. Define
  X K X
Ik
b b
uv V, (V, U) , b ik ), Eik (V
uik (Eik (V, U b ik ))
b, U
k=1 i=1
  XK X
Ik
b b
uu U, (V, U) , b , Uik ), E(V
uik (Eik (V b ik ))
b, U
k=1 i=1

In iteration 2r + 1, the algorithm solves the following problem



min uv V, (V2r , U2r )
V
Ik
X (43)
Tr (Vik VH
ik ) ≤ Pk , ∀ k ∈ K.
i=1

In iteration 2r + 2, the algorithm solves the following (unconstrained) problem



min uu U, (V2r+1 , U2r ) (44)
U

The above BSUM algorithm for solving (40) is called WMMSE algorithm in the reference [28].
Due to (42), we must have that
K X
Ik
 X
uv V, (V2r , U2r ) ≥ log det(Eik (V, U2r
ik )), for all feasible V, ∀ ik
k=1 i=1
K X
Ik
 X
uu U, (V2r+1 , U2r ) ≥ log det(Eik (V2r+1 , Uik )) for all Uik , ∀ ik
k=1 i=1

Moreover, other conditions in Assumption 2 are also satisfied for uv (·) and uu (·). Thus the convergence
of the WMMSE algorithm to a stationary solution of problem (41) follows directly from Theorem 2.
We briefly mention here that the main benefit of using the BSUM approach for solving problem (41)
is that in each step, the problem (43) can be decomposed into K independent convex subproblems, one
for each base station k ∈ K. Moreover, the solutions for these K subproblems can be simply obtained
in closed form (subject to an efficient bisection search). For more details on this algorithm, we refer the
readers to [28] and [25].
The BSUM approach has been extensively used for resource allocation in wireless networks, for
example [8], [16], [18], [23], [27], and [26]. However, the convergence of most of the algorithms was
not rigorously established.

September 12, 2012 DRAFT


25

B. Proximal Minimization Algorithm

The classical proximal minimization algorithm (see, e.g., [3, Section 3.4.3]) obtains a solution of the
problem minx∈X f (x) by solving an equivalent problem
1
min f (x) + kx − yk22 , (45)
x∈X ,y∈X 2c
where f (·) is a convex function, X is a closed convex set, and c > 0 is a scalar parameter. The equivalent
problem (45) is attractive in that it is strongly convex in both x and y (but not jointly) so long as f (x)
is convex. This problem can be solved by performing the following two steps in an alternating fashion
 
r+1 1 r 2
x = arg min f (x) + kx − y k2 (46)
x∈X 2c

yr+1 = xr+1 . (47)

1
Equivalently, let u(x; xr ) , f (x) + 2c kx − xr k22 , then the iteration (46)–(47) can be written as

xr+1 = arg min u(x, xr ). (48)


x∈X

It can be straightforwardly checked that for all x, xr ∈ X , the function u(x, xr ) serves as an upper
bound for the function f (x). Moreover, the conditions listed in Assumption 1 are all satisfied. Clearly, the
iteration (48) corresponds to the SUM algorithm discussed in Section III. Consequently, the convergence
of the proximal minimization procedure can be obtained from Theorem 1.
The proximal minimization algorithm can be generalized in the following way. Consider the problem

min f (x1 , · · · , xn ) (49)


x

s.t. xi ∈ Xi , i = 1, · · · , n,

where {Xi }ni=1 are closed convex sets, f (·) is convex in each of its block components, but not necessarily
strictly convex. A straightforward application of the BCD procedure may fail to find a stationary solution
for this problem, as the per-block subproblems may contain multiple solutions. Alternatively, we can
consider an alternating proximal minimization algorithm [12], in each iteration of which the following
subproblem is solved
1
min f (xr1 , . . . , xri−1 , xi , xri+1 , . . . , xrn ) + kxi − xri k22 (50)
xi 2c
s.t. xi ∈ Xi .

It is not hard to see that this subproblem always admits a unique solution, as the objective is a strictly
1
convex function of xi . Let ui (xi , xr ) , f (xr1 , · · · , xi , · · · xrn ) + 2c kxi − xri k22 . Again for each xi ∈ Xi

September 12, 2012 DRAFT


26

Q
and xr ∈ j Xj , the function ui (xi , xr ) is an upper bound of the original objective f (x). Moreover,
all the conditions in Assumption 2 are satisfied. Utilizing Theorem 2, we conclude that the alternating
proximal minimization algorithm must converge to a stationary solution of the problem (49). Moreover,
our result extends those in [12] to the case of nonsmooth objective function as well as the case with
iteration-dependent coefficient c. The latter case, which was also studied in the contemporary work [32],
will be demonstrated in an example for tensor decomposition shortly.

C. Proximal Splitting Algorithm

The proximal splitting algorithm (see, e.g., [9]) for nonsmooth optimization is also a special case of
the BSUM algorithm. Consider the following problem

min f1 (x) + f2 (x) (51)


x∈X

where X is a closed and convex set. Furthermore, f1 is convex and lower semicontinuous; f2 is convex
and has Lipschitz continuous gradient, i.e., k∇f2 (x) − ∇f2 (y)k ≤ βkx − yk, ∀ x, y ∈ X and for some
β > 0.
Define the proximity operator proxfi : X → X as
1
proxfi (x) = arg min fi (y) + kx − yk2 . (52)
y∈X 2
The following forward-backward splitting iteration can be used to obtain a solution for problem (51) [9]:

xr+1 = proxγf (xr − γ∇f2 (xr )) (53)


| {z }1 | {z }
backward step forward step

where γ ∈ [ǫ, 2/β − ǫ] with ǫ ∈]0, min{1, 1/β}[. Define


1
u(x, xr ) , f1 (x) + kx − xr k2 + hx − xr , ∇f2 (xr )i + f2 (xr ). (54)

We first show that the iteration (53) is equivalent to the following iteration

xr+1 = arg min u(x, xr ). (55)


x∈X

From the definition of the prox operation, we have


1
proxγf1 (xr − γ∇f2 (xr )) = arg min γf1 (x) + kx − xr + γ∇f2 (xr )k22
x∈X 2
1
= arg min f1 (x) + kx − xr k22 + hx − xr , ∇f2 (xr )i
x∈X 2γ
= arg min u(x, xr ).
x∈X

September 12, 2012 DRAFT


27

We then show that u(x, xr ) is an upper bound of the original function f1 (x)+f2 (x), for all x, xr ∈ X .
Note that from the well known Descent Lemma [2, Proposition A.32], we have that
β
f2 (x) ≤ f2 (xr ) + kx − xr k2 + hx − xr , ∇f2 (xr )i
2
1
≤ f2 (xr ) + kx − xr k2 + hx − xr , ∇f2 (xr )i

where the second inequality is from the definition of γ . This result implies that u(x, y) ≥ f1 (x) +
f2 (x), ∀ x, y ∈ X . Moreover, we can again verify that all the other conditions in Assumption 1 is true.
Consequently, we conclude that the forward-backward splitting algorithm is a special case of the SUM
algorithm.
Similar to the previous example, we can generalize the forward-backward splitting algorithm to the
problem with multiple block components. Consider the following problem
n
X
min fi (xi ) + fn+1 (x1 , · · · , xn ) (56)
i=1

s.t. xi ∈ Xi , i = 1, · · · , n

where {Xi }ni=1 are a closed and convex sets. Each function fi (·), i = 1, · · · n is convex and lower
semicontinuous w.r.t. xi ; fn+1 (·) is convex and has Lipschitz continuous gradient w.r.t. each of the
component xi , i.e., k∇fn+1 (x) − ∇fn+1 (y)k ≤ βi kxi − yi k, ∀ xi , yi ∈ Xi , i = 1, · · · , n. Then the
following block forward-backward splitting algorithm can be shown as a special case of the BSUM
algorithm, and consequently converges to a stationary solution of the problem (56)

xr+1
i = proxγfi (xri − γ∇xi fn+1 (xr )), i = 1, 2, ..., n,

where γ ∈ [ǫi , 2/βi − ǫi ] with ǫi ∈]0, min{1, 1/βi }[.

D. CANDECOMP/PARAFAC Decomposition of Tensors

Another application of the proposed method is in CANDECOMP/PARAFAC (CP) decomposition of


tensors. Given a tensor X ∈ Rm1 ×m2 ×...×mn of order n, the idea of CP decomposition is to write the
tensor as the sum of rank-one tensors:
R
X
X= Xr ,
r=1

where Xr = a1r ◦ a2r ◦ . . . ◦ anr and air ∈ Rmi . Here the notation “ ◦ ” denotes the outer product.

September 12, 2012 DRAFT


28

In general, finding the CP decomposition of a given tensor is NP-hard [15]. In practice, one of the
most widely accepted algorithms for computing the CP decomposition of a tensor is the Alternating
Least Squares (ALS) algorithm [11], [19], [29]. The ALS algorithm proposed in [6], [13] is in essence a
BCD method. For ease of presentation, we will present the ALS algorithm only for tensors of order three.

Let X ∈ RI×J×K be a third order tensor. Let (A; B; C) represent the following decomposition
R
X
(A; B; C) , ar ◦ br ◦ cr ,
r=1
where ar (resp. br and cr ) is the r -th column of A (resp. B and C ). The ALS algorithm minimizes the
difference between the original and the reconstructed tensors

min kX − (A; B; C)k, (57)


A,B,C

where A ∈ RI×R , B ∈ RJ×R , C ∈ RK×R , and R is the rank of the tensor.

The ALS approach is a special case of the BCD algorithm in which the three blocks of variables A, B,
and C are cyclically updated. In each step of the computation when two blocks of variables are held
fixed, the subproblem becomes the quadratic least squares problem and admits closed form updates (see
[19]).

One of the well-known drawbacks of the ALS algorithm is the swamp effect where the objective
value remains almost constant for many iterations before starting to decrease again. Navasca et al. in
[22] observed that adding a proximal term in the algorithm could help reducing the swamp effect. More
specifically, at each iteration r the algorithm proposed in [22] solves the following problem for updating
the variables:
kX − (A; B; C)k2 + λkA − Ar k2 + λkB − B r k2 + λkC − C r k2 , (58)

where λ ∈ R is a positive constant. As discussed before, this proximal term has been considered in
different optimization contexts and its convergence has been already showed in [12]. An interesting
numerical observation in [22] is that decreasing the value of λ during the algorithm can noticeably
improve the convergence of the algorithm. Such iterative decrease of λ can be accomplished in a number
of different ways. Our numerical experiments show that the following simple approach to update λ can
significantly improve the convergence of the ALS algorithm and substantially reduce the swamp effect:
kX − (Ar ; B r ; C r )k
λr = λ0 + λ1 , (59)
kXk

September 12, 2012 DRAFT


29

where λr is the proximal coefficient λ at iteration r . Theorem 2 implies the convergence is guaranteed
even with this update rule of λ, whereas the convergence result of [12] does not apply in this case since
the proximal coefficient is changing during the iterations.

Figure 6 shows the performance of different algorithms for the example given in [22] where the
tensor X is obtained from the decomposition
 √   
  3 2 cos θ 0 1 0 0
1 cos θ 0    
   
A= , B= 0 sin θ 1 , C =  0 1 0 .
0 sin θ 1    
0 sin θ 0 0 0 1
The vertical axis is the value of the objective function where the horizontal axis is the iteration number.
In this plot, ALS is the classical alternating least squares algorithm. The curve for Constant Proximal
shows the performance of the BSUM algorithm when we use the objective function in (58) with λ = 0.1.
The curve for Diminishing Proximal shows the performance of block coordinate descent method on (58)
where the weight λ decreases iteratively according to (59) with λ0 = 10−7 , λ1 = 0.1. The other two
curves MBI and MISUM correspond to the maximum block improvement algorithm and the MISUM
algorithm. In the implementation of the MISUM algorithm, the proximal term is of the form in (58) and
the weight λ is updated based on (59).

1
10
ALS
0
Constant Proximal
10 Dimnishing Proximal
MBI
−1 MISUM
10

−2
10
Error

−3
10

−4
10

−5
10

−6
10
0 50 100 150 200 250 300 350 400
Iterates

Fig. 6: Convergence of Different Algorithms

Table I represents the average number of iterations required to get an objective value less than ǫ = 10−5
for different algorithms. The average is taken over 1000 Monte-Carlo runs over different initializations.

September 12, 2012 DRAFT


30

Algorithm Average number of iterations for convergence


ALS 277
Constant Proximal 140
Diminishing Proximal 78
MBI 572
MISUM 175

TABLE I: Average number of iterations for convergence

The initial points are generated randomly where the components of the variables A, B, and C are drawn
independently from the uniform distribution over the unit interval [0, 1]. As it can be seen, adding a
diminishing proximal term significantly improves the convergence speed of the ALS algorithm.

E. Expectation Maximization Algorithm

The expectation maximization algorithm (EM) in [10] is an iterative procedure for maximum likelihood
estimation when some of the random variables are unobserved/hidden. Let w be the observed random
vector which is used for estimating the value of θ . The maximum likelihood estimate of θ can be given
as
θ̂ML = arg max ln p(w|θ). (60)
θ

Let the random vector z be the hidden/unobserved variable. The EM algorithm starts from an initial
estimate θ 0 and generates a sequence {θ r } by repeating the following steps:

• E-Step: Calculate g(θ, θ r ) , Ez|w,θr {ln p(w, z|θ)}


• M-Step: θ r+1 = arg maxθ g(θ, θ r )

The EM-algorithm can be viewed as a special case of SUM algorithm [4]. In fact, we are interested in
solving the following optimization problem

min − ln p(w|θ).
θ

September 12, 2012 DRAFT


31

The objective function could be written as

− ln p(w|θ) = − ln Ez|θ p(w|z, θ)


 
p(z|w, θ r )p(w|z, θ)
= − ln Ez|θ
p(z|w, θ r )
 
p(z|θ)p(w|z, θ)
= − ln Ez|w,θr
p(z|w, θ r )
 
p(z|θ)p(w|z, θ)
≤ −Ez|w,θr ln
p(z|w, θ r )

= −Ez|w,θr ln p(w, z|θ) + Ez|w,θr ln p(z|w, θ r )

, u(θ, θ r ),

where the inequality is due to the Jensen’s inequality and the third equality follows from a simple change
of the order of integration for the expectation. Since Ez|w,θr ln p(z|w, θ r ) is not a function of θ , the
M-step in the EM-algorithm can be written as

θ r+1 = arg max u(θ, θ r ).


θ

Furthermore, it is not hard to see that u(θ r , θ r ) = − ln p(w|θ r ). Therefore, under the smoothness
assumption, Proposition 1 implies that Assumption 1 is satisfied. As an immediate consequence, the EM-
algorithm is a special case of the SUM algorithm. Therefore, our result implies not only the convergence
of the EM-algorithm, but also the convergence of the EM-algorithm with Gauss-Seidel/coordinatewise
update rule (under the assumptions of Theorem 2). In fact in the block coordinate EM-algorithm (BEM),
at each M-step, only one block is updated. More specifically, let θ = (θ1 , . . . , θn ) be the unknown
parameter. Assume w is the observed vector and z is the hidden/unobserved variable as before. The
BEM algorithm starts from an initial point θ 0 = (θ10, . . . , θn0 ) and generates a sequence {θ r } according
to the algorithm in Figure 7.
The motivation behind using the BEM algorithm instead of the EM algorithm could be the difficulties
in solving the M-step of EM for the entire set of variables, while solving the same problem per block of
variables is easy. To the best of our knowledge, the BEM algorithm and its convergence behavior have
not been analyzed before.

F. Concave-Convex Procedure/Difference of Convex Functions

A popular algorithm for solving unconstrained problems, which also belongs to the class of successive
upper-bound minimization, is the Concave-Convex Procedure (CCCP) introduced in [33]. In CCCP, also

September 12, 2012 DRAFT


32

1 Initialize with θ 0 and set r = 0

2 repeat
3 r = r + 1, i = r mod n + 1
4 r , θ , θ r , . . . , θ r )}
E-Step: gi (θi , θ r ) = Ez|w,θr {ln p(w, z|θ1r , . . . , θi−1 i i+1 n

5 M-Step: θir+1 = arg maxθi gi (θi , θ r )

6 until some convergence criterion is met

Fig. 7: Pseudo code of the BEM algorithm

known as the difference of convex functions (DC) programming, we consider the unconstrained problem

min f (x),
x∈Rm

where f (x) = fcve (x) + fcvx (x), ∀ x ∈ Rm ; where fcve (·) is a concave function and fcvx (·) is convex.
The CCCP generates a sequence {xr } by solving the following equation:

∇fcvx (xr+1 ) = −∇fcve (xr ),

which is equivalent to
xr+1 = arg min g(x, xr ), (61)
x

where g(x, xr ) , fcvx (x)+(x−xr )T ∇fcve (xr )+fcve (xr ). Clearly, g(x, xr ) is a tight convex upper-bound
of f (x) and hence CCCP is a special case of the SUM algorithm and its convergence is guaranteed by
Theorem 1 under certain assumptions. Furthermore, if the updates are done in a block coordinate manner,
the algorithm becomes a special case of BSUM whose convergence is guaranteed by Theorem 2. To the
best of our knowledge, the block coordinate version of CCCP algorithm and its convergence has not been
studied before.

R EFERENCES

[1] Z. L. B. Chen, Simai He and S. Zhang, “Maximum block improvement and polynomial optimization,” SIAM Journal on
Optimization, vol. 22, pp. 87–107, 2012.
[2] D. P. Bertsekas, Nonlinear Programming, 2nd ed. Athena-Scientific, 1999.

September 12, 2012 DRAFT


33

[3] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods, 2nd ed. Athena-Scientific,
1999.
[4] S. Borman, “The expectation maximization algorithm - a short tutorial,” Unpublished paper. [Online]. Available:
http://ftp.csd.uwo.ca/faculty/olga/Courses/Fall2006/Papers/EM algorithm.pdf
[5] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
[6] J. D. Carroll and J. J. Chang, “Analysis of individual differences in multidimensional scaling via an n-way generalization
of “Eckart-Young” decomposition,” Psychometrika, vol. 35, pp. 283–319, 1970.
[7] Y. Censor and S. A. Zenios, Parallel Optimization: Theory, Algorithm, and Applications. Oxford University Press, Oxford,
United Kingdom, 1997.
[8] M. Chiang, C. W. Tan, D. P. Palomar, D. O’Neill, and D. Julian, “Power control by geometric programming,” IEEE
Transactions on Wireless Communications, vol. 6, pp. 2640–2651, 2007.
[9] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in signal processing,” 2009, available online at: arxiv.org.
[10] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal
of the Royal Statistical Society Series B, vol. 39, pp. 1–38, 1977.
[11] N. K. M. Faber, R. Bro, and P. K. Hopke, “Recent developments in CANDECOMP/PARAFAC algorithms: A critical
review,” Chemometrics and Intelligent Laboratory Systems, vol. 65, pp. 119–137, 2003.
[12] L. Grippo and M. Sciandrone, “On the convergence of the block nonlinear Gauss-Seidel method under convex constraints,”
Operations Research Letters, vol. 26, pp. 127–136, 2000.
[13] R. A. Harshman, “Foundations of the parafac procedure: Models and conditions for an explanatory” multi-modal factor
analysis,” UCLA working papers in phonetics, vol. 16, pp. 1–84, 1970.
[14] J. A. Hartigan and M. A. Wong, “K-means clustering algorithm,” Journal of the Royal Statistical Society, Series C (Applied
Statistics), vol. 28, pp. 100–108, 1979.
[15] J. Hastad, “Tensor rank is NP-complete,” Journal of Algorithms, vol. 11, pp. 644–654, 1990.
[16] M. Hong and Z.-Q. Luo, “Joint linear precoder optimization and base station selection for an uplink mimo network: A
game theoretic approach,” in the Proceedings of the IEEE ICASSP, 2012.
[17] H. R. Howson and N. G. F. Sancho, “A new algorithm for the solution of multistate dynamic programming problems,”
Mathematical Programming, vol. 8, pp. 104–116, 1975.
[18] S. J. Kim and G. B. Giannakis, “Optimal resource allocation for mimo ad-hoc cognitive radio networks,” Proceedings of
Allerton Conference on Communication, Control, and Computing, 2008.
[19] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, pp. 455–500, 2009.
[20] Z.-Q. Luo and S. Zhang, “Dynamic spectrum management: Complexity and duality,” IEEE Journal of Selected Topics in
Signal Processing, vol. 2, no. 1, pp. 57–73, 2008.
[21] B. R. Marks and G. P. Wright, “A general inner approximation algorithm for nonconvex mathematical programs,” Operations
Research, vol. 26, pp. 681–683, 1978.
[22] C. Navasca, L. D. Lathauwer, and S. Kindermann, “Swamp reducing technique for tensor decomposition,” Proc. 16th
European Signal Processing Conference (EUSIPCO), August 2008.
[23] C. T. K. Ng and H. Huang, “Linear precoding in cooperative MIMO cellular networks with limited coordination clusters,”
IEEE Journal on Selected Areas in Communications, vol. 28, no. 9, pp. 1446 –1454, december 2010.
[24] M. J. D. Powell, “On search directions for minimization algorithms,” Mathematical Programming, vol. 4, pp. 193–201,
1973.

September 12, 2012 DRAFT


34

[25] M. Razaviyayn, H. Baligh, A. Callard, and Z.-Q. Luo, “Joint transceiver design and user grouping in a MIMO interfering
broadcast channel,” 45th Annual Conference on Information Sciences and Systems (CISS), pp. 1–6, 2011.
[26] C. Shi, R. A. Berry, and M. L. Honig, “Local interference pricing for distributed beamforming in MIMO networks,”
Proceedings of the 28th IEEE conference on Military communications (MILCOM), pp. 1001–1006, 2009.
[27] ——, “Monotonic convergence of distributed interference pricing in wireless networks,” in Proceedings of the 2009 IEEE
international conference on Symposium on Information Theory - Volume 3, ser. ISIT’09, 2009, pp. 1619–1623.
[28] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An iteratively weighted mmse approach to distributed sum-utility
maximization for a MIMO interfering broadcast channel,” IEEE Transactions on Signal Processing, vol. 59, pp. 4331–4340,
2011.
[29] G. Tomasi and R. Bro, “A comparison of algorithms for fitting the parafac model,” Computational Statistics and Data
Analysis, vol. 50, p. 17001734, April 2006.
[30] P. Tseng, “Convergence of a block coordinate descent method for nondifferentiable minimization,” Journal of Optimization
Theory and Applications, vol. 109, pp. 475–494, 2001.
[31] P. Tseng and S. Yun, “A coordinate gradient descent method for nonsmooth separable minimization,” Mathematical
Programming, vol. 117, pp. 387–423, 2009.
[32] Y. Xu and W. Yin, “A block coordinate descent method for multi-convex optimization with applications to nonnegative
tensor factorization and completion,” http://www.caam.rice.edu/∼optimization/BCD/, 2012.
[33] A. L. Yuille and A. Rangarajan, “The concave-convex procedure,” Neural Computation, vol. 15, pp. 915–936, 2003.
[34] N. Zadeh, “A note on the cyclic coordinate ascent method,” Management Science, vol. 16, pp. 642–644, 1970.

September 12, 2012 DRAFT

You might also like