Quadratic Decomposable Submodular Function Minimization: Theory and Practice
Quadratic Decomposable Submodular Function Minimization: Theory and Practice
Quadratic Decomposable Submodular Function Minimization: Theory and Practice
Pan Li [email protected]
Department of Computer Science
Purdue University
West Lafayette, IN 47907, USA
arXiv:1902.10132v4 [cs.LG] 26 Oct 2020
Niao He [email protected]
Department of Industrial and Enterprise Systems Engineering
University of Illinois at Urbana-Champaign
Champaign, IL 61820, USA
Olgica Milenkovic [email protected]
Department of Electrical and Computer Engineering
University of Illinois at Urbana-Champaign
Champaign, IL 61820, USA
Abstract
We introduce a new convex optimization problem, termed quadratic decomposable submod-
ular function minimization (QDSFM), which allows to model a number of learning tasks
on graphs and hypergraphs. The problem exhibits close ties to decomposable submodu-
lar function minimization (DSFM) yet is much more challenging to solve. We approach
the problem via a new dual strategy and formulate an objective that can be optimized
through a number of double-loop algorithms. The outer-loop uses either random coordi-
nate descent (RCD) or alternative projection (AP) methods, for both of which we prove
linear convergence rates. The inner-loop computes projections onto cones generated by base
polytopes of the submodular functions via the modified min-norm-point or Frank-Wolfe al-
gorithms. We also describe two new applications of QDSFM: hypergraph-adapted PageR-
ank and semi-supervised learning. The proposed hypergraph-based PageRank algorithm
can be used for local hypergraph partitioning and comes with provable performance guaran-
tees. For hypergraph-adapted semi-supervised learning, we provide numerical experiments
demonstrating the efficiency of our QDSFM solvers and their significant improvements on
prediction accuracy when compared to state-of-the-art methods.
Keywords: Submodular functions, Lovász extensions, Random coordinate descent,
Frank-Wolfe method, PageRank.
1. Introduction
Given [N ] = {1, 2, ..., N }, a submodular function F : 2[N ] → R is a set function that for
any S1 , S2 ⊆ [N ] satisfies F (S1 ) + F (S2 ) ≥ F (S1 ∪ S2 ) + F (S1 ∩ S2 ) (Edmonds, 2003). Sub-
modular functions are ubiquitous in machine learning as they capture rich combinatorial
properties of set functions and provide useful regularizers for supervised and unsupervised
learning objectives (Bach, 2013). Submodular functions are endowed with convex Lovász
1
Li, He, and Milenkovic
extensions (Lovász, 1983), which provide the means to establish connections between com-
binatorial and continuous optimization problems (Fujishige, 2005).
Due to their versatility, submodular functions and their Lovász extensions are frequently
used in applications such as learning on directed/undirected graphs and hypergraphs (Zhu
et al., 2003a; Zhou et al., 2004; Hein et al., 2013), image denoising via total variation
regularization (Osher et al., 2005; Chambolle and Darbon, 2009), and maximum a posteriori
probability (MAP) inference in high-order Markov random fields (Kumar and Hebert, 2003).
In many optimization settings involving submodular functions, one encounters the convex
program
[fr (x)]p ,
X X
min (xi − ai )2 + (1)
x
i∈[N ] r∈[R]
where a ∈ RN , p ∈ {1, 2}, and for all r in some index set [R], fr is the Lovász extension
of a submodular function Fr that describes some combinatorial structure over the set [N ].
For example, in image denoising, the parameter ai may correspond to the observed value
of pixel i, while the functions [fr (x)]p may be used to impose smoothness constraints on
the pixel neighborhoods. One of the main difficulties in solving the above optimization
problem comes from the nondifferentiability of the second term: √ a direct application of
subgradient methods leads to convergence rates as slow as 1/ k, where k denotes the
number of iterations (Shor et al., 1985).
In recent years, the above optimization problem for p = 1, referred to as decompos-
able submodular function minimization (DSFM) (Stobbe and Krause, 2010), has received
significant attention. The motivation for studying this particular setup is two-fold: first,
solving the convex optimization problem directly recovers the combinatorial solution to the
submodular min-cut problem (Fujishige, 2005) which takes the form minS⊆[N ] F (S), where
X X
F (S) = Fr (S) − 2 ai .
r∈[R] i∈S
2
Quadratic Decomposable Submodular Function Minimization
it has not received the same level of attention as DSFM, neither from the theoretical nor
algorithmic perspective. To our best knowledge, only a few reported works (Hein et al.,
2013; Zhang et al., 2017) provided solutions for specific instances of QDSFM with sublin-
ear convergence guarantees. For the general QDSFM problem, no analytical results are
currently known.
Our work takes a substantial step towards solving the QDSFM problem in its most
general form by developing a family of algorithms with linear convergence rates and small
iteration cost, including the randomized coordinate descent (RCD) and alternative projec-
tion (AP) algorithms. Our contributions are as follows.
1. First, we derive a new dual formulation for the QDSFM problem since an analogue of
the dual transformation for the DSFM problem is not applicable. Interestingly, the
dual QDSFM problem requires one to find the best approximation of a hyperplane
via a product of cones as opposed to a product of polytopes, encountered in the dual
DSFM problem.
2. Second, we develop linearly convergent RCD and AP algorithms for solving the dual
QDSFM problem. Because of the special underlying conic structure, a new analytic
approach is needed to prove the weak strong-convexity of the dual QDSFM problem,
which essentially guarantees linear convergence.
2. Problem Formulations
We first introduce some terminology and notation useful for our subsequent exposition.
Lovász extension and base polytope. For a submodular function F defined over the
ground set [N ], the Lovász extension is a convex function f : RN → R, defined for all
x ∈ RN according to
N
X −1
f (x) = F ({i1 , ..., ik })(xik − xik+1 ) + F ([N ])xiN , (2)
k=1
3
Li, He, and Milenkovic
where xi1 ≥ xi2 ≥ · · · ≥ xiN , andPi1 , ..., iN are distinct indices in [N ]. For a vector x ∈ RN
and a set S ⊆ [N ], define x(S) = i∈S xi , where xi stands for the value of the i-th dimension
of x. The base polytope of F , denoted by B, is defined as
Using the base polytope, the Lovász extension can also be written as f (x) = maxy∈B hy, xi.
We say that an element i ∈ [N ] is incident to F if there exists a S ⊂ [N ] such that
F (S) 6= F (S ∪ {i}).
N ×N and a vector x ∈ RN , we define
Notations. qPmatrix W ∈ R
Given a positive diagonal
N 2
the W -norm according to kxkW = i=1 Wii xi , and simply use k · k when W = I, the
identity matrix. For an index set [R], we denote the R-product of N -dimensional Euclidean
spaces by ⊗r∈[R] RN . A vector y ∈ ⊗r∈[R] RN is written as (y1 , y2 , ..., yR ), where yr ∈ RN
qP
N R 2
for all r ∈ [R]. The W -norm induced on ⊗r∈[R] R equals kykI(W ) = r=1 kyr kW . We
qP
2
reserve the symbol ρ for maxyr ∈Br ,∀r r∈[R] kyr k1 . Furthermore, we use (x)+ to denote
the function max{x, 0}.
[fr (x)]2 ,
X
QDSFM: min kx − ak2W + (4)
x∈RN
r∈[R]
4
Quadratic Decomposable Submodular Function Minimization
φ2r φ2
[fr (x)]2 = max φr fr (x) − = max max hyr , xi − r . (5)
φr ≥0 4 φr ≥0 yr ∈φr Br 4
Let y = (y1 , y2 , ..., yR ) and φ = (φ1 , φ2 , ..., φR ). For each r, we define a convex cone
Cr = {(yr , φr )|φr ≥ 0, yr ∈ φr Br }
which represents the feasible set of the variables (yr , φr ). Furthermore, we denote the
product of cones by
O
C= Cr := {(y, φ) : φr ≥ 0, yr ∈ φr Br , ∀r ∈ [R]}.
r∈[R]
Invoking equation (5), we arrive at the following two dual formulations for the original
QDSFM problem in Lemma 1.
N,
N
By introducing Λ = (λr ) ∈ r∈[R] R the previous optimization problem takes the form
X λr 2
X
2
min kyr − √ kW −1 + φr , s.t. (y, φ) ∈ C, λr = 2W a. (7)
y,φ,Λ
r∈[R]
R r∈[R]
The proof of Lemma 1 is presented in Section 7.1.1. The dual formulations for the
DSFM problem were described in Lemma 2 of (Jegelka et al., 2013). Similarly to the
DSFM problem, the dual formulations (6) and (7) are simple, nearly separable in the dual
variables, and may be solved by algorithms such as those used for solving dual DSFM,
including the Douglas-Rachford splitting method (DR) (Jegelka et al., 2013), the alternative
5
Li, He, and Milenkovic
projection method (AP) (Nishihara et al., 2014), and the random coordinate descent method
(RCD) (Ene and Nguyen, 2015).
However, there are also some notable differences. Our dual formulations for the QDSFM
problem are defined on a product of cones constructed from the base polytopes of the sub-
modular functions. The optimization problem (7) essentially asks for the best approxima-
tion of an affine space in terms of the product of cones, whereas in the DSFM problem one
seeks an approximation in terms of a product of polytopes. In the next section, we propose
to solve the dual problem (6) using the random coordinate descent method (RCD), and to
solve (7) using the alternative projection method (AP), both tailored to the conic structure.
The conic structures make the convergence analysis and the computations of projections
for these algorithms challenging.
Before proceeding with the description of Nour algorithms, we first provide a relevant
geometric property of the product of cones r∈[R] Cr , as stated in the following lemma.
This property is essential in establishing the linear convergence of RCD and AP algorithms.
N
Lemma 2 Consider a feasible solution of the problem (y, φ) ∈ r∈[R] Cr and a nonneg-
′ ′
N
ative vector φ = (φr ) ∈ r∈[R] R≥0 . Let s be an arbitrary point in the base polytope of
′
φ F , and let W , W (2) be two positive diagonal matrices. Then, there exists a
(1)
P
N r r ′
r∈[R]
′
y ∈ r∈[R] φr Br satisfying r∈[R] yr′ = s and
P
X
ky − y ′ k2I(W (1) ) + kφ − φ′ k2 ≤ µ(W (1) , W (2) ) k yr − sk2W (2) + kφ − φ′ k2 ,
r∈[R]
where
X
(1)
X (2) 9 2 X (1)
µ(W (1) , W (2) ) = max Wii 1/Wjj , ρ Wii + 1 , (8)
4
i∈[N ] j∈[N ] i∈[N ]
qP
2
and ρ = maxyr ∈Br ,∀r∈[R] r∈[R] kyr k1 .
Throughout the remainder of this section, we treat the projections as provided by an oracle.
Later in Section 4, we provide some efficient methods for computing the projections.
6
Quadratic Decomposable Submodular Function Minimization
Note that the optimal φr , as defined in (9), is closely related to gauge functions of poly-
topes (Bach, 2013). This representation is a consequence of the relationship between the
optimal primal and dual solutions stated in Lemma 1. For convenience, we denote the
optimal value of the objective function of interest over (y, φ) ∈ Ξ by g∗ = g(y, φ), and also
define a distance function
r
d((y, φ), Ξ) = min ky − y ′ k2I(W −1 ) + kφ − φ′ k2 .
(y ′ ,φ′ )∈Ξ
Lemma 3 (Weak Strong Convexity) Suppose that (y, φ) ∈ r∈[R] Cr and that (y ∗ , φ∗ ) ∈
N
7
Li, He, and Milenkovic
Theorem 4 asserts that at most O(Rµ(W −1 , W −1 ) log 1ǫ ) iterations are required to obtain
an ǫ-optimal solution in expectation for the QDSFM problem. The proof of Lemma 3 and
Theorem 4 are postponed to Section 7.2.
Here, Ψ ∈ RN ×N is a positive diagonal matrix in which Ψii = |{r ∈ [R]|i ∈ Sr }| equals the
number of submodular functions that i is incident to.
Lemma 5 Problem (11) is equivalent to problem (6).
The proof of Lemma 5 is presented in Section 7.3.1. The AP method for solving (11) is
listed in Algorithm 2. Observe that Step 5 is a projection onto cones defined based on
the positive diagonal matrix ΨW −1 which differs from the one used in the RCD method.
Note that compared to the RCD algorithm, AP requires one to compute projections onto
all cones Cr in each iteration. Thus, in most cases, AP requires larger computation cost
than the RCD algorithm. On the other hand, AP naturally lends itself to parallelization
since the projections can be decoupled and implemented very efficiently.
Before moving to the convergence analysis, we first prove the uniqueness of optimal φ
in the following lemma.
Lemma 6 The optimal value of φ to problem (11) is unique.
Proof We know (11) is equivalent to P (6). SupposeP
there are two optimal solutions
P (ȳ, φ̄) and
(ỹ, φ̃). As the optimal x∗ is unique, r∈[R] ȳr = r∈[R] ỹr = 2W (a − x∗ ) = r∈[R] ȳr +ỹ
2 .
r
8
Quadratic Decomposable Submodular Function Minimization
2
φ̄r +φ̃r 2 2 which makes g( ȳ+ỹ φ̄+φ̃
P P P
If φ̄ 6= φ̃, we have r∈[R] 2 < r∈[R] φ̄r = r∈[R] φ̃r , 2 , 2 )
smaller than g(ȳ, φ̄) = g(ỹ, φ̃) and thus causes contradiction.
To determine the convergence rate of the AP method, we adapt the result of (Nishihara
et al., 2014) on the convergence rate of APs between two convex bodies. In our setting, the
two convex bodies of interest are the the cone C and the hyperplane
X
Z = {(y, φ)| yr = 2W (a − x∗ ), φr = φ∗r , yr,i = 0, ∀ i 6∈ Sr },
r∈[R]
Lemma 7 (Nishihara et al. (2014)) Let Ξ be as defined in (9). In addition, define the
distance function
r
dΨW −1 ((y, φ), Ξ) = min ky − y ′ k2I(ΨW −1 ) + kφ − φ′ k2 .
(y ′ ,φ′ )∈Ξ
1 k
dΨW −1 ((y (k) , φ(k) ), Ξ) ≤ 2dΨW −1 ((y (0) , φ(0) ), Ξ)(1 − ) ,
κ2∗
where
dΨW −1 ((y, φ), Ξ)
κ∗ = sup .
(y,φ)∈C∪Z/Ξ max{dΨW −1 ((y, φ), C), dΨW −1 ((y, φ), Z)}
The next Lemma establishes a finite upper bound on κ∗ . This guaratees linear convergence
rates for the AP algorithm.
9
Li, He, and Milenkovic
Recall that Ψii = |{r ∈ [R]|i ∈ Sr }|. Then, Algorithm 1 (RCD) requires an expected number
of
2 −1 Dii 1
O N R max{1, 9β } max log
i,j∈[N ] Djj ǫ
iterations to return a solution (y, φ) that satisfies d((y, φ), Ξ) ≤ ǫ. Algorithm 2 (AP) requires
a number of
Ψjj Dii 1
O N 2 R max{1, 9β −1 } max log
i,j∈[N ] R Djj ǫ
iterations to return a solution (y, φ) that satisfies dΨW −1 ((y, φ), Ξ) ≤ ǫ.
The proof of Corollary 9 is provided in Section 7.4. Note that the term N 2 R also appears
in the expression for the complexity of the RCD and the AP methods for solving the DSFM
problem (Ene et al., 2017). The term max{1, 9β −1 } implies that whenever β is small, the
convergence rate is slow. In Section 5.1, we provide a case-specific analysis showing that β
is related to the mixing time of the underlying Markov process in the PageRank application.
Depending on the application domain, other interpretations of β could exist, but will not
be discussed here to avoid clutter. It is also worth pointing out that the variable D in
Corollary 9 is closely related to the notion of degree in hypergraphs (Li and Milenkovic,
2018b; Yoshida, 2019).
Let h∗ and (y ∗ , φ∗ ) be the optimal value of the objective function and the optimal solution,
respectively. When performing projections, one only needs to consider the variables incident
to F , and set all other variables to zero. For simplicity, we assume that all variables in [N ]
are incident to F . The results can easily extend to the case that the incidences are of a
general form.
Unlike the QDSFM problem, solving the DSFM involves the computation of projec-
tions onto the base polytopes of submodular functions. Two algorithms, the Frank-Wolfe
(FW) method (Frank and Wolfe, 1956) and the Fujishige-Wolfe minimum norm algorithm
(MNP) (Fujishige and Isotani, 2011), are used for this purpose. Both methods assume in-
expensive linear minimization oracles on polytopes and guarantee a 1/k-convergence rate.
The MNP algorithm is more sophisticated yet empirically more efficient. Nonetheless, nei-
ther of these methods can be applied directly to conic projections. To this end, we modify
10
Quadratic Decomposable Submodular Function Minimization
these two methods by adjusting them to the conic structure in (12) and show that they still
achieve a 1/k-convergence rate. We refer to the procedures as the conic FW method and
the conic MNP method, respectively.
11
Li, He, and Milenkovic
careful modification of the arguments in (Chakrabarty et al., 2014) to handle the conic
structure. We provide a detailed proof in Section 7.5.
(k) (k) ∗
2kak2W̃ Q2
h(y ,φ )≤h + .
k+2
The proof of Theorem 11 is deferred to Section 7.6.
12
Quadratic Decomposable Submodular Function Minimization
program with only two variables and consequently introduces a complexity term O(N ).
Therefore, the complexity of the FW algorithm is dominated by Step 2, which requires
O(N log N + N × EO) operations.
Below we discuss the distinctions of our method from several relevant methods for com-
puting projections on cones. Compared to the linearly convergent away-steps Frank-Wolfe
algorithm (Wolfe, 1970) and its generalized versions for the conic case, such as the away-
steps/pairwise non-negative matching pursuit (AMP/PWMP) techniques recently proposed
in (Locatello et al., 2017), the size of the active set S in the (conic) MNP algorithm is al-
ways bounded by N instead of the (increasing) number of iterations. Particularly, when
applied to the QDSFM problems, the size of the active set S in the (conic) MNP algorithm
can be even smaller in practice as it is bounded by the number of elements that affects
the value of the submodular function, which is usually smaller than N (Li and Milenkovic,
2018a). Moreover, in terms of base polytopes of submodular functions, AMP/PWMP may
not perform as well as MNP because the steepest ascent over the active set may not be
executed as efficiently as the descent over polytopes obtained via the greedy approach (Step
2, which generally appears in FW-based methods).
Another benefit of the (conic) MNP method is that it can compute exact projections
for the problem (12) in finitely many steps. Because of the structure of the cones generated
from polytopes, the exact projections are expressed as linear combinations of the extreme
points of the polytopes. As the number of such combinations is finite, MNP can find the
exact projections in finitely many iterations, thereby inheriting a similar property of the
MNP method pertaining to polytopes (Fujishige and Isotani, 2011). While this property
also holds for the fully corrective non-negative matching pursuit (FCMP) (Locatello et al.,
2017), FCMP requires solving an additional nonnegative quadratic program in the inner
loop, and is thus much less efficient; MNP does not require an explicit solution of the
nonnegative quadratic program as it allows for an efficient update of the set of atoms, S (k) ,
described in Algorithm 3, through proper leveraging of the problem structure (minor loops
of Algorithm 3). In contrast, the AMP/PWMP methods in (Locatello et al., 2017) as well
as our conic FW method introduced in Section 4.2 may not ensure exact projections. As the
proposed algorithms for QDSFM depend highly on precise projections for the problem (12),
using MNP as the subroutine for computing projections has the potential to yield better
accuracy than other methods discussed above. This is further evidenced in our experiments,
to be discussed below.
We illustrate the convergence speeds of the RCD algorithm when using the conic MNP and
the conic FW methods as subroutines for computing the conic projections, referred to as
QRCD-MNP and QRCD-FW, respectively. We also compare them with two other methods,
QRCD-AMP and QRCD-PWMP, where the subroutine for computing the conic projection
is replaced by AMP and PWMP (Locatello et al., 2017), respectively. For this purpose, we
construct a synthetic QDSFM problem (4) by fixing N = 100, R = 100, and Wii = 1, for
all i ∈ [N ]. We generate each incidence set Sr by choosing uniformly at random a subset of
[N ] of cardinality 10, and then set the entries of a to be i.i.d. standard Gaussian.
13
Li, He, and Milenkovic
θ = 0.25 θ = 0.5 θ =1
5 5 5
QRCDM-MNP QRCDM-MNP QRCDM-MNP
QRCDM-FW QRCDM-FW QRCDM-FW
0 QRCDM-PWMP 0 QRCDM-PWMP 0 QRCDM-PWMP
QRCDM-AMP QRCDM-AMP QRCDM-AMP
log 10 (gap)
log 10 (gap)
log 10 (gap)
-5 -5 -5
Figure 1: Convergence results (log10 (gap): mean ± std) for the QRCD-MNP, QRCD-FW,
QRCD-AMP, QRCD-PWMP algorithms for general submodular functions.
The number of RCD iterations is set to 300R = 3 × 104 . For the projections performed
in the inner loop, as only the conic MNP method is guaranteed to find the exact solution in
finite time, we constrain the number of iterations used by competing methods to ensure a
fair comparison. To this end, note that for the AMP and PWMP methods (Locatello et al.,
2017), a parameter δ that controls the precision is computed as −γkdt k2 (here we use the
notation from Algorithm 3 of (Locatello et al., 2017)). For all algorithms, we terminate
the inner loop by setting δ = 10−12 or a maximum number of iterations |Sr |3 = 1000.
All algorithms were executed on a 3.2GHz Intel Core i5 machine. For each algorithm, the
average and variance of the primal gap are obtained via 100 independent tests. The con-
vergence results are shown in Figure 1, which demonstrates that QRCD-MNP outperforms
the other three methods.The performance of QRCD-PWMP is satisfactory when θ = 1, but
it deteriorates for smaller values of θ, which may be a consequence of increasing the num-
ber of extreme points of cones/polytopes corresponding to the submodular functions (13).
QRCD-FW and QRCD-AMP exhibit similar convergence behaviors but fail to provide high-
precision solutions for reasonable CPU times.
14
Quadratic Decomposable Submodular Function Minimization
3.5
log 10 (cputime(ms)+1)
3
2.5
1.5
MNP
1
FW
0.5 PWMP
AMP
0
0 5 10 15 20
s
Figure 2: The required CPU time for different methods used to compute the projection (12)
versus the scale of the submodular function s.
ure 2. All methods terminate when (h(y (k) , φ(k) ) − h∗ )/kak2W̃ ≤ 10−5 over independent
runs.
As can be seen from Figure 2, the complexities of all methods increase as the submodular
function scales up, while the conic MNP method tends to be the most robust with respect
to scale changes. PWMP appears to be the most sensitive to scaling due to the growing
size of the active set with increasing s.
5. Applications to PageRank
In what follows, we introduce one important application of the QDSFM problem - PageRank
(PR). Our treatment of the PR process relies on diffusion processes over hypergraphs (Chan
et al., 2017, 2018). We demonstrate that what is known as a PR vector can be efficiently
computed using our proposed QDSFM solver. We also show that the newly introduced PR
retains important properties of the standard PR over graphs, pertaining both to mixing
and local partitioning properties (Andersen et al., 2006).
15
Li, He, and Milenkovic
16
Quadratic Decomposable Submodular Function Minimization
Table 1: The term hx, L(x)i for different combinatorial structures. In the third column,
whenever the stated conditions are not satisfied, it is assumed that Fr = 0. For directed
hypergraphs, Hr and Tr are subsets of Sr which we subsequently term the head and the tail
set. For Hr = Tr = Sr , one recovers the setting for undirected hypergraphs.
extension reads as fr (x) = max(i,j)∈Hr ×Tr (xi − xj )+ . If Sr contains only two vertices for all
values r ∈ [R], the hypergraph clearly reduces to a graph; and similarly, if Hr = Tr = Sr ,
directed hypergraphs reduce to undirected hypergraphs. We also define the degree of a
vertex i as X
di = wr ,
r:i∈Sr
and let D denote the diagonal matrix of degree values. The volume of a set S ⊆ V is defined
as
X
vol(S) = di .
i∈S
P
Furthermore, setting m = i∈V di = vol(V ) we define the boundary of a set S and its
volume as
X X
∂S = {r ∈ [R]|S ∩ Hr 6= ∅, S̄ ∩ Tr 6= ∅}, vol(∂S) = wr = wr Fr (S).
r∈∂S r∈[R]
di
We also define a distribution πS over V according to (πS )i = vol(S) . Furthermore, using the
boundary and volume, we also define the conductance of the set S as
vol(∂S)
Φ(S) = . (15)
min{vol(S), vol(S̄)}
The boundary, volume and conductance are all standard and well-studied graph-theoretic
concepts (Bollobás, 2013), generalized above to accommodate hypergraphs. The hypergraph
17
Li, He, and Milenkovic
conductance is defined as the minimal value of the conductance over all possible subsets of
V , that is,
The hypergraph conductance is an important objective function used for detecting commu-
nities of good quality in networks with higher-order functional units (Benson et al., 2016).
Solving the optimization (16) naturally yields a two-way partition of V so that the corre-
sponding procedure is also referred to as hypergraph partitioning.
1. For each active hyperedge Sr and for each pair of vertices (i, j) ∈ Sr↑ (x) × Sr↓ (x), we
(ij) P (ij)
introduce a non-negative scalar ar such that (i,j)∈S ↑ (x)×S ↓ (x) ar = wr . For pairs
r r
(ij)
/ Sr↑ (x) × Sr↓ (x) or belonging to inactive hyperedges, we set ar
of vertices (i, j) ∈ = 0.
2. Using the above described pairwise weights, we create a symmetric matrix A such
P (ij) P
that Aij = r∈[R] ar and Aii = di − i,j∈V,j6=i Aij .
For undirected graphs for which |Sr | = |Tr | = |Hr | = 2 for all r ∈ [R], the matrix A is an
adjacency matrix of a graph. Using the terminology of (Chan et al., 2018), we define the
Markov operator M (p) based on A according to
M (p) = Ax = AD −1 p.
The Markov operator M (p) is closely related to the Lovász extension fr , which may be seen
as follows. Let ∇fr (x) be the subdifferential set of fr at x. We once again recall that as
Fr is submodular, fr (x) = arg maxy∈Br hy, xi (Bach, 2013). This gives rise to the following
result.
Lemma 12 For any p ∈ RN and x = D −1 p, it holds that p−M (p) ∈ r∈[R] wr fr (x)∇fr (x).
P
18
Quadratic Decomposable Submodular Function Minimization
(ij)
where 1) holds due to the definitions of ar and fr (x). It only remains to show that for a
r ∈ [R], active Sr , the vector v
(ij)
i ∈ Sr↑ ,
P
a /wr
j:∈Sr↓ r
(ij)
vi = i ∈ Sr↓ ,
P
− j:∈S ↑ ar /wr
r
0 otherwise
P (ij) (ij)
satisfies v ∈ ∇fr (x). Since (i,j)∈S ↑ ×S ↓ ar = wr and ar ≥ 0, we know that v ∈ Br .
r r
Moreover, because v ∈ ∇fr (x), we have hv, xi = fr (x) and thus the claim clearly holds.
The statement of Lemma 12 describes a definition of a Laplacian operator for submodular
hypergraphs consistent with previous definitions (Li and Milenkovic, 2018b). We write the
underlying Laplacian operator as L(x) = (D −A)x = p−M (p). Note that when p = πV , the
1 1
components of x = D −1 p are equal and hence M (p) = AD −1 p = vol(S) A1 = vol(S) D1 = p.
Therefore, πV may be viewed as an analogue of a stationary distribution.
Given an initial potential p0 ∈ RN and an α ∈ (0, 1], the PR process is described by the
following ordinary differential equation:
dpt
The PR process = α(p0 − pt ) + (1 − α)(M (pt ) − pt ). (17)
dt
The choice of α = 0 reduces the PR process to the DP defined in (Chan et al., 2017, 2018).
(ij)
Tracking the PR process at every time point t, i.e., specifying ar , is as difficult as tracking
2
DP, which in turn requires solving a densest subset problem (Chan et al., 2017, 2018).
However, we only need to examine and evaluate the stationary point which corresponds to
the PR vector. The PR vector pr(α, p0 ) satisfies the equation
From Lemma 12, we know that pr(α, p0 ) can be obtained by solving a QDSFM. Let xpr =
D −1 pr(α, p0 ). Then
X
α(p0 − Dxpr ) − (1 − α) wr fr (xpr )∇fr (xpr ) ∋ 0
r∈[R]
X
⇔ xpr = arg min kx − x0 k2W + [fr′ (x)]2 , (19)
x
r∈[R]
α √
where x0 = D −1 p0 , W = 1−α D and fr′ = wr f r .
2. To track the diffusion of p(t) (17), one has to be careful in terms of deciding which components of p(t)
are allowed to change within an infinitesimal time. The selection procedure should guarantee that after
an infinitesimal time p(t) → p(t) + dp(t), M (p(t)) and M (p(t) + dp(t)) satisfy M (p(t)) = M ′ p(t) and
M (p(t) + dp(t)) = M ′ p(t) + M ′ dp(t) for some shared choice of the matrix M ′ . Here, M ′ captures the
active diffusion network that remains unchanged in [t, t + dt). In general, M ′ will transition through
discrete states in a continuous-time domain. Determining which components of p(t) are allowed to change
is equivalent to solving a densest set problem.
19
Li, He, and Milenkovic
where SH (γ) = {i|i ∈ H, bi ≥ γ} and ST (δ) = {j|j ∈ T, bj ≤ δ}. The optimal values of γ
and δ are required to simultaneously satisfy △γ = 0 and △δ = 0. Algorithm 5 is designed
to find such values of γ and δ. The “search” for (γ, δ) starts from (maxi∈H bi , minj∈T bj ),
and one gradually decreases γ and increases δ while ensuring △γ = −△δ (see Steps 5-10 in
Algorithm 5). The reason for imposing the constraint △γ = −△δ during the procedure is to
decrease the number of tuning parameters. As a result, the constraint implicitly transforms
the procedure into simple line search. The complexity of Algorithm 5 is dominated by the
sorting step (Step 1), which requires O(|Sr | log |Sr |) operations, as wH , wT , △γ , and △δ
can all be efficiently tracked within the inner loops.
Combining the projection method in Algorithm 5 and Corollary 9 with parameter choice
α
β = 1−α , we arrive at the following result summarizing the overall complexity of running
the PageRank process on (un)directed hypergraphs.
Corollary 13 The PageRank problem on (un)directed hypergraphs can be obtained by solv-
ing (19). A combination of Algorithm 1 (RCD) and Algorithm 5 returns an ǫ-optimal
solution in expectation with total computation complexity:
1−α Dii X 1
O N 2 max 1, 9 max |Sr | log |Sr | log .
α i,j∈[N ] Djj ǫ
r∈[R]
20
Quadratic Decomposable Submodular Function Minimization
x i1 ≥ x i2 ≥ · · · ≥ x iN . (20)
3. Recall that personalized PageRank is a PR process with initial distribution p0 = 1i , for some vertex
i∈V.
21
Li, He, and Milenkovic
Let Sjp = {xi1 , xi2 , ..., xij }. Recall the definition of the conductance of a set Φ(·) (15) and
evaluate
j ∗ = arg min Φ(Sjp ).
j∈[N ]
Note that the Lovász-Simonovits curve of πV contains only one segment by its very defini-
tion. It is easy to check that Ip (z) is continuous and concave in z. Moreover, for any set
S ⊆ [N ], we have p(S) ≤ Ip (vol(S)). We further write Vp = {j : xij > xij+1 } and refer
to {vol(Sjp )}j∈Vp as the (exact) break points of Ip . The relevance of the Lovász-Simonovits
curve in the aforementioned context may be understood as follows. Although the Lovász-
Simonovits curve is piecewise linear, the changes of the slopes at the break points may be
used to describe its “curvatures”. A large (small) curvature corresponds to the case that
the distribution of the potential is far from (close to) the stationary distribution, which
further implies a small (large) conductance of the graphs/hypergraphs (see Theorem 16).
Therefore, analyzing the Lovász-Simonovits curvature is of crucial importance in the pro-
cess of bounding the conductance. The first step of this analysis is presented in Lemma 14,
which establishes the “curvatures” at the break points.
α 1−α
Ip (vol(Sjp )) ≤ p0 (Sjp ) + [Ip (vol(Sjp ) − vol(∂Sjp )) + Ip (vol(Sjp ) + vol(∂Sjp )].
2−α 2−α
Furthermore, for k ∈ [0, m],
Remark 15 In comparison with the undirected graph case (Lemma 5 (Andersen et al.,
2006)), where the result holds for arbitrary S ⊆ V , our claim is true only for Sjp for which
j ∈ Vp . However, this result suffices to establish all relevant PR results.
Using the upper bound on the break points in Ip , we can now construct a curve that
uniformly bounds Ip using Φp .
22
Quadratic Decomposable Submodular Function Minimization
Theorem 16 Let p = pr(α, p0 ) be the previously defined PR vector, and let Φp be the
minimal conductance achieved by a sweep cut. For any integer t ≥ 0 and any k ∈ [0, m],
the following bound holds:
s !t
k α min{k, m − k} Φ2p
Ip (k) ≤ + t+ 1− .
m 2−α mini:(p0 )i >0 di 8
For graphs (Andersen et al., 2006), Theorem 16 may be used to characterize the graph
partitioning property of sweep cuts induced by a personalized PageRank vector. Similarly,
for general directed hypergraphs, we establish a similar result in Theorem 17.
Theorem 17 Let S be a set of vertices such that vol(S) ≤ 21 M and Φ(S) ≤ αc , for some
constants α, c. If there exists a distribution P for sampling vertices v ∈ S such that
Ei∼P [pr(α, 1i )(S̄)] ≤ 8c pr(α, πS )(S̄), then with probability at least 12 ,
s
vol(S)
Φpr(α,1i ) = O( α log ),
di
Remark 18 Note that in Theorem 17, we require that the sampling distribution P satisfy
Ei∼P [pr(α, 1i )(S̄)] ≤ 8c pr(α, πS )(S̄). For graphs, when we sample a vertex i with probability
proportional to its degree di , this condition is naturally satisfied, with c = 8. However, for
general (un)directed hypergraphs, the sampling procedure is non-trivial and harder to handle
due to the non-linearity of the random-walk operator M (·). We relegate a more in-depth
study of this topic to a companion paper.
(k)
where a(k) describes the information provided by the known labels, i.e., ai = 1 if yi = k,
and 0 otherwise, β denotes a hyperparameter and Ω is a smoothness regularizer. The labels
of the data points are estimated as
(k)
ŷi = arg max{x̂i }.
k
23
Li, He, and Milenkovic
24
Quadratic Decomposable Submodular Function Minimization
reporting the results, we use the primal gap (“gap”) 4 to characterize the convergence
properties of different solvers.
bound on the primal gap. Therefore, the achieved value of the objective has high accuracy and thus can
be used as the optimal primal value.
25
Li, He, and Milenkovic
log10 (gap)
log10 (gap)
-4 -4
-5 -6 -5
-6 -6
-8
-7 -7
-8 -8
-10
-9 -9
-10 -12 -10
0 0.2 0.4 0.6 0.8 -20 0 20 40 60 80 100 -200 0 200 400 600 800 1000
cputime(s) cputime(s) cputime(s)
Figure 3: Convergence results on synthetic datasets: gap vs CPU time of different QDSFM
solvers (with an average ± standard deviation).
2.93 2.55 2.23 0.00 1.47 0.00 0.78 0.00 6.81 6.04 5.71 5.41
QAP-SPE 14.9 15.0 12.6 13.2 7.33 8.10 4.07 3.80 9.51 9.21 8.14 7.09 2.7 × 102 0.85
PDHG 9.05 9.65 4.56 4.05 3.02 2.55 1.74 0.95 8.64 7.32 6.81 6.11 3.0 × 102 0.83
SGD 5.79 4.15 4.30 3.30 3.94 2.90 3.41 2.10 8.22 7.11 7.01 6.53 1.5 × 104 0.86
DRCD 44.7 44.2 46.1 45.3 43.4 44.2 45.3 44.6 9.97 9.97 9.96 9.97 3.8 × 106 0.85
InvLap 8.17 7.30 3.27 3.00 1.91 1.60 0.89 0.70 8.89 7.11 6.18 5.60 — 0.07
Table 2: Prediction results on synthetic datasets: classification error rates & Average 100
c(Sj ∗ ) for different solvers (MN: mean, MD: median).
26
Quadratic Decomposable Submodular Function Minimization
-2 -2
-3 -3
-4
Figure 4: Parameter sensitivity: the rate of a primal gap of QRCD after 2 × 105 iterations
with respect to different choices of the parameters β & maxi,j∈[N ] Wii /Wjj .
mushroom covertype67
0 covertype45 0
0
QRCDM-SPE QRCDM-SPE
QRCDM-SPE -1 QAP-SPE
QAP-SPE -1
QAP-SPE
-2 PDHG PDHG
-2 PDHG -2
SGD SGD
SGD
-4 -3 -3
log10 (gap)
log10 (gap)
log10 (gap)
-4 -4
-6 -5
-5
-6
-8 -6
-7
-7
-8
-10
-9 -8
-12 -10 -9
0 0.5 1 1.5 2 0 0.2 0.4 0.6 0.8 0 0.5 1 1.5 2 2.5 3
cputime(s) cputime(s) cputime(s)
Figure 5: Convergence of different solvers for QDFSM over three different real datasets.
by using QDSFM as the objective instead of DSFM or InvLap (Hein et al., 2013). In our
experiment, we focused on comparing the convergence of different solvers for QDSFM. We
set β = 100 and Wii = 1, for all i, and set the number of observed labels to 100, which is
a proper setting according to (Hein et al., 2013). Figure 5 shows the results. Again, the
proposed QRCD-SPE and QAP-SPE methods both converge faster than PDHG and SGD,
while QRCD-SPE performs the best. Note that we did not plot the results for QRCD-
MNP and QRCD-FW as these methods converge extremely slowly due to the large sizes
of the hyperedges. InvLap requires 22, 114 and 1802 seconds to run on the Mushroom,
Covertype45, and Covertype67 datasets, respectively. Hence, the method does not scale
well with the problem size.
7. Detailed Proofs
In all derivations that follow, we may exchange the order of minimization and maximization
(i.e., min max = max min) due to Proposition 2.2 (Ekeland and Temam, 1999). Plugging
27
Li, He, and Milenkovic
X * X +
λr 2 2
min max kyr − √ kW −1 + φr + λ, λr − 2W a
φr ≥0,yr ∈φr Br ,λr λ
r∈[R]
R r∈[R]
X * X +
λr 2 2
= min max min kyr − √ kW −1 + φr + λ, λr − 2W a
φr ≥0,yr ∈φr Br λ λr
r∈[R]
R r∈[R]
* +
X 1
2 2
√ X 1
= min max kλkW + φr + λ, R (yr − W λr ) − 2W a
φr ≥0,yr ∈φr Br λ 4 2
r∈[R] r∈[R]
* +
R 2
√ X X
= min max − kλkW + R λ, yr − 2W a + φ2r
φr ≥0,yr ∈φr Br λ 4
r∈[R] r∈[R]
X X
2 2
= min k yr − 2W akW −1 + φr ,
φr ≥0,yr ∈φr Br
r∈[R] r∈[R]
28
Quadratic Decomposable Submodular Function Minimization
X ρ X φ′
k yr − sk1 + kφ′ − φk ≥ k r
yr − sk1 .
2 φr
r∈[R] r∈[R]
Proof For all r, let ỹr = yr /φr ∈ Br , and define a function that depends on φ,
X X
h(φ) = k yr − sk1 = k φr ỹr − sk1 .
r∈[R] r∈[R]
Z 1
′
h(φ ) = h(φ) + h∇h|φ+t(φ′ −φ) , t(φ′ − φ)idt
t=0
maxt∈[0,1] k∇h|φ+t(φ′ −φ) k ′
≥ h(φ) − kφ − φk
2q
2
P
maxỹr ∈Br ,∀r r∈[R] kỹr k1 ρ
≥ h(φ) − kφ′ − φk = h(φ) − kφ′ − φk.
2 2
Combining Lemma 20 with Lemma 19, we can establish the claim of Lemma 2.
qP
First, let ρ(W (1) ) = maxy∈Nr∈[R] Br 2 ′ ′
N
r∈[R] kyr kWr . Suppose that y ∈ r∈[R] φr Br
′
is such that r∈[R] yr′ = s and it minimizes r∈[R] k φφrr yr − yr′ k2W (1) . As s lies in the base
P P
29
Li, He, and Milenkovic
X φ′r X φ′
ky − y ′ kI(W (1) ) ≤ kyr′ − yr kW (1) + kyr − r yr kW (1)
φr φr
r∈[R] r∈[R]
s
P (1)
i∈[N ] Wii
1) X φ′
≤ k r
yr − sk1 + ρ(W (1) )kφ′ − φk
2 φr
r∈[R]
s
P (1)
2) i∈[N ] Wii ρ
X
≤ k yr − sk1 + kφ′ − φk + ρ(W (1) )kφ′ − φk
2 2
r∈[R]
s
P (1) P (2)
i∈[N ] Wii j∈[N ] 1/Wjj
X
= k yr − skW (2)
2
r∈[R]
s
P (1)
i∈[N ] Wii ρ
+ + ρ(W (1) ) kφ′ − φk
2 2
s s
P (1) P (2) P (1)
3) i∈[N ] Wii j∈[N ] 1/Wjj 3 i∈[N ] Wii
X
≤ k yr − akW (2) + ρkφ′ − φk,
2 2 2
r∈[R]
where 1) follows from Lemma 19 and the definition of ρ(W (1) ), 2) is a consequence of
Lemma 20 and 3) holds because
(1)
X X X
Wii kyr k21 ≥ kyr k2Wr .
i∈[N ] r∈[R] r∈[R]
7.2 Proof for the Linear Convergence Rate of the RCD Algorithm
7.2.1 Proof of Lemma 3
If (y ∗ , φ∗ ) is the optimal solution, then it must hold that r∈[R] yr∗ = 2W (a − x∗ ) because of
P
the duality between the primal and dual variables. Moreover, we alsoP know that there
P must
exist a nonempty collection Y of points y ′ ∈ r∈[R] φ∗r Br such that r∈[R] yr′ = r∈[R] yr∗ .
N
30
Quadratic Decomposable Submodular Function Minimization
(k+1)
The following lemma is a direct consequence of the optimality of yr as the projection
ΠCr .
Lemma 22 One has
h∇r g((y (k) , φ(k) )), (yr(k+1) − yr∗ , φ(k+1)
r − φ∗r )i
≤ 2hyr(k) − yr(k+1) , yr(k+1) − yr∗ iW −1 + 2hφ(k) (k+1) (k+1)
r − φr , φr − φ∗r i.
The following lemma follows from a simple manipulation of the Euclidean norm.
Lemma 23 It holds that
kyr(k+1) − yr(k) k2W −1 + (φ(k+1)
r − φ(k)
r )
2
31
Li, He, and Milenkovic
Our next task is to determine how the objective function decreases in each iteration.
The following expectation is with respect to uniformly sampled values of r ∈ [R] in the k-th
iteration:
h i
E g(y (k+1) , φ(k+1) ) − g(y (k) , φ(k) )
h i
= E h∇r g(y (k) , φ(k) ), (yr(k+1) − yr(k) , φ(k+1)
r − φ(k)
r )i + kyr
(k+1)
− yr(k) k2W −1 + (φ(k+1)
r − φ(k)
r )
2
h
= E h∇r g(y (k) , φ(k) ), (yr∗ − yr(k) , φ∗r − φ(k)
r )i + h∇r g(y , φ ), (yr(k+1) − yr∗ , φ(k+1)
(k) (k)
r − φ∗r )i
i
(k+1) (k) 2 (k+1) (k) 2
+kyr − yr kW −1 + (φr − φr )
1) h
≤ E h∇r g(y (k) , φ(k) ), (yr∗ − yr(k) , φ∗r − φ(k)
r )i − kyr
(k+1)
− yr∗ k2W −1 + kyr∗ − yr(k) k2W −1
i
∗ 2 ∗
−(φ(k+1)
r − φr ) + (φr − φ (k) 2
r )
2) 1 h i
≤ h∇g(y (k) , φ(k) ), (y ∗ − y (k) , φ∗ − φ(k) )i − E ky (k+1) − y ∗ k2I(W −1 ) + kφ(k+1) − φ∗ k2
R
+ ky (k) − y ∗ k2I(W −1 ) + kφ(k) − φ∗ k2 (26)
3)
2 h i 2 h i
≤ g(y ∗ , φ∗ ) − g(y (k) , φ(k) ) + 1 − ky (k) − y ∗ k2I(W −1 ) + kφ(k) − φ∗ k2
(µ + 1)R (µ + 1)R
h i
∗ 2
− E ky (k+1)
− y kI(W −1 ) + kφ (k+1)
− φ∗ k2 . (27)
h i
E g(y (k+1) , φk+1 ) − g(y ∗ , φ∗ ) + d2 ((y (k+1) , φ(k+1) ), Ξ)
h i
≤ E g(y (k+1) , φk+1 ) − g(y ∗ , φ∗ ) + ky (k+1) − y ∗ k2I(W −1 ) + kφ(k+1) − φ∗ k2I(W −1 )
h
2 i
≤ 1− E g(y (k) , φk ) − g(y ∗ , φ∗ ) + d2 ((y (k) , r (k) ), Ξ) .
(µ + 1)R
The claimed proof follows by iterating the above derivations for all values of k.
32
Quadratic Decomposable Submodular Function Minimization
By eliminating λr , we arrive at
X X
[dΨW −1 ((y, φ), Z)]2 = k yr − 2W (a − x∗ )k2W −1 + (φr − φ∗r )2 .
r r
As φ∗r is the unique optima, it follows that φ∗r = φ′r . Also, r yr′ = 2W (a − x∗ ). Moreover,
P
as X
ky − y ′ k2ΨW −1 + (φr − φ′r )2 ≥ [dΨW −1 ((y, φ), Ξ)]2 ,
r
33
Li, He, and Milenkovic
Again, due to the definition of the distance dΨW −1 ((y, φ), Ξ), we have
Moreover, because of the way we chose (y ′ , φ′ ) and due to the fact that C is convex, we have
where 1) follows from the Cauchy-Schwarz inequality over the entries yr,i , i ∈ Sr .
Finally, we set φ = φ′′ = φ∗ and combine (28)-(31) to obtain
X
ρ2 ≤ 4 Dii .
i∈[N ]
34
Quadratic Decomposable Submodular Function Minimization
Proof For each r, consider yr ∈ Br . Sort the entries of yr in descending order. Without
loss of generality, assume that the ordering reads as yr,i1 ≥ yr,i2 ≥ · · · ≥ yr,iN . As Fr ([N ]) =
PN PN
j=1 yr,ij ≥ 0, we have yr,i1 ≥ 0. If yr,iN ≥ 0, then kyr k1 = k=1 yr,ik = Fr ([N ]) ≤
maxS⊆[N ] Fr (S). If yr,iN < 0, there exists a k′ such that yr,ik′ ≥ 0 and yr,ik′ +1 < 0. Given
the definition of Br , we have
N k ′
X X
|yr,ik | ≤ |yr,ik | ≤ Fr ({i1 , i2 , ..., ik′ }) ≤ max Fr (S),
S⊆[N ]
k=k ′ +1 k=1
and thus kyr k1 ≤ 2 maxS⊆[N ] Fr (S). Moreover, as each variable in [N ] is incident to at least
one submodular function, we have
X X X X
max [Fr (S)]2 ≤ max [Fr (S)]2 ≤ Dii .
S⊆[N ] S⊆[N ]
r∈[R] i∈[N ] r:i∈Sr i∈[N ]
X X Wii Dii
Wii 1/Wjj ≤ N 2 max = N 2 max ,
i,j Wjj i,j Djj
i∈[N ] j∈[N ]
and
X 1) X X 4 2 Dii
ρ2 1/Wjj ≤ 4 Dii 1/Wjj ≤ N max ,
β i,j Djj
j∈[N ] i∈[N ] j∈[N ]
where 1) follows from Lemma 24. According to the definition of µ(W −1 , W −1 ) (see (8)),
Dii
µ(W −1 , W −1 ) ≤ N 2 max{1, 9β −1 } max .
i,j Djj
Similarly, we have
Ψjj Dii
µ(ΨW −1 , W −1 ) ≤ N 2 max{1, 9β −1 } max .
i,j Djj
35
Li, He, and Milenkovic
If the coefficients λ (λi ∈ R, ∀i) minimize h̃(S, λ), we call the corresponding triple (y, S, λ)
a good triple. Given a triple (y, S, λ), we also define
X
△(y, q) = −hy − a, qi − λi ,
qi ∈S
X
△(y) = max △(y) = − minhy − a, qi − λi ,
q∈B q∈B
qi ∈S
and X
err(y) = h(y, λi ) − h∗ .
qi ∈S
36
Quadratic Decomposable Submodular Function Minimization
err(y)
Lemma 27 If (y, S, λ) is a good triple, then △(y) ≥ kakW̃ , where we recall that err(y) =
h(y, qi ∈S λi ) − h∗ .
P
moreover, if (y (k) , φ(k) ) triggers a MAJOR loop, the claimed inequality is strict.
37
Li, He, and Milenkovic
The following lemma characterizes the pair (y, φ) at the point when the MNP method
terminates.
Lemma 29 In the MAJOR loop at iteration k, if hy (k) − a, q (k) iW̃ + φ(k) ≥ −δ, then
h(y (k) , φ(k) ) ≤ h∗ + kakW̃ δ.
Lemma 30 If Algorithm 2 does not terminate, then for any ǫ > 0, one can guarantee that
after O(N kak2W̃ max{Q2 , 1}/ǫ) iterations, Algorithm 2 generates a pair (y, φ) that satisfies
err(y) ≤ ǫ.
The proofs of Lemma 28 and Lemma 29 are fairly straightforward, while the proof of
Lemma 30 is significantly more involved and postponed to the next section.
Proof of Lemma 28. Suppose that (y (k) , φ(k) ) starts a MAJOR loop. As
Consider next the relationship between (z (k) , qi ∈S (k) ∪{q(k) } αi ) and (y (k) , φ(k) ). Because of
P
Step 6, we know that
X
h(z (k+1) , αi ) = h̃(S (k) ∪ {q (k) }) ≤ h(y (k) + εq (k) , φ(k) + ε) < h(y (k) , φ(k) ).
qi ∈S (k) ∪{q (k) }
which naturally satisfies the claimed condition. If (y (k+1) , φ(k+1) ) is generated P in some MI-
NOR loop, then (y (k+1) ,φ (k+1) ) lies strictly within the segment between (z (k+1) , qi ∈S (k) ∪{q(k) } αi )
(k) (k)
and (y , φ ) (because θ > 0). Therefore, we also have h(y (k+1) ,φ(k+1) ) < h(y (k) , φ(k) ). If
(k) (k)
(y , φ ) starts a MINOR loop, then we have
X
h(z (k+1) , αi ) = h̃(S (k) ) ≤ h(y (k) , φ(k) ),
qi ∈S (k) ∪{q (k) }
once again (k+1) (k+1) ) still lies within the segment between
(k+1)
P due to Step 6. As (y (k) , φ
(z , qi ∈S (k) ∪{q(k) } αi ) and (y , φ(k) ), we have h(y (k+1) , φ(k+1) ) ≤ h(y (k) , φ(k) ).
Proof of Lemma 29. Lemma 29 is a corollary of Lemma 27. To see why this is the
case, observe that in a MAJOR loop, (y (k) , S (k) , λ(k) ) is always a good triple. Since
38
Quadratic Decomposable Submodular Function Minimization
39
Li, He, and Milenkovic
where 1) follows from the Cauchy-Schwarz inequality and 2) is due to Lemma 25. Since
△2 (y,q)
△(y, q) > 0, letting φ → ∞ reduces equation (32) to kqk2 +1 .
W̃
Next, using Lemma 33, we may characterize the decrease of the objective function for
one MAJOR loop with no MINOR loop. As (y (k+1) , S (k+1) , λ(k+1) ) is a good triple and y (k)
also lies in lin(S), we have the following result.
Next, we characterize the decrease of the objective function for one MAJOR loop with
one single MINOR loop.
Lemma 35 Consider some MAJOR loop k with only one MINOR loop. Then
△2 (y (k) )
err(y (k) ) − err(y (k+2) ) ≥ .
Q2 + 1
Proof Suppose that the active sets associated with y (k) , y (k+1) , y (k+2) are S (k) , S (k+1) , S (k+2) ,
respectively. We know that within the MINOR loop, (z (k) , S (k) ∪ {q (k) }, α) is a good triple
and y (k+1) = θy (k) + (1 − θ)z (k) , for some θ ∈ [0, 1]. Let
X (k) X
A = ky (k) − ak2W̃ + ( λi )2 − kz (k) − ak2W̃ − ( αi )2 . (33)
qi ∈S (k) qi ∈S (k) ∪{q (k) }
Note that both S (k) and S (k+1) are subsets of S (k) ∪ {q (k) }. As (z (k) , S (k) ∪ {q (k) }, α) is a
good triple, using Lemma 25, we obtain
(k)
X X
hz (k) − a, z (k) − y (k) iW̃ + h αi , (αi − λi )i = 0.
qi ∈S (k) ∪{q (k) } qi ∈S (k) ∪{q (k) }
Furthermore, as y (k+1) = θy (k) +(1−θ)z (k) = z (k) −θ(z (k) −y (k) ) and λ(k+1) = α−θ(α−λ(k)),
we have
X (k) X (k+1) 2
ky (k) − ak2W̃ + ( λi )2 − ky (k+1) − ak2W̃ − ( λi ) = (1 − θ 2 )A.
i∈S (k) i∈S (k+1)
Moreover, we have
△(y (k+1) , q (k) ) = θ△(y (k) , q (k) ) + (1 − θ)△(z (k) , q (k) ) = θ△(y (k) , q (k) ) = θ△(y (k) ), (35)
40
Quadratic Decomposable Submodular Function Minimization
which holds because △(y, q) is linear in y and Lemma 25 implies △(z (k) , q (k) ) = 0.
Since according to Lemma 28, h(y (k) , φ(k) ) > h(y (k+1) , φ(k+1) ), Lemma 4 in (Chakrabarty
et al., 2014) also holds in our case, and thus q (k) ∈ S (k+1) . To obtain y (k+2) and S (k+2) ,
one needs to remove active points with a zero coefficients from S (k+1) , so that y (k+2) once
again belongs to a good triple with corresponding S (k+1) . Based on 33 and equation (35),
we have the following result.
(k+1) 2 (k+2) 2
X X
ky (k+1) − ak2W̃ + ( λi ) −ky (k+2) − ak2W̃ − ( λi ) (36)
qi ∈S (k+1) qi ∈S (k+2)
And, combining Lemma 34, Lemma 35 and Lemma 27 establishes Theorem 31.
41
Li, He, and Milenkovic
2p(Sjp ) −
X
wr fr (x)h∇fr (x), 1Sjp i
r∈[R]
=2p(Sjp ) wr Fr (Sjp )
X
− max (xi − xj )
r∈[R] (i,j)∈Sr↑ ×Sr↓
r∈[R] r∈[R]
=Ip vol(Sjp ) − vol(∂Sjp ) + Ip vol(Sjp ) + vol(∂Sjp ). (39)
42
Quadratic Decomposable Submodular Function Minimization
where 1) is due to Lemma 12 and 2) is due to equation (39). This proves the first inequality.
By using the concavity of Ip (·), we also have
2 t
s !
k α k̄ Φ p
I (t) (k) = + t+ 1− .
m 2−α dmin 8
When t = 0, Ip (k) ≤ Ip0 (k) holds due to Lemma 14. As I (0) (m) = I (0) (dmin ) = 1, for
k ∈ [dmin , m], we have
Since for k ∈ [0, dmin ], Ip0 (k) is linear, we also have Ip0 (0) = 0 ≤ I (t) (0). Hence, for
k ∈ [0, dmin ], it also holds that Ip0 (k) ≤ I (0) (k).
Next, suppose that for step t, Ip (k) ≤ I (t) (k) holds. We then consider the case t + 1.
For k = vol(Sjp ), Lemma 14 indicates that
α 1−α
Ip (k) ≤ Ip0 (k) + [Ip (k − k̄Φ(Sjp )) + Ip (k + k̄Φ(Sjp ))]
2−α 2−α
s s !t
p p 2
1) α 1 − α 2k 2α k − k̄Φ(Sj ) k + k̄Φ(Sj ) Φ
≤ Ip0 (k) + + t+ + 1− p
2−α 2−α m 2−α dmin dmin 8
s s !t
p
2) k α 1 − α k − k̄Φ(Sj ) k + k̄Φ(Sjp ) Φ2p
≤ + (t + 1) + + 1−
m 2−α 2−α dmin dmin 8
s s !t
p
3) k α 1 − α k̄ − k̄Φ(Sj ) k̄ + k̄Φ(Sjp ) Φ2p
≤ + (t + 1) + + 1−
m 2−α 2−α dmin dmin 8
s !t+1
4) k α 2 − 2α k̄ Φ2p
≤ + (t + 1) + 1− ≤ I (t+1) (k),
m 2−α 2 − α dmin 8
where 1) follows from Lemma 14; 2) is due to Ip0 (k) ≤ 1; 3) can be verified by considering
p
two cases separately, namely k ≤ m m
2 and k ≥ 2 ; and 4) follows from Φ(Sj ) ≥ Φp and the
√
Taylor-series expansion of x ± φx.
At this point, we have shown that Ip (k) ≤ I (t) (k) for all k = vol(Sjp ). Since {vol(Sjp )}j∈Vp
covers all break points of Ip and since I (t) is concave, we have that for all k ∈ [0, m], it holds
that Ip (k) ≤ I (t) (k). This concludes the proof.
43
Li, He, and Milenkovic
44
Quadratic Decomposable Submodular Function Minimization
Therefore,
s
100vol(S)
Φpr(α,1i ) ≤ 32α ln .
di
N2
−1 −1 −1 −1 1) Wii 9 2 −1 1
µ(β W ,β W ) = max max + 1 , ρ N β max
2 i,j∈[N ] Wjj 4 j∈[N ] Wjj
N2
2) Wii 9 −1 1
= max max + 1 , β N R max
2 i,j∈[N ] Wjj 2 j∈[N ] Wjj
N2
3) Wii 9 −1 2 Wii
= max max + 1 , β N max , (41)
2 i,j∈[N ] Wjj 2 i,j∈[N ] Wjj
where 1) holds because half of the values of Wii are set to 1 and the other half to a value in
{1, 0.1, 0.01, 0.001}, 2) follows from (40) and 3) is due to the particular setting N = R and
maxi∈[N ] Wii = 1. Equation (41) may consequently be rewritten as
8. Acknowledgement
The authors would like to thank the reviewers for their insightful suggestions to improve
the quality of the manuscript. The authors also gratefully acknowledge funding from the
NSF CIF 1527636 and the NSF Science and Technology Center (STC) at Purdue University,
Emerging Frontiers of Science of Information, 0939370.
45
Li, He, and Milenkovic
References
Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using pagerank
vectors. In Proceedings of the 47th Annual Symposium on Foundations of Computer
Science, pages 475–486. IEEE Computer Society, 2006.
Reid Andersen, Fan Chung, and Kevin Lang. Local partitioning for directed graphs using
pagerank. Internet Mathematics, 5(1-2):3–22, 2008.
Austin R Benson, David F Gleich, and Jure Leskovec. Higher-order organization of complex
networks. Science, 353(6295):163–166, 2016.
Béla Bollobás. Modern graph theory, volume 184. Springer Science & Business Media, 2013.
Deeparnab Chakrabarty, Prateek Jain, and Pravesh Kothari. Provable submodular mini-
mization using Wolfe’s algorithm. In Advances in Neural Information Processing Systems,
pages 802–809, 2014.
Antonin Chambolle and Jérôme Darbon. On total variation minimization and surface
evolution using parametric maximum flows. International Journal of Computer Vision,
84(3):288, 2009.
Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex
problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40
(1):120–145, 2011.
T-H Hubert Chan, Zhihao Gavin Tang, Xiaowei Wu, and Chenzi Zhang. Diffusion operator
and spectral analysis for directed hypergraph laplacian. arXiv preprint arXiv:1711.01560,
2017.
T-H Hubert Chan, Anand Louis, Zhihao Gavin Tang, and Chenzi Zhang. Spectral properties
of hypergraph laplacian and approximation algorithms. Journal of the ACM (JACM), 65
(3):15, 2018.
Ivar Ekeland and Roger Temam. Convex Analysis and Variational Problems, volume 28.
Siam, 1999.
Alina Ene and Huy Nguyen. Random coordinate descent methods for minimizing decompos-
able submodular functions. In Proceedings of the International Conference on Machine
Learning, pages 787–795, 2015.
46
Quadratic Decomposable Submodular Function Minimization
Alina Ene, Huy Nguyen, and László A Végh. Decomposable submodular function minimiza-
tion: discrete and continuous. In Advances in Neural Information Processing Systems,
pages 2874–2884, 2017.
Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval
Research Logistics, 3(1-2):95–110, 1956.
Satoru Fujishige. Submodular functions and optimization, volume 58. Elsevier, 2005.
Satoru Fujishige and Shigueo Isotani. A submodular function minimization algorithm based
on the minimum-norm base. Pacific Journal of Optimization, 7(1):3–17, 2011.
David F Gleich, Lek-Heng Lim, and Yongyang Yu. Multilinear pagerank. SIAM Journal
on Matrix Analysis and Applications, 36(4):1507–1541, 2015.
Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms
for norm-regularized smooth convex optimization. Mathematical Programming, 152(1-2):
75–112, 2015.
Matthias Hein, Simon Setzer, Leonardo Jost, and Syama Sundar Rangapuram. The to-
tal variation on hypergraphs-learning on hypergraphs revisited. In Advances in Neural
Information Processing Systems, pages 2427–2435, 2013.
Stefanie Jegelka, Francis Bach, and Suvrit Sra. Reflection methods for user-friendly sub-
modular optimization. In Advances in Neural Information Processing Systems, pages
1313–1321, 2013.
Rie Johnson and Tong Zhang. On the effectiveness of laplacian normalization for graph
semi-supervised learning. Journal of Machine Learning Research, 8(Jul):1489–1517, 2007.
Sanjiv Kumar and Martial Hebert. Discriminative random fields: A discriminative frame-
work for contextual interaction in classification. In Proceedings of IEEE International
Conference on Computer Vision, pages 1150–1157. IEEE, 2003.
47
Li, He, and Milenkovic
Pan Li, Hoang Dau, Gregory Puleo, and Olgica Milenkovic. Motif clustering and overlapping
clustering for social network analysis. In INFOCOM 2017-IEEE Conference on Computer
Communications, IEEE, pages 1–9. IEEE, 2017.
Francesco Locatello, Michael Tschannen, Gunnar Rätsch, and Martin Jaggi. Greedy algo-
rithms for cone constrained optimization with convergence guarantees. In Advances in
Neural Information Processing Systems, pages 773–784, 2017.
László Lovasz and Miklós Simonovits. The mixing rate of markov chains, an isoperimetric
inequality, and computing the volume. In Proceedings of the 31st Annual Symposium on
Foundations of Computer Science, page 1. IEEE Computer Society, 1990.
Robert Nishihara, Stefanie Jegelka, and Michael I Jordan. On the convergence rate of
decomposable submodular function minimization. In Advances in Neural Information
Processing Systems, pages 640–648, 2014.
Stanley Osher, Martin Burger, Donald Goldfarb, Jinjun Xu, and Wotao Yin. An iterative
regularization method for total variation-based image restoration. Multiscale Modeling &
Simulation, 4(2):460–489, 2005.
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation
ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
N. Z. Shor, Krzysztof C. Kiwiel, and Andrzej Ruszcayǹski. Minimization Methods for Non-
differentiable Functions. Springer-Verlag, Berlin, Heidelberg, 1985. ISBN 0-387-12763-1.
Philip Wolfe. Convergence theory in nonlinear programming. Integer and nonlinear pro-
gramming, pages 1–36, 1970.
Hao Yin, Austin R Benson, Jure Leskovec, and David F Gleich. Local higher-order graph
clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, pages 555–564. ACM, 2017.
Chenzi Zhang, Shuguang Hu, Zhihao Gavin Tang, and TH Hubert Chan. Re-revisiting
learning on hypergraphs: confidence interval and subgradient method. In Proceedings of
the International Conference on Machine Learning, pages 4026–4034, 2017.
48
Quadratic Decomposable Submodular Function Minimization
Dawei Zhou, Si Zhang, Mehmet Yigit Yildirim, Scott Alcorn, Hanghang Tong, Hasan
Davulcu, and Jingrui He. A local algorithm for structure-preserving graph cut. In Pro-
ceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pages 655–664. ACM, 2017.
Denny Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Schölkopf.
Learning with local and global consistency. In Advances in Neural Information Processing
Systems, pages 321–328, 2004.
Denny Zhou, Jiayuan Huang, and Bernhard Schölkopf. Learning with hypergraphs: Clus-
tering, classification, and embedding. In Advances in Neural Information Processing
Systems, pages 1601–1608, 2007.
Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. Semi-supervised learning using
gaussian fields and harmonic functions. In Proceedings of the 20th International Confer-
ence on Machine Learning, pages 912–919, 2003a.
Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Combining active learning and semi-
supervised learning using gaussian fields and harmonic functions. In ICML 2003 workshop
on the continuum from labeled to unlabeled data in machine learning and data mining,
volume 3, 2003b.
49