Quadratic Decomposable Submodular Function Minimization: Theory and Practice

Quadratic Decomposable Submodular Function Minimization
Quadratic Decomposable Submodular Function

Minimization: Theory and Practice
Pan Li [email protected]
Department of Computer Science
Purdue University
West Lafayette, IN 47907, USA
arXiv:1902.10132v4 [cs.LG] 26 Oct 2020
Niao He [email protected]
Department of Industrial and Enterprise Systems Engineering
University of Illinois at Urbana-Champaign
Champaign, IL 61820, USA
Olgica Milenkovic [email protected]
Department of Electrical and Computer Engineering
University of Illinois at Urbana-Champaign
Champaign, IL 61820, USA
Abstract
We introduce a new convex optimization problem, termed quadratic decomposable submod-
ular function minimization (QDSFM), which allows to model a number of learning tasks
on graphs and hypergraphs. The problem exhibits close ties to decomposable submodu-
lar function minimization (DSFM) yet is much more challenging to solve. We approach
the problem via a new dual strategy and formulate an objective that can be optimized
through a number of double-loop algorithms. The outer-loop uses either random coordi-
nate descent (RCD) or alternative projection (AP) methods, for both of which we prove
linear convergence rates. The inner-loop computes projections onto cones generated by base
polytopes of the submodular functions via the modified min-norm-point or Frank-Wolfe al-
gorithms. We also describe two new applications of QDSFM: hypergraph-adapted PageR-
ank and semi-supervised learning. The proposed hypergraph-based PageRank algorithm
can be used for local hypergraph partitioning and comes with provable performance guaran-
tees. For hypergraph-adapted semi-supervised learning, we provide numerical experiments
demonstrating the efficiency of our QDSFM solvers and their significant improvements on
prediction accuracy when compared to state-of-the-art methods.
Keywords: Submodular functions, Lovász extensions, Random coordinate descent,
Frank-Wolfe method, PageRank.
1. Introduction
Given [N ] = {1, 2, ..., N }, a submodular function F : 2[N ] → R is a set function that for
any S1 , S2 ⊆ [N ] satisfies F (S1 ) + F (S2 ) ≥ F (S1 ∪ S2 ) + F (S1 ∩ S2 ) (Edmonds, 2003). Sub-
modular functions are ubiquitous in machine learning as they capture rich combinatorial
properties of set functions and provide useful regularizers for supervised and unsupervised
learning objectives (Bach, 2013). Submodular functions are endowed with convex Lovász
1
Li, He, and Milenkovic
extensions (Lovász, 1983), which provide the means to establish connections between com-
binatorial and continuous optimization problems (Fujishige, 2005).
Due to their versatility, submodular functions and their Lovász extensions are frequently
used in applications such as learning on directed/undirected graphs and hypergraphs (Zhu
et al., 2003a; Zhou et al., 2004; Hein et al., 2013), image denoising via total variation
regularization (Osher et al., 2005; Chambolle and Darbon, 2009), and maximum a posteriori
probability (MAP) inference in high-order Markov random fields (Kumar and Hebert, 2003).
In many optimization settings involving submodular functions, one encounters the convex
program
[fr (x)]p ,
X X
min (xi − ai )2 + (1)
x
i∈[N ] r∈[R]
where a ∈ RN , p ∈ {1, 2}, and for all r in some index set [R], fr is the Lovász extension
of a submodular function Fr that describes some combinatorial structure over the set [N ].
For example, in image denoising, the parameter ai may correspond to the observed value
of pixel i, while the functions [fr (x)]p may be used to impose smoothness constraints on
the pixel neighborhoods. One of the main difficulties in solving the above optimization
problem comes from the nondifferentiability of the second term: √ a direct application of
subgradient methods leads to convergence rates as slow as 1/ k, where k denotes the
number of iterations (Shor et al., 1985).
In recent years, the above optimization problem for p = 1, referred to as decompos-
able submodular function minimization (DSFM) (Stobbe and Krause, 2010), has received
significant attention. The motivation for studying this particular setup is two-fold: first,
solving the convex optimization problem directly recovers the combinatorial solution to the
submodular min-cut problem (Fujishige, 2005) which takes the form minS⊆[N ] F (S), where
X X
F (S) = Fr (S) − 2 ai .
r∈[R] i∈S
Second, minimizing a submodular function decomposed into a sum of simpler components

Fr , r ∈ [R], is significantly easier than minimizing an unrestricted submodular function
F over a large set [N ]. There are several milestone results for solving the DSFM prob-
lem: Jegelka et al. (2013) first tackled the problem by considering its dual and proposed a
solver based on Douglas-Rachford splitting. Nishihara et al. (2014) established linear con-
vergence rates of alternating projection methods for solving the dual problem. Ene and
Nguyen (2015); Ene et al. (2017) presented linear convergence rates of coordinate descent
methods and subsequently tightened known results via the use of submodular flows. The
work in Li and Milenkovic (2018a) improved previous methods by leveraging incidence
relations between the arguments of submodular functions and their constituent parts.
We focus on the case p = 2 and refer to the underlying optimization problem as quadratic
DSFM (QDSFM). QDSFM appears naturally in a wide spectrum of applications, including
learning on graphs and hypergraphs, and in particular, semi-supervised learning and PageR-
ank analysis, to be described in more detail later. Moreover, it has been demonstrated both
theoretically (Johnson and Zhang, 2007) and empirically (Zhou et al., 2004; Hein et al.,
2013) that using quadratic regularizers in (1) improves the predictive performance for semi-
supervised learning when compared to the case p = 1. Despite the importance of QDSFM,
2
it has not received the same level of attention as DSFM, neither from the theoretical nor
algorithmic perspective. To our best knowledge, only a few reported works (Hein et al.,
2013; Zhang et al., 2017) provided solutions for specific instances of QDSFM with sublin-
ear convergence guarantees. For the general QDSFM problem, no analytical results are
currently known.
Our work takes a substantial step towards solving the QDSFM problem in its most
general form by developing a family of algorithms with linear convergence rates and small
iteration cost, including the randomized coordinate descent (RCD) and alternative projec-
tion (AP) algorithms. Our contributions are as follows.
1. First, we derive a new dual formulation for the QDSFM problem since an analogue of
the dual transformation for the DSFM problem is not applicable. Interestingly, the
dual QDSFM problem requires one to find the best approximation of a hyperplane
via a product of cones as opposed to a product of polytopes, encountered in the dual
DSFM problem.
2. Second, we develop linearly convergent RCD and AP algorithms for solving the dual
QDSFM problem. Because of the special underlying conic structure, a new analytic
approach is needed to prove the weak strong-convexity of the dual QDSFM problem,
which essentially guarantees linear convergence.
3. Third, we develop generalized Frank-Wolfe (FW) and min-norm-point methods for

efficiently computing the conic projection required in each step of RCD and AP and
provide a 1/k-rate convergence analysis. These FW-type algorithms and their corre-
sponding analyses do not rely on the submodularity assumption and apply to general
conic projections, and are therefore of independent interest.
4. Fourth, we introduce a novel application of the QDSFM problem, related to the

PageRank (PR) process for hypergraphs. The main observation is that PR state
vectors can be efficiently computed by solving QDSFM problems. We also show that
many important properties of the PR process for graphs carry over to the hypergraph
setting, including mixing and local hypergraph partitioning results.
5. Finally, we evaluate our methods on semi-supervised learning over hypergraphs using

synthetic and real datasets, and demonstrate superior performance both in conver-
gence rate and prediction accuracy compared to existing general-purpose methods.
2. Problem Formulations
We first introduce some terminology and notation useful for our subsequent exposition.
Lovász extension and base polytope. For a submodular function F defined over the
ground set [N ], the Lovász extension is a convex function f : RN → R, defined for all
x ∈ RN according to
N
X −1
f (x) = F ({i1 , ..., ik })(xik − xik+1 ) + F ([N ])xiN , (2)
k=1
3
where xi1 ≥ xi2 ≥ · · · ≥ xiN , andPi1 , ..., iN are distinct indices in [N ]. For a vector x ∈ RN
and a set S ⊆ [N ], define x(S) = i∈S xi , where xi stands for the value of the i-th dimension
of x. The base polytope of F , denoted by B, is defined as
B = {y ∈ RN |y(S) ≤ F (S), ∀S ⊂ [N ], y([N ]) = F ([N ])}. (3)
Using the base polytope, the Lovász extension can also be written as f (x) = maxy∈B hy, xi.
We say that an element i ∈ [N ] is incident to F if there exists a S ⊂ [N ] such that
F (S) 6= F (S ∪ {i}).
N ×N and a vector x ∈ RN , we define
Notations. qPmatrix W ∈ R
Given a positive diagonal
N 2
the W -norm according to kxkW = i=1 Wii xi , and simply use k · k when W = I, the
identity matrix. For an index set [R], we denote the R-product of N -dimensional Euclidean
spaces by ⊗r∈[R] RN . A vector y ∈ ⊗r∈[R] RN is written as (y1 , y2 , ..., yR ), where yr ∈ RN
qP
N R 2
for all r ∈ [R]. The W -norm induced on ⊗r∈[R] R equals kykI(W ) = r=1 kyr kW . We
qP
2
reserve the symbol ρ for maxyr ∈Br ,∀r r∈[R] kyr k1 . Furthermore, we use (x)+ to denote
the function max{x, 0}.
2.1 The QDSFM Problem

Consider a collection of submodular functions {Fr }r∈[R] defined over the ground set [N ], and
denote their Lovász extensions and base polytopes by {fr }r∈[R] and {Br }r∈[R] , respectively.
Let Sr ⊆ [N ] denote the set of variables incident to Fr and assume that the functions
Fr are normalized and nonnegative, i.e., that Fr (∅) = 0 and Fr ≥ 0. These two mild
constraints are satisfied by almost all submodular functions that arise in practice. Note
that we do not impose monotonicity constraints which are frequently assumed and used in
submodular function optimization; nevertheless, our setting naturally applies to the special
case of monotonously increasing submodular functions.
The QDSFM problem is formally stated as follows:
[fr (x)]2 ,
X
QDSFM: min kx − ak2W + (4)
x∈RN
r∈[R]
where a ∈ RN is a given vector and W ∈ RN ×N is a positive diagonal matrix. The QDSFM

problem is convex because the Lovász extensions fr are convex, and nonnegative. In fact,
it is also strongly convex, implying that there exists a unique optimal solution, denoted
by x∗ . The quadratic form [fr (x)]2 may be generalized to q(fr (x)), where the mapping
q(·) : R≥0 → R≥0 satisfies the monotonicity, strong convexity and smoothness conditions.
In this case, subsequent results such as those pertaining to the transformation (5) may
be obtained by using the convex conjugate of q(·), leading to similar conclusions as those
described in Section 3 for the quadratic form. Our focus is on quadratic forms as they
correspond to important applications described in Sections 5 and 6.
As a remark, in contrast to the DSFM problem, we are not aware of any discrete
optimization problems that directly correspond to the QDSFM problem. However, as we will
show in Section 5, QDSFM can be used to compute the PageRank vectors for hypergraphs
4
which itself is a relevant discrete optimization question. Moreover, as will be demonstrated

in subsequent sections, the obtained PageRank vectors are helpful for approximating the
hypergraph conductance, another important discrete optimization objective.
2.2 Dual Formulation

Despite being convex, the objective is in general nondifferentiable. This implies that only
sublinear convergence can be obtained when directly applying the subgradient method. To
address this issue, we revert to the dual formulation of the QDSFM problem. A natural idea
is to mimic the approach used for DSFM by exploiting the characterization of the Lovász
extension, fr (x) = maxyr ∈Br hyr , xi, ∀r. However, this leads to a semidefinite programing
problem for the dual variables {yr }r∈[R] , which is too expensive to solve for large problems.
Instead, we establish a new dual formulation that overcomes this obstacle.
The dual formulation hinges upon the following simple yet key observation:
φ2r φ2
[fr (x)]2 = max φr fr (x) − = max max hyr , xi − r . (5)
φr ≥0 4 φr ≥0 yr ∈φr Br 4
Let y = (y1 , y2 , ..., yR ) and φ = (φ1 , φ2 , ..., φR ). For each r, we define a convex cone
Cr = {(yr , φr )|φr ≥ 0, yr ∈ φr Br }
which represents the feasible set of the variables (yr , φr ). Furthermore, we denote the
product of cones by
O
C= Cr := {(y, φ) : φr ≥ 0, yr ∈ φr Br , ∀r ∈ [R]}.
r∈[R]
Invoking equation (5), we arrive at the following two dual formulations for the original
QDSFM problem in Lemma 1.
Lemma 1 The following optimization problem is dual to (4):

X X
min g(y, φ) := k yr − 2W ak2W −1 + φ2r , s.t. (y, φ) ∈ C. (6)
y,φ
r∈[R] r∈[R]
N,
N
By introducing Λ = (λr ) ∈ r∈[R] R the previous optimization problem takes the form
X λr 2
X
2
min kyr − √ kW −1 + φr , s.t. (y, φ) ∈ C, λr = 2W a. (7)
y,φ,Λ
r∈[R]
R r∈[R]
The primal variables in both cases may be computed as x = a − 12 W −1

P
r∈[R] yr .
The proof of Lemma 1 is presented in Section 7.1.1. The dual formulations for the
DSFM problem were described in Lemma 2 of (Jegelka et al., 2013). Similarly to the
DSFM problem, the dual formulations (6) and (7) are simple, nearly separable in the dual
variables, and may be solved by algorithms such as those used for solving dual DSFM,
including the Douglas-Rachford splitting method (DR) (Jegelka et al., 2013), the alternative
5
projection method (AP) (Nishihara et al., 2014), and the random coordinate descent method
(RCD) (Ene and Nguyen, 2015).
However, there are also some notable differences. Our dual formulations for the QDSFM
problem are defined on a product of cones constructed from the base polytopes of the sub-
modular functions. The optimization problem (7) essentially asks for the best approxima-
tion of an affine space in terms of the product of cones, whereas in the DSFM problem one
seeks an approximation in terms of a product of polytopes. In the next section, we propose
to solve the dual problem (6) using the random coordinate descent method (RCD), and to
solve (7) using the alternative projection method (AP), both tailored to the conic structure.
The conic structures make the convergence analysis and the computations of projections
for these algorithms challenging.
Before proceeding with the description of Nour algorithms, we first provide a relevant
geometric property of the product of cones r∈[R] Cr , as stated in the following lemma.
This property is essential in establishing the linear convergence of RCD and AP algorithms.
N
Lemma 2 Consider a feasible solution of the problem (y, φ) ∈ r∈[R] Cr and a nonneg-
′ ′
N
ative vector φ = (φr ) ∈ r∈[R] R≥0 . Let s be an arbitrary point in the base polytope of
′
φ F , and let W , W (2) be two positive diagonal matrices. Then, there exists a
(1)
P
N r r ′
r∈[R]
′
y ∈ r∈[R] φr Br satisfying r∈[R] yr′ = s and
P
 
X
ky − y ′ k2I(W (1) ) + kφ − φ′ k2 ≤ µ(W (1) , W (2) ) k yr − sk2W (2) + kφ − φ′ k2  ,
r∈[R]
where  
X
(1)
X (2) 9 2 X (1)

µ(W (1) , W (2) ) = max Wii 1/Wjj , ρ Wii + 1 , (8)
 4 
i∈[N ] j∈[N ] i∈[N ]
qP
2
and ρ = maxyr ∈Br ,∀r∈[R] r∈[R] kyr k1 .
The proof of Lemma 2 is relegated to Section 7.1.2.
3. Linearly Convergent Algorithms for Solving the QDSFM Problem

Next, we introduce and analyze the random coordinate descent method (RCD) for solving
the dual problem (6), and the alternative projection method (AP) for solving (7). Both
methods exploit the separable structure of the feasible set. It is worth mentioning that
our results may be easily extended to the Douglas-Rachford splitting method as well as
accelerated and parallel variants of the RCD method used to solve DSFM problems (Jegelka
et al., 2013; Ene and Nguyen, 2015; Li and Milenkovic, 2018a).
We define the projection Π onto a convex cone Cr as follows: for a point b in RN and a
positive diagonal matrix W ′ in RN ×N , we set
ΠCr (b; W ′ ) = argmin kyr − bk2W ′ + φ2r .
(yr ,φr )∈Cr
Throughout the remainder of this section, we treat the projections as provided by an oracle.
Later in Section 4, we provide some efficient methods for computing the projections.
6
Algorithm 1: The RCD method for Solving (6)

(0) (0)
0: For all r, initialize yr ← 0, φr and k ← 0
1: In iteration k:
2: Uniformly at random pick an r ∈ [R].
(k+1) (k+1) (k)
) ← ΠCr (2W a − r′ 6=r yr′ ; W −1 )
P
3: (yr , φr
(k+1) (k)
4: Set yr′ ← yr′ for r ′ 6= r
3.1 The Random Coordinate Descent (RCD) Algorithm

Consider the dual formulation (6). For each coordinate r, optimizing over the dual variables
(yr , φr ) is equivalent to computing a projection onto the cone Cr . This gives rise to the
RCD method of Algorithm 1.
The objective g(y, φ) described in (6) is not strongly convex in general. However, with
some additional work, Lemma 2 can be used to establish the weak strong convexity of
g(y, φ); this essentially guarantees a linear convergence rate of the RCD algorithm. To
proceed, we need some additional notation. Denote the set of solutions of (6) by
X
Ξ = {(y, φ)| yr = 2W (a − x∗ ), φr = inf θ, ∀r}. (9)
yr ∈θBr
r∈[R]
Note that the optimal φr , as defined in (9), is closely related to gauge functions of poly-
topes (Bach, 2013). This representation is a consequence of the relationship between the
optimal primal and dual solutions stated in Lemma 1. For convenience, we denote the
optimal value of the objective function of interest over (y, φ) ∈ Ξ by g∗ = g(y, φ), and also
define a distance function
r
d((y, φ), Ξ) = min ky − y ′ k2I(W −1 ) + kφ − φ′ k2 .
(y ′ ,φ′ )∈Ξ
Lemma 3 (Weak Strong Convexity) Suppose that (y, φ) ∈ r∈[R] Cr and that (y ∗ , φ∗ ) ∈
N
Ξ minimizes ky − y ′ k2I(W −1 ) + kφ − φ′ k2 over (y ′ , φ′ ) ∈ Ξ, i.e., (y ∗ , φ∗ ) is the projection of

(y, φ) onto Ξ. Then,
X d2 ((y, φ), Ξ)
k (yr − yr∗ )k2W −1 + kφ − φ∗ k2 ≥ . (10)
µ(W −1 , W −1 )
r∈[R]
Lemma 3 can be proved by applying Lemma 2 with φ′ = φ∗ , s = 2W (a−x∗ ), W (1) , W (2) =

W −1 and the definition of y ∗ . Note that the claim in (10) is significantly weaker than the
usual strong convexity condition. We show that such a condition is sufficient to ensure
linear convergence of the RCD algorithm and provides the results in the following theorem.
Theorem 4 (Linear Convergence of RCD) Running k iterations of Algorithm 1 pro-
duces a pair (y (k) , φ(k) ) that satisfies
h i
E g(y (k) , φ(k) ) − g ∗ + d2 ((y (k) , φ(k) ), Ξ)
k h
2 (0) (0) ∗ 2 (0) (0)
i
≤ 1− g(y , φ ) − g + d ((y , r ), Ξ) .
R[1 + µ(W −1 , W −1 )]
7
Algorithm 2: The AP Method for Solving (11)

(0) (0)
0: For all r, initialize yr ← 0, φr ← 0, and k ← 0
1: In iteration k:
P (k)
2: α(k+1) ← 2W −1 r yr − 4a.
3: For all r ∈ [R]:
(k+1) (k)
4: λr,i ← yr,i − 21 (Ψ−1 W α(k+1) )i for i ∈ Sr
(k+1) (k+1) (k+1)
5: (yr , φr ) ← ΠCr (λr ; ΨW −1 )
Theorem 4 asserts that at most O(Rµ(W −1 , W −1 ) log 1ǫ ) iterations are required to obtain
an ǫ-optimal solution in expectation for the QDSFM problem. The proof of Lemma 3 and
Theorem 4 are postponed to Section 7.2.
3.2 The Alternative Projection (AP) Algorithm

The AP method can be used to solve the dual problem (7), which is of the form of a
best-approximation problem, by alternatively performing projections between the product
of cones and a hyperplane. Furthermore, for some incidence relations, Sr may be a proper
subset of [N ], which consequently requires the ith component of yr , i.e., yr,i to be zero if
i 6∈ Sr . Enforcing λr,i = 0 for i 6∈ Sr allows the AP method to avoid redundant computa-
tions and achieve better convergence rates. This phenomenon has also been observed for
the DSFM problem in (Djolonga and Krause, 2015; Li and Milenkovic, 2018a). To avoid
redundant computations, we use the AP approach to solve the following dual problem:
X
kyr − λr k2ΨW −1 + φ2r ,

min (11)
y,φ,Λ
r∈[R]
X
s.t. (y, φ) ∈ C, λr = 2W a, and λr,i = 0 for all i 6∈ Sr .
r∈[R]
Here, Ψ ∈ RN ×N is a positive diagonal matrix in which Ψii = |{r ∈ [R]|i ∈ Sr }| equals the
number of submodular functions that i is incident to.
Lemma 5 Problem (11) is equivalent to problem (6).
The proof of Lemma 5 is presented in Section 7.3.1. The AP method for solving (11) is
listed in Algorithm 2. Observe that Step 5 is a projection onto cones defined based on
the positive diagonal matrix ΨW −1 which differs from the one used in the RCD method.
Note that compared to the RCD algorithm, AP requires one to compute projections onto
all cones Cr in each iteration. Thus, in most cases, AP requires larger computation cost
than the RCD algorithm. On the other hand, AP naturally lends itself to parallelization
since the projections can be decoupled and implemented very efficiently.
Before moving to the convergence analysis, we first prove the uniqueness of optimal φ
in the following lemma.
Lemma 6 The optimal value of φ to problem (11) is unique.
Proof We know (11) is equivalent to P (6). SupposeP
there are two optimal solutions
P (ȳ, φ̄) and
(ỹ, φ̃). As the optimal x∗ is unique, r∈[R] ȳr = r∈[R] ỹr = 2W (a − x∗ ) = r∈[R] ȳr +ỹ
2 .
r
8
2
φ̄r +φ̃r 2 2 which makes g( ȳ+ỹ φ̄+φ̃
P P P
If φ̄ 6= φ̃, we have r∈[R] 2 < r∈[R] φ̄r = r∈[R] φ̃r , 2 , 2 )
smaller than g(ȳ, φ̄) = g(ỹ, φ̃) and thus causes contradiction.
To determine the convergence rate of the AP method, we adapt the result of (Nishihara
et al., 2014) on the convergence rate of APs between two convex bodies. In our setting, the
two convex bodies of interest are the the cone C and the hyperplane
X
Z = {(y, φ)| yr = 2W (a − x∗ ), φr = φ∗r , yr,i = 0, ∀ i 6∈ Sr },
r∈[R]
where φ∗ = (φ∗r )r∈[R] is the unique optimal solution of (11).
Lemma 7 (Nishihara et al. (2014)) Let Ξ be as defined in (9). In addition, define the
distance function
r
dΨW −1 ((y, φ), Ξ) = min ky − y ′ k2I(ΨW −1 ) + kφ − φ′ k2 .
(y ′ ,φ′ )∈Ξ
In the k-th iteration of Algorithm 5, the pair (y (k) , φ(k) ) satisfies
1 k
dΨW −1 ((y (k) , φ(k) ), Ξ) ≤ 2dΨW −1 ((y (0) , φ(0) ), Ξ)(1 − ) ,
κ2∗
where
dΨW −1 ((y, φ), Ξ)
κ∗ = sup .
(y,φ)∈C∪Z/Ξ max{dΨW −1 ((y, φ), C), dΨW −1 ((y, φ), Z)}
The next Lemma establishes a finite upper bound on κ∗ . This guaratees linear convergence
rates for the AP algorithm.
Lemma 8 One has κ2∗ ≤ 1 + µ(ΨW −1 , W −1 ).
The proof of Lemma 8 may be found in Section 7.3.2.

Lemma 8 implies that running the AP algorithm with O(µ(ΨW −1 , W −1 ) log 1ǫ ) iterations
guarantee that dΨW −1 ((y (k) , φ(k) ), Ξ) ≤ ǫ. Moreover, as Ψii ≤ R, the iteration complexity
of the AP method is smaller than that of the RCD algorithm. However, each iteration of
the AP solver requires performing projections on all cones Cr , r ∈ [R], while each iteration
of the RCD method requires computing only one projection onto a single cone. Other
methods such as the DR and primal-dual hybrid gradient descent (PDHG) (Chambolle and
Pock, 2011) used in semi-supervised learning on hypergraphs, which is a specific QDSFM
problem (Hein et al., 2013), also require computing a total of R projections during each
iteration. Thus, from the perspective of iteration cost, RCD is significantly more efficient,
especially when R is large and computing Π(·) is costly. This phenomenon is also observed
in practice, as illustrated by the experiments in Section 6.1.
The following corollary summarizes the previous discussion and will be used in our
subsequent derivations in Section 5.1.
9
Corollary 9 Suppose that W = βD, where β is a hyper-parameter, and D is a diagonal

matrix such that X
Dii = max[Fr (S)]2 .
S⊆V
r∈[R]: i∈Sr
Recall that Ψii = |{r ∈ [R]|i ∈ Sr }|. Then, Algorithm 1 (RCD) requires an expected number
of
2 −1 Dii 1
O N R max{1, 9β } max log
i,j∈[N ] Djj ǫ
iterations to return a solution (y, φ) that satisfies d((y, φ), Ξ) ≤ ǫ. Algorithm 2 (AP) requires
a number of
Ψjj Dii 1
O N 2 R max{1, 9β −1 } max log
i,j∈[N ] R Djj ǫ
iterations to return a solution (y, φ) that satisfies dΨW −1 ((y, φ), Ξ) ≤ ǫ.
The proof of Corollary 9 is provided in Section 7.4. Note that the term N 2 R also appears
in the expression for the complexity of the RCD and the AP methods for solving the DSFM
problem (Ene et al., 2017). The term max{1, 9β −1 } implies that whenever β is small, the
convergence rate is slow. In Section 5.1, we provide a case-specific analysis showing that β
is related to the mixing time of the underlying Markov process in the PageRank application.
Depending on the application domain, other interpretations of β could exist, but will not
be discussed here to avoid clutter. It is also worth pointing out that the variable D in
Corollary 9 is closely related to the notion of degree in hypergraphs (Li and Milenkovic,
2018b; Yoshida, 2019).
4. Computing the Conic Projections

In this section, we describe efficient routines for computing a projection onto the conic
set ΠCr (·). As the procedure works for all values of r ∈ [R], we drop the subscript r for
simplicity of notation. Let C = {(y, φ)|y ∈ φB, φ ≥ 0}, where B denotes the base polytope
of the submodular function F . Recall that the conic projection onto C is defined as
ΠC (a; W̃ ) = arg min h(y, φ) , ky − ak2W̃ + φ2 s.t. (y, φ) ∈ C. (12)

(y,φ)
Let h∗ and (y ∗ , φ∗ ) be the optimal value of the objective function and the optimal solution,
respectively. When performing projections, one only needs to consider the variables incident
to F , and set all other variables to zero. For simplicity, we assume that all variables in [N ]
are incident to F . The results can easily extend to the case that the incidences are of a
general form.
Unlike the QDSFM problem, solving the DSFM involves the computation of projec-
tions onto the base polytopes of submodular functions. Two algorithms, the Frank-Wolfe
(FW) method (Frank and Wolfe, 1956) and the Fujishige-Wolfe minimum norm algorithm
(MNP) (Fujishige and Isotani, 2011), are used for this purpose. Both methods assume in-
expensive linear minimization oracles on polytopes and guarantee a 1/k-convergence rate.
The MNP algorithm is more sophisticated yet empirically more efficient. Nonetheless, nei-
ther of these methods can be applied directly to conic projections. To this end, we modify
10
Algorithm 3: The Conic MNP Method for Solving (12)

(k)
Input: W̃ , a, B and a small positive constant δ. Maintain φ(k) = qi ∈S (k) λi
P
(0) ha,q1 iW̃ (0)
Choose an arbitrary q1 ∈ B. Set S (0) ← {q1 }, λ1 ← 1+kq1 k2
, y (0) ← λ1 q1 , k ← 0
W̃
1. Iteratively execute (MAJOR LOOP):
2. q (k) ← arg minq∈B h∇y h(y (k) , φ(k) ), qiW̃
3. If hy (k) − a, q (k) iW̃ + φ(k) ≥ −δ, break;
4. Else S (k) ← S (k) ∪ {q (k) }.
5. Iteratively execute (MINOR LOOP):
(k)
α ← arg minα k q(k) ∈S (k) αi qi − ak2W̃ + ( q(k) ∈S αi )2 ,
P P
6.
i i
(k)
z (k) ← q(k) ∈S αi qi
P
7.
i
8. If αi ≥ 0 for all i, break;
(k) (k) (k+1) (k)
9. Else θ = mini:αi <0 λi /(λi − αi ), λi ← θαi + (1 − θ)λi ,
10. y (k+1) ← θz (k) + (1 − θ)y (k) , S (k+1) ← {i : λ(k+1) > 0}, k ← k + 1
11. y (k+1) ← z (k) , λ(k+1) ← α, S (k+1) ← {i : λ(k+1) > 0}, k ← k + 1
these two methods by adjusting them to the conic structure in (12) and show that they still
achieve a 1/k-convergence rate. We refer to the procedures as the conic FW method and
the conic MNP method, respectively.
4.1 The Conic MNP Algorithm

Detailed steps of the conic MNP method are presented in Algorithm 3. The conic MNP
algorithm keeps track of an active set S = {q1 , q2 , ...} and searches
P for the best solution in
Pcone of an active set S by cone(S) = { qi ∈S αi qi |αi ≥ 0} and its
its conic hull. Denote the
linear set by lin(S) = { qi ∈S αi qi |αi ∈ R}. Similar to the original MNP algorithm, Algo-
rithm 3 also includes two level-loops: the MAJOR and MINOR loop. In the MAJOR loop,
one greedily add a new active point q (k) to the set S obtained from the linear minimization
oracle w.r.t. the base polytope (Step 2). By the end of the MAJOR loop, we obtain y (k+1)
that minimizes h(y, φ) over cone(S) (Step 3-11). The MINOR loop is activated when lin(S)
contains some point z that guarantees a smaller value of the objective function than the
optimal point in cone(S), provided that some active points from S may be removed. It is
important to point out that compared to the original MNP method, Steps 2 and 6 as well
as the termination Step 3 are specialized for the conic structure.
Theorem 10 below shows that the conic MNP algorithm preserves the 1/k-convergence
rate of the original MNP method.
Theorem 10 Let B be an arbitrary polytope in RN and let C = {(y, φ)|y ∈ φB, φ ≥ 0}
be the cone induced by the polytope. For any positive diagonal matrix W̃ , define Q =
maxq∈B kqkW̃ . Algorithm 3 produces a sequence of pairs (y (k) , φ(k) )k=1,2,... such that h(y (k) , φ(k) )
decreases monotonically. It terminates when k = O(N kakW̃ max{Q2 , 1}/δ), with
h(y (k) , φ(k) ) ≤ h∗ + δkakW̃ .
Note that the result is essentially independent on the submodularity assumption and
applies to general cones induced by arbitrary polytopes. Proving Theorem 10 requires a
11
Algorithm 4: The Conic FW Algorithm for Solving (12)

Input: W̃ , a, B and a small positive δ
Initialize y (0) ← 0, φ(0) ← 0 and k ← 0
1. Iteratively execute the following steps:
2. q (k) ← arg minq∈B h∇y h(y (k) , φ(k) ), qi
3. If hy (k) − a, q (k) iW̃ + φ(k) ≥ −δ, break.
(k) (k) (k) (k)
4. Else: (γ1 , γ2 ) ← arg minγ1 ≥0,γ2 ≥0 h(γ1 y (k) + γ2 q (k) , γ1 φ(k) + γ2 )
(k) (k) (k) (k)
5. y (k+1) ← γ1 y (k) + γ2 q (k) , φ(k+1) ← γ1 φ(k) + γ2 , k ← k + 1.
careful modification of the arguments in (Chakrabarty et al., 2014) to handle the conic
structure. We provide a detailed proof in Section 7.5.
4.2 The Conic FW Algorithm

We introduce next the conic FW method, summarized in Algorithm 4. Note that Steps
2, 4 are specialized for cones. The main difference between the FW and MNP methods
is the size of active set: FW only maintains two active points, while MNP may maintain
as many as N points. A similar idea was proposed in (Harchaoui et al., 2015) to deal
with the conic constraints, with the following minor difference: our method directly finds
the variable y (k+1) from the cone ({y (k) , q (k) }) (Steps 4, 5) while (Harchaoui et al., 2015)
first determines finite upper bounds (γ̃1 , γ̃2 ) on the coefficients (γ1 , γ2 ) (see the notation
in Steps 4, 5) and then searches for the update y (k+1) in conv({γ̃1 y (k) , γ̃2 q (k) , 0}). Our
method simplifies this procedure by avoiding computations of the upper bounds (γ̃1 , γ̃2 )
and removing the constraints γ1 ≤ γ̃1 , γ2 ≤ γ̃2 in Step 4.
For completeness, we establish a 1/k-convergence rate for the conic FW algorithm in
Theorem 11.
Theorem 11 Let B be an arbitrary polytope in RN and let C = {(y, φ)|y ∈ φB, φ ≥

0} denote its corresponding cone. For some positive diagonal matrix W̃ , define Q =
maxq∈B kqkW̃ . Then, the point (y (k) , φ(k) ) generated by Algorithm 4 satisfies
(k) (k) ∗
2kak2W̃ Q2
h(y ,φ )≤h + .
k+2
The proof of Theorem 11 is deferred to Section 7.6.
4.3 Analysis of Time Complexity and Further Discussion

Let EO stand for the time needed to query the function value of F (·). The linear pro-
gram in Step 2 of both the MNP and FW algorithms requires O(N log N + N × EO)
operations if implemented as a greedy method. Step 6 of the MNP method requires solv-
ing a quadratic program with no constraints, and the solution may be obtained in closed
form using O(N |S|2 ) operations. The remaining operations in the MNP method intro-
duce a O(N ) complexity term. Hence, the execution of each MAJOR or MINOR loop
requires O(N log N + N × EO + N |S|2 ) operations. In the FW method, the optimization
problem in Step 4 is relatively simple, since it reduces to solving a nonnegative quadratic
12
program with only two variables and consequently introduces a complexity term O(N ).
Therefore, the complexity of the FW algorithm is dominated by Step 2, which requires
O(N log N + N × EO) operations.
Below we discuss the distinctions of our method from several relevant methods for com-
puting projections on cones. Compared to the linearly convergent away-steps Frank-Wolfe
algorithm (Wolfe, 1970) and its generalized versions for the conic case, such as the away-
steps/pairwise non-negative matching pursuit (AMP/PWMP) techniques recently proposed
in (Locatello et al., 2017), the size of the active set S in the (conic) MNP algorithm is al-
ways bounded by N instead of the (increasing) number of iterations. Particularly, when
applied to the QDSFM problems, the size of the active set S in the (conic) MNP algorithm
can be even smaller in practice as it is bounded by the number of elements that affects
the value of the submodular function, which is usually smaller than N (Li and Milenkovic,
2018a). Moreover, in terms of base polytopes of submodular functions, AMP/PWMP may
not perform as well as MNP because the steepest ascent over the active set may not be
executed as efficiently as the descent over polytopes obtained via the greedy approach (Step
2, which generally appears in FW-based methods).
Another benefit of the (conic) MNP method is that it can compute exact projections
for the problem (12) in finitely many steps. Because of the structure of the cones generated
from polytopes, the exact projections are expressed as linear combinations of the extreme
points of the polytopes. As the number of such combinations is finite, MNP can find the
exact projections in finitely many iterations, thereby inheriting a similar property of the
MNP method pertaining to polytopes (Fujishige and Isotani, 2011). While this property
also holds for the fully corrective non-negative matching pursuit (FCMP) (Locatello et al.,
2017), FCMP requires solving an additional nonnegative quadratic program in the inner
loop, and is thus much less efficient; MNP does not require an explicit solution of the
nonnegative quadratic program as it allows for an efficient update of the set of atoms, S (k) ,
described in Algorithm 3, through proper leveraging of the problem structure (minor loops
of Algorithm 3). In contrast, the AMP/PWMP methods in (Locatello et al., 2017) as well
as our conic FW method introduced in Section 4.2 may not ensure exact projections. As the
proposed algorithms for QDSFM depend highly on precise projections for the problem (12),
using MNP as the subroutine for computing projections has the potential to yield better
accuracy than other methods discussed above. This is further evidenced in our experiments,
to be discussed below.
4.4 Numerical Illustration I: Convergence of QRCD-MNP and QRCD-FW
We illustrate the convergence speeds of the RCD algorithm when using the conic MNP and
the conic FW methods as subroutines for computing the conic projections, referred to as
QRCD-MNP and QRCD-FW, respectively. We also compare them with two other methods,
QRCD-AMP and QRCD-PWMP, where the subroutine for computing the conic projection
is replaced by AMP and PWMP (Locatello et al., 2017), respectively. For this purpose, we
construct a synthetic QDSFM problem (4) by fixing N = 100, R = 100, and Wii = 1, for
all i ∈ [N ]. We generate each incidence set Sr by choosing uniformly at random a subset of
[N ] of cardinality 10, and then set the entries of a to be i.i.d. standard Gaussian.
13
θ = 0.25 θ = 0.5 θ =1
5 5 5
QRCDM-MNP QRCDM-MNP QRCDM-MNP
QRCDM-FW QRCDM-FW QRCDM-FW
0 QRCDM-PWMP 0 QRCDM-PWMP 0 QRCDM-PWMP
QRCDM-AMP QRCDM-AMP QRCDM-AMP
log 10 (gap)
log 10 (gap)
log 10 (gap)
-5 -5 -5
-10 -10 -10
-15 -15 -15

0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 0.5 1 1.5 2
log 10 (cputime(s)+1) log 10 (cputime(s)+1) log 10 (cputime(s)+1)
Figure 1: Convergence results (log10 (gap): mean ± std) for the QRCD-MNP, QRCD-FW,
QRCD-AMP, QRCD-PWMP algorithms for general submodular functions.
We consider the following submodular function: for r ∈ [R], S ⊆ Sr ,
min{|S|, |Sr /S|}θ

Fr (S) = , θ ∈ {0.25, 0.5, 1}. (13)
(|Sr |/2)θ
The number of RCD iterations is set to 300R = 3 × 104 . For the projections performed
in the inner loop, as only the conic MNP method is guaranteed to find the exact solution in
finite time, we constrain the number of iterations used by competing methods to ensure a
fair comparison. To this end, note that for the AMP and PWMP methods (Locatello et al.,
2017), a parameter δ that controls the precision is computed as −γkdt k2 (here we use the
notation from Algorithm 3 of (Locatello et al., 2017)). For all algorithms, we terminate
the inner loop by setting δ = 10−12 or a maximum number of iterations |Sr |3 = 1000.
All algorithms were executed on a 3.2GHz Intel Core i5 machine. For each algorithm, the
average and variance of the primal gap are obtained via 100 independent tests. The con-
vergence results are shown in Figure 1, which demonstrates that QRCD-MNP outperforms
the other three methods.The performance of QRCD-PWMP is satisfactory when θ = 1, but
it deteriorates for smaller values of θ, which may be a consequence of increasing the num-
ber of extreme points of cones/polytopes corresponding to the submodular functions (13).
QRCD-FW and QRCD-AMP exhibit similar convergence behaviors but fail to provide high-
precision solutions for reasonable CPU times.
4.5 Numerical Illustration II: Dependency on the Scale of the Submodular

Functions
As shown in Theorems 10 and 11, the complexity of the conic MNP/FW methods depends
on the scale of the submodular functions F , specifically, (h(y (k) , φ(k) ) − h∗ )/kak2W̃ ∼ O(F 2 ).
We would like to evaluate such dependency. Since we are only concerned with the compu-
tations of projection, we focus on a single submodular function F (·) that once again takes
the form (13) with parameter θ = 0.25 and a ground set of cardinality N = |Sr | = 100.
The submodular function is further scaled as s ∗ F (·), where s is a constant in the set
{1, 2, 5, 10, 20}. We generate the entries of a in (12) at the same scale as the submodular
function, i.e., av ∼ N (0, s2 ). The proposed MNP, FW, and the AMP/PWMP methods
from (Locatello et al., 2017) were tested, and their CPU execution time reported in Fig-
14
3.5
log 10 (cputime(ms)+1)
3
2.5
1.5
MNP
1
FW
0.5 PWMP
AMP
0
0 5 10 15 20
s
Figure 2: The required CPU time for different methods used to compute the projection (12)
versus the scale of the submodular function s.
ure 2. All methods terminate when (h(y (k) , φ(k) ) − h∗ )/kak2W̃ ≤ 10−5 over independent
runs.
As can be seen from Figure 2, the complexities of all methods increase as the submodular
function scales up, while the conic MNP method tends to be the most robust with respect
to scale changes. PWMP appears to be the most sensitive to scaling due to the growing
size of the active set with increasing s.
5. Applications to PageRank
In what follows, we introduce one important application of the QDSFM problem - PageRank
(PR). Our treatment of the PR process relies on diffusion processes over hypergraphs (Chan
et al., 2017, 2018). We demonstrate that what is known as a PR vector can be efficiently
computed using our proposed QDSFM solver. We also show that the newly introduced PR
retains important properties of the standard PR over graphs, pertaining both to mixing
and local partitioning properties (Andersen et al., 2006).
5.1 Background: PageRank(PR)

PR is a well-known learning algorithm over graphs, first introduced in the context of web-
page search (Page et al., 1999). Given a graph G = (V, E), with V = [N ] and E denoting
the edge set over V , we let A and D denote the adjacency matrix and diagonal degree
matrix of G, respectively. PR essentially reduces to finding a fixed point p ∈ RN of the
iterative process
p(t+1) = αp(0) + (1 − α)AD −1 p(t) ,
where p(0) ∈ RN is a fixed vector and α ∈ (0, 1). By using different exploration vectors
p(0) , PR can be adapted to different graph-based applications. For example, for webpages
ranking (Page et al., 1999), p(0) may be a uniformly at random chosen vertex. For finding
a local graph partitioning around a vertex i (Andersen et al., 2006), p(0) is typically set to
ei , i.e., a zero-one vector with a single 1 in its i-th component.
15
It is easy to verify that the fixed point p is a solution to the problem

α
min kp − p(0) k2D−1 + (D −1 p)T (D − A)(D −1 p) = kx − ak2W + hx, L(x)i, (14)
p 1−α
α
where x = D −1 p, a = D −1 p(0) , W = 1−α D and L = D − A.
Recent work exploring the role of high-order functional relations among vertices in
networks (Benson et al., 2016; Li and Milenkovic, 2017; Li et al., 2017) brought forward
the need for studying PR-type of problems for hypergraphs (Zhou et al., 2017; Yin et al.,
2017).1 There exist several definitions of the PR algorithm for hypergraphs. For example,
the approach in (Zhou et al., 2017) relies on what is termed a multilinear PR (Gleich et al.,
2015) and is limited to the setting of uniform hypergraphs, in which all hyperedges have the
same cardinality. Computing a multilinear PR requires tensor multiplication and the latter
method has complexity exponential in the size of the hyperedges. The PR formulation
in (Yin et al., 2017) requires one to first project hypergraphs onto graphs and then run
standard graph-based PR. The projection step usually causes large distortions when the
hyperedges are large and leads to low-quality hypergraph partitionings (Hein et al., 2013;
Li and Milenkovic, 2018b).
Using equation (14), we introduce a new formulation for PR on hypergraphs that lever-
ages nonlinear Laplacian operators L(·) for (un)directed hypergraphs (Chan et al., 2017,
2018) and submodular hypergraphs (Li and Milenkovic, 2018b; Yoshida, 2019). The new
PR may be applied to both uniform and non-uniform hypergraphs with arbitrarily large
hyperedges. The stationary point is the solution to a QDSFM problem, where the term
hx, L(x)i in (14) is represented by a sum of Lovász extensions of submodular functions
defined over hyperedges (See Table 1).
To demonstrate the utility of this particular form of PageRank, we show that the PR
vector obtained as a solution of the QDSFM can be used to find a cut of a directed hyper-
graph (as well as undirected hypergraph) with small conductance, which is an important
invariant for local partitioning and community detection (Zhou et al., 2017; Yin et al., 2017).
The results essentially generalize the mixing results for PR over graphs derived in (Andersen
et al., 2006) and (Andersen et al., 2008). Our main technical contribution is handling new
technical challenges that arise from the non-linearity of the operators involved. In what
follows, we mainly focus on directed hypergraphs. Tighter and more general results may
be obtained for submodular hypergraphs (Li and Milenkovic, 2018b; Yoshida, 2019), which
will be reported in a companion paper.
5.2 Terminology: Boundary, Volume, Conductance

Let G = (V, E) be a hypergraph with V = [N ] and hyperedge in E equated with the
incidence sets {Sr }r∈[R] . More precisely, each hyperedge represents an incidence set Sr
associated with a triplet (wr , Hr , Tr ), where wr is a scalar weight and Hr and Tr ⊆ Sr are
the head and tail sets defined in Table 1. The pair (Hr , Tr ) defines a submodular function
Fr with incidence set Sr according to Fr (S) = 1 if |S ∩ Hr | ≥ 1 and |S̄ ∩ Tr | ≥ 1 (where
S̄ stands for the complement of S), and Fr (S) = 0 otherwise. The corresponding Lovász
1. Hypergraphs are natural extensions of graphs in which edges are replaced by hyperedges, representing
subsets of vertices in V of size ≥ 2.
16
A single component of The combinatorial structure The submodular function

hx, L(x)i
√
wr (xi − xj )2 , Sr = {i, j} A graph (Zhou et al., 2004; Fr (S) = wij if
Zhu et al., 2003b) |S ∩ {i, j}| = 1
√
wr maxi,j∈Sr (xi − xj )2 A hypergraph (Hein et al., Fr (S) = wr if
2013) |S ∩ Sr | ∈ [1, |Sr | − 1]
√
wr max (xi − xj )2+ A directed Fr (S) = wr if
(i,j)∈Hr ×Tr
hypergraph (Zhang et al., |S ∩ Hr | ≥ 1,
2017) |([N ]/S) ∩ Tr | ≥ 1
General [fr (x)]2 A submodular Any symmetric
hypergraph (Li and submodular function
Milenkovic, 2018b; Yoshida,
2019)
Table 1: The term hx, L(x)i for different combinatorial structures. In the third column,
whenever the stated conditions are not satisfied, it is assumed that Fr = 0. For directed
hypergraphs, Hr and Tr are subsets of Sr which we subsequently term the head and the tail
set. For Hr = Tr = Sr , one recovers the setting for undirected hypergraphs.
extension reads as fr (x) = max(i,j)∈Hr ×Tr (xi − xj )+ . If Sr contains only two vertices for all
values r ∈ [R], the hypergraph clearly reduces to a graph; and similarly, if Hr = Tr = Sr ,
directed hypergraphs reduce to undirected hypergraphs. We also define the degree of a
vertex i as X
di = wr ,
r:i∈Sr
and let D denote the diagonal matrix of degree values. The volume of a set S ⊆ V is defined
as
X
vol(S) = di .
i∈S
P
Furthermore, setting m = i∈V di = vol(V ) we define the boundary of a set S and its
volume as
X X
∂S = {r ∈ [R]|S ∩ Hr 6= ∅, S̄ ∩ Tr 6= ∅}, vol(∂S) = wr = wr Fr (S).
r∈∂S r∈[R]
di
We also define a distribution πS over V according to (πS )i = vol(S) . Furthermore, using the
boundary and volume, we also define the conductance of the set S as
vol(∂S)
Φ(S) = . (15)
min{vol(S), vol(S̄)}
The boundary, volume and conductance are all standard and well-studied graph-theoretic
concepts (Bollobás, 2013), generalized above to accommodate hypergraphs. The hypergraph
17
conductance is defined as the minimal value of the conductance over all possible subsets of
V , that is,
Φ∗ = min Φ(S). (16)

S⊂V,S6=∅,V
The hypergraph conductance is an important objective function used for detecting commu-
nities of good quality in networks with higher-order functional units (Benson et al., 2016).
Solving the optimization (16) naturally yields a two-way partition of V so that the corre-
sponding procedure is also referred to as hypergraph partitioning.
5.3 The PageRank Process as QDSFM Optimization

Next, we formally define the PR process for directed hypergraphs and show how it relates
to the QDSFM problem. We borrow the definition of a diffusion process (DP) over directed
hypergraphs proposed in (Chan et al., 2017, 2018) to describe our PageRank (PR) process.
To this end, we first need to define a Markov operator as follows.
Suppose that p ∈ RN is a potential over [N ] and let x = D −1 p. For all hyperedges Sr ,
we let Sr↓ (x) = arg mini∈Tr xi and Sr↑ (x) = arg maxj∈Hr xj . A hyperedge Sr is said to be
active if arg maxj∈Hr xj > arg mini∈Tr xi , and inactive otherwise.
1. For each active hyperedge Sr and for each pair of vertices (i, j) ∈ Sr↑ (x) × Sr↓ (x), we
(ij) P (ij)
introduce a non-negative scalar ar such that (i,j)∈S ↑ (x)×S ↓ (x) ar = wr . For pairs
r r
(ij)
/ Sr↑ (x) × Sr↓ (x) or belonging to inactive hyperedges, we set ar
of vertices (i, j) ∈ = 0.
2. Using the above described pairwise weights, we create a symmetric matrix A such
P (ij) P
that Aij = r∈[R] ar and Aii = di − i,j∈V,j6=i Aij .
For undirected graphs for which |Sr | = |Tr | = |Hr | = 2 for all r ∈ [R], the matrix A is an
adjacency matrix of a graph. Using the terminology of (Chan et al., 2018), we define the
Markov operator M (p) based on A according to
M (p) = Ax = AD −1 p.
The Markov operator M (p) is closely related to the Lovász extension fr , which may be seen
as follows. Let ∇fr (x) be the subdifferential set of fr at x. We once again recall that as
Fr is submodular, fr (x) = arg maxy∈Br hy, xi (Bach, 2013). This gives rise to the following
result.
Lemma 12 For any p ∈ RN and x = D −1 p, it holds that p−M (p) ∈ r∈[R] wr fr (x)∇fr (x).
P
Proof Consider a vertex i ∈ V . Then,

pi X pj X X X
(p − M (p))i = pi − Aii − Aij = Aij (xi − xj ) = ar(ij) (xi − xj )
di dj
j:∈V,j6=i j:∈V,j6=i r∈[R] j:∈V,j6=i
1) X X X X
= wr fr (x) ar(ij) /wr + wr fr (x) (−ar(ij) )/wr
r:∈[R], Sr active, i∈Sr↑ j:∈Sr↓ r:∈[R], Sr active, i∈Sr↓ j:∈Sr↑
18
(ij)
where 1) holds due to the definitions of ar and fr (x). It only remains to show that for a
r ∈ [R], active Sr , the vector v
(ij)
i ∈ Sr↑ ,
 P

 a /wr
j:∈Sr↓ r
(ij)
vi = i ∈ Sr↓ ,
P
− j:∈S ↑ ar /wr
 r
0 otherwise

P (ij) (ij)
satisfies v ∈ ∇fr (x). Since (i,j)∈S ↑ ×S ↓ ar = wr and ar ≥ 0, we know that v ∈ Br .
r r
Moreover, because v ∈ ∇fr (x), we have hv, xi = fr (x) and thus the claim clearly holds.
The statement of Lemma 12 describes a definition of a Laplacian operator for submodular
hypergraphs consistent with previous definitions (Li and Milenkovic, 2018b). We write the
underlying Laplacian operator as L(x) = (D −A)x = p−M (p). Note that when p = πV , the
1 1
components of x = D −1 p are equal and hence M (p) = AD −1 p = vol(S) A1 = vol(S) D1 = p.
Therefore, πV may be viewed as an analogue of a stationary distribution.
Given an initial potential p0 ∈ RN and an α ∈ (0, 1], the PR process is described by the
following ordinary differential equation:
dpt
The PR process = α(p0 − pt ) + (1 − α)(M (pt ) − pt ). (17)
dt
The choice of α = 0 reduces the PR process to the DP defined in (Chan et al., 2017, 2018).
(ij)
Tracking the PR process at every time point t, i.e., specifying ar , is as difficult as tracking
2
DP, which in turn requires solving a densest subset problem (Chan et al., 2017, 2018).
However, we only need to examine and evaluate the stationary point which corresponds to
the PR vector. The PR vector pr(α, p0 ) satisfies the equation
α(p0 − pr(α, p0 )) + (1 − α)(M (pr(α, p0 )) − pr(α, p0 )) = 0. (18)
From Lemma 12, we know that pr(α, p0 ) can be obtained by solving a QDSFM. Let xpr =
D −1 pr(α, p0 ). Then
X
α(p0 − Dxpr ) − (1 − α) wr fr (xpr )∇fr (xpr ) ∋ 0
r∈[R]
X
⇔ xpr = arg min kx − x0 k2W + [fr′ (x)]2 , (19)
x
r∈[R]
α √
where x0 = D −1 p0 , W = 1−α D and fr′ = wr f r .
2. To track the diffusion of p(t) (17), one has to be careful in terms of deciding which components of p(t)
are allowed to change within an infinitesimal time. The selection procedure should guarantee that after
an infinitesimal time p(t) → p(t) + dp(t), M (p(t)) and M (p(t) + dp(t)) satisfy M (p(t)) = M ′ p(t) and
M (p(t) + dp(t)) = M ′ p(t) + M ′ dp(t) for some shared choice of the matrix M ′ . Here, M ′ captures the
active diffusion network that remains unchanged in [t, t + dt). In general, M ′ will transition through
discrete states in a continuous-time domain. Determining which components of p(t) are allowed to change
is equivalent to solving a densest set problem.
19
5.4 Complexity of Computing PageRank over Directed Hypergraphs

We now analyze the computation complexity of computing PageRank over directed hyper-
graphs using QSDFM solvers. First, due to the specific structure of the Lovász extension
fr′ in (19), we devise an algorithm of complexity O(|Sr | log |Sr |) that exactly computes the
required projections onto the induced cones. This projection algorithm is much more effi-
cient than the conic FW and MNP methods designed for general polytopes. We focus on
projections with respect to one particular value r; since the scaling wr does not affect the
analysis, we henceforth omit the subscript r and the superscript ′ and simply use f instead
of fr′ .
The key idea is to revert the projection ΠC (a; W̃ ) back to its primal form. In the primal
setting, optimization becomes straightforward and only requires careful evaluation of the
gradient values. First, following a similar strategy as described in Lemma 1, it is easy to
show that
1 1
min kz − bk2W + [f (z)]2
z 2 2
is the dual of the problem (12), where W = W̃ −1 , b = 21 W −1 a, and f is the Lovász extension
corresponding to the base polytope B. Then, one has y = a − 2W z, φ = 2hy, ziW −1 .
Next, recall that for a directed hyperedge, we introduced head and tail sets H, T , re-
spectively, and f (z) = (maxi∈H zj − minj∈T zj )+ . Clearly, when H, T both equal to the
corresponding incidence set, the directed hypergraph reduces to an undirected hypergraph.
To solve this optimization problem, define two intermediate variables γ = maxi∈H zi and
δ = minj∈T zj . Denote the derivatives with respect to γ and δ as △γ and △δ , respectively.
Hence, △γ and △δ saitsfy
X X
△γ = (γ − δ)+ + Wii (γ − bi ), △δ = −(γ − δ)+ + Wjj (δ − bj ),
i∈SH (γ) j∈ST (δ)
where SH (γ) = {i|i ∈ H, bi ≥ γ} and ST (δ) = {j|j ∈ T, bj ≤ δ}. The optimal values of γ
and δ are required to simultaneously satisfy △γ = 0 and △δ = 0. Algorithm 5 is designed
to find such values of γ and δ. The “search” for (γ, δ) starts from (maxi∈H bi , minj∈T bj ),
and one gradually decreases γ and increases δ while ensuring △γ = −△δ (see Steps 5-10 in
Algorithm 5). The reason for imposing the constraint △γ = −△δ during the procedure is to
decrease the number of tuning parameters. As a result, the constraint implicitly transforms
the procedure into simple line search. The complexity of Algorithm 5 is dominated by the
sorting step (Step 1), which requires O(|Sr | log |Sr |) operations, as wH , wT , △γ , and △δ
can all be efficiently tracked within the inner loops.
Combining the projection method in Algorithm 5 and Corollary 9 with parameter choice
α
β = 1−α , we arrive at the following result summarizing the overall complexity of running
the PageRank process on (un)directed hypergraphs.
Corollary 13 The PageRank problem on (un)directed hypergraphs can be obtained by solv-
ing (19). A combination of Algorithm 1 (RCD) and Algorithm 5 returns an ǫ-optimal
solution in expectation with total computation complexity:
 

1−α Dii X 1
O N 2 max 1, 9 max |Sr | log |Sr | log  .
α i,j∈[N ] Djj ǫ
r∈[R]
20
Algorithm 5: An Exact Projection Algorithm for a Directed Hyperedge

Input: W , b, H, T
1. Sort {bi }i∈H and {bj }j∈T .
2. Initialize γ ← maxi∈H bi and δ ← minj∈T bj .
3. If γ ≤ δ, return z = b.
4. IterativelyP execute: P
5. wH ← i∈SH (γ) Wii , wT ← j∈ST (δ) Wjj
6. γ1 ← maxi∈H/SH (γ) bv , δ1 ← δ + (γ − γ1 )wH /wT
7. δ2 ← minj∈T /ST (δ) bv , γ2 ← γ − (δ2 − δ)wT /wH
8. k∗ ← arg mink∈{1,2} δk
9. If γk∗ ≤ δk∗ or △γk∗ ≤ 0, break
10. (γ, δ) ← (γk∗ , δk∗ )
△
11. (γ, δ) ← (wT , wH ) wT wH +wγ T +wH
12. Set zi to γ, if i ∈ SH (γ), to δ, if i ∈ ST (δ), and to bi otherwise.
A combination of Algorithm 2 (AP) and Algorithm 5 returns an ǫ-optimal solution with

total computation complexity:
 

1−α Ψjj Dii X 1
O N 2 max 1, 9 max |Sr | log |Sr | log  .
α i,j∈[N ] Djj ǫ
r∈[R]
5.5 Analysis of Partitions Based on PageRank

We are now ready to analyze hypergraph partitioning procedures (16) based on our def-
inition of PageRank. The optimization problem (16) is NP-hard even for graphs so we
only seek approximate solutions. Consequently, an important application of our version of
PageRank is to provide an approximate solution for (16). We start by proving that our
version of PageRank, similarly to the standard PageRank over graphs (Andersen et al.,
2006), satisfies a mixing result derived based on so-called Lovász-Simonovits curves (Lovasz
and Simonovits, 1990). This result may be further used to prove that Personalized PageR-
ank leads to partitions with small conductance3 . The main components of our proof are
inspired by the proofs for PageRank over graphs (Andersen et al., 2006). The novelty of our
approach is that we need several specialized steps to handle the nonlinearity of the Markov
operator M (·), which in the standard graph setting is linear. We postpone the proofs of all
our results to Section 7.7.
We first introduce relevant concepts and terminology necessary for presenting our results.
Sweep cuts are used to partition hypergraphs based on the order of components of a
distribution p and are defined as follows. Set x = D −1 p and sort the components of x so
that
x i1 ≥ x i2 ≥ · · · ≥ x iN . (20)
3. Recall that personalized PageRank is a PR process with initial distribution p0 = 1i , for some vertex
i∈V.
21
Let Sjp = {xi1 , xi2 , ..., xij }. Recall the definition of the conductance of a set Φ(·) (15) and
evaluate
j ∗ = arg min Φ(Sjp ).
j∈[N ]
The pair is referred to as a sweep cut. Furthermore, define Φp = Φ(Sjp∗ ); in words,

(Sjp∗ , S̄jp∗ )
Φp is the smallest conductance that may be achieved by sweep cuts of a potential p. Our
target is to show that the conductance based on a Personalized PR vector, i.e., Φpr(α,1v ) ,
may approximate Φ∗ .
Intuitively, the conductance of a hypergraph for a given p0 and α captures how “close”
a PR vector pr(α, p0 ) is to the stationary distribution πV : a large conductance results
in mixing the potentials of different vertices and thus makes pr(α, p0 ) closer to πV . To
characterize the degree of “closeness”, we introduce the Lovász-Simonovits curve. Given
a potential vector p ∈ RN and x = D −1 p, suppose that the order of the components in x
follows equation (20). Let vol(S0p ) = 0. Define a piecewise linear function Ip (·) : [0, m] →
[0, 1] according to
p
p z − vol(Sj−1 ) p
Ip (z) = p(Sj−1 ) + pvj , for vol(Sj−1 ) ≤ k ≤ vol(Sjp ), j ∈ [N ].
dvj
Note that the Lovász-Simonovits curve of πV contains only one segment by its very defini-
tion. It is easy to check that Ip (z) is continuous and concave in z. Moreover, for any set
S ⊆ [N ], we have p(S) ≤ Ip (vol(S)). We further write Vp = {j : xij > xij+1 } and refer
to {vol(Sjp )}j∈Vp as the (exact) break points of Ip . The relevance of the Lovász-Simonovits
curve in the aforementioned context may be understood as follows. Although the Lovász-
Simonovits curve is piecewise linear, the changes of the slopes at the break points may be
used to describe its “curvatures”. A large (small) curvature corresponds to the case that
the distribution of the potential is far from (close to) the stationary distribution, which
further implies a small (large) conductance of the graphs/hypergraphs (see Theorem 16).
Therefore, analyzing the Lovász-Simonovits curvature is of crucial importance in the pro-
cess of bounding the conductance. The first step of this analysis is presented in Lemma 14,
which establishes the “curvatures” at the break points.
Lemma 14 Let p = pr(α, p0 ), x = D −1 p and j ∈ Vp . Then,
α 1−α
Ip (vol(Sjp )) ≤ p0 (Sjp ) + [Ip (vol(Sjp ) − vol(∂Sjp )) + Ip (vol(Sjp ) + vol(∂Sjp )].
2−α 2−α
Furthermore, for k ∈ [0, m],
Ip (k) ≤ p0 (k) ≤ Ip0 (k).
Remark 15 In comparison with the undirected graph case (Lemma 5 (Andersen et al.,
2006)), where the result holds for arbitrary S ⊆ V , our claim is true only for Sjp for which
j ∈ Vp . However, this result suffices to establish all relevant PR results.
Using the upper bound on the break points in Ip , we can now construct a curve that
uniformly bounds Ip using Φp .
22
Theorem 16 Let p = pr(α, p0 ) be the previously defined PR vector, and let Φp be the
minimal conductance achieved by a sweep cut. For any integer t ≥ 0 and any k ∈ [0, m],
the following bound holds:
s !t
k α min{k, m − k} Φ2p
Ip (k) ≤ + t+ 1− .
m 2−α mini:(p0 )i >0 di 8
For graphs (Andersen et al., 2006), Theorem 16 may be used to characterize the graph
partitioning property of sweep cuts induced by a personalized PageRank vector. Similarly,
for general directed hypergraphs, we establish a similar result in Theorem 17.
Theorem 17 Let S be a set of vertices such that vol(S) ≤ 21 M and Φ(S) ≤ αc , for some
constants α, c. If there exists a distribution P for sampling vertices v ∈ S such that
Ei∼P [pr(α, 1i )(S̄)] ≤ 8c pr(α, πS )(S̄), then with probability at least 12 ,
s
vol(S)
Φpr(α,1i ) = O( α log ),
di
where i is sampled according to P .
Remark 18 Note that in Theorem 17, we require that the sampling distribution P satisfy
Ei∼P [pr(α, 1i )(S̄)] ≤ 8c pr(α, πS )(S̄). For graphs, when we sample a vertex i with probability
proportional to its degree di , this condition is naturally satisfied, with c = 8. However, for
general (un)directed hypergraphs, the sampling procedure is non-trivial and harder to handle
due to the non-linearity of the random-walk operator M (·). We relegate a more in-depth
study of this topic to a companion paper.
6. Applications to Semi-supervised Learning and Numerical Experiments

Another important application of QDSFM is semi-supervised learning (SSL). SSL is a learn-
ing paradigm that allows one to utilize the underlying structure or distribution of unlabeled
samples whenever the information provided by labeled samples is insufficient for learning
an inductive predictor (Gammerman et al., 1998; Joachims, 2003). A standard setup for a
K-class transductive learner is as follows: given N data points {zi }i∈[N ] , along with labels
for the first l (≪ N ) samples {yi |yi ∈ [K] }i∈[l] , the learner is asked to infer the labels of all
the remaining data points i ∈ [N ]/[l]. The widely-used SSL problem with least square loss
requires one to solve K regularized problems. For each class k ∈ [K], one sets the scores of
data points within the class to
x̂(k) = arg min βkx(k) − a(k) k2 + Ω(x(k) ),

x(k)
(k)
where a(k) describes the information provided by the known labels, i.e., ai = 1 if yi = k,
and 0 otherwise, β denotes a hyperparameter and Ω is a smoothness regularizer. The labels
of the data points are estimated as
(k)
ŷi = arg max{x̂i }.
k
23
In typical graph and hypergraph learning problems, Ω is chosen to be a Laplacian regularizer

constructed using {zi }i∈[N ] ; this regularization term ties the above learning problems to
PageRank (again, refer to Table 1). With the Laplacian regularization, each edge/hyperedge
corresponds to one decomposed part fr in the QDSFM problem. The variables themselves
may be normalized with respect to their degrees, in which case the normalized Laplacian
is used √instead. For
p example, in graph learning, one of the terms in Ω assumes the form
wij (xi / di − xj / dj )2 , where di and dj correspond to the degrees of the vertex variables
i and j, respectively. Using some simple algebra, it can be shown that the normalization
term is embedded into the matrix W used in the definition of the QDSFM problem (4).
6.1 Experimental Setup

We numerically evaluate our SSL learning framework for hypergraphs on both real and syn-
thetic datasets. For the particular problem at hand, the QDSFM problem can be formulated
as follows:
X xi xj 2
min βkx − ak2 + max ( √ −p ) , (21)
x∈RN i,j∈Sr Wii Wjj
r∈[R]
where ai ∈ {−1, 0, 1} indicates if the corresponding variable i has a negative, missing, or

positive label, respectively, β is a parameter that balances out the influence of observations
and the regularization term, {Wii }i∈[N ] defines a positive measure over the variables and
may be chosen to be the degree matrix D, with Dii = |{r ∈ [R] : i ∈ Sr }|. Each part in
the decomposition corresponds to one hyperedge. We compare eight different solvers falling
into three categories: (a) our proposed general QDSFM solvers, QRCD-SPE, QRCD-MNP,
QRCD-FW and QAP-SPE ; (b) alternative specialized solvers for the given problem (21),
including PDHG (Hein et al., 2013) and SGD (Zhang et al., 2017); (c) SSL solvers that
do not use QDSFM as the objective, including DRCD (Ene and Nguyen, 2015) and In-
vLap (Zhou et al., 2007). The first three methods all have outer-loops that execute RCD,
but with different inner-loop projections computed via the exact projection algorithm for
undirected hyperedges (see Algorithm 5 in Section 5.4), or the generic MNP and FW. As it
may be time consuming to find the precise projections via MNP and FW, we always fix the
number of MAJOR loops of the MNP and the number of iterations of the FW method to
100|Sr | and 100|Sr |2 , respectively. Empirically, these choices provide an acceptable trade-
off between accuracy and complexity. The QAP-SPE method uses AP in the outer-loop
and exact inner-loop projections (see Algorithm 5 in Section 5.4). PDHG and SGD are
the only known solvers for the specific objective (21). PDHG and SGD depend on certain
1
parameters that we choose in standard fashion: for PDHG, we set σ = τ = √1+max D
and
i ii
1
for SGD, we set ηk = kβ max i Wii
.
DRCD is a state-of-the-art solver for DSFM and also uses a combination of outer-loop
RCD and inner-loop projections. InvLap first transforms hyperedges into cliques and then
solves a Laplacian-based linear problem. All the aforementioned methods, except InvLap,
are implemented in C++ in a nonparallel fashion. InvLap is executed via matrix inversion
operations in Matlab, and may be parallelized. The CPU times of all methods are recorded
on a 3.2GHz Intel Core i5. The results are summarized for 100 independent tests. When
24
reporting the results, we use the primal gap (“gap”) 4 to characterize the convergence
properties of different solvers.
6.2 Synthetic Data

We generated a hypergraph with N = 1000 vertices that belong to two equal-sized clusters.
We uniformly at random generated 500 hyperedges within each cluster and 1000 hyperedges
across the two clusters. Note that in higher-order clustering, we do not need to have
many hyperedges within each cluster to obtain good clustering results. Each hyperedge
includes 20 vertices. We also uniformly at random picked l = 1, 2, 3, 4 vertices from each
cluster to represent labeled datapoints. With the vector x obtained by solving (21), we
classified the variables based on the Cheeger cut rule (Hein et al., 2013): suppose that
x
√ xi1 ≥ √ xi2 ≥ · · · ≥ √ iN , and let Sj = {i1 , i2 , ..., ij }. We partitioned [N ] into two
W i1 i1 W i2 i2 W iN iN
sets (Sj ∗ , S̄j ∗ ), where

|S ∩ Sj 6= ∅, Sr ∩ S̄j 6= ∅}|
j ∗ = arg min Φ(Sj ) , Pr P .
j∈[N ] max{ r∈[R] |Sr ∩ Sj |, r∈[R] |Sr ∩ S̄j |}
The classification error is defined as (# of incorrectly classified vertices)/N . In the experi-

ment, we used Wii = Dii , ∀ i, and tuned β to be nearly optimal for different objectives with
respect to the classification error rates: for QDSFM as the objective, using QRCD-SPE,
QAP-SPE, PDHG, and SGD as the methods of choice, we set β = 0.02; for DSFM as the
objective, including the DRCD method, we set β = 1; for InvLap, we set β = 0.001.
The left subfigure of Figure 3 shows that QRCD-SPE converges much faster than all
other methods when solving the problem (21) according to the gap metric (we only plotted
the curve for l = 3 as all other values of l produce similar patterns). To avoid clutter, we
moved the results for QRCD-MNP and QRCD-FW to the right two subfigures in Figure 3,
as these methods are typically 100 to 1000 times slower than QRCD-SPE. In Table 2, we
described the performance of different methods with comparable CPUtimes. Note that
when QRCD-SPE converges (with primal gap 10−9 achieved after 0.83s), the obtained x
leads to a much smaller classification error than other methods. QAP-SPE, PDHG and
SGD all have large classification errors as they do not converge within short CPU time-
frames. QAP-SPE and PDHG perform only a small number of iterations, but each iteration
computes the projections for all the hyperedges, which is more time-consuming. The poor
performance of DRCD implies that the DFSM is not a good objective for SSL. InvLap
achieves moderate classification errors, but still does not match the performance of QRCD-
SPE. Furthermore, note that InvLap uses Matlab, which is optimized for matrix operations,
thus fairly efficient. However, for experiments on real datasets with fewer but significantly
larger hyperedges, InvLap does not offer as good performance as the one for synthetic data.
The column “Average 100c(Sj ∗ )” also illustrates that the QDSFM objective is a good choice
for finding approximate balanced cuts of hypergraphs.
4. We place a high accuracy requirement for QDSFM-MNP by ignoring the CPU time needed to guarantee
that it achieves a duality gap ias low as 1e − 14. Note that the duality gap is computed according to
h 2
(k)
kx(k) − ak2W + r∈[R] fr (x(k) ) − g(y (k) , φ(k) )/4, where x(k) = a − 12 W −1 r∈[R] yr , which is an upper
P P
bound on the primal gap. Therefore, the achieved value of the objective has high accuracy and thus can
be used as the optimal primal value.
25
l=3 l=3 l=3

0 0 0
QRCDM-SPE QRCDM-MNP QRCDM-FW
-1 -1
QAP-SPE
PDHG -2
-2 -2
SGD
-3 -3
-4
log10 (gap)
log10 (gap)
log10 (gap)
-4 -4
-5 -6 -5
-6 -6
-8
-7 -7
-8 -8
-10
-9 -9
-10 -12 -10
0 0.2 0.4 0.6 0.8 -20 0 20 40 60 80 100 -200 0 200 400 600 800 1000
cputime(s) cputime(s) cputime(s)
Figure 3: Convergence results on synthetic datasets: gap vs CPU time of different QDSFM
solvers (with an average ± standard deviation).
Classification error rate (%) Average 100c(Sj ∗ )

Obj. Solvers l=1 l=2 l=3 l=4 #iterations cputime(s)
l=1 l=2 l=3 l=4
MN MD MN MD MN MD MN MD
QRCD-SPE 4.8 × 105 0.83
Oth. QDSFM
2.93 2.55 2.23 0.00 1.47 0.00 0.78 0.00 6.81 6.04 5.71 5.41
QAP-SPE 14.9 15.0 12.6 13.2 7.33 8.10 4.07 3.80 9.51 9.21 8.14 7.09 2.7 × 102 0.85
PDHG 9.05 9.65 4.56 4.05 3.02 2.55 1.74 0.95 8.64 7.32 6.81 6.11 3.0 × 102 0.83
SGD 5.79 4.15 4.30 3.30 3.94 2.90 3.41 2.10 8.22 7.11 7.01 6.53 1.5 × 104 0.86
DRCD 44.7 44.2 46.1 45.3 43.4 44.2 45.3 44.6 9.97 9.97 9.96 9.97 3.8 × 106 0.85
InvLap 8.17 7.30 3.27 3.00 1.91 1.60 0.89 0.70 8.89 7.11 6.18 5.60 — 0.07
Table 2: Prediction results on synthetic datasets: classification error rates & Average 100
c(Sj ∗ ) for different solvers (MN: mean, MD: median).
We also evaluated the influence of parameter choices on the convergence of QRCD

methods. According to Theorem 4, the required number of RCD iterations for achieving
an ǫ-optimal solution for (21) is roughly O(N 2 R max(1, 9/(2β)) max i,j∈[N ] Wii /Wjj log 1/ǫ)
(see Section 7.8). We mainly focus on testing the dependence on the parameters β and
maxi,j∈[N ] Wii /Wjj , as the term N 2 R is also included in the iteration complexity of DSFM
and was shown to be necessary given certain submodular structures (Li and Milenkovic,
2018a). To test the effect of β, we fix Wii = 1 for all i, and vary β ∈ [10−3 , 103 ]. To test the
effect of W , we fix β = 1 and randomly choose half of the vertices and set their Wii values to
lie in {1, 0.1, 0.01, 0.001}, and set the remaining ones to 1. Figure 4 shows our results. The
logarithm of gap ratios is proportional to log β −1 for small β, and log maxi,j∈[N ] Wii /Wjj ,
which is not as sensitive as predicted by Theorem 4. Moreover, when β is relatively large
(> 1), the dependence on β levels out.
6.3 Real Data

We also evaluated the proposed algorithms on three UCI datasets: Mushroom, Covertype45,
Covertype67, used as standard test datasets for SSL on hypergraphs (Zhou et al., 2007;
Hein et al., 2013; Zhang et al., 2017). Each dataset corresponds to a hypergraph model
described in (Hein et al., 2013): Entries correspond to vertices, while each categorical feature
is modeled as one hyperedge; numerical features are first quantized into 10 bins of equal
size, and then mapped to hyperedges. Compared to synthetic data, in the real datasets,
the size of most hyperedges is significantly larger (≥ 1000) while the number of hyperedges
is small (≈ 100). See Table 3 for a detailed description of the parameters of the generated
hypergraphs. Previous work has shown that smaller classification errors can be achieved
26
-2 -2
-3 -3
-4
log10 (gap (2e5) /gap(0))

-4
log10 (gap (2e5) /gap(0))

-5 -5
-6 -6
-7 -7
-8 -8
-9 -9
-10 -10
-11 -11
-12 -12
-4 -2 0 2 4 -1 0 1 2 3 4
log10 (β-1 ) log10 (Wmax /W min )
Figure 4: Parameter sensitivity: the rate of a primal gap of QRCD after 2 × 105 iterations
with respect to different choices of the parameters β & maxi,j∈[N ] Wii /Wjj .
mushroom covertype67
0 covertype45 0
0
QRCDM-SPE QRCDM-SPE
QRCDM-SPE -1 QAP-SPE
QAP-SPE -1
QAP-SPE
-2 PDHG PDHG
-2 PDHG -2
SGD SGD
SGD
-4 -3 -3
log10 (gap)
log10 (gap)
log10 (gap)
-4 -4
-6 -5
-5
-6
-8 -6
-7
-7
-8
-10
-9 -8
-12 -10 -9
0 0.5 1 1.5 2 0 0.2 0.4 0.6 0.8 0 0.5 1 1.5 2 2.5 3
cputime(s) cputime(s) cputime(s)
Figure 5: Convergence of different solvers for QDFSM over three different real datasets.
by using QDSFM as the objective instead of DSFM or InvLap (Hein et al., 2013). In our
experiment, we focused on comparing the convergence of different solvers for QDSFM. We
set β = 100 and Wii = 1, for all i, and set the number of observed labels to 100, which is
a proper setting according to (Hein et al., 2013). Figure 5 shows the results. Again, the
proposed QRCD-SPE and QAP-SPE methods both converge faster than PDHG and SGD,
while QRCD-SPE performs the best. Note that we did not plot the results for QRCD-
MNP and QRCD-FW as these methods converge extremely slowly due to the large sizes
of the hyperedges. InvLap requires 22, 114 and 1802 seconds to run on the Mushroom,
Covertype45, and Covertype67 datasets, respectively. Hence, the method does not scale
well with the problem size.
7. Detailed Proofs
7.1 Proofs of the Results Pertaining to the Dual Formulation
7.1.1 Proof of Lemma 1
In all derivations that follow, we may exchange the order of minimization and maximization
(i.e., min max = max min) due to Proposition 2.2 (Ekeland and Temam, 1999). Plugging
27
Dataset Mushroom Covertype45 Covertype67

N 8124 12240 37877
P R 112 127 136
r∈[R] |S r | 170604 145999 451529
Table 3: The UCI datasets used for experimental testing.
equation (5) into (4), we obtain

X
min [fr (x)]2 + kx − ak2W
x
r∈[R]
X φ2r

= min max hyr , xi − + kx − ak2W
x φr ≥0,yr ∈φr Br 4
r∈[R]
X φ2r

= max min hyr , xi − + kx − ak2W
φr ≥0,yr ∈φr Br x 4
r∈[R]
1 X 1X 2
= max − k yr − 2W ak2W −1 − φ + kak2W .
φr ≥0,yr ∈φr Br 4 4 r r
r∈[R]
By eliminating some constants, one obtains the dual (6).

Next, we prove that the problem (7) is equivalent to (6) when removing λr .
First, (7) is equivalent to
X * X +
λr 2 2
min max kyr − √ kW −1 + φr + λ, λr − 2W a
φr ≥0,yr ∈φr Br ,λr λ
r∈[R]
R r∈[R]
X * X +
λr 2 2
= min max min kyr − √ kW −1 + φr + λ, λr − 2W a
φr ≥0,yr ∈φr Br λ λr
r∈[R]
R r∈[R]
* +
X 1
2 2
√ X 1
= min max kλkW + φr + λ, R (yr − W λr ) − 2W a
φr ≥0,yr ∈φr Br λ 4 2
r∈[R] r∈[R]
* +
R 2
√ X X
= min max − kλkW + R λ, yr − 2W a + φ2r
φr ≥0,yr ∈φr Br λ 4
r∈[R] r∈[R]
X X
2 2
= min k yr − 2W akW −1 + φr ,
φr ≥0,yr ∈φr Br
r∈[R] r∈[R]
which is equivalent to (6).

We start by recalling the following lemma from (Li and Milenkovic, 2018a) that characterizes
the geometric structure of the product base polytope.
28
Lemma 19 (Li and Milenkovic (2018a)) Assume that W ∈ RN ×N is a positive diag-

′
N
onal matrix. Let y ∈ r∈[R] φr Br and let s be N
in the base polytope of the submodular
function r φr Fr . Then, there exists a point y ∈ r∈[R] φ′r Br such that r∈[R] yr′ = s and
P ′ ′
P
q PN
i=1 Wii
ky ′ − ykI(W ) ≤
P
2 k r∈[R] yr − sk1 .
Lemma 19 cannotN be used directly

Nto prove Lemma 2, since y, y ′ are in different product
base polytopes, r∈[R] φr Br and r∈[R] φ′r Br , respectively. However, the following lemma
shows that one can transform y to lie in the same base polytopes that contains y ′ .
and a nonnegative vector φ′ =

N
Lemma N20 For a given feasible point (y, φ) ∈ r∈[R] Cr ,
(φr ) ∈ r∈[R] R≥0 , one has
X ρ X φ′
k yr − sk1 + kφ′ − φk ≥ k r
yr − sk1 .
2 φr
r∈[R] r∈[R]
Proof For all r, let ỹr = yr /φr ∈ Br , and define a function that depends on φ,
X X
h(φ) = k yr − sk1 = k φr ỹr − sk1 .
r∈[R] r∈[R]
For all φ and r, |∇φr h(φ)| ≤ kỹr k1 . Therefore,
Z 1
′
h(φ ) = h(φ) + h∇h|φ+t(φ′ −φ) , t(φ′ − φ)idt
t=0
maxt∈[0,1] k∇h|φ+t(φ′ −φ) k ′
≥ h(φ) − kφ − φk
2q
2
P
maxỹr ∈Br ,∀r r∈[R] kỹr k1 ρ
≥ h(φ) − kφ′ − φk = h(φ) − kφ′ − φk.
2 2
Combining Lemma 20 with Lemma 19, we can establish the claim of Lemma 2.
qP
First, let ρ(W (1) ) = maxy∈Nr∈[R] Br 2 ′ ′
N
r∈[R] kyr kWr . Suppose that y ∈ r∈[R] φr Br
′
is such that r∈[R] yr′ = s and it minimizes r∈[R] k φφrr yr − yr′ k2W (1) . As s lies in the base
P P
29
′ we know that such an y ′ exists. Moreover, we have

P
polytope of r∈[R] φr Fr ,
X φ′r X φ′
ky − y ′ kI(W (1) ) ≤ kyr′ − yr kW (1) + kyr − r yr kW (1)
φr φr
r∈[R] r∈[R]
s
P (1)
i∈[N ] Wii
1) X φ′
≤ k r
yr − sk1 + ρ(W (1) )kφ′ − φk
2 φr
r∈[R]
s  
P (1)
2) i∈[N ] Wii ρ
X
≤ k yr − sk1 + kφ′ − φk + ρ(W (1) )kφ′ − φk
2 2
r∈[R]
s
P (1) P (2)
i∈[N ] Wii j∈[N ] 1/Wjj
X
= k yr − skW (2)
2
r∈[R]
s 
P (1)
i∈[N ] Wii ρ
+ + ρ(W (1) ) kφ′ − φk
 
2 2
s s
P (1) P (2) P (1)
3) i∈[N ] Wii j∈[N ] 1/Wjj 3 i∈[N ] Wii
X
≤ k yr − akW (2) + ρkφ′ − φk,
2 2 2
r∈[R]
where 1) follows from Lemma 19 and the definition of ρ(W (1) ), 2) is a consequence of
Lemma 20 and 3) holds because
(1)
X X X
Wii kyr k21 ≥ kyr k2Wr .
i∈[N ] r∈[R] r∈[R]
7.2 Proof for the Linear Convergence Rate of the RCD Algorithm
If (y ∗ , φ∗ ) is the optimal solution, then it must hold that r∈[R] yr∗ = 2W (a − x∗ ) because of
P
the duality between the primal and dual variables. Moreover, we alsoP know that there
P must
exist a nonempty collection Y of points y ′ ∈ r∈[R] φ∗r Br such that r∈[R] yr′ = r∈[R] yr∗ .
N
Using Lemma 2, and setting φ′ = φ∗ , s = 2W (a − x∗ ), W (1) , W (2) = W −1 , we can show

that there exists some y ′ ∈ Y such that
 
X
ky − y ′ k2I(W −1 ) + kφ − φ∗ k2 ≤ µ(W −1 , W −1 ) k (yr − yr′ )k2W −1 + kφ − φ∗ k2  .
r∈[R]
According to the definition of y ∗ , one has ky − y ∗ k2I(W −1 ) ≤ ky − y ′ k2I(W −1 ) for y ′ ∈ Y. This

concludes the proof.
30
7.2.2 Proof of Theorem 4

First, suppose that (y ∗ , φ∗ ) = arg min(y′ ,φ′ )∈Ξ ky (k) − y ′ k2I(W −1 ) + kφ(k) − φ′ k2 . Throughout
the proof, for simplicity, we use µ to denote µ(W −1 , W −1 ). We start by establishing the
following results.
Lemma 21 It can be shown that the following inequalities hold:
h∇g(y (k) , φ(k) ), (y ∗ − y (k) , φ∗ − φ(k) )i
1) 1 (k)
≤ g(y ∗ , φ∗ ) − g(y (k) , φ(k) ) − ky − y ∗ k2I(W −1 ) + kφ(k) − φ∗ k2
µ
2) 2 h i
≤ g(y ∗ , φ∗ ) − g(y (k) , φ(k) ) − ky (k) − y ∗ k2I(W −1 ) − kφ(k) − φ∗ k2 . (22)
µ+1
Proof From Lemma 3, we can infer that
X 1h i
k (yr − yr∗ )k2W −1 + kφ − φ∗ k2 ≥ ky − y ∗ k2I(W −1 ) + kφ − φ∗ k2 ⇒
µ
r∈[R]
g(y ∗ , φ∗ ) ≥ g(y (k) , φ(k) ) + h∇g(y (k) , φ(k) ), (y ∗ − y (k) , φ∗ − φ(k) )i

1h i
+ ky − y ∗ k2I(W −1 ) + kφ − φ∗ k2 , (23)
µ
g(y (k) , φ(k) ) ≥ g(y ∗ , φ∗ ) + h∇g(y ∗ , φ∗ ), (y (k) − y ∗ , φ(k) − φ∗ )i
1h i
+ ky − y ∗ k2I(W −1 ) + kφ − φ∗ k2 . (24)
µ
As h∇g(y ∗ , φ∗ ), (y (k) − y ∗ , φ(k) − φ∗ )i ≥ 0, (24) gives
1h i
g(y ∗ , φ∗ ) − g(y (k) , φ(k) ) ≤ − ky − y ∗ k2I(W −1 ) + kφ(k) − φ∗ k2 . (25)
µ
Inequality (23) establishes Claim 1) in (22). Claim 2) in (22) follows from (25).
(k+1)
The following lemma is a direct consequence of the optimality of yr as the projection
ΠCr .
Lemma 22 One has
h∇r g((y (k) , φ(k) )), (yr(k+1) − yr∗ , φ(k+1)
r − φ∗r )i
≤ 2hyr(k) − yr(k+1) , yr(k+1) − yr∗ iW −1 + 2hφ(k) (k+1) (k+1)
r − φr , φr − φ∗r i.
The following lemma follows from a simple manipulation of the Euclidean norm.
Lemma 23 It holds that
kyr(k+1) − yr(k) k2W −1 + (φ(k+1)
r − φ(k)
r )
2
= kyr(k+1) − yr∗ k2W −1 + (φ(k+1)

r − φ∗r )2 + kyr(k) − yr∗ k2W −1 + (φ(k) ∗ 2
r − φr )
+ 2hyr(k+1) − yr∗ , yr∗ − yr(k) iW −1 + 2hφ(k+1)
r − φ∗r , φ∗r − φ(k)
r i
= −kyr(k+1) − yr∗ k2W −1 − (φ(k+1)
r − φ∗r )2 + kyr(k) − yr∗ k2W −1 + (φ(k) ∗ 2
r − φr )
+ 2hyr(k+1) − yr∗ , yr(k+1) − yr(k) iW −1 + 2hφ(k+1)
r − φ∗r , φ(k+1)
r − φ(k)
r i.
31
Our next task is to determine how the objective function decreases in each iteration.
The following expectation is with respect to uniformly sampled values of r ∈ [R] in the k-th
iteration:
h i
E g(y (k+1) , φ(k+1) ) − g(y (k) , φ(k) )
h i
= E h∇r g(y (k) , φ(k) ), (yr(k+1) − yr(k) , φ(k+1)
r − φ(k)
r )i + kyr
(k+1)
− yr(k) k2W −1 + (φ(k+1)
r − φ(k)
r )
2
h
= E h∇r g(y (k) , φ(k) ), (yr∗ − yr(k) , φ∗r − φ(k)
r )i + h∇r g(y , φ ), (yr(k+1) − yr∗ , φ(k+1)
(k) (k)
r − φ∗r )i
i
(k+1) (k) 2 (k+1) (k) 2
+kyr − yr kW −1 + (φr − φr )
1) h
≤ E h∇r g(y (k) , φ(k) ), (yr∗ − yr(k) , φ∗r − φ(k)
r )i − kyr
(k+1)
− yr∗ k2W −1 + kyr∗ − yr(k) k2W −1
i
∗ 2 ∗
−(φ(k+1)
r − φr ) + (φr − φ (k) 2
r )
2) 1 h i
≤ h∇g(y (k) , φ(k) ), (y ∗ − y (k) , φ∗ − φ(k) )i − E ky (k+1) − y ∗ k2I(W −1 ) + kφ(k+1) − φ∗ k2
R
+ ky (k) − y ∗ k2I(W −1 ) + kφ(k) − φ∗ k2 (26)
3)

2 h i 2 h i
≤ g(y ∗ , φ∗ ) − g(y (k) , φ(k) ) + 1 − ky (k) − y ∗ k2I(W −1 ) + kφ(k) − φ∗ k2
(µ + 1)R (µ + 1)R
h i
∗ 2
− E ky (k+1)
− y kI(W −1 ) + kφ (k+1)
− φ∗ k2 . (27)
(k+1) (k) (k+1)

Here, 1) is a consequence of Lemma 22 and Lemma 23, 2) is due to yr′ = yr′ , φr′ =
(k)
φr′ for r ′ 6= r, and 3) may be established from (22).
Equation (35) further leads to
h i
E g(y (k+1) , φk+1 ) − g(y ∗ , φ∗ ) + d2 ((y (k+1) , φ(k+1) ), Ξ)
h i
≤ E g(y (k+1) , φk+1 ) − g(y ∗ , φ∗ ) + ky (k+1) − y ∗ k2I(W −1 ) + kφ(k+1) − φ∗ k2I(W −1 )
h
2 i
≤ 1− E g(y (k) , φk ) − g(y ∗ , φ∗ ) + d2 ((y (k) , r (k) ), Ξ) .
(µ + 1)R
The claimed proof follows by iterating the above derivations for all values of k.
32
7.3 Convergence Analysis of the AP Algorithm

First, for r ∈ [R], we define a diagonal matrix Ar ∈ RN ×N : (Ar )ii = 1, if i ∈ Sr , and 0
otherwise. Start with a Lagrangian dual of (11), and transform it according to
X X
kyr − λr k2ΨW −1 + φ2r + hα,

min
N min max λr − 2W ai
(y,φ)∈ r∈[R] Cr Λ:λr,i =0, ∀(i,r)6∈S α∈RN
r∈[R] r∈[R]
X X
λr k2ΨW −1 φ2r

= min
N max min kyr − + + hα, λr − 2W ai
(y,φ)∈ r∈[R] Cr α∈RN Λ:λr,i =0, ∀(i,r)6∈S
r∈[R] r∈[R]
1) X 1
= min
N max kAr Ψ−1 W αk2ΨW −1 + φ2r
(y,φ)∈ r∈[R] Cr α∈RN 4
r∈[R]
* +
X 1

−1
+ α, yr − Ar Ψ W α − 2W a
2
r∈[R]
* +
2) 1 2
X X
= min
N max − kαkW + α, yr − 2W a + φ2r
(y,φ)∈ r∈[R] Cr α∈RN 4
r∈[R] r∈[R]
X X
2 2
= min
N k yr − 2W akW −1 + φr .
(y,φ)∈ r∈[R] Cr
r∈[R] r∈[R]
Here, 1) is due to λr = yr − 12 Ar Ψ−1 W α while 2) is based on the fact that Ψ =

P
r∈[R] Ar .
This establishes the claimed result.

Suppose that (y, φ) ∈ C/Ξ. Then,
X
[dΨW −1 ((y, φ), Z)]2 = kyr − λr k2ΨW −1 + (φr − φ∗r )2

min
λr ,∀r∈[R]
r
X
s.t. λr = 2W (a − x∗ ), λr,i = 0, ∀r ∈ [R], i 6∈ Sr .
r∈[R]
By eliminating λr , we arrive at
X X
[dΨW −1 ((y, φ), Z)]2 = k yr − 2W (a − x∗ )k2W −1 + (φr − φ∗r )2 .
r r
Based on Lemma 2, we know that there exists a (y ′ , φ′ ) ∈ Ξ such that

" #
X X X
−1 −1 ′ 2
µ(ΨW , W ) k (yr − yr )kW −1 + (φr − φr ) ≥ ky − y ′ k2ΨW −1 +
′ 2
(φr − φ′r )2 .
r r r
As φ∗r is the unique optima, it follows that φ∗r = φ′r . Also, r yr′ = 2W (a − x∗ ). Moreover,
P
as X
ky − y ′ k2ΨW −1 + (φr − φ′r )2 ≥ [dΨW −1 ((y, φ), Ξ)]2 ,
r
33
according to the above definition, we have
[dΨW −1 ((y, φ), Ξ)]2

≤ µ(ΨW −1 , W −1 ).
[dΨW −1 ((y, φ), Z)]2
Next, suppose that (y, φ) ∈ Z/Ξ and that
(y ′ , φ′ ) = arg min ky − zk2I(ΨW −1 ) + kφ − ψk2 ,

(z,ψ)∈C
(y ′′ , φ′′ ) = arg min ky ′ − zk2I(ΨW −1 ) + kφ′ − ψk2 .

(z,ψ)∈Ξ
Again, due to the definition of the distance dΨW −1 ((y, φ), Ξ), we have
[dΨW −1 ((y, φ), Ξ)]2 ≤ ky − y ′′ k2I(ΨW −1 ) + kφ − φ′′ k2 . (28)
Moreover, because of the way we chose (y ′ , φ′ ) and due to the fact that C is convex, we have
ky − y ′′ k2I(ΨW −1 ) + kφ − φ′′ k2 ≤ ky − y ′ k2I(ΨW −1 ) + kφ − φ′ k2

+ ky ′ − y ′′ k2I(ΨW −1 ) + kφ′ − φ′′ k2 . (29)
Using Lemma 2, we obtain

X
ky ′ − y ′′ k2I(ΨW −1 ) + kφ′ − φ′′ k2 ≤ µ(ΨW −1 , W −1 )(k (yr′ − yr′′ )k2W −1 + kφ′ − φ′′ k2 ), (30)
r
and we also have

X X
k (yr′ − yr′′ )k2W −1 = k yr′ − 2W (a − x∗ )k2W −1
r r
X 1)
=k (yr′ − yr )k2W −1 ≤ k(y ′ − y)k2I(ΨW −1 ) , (31)
r
where 1) follows from the Cauchy-Schwarz inequality over the entries yr,i , i ∈ Sr .
Finally, we set φ = φ′′ = φ∗ and combine (28)-(31) to obtain
[dΨW −1 ((y, φ), Ξ)]2 ≤ (1 + µ(ΨW −1 , W −1 ))(ky − y ′ k2I(ΨW −1 ) + kφ − φ′ k2 )

= (1 + µ(ΨW −1 , W −1 ))[dΨW −1 ((y, φ), C)]2 ,
which concludes the proof.
7.4 Proof of Corollary 9

First, we establish an upper bound on ρ.
Lemma 24 Suppose that Dii = r:r∈[R],i∈Cr maxS⊆V [Fr (S)]2 . Then

P
X
ρ2 ≤ 4 Dii .
i∈[N ]
34
Proof For each r, consider yr ∈ Br . Sort the entries of yr in descending order. Without
loss of generality, assume that the ordering reads as yr,i1 ≥ yr,i2 ≥ · · · ≥ yr,iN . As Fr ([N ]) =
PN PN
j=1 yr,ij ≥ 0, we have yr,i1 ≥ 0. If yr,iN ≥ 0, then kyr k1 = k=1 yr,ik = Fr ([N ]) ≤
maxS⊆[N ] Fr (S). If yr,iN < 0, there exists a k′ such that yr,ik′ ≥ 0 and yr,ik′ +1 < 0. Given
the definition of Br , we have
N k ′
X X
|yr,ik | ≤ |yr,ik | ≤ Fr ({i1 , i2 , ..., ik′ }) ≤ max Fr (S),
S⊆[N ]
k=k ′ +1 k=1
and thus kyr k1 ≤ 2 maxS⊆[N ] Fr (S). Moreover, as each variable in [N ] is incident to at least
one submodular function, we have
X X X X
max [Fr (S)]2 ≤ max [Fr (S)]2 ≤ Dii .
S⊆[N ] S⊆[N ]
r∈[R] i∈[N ] r:i∈Sr i∈[N ]
Combining all of the above results, we obtain

X X X
ρ2 = max kyr k21 ≤ 4 max [Fr (S)]2 ≤ 4 Dii .
yr ∈Br S⊆[N ]
r∈[R] r∈[R] i∈[N ]
When W = βD, we have
X X Wii Dii
Wii 1/Wjj ≤ N 2 max = N 2 max ,
i,j Wjj i,j Djj
i∈[N ] j∈[N ]
and
X 1) X X 4 2 Dii
ρ2 1/Wjj ≤ 4 Dii 1/Wjj ≤ N max ,
β i,j Djj
j∈[N ] i∈[N ] j∈[N ]
where 1) follows from Lemma 24. According to the definition of µ(W −1 , W −1 ) (see (8)),
Dii
µ(W −1 , W −1 ) ≤ N 2 max{1, 9β −1 } max .
i,j Djj
Similarly, we have
Ψjj Dii
µ(ΨW −1 , W −1 ) ≤ N 2 max{1, 9β −1 } max .
i,j Djj
This concludes the proof.
35
7.5 Convergence Analysis of the Conic MNP Algorithm

7.5.1 Preliminary Notation and Lemmas
P an active set S = {q1 , q2 , ...}, and a collection of coefficients λ = {λ1 , λ2 , ...}, if

Given
y = qi ∈S λi qi , we simply refer to (y, S, λ) as a triple.
Define the following functions that depend on S
X X
h̃(S, λ) , h( λi q i , λi ),
qi ∈S qi ∈S
h̃(S) , min h̃(S, λ),

λ:λi ∈R,∀i
h̃+ (S) , min h̃(S, λ).

λ:λi ≥0,∀i
If the coefficients λ (λi ∈ R, ∀i) minimize h̃(S, λ), we call the corresponding triple (y, S, λ)
a good triple. Given a triple (y, S, λ), we also define
X
△(y, q) = −hy − a, qi − λi ,
qi ∈S
X
△(y) = max △(y) = − minhy − a, qi − λi ,
q∈B q∈B
qi ∈S
and X
err(y) = h(y, λi ) − h∗ .
qi ∈S
The following lemma establishes the optimality of a good triple.

Lemma 25 Given an active set S, consider the good triple (y ′ , S, λ′ ) and an arbitrary triple
(y, S, λ). Then,
X X X X
hy ′ − a, yiW̃ + h λ′i , λi i = hy ′ − a, y ′ − yiW̃ + h λ′i , (λ′i − λi )i = 0.
qi ∈S qi ∈S qi ∈S qi ∈S
Proof Without loss of generality, assume that hy ′ − a, yiW̃ + h qi ∈S λ′i , qi ∈S λi i < 0.

P P
Then, for any ǫ > 0, (y ′ + ǫy, S, λ′ + ǫλ) is also a triple. For ǫ sufficiently small, we have
h̃(S, λ′ + ǫλ) < h̃(S, λ′ ), which contradicts the optimality of (y ′ , S, λ′ ). Hence,
X X
hy ′ − a, yiW̃ + h λ′i , λi i = 0.
qi ∈S qi ∈S
As (y ′− y, S, λ′ − λ) is also a triple, repeating the above procedure we obtain the claimed

equality.
Lemma 26 For any ŷ ∈ B,

kakW̃
arg min h(φŷ, φ) ≤ .
φ≥0 2
kakW̃
Moreover, φ∗ ≤ 2 .
36
Proof Given ŷ, the optimal value of φ satisfies
ha, ŷiW̃ kakW̃ kakW̃

φ= ≤ ≤ .
1 + kŷk2W̃ kŷkW̃ + kŷk1 2
W̃
This establishes the claimed result.
err(y)
Lemma 27 If (y, S, λ) is a good triple, then △(y) ≥ kakW̃ , where we recall that err(y) =
h(y, qi ∈S λi ) − h∗ .
P
Proof Recall that (y ∗ , φ∗ ) denotes the optimal solution. As y ∗ /φ∗ ∈ B, we have

X
φ∗ △(y) ≥ −hy − a, y ∗ iW̃ − hφ∗ , λi i
qi ∈S
1) X X
= −hy − a, y ∗ iW̃ − hφ∗ , λi i + hy − a, yiW̃ + ( λi )2
qi ∈S qi ∈S
X X
= −hy − a, y ∗ − aiW̃ − hφ∗ , λi i + hy − a, y − aiW̃ + ( λi )2
qi ∈S qi ∈S
 
2) 1 X
≥ ky − ak2W̃ + ( λi )2 − ky ∗ − ak2W̃ + (φ∗ )2 
2
qi ∈S
1
= err(y),
2
where 1) follows from Lemma 25, while 2) is a consequence of the Cauchy-Schwarz inequal-
ity. By using the bound for φ∗ described in Lemma 26, we arrive at the desired conclusion.

We only need to prove the following three lemmas which immediately give rise to Theo-
rem 10. Lemma 28 corresponds to the first statement of Theorem 10. Combining Lemma 29
and Lemma 30, we can establish the second statement of Theorem 10. This follows as we
may choose ǫ = δkakW̃ .
If Algorithm 2 terminates after less than O(N kak2W̃ max{Q2 , 1}/ǫ) iterations, then the
condition of Lemma 29 is satisfied and thus err(y (k) ) ≤ ǫ = δkakW̃ . If Algorithm 2 does not
terminate after O(N kak2W̃ max{Q2 , 1}/ǫ) iterations, Lemma 30 guarantees err(y (k) ) ≤ ǫ =
δkakW̃ .
Lemma 28 At any point before Algorithm 2 terminates, one has
h(y (k) , φ(k) ) ≥ h(y (k+1) , φ(k+1) );
moreover, if (y (k) , φ(k) ) triggers a MAJOR loop, the claimed inequality is strict.
37
The following lemma characterizes the pair (y, φ) at the point when the MNP method
terminates.
Lemma 29 In the MAJOR loop at iteration k, if hy (k) − a, q (k) iW̃ + φ(k) ≥ −δ, then
h(y (k) , φ(k) ) ≤ h∗ + kakW̃ δ.
Lemma 30 If Algorithm 2 does not terminate, then for any ǫ > 0, one can guarantee that
after O(N kak2W̃ max{Q2 , 1}/ǫ) iterations, Algorithm 2 generates a pair (y, φ) that satisfies
err(y) ≤ ǫ.
The proofs of Lemma 28 and Lemma 29 are fairly straightforward, while the proof of
Lemma 30 is significantly more involved and postponed to the next section.
Proof of Lemma 28. Suppose that (y (k) , φ(k) ) starts a MAJOR loop. As
hy (k) − a, q (k) iW̃ + φ(k) < −δ,
we know that there exists some small ε such that
h(y (k) + εq (k) , φ(k) + ε) < h(y (k) , φ(k) ).
Consider next the relationship between (z (k) , qi ∈S (k) ∪{q(k) } αi ) and (y (k) , φ(k) ). Because of
P
Step 6, we know that
X
h(z (k+1) , αi ) = h̃(S (k) ∪ {q (k) }) ≤ h(y (k) + εq (k) , φ(k) + ε) < h(y (k) , φ(k) ).
qi ∈S (k) ∪{q (k) }
If (y (k+1) , φ(k+1) ) is generated in some MAJOR loop, then

X
(y (k+1) , φ(k+1) ) = (z (k+1) , αi ),
qi ∈S (k) ∪{q (k) }
which naturally satisfies the claimed condition. If (y (k+1) , φ(k+1) ) is generated P in some MI-
NOR loop, then (y (k+1) ,φ (k+1) ) lies strictly within the segment between (z (k+1) , qi ∈S (k) ∪{q(k) } αi )
(k) (k)
and (y , φ ) (because θ > 0). Therefore, we also have h(y (k+1) ,φ(k+1) ) < h(y (k) , φ(k) ). If
(k) (k)
(y , φ ) starts a MINOR loop, then we have
X
h(z (k+1) , αi ) = h̃(S (k) ) ≤ h(y (k) , φ(k) ),
qi ∈S (k) ∪{q (k) }
once again (k+1) (k+1) ) still lies within the segment between
(k+1)
P due to Step 6. As (y (k) , φ
(z , qi ∈S (k) ∪{q(k) } αi ) and (y , φ(k) ), we have h(y (k+1) , φ(k+1) ) ≤ h(y (k) , φ(k) ).
Proof of Lemma 29. Lemma 29 is a corollary of Lemma 27. To see why this is the
case, observe that in a MAJOR loop, (y (k) , S (k) , λ(k) ) is always a good triple. Since
△(y (k) ) = −(hy (k) − a, q (k) iW̃ + φ(k) ) ≤ δ,
we have err(y) ≤ δkakW̃ .
38

The outline of the proof is similar to that of the standard case described in (Chakrabarty
et al., 2014), and some results therein can be directly reused. The key step is to show that
in every MAJOR loop k with no more than one MINOR loop, the objective achieved by
y (k) decreases sufficiently, as precisely described in Theorem 31.
Theorem 31 For a MAJOR loop with no more than one MINOR loop, if the starting point
is y, the starting point y ′ of the next MAJOR loop satisfies

′ err(y)
err(y ) ≤ err(y) 1 − .
kakW̃ (Q2 + 1)
Based on this theorem, it is easy to establish the result in Lemma 30 by using the next
lemma and the same approach as described in (Chakrabarty et al., 2014).
Lemma 32 (Lemma 1 (Chakrabarty et al., 2014)) In any consecutive 3N + 1 itera-
tons, there exists at least one MAJOR loop with not more than one MINOR loop.
We now focus on the proof of Theorem 31. The next geometric lemma is the counterpart
of Lemma 2 (Chakrabarty et al., 2014) for the conic case.
Lemma 33 Given an active set S, consider a good triple (y ′ , S, λ′ ) and an arbitrary triple
(y, S, λ). Then, for any q ∈ lin(S) such that △(y, q) > 0, we have
X X △2 (y, q)
ky − ak2W̃ + ( λi )2 − ky ′ − ak2W̃ − ( λ′i )2 ≥ .
kqk2W̃ + 1
qi ∈S qi ∈S
Proof First, we have

X X
ky − ak2W̃ + ( λi )2 − ky ′ − ak2W̃ − ( λ′i )2
qi ∈S qi ∈S
X X X
=ky − y ′ k2W̃ + [ (λi − λ′i )]2 + 2hy − y ′ , y ′ − aiW̃ + 2h (λi − λ′i ), λ′i i
qi ∈S qi ∈S qi ∈S
1) X
=ky − y ′ k2W̃ + [ (λi − λ′i )]2 ,
qi ∈S
where 1) follows from Lemma 25. Next, for any φ ≥ 0,

X
ky − y ′ k2W̃ + [ (λi − λ′i )]2
qi ∈S
h i2
hy − y ′ , y − φqiW̃ + h qi ∈S (λi − λ′i ), qi ∈S λi − φi
P P
1)
≥
ky − φqk2W̃ + ( qi ∈S λi − φ)2
P
h i2
hy − a, y − φqiW̃ + h qi ∈S λi , qi ∈S λi − φi − hy ′ − a, y − φqiW̃ − h qi ∈S λ′i , qi ∈S λi − φi
P P P P
=
ky − rqk2W̃ + ( qi ∈S λi − φ)2
P
h P P i2
2)
hy − a, y − φqiW̃ + h λ
qi ∈S i , λ
qi ∈S i − φi
= 2 , (32)
ky − φqkW̃ + ( qi ∈S λi − φ)2
P
39
where 1) follows from the Cauchy-Schwarz inequality and 2) is due to Lemma 25. Since
△2 (y,q)
△(y, q) > 0, letting φ → ∞ reduces equation (32) to kqk2 +1 .
W̃
Next, using Lemma 33, we may characterize the decrease of the objective function for
one MAJOR loop with no MINOR loop. As (y (k+1) , S (k+1) , λ(k+1) ) is a good triple and y (k)
also lies in lin(S), we have the following result.
Lemma 34 Consider some MAJOR loop k without MINOR loops. Then
△2 (y (k) , q (k) ) △2 (y (k) )

err(y (k) ) − err(y (k+1) ) ≥ = .
Q2 + 1 Q2 + 1
Next, we characterize the decrease of the objective function for one MAJOR loop with
one single MINOR loop.
Lemma 35 Consider some MAJOR loop k with only one MINOR loop. Then
△2 (y (k) )
err(y (k) ) − err(y (k+2) ) ≥ .
Q2 + 1
Proof Suppose that the active sets associated with y (k) , y (k+1) , y (k+2) are S (k) , S (k+1) , S (k+2) ,
respectively. We know that within the MINOR loop, (z (k) , S (k) ∪ {q (k) }, α) is a good triple
and y (k+1) = θy (k) + (1 − θ)z (k) , for some θ ∈ [0, 1]. Let
X (k) X
A = ky (k) − ak2W̃ + ( λi )2 − kz (k) − ak2W̃ − ( αi )2 . (33)
qi ∈S (k) qi ∈S (k) ∪{q (k) }
From Lemma 33, we have
△2 (y (k) , q (k) ) △2 (y (k) )

A≥ ≥ . (34)
kq (k) k2W̃ + 1 Q2 + 1
Note that both S (k) and S (k+1) are subsets of S (k) ∪ {q (k) }. As (z (k) , S (k) ∪ {q (k) }, α) is a
good triple, using Lemma 25, we obtain
(k)
X X
hz (k) − a, z (k) − y (k) iW̃ + h αi , (αi − λi )i = 0.
qi ∈S (k) ∪{q (k) } qi ∈S (k) ∪{q (k) }
Furthermore, as y (k+1) = θy (k) +(1−θ)z (k) = z (k) −θ(z (k) −y (k) ) and λ(k+1) = α−θ(α−λ(k)),
we have
X (k) X (k+1) 2
ky (k) − ak2W̃ + ( λi )2 − ky (k+1) − ak2W̃ − ( λi ) = (1 − θ 2 )A.
i∈S (k) i∈S (k+1)
Moreover, we have
△(y (k+1) , q (k) ) = θ△(y (k) , q (k) ) + (1 − θ)△(z (k) , q (k) ) = θ△(y (k) , q (k) ) = θ△(y (k) ), (35)
40
which holds because △(y, q) is linear in y and Lemma 25 implies △(z (k) , q (k) ) = 0.
Since according to Lemma 28, h(y (k) , φ(k) ) > h(y (k+1) , φ(k+1) ), Lemma 4 in (Chakrabarty
et al., 2014) also holds in our case, and thus q (k) ∈ S (k+1) . To obtain y (k+2) and S (k+2) ,
one needs to remove active points with a zero coefficients from S (k+1) , so that y (k+2) once
again belongs to a good triple with corresponding S (k+1) . Based on 33 and equation (35),
we have the following result.
(k+1) 2 (k+2) 2
X X
ky (k+1) − ak2W̃ + ( λi ) −ky (k+2) − ak2W̃ − ( λi ) (36)
qi ∈S (k+1) qi ∈S (k+2)
△2 (y (k+1) , q (k) ) θ 2 △2 (y (k) )

≥ = . (37)
Q2 + 1 Q2 + 1
Consequently, combining equations (33), (34) and (36), we arrive at
△2 (y (k) )
err(y (k) ) − err(y (k+2) ) ≥ .
Q2 + 1
And, combining Lemma 34, Lemma 35 and Lemma 27 establishes Theorem 31.
7.6 Convergence Analysis of the Conic Frank-Wolfe Algorithm

Using the same strategy as in the proof of Lemma 26, we may prove the following lemma.
(k) (k)
It hinges on the optimality assumption for γ1 φ(k) + γ2 in Step 3 of Algorithm 3.
kakW̃
Lemma 36 In Algorithm 3, for all k, φ(k+1) ≤ 2 .
Now, we prove Theorem 11. We write y = φŷ, where ŷ ∈ B, so that
h(y, φ) = h(φŷ, φ) ≥ h(y (k) , φ(k) ) + 2hy (k) − a, φŷ − y (k) iW̃ + φ2 − (φ(k) )2 .
kak
For both sides, minimize (ŷ, φ) over B × [0, 2W̃ ], which contains (y ∗ , φ∗ ). Since q (k) =
arg minq∈B (y (k) − a, qiW̃ , we know that the optimal solutions satisfy ŷ = q (k) and φ =
kakW̃
φ̃(k) = min{max{0, −hy (k) − a, q (k) i}, 2 } of the RHS
h∗ = h(y ∗ , φ∗ ) ≥ h(y (k) , φ(k) ) + 2hy (k) − a, φ̃(k) q (k) − y (k) iW̃ + (φ̃(k) )2 − (φ(k) )2 . (38)
(k) (k)
Moreover, because of the optimality of (γ1 × γ2 ) ∈ R2≥0 , for arbitrary γ ∈ [0, 1] we have
(k) (k) (k) (k)
h(γ1 y (k) + γ2 q (k) , γ1 φ(k) + γ2 )
≤ h((1 − γ)y (k) + γ φ̃(k) q (k) , (1 − γ)φ(k) + γ φ̃(k) )
= h(y (k) , φ(k) ) + 2γhy (k) − a, φ̃(k) q (k) − y (k) iW̃
+ γ[(φ̃(k) )2 − (φ(k) )2 ] + γ 2 kφ̃(k) q (k) − y (k) k2W̃ + (γ 2 − γ)(φ̃(k) − φ(k) )2
1)
≤ h(y (k) , φ(k) ) + γ(h∗ − h(y (k) , φ(k) )) + γ 2 kφ̃(k) q (k) − y (k) k2W̃
2)
≤ h(y (k) , φ(k) ) + γ(h∗ − h(y (k) , φ(k) )) + γ 2 kak2W̃ Q2 ,
41
where 1) follows from (38) and γ 2 − γ ≤ 0, and 2) follows from

kak2W̃
kφ̃(k) q (k) − y (k) k2W̃ ≤ 4 max kqk2W̃ = kak2W̃ Q2 .
4 q∈B
hŷ ∗ ,aiW̃
The claimed result now follows by induction. First, let ŷ ∗ = y ∗ /φ∗ , where φ∗ = 1+kŷ ∗ k2
.
W̃
Then,
hŷ ∗ , ai2W̃
h(y (0) , φ(0) ) − h∗ ≤ 2hy ∗ , ai − (y ∗ )2 − (φ∗ )2 = ≤ kak2W̃ Q2 .
1 + kŷ ∗ k2W̃
2kak2 Q2
Suppose that h(y (k) , r (k) ) − h∗ ≤ k+2 .
W̃
In this case, for all γ ∈ [0, 1], we have
h(y (t+1) , φ(t+1) ) − h∗ ≤ (1 − γ)[h(y (k) , φ(k) ) − h∗ ] + γ 2 kak2W̃ Q2 .
1 2kak2 Q2
By choosing γ = k+2 , we obtain h(y (k+1) , φ(k+1) ) − h∗ ≤ k+3 ,
W̃
which concludes the
proof.
7.7 Proofs for the Partitioning Properties of PageRank

Based on the definitions of Sjp and j ∈ Vp , we have h∇fr (x), 1Sjp i = Fr (Sjp ). Consequently,
2p(Sjp ) −
X
wr fr (x)h∇fr (x), 1Sjp i
r∈[R]
=2p(Sjp ) wr fr (x)Fr (Sjp )

X
−
r∈[R]
=2p(Sjp ) wr Fr (Sjp )
X
− max (xi − xj )
r∈[R] (i,j)∈Sr↑ ×Sr↓
   
= Ip (vol(Sjp )) − wr Fr (Sjp ) max xi  + Ip (vol(Sjp )) + wr Fr (Sjp ) min xi 

X X
r∈[R] i∈Sr↑ r∈[R] i∈Sr↓

   
≤Ip vol(Sjp ) − wr Fr (Sjp ) + Ip vol(Sjp ) + wr Fr (Sjp )

X X
r∈[R] r∈[R]

=Ip vol(Sjp ) − vol(∂Sjp ) + Ip vol(Sjp ) + vol(∂Sjp ). (39)
Using equation (18), we have

α 2 − 2α 1
p(Sjp ) = p0 (Sjp ) + [M (p) − p] (Sjp ) + p(Sjp )
2−α 2−α 2
 
1) α 2 − 2α  1
p0 (Sjp ) + fr (x)h∇fr (x), 1Sjp i + p(Sjp )
X
= −
2−α 2−α 2
r∈[R]
2) α 1−αh i
≤ p0 (Sjp ) + Ip vol(Sjp ) − vol(∂Sjp ) + Ip vol(Sjp ) + vol(∂Sjp ) ,
2−α 2−α
42
where 1) is due to Lemma 12 and 2) is due to equation (39). This proves the first inequality.
By using the concavity of Ip (·), we also have
Ip (vol(Sjp )) ≤ p0 (Sjp ) ≤ Ip0 (vol(Sjp )).
Moreover, as Ip is piecewise linear, the proof follows.

This result can be proved in a similar way as the corresponding case for graphs (Andersen
et al., 2006), by using induction.
Define k̄ = min{k, m − k}, dmin = mini:(p0 )i >0 di and
2 t
s !
k α k̄ Φ p
I (t) (k) = + t+ 1− .
m 2−α dmin 8
When t = 0, Ip (k) ≤ Ip0 (k) holds due to Lemma 14. As I (0) (m) = I (0) (dmin ) = 1, for
k ∈ [dmin , m], we have
Ip0 (k) ≤ 1 ≤ I (0) (k).
Since for k ∈ [0, dmin ], Ip0 (k) is linear, we also have Ip0 (0) = 0 ≤ I (t) (0). Hence, for
k ∈ [0, dmin ], it also holds that Ip0 (k) ≤ I (0) (k).
Next, suppose that for step t, Ip (k) ≤ I (t) (k) holds. We then consider the case t + 1.
For k = vol(Sjp ), Lemma 14 indicates that
α 1−α
Ip (k) ≤ Ip0 (k) + [Ip (k − k̄Φ(Sjp )) + Ip (k + k̄Φ(Sjp ))]
2−α 2−α
 s s  !t 
p p 2
1) α 1 − α  2k 2α k − k̄Φ(Sj ) k + k̄Φ(Sj ) Φ
≤ Ip0 (k) + + t+ +  1− p 
2−α 2−α m 2−α dmin dmin 8
 s s  !t
p
2) k α 1 − α  k − k̄Φ(Sj ) k + k̄Φ(Sjp ) Φ2p
≤ + (t + 1) + +  1−
m 2−α 2−α dmin dmin 8
s s  !t
p
3) k α 1 − α  k̄ − k̄Φ(Sj ) k̄ + k̄Φ(Sjp ) Φ2p
≤ + (t + 1) + +  1−
m 2−α 2−α dmin dmin 8
s !t+1
4) k α 2 − 2α k̄ Φ2p
≤ + (t + 1) + 1− ≤ I (t+1) (k),
m 2−α 2 − α dmin 8
where 1) follows from Lemma 14; 2) is due to Ip0 (k) ≤ 1; 3) can be verified by considering
p
two cases separately, namely k ≤ m m
2 and k ≥ 2 ; and 4) follows from Φ(Sj ) ≥ Φp and the
√
Taylor-series expansion of x ± φx.
At this point, we have shown that Ip (k) ≤ I (t) (k) for all k = vol(Sjp ). Since {vol(Sjp )}j∈Vp
covers all break points of Ip and since I (t) is concave, we have that for all k ∈ [0, m], it holds
that Ip (k) ≤ I (t) (k). This concludes the proof.
43

First, we prove that under the assumptions for the vertex-sampling probability P in the
statement of the theorem, pr(α, 1i )(S) can be lower bounded as follows.
Lemma 37 If a vertex v ∈ S is sampled according to a distribution P such that
Ei∼P [pr(α, 1i )(S̄)] ≤ c pr(α, πS )(S̄),
where c is a constant, then with probability at least 12 , one has
c Φ(S)
pr(α, 1i )(S) ≥ 1 − .
4α
Proof Let p = pr(α, πS ). Then,
1)
αp(S̄) = απS (S̄) + (1 − α) [M (p) − p] (S̄)
2) X
=− wr fr (x)h∇fr (x), 1S̄ i
r∈[R]
3) X X
≤ wr fr (x)F (S) ≤ wr F (S) max xi
r∈[R] r∈[R] i∈Sr↑
X
≤ Ip ( wr F (S)) = Ip (vol(∂S))
r∈[R]
4)
≤ IπS (vol(∂S)) = Φ(S),
where 1) is a consequence of Equation (18); 2) follows from Lemma 12; 3) holds because
for any x ∈ RN , h∇fr (x), 1S̄ i ≥ −F (S); and 4) is a consequence of Lemma 14. Hence,
c c Φ(S)
pr(α, πS )(S̄) ≤
Ev∼P [pr(α, 1i )(S̄)] ≤ .
8 8 α
Moreover, sampling according to v ∼ P and using Markov’s inequality, we have

c Φ(S) Ev∼P [pr(α, 1i )(S̄)] 1
P pr(α, 1i )(S̄) ≥ ≤ ≤ ,
4 α c Φ(S) 2
4 α
which concludes the proof.
As Φ(S) ≤ αc , we have P pr(α, 1i )(S) ≥ 43 ≥ 21 . By combining this lower bound on

pr(α, 1i )(S) with the upper bound of Theorem 16, we have

s !t
3 1 α vol(S) Φ2pr(α,1i )
≤ pr(α, 1i )(S) ≤ Ipr(α,1i ) (vol(S)) ≤ + t+ 1− .
4 2 2−α di 8
q
Next, we choose t = ⌈ Φ2 8 ln 8 vol(S)di ⌉. Then, the above inequality may be transformed
pr(α,1i )
into
s s
3 1 α 8 vol(S) 1 5 α 8 vol(S)
≤ + ⌈ 2 ln 8 ⌉+ ≤ + 2 ln 10 .
4 2 2 − α Φpr(α,1i ) di 8 8 2 − α Φpr(α,1i ) di
44
Therefore,
s
100vol(S)
Φpr(α,1i ) ≤ 32α ln .
di
7.8 Analysis of the Parameter Choices for Experimental Verification

Let x′ = W −1/2 x and a′ = W −1/2 a. We can transform the objective (21) into a standard
QDSFM problem,
X
βkx′ − a′ k2W + max (x′i − x′j )2 .
i,j∈Sr
r∈[R]
From Theorem 4, to achieve an ǫ-optimal solution, one requires O(Rµ(β −1 W −1 , β −1 W −1 ) log 1ǫ )

iterations in the RCD algorithm (Algorithm 1). According to the particular settings for the
experiment (undirected unweighted hypergraphs), we have
X X
ρ2 = max kyr k21 = max 2 = 2R. (40)
yr ∈Br ,∀r yr ∈Br ,∀r
r∈[R] r∈[R]
From the definition of µ (8), we may rewrite µ(β −1 W −1 , β −1 W −1 ) as
N2

−1 −1 −1 −1 1) Wii 9 2 −1 1
µ(β W ,β W ) = max max + 1 , ρ N β max
2 i,j∈[N ] Wjj 4 j∈[N ] Wjj
N2

2) Wii 9 −1 1
= max max + 1 , β N R max
2 i,j∈[N ] Wjj 2 j∈[N ] Wjj
N2

3) Wii 9 −1 2 Wii
= max max + 1 , β N max , (41)
2 i,j∈[N ] Wjj 2 i,j∈[N ] Wjj
where 1) holds because half of the values of Wii are set to 1 and the other half to a value in
{1, 0.1, 0.01, 0.001}, 2) follows from (40) and 3) is due to the particular setting N = R and
maxi∈[N ] Wii = 1. Equation (41) may consequently be rewritten as
O(N 2 max(1, 9/(2β)) max Wii /Wjj ),

i,j∈[N ]
which establishes the claimed statement.
8. Acknowledgement
The authors would like to thank the reviewers for their insightful suggestions to improve
the quality of the manuscript. The authors also gratefully acknowledge funding from the
NSF CIF 1527636 and the NSF Science and Technology Center (STC) at Purdue University,
Emerging Frontiers of Science of Information, 0939370.
45
References
Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using pagerank
vectors. In Proceedings of the 47th Annual Symposium on Foundations of Computer
Science, pages 475–486. IEEE Computer Society, 2006.
Reid Andersen, Fan Chung, and Kevin Lang. Local partitioning for directed graphs using
pagerank. Internet Mathematics, 5(1-2):3–22, 2008.
Francis Bach. Learning with submodular functions: A convex optimization perspective.

Foundations and Trends® in Machine Learning, 6(2-3):145–373, 2013.
Austin R Benson, David F Gleich, and Jure Leskovec. Higher-order organization of complex
networks. Science, 353(6295):163–166, 2016.
Béla Bollobás. Modern graph theory, volume 184. Springer Science & Business Media, 2013.
Deeparnab Chakrabarty, Prateek Jain, and Pravesh Kothari. Provable submodular mini-
mization using Wolfe’s algorithm. In Advances in Neural Information Processing Systems,
pages 802–809, 2014.
Antonin Chambolle and Jérôme Darbon. On total variation minimization and surface
evolution using parametric maximum flows. International Journal of Computer Vision,
84(3):288, 2009.
Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex
problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40
(1):120–145, 2011.
T-H Hubert Chan, Zhihao Gavin Tang, Xiaowei Wu, and Chenzi Zhang. Diffusion operator
and spectral analysis for directed hypergraph laplacian. arXiv preprint arXiv:1711.01560,
2017.
T-H Hubert Chan, Anand Louis, Zhihao Gavin Tang, and Chenzi Zhang. Spectral properties
of hypergraph laplacian and approximation algorithms. Journal of the ACM (JACM), 65
(3):15, 2018.
Josip Djolonga and Andreas Krause. Scalable variational inference in log-supermodular

models. In ICML, pages 1804–1813, 2015.
Jack Edmonds. Submodular functions, matroids, and certain polyhedra. In Combinatorial

Optimization?Eureka, You Shrink!, pages 11–26. Springer, 2003.
Ivar Ekeland and Roger Temam. Convex Analysis and Variational Problems, volume 28.
Siam, 1999.
Alina Ene and Huy Nguyen. Random coordinate descent methods for minimizing decompos-
able submodular functions. In Proceedings of the International Conference on Machine
Learning, pages 787–795, 2015.
46
Alina Ene, Huy Nguyen, and László A Végh. Decomposable submodular function minimiza-
tion: discrete and continuous. In Advances in Neural Information Processing Systems,
pages 2874–2884, 2017.
Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval
Research Logistics, 3(1-2):95–110, 1956.
Satoru Fujishige. Submodular functions and optimization, volume 58. Elsevier, 2005.
Satoru Fujishige and Shigueo Isotani. A submodular function minimization algorithm based
on the minimum-norm base. Pacific Journal of Optimization, 7(1):3–17, 2011.
Alexander Gammerman, Volodya Vovk, and Vladimir Vapnik. Learning by transduction. In

Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pages
148–155. Morgan Kaufmann Publishers Inc., 1998.
David F Gleich, Lek-Heng Lim, and Yongyang Yu. Multilinear pagerank. SIAM Journal
on Matrix Analysis and Applications, 36(4):1507–1541, 2015.
Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms
for norm-regularized smooth convex optimization. Mathematical Programming, 152(1-2):
75–112, 2015.
Matthias Hein, Simon Setzer, Leonardo Jost, and Syama Sundar Rangapuram. The to-
tal variation on hypergraphs-learning on hypergraphs revisited. In Advances in Neural
Information Processing Systems, pages 2427–2435, 2013.
Stefanie Jegelka, Francis Bach, and Suvrit Sra. Reflection methods for user-friendly sub-
modular optimization. In Advances in Neural Information Processing Systems, pages
1313–1321, 2013.
Thorsten Joachims. Transductive learning via spectral graph partitioning. In Proceedings

of the 20th International Conference on Machine Learning, pages 290–297, 2003.
Rie Johnson and Tong Zhang. On the effectiveness of laplacian normalization for graph
semi-supervised learning. Journal of Machine Learning Research, 8(Jul):1489–1517, 2007.
Sanjiv Kumar and Martial Hebert. Discriminative random fields: A discriminative frame-
work for contextual interaction in classification. In Proceedings of IEEE International
Conference on Computer Vision, pages 1150–1157. IEEE, 2003.
Pan Li and Olgica Milenkovic. Inhomogeneous hypergraph clustering with applications. In

Advances in Neural Information Processing Systems, pages 2305–2315, 2017.
Pan Li and Olgica Milenkovic. Revisiting decomposable submodular function minimization

with incidence relations. In Advances in Neural Information Processing Systems, 2018a.
Pan Li and Olgica Milenkovic. Submodular hypergraphs: p-laplacians, cheeger inequali-

ties and spectral clustering. In Proceedings of the International Conference on Machine
Learning, 2018b.
47
Pan Li, Hoang Dau, Gregory Puleo, and Olgica Milenkovic. Motif clustering and overlapping
clustering for social network analysis. In INFOCOM 2017-IEEE Conference on Computer
Communications, IEEE, pages 1–9. IEEE, 2017.
Francesco Locatello, Michael Tschannen, Gunnar Rätsch, and Martin Jaggi. Greedy algo-
rithms for cone constrained optimization with convergence guarantees. In Advances in
Neural Information Processing Systems, pages 773–784, 2017.
László Lovász. Submodular functions and convexity. In Mathematical Programming The

State of the Art, pages 235–257. Springer, 1983.
László Lovasz and Miklós Simonovits. The mixing rate of markov chains, an isoperimetric
inequality, and computing the volume. In Proceedings of the 31st Annual Symposium on
Foundations of Computer Science, page 1. IEEE Computer Society, 1990.
Robert Nishihara, Stefanie Jegelka, and Michael I Jordan. On the convergence rate of
decomposable submodular function minimization. In Advances in Neural Information
Processing Systems, pages 640–648, 2014.
Stanley Osher, Martin Burger, Donald Goldfarb, Jinjun Xu, and Wotao Yin. An iterative
regularization method for total variation-based image restoration. Multiscale Modeling &
Simulation, 4(2):460–489, 2005.
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation
ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
N. Z. Shor, Krzysztof C. Kiwiel, and Andrzej Ruszcayǹski. Minimization Methods for Non-
differentiable Functions. Springer-Verlag, Berlin, Heidelberg, 1985. ISBN 0-387-12763-1.
Peter Stobbe and Andreas Krause. Efficient minimization of decomposable submodular

functions. In Advances in Neural Information Processing Systems, pages 2208–2216,
2010.
Philip Wolfe. Convergence theory in nonlinear programming. Integer and nonlinear pro-
gramming, pages 1–36, 1970.
Hao Yin, Austin R Benson, Jure Leskovec, and David F Gleich. Local higher-order graph
clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, pages 555–564. ACM, 2017.
Yuichi Yoshida. Cheeger inequalities for submodular transformations. In Proceedings of

the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2582–2601.
Society for Industrial and Applied Mathematics, 2019.
Chenzi Zhang, Shuguang Hu, Zhihao Gavin Tang, and TH Hubert Chan. Re-revisiting
learning on hypergraphs: confidence interval and subgradient method. In Proceedings of
the International Conference on Machine Learning, pages 4026–4034, 2017.
48
Dawei Zhou, Si Zhang, Mehmet Yigit Yildirim, Scott Alcorn, Hanghang Tong, Hasan
Davulcu, and Jingrui He. A local algorithm for structure-preserving graph cut. In Pro-
ceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pages 655–664. ACM, 2017.
Denny Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Schölkopf.
Learning with local and global consistency. In Advances in Neural Information Processing
Systems, pages 321–328, 2004.
Denny Zhou, Jiayuan Huang, and Bernhard Schölkopf. Learning with hypergraphs: Clus-
tering, classification, and embedding. In Advances in Neural Information Processing
Systems, pages 1601–1608, 2007.
Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. Semi-supervised learning using
gaussian fields and harmonic functions. In Proceedings of the 20th International Confer-
ence on Machine Learning, pages 912–919, 2003a.
Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Combining active learning and semi-
supervised learning using gaussian fields and harmonic functions. In ICML 2003 workshop
on the continuum from labeled to unlabeled data in machine learning and data mining,
volume 3, 2003b.
49

Quadratic Decomposable Submodular Function Minimization: Theory and Practice

Uploaded by

Copyright:

Available Formats

Quadratic Decomposable Submodular Function Minimization: Theory and Practice

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Quadratic Decomposable Submodular Function Minimization: Theory and Practice

Uploaded by

Copyright:

Available Formats

Quadratic Decomposable Submodular Function Minimization

Quadratic Decomposable Submodular Function

Second, minimizing a submodular function decomposed into a sum of simpler components

3. Third, we develop generalized Frank-Wolfe (FW) and min-norm-point methods for

4. Fourth, we introduce a novel application of the QDSFM problem, related to the

5. Finally, we evaluate our methods on semi-supervised learning over hypergraphs using

B = {y ∈ RN |y(S) ≤ F (S), ∀S ⊂ [N ], y([N ]) = F ([N ])}. (3)

2.1 The QDSFM Problem

where a ∈ RN is a given vector and W ∈ RN ×N is a positive diagonal matrix. The QDSFM

which itself is a relevant discrete optimization question. Moreover, as will be demonstrated

2.2 Dual Formulation

Lemma 1 The following optimization problem is dual to (4):

The primal variables in both cases may be computed as x = a − 12 W −1

The proof of Lemma 2 is relegated to Section 7.1.2.

3. Linearly Convergent Algorithms for Solving the QDSFM Problem

Algorithm 1: The RCD method for Solving (6)

3.1 The Random Coordinate Descent (RCD) Algorithm

Ξ minimizes ky − y ′ k2I(W −1 ) + kφ − φ′ k2 over (y ′ , φ′ ) ∈ Ξ, i.e., (y ∗ , φ∗ ) is the projection of

Lemma 3 can be proved by applying Lemma 2 with φ′ = φ∗ , s = 2W (a−x∗ ), W (1) , W (2) =

Algorithm 2: The AP Method for Solving (11)

3.2 The Alternative Projection (AP) Algorithm

where φ∗ = (φ∗r )r∈[R] is the unique optimal solution of (11).

In the k-th iteration of Algorithm 5, the pair (y (k) , φ(k) ) satisfies

Lemma 8 One has κ2∗ ≤ 1 + µ(ΨW −1 , W −1 ).

The proof of Lemma 8 may be found in Section 7.3.2.

Corollary 9 Suppose that W = βD, where β is a hyper-parameter, and D is a diagonal

4. Computing the Conic Projections

ΠC (a; W̃ ) = arg min h(y, φ) , ky − ak2W̃ + φ2 s.t. (y, φ) ∈ C. (12)

Algorithm 3: The Conic MNP Method for Solving (12)

4.1 The Conic MNP Algorithm

Algorithm 4: The Conic FW Algorithm for Solving (12)

4.2 The Conic FW Algorithm

Theorem 11 Let B be an arbitrary polytope in RN and let C = {(y, φ)|y ∈ φB, φ ≥

4.3 Analysis of Time Complexity and Further Discussion

4.4 Numerical Illustration I: Convergence of QRCD-MNP and QRCD-FW

-10 -10 -10

-15 -15 -15

We consider the following submodular function: for r ∈ [R], S ⊆ Sr ,

min{|S|, |Sr /S|}θ

4.5 Numerical Illustration II: Dependency on the Scale of the Submodular

5.1 Background: PageRank(PR)

It is easy to verify that the fixed point p is a solution to the problem

5.2 Terminology: Boundary, Volume, Conductance

A single component of The combinatorial structure The submodular function

Φ∗ = min Φ(S). (16)

5.3 The PageRank Process as QDSFM Optimization

Proof Consider a vertex i ∈ V . Then,

α(p0 − pr(α, p0 )) + (1 − α)(M (pr(α, p0 )) − pr(α, p0 )) = 0. (18)

5.4 Complexity of Computing PageRank over Directed Hypergraphs

Algorithm 5: An Exact Projection Algorithm for a Directed Hyperedge

A combination of Algorithm 2 (AP) and Algorithm 5 returns an ǫ-optimal solution with

5.5 Analysis of Partitions Based on PageRank

The pair is referred to as a sweep cut. Furthermore, define Φp = Φ(Sjp∗ ); in words,

Lemma 14 Let p = pr(α, p0 ), x = D −1 p and j ∈ Vp . Then,

Ip (k) ≤ p0 (k) ≤ Ip0 (k).

where i is sampled according to P .

6. Applications to Semi-supervised Learning and Numerical Experiments

x̂(k) = arg min βkx(k) − a(k) k2 + Ω(x(k) ),

In typical graph and hypergraph learning problems, Ω is chosen to be a Laplacian regularizer