Sparse projections onto the simplex
[email protected]
[email protected]
[email protected]
[email protected]
arXiv:1206.1529v5 [cs.LG] 10 Apr 2013
Anastasios Kyrillidis
Stephen Becker
Volkan Cevher
Christoph Koch
Abstract
highlight quantum tomography, density learning, and
Markowitz portfolio design problems as running examples. We then illustrate provable non-convex solutions
to minimize quadratic loss functions
Most learning methods with rank or sparsity constraints use convex relaxations, which
lead to optimization with the nuclear norm
or the ℓ1 -norm. However, several important
learning applications cannot benefit from this
approach as they feature these convex norms
as constraints in addition to the non-convex
rank and sparsity constraints. In this setting,
we derive efficient sparse projections onto
the simplex and its extension, and illustrate
how to use them to solve high-dimensional
learning problems in quantum tomography,
sparse density estimation and portfolio selection with non-convex constraints.
f (β) := ky − A(β)k2
subject to the constraints in Problem 1 and 2 with our
projectors. In (2), we assume that y ∈ Rm is given
and the (known) operator A : Rp → Rm is linear.
For simplicity of analysis, our minimization approach
is based on the projected gradient descent algorithm:
i
1. Introduction
We study the following sparse Euclidean projections:
Problem 1. (Simplex) Given w ∈ Rp , find a Euclidean projection of w onto the intersection of ksparse vectors Σk = β ∈ Rp : |{i : βi 6= 0}| ≤ k
P
p
and the simplex ∆+
i βi = λ :
λ = β ∈ R : βi ≥ 0,
P(w) ∈
argmin kβ − wk2 .
(2)
(1)
β:β∈Σk ∩∆+
λ
Problem 2. (Hyperplane) Replace
∆+
in (1) with the
λ P
hyperplane constraint ∆λ = β ∈ Rp :
i βi = λ .
We prove that it is possible to compute such projections
in quasilinear time via simple greedy algorithms.
Our motivation with these projectors is to address
important learning applications where the standard
sparsity/low-rank heuristics based on the ℓ1 /nuclearnorm are either given as a constraint or conflicts
with the problem constraints. For concreteness, we
Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR:
W&CP volume 28. Copyright 2013 by the author(s).
β i+1 = P(β i − µi ∇f (β i )),
(3)
where β is the i-th iterate, ∇f (·) is the gradient of
the loss function, µi is a step-size, and P(·) is based on
Problem 1 or 2. When the linear map A in (2) provides bi-Lipschitz embedding for the constraint sets,
we can derive rigorous approximation guarantees for
the algorithm (3); cf., (Garg & Khandekar, 2009).1
To the best of our knowledge, explicitly sparse Euclidean projections onto the simplex and hyperplane
constraints have not been considered before. The closest work to ours is the paper (Kyrillidis & Cevher,
2012). In (Kyrillidis & Cevher, 2012), the authors propose an alternating projection approach in regression
where the true vector is already sparse and within a
convex norm-ball constraint. In contrast, we consider
the problem of projecting an arbitrary given vector
onto convex-based and sparse constraints jointly.
At the time of this submission, we become aware of
(Pilanci et al., 2012), which considers cardinality regularized loss function minimization subject to simplex
constraints. Their convexified approach relies on solving a lower-bound to the objective function and has
O(p4 ) complexity, which is not scalable. We also note
that regularizing with the cardinality constraints is
generally easier: e.g., our projectors become simpler.
1
Surprisingly, a recent analysis of this algorithm along
with similar assumptions indicates that rigorous guarantees can be obtained for minimization of general loss functions other than the quadratic (Bahmani et al., 2011).
Sparse projections onto the simplex
Notation: Plain and boldface lowercase letters represent scalars and vectors, resp. The i-th entry of a
vector w is wi , and [wi ]+ = max(wi , 0), while β i is the
model estimate at the i-th iteration of an algorithm.
Given a set S ⊆ N = {1, . . . , p}, the complement S c
is defined with respect to N , and the cardinality is
|S|. The support set of w is supp(w) = {i : wi 6= 0}.
Given a vector w ∈ Rp , wS is the projection (in Rp )
of w onto S, i.e. (wS )S c = 0, whereas w S ∈ R|S| is
w limited to S entries. The all-ones column vector is
1, with dimensions apparent from the context. We define Σk as the set of all k-sparse subsets of N , and we
sometimes write β ∈ Σk to mean supp(β) ∈ Σk . The
trace of a matrix X is written tr(X).
2. Preliminaries
Basic definitions: Without loss of generality, assume w is sorted in descending order, so w1 is the
largest element. We denote Pλ+ for the (convex) Euclidean projector onto the standard simplex ∆+
λ , and
Pλ for its extension to ∆λ . The (non-convex) Euclidean projector onto the set Σk is PΣk , which retains
the k-largest in magnitude elements. In contrast to
Pλ , the projection PΣk need not be unique.
Definition 2.1 (Operator PLk ). We define PLk (w) as
the operator that keeps the k-largest entries of w (not
in magnitude) and sets the rest to zero. This operation
can be computed in O(p min(k, log(p)))-time.
Definition 2.2 (Euclidean projection Pλ+ ). The projector onto the simplex is given by
!
ρ
1 X
wi − λ
(Pλ+ (w))i = [wi − τ ]+ , where τ :=
ρ i=1
for ρ := max{j : wj > 1j (
Pj
i=1
descent algorithm in (3) features the following invariant on the objective (Garg & Khandekar, 2009):
f (β i+1 ) ≤
2δ2k
f (β i ) + c1 kεk2 ,
1 − δ2k
(5)
for c1 > 0 and stepsize µi = 1+δ1 2k . Hence, for
δ2k < 1/3, the iterations of the algorithm are contractive and (3) obtains a good approximation on the
loss function. In addition, (Foucart, 2010) shows that
we can guarantee approximation on the true model via
kβ i+1 − β ⋆ k2 ≤ 2δ3k kβ i − β ⋆ k2 + c2 kεk2 ,
(6)
for c1 > 0 and µi = 1. Similarly, when δ3k < 1/2,
the iterations of the algorithm are contractive. Different step size µi strategies result in different guarantees; c.f., (Foucart, 2010; Garg & Khandekar, 2009;
Kyrillidis & Cevher, 2011) for a more detailed discussion. Note that to satisfy a given RIP constant
δ, random matrices with
sub-Gaussian entries require
m = O k log(p/k)/δ 2 . In low rank matrix cases, similar RIP conditions for (3) can be derived with approximation guarantees; cf., (Meka et al., 2010).
3. Underlying discrete problems
Let β ⋆ be a projection of w onto Σk ∩ ∆+
λ or Σk ∩ ∆λ .
We now make the following elementary observation:
Remark 1. The Problem 1 and 2 statements can be
equivalently transformed into the following nested minimization problem: {S ⋆ , β ⋆ } =
i
h
argmin
k(β − w) S k22 + kw S c k22 ,
argmin
S:S∈Σk
β:β S ∈∆+
λ or ∆λ ,
β S c =0
where supp(β ⋆ ) = S ⋆ and β ⋆ ∈ ∆+
λ or ∆λ .
wi − λ)}.
Definition 2.3 (Euclidean projection Pλ ). The projector onto the extended simplex is given by
!
p
1 X
wi − λ .
(Pλ (w))i = wi − τ, where τ =
p i=1
Therefore, given S ⋆ = supp(β ⋆ ), we can find β ⋆
by projecting wS ⋆ onto ∆+
λ or ∆λ within the kdimensional space. Thus, the difficulty is finding S ⋆ .
Hence, we split the problem into the task of finding the
support and then finding the values on the support.
Definition 2.4 (Restricted isometry property (RIP)
(Candès et al., 2006)). A linear operator A : Rp →
Rm satisfies the k-RIP with constant δk ∈ (0, 1) if
3.1. Problem 1
1 − δk ≤ kA(β)k22 /kβk22 ≤ 1 + δk , ∀β ∈ Σk .
(4)
Guarantees of the gradient scheme (3): Let
y = A(β ⋆ ) + ε ∈ Rm , (m ≪ p), be a generative model
where ε is an additive perturbation term and β ⋆ is
the k-sparse true model generating y. If the RIP assumption (4) is satisfied, then the projected gradient
Given any support S, the unique corresponding estib = Pλ+ (w ). We conclude that β ⋆ satismator is β
S
S
⋆
fies β (S ⋆ )c = 0 and β ⋆S ⋆ = Pλ+ (w S ⋆ ), and
S ⋆ ∈ argmin kPλ+ (w S ) − w S k22 + kw S c k22
S:S∈Σk
= argmax F+ (S)
S:S∈Σk
where F+ (S) :=
P
i∈S
(7)
wi2 − ((Pλ+ (w S ))i − wi )2 .
Sparse projections onto the simplex
Algorithm 1 GSSP
1: S ⋆ = supp(PLk (w))
{Select support}
2: β S ⋆ = Pλ+ (w S ⋆ ), β (S ⋆ )c = 0 {Final projection}
Algorithm 2 GSHP
1: ℓ = 1 , S = j, j ∈ arg maxi [λwi ]
{Initialize}
2: Repeat: ℓ ← ℓ + 1, S ← S ∪ j,
where
P
wj −λ
j ∈ arg maxi∈N \S wi − j∈S
{Grow}
ℓ−1
3: Until ℓ = k, set S ⋆ ← S
{Terminate}
4: β S ⋆ = Pλ (w S ⋆ ), β (S ⋆ )c = 0 {Final projection}
This set function can be simplified to
X
(wi2 − τ 2 ),
F+ (S) =
(8)
i∈S
which is non-obvious. The algorithm first selects the
index of the largest element that has the same sign
as λ. It then grows the index set one at a time by
finding the farthest element from the current mean, as
adjusted by lambda. Surprisingly, the algorithm finds
the correct support set, as we prove in Section 5.2. We
call this algorithm the greedy selector and hyperplane
projector (GSHP), whose overall complexity is similar
to GSSP.
5. Main results
Remark 2. When the symbol S is used as S =
supp(β̄) for any β̄, then if |S| < k, we enlarge S until
it has k elements by taking the first k − |S| elements
that are not already in S, and setting β̄ = 0 on these
elements. The lexicographic approach is used to break
ties when there are multiple solutions.
where τ (which depends on S) is as in Lemma 1.
5.1. Correctness of GSSP
Lemma 1. Let β = Pλ+ (w) where βi = [wi − τ ]+ .
Then, wiP≥ τ for all i ∈ S = supp(β). Furthermore,
1
τ = |S|
i∈S wi − λ .
Theorem 1. Algorithm 1 exactly solves Problem 1.
Proof. Directly from the definition of τ in Definition
2.2. The intuition is quite simple: the “threshold” τ
should be smaller than the smallest entry in the selected support, or we unnecessarily shrink the coefficients that are larger without introducing any new
support to the solution. Same arguments apply to inflating the coefficients to meet the simplex budget.
3.2. Problem 2
Similar to above, we conclude that β ⋆ satisfies β ⋆S ⋆ =
Pλ (w S ⋆ ) and β ⋆(S ⋆ )c = 0, where
S ⋆ ∈ argmin kz − wk2 = argmax F (S)
S:S∈Σk
(9)
S:S∈Σk
where z ∈ Rp with z S = Pλ (w S ) and z S c = 0 and
P
P
1
2
2
F (S) :=
i∈S wi − |S| (
i∈S wi − λ) .
4. Sparse projections onto
∆+
λ
and ∆λ
Algorithm 1 below suggests an obvious greedy approach for the projection onto Σk ∩ ∆+
λ . We select the
set S ⋆ by naively projecting w as PLk (w). Remarkably, this gives the correct support set for Problem 1,
as we prove in Section 5.1. We call this algorithm
the greedy selector and simplex projector (GSSP). The
overall complexity of GSSP is dominated by the sort
operation in p-dimensions.
Unfortunately, the GSSP fails for Problem 2. As a
result, we propose Algorithm 2 for the Σk ∩ ∆λ case
Proof. Intuitively, the k-largest coordinates should be
in the solution. To see this, suppose that u is the
projection of w. Let wi be one of the k-(most positive)
coordinates of w and ui = 0. Also, let wj < wi , i 6=
j such that uj > 0. We can then construct a new
vector u′ where u′j = ui = 0 and u′i = uj . Therefore,
u′ satisfies the constraints, and it is closer to w, i.e.,
kw − uk22 − kw − u′ k22 = 2uj (wi − wj ) > 0. Hence, u
cannot be the projection.
To be complete in the proof, we also need to show that
the cardinality k solutions are as good as any other
solution with cardinality less than k. Suppose there
exists a solution u with support |S| < k. Now add any
elements to S to form Ŝ with size k. Then consider
w restricted to Ŝ, and let û be its projection onto the
simplex. Because this is a projection, kû Ŝ − w Ŝ k ≤
ku Ŝ − w Ŝ k, hence kû − wk ≤ ku − wk.
5.2. Correctness of GSHP
Theorem 2. Algorithm 2 exactly solves Problem 2.
Proof. To motivate the support selection of GSHP, we
now identify a key relation that holds for any b ∈ Rk :
k
X
i=1
b2i −
λ(2b1 − λ) +
P
k
i=1 bi
k
k
X
j − 1
j=2
−λ
j
bj −
2
=
Pj−1
bi − λ 2
.
j−1
i=1
(10)
By its left-hand side, this relation is invariant under
permutation of b. Moreover, the summands in the sum
Sparse projections onto the simplex
over k are certainly non-negative for k ≥ 2, so without
loss of generality the solution sparsity of the original
problem is ||β ⋆ ||0 = k. For k = 1, F is maximized by
picking an index i that maximizes λwi , which is what
the algorithm does.
For the sake of clarity (and space), we first describe
the proof of the case k ≥ 2 for λ = 0 and then explain
how it generalizes for λ 6=P
0. In the sequel, let us use
1
the shortcut avg(S) = |S|
j∈S wj .
Let S be an optimal solution index set and let I be
the result computed by the algorithm. For a proof (of
the case k ≥ 2, λ = 0) by contradiction, assume that I
and S differ. Let e be the first element of I\S in the
order of insertion into I by the algorithm. Let e′ be
the element of S\I0 that lies closest to e. Without loss
of generality, we may assume that we 6= we′ , otherwise
we could have chosen S\{e′ } ∪ {e} rather than S as
solution in the first place. Let I0 ⊆ I ∩S be the indices
added to I by the algorithm before e. Assume that I0
is nonempty. We will later see how to ensure this.
Let a := avg(I0 ) and a′ := avg(S\{e′ }). There are
three ways in which we , we′ and a′ can be ordered
relative to each other:
1. e′ lies between e and a′ , thus |we′ − a′ | < |we − a′ |
since we 6= we′ .
2. a′ lies between e and e′ . But then, since there are
no elements of S between e and e′ , S\I0 moves the
average a′ beyond a away from e towards e′ , so
|we′ − a′ | < |we′ − a| and |we − a| < |we − a′ |.
But we know that |we′ − a| < |we − a| since
e = argmaxi∈I0 |wi − a| by the choice of the greedy
algorithm and we 6= we′ . Thus |we′ −a′ | < |we −a′ |.
3. |we − a′ | < |we′ − a′ |, i.e., e lies between a′ and
e′ . But this case is impossible: compared to a, a′
averages over additional values that are closer to a
than e, and e′ is one of them. So a′ must be on the
same side as e′ relative to e, not the opposite side.
So |we′ − a′ | < |we − a′ | is assured in all cases. Note in
particular that if |S| ≥ 1, |wi −avg(S)| θ |wj −avg(S)|,
then
2
k − 1
wi − avg(S)
F (S ∪ {i}) = F (S) +
k
2
k − 1
θ F (S) +
wj − avg(S)
k
= F (S ∪ {j}),
(11)
where θ is either ‘=’ or ‘<’. By inequality (11), F (S) <
F ((S\{e′ }) ∪ {e}). But this means that S is not a
solution: contradiction.
We have assumed that I0 is nonempty; this is ensured
because any solution S must contain at least an index
i ∈ argmaxj wj . Otherwise, we could replace a maximal index w.r.t. w in S by this i and get, by (11),
a larger F value. This would be a contradiction with
our assumption that S is a solution. Note that this
maximal index is also picked (first) by the algorithm.
This completes the proof for the case λ = 0. Let us
now consider the general case where λ is unrestricted.
We reduce the general problem to the case that λ = 0.
Let us write Fw,λ to make the parameters w and λ
explicit when talking of F . Let wi′∗ := wi∗ − λ for one
i∗ for which λwi∗ is maximal, and let wi′ := wi for all
other i. We use the fact that, by the definition of F ,
Fw,λ (S) = 2λwi′∗ + λ2 + Fw′ ,0 (S)
when S contains such an element i∗ ∈ argmaxj (λwj ).
Clearly, i∗ is an extremal element w.r.t. w and wi∗ has
maximum distance from −λ, so
P
i6=j wi − λ
∗
.
i ∈ argmax wj −
j−1
j
By (10), i∗ must be in the optimal solution for Fw,λ .
Also, Fw′ ,0 (S) and 2λwi′∗ +λ2 +Fw′ ,0 (S) are maximized
by the same index sets S when i∗ ∈ S is required.
Thus,
argmax Fw,λ (S) = argmax Fw′ ,0 (S).
S
S:j∈S
Now observe that our previous proof for the case λ = 0
also works if one adds a constraint that one or more
indices be part of the solution: If the algorithm computes these elements as part of its result I, they are
in I0 = I ∩ S. But this is what the algorithm does on
input (w, λ); it chooses i∗ in its first step and then proceeds as if maximizing Fw′ ,0 . Thus we have established
the algorithm’s correctness.
6. Application: Quantum tomography
Problem: In quantum tomography (QT), we aim to
learn a density matrix X⋆ ∈ Cd×d , which is Hermitian
(i.e., (X⋆ )H = X⋆ ), positive semi-definite (i.e., X⋆
0) and has rank(X⋆ ) = r and tr(X⋆ ) = 1. The QT
measurements are y = A(X⋆ ) + η, where (A(X⋆ ))i =
tr(Ei X⋆ ) + ηi , and ηi is zero-mean Gaussian. The
operators Ei ’s are the tensor product of the 2×2 Pauli
matrices (Liu, 2011).
Recently, (Liu, 2011) showed that almost all
such tensor constructions of O(rd log6 d) Pauli
measurements satisfy the so-called rank-r restricted isometry property (RIP) for all X ∈
Sparse projections onto the simplex
√
rkXkF :
(1 − δr ) kXk2F ≤ kA(X)k2F ≤ (1 + δr ) kXk2F ,
(12)
where k · k∗ is the nuclear norm (i.e., the sum of singular values), which reduces to tr(X) since X 0. This
key observation enables us to leverage the recent theoretical and algorithmic advances in low-rank matrix
recovery from a few affine measurements.
The standard matrix-completion based approach to recover X⋆ from y is the following convex relaxation:
minimize kA(X) −
X0
yk2F
+ λkXk∗ .
In this section, we demonstrate that one can do significantly better via the non-convex algorithm based on
(3). A key ingredient then is the following projection:
B0
s.t. rank(B) = r, tr(B) = 1,
(14)
for a given Hermitian matrix W ∈ Rn×n . Since the
RIP assumption holds here, we can obtain rigorous
guarantees based on a similar analysis to (Foucart,
2010; Garg & Khandekar, 2009; Meka et al., 2010).
To obtain the solution, we compute the eigenvalue decomposition W = UΛW UH and then use the unitary
invariance of the problem to solve D⋆ ∈ argminD kD−
ΛW kF subject to kDk∗ ≤ 1 and rank(D) ≤ r, and
from D⋆ form UD⋆ UH to obtain a solution. In fact,
we can constrain D to be diagonal, and thus reduce the
matrix problem to the vector version for D = diag(d),
where the projector in Problem 1 applies. This reduction follows from the well-known result:
Proposition 6.1 ((Mirsky, 1960)). Let A, B ∈ Rm×n
and q = min{m, n}. Let σi (A) be the singular values
of A in descending order (similarly for B). Then,
q
X
i=1
σi (A) − σi (B)
2
−2
10
Unfortunately, this convex approach fails to account
for the physical constraint kXk∗ = 1. To overcome
this difficulty, the relaxation parameter λ is tuned to
obtain solutions with the desired rank followed by normalization to heuristically meet the trace constraint.
b ∈ argmin kB −
B
−1
10
(13)
This convex approach is both tractable and amenable
to theoretical analysis (Gross et al., 2010; Liu, 2011).
As a result, we can provably reduce the number of
samples m from O(d2 ) to Õ(rd) (Liu, 2011).
Wk2F
0
10
Relative error in Frobenius norm
X ∈ Cd×d : X 0, rank(X) ≤ r, kXk∗ ≤
≤ kA − Bk2F .
Equation (14) has a solution if r ≥ 1 since the constraint set is non-empty and compact (Weierstrass’s
theorem). As the vector reduction achieves the lower
bound, it is an optimal projection.
2
convex approach 1
convex approach 2
nonconvex, x0=random
nonconvex, x0=convex solution
2.5
3
3.5
4
4.5
Number of measurements (divided by d*r)
5
Figure 1. Quantum tomography with 8 qubits and 30 dB
SNR: Each point is the median over 10 random realizations.
Convex approach 1 refers to (13) and approach 2 is (15).
0
10
−1
Relative error in Frobenius norm
10
−2
10
−3
10
−4
10
−5
10
−6
10
2
convex approach 1
convex approach 2
nonconvex, x0=random
nonconvex, x0=convex solution
2.5
3
3.5
4
4.5
Number of measurements (divided by d*r)
5
Figure 2. Same as Figure 1 but with 7 qubits, no noise.
Numerical experiments: We numerically demonstrate that the ability to project onto trace and rank
constraints jointly can radically improve recovery even
with the simple gradient descent algorithm as in (3).
We follow the same approach in (Gross et al., 2010):
we generate random Pauli measurements and add
white noise. The experiments that follow use a 8 qubit
system (d = 28 ) with noise SNR at 30 dB (so the absolute noise level changes depending on the number of
measurements), and a 7 qubit noiseless system.
The measurements are generated using a random realvalued matrix X⋆ with rank 2, although the algorithms
also work with complex-valued matrices. A d × d rank
r real-valued matrix has dr − r(r − 1)/2 ≈ dr degrees
of freedom, hence we need at least 2dr number of measurements to recover X⋆ from noiseless measurements
Sparse projections onto the simplex
(due to the null-space of the linear map). To test the
various approaches, we vary the number of measurements between 2dr and 5dr. We assume r is known,
though other computational experience suggests that
estimates of r return good answers as well.
The convex problem (13) depends on a parameter λ.
We solve the problem for different λ in a bracketing
search until we find the first λ that provides a solution
with numerical rank r. Like (Flammia et al., 2012), we
normalize the final estimate to ensure the trace is 1.
Additionally, we test the following convex approach:
minimize kA(X) − yk2F .
X0,kXk∗ ≤1
(15)
Compared to (13), no parameters are needed since we
exploit prior knowledge of the trace, but there is no
guarantee on the rank. Both convex approaches can
be solved with proximal gradient descent; we use the
TFOCS package (Becker et al., 2011) since it uses a
sophisticated line search and Nesterov acceleration.
To illustrate the power of the combinatorial projections, we solve the following non-convex formulation:
minimize
X0,kXk∗ ≤1,rank(X)=r
kA(X) − yk2F .
(16)
Within the projected gradient algorithm (3), we use
the GSSP algorithm as described above. The stepsize
is µi = 3/kAk2 where k · k is the operator norm; we
can also apply Nesterov acceleration to speed convergence, but we use (3) for simplicity. Due to the nonconvexity, the algorithm depends on the starting value
X0 . We try two strategies: (i) random initialization,
and (ii) initializing with the solution from (15). Both
initializations often lead to the same stationary point.
Figure 1 shows the relative error kX − X⋆ kF /kX⋆ kF
of the different approaches. All approaches do poorly
when there are only 2dr measurements since this is
near the noiseless information-theoretic limit. For
higher numbers of measurements, the non-convex approach substantially outperforms both convex approaches. For 2.4dr measurements, it helps to start
X0 with the convex solution, but otherwise the two
non-convex approaches are nearly identical.
Between the two convex solutions, (13) outperforms
(15) since it tunes λ to achieve a rank r solution.
Neither convex approach is competitive with the nonconvex approaches since they do not take advantage of
the prior knowledge on trace and rank.
Figure 2 shows more results on a 7 qubit problem without noise. Again, the non-convex approach gives better results, particularly when there are fewer measurements. As expected, both approaches approach perfect
recovery as the number of measurements increases.
Approach
convex
non-convex
mean time
standard deviation
0.294 s.
0.192 s.
0.030 s.
0.019 s.
Table 1. Time per iteration of convex and non-convex approaches for quantum state tomography with 8 qubits.
Here we highlight another key benefit of the nonconvex approach: since the number of eigenvectors
needed in the partial eigenvalue decomposition is at
most r, it is quite scalable. In general, the convex approach has intermediate iterates which require eigenvalue decompositions close to the full dimension, especially during the first few iterations. Table 1 shows average time per iteration for the convex and non-convex
approach (overall time is more complicated, since the
number of iterations depends on linesearch and other
factors). Even using Matlab’s dense eigenvalue solver
eig, the iterations of the non-convex approach are
faster; problems that used an iterative Krylov subspace solver would show an even larger discrepancy.2
7. Application: Measure learning
Problem: We study the kernel density learning setting: Let x(1) , x(2) , . . . , x(n) ∈ Rp be an n-size corpus of p-dimensional samples, drawn from an unknown
probability density function P
(pdf) µ(x). Here, we will
n
form an estimator µ̂(x) := i=1 βi κσ (x, x(i) ), where
κσ (x, y) is a Gaussian kernel with parameter σ. Let us
choose µ̂(x) to minimize the integrated squared error
criterion: ISE = Ekµ̂(x) − µ(x)k22 . As a result, we can
introduce a density learning problem as estimating a
weight vector β ⋆ ∈ ∆+
1 . The objective can then be
written as follows (Bunea et al., 2010; Kim, 1995)
n
o
(17)
β ⋆ ∈ argmin β T Σβ − cT β ,
β∈∆+
1
where Σ ∈ Rn×n with Σij = κ√2σ (x(i) , x(j) ), and
ci =
1 X
κσ (x(i) , x(j) ), ∀i, j.
n−1
(18)
j6=i
While the combination of the −cT β term and the nonnegativity constraint induces some sparsity, it may not
be enough. To avoid overfitting or obtain interpretable
results, one might control the level of solution sparsity
(Bunea et al., 2010). In this context, we extend (17)
to include cardinality constraints, i.e. β ⋆ ∈ ∆+
1 ∩ Σk .
2
Quantum state tomography does not easily benefit
from iterative eigenvalue solvers, since the range of A∗ is
not sparse.
Sparse projections onto the simplex
0.14
3
x 10
Quad. Prog.
0.1
0.08
0.1
1
0
0
200
400
600
800
1000
0.06
0.4
s=5
Weights
0.04
0.02
0
−15
−10
−5
0
5
0.06
0.02
0
0
10
0.08
0.04
0.2
x
True pdf
Estimated pdf for s = 5
True kernel means
Estimated kernel means
0.12
2
Probability
Weights
0.12
Probability
0.14
−3
True pdf
Parzen window
Quad. Prog.
True kernel mean position
200
400
600
Sample point indices
0
−15
800
−10
−5
0
5
10
x
Figure 3. Density estimation results using the Parzen method (left), the quadratic program (17) (left and middle-top),
and our approach (middle-bottom and right).
0.14
0.14
True pdf
Estimated pdf for s = 3
True kernel means
Estimated kernel means
Probability
0.12
0.14
True pdf
Estimated pdf for s = 8
True kernel means
Estimated kernel means
0.12
0.14
True pdf
Estimated pdf for s = 10
Estimated kernel means
True kernel means
0.12
0.1
0.1
0.1
0.1
0.08
0.08
0.08
0.08
0.06
0.06
0.06
0.06
0.04
0.04
0.04
0.02
0.02
0.02
0.04
0.02
0
−15
−10
−5
0
x
5
10
0
−15
−10
−5
0
5
10
x
(a) k = 3
(b) k = 8
0
−15
−10
−5
0
5
True pdf
Estimated pdf for s = 15
True kernel means
Estimated kernel means
0.12
10
0
−15
x
−10
−5
0
5
10
x
(c) k = 10
(d) k = 15
Figure 4. Estimation results for different k: Red spikes depict the estimated kernel means as well as the their relative
contribution to the Gaussian mixture. As k increases, the additional nonzero coefficients in β ⋆ tend to have small weights.
Numerical experiments: We consider
P5 the following Gaussian mixture: µ(x) = 15 i=1 κσi (β i , x),
where σi = (7/9)i and β i = 14(σi − 1). A sample of
1000 points is drawn from µ(x). We compare the density estimation performance of: (i) the Parzen method
(Parzen, 1962), (ii) the quadratic programming formulation in (17), and (iii) our cardinality-constrained
version of (17) using GSSP. While µ(x) is constructed
by kernels with various widths, we assume a constant
width during the kernel estimation. In practice, the
width is not known a priori but can be found using
cross-validation techniques; for simplicity, we assume
kernels with width σ = 1.
Figure 3(left) depicts the true pdf and the estimated
densities using the Parzen method and the quadratic
programming approach. Moreover, the figure also includes a scaled plot of 1/σi , indicating the height of
the individual Gaussian mixtures. By default, the
Parzen window method estimation interpolates 1000
Gaussian kernels with centers around the sampled
points to compute the estimate µ̂(x); unfortunately,
neither the quadratic programming approach (as Figure 3 (middle-top) illustrates) nor the Parzen window
estimator results are easily interpretable even though
both approaches provide a good approximation of the
true pdf.
Using our cardinality-constrained approach, we can
significantly enhance the interpretability. This is because in the sparsity-constrained approach, we can
control the number of estimated Gaussian components. Hence, if the model order is known a priori,
the non-convex approach can be extremely useful.
To see this, we first show the coefficient profile of the
sparsity based approach for k = 5 in Figure 3 (middlebottom). Figure 3 (right) shows the estimated pdf
for k = 5 along with the positions of weight coefficients obtained by our sparsity enforcing approach.
Note that most of the weights obtained concentrate
around the true means, fully exploiting our prior information about the ingredients of µ(x)—this happens
with rather high frequency in the experiments we conducted. Figure 4 illustrates further estimated pdf’s
using our approach for various k. Surprisingly, the resulting solutions are still approximately 5-sparse even
if k > 5, as the over-estimated coefficients are extremely small, and hence the sparse estimator is reasonably robust to inaccurate estimates of k.
8. Application: Portfolio optimization
Problem: Given a sample covariance matrix Σ and
expected mean µ, the return-adjusted Markowitz
mean-variance (MV) framework
β⋆
n selects a portfolio
o
T
⋆
such that β ∈ argminβ∈∆+ β Σβ − τ µT β , where
1
Sparse projections onto the simplex
∆+
1 encodes the normalized capital constraint, and
τ trades off risk and return (Brodie et al., 2009;
DeMiguel et al., 2009). The solution β ⋆ ∈ ∆+
1 is the
distribution of investments over the p available assets.
δ ∗β
T
T
∈ argmin (β̄ + δ β ) Σ(β̄ + δ β ) − τ µ (β̄ + δ β ),
δ β ∈Σk ∩∆λ
where λ is the level of update, and k controls the transactions costs. During an update, λ = 0 would keep the
portfolio value constant while λ > 0 would increase it.
Numerical experiments: To clearly highlight the
impact of the non-convex projector, we create a synthetic portfolio update problem, where we know the
solution. As in (Brodie et al., 2009), we cast this problem as a regression problem and synthetically generate
y = Xβ ⋆ where p = 1000 such that β ⋆ ∈ ∆λ (λ is chosen randomly), and kβ ⋆ k0 = k for k = 100.
Since in general we do not expect RIP assumptions to
hold in portfolio optimization, our goal here is to refine
the sparse solution of a state-of-the-art convex solver
via (3) in order to accommodate the strict sparsity and
budget constraints. Hence, we first consider the basis
pursuit criterion and solve it using SPGL1 (van den
Berg & Friedlander, 2008):
X
y
√ . (19)
√ β=
minimize kβk1 s.t.
λ/ p
1T / p
√
The normalization by 1/ p in the last equality gives
the constraint matrix a better condition number, since
otherwise it is too ill-conditioned for a first-order
solver.
Almost none of the solutions to (19) return a k-sparse
solution. Hence, we initialize (3) with the SPGL1 solution to meet the constraints. The update step in (3)
uses the GSHP algorithm.
b −
Figure 5 shows the resulting relative errors kβ
β ⋆ k2 /kβ ⋆ k2 . We see that not only does (3) return a
k-sparse solution, but that this solution is also closer
to β ⋆ , particularly when the sample size is small. As
the sample size increases, the knowledge that β ⋆ is
k-sparse makes up a smaller percentage of what we
know about the signal, so the gap between (19) and
(3) diminishes.
−1
10
Relative Error
While such solutions construct portfolios from scratch,
a more realistic scenario is to incrementally adjust an
existing portfolio as the market changes. Due to costs
per transaction, we can naturally introduce cardinality
constraints. In mathematical terms, let β̄ ∈ Rp be the
current portfolio selection. Given β̄, we seek to adjust
the current selection β = β̄ + δ β such that kδ β k0 ≤ k.
This leads to the following optimization problem:
0
10
−2
10
−3
10
−4
10
−5
10
280
Approach 1
Approach 2
300
320
340
m
360
380
400
b − β ⋆ k2 /kβ ⋆ k2 comparison as
Figure 5. Relative error kβ
a function of m: Approach 1 is the non-convex approach
(3), and approach 2 is (19). Each point corresponds to the
median value of 30 Monte-Carlo realizations.
9. Conclusions
While non-convexity in learning algorithms is undesirable according to conventional wisdom, avoiding it
might be difficult in many problems. In this setting, we
show how to efficiently obtain exact sparse projections
onto the positive simplex and its hyperplane extension. We empirically demonstrate that our projectors
provide substantial accuracy benefits in quantum tomography from fewer measurements and enable us to
exploit prior non-convex knowledge in density learning. Moreover, we also illustrate that we can refine
the solution of well-established state-of-the-art convex
sparse recovery algorithms to enforce non-convex constraints in sparse portfolio updates. The quantum tomography example in particular illustrates that the
non-convex solutions can be extremely useful; here, the
non-convexity appears milder, since a fixed-rank matrix still has extra degrees of freedom from the choice
of its eigenvectors.
10. Acknowledgements
VC and AK’s work was supported in part by the European Commission under Grant MIRG-268398, ERC
Future Proof and SNF 200021-132548. SRB is supported by the Fondation Sciences Mathématiques de
Paris. CK’s research is funded by ERC ALGILE.
References
Bahmani, S., Boufounos, P., and Raj, B. Greedy
sparsity-constrained optimization. In Signals, Systems and Computers (ASILOMAR), 2011 Confer-
Sparse projections onto the simplex
ence Record of the Forty Fifth Asilomar Conference
on, pp. 1148–1152. IEEE, 2011.
Becker, Stephen, Candès, Emmanuel, and Grant,
Michael. Templates for convex cone problems with
applications to sparse signal recovery. Mathematical
Programming Computation, pp. 1–54, 2011. ISSN
1867-2949. URL http://dx.doi.org/10.1007/
s12532-011-0029-5. 10.1007/s12532-011-0029-5.
Brodie, J., Daubechies, I., De Mol, C., Giannone, D.,
and Loris, I. Sparse and stable Markowitz portfolios.
Proceedings of the National Academy of Sciences,
106(30):12267–12272, 2009.
Bunea, F., Tsybakov, A.B., Wegkamp, M.H., and
Barbu, A. SPADES and mixture models. The Annals of Statistics, 38(4):2525–2558, 2010.
Candès, E., Romberg, J., and Tao, T. Robust uncertainty principles: Exact signal reconstruction from
highly incomplete frequency information. IEEE
Trans. on Information Theory, 52(2):489 – 509,
February 2006.
DeMiguel, V., Garlappi, L., Nogales, F.J., and Uppal,
R. A generalized approach to portfolio optimization: Improving performance by constraining portfolio norms. Management Science, 55(5):798–812,
2009.
Flammia, S.T., Gross, D., Liu, Y.K., and Eisert,
J. Quantum tomography via compressed sensing:
error bounds, sample complexity and efficient estimators. New Journal of Physics, 14(9):095022,
2012. URL http://stacks.iop.org/1367-2630/
14/i=9/a=095022.
Foucart, S. Sparse recovery algorithms: sufficient conditions in terms of restricted isometry constants.
Proceedings of the 13th International Conference on
Approximation Theory, 2010.
Garg, R. and Khandekar, R. Gradient descent with
sparsification: an iterative algorithm for sparse recovery with restricted isometry property. In ICML.
ACM, 2009.
Gross, D., Liu, Y.-K., Flammia, S. T., Becker, S.,
and Eisert, J. Quantum state tomography via compressed sensing. Phys. Rev. Lett., 105(15):150401,
Oct 2010. doi: 10.1103/PhysRevLett.105.150401.
Kim, D. Least squares mixture decomposition estimation. PhD thesis, 1995.
Kyrillidis, A. and Cevher, V. Recipes on hard thresholding methods. Dec. 2011.
Kyrillidis, A. and Cevher, V. Combinatorial selection
and least absolute shrinkage via the Clash algorithm. In IEEE International Symposium on Information Theory, July 2012.
Liu, Y.K. Universal low-rank matrix recovery from
Pauli measurements. In NIPS, pp. 1638–1646, 2011.
Meka, Raghu, Jain, Prateek, and Dhillon, Inderjit S.
Guaranteed rank minimization via singular value
projection. In NIPS Workshop on Discrete Optimization in Machine Learning, 2010.
Mirsky, L. Symmetric gauge functions and unitarily
invariant norms. The quarterly journal of mathematics, 11(1):50–59, 1960.
Nemhauser, G.L. and Wolsey, L.A. Integer and combinatorial optimization, volume 18. Wiley New York,
1988.
Parzen, E. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962.
Pilanci, M., El Ghaoui, L., and Chandrasekaran, V.
Recovery of sparse probability measures via convex
programming. In Advances in Neural Information
Processing Systems 25, pp. 2429–2437, 2012.
van den Berg, E. and Friedlander, M. P. Probing the
Pareto frontier for basis pursuit solutions. SIAM J.
Sci. Comput., 31(2):890–912, 2008.