Academia.eduAcademia.edu

Sparse projections onto the simplex

2012

Abstract: The past decade has seen the rise of $\ ell_1 $-relaxation methods to promote sparsity for better interpretability and generalization of learning results. However, there are several important learning applications, such as Markowitz portolio selection and sparse mixture density estimation, that feature simplex constraints, which disallow the application of the standard $\ ell_1 $-penalty. In this setting, we show how to efficiently obtain sparse projections onto the positive and general simplex with sparsity constraints.

Sparse projections onto the simplex [email protected] [email protected] [email protected] [email protected] arXiv:1206.1529v5 [cs.LG] 10 Apr 2013 Anastasios Kyrillidis Stephen Becker Volkan Cevher Christoph Koch Abstract highlight quantum tomography, density learning, and Markowitz portfolio design problems as running examples. We then illustrate provable non-convex solutions to minimize quadratic loss functions Most learning methods with rank or sparsity constraints use convex relaxations, which lead to optimization with the nuclear norm or the ℓ1 -norm. However, several important learning applications cannot benefit from this approach as they feature these convex norms as constraints in addition to the non-convex rank and sparsity constraints. In this setting, we derive efficient sparse projections onto the simplex and its extension, and illustrate how to use them to solve high-dimensional learning problems in quantum tomography, sparse density estimation and portfolio selection with non-convex constraints. f (β) := ky − A(β)k2 subject to the constraints in Problem 1 and 2 with our projectors. In (2), we assume that y ∈ Rm is given and the (known) operator A : Rp → Rm is linear. For simplicity of analysis, our minimization approach is based on the projected gradient descent algorithm: i 1. Introduction We study the following sparse Euclidean projections: Problem 1. (Simplex) Given w ∈ Rp , find a Euclidean projection of w onto the intersection of ksparse vectors Σk = β ∈ Rp : |{i : βi 6= 0}| ≤ k  P p and the simplex ∆+ i βi = λ : λ = β ∈ R : βi ≥ 0, P(w) ∈ argmin kβ − wk2 . (2) (1) β:β∈Σk ∩∆+ λ Problem 2. (Hyperplane) Replace ∆+ in (1) with the λ P  hyperplane constraint ∆λ = β ∈ Rp : i βi = λ . We prove that it is possible to compute such projections in quasilinear time via simple greedy algorithms. Our motivation with these projectors is to address important learning applications where the standard sparsity/low-rank heuristics based on the ℓ1 /nuclearnorm are either given as a constraint or conflicts with the problem constraints. For concreteness, we Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s). β i+1 = P(β i − µi ∇f (β i )), (3) where β is the i-th iterate, ∇f (·) is the gradient of the loss function, µi is a step-size, and P(·) is based on Problem 1 or 2. When the linear map A in (2) provides bi-Lipschitz embedding for the constraint sets, we can derive rigorous approximation guarantees for the algorithm (3); cf., (Garg & Khandekar, 2009).1 To the best of our knowledge, explicitly sparse Euclidean projections onto the simplex and hyperplane constraints have not been considered before. The closest work to ours is the paper (Kyrillidis & Cevher, 2012). In (Kyrillidis & Cevher, 2012), the authors propose an alternating projection approach in regression where the true vector is already sparse and within a convex norm-ball constraint. In contrast, we consider the problem of projecting an arbitrary given vector onto convex-based and sparse constraints jointly. At the time of this submission, we become aware of (Pilanci et al., 2012), which considers cardinality regularized loss function minimization subject to simplex constraints. Their convexified approach relies on solving a lower-bound to the objective function and has O(p4 ) complexity, which is not scalable. We also note that regularizing with the cardinality constraints is generally easier: e.g., our projectors become simpler. 1 Surprisingly, a recent analysis of this algorithm along with similar assumptions indicates that rigorous guarantees can be obtained for minimization of general loss functions other than the quadratic (Bahmani et al., 2011). Sparse projections onto the simplex Notation: Plain and boldface lowercase letters represent scalars and vectors, resp. The i-th entry of a vector w is wi , and [wi ]+ = max(wi , 0), while β i is the model estimate at the i-th iteration of an algorithm. Given a set S ⊆ N = {1, . . . , p}, the complement S c is defined with respect to N , and the cardinality is |S|. The support set of w is supp(w) = {i : wi 6= 0}. Given a vector w ∈ Rp , wS is the projection (in Rp ) of w onto S, i.e. (wS )S c = 0, whereas w S ∈ R|S| is w limited to S entries. The all-ones column vector is 1, with dimensions apparent from the context. We define Σk as the set of all k-sparse subsets of N , and we sometimes write β ∈ Σk to mean supp(β) ∈ Σk . The trace of a matrix X is written tr(X). 2. Preliminaries Basic definitions: Without loss of generality, assume w is sorted in descending order, so w1 is the largest element. We denote Pλ+ for the (convex) Euclidean projector onto the standard simplex ∆+ λ , and Pλ for its extension to ∆λ . The (non-convex) Euclidean projector onto the set Σk is PΣk , which retains the k-largest in magnitude elements. In contrast to Pλ , the projection PΣk need not be unique. Definition 2.1 (Operator PLk ). We define PLk (w) as the operator that keeps the k-largest entries of w (not in magnitude) and sets the rest to zero. This operation can be computed in O(p min(k, log(p)))-time. Definition 2.2 (Euclidean projection Pλ+ ). The projector onto the simplex is given by ! ρ 1 X wi − λ (Pλ+ (w))i = [wi − τ ]+ , where τ := ρ i=1 for ρ := max{j : wj > 1j ( Pj i=1 descent algorithm in (3) features the following invariant on the objective (Garg & Khandekar, 2009): f (β i+1 ) ≤ 2δ2k f (β i ) + c1 kεk2 , 1 − δ2k (5) for c1 > 0 and stepsize µi = 1+δ1 2k . Hence, for δ2k < 1/3, the iterations of the algorithm are contractive and (3) obtains a good approximation on the loss function. In addition, (Foucart, 2010) shows that we can guarantee approximation on the true model via kβ i+1 − β ⋆ k2 ≤ 2δ3k kβ i − β ⋆ k2 + c2 kεk2 , (6) for c1 > 0 and µi = 1. Similarly, when δ3k < 1/2, the iterations of the algorithm are contractive. Different step size µi strategies result in different guarantees; c.f., (Foucart, 2010; Garg & Khandekar, 2009; Kyrillidis & Cevher, 2011) for a more detailed discussion. Note that to satisfy a given RIP constant δ, random matrices with  sub-Gaussian entries require m = O k log(p/k)/δ 2 . In low rank matrix cases, similar RIP conditions for (3) can be derived with approximation guarantees; cf., (Meka et al., 2010). 3. Underlying discrete problems Let β ⋆ be a projection of w onto Σk ∩ ∆+ λ or Σk ∩ ∆λ . We now make the following elementary observation: Remark 1. The Problem 1 and 2 statements can be equivalently transformed into the following nested minimization problem: {S ⋆ , β ⋆ } = i h argmin k(β − w) S k22 + kw S c k22 , argmin S:S∈Σk β:β S ∈∆+ λ or ∆λ , β S c =0 where supp(β ⋆ ) = S ⋆ and β ⋆ ∈ ∆+ λ or ∆λ . wi − λ)}. Definition 2.3 (Euclidean projection Pλ ). The projector onto the extended simplex is given by ! p 1 X wi − λ . (Pλ (w))i = wi − τ, where τ = p i=1 Therefore, given S ⋆ = supp(β ⋆ ), we can find β ⋆ by projecting wS ⋆ onto ∆+ λ or ∆λ within the kdimensional space. Thus, the difficulty is finding S ⋆ . Hence, we split the problem into the task of finding the support and then finding the values on the support. Definition 2.4 (Restricted isometry property (RIP) (Candès et al., 2006)). A linear operator A : Rp → Rm satisfies the k-RIP with constant δk ∈ (0, 1) if 3.1. Problem 1 1 − δk ≤ kA(β)k22 /kβk22 ≤ 1 + δk , ∀β ∈ Σk . (4) Guarantees of the gradient scheme (3): Let y = A(β ⋆ ) + ε ∈ Rm , (m ≪ p), be a generative model where ε is an additive perturbation term and β ⋆ is the k-sparse true model generating y. If the RIP assumption (4) is satisfied, then the projected gradient Given any support S, the unique corresponding estib = Pλ+ (w ). We conclude that β ⋆ satismator is β S S ⋆ fies β (S ⋆ )c = 0 and β ⋆S ⋆ = Pλ+ (w S ⋆ ), and S ⋆ ∈ argmin kPλ+ (w S ) − w S k22 + kw S c k22 S:S∈Σk = argmax F+ (S) S:S∈Σk where F+ (S) := P i∈S (7)  wi2 − ((Pλ+ (w S ))i − wi )2 . Sparse projections onto the simplex Algorithm 1 GSSP 1: S ⋆ = supp(PLk (w)) {Select support} 2: β S ⋆ = Pλ+ (w S ⋆ ), β (S ⋆ )c = 0 {Final projection} Algorithm 2 GSHP 1: ℓ = 1 , S = j, j ∈ arg maxi [λwi ] {Initialize} 2: Repeat: ℓ ← ℓ + 1, S ← S ∪ j, where P wj −λ j ∈ arg maxi∈N \S wi − j∈S {Grow} ℓ−1 3: Until ℓ = k, set S ⋆ ← S {Terminate} 4: β S ⋆ = Pλ (w S ⋆ ), β (S ⋆ )c = 0 {Final projection} This set function can be simplified to X (wi2 − τ 2 ), F+ (S) = (8) i∈S which is non-obvious. The algorithm first selects the index of the largest element that has the same sign as λ. It then grows the index set one at a time by finding the farthest element from the current mean, as adjusted by lambda. Surprisingly, the algorithm finds the correct support set, as we prove in Section 5.2. We call this algorithm the greedy selector and hyperplane projector (GSHP), whose overall complexity is similar to GSSP. 5. Main results Remark 2. When the symbol S is used as S = supp(β̄) for any β̄, then if |S| < k, we enlarge S until it has k elements by taking the first k − |S| elements that are not already in S, and setting β̄ = 0 on these elements. The lexicographic approach is used to break ties when there are multiple solutions. where τ (which depends on S) is as in Lemma 1. 5.1. Correctness of GSSP Lemma 1. Let β = Pλ+ (w) where βi = [wi − τ ]+ . Then, wiP≥ τ for all i ∈ S = supp(β). Furthermore, 1 τ = |S| i∈S wi − λ . Theorem 1. Algorithm 1 exactly solves Problem 1. Proof. Directly from the definition of τ in Definition 2.2. The intuition is quite simple: the “threshold” τ should be smaller than the smallest entry in the selected support, or we unnecessarily shrink the coefficients that are larger without introducing any new support to the solution. Same arguments apply to inflating the coefficients to meet the simplex budget. 3.2. Problem 2 Similar to above, we conclude that β ⋆ satisfies β ⋆S ⋆ = Pλ (w S ⋆ ) and β ⋆(S ⋆ )c = 0, where S ⋆ ∈ argmin kz − wk2 = argmax F (S) S:S∈Σk (9) S:S∈Σk where z ∈ Rp with z S = Pλ (w S ) and z S c = 0 and P P 1 2 2 F (S) := i∈S wi − |S| ( i∈S wi − λ) . 4. Sparse projections onto ∆+ λ and ∆λ Algorithm 1 below suggests an obvious greedy approach for the projection onto Σk ∩ ∆+ λ . We select the set S ⋆ by naively projecting w as PLk (w). Remarkably, this gives the correct support set for Problem 1, as we prove in Section 5.1. We call this algorithm the greedy selector and simplex projector (GSSP). The overall complexity of GSSP is dominated by the sort operation in p-dimensions. Unfortunately, the GSSP fails for Problem 2. As a result, we propose Algorithm 2 for the Σk ∩ ∆λ case Proof. Intuitively, the k-largest coordinates should be in the solution. To see this, suppose that u is the projection of w. Let wi be one of the k-(most positive) coordinates of w and ui = 0. Also, let wj < wi , i 6= j such that uj > 0. We can then construct a new vector u′ where u′j = ui = 0 and u′i = uj . Therefore, u′ satisfies the constraints, and it is closer to w, i.e., kw − uk22 − kw − u′ k22 = 2uj (wi − wj ) > 0. Hence, u cannot be the projection. To be complete in the proof, we also need to show that the cardinality k solutions are as good as any other solution with cardinality less than k. Suppose there exists a solution u with support |S| < k. Now add any elements to S to form Ŝ with size k. Then consider w restricted to Ŝ, and let û be its projection onto the simplex. Because this is a projection, kû Ŝ − w Ŝ k ≤ ku Ŝ − w Ŝ k, hence kû − wk ≤ ku − wk. 5.2. Correctness of GSHP Theorem 2. Algorithm 2 exactly solves Problem 2. Proof. To motivate the support selection of GSHP, we now identify a key relation that holds for any b ∈ Rk : k X i=1 b2i − λ(2b1 − λ) + P k i=1 bi k k X j − 1 j=2 −λ j bj − 2 = Pj−1 bi − λ 2 . j−1 i=1 (10) By its left-hand side, this relation is invariant under permutation of b. Moreover, the summands in the sum Sparse projections onto the simplex over k are certainly non-negative for k ≥ 2, so without loss of generality the solution sparsity of the original problem is ||β ⋆ ||0 = k. For k = 1, F is maximized by picking an index i that maximizes λwi , which is what the algorithm does. For the sake of clarity (and space), we first describe the proof of the case k ≥ 2 for λ = 0 and then explain how it generalizes for λ 6=P 0. In the sequel, let us use 1 the shortcut avg(S) = |S| j∈S wj . Let S be an optimal solution index set and let I be the result computed by the algorithm. For a proof (of the case k ≥ 2, λ = 0) by contradiction, assume that I and S differ. Let e be the first element of I\S in the order of insertion into I by the algorithm. Let e′ be the element of S\I0 that lies closest to e. Without loss of generality, we may assume that we 6= we′ , otherwise we could have chosen S\{e′ } ∪ {e} rather than S as solution in the first place. Let I0 ⊆ I ∩S be the indices added to I by the algorithm before e. Assume that I0 is nonempty. We will later see how to ensure this. Let a := avg(I0 ) and a′ := avg(S\{e′ }). There are three ways in which we , we′ and a′ can be ordered relative to each other: 1. e′ lies between e and a′ , thus |we′ − a′ | < |we − a′ | since we 6= we′ . 2. a′ lies between e and e′ . But then, since there are no elements of S between e and e′ , S\I0 moves the average a′ beyond a away from e towards e′ , so |we′ − a′ | < |we′ − a| and |we − a| < |we − a′ |. But we know that |we′ − a| < |we − a| since e = argmaxi∈I0 |wi − a| by the choice of the greedy algorithm and we 6= we′ . Thus |we′ −a′ | < |we −a′ |. 3. |we − a′ | < |we′ − a′ |, i.e., e lies between a′ and e′ . But this case is impossible: compared to a, a′ averages over additional values that are closer to a than e, and e′ is one of them. So a′ must be on the same side as e′ relative to e, not the opposite side. So |we′ − a′ | < |we − a′ | is assured in all cases. Note in particular that if |S| ≥ 1, |wi −avg(S)| θ |wj −avg(S)|, then 2 k − 1 wi − avg(S) F (S ∪ {i}) = F (S) + k 2 k − 1 θ F (S) + wj − avg(S) k = F (S ∪ {j}), (11) where θ is either ‘=’ or ‘<’. By inequality (11), F (S) < F ((S\{e′ }) ∪ {e}). But this means that S is not a solution: contradiction. We have assumed that I0 is nonempty; this is ensured because any solution S must contain at least an index i ∈ argmaxj wj . Otherwise, we could replace a maximal index w.r.t. w in S by this i and get, by (11), a larger F value. This would be a contradiction with our assumption that S is a solution. Note that this maximal index is also picked (first) by the algorithm. This completes the proof for the case λ = 0. Let us now consider the general case where λ is unrestricted. We reduce the general problem to the case that λ = 0. Let us write Fw,λ to make the parameters w and λ explicit when talking of F . Let wi′∗ := wi∗ − λ for one i∗ for which λwi∗ is maximal, and let wi′ := wi for all other i. We use the fact that, by the definition of F , Fw,λ (S) = 2λwi′∗ + λ2 + Fw′ ,0 (S) when S contains such an element i∗ ∈ argmaxj (λwj ). Clearly, i∗ is an extremal element w.r.t. w and wi∗ has maximum distance from −λ, so P i6=j wi − λ ∗ . i ∈ argmax wj − j−1 j By (10), i∗ must be in the optimal solution for Fw,λ . Also, Fw′ ,0 (S) and 2λwi′∗ +λ2 +Fw′ ,0 (S) are maximized by the same index sets S when i∗ ∈ S is required. Thus, argmax Fw,λ (S) = argmax Fw′ ,0 (S). S S:j∈S Now observe that our previous proof for the case λ = 0 also works if one adds a constraint that one or more indices be part of the solution: If the algorithm computes these elements as part of its result I, they are in I0 = I ∩ S. But this is what the algorithm does on input (w, λ); it chooses i∗ in its first step and then proceeds as if maximizing Fw′ ,0 . Thus we have established the algorithm’s correctness. 6. Application: Quantum tomography Problem: In quantum tomography (QT), we aim to learn a density matrix X⋆ ∈ Cd×d , which is Hermitian (i.e., (X⋆ )H = X⋆ ), positive semi-definite (i.e., X⋆  0) and has rank(X⋆ ) = r and tr(X⋆ ) = 1. The QT measurements are y = A(X⋆ ) + η, where (A(X⋆ ))i = tr(Ei X⋆ ) + ηi , and ηi is zero-mean Gaussian. The operators Ei ’s are the tensor product of the 2×2 Pauli matrices (Liu, 2011). Recently, (Liu, 2011) showed that almost all such tensor constructions of O(rd log6 d) Pauli measurements satisfy the so-called rank-r restricted isometry property (RIP) for all X ∈ Sparse projections onto the simplex √ rkXkF : (1 − δr ) kXk2F ≤ kA(X)k2F ≤ (1 + δr ) kXk2F , (12) where k · k∗ is the nuclear norm (i.e., the sum of singular values), which reduces to tr(X) since X  0. This key observation enables us to leverage the recent theoretical and algorithmic advances in low-rank matrix recovery from a few affine measurements. The standard matrix-completion based approach to recover X⋆ from y is the following convex relaxation: minimize kA(X) − X0 yk2F + λkXk∗ . In this section, we demonstrate that one can do significantly better via the non-convex algorithm based on (3). A key ingredient then is the following projection: B0 s.t. rank(B) = r, tr(B) = 1, (14) for a given Hermitian matrix W ∈ Rn×n . Since the RIP assumption holds here, we can obtain rigorous guarantees based on a similar analysis to (Foucart, 2010; Garg & Khandekar, 2009; Meka et al., 2010). To obtain the solution, we compute the eigenvalue decomposition W = UΛW UH and then use the unitary invariance of the problem to solve D⋆ ∈ argminD kD− ΛW kF subject to kDk∗ ≤ 1 and rank(D) ≤ r, and from D⋆ form UD⋆ UH to obtain a solution. In fact, we can constrain D to be diagonal, and thus reduce the matrix problem to the vector version for D = diag(d), where the projector in Problem 1 applies. This reduction follows from the well-known result: Proposition 6.1 ((Mirsky, 1960)). Let A, B ∈ Rm×n and q = min{m, n}. Let σi (A) be the singular values of A in descending order (similarly for B). Then, q X i=1 σi (A) − σi (B) 2 −2 10 Unfortunately, this convex approach fails to account for the physical constraint kXk∗ = 1. To overcome this difficulty, the relaxation parameter λ is tuned to obtain solutions with the desired rank followed by normalization to heuristically meet the trace constraint. b ∈ argmin kB − B −1 10 (13) This convex approach is both tractable and amenable to theoretical analysis (Gross et al., 2010; Liu, 2011). As a result, we can provably reduce the number of samples m from O(d2 ) to Õ(rd) (Liu, 2011). Wk2F 0 10 Relative error in Frobenius norm X ∈ Cd×d : X  0, rank(X) ≤ r, kXk∗ ≤ ≤ kA − Bk2F . Equation (14) has a solution if r ≥ 1 since the constraint set is non-empty and compact (Weierstrass’s theorem). As the vector reduction achieves the lower bound, it is an optimal projection. 2 convex approach 1 convex approach 2 nonconvex, x0=random nonconvex, x0=convex solution 2.5 3 3.5 4 4.5 Number of measurements (divided by d*r) 5 Figure 1. Quantum tomography with 8 qubits and 30 dB SNR: Each point is the median over 10 random realizations. Convex approach 1 refers to (13) and approach 2 is (15). 0 10 −1 Relative error in Frobenius norm  10 −2 10 −3 10 −4 10 −5 10 −6 10 2 convex approach 1 convex approach 2 nonconvex, x0=random nonconvex, x0=convex solution 2.5 3 3.5 4 4.5 Number of measurements (divided by d*r) 5 Figure 2. Same as Figure 1 but with 7 qubits, no noise. Numerical experiments: We numerically demonstrate that the ability to project onto trace and rank constraints jointly can radically improve recovery even with the simple gradient descent algorithm as in (3). We follow the same approach in (Gross et al., 2010): we generate random Pauli measurements and add white noise. The experiments that follow use a 8 qubit system (d = 28 ) with noise SNR at 30 dB (so the absolute noise level changes depending on the number of measurements), and a 7 qubit noiseless system. The measurements are generated using a random realvalued matrix X⋆ with rank 2, although the algorithms also work with complex-valued matrices. A d × d rank r real-valued matrix has dr − r(r − 1)/2 ≈ dr degrees of freedom, hence we need at least 2dr number of measurements to recover X⋆ from noiseless measurements Sparse projections onto the simplex (due to the null-space of the linear map). To test the various approaches, we vary the number of measurements between 2dr and 5dr. We assume r is known, though other computational experience suggests that estimates of r return good answers as well. The convex problem (13) depends on a parameter λ. We solve the problem for different λ in a bracketing search until we find the first λ that provides a solution with numerical rank r. Like (Flammia et al., 2012), we normalize the final estimate to ensure the trace is 1. Additionally, we test the following convex approach: minimize kA(X) − yk2F . X0,kXk∗ ≤1 (15) Compared to (13), no parameters are needed since we exploit prior knowledge of the trace, but there is no guarantee on the rank. Both convex approaches can be solved with proximal gradient descent; we use the TFOCS package (Becker et al., 2011) since it uses a sophisticated line search and Nesterov acceleration. To illustrate the power of the combinatorial projections, we solve the following non-convex formulation: minimize X0,kXk∗ ≤1,rank(X)=r kA(X) − yk2F . (16) Within the projected gradient algorithm (3), we use the GSSP algorithm as described above. The stepsize is µi = 3/kAk2 where k · k is the operator norm; we can also apply Nesterov acceleration to speed convergence, but we use (3) for simplicity. Due to the nonconvexity, the algorithm depends on the starting value X0 . We try two strategies: (i) random initialization, and (ii) initializing with the solution from (15). Both initializations often lead to the same stationary point. Figure 1 shows the relative error kX − X⋆ kF /kX⋆ kF of the different approaches. All approaches do poorly when there are only 2dr measurements since this is near the noiseless information-theoretic limit. For higher numbers of measurements, the non-convex approach substantially outperforms both convex approaches. For 2.4dr measurements, it helps to start X0 with the convex solution, but otherwise the two non-convex approaches are nearly identical. Between the two convex solutions, (13) outperforms (15) since it tunes λ to achieve a rank r solution. Neither convex approach is competitive with the nonconvex approaches since they do not take advantage of the prior knowledge on trace and rank. Figure 2 shows more results on a 7 qubit problem without noise. Again, the non-convex approach gives better results, particularly when there are fewer measurements. As expected, both approaches approach perfect recovery as the number of measurements increases. Approach convex non-convex mean time standard deviation 0.294 s. 0.192 s. 0.030 s. 0.019 s. Table 1. Time per iteration of convex and non-convex approaches for quantum state tomography with 8 qubits. Here we highlight another key benefit of the nonconvex approach: since the number of eigenvectors needed in the partial eigenvalue decomposition is at most r, it is quite scalable. In general, the convex approach has intermediate iterates which require eigenvalue decompositions close to the full dimension, especially during the first few iterations. Table 1 shows average time per iteration for the convex and non-convex approach (overall time is more complicated, since the number of iterations depends on linesearch and other factors). Even using Matlab’s dense eigenvalue solver eig, the iterations of the non-convex approach are faster; problems that used an iterative Krylov subspace solver would show an even larger discrepancy.2 7. Application: Measure learning Problem: We study the kernel density learning setting: Let x(1) , x(2) , . . . , x(n) ∈ Rp be an n-size corpus of p-dimensional samples, drawn from an unknown probability density function P (pdf) µ(x). Here, we will n form an estimator µ̂(x) := i=1 βi κσ (x, x(i) ), where κσ (x, y) is a Gaussian kernel with parameter σ. Let us choose µ̂(x) to minimize the integrated squared error criterion: ISE = Ekµ̂(x) − µ(x)k22 . As a result, we can introduce a density learning problem as estimating a weight vector β ⋆ ∈ ∆+ 1 . The objective can then be written as follows (Bunea et al., 2010; Kim, 1995) n o (17) β ⋆ ∈ argmin β T Σβ − cT β , β∈∆+ 1 where Σ ∈ Rn×n with Σij = κ√2σ (x(i) , x(j) ), and ci = 1 X κσ (x(i) , x(j) ), ∀i, j. n−1 (18) j6=i While the combination of the −cT β term and the nonnegativity constraint induces some sparsity, it may not be enough. To avoid overfitting or obtain interpretable results, one might control the level of solution sparsity (Bunea et al., 2010). In this context, we extend (17) to include cardinality constraints, i.e. β ⋆ ∈ ∆+ 1 ∩ Σk . 2 Quantum state tomography does not easily benefit from iterative eigenvalue solvers, since the range of A∗ is not sparse. Sparse projections onto the simplex 0.14 3 x 10 Quad. Prog. 0.1 0.08 0.1 1 0 0 200 400 600 800 1000 0.06 0.4 s=5 Weights 0.04 0.02 0 −15 −10 −5 0 5 0.06 0.02 0 0 10 0.08 0.04 0.2 x True pdf Estimated pdf for s = 5 True kernel means Estimated kernel means 0.12 2 Probability Weights 0.12 Probability 0.14 −3 True pdf Parzen window Quad. Prog. True kernel mean position 200 400 600 Sample point indices 0 −15 800 −10 −5 0 5 10 x Figure 3. Density estimation results using the Parzen method (left), the quadratic program (17) (left and middle-top), and our approach (middle-bottom and right). 0.14 0.14 True pdf Estimated pdf for s = 3 True kernel means Estimated kernel means Probability 0.12 0.14 True pdf Estimated pdf for s = 8 True kernel means Estimated kernel means 0.12 0.14 True pdf Estimated pdf for s = 10 Estimated kernel means True kernel means 0.12 0.1 0.1 0.1 0.1 0.08 0.08 0.08 0.08 0.06 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0.04 0.02 0 −15 −10 −5 0 x 5 10 0 −15 −10 −5 0 5 10 x (a) k = 3 (b) k = 8 0 −15 −10 −5 0 5 True pdf Estimated pdf for s = 15 True kernel means Estimated kernel means 0.12 10 0 −15 x −10 −5 0 5 10 x (c) k = 10 (d) k = 15 Figure 4. Estimation results for different k: Red spikes depict the estimated kernel means as well as the their relative contribution to the Gaussian mixture. As k increases, the additional nonzero coefficients in β ⋆ tend to have small weights. Numerical experiments: We consider P5 the following Gaussian mixture: µ(x) = 15 i=1 κσi (β i , x), where σi = (7/9)i and β i = 14(σi − 1). A sample of 1000 points is drawn from µ(x). We compare the density estimation performance of: (i) the Parzen method (Parzen, 1962), (ii) the quadratic programming formulation in (17), and (iii) our cardinality-constrained version of (17) using GSSP. While µ(x) is constructed by kernels with various widths, we assume a constant width during the kernel estimation. In practice, the width is not known a priori but can be found using cross-validation techniques; for simplicity, we assume kernels with width σ = 1. Figure 3(left) depicts the true pdf and the estimated densities using the Parzen method and the quadratic programming approach. Moreover, the figure also includes a scaled plot of 1/σi , indicating the height of the individual Gaussian mixtures. By default, the Parzen window method estimation interpolates 1000 Gaussian kernels with centers around the sampled points to compute the estimate µ̂(x); unfortunately, neither the quadratic programming approach (as Figure 3 (middle-top) illustrates) nor the Parzen window estimator results are easily interpretable even though both approaches provide a good approximation of the true pdf. Using our cardinality-constrained approach, we can significantly enhance the interpretability. This is because in the sparsity-constrained approach, we can control the number of estimated Gaussian components. Hence, if the model order is known a priori, the non-convex approach can be extremely useful. To see this, we first show the coefficient profile of the sparsity based approach for k = 5 in Figure 3 (middlebottom). Figure 3 (right) shows the estimated pdf for k = 5 along with the positions of weight coefficients obtained by our sparsity enforcing approach. Note that most of the weights obtained concentrate around the true means, fully exploiting our prior information about the ingredients of µ(x)—this happens with rather high frequency in the experiments we conducted. Figure 4 illustrates further estimated pdf’s using our approach for various k. Surprisingly, the resulting solutions are still approximately 5-sparse even if k > 5, as the over-estimated coefficients are extremely small, and hence the sparse estimator is reasonably robust to inaccurate estimates of k. 8. Application: Portfolio optimization Problem: Given a sample covariance matrix Σ and expected mean µ, the return-adjusted Markowitz mean-variance (MV) framework β⋆ n selects a portfolio o T ⋆ such that β ∈ argminβ∈∆+ β Σβ − τ µT β , where 1 Sparse projections onto the simplex ∆+ 1 encodes the normalized capital constraint, and τ trades off risk and return (Brodie et al., 2009; DeMiguel et al., 2009). The solution β ⋆ ∈ ∆+ 1 is the distribution of investments over the p available assets. δ ∗β T T ∈ argmin (β̄ + δ β ) Σ(β̄ + δ β ) − τ µ (β̄ + δ β ), δ β ∈Σk ∩∆λ where λ is the level of update, and k controls the transactions costs. During an update, λ = 0 would keep the portfolio value constant while λ > 0 would increase it. Numerical experiments: To clearly highlight the impact of the non-convex projector, we create a synthetic portfolio update problem, where we know the solution. As in (Brodie et al., 2009), we cast this problem as a regression problem and synthetically generate y = Xβ ⋆ where p = 1000 such that β ⋆ ∈ ∆λ (λ is chosen randomly), and kβ ⋆ k0 = k for k = 100. Since in general we do not expect RIP assumptions to hold in portfolio optimization, our goal here is to refine the sparse solution of a state-of-the-art convex solver via (3) in order to accommodate the strict sparsity and budget constraints. Hence, we first consider the basis pursuit criterion and solve it using SPGL1 (van den Berg & Friedlander, 2008):     X y √ . (19) √ β= minimize kβk1 s.t. λ/ p 1T / p √ The normalization by 1/ p in the last equality gives the constraint matrix a better condition number, since otherwise it is too ill-conditioned for a first-order solver. Almost none of the solutions to (19) return a k-sparse solution. Hence, we initialize (3) with the SPGL1 solution to meet the constraints. The update step in (3) uses the GSHP algorithm. b − Figure 5 shows the resulting relative errors kβ β ⋆ k2 /kβ ⋆ k2 . We see that not only does (3) return a k-sparse solution, but that this solution is also closer to β ⋆ , particularly when the sample size is small. As the sample size increases, the knowledge that β ⋆ is k-sparse makes up a smaller percentage of what we know about the signal, so the gap between (19) and (3) diminishes. −1 10 Relative Error While such solutions construct portfolios from scratch, a more realistic scenario is to incrementally adjust an existing portfolio as the market changes. Due to costs per transaction, we can naturally introduce cardinality constraints. In mathematical terms, let β̄ ∈ Rp be the current portfolio selection. Given β̄, we seek to adjust the current selection β = β̄ + δ β such that kδ β k0 ≤ k. This leads to the following optimization problem: 0 10 −2 10 −3 10 −4 10 −5 10 280 Approach 1 Approach 2 300 320 340 m 360 380 400 b − β ⋆ k2 /kβ ⋆ k2 comparison as Figure 5. Relative error kβ a function of m: Approach 1 is the non-convex approach (3), and approach 2 is (19). Each point corresponds to the median value of 30 Monte-Carlo realizations. 9. Conclusions While non-convexity in learning algorithms is undesirable according to conventional wisdom, avoiding it might be difficult in many problems. In this setting, we show how to efficiently obtain exact sparse projections onto the positive simplex and its hyperplane extension. We empirically demonstrate that our projectors provide substantial accuracy benefits in quantum tomography from fewer measurements and enable us to exploit prior non-convex knowledge in density learning. Moreover, we also illustrate that we can refine the solution of well-established state-of-the-art convex sparse recovery algorithms to enforce non-convex constraints in sparse portfolio updates. The quantum tomography example in particular illustrates that the non-convex solutions can be extremely useful; here, the non-convexity appears milder, since a fixed-rank matrix still has extra degrees of freedom from the choice of its eigenvectors. 10. Acknowledgements VC and AK’s work was supported in part by the European Commission under Grant MIRG-268398, ERC Future Proof and SNF 200021-132548. SRB is supported by the Fondation Sciences Mathématiques de Paris. CK’s research is funded by ERC ALGILE. References Bahmani, S., Boufounos, P., and Raj, B. Greedy sparsity-constrained optimization. In Signals, Systems and Computers (ASILOMAR), 2011 Confer- Sparse projections onto the simplex ence Record of the Forty Fifth Asilomar Conference on, pp. 1148–1152. IEEE, 2011. Becker, Stephen, Candès, Emmanuel, and Grant, Michael. Templates for convex cone problems with applications to sparse signal recovery. Mathematical Programming Computation, pp. 1–54, 2011. ISSN 1867-2949. URL http://dx.doi.org/10.1007/ s12532-011-0029-5. 10.1007/s12532-011-0029-5. Brodie, J., Daubechies, I., De Mol, C., Giannone, D., and Loris, I. Sparse and stable Markowitz portfolios. Proceedings of the National Academy of Sciences, 106(30):12267–12272, 2009. Bunea, F., Tsybakov, A.B., Wegkamp, M.H., and Barbu, A. SPADES and mixture models. The Annals of Statistics, 38(4):2525–2558, 2010. Candès, E., Romberg, J., and Tao, T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. on Information Theory, 52(2):489 – 509, February 2006. DeMiguel, V., Garlappi, L., Nogales, F.J., and Uppal, R. A generalized approach to portfolio optimization: Improving performance by constraining portfolio norms. Management Science, 55(5):798–812, 2009. Flammia, S.T., Gross, D., Liu, Y.K., and Eisert, J. Quantum tomography via compressed sensing: error bounds, sample complexity and efficient estimators. New Journal of Physics, 14(9):095022, 2012. URL http://stacks.iop.org/1367-2630/ 14/i=9/a=095022. Foucart, S. Sparse recovery algorithms: sufficient conditions in terms of restricted isometry constants. Proceedings of the 13th International Conference on Approximation Theory, 2010. Garg, R. and Khandekar, R. Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property. In ICML. ACM, 2009. Gross, D., Liu, Y.-K., Flammia, S. T., Becker, S., and Eisert, J. Quantum state tomography via compressed sensing. Phys. Rev. Lett., 105(15):150401, Oct 2010. doi: 10.1103/PhysRevLett.105.150401. Kim, D. Least squares mixture decomposition estimation. PhD thesis, 1995. Kyrillidis, A. and Cevher, V. Recipes on hard thresholding methods. Dec. 2011. Kyrillidis, A. and Cevher, V. Combinatorial selection and least absolute shrinkage via the Clash algorithm. In IEEE International Symposium on Information Theory, July 2012. Liu, Y.K. Universal low-rank matrix recovery from Pauli measurements. In NIPS, pp. 1638–1646, 2011. Meka, Raghu, Jain, Prateek, and Dhillon, Inderjit S. Guaranteed rank minimization via singular value projection. In NIPS Workshop on Discrete Optimization in Machine Learning, 2010. Mirsky, L. Symmetric gauge functions and unitarily invariant norms. The quarterly journal of mathematics, 11(1):50–59, 1960. Nemhauser, G.L. and Wolsey, L.A. Integer and combinatorial optimization, volume 18. Wiley New York, 1988. Parzen, E. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962. Pilanci, M., El Ghaoui, L., and Chandrasekaran, V. Recovery of sparse probability measures via convex programming. In Advances in Neural Information Processing Systems 25, pp. 2429–2437, 2012. van den Berg, E. and Friedlander, M. P. Probing the Pareto frontier for basis pursuit solutions. SIAM J. Sci. Comput., 31(2):890–912, 2008.