Academia.eduAcademia.edu

Flag Manifolds for Subspace ICA Problems

2007, Acoustics, Speech and …

We investigate the use of the Riemannianoptimization method over the flag manifold in subspace ICA problems such as in-dependent subspace analysis (ISA) and complex ICA. In the ISA experiment, we use the Riemannian approach over the flag manifold together ...

FLAG MANIFOLDS FOR SUBSPACE ICA PROBLEMS Yasunori Nishimori, Shotaro Akaho Samer Abdallah, Mark D. Plumbley Neuroscience Research Institute AIST 1-1-1, Umezono, Tsukuba, Ibaraki 305-8568, Japan Department of Electrical Engineering Queen Mary, University of London Mile End Road, London E1 4NS, UK ABSTRACT We investigate the use of the Riemannian optimization method over the ƀag manifold in subspace ICA problems such as independent subspace analysis (ISA) and complex ICA. In the ISA experiment, we use the Riemannian approach over the ƀag manifold together with an MCMC method to overcome the problem of local minima of the ISA cost function. Experiments demonstrate the effectiveness of both Riemannian methods – simple geodesic gradient descent and hybrid geodesic gradient descent, compared with the ordinary gradient method. Index Terms— Independent subspace analysis, complex ICA, natural gradient, geodesic, ƀag manifolds, Riemannian optimization method 1. INTRODUCTION Several signal processing tasks, such as those related to independent component analysis (ICA), can be approached by optimization over a manifold. Examples of such manifolds include the orthogonal group O(n), corresponding to the set of orthonormal n × n matrices, the Stiefel manifold St(n, p ; R), corresponding to the set of orthonormal n × p matrices,   W = (w1 , . . . , wp ) ∈ Rn×p |W ⊤ W = Ip , n ≥ p , and the Grassmann manifold Gr(n, p ; R) of unoriented pplanes, corresponding to the subspaces spanned by n × p full rank matrices. Stiefel manifolds have been used in ICA and PCA in the case where the number of the extracted components is less than the number of the mixed signals [8], while Grassmann manifolds have been utilized for invariant subspace computation and subspace tracking [1]. One-unit ICA extracts one independent component at a time, while ordinary ICA extracts several components simultaneously by optimization over the Stiefel manifold. A single subspace can be represented as a point on the Grassmann manifold, which can be used for subspace analysis. The question then arises: what manifold will be necessary for extracting several subspaces simultaneously? This leads us to the This work is partly supported by JSPS Grant-in-Aid for Exploratory Research 16650050, MEXT Grant-in-Aid for Scientiſc Research on Priority Areas 17022033, and EPSRC Grant GR/S82213/01. 1­4244­0728­1/07/$20.00 ©2007 IEEE concept of the ƀag manifold, which is the manifold consisting of orthogonal subspaces. This is closely related to the Stiefel manifold, and include the Grassmannian manifold as a special case. We extend the Riemannian optimization method to the ƀag manifold by deriving the formulas for the natural gradient and geodesics on the manifold. We show how the ƀag manifold method can be applied to subspace ICA problems such as independent subspace analysis [5] and complex ICA. We show that we can replace an optimization over the complex Stiefel manifold St(n, p, C) of n × p complex unitary matrices (n ≥ p) with an optimization over the real generalized ƀag manifold of p 2-dimensional subspaces in R2n . Thus, for example, the complex ICA problem of separating p complex independent sources from a sequence of n > p complex observations can also be tackled using a generalized ƀag manifold. We also consider the problem of local minima in ISA and propose a hybrid geodesic gradient-MCMC method to tackle the problem. This algorithm takes geodesic gradient descent steps in the ƀag manifold interleaved with random interchanges of basis vectors between subspaces. These swaps prevent the system from becoming trapped in local minima of the ISA cost function, while the Riemannian optimization method accelerates convergence between swaps. 2. FLAG MANIFOLD Manifolds that frequently arise from signal processing tasks are Lie groups, and homogeneous spaces of Lie groups. A homogeneous space M is deſned to be a manifold on which a Lie group G acts transitively, and it is expressed as the quotient space of G by its isotropy subgroup H: G/H, where an isotropy subgroup H of p ∈ M consist of the stabilizers of a point p ∈ M : H = {x ∈ G|x · p = p}. It is also worth mentioning that G is a ſber bundle over G/H whose ſber is isomorphic to H. The formula for geodesics over the ƀag manifold is derived by regarding the Stiefel manifold as a ſber bundle over the ƀag manifold. Previous use of homogeneous spaces has mainly concentrated on the Stiefel manifold St(n, p ; R), which is is diffeomorphic to O(n)/O(n − p) and IV ­ 1417 ICASSP 2007 the Grassmann manifold Gr(n, p ; R), which is diffeomorphic to O(n)/O(p) × O(n − p). Extending the concept of the Grassmann manifold, we introduce a new class of manifold – the ƀag manifold [6]. Given an ordered sequence (n1 , . . . , nr ) of nonnegative integers with n1 +· · ·+nr ≤ n, the ƀag manifold Fl(n1 , n2 , . . . , nr ) is deſned to be the set of sequence of vector spaces V1 ⊂ · · · ⊂ Vr ⊂ Rn with dim Vi = n1 + · · · + ni , i = 1, 2, . . . , r. Fl(n1 , n2 , · · · , nr ) is a smooth, connected, compact manifold. We need a different expression of the ƀag manifold to tackle the subspace ICA problems; we consider the set of the orthogonal direct sum of the vector spaces V V = V1 ⊕ V2 ⊕ · · · ⊕ Vr ⊂ Rn , r where dim V = i=1 di = p ≤ n. With the mapping Vi → i j=1 Vj we can see that the set of all V forms a manifold diffeomorphic to the original deſnition so this is also a ƀag manifold, which we denote by Fl(n, d), where d = (d1 , . . . , dr ). We represent a point on this manifold by a n × p orthogonal matrix W , i.e. W ⊤ W = Ip , which is decomposed as W = [W1 , W2 , . . . , Wr ], Wi = (w1i , w2i , . . . , wdi i ), where wki ∈ Rn , k = 1, . . . , di for some i, form the orthogonal basis of Vi . The orthogonal group O(n) acts transitively on the ƀag manifold Fl(n, d) by matrix multiplication: where ϕM (W, V, t) denotes the equation of a geodesic over manifold M starting from w ∈ M in the direction of V ∈ Tw M such that ϕM (W, V, 0) = W, ϕ′M (W, V, 0) = V . We derived in [8] the equation of a geodesic on a ƀag manifold with respect to the normal metric Fl(n,d;R) gW where V1 , V2 ∈ TW Fl(n, d ; R) : ϕFl(n,d;R) (W, V, t) = exp(t(DW ⊤ − W D⊤ ))W, where D = (I − 12 W W ⊤ )V . The natural gradient V of a function f on Fl(n, d ; R) at W with respect to g Fl(n,d;R) is:  Vi = Xi − (Wi Wi ⊤ Xi + j=i Wj Xj⊤ Wi ), where V = (V1 , . . . , Vr ). 3. COMPLEX ICA We illustrate how a special class of optimization problems over the complex Stiefel manifold are transformed to optimization problems over the ƀag manifold, thereby making the Riemannian optimization method over the ƀag manifold applicable to the complex ICA problem. Let us consider an optimization problem over the complex Stiefel manifold: O(n) × Fl(n, d) ∋ (R, W ) → RW ∈ Fl(n, d). F : St(n, p ; C) → R, It is easily seen that the isotropy subgroup of O(n) at W is: [W, W⊥ ]diag[R1 , R2 , . . . , Rr , Rr+1 ][W, W⊥ ]⊤ , where Rk ∈ O(dk ), (1 ≤ k ≤ r), Rr+1 ∈ O(n − p), and W⊥ is an arbitrary n× (n − p) matrix satisfying [W, W⊥ ] ∈ O(n). Therefore Fl(n, d ; R) ∼ = O(n)/O(d1 ) × · · · × O(dr ) × O(n − p). (1) Fl(n, d ; R) is locally isomorphic 1 to St(n, p ; R) as a homogeneous space when all di (1 ≤ i ≤ r) = 1, and it reduces to a Grassmann manifold if r = 1. It may well be said that the ƀag manifold is a generalization of the Stiefel and Grassmann manifolds in this sense. We can obtain the Riemannian optimization method over a manifold by adapting ordinary optimization methods over the Euclidean space to the manifold: ſrst the updated direction is replaced by the Riemannian counterpart, which is geometric in the sense that it does not depend on parametrizations of the manifold; second, the current point is updated to the next along a geodesic on the manifold, thus updated points always stay on the manifold, which guarantees the stability against the deviation from the manifold: Wk+1 = ϕM (Wk , − gradW f (Wk ), η), 1A (V1 , V2 ) = trV1⊤ (I − 21 W W ⊤ )V2 , (2) where St(n, p ; C) = {W = (w1 , . . . , wp ) = Wℜ + iWℑ ∈ Cn×p |W H W = Ip } (H denotes the Hermitian transpose operator). We assume F is invariant under the transformation   (3) W = (w1 , . . . , wp ) → eiθ1 w1 , . . . , eiθp wp , which is satisſed by cost functions of signal processing tasks including complex ICA. Because the cost function F is real-valued, St(n, p ; C) should be regarded as a real manifold rather than a complex manifold, for which we embed St(n, p ; C) into R2n×2p by the following map:   τ : W = w1ℜ + iw1ℑ , . . . , wpℜ + iwpℑ → W̃  ℜ w1 −w1ℑ w2ℜ −w2ℑ · · · wpℜ −wpℑ = . (4) w1ℑ w1ℜ w2ℑ w2ℜ · · · wpℑ wpℜ It turns out that the embedded manifold N = τ (St(n, p ; C)) coincides with St(2n, 2p ; R) ∩ T, where T = τ (Cn×p ) is a subspace in R2n×2p . Thus minimizing F over St(n, p ; C) is transformed to minimizing f over N , where f (W̃ ) := F (W ). Furthermore, the assumption of F gives N an additional structure. Since the transformation on St(n, p ; C) (3) induces the following transformation on N : G′ /H ′ homogeneous space G/H is locally isomorphic to when the Lie algebras of G, H are locally isomorphic to G′ , H ′ respectively. IV ­ 1418 W̃ → W̃ diag(R(θ1 ), R(θ2 ), · · · , R(θp )), (5) s s ∂f ∂f + i ∂W , and pro means the projection onto denotes ∂W ℜ ℑ ∂f St(n, p ; C) by complex SVD. Both gradW̃k f (W̃k ) and ∂W are computed by substituting 4 ∂||yi ||4 ∂wiℜ s = 2||yi ||2 (yi∗ x + yi x∗ ) i || = 2i||yi ||2 (yi∗ x − yi x∗ ). Recall that we map and ∂||y ∂wiℑ St(n, p;C) to Fl(2n, 2;R) and update the matrices on Fl(2n, 2; R) using the correspondence between W and W̃ (4). After W̃ converges to W̃∞ , W̃∞ is pulled back to St(n, p ; C) to give a demixing matrix W∞ . The learning constant ηk , µs was chosen at each iteration based on the Armijo rule [10]. The separation result is shown in Fig. 1(c). The constellations of the source signals were recovered up to phase shifts. Both algorithms were tested for 100 trials. On each trial, a random nonsingular matrix was used to generate the data; a random unitary matrix was chosen as an initial demixing matrix; we iterated for 200 steps. The plots of Fig. 1(d) show 2 Strictly speaking Fl(2n, 2 ; R) should be replaced with SO(2n)/SO(2) × . . . × SO(2) × SO(2n − 2p) – the universal cover of Fl(2n, 2 ; R), yet both are locally isomorphic to each other as a homogeneous space, and we use Fl(2n, 2 ; R) by abuse of notation. 4 2 0 í2 í4 í4 í2 4 2 0 í2 í4 í4 í2 4 2 0 í2 í4 í4 í2 0 2 0 2 0 2 4 2 0 í2 í4 4 í4 í2 4 2 0 í2 í4 4 í4 í2 4 2 0 í2 í4 4 í4 í2 0 2 0 2 0 2 4 2 0 í2 í4 4 í4 í2 4 2 0 í2 í4 4 í4 í2 4 2 0 í2 í4 4 í4 í2 0 2 4 0 2 4 0 2 4 4 2 0 í2 í4 í4 í2 4 2 0 í2 í4 í4 í2 4 2 0 í2 í4 í4 í2 (a) source signals 2 2 2 0 0 0 0 2 0 2 0 2 4 2 0 í2 í4 4 í4 í2 4 2 0 í2 í4 4 í4 í2 4 2 0 í2 í4 4 í4 í2 0 2 0 2 0 2 4 2 0 í2 í4 4 í4 í2 4 2 0 í2 í4 4 í4 í2 4 2 0 í2 í4 4 í4 í2 0 2 4 0 2 4 0 2 4 (b) mixed signals 11.5 Ordinary gradient method Geodesic method 11 10.5 í2 í2 2 0 2 0 í2 í2 2 0 2 0 í2 í2 0 2 í2 í2 í2 í2 2 0 2 0 0 2 í2 í2 0 2 Average value of cost function  θi − sin θi  where R(θi ) = cos sin θi cos θi , the function f is also invariant under the transformation (5). Therefore, f can be interpreted as a function over a submanifold of the ƀag manifold2 : N ′ = Fl(2n, 2 ; R) ∩ T , where 2 = (2, . . . , 2). In fact, N ′ is a totally geodesic submanifold of Fl(2n, 2 ; R), that is, a geodesic on Fl(2n, 2 ; R) emanating from W̃ ∈ N ′ in the direction of Ṽ ∈ TW̃ N ′ is always contained in N ′ . This allows us to consider just Fl(2n, 2 ; R) instead of its submanifold N ′ . To summarize, minimizing F over St(n, p ; C) can be solved by minimizing the function f over the submanifold N ′ of Fl(2n, 2;R); to minimize f on N ′ , we have only to apply the Riemannian optimization method for Fl(2n, 2 ; R) to f . To explore the behavior of the Riemannian geodesic gradient descent method on the complex Stiefel manifold, we performed a numerical complex ICA experiment. Let us assume we are given 9 signals x ∈ C9 (Fig. 1(b)) which are complex-valued instantaneous linear mixture of two independent QAM16 signals, two QAM4 signals, two PSK8 signals, and three complex-valued Gaussian noise signals. We assume we know in advance the number of noise signals. The task of complex ICA under this assumption is to recover only nonnoise signals y = (y1 , . . . , y4 )⊤ so that y = W ⊤ x. As a preprocessing stage, we ſrst center the data and then whiten it by SVD. Thus, n×p demixing matrix W can be regarded as a point on the complex Stiefel manifold St(n, p ; C), namely W H W = Ip . As an objective function, we use a kurtosis-like 4 higher-order statistics: F (W ) = i=1 E ||yi (t)||4 . Then by minimizing F (W ) over St(n, p ; C) we can solve the task. We compared two algorithms. One is the Riemannian optimization method: W̃k+1 =ϕFl(2n,2;R) (W̃k , − gradW̃k f (W̃k )), and another is the standard gradient descent method followed ∂f ∂f by projection: Ws+1 = pro(Ws − µs ∂W ), where ∂W 10 9.5 9 8.5 8 7.5 7 6.5 0 10 (c) recovered signals 1 10 Number of iterations 2 10 (d) cost function Fig. 1. Complex ICA experiment the average behavior of these two algorithms over 100 trials. We observed that the Riemannian optimization method decreased the cost more rapidly than the standard gradient descent method followed by projection, particularly in the early stages of learning . 4. ISA AND SUBSPACE IDENTIFICATION In this section we examine the issue of local minima in ISA and the use of Markov chain Monte Carlo methods as a possible solution. These local minima are best observed in artiſcial problems where the correct solution is known and suboptimal solutions easily detected. Hence, the experiments described here were conducted using data drawn from independent subspaces with spherically symmetric multivariate Student’s t distributions. The particular choice of the multivariate t distribution is convenient because it is easy to sample from and analytically tractable. The degrees-of-freedom parameter was set to 3, producing a moderately heavy-tailed distribution. A maximum-likelihood (ML) estimator for the ISA system was constructed using the multivariate Student’s t distribution as the prior within each subspace and the (negative) log-likelihood of the resulting model as the cost function; this was minimized using the geodesic gradient descent method. The algorithm was tested in multiple runs on a number of 12 dimensional problems composed of, respectively, 2, 3, 4, and 6 dimensional subspaces. For each run, a random sample was drawn from the source distribution, the mixing matrix set to the identity, and the algorithm initialized with a random orthogonal matrix. The correct product of multivariate Student’s t distributions was used as the prior; that is, the number and dimensionality of the subspaces was correct in each run. IV ­ 1419 (a) (b) less than 200 gradient steps were required to reach the correct solution. The results are summarized in ſg. 2(b), which shows that the hybrid geodesic gradient-MCMC algorithm is more likely to reach the correct solution to the problem than the purely deterministic gradient method. 600 deterministic mcmc 400 200 0 0 0.5 1 5. CONCLUSIONS 1.5 (c) 8.5 8 7.5 0 50 100 iteration 150 200 Fig. 2. (a) Example of a sub-optimal basis matrix reached by gradient descent in a problem with 3 subspaces of dimension 4 (white=zero, black=±1). (b) Distribution of ſnal Amari error over 100 runs of deterministic gradient descent (black) and 500 runs of the MCMC-gradient hybrid method (grey). (c) Evolution of cost function for multiple runs of deterministic (black lines) and MCMC (grey lines) algorithms in a problem with 3 × 4-dimensional subspaces. It was observed that in many cases, the system would become stuck at a non-optimal, non-subspace-separating solution, an example of which is shown in ſg. 2(a). In these cases, the output components are mixtures of sources from different source subspaces. One solution to this problem is to allow the system to swap basis vectors at random with a probability related to the resulting change in the likelihood. Speciſcally, we use a Metropolis-Hastings methodology: a swap is proposed by selecting two columns of the basis matrix at random; this is accepted unconditionally if the likelihood is increased, and with probability e−β(t)∆L otherwise, where ∆L is the decrease in the likelihood and β(t) is a time-varying inverse temperature parameter. Since we are aiming for a ML solution, the temperature is gradually decreased as the algorithm progresses, that is β is increased linearly from 20 to 60 in 200 steps. These MCMC updates to the basis matrix are interspersed with geodesic gradient descent steps, which quickly drive the system to a local minimum in between swaps. The gradient updates are disabled once the change in the likelihood drops below a small threshold (10−8 in these experiments), and then re-enabled as soon as a swap occurs. This reduces the amount of computation expended on relatively ineffectual gradient steps, and explains why some of the traces in ſg. 2(c) terminate at less than 200 iterations: in these cases, We have demonstrated that the ƀag manifold is useful for tackling subspace ICA problems. The aim of this paper was not to pursue the best learning algorithm for a particular subspace ICA problem, rather the emphasis was to illustrate how the ƀag manifold naturally arises from the subspace ICA problems, and how we can exploit the geometric structures of the ƀag manifold to modify existing optimization algorithms to be adapted to these problems. Though we have concentrated on the gradient descent method in this paper, other optimization methods such as the ſxed point method could also be formulated over the ƀag manifold. 6. REFERENCES [1] P-A. Absil, R. Mahony, and R. Sepulchre, Riemannian geometry of Grassmann manifolds with a view on algorithmic computation, Acta Applicandae Mathematicae, 80 (2), pp.199-220, 2004. [2] S. Amari, Natural gradient works efſciently in learning, Neural Computation, 10, pp.251-276, 1998. [3] A. Edelman, T.A. Arias, and S.T. Smith, The geometry of algorithms with orthogonality constraints, SIAM Journal on Matrix Analysis and Applications, 20 (2), pp.303-353, 1998. [4] S. Fiori, Quasi-geodesic neural learning algorithms over the orthogonal group: a tutorial, Journal of Machine Learning Research, 6, pp.743-781, 2005. [5] A. Hyvärinen and P.O. Hoyer, Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12(7), pp.1705-1720, 2000. [6] L.J. Mason, and N.M.J Woodhouse, Integrability, Self-duality, and Twistor Theory, Oxford University Press, 1996. [7] Y. Nishimori, Learning algorithm for independent component analysis by geodesic ƀows on orthogonal group, Proceedings of International Joint Conference on Neural Networks (IJCNN1999), 2, pp.1625-1647, 1999. [8] Y. Nishimori and S.Akaho, Learning algorithms utilizing quasigeodesic ƀows on the Stiefel manifold, Neurocomputing, 67 pp.106-135, 2005. [9] Y. Nishimori, S. Akaho and M.D. Plumbley, Riemannian optimization method on the ƀag manifold for independent subspace analysis, Proceedings of ICA2006, pp.295-1302, 2006. [10] E. Polak, Optimization: Algorithms and Consistent Approximations, Springer-Verlag, 1997. IV ­ 1420