SIGNAL
PROCESSING
ELSEVIER
Signal Processing 36 (1994) 287-314
Independent component analysis, A new concept?*
Pierre Comon
THOMSON-SINTRA, Parc Soph& Antipolis, BP 138, F-06561 Valbonne Cedex, France
Received 24 August 1992
Abstract
The independent component analysis (ICA) of a random vector consists of searching for a linear transformation that
minimizes the statistical dependence between its components. In order to define suitable search criteria, the expansion of
mutual information is utilized as a function of cumulants of increasing orders. An efficient algorithm is proposed, which
allows the computation of the ICA of a data matrix within a polynomial time. The concept o f l C A may actually be seen as
an extension of the principal component analysis (PCA), which can only impose independence up to the second order
and, consequently, defines directions that are orthogonal. Potential applications of ICA include data analysis and
compression, Bayesian detection, localization of sources, and blind identification and deconvolution.
Zusammenfassung
Die Analyse unabhfingiger Komponenten (ICA) eines Vektors beruht auf der Suche nach einer linearen Transformation, die die statistische Abh~ingigkeit zwischen den Komponenten minimiert. Zur Definition geeigneter Such-Kriterien
wird die Entwicklung gemeinsamer Information als Funktion von Kumulanten steigender Ordnung genutzt. Es wird ein
effizienter Algorithmus vorgeschlagen, der die Berechnung der ICA ffir Datenmatrizen innerhalb einer polynomischen
Zeit erlaubt. Das Konzept der ICA kann eigentlich als Erweiterung der 'Principal Component Analysis' (PCA) betrachtet
werden, die nur die Unabh~ingigkeit bis zur zweiten Ordnung erzwingen kann und deshalb Richtungen definiert, die
orthogonal sind. Potentielle Anwendungen der ICA beinhalten Daten-Analyse und Kompression, Bayes-Detektion,
Quellenlokalisierung und blinde Identifikation und Entfaltung.
R~sum~
L'Analyse en Composantes Ind6pendantes (ICA) d'un vecteur al6atoire consiste en la recherche d'une transformation
lin6aire qui minimise la d6pendance statistique entre ses composantes. Afin de d6finir des crit6res d'optimisation
appropribs, on utilise un d6veloppement en s6rie de l'information mutuelle en fonction de cumulants d'ordre croissant.
On propose ensuite un algorithme pratique permettant le calcul de I'ICA d'une matrice de donn6es en un temps
polynomial. Le concept d'ICA peut 6tre vu en r~alitb comme une extension de l'Analyse en Composantes Principales
(PCA) qui, elle, ne peut imposer l'ind6pendance qu'au second ordre et d6finit par cons6quent des directions orthogonales.
Les applications potentielles de I'ICA incluent l'analyse et la compression de donn6es, la d&ection bayesienne, la
localisation de sources, et l'identification et la d6convolution aveugles.
+This work was supported in part by the DRET.
0165-1684/94/$7.00 © 1994 Elsevier Science B.V. All rights reserved
SSDI 0165-1684(93)E0093-Z
288
P. Comon / Signal Processing 36 (1994) 28~314
Key words: Statistical independence; Random variables; Mixture, Entropy; Information; High-order statistics; Source
separation; Data analysis; Noise reduction; Blind identification; Array processing; Principal components; Independent
components
1. Introduction
1.2. Organization o f the paper
1.1. Problem description
In this section, related works, possible applications and preliminary observations regarding the
problem statement are surveyed. In Section 2, general results on statistical independence are stated,
and are then utilized in Section 3 to derive optimization criteria. The properties of these criteria are
also investigated in Section 3. Section 4 is dedicated
to the design of a practical algorithm, delivering
a solution within a number of steps that is a polynomial function of the dimension of vector y. Simulation results are then presented in Section 5. For
the sake of clarity, most proofs as well as a discussion about complex data have been deferred to
the appendix at the end of the paper.
This paper attempts to provide a precise definition of ICA within an applicable mathematical
framework. It is envisaged that this definition will
provide a baseline for further development and
application of the ICA concept. Assume the following linear statistical model:
y = M x + v,
(1.1a)
where x , y and v are random vectors with values in
or C and with zero mean and finite covariance,
M is a rectangular matrix with at most as many
columns as rows, and vector x has statistically
independent components. The problem set by ICA
may be summarized as follows. Given T realizations of y, it is desired to estimate both matrix
M and the corresponding realizations of x. However, because of the presence of the noise v, it is in
general impossible to recover exactly x. Since the
noise v is assumed here to have an unknown distribution, it can only be treated as a nuisance, and the
ICA cannot be devised for the noisy model above,
Instead, it will be assumed that
y = Fz,
(1.1b)
where z is a random vector whose components are
maximizing a 'contrast' function. This terminology
is consistent with [31], where Gassiat introduced
contrast functions in the scalar case for purposes of
blind deconvolution. As defined subsequently, the
contrast of a vector z is m a x i m u m when its components are statistically independent. The qualifiers
'blind' or 'myopic' are often used when only the
outputs of the system considered are observed; in
this framework, we are thus dealing with the problem of blind identification of a linear static system.
1.3. Related works
The calculation of ICA was discussed in several
recent papers [8, 16, 30, 36, 37, 61], where the
problem was given various names. For instance, the
terminology 'sources separation problem' has often
been coined. Investigations reveal that the problem
of 'independent component analysis' was actually
first proposed and so named by Herault and Jutten
around 1986 because of its similarities with principal component analysis (PCA). This terminology is
retained in the paper.
Herault and Jutten seem to be the first (around
1983) to have addressed the problem of ICA. Several papers by these authors propose an iterative
real-time algorithm based on a neuro-mimetic
architecture. The authors deserve merit in their
achievement of an algorithmic solution when no
theoretical explanation was available at that time.
Nevertheless, their solution can show lack of convergence in a number of cases. Refer to [37] and
other papers in the same issue, and to [27]. In their
framework, high-order statistics were not introduced explicitly. It is less well known that Bar-Ness [2]
P. Comon / Signal Processing 36 (1994) 287-314
independently proposed another approach at the
same time that presented rather similar qualities
and drawbacks.
Giannakis et al. [35] addressed the issue of
identifiability of ICA in 1987 in a somewhat different framework, using third-order cumulants. However, the resulting algorithm required an exhaustive
search. Lacoume and Ruiz [42] also sketched
a mathematical approach to the problem using
high-order statistics; in [29, 30], Gaeta and
Lacoume proposed to estimate the mixing matrix
M by maximum likelihood approach, where an
exhaustive search was also necessary to determine
the absolute maximum of an objective function of
many variables. Thus, from a practical point of
view, this was realistic only in the two-dimensional
case.
Cardoso focused on the algebraic properties of
the fourth-order cumulants, and interpreted them
as linear operators acting on matrices. A simple
case is the action on the identity yielding a cumulant matrix whose diagonalization gives an estimate of ICA [8]. When the action is defined on
a set of matrices, one obtains several cumulant
matrices whose joint diagonalization provides more
robust estimates [55]. This is equivalent to optimizing a cumulant-based criterion [55], and is then
similar in spirit to the approach presented herein.
Other algebraic approaches, using only fourth-order cumulants, have also been investigated [9, 10].
In [36], Inouye proposed a solution for the separation of two sources, whereas at the same time
Comon [13] proposed another solution for N 7> 2.
Together with Cardoso's solution, these were
among the first direct (within polynomial time)
solutions to the ICA problem. In [61], Inouye and
his colleagues derived identifiability conditions for
the ICA problem. Their Theorem 2 may be seen to
have connections to earlier works [19]. On the
other hand, our Theorem 11 (also presented in [15,
16]) only requires pairwise independence, which is
generally weaker than the conditions required by
289
signal separation problem. In some cases, identifiability conditions are not fulfilled, e.g. when processes z,(t) have spectra proportional to each other,
so that the signal separation problem degenerates
into the ICA problem, where the time coherency is
ignored. Independently, the signal separation problem has also been addressed more recently by Tong
et al. [60].
This paper is an extended version of the conference paper presented at the Chamrousse workshop
in July 1991 [16]. Although most of the results were
already tackled in [16], the proofs were very
shortened, or not stated at all, for reasons of space.
They are now detailed here, within the same framework. Furthermore, some new results are stated,
and complementary simulations are presented.
Among other things, it is sought to give sound
justification to the choice of the objective function
to be maximized. The results obtained turn out to
be consistent with what was heuristically proposed
in [42]. Independently of [29, 30], the author proposed to approximate the probability density by its
Edgeworth expansion [16], which has advantages
over Gram-Charlier's expansion. Contrary to [29,
30], the hypothesis that odd cumulants vanish is not
assumed here: Gaeta emphasized in [29] the consistency of the theoretical results obtained in both approaches when third-order cumulants are null.
It may be considered, however, that a key contribution of this paper consists of providing a practical algorithm that does not require an exhaustive
search, but can be executed within a polynomial
time, even in the presence of non-Gaussian noise.
The complexity aspect is of great interest when the
dimension o f y is large (at least greater than 2). On
the other hand, the robustness in the presence of
non-Gaussian noise has not apparently been previously investigated. Various implementations of
this algorithm are discussed in [12, 14, 15]; the
present improved version should, however, be preferred. A real-time version can also be implemented
on a parallel architecture [14].
[61].
In [24], Fety addressed the problem of identifying the dynamic model of the form y(t) = Fz(t),
where t is a time index. In general, the identification
of these models can be completed with the help of
second-order moments only, and will be called the
1.4. Applications
Let us spend a few lines on related applications.
If x is a vector formed of N successive time samples
290
P. Comon / Signal Processing 36 (1994) 28~314
of a white process, and if M is Toeplitz triangular,
then model (1.1) represents nothing else but a deconvolution problem. The first column of M contains the successive samples of the impulse response
of the corresponding causal filter. In the FIR case,
M is also banded. Such blind identification and/or
deconvolution problems have been addressed in
various ways in [5, 21, 45, 46, 50, 52, 62, 64]. Note
that if M is not triangular, the filter is allowed to be
non-causal. Moreover, if M is not Toeplitz, the
filter is allowed to be non-stationary. Thus, blind
deconvolution may be viewed as a constrained ICA
problem, which is not dealt with in this paper. In
fact, in the present study, no structure for matrix
M will be assumed, so that blind deconvolution or
other constrained ICAs cannot be efficiently carried out with the help of the algorithm described in
Section 4, at least in its present form.
In antenna array processing, ICA might be utilized in at least two instances: firstly, the estimation of
radiating sources from unknown arrays (necessarily
without localization). It has already been used in
radar experimentation for example [20]. In the
same spirit, the use of ICA may also have some
interesting features for jammer rejection or noise
reduction. Secondly, ICA might be useful in a twostage localization procedure, if arrays are perturbed
or ill-calibrated. Additional experiments need,
however, to be completed to confirm the advantages of ICA under such circumstances. On the
other hand, the use of high-order statistics (HOS) in
this area is of course not restricted to ICA. Regarding methods for sources localization, some references may be found in [8, 9, 26, 32, 47, 53, 59]. It
can be expected that a two-stage procedure, consisting of first computing an ICA, and using the
knowledge of the array manifold to perform a localization in the second stage, would be more robust
against calibration uncertainties. But this needs to
be experimented with.
Further, ICA can be utilized in the identification
of multichannel ARMA processes when the input is
not observed and, in particular, for estimating the
first coefficient of the model [12, 15]. In general, the
first matrix coefficient is in fact either assumed to be
known [33, 56, 62, 63], or of known form, e.g.
triangular [35]. Note that the authors of [35] were
the first to worry about the estimation of the first
coefficient of non-monic multichannel models.
A survey paper is in preparation on this subject
[583.
On the other hand, ICA can be used as a data
preprocessing tool before Bayesian detection and
classification. In fact, by a change of coordinates,
the density of multichannei data may be approximated by a product of marginal densities, allowing
a density estimation with much shorter observations [17]. Other possible applications of ICA can
be found in areas where PCA is of interest. The
application to the analysis of chaos is perhaps one
of the most exotic [25].
1.5. Preliminary observations
Suppose x is a non-degenerate (i.e. non-deterministic) Gaussian random vector of dimension
p with statistically independent components, and
z is a vector of dimension N ~> p defined as z = Cx,
where C is an N × p full rank matrix. Then if z has
independent components, matrix C can easily be
shown (see Appendix A.I) to be of the form
C = AQA,
(1.2)
where both A and A are real square diagonal, and
Q is N × p and satisfies Q*Q = I. If N = p, A is
invertible and Q is unitary; this demonstrates that if
both x and z have a unit covariance matrix, then
C may be any unitary matrix. Theorem 11 shows
that when x has at most one Gaussian component,
this indetermination reduces to a matrix of the form
DP, where D is a diagonal matrix with entries of
unit modulus and P is permutation. This latter
indetermination cannot be reduced further without
additional assumptions.
D E F I N I T I O N 1. The ICA of a random vectory of
size N with finite covariance Vy is a pair {F, A} of
matrices such that
(a) the covariance factorizes into
Vy = FA2F *,
where A is diagonal real positive and F is full
column rank p;
(b) the observation can be written as y = Fz, where
z is a p × 1 random vector with covariance A2
P. Comon / Signal Processing 36 (1994) 28~314
and whose components are 'the most independent possible', in the sense of the maximization
of a given 'contrast function' that will be subsequently defined.
291
[15]. In fast, for computational purposes, it is convenient to define a unique representative of these
equivalence classes.
4. The uniqueness of ICA as well as
of PCA requires imposing three additional constraints. We have chosen to impose the following:
(c) the columns of F have unit norm;
(d) the entries of d are sorted in decreasing order;
(e) the entry of largest modulus in each column of
F is positive real.
DEFINITION
In this definition, the asterisk denotes transposition, and complex conjugation whenever the
quantity is complex (mainly the real case will be
considered in this paper). The purpose of the next
section will be to define such contrast functions.
Since multiplying random variables by non-zero
scalar factors or changing their order does not
affect their statistical independence, ICA actually
defines an equivalence class of decompositions
rather than a single one, so that the property below
holds. The definition of contrast functions will thus
have to take this indeterminacy into account.
PROPERTY2.
l f a pair {F, A} is an I C A o f a random vector, y, then so is the pair {F', A'} with
F'= FADP
and
A'= P*A-1AP,
where .d is a p x p invertible diagonal real positive
scaling matrix, D is a p x p diagonal matrix with
entries o f unit modulus and P is a p x p permutation.
For conciseness we shall sometimes refer to the
matrix A = A D as the diagonal invertible 'scale'
matrix.
Each of the constraints in Definition 4 removes
one of the degrees of freedom exhibited in Property
2. More precisely, (c), (d) and (e) determine/1, P and
D, respectively. With constraint (c), the matrix F defined in the PCA decomposition (Definition 3) is
unitary. As a consequence, ICA (as well as PCA)
defines a so-called p x 1 'source vector', z, satisfying
y = Fz.
(1.3)
As we shall see with Theorem 1 l, the uniqueness
of ICA requires z to have at most one Gaussian
component.
2. Statements related to statistical independence
For the sake of clarity, let us also recall the
definition of PCA below.
D E F I N I T I O N 3. The PCA of a random vector
y of size N with finite covariance Vyis a pair {F, A}
of matrices such that
(a) the covariance factorizes into
Vy = FA2F *,
where d is diagonal real positive and F is full
column rank p;
(b) F is an N x p matrix whose columns are orthogonal to each other (i.e. F * F diagonal).
Note that two different pairs can be PCAs of the
same random variable y, but are also related by
Property 2. Thus, PCA has exactly the same inherent indeterminations as ICA, so that we may
assume the same additional arbitrary constraints
In this section, an appropriate contrast criterion
is proposed first. Then two results are stated. It is
proved that ICA is uniquely defined if at most one
component of x is Gaussian, and then it is shown
why pairwise independence is a sufficient measure
of statistical independence in our problem.
The results presented in this paper hold true
either for real or complex variables. However, some
derivations would become more complicated if derived in the complex case. Therefore, only real variables will be considered most of the time for the
sake of clarity, and extensions to the complex case
will be mainly addressed only in the appendix. In
the remaining, plain lowercase (resp. uppercase)
letters denote, in general, scalar quantities (resp.
tables with at least two indices, namely matrices or
tensors), whereas boldface lowercase letters denote
column vectors with values in ~N.
P. Comon / Signal Processing 36 (1994) 287-314
292
2.1. Mutual information
Let x be a random variable with values in EN and
denote by px(U) its probability density function
(pdf). Vector x has mutually independent components if and only if
N
px(u) : [ I p~,(u~).
(2.1)
i=I
So a natural way of checking whether x has
independent components is to measure a distance
between both sides of (2.1):
6
Px, •
(2.2)
In statistics, the large class of f-divergences is of
key importance among the possible distance
measures available [4]. In these measures the roles
played by both densities are not always symmetric,
so that we are not dealing with proper distances.
For instance, the Kullback divergence is defined as
('
,
,,
p,(u)
6(px, p=) = j p A u ) l o g p ~
V = UA2U *
du.
(2.4)
6(px, p:) >1 O,
with equality if and only if pz(u)= p:(u) almost
everywhere. This property is due to the convexity of
the logarithm [6]. Now, if we look at the form of
the Kullback divergence of (2.2), we obtain precisely the average mutual information of x:
f
, ,,
pAu)
jPxtU)logH~x,(u3 du, ueC N.
(2.6)
(2.3)
Recall that the Kullback divergence satisfies
I(px) =
N the subset of
product ( x , y ) = E [ x * y ] and by ~:2
~:~ of variables having an invertible covariance. For
instance, we have E~ __ F~.
The goal of the standardization is to transform
a random vector of E2N, z, into another, ~, that has
a unit covariance. But if the covariance V of vector
z is not invertible, it is necessary to also perform
a projection of z onto the range space of V. The
standardization procedure proposed here fulfils
these two tasks, and may be viewed as a mere PCA.
Let z be a zero-mean random variable of n:~, V
its covariance matrix and L a matrix such that
V = LL*. Matrix L could be defined by any
square-root decomposition, such as Cholesky factorization, or a decomposition based on the eigen
value decomposition (EVD), for instance. But our
preference goes to the EVD because it allows one to
handle easily the case of singular or ill-conditioned
covariance matrices by projection. More precisely,
if
(2.5)
From (2.4), the mutual information vanishes if and
only if the variables xi are mutually independent,
and is strictly positive otherwise. We shall see that
this will be a good candidate as a contrast function.
2.2. Standardization
Now denote by IFN the space of random variables
with values in ~N, by ~:~ the Euclidian subspace of
EN spanned by variables with finite moments up to
order r, for any r >t 2, provided with the scalar
denotes the EVD of V, where A is full rank and
U possibly rectangular, we define the square root of
V as L = UA. Then we can define the standardized
variable associated with z:
= A - 1U*z.
(2.7)
Note that (?,)i # (zi). In fact, (zi) is merely the
variable zi normalized by its variance. In the following, we shall only talk about :~i, the ith component
of the standardized variable ~, so avoiding any
possible confusion between the similar-looking notations ~:i and ~g.
Projection and standardization are thus performed all at once if z is not in --N
E 2 . Algorithm 18
recommends resorting to the singular value decomposition (SVD) instead of EVD in order to
avoid the explicit calculation of V and to improve
numerical accuracy.
Without restricting the generality we may thus
consider in the remaining that the variable observed belongs to H:~; we additionally assume that
N > 1, since otherwise the observation is scalar and
the problem does not exist.
293
P. Comon / Signal Processing 36 (1994) 287-314
2.3. Negentropy, a measure of distance to normality
written as:
N
1
H V/i
l(p~) = J(PI) -- ~ J(Px,) + ~log det V'
Define the differential entropy of x as
(2.14)
i=1
S(px) = - f p~(u)logpx(u)du.
(2.8)
Recall that differential entropy is not the
limit of Shannon's entropy defined for discrete
variables; it is not invariant by an invertible
change of coordinates as the entropy was, but only
by orthogonal transforms. Yet, it is the usual practice to still call it entropy, in short. Entropy enjoys
very privileged properties as emphasized in [54],
and we shall point out in Section 2.3 that information (2.5) may also be written as a difference of
entropies.
As a first example, if z is in ~2u, the entropies of
z and ~: are related by [16]
S(p:) = S(p~) - ½1ogdet V.
(2.9)
Ifx is zero-mean Gaussian, its pdf will be referred
to by the notation flax(U), with
flax(U) = (2re)-N/2I VI-~/2exp{- u ' V - a u } / 2 .
(2.10)
Among the densities of ~:~ having a given
covariance matrix V, the Gaussian density is the
one which has the largest entropy. This well-known
proposition says that
S(Ox) >~ S(py),
Vx, ye~_~,
(2.11)
with equality iff flax(u)= py(u) almost everywhere.
The entropy obtained in the case of equality is
S(flax) = ½IN + Nlog(2rt) + logdet V].
(2.12)
For densities in ~_2
N, one defines the negentropy as
J(Px) = S(flax)- S(px),
(2.13)
where V denotes the variance of x (the proof is
deferred to Appendix A.3). This relation gives
a means of approximating the mutual information,
provided we are able to approximate the negentropy about zero, which amounts to expanding the
density PI in the neighbourhood of flax.This will be
the starting point of Section 3.
For the sake of convenience, the transform F defined in Definition 1 will be sought in two steps. In
the first step, a transform will cancel the last term of
(2.14). It can be shown that this is equivalent to
standardizing the data (see Appendix A.4). This
step will be performed by resorting only to secondorder moments of y. Then in the second step, an
orthogonal transform will be computed so that the
second term on the right-hand side of (2.14) is
minimized while the two others remain constant. In
this step, resorting to higher-order cumulants is
necessary.
2.4. Measures of statistical dependence
We have shown that both the Gaussian feature
and the mutual independence can be characterized
with the help of negentropy. Yet, these remarks
justify only in part the use of (2.14) as an optimization criterion in our problem. In fact, from Property 2, this criterion should meet the requirements
given below.
D E F I N I T I O N 5. A contrast is a mapping ~u from
the set of densities { p x , x e ~ -N} to ~ satisfying the
following three requirements.
- ~(Px) does not change if the components x~ are
permuted:
~P(Ppx) = ~(Px),
where flaxstands for the Gaussian density with the
same mean and variance as Px. As shown in Appendix A.2, negentropy is always positive, is invariant
by any linear invertible change of coordinates,
and vanishes if and only if p~ is Gaussian. From
(2.5) and (2.13), the mutual information may be
-
~ is invariant by 'scale' change, that is,
7t(Pax) = q'(Px),
-
VP permutation.
VA diagonal invertible.
If x has independent components, then
tP(PAx) <~ ~/'(Px), VA invertible.
P. Comon / Signal Processing 36 (1994) 287-314
294
D E F I N I T I O N 6. A contrast will be said to be
discrimina[ing over a set ~: if the equality
7J(pAx) = T(px) holds only when A is of the form
AP, as defined in Property 2, when x is a random
vector of ~: having independent components.
Now we are in a position to propose a contrast
criterion.
THEOREM
over ~_~:
7. The following mapping is a contrast
T(p~) = - I(p~).
See the appendix for a proof. The criterion proposed in Theorem 7 is consequently admissible for
ICA computation. This theoretical criterion, involving a generally unknown density, will be made
usable by approximtions in Section 3. Regarding
computational loads, the calculation of ICA may
still be very heavy even after approximations, and
we now turn to a theorem that theoretically explains why the practical algorithm designed in
Section 4, that proceeds pairwise, indeed works.
The following two lemmas are utilized in the
proof of Theorem 10, and are reproduced below
because of their interesting content. Refer to
[18, 22] for details regarding Lemmas 8 and 9,
respectively.
LEMMA
8
(1951)). The
a polynomial
function only
(Lemma of Marcinkiewicz-Dugu6
function q~(u)= e vtu~, where P(u) is
of degree m, can be a characteristic
if m <~ 2.
LEMMA
9 (Cram6r's lemma (1936)). I f {xl,
1 <~ i <~ N} are independent random variables and if
N
X = ~,i=1 aixi is Gaussian, then all the variables
xi for which ai # 0 are Gaussian.
Utilizing these lemmas, it is possible to obtain
the following theorem, which can be seen to be
a variant of Darmois's theorem (see Appendix A.7).
T H E O R E M 10. Let x and z be two random vectors
such that z = Bx, B being a given rectangular
matrix. Suppose additionally that x has independent
components and that z has pairwise independent corn-
ponents. I f B has two non-zero entries in the same
column j, then xj is either Gaussian or deterministic.
Now we are in a position to state a theorem, from
which two important corollaries can be deduced
(see Appendices A.7-A.9 for proofs).
T H E O R E M 11. Let x be a vector with independent
components, of which at most one is Gaussian, and
whose densities are not reduced to a point-like mass.
Let C be an orthogonal N x N matrix and z the
vector z = Cx. Then the following three properties
are equivalent:
(i) The components zi are pairwise independent.
(ii) The components zi are mutually independent.
(iii) C = AP, A diagonal, P permutation.
C O R O L L A R Y 12. The contrast in Theorem 7 is
discriminating over the set of random vectors having
at most one Gaussian component, in the sense of
Definition 6.
C O R O L L A R Y 13 (Identifiability). Let no noise be
present in model ( 1 . 1 ) , and define y = M x and
y = Fz, x being a random variable in some set ~_ of
~_s
2 satisfying the requirements of Theorem 11. Then
if T is discriminant over E, T(pz) = T(px) if only if
F = M A P , where A is an invertible diagonal matrix
and P a permutation.
This last corollary is actually stating identifiability conditions of the noiseless problem. It shows in
particular that for discriminating contrasts, the indeterminacy is minimal, i.e. is reduced to Property
2; and from Corollary 12, this applies to the contrast in Theorem 7. We recognize in Corollary 13
some statements already emphasized in blind deconvolution problems [21, 52, 64], namely that
a non-Gaussian random process can be recovered
after convolution by a linear time-invariant stable
unknown filter only up to a constant delay and
a constant phase shift, which may be seen as the
analogues of our indeterminations D and P in
Property 2, respectively. For processes with unit
variance, we find indeed that M = F D P in the corollary above.
295
P. Comon / Signal Processing 36 (1994) 287-314
3. O p t i m i z a t i o n criteria
3.1. Approximation of the mutual information
Suppose that we observe .~ and that we look for
an orthogonal matrix Q maximizing the contrast:
~(p:) = - l(pO,
where
~ = Q.~.
(3.1)
In practice, the densities p~ and py are not known, so
that criterion (3.1) cannot be directly utilized. The
aim of this section is to express the contrast in
Theorem 7 as a function of the standardized cumulants (of orders 3 and 4), which are quantities more
easily accessible. The expression of entropy and
negentropy in the scalar case will be first briefly
derived. We start with the Edgeworth expansion of
type A of a density. A central limit theorem says
that if z is a sum of m independent random variables with finite cumulants, then the ith-order
cumulant of z is of order
t¢i ~ m ( 2 - i)/2.
This theorem can be traced back to 1928 and is
attributed to Cram6r [65]. Referring to [39, p. 176,
formula 6.49] or to [1], the Edgeworth expansion
of the pdf of z up to order 4 about its best Gaussian
approximate (here with zero-mean and unit variance) is given by
In this expression, x~ denotes the cumulant of
order i of the standardized scalar variable considered (this is the notation of Kendall and Stuart,
not assumed subsequently in the multichannel
case), and hi(u)is the Hermite polynomial of degree
i, defined by the recursion
ho(u) = 1,
hi(u) = u,
(3.3)
hk+ l(u) = u hk(U) -- ~u hk(U).
The key advantage of using the Edgeworth expansion instead of Gram-Charlier's lies in the
ordering of terms according to their decreasing
significance as a function of m-1/2. In fact, in the
Gram-Charlier expansion, it is impossible to enquire into the magnitude of the various terms. See
[39, 40] for general remarks on pdf expansions.
Now let us turn to the expansion of the negentropy defined in (2.13). The negentropy itself can be
approximated by an Edgeworth-based expansion
as the theorem below shows.
THEOREM 14. For a standardized scalar variable
Z,
J(pz) = ~
+ A~
7
+ ~3
4
x 2 4 + o(m-2).
-- ~tC3t¢
p~(u)
~(u) - 1
The proof is sketched in the appendix. Next,
from (2.14), the calculus of the mutual information
of a standardized vector $ needs not only the marginal negentropy of each component ~i but also the
joint negentropy of $. Now it can be noticed that
(see the appendix)
1
+ ~.. t¢3 h3(u)
1
10
2
+ ~ x, h4(u) + 6.t x3 h6(u)
1
35
+ ~. t¢5 hs(u) + ~. ~c3t¢~.h7(u )
(3.5)
J(PO = J(Py),
280
+ ~ t¢3 h9(u)
1
56
35 x2 h8(u)
"t- ~.. K6 h6(u) + 8.1 x3x5 ha(u) + 8.t
since differential entropy is invariant by an orthogonal change of coordinates. Denote by Ko. ..q the
cumulants Cum{~i,~ . . . . . ~q}. Then using (3.4)
and (3.5), the mutual information of ~ takes the
form, up to O(m -2) terms,
I(p~) ~ J(p~) -
2100
15400 4
+~
~c2 x4 hlo(u) + ~
Ks hx2(u)
+ o(m- 2).
(3.4)
(3.2)
- -1 ~ {4K2 i + K ]
48 i=1
+7K~i
- - 6 K u2
iKui}.
(3.6)
296
P. Comon / Signal Processing 36 (1994) 287-314
Yet, cumulants satisfy a multilinearity property
[7], which allows them to be called tensors [43, 44].
Denote by F the family of cumulants of j,, and by
K those of ~: = Q.~. Then this property can be written at orders 3 and 4 as
Kijk : ~, QipQjqQkrFpqr,
(3.7)
pqr
gljkl :
~ QipQjqQkrQtsFpqrs.
(3.8)
pqrs
On the other hand, J(p~) does not depend on Q, so
that criterion (3.1) can be reduced up to order m-2
to the maximization of a functional ~:
N
tp(Q) :
~ 4K~i + K,],i + 7K~i -- 6KiiiKiiii
2
(3.9)
i=i
with respect to Q, bearing in mind that the tensors
K depend on Q through relations (3.7) and (3.8).
Even if the mutual information has been proved to
be a contrast, we are not sure that expression (3.9) is
a contrast itself, since it is only approximating the
mutual information. Other criteria that are contrasts are subsequently proposed, and consist of
simplifications of (3.9).
If Kiii are large enough, the expansion of J(p~)
can be performed only up to order O(m-3/2 ), yielding ~,(Q) = 4~K2i. On the other hand, if Kiii are all
null, (3.9) reduces to if(Q) = ~K2ii. It turns out that
these two particular forms of ~,(Q) are contrasts, as
shown in the next section. Of course, they can be
used even if Km are neither all large nor all null, but
then they would not approximate the contrast in
Theorem 7 any more.
L E M M A 15. Denote by "Q the matrix obtained by
raising each entry of an orthogonal matrix Q to the
power r. Then we have
HZQu II ~ IIu II.
THEOREM 16. The functional
N
~'(Q) = Z K2i l . . . i ,
i=1
where Kii... i are marginal standardized cumulants of
order r, is a contrast for any r >1 2. Moreover, for
any r >>.3, this contrast is discriminating over the set
of random vectors having at most one null marginal
cumulant of order r. Recall that the cumulants are
functions of Q via the multilinearity relation, e.g.
(3.7) and (3.8).
The proofs are given in the appendix (see also
[15] for complementary details). These contrast
functions are generally less discriminating than
(3.1) (i.e. discriminating over a smaller subset of
random vectors). In fact, if two components have
a zero cumulant of order r, the contrast in Theorem
16 fails to separate them (this is the same behaviour
as for Gaussian components). However, as in Theorem 11, at most one source component is allowed
to have a null cumulant.
Now, it is pertinent to notice that the quantity
~2,
~
K i2l i 2 . . . i r
(3.10)
il . . . i t
3.2. Simpler criteria
The function ~,(Q) is actually a complicated rational function in N ( N - 1)/2 variables, if indices
vary in {1, ... , N}. The goal of the remainder of the
paper is to avoid exhaustive search and save computational time in the optimization problem, as
well as propose a family of contrast functions in
Theorem 16, of simpler use. The justification of
these criteria is argued and is not subject to the
validity of the Edgeworth expansion; in other
words, it is not absolutely necessary that they approximate the mutual information.
is invariant under linear and invertible transformations (cf. the appendix). This result gives, in fact, an
important practical significance to contrast functions such as in Theorem 16. Indeed, the maximization of ~,(Q) is equivalent to the minimization of
~2 - ~,(Q), which is eventually the same as minimizing the sum of the squares of all cross-cumulants of
order r, and these cumulants are precisely the
measure of statistical dependence at order r. Another consequence of this invariance is that if
~/'(y) = ~(x), where x has uncorrelated components at orders involved in the definition of ~b, then
y also has independent components in the same
sense. A similar interpretation can be given for
P. Comon / Signal Processing 36 (1994) 28~314
expression (3.9) since
~3,4 =
•
4K2k + K2u + 3KijkKijnKkqrKqr,
ijklmnqr
+ 4KijkKimnKjmrKknr -- 6KijkKumKjktm
(3.11)
is also invariant under linear and invertible transformations [16] (see the appendix for a proof).
However, on the other hand, the minimization of
(3.11) does not imply individual minimization of
cross-cumulants any more, due to the presence of
a minus sign, among others.
The interest in maximizing the contrast in
Theorem 16 rather than minimizing the crosscumulants lies essentially in the fact that only
N cumulants are involved instead of O(N'). Thus,
this spares us a lot of computations when estimating the cumulants from the data. A first analysis of
complexity was given in [15], and will be addressed
in Section 4.
3.3. Link with blind identification and deconvolution
Criterion (3.9) and that in Theorem 16 may be
connected with other criteria recently proposed in
the literature for the blind deconvolution problem
[3, 5, 21, 31, 48, 52, 64]. For instance, the criterion
proposed in [52] may be seen to be equivalent to
maximize ~ K ,2, . In [21], one of the optimization
criteria proposed amounts to minimizing ~S(pz,),
which is consistent with (2.13), (2.14) and (3.1). In
[5], the family of criteria proposed contains (3.1).
On the other hand, identification techniques
presented in [35, 45, 46, 57, 63] solve a system of
equations obtained by equation error matching.
Although they work quite well in general, these
approaches may seem arbitrary in their selection of
particular equations rather than others. Moreover,
their robustness is questioned in the presence of
measurement noise, especially non-Gaussian, as
well as for short data records. In [62] a matching in
the least-squares (LS) sense is proposed; in [12] the
use of much more equations than unknowns improves on the robustness for short data records.
A more general approach is developed in [28],
297
where a weighted LS matching is suggested. Our
equation (3.9) gives a possible justification to the
process of selecting particular cumulants, by showing their dominance over the others. In this context,
some simplifications would occur when developing
(3.9) as a function of Q since the components yi are
generally assumed to be identically distributed (this
is not assumed in our derivations). However, the
truncation in the expansion has removed the optimality character of criterion (3.1), whereas in the
recent paper [34] asymptotic optimality is addressed.
4. A practical algorithm
4.1. Pairwise processing
As suggested by Theorem 11, in order to maximize (3.9) it is necessary and sufficient to consider
only pairwise cumulants of ?:. The proof of Theorem
11 did not involve the contrast expression, but was
valid only in the case where the observation y was
effectively stemming linearly from a random variable x with independent components (model (1.1) in
the noiseless case). It turns out that it is true for any
.~ and any ?: = Q.~, and for any contrast of polynomial (and by extension, analytic) form in
marginal cumulants of ~ as is the case in (3.9) or
Theorem 16.
T H E O R E M 17. Let ~k(Q) be a polynomial in marginal cumulants denoted by K. Then the equation
dO = 0 imposes conditions only on components of
K having two distinct indices. The same statement
holds true for the condition d2O < 0.
The proof resorts to differentiation tools borrowed from classical analysis, and not to statistical
considerations (see the appendix). The theorem
says that pairwise independence is sufficient, provided a contrast of polynomial form is utilized. This
motivated the algorithm developed in this section.
Given an N × T data matrix, Y = {y(t),
1 ~< t ~< T}, the proposed algorithm processes each
pair in turn (similarly to the Jacobi algorithm in the
diagonalization of symmetric real matrices); Zi: will
denote the ith row of a matrix Z.
P. Comon / Signal Processing 36 (1994) 287-314
298
A L G O R I T H M 18
(1) Compute the SVD of the data matrix as
Y = vz[J*, where Vis N x p, Z is p x p and U*
is p x T, and p denotes the rank of Y. Use
therefore the algorithm proposed in [11] since
T is often much larger than N. Then
Z = x / ~ U * and L = VZ/x//T.
(2) Initialize F = L.
(3) Begin a loop on the sweeps: k = 1, 2 . . . . . kmax;
kmax ~< 1 + x//-p.
(4) Sweep the p(p - 1)/2 pairs (i, j), according to
a fixed ordering. For each pair:
(a) estimate the required cumulants of (Zi:, Z j:),
by resorting to K-statistics for instance
[39, 44],
(b) find the angle a maximizing 6(Q"'J~), where
Q(i,j) is the plane rotation of angle a,
a ~ ] - n/4, re/4], in the plane defined by
components { i, j},
(c) accumulate F : = FQ "'j)*,
(e) update Z : = (2{~'J)Z.
(5) End of the loop on k if k = k. . . . or if all estimated angles are very small (compared to 1/T).
(6) Compute the norm of the columns of F:
Aii
II5i II.
=
The core of the algorithm is actually step 4(b),
and as shown in [15] it can be carried out in
various ways. This step becomes particularly
simple when a contrast of the type given by
Theorem 16 is used. Let us take for example the
contrast in Theorem 16 at order 4 (a similar and
less complicated reasoning can be carried out at
order 3).
Step 4(b) of the algorithm maximizes the contrast
in dimension 2 with respect to orthogonal transformations. Denote by Q the Givens rotation
(= 0-
where coefficients b k a r e given in the appendix.
Here, we have somewhat abused the notations, and
write @ as a function of ~ instead of Q itself. However, there is no ambiguity since 0 characterizes
Q completely, and ~ determines two equivalent
solutions in 0 via the equation
0 2-~0-
1=0.
(4.3)
Expression (4.2) extends to the case of complex
observations, as demonstrated in the appendix. The
maximization of (4.2) with respect to variable
leads to a polynomial rooting which does not
raise any difficulty, since it is of low degree (there
even exists an analytical solution since the degree is
strictly smaller than 5):
[0
1
1/0.
(4.4)
k=O
(8) Normalize F by the transform F : = FA-1
(9) Fix the phase (sign) of each column o f F according to Definition 4(e). This yields F : = FD.
and
(4.2)
k=0
(o(~) = ~ ck~k= O.
.4 := pAp* and F : = FP*.
1
4
~(~) = [(2 + 4]-2 ~ bRig,
4
(7) Sort the entries of A in decreasing order:
0
Then the contrast takes the same value at 0 and
- 1/0. Yet, it is a rational function of 0, so that it
can be expressed as a function of variable ~ only.
More precisely, we have
(4.1)
The exact value of coefficients Ck is also given in the
appendix. Thus, step 4(b) of the algorithm consists
simply of the following four steps:
- compute the roots of ~o(~) by using (A.39);
- c o m p u t e the corresponding values of ~k(~)
by using (A.38);
- retain the absolute maximum;
(4.5)
- root Eq. (4.3) in order to get the solution 0o in the
interval ] - 1, 1].
The solution 0o corresponds to the tangent of the
rotation angle. Since it is sufficient to have one
representative of the equivalence class in Property
2, the solution in ] - 1, 1] suffices. Obvious realtime implementations are described in [14]. Additional details on this algorithm may also be found
in [15]. In particular, in the noiseless case the
polynomial (4.4) degenerates and we end up with
the rooting of a polynomial of degree 2 only. However, it is wiser to resort to the latter simpler algorithm only when N = 2 [15].
299
P. Comon / Signal Processing 36 (1994) 287-314
4.2. Convergence
It is easy to show [15,] that Algorithm 18 is such
that the global contrast function ~O(Q) monotonically increases as more and more iterations are run.
Since this real positive sequence is bounded above,
it must converge to a maximum. How many sweeps
(kmax) should be run to reach this maximum depends on the data. The value of kmax has been
chosen on the basis of extensive simulations: the
value kmax= 1 + ~
seems to give satisfactory
results. In fact, when the number of sweeps, k,
reaches this value, it has been observed that the
angles of all plane rotations within the last sweep
were very small and that the contrast function had
reached its stationary value. However, iterations
can obviously be stopped earlier.
In [52,], the deconvolution algorithm proposed
has been shown to suffer from spurious maxima
in the noisy case. Moreover, as pointed out in
[64-1, even in the noiseless case, spurious maxima
may exist for data records of limited length.
This phenomenon has not been observed
when running ICA, but could a priori also
happen. In fact, nothing ensures the absence of
local maxima, at least based on the results we
have presented up to now. Algorithm 18 finds
a sequence of absolute maxima with respect to
each plane rotation, but no argument has
been found that ensures reaching the global absolute maximum with respect to the cumulated
rotation. So it is only conjectured that this
search is sufficient over the group of orthogonal
matrices.
4.3. C o m p u t a t i o n a l c o m p l e x i t y
When evaluating the complexity of Algorithm
18, we shall count the number of floating point
operations (flops). A flop corresponds to a multiplication followed by an addition. Note that the complexity is often assimilated to the number of multiplications, since most of the time there are more
multiplications than additions. In order to make
the analysis easier, we shall assume that N and
T are large, and only retain the terms of highest
orders.
In step 1, the SVD of a matrix of size N x Tis to
be calculated. Since T will necessarily be larger than
N, it is appropriate to resort to a special technique
proposed by Chan [-11-] consisting of running a Q R
factorization first. The complexity is then of
3 T N 2 - 2N3/3. Making explicit the matrix L requires the multiplication of an N x p matrix by
a diagonal matrix of size p. This requires an additional N p flops, which is negligible with respect to
T N 2.
In step 4(a), the five pairwise cumulants of order
4 must be calculated. If zi: and z j: are the two row
vectors concerned, this may be done as follows for
reasonable values of T (T > 100):
(u = zi:" zi:,
termwise product T flops
(ii = z j:. z j:,
termwise product T flops
~i~ = zi:. z~:,
termwise product T flops
Fiiii = ( * ( , / T -
3, scalar product
T + 1 flops
scalar product
T + 1 flops
ffiijj ~ - ( * ( j j / T - -
1, scalar product
T + 1 flops
I]jjj = ( * ~ j j / T ,
scalar product
T + 1 flops
3, scalar product
T + 1 flops
1]iij = (*(ij/T,
l~jjjj
= ~ *j ( j ~ / T -
(4.6)
So step 4(a) requires O(8T) flops. The polynomial
rooting of step 4(b) requires O(1) flops, as well as
the selection of the absolute maximum. The accumulation of Step 4(c) needs 4N flops because only
two columns of F are changed. The update Step
4(d) is eventually calculated within 4T flops.
If the loop on k is executed K times, then the
complexity above is repeated O ( K p 2 / 2 ) times.
Since p ~< N, we have a complexity bounded by
(12T+ 4 N ) K N 2 / 2 . The remaining normalization
steps require a global amount of O ( N p ) flops.
As conclusion, and if T >>N, the execution of
Algorithm 18 has a complexity of
O(6KN 2 T) flops.
(4.7)
In order to compare this with usual algorithms in
numerical analysis, let us take realistic examples. If
T = O(N 2) and K = O(N1/2), the global complexity is O(N9/2), and if T = O(N3/2), we get O(N 4)
flops. These orders of magnitude may be compared
300
P. Comon / Signal Processing 36 (1994) 287-314
to usual complexities in O(N 3) of most matrix
factorizations.
The other algorithm at our disposal to compute
the ICA within a polynomial time is the one proposed by Cardoso (refer to [10] and the references
therein). The fastest version requires O(N 5) flops, if
all cumulants of the observations are available and
if N sources are present with a high signal to noise
ratio. On the other hand, there are O(N4/24) different cumulants in this 'supersymmetric' tensor.
Since estimating one cumulant needs O(~T) flops,
1 < c~~< 2, we have to take into account an additional complexity of O(aN4T/24). In this operatorbased approach, the computation of the cumulant
tensor is the task that is the most computationally
heavy. For T = O(N 2) for instance, the global
complexity of this approach is roughly O(N6).
Even for T = O(N), which is too small, we get
a complexity of O(NS). Note that this is always
larger than (4.9). The reason for this is that only
pairwise cumulants are utilized in our algorithm.
Lastly, one could think of using the multilinearity relation (3.8) to calculate the five cumulants
required in step 4(a). This indeed requires little
computational means, but needs all input cumulants to be known, exactly as in Cardoso's method
above. The computational burden is then dominated by the computation of input cumulants and
represents O ( T N 4) flops.
5. S i m u l a t i o n results
5.1. Performance criterion
In this section, the behaviour of ICA in the presence of non-Gaussian noise is investigated by
means of simulations. We first need to define a distance between matrices modulo a multiplicative
factor of the form AP, where A is diagonal invertible and P is a permutation. Let A and A be two
invertible matrices, and define the matrices with
unit-norm columns
A = AA -1,
A = A z ] -1,
with A~k = IIa:kll,
ARk = IIm:kll-
The gap ~(A, .4) is built from the matrix D = A - 1A
as
+~. ~IDi, I2-1
+~
~. I D i , 1 2 - 1 .
(5.1)
It is shown in the appendix that this measure of
distance is indeed invariant by postmultiplication
by a matrix of the form AP:
(5.2)
e(AAP, .,4) = e(A, A A P ) = e(A, 4).
Moreover, ~(A, ,4) = 0 if and only ifA and A can be
deduced from each other by multiplication by a
factor AP:
(5.3)
e(A, ,4) = 0 , ~ A = A A P .
5.2. Behaviour in the presence o f non-Gaussian
noise when N = 2
Consider the observations in dimension N for
realization t:
y(t) = (1 - I~)Mx(t) + pflw(t),
(5.4)
where 1 ~< t ~< T, x and w are zero-mean standardized random variables uniformly distributed in
[ - x ~ , x ~ ] (and thus have a kurtosis of - 1.2),
p is a positive real parameter of [0, 1] and
/~ = II M [1/111II, [1' II denoting the L 2 matrix norm
(the largest singular value). Then when p ranges
from 0 to 1, the signal to noise kurtosis ratio decreases from infinity to zero. Since x and w have the
same statistics, the signal to noise ratio can be
indifferently defined at orders 2 or 4. One may
assume the following as a definition of the signal to
noise ratio (see Fig. 1):
SNRdB = 101oglo 1 - p
#
In this section we assume N = 2 and
M=
-3
"
301
P. Comon / Signal Processing 36 (1994) 287-314
vd,
x
0.45
y
0
.
N=2
4= .~
0.35
0.3 =
0.25
0.2
0.15
Fig. 1. Identification block diagram. We have in particular
y = Fz = LQ*(P*A ID)z. The matrix (P*A-1D) is optional
and serves to determine uniquely a representative of the equivalence class of solutions.
0.1
0.05
0
100
0.4
0.35
~
v=240
v=171
v = 120
v=48
-_
,,,~
0.2
0
il/.-
_i
700
800
900
1000
T
N=2
T = 100,
T = 140,
T=200,
T = 500,
=
v=240
v=171
v= 120
v = 48
,
/ ,/
/ ,
/,//
=
.....
'-.,
"'" ',,
0.15
ii//i
0.1
600
Standard deviation of gap
0.2
0.05
500
0.3
. . . . . . . . .
0.25
o.1
400
................
//,/"
j,/
///
0.25
0.15
300
Fig. 3. Average values of the gap ~ as a function of the data
length Tand for various noise rates/~. The dimension is N = 2.
Mean of gap
0.45
N=2
T = 100,
T = 140,
T = 200,
T = 500,
200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8 tt
0.9
0.05 . . . . . . . . . . . . . . . . . . . .
Fig. 2. Average values of the gap ~ between the true mixmg
matrix M and the estimated one, F, as a function of the noise
rate and for various data lengths. Here the dimension is N = 2,
and averages are computed over v trials.
We represent in Figs. 2 - 4 the behaviour of Algorithm 18 when the contrast in T h e o r e m 16 is used
at order r = 4 (which also coincides with (3.9) since
the densities are symmetrically distributed in this
simulation). The m e a n of the error e defined in (5.1)
has been plotted as a function of # in Fig. 2, and as
a function of T in Fig. 3. The m e a n was calculated
over a number v of averagings (indicated in Fig. 2)
chosen so that the product v T is constant:
v T ~ 24000. The curves presented in Fig. 2 have
been calculated for/~ ranging from 0 to 1 by steps of
0.02. Fig. 4 s h o w s the standard deviations of e obtained with the same data. A similar simulation was
presented in [16], where signal and noise had different kurtosis.
0
0.1
0.2
.
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Fig. 4. Standard deviations of the gap *:for the same input and
output data as for Fig. 2.
M o s t surprising in these simulations in the excellent behaviour of ICA even in the close neighbourh o o d of/a = 0.5 (viz. 0 dB); keep in mind that the
#-scale is z o o m i n g about 0 dB (according to Fig. 9,
an S N R of 1 dB corresponds to /~ = 0.44 for
example). The exact value of p for which the ICA
remains interesting depends on the integration
length T. In fact, the shorter the T, the larger the
apparent noise, since estimation errors are also
seen by the algorithm as extraneous noise. In Fig. 3,
for instance, it can be noticed that the calculation of
ICA with T = I00 in the noiseless case is about as
g o o d as with T = 300 when # = 0.4. For larger
P. Comon / Signal Processing 36 (1994) 287-314
302
values of/~, the ICA algorithm considers the vector
M x as a noise and the vector I~flw as a source
vector. Because w has independent components, it
could be checked out conversely that e(I, F) tends
to zero as p tends to one, showing that matrix F is
approaching AP.
Gap
180
=10
160
140
120
1(30
i
ii,
80
5.3. Behaviour in the noiseless case when N > 2
60
40
Another interesting experiment is to look at the
influence of the choice of the sweeping strategy on
the convergence of the algorithm. For this purpose,
consider now t0 independent sources. In Algorithm
18, the pairs are described according to a standard
cyclic-by-rows ordering, but any other ordering
may be utilized. To see the influence of this ordering without modifying the code, we have changed
the order of the source kurtosis. In ordering 1, the
source kurtosis are
E1 - 1
1 -1
1.5 - 1 . 5 2 - 2
1 -1]
20
0
-2
-1
-1
-1
100
120
140
160
Number of pairs processed
Contrast
N=IO
(5.7)
-1.5
80
-1]
and
[1 1 1.5 2 1 - 1
60
(5.6)
whereas in ordering 2 and 3 they are, respectively,
-1.5
40
Fig. 5. G a p between the true mixing matrix M and the estimated one, F, as a function of the number of iterations. These are
asymptotic results, i.e. the data length may be considered infinite. Solid and dashed lines correspond to three different orderings. The dimension is N = 10.
20
[1 2 1.5 1 1 - 1
20
/
.-'
~fJ
,
-2].
. ,,"//
()y-'
The presence of cumulants of opposite signs does
not raise any obstacle, as well as the presence of at
most one null cumulant as shown in [16]. The mixing
matrix, M, is Toeplitz circulant with the vector
[-3 0 2 1 - 1
1 0 1 -1
-2]
(5.8)
as first row.
No noise is present. This simulation is performed
directly from cumulants, using unavoidably the
procedure described in the last paragraph of Section 4.3, so that the performances are those that
would be obtained for T = ~ with real-world signals (see Appendix A.21). The contrast utilized is
that in Theorem 16 with r = 4. The evolution of
gap e between the original matrix, M, and the
estimated matrix, F, is plotted in Fig. 5 for the three
orderings, as a function of the number of Givens
rotations computed. Observe that the gap .e is not
necessarily monotonically decreasing, whereas the
contrast always is, by construction of the algo-
0
20
40
60
80
100
120
140
160
Number of pairs processed
Fig. 6. Evolution of the contrast function as more and more
iterations are performed, in the same problem as shown in Fig. 5,
with the same three ordefings.
rithm. For convenience, the contrast is also represented in Fig. 6. The contrast reaches a stationary
value of 18.5 around the 90th iteration, that is at the
end of the second sweep. It can be noticed that this
corresponds precisely to the sum of the squares of
source cumulants (5.6), showing that the upper
bound is indeed reached. Readers desiring to program the ICA should pay attention to the fact that
303
P. Comon / Signal Processing 36 (1994) 287 314
Mean o! gap
the speed of convergence can depend on the standardization code (e.g. changing the sign of a
singular vector can affect the behaviour of the
convergence).
It is clear that the values of source kurtosis can
have an influence on the convergence speed, as in
the Jacobi algorithm (provided they are not all
equal of course). In the latter algorithm, it is well
k n o w n that the convergence can be speeded up by
processing the pair for which non-diagonal terms
are the largest. Here, the difference is that crossterms are not explicitly given. Consequently, even if
this strategy could also be applied to the diagonalization of the tensor cumulant, the c o m p u t a t i o n of
all pairwise cross-cumulants at each iteration is
necessary. We would eventually loose more in complexity than we would gain in convergence speed.
5.4. Behaviour in the presence o f noise when N > 2
Now, it is interesting to perform the same experiment as in Section 5.2, but for values of N larger
than 2. We again take the experiment described by
(5.4). The c o m p o n e n t s of x and w are again independent r a n d o m variables identically and uniformly distributed in [ - x/3, x f 3 ] , as in Section 5.2. In
Fig. 7, we report the gap e that we obtain for
N = 10. The matrix M was in this example again
the Toeplitz circulant matrix of Section 5.3 with
a first row defined by
[3 0 2 1 - 1
1 0 1 -1
-2].
It m a y be observed in Fig. 7 that an integration
length of T = 100 is insufficient when N = 10 (even
in the noiseless case), whereas T = 500 is acceptable. F o r T = 100, the gap e showed a very high
variance, so that the top solid curve is subject to
i m p o r t a n t errors (see the standard deviations in
Fig. 8); the n u m b e r v of trials was indeed only 50 for
reasons of c o m p u t a t i o n a l complexity. As in the
case of N = 2 and for larger values of T, the increase of e is quite sudden when # increases. With
a logarithmic scale (in dB), the sudden increase
would be even much steeper. After inspection, the
ICA can be c o m p u t e d with Algorithm 18 up to
signal to noise ratios of about + 2 dB (# = 0.35)
90
N=10
T= 100, v = 5 0
T = 500, v = 1 0
T = 1000, v = 10
T = 5000, v = 2
80
70
~
'
60
50
i
4o!
30
20
10
O" ?~?2~-_?L
"""
" "~'
0.1
0.2
0.3
0.4
0.5
0.6
II
0.7
Fig. 7. Averages of the gap e between the true mixing matrix
M and the estimated one, F, as a function of the noise rate and
for various data lengths, T. Here the dimension is N = 10.
Notice the gap is small below a 2dB kurtosis signal to noise
ratio (rate # = 0.35) for a sufficientdata length. The bottom solid
curve corresponds to ultimate performances with infinite integration length.
Standard deviationof gap
30
25
N=10
T = 100, v=50
T = 500, v=10
T= 1000, v = 10
T = 5000, v = 2
/
/
/
20
/
15
10'
.___. o
//
0.1
~
..a.
0.2
0.3
014
o "'""'-,..
0.5
0.6 g
0.7
Fig. 8. Standard deviations of the gap e obtained for integration
length T and computed over v trials, for the same data as for
Fig. 7.
for integration lengths of the order of 1000 samples.
The continuous b o t t o m solid curve in Fig. 7 shows
the performances that would be obtained for infinite integration length. In order to c o m p u t e the
P. Comon / Signal Processing 36 (1994) 28~314
304
SNR as a function of the noise rate
20
15
10
5
0
-5
\
-10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Ix
0.9
Fig. 9. SNR as a function of the norse rate/~.
ICA in such a case, the validation algorithm described in Appendix A.21 has been utilized, as in
Section 5.3. It can be inferred from this curve that,
for N = 10, an intregration of T = 5000 reaches
performances very close to the ultimate.
Source MATLAB codes of these experiments can
be obtained from the author upon request. Other
simulation results related to ICA are also reported
in [15].
6. Conclusions
The definition of ICA given within the flamework of this paper depends on a contrast function
that serves as a rotation selection criterion. One of
the contrasts proposed is built from the mutual
information of standardized observations. For
practical purposes this contrast is approximated by
the Edgeworth expansion of the mutual information, and consists of a combination of third- and
fourth-order marginal cumulants.
A particular contrast of interest is the sum of
squares of marginal cumulants of order 4. An
efficient algorithm is proposed that maximizes this
criterion, by rooting a sequence of polynomials
of degree 4. The overall complexity of the algorithm depends on the integration length, but it is
reasonable to say that N 4 to N 5 floating point
operations are required to compute an ICA in
dimension N.
Simulation results demonstrate convergence and
robustness of the algorithm, even in the presence of
non-Gaussian noise (up to 0 dB for a noise having
the same statistics as the sources), and with limited
integration lengths. The influence of finite integration and non-Gaussian noise are recent investigations in the area of high-order statistics.
It is envisaged that this tool might be useful in
antenna array processing (separation and recognition of sources from unknown arrays, localization
from perturbed arrays, jammers rejection, etc.),
Bayesian classification, data compression, as well
as many areas where PCA has been already utilized
for many years, such as detection, linear whitening
or exploratory analysis. However, it has not been
proved (but also never been observed to happen)
that the algorithm proposed cannot be stuck at
some local maximum. The algorithm is indeed
guaranteed to reach the global maximum only in
the case of two sources in the presence of nonGaussian noise.
7. Appendix A
7.A.1. Proof of relation (1.2)
Since x is non-degenerate, it admits an invertible
p × p covariance matrix, which we may denote by
A 2 for convenience, where matrix A is diagonal,
invertible and positive real. Denote by an asterisk
transposition and complex conjugation. The
covariance of z can be written as C A - Z C *, and
must be diagonal since z has independent components. Thus, denote by A z this covariance matrix,
where A is diagonal and positive real (since C is full
column rank, A has exactly N - p
null entries).
Thus, we have CA zC* = A 2. Let P be a permutation matrix such that the p first diagonal entries of
A' = PAP* are non-zero. Denote by Ap the p × p
block formed of these entries. Then the relation
( P C A - 1 ) ( P C A - 1 ) * = A 'z shows that the N - p
rows of matrix A = P C A - 1 are null. Denote by
Ap the block formed of the p first rows. Then there
exists a unitary matrix Qp such that Ap = ApQ o.
This is equivalent to
A
1
P. Comon / Signal Processing 36 (1994) 287 314
Denote by Q' the last N x p block matrix. By construction, we consequently have C = P*A'Q'A,
which can be written as C = AQA, with Q = P*Q'
and Q'*Q' = I. []
305
7.A.3. Proof of (2.14)
From (2.13) we start with
J(Px) -- ~ J(Px,)
i
= s ( ¢ ~ ) - S(p~) - Y~ s ( ¢ x , ) + y~ S(px,).
7.A.2. Properties of (2.13)
i
Adding and subtracting a term in the definition
of negentropy gives
i
Using now (2.5),
J(P*) -- E J(Px,) = I(p,) + S(c~x) - E S(~bx,)"
i
i
(A.5)
J(p~) = flogp~(u)px(u)du- flog4~(u)p~(u)du
+ f log4)~(u)p~(u)du- f log~(u)d~x(u)(u)du,
which can be written as
J(p~) =
Eq. (2.14) is then obtained by replacing Gaussian
entropies by their values given in (2.12). []
7.A.4. Necessary and sufficient condition .for
the last term of (2.14) to be zero
log ~ px(u)du
The last term of (2.14) is zero when
N
+ flog
-
du.
(A.1)
Now, by definition, q~(u) and px(u) have the same
first- and second-order moments. Yet, since
log ~ ( u ) is a polynomial of degree 2, we have
f l o g 49x(u)(ax(u)du =
flogdpx(U)px(u)du.
det V = 1-[ Vii.
It is obvious that if V is diagonal, (A.6) is satisfied.
Let us prove the reverse, and assume that (A.6)
holds. Then write the Cholesky factorization of
covariance V as V = LL*. This yields
i
(A.2)
Vii = ~ L2k
and after substitution in (A.6)
L~=
i=1
r , ,,
JpAm~og
~p A u )
du.
(a.7)
k=l
Combining the two last equations shows that
negentropy (2.1 3) may be written in another manner, as a Kullback divergence:
J(P,) =
(a.6)
L 2 + Z LZk i=1
(A.8)
k=l
(A.3)
This relation is possible only if either some row of
L is null or when all L~k are null. In other words, if
V is invertible, it must be diagonal. [~
(A.4)
7.A.5. Proof of Theorem 7
This proves, referring to (2.4), that
S ( p ~ ) >>. o,
with equality iff q~x = Px almost everywhere. Again
as a Kullback divergence between two densities
defined on the same space, negentropy is invariant
by an invertible change of coordinates.
The second requirement of Definition 5 is satisfied because the random variable is standardized.
Regarding the third one, note first that from (2.4) or
(2.5) the mutual information can cancel only when
P. Comon / Signal Processing 36 (1994) 287-314
306
all components are mutually independent. Again
because of (2.4), the information is always positive.
Thus, we indeed have ~(Ax) <~ O, with equality if
and only if the components are independent.
At this stage a comment is relevant. In scalar
blind deconvolution problems [21], in order to
prove that the minimization of entropy was an
acceptable criterion, it was necessary to resort to
the property
S(~a,x~) - S(x) >>,log(~a~2)/2,
(A.9)
satisfied as soon as the random variables xi are
independently and identically distributed (i.i.d.) according to the same distribution as x. Directions for
the proof of (A.9) may be found in [6]. In our
framework, the proof was somewhat simpler and
property (A.9) was not necessary. This stems from
the fact that we are dealing with multivariate
random variables, and thus multivariate entropy.
In return, the components of x are not requested to
be i.i.d. []
where xi are independent random variables. Then if
X I and Xz are independent, all variables xj for
which ajb i ~ 0 are Gaussian.
This theorem can also be found in [19; 49, p. 218;
38, p. 89]. Its proof needs Lemmas 8 and 9, and also
the following lemma published in 1953 by Darmois
[193.
L E M M A 20 (Darmois' lemma). Let f l , f z , ga and
g2 be continuous functions in an open set U, and null
at the origin. I f
fl(au + cv) + f2(bu + dr) = gl(u) + g2(v),
Vu, v~U,
where a, b, c are non-zero constants such that
ad va bc, then J](x) are polynomials of degree <~ 2.
The proof of the lemma makes use of successive
finite differences [38, 19, p. 5]. More general results
of this kind can be found in Kagan et al. [38].
7.A.6. Proofs of Lemma 8 and 9
7.A.8. Proof of Theorem 11
Lemma 8 was first proved by Marcinkiewicz in
1940, but Dugu6 gave a shorter proof in 1951 [22].
Extensions of these lemmas may be found in Kagan
et al. [38], but have the inconvenience of being
more difficult to prove. Lemma 9 is due to Cram6r
and may be found in his book of 1937 [18]. See also
[38, p. 21] for extensions. Some general comments
on these lemmas can also be found in [19]. []
Implications (iii) =~ (ii) and (ii) ~ (i) are quite obvious. We shall prove the last one, namely (i) =, (iii).
Assume z has pairwise independent components,
and suppose C is not of the form AP. Since C is
orthogonal, it necessarily has two non-zero entries
in at least two different columns. Then by applying
Theorem 10 twice, x has at least two Gaussian
components, which is contrary to our hypothesis. []
7.A.7. Remarks on Theorem 10
This theorem may be seen to be a direct consequence of a theorem due to Darmois [19]. This was
unnoticed in [15].
T H E O R E M 19 (Darmois' theorem (1953)). Define the two random variables Xx and X2 as
N
N
X l = ~ aixi,
X z = ~ bixi,
i=l
i=1
7.A.9. Proofs of Corollaries 12 and 13
To
prove
Corollary
12, assume
that
~tt(PAx) ~(p~), where x has independent components. Then ~(Px) = 0 because of(2.4), which yields
~(PAx) = 0. Next, again because of (2.4), A x has
independent components. Now Theorem I1 implies that A = AP, which was to be proved. On the
other hand, the reverse implication is immediate
since ~(PAPx) = kU(Px) is satisfied by any contrast;
see Definition 5 and Theorem 7.
----
307
P. Comon / Signal Processing 36 (1994) 287-314
To prove Corollary 13, consider y = Mx and
y = Fz. From Section 2.2, we may assume M is full
rank, and since it has more rows than columns by
hypothesis, as pointed out after (1.1), there exists
a square invertible matrix, A, such that x = Az. In
addition, since x and z have a regular covariance, it
also holds that MA = F. Then, if 7j is discriminating, it implies by Definition 6 that A = AP, where
A is a scaling diagonal matrix and P is a permutation. Premultiplying by M this last equality eventually gives F = MAP. []
7.A.11. Proof oJ" (3.5)
If z = Ay, then
S(p~.) = - fp~(Au)log{lAIp~(Au)}lAIdu,
(m.16)
which yields S(p~.) = S(p~) - ]A] log I A I. When A is
orthogonal, IAI = 1 and S(p~,) = S(p~) as well as
s(o,)
= s(~:).
[]
7.A.12. Proof of Lemma 15
7.A.lO. Sketch of the proof Jbr (3.4)
Start with the expansion
(1 + v)log(1 + v)
= v + v2/2 - v3/6 + v4/12 + o(v4).
(A.IO)
Let the relation pe(x) = qS(x)(1 + v(x)). Then v(x) is
obviously defined through relation (3.2), and the
negentropy (2.13) can be written as
d(p~) =
f(x)(1
+ v(x))log(1 + v(x))dx.
(A.11)
Inserting (A.10) into (A.11), and then replacing v(x)
by its value given by (3.2), leads to a long expression. In this expression, a number of simplifications
can be performed by resorting to the following
properties of Hermite polynomials.
orthogonality with respect to the Gaussian
kernel:
f dp(u)hp(u)hq(u) du = p! 6pq;
-
(A.12)
other less known properties that can be proved
by successive integrations by parts:
f (a(u) h~(u)h4(u)du
f
f
3!3,
(A. 13)
q~(u)h 2(u)h6(u) du = 6!,
(A.14)
(a(u)h~(u) du = 93"3!z.
(A.15)
Lemma 15 and Theorem 16 also hold for unitary
matrices. Since the proof has the same complexity,
we shall derive it in this case. Denote by 0 the
matrix obtained by replacing all its entries by their
modulus. Consider a unitary matrix Q; then the
matrix 20 is bistochastic (i.e. the sum of its entries
in any row or column is equal to 1). Yet from the
Birkhoff theorem, the set of bistochastic matrices is
a convex polyhedron whose vertices are permutations. Thus, 20 can be decomposed as
2Q=2~sPs,
0~s>~0, ~ ' ~ s = 1,
s
where P~ are permutation matrices. Denote by fi the
vector of components luil. Then we have the inequality
112Qull ~< 1120~11 ~< ~ s l l P ~ l l
= Iluil.
[]
(A.18)
s
7.A.13. Proof of Theorem 16
Since Q is orthogonal, we have IQi~l ~< 1 and,
consequently, IrQijl <~ 12Qu[ as soon as r >/2, which
can also be written as ' 0 u ~< 20u. Applying the
triangular inequality gives
i,j,k
i,j,k
<~ ~ 2Q~iZQkjfii~j.
Then retaining only the terms at least of order
O(rn-Z), according to (3.1), yields finally (3.4). []
(a.17)
s
(A.19)
i,j,k
Then using Lemma 15 we get
II~Qull2 ~< 112Q~112~< I1~112= [lull 2.
(A.20)
308
P. Comon
/ Signal Processing
Now consider a standardized random vector
.~ having zero cross-cumulants of order r and denote by u~ its marginal cumulants of order r. Because of the multilinearity of cumulants (e.g. (3.7) or
(3.8)), the very left-hand side of (A.20) is nothing
else but ~(Q.~). So we have just proved with (A.20)
that kU(Q.~)~< 7'(.~) for any orthogonal matrix Q.
This shows that Theorem 16 is indeed a contrast, as
soon as r ~> 2.
Note that, for r = 2, the equality IIZ(2uIJ = Jlu II
always holds, so that the contrast is actually degenerated. It remains to prove that Theorem 16 is
discriminating for any r > 2.
Assume equality in (A.20). Then, in particular,
36 (1994)
28~314
7.A.14. Proof o f (3.10)
Since only standardized cumulants are involved,
we only need consider orthogonal transforms.
Again from the multilinearity of cumulants, we
have
2
il ... ir Pl ... pr ql ... qr
Qi~q, "'" Qirq.Fp, .. p l'q .... q.,
(A.24)
Now put the first sum inside the two others, and
remember that because Q is orthogonal we have
2 QipQi, = 6p,.
i
1120~[I 2 - pIrO~ll 2 = 0
(A.21)
for some vector u having at most one null component. But because of (A.19), this implies that
all the differences involved in (A.21) individually
vanish:
Vi,
Vi, j.
(A.23)
Consequently, for all values of i, and for at least
p-1
values of j, we have O 2J . = -Qu.
~
Since
0 ~< 0u ~ 1, 20u is necessarily either zero or one for
those pairs (i, j). Now, a p x p doubly stochastic
matrix 2() containing only zeros or ones in
a p x ( p - 1) submatrix is a permutation. The
matrix Q is thus equal to a permutation up to
multiplication of numbers of unit modulus. This
may be written as Q = DP, where D need only be
diagonal with unit modulus entries. This proves the
second part Of the theorem.
This result also holds in the complex case, as
demonstrated in [15], provided the following
definition is used in Theorem 16: ~ ( Q ) =
i[ 2"
6,,q,...3~qf,
.... p Fq.... q.,(A.25)
which is eventually nothing else but the sum
2
r 2.... p~,
which does not depend on matrix Q. These results
also hold in the complex case, as shown in
[15].
7.A.15. Proof o f (3.11)
(2Qij _ '(~ij)uj - 0,
...
E
(A.22)
and next
Z[Ku
~
pt ... Pr qt .., qr
Pl ... pr
Again, because fij, *(2u and 2Qu are positive, this
yields
( r O l - I ) i -~- O,
f2,:
f2~=
(2Qit)~ - ('Oit)zi = O, Vi.
(20~)i-
Then (A.24) simplifies into
[]
~'~3,4 can be shown to be invariant in exactly the
same way as the proof of (3.10) was derived. In fact,
replace cumulants K as functions of cumulants
F and matrix Q, using the multilinearity (3.7) and
(3.8). Note that if a row index of Q, say i, is present
in a monomial of (3.11) then it appears twice. The
sum over i can be pulled and calculated first. Then
the two terms, say Qi. and Qib, disappear from
(3.11) since, by orthogonality of Q,
(A.26)
Qi.Qib = 6.b.
i
Then repeat the procedure as long as entries of
Q are left in the expression of f23.4. This finally
gives the expression
:
F a b c d -~abc
abcd
P. Comon / Signal Processing 36 (1994) 287 314
3 y~ /'.bc/'.b.l"..:r.:.
309
points:
abcdef
+ 4/'abc/',e,~'bar/'..: - 6 ~ /',b, Fade/'bcde,
Z °~I(°~- 2)K2q.,,-zrv H K~.p
abcde
(A.27)
which proves the invariance of ~3.4. Note that this
would also be true if Q were complex unitary, if
squares were replaced by squared moduli. []
+ Kq,(, 1)*pH flKqao-1)*p H K:P
- (~ - 1) ]-I K:p + (0~- 2)K2,.,-2)*q ]-I Ko*q
+ Kp.~, 1)'q H flKp.(~_l,.p I~ K:q
7.A.16. Proof of Theorem 17
- (• -- 1) H
It is sufficient to look at the derivatives of
a monomial. So define
¢(Q) = ~ I] K:,,
(A.28)
i ~S
where the notation ~*i stands for a repetition of
index i ~ times and S is a finite set of integers. The
differentiation yields
d~,(Q) = Z Z dK:, [] Kp,~.
i aeS
(A.29)
#¢:a
dK:i = ~ ~', dQuKs.t,- lri,
)
As a conclusion, dO as well a s d2@ involve only
pairwise cumulants. This proof would actually extend to dN/. Let us end with an example. Suppose
O(Q) = El KiliKui
2
i,. then S = {4, 3, 3} and (A.32)
gives
2
2
dO = o "*~ 4(KpppqKpppgpqqqKqqq)
+ 6(KppqKpvpKpppp- KpqqKqqqKqqqq) = 0
Vp, q.
Yet, we have two relations below:
(A.34)
K:qt"
[]
(A.30)
J
dQ = dA Q,
(A.31)
for some skew-symmetric matrix dA. Stationary
points of ¢(Q) just cancel the derivative for all
perturbations dQ, and thus from (A.31) for all skewsymmetric matrices dA. Write any skew-symmetric
matrix as a linear combination of matrices A(p, q)
defined by
A ( p , q )ij = 6pi6qj
- - (~pj(~qi"
Consequently, the cancellation of (A.29) is equivalent to
7.A.17. Proof of the contrast expression (4.2)
The proof is a little tedious but raises no difficulty. Assume that the pair of components to be
processed is (1,2) without restricting generality,
and start with the definition of the contrast:
tlJ(pz) = K2111 4- K2222.
Next substitute back in (A.35) expressions of Killl as
functions of 0, taking (3.8) and (4.1) into account:
(1 + 02)2K1111 = 1"222204 + 4F122z0 a + 6/112202
+ 4Fl1120 + F1111,
Z a(Kq,,. ,,'v I-[ K ~ . p - Kv,,._l:q 1-I K~..)
= 0,
Vp, q.
(A.32)
Reasoning in the same way for d2~b(Q) shows that,
at any stationary point, d2~k contains only terms of
the form
K~.s, K~,I~_I~.~ and
Kz.i,(a_2).j,
(A.33)
in which still only two distinct indices enter. In fact,
let us give just the expression of d2~b at stationary
(A.35)
(1 + 0 2 ) 2 K 2 2 2 2 = /'111104 - - 4/'1112 03 + 6/'1122 02
- - 4/'12220 + /'2222,
(A.36)
and group the terms such that the expression depends only on ~ = 0 - 1/0. For the sake of simplicity, denote
ao = F l l l l ,
a3 = 4/"1222,
a l = 4/'1112,
a4 = /'2222.
a2 = 6/'1122,
(A.37)
P. Comon / Signal Processing 36 (1994) 28~314
310
Then the contrast (A.35) is of the form (4.2) if we
denote the coefficients as follows:
sider plane rotations with real cosine and complex
sine:
b~ = ao~ + a],
b3 = 2 ( a 3 a 4 - aoal),
c 2 ÷ [ s l 2 = 1.
O=s/c,
b2 = 4aoz + 4a ] + a 2 + a 2 + 2aoa2 + 2aza4,
bx = 2 ( - 3aoal + 3aaa4 + ala4 - aoa3
(A.38)
(A.40)
O u t p u t cumulants given in the real case in (A.36)
now take the following form [-15]:
K l l l l / C 4 = F2222020 .2 ÷ 200"{1"2122 0 ÷ F 1 2 2 2 0 " }
+ a2a3 - a l a 2 )
÷ 41"112200" ÷ {/"2112 02 ÷ F1221 0 . 2 }
bo = 2(a 2 + a 2 + a 2 + a3z + a4z + 2aoa2
+ 2 { F l x l 2 0 + F11210"} + FllXl,
+ 2ao a4 + 2ala3 + 2a2a4).
[]
(A.41)
K2222/c 4
7.A.18. Exact form of Eq. (4.4) defining
stationary points
+ 41"112200" + {/'2112 02
Taking the derivative of ~(~) given in (4.2) with
respect to ~ yields directly (4.4) if we denote
c,, = r1111r~l~2 -/'2222/'1222,
2
C 3 = I) - - 4 ( f f 1 1 1 2
-
2
÷ /"1222)
-
C 1 = 3v -
(A.39)
2/"1111/"2222 -
+ /12210*2 }
2{r2,220 + r~2220"} + r2222.
Then the contrast function, which is real by construction, can be calculated as a function of the
variable ~ = 0 - 1/0", since ~, is a rational function
satisfying ~,(1/0") = ~O(0). It can be shown that the
precise form of ~(~) is
3/"llZ2(r1111 + /"2222),
c2 = 30,
. 2 - - 200"{Ft1120 + F 1 1 2 1 0 " }
= ffllll020
2
2
~(~) = Z
~
b~j~i~*J/[~¢ * + 4] 2,
(A.42)
i=-2 j=-2
0~<i+j~<4
32/"1112/'1222
-- 36F2122,
where the coefficients bij are defined as follows.
Denote for the sake of simplicity
Co = -- 4(v + 4c4),
ao = / ' 1 1 1 1 ,
with
I~2 ~ 2/-'2112 ,
v = (/"11xl + r2222 - 6r~22)(/"~22~ -/"1112),
V = /"2111 + /"2222 .
al = 4 / ' 1 1 1 2 ,
a 3 ~ 4/"1222 ,
a 2 = ff1122,
(A.43)
a 4 ~ /"2222.
Then
b22 = a~ + a L
b21 = a4a* - aoal,
b12 = b*l,
b~l = 4(a 2 + a ]) + (a~a* + aaa~)/2
7.A.19. Extension to the complex case
In the complex case, the characteristic function
can be defined in the standard way, i.e. q~y(u)=
E{exp[iRe(y*u)]}, but cumulants can take
various forms. Here we use / ] / j u = c u m { ~ i ,
Y*, .v*, .vl} as a definition of cumulants. We con-
+ 2a2(ao + a4),
b2o = (a~ + a'2)/4 + 82(ao + a,),
bo2 = b~o,
blo = az(a* - a,) + a2(a3 - a*)/2
+ 3(a4a~ - aoal) + (a4al - aoa~),
box = b~o,
(A.44)
P. Comon / Signal Processing 36 (1994) 287-314
b2-1 =
a2(a~ -
ai)/2,
Let us prove the converse implication. If
e(A, A ) = 0, then each term in (5.1) must vanish;
this yields
b-12 = b*-1,
b1-1 = 2a2a2 + (a 2 + a2)/2 + a,a*
+ 2a2(ao + a4),
b _ l l = b*_l,
(a) ~ l D i j [ = 1, ( N ) ~ I D u I = I ,
j
b2_ 2 =
a2/2,
b
22 = b * - 2 ,
(c)
+ 4aoa4 + ala3 + ala3 "4- 4a2(ao + a4).
The other coefficients are null. It may be checked
that this exPression of the contrast coincides with
(A.38) when all the quantities involved are real.
Next, stationary points of ff can be calculated in
various ways. One possibility is to resort to the
equation
-
K1222K2222.=
0,
(A.45)
which is satisfied at any stationary point of contrast
(A.35) (relation (A.32) indeed extends to the complex case). The other possibility is to differentiate
(A.42) with respect to the modulus p and argument
~o of 4, respectively, and set the derivatives to zero.
We obtain a system of two polynomials of two
variables, whose degree in p is equal to 4 and 3,
respectively. The exact form of the system obtained
is not given here for reasons of space. It can be
solved by first performing an elimination on p using successive greatest common divisors of coefficients (which are polynomials in ei~). Next, rooting
the polynomial in e i~' alone gives admissible values
of q~. Then substitution in the original system
allows one to obtain the corresponding admissible
values of p. As in the real case, the pairs (p, tp)
corresponding to maxima are selected by replacing
by Rei~° in the expression of the contrast.
7.A.20. Proof o f (5.2) and (5.3)
Assume/] = AAP. Then by definition the gap is
a function of the matrix D' = P*D if we denote
D = A- 1A (in fact, A A P = A A P = A_P). However,
a glance at (5.1) suffices to see that the order in
which the rows of D are set does not change the
expression of the gap e(A, ,zi).
i
~ IDulz = 1, (d) ~
j
bo = 2a 2 + a , a * + a2a'~ + 2a 2 + a3a* + 2a ]
Kll11K1112
311
IDul z = 1.
(A.46)
i
The first two relations show that matrix b is bistochastic. Yet, we have shown in Appendix A.13
that for any bistochastic matrix D, if there exists
a vector ~ with real positive entries and at most one
null component such that
IID~ II = II2 ~ II,
(A.47)
t h e n / ) is a permutation. Here the result we have to
prove is weaker. In fact, choose as vector ~ the
vector formed of ones. From (A.46) we have
~
= 2 ~ = 1,
(A.48)
which of course implies (A.47). Thus, b is a permutation, and ,zi is of the expected form. []
7.A.21. Description o f the 'validation' algorithm
This algorithm aims to provide ultimate performances when statistics are perfectly known. It is
supposed that M is known as well as the vector of
source and noise cumulants, denoted by r~pppp in
this appendix, 1 ~< p ~< 2N. The algorithm differs
from Algorithm 18 only in that moments are deduced from the current transformation and the
source cumulants.
A L G O R I T H M 21 (Validation).
(1) Compute the SVD of the matrix A A =
[ ( 1 - p ) M #flI] as A A = VZU*, where V is
N x p, S is p × p and U* is p x 2N, and p denotes the rank of AA. Set L = VS.
(2) Initialize F -- L.
(3) Begin a loop on the sweeps: k = 1, 2. . . . . kmax;
kmax ~< 1 + xfP"
(4) Sweep the p( p - 1)/2 pairs (i,j), according to
a fixed ordering. For each pair:
312
P. Comon / Signal Processing 36 (1994) 287-314
(a) Use the multilinearity relation (3.8) in order
to compute the required cumulants F~(i,j) as
functions of cumulants tcpppp, and matrices
AA and F.
(b) Find the angle ~ maximizing ~(Q(i.j)),
where Q(~'J) is the plane rotation of angle a,
a ~ ] - ~/4, rt/4], in the plane defined by
components {i,j}. Use the expressions
given in Appendices A~17 and A.18 for this
purpose.
(c) Accumulate F := FQ (i'j)*.
(5) End of the loop on k if k = k . . . . or if all estimated angles are small (compared to machine
precision).
(6) Compute the norm of the columns of F:
Aii II F..iII.
(7) Sort the entries of A in decreasing order:
=
A := PAP* and F : = FP*.
(8) Normalize F by the transform F ' = F A - 1
(9) Fix the phase (sign) of each column of F according to Definition 4(e). This yields F : = FD.
Source MATLAB codes can be obtained from the
author upon request. []
8. Acknowledgments
The author wishes to thank Albert Groenenboom, Serge Prosperi and Peter Bunn for their
proofreading of the paper, which unquestionably
improved its clarity. References [44], indicated by
Ananthram Swami, and [6], indicated by Albert
Benveniste some years ago, were also helpful. Lastly, the author is indebted to the reviewers, who
detected a couple of mistakes in the original submission, and permitted a definite improvement in
the paper.
9. References
[1] M. Abramowitz and I.A. Stegun, Handbook of Mathematical Functions, Dover, New York, 1965, 1972.
[2] Y. Bar-Ness, "Bootstrapping adaptive interference cancelers: Some practical limitations", The Globecom Conf.,
Miami, November 1982, Paper F3.7, pp. 1251-1255.
[3] S. Bellini and R. Rocca, "Asymptotically efficient blind
deconvolution", Si9nal Processin 9, Vol. 20, No. 3, July
1990, pp. 193 209.
[4] M. Basseville, "'Distance measures for signal processing
and pattern recognition", Signal Processin9, Vol. 18, No. 4,
December 1989, pp. 349 369.
[5] A. Benveniste, M. Goursat and G. Ruget, "Robust identification of a nonminimum phase system", IEEE Trans.
Automat. Control, Vol. 25, No. 3, 1980, pp. 385 399.
[6] R.E. Blahut, Principles and Practice of InJbrmation Theory,
Addison-Wesley, Reading, MA, 1987.
[7] D.R. Brillinger, Time Series Data Analysis and Theory,
Holden-Day, San Francisco, CA, 1981.
[8] J.F.Cardoso, "Sources separation using higher order moments", Proc. Internat. Conf. Acoust. Speech Signal Process., Glasgow, 1989, pp. 2109-2112.
[9] J.F. Cardoso, "Super-symmetric decomposition of the
fourth-order cumulant tensor, blind identification of more
sources than sensors", Proc. Internat. Conj. Acoust. Speech
Signal Process. 91, 14 17 May 1991.
[10] J.F. Cardoso, "lterative techniques for blind sources separation using only fourth order cumulants", Conf. EUSIPCO, 1992, pp. 739 742.
[11] T.F. Chan, "An improved algorithm for computing the
singular value decomposition", ACM Trans Math. Software, Vol. 8, No. March 1982, pp. 72 83.
[12] P. Comon, "Separation of sources using high-order
cumulants", SPIE Conf. on Advanced Algorithms and
Architectures for Siynal Processin9, Vol. Real-Time Signal
Processin 9 XII, San Diego, 8 10 August 1989, pp.
170 181.
[13] P. Comon, "Separation of stochastic processes", Proc.
Workshop Higher-Order Spectral Analysis, Vail, Colorado,
June 1989, pp. 174 179.
[14] P. Comon, "Process and device for real-time signals separation", Patent registered for Thomson-Sintra, No.
9000436, December 1989.
[15] P. Comon, "Analyse en Composantes Ind6pendantes et
Identification Aveugle", Traitement du Signal, Vol. 7, No.
5, December 1990, pp. 435 450.
[16] P. Comon, "Independent component analysis", lnternat.
Sional Processin9 Workshop on High-Order Statistics,
Chamrousse, France, 10-12 July 1991, pp. 111-120; republished in J.L. Lacoume, ed., Hioher-Order Statistics,
Elsevier, Amsterdam 1992, pp. 29-38.
[17] P. Comon, "Kernel classification: The case of short samples and large dimensions", Thomson ASM Report,
93/S/EGS/NC/228.
[18] H. Cram6r, Random Variables and Probability Distributions, Cambridge Univ. Press, Cambridge, 1937.
[19] G. Darmois, "Analyse G6n6rale des Liaisons Stochastiques", Rev. Inst. Internat. Stat., Vol. 21, 1953, pp. 2-8.
[20] G. Desodt and D. Muller, "Complex independent component analysis applied to the separation of radar signals",
in: Torres, Masgrau and Lagunas, eds., Proc. EUSIPCO
Conf., Barcelona, Elsevier, Amsterdam, 1990, pp. 665 668.
P. Comon / Signal Processing 36 (1994) 28~314
[21] D.L. Donoho, "On minimum entropy deconvolution",
Proc. 2nd Appl. Time Series Syrup., Tulsa, 1980; reprinted in
Applied Time Series Analysis II, Academic Press, New
York, 1981, pp. 565-608.
[22] D. Dugu+, "Analycit6 et convexit6 des fonctions
caract&istiques', Annales de l'lnstitut Henri Poincarb, Vol.
XII, 1951, pp. 45-56.
[23] P. Duvaut, "Principes de m6thodes de s6paration fond6es
sur les moments d'ordre sup6rieur', Traitement du Signal,
Vol. 7, No. 5, December 1990, pp. 407 418.
[24] L. Fety, Methodes de traitement d'antenne adapt+es aux
radiocommunications, Doctorate Thesis, ENST, 1988.
[25] P. Flandrin and O. Michel, "Chaotic signal analysis and
higher order statistics", Conf. EUSIPCO, Brussels, 24 27
August 1992, pp. 179 182.
[26] P. Forster, "Bearing estimation in the bispectrum domain", IEEE Trans. Signal Processing, Vol. 39, No. 9,
September 1991, pp. 1994-2006.
[27] J.C. Fort, "'Stability of the JH sources separation algorithm", Traitement du Signal, Vol. 8, No. 1, 1991, pp.
35 42.
[28] B. Eriedlander and B. Porat, "Asymptotically optimal estimation of MA and ARMA parameters of non-Gaussian
processes from high-order moments", IEEE Trans Automat. Control. Vol. 35, January 1990, pp. 27-35.
[29] M. Gaeta, Les Statistiques d'Ordre Sup6rieur Appliqu6es
:i la S6paration de Sources, Ph.D. Thesis, Universit+ de
Grenoble, July 1991.
[30] M. Gaeta and J.L. Lacoume, "Source separation without a priori knowPedge: The maximum likelihood
solution", in: Torres, Masgrau and Lagunas, eds., Proc.
EUSIPCO Con.['., Barcelona, Elsevier, Amsterdam, 1990,
pp. 621 624.
[31] E. Gassiat, D+convolution Aveugle, Ph.D. Thesis, Universit6 de Paris Sud, January 1988.
[32] G. Giannakis and S. Shamsunder, "'Modeling of nonGaussian array data using cumulants: DOA estimation
with less sensors than sources", Proc. 25th Ann. Conj. on
Information Sciences and Systems, Baltimore, 20 22
March 1991, pp. 600 606.
[33] G. Giannakis and A. Swami, "On estimating noncausal
nonminimum phase ARMA models of non-Gaussian processes", IEEE Trans. Acoust. Speech Signal Process., Vol.
38. No. 3, March 1990, pp. 478 495.
[34] G. Giannakis and M.K. Tsatsanis, "A unifying maximumlikelihood view of cumulant and polyspectral measures for
non-Gaussian signal classification and estimation", IEEE
Trans. Inform. Theory, Vol. 38, No. 2, March 1992, pp.
386- 406.
[35] G. Giannakis, Y. Inouye and J.M. Mendel, "Cumulantbased identification of multichannel moving average
models", IEEE Automat. Control, Vol. 34, July 1989, pp.
783- 787.
[36] Y. lnouye and T. Matsui, "Cumulant based parameter
estimation of linear systems", Proc. Workshop HigherOrder Spectral Analysis, Vail, Colorado, June 1989, pp.
180-185.
313
[37] C. Jutten and J. Herault, "Blind separation of sources, Part
I: An adaptive al'gorithm based on neuromimatic architecture", Signal Processing, Vol. 24, No. 1, July 1991, pp. 1 10.
[38] A.M. Kagan, Y.V. Linnik and C.R. Rao, Characterization
Problems in Mathematical Statistics, Wiley, New York, 1973.
[39] M. Kendall and A. Stuart, The Advanced Theory of Statistics, Vol. 1, New York, 1977.
[40] S. Kotz and N.L. Johnson, Encyclopedia ~f" Statistical
Sciences, Wiley, New York, 1982.
[41] J.L. Lacoume, "Tensor-based approach of random variables", Talk given at ENST Paris, 7 February 1991.
[42] J.L. Lacoume and P. Ruiz, "'Extraction of independent
components from correlated inputs, A solution based on
cumulants", Proc Workshop Higher-Order Spectral Analysis, Vail, Colorado, June 1989, pp. 146-151.
[43] P. McCullagh, "Tensor notation and cumulants", Biometrika, Vo. 71, 1984, pp. 461-476.
[44] P. McCullagh, Tensor Methods in Statistics, Chapman and
Hall, London 1987.
[45] C.L. Nikias, "ARMA bispectrum approach to nonminimum phase system identification", IEEE Trans. Acoust.
Speech Sional Process., Vol. 36, April 1988, pp. 513 524.
[46] C.L. Nikias and M.R. Raghuveer, ~'Bispectrum estimation:
A digital signal processing framework", Proc. IEEE, Vol.
75, No. 7, July 1987, pp. 869-891.
[47] B. Porat and B. Friedlander, "Direction finding algorithms
based on higher-order statistics", IEEE Trans. Signal Processing, Vol. 39, No. 9, September 1991, pp. 2016 2024.
[48] J.G. Proakis and C.L. Nikias, "'Blind equalization", in:
Adaptive Signal Processing, Proc. SPIE, Vol. 1565, 1991,
pp. 76 85.
[49] C.R. Rao, Linear Statistical lnjerence and its Applications,
Wiley, New York, 1973.
[50] P.A. Regalia and P. Loubaton, "'Rational subspace estimation using adaptive lossless filters", IEEE Trans. Acoust.
Speech Signal Process., July 1991, submitted.
[51] G. Scarano, "'Cumulant series exapansion of hybrid nonlinear moments of complex random variables", IEEE
Trans. Signal Processing, Vol. 39, No. 4, April 1991, pp.
1001-1003.
[52] O. Shalvi and E. Weinstein, "'New criteria for blind deconvolution of nonminimum phase systems", IEEE Trans.
Inform. Theory, Vol. 36, No. 2, March 1990, pp. 312 321.
[53] S. Shamsunder and G.B. Giannakis, "Modeling of nonGaussian array data using cumulants: DOA estimation of
more sourcdes with less sensors", Signal Processing, Vol.
30, No. 3, February 1993, pp. 279-297.
[54] J.E. Shore and R.W. Johnson, "Axiomatic derivation of the
principle of maximum entropy", IEEE Trans. Inform
Theory, Vol. 26, January 1980, pp. 26 37.
[55] A. Souloumiac and J.F. Cardoso, "Comparaisons de
M&hodes de S6paration de Sources", X l l l t h Coll.
GRETSI, September 1991, pp. 661 664.
[56] A. Swami and J.M. Mendel, "ARMA systems excited by
non-Gaussian processes are not always identifiable", IEEE
Trans. Automat. Control, Vol. 34, No. 5, May 1989, pp.
572 573.
314
P. Comon / Signal Processing 36 (1994) 28~314
[57] A. Swami and J.M. Mendel, "ARMA parameter estimation
using only output cumulants", IEEE Trans. Acoust. Speech
Signal Process., Vol. 38, July 1990, pp. 1257-1265.
[58] A. Swami, G.B. Giannakis and S. Shamsunder, "A unified
approach to modeling multichannel ARMA processes using cumulants', IEEE Trans. Signal Process., submitted.
[59] L. Swindlehurst and T. Kailath, "Detection and estimation
using the third moment matrix", Internat. Conf. Acoust.
Speech Signal Process., Glasgow, 1989, pp. 2325-2328.
[60] L. Tong, V.C. Soon and R. Liu, "AMUSE: A new blind
identification algorithm", Proc. lnternat. Conf. Acoust.
Speech Signal Process., 1990, pp. 1783-1787.
[61] L. Tong et al., "A necessary and sufficient condition of
blind identification", lnternat. Signal Processing Work-
[62]
[63]
[64]
[65]
shop on High-Order Statistic.s, Chamrousse, France, 10 12
July 1991, pp. 261-264.
J.K. Tugnait, "Identification of linear stochastic systems
via second and fourth order cumulant matching", IEEE
Trans. Inform. Theory, Vol. 33, May 1987, pp. 393 407.
J.K. Tugnait, "Approaches to FIR system identification with noise data using higher-order statistics",
Proc. Workshop Higher-Order Spectral Analysis, Vail,
June 1989, pp. 13-18.
J.K. Tugnait, "Comments on new criteria for blind deconvolution of nonminimum phase systems", IEEE Trans.
Inform. Theory, Vol. 38, No. 1, January 1992, 210 213.
D.L. Wallace, "Asymptotic approximations to distributions", Ann. Math. Statist., Vol. 29, 1958, pp. 635-654.