An Alternative Natural Gradient Approach For ICA B

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/235683846

An alternative Natural Gradient Approach for ICA based learning algorithms


in Blind Source Separation

Conference Paper · September 2004

CITATIONS READS

7 77

3 authors, including:

Stefano Squartini
Università Politecnica delle Marche
202 PUBLICATIONS   1,480 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

IEEE Italy Section Consumer Electronics Society Chapter View project

Computational Energy Management View project

All content following this page was uploaded by Stefano Squartini on 31 May 2014.

The user has requested enhancement of the downloaded file.


AN ALTERNATIVE NATURAL GRADIENT APPROACH FOR ICA BASED
LEARNING ALGORITHMS IN BLIND SOURCE SEPARATION
Andrea Arcangeli, Stefano Squartini, Francesco Piazza
DEIT, Università Politecnica delle Marche
Via Brecce Bianche 60131 Ancona, Italy (Europe)
Phone: +39 0712204453, fax: +390712204835, email: [email protected]

ABSTRACT ferent algorithms) as well. Such rules have relevant proper-


In this paper a new formula for natural gradient based learn- ties, like Newton-like performances at gradient cost, and
ing in blind source separation (BSS) problem is derived. This equivariance [4]. These also occur if relative gradient, de-
represents a different gradient from the usual one in [1], but fined by Cardoso and Laheld [5], is applied.
can still considered natural since it comes from the definition The same has been done in other relevant cases, as blind de-
of a Riemannian metric in the matrix space of parameters. convolution problem and adaptation of multilayer neural
The new natural gradient consists on left multiplying the network. Moreover, it has also been shown that natural gra-
standard gradient for an adequate term depending on the pa- dient online learning is asymptotically Fisher efficient [1].
rameter matrix to adapt, whereas the other one considers a In all these considerations the Riemannian metric has been
right multiplication. The two natural gradients have been given without considering the chance of defining different
employed in two ICA based learning algorithms for BSS and ones that could lead to derivation of different natural gradi-
it resulted they have identical behavior. ents. This is what the authors have tried to show in the pre-
sent paper, addressing the BSS problem. Moving from the
1. INTRODUCTION same idea developed by Amari in [1] to define the Rieman-
nian metric, a new tensor G ( ω ) is formulated and accord-
Many optimization tasks in literature are based on the mini-
mization (or maximization) of a cost function l, properly ingly a new natural gradient formula. The usual one and the
defined in relation to the addressed problem. The most used novel one have been implemented in two learning ap-
technique is the gradient descent (or ascent), that searches the proaches for BSS and interesting results obtained.
global critical points of l by employing the gradient to the
surface determined by l. A relevant improvement has been 2. BLIND SOURCE SEPARATION
recently introduced in literature to speed such a searching up: As addressed in literature, BSS [6], [7] is the problem of re-
it consists on substituting the standard gradient with a new covering the original m-sources vector u(k ) mixed by a non-
one, namely the natural gradient. This is justified by the
singular m-by-n matrix A , the mixing matrix, when both are
property of being Riemannian of the space of parameter in- unknown and the only available information is the mixed n-
terested to optimization, as shown in [1], [2]. Therefore a
signals vector x = Au , a part from hypotheses of statistical
Riemannian metric must be known and in dependence of that
independence and non-gaussianity of input sources and
the direction of gradient can be adjusted to make conver-
n > m . This is achieved by determining an n-by-m ma-
gence faster. The measure of such an adjustment has been
derived by Amari [1] in his steepest descent theorem, accord- trix W , the demixing matrix, such that the resulting output
ing to which, given a generic Riemannian space y = Wx is equal to u(k ) up to permutation and scaling ma-
S = {ω ∈ R n } and a cost function l defined on it, the steepest trices, P and D respectively. That can be expressed as:
descent direction is: y = Wx = WAu = DPu (2)
 l ( ω ) = G −1 ( ω ) ∇l ( ω )
∇ (1) The overall structure of BSS is depicted in Figure 1, where
the contribute of noise v (k ) is also taken into account. We
where ∇l ( ω ) is the conventional gradient, ∇
 l ( ω ) is the
shall fix v (k ) to zero in the following.
here-defined natural gradient, and G −1 ( ω ) is the inverse of Different methods have been proposed in literature to solve
the metric tensor. It is obvious that derivation of suitable this kind of problem, generally borrowed from Information
formulas for natural gradient is strictly linked to the problem Theory. We are going to consider here two ICA based ap-
under study, and it necessarily passes through the calculation proaches: one is based on minimization of mutual informa-
of the metric. tion whereas the other, proved to be equivalent to maximum
Here blind source separation (BSS) is considered, and likelihood approach, on maximization of output entropy (the
G −1 ( ω ) has been already derived in literature [1], [3] and Infomax approach). It has been shown that they yield the
same solution under ideal condition of perfect reconstruction.
corresponding natural gradient learning rules (relative to dif-

593
Both of them consist on adapting the de-mixing matrix by usual matrix product as operation of multiplication between
using gradient information deriving from derivatives of suit- two elements of the group.
able cost function l ( x, W ) respect with the elements of W . We are going to show two different but related ways to de-
fine a Riemannian metric on W , i.e. the tensor field
The learning rule can be written as follows:
G ( W ) determining a W -point wise inner product for the
∆W = ±η∇l ( x, W ) (3)
tangent space TW W [8]. Both of them will get started from
where the sign indeterminacy allows to consider both maxi- definition of such an inner product by means of a two-step
mization and minimization of l ( x, W ) . Equation (3) obvi- procedure:
1) Definition of the inner product relative to a point in
ously assumes different forms in dependence of the cost W (the identity matrix I , the neutral element of the
function chosen. The first ICA based method addressed
group).
minimizes the Kullback-Leibler divergence D f f ( W ) be- 2) Imposing that the W -point wise inner product be in-
tween two proper distributions: the probability density func- variant to translations in W .
tion f y ( y, W ) parameterized by W , and the corresponding Such a translation operation is nothing but a function that
m
allows to move in the parameter space. It is defined as fol-
factorial distribution f y ( y , W ) = ∏ f yi ( yi , W ) , that is the lows, taking into account that W is a non-commutative
i =1 group:
product of all marginal probability density functions of out-
TV : W → W T:W → W
put y . As derived in [6], [7], the final formula for the stan- V

dard gradient based learning rule is the following: TV ( W ) = W ⋅ V V T(W) = V ⋅ W ∀V ∈ W


Right Translation Left Translation
∆W = η  I − ϕ ( y ) y T  W −T (4)
Hence, given a curve γ (t ) :[−1,1] → W , γ (0) = V , we can
where ϕ ( y ) is the activation function and T stands for trans- also apply the translation operation to the curve γ (t ) and to
position. Infomax maximizes the entropy H ( z ) where its derivative:
z = g ( y ) and g is the final nonlinearity whose shape de- d γ (t )
γ (0) = ∈ TV W
pends on the knowledge of the prior distribution of sources. dt t = 0
Under this hypothesis, the learning rule is:
Canonical inner product has been chosen as inner product at
∆W = η  W −T + (1 − 2z ) xT  (5) the “starting point” I , with A = γ (0), B = γ '(0) tangent
vectors belonging to TI W :
as derived in [6], [7], always valid in case of standard gradi-
ent. We are going to consider square dimensions for the in- m m

volved matrices in the following. As already done, the time A, B I


= ∑∑ aij bij
i =1 j =1
step will be omitted to simplify notation.
Now, we can proceed with the second step of aforementioned
procedure: it requires to distinguish between the two possible
v(k) translation modes.
u(k) x(k) y(k)
3.1. Right Translation
A Σ W
Here we require that:
,B
A  = TW ( A), TW (B) =
W
Figure 1. Structure of blind source separation problem.
W
(6)
A ⋅ W, B ⋅ W W
 A, B I

3. NATURAL GRADIENTS IN MATRIX SPACES The bi-linearity property of inner product and easy calcula-
tions let the following hold:
As aforementioned, the natural gradient, as defined in (1),
represents the steepest descent direction of cost function ,B
A 
W
= ∑ a b G (W )
ij kl ij , kl (7)
l ( x, W ) to maximize (or minimize) in a Riemannian space.
i , j , k ,l

Consequently, this links the natural gradient definition to that  (W) = E , E


where G and the only Eij entry not
ij , kl ij kl W
one of Riemannian metric over the parameter space. Such a
space is the space W of invertible matrix Wm× m on R and equal to zero is in location ( i, j ) . Observing that (6) must be
it satisfies all properties of Lie groups, if we consider the valid also for W -element like Eij , the following equation

594
can be derived to describe the tensor defining the Rieman- where H W = ( hijW ) = ( W −1 ) W −1 and the new tensor of the
T

nian metric:  
 ( W ) = E ⋅ W −1 , E ⋅ W −1 new metric is Gij ,kl ( W ) = W −1 ⋅ Eij , W −1 ⋅ Ekl .
Gij , kl ij kl =  I
I
Now we can proceed as done before in right natural gradient
m (8)
= δ ik ∑ W −1 jc W −1lc = δ ik hjlW case, observing first that:
c =1
( )
m
= ∑ ( Ac ) H W Bc
T
A, B
( )
= hijW = W −1 ( W )     
W
 W −1 T
where H . c =1

A part from different notation, what just described is nothing and then applying the steepest descent theorem separately for
but what Amari derived in [1], as (8) states. This let us an- each column of the gradient ∇l ( W ) calculated at W .
ticipate the final formula for natural gradient associated with
Hence, we can move from ∇ c l ( W ) = ( H W ) ∇ c l ( W ) and
−1
the aforementioned metric, namely right natural gradient:  
 l ( W ) = ∇l ( W ) W T W
∇ (9) calculate ( H ) = WW , to derive finally the expression
W −1 T


of left natural gradient:
Derivation of (9) gets started from substituting (8) in (7) and
operating as follows: ∇l ( W ) = WWT ∇l ( W ) (11)

m m m m m m m
,B
A  = ∑∑∑∑ aij bkl δ ik hjlW = ∑∑∑ aij bkl hjlW 4. EXPERIMENTAL RESULTS
W
i =1 j =1 k =1 l =1 r =1 j =1 l =1
In this section performances of the two ICA based learning
that can be further reduced as algorithms are analyzed both when natural gradients and
standard gradient are applied. Application of (9) and (11) to
(( )
m
,B
A 
W
=∑ A
 H
r
 W (B
 )T .
r ) (4) and (5) leads to the following natural gradient based
r =1 learning rules:
This authorizes us to apply the theorem of steepest descent
∆W = η  I − ϕ ( y ) y T  W right
(1) separately for each row of the gradient ∇l ( W ) calculated (12)
∆W = η W I − WT ϕ ( y ) xT  left
at W . In such a way we can identify the inverse of G ( W )
 W and write:
as the inverse of H ∆W = η  I + (1 − 2z ) y T  W right
(13)
( ∇ l ( W ) ) = ( H ) ( ∇ l ( W ) ) W −1
T
∆W = η W I + WT (1 − 2z ) xT 
T
r r (10) left

 W = WWT by
It can be straightforward derived that H Looking at (12) and (13), it seems that that the equivariance

(W ) = (( W ) ) (( W ) ) W .
−1 −1 T property is not satisfied by the new natural gradient. Indeed,
−1
(W )
−1 T −1 T
(W ) −1 −1
= −1 −1
being C ( k ) = W ( k ) A the global matrix describing the
Substituting this result in (10) and addressing all gradient overall system, we can not specify the updating rule for the
rows, it results that ( ∇
 l ( W ) )T = WT W ( ∇l ( W ) )T and con- global matrix only in function of itself, as conversely it does
in case of right natural gradient [3], [7]:
sequently (9).
∆C = η C − WWT ϕ ( y ) xT A  ICA based method 1
3.2. Left Translation
∆C = η C + WW (1 − 2z ) x A 
T T
ICA based method 2
Here we require that:
However, such property is satisfied again if we describe the
A, B W = WT ( A), WT (B) W = W ⋅ A, W ⋅ B W  A, B I system in Figure 1, interpreting u(k ) as a row-vector, in-
 
stead of a column vector, as done till now. Equation (2) can
The following equalities can be derived through similar con- be now written as:
siderations as before:
y = xW = uAW = uDP
A, B
  W
= ∑ a b G (W ) ij kl ij , kl
i , j , k ,l while, the learning rules (4) and (5) become:
m
Gij , kl ( W ) = δ jl ∑ W −1ic W −1ck = δ jl hikW ∆W = η W −T  I − y T ϕ ( y )  ICA based method 1
 c =1 
∆W = η  W −T + xT (1 − 2z )  ICA based method 2

595
allowing us to derive the following ones in global matrix 5. CONCLUSION
notation (now C ( k ) = AW ( k ) ), in case of left natural gra- A mathematical demonstration for derivation of a new natu-
dient: ral gradient based learning rule for BSS problem has been
provided. The definition of a different Riemannian metric
∆C = η C  I − y T ϕ ( y )  ICA based method 1 from the usual one allowed to get an original natural gradient
that behaves as well as that one proposed in [1], when ap-
∆C = η C  I + y (1 − 2z )  ICA based method 2
T
plied to ICA and Infomax learning algorithms. This fact has
The example dealt with considers a system involving relevant implications, since we can assume to derive other
different natural gradients once we will be able to define
u1 = sin(400k)cos 2 (30K) u2 = sign(sin(150k+15cos(30k)))
proper Riemannian metrics in the parameter space. Conse-
as the two independent sources, while the mixing matrix is quently, we could be interested to compare and rate their
A = [0.56,0.79;-0.75,0.65] . Simulations have been per- performances when applied to BSS learning algorithms.
formed by using the batch version of considered learning These aspects are actually under study. Another point to in-
algorithms, leaving unchanged the parameter values: number vestigate should be the calculation of original natural gradi-
of iterations, learning rate η , number of signal samples and ents in other spaces, as those ones of perceptrons and dy-
epoch size. Figures 2-3 and Table 1 show how the employ- namical linear systems.
ment of natural gradient allows to get a relevant improve-
ment respect with the standard gradient, while right and left 0.2
BSS error 1

translation based versions have basically identical behaviour.


0.15 standard
right
left
S/N ratio
0.1
[dB]

Learning algo- st
1 channel S/N 2nd channel S/N 0.05

rithms [dB] [dB] 0


0 20 40 60 80 100 120 140 160 180 200
iterations
Standard gradient Infomax 12.94 9.89 BSS error 2
0.4
Right natural gradient
93.39 94.34
Infomax 0.3 standard
right
Left natural gradient In- left
S/N ratio

96.20 89.77
fomax 0.2
[dB]

Standard gradient ICA 14.52 10.22 0.1

0
Right natural gradient ICA 118.93 120.41 0 20 40 60 80 100 120 140 160 180 200
iterations

Left natural gradient ICA 124.20 102.04


Figure 3. Decaying shapes of S/N ratios of two output signals
for three cases under study.
Table 1. S/N ratios in all cases addressed. They are relative to
the number of iterations at which natural gradient based algo-
rithms get convergence. REFERENCES
[1] S.I. Amari, “Natural Gradient works efficiently in learn-
1
Source 1
8
Source Pdf 1
1.5
Mixed 1
5
Out 1 (Gain=3.99) ing”, Neural computation, vol. 10, pp. 251-276,1998.
7
1
[2] S. Douglas and S. Amari, ''Natural-gradient adaptation'',
0.5 6

5
0.5 Chapter 2 (pp. 13-61) in S. Haykin (ed.), Unsupervised
Adaptive Filtering, vol. I, Wiley 2000.
dens ity
am plitude

0 4 0 0

-0.5
3

2
-0.5 [3] A. Cichocki, and S.I. Amari, Adaptive Blind Signal and
1
-1 Image Processing, Wiley&Sons, England, 2002.
-1
0 5000 10000
0
-1 0 1
-1.5
0 5000 10000
-5
0 5000 10000 [4] J.F. Cardoso, “Learning in Manifolds: the case of Source
samples amplitude
Separation”, Proceedings of IEEE SSAP, USA, 1998.
Source 2 Source Pdf 2 Mixed 2 Out 2 (Gain=-3.85)
1 7 1.5 4
[5] J.F. Cardoso and B.H. Laheld, “Equivariant adaptive
source separation”, IEEE Trans. Signal Processing, vol.44,
6 3
1
0.5 2
5
0.5
1 pp. 3017-3030, December 1996.
am plitude

4
dens ity

0
3
0 0

-1
[6] T.W.Lee, M. Girolami, A. J. Bell, and T.J.Sejnowski, “A
unifying information-theoretic framework for indipendent
-0.5
2
-0.5 -2
-1

-1
1

0 -1.5
-3

-4
component analysis”, International journal of computers and
0 5000
samples
10000 -1 0
amplitude
1 0 5000 10000 0 5000 10000
mathemathics with applications, 1999.
[7] S.Haykin, Neural Networks – A comprehensive Founda-
Figure 2. Signals involved in BSS problem: sources, mixed, tion. Prentice Hall, February 1999.
and recovered. They are valid for all addressed algorithms (at [8] W. Boothby, An Introduction to Differentiable Manifolds
convergence point). and Riemannian Geometry. Academic Press, 2nd ed., 1986.

596
View publication stats

You might also like