An Alternative Natural Gradient Approach For ICA B
An Alternative Natural Gradient Approach For ICA B
An Alternative Natural Gradient Approach For ICA B
net/publication/235683846
CITATIONS READS
7 77
3 authors, including:
Stefano Squartini
Università Politecnica delle Marche
202 PUBLICATIONS 1,480 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Stefano Squartini on 31 May 2014.
593
Both of them consist on adapting the de-mixing matrix by usual matrix product as operation of multiplication between
using gradient information deriving from derivatives of suit- two elements of the group.
able cost function l ( x, W ) respect with the elements of W . We are going to show two different but related ways to de-
fine a Riemannian metric on W , i.e. the tensor field
The learning rule can be written as follows:
G ( W ) determining a W -point wise inner product for the
∆W = ±η∇l ( x, W ) (3)
tangent space TW W [8]. Both of them will get started from
where the sign indeterminacy allows to consider both maxi- definition of such an inner product by means of a two-step
mization and minimization of l ( x, W ) . Equation (3) obvi- procedure:
1) Definition of the inner product relative to a point in
ously assumes different forms in dependence of the cost W (the identity matrix I , the neutral element of the
function chosen. The first ICA based method addressed
group).
minimizes the Kullback-Leibler divergence D f f ( W ) be- 2) Imposing that the W -point wise inner product be in-
tween two proper distributions: the probability density func- variant to translations in W .
tion f y ( y, W ) parameterized by W , and the corresponding Such a translation operation is nothing but a function that
m
allows to move in the parameter space. It is defined as fol-
factorial distribution f y ( y , W ) = ∏ f yi ( yi , W ) , that is the lows, taking into account that W is a non-commutative
i =1 group:
product of all marginal probability density functions of out-
TV : W → W T:W → W
put y . As derived in [6], [7], the final formula for the stan- V
3. NATURAL GRADIENTS IN MATRIX SPACES The bi-linearity property of inner product and easy calcula-
tions let the following hold:
As aforementioned, the natural gradient, as defined in (1),
represents the steepest descent direction of cost function ,B
A
W
= ∑ a b G (W )
ij kl ij , kl (7)
l ( x, W ) to maximize (or minimize) in a Riemannian space.
i , j , k ,l
594
can be derived to describe the tensor defining the Rieman- where H W = ( hijW ) = ( W −1 ) W −1 and the new tensor of the
T
nian metric:
( W ) = E ⋅ W −1 , E ⋅ W −1 new metric is Gij ,kl ( W ) = W −1 ⋅ Eij , W −1 ⋅ Ekl .
Gij , kl ij kl = I
I
Now we can proceed as done before in right natural gradient
m (8)
= δ ik ∑ W −1 jc W −1lc = δ ik hjlW case, observing first that:
c =1
( )
m
= ∑ ( Ac ) H W Bc
T
A, B
( )
= hijW = W −1 ( W )
W
W −1 T
where H . c =1
A part from different notation, what just described is nothing and then applying the steepest descent theorem separately for
but what Amari derived in [1], as (8) states. This let us an- each column of the gradient ∇l ( W ) calculated at W .
ticipate the final formula for natural gradient associated with
Hence, we can move from ∇ c l ( W ) = ( H W ) ∇ c l ( W ) and
−1
the aforementioned metric, namely right natural gradient:
l ( W ) = ∇l ( W ) W T W
∇ (9) calculate ( H ) = WW , to derive finally the expression
W −1 T
of left natural gradient:
Derivation of (9) gets started from substituting (8) in (7) and
operating as follows: ∇l ( W ) = WWT ∇l ( W ) (11)
m m m m m m m
,B
A = ∑∑∑∑ aij bkl δ ik hjlW = ∑∑∑ aij bkl hjlW 4. EXPERIMENTAL RESULTS
W
i =1 j =1 k =1 l =1 r =1 j =1 l =1
In this section performances of the two ICA based learning
that can be further reduced as algorithms are analyzed both when natural gradients and
standard gradient are applied. Application of (9) and (11) to
(( )
m
,B
A
W
=∑ A
H
r
W (B
)T .
r ) (4) and (5) leads to the following natural gradient based
r =1 learning rules:
This authorizes us to apply the theorem of steepest descent
∆W = η I − ϕ ( y ) y T W right
(1) separately for each row of the gradient ∇l ( W ) calculated (12)
∆W = η W I − WT ϕ ( y ) xT left
at W . In such a way we can identify the inverse of G ( W )
W and write:
as the inverse of H ∆W = η I + (1 − 2z ) y T W right
(13)
( ∇ l ( W ) ) = ( H ) ( ∇ l ( W ) ) W −1
T
∆W = η W I + WT (1 − 2z ) xT
T
r r (10) left
W = WWT by
It can be straightforward derived that H Looking at (12) and (13), it seems that that the equivariance
(W ) = (( W ) ) (( W ) ) W .
−1 −1 T property is not satisfied by the new natural gradient. Indeed,
−1
(W )
−1 T −1 T
(W ) −1 −1
= −1 −1
being C ( k ) = W ( k ) A the global matrix describing the
Substituting this result in (10) and addressing all gradient overall system, we can not specify the updating rule for the
rows, it results that ( ∇
l ( W ) )T = WT W ( ∇l ( W ) )T and con- global matrix only in function of itself, as conversely it does
in case of right natural gradient [3], [7]:
sequently (9).
∆C = η C − WWT ϕ ( y ) xT A ICA based method 1
3.2. Left Translation
∆C = η C + WW (1 − 2z ) x A
T T
ICA based method 2
Here we require that:
However, such property is satisfied again if we describe the
A, B W = WT ( A), WT (B) W = W ⋅ A, W ⋅ B W A, B I system in Figure 1, interpreting u(k ) as a row-vector, in-
stead of a column vector, as done till now. Equation (2) can
The following equalities can be derived through similar con- be now written as:
siderations as before:
y = xW = uAW = uDP
A, B
W
= ∑ a b G (W ) ij kl ij , kl
i , j , k ,l while, the learning rules (4) and (5) become:
m
Gij , kl ( W ) = δ jl ∑ W −1ic W −1ck = δ jl hikW ∆W = η W −T I − y T ϕ ( y ) ICA based method 1
c =1
∆W = η W −T + xT (1 − 2z ) ICA based method 2
595
allowing us to derive the following ones in global matrix 5. CONCLUSION
notation (now C ( k ) = AW ( k ) ), in case of left natural gra- A mathematical demonstration for derivation of a new natu-
dient: ral gradient based learning rule for BSS problem has been
provided. The definition of a different Riemannian metric
∆C = η C I − y T ϕ ( y ) ICA based method 1 from the usual one allowed to get an original natural gradient
that behaves as well as that one proposed in [1], when ap-
∆C = η C I + y (1 − 2z ) ICA based method 2
T
plied to ICA and Infomax learning algorithms. This fact has
The example dealt with considers a system involving relevant implications, since we can assume to derive other
different natural gradients once we will be able to define
u1 = sin(400k)cos 2 (30K) u2 = sign(sin(150k+15cos(30k)))
proper Riemannian metrics in the parameter space. Conse-
as the two independent sources, while the mixing matrix is quently, we could be interested to compare and rate their
A = [0.56,0.79;-0.75,0.65] . Simulations have been per- performances when applied to BSS learning algorithms.
formed by using the batch version of considered learning These aspects are actually under study. Another point to in-
algorithms, leaving unchanged the parameter values: number vestigate should be the calculation of original natural gradi-
of iterations, learning rate η , number of signal samples and ents in other spaces, as those ones of perceptrons and dy-
epoch size. Figures 2-3 and Table 1 show how the employ- namical linear systems.
ment of natural gradient allows to get a relevant improve-
ment respect with the standard gradient, while right and left 0.2
BSS error 1
Learning algo- st
1 channel S/N 2nd channel S/N 0.05
96.20 89.77
fomax 0.2
[dB]
0
Right natural gradient ICA 118.93 120.41 0 20 40 60 80 100 120 140 160 180 200
iterations
5
0.5 Chapter 2 (pp. 13-61) in S. Haykin (ed.), Unsupervised
Adaptive Filtering, vol. I, Wiley 2000.
dens ity
am plitude
0 4 0 0
-0.5
3
2
-0.5 [3] A. Cichocki, and S.I. Amari, Adaptive Blind Signal and
1
-1 Image Processing, Wiley&Sons, England, 2002.
-1
0 5000 10000
0
-1 0 1
-1.5
0 5000 10000
-5
0 5000 10000 [4] J.F. Cardoso, “Learning in Manifolds: the case of Source
samples amplitude
Separation”, Proceedings of IEEE SSAP, USA, 1998.
Source 2 Source Pdf 2 Mixed 2 Out 2 (Gain=-3.85)
1 7 1.5 4
[5] J.F. Cardoso and B.H. Laheld, “Equivariant adaptive
source separation”, IEEE Trans. Signal Processing, vol.44,
6 3
1
0.5 2
5
0.5
1 pp. 3017-3030, December 1996.
am plitude
4
dens ity
0
3
0 0
-1
[6] T.W.Lee, M. Girolami, A. J. Bell, and T.J.Sejnowski, “A
unifying information-theoretic framework for indipendent
-0.5
2
-0.5 -2
-1
-1
1
0 -1.5
-3
-4
component analysis”, International journal of computers and
0 5000
samples
10000 -1 0
amplitude
1 0 5000 10000 0 5000 10000
mathemathics with applications, 1999.
[7] S.Haykin, Neural Networks – A comprehensive Founda-
Figure 2. Signals involved in BSS problem: sources, mixed, tion. Prentice Hall, February 1999.
and recovered. They are valid for all addressed algorithms (at [8] W. Boothby, An Introduction to Differentiable Manifolds
convergence point). and Riemannian Geometry. Academic Press, 2nd ed., 1986.
596
View publication stats