Academia.eduAcademia.edu

New Correlation Coefficient for Data Analysis

2012

The proposed correlation coefficient better characterize the statistical independence of two random variables that are a linear mixture of two independent sources. This correlation coefficient can be calculated with analytical relations or with the known algorithms of independent components analysis (ICA). The value of the correlation coefficient is zero when the random variables are a statistically independent and it is one when these are fully dependent.

Scientific Papers Series “Management, Economic Engineering in Agriculture and Rural Development” Vol. 12, Issue 4, 2012, Print ISSN 2284-7995, ISSN-L 2247-3572, E-ISSN 2285-3952 NEW CORRELATION COEFFICIENT FOR DATA ANALYSIS Dragos FALIE1, Livia DAVID2 1 Politechnical University of Bucharest, Romania, 1-3 Iuliu Maniu, 030791, Bucharest, Romania, E-mail : [email protected], 2 University of Agricultural Sciences and Veterinary Medicine, Bucharest, 59 Marasti, sector 1, 011464, Bucharest, Romania, E-mail:[email protected], Phone: +40722602064 Corresponding author email: [email protected] Abstract The proposed correlation coefficient better characterize the statistical independence of two random variables that are a linear mixture of two independent sources. This correlation coefficient can be calculated with analytical relations or with the known algorithms of independent components analysis (ICA). The value of the correlation coefficient is zero when the random variables are a statistically independent and it is one when these are fully dependent. Keywords: blind separation of sources, correlation, ICA, statistical independence then x1 can be represented in a ฀ 2 space by the vector [a11, a12] and x2 by the vector [a21, a22]. With this representation the correlation coefficient (6) can be represented geometrically as the scalar product between x1 and x2. INTRODUCTION The dependences between two random variables and is represented generally by a correlation relation and the commonly used is the Pearson correlation coefficient[1]: E ª¬ x1  E > x1 @ ˜ x 2  E > x 2 @ º¼ V (x1 ) ˜ V (x 2 ) U x1 , x 2 (1) where E is the expectation operator and ı is the standard deviation of a random variable : 2 E ª x  E > x@ º ¬ ¼ V ( x) (2) The correlation coefficient (1) has a simpler relation: U x1 , x2 E > x1 ˜ x 2 @ (3) if the random variables and are normalized: x1 x1  E > x1 @ V (x1 ) x2  E >x2 @ , x2 V (x 2 ) (4) The simplest situation is when x1, x2 are a linear mixture of two statistically independent normalized random variables s1, s2 named sources: x1 a11 ˜ s1  a12 ˜ s 2 , x 2 a21 ˜ s1  a22 ˜ s 2 Fig. 1. The dependence of x1, x2 on s1, s2. On the x ax that corresponds to s1, the coefficients a11, a12 are represented. On the y ax that corresponds to s s2, the coefficients a21, a22 are represented. (5) Due to the fact that x1, x2 are normalized then a112+a122=1 and a212+ a222=1. In this case the relations (5) can be rewritten as: In this case the correlation coefficient between x1, x2 is: U x1 , x 2 a11 ˜ a21  a12 ˜ a22 (6) Assuming that the unit vectors along the x,y axis corresponds to s1, s2 and x1, x2 are given by (5) 159 x1 s1 cos D  s 2 sin D x2 s1 cos E  s 2 sin E (7) where the Į, ȕ are the angles formed by x1, x2 with the x ax respectively. Using (7) the correlation coefficient takes a very simple trigonometric form: cos(D  E ) U x1 , x2 It can be noticed that the Pearson correlation coefficient expressed as in Eq.(8) is the same with R when: cos(D  E ) ! cos(D  E ) (8) On Fig. 2 is represented the particular case when x1, x2 are orthogonal ȕ=Į+(2k+1)ʌ/2. In this case the Pearson’s correlation coefficient (8) is zero but, R may vary from 0 to 1: In the case when both x1, x2 have a Gaussian distribution or any one of the coefficients a11, a12 a21 and a22 equals to zero, then the absolute value of the correlation coefficient measure the statistical dependence between the random variables x1, x2. In this case if the correlation coefficient is zero then x1, x2 are statistically independent. In the other cases the correlation coefficient may not correctly show the statistical dependence between x1 and x2. For example the Pearson’s correlation coefficient expressed by Eq. (8) is zero in the case when the random variables are “orthogonal”: D E S 2  k ˜S , k  ฀ R(x1 , x2 ) 2 s1  s 2 , x 2 2 2 s1  s 2 2 (10). sin(D ) 2  cos(D ) 2 R(x1 , x 2 ) E 1, rD  k ˜ S , k  ฀ (15) The correlation coefficient R can be calculated by Eq. (11) if x1, x2 are separated into independent components by using an independent components analysis (ICA) algorithm [2-4]. R may be calculated also with Eq. (12) in which case, what is needed is, to evaluate cos(Į+ȕ), cos(Į-ȕ) being known via(8). To compute R with (12) is necessary to know the value of: (11) r (x1 , x 2 ) The value of R is zero only when the random variables (5) are statistical independent and one when these are fully dependent. It has be noted with the Latin letter R similar with the correlation coefficient that is usually noted with the Greek letter ȡ. By using (7) the correlation coefficient R can be expressed as: R(x1 , x2 ) (14) When Į = ʌ/4 then R=1 and x1, x2 are in the most dependent situation. Other particular cases are when R=1/2 when Į = ʌ/12 and R=¥3/2 when Į = ʌ/6. The random variables x1, x2 are fully dependent and R=1 when: The random variables x1, x2 given by (5), are independent, when a11a12 = 0 and a21a22 = 0. In this case, but not only, the Pearson correlation coefficient (6) is zero. It would be therefore useful to provide an indicator, which is different from zero when the variables x1 and x2 are dependent but the Pearson coefficient is zero. The new correlation coefficient that we propose is defined with the following formula: a11 ˜ a21  a12 ˜ a22 2 Fig. 2. The dependence of two orthogonal random variables x1, x2 on s1, s2. The x, y axis corresponds to s1, s2. MATERIAL AND METHOD R(x1 , x 2 ) S D r (2k  1) , k  ฀ sin(2 ˜ D ) , E (9) In this case the variables x1, x2 are not statistical independent if in (8) Į  +kʌ/2 and quite dependent in the particular case when Į = ʌ/4 and ȕ =Į+ʌ/2: x1 (13) cos D  E (16) Analytical solution for r(x1, x2) is: cos t 2 k 40  2k 22  k04 k40  2k 22  k04 (17) where: max ^ cos(D  E ) , cos(D  E ) ` (12) k40 E ¬ªo14 ¼º  3, k04 k22 E ª¬o12 o 2 2 º¼  1 E ¬ªo 24 ¼º  3, (18) Eqs. (17), can be obtained only when the following condition is fulfilled: 160 E ª o12  o 22 ¬« 2 º 8 ¼» TABLE 1.CORRELATION MATRIX E ª¬s14 º¼  E ª¬s 42 º¼  6 z 0 (19) For a Gaussian source E[s4]= 3 and in this case if both sources s1, s2 are Gaussian (19) is not fulfilled. If for one of the sources E[s14]<3 and for the other E[s24]>3 such as (19) is not fulfilled then the solution cannot be calculated with (17). Also in the case when one of the sources are a mixture of two random variables such as E[s14]=3 and the other source is or not Gaussian but for it also E[s24]=3 then the solution cannot be computed with (17). When (19) is fulfilled R can be calculated knowing tan(t) obtained with the Comon’s relation [5] or with the alternative Comon’s formula (ACF)[8,3,14]: tan(t ) 2k22 k31  k13 E ª¬o13o 2 º¼ , k13 € ǧ UK f Sw $ Ca $ Au 0.996 0.953 0.230 0.854 0.860 0.940 0.916 $ USA 0.953 0.996 0.196 0.833 0.820 0.964 0.877 € 0.230 0.196 0.996 0.562 0.479 0.085 0.187 ǧ UK 0.854 0.833 0.562 0.996 0.915 0.765 0.801 f Sw 0.860 0.820 0.479 0.915 0.996 0.785 0.878 $ Ca 0.940 0.964 0.085 0.765 0.785 0.996 0.927 $ Au 0.916 0.877 0.187 0.801 0.878 0.927 0.996 gold (20) E ª¬o1o 23 º¼ $ USA gold TABLE 2.CORRECTED CORRELATION MATRIX where: k31 gold (21) tan(2t ) k40  6k22  k04 f Sw $ Ca $ Au 0.996 0.983 0.286 0.999 0.906 0.940 0.954 0.983 0.996 0.279 1.000 0.996 0.964 0.977 € 0.286 0.279 0.996 0.562 0.479 0.085 0.249 ǧ UK 0.999 1.000 0.562 0.996 0.993 0.984 1.000 f Sw 0.906 0.996 0.479 0.993 0.996 0.999 0.969 $ Ca 0.940 0.964 0.085 0.984 0.999 0.996 1.000 $ Au 0.954 0.977 0.249 1.000 0.969 1.000 0.996 (22) gold 0.000 $ USA € 0.030 0.056 0.145 0.046 0.000 ǧ UK f Sw $ Ca $ Au R corrects the Pearson’s correlation coefficient only when all the coefficients a11, a12, a21 and a22 in (5) are different from zero. If one of these coefficients equal zero then the system (5) reduces to: a21 ˜ s1  a22 ˜ s 2 ǧ UK gold RESULTS AND DISCUSSIONS s1 , x 2 € TABLE 3.THE DIFFERENCE BETWEEN THE TWO The above relation is known as the approximate maximum likelihood (AML) estimator [1012,5]. This relation can also be obtained by combining E[o1o23] and E[o13o2. Additionally the condition (19) need to be fulfilled. x1 $ USA $ USA The best results are obtained with the following relation: 4 k31  k13 gold 0.039 CORRELATION COEFFICIENTS ǧ UK f Sw $ € USA 0.030 0.056 0.145 0.046 0.000 0.083 0.167 0.176 0.000 0.000 0.000 0.083 0.000 0.000 0.167 0.078 0.000 0.000 0.176 0.078 0.000 0.000 0.219 0.214 0.100 0.062 0.199 0.090 $ Ca $ Au 0.000 0.039 0.100 0.062 0.199 0.090 0.073 0.000 0.000 0.000 0.219 0.214 0.000 0.073 As was expected there are also a lot of cases where the data structure has the simple form as in (23) which case the two correlation coefficients gives the same or very close results. For example in the 5th row of table 3 the dependence of the Canadian $ on gold, USA $ and € has a simple structure but, the dependence on the ǧ UK and Swiss franc is complex it impose the use of the new correlation coefficient. (23) and the two correlation coefficients gives the same result. The correlation matrix ȡ and R between the changing rates of different currency are presented in the Tab. 1 and 2 respectively. The difference between ȡ and R is presented in Table 3. A general remark is that there are enough cases where ȡ has been corrected by R to justify the use of the new correlation coefficient. In this example the corrected correlation R has a greater value than ȡ. 161 e stock markket [9] and other fieldss In economics, the dependen nces betweeen differen nt random m varriables cann not always bbe correctly y evaluatedd witth the correllation coeffficient but R can easilyy clarrify the prob blem. RE EFERENCE ES Back, A.D. an nd A. S. Weiggend, A first application a off [1]B indeependent com mponent analyysis to extractting structuree from m stock retu urns, Internattional Journaal of Neurall Systtems, vol. 8, pp. p 473-484, A Aug 1997. [2]C Comon, P., Separation S off Stochastic Processes, P inn Procceedings Workshop on Higher-Ord der Spectrall Anaalysis, 1989, pp. p pp. 174–1779. [3]C Comon, P. Sep paration of ssources using higher-orderr cum mulants. Vol. 1152, 1 1989. [4]D De la Rosa, J.J.G. et al., H Higher order statistics andd indeependent component for fo spectrall analysis characterization of acoustic emission sig gnals in steell pipees, Ieee Trransactions oon Instrumeentation andd Meaasurement, vol. 56, pp. 23122-2321, Dec 2007. 2 [5]H Harroy, F. an nd J.-L. Lacouume, Maximu um likelihoodd estim mators and Crramer-Rao boounds in sourcce separation,, Sign nal Processing g, vol. 55, pp. 167-177, 199 96. [6]H Hyvarinen, A., Fast and robbust fixed-poiint algorithmss for independent component aanalysis, Ieee Transactionss on Neural N Networks, vol. 10, ppp. 626-634, May M 1999. [7]H Hyvarinen,A., The fixeed-point alg gorithm andd maxximum likelihood estim mation for independentt com mponent analyysis, Neural P Processing Letters, vol. 10,, pp. 1-5, Aug 1999 9. Lacoume, J.L. et al., Statisstiques d’Ord dre Supérieurr [8]L pour le Traitemen nt du Signal. PParis: Masson n, 1997. Stuart, A. and d J. K. Ord, K Kendall’s Adva anced Theoryy [9]S of Statistics, S sixtth ed. vol. I. London: Edw ward Arnold,, 1994. [10]]Zarzoso,V. et e al., Optim mal pairwise fourth-orderr indeependent com mponent analyysis, Ieee Traansactions onn Sign nal Processing g, vol. 54, pp. 3049-3063, Aug A 2006. [11]]Zarzoso, V. and A. K. Nandi, Closed-form m estim mators for blind b separatiion of sourcees - part II:: Com mplex mixturees, Wireless PPersonal Com mmunications,, vol. 21, pp. 29-48 8, Apr 2002. [12]]Zarzoso, V. and A. K. Nandi, Closed-form m estim mators for bliind separationn of sources - part I: Reall Communications, vol. 21,, mixtures, Wireless Personal C pp. 5-28, Apr 200 02. [13]]Zarzoso, V. et al., A contrast for independentt com mponent analyysis with prioors on the so ource kurtosiss sign ns, Ieee Signaal Processing Letters, vol. 15, pp. 501-504, 2008 2008. [14]]Zhang, Z. et al., Blindd source seeparation byy com mbining indeepandent coomponent an nalysis withh com mplex discretee wavelet trannsform, 2007 Internationall Con nference on Wavelet Analysis and a Patternn Reccognition, Volls 1-4, Proceeedings, pp. 54 49-554, 1924,, 2007. Fig. 3. The ti me dependence of o the correlation coefficients ȡ annd R bbetween € and thee gold price and $ USA. The timee dependennce of th he correlaation coefficientts is represeented in Fiig. 3. One can observe thhat if the daata of the gold price and USA $ are shifted in thhe past with h 0…6 dayss the values of thhe two corrrelation coeffficients aree the same. Thiss indicates that the daata structuree in these casess has the sim mple form(2 23). The data structure changes c if the same ddata (gold pricee and USA $) is shifteed in the fuuture with 1…8 days. This example sh hows that iff the wn is betteer to data structture is not priory know use the new w correlation coefficien nt. CONCLU USIONS The proposed correlattion coefficcient R corrrects the Pearsoon relation and show the statisttical dependence between two t random m variables that are a linearr mixture off two indepeendent sourcces. R can be ccalculated with w analytical relationns or can be obtaained by IC CA algorithm ms also. Be side the knownn relation to calculaate R a nnew analytical rrelation has been propo osed (17). There are ssituations when w the ran ndom variab ables does not saatisfy the coondition (19 9) and R neeeds to be calcuulated using ICA algoritthms [7]. Even if to computee R is a little l bit m more complicatee than to calculate the t correlaation coefficientt the advanntage to kn now it (R) are considerabble. In many cases the randdom variables ddoes not havve the simp ple structuree of (23) and tthe Pearsonn’s correlattion coefficcient may give innaccurate values. v 162