The Behaviour of Rank Correlation Coefficients For

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Research in Mathematics

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/oama23

The behaviour of rank correlation coefficients for


incomplete data

Cahyo Crysdian |

To cite this article: Cahyo Crysdian | (2022) The behaviour of rank correlation coefficients for
incomplete data, Research in Mathematics, 9:1, 2107793, DOI: 10.1080/27684830.2022.2107793

To link to this article: https://doi.org/10.1080/27684830.2022.2107793

© 2022 The Author(s). This open access


article is distributed under a Creative
Commons Attribution (CC-BY) 4.0 license.

Published online: 07 Aug 2022.

Submit your article to this journal

Article views: 180

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=oama23
RESEARCH IN MATHEMATICS
2022, VOL. 9, NO. 1, 2107793
https://doi.org/10.1080/27684830.2022.2107793

COMPUTATIONAL SCIENCE | RESEARCH ARTICLE

The behaviour of rank correlation coefficients for incomplete data


a
Cahyo Crysdian
a
Computer Science Department, Universitas Islam Negeri Maulana Malik Ibrahim, Malang, Indonesia

ABSTRACT ARTICLE HISTORY


This paper presents the analysis to disclose the behaviour of rank correlation coefficients under the Received 13 February 2021
complete and incomplete data condition. The main concern of this research is to deal with the Accepted 27 July 2022
missing data by preserving the originality of data pair rather than experiencing data deletion or KEYWORDS
imputation. The paper introduces the variability function that is developed for each correlation Rank correlation coefficient;
coefficient in order to disclose the mean and the variance for every possible data sequences. The index ranking; incomplete
comparisons between Kendall, Spearman, and the absolute distance measure for index ranking data; missing data; variability
demonstrate the use of variability function under both the complete and incomplete data, in which function; variance estimate
it becomes a useful tool to describe the coefficient’s mechanism to proceed with a set of possible
data sequences. The analysis proves that Kendall coefficient becomes the better method compared
to Spearman and the absolute distant measure due to threefold, i.e. the ability to preserve the zero
mean of variability distribution in complete data, the ability to survive from the missing data, and
the ability to gain a higher rate of convergence in incomplete condition. Meanwhile, Spearman fails
to preserve the original data pair under the incomplete condition due to direct measurement of
rank distances.

Introduction
found in many practical observations. As shown by
Rank correlation coefficient aims to measure the asso­ Equations 1 and 2, both Kendall and Spearman are
ciation between a pair of ordinal variables representing influenced only by a single n number of data. This
the ranking of different items obtained from an obser­ condition implicitly presents the need to achieve
vation, or the ranking of the same item from different a complete data. Let ðA; BÞ ¼ ða1 � � � an ; b1 � � � bn Þ
observations. Two famous classical methods that are become a pair of distinct variables storing the result of
still widely used even today are Kendall τ and an observation. Data completeness is achieved when
Spearman ρ (Alvo & Cabilio, 1995; Alvo & Park, 2002; "ai ‚[ and "bi ‚[; i ¼ 1 � � � n with [ denotes the
Kendall, 1938; Spearman, 1904; Szmidt & Kacprzyk; Xu missing data; hence, jAj ¼ jBj ¼ n. If somehow, without
et al., 2010; Szmidt & Kacprzyk). These methods are reducing its generality, 9ai 2 [; jAj < n and
formulated by "bi ‚[; i ¼ 1 � � � n; jBj ¼ n, thus jAj�jBj, then the
nc nd data are incomplete. The incompleteness is mostly due
τ¼1 (1) to unmeasured objects in an observation.
2 nðn 1Þ
However, it is difficult to always achieve data com­
P pleteness in practical situations due to vary conditions
6 di2
ρ¼1 (2) such as hardware and time constraints that restrict
nðn2 1Þ
object measurements. Therefore, in order to achieve
in which nc and nd denote the number of concordant a complete data, the removal of unobserved objects
and discordant, respectively; d is the distant between must often be taken prior to applying rank correlation
a pair of data rank, while n is the number of data. The coefficient that includes a list-wise or a pair-wise dele­
methods produce a score in a range of ½ 1; 1� that tion as noted by Alvo and Cabilio (1995), Alvo and Park
represents a perfect opposite correlation to a perfect (2002), and Raykov et al. (2014). In case of a large
correlation, respectively, while score 0 means that the number of objects influenced by the deletion process,
data pair is independent to each other. Unfortunately, the data would suffer from significant losses that poten­
these classical rank correlation coefficients were not tially jeopardize its characteristic. This condition is
designed to deal with the missing data that is often undesired by most researchers. Therefore, Alvo and

CONTACT Cahyo Crysdian [email protected] Computer Science, Universitas Islam Negeri Maulana Malik Ibrahim, Indonesia
Reviewing editor: Lishan Liu Qufu Normal University, Shandong, CHINA
© 2022 The Author(s). This open access article is distributed under a Creative Commons Attribution (CC-BY) 4.0 license.
You are free to: Share — copy and redistribute the material in any medium or format. Adapt — remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms. Under the following terms: Attribution — You must give appropriate credit, provide a link to the
license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. No additional
restrictions You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
2 C. CRYSDIAN

Cabilio (1995) introduced the imputation approach any sequence of data in this study is recognized as an
based on distance metrics to extend the classical rank index ranking.
correlation coefficient based on Spearman and Kendall Meanwhile, a measure of similarity based on the
to deal with the missing data. The process was devel­ absolute distance between a pair of index ranking
oped by associating and relabeling a set of items com­ ðA; BÞ ¼ ða1 � � � an ; b1 � � � bm Þ is designed for incomplete
posing the original incomplete data based on the data (Crysdian, 2018) as formulated by
compatibility between the incomplete and complete Xn;m
1 α
data. Later, Cabilio and Tilley (1999) and Alvo and c¼ (3)
Park (2002) extended this approach based on minðn; mÞ i¼1;j¼1 ji jj þ 1
a multivariate statistical test. Kidwell et al. (2008) even
with
employed Alvo and Cabilio’s approach for visualization �
purposes. Different techniques were presented by Albers 1 if ai ¼ bj
α¼ (4)
and Teulings (1996) to introduce the correlation esti­ 0 otherwise
mate by incorporating additional information from
Equation 3 produces a range of score ½0; 1� that presents
further observations. This effort increased the size of
the independent to tightly correlated data, respectively.
data to become n þ m1 þ m2 with n is the size of the
This approach enables the similarity measure between
original variable that might contain missing data, while
a pair of index rankings in whatever condition they
m1 and m2 were the size of additional observation
might have, and therefore it preserves the originality of
obtained from the first and second variables being cor­
the observed variables. For the case of a complete data in
related, respectively. Meanwhile, Raykov et al. (2014)
which n ¼ m, reaching c ¼ 0 is not possible due to
built a correlation estimate by developing a set of pre­
"ai 2 B and "bj 2 A. While different condition is
dictive rankings to the missing data that utilized the
assumption of missing at random. This effort was found from the incomplete data in which n,
extended by Eekhout et al. (2015) to include an auxiliary Equation 3 is capable of reaching c ¼ 0 due to the
variable in terms of item score information. Recently, possibility that "ai ‚B or "bj ‚A.
various methods have been introduced to predict the From this point, the difference behaviour between var­
missing data such as Kim and Im (2018), Emmanuel ied rank correlation coefficients, i.e. Kendall, Spearman,
et al. (2021), and (Mirzaei et al., 2022) to develop multi­ and the absolute distance measure, are noticeable. As
ple imputation approach, Yan et al. (2021) to predict noted by Xu et al. (2010), the problem of mathematical
missing attribute and restore big data by using K-means tractability is raised from this issue due to the function
and Neural Network Backpropagation, and Rejeb et al. complexity to describe the unique mechanism of each
() to estimate missing values using Kohonen map. coefficient. Hence, it is interesting to reveal how these
Despite the progress being made to deal with the coefficients behave for n ! 1 under the complete and
incomplete data, the original observed variables, incomplete conditions. The study aims to discuss the char­
however, become the most appropriate representa­ acteristics of rank correlation coefficient to adapt with the
tion to describe system characteristics or phenom­ complete and incomplete data by disclosing their beha­
enon being investigated, regardless the condition that viour for n ! 1 through the application of variability
they might contain some missing data. Altering the function introduced in Section Material and methods, sub
original variables means putting into risk on chan­ section Coefficient’s behaviour for complete data. Here we
ging or even diminishing data characteristics such as show how variability function discloses the internal
shown by (Zidan et al., 2017,), Kim et al. (2020), mechanism of each rank correlation coefficient. The rest
Abdel-Aty et al. (2020), and Mirzaei et al., 2022). of the paper is organized as follows. Section 2 discusses the
Conducting imputation for a rank of indices jeopar­ complexity of complete and incomplete condition, and to
dizes unique features of an index ranking. It is introduce the variability function to analyze the behaviour
important to note that each index represents differ­ of rank correlation coefficient by presenting the following
ent entity associated with an object in a sequence. materials, i.e. the complexity of data sequence under com­
Therefore, it would not be appropriate to replace an plete and incomplete condition in Sub-Section 2.1, the
index with its neighbors or to modify an index behaviour of each rank correlation coefficient in the com­
ranking, since the action would cause the changes plete data by formulating the variability function in Sub-
in the original variable. The last statement becomes Section 2.2, and the extension of variability function to
the foundation of this study. Here, we assume that adapt with the incomplete condition in Sub-Section 2.3.
an entity corresponds to only an index, and there Section 3 discusses the characteristic of rank correlation
should be no repeated index in a ranking. Hence, coefficient by disclosing their behaviours under the
RMS: RESEARCH IN MATHEMATICS & STATISTICS 3

condition of complete and incomplete data through the Table 1. Number of possible data sequence
application of variability function. The study is concluded n Complete data Incomplete data
in Section 4. 1 1 1
2 2 6
3 6 33
4 24 208
Material and methods 5 120 1545
6 720 13,326
The complexity of complete and incomplete data 7 5040 130,921
8 40,320 1,441,728
For the case of a complete data ðA; BÞ ¼ ða1 � � � an ; b1 � � � bn Þ, 9 362,880 17,572,113
10 3,628,800 234,662,230
the correlation between A and B is a permutation of n data items. .. .. ..
Hence, the correlation score is distributed over a number of . . .
n NðnÞ ¼ n! P n
possible data sequence N as defined by NðnÞ ¼ Cn;i Pn;i
i¼1

NðnÞ ¼ n! (5)
Different condition is found for incomplete data with Cn;i ¼ ðn n!iÞ!i! and Pn;i ¼ ðnn!iÞ!
ðA; BÞ ¼ ða1 � � � an ; b1 � � � bm Þ, in which n. Without redu­
The comparison between the number of data sequences
cing any generality, this assumption states that the incom­
for both complete and incomplete data for any n as given in
pleteness is suffered by B. Therefore, the possible data
Table 1 shows the exponential growth of possible data
sequence of B is the permutation of the result of a union
sequences. It is difficult to obtain the mean and variance
operation between the power set of A with the missing data
from rank correlation coefficients for large n since the
that is represented by “0”. We use “0” for the notation of
computation involves a large number of possible data
missing data since the data rank is a positive integer, while
sequences. This problem restricts the effort to present the
the number of “0” in B represent the number of missing
function behavior of rank correlation coefficients, particu­
data. For instance, A ¼ f1; 2; 3g; hence, B can be in any
larly under incomplete condition. Therefore, it is crucial to
permutation of the following combinations, i.e. f1; 2; 3g,
define the variability function that carries the smallest
f1; 2; 0g, f1; 3; 0g, f2; 3; 0g, f1; 0; 0g, f2; 0; 0g, f3; 0; 0g,
component of rank correlation coefficient. Then, the com­
and f0; 0; 0g. For B ¼ f1; 0; 0g, which contains two miss­
putation to obtain the mean and the variance of correlation
ing data; hence, we can rewrite to become B ¼ f1; 01 ; 02 g.
coefficients can be established using the predefined varia­
There are 3! combination of data sequence that consists of
bility function as elaborated in the next sub-section.
f1; 01 ; 02 g; f1; 02 ; 01 g; f01 ; 1; 02 g; f02 ; 1; 01 g; f01 ; 02 ; 1g;
f02 ; 01 ; 1g. Possible sequence of data grows to become
Coefficient’s behaviour for complete data
NðnÞ ¼ ð2n 1Þn! þ 1 (6)
For a pair of complete data ðA; BÞ ¼ ða1 � � � an ; b1 � � � bn Þ,
It is important to note that the statement “+ 1” in the last it is not possible to have a repeated index due to the nature
fraction of Equation 6 is to accommodate f0; 0; 0g. of index ranking as stated previously, hence ai �aj and
Here, we prefer to exclude it and compute only the bi �bj , i�j, for i; j ¼ 1 � � � n. Kendall coefficient in
possible sequence that carries component from the ori­ Equation 1 is computed based on the number of concor­
ginal data. Hence, total possible sequence of data can be dant and discordant as follows:
reduced to become NðnÞ ¼ ð2n 1Þn!. Moreover, it is n;n
X
possible to further reduce the possible sequence of data nc ¼ conði; jÞ (8)
by removing the permutations from the combinations i¼1;j¼1;i�j
having repeated items, such as f1; 0; 0g that comes from
f1; 01 ; 02 g and f1; 02 ; 01 g. Then, the total possible n;n
X
sequence of data in Equation 6 is reduced to become nd ¼ disði; jÞ (9)
i¼1;j¼1;i�j
X
n
NðnÞ ¼ Cn;i Pn;i (7) with
i¼1


1 if ððai > aj Þ and ðbi > bj ÞÞ or ððai < aj Þ and ðbi < bj ÞÞ
conði; jÞ ¼ (10)
0 if otherwise
4 C. CRYSDIAN


1 if ððai > aj Þ and ðbi < bj ÞÞ or ððai < aj Þ and ðbi > bj ÞÞ
disði; jÞ ¼ (11)
0 if otherwise

NðnÞ
P
Hence, the variability function of Kendall coefficient vK τi
i¼1
for any n is also defined by the number of concordant μK ¼
NðnÞ
and discordant as follows:
1X n!
nci ndi
vK ¼ nc nd ¼
n! i¼1 nðn 1Þ=2
¼ τ nðn 1Þ=2
2 X
n!
with τ is the Kendall’s range of score in ½ 1; 1�, hence ¼ nci ndi ð17Þ
nðn 1Þn!
1 � τ � 1, therefore i¼1

nðn 1Þ=2 � vK � nðn 1Þ=2 (13) P


n!
Since nci ndi includes all possible sequence that
i¼1

create the variability of Kendal’s function as stated in


Thus, the variability function to satisfy Equation 13 is Equation 12, hence
1
vK ¼ nðn 1Þ 2i (14) nðnX1Þ=2
2 1X n!
1
nci ndi ffi nðn 1Þ
vKi (18)
or n! i¼1 þ 1 i¼0
2
1 In Equation 18, we include the denominator since it
vK ¼ nðn 1Þ þ 2i (15) becomes the source of the nominator as stated in
2
Equation 17. Therefore, replacing the nominator has
for i ¼ 0 � � � nðn 1Þ=2. a consequence of replacing the denominator in order to
have a fair computation. It is important to note that the “+
Proof: 1” statement in the denominator of the right side of
The variability of Kendal’s score is governed by Equation 18 represents the sources of i that are started
1
2 ðn
n 1Þ for n number of an index ranking as stated from zero. The mean of Kendall coefficient can then be
by Equation 1 that become the total number of nc and computed by inserting Equation 18 to Equation 17 as
nd, hence follows:
1
nc þ nd ¼ nðn 1Þ (16) nðnX1Þ=2
2 2
μK ¼ � � vKi
Thus, the number of nc varies between a range of nðn 1Þ nðn2 1Þ
þ1 i¼0
½0; nðn 1Þ=2�. The same condition is applied to nd.
nðnX1Þ=2
To compute νK ¼ f ðnc; ndÞ in Equation 12, we need 2 1
¼ � � nðn 1Þ 2i
to state the extreme maximum or the extreme mini­ nðn 1Þ nðn2 1Þ
þ1 i¼0
2
mum that can be reached through Equation 16, i.e.
1 1 2
2 nðn 1Þ or 2 nðn 1Þ, respectively. We can then ¼ � � :0
visit all possible νK in Equation 13 by using an order nðn 1Þ nðn2 1Þ
þ1
sequence of variable i in a range of L ¼ 0 QED (19)
½0; nðn 1Þ=2�. QED
Based on (14), μK ¼ 0 for any n.
Even though the denominator does not influence the
Proof: computation due to the nominator’s symmetry that
RMS: RESEARCH IN MATHEMATICS & STATISTICS 5

produces zero mean in Equation 19, the statement of the 1X n!


denominator in Equation 18 is vital for the calculation μS ¼ ρ
n! i¼1 i
of the variance. Hence, the variance of Kendall coeffi­ P
n! � �
cient is obtained by 1X 6 di2
¼ 1
n! i¼1 nðn2 1Þ
NðnÞ
1 X
σ 2K ¼ ðτ i μÞ2 P
n!
� P �
NðnÞ i¼1 6 di2
Since 1 nðn2 1Þ includes all possible sequence
i¼1
1 X
n!
¼ τ 2i that create the variability function of Spearman, hence
n! i¼1
� � n! � P � 2
1Þ=3 � �
1X n!
nci ndi 2
nðnX
1X 6 di2 1 6vS
¼ 1 ffi 1
1 n! i¼1 nðn2 1Þ ðnðn2 1Þ=3Þ þ 1 nðn2 1Þ
n! i¼1 2 nðn 1Þ i¼0
(26)
4 X
n!
2
¼ ðnci ndi Þ ð20Þ Justification of Equation 26 is similar to Equation 18 for
n2 ðn 1Þ2 n! i¼1 Kendall which includes all possible variability and their
The variance estimate of Kendall can then be computed in sources. The mean of Spearman coefficient is computed
term of n by inserting Equation 18 to Equation 20 as by inserting Equation 26 to Equation 25 as follows:
follows: 2
nðnX 1Þ=3 � �
nðnX1Þ=2
1 6i
4 μS ¼ 1
σ^K2 ffi � � 2
vKi ðnðn2 1Þ=3Þ þ 1 i¼0
nðn2 1Þ
n2 ðn 1Þ2 nðn2 1Þ þ1 i¼1 ¼0 QED
nðnX1Þ=2
8 The result of Equation 25 is consistent with the symme­
ffi 2 try of Spearman score. Hence, Spearman’s variance for
n2 ðn 1Þ ðnðn 1Þ þ 2Þ i¼0
� �2 complete data is defined by
1
nðn 1Þ 2i (21)
2 1X n!
σ 2S ¼ ðρ μS Þ2
Meanwhile, the variability of Spearman coefficient in n! i¼1 i
Equation 2 can be defined by the total square distance 1X n!

from a pair of index ranking as follows: ¼ ρ2


n! 1¼1 i
X
n! n! � P �2
vS ¼ di2 1X 6 di2
¼ 1
i¼1 n! i¼1 nðn2 1Þ
nðn2 1Þð1 ρÞ
¼ By inserting Equation 26 to Equation 28, we obtain
6 variance estimate of Spearman coefficient as follows:
Since the score of Spearman is in a range of ½ 1; 1�, 2
nðnX 1Þ=3 � �2
hence 1 � ρ � 1, therefore 1 6i
σ^S2 ffi 1
ðnðn2 1Þ=3Þ þ 1 nðn2 1Þ
0 � vS � nðn2 1Þ=3 (23) i¼1
(29)
For any n, the extreme minimum and maximum of νS in
Equation 23 can be visited by using an order sequence of The variability function of the absolute distance mea­
variable i in a range of ½0 � � � nðn2 1Þ=3�. Thus, the sure in Equation 3 is described by the total distance of
variability function to satisfy Equation 23 is a pair of index ranking as computed by

vS ¼ i (24)
X
n
1
for i ¼ 0 � � � nðn2 1Þ=3. vc ¼
i;j¼1
1 þ ji jj
Based on Equation 24, μS ¼ 0 for any n.
e1 e2 en
¼ þ þ ��� þ
Proof: 1 2 n
6 C. CRYSDIAN

with ei is the number of the absolute distance obtained In this case, ci ffi Xj for i ¼ 1 � � � n!; j ¼ 1 � � � n, with n!
by the correlation coefficient for i ¼ 1 � � � n. Right now, and n become the sources of ci and Xj respectively.
we just leave ei since it is difficult to compute their Hence, the variance estimate is obtained by inserting
quantities for a single data pair. However, later on this Equation 34 to Equation 35 as follows:
section, we can obtain the pattern of parameter e for the P
n
ðXj μc Þ2
whole data sequences. Since there exist c1 � � � cn! score j¼1
σ^c2 ¼
obtained from n! data sequences for complete data: thus, n
�2 � �2 � �2 � �2
2ðn 1Þ 2ðn 2Þ 2ðn ðn 1ÞÞ
the mean of correlation score is computed by 1 μc þ nð1þ1Þ μc þ nð2þ1Þ μc þ ��� þ nððn 1Þþ1Þ μc
¼
n
�2 n �
P �2
2ðn iÞ
1 μc þ μc
Pn! ¼ i¼1
nðiþ1Þ
ð36Þ
i¼1 ci n
μc ¼
n!
1 e1;1 e2;1 en;1 � e1;n! e2;n! en;n! �
n 1 þ 2 þ ... þ n þ ::: þ n1 1 þ 2 þ ... þ n Incomplete condition
¼
n!
In order to adapt with the incomplete data ðA; BÞ ¼
Pn! Pn! Pn! ða1 � � � an ; b1 � � � bm Þ in which 9ai ‚B or 9bj ‚A, hence
e e e
i¼1 1;i i¼1 2;i i¼1 n;i
1 þ 2 þ ... þ n n, it is vital to preserve both n and m as the domain of
¼ (31)
nn! correlation. Altering either n or m by the deletion or
By scrutinizing data sequences for small n, we find that adding more observation would make correlation run­
parameter e in Equation 31 are described by ning in a different environment. Even though some
researchers or statisticians might be interested to observe
X
n!
the cause of missing data such as the missing completely
e1;i ¼ n! ¼ nðn 1Þ! ¼ nN ðn 1Þ
i¼1 at random (MCAR) or missing at random (MAR) in
X
n! order to build the most suitable distribution for the sake
e2;i ¼ 2ðn 1Þðn 1Þ! ¼ 2ðn 1ÞN ðn 1Þ of prediction or imputation; however, this issue is beyond
i¼1
X
n!
the scope of the paper that concerns with the original data
e3;i ¼ 2ðn 2Þðn 1Þ! ¼ 2ðn 2ÞN ðn 1Þ pair. Disclosing the survivability of rank correlation coef­
i¼1 ficient under the incomplete data is more desired.
..
. Therefore, the analysis focusses on observing the effect
X
n! of the missing data based on the mean and the variance
en;i ¼ 2ðn ðn 1ÞÞðn 1Þ! ¼ 2ðn ðn 1ÞÞN ðn 1Þ for n ! 1.
i¼1
For the case of Kendall coefficient, since the measure­
(32)
ment is possible to take place only for ai 2 B and bj 2 A;
Inserting Equation 32 to Equation 31 delivers the fol­ hence, it is necessary to introduce α in order to compute
lowing result nc and nd by excluding ai ‚B and bj ‚A as follows:
!
1 2Xn 1
n i X
k
^c ¼
μ 1þ (33) nc ¼ α conði; jÞ (37)
n n i¼1 i þ 1 i¼1

We could rewrite Equation 33 into a longer form to


X
k
disclose the pattern of fraction units composing the nd ¼ α disði; jÞ
mean for each n as follow: i¼1
� �
1 2ðn 1Þ 2ðn 2Þ 2ðn ðn 1ÞÞ with α is defined in Equation 4. Due to Equation 37, the
^μc ¼ 1þ þ þ ... þ
n nð1 þ 1Þ nð2 þ 1Þ nððn 1Þ þ 1Þ variability function of Kendall coefficient in Equation 12
1 is expanded to become:
¼ ðX1 þ X2 þ . . . þ Xn Þ
n
(34)

The variance for large n is then computed by vk ¼ nc nd


¼ Cn h;2 2i (38)
P
n!
2
ðci μc Þ 1
¼ ðn hÞðn ðh þ 1ÞÞ 2i
σ 2c ¼ i¼1 (35) 2
n!
RMS: RESEARCH IN MATHEMATICS & STATISTICS 7

with h and i are the range of integers as the function of n in which maximum distance for ai ‚B or bj ‚A violates to the
h ¼ 0 . . . ðn 2Þ and i ¼ 0 . . . ðn hÞðn ðh þ 1ÞÞ=2. range of score ½ 1; 1�, while defining d ¼ 0 produces
Based on Equation 38, μK ¼ 0 for incomplete data. weird measurement since

ρ "ai ‚Bor"bj ‚A ¼ ρðA ¼ BÞ ¼ 1. Therefore, Spear
Proof: man fails to present its variability due to the failure to
PN ðnÞþ1 define the distance for the missing items. Here, we do
i¼1 τi not argue that Spearman is completely failing to deal
μK ¼
N ðnÞ þ 1 with the missing data since any pairwise deletion or
N ðX
nÞþ1 data imputation could be employed to achieve data
2
¼ nci ndi completeness. However, it is vital to obey the consen­
nðn 1ÞðN ðnÞ þ 1Þ i¼1
sus stated in the beginning of this section to preserve
n 2 ðn
X hÞðnXðhþ1ÞÞ=2
2 the originality of data pair. It means taking any action
¼ vK
nðn 1ÞðN ðnÞ þ 1Þ h¼0 i¼0
to alter the data either by deletion or imputation vio­
n 2 ðn hÞðnXðhþ1ÞÞ=2
lates the foundation of this study.
2 X
¼ While the absolute distance measure in Equation 3 is
nðn 1ÞðN ðnÞ þ 1Þ h¼0 i¼0 intentionally designed to adapt with the incomplete data,
1 the variability function defined in Equation 30 for complete
ðn hÞðn ðh þ 1ÞÞ 2i ¼ 0
2 data complies with the incomplete condition. Hence, it is
QED (39) merely to insert the missing data into its variability function
in order to describe its mechanism under the incomplete
Therefore, the variance of Kendall coefficient for incom­
condition as follows. The mean of this coefficient is com­
plete data is described by
puted by
NX
ðnÞþ1 PN ðnÞþ1
1
σ 2K ¼ ðτ i μÞ2 ci
N ðnÞ þ 1 i¼1 μc ¼ i¼1 (43)
N ðnÞ þ 1
NX
ðnÞþ1
1 with N ðnÞ is defined in Equation 7. Here, the statement
¼ τi2
N ðnÞ þ 1 i¼1 “+1” in the denominator is to accommodate the missing
(40) of all items either in A or B. Inserting Equation 30 to
NX
ðnÞþ1
4 2
¼ ðnci ndi Þ Equation 43, we obtain
n2 ðn 1Þ2 ðN ðnÞ þ 1Þ i¼1
n 2 ðn
X hÞðnXðhþ1ÞÞ=2
4 PN ðnÞþ1
ffi vK 2 ðvc =nÞi
n2 ðn 1Þ2 Tv h¼0 i¼0 μC ¼ i¼1
N ðnÞ þ 1
with Tv is the total number of variation generated by vK PN ðnÞþ1 PN ðnÞþ1 PN ðnÞþ1 (44)
e1;i e2;i en;i
in which
i¼1
1 þ i¼1
2 þ . . . þ i¼1
n
¼
nðN ðnÞ þ 1Þ
X
n 2
ðn hÞðn ðh þ 1ÞÞ
Tv ¼ (41) To compute Equation 44, we need to modify
h¼0
2
Equation 32 for incomplete data by inserting the
Thus, the variance estimate is computed by number of possible data sequences in Equation 7.
Hence, we obtain
X
n 2 ðn hÞðn
Xðhþ1ÞÞ=2
4
σ^K2 ¼ N
X n þ1 X
n 1
n2 ðn 1Þ2 TV h¼0 i¼0 e1;i ¼ nN ðn 1Þ ¼ n Cn 1;i Pn 1;i
� �2 i¼1 i¼0
1 N
X n þ1 X
n 1
ðn hÞðn ðh þ 1ÞÞ 2i (42) e2;i ¼ 2ðn 1ÞN ðn 1Þ ¼ 2ðn 1Þ Cn 1;i Pn 1;i
2 i¼1 i¼0
N
X n þ1 X
n 1
Meanwhile, it is not possible to define any variability e3;i ¼ 2ðn 2ÞN ðn 1Þ ¼ 2ðn 2Þ Cn 1;i Pn 1;i
function for Spearman under incomplete data without i¼1 i¼0
..
violating the correlation principle due to undefined .
distance d either for ai ‚B or bj ‚A. Referring to vS in N
X n þ1 X
n 1
en;i ¼ 2ðn ðn 1ÞÞN ðn 1Þ ¼ 2ðn ðn 1ÞÞ Cn 1;i Pn 1;i
Equation 22, d becomes the core of variability function i¼1 i¼0
for Spearman. In this case, defining d � n 1 as the (45)
8 C. CRYSDIAN

Inserting Equation 45 to Equation 44, we obtain and Spearman share similar characteristics for large n, and
0 nP1 1 differ significantly to the absolute distance measure as seen
! C P
n 1;i n 1;i by Figure 1. In this case, Kendall and Spearman directly
1 Xn 1
n i B C
^C ¼
μ nþ2 B i¼0 C lead to the convergence, with Spearman has the faster
n iþ1 @ Pn A
i¼1 1 þ Cn 1;i Pn 1;i convergence compared to Kendall based on the growth
i¼1 of n. Meanwhile, the absolute distance measure presents
(46) a ripple at n � 4 before leading to the convergence at n > 4.
To simplify Equation 46, we can write Therefore, sorting the rate of convergence from the fastest
Pn 1 to the slowest is demonstrated by Spearman, Kendall and
C P
i¼0 n 1;i n 1;i
β¼ P n , hence the absolute distance measure. The higher convergence
1þ Cn 1;i Pn 1;i
i¼1
rate of Spearman compared to other methods is caused
!
by the larger size of Spearman variability that grows expo­
β X
n 1
n i
^C ¼
μ
n
nþ2
iþ1
nentially based on the growth of n as shown by Figure 2.
i¼1

1

2βðn 1Þ 2βðn 2Þ 2βðn ðn 1ÞÞ
� This condition also discloses the slower convergence rate
¼ nβ þ þ þ ���þ
n 1þ1 2þ1 ðn 1Þ þ 1
1
of the absolute distance measure compared to other meth­
¼ ðX1 þ X2 þ � � � þ Xn Þ (47)
n ods due to the less number of variability based on the
The variance is then computed by growth on n.
PN ðnÞþ1 �2 Meanwhile, graphing the variability function of each
2 i¼1 ci μc rank correlation coefficient as defined by vK in both
σc ¼
N ðnÞ þ 1 Equation 14 and 15, vS in Equation 24, and vc in
Pn �2 (48)
Equation 30 for Kendall, Spearman, and the absolute
i¼1 Xi μc
ffi distance measure, respectively, reveals the distribution
n
of variability function based on the growth of n as
The variance estimate for the absolute distant-based shown by Figure 3A-C. Figure 3A depicts Kendall varia­
measure is obtained by inserting Equation 47 to bility function that perfectly distributes around zero
Equation 48 as follows: mean. It is different to the variability function of
Spearman that distributes above zero and grows expo­
nentially with the n as shown by Figure 3B, and the
Result and discussion absolute distance measure that distributes in a range of
½0; 1� as shown by Figure 3C. It shows that Kendall is
The behaviours of Kendall, Spearman and the absolute a well-designed method for a rank correlation coeffi­
distance measure are asymptotically normal under the cient that even preserves the zero mean for any n by its
complete data with Nð0; σ^K2 Þ, Nð0; σ^S2 Þ, and Nð^ μc ; σ^c2 Þ, variability function.
respectively. In this case, σ^K is defined by Equation 21, σ^S2
2
For the incomplete data, only Kendall and the absolute
is defined by Equation 29, while μ ^c and σ^c2 are defined by distance measure survives. It is due to the flexibility to
Equation 33 and 36. It is worth noting that σ^K2 , σ^S2 , μ
^c , and adapt with the missing data by accepting α. It is important
2 to note that the mechanism to employ α is different from
σ^c are unbiased since all parameters are perfectly described
by their variability function, i.e. vK for Kendall in both the deletion mechanism since the former preserves the
Equation 14 and 15, vS for Spearman in Equation 24, and originality of data pair as the domain of correlation. In
vc for the absolute distance measure in Equation 30, this case, Kendall and the absolute distance measure are
respectively. Graphing the variance estimates of Kendall, asymptotically normal with Nð0; σ^K2 Þ and Nð^ μc ; σ^c2 Þ
2 2
Spearman and the absolute distance measure as defined by respectively, in which σ^K , μ
^c and σ^c are unbiased as defined
Equation 21, 29, and 36, respectively, produces Figure 1 by Equation 42, 46, and 49, respectively. Comparing the
that discloses the following phenomenon, i.e. both Kendall variance estimates of Kendall against the absolute distance

� �2 � �2 � �2 !
1 2βðn 1Þ �2 2βðn 2Þ 2βðn ðn 1ÞÞ
σ^c2 ¼ nβ μc þ μc þ μc þ . . . þ μc
n 2 3 ðn 1Þ þ 1
n 1� �2 ! n 1� �2 ! (49)
1 �2 X 2βðn iÞ 1 �2 X 2βðn iÞ
¼ nβ μc þ μc ¼ nβ μc þ μc
n i¼1
iþ1 n i¼1
iþ1
RMS: RESEARCH IN MATHEMATICS & STATISTICS 9

Figure 1. The variance estimates for kendal, Spearman, and the absolute distance measure under complete data condition based on
the growth of n.

measure for the incomplete data produces a graph in complete condition as shown by Figure 5. In this case,
Figure 4 that shows Kendall demonstrating a higher rate the absolute distance measure presents a ripple for small n,
of convergence compared to the absolute distance mea­ i.e. n � 6. Meanwhile, Kendall still presents a significantly
sure. This phenomenon is caused by the growth of varia­ different characteristics to the absolute distance measure
bility size of Kendall in incomplete data that grows for large n with σ^K2 > σ^c2 due to a wider range of Kendal’s
exponentially with the growth of n, a similar condition to score in ½ 1; 1� compared to the range of the absolute
Spearman for complete data, while the variability size of distance measure in ½0; 1�. This condition is disclosed by
the absolute distance measure is the same with the graphing the variability function of Kendall for incomplete

Figure 2. The variability size of kendall, Spearman, and the absolute distance measure based on the growth of n for complete data.
10 C. CRYSDIAN

Figure 3. The variability distribution for (A) Kendall, (B) Spearman, and (C) The absolute distance measure.
RMS: RESEARCH IN MATHEMATICS & STATISTICS 11

Figure 4. Variance estimates for Kendal and the absolute distance measure under incomplete condition based on the growth of n.

Figure 5. The exponential growth of Kendall’s variability size for incomplete data compared to the original Kendal and Spearman in
complete data and the absolute-distance measure.

data as shown by Figure 6 that shows Kendall still preser­ Meanwhile, the comparisons between the variance
ving the distribution of variability function similar to the obtained from the complete data versus the incomplete
complete data, even though the mean is moved away from condition for both Kendall and the absolute distance mea­
zero by the growth of n due the incomplete condition. sure as given in Figure 7A and 7B, respectively, disclose
While the variability function of the absolute distance a fact that the incomplete condition produces smaller
measure for the incomplete data is the same with the variance than its counterpart. This condition is due to
complete data as presented in Figure 3C that occupies the larger number of possible data sequences for incom­
the range ½0; 1�. plete data as shown by Table 1 occupying the same range
12 C. CRYSDIAN

Figure 6. The variability function of Kendall for incomplete data.

Figure 7. Variance estimates for complete vs incomplete condition for (A) Kendall and (B) The absolute distance measure.
RMS: RESEARCH IN MATHEMATICS & STATISTICS 13

of variability as the complete data. For the case of Kendal, incomplete data. A lesson learned from Spearman incap­
the complete data demonstrate higher rate of convergence ability to deal with the missing data is due to the
than the incomplete condition as shown in Figure 7A. This metrics that relies only on a direct measurement without
phenomenon can be explained by contrasting the variabil­ preparing to deal with the unmeasured condition.
ity function of Kendall in both complete and incomplete Therefore, it is crucial to develop a normalization approach
condition as shown in Figure 3A and 6. In this case, even inside the metrics in order to adapt with varying input that
though the distribution of variability function in both potentially range from the normal to the extreme or
condition is similar, however, the growth of data sequence beyond the normal condition. On the other hand,
in incomplete condition as described by Section 2A forces Kendall’s indirect measurement to compute each data
the mean to move away from zero by the growth of n. This position in a ranking through the number of concordant
condition causes the extension to achieve the convergence and discordant survives the coefficient from the incomplete
by Kendall in incomplete data. condition.
While for the case of the absolute distance measure,
both the complete and incomplete data show a similar
rate of convergence as shown by Figure 7B. This condi­
Disclosure statement
tion is due to the same distribution of variability func­ No potential conflict of interest was reported by the author(s).
tion as shown by Figure 3C and the same variability size
as shown by Figure 2 and 5 for complete and incomplete
data, respectively. In this case, the lower slope of var­ Funding
iance estimate in incomplete data as shown by Figure 7B
The author received no direct funding for this research.
is due to the larger size of incomplete data sequences
compared to the complete condition as described in
Section 2A. Notes on contributor
Cahyo Crysdian obtained bachelor degree in Electrical
Conclusion Engineering from Brawijaya University, Indonesia in 1997,
and graduated from both master study and PhD in
The variability function of rank correlation coefficients Computer Science from Universiti Teknologi Malaysia in
enables the expression of the mean and variance under 2003 and 2006 respectively. Currently, he works as a Head
the complete and incomplete data. It helps to solve the of Master Program in Computer Science, Islamic State
University Maulana Malik Ibrahim Malang, Indonesia. His
mathematical tractability of variance as noted by Xu et al. research interest includes Computer Vision and Intelligent
(2010). Here, we find that Kendall becomes the better System.
method over Spearman and the absolute distance measure
due to the following reason. Under the complete data
condition, Kendall presents the variability function that ORCID
perfectly distributes around the zero mean, which is differ­ Cahyo Crysdian http://orcid.org/0000-0002-7488-6217
ent from other methods that fail to preserve the zero mean.
In this case, Spearman shows the distribution of variability
function above zero, while the absolute distant measure is References
in ½0; 1�. Although Spearman gains a higher convergence Abdel-Aty, A.-H., Kadry, H., Zidan, M., Zanaty, E. A., Abdel-
rate of variance, the distribution of variability function Aty, M., & Abdel-Aty, M. (2020). A quantum classification
proves that Kendall is a well-designed method to conduct algorithm for classification incomplete patterns based on
rank correlation for complete data. Under the incomplete entanglement measure. Journal of Intelligent and Fuzzy
Systems, 38(3), 2817–2822. https://doi.org/10.3233/JIFS-
condition, Kendall and the absolute distant measure exhi­
179566
bits the capacity to adapt with the missing data. In this case, Albers, W., & Teulings, M. F. (1996). A simple estimator
Kendall needs to improve the definition of concordant and for correlation in incomplete data. Statistics &
discordant by accepting alpha in order to deal with the Probability Letters, 31(1), 51–57. https://doi.org/10.
missing data, while the absolute distant measure has this 1016/S0167-7152(96)00013-2
mechanism in its original design. However, Kendall shows Alvo, M., & Cabilio, P. (1995). Rank correlation methods for
missing data. The Canadian Journal of Statistics, La Revue
a higher convergence rate than the absolute distant mea­ Canadienne de Statistique, 23(4), 345–358. https://doi.org/10.
sure under the incomplete condition, and is able to pre­ 2307/3315379
serve the zero mean of variability distribution to some Alvo, M., & Park, J. (2002). Multivariate non-parametric tests
extent of n. Meanwhile, Spearman fails to deal with the of trend when the data are incomplete. Statistics &
14 C. CRYSDIAN

Probability Letters, 57(3), 281–290. https://doi.org/10.1016/ Communication (ICAIIC), 2020, pp. 454–456, https://doi.org/
S0167-7152(02)00062-7 10.1109/ICAIIC48513.2020.9065044.
Cabilio, P., & Tilley, J. (1999). Power calculations for tests of trend Mirzaei, A., Carter, S. R., Patanwala, A. E., & Schneider, C. R.
with missing observations. Environmetrics, 10(6), 803–816. (2022). Missing data in surveys: Key concepts, approaches,
https://doi.org/10.1002/(SICI)1099-095X(199911/12) and applications. Research in Social and Administrative
10:6<803::AID-ENV397>3.0.CO;2-O Pharmacy, 18(2), 2308–2316. https://doi.org/10.1016/j.
Crysdian, C. (2018). Performance measurement without sapharm.2021.03.009 1551-7411
ground truth to achieve optimal edge. International Raykov, T., Schneider, B. C., Marcoulides, G. A., &
Journal of Image and Data Fusion, 9(2), 170–193. https:// Lichtenberg, P. A. (2014). Examining measure correlations
doi.org/10.1080/19479832.2017.1384764 with incomplete data sets. Structural Equation Modeling:
Eekhout, I., Enders, C. K., Twisk, J. W. R., de A Multidisciplinary Journal, 21(2), 318–324. https://doi.
Boer, M. R., de Vet, H. C. W., & Heymans, M. W. org/10.1080/10705511.2014.882696
(2015). Analyzing incomplete item scores in longitudi­ Rejeb, S., Duveau, C., & Rebafka, T. “Self-organizing maps for
nal data by including item score information as aux­ exploration of partially observed data and imputation of missing
iliary variables. Structural Equation Modeling: values”. Cornell University. arXiv:2202.07963, Feb 2022
A Multidisciplinary Journal, 22(4), 588–602. https:// https://doi.org/10.48550/arXiv.2202.07963
doi.org/10.1080/10705511.2014.937670 Spearman, C. (1904, January). The proof and measurement of
Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., association between two things. The American Journal of
Mphago, B., & Tabona, O. (2021). A survey on missing Psychology, 15(1), 72–101. https://doi.org/10.2307/1412159
data in machine learning. Journal of Big Data, 8(140), 1–37. Szmidt, E., & Kacprzyk, J. (2011). “The spearman and kendall
https://doi.org/10.1186/s40537-021-00516-9 rank correlation coefficients between intuitionistic fuzzy sets”.
Kendall, M. G. (1938, June). A new measure of rank EUSFLAT-LFA 2011 Atlantis Press EUSFLAT-LFA 2011
correlation. Biometrika, 30(1–2), 81–93. https://doi.org/ Xu, W., Hung, Y. S., Niranjan, M., & Shen, M. (2010,
10.1093/biomet/30.1–2.81 February). Asymptotic mean and variance of gini correla­
Kidwell, P., Lebanon, G., & Cleveland, W. S. (2008, December). tion for bivariate normal samples. IEEE Transactions on
Visualizing incomplete and partially ranked data. IEEE Signal Processing, 58(2), 522–534. https://doi.org/10.1109/
Transactions on Visualization and Computer Graphics, 14(6), TSP.2009.2032448
1356–1363. https://doi.org/10.1109/TVCG.2008.181 Yan, A., Wang, W., Ren, Y., & Geng, H. W. (2021, June).
Kim, J., & Im, J. (2018). Proposing a missing data method for A clustering algorithm for multi-modal heterogeneous big
hospitality research on online customer reviews: An appli­ data with abnormal data. Frontiers in Neurorobotics, 15, 1–
cation of imputation approach. International Journal of 9. https://doi.org/10.3389/fnbot.2021.680613
Contemporary Hospitality Management, 30(11), Zidan, M., Abdel-Aty, A.-H., El-Sadek, A., Zanaty, E. A., &
3250–3267. https://doi.org/10.1108/IJCHM-10-2017-0708 Abdel-Aty, M., “Low-cost autonomous perceptron neural
Kim, J., Tae, D., & Seok, J., “A survey of missing data imputation network inspired by quantum computation”, AIP confer­
using generative adversarial networks,” 2020 International ence proceedings, 2017, 1905, 020005, https://doi.org/10.
Conference on Artificial Intelligence in Information and 1063/1.5012145.

You might also like