The Two-Dimensional Kolmogorov-Smirnov Test: Raul H.C. Lopes
The Two-Dimensional Kolmogorov-Smirnov Test: Raul H.C. Lopes
The Two-Dimensional Kolmogorov-Smirnov Test: Raul H.C. Lopes
Ivan Reid
School of Engineering & Design, Brunel University
E-mail: [email protected]
Peter R. Hobson
School of Engineering & Design, Brunel University
E-mail: [email protected]
Goodness-of-fit statistics measure the compatibility of random samples against some theoretical
probability distribution function. The classical one-dimensional Kolmogorov-Smirnov test is a
non-parametric statistic for comparing two empirical distributions which defines the largest abso-
lute difference between the two cumulative distribution functions as a measure of disagreement.
Adapting this test to more than one dimension is a challenge because there are 2d − 1 independent
ways of defining a cumulative distribution function when d dimensions are involved. In this paper
three variations on the Kolmogorov-Smirnov test for multi-dimensional data sets are surveyed:
Peacock’s test [1] that computes in O(n3 ); Fasano and Franceschini’s test [2] that computes in
O(n2 ); Cooke’s test that computes in O(n2 ).
We prove that Cooke’s algorithm runs in O(n2 ), contrary to his claims that it runs in O(n lg n).
We also compare these algorithms with ROOT’s version of the Kolmogorov-Smirnov test.
∗ Speaker.
c Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence. http://pos.sissa.it/
The two-dimensional KS test Raul H.C. Lopes
1. Introduction
Let X and Y be two independent stochastic variables whose cumulative distribution functions
F and G are unknown. A classical two-sample problem consists of testing the null hypothesis
2
The two-dimensional KS test Raul H.C. Lopes
When comparing two samples with cumulative distribution functions F(x) and G(x) the statis-
tic is defined as
DKS = max|F(x) − G(x)|
Extending the K-S statistic to multi-dimensional space is a challenge. In one-dimensional
space the statistic is independent of the direction of ordering of data because P(> x) = 1 − P(< x).
In an d dimensional space, however, there are 2d − 1 independent ways of defining a cumulative
distribution function. In [1], Peacock introduced the idea of making the statistic independent of any
particular ordering by finding the largest difference between the cumulative distribution functions
under any possible ordering. Given n points in a two-dimensional space, that amounts to calculating
the cumulative distribution functions in the 4n2 quadrants of the plane defined by all pairs (Xi ,Y j ),
Xi and Y j being coordinates of any pairs of points in the given samples.
Peacock’s test demands partitioning the n points in 4n2 quadrants and then computing the
maximum absolute difference between cumulative distribution functions in all quadrants. The
counting step can be performed by a brute force algorithm that for each point in a sample sweeps
through each quadrant, deciding whether the point is in it. That gives n steps, one for each point,
with complexity Θn2 , giving the final complexity of Θn3 .
The counting steps can be dramatically improved by using a range-counting algorithm. A
set of n points in two dimensions can be indexed by a range-count tree in O(n lg n) [6]. Given
such range-counting tree, any two-sided range-counting query can be answered in O(n lg n) time.
An efficient algorithm for Peacock’s test indexes the two samples of points in two range-counting
trees, and then performs one two-sided query for each quadrant defined by the test, giving a time
upper-bound of O(n2 lg n).
Fact: The lower-bound for finding the maximum of O(n2 ) quantities is ωn2 [7].
Claim: The lower-bound for performing Peacock’s test on n points is the O(4n2 ) quadrants
and that is ωn2 lg n [8].
Peacock’s test is very demanding. Performing the test on 218 points with the brute force
algorithm running on a 4GHz processor, would demand several days. Even the range-counting tree
based algorithm would demand days to perform the test on a sample with a million points, see
Table 1.
Fasano and Franceschini introduced in [2] a variation on Peacock’s test that greatly reduces
the lower-bound for its computation. Their heuristic consists in considering quadrants centred in
3
The two-dimensional KS test Raul H.C. Lopes
each point of the given samples. An sample of n points would define n quadrants. A brute force
algorithms performs, for each given point, a sweep through all quadrants to decide whether the
points is and must be counted in it. This algorithm is presented for example in [9].
An algorithm based on a range-counting tree can index the n points in O(n lg n) time. After that
n two-sided range queries of O(lgn) can be used two compute cumulative distribution functions
differences in all quadrants.
Claim: The lower-bound for computing the Fasano and Franceschini test is the lower-bound
for sorting n points in a two-dimensional plane, which is O(n lg n) [8].
Table 1 compares upper-bounds for performing these two tests using both brute force and
optimal algorithms. The identification in the columns are:
• BFFF: brute-force algorithm for Fasano and Franceschini test, as presented in [9].
The upper-bound columns present the asymptotic upper-bounds. The time columns present
number of hours to compute in the hypothetical world where we have a 4GHz processor that can
perform lg n range queries in lg n cycles.1
The Table indicates that for sets with up to a thousand (210 ) points any of the algorithms is
fast enough. For more than one million points (220 ), only the last Fasano and Franceschini test
implemented on top of a range-counting tree can give us times in the range of minutes.
4. Cooke’s test
In [5], Cooke introduces an efficient implementation for Peacock’s test. He describes the
algorithm, and presents implementations for it in both Python and C. Chan in [10], discusses a
parallel implementation for Cooke’s algorithm.
Cooke’s algorithm is evaluated here because it is the fastest of all tests based on the 1-D
Kolmogorov test, even if, as we show below, the claim that the test runs in O(n lg n) is false. The
1 In isolation each of these assumptions is clearly false. Together they might resist Moore’s law for a few years.
4
The two-dimensional KS test Raul H.C. Lopes
algorithm uses two binary trees each containing all points from both samples, each point marked
to identify its source. The trees are ordered by x coordinate. In the first tree, points are inserted in
increasing order of they y coordinate. As a result, points with lower y coordinate will be allocated
next to the root. By sweeping the tree from leaves to root, the algorithm performs a sweep from
top to bottom in all quadrants defined by the tree. The second tree reverts the order of insertion
of points, locating points with higher y coordinates next to the root. That produces a sweep from
bottom to top in the quadrants when the tree is swept from leaves to root. The number of points in
each quadrant is updated during the sweeps, by counting the number of points in each subtree, and
updating the maximum difference.
Cooke’s unproved assumptions about his algorithm are:
• The dominating time is in the construction of the tree, which he believes is done in O(n lg n).
• That by sweeping all quadrants defined by the nodes from top to bottom and reverse he is
computing over all 4n2 quadrants defined by Peacock’s algorithm.
The first assumption is clearly false. The algorithnm uses an unbalanced binary tree and the
upper-bound to build such tree is O(n2 )[11, 12]. Moreover the algorithm depends on ordering
the tree level by y coordinate. As such it would not help using a balanced tree instead because
the process of balancing would disturb the ordering in the levels. However, the algorithm is still
efficient. In the end, the expected time to construct an unbalanced binary tree is O(n lg n).
His second assumption is also false: his algorithm does not implement Peacock’s test. Indeed
it is not possible to select the maximum of 4n2 unknown quantities, something prescribed by Pea-
cock’s test, in less that 4n2 comparisons [7]. If Cooke’s algorithm computed Peacock’s test, one of
the following would be true:
• Cooke would have invented a new algorithm for selecting the maximum of a random se-
quence of numbers,
• or some of the quadrants defined by Peacock are redundant, which would make sweeping
them all unnecessary.
Tests performed with a few data sets clearly show that Cooke’s test is not Peacock’s test. Table
2 shows a set of tests. Each file Fi contains 1024 muon-simulated events, all samples from the same
distribution. The table shows that Cook’s test is definitely different from Peacock’s.
ROOT [13] implements a 2-D Kolmogorov-Smirnov test using an extension of its 1D Kolmogorov-
Smirnov test. The two-dimensional test suffers from at least two serious limitations: it uses binned
data, a limitation already found in the one-dimensional test, and and it computes the statistic as an
average of two one-dimensional statistics.
ROOT’s 2-D Kolmogorov-Smirnov test computes two one-dimensional Kolmogorov-Smirnov
test over two given histograms. The Kolmogorov-Smirnov test is applied to find the probability
that the x coordinate in the two histograms comes from the same distribution. Then a similar test
5
The two-dimensional KS test Raul H.C. Lopes
Cooke Peacock
File
distance distance
F0 x F0 0.00195312 0
F1 x F1 0.00195312 0
F2 x F2 0.00195312 0
F3 x F3 0.00195312 0
F0 x F1 0.0595703 0.0595703
F1 x F2 0.0957031 0.0957031
compare the y coordinate in the two samples. ROOT then computes the average of those two
probabilities. It is quite easy to define distributions that will completely fool this test. For example,
two samples that are equal in the x coordinate, and whose y coordinates are almost permutations of
each other will fool ROOT into determining that the samples have a great chance of being equal.
In addition, ROOT’s 1-D Kolmogorov-Smirnov test takes as input a pair of histograms with
binned data. This goes against the most fundamental principle of the Kolmogorov-Smirnov test, a
test defined to be applied to continuous unbinned data [14, 9]. Indeed ROOT seems to be the only
package to implement the Kolmogorov-Smirnov test over binned data, see for example [15, 16].
Table 3 shows results for two-sample comparisons under different binning. The files are all different
and they have 4096 points each. Results report probability that the files are different, next to one
indicating that compared samples are probably different. Notice that the third test shows an increase
in the probability (of likelihood) with the increase in the number of bins. The last one shows exactly
the opposite: a decrease in the probability (of likelihood) with the increase in the number of bins.
6. Experimental results
We performed tests with simulated data from Compact Muon Solenoid (CMS) experiment at
CERN to evaluate the quality of the four tests we discussed. Two sets of data were used, repre-
senting reconstructed muon tracks from simulated Z-particle decay within the CMS experiment at
CERN. The data pairs chosen were the reduced-χ 2 goodness-of-fit parameter for the track fit and
Φ, the azimuthal angle of the reconstructed track around the beam-line axis. One set of data were
produced using the ideal detector positions, the second after introduction of small misalignments
representing the probable errors in positioning of the detectors when CMS first starts operation[17].
We used the bootstrap method for hypothesis testing, algorithm 16.1 of [18]. As described
earlier in this paper, we worked with two-sample tests. Given two samples, our target is to find out
6
The two-dimensional KS test Raul H.C. Lopes
if the null hypothesis (that they come from the same distribution) is true. The bootstrap method here
consists in running the statistics being tested once for the original samples to obtain one statistic,
say θb, and then run again by resampling the original sets.
In our tests, θb was obtained for the original samples of sizes m and n. Then fifty tests were
performed over samples obtained by permutation with replacement from the union of the original
samples.
It is important to keep in mind that Peacock, Fasano and Franceschini, and Cooke tests com-
pute distance between samples. Lower distances indicate greater chances that the samples come
from the same population. For these statistics, we are computing
Prob(θb∗ ≥ θb)
ROOT’s results show probabilities that the compared samples do not come from the same
distribution. Again in ROOT’s case, we are computing
Prob(θc∗ ≥ θb)
Tables 4, 5, and 6 show results of runs for Peacock, Fasano and Franceschini, and Cooke tests,
using pairs of samples from the same distribution. Tables 7, 8, and 9 show results for comparing
samples from the same distribution using ROOT with 40, 400 and 4000 bins respectively. In
each Table, the first column identifies the samples, the others give the distance θb computed by the
respective test, and the significance level obtained. Files F0 through F3 have 1024 points each. Files
F4 through F7 have 2048 points each. It should be noticed that samples from the same distribution
are being compared, with small distances between pairs of samples being expected.
7
The two-dimensional KS test Raul H.C. Lopes
8
The two-dimensional KS test Raul H.C. Lopes
9
The two-dimensional KS test Raul H.C. Lopes
Table 10: Mean and Standard Error for ASL for each test
Table 10 shows the means and standard errors for the ASL obtained for all tests. Least standard
error is for Peacock’s test, followed by Fasano and Franceshini’s. Cooke’s test and ROOT’s test
both present standard errors more than a hundred times higher than the other two tests.
Table 11 shows results of tests for samples comparing aligned against misaligned data. The
first column describes the file. Then come the distances obtained for each of the tests: P for
Peacock, FF for Fasano and Franceschini, C for Cooke, R40 for ROOT with 40 bins, R4000 for
ROOT with 4000 bins. Files Gi have data from the simulation of misaligned tracks. Files G0
through G3 have 1024 points each. Files G4 through G7 have 2048 points each.
ROOT’s results point to much larger differences between the samples than the other tests,
results that seem unlikely given that we are comparing aligned against misaligned data and differ-
ences are not expected to be so big. In addition, ROOT shows large differences in results for the
same pair of samples when the binning is changed.
10
The two-dimensional KS test Raul H.C. Lopes
7. Conclusion
In this paper, we tested four variations of the Kolmogorov-Smirnov test for two-dimensional
data sets. We compared Peacock’s, Fasano and Franceschini’s , Cooke’s, and ROOT’s tests. We
established precise computing bounds for the first three of them. We have shown that Cooke’s
test, contrary to what is stated in [5], is not an implementation of Peacock’s test. Tests comparing
samples from the same distribution indicate that Peacock’s and Fasano and Franceschini’s tests
are more stable than the others. Experiments with ROOT have shown results with large discrep-
ancies when the size of bins is changed. We are now extending these tests to incorporate three
new statistics for multi-dimensional tests, including statistics based on minum energy and graph
matching.
Acknowledgement
We would like to acknowledge financial support from the Science and Technology Facilities
Council, UK.
References
[2] G. Fasano and A. Franceschini, A multidimensional of the Kolmogorov-Smirnov test, Monthly Notices
Royal Astronomy Society 225 (1987) 155–170.
[3] D. N. Sperger, T. Piran, A. Loeb, J. Goodman, and J. N. Bahcall, A simple model for neutrino cooling
of the large magellanic cloud supernova, Science 237 (1987) 1471–1473.
[4] D. W. Scott, Multivariate Density Estimation: Theory, Practice, and Visualisation. John Wiley &
Sons Inc, 2007.
11
The two-dimensional KS test Raul H.C. Lopes
12