The document discusses the G-means algorithm, which uses a Gaussian expectation-maximization algorithm to cluster data. It assumes data points within each cluster follow a multidimensional Gaussian distribution. It introduces a test to determine if data assigned to a center is sampled from a Gaussian distribution. The test projects the data onto the direction identified by k-means as important for separating clusters. It then uses the Anderson-Darling statistic on the projected one-dimensional data to test if it fits a Gaussian distribution. If so, the original center is kept, otherwise the center is split into two sub-clusters.
The document discusses the G-means algorithm, which uses a Gaussian expectation-maximization algorithm to cluster data. It assumes data points within each cluster follow a multidimensional Gaussian distribution. It introduces a test to determine if data assigned to a center is sampled from a Gaussian distribution. The test projects the data onto the direction identified by k-means as important for separating clusters. It then uses the Anderson-Darling statistic on the projected one-dimensional data to test if it fits a Gaussian distribution. If so, the original center is kept, otherwise the center is split into two sub-clusters.
The document discusses the G-means algorithm, which uses a Gaussian expectation-maximization algorithm to cluster data. It assumes data points within each cluster follow a multidimensional Gaussian distribution. It introduces a test to determine if data assigned to a center is sampled from a Gaussian distribution. The test projects the data onto the direction identified by k-means as important for separating clusters. It then uses the Anderson-Darling statistic on the projected one-dimensional data to test if it fits a Gaussian distribution. If so, the original center is kept, otherwise the center is split into two sub-clusters.
The document discusses the G-means algorithm, which uses a Gaussian expectation-maximization algorithm to cluster data. It assumes data points within each cluster follow a multidimensional Gaussian distribution. It introduces a test to determine if data assigned to a center is sampled from a Gaussian distribution. The test projects the data onto the direction identified by k-means as important for separating clusters. It then uses the Anderson-Darling statistic on the projected one-dimensional data to test if it fits a Gaussian distribution. If so, the original center is kept, otherwise the center is split into two sub-clusters.
Download as DOCX, PDF, TXT or read online from Scribd
Download as docx, pdf, or txt
You are on page 1of 1
to be Gaussian, then we want to use multiple centers to model the data properly.
The algorithm will run
k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0: The data around the center are sampled from a Gaussian. • H1: The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0, then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1, then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F(x(i)), where F is the N(0, 1) cumulative distribution function. Then the statistic is A 2 (Z) = − 1 n Xn i=1 (2i − 1)[log(zi) + log(1 − zn+1−i)] − n (1) Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A 2 ∗ (Z) = A 2 (Z)(1 + 4/n − 25/(n 2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1, c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: x 0 i = hxi , vi/||v||2 . X0 is a 1-dimensional representation of the data projected onto v. Transform X0 so that it has mean 0 and variance 1. 5. Let zi = F(x 0 (i) ). If A2 ∗ (Z) is in the range of non-critical values at confidence level α, then accept H0, keep the original center, and discard {c1, c2}. Otherwise, reject H0 and keep {c1, c2} in place of the original center. A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian.