SD, TGHFDHSGD

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 1

to be Gaussian, then we want to use multiple centers to model the data properly.

The algorithm will run


k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k)
times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are
spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization
algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution
with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we
present below are valid for either covariance matrix assumption. The test also accounts for the number
of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation
2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints.
2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect
whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are •
H0: The data around the center are sampled from a Gaussian. • H1: The data around the center are not
sampled from a Gaussian. If we accept the null hypothesis H0, then we believe that the one center is
sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and
accept H1, then we want to split the cluster. The test we use is based on the Anderson-Darling statistic.
This one-dimensional test has been shown empirically to be the most powerful normality test that is
based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been
converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F(x(i)), where F is the N(0, 1)
cumulative distribution function. Then the statistic is A 2 (Z) = − 1 n Xn i=1 (2i − 1)[log(zi) + log(1 −
zn+1−i)] − n (1) Stephens [17] showed that for the case where µ and σ are estimated from the data (as in
clustering), we must correct the statistic according to A 2 ∗ (Z) = A 2 (Z)(1 + 4/n − 25/(n 2 )) (2) Given a
subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1.
Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for
good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to
some early stopping point if desired. Let c1, c2 be the child centers chosen by k-means. 4. Let v = c1 − c2
be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to
be important for clustering. Then project X onto v: x 0 i = hxi , vi/||v||2 . X0 is a 1-dimensional
representation of the data projected onto v. Transform X0 so that it has mean 0 and variance 1. 5. Let zi
= F(x 0 (i) ). If A2 ∗ (Z) is in the range of non-critical values at confidence level α, then accept H0, keep the
original center, and discard {c1, c2}. Otherwise, reject H0 and keep {c1, c2} in place of the original center.
A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one
dimension where the test is simple to apply. The authors of [5] also use this approach for online
dimensionality reduction during clustering. The one-dimensional representation of the data allows us to
consider only the data along the direction that kmeans has found to be important for separating the
data. This is related to the problem of projection pursuit [7], where here k-means searches for a
direction in which the data appears non-Gaussian.

You might also like