Bhaskara 20 A
Bhaskara 20 A
Bhaskara 20 A
Abstract
In the online version of the classic k-means clustering problem, the points of a dataset u1 , u2 , . . .
arrive one after another in an arbitrary order. When the algorithm sees a point, it should either add
it to the set of centers, or let go of the point. Once added, a center cannot be removed. The goal is
to end up with set of roughly k centers, while competing in k-means objective value with the best
set of k centers in hindsight.
Online versions of k-means and other clustering problem have received significant attention in
the literature. The key idea in many algorithms is that of adaptive sampling: when a new point
arrives, it is added to the set of centers with a probability that depends on the distance to the centers
chosen so far. Our contributions are as follows:
1. We give a modified adaptive sampling procedure that obtains a better approximation ratio (im-
proving it from logarithmic to constant).
2. Our main result is to show how to perform adaptive sampling when data has outliers ( k points
that are potentially arbitrarily far from the actual data, thus rendering distance-based sampling
prone to picking the outliers).
3. We also discuss lower bounds for k-means clustering in an online setting.
Keywords: Online algorithms, k-means clustering, Robust algorithms
1. Introduction
Clustering data is one of the fundamental subroutines in the analysis of large data. The general goal
is to partition the data into clusters so that points within a cluster are “more similar” to one another,
compared to points in different clusters. This intuitive requirement can be formalized in many
ways. Some popular notions include k-means, k-center, facility location, hierarchical clustering,
correlation clustering, etc. Each formulation has its merits, and the choice between them largely
depends on the application and the data itself. (E.g., see Dasgupta, 2016; Hastie et al., 2009).
The subject of this paper is k-means clustering. Here, we are given an integer k, and the goal is
to partition the data into k clusters, so as to minimize the sum of squared distances from the points
to the corresponding cluster centers. In the case of points in a Euclidean space, the cluster center is
simply the (empirical) mean of the points in a cluster. In general metric spaces, we need to designate
one point in a cluster to be the center. k-means clustering is well-studied from an algorithmic and
a complexity perspective. It is known to be APX-hard (i.e., hard to approximate to a factor > 1
assuming P 6= NP) (Bahmani et al., 2012; Krishnaswamy et al., 2018). The first constant factor
approximation is due to (Kanungo et al., 2004), who gave an algorithm based on local search. The
best known approximation algorithm is due to (Ahmadian et al., 2017).
On the practical side, there are many heuristics developed for the k-means problem. Lloyd’s
algorithm (Lloyd, 1982) (popularly known simply as the “k-means algorithm”) is an iterative EM-
style procedure that works quite well in practice. A better initialization for Lloyd’s algorithm was
proposed by (Arthur and Vassilvitskii, 2007), based on adaptive sampling (an idea which will fea-
ture prominently in our results). In order to deal with much larger datasets, distributed algorithms
for k-means have been extensively studied (see Ene et al., 2011; Balcan et al., 2013; Bateni et al.,
2014, and references therein).
Another well-studied setting for clustering problems is the data streaming model. Here, the
points of a dataset arrive one after another in arbitrary order. The algorithm must, in the end, arrive
at an approximately optimal solution to the k-means problem on the full dataset. The trivial solution
is to store all the points, and use an approximation algorithm in the end. The goal in streaming
algorithms is to do much better than this, in terms of both memory and time complexity. Specifically,
the goal is to use only roughly O(k) memory, use as little computation as possible at each step (when
a point arrives), and in the end, obtain a constant factor approximation. Surprisingly, this turns out
to be possible, using a little more than k memory (see Guha et al., 2001; Ene et al., 2011).
Online model. We consider the online model of computation, which is more restrictive than
streaming. Here, points arrive one after another in an arbitrary order. Once a point arrives, the
algorithm must decide to keep it, or let it go (forever). Once a point is kept, it cannot be removed
(thus the decision making is done in a one-shot manner). The goal is similar: we wish to maintain
roughly k centers, whilst competing with the clustering objective for the best k centers in hindsight.
The online model is well studied for problems such as hiring (the so-called secretary problem, see
Babaioff et al., 2009), as well as submodular optimization (see Buchbinder et al., 2019). In the
context of clustering and facility location, online algorithms were first studied in the classic work
of (Meyerson, 2001). The motivation here was to open facilities so as to serve demands that ar-
rive online. In this context, it is also natural that once facility/clusters are opened, they cannot be
closed/moved without a cost. Later works of (Charikar et al., 2004) and (Liberty et al., 2016) studied
online clustering from an approximation perspective.
The work of (Liberty et al., 2016), which is inspired by the adaptive sampling ideas of (Mey-
erson, 2001) (see also Ostrovsky et al., 2013; Arthur and Vassilvitskii, 2007), shows that assuming
one has an estimate of the optimum k-means error (up to a constant factor), going over the points
and sampling them with probability proportional to the squared-distance to the points chosen so
far (scaled by the guess for the optimum) results in (a) choosing only O(k log n) points (assuming
that the guess for the optimum error was close), and (b) achieving an objective value that is at most
O(log n) · OPT.
Our contributions improve the understanding of the online k-means problem along many axes.
First, we improve the approximation factor of the k-means objective cost and study intrinsic lim-
itations in the online model. Next, we consider the problem in the presence of outliers (formally
defined below). Suppose we have z k log n points that are outliers. Now, adaptive sampling can
choose these points with high probability, resulting in Θ(z) points being chosen. We show, via a
carefully designed algorithm, how we can mitigate the effect of outliers.
2
O NLINE k- MEANS CLUSTERING
Theorem 1 Suppose the points V ⊂ Rd of a dataset arrive in an online manner. Let k ≥ 1, and
ξ > 0 and ∈ (0, 1) be given parameters. k is the desired number of clusters and ξ is the desired
clustering cost. Then there exists an algorithm, that after seeing ui , decides to either ignore it or
add it to a selection set C (which starts off empty), with the following properties:
Thus we obtain a bi-criteria approximation (use k 0 > k centers), while obtaining a constant
approximation to the objective. As stated earlier, (Liberty et al., 2016) obtain a similar result, albeit
with a O(log n)(OPT+ξ) bound on the objective. We obtain this improvement via an idea introduced
in the recent work (Bhaskara et al., 2019) on online PCA. Adapted to our setting, it allows us to
analyze clustering cost in a novel way: when bounding the distance from a point u to the centers C,
we take into account not just the centers that have been chosen so far, but also the ones that are to
appear later.
We also note that the value of ξ plays an important role in the guarantees. As such, it is the
“desired” objective value (one might think of it as approaching zero). However, if ξ is too small,
we pay an additional log OPT ξ in the bound for |C|. As long as we pick it to be OPT /poly(n), this
will only add a logarithmic term, and we thus do not expect it to be significant in practice. More
formally, ξ can be viewed as the permissible “additive” error in the clustering cost.
Our next (and main) result deals with k-means clustering, when the input also contains a certain
number of outliers. In this case, the goal is to discard some points as outliers, and minimize the k-
means objective on the remaining points. The problem, which we now define formally, has received
a lot of attention in the offline case.
Formulating clustering with outliers. Let K M - OPT(X) denote the optimum value of the k-
means objective on a set of points X. Given a set of points V and a bound on the number of
outliers z, the goal in clustering with outliers is to partition V = Vin ∪ Vout , such that |Vout | ≤ z, and
K M - OPT(Vin ) is minimized.
The early work of (Charikar et al., 2001) studied the problem of k-median and facility location
in a similar setup. The recent work of (Gupta et al., 2017) gave a local search based algorithm
for the problem. Both these algorithms give bi-criteria approximations (where > z points are
discarded as outliers.). In practice, this corresponds to declaring a small number of the inliers
as outliers. In applications where the true clusters are sufficiently large, these guarantees are still
valuable. The recent result of (Krishnaswamy et al., 2018) (and the earlier result of Chen, 2008,
for k-median) made considerable progress, giving constant factor (∼ 50) approximation algorithms
without violating the constraint on z. These algorithms are intricate and are based on LP relaxations.
3
O NLINE k- MEANS CLUSTERING
Our main result shows that one can obtain approximation guarantees for k-means with outliers
even in an online model, via a slight variant of adaptive sampling. Intuitively, this is challenging
because if encounter a point that is far from all the previously stored points, we are not sure if it
represents a start of a new cluster, or if it is simply an outlier. Our result is the following:
Theorem 2 Suppose the points V ⊂ Rd of a dataset arrive in an online manner. Let k ≥ 1, and
ξ > 0, and L > 1 and ∈ (0, 1) be given parameters. Let Vin be the set of inlier points in V , k the
desired number of clusters and ξ the desired clustering cost over the inlier points in V . Then there
exists an algorithm, that after seeing ui , decides to either ignore it, or add it to a selection set C
(which starts off empty), or add it to the outlier set M with the following properties:
The algorithm has a parameter L, which features in the bounds. If we have an estimate for log n
(up to constants would suffice), the terms (log n)/L would reduce to O(1). Note also that the algo-
rithm discards more than z points as outliers. Assuming that L, ξ are good guesses for log n, OPT
respectively, the number of points discarded is O(z).
Lower bounds. Finally, it is natural to ask if our algorithms’ dependence on the various parame-
ters, k, z, etc. can be improved. Lower bounds have been extensively studied for online clustering
and facility location problems (Meyerson, 2001; Fotakis, 2007; Liberty et al., 2016). For k-means
clustering, the first question is if we can obtain an online algorithm that achieves a constant factor
approximation to the objective, while choosing only O(k) points in the end. It is easy to see that this
is impossible if we do not know either the number of points or an estimate of the optimum objective
value in advance: consider points in one dimension, where the ith point is ci , for i = 0, 1, . . . , c > 2
and k = 1. The optimum solution is to choose the last point, while an online algorithm, to remain
competitive, must end up keeping every point. (See Liberty et al., 2016, for a formalization of this.)
The recent work of (Moshkovitz, 2019) showed also established similar guarantees.
Our main result is that even when the optimum objective value is known up to a constant factor
(as is the case in our algorithm), we cannot avoid an overhead of log n in the number of centers. This
shows that the dependence in our algorithm is essentially tight. Formally, we show the following:
Theorem 3 For any constant C and parameter r ∈ N+ , there is no randomized online algorithm
which, with no prior knowledge of the number of points, can obtain an expected approximation ratio
2
of C16 for the objective, using fewer than r centers. Moreover, this holds even when the instance size
n is in the range [C 8r , C 16r ] (in which case, 8log n log n
log C ≥ r ≥ 16 log C ).
In the instances we use for the lower bound, the optimum value is always Θ(r), thus we can
assume that we know it up to a constant; however, the instance size (total number of points) is not
known exactly.
4
O NLINE k- MEANS CLUSTERING
Description of Algorithm 1. The algorithm processes the points in phases. In every phase, the
algorithm builds a set Ccur (chosen centers in the current phase), by adding each point u to Ccur with
probability pu that depends on the distance from u to the set Cpre (which is the set of centers we
have accumulated until the beginning of the phase). Once the sum of the probabilities pu exceeds
1, the phase ends, and Ccur is added to Cpre .
5
O NLINE k- MEANS CLUSTERING
Theorem 4 P Suppose the points of V arrive in an online manner, and suppose that the guess ξ
satisfies ξ ≥ v∈V d2 (v, C ∗ ), where C ∗ is the optimal set of cluster centers. Then the algorithm
satisfies: w.p. at least 45 ,
1. The number of phases and consequently the number of chosen centers is ≤ O(k log n).
2. The k-means objective cost for the output centers C is ≤ ξ · O (log n).
Definition 5 Let {ui }ri=1 be the points in a phase and let Ccur be the set of selected points. A phase
is said to be successful if:
1. At least one point is selected, i.e., |Ccur | ≥ 1.
Proof The proof is by simply bounding the probabilities of (1), (2) and (3) in the definition of a
successful phase. The details are deferred to § D.1.
Proof We will prove the lemma via contradiction. Suppose we have t successful phases, and
suppose we choose one point from each successful phase. Let these points be v1 , v2 , . . . , vt .
By definition, the probability values pvi ∈ (1/n2 , 1). Thus, by dividing (1/n2 , 1) into 2 log n
intervals of the form (1/2i+1 , 1/2i ], we obtain that there exists some q such that t/(2 log n) of the
vi lie in the interval (q, 2q]. Let I denote the indices of these points.
Now, for any i, j ∈ I, we claim that d(vi , vj )2 > ξq
k . This claim holds because of the following:
kd2 (vj ,Cpre )
suppose i < j. Thus in the jth phase, vi was already present in Cpre . Thus, since pvj ≤ ξ ,
kd(vi ,vj )2
we have that pvj ≤ ξ . Since pvj > q, the claim follows.
ξq
Now, consider balls of squared-radius 4k around each of the points vi (for i ∈ I), and denote
these balls by Bi (respectively). The claim above implies that for any i, j ∈ I with i 6= j, Bi ∩ Bj =
∅.
Finally, we claim that |I| ≤ 33k. Suppose, for the sake of contradiction, that |I| > 33k. Then,
since the balls are all disjoint, we must have that for at least 32k of the balls Bi , C ∗ ∩ Bi = ∅ (where
C ∗ is the optimal set of centers). Let J denote the set of i for which this holds. This means that for
ξq
any i ∈ J, d2 (vi , C ∗ ) > 4k .
6
O NLINE k- MEANS CLUSTERING
Now, let us denote by Si the set of original points u that are in phase i. Since i is a successful
phase, we have (by condition (2) of the definition), that
X 1 2 1 ξq ξ
4 d2 (u, C ∗ ) ≥ d (vi , C ∗ ) > = .
pi 2q 4k 8k
u∈Si
Thus, if we have |J| ≥ 32k, this would let us conclude that the sum of the distances to the
optimal centers from the points in those phases is > ξ, a contradiction to the assumption that the
optimum objective value is ≤ ξ.
This implies that |J| ≤ 32k, and subsequently that |I| ≤ 33k, implying that the number of
phases t satisfies t ≤ 66k log n.
The two lemmas above now let us bound the number of phases in the algorithm.
Lemma 8 For any δ > 0, the total number of phases is O(k log n + log(1/δ), with probability at
least 1 − δ.
The proof is intuitively easy, but needs some care since the points chosen in a phase depend on the
previously chosen points. This is deferred to Appendix E.
We can now complete the proof of Theorem 4.
Proof [of Theorem 4] First, we bound the size of C, the selected points, using the bound on the
number of phases. Suppose the number of phases is r. In each phase, the selection probabilities add
up to a quantity between 1 and 2. Thus, we have that Pr[|C| > 20r] ≤ 1/10. Setting δ = 1/10
in Lemma 8, we get that r ≤ O(k log n) with probability ≥ 9/10. Combining these, we have that
|C| ≤ O(k log n) with probability ≥ 4/5. P
Next, we need to bound the total cost u d2 (u, C), where the sum ranges over all the points.
To bound this, we will bound the sum in each phase. To this end, let u1 , . . . , ur be the points in a
phase, and let Cpre the set of chosen points at the start of the phase. We consider two cases.
Case 1: there is no i such that d2 (ui , Cpre ) > kξ . In this case,
X X X ξ ξ
d2 (u, C) ≤ d2 (u, Cpre ) ≤ pui · ≤2 .
k k
i∈[r] i∈[r] i∈[r]
The last inequality follows because the sum of the selection probabilities for points in any phase is
≤2
Case 2: there is an i such that d2 (ui , Cpre ) > kξ . In this case, i must be r (as the phase ends at
that i). Furthermore, the point i is certainly included in Ccur , and hence is also in C. Thus, we have
X X
d2 (u, C) = d2 (u, C),
i∈[r] i∈[r−1]
which can then be bounded exactly as before. Thus in both cases, we have that i∈[r] d2 (u, C) ≤
P
2ξ/k. Combined with the bound on the number of phases, this completes the proof.
7
O NLINE k- MEANS CLUSTERING
3.3. Analysis
We show the following theorem about the algorithm.
Theorem 9 P Suppose the points of a V arrive in an online manner and suppose that the guess ξ
satisfies ξ ≥ v∈V d2 (v, C ∗ ). Then for any > 0, the algorithm satisfies:
7
1. w.p. at least 10 , the number of chosen centers is ≤ O( k log(n)).
2. The k-means objective cost for the output centers C is ≤ O(1) · OPT + ξ where OPT is the
optimum k-means objective value for the full instance.
The main trick in the analysis is to move to a slightly different algorithm, one that can be viewed
as a “two-pass” procedure, and analyze that instead. Observe that in line 2 of the algorithm, we do
not use T in any way, and thus it proceeds precisely as in Algorithm 1. Let Cfirst denote the final
value (after we have seen all the points) of Cpre ∪ Ccur . Now, consider replacing the probability of
2 (v,C
adding v to T (line 2) to min 1, 40k·d ξ first )
. This is no longer an online algorithm, but it is a
two-pass algorithm. The key observation is that the probability of adding u to T in our algorithm
is at least as large as the probability of adding u to T in the two pass algorithm. Thus, it suffices to
bound the error of the two-pass algorithm.
This is done by using the following lemma.
2
P
Lemma 10 Let S be any set of points in our metric space, and define Z := v∈V d (v, S).
Suppose T is formed by adding every element v ∈ V independently with probability pv that satisfies
2
pv ≥ ck d (v,S)
Z . Then we have
" #
X X 4Z
E d2 (v, S ∪ T ) ≤ 16 d2 (v, C ∗ ) + . (1)
c
v∈V v∈V
8
O NLINE k- MEANS CLUSTERING
Remark. The lemma is reminiscent of the classic lemma of (Frieze et al., 2004) on “norm sam-
pling” (for matrix low rank approximation). While the statement for clustering is similar in spirit,
we are not aware of its proof in the literature (nor does it follow from the result of Frieze et al.,
2004, directly).
The proof of the lemma is somewhat involved, and is deferred to Section B. In analyzing the
two-pass algorithm, we apply Lemma 10 with S = Cfirst . Then we show that the condition pv ≥
2
ck d (v,S)
Z holds with c = 40Z
ξ . To see this, note that
This results in X X
d2 (v, Cfirst ∪ T ) ≤ 160 d2 (v, C ∗ ) + ξ.
v∈V v∈V
Next, consider the expected size of the final set. This time, we need to analyze Algorithm 2 and
not the two-pass algorithm above. We have that
X 40kd2 (v, Cpre )
E[|T |] =
ξ
By the bound on the total error from Theorem 4, we have that with probability ≥ 45 ,
X
d2 (v, Cpre ) ≤ O(ξ log(n))
7
Thus the above together with Markov’s inequality, we have that with probability ≥ 10 , |T | ≤
O( k log(n)). This completes the proof of Theorem 9.
9
O NLINE k- MEANS CLUSTERING
Theorem 11 Suppose the points of V arrive P in an online manner and let Vin be the set of inliers in
V . Suppose that the guess ξ satisfies ξ ≥ v∈Vin d2 (v, C ∗ ), where C ∗ is the optimal set of cluster
centers. Then the algorithm, on seeing a point vi either adds it to C, or ignores it, or marks it as an
outlier such that in the end we have, w.p. at least 4/5 and for any parameter L ≥ 1:
4. the objective cost for the inlier points (ones not marked as outliers) is ≤ ξ · O(log n + L).
kd2 (u,C )
Definition 12 (Type A, B) First, we say that a point u is type A if ξ
pre
< kL
z , and we say
that it is type B otherwise.
Next, a phase is said to be type A if it terminates because α ≥ 1. Else, if it terminates because
of β ≥ 1, we say that it is type B. (The very last phase can be classified as type A.)
10
O NLINE k- MEANS CLUSTERING
Outline of the argument. At an intuitive level, the algorithm labels points as type B (and handles
them differently) as long as they are sufficiently far from the ones chosen so far. This is certainly
a reasonable way to handle outliers (in fact, the threshold is chosen precisely to ensure that the
number of type B phases, assuming only outliers are type B points, is roughly kL). The problem
is now that inliers (e.g., when a cluster is starting to be formed) can be type B points. We need to
ensure that there is still a sufficiently large probability that they will be chosen. This is done via
a “win-win” argument. If a cluster has more than z/k points (and assume for the sake of intuition
that they appear consecutively), then once z/kL of them are seen, there is a sufficient probability
of picking one of the points, and thus the cluster is covered. For the clusters with < z/k points, the
total number of points in these clusters is < z. Thus classifying the points as outliers only results
in a factor 2 increase in the number of outliers. (This is the intuitive reason we obtain a bi-criteria
bound.)
To carry this plan forth formally, we require carefully arguing that the number of phases of each
type is small. We also need to define the notion of successful phases more carefully (depending on
whether they are type A or B). The details are deferred to Appendix A.
11
O NLINE k- MEANS CLUSTERING
1. We produce an instance where points have multiplicities. This can easily be removed by small perturbation.
12
O NLINE k- MEANS CLUSTERING
References
Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-
means and euclidean k-median by primal-dual algorithms. 2017 IEEE 58th Annual Symposium
on Foundations of Computer Science (FOCS), pages 61–72, 2017.
David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In Proceed-
ings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pages
1027–1035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathematics.
Moshe Babaioff, Michael Dinitz, Anupam Gupta, Nicole Immorlica, and Kunal Talwar. Secretary
problems: Weights and discounts. In Proceedings of the Twentieth Annual ACM-SIAM Sym-
posium on Discrete Algorithms, SODA ’09, pages 1245–1254, Philadelphia, PA, USA, 2009.
Society for Industrial and Applied Mathematics.
Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scal-
able k-means++. PVLDB, 5(7):622–633, 2012.
Maria-Florina Balcan, Steven Ehrlich, and Yingyu Liang. Distributed clustering on graphs. In
Neural Information Processing Systems (NIPS), 2013.
MohammadHossein Bateni, Aditya Bhaskara, Silvio Lattanzi, and Vahab S. Mirrokni. Distributed
balanced clustering via mapping coresets. In Advances in Neural Information Processing Systems
27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014,
Montreal, Quebec, Canada, pages 2591–2599, 2014.
Aditya Bhaskara, Silvio Lattanzi, Sergei Vassilvitskii, and Morteza Zadimoghaddam. Residual
based sampling for online low rank approximation. In Proceedings of the 60th Annual IEEE
Symposium on Foundations of Computer Science (FOCS), 2019.
Niv Buchbinder, Moran Feldman, Yuval Filmus, and Mohit Garg. Online submodular maximiza-
tion: Beating 1/2 made simple. In Andrea Lodi and Viswanath Nagarajan, editors, Integer Pro-
gramming and Combinatorial Optimization, pages 101–114, Cham, 2019. Springer International
Publishing.
M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental clustering and dynamic in-
formation retrieval. SIAM Journal on Computing, 33(6):1417–1440, 2004. doi: 10.1137/
S0097539702418498. URL https://doi.org/10.1137/S0097539702418498.
Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility
location problems with outliers. In Proceedings of the Twelfth Annual ACM-SIAM Symposium
on Discrete Algorithms, SODA ’01, pages 642–651, Philadelphia, PA, USA, 2001. Society for
Industrial and Applied Mathematics. ISBN 0-89871-490-7. URL http://dl.acm.org/
citation.cfm?id=365411.365555.
Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In Pro-
ceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’08,
pages 826–835, Philadelphia, PA, USA, 2008. Society for Industrial and Applied Mathematics.
URL http://dl.acm.org/citation.cfm?id=1347082.1347173.
13
O NLINE k- MEANS CLUSTERING
Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In Proceedings of the
Forty-eighth Annual ACM Symposium on Theory of Computing, STOC ’16, pages 118–127, New
York, NY, USA, 2016. ACM.
Alina Ene, Sungjin Im, and Benjamin Moseley. Fast clustering using mapreduce. In Proceedings
of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’11, pages 681–689, New York, NY, USA, 2011. ACM.
Dimitris Fotakis. On the competitive ratio for online facility location. Algorithmica, 50(1):1–57,
2007. doi: 10.1007/s00453-007-9049-y.
Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms for finding low-rank
approximations. Journal of the ACM, 51(6):1025–1041, November 2004.
Michel Goemans. Chernoff bounds, and some applications, 2015. URL http://math.mit.
edu/˜goemans/18310S15/chernoff-notes.pdf.
S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. STOC, 2001.
Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Local
search methods for k-means with outliers. Proc. VLDB Endow., 10(7):757–768, March 2017.
ISSN 2150-8097. doi: 10.14778/3067421.3067425. URL https://doi.org/10.14778/
3067421.3067425.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data
mining, inference and prediction. Springer, 2 edition, 2009.
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and
Angela Y. Wu. A local search approximation algorithm for k-means clustering. Computational
Geometry, 28(2):89 – 112, 2004. Special Issue on the 18th Annual Symposium on Computational
Geometry - SoCG2002.
Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and
k-means with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT
Symposium on Theory of Computing, STOC 2018, pages 646–659, New York, NY, USA, 2018.
ACM.
Edo Liberty, Ram Sriharsha, and Maxim Sviridenko. An Algorithm for Online K-Means Clustering,
pages 81–89. SIAM, 2016.
Stuart P. Lloyd. Least squares quantization in pcm. IEEE Trans. Information Theory, 28:129–136,
1982.
A. Meyerson. Online facility location. In Proceedings of the 42Nd IEEE Symposium on Foundations
of Computer Science, FOCS ’01, pages 426–, Washington, DC, USA, 2001. IEEE Computer
Society.
Michal Moshkovitz. Unexpected effects of online k-means clustering, 2019.
Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. The effectiveness of
lloyd-type methods for the k-means problem. J. ACM, 59(6):28:1–28:22, January 2013.
14
O NLINE k- MEANS CLUSTERING
Definition 13 (Majority outlier/inlier) A phase is said to be a majority outlier phase if one of the
following conditions hold:
1. The phase is Type A and the following inequality holds (where {ui }ri=1 are the Type A points
in the phase):
X kd2 (ui , Cpre ) 1
≥ .
ξ 2
i∈[r]∧ui ∈V \Vin
2. The phase
P is type B andkLthe following holds (where {ui }ri=1 are the Type B points in the
phase): i∈[r]∧ui ∈V \Vin z ≥ 12 .
2. (Type A phases with majority inliers) Let the phase be a majority inlier phase of Type A. Let
{ui }ri=1 be the Type A points in the phase and let Ccur be the set of selected Type A points.
The phase is said to be successful if:
(a) At least one inlier point of Type A is selected, i.e., |Ccur ∩ Vin | ≥ 1
(b) For the optimal set of clusters C ∗ , we have
X 1 2 X
d (ui , C ∗ ) < 4 d2 (ui , C ∗ )
pi
ui ∈Ccur ∩Vin i∈[r]∧ui ∈Vin
3. (Type B phases with majority inliers) Let the phase be a majority inlier phase of Type B. Let
{ui }ri=1 be the Type B points in the phase and let Tcur be the set of selected Type B points.
The phase is said to be successful if:
(a) At least one inlier point of Type B is selected, i.e. |Tcur ∩ Vin | ≥ 1
(b) For the optimal set of clusters C ∗ , we have
X 1 2 X
d (ui , C ∗ ) < 4 d2 (ui , C ∗ ).
pi
ui ∈Tcur ∩Vin i∈[r]∧ui ∈Vin
1
Lemma 15 A Type A phase with majority inliers, is successful with probability ≥ 8 (over the
choice of Ccur ).
15
O NLINE k- MEANS CLUSTERING
Proof The proof is again by bounding the probabilities of (a), (b) and (c) in the definition of a
successful Type A phase with majority inliers.
First off, the probability that Ccur ∩ Vin = ∅ is at most
Y X 1
(1 − pui ) ≤ exp(− pui ) ≤ 1/e 2 ,
i∈[r]∧ui ∈Vin i∈[r]∧ui ∈Vin
since by the definition of a majority inlier Type A phase, the probabilities over the inliers, add up to
a quantity ≥ 12 .
Next, let Yi be an indicator random variable that is 1 if ui ∈ Ccur and 0 otherwise. Thus
Pr [Yi = 1] = pui . Define
X 1 2
X= Yi d (ui , C ∗ ). (2)
pui
i∈[r]∧ui ∈Vin
Thus part (b) of the definition of a successful phase holds with probability ≥ 3/4, by Markov’s
inequality. For part (c), note that the probability of picking a point u with with probability < 1/n3
is at most 1/n2 (as there are at most n points in total).
Thus all the conditions hold with probability ≥ 1 − 11 − 14 − n12 > 1/8, for n > 8.
e2
1
Lemma 16 A Type B phase with majority inliers, is successful with probability ≥ 8 (over the
choice of Tcur ).
Proof The proof is by simply bounding the probabilities of (a), and (b) in the definition of a
successful Type B phase with majority inliers.
First off, the probability that Tcur ∩ Vin = ∅ is at most
Y X 1
(1 − pui ) ≤ exp(− pui ) ≤ 1/e 2 ,
i∈[r]∧ui ∈Vin i∈[r]∧ui ∈Vin
since by the definition of a successful Type B phase, the probabilities over the inliers, add up to a
quantity ≥ 12 .
Next, let Yi be an indicator random variable that is 1 if ui ∈ Tcur and 0 otherwise. Thus
Pr [Yi = 1] = pui . Define
X 1 2
X= Yi d (ui , C ∗ ). (3)
pui
i∈[r]∧ui ∈Vin
16
O NLINE k- MEANS CLUSTERING
Thus part (b) of the definition of a successful phase holds with probability ≥ 3/4, by Markov’s
inequality.
Thus all the conditions hold with probability ≥ 1 − 11 − 14 > 1/8.
e2
There are two main steps in the analysis. First, we need to bound the number of phases in the
algorithm. This determines the expected number of selected columns. Next, we need to bound the
total error. The following lemmas capture these statements.
Lemma 17 The number of majority outlier phases, is O(kL).
Proof We will prove the lemma via contradiction. Suppose we have t successful phases because of
having majority outliers.
If we consider one of these phases, if the phase is of Type P B and if the Type 1B points in
that phase are v1 , v2 , . . . , vr , by definition we would have i∈[r]∧ui ∈V \Vin kL z > 2 and if the
phase is of Type A and if the Type A points in that phase are v1 , v2 , . . . , vr , we would have
kd2 (ui ,Cpre ) kd2 (ui ,Cpre )
> 12 and since kL , we can see that i∈[r]∧ui ∈V \Vin kL
P P
i∈[r]∧ui ∈V \Vin ξ z ≥ ξ z >
1 z
2 . Therefore we can see that in either case, the number of outlier points in a the phase is > 2kL
Now, we claim that r ≤ 2kL. Suppose, for the sake of contradiction, that r > 2kL. Then,
z
since for each phase in this, the number of outlier points in a the phase is > 2kL , we would have
the number of outliers > z, a contradiction to the assumption on the upper bound on outliers.
This implies r ≤ 2kL, implying that the number of phases successful because of having majority
outliers, is O(kL)
Lemma 18 The number of successful majority inlier phases of Type A, is O(k log n).
Proof We will prove the lemma via contradiction. Suppose we have t successful Type A phases,
and suppose we choose one Type A inlier point from each successful phase. Let these points
be v1 , v2 , . . . , vt . By definition, the probability values pvi ∈ (1/n3 , kL/z ). Thus, by dividing
(1/n3 , kL/z ) into 3 log n + log kL/z intervals of the form (1/2i+1 , 1/2i ], we obtain that there ex-
ists some q such that t/(3 log n + log kL/z ) of the vi lie in the interval (q, 2q]. Let I denote the
indices of these points.
Now, for any i, j ∈ I, we claim that d(vi , vj )2 > ξq k . This claim holds because of the following:
kd2 (vj ,Cpre )
suppose i < j. Thus in the jth phase, vi was already present in Cpre . Thus, since pvj ≤ ξ ,
kd(vi ,vj )2
we have that pvj ≤ ξ . Since pvj > q, the claim follows.
ξq
Now, consider balls of squared-radius 4k around each of the points vi (for i ∈ I), and denote
these balls by Bi (respectively). The claim above implies that for any i, j ∈ I with i 6= j, Bi ∩ Bj =
∅.
Finally, we claim that |I| ≤ 33k. Suppose, for the sake of contradiction, that |I| > 33k. Then,
since the balls are all disjoint, we must have that for at least 32k of the balls Bi , C ∗ ∩ Bi = ∅ (where
C ∗ is the optimal set of centers). Let J denote the set of i for which this holds. This means that for
ξq
any i ∈ J, d2 (vi , C ∗ ) > 4k .
Now, let us denote by Si the set of original points u that are in phase i. Since i is a successful
phase, we have (by condition (b) of the definition of Type A successful phase), that
X 1 1 ξq ξ
4 d2 (u, C ∗ ) ≥ d2 (vi , C ∗ ) > = .
pi 2q 4k 8k
u∈Si ∧u∈Vin
17
O NLINE k- MEANS CLUSTERING
Thus, if we have |J| ≥ 32k, this would let us conclude that the sum of the distances to the
optimal centers from the inlier points in those phases is > ξ, a contradiction to the assumption that
the optimum objective value is ≤ ξ.
This implies that |J| ≤ 32k, and subsequently that |I| ≤ 33k, implying that the number of
phases t satisfies t ≤ 99k log n (Since kL/z is assumed to be < 1 we can drop that from the bound).
Proof We will prove the lemma via contradiction. Suppose we have t successful phases, and
suppose we choose one Type B inlier point from each successful phase. Let these points be
v1 , v2 , . . . , vt .
ξL
By definition, the probability values pui = kL 2
z and d (ui , Cpre ) ≥ z .
Now, for any i, j ∈ [t], we claim that d(vi , vj )2 > ξL 2z . This claim holds because of the
following: suppose i < j. Thus in the jth phase, vi was already present in Cpre . Thus, since
kd2 (vj ,Cpre ) kd(vi ,vj )2
pvj ≤ ξ , we have that pvj ≤ ξ . Since pvj > kL
2z , the claim follows.
ξL
Now, consider balls of squared-radius 8z around each of the points vi (for i ∈ [t]),
and denote
these balls by Bi (respectively). The claim above implies that for any i, j ∈ [t] with i 6= j, Bi ∩Bj =
∅.
Finally, we claim that t ≤ 33k. Suppose, for the sake of contradiction, that t > 33k. Then,
since the balls are all disjoint, we must have that for at least 32k of the balls Bi , C ∗ ∩ Bi = ∅ (where
C ∗ is the optimal set of centers). Let J denote the set of i for which this holds. This means that for
any i ∈ J, d2 (vi , C ∗ ) > ξL
8z .
Now, let us denote by Si the set of original points u that are in phase i. Since i is a successful
phase, we have (by condition (b) of the definition of Type B successful phase), that
X 1 2 z ξL ξ
4 d2 (u, C ∗ ) ≥ d (vi , C ∗ ) > = .
pi kL 8z 8k
u∈Si ∧u∈Vin
Thus, if we have |J| ≥ 32k, this would let us conclude that the sum of the distances to the
optimal centers from the inlier points in those phases is > ξ, a contradiction to the assumption that
the optimum objective value is ≤ ξ. This implies that |J| ≤ 32k, and subsequently that t ≤ 33k.
Lemma 20 For any δ > 0, the total number of phases is O(k log n + kL + log(1/δ)), with
probability at least 1 − δ.
The proof can be divided into two parts, first showing the for Type A and B phases each we could
argue the bounds for each case with probability 1 − δ/2, using Appendix E. We can see that the
probability of both bounds is ≥ (1 − δ/2)2 ≥ 1 − δ and thus w.p. at least 1 − δ the total nmber of
phases is O(k log n + kL + log(1/δ)).
We can now complete the proof of Theorem 11.
Proof [of Theorem 11] First, we bound the size of C, the selected points, using the bound on the
number of phases. Suppose the number of phases is r. In each phase, the selection probabilities add
up to a quantity between 1 and 2. Thus, we have that Pr[|C| > 20r] ≤ 1/10. Setting δ = 1/10 in
18
O NLINE k- MEANS CLUSTERING
Lemma 20, we get that r ≤ O(k log n + kL) with probability ≥ 9/10. Combining these, we have
that |C| ≤ O(k log n + kL) with probability ≥ 4/5.
Next we bound the number of points marked as outliers. To bound this, we will bound the
number of points marked as outliers in each phase. Let the points marked as outliers be M . We
consider two cases.
Case 1: the phase is of Type A,
X kL
≤ 1.
z
i∈[r]∧u∈M
The inequality follows because the by definition, if the phase is of Type A then β < 1
Case 2: the phase is of Type B,
X kL
≤ 2.
z
i∈[r]∧u∈M
The inequality follows because the by definition, if the phase is of Type B then 1 ≤ β ≤ 2
2z
Therefore, we can see that in either case |i ∈ [r] ∧ u ∈ M | ≤ kL . Combined with the bound on
P that |M
the number of phases, this completes the shows | ≤ kL O(k log n + kL) ≤ O(z( logL n + 1)).
2z
Next, we need to bound the total cost u∈Vin \M d2 (u, C), where the sum ranges over the inlier
points of Type A. To bound this, we will bound the sum in each phase. To this end, let u1 , . . . , ur be
the points in a phase, and let Cpre the set of chosen points at the start of the phase. We now consider
two cases,
Case 1: the phase is of Type A
X X X ξ ξ
d2 (ui , C) ≤ d2 (ui , Cpre ) ≤ pui · ≤ 2 .
k k
i∈[r]∧ui ∈Vin \M i∈[r]∧ui ∈Vin \M i∈[r]∧ui ∈Vin \M
The last inequality follows because the sum of the selection probabilities of Type A points in the
phase (α) for the inlier points in Ccur when the phase is of Type A phase is ≤ 2.
Case 1: the phase is of Type B
X X X ξ ξ
d2 (ui , C) ≤ d2 (ui , Cpre ) ≤ pui · < .
k k
i∈[r]∧ui ∈Vin \M i∈[r]∧ui ∈Vin \M i∈[r]∧ui ∈Vin \M
The last inequality follows because the sum of the selection probabilities of Type A points in the
phase (α) for the inlier points in Ccur when
P the phase is of Type B is < 1.
This implies that in either case the i∈[r]∧ui ∈Vin d (ui , C) ≤ 2 kξ . Combined with the bound on
2
the number of phases, this shows that the total cost over the inlier points (not marked as outliers) is
ξO(log n + L).
The Algorithm 3 performs well when the number of outliers are sufficiently large (z k log n).
However even in the case where z is small, we can see that by controlling the parameter L we can
still get the algorithm to work well (at the cost of marking O(z · logL n ) many points as outliers).
19
O NLINE k- MEANS CLUSTERING
Theorem 21 Suppose the set of points VParrive in an online manner. Let k ≥ 1 be an integer, and
let ξ be a (given) parameter that satisfies v∈Vin d2 (v, C ∗ ) ≤ ξ . Then for any > 0, the algorithm
satisfies:
7
1. w.p. at least 10 , the number of chosen centers is ≤ O( k (log(n) + L)).
2. The k-means objective cost for the output centers C, over the points not marked as outliers is
≤ O(1) · OPT + ξ where OPT is the optimum k-means objective value over the inlier points.
A.2. Analysis
Once again, the idea is to view the algorithm as a “two-pass” procedure) and analyze that instead.
Once again, we observe that in line 4 of the algorithm, we do not use T in any way, and thus
it proceeds precisely as in Algorithm 3. Let Cfirst denote the final value (after we have seen all
the points) of Cpre ∪ Ccur ∪ Tcur . Now, let us replace the probability of adding v to T (line 4) to
2
min(1, 40k·d ξ
(v,Cfirst )
). This is no longer an online algorithm, but it is a two-pass algorithm. The
key observation is that the probability of adding u to T in our algorithm is at least as large as the
probability of adding u to T in the two pass algorithm. Thus, it suffices to bound the error of the
two-pass algorithm.
This is done by using the lemma 10. In this case we only take into account the set Vin \ M ,
which contains the points that are not marked as outliers. (The points marked as outliers is denoted
by M )
In analyzing the two-pass algorithm, we apply Lemma 10 with S = Cfirst . Then we show that
2
the condition pv ≥ ck d (v,S) Z holds with c = 40Z
ξ . To see this, note that
20
O NLINE k- MEANS CLUSTERING
Next, consider the expected size of the final set. This time, we need to analyze Algorithm 2 and
not the two-pass algorithm above. We have that
X 40kd2 (v, Cpre )
E[|T |] =
ξ
v∈Vin \M
By the bound on the total error from Theorem 11, we have that with probability ≥ 45 ,
X
d2 (v, Cpre ) ≤ O(ξ(log(n) + L))
v∈Vin \M
7
Thus the above together with Markov’s inequality, we have that with probability ≥ 10 ,
k
|T | ≤ O( (log(n) + L))
This completes the proof of Theorem 21.
k
X 1
e−αkδi δi ≤
α
i=1
We will also use the so-called parallel axis theorem, which can be stated as follows.
21
O NLINE k- MEANS CLUSTERING
Lemma 23 Let u1 , u2 , . . . , ur be a set of points and let µ be their mean. Let x be any other point.
Then we have X X
d2 (ui , x) = d2 (ui , µ) + rd2 (x, µ).
i i
The proof follows by a direct computation. We are now ready to prove Lemma 10.
2
P
Proof [of Lemma 10.] Let Z = v∈V d (v, S) as in the statement. Let C1 , C2 , . . . , Ck be the
optimal clusters and let µ1 , µ2 , . . . , µk be their centers respectively. Define σi2 to be the average
squared distance from the points in cluster i to its center. I.e.,
1 X 2
σi2 = d (u, µi ).
|Ci |
u∈Ci
The rough idea of the proof is as follows. Our goal will be to argue that for every cluster, the
sampling procedure is quite likely to pick a point that is close to the center of the cluster. We then
argue that this leads to the value of d2 (u, S ∪ T ) being relatively small (compared to the average
radius of the cluster). To this end, for every i, define
Di = {u ∈ Ci : d2 (u, µi ) ≤ 4σi2 }.
Next, for any u ∈ Ci \ Di , we have d(u, S) ≤ d(u, u0 ) + d(u0 , S), and thus d2 (u, S) ≤
2(d2 (u, u0 )
+ d2 (u0 , S)). Similarly, we can use d(u, u0 ) ≤ d(u, µi ) + d(u0 , µi ) to conclude that
d (u, u ) ≤ 2(d2 (u, µi ) + d2 (u0 , µi )). Putting the two together and summing over all u ∈ Ci \ Di ,
2 0
we have
X X
d2 (u, S) ≤ 2|Ci \ Di | d2 (u0 , S) + 2d2 (u0 , µi ) + 4 d2 (u, µi ).
Now, since u0 ∈ Di , we have d2 (u0 , µi ) ≤ 4σi2 . Also, u∈Ci d2 (u, µi ) = |Ci |σi2 by definition,
P
so we can use this as an upper bound for the last term above. Using these, together with |Ci \ Di | ≤
|Ci |/4, we get:
X |Ci | 2 0
d2 (u, S) ≤ d (u , S) + 4|Ci |σi2 + 4|Ci |σi2 .
2
u∈Ci \Di
22
O NLINE k- MEANS CLUSTERING
2 2
P
P good2 if u∈Ci d (u, S) ≤ 16|C
Next, call a cluster
2
i |σi , and bad otherwise. If a cluster is good,
we trivially have u∈Ci d (u, S ∪ T ) ≤ 16|Ci |σi . Thus, let us only focus on bad clusters. The
important observation is that by (6), if Ci is a bad cluster, then
X 3 X 2
d2 (u, S) ≥ d (u, S).
10
u∈Di u∈Ci
Intuitively, this means that if one were to sample points in Ci using weights proportional to d2 (u, S),
there is at least a 3/10 probability of picking
P a point “close” to the center.
Let us now use this to reason about E[ u∈Ci d2 (u, S ∪ T )]. We can bound it as follows. If the
sample T contains some point in x ∈ Di , then by the parallel axis theorem,
X X
d2 (u, S ∪ T ) ≤ d2 (u, x) ≤ |Ci |σi2 + 4|Ci |σi2 = 5|Ci |σi2 .
u∈Ci u∈Ci
Q P
The probability that no point from Di is chosen is at most: u∈Di (1 − pu ) ≤ exp(− u∈Di pu ).
By the condition we have on pu , this can be bounded by
ck X 3ck X
exp − d2 (u, S) ≤ exp − d2 (u, S) .
Z 10Z
u∈Di u∈Ci
For notational convenience, define Z1 u∈Ci d2 (u, S) = δi (thus i δi = 1). The above computa-
P P
tion shows that the probability that T contains no point of Di is at most exp(−3ckδi /10). Thus we
can bound
X
E d2 (u, S ∪ T ) ≤ 5|Ci |σi2 + Zδi · exp(−3ckδi /10).
u∈Ci
Now, summing this over all the bad clusters, we can bound
" #
X X X
E 2
d (u, S ∪ T ) ≤ 16 d2 (u, C ∗ ) + Zδi · exp(−3ckδi /10).
u∈V u∈V i
Now, setting α = 3c/10, we can bound the second term by 10Z/3c. This gives the desired result.
23
O NLINE k- MEANS CLUSTERING
2 ∗
P
Proof Define OPT = v∈V d (v, C ) as before. Now, if ξ0 ≥ OPT , then the algorithm would
not double its points and thus would only result in O(k log n) phases.2 Consider the case where
ξ0 < OPT. Since we stop and double the ξ value after ck log n rounds, the total number of phases is
clearly bounded above by k log n log OPT
ξ0 . Together, these cases imply the desired lemma.
The next lemma is more interesting: it shows that the error bound only increases by a constant
(and not a log OPT
ξ0 factor).
Lemma 25 The error of the Algorithm 5 (total squared distance from points u to the points chosen
in the end), is ≤ O(OPT · log n).
Proof The algorithm doubles the guess of ξ after ck log n phases. The total objective value for each
ξ is thus at most 2cξ log n. After the ith time ξ0 is doubled, we will have ξ = 2i ξ0 . Let t = log OPT
ξ0
be the bound on the number of times we double ξ. Then we have a bound on the objective value, of
2c log n(ξ0 + 2ξ0 + · · · + 2t ξ0 ) = O(c log nOPT). This completes the proof.
2. Technically, this is always true – there is a failure probability of 1/10. However, we can carry forth this failure
probability into our procedure’s guarantee as well.
24
O NLINE k- MEANS CLUSTERING
Thus part (2) of the definition of a successful phase holds with probability ≥ 3/4, by Markov’s
inequality. For part (3), note that the probability of picking a point u with with probability < 1/n2
is at most 1/n (as there are at most n points in total).
Thus all the conditions hold with probability ≥ 1 − 1e − 14 − n1 > 1/4, for n > 10.
Lemma 26 We toss a coin n times. The tosses are independent of each other and in each toss, the
probability of seeing a head is at least p. Let Hm and Tm denote the number of heads and tails we
observe in the first m ≤ n coin tosses. With probability 1 − δ, we have Hm ≥ pm 2
4 − d8 ln δ /pe
for any 1 ≤ m ≤ n. We note that although the claim is about conjunction of all these n events, the
probability does not rely on n.
Proof We denote the expected number of heads in the first m tosses with µ which is at least pm.
Applying lower tail inequality of Theorem 4 in (Goemans, 2015) implies,
1
P r[Hm < (1 − )µ] ≤ e− /8 ≤ e− /8
µ pm
The error probability e−pm/8 is at most δ/2 for m ≥ m0 = d8ln(2/δ)/pe. Instead of summing
up the error bound for all values of m, we focus on the smaller geometrically growing sequence
M = {2l m0 |l ∈ Z≥0 AN D 2l m0 ≤ n}. Having the lower bound on Hm for every m ∈ M
helps us achieve a universal lower bound on any 1 ≤ m ≤ n as follows. For any m ≤ m0 , the
bound Hm ≥ pm − m0 holds trivially. For any other m ≤ n, there exists an m00 ∈ M such that
m00 ≤ m ≤ 2m00 . By definition Hm is at least Hm00 . Assuming Hm00 ≥ pm00/2 implies Hm is at
least pm/4 which proves the claim of the lemma. So we focus on bounding the error probabilities
only for values in set M. For m0 , the error probability is at most δ/2. The next value in M is 2m0 ,
so given the exponential form of the error, it is at most (δ/2)2 . Using union bound, the aggregate
error probability for set M does not exceed
2 3
δ δ δ δ/2
+ + + ··· ≤ ≤δ
2 2 2 1 − δ/2
Therefore with probability at least 1 − δ we have for every m ∈ M , Hm ≥ pm/2, and consequently
for every 1 ≤ m ≤ n, Hm ≥ pm 0
4 − m which finishes the proof.
The next lemma allows us to go from a bound on the number of phases to a bound on the number
of points chosen.
25
O NLINE k- MEANS CLUSTERING
Lemma 27 Consider any of the algorithms 1, 2, 3, 4. If the number of phases is ≤ r, then with
9
probability ≥ 10 , the number of points selected is ≤ 20r.
P P
Proof Assume we have p phases where p ≤ r, then Z = i∈[p] vj ∈P hasei Yj is the number of
P P P P P
points selected. Then E(Z) = E( i∈[p] vj ∈P hasei Yj ) = i∈[p] vj ∈P hasei E(Yj ) ≤ i∈[p] 2
(Since the expected number of points over a single phase is ≤ 2). Therefore E(Z) ≤ 2p ≤ 2r
and applying Markov’s Inequality with this we can see that P r(Z ≥ 20r) ≤ P r(Z ≥ 10E(Z)) ≤
9
1/10. Thus w.p. ≥ 10 , Z ≤ 20r.
26