Cluster Ppt1

Performance guarantees
for hierarchical clustering
Sanjoy Dasgupta University of

California, San Diego
Philip Long Genomics Institute of
Singapore
Hierarchical clustering
Recursive partitioning of a data set
1−clustering
5
2−clustering
1
3−clustering
4
2
4−clustering
3
5−clustering
1 2 3 4 5
Popular form of data
analysis
• No need to specify number of clusters
• Can view data at many levels of granularity,

all at the same time
• Simple heuristics for constructing

hierarchical clusterings
Applications
• Has long been used by biologists and social
scientists
• A standard part of the statistician’s toolbox

since the 60s or 70s
• Recently: common tool for analyzing gene

expression data
Performance guarantees
There are many simple greedy schemes for
constructing hierarchical clusterings.
But are these resulting clusterings any good?
Or are they pretty arbitrary?

One basic problem
In fact, the whole enterprise of hierarchical
clustering could use some more justification.
e.g.
An existence question
Must there always exist a hierarchical

clustering which is close to optimal at every
level of granularity, simultaneously?
[I.e., such that for all k, the induced k-

clustering is close to the best k-clustering?]
What is the best k-
clustering?
The k-clustering problem.
Input: data points in a metric space; k
Output: a partition of the points into k clusters

C1,…, Ck with centers µ1, ..., µk
Goal: minimize cost of the clustering

Cost functions for
clustering
Two cost functions which are commonly used:
Maximum radius (k-center)

max {d(x,µi): i = 1…k, x in Ci}
Average radius (k-median)
avg {d(x,µi): i = 1…k, x in Ci}
Both yield NP-hard optimization problems, but have

constant-factor approximation algorithms.
Maximum-radius cost
function
Our main result
Adopt the maximum-radius cost function.
Our algorithm returns a hierarchical clustering

such that for every k, the induced k-clustering
is guaranteed to be within a factor eight of
optimal.
Standard heuristics
• The standard heuristics for hierarchical clustering
are greedy and work bottom-up:
single-linkage, average-linkage, complete-linkage
• Their k-clusterings can be off by a factor of:
-- at least log2 k (average-, complete-linkage);
-- at least k (single-linkage).
• Our algorithm is similar in efficiency and
simplicity, but works top-down.
A heuristic for k-clustering
[Hochbaum and Shmoys, 1985]
Eg. k = 4.
3
1 R
4
This 4-clustering has cost R ≤ 2 OPT4

Algorithm: step one
Number all points by farthest-first traversal.
3
8 2
7
R3
10
R2 R6
9 R4
6
5
1 R5
4
For all k, the k-clustering defined by centers {1,2,…,k}

has radius Rk+1 ≤ 2 OPTk. (Note: R2 ≥ R3 ≥ … ≥ Rn.)
A possible hierarchical
clustering
1
3 R2 R3
8 2 2 3
7
10 R4 R6 R7 R8
9
4 6 7 8
6
5 R5 R10
1
4 5 10
Hierarchical clustering specified by R9
parent function: 9
π(j) = closest point to j in {1,2,…,j-1}.
Note: Rk = d(k, π(k))
Algorithm: step two
Divide points into levels of granularity.
Set R = R2; and fix some β > 1.
The jth level has points {i: R/βj ≥ Ri > R/βj+1}.
3
8 2
7
10
9
6
5
1
4
Algorithm: step two,
cont’d
3
1
8 2
7 2 3 4
10
9
6 7 8 5 9
6
5
1
4
10
Different parent function:

π*(j) = closest point to j at lower level of granularity
Algorithm: summary
• Number the points by farthest-first traversal; note the
values Ri = d(i, {1,2,…, i-1}).
• Choose R = α R2.
• L(0) = {1}; for j > 0, L(j) = {i: R/βj-1 ≥ Ri > R/βj}.
• If point i is in L(j),
π*(i) = closest point to i in L(0), …., L(j-1).
Theorem: Fix α=1, β=2. If the data points lie in a metric

space, then for all k simultaneously, the induced k-
clustering is within a factor eight of optimal.
Randomization trick
Pick α from the distribution βU[0,1] . Set β = e.
Then for all k, the induced k-clustering has

expected cost at most 2e ≈ 5.44 times
optimal.
Thanks to Rajeev Motwani for suggesting this.

What does a constant-
factor approximation
mean?
Prevent the worst.
Standard agglomerative
heuristics
1. Initially each point is its own cluster.
2. Repeatedly merge the two “closest” clusters.
Need to define distance between clusters…
Single-linkage: distance between closest pair of points

Average-linkage: distance between centroids
Complete-linkage: distance between farthest pair
Single-linkage clustering
Chaining effect.
1 - jδ
1 2 3 … j j+1 … n
The k-clustering will have diameter about n-k, instead of n/k.

Therefore: off by a factor of k.
Average-linkage clustering
Points in d-dimensional space, d = log2 k, under an l1 metric.
Final radius should be 1, instead is d.

Therefore: off by a factor of log2 k.
Complete-linkage
clustering
Can similarly construct a bad case…
Off by a factor of at least log2 k.

Summary
There is a basic existence question about
hierarchical clustering which needs to be
addressed:
must there always exist a hierarchical
clustering in which, for each k, the induced
k-clustering is close to optimal?
It turns out the answer is yes.

Summary, cont’d
In fact, there is a simple, fast algorithm to
construct such hierarchical clusterings.
Meanwhile, the standard agglomerative

heuristics do not always produce close-to-
optimal clusterings.
Where next?
• Reduce the approximation factor.
• Other cost functions for clustering.
• For average- and complete-linkage, is the
log k lower bound also an upper bound?
• Local improvement procedures for
hierarchical clustering?

Cluster Ppt1

Uploaded by

Copyright:

Available Formats

Cluster Ppt1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Ppt1

Uploaded by

Copyright:

Available Formats

Performance guarantees

for hierarchical clustering

Sanjoy Dasgupta University of

• Can view data at many levels of granularity,

• Simple heuristics for constructing

• A standard part of the statistician’s toolbox

• Recently: common tool for analyzing gene

But are these resulting clusterings any good?

Or are they pretty arbitrary?

Must there always exist a hierarchical

[I.e., such that for all k, the induced k-

Input: data points in a metric space; k

Output: a partition of the points into k clusters

Goal: minimize cost of the clustering

Maximum radius (k-center)

Both yield NP-hard optimization problems, but have

Adopt the maximum-radius cost function.

Our algorithm returns a hierarchical clustering

This 4-clustering has cost R ≤ 2 OPT4

For all k, the k-clustering defined by centers {1,2,…,k}

Different parent function:

Theorem: Fix α=1, β=2. If the data points lie in a metric

Then for all k, the induced k-clustering has

Thanks to Rajeev Motwani for suggesting this.

Need to define distance between clusters…

Single-linkage: distance between closest pair of points

The k-clustering will have diameter about n-k, instead of n/k.

Final radius should be 1, instead is d.

Can similarly construct a bad case…

Off by a factor of at least log2 k.

It turns out the answer is yes.

Meanwhile, the standard agglomerative

You might also like