Concepts and Techniques: - Chapter 10
Concepts and Techniques: - Chapter 10
Concepts and Techniques: - Chapter 10
Concepts and
Techniques
(3rd ed.)
Chapter 10
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
2013 Han, Kamber & Pei. All rights reserved.
1
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
3
Data reduction
Feature selection
Proximity measure
Clustering criterion
Choice of algorithms
Clustering algorithms
Dissimilarity/Similarity metric
Partitioning criteria
Separation of clusters
Similarity measure
Clustering space
Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture
of these
Constraint-based clustering
Partitioning approach:
Construct various partitions and then evaluate them by
some criterion, e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or
objects) using some criterion
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
12
Model-based:
A model is hypothesized for each of the clusters and tries to
find the best fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: p-Cluster
User-guided or constraint-based:
Clustering by considering user-specified or applicationspecific constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects are often linked together in various ways
Massive links can be used to cluster objects: SimRank,
LinkClus
13
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
14
E ik1 pCi (d ( p, ci )) 2
16
Arbitrarily
partition
objects
into k
groups
Repeat
Update
the
cluster
centroids
Loop if
needed
Reassign objects
Update
the
cluster
centroids
Until no change
17
Weakness
Dissimilarity calculations
10
10
0
0
10
10
20
Total Cost = 20
10
10
10
6
5
4
3
2
1
0
0
K=2
10
Arbitrar
y
choose
k object
as
initial
medoid
s
6
5
4
3
2
1
0
0
10
Total Cost = 26
10
Do loop
Until no
change
Assign
each
remaini
ng
object
to
nearest
medoid
s
7
6
5
4
3
2
1
0
0
10
Compute
total cost
of
swapping
Swapping
O and
Oramdom
If quality is
improved.
6
5
4
10
Randomly select a
nonmedoid
object,Oramdom
8
7
8
7
6
5
4
0
0
10
10
21
PAM works effectively for small data sets, but does not scale well
for large data sets (due to the computational complexity)
22
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
23
Hierarchical Clustering
Step 1
ab
abcde
cde
de
e
Step 4
agglomerative
(AGNES)
Step 3
divisive
(DIANA)
24
AGNES (Agglomerative
Nesting)
Go on in a non-descending fashion
10
10
0
0
10
0
0
10
10
25
26
10
10
0
0
10
0
0
10
10
27
Distance between
Clusters
Cm
iN 1(t
ip
N (t cm ) 2
Rm i 1 ip
N
Diameter: square root of average mean squared
Extensions to Hierarchical
Clustering
Xi
i 1
CF = (5, (16,30),(54,190))
N
Xi
i 1
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
9
8
7
6
5
4
3
2
1
0
0
32
10
CF-Tree in BIRCH
Clustering feature:
Summary of the statistics for a given subcluster: the 0th, 1st, and 2nd moments of the subcluster from the
statistical point of view
CF1
CF2
CF3
CF6
L=6
child1
child2
child3
child6
Non-leaf node
CF1
CF2
CF3
CF5
child1
child2
child3
child5
Leaf node
prev CF1
CF2
CF6
Leaf node
next
prev CF1
CF2
CF4
next
34
Cluster Diameter
1
2
( xi x j )
n( n 1)
35
2.
37
and
are the average weights of the edges that
belong in the min-cut bisector of clusters Ci and Cj ,
respectively, and
is the average weight of the edges
that connect vertices in Ci to vertices in Cj
Merge Sub-Clusters:
Overall Framework of
CHAMELEON
Construct (K-NN)
Partition the Graph
Sparse Graph
Data Set
K-NN Graph
P and q are
connected if q is
among the top k
closest neighbors of
p
Merge Partition
Final Clusters
Relative
interconnectivity:
connectivity of c1 and
c2 over internal
connectivity
Relative closeness:
39
40
Probabilistic Hierarchical
Clustering
Generative Model
42
Gaussian Distribution
Bean
machine:
drop ball
with pins
1-d
Gaussia
n
2-d
Gaussia
n
43
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
45
Density-Based Clustering
Methods
Two parameters:
p belongs to NEps(q)
core point condition:
MinPts = 5
Eps = 1 cm
Density-reachable:
A point p is density-reachable
from a point q w.r.t. Eps, MinPts if
there is a chain of points p1, ,
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
Density-connected
A point p is density-connected to
a point q w.r.t. Eps, MinPts if there
is a point o such that both, p and
q are density-reachable from o
w.r.t. Eps and MinPts
p
p1
q
o
48
MinPts = 5
49
50
51
MinPts-distance(p), otherwise
Reachability Distance of object p from core object q is the
min radius value that makes p density-reachable from q
Reachability-distance, MinPts(p, q) =
Undefined if q is not a core object
max(core-distance(q), distance (q, p)), otherwise
53
54
Reachabilitydistance
undefined
56
f Gaussian ( x , y ) e
d ( x,y)
2 2
influence of
y on x
Major features
f
f
total influence
on x
D
Gaussian
D
Gaussian
( x ) i 1 e
N
d ( x , xi ) 2
2
( x, xi ) i 1 ( xi x) e
N
d ( x , xi ) 2
2 2
gradient of x
in the
direction of xi
Uses grid cells but only keeps information about grid cells that
do actually contain data points and manages these cells in a
tree-based access structure
Influence function: describes the impact of a data point within
its neighborhood
Overall density of the data space can be calculated as the
sum of the influence function of all data points
Clusters can be determined mathematically by identifying
density attractors
Density attractors are local maximal of the overall density
function
Center defined clusters: assign to each density attractor the
points density attracted to it
Arbitrary shaped cluster: merge density attractors that are
connected through paths of high density (> threshold)
58
Density Attractor
59
60
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
61
63
Identify clusters
=3
30
40
Vacation
20
50
Salary
(10,000)
0 1 2 3 4 5 6 7
la
a
S
ry
30
Vacation(
week)
0 1 2 3 4 5 6 7
age
60
20
50
30
40
50
age
60
age
68
Strength
automatically finds subspaces of the highest
dimensionality such that high density clusters exist
in those subspaces
insensitive to the order of records in input and does
not presume some canonical data distribution
scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
The accuracy of the clustering result may be
degraded at the expense of simplicity of the method
69
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
70
Empirical method
# of clusters: k n/2 for a dataset of n points, e.g., n = 200, k = 10
Elbow method
Use the turning point in the curve of sum of within cluster variance
w.r.t the # of clusters
Cross validation method
Divide a given data set into m parts
Use m 1 parts to obtain a clustering model
Use the remaining part to test the quality of the clustering
E.g., For each point in the test set, find the closest centroid, and
use the sum of squared distance between all points in the test
set and the closest centroids to measure how well the model fits
the test set
For any k > 0, repeat it m times, compare the overall quality
measure w.r.t. different ks, and find # of clusters that fits the data
the best
71
Matching-based measures
Purity, maximum matching, F-measure
Entropy-Based Measures
Ground truth partitioning T T
Conditional entropy, normalized mutual
Cluster
Cluster
C
information (NMI), variation of information C
Pair-wise measures
Four possibilities: True positive (TP), FN, FP,
TN
Jaccard coefficient, Rand statistic, FowlkesMallow measure
Correlation measures
Discretized Huber static, normalized
discretized Huber static
1
74
Entropy of clustering C:
Entropy of partitioning T:
Entropy of T w.r.t. cluster Ci:
Conditional entropy of T
w.r.t. clustering C:
The more a clusters members are split into different
partitions, the higher the conditional entropy
For a perfect clustering, the conditional entropy value
is 0, where the worst possible conditional entropy value
is log k
75
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
77
Summary
79
80
CS512-Spring 2011: An
Introduction
Coverage
BK2: Chapter 9
Partial coverage: Mark Newman: Networks: An Introduction, Oxford U.,
2010
Scattered coverage: Easley and Kleinberg, Networks, Crowds, and
Markets: Reasoning About a Highly Connected World, Cambridge U., 2010
Recent research papers
Requirements
References (1)
References (2)
References (3)
85
85
86
Total Cost = 20
10
10
10
6
5
4
3
2
1
0
0
K=2
10
Arbitrar
y
choose
k object
as
initial
medoid
s
6
5
4
3
2
1
0
0
10
Total Cost = 26
10
Do loop
Until no
change
Assign
each
remaini
ng
object
to
nearest
medoid
s
7
6
5
4
3
2
1
0
0
10
Compute
total cost
of
swapping
Swapping
O and
Oramdom
If quality is
improved.
6
5
4
10
Randomly select a
nonmedoid
object,Oramdom
8
7
8
7
6
5
4
0
0
10
10
87
89
Sampling-based method
CLARA(Clustering LARge Applications)
90
Weakness:
C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c,
d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c,
d, e}
C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Jaccard co-efficient may lead to wrong clustering result
C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, T
f})
1 T2
Sim( T , T )
Jaccard co-efficient-based similarity function: 1 2
T1 T2
{c{c,
} d, e} 1
Ex. LetSim
T1 (=T {a,
b, c}, T2 =
1, T 2 )
0.2
{a, b, c, d , e}
94
Clusters
C1:<a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e},
{a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
C2: <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Neighbors
Sample n points, q1, , qn, uniformly from D. For each qi, find
its nearest neighbor in D {qi}: yi = min{dist (qi, v)} where v
in D and v qi
Calculate the Hopkins Statistic:
If D is uniformly distributed, xi and yi will be close to each
other and H is close to 0.5. If D is clustered, H is close to 1
98