Clustering Part4
Clustering Part4
Clustering Part4
• Numeric
• Binary
• Categorical
• Text
• Time series
Part I:
Numeric data
Distance measures
Definition of distance metric
A distance function is metric if the following conditions
are met for all data points x, y, z:
• All distances are non-negative:
d(x, y) ≥ 0
• Distance to point itself is zero:
d(x, x) = 0
• All distances are symmetric:
d(x, y) = d(y, x)
• Triangular inequality:
d(x, y) d(x, z) + d(z, y)
Common distance metrics
Xj = (xj1, xj2, …, xjp)
dij = ?
q q q
d (i, j ) q xi1 x j1 xi 2 x j 2 ... xip x jp
• Euclidean distance
q=2
2 2 2
d (i, j ) xi1 x j1 xi 2 x j 2 ... xip x jp
• Manhattan distance
q=1
d (i, j ) xi1 x j1 xi 2 x j 2 ... xip x jp
Distance metrics example
10
2D example
X1 = (2,8) x1 = (2,8)
x2 = (6,3)
5 Euclidean distance
5 2 2
d (1,2) 2 6 8 3 41
X2 = (6,3)
Manhattan distance
d (1,2) 2 6 8 3 9
0 5 10
4
Chebyshev distance
In case of q , the distance equals to the maximum
difference of the attributes. Useful if the worst case must
be avoided:
1
n
q
q
d ( X , Y ) lim xi yi
q
i 1
max x1 y1 , x2 y2 , , xn yn
Example:
d (2,8), (6,3) max2 6 , 8 3 max4,5 5
Hierarchical clustering
Cost functions
Single link: the smallest distance between vectors in
clusters i and j:
d (Ci , C j ) min
xi Ci , x j C j ,
d ( xi , x j )
1.50
1.45
1.40
1.35 Soft
1.30 convergence
1.25
0 1 2 3 4 5 6 7 8 9 10
Iterations
Distance and distortion
Distortion function:
N
E X , C N1 Ld xi , c pi
i 1
Distortion for binary data
1.00
q=10,r=90
0.90
q=20,r=80
0.80
q=30,r=70
c
0.70
0.60
q=40,r=60
0.50
6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0
d
Centroid location
d = ( = 0) c = 0 .5
d = 3 ( = 0 .5 ) c = 0 .3 7
d = 2 ( = 1) c = 0 .2 5
d= 1 c= 0
0 .0 1 .0
q = 75 r = 25
Part III:
Categorical data
Categorical clustering
Three attributes
Categorical clustering
Sample 2-d data: color and shape
Model A Model B
Model C
Categorical clustering
• k-modes
• k-medoids
• k-distributions
• k-histograms
• k-populations
• k-representatives
Entropy-based cost functions
Category utility:
D 2/3
F 1/3
K-distributions
Cost function with ε addition
Example of cluster allocation
Change of entropy
Problem of non-convergence
Non-convergence
Results with Census dataset
Part IV:
Text data
Applications of text clustering
• Query relaxation
• Spell-checking
• Automatic categorization
• Document similarity
Query relaxation
Current solution Alternate solution
Matching suffixes from database From semantic clustering
Spell-checking
Category by
clustering
HE og
Go
LP le
! ! ! t ra n
T h sla
is tio
is n …
String clustering
Query errors
Similarity functions:
Limited knowledge about data
Typos Edit distance
Limited input device (cell phone) input
Data errors
Q-gram
Typos Cosine
Web data
OCR
Edit distance
Levenhstein distance
t m p
0 1 2 3
t 1 0 1 2
e 2 1 2 2
m 3 2 1 2
p 4 3 2 1
Dynamic programming:
m[i][j] = min{ m[i-1][j]+1, m[i][j-1]+1, m[i-1][j-1]+d(i,j)}
d(i,j) =0 if i=j, d(i,j)=1 else
Q-grams
b i n g o n 2-grams
• Implementation:
Cosine similarity = (Common Terms) / (sqrt(Number
of terms in String1) + sqrt(Number of terms in
String2))
Cosine similarity
antonym
damp anhydrous
Example of WordNet
object
artifact
instrumentality article
Inanimate-object (17 %)
Natural-object (1.6%)
Geological-formation (0.17%)
Can be solved by dynamic programming.
Complexity is exponential to the number of
time-series in a cluster.
Averaging heuristic
Calculate the medoid sequence
Calculate warping paths from the medoid to all
other time series in the cluster
New prototype is the average sequence over
warping paths
Local search heuristics
Example of the three methods
Swim Walking
hall street
Market
Science place
park
Homes
Shop of users
Image segmentation