Clustering Part4

Clustering methods: Part 4
Clustering Cost function

Pasi Fränti
29.4.2014
Speech and Image Processing Unit

School of Computing
University of Eastern Finland
Data types
• Numeric
• Binary
• Categorical
• Text
• Time series
Part I:
Numeric data
Distance measures
Definition of distance metric
A distance function is metric if the following conditions
are met for all data points x, y, z:
• All distances are non-negative:
d(x, y) ≥ 0
• Distance to point itself is zero:
d(x, x) = 0
• All distances are symmetric:
d(x, y) = d(y, x)
• Triangular inequality:
d(x, y)  d(x, z) + d(z, y)
Common distance metrics
Xj = (xj1, xj2, …, xjp)
dij = ?
• Minkowski distance Xi = (xi1, xi2, …, xip)
q q q
d (i, j ) q xi1  x j1  xi 2  x j 2  ...  xip  x jp
1st dimension 2nd dimension pth dimension
• Euclidean distance
q=2
2 2 2
d (i, j )  xi1  x j1  xi 2  x j 2  ...  xip  x jp
• Manhattan distance
q=1
d (i, j )  xi1  x j1  xi 2  x j 2  ...  xip  x jp
Distance metrics example
10
2D example
X1 = (2,8) x1 = (2,8)
x2 = (6,3)
5 Euclidean distance
5 2 2
d (1,2)  2  6  8  3  41
X2 = (6,3)
Manhattan distance
d (1,2)  2  6  8  3 9
0 5 10
4
Chebyshev distance
In case of q  , the distance equals to the maximum
difference of the attributes. Useful if the worst case must
be avoided:
1

n
q
q
d  ( X , Y ) lim  xi  yi 
q 
 i 1 
max x1  y1 , x2  y2 ,  , xn  yn 
Example:
d  (2,8), (6,3)  max2  6 , 8  3  max4,5 5
Hierarchical clustering
Cost functions
Single link: the smallest distance between vectors in
clusters i and j:
d (Ci , C j )  min
xi Ci , x j C j ,
d ( xi , x j )
Complete-link: the largest distance between vectors

in clusters i and j:
d (Ci , C j )  max
xi Ci , x j C j ,
d ( xi , x j )
Average link: the average distance between vectors in

clusters i and j:
1
d (C i , C j )   
C i C j xi Ci x j C j
d ( xi ,x j )
Single Link
Complete Link
Average Link
Cost function example
[Theodoridis, Koutroumbas, 2006]
1 1.1 1.2 1.3 1.4 1.5

Data Set
x1 x2 x3 x4 x5 x6 x7
Single Link: Complete Link:

x1 x2 x3 x4 x5 x6 x7 x1 x2 x3 x4 x5 x6 x7
Part II:
Binary data
Hamming Distance
(Binary and categorical data)
• Number of different attribute values.
• Distance of (1011101) and (1001001) is 2.
• Distance (2143896) and (2233796)
• Distance between (toned) and (roses) is 3.
100->011 has distance 3 (red path)
3-bit binary cube 010->111 has distance 2 (blue path)
Hard thresholding of centroid
(0.40, 0.60, 0.75, 0.20, 0.45, 0.25)
Hard and soft centroids
Bridge (binary version)
1.60
1.55 convergence
Binary
Distortion
1.50
1.45
1.40
1.35 Soft
1.30 convergence
1.25
0 1 2 3 4 5 6 7 8 9 10
Iterations
Distance and distortion
General distance function:

1
 K d
  d
Ld xi , c j   xik  c jk 
 
 k 1 
Distortion function:
N

E X , C   N1  Ld xi , c pi 
i 1
Distortion for binary data
Cost of a single attribute:

d
D jk q jk c jk 
 r jk 1  c jk 
d
The number of zeroes is qjk, the number of ones is

rjk and cjk is the current centroid value for
variable k of group j.
Optimal centroid position
Optimal centroid position depends on the

metric. Given parameter:
 1 / d  1
The optimal position is:

r jk
c jk   
q jk  r jk
Example of centroid location
1.00
q=10,r=90
0.90
q=20,r=80
0.80
q=30,r=70
c
0.70
0.60
q=40,r=60
0.50
6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0
d
Centroid location
d =  ( = 0) c = 0 .5
d = 3 (  = 0 .5 ) c = 0 .3 7
d = 2 ( = 1) c = 0 .2 5
d= 1 c= 0
0 .0 1 .0
q = 75 r = 25
Part III:
Categorical data
Categorical clustering
Three attributes
Sample 2-d data: color and shape
Model A Model B
Model C
Methods: Histogram-based methods:
• k-modes
• k-medoids
• k-distributions
• k-histograms
• k-populations
• k-representatives
Entropy-based cost functions
Category utility:
Entropy of data set:
Entropies of the clusters relative to the data:

Iterative algorithms
K-modes clustering
Distance function
K-modes clustering
Prototype of cluster
K-medoids clustering
Prototype of cluster
Vector with minimal total distance to every other

3
2 2 Medoid:
A B B B
C C D C
E F G F
2+3=5 2+2=4 2+3=5
K-medoids
Example
K-medoids
Calculation
K-histograms
D 2/3
F 1/3
K-distributions
Cost function with ε addition
Example of cluster allocation
Change of entropy
Problem of non-convergence
Non-convergence
Results with Census dataset
Part IV:
Text data
Applications of text clustering
• Query relaxation
• Spell-checking
• Automatic categorization
• Document similarity
Query relaxation
Current solution Alternate solution
Matching suffixes from database From semantic clustering
Spell-checking
Word kahvila (café) but

with one correct and
two incorrect spellings
Automatic categorization
Category by
clustering
HE og
Go
LP le
! ! ! t ra n
T h sla
is tio
is n …
String clustering
• The similarity between every string pair is calculated

as a basis for determining the clusters
• A similarity measure is required to calculate the

similarity between two strings.
Approximate string Semantic
matching similarity
Document clustering
Motivation:
– Group related documents based on their content
– No predefined training set (taxonomy)
– Generate a taxonomy at runtime
Clustering Process:
– Data preprocessing: remove stop words, stem,
feature extraction and lexical analysis
– Define cost function
– Perform clustering
Exact string matching
• Given a text string T of length n and a pattern

string P of length m, the exact string
matching problem is to find all occurrences
of P in T.
• Example: T=“AGCTTGA” P=“GCT”
• Applications:
– Searching keywords in a file
– Searching engines (like Google)
– Database searching
Approximate string matching
Determine if a text string T of length n and a
pattern string P of length m partially matches.
– Consider the string “approximate”. Which of these are partial matches?
aproximate approximately appropriate proximate approx approximat
apropos approxximate
– A partial match can be thought of as one that has k differences from the
string where k is some small integer (for instance 1 or 2)
– A difference occurs if the string1.charAt(j) != string2.charAt(j) or if
string1.charAt(j) does not appear in string2 (or vice versa)
 The former case is known as a revise difference, the latter is a delete
or insert difference.
 What about two characters that appear out of position? For instance,
approximate vs. apporximate?
Approximate string matching
Keanu Reeves
Samuel Jackson
Arnold Schwarzenegger
H. Norman Schwarfkopf
Schwarrzenger Bernard Schwartz
…
Query errors

Similarity functions:
Limited knowledge about data
 Typos  Edit distance
 Limited input device (cell phone) input
Data errors
 Q-gram
 Typos  Cosine
 Web data
 OCR
Edit distance
Levenhstein distance
• Given two strings T and P, the edit distance is

the minimum number of substitutions,
insertion and deletions, which will transform
the string T into P.
• Time complexity by dynamic programming:
O(nm)
Edit distance
1974
t m p
0 1 2 3
t 1 0 1 2
e 2 1 2 2
m 3 2 1 2
p 4 3 2 1
Dynamic programming:
m[i][j] = min{ m[i-1][j]+1, m[i][j-1]+1, m[i-1][j-1]+d(i,j)}
d(i,j) =0 if i=j, d(i,j)=1 else
Q-grams
b i n g o n 2-grams
Fixed length (q)

ed(T, P) <= k, then
# of common grams >= # of T grams – k * q
Q-grams
T = “bingo”, P = “going”
gram1 = {#b, bi, in, ng, go, o#}
gram2 = {#g, go, oi, in, ng, g#}
Total(gram1, gram2) = {#b,bi,in,ng,go,o#,#g,
go,oi,in,ng,g#}
|common terms difference|= sum{1,1,0,0,0,1,1,0,1,0,0,1}
gram1.length = (T.length + (q - 1) * 2 + 1) – q
gram2.length = (P.length + (q - 1) * 2 + 1) - q
L = gram1.length + gram2.length=12
Similarity = (L- |common terms difference| )/ L
=0.5
Cosine similarity
• Two vectors A and B,θ is represented using a dot

product and magnitude as
• Implementation:
Cosine similarity = (Common Terms) / (sqrt(Number
of terms in String1) + sqrt(Number of terms in
String2))
Cosine similarity
T = “bingo right”, P = “going right”

T1 = {bingo right}, P1 = {going right}
L1 = unique(T1).length;
L2 = unique(P1).length;
Unique(T1&P1) = {bingo right going}
L3 = Unique(T1&P1) .length;
Common terms = (L1+L2)-L3;
Similarity = common terms / (sqrt(L1)*sqrt(L2))
Dice coefficient
• Similar with cosine similarity

• Dices coefficient = (2*Common Terms) /
(Number of terms in String1 + Number of
terms in String2)
Similarities for sample data
Edit Q-gram Q-gram Q-gram Cosine
Compared Strings distance Q=2 Q=3 Q=4 distance
Pizza Express Café
Pizza Express 72% 79% 74% 70% 82%
Lounasravintola Pinja Ky – ravintoloita
Lounasravintola Pinja 54% 68% 67 % 65% 63 %
Kioski Piirakkapaja
Kioski Marttakahvio Different 47% 45% 33% 32% 50%
Kauppa Kulta Keidas
Different 68% 67% 63 % 60% 67%
Kauppa Kulta Nalle
Ravintola Beer Stop Pub
Baari, Beer Stop R-kylä 39% 42% 36% 31% 50%
Ravintola Foxie s
Bar Foxie Karsikko 31% 25% 15% 12% 24%
Play baari
Ravintola Bar Play – Ravintoloita 21% 31% 17% 8% 32%
Thesaurus-based WordNet
WordNet
An extensive lexical network for the English language
• Contains over 138,838 words.
• Several graphs, one for each part-of-speech.
• Synsets (synonym sets), each defining a semantic sense.
• Relationship information (antonym, hyponym, meronym …)
• Downloadable for free (UNIX, Windows)
• Expanding to other languages (Global WordNet Association)
• Funded >$3 million, mainly government (translation interest)
• Founder George Miller, National Medal of Science, 1991.
watery parched
moist wet dry arid

synonym
antonym
damp anhydrous
Example of WordNet
object
artifact
instrumentality article
conveyance, transport ware
vehicle table ware
wheeled vehicle cutlery, eating utensil
automotive, motor bike, bicycle fork
car, auto truck

Examples of probabilities
Entity (40%)
Inanimate-object (17 %)
Natural-object (1.6%)
Geological-formation (0.17%)
Natural-elevation (0.011%) Shore ( 0.008%)
Hill (0.0019%) Coast (0.002%)

Hierarchical clustering by WordNet
Performance of WordNet
Human Edge Counting Information Content Based
Judgment Based Measures Measures
Word Pair
Path WUP LIN Jiang&Conrath
Car Automobile 3.92 1 1 1 1

Gem Jewel 3.84 1 1 1 1
Journey Voyage 3.84 0.97 0.92 0.84 0.88
Boy Lad 3.76 0.97 0.93 0.86 0.88
Coast Shore 3.70 0.97 0.91 0.98 0.99
Asylum Madhouse 3.61 0.97 0.94 0.97 0.97
Magician Wizard 3.50 1 1 1 1
Midday Noon 3.42 1 1 1 1
Furnace Stove 3.11 0.81 0.46 0.23 0.39
Food Fruit 3.08 0.81 0.22 0.13 0.63
Bird Cock 3.05 0.97 0.94 0.6 0.73
Bird Crane 2.97 0.92 0.84 0.6 0.73
Part V:
Time series
Clustering of time-series
Dynamic Time Warping
Align two time-
series by minimizing
distance of the Solve by
aligned observations dynamic
programming!
Example of DTW
Prototype of a cluster
 Sequence c that minimizes E(Sj,c) is called a

Steiner sequence.

Good approximation to Steiner problem, is to
use medoid of the cluster (discrete median).

Medoid is such a time-series in the cluster that
minimizes E(Sj,c).
Calculating the prototype

Can be solved by dynamic programming.

Complexity is exponential to the number of
time-series in a cluster.
Averaging heuristic

Calculate the medoid sequence

Calculate warping paths from the medoid to all
other time series in the cluster

New prototype is the average sequence over
warping paths
Local search heuristics
Example of the three methods
E(S) = 159 E(S) = 138 E(S) = 118
• LS provides better fit in terms of the Steiner cost

function.
• It cannot modify sequence length during the iterations.
In datasets with varying lengths it might provide better
fit, but non-sensitive prototypes
Experiments
Part VI:
Other clustering problems
Clustering of GPS trajectories
Density clusters
Swim Walking
hall street
Market
Science place
park
Homes
Shop of users
Image segmentation
Objects of different colors

Literature
1. S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 2nd
edition, 2006.
2. P. Fränti and T. Kaukoranta, "Binary vector quantizer design using soft centroids",
Signal Processing: Image Communication, 14 (9), 677‑681, 1999.
3. I. Kärkkäinen and P. Fränti, "Variable metric for binary vector quantization", IEEE
Int. Conf. on Image Processing (ICIP’04), Singapore, vol. 3, 3499-3502, October
2004.
Modified k-modes + k-histograms:
Literature
M. Ng, M.J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes Clustering
Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), 503-507, March, 2007.
ACE:
K. Chen and L. Liu, The “Best k'' for entropy-based categorical dataclustering, Int. Conf. on Scientific and
Statistical Database Management (SSDBM'2005), pp. 253-262, Berkeley, USA, 2005.
ROCK:
S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”, Information
Systems, Vol. 25, No. 5, pp. 345-366, 200x.
K-medoids:
L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley Sons,
New York, 1990.
K-modes:
Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data mining
knowledge discovery, Vol. 2, No. 3, pp. 283-304, 1998.
K-distributions:
Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int. Conf.
on Intelligent Computing (ICIC 2007), pp. 436-443, Qingdao, China, 2007.
K-histograms:
Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering Algorithm
for Categorical Dataset, CoRR, abs/cs/0509033, http://arxiv.org/abs/cs/0509033, 2005.

Clustering Part4

Uploaded by

Copyright:

Available Formats

Clustering Part4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering Part4

Uploaded by

Copyright:

Available Formats

Clustering methods: Part 4

Clustering Cost function

Speech and Image Processing Unit

• Minkowski distance Xi = (xi1, xi2, …, xip)

1st dimension 2nd dimension pth dimension

Complete-link: the largest distance between vectors

Average link: the average distance between vectors in

1 1.1 1.2 1.3 1.4 1.5

Single Link: Complete Link:

General distance function:

Cost of a single attribute:

The number of zeroes is qjk, the number of ones is

Optimal centroid position depends on the

Methods: Histogram-based methods:

Entropy of data set:

Entropies of the clusters relative to the data:

Vector with minimal total distance to every other

Word kahvila (café) but

• The similarity between every string pair is calculated

• A similarity measure is required to calculate the

• Given a text string T of length n and a pattern

• Given two strings T and P, the edit distance is

Fixed length (q)

• Two vectors A and B,θ is represented using a dot

T = “bingo right”, P = “going right”

• Similar with cosine similarity

moist wet dry arid

conveyance, transport ware

vehicle table ware

wheeled vehicle cutlery, eating utensil

automotive, motor bike, bicycle fork

car, auto truck

Natural-elevation (0.011%) Shore ( 0.008%)

Hill (0.0019%) Coast (0.002%)

Car Automobile 3.92 1 1 1 1

 Sequence c that minimizes E(Sj,c) is called a

E(S) = 159 E(S) = 138 E(S) = 118

• LS provides better fit in terms of the Steiner cost

Objects of different colors

You might also like