Clustering Part4

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 79

Clustering methods: Part 4

Clustering Cost function


Pasi Fränti
29.4.2014

Speech and Image Processing Unit


School of Computing
University of Eastern Finland
Data types

• Numeric
• Binary
• Categorical
• Text
• Time series
Part I:
Numeric data
Distance measures
Definition of distance metric
A distance function is metric if the following conditions
are met for all data points x, y, z:
• All distances are non-negative:
d(x, y) ≥ 0
• Distance to point itself is zero:
d(x, x) = 0
• All distances are symmetric:
d(x, y) = d(y, x)
• Triangular inequality:
d(x, y)  d(x, z) + d(z, y)
Common distance metrics
Xj = (xj1, xj2, …, xjp)
dij = ?

• Minkowski distance Xi = (xi1, xi2, …, xip)

q q q
d (i, j ) q xi1  x j1  xi 2  x j 2  ...  xip  x jp

1st dimension 2nd dimension pth dimension

• Euclidean distance
q=2
2 2 2
d (i, j )  xi1  x j1  xi 2  x j 2  ...  xip  x jp

• Manhattan distance
q=1
d (i, j )  xi1  x j1  xi 2  x j 2  ...  xip  x jp
Distance metrics example
10
2D example
X1 = (2,8) x1 = (2,8)
x2 = (6,3)

5 Euclidean distance
5 2 2
d (1,2)  2  6  8  3  41

X2 = (6,3)
Manhattan distance
d (1,2)  2  6  8  3 9

0 5 10
4
Chebyshev distance
In case of q  , the distance equals to the maximum
difference of the attributes. Useful if the worst case must
be avoided:

1

n
q
q
d  ( X , Y ) lim  xi  yi 
q 
 i 1 
max x1  y1 , x2  y2 ,  , xn  yn 
Example:
d  (2,8), (6,3)  max2  6 , 8  3  max4,5 5
Hierarchical clustering
Cost functions
Single link: the smallest distance between vectors in
clusters i and j:
d (Ci , C j )  min
xi Ci , x j C j ,
d ( xi , x j )

Complete-link: the largest distance between vectors


in clusters i and j:
d (Ci , C j )  max
xi Ci , x j C j ,
d ( xi , x j )

Average link: the average distance between vectors in


clusters i and j:
1
d (C i , C j )   
C i C j xi Ci x j C j
d ( xi ,x j )
Single Link
Complete Link
Average Link
Cost function example
[Theodoridis, Koutroumbas, 2006]

1 1.1 1.2 1.3 1.4 1.5


Data Set
x1 x2 x3 x4 x5 x6 x7

Single Link: Complete Link:


x1 x2 x3 x4 x5 x6 x7 x1 x2 x3 x4 x5 x6 x7
Part II:
Binary data
Hamming Distance
(Binary and categorical data)
• Number of different attribute values.
• Distance of (1011101) and (1001001) is 2.
• Distance (2143896) and (2233796)
• Distance between (toned) and (roses) is 3.
100->011 has distance 3 (red path)
3-bit binary cube 010->111 has distance 2 (blue path)
Hard thresholding of centroid
(0.40, 0.60, 0.75, 0.20, 0.45, 0.25)
Hard and soft centroids
Bridge (binary version)
1.60
1.55 convergence
Binary
Distortion

1.50
1.45
1.40
1.35 Soft
1.30 convergence
1.25
0 1 2 3 4 5 6 7 8 9 10
Iterations
Distance and distortion

General distance function:


1
 K d
  d
Ld xi , c j   xik  c jk 
 
 k 1 

Distortion function:
N

E X , C   N1  Ld xi , c pi 
i 1
Distortion for binary data

Cost of a single attribute:


d
D jk q jk c jk 
 r jk 1  c jk 
d

The number of zeroes is qjk, the number of ones is


rjk and cjk is the current centroid value for
variable k of group j.
Optimal centroid position

Optimal centroid position depends on the


metric. Given parameter:
 1 / d  1
The optimal position is:

r jk
c jk   
q jk  r jk
Example of centroid location

1.00
q=10,r=90
0.90
q=20,r=80
0.80
q=30,r=70
c

0.70
0.60
q=40,r=60
0.50
6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0
d
Centroid location

d =  ( = 0) c = 0 .5

d = 3 (  = 0 .5 ) c = 0 .3 7
d = 2 ( = 1) c = 0 .2 5
d= 1 c= 0

0 .0 1 .0
q = 75 r = 25
Part III:
Categorical data
Categorical clustering
Three attributes
Categorical clustering
Sample 2-d data: color and shape

Model A Model B

Model C
Categorical clustering

Methods: Histogram-based methods:

• k-modes
• k-medoids
• k-distributions
• k-histograms
• k-populations
• k-representatives
Entropy-based cost functions
Category utility:

Entropy of data set:

Entropies of the clusters relative to the data:


Iterative algorithms
K-modes clustering
Distance function
K-modes clustering
Prototype of cluster
K-medoids clustering
Prototype of cluster

Vector with minimal total distance to every other


3
2 2 Medoid:
A B B B
C C D C
E F G F
2+3=5 2+2=4 2+3=5
K-medoids
Example
K-medoids
Calculation
K-histograms

D 2/3
F 1/3
K-distributions
Cost function with ε addition
Example of cluster allocation
Change of entropy
Problem of non-convergence
Non-convergence
Results with Census dataset
Part IV:
Text data
Applications of text clustering

• Query relaxation
• Spell-checking
• Automatic categorization
• Document similarity
Query relaxation
Current solution Alternate solution
Matching suffixes from database From semantic clustering
Spell-checking

Word kahvila (café) but


with one correct and
two incorrect spellings
Automatic categorization

Category by
clustering

HE og
Go
LP le
! ! ! t ra n
T h sla
is tio
is n …
String clustering

• The similarity between every string pair is calculated


as a basis for determining the clusters

• A similarity measure is required to calculate the


similarity between two strings.
Approximate string Semantic
matching similarity
Document clustering
Motivation:
– Group related documents based on their content
– No predefined training set (taxonomy)
– Generate a taxonomy at runtime
Clustering Process:
– Data preprocessing: remove stop words, stem,
feature extraction and lexical analysis
– Define cost function
– Perform clustering
Exact string matching

• Given a text string T of length n and a pattern


string P of length m, the exact string
matching problem is to find all occurrences
of P in T.
• Example: T=“AGCTTGA” P=“GCT”
• Applications:
– Searching keywords in a file
– Searching engines (like Google)
– Database searching
Approximate string matching
Determine if a text string T of length n and a
pattern string P of length m partially matches.
– Consider the string “approximate”. Which of these are partial matches?
aproximate approximately appropriate proximate approx approximat
apropos approxximate
– A partial match can be thought of as one that has k differences from the
string where k is some small integer (for instance 1 or 2)
– A difference occurs if the string1.charAt(j) != string2.charAt(j) or if
string1.charAt(j) does not appear in string2 (or vice versa)
 The former case is known as a revise difference, the latter is a delete
or insert difference.
 What about two characters that appear out of position? For instance,
approximate vs. apporximate?
Approximate string matching
Keanu Reeves
Samuel Jackson
Arnold Schwarzenegger
H. Norman Schwarfkopf
Schwarrzenger Bernard Schwartz

Query errors

Similarity functions:
Limited knowledge about data
 Typos  Edit distance
 Limited input device (cell phone) input
Data errors
 Q-gram
 Typos  Cosine
 Web data
 OCR
Edit distance
Levenhstein distance

• Given two strings T and P, the edit distance is


the minimum number of substitutions,
insertion and deletions, which will transform
the string T into P.
• Time complexity by dynamic programming:
O(nm)
Edit distance
1974

t m p
0 1 2 3
t 1 0 1 2
e 2 1 2 2
m 3 2 1 2
p 4 3 2 1

Dynamic programming:
m[i][j] = min{ m[i-1][j]+1, m[i][j-1]+1, m[i-1][j-1]+d(i,j)}
d(i,j) =0 if i=j, d(i,j)=1 else
Q-grams

b i n g o n 2-grams

Fixed length (q)


ed(T, P) <= k, then
# of common grams >= # of T grams – k * q
Q-grams
T = “bingo”, P = “going”
gram1 = {#b, bi, in, ng, go, o#}
gram2 = {#g, go, oi, in, ng, g#}
Total(gram1, gram2) = {#b,bi,in,ng,go,o#,#g,
go,oi,in,ng,g#}
|common terms difference|= sum{1,1,0,0,0,1,1,0,1,0,0,1}
gram1.length = (T.length + (q - 1) * 2 + 1) – q
gram2.length = (P.length + (q - 1) * 2 + 1) - q
L = gram1.length + gram2.length=12
Similarity = (L- |common terms difference| )/ L
=0.5
Cosine similarity

• Two vectors A and B,θ is represented using a dot


product and magnitude as

• Implementation:
Cosine similarity = (Common Terms) / (sqrt(Number
of terms in String1) + sqrt(Number of terms in
String2))
Cosine similarity

T = “bingo right”, P = “going right”


T1 = {bingo right}, P1 = {going right}
L1 = unique(T1).length;
L2 = unique(P1).length;
Unique(T1&P1) = {bingo right going}
L3 = Unique(T1&P1) .length;
Common terms = (L1+L2)-L3;
Similarity = common terms / (sqrt(L1)*sqrt(L2))
Dice coefficient

• Similar with cosine similarity


• Dices coefficient = (2*Common Terms) /
(Number of terms in String1 + Number of
terms in String2)
Similarities for sample data
Edit Q-gram Q-gram Q-gram Cosine
Compared Strings distance Q=2 Q=3 Q=4 distance
Pizza Express Café
Pizza Express 72% 79% 74% 70% 82%
Lounasravintola Pinja Ky – ravintoloita
Lounasravintola Pinja 54% 68% 67 % 65% 63 %
Kioski Piirakkapaja
Kioski Marttakahvio Different 47% 45% 33% 32% 50%
Kauppa Kulta Keidas
Different 68% 67% 63 % 60% 67%
Kauppa Kulta Nalle
Ravintola Beer Stop Pub
Baari, Beer Stop R-kylä 39% 42% 36% 31% 50%
Ravintola Foxie s
Bar Foxie Karsikko 31% 25% 15% 12% 24%
Play baari
Ravintola Bar Play – Ravintoloita 21% 31% 17% 8% 32%
Thesaurus-based WordNet
WordNet
An extensive lexical network for the English language
• Contains over 138,838 words.
• Several graphs, one for each part-of-speech.
• Synsets (synonym sets), each defining a semantic sense.
• Relationship information (antonym, hyponym, meronym …)
• Downloadable for free (UNIX, Windows)
• Expanding to other languages (Global WordNet Association)
• Funded >$3 million, mainly government (translation interest)
• Founder George Miller, National Medal of Science, 1991.
watery parched

moist wet dry arid


synonym

antonym
damp anhydrous
Example of WordNet
object

artifact

instrumentality article

conveyance, transport ware

vehicle table ware

wheeled vehicle cutlery, eating utensil

automotive, motor bike, bicycle fork

car, auto truck


Examples of probabilities
Entity (40%)

Inanimate-object (17 %)

Natural-object (1.6%)

Geological-formation (0.17%)

Natural-elevation (0.011%) Shore ( 0.008%)

Hill (0.0019%) Coast (0.002%)


Hierarchical clustering by WordNet
Performance of WordNet
Human Edge Counting Information Content Based
Judgment Based Measures Measures
Word Pair
Path WUP LIN Jiang&Conrath

Car Automobile 3.92 1 1 1 1


Gem Jewel 3.84 1 1 1 1
Journey Voyage 3.84 0.97 0.92 0.84 0.88
Boy Lad 3.76 0.97 0.93 0.86 0.88
Coast Shore 3.70 0.97 0.91 0.98 0.99
Asylum Madhouse 3.61 0.97 0.94 0.97 0.97
Magician Wizard 3.50 1 1 1 1
Midday Noon 3.42 1 1 1 1
Furnace Stove 3.11 0.81 0.46 0.23 0.39
Food Fruit 3.08 0.81 0.22 0.13 0.63
Bird Cock 3.05 0.97 0.94 0.6 0.73
Bird Crane 2.97 0.92 0.84 0.6 0.73
Part V:
Time series
Clustering of time-series
Dynamic Time Warping
Align two time-
series by minimizing
distance of the Solve by
aligned observations dynamic
programming!
Example of DTW
Prototype of a cluster

 Sequence c that minimizes E(Sj,c) is called a


Steiner sequence.

Good approximation to Steiner problem, is to
use medoid of the cluster (discrete median).

Medoid is such a time-series in the cluster that
minimizes E(Sj,c).
Calculating the prototype


Can be solved by dynamic programming.

Complexity is exponential to the number of
time-series in a cluster.
Averaging heuristic


Calculate the medoid sequence

Calculate warping paths from the medoid to all
other time series in the cluster

New prototype is the average sequence over
warping paths
Local search heuristics
Example of the three methods

E(S) = 159 E(S) = 138 E(S) = 118

• LS provides better fit in terms of the Steiner cost


function.
• It cannot modify sequence length during the iterations.
In datasets with varying lengths it might provide better
fit, but non-sensitive prototypes
Experiments
Part VI:
Other clustering problems
Clustering of GPS trajectories
Density clusters

Swim Walking
hall street

Market
Science place
park

Homes
Shop of users
Image segmentation

Objects of different colors


Literature
1. S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 2nd
edition, 2006.
2. P. Fränti and T. Kaukoranta, "Binary vector quantizer design using soft centroids",
Signal Processing: Image Communication, 14 (9), 677‑681, 1999.
3. I. Kärkkäinen and P. Fränti, "Variable metric for binary vector quantization", IEEE
Int. Conf. on Image Processing (ICIP’04), Singapore, vol. 3, 3499-3502, October
2004.
Modified k-modes + k-histograms:
Literature
M. Ng, M.J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes Clustering
Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), 503-507, March, 2007.
ACE:
K. Chen and L. Liu, The “Best k'' for entropy-based categorical dataclustering, Int. Conf. on Scientific and
Statistical Database Management (SSDBM'2005), pp. 253-262, Berkeley, USA, 2005.
ROCK:
S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”, Information
Systems, Vol. 25, No. 5, pp. 345-366, 200x.
K-medoids:
L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley Sons,
New York, 1990.
K-modes:
Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data mining
knowledge discovery, Vol. 2, No. 3, pp. 283-304, 1998.
K-distributions:
Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int. Conf.
on Intelligent Computing (ICIC 2007), pp. 436-443, Qingdao, China, 2007.
K-histograms:
Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering Algorithm
for Categorical Dataset, CoRR, abs/cs/0509033, http://arxiv.org/abs/cs/0509033, 2005.

You might also like