2 ADA Cluster Analysis

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 12

What is Cluster Analysis?

• Cluster: a collection of data objects


– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters

• Cluster analysis
– Natural grouping a set of data objects into
clusters

• Clustering is unsupervised classification: no


predefined classes
Objective of clustering
algorithms for categorical data
• Partition the objects into groups.
– Objects with similar categorical attribute
values are placed in the same group.
– Objects in different groups contain
dissimilar categorical attribute values.
General Applications of Clustering
• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature
spaces
– detect spatial clusters and explain them in spatial data
mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar
access patterns
What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity

• The quality of a clustering depends on:


– Appropriateness of method for dataset.
– The (dis)similarity measure used
– Its implementation.

• The quality of a clustering method is also measured by its ability to


discover some or all of the hidden patterns.
Data Structures
• Data matrix  x11 ... x1f ... x1p 
 
 ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

 0 
 d(2,1) 
• Dissimilarity matrix  0 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0
Similarity and Dissimilarity
Between Objects
• Distances are normally used to measure the similarity or
dissimilarity between two data objects
• Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1 i2 j2 ip jp

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
Similarity and Dissimilarity
Between Objects (Cont.)
• If q = 2, d is Euclidean distance:
d (i, j)  (| x  x | 2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
– Properties
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)
• Also one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures.
Clustering of genomic data sets
Hierarchical Clustering
• Use distance matrix as clustering criteria. This
method does not require the number of clusters
k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
Agglomerative nesting
(AGNES)
a
ab
b abcde
c
cde
d
de
e
Divisive analysis
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
AGNES (Agglomerative
Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
A Dendrogram Shows How the
Clusters are Merged Hierarchically

Decompose data objects into several levels of nested


partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected
component forms a cluster.
Dendrogram

You might also like