Clustering
Clustering
Clustering
?
Jaipur Indore Aurangabad Jabalpur
Bangalore Jodhpur
Bhopal Kalyan Dombivli GwaliorKota Hubli–
Dharwad
Mysore Ludhiana Navi Mumbai
Amritsar Solapur Chennai Vasai-Virar
Coimbatore Madurai
?
Jaipur Indore Aurangabad Jabalpur Bangalore Jodhpur
Bhopal Kalyan Dombivli GwaliorKota Hubli–Dharwad
Mysore Ludhiana Navi Mumbai
Amritsar Solapur ChennaiVasai-Virar Coimbatore Madurai
Initially No. of clusters: 21
9
The required data for cluster analysis
• Typically, cluster analysis is performed on a
table of raw data, where each row represents
an object and the columns represent
quantitative characteristic of the objects.
• These quantitative characteristics are
called clustering variables.
Outputs of Cluster analysis
• The main output from cluster analysis is a
table showing the mean values of each cluster
on the clustering variables.
• The second output shows which object has
been classified into which cluster.
Example
• For example, in the table there are
18 objects, and there are two
clustering variables, x and y.
• Cluster analysis an also be
performed using data in a distance
matrix.
Outputs of Example data
Table of means
Measures of distance (similarity)
• The distance between two clusters has been computed
based on the length of the straight line drawn from
one cluster to another.
• This is commonly referred to as the Euclidean
distance.
• Many other distance metrics have been developed.
Distance Or Similarity Measures
I. Euclidean Distance :
II.Manhattan Distance:
III.Mahalanobis Distance:
19
Major Clustering Approaches
• Partitioning approach:
– Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
– Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
– Create a hierarchical decomposition of the set of data (or objects) using
some criterion
– Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
– Based on connectivity and density functions
– Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
– based on a multiple-level granularity structure
– Typical methods: STING, WaveCluster, CLIQUE
20
Partitioning Algorithms: Basic Concept
21
The K-Means Clustering Method
K=2
Agglomerative Divisive
37
Hierarchical Clustering
39
Exercise
45
Types of Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until
only one cluster (or k clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a
point (or there are k clusters)
Agglomerative versus divisive algorithms
1. Linkage Method
• Single Linkage
• Complete Linkage
• Average Linkage
2. Variance Method Or Ward’s Method
48
Single Linkage
• For the single linkage, the distance is to be
minimum between any two points in the
different clusters.
i.e.
Single Linkage Example
Single Linkage Example
• Lets take a 6 simple Vectors
Step 1: compute distance between points
74
Density-Based Clustering: Basic Concepts
• Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Eps-
neighbourhood of that point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
– p belongs to NEps(q)
– core point condition: p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q
75
Density-Reachable and Density-Connected
• Density-reachable:
– A point p is density-reachable from a p
point q w.r.t. Eps, MinPts if there is a
p1
chain of points p1, …, pn, p1 = q, pn = p q
such that pi+1 is directly density-
reachable from pi
• Density-connected
– A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are o
density-reachable from o w.r.t. Eps and
MinPts
76
Core, Border and Outlier
Example
DBSCAN: The Algorithm
79
DBSCAN: Sensitive to Parameters
80