Clustering

Clustering
?
Jaipur Indore Aurangabad Jabalpur
Bangalore Jodhpur
Bhopal Kalyan Dombivli GwaliorKota Hubli–
Dharwad
Mysore Ludhiana Navi Mumbai
Amritsar Solapur Chennai Vasai-Virar
Coimbatore Madurai
?
Jaipur Indore Aurangabad Jabalpur Bangalore Jodhpur
Bhopal Kalyan Dombivli GwaliorKota Hubli–Dharwad
Amritsar Solapur ChennaiVasai-Virar Coimbatore Madurai
Initially No. of clusters: 21
Cluster 1: Jaipur, jophpur

No. of clusters : 20
Cluster 1: Jaipur, jophpur kota

No. of clusters : 19
Cluser 2: indore , jobalpur
?
Jaipur Indore Aurangabad Jabalpur
Bangalore Jodhpur
Bhopal Kalyan Dombivli GwaliorKota Hubli–
Dharwad
Amritsar Solapur Chennai Vasai-Virar
Coimbatore Madurai
Visualizing Cluster Analysis
• Cluster analysis refers to algorithms that
group similar objects into groups
called clusters.
• The endpoint of cluster analysis is a set of
clusters, where each cluster is distinct from
each other cluster, and the objects within each
cluster are broadly similar to each other.
• A 8
• V 5
• F 7
• G 8
• H 9
• W9
• D 6.98
• R 5
• Gr 1 (about and above the 7) A,G,H,W
• Gr 2: Below to 7 V,F, R
Example of Cluster
In this scatterplot, two clusters

are shown,
one by filled circles and
one by unfilled circles.
Types of Clustering
Roughly clustering can be divided into two

subgroups
• Hard Clustering-
• In hard clustering, each data point either belongs to a
cluster completely or not.
• Strict partitioning clustering: each object belongs to
exactly one cluster.
• Soft Clustering-
• Instead of putting each data point into a separate cluster, a
probability of that data point to be in those clusters is
assigned.
• Each object belongs to each cluster to a certain degree. 8
What Is Good Clustering ?
• A good clustering method will produce high
quality cluster with
• High intra-class similarity.
• Low inter-class similarity.
• The quality of clustering result depends on
both the similarity measure used by the
method and its implementation.
9
The required data for cluster analysis
• Typically, cluster analysis is performed on a
table of raw data, where each row represents
an object and the columns represent
quantitative characteristic of the objects.
• These quantitative characteristics are
called clustering variables.
Outputs of Cluster analysis
• The main output from cluster analysis is a
table showing the mean values of each cluster
on the clustering variables.
• The second output shows which object has
been classified into which cluster.
Example
• For example, in the table there are
18 objects, and there are two
clustering variables, x and y.
• Cluster analysis an also be
performed using data in a distance
matrix.
Outputs of Example data
Object cluster classification
Table of means
Measures of distance (similarity)
• The distance between two clusters has been computed
based on the length of the straight line drawn from
one cluster to another.
• This is commonly referred to as the Euclidean
distance.
• Many other distance metrics have been developed.
Distance Or Similarity Measures
•In order to group similar items, we need a way to measure
the distance between objects.
•The different measures of distance or similarity are
I. Euclidean Distance :
II.Manhattan Distance:
III.Mahalanobis Distance:
Where S is the covariance

15
Euclidean Distance Example
• Find the distance between points P(3, 2) and Q(4, 1).
Manhattan Distance Example
Mahalanobis Distance Example
• How far v (66, 640, 44) is from the mean of the
dataset (68.0, 600.0, 40.0). √ ( 𝑎− 𝑏 )𝑇 ∗ 𝑆− 1 ∗(𝑎 −𝑏)
Distance Or Similarity Matrix
• Based on the distance or similarity measure we can construct a

symmetric matrix of distance or similarity values.
• Let (i,j) entry in the matrix is the distance or similarity
between item i & j.
Here (i.e. The matrix is symmetric)

So we need on the lower triangular part of matrix.
The diagonal is all 1’s (similarity) or 0’s all (distance).
19
Major Clustering Approaches
• Partitioning approach:
– Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
– Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
– Create a hierarchical decomposition of the set of data (or objects) using
some criterion
– Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
– Based on connectivity and density functions
– Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
– based on a multiple-level granularity structure
– Typical methods: STING, WaveCluster, CLIQUE
20
Partitioning Algorithms: Basic Concept
• Partitioning method: Partitioning a database D of n objects into a set of k

clusters, such that the sum of squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)
• Given k, find a partition of k clusters that optimizes the chosen partitioning

criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the
center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
Each cluster is represented by one of the objects in the cluster
21
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four

steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of
the current partitioning (the centroid is the center, i.e.,
mean point, of the cluster)
– Assign each object to the cluster with the nearest seed
point
– Go back to Step 2, stop when the assignment does not
change
22
An Example of K-Means Clustering
K=2
Arbitrarily Update the

partition cluster
objects into centroids
k groups
The initial data set Loop if Reassign objects

needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update the
cluster
point) for each partition centroids
 Assign each object to the
cluster of its nearest centroid
 Until no change
23
Comments on the K-Means Method
• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is #

iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimal.
• Weakness
– Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
– Need to specify k, the number of clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
– Sensitive to noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
24
The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects (medoids) in clusters

– PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
• Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
• PAM works effectively for small data sets, but does not scale well for
large data sets (due to the computational complexity)
• Efficiency improvement on PAM
– CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
– CLARANS (Ng & Han, 1994): Randomized re-sampling
25
K-Means Clustering Algorithm
• K-Means Clustering Algorithm involves the following steps-
1. Choose the number of clusters K.
2. Randomly select any K data points as cluster centers. Select cluster centers in such
a way that they are as farther as possible from each other.
3. Calculate the distance between each data point and each cluster center. The
distance may be calculated either by using given distance function or by using
euclidean distance formula.
4. Assign each data point to some cluster. A data point is assigned to that cluster
whose center is nearest to that data point.
5. Re-compute the center of newly formed clusters. The center of a cluster is
computed by taking mean of all the data points contained in that cluster.
6. Keep repeating the procedure from Step-03 to Step-05 until any of the following
stopping criteria is met-
• Center of newly formed clusters do not change
• Data points remain present in the same cluster
• Maximum number of iterations are reached
Exercise
Solution
Calculating Distance Between A1(2, 10) and C1(2, 10)-
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9

From here, New clusters are-

Cluster-01: First cluster contains points-
• A1(2, 10)

Cluster-02: Second cluster contains points-
• A3(8, 4)
• A4(5, 8)
• A5(7, 5)
• A6(6, 4)
• A8(4, 9)

Cluster-03: Third cluster contains points-
• A2(2, 5)
• A7(1, 2)
Now, re-compute the new cluster clusters.
• The new cluster centre is computed by taking mean of all the points contained in that
cluster.
For Cluster-01:

• We have only one point A1(2, 10) in Cluster-01.
• So, cluster center remains the same.

For Cluster-02:

Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)

For Cluster-03:

= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)

This is completion of Iteration-01.
Iteration-02:

The following illustration shows the calculation of distance between point A1(2, 10)
and each of the center of the three clusters-

Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|= |2 – 2| + |10 – 10|= 0

Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|= |6 – 2| + |6 – 10|= 4 + 4= 8

Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-

Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|= |1.5 – 2| + |3.5 – 10|= 0.5 + 6.5= 7

In the similar manner, we calculate the distance of other points from each of the center
of the three clusters.
From here, New clusters are-

Cluster-01:
First cluster contains points-
• A1(2, 10)
• A8(4, 9)

Cluster-02:
Second cluster contains points-
• A3(8, 4)
• A4(5, 8)
• A5(7, 5)
• A6(6, 4)

Cluster-03:
Third cluster contains points-
• A2(2, 5)
• A7(1, 2)

Now, re-compute the new cluster clusters.
• The new cluster center is computed by taking mean of all the points contained in that
cluster.
For Cluster-01:

= ((2 + 4)/2, (10 + 9)/2)= (3, 9.5)

For Cluster-02:

= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)= (6.5, 5.25)

For Cluster-03:

= ((2 + 1)/2, (5 + 2)/2)= (1.5, 3.5)

This is completion of Iteration-02.

After second iteration, the center of the three clusters are-
• C1(3, 9.5)
• C2(6.5, 5.25)
• C3(1.5, 3.5)
Types of Clustering Algorithms
Clustering
Hierarchical Non hierarchical
Agglomerative Divisive
Partitioning Density based Model based
37
Hierarchical Clustering
• In hierarchical clustering the goal is to produce a hierarchical

series of nested clusters, ranging from clusters of individual
points at the bottom to an all-inclusive cluster at the top.
• A diagram called a dendrogram graphically represents this
hierarchy.
• One of the attractions of hierarchical techniques is that they
correspond to taxonomies that are very common in the biological
sciences, e.g., kingdom, phylum, genus, species.
38
Dendrogram
• A dendrogram is a diagram representing a tree which

shows the hierarchical relationship between the clusters:
• In hierarchical clustering, it illustrates the arrangement
of the cluster produced by the corresponding analysis.
39
Exercise
Design the Dendrogram for the following diagram

Example
Types of Hierarchical Clustering
• A hierarchical method can be classiﬁed as
being either
• Agglomerative
• Divisive
45
Types of Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until
only one cluster (or k clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a
point (or there are k clusters)
Agglomerative versus divisive algorithms
• Hierarchical clustering typically works by sequentially

merging similar clusters, as shown above. This is known
as agglomerative hierarchical clustering.
• In theory, it can also be done by initially grouping all the

observations into one cluster, and then successively splitting
these clusters. This is known as divisive hierarchical
clustering. Divisive clustering is rarely done in practice.
Agglomerative Hierarchical Clustering
• Start with the points as individual clusters and, at each step,
merge the closest pair of clusters.
• The agglomerative hierarchical clustering further classified as
1. Linkage Method
• Single Linkage
• Complete Linkage
• Average Linkage
2. Variance Method Or Ward’s Method
48
Single Linkage
• For the single linkage, the distance is to be
minimum between any two points in the
different clusters.
i.e.
Single Linkage Example
Single Linkage Example
• Lets take a 6 simple Vectors
Step 1: compute distance between points
• Using Euclidean Distance lets compute the

Distance Matrix.
Euclidean Distance = sqrt( (x2 -x1)**2 + (y2-y1)**2 )
• Example : Distance between A and B
sqrt ( (18- 22) ** 2 + (0–0) ** 2))
sqrt( (16) + 0)
sqrt(16)= 4
Distance Matrix
Single Linkage Detailed Example
• Single Link Clustering: Minimum of two
distances. Leads to large more diverse clusters.
• Distance Matrix: Diagonals will be 0 and
values will be symmetric.
• Step a: The shortest distance in the matrix is 1 and the
vectors associated with that are C & D
• So the first cluster is C — D
• Distance between other vectors and CD
• A to CD = min(A->C, A->D)
=min(25,24)=24
• B to CD = min(B-<C, B->D)
=min(21,20) = 20
and similarly find for E & F
• Step b : Now 2 is the shortest distance and the
vectors associated with that are E & F
• Second cluster is E — F
• A to EF = min(A->E, A->F) = min(9,7) = 7
• CD to EF = min(CD->E, CD->F) = min(15,17)
= 15
• Step c : Next shortest is 3, and the associated
vectors are B & EF
• Third cluster is B — EF
• A to BEF = min(A->B, A->EF) = min(4,7) = 4
CD to BEF = min(CD->B, CD->EF) =
min(20,15) = 15
• Step d : Next shortest is 4, and the associated
vectors are A& BEF
• Fourth cluster is A — BEF
• CD to ABEF = min(CD->A, CD->BEF) =
min(24,15) = 15
• Step e : Last cluster is CD — ABEF
• The Dendrogram for the single link cluster is
as follows:
Complete Linkage
• For the complete linkage, the distance is to be
maximum between any two points in the
different clusters.
i.e.
Example of Complete Linkage
Distance Matrix
• Consider the following distance matrix:
Complete Linkage Detailed Example
• Complete Link Clustering: Considers Max
of all distances. Leads to many small clusters.
• Distance Matrix: Diagonals will be 0 and
values will be symmetric.
• Step a: The shortest distance in the matrix is 1 and
the vectors associated with that are C & D
• So the first cluster is C — D
• Distance between other vectors and CD
• A to CD = max(A->C, A->D) = max(25,24) = 25
B to CD = max(B-<C, B->D) = max(21,20) = 21
• and similarly find for E -> CD & F -> CD
• Step b : Now 2 is the shortest distance and the
vectors associated with that are E & F
• Second cluster is E — F
• A to EF = max(A->E, A->F) = max(9,7) = 9
CD to EF = max(CD->E, CD->F) =
max(15,17) = 17
• Step c : Now 4 is the shortest distance and vectors
associated are A & B
• Third cluster is A — B
• CD to AB = max(CD -> A, CD ->B) = max(25,21) =
25
EF to AB = max(EF -> A, EF ->B) = max(9,5) = 9
• Step d : Now 9 is the shortest distance and
vectors associated are AB and EF
• Fourth cluster is AB — EF
• CD to ABEF = max(CD->AB, CD->EF) =
max(25,18) = 25
• Step e : Last cluster is CD — ABEF

• Lets look at the Dendrogram for the complete
link cluster.
Exercise
• Consider your mobile number as rough data, make the data pair from left to right.
• For example: for the mobile number 9415736072,
• data points set DP= {(9,4), (1,5), (7,3), (6,0), and (7,2)}
• Case 1: if not able to get five different data pair from your mobile number then add one in both
number in the second occurrence of similar data pair and create new data pair. for example:
mobile number 9415736073
• so original data points set DP= {(9,4), (1,5), (7,3), (6,0), and (7,3)}
• modified data points set DP= {(9,4), (1,5), (7,3), (6,0), and (8,4)}
• Case 2: You can repeat this process it getting three identical data pair
• For example: for example: mobile number 9415737373
• modified data points set DP= {(9,4), (1,5), (7,3), (8,4), and (9,5)}
• During distance computation , ignore the decimal part. For Ex: if computed distance is 6.4 ,
then consider only 6.
• Now design the Dendrogram using following methods for DP
– Single linkage method
– Complete linkage method
Hierarchical Clustering: Comparison
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
74
Density-Based Clustering: Basic Concepts
• Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Eps-
neighbourhood of that point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
– p belongs to NEps(q)
– core point condition: p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q
75
Density-Reachable and Density-Connected
• Density-reachable:
– A point p is density-reachable from a p
point q w.r.t. Eps, MinPts if there is a
p1
chain of points p1, …, pn, p1 = q, pn = p q
such that pi+1 is directly density-
reachable from pi
• Density-connected
– A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are o
density-reachable from o w.r.t. Eps and
MinPts
76
Core, Border and Outlier
Example
DBSCAN: The Algorithm
• Arbitrary select a point p

• Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
• Continue the process until all of the points have been processed
79
DBSCAN: Sensitive to Parameters
80

Clustering

Uploaded by

Copyright:

Available Formats

Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering

Uploaded by

Copyright:

Available Formats

Clustering

Cluster 1: Jaipur, jophpur

Cluster 1: Jaipur, jophpur kota

In this scatterplot, two clusters

Roughly clustering can be divided into two

Object cluster classification

•In order to group similar items, we need a way to measure

the distance between objects.

•The different measures of distance or similarity are

Where S is the covariance

• Based on the distance or similarity measure we can construct a

Here (i.e. The matrix is symmetric)

• Partitioning method: Partitioning a database D of n objects into a set of k

• Given k, find a partition of k clusters that optimizes the chosen partitioning

• Given k, the k-means algorithm is implemented in four

Arbitrarily Update the

The initial data set Loop if Reassign objects

• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is #

• K-Medoids Clustering: Find representative objects (medoids) in clusters

Hierarchical Non hierarchical

Partitioning Density based Model based

• In hierarchical clustering the goal is to produce a hierarchical

• A dendrogram is a diagram representing a tree which

Design the Dendrogram for the following diagram

• Hierarchical clustering typically works by sequentially

• In theory, it can also be done by initially grouping all the

• Using Euclidean Distance lets compute the

• Step e : Last cluster is CD — ABEF

• Arbitrary select a point p

You might also like