Clustering

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 80

Clustering

?
Jaipur Indore Aurangabad Jabalpur
Bangalore Jodhpur
Bhopal Kalyan Dombivli GwaliorKota Hubli–
Dharwad
Mysore Ludhiana Navi Mumbai
Amritsar Solapur Chennai Vasai-Virar
Coimbatore Madurai
?
Jaipur Indore Aurangabad Jabalpur Bangalore Jodhpur
Bhopal Kalyan Dombivli GwaliorKota Hubli–Dharwad
Mysore Ludhiana Navi Mumbai
Amritsar Solapur ChennaiVasai-Virar Coimbatore Madurai
Initially No. of clusters: 21

Cluster 1: Jaipur, jophpur


No. of clusters : 20

Cluster 1: Jaipur, jophpur kota


No. of clusters : 19
Cluser 2: indore , jobalpur
?
Jaipur Indore Aurangabad Jabalpur
Bangalore Jodhpur
Bhopal Kalyan Dombivli GwaliorKota Hubli–
Dharwad
Mysore Ludhiana Navi Mumbai
Amritsar Solapur Chennai Vasai-Virar
Coimbatore Madurai
Visualizing Cluster Analysis
• Cluster analysis refers to algorithms that
group similar objects into groups
called clusters.
• The endpoint of cluster analysis is a set of
clusters, where each cluster is distinct from
each other cluster, and the objects within each
cluster are broadly similar to each other. 
• A 8
• V 5
• F 7
• G 8
• H 9
• W9
• D 6.98
• R 5
• Gr 1 (about and above the 7) A,G,H,W
• Gr 2: Below to 7 V,F, R
Example of Cluster

In this scatterplot, two clusters


are shown,
one by filled circles and
one by unfilled circles.
Types of Clustering

Roughly clustering can be divided into two


subgroups
• Hard Clustering-
• In hard clustering, each data point either belongs to a
cluster completely or not.
• Strict partitioning clustering: each object belongs to
exactly one cluster.
• Soft Clustering-
• Instead of putting each data point into a separate cluster, a
probability of that data point to be in those clusters is
assigned.
• Each object belongs to each cluster to a certain degree. 8
What Is Good Clustering ?
• A good clustering method will produce high
quality cluster with
• High intra-class similarity.
• Low inter-class similarity.
• The quality of clustering result depends on
both the similarity measure used by the
method and its implementation.

9
The required data for cluster analysis
• Typically, cluster analysis is performed on a
table of raw data, where each row represents
an object and the columns represent
quantitative characteristic of the objects.
• These quantitative characteristics are
called clustering variables. 
Outputs of Cluster analysis
• The main output from cluster analysis is a
table showing the mean values of each cluster
on the clustering variables.
• The second output shows which object has
been classified into which cluster.
Example
• For example, in the table there are
18 objects, and there are two
clustering variables, x and y.
• Cluster analysis an also be
performed using data in a distance
matrix.
Outputs of Example data

 Object cluster classification

 Table of means
Measures of distance (similarity)
• The distance between two clusters has been computed
based on the length of the straight line drawn from
one cluster to another.
• This is commonly referred to as the Euclidean
distance. 
• Many other distance metrics have been developed.
Distance Or Similarity Measures

•In order to group similar items, we need a way to measure

the distance between objects.

•The different measures of distance or similarity are

I. Euclidean Distance :

II.Manhattan Distance:

III.Mahalanobis Distance:

Where S is the covariance


15
Euclidean Distance Example
• Find the distance between points P(3, 2) and Q(4, 1).
Manhattan Distance Example
Mahalanobis Distance Example
•  How far v (66, 640, 44) is from the mean of the
dataset (68.0, 600.0, 40.0). √ ( 𝑎− 𝑏 )𝑇 ∗ 𝑆− 1 ∗(𝑎 −𝑏)
Distance Or Similarity Matrix

• Based on the distance or similarity measure we can construct a


symmetric matrix of distance or similarity values.
• Let (i,j) entry in the matrix is the distance or similarity
between item i & j.

Here (i.e. The matrix is symmetric)


So we need on the lower triangular part of matrix.
The diagonal is all 1’s (similarity) or 0’s all (distance).

19
Major Clustering Approaches

• Partitioning approach:
– Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
– Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
– Create a hierarchical decomposition of the set of data (or objects) using
some criterion
– Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
– Based on connectivity and density functions
– Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
– based on a multiple-level granularity structure
– Typical methods: STING, WaveCluster, CLIQUE
20
Partitioning Algorithms: Basic Concept

• Partitioning method: Partitioning a database D of n objects into a set of k


clusters, such that the sum of squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)

• Given k, find a partition of k clusters that optimizes the chosen partitioning


criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the
center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
Each cluster is represented by one of the objects in the cluster

21
The K-Means Clustering Method

• Given k, the k-means algorithm is implemented in four


steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of
the current partitioning (the centroid is the center, i.e.,
mean point, of the cluster)
– Assign each object to the cluster with the nearest seed
point
– Go back to Step 2, stop when the assignment does not
change
22
An Example of K-Means Clustering

K=2

Arbitrarily Update the


partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects


needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update the
cluster
point) for each partition centroids
 Assign each object to the
cluster of its nearest centroid
 Until no change
23
Comments on the K-Means Method

• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is #


iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimal.
• Weakness
– Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
– Need to specify k, the number of clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
– Sensitive to noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
24
The K-Medoid Clustering Method

• K-Medoids Clustering: Find representative objects (medoids) in clusters


– PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
• Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
• PAM works effectively for small data sets, but does not scale well for
large data sets (due to the computational complexity)
• Efficiency improvement on PAM
– CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
– CLARANS (Ng & Han, 1994): Randomized re-sampling
25
K-Means Clustering Algorithm
•  K-Means Clustering Algorithm involves the following steps-
1. Choose the number of clusters K.
2. Randomly select any K data points as cluster centers. Select cluster centers in such
a way that they are as farther as possible from each other.
3. Calculate the distance between each data point and each cluster center. The
distance may be calculated either by using given distance function or by using
euclidean distance formula.
4. Assign each data point to some cluster. A data point is assigned to that cluster
whose center is nearest to that data point.
5.  Re-compute the center of newly formed clusters. The center of a cluster is
computed by taking mean of all the data points contained in that cluster.
6. Keep repeating the procedure from Step-03 to Step-05 until any of the following
stopping criteria is met-
• Center of newly formed clusters do not change
• Data points remain present in the same cluster
• Maximum number of iterations are reached
Exercise
Solution
Calculating Distance Between A1(2, 10) and C1(2, 10)-
 Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
 Calculating Distance Between A1(2, 10) and C2(5, 8)-
 Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
 Calculating Distance Between A1(2, 10) and C3(1, 2)-
 Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
 
From here, New clusters are-
 
Cluster-01: First cluster contains points-
• A1(2, 10)
 
Cluster-02: Second cluster contains points-
• A3(8, 4)
• A4(5, 8)
• A5(7, 5)
• A6(6, 4)
• A8(4, 9)
 
Cluster-03: Third cluster contains points-
• A2(2, 5)
• A7(1, 2)
Now, re-compute the new cluster clusters.
• The new cluster centre is computed by taking mean of all the points contained in that
cluster.
For Cluster-01:
 
• We have only one point A1(2, 10) in Cluster-01.
• So, cluster center remains the same.
 
For Cluster-02:
 
Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)
 
For Cluster-03:
 
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
 
This is completion of Iteration-01.
Iteration-02:
 
The following illustration shows the calculation of distance between point A1(2, 10)
and each of the center of the three clusters-
 
Calculating Distance Between A1(2, 10) and C1(2, 10)-
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|= |2 – 2| + |10 – 10|= 0
 
Calculating Distance Between A1(2, 10) and C2(6, 6)-
 Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|= |6 – 2| + |6 – 10|= 4 + 4= 8
 
Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-
 
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|= |1.5 – 2| + |3.5 – 10|= 0.5 + 6.5= 7
 
In the similar manner, we calculate the distance of other points from each of the center
of the three clusters.
From here, New clusters are-
 
Cluster-01:
 First cluster contains points-
• A1(2, 10)
• A8(4, 9)
 
Cluster-02:
 Second cluster contains points-
• A3(8, 4)
• A4(5, 8)
• A5(7, 5)
• A6(6, 4)
 
Cluster-03:
 Third cluster contains points-
• A2(2, 5)
• A7(1, 2)
 
Now, re-compute the new cluster clusters.
• The new cluster center is computed by taking mean of all the points contained in that
cluster.
For Cluster-01:
 
Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2)= (3, 9.5)
 
For Cluster-02:
 
Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)= (6.5, 5.25)
 
For Cluster-03:
 
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)= (1.5, 3.5)
 
This is completion of Iteration-02.
 
After second iteration, the center of the three clusters are-
• C1(3, 9.5)
• C2(6.5, 5.25)
• C3(1.5, 3.5)
Types of Clustering Algorithms
Clustering

Hierarchical Non hierarchical

Agglomerative Divisive

Partitioning Density based Model based

37
Hierarchical Clustering

• In hierarchical clustering the goal is to produce a hierarchical


series of nested clusters, ranging from clusters of individual
points at the bottom to an all-inclusive cluster at the top.
• A diagram called a dendrogram graphically represents this
hierarchy.
• One of the attractions of hierarchical techniques is that they
correspond to taxonomies that are very common in the biological
sciences, e.g., kingdom, phylum, genus, species.
38
Dendrogram

• A dendrogram is a diagram representing a tree which


shows the hierarchical relationship between the clusters:
• In hierarchical clustering, it illustrates the arrangement
of the cluster produced by the corresponding analysis.

39
Exercise

Design the Dendrogram for the following diagram


Example
Types of Hierarchical Clustering
• A hierarchical method can be classified as
being either
• Agglomerative
• Divisive

45
Types of Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until
only one cluster (or k clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a
point (or there are k clusters)
Agglomerative versus divisive algorithms

• Hierarchical clustering typically works by sequentially


merging similar clusters, as shown above. This is known
as agglomerative hierarchical clustering. 

• In theory, it can also be done by initially grouping all the


observations into one cluster, and then successively splitting
these clusters. This is known as divisive hierarchical
clustering. Divisive clustering is rarely done in practice.
Agglomerative Hierarchical Clustering
• Start with the points as individual clusters and, at each step,
merge the closest pair of clusters.
• The agglomerative hierarchical clustering further classified as

1. Linkage Method
• Single Linkage

• Complete Linkage
• Average Linkage
2. Variance Method Or Ward’s Method
48
Single Linkage
• For the single linkage, the distance is to be
minimum between any two points in the
different clusters.
i.e.
Single Linkage Example
Single Linkage Example
• Lets take a 6 simple Vectors
Step 1: compute distance between points

• Using Euclidean Distance lets compute the


Distance Matrix.
Euclidean Distance = sqrt( (x2 -x1)**2 + (y2-y1)**2 )
• Example : Distance between A and B
sqrt ( (18- 22) ** 2 + (0–0) ** 2))
sqrt( (16) + 0)
sqrt(16)= 4
Distance Matrix
Single Linkage Detailed Example
• Single Link Clustering: Minimum of two
distances. Leads to large more diverse clusters.
• Distance Matrix: Diagonals will be 0 and
values will be symmetric.
Single Linkage Detailed Example
• Step a: The shortest distance in the matrix is 1 and the
vectors associated with that are C & D
• So the first cluster is C — D
• Distance between other vectors and CD
• A to CD = min(A->C, A->D)
=min(25,24)=24
• B to CD = min(B-<C, B->D)
=min(21,20) = 20
and similarly find for E & F
Single Linkage Detailed Example
• Step b : Now 2 is the shortest distance and the
vectors associated with that are E & F
• Second cluster is E — F
• A to EF = min(A->E, A->F) = min(9,7) = 7
• CD to EF = min(CD->E, CD->F) = min(15,17)
= 15
Single Linkage Detailed Example
• Step c : Next shortest is 3, and the associated
vectors are B & EF
• Third cluster is B — EF
• A to BEF = min(A->B, A->EF) = min(4,7) = 4
CD to BEF = min(CD->B, CD->EF) =
min(20,15) = 15
Single Linkage Detailed Example
• Step d : Next shortest is 4, and the associated
vectors are A& BEF
• Fourth cluster is A — BEF
• CD to ABEF = min(CD->A, CD->BEF) =
min(24,15) = 15
Single Linkage Detailed Example
• Step e : Last cluster is CD — ABEF
• The Dendrogram for the single link cluster is
as follows:
Complete Linkage
• For the complete linkage, the distance is to be
maximum between any two points in the
different clusters.
i.e.
Example of Complete Linkage
Distance Matrix
• Consider the following distance matrix:
Complete Linkage Detailed Example
• Complete Link Clustering: Considers Max
of all distances. Leads to many small clusters.
• Distance Matrix: Diagonals will be 0 and
values will be symmetric.
Complete Linkage Detailed Example
• Step a: The shortest distance in the matrix is 1 and
the vectors associated with that are C & D
• So the first cluster is C — D
• Distance between other vectors and CD
• A to CD = max(A->C, A->D) = max(25,24) = 25
B to CD = max(B-<C, B->D) = max(21,20) = 21
• and similarly find for E -> CD & F -> CD
Complete Linkage Detailed Example
• Step b : Now 2 is the shortest distance and the
vectors associated with that are E & F
• Second cluster is E — F
• A to EF = max(A->E, A->F) = max(9,7) = 9
CD to EF = max(CD->E, CD->F) =
max(15,17) = 17
Complete Linkage Detailed Example
• Step c : Now 4 is the shortest distance and vectors
associated are A & B
• Third cluster is A — B
• CD to AB = max(CD -> A, CD ->B) = max(25,21) =
25
EF to AB = max(EF -> A, EF ->B) = max(9,5) = 9
Complete Linkage Detailed Example
• Step d : Now 9 is the shortest distance and
vectors associated are AB and EF
• Fourth cluster is AB — EF
• CD to ABEF = max(CD->AB, CD->EF) =
max(25,18) = 25
Complete Linkage Detailed Example

• Step e : Last cluster is CD — ABEF


• Lets look at the Dendrogram for the complete
link cluster.
Exercise
• Consider your mobile number as rough data, make the data pair from left to right.
• For example: for the mobile number 9415736072,
• data points set DP= {(9,4), (1,5), (7,3), (6,0), and (7,2)}
• Case 1: if not able to get five different data pair from your mobile number then add one in both
number in the second occurrence of similar data pair and create new data pair. for example:
mobile number 9415736073
• so original data points set DP= {(9,4), (1,5), (7,3), (6,0), and (7,3)}
• modified data points set DP= {(9,4), (1,5), (7,3), (6,0), and (8,4)}
• Case 2: You can repeat this process it getting three identical data pair
• For example: for example: mobile number 9415737373
• modified data points set DP= {(9,4), (1,5), (7,3), (8,4), and (9,5)}
• During distance computation , ignore the decimal part. For Ex: if computed distance is 6.4 ,
then consider only 6.
• Now design the Dendrogram using following methods for DP
– Single linkage method
– Complete linkage method
Hierarchical Clustering: Comparison
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

74
Density-Based Clustering: Basic Concepts
• Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Eps-
neighbourhood of that point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
– p belongs to NEps(q)
– core point condition: p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q
75
Density-Reachable and Density-Connected

• Density-reachable:
– A point p is density-reachable from a p
point q w.r.t. Eps, MinPts if there is a
p1
chain of points p1, …, pn, p1 = q, pn = p q
such that pi+1 is directly density-
reachable from pi
• Density-connected
– A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are o
density-reachable from o w.r.t. Eps and
MinPts
76
Core, Border and Outlier
Example
DBSCAN: The Algorithm

• Arbitrary select a point p


• Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
• Continue the process until all of the points have been processed

79
DBSCAN: Sensitive to Parameters

80

You might also like