Clustering Algorithm

Clustering Algorithms
Machine learning:
• Supervised vs Unsupervised.
– Supervised learning - the presence of the
outcome variable is available to guide the learning
process.
• there must be a training data set in which the solution
is already known.
– Unsupervised learning - the outcomes are
unknown.
• cluster the data to reveal meaningful partitions and
hierarchies
Clustering:
• Clustering is the task of gathering samples into groups of similar samples
according to some predefined similarity or dissimilarity measure.
• Clustering is a kind of Unsupervised learning: no predefined classes (i.e.,
learning by observations)
sample Cluster/group
Clustering is an unsupervised machine learning
technique that divides the population or data points
into several groups or clusters such that data points in
the same groups are more similar to other data points
in the same group and dissimilar to the data points in
other groups.
•Points in the same cluster are closer to each other.
•Points in the different clusters are far apart.

The primary aim is to maximize the similarity within
clusters while minimizing the similarity between clusters.
Clustering is considered to be best if it has maximum
intercluster distance and minimum intracluster distance.
Intracluster distance: Distance between two point in the

same cluster.
Intercluster distance: Distance between two points in the
different clusters.
Why Clustering:
Clustering is an essential task which determines the

intrinsic grouping among the unlabeled data present.
There are no criteria for good clustering. It depends on the

user, what is the criteria they may use which satisfy their
need
The clustering algorithm must make some assumptions

that constitute the similarity of points and each
assumption make different and equally valid clusters.
What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
Finding similarities between data according to the characteristics

found in the data and grouping similar data objects into clusters
7
Clustering Applications
Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus
and species
Information retrieval: document clustering
Marketing: Help marketers discover distinct groups in their customer bases,

and then use this knowledge to develop targeted marketing programs
City-planning: Identifying groups of houses according to their house type,

value, and geographical location
Social Network Analysis:- Various Community
Image Segmentation
8
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters

high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on the similarity

measure used by the method its implementation, and Its ability
to discover some or all of the hidden patterns.
10
Measure the Quality of Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function, typically
metric: d(i, j)
The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
Weights should be associated with different variables based on
applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that measures
the “goodness” of a cluster.
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective
11
Major Clustering Approaches
Partitioning approach:
Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using
some criterion
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN
12
K-MEANS CLUSTERING
● The k-means algorithm is an algorithm to cluster

n objects based on attributes into k partitions,
where k < n.
● Simply speaking k-means clustering is an

algorithm to classify or to group the objects
based on attributes/features into k number of
group.
● k is positive integer number.
● The grouping is done by minimizing the sum of
squares of distances between data and the
corresponding cluster centroid.
● K-Means algorithm is a centroid based clustering

technique.
● Each cluster in k-means clustering algorithm is

represented by a centroid point.
Centroid point
The centroid point is the point that represents its cluster. Centroid
point is the average of all the points in the set and will change in
each step and will be computed by:
For the above equation,

C_i: i'th Centroid
S_i: All points belonging to set_i with centroid as C_i
x_j: j'th point from the set
||S_i||: number of points in set_i
15
The idea of the K-Means algorithm is to find k-centroid points
and every point in the dataset will belong either of k-sets
having minimum Euclidean distance.
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in four steps:

Steps #1: Initialization:
The initial k-centroids are randomly picked from the dataset
of points
Steps #2: Assignment:

For each point in the dataset, find the Euclidean distance
between the point and all centroids. The point will be
assigned to the cluster with the nearest centroid. Assign each
object to the cluster with the nearest seed point
15
The K-Means Clustering Method
Steps #3: Updation of Centroid:
Update the value of the centroid with the new mean value
Steps #4: Repeat:

Repeat steps 2 and 3 unless convergence is achieved. If
convergence is achieved then stop. Convergence refers to
the condition where the previous value of centroids is equal
to the updated value.
How K-means clustering works
K-means algorithm:
1. Given n objects, initialize k cluster centers
2. Assign each object to its closest cluster centre
3. Update the center for each cluster
4. Repeat 2 and 3 until no change in each cluster center
• Experiment: Pack of cards, dominoes
• Apply the K-means algorithm to the Shapley data

– Change the number of potential cluster and find how the
clustering differ
K-Means++ Clustering:
In the case of finding initial centroids for K-Means

clustering, we were using randomization. The initial k-
centroids were picked randomly from the data points.
This randomization of picking k-centroids points results in

the problem of initialization sensitivity.
This problem tends to affect the final formed clusters.
The final formed clusters depend on how initial centroids

were picked.
23
There are two approaches to avoid this problem of initialization
sensitivity:
1.Repeat K-means: Repeating the algorithm and initialization of

centroids several times and pick the clustering approach that has
small intracluster distance and large intercluster distance.
2.K-Means++: K-Means++ is a smart centroid initialization technique.

Steps of k-mean++
Step 1 :Randomly select the first centroid from the data
points.
Step 2: Compute distance of all points in the dataset from

the selected centroid.
Step 3: For the subsequent centroids, it uses a weighted

probability distribution to choose the next centroid.
The probability of selecting a point as the next centroid is

higher if it is far away from the already chosen centroids.
Steps of k-mean++
This ensures that the initial centroids are well spread out
across the dataset, leading to better convergence and
potentially avoiding suboptimal solutions.
Step 4: The remaining steps of the K-means algorithm,

including point assignment and centroid update, are the
same as in the regular K-means algorithm.
What Is the Problem of the K-means/K-means++
Method?
A problem with the K-means and K-means++ clustering is that the final
centroids are not interpretable or in other words, centroids are not
the actual point but the mean of points present in that cluster.
The k-means and k-means++ algorithm is sensitive to outliers !

Since an object with an extremely large value may substantially
distort the distribution of the data
24
K-Medoids:
Instead of taking the mean value of the object in a cluster

as a reference point, medoids can be used, which is the
most centrally located object in a cluster.
Python library:- sklearn_extra.cluster.KMedoids

Medoids are representative objects of a data set
or a cluster within a data set whose sum of
dissimilarities to all the objects in the cluster is
minimal.
Medoids are similar in concept to means or

centroids, but medoids are always restricted to
be members of the data set.
How to pick the best value of K in case of K-means/K-means++/K-
medoids- using Elbow method
The best value of K can be computed using the Elbow method.

The cost function of K-Means, K-Means++, and K-Medoids
techniques is to minimize intracluster distance and maximize
intercluster distance.
The Cost function is known as WCSS (within cluster sum of

square distance)
Hierarchical Clustering Algorithms :
•It is a statistical method used to build a cluster by
arranging elements at various levels.
•The clusters formed in this method form a tree-type

structure based on the hierarchy.
•New clusters are formed using the previously formed

one. It is divided into two category
•Agglomerative (bottom-up approach)

•Divisive (top-down approach)
Agglomerative Clustering
We start from each data point and keep on

making cluster until all the data points are
included.
Finally, we find one single big cluster comprises

of all data points that we considered initially.
Dendogram:
• Each level will then represent a possible
cluster.
• The height of the dendrogram shows the level

of similarity that any two clusters are joined
• The closer to the bottom they are the more

similar the clusters are
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not
require the number of clusters k as an input, but needs a
termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
a
ab
b abcde
c
cde
d
de
e
divisive)
Step 4 Step 3 Step 2 Step 1 Step 0
29
Dendrogram: Shows How Clusters are Merged
There will be a total of 4 clusters
Affinity Propagation
Affinity Propagation is a clustering algorithm that does

not require specifying the number of clusters in advance.
It discovers the number of clusters automatically based
on the data's similarities.
It operates by iteratively exchanging messages between

data points to determine the most representative
exemplars, which then define the clusters.
Affinity Propagation is effective in finding clusters with varying
sizes and shapes.
However, it can be computationally expensive, especially for

large datasets, as it requires pairwise similarity calculations and
matrix operations
1. Similarity Matrix:
•Affinity Propagation begins by constructing a similarity matrix

that measures the pairwise similarities between data points.
•The similarity metric can be based on various distance measures

such as Euclidean distance or similarity measures like cosine
similarity.
2. Availability and Responsibility Matrices:
The algorithm initializes two matrices: the availability matrix (A) and
the responsibility matrix (R), both of size N x N, where N is the
number of data points.
The availability matrix reflects the accumulated evidence of how

well-suited a data point is to be an exemplar, while the responsibility
matrix quantifies the suitability of one data point being chosen as an
exemplar relative to others.
3. Message Passing:
•The algorithm iteratively updates the availability and

responsibility matrices until convergence.
•In each iteration, the algorithm passes messages between data

points to update their availability and responsibility values.
•The responsibility update equation is based on the current

similarities and the maximum responsibility of other data points
for each exemplar.
•The availability update equation considers the current

responsibilities and the accumulated availabilities of other data
points for each exemplar.
4. Exemplar Selection:
•After convergence, the algorithm determines the exemplars,

which are the most representative data points for each cluster.
•Exemplars are selected based on high availability values,

indicating their ability to represent other data points well.
5. Convergence:
•The algorithm continues updating the matrices and exchanging

messages until convergence criteria are met.
•Convergence is typically determined by a maximum number of

iterations or when the change in the matrices' values falls
below a threshold.
Python code
#Create an instance of the AffinityPropagation class and fit it to your data:
from sklearn.cluster import AffinityPropagation

clustering = AffinityPropagation().fit(X)
#Obtain the cluster centers and labels from the trained model:
cluster_centers = clustering.cluster_centers_
labels = clustering.labels_
num_clusters = len(cluster_centers)
num_clusters
cluster_centers
#Plot the data points and cluster centers using different colors for each cluster
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis’)
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], marker='x', color='red',

s=100)
plt.title('Affinity Propagation Clustering')

plt.show()
The Colormap(cmap) instance used to map scalar data to colors.

Clustering Algorithm

Uploaded by

Copyright:

Available Formats

Clustering Algorithm

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering Algorithm

Uploaded by

Copyright:

Available Formats

Clustering Algorithms

•Points in the same cluster are closer to each other.

•Points in the different clusters are far apart.

Intracluster distance: Distance between two point in the

Clustering is an essential task which determines the

There are no criteria for good clustering. It depends on the

The clustering algorithm must make some assumptions

Finding similarities between data according to the characteristics

Information retrieval: document clustering

Marketing: Help marketers discover distinct groups in their customer bases,

City-planning: Identifying groups of houses according to their house type,

Social Network Analysis:- Various Community

A good clustering method will produce high quality clusters

The quality of a clustering method depends on the similarity

● The k-means algorithm is an algorithm to cluster

● Simply speaking k-means clustering is an

● K-Means algorithm is a centroid based clustering

● Each cluster in k-means clustering algorithm is

For the above equation,

Given k, the k-means algorithm is implemented in four steps:

Steps #2: Assignment:

Steps #3: Updation of Centroid:

Steps #4: Repeat:

• Experiment: Pack of cards, dominoes

• Apply the K-means algorithm to the Shapley data

In the case of finding initial centroids for K-Means

This randomization of picking k-centroids points results in

This problem tends to affect the final formed clusters.

The final formed clusters depend on how initial centroids

1.Repeat K-means: Repeating the algorithm and initialization of

2.K-Means++: K-Means++ is a smart centroid initialization technique.

Step 2: Compute distance of all points in the dataset from

Step 3: For the subsequent centroids, it uses a weighted

The probability of selecting a point as the next centroid is

Step 4: The remaining steps of the K-means algorithm,

The k-means and k-means++ algorithm is sensitive to outliers !

Instead of taking the mean value of the object in a cluster

Python library:- sklearn_extra.cluster.KMedoids

Medoids are similar in concept to means or

The best value of K can be computed using the Elbow method.

The Cost function is known as WCSS (within cluster sum of

•The clusters formed in this method form a tree-type

•New clusters are formed using the previously formed

•Agglomerative (bottom-up approach)

We start from each data point and keep on

Finally, we find one single big cluster comprises

• The height of the dendrogram shows the level

• The closer to the bottom they are the more

Affinity Propagation is a clustering algorithm that does

It operates by iteratively exchanging messages between

However, it can be computationally expensive, especially for

•Affinity Propagation begins by constructing a similarity matrix

•The similarity metric can be based on various distance measures

The availability matrix reflects the accumulated evidence of how

•The algorithm iteratively updates the availability and

•In each iteration, the algorithm passes messages between data

•The responsibility update equation is based on the current

•The availability update equation considers the current

•After convergence, the algorithm determines the exemplars,