Clustering Algorithm
Clustering Algorithm
Clustering Algorithm
Machine learning:
• Supervised vs Unsupervised.
– Supervised learning - the presence of the
outcome variable is available to guide the learning
process.
• there must be a training data set in which the solution
is already known.
– Unsupervised learning - the outcomes are
unknown.
• cluster the data to reveal meaningful partitions and
hierarchies
Clustering:
• Clustering is the task of gathering samples into groups of similar samples
according to some predefined similarity or dissimilarity measure.
• Clustering is a kind of Unsupervised learning: no predefined classes (i.e.,
learning by observations)
sample Cluster/group
Clustering is an unsupervised machine learning
technique that divides the population or data points
into several groups or clusters such that data points in
the same groups are more similar to other data points
in the same group and dissimilar to the data points in
other groups.
7
Clustering Applications
Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus
and species
Image Segmentation
8
Quality: What Is Good Clustering?
10
Measure the Quality of Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function, typically
metric: d(i, j)
The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
Weights should be associated with different variables based on
applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that measures
the “goodness” of a cluster.
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective
11
Major Clustering Approaches
Partitioning approach:
Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using
some criterion
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN
12
K-MEANS CLUSTERING
15
The K-Means Clustering Method
Update the value of the centroid with the new mean value
This ensures that the initial centroids are well spread out
across the dataset, leading to better convergence and
potentially avoiding suboptimal solutions.
A problem with the K-means and K-means++ clustering is that the final
centroids are not interpretable or in other words, centroids are not
the actual point but the mean of points present in that cluster.
24
K-Medoids:
The algorithm initializes two matrices: the availability matrix (A) and
the responsibility matrix (R), both of size N x N, where N is the
number of data points.
#Obtain the cluster centers and labels from the trained model:
cluster_centers = clustering.cluster_centers_
labels = clustering.labels_
num_clusters = len(cluster_centers)
num_clusters
cluster_centers
#Plot the data points and cluster centers using different colors for each cluster
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis’)