Hierarchical Clustering - 11.3.2024 - Full
Hierarchical Clustering - 11.3.2024 - Full
Hierarchical Clustering - 11.3.2024 - Full
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group
the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA. Clusters
are visually represented in a hierarchical tree called a dendrogram i.e. we develop the hierarchy of
clusters in the form of a tree, and this tree-shaped structure is known as the dendrogram. A
representation of dendrogram is shown in this figure:
Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they
both differ depending on how they work. As there is no requirement to predetermine the number of
clusters as we did in the K-Means algorithm.
1. There is no need to pre-specify the number of clusters. Instead, the dendrogram can be cut at
the appropriate level to obtain the desired number of clusters.
Applications
As we already have other clustering algorithms such as K-Means Clustering, then why we need
hierarchical clustering? So, as we have seen in the K-means clustering that there are some challenges
with this algorithm, which are a predetermined number of clusters, and it always tries to create the
clusters of the same size. To solve these two challenges, we can opt for the hierarchical clustering
algorithm because, in this algorithm, we don't need to have knowledge about the predefined number
of clusters.
The working of the AHC algorithm can be explained using the below steps:
Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number of
clusters will also be N.
Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there will
now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will
be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the
below images:
Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide the
clusters as per the problem.
Hierarchical clustering employs a measure of distance/similarity to create new clusters. Steps for
Agglomerative clustering can be summarized as follows:
Step 3: Merge the clusters based on a metric for the similarity between clusters
Step 4: Update the distance matrix
Step 5: Repeat Step 3 and Step 4 until only a single cluster remains
The first step of the algorithm is to create a distance matrix. The values of the matrix are calculated by
applying a distance function between each pair of objects. The Euclidean distance function is
commonly used for this operation. The structure of the proximity matrix will be as follows for a data
set with ‘n’ elements. Here, d(pi,pj) represent the distance values between pi and pj.
The main question in hierarchical clustering is how to calculate the distance between clusters and
update the proximity matrix. There are many different approaches used to answer that question. Each
approach has its advantages and disadvantages. The choice will depend on whether there is noise in
the data set, whether the shape of the clusters is circular or not, and the density of the data points.
The closest distance between the two clusters is crucial for the hierarchical clustering. There are
various ways to calculate the distance between two clusters, and these ways decide the rule for
clustering. These measures are called Linkage methods.
A numerical example will help illustrate the methods and choices. We'll use a small sample data set
containing just nine two-dimensional points, displayed in below.
Suppose we have two clusters in the sample data set, as shown in Figure 2. There are different
approaches to calculate the distance between the clusters. Popular methods are listed below.
One way to measure the distance between clusters is to find the minimum distance between points in
those clusters. That is, we can find the point in the first cluster nearest to a point in the other cluster
and calculate the distance between those points. In Figure 2, the closest points are p2 in one cluster
and p5 in the other. The distance between those points, and hence the distance between clusters, is
found as d(p2,p5)=4.
The advantage of the Min method is that it can accurately handle non-elliptical shapes. The
disadvantages are that it is sensitive to noise and outliers.
Another way to measure the distance is to find the maximum distance between points in two clusters.
We can find the points in each cluster that are furthest away from each other and calculate the distance
between those points. In Figure 3, the maximum distance is between p1 and p6. Distance between
those two points, and hence the distance between clusters, is found as d(p1,p6)=10.
Max is less sensitive to noise and outliers in comparison to MIN method. However, MAX can break
large clusters and tends to be biased towards globular clusters.
Centroid Linkage
The Centroid method defines the distance between clusters as being the distance between their
centers/centroids. After calculating the centroid for each cluster, the distance between those centroids
is computed using a distance function.
Average Linkage
The Average method defines the distance between clusters as the average pairwise distance among all
pairs of points in the clusters. For simplicity, only some of the lines connecting pairs of points are
shown in Figure 6.
The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the HC
algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean distances between the
data points, and the x-axis shows all the data points of the given dataset.
The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in agglomerative clustering,
and the right part is showing the corresponding dendrogram.
o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form a
cluster, correspondingly a dendrogram is created, which connects P2 and P3 with a
rectangular shape. The hight is decided according to the Euclidean distance between the data
points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is
higher than of previous, as the Euclidean distance between P5 and P6 is a little bit greater than
the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and
P4, P5, and P6, in another dendrogram.
o At last, the final dendrogram is created that combines all the data points together.
We can cut the dendrogram tree structure at any level as per our requirement.
Example # 1
Question. Find the clusters using a single link technique. Use Euclidean distance and draw the
dendrogram.
Sample No. X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
Solution:
But there is one point to focus on that the diagonal of the above distance matrix is a special point for
us. The distance above and below the diagonal will be same. For eg: d(P2, P5) is equivalent to d(P5,
P2). So we will find the distance of the below section of the matrix.
Step 2: Merging the two closest members of the two clusters and finding the minimum element in
distance matrix. Here the minimum value is 0.10 and hence we combine P3 and P6 (as 0.10 came in
the P6 row and P3 column). Now, form clusters of elements corresponding to the minimum value and
update the distance matrix. To update the distance matrix:
The divisive method starts at the top and at each level recursively split one of
the existing clusters at that level into two new clusters. If there are N
observations in the dataset, there the divisive method also will produce N − 1
levels in the hierarchy.
The split is chosen to produce two new groups with the largest “between-group
dissimilarity”. For example, the divisive method is shown in below figure. Each
nonterminal node has two daughter nodes. The two daughters represent the two
groups resulting from the split of the parent.
The proximity matrix and linkage function follow the same procedure as
agglomerative clustering, As the divisive clustering is not used in many places,
there is no predefined class/function in any Python library.
Divisive clustering algorithms begin with the entire data set as a single cluster,
and recursively divide one of the existing clusters into two daughter clusters at
each iteration in a top-down fashion. To apply this procedure, we need a
separate algorithm to divide a given dataset into two clusters. The divisive
algorithm may be implemented by using the k-means algorithm with k = 2 to
perform the splits at each iteration. However, it would not necessarily produce a
splitting sequence that possesses the monotonicity property required for
dendrogram representation.
(a) For the first iteration, compute the average distance of x to all other objects.
(b) For the remaining iterations, find an object x in Ci for which Dx is the
largest. If Dx > 0 then move x to Cj .
Step 5. Repeat Steps 3(b) and 4(b) until all differences Dx are negative. Then Cl
is split into Ci and Cj .
Step 6. Select the smaller cluster with the largest diameter. (The diameter of a
cluster is the largest dissimilarity between any two of its objects.) Then divide
this cluster, following Steps 1-5.
Step 7. Repeat Step 6 until all clusters contain only a single object.
Example 1#
12345
104457
240433
344057
453502
573720
P1 P2 P3 P4 P5
P1 0 4 4 5 7
P2 4 0 4 3 3
P3 4 4 0 5 7
P4 5 3 5 0 2
P5 7 3 7 2 0
Which point is most dissimilar? As you noted, we take the average distance to
the other points. So, just for example,
1 2 3 4 5
5.00 3.50 5.00 3.75 4.75
Since points 1 and 3 are tied for the most dissimilar, we pick one of these
arbitrarily. I will use point 1. Using your notation, we have A = {2,3,4,5} and B
= {1}
Now we want to move any points that are closer to B than (the other points in)
A into B. So for each point x in A we compute d(x, A-x) and d(x,B). For
example, for point 2 we compute
[ d(2,3) + d(2,4) + d(2,5) ] / 3 - d(2,1)
= 10/3 - 4 = -2/3
for point 3 we compute
[ d(3,2) + d(3,4) + d(3,5) ] / 3 - d(3,1)
All of the differences are:
2 3 4 5
-2/3 4/3 -5/3 -3
2 4 5
-1.0 -2.5 -4.5
All are negative (that is the remaining points in A are closer to A than to B), so
we stop this division and we have the two clusters {2,4,5} and {1,3}.
__________________________________________________________________________________
For the next step, we choose the cluster with the largest diameter, that is the
cluster with the greatest distance between two points in the cluster.
diameter({1,3}) = d(1,3) = 4
= max(3, 3, 2) = 3
So cluster {1,3} has the largest diameter. Trivially, this will be split into {1} and
{3}. So now we have clusters {2,4,5}, {1} and {3}.
At the next step, we must split the cluster {2,4,5}. As above, for each point we
will find the point with the largest average distance to the rest of the points. For
example we compute
2 4 5
3 5/2 5/2
so point 2 is the most dissimilar.
We split {2,4,5} into A={4,5} and B={2}. We need to check if some additional
points should be moved into the set B.
So no additional points should be moved. We split {2,4,5} into {2} and {4,5}
giving us the partition of the full data set {1}, {2}, {3}, {4,5}
Finally, we divide the last cluster to get {1} {2} {3} {4} {5} Thus, the full
hierarchy looks like this: