Hierarchical Clustering - 11.3.2024 - Full

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Hierarchical Clustering

Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group
the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA. Clusters
are visually represented in a hierarchical tree called a dendrogram i.e. we develop the hierarchy of
clusters in the form of a tree, and this tree-shaped structure is known as the dendrogram. A
representation of dendrogram is shown in this figure:

Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they
both differ depending on how they work. As there is no requirement to predetermine the number of
clusters as we did in the K-Means algorithm.

Hierarchical clustering has a couple of key benefits:

1. There is no need to pre-specify the number of clusters. Instead, the dendrogram can be cut at
the appropriate level to obtain the desired number of clusters.

2. Data is easily summarized/organized into a hierarchy using dendrograms. Dendrograms make


it easy to examine and interpret clusters.

Applications

There are many real-life applications of Hierarchical clustering. They include:

● Bioinformatics: grouping animals according to their biological features to reconstruct


phylogeny trees
● Business: dividing customers into segments or forming a hierarchy of employees based on
salary.
● Image processing: grouping handwritten characters in text recognition based on the similarity
of the character shapes.
● Information Retrieval: categorizing search results based on the query.

Hierarchical clustering types

There are two main types of hierarchical clustering:

● Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with


taking all data points as single clusters and merging them until one cluster is left. Initially,
each object is considered to be its own cluster. According to a particular procedure, the
clusters are then merged step by step until a single cluster remains. At the end of the cluster
merging process, a cluster containing all the elements will be formed.
● Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down
approach. Initially, all objects are considered in a single cluster. Then the division process is
performed step by step until each object forms a different cluster. The cluster division or
splitting procedure is carried out according to some principles that maximum distance
between neighboring objects in the cluster.

Why hierarchical clustering?

As we already have other clustering algorithms such as K-Means Clustering, then why we need
hierarchical clustering? So, as we have seen in the K-means clustering that there are some challenges
with this algorithm, which are a predetermined number of clusters, and it always tries to create the
clusters of the same size. To solve these two challenges, we can opt for the hierarchical clustering
algorithm because, in this algorithm, we don't need to have knowledge about the predefined number
of clusters.

How the Agglomerative Hierarchical clustering Work?

The working of the AHC algorithm can be explained using the below steps:

Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number of
clusters will also be N.

Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there will
now be N-1 clusters.

Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will
be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the
below images:

Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide the
clusters as per the problem.

Working procedure of Hierarchical clustering

Hierarchical clustering employs a measure of distance/similarity to create new clusters. Steps for
Agglomerative clustering can be summarized as follows:

Step 1: Compute the proximity matrix using a particular distance metric

Step 2: Each data point is assigned to a cluster

Step 3: Merge the clusters based on a metric for the similarity between clusters
Step 4: Update the distance matrix

Step 5: Repeat Step 3 and Step 4 until only a single cluster remains

Computing a proximity matrix

The first step of the algorithm is to create a distance matrix. The values of the matrix are calculated by
applying a distance function between each pair of objects. The Euclidean distance function is
commonly used for this operation. The structure of the proximity matrix will be as follows for a data
set with ‘n’ elements. Here, d(pi,pj) represent the distance values between pi and pj.

Similarity between Clusters

The main question in hierarchical clustering is how to calculate the distance between clusters and
update the proximity matrix. There are many different approaches used to answer that question. Each
approach has its advantages and disadvantages. The choice will depend on whether there is noise in
the data set, whether the shape of the clusters is circular or not, and the density of the data points.
The closest distance between the two clusters is crucial for the hierarchical clustering. There are
various ways to calculate the distance between two clusters, and these ways decide the rule for
clustering. These measures are called Linkage methods.

A numerical example will help illustrate the methods and choices. We'll use a small sample data set
containing just nine two-dimensional points, displayed in below.
Suppose we have two clusters in the sample data set, as shown in Figure 2. There are different
approaches to calculate the distance between the clusters. Popular methods are listed below.

Min (Single) Linkage

One way to measure the distance between clusters is to find the minimum distance between points in
those clusters. That is, we can find the point in the first cluster nearest to a point in the other cluster
and calculate the distance between those points. In Figure 2, the closest points are p2 in one cluster
and p5 in the other. The distance between those points, and hence the distance between clusters, is
found as d(p2,p5)=4.
The advantage of the Min method is that it can accurately handle non-elliptical shapes. The
disadvantages are that it is sensitive to noise and outliers.

Max (Complete) Linkage

Another way to measure the distance is to find the maximum distance between points in two clusters.
We can find the points in each cluster that are furthest away from each other and calculate the distance
between those points. In Figure 3, the maximum distance is between p1 and p6. Distance between
those two points, and hence the distance between clusters, is found as d(p1,p6)=10.

Max is less sensitive to noise and outliers in comparison to MIN method. However, MAX can break
large clusters and tends to be biased towards globular clusters.

Centroid Linkage

The Centroid method defines the distance between clusters as being the distance between their
centers/centroids. After calculating the centroid for each cluster, the distance between those centroids
is computed using a distance function.

Average Linkage
The Average method defines the distance between clusters as the average pairwise distance among all
pairs of points in the clusters. For simplicity, only some of the lines connecting pairs of points are
shown in Figure 6.

Woking of Dendrogram in Hierarchical clustering

The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the HC
algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean distances between the
data points, and the x-axis shows all the data points of the given dataset.

The working of the dendrogram can be explained using the below diagram:

In the above diagram, the left part is showing how clusters are created in agglomerative clustering,
and the right part is showing the corresponding dendrogram.

o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form a
cluster, correspondingly a dendrogram is created, which connects P2 and P3 with a
rectangular shape. The hight is decided according to the Euclidean distance between the data
points.

o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is
higher than of previous, as the Euclidean distance between P5 and P6 is a little bit greater than
the P2 and P3.

o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and
P4, P5, and P6, in another dendrogram.

o At last, the final dendrogram is created that combines all the data points together.
We can cut the dendrogram tree structure at any level as per our requirement.

Example # 1

Question. Find the clusters using a single link technique. Use Euclidean distance and draw the
dendrogram.

Sample No. X Y

P1 0.40 0.53

P2 0.22 0.38

P3 0.35 0.32

P4 0.26 0.19

P5 0.08 0.41

P6 0.45 0.30

Solution:

Step 1: Compute the distance matrix by finding the Eucledian distance:

But there is one point to focus on that the diagonal of the above distance matrix is a special point for
us. The distance above and below the diagonal will be same. For eg: d(P2, P5) is equivalent to d(P5,
P2). So we will find the distance of the below section of the matrix.

Step 2: Merging the two closest members of the two clusters and finding the minimum element in
distance matrix. Here the minimum value is 0.10 and hence we combine P3 and P6 (as 0.10 came in
the P6 row and P3 column). Now, form clusters of elements corresponding to the minimum value and
update the distance matrix. To update the distance matrix:

min ((P3,P6), P1) = min ((P3,P1), (P6,P1)) = min (0.22,0.24) = 0.22


Divisive Clustering

Divisive clustering works just the opposite of agglomerative clustering. It starts


by considering all the data points into a big single cluster and later on splitting
them into smaller heterogeneous clusters continuously until all data points are in
their own cluster. Thus, they are good at identifying large clusters. It follows a
top-down approach and is more efficient than agglomerative clustering. But, due
to its complexity in implementation, it doesn’t have any predefined
implementation in any of the major machine learning frameworks.

The divisive method starts at the top and at each level recursively split one of
the existing clusters at that level into two new clusters. If there are N
observations in the dataset, there the divisive method also will produce N − 1
levels in the hierarchy.

The split is chosen to produce two new groups with the largest “between-group
dissimilarity”. For example, the divisive method is shown in below figure. Each
nonterminal node has two daughter nodes. The two daughters represent the two
groups resulting from the split of the parent.

In the above figure,


● The data points 1,2,...6 are assigned to large cluster.

● After calculating the proximity matrix, based on the dissimilarity the


points are split up into separate clusters.

● The proximity matrix is again computed until each point is assigned to an


individual cluster.

The proximity matrix and linkage function follow the same procedure as
agglomerative clustering, As the divisive clustering is not used in many places,
there is no predefined class/function in any Python library.

Divisive clustering algorithms begin with the entire data set as a single cluster,
and recursively divide one of the existing clusters into two daughter clusters at
each iteration in a top-down fashion. To apply this procedure, we need a
separate algorithm to divide a given dataset into two clusters. The divisive
algorithm may be implemented by using the k-means algorithm with k = 2 to
perform the splits at each iteration. However, it would not necessarily produce a
splitting sequence that possesses the monotonicity property required for
dendrogram representation.

DIANA (Divisive Analysis) is a divisive hierarchical clustering technique. Here


is an outline of the algorithm.

Step 1. Suppose that cluster Cl is going to be split into clusters Ci and Cj .

Step 2. Let Ci = Cl and Cj = ∅ .

Step 3. For each object x ∈ Ci:

(a) For the first iteration, compute the average distance of x to all other objects.

(b) For the remaining iterations, compute

Dx = average {d(x, y) ∶ y ∈ Ci} − average{d(x, y) ∶ y ∈ Cj}.


Step 4. (a) For the first iteration, move the object with the maximum average
distance to Cj .

(b) For the remaining iterations, find an object x in Ci for which Dx is the
largest. If Dx > 0 then move x to Cj .

Step 5. Repeat Steps 3(b) and 4(b) until all differences Dx are negative. Then Cl
is split into Ci and Cj .

Step 6. Select the smaller cluster with the largest diameter. (The diameter of a
cluster is the largest dissimilarity between any two of its objects.) Then divide
this cluster, following Steps 1-5.

Step 7. Repeat Step 6 until all clusters contain only a single object.

Example 1#

Throughout the process, we need the distance matrix.

12345
104457
240433
344057
453502
573720
P1 P2 P3 P4 P5
P1 0 4 4 5 7
P2 4 0 4 3 3
P3 4 4 0 5 7
P4 5 3 5 0 2
P5 7 3 7 2 0

The top level cluster contains all points {1,2,3,4,5}.

Which point is most dissimilar? As you noted, we take the average distance to
the other points. So, just for example,

For point 1 we compute

[ d(1,2) + d(1,3) + d(1,4) + d(1,5) ] / 4


= (4 + 4 + 5 + 7) / 4 = 5
For point 2 we compute
[ d(2,1) + d(2,3) + d(2,4) + d(2,5) ] / 4
=(4+4+3+3)/4=3.5

The average distances for all points are:

1 2 3 4 5
5.00 3.50 5.00 3.75 4.75
Since points 1 and 3 are tied for the most dissimilar, we pick one of these
arbitrarily. I will use point 1. Using your notation, we have A = {2,3,4,5} and B
= {1}
Now we want to move any points that are closer to B than (the other points in)
A into B. So for each point x in A we compute d(x, A-x) and d(x,B). For
example, for point 2 we compute
[ d(2,3) + d(2,4) + d(2,5) ] / 3 - d(2,1)
= 10/3 - 4 = -2/3
for point 3 we compute
[ d(3,2) + d(3,4) + d(3,5) ] / 3 - d(3,1)
All of the differences are:
2 3 4 5
-2/3 4/3 -5/3 -3

Only point 3 is bigger than zero so we move it to cluster B. We now have A =


{2,4,5} B = {1,3}. We check if any additional points should be moved. Again,
we compute d(x, A-x) - d(x,B) for each point in A. The differences are:

2 4 5
-1.0 -2.5 -4.5

All are negative (that is the remaining points in A are closer to A than to B), so
we stop this division and we have the two clusters {2,4,5} and {1,3}.
__________________________________________________________________________________

For the next step, we choose the cluster with the largest diameter, that is the
cluster with the greatest distance between two points in the cluster.

diameter({1,3}) = d(1,3) = 4

diameter({2,4,5}) = max(d(2,4), d(2,5), d(4,5))

= max(3, 3, 2) = 3

So cluster {1,3} has the largest diameter. Trivially, this will be split into {1} and
{3}. So now we have clusters {2,4,5}, {1} and {3}.

At the next step, we must split the cluster {2,4,5}. As above, for each point we
will find the point with the largest average distance to the rest of the points. For
example we compute

point 2: [d(2,4) + d(2,5)] / 2 = [3+3]/2 = 3

The other averages are point

2 4 5
3 5/2 5/2
so point 2 is the most dissimilar.

We split {2,4,5} into A={4,5} and B={2}. We need to check if some additional
points should be moved into the set B.

d(4,5) - d(4,2} = 2-3 = -1

d(5,4) - d(5,2} = 2-3 = -1

So no additional points should be moved. We split {2,4,5} into {2} and {4,5}
giving us the partition of the full data set {1}, {2}, {3}, {4,5}

Finally, we divide the last cluster to get {1} {2} {3} {4} {5} Thus, the full
hierarchy looks like this:

You might also like