20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
• python implementation
• Use cases
K-MEANS CLUSTERING
11/13/202
4
SCIKIT-LEARN – CHEAT SHEET
11/13/202
4
Slide no. 2
WHAT IT IS
11/13/202
4
Slide no. 4
K-MEANS ALGORITHM PROPERTIES
• Every member of a cluster is closer to its cluster than any other cluster
11/13/202
4
Slide no. 5
THE K-MEANS ALGORITHM PROCESS
1. The dataset is partitioned into K clusters and the data points are randomly assigned
to the clusters.
2. For each data point:
• Calculate the distance from the data point to each cluster.
• If the data point is closest to its own cluster, leave it where it is. If the data point is
not closest to its own cluster, move it into the closest cluster.
• Repeat the above step until a complete pass through all the data points results in no
data point moving from one cluster to another. At this point the clusters are stable
and the clustering process ends.
• The choice of initial partition can greatly affect the final clusters that result, in terms
of inter-cluster and intra-cluster distances
11/13/202
4
Slide no. 6
PROXIMITY BETWEEN CLUSTERS
3. Average Linkage –
11/13/202
4
Slide no. 7
DISTANCE CALCULATION
11/13/202
4
Slide no. 8
HOW IT WORKS
Data Point
A1 2 10 Group the data points into 3 clusters
A2 2 5
A3 8 4
A4 5 8
Randomly decide on starting center points
A5 7 5 of the 3 clusters
A6 6 4
A7 1 2
A8 4 9
11/13/202
4
Slide no. 9
HOW IT WORKS
Iterations -- 1
2 10 6 6 1.5 3.5
11/13/202
4
Slide no. 10
HOW IT WORKS
Iterations -- 2
Cluster C1 Cluster C2 Cluster C3
Centroid adjusted from previous 2 10 6 6 1.5 3.5
iteration Data Point Distance from c1 Distance from c2 Distance from c1 Cluster Assignment
A1 2 10 0 8 7 c1
A2 2 5 5 5 2 c3
A3 8 4 12 4 7 c2
Distance calculation using A4 5 8 5 3 8 c2
rectilinear method and assignment A5 7 5 10 2 7 c2
of cluster based on min distance A6 6 4 10 2 5 c2
A7 1 2 9 9 2 c3
A8 4 9 3 5 8 c1
11/13/202
4
Slide no. 11
HOW IT WORKS
Iterations -- 3
Cluster C1 Cluster C2 Cluster C3
3 9.5 6.5 5.25 1.5 3.5
Centroid adjusted from previous
Data Point Distance from c1 Distance from c2 Distance from c1 Cluster Assignment Cluster Assignment
iteration
A1 2 10 0 8 7 c1 c1
A2 2 5 5 5 2 c3 c3
A3 8 4 12 4 7 c2 c2
Distance calculation using A4 5 8 5 3 8 c2 c2
rectilinear method and assignment A5 7 5 10 2 7 c2 c2
A6 6 4 10 2 5 c2 c2
of cluster based on min distance
A7 1 2 9 9 2 c3 c3
A8 4 9 3 5 8 c1 c1
11/13/202
4
Slide no. 12
KMEANS VISUALIZATION
• http://shabal.in/visuals/kmeans/4.html
11/13/202
4
Slide no. 13
PRACTICAL APPLICATIONS
• Customer Segmentation:
Pricing Segmentation
Loyalty
Spend Behaviour
Branch Geo
Customer Need needs, channel of preferences, service expectations.
Category Who are these customers?
Why are they behaving the way to?
Customer Service Customer Value in last 6/12/18/24 months
Customer Type – Individuals and Small Businesses
Product type (e.g. Gas, Electricity etc)
Length of Relationship
Overall consumption
Number of complains
News Article Clustering
11/13/202
4
Slide no. 14
DISADVANTAGES
11/13/202
4
Slide no. 15
WHEN K-MEANS CLUSTERING FAILS
• (-) The examples and illustrations we see in our statistics courses are designed to
reflect ideal situations that sadly almost never occur in the real world.
• (+) This works best when your data is roughly "spherical," as in the toy data set
• (+) Still the most widely used unsupervised machine learning algorithms
11/13/202
4
Slide no. 16
USE CASES
11/13/202
4
Slide no. 17
KMEANS – SCITKIT LEARN
• Attributes:
• cluster_centers_ : array, [n_clusters, n_features] Coordinates of cluster centers. If
the algorithm stops before fully converging
• labels_ : Labels of each point
• inertia_ : float, Sum of squared distances of samples to their closest cluster center.
• n_iter_ : int, Number of iterations run.
11/13/202
4
Slide no. 18
K-MEANS++
• K-Means++
• Random choice of points from the samples
• user provided points(vectors)
• K-Means++
11/13/202
4
Slide no. 20