20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans

• What it is?
• python implementation
• Use cases
K-MEANS CLUSTERING
11/13/202
4
SCIKIT-LEARN – CHEAT SHEET
11/13/202
4
Slide no. 2
WHAT IT IS
• K-means clustering is a simple unsupervised learning algorithm that is used to

group/ cluster n objects based on certain attributes into k partitions, where k < n.
• Follows a simple procedure of classifying a given data set into a number of

clusters, defined by the letter "k," which is fixed beforehand.
• Groups data using a “top-down” approach since it starts with a predefined

number of clusters and assigns all observations to each of them.
• no overlaps in the groups; each observation is assigned only to a single group.
• Approach is computationally faster and can handle greater numbers of

observations than agglomerative hierarchical clustering
• K-means clustering has uses in search engines, market segmentation, statistics

and even astronomy.
11/13/202
4
Slide no. 3
EXAMPLE
Different movie genres appeal to people of different

personalities. To confirm this, construct a plot of
movie titles along personality dimensions:
There appears to be 3 clusters:

• Red: Conscientious extraverts who like action and
romance genres
• Blue: Anxious and open people who like advant-

garde and fantasy genres
• Yellow: Introverts with social anxieties who like

Japanese animations (otaku culture)
• Movies in the center seem to be general household

favorites.
11/13/202
4
Slide no. 4
K-MEANS ALGORITHM PROPERTIES
• There are always K clusters.
• There is always at least one item in each cluster.
• The clusters are non-hierarchical and they do not overlap.
• Every member of a cluster is closer to its cluster than any other cluster
11/13/202
4
Slide no. 5
THE K-MEANS ALGORITHM PROCESS
1. The dataset is partitioned into K clusters and the data points are randomly assigned
to the clusters.
2. For each data point:
• Calculate the distance from the data point to each cluster.
• If the data point is closest to its own cluster, leave it where it is. If the data point is
not closest to its own cluster, move it into the closest cluster.
• Repeat the above step until a complete pass through all the data points results in no
data point moving from one cluster to another. At this point the clusters are stable
and the clustering process ends.
• The choice of initial partition can greatly affect the final clusters that result, in terms
of inter-cluster and intra-cluster distances
11/13/202
4
Slide no. 6
PROXIMITY BETWEEN CLUSTERS
1. Single Linkage - Distance between two

nearest patterns in the clusters
2. Complete Linkage - Distance between 2

farthest data patterns
3. Average Linkage –
4. Average Group (centroid) linkage –
11/13/202
4
Slide no. 7
DISTANCE CALCULATION
• The Euclidean distance function measures the ‘as-the-crow-flies’ distance. The

formula for this distance between a point X (X1, X2, etc.) and a point Y (Y1, Y2, etc.)
is:
11/13/202
4
Slide no. 8
HOW IT WORKS
Data Point
A1 2 10 Group the data points into 3 clusters
A2 2 5
A3 8 4
A4 5 8
Randomly decide on starting center points
A5 7 5 of the 3 clusters
A6 6 4
A7 1 2
A8 4 9
11/13/202
4
Slide no. 9
HOW IT WORKS
Iterations -- 1
Initial centroid assigned Cluster C1 Cluster C2 Cluster C3

2 10 5 8 1 2
(randomly)
Data Point Distance from c1 Distance from c2 Distance from c1 Cluster Assignment
A1 2 10 0 5 9 c1
A2 2 5 5 6 4 c3
A3 8 4 12 7 9 c2
Distance calculation using
A4 5 8 5 0 10 c2
rectilinear method and assignment
A5 7 5 10 5 9 c2
of cluster based on min distance A6 6 4 10 5 7 c2
A7 1 2 9 10 0 c3
A8 4 9 3 2 10 c2
Cluster C1 Cluster C2 Cluster C3

2 10 8 4 2 5
New centroid for each cluster 5 8 1 2
calculated (averaging of data 7 5
points) 6 4
4 9
2 10 6 6 1.5 3.5
11/13/202
4
Slide no. 10
HOW IT WORKS
Iterations -- 2
Centroid adjusted from previous 2 10 6 6 1.5 3.5
iteration Data Point Distance from c1 Distance from c2 Distance from c1 Cluster Assignment
A1 2 10 0 8 7 c1
A2 2 5 5 5 2 c3
A3 8 4 12 4 7 c2
Distance calculation using A4 5 8 5 3 8 c2
rectilinear method and assignment A5 7 5 10 2 7 c2
of cluster based on min distance A6 6 4 10 2 5 c2
A7 1 2 9 9 2 c3
A8 4 9 3 5 8 c1

2 10 8 4 2 5
New centroid for each cluster 4 9 5 8 1 2
calculated (averaging of data 7 5
6 4
points)
3 9.5 6.5 5.25 1.5 3.5
11/13/202
4
Slide no. 11
HOW IT WORKS
Iterations -- 3
3 9.5 6.5 5.25 1.5 3.5
Centroid adjusted from previous
Data Point Distance from c1 Distance from c2 Distance from c1 Cluster Assignment Cluster Assignment
iteration
A1 2 10 0 8 7 c1 c1
A2 2 5 5 5 2 c3 c3
A3 8 4 12 4 7 c2 c2
Distance calculation using A4 5 8 5 3 8 c2 c2
rectilinear method and assignment A5 7 5 10 2 7 c2 c2
A6 6 4 10 2 5 c2 c2
of cluster based on min distance
A7 1 2 9 9 2 c3 c3
A8 4 9 3 5 8 c1 c1
Cluster C1 Cluster C2 Cluster C3 No change in the cluster

2 10 8 4 2 5 assignment, Hence STOP
New centroid for each cluster 4 9 5 8 1 2
7 5
calculated (averaging of data
6 4
points)
3 9.5 6.5 5.25 1.5 3.5
11/13/202
4
Slide no. 12
KMEANS VISUALIZATION
• http://shabal.in/visuals/kmeans/4.html
11/13/202
4
Slide no. 13
PRACTICAL APPLICATIONS
• Customer Segmentation:
Pricing Segmentation
Loyalty
Spend Behaviour
Branch Geo
Customer Need needs, channel of preferences, service expectations.
Category Who are these customers?
Why are they behaving the way to?
Customer Service Customer Value in last 6/12/18/24 months
Customer Type – Individuals and Small Businesses
Product type (e.g. Gas, Electricity etc)
Length of Relationship
Overall consumption
Number of complains
News Article Clustering
11/13/202
4
Slide no. 14
DISADVANTAGES
• Fixed number of clusters can make it

A Globular clusters are very tightly bound,
difficult to predict what K should be.
which gives them their spherical shapes and
relatively high stellar densities toward their
centers.
• Does not work well with non-globular
clusters.
• Different initial partitions can result in

different final clusters.
• Even outliers become part of some

cluster!
11/13/202
4
Slide no. 15
WHEN K-MEANS CLUSTERING FAILS
• (-) Real-life data is almost always messy.
• (-) The examples and illustrations we see in our statistics courses are designed to
reflect ideal situations that sadly almost never occur in the real world.
• (+) This works best when your data is roughly "spherical," as in the toy data set
• (+) Still the most widely used unsupervised machine learning algorithms
11/13/202
4
Slide no. 16
USE CASES
Behavioral Inventory Sorting sensor Detecting bots or

segmentation categorization measurements anomalies
• Segment by • Group inventory by • Detect activity types • Separate valid
purchase history sales activity in motion sensors activity groups from
• Segment by • Group inventory by • Group images bots
activities on manufacturing • Separate audio • Group valid activity
application, website, metrics • Identify groups in to clean up outlier
or platform health monitoring detection
• Define personas
based on interests
• Create profiles
based on activity
monitoring
11/13/202
4
Slide no. 17
KMEANS – SCITKIT LEARN
• KMeans(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001,

precompute_distances=’auto’, verbose=0, random_state=None, copy_x=True,
n_jobs=None, algorithm=’auto’)[source]
• Attributes:
• cluster_centers_ : array, [n_clusters, n_features] Coordinates of cluster centers. If
the algorithm stops before fully converging
• labels_ : Labels of each point
• inertia_ : float, Sum of squared distances of samples to their closest cluster center.
• n_iter_ : int, Number of iterations run.
11/13/202
4
Slide no. 18
K-MEANS++
• How does the k-means++ works?

• research yourself!!!
• Essentially there are 3 methods to assign initial centers
• K-Means++
• Random choice of points from the samples
• user provided points(vectors)
• K-Means++
• Say we need to select 2 cluster centers,

• instead of selecting them all randomly {like we do in simple k means}, we will select the first one randomly,
then find the points that are farthest to the first center
• These points most probably do not belong to the first cluster center as they are far from it and assign the
second cluster center nearby those far points.
11/13/202
4
Slide no. 19
SPECIFYING DIFFERENT DISTANCE CALCULATION METHOD
• SCITKIT learn implementation only supports ‘Euclidean’ distance measure
11/13/202
4
Slide no. 20

20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans

Uploaded by

Copyright:

Available Formats

20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans

Uploaded by

Copyright:

Available Formats

• What it is?

• K-means clustering is a simple unsupervised learning algorithm that is used to

• Follows a simple procedure of classifying a given data set into a number of

• Groups data using a “top-down” approach since it starts with a predefined

• no overlaps in the groups; each observation is assigned only to a single group.

• Approach is computationally faster and can handle greater numbers of

• K-means clustering has uses in search engines, market segmentation, statistics

Different movie genres appeal to people of different

There appears to be 3 clusters:

• Blue: Anxious and open people who like advant-

• Yellow: Introverts with social anxieties who like

• Movies in the center seem to be general household

• There are always K clusters.

• There is always at least one item in each cluster.

• The clusters are non-hierarchical and they do not overlap.

1. Single Linkage - Distance between two

2. Complete Linkage - Distance between 2

4. Average Group (centroid) linkage –

• The Euclidean distance function measures the ‘as-the-crow-flies’ distance. The

Initial centroid assigned Cluster C1 Cluster C2 Cluster C3

Cluster C1 Cluster C2 Cluster C3

Cluster C1 Cluster C2 Cluster C3

3 9.5 6.5 5.25 1.5 3.5

Cluster C1 Cluster C2 Cluster C3 No change in the cluster

3 9.5 6.5 5.25 1.5 3.5

• Fixed number of clusters can make it

• Different initial partitions can result in

• Even outliers become part of some

• (-) Real-life data is almost always messy.

Behavioral Inventory Sorting sensor Detecting bots or

• KMeans(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001,

• How does the k-means++ works?

• Essentially there are 3 methods to assign initial centers

• Say we need to select 2 cluster centers,

• SCITKIT learn implementation only supports ‘Euclidean’ distance measure

You might also like