K Means

K-means Clustering
Ke Chen
Reading: [7.3, EA], [9.1, CMB]
COMP24111 Machine Learning

Outline
• Introduction
• K-means Algorithm
• Example
• How K-means partitions?
• K-means Demo
• Relevant Issues
• Application: Cell Neulei Detection
• Summary
2
Introduction
• Partitioning Clustering Approach
– a typical clustering analysis approach via iteratively partitioning
training data set to learn a partition of the given data space
– learning a partition on a data set to produce several non-empty
clusters (usually, the number of clusters given in advance)
– in principle, optimal partition achieved via minimising the sum
of squared distance to its “representative object” in each cluster
E    xCk d (x, mk )
K
k 1
2
e.g., Euclidean distance d 2

( x , m k )   n kn
( x  m
n 1
) 2
3
Introduction
• Given a K, find a partition of K clusters to optimise the

chosen partitioning criterion (cost function)
o global optimum: exhaustively search all partitions
• The K-means algorithm: a heuristic method
o K-means algorithm (MacQueen’67): each cluster is
represented by the centre of the cluster and the
algorithm converges to stable centriods of clusters.
o K-means algorithm is the simplest partitioning method
for clustering analysis and widely used in data mining
applications.
4
K-means Algorithm
• Given the cluster number K, the K-means algorithm is
carried out in three steps after initialisation:
Initialisation: set seed points (randomly)

1) Assign each object to the cluster of the nearest seed
point measured with a specific distance metric
2) Compute new seed points as the centroids of the
clusters of the current partition (the centroid is the
centre, i.e., mean point, of the cluster)
3) Go back to Step 1), stop when no more new assignment
(i.e., membership in each cluster no longer changes)
5
Example
• Problem
Suppose we have 4 types of medicines and each has two attributes (pH and
weight index). Our goal is to group these objects into K=2 group of medicine.
D
Medicine Weight pH-
Index C
A 1 1
B 2 1
A B
C 4 3
D 5 4
6
Example
• Step 1: Use initial seed points for partitioning
c1  A, c2  B
D
Euclidean distance
C
d( D, c1 )  ( 5  1)2  ( 4  1)2  5
A B
d( D, c2 )  ( 5  2)2  ( 4  1)2  4.24
Assign each object to the cluster

with the nearest seed point
7
Example
• Step 2: Compute new centroids of the current partition
Knowing the members of each

cluster, now we compute the new
centroid of each group based on
these new memberships.
c1  (1, 1)
 2  4  5 1 3  4
c2   , 
 3 3 
11 8
( , )
3 3
8
Example
• Step 2: Renew membership based on new centroids
Compute the distance of all

objects to the new centroids
Assign the membership to objects
9
Example
• Step 3: Repeat the first two steps until its convergence
Knowing the members of each

cluster, now we compute the new
centroid of each group based on
these new memberships.
 1 2 11 1
c1   ,   (1 , 1)
 2 2  2
45 34 1 1
c2   ,   (4 , 3 )
 2 2  2 2
10
Example
• Step 3: Repeat the first two steps until its convergence
Compute the distance of all objects

to the new centroids
Stop due to no new assignment

Membership in each cluster no
longer change
11
Exercise
For the medicine data set, use K-means with the Manhattan distance
metric for clustering analysis by setting K=2 and initialising seeds as
C1 = A and C2 = C. Answer three questions as follows:
1. How many steps are required for convergence?
2. What are memberships of two clusters after convergence?
3. What are centroids of two clusters after convergence?
Medicine Weight pH- D
Index
C
A 1 1
B 2 1 A B
C 4 3
D 5 4
12
How K-means partitions?
When K centroids are set/fixed,

they partition the whole data
space into K mutually exclusive
subspaces to form a partition.
A partition amounts to a
Voronoi Diagram
Changing positions of centroids

leads to a new partitioning.
13
K-means Demo
K-means Demo
14
Relevant Issues
• Computational complexity
– O(tKn), where n is number of objects, K is number of clusters,
and t is number of iterations. Normally, K, t << n.
• Local optimum
– sensitive to initial seed points
– converge to a local optimum: maybe an unwanted solution
• Other problems
– Need to specify K, the number of clusters, in advance
– Unable to handle noisy data and outliers (K-Medoids algorithm)
– Not suitable for discovering clusters with non-convex shapes
– Applicable only when mean is defined, then what about
categorical data? (K-mode algorithm)
– how to evaluate the K-mean performance?
15
Application
• Colour-Based Image Segmentation Using K-means
Step 1: Loading a colour image of tissue stained with hemotoxylin and
eosin (H&E)
16
Application

Step 2: Convert the image from RGB colour space to L*a*b*
colour space
• Unlike the RGB colour model, L*a*b* colour is
designed to approximate human vision.
• There is a complicated transformation between RGB
and L*a*b*.
(L*, a*, b*) = T(R, G, B).
(R, G, B) = T’(L*, a*, b*).
17
Application

Step 3: Undertake clustering analysis in the (a*, b*) colour
space with the K-means algorithm
• In the L*a*b* colour space, each pixel has a
properties or feature vector: (L*, a*, b*).
• Like feature selection, L* feature is discarded. As a
result, each pixel has a feature vector (a*, b*).
• Applying the K-means algorithm to the image in the
a*b* feature space where K = 3 by applying the
domain knowledge.
18
Application
Step 4: Label every pixel in the image using the results from
K-means clustering (indicated by three different grey levels)
19
Application
Step 5: Create Images that Segment the H&E Image by Colour
• Apply the label and the colour information of each pixel to achieve
separate colour images corresponding to three clusters.
“blue” pixels “white” pixels “pink” pixels
20
Application
Step 6: Segment the nuclei into a separate image with the L* feature
• In cluster 1, there are dark and light blue objects (pixels). The dark blue
objects (pixels) correspond to nuclei (with the domain knowledge).
• L* feature specifies the brightness values of each colour.
• With a threshold for L*, we achieve an image containing the nuclei only.
21
Summary
• K-means algorithm is a simple yet popular method for
clustering analysis
• Its performance is determined by initialisation and
appropriate distance measure
• There are several variants of K-means to overcome its
weaknesses
– K-Medoids: resistance to noise and/or outliers
– K-Modes: extension to categorical data clustering analysis
– CLARA: extension to deal with large data sets
– Mixture models (EM algorithm): handling uncertainty of clusters
Online tutorial: how to use the K-means function in Matlab
https://www.youtube.com/watch?v=aYzjenNNOcc
22

K Means

Uploaded by

Copyright:

Available Formats

K Means

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

K Means

Uploaded by

Copyright:

Available Formats

K-means Clustering

Reading: [7.3, EA], [9.1, CMB]

COMP24111 Machine Learning

• How K-means partitions?

• Application: Cell Neulei Detection

e.g., Euclidean distance d 2

• Given a K, find a partition of K clusters to optimise the

Initialisation: set seed points (randomly)

Assign each object to the cluster

Knowing the members of each

Compute the distance of all

Assign the membership to objects

Knowing the members of each

Compute the distance of all objects

Stop due to no new assignment

When K centroids are set/fixed,

Changing positions of centroids

• Colour-Based Image Segmentation Using K-means

• Colour-Based Image Segmentation Using K-means

“blue” pixels “white” pixels “pink” pixels

You might also like