K Means
K Means
K Means
Ke Chen
• K-means Algorithm
• Example
• K-means Demo
• Relevant Issues
• Summary
2
COMP24111 Machine Learning
Introduction
• Partitioning Clustering Approach
– a typical clustering analysis approach via iteratively partitioning
training data set to learn a partition of the given data space
– learning a partition on a data set to produce several non-empty
clusters (usually, the number of clusters given in advance)
– in principle, optimal partition achieved via minimising the sum
of squared distance to its “representative object” in each cluster
E xCk d (x, mk )
K
k 1
2
3
COMP24111 Machine Learning
Introduction
4
COMP24111 Machine Learning
K-means Algorithm
• Given the cluster number K, the K-means algorithm is
carried out in three steps after initialisation:
D
Medicine Weight pH-
Index C
A 1 1
B 2 1
A B
C 4 3
D 5 4
6
COMP24111 Machine Learning
Example
• Step 1: Use initial seed points for partitioning
c1 A, c2 B
D
Euclidean distance
C
d( D, c1 ) ( 5 1)2 ( 4 1)2 5
A B
d( D, c2 ) ( 5 2)2 ( 4 1)2 4.24
7
COMP24111 Machine Learning
Example
• Step 2: Compute new centroids of the current partition
c1 (1, 1)
2 4 5 1 3 4
c2 ,
3 3
11 8
( , )
3 3
8
COMP24111 Machine Learning
Example
• Step 2: Renew membership based on new centroids
9
COMP24111 Machine Learning
Example
• Step 3: Repeat the first two steps until its convergence
1 2 11 1
c1 , (1 , 1)
2 2 2
45 34 1 1
c2 , (4 , 3 )
2 2 2 2
10
COMP24111 Machine Learning
Example
• Step 3: Repeat the first two steps until its convergence
11
COMP24111 Machine Learning
Exercise
For the medicine data set, use K-means with the Manhattan distance
metric for clustering analysis by setting K=2 and initialising seeds as
C1 = A and C2 = C. Answer three questions as follows:
1. How many steps are required for convergence?
2. What are memberships of two clusters after convergence?
3. What are centroids of two clusters after convergence?
Medicine Weight pH- D
Index
C
A 1 1
B 2 1 A B
C 4 3
D 5 4
12
COMP24111 Machine Learning
How K-means partitions?
A partition amounts to a
Voronoi Diagram
13
COMP24111 Machine Learning
K-means Demo
K-means Demo
14
COMP24111 Machine Learning
Relevant Issues
• Computational complexity
– O(tKn), where n is number of objects, K is number of clusters,
and t is number of iterations. Normally, K, t << n.
• Local optimum
– sensitive to initial seed points
– converge to a local optimum: maybe an unwanted solution
• Other problems
– Need to specify K, the number of clusters, in advance
– Unable to handle noisy data and outliers (K-Medoids algorithm)
– Not suitable for discovering clusters with non-convex shapes
– Applicable only when mean is defined, then what about
categorical data? (K-mode algorithm)
– how to evaluate the K-mean performance?
15
COMP24111 Machine Learning
Application
• Colour-Based Image Segmentation Using K-means
Step 1: Loading a colour image of tissue stained with hemotoxylin and
eosin (H&E)
16
COMP24111 Machine Learning
Application
17
COMP24111 Machine Learning
Application
18
COMP24111 Machine Learning
Application
• Colour-Based Image Segmentation Using K-means
Step 4: Label every pixel in the image using the results from
K-means clustering (indicated by three different grey levels)
19
COMP24111 Machine Learning
Application
• Colour-Based Image Segmentation Using K-means
Step 5: Create Images that Segment the H&E Image by Colour
• Apply the label and the colour information of each pixel to achieve
separate colour images corresponding to three clusters.
20
COMP24111 Machine Learning
Application
• Colour-Based Image Segmentation Using K-means
Step 6: Segment the nuclei into a separate image with the L* feature
• In cluster 1, there are dark and light blue objects (pixels). The dark blue
objects (pixels) correspond to nuclei (with the domain knowledge).
• L* feature specifies the brightness values of each colour.
• With a threshold for L*, we achieve an image containing the nuclei only.
21
COMP24111 Machine Learning
Summary
• K-means algorithm is a simple yet popular method for
clustering analysis
• Its performance is determined by initialisation and
appropriate distance measure
• There are several variants of K-means to overcome its
weaknesses
– K-Medoids: resistance to noise and/or outliers
– K-Modes: extension to categorical data clustering analysis
– CLARA: extension to deal with large data sets
– Mixture models (EM algorithm): handling uncertainty of clusters
Online tutorial: how to use the K-means function in Matlab
https://www.youtube.com/watch?v=aYzjenNNOcc
22
COMP24111 Machine Learning