Lecture 19

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

Data Science

CSE-4075
(k-means)
Partitional Clustering
• Output a single partition of the data into
clusters

• Good for large data sets

• Determining the number of clusters is a major


challenge

2
K-Means

Predetermined
number of clusters

Start with seed


clusters of one
element

Seeds
Assign Instances to Clusters
Find New Centroids
New Clusters
Example
Example
Example
Example
Example

Repeat if for the next iteration and check whether the clusters remain the same or not?
Choosing the Value of k
Choosing the Value of k
There is no easy way to choose the value of K
One way is the elbow method

First, compute the sum of squared error (SSE) for


some values of k (for instance 2,4,6,8 etc.)

The SSE is defied as the sum of the squared


distance between each member of the cluster and
its centroid
Choosing the Value of k
If you plot k against SSE, you will see that the error
decreases as k gets larger; this is because when the
number of clusters increases, they should be
smaller, so distortion is also smaller

The idea of elbow method is to choose the k at


which the SSE decreases abruptly
Choosing the Value of k
Strengths of k-means
• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a linear
algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is used.
The global optimum is hard to find due to complexity.
Weaknesses of k-means
• The algorithm is only applicable if the mean is
defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away from
other data points.
– Outliers could be errors in the data recording or
some special data points with very different values.
Weaknesses of k-means: Problems with
outliers
Weaknesses of k-means: To deal with
outliers
• One method is to remove some data points in the
clustering process that are much further away from the
centroids than other data points.
– To be safe, we may want to monitor these possible outliers over
a few iterations and then decide to remove them.
• Another method is to perform random sampling. Since in
sampling we only choose a small subset of the data
points, the chance of selecting an outlier is very small.
– Assign the rest of the data points to the clusters by distance or
similarity comparison, or classification
Weaknesses of k-means: Sensitivity to
initial seeds
Weaknesses of k-means: Can’t handle each
type of data

You might also like