UNIT5

16CS63 : Machine Learning
Course Content
UNIT 1 Introduction to Machine Learning

(a) Machine Learning Techniques & Types
(b) Familiarization of Data Analytics and Intelligent Learning Algorithms
UNIT 2 Regression & Classification Methods
Regression Methods - (a) Linear Regression
(b) Non-linear Regression
Classification Methods - (a) Naive Bayes.
(b) Support Vector Machine (SVM).
(c) K – Nearest Neighbor
UNIT 3 Ensemble Methods
Ensemble Methods - (a) Decision Tree.
(b) Random Forest.
(c) AdaBoost
UNIT 4 Dimensionality Reduction & Parameter Estimation
Dimensionality Reduction - (a) Principal Component Analysis (PCA).
(b) Linear Discriminate Analysis (LDA).
Parameter Estimation – (a) Maximum Likelihood Estimate (MLE).
UNIT 5 Clustering Methods
Clustering Methods - (a) K – Means Clustering
(b) Hierarchical Clustering.
UNIT 6 Reinforcement Methods
Reinforcement Methods – (a) Markov Decision Process
(b) Q - Learning
UNIT 7 Perceptron
Perceptron - (a) Perceptron.
(b) Multilayer Perceptron.
UNIT 8 Artificial Neural Network
Artificial Neural Network - (a) Multi – layer Artificial Neural network.
(b) Back Propagation Learning.
(c) RBF Network
UNIT 5 Clustering Methods
Clustering Methods - (a) K – Means Clustering
(b) Hierarchical Clustering.
What• Finding
is Cluster Analysis?
groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Applications of Cluster Analysis
• Understanding
• Group related documents for browsing,
group genes and proteins that have similar
functionality, or group stocks with similar
price fluctuations
• Summarization
• Reduce the size of large data sets
Clustering
precipitation in
Australia
What is not Cluster Analysis?
• Supervised classification
• Have class label information
• Simple segmentation
• Dividing students into different registration groups alphabetically,
by last name
• Results of a query
• Groupings are a result of an external specification
• Graph partitioning
• Some mutual relevance and synergy, but areas are not identical
Notion of a Cluster can be Ambiguous
How many clusters? Six Clusters
Two Clusters Four Clusters

Types of Clusterings
• A clustering is a set of clusters

• Important distinction between hierarchical and
partitional sets of clusters
• Partitional Clustering
• A division data objects into non-overlapping subsets (clusters) such
that each data object is in exactly one subset
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
Partitional Clustering
Original Points A Partitional Clustering

Hierarchical Clustering
Traditional Hierarchical Clustering Traditional Dendrogram
Non-traditional Hierarchical Non-traditional Dendrogram

Clustering
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density-based clusters
• Property or Conceptual
• Described by an Objective Function

Types of Clusters: Well-Separated
• Well-Separated Clusters:
• A cluster is a set of points such that any point in a cluster is closer
(or more similar) to every other point in the cluster than to any
point not in the cluster.
3 well-separated clusters
Types of Clusters: Center-Based
• Center-based
• A cluster is a set of objects such that an object in a cluster is closer
(more similar) to the “center” of a cluster, than to the center of any
other cluster
• The center of a cluster is often a centroid, the average of all the
points in the cluster, or a medoid, the most “representative” point
of a cluster
4 center-based clusters
Types of Clusters: Contiguity-Based
• Contiguous Cluster (Nearest neighbor or Transitive)

• A cluster is a set of points such that a point in a cluster is closer (or
more similar) to one or more other points in the cluster than to any
point not in the cluster.
8 contiguous clusters
Types of Clusters: Density-Based
• Density-based
• A cluster is a dense region of points, which is separated by low-
density regions, from other regions of high density.
• Used when the clusters are irregular or intertwined, and when noise
and outliers are present.
6 density-based clusters
Types of Clusters: Conceptual Clusters
• Shared Property or Conceptual Clusters

• Finds clusters that share some common property or represent a
particular concept.
.
2 Overlapping Circles
Types of Clusters: Objective Function
• Clusters Defined by an Objective Function

• Finds clusters that minimize or maximize an objective function.
• Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using the
given objective function. (NP Hard)
• Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
• A variation of the global objective function approach is to fit the data to a
parameterized model.
• Parameters for the model are determined from the data.
• Mixture models assume that the data is a ‘mixture' of a number of statistical
distributions.
Types of Clusters: Objective Function …
• Map the clustering problem to a different domain and solve a related

problem in that domain
• Proximity matrix defines a weighted graph, where the nodes are the points
being clustered, and the weighted edges represent the proximities between
points
• Clustering is equivalent to breaking the graph into connected components,

one for each cluster.
• Want to minimize the edge weight between clusters and maximize the edge
weight within clusters
Characteristics of the Input Data Are Important
• Type of proximity or density measure

• This is a derived measure, but central to clustering
• Sparseness
• Dictates type of similarity
• Adds to efficiency
• Attribute type
• Type of Data
• Other characteristics, e.g., autocorrelation
• Dimensionality
• Noise and Outliers
• Type of Distribution
Clustering Algorithms
• K-means and its variants
• Hierarchical clustering
K-means Clustering
• Partitional clustering approach

• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
K-means Clustering – Details
• Initial centroids are often chosen randomly.
• Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.
• K-means will converge for common similarity measures mentioned above.
• Most of the convergence happens in the first few iterations.
• Often the stopping condition is changed to ‘Until relatively few points change clusters’
• Complexity is O( n * K * I * d )
• n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Two different K-means Clusterings
3
2.5
2
Original Points
1.5
y
1
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x
Optimal Clustering Sub-optimal Clustering

Importance of Choosing Initial Centroids
Iteration 6
1
2
3
4
5
3
2.5
1.5
y 1
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Iteration 4 Iteration 5 Iteration 6

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)
• For each point, the error is the distance to the nearest cluster
• To get SSE, we square these errors and sum them.
• x is a data point in cluster Ci and mi is the representative point for cluster Ci

• can show that mi corresponds to the center (mean) of the cluster
• Given two clusters, we can choose the one with the smallest error
• One easy way to reduce SSE is to increase K, the number of clusters
• A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
Problems with Selecting Initial Points
• If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small.
• Chance is relatively small when K is large
• If clusters are the same size, n, then
• For example, if K = 10, then probability = 10!/10 10 = 0.00036

• Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t
• Consider an example of five pairs of clusters
Limitations of K-means
• K-means has problems when clusters are of differing
• Sizes
• Densities
• Non-globular shapes
• K-means has problems when the data contains outliers.

Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density
Original Points K-means (3 Clusters)

• Produces a set of nested clusters organized as a hierarchical tree
• Can be visualized as a dendrogram
• A tree like diagram that records the sequences of merges or splits
0.2
0.15
0.1
0.05
0
1 3 2 5 4 6
Strengths of Hierarchical Clustering
• Do not have to assume any particular number of clusters
• Any desired number of clusters can be obtained by ‘cutting’ the dendogram at
the proper level
• They may correspond to meaningful taxonomies

• Example in biological sciences (e.g., animal kingdom, phylogeny
reconstruction, …)
• Two main types of hierarchical clustering
• Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k clusters) left
• Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there are k clusters)
• Traditional hierarchical algorithms use a similarity or distance matrix

• Merge or split one cluster at a time
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Basic algorithm is straightforward

1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
• Key operation is the computation of the proximity of two

clusters
• Different approaches to defining the distance between clusters
distinguish the different algorithms
Starting Situation
• Start with clusters of individual points and ap1proximity
p2 p3 p4 matrix
p5 . . .
p1
p2
p3
p4
p5
.
.
. Proximity Matrix
Intermediate Situation
C1 C2 C3 C4 C5
• After some merging steps, we have some clusters
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
Intermediate Situation
C1 C2 C3 C4 C5
• We want to merge the two closest clusters (C2 and C5) and update the proximity
matrix. C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
After Merging
C2
• The question is “How do we update the proximity matrix?” U
C1 C3 C4
C5
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward’s Method uses squared error
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
function
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
function
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
function
p1 p2 p3 p4 p5 ...
p1
  p2
p3
p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
function
Cluster Similarity: MIN or Single Link
• Similarity of two clusters is based on the two most similar (closest)
points in the different clusters
• Determined by one pair of points, i.e., by one link in the proximity graph.
1 2 3 4 5
Hierarchical Clustering: MIN
5
1
3
5 0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
Nested Clusters Dendrogram

Strength of MIN
Original Points Two Clusters
• Can handle non-elliptical shapes

Limitations of MIN
• Sensitive to noise and outliers

Cluster Similarity: MAX or Complete Linkage
• Similarity of two clusters is based on the two least similar (most

distant) points in the different clusters
• Determined by all pairs of points in the two clusters
1 2 3 4 5
Hierarchical Clustering: MAX
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6
0.2
3 0.15
1 0.1
4 0.05
0
3 6 4 1 2 5

Strength of MAX
• Less susceptible to noise and outliers

Limitations of MAX
•Tends to break large clusters

•Biased towards globular clusters
Cluster Similarity: Group Average
• Proximity of two clusters is the average of pairwise proximity
between points in the two clusters.
• Need to use average connectivity for scalability since total proximity

favors large clusters
1 2 3 4 5
Hierarchical Clustering: Group Average
5 4 1
0.25
2
5 0.2
2
0.15
3 6 0.1
1 0.05
4 0
3 6 4 1 2 5
3

Hierarchical Clustering: Group Average
• Compromise between Single and Complete Link
• Strengths
• Limitations
• Biased towards globular clusters
Cluster Similarity: Ward’s Method
• Similarity of two clusters is based on the increase in squared error when
two clusters are merged
• Similar to group average if distance between points is distance squared
• Biased towards globular clusters
• Hierarchical analogue of K-means

• Can be used to initialize K-means
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
Hierarchical Clustering: Time and Space requirements
• O(N2) space since it uses the proximity matrix.

• N is the number of points.
• O(N3) time in many cases

• There are N steps and at each step the size, N2, proximity matrix must be
updated and searched
• Complexity can be reduced to O(N2 log(N) ) time for some approaches
Hierarchical Clustering: Problems and Limitations
• Once a decision is made to combine two clusters, it cannot be undone
• No objective function is directly minimized
• Different schemes have problems with one or more of the following:

• Sensitivity to noise and outliers
• Difficulty handling different sized clusters and convex shapes
• Breaking large clusters
• Problems on K-Means and Hierarchical Clustering

UNIT5

Uploaded by

Copyright:

Available Formats

UNIT5

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT5

Uploaded by

Copyright:

Available Formats

16CS63 : Machine Learning

UNIT 1 Introduction to Machine Learning

How many clusters? Six Clusters

Two Clusters Four Clusters

• A clustering is a set of clusters

Original Points A Partitional Clustering

Traditional Hierarchical Clustering Traditional Dendrogram

Non-traditional Hierarchical Non-traditional Dendrogram

• Described by an Objective Function

• Contiguous Cluster (Nearest neighbor or Transitive)

• Shared Property or Conceptual Clusters

• Clusters Defined by an Objective Function

• Map the clustering problem to a different domain and solve a related

• Clustering is equivalent to breaking the graph into connected components,

• Type of proximity or density measure

• Partitional clustering approach

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Optimal Clustering Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Iteration 4 Iteration 5 Iteration 6

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

• x is a data point in cluster Ci and mi is the representative point for cluster Ci

• For example, if K = 10, then probability = 10!/10 10 = 0.00036

• K-means has problems when the data contains outliers.

Original Points K-means (3 Clusters)

Original Points K-means (3 Clusters)

• They may correspond to meaningful taxonomies

• Traditional hierarchical algorithms use a similarity or distance matrix

• More popular hierarchical clustering technique

• Basic algorithm is straightforward

• Key operation is the computation of the proximity of two

Nested Clusters Dendrogram

Original Points Two Clusters

• Can handle non-elliptical shapes

Original Points Two Clusters

• Sensitive to noise and outliers

• Similarity of two clusters is based on the two least similar (most

Nested Clusters Dendrogram

Original Points Two Clusters

• Less susceptible to noise and outliers

Original Points Two Clusters

•Tends to break large clusters

• Need to use average connectivity for scalability since total proximity

Nested Clusters Dendrogram

• Less susceptible to noise and outliers

• Biased towards globular clusters

• Hierarchical analogue of K-means

• O(N2) space since it uses the proximity matrix.

• O(N3) time in many cases

• Once a decision is made to combine two clusters, it cannot be undone

• No objective function is directly minimized

• Different schemes have problems with one or more of the following:

You might also like