UNIT5

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 60

16CS63 : Machine Learning

Course Content

UNIT 1 Introduction to Machine Learning


(a) Machine Learning Techniques & Types
(b) Familiarization of Data Analytics and Intelligent Learning Algorithms
UNIT 2 Regression & Classification Methods
Regression Methods - (a) Linear Regression
(b) Non-linear Regression
Classification Methods - (a) Naive Bayes.
(b) Support Vector Machine (SVM).
(c) K – Nearest Neighbor
UNIT 3 Ensemble Methods
Ensemble Methods - (a) Decision Tree.
(b) Random Forest.
(c) AdaBoost
UNIT 4 Dimensionality Reduction & Parameter Estimation
Dimensionality Reduction - (a) Principal Component Analysis (PCA).
(b) Linear Discriminate Analysis (LDA).
Parameter Estimation – (a) Maximum Likelihood Estimate (MLE).
UNIT 5 Clustering Methods
Clustering Methods - (a) K – Means Clustering
(b) Hierarchical Clustering.
UNIT 6 Reinforcement Methods
Reinforcement Methods – (a) Markov Decision Process
(b) Q - Learning
UNIT 7 Perceptron
Perceptron - (a) Perceptron.
(b) Multilayer Perceptron.
UNIT 8 Artificial Neural Network
Artificial Neural Network - (a) Multi – layer Artificial Neural network.
(b) Back Propagation Learning.
(c) RBF Network
UNIT 5 Clustering Methods
Clustering Methods - (a) K – Means Clustering
(b) Hierarchical Clustering.
What• Finding
is Cluster Analysis?
groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Applications of Cluster Analysis

• Understanding
• Group related documents for browsing,
group genes and proteins that have similar
functionality, or group stocks with similar
price fluctuations

• Summarization
• Reduce the size of large data sets

Clustering
precipitation in
Australia
What is not Cluster Analysis?

• Supervised classification
• Have class label information

• Simple segmentation
• Dividing students into different registration groups alphabetically,
by last name

• Results of a query
• Groupings are a result of an external specification

• Graph partitioning
• Some mutual relevance and synergy, but areas are not identical
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters


Types of Clusterings

• A clustering is a set of clusters


• Important distinction between hierarchical and
partitional sets of clusters
• Partitional Clustering
• A division data objects into non-overlapping subsets (clusters) such
that each data object is in exactly one subset

• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
Partitional Clustering

Original Points A Partitional Clustering


Hierarchical Clustering

Traditional Hierarchical Clustering Traditional Dendrogram

Non-traditional Hierarchical Non-traditional Dendrogram


Clustering
Types of Clusters
• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density-based clusters

• Property or Conceptual

• Described by an Objective Function


Types of Clusters: Well-Separated

• Well-Separated Clusters:
• A cluster is a set of points such that any point in a cluster is closer
(or more similar) to every other point in the cluster than to any
point not in the cluster.

3 well-separated clusters
Types of Clusters: Center-Based

• Center-based
• A cluster is a set of objects such that an object in a cluster is closer
(more similar) to the “center” of a cluster, than to the center of any
other cluster
• The center of a cluster is often a centroid, the average of all the
points in the cluster, or a medoid, the most “representative” point
of a cluster

4 center-based clusters
Types of Clusters: Contiguity-Based

• Contiguous Cluster (Nearest neighbor or Transitive)


• A cluster is a set of points such that a point in a cluster is closer (or
more similar) to one or more other points in the cluster than to any
point not in the cluster.

8 contiguous clusters
Types of Clusters: Density-Based

• Density-based
• A cluster is a dense region of points, which is separated by low-
density regions, from other regions of high density.
• Used when the clusters are irregular or intertwined, and when noise
and outliers are present.

6 density-based clusters
Types of Clusters: Conceptual Clusters

• Shared Property or Conceptual Clusters


• Finds clusters that share some common property or represent a
particular concept.
.

2 Overlapping Circles
Types of Clusters: Objective Function

• Clusters Defined by an Objective Function


• Finds clusters that minimize or maximize an objective function.
• Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using the
given objective function. (NP Hard)
• Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
• A variation of the global objective function approach is to fit the data to a
parameterized model.
• Parameters for the model are determined from the data.
• Mixture models assume that the data is a ‘mixture' of a number of statistical
distributions.
Types of Clusters: Objective Function …

• Map the clustering problem to a different domain and solve a related


problem in that domain
• Proximity matrix defines a weighted graph, where the nodes are the points
being clustered, and the weighted edges represent the proximities between
points

• Clustering is equivalent to breaking the graph into connected components,


one for each cluster.

• Want to minimize the edge weight between clusters and maximize the edge
weight within clusters
Characteristics of the Input Data Are Important

• Type of proximity or density measure


• This is a derived measure, but central to clustering
• Sparseness
• Dictates type of similarity
• Adds to efficiency
• Attribute type
• Dictates type of similarity
• Type of Data
• Dictates type of similarity
• Other characteristics, e.g., autocorrelation
• Dimensionality
• Noise and Outliers
• Type of Distribution
Clustering Algorithms
• K-means and its variants

• Hierarchical clustering
K-means Clustering

• Partitional clustering approach


• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
K-means Clustering – Details
• Initial centroids are often chosen randomly.
• Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.
• K-means will converge for common similarity measures mentioned above.
• Most of the convergence happens in the first few iterations.
• Often the stopping condition is changed to ‘Until relatively few points change clusters’
• Complexity is O( n * K * I * d )
• n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Two different K-means Clusterings
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering


Importance of Choosing Initial Centroids

Iteration 6
1
2
3
4
5
3

2.5

1.5

y 1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)
• For each point, the error is the distance to the nearest cluster
• To get SSE, we square these errors and sum them.

• x is a data point in cluster Ci and mi is the representative point for cluster Ci


• can show that mi corresponds to the center (mean) of the cluster
• Given two clusters, we can choose the one with the smallest error
• One easy way to reduce SSE is to increase K, the number of clusters
• A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
Problems with Selecting Initial Points
• If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small.
• Chance is relatively small when K is large
• If clusters are the same size, n, then

• For example, if K = 10, then probability = 10!/10 10 = 0.00036


• Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t
• Consider an example of five pairs of clusters
Limitations of K-means
• K-means has problems when clusters are of differing
• Sizes
• Densities
• Non-globular shapes

• K-means has problems when the data contains outliers.


Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)


Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)


Hierarchical Clustering
• Produces a set of nested clusters organized as a hierarchical tree
• Can be visualized as a dendrogram
• A tree like diagram that records the sequences of merges or splits

0.2

0.15

0.1

0.05

0
1 3 2 5 4 6
Strengths of Hierarchical Clustering
• Do not have to assume any particular number of clusters
• Any desired number of clusters can be obtained by ‘cutting’ the dendogram at
the proper level

• They may correspond to meaningful taxonomies


• Example in biological sciences (e.g., animal kingdom, phylogeny
reconstruction, …)
Hierarchical Clustering
• Two main types of hierarchical clustering
• Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

• Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there are k clusters)

• Traditional hierarchical algorithms use a similarity or distance matrix


• Merge or split one cluster at a time
Agglomerative Clustering Algorithm

• More popular hierarchical clustering technique

• Basic algorithm is straightforward


1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

• Key operation is the computation of the proximity of two


clusters
• Different approaches to defining the distance between clusters
distinguish the different algorithms
Starting Situation
• Start with clusters of individual points and ap1proximity
p2 p3 p4 matrix
p5 . . .
p1
p2
p3
p4
p5
.
.
. Proximity Matrix
Intermediate Situation
C1 C2 C3 C4 C5
• After some merging steps, we have some clusters
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1

C2 C5
Intermediate Situation
C1 C2 C3 C4 C5
• We want to merge the two closest clusters (C2 and C5) and update the proximity
matrix. C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1

C2 C5
After Merging
C2
• The question is “How do we update the proximity matrix?” U
C1 C3 C4
C5
C1 ?

C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1

C2 U C5
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...

p1
Similarity?
p2
p3

p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...

p1

p2
p3

p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...

p1

p2
p3

p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...

p1

p2
p3

p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...

p1
  p2
p3

p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward’s Method uses squared error
Cluster Similarity: MIN or Single Link
• Similarity of two clusters is based on the two most similar (closest)
points in the different clusters
• Determined by one pair of points, i.e., by one link in the proximity graph.

1 2 3 4 5
Hierarchical Clustering: MIN

5
1
3
5 0.2

2 1 0.15

2 3 6 0.1

0.05
4
4 0
3 6 2 5 4 1

Nested Clusters Dendrogram


Strength of MIN

Original Points Two Clusters

• Can handle non-elliptical shapes


Limitations of MIN

Original Points Two Clusters

• Sensitive to noise and outliers


Cluster Similarity: MAX or Complete Linkage

• Similarity of two clusters is based on the two least similar (most


distant) points in the different clusters
• Determined by all pairs of points in the two clusters

1 2 3 4 5
Hierarchical Clustering: MAX

4 1
2 5 0.4

0.35
5
2 0.3

0.25

3 6
0.2

3 0.15
1 0.1

4 0.05

0
3 6 4 1 2 5

Nested Clusters Dendrogram


Strength of MAX

Original Points Two Clusters

• Less susceptible to noise and outliers


Limitations of MAX

Original Points Two Clusters

•Tends to break large clusters


•Biased towards globular clusters
Cluster Similarity: Group Average
• Proximity of two clusters is the average of pairwise proximity
between points in the two clusters.

• Need to use average connectivity for scalability since total proximity


favors large clusters

1 2 3 4 5
Hierarchical Clustering: Group Average

5 4 1
0.25
2
5 0.2
2
0.15

3 6 0.1

1 0.05

4 0
3 6 4 1 2 5
3

Nested Clusters Dendrogram


Hierarchical Clustering: Group Average
• Compromise between Single and Complete Link

• Strengths
• Less susceptible to noise and outliers

• Limitations
• Biased towards globular clusters
Cluster Similarity: Ward’s Method
• Similarity of two clusters is based on the increase in squared error when
two clusters are merged
• Similar to group average if distance between points is distance squared

• Less susceptible to noise and outliers

• Biased towards globular clusters

• Hierarchical analogue of K-means


• Can be used to initialize K-means
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4

5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
Hierarchical Clustering: Time and Space requirements

• O(N2) space since it uses the proximity matrix.


• N is the number of points.

• O(N3) time in many cases


• There are N steps and at each step the size, N2, proximity matrix must be
updated and searched
• Complexity can be reduced to O(N2 log(N) ) time for some approaches
Hierarchical Clustering: Problems and Limitations

• Once a decision is made to combine two clusters, it cannot be undone

• No objective function is directly minimized

• Different schemes have problems with one or more of the following:


• Sensitivity to noise and outliers
• Difficulty handling different sized clusters and convex shapes
• Breaking large clusters
• Problems on K-Means and Hierarchical Clustering

You might also like