NVT SDS Unit Vi Final PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

STATISTICS FOR DATA SCIENCE

UNIT-VI
Session 2022-2023
V Semester (Honors)

Dr.Nileshsingh V. Thakur
(Ph.D. Computer Sci. Engg.)
Professor
Department of Computer Technology
Yeshwantrao Chavan College of Engineering
Nagpur, India.

NileshsinghVThakur
Support Vector Machines

Here we approach the two-class classification problem in a


direct way:

We try and find a plane that separates the classes in


feature space.
If we cannot, we get creative in two ways:
• We soften what we mean by “separates”, and
• We enrich and enlarge the feature space so that separation
is possible.

NileshsinghVThakur
1 / 21
What is a Hyperplane?

• A hyperplane in p dimensions is a flat affine subspace of


dimension p − 1.
• In general the equation for a hyperplane has the form

β0 + β1 X1 + β2 X2 + . . . + βp Xp = 0
• In p = 2 dimensions a hyperplane is a line.
• If β0 = 0, the hyperplane goes through the origin,
otherwise not.
• The vector β = (β1 , β2 , · · · , βp ) is called the normal vector
— it points in a direction orthogonal to the surface of a
hyperplane.

NileshsinghVThakur
2 / 21
Hyperplane in 2 Dimensions

10
8 β=(β1,β2)

β1X1+β2X2−6=1.6
6

β1X1+β2X2−6=0

4
X2

β1X1+β2X2−6=−4
2


0
−2

β1 = 0.8
β2 = 0.6

−2 0 2 4 6 8 10

X1
NileshsinghVThakur
3 / 21
Separating Hyperplanes
3

3
2

2
X2

X2
1

1
0

0
−1

−1
−1 0 1 2 3 −1 0 1 2 3

X1 X1

• If f (X) = β0 + β1 X1 + · · · + βp Xp , then f (X) > 0 for points on


one side of the hyperplane, and f (X) < 0 for points on the other.
• If we code the colored points as Yi = +1 for blue, say, and
Yi = −1 for mauve, then if Yi · f (Xi ) > 0 for all i, f (X) = 0
defines a separating hyperplane.
NileshsinghVThakur
4 / 21
Maximal Margin Classifier
Among all separating hyperplanes, find the one that makes the
biggest gap or margin between the two classes.

Constrained optimization problem


3

maximize M
2

β0 ,β1 ,...,βp
p
X2

X
βj2 = 1,
1

subject to
j=1
0

yi (β0 + β1 xi1 + . . . + βp xip ) ≥ M


for all i = 1, . . . , N.
−1

−1 0 1 2 3

X1

This can be rephrased as a convex quadratic program, and


solved efficiently. The function svm() in package e1071 solves
this problem efficiently NileshsinghVThakur
5 / 21
Non-separable Data

The data on the left are


2.0

not separable by a linear


1.5

boundary.
1.0
X2
0.5

This is often the case,


unless N < p.
0.0
−0.5
−1.0

0 1 2 3

X1

NileshsinghVThakur
6 / 21
Noisy Data
3

3
2

2
X2

X2
1

1
0

0
−1

−1
−1 0 1 2 3 −1 0 1 2 3

X1 X1

Sometimes the data are separable, but noisy. This can lead to a
poor solution for the maximal-margin classifier.

The support vector classifier maximizes a soft margin.

NileshsinghVThakur
7 / 21
Support Vector Classifier

10 10
4

4
7 7
3

3
11
9 9
2

2
8 8
X2

X2
1

1
1 1
12
3 3
0

0
4 5 4 5
2 2
−1

−1
6 6

−0.5 0.0 0.5 1.0 1.5 2.0 2.5 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

X1 X1

p
X
maximize M subject to βj2 = 1,
β0 ,β1 ,...,βp ,1 ,...,n
j=1
yi (β0 + β1 xi1 + β2 xi2 + . . . + βp xip ) ≥ M (1 − i ),
X n
i ≥ 0, i ≤ C,
i=1

NileshsinghVThakur
8 / 21
C is a regularization parameter

3
2

2
1

1
X2

X2
0

0
−1

−1
−2

−2
−3

−3
−1 0 1 2 −1 0 1 2

X1 X1
3

3
2

2
1

1
X2

X2
0

0
−1

−1
−2

−2
−3

−3

−1 0 1 2 −1 0 1 2

X1 X1

NileshsinghVThakur
9 / 21
Linear boundary can fail

Sometime a linear bound-


4

ary simply won’t work,


no matter what value of
2

C.
X2
0

The example on the left


is such a case.
−2

What to do?
−4

4 −4 −2 0 2 4

X1

NileshsinghVThakur
10 / 21
Feature Expansion

• Enlarge the space of features by including transformations;


e.g. X12 , X13 , X1 X2 , X1 X22 ,. . .. Hence go from a
p-dimensional space to a M > p dimensional space.
• Fit a support-vector classifier in the enlarged space.
• This results in non-linear decision boundaries in the
original space.
Example: Suppose we use (X1 , X2 , X12 , X22 , X1 X2 ) instead of
just (X1 , X2 ). Then the decision boundary would be of the form

β0 + β1 X1 + β2 X2 + β3 X12 + β4 X22 + β5 X1 X2 = 0

This leads to nonlinear decision boundaries in the original space


(quadratic conic sections).

NileshsinghVThakur
11 / 21
Cubic Polynomials
Here we use a basis
expansion of cubic poly-

4
nomials
From 2 variables to 9

2
X2

X2
The support-vector clas-

0
sifier in the enlarged
space solves the problem

−2

−2
in the lower-dimensional
space
−4

−4
−4 −2 0 2 4

X1

β0 +β1 X1 +β2 X2 +β3 X12 +β4 X22 +β5 X1 X2 +β6 X13 +β7 X23 +β8 X1 X22 +β9 X12 X2 = 0

NileshsinghVThakur
12 / 21
Nonlinearities and Kernels

• Polynomials (especially high-dimensional ones) get wild


rather fast.
• There is a more elegant and controlled way to introduce
nonlinearities in support-vector classifiers — through the
use of kernels.
• Before we discuss these, we must understand the role of
inner products in support-vector classifiers.

NileshsinghVThakur
13 / 21
Inner products and support vectors
p
X
hxi , x i =
i0 xij xi0 j — inner product between vectors
j=1

• The linear support vector classifier can be represented as


n
X
f (x) = β0 + αi hx, xi i — n parameters
i=1

• To estimate the parameters α1 , . . . , αn and β0 , all we need


are the n2 inner products hxi , xi0 i between all pairs of


training observations.
It turns out that most of the α̂i can be zero:
X
f (x) = β0 + α̂i hx, xi i
i∈S
S is the support set of indices i such that α̂i > 0. [see slide 8]
NileshsinghVThakur
14 / 21
Kernels and Support Vector Machines
• If we can compute inner-products between observations, we
can fit a SV classifier. Can be quite abstract!
• Some special kernel functions can do this for us. E.g.
 d
p
X
K(xi , xi0 ) = 1 + xij xi0 j 
j=1

computes the inner-products needed for d dimensional


polynomials — p+d

d basis functions!
Try it for p = 2 and d = 2.
• The solution has the form
X
f (x) = β0 + α̂i K(x, xi ).
i∈S

NileshsinghVThakur
15 / 21
Radial Kernel
p
X
K(xi , xi0 ) = exp(−γ (xij − xi0 j )2 ).
j=1
4

X
f (x) = β0 + α̂i K(x, xi )
i∈S
2

Implicit feature space;


X2

very high dimensional.


0

Controls variance by
−2

squashing down most


dimensions severely
−4

4 −4 −2 0 2 4

X1 NileshsinghVThakur
16 / 21
Example: Heart Data
1.0

1.0
0.8

0.8
True positive rate

True positive rate


0.6

0.6
0.4

0.4
0.2

0.2
Support Vector Classifier
SVM: γ=10−3
Support Vector Classifier SVM: γ=10−2
SVM: γ=10−1
0.0

0.0
LDA

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate False positive rate

ROC curve is obtained by changing the threshold 0 to threshold


t in fˆ(X) > t, and recording false positive and true positive
rates as t varies. Here we see ROC curves on training data.

NileshsinghVThakur
17 / 21
Example continued: Heart Test Data
1.0

1.0
0.8

0.8
True positive rate

True positive rate


0.6

0.6
0.4

0.4
0.2

0.2
Support Vector Classifier
SVM: γ=10−3
Support Vector Classifier SVM: γ=10−2
SVM: γ=10−1
0.0

0.0
LDA

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate False positive rate

NileshsinghVThakur
18 / 21
SVMs: more than 2 classes?

The SVM as defined works for K = 2 classes. What do we do if


we have K > 2 classes?
OVA One versus All. Fit K different 2-class SVM
classifiers fˆk (x), k = 1, . . . , K; each class versus
the rest. Classify x∗ to the class for which fˆk (x∗ )
is largest.
OVO One versus One. Fit all K2 pairwise classifiers


fˆk` (x). Classify x∗ to the class that wins the most


pairwise competitions.
Which to choose? If K is not too large, use OVO.

NileshsinghVThakur
19 / 21
Support Vector versus Logistic Regression?
With f (X) = β0 + β1 X1 + . . . + βp Xp can rephrase
support-vector classifier optimization as
 
Xn p
X 
minimize max [0, 1 − yi f (xi )] + λ βj2
β0 ,β1 ,...,βp  
i=1 j=1
8

SVM Loss
Logistic Regression Loss This has the form
loss plus penalty.
6

The loss is known as the


Loss

hinge loss.
4

Very similar to “loss” in


2

logistic regression (negative


log-likelihood).
0

−6 −4 −2 0 2

yi(β0 + β1xi1 + . . . + βpxip)


NileshsinghVThakur
20 / 21
Which to use: SVM or Logistic Regression

• When classes are (nearly) separable, SVM does better than


LR. So does LDA.
• When not, LR (with ridge penalty) and SVM very similar.
• If you wish to estimate probabilities, LR is the choice.
• For nonlinear boundaries, kernel SVMs are popular. Can
use kernels with LR and LDA as well, but computations
are more expensive.

NileshsinghVThakur
21 / 21
Unsupervised Learning

Unsupervised vs Supervised Learning:


• Most of this course focuses on supervised learning methods
such as regression and classification.
• In that setting we observe both a set of features
X1 , X2 , . . . , Xp for each object, as well as a response or
outcome variable Y . The goal is then to predict Y using
X1 , X2 , . . . , Xp .
• Here we instead focus on unsupervised learning, we where
observe only the features X1 , X2 , . . . , Xp . We are not
interested in prediction, because we do not have an
associated response variable Y .

NileshsinghVThakur
1 / 50
The Goals of Unsupervised Learning

• The goal is to discover interesting things about the


measurements: is there an informative way to visualize the
data? Can we discover subgroups among the variables or
among the observations?
• We discuss two methods:
• principal components analysis, a tool used for data
visualization or data pre-processing before supervised
techniques are applied, and
• clustering, a broad class of methods for discovering
unknown subgroups in data.

NileshsinghVThakur
2 / 50
The Challenge of Unsupervised Learning

• Unsupervised learning is more subjective than supervised


learning, as there is no simple goal for the analysis, such as
prediction of a response.
• But techniques for unsupervised learning are of growing
importance in a number of fields:
• subgroups of breast cancer patients grouped by their gene
expression measurements,
• groups of shoppers characterized by their browsing and
purchase histories,
• movies grouped by the ratings assigned by movie viewers.

NileshsinghVThakur
3 / 50
Another advantage

• It is often easier to obtain unlabeled data — from a lab


instrument or a computer — than labeled data, which can
require human intervention.
• For example it is difficult to automatically assess the
overall sentiment of a movie review: is it favorable or not?

NileshsinghVThakur
4 / 50
Principal Components Analysis

• PCA produces a low-dimensional representation of a


dataset. It finds a sequence of linear combinations of the
variables that have maximal variance, and are mutually
uncorrelated.
• Apart from producing derived variables for use in
supervised learning problems, PCA also serves as a tool for
data visualization.

NileshsinghVThakur
5 / 50
Principal Components Analysis: details
• The first principal component of a set of features
X1 , X2 , . . . , Xp is the normalized linear combination of the
features

Z1 = φ11 X1 + φ21 X2 + . . . + φp1 Xp

that
Pp has2 the largest variance. By normalized, we mean that
j=1 φj1 = 1.
• We refer to the elements φ11 , . . . , φp1 as the loadings of the
first principal component; together, the loadings make up
the principal component loading vector,
φ1 = (φ11 φ21 . . . φp1 )T .
• We constrain the loadings so that their sum of squares is
equal to one, since otherwise setting these elements to be
arbitrarily large in absolute value could result in an
arbitrarily large variance.
NileshsinghVThakur
6 / 50
PCA: example

35
30
25
Ad Spending
20
15
10
5
0

10 20 30 40 50 60 70

Population

The population size (pop) and ad spending (ad) for 100 different
cities are shown as purple circles. The green solid line indicates
the first principal component direction, and the blue dashed
line indicates the second principal component direction.

NileshsinghVThakur
7 / 50
Computation of Principal Components

• Suppose we have a n × p data set X. Since we are only


interested in variance, we assume that each of the variables
in X has been centered to have mean zero (that is, the
column means of X are zero).
• We then look for the linear combination of the sample
feature values of the form

zi1 = φ11 xi1 + φ21 xi2 + . . . + φp1 xip (1)

for i = 1, . . . , n thatP
has largest sample variance, subject to
the constraint that pj=1 φ2j1 = 1.
• Since each of the xij has mean zero, then so does zi1 (for
any values of φj1 ). Hence the sample variance of the zi1
can be written as n1 ni=1 zi1
2.
P

NileshsinghVThakur
8 / 50
Computation: continued

• Plugging in (1) the first principal component loading vector


solves the optimization problem
 2
n p p
1 X X X
maximize φj1 xij  subject to φ2j1 = 1.
φ11 ,...,φp1 n
i=1 j=1 j=1

• This problem can be solved via a singular-value


decomposition of the matrix X, a standard technique in
linear algebra.
• We refer to Z1 as the first principal component, with
realized values z11 , . . . , zn1

NileshsinghVThakur
9 / 50
Geometry of PCA

• The loading vector φ1 with elements φ11 , φ21 , . . . , φp1


defines a direction in feature space along which the data
vary the most.
• If we project the n data points x1 , . . . , xn onto this
direction, the projected values are the principal component
scores z11 , . . . , zn1 themselves.

NileshsinghVThakur
10 / 50
Further principal components

• The second principal component is the linear combination


of X1 , . . . , Xp that has maximal variance among all linear
combinations that are uncorrelated with Z1 .
• The second principal component scores z12 , z22 , . . . , zn2
take the form

zi2 = φ12 xi1 + φ22 xi2 + . . . + φp2 xip ,

where φ2 is the second principal component loading vector,


with elements φ12 , φ22 , . . . , φp2 .

NileshsinghVThakur
11 / 50
Further principal components: continued

• It turns out that constraining Z2 to be uncorrelated with


Z1 is equivalent to constraining the direction φ2 to be
orthogonal (perpendicular) to the direction φ1 . And so on.
• The principal component directions φ1 , φ2 , φ3 , . . . are the
ordered sequence of right singular vectors of the matrix X,
and the variances of the components are n1 times the
squares of the singular values. There are at most
min(n − 1, p) principal components.

NileshsinghVThakur
12 / 50
Illustration

• USAarrests data: For each of the fifty states in the United


States, the data set contains the number of arrests per
100, 000 residents for each of three crimes: Assault, Murder,
and Rape. We also record UrbanPop (the percent of the
population in each state living in urban areas).
• The principal component score vectors have length n = 50,
and the principal component loading vectors have length
p = 4.
• PCA was performed after standardizing each variable to
have mean zero and standard deviation one.

NileshsinghVThakur
13 / 50
USAarrests data: PCA plot
−0.5 0.0 0.5

UrbanPop

3
2

0.5
Hawaii California
Rhode Island
Massachusetts
Utah New Jersey
Second Principal Component

Connecticut
1

Washington Colorado
New York Nevada
Minnesota Pennsylvania
Wisconsin
Ohio IllinoisArizona
Oregon Rape
Texas
Delaware Missouri
Oklahoma
Kansas
Nebraska Indiana Michigan
Iowa

0.0
New Hampshire
0

Florida
Idaho Virginia New Mexico
Maine Wyoming
Maryland
North Dakota Montana
Assault
South Dakota Tennessee
Louisiana
Kentucky
−1

Arkansas Alabama Alaska


Georgia
VermontWest Virginia Murder

−0.5
South Carolina
−2

North Carolina
Mississippi
−3

−3 −2 −1 0 1 2 3

First Principal Component

NileshsinghVThakur
14 / 50
Figure details

The first two principal components for the USArrests data.


• The blue state names represent the scores for the first two
principal components.
• The orange arrows indicate the first two principal
component loading vectors (with axes on the top and
right). For example, the loading for Rape on the first
component is 0.54, and its loading on the second principal
component 0.17 [the word Rape is centered at the point
(0.54, 0.17)].
• This figure is known as a biplot, because it displays both
the principal component scores and the principal
component loadings.

NileshsinghVThakur
15 / 50
PCA loadings

PC1 PC2
Murder 0.5358995 -0.4181809
Assault 0.5831836 -0.1879856
UrbanPop 0.2781909 0.8728062
Rape 0.5434321 0.1673186

NileshsinghVThakur
16 / 50
Another Interpretation of Principal Components

• •

1.0
• • • ••
• • • •• • •• • ••
• •

Second principal component


0.5
• •
•• •

• • • •
• • ••• • •

0.0
•• •
• •• •
• ••• •• • • •
•• • •
•• • • • •
•• • •

−0.5
• •
• •••
• • • ••
• • • •
• •
−1.0

−1.0 −0.5 0.0 0.5 1.0


First principal component

NileshsinghVThakur
17 / 50
PCA find the hyperplane closest to the observations

• The first principal component loading vector has a very


special property: it defines the line in p-dimensional space
that is closest to the n observations (using average squared
Euclidean distance as a measure of closeness)
• The notion of principal components as the dimensions that
are closest to the n observations extends beyond just the
first principal component.
• For instance, the first two principal components of a data
set span the plane that is closest to the n observations, in
terms of average squared Euclidean distance.

NileshsinghVThakur
18 / 50
Scaling of the variables matters
• If the variables are in different units, scaling each to have
standard deviation equal to one is recommended.
• If they are in the same units, you might or might not scale
the variables.

Scaled Unscaled
−0.5 0.0 0.5 −0.5 0.0 0.5 1.0

1.0
UrbanPop
3

UrbanPop

150
2
Second Principal Component

Second Principal Component


0.5

100
** ** *

0.5
*
*
1

* *
** *

50
* * * * Rape
* * * Rape
* * ** * *
0.0

** * * ** * *
0

* * ** * * ** **
* * ** * *

0.0
* * * * * *
* * **
0
* * * *Murder * * Assault
* * Assault * ** * * * * *
* * * *
* * * * * *
−1

* * * *
* *
−50
* * *
Murder
−0.5

−0.5
−2

*
−100

*
*
−3

−3 −2 −1 0 1 2 3 −100 −50 0 50 100 150

First Principal Component First Principal Component

NileshsinghVThakur
19 / 50
Proportion Variance Explained
• To understand the strength of each component, we are
interested in knowing the proportion of variance explained
(PVE) by each one.
• The total variance present in a data set (assuming that the
variables have been centered to have mean zero) is defined
as
p p n
X X 1X 2
Var(Xj ) = xij ,
n
j=1 j=1 i=1

and the variance explained by the mth principal


component is
n
1X 2
Var(Zm ) = zim .
n
i=1
Pp PM
• It can be shown that j=1 Var(Xj ) = m=1 Var(Zm ),
with M = min(n − 1, p).
NileshsinghVThakur
20 / 50
Proportion Variance Explained: continued
• Therefore, the PVE of the mth principal component is
given by the positive quantity between 0 and 1
Pn
z2
Pp i=1Pn im 2 .
j=1 i=1 xij
• The PVEs sum to one. We sometimes display the
cumulative PVEs.
1.0

1.0
Cumulative Prop. Variance Explained
0.8

0.8
Prop. Variance Explained

0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Principal Component Principal Component

NileshsinghVThakur
21 / 50
How many principal components should we use?

If we use principal components as a summary of our data, how


many components are sufficient?
• No simple answer to this question, as cross-validation is not
available for this purpose.
• Why not?
• When could we use cross-validation to select the number of
components?
• the “scree plot” on the previous slide can be used as a
guide: we look for an “elbow”.

NileshsinghVThakur
22 / 50
Clustering

• Clustering refers to a very broad set of techniques for


finding subgroups, or clusters, in a data set.
• We seek a partition of the data into distinct groups so that
the observations within each group are quite similar to
each other,
• It make this concrete, we must define what it means for
two or more observations to be similar or different.
• Indeed, this is often a domain-specific consideration that
must be made based on knowledge of the data being
studied.

NileshsinghVThakur
23 / 50
PCA vs Clustering

• PCA looks for a low-dimensional representation of the


observations that explains a good fraction of the variance.
• Clustering looks for homogeneous subgroups among the
observations.

NileshsinghVThakur
24 / 50
Clustering for Market Segmentation

• Suppose we have access to a large number of measurements


(e.g. median household income, occupation, distance from
nearest urban area, and so forth) for a large number of
people.
• Our goal is to perform market segmentation by identifying
subgroups of people who might be more receptive to a
particular form of advertising, or more likely to purchase a
particular product.
• The task of performing market segmentation amounts to
clustering the people in the data set.

NileshsinghVThakur
25 / 50
Two clustering methods

• In K-means clustering, we seek to partition the


observations into a pre-specified number of clusters.
• In hierarchical clustering, we do not know in advance how
many clusters we want; in fact, we end up with a tree-like
visual representation of the observations, called a
dendrogram, that allows us to view at once the clusterings
obtained for each possible number of clusters, from 1 to n.

NileshsinghVThakur
26 / 50
K-means clustering
K=2 K=3 K=4

A simulated data set with 150 observations in 2-dimensional


space. Panels show the results of applying K-means clustering
with different values of K, the number of clusters. The color of
each observation indicates the cluster to which it was assigned
using the K-means clustering algorithm. Note that there is no
ordering of the clusters, so the cluster coloring is arbitrary.
These cluster labels were not used in clustering; instead, they
are the outputs of the clustering procedure.
NileshsinghVThakur
27 / 50
Details of K-means clustering

Let C1 , . . . , CK denote sets containing the indices of the


observations in each cluster. These sets satisfy two properties:
1. C1 ∪ C2 ∪ . . . ∪ CK = {1, . . . , n}. In other words, each
observation belongs to at least one of the K clusters.
2. Ck ∩ Ck0 = ∅ for all k 6= k 0 . In other words, the clusters are
non-overlapping: no observation belongs to more than one
cluster.
For instance, if the ith observation is in the kth cluster, then
i ∈ Ck .

NileshsinghVThakur
28 / 50
Details of K-means clustering: continued
• The idea behind K-means clustering is that a good
clustering is one for which the within-cluster variation is as
small as possible.
• The within-cluster variation for cluster Ck is a measure
WCV(Ck ) of the amount by which the observations within
a cluster differ from each other.
• Hence we want to solve the problem
(K )
X
minimize WCV(Ck ) . (2)
C1 ,...,CK
k=1

• In words, this formula says that we want to partition the


observations into K clusters such that the total
within-cluster variation, summed over all K clusters, is as
small as possible.
NileshsinghVThakur
29 / 50
How to define within-cluster variation?

• Typically we use Euclidean distance


p
1 X X
WCV(Ck ) = (xij − xi0 j )2 , (3)
|Ck | 0
i,i ∈Ck j=1

where |Ck | denotes the number of observations in the kth


cluster.
• Combining (2) and (3) gives the optimization problem that
defines K-means clustering,
 
K p
X 1 X X 
minimize (xij − xi0 j )2 . (4)
C1 ,...,CK  |Ck | 0 
k=1 i,i ∈Ck j=1

NileshsinghVThakur
30 / 50
K-Means Clustering Algorithm

1. Randomly assign a number, from 1 to K, to each of the


observations. These serve as initial cluster assignments for
the observations.
2. Iterate until the cluster assignments stop changing:
2.1 For each of the K clusters, compute the cluster centroid.
The kth cluster centroid is the vector of the p feature means
for the observations in the kth cluster.
2.2 Assign each observation to the cluster whose centroid is
closest (where closest is defined using Euclidean distance).

NileshsinghVThakur
31 / 50
Properties of the Algorithm

• This algorithm is guaranteed to decrease the value of the


objective (4) at each step. Why? Note that
p p
1 X X 2
XX
(xij − xi0 j ) = 2 (xij − x̄kj )2 ,
|Ck | 0
i,i ∈Ck j=1 i∈Ck j=1

1 P
where x̄kj = |Ck | i∈Ck xij is the mean for feature j in
cluster Ck .
• however it is not guaranteed to give the global minimum.
Why not?

NileshsinghVThakur
32 / 50
Example
Data Step 1 Iteration 1, Step 2a

Iteration 1, Step 2b Iteration 2, Step 2a Final Results

NileshsinghVThakur
33 / 50
Details of Previous Figure
The progress of the K-means algorithm with K=3.
• Top left: The observations are shown.
• Top center: In Step 1 of the algorithm, each observation is
randomly assigned to a cluster.
• Top right: In Step 2(a), the cluster centroids are computed.
These are shown as large colored disks. Initially the
centroids are almost completely overlapping because the
initial cluster assignments were chosen at random.
• Bottom left: In Step 2(b), each observation is assigned to
the nearest centroid.
• Bottom center: Step 2(a) is once again performed, leading
to new cluster centroids.
• Bottom right: The results obtained after 10 iterations.

NileshsinghVThakur
34 / 50
Example: different starting values
320.9 235.8 235.8

235.8 235.8 310.9

NileshsinghVThakur
35 / 50
Details of Previous Figure

K-means clustering performed six times on the data from


previous figure with K = 3, each time with a different random
assignment of the observations in Step 1 of the K-means
algorithm.
Above each plot is the value of the objective (4).
Three different local optima were obtained, one of which
resulted in a smaller value of the objective and provides better
separation between the clusters.
Those labeled in red all achieved the same best solution, with
an objective value of 235.8

NileshsinghVThakur
36 / 50
Hierarchical Clustering

• K-means clustering requires us to pre-specify the number


of clusters K. This can be a disadvantage (later we discuss
strategies for choosing K)
• Hierarchical clustering is an alternative approach which
does not require that we commit to a particular choice of
K.
• In this section, we describe bottom-up or agglomerative
clustering. This is the most common type of hierarchical
clustering, and refers to the fact that a dendrogram is built
starting from the leaves and combining clusters up to the
trunk.

NileshsinghVThakur
37 / 50
Hierarchical Clustering: the idea
Builds a hierarchy in a “bottom-up” fashion...

A B A B

C C

NileshsinghVThakur
38 / 50
Hierarchical Clustering Algorithm
The approach in words:
• Start with each point in its own cluster.
• Identify the closest two clusters and merge them.
• Repeat.
• Ends when all points are in a single cluster.
Dendrogram

4
3
D
E
A B 2
C
1
0

D
E
B
A
C
NileshsinghVThakur
39 / 50
An Example

4
X2
2
0
−2

−6 −4 −2 0 2

X1

45 observations generated in 2-dimensional space. In reality


there are three distinct classes, shown in separate colors.
However, we will treat these class labels as unknown and will
seek to cluster the observations in order to discover the classes
from the data.
NileshsinghVThakur
40 / 50
Application of hierarchical clustering
10

10

10
8

8
6

6
4

4
2

2
0

NileshsinghVThakur
41 / 50
Details of previous figure

• Left: Dendrogram obtained from hierarchically clustering


the data from previous slide, with complete linkage and
Euclidean distance.
• Center: The dendrogram from the left-hand panel, cut at a
height of 9 (indicated by the dashed line). This cut results
in two distinct clusters, shown in different colors.
• Right: The dendrogram from the left-hand panel, now cut
at a height of 5. This cut results in three distinct clusters,
shown in different colors. Note that the colors were not
used in clustering, but are simply used for display purposes
in this figure

NileshsinghVThakur
42 / 50
Types of Linkage
Linkage Description
Maximal inter-cluster dissimilarity. Compute all pairwise
dissimilarities between the observations in cluster A and
Complete
the observations in cluster B, and record the largest of
these dissimilarities.
Minimal inter-cluster dissimilarity. Compute all pairwise
Single dissimilarities between the observations in cluster A and
the observations in cluster B, and record the smallest of
these dissimilarities.
Mean inter-cluster dissimilarity. Compute all pairwise
dissimilarities between the observations in cluster A and
Average
the observations in cluster B, and record the average of
these dissimilarities.
Dissimilarity between the centroid for cluster A (a mean
Centroid vector of length p) and the centroid for cluster B. Cen-
troid linkage can result in undesirable inversions.

NileshsinghVThakur
43 / 50
Choice of Dissimilarity Measure
• So far have used Euclidean distance.
• An alternative is correlation-based distance which considers
two observations to be similar if their features are highly
correlated.
• This is an unusual use of correlation, which is normally
computed between variables; here it is computed between
the observation profiles for each pair of observations.
20

Observation 1
Observation 2
Observation 3
15
10

2
5

3
1
0

5 10 15 20

Variable Index
NileshsinghVThakur
45 / 50
Practical issues

• Scaling of the variables matters!. Should the observations


or features first be standardized in some way? For
instance, maybe the variables should be centered to have
mean zero and scaled to have standard deviation one.
• In the case of hierarchical clustering,
• What dissimilarity measure should be used?
• What type of linkage should be used?
• How many clusters to choose? (in both K-means or
hierarchical clustering). Difficult problem. No agreed-upon
method. See Elements of Statistical Learning, chapter 13
for more details.
• Which features should we use to drive the clustering?

NileshsinghVThakur
46 / 50
Example: breast cancer microarray study

• “Repeated observation of breast tumor subtypes in


independent gene expression data sets;” Sorlie at el, PNAS
2003
• Gene expression measurements for about ∼ 8000 genes, for
each of 88 breast cancer patients.
• Average linkage, correlation metric
• Clustered samples using 500 intrinsic genes: each woman
was measured before and after chemotherapy. Intrinsic
genes have smallest within/between variation.

NileshsinghVThakur
47 / 50
Fig. 1. Hierarchical clustering of 115 tumor tissues and 7 nonmalignant tissues using the ‘‘intrinsic’’ gene set. (A) A scaled-down representation of the entire cluster
NileshsinghVThakur
48 / 50
of 534 genes and 122 tissue samples based on similarities in gene expression. (B) Experimental dendrogram showing the clustering of the tumors into five subgroups.
among our p
regarding di
and patient
patients had
ESR1 receiv
from van’t
adjuvant tre
The obser
with a basal
nosis for th
usually high
expression o
familial canc
in several s
confounding

Molecular M
subtypes, it
vised analysi
were separa
prognostic in
of the 231 m
Fig. 5. Kaplan–Meier analysis of disease outcome in two patient cohorts. (A)
Time to development of distant metastasis in the 97 sporadic cases from van’t
cohort, we fo
Veer et al. Patients were stratified according to the subtypes as shown in Fig. 2B. recurrences
(B) Overall survival for 72 patients with locally advanced breast cancer in the part be due
Norway cohort. The normal-like tumor subgroups were omitted from both data discussed ab
sets in this analysis. Both van’

8422 兩 www.pnas.org兾cgi兾doi兾10.1073兾pnas.0932692100 NileshsinghVThakur


49 / 50
Conclusions

• Unsupervised learning is important for understanding the


variation and grouping structure of a set of unlabeled data,
and can be a useful pre-processor for supervised learning
• It is intrinsically more difficult than supervised learning
because there is no gold standard (like an outcome
variable) and no single objective (like test set accuracy).
• It is an active field of research, with many recently
developed tools such as self-organizing maps, independent
components analysis and spectral clustering.
See The Elements of Statistical Learning, chapter 14.

NileshsinghVThakur
50 / 50

You might also like