NVT SDS Unit Vi Final PDF
NVT SDS Unit Vi Final PDF
NVT SDS Unit Vi Final PDF
UNIT-VI
Session 2022-2023
V Semester (Honors)
Dr.Nileshsingh V. Thakur
(Ph.D. Computer Sci. Engg.)
Professor
Department of Computer Technology
Yeshwantrao Chavan College of Engineering
Nagpur, India.
NileshsinghVThakur
Support Vector Machines
NileshsinghVThakur
1 / 21
What is a Hyperplane?
β0 + β1 X1 + β2 X2 + . . . + βp Xp = 0
• In p = 2 dimensions a hyperplane is a line.
• If β0 = 0, the hyperplane goes through the origin,
otherwise not.
• The vector β = (β1 , β2 , · · · , βp ) is called the normal vector
— it points in a direction orthogonal to the surface of a
hyperplane.
NileshsinghVThakur
2 / 21
Hyperplane in 2 Dimensions
10
8 β=(β1,β2)
●
●
β1X1+β2X2−6=1.6
6
β1X1+β2X2−6=0
●
4
X2
β1X1+β2X2−6=−4
2
●
0
−2
β1 = 0.8
β2 = 0.6
−2 0 2 4 6 8 10
X1
NileshsinghVThakur
3 / 21
Separating Hyperplanes
3
3
2
2
X2
X2
1
1
0
0
−1
−1
−1 0 1 2 3 −1 0 1 2 3
X1 X1
maximize M
2
β0 ,β1 ,...,βp
p
X2
X
βj2 = 1,
1
subject to
j=1
0
−1 0 1 2 3
X1
boundary.
1.0
X2
0.5
0 1 2 3
X1
NileshsinghVThakur
6 / 21
Noisy Data
3
3
2
2
X2
X2
1
1
0
0
−1
−1
−1 0 1 2 3 −1 0 1 2 3
X1 X1
Sometimes the data are separable, but noisy. This can lead to a
poor solution for the maximal-margin classifier.
NileshsinghVThakur
7 / 21
Support Vector Classifier
10 10
4
4
7 7
3
3
11
9 9
2
2
8 8
X2
X2
1
1
1 1
12
3 3
0
0
4 5 4 5
2 2
−1
−1
6 6
−0.5 0.0 0.5 1.0 1.5 2.0 2.5 −0.5 0.0 0.5 1.0 1.5 2.0 2.5
X1 X1
p
X
maximize M subject to βj2 = 1,
β0 ,β1 ,...,βp ,1 ,...,n
j=1
yi (β0 + β1 xi1 + β2 xi2 + . . . + βp xip ) ≥ M (1 − i ),
X n
i ≥ 0, i ≤ C,
i=1
NileshsinghVThakur
8 / 21
C is a regularization parameter
3
2
2
1
1
X2
X2
0
0
−1
−1
−2
−2
−3
−3
−1 0 1 2 −1 0 1 2
X1 X1
3
3
2
2
1
1
X2
X2
0
0
−1
−1
−2
−2
−3
−3
−1 0 1 2 −1 0 1 2
X1 X1
NileshsinghVThakur
9 / 21
Linear boundary can fail
C.
X2
0
What to do?
−4
4 −4 −2 0 2 4
X1
NileshsinghVThakur
10 / 21
Feature Expansion
β0 + β1 X1 + β2 X2 + β3 X12 + β4 X22 + β5 X1 X2 = 0
NileshsinghVThakur
11 / 21
Cubic Polynomials
Here we use a basis
expansion of cubic poly-
4
nomials
From 2 variables to 9
2
X2
X2
The support-vector clas-
0
sifier in the enlarged
space solves the problem
−2
−2
in the lower-dimensional
space
−4
−4
−4 −2 0 2 4
X1
β0 +β1 X1 +β2 X2 +β3 X12 +β4 X22 +β5 X1 X2 +β6 X13 +β7 X23 +β8 X1 X22 +β9 X12 X2 = 0
NileshsinghVThakur
12 / 21
Nonlinearities and Kernels
NileshsinghVThakur
13 / 21
Inner products and support vectors
p
X
hxi , x i =
i0 xij xi0 j — inner product between vectors
j=1
training observations.
It turns out that most of the α̂i can be zero:
X
f (x) = β0 + α̂i hx, xi i
i∈S
S is the support set of indices i such that α̂i > 0. [see slide 8]
NileshsinghVThakur
14 / 21
Kernels and Support Vector Machines
• If we can compute inner-products between observations, we
can fit a SV classifier. Can be quite abstract!
• Some special kernel functions can do this for us. E.g.
d
p
X
K(xi , xi0 ) = 1 + xij xi0 j
j=1
NileshsinghVThakur
15 / 21
Radial Kernel
p
X
K(xi , xi0 ) = exp(−γ (xij − xi0 j )2 ).
j=1
4
X
f (x) = β0 + α̂i K(x, xi )
i∈S
2
Controls variance by
−2
4 −4 −2 0 2 4
X1 NileshsinghVThakur
16 / 21
Example: Heart Data
1.0
1.0
0.8
0.8
True positive rate
0.6
0.4
0.4
0.2
0.2
Support Vector Classifier
SVM: γ=10−3
Support Vector Classifier SVM: γ=10−2
SVM: γ=10−1
0.0
0.0
LDA
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
NileshsinghVThakur
17 / 21
Example continued: Heart Test Data
1.0
1.0
0.8
0.8
True positive rate
0.6
0.4
0.4
0.2
0.2
Support Vector Classifier
SVM: γ=10−3
Support Vector Classifier SVM: γ=10−2
SVM: γ=10−1
0.0
0.0
LDA
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
NileshsinghVThakur
18 / 21
SVMs: more than 2 classes?
NileshsinghVThakur
19 / 21
Support Vector versus Logistic Regression?
With f (X) = β0 + β1 X1 + . . . + βp Xp can rephrase
support-vector classifier optimization as
Xn p
X
minimize max [0, 1 − yi f (xi )] + λ βj2
β0 ,β1 ,...,βp
i=1 j=1
8
SVM Loss
Logistic Regression Loss This has the form
loss plus penalty.
6
hinge loss.
4
−6 −4 −2 0 2
NileshsinghVThakur
21 / 21
Unsupervised Learning
NileshsinghVThakur
1 / 50
The Goals of Unsupervised Learning
NileshsinghVThakur
2 / 50
The Challenge of Unsupervised Learning
NileshsinghVThakur
3 / 50
Another advantage
NileshsinghVThakur
4 / 50
Principal Components Analysis
NileshsinghVThakur
5 / 50
Principal Components Analysis: details
• The first principal component of a set of features
X1 , X2 , . . . , Xp is the normalized linear combination of the
features
that
Pp has2 the largest variance. By normalized, we mean that
j=1 φj1 = 1.
• We refer to the elements φ11 , . . . , φp1 as the loadings of the
first principal component; together, the loadings make up
the principal component loading vector,
φ1 = (φ11 φ21 . . . φp1 )T .
• We constrain the loadings so that their sum of squares is
equal to one, since otherwise setting these elements to be
arbitrarily large in absolute value could result in an
arbitrarily large variance.
NileshsinghVThakur
6 / 50
PCA: example
35
30
25
Ad Spending
20
15
10
5
0
10 20 30 40 50 60 70
Population
The population size (pop) and ad spending (ad) for 100 different
cities are shown as purple circles. The green solid line indicates
the first principal component direction, and the blue dashed
line indicates the second principal component direction.
NileshsinghVThakur
7 / 50
Computation of Principal Components
for i = 1, . . . , n thatP
has largest sample variance, subject to
the constraint that pj=1 φ2j1 = 1.
• Since each of the xij has mean zero, then so does zi1 (for
any values of φj1 ). Hence the sample variance of the zi1
can be written as n1 ni=1 zi1
2.
P
NileshsinghVThakur
8 / 50
Computation: continued
NileshsinghVThakur
9 / 50
Geometry of PCA
NileshsinghVThakur
10 / 50
Further principal components
NileshsinghVThakur
11 / 50
Further principal components: continued
NileshsinghVThakur
12 / 50
Illustration
NileshsinghVThakur
13 / 50
USAarrests data: PCA plot
−0.5 0.0 0.5
UrbanPop
3
2
0.5
Hawaii California
Rhode Island
Massachusetts
Utah New Jersey
Second Principal Component
Connecticut
1
Washington Colorado
New York Nevada
Minnesota Pennsylvania
Wisconsin
Ohio IllinoisArizona
Oregon Rape
Texas
Delaware Missouri
Oklahoma
Kansas
Nebraska Indiana Michigan
Iowa
0.0
New Hampshire
0
Florida
Idaho Virginia New Mexico
Maine Wyoming
Maryland
North Dakota Montana
Assault
South Dakota Tennessee
Louisiana
Kentucky
−1
−0.5
South Carolina
−2
North Carolina
Mississippi
−3
−3 −2 −1 0 1 2 3
NileshsinghVThakur
14 / 50
Figure details
NileshsinghVThakur
15 / 50
PCA loadings
PC1 PC2
Murder 0.5358995 -0.4181809
Assault 0.5831836 -0.1879856
UrbanPop 0.2781909 0.8728062
Rape 0.5434321 0.1673186
NileshsinghVThakur
16 / 50
Another Interpretation of Principal Components
• •
•
1.0
• • • ••
• • • •• • •• • ••
• •
•
0.0
•• •
• •• •
• ••• •• • • •
•• • •
•• • • • •
•• • •
−0.5
• •
• •••
• • • ••
• • • •
• •
−1.0
NileshsinghVThakur
17 / 50
PCA find the hyperplane closest to the observations
NileshsinghVThakur
18 / 50
Scaling of the variables matters
• If the variables are in different units, scaling each to have
standard deviation equal to one is recommended.
• If they are in the same units, you might or might not scale
the variables.
Scaled Unscaled
−0.5 0.0 0.5 −0.5 0.0 0.5 1.0
1.0
UrbanPop
3
UrbanPop
150
2
Second Principal Component
100
** ** *
0.5
*
*
1
* *
** *
50
* * * * Rape
* * * Rape
* * ** * *
0.0
** * * ** * *
0
* * ** * * ** **
* * ** * *
0.0
* * * * * *
* * **
0
* * * *Murder * * Assault
* * Assault * ** * * * * *
* * * *
* * * * * *
−1
* * * *
* *
−50
* * *
Murder
−0.5
−0.5
−2
*
−100
*
*
−3
NileshsinghVThakur
19 / 50
Proportion Variance Explained
• To understand the strength of each component, we are
interested in knowing the proportion of variance explained
(PVE) by each one.
• The total variance present in a data set (assuming that the
variables have been centered to have mean zero) is defined
as
p p n
X X 1X 2
Var(Xj ) = xij ,
n
j=1 j=1 i=1
1.0
Cumulative Prop. Variance Explained
0.8
0.8
Prop. Variance Explained
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0
NileshsinghVThakur
21 / 50
How many principal components should we use?
NileshsinghVThakur
22 / 50
Clustering
NileshsinghVThakur
23 / 50
PCA vs Clustering
NileshsinghVThakur
24 / 50
Clustering for Market Segmentation
NileshsinghVThakur
25 / 50
Two clustering methods
NileshsinghVThakur
26 / 50
K-means clustering
K=2 K=3 K=4
NileshsinghVThakur
28 / 50
Details of K-means clustering: continued
• The idea behind K-means clustering is that a good
clustering is one for which the within-cluster variation is as
small as possible.
• The within-cluster variation for cluster Ck is a measure
WCV(Ck ) of the amount by which the observations within
a cluster differ from each other.
• Hence we want to solve the problem
(K )
X
minimize WCV(Ck ) . (2)
C1 ,...,CK
k=1
NileshsinghVThakur
30 / 50
K-Means Clustering Algorithm
NileshsinghVThakur
31 / 50
Properties of the Algorithm
1 P
where x̄kj = |Ck | i∈Ck xij is the mean for feature j in
cluster Ck .
• however it is not guaranteed to give the global minimum.
Why not?
NileshsinghVThakur
32 / 50
Example
Data Step 1 Iteration 1, Step 2a
NileshsinghVThakur
33 / 50
Details of Previous Figure
The progress of the K-means algorithm with K=3.
• Top left: The observations are shown.
• Top center: In Step 1 of the algorithm, each observation is
randomly assigned to a cluster.
• Top right: In Step 2(a), the cluster centroids are computed.
These are shown as large colored disks. Initially the
centroids are almost completely overlapping because the
initial cluster assignments were chosen at random.
• Bottom left: In Step 2(b), each observation is assigned to
the nearest centroid.
• Bottom center: Step 2(a) is once again performed, leading
to new cluster centroids.
• Bottom right: The results obtained after 10 iterations.
NileshsinghVThakur
34 / 50
Example: different starting values
320.9 235.8 235.8
NileshsinghVThakur
35 / 50
Details of Previous Figure
NileshsinghVThakur
36 / 50
Hierarchical Clustering
NileshsinghVThakur
37 / 50
Hierarchical Clustering: the idea
Builds a hierarchy in a “bottom-up” fashion...
A B A B
C C
NileshsinghVThakur
38 / 50
Hierarchical Clustering Algorithm
The approach in words:
• Start with each point in its own cluster.
• Identify the closest two clusters and merge them.
• Repeat.
• Ends when all points are in a single cluster.
Dendrogram
4
3
D
E
A B 2
C
1
0
D
E
B
A
C
NileshsinghVThakur
39 / 50
An Example
4
X2
2
0
−2
−6 −4 −2 0 2
X1
10
10
8
8
6
6
4
4
2
2
0
NileshsinghVThakur
41 / 50
Details of previous figure
NileshsinghVThakur
42 / 50
Types of Linkage
Linkage Description
Maximal inter-cluster dissimilarity. Compute all pairwise
dissimilarities between the observations in cluster A and
Complete
the observations in cluster B, and record the largest of
these dissimilarities.
Minimal inter-cluster dissimilarity. Compute all pairwise
Single dissimilarities between the observations in cluster A and
the observations in cluster B, and record the smallest of
these dissimilarities.
Mean inter-cluster dissimilarity. Compute all pairwise
dissimilarities between the observations in cluster A and
Average
the observations in cluster B, and record the average of
these dissimilarities.
Dissimilarity between the centroid for cluster A (a mean
Centroid vector of length p) and the centroid for cluster B. Cen-
troid linkage can result in undesirable inversions.
NileshsinghVThakur
43 / 50
Choice of Dissimilarity Measure
• So far have used Euclidean distance.
• An alternative is correlation-based distance which considers
two observations to be similar if their features are highly
correlated.
• This is an unusual use of correlation, which is normally
computed between variables; here it is computed between
the observation profiles for each pair of observations.
20
Observation 1
Observation 2
Observation 3
15
10
2
5
3
1
0
5 10 15 20
Variable Index
NileshsinghVThakur
45 / 50
Practical issues
NileshsinghVThakur
46 / 50
Example: breast cancer microarray study
NileshsinghVThakur
47 / 50
Fig. 1. Hierarchical clustering of 115 tumor tissues and 7 nonmalignant tissues using the ‘‘intrinsic’’ gene set. (A) A scaled-down representation of the entire cluster
NileshsinghVThakur
48 / 50
of 534 genes and 122 tissue samples based on similarities in gene expression. (B) Experimental dendrogram showing the clustering of the tumors into five subgroups.
among our p
regarding di
and patient
patients had
ESR1 receiv
from van’t
adjuvant tre
The obser
with a basal
nosis for th
usually high
expression o
familial canc
in several s
confounding
Molecular M
subtypes, it
vised analysi
were separa
prognostic in
of the 231 m
Fig. 5. Kaplan–Meier analysis of disease outcome in two patient cohorts. (A)
Time to development of distant metastasis in the 97 sporadic cases from van’t
cohort, we fo
Veer et al. Patients were stratified according to the subtypes as shown in Fig. 2B. recurrences
(B) Overall survival for 72 patients with locally advanced breast cancer in the part be due
Norway cohort. The normal-like tumor subgroups were omitted from both data discussed ab
sets in this analysis. Both van’
NileshsinghVThakur
50 / 50