Genomic Signal Processing: Classification of Disease Subtype Based On Microarray Data
Genomic Signal Processing: Classification of Disease Subtype Based On Microarray Data
Genomic Signal Processing: Classification of Disease Subtype Based On Microarray Data
Lecture 2
Classication of disease subtype based on microarray data
1. Analysis of microarray data (see last 15 slides of Lecture 1)
2. Classication methods for microarray data
3. Discrimination analysis by linear discrimination
4. A case study: clasication of ALL/AML Leukemia
1
2. Classication methods for microarray data
Unsupervised learning = Learning without a teacher = Cluster analysis
Supervised learning = Learning with a teacher = Discriminant analysis
Classication is the name used mostly for supervised learning.
Examples of supervised learning:
1. You measure the spectrum of a frame a speech, and compute 24 averages
over xed subbands, and form with them the vector of 24 spectra coecients,
called a feature vector. Each frame is labeled with 1 if it was spoken by a boy,
with a 0 if it was spoken by a girl. If enough examples of frames and their
corresponding labels are available, one can nd a rule by which to guess boy
or girl, for each frame, given the vector of 24 spectral coecients.
2
2. You have a vector of 1000 gene expressions for each patient suering of
Leukemia. Leukemia has two major subtypes: ALL (acute lymphocytic
leukemia) and AML (acute myelogenous leukemia). A number of examples of
patients with ALL and AML are given, together with the gene expressions for
each patient. It is required to nd a rule by which to diagnose ALL or AML
based on the expressions of the genes.
The problem of discriminant analysis (or supervised learning):
The multivariate observations for each individual form the feature vector.
Each individual has associated a class label. In the training set all individuals
have known class labels, and the classication rules are learned (or designed)
over the training examples. In the testing set the class labels are not used for
determining the correct rule, only to estimate the performance (errors).
Evaluation of the performance
error rate = proportion of items missclassied
optimum error rate= the rate which will hold if all parameters of a statis-
tical model for the class labels and feature vectors are known.
3
apparent error rate = the rate we obtain by resubstituting the training
samples and determining the missclassications
Overall error rate versus group error rate (sometimes we dont like the
global optimum, which may be poor for some groups)
Error rates
Apparent rate: classify the training samples by the rule derived from the
training set itself. The estimator is overly optimistic if the sample size is not
much larger than the number of variables.
Leave-one-out Omit one observation, recalculate the classication rule from
all other observations, classify the deleted observation and repeat the steps
for each observation in turn. Counting the errors of missclassication yields
almost unbiased estimate of the error rate. However the variance is high (the
missclasications are correlated).
Crossvalidation is like leave-one-out, but now split the sample in k groups, use
k-1 of them to obtain the classication rule and then classify the remaining
group.
4
Variable selection
Select a small number of variables relative to the sizes of the two training sets.
Variables which are known to be highly correlated can be replaced by just one
variable.
Variables included are those whose scaled between group distances are the
largest.
In a forward program variables are included one at a time, based on which of
them decreases the error rate the most.
In a backward program one begins with the entire set and at each step drops
the variable that increases the error rate the least.
A simplistic plan: select rst variables based on their ratio BSS/WSS (see
linear discrimination part), and then perform a stepwise selection.
5
3. Discrimination analysis by linear discrimination
We discuss rst the two-class problems (only two class labels).
Let X
i
be the feature (measurement) vector of cell i, and D
i
{0, 1} be the class of cell i.
A useful characterization of the two classes, by the average over the class. Let m
0
be the
mean vector of the class 0
m
0
=
1
|{i|D
i
= 0}|
i|D
i
=0
X
i
(1)
and m
1
be the mean vector of the class 1
m
1
=
1
|{i|D
i
= 1}|
i|D
i
=1
X
i
(2)
The goal is to nd a vector a such the scalar a
T
X
i
is compared against a
T
m
0
and a
T
m
1
and
is classied according to the nearest of them.
Y
i
=
0 if |a
T
X
i
a
T
m
0
|
2
< |a
T
X
i
a
T
m
1
|
2
1 else
(3)
The discrimination equation is
(a
T
X
i
)
2
2(a
T
X
i
)(a
T
m
0
) + (a
T
m
0
)
2
< (a
T
X
i
)
2
2(a
T
X
i
)(a
T
m
1
) + (a
T
m
1
)
2
6
2(a
T
X
i
)a
T
(m
0
m
1
) < (a
T
m
1
)
2
(a
T
m
0
)
2
2(a
T
X
i
) <
(a
T
m
1
)
2
(a
T
m
0
)
2
a
T
(m
0
m
1
)
= a
T
(m
0
+ m
1
)
a
T
(X
i
m
0
+ m
1
2
) < 0 (4)
Thus the classier has the equivalent form
Y
i
=
0 if a
T
(X
i
m
0
+m
1
2
) < 0
1 else
(5)
We would certainly like to maximize the number of correct classications
max
a
Card{i|a
T
(X
i
m
0
+ m
1
2
) < 0; D
i
= 0} + Card{i|a
T
(X
i
m
0
+ m
1
2
) 0; D
i
= 1}(6)
but this problem has no closed form solution.
Since we cannot maximize directly the number of correct classications we consider the following,
easier, problem.
7
Heuristic problem: Fishers linear discriminant
We like to nd two centers, a
T
m
0
and a
T
m
1
, such that the data in class 0 is the
least scattered around a
T
m
0
, and data in class 1 is the least scattered around
a
T
m
1
, and moreover the centers are as further apart as possible. All these re-
quirements can be written in one single criterion to be maximized,
J(a) =
(a
T
m
0
a
T
m
1
)
2
2
0
+
2
1
(7)
where the variances are
2
0
=
i|D
i
=0
(a
T
X
i
a
T
m
0
)
2
=
i|D
i
=0
a
T
(X
i
m
0
)(X
i
m
0
)
T
a = a
T
S
0
a
2
1
=
i|D
i
=1
(a
T
X
i
a
T
m
1
)
2
=
i|D
i
=1
a
T
(X
i
m
1
)(X
i
m
1
)
T
a = a
T
S
1
a
and therefore
J(a) =
(a
T
m
0
a
T
m
1
)
2
2
0
+
2
1
=
a
T
(m
0
m
1
)(m
0
m
1
)
T
a
a
T
(S
0
+ S
1
)a
= (a
T
Ba)/(a
T
Wa) (8)
where the matrices B = (m
0
m
1
)(m
0
m
1
)
T
and W = S
0
+ S
1
are known.
8
Case A: Nonsingular Within Class Covariance matrix
To solve the maximization problem, suppose W = S
0
+ S
1
= W
T/2
W
1/2
is
positive denite and make the change of variable = W
1/2
a (a = W
1/2
) to
get
J(a) =
a
T
Ba
a
T
Wa
=
T
W
T/2
BW
1/2
(9)
The maximizing satises
= arg max
||||=1
T
W
T/2
BW
1/2
= arg max
,
T
Q + (1
T
) (10)
where we denote W
T/2
BW
1/2
= Q. Now the Lagrangian is minimized when
2Q 2 = 0, which shows that must be a eigenvector of Q and is its
corresponding eigenvalue. It is obvious that the maximization is realized by taking
as the (unit norm) eigenvector corresponding to the largest egenvalue
of
W
T/2
BW
1/2
= Q, when the criterion
J() =
T
Q =
(11)
9
Finally, the linear combination vector is
a
= W
1/2
= W
1/2
max eigenvector of(W
T/2
BW
1/2
) (12)
Simplications:
We may simplify the solution: W
T/2
BW
1/2
implies W
1/2
W
T/2
BW
1/2
=
W
1/2
or W
1
Ba
, therefore
W
1
(m
0
m
1
) =
(m
0
m
1
)
T
a
therefore
a
= W
1
(m
0
m
1
) (13)
or any scaled version of it.
10
Case B: Singular Within Class Covariance matrix
To solve the maximization problem when W = S
0
+ S
1
is singular consider the
over-parameterization a = sa
s
, such that a
T
s
Wa
s
= C with a constant scalar C,
which represent all possible a vectors. Observe that J(a) =
a
T
Ba
a
T
Wa
=
sa
T
s
Bsa
s
sa
T
s
Wsa
s
=
a
T
s
Ba
s
/C and construct the Lagrangian
J(a
s
, ) = a
T
s
Ba
s
+ (C a
T
s
Wa
s
) (14)
which has an extremum when 2Ba
s
2Wa
s
= 0, which tells that all Lagrangian
extremum points are at those a
s
which are scaled generalized eigenvectors of
(B, W). Recall that the generalized eigenvector decomposition of (B, W) obeys
BV = WV where is diagonal and V is a square matrix whose columns are
generalized eigenvectors, each eigenvector v
i
satisfying Bv
i
=
i
Wv
i
.
Denote a
s
any scaled generalized eigenvector satisfying Ba
s
1
Wa
s
= 0 and
C = a
s
T
Wa
s
, then J(a
s
,
1
) = a
s
T
Ba
s
, which can be evaluated by left-multiplying
Ba
s
=
1
Wa
s
with a
s
T
, to get J(a
s
,
1
) = a
s
T
Ba
s
=
1
a
s
T
Wa
s
=
1
C. There-
fore the maximum criterion J(a
s
,
1
) is obtained when
max
is the generalized
eigenvalue of (B, W) and a
s
is its corresponding eigenvector. We consider that
11
the diagonal matrix is ordered such that
1
is the maximum generalized eigen-
value, and v
1
the corresponding eigenvector. Therefore a = v
1
is the linear
discrimination vector maximizing the Fischer discrimination criterion.
12
Summary
1. Given the feature vector X
i
for each individual i, and the class label for each
individual
2. Compute the average over each class m
0
and m
1
.
3. Compute the covariance matrices
S
0
=
i|D
i
=0
(X
i
m
1
)(X
i
m
1
)
T
S
1
=
i|D
i
=1
(X
i
m
1
)(X
i
m
1
)
T
4. Construct the matrix W = S
0
+ S
1
and take the optimum discriminator as
a
= W
1
(m
0
m
1
).
5. Use the discriminator a
0 if a
T
(X
i
m
0
+m
1
2
) < 0
1 else
(15)
13
How good is one gene as a feature? A simple indicator: BSS/WSS
When the feature vector is just a scalar (it is composed of a single gene
expression) the discrimination criterion reduces to
J(a) = (a
T
Ba)/(a
T
Wa) = B/W (16)
where the scalar B = (m
0
m
1
)
2
is the spread between the centers of the
classes and W = S
0
+ S
1
=
i|D
i
=0
(X
i
m
0
)
2
+
i|D
i
=1
(X
i
m
1
)
2
is the
within class spread. J(a) is proportional to the quantity BSS/WSS dened
below.
Let denote N
0
the number of members of class 0, N
1
the number of members of class 1, m
the overall mean and N the total number of individuals. The following sums of squares can
be dened:
TSS =
i
(X
i
m)
2
BSS = N
0
(m
0
m)
2
+ N
1
(m
1
m)
2
WSS = W =
i|D
i
=0
(X
i
m
0
)
2
+
i|D
i
=1
(X
i
m
1
)
2
(17)
BSS is called between sum of squares; TSS is the total sum of squares; WSS = W = is
the within sum of squares. We have TSS = WSS + BSS.
14
Due to the identity
N
0
N
1
N
(m
0
m
1
)
2
= N
0
(m
0
m)
2
+ N
1
(m
1
m)
2
= BSS we have so
BSS/WSS = B/W
N
0
N
1
N
=
N
0
N
1
N
J(a). Thus BSS/WSS gives a good indication of the
discrimination capabilities of a scalar feature.
15
An example: MIT Leukemia data set
The data le
The le 1709.mat contains the matrix M preprocessed with process1.m (oor 100
& ceiling 16000 for each element; exclusion max/min <= 5 & (maxmin) <=
500; log10). The vector v is created such that (1 if ALL and 0 if AML).
Some statistics
min(min(M))=0; max(max(M))=4.5791; mean(mean(M))=2.8433;
median(median(M))=2.8156;
v1=M(:); median(v1)=2.8116; imagesc([ones(100,1)*v*max(max(M)) ; M])
colormap(gray)
16
10 20 30 40 50 60 70
200
400
600
800
1000
1200
1400
1600
1800
Figure 1: The dataset MitLeukemia preprocessed to keep 1709 gene expressions of 72 cell types. The bottom white/black regions
show the known classication of the cells: Cell type 1=ALL (white); Cell type 0=ALM (black).
17
V
a
l
u
e
s
Column Number
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Figure 2: Boxplot of the dataset MitLeukemia preprocessed to keep 1709 gene expressions of 72 cell types. BOX-
PLOT(X,NOTCH,SYM,VERT,WHIS) produces a box and whisker plot for each column of X. The box has lines at the lower
quartile, median, and upper quartile values. The whiskers are lines extending from each end of the box to show the extent of the
rest of the data. Outliers are data with values beyond the ends of the whiskers.
18
0 10 20 30 40 50 60 70 80
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Figure 3: Mean plus/minus four times the standard deviation of the dataset MitLeukemia preprocessed to keep 1709 gene expressions
of 72 cell types.
19
1 2 3 4
0
0.1
0.2
0.3
0.4
0.5
0.6
N
u
m
b
e
r
o
f
m
i
s
s
c
l
a
s
i
f
i
e
d
t
e
s
t
c
e
l
l
s
Number of predictor genes/10
Figure 4: Boxplot of missclassications in 100 runs of 2:1 sampling scheme (48 training cells and 24 test cells). The predictor genes
are the rst in the dataset, (1:p), with p=10,20,30,40.
20
1 2 3 4
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
N
u
m
b
e
r
o
f
m
i
s
s
c
l
a
s
i
f
i
e
d
t
e
s
t
c
e
l
l
s
Number of predictor genes/10
Figure 5: Boxplot of missclassications in 100 runs of 2:1 sampling scheme (48 training cells and 24 test cells). The predictor genes
are taken at the position 100+(1:p) with p=10,20,30,40, to compare random choices (quite similar with the previous gure].
21
1 2 3 4
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
N
u
m
b
e
r
o
f
m
i
s
s
c
l
a
s
i
f
i
e
d
t
e
s
t
c
e
l
l
s
Number of predictor genes/10
Figure 6: A good feature selection: Boxplot of missclassications in 100 runs of 2:1 sampling scheme (48 training cells and 24 test
cells). The predictor genes are arranged in decreasing order of their BSS/WSS, and p=10,20,30,40.
22
1 2 3 4
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
N
u
m
b
e
r
o
f
m
i
s
s
c
l
a
s
i
f
i
e
d
t
e
s
t
c
e
l
l
s
Number of predictor genes/10
Figure 7: A good feature selection: another realization of the crossvalidation split. Boxplot of missclassications in 100 runs of 2:1
sampling scheme (48 training cells and 24 test cells). The predictor genes are arranged in decreasing order of their BSS/WSS, and
p=10,20,30,40.
23
1 2 3 4
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
N
u
m
b
e
r
o
f
m
i
s
s
c
l
a
s
i
f
i
e
d
t
e
s
t
c
e
l
l
s
Number of predictor genes/10
Figure 8: A very good feature selection (without the normalization to unit variance across the genes). Boxplot of missclassications
in 100 runs of 2:1 sampling scheme (48 training cells and 24 test cells). The predictor genes are arranged in decreasing order of
their BSS/WSS, and p=10,20,30,40.
24
0 10 20 30 40 50 60 70 80
60
40
20
0
20
40
60
80
100
120
Cell index i
1
0
0
(
n
c
n
m
)
/
(
n
c
+
n
m
)
Number of correct classifications minus number of missclassification [%]
100(n
c
(i)n
m
(i))/(n
c
(i)+n
m
(i))
100(n
c
(i)+n
m
(i))/N
i
ter
Figure 9: Individual rates of missclassications for each cell. Three cells are systematically missclassied.
25
1 2 3 4 5 6 7 8 9 10
0.1
0.08
0.06
0.04
0.02
0
0.02
0.04
0.06
0.08
0.1
predictor gene index i
a
i
t
h
e
l
i
n
e
a
r
d
i
s
c
r
i
m
i
n
a
t
i
o
n
c
o
e
f
f
i
c
i
e
n
t
Figure 10: The linear discriminator in 100 runs.
26