Chemo Metrics
Chemo Metrics
Chemo Metrics
Contents
Slide 2
1. Introduction 2. Linear algebra 3. Factor analysis Principal component analysis Multivariate curve resolution 4. Multivariate regression Multiple linear regression Principal component regression Partial least squares regression 5. Classification Principal component discriminant function analysis Partial least squares discriminant analysis 6. Conclusion
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Chemoinformatics
Virtual Screening Modelling
creation design
analysis dissemination
QSARs
Chemometrics
visualisation
organisation
management
Slide 3
w o r C
C n
use
p o
r y
3D Descriptors
h ig
2 t
6 0 0
Graph Structures
Chemometrics
Chemometrics is the science of relating measurements made on a chemical system or process to the state of the system via application of mathematical or statistical methods
Slide 4
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Chemometrics
Advantages Fast and efficient on modern computers Statistically valid Removes potential bias Uses all information available
Slide 5
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Data analysis
Identification What chemicals are on the surface? Where are they located?
Slide 6
w o r C
C n
p o
r y
h ig
Calibration / Quantification
2 t
6 0 0
Data matrix
Mass spectrum of Sample 1 40 30 20 10 0
Intensity
Slide 7
w o r C
Variables
C n
Intensity
9 32 10 1 21 X= 18 20 22 4 12 24 12 30 6 6
p o
30 20 10 0
r y
1
h ig
2
3 Mass
2 t
4
6 0 0
5
Intensity
Samples
3 Mass
1 a= 2 4
4 b= 2 2
a b = ax bx + ay by + az bz
a = [1 2 4] Vector length
= 44.5o
4 a b = ab = [1 2 4] 2 2
a = a a = aa = 12 + 22 + 42
Slide 8
w o r C
= 1 4 + 2 2 + 4 2 = 16
C n
p o
r y
h ig
1 2
2 t
4
6 0 0
2 2
ab = a b cos
b Orthogonality a If ab = 0 then they are orthogonal i.e. at right angles = 90o Collinearity
Slide 9
If they are also of unit length then they are orthonormal i.e. aa = 1 bb = 1
w o r C
C n
p o
Correlation
o o
r y
h ig
2 t
b
6 0 0
If 0 90 then the vectors are neither orthogonal nor collinear they are correlated
Matrix algebra
Matrix multiplication Matrix addition
A+ B = C
(I K ) + (I K ) = (I K )
Slide 10
2 4 1 1 2 0 1 6 1 3 8 6 + 0 1 2 = 3 9 4
w o r C
C n
p o
No. of columns of A must be equal no. of rows of B Element in row i and column j of the product matrix AB is equal to row i of A times column j of B
1 4 1 1+ 4 3 1 2 + 4 2 2 2 1 2 = 2 1 + 2 3 2 2 + 2 2 3 2 4 2 4 1 + 2 3 4 2 + 2 2 13 10 8 8 = 10 12
r y
6 AB = C 0 0 2 t h ig
(I N )(N K ) = (I K )
Matrix inverse
Identity matrix: diagonal of 1s
1 0 0 I= 0 1 0 0 0 1
AI = A
AA -1 = I
(only exists if matrix is full rank)
AB = C
B = A 1C B = A +C
Slide 11
w o r C
C n
A + = A AA AA + = I
p o[ ]
r y
1
h ig
2 t
6 0 0
Additional properties
Simultaneous equations of any size can be solved by matrices Rank = number of unique equations. This matrix is rank 2 1x + 2y = 5 3 x + 2y = 7
Rank is the max. number of rows or columns that are linearly independent Rank of a matrix R min(I,K) To obtain unique solution we require number of variables rank
Slide 12
w o r C
C n
p o
x 1 2 5 1 y = 3 2 7 = 2
5 1 2 3 2 x = 7 y 10 2 4
r y
h ig
1
1 2 x 5 y = 7 3 2
2 t
6 0 0
Matrix projections
To write a in terms of x* and y*, we find its projections on the new axes
a = 2x + 3y
x a = [2 3] y
y*
a 30
new axes
Slide 13
w o r C
C n
x*
p o
r y
h ig
2 t
6 0 0
x* a = [3.2 1.6] * y
Data matrix
Mass spectrum of Sample 1 40 30 20 10 0 1
Intensity
6 4 2 0 1 2 3 Mass 4 5
Intensity
5 1 2 4 0 6
1 6 1 0 4 4 2 5 1 1
Variables [mass]
9 32 10 1 21 = 18 20 22 4 12 24 12 30 6 6
Variables [mass]
Chemicals
Slide 14
w o r C
C n
1 2
3 Mass
Intensity
p o
r y
=
h ig
30 20 10 0
Intensity
2 t
2 3 Mass
6 0 0
4 5
3 Mass
40 30 20 10 0
3 Mass
Data matrix
Samples
Data matrix
2. Instead of x, y, z in real space, the axes are mass1, mass2, mass3 etc in variable space (also data space)
3. Without noise, rank of dataset = number of unique components 4. With random, uncorrelated noise, rank of dataset = number of samples or number of variables, whichever is smaller
Sample composition
5 1 2 4 0 6
1 6 1 0 4 4 2 5 1 1
Variables [mass]
9 32 10 1 21 = 18 20 22 4 12 24 12 30 6 6
Variables [mass]
Chemicals
Slide 15
w o r C
C n
p o
r y
=
h ig
2 t
6 0 0
Data matrix
Samples
Samples
Data analysis
Identification What chemicals are on the surface? Where are they located?
Slide 16
w o r C
C n
p o
r y
h ig
Calibration / Quantification
2 t
6 0 0
Terminology
In order to clarify existing terminology and emphasise the relationship between the different chemometrics techniques, the following terminology is adopted in this tutorial
Terms used here factors P PCA loadings, eigenvector, principal component scores
projections T
component concentration
scores
Slide 17
w o r C
C n
p o
r y
h ig
2 t
6 0 0
MCR
component spectra
Factors are linear combinations of original variables i.e. mass Equivalent to a rotation in data space factors are new axes Data described by their projections onto the factors PCA is a factor analysis technique
Slide 18
w o r C
C n
p o
PCA is a technique for reducing matrices of data to their lowest dimensionality by describing them using a small number of factors
r y
h ig
2 t
6 0 0
X = TP + E
X = TP + E = t np n +E
n =1
(I 1)(1 I )
Slide 19
We decompose the data X (rank R) into N simpler matrices of rank 1, where N < R. Each simple matrix is the outer product of two vectors, tn, and pn
w o r C
C n
p o
Factors
r y
h ig
Experimental noise
2 t
6 0 0
PCA outline
Matrix multiplication
Data Matrix
Data selection and preprocessing
Covariance Matrix
Raw Data
After Malinowski, Factor Analysis in Chemistry, John Wiley & Sons (2002)
Slide 20
w o r C
Reproduction
C n
p o
r y
h ig
Decomposition
2 t
6 0 0
Re
PCA outline
Matrix multiplication
Data Matrix
Data selection and preprocessing
Covariance Matrix
Raw Data
After Malinowski, Factor Analysis in Chemistry, John Wiley & Sons (2002)
Slide 21
w o r C
Reproduction
C n
p o
r y
h ig
Decomposition
2 t
6 0 0
Re
PCA decomposition
Covariance matrix contains information about the variances of data points within the dataset, and is defined as
(K K ) = (K I )(I K )
Z = XX
In PCA, Z is decomposed into a set of eigenvectors p and associated eigenvalues , such that
Eigenvalues are positive or zero The number of non-zero eigenvalues = rank of data R Eigenvectors are orthonormal
Slide 22
w o r C
C n
(K K )(K 1) = (K 1)
Zp = p
p o
r y
h ig
2 t
6 0 0
PCA outline
Matrix multiplication
Data Matrix
Data selection and preprocessing
Covariance Matrix
Decomposition
Raw Data
After Malinowski, Factor Analysis in Chemistry, John Wiley & Sons (2002)
Slide 23
w o r C
Reproduction
C n
p o
r y
h ig
2 t
6 0 0
Re
PCA factors
Because Z is the covariance matrix, eigenvectors of Z are special directions in the data space that is optimal in describing the variance of the data Eigenvalues are the amount of variance described by their associated eigenvector
X = TP = t np n
n =1
Slide 24
w o r C
C n
These eigenvectors are the factors PCA obtain for the factor analysis equation. They are sorted by their eigenvalues PCA factors successively capture the largest amount of variance (spread) within the dataset
p o
r y
h ig
2 t
6 0 0
Instead of describing the data PCA Factor 2 using correlated variables x and y, we transform them onto a new basis (factors) which are uncorrelated By removing higher factors (variances due to noise) we can reduce dimensionality of data factor compression
Slide 25
w o r C
C n
p o
r y
h ig
y
2 t
6 0 0
x
PCA Factor 1
PCA outline
Matrix multiplication
Data Matrix
Data selection and preprocessing
Covariance Matrix
Raw Data
After Malinowski, Factor Analysis in Chemistry, John Wiley & Sons (2002)
Slide 26
w o r C
Reproduction
C n
p o
r y
h ig
Decomposition
2 t
6 0 0
Re
Number of factors
Data set of 8 spectra from mixing 3 pure compound spectra
1010 Eigenvalue 105 100 10-5 10-10 10-15 1 108 107 106 105 104 103 102 1011 2 3 4 5
(a) no noise
Slide 27
w o r C
Eigenvalue
C n
7 8
p o
2. Scree test: Eigenvalue plot levels off in a linearly decreasing manner after 3 factors
r y
h ig
2 t
6 0 0
(b)
Data reproduction
X = TP + E = t np n +E
n =1
X = X E = TP
E=XX
E = X t np n
n =1 N
Slide 28
E is the matrix of residuals should contain noise only useful for judging quality of PCA model may show up unexpected features!
w o r C
C n
X is the reproduced data matrix reproduced from N selected factors and projections noise filtered by removal of higher factors that describe noise variations useful for MCR
p o
r y
h ig
2 t
6 0 0
PCA outline
Matrix multiplication
Data Matrix
Data selection and preprocessing
Covariance Matrix
Raw Data
After Malinowski, Factor Analysis in Chemistry, John Wiley & Sons (2002)
Slide 29
w o r C
Reproduction
C n
p o
r y
h ig
Decomposition
2 t
6 0 0
Re
Data preprocessing
Enhances PCA by bringing out important variance in dataset Makes assumption about nature of variance in data Can distort interpretation and quantification Includes: mass binning peak selection mean centering normalisation variance scaling Poisson scaling logarithmic transformation
Slide 30
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Mean centering
Preprocessed data sample i, mass k
Subtract mean spectrum from each sample PCA describes variations from the mean
z Factor 2 Raw data
Factor 1
y 1st factor goes from origin and accounts for the highest variance
Slide 31
w o r C
C n
p o
r y
z
h ig
2 t
6 0 0
Mean Centering
Factor 2
Factor 1
Normalisation
Preprocessed data sample i, mass k
~ X i ,k = X i ,k
Raw data sample i, mass k
1 sum (X i ,: )
Divide each spectrum by its total ion intensity Reduces effects of topography, sample charging, drift in primary ion current Assumes chemical variances can be described by relative changes in ion intensities Reduces rank of data by 1
Slide 32
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Variance scaling
Preprocessed data sample i, mass k
~ X i ,k = X i ,k
Raw data sample i, mass k
1 var (X :,k )
Variance of mass k
Divide each variable by its variance in the dataset Equalises importance of each variable (i.e. mass) Problematic for weak peaks usually used with peak selection Called auto scaling if combined with mean centering
For each variable (mass, in SIMS spectrum)
Mean
P. Geladi and B. Kowalski, Partial Least-Squares Regression: A Tutorial, Analytica Chimica Acta, 185 (1986) 1
Slide 33
w o r C
Variance
C n
Raw data
p o
Meancentering
r y
h ig
2 t
6 0 0
Variance scaling
Auto scaling
Poisson scaling
~ X i ,k = X i ,k
Preprocessed data sample i, mass k Raw data sample i, mass k
1 1 mean (X i ,: ) mean (X : ,k )
Mean intensity of sample i
SIMS data dominated by Poisson counting noise Equalises the noise variance of each data point Provides greater noise rejection
No preprocessing No. of factors?
Slide 34
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Effect of preprocessing
First factor goes from origin to mean of data All factors describe variations from the mean Equalises total ion yield of each sample and emphasise relative changes in ion intensities Equalises variance of every peak regardless of intensity. Best with peak selection. Equalises noise variance of each data point. Provides greater noise rejection.
Crown Copyright 2006 Slide 35
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Three protein compositions (100% fibrinogen, 50% fibrinogen / 50% albumin, 100% albumin) adsorbed onto poly(DTB suberate) First factor (PC1) shows relative abundance of amino acid peaks of two proteins
Slide 36
w o r C
C n
p o
r y
h ig
Fib
2 t
6 0 0
PCA Factors
16 different single protein films adsorbed on mica Excellent classification of proteins using only 2 factors Factors consistent with total amino acid composition of various proteins 95% confidence limits provide means for identification / classification
Slide 37
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Datacube contains a raster of I x J pixels and K mass peaks The datacube is rearranged into 2D data matrix with dimensions [(I J) K] prior to PCA unfolding PCA results are folded to form projection images prior to interpretation 1
1 2 3
2 3 4 5 6 7 8 9
4 5 6
7 8 9
I, rows
s s a m K, aks pe
unfold
J, columns
Slide 38
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Mean centering
1 0 -1 -2 2 4
Poisson scaling
Only 2 factors needed dimensionality of image reduced by factor of 20! J. Lee, I. S. Gilmore, to be published
Slide 39
w o r C
C n
p o
log(eigenvalues)
r y
2 4
h ig
2 t
6 0 0
18 20
18 20
Normalisation
1.5
5
0.5
0
-0.5
-5
10
-2
-0.2 -0.4
10
15
20 25 Mass , u
Slide 40
w o r C
1 0.8
C n
5 10 15
20 25 Mas s, u
p o
30
r y
35
h ig
40
2 t
6 0 0
50m
Slide 41
w o r C
Factor 1
C n
Mass, u
p o
r y
PCA factors are linear combinations of chemical components and optimally describe variance PCA results difficult to interpret
h ig
2 t
D
6 0 0
Factor 3
Factor 2
Factor 2
Factor 3 C
After Varimax rotation, distribution and characteristic peaks are obtained, simplifying interpretation of huge dataset
Mass, u
Slide 42
w o r C
C n
p o
r y
E
h ig
2 t
6 0 0
Factor 5
Mass, u
We want to resolve original chemical spectra and reverse the following process:
Sample Chemical spectra composition
Chemicals
Variables
Use multivariate curve resolution (also called self modelling mixture analysis)
Slide 43
w o r C
Samples
5 1 2 4 1 6 1 0 4 4 2 5 1 1 0 6 Variables
C n
p o
Chemicals
r y
h ig
2 t
6 0 0
Samples
Data matrix
9 32 10 1 21 18 20 22 4 12 = 24 12 30 6 6
X = TP + E
MCR Factor 1
MCR uses an iterative least-squares algorithm to extract solutions, while applying suitable constraints e.g. nonnegativity
Slide 44
MCR is designed for recovery of chemical spectra and contributions from a multicomponent mixture, when little or no prior information about the composition is available
w o r C
C n
p o
r y
y
h ig
Experimental noise
2 t
6 0 0
MCR Factor 2
1. Determine number of factors N via eigenvalue plot 2. Obtain PCA reproduced data matrix for N factors 3. Obtain initial estimates of spectra (factors) or contributions (projections)
Random initialisation PCA factors Varimax rotated PCA factors Pure variable detection algorithm e.g. SIMPLISMA Non-negativity Equality
Slide 45
4. Constraints
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Outline of MCR
Raw Data
Data Matrix
PCA
MCR Projections T
MCR Factors P
Slide 46
w o r C
C n
p o
r y
h ig
2 t
6 0 0
MCR-ALS
MCR-ALS algorithm
Pseudoinverse of rectangular matrix 1 A + = A[AA]
(1) Find estimate of T using P, applying constraints (2) Find new estimate of P using T, applying constraints (3) Compute MCR reproduced matrix (4) Compare results and check convergence
Crown Copyright 2006 Slide 47
Steps (1) (4) are repeated until MCR factors P and projections T are able to reconstruct reproduced data matrix X within acceptable error specified in convergence criterion
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Rotational ambiguity
MCR can suffer from rotational ambiguity Accuracy of resolved spectra depends on selectivity i.e. existence of pixels or sample where there is only contribution from one component
y MCR Factor 2
Good initial estimates are essential Peaks for the intense components may appear in spectra resolved for weak component
Slide 48
w o r C
MCR Factor 1
C n
p o
y
r y
Chemical 1?
h ig
2 t
6 0 0
Chemical 2?
50m
Projections 1
Projections 2
Projections 3
Projections 4
Projections 5
Slide 49
w o r C
C n
Mass, u
p o
r y
h ig
2 t
6 0 0
Factor 1
Factor 2
Factor 3
Mass, u
Distribution and characteristic peaks are obtained, in complete agreement with PCA and manual analysis by expert
Slide 50
w o r C
C n
p o
r y
h ig
E
2 t
Factor 4
6 0 0
Factor 5
Mass, u
Three images are each assigned a SIMS spectra (PBC, PC, PVT) and combined to form a multivariate image dataset Poisson noise are added to the image (avg ~50 counts per pixel) Projection on PCA factors show combinations of original images
PCA Projections 1
Slide 51
w o r C
C n
PCA Projections 2
p o
r y
h ig
2 t
6 0 0
PCA Projections 3
MCR Projections 1
MCR Projections 2
Slide 52
w o r C
C n
p o
r y
h ig
MCR Projections 3
2 t
6 0 0
Data analysis
Identification What chemicals are on the surface? Where are they located?
Slide 53
w o r C
C n
p o
r y
h ig
Calibration / Quantification
2 t
6 0 0
Regression analysis
Measured properties
XPS measurement
1 2 3 Mass 4 5
20 10 0 1 2 3 Mass 4 5
Regression coefficient
Slide 54
w o r C
3 Mass 4 5
y = f (x ) + e y = b1x1 + b2 x 2 + b3 x3 + ... + bm x m + e
C n
We can build a model to predict the properties of materials from their SIMS spectra
p o
Intensity
r y
2 1
h ig
1 4 6
Molecular weight
Intensity
2 t
Density 3
6 0 0
7 4
Intensity
Y = XB + E
Dependent variables
This is the covariance matrix of X. In SIMS this is likely to be close to singular and a well defined inverse matrix cannot be found. This is due to the problem of collinearity, caused by linearly dependent rows or columns in the matrix.
Slide 55
B = X+Y
w o r C
or
C n (
B = XX ) XY
1
p o
r y
Regression matrix
h ig
Error
2 t
6 0 0
1 X + = (XX ) X
is the pseudoinverse of X
Y = XB + E
SIMS data matrix
Slide 56
w o r C
C n
p o
r y
h ig
2 t
Regression matrix
6 0 0
Error
MLR finds the least squares solution i.e. the best R2 correlation between Y and the projections of data onto the regression vector XB
PCA reduces dimensionality of data and reduces effect of noise PCA projection matrix is the coordinates of data points in reduced factor space Hence we can use PCA projection matrix T in our linear regression
(I M ) = (I N )(N M ) + (I M )
Y = TB + E
Regression matrix
These are now guaranteed to be invertible since the rows of PCA projection matrix are orthogonal
Slide 57
w o r C
C n
p o
r y
h ig
2 t
6 0 0
B =T + Y B = (T T ) T Y
1
Error
Y = + E
PCA projection matrix
For more than one factor, PCR finds linear combinations of projections T that are best for predicting Y, i.e. regression vectors are linear combinations of PCA factors P.
A. M. C. Davies, T. Fearn, Spectroscopy Europe 17 (2005) 28
Slide 58
w o r C
C n
p o
r y
h ig
PCR finds correlation between Y and projection of data onto first PCA factor, T. The regression vector is a multiple of the PCA factor P.
2 t
Regression matrix
6 0 0
Error
Regression uses projections T are computed to model X only By choosing directions that maximise the variance in data X we hope to include information which relates the original variables to Y First few PCA factors of X may contain only matrix effect and may have no relation to quantities Y which we want to predict
Introducing PLS
Slide 59
PLS extracts projections that are common to both X and Y This is done by simultaneous decomposition of X and Y using an iterative algorithm (NIPALS) It removes redundant information from the regression i.e. factors describing large amounts of variance in X that does not correlates with Y More viable, robust solution using fewer number of factors
w o r C
C n
p o
r y
h ig
2 t
6 0 0
For decomposition of single matrix X in PCA, NIPALS calculate t1 and p1 alternately until convergence. The next set of factors t2 and p2 are calculate by fitting the residuals (data not explained by p1) (1) PCA decomposition X
(I K )
(I 1)
tX = tY
(I 1)
For simultaneous decomposition of X and Y, PLS finds a mutual set of projections common to X and Y so tx = ty
p (1 K )
p (1 K )
(1 M )
From E. Malinowski, Factor Analysis in Chemistry, John Wiley and Sons (2002) Slide 60
w o r C
C n
t
p o
(I K )
r y
h ig
2 t
(I M )
6 0 0
PLS formulation
X = TP + E Y = TQ + F
projections errors
Y = XB + E
B = X + Y = (P)+ Q = WQ
regression vector
T are PLS projections used to predict Y from X (often referred to as scores) W is the weights matrix and reflects covariance structure between X and Y P and Q are not orthogonal matrices due to constraint on finding common projections T. They are sometimes called x-loadings and y-loadings respectively In literature latent variable refers to the set of quantities t, p and q associated with each PLS factor
Crown Copyright 2006 Slide 61
w o r C
C n
p o
r y
h ig
2 t
6 0 0
weights matrix
SIMS spectra of thin films of Irganox were compared with their thicknesses measured with XPS PLS model able to predict thicknesses for t < 6nm PLS regression vector shows us the SIMS peaks most correlated with thicknesses
12 10 x 10
10
231 59
8 6 4 2 0 -2
3 2 1 0 0
277
1176
200
400
1000
1200
Slide 62
w o r C
C n
p o
6 5 4
r y
h ig
2 t
6 0 0
Surfaces of plasma deposited films were characterised by SIMS. This was then related to bovine arterial endothelial cell (BAEC) growth (cell counting)
A. Chilkoti, A.E Scheimer, V.H Perez Luna and B.D Ratner, Anal. Chem. 67 (1995) 2883
Slide 63
w o r C
C n
p o
r y
h ig
2 t
6 0 0
PLS validation
PLS can be used to build predictive models (calibration) Validation is needed to guard against over-fitting Without enough data for validation set, cross validation can be useful Good predictive model
35 30 25 20 15 10
Dependent variable, Y
Dependent variable, Y
20 15 10 5 0 0 10 20 Independent variable, X 30
5 0 0
10 20 Independent variable, X
30
Slide 64
w o r C
C n
p o
45 40 35 30 25
r y
Data is overfitted!
h ig
2 t
6 0 0
PLS validation
Calculate PLS model excluding sample i Predict sample i Repeat for all different samples Calculate root mean square error of prediction
RMSECV, RMSEC
0 1
To decide optimal number of factors use minimum of RMSECV (Root Mean Square Error of Cross Validation) or PRESS (Prediction Residual Sum of Squares)
10 11 12
Slide 65
w o r C
C n
RMSECV RMSEC
p o
RMSEC (Root Mean Square Error of Calibration) goes down with increasing number of factors
r y
h ig
2 t
6 0 0
sample
If dataset is large enough, split into calibration and validation sets Rule of thumb 2/3 calibration set, 1/3 validation set Validation data should be statistically independent from calibration data e.g. NOT repeat spectra of same sample
Independent validation set is essential if we want to use model to predict new samples
Slide 66
w o r C
C n
p o
r y
validation
h ig
2 t
6 0 0
Data analysis
Identification What chemicals are on the surface? Where are they located?
Slide 67
w o r C
C n
p o
r y
h ig
Calibration / Quantification
2 t
6 0 0
PCA Factors
16 different single protein films adsorbed on mica Excellent classification of proteins using only 2 factors Factors consistent with total amino acid composition of various proteins 95% confidence limits provide means for identification / classification
Slide 68
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Octadecanethiol selfassembled monolayers on gold substrates, exposed to different allylamine plasma deposition times Projections of data onto PCA factors indicate four clusters of objects Magnification of framed cluster reveals further clustering
Slide 69
w o r C
C n
p o
r y
h ig
2 t
6 0 0
PC-DFA
PC-DFA = Principal Component Discriminant Function Analysis Discriminant functions maximizes the Fishers ratio between groups
2 ( mean1 mean2 ) Fisher' s ratio =
var1 + var2
Slide 70
w o r C
C n
p o
r y
h ig
2 t
6 0 0
PLS-DA
PLS-DA = Partial Least Squares Discriminant Analysis We put data in X and categorical information in Y
PLS-DA Projections 1
Slide 71
w o r C
C n
PLS-DA Projections 2
PLS finds factors that explains variance in data X while taking into account classifications Y
p o
r y
h ig
2 t
6 0 0
Other methods
PC-DFA and PLS-DA are both supervised methods Prior knowledge about groups are required There also exists unsupervised clustering methods
analysis creation
retrieval
visualisation
management
All these (and much more) belong to the wider field of chemoinformatics
Slide 72
w o r C
design
C n
p o
dissemination
r y
h ig
2 t
6 0 0
use organisation
Conclusion
In this tutorial we looked at Identification using PCA and MCR Quantification using MLR, PCR and PLS Classification using PC-DFA, PLS-DA Importance of validation for predictive models Data preprocessing techniques and their effects Matrix and vector algebra New set of terminologies
Terms used here
factors P
projections T
Slide 73
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Bibliography
General A. R. Leach, V. J. Gillet, An introduction to Chemoinformatics, Kluwer Academic Publishers (2003) S. Wold, Chemometrics; what do we mean with it, and what do we want from it?, Chemom. Intell. Lab. Syst. 30 (1995) 109 E. R. Malinowski, Factor analysis in Chemistry, John Wiley and Sons (2002) P. Geladi, H. Grahn, Multivariate image analysis, John Wiley and Sons (1996) D. J. Graham, NESAC/BIO ToF-SIMS MVA web resource, http://nb.engr.washington.edu/nb-sims-resource/ PCA D. J. Graham, M. S. Wagner, D. G. Castner, Information from complexity: challenges of ToF-SIMS data interpretation, Appl. Surf. Sci. 252 (2006) 6860 M. R. Keenan, P. G. Kotula, Accounting for Poisson noise in the multivariate analysis of ToF-SIMS spectrum images, Surf. Interface Anal. 36 (2004) 203 MCR N. B. Gallagher, J. M. Shaver, E. B. Martin, J. Morris, B. M. Wise, W. Windig, Curve resolution for multivariate images with applications to TOF-SIMS and Raman, Chemom. Intell. Lab. Syst. 73 (2004) 105 J. A. Ohlhausen, M. R. Keenan, P. G. Koulta, D. E. Peebles, Multivariate statistical analysis of time-of-flight secondary ion mass spectrometry using AXSIA, Appl. Surf. Sci. 231-232 (2004) 230 R. Tauler, A. de Juan, MCR-ALS Graphic User Friendly Interface, http://www.ub.es/gesq/mcr/mcr.htm PLS P. Geladi, B. Kowalski, Partial Least-Squares Regression: A Tutorial, Analytica Chimica Acta 185 (1986) 1 A. M. C. Davies, T. Fearn, Back to basics: observing PLS, Spectroscopy Europe 17 (2005) 28
Slide 74
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Acknowledgements
The work is supported by UK Department of Trade and Industrys Valid Analytical Measurements (VAM) Programme and cofunded by UK MNT Network
We would like to thank Dr Ian Fletcher (ICI Measurement Sciences Group) for images and expert analysis, and Dr Martin Seah (NPL) for helpful comments
For further information of Surface and Nanoanalysis at NPL please visit http://www.npl.co.uk/nanoanalysis
Slide 75
w o r C
C n
p o
r y
h ig
2 t
6 0 0