Chemo Metrics

A guide to the practical use of Chemometrics with applications for Static SIMS
Joanna Lee, Ian Gilmore National Physical Laboratory, Teddington, UK

Email: [email protected] Web: http://www.npl.co.uk/nanoanalysis
Crown Copyright 2006
Contents
Slide 2
1. Introduction 2. Linear algebra 3. Factor analysis Principal component analysis Multivariate curve resolution 4. Multivariate regression Multiple linear regression Principal component regression Partial least squares regression 5. Classification Principal component discriminant function analysis Partial least squares discriminant analysis 6. Conclusion
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Chemoinformatics
Virtual Screening Modelling
creation design
analysis dissemination
chemical chemical information information

retrieval
QSARs
Chemometrics
visualisation
organisation
management
A. R. Leach and V. J. Gillet, An Introduction to Chemoinformatics, Kluwer Academic Publishers, 2003
Slide 3
w o r C
C n
use
p o
r y
3D Descriptors
h ig
2 t
6 0 0
Graph Structures
Chemometrics
Chemometrics is the science of relating measurements made on a chemical system or process to the state of the system via application of mathematical or statistical methods
A. M. C. Davies, Spectroscopy Europe 10 (1998) 28
Slide 4
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Chemometrics
Advantages Fast and efficient on modern computers Statistically valid Removes potential bias Uses all information available
Slide 5
Disadvantages Lots of different methods, procedures, terminologies Can be difficult to understand
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Data analysis
Identification What chemicals are on the surface? Where are they located?
Classification Is there an outlier in the data?
Which group does it belong to?
Slide 6
w o r C
C n
SIMS SIMS Dataset Dataset
p o
r y
h ig
Calibration / Quantification
How is it related to known properties? Can we predict these properties?
2 t
6 0 0
Data matrix
Mass spectrum of Sample 1 40 30 20 10 0
Intensity
X has 3 row and 5 columns 3 5 data matrix
Mass spectrum of Sample 3 40 30 20 10 0 1 2 3 Mass 4 5
Slide 7
w o r C
Variables
C n
Intensity
9 32 10 1 21 X= 18 20 22 4 12 24 12 30 6 6
p o
30 20 10 0
r y
1
Mass spectrum of Sample 2
h ig
2
3 Mass
2 t
4
6 0 0
5
Intensity
Samples
3 Mass
Vector algebra (1)

ax = 1 ay = 2 az = 4 z y
1 a= 2 4
4 b= 2 2
Vector Inner Product (dot product)

a b = a b cos
a b = ax bx + ay by + az bz
a = [1 2 4] Vector length
= 44.5o
4 a b = ab = [1 2 4] 2 2
a = a a = aa = 12 + 22 + 42
Slide 8
w o r C
= 1 4 + 2 2 + 4 2 = 16
C n
p o
r y
h ig
1 2
2 t
4
6 0 0
2 2
Transpose (to exchange rows and columns)
Vector algebra (2)
ab = a b cos
b Orthogonality a If ab = 0 then they are orthogonal i.e. at right angles = 90o Collinearity
Orthogonal vectors are uncorrelated
The smaller is the larger the correlation between a and b
Slide 9
If they are also of unit length then they are orthonormal i.e. aa = 1 bb = 1
w o r C
C n
p o
b a o If = 0 then the vectors are collinear
Correlation
o o
r y
h ig
2 t
b
6 0 0
If 0 90 then the vectors are neither orthogonal nor collinear they are correlated
Matrix algebra
Matrix multiplication Matrix addition
A+ B = C
(I K ) + (I K ) = (I K )
A and B must be the same size Each corresponding element is added
(e.g. pure spectra + noise = experimental data)
Slide 10
2 4 1 1 2 0 1 6 1 3 8 6 + 0 1 2 = 3 9 4
w o r C
C n
p o
No. of columns of A must be equal no. of rows of B Element in row i and column j of the product matrix AB is equal to row i of A times column j of B
1 4 1 1+ 4 3 1 2 + 4 2 2 2 1 2 = 2 1 + 2 3 2 2 + 2 2 3 2 4 2 4 1 + 2 3 4 2 + 2 2 13 10 8 8 = 10 12
r y
6 AB = C 0 0 2 t h ig
(I N )(N K ) = (I K )
Matrix inverse
Identity matrix: diagonal of 1s
1 0 0 I= 0 1 0 0 0 1
AI = A
Matrix inverse for a square matrix
AA -1 = I
(only exists if matrix is full rank)
We can now solve matrix equation If A is square If A is rectangular
AB = C
B = A 1C B = A +C
Slide 11
w o r C
C n
Matrix pseudoinverse for a rectangular matrix
A + = A AA AA + = I
p o[ ]
r y
1
h ig
2 t
6 0 0
Additional properties
(AB) = BA (AB)1 = B1A 1 (AB)+ = B+ A +
Rank and singularity
Simultaneous equations of any size can be solved by matrices Rank = number of unique equations. This matrix is rank 2 1x + 2y = 5 3 x + 2y = 7
The matrix inverse. If it cannot be inverted the matrix is singular
Rank is the max. number of rows or columns that are linearly independent Rank of a matrix R min(I,K) To obtain unique solution we require number of variables rank
Slide 12
Row 3 is simply multiple of row 1 so rank = 2
w o r C
C n
p o
x 1 2 5 1 y = 3 2 7 = 2
5 1 2 3 2 x = 7 y 10 2 4
r y
h ig
1
1 2 x 5 y = 7 3 2
2 t
6 0 0
Matrix projections
To write a in terms of x* and y*, we find its projections on the new axes
a = 2x + 3y
x a = [2 3] y
y*
a 30
projections onto new axes
new axes
Slide 13
w o r C
C n
x*
p o
0.87 0.5 x * a = [2 3] * 0.5 0.87 y
r y
x 0.87 0.5 x * y = 0.5 0.87 y *
h ig
2 t
6 0 0
x* a = [3.2 1.6] * y
Data matrix
Mass spectrum of Sample 1 40 30 20 10 0 1
Intensity
Mass spectrum of Chemical 1 8

Intensity
Chemical Chemical 1 2 Sample 1 Sample 2 Sample 3 5 2 0 1 4
6 4 2 0 1 2 3 Mass 4 5
Intensity
Mass spectrum of Chemical 2 6 4 2 0
5 1 2 4 0 6
1 6 1 0 4 4 2 5 1 1
Variables [mass]
9 32 10 1 21 = 18 20 22 4 12 24 12 30 6 6
Variables [mass]
Chemicals
Slide 14
Sample composition Samples
w o r C
Chemical spectra Chemicals
C n
1 2
3 Mass
Intensity
p o
r y
=
h ig
30 20 10 0
Intensity
2 t
2 3 Mass
6 0 0
4 5
3 Mass
40 30 20 10 0
3 Mass
Data matrix
Samples
Data matrix
1. Each spectra can be represented by a vector
2. Instead of x, y, z in real space, the axes are mass1, mass2, mass3 etc in variable space (also data space)
3. Without noise, rank of dataset = number of unique components 4. With random, uncorrelated noise, rank of dataset = number of samples or number of variables, whichever is smaller
Sample composition
5 1 2 4 0 6
1 6 1 0 4 4 2 5 1 1
Variables [mass]
9 32 10 1 21 = 18 20 22 4 12 24 12 30 6 6
Variables [mass]
Chemicals
Slide 15
w o r C
Chemical spectra Chemicals
C n
p o
r y
=
h ig
2 t
6 0 0
Data matrix
Samples
Samples
Data analysis
Slide 16
w o r C
C n
p o
r y
h ig
How is it related to known properties? Can we predict these properties?
2 t
6 0 0
Terminology
In order to clarify existing terminology and emphasise the relationship between the different chemometrics techniques, the following terminology is adopted in this tutorial
Terms used here factors P PCA loadings, eigenvector, principal component scores
projections T
component concentration
scores
Slide 17
w o r C
C n
p o
r y
h ig
2 t
6 0 0
MCR
PLS latent vectors, latent variables
component spectra
Principal component analysis (PCA)

y PCA Factor 1 PCA Factor 2
Factors are linear combinations of original variables i.e. mass Equivalent to a rotation in data space factors are new axes Data described by their projections onto the factors PCA is a factor analysis technique
Slide 18
w o r C
C n
p o
PCA is a technique for reducing matrices of data to their lowest dimensionality by describing them using a small number of factors
r y
h ig
2 t
6 0 0
Principal component analysis (PCA)

PCA follows the factor analysis equation
(I K ) = (I N )(N K ) + (I K )
X = TP + E
Data matrix Projection of data onto factors
X = TP + E = t np n +E
n =1
(I 1)(1 I )
Slide 19
We decompose the data X (rank R) into N simpler matrices of rank 1, where N < R. Each simple matrix is the outer product of two vectors, tn, and pn
w o r C
C n
p o
Factors
r y
h ig
Experimental noise
2 t
6 0 0
PCA outline
Matrix multiplication
Data Matrix
Data selection and preprocessing
Covariance Matrix
Raw Data
Reproduced Data Matrix
PCA Factors and Projections x N

Factor compression
PCA Factors and Projections x R
After Malinowski, Factor Analysis in Chemistry, John Wiley & Sons (2002)
Slide 20
w o r C
Reproduction
C n
p o
r y
h ig
Decomposition
2 t
6 0 0
Re
tion duc pro
Eigenvectors and Eigenvalues

Sort by eigenvalues
PCA outline
Data Matrix
Covariance Matrix
Raw Data

Factor compression
Slide 21
w o r C
Reproduction
C n
p o
r y
h ig
Decomposition
2 t
6 0 0
Re
tion duc pro

Sort by eigenvalues
PCA decomposition
Covariance matrix contains information about the variances of data points within the dataset, and is defined as
(K K ) = (K I )(I K )
Z = XX
In PCA, Z is decomposed into a set of eigenvectors p and associated eigenvalues , such that
Eigenvalues and eigenvectors have some special properties:
Eigenvalues are positive or zero The number of non-zero eigenvalues = rank of data R Eigenvectors are orthonormal
Slide 22
w o r C
C n
(K K )(K 1) = (K 1)
Zp = p
p o
r y
h ig
2 t
6 0 0
PCA outline
Data Matrix
Covariance Matrix
Decomposition
Raw Data

Factor compression
Slide 23
w o r C
Reproduction
C n
p o
r y
h ig
2 t
6 0 0
Re
tion duc pro

Sort by eigenvalues
PCA factors
Because Z is the covariance matrix, eigenvectors of Z are special directions in the data space that is optimal in describing the variance of the data Eigenvalues are the amount of variance described by their associated eigenvector
X = TP = t np n
n =1
Projection of data onto nth factor (scores)
The nth factor (loadings)
Projection of data onto factors (often called scores) are orthogonal
Slide 24
w o r C
C n
These eigenvectors are the factors PCA obtain for the factor analysis equation. They are sorted by their eigenvalues PCA factors successively capture the largest amount of variance (spread) within the dataset
p o
r y
h ig
2 t
6 0 0
PCA graphical representation

The first factor lies along the major axis of ellipse and accounts for most variation
Instead of describing the data PCA Factor 2 using correlated variables x and y, we transform them onto a new basis (factors) which are uncorrelated By removing higher factors (variances due to noise) we can reduce dimensionality of data factor compression
Slide 25
w o r C
C n
p o
r y
h ig
y
2 t
6 0 0
x
PCA Factor 1
PCA outline
Data Matrix
Covariance Matrix
Raw Data

Factor compression
Slide 26
w o r C
Reproduction
C n
p o
r y
h ig
Decomposition
2 t
6 0 0
Re
tion duc pro

Sort by eigenvalues
Number of factors
Data set of 8 spectra from mixing 3 pure compound spectra
1010 Eigenvalue 105 100 10-5 10-10 10-15 1 108 107 106 105 104 103 102 1011 2 3 4 5
(a) no noise
1. Prior knowledge of system
Poisson noise max 5000 counts
4. Percentage of total variance captured by N eigenvectors:

8
sum of eigenvalue s up to N 100% sum of all eigenvalue s
Sorted eigenvector index
Slide 27
w o r C
Eigenvalue
C n
7 8
p o
2. Scree test: Eigenvalue plot levels off in a linearly decreasing manner after 3 factors
3. Percentage of variance captured by Nth eigenvector:

N th eigenvalue 100% sum of all eigenvalue s
r y
h ig
2 t
6 0 0
(b)
Data reproduction
X = TP + E = t np n +E
n =1
X = X E = TP
E=XX
E = X t np n
n =1 N
Slide 28
E is the matrix of residuals should contain noise only useful for judging quality of PCA model may show up unexpected features!
w o r C
C n
X is the reproduced data matrix reproduced from N selected factors and projections noise filtered by removal of higher factors that describe noise variations useful for MCR
p o
r y
h ig
2 t
6 0 0
PCA outline
Data Matrix
Covariance Matrix
Raw Data

Factor compression
Slide 29
w o r C
Reproduction
C n
p o
r y
h ig
Decomposition
2 t
6 0 0
Re
tion duc pro

Sort by eigenvalues
Data preprocessing
Enhances PCA by bringing out important variance in dataset Makes assumption about nature of variance in data Can distort interpretation and quantification Includes: mass binning peak selection mean centering normalisation variance scaling Poisson scaling logarithmic transformation
More details in the following slides
Slide 30
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Mean centering
Preprocessed data sample i, mass k
~ X i,k = X i,k mean (X:,k )

Raw data sample i, mass k
Mean intensity of mass k
Subtract mean spectrum from each sample PCA describes variations from the mean
z Factor 2 Raw data
Factor 1
y 1st factor goes from origin and accounts for the highest variance
1st factor goes from origin to centre of gravity of data
Slide 31
w o r C
C n
p o
r y
z
h ig
2 t
6 0 0
Mean Centering
Factor 2
Factor 1
Normalisation
~ X i ,k = X i ,k
1 sum (X i ,: )
Divide each spectrum by its total ion intensity Reduces effects of topography, sample charging, drift in primary ion current Assumes chemical variances can be described by relative changes in ion intensities Reduces rank of data by 1
S. N. Deming, J. A. Palasota, J. M. Nocerino, J. Chemomet, 7 (1993) 393
Slide 32
w o r C
C n
p o
r y
h ig
2 t
Total intensity of sample i
6 0 0
Variance scaling
~ X i ,k = X i ,k
1 var (X :,k )
Variance of mass k
Divide each variable by its variance in the dataset Equalises importance of each variable (i.e. mass) Problematic for weak peaks usually used with peak selection Called auto scaling if combined with mean centering
For each variable (mass, in SIMS spectrum)
Mean
P. Geladi and B. Kowalski, Partial Least-Squares Regression: A Tutorial, Analytica Chimica Acta, 185 (1986) 1
Slide 33
w o r C
Variance
C n
Raw data
p o
Meancentering
r y
h ig
2 t
6 0 0
Variance scaling
Auto scaling
Poisson scaling
~ X i ,k = X i ,k
Preprocessed data sample i, mass k Raw data sample i, mass k
1 1 mean (X i ,: ) mean (X : ,k )
Mean intensity of sample i
SIMS data dominated by Poisson counting noise Equalises the noise variance of each data point Provides greater noise rejection
No preprocessing No. of factors?
M. R. Keenan, P. G. Kotula, Surf. Interface Anal., 36 (2004) 203
Slide 34
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Mean intensity of mass k
Poisson scaling 4 factors needed
Data preprocessing summary
Method of preprocessing No preprocessing Mean centering Normalisation
Effect of preprocessing
First factor goes from origin to mean of data All factors describe variations from the mean Equalises total ion yield of each sample and emphasise relative changes in ion intensities Equalises variance of every peak regardless of intensity. Best with peak selection. Equalises noise variance of each data point. Provides greater noise rejection.
Crown Copyright 2006 Slide 35
Variance scaling Poisson scaling
w o r C
C n
p o
r y
h ig
2 t
6 0 0
PCA example (1)

PCA Factor 1 (62%)
Alb
Three protein compositions (100% fibrinogen, 50% fibrinogen / 50% albumin, 100% albumin) adsorbed onto poly(DTB suberate) First factor (PC1) shows relative abundance of amino acid peaks of two proteins
Projection onto first factor separates samples based on protein composition
D.J. Graham et al, Appl. Surf. Sci., 252 (2006) 6860
Slide 36
w o r C
PCA Projections 1 (62%)
C n
p o
r y
h ig
Fib
2 t
6 0 0
PCA example (2)
PCA Factors
16 different single protein films adsorbed on mica Excellent classification of proteins using only 2 factors Factors consistent with total amino acid composition of various proteins 95% confidence limits provide means for identification / classification
M. Wagner & D. G. Castner, Langmuir, 17 (2001) 4649
Slide 37
w o r C
C n
p o
r y
h ig
2 t
6 0 0
PCA image analysis
Datacube contains a raster of I x J pixels and K mass peaks The datacube is rearranged into 2D data matrix with dimensions [(I J) K] prior to PCA unfolding PCA results are folded to form projection images prior to interpretation 1
1 2 3
2 3 4 5 6 7 8 9
4 5 6
7 8 9
I, rows
s s a m K, aks pe
unfold
J, columns
Slide 38
w o r C
C n
p o
r y
h ig
2 t
6 0 0
PCA image example (1)

2 log(eigenvalues)
Mean centering
1 0 -1 -2 2 4
Immiscible PC / PVC polymer blend

42 counts per pixel on average Total ion image
0.8 log(eigenvalues) 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Sorted eigenvector index 18 20
Poisson scaling
Only 2 factors needed dimensionality of image reduced by factor of 20! J. Lee, I. S. Gilmore, to be published
Slide 39
w o r C
C n
p o
log(eigenvalues)
-1.5 -2 -2.5 -3 -3.5 -4 -4.5 -5
r y
2 4
h ig
6 8 10 12 14 16 Sorted eigenvector index
2 t
6 0 0
18 20
18 20
Normalisation
6 8 10 12 14 16 Sorted eigenvector index
Poisson scaled PCA results after mean centering

10
1.5
5
0.5
0
-0.5
-5
10
0.6 0.4 0.2 0
-2
2nd factor shows detector saturation for intense 35Cl peak

30 35 40
-0.2 -0.4
10
15
20 25 Mass , u
J. Lee, I. S. Gilmore, to be published
Slide 40
w o r C
1 0.8
C n
5 10 15
20 25 Mas s, u
p o
30
r y
35
h ig
40
1st factor distinguishes PVC and PC phases
2 t
6 0 0

Image courtesy of Dr Ian Fletcher, ICI Measurement Sciences Group
Hair fibre with multi-component pretreatment

Total ion Total Spectra
50m
Use Varimax rotation!
H. F. Kaiser. Psychometrika 23 (1958) 187
Slide 41
w o r C
Factor 1
C n
Mass, u
p o
r y
PCA factors are linear combinations of chemical components and optimally describe variance PCA results difficult to interpret
h ig
2 t
D
6 0 0
Factor 3
Factor 2

Factor 1 D Factor 4
Factor 2
Factor 3 C
After Varimax rotation, distribution and characteristic peaks are obtained, simplifying interpretation of huge dataset
Mass, u
Slide 42
w o r C
C n
p o
r y
E
h ig
2 t
6 0 0
Factor 5
Mass, u
Multivariate curve resolution (MCR)

PCA factors are directions that describes variance positive and negative peaks in factors can be difficult to interpret
We want to resolve original chemical spectra and reverse the following process:
Sample Chemical spectra composition
Chemicals
Variables
Use multivariate curve resolution (also called self modelling mixture analysis)
Slide 43
w o r C
Samples
5 1 2 4 1 6 1 0 4 4 2 5 1 1 0 6 Variables
C n
p o
Chemicals
r y
h ig
2 t
6 0 0
Samples
Data matrix
9 32 10 1 21 18 20 22 4 12 = 24 12 30 6 6

(I K ) = (I N )(N K ) + (I K )
X = TP + E
Data matrix Projection of data onto factors Factors
MCR Factor 1
MCR uses an iterative least-squares algorithm to extract solutions, while applying suitable constraints e.g. nonnegativity
Slide 44
MCR is designed for recovery of chemical spectra and contributions from a multicomponent mixture, when little or no prior information about the composition is available
w o r C
C n
p o
r y
y
h ig
Experimental noise
2 t
6 0 0
MCR Factor 2

Six Steps to MCR Results
1. Determine number of factors N via eigenvalue plot 2. Obtain PCA reproduced data matrix for N factors 3. Obtain initial estimates of spectra (factors) or contributions (projections)
Random initialisation PCA factors Varimax rotated PCA factors Pure variable detection algorithm e.g. SIMPLISMA Non-negativity Equality
5. Convergence criterion 6. Alternating least squares (ALS) optimisation
Slide 45
4. Constraints
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Outline of MCR
Raw Data
Data Matrix
Reproduced Data Matrix Constraints
PCA
MCR Projections T
MCR Factors P
Slide 46
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Number of Factors Initial Estimates
MCR-ALS
MCR-ALS algorithm
Pseudoinverse of rectangular matrix 1 A + = A[AA]
Start with PCA reproduced data matrix

X = TP
Assume initial estimate of factors P

T = X (P)+ P = T + X = TP X X E=X M
(1) Find estimate of T using P, applying constraints (2) Find new estimate of P using T, applying constraints (3) Compute MCR reproduced matrix (4) Compare results and check convergence
Steps (1) (4) are repeated until MCR factors P and projections T are able to reconstruct reproduced data matrix X within acceptable error specified in convergence criterion
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Rotational ambiguity
MCR can suffer from rotational ambiguity Accuracy of resolved spectra depends on selectivity i.e. existence of pixels or sample where there is only contribution from one component
y MCR Factor 2
Good initial estimates are essential Peaks for the intense components may appear in spectra resolved for weak component
Slide 48
w o r C
MCR Factor 1
C n
p o
y
r y
Chemical 1?
h ig
2 t
6 0 0
Chemical 2?
MCR image example (1)

Hair fibre with multi-component pretreatment
Total ion Total Spectra
50m
Projections 1
Projections 2
Projections 3
Projections 4
Projections 5
Slide 49
w o r C
C n
Mass, u
p o
r y
h ig
Image courtesy of Dr Ian Fletcher, ICI Measurement Sciences Group
2 t
6 0 0
Factor 1
Factor 2
Factor 3
Mass, u
Distribution and characteristic peaks are obtained, in complete agreement with PCA and manual analysis by expert
Slide 50
w o r C
C n
p o
r y
h ig
E
2 t
Factor 4
6 0 0
Factor 5
Mass, u
Three images are each assigned a SIMS spectra (PBC, PC, PVT) and combined to form a multivariate image dataset Poisson noise are added to the image (avg ~50 counts per pixel) Projection on PCA factors show combinations of original images
PCA Projections 1
Slide 51
w o r C
C n
PCA Projections 2
p o
r y
h ig
2 t
6 0 0
PCA Projections 3
MCR resolves the original images and spectra unambiguously!
MCR Projections 1
MCR Projections 2
Slide 52
w o r C
C n
p o
r y
h ig
MCR Projections 3
2 t
6 0 0
Data analysis
Slide 53
w o r C
C n
p o
r y
h ig
How is it related to known properties?
2 t
6 0 0
Can we predict these properties?
Regression analysis
Measured properties
XPS measurement
1 2 3 Mass 4 5
Mass spectrum of Sample 1 40 30 20 10 0
Sample 1 Sample 2 Sample 3
Mass spectrum of Sample 2 30
20 10 0 1 2 3 Mass 4 5
Mass spectrum of Sample 3 40 30 20 10 0 1 2
Dependent variable i.e. measured property
Regression coefficient
Independent variable i.e. intensity at mass m
Slide 54
w o r C
3 Mass 4 5
y = f (x ) + e y = b1x1 + b2 x 2 + b3 x3 + ... + bm x m + e
C n
We can build a model to predict the properties of materials from their SIMS spectra
p o
Intensity
r y
2 1
h ig
1 4 6
Molecular weight
Intensity
2 t
Density 3
6 0 0
7 4
Intensity
Multiple linear regression (MLR)
I = no. of samples K = no. of mass units M = no. of dependent variables
Extending to I samples and N dependent variables

(I M ) = (I K )(K M ) + (I M )
Y = XB + E
Dependent variables
Least squares solution (MLR solution)
This is the covariance matrix of X. In SIMS this is likely to be close to singular and a well defined inverse matrix cannot be found. This is due to the problem of collinearity, caused by linearly dependent rows or columns in the matrix.
Slide 55
B = X+Y
w o r C
or
C n (
SIMS data matrix
B = XX ) XY
1
p o
r y
Regression matrix
h ig
Error
2 t
6 0 0
1 X + = (XX ) X
is the pseudoinverse of X
MLR - graphical representation

Dependent variables
Y = XB + E
SIMS data matrix
We relate Y to the projection of X onto B
Large number of variables (e.g. mass) Risk of overfitting!
A. M. C. Davies, T. Fearn, Spectroscopy Europe 17 (2005) 28
Slide 56
w o r C
C n
p o
r y
h ig
2 t
Regression matrix
6 0 0
Error
MLR finds the least squares solution i.e. the best R2 correlation between Y and the projections of data onto the regression vector XB
Principal component regression (PCR)
I = no. of samples M = no. of dependent variables N = no. of PCA factors
PCA reduces dimensionality of data and reduces effect of noise PCA projection matrix is the coordinates of data points in reduced factor space Hence we can use PCA projection matrix T in our linear regression
(I M ) = (I N )(N M ) + (I M )
Y = TB + E
Regression matrix
These are now guaranteed to be invertible since the rows of PCA projection matrix are orthogonal
Slide 57
Dependent variables PCA Projection Matrix
w o r C
C n
p o
r y
h ig
2 t
6 0 0
B =T + Y B = (T T ) T Y
1
Error
PCR graphical representation

Dependent variables
Y = + E
PCA projection matrix
One factor PCR example
For more than one factor, PCR finds linear combinations of projections T that are best for predicting Y, i.e. regression vectors are linear combinations of PCA factors P.
A. M. C. Davies, T. Fearn, Spectroscopy Europe 17 (2005) 28
Slide 58
w o r C
C n
p o
r y
h ig
PCR finds correlation between Y and projection of data onto first PCA factor, T. The regression vector is a multiple of the PCA factor P.
2 t
Regression matrix
6 0 0
Error
Partial least squares regression (PLS)

The problem with PCR
X = SIMS data matrix Y = Dependent variables (e.g. XPS)
Regression uses projections T are computed to model X only By choosing directions that maximise the variance in data X we hope to include information which relates the original variables to Y First few PCA factors of X may contain only matrix effect and may have no relation to quantities Y which we want to predict
Introducing PLS
Slide 59
PLS extracts projections that are common to both X and Y This is done by simultaneous decomposition of X and Y using an iterative algorithm (NIPALS) It removes redundant information from the regression i.e. factors describing large amounts of variance in X that does not correlates with Y More viable, robust solution using fewer number of factors
w o r C
C n
p o
r y
h ig
2 t
6 0 0
PLS (NIPALS) algorithm
I = no. of samples M = no. of dependent variables N = no. of PCA factors
For decomposition of single matrix X in PCA, NIPALS calculate t1 and p1 alternately until convergence. The next set of factors t2 and p2 are calculate by fitting the residuals (data not explained by p1) (1) PCA decomposition X
(I K )
(I 1)
tX = tY
(I 1)
For simultaneous decomposition of X and Y, PLS finds a mutual set of projections common to X and Y so tx = ty
p (1 K )
p (1 K )
(1 M )
From E. Malinowski, Factor Analysis in Chemistry, John Wiley and Sons (2002) Slide 60
w o r C
C n
t
p o
(I K )
(2) PLS decomposition X
r y
h ig
2 t
(I M )
6 0 0
PLS formulation
We can now write
X = TP + E Y = TQ + F
projections errors
Y = XB + E
B = X + Y = (P)+ Q = WQ
regression vector
T are PLS projections used to predict Y from X (often referred to as scores) W is the weights matrix and reflects covariance structure between X and Y P and Q are not orthogonal matrices due to constraint on finding common projections T. They are sometimes called x-loadings and y-loadings respectively In literature latent variable refers to the set of quantities t, p and q associated with each PLS factor
w o r C
C n
p o
r y
h ig
2 t
6 0 0
weights matrix
PLS example (1)
SIMS spectra of thin films of Irganox were compared with their thicknesses measured with XPS PLS model able to predict thicknesses for t < 6nm PLS regression vector shows us the SIMS peaks most correlated with thicknesses
12 10 x 10
10
231 59
Thickness predicted by SIMS (nm)
Regression Vector for Y
8 6 4 2 0 -2
3 2 1 0 0
277
1176
200
400
600 800 Mass, u
1000
1200
1 2 3 4 5 Thickness measured by XPS (nm)
J. Lee, I. S. Gilmore, to be published
Slide 62
w o r C
C n
p o
6 5 4
r y
h ig
2 t
6 0 0
PLS example (2)
PLS prediction from TOF-SIMS data
Surfaces of plasma deposited films were characterised by SIMS. This was then related to bovine arterial endothelial cell (BAEC) growth (cell counting)
Reduces amount of biological cell counting experiments required
Cells counted optically
A. Chilkoti, A.E Scheimer, V.H Perez Luna and B.D Ratner, Anal. Chem. 67 (1995) 2883
Slide 63
Allowed surface treatment to be characterised
w o r C
C n
p o
r y
h ig
2 t
6 0 0
PLS validation
PLS can be used to build predictive models (calibration) Validation is needed to guard against over-fitting Without enough data for validation set, cross validation can be useful Good predictive model
35 30 25 20 15 10
Dependent variable, Y
Dependent variable, Y
20 15 10 5 0 0 10 20 Independent variable, X 30
5 0 0
10 20 Independent variable, X
30
Slide 64
w o r C
C n
p o
45 40 35 30 25
r y
Data is overfitted!
h ig
2 t
6 0 0
PLS validation
Leave one out cross validation most popular

0.7 0.6
Calculate PLS model excluding sample i Predict sample i Repeat for all different samples Calculate root mean square error of prediction
RMSECV, RMSEC
0.5 0.4 0.3 0.2 0.1
0 1
To decide optimal number of factors use minimum of RMSECV (Root Mean Square Error of Cross Validation) or PRESS (Prediction Residual Sum of Squares)
4 5 6 7 8 9 Number of PLS Factors
10 11 12
Slide 65
w o r C
C n
RMSECV RMSEC
p o
RMSEC (Root Mean Square Error of Calibration) goes down with increasing number of factors
r y
h ig
2 t
6 0 0
PLS validation population

calibration
sample
next set of sample?
If dataset is large enough, split into calibration and validation sets Rule of thumb 2/3 calibration set, 1/3 validation set Validation data should be statistically independent from calibration data e.g. NOT repeat spectra of same sample
Calibration Validation Prediction
Independent validation set is essential if we want to use model to predict new samples
Slide 66
w o r C
C n
p o
r y
validation
h ig
2 t
6 0 0
Data analysis
Slide 67
w o r C
C n
p o
r y
h ig
How is it related to known properties?
2 t
6 0 0
Can we predict these properties?
PCA classification (1)
PCA Factors
16 different single protein films adsorbed on mica Excellent classification of proteins using only 2 factors Factors consistent with total amino acid composition of various proteins 95% confidence limits provide means for identification / classification
M. Wagner & D. G. Castner, Langmuir, 17 (2001) 4649
Slide 68
w o r C
C n
p o
r y
h ig
2 t
6 0 0
PCA classification (2)
Octadecanethiol selfassembled monolayers on gold substrates, exposed to different allylamine plasma deposition times Projections of data onto PCA factors indicate four clusters of objects Magnification of framed cluster reveals further clustering
Outliers can also be located
M. Von Gradowski et al, Surf. Interface Anal. 36 (2004) 1114
Slide 69
w o r C
C n
p o
r y
h ig
2 t
6 0 0
PC-DFA
PC-DFA = Principal Component Discriminant Function Analysis Discriminant functions maximizes the Fishers ratio between groups
2 ( mean1 mean2 ) Fisher' s ratio =
var1 + var2
Used to distinguish strains of bacteria
J. S. Fletcher et al, Appl. Surf. Sci. 252 (2006) 6869
Slide 70
w o r C
C n
p o
r y
h ig
2 t
6 0 0
PLS-DA
PLS-DA = Partial Least Squares Discriminant Analysis We put data in X and categorical information in Y
Regression vector is used for future predictions
PLS-DA Projections 1
Slide 71
w o r C
C n
PLS-DA Projections 2
PLS finds factors that explains variance in data X while taking into account classifications Y
p o
r y
h ig
2 t
6 0 0
Other methods
PC-DFA and PLS-DA are both supervised methods Prior knowledge about groups are required There also exists unsupervised clustering methods
analysis creation
retrieval
visualisation
management
All these (and much more) belong to the wider field of chemoinformatics
Slide 72
w o r C
design
C n
chemical chemical information information
p o
dissemination
r y
h ig
2 t
6 0 0
use organisation
Conclusion
In this tutorial we looked at Identification using PCA and MCR Quantification using MLR, PCR and PLS Classification using PC-DFA, PLS-DA Importance of validation for predictive models Data preprocessing techniques and their effects Matrix and vector algebra New set of terminologies
Terms used here
factors P
PCA loadings, eigenvector, principal component scores
MCR component spectra component concentration
PLS latent vectors, latent variables scores
projections T
Slide 73
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Bibliography
General A. R. Leach, V. J. Gillet, An introduction to Chemoinformatics, Kluwer Academic Publishers (2003) S. Wold, Chemometrics; what do we mean with it, and what do we want from it?, Chemom. Intell. Lab. Syst. 30 (1995) 109 E. R. Malinowski, Factor analysis in Chemistry, John Wiley and Sons (2002) P. Geladi, H. Grahn, Multivariate image analysis, John Wiley and Sons (1996) D. J. Graham, NESAC/BIO ToF-SIMS MVA web resource, http://nb.engr.washington.edu/nb-sims-resource/ PCA D. J. Graham, M. S. Wagner, D. G. Castner, Information from complexity: challenges of ToF-SIMS data interpretation, Appl. Surf. Sci. 252 (2006) 6860 M. R. Keenan, P. G. Kotula, Accounting for Poisson noise in the multivariate analysis of ToF-SIMS spectrum images, Surf. Interface Anal. 36 (2004) 203 MCR N. B. Gallagher, J. M. Shaver, E. B. Martin, J. Morris, B. M. Wise, W. Windig, Curve resolution for multivariate images with applications to TOF-SIMS and Raman, Chemom. Intell. Lab. Syst. 73 (2004) 105 J. A. Ohlhausen, M. R. Keenan, P. G. Koulta, D. E. Peebles, Multivariate statistical analysis of time-of-flight secondary ion mass spectrometry using AXSIA, Appl. Surf. Sci. 231-232 (2004) 230 R. Tauler, A. de Juan, MCR-ALS Graphic User Friendly Interface, http://www.ub.es/gesq/mcr/mcr.htm PLS P. Geladi, B. Kowalski, Partial Least-Squares Regression: A Tutorial, Analytica Chimica Acta 185 (1986) 1 A. M. C. Davies, T. Fearn, Back to basics: observing PLS, Spectroscopy Europe 17 (2005) 28
Slide 74
w o r C
C n
p o
r y
h ig
2 t
6 0 0
Acknowledgements
The work is supported by UK Department of Trade and Industrys Valid Analytical Measurements (VAM) Programme and cofunded by UK MNT Network
We would like to thank Dr Ian Fletcher (ICI Measurement Sciences Group) for images and expert analysis, and Dr Martin Seah (NPL) for helpful comments
For further information of Surface and Nanoanalysis at NPL please visit http://www.npl.co.uk/nanoanalysis
Slide 75
w o r C
C n
p o
r y
h ig
2 t
6 0 0

Chemo Metrics

Uploaded by

Copyright:

Available Formats

Chemo Metrics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chemo Metrics

Uploaded by

Copyright:

Available Formats

A guide to the practical use of Chemometrics with applications for Static SIMS

Joanna Lee, Ian Gilmore National Physical Laboratory, Teddington, UK

Crown Copyright 2006

Crown Copyright 2006

chemical chemical information information

A. R. Leach and V. J. Gillet, An Introduction to Chemoinformatics, Kluwer Academic Publishers, 2003

Crown Copyright 2006

A. M. C. Davies, Spectroscopy Europe 10 (1998) 28

Crown Copyright 2006

Crown Copyright 2006

Disadvantages Lots of different methods, procedures, terminologies Can be difficult to understand

Classification Is there an outlier in the data?

Which group does it belong to?

Crown Copyright 2006

SIMS SIMS Dataset Dataset

How is it related to known properties? Can we predict these properties?

X has 3 row and 5 columns 3 5 data matrix

Mass spectrum of Sample 3 40 30 20 10 0 1 2 3 Mass 4 5

Crown Copyright 2006

Mass spectrum of Sample 2

Vector algebra (1)

Vector Inner Product (dot product)

Crown Copyright 2006

Transpose (to exchange rows and columns)

Vector algebra (2)

Orthogonal vectors are uncorrelated

The smaller is the larger the correlation between a and b

Crown Copyright 2006

b a o If = 0 then the vectors are collinear

A and B must be the same size Each corresponding element is added

(e.g. pure spectra + noise = experimental data)

Crown Copyright 2006

Matrix inverse for a square matrix

We can now solve matrix equation If A is square If A is rectangular

Crown Copyright 2006

Matrix pseudoinverse for a rectangular matrix

(AB) = BA (AB)1 = B1A 1 (AB)+ = B+ A +

Rank and singularity

The matrix inverse. If it cannot be inverted the matrix is singular

Crown Copyright 2006

Row 3 is simply multiple of row 1 so rank = 2

projections onto new axes

Crown Copyright 2006

0.87 0.5 x * a = [2 3] * 0.5 0.87 y

x 0.87 0.5 x * y = 0.5 0.87 y *

Mass spectrum of Chemical 1 8

Chemical Chemical 1 2 Sample 1 Sample 2 Sample 3 5 2 0 1 4

Mass spectrum of Chemical 2 6 4 2 0

Crown Copyright 2006

Sample composition Samples

Chemical spectra Chemicals

Mass spectrum of Sample 2

Mass spectrum of Sample 3

1. Each spectra can be represented by a vector

Crown Copyright 2006

Chemical spectra Chemicals

Classification Is there an outlier in the data?

Which group does it belong to?

Crown Copyright 2006

SIMS SIMS Dataset Dataset