Tutorial On PLS and PCA
Tutorial On PLS and PCA
Tutorial On PLS and PCA
SUMMARY
A tutorial on the partial least-squares (PLS) regression method is provided. Weak points
in some other regression methods are outlined and PLS is developed as a remedy for those
weaknesses. An algorithm for a predictive PLS and some practical hints for its use are
given.
TABLE 1
Symbols
Fig. 1. Data preprocessing. The data for each variable are represented by a variance bar
and its center. (A) Most raw data look like this. (B) The result after mean-centering only.
(C) The result after variance-scaling only. (D) The result after mean-centering and variance-
scaling.
4
y = ‘f b,x, +e (lb)
J=l
y = x’b + e UC)
In Eqn. l(a), the xJ are called independent variables and y is the dependent
variable, the bJ’s are sensitivities and e is the error or residual. In Eqn. l(c), y
is a scalar, b is a column vector and x’ is a row vector.
Equation 1 describes multilinear dependencies for only one sample. If one
gets n samples, the y, (i = 1 - n) can be written as a column vector y, b
remains the same and the vectors, xi, form the rows of a matrix X:
y=Xb+e (2)
For a better understanding of these matrix equations, they are also given in
graphical representation:
m 1 1
= X b + e
m
iml
n n n
In this case, n is the number of samples and m the number of independent
variables.
It is now possible to distinguish among three cases.
(1) m > II. There are more variables than samples. In this case, there is an
infinite number of solutions for b, which all fit the equation. This is not
what is wanted.
(2) m = n. The numbers of samples and of variables are equal. This situation
may not be encountered often in practical situations. However, it gives a
unique solution for b provided that X has full rank. This allows us to write
e=y-Xb=O (3)
e is called the residual vector. In this case, it is a vector of zeroes: 0.
(3) m < n. There are more samples than variables. This does not allow an
exact solution for b. But one can get a solution by minimizing the length of
the residual vector e in the following equation:
e=y-Xb (4)
The most popular method for doing this is called the “least-squares method”.
The least-squares solution is
b = (X’X)-‘X’y (5)
(Complete explanations are available elsewhere [5, 7, 81.) Equation 5 gives a
hint towards the most frequent problem in MLR: the inverse of X’X may
not exist. Collinearity, zero determinant and singularity are all names for the
same problem. A good description of this situation is available [ 91.
At this point, it might appear that there always have to be at least as many
samples as variables, but there are other ways to formulate this problem. One
of them is to delete some variables in the case m > n. Many methods exist
for choosing which variables to delete [ 7,8).
[j-Fpid’
n n n
This is the general case that will be referred to in the further text.
Summary: MLR
- For m > n, there is no unique solution unless one deletes independent
variables.
- For m = n, there is one unique solution.
- For m < n, a least-squares solution is possible. For m = n and m < n, the
matrix inversion can cause problems.
- MLR is possible with more than one dependent variable.
or in graphical representation:
w+-J+m+...+m
a=fl@+a,@+ +@a
n
=ypy ” ”
0
n
To illustrate what the th and p;1 mean, an example for two variables, in the
two-dimensional plane, is shown in Fig. 2A. Extension to more dimensions is
easy but difficult to show on paper. For the example in Fig. 2A, the principal
component is the line of best fit for the data points that are shown in Fig. 2B.
Best fit means that the sum of squares of 3c1and ;Y~residuals is minimized.
This is also the average of both regression lines. It goes from - to +=. The
p; is a 1 X 2 row vector. Its elements, p 1 and p2, are the direction cosines, or
the projections of a unit vector along the principal component on the axes of
Fig. 2. A principal component in the case of two variables: (A) loadings are the angle
cosines of the direction vector; (B) scores are the proJections of the sample points (l-6)
on the principal component direction. Note that the data are mean-centered.
7
the plot. The scores vector, th, is a n X 1 column vector. Its elements are the
coordinates of the respective points on the principal component line (Fig. 2B).
For this example, it can easily be understood why one wants the length of
the pk to be one [cos (0,)’ + cos (0,)” = cos (0,)’ + sin (0,)’ = l] ; similar
rules exist for more than two dimensions.
Generally, what one wants is an operator that projects the columns of X
onto a single dimension and an operator that projects the rows of X onto a
single dimension (see Fig. 3). In the first case, each column of X is repre-
sented by a scalar; in the second case, each row of X is represented by a scalar.
In the rest of this section it will be shown that these operators are of a
very simple nature.
Nonlinear iterative partial least squares (NIPALS) does not calculate all
the principal components at once. It calculates t1 and pi from the X matrix.
Then the outer product, tip’,, is subtracted from X and the residual E 1 is
calculated. This residual can be used to calculate t2 and pi:
El = X-ttlp: E2=E1--t2p;...
m m 1
EH
n n
“Y m. .
I
1
P'
Fig. 3. Scores and loadings are obtained by projecting X into vectors. Loadings. each
column of X is projected into an element of the vector p’. Scores: each row of X is pro-
jected into an element of the vector t.
8
(Note that after the first component is calculated, X in steps 2 and 4 has to
be replaced by its residual.)
An explanation of how NIPALS works can be seen when one realizes that
thth in Eqn. 12, lIpAllin Eqn. 13 and pLph in Eqn. 14 are scalars. These scalar
constants are best combined in one general constant C. Then one can substi-
tute Eqn. 12 into 14: th = Xp, and px = t;X give Cph = (Xp,)‘X, or Cpk =
p;lX’X, or (CI, - X’X)p, = 0; or one can substitute Eqn. 14 into 12 and get
(C’I, - XX’)th = 0. These are the eigenvalue/eigenvector equations for X’X
and XX’ as used in the classical calculation. (I, is the identity matrix of size
n X n; I, is that of size m X m. ) The classical eigenvector and eigenvalue
theory is well described by Strang [lo].
It has been shown that on convergence, the NIPALS solution is the same
as that calculated by the eigenvector formulae. The NIPALS method is con-
venient for microcomputers; it is also necessary for a good understanding of
PLS. Otherwise, it does not matter what method one uses. In practical situa-
tions, NIPALS usually converges; in the case of nonconvergence, two or
more very similar eigenvalues exist. Then it does not matter which combina-
tion or rotation of eigenvectors is chosen. The reader is referred to Mardia
et al. [ 51 for a more detailed discussion of PCA.
Summary: PCA
- A data matrix X of rank r can be decomposed to a sum of r rank 1 matrices.
- These rank 1 matrices are outer products of vectors called scores and load-
ings.
- The scores and loadings can be calculated pair-by-pair by an iterative pro-
cedure.
The results from the section on PCA can be used to explain the principal
component transformation of a data matrix X. This is a representation of X
as its scores matrix T (where dimensions having small eigenvalues are ex-
cluded). The transformation is
T = XP (= TP’P = TI,) (15)
or graphically:
or graphically:
The variables of X are replaced by new ones that have better properties
(orthogonality) and also span the multidimensional space of X. The inversion
of T’T should give no problem because of the mutual orthogonality of the
scores. Score vectors corresponding to small eigenvalues can be left out in
order to avoid collinearity problems from influencing the solution [9] .
PCR solves the collinearity problem (by guaranteeing an invertible matrix
in the calculation of 8) and the ability to eliminate the lesser principal com-
ponents allows some noise (random error) reduction. However, PCR is a two-
step method and thereby has the risk that useful (predictive) information
will end up in discarded principal components and that some noise will
remain in the components used for regression.
Detailed information on PCR is given by Mardia et al. [5] and Draper and
Smith [ 71. Gunst and Mason [ 81 give a slightly different definition of PCR.
Summary: PCR
- A data matrix can be represented by its score matrix.
- A regression of the score matrix against one or several dependent variables
is possible, provided that scores corresponding to small eigenvalues are
omitted.
- This regression gives no matrix inversion problems; it is well conditioned.
Model building
The PLS model is built on the properties of the NIPALS algorithm. As
mentioned in the PCR section, it is possible to let the score matrix represent
the data matrix. A simplified model would consist of a regression between
the scores for the X and Y block. The PLS model can be considered as con-
sisting of outer relations (X and Y block individually) and an inner relation
(linking both blocks).
The outer relation for the X block (cf. PCA section) is
One can build the outer relation for the Y block in the same way:
“~j-p+Q
” ” ”
The summations are from 1 to a. One can describe all the components and
thus make E = F* = 0 or not. How and why this is done is discussed below.
It is the intention to describe Y as well as is possible and hence to make IIF*
as small as possible and, at the same time, get a useful relation between X
and Y. The inner relation can be made by looking at a graph of the Y block
score, u, against the X block score, t, for every component (Fig. 4). The
simplest model for this relation is a linear one:
fib = bhth (19)
where b, = u~t&,th. The bh play the role of the regression coefficients, b, in
the MLR and PCR models.
This model, however, is not the best possible. The reason is that the prin-
cipal components are calculated for both blocks separately so that they have
a weak relation to each other. It would be better to give them information
about each other so that slightly rotated components result which lie closer
to the regression line of Fig. 4.
Over-simplified model: 2x PCA. An over-simplified model can be written
in algorithmic form as in the NIPALS section.
For the X block: ( 1) take tstart = some Xj; (2) p’ = t’X/t’t (= U’x/U’U);
(3) PlW = Pbd/llP;i& (4) t = XP/P’Pi (5) compare t in steps 2 and 4 and if
they are equal stop, else go to 2.
Fig. 4. The inner relation. A linear regression of u against t. Note that the data are mean-
centered.
11
For the Y block, (1) take uStart = some yj; (2) q’ = u’Y/u’u (= t’Y/t’t);
(3) qLw = qbM/llqhdll;(4) u = Yq/q’q; (5) compare u in steps 2 and 4 and if
they are equal stop, else go to 2.
Improving the inner relation: exchange of scores. The above relations are
written as completely separated algorithms. The way each can get informa-
tion about the other is to let t and u change place in step 2. (Note the parts
in parentheses in this step.) Thus, the two algorithms can be written in
sequence: (1) take uStart = some y,; (2) p’ = u’X/u’u (w’ = uX/u’u); (3) ph,, =
Pbld/llPbdll(wLJ = wbJllw&ll); (4) t = Xp/p’p (t = Xw/w’w); (5) q’ = t’Y/t’t;
(6) qkw = q&/llq≪ (7) u = Yq/q’q; (8) Compare the t in step 4 with the
one in the preceding iteration step. If they are equal (within a certain round-
ing error), stop; else go to 2. (In the case for which the Y block has only one
variable, steps 5-8 can be omitted by putting Q = 1.)
This algorithm usually converges very quickly to give rotated components
for X and Y block.
Obtaining orthogonal X block scores. There is still a problem; the algo-
rithm does not give orthogonal t values. The reason is that the order of calcu-
lations that was used for the PCA has been changed. Therefore, the p’ are
replaced by weights w’ (see formulas in parentheses in previous subsection).
An extra loop can be included after convergence to get orthogonal t values:
P ’ = t’x/t’t (20)
With pk = p&/llp&ll, it becomes possible to calculate the new t: t = Xp/p’p,
but this turns out to be just a scalar multiplication with the norm of the p’ in
Eqn. 20: t,, = t,llp&ll . Orthogonal t values are not absolutely necessary,
but they make the comparison with PCR easier. One must give the same
resealing to the weights, w’, if the prediction is to be made without error:
W’ = w&llp&ll . Then t can be used for the inner relation as in Eqn. 19, and
thlmresiduals can be calculated from El = X - tip’, and FT = Y - u,q’,. In
general,
Fh = Fh - 1 - b,,thq;, (23)
(It is recalled that the aim is to make IIFhll small.) This mixed relation ensures
the ability to use the model parameters for predicting from a test set. Further-
more, because the rank of Y is not decreased by 1 for each component, one
can go on until the rank of the X block is exhausted. The complete algorithm
is given in the Appendix together with an illustration of the matrices and
vectors.
12
Summary: PLS
- There are outer relations of the form X = TP’ + E and Y = UQ’ + F*.
- There is an inner relation Qh = bhth.
- The mixed relation is Y = TBQ’ + F where l[Fll is to be minimized.
- In the iterative algorithm, the blocks get each other’s scores, this gives a
better inner relation.
- In order to obtain orthogonal X scores, as in the PCA, it is necessary to
introduce weights.
Prediction
The important part of any regression is its use in predicting the dependent
block from the independent block. This is done by decomposing the X block
and building up the Y block. For this purpose, p’, q’, w’ and b from the cali-
bration part are saved for every PLS factor. It should be noted that the new
X block has r samples instead of n.
The independent blocks are decomposed and the dependent block is built
up. For the X block, t is estimated by multiplying X by w as in the model
building part
ih = E+_Iw~ (24)
E, =Eh+-&,p;I (25)
For the Y block:
Y = Fh = E b&q; (26)
where the summation is over h for all the factors (a) one wants to include
and X = EO, Y = F,.
Number of components
If the underlying model for the relation between X and Y is a linear model,
the number of components needed to describe this model is equal to the
model dimensionality. Nonlinear models require extra components to des-
cribe nonlinearities. The number of components to be used is a very impor-
tant property of a PLS model.
13
l
.--
0”
-e-_. -_.’
’
I I I I , I I I 1 I l
0 2 4 6 8 IO 0 2 IO
NUMBER OF COMPONENTS NUMBER :F COkPONE!NTS
Fig. 5. 11Fhll vs. the number of PLS components. A threshold and/or a difference criterion
(see A in figure) can be used to stop the algorithm.
Fig. 6. Plot of PRESS against the number of components. This criterion evaluates the
predictive power of the model. The number of components giving a minimum PRESS is
the right number for the model that gives optimal prediction. In this example, models
with 4-8 components would be acceptable.
14
Statistics
From the matrices of residuals Eh and F,,, sums of squares can be calcu-
lated as follows: the total sum of squares over a matrix, the sums of squares
over rows, and the sums of squares over columns. These sums of squares can
be used to construct variance-like estimators. The statistical properties of
these estimators have not undergone a rigorous mathematical treatment yet,
but some properties can be understood intuitively.
The sum of squares of the Fh is the indicator of how good the model is
(Eqn. 23). The sum of squares of Eh is an indicator of how much of the X
block is not used in the model. In some cases, a substantial part of the X
block does not participate in the model, which means that the independent
variables have unexpected properties or large errors.
Fig. 7. Statistics for the variables. The data are shown as bars representing the sum of
squares per variable (for the model building both X and Y variables; for the prediction
only X variables). After 0 PLS components, the data is in mean-centered and variance-
scaled form. As the number of PLS components increases, the information in each
variable is exhausted. The hatched bar shows the behavior of a “special” variable, one
that contributes little to the model.
Fig. 8. Statistics for the objects (samples). The data are shown as bars representing the
sum of squares per object. As the number of PLS components increases, the sum of
squares for each object decreases. The hatched bar shows the behavior of a “special”
object, probably an outlier.
15
0
P
b 4
1 1 1
m
I
1
Fig. 9. A graphical represent&on of the matrices and vectors used in PLS.
Conclusion
The topic of partial least squares is much larger than the material covered
above. Some subjects not discussed at all or not in detail are: outlier detec-
tion, treatment of missing data, F and t statistics, classification/pattern
recognition, leverage, selection of variables, data transformations, extensions
to more blocks and hierarchical models, and lack of fit.
There are also other PLS algorithms. Each may have some advantage in a
particular application. The algorithm given here is one of the most complete
and elegant ones when prediction is important. An example of its application
to simulated data is given in the next paper [ 111.
The authors thank Svante Wold and Harald Martens for their contributions
to the PLS method. This paper results from a number of discussions at the
Laboratory for Chemometrics. We thank all colleagues and visitors to the lab
and especially Dave Veltkamp for their support and stimulating discussions.
The Science Department of the Belgian Ministry of Foreign Affairs provided
P. Geladi with a NATO travel grant lO/B/84/BE. Graphics were done by
16
Louise Rose. This work was supported by a grant from the Center for Process
Analytical Chemistry, a National Science Foundation Cooperative Research
Center at the University of Washington.
REFERENCES
4s. Wold in, H. Martens and H. Russwurm (Eds.), Food Research and Data Analysis,
Applied Science Publishers, London, 1983.
5 K. Mardia, J. Kent and J. Bibby, Multivariate Analysis, Academic Press, London, 1980.
6 S. Wold, in B. Kowalski (Ed.), Chemometrics: Mathematics and Statistics in Chemistry,
Reidel, Dordrecht, 1984.
7 N. Draper and H. Smith, Applied Regression Analysis, Wiley, New York, 1981.
8 R. Gunst and R. Mason, Regression Analysis and its Applications, M. Dekker, New
York, 1980.
9 D. Belsley, E. Kuh and R. Welsch, Regression Diagnostics: Identifying Influential Data
and Sources of Collinearity, Wiley, New York, 1980.
10 G. Strang, Linear Algebra and its Applications, Academic Press, New York, 1980.
11 P. Geladi and B. Kowalski, Anal. Chim. Acta, 185 (1986) 19.