Principal Component Analysis: Analytical Methods May 2014

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/261551712

Principal component analysis

Article  in  Analytical methods · May 2014


DOI: 10.1039/c3ay41907j

CITATIONS READS

686 2,126

2 authors, including:

Rasmus Bro
University of Copenhagen
300 PUBLICATIONS   20,568 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

drEEM user support View project

Working on automating PARAFAC analysis of EEMs View project

All content following this page was uploaded by Rasmus Bro on 21 October 2019.

The user has requested enhancement of the downloaded file.


Analytical
Methods
TUTORIAL REVIEW

Principal component analysis


Cite this: Anal. Methods, 2014, 6, 2812
Rasmus Broa and Age K. Smildeab

Principal component analysis is one of the most important and powerful methods in chemometrics as well
as in a wealth of other areas. This paper provides a description of how to understand, use, and interpret
Received 28th October 2013
Accepted 17th February 2014 principal component analysis. The paper focuses on the use of principal component analysis in typical
chemometric areas but the results are generally applicable.
DOI: 10.1039/c3ay41907j

www.rsc.org/methods

Africa. A Foss WineScan instrument was used to measure 14


Introductory example characteristic parameters of the wines such as the ethanol
To set the stage for this paper, we will start with a small example content, pH, etc. (Table 1).
where principal component analysis (PCA) can be useful. Red Hence, a dataset is obtained which consists of 44 samples
wines, 44 samples, produced from the same grape (Cabernet and 14 variables. The actual measurements can be arranged in a
sauvignon) were collected. Six of these were from Argentina, table or a matrix of size 44  14. A portion of this table is shown
een from Chile, twelve from Australia and eleven from South in Fig. 1.
With 44 samples and 14 columns, it is quite complicated to
get an overview of what kind of information is available in the
a
Department of Food Science, University of Copenhagen, Rolighedsvej 30, DK-1958, data. A good starting point is to plot individual variables or
Frederiksberg C, Denmark samples. Three of the variables are shown in Fig. 2. It can be
b
Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of seen that total acid as well as methanol tends to be higher in
Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands

Rasmus Bro studied mathe- Age K. Smilde is a full professor


matics and analytical chemistry of Biosystems Data Analysis at
at the Technical University of the Swammerdam Institute for
Denmark and received his M.Sc. Life Sciences at the University of
degree in 1994. In 1998 he Amsterdam and he also works
obtained his Ph.D. in multi-way part-time at the Academic
analysis from the University of Medical Centre of the same
Amsterdam, The Netherlands university. As of July 1, 2013 he
with Age K. Smilde as one of two holds a part-time position as
supervisors. He is currently professor of Computational
Professor of chemometrics at the Systems Biology at the Univer-
University of Copenhagen, Den- sity of Copenhagen. His research
mark. His work focusses on interest focuses on two topics:
developing tools for analyzing data in process analytical tech- data fusion and network inference. Data fusion concerns inte-
nology, metabonomics and analytical chemistry. Rasmus Bro has grating functional genomics data and fusing data with prior bio-
published 150 papers as well as several books and has received a logical knowledge. The network topic encompasses network
number of prizes and awards over the years. He heads the ten year reconstruction, deriving network properties from time-resolved
old industrial consortium ODIN which is a networking facility that data and computational network analysis. He has published more
aims to make chemometrics as used and useful as possible in than 200 peer-reviewed papers and has been the Editor-Europe of
Danish industries. the Journal of Chemometrics during the period 1994–2002. He
chaired the 1996 Gordon Research Conference on Statistics in
Chemistry & Chemical Engineering. In 2006, he received the
Eastern Analytical Symposium Award for Achievements in
Chemometrics.

2812 | Anal. Methods, 2014, 6, 2812–2831 This journal is © The Royal Society of Chemistry 2014
Tutorial Review Analytical Methods

Table 1 Chemical parameters determined on the wine samples (data be lost as it would always be possible to go from e.g. the average
from http://www.models.life.ku.dk/Wine_GCMS_FTIR [February, to the two original variables.
2014]1,2)
This concept of using suitable linear combinations of the
Ethanol (vol%) original variables will turn out to be essential in PCA and is
Total acid (g L1) explained in a bit more detail and a slightly unusual way here.
Volatile acid (g L1) The new variable, say, the average of the two original ones, can
Malic acid (g L1) be dened as a weighted average of all 14 variables; only the
pH
other variables will have weight zero. These 14 weights are
Lactic acid (g L1)
Rest sugar (Glu + Fru) (g L1) shown in Fig. 4. Rather than having the weights of ethanol and
Citric acid (mg L1) glycerol to be 0.5 as they would in an ordinary average, they are
CO2 (g L1) chosen as 0.7 to make the whole 14-vector of weights scaled to
Density (g mL1) be a unit vector. When the original variables ethanol and glyc-
Total polyphenol index
erol are taken to be of length one (unit length) then it is
Glycerol (g L1)
Methanol (vol%) convenient to also have the linear combination of those to be of
Tartaric acid (g L1) length one. This then denes the unit on the combined vari-
pffiffiffi
able. To achieve this it is necessary to take 0.7 ( 2=2 to be exact)
of ethanol and 0.7 of glycerol, as simple Pythagorean geometry
shows in Fig. 5. This also carries over to more than two
samples from Australia and South Africa whereas there are less variables.
pronounced regional differences in the ethanol content. Using a unit weight vector has certain advantages. The most
Even though Fig. 2 may suggest that there is little relevant important one is that the unit vector preserves the size of the
regional information in ethanol, it is dangerous to rely too variation. Imagine there are ten variables rather than two that
much on univariate analysis. In univariate analysis, any co- are being averaged. Assume, for simplicity that all ten have the
variation with other variables is explicitly neglected and this value ve.
may lead to important features being ignored. For example, Regardless of whether the average is calculated from two or
plotting ethanol versus glycerol (see Fig. 3) shows an interesting ten variables, the average remains ve. Using the unit vector,
correlation between the two. This is difficult to deduce from though, will provide a measure of the number of variables
plots of the individual variables. If glycerol and ethanol were showing variation. In fact, the variance of the original variables
completely correlated, it would, in fact, be possible to simply and this newly calculated one will be the same, if the original
use e.g. the average or the sum of the two as one new variable variables are all correlated. Thus, using the unit vector preserves
that could replace the two original ones. No information would the variation in the data and this is an attractive property. One

Fig. 1 A subset of the wine dataset.

This journal is © The Royal Society of Chemistry 2014 Anal. Methods, 2014, 6, 2812–2831 | 2813
Analytical Methods Tutorial Review

Fig. 2 Three variables coloured according to the region.

Fig. 5 The concept of a unit vector.

Fig. 3 A plot of ethanol versus glycerol.

This is a powerful property; that it is possible to use weights


of the reasons is that it allows for going back and forth between to condense several variables into one and vice versa. To
the space of the original variables (say glycerol–ethanol) and the generalize this, notice that the current concept only works
new variable. With this denition of weights, it is now possible perfectly when the two variables are completely correlated.
to calculate the new variable, the ‘average’, for any sample, as Think of an average grade in a school system. Many particular
indicated in Fig. 6. grades can lead to the same average grade, so it is not in general
As mentioned above, it is possible to go back and forth possible to go back and forth. To make an intelligent new
between the original two variables and the new variable. variable, it is natural to ask for a new variable that will actually
Multiplying the new variable with the weights provides an provide a nice model of the data. That is, a new variable which,
estimation of the original variables (Fig. 7). when multiplied with the weights, will describe as much as

Fig. 4 Defining the weights for a variable that includes only ethanol and glycerol information.

2814 | Anal. Methods, 2014, 6, 2812–2831 This journal is © The Royal Society of Chemistry 2014
Tutorial Review Analytical Methods

Fig. 6 Using defined weights to calculate a new variable that is a scaled average of ethanol and glycerol (arbitrary numbers used here). The
average is calculated as the inner product of the 14 measurements of a sample and the weight vector. Some didactical rounding has been used in
the example.

will only focus on variables measured in large numbers. It is


desired to model all variables, and there is a preprocessing tool
called autoscaling which will make each column have the same
‘size’ so that all variables have an equal opportunity of being
modelled. Autoscaling means that from each variable, the mean
value is subtracted and then the variable is divided by its
standard deviation. Autoscaling will be described in more
detail, but for now, it is just important to note that each variable
Fig. 7 Using the new variable and the weights to estimate the old
is transformed to equal size and in the process, each variable
original variables. will have negative as well as positive values because the mean of
it has been subtracted. Note that an average sample now
corresponds to all zeroes. Hence, zero is no longer absence of a
‘signal’ but instead indicates an average ‘signal’.
With this pre-processing of the data, PCA can be performed.
The technical details of how to do that will follow, but the rst
principal component is shown in Fig. 9. In the lower plot, the
weights are shown. Instead of the quite sparse weights in Fig. 4,
these weights are non-zero for all variables. This rst compo-
nent does not explain all the variation, but it does explain 25%
of what is happening in the data. As there are 14 variables, it
would be expected that if every variable showed variation
independent of the other, then each original variable would
Fig. 8 Defining weights (w's) that will give a new variable which leads explain 100%/14 ¼ 7% of the variation. Hence, this rst
to a good model of the data. component is wrapping up information, which can be said to
correspond to approximately 3–4 variables.
Just like the average of ethanol and glycerol or the average
possible the whole matrix (Fig. 8). Such a variable will be an school grade, the new variable can be interpreted as “just a
optimal representative of the whole data in the sense that no variable”. The weights dene how the variable is determined
other weighted average simultaneously describes as much of and how many scores each sample has of this linear combina-
the information in the matrix. tion. For example, it is seen that most of the South African
It turns out that PCA provides a solution to this problem. samples have positive scores and hence, will have fairly high
Principal component analysis provides the weights needed to values on variables that have positive weights such as for
get the new variable that best explains the variation in the whole example methanol. This is conrmed in Fig. 2.
dataset in a certain sense. This new variable including the
dening weights, is called the rst principal component.
To nd the rst principal component of the actual wine data, Principal component analysis
it is necessary to jump ahead a little bit and preprocess the data
rst. Looking at the data (Fig. 1) it is seen, that some variables Taking linear combinations
such as CO2 are measured in numbers that are much larger It is time to introduce some more formal notation and
than e.g. methanol. For example, for sample three, CO2 is 513.74 nomenclature. The weighted average as mentioned above is
[g L1] whereas methanol is 0.18 [vol%]. If this difference in more formally called a linear combination: it is a way of
scale and possibly offset is not handled, then the PCA model combining the original variables in a linear way. It is also

This journal is © The Royal Society of Chemistry 2014 Anal. Methods, 2014, 6, 2812–2831 | 2815
Analytical Methods Tutorial Review

Throughout we will use the symbol ||.||2 to indicate the squared


Frobenius norm (sum-of-squares). Thus, the formal problem
becomes
argmax varðtÞ (1)
kwk¼1

which should be read as the problem of nding the w of length


one that maximizes the variance of t (note that ||w|| ¼ 1 is the
same as requiring ||w||2 ¼ 1). The function argmax is the
mathematical notation for returning the argument w of
the maximization function. This can be made more explicit by
using the fact that t ¼ Xw:
   
argmax tT t ¼ argmax wT XT Xw (2)
jjwjj¼1 jjwjj¼1

where it is assumed that the matrix X is mean-centered (then all


linear combinations are also mean-centered). The latter problem
is a standard problem in linear algebra and the optimal w is the
(standardized) rst eigenvector (i.e. the eigenvector with the
largest value) of the covariance matrix XTX/(n  1) or the corre-
sponding cross-product matrix XTX.

Explained variation
The variance of t can now be calculated but a more meaningful
assessment of the summarizing capability of t is obtained by
calculating how representative t is in terms of replacing X. This
Fig. 9 The first principal component of the wine data. The lower plot can be done by projecting the columns of X on t and calculating
shows the weights and the upper plot shows the weighted averages the residuals of that projection. This is performed by regressing
obtained with those weights. all variables of X on t using the ordinary regression equation

X ¼ tpT + E (3)
sometimes called a latent variable where, in contrast, the orig-
where p is the vector of regression coefficients and E is the
inal variables are manifest.
matrix of residuals. Interestingly, p equals w and the whole
The data are collected in a matrix X with I rows (i ¼ 1, ., I;
machinery of regression can be used to judge the quality of the
usually samples/objects) and J columns (j ¼ 1, ., J; usually
summarizer t. Traditionally, this is done by calculating
variables), hence of size I  J. The individual variables
(columns) of X are denoted by xj (j ¼ 1, ., J) and are all vectors kXk2  kEk2
100% (4)
in the I-dimensional space. A linear combination of those x kXk2
variables can be written as t ¼ w1  x1 + . + wJ  xJ, where t is
which is referred to as the percentage of explained variation of t.
now a new vector in the same space as the x variables (because it
In Fig. 10, it is illustrated how the explained variation is
is a linear combination of these). In matrix notation, this
calculated as also explained around eqn (4).
becomes t ¼ Xw, with w being the vector with elements wj ( j ¼ 1,
Note, that the measures above are called variations rather
., J). Since the matrix X contains variation relevant to the
than variances. In order to talk about variances, it is necessary
problem, it seems reasonable to have as much as possible of
to correct for the degrees of freedom consumed by the model
that variation also in t. If this amount of variation in t is
and this is not a simple task. Due to the non-linear nature of the
appreciable, then it can serve as a good summary of the x
PCA model, degrees of freedom are not as simple to dene as for
variables. Hence, the fourteen variables of X can then be
linear models such as in linear regression or analysis of vari-
replaced by only one variable t retaining most of the relevant
ance. Hence, throughout this paper, the magnitude of variation
information.
will simply be expressed in terms of sums of squares. For more
The variation in t can be measured by its variance, var(t),
information on this, refer to the literature.3,4
dened in the usual way in statistics. Then the problem trans-
lates to maximizing this variance choosing optimal weights w1,
., wJ. There is one caveat, however, since multiplying an PCA as a model
optimal w with an arbitrary large number will make the variance Eqn (3) highlights an important interpretation of PCA: it can be
of t also arbitrary large. Hence, to have a proper problem, the seen as a modelling activity. By rewriting eqn (3) as
weights have to be normalized. This is done by requiring that
their norm, i.e. the sum-of-squared values, is one (see Fig. 5). ^ + E,
X ¼ tpT + E ¼ X (5)

2816 | Anal. Methods, 2014, 6, 2812–2831 This journal is © The Royal Society of Chemistry 2014
Tutorial Review Analytical Methods

each other. This is the usual way to perform PCA in chemo-


metrics. Due to the orthogonality in P, the R components have
independent contributions to the overall explained variation

kXk2 ¼ kt1 pT1 k2 þ .ktR pTR k2 þ kEk2 (8)

and the term ‘explained variation per component’ can be used,


similarly as in eqn (4).

History of PCA
PCA has been (re-)invented several times. The earliest presen-
tation was in terms of eqn (6).7 This interpretation stresses the
modelling properties of PCA and is very much rooted in
Fig. 10 Exemplifying how explained variation is calculated using the regression-thinking: variation explained by the principal
data and the residuals.
components (Pearson's view). Later, in the thirties, the idea of
taking linear combinations of variables was introduced8 and the
variation of the principal components was stressed (eqn (1);
shows that the (outer-) product tpT serves as a model of X Hotelling's view). This is a more multivariate statistical
(indicated with a hat). In this equation, vector t was a xed approach. Later, it was realized that the two approaches were
regressor and vector p the regression coefficient to be found. It very similar.
can be shown that actually both t and p can be established from Similar, but not the same. There is a fundamental concep-
such an equation5 by solving tual difference between the two approaches, which is important
2 to understand. In the Hotelling approach, the principal
argmin kX  tpT k (6)
t;p components are taken seriously in their specic direction. The
rst component explains the most variation, the second
which is also a standard problem in linear algebra and has the
component the second most, etc. This is called the principal
same solution as eqn (2). Note that the solution does not change
axis property: the principal components dene new axes which
if t is premultiplied by a s 0 and simultaneously p is divided by
should be taken seriously and have a meaning. PCA nds these
that same value. This property is called the scaling ambiguity6
principal axes. In contrast, in the Pearson approach it is the
and it can be solved in different ways. In chemometrics, the
subspace, which is important, not the axes as such. The axes
vector p is normalized to length one (||p|| ¼ 1) and in psycho-
merely serve as a basis for this subspace. In the Hotelling
metrics, t is normalized to length one. The vector t is usually
approach, rotating the principal components destroys the
referred to as the score vector (or scores in shorthand) and the
interpretation of these components whereas in the Pearson
vector p is called the loading vector (or loadings in shorthand).
conceptual model rotations merely generate a different basis for
The term ‘principal component’ is not clearly dened and can
the (optimal) subspace.9
mean either the score vector or the loading vector or the
combination. Since the score and loading vectors are closely
tied together it seems logical to reserve the term principal Visualization and interpretation
component for the pair t and p. It is now time to discuss how a PCA model can be visualized.
There are four parts of a PCA model: the data, the scores, the
loadings and the residuals. Visualization of the actual data is
Taking more components oen dependent on the type of data and the traditions of a given
If the percentage of explained variation of eqn (4) is too small, eld. For continuous data such as time-series and spectra, it is
then the t, p combination is not a sufficiently good summarizer
of the data. Eqn (5) suggests an extension by writing
^ þE
X ¼ TPT þ E ¼ t1 pT1 þ . þ tR pTR ¼ X (7)

where T ¼ [t1, ., tR] (I  R) and P ¼ [p1, ., pR] (J  R) are now


matrices containing, respectively, R score vectors and R loading
vectors. If R is (much) smaller than J, then T and P still amount
to a considerably more parsimonious description of the varia-
tion in X. To identify the solution, P can be taken such that PTP
¼ I and T can be taken such that TTT is a diagonal matrix. This
corresponds to the normalisation of the loadings mentioned
above. Each loading vector, thus has norm one and is orthog- Fig. 11 The structure of a PCA model. Note that residuals (E) have the
onal to other loading vectors (an orthogonal basis). The same structure as the data and so does the model approximation of
constraint on T implies that the score vectors are orthogonal to the data (TPT).

This journal is © The Royal Society of Chemistry 2014 Anal. Methods, 2014, 6, 2812–2831 | 2817
Analytical Methods Tutorial Review

Fig. 12 Score 1 and 2 from a PCA on the autoscaled wine data. Upper left is a line plot of the 44 score values in component 1 and lower left the 44
score values of component 2. In the right plot, the two scores are plotted against each other.

Fig. 13 Score plot of component 1 versus 2 (top) and 1 versus 13 (bottom). To the left, the plots are shown filling out the squares and to the right,
they are shown preserving distances.

oen feasible to plot the data as curves whereas more discrete important chemical information as to what spectral variation
data are oen plotted in other ways such as bar plots. has not been explained (see also Fig. 23). In short, any visuali-
Visualizing and interpreting residuals. Whatever visualiza- zation that is useful for the data will also be useful for the
tion applies to the data would oen also be useful for e.g. the residuals.
residuals (Fig. 11). The residuals have the same structure and Residuals can also be plotted as histograms or e.g. normal
for example for spectral data, the residuals would literally probability plots in order to see if the residuals are normally
correspond to the residual spectra and therefore provide distributed. Alternatively, the residuals can be used for

2818 | Anal. Methods, 2014, 6, 2812–2831 This journal is © The Royal Society of Chemistry 2014
Tutorial Review Analytical Methods

Fig. 16 Scatter plot of loading 1 versus loading 2.

Fig. 14 Loading one (top) and loading two (bottom).

Fig. 15 Hypothetical loading vector from a model that explains 100%


in component 1 (top) and 14% in component 1 (bottom). Fig. 17 A biplot of the first two components of a PCA model of the
wine data.

calculating the explained or unexplained variation as explained rst component represents, which will be described by the
earlier. loadings (see below).
Visualizing and interpreting scores. It is well known that the Instead of plotting the scores in line plots, it is also possible
readings of a variable can be plotted. Imagine that pH is to plot them in scatter plots. In Fig. 12 (right), such a scatter plot
measured on 20 samples. These 20 values can be plotted in a is shown and from the scatter plot it is more readily seen that
multitude of ways. Scores are readings in exactly the same way there seem to be certain groupings in the data. For example, the
as any other variable and can hence be plotted and interpreted Australian and Chilean wines seem to be almost distinctly
in many different ways. In Fig. 12, some visualizations of the different in this score plot. This suggests that it is possible to
rst two components of the PCA model of the wine data are classify a wine using these measured variables. If a new sample
shown. If desired, they can be plotted as line plots as shown in ends up in the middle of the Chilean samples, it is probably not
the le in the gure. This plot of, for example, score 1, shows an Australian wine and vice versa. This possibility of using PCA
that the dark blue scores tend to have negative scores. This for classication forms the basis for the classication method
means that wines from Chile have relatively less of what this called SIMCA (So Independent Modelling of Class

This journal is © The Royal Society of Chemistry 2014 Anal. Methods, 2014, 6, 2812–2831 | 2819
Analytical Methods Tutorial Review

Fig. 18 Scatter plot of the preprocessed values of the variables of two wines (left) and two variables (right).

Fig. 19 Scree plots for simulated rank four data with various levels of noise. Top plots show eigenvalues. Middle plots show the same but zoomed
in on the y-axis to the line indicated in the top plot. Lower plots show the logarithm of the eigenvalues.

Analogies).10,11 The scatter plot can be interpreted in the same can be read from the plots directly. If on the other hand the
way that scatter plots are normally interpreted. For example, a plots are not shown using the same scale on both axis, then
plot of glycerol versus ethanol (Fig. 3) is simple to interpret. assessing distances is not possible.
Samples that are close have similar glycerol and ethanol. Like- Compare the two versions of the two score plots in Fig. 13.
wise, for a scatter plot of component 1 and 2. Samples that are The lower le plot has widely different scales on the two axes
close are similar in terms of what the components represent (because one component has much larger values numerically
which is dened by the loading vectors. Also, if (and only if) the than the other). Henceforth, it is similar to plotting e.g. kilo-
two components represent all or almost all of the variation in metres on one axis and meters on another. A map with such
the data, then e.g. two closely lying samples are similar with axes does not preserve distance. Consider, for example, the wine
respect to the actual data. sample marked A. It seems to be closer to sample C than B in
It is possible to assess similarities and differences among the lower le plot. The plot to the lower right preserves
samples in terms of the raw data. If two components explain all distances and here it is readily veried that sample A is, in fact,
or most of the variation in the data, then a score scatter plot will the closest to B in the space spanned by the two components.
reect distances in terms of the data directly if the scores are There are several points worth mentioning in relation to this.
shown on the same scale. That is, the plot must be shown as Score plots are only indicative of the specic fraction of variance
original scores where the basis is the loading vector. As the they explain. For example, scores that explain three percent do
loading vectors are unit vectors, they reect the original data not imply much with respect to the raw data. To assess relative
and if the two axes in the plot use the same scale, then distances positions such as distances in a score plot, the plot needs to

2820 | Anal. Methods, 2014, 6, 2812–2831 This journal is © The Royal Society of Chemistry 2014
Tutorial Review Analytical Methods

Fig. 23 Example of spectral data (grey) and residual spectral infor-


mation after one (left) and six (right) components.

example, the lower le score plot in Fig. 13 is much better for
Fig. 20 Scree plot for the autoscaled wine data. The decision lines for discerning groupings and detecting patterns than the one to the
having eigenvalues larger than one and the broken stick is also shown. lower right.
Visualizing and interpreting loadings. Loadings dene what
a principal component represents. Just as the weight in Fig. 4
dened the latent variable to represent a mixture of glycerol and
ethanol, the loading vector of a PCA model does exactly the
same. It denes what linear combination of the variables a
particular component represents.
Fig. 14 shows the loadings of the two rst components. With
these, it is possible to explain what the scores of the model
represent. For example, wines from Chile have low (negative)
scores for component 2 (Fig. 12). This implies that they have a lot
of the opposite of the phenomenon represented in loading 2.
Hence, these samples have variation where ethanol, total, vola-
tile, and lactic acids are low at the same time (relatively) while e.g.
malic acid is high. Also, and this is an important point, certain
variables that have low loadings close to zero, such as e.g. citric
Fig. 21 Cumulated percentage variation explained. acid, do not follow this trend. Hence, the loading tells about what
the trend is and also which variables are not part of the trend.
The phenomenon reected in the principal component is
also expected to be visually apparent in the raw data, but only
with respect to how much variation of the data this component
describes. The rst component is seen in the label in Fig. 14 to
explain 24.4% of the variation whereas the second one explains
21.3%. Together that means that 45.7% of the variation is
explained by these two components. If the two components had
explained 100%, all information would be contained in these
two components, but for this particular model, half the varia-
tion is still retained in other components, so we should remain
cautious not to claim that observations from the components
are fully indicative of variations in the data.
An example on the importance of this is indicated in Fig. 15.
Fig. 22 Left: score number four of wine data. Right: score two versus
The model reected in the top plot shows that variables 4 and
score four.
6 are perfectly oppositely correlated. The model reected in the
bottom plot does not indicate that. In contrast, the low
percentage explained, indicates that there are many other
preserve distances. This is mostly a problem in practice, when phenomena in the data so the correlation between variable 4
the magnitude of the two components is widely different. The and 6 needs not be close to minus one as it will be in the
score plot that does not preserve distances is still useful. For rst model.

This journal is © The Royal Society of Chemistry 2014 Anal. Methods, 2014, 6, 2812–2831 | 2821
Analytical Methods Tutorial Review

Instead of looking at the loadings in line plots, it is also whole loading matrix to ensure that e.g. the score values are not
feasible to make scatter plots (Fig. 16). The scatter plot is oen so small compared to the loadings that they are not visible in a
helpful for nding patterns of variation. For example, it is plot.
apparent in the plot that volatile acid and lactic acid are There are many interesting aspects of biplots and scatter-
generally correlated in approximately 50% of the variation plots but only a few important interpretational issues will be
reected in the two components. Residual sugar seems to be described here.
only moderately described in these two components as it is Two objects that are close and far from the origin have
close to zero in both components. As the variables have been similar response (with respect to the variation explained by the
auto-scaled, a position close to zero implies that this particular components). For example, the two samples CHI-VDA1 and
variable does not co-vary with the variation that component 1 CHI-SCH1 are far from the origin and close together. Hence they
and 2 is reecting. are expected to be correlated, but only with respect to the
As for the score scatter plot, distances are only preserved in approximately 50% that these two components describe. The
the loading scatter plot, if the two loadings are plotted on the two samples are plotted against each other in Fig. 18 (le). Note,
same scale. The basis for the loadings are the scores and these that it is the preprocessed data that the PCA model reects and
are generally not unit vectors as they carry the variance of the hence, that interpretations can be made about.
components. To correct for that, it is possible to simply Likewise, two variables that are close and far from the origin
normalize the scores and multiply the corresponding loading are correlated (with respect to the variation explained by the
vectors by the inverse normalization factor. In essence, components). An example is given in Fig. 18 (right). Note, that
just moving the variance from the score vector to the loading the high correlation is apparently governed by an extreme
vector. sample – a potential outlier which will be discussed later.
Visualizing and interpreting loadings and scores together – The center of the plot represents the average sample – not
biplots. It is possible and obvious to link the score and the zero – in case the data have been centered. Hence, samples with
loading plot. That way, it is possible to explain why e.g. a certain very negative scores have low values relative to the other
grouping is observed in a score plot. As hinted above, it is samples and samples with high positive scores are the opposite.
difficult to nd a suitable base to plot on when combining Again, with respect to the variation explained by the
scores and loadings, especially if preserving distances is components.
desired. The biplot aims to solve this problem, or rather, pres- The larger projection a sample has on the vector dened by a
ents a suitable set of compromises to choose from. Biplots were given variable, the more that sample deviates from the average
originally developed by K. R. Gabriel, but J. C. Gower has also on that particular variable (see e.g. how sample SOU-HHI1
contributed. The reader is urged to refer to the original litera- projects to the axis dened by the variable lactic acid in Fig. 17).
ture for more in depth information.1214 It is oen overlooked, that the above considerations for
The principle behind biplots can be explained by repre- biplots apply equally well on loading plots or on score plots. Just
senting the PCA model using like above, when for example, loadings are plotted without
considering the magnitude of the scores, distances may be
X ¼ TPT ¼ T(norm)SPT (9) impossible to judge.

where T(norm) is the score matrix with each column scaled to


norm one just like the loadings are. The diagonal matrix S Practical aspects
contains the norms of T on the diagonal. Above, no residuals are Assumptions
assumed for simplicity. Normally the scores are taken as In its most basic form, PCA can be seen as a basis for trans-
T(norm)S (¼T) but if a distance preserving plot of the loadings is formation. Instead of using the basis vectors uj ¼ (0, ., 0, 1, 0,
desired, it is more reasonable to set the loadings to PST and .0)0 (with the one at place j) the basis given by p1, ., pJ is used.
thus, have the scores be a normalized and orthogonal basis to For this transformation, no assumptions are needed. Consid-
base the plots on. Re-writing, the PCA model as ering PCA in the form of eqn (5) and (7), where a model is
assumed and least squares tting is chosen to estimate the
X ¼ T(norm)SPT ¼ T(norm)SaS(1a)PT (10) parameters T and P, it is not unreasonable to make some
assumptions regarding the residuals as collected in E. The
It is possible to obtain the two solutions by either setting a mildest assumption is that one of these residuals being inde-
equal to one or to zero. In fact, the most common biplot, takes a pendently and identically distributed (iid), without specifying
equal to 0.5 in order to produce a compromise plot where more than that this distribution is symmetrical around zero.
distances in both spaces can be approximately assessed. Hence, Hence, there are no systematic errors and the error is
a ¼ 0 represents distances for variables (loadings) preserved, a homoscedastic.
¼ 1 represents distances for samples (scores) preserved and a ¼ When the errors are heteroscedastic and there is a model for
0.5 represents distances for both samples and variables are the error, then eqn (7) can be tted under this error model by
(only) approximately preserved. using maximum likelihood or weighted least squares
In addition to this scaling of the variance, there is oen also approaches.1517 Although this solves the problem of hetero-
a more trivial scaling of either the whole score matrix or the scedasticity, certain implementations of maximum likelihood

2822 | Anal. Methods, 2014, 6, 2812–2831 This journal is © The Royal Society of Chemistry 2014
Tutorial Review Analytical Methods

tting remove various aspects of the simplicity of PCA which is why the following three sections will provide different
(orthogonal scores, nestedness of solutions, etc.). answers to that question.
Exploratory studies. In exploratory studies, there is no
Inference/validation quantitatively well-dened purpose with the analysis. Rather,
Since the PCA model parameters are used for interpretation and the aim is oen to just ‘have a look at the data’. The short
exploration, it is reasonable to ask how stable the results are. answer to how many components to use then is: “just use the
This calls for statistical inference tools. There are different rst few components”. A slightly more involved answer is that in
routes to take in this respect. Upon assuming multivariate exploratory studies, it is quite common not to x the number of
normality of the x-variables, statistical inference for the scores components very accurately. Oen, the interest is in looking at
and loadings are available (see e.g. Anderson,18 pp. 468). the main variation and per denition, the rst components
Multivariate normality cannot always be assumed, but approx- provide information on that. As e.g. component one and three
imate normality of the scores – they are linear combinations – do not change regardless of whether component six or seven is
envoking the Central Limit Theorem can sometimes be done. included, it is oen not too critical to establish the exact
For a distribution-free approach, resampling methods can be number of components. Components are looked at and inter-
used, e.g., bootstrapping. This is, however, not trivial and preted from the rst component and downwards. Each extra
several alternatives exist.19,20 component is less and less interesting as the variation
explained is smaller and smaller, so oen a gradual decline of
Preprocessing interest is attached to components. Note that this approach for
assessing the importance of components is not to be taken too
Oen a PCA performed on the raw data is not very meaningful. literally. There may well be reasons why smaller variations are
In regression analysis, oen an intercept or offset is included important for a specic dataset.29
since it is the deviation from such an offset, which represents If outliers are to be diagnosed with appropriate statistics
the interesting variation. In terms of the prototypical example, (see next section), then, however, it is more important to
the absolute levels of the pH is not that interesting but the establish the number of components to use. For example, the
variation in pH of the different Cabernets is relevant. For PCA to residual will change depending on how many components are
focus on this type of variation it is necessary to mean-center the used, so in order to be able to assess residuals, a reasonable
data. This is simply performed by subtracting from every vari- number of components must be used. There are several ad hoc
able in X the corresponding mean-level. approaches that can be used to determine the number of
Sometimes it is also necessary to think about the scales of components. A selection of methods is offered below, but note
the data. In the wine example, there were measurements of that these methods seldom provide clear-cut and denitive
concentrations and of pH. These are not on the same scales (not answers. Instead, they are oen used in a combined way to get
even in the same units) and to make the variables more an impression on the effective rank of the data.
comparable, the variables are scaled by dividing them by the Eigenvalues and their relation to PCA. Before the methods
corresponding standard deviations. The combined process of are described, it is necessary to explain the relation between
centering and scaling in this way is oen called autoscaling. For PCA and eigenvalues. An eigenvector of a (square) matrix A is
a more detailed account of centering and scaling, see the dened as the nonzero vector z with the following property:
literature.21,22
Centering and scaling are the two most common types of Az ¼ lz (11)
preprocessing and they normally always have to be decided
upon. There are many other types of preprocessing methods Where z is called the eigenvector. If matrix A is symmetric (semi-)
available though. The appropriate preprocessing typically positive denite, then the full eigenvalue decomposition of A
depends on the nature of the data investigated.23–27 becomes:

Choosing the number of components A ¼ ZLZT (12)

A basic rationale in PCA is that the informative rank of the data Where Z is an orthogonal matrix and L is a nonzero diagonal
is less than the number of original variables. Hence, it is matrix. In chemometrics, it is customary to work with covari-
possible to replace the original J variables with R (R  J) ance or correlation matrices and these are symmetric (semi-)
components and gain a number of benets. The inuence of positive denite. Hence, eqn (12) describes their eigenvalue
noise is minimized as the original variables are replaced with decomposition. Since all eigenvalues of such matrices are
weighted averages,28 and the interpretation and visualization is nonnegative, it is customary to order them from high to low;
greatly aided by having a simpler (fewer variables) view to all the and refer to the rst eigenvalue as the largest one.
variations. Furthermore, the compression of the variation into The singular value decomposition of X (I  J) is given by
fewer components can yield statistical benets in further
modelling with the data. Hence, there are many good reasons to X ¼ USVT (13)
use PCA. In order to use PCA, though, it is necessary to be able
to decide on how many components to use. The answer to that Where U is an (I  J) orthogonal matrix (UTU ¼ I); S (J  J) is a
problem depends a little bit on the purpose of the analysis, diagonal matrix with the nonzero singular values on its

This journal is © The Royal Society of Chemistry 2014 Anal. Methods, 2014, 6, 2812–2831 | 2823
Analytical Methods Tutorial Review

diagonal and V is an (J  J) orthogonal matrix (VTV ¼ VVT ¼ I). below one can be real and signicant. Real phenomena can be
This is for the case of I > J, but the other cases follow similarly. small in variation, yet accurate.
Considering XTX and upon using eqn (12) and (13) it follows Broken stick. A more realistic cut off for the eigenvalues is
obtained with the so called broken stick rule.37 A line is added to
XTX ¼ VSTUTUSVT ¼ VS2VT ¼ ZLZT. (14) the scree plot that shows the eigenvalues that would be expected
for random data (the green line in Fig. 22). This line is calculated
This shows the relationship between the singular values and assuming that random data will follow a so-called broken stick
the eigenvalues. The eigenvalue corresponding to a component distribution. The broken stick distribution hypothesizes how
is the same as the squared singular value which again is the random variation will partition and uses the analogy of how the
variation of the particular component. lengths of pieces of a stick will be distributed when broken at
Scree test. The scree test was developed by R. B. Cattell in random places into J pieces.38 It can be shown that for auto-
1966.30 It is based on the assumption that relevant information scaled data, this theoretical distribution can be calculated as
is larger than random noise and that the magnitude of the
variation of random noise seems to level off quite linearly with XJ
1
the number of components. Traditionally, the eigenvalues of br ¼ : (15)
j¼r
j
the cross-product of the preprocessed data, are plotted as a
function of the number of components, and when only noise is
modelled, it is assumed that the eigenvalues are small and As seen in Fig. 20, the broken stick would seem to indicate
decline gradually. In practice, it may be difficult to see this in that three to four components are reasonable.
the plot of eigenvalues due to the huge eigenvalues and oen High fraction of variation explained. If the data measured
the logarithm of the eigenvalues is plotted instead. Both are has e.g. one percent noise, it is expected that PCA will describe
shown in Fig. 19 for a simulated dataset of rank four and with all the variation down to around one percent. Hence, if a two-
various amounts of noise added. It is seen that the eigenvalues component model describes only 50% of the variation and is
level off aer four components, but the details are difficult to otherwise sound, it is probable that more components are
see in the raw eigenvalues unless zoomed in. It is also seen, that needed. On the other hand, if the data are very noisy coming e.g.
the distinction between ‘real’ and noise eigenvalues are difficult from process monitoring or consumer preference mapping and
to discern at high noise levels. has an expected noise fraction of maybe 40%, then an otherwise
For real data, the plots may even be more difficult to use as sound model tting 90% of the variation would imply over-
also exemplied in the original publication of Cattell as well tting and fewer components should be used. Having knowl-
as in many others.3133 Cattell himself admitted that: “Even a test edge on the quality of the data can help in assessing the number
as simple as this requires the acquisition of some art in adminis- of components. In Fig. 21, the variation explained is shown. The
tering it”. This, in fact, is not particular to the scree test but goes plot is equivalent to the eigenvalue plot except it is cumulative
for all methods for selecting the number of components. and on a different scale. For the wine data, the uncertainty is
For the wine data, it is not easy to rmly assess the number different for each variable, and varies from approximately 5 and
of components based on the scree test (Fig. 20). One may argue even up to 50% relative to the variation in the data. This is quite
that seven or maybe nine components seem feasible, but this variable and makes it difficult to estimate how much variation
would imply incorporating components that explain very little should be explained, but most certainly less than 50% would
variation. A more obvious choice would probably be to assess mean that all is not explained and explaining more than, say 90–
three components as suitable based on the scree plot and then 95% of the variation would be meaningless and just modelling
be aware that further components may also contain useful of noise. Therefore, based on variation explained, it is likely that
information. there is more than two but less than, say, seven components.
Eigenvalue below one. If the data is autoscaled, each variable Valid interpretation. As indicated by the results, the different
has a variance of one. If all variables are orthogonal to each rules above seldom agree. This is not as big a problem as it
other, then every component in a PCA model would have an might seem. Quite oen, the only thing needed is to know the
eigenvalue of one since the preprocessed cross-product matrix neighbourhood of how many components are needed. Using
(the correlation matrix) is identity. It is then fair to say, that if a the above methods ‘informally’ and critically, will oen provide
component has an eigenvalue larger than one, it explains vari- that answer. Furthermore, one of the most important strategies
ation of more than one variable. This has led to the rule of for selecting the number of components is to supplement such
selecting all components with eigenvalues exceeding one methods with interpretations of the model. For the current
(see the red line in Fig. 20). It is sometimes also referred to as data, it may be questioned whether e.g. three or four compo-
the Kaisers' rule or Kaiser–Guttmans' rule and many additional nents should be used.
arguments have been provided for this method.3436 While it In Fig. 22, it is shown, that there is distinct structure in the
remains a very ad hoc approach, it is nevertheless a useful rule- scores of component four. For example, the wines from Argen-
of-thumb to get an idea about the complexity of a dataset. For tina all have positive scores. Such a structure or grouping will
the wine data (Fig. 20), the rule suggests that around four or ve not happen accidentally unless unfortunate confounding has
components are reasonable. Note, that for very precise data, it is occurred. Hence, as long as Argentinian wines were not
perfectly possible that even components with eigenvalues far measured separately on a different system or something

2824 | Anal. Methods, 2014, 6, 2812–2831 This journal is © The Royal Society of Chemistry 2014
Tutorial Review Analytical Methods

similar, the mere fact that component four (either scores or


loadings) shows distinct behaviour is an argument in favour of
including that component. This holds regardless of what other
measures might indicate.
The loadings may also provide similar validation by high-
lighting correlations expected from a priori knowledge. In the case
of continuous data such as time series or spectral data, it is also
instructive to look at the shape of the residuals. An example is
provided in Fig. 23. A dataset consisting of visual and near-
infrared spectra of 40 beer samples is shown in grey. Aer one
component, the residuals are still fairly big and quite structured
from a spectral point of view. Aer six components, there is very
little information le indicating that most of the systematic vari-
ation has been modelled. Note from the title of the plot, that 95%
of the variation explained is quite low for this dataset whereas that
would be critically high for the wine data as discussed above. A plot of RMSECV for PCA models with different number of
Fig. 24
components.
Cross-validation. In certain cases, it is necessary to establish
the appropriate number of components more rmly than in the
exploratory or casual use of PCA. For example, a PCA model may
be needed to verify if the data of a new patient indicate that this number of components but the RMSECV gets worse aer four
patient is similar to diseased persons. This may be accomplished components, indicating that no more than four components
by checking if the sample is an outlier when projected into a PCA should be used. In fact, the improvement going from three to
model (see next section on outliers). Because the outlier diag- four components is so small, that three is likely a more feasible
nostics depend on the number of components chosen, it is choice from that perspective.
necessary to establish the number of components before the The cross-validated error, RMSECV, can be compared to the
model can be used for its purpose. There are several ways do to tted error, the Root Mean Squared Error of Calibration,
this including the above-mentioned methods. Oentimes, RMSEC. In order for the two to be comparable though, the tted
though, they are considered too ad hoc and other approaches are residuals must be corrected for the degrees of freedom
used. One of the more popular approaches is cross-validation. S. consumed by the model. Calculating these degrees of freedom
Wold was the rst to introduce cross-validation of PCA models39 is not a trivial subject as mentioned earlier.3,4,47
and several slightly different approaches have been developed When using PCA for other purposes. It is quite common to
subsequently. Only a brief description of cross-validation will be use PCA as a preprocessing step in order to get a nicely compact
given here, but details can be found in the literature.40,41 representation of a dataset. Instead of the original many (J)
The idea in cross-validation is to leave out part of the data and variables, the dataset can be expressed in terms of the few (R)
then estimate the le-out part. If this is done wisely, the principal components. These components can then in turn be
prediction of the le-out part is independent of the actual le- used for many different purposes (Fig. 25).
out part. Hence, overtting leading to too optimistic models is It is common practice to use, for example, cross-validation
not possible. Conceptually, a single element (typically more than for determining the number of components and then use that
one element) of the data matrix is le out. A PCA model handling number of components in further modelling. For example, the
missing data,4246 can then be tted to the dataset and based on scores may be used for building a classication model using
this PCA model, an estimate of the le out element can be linear discriminant analysis. While this approach to selecting
obtained. Hence, a set of residuals is obtained where there are components is both feasible and reasonable there is a risk that
no problems with overtting. Taking the sum of squares of these components that could help improve classication would be
yields the so-called Predicted REsidual Sums of Squares (PRESS) le out. For example, cross-validation may indicate that ve
components are valid, but it turns out that component seven
J 
I X
X 2
PRESSr ¼ xij ðrÞ (16)
i¼1 j¼1

where xij(r) is the residual of sample i and variable j aer r


components. From the PRESS the Root Mean Squared Error of
Cross-Validation (RMSECV) is obtained as
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PRESSr
RMSECVr ¼ (17)
IJ

In Fig. 24, the results of cross-validation are shown. As


shown in Fig. 21 the t to data will trivially improve with the Fig. 25 Using the scores of PCA for further modelling.

This journal is © The Royal Society of Chemistry 2014 Anal. Methods, 2014, 6, 2812–2831 | 2825
Analytical Methods Tutorial Review

can reliably improve classication. In order to be certain that such plots have already been shown throughout this paper. It is
useful information is retained in the PCA model, it is generally also important, and frequently forgotten, to look at the pre-
advised to validate the number of components in terms of the processed data. While the raw data are important, they actually
actual goal. Instead of validating the number of components never enter the modeling. It is the preprocessed data that will be
that best describe X in some sense (PCA cross-validation), it will modeled and there can be big differences in the interpretations
oen make more sense to use the number of components that of the raw and the preprocessed data.
provides the best classication results if PCA is used in Score plots. While raw and preprocessed data should always
conjunction with discriminant analysis. be investigated, some types of outliers will be difficult to iden-
tify from there. The PCA model itself can provide further
information. There are two places where outlying behavior will
Detecting outliers show up most evidently: in the scores and in the residuals. It is
Outliers are samples that are somehow disturbing or unusual. appropriate to go through all selected scores and look for
Oen, outliers are downright wrong samples. For example, in samples that have strange behaviour. Oen, it is only compo-
determining the height of persons, ve samples are obtained nent one and two that are investigated but it is necessary to look
([1.78, 1.92, 1.83, 167, 1.87]). The values are in meters but at all the relevant components.
accidentally, the fourth sample has been measured in centi- As for the data, it is a good idea to plot the scores in many
meters. If the sample is not either corrected or removed, the ways, using different combinations of scatter plots, line plots,
subsequent analysis is going to be detrimentally disturbed by histograms, etc. Also, it is oen useful to go through the same
this outlier. Outlier detection is about identifying and handling plot but coloured by all the various types of additional infor-
such samples. An alternative or supplement to outlier handling mation available. This could be any kind of information such as
is the use of robust methods, which will however, not be treated temperature, storage time of sample, operator or any other kind
in detail here. The reader is referred to the literature for more of either qualitative or quantitative information available. For
details on robust methods.4859 the wine data model, it is seen (Fig. 26) that one sample is
This section is mainly going to focus on identifying outliers, behaving differently from the others in score plot one versus two
but understanding the outliers is really the critical aspect. Oen (upper le corner).
outliers are mistakenly taken to mean ‘wrong samples’ and Looking at the loading plot (Fig. 16) indicates that the
nothing could be more wrong! Outliers can be absolutely right, sample must be (relatively) high in volatile and lactic acid and
but e.g. just badly represented. In such a case, the solution is not low in malic acid. This should then be veried in the raw data.
to remove the outlier, but to supplement the data with more of Aer removing this sample, the model is rebuilt and reeval-
the same type. The bottom line is that it is imperative to uated. No more extreme samples are observed in the scores.
understand why a sample is an outlier. This section will give the Before deciding on what to do with an outlier, it is necessary
tools to identify the samples and see in what way they differ. It is to look at how important the component is. Imagine a sample
then up to the data analyst to decide how the outliers should be that is doing an ‘excellent job’ in the rst seven components,
handled. but in the eighth has an outlying behaviour. If that eighth
Data inspection. An oen forgotten, but important, rst step component is very small in terms of variation explained and not
in data analysis is to inspect the raw data. Depending on the the most important for the overall use of the model; then it is
type of data, many kinds of plots can be relevant as already probably not urgent to remove such a sample.
mentioned. For spectral data, line plots may be nice. For Whenever in doubt as to whether to remove an outlier or not,
discrete data, histograms, normal probability plots, or scatter it is oen instructive to compare the models before and aer
plots could be feasible. In short, any kind of visualization that removal. If the interpretation or intended use changes
will help elucidate aspects of the data can be useful. Several dramatically, it indicates that the sample has an extreme

Fig. 26 Score plot of a four component PCA model of the wine data.

2826 | Anal. Methods, 2014, 6, 2812–2831 This journal is © The Royal Society of Chemistry 2014
Tutorial Review Analytical Methods

1
tTi ðTT TÞ ti
Ti 2 ¼ (18)
I 1
Where T is the matrix of scores (I  R) from all the calibration
samples and ti is an R  1 vector holding the R scores of the ith
sample. Assuming that the scores are normally distributed,
then condence limits for Ti2 can be assigned as
RðI  1Þ
Ti 2 ðI;RÞ ¼ FR;IR;a (19)
I R

In Fig. 27, an example of the 95% condence limits is


shown. This plot illustrates the somewhat deceiving effect such
limits can have. Two samples are outside the condence limit
leading the inexperienced user to suggest leaving out both.
However, rst of all, samples should not be le out without
understanding why they are ‘wrong’ and more importantly,
there is nothing in what we know about the data thus far, that
Fig. 27 PCA score plot similar to Fig. 26 (left) but now with a 95%
suggests the scores would follow a multivariate normal distri-
confidence limit shown.
bution. Hence, the limit is rather arbitrary and for this partic-
ular dataset, the plot in Fig. 26 is denitely to be preferred
when assessing if samples behave reasonably. In some cases,
behaviour that needs to be handled whereas the opposite
when enough samples are available and those samples really
indicates that it is of little importance whether the sample is
do come from the same population, the scores are approxi-
removed.
mately normally distributed. This goes back to the central limit
Hotelling's T2. Looking at scores is helpful, but it is only
theorem.62 Examples are, e.g. in the multivariate process
possible to look at few components at a time. If the model has
control area.63 In those cases Hotelling's T2 is a particularly
many components, it can be laborious and the risk of acci-
useful statistic.
dentally missing something increases. In addition, in some
The limits provided by Hotelling's T2 can be quite
cases, outlier detection has to be automated in order to function
misleading for grouped data. As an example, Fig. 28 shows the
e.g. in an on-line process monitoring system. There are ways to
score plot of a dataset, where the samples fall in four distinct
do so, and a common way is to use the so-called Hotelling's T2
groups (based on the geological background). The sample in the
which was introduced in 1931.60 This diagnostic can be seen as
middle called “outlier?” is by no means extreme with respect to
an extension of the t-test and can also be applied to the scores of
Hotelling's T2 even though the sample is relatively far from all
a PCA model.61 It is calculated as
other samples.

Fig. 28 PCA scores plot (1 vs. 2) for a dataset consisting of ten concentrations of trace elements in obsidian samples from four specific quarries –
data from a study by Kowalski et al.64

This journal is © The Royal Society of Chemistry 2014 Anal. Methods, 2014, 6, 2812–2831 | 2827
Analytical Methods Tutorial Review

Fig. 29 Contribution plot for sample 34 in the wine data.

Score contribution plots. When a sample has been detected and then aerwards projecting it onto the model thereby
as being an outlier, it is oen interesting to try to investigate the obtaining more objective scores and residuals.
reason. Extreme scores indicate that the sample has high levels Lonely wolfs. Imagine a situation where the samples are
of whatever, the specic component reects in its correspond- constituted by distinct groups rather than one distribution as
ing loading vector. Sometimes, it is difficult to verify directly also exemplied in Fig. 28. Hotelling's T2 is not the most
what is going on and the so-called contribution plot can help. obvious choice for detecting samples that are unusually posi-
There are several different implementations of contribution tioned but not far from the center. A way to detect such samples,
plots65 but one common version was originally developed by is to measure the distance of the sample to the nearest
Nomikos.66 The contribution for a given sample indicates what neighbor. This can also be generalized e.g. to the average
variables caused that sample to get an extreme set of scores. For distance to the k nearest neighbors and various distance
a given set of components (e.g. component one and two in measures can be used if so desired.
Fig. 29), this contribution can be calculated as In Fig. 30, it is seen that colouring the scores by the distance
to the nearest neighbour, highlights that there are, in fact,
X
R
tnew
r xj
new
pjr
cD
j ¼ (20) several samples that are not very close to other samples. When
r¼1
tTr tr =ðI  1Þ the samples are no longer coloured by class as shown in Fig. 28,
it is much less obvious that the green ‘K’ class is indeed a well-
The vector tr is rth score vector from the calibration model, I dened class.
the number of samples in the calibration set and tnew r is the
score of the sample in question. It can come from the calibra-
tion set or be a new sample. xnew j is the data of the sample in
question for variable j and pjr is the corresponding loading
element. In this case, R components are considered, but fewer
components can also be considered by adjusting the summa-
tion in eqn (20).
The contribution plot indicates what variables make the
selected sample have an extreme Hotelling's T2 and in Fig. 29,
the most inuential variables are also the ones that that are
visible in the raw data (not shown). Eqn (20) explains the
simplest case of contribution plots with orthogonal P matrices.
Generalized contributions are available for non-orthogonal
cases.65 Note that if xnewj is a part of the calibration set, it
inuences the model. A more objective measure of whether Fig. 30 Score plot of Fig. 28. Samples are coloured according to the
xnew
j ts the model can be obtained by removing it from the data distance of the sample to the nearest neighbour.

2828 | Anal. Methods, 2014, 6, 2812–2831 This journal is © The Royal Society of Chemistry 2014
Tutorial Review Analytical Methods

new hypotheses as well as many other things. This tutorial has


provided a description of the basic concepts of how to use PCA
critically.

References
1 T. Skov, D. Ballabio and R. Bro, Multiblock variance
partitioning: a new approach for comparing variation in
multiple data blocks, Anal. Chim. Acta, 2008, 615, 18–29.
2 D. Ballabio, T. Skov, R. Leardi and R. Bro, Classication of
GC-MS measurements of wines by combining data
dimension reduction and variable selection techniques,
J. Chemom., 2008, 22, 457–463.
3 K. Faber, Degrees of freedom for the residuals of a principal
component analysis — A clarication, Chemometrics and
Chemoinformatics, 2008, vol. 93, pp. 80–86.
4 H. Martens, S. Hassani, E. M. Qannari and A. Kohler,
Degrees of freedom estimation in Principal Component
Analysis and Consensus Principal Component Analysis,
Fig. 31 Influence plot of wine data with a four component PCA model.
Chemom. Intell. Lab. Syst., 2012, 118, 246–259.
5 J. M. F. ten Berge, Least squares optimization in multivariate
analysis, DSWO Press, Leiden, 1993.
Residuals. The use of residuals has already been described in 6 A. K. Smilde, H. C. J. Hoefsloot, H. A. L. Kiers, S. Bijlsma and
detail. For outlier detection, it is common to use the sum H. F. M. Boelens, Sufficient conditions for unique solutions
squared residuals, oen called the Q-statistics, of each sample within a certain class of curve resolution models, J. Chemom.,
to look for samples that are not well-described by the PCA 2001, 15(4), 405–411.
model. When Q is plotted against T2, it is oen referred to as an 7 K. Pearson, On lines and planes of closest t to points in
inuence plot. Note, that both residuals and T2 will change with space, Philos. Mag., 1901, 2, 559–572.
the number of components, so if the number of components are 8 H. Hotelling, Analysis of a complex of statistical variables
not rmly dened, it may be necessary to go back and forth a bit into principal components, J. Educ. Psychol., 1933, 24, 417–
between different numbers of components. 441.
In the inuence plot in Fig. 31, it is clear that one sample 9 J. M. F. ten Berge and H. A. L. Kiers, Are all varieties of PCA
stands out with a high Hotelling's T2 in the PCA model and no the same? A reply to Cadima & Jolliffe, Br. J. Math. Stat.
samples have extraordinarily large residuals. It will hence, be Psychol., 1997, 50(2), 367–368.
reasonable to check the T2 contribution plot of that sample, to 10 S. Wold, C. Albano, W. J. Dunn, III, U. Edlund,
see if an explanation for the extreme behavior can be obtained. K. H. Esbensen, P. Geladi, S. Hellberg, E. Johansson,
The two blue lines are 95% condence levels. Such lines are W. Lindberg and M. Sjöström, Multivariate data analysis in
oen given in soware but should not normally be the focus of chemistry, in Chemometrics. Mathematics and Statistics in
attention as also described above for score plots. Chemistry, ed. B. R. Kowalski, D. Reidel Publishing
Residual contribution plots. Just as contribution plots for Company, Dordrecht, 1984, pp. 17–95.
scores can be dened, contribution plots for residual variation 11 I. E. Frank and S. Lanteri, Classication models:
can be determined as well. These are simpler to dene, as the discriminant analysis, SIMCA, CART, Chemom. Intell. Lab.
contributing factor to a high residual is simply the squared Syst., 1989, 5, 247–256.
residual vector itself. Hence, if a sample shows an extraordinary 12 J. C. Gower, A general theory of Biplots, in Recent Advances in
residual variation, the residual contribution plot (the residuals Descriptive Multivariate Analysis, ed. W. J. Krzanowski,
of the sample) can indicate why the sample has high residual Clarendon Press, Oxford, 1995, pp. 283–303.
variation. The squared residuals do not reveal the sign of the 13 A. Carlier and P. M. Kroonenberg, Decompositions and
deviation and sometimes, the raw residuals are preferred to the biplots in three-way correspondence analysis,
squared ones to allow the sign to be visible.67 Psychometrika, 1996, 61, 355–373.
14 K. R. Gabriel, The biplot graphic display of matrices with
Conclusion application to principal component analysis, Biometrika,
1971, 58, 453–467.
Principal component analysis is a powerful and versatile 15 R. Bro, N. D. Sidiropoulos and A. K. Smilde, Maximum
method capable of providing an overview of complex multivar- likelihood tting using simple least squares algorithms,
iate data. PCA can be used e.g. for revealing relations between J. Chemom., 2002, 16(8–10), 387–400.
variables and relations between samples (e.g. clustering), 16 D. T. Andrews and P. D. Wentzell, Applications of maximum
detecting outliers, nding and quantifying patterns, generating likelihood principal component analysis: incomplete data

This journal is © The Royal Society of Chemistry 2014 Anal. Methods, 2014, 6, 2812–2831 | 2829
Analytical Methods Tutorial Review

sets and calibration transfer, Anal. Chim. Acta, 1997, 350(3), number of factors, Chemom. Intell. Lab. Syst., 1999, 48, 91–
341–352. 97.
17 P. D. Wentzell, D. T. Andrews, D. C. Hamilton, N. M. Faber 34 H. F. Kaiser, The Application of Electronic Computers to
and B. R. Kowalski, Maximum likelihood principal Factor Analysis, Educ. Psychol. Meas., 1960, 20, 141–151.
component analysis, J. Chemom., 1997, 11(4), 339–366. 35 N. Cliff, The Eigenvalues-Greater-Than-One Rule and the
18 T. W. Anderson, An Introduction to Multivariate Statistical Reliability of Components, Psychol. Bull., 1988, 103(2), 276–
Analysis, Wiley, 2nd edn, 1984. 279.
19 M. E. Timmerman, H. A. L. Kiers and A. K. Smilde, 36 L. Guttman, Some necessary conditions for common-factor
Estimating condence intervals for principal component analysis, Psychometrika, 1954, 19(2), 149–161.
loadings: a comparison between the bootstrap and 37 S. Frontier, Étude de la decroissance des valeurs propers
asymptotic results, Br. J. Math. Stat. Psychol., 2007, 60(2), dans une analyze en composantes principales: comparison
295–314. avec le modèle de baton brisé, J. Exp. Mar. Biol. Ecol., 1976,
20 H. Babamoradi and F. van den Berg, Rinnan, Å. Bootstrap 25, 67–75.
based Condence Limits in Principal Component 38 R. H. MacArthur, On the relative abundance of bird species,
Analysis–a case study, Chemom. Intell. Lab. Syst., 2013, 120, Proc. Natl. Acad. Sci. U. S. A., 1957, 43(3), 293–295.
97–105. 39 S. Wold, Cross-validatory estimation of the number of
21 R. A. van den Berg, H. C. J. Hoefsloot, J. A. Westerhuis, components in factor and principal components models,
A. K. Smilde and M. van der Werf, Centering, scaling, and Technometrics, 1978, 20, 397–405.
transformations: improving the biological information 40 W. J. Krzanowski and P. Kline, Cross-validation for choosing
content of metabolomics data, BMC Genomics, 2006, 7(142). the number of important components in principal
22 R. Bro and A. K. Smilde, Centering and scaling in component component analysis, Multivariate Behav. Res., 1995, 30(2),
analysis, J. Chemom., 2003, 17(1), 16–33. 149–165.
23 N. K. Afseth, V. H. Segtnan and J. P. Wold, Raman spectra of 41 R. Bro, K. Kjeldahl, A. K. Smilde and H. A. L. Kiers, Cross-
biological samples: a study of preprocessing methods, Appl. validation of component models: a critical look at current
Spectrosc., 2006, 60(12), 1358–1367. methods, Anal. Bioanal. Chem., 2008, 390, 1241–1251.
24 C. D. Brown, L. Vega-Montoto and P. D. Wentzell, Derivative 42 T. C. Gleason and R. Staelin, A proposal for handling
preprocessing and optimal corrections for baseline dri in missing data, Psychometrika, 1975, 40, 229–252.
multivariate calibration, Appl. Spectrosc., 2000, 54(7), 1055– 43 P. R. C. Nelson, P. A. Taylor and J. F. MacGregor, Missing
1068. data methods in PCA and PLS: score calculations with
25 S. N. Deming, J. A. Palasota and J. M. Nocerino, The incomplete observations, Chemom. Intell. Lab. Syst., 1996,
geometry of multivariate object preprocessing, J. Chemom., 35(1), 45–65.
1993, 7, 393–425. 44 B. Grung and R. Manne, Missing values in principal
26 H. Martens and E. Stark, Extended multiplicative signal component analysis, Chemom. Intell. Lab. Syst., 1998, 42,
correction and spectral interference subtraction: new 125–139.
preprocessing methods for near infrared spectroscopy, 45 B. Walczak and D. L. Massart, Dealing with missing data Part
J. Pharm. Biomed. Anal., 1991, 9(8), 625–635. I, Chemom. Intell. Lab. Syst., 2001, 58(1), 15–27.
27 M. Pardo, G. Niederjaufner, G. Benussi, E. Comini, G. Faglia, 46 E. Adams, B. Walczak, C. Vervaet, P. G. Risha and
G. Sberveglieri, M. Holmberg and I. Lundstrom, Data D. L. Massart, Principal component analysis of dissolution
preprocessing enhances the classication of different data with missing elements, Int. J. Pharm., 2002, 234(1–2),
brands of Espresso coffee with an electronic nose, Sens. 169–178.
Actuators, B, 2000, 69(3), 397–403. 47 J. Mandel, The partitioning of interaction in analysis of
28 R. Bro, Multivariate calibration - What is in chemometrics variance, J. Res. Natl. Bur. Stand., Sect. B, 1969, 73B, 309–328.
for the analytical chemist?, Anal. Chim. Acta, 2003, 500(1– 48 S. J. Devlin, R. Gnanadesikan and J. R. Kettenring, Robust
2), 185–194. estimation of dispersion matrices and principal
29 O. E. de Noord, The inuence of data preprocessing on the components, J. Am. Stat. Assoc., 1981, 76(375), 354–362.
robustness and parsimony of multivariate calibration 49 O. S. Borgen and Å. Ukkelborg, Outlier detection by robust
models, Chemom. Intell. Lab. Syst., 1994, 23, 65–70. alternating regression, Anal. Chim. Acta, 1993, 277, 489–494.
30 R. B. Cattell, The scree test for the number of factors, 50 Y. Xie, J. Wang, Y. Liang, L. Sun, X. Song and R. Yu, Robust
Multivariate Behav. Res., 1966, 1, 245–276. principal component analysis by projection pursuit,
31 P. M. Bentler and K. H. Yuan, Test of linear trend in J. Chemom., 1993, 7, 527–541.
eigenvalues of a covariance matrix with application to data 51 H. Hove, Y. Liang and O. M. Kvalheim, Trimmed object
analysis, Br. J. Math. Stat. Psychol., 1996, 49(2), 299–312. projections: a nonparametric robust latent-structure
32 P. M. Bentler and K. H. Yuan, Tests for linear trend in the decomposition method, Chemom. Intell. Lab. Syst., 1995,
smallest eigenvalues of the correlation matrix, 27, 33–40.
Psychometrika, 1998, 63, 131–144. 52 J. G. Chen, J. A. Bandoni and J. A. Romagnoli, Robust PCA
33 R. C. Henry, E. S. Park and C. Spiegelman, Comparing a new and normal region in multivariate statistical process
algorithm with the classic methods for estimating the monitoring, AIChE J., 1996, 42(12), 3563–3566.

2830 | Anal. Methods, 2014, 6, 2812–2831 This journal is © The Royal Society of Chemistry 2014
Tutorial Review Analytical Methods

53 W. C. Chen, H. Cui and Y. Z. Liang, A new principal 60 H. Hotelling, The generalization of Student's ratio, Ann.
component analysis method based on robust diagnosis, Math. Stat., 1931, 2, 360–378.
Anal. Lett., 1996, 29, 1647–1667. 61 J. E. Jackson, Principal components and factor analysis: part
54 A. Singh, Outliers and robust procedures in some I - principal components, J. Qual. Tech., 1980, 12, 201–213.
chemometric applications, Chemom. Intell. Lab. Syst., 1996, 62 A. M. Mood, F. R. Graybill and D. C. Boes, Introduction to the
33, 75–100. Theory of Statistics, McGraw-Hill, 3rd edn, 1974.
55 E. V. Thomas and N. X. Ge, Development of robust 63 P. Nomikos and J. F. MacGregor, Multivariate SPC charts for
multivariate calibration models, Technometrics, 2000, 42(2), monitoring batch processes, Technometrics, 1995, 37, 41–59.
168–177. 64 B. R. Kowalski, T. F. Schatzki and F. H. Stross, Classication
56 K. A. Hoo, K. J. Tvarlapati, M. J. Piovoso and R. Hajare, A of archaeological artifacts by applying pattern recognition to
method of robust multivariate outlier replacement, trace element data, Anal. Chem., 1972, 44(13), 2176–2180.
Comput. Chem. Eng., 2002, 26(1), 17–39. 65 J. A. Westerhuis, S. P. Gurden and A. K. Smilde, Generalized
57 M. Hubert, P. J. Rousseeuw and K. Vanden Branden, contribution plots in multivariate statistical process
ROBPCA: a new approach to robust principal component monitoring, Chemom. Intell. Lab. Syst., 2000, 51(1), 95–114.
analysis, Technometrics, 2005, 47(1), 64–79. 66 P. Nomikos, Detection and diagnosis of abnormal batch
58 S. F. Møller, J. V. F. Frese and R. Bro, Robust methods for operations based on multiway principal component
multivariate data analysis, J. Chemom., 2005, 19, 549–563. analysis, ISA Trans., 1996, 35, 259–266.
59 P. J. Rousseeuw, M. Debruyne, S. Engelen and M. Hubert, 67 B. M. Wise, N. L. Ricker and D. Veltkamp, Upset and Sensor
Robustness and outlier detection in chemometrics, Crit. Failure Detection in Multivariate Processes, AIChE 1989
Rev. Anal. Chem., 2006, 36(3–4), 221–242. Annual Meeting, Nov. 1989.

This journal is © The Royal Society of Chemistry 2014 Anal. Methods, 2014, 6, 2812–2831 | 2831

View publication stats

You might also like