Detection of Prediction Outliers and Inliers in Multivariate Calibration

Analytica Chimica Acta 388 (1999) 283±301
Detection of prediction outliers and inliers in multivariate calibration

D. Jouan-Rimbauda, E. Bouveressea, D.L. Massarta,*, O.E. de Noordb
a
ChemoAC, Pharmaceutical Institute, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussel, Belgium
b
Shell International Chemicals BV, Shell Research and Technology Centre, PO Box 38000, 1030 BN Amsterdam, Netherlands
Received 14 May 1998; received in revised form 14 August 1998; accepted 25 August 1998
Abstract
After a multivariate calibration model is built and validated, it is ready to be used for the prediction of the characteristic to be
determined in the new samples. Before prediction, one should make sure that the new samples are similar to the calibration
samples. If not, such samples are called prediction outliers, which can be of two types, namely samples that are outside of the
calibration space, and samples which are situated in a low- (or empty-) density region of the calibration space. The latter type,
also called inliers, are not detected by existing methods. In this paper, potential functions are proposed for the detection of
inliers and outliers in prediction. # 1999 Elsevier Science B.V. All rights reserved.
Keywords: Multivariate calibration; Prediction; Outliers; Potential functions
1. Introduction calibration model; (vi) detection of model outliers

[11±18]; (vii) validation of the model [19]. Therefore,
Multivariate calibration [1] does not simply consist many steps are required in calibration, which should
of building a multivariate calibration model, i.e., in ensure that the model can be used accurately for
applying a method such as partial least squares (PLS) prediction of unknown samples. The linear multivari-
[1,2], principal component regression (PCR) [1,3,4], ate calibration model relates X (the matrix of inde-
or multiple linear regression (MLR) [5], to a given pendent predictor variables) and y (the vector of
data set. Indeed, it is a multistep procedure, each step dependent variable) according to the equation
having its importance, and having possible conse-
y b0 Xb: (1)
quences for the following steps. Among the necessary
steps, we can mention: (i) selection of a calibration set; The ®nal goal of a multivariate calibration model is
(ii) data pretreatment [6]; (iii) check for representa- to predict the characteristic of new samples, which
tivity between calibration and validation sets [7,8]; will be called prediction samples in this paper.
(iv) investigation of inhomogeneities (such as atypical Whatever the calibration model, it is important to
objects, and clustering tendency) [9], and of possible check that the calibration samples are representative
non-linearities in the data [10]; (v) construction of the of the unknown samples to predict, i.e., that the
prediction samples are comparable to the calibration
*Corresponding author. Tel.: +32-2-477-47-37; fax: +32-2-477- ones. Indeed, if for any reason, an unknown sample is
47-35; e-mail: [email protected] different from the calibration samples, i.e., out of the
0003-2670/99/$ ± see front matter # 1999 Elsevier Science B.V. All rights reserved.
PII: S0003-2670(98)00626-6
284 D. Jouan-Rimbaud et al. / Analytica Chimica Acta 388 (1999) 283±301
calibration space (e.g., if an interferent is present),

then the prediction obtained from the calibration
model will be unreliable, as it will correspond to an
extrapolation of the model. In this case, one has to
look for the reasons of this lack of similarity.
The goal of this paper is to investigate the repre-
sentativity of calibration samples towards the predic-
tion samples. If a prediction sample is different from
the calibration samples, it can be considered to be an
`òutlier in prediction'' (or a ``prediction outlier''). In
simple cases, outlier diagnostics (based on leverages,
or on the classical or robust Mahalanobis distance
[12,18,20] for instance) can be applied. However, in
more complex cases where the data distribution has an
irregular density (clustered data sets, curved data sets), Fig. 1. PC1±PC2 score plot of the polyol data set, and two types of
these methods are no longer reliable, and some more prediction outliers: point I is an `ìnlier'', and point O is an
powerful methods have to be investigated. This paper `òutlier''.
presents the use of potential functions [21].
ated between two clusters, or within a gap in the

2. Theory calibration set, but still lies within the calibration
range limits: these types of prediction outliers are
According to [22], there are two types of chemical called `ìnliers''. The term `ìnlier'' has been de®ned
outliers: by the ASTM as `à spectrum residing within a gap in
the multivariate calibration space, the result for which
the prediction samples contain the same compo- is subject to possible interpolation errors'' [23]. Clas-
nents as the calibration samples, but in concentra- sical methods of outlier detection fail to detect inliers.
tions that are outside the range in the calibration set, The concept of inliers/outliers is described in
so that their prediction would lead to an extrapola- Fig. 1.
tion of the model; or Point I is outlying in the sense that it does not
the prediction samples contain components that belong to any of the clusters, but is, however, included
were not present in the calibration samples (inter- in the range de®ned by the calibration samples: it is
ferents). therefore called `ìnlier''. Point O, however, is clearly
out of the space de®ned by the calibration samples,
The Mahalanobis distance is described by the and is an `òutlier''.
ASTM (American Society for Testing and Materials)
[22] to detect the ®rst type of outliers (concentration 2.1. The classical Mahalanobis distance
outside the calibration concentration range), while the
root mean square error of spectral residual (RMSSR) The Mahalanobis distance is used as a diagnostic
is advised to detect new samples containing interfer- for outlier detection in multivariate calibration, and
ents not present during the calibration. However, the also to detect those samples which would lead to an
classical Mahalanobis distance sometimes fail to extrapolation of the model. The underlying assump-
detect outliers in a calibration set, and often, robust tion when using the classical Mahalanobis distance as
Mahalanobis distance [12,13,18], based on a robust an outlier detection tool is that the data have a normal
estimation of the mean and of the variance±covariance distribution. In order to detect calibration outliers, the
matrix, is used instead. Mahalanobis distance between each calibration sam-
It is also possible that the projection of some ple and the centroid of the calibration set is computed,
prediction samples onto the calibration space is situ- and compared to the quantiles of a Chi2 distribution
D. Jouan-Rimbaud et al. / Analytica Chimica Acta 388 (1999) 283±301 285
with p degrees of freedom (p being the number of estimation of the mean and of the variance±covariance
variables taken into account) [12]. The Mahalanobis matrix of the calibration data set should not be in¯u-
distance between a calibration object i and the cen- enced by outliers (when the model is already estab-
troid of the data set, in the original variable space, is lished, the calibration data are free of outliers). Yet, it
de®ned as is interesting to verify it on our data, by comparing
results from both classical and robust Mahalanobis
MD2 i xi ÿ xSÿ1 xi ÿ x0 ; (2) distance.
where S is the variance±covariance matrix de®ned as Robust estimators of x and S can be obtained by
ellipsoidal multivariate trimming [12,13], according
1 to the following procedure:
S X0 Xc ; (3)
nÿ1 c
1. The initial mean x and variance±covariance matrix
where n corresponds to the number of objects in X, S are the classical ones (see Eq. (3)), and the
and Xc is the column-centred X matrix. Another squared Mahalanobis distance for each object is
measure for outlier detection, which is related to computed as in Eq. (2).
the Mahalanobis distance, is the leverage. The 2. A pre-determined number of objects N (N>50% n),
leverages are the elements situated on the diagonal with smallest MD2(i), are selected, and new esti-
of the prediction matrix (or hat matrix), de®ned as mates of the mean and of the variance±covariance
matrix (x1 and S1) are computed, based on these N
H XX0 Xÿ1 X0 : (4) objects only.
3. With x1 and S1, a new estimate of the Mahalanobis
The leverage hi of object i is therefore de®ned as
distance MD21 i can be computed, for each of the
hi xi X0 Xÿ1 x0i : (5) n objects, as
0
There is a linear relationship between the leverage of MD21 i xi ÿ x1 Sÿ1
1 xi ÿ x1 : (7)
an object, and its Mahalanobis distance to the cen-
troid: 4. The N objects with smallest MD(i) are selected,
1 1 and are used to calculate new estimates x2 and S2 of
hi MD2 i : (6) the mean and variance±covariance matrix.
nÿ1 n
5. x1 and x2 are compared, as well as S1 and S2. Steps
During calibration, in order to detect possible outliers, (3)±(5) are iterated until convergence of the mean
the value of leverage computed for the calibration and variance±covariance matrix. When these are
object is compared to two or three times p/n. This stable, these are used for the final calculation of the
diagnostic is also possible in prediction: the leverages Mahalanobis distance.
of the prediction objects can be computed, and com-
pared to two or three times the average value of
2.3. The root mean square error in spectral residual
leverages in the calibration set [1].
(RMSSR)
2.2. The robust Mahalanobis distance It was proposed [24] to identify the presence of an
interferent in the new samples, which was not present
The classical Mahalanobis distance does not always during the calibration phase, by projecting each new
allow a proper detection of calibration outliers, sample on the PC-space de®ned in calibration, and by
because these may in¯uence the estimation of the computing the spectral residuals, and the root mean
mean and of the variance±covariance matrix. This is squared error in spectral residual (RMSSR). Compar-
the reason why it is often more appropriate to use a ison with the RMSSR values of the calibration sam-
robust estimation of the Mahalanobis distance [12], ples can help to identify new samples containing an
which is not in¯uenced by outliers. In our case, we are interferent. Let x be a spectrum (row vector), ^x be the
looking for prediction outliers, which implies that the spectrum projected in the working PC-space, the
RMSSR is then de®ned as Q-value is equal to the sum of squared spectral residuals
1=2 Q x ÿ ^xx ÿ ^x0 : (9)
x0
x ÿ ^xx ÿ ^
RMSSR ; (8)
p The obtained Q-value is compared with a critical Q-
where p is the number of wavelengths at which the value, which formula is presented, e.g., in [25].
spectrum was recorded.
The goal of the authors is different from ours, 2.4. The potential functions
because they wanted to identify the interferent (by
computing a scalar product between its spectrum and The basic idea behind potential functions is the
spectra contained in a library), but we can apply, for following. Each point of the calibration set creates a
our purpose, the ®rst part of the method (which ``potential'' in space, such that the value of the poten-
indicates the presence of an interferent). According tial is maximum at its location, and decreases con-
to [22], one can use the maximal value of RMSSR tinuously with distance from this point. By averaging
obtained in the calibration set (called RMSSRmax) to all individual potentials from the calibration set, a
determine a cut-off value for the RMSSR obtained by global potential can be obtained at any place in space,
prediction samples: above this value, they are outliers. which is in fact a probability density [21,26,27];
However, at least seven replicate measurements of at therefore, at any location in space, a global value of
least three calibration samples should be measured, the potential induced by the calibration samples exists.
such that ``the replicate measurement should include Potential functions have been used widely in the
all steps in the measurement procedure'' [22]. In ®eld of pattern recognition [21]. For example, Coo-
practice, such replicate measurements are rarely avail- mans [28] has used several classi®cation methods
able, and one should ®nd another way to use the based on potential function, such as ALLOC [29]
RMSSR obtained by calibration samples. We propose (developed by Hermans et al. [30]) or CUPLOT
to work with quantiles of the RMSSR values obtained [31]; Forina et al. [27] have proposed two types of
for the calibration samples. The procedure applied class boundaries for building a probabilistic and dis-
performs as follows: tribution-free class-modelling technique from poten-
tial functions. In this work, the potential functions are
Perform a PCA of the calibration data matrix. used, such that, if the potential value of an unknown
Project the calibration and the prediction spectra in prediction sample is 0, it means that this sample is
the optimal PC-space. ``far'' from the bulk of the calibration set, and there-
Compute the spectral residuals of the calibration fore, the object is a prediction outlier.
and prediction spectra (projection of the spectra in
the residual PC-space). 2.4.1. Types of potential functions
For each spectrum, the RMSSR can be calculated There are many different types of potential func-
as defined in Eq. (8). tions, one of which is the Gaussian function, de®ned,
Sort the RMSSR values of the calibration data in in the univariate space, as
increasing order; the 95% quantile of these ordered
1 ÿ1 2
data can be used as the cut-off value, above which xi ; xj ÿp exp xi ÿ xj : (10)
the prediction spectra are considered extreme sam- 2 s 2s2
ples, and therefore suspected outliers. (xi, xj) is the potential induced by object xi on object
xj. The parameter s is called the smoothing parameter
If PLS is used to construct the ®nal calibration (also called the bandwidth, or the window width); it
model, the new spectra are projected in the calibration de®nes the width of the curve (in this present case a
factor-space instead of the PC-space. Gaussian) above each object. is such that
Another method which was studied, and which also
makes use of the spectral residuals, is based on the Q- Z
1
charts, which are usually used in statistical process x dx 1: (11)

control to detect out of control situations [25]. The ÿ1
The global potential f is de®ned as increasing order; the cut-off % value (100ÿp,
X
n e.g., 5%) is
1
f xi ; xj ; (12) f fk u ÿ kfk1 ÿ fk ; (16)
n i1
with un/100 and kint(u).
n being the number of objects in the calibration set.
The global potential is an estimate of the probability
fp is usually larger than f, so that using f to
density [21,26,27].
determine the cut-off value will lead to broader
In this work, we preferred to use triangular potential
regions around the calibration set.
functions (the reason for this choice will be explained
One problem when working with such quantiles is
in Section 2.4.2. In the univariate space, triangular
that some calibration samples may have a probability
potential functions are de®ned as
x ÿ x x ÿ x density smaller than the critical value, so that a pre-
i j i j diction sample situated at exactly the same location as
xi ; xj 1 ÿ if 1;
s s (13) one of these calibration samples will be described as
xi ; xj 0; otherwise:
an outlier, while in fact it is not. It is in fact a
In the multivariate space with A dimensions, the compromise between the - and -errors.
absolute value |xiÿxj| is replaced by the Euclidean In Fig. 2(a), the contour plot corresponding to a
norm of the vector xiÿxj in the de®nition of (xi, xj), Gaussian potential built with a smoothing parameter
and of 1.4, and a cut-off value computed with fp at 95%
level is shown. Reasonable contours are shown, but
1 X n
1 X n
f x i ; x j xi ; xj (14) although a few points are still out of these contours,
nsA i1 ns i1 there is a relatively wide empty space in the right-hand
side cluster (at high values of variable 2). In order to
if the smoothing parameter s is constant over all
take the discarded objects into account, the con®dence
variables, which is the assumption followed in this
level is increased to 99%, and f is used (Fig. 2(b)).
paper. A being constant, sA can be written s*.
Now, it is clear that the contours around the two
For information regarding other types of potential
clusters of this data set are too wide. Therefore, it
functions, the readers are referred, for instance, to
is not easy to decide a priori on which cut-off value to
Chapter 2 of [26].
use (f or fp?), neither on the con®dence level to work
with. This is the reason why we preferred to work with
2.4.2. Critical values
triangular potential functions, which have a zero cut-
For Gaussian potential functions, two cut-off
off value. The advantage of triangular potential func-
values, below which the new objects are considered
tions over Gaussian potential functions is that one does
outliers, were proposed by Forina et al. [27], one of
not need to compute a cut-off value to delimit the
which came from work performed by Derde et al. [32].
calibration space, and one does not have to choose a
The method to ®nd them is based on the p% sample
con®dence level.
percentile. The cut-off values are computed by ®rst
estimating the probability density of each calibration
2.4.3. Optimisation of the smoothing parameter s
sample.
The boundaries around the calibration set can be
To calculate the first cut-off value, the objects are de®ned by drawing the contour plot at the level
arranged in descending order; the cut-off p% value indicated by the cut-off value (0 when triangular
(e.g., p95%) of probability density is potential functions are used). The location of the
iso-potential line at this level depends on the smooth-
fp fj q ÿ jfj1 ÿ fj ; (15)
ing parameter s (the larger the smoothing, the wider
with qpn/100 and jint(q), where int(q) means the de®ned subspace). Therefore, the smoothing para-
the integer part of q. meter is of crucial importance, and should be opti-
To calculate the second cut-off value, the calibra- mised carefully. If it is chosen too small, one may
tion samples' probability densities are arranged in obtain a small positive-potential region around each
Fig. 2. Contour plot with fixed potential functions: (a) smoothing1.4, p95%, and (b) smoothing1.4, 1%.
object, and a null potential in-between; on the other in the whole calibration subspace, the contours will
hand, if it is chosen too large, the resulting potential remain similar by applying this method, which will be,
may be found positive in places where it should be from now on, denoted as ``the 10% percentile
null. method''.
Instead of using a constant smoothing parameter To optimise the value of k, one starts by building a
over all the calibration objects, the smoothing was potential surface above the calibration set such that s
optimised for each calibration object, and chosen as a for each object is equal to the nearest neighbour
function of the local density around the calibration distance (k1). If k1 yields too small s values (a
objects. Such a potential function, optimising locally small value of s being a value for which the resulting
the smoothing, is called a variable potential function. potential surface consists of many regions with a
The smoothing parameter s was chosen, for each positive potential around each object, and a zero-
calibration sample, to be equal to the k nearest neigh- potential in-between, even in a non-clustered data
bour (NN) distance (i.e., to the distance to the kth set), k is then increased to 2, and so on, until a k
nearest neighbour): in order to optimise the value of s, value is reached, for which the potential surface
one should therefore optimise k. If there is one isolated obtained seems reasonable. To decide when this is
point in the data set, working with sk NN distance the case, two methods are proposed:
may create a big envelope around it, while the envel- Centroid method. A pair of calibration samples is
ope around the other points will be satisfactory. On the chosen randomly, and one checks whether their
other hand, taking s equal to the k NN distance may centroid has a non-null potential. This procedure is
yield generally satisfactory results, while locally, a iterated a certain number of times. If in many cases,
high density of objects may yield a too narrow envel- the potentials of the centroids are zero, it means
ope. In order to avoid this, the following modi®cation that many calibration objects create a potential
was also studied: Once the parameter k is optimised, s surface which is positive in their close neighbourhood
values are calculated for all objects in the data set, and only, and that there exist many regions with zero-
are ordered in increasing values. All objects for which potential between objects. Therefore, the chosen k
s is smaller than, e.g., the 10% quantile, are given the yields too small s values, and should be increased.
value of the 10% quantile s, to avoid too small s The smallest k, for which, for any pair of chosen
values, i.e., too narrow envelope. All objects with s points, the centroid has a positive potential, is de®ned
larger than the 90% quantile s are given the value of as the optimal one.
the 90% quantile s, to avoid too large s, i.e., too wide Cross-validation method. A leave-one-out strategy
envelope. If the density of the data points is the same can also be applied in the optimisation of s. For
increasing values of k, the following loop is per-

formed:
for i1 to n
remove object i from the calibration set
calculate the global potential induced by the
nÿ1 remaining objects on object i, by using, for
each object, sthe k nearest neighbour distance
end
The smallest k for which all calibration objects have

a positive potential is the optimal one.
3. Experimental
Both simulated and real data were used. The simu-

lated data were used to optimise the methods to
determine the smoothing parameter s. These methods
were then applied to two real data sets.
3.1. Simulated data
In order to study the potential functions, and in

particular, the critical step of smoothing optimisation,
three two-dimensional data sets were simulated
(called sim1 to sim3). These are presented in
Fig. 3(a)±(c).
These simulated data are supposed to be samples
from a calibration set. They have been given an
unusual shape (particularly sim2 and sim3) in order
to test the applicability of potential functions in com-
plex data distributions. Prediction samples were not
simulated each time, because our interest was to de®ne
limits around the calibration sets while we were trying
to optimise the smoothing.
3.2. Real data
The ®rst real data set used in this study comes from
the Matlab PLS Toolbox, version 1.5, from Wise [33].
The data come from a liquid-fed ceramic melter, and a
multivariate calibration model should relate the tem-
perature in the molten glass tank to the 20 tank levels
(the 20 original variables). The data are already sepa- Fig. 3. Different two-dimensional data sets used for optimisation
rated into a calibration set (300 samples) and a test set of the parameters of the potential functions: (a) sim1, (b) sim2, and
(200 samples). Two outliers in the PC1±PC2 space (c) sim3.
were removed from the calibration set. These were
detected with the evolution program proposed by draw a number from a uniform distribution
Walczak [14,15]. This data set was used to test between 0.7 and 1.3
whether the methods found to optimise the smoothing NS2 (object i)object a
with simulated data can also be applied to real data.
end
A second real data set from industry was used, in
particular, to show that the methods based on the
Therefore, the ®rst 20 objects from NS2 may be
Mahalanobis distance and on RMSSR are not always
good prediction objects or inliers, while the last 20
applicable: NIR spectra of polyether polyol samples
objects may be good prediction objects, inliers or
were measured, and related to the hydroxyl number of
outliers.
each sample, with different linear multivariate cali-
NS1 and NS2 were projected on the PC1±PC2 space
bration methods [34±37]. In this work, the data set was
from the calibration set, and are presented in Fig. 4(a).
cleaned from the outliers found by Centner et al. [10],
so that 84 samples out of the 87 original ones were left,
whose spectra were measured over 499 wavelengths in 3.3. Software
the range 1132±2128 nm, on a Paci®c Scienti®c 6250
scanning spectrometer (NIRSystem, Silver Spring, All computations were performed on a PC 486
MD). The PC1±PC2 score plot is shown in Fig. 1. DX2 50 Mhz computer, with programs written with
The right-hand side cluster (58 samples) is called Matlab for Windows, version 4.0 (Mathworks, Natick
cluster 1, and the left-hand side cluster (26 samples) 1992).
is called cluster 2.
As no new prediction samples were available, new
prediction samples were simulated from the real cali- 4. Results and discussion
bration samples, and separated into two new sets NS1
and NS2. The creation of the 14 samples from NS1 4.1. Application of the tests based on the
was performed as follows: Mahalanobis distance (MD)
for i1 to 14
Both the classical and the robust Mahalanobis dis-
randomly select two objects from the calibration tances were used with different tests to try and detect
set, indexed a and b inliers and outliers. The detection of inliers/outliers
NS1 (object i)0.5(object aobject b) based on the MD was studied on the second real case
study: a PCR-calibration model was built, its com-
end plexity was found by cross-validation to be 8 PCs [34],
Therefore, NS1 may contain ``good'' prediction and the MD was always computed in this 8 PC-space.
objects and inliers (a good prediction object being an First, the classical MD between each prediction object
objectto predict which is neither an inlier, noran outlier). and the mean of the calibration set was computed, and
The creation of the 40 samples from NS2 was compared to the critical value from a Chi2 distribution
performed as follows: with 8 degrees of freedom. With 95% con®dence, this
critical value is equal to 15.5. No suspect objects were
for i1 to 20 detected in NS1, while 17 such objects were detected
in NS2. These are shown on a PC1±PC 2 score plot, in
randomly select two objects from the calibration
Fig. 4(b). All these points are outliers, even the ®ve
set, indexed a and b
samples, which seem to be inliers on a PC1±PC2 score
NS2 (object i)0.5(object aobject b)
plot, but which are in fact outliers on PC3.
end As explained in Section 2, the leverage of an object
corresponds to a scaled Mahalanobis distance of the
for i21 to 40
objects towards the centroid of the data. The differ-
randomly select one object from the calibration ence between using the Mahalanobis distance or the
set, indexed a leverage resides in the computation of the critical
value: instead of comparing MD to a Chi2 distribution,

the leverage is compared to two or three times the
average value of leverages in the calibration set. We
have checked that similar results were obtained when
using the leverage instead of the MD. For NS1, no
suspect objects were found, even when the critical
value was computed as only two times the average
value of leverage in the calibration set. In NS2,
however, whether the critical value was computed
with two or three times the average calibration lever-
age, 17 suspect objects were detected, which are the
same as those found when comparing the MD with a
critical value from a Chi2 distribution. Therefore, the
methods based on the classical MD do not enable to
detect the inliers from NS1 and NS2.
We then computed the robust estimate of the Maha-
lanobis distance, and tried to use different number of
objects for the trimming (N43; 50; 60; 70; 80). With
this particular data set, which is very heterogeneous
and where some isolated points are present, varying N
leaded to different estimates of x and S, and therefore,
totally different numbers of outliers/inliers were
detected in both test sets. This is due to the very
particular distribution of the data. No satisfactory
conclusions could then be drawn from the use of
the robust Mahalanobis distance.
4.2. Application of the tests based on the RMSSR
The spectra from the calibration set, NS1 and NS2

were projected in the calibration 8 PC-space. The
residual spectra can then be plotted. They are shown
in Fig. 5(a)±(c).
First, it can be seen that the residual spectra from
NS1 (Fig. 5(b)) are smaller than the residual spectra
from the calibration set (Fig. 5(a)), although the PCs
were built especially to maximise the variance of the
calibration data; it is therefore very probable that no
suspect spectrum will be detected in NS1. However, it
is also clear that some residual spectra from NS2
(Fig. 5(c)) are much larger than those from the cali-
bration set, so that suspect objects in NS2 can certainly
be detected. In order to detect which are the possible
outliers, the RMSSR values for all spectra of the
Fig. 4. PC1±PC2 score plot of the polyol calibration data set (): calibration set are sorted in increasing order, and
(a) projection of two test sets NS1 () and NS2 (), (b) outliers
from NS2 detected with the MD distance to the mean of the the 95% quantile value is used as the critical value.
calibration set (
), and (c) outliers from NS2 detected with the The new spectra for which the RMSSR is larger than
RMSSR ( ). the critical value are detected as outliers. No such
distance, 49 centroids are found out of the calibration

domain, i.e., 4%. From the plot presented in Fig. 6(a),
it is clear that these smoothing values are too small, as
some of the rejected points (represented by ) are well
inside the calibration domain. The stars in Fig. 6
indicate spots in the data space where outliers would
have been detected, as the global potential value in
these locations is equal to 0. They are situated in
zones, however, where there are few or no objects
nearby, so that one could possibly consider them as
inliers.
With 3 NN distance, four points (0.3%) are found
with a null potential. As can be seen in Fig. 6(b), they
are on the borders of the data set, but it may be that
they should still be included. k4 is the smallest k for
which all centroids are included on the potential sur-
face. The contour plot corresponding to k3 is shown
Fig. 5. Spectral residuals after projection on the 8 PC-space for: (a) in Fig. 6(c), and that corresponding to k4 in
the calibration spectra, (b) the spectra from NS1, and (c) the spectra Fig. 6(d). There is no objective reason to prefer one
from NS2.
to the other. It is a question of compromise between -
and -errors. Moreover, the knowledge of the data,
samples are detected in NS1, while 20 samples are and of a possible non-linear relationship between X
detected in NS2, and are represented in Fig. 4(c). and y, may guide the decisions on the tightness of the
Results of the Q-charts indicate no suspect sample borders required. We would conclude that k3 yields
in NS1, while 15 objects of NS2 are detected, because an appropriate smoothing, but others might prefer
of their Q-value which is larger than the critical one. k4. The borders are quite wide, particularly around
These 15 objects are among the last 20 simulated ones, the point with extreme low values. This point is a bit
and are outliers. isolated, so that its 3 and 4 NN distances are large. In
As a conclusion of the results obtained in Sec- order to reduce the width of the envelope at this
tions 4.1 and 4.2, one can state that these two methods location, the method based on the percentile proposed
can enable a proper detection of prediction outliers. in Section 2.4.3 was used. The contour plot corre-
However, the detection of inliers remains a problem. sponding to k3 is shown in Fig. 6(e): the borders
around the isolated point are much more reasonable
4.3. Application of the potential functions than those displayed in Fig. 6(c). The borders in other
regions stay similar.
With potential functions one should be able to In order to apply the method to the curved data set
de®ne contours around the calibration set. Those (sim2), the method had to be modi®ed, as we do not
prediction samples which are out of the de®ned con- want to accept the centroids of all possible pairs of
tours are outliers in prediction, or inliers. If triangular points, the aim being to be able to detect those
potential functions are used, the only parameter to prediction objects situated in the gap of the calibration
optimise is the smoothing around each object. It is space (the inliers). Therefore, for each object, only the
necessary to be able to do this without having to resort K nearest neighbours were considered for creation of
to visual inspection. centroids. The smoothing was optimised by consider-
ing pairs with the 10 neighbouring points, and it was
4.3.1. Method based on the centroid found that 4 NN distance was the minimal acceptable
The method was ®rst applied to the simplest simu- smoothing. This value was also accepted with K30.
lated data set sim1. Since there are 50 objects in the On the contour plot shown in Fig. 6(f), we can see that
data set, 1225 centroids can be de®ned. With the 2 NN this leads to quite reasonable contours, but the borders
Fig. 6. Use of the centroid method to optimise the smoothing (() real objects, () simulated centroids): (a) sim1, k2, (b) sim1, k3, (c) sim1,
contour plot corresponding to k3, (d) sim1, contour plot corresponding to k4, (e) sim1, contour plot corresponding to k3 with the 10%
percentile method, (f) sim2, contour plot corresponding to k4, and (g) sim2, contour plot corresponding to k4 with the 10% percentile
method.
Table 1
Results of the leave-one-out approach applied to sim1
Smoothing Number of objects

with null potential
1 NN distance 49/50
2 NN distance 14/50
3 NN distance 3/50
4 NN distance 1/50
4.3.2. Leave-one-out approach

In this approach, the potential induced by each
calibration object (except by object i) on object i is
computed, and compared to zero. The proportion of
objects having a leave-one-out potential of zero is an
Fig. 6. (Continued )
indicator of the appropriateness of the studied smooth-
ing. sim1 was investigated ®rst, and the corresponding
results are shown in Table 1.
With 4 NN distance as a smoothing, one calibration
are quite wide. The method based on the 10% per-
object is found out of the potential surface. This object
centile was also applied, and the corresponding con-
is the lowest extreme object in Fig. 3(a). With the
tour plot is shown in Fig. 6(g): here again, more
previous method, we had concluded that 3 NN dis-
reasonable contours delimit the positive-potential
tance was an appropriate smoothing.
surface.
The results for sim2 are shown in Table 2.
Potential functions were also built with sim3. When
With 4 NN distance as a smoothing, only two
using, for each calibration object, its 10 nearest
objects (1.2%) are left out. They are objects A and
neighbours to build the centroids, the 3 NN distance
B in Fig. 3(b), and here again, we can accept this value
was found to be appropriate. This same smoothing was
of smoothing as acceptable. With the previous
kept, but this time, the centroids were built, for each
method, k4 was accepted immediately, as it was
object, with their 15 nearest neighbours: seven null
the smallest value for which no pair had a centroid
potentials were obtained, out of 1180 tested centroids,
with a null potential.
i.e., 0.6% of the tested centroids had a null potential.
The leave-one-out approach was tested with sim3,
The location of the rejected centroids is indicated in
and the corresponding results are presented in
Fig. 7(a) (centroids are represented by stars).
Table 3.With 5 NN distance, two objects (A and B
These are located in empty regions in the calibration
in Fig. 3(c)) are left out, which is about 1.5%. Object
set, and their proportion is small enough to conclude
B is a bit isolated, but is still within the crown, so that
that the smoothing is appropriate. The corresponding
one could decide that 6 NN distance is the optimal
contours are shown in Fig. 7(b). Some small holes can
smoothing, which is much more than what was found
be seen in the potential surface. In order to avoid them,
with the centroid method.
one could increase the smoothing parameter to the 4
NN distance. However, the outer borders around the
Table 2
set are quite broad, and it does not seem reasonable to Results of the leave-one-out approach applied to sim2
increase them further. The contour plots, correspond-
ing to k3 and k4, and coming from the method Smoothing Number of objects
with null potential
based on the 10% percentile are shown, respectively,
in Fig. 7(c) and (d): Even with smoothing values equal 1 NN distance 153/166
to the 4 NN distance, the outer contours are much 2 NN distance 44/166
3 NN distance 11/166
better when the method based on the 10% percentile is
4 NN distance 2/166
used.
Fig. 7. Potential functions built from sim3: (a) k3, K15, (b) contour plots corresponding to k3, (c) contour plot corresponding to k3
with the 10% percentile method, and (d) contour plot corresponding to k4 with the 10% percentile method.
From the three studied simulations, it appears that ``lonely'' points. From the study of the contour plots
the leave-one-out approach is less optimistic than the obtained in each case, we conclude that the centroid
centroid method, as larger k are found with the former method leads to a better compromise between - and
approach. In particular, it is strongly in¯uenced by -errors.
Table 3 4.3.3. Application to real data sets

Results of the leave-one-out approach applied to sim3
The study of the ®rst data set was done in the PC1±
Smoothing Number of objects PC2 space, as 2 PCs are found optimal in the top±
with null potential down PCR modelling. Both the centroid and the leave-
1 NN distance 125/134 one-out methods were used to optimise the smoothing.
2 NN distance 39/134 Because of the large number of objects in the training
3 NN distance 11/134 set, 44253 pairs of points can be selected, which would
4 NN distance 5/134 be computationally intensive. As it was seen, in some
5 NN distance 2/134
simulated cases, that working with the 10 nearest
neighbours to compute pairs can yield satisfactory calibration set, with k3 and the 10% percentile
results for the optimal smoothing value, this approach method, and the potential of the objects from the test
was adopted. With k3, 0.0565% of the centroids set is calculated. The technique is used to detect both
were out of the potential surface (one centroid), while outliers, i.e., objects which are out of the calibration
with k4, all centroids were included on the potential space, and inliers. As de®ned earlier, inliers are situ-
surface. The corresponding contour plots are shown in ated within a gap of the calibration data. This can
Fig. 8(a) and (b), respectively, and the contours plots correspond to objects with a zero-potential residing
obtained with the percentile method are shown in within the calibration space, but also to some objects
Fig. 8(c) and (d), respectively. with a relatively small value of potential, situated in a
Fig. 8(a) and (b) shows that the isolated points low-density region. To detect the ®rst type of inliers,
situated at the border of the data set create wide and distinguish it from outliers, one can resort to
borders around them. This can be corrected by using visual inspection, or one could compare the objects
the percentile method, as is con®rmed with Fig. 8(c) with zero-potential to objects detected with the Maha-
and (d) where isolated points stand alone on an iso- lanobis distance method (as it was seen earlier that this
lated potential surface, which is satisfactory, as it will method enables to detect outliers in prediction):
enable an easier detection of inliers. With the leave- objects detected with both methods are outliers in
one-out approach, using k5 leads to 1% of the prediction, while objects with a zero-potential, but
objects out of the potential surface (three objects), undetected by the Mahalanobis distance method, are
which leads to too wide contours. inliers. To detect the second type of inliers, one should
Therefore, the conclusions from the simulated case de®ne what is a small potential value. Small values of
studies are con®rmed by this real case study, and a potential can be de®ned, e.g., as potential values
proposed strategy for the detection of outliers/inliers smaller than the 5% percentile potential of the cali-
in a data set is as follows: bration set. Therefore, all test objects with a potential
value less than this threshold value will be considered
The smoothing parameter used to build the global
as inliers, i.e., as objects situated in less dense regions
potential function is defined, for each object, as the
of the calibration space. The inliers and outliers in
k nearest neighbour Euclidean distance.
prediction of the test set are shown in Fig. 8(e) (in
k can be optimised by the centroid method, by
order not to overcrowd the plot, good test objects are
building, for each object, centroids with the 10
not displayed). It can be seen that the proposed method
nearest neighbours (a given pair of calibration
leads to a quite reasonable detection of outliers/inliers.
objects being used only once).
Only the proposed strategy was applied to the
Once k is optimal, the final potential function is
second real data set. This data set used is heteroge-
built by using the 10% percentile method.
neous, as: (i) it contains two main clusters (chemically
This enables to narrow the contours around isolated explainable), although smaller subclusters can be
points situated at the extremity of a data set. One could de®ned [37], and (ii) in each cluster, the density is
also optimise k by using directly the 10% percentile irregular. In particular, dif®culties may be expected
method. The reason why this is not proposed is the with cluster 2. Therefore, it is a challenging data set,
following: Compare, for instance, Fig. 8(a) and (c): if typical for many real applications. Because of the
the percentile method had been used to optimise the irregular densities in the two clusters, it is certainly
smoothing, some centroids built from the isolated better to optimise the smoothing in each cluster sepa-
border points would have been found out of the rately, as the value k to optimise will probably be
potential surface (inliers), which were found inside different in each cluster. In this case, the outliers/
the surface when the percentile method was not con- inliers from the whole data set are:
sidered for the optimisation of k. Therefore, this would
have lead to a larger value of k. objects with a zero-potential from the two clusters
(outliers or inliers);
By using this strategy for the studied data set, a objects with a small potential value in one cluster
potential surface can therefore be built above the and azero-potentialvalueintheothercluster(inliers);
Fig. 8. First real case study: contour plots with (a) k3, and (b) k4; contour plots obtained with the 10% percentile method (c) with k3, and
(d) with k4; (e) detection of inliers/outliers in the test set: inliers with a positive potential (), inliers with a zero-potential (*), outliers (
).
objects with a small potential value in both clusters

(if possible) (inliers).
Since the model was built with two clusters

together, the PCA is performed on the whole data
set, and each cluster is then considered separately in
the obtained PC-space.
The method of centroids was ®rst applied to the
cluster 1. Because of the ®ve points A, B, C, D and E
(see Fig. 1), it was decided to study centroids only of
the four nearest neighbouring pairs. We ®rst worked in
two dimensions in order to be able to visualise how the
potential performs in such a data set. With k2, two
centroids were found to have a zero-potential (1.29%),
while with k3, they were all included on the potential
surface. The corresponding contour plots, after apply-
ing the 10% percentile method, are shown in Fig. 9(a).
These contours seem reasonable.
The method was then applied to the cluster 2, where
there is more diversity in the density, and where two
objects are relatively far away (points F and G in
Fig. 1). Therefore, it was decided to look at only one
centroid (i.e., for each calibration object, one pair was
formed with its nearest neighbour). By working with
k1, all centroids are found on the potential surface,
but a few regions of positive potential are found. With
k2, only one cluster is found, but with a small hole in
the middle, and wide borders (Fig. 9(b)). The data
points in this cluster are quite scattered, so that it is
dif®cult to de®ne precisely borders ®tting well the
domain of variation of the calibration data. It seems
that the example presented with cluster 2 shows that
the method's usefulness is limited for data set with a
Fig. 9. Contour plots: (a) around cluster 1, (Ð) k2, (- - -) k3,
very irregular density, and/or with very few data with the 10% percentile method, and (b) around cluster 2, (Ð)
points. Some large zones are empty, so that new k1, (- - -) k2, with the 10% percentile method.
objects projected into these zones should be consid-
ered as inliers. However, the lack of this type of object
Table 4
Inliers/outliers in NS1
Cluster 1 Cluster 2
k 2 PC-space 8 PC-space k 2 PC-space 8 PC-space

Zero-potential objects: Zero-potential objects: Zero-potential objects: Zero-potential objects:
1, 2, 3, 5, 6, 8, 10, 13, 14 1±6, 8, 10, 13, 14 1±9, 11±14 1±4, 6±9, 11±14
3 Low-potential objects: 1 Low-potential object: 10 Low-potential object: 10

4, 9, 11
in the calibration set causes wide borders to be drawn

around this cluster. Probably a better smoothing value
could be obtained by considering smoothing values
between the 1 and the 2 NN distance.
The smoothing optimisation was also studied in the
8 PC-space, as 8 is the optimal dimensionality in the
top±down PCR model. We decided to keep the same
number of pairs as in the 2 PC-space, i.e., four pairs for
cluster 1, and one pair for cluster 2. In cluster 1, with
k3, no pair of samples had its centroid with a zero-
potential. In cluster 2, the centroid method did not
reject any centroids with k1.
The potential function can now be used to detect
outliers and inliers among the test samples, and the
results for NS1 are presented in Table 4. These test
samples are not real, but have been obtained by aver-
aging the spectra of two real samples (see Section 3).
The prediction outliers and inliers from cluster 1 in
the 2 PC-space, with k3, are satisfactory, as can be
judged from Fig. 10(a).
In cluster 2, with a 1 NN distance smoothing, many
objects are found to be outliers/inliers, due to the
presence of holes in the potential surface (see
Fig. 10(b)). Among the objects with a zero-potential,
three are inliers, the other ones are outliers from this
cluster. The outliers/inliers from the whole calibration
set in the 2-PC space are shown in Fig. 10(c). It can be
seen that a reasonable inlier/outlier detection is per-
formed. As each cluster is chemically explainable,
inliers situated within each of them are simply new
objects belonging to an empty region, they could be
included to the calibration set, in order to update the
calibration model by ®lling empty regions of the
calibration set. Inclusion of the inliers situated
between the two clusters is not so evident. One should
®rst ascertain whether they belong to one of the two
chemical families represented by the two clusters. If
this is the case, it means again that they belong to an
empty region of the calibration space, and they could
be added to the calibration set.
In the 8 PC-space, the centroid method indicated an
optimal smoothing corresponding to the 3 NN dis-
tance for the cluster 1, and to the 1 NN distance in the
cluster 2. The inliers/outliers from the cluster 1 are
Fig. 10. Detection of outliers in the polyol case study, in the 2 PC-
indicated in Table 4, but not presented on plots, as
space: (a) from cluster 1, with k3, (b) from cluster 2, with k1,
they have been found in the 8 PC space, and their and (c) in the calibration PC1±PC2 space: inliers with a positive
projection to a subspace like PC1±PC2 might lead to potential (), inliers with a zero-potential (*), outliers ( ), good
misinterpretation. objects ().
5. Conclusion [4] J.M. Sutter, J.H. Kalivas, P.M. Lang, J. Chemometrics 6

(1992) 217.
[5] N.R. Draper, H. Smith, Applied Regression Analysis, 2nd ed.,
The method based on centroids for optimising the Wiley, New York, USA, 1981.
smoothings seems quite reliable for our purpose. [6] O.E. de Noord, Chemometrics Intelligent Lab. Systems 23
Whatever the data distribution, one can consider a (1994) 65.
small number of pairs to de®ne a few centroids only [7] D. Jouan-Rimbaud, D.L. Massart, C.A. Saby, C. Puel, Anal.
Chim. Acta 350 (1997) 149.
(e.g., for each object, build centroids with its 10
[8] D. Jouan-Rimbaud, D.L. Massart, C.A. Saby, C. Puel,
nearest neighbours). Using the method based on the Chemometrics Intelligent Lab. Systems 40 (1998) 129.
10% percentile to build the ®nal potential surface [9] V. Centner, D.L. Massart, O.E. de Noord, Anal. Chim. Acta
enables to reduce the width of the domain in less 330 (1996) 1.
dense regions, which is an improvement. [10] V Centner, D.L. Massart, O.E. de Noord, Detection of non-
linearities in multivariate calibration, in preparation.
Two types of atypical samples can be detected with
[11] S. Chatterjee, A.S. Hadi, Statistical Sci. 1(3) (1986) 379.
this method, namely the inliers and the outliers. Inliers [12] P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier
either have a low value of potential, or a zero value. Detection, Wiley, New York, 1987.
This last type of inliers can be distinguished from real [13] B. Walczak, D.L. Massart, Chemometrics Intelligent Lab.
outliers, because they are not detected as outliers when Systems 27 (1995) 41.
[14] B. Walczak, Chemometrics Intelligent Lab. Systems 28
using the test based on the Mahalanobis distance.
(1995) 259.
Finding inliers is of interest, ®rst in non-linear cali- [15] B. Walczak, Chemometrics Intelligent Lab. Systems 29
bration models (where their prediction can lead to (1995) 63.
interpolation error), but also in linear models, which [16] Y.Z. Liang, O.V. Kvalheim, Chemometrics Intelligent Lab.
can be updated by including these new objects in less Systems 32 (1996) 1.
[17] A. Singh, Chemometrics Intelligent Lab. Systems 33 (1996)
dense regions of the calibration set. As to outliers, they
75.
should be detected in both types of models, as their [18] P.J. Rousseeuw, P.J. Van Zomeren, J. Am. Statistical Assoc.
prediction may not be reliable. 85 (1990) 633.
As inliers cannot be detected by the classical [19] M. Forina, G. Drava, R. Boggia, S. Lanteri, P. Conti, Anal.
methods of outlier detection, and as these often Chim. Acta 295 (1994) 109.
assume a certain data distribution for outlier detection, [20] R. de Maesschalck, D. Jouan-Rimbaud, D.L. Massart,
Tutorial: the Mahalanobis distance, in preparation.
potential functions can therefore be described as a [21] D. Coomans, I. Broeckaert, Potential Pattern Recognition in
general method to detect both types of atypical Chemical and Medical Decision Making, Research Studies
objects. Press Ltd., UK, 1986.
For high-dimensional data, the potential functions [22] E 1655±94, Standard Practices for Infrared, Multivariate,
Quantitative Analysis, Annual book of ASTM Standards, vol.
can be built from the PCs. In fact, they should be built
14.02, February 1995.
with the variables used in the model, e.g., PLS scores [23] J. Workman jr., , NIR News 7(4) (1996) 15.
if the model was PLS. [24] M.M.A. Ruyken, J.A. Visser, A.K. Smilde, Anal. Chem. 67
(1995) 2170.
[25] J.E. Jackson, G.S. Mudholkar, Technometrics 21(3) (1979)
341.
Acknowledgements
[26] B.W. Silverman, Density Estimation for Statistics and Data
Analysis, Chapman & Hall, New York, 1986.
Dr. Beata Walczak is thanked for her help during the [27] M. Forina, C. Armanino, R. Leardi, G. Drava, J. Chemo-
revision of this paper. metrics 5 (1991) 435.
[28] D. Coomans, Patroonherkenning in de medische diagnose aan
de hand van klinische laboratoriumonderzoeken, Doctoral
Thesis VUB, Brussels (B), 1982.
References [29] D. Coomans, D.L. Massart, I. Broeckaert, A. Tassin, Anal.
Chim. Acta 133 (1981) 215.
[1] H. Martens, T. Nñs, Multivariate Calibration, Wiley, [30] J. Hermans, J.D.F. Habbema, Manual for the ALLOC
Chichester, 1989. Discriminant Analysis Program, Department of Medical
[2] P. Geladi, B. Kowalski, Anal. Chim. Acta 185 (1986) 1. Statistics, University of Leiden, PO Box 2060, Leiden (NL),
[3] T. Nñs, H. Martens, J. Chemometrics 2 (1988) 155. 1976.
[31] D. Coomans, D.L. Massart, Anal. Chim. Acta 133 (1981) 225. [35] V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.M.
[32] M.P. Derde, L. Kaufman, D.L. Massart, J. Chemometrics 3 Vandeginste, C. Sterna, Anal. Chem. 68 (1996) 3851.
(1989) 375. [36] D. Jouan-Rimbaud, B. Walczak, R.J. Poppi, O.E. de Noord,
[33] B. Wise, PLS Toolbox, version 1.5, Eigenvector Research, D.L. Massart, Anal. Chem. 69 (1997) 4317.
Manson, WA, 1997. [37] L. Pasti, D. Jouan-Rimbaud, D.L. Massart, O.E. de Noord,
[34] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. de Noord, Anal. Chim. Acta 364 (1998) 253.
Anal. Chem. 67 (1995) 4295.

Detection of Prediction Outliers and Inliers in Multivariate Calibration

Uploaded by

Copyright:

Available Formats

Detection of Prediction Outliers and Inliers in Multivariate Calibration

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Detection of Prediction Outliers and Inliers in Multivariate Calibration

Uploaded by

Copyright:

Available Formats

Analytica Chimica Acta 388 (1999) 283±301

Detection of prediction outliers and inliers in multivariate calibration

Keywords: Multivariate calibration; Prediction; Outliers; Potential functions

1. Introduction calibration model; (vi) detection of model outliers

calibration space (e.g., if an interferent is present),

ated between two clusters, or within a gap in the

charts, which are usually used in statistical process  x dx  1: (11)

increasing values of k, the following loop is per-

The smallest k for which all calibration objects have

Both simulated and real data were used. The simu-

3.1. Simulated data

In order to study the potential functions, and in

3.2. Real data

value: instead of comparing MD to a Chi2 distribution,

4.2. Application of the tests based on the RMSSR

The spectra from the calibration set, NS1 and NS2

distance, 49 centroids are found out of the calibration

Smoothing Number of objects

4.3.2. Leave-one-out approach

Table 3 4.3.3. Application to real data sets

 objects with a small potential value in both clusters

Since the model was built with two clusters

k 2 PC-space 8 PC-space k 2 PC-space 8 PC-space

3 Low-potential objects: 1 Low-potential object: 10 Low-potential object: 10

in the calibration set causes wide borders to be drawn

5. Conclusion [4] J.M. Sutter, J.H. Kalivas, P.M. Lang, J. Chemometrics 6

You might also like

charts, which are usually used in statistical process x dx 1: (11)

objects with a small potential value in both clusters