Detection of Prediction Outliers and Inliers in Multivariate Calibration
Detection of Prediction Outliers and Inliers in Multivariate Calibration
Detection of Prediction Outliers and Inliers in Multivariate Calibration
Received 14 May 1998; received in revised form 14 August 1998; accepted 25 August 1998
Abstract
After a multivariate calibration model is built and validated, it is ready to be used for the prediction of the characteristic to be
determined in the new samples. Before prediction, one should make sure that the new samples are similar to the calibration
samples. If not, such samples are called prediction outliers, which can be of two types, namely samples that are outside of the
calibration space, and samples which are situated in a low- (or empty-) density region of the calibration space. The latter type,
also called inliers, are not detected by existing methods. In this paper, potential functions are proposed for the detection of
inliers and outliers in prediction. # 1999 Elsevier Science B.V. All rights reserved.
0003-2670/99/$ ± see front matter # 1999 Elsevier Science B.V. All rights reserved.
PII: S0003-2670(98)00626-6
284 D. Jouan-Rimbaud et al. / Analytica Chimica Acta 388 (1999) 283±301
with p degrees of freedom (p being the number of estimation of the mean and of the variance±covariance
variables taken into account) [12]. The Mahalanobis matrix of the calibration data set should not be in¯u-
distance between a calibration object i and the cen- enced by outliers (when the model is already estab-
troid of the data set, in the original variable space, is lished, the calibration data are free of outliers). Yet, it
de®ned as is interesting to verify it on our data, by comparing
results from both classical and robust Mahalanobis
MD2
i
xi ÿ xSÿ1
xi ÿ x0 ; (2) distance.
where S is the variance±covariance matrix de®ned as Robust estimators of x and S can be obtained by
ellipsoidal multivariate trimming [12,13], according
1 to the following procedure:
S
X0 Xc ; (3)
nÿ1 c
1. The initial mean x and variance±covariance matrix
where n corresponds to the number of objects in X, S are the classical ones (see Eq. (3)), and the
and Xc is the column-centred X matrix. Another squared Mahalanobis distance for each object is
measure for outlier detection, which is related to computed as in Eq. (2).
the Mahalanobis distance, is the leverage. The 2. A pre-determined number of objects N (N>50% n),
leverages are the elements situated on the diagonal with smallest MD2(i), are selected, and new esti-
of the prediction matrix (or hat matrix), de®ned as mates of the mean and of the variance±covariance
matrix (x1 and S1) are computed, based on these N
H X
X0 Xÿ1 X0 : (4) objects only.
3. With x1 and S1, a new estimate of the Mahalanobis
The leverage hi of object i is therefore de®ned as
distance
MD21
i can be computed, for each of the
hi xi
X0 Xÿ1 x0i : (5) n objects, as
0
There is a linear relationship between the leverage of MD21
i
xi ÿ x1 Sÿ1
1
xi ÿ x1 : (7)
an object, and its Mahalanobis distance to the cen-
troid: 4. The N objects with smallest MD(i) are selected,
1 1 and are used to calculate new estimates x2 and S2 of
hi MD2
i : (6) the mean and variance±covariance matrix.
nÿ1 n
5. x1 and x2 are compared, as well as S1 and S2. Steps
During calibration, in order to detect possible outliers, (3)±(5) are iterated until convergence of the mean
the value of leverage computed for the calibration and variance±covariance matrix. When these are
object is compared to two or three times p/n. This stable, these are used for the final calculation of the
diagnostic is also possible in prediction: the leverages Mahalanobis distance.
of the prediction objects can be computed, and com-
pared to two or three times the average value of
2.3. The root mean square error in spectral residual
leverages in the calibration set [1].
(RMSSR)
2.2. The robust Mahalanobis distance It was proposed [24] to identify the presence of an
interferent in the new samples, which was not present
The classical Mahalanobis distance does not always during the calibration phase, by projecting each new
allow a proper detection of calibration outliers, sample on the PC-space de®ned in calibration, and by
because these may in¯uence the estimation of the computing the spectral residuals, and the root mean
mean and of the variance±covariance matrix. This is squared error in spectral residual (RMSSR). Compar-
the reason why it is often more appropriate to use a ison with the RMSSR values of the calibration sam-
robust estimation of the Mahalanobis distance [12], ples can help to identify new samples containing an
which is not in¯uenced by outliers. In our case, we are interferent. Let x be a spectrum (row vector), ^x be the
looking for prediction outliers, which implies that the spectrum projected in the working PC-space, the
286 D. Jouan-Rimbaud et al. / Analytica Chimica Acta 388 (1999) 283±301
RMSSR is then de®ned as Q-value is equal to the sum of squared spectral residuals
1=2 Q
x ÿ ^x
x ÿ ^x0 : (9)
x0
x ÿ ^x
x ÿ ^
RMSSR ; (8)
p The obtained Q-value is compared with a critical Q-
where p is the number of wavelengths at which the value, which formula is presented, e.g., in [25].
spectrum was recorded.
The goal of the authors is different from ours, 2.4. The potential functions
because they wanted to identify the interferent (by
computing a scalar product between its spectrum and The basic idea behind potential functions is the
spectra contained in a library), but we can apply, for following. Each point of the calibration set creates a
our purpose, the ®rst part of the method (which ``potential'' in space, such that the value of the poten-
indicates the presence of an interferent). According tial is maximum at its location, and decreases con-
to [22], one can use the maximal value of RMSSR tinuously with distance from this point. By averaging
obtained in the calibration set (called RMSSRmax) to all individual potentials from the calibration set, a
determine a cut-off value for the RMSSR obtained by global potential can be obtained at any place in space,
prediction samples: above this value, they are outliers. which is in fact a probability density [21,26,27];
However, at least seven replicate measurements of at therefore, at any location in space, a global value of
least three calibration samples should be measured, the potential induced by the calibration samples exists.
such that ``the replicate measurement should include Potential functions have been used widely in the
all steps in the measurement procedure'' [22]. In ®eld of pattern recognition [21]. For example, Coo-
practice, such replicate measurements are rarely avail- mans [28] has used several classi®cation methods
able, and one should ®nd another way to use the based on potential function, such as ALLOC [29]
RMSSR obtained by calibration samples. We propose (developed by Hermans et al. [30]) or CUPLOT
to work with quantiles of the RMSSR values obtained [31]; Forina et al. [27] have proposed two types of
for the calibration samples. The procedure applied class boundaries for building a probabilistic and dis-
performs as follows: tribution-free class-modelling technique from poten-
tial functions. In this work, the potential functions are
Perform a PCA of the calibration data matrix. used, such that, if the potential value of an unknown
Project the calibration and the prediction spectra in prediction sample is 0, it means that this sample is
the optimal PC-space. ``far'' from the bulk of the calibration set, and there-
Compute the spectral residuals of the calibration fore, the object is a prediction outlier.
and prediction spectra (projection of the spectra in
the residual PC-space). 2.4.1. Types of potential functions
For each spectrum, the RMSSR can be calculated There are many different types of potential func-
as defined in Eq. (8). tions, one of which is the Gaussian function, de®ned,
Sort the RMSSR values of the calibration data in in the univariate space, as
increasing order; the 95% quantile of these ordered
1 ÿ1 2
data can be used as the cut-off value, above which
xi ; xj ÿp exp
xi ÿ xj : (10)
the prediction spectra are considered extreme sam- 2 s 2s2
ples, and therefore suspected outliers. (xi, xj) is the potential induced by object xi on object
xj. The parameter s is called the smoothing parameter
If PLS is used to construct the ®nal calibration (also called the bandwidth, or the window width); it
model, the new spectra are projected in the calibration de®nes the width of the curve (in this present case a
factor-space instead of the PC-space. Gaussian) above each object. is such that
Another method which was studied, and which also
makes use of the spectral residuals, is based on the Q- Z
1
The global potential f is de®ned as increasing order; the cut-off % value (100ÿp,
X
n e.g., 5%) is
1
f
xi ; xj ; (12) f fk
u ÿ k
fk1 ÿ fk ; (16)
n i1
with un/100 and kint(u).
n being the number of objects in the calibration set.
The global potential is an estimate of the probability
fp is usually larger than f, so that using f to
density [21,26,27].
determine the cut-off value will lead to broader
In this work, we preferred to use triangular potential
regions around the calibration set.
functions (the reason for this choice will be explained
One problem when working with such quantiles is
in Section 2.4.2. In the univariate space, triangular
that some calibration samples may have a probability
potential functions are de®ned as
x ÿ x x ÿ x density smaller than the critical value, so that a pre-
i j i j diction sample situated at exactly the same location as
xi ; xj 1 ÿ if 1;
s s (13) one of these calibration samples will be described as
xi ; xj 0; otherwise:
an outlier, while in fact it is not. It is in fact a
In the multivariate space with A dimensions, the compromise between the - and -errors.
absolute value |xiÿxj| is replaced by the Euclidean In Fig. 2(a), the contour plot corresponding to a
norm of the vector xiÿxj in the de®nition of (xi, xj), Gaussian potential built with a smoothing parameter
and of 1.4, and a cut-off value computed with fp at 95%
level is shown. Reasonable contours are shown, but
1 X n
1 X n
f
x i ; x j
xi ; xj (14) although a few points are still out of these contours,
nsA i1 ns i1 there is a relatively wide empty space in the right-hand
side cluster (at high values of variable 2). In order to
if the smoothing parameter s is constant over all
take the discarded objects into account, the con®dence
variables, which is the assumption followed in this
level is increased to 99%, and f is used (Fig. 2(b)).
paper. A being constant, sA can be written s*.
Now, it is clear that the contours around the two
For information regarding other types of potential
clusters of this data set are too wide. Therefore, it
functions, the readers are referred, for instance, to
is not easy to decide a priori on which cut-off value to
Chapter 2 of [26].
use (f or fp?), neither on the con®dence level to work
with. This is the reason why we preferred to work with
2.4.2. Critical values
triangular potential functions, which have a zero cut-
For Gaussian potential functions, two cut-off
off value. The advantage of triangular potential func-
values, below which the new objects are considered
tions over Gaussian potential functions is that one does
outliers, were proposed by Forina et al. [27], one of
not need to compute a cut-off value to delimit the
which came from work performed by Derde et al. [32].
calibration space, and one does not have to choose a
The method to ®nd them is based on the p% sample
con®dence level.
percentile. The cut-off values are computed by ®rst
estimating the probability density of each calibration
2.4.3. Optimisation of the smoothing parameter s
sample.
The boundaries around the calibration set can be
To calculate the first cut-off value, the objects are de®ned by drawing the contour plot at the level
arranged in descending order; the cut-off p% value indicated by the cut-off value (0 when triangular
(e.g., p95%) of probability density is potential functions are used). The location of the
iso-potential line at this level depends on the smooth-
fp fj
q ÿ j
fj1 ÿ fj ; (15)
ing parameter s (the larger the smoothing, the wider
with qpn/100 and jint(q), where int(q) means the de®ned subspace). Therefore, the smoothing para-
the integer part of q. meter is of crucial importance, and should be opti-
To calculate the second cut-off value, the calibra- mised carefully. If it is chosen too small, one may
tion samples' probability densities are arranged in obtain a small positive-potential region around each
288 D. Jouan-Rimbaud et al. / Analytica Chimica Acta 388 (1999) 283±301
Fig. 2. Contour plot with fixed potential functions: (a) smoothing1.4, p95%, and (b) smoothing1.4, 1%.
object, and a null potential in-between; on the other in the whole calibration subspace, the contours will
hand, if it is chosen too large, the resulting potential remain similar by applying this method, which will be,
may be found positive in places where it should be from now on, denoted as ``the 10% percentile
null. method''.
Instead of using a constant smoothing parameter To optimise the value of k, one starts by building a
over all the calibration objects, the smoothing was potential surface above the calibration set such that s
optimised for each calibration object, and chosen as a for each object is equal to the nearest neighbour
function of the local density around the calibration distance (k1). If k1 yields too small s values (a
objects. Such a potential function, optimising locally small value of s being a value for which the resulting
the smoothing, is called a variable potential function. potential surface consists of many regions with a
The smoothing parameter s was chosen, for each positive potential around each object, and a zero-
calibration sample, to be equal to the k nearest neigh- potential in-between, even in a non-clustered data
bour (NN) distance (i.e., to the distance to the kth set), k is then increased to 2, and so on, until a k
nearest neighbour): in order to optimise the value of s, value is reached, for which the potential surface
one should therefore optimise k. If there is one isolated obtained seems reasonable. To decide when this is
point in the data set, working with sk NN distance the case, two methods are proposed:
may create a big envelope around it, while the envel- Centroid method. A pair of calibration samples is
ope around the other points will be satisfactory. On the chosen randomly, and one checks whether their
other hand, taking s equal to the k NN distance may centroid has a non-null potential. This procedure is
yield generally satisfactory results, while locally, a iterated a certain number of times. If in many cases,
high density of objects may yield a too narrow envel- the potentials of the centroids are zero, it means
ope. In order to avoid this, the following modi®cation that many calibration objects create a potential
was also studied: Once the parameter k is optimised, s surface which is positive in their close neighbourhood
values are calculated for all objects in the data set, and only, and that there exist many regions with zero-
are ordered in increasing values. All objects for which potential between objects. Therefore, the chosen k
s is smaller than, e.g., the 10% quantile, are given the yields too small s values, and should be increased.
value of the 10% quantile s, to avoid too small s The smallest k, for which, for any pair of chosen
values, i.e., too narrow envelope. All objects with s points, the centroid has a positive potential, is de®ned
larger than the 90% quantile s are given the value of as the optimal one.
the 90% quantile s, to avoid too large s, i.e., too wide Cross-validation method. A leave-one-out strategy
envelope. If the density of the data points is the same can also be applied in the optimisation of s. For
D. Jouan-Rimbaud et al. / Analytica Chimica Acta 388 (1999) 283±301 289
3. Experimental
The ®rst real data set used in this study comes from
the Matlab PLS Toolbox, version 1.5, from Wise [33].
The data come from a liquid-fed ceramic melter, and a
multivariate calibration model should relate the tem-
perature in the molten glass tank to the 20 tank levels
(the 20 original variables). The data are already sepa- Fig. 3. Different two-dimensional data sets used for optimisation
rated into a calibration set (300 samples) and a test set of the parameters of the potential functions: (a) sim1, (b) sim2, and
(200 samples). Two outliers in the PC1±PC2 space (c) sim3.
were removed from the calibration set. These were
290 D. Jouan-Rimbaud et al. / Analytica Chimica Acta 388 (1999) 283±301
detected with the evolution program proposed by draw a number from a uniform distribution
Walczak [14,15]. This data set was used to test between 0.7 and 1.3
whether the methods found to optimise the smoothing NS2 (object i)object a
with simulated data can also be applied to real data.
end
A second real data set from industry was used, in
particular, to show that the methods based on the
Therefore, the ®rst 20 objects from NS2 may be
Mahalanobis distance and on RMSSR are not always
good prediction objects or inliers, while the last 20
applicable: NIR spectra of polyether polyol samples
objects may be good prediction objects, inliers or
were measured, and related to the hydroxyl number of
outliers.
each sample, with different linear multivariate cali-
NS1 and NS2 were projected on the PC1±PC2 space
bration methods [34±37]. In this work, the data set was
from the calibration set, and are presented in Fig. 4(a).
cleaned from the outliers found by Centner et al. [10],
so that 84 samples out of the 87 original ones were left,
whose spectra were measured over 499 wavelengths in 3.3. Software
the range 1132±2128 nm, on a Paci®c Scienti®c 6250
scanning spectrometer (NIRSystem, Silver Spring, All computations were performed on a PC 486
MD). The PC1±PC2 score plot is shown in Fig. 1. DX2 50 Mhz computer, with programs written with
The right-hand side cluster (58 samples) is called Matlab for Windows, version 4.0 (Mathworks, Natick
cluster 1, and the left-hand side cluster (26 samples) 1992).
is called cluster 2.
As no new prediction samples were available, new
prediction samples were simulated from the real cali- 4. Results and discussion
bration samples, and separated into two new sets NS1
and NS2. The creation of the 14 samples from NS1 4.1. Application of the tests based on the
was performed as follows: Mahalanobis distance (MD)
for i1 to 14
Both the classical and the robust Mahalanobis dis-
randomly select two objects from the calibration tances were used with different tests to try and detect
set, indexed a and b inliers and outliers. The detection of inliers/outliers
NS1 (object i)0.5(object aobject b) based on the MD was studied on the second real case
study: a PCR-calibration model was built, its com-
end plexity was found by cross-validation to be 8 PCs [34],
Therefore, NS1 may contain ``good'' prediction and the MD was always computed in this 8 PC-space.
objects and inliers (a good prediction object being an First, the classical MD between each prediction object
objectto predict which is neither an inlier, noran outlier). and the mean of the calibration set was computed, and
The creation of the 40 samples from NS2 was compared to the critical value from a Chi2 distribution
performed as follows: with 8 degrees of freedom. With 95% con®dence, this
critical value is equal to 15.5. No suspect objects were
for i1 to 20 detected in NS1, while 17 such objects were detected
in NS2. These are shown on a PC1±PC 2 score plot, in
randomly select two objects from the calibration
Fig. 4(b). All these points are outliers, even the ®ve
set, indexed a and b
samples, which seem to be inliers on a PC1±PC2 score
NS2 (object i)0.5(object aobject b)
plot, but which are in fact outliers on PC3.
end As explained in Section 2, the leverage of an object
corresponds to a scaled Mahalanobis distance of the
for i21 to 40
objects towards the centroid of the data. The differ-
randomly select one object from the calibration ence between using the Mahalanobis distance or the
set, indexed a leverage resides in the computation of the critical
D. Jouan-Rimbaud et al. / Analytica Chimica Acta 388 (1999) 283±301 291
Fig. 6. Use of the centroid method to optimise the smoothing (() real objects, () simulated centroids): (a) sim1, k2, (b) sim1, k3, (c) sim1,
contour plot corresponding to k3, (d) sim1, contour plot corresponding to k4, (e) sim1, contour plot corresponding to k3 with the 10%
percentile method, (f) sim2, contour plot corresponding to k4, and (g) sim2, contour plot corresponding to k4 with the 10% percentile
method.
294 D. Jouan-Rimbaud et al. / Analytica Chimica Acta 388 (1999) 283±301
Table 1
Results of the leave-one-out approach applied to sim1
1 NN distance 49/50
2 NN distance 14/50
3 NN distance 3/50
4 NN distance 1/50
Fig. 7. Potential functions built from sim3: (a) k3, K15, (b) contour plots corresponding to k3, (c) contour plot corresponding to k3
with the 10% percentile method, and (d) contour plot corresponding to k4 with the 10% percentile method.
From the three studied simulations, it appears that ``lonely'' points. From the study of the contour plots
the leave-one-out approach is less optimistic than the obtained in each case, we conclude that the centroid
centroid method, as larger k are found with the former method leads to a better compromise between - and
approach. In particular, it is strongly in¯uenced by -errors.
neighbours to compute pairs can yield satisfactory calibration set, with k3 and the 10% percentile
results for the optimal smoothing value, this approach method, and the potential of the objects from the test
was adopted. With k3, 0.0565% of the centroids set is calculated. The technique is used to detect both
were out of the potential surface (one centroid), while outliers, i.e., objects which are out of the calibration
with k4, all centroids were included on the potential space, and inliers. As de®ned earlier, inliers are situ-
surface. The corresponding contour plots are shown in ated within a gap of the calibration data. This can
Fig. 8(a) and (b), respectively, and the contours plots correspond to objects with a zero-potential residing
obtained with the percentile method are shown in within the calibration space, but also to some objects
Fig. 8(c) and (d), respectively. with a relatively small value of potential, situated in a
Fig. 8(a) and (b) shows that the isolated points low-density region. To detect the ®rst type of inliers,
situated at the border of the data set create wide and distinguish it from outliers, one can resort to
borders around them. This can be corrected by using visual inspection, or one could compare the objects
the percentile method, as is con®rmed with Fig. 8(c) with zero-potential to objects detected with the Maha-
and (d) where isolated points stand alone on an iso- lanobis distance method (as it was seen earlier that this
lated potential surface, which is satisfactory, as it will method enables to detect outliers in prediction):
enable an easier detection of inliers. With the leave- objects detected with both methods are outliers in
one-out approach, using k5 leads to 1% of the prediction, while objects with a zero-potential, but
objects out of the potential surface (three objects), undetected by the Mahalanobis distance method, are
which leads to too wide contours. inliers. To detect the second type of inliers, one should
Therefore, the conclusions from the simulated case de®ne what is a small potential value. Small values of
studies are con®rmed by this real case study, and a potential can be de®ned, e.g., as potential values
proposed strategy for the detection of outliers/inliers smaller than the 5% percentile potential of the cali-
in a data set is as follows: bration set. Therefore, all test objects with a potential
value less than this threshold value will be considered
The smoothing parameter used to build the global
as inliers, i.e., as objects situated in less dense regions
potential function is defined, for each object, as the
of the calibration space. The inliers and outliers in
k nearest neighbour Euclidean distance.
prediction of the test set are shown in Fig. 8(e) (in
k can be optimised by the centroid method, by
order not to overcrowd the plot, good test objects are
building, for each object, centroids with the 10
not displayed). It can be seen that the proposed method
nearest neighbours (a given pair of calibration
leads to a quite reasonable detection of outliers/inliers.
objects being used only once).
Only the proposed strategy was applied to the
Once k is optimal, the final potential function is
second real data set. This data set used is heteroge-
built by using the 10% percentile method.
neous, as: (i) it contains two main clusters (chemically
This enables to narrow the contours around isolated explainable), although smaller subclusters can be
points situated at the extremity of a data set. One could de®ned [37], and (ii) in each cluster, the density is
also optimise k by using directly the 10% percentile irregular. In particular, dif®culties may be expected
method. The reason why this is not proposed is the with cluster 2. Therefore, it is a challenging data set,
following: Compare, for instance, Fig. 8(a) and (c): if typical for many real applications. Because of the
the percentile method had been used to optimise the irregular densities in the two clusters, it is certainly
smoothing, some centroids built from the isolated better to optimise the smoothing in each cluster sepa-
border points would have been found out of the rately, as the value k to optimise will probably be
potential surface (inliers), which were found inside different in each cluster. In this case, the outliers/
the surface when the percentile method was not con- inliers from the whole data set are:
sidered for the optimisation of k. Therefore, this would
have lead to a larger value of k. objects with a zero-potential from the two clusters
(outliers or inliers);
By using this strategy for the studied data set, a objects with a small potential value in one cluster
potential surface can therefore be built above the and azero-potentialvalueintheothercluster(inliers);
D. Jouan-Rimbaud et al. / Analytica Chimica Acta 388 (1999) 283±301 297
Fig. 8. First real case study: contour plots with (a) k3, and (b) k4; contour plots obtained with the 10% percentile method (c) with k3, and
(d) with k4; (e) detection of inliers/outliers in the test set: inliers with a positive potential (), inliers with a zero-potential (*), outliers (
).
298 D. Jouan-Rimbaud et al. / Analytica Chimica Acta 388 (1999) 283±301
Table 4
Inliers/outliers in NS1
Cluster 1 Cluster 2
[31] D. Coomans, D.L. Massart, Anal. Chim. Acta 133 (1981) 225. [35] V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.M.
[32] M.P. Derde, L. Kaufman, D.L. Massart, J. Chemometrics 3 Vandeginste, C. Sterna, Anal. Chem. 68 (1996) 3851.
(1989) 375. [36] D. Jouan-Rimbaud, B. Walczak, R.J. Poppi, O.E. de Noord,
[33] B. Wise, PLS Toolbox, version 1.5, Eigenvector Research, D.L. Massart, Anal. Chem. 69 (1997) 4317.
Manson, WA, 1997. [37] L. Pasti, D. Jouan-Rimbaud, D.L. Massart, O.E. de Noord,
[34] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. de Noord, Anal. Chim. Acta 364 (1998) 253.
Anal. Chem. 67 (1995) 4295.