Eurachem Uncertainty of Qualitative Methods
Eurachem Uncertainty of Qualitative Methods
Eurachem Uncertainty of Qualitative Methods
COMMERCIAL-IN-CONFIDENCE
EURACHEM/CITAC Guide:
The Expression of Uncertainty
in Qualitative Testing
LGC/VAM/2003/048
Eurachem/CITAC conference Discussion Paper
Introduction 1
Part 3: Examples 15
Example 1: A Reported Study on Identification Certainty in Mass Spectrometry
Using Database Searching 15
Example 2: Chance Matches When Using an IR Database 18
Example 3: Sample databases for assessing drug identification performance 20
References 22
LGC/VAM/2003/048 Page i
Eurachem/CITAC conference Discussion Paper
Introduction
The problem of the reliability associated with qualitative testing has received relatively little
coverage in the literature compared to that afforded to measurement uncertainty. While a few
authors have addressed this area,1-3 much still remains to be done.
Uncertainty in relation to qualitative analysis is a topic of current interest to the EURACHEM
Measurement Uncertainty Working Group. This report is intended to stimulate debate within the
WG and to aid the development of policy in this area. The report is presented in two parts. The first
part consists of a general overview of the main issues while the second part explores the use of
several measures of reliability in more detail with an emphasis on the use of false response rates.
Part 1 of this paper comprises a Eurachem discussion paper first published in Accreditation and
Quality Assurance. It aims to describe the present state of the art and to give an indication of what
may reasonably be expected from laboratories, for example by accreditation bodies.
Part 2 sets out a range of existing measures of uncertainty in in identification. In combination, these
two parts could form the basis for formal Eurachem guidance on the topic. Comment is accordingly
invited.
LGC/VAM/2003/048 Page 1
Eurachem/CITAC conference Discussion Paper
*
Partial class membership is used extensively in “fuzzy logic” systems, but the relevant terminology and
treatment is very rare in ordinary testing activities.
Page 2 LGC/VAM/2003/048
Eurachem/CITAC conference Discussion Paper
information is, as a result, typically probabilistic in nature. That is, one gives an indication of the
probability of a given classification being correct.
The most familiar and widely used form of such information is, at present, the use of false response
rates, particularly “false positive rates” and “false negative rates”.
Probably the most important alternative to simple statements of false response rates is the use of
values derived from Bayes’ theorem (a summary of Bayes’ theorem is given in reference 2).
Examples include likelihood ratio (an indication of the additional information provided by a test
result) and posterior probability, an indication of the probability of an object fitting a given category
given a test result. Bayesian estimates are particularly widely used in evaluating forensic evidence,
for example DNA matching or blood group matching. Further details can be found elsewhere [ref. 2
and references cited therein]. Bayesian estimates can be calculated by appropriate combination of
false positive and false negative rates.
LGC/VAM/2003/048 Page 3
Eurachem/CITAC conference Discussion Paper
Page 4 LGC/VAM/2003/048
Eurachem/CITAC conference Discussion Paper
Note: If normal distributions are assumed, probabilities fall off very sharply with increasing distance
between threshold levels and typical levels of response. However, very considerable caution is advisable in
extrapolating much beyond 95% confidence bounds. Due to such factors as human error, it is generally
observed in routine measurements that the probabilities of very extreme results, though still quite low, are
nonetheless many orders of magnitude higher than would be expected on the basis of the normal
distribution.
LGC/VAM/2003/048 Page 5
Eurachem/CITAC conference Discussion Paper
• A few sectors have started to use Bayesian probability estimates in assessing the performance of
qualitative tests; the forensic sector is probably the most advanced. Even here, direct reporting of
probability information is rare because of uncertainties in the various terms needed for the
estimate.
• Though there are publications on qualitative test failure probabilities and risks in the specialist
literature, few laboratories can be expected to have access to the wide range of journals
involved. Further, such papers tend to be written for specialists, and are accordingly not easy to
implement for a routine test lab. There is thus little detailed and accessible guidance available to
the general laboratory population.
• There is often sectoral or more general guidance on good practice in qualitative testing, and
while this may not address uncertainty directly, it typically addresses other issues associated
with quality control and assurance for the type of tests involved.
11 Future developments
There is a need to standardise the nomenclature relating to false response rates. There is an
additional need to provide accessible and consistent guidance on the study of qualitative test
performance.
EURACHEM is pursuing both these ends through the measurement uncertainty working group, and
hopes to obtain wider input from other international organisations.
12 Implications
1. It is realistic to expect that testing laboratories have qualitative test method parameters (conditions
of testing) under adequate control. Evidence of that will typically involve
• clear evidence of traceability for the values of important control parameters prescribed by the
method
• evidence that uncertainties in these parameters are sufficiently small for the purpose
2. It is important for laboratories to check at least the most critical false response rate for a
qualitative test.
3. It is reasonable to expect laboratories to be following published codes of best practice in
qualitative testing where they are available.
4. Quantitative (i.e. numerical) reports of uncertainties in qualitative test results should not generally
be expected.
Page 6 LGC/VAM/2003/048
Eurachem/CITAC conference Discussion Paper
13 Summary
A number of different measures of reliability for methods of qualitative analysis have been
investigated. It is evident that the nomenclature for these measures is confusing and that different
measures tell the analyst different things. Some guidance on when to use the different reliability
measures would be useful. A further point is that the large amounts of practical experimental data
required will generally be expensive to obtain, while inferences drawn from smaller data samples
will have limited reliability.
14 Introduction
In analytical science, the purpose of qualitative analysis is to classify materials. In order to do this
the materials of interest must first be detected. The ability of an analytical method to detect a target
material depends upon the amount of the material which is present in the analytical system as well as
upon the performance characteristics of the analytical method. Thus, if the aim of an analysis is to
determine whether or not a particular substance is present, it will be necessary to specify a minimum
concentration which must be capable of detection.
Just as it is possible to make an erroneous identification of a person under poor observation
conditions so too is it possible to make an erroneous identification of a material submitted for
qualitative analysis. It is hence desirable to provide users of qualitative analysis results with some
indication of the reliability of an identification.
The degree of confidence in the correctness of an identification can be expressed in a number of
ways. For a given test method, the basic properties that need to be measured are the numbers of true
positive and negative results and the numbers of false positive and negative results obtained on a
range of samples. From these numbers the fundamental measures of reliability viz. the false positive
and false negative rates can be calculated. Several other measures can also be derived from these
numbers (Table 1). The false positive and negative rates can be combined into a single figure
expressed by the Bayesian likelihood ratio. If the analyst is able to quantify his initial degree of
belief in the outcome of a test applied to a particular sample − before the test is applied − then a
further reliability measure in the form of a Bayesian posterior probability can be calculated. One
other important method parameter which needs to be determined is the limit of detection; knowing
this enables the analyst to select a method capable of satisfying the customer’s requirement relating
to minimum detectable amount.
LGC/VAM/2003/048 Page 7
Eurachem/CITAC conference Discussion Paper
Sensitivity TP
TP + FN
Specificity TN
TN + FP
Efficiency TP + TN
TP + TN + FP + FN
Where: TP = number of true positives; FP = number of false positives; TN = number of true negatives; FN =
number of false negatives.
In Table 1 the terms sensitivity and specificity are used in the clinical chemistry sense viz.: sensitivity
is the fraction of true positive results obtained when a test is applied to positive samples (it is the
probability that a positive sample is identified as such); specificity is the fraction of true negative
results obtained when a test is applied to negative samples (it is the probability that a negative
sample is identified as such).
For the purposes of this study, the following definitions, based on those of AOAC, apply:
True positive: Results obtained using the confirmatory technique and another
analytical technique are both positive.
True negative: Results obtained using the confirmatory technique and another
analytical technique are both negative.
False positive: Result obtained using the confirmatory technique is negative but
that obtained using another analytical technique is positive.
False negative: Result obtained using the confirmatory technique is positive but
that obtained using another analytical technique is negative.
Page 8 LGC/VAM/2003/048
Eurachem/CITAC conference Discussion Paper
The false positive and false negative rates referred to in these definitions are based on those defined
in the AOAC Research Institute Policies & Procedure document and commonly employed by
clinical chemists viz.:
false positives x 100
False positive rate (%) =
total known negatives
false negatives x 100
False negative rate (%) =
total known positives
Since false positive/negative rates are interpreted as probabilities for the purpose of calculating the
Likelihood Ratio they are expressed as fractions rather than as percentages in Table 1.
LGC/VAM/2003/048 Page 9
Eurachem/CITAC conference Discussion Paper
As well as depending on the cut-off value, the number of false negatives is also influenced by the
level of the analyte and the distribution of values that could be obtained at a given level. For high
levels of analyte the likelihood of false negatives will be very low and for low levels of analyte it
will be relatively higher. The false negative rate therefore depends upon the distribution of analyte
values in the population being sampled.
Table 2: Effect of cut-off level on false response rates at low levels of analyte
Table 2 illustrates the effect of cut-off level on the false response rates at low levels of analyte. The
levels are expressed in arbitrary concentration units and the blank is assumed to have a mean of 0
and a standard deviation of 1. The actual levels are assumed to have means as indicated and standard
deviations of 1. For each cut-off level and actual level, the table entries show the proportion of
results falling below the cut-off level (false negatives) and the proportion of blank results falling
above the cut-off level (false positives). It can be seen that, for a given cut-off level, the false
positive rate is constant but the false negative rate decreases, as would be expected, with increasing
analyte concentration.
The second mechanism corresponds to the problem of committing type 1 (false positive) and type 2
(false negative) errors and the analyst must decide on a suitable balance between the two. Raising
the cut-off level reduces the probability of obtaining false positives but increases that of obtaining
false negatives − and conversely. These ideas are illustrated in Figure 1.
Estimation of the false response rates of a method should ideally be designed into the method
validation studies. At this stage the analyte and detection technique would of course be known but a
study should ensure that an adequate range of matrices, likely to be encountered in practice, is
covered. A confirmatory detection technique will also need to be selected and a method
incorporating it validated. Given that the number of false responses should ideally be low, the
problem arises of how many samples to test to be reasonably sure of finding a non-zero number of
false responses. One way of doing this is to model the problem as a set of Bernoulli trials − see
below.
From published information (see, for example, Ferrara5 ) it is evident that false positive or negative
rates can be as low as 0.5% and in some cases even lower. For a range of false response
probabilities, Table 3 shows the number of samples that would need to be analysed in order to be
certain, to at least the confidence levels indicated, of finding one or more false responses.
Confidence Level
Probability 95% 99%
0.005 598 919
0.01 299 459
0.05 59 90
Page 10 LGC/VAM/2003/048
Eurachem/CITAC conference Discussion Paper
cut-off level
x1 = 0
x2 > 0
Response "negative" "positive"
false false
negative positive
region region
Concentration
0
Figure 1: False response rates from distributions
In attempting to determine false response rates experimentally for a new method/analyte, the analyst
is faced with a dilemma. On the one hand, since, for a given method, he does not know the false
response rate of interest − this is what he is trying to determine − he cannot decide on an appropriate
number of samples to analyse in order to be reasonably sure of detecting a false response. On the
other hand, if he simply kept on testing until the first false response occurred this would not
necessarily give a true picture of the false response rate (a false response could occur in the first
experiment and then not again for a further 500 experiments!).
To get round this problem, it is suggested that the analyst decides in advance on tolerable levels for
the two false response rates. For a chosen confidence level, he can then calculate, via a binomial
distribution, the number of experiments needed to find one or more false responses. This approach is
not guaranteed to produce an exact figure for the false response rate but it will at least place a bound
on it. For example, if the analyst decides that a 5% false positive rate is acceptable and, if after
performing 59 experiments (Table 1) covering the likely range of matrices, no false positives are
found, then he can be reasonably certain that the true false positive rate is not greater than 5%. It is
further recommended, as a quality control measure, that the samples be interspersed with blanks and
standards containing the target analyte just above and below the method detection limit. When, as is
usual, observed responses which correspond with expectation are not confirmed, the analyst should
be aware that some of them may be false responses. When all observations are not confirmed,
calculated false response rates should be treated with caution. It should always be remembered that
false response rates cannot be viewed as exact values since they depend very much upon the vagaries
of the population being sampled and also upon the method of sampling.
From Table 3, it can be seen that, for low false response rates, it may be impractical to analyse a
sufficient number of samples to detect a false response. Accordingly, if a test is cheap to operate
and/or is intended to be used with high sample numbers, e.g. as a drugs screening test, it may be
preferable to establish first that the false response rate does not exceed an upper limit, say 5%, by
LGC/VAM/2003/048 Page 11
Eurachem/CITAC conference Discussion Paper
experiment, and then to refine this figure in the light of experience with further samples. Where
sample numbers are likely to be relatively low and/or the test is expensive to apply, there may be
little choice but to run the test in parallel with a confirmatory test (on all results!) and, from time to
time, recalculate the false responses rates based on the experimental results.
16 Additional Costs
Positive test results are routinely confirmed by an independent method wherever the analyte is
normally expected to be absent from the sample matrix. Negative test results are not usually
confirmed, since this would add to costs. Similarly, if an analyte is normally expected to be present,
a negative result would be confirmed but not a positive one. In adopting this policy analysts make
the assumption that, in the first case, the negative results found are true negatives, and, in the second
case, that the positive results found are true positives. This assumption may, on occasion, be
incorrect but the analyst will never know. The point here is that, in order to calculate, say, a false
positive rate, it is necessary to know the number of true negatives. Thus, if false response rates are to
be determined reasonably accurately then all test results must be independently confirmed and
additional costs must therefore be incurred.
18 Selectivity
Selectivity, in the sense in which this term is usually employed in analytical chemistry, refers to the
ability of a method to discriminate between different components of a sample. It is particularly
important when several components of a sample are similar with respect to the property being
measured.
19 Relevance
In many cases where qualitative analysis is performed there is a requirement for confirmatory tests
to be applied. This is particularly so when analysis is being performed for forensic purposes, for
medical diagnostic purposes or where important financial or safety-critical consequences hinge on
the result. In short, where the perceived consequences of an incorrect identification are seen as
serious, confirmatory tests will be carried out as a matter of course. In these situations the analyst
will be as certain as he can be of the reported identification and any measure of identification
certainty should be so high as to be effectively redundant. A quality metric associated with a
measurement/classification is only of use when it has the potential to influence a decision based on
the measurement/classification.
Identification certainty is of use where the correctness of a presumptive identification is not critical
but where the cost of confirmation is high. The end user of the result can see that the classification
does not purport to be completely accurate and is further provided with a quantification of the degree
of doubt attached to it. Alternatively, the analyst can use an identification certainty value [for a
Page 12 LGC/VAM/2003/048
Eurachem/CITAC conference Discussion Paper
classification], in conjunction with a rule or a set of criteria, to decide whether to carry out a
confirmatory analysis.
Formula Definition
FP The fraction of observed positive results
which are false.
Pobs
FP The fraction of all results which are false
positives.
Pobs + N obs
FP The number of false positive results obtained
TP + FN for each true positive result.
FP The fraction of true negative results which
TN + FP respond as positive.
Where: Pobs = number of observed positive results (true + false); N obs = number of observed negative results
(true + false).
There is thus ample scope for confusion when the terms discussed here are employed by different
analytical sectors. AOAC International defines the false positive/negative rates in the clinical
chemistry sense used here but the definitions are not stated in Official Methods of Analysis 7 ] their
publication to which analysts will most likely have access. The other main international body to
which analysts might turn for guidance is IUPAC but the IUPAC Commission on Analytical
Nomenclature in their recent 1995 report8 did not address this area of terminology at all. This may be
a topic that EURACHEM could add to their other work on the clarification of terminology.
21 Summary
The fundamental measures of reliability for methods of qualitative analysis providing presumptive
results are the false positive and false negative rates. A number of other measures can be calculated
directly from these or from their basic component parts (the numbers of true and false positive and
negative results observed in a sufficiently long series of trials).
If reliability is to be expressed in terms of false response rates then both the false positive and false
negative rates must be quoted if a true picture of method performance is to be obtained.
LGC/VAM/2003/048 Page 13
Eurachem/CITAC conference Discussion Paper
Alternatively, both the sensitivity and specificity could be quoted. The Efficiency and Youden
indices individually combine all of the information carried by the false positive and false negative
rates (and also by the sensitivity and specificity); thus only one of these unitary measures need be
quoted in place of one of the other associated pairs. The Likelihood Ratio also provides a single
measure of method performance
Current practice is to subject samples to confirmatory tests only when the presumptive result is
contrary to what would be expected from a reference population. This practice is driven by economic
considerations but can lead to erroneous results being reported when an expected, and hence
unconfirmed, result is in fact wrong.
Page 14 LGC/VAM/2003/048
Eurachem/CITAC conference Discussion Paper
Part 3: Examples
Example 1: A Reported Study on Identification Certainty in Mass
Spectrometry Using Database Searching
The Webb and Carter study discusses earlier works by Sphon who investigated the minimum
number of ions that needed to be monitored in order to produce an unambiguous identification of
diethylstilboestrol (DES). Data relating to Sphon’s most recent study, based on a mass spectral
library containing about 270,000 entries, is presented in Table 1.
Ion, % RA Matc
m/z Range hes
268 1-100 9995
268 1-100
239 1-100 5536
268 90-100
239 10-90 46
268 90-100
239 50-70 9
268 90-100
239 50-90 15
145 5-90
268 90-100
239 50-70 1
145 45-65
RA = Relative Abundance
Table 1 shows that, when the relative abundance of each ion is considered, the number of matches
occurring in a database can be reduced dramatically.
LGC/VAM/2003/048 Page 15
Eurachem/CITAC conference Discussion Paper
Although Sphon and others have recommended the use of a minimum of three ions for identification,
the European Union (EU) requires a minimum of four ions when testing for veterinary drugs
residues in cattle. The more stringent EU requirement has sometimes proved difficult to achieve and,
it is suspected, has led to a number of false negative results being reported.
Webb and Carter, in a similar study based on a NIST database containing 62235 spectra, confirmed
the results of Sphon and extended these through the inclusion of additional compounds of interest in
the forensic and agro-chemical fields. A subset of their results is presented in Table 2.
Page 16 LGC/VAM/2003/048
Eurachem/CITAC conference Discussion Paper
Using the criteria of De Ruig et al, the number of possible ions in a mass spectrum is 300 (from the
300 !
m/z range 180-480). The number of combinations of 300 objects taken 3 at a time is
( 300 − 3)! 3!
= 4,455,100. Hence, for any three peaks, the chance match probability is taken to be ~1: 4.5 x 106 .
The approach of De Ruig takes no account of the intensity information in a mass spectrum however,
in the studies described above, such information has been utilised in order to reduce the number of
matches to one. The effective chance match probability is therefore very much lower than the figure
calculated.
LGC/VAM/2003/048 Page 17
Eurachem/CITAC conference Discussion Paper
The use of database statistics in evaluating criteria for qualitative analysis has been investigated by
several authors. De Ruig et al [4] proposed criteria to be met by identification methods employed in
veterinary drug residue identification. The authors give indicative values of chance match
probabilities based on a simple binomial model. Ellison et al [5], commenting on this paper, noted
that a hypergeometric distribution was a more appropriate model. The latter authors focused on the
occurrence of chance matches when an infrared spectrum is compared against a spectral library.
The library used by Ellison et al was the Sadtler library containing spectra on 59,626 different
materials. A random subset of thirty compounds was selected from this library and the number of
peaks, q, in the range 500 - 1800 cm-1 noted for each compound. It was determined that the average
number of peaks per spectrum in the region 500 - 1800 cm-1, m, was 16. The spectral resolution
available was 4 cm-1 and this implied the existence of 1300/4 = 325, p, discrete peak positions in the
500 - 1800 cm-1 region. For each different spectrum in the chosen subset, the entire database was
searched twice − first for a minimum of three matching peaks and the second time for a minimum of
six matching peaks. The number of matches for each compound was compared with the number
predicted on the basis of a hypergeometric distribution. For n ≥ 3 the number of observed matches
was about twice the predicted number. For n ≥ 6, although the number of matches was considerably
lower, as would be expected, the observed matches exceeded the predicted by a factor of ten. Part of
the data for six peak matches is presented in Table 1.
The calculated chance match probabilities for six peak matches were in the range 10-8 -10-10. The
chance match probability for a compound when multiplied by the number of entries in the database
gives an estimate of the number of compounds fitting the search criteria. In the case of two of the
Page 18 LGC/VAM/2003/048
Eurachem/CITAC conference Discussion Paper
compounds in Table 1, viz. Caproic acid, isobutyl ester and 1-Bromoadamantane, the search criteria
produce a single match and hence would appear to be adequate if these compounds are suspected.
For the remaining compounds, many more matches are produced which indicates a requirement for
more stringent criteria.
As stressed in the main body of this guide, reference databases − of which spectral libraries are one
type − cannot be used to obtain information on false response rates. It is the responsibility of the
analyst to decide which, if any, of a set of matches corresponds to an unknown.
LGC/VAM/2003/048 Page 19
Eurachem/CITAC conference Discussion Paper
In the case of methadone, the Bayesian posterior probability is 0.988. In other words the analyst can
be over 98% certain that a positive reponse for methadone genuinely indicates the presence of this
drug.
Table 2 shows similar data for a different, non-immunochemical, technique. Note that the false
positive rate for cocaine by this technique is reported as zero. It is debatable however whether the
false response rates for such screening tests can truly be zero. In this case no false positives were
found but, had more samples been analysed, it is possible that one or more false positives would
have appeared.
Considering methadone again, the Bayesian posterior probability is 0.960. This is a high probability
though slightly less convincing than that produced by the EMIT test. If both tests are performed, and
a positive response obtained in each case, then the combined Bayesian probability becomes 0.999
In this example, reliable prior probabilities are available in the form of prevalence values. Had these
not been to hand, or if the analyst had preferred not to use them, likelihood ratios could have been
used instead; the corresponding values being 246 (EMIT) and ~68 (Toxi-Lab). The combined
likelihood ratio then being 16,830.
In all cases GC-MS was used as a reference technique to establish the false response rates. The
particular database referred to here is quite comprehensive for the analytes of interest to the authors,
and has clearly been designed to permit a Bayesian analysis of the data. There are inevitably some
Page 20 LGC/VAM/2003/048
Eurachem/CITAC conference Discussion Paper
missing values but, as more data is added, these should be reduced in number and the accuracy of
predictions further improved.
A further advantage of a database set up to record Bayesian input data for several different
techniques is the information it provides to enable one to optimise analytical performance. For
example, by selecting a screening method with a low false positive rate this should minimise the
costs of expensive confirmatory analyses. However, other factors also need to be taken into account
such as the limit of detection of a technique, its false negative rate, and the speed of analysis.
LGC/VAM/2003/048 Page 21
Eurachem/CITAC conference Discussion Paper
References
1. de Ruig W.G; Dijkstra G.; Stephany R.W. Anal.Chim.Acta 1989, 223, 277-82.
3. Ellison S.L.R.; Gregory S.; Hardcastle W.A. Analyst 1998, 123, 1155-61.
5. Ferrara S.D; Tedeschi L.; Frison G.; Brusini G.; Castagna F.; Bernadelli B.; Soregaroli D.
J.Anal.Toxicol. 1994, 18, 278.
6. The Fitness for Purpose of Analytical Methods, 1.0 ed.; Eurachem: 1998.
Page 22 LGC/VAM/2003/048