28
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 1, FEBRUARY 2013
An Extreme Function Theory for Novelty Detection
David A. Clifton, Lei Clifton, Samuel Hugueny, David Wong, and Lionel Tarassenko
Abstract—We introduce an extreme function theory as a novel
method by which probabilistic novelty detection may be performed with functions, where the functions are represented by
time-series of (potentially multivariate) discrete observations. We
set the method within the framework of Gaussian processes (GP),
which offers a convenient means of constructing a distribution
over functions. Whereas conventional novelty detection methods
aim to identify individually extreme data points, with respect to a
model of normality constructed using examples of “normal” data
points, the proposed method aims to identify extreme functions,
with respect to a model of normality constructed using examples
of “normal” functions, where those functions are represented
by time-series of observations. The method is illustrated using
synthetic data, physiological data acquired from a large clinical
trial, and a benchmark time-series dataset.
Index Terms—Functional analysis, Gaussian processes, signal
processing algorithms.
I. INTRODUCTION
OVELTY detection [1] is a fundamental task in anomaly
detection, outlier detection, and one-class classification,
in which we wish to identify if newly-observed data are in some
sense “novel” with respect to previously-observed examples.
Novelty detection can be viewed as a hypothesis test, in which
we wish to determine if a previously-unseen test dataset has the
same characteristics as a training set of “normal” data.
be
Definition I.1: Let a “training” dataset
a collection of examples of “normality”. For the case of nontimeseries data, let the
example be a -dimensional point
.
Definition I.2: Let a test set
, where each
test examples
is defined as for the training data in
of the
definition I.1.
In novelty detection, we wish to test the null hypothesis
that the test set
has the same characteristics as the training
set , preferably in some probabilistic sense. A model of normality
is typically constructed from the training set , and
N
Manuscript received August 01, 2012; revised November 06, 2012; accepted
December 01, 2012. Date of publication December 13, 2012; date of current
version January 22, 2013. The work of D. Clifton was supported by the Centre
of Excellence in Personalised Healthcare funded by the Wellcome Trust and
EPSRC under Grant WT 088877/Z/09/Z. The work of L. Clifton and D. Wong
was supported by the NIHR Biomedical Research Centre, Oxford, U.K. The
work of S. Hugueny was supported by the EPSRC LSI Doctoral Training Centre,
Oxford, U.K. Matlab implementations of the methods described in this paper
may be downloaded from http://www.robots.ox.ac.uk/~davidc. The associate
editor coordinating the review of this manuscript and approving it for publication was Dr. Ery Arias-Castro.
The authors are with the Department of Engineering Science, University of
Oxford, Oxford OX3 7DQ, U.K. (e-mail:
[email protected])
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/JSTSP.2012.2234081
is tested by comparing the test set
to the model
.
The majority of novelty detection work in the literature uses
defines a decision
point-wise novelty detection, in which
boundary in the data space . Individual test points are then
compared to the “normal” region of data space
deis deemed to hold (and
fined by the decision boundary, and
is classified as being “normal”) if
. One of the
most widely used of such methods is discriminant and based
on the support vector machine [2], [3], in which novelty detection takes place in the high-dimensional reproducing kernel
Hilbert space (RKHS) corresponding to some kernel function.
are classified independently from one anHere, test points
other, by being compared to a decision boundary in the RKHS.
Generalizations of this point-wise approach to novelty detection have been proposed in, for example, the states within a
hidden Markov model [4], the output of a Kalman filter [5],
or the states of a factorial switched Kalman filter [6]. More
recently, one-class classifiers using Gaussian processes (GPs)
have been proposed [7]–[10] that take a similarly point-wise approach to novelty detection, constructing a function to use as
into reand dividing data space
the model of normality
and low support
gions with high support
depending on whether those regions are close to those occupied
by “normal” training data , or not, respectively. Change-point
detection has been implemented within a GP framework to identify and rectify sensor failures [11].
This paper aims to tackle the case of timeseries novelty detection, as typically occurs in the analysis of data acquired from
critical systems such as jet engines and human patients. In such
applications, the test data are a potentially multivariate timeseries that we wish to classify as being either normal or abnormal; i.e., the question may be framed “is this timeseries of
human vital signs indicative of a normal patient or a deteriorating patient?” Conventional point-wise novelty detection is
appropriate when instances of independent objects are to be
classified; e.g., the classification of different mammograms as
being “normal” or “abnormal” [12]. However, when the test data
represent a timeseries, the i.i.d. assumption typically does not
hold, and adopting a point-wise, sample-by-sample approach to
classification can result in large numbers of misclassifications,
because we are making large numbers of assumedly independent decisions (perhaps at the sampling rate of the data). Instead,
we suggest that a single decision can be taken, testing an entire
timeseries, which represents a test function. We will thus adopt
a function-wise approach to novelty detection, where the functions are represented by timeseries of discrete observations.
The GP framework offers a convenient, non-parametric
method of defining a probability distribution over a Lebesque
where
is Borel-measurable
space of functions
(often
or discretizations thereof) and where is a valid
probability measure, such as a multivariate Gaussian density
1932-4553/$31.00 © 2012 IEEE
CLIFTON et al.: EXTREME FUNCTION THEORY FOR NOVELTY DETECTION
over a (potentially infinite) number of random variables (rvs).
This paper describes a principled, probabilistic approach to
functional novelty detection by considering extreme functions.
II. BACKGROUND AND NOTATION
Our work is related to that of functional data analysis (FDA),
which is an active branch of research concerning inference
where “the data are functions” [13]. Reviews of FDA [14]–[16]
identify two main approaches: (i) regularization-based methods,
in which functions are resampled (interpolated) such that observations occur on a regular sampling grid, and (ii) filtering-based
methods, in which functions are expanded onto a finite-dimensional function basis after smoothing has been applied. The
latter is the current focus of most research in FDA, where
popular tasks include using the basis coefficients for clustering
[17], principal component analysis, and linear discriminant
analysis [13], [18], often set within a GP regression framework.
However, while multiclass classification has been performed
using the latter, there is little focus on novelty detection. The
closest existing work to tackling the problem considered in this
paper is an -test for functional data [19], which compares two
functions, analogous to the standard -test between two sets
of data points.
Similarly, timeseries classification has been identified as a
topic within anomaly detection [20], where many of the approaches are shared with those in the field of novelty detection
(and where the latter is most often presented as a branch of machine learning).
In common with the majority of FDA work, and in keeping
with the GP literature, we will first consider univariate timeseries (generalizing later to multivariate timeseries). Keeping to
the notation of the functional and timeseries literature, we must
redefine the notation from definitions I.1 and I.2, which were
conventional notation for the i.i.d. case, giving:
Definition II.1: Let a training set of
examples
of functional data (e.g., timeseries) be
. Let the
example (e.g., timeseries object) be a sequence
of
observations occurring at locations
. For the case of univariate timeseries, the r.v.
and the index set of the r.v.
is time.
Note that, in general, there is no requirement that each of the
examples in a collection of functional data should be of the
same length, and so the length of the
example is denoted
by . Furthermore, in the general case, there is no requirement
that the examples should consist of observations occurring at
the same locations; for timeseries, this means that each example
could be observed at different times.
We first consider the case of univariate GP regression to set
notation, and initially define a single GP over one of the
sequences of observations . We follow the standard
treatment of [21].
Let a GP prior be defined over a latent variable
, using, for example,
a
squared-exponential
(SE)
covariance
function
, where
is the -norm, where
and
are the length-scale in the
-direction and the variance of , respectively, and where
the mean function
. The
observations
29
for this
timeseries are related to the latent variables via
, with
, or
,
dropping the subscripts for the
timeseries object, as in
[21].
Following the novelty detection approach, in which a model
of normality
is constructed from “normal” training examples, we will assume the set of training timeseries
comprises “normal” timeseries. We now wish to construct a model
from this training set, where
is a GP that describes the
dynamics of the whole collection of “normal” examples. For
instance, each example timeseries in the training set could be
a timeseries of vital-sign observations taken from a “normal”
patient, as will be considered in Section VII. The hyperparameters of the GP used to represent
could be determined in a
number of ways; this paper maximizes the joint likelihood of all
timeseries in the training set
[21],
an example of which will be demonstrated in Section VII. More
complex examples of model construction considered in FDA include constructing a mixture of GPs [22].
Assuming the presence of a GP model , we next focus on
determining how can we formulate a hypothesis test
to classify a test timeseries
as being “normal” (i.e., generated
from
to some probability ) or otherwise “abnormal”.
III. ASSIGNING PROBABILITIES TO FUNCTIONS
The formulation of our hypothesis test requires a mapping
from functions to probabilities,
. A common
approach used to assign a sequence of test observations to
one of many clusters in FDA [13] is to consider the marginal
likelihood given the set of inputs to the function :
(1)
in which we have marginalized over the function values
, using the GP distribution1 over functions
, and where the likelihood
. The log marginal likelihood can be found in closed
form as a marginalized Gaussian,
(2)
where
. However, the resulting quantities are
probability densities, not probabilities, and therefore scale with
the dimensionality of the input (i.e., the number of data in
the timeseries). Fig. 1 shows log marginal likelihoods for 500
sample functions drawn from an example GP model
with
a random positive semi-definite covariance matrix , where
we have evaluated the sample functions at both
and
in . It may be seen that doubling the dimensionality of the vector of function values causes a scaling of
the log marginal likelihoods. Therefore, while likelihoods may
be useful for some tasks (e.g., maximum likelihood approaches
to parameter fitting for some fixed value of ), they are not
1In keeping with the literature, we use the term distributions to refer to probability density functions (pdfs), which we will denote with lower-case letters
, and which we will make distinct from distribution functions
(dfs), which we will denote with upper-case letters
.
30
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 1, FEBRUARY 2013
Fig. 1. Histograms of densities (marginal likelihoods) for 500 sample functions
generated randomly from a GP model
with random , evaluated at
and
in , shown in red (rightmost) and blue (leftmost),
respectively.
suitable for comparing functions of arbitrary , nor for direct
probabilistic inference, because they are not probabilities.
Instead, we consider that the GP used as a model
may
at
be used to determine the predicted values of observations
times
using the GP regression framework [21]:
(3)
(4)
In keeping with our function-wise philosophy, we will consider
the joint distribution over all
and the model, to evaluate
the whole test function in a single classification decision, which
is the joint posterior distribution over all functions, conditioned
on the model,
(5)
where the mean function and covariance matrix are, respectively,
The hypothesis test
could initially be framed by considering
the chance of obtaining the test function
from the GP
model
. The GP offers the convenience that the joint distribution over the test data (5) is multivariate Gaussian, and so
we could formulate
by determining the probability of observing an -dimensional vector from the -dimensional multivariate Gaussian distribution. For low (such as
or
2), standard results from extreme value theory may be applied
[23]. However, this is not straightforward for multivariate Gaussians of increasingly large dimensionality, as occurs in our case
with
with functional data represented by timeseries,
and standard results cannot be used [24]. Instead, we consider
forming a distribution over the densities allowed by the GP
model
.
IV. A DISTRIBUTION OVER FUNCTION DENSITIES
For some dimensionality , the GP distribution over functions is multivariate Gaussian by definition, and a test function
may be evaluated to give a density using the GP:
Definition IV.1: Let probability density be given by a pdf,
, where
, given by
(5).
We emphasize that the notation
refers to the multivariate Gaussian distribution , defined over the latent function
, drawing careful distinction between
(the multivariate Gaussian defined by the GP),
(the vector of output
values from the latent function), and (the latent function).
We note that the largest likelihood obtained from
occurs
at the mode, and has value
, where
. Therefore, densities take the range
.
Then,
Definition IV.2: Let a df
be defined over densities
, according to
(6)
where the region of integration
.
This is an integration over all those points in
that result in
a density
which is lower than the density .
Thus,
is the probability that a sample function generated
from the GP will be “more extreme” (i.e., have lower density
w.r.t. the GP) than our test function . It is important to note
that
is effectively a df over level sets on the output of
multivariate Gaussian . That is, it is a distribution function
defined over probability densities.
The integration in (6) can be evaluated in closed form by
casting
into an equivalent function over Mahalanobis radius with respect to , such that
(7)
where
, in which is the Cholesky decomposition of , and where denotes the Hadamard (elementwise) matrix product. The integration may be performed in
polar form, taking advantage of the radial symmetry of our multivariate Gaussian distribution over functions (5), which simplifies our expression for
(8)
where is the Mahalanobis radius on our multivariate Gaussian
distribution over functions that gives density , via
(9)
and where the latter is obtained by rearranging the former.
This explicitly shows that
is the tail mass associated
with the level set defined by on our multivariate Gaussian
, and is therefore the probability of observing a
sample function generated from the GP of greater Mahalanobis
radius than , and hence the probability of observing a sample
function with a lower probability density than . Integration
over all of the angles
in the radial integration (8)
yields
(10)
CLIFTON et al.: EXTREME FUNCTION THEORY FOR NOVELTY DETECTION
Fig. 2. (a) Df
over densities
and resulting empirical df for densities of
sample functions generated randomly
from a GP, shown by red line and blue dots, respectively. Densities have been
normalized by dividing through by
; i.e., using
. (b) Sample functions drawn from a GP, , colored by
. “Abnormal” functions are shown
by dark lines. A 95% confidence region (2 standard deviations) around the mean
function is shown by the shaded gray background.
where
is the total solid angle of the hypersphere in dimensions, and where
is the Gamma function. As
, the densities
, and so it will be convenient to express
, where
. It will also be convenient to use the Cholesky decomposition
as
. After iterative integration-by-parts (see the appendix),
(11)
(12)
where the functions
,
, and
are given in the appendix.
Fig. 2(a) shows that the predicted df
over a range of densities using (11) is a close estimate of the empirical df obtained
by generating
sample functions randomly from an example
GP, with
. It may be seen that as densities
, the
df
as required: for a given density , the probability that a sample function generated randomly from the GP
will have a lower density (i.e., be more extreme) decreases as
. Conversely, as the density tends towards its maximum
value, at the mode of the Gaussian distribution over function
space,
, the probability
of observing a more extreme function tends to 1.
Finally, we may use the df
to assign probabilities to
each function that we wish to examine, given some GP model
.
Definition IV.3: Let a test of the null hypothesis
be defined as the comparison of
to some threshold probability
, where the null hypothesis holds if
(thus
classifying the data as being “normal”). The null hypothesis is
31
rejected (and the test function is deemed “abnormal”) when the
corresponding
.
For example,
would result in a hypothesis test in
which functions are deemed “abnormal” if they could have been
generated from the GP model
with a probability below significance level
.
Fig. 2(b) shows an example in which ten sample functions
have been drawn from an example GP model
, and which
are therefore all “normal”, and where two other functions have
been shown for comparison (black lines). Each of the sample
functions
has been colored according to
, given
. Those functions that lie close to the
mean function take higher values of
, while those that
stray away from the mean function take lower values. Importantly, we note that the ten “normal” sample functions (for which
the null hypothesis
should not be rejected) take values of
, and so a hypothesis test (e.g., with
)
would typically not reject
. The two functions shown by
. Therefore,
for
black lines, however, have
these two functions would be rejected by a hypothesis test with
, and the functions shown by black lines would be
deemed to be “abnormal”.
V. EXTREME FUNCTION DISTRIBUTIONS
Extreme value theory is a branch of statistics that considers
the distribution of extrema (such as the minimum or maximum
value observed in a set of data) in low-dimensional spaces. The
majority of the literature [25], [26] is concerned with univariate
or
have been described using
data; some extensions into
copulae to estimate the dependence between rvs. These are, by
definition, point-wise approaches, in which the single most extreme point in a set of data is considered. This point-wise approach has been extended to higher-dimensional work in [24]
and [27]. We will now consider the extension of this method to
our functional application, and show how it can be used in conjunction with
to provide identification of extreme functions, characterized by discrete observations with
,
using extreme function distributions.
Fig. 3(a) shows functions sampled randomly from the GP
model
considered previously, where each is the most extreme of a set of
sample functions, for increasing , and
where “most extreme” is defined as being that function with the
lowest density
, as given by the
-dimensional multivariate Gaussian distribution in (5). With
increasing , as shown by the colors in Fig. 3(a), the most extreme function observed in a set of randomly-generated functions becomes increasingly more extreme, moving away from
the mean function. This follows the intuition that, as we draw
more data from the underlying distribution, we would expect
the extremum of those data to be “more extreme” as we draw
more and more data [24]. However, all of the functions generated from
are all “normal” functions, in that they have been
generated from our GP that represents , and so we would like
the null hypothesis
not to be rejected. That is, they may be
extreme, but they are not “abnormal” – they are extreme only
because we have observed many realizations from the GP.
If we treat these functions as before, and assign probabilities
to them using
, we obtain the results shown in Fig. 3(b),
32
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 1, FEBRUARY 2013
This is because the Fisher-Tippett theorem [28], on which extreme value theory is based, may be used to show that all non-degenerate functions2 are in the domain of attraction of the generalized extreme value (GEV) distribution in their extrema, and
that the limiting form of the GEV for our case, in which prob, is
ability densities are truncated over the domain
the Weibull.
The Weibull has scale and shape parameters
and
, respectively, for some value of , which may be estimated [24] as
being
and
, in which
is the
quantile on
, which may found to arbitrary accuracy because we can use
in closed form (11) using
. Also,
is the pdf associated with the df
,
and is straightforwardly the integrand in (10),
(14)
Fig. 3. Extreme functions generated by observing
functions from a GP and selecting that with the lowest probability
; (a)
colored according to (b) colored according to
. Plots (c) and (d) show
extreme function distributions
and
, respectively defined over
densities and Mahalanobis radii , giving predictions (blue lines) and results
obtained by randomly generating
extreme functions (red dots). Distributions are shown for
, with dimensionality
from right to left in (c), and from left to right in (d).
in which most of the functions are assigned very low probabiliand hence
would be incorrectly rejected
ties,
for the majority of the functions shown. However, we can adapt
extreme value theory to our functional case by considering the
extremes in probability density . Given the df
, which
is itself univariate in densities , we note that the most extreme
sample of a set of observations from
will have a density the distribution of which asymptotically converges to the
Weibull df for minima [26],
(13)
Thus, using this non-standard extension of extreme value
theory over a probability density, we have obtained a df
that allow us to determine the location in density space for the
densities of extreme functions, given some number of observed
functions . This is shown in Fig. 3(c) for
, with
increasing dimensionality , where it may be seen that these
dfs
closely match the densities of extreme functions
observed from random sampling of the GP.
Definition V.1: Define an extreme function distribution as
being
over Mahalanobis radii
on the multivariate
Gaussian posterior distribution
using
from (9).
is the probability that, if a
set of functions is observed from , then the most extreme
function of that set will have Mahalanobis radius (with respect
to
) less than .
Therefore,
is the probability that a test function (which
has Mahalanobis radius w.r.t.
) is “abnormal”, given a
model of normality
.
The extreme function distribution for our example
is
shown in Fig. 3(d), again showing close agreement with the
results of random sampling of functions from the GP. As
required, the probability of abnormality
increases with
. Increasing the value of
(i.e., generating more functions
from
) results in the support of
shifting further up
the -axis, as expected: if we observe more functions from
, then we are more likely to observe functions with larger
Mahalanobis radius (i.e, be more “extreme”), with respect to
.
VI. HYPOTHESIS TESTING WITH EXTREME FUNCTIONS
Finally, we may return to our extreme functions previously
shown in Fig. 3(b) and, instead of assigning them probabilities
w.r.t.
, we assign probabilities for each using the extreme
function distributions for the appropriate number of functions
observed in each case, .
Definition VI.1: Let a hypothesis test be defined as the
test comparing the extreme value distribution
to some
2Non-degenerate functions are those that do not assign all probability mass
to a single point in their domain. The multivariate Gaussian defined by the GP
is therefore non-degenerate, as is the df defined over its densities,
.
CLIFTON et al.: EXTREME FUNCTION THEORY FOR NOVELTY DETECTION
33
“abnormal” by a hypothesis test based on
. This is correct in this instance, because it was drawn from
a GP with
, whereas the other functions were drawn
from an otherwise-identical GP with
.
We have demonstrated that the proposed extension of extreme
value theory to the problem considered in this paper can correctly separate functions that are extreme-but-normal (i.e., generated by a model of normality, but with a low probability) from
those that are actually abnormal (i.e., those not generated by the
model).
VII. ILLUSTRATION WITH PATIENT VITAL-SIGN DATA
Fig. 4. (a) Extreme functions previously shown in Fig. 3, with probabilities
now assigned using extreme function distribution
given the appropriate value of for each, and (b) using
with
, including
two other functions shown by dark lines.
, rejecting the null hypothesis
threshold probability
(and thus classifying functions as “abnormal”) for which
.
Fig. 4(a) shows the result of using the hypothesis test from
definition VI.1, in which it may be seen that each extreme function now takes probabilities
, and that the
null hypothesis
will no longer be rejected if
is used (with, for example,
). Therefore, the sample
functions will be classified as being “normal” functions w.r.t to
our GP, , as we would hope, because they were all generated
from
and are thus “normal”. We have therefore successfully
identified extreme-but-normal functions that have arisen due to
the fact that we have observed multiple functions.
Our original example in Fig. 2(b), which included two additional functions shown by dark lines, may now be revisited in
light of our extreme function distributions. Setting
and
assigning probabilities to each function in the example using the
resulting extreme function distribution
gives the result shown in Fig. 4(b). Here it may be seen that the ten sample
functions randomly generated from the posterior GP (shown by
yellow lines) are now assigned probabilities
, indicating their obvious “normality” with respect to the GP model
. However, one of the two functions shown by dark lines has
been assigned
, which would cause a hypothesis test based on, for example,
to accept
and classify the function as being “normal”. This
classification is correct in this instance, because the function
was generated from the GP model , and is the most extreme
function (i.e., that with the lowest likelihood ) from a set of
sample functions generated from
.
The other function shown by a dark line is assigned a very
low probability,
, and is therefore classified
We now illustrate the use of the method with data representing
patient vital-sign trajectories following upper gastro-intestinal
cancer surgery, where the patients are subsequently nursed in a
“step-down” recovery ward. This dataset comprises 154 examples of “normal” patient timeseries, in which the patients were
discharged home after a variable length-of-stay in the hospital,
and 17 examples of “abnormal” patient timeseries, in which the
patients either died or were admitted under emergency conditions to the intensive care unit (ICU). Patients in this latter category are associated with highly increased risks of mortality and
morbidity, and so the automatic determination of physiological
deterioration from timeseries of acquired data is an important
task – made more so by the fact that vital-sign observations are
taken every four hours at best [29].
Our study [30] was undertaken at the Cancer Hospital, within
the Oxford University Hospitals NHS Trust, and was granted
ethical approval by the local ethics committee.
A. Data
Fig. 5(a) shows a subset of the normal data, illustrating
15 time-series of variabilities in respiratory rate (RR) from
the “most normal” patients3, in which the differences, over a
24-hour period, between the maximum and minimum RR (as
observed by the nurses) are shown for the first 24 days after
admission to the recovery ward. It may be seen that these timeseries typically take high values immediately after discharge
from surgery, which then reduce as the patient recovers on the
ward.
For comparison, timeseries from a selection of five patients
from the abnormal dataset are shown in Fig. 5(b). It may be
seen that these example abnormal timeseries can take highly
extreme values (such as that shown in orange, which reaches
); however, many examples of abnormality (such
as the four shown in purple, red, light blue, and green) occupy
the same range of values on the vertical axis as do the normal
patients.
The dataset of 154 normal patients has a median length-ofstay of 9 days (IQR 5 days), while the dataset of 17 abnormal
patients has a median length-of-stay of 5 days (IQR 4 days).
There is significant overlap between the length-of-stays in each
set, and both sets contain timeseries for all lengths-of-stay up to
24 days. The majority of the datasets exhibited incomplete data
within their record.
3defined
to be those closest to the median length-of-stay for normal patients
34
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 1, FEBRUARY 2013
TABLE I
MEAN (AND 1 S.D.) CLASSIFICATION ERRORS FOR PATIENT VITAL-SIGN
DATA OVER
EXPERIMENTS, SHOWING FALSE-POSITIVE RATE
(
) AND FALSE-NEGATIVE RATE (
).
Fig. 5. Normal and abnormal patient data shown in (a) and (b), respectively,
with the EFT GP model
shown by its mean function (in blue) and 95%
confidence region (shaded pink). (a) “Normal” vital-sign trajectories. (b) ICU
readmission trajectories.
theory. We note that (i) and (ii) are point-wise approaches to
novelty detection, and the Gaussian kernel was used as the
default choice in each; the former is distance-based, in the
RKHS of its kernel, while the latter is probabilistic. Methods
(iii) and (iv) are distance-based methods, often used as benchmarks in the anomaly detection and novelty detection literature
[20], [32], [33]. We denote the methods (i)-(v) from above as
OC-SVM, OC-GP,
,
-DTW, and EFT, respectively.
Ten-fold cross-validation was used to set model parameters
for each method, using a training set that comprised 75% of the
available normal data (116 of 154 examples, selected randomly)
and 50% of the available abnormal data (8 of the 17 examples,
selected randomly). Model parameters were selected in each
method to minimize the classification error4. The remaining
38 normal examples and 9 abnormal examples were “held
out” and used as test data. The entire experiment (random
selection of training data, followed by ten-fold cross-validation
times for
to set model parameters) was repeated
each classifier. Each method was trained using feature vectors
comprising data up to 24 days. For those methods that require
fully-specified input vectors (OC-SVM, OC-GP,
), the
data were padded with zeros where data were missing, after
zero-mean, unit-variance normalization was applied (such that
the missing data appear “normal”, at the mean of the training
set). The
-DTW and EFT methods allow non-probabilistic
and probabilistic handling of missing data, using DTW and
marginalization, respectively.
C. Results
Fig. 6. (a) Normal data, corresponding to the centroid of the normal data, in
the upper subplot, with the same timeseries having data removed shown in the
lower subplot. Missing data are circled in red in the upper subplot. (b) DTW
match of the original timeseries to the version containing missing data, where
color shows DTW distance.
B. Method
We compared five methods for performing novelty detection:
(i) the popular one-class SVM formulation of [3]; (ii) the recently-proposed one-class GP classifier of [10]; (iii) a -nearest
neighbors ( -NN) approach using the Euclidean distance
metric; (iv) a -NN approach using a dynamic time-warping
(DTW) distance [31]; and (v) our GP-based extreme function
, reported on
Results from each of the
the “held out” test data in each case, are shown in Table I. It
may be seen that the best-performing techniques are those that
can adequately cope with the missing data, the
-DTW and
the proposed EFT-based method. While both of the latter exhibit lower FPR than the other methods, the proposed method
has greater success at classifying abnormal data, with a lower
FNR across all experiments. An example of DTW is shown in
Fig. 6(a), which shows one of the normal examples with data
removed for the purposes of illustration. Fig. 6(b) demonstrates
that the DTW method accurately aligns the original copy of the
normal timeseries with that containing missing data. However,
the results shown in Table I indicate that the principled treatment of missing data from the proposed method (achieved via
straightforward marginalization of the Gaussian process [21])
results in better sensitivity to “abnormal” data, with a lower
FNR.
4Classification error is defined to be false negatives (FN) + false positive
(FP). The former is an erroneous classification of an abnormal example as being
“normal”, while the latter is erroneous classification of a normal example as
being “abnormal”.
CLIFTON et al.: EXTREME FUNCTION THEORY FOR NOVELTY DETECTION
35
TABLE II
MEAN (AND 1 S.D.) CLASSIFICATION ERRORS FOR UCI BENCHMARK
DATA OVER
EXPERIMENTS.
Fig. 7. Randomly selected examples from the “normal” and “abnormal”
classes in (a) and (b), respectively.
VIII. ILLUSTRATION WITH BENCHMARK DATASET
The PAMAP2 benchmark dataset [34], [35] was used from
the UCI Machine Learning Repository [36], which comprises
activity data from heart-rate and inertial measurement units,
connected to subjects who are asked to perform a protocol of
12 activities: lying, sitting, standing, ironing clothes, using a
vacuum cleaner, walking normally, “Nordic” walking, running,
cycling, ascending stairs, descending stairs, and using a skipping rope. Data were acquired for a total of 27,000 seconds, resulting in over 3.8
labeled data.
This dataset is multivariate, where we have investigated timeseries of the following features: heart-rate (at approximately 9
Hz) and average spectral power (per 5 s window of data) for
each of the three axes of a patient-worn accelerometer (at approximately 100 Hz), following [35].
The methodology from Section VII was repeated, where the
“normal” class was taken to comprise all timeseries examples
of ascending and descending stairs, and where all other activities were taken as being “abnormal” for the purposes of this investigation. Fig. 7 shows randomly-selected examples of both
the “normal” and “abnormal” classes, where it may be seen that
the former (ascending and descending stairs) exhibits particular dynamics that a novelty detector could learn, whereas the
latter seems more inconsistent (as expected, given that these are
random examples of other types of activity).
While this benchmark dataset does not suffer from the
incompleteness of data evident in the dataset considered in
Section VII, the length of each timeseries can vary. The multivariate nature of the dataset is straightforwardly used by
methods (i)-(iv). For the proposed EFT-based method, we
adopt the usual GP approach [21] of providing the kernel
with multivariate inputs.
function
Table II presents the results of
experiments, in which
it may be seen that most methods have performed less well
using this benchmark dataset compared with the patient-based
analysis in Section VII. The OC-GP and
methods are
Fig. 8. Examples of
-DTW resulting in misclassifications in which (a)
two “normal” examples are matched only for the latter half of the timeseries;
(b) “abnormal” and “normal” timeseries are closely matched due to the warping
effect introduced by DTW.
the least able to separate “normal” from “abnormal” classes
in this instance, while the OC-SVM has achieved a low FPR.
The
-DTW method performs similarly to the proposed
EFT-based method, although the latter achieves a lower FNR
(i.e., it is more sensitive to “abnormal” data), and has generally
less variable performance (standard deviation of FNR 0.06
compared with 0.11 for the
-DTW).
Again, the DTW procedure can occasionally perform undesirable warping, such that “normal” data are sometimes associated with larger DTW distances to “normality” than some “abnormal” timeseries. This effect is shown in Fig. 8, in which an
“abnormal” timeseries in (b) matches the centroid of “normal”
data more closely than the example “normal” timeseries shown
in (a), with a DTW distance for the “abnormal” case in (b) being
0.67 the DTW distance for the “normal” case in (a). Again, the
GP-based method straightforwardly marginalises over shorter
timeseries to perform a classification using only the relevant
input domain of the model
.
IX. DISCUSSION
We have extended extreme value theory such that we may
take a function-wise approach to novelty detection, following
the FDA method in which functions are initially represented
by timeseries of discrete observations. While much of previous
work in novelty detection has concentrated on pointwise approaches (which are by far the most commonly described in the
literature), these reach the limit of their usefulness in the assessment of timeseries.
GPs offer a natural probabilistic framework in which to define
distributions over a function space, and we have used the GP regression case to illustrate our method due to the convenience of
36
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 1, FEBRUARY 2013
working with the Gaussian distribution, which has allowed us to
find extreme function distributions in closed-form. We note that
tends asymptotically towards a
while the df of densities
parametric distribution (the Weibull, in this case), the whole approach is non-parametric, due to the GP distribution over function space [21].
Results reported in this paper suggest that the proposed
method provides slightly better performance than distance-based methods employing dynamic time-warping, while
offering the advantages of a principled probabilistic inference
framework. While DTW copes well with small quantities of
missing data, or with timeseries that differ in length by a small
amount, it is less robust when coping with noisy, incomplete
timeseries that are often encountered in practice, such as in
physiological patient monitoring. By contrast, the proposed
method copes well with missing data, using straightforward
marginalization of the GP, afforded by its consistency property,
in which any subset of the rvs over which the GP is defined
also has a joint distribution which is multivariate Gaussian.
A natural extension for this work would be to non-Gaussian
processes, in which the distribution over functions is not constrained to be Gaussian; e.g., a Beta process or a generalized
Pareto process. However, the GP framework is sufficiently flexible to allow a wide-range of functional forms, as illustrated
throughout this paper.
There is further scope for estimation of the Weibull parameters within the Bayesian GP framework, which would require
an approach using approximate inference, such as via sampling
or deterministic (variational) methods.
APPENDIX
IN CLOSED FORM
From (10), we apply iterative integration-by-parts and use
to obtain cases for even and odd, respectively,
(15)
(16)
where
is the complementary Gaussian error function.
These reduce to (11), in which
(17)
(18)
(19)
(20)
(21)
REFERENCES
[1] L. Tarassenko, D. Clifton, P. Bannister, S. King, and D. King, “Novelty detection,” Encyclopaedia of Structural Health Monitoring, pp.
653–675, 2009.
[2] D. Tax and R. Duin, “Support vector domain description,” Pattern
Recognit. Lett., vol. 20, pp. 1191–1199, 1999.
[3] B. Schölkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C.
Williamson, “Estimating the support of a high-dimensional distribution,” Neural Comput., vol. 13, no. 7, pp. 1443–1471, 2001.
[4] N. Hughes, L. Tarassenko, and S. Roberts, “Markov models for automated ECG interval analysis,” in Advances in Neural Information
Processing Systems, S. Thrun, L. Saul, and B. Schoelkopf, Eds. Cambridge, MA: MIT Press, 2004, vol. 16, pp. 611–618.
[5] H. Lee and S. Roberts, “On-line novelty detection using the Kalman
filter and extreme value theory,” in Proc. Int. Conf. Pattern Recognit.,
2008, pp. 1–4.
[6] J. Quinn, C. Williams, and N. McIntosh, “Factorial switching linear dynamical systems applied to physiological condition monitoring,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 31, no. 9, pp. 1537–1551, Sep.
2009.
[7] H. Kim and J. Lee, J. Wang, Z. Yi, J. Zurada, B.-L. Lu, and H. Yin, Eds.,
“Pseudo-density estimation for clustering with Gaussian processes,” in
Proc. Advances in Neural Networks – ISNN 2006, Berlin, Germany,
2006, vol. 3971, Lecture Notes in Computer Science, pp. 1238–1243.
[8] R. Adams, I. Murray, and D. Mackay, “The Gaussian process density
sampler,” in Advances in Neural Information Processing Systems, D.
Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds. Cambridge,
MA: MIT Press, 2009, vol. 21, pp. 9–16.
[9] Y. Gao and Y. Li, R. Kimmel, R. Klette, and A. Sugimoto, Eds., “Improving Gaussian process classification with outlier detection, with applications in image classification,” in Proc. Comput. Vis.—ACCV ’10,
Berlin, Germany, 2011, vol. 6495, Lecture Notes in Computer Science,
pp. 153–164.
[10] M. Kemmler, E. Rodner, and J. Denzler, R. Kimmel, R. Klette, and A.
Sugimoto, Eds., “One-class classification with Gaussian processes,” in
Proc. Comput. Vis.—ACCV ’10, Berlin, 2011, vol. 6493, Lecture Notes
in Computer Science, pp. 489–500.
[11] R. Garnett, M. A. Osborne, S. Reece, A. Rogers, and S. J. Roberts,
“Sequential Bayesian prediction in the presence of changepoints and
faults,” Comput. J., vol. 53, no. 9, pp. 1430–1446, 2010.
[12] L. Tarassenko, P. Hayton, N. Cerneaz, and M. Brady, “Novelty detection for the identification of masses in mammograms,” in Proc. 4th
IEE Int. Conf. Artif. Neural Netw., Perth, Australia, 1995, vol. 4, pp.
442–447.
[13] J. Ramsay and B. Silverman, Functional Data Analysis. New York:
Wiley, 2005.
[14] G. M. James and C. A. Sugar, “Clustering for sparsely sampled functional data,” J. Amer. Statist. Assoc., vol. 98, pp. 397–408, 2003.
[15] J. Rice, “Functional and longitudinal data analysis: Perspectives on
smoothing,” Statist. Sinica, vol. 14, pp. 631–647, 2004.
[16] H.-G. Muller, “Functional modelling and classification of longitudinal
data,” Scandinavian J. Statist., vol. 32, no. 2, pp. 223–240, 2005.
[17] F. Yao, H.-G. Muller, and J. Wang, “Functional linear regression analysis for longitudinal data,” Ann. Statist., vol. 33, no. 6, pp. 2873–2903,
2004.
[18] J. Shi and T. Choi, Gaussian Process Regression Analysis for Functional Data. London, U.K.: Chapman & Hall, 2011.
[19] P. Hall and I. van Keilegom, “Two-sample tests in functional data analysis starting from discrete data,” Statist. Sinica, vol. 17, pp. 1511–1531,
2007.
[20] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A
survey,” ACM Comput. Surveys, vol. 41, no. 3, 2009.
[21] C. Rasmussen and C. Williams, Gaussian Processes for Machine
Learning. Cambridge, MA: MIT Press, 2006.
[22] J. Shi and B. Wang, “Curve prediction and clustering with mixtures of
Gaussian process functional regression models,” Statist. Comput., vol.
18, pp. 267–283, 2008.
[23] S. J. Roberts, “Extreme value statistics for novelty detection in biomedical signal processing,” IEE Proc. Sci., Technol. Measur., vol. 47, no.
6, pp. 363–367, 2000.
CLIFTON et al.: EXTREME FUNCTION THEORY FOR NOVELTY DETECTION
[24] D. Clifton, S. Hugueny, and L. Tarassenko, “Novelty detection with
multivariate extreme value statistics,” J. Signal Process. Syst., vol. 65,
pp. 371–389, 2011.
[25] P. Embrechts, C. Kluppelberg, and T. Mikosch, Modelling Extremal
Events, 4th ed. Berlin, Germany: Springer-Verlag, 2003.
[26] L. de Haan and A. Ferreira, Extreme Value Theory. Berlin, Germany:
Springer-Verlag, 2006.
[27] S. Hugueny, D. Clifton, and L. Tarassenko, “Probabilistic patient monitoring with multivariate, multimodal extreme value theory,” Communicat. Comput. Sci., vol. 127, pp. 199–211, 2011.
[28] R. A. Fisher and L. H. C. Tippett, “Limiting forms of the frequency
distributions of the largest or smallest members of a sample,” in Proc.
Cambridge Philosoph. Soc., 1928, vol. 24.
[29] L. Tarassenko, D. Clifton, M. Pinsky, M. Hravnak, J. Woods, and
P. Watkinson, “Centile-based early warning scores derived from
statistical distributions of vital signs,” Resuscitat., vol. 82, no. 8, pp.
1013–1018, 2011.
[30] L. Clifton, D. Clifton, M. Pimentel, P. Watkinson, and L. Tarassenko,
“Gaussian processes for personalised e-health monitoring with wearable sensors,” IEEE Trans. Biomed. Eng., vol. 60, no. 1, pp. 193–197,
Jan. 2012.
[31] K. Buza, “Fusion methods for time-series classification,” Ph.D. disseration, Univ. of Hildesheim, Hildesheim, Germany, 2011.
[32] M. Markou and S. Singh, “Novelty detection: A review – Part 1: Statistical approaches,” Signal Process., vol. 83, no. 12, pp. 2481–2497,
2003.
[33] M. Markou and S. Singh, “Novelty detection: A review – Part 2:
Neural network based approaches,” Signal Process., vol. 83, no. 12,
pp. 2499–2521, 2003.
[34] A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for
activity monitoring,” in Proc. 16th IEEE Int. Symp. Wearable Comput.,
Newcastle, U.K., 2012.
[35] A. Reiss and D. Stricker, “Creating and benchmarking a new dataset
for physical activity monitoring,” in Proc. 5th Workshop Affect and
Behavior Related Assistance, Crete, Greece, 2012.
[36] A. Frank and A. Asuncion, UCI Machine Learning Repository 2010
[Online]. Available: http://archive.ics.uci.edu/ml
David A. Clifton is a Research Fellow at Mansfield
College, Oxford and a College Lecturer at Balliol
College, Oxford. He received an MEng degree
in Engineering Mathematics from the University
of Bristol and a DPhil degree in Information
Engineering from the University of Oxford. His
research interests are in statistical signal processing,
particularly in biomedical informatics and other
biomedical applications. His doctoral research led to
patented methods for monitoring the jet engines of
the Eurofighter Typhoon, the Airbus A380, and the
Boeing 787 Dreamliner, and he has won the nPower Science, Engineering and
Technology (SET) award at the UK Houses of Parliament and the J.A. Lodge
award for biomedical engineering by the IET.
37
Lei Clifton is a post-doctoral research assistant
at the Institute of Biomedical Engineering, in the
Department of Engineering Science at the University
of Oxford. She received BSc and MSc degrees in
Electrical Engineering from the Beijing Institute of
Technology, China, and a PhD degree in Information
Engineering from UMIST. Her research interests
include the use of statistical machine learning for
health informatics and physiological monitoring.
Samuel Hugueny received an MSc degree
in Computer Engineering from the Ecole National Superieure de Techniques Avancees, Paris
(ENSTA-Paritech), and is currently studying a DPhil
degree at the Institute of Biomedical Engineering in
the Department of Engineering Science, University
of Oxford. His research interests include medical
imaging, signal processing, novelty detection, and
patient monitoring.
David Wong is a BRC post-doctoral research
assistant at the Institute of Biomedical Engineering
in Oxford. He graduated from the University of
Oxford with MEng and DPhil degrees, for the
latter of which his research focussed on developing
systems to detect deterioration in hospital patient
vital-signs using machine learning signal processing
techniques. His recent interests include determining
vital-sign trends using time-series analysis. David
is currently involved in a number of clinical studies
to assess the effect of computer-assisted systems
in increasing patient safety, and reducing nursing workload during routine
hospital observations.
Lionel Tarassenko received the B.A. degree in engineering science in 1978 and the Ph.D. degree in medical engineering in 1985, both from the University
of Oxford. He has held the Chair in Electrical Engineering at Oxford University since October 1997. He
was elected to a Fellowship of the IEE in 1996, when
he was also awarded the IEE Mather Premium for
his work on neural networks, and to a Fellowship of
the Royal Academy of Engineering (RAE) in 2000.
Prof. Tarassenko was awarded the Silver Medal of the
Royal Academy of Engineering for his contribution
to British engineering in 2006, and was awarded the CBE for services to engineering in 2011.