Academia.eduAcademia.edu

An Extreme Function Theory for Novelty Detection

2013

We introduce an extreme function theory as a novel method by which probabilistic novelty detection may be performed with functions, where the functions are represented by time-series of (potentially multivariate) discrete observations. We set the method within the framework of Gaussian processes (GP), which offers a convenient means of constructing a distribution over functions. Whereas conventional novelty detection methods aim to identify individually extreme data points, with respect to a model of normality constructed using examples of "normal" data points, the proposed method aims to identify extreme functions, with respect to a model of normality constructed using examples of "normal" functions, where those functions are represented by time-series of observations. The method is illustrated using synthetic data, physiological data acquired from a large clinical trial, and a benchmark time-series dataset.

28 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 1, FEBRUARY 2013 An Extreme Function Theory for Novelty Detection David A. Clifton, Lei Clifton, Samuel Hugueny, David Wong, and Lionel Tarassenko Abstract—We introduce an extreme function theory as a novel method by which probabilistic novelty detection may be performed with functions, where the functions are represented by time-series of (potentially multivariate) discrete observations. We set the method within the framework of Gaussian processes (GP), which offers a convenient means of constructing a distribution over functions. Whereas conventional novelty detection methods aim to identify individually extreme data points, with respect to a model of normality constructed using examples of “normal” data points, the proposed method aims to identify extreme functions, with respect to a model of normality constructed using examples of “normal” functions, where those functions are represented by time-series of observations. The method is illustrated using synthetic data, physiological data acquired from a large clinical trial, and a benchmark time-series dataset. Index Terms—Functional analysis, Gaussian processes, signal processing algorithms. I. INTRODUCTION OVELTY detection [1] is a fundamental task in anomaly detection, outlier detection, and one-class classification, in which we wish to identify if newly-observed data are in some sense “novel” with respect to previously-observed examples. Novelty detection can be viewed as a hypothesis test, in which we wish to determine if a previously-unseen test dataset has the same characteristics as a training set of “normal” data. be Definition I.1: Let a “training” dataset a collection of examples of “normality”. For the case of nontimeseries data, let the example be a -dimensional point . Definition I.2: Let a test set , where each test examples is defined as for the training data in of the definition I.1. In novelty detection, we wish to test the null hypothesis that the test set has the same characteristics as the training set , preferably in some probabilistic sense. A model of normality is typically constructed from the training set , and N Manuscript received August 01, 2012; revised November 06, 2012; accepted December 01, 2012. Date of publication December 13, 2012; date of current version January 22, 2013. The work of D. Clifton was supported by the Centre of Excellence in Personalised Healthcare funded by the Wellcome Trust and EPSRC under Grant WT 088877/Z/09/Z. The work of L. Clifton and D. Wong was supported by the NIHR Biomedical Research Centre, Oxford, U.K. The work of S. Hugueny was supported by the EPSRC LSI Doctoral Training Centre, Oxford, U.K. Matlab implementations of the methods described in this paper may be downloaded from http://www.robots.ox.ac.uk/~davidc. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Ery Arias-Castro. The authors are with the Department of Engineering Science, University of Oxford, Oxford OX3 7DQ, U.K. (e-mail: [email protected]) Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSTSP.2012.2234081 is tested by comparing the test set to the model . The majority of novelty detection work in the literature uses defines a decision point-wise novelty detection, in which boundary in the data space . Individual test points are then compared to the “normal” region of data space deis deemed to hold (and fined by the decision boundary, and is classified as being “normal”) if . One of the most widely used of such methods is discriminant and based on the support vector machine [2], [3], in which novelty detection takes place in the high-dimensional reproducing kernel Hilbert space (RKHS) corresponding to some kernel function. are classified independently from one anHere, test points other, by being compared to a decision boundary in the RKHS. Generalizations of this point-wise approach to novelty detection have been proposed in, for example, the states within a hidden Markov model [4], the output of a Kalman filter [5], or the states of a factorial switched Kalman filter [6]. More recently, one-class classifiers using Gaussian processes (GPs) have been proposed [7]–[10] that take a similarly point-wise approach to novelty detection, constructing a function to use as into reand dividing data space the model of normality and low support gions with high support depending on whether those regions are close to those occupied by “normal” training data , or not, respectively. Change-point detection has been implemented within a GP framework to identify and rectify sensor failures [11]. This paper aims to tackle the case of timeseries novelty detection, as typically occurs in the analysis of data acquired from critical systems such as jet engines and human patients. In such applications, the test data are a potentially multivariate timeseries that we wish to classify as being either normal or abnormal; i.e., the question may be framed “is this timeseries of human vital signs indicative of a normal patient or a deteriorating patient?” Conventional point-wise novelty detection is appropriate when instances of independent objects are to be classified; e.g., the classification of different mammograms as being “normal” or “abnormal” [12]. However, when the test data represent a timeseries, the i.i.d. assumption typically does not hold, and adopting a point-wise, sample-by-sample approach to classification can result in large numbers of misclassifications, because we are making large numbers of assumedly independent decisions (perhaps at the sampling rate of the data). Instead, we suggest that a single decision can be taken, testing an entire timeseries, which represents a test function. We will thus adopt a function-wise approach to novelty detection, where the functions are represented by timeseries of discrete observations. The GP framework offers a convenient, non-parametric method of defining a probability distribution over a Lebesque where is Borel-measurable space of functions (often or discretizations thereof) and where is a valid probability measure, such as a multivariate Gaussian density 1932-4553/$31.00 © 2012 IEEE CLIFTON et al.: EXTREME FUNCTION THEORY FOR NOVELTY DETECTION over a (potentially infinite) number of random variables (rvs). This paper describes a principled, probabilistic approach to functional novelty detection by considering extreme functions. II. BACKGROUND AND NOTATION Our work is related to that of functional data analysis (FDA), which is an active branch of research concerning inference where “the data are functions” [13]. Reviews of FDA [14]–[16] identify two main approaches: (i) regularization-based methods, in which functions are resampled (interpolated) such that observations occur on a regular sampling grid, and (ii) filtering-based methods, in which functions are expanded onto a finite-dimensional function basis after smoothing has been applied. The latter is the current focus of most research in FDA, where popular tasks include using the basis coefficients for clustering [17], principal component analysis, and linear discriminant analysis [13], [18], often set within a GP regression framework. However, while multiclass classification has been performed using the latter, there is little focus on novelty detection. The closest existing work to tackling the problem considered in this paper is an -test for functional data [19], which compares two functions, analogous to the standard -test between two sets of data points. Similarly, timeseries classification has been identified as a topic within anomaly detection [20], where many of the approaches are shared with those in the field of novelty detection (and where the latter is most often presented as a branch of machine learning). In common with the majority of FDA work, and in keeping with the GP literature, we will first consider univariate timeseries (generalizing later to multivariate timeseries). Keeping to the notation of the functional and timeseries literature, we must redefine the notation from definitions I.1 and I.2, which were conventional notation for the i.i.d. case, giving: Definition II.1: Let a training set of examples of functional data (e.g., timeseries) be . Let the example (e.g., timeseries object) be a sequence of observations occurring at locations . For the case of univariate timeseries, the r.v. and the index set of the r.v. is time. Note that, in general, there is no requirement that each of the examples in a collection of functional data should be of the same length, and so the length of the example is denoted by . Furthermore, in the general case, there is no requirement that the examples should consist of observations occurring at the same locations; for timeseries, this means that each example could be observed at different times. We first consider the case of univariate GP regression to set notation, and initially define a single GP over one of the sequences of observations . We follow the standard treatment of [21]. Let a GP prior be defined over a latent variable , using, for example, a squared-exponential (SE) covariance function , where is the -norm, where and are the length-scale in the -direction and the variance of , respectively, and where the mean function . The observations 29 for this timeseries are related to the latent variables via , with , or , dropping the subscripts for the timeseries object, as in [21]. Following the novelty detection approach, in which a model of normality is constructed from “normal” training examples, we will assume the set of training timeseries comprises “normal” timeseries. We now wish to construct a model from this training set, where is a GP that describes the dynamics of the whole collection of “normal” examples. For instance, each example timeseries in the training set could be a timeseries of vital-sign observations taken from a “normal” patient, as will be considered in Section VII. The hyperparameters of the GP used to represent could be determined in a number of ways; this paper maximizes the joint likelihood of all timeseries in the training set [21], an example of which will be demonstrated in Section VII. More complex examples of model construction considered in FDA include constructing a mixture of GPs [22]. Assuming the presence of a GP model , we next focus on determining how can we formulate a hypothesis test to classify a test timeseries as being “normal” (i.e., generated from to some probability ) or otherwise “abnormal”. III. ASSIGNING PROBABILITIES TO FUNCTIONS The formulation of our hypothesis test requires a mapping from functions to probabilities, . A common approach used to assign a sequence of test observations to one of many clusters in FDA [13] is to consider the marginal likelihood given the set of inputs to the function : (1) in which we have marginalized over the function values , using the GP distribution1 over functions , and where the likelihood . The log marginal likelihood can be found in closed form as a marginalized Gaussian, (2) where . However, the resulting quantities are probability densities, not probabilities, and therefore scale with the dimensionality of the input (i.e., the number of data in the timeseries). Fig. 1 shows log marginal likelihoods for 500 sample functions drawn from an example GP model with a random positive semi-definite covariance matrix , where we have evaluated the sample functions at both and in . It may be seen that doubling the dimensionality of the vector of function values causes a scaling of the log marginal likelihoods. Therefore, while likelihoods may be useful for some tasks (e.g., maximum likelihood approaches to parameter fitting for some fixed value of ), they are not 1In keeping with the literature, we use the term distributions to refer to probability density functions (pdfs), which we will denote with lower-case letters , and which we will make distinct from distribution functions (dfs), which we will denote with upper-case letters . 30 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 1, FEBRUARY 2013 Fig. 1. Histograms of densities (marginal likelihoods) for 500 sample functions generated randomly from a GP model with random , evaluated at and in , shown in red (rightmost) and blue (leftmost), respectively. suitable for comparing functions of arbitrary , nor for direct probabilistic inference, because they are not probabilities. Instead, we consider that the GP used as a model may at be used to determine the predicted values of observations times using the GP regression framework [21]: (3) (4) In keeping with our function-wise philosophy, we will consider the joint distribution over all and the model, to evaluate the whole test function in a single classification decision, which is the joint posterior distribution over all functions, conditioned on the model, (5) where the mean function and covariance matrix are, respectively, The hypothesis test could initially be framed by considering the chance of obtaining the test function from the GP model . The GP offers the convenience that the joint distribution over the test data (5) is multivariate Gaussian, and so we could formulate by determining the probability of observing an -dimensional vector from the -dimensional multivariate Gaussian distribution. For low (such as or 2), standard results from extreme value theory may be applied [23]. However, this is not straightforward for multivariate Gaussians of increasingly large dimensionality, as occurs in our case with with functional data represented by timeseries, and standard results cannot be used [24]. Instead, we consider forming a distribution over the densities allowed by the GP model . IV. A DISTRIBUTION OVER FUNCTION DENSITIES For some dimensionality , the GP distribution over functions is multivariate Gaussian by definition, and a test function may be evaluated to give a density using the GP: Definition IV.1: Let probability density be given by a pdf, , where , given by (5). We emphasize that the notation refers to the multivariate Gaussian distribution , defined over the latent function , drawing careful distinction between (the multivariate Gaussian defined by the GP), (the vector of output values from the latent function), and (the latent function). We note that the largest likelihood obtained from occurs at the mode, and has value , where . Therefore, densities take the range . Then, Definition IV.2: Let a df be defined over densities , according to (6) where the region of integration . This is an integration over all those points in that result in a density which is lower than the density . Thus, is the probability that a sample function generated from the GP will be “more extreme” (i.e., have lower density w.r.t. the GP) than our test function . It is important to note that is effectively a df over level sets on the output of multivariate Gaussian . That is, it is a distribution function defined over probability densities. The integration in (6) can be evaluated in closed form by casting into an equivalent function over Mahalanobis radius with respect to , such that (7) where , in which is the Cholesky decomposition of , and where denotes the Hadamard (elementwise) matrix product. The integration may be performed in polar form, taking advantage of the radial symmetry of our multivariate Gaussian distribution over functions (5), which simplifies our expression for (8) where is the Mahalanobis radius on our multivariate Gaussian distribution over functions that gives density , via (9) and where the latter is obtained by rearranging the former. This explicitly shows that is the tail mass associated with the level set defined by on our multivariate Gaussian , and is therefore the probability of observing a sample function generated from the GP of greater Mahalanobis radius than , and hence the probability of observing a sample function with a lower probability density than . Integration over all of the angles in the radial integration (8) yields (10) CLIFTON et al.: EXTREME FUNCTION THEORY FOR NOVELTY DETECTION Fig. 2. (a) Df over densities and resulting empirical df for densities of sample functions generated randomly from a GP, shown by red line and blue dots, respectively. Densities have been normalized by dividing through by ; i.e., using . (b) Sample functions drawn from a GP, , colored by . “Abnormal” functions are shown by dark lines. A 95% confidence region (2 standard deviations) around the mean function is shown by the shaded gray background. where is the total solid angle of the hypersphere in dimensions, and where is the Gamma function. As , the densities , and so it will be convenient to express , where . It will also be convenient to use the Cholesky decomposition as . After iterative integration-by-parts (see the appendix), (11) (12) where the functions , , and are given in the appendix. Fig. 2(a) shows that the predicted df over a range of densities using (11) is a close estimate of the empirical df obtained by generating sample functions randomly from an example GP, with . It may be seen that as densities , the df as required: for a given density , the probability that a sample function generated randomly from the GP will have a lower density (i.e., be more extreme) decreases as . Conversely, as the density tends towards its maximum value, at the mode of the Gaussian distribution over function space, , the probability of observing a more extreme function tends to 1. Finally, we may use the df to assign probabilities to each function that we wish to examine, given some GP model . Definition IV.3: Let a test of the null hypothesis be defined as the comparison of to some threshold probability , where the null hypothesis holds if (thus classifying the data as being “normal”). The null hypothesis is 31 rejected (and the test function is deemed “abnormal”) when the corresponding . For example, would result in a hypothesis test in which functions are deemed “abnormal” if they could have been generated from the GP model with a probability below significance level . Fig. 2(b) shows an example in which ten sample functions have been drawn from an example GP model , and which are therefore all “normal”, and where two other functions have been shown for comparison (black lines). Each of the sample functions has been colored according to , given . Those functions that lie close to the mean function take higher values of , while those that stray away from the mean function take lower values. Importantly, we note that the ten “normal” sample functions (for which the null hypothesis should not be rejected) take values of , and so a hypothesis test (e.g., with ) would typically not reject . The two functions shown by . Therefore, for black lines, however, have these two functions would be rejected by a hypothesis test with , and the functions shown by black lines would be deemed to be “abnormal”. V. EXTREME FUNCTION DISTRIBUTIONS Extreme value theory is a branch of statistics that considers the distribution of extrema (such as the minimum or maximum value observed in a set of data) in low-dimensional spaces. The majority of the literature [25], [26] is concerned with univariate or have been described using data; some extensions into copulae to estimate the dependence between rvs. These are, by definition, point-wise approaches, in which the single most extreme point in a set of data is considered. This point-wise approach has been extended to higher-dimensional work in [24] and [27]. We will now consider the extension of this method to our functional application, and show how it can be used in conjunction with to provide identification of extreme functions, characterized by discrete observations with , using extreme function distributions. Fig. 3(a) shows functions sampled randomly from the GP model considered previously, where each is the most extreme of a set of sample functions, for increasing , and where “most extreme” is defined as being that function with the lowest density , as given by the -dimensional multivariate Gaussian distribution in (5). With increasing , as shown by the colors in Fig. 3(a), the most extreme function observed in a set of randomly-generated functions becomes increasingly more extreme, moving away from the mean function. This follows the intuition that, as we draw more data from the underlying distribution, we would expect the extremum of those data to be “more extreme” as we draw more and more data [24]. However, all of the functions generated from are all “normal” functions, in that they have been generated from our GP that represents , and so we would like the null hypothesis not to be rejected. That is, they may be extreme, but they are not “abnormal” – they are extreme only because we have observed many realizations from the GP. If we treat these functions as before, and assign probabilities to them using , we obtain the results shown in Fig. 3(b), 32 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 1, FEBRUARY 2013 This is because the Fisher-Tippett theorem [28], on which extreme value theory is based, may be used to show that all non-degenerate functions2 are in the domain of attraction of the generalized extreme value (GEV) distribution in their extrema, and that the limiting form of the GEV for our case, in which prob, is ability densities are truncated over the domain the Weibull. The Weibull has scale and shape parameters and , respectively, for some value of , which may be estimated [24] as being and , in which is the quantile on , which may found to arbitrary accuracy because we can use in closed form (11) using . Also, is the pdf associated with the df , and is straightforwardly the integrand in (10), (14) Fig. 3. Extreme functions generated by observing functions from a GP and selecting that with the lowest probability ; (a) colored according to (b) colored according to . Plots (c) and (d) show extreme function distributions and , respectively defined over densities and Mahalanobis radii , giving predictions (blue lines) and results obtained by randomly generating extreme functions (red dots). Distributions are shown for , with dimensionality from right to left in (c), and from left to right in (d). in which most of the functions are assigned very low probabiliand hence would be incorrectly rejected ties, for the majority of the functions shown. However, we can adapt extreme value theory to our functional case by considering the extremes in probability density . Given the df , which is itself univariate in densities , we note that the most extreme sample of a set of observations from will have a density the distribution of which asymptotically converges to the Weibull df for minima [26], (13) Thus, using this non-standard extension of extreme value theory over a probability density, we have obtained a df that allow us to determine the location in density space for the densities of extreme functions, given some number of observed functions . This is shown in Fig. 3(c) for , with increasing dimensionality , where it may be seen that these dfs closely match the densities of extreme functions observed from random sampling of the GP. Definition V.1: Define an extreme function distribution as being over Mahalanobis radii on the multivariate Gaussian posterior distribution using from (9). is the probability that, if a set of functions is observed from , then the most extreme function of that set will have Mahalanobis radius (with respect to ) less than . Therefore, is the probability that a test function (which has Mahalanobis radius w.r.t. ) is “abnormal”, given a model of normality . The extreme function distribution for our example is shown in Fig. 3(d), again showing close agreement with the results of random sampling of functions from the GP. As required, the probability of abnormality increases with . Increasing the value of (i.e., generating more functions from ) results in the support of shifting further up the -axis, as expected: if we observe more functions from , then we are more likely to observe functions with larger Mahalanobis radius (i.e, be more “extreme”), with respect to . VI. HYPOTHESIS TESTING WITH EXTREME FUNCTIONS Finally, we may return to our extreme functions previously shown in Fig. 3(b) and, instead of assigning them probabilities w.r.t. , we assign probabilities for each using the extreme function distributions for the appropriate number of functions observed in each case, . Definition VI.1: Let a hypothesis test be defined as the test comparing the extreme value distribution to some 2Non-degenerate functions are those that do not assign all probability mass to a single point in their domain. The multivariate Gaussian defined by the GP is therefore non-degenerate, as is the df defined over its densities, . CLIFTON et al.: EXTREME FUNCTION THEORY FOR NOVELTY DETECTION 33 “abnormal” by a hypothesis test based on . This is correct in this instance, because it was drawn from a GP with , whereas the other functions were drawn from an otherwise-identical GP with . We have demonstrated that the proposed extension of extreme value theory to the problem considered in this paper can correctly separate functions that are extreme-but-normal (i.e., generated by a model of normality, but with a low probability) from those that are actually abnormal (i.e., those not generated by the model). VII. ILLUSTRATION WITH PATIENT VITAL-SIGN DATA Fig. 4. (a) Extreme functions previously shown in Fig. 3, with probabilities now assigned using extreme function distribution given the appropriate value of for each, and (b) using with , including two other functions shown by dark lines. , rejecting the null hypothesis threshold probability (and thus classifying functions as “abnormal”) for which . Fig. 4(a) shows the result of using the hypothesis test from definition VI.1, in which it may be seen that each extreme function now takes probabilities , and that the null hypothesis will no longer be rejected if is used (with, for example, ). Therefore, the sample functions will be classified as being “normal” functions w.r.t to our GP, , as we would hope, because they were all generated from and are thus “normal”. We have therefore successfully identified extreme-but-normal functions that have arisen due to the fact that we have observed multiple functions. Our original example in Fig. 2(b), which included two additional functions shown by dark lines, may now be revisited in light of our extreme function distributions. Setting and assigning probabilities to each function in the example using the resulting extreme function distribution gives the result shown in Fig. 4(b). Here it may be seen that the ten sample functions randomly generated from the posterior GP (shown by yellow lines) are now assigned probabilities , indicating their obvious “normality” with respect to the GP model . However, one of the two functions shown by dark lines has been assigned , which would cause a hypothesis test based on, for example, to accept and classify the function as being “normal”. This classification is correct in this instance, because the function was generated from the GP model , and is the most extreme function (i.e., that with the lowest likelihood ) from a set of sample functions generated from . The other function shown by a dark line is assigned a very low probability, , and is therefore classified We now illustrate the use of the method with data representing patient vital-sign trajectories following upper gastro-intestinal cancer surgery, where the patients are subsequently nursed in a “step-down” recovery ward. This dataset comprises 154 examples of “normal” patient timeseries, in which the patients were discharged home after a variable length-of-stay in the hospital, and 17 examples of “abnormal” patient timeseries, in which the patients either died or were admitted under emergency conditions to the intensive care unit (ICU). Patients in this latter category are associated with highly increased risks of mortality and morbidity, and so the automatic determination of physiological deterioration from timeseries of acquired data is an important task – made more so by the fact that vital-sign observations are taken every four hours at best [29]. Our study [30] was undertaken at the Cancer Hospital, within the Oxford University Hospitals NHS Trust, and was granted ethical approval by the local ethics committee. A. Data Fig. 5(a) shows a subset of the normal data, illustrating 15 time-series of variabilities in respiratory rate (RR) from the “most normal” patients3, in which the differences, over a 24-hour period, between the maximum and minimum RR (as observed by the nurses) are shown for the first 24 days after admission to the recovery ward. It may be seen that these timeseries typically take high values immediately after discharge from surgery, which then reduce as the patient recovers on the ward. For comparison, timeseries from a selection of five patients from the abnormal dataset are shown in Fig. 5(b). It may be seen that these example abnormal timeseries can take highly extreme values (such as that shown in orange, which reaches ); however, many examples of abnormality (such as the four shown in purple, red, light blue, and green) occupy the same range of values on the vertical axis as do the normal patients. The dataset of 154 normal patients has a median length-ofstay of 9 days (IQR 5 days), while the dataset of 17 abnormal patients has a median length-of-stay of 5 days (IQR 4 days). There is significant overlap between the length-of-stays in each set, and both sets contain timeseries for all lengths-of-stay up to 24 days. The majority of the datasets exhibited incomplete data within their record. 3defined to be those closest to the median length-of-stay for normal patients 34 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 1, FEBRUARY 2013 TABLE I MEAN (AND 1 S.D.) CLASSIFICATION ERRORS FOR PATIENT VITAL-SIGN DATA OVER EXPERIMENTS, SHOWING FALSE-POSITIVE RATE ( ) AND FALSE-NEGATIVE RATE ( ). Fig. 5. Normal and abnormal patient data shown in (a) and (b), respectively, with the EFT GP model shown by its mean function (in blue) and 95% confidence region (shaded pink). (a) “Normal” vital-sign trajectories. (b) ICU readmission trajectories. theory. We note that (i) and (ii) are point-wise approaches to novelty detection, and the Gaussian kernel was used as the default choice in each; the former is distance-based, in the RKHS of its kernel, while the latter is probabilistic. Methods (iii) and (iv) are distance-based methods, often used as benchmarks in the anomaly detection and novelty detection literature [20], [32], [33]. We denote the methods (i)-(v) from above as OC-SVM, OC-GP, , -DTW, and EFT, respectively. Ten-fold cross-validation was used to set model parameters for each method, using a training set that comprised 75% of the available normal data (116 of 154 examples, selected randomly) and 50% of the available abnormal data (8 of the 17 examples, selected randomly). Model parameters were selected in each method to minimize the classification error4. The remaining 38 normal examples and 9 abnormal examples were “held out” and used as test data. The entire experiment (random selection of training data, followed by ten-fold cross-validation times for to set model parameters) was repeated each classifier. Each method was trained using feature vectors comprising data up to 24 days. For those methods that require fully-specified input vectors (OC-SVM, OC-GP, ), the data were padded with zeros where data were missing, after zero-mean, unit-variance normalization was applied (such that the missing data appear “normal”, at the mean of the training set). The -DTW and EFT methods allow non-probabilistic and probabilistic handling of missing data, using DTW and marginalization, respectively. C. Results Fig. 6. (a) Normal data, corresponding to the centroid of the normal data, in the upper subplot, with the same timeseries having data removed shown in the lower subplot. Missing data are circled in red in the upper subplot. (b) DTW match of the original timeseries to the version containing missing data, where color shows DTW distance. B. Method We compared five methods for performing novelty detection: (i) the popular one-class SVM formulation of [3]; (ii) the recently-proposed one-class GP classifier of [10]; (iii) a -nearest neighbors ( -NN) approach using the Euclidean distance metric; (iv) a -NN approach using a dynamic time-warping (DTW) distance [31]; and (v) our GP-based extreme function , reported on Results from each of the the “held out” test data in each case, are shown in Table I. It may be seen that the best-performing techniques are those that can adequately cope with the missing data, the -DTW and the proposed EFT-based method. While both of the latter exhibit lower FPR than the other methods, the proposed method has greater success at classifying abnormal data, with a lower FNR across all experiments. An example of DTW is shown in Fig. 6(a), which shows one of the normal examples with data removed for the purposes of illustration. Fig. 6(b) demonstrates that the DTW method accurately aligns the original copy of the normal timeseries with that containing missing data. However, the results shown in Table I indicate that the principled treatment of missing data from the proposed method (achieved via straightforward marginalization of the Gaussian process [21]) results in better sensitivity to “abnormal” data, with a lower FNR. 4Classification error is defined to be false negatives (FN) + false positive (FP). The former is an erroneous classification of an abnormal example as being “normal”, while the latter is erroneous classification of a normal example as being “abnormal”. CLIFTON et al.: EXTREME FUNCTION THEORY FOR NOVELTY DETECTION 35 TABLE II MEAN (AND 1 S.D.) CLASSIFICATION ERRORS FOR UCI BENCHMARK DATA OVER EXPERIMENTS. Fig. 7. Randomly selected examples from the “normal” and “abnormal” classes in (a) and (b), respectively. VIII. ILLUSTRATION WITH BENCHMARK DATASET The PAMAP2 benchmark dataset [34], [35] was used from the UCI Machine Learning Repository [36], which comprises activity data from heart-rate and inertial measurement units, connected to subjects who are asked to perform a protocol of 12 activities: lying, sitting, standing, ironing clothes, using a vacuum cleaner, walking normally, “Nordic” walking, running, cycling, ascending stairs, descending stairs, and using a skipping rope. Data were acquired for a total of 27,000 seconds, resulting in over 3.8 labeled data. This dataset is multivariate, where we have investigated timeseries of the following features: heart-rate (at approximately 9 Hz) and average spectral power (per 5 s window of data) for each of the three axes of a patient-worn accelerometer (at approximately 100 Hz), following [35]. The methodology from Section VII was repeated, where the “normal” class was taken to comprise all timeseries examples of ascending and descending stairs, and where all other activities were taken as being “abnormal” for the purposes of this investigation. Fig. 7 shows randomly-selected examples of both the “normal” and “abnormal” classes, where it may be seen that the former (ascending and descending stairs) exhibits particular dynamics that a novelty detector could learn, whereas the latter seems more inconsistent (as expected, given that these are random examples of other types of activity). While this benchmark dataset does not suffer from the incompleteness of data evident in the dataset considered in Section VII, the length of each timeseries can vary. The multivariate nature of the dataset is straightforwardly used by methods (i)-(iv). For the proposed EFT-based method, we adopt the usual GP approach [21] of providing the kernel with multivariate inputs. function Table II presents the results of experiments, in which it may be seen that most methods have performed less well using this benchmark dataset compared with the patient-based analysis in Section VII. The OC-GP and methods are Fig. 8. Examples of -DTW resulting in misclassifications in which (a) two “normal” examples are matched only for the latter half of the timeseries; (b) “abnormal” and “normal” timeseries are closely matched due to the warping effect introduced by DTW. the least able to separate “normal” from “abnormal” classes in this instance, while the OC-SVM has achieved a low FPR. The -DTW method performs similarly to the proposed EFT-based method, although the latter achieves a lower FNR (i.e., it is more sensitive to “abnormal” data), and has generally less variable performance (standard deviation of FNR 0.06 compared with 0.11 for the -DTW). Again, the DTW procedure can occasionally perform undesirable warping, such that “normal” data are sometimes associated with larger DTW distances to “normality” than some “abnormal” timeseries. This effect is shown in Fig. 8, in which an “abnormal” timeseries in (b) matches the centroid of “normal” data more closely than the example “normal” timeseries shown in (a), with a DTW distance for the “abnormal” case in (b) being 0.67 the DTW distance for the “normal” case in (a). Again, the GP-based method straightforwardly marginalises over shorter timeseries to perform a classification using only the relevant input domain of the model . IX. DISCUSSION We have extended extreme value theory such that we may take a function-wise approach to novelty detection, following the FDA method in which functions are initially represented by timeseries of discrete observations. While much of previous work in novelty detection has concentrated on pointwise approaches (which are by far the most commonly described in the literature), these reach the limit of their usefulness in the assessment of timeseries. GPs offer a natural probabilistic framework in which to define distributions over a function space, and we have used the GP regression case to illustrate our method due to the convenience of 36 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 1, FEBRUARY 2013 working with the Gaussian distribution, which has allowed us to find extreme function distributions in closed-form. We note that tends asymptotically towards a while the df of densities parametric distribution (the Weibull, in this case), the whole approach is non-parametric, due to the GP distribution over function space [21]. Results reported in this paper suggest that the proposed method provides slightly better performance than distance-based methods employing dynamic time-warping, while offering the advantages of a principled probabilistic inference framework. While DTW copes well with small quantities of missing data, or with timeseries that differ in length by a small amount, it is less robust when coping with noisy, incomplete timeseries that are often encountered in practice, such as in physiological patient monitoring. By contrast, the proposed method copes well with missing data, using straightforward marginalization of the GP, afforded by its consistency property, in which any subset of the rvs over which the GP is defined also has a joint distribution which is multivariate Gaussian. A natural extension for this work would be to non-Gaussian processes, in which the distribution over functions is not constrained to be Gaussian; e.g., a Beta process or a generalized Pareto process. However, the GP framework is sufficiently flexible to allow a wide-range of functional forms, as illustrated throughout this paper. There is further scope for estimation of the Weibull parameters within the Bayesian GP framework, which would require an approach using approximate inference, such as via sampling or deterministic (variational) methods. APPENDIX IN CLOSED FORM From (10), we apply iterative integration-by-parts and use to obtain cases for even and odd, respectively, (15) (16) where is the complementary Gaussian error function. These reduce to (11), in which (17) (18) (19) (20) (21) REFERENCES [1] L. Tarassenko, D. Clifton, P. Bannister, S. King, and D. King, “Novelty detection,” Encyclopaedia of Structural Health Monitoring, pp. 653–675, 2009. [2] D. Tax and R. Duin, “Support vector domain description,” Pattern Recognit. Lett., vol. 20, pp. 1191–1199, 1999. [3] B. Schölkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Comput., vol. 13, no. 7, pp. 1443–1471, 2001. [4] N. Hughes, L. Tarassenko, and S. Roberts, “Markov models for automated ECG interval analysis,” in Advances in Neural Information Processing Systems, S. Thrun, L. Saul, and B. Schoelkopf, Eds. Cambridge, MA: MIT Press, 2004, vol. 16, pp. 611–618. [5] H. Lee and S. Roberts, “On-line novelty detection using the Kalman filter and extreme value theory,” in Proc. Int. Conf. Pattern Recognit., 2008, pp. 1–4. [6] J. Quinn, C. Williams, and N. McIntosh, “Factorial switching linear dynamical systems applied to physiological condition monitoring,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 9, pp. 1537–1551, Sep. 2009. [7] H. Kim and J. Lee, J. Wang, Z. Yi, J. Zurada, B.-L. Lu, and H. Yin, Eds., “Pseudo-density estimation for clustering with Gaussian processes,” in Proc. Advances in Neural Networks – ISNN 2006, Berlin, Germany, 2006, vol. 3971, Lecture Notes in Computer Science, pp. 1238–1243. [8] R. Adams, I. Murray, and D. Mackay, “The Gaussian process density sampler,” in Advances in Neural Information Processing Systems, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds. Cambridge, MA: MIT Press, 2009, vol. 21, pp. 9–16. [9] Y. Gao and Y. Li, R. Kimmel, R. Klette, and A. Sugimoto, Eds., “Improving Gaussian process classification with outlier detection, with applications in image classification,” in Proc. Comput. Vis.—ACCV ’10, Berlin, Germany, 2011, vol. 6495, Lecture Notes in Computer Science, pp. 153–164. [10] M. Kemmler, E. Rodner, and J. Denzler, R. Kimmel, R. Klette, and A. Sugimoto, Eds., “One-class classification with Gaussian processes,” in Proc. Comput. Vis.—ACCV ’10, Berlin, 2011, vol. 6493, Lecture Notes in Computer Science, pp. 489–500. [11] R. Garnett, M. A. Osborne, S. Reece, A. Rogers, and S. J. Roberts, “Sequential Bayesian prediction in the presence of changepoints and faults,” Comput. J., vol. 53, no. 9, pp. 1430–1446, 2010. [12] L. Tarassenko, P. Hayton, N. Cerneaz, and M. Brady, “Novelty detection for the identification of masses in mammograms,” in Proc. 4th IEE Int. Conf. Artif. Neural Netw., Perth, Australia, 1995, vol. 4, pp. 442–447. [13] J. Ramsay and B. Silverman, Functional Data Analysis. New York: Wiley, 2005. [14] G. M. James and C. A. Sugar, “Clustering for sparsely sampled functional data,” J. Amer. Statist. Assoc., vol. 98, pp. 397–408, 2003. [15] J. Rice, “Functional and longitudinal data analysis: Perspectives on smoothing,” Statist. Sinica, vol. 14, pp. 631–647, 2004. [16] H.-G. Muller, “Functional modelling and classification of longitudinal data,” Scandinavian J. Statist., vol. 32, no. 2, pp. 223–240, 2005. [17] F. Yao, H.-G. Muller, and J. Wang, “Functional linear regression analysis for longitudinal data,” Ann. Statist., vol. 33, no. 6, pp. 2873–2903, 2004. [18] J. Shi and T. Choi, Gaussian Process Regression Analysis for Functional Data. London, U.K.: Chapman & Hall, 2011. [19] P. Hall and I. van Keilegom, “Two-sample tests in functional data analysis starting from discrete data,” Statist. Sinica, vol. 17, pp. 1511–1531, 2007. [20] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Surveys, vol. 41, no. 3, 2009. [21] C. Rasmussen and C. Williams, Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press, 2006. [22] J. Shi and B. Wang, “Curve prediction and clustering with mixtures of Gaussian process functional regression models,” Statist. Comput., vol. 18, pp. 267–283, 2008. [23] S. J. Roberts, “Extreme value statistics for novelty detection in biomedical signal processing,” IEE Proc. Sci., Technol. Measur., vol. 47, no. 6, pp. 363–367, 2000. CLIFTON et al.: EXTREME FUNCTION THEORY FOR NOVELTY DETECTION [24] D. Clifton, S. Hugueny, and L. Tarassenko, “Novelty detection with multivariate extreme value statistics,” J. Signal Process. Syst., vol. 65, pp. 371–389, 2011. [25] P. Embrechts, C. Kluppelberg, and T. Mikosch, Modelling Extremal Events, 4th ed. Berlin, Germany: Springer-Verlag, 2003. [26] L. de Haan and A. Ferreira, Extreme Value Theory. Berlin, Germany: Springer-Verlag, 2006. [27] S. Hugueny, D. Clifton, and L. Tarassenko, “Probabilistic patient monitoring with multivariate, multimodal extreme value theory,” Communicat. Comput. Sci., vol. 127, pp. 199–211, 2011. [28] R. A. Fisher and L. H. C. Tippett, “Limiting forms of the frequency distributions of the largest or smallest members of a sample,” in Proc. Cambridge Philosoph. Soc., 1928, vol. 24. [29] L. Tarassenko, D. Clifton, M. Pinsky, M. Hravnak, J. Woods, and P. Watkinson, “Centile-based early warning scores derived from statistical distributions of vital signs,” Resuscitat., vol. 82, no. 8, pp. 1013–1018, 2011. [30] L. Clifton, D. Clifton, M. Pimentel, P. Watkinson, and L. Tarassenko, “Gaussian processes for personalised e-health monitoring with wearable sensors,” IEEE Trans. Biomed. Eng., vol. 60, no. 1, pp. 193–197, Jan. 2012. [31] K. Buza, “Fusion methods for time-series classification,” Ph.D. disseration, Univ. of Hildesheim, Hildesheim, Germany, 2011. [32] M. Markou and S. Singh, “Novelty detection: A review – Part 1: Statistical approaches,” Signal Process., vol. 83, no. 12, pp. 2481–2497, 2003. [33] M. Markou and S. Singh, “Novelty detection: A review – Part 2: Neural network based approaches,” Signal Process., vol. 83, no. 12, pp. 2499–2521, 2003. [34] A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in Proc. 16th IEEE Int. Symp. Wearable Comput., Newcastle, U.K., 2012. [35] A. Reiss and D. Stricker, “Creating and benchmarking a new dataset for physical activity monitoring,” in Proc. 5th Workshop Affect and Behavior Related Assistance, Crete, Greece, 2012. [36] A. Frank and A. Asuncion, UCI Machine Learning Repository 2010 [Online]. Available: http://archive.ics.uci.edu/ml David A. Clifton is a Research Fellow at Mansfield College, Oxford and a College Lecturer at Balliol College, Oxford. He received an MEng degree in Engineering Mathematics from the University of Bristol and a DPhil degree in Information Engineering from the University of Oxford. His research interests are in statistical signal processing, particularly in biomedical informatics and other biomedical applications. His doctoral research led to patented methods for monitoring the jet engines of the Eurofighter Typhoon, the Airbus A380, and the Boeing 787 Dreamliner, and he has won the nPower Science, Engineering and Technology (SET) award at the UK Houses of Parliament and the J.A. Lodge award for biomedical engineering by the IET. 37 Lei Clifton is a post-doctoral research assistant at the Institute of Biomedical Engineering, in the Department of Engineering Science at the University of Oxford. She received BSc and MSc degrees in Electrical Engineering from the Beijing Institute of Technology, China, and a PhD degree in Information Engineering from UMIST. Her research interests include the use of statistical machine learning for health informatics and physiological monitoring. Samuel Hugueny received an MSc degree in Computer Engineering from the Ecole National Superieure de Techniques Avancees, Paris (ENSTA-Paritech), and is currently studying a DPhil degree at the Institute of Biomedical Engineering in the Department of Engineering Science, University of Oxford. His research interests include medical imaging, signal processing, novelty detection, and patient monitoring. David Wong is a BRC post-doctoral research assistant at the Institute of Biomedical Engineering in Oxford. He graduated from the University of Oxford with MEng and DPhil degrees, for the latter of which his research focussed on developing systems to detect deterioration in hospital patient vital-signs using machine learning signal processing techniques. His recent interests include determining vital-sign trends using time-series analysis. David is currently involved in a number of clinical studies to assess the effect of computer-assisted systems in increasing patient safety, and reducing nursing workload during routine hospital observations. Lionel Tarassenko received the B.A. degree in engineering science in 1978 and the Ph.D. degree in medical engineering in 1985, both from the University of Oxford. He has held the Chair in Electrical Engineering at Oxford University since October 1997. He was elected to a Fellowship of the IEE in 1996, when he was also awarded the IEE Mather Premium for his work on neural networks, and to a Fellowship of the Royal Academy of Engineering (RAE) in 2000. Prof. Tarassenko was awarded the Silver Medal of the Royal Academy of Engineering for his contribution to British engineering in 2006, and was awarded the CBE for services to engineering in 2011.