Academia.eduAcademia.edu

The Mixture of Multi-kernel Relevance Vector Machines Model

2012, 2012 IEEE 12th International Conference on Data Mining

We present a new regression mixture model where each mixture component is a multi-kernel version of the Relevance Vector Machine (RVM). In the proposed model, we exploit the enhanced modeling capability of RVMs due to their embedded sparsity enforcing properties. Moreover, robustness is achieved with respect to the kernel parameters, by employing a weighted multi-kernel scheme. The mixture model is trained using the maximum a posteriori (MAP) approach, where the Expectation Maximization (EM) algorithm is applied offering closed form update equations for the model parameters. An incremental learning methodology is also presented to tackle the parameter initialization problem of the EM algorithm. The efficiency of the proposed mixture model is empirically demonstrated on the time series clustering problem using various artificial and real benchmark datasets and by performing comparisons with other regression mixture models.

The mixture of multi-kernel relevance vector machines model Konstantinos Blekas and Aristidis Likas Department of Computer Science, University of Ioannina PO.Box 1186, Ioannina 45110 - Greece E-mail:{kblekas, arly}@cs.uoi.gr Abstract—We present a new regression mixture model where each mixture component is a multi-kernel version of the Relevance Vector Machine (RVM). In the proposed model, we exploit the enhanced modeling capability of RVMs due to their embedded sparsity enforcing properties. Moreover, robustness is achieved with respect to the kernel parameters, by employing a weighted multi-kernel scheme. The mixture model is trained using the maximum a posteriori (MAP) approach, where the Expectation Maximization (EM) algorithm is applied offering closed form update equations for the model parameters. An incremental learning methodology is also presented to tackle the parameter initialization problem of the EM algorithm. The efficiency of the proposed mixture model is empirically demonstrated on the time series clustering problem using various artificial and real benchmark datasets and by performing comparisons with other regression mixture models. Keywords-Relevance Vector Machines; mixture models; sparse prior; multi-kernel; incremental EM learning I. I NTRODUCTION Mixture model training constitutes a flexible and well established approach in the case of data sets containing data objects that have been generated from heterogeneous sources [1], [2]. Among many advantages they offer, mixture models provide a nice framework for cluster analysis by assigning objects to mixture components (or clusters) while simultaneously estimating the model parameters. Regression mixture models are a special type of mixture models where the components correspond to regression functions and they have mainly employed to model sequential data. Many problems of scientific interest can be formulated as sequential data modeling problems. Such type of data can be encountered in a number of diverse applications, ranging from gene clustering in bioinformatics to clustering of cyclone trajectories [3], [4], [5], and recently to video surveillance problems [6], [7], [8] and motion recognition [9]. A natural framework for modeling sequential data is through regression mixture models, also known as latent class regression analysis [1], [2]. A regression mixture model allows for simultaneously modeling heterogeneous regression functions by training a mixture of distinct distributions, each corresponding to a latent class. Objects within each latent class share the same regression function. Through the literature there are different types of regression mixture models that have been used for sequential data modeling [10]. Among them, Hidden Markov Models [11], polynomial and spline regression models [12],[4],[5], mixtures of ARMA models [13] and mixtures of Gaussian processes [14] are commonly used models. These methods are suffering from the drawback of not automatically addressing the problem of model order selection, which is very important in regression. If the order of the regressor model is too large, it overfits the observations and does not generalize well. On the other hand if it is too small, it might miss trends in the data. Sparse Bayesian regression offers a solution to the model selection problem, see for example [15], [16], [17] and [18], by introducing sparse priors on the model parameters. Enforcing sparsity is a fundamental machine learning regularization principle and has been successfully used to tackle several problems, such as feature selection. The key idea behind sparse Bayesian regression is that we can obtain more flexible inference methods by employing models having initially many degrees of freedom and applying a heavy tail prior is imposed to the coefficients of the regressor. During training, the coefficients that are not significant are zeroed out due to the prior, thus only a few coefficients are retained in the model which are considered significant for the particular training data. In this paper we propose a regression mixture model where each component corresponds to an extension of the typical RVM model [15] assuming a weighted multikernel function, ie. each RVM kernel is a weighted combination of basic kernels and the combination weights are estimated during training [19]. We call this extension Multi-kernel RVM (MKRVM). In the RVM model the marginal distribution of the observations given the hyperparameters is a Gaussian distribution (see Eq. 9). Therefore, the regression mixture is converted into a typical mixture model of Gaussians with zero mean and full covariances. A significant problem in regression is how to define the scalar parameter of the kernel design matrix. In this study we have faced this issue by considering a multi-kernel scheme where we assume that each mixture component has a unique kernel matrix calculated as a linear combination of a (common) set of matrices with known kernel parameter values. These vectors of coefficients are part of the mixture model parameters which must be estimated. Then, a maximum a posteriori expectation maximization algorithm (MAP-EM) [20], [1] is applied to learn this mixture of multi-kernel RVMs model and fit the input data. This is leads to update rules of all model parameters in closed form during the M -step and improves data fitting. In the case of the multi-kernel scheme coefficients this leads to a convex quadratic programming problem with constraints. Another contribution of the present work is an incremental scheme for training the mixture of multi-kernel RVMs model that is based on an appropriate repeated splitting process. This causes to make the learning process independent of the initialization of model parameters and obtain optimum solutions. The proposed regression mixture models can be employed to model and cluster a set of functions, ie. each object in the training set is a function represented as by a set of (input, output) pairs. In this work we have focused on the specific case of time series clustering, ie. the (input, output) pairs of a data object correspond to a particular time series. The performance of the proposed methodology is evaluated using a variety of simulated and real data sets. Comparative results demonstrate improvements over previous methods such as the polynomial regression mixture model and the mixture of autoregressive models. Since the ground truth is already known for all datasets, we have used the percentage of correct classification (purity) and the normalized mutual information (NMI) quantities for evaluating the performance of each method. Also, in the case of artificial data, we have computed as a performance metric the error in estimating the original functions that have generated each cluster. As experiments indicate, our method offers greater flexibility and robustness and obtains superior clustering solutions. In section 2 we describe the multi-kernel relevance vector machine which is the building block in our approach. The proposed sparse regression mixture model is then presented in section 3, along with the EM algorithm used for parameter estimation and the incremental learning procedure. To assess the performance of the proposed methodology we present in section 4 numerical experiments with artificial and real benchmark data sets. Finally, in section 5 we provide conclusions and suggestions for future research. II. T HE M ULTI -K ERNEL R ELEVANCE V ECTOR M ACHINE Suppose we are given a set of N samples (data objects) Y = {y 1 , . . . , y N }, where each sample y n consists of L input-target pairs: y n = {xn , tn } = {(xni , tni )}L i=1 . In regression problems we consider that the real target values tni correspond to noisy measurements of the output of a parametric model f with input vector xni , i.e. tni = f (xni ; θ) + ϵni , (1) where ϵni refers to noise and θ denotes the model parameters which must be estimated using a training set. Moreover, for the conditional density of each sample y n we can write p(y n = {xn , tn }|θ) = p(tn |xn , θ)p(xn ) ∝ p(tn |xn , θ) (2) Typically, we can model tn by assuming an M -order linear regression model on the L input vectors with an additive noise term given by tn = Φ n w + ϵ n , (3) T where w = (w1 , . . . , wM ) is the vector with the unknown regression coefficients, and Φn is the design matrix of size L × M . In the above model, the error term ϵn is a Ldimensional vector that is assumed to be zero mean Gaussian with a spherical covariance, i.e. ϵn ∼ N (0, σ 2 I). For constructing the design matrix Φ we can employ several approaches. A simple approach is to use the Vandermonde or B-splines matrix, in cases where we assume polynomial or splines regression models, respectively [21]. Another option is to consider a kernel design matrix of size L × L, consisting of L basis functions, Φn = [ϕ(xn1 ), . . . , ϕ(xnL )] where ϕ(xni ) is a vector of L kernel values among xni and all other inputs, i.e. ϕ(xni ) = (K(xni , xn1 ), . . . , K(xni , xnL )). This is achieved by appropriately selecting a kernel function, with the RBF kernel function to be the most commonly used: ∥xni − xik ∥2 ). 2λ The scalar parameter λ plays a significant role to the quality of the fitting procedure. Its selection depends on the amount of local variations of input data sequences. In our case we consider a multi-kernel scheme by using a discrete set of S RBF kernel functions Ks , each one having its own scalar parameter value λs . In particular, we assume that the composite kernel matrix Φn can be written as a linear combination of S kernel matrices Fns : K(xni , xnk ) = exp(− Φn = S ∑ us Fns , (4) s=1 where ∑S the coefficients us satisfy the constraints us ≥ 0 and s=1 us = 1. These parameters should be estimated during learning in order to construct the composite kernel matrix, as it will be shown later. It must be noted that the multi-kernel idea has been successfully used in several machine learning models [22], [23], [24], [19], that assume a weighted linear sum of kernel and estimate the kernel weights during training. However, to the best of our knowledge this is the first time that a multi-kernel version of RVM with adaptive kernel weights is proposed. Using Equation 3, it is obvious that given the set of regression parameters {w, σ 2 , u}, we can model the conditional probability density of the target tn with the normal distribution, i.e. p(tn |w, σ 2 , u) = N (tn |Φn w, σ 2 I) . (5) An important issue, when using a regression model is how to define its order M , since models of small order may lead to underfitting, while large values of M may lead to overfitting. One approach to tackle this problem is the Bayesian regularization method that has been successfully employed in the Relevance Vector Machine (RVM) model [15]. This technique initially assumes a large value of the order M (M = L) and imposes a heavy tailed prior distribution p(w) on the regression model parameters wi , to zero out most of them after training. More specifically, the prior is defined in a hierarchical way by considering a zero-mean Gaussian distribution over w = (w1 , . . . , wL ): p(w|α) = N (w|0, A−1 ) = T ∏ N (wi |0, αi−1 ) (6) i=1 where A is a diagonal matrix containing L elements of the hyperparameter vector α = [α1 . . . αL ]. In addition, a Gamma prior is imposed in each hyperparameter αi : p(α) = T ∏ Gamma(αi |a, b) ∝ T ∏ αia−1 e−bαi . (7) Learning the linear weights u of the multi-kernel scheme can be done using the fact that Φn µn = S ∑ us Fns µn = Fn u , where s=1 Fn = [Fn1 µn Fn2 µn · · · FnS µn ], and by solving the following minimization problem: S N ∑ 1∑ us = 1 and us ≥ 0 . ∥tn − Fn u∥2 s.t. u 2 s=1 n=1 (13) More comprehensively, we can rewrite the above objective function as follows: S {1 } ∑ min us = 1 ,us ≥ 0 , (14) uT Zu + uT q , s.t. u 2 s=1 ∑N ∑N where Z = n=1 FnT Fn and q = − n=1 FnT tn . This is a typical convex quadratic programming problem with both equality and inequality constraints that can be solved by active-set methods that use Lagrange multipliers leading to closed form analytical expressions [25]. min i=1 i=1 Also we can assume a Gamma hyperprior over the noise parameter σ 2 : p(σ −2 ) = Gamma(σ −2 |c, d) ∝ σ −2(c−1) e−dσ −2 . (8) All Gamma parameters {a, b, c, d} are a priori set to zero values to achieve uninformative priors. The above two-stage hierarchical prior on αi is actually a Student-t distribution and is called sparse [15], since it enforces most of the values αi to be large, thus the corresponding coefficients wi are forced to zero and eliminated from the model. In this way the complexity of the regression model is controlled in an automatic and elegant way and overfitting is avoided. By integrating out the contribution of weights w from Equation 5, we can obtain the marginal distribution of target tn given the model parameters θ = {a, σ 2 , u}, as a zero mean Normal distribution: ∫ p(tn |θ) = p(tn |w, σ 2 , u)p(w|α)dw = N (0, Cn ) , (9) where the covariance matrix has the form: Cn = Φn A −1 ΦTn 2 +σ I . (10) Furthermore, we can obtain the posterior distribution over the weights w, which is also Gaussian, as: p(w|tn , θ) = N (w|µn , Σn ) , (11) with mean and covariance given by µn = σ −2 Σn ΦTn tn , Σn = (σ −2 ΦTn Φn + A)−1 . III. T HE MIXTURE OF MKRVM S MODEL In the mixture of MKRVMs model there are K multi-kernel RVM components with parameters θj = {αj , σj2 , uj }, j = 1, . . . , K. According to Eq. 9, each component defines a zero mean Normal distribution: p(tn |θj ) = N (tn |0, Cnj ) , with 2 T Cnj = Φnj A−1 j Φnj + σj I and Φnj = Thus, the Φn µn denotes the final model-based estimation for sample y n . S ∑ ujs Fns . (16) s=1 Moreover, Aj in Eq. 16 is a diagonal matrix containing the elements of hyperparameter vector αj , i.e. Aj = diag{αj1 , . . . , αjL }. The MK-RVM mixture model is described by the following probability density function: f (tn |Θ) = K ∑ j=1 πj p(tn |θj ) = K ∑ πj N (tn |0, Cnj ) . (17) j=1 Let Θ denote the set of all mixture model parameters, i.e. Θ = {πj , θj }K . The mixing weights πj satisfy the ∑K j=1 constraints: j=1 πj = 1 and πj ≥ 0. The same happens with the coefficients uj of the multi-kernel scheme, i.e. ∑S 2 s=1 ujs = 1 and ujs ≥ 0. The parameters {αj , σj } are constrained by Gamma prior distributions: p(αj ) = (12) (15) T ∏ i=1 Gamma(αji |aj , bj ) ∝ T ∏ a −1 −bj αji αjij e i=1 −2(cj −1) −dj σj−2 p(σj−2 ) = Gamma(σj−2 |cj , dj ) ∝ σj e , (18) . (19) where all Gamma parameters {aj , bj , cj , dj }, are set to zero (uninformative priors). To train the mixture of MKRVMs model we define the maximum a posteriori (MAP) log-likelihood function: + K ∑ µnj Σnj j=1 {log p(αj ) + log p(σj−2 )} , (20) j=1 and use the Expectation-Maximization (EM) algorithm [20] for likelihood maximization. EM performs iteratively two steps: The E-step, where the current posterior probabilities are calculated of any sample y n = {xn , tn } to belong to any cluster j: znj = P (j|y n , Θ) = ∑K πj N (tn |0, Cnj ) j ′ =1 πj ′ N (tn |0, Cnj ′ ) . N ∑ K ∑ znj {log πj − n=1 j=1 K ∑ min uj min uj 1 log |Cnj | 2 π̂j α̂ji σ̂j2 = = = , (22) znj + 2aj n=1 N ∑ n=1 N ∑ n=1 N ∑ n=1 znj µ2nji + N ∑ , (24) znj (Σnj )ii + 2bj n=1 znj ∥tn − Φnj µnj ∥2 + 2dj . znj (L − ∑T i=1 + Aj ) . (29) N 1∑ znj ∥tn − Fnj uj ∥2 2 n=1 S ∑ ujs = 1 and ujs ≥ 0 , (30) s=1 (23) N ∑ (28) −1 N 1∑ znj ∥tn − Φnj µnj ∥2 = 2 n=1 s.t. znj N = (σj−2 ΦTnj Φnj uj where the quantities znj have been computed by Eq. 21. Setting the partial derivatives equal to zeros, the following update rules for the mixture parameters are obtained: n=1 σj−2 Σnj ΦTnj tn , min L ∑ −bj αji } + cj log(σj−2 ) − dj σj−2 } , = S N ∑ 1∑ ujs Fns µnj ∥2 = znj ∥tn − 2 n=1 s=1 1 { {aj log(αji ) − tTn (Cnj )−1 tn } + 2 j=1 i=1 N ∑ 1 ∥tn − Φnj µnj ∥2 + µTnj Aj µnj , (27) σj2 Note also that in the above equations (Eqs. 24, 25) the (Σnj )ii indicates the i-th diagonal element of the j-th RVM posterior weight covariance matrix Σnj , while µnji is the i-th element of the posterior mean vector µnj . Also, the quantities {γnji } are defined as γnji = 1 − αji (Σnj )ii . As in the case of a single multi-kernel RVM model (see Eq. 13), the linear weights uj = (uj1 , . . . , ujS ) of the multikernel scheme can be estimated by solving the following minimization problem per component j: (21) During the M -step the maximization of the expected value of the complete log-likelihood (Q-function) is performed with respect to Θ. In our case the Q-function is: Q(Θ) = = 1 T t (tn − Φnj µnj ) σj2 n where LM AP = log p(Y |Θ) + log p(Θ) = K N ∑ ∑ log{ πj N (tn |0, Cnj )} n=1 2 −1 T tn = tTn (Φnj A−1 j Φnj + σj I) (25) γnji ) + 2cj In the above rules we have used the following expressions [15]: T 2 2 log |Φnj A−1 j Φnj + σj I| = − log |Σnj | + log σj − log |Aj |, (26) where we have considered only the part of the Q-function (Eq. 22) that involves uj . It must be noted here that we assume the posterior mean vector µnj and the covariance matrix Σnj as constants. Also, the matrix Fnj in the above rule has S columns calculated by Fjs µnj , i.e. Fnj = [Fn1 µnj Fn2 µnj · · · FnS µnj ]. We can further write the minimization problem in a more convenient way (similar to Eq. 14): S } {1 ∑ uTj Zj uj + uTj q j , s.t. ujs = 1 ,ujs ≥ 0 , min uj 2 s=1 (31) ∑N T where Zj = z F F and q = j n=1 nj nj nj ∑N T − n=1 znj Fnj tn . This is a typical convex quadratic programming problem with both equality and inequality constraints [25]. After EM convergence, the assignment of each sample y n to the K MKRVM components can be made using the maximum value of the posterior probabilities znj (Eq. 21). The MKRVM function can be also obtained for each component j as follows: wj = (σj−2 N ∑ n=1 znj ΦTnj Φnj + Aj )−1 σj−2 N ∑ i=1 znj ΦTnj tn . (32) A. Incremental mixture learning An important concern when applying the EM algorithm, is its strong dependence on the initialization of the model parameters. Improper initialization may lead to poor local maxima of the log-likelihood, a fact that in turn may affect the quality of the method’s estimation capability. A natural way for initialization is to first make a random sampling through the training set Y to select K samples, one for each component. Then, a single multi-kernel RVM is trained using the corresponding selected sample in order to estimate the regression parameters θj = {αj , σj2 , uj } for each component j. The mixing parameters πj are initially set to 1 K . Finally, one iteration of the EM algorithm (one-stepEM) is executed to further refine these parameters and to evaluate the MAP log-likelihood function value LM AP (Eq. 20). Several such random trials (100 in our experiments) are executed and the solution with the maximum log-likelihood value is selected for initializing the model parameters. In Gaussian mixture modeling, several methods have been proposed to overcome the problem of poor initialization, which are mainly based on incremental construction of the mixture model[26], [27], [28]. We have adopted such a scheme to train our RVM mixture model and have developed a learning methodology that sequentially adds a new RVM component to the mixture based on a component splitting procedure. Initially, we start with a mixture model with one MKRVM component. This is done by executing the single multi-kernel RVM learning process to estimate the initial regression parameters {α1 , σ12 , u1 } following the updated rules given in previous section, where we put j = 1 and znj = 1. Let now assume that we have already computed a mixture fk with k MKRVM components (k < K): fk (tn |Θk ) = k ∑ πj p(tn |θj ) . (33) j=1 Also, we denote as: fk−j (tn |Θ−j k ) = fk (tn |Θk ) − πj p(tn |θj ) the model containing the k − 1 components of the k-order mixture model fk , after eliminating the contribution of the j-th component. In order to add a new component, at first an existing component j ∗ is selected for splitting based on the current maximum mixing prior probability value, i.e. j ∗ = arg maxj {πj }. A new regression component is then added, labeled (k + 1), with weight πk+1 (πk+1 < πj ∗ ). Thus, the new mixture density function fk+1 with k + 1 components is now given as: ∗ The mixing weights of both the new inserted (k + 1) and the splitted (j ∗ ) component are initialized as πk+1 = π ∗ old πj ∗ new = j 2 . For initializing the RVM parameters 2 θk+1 = {αk+1 , σk+1 , uk+1 } we apply the following strategy: First, we find the samples y n that currently belong to the cluster j ∗ . We then select a small percentage of those samples that have the lowest probability (outliers), after sorting them in terms of their density values p(tn |θj ∗ ) = N (tn |0, Cj ∗ ). Next we execute the training procedure of a single multi-kernel RVM component to this this set of samples, in order to obtain an initial estimation of the 2 MKRVM parameters θk+1 = {αk+1 , σk+1 , uk+1 }. After the above initialization, the EM algorithm is used to estimate the parameters Θk+1 of the new mixture model fk+1 with k + 1 RVM components. The splitting procedure proceeds in this incremental fashion, adding one MKRVM component at a time, until we receive a mixture model with the desired number (K) of the MKRVM components. This approach is summarized in Algorithm 1. An obvious advantage of the incremental learning scheme is that of simultaneously offering solutions for the intermediate models with k = 1, . . . , K components. Moreover, it can be exploited in order to introduce criteria to stop the evolution of learning: stop training when the insertion of a new component does not offer any significant improvement of the likelihood function. Algorithm 1 Incremental learning of the mixture of MKRVMs model Initially apply the single multiple-kernel RVM training procedure to the dataset Y for estimating parameters θ1 = {a1 , σ12 , u1 }. Set Θ1 = {π1 , θ1 } with π1 = 1. 1: while k < K do 2: Select a component for splitting: j ∗ = arg maxkj=1 {πj }. 3: Find samples that currently belong to component j ∗ , i.e. Y∗ = {y n : j ∗ = arg maxkj=1 znj }. Sort them in terms of their density values p(tn |θj ∗ ). 4: Select a subset (e.g. 10%) Y∗out of Y∗ with the less probable samples (outliers). 5: Run single multiple-kernel RVM training over Y∗out for initializing new component parameters θk+1 = 2 {ak+1 , σk+1 , uk+1 }. 6: 7: 8: 9: π old Initialize mixing weights as πk+1 = πj ∗ new = j 2 . Apply the EM algorithm to the new mixture of MKRVMs fk+1 (tn |Θk+1 ) and obtain Θk+1 . k = k + 1. end while ∗ IV. E XPERIMENTAL R ESULTS ∗ fk+1 (tn |Θk+1 ) = fk−j (tn |Θ−j k )+ new πj ∗ p(tn |θj ∗ ) + πk+1 p(tn |θk+1 ) . (34) We have selected the task of times-series clustering for evaluating the proposed mixture model using a variety of artificial datasets and real benchmarks. In this case, each sample y n is a sequence of real observations measured at L successive time instances that correspond to the target values tni . At each time instance the input xni is a ddimensional vector that describes the d previous target values, i.e. xni = (tn,i−d , tn,i−d+1 , . . . , tn,i−1 ). During all experiments we have considered inputs of length d = 10. Moreover, for constructing the multi-kernel scheme for a dataset, we calculated first the total variance of samples, λ. Next, we used a set of S = 10 RBF kernel functions, where each one had a scalar parameter λs = ks λ, where ks = [0.1, 0.2, . . . , 1.0] (level of percentage). Finally, the linear weights of the multi-kernel scheme were in all cases initialized equally to ujs = 1/S. In our study we have tested both the incremental and the typical (with random initialization) regression mixture with MKRVM components, that will be referred next as iMMKRVM and MMKRVM, respectively. In the incremental approach, the hyperparameters α are always initialized as αl = 1/L (step 5 of Algorithm 1) at every single MKRVM training. We compared our method with two common regression mixture models: • The polynomial regression mixture model (MPRM) that considers a polynomial regression function of order p for any cluster, i.e. tni = p ∑ wjl xlni , (35) l=0 • p ∑ N M I(Ω, C) = I(Ω, C) , [H(Ω) + H(C)]/2 (37) where I(Ω, C) = ∑∑ k P (ωk , cj ) P (ωk )P (cj ) (38) P (ωk ) log P (ωk ) (39) P (ck ) log P (ck ) . (40) P (ωk , cj ) log j H(Ω) = − ∑ k H(C) = − ∑ k The quantities P (ωk ), P (cj ) and P (ωk , cj ) are the probabilities of a sample belonging to class ωk , cluster cj and in their intersection, respectively, and are computed based on the corresponding set of cardinalities (frequencies). A. Experiments with simulated datasets where wjl are the p + 1 regression coefficients for each cluster. In this case the time instances are considered as inputs (xni = i) and a (common) Vandermonde design matrix is used. Finally, in all experiments we have chosen polynomials of order p = 10, since they showed better performance. The mixture of autoregressive (MAR) model that consists of K different AR models, which correspond to the K clusters of interest. Given a time-series tn and an order p, the AR(p) model assumes that any value tni has been generated as a linear combination of p previous values plus a constant term, i.e. tni = wj0 + To quantify the performance and measure the quality of the clustering results obtained by each method, we have used two evaluation criteria: • purity, which is the percentage of correctly classified samples after labeling each cluster with the label of the class which is most frequent among the sequences that belong to this cluster, and • normalized mutual information (NMI), which is an information-theoretic measure based on the mutual information of the true labeling (Ω) and the clustering (C) normalized by their respective entropies: wjl tn,i−l . (36) l=1 Again, {wjl }pl=0 are the p + 1 coefficients for the jth cluster. In this case, the design matrix is created by setting ones (1) to the first column, while the rest columns have the past p values for every time instance. Experiments have made with setting p = 10. Both regression mixture models were trained using the EMbased maximum likelihood framework [13], [12], [5], where we follow the typical sample-based initialization strategy described in section III-A. At first, the performance of our method was evaluated using simulated time-series with known ground truth. For this purpose we have selected the Mackey-Glass delay differential equation, which provides a classical benchmark for time-series modeling given by the following rule: ) ( r(t − τ /δ) − 0.1r(t) , r(t + 1) = r(t) + δ 0.2 1 + r(t − τ /δ)10 (41) where the step size δ set to δ = 0.1. In our study we have generated three data sets using K = {2, 3, 4} of such series of length L = 500 respectively, by considering different values for the delay time τ , as illustrated in Fig. 1. A number of 100 noisy copies of the original curve per class were generated using various levels of noise. Since we are aware of the generative series of each cluster, we have considered an additional evaluation criterion for quantifying the ability of the proposed methodology to model the temporal patterns of the data. In particular, we have computed the mean square error (M SE), between the original Mackey-Glass series {r j , j = 1, . . . , K} (Eq. 41), and the estimated mean curves tj after convergence calculated as: ∑N znj Φnj µnj tj = n=1 . (42) ∑N n=1 znj dataset CBF Coffee ECG Face Four Gun Point Synthetic control Trace Wafer Mackay−Glass series (K=2) 1.5 τ=17 τ=50 y(t) 1 0.5 # classes (K) 3 2 2 4 2 6 4 2 size (N ) 930 56 200 112 200 600 200 1000 dimension (T ) 128 286 96 350 150 60 275 152 Table I D ESCRIPTION OF THE EIGHT (8) UCR DATASETS USED IN OUR EXPERIMENTAL STUDY. 0 0 100 200 300 400 500 Time (a) especially for high noise. Between the two proposed versions of MKRVM mixtures, the one based on incremental learning (iMMKRVM) gave slightly better results, confirming its ability to offer efficient parameter initialization and reaching high quality solutions. In what concerns MSE, it is interesting to observe the significant improvement of the fit error criterion in the case of mixture of MKRVMs. This is in agreement with our belief, that sparseness is beneficial both not only for classification accuracy, but also for fitting quality. The MSE results also indicate that the incremental learning approach is superior to the randomly initialized MMKRVM. Mackay−Glass series (K=3) 1.5 y(t) 1 0.5 0 τ=17 τ=30 τ=50 0 100 200 300 400 500 Time (b) Mackay−Glass series (K=4) B. Experiments with real benchmarks 1.5 y(t) 1 0.5 0 τ=20 τ=30 τ=40 τ=50 0 100 200 300 400 500 Time (c) Figure 1. Three sets with (a) K = 2, (b) K = 3 and (c) K = 4 Mackey-Glass series used for generating artificial datasets. and K 1 ∑1 M SE = ∥r j − tj ∥2 . K j=1 L (43) For every noise level (SNR) we generated 30 different datasets and we calculated the mean value and the standard deviation of three performance criteria purity, NMI and MSE. Figure 2 illustrates the comparative results in terms of the SNR values. As it is obvious, the proposed mixture of MKRMVs model improves significantly clustering quality as compared to the polynomial and the AR regression mixture, Further experiments have been conducted using various real datasets, obtained from the UCR time series data collection [29], where the ground truth is known. In Table I we present a summary of eight (8) UCR datasets we have used in our study. The results using two evaluation metrics, purity and NMI, are shown in Table II for the two versions of the proposed mixture of MKRVMs and the other two regression mixture models. Since the proposed incremental learning approach (iMMRVM) does not depend on the initialization, we show only the result of a single run. For the rest three methods (MMRVM, MPRM, MAR) we provide the mean value and the standard deviation of each measure (for 30 trials). As it can be observed, the performance of the proposed mixture of MKRVMs is obviously superior and in many cases the difference is quite noticeable. Finally, Fig. 3 illustrates the mean mixture regression functions as estimated by the proposed iMMKRVM model (according to Eq. 42) in the case of six UCR datasets. From these results a significant conclusion can be drawn, about the impact of the multi-kernel scheme to the regression modeling performance which is affected, sometimes significantly, on the choice of the proper design matrix. In particular, when the input samples contain strong local variations (such as in Coffee and Face Four datasets), the estimated regression should capture these local details using small values of the scalar parameters λs . On the contrary, in cases where data samples are smoother (such as in CBF and K=2 1 1 iMMKRVM MMKRVM MPRM MAR 0.9 iMMKRVM MMKRVM MPRM MAR 0.9 0.8 0.07 0.06 MSE NMI purity 0.7 0.8 0.6 0.5 0.7 0 −2 −4 −6 −8 −10 0.04 0.02 0.3 0.2 0.05 0.03 0.4 0.6 iMMKRVM MMKRVM MPRM MAR 0.08 0.01 0 −2 −4 −6 SNR (dB) SNR (dB) −8 0 −10 0 −2 −4 −6 SNR (dB) −8 −10 K=3 1 1 iMMKRVM MMKRVM MPRM MAR 0.9 0.8 0.8 0.08 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0 −2 −4 −6 0.2 −8 MSE 0.6 iMMKRVM MMKRVM MPRM MAR 0.1 0.7 NMI purity 0.7 0.2 0.12 iMMKRVM MMKRVM MPRM MAR 0.9 0.06 0.04 0.02 0 −2 SNR (dB) −4 −6 0 2 −8 0 −2 SNR (dB) −4 −6 −8 −10 SNR (dB) K=4 1 1 iMMKRVM MMKRVM MPRM MAR 0.9 iMMKRVM MMKRVM MPRM MAR 0.9 0.8 0.8 0.12 MSE NMI purity 0.6 0.5 0.6 0.5 0.04 0.3 0 −2 −4 SNR (dB) −6 −8 (a) Figure 2. 0.2 0.1 0.08 0.06 0.4 0.4 0.14 0.7 0.7 iMMKRVM MMKRVM MPRM MAR 0.16 0.02 0 −2 −4 SNR (dB) −6 −8 0 0 (b) −2 −4 SNR (dB) −6 −8 (c) Comparative results for the simulated datasets of Fig. 1. Plots illustrate the three evaluation metrics in terms of various noise levels. Table II C OMPARATIVE RESULTS OF MIXTURE MODELS FOR THE UCR UCR Dataset CBF Coffee ECG Face Four Gun Point Synthetic Control Trace Wafer iMRVM purity NMI 0.94 0.79 0.64 0.06 0.78 0.35 0.69 0.46 0.72 0.16 0.76 0.74 0.75 0.68 0.75 0.64 MRVM purity NMI 0.85 ± 0.05 0.60 ± 0.09 0.64 ± 0.00 0.06 ± 0.00 0.78 ± 0.00 0.35 ± 0.00 0.61 ± 0.03 0.39 ± 0.02 0.72 ± 0.00 0.16 ± 0.00 0.72 ± 0.02 0.73 ± 0.02 0.72 ± 0.02 0.64 ± 0.03 0.75 ± 0.00 0.64 ± 0.00 DATASETS MPRM purity NMI 0.65 ± 0.02 0.38 ± 0.01 0.56 ± 0.00 0.02 ± 0.00 0.69 ± 0.00 0.12 ± 0.00 0.40 ± 0.04 0.31 ± 0.02 0.50 ± 0.00 0.04 ± 0.00 0.73 ± 0.01 0.72 ± 0.01 0.53 ± 0.00 0.50 ± 0.00 0.61 ± 0.01 0.00 ± 0.00 MAR purity NMI 0.60 ± 0.03 0.36 ± 0.02 0.57 ± 0.00 0.02 ± 0.00 0.72 ± 0.00 0.18 ± 0.00 0.41 ± 0.00 0.29 ± 0.00 0.55 ± 0.00 0.08 ± 0.00 0.70 ± 0.04 0.69 ± 0.03 0.67 ± 0.08 0.61 ± 0.07 0.67 ± 0.04 0.50 ± 0.06 CBF dataset (K = 3) 4 4 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 −4 −4 Coffee dataset (K = 2) 50 50 40 40 30 30 20 20 10 10 0 0 Face Four dataset (K = 4) 6 6 4 4 2 2 0 0 −2 −2 −4 −4 −6 −6 Gun Point dataset (K = 2) 2 2 1 1 0 0 −1 −1 −2 −2 Trace dataset (K = 4) 4 4 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 Wafer dataset (K = 2) 12 12 10 10 8 8 6 6 4 4 2 2 0 0 −2 −2 Figure 3. Some examples of the resulting regression function for any component (cluster), as estimated by the proposed method in some UCR datasets. Gun Point datasets) large kernel width parameters provide a better fit. The proposed method, incorporating the multikernel scheme, has the flexibility to automatically adapt to the characteristics of input data samples, thus improving the data fitting capability. V. C ONCLUSIONS In this paper we have proposed a novel regression mixture model, where each mixture component is a multi-kernel RVM regression model. The model is very general and can be used to cluster a set of multidimensional functions, where each function is represented by a set of input-target pairs. The key aspect of the proposed technique lies on the employment of RVMs as components and the exploitation of its superior regression performance to model the data of each latent class. We have also presented a weighted multi-kernel scheme for composing the kernel matrix of each component that offers better fitting capabilities. Learning in the proposed sparse regression mixture model is achieved in terms of a maximum a posteriori (MAP) framework that allows the EM algorithm to be effectively used for estimating the model parameters. This has the advantage of establishing update rules in closed form during the M step and thus data fitting is computationally efficient. An incremental learning strategy has also been presented that makes the construction of the sparse regression mixture model independent of parameter initialization. Experiments on several datasets for time series clustering demonstrated the ability of the proposed MKRVM mixture to achieve improved clustering performance and robustness compared to other typical regression models. We are planning to study the performance of the proposed methodology in computer vision applications, such as visual tracking problems and object detection in a video surveillance domain [6], [7], [8]. Another future research direction is to examine the possibility of applying alternative types of sparse priors [17], [18], as well as to make an extensive comparison with other types of mixture models that employ different types of regression components, such as Gaussian Processes. Furthermore, introducing the fully Bayesian framework to our method constitutes an interesting direction for future work, that could also allow for the estimation of the number of clusters. Acknowledgments This manuscript is dedicated to the memory of our friend and colleague Professor Nikolaos P. Galatsanos who contributed significantly to the research and preparation of this work. He was a great mentor who continues to provide inspiration to all of his former colleagues R EFERENCES [1] G. McLachlan and D. Peel, Finite Mixture Models. New York: John Wiley & Sons, Inc., 2000. [2] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006. [3] W. DeSarbo and W. Cron, “A maximum likelihood methodology for clusterwise linear regression,” Journal of Classification, vol. 5, no. 1, pp. 249–282, 1988. [4] D. Chudova, S. Gaffney, E. Mjolsness, and P. Smyth, “Mixture models for translation-invariant clustering of sets of multi-dimensional curves,” in Proc. of the Ninth ACM SIGKDD Intern. Conf. on Knowledge Discovery and Data Mining, (Washington, DC), pp. 79–88, 2003. [5] K. Blekas, C. Nikou, N. Galatsanos, and N. V. Tsekos, “A regression mixture model with spatial constraints for clustering spatiotemporal data,” Inter. Journal on Artificial Intelligence Tools, vol. 17, no. 5, pp. 1023–1041, 2008. [6] J. Alon, S. Sclaroff, G. Kollios, and V. Pavlovic, “Discovering clusters in motion time-series data,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 375–381, 2003. [7] O. Williams, A. Blake, and R. Cipolla, “Sparse Bayesian Learning for Efficient Visual Tracking,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1292–1304, 2005. [8] G. Antonini and J. Thiran, “Counting pedestrians in video sequences using trajectory clustering,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 16, no. 8, pp. 1008– 1020, 2006. [9] B. Williams, M. Toussaint, and A. Storkey, “Modelling motion primitives and their timing in biologically executed movements,” in Advances in Neural Information Processing Systems 15, pp. 1547–1554, 2008. [10] T. Liao, “Clustering of time series data - a survey,” Pattern Recognition, vol. 38, pp. 1857–1874, 2005. [11] P. Smyth, “Clustering sequences with hidden Markov models,” in Advances in Neural Information Processing Systems, pp. 648–654, 1997. [12] S. Gaffney and P. Smyth, “Curve clustering with random effects regression mixtures,” in Proc. of the Ninth Intern. Workshop on Artificial Intelligence and Statistics (C. M. Bishop and B. J. Frey, eds.), 2003. [13] Y. Xiong and D.-Y. Yeung, “Mixtures of ARMA models for model-based time series clustering,” in IEEE International Conference on Data Mining (ICDM), pp. 717–720, 2002. [14] J. Shi and B. Wang, “Curve prediction and clustering with mixtures of Gaussian process functional regression models,” Statistics and Computing, vol. 18, pp. 267–283, 2008. [15] M. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning Research, vol. 1, pp. 211–244, 2001. [16] M. Zhong, “A Variational method for learning Sparse Bayesian Regression,” Neurocomputing, vol. 69, pp. 2351– 2355, 2006. [17] A. Schmolck and R. Everson, “Smooth Relevance Vector Machine: A smoothness prior extension of the RVM,” Machine Learning, vol. 68, no. 2, pp. 107–135, 2007. [18] M. Seeger, “Bayesian Inference and Optimal Design for the Sparse Linear Model,” Journal of Machine Learning Research, vol. 9, pp. 759–813, 2008. [19] M. Gonen and E. Alpaydin, “Multiple kernel learning algorithms,” Journal of Machine Learning Research, vol. 12, pp. 2211–2268, 2011. [20] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from incomplete data via the EM algorithm,” J. Roy. Statist. Soc. B, vol. 39, pp. 1–38, 1977. [21] F. Harrell, Regression Modeling Strategies. With Applications to Linear Models, Logistic Regression and Survival Analysis. Springer-Verlag New York, Inc., 2001. [22] S. Gunn and J. Kandola, “Structural modelling with sparse kernels,” Machine Learning, vol. 48, pp. 137–163, 2002. [23] M. Girolami and S. Rogers, “Hierarchic bayesian models for kernel learning,” in Intern. Conference on Machine Learning (ICML’05), pp. 241–248, 2005. [24] M. Hu, Y. Chen, and J. Kwok, “Building sparse multiplekernel svm classifiers,” IEEE Trans. on Neural Networks, vol. 20, no. 5, pp. 827–839, 2009. [25] J. Nocedal and S. J. Wright, Numerical Optimization. Springer-Verlag New York, Inc., 1999. [26] J. Li and A. Barron, “Mixture density estimation,” in Advances in Neural Information Processing Systems, vol. 12, pp. 279–285, The MIT Press, 2000. [27] N. Ueda, R. Nakano, Z. Ghahramani, and G. Hinton, “SMEM algorithm for mixture models,” Neural Computation, vol. 12, no. 9, pp. 2109–2128, 2000. [28] N. Vlassis and A. Likas, “A greedy EM algorithm for Gaussian mixture learning,” Neural Processing Letters, vol. 15, pp. 77–87, 2001. [29] E. Keogh, X. Xi, L. Wei, and C. Ratanamahatana, “The ucr time series classification/clustering homepage: www.cs.ucr.edu/∼eamonn/time series data/,” 2006.