The mixture of multi-kernel relevance vector machines model
Konstantinos Blekas and Aristidis Likas
Department of Computer Science, University of Ioannina
PO.Box 1186, Ioannina 45110 - Greece
E-mail:{kblekas, arly}@cs.uoi.gr
Abstract—We present a new regression mixture model where
each mixture component is a multi-kernel version of the
Relevance Vector Machine (RVM). In the proposed model, we
exploit the enhanced modeling capability of RVMs due to their
embedded sparsity enforcing properties. Moreover, robustness
is achieved with respect to the kernel parameters, by employing
a weighted multi-kernel scheme. The mixture model is trained
using the maximum a posteriori (MAP) approach, where the
Expectation Maximization (EM) algorithm is applied offering
closed form update equations for the model parameters. An
incremental learning methodology is also presented to tackle
the parameter initialization problem of the EM algorithm.
The efficiency of the proposed mixture model is empirically
demonstrated on the time series clustering problem using various artificial and real benchmark datasets and by performing
comparisons with other regression mixture models.
Keywords-Relevance Vector Machines; mixture models;
sparse prior; multi-kernel; incremental EM learning
I. I NTRODUCTION
Mixture model training constitutes a flexible and well
established approach in the case of data sets containing
data objects that have been generated from heterogeneous
sources [1], [2]. Among many advantages they offer, mixture
models provide a nice framework for cluster analysis by
assigning objects to mixture components (or clusters) while
simultaneously estimating the model parameters. Regression
mixture models are a special type of mixture models where
the components correspond to regression functions and they
have mainly employed to model sequential data.
Many problems of scientific interest can be formulated
as sequential data modeling problems. Such type of data
can be encountered in a number of diverse applications,
ranging from gene clustering in bioinformatics to clustering
of cyclone trajectories [3], [4], [5], and recently to video
surveillance problems [6], [7], [8] and motion recognition
[9].
A natural framework for modeling sequential data is
through regression mixture models, also known as latent
class regression analysis [1], [2]. A regression mixture model
allows for simultaneously modeling heterogeneous regression functions by training a mixture of distinct distributions,
each corresponding to a latent class. Objects within each
latent class share the same regression function. Through the
literature there are different types of regression mixture models that have been used for sequential data modeling [10].
Among them, Hidden Markov Models [11], polynomial and
spline regression models [12],[4],[5], mixtures of ARMA
models [13] and mixtures of Gaussian processes [14] are
commonly used models. These methods are suffering from
the drawback of not automatically addressing the problem of
model order selection, which is very important in regression.
If the order of the regressor model is too large, it overfits
the observations and does not generalize well. On the other
hand if it is too small, it might miss trends in the data.
Sparse Bayesian regression offers a solution to the model
selection problem, see for example [15], [16], [17] and
[18], by introducing sparse priors on the model parameters.
Enforcing sparsity is a fundamental machine learning regularization principle and has been successfully used to tackle
several problems, such as feature selection. The key idea
behind sparse Bayesian regression is that we can obtain more
flexible inference methods by employing models having
initially many degrees of freedom and applying a heavy
tail prior is imposed to the coefficients of the regressor.
During training, the coefficients that are not significant are
zeroed out due to the prior, thus only a few coefficients are
retained in the model which are considered significant for
the particular training data.
In this paper we propose a regression mixture model
where each component corresponds to an extension of the
typical RVM model [15] assuming a weighted multikernel
function, ie. each RVM kernel is a weighted combination of
basic kernels and the combination weights are estimated during training [19]. We call this extension Multi-kernel RVM
(MKRVM). In the RVM model the marginal distribution of
the observations given the hyperparameters is a Gaussian
distribution (see Eq. 9). Therefore, the regression mixture is
converted into a typical mixture model of Gaussians with
zero mean and full covariances. A significant problem in
regression is how to define the scalar parameter of the
kernel design matrix. In this study we have faced this issue
by considering a multi-kernel scheme where we assume
that each mixture component has a unique kernel matrix
calculated as a linear combination of a (common) set of
matrices with known kernel parameter values. These vectors
of coefficients are part of the mixture model parameters
which must be estimated. Then, a maximum a posteriori
expectation maximization algorithm (MAP-EM) [20], [1] is
applied to learn this mixture of multi-kernel RVMs model
and fit the input data. This is leads to update rules of
all model parameters in closed form during the M -step
and improves data fitting. In the case of the multi-kernel
scheme coefficients this leads to a convex quadratic programming problem with constraints. Another contribution
of the present work is an incremental scheme for training
the mixture of multi-kernel RVMs model that is based on
an appropriate repeated splitting process. This causes to
make the learning process independent of the initialization
of model parameters and obtain optimum solutions.
The proposed regression mixture models can be employed
to model and cluster a set of functions, ie. each object in
the training set is a function represented as by a set of
(input, output) pairs. In this work we have focused on the
specific case of time series clustering, ie. the (input, output)
pairs of a data object correspond to a particular time series.
The performance of the proposed methodology is evaluated
using a variety of simulated and real data sets. Comparative
results demonstrate improvements over previous methods
such as the polynomial regression mixture model and the
mixture of autoregressive models. Since the ground truth is
already known for all datasets, we have used the percentage
of correct classification (purity) and the normalized mutual
information (NMI) quantities for evaluating the performance
of each method. Also, in the case of artificial data, we have
computed as a performance metric the error in estimating
the original functions that have generated each cluster. As
experiments indicate, our method offers greater flexibility
and robustness and obtains superior clustering solutions.
In section 2 we describe the multi-kernel relevance vector
machine which is the building block in our approach. The
proposed sparse regression mixture model is then presented
in section 3, along with the EM algorithm used for parameter
estimation and the incremental learning procedure. To assess
the performance of the proposed methodology we present
in section 4 numerical experiments with artificial and real
benchmark data sets. Finally, in section 5 we provide conclusions and suggestions for future research.
II. T HE M ULTI -K ERNEL R ELEVANCE V ECTOR M ACHINE
Suppose we are given a set of N samples (data objects)
Y = {y 1 , . . . , y N }, where each sample y n consists of L
input-target pairs: y n = {xn , tn } = {(xni , tni )}L
i=1 .
In regression problems we consider that the real target
values tni correspond to noisy measurements of the output
of a parametric model f with input vector xni , i.e.
tni = f (xni ; θ) + ϵni ,
(1)
where ϵni refers to noise and θ denotes the model parameters
which must be estimated using a training set. Moreover, for
the conditional density of each sample y n we can write
p(y n = {xn , tn }|θ) = p(tn |xn , θ)p(xn ) ∝ p(tn |xn , θ)
(2)
Typically, we can model tn by assuming an M -order linear
regression model on the L input vectors with an additive
noise term given by
tn = Φ n w + ϵ n ,
(3)
T
where w = (w1 , . . . , wM ) is the vector with the unknown
regression coefficients, and Φn is the design matrix of size
L × M . In the above model, the error term ϵn is a Ldimensional vector that is assumed to be zero mean Gaussian
with a spherical covariance, i.e. ϵn ∼ N (0, σ 2 I).
For constructing the design matrix Φ we can employ
several approaches. A simple approach is to use the Vandermonde or B-splines matrix, in cases where we assume polynomial or splines regression models, respectively
[21]. Another option is to consider a kernel design matrix
of size L × L, consisting of L basis functions, Φn =
[ϕ(xn1 ), . . . , ϕ(xnL )] where ϕ(xni ) is a vector of L kernel
values among xni and all other inputs, i.e. ϕ(xni ) =
(K(xni , xn1 ), . . . , K(xni , xnL )). This is achieved by appropriately selecting a kernel function, with the RBF kernel
function to be the most commonly used:
∥xni − xik ∥2
).
2λ
The scalar parameter λ plays a significant role to the
quality of the fitting procedure. Its selection depends on the
amount of local variations of input data sequences.
In our case we consider a multi-kernel scheme by using a
discrete set of S RBF kernel functions Ks , each one having
its own scalar parameter value λs . In particular, we assume
that the composite kernel matrix Φn can be written as a
linear combination of S kernel matrices Fns :
K(xni , xnk ) = exp(−
Φn =
S
∑
us Fns ,
(4)
s=1
where
∑S the coefficients us satisfy the constraints us ≥ 0 and
s=1 us = 1. These parameters should be estimated during
learning in order to construct the composite kernel matrix, as
it will be shown later. It must be noted that the multi-kernel
idea has been successfully used in several machine learning
models [22], [23], [24], [19], that assume a weighted linear sum of kernel and estimate the kernel weights during
training. However, to the best of our knowledge this is the
first time that a multi-kernel version of RVM with adaptive
kernel weights is proposed.
Using Equation 3, it is obvious that given the set of
regression parameters {w, σ 2 , u}, we can model the conditional probability density of the target tn with the normal
distribution, i.e.
p(tn |w, σ 2 , u) = N (tn |Φn w, σ 2 I) .
(5)
An important issue, when using a regression model is how
to define its order M , since models of small order may
lead to underfitting, while large values of M may lead
to overfitting. One approach to tackle this problem is the
Bayesian regularization method that has been successfully
employed in the Relevance Vector Machine (RVM) model
[15]. This technique initially assumes a large value of the
order M (M = L) and imposes a heavy tailed prior
distribution p(w) on the regression model parameters wi ,
to zero out most of them after training.
More specifically, the prior is defined in a hierarchical
way by considering a zero-mean Gaussian distribution over
w = (w1 , . . . , wL ):
p(w|α) = N (w|0, A−1 ) =
T
∏
N (wi |0, αi−1 )
(6)
i=1
where A is a diagonal matrix containing L elements of
the hyperparameter vector α = [α1 . . . αL ]. In addition, a
Gamma prior is imposed in each hyperparameter αi :
p(α) =
T
∏
Gamma(αi |a, b) ∝
T
∏
αia−1 e−bαi .
(7)
Learning the linear weights u of the multi-kernel scheme
can be done using the fact that
Φn µn =
S
∑
us Fns µn = Fn u , where
s=1
Fn = [Fn1 µn Fn2 µn · · · FnS µn ], and by solving the
following minimization problem:
S
N
∑
1∑
us = 1 and us ≥ 0 .
∥tn − Fn u∥2 s.t.
u 2
s=1
n=1
(13)
More comprehensively, we can rewrite the above objective
function as follows:
S
{1
}
∑
min
us = 1 ,us ≥ 0 , (14)
uT Zu + uT q , s.t.
u
2
s=1
∑N
∑N
where Z = n=1 FnT Fn and q = − n=1 FnT tn . This is
a typical convex quadratic programming problem with both
equality and inequality constraints that can be solved by
active-set methods that use Lagrange multipliers leading to
closed form analytical expressions [25].
min
i=1
i=1
Also we can assume a Gamma hyperprior over the noise
parameter σ 2 :
p(σ −2 ) = Gamma(σ −2 |c, d) ∝ σ −2(c−1) e−dσ
−2
.
(8)
All Gamma parameters {a, b, c, d} are a priori set to zero
values to achieve uninformative priors.
The above two-stage hierarchical prior on αi is actually a
Student-t distribution and is called sparse [15], since it enforces most of the values αi to be large, thus the corresponding coefficients wi are forced to zero and eliminated from the
model. In this way the complexity of the regression model
is controlled in an automatic and elegant way and overfitting
is avoided. By integrating out the contribution of weights w
from Equation 5, we can obtain the marginal distribution of
target tn given the model parameters θ = {a, σ 2 , u}, as a
zero mean Normal distribution:
∫
p(tn |θ) = p(tn |w, σ 2 , u)p(w|α)dw = N (0, Cn ) , (9)
where the covariance matrix has the form:
Cn = Φn A
−1
ΦTn
2
+σ I .
(10)
Furthermore, we can obtain the posterior distribution over
the weights w, which is also Gaussian, as:
p(w|tn , θ) = N (w|µn , Σn ) ,
(11)
with mean and covariance given by
µn = σ −2 Σn ΦTn tn , Σn = (σ −2 ΦTn Φn + A)−1 .
III. T HE MIXTURE OF MKRVM S MODEL
In the mixture of MKRVMs model there are K
multi-kernel RVM components with parameters θj =
{αj , σj2 , uj }, j = 1, . . . , K. According to Eq. 9, each
component defines a zero mean Normal distribution:
p(tn |θj ) = N (tn |0, Cnj ) ,
with
2
T
Cnj = Φnj A−1
j Φnj + σj I and Φnj =
Thus, the Φn µn denotes the final model-based estimation
for sample y n .
S
∑
ujs Fns . (16)
s=1
Moreover, Aj in Eq. 16 is a diagonal matrix containing
the elements of hyperparameter vector αj , i.e. Aj =
diag{αj1 , . . . , αjL }. The MK-RVM mixture model is described by the following probability density function:
f (tn |Θ) =
K
∑
j=1
πj p(tn |θj ) =
K
∑
πj N (tn |0, Cnj ) . (17)
j=1
Let Θ denote the set of all mixture model parameters,
i.e. Θ = {πj , θj }K
. The mixing weights πj satisfy the
∑K j=1
constraints: j=1 πj = 1 and πj ≥ 0. The same happens
with the coefficients uj of the multi-kernel scheme, i.e.
∑S
2
s=1 ujs = 1 and ujs ≥ 0. The parameters {αj , σj } are
constrained by Gamma prior distributions:
p(αj ) =
(12)
(15)
T
∏
i=1
Gamma(αji |aj , bj ) ∝
T
∏
a −1 −bj αji
αjij
e
i=1
−2(cj −1) −dj σj−2
p(σj−2 ) = Gamma(σj−2 |cj , dj ) ∝ σj
e
,
(18)
.
(19)
where all Gamma parameters {aj , bj , cj , dj }, are set to zero
(uninformative priors).
To train the mixture of MKRVMs model we define the
maximum a posteriori (MAP) log-likelihood function:
+
K
∑
µnj
Σnj
j=1
{log p(αj ) + log p(σj−2 )} ,
(20)
j=1
and use the Expectation-Maximization (EM) algorithm [20]
for likelihood maximization. EM performs iteratively two
steps: The E-step, where the current posterior probabilities
are calculated of any sample y n = {xn , tn } to belong to
any cluster j:
znj = P (j|y n , Θ) = ∑K
πj N (tn |0, Cnj )
j ′ =1
πj ′ N (tn |0, Cnj ′ )
.
N ∑
K
∑
znj {log πj −
n=1 j=1
K
∑
min
uj
min
uj
1
log |Cnj |
2
π̂j
α̂ji
σ̂j2
=
=
=
,
(22)
znj + 2aj
n=1
N
∑
n=1
N
∑
n=1
N
∑
n=1
znj µ2nji
+
N
∑
, (24)
znj (Σnj )ii + 2bj
n=1
znj ∥tn − Φnj µnj ∥2 + 2dj
.
znj (L −
∑T
i=1
+ Aj )
.
(29)
N
1∑
znj ∥tn − Fnj uj ∥2
2 n=1
S
∑
ujs = 1 and ujs ≥ 0 ,
(30)
s=1
(23)
N
∑
(28)
−1
N
1∑
znj ∥tn − Φnj µnj ∥2 =
2 n=1
s.t.
znj
N
=
(σj−2 ΦTnj Φnj
uj
where the quantities znj have been computed by Eq. 21.
Setting the partial derivatives equal to zeros, the following
update rules for the mixture parameters are obtained:
n=1
σj−2 Σnj ΦTnj tn ,
min
L
∑
−bj αji } + cj log(σj−2 ) − dj σj−2 } ,
=
S
N
∑
1∑
ujs Fns µnj ∥2 =
znj ∥tn −
2 n=1
s=1
1
{ {aj log(αji )
− tTn (Cnj )−1 tn } +
2
j=1 i=1
N
∑
1
∥tn − Φnj µnj ∥2 + µTnj Aj µnj , (27)
σj2
Note also that in the above equations (Eqs. 24, 25) the
(Σnj )ii indicates the i-th diagonal element of the j-th RVM
posterior weight covariance matrix Σnj , while µnji is the
i-th element of the posterior mean vector µnj . Also, the
quantities {γnji } are defined as γnji = 1 − αji (Σnj )ii .
As in the case of a single multi-kernel RVM model (see
Eq. 13), the linear weights uj = (uj1 , . . . , ujS ) of the multikernel scheme can be estimated by solving the following
minimization problem per component j:
(21)
During the M -step the maximization of the expected value
of the complete log-likelihood (Q-function) is performed
with respect to Θ. In our case the Q-function is:
Q(Θ) =
=
1 T
t (tn − Φnj µnj )
σj2 n
where
LM AP = log p(Y |Θ) + log p(Θ) =
K
N
∑
∑
log{
πj N (tn |0, Cnj )}
n=1
2 −1
T
tn =
tTn (Φnj A−1
j Φnj + σj I)
(25)
γnji ) + 2cj
In the above rules we have used the following expressions
[15]:
T
2
2
log |Φnj A−1
j Φnj + σj I| = − log |Σnj | + log σj − log |Aj |,
(26)
where we have considered only the part of the Q-function
(Eq. 22) that involves uj . It must be noted here that we
assume the posterior mean vector µnj and the covariance
matrix Σnj as constants. Also, the matrix Fnj in the above
rule has S columns calculated by Fjs µnj , i.e. Fnj =
[Fn1 µnj Fn2 µnj · · · FnS µnj ]. We can further write the
minimization problem in a more convenient way (similar to
Eq. 14):
S
}
{1
∑
uTj Zj uj + uTj q j , s.t.
ujs = 1 ,ujs ≥ 0 ,
min
uj
2
s=1
(31)
∑N
T
where Zj
=
z
F
F
and
q
=
j
n=1 nj nj nj
∑N
T
− n=1 znj Fnj tn . This is a typical convex quadratic
programming problem with both equality and inequality
constraints [25].
After EM convergence, the assignment of each sample
y n to the K MKRVM components can be made using
the maximum value of the posterior probabilities znj (Eq.
21). The MKRVM function can be also obtained for each
component j as follows:
wj = (σj−2
N
∑
n=1
znj ΦTnj Φnj + Aj )−1 σj−2
N
∑
i=1
znj ΦTnj tn .
(32)
A. Incremental mixture learning
An important concern when applying the EM algorithm,
is its strong dependence on the initialization of the model
parameters. Improper initialization may lead to poor local
maxima of the log-likelihood, a fact that in turn may affect
the quality of the method’s estimation capability. A natural
way for initialization is to first make a random sampling
through the training set Y to select K samples, one for
each component. Then, a single multi-kernel RVM is trained
using the corresponding selected sample in order to estimate
the regression parameters θj = {αj , σj2 , uj } for each
component j. The mixing parameters πj are initially set to
1
K . Finally, one iteration of the EM algorithm (one-stepEM) is executed to further refine these parameters and to
evaluate the MAP log-likelihood function value LM AP (Eq.
20). Several such random trials (100 in our experiments) are
executed and the solution with the maximum log-likelihood
value is selected for initializing the model parameters.
In Gaussian mixture modeling, several methods have been
proposed to overcome the problem of poor initialization,
which are mainly based on incremental construction of the
mixture model[26], [27], [28]. We have adopted such a
scheme to train our RVM mixture model and have developed
a learning methodology that sequentially adds a new RVM
component to the mixture based on a component splitting
procedure. Initially, we start with a mixture model with one
MKRVM component. This is done by executing the single
multi-kernel RVM learning process to estimate the initial
regression parameters {α1 , σ12 , u1 } following the updated
rules given in previous section, where we put j = 1 and
znj = 1.
Let now assume that we have already computed a mixture
fk with k MKRVM components (k < K):
fk (tn |Θk ) =
k
∑
πj p(tn |θj ) .
(33)
j=1
Also, we denote as:
fk−j (tn |Θ−j
k ) = fk (tn |Θk ) − πj p(tn |θj )
the model containing the k − 1 components of the k-order
mixture model fk , after eliminating the contribution of the
j-th component.
In order to add a new component, at first an existing
component j ∗ is selected for splitting based on the current maximum mixing prior probability value, i.e. j ∗ =
arg maxj {πj }. A new regression component is then added,
labeled (k + 1), with weight πk+1 (πk+1 < πj ∗ ). Thus, the
new mixture density function fk+1 with k + 1 components
is now given as:
∗
The mixing weights of both the new inserted (k + 1)
and the splitted (j ∗ ) component are initialized as πk+1 =
π ∗ old
πj ∗ new = j 2 . For initializing the RVM parameters
2
θk+1 = {αk+1 , σk+1
, uk+1 } we apply the following strategy: First, we find the samples y n that currently belong to
the cluster j ∗ . We then select a small percentage of those
samples that have the lowest probability (outliers), after
sorting them in terms of their density values p(tn |θj ∗ ) =
N (tn |0, Cj ∗ ). Next we execute the training procedure of
a single multi-kernel RVM component to this this set of
samples, in order to obtain an initial estimation of the
2
MKRVM parameters θk+1 = {αk+1 , σk+1
, uk+1 }. After
the above initialization, the EM algorithm is used to estimate
the parameters Θk+1 of the new mixture model fk+1 with
k + 1 RVM components.
The splitting procedure proceeds in this incremental fashion, adding one MKRVM component at a time, until we
receive a mixture model with the desired number (K) of
the MKRVM components. This approach is summarized
in Algorithm 1. An obvious advantage of the incremental
learning scheme is that of simultaneously offering solutions
for the intermediate models with k = 1, . . . , K components.
Moreover, it can be exploited in order to introduce criteria
to stop the evolution of learning: stop training when the
insertion of a new component does not offer any significant
improvement of the likelihood function.
Algorithm 1 Incremental learning of the mixture of
MKRVMs model
Initially apply the single multiple-kernel RVM training
procedure to the dataset Y for estimating parameters θ1 =
{a1 , σ12 , u1 }. Set Θ1 = {π1 , θ1 } with π1 = 1.
1: while k < K do
2:
Select a component for splitting: j ∗
=
arg maxkj=1 {πj }.
3:
Find samples that currently belong to component j ∗ ,
i.e. Y∗ = {y n : j ∗ = arg maxkj=1 znj }. Sort them in
terms of their density values p(tn |θj ∗ ).
4:
Select a subset (e.g. 10%) Y∗out of Y∗ with the less
probable samples (outliers).
5:
Run single multiple-kernel RVM training over Y∗out
for initializing new component parameters θk+1 =
2
{ak+1 , σk+1
, uk+1 }.
6:
7:
8:
9:
π
old
Initialize mixing weights as πk+1 = πj ∗ new = j 2 .
Apply the EM algorithm to the new mixture of
MKRVMs fk+1 (tn |Θk+1 ) and obtain Θk+1 .
k = k + 1.
end while
∗
IV. E XPERIMENTAL R ESULTS
∗
fk+1 (tn |Θk+1 ) = fk−j (tn |Θ−j
k )+
new
πj ∗
p(tn |θj ∗ ) + πk+1 p(tn |θk+1 ) .
(34)
We have selected the task of times-series clustering for
evaluating the proposed mixture model using a variety of
artificial datasets and real benchmarks. In this case, each
sample y n is a sequence of real observations measured at
L successive time instances that correspond to the target
values tni . At each time instance the input xni is a ddimensional vector that describes the d previous target
values, i.e. xni = (tn,i−d , tn,i−d+1 , . . . , tn,i−1 ). During all
experiments we have considered inputs of length d = 10.
Moreover, for constructing the multi-kernel scheme for a
dataset, we calculated first the total variance of samples,
λ. Next, we used a set of S = 10 RBF kernel functions,
where each one had a scalar parameter λs = ks λ, where
ks = [0.1, 0.2, . . . , 1.0] (level of percentage). Finally, the
linear weights of the multi-kernel scheme were in all cases
initialized equally to ujs = 1/S.
In our study we have tested both the incremental and
the typical (with random initialization) regression mixture
with MKRVM components, that will be referred next as
iMMKRVM and MMKRVM, respectively. In the incremental approach, the hyperparameters α are always initialized
as αl = 1/L (step 5 of Algorithm 1) at every single
MKRVM training. We compared our method with two
common regression mixture models:
• The polynomial regression mixture model (MPRM) that
considers a polynomial regression function of order p
for any cluster, i.e.
tni =
p
∑
wjl xlni ,
(35)
l=0
•
p
∑
N M I(Ω, C) =
I(Ω, C)
,
[H(Ω) + H(C)]/2
(37)
where
I(Ω, C) =
∑∑
k
P (ωk , cj )
P (ωk )P (cj )
(38)
P (ωk ) log P (ωk )
(39)
P (ck ) log P (ck ) .
(40)
P (ωk , cj ) log
j
H(Ω) = −
∑
k
H(C) = −
∑
k
The quantities P (ωk ), P (cj ) and P (ωk , cj ) are the
probabilities of a sample belonging to class ωk , cluster
cj and in their intersection, respectively, and are computed based on the corresponding set of cardinalities
(frequencies).
A. Experiments with simulated datasets
where wjl are the p + 1 regression coefficients for each
cluster. In this case the time instances are considered
as inputs (xni = i) and a (common) Vandermonde
design matrix is used. Finally, in all experiments we
have chosen polynomials of order p = 10, since they
showed better performance.
The mixture of autoregressive (MAR) model that consists of K different AR models, which correspond to
the K clusters of interest. Given a time-series tn and
an order p, the AR(p) model assumes that any value
tni has been generated as a linear combination of p
previous values plus a constant term, i.e.
tni = wj0 +
To quantify the performance and measure the quality of
the clustering results obtained by each method, we have used
two evaluation criteria:
• purity, which is the percentage of correctly classified
samples after labeling each cluster with the label of the
class which is most frequent among the sequences that
belong to this cluster, and
• normalized mutual information (NMI), which is an
information-theoretic measure based on the mutual information of the true labeling (Ω) and the clustering
(C) normalized by their respective entropies:
wjl tn,i−l .
(36)
l=1
Again, {wjl }pl=0 are the p + 1 coefficients for the jth cluster. In this case, the design matrix is created
by setting ones (1) to the first column, while the rest
columns have the past p values for every time instance.
Experiments have made with setting p = 10.
Both regression mixture models were trained using the EMbased maximum likelihood framework [13], [12], [5], where
we follow the typical sample-based initialization strategy
described in section III-A.
At first, the performance of our method was evaluated
using simulated time-series with known ground truth. For
this purpose we have selected the Mackey-Glass delay
differential equation, which provides a classical benchmark
for time-series modeling given by the following rule:
)
(
r(t − τ /δ)
−
0.1r(t)
,
r(t + 1) = r(t) + δ 0.2
1 + r(t − τ /δ)10
(41)
where the step size δ set to δ = 0.1. In our study we have
generated three data sets using K = {2, 3, 4} of such series
of length L = 500 respectively, by considering different
values for the delay time τ , as illustrated in Fig. 1. A number
of 100 noisy copies of the original curve per class were
generated using various levels of noise.
Since we are aware of the generative series of each
cluster, we have considered an additional evaluation criterion
for quantifying the ability of the proposed methodology to
model the temporal patterns of the data. In particular, we
have computed the mean square error (M SE), between
the original Mackey-Glass series {r j , j = 1, . . . , K} (Eq.
41), and the estimated mean curves tj after convergence
calculated as:
∑N
znj Φnj µnj
tj = n=1
.
(42)
∑N
n=1 znj
dataset
CBF
Coffee
ECG
Face Four
Gun Point
Synthetic control
Trace
Wafer
Mackay−Glass series (K=2)
1.5
τ=17
τ=50
y(t)
1
0.5
# classes (K)
3
2
2
4
2
6
4
2
size (N )
930
56
200
112
200
600
200
1000
dimension (T )
128
286
96
350
150
60
275
152
Table I
D ESCRIPTION OF THE EIGHT (8) UCR
DATASETS USED IN OUR
EXPERIMENTAL STUDY.
0
0
100
200
300
400
500
Time
(a)
especially for high noise. Between the two proposed versions
of MKRVM mixtures, the one based on incremental learning
(iMMKRVM) gave slightly better results, confirming its
ability to offer efficient parameter initialization and reaching high quality solutions. In what concerns MSE, it is
interesting to observe the significant improvement of the fit
error criterion in the case of mixture of MKRVMs. This is
in agreement with our belief, that sparseness is beneficial
both not only for classification accuracy, but also for fitting
quality. The MSE results also indicate that the incremental
learning approach is superior to the randomly initialized
MMKRVM.
Mackay−Glass series (K=3)
1.5
y(t)
1
0.5
0
τ=17
τ=30
τ=50
0
100
200
300
400
500
Time
(b)
Mackay−Glass series (K=4)
B. Experiments with real benchmarks
1.5
y(t)
1
0.5
0
τ=20
τ=30
τ=40
τ=50
0
100
200
300
400
500
Time
(c)
Figure 1. Three sets with (a) K = 2, (b) K = 3 and (c) K = 4
Mackey-Glass series used for generating artificial datasets.
and
K
1 ∑1
M SE =
∥r j − tj ∥2 .
K j=1 L
(43)
For every noise level (SNR) we generated 30 different
datasets and we calculated the mean value and the standard
deviation of three performance criteria purity, NMI and
MSE. Figure 2 illustrates the comparative results in terms of
the SNR values. As it is obvious, the proposed mixture of
MKRMVs model improves significantly clustering quality as
compared to the polynomial and the AR regression mixture,
Further experiments have been conducted using various
real datasets, obtained from the UCR time series data
collection [29], where the ground truth is known. In Table
I we present a summary of eight (8) UCR datasets we have
used in our study. The results using two evaluation metrics,
purity and NMI, are shown in Table II for the two versions
of the proposed mixture of MKRVMs and the other two
regression mixture models. Since the proposed incremental
learning approach (iMMRVM) does not depend on the
initialization, we show only the result of a single run. For the
rest three methods (MMRVM, MPRM, MAR) we provide
the mean value and the standard deviation of each measure
(for 30 trials). As it can be observed, the performance of the
proposed mixture of MKRVMs is obviously superior and in
many cases the difference is quite noticeable.
Finally, Fig. 3 illustrates the mean mixture regression
functions as estimated by the proposed iMMKRVM model
(according to Eq. 42) in the case of six UCR datasets.
From these results a significant conclusion can be drawn,
about the impact of the multi-kernel scheme to the regression modeling performance which is affected, sometimes
significantly, on the choice of the proper design matrix.
In particular, when the input samples contain strong local
variations (such as in Coffee and Face Four datasets), the
estimated regression should capture these local details using
small values of the scalar parameters λs . On the contrary, in
cases where data samples are smoother (such as in CBF and
K=2
1
1
iMMKRVM
MMKRVM
MPRM
MAR
0.9
iMMKRVM
MMKRVM
MPRM
MAR
0.9
0.8
0.07
0.06
MSE
NMI
purity
0.7
0.8
0.6
0.5
0.7
0
−2
−4
−6
−8
−10
0.04
0.02
0.3
0.2
0.05
0.03
0.4
0.6
iMMKRVM
MMKRVM
MPRM
MAR
0.08
0.01
0
−2
−4
−6
SNR (dB)
SNR (dB)
−8
0
−10
0
−2
−4
−6
SNR (dB)
−8
−10
K=3
1
1
iMMKRVM
MMKRVM
MPRM
MAR
0.9
0.8
0.8
0.08
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0
−2
−4
−6
0.2
−8
MSE
0.6
iMMKRVM
MMKRVM
MPRM
MAR
0.1
0.7
NMI
purity
0.7
0.2
0.12
iMMKRVM
MMKRVM
MPRM
MAR
0.9
0.06
0.04
0.02
0
−2
SNR (dB)
−4
−6
0
2
−8
0
−2
SNR (dB)
−4
−6
−8
−10
SNR (dB)
K=4
1
1
iMMKRVM
MMKRVM
MPRM
MAR
0.9
iMMKRVM
MMKRVM
MPRM
MAR
0.9
0.8
0.8
0.12
MSE
NMI
purity
0.6
0.5
0.6
0.5
0.04
0.3
0
−2
−4
SNR (dB)
−6
−8
(a)
Figure 2.
0.2
0.1
0.08
0.06
0.4
0.4
0.14
0.7
0.7
iMMKRVM
MMKRVM
MPRM
MAR
0.16
0.02
0
−2
−4
SNR (dB)
−6
−8
0
0
(b)
−2
−4
SNR (dB)
−6
−8
(c)
Comparative results for the simulated datasets of Fig. 1. Plots illustrate the three evaluation metrics in terms of various noise levels.
Table II
C OMPARATIVE RESULTS OF MIXTURE MODELS FOR THE UCR
UCR Dataset
CBF
Coffee
ECG
Face Four
Gun Point
Synthetic Control
Trace
Wafer
iMRVM
purity
NMI
0.94
0.79
0.64
0.06
0.78
0.35
0.69
0.46
0.72
0.16
0.76
0.74
0.75
0.68
0.75
0.64
MRVM
purity
NMI
0.85 ± 0.05 0.60 ± 0.09
0.64 ± 0.00 0.06 ± 0.00
0.78 ± 0.00 0.35 ± 0.00
0.61 ± 0.03 0.39 ± 0.02
0.72 ± 0.00 0.16 ± 0.00
0.72 ± 0.02 0.73 ± 0.02
0.72 ± 0.02 0.64 ± 0.03
0.75 ± 0.00 0.64 ± 0.00
DATASETS
MPRM
purity
NMI
0.65 ± 0.02 0.38 ± 0.01
0.56 ± 0.00 0.02 ± 0.00
0.69 ± 0.00 0.12 ± 0.00
0.40 ± 0.04 0.31 ± 0.02
0.50 ± 0.00 0.04 ± 0.00
0.73 ± 0.01 0.72 ± 0.01
0.53 ± 0.00 0.50 ± 0.00
0.61 ± 0.01 0.00 ± 0.00
MAR
purity
NMI
0.60 ± 0.03 0.36 ± 0.02
0.57 ± 0.00 0.02 ± 0.00
0.72 ± 0.00 0.18 ± 0.00
0.41 ± 0.00 0.29 ± 0.00
0.55 ± 0.00 0.08 ± 0.00
0.70 ± 0.04 0.69 ± 0.03
0.67 ± 0.08 0.61 ± 0.07
0.67 ± 0.04 0.50 ± 0.06
CBF dataset (K = 3)
4
4
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−4
−4
Coffee dataset (K = 2)
50
50
40
40
30
30
20
20
10
10
0
0
Face Four dataset (K = 4)
6
6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−6
Gun Point dataset (K = 2)
2
2
1
1
0
0
−1
−1
−2
−2
Trace dataset (K = 4)
4
4
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
Wafer dataset (K = 2)
12
12
10
10
8
8
6
6
4
4
2
2
0
0
−2
−2
Figure 3. Some examples of the resulting regression function for any
component (cluster), as estimated by the proposed method in some UCR
datasets.
Gun Point datasets) large kernel width parameters provide
a better fit. The proposed method, incorporating the multikernel scheme, has the flexibility to automatically adapt to
the characteristics of input data samples, thus improving the
data fitting capability.
V. C ONCLUSIONS
In this paper we have proposed a novel regression mixture
model, where each mixture component is a multi-kernel
RVM regression model. The model is very general and
can be used to cluster a set of multidimensional functions,
where each function is represented by a set of input-target
pairs. The key aspect of the proposed technique lies on the
employment of RVMs as components and the exploitation
of its superior regression performance to model the data
of each latent class. We have also presented a weighted
multi-kernel scheme for composing the kernel matrix of each
component that offers better fitting capabilities. Learning in
the proposed sparse regression mixture model is achieved
in terms of a maximum a posteriori (MAP) framework
that allows the EM algorithm to be effectively used for
estimating the model parameters. This has the advantage
of establishing update rules in closed form during the M step and thus data fitting is computationally efficient. An
incremental learning strategy has also been presented that
makes the construction of the sparse regression mixture
model independent of parameter initialization. Experiments
on several datasets for time series clustering demonstrated
the ability of the proposed MKRVM mixture to achieve
improved clustering performance and robustness compared
to other typical regression models.
We are planning to study the performance of the proposed methodology in computer vision applications, such
as visual tracking problems and object detection in a video
surveillance domain [6], [7], [8]. Another future research
direction is to examine the possibility of applying alternative
types of sparse priors [17], [18], as well as to make an
extensive comparison with other types of mixture models
that employ different types of regression components, such
as Gaussian Processes. Furthermore, introducing the fully
Bayesian framework to our method constitutes an interesting
direction for future work, that could also allow for the
estimation of the number of clusters.
Acknowledgments
This manuscript is dedicated to the memory of our friend
and colleague Professor Nikolaos P. Galatsanos who contributed significantly to the research and preparation of this
work. He was a great mentor who continues to provide
inspiration to all of his former colleagues
R EFERENCES
[1] G. McLachlan and D. Peel, Finite Mixture Models. New York:
John Wiley & Sons, Inc., 2000.
[2] C. M. Bishop, Pattern Recognition and Machine Learning.
Springer, 2006.
[3] W. DeSarbo and W. Cron, “A maximum likelihood methodology for clusterwise linear regression,” Journal of Classification, vol. 5, no. 1, pp. 249–282, 1988.
[4] D. Chudova, S. Gaffney, E. Mjolsness, and P. Smyth,
“Mixture models for translation-invariant clustering of sets
of multi-dimensional curves,” in Proc. of the Ninth ACM
SIGKDD Intern. Conf. on Knowledge Discovery and Data
Mining, (Washington, DC), pp. 79–88, 2003.
[5] K. Blekas, C. Nikou, N. Galatsanos, and N. V. Tsekos,
“A regression mixture model with spatial constraints for
clustering spatiotemporal data,” Inter. Journal on Artificial
Intelligence Tools, vol. 17, no. 5, pp. 1023–1041, 2008.
[6] J. Alon, S. Sclaroff, G. Kollios, and V. Pavlovic, “Discovering
clusters in motion time-series data,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 375–381, 2003.
[7] O. Williams, A. Blake, and R. Cipolla, “Sparse Bayesian
Learning for Efficient Visual Tracking,” IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 27, no. 8,
pp. 1292–1304, 2005.
[8] G. Antonini and J. Thiran, “Counting pedestrians in video sequences using trajectory clustering,” IEEE Trans. on Circuits
and Systems for Video Technology, vol. 16, no. 8, pp. 1008–
1020, 2006.
[9] B. Williams, M. Toussaint, and A. Storkey, “Modelling
motion primitives and their timing in biologically executed
movements,” in Advances in Neural Information Processing
Systems 15, pp. 1547–1554, 2008.
[10] T. Liao, “Clustering of time series data - a survey,” Pattern
Recognition, vol. 38, pp. 1857–1874, 2005.
[11] P. Smyth, “Clustering sequences with hidden Markov models,” in Advances in Neural Information Processing Systems,
pp. 648–654, 1997.
[12] S. Gaffney and P. Smyth, “Curve clustering with random
effects regression mixtures,” in Proc. of the Ninth Intern.
Workshop on Artificial Intelligence and Statistics (C. M.
Bishop and B. J. Frey, eds.), 2003.
[13] Y. Xiong and D.-Y. Yeung, “Mixtures of ARMA models for
model-based time series clustering,” in IEEE International
Conference on Data Mining (ICDM), pp. 717–720, 2002.
[14] J. Shi and B. Wang, “Curve prediction and clustering with
mixtures of Gaussian process functional regression models,”
Statistics and Computing, vol. 18, pp. 267–283, 2008.
[15] M. Tipping, “Sparse Bayesian Learning and the Relevance
Vector Machine,” Journal of Machine Learning Research,
vol. 1, pp. 211–244, 2001.
[16] M. Zhong, “A Variational method for learning Sparse
Bayesian Regression,” Neurocomputing, vol. 69, pp. 2351–
2355, 2006.
[17] A. Schmolck and R. Everson, “Smooth Relevance Vector Machine: A smoothness prior extension of the RVM,” Machine
Learning, vol. 68, no. 2, pp. 107–135, 2007.
[18] M. Seeger, “Bayesian Inference and Optimal Design for
the Sparse Linear Model,” Journal of Machine Learning
Research, vol. 9, pp. 759–813, 2008.
[19] M. Gonen and E. Alpaydin, “Multiple kernel learning algorithms,” Journal of Machine Learning Research, vol. 12,
pp. 2211–2268, 2011.
[20] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood
from incomplete data via the EM algorithm,” J. Roy. Statist.
Soc. B, vol. 39, pp. 1–38, 1977.
[21] F. Harrell, Regression Modeling Strategies. With Applications
to Linear Models, Logistic Regression and Survival Analysis.
Springer-Verlag New York, Inc., 2001.
[22] S. Gunn and J. Kandola, “Structural modelling with sparse
kernels,” Machine Learning, vol. 48, pp. 137–163, 2002.
[23] M. Girolami and S. Rogers, “Hierarchic bayesian models for
kernel learning,” in Intern. Conference on Machine Learning
(ICML’05), pp. 241–248, 2005.
[24] M. Hu, Y. Chen, and J. Kwok, “Building sparse multiplekernel svm classifiers,” IEEE Trans. on Neural Networks,
vol. 20, no. 5, pp. 827–839, 2009.
[25] J. Nocedal and S. J. Wright, Numerical Optimization.
Springer-Verlag New York, Inc., 1999.
[26] J. Li and A. Barron, “Mixture density estimation,” in Advances in Neural Information Processing Systems, vol. 12,
pp. 279–285, The MIT Press, 2000.
[27] N. Ueda, R. Nakano, Z. Ghahramani, and G. Hinton, “SMEM
algorithm for mixture models,” Neural Computation, vol. 12,
no. 9, pp. 2109–2128, 2000.
[28] N. Vlassis and A. Likas, “A greedy EM algorithm for Gaussian mixture learning,” Neural Processing Letters, vol. 15,
pp. 77–87, 2001.
[29] E. Keogh, X. Xi, L. Wei, and C. Ratanamahatana,
“The ucr time series classification/clustering homepage:
www.cs.ucr.edu/∼eamonn/time series data/,” 2006.