Academia.eduAcademia.edu

Heteroscedastic Relevance Vector Machine

2013

In this work we propose a heteroscedastic generalization to RVM, a fast Bayesian framework for regression, based on some recent similar works. We use variational approximation and expectation propagation to tackle the problem. The work is still under progress and we are examining the results and comparing with the previous works.

arXiv:1301.2015v1 [stat.ML] 10 Jan 2013 Heteroscedastic Relevance Vector Machine Daniel Khashabi, Mojtaba Ziyadi, Feng Liang {khashab2,ziyadi2,liangf}@illinois.edu University of Illinois, Urbana-Champaign Abstract In this work we first propose a heteroscedastic generalization to RVM, a fast Bayesian framework for regression, based on some recent similar works. We use variational approximation and expectation propagation to tackle the problem. The work is still under progress and we are examining the results and comparing with the previous works. 1 Introduction The Relevance Vector Machines (RVM) [Tipping, 2001] are among the most popular fast Bayesian frameworks. In this model it is assumed that the output are contaminated with a Gaussian noise with constant variance. While simple and effective in some applications this assumption is not sufficient, and we need to assume variable variance noise processes, i.e. heteroscedastic models. Gaussian processes (GP) provide a rich nonparametric tools for modelling. There are interesting similarities between GP models and RVM; in [?, MacKay et al., 1997] it is shown that infinite equally spaced Gaussian kernel functions will converge to a Gaussian process model. Thus one may think of RVM as a fast sparsified GP. Due to richness and flexibility of the GP models, there has been a surge of interests in using these model for different scenarios, especially for modelling output-heteroscedastic data.1 A ground work by [Goldberg et al., 1997] assumes a model consisting of a Gaussian process for real prediction and the 1 Note that heteroscedasticity could be both in input or output, and also note that uncertainty is also different from heteroscedasticity. Examples of uncertainty uncertainty 1 noise process (on the output), and evaluates the model using MCMC. Similar models are proposed in [Quadrianto et al., 2009] and [Kersting et al., 2007], which have slow convergence. In [Lázaro-Gredilla and Titsias, 2011] an outputheteroscedastic GP via GP noise is introduced which uses a variational approximation to train. In the same model in [Munoz-Gonzalez et al., 2011] the approximate training is being done using Expectation Propagation [Minka, 2001]. In this work follow a similar modelling to [Munoz-Gonzalez et al., 2011] and [Lázaro-Gredilla and Titsias, 2011] to impose heteroscedasticity in RVM using a GP model noise. Due to sparsification properties of RVM model over GP, our model has faster convergence comparing to [Munoz-Gonzalez et al., 2011, Lázaro-Gredilla and Titsias, 2011]. The simplified variational approximation simplification which is known as collapsed variational approximation is first introduced in [King and Lawrence, 2006, Teh et al., 2007] and analytically analyzed in [Hensman et al., 2012]. We implement the model on several applications and compare the practical result to several existing models. 2 The learning model Consider we have the data-set D = {xi , yi }N i=1 = (X, y), and y is the model’s scalar output, x is the input vector with length Q. Assume the following linear model similar to [Tipping, 2001], y(x) = M X wi φi (x) + ǫ = Φ(x)w + ǫ i=1 in which w = [w1 , . . . , wM ]T is the vector of weights and Φ(x) = [φ1 (x), . . . , φM (x)] is the vector of kernel function with length M. We also define N × M design  T matrix as Φ = Φ(x1 )T . . . Φ(xN )T which will come handy later. We also put normal distribution over the weights, p(w|α) = M Y i=1 p(wi |αi ) = M Y i=1   N wi |0, αi−1 = N w|0, A−1 . which are not heteroscedastic are [Tresp et al., 1994, Wright, 1999, Dallaire et al., 2011, McHutchon and Rasmussen, 2011]. In addition one other incentive to create outputheteroscedastic models, are input-uncertain data, i.e. one could approximate input noise by taking its effect in the variable-variance noise in the output into account, i.e. via output-heteroscedasity. 2 where A = diag {α1 , . . . , αM }. The main difference of the above formulation with the one in [Tipping, 2001] is that in the above model, the noise term has a variable variance which is modelled with a GPR similar to [Lázaro-Gredilla and Titsias, 2011], ǫ ∼ N (0, eg(x) ), g(x) ∼ GP(µ0 I, Kg ).  p (fn = Φ (xn ) w|A) = N 0, Φ(xn )A−1 ΦT (xn ) ⇒ p(f) = N (0, Kf ) . The hyper-parameters to be learnt in this model are Θ = {A, θg }, in which θg is set of parameters in the covariance function of g(x). The likelihood is,   N Y 1 (yn − Φ(xn )w)2 √ exp − (1a) L= 2e2gn egn 2π n=1 ( ) N 1 1 X −2gn hP i = exp − e (yn − Φ(xn )w)2 (1b) N/2 N 2 exp g (2π) n=1 n=1 n = N (y|µl , Σl ) (1c) in which Σl = diag {e2g1 , . . . , e2gN } and µl = [Φ(x1 )w, . . . , Φ(xN )w]T = Φw. The posterior for each f and g is, p(g|w, D, Θ) ∝ N (µ0 I, Kg ) × L  p(w|g, D, Θ) ∝ N 0, A−1 × L (2) (3) Note that the above likelihood is only Gaussian with respect to w, not g. Considering that both priors are Gaussian, this causes the posterior of g in Eq.2 to become non-Gaussian, and need to approximated to become tractable. Also note that defining f = Φw, because of the deterministic relationship between f and w, having one, entails having the others distribution, and the model resulting model based on f is essentially the same as the one in [Lázaro-Gredilla and Titsias, 2011]. To keep everything simple here instead of writing the Bayes relation on f, we simply write it on w. 3 Approximate learning In RVM training the parameters is achieved via maximizing the marginal likelihood, p(y|X, A, σ 2), in which σ 2 is a constant variance. While in out setting the variance eg(x) is not constant, and it not determined. 3 In this section we formulate the approximations to tackle the problem at hand. We’ll first introduce the variational methods and then expectation propagation method. 3.1 Variational Approximation In variable approximation we approximate the posterior distribution by factorizing the joint distribution of target parameters into independent distributions. We then lower bound the marginal likelihood, henceforth instead of maximizing the posterior, we maximize the lower bound. A comprehensive review of variational methods for several applications could be found in [Wainwright and Jordan, 2008]. We assume our approximate factorization as follows, p(w, g|y) ≈ q(w, g) = q(w).q(g) The optimization of the above independence approximation could be done using Kullback-Leibler divergence, KL (q (w) .q (g) kp (w, g|y)), which at the same time maximizes the lower bound on the marginal likelihood, Z p(w, g, y) log p(y) ≥ q(g)q(w) log dwdg = F (q(w), q(g)) q(g)q(w) which can be rewritten as log p(y) − F (q(w), q(g)) = KL (q (w) .q (g) kp (w, g|y)) The solution to the about variational minimization is Z  t+1 t q (w) ∝ p(w). exp q (g) log p(y|w, g)dg q t+1 (g) ∝ p(g). exp Z q t+1 (w) log p(y|w, g)dw  As suggested in [Lázaro-Gredilla and Titsias, 2011] the above lower bound F , could be more simplified by exploiting the structure of the problem. It turns out that this method was previously introduced by [King and Lawrence, 2006, Teh et al., 2007] and known as collapsed variational inference. The recent work [Hensman et al., 2012] introduces a unified view of this model. Similarly we can analytically marginalize one variable and put in the lower bound 4 to decrease a portion of the optimization process. We analytically compute q(w), Z  p(w) ∗ q (w) = arg max = exp q(g) log p(y|w, g)dg , (4) q(w) Z (q (w)) R  R where Z (q (w)) = p(w) exp q(g) log p(y|w, g)dg dw is the normalizing constant. Now we simplify the lower bound: log p(y) ≥ F |q(w)=q∗ (w) = Ew,g∼q log p(w, g, y) q(w)q(g) (5a) q(w)=q ∗ (w) p(w, g, y)Z q(g)p(w) exp [Eg∼q log p(y|w, g)] (5b)   p(y|w, g)p(w)p(g)Z = Ew∼q∗ Eg∼q log − Eg∼q log p(y|w, g) q(g)p(w) (5c)   p(g)Z (5d) = Ew∼q∗ Eg∼q log q(g)   p(g)Z = Eg∼q log (5e) q(g)   p(g) = Z − KL (q(g)kp (g)) = FKL = Z − Eg∼q log q(g) (5f) = Ew∼q∗ Eg∼q log Which is what we expected, a new variational bound which is functional of only the distribution on g. Similar to [Hensman et al., 2012] we call this new bound FKL . Based on the assumptions of the problem now we can simplify the bound. Following [Lázaro-Gredilla and Titsias, 2011] we assume that q(g) = N (g|µ, Σ), FKL = log Z N w|0, A −1  exp Z  N (g|µ, Σ) log p(y|w, g)dg dw − KL (N (g|µ, Σ)kN (g|µ01, Kg )) (6a) (6b) R Substituting the term inside the exponential N (g|µ, Σ) log p(y|w, g)dg = log N (y|w, R) − 14 tr(Σ) in which [R]ii = e[µ]i −[Σ]ii/2 , which makes the bound 5 as follows: 1 FKL = log N (y|0, ΦA−1ΦT + R) − tr(Σ) − KL(N (g|µ, Σ)kN (g|µ01, Kg )). 4 (7) Similar to what is done in [Lázaro-Gredilla and Titsias, 2011] one could exploit the nature of the formulation and decrease the number of the problems KL = 0, and expressing to be optimized by using equations of ∂F∂µKL = 0, and ∂F ∂Σ the unknown variables in terms of one single smaller variable Λ. Simplifying the update equations will result in the same equation as in appendix of [Lázaro-Gredilla and Titsias, 2011], but with more straightforward equation for updating α parameters, N X ∂FKL φj (xk ) = 0 ⇒ αj = P ∂αj yk − exp {[µ]k − [Σ]kk /2} − N i=1,i6=j k=1 3.2 1 φ (xk ) αi i Expectation Propagation Approximation One could use [Minka, 2001] to do iterative approximation to g’s posterior. We need to approximate the likelihood in Eq. 1a into factors using a Gaussian family, p(yn |w, g) ≈ t̃yn = Z̃.N (yn |µ̃yn , σ̃y2n ), The likelihood is, L = p(y|w, g) ≈ N Y n=1 t̃yn = N (y|µ̃y , Σ̃y ). N Y n=1 Z̃ = LEP Putting the prior and the likelihood into the Bayes formula in Eq. 2 we get the following relation for posterior, q(g|w, D, Θ) ∝ p(q) × LEP Such that q(g|w, D, Θ) = N (g|µ, Σ) in which Σ =  −1 K−1 g + Σ̃y −1 and µ = ΣΣ̃−1 µ̃. The normalizing constant for the above Bayes relation, ZEP is the EP approximation for marginal likelihood. In EP at each step we delete the i-th term from the posterior to get the gap distribution, and then we add the corresponding exact likelihood to the gap distribution. Then we minimize KL(,) for the resulted non-Gaussian distribution and a Gaussian distribution, and then based on the approximated Gaussian, we update t̃yn . 6 4 Prediction Note that using Eq. 4 we have q ∗ (w) ∝ N (0, A−1).N (y|f, R) = N (µw , Σw ) −1 µw = Σw ΦT R−1 y, Σw = A + ΦT R−1 Φ Comparing the above posterior mean and variance with that of Eq.6 in [Tipping et al., 2003] it reveals that the matrix R here acts like the σ 2 I noise matrix in RVM. Now we can find the predictive distribution by marginalizing the latent variables in model. 5 Notes on convergence In conventional RVM, the prior p(w|α) acts to sparsify the model. During the update procedure of the model, some αi → ∞, then we put the corresponding wi = 0, and continue the training, and results in sparsity of RVM. In [Tipping et al., 2003] it’s been showed that the local maxima of marginal likelihood for some values of αi happens in infinity. Assume the lower bound in Eq.7; in this bound the first part is a function of α and need to be optimized. Comparing this with the likelihood in [Tipping et al., 2003]( Eq.7 ) we see that the factor C = σ 2 I + ΦA−1 ΦT defined in that work is equivalent to C′ = R + ΦA−1 ΦT . Thus, similarly we could follow the Eq.15 to Eq.21 in [Tipping et al., 2003] to show that the local maxima of our lower bound happens when some αi go to infinity. So in experiments for a big enough αT h , for each αi > αT h we could put wi = 0 and get smaller model. Also note that due to sparsification properties of RVM, our model has faster convergence comparing to [Munoz-Gonzalez et al., 2011, Lázaro-Gredilla and Titsias, 2011] as we calculate α parameters analytically. 6 Conclusion and future work We derived formulations of Heteroscedastic RVM which are similar to [Lázaro-Gredilla and Titsias, 2 and has interesting similarities with [Tipping et al., 2003]. This is a report of an ongoing work, and we are still experimenting and comparing methods. The final results will soon be published. 7 References [Dallaire et al., 2011] Dallaire, P., Besse, C., and Chaib-draa, B. (2011). An approximate inference with gaussian process to latent functions from uncertain data. Neurocomputing, 74(11):1945–1955. [Goldberg et al., 1997] Goldberg, P., Williams, C., and Bishop, C. (1997). Regression with input-dependent noise: A gaussian process treatment. Advances in neural information processing systems, 10:493–499. [Hensman et al., 2012] Hensman, J., Rattray, M., and Lawrence, N. D. (2012). Fast variational inference in the conjugate exponential family. CoRR, abs/1206.5162. [Kersting et al., 2007] Kersting, K., Plagemann, C., Pfaff, P., and Burgard, W. (2007). Most likely heteroscedastic gaussian process regression. In Proceedings of the 24th international conference on Machine learning, pages 393–400. ACM. [King and Lawrence, 2006] King, N. and Lawrence, N. (2006). Fast variational inference for gaussian process models through kl-correction. Machine Learning: ECML 2006, pages 270–281. [Lázaro-Gredilla and Titsias, 2011] Lázaro-Gredilla, M. and Titsias, M. (2011). Variational heteroscedastic gaussian process regression. [MacKay et al., 1997] MacKay, D. et al. (1997). Gaussian processes-a replacement for supervised neural networks? [McHutchon and Rasmussen, 2011] McHutchon, A. and Rasmussen, C. (2011). Gaussian process training with input noise. [Minka, 2001] Minka, T. (2001). Expectation propagation for approximate bayesian inference. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 362–369. Morgan Kaufmann Publishers Inc. [Munoz-Gonzalez et al., 2011] Munoz-Gonzalez, L., Lazaro-Gredilla, M., and Figueiras-Vidal, A. (2011). Heteroscedastic gaussian process regression using expectation propagation. In Machine Learning for Signal Processing (MLSP), 2011 IEEE International Workshop on, pages 1–6. IEEE. 8 [Quadrianto et al., 2009] Quadrianto, N., Kersting, K., Reid, M., Caetano, T., and Buntine, W. (2009). Kernel conditional quantile estimation via reduction revisited. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on, pages 938–943. IEEE. [Teh et al., 2007] Teh, Y., Newman, D., and Welling, M. (2007). A collapsed variational bayesian inference algorithm for latent dirichlet allocation. Advances in neural information processing systems, 19:1353. [Tipping, 2001] Tipping, M. (2001). Sparse bayesian learning and the relevance vector machine. The Journal of Machine Learning Research, 1:211– 244. [Tipping et al., 2003] Tipping, M., Faul, A., et al. (2003). Fast marginal likelihood maximisation for sparse bayesian models. In Proceedings of the ninth international workshop on artificial intelligence and statistics, volume 1. Jan. [Tresp et al., 1994] Tresp, V., Ahmad, S., Neuneier, R., et al. (1994). Training neural networks with deficient data. Advances in neural information processing systems, pages 128–128. [Wainwright and Jordan, 2008] Wainwright, M. and Jordan, M. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1-2):1–305. [Wright, 1999] Wright, W. (1999). Bayesian approach to neural-network modeling with input uncertainty. Neural Networks, IEEE Transactions on, 10(6):1261–1270. 9