Academia.eduAcademia.edu

Particle Learning for Sequential Bayesian Computation*

2011, Bayesian Statistics 9

Particle learning provides a simulation-based approach to sequential Bayesian computation. To sample from a posterior distribution of interest we use an essential state vector together with a predictive and propagation rule to build a resampling-sampling framework. Predictive inference and sequential Bayes factors are a direct by-product. Our approach provides a simple yet powerful framework for the construction of sequential posterior sampling strategies for a variety of commonly used models.

BAYESIAN STATISTICS 9, J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West (Eds.) c Oxford University Press, 2010 Particle Learning for Sequential Bayesian Computation Hedibert F. Lopes Carlos M. Carvalho The University of Chicago, USA The University of Chicago, USA [email protected] [email protected] Michael S. Johannes Columbia University, USA [email protected] Nicholas G. Polson The University of Chicago, USA [email protected] Summary Particle learning provides a simulation-based approach to sequential Bayesian computation. To sample from a posterior distribution of interest we use an essential state vector together with a predictive and propagation rule to build a resampling-sampling framework. Predictive inference and sequential Bayes factors are a direct by-product. Our approach provides a simple yet powerful framework for the construction of sequential posterior sampling strategies for a variety of commonly used models. Keywords and Phrases: Particle learning; Bayesian; Dynamic factor models; Essential state vector; Mixture models; Sequential inference; Conditional dynamic linear models; Nonparametric; Dirichlet. Hedibert F. Lopes is Associate Professor of Econometrics and Statistics, The University of Chicago Booth School of Business. Carlos M. Carvalho is Assistant Professor of Econometrics and Statistics, The University of Chicago Booth School of Business. Michael S. Johannes is Roger F. Murray Associate Professor of Finance, Graduate School of Business, Columbia University. Nicholas G. Polson is Professor of Econometrics and Statistics, The University of Chicago Booth School of Business. We would like to thank Mike West, Raquel Prado and Peter Müller for insightful comments that greatly improved the article. We also thank Seung-Min Yae for research assistance with some of the examples. Part of this research was conducted while the first two authors were visiting the Statistical and Applied Mathematical Sciences Institute for the 2008-09 Program on Sequential Monte Carlo Methods. Lopes et al. 2 1. THE PL FRAMEWORK Sequential Bayesian computation requires calculation of a set of posterior distributions p(θ | y t ), for t = 1, . .R. , T , where y t = (y1 , . . . , yt ). The inability to directly compute marginal p(y t ) = p(y t | θ)p(θ)dθ implies that accessing the desired posterior distributions requires simulation schemes. We present a sequential simulation strategy to both p(θ | y t ) and p(y t ) based on a resample-sampling framework called Particle Learning (PL). PL is a direct extension of the resample-sampling scheme which was introduced by Pitt and Shephard (1999) in the parameter-free, time series context. Our new look at Bayes’s theorem delivers a sequential, on-line inference strategy for effective posterior simulation strategies in a variety of commonly used models. These strategies are intuitive and easy to implement. In addition, when contrasted to MCMC methods PL delivers more for less as it provides • posterior samples in a direct approximations of marginal likelihoods; • parallel environment, an important feature as more multi-processor computational power becomes available. Central to PL is the creation of a essential state vector Zt to be tracked sequentially. We assume that this vector is conditionally sufficient for the parameter of interest; so that p(θ | Zt ) is either available in closed-form or can easily be sampled from. (i) t Given samples {Zt }N i=1 ∼ p(Zt | y ), then a simple mixture approximation to the set of posteriors (or moments thereof) is given by pN (θ | y t ) = N 1 X (i) p(θ | Zt ). N i=1 This follows from the Rao-Blackwellised identity, Z p(θ | y t ) = E {p(θ | Zt )} = p(θ | Zt )p(Zt | y t )dZt . (i) If we require samples, we draw θ(i) ∼ p(θ | Zt ). See West (1992,1993) for an early approach to approximating posterior distributions via mixtures. The task of sequential Bayesian computation is then equivalent to a filtering (i) t problem for the essential state vector, drawing {Zt }N i=1 ∼ p(Zt | y ) sequentially from the set of posteriors. To this end, PL exploits the following sequential decomposition of Bayes’ rule Z p(Zt+1 | y t+1 ) = p(Zt+1 | Zt , yt+1 ) dP(Zt | y t+1 ) Z ∝ p(Zt+1 | Zt , yt+1 ) p(yt+1 | Zt ) dP(Zt | y t ). {z }| {z } | propagate resample The distribution dP(Zt | y t+1 ) ∝ p(yt+1 | Zt )dP(Zt | y t ), where P(Zt | y t ) denotes the distribution PN of the current state vector. In particle form this would be represented by N1 i=1 δZ (i) , where δ is the Dirac measure. t Particle Learning 3 The intuition is as follows. Given P(Zt | y t ) we find the smoothed distribution P(Zt | y t+1 ) via resampling and then propagate forward using p(Zt+1 | Zt , yt+1 ) to find the new Zt+1 . Making an analogy to dynamic linear models this is exactly the Kalman filtering logic in reverse, first proposed by Pitt and Shephard (1999). From a sampling perspective, this leads to a very simple algorithm for updating particles N {Zt }N i=1 to {Zt+1 }i=1 in 2 steps: (i) Resample: with replacement from a multinomial with weights proportional to (i) ζ(i) the predictive distribution p(yt+1 | Zt ) to obtain {Zt }N i=1 ; (i) ζ(i) (ii) Propagate: with Zt+1 ∼ p(Zt+1 | Zt (i) , yt+1 ) to obtain {Zt+1 }N i=1 . The ingredients of particle learning are the essential state vector Zt , a predictive (i) probability rule p(yt+1 | Zt ) for resampling ζ(i) and a propagation rule to update ζ(i) (i) particles: Zt → Zt+1 . We summarize the algorithm as follows: Particle Learning (PL) Step 1. (Resample) Generate an index ζ ∼ Multinomial(ω, N ) where (i) Step 2. (Propagate) p(yt+1 | Zt ) ω (i) = PN ; (i) i=1 p(yt+1 | Zt ) (ζ(i)) Zt+1 (ζ(i)) ∼ p(Zt+1 | Zt , yt+1 ); Step 3. (Learn) pN (θ | y t+1 ) = N 1 X p(θ | Zt+1 ). N i=1 Example 1 (Constructing Zn for the i.i.d. model). As a first illustration of the derivation of the essential state vector and the implementation of PL, consider the following simple i.i.d. model yi | λi λi ∼ ∼ N (µ, τ 2 λi ) IG(ν/2, ν/2) for i = 1, . . . , n and known ν and prior µ | τ 2 ∼ N (m0 , C0 τ 2 ) and τ 2 ∼ IG(a0 , b0 ). Here the essential state vector is Zn = (λn+1 , an , bn , mn , Cn ) where (an , bn ) index the sufficient statistics for the updating of τ 2 , while (mn , Cn ) index the sufficient statistics for the updating of µ. Set m0 = 0 and C0 = 1. The sequence of variables λn+1 are i.i.d. and so can be propagated directly from p(λn+1 ), whilst the conditional sufficient statistics (an+1 , bn+1 ) are deterministically calculated based on previous values (an , bn ) and parameters (µn+1 , λn+1 ). Here µn+1 simply denotes draws for the parameter µ at time n + 1. Given the particle set {(Z0 , µ, τ 2 )(i) }N i=1 , PL cycles through the following steps: Lopes et al. 4 2 (i) N Step 1. Resample {(Z̃n , µ̃, τ̃ 2 )(i) }N i=1 from {(Zn , µ, τ ) }i=1 with weights (i) (i) (i) (i) (i) wn+1 ∝ p(yn+1 | Zn ) = fN (yn+1 ; mn , τ 2(i) (Cn + λn+1 )), (i) (i) (i) i = 1, . . . , N ; (i) (i) 2 Step 2. Propagate an+1 = ãn + 0.5 and bn+1 = b̃n + 0.5yn+1 /(1 + λ̃n+1 ), and sample τ 2(i) from (i) (i) IG(an+1 , bn+1 ), for i = 1, . . . , N ; (i) (i) (i) (i) (i) (i) (i) Step 3. Propagate Cn+1 = 1/(1/C̃n + 1/λn+1 ) and (Cn+1 )−1 mn+1 = (C̃n )−1 m̃n + (i) (i) (i) (i) yn+1 /λn+1 , and sample µn+1 from N (mn+1 , Cn+1 ), for i = 1, . . . , N ; (i) (i) Step 4. Sample λn+2 from p(λn+2 ) and let Zn+1 = (λn+2 , an+1 , bn+1 , mn+1 , Cn+1 )(i) , for i = 1, . . . , N . In step 2 fN (y; µ, σ 2 ) denotes the density of N (µ, σ 2 ) evaluated at y. The posterior for µ and τ 2 could be approximated through a Gibbs sampler based on the full conditionals: µ | λ, τ 2 , y N (g1 (y, λ)/s(λ); τ 2 /s(λ)) ∼ τ 2 | λ, y ∼ 2 λi | τ , yi IG(a0 + 0.5n, b0 + 0.5g2 (y, λ)) „ « ν + 1 ν + (yi − µ)2 /τ 2 IG , 2 2 ∼ i = 1, . . . , n. P Pn Pn −1 2 where s(λ) = 1 + n i=1 λi , g1 (y, λ) = i=1 yi /λi , and g2 (y, λ) = i=1 yi /(1 + λi ). Figure 1 provides an illustration of both PL to the Gibbs sampler. 0.25 PL 0.25 GIBBS 0.20 0.20 ● ● ● ● ● ● ● ● ● 0.15 0.15 ● ● ● ● ● ● ● ● ● 0.00 −0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 0.0 0.5 µ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●● ●●● ●●● ● ● ● ●● ●● ● ●●●● ● ●●● ● ● ●● ● ● ●●● ●●● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ●● ● ●●●● ● ●● ●● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ●●● ●●● ● ●●●● ●● ● ● ● ●●● ●●●●● ● ● ●●● ●● ● ● ●● ● ●● ●●●● ● ●●●●● ● ●● ● ●●●● ●●●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●●● ●●● ● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ●●●● ●●● ● ●● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ●●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 0.05 0.05 ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ●●● ●●● ●● ●●● ● ●● ● ●●● ●● ● ● ●● ● ● ●● ●● ●●● ●●● ● ●● ●● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ●●● ●● ● ●● ●● ●●● ● ● ●● ● ●●● ● ● ●● ● ●● ●● ●● ● ●● ●● ●● ●● ● ●●●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ●● ●● ●●● ● ●● ● ● ●● ●● ● ● ●● ● ●● ● ●● ●● ●● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●●● ●● ●● ● ●● ● ● ● ●●●●● ● ●● ●● ●● ●●● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ●●● ●●●● ●● ●●● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ● ● ● ●● ●● ● ●● ● ●●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●●●●●●●●● ● ● ● ● ● ●● ● ● 0.00 ● ● ● ● ● ● ● ● 0.10 0.10 ● ● ● ● τ2 τ2 ● ● ● ● 1.0 −0.5 0.0 ● ● ● ● 0.5 1.0 µ 1: i.i.d. model. Gibbs versus Particle Learning. Data y = (−15, −10, 0, 1, 2), number of degrees of freedom ν = 1, and hyperparameters a0 = 5, b0 = 0.05, m0 = 0 and C0 = 1. For the Gibbs sampler, the initial value for τ 2 is V (y) = 58.3 and 5000 draws after 10000 as burn-in. PL is based Figure on 10000 particles. The contours represent the true posterior distribution. Particle Learning 5 1.1. Constructing the Essential State Vector At first sight, PL seems to be a rather simple paradigm. The real power, however, lies in the flexibility one has in defining the essential state vector. This may include: state variables, auxiliary variables, subset of parameters θ(−i) , sufficient statistics, model specification, among others. The dimensionality of Zt can also be included in the particle set and increase with the sample size as for example, in the nonparametric mixture of Dirichlet process discussed later. In sum, the use of an essential state vector Zt is an integral part of our approach, and its definition will become clear in the following sections. The propagation rule can involve either stochastic or deterministic updates and in many ways it is a modeling tool by itself. For example, in complicated set ups (variable selection, treed models) the propagation rule p(Zt+1 | Zt , yt+1 ) suggests many different ways of searching the model space. It our hope that dissimination of the ideas associated with PL there will be more cases where the use of Zt leads to new modeling insights. The following represent examples of the form of Zt in the models that will be addressed later in the chapter: • Mixture Regression Models: Auxiliary state variable λt and conditional sufficient statistics st for parameter inference; • State Space Models: In conditionally gaussian dynamic linear models Zt tracks the usual Kalman filter state moments denoted by (mt , Ct ) and conditional sufficient statistics st for fixed parameters; • Nonparametric Models: Track an indicator of each mixture component kt , the number nt allocated to each component and the current number of unique components mt . In a Dirichlet process mixture, the particle vector can grow in time as there’s a positive probability of adding a new mixture component with each new observation. In the rest of the paper, we address each of these models and provide the necessary calculations to implement PL. 1.2. Comparison with SIS and MCMC Particle filtering (Gordon, Salmond and Smith, 1993) and sequential importance sampling (Kong, Liu and Wong, 1994) have a number of features in common with PL. For example, one can view our update for the augmented vector Zt as a fullyadapted version of Pitt and Shephard’s (1999) the auxiliary particle filter (APF), with the additional step that the augmented variables can depend on functionals of (i) the parameter. The additional parameter draw θ(i) ∼ p(θ | Zn ) is not present in the APF and is used in PL to replenish the diversity of the parameter particles. Storvik (2002) proposed the use sufficient statistics in state space models that are independent of parameters in a propagate-resampling algorithm. Chen and Liu (2000) work with a similar approach in the mixture Kalman filter context. PL differs in two important ways: (i) they only consider the problem of state filtering and (ii) they work on the propagate-resample framework. This is carefully discussed in Carvalho, Johannes, Lopes and Polson (2010). Again, our view of augmented variables Zt is more general than Storvik’s approach. Another related class of algorithms are Rao-Blackwellised particle filters, which are typically propagate-resample algorithms where zt+1 denotes missing data and Lopes et al. 6 xt+1 a state and a pure filtering problem. ` ´ Additionally they attempt the approximation of the joint distribution p Z t | y t . This target increases in dimensionality as` new data becomes available leading to unbalanced weights.` In our´ framework, ´ p Z t | y t is not of interest as the filtered, lower dimensional p Zt | y t is sufficient for inference at time t. Notice that, based on their work, one has to consider the question of “when to resample?” as an alternative to re-balance the approximation weights. In contrast, our approach requires re-sampling at every step as the preselection of particles in light of new observations is fundamental in avoiding a decay in the particle approximation for θ. Another avenue of research uses MCMC steps inside a sequential Monte Carlo algorithm as in the resample-move algorithm of Gilks and Berzuini (2001). This is not required in our strategy as we are using a fully-adapted approach. Finally, see Lopes and Tsay (2010) for a recent review of particle filter methods with an emphasis on contrasting propagate-resample and resample-propagate filters. 1.3. Smoothing At time T , PL provides the filtered distribution of the last essential state vector ZT , namely p(ZT | y T ). If the smoothed distribution of any element k of Z, i.e, p(kT | y T ) is required, it can be obtained at the end of the filtering process. To compute the marginal smoothing distribution, we need the distribution Z p(kT | y) = p(kT | Z T , y T )p(ZT | y T )dZT In the case where kt is conditionally independent across time given Zt this can further simplified as Z p(kT | Z T , y)p(ZT | y T )dZT = Z Y T t=1 p(kt | ZT yt )p(ZT | y T )dZT so that samples from p(kT | y T ) can be obtained by sampling (for each particle ZT ) each kt independently from the discrete filtered mixture with probability proportional to p(kt = j | ZT , yt ) ∝ p(yt | kt = j, ZT )p(kt = j | ZT ). This is the case, for example, in the mixture models consider later where k could represent the allocation of each observation to a mixture component. In other models, where kt has a Markovian evolution (as in state space models) the smoothing distribution can be expressed by Z p(kT | Z T , y T )p(ZT | y T )dZT = T Y t=1 p(kt | kt+1 ZT )p(ZT | y T ). By noting that p(kt | kt+1 ZT ) ∝ p(kt+1 | kt , Zt )p(kt | Zt ) sequential backwards sampling schemes can be constructed by using the transition equation of kt as weights. Particle Learning 7 This discussion is a generalization of the algorithm presented in Carvalho, Johannes, Lopes and Polson (2010) for state space models which is originally proposed as an extension of Godsill, Doucet and West (2004). It is important to point out that traditional SMC algorithms attempt to approximate p(kt | y t ) as part of the filtering process, i.e., attempting to sequentially approximate a posterior that is growing in dimension with t – this leads, as expected and extensively reported, to unbalanced weights. PL focus on the simpler, more stable problem of filtering p(kt | y t ) and observes that, in most models, smoothing can effectively be performed in the end. 1.4. Marginal Likelihoods PL can also provide estimates of the predictive distribution p(yn+1 | y n ) and marginal likelihood p(y n ) for model assessment and Bayes factors. Following our resamplingsampling approach, an on-line estimate of the full marginal likelihood can be developed by sequentially approximating p(yn+1 | y n ). Specifically, given the current particle draws, we have pN (yn+1 | y n ) = N X i=1 p(yn+1 | Zn(i) ) and pN (y n ) = n Y i=1 pN (yi | y i−1 ). Therefore we simplify the problem of calculating p (y n ) by estimating a sequence of small integrals. This also provides access to sequential Bayes factors necessary in many sequential decision problems. 1.5. Choice of Priors At its simplest level the algorithm only requires samples θ(i) from the prior p(θ). Hence the method is not directly applicable to improper priors. However, the natural R class of priors are mixture priors on the form p(θ) = p(θ | Z0 )p(Z0 )dZ0 . The conditional p(θ | Z0 ) is chosen to be naturally conjugate to the likelihood. If Z0 is fixed, then we start all particles out with the same Z0 value. More commonly, we (i) will start with a sample Z0 ∼ p(Z0 ) and let the algorithm resample these draws (i) with the marginal likelihood p(y1 | Z0 ). This approach will lead to efficiency gains over blindly sampling from the prior. This method also allows us to implement nonconjugate priors together with vague “uninformative” priors such as Cauchy priors via a scale mixtures of normals. 1.6. Monte Carlo Error Due to the sequential Monte Carlo nature of the algorithm, error bounds of the √ form CT / N are available where N is the number of particles used. The constant CT is model, prior and data dependent and in general its magnitude accumulates over T , see, for example, Brockwell, Del Moral and Doucet (2010). Clearly, these propagate errors will be worse for diffuse priors and for large signal-to-noise ratios as with many Monte Carlo approaches. To assess Monte Carlo standard errors we propose the convergence diagnostic of Carpenter, Clifford and Fearnhead (1999). By running the algorithm M independent time (based on N particles) one can calculate the Monte Carlo estimates of the mean and variance for the functional of interest. Then by performing an analysis of variance between replicates, the Monte carlo error or effective sample size can be assessed. One might also wish to perform this measure over different data trajectories as some data realizations might be harder to estimate than others. Lopes et al. 8 2. APPLICATIONS 2.1. Mixture Regression Models In order to illustrate the efficiency gains available with our approach consider the most common class of applications: mixture or latent variable models Z p(y | θ) = p(y | θ, λ)p(λ | θ)dλ, where λn = (λ1 , . . . , λn ) is a data augmentation variable. For this model, with a conditionally conjugate prior, we can find a conditional sufficient statistic, sn , for parameter learning. Therefore, we define our sufficient state vector as Zn = (λn , sn ). Under these assumptions, we can write ` ´ p θ | λn+1 , y n+1 = p (θ | sn+1 ) with sn+1 = S (sn , λn+1 , yn+1 ) where S(·) is a deterministic recursion relating the previous sn to the next, conditionally on λn+1 and yn+1 . Now, the propagation step becomes λn+1 sn+1 ∼ = p(λn+1 | λn , θ, yn+1 ) S (sn , λn+1 , yn+1 ) . More complicated mixture models appear in Section 2.3. Example 2 (Bayesian lasso). We can develop a sequential version of Bayesian lasso (Carlin and Polson, 1991, Hans, 2009)√for a simple problem of signal detection. The model takes the form yi = θi +εi and θi = τ λi εθi , where εi ∼ N (0, 1), εθi ∼ N (0, 1), λi ∼ Exp(2) and τ 2 ∼ IG(a0 , b0 ). This leads to independent double exponential marginal priors for each θi with p(θi ) = (2τ )−1 exp (− | θi | /τ ). The natural set of latent variables is given by the augmentation variable λn+1 and conditional sufficient statistics leading to Zn = (λn+1 , an , bn ). The sequence of variables λn+1 are i.i.d. and so can be propagated directly with p(λn+1 ), whilst the conditional sufficient statistics (an+1 , bn+1 ) are deterministically determined based on parameters (θn+1 , λn+1 ) and previous values (an , bn ). Given the particle set (Z0 , τ )(i) , the resample-propagate algorithm cycles through the following steps: (i) (i) i) Resample particles with weights wn+1∝ fN (yn+1 ; 0, 1+τ 2(i) λn+1 ); (i) (i) (i) (i) (i) (i) −1(i) −1 ii) Propagate θn+1 ∼ N (mn , Cn ), mn = Cn τ̃ 2(i) λ̃n+1 yn+1 and Cn = 1+τ̃ −2(i) λ̃n+1 ; (i) (i) (i) 2(i) (i) iii) Update sufficient statistics an+1 = ãn + 1/2 and bn+1 = b̃n + θn+1 /(2λ̃n+1 ); (i) iv) Draw τ 2(i) ∼ IG(an+1 , bn+1 ) and λn+2 ∼ Exp(2); (i) (i) (i) (i) v) Let Zn+1 = (λn+1 , an , bn ) and update (Zn+1 , τ )(i) . We use our marginal likelihood (or Bayes factor) to compare lasso with a standard normal prior. Under the normal prior we assume that τ 2 ∼ IG(a1 , b1 ) and we match the variances of the parameter θi . As the lasso is a model for sparsity we would expect the evidence for it to increase when we observe yt = 0. We can sequentially estiP (i) mate p(yn+1 | y n , lasso) via p(yn+1 | y n , lasso) = N −1 N i=1 p(yn+1 | (λn , τ ) ) with pre2 dictive p(yn+1 | λn , τ ) ∼ N (0, τ λn + 1). This leads to a sequential Bayes factor Bn+1 = p(y n+1 | lasso)/p(y n+1 | normal). Particle Learning 9 Data was simulated based on θ = (0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1) and priors τ 2 ∼ IG(2, 1) for the double exponential case and τ 2 ∼ IG(2, 3) for the normal case, reflecting the ratio of variances between those two distributions. Results are summarized by computing sequential Bayes factor (figure not shown). As expected the evidence in favor of the lasso is increased when we observe y = 0 and for the normal model when we observe a signal y = 1. PL can easily be extended to a lasso regression setting. Suppose that we have p yt+1 = Xt′ β + σ λt+1 ǫt+1 and θ = (β, σ 2 ) and conditionally conjugate prior is assumed, i.e. p(β | σ 2 ) ∼ N (b0 , σ 2 B0−1 ) and p(σ 2 ) ∼ IG(ν0 /2, d0 /2). We track Zt = (st , λt+1 ) where st = (bt , Bt , dt ) are conditional sufficient statistics for the parameters. The recursive definitions are Bt+1 = ′ Bt + λ−1 t+1 Xt Xt Bt+1 bt+1 = ′ Bt bt + λ−1 t+1 Xt yt+1 , and dt+1 = ′ ′ dt + b′t Bt bt + λ−1 t+1 Xt+1 yt+1 − bt+1 Bt+1 bt+1 . The conditional posterior p(θ | Zn ) is then available for sampling and our approach applies. We use this example to compare the accuracy in estimating the posterior distribution of the regularization penalty p(τ | y). We use the generic resample-move batch importance sampling developed by Gilks and Berzuini (2001) and Chopin (2002). The data is cut into batches parameterized by block-lengths (n, p). In the generic resample move algorithm, we first initialize by drawing from the prior π(θ, τ ) with θ = (θ1 , . . . , θ15 ). The particles are then re-sampled with the likelihood from the first batch of observations (y1 , . . . , yp ). Then the algorithm proceeds sequentially. There is no need to use the λi augmentation variables as this algorithm does not exploit this conditioning information. Then an MCMC kernel is used to move particles. Here, we use a simple random walk MCMC step. This can clearly be tuned to provide better performance although this detracts from the “black-box” nature of this approach. Chopin (2002) provides recommendations for the choice of kernel. Figure 2 provides the comparison with two separate runs of the algorithm both with N = 10, 000 particles for (n, p) = (3, 5) or (n, p) = (15, 1). The performance is similar for the case p = 1. Our efficiency gains come from the extra conditioning information available in Zn . 2.2. Conditional Dynamic Linear Models We now explicitly derive our PL algorithm in a class of conditional dynamic linear models which are an extension of the models considered in West and Harrison (1997). This follows from Carvalho, Johannes, Lopes and Polson (2010) and consists of a vast class of models that embeds many of the commonly used dynamic models. MCMC via Forward-filtering Backwards-sampling or mixture Kalman filtering (MKF) (Chen and Liu, 2000) are the current methods of use for the estimation of these models. As an approach for filtering, PL has a number of advantages. First, our algorithm is more efficient as it is a perfectly-adapted filter (Pitt and Shephard, 1999). Second, we extend MKF by including learning about fixed parameters and smoothing for states. The conditional DLM defined by the observation and evolution equations takes the form of a linear system conditional on an auxiliary state λt+1 (yt+1 | xt+1 , λt+1 , θ) ∼ N (Fλt+1 xt+1 , Vλt+1 ) (xt+1 | xt , λt+1 , θ) ∼ N (Gλt+1 xt , Wλt+1 ) Lopes et al. 10 n=15, p=1 p(τ |y) by PL 1200 1000 1000 800 800 600 600 400 400 200 200 n=15, p=5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 1200 1200 1000 1000 800 800 600 600 400 400 200 200 0 0.1 Figure p(τ |y) by Chopin(2002) 1200 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.2 0.3 0.4 0.5 0.6 0.7 2: Bayesian Lasso. Comparison to Chopin’s (2002) batch sampling scheme. with θ containing the unknown elements of the quadruple {Fλ , Gλ , Vλ , Wλ }. The marginal distribution of observation error and state shock distribution are any combination of normal, scale mixture of normals, or discrete mixture of normals depending on the specification of the distribution on the auxiliary state variable p(λt+1 | θ), so that, p(yt+1 | xt+1 , θ) = Z fN (yt+1 ; Fλt+1 xt+1 , Vλt+1 )p(λt+1 | θ)dλt+1 . Extensions to hidden Markov specifications where λt+1 evolves according to the transition p(λt+1 | λt , θ) are straightforward and are discussed in the dynamic factor model example below. In CDLMs the state filtering and parameter learning problem is equivalent to a filtering problem for the joint distribution of their respective sufficient statistics. This is a direct result of the factorization of the full joint p(xt+1 , θ, λt+1 , st+1 , sxt+1 | y t+1 ) as a sequence of conditional distributions p(θ | st+1 )p(xt+1 | sxt+1 , λt+1 )p(λt+1 , st+1 , sxt+1 | y t+1 ). where st and sxt are the conditional sufficient statistics for parameters and states respectively. Here the conditional sufficient statistics for states (sxt ) and parameters Particle Learning 11 (st ) satisfy deterministic updating rules sxt+1 st+1 = = K (sxt , θ, λt+1 , yt+1 ) S (st , xt+1 , λt+1 , yt+1 ) . More specifically, define sxt = (mt , Ct ) as Kalman filter first and second moments at time t. Conditional on θ, we then have (xt+1 | sxt+1 , λt+1 , θ, ) ∼ N (at+1 , Rt+1 ), where at+1 = Gλt+1 mt and Rt+1 = Gλt+1 Ct G′λt+1 +Wλt+1 Updating state sufficient −1 statistics (mt+1 , Ct+1 ) is achieved by mt+1 = Gλt+1 mt +At+1 (yt+1 − et ) and Ct+1 = −1 −1 −1 ′ Rt+1 + Fλt+1 Fλt+1 Vλt+1 , with Kalman gain matrix At+1 = Rt+1 Fλt+1 Qt+1 , et = Fλt+1 Gλt+1 mt , and Qt+1 = Fλt+1 Rt+1 Fλt+1 + Vλt+1 . We are now ready to define the PL scheme for the CDLMs. First, assume that the auxiliary state variable is discrete with λt+1 ∼ p(λt+1 | λt , θ). We start, at time t, with a particle approximation for the joint posterior of (xt , λt , st , sxt , θ | y t ). Then we propagate to t + 1 by first re-sampling the current particles with weights proportional to the predictive p(yt+1 | (θ, sxt )). This provides a particle approximation to p(xt , θ, λt , st , sxt | y t+1 ), the smoothing distribution. New states λt+1 and xt+1 are then propagated through the conditional posterior distributions p(λt+1 | λt , θ, yt+1 ) and p(xt+1 | λt+1 , xt , θ, yt+1 ). Finally the conditional sufficient statistics are updated according to (1) and (1) and new samples for θ are obtained from p(θ | st+1 ). Notice that in the conditional dynamic linear models all the above densities are available for evaluation and sampling. For instance, the predictive is computed via (i) p(yt+1 | (λt , sx t , θ) ) = X λt+1 (i) p(yt+1 | λt+1 , (sx t , θ) )p(λt+1 | λt , θ) where the inner predictive distribution is given by p (yt+1 | λt+1 , sx t , θ) = Z p (yt+1 | xt+1 , λt+1 , θ) p(xt+1 | sx t , θ)dxt+1 . In the general case where the auxiliary state variable λt is continuous it might not be possible to integrate out λt+1 form the predictive in step 1. We extend the above scheme by adding to the current particle set a propagated particle λt+1 ∼ p(λt+1 | (λt , θ)(i) ). Both algorithms can be combined with the backwards propagation scheme of Carvalho, Johannes, Lopes and Polson (2010) to provide a full draw from the marginal posterior distribution for all the states given the data, namely the smoothing distribution p(x1 , . . . , xT | y T ). The next two examples detail the steps of PL for a dynamic factor models with time-varying loadings and for a dynamic logit models. Example 3 (Dynamic factor model with time-varying loadings). Consider data yt = (yt1 , yt2 )′ , t = 1, . . . , T , following a dynamic factor model with time-varying loadings driven by a discrete latent state λt with possible values {1, 2}. Specifically, we have (yt+1 | xt+1 , λt+1 , θ) ∼ N (βt+1 xt+1 , σ 2 I2 ) (xt+1 | xt , λt+1 , θ) ∼ N (xt , σx2 ) with time-varying loadings βt+1 = (1, βλt+1 )′ and initial state distribution x0 ∼ N (m0 , C0 ). The jumps in the factor loadings are driven by a Markov switching process (λt+1 | λt , θ), whose transition matrix Π has diagonal elements P r(λt+1 = 1 | λt = 1, θ) = p and Lopes et al. 12 Figure 3: Dynamic factor model - state filtering. Top panel: True value of λt (red line), P r(λt = 1 | y t ) (black line) and P r(λt = 1 | y T ) (blue line) Bottom panel: True value of xt (red line), E(xt | y t ) (black line) and E(xt | y T ) (blue line). P r(λt+1 = 2 | λt = 2, θ) = q. The parameters are θ = (β1 , β2 , σ 2 , τ 2 , p, q)′ . See Carvalho and Lopes (2007) for related Markov switching models. We are able to marginalizing over both (xt+1 , λt+1 ) by using state sufficient statistics t t sx t = (mt , Ct ) as particles. From the Kalman filter recursions we know that (xt | λ , θ, y ) ∼ x, λ N (mt , Ct ). The mapping for state sufficient statistics sx = K(s , θ, y ) is given t+1 t+1 t t+1 by the one-step Kalman update equations. The prior distributions are conditionally con2 2 2 jugate where (βi | σ ) ∼ N (bi0 , σ Bi0 ) for i = 1, 2, σ ∼ IG(ν00 /2, d00 /2) and τ 2 ∼ IG(ν10 /2, d10 /2). For the transition probabilities, we assume that p ∼ Beta(p1 , p2 ) and q ∼ (i) Beta(q1 , q2 ). Assume that at time t, we have particles {(xt , θ, λt , sx t , st ) , i = 1, . . . , N } t approximating p(xt , θ, λt , sx t , st | y ). The PL algorithm can be described through the following steps: (1) (N ) (i) Re-sampling: Draw an index ki ∼ Multinomial(wt , . . . , wt (ki ) ), p(yt+1 | (sx t , λt , θ) 2 X λt+1 =1 where p(yt+1 | sx t , λt , θ) (i) ) with weights wt equals ′ + σ 2 I2 )p(λt+1 | λt , θ); fN (yt+1 ; βt+1 mt , (Ct + τ 2 )βt+1 βt+1 ∝ Particle Learning 13 i (i) (k ) , y (ii) Propagating state λ: Draw λt+1 from p(λt+1 | (sx t+1 ), so the density t , λt , θ) 2 ′ 2 p(λt+1 | sx t , λt , θ, yt+1 ) ∝ fN (yt+1 ; βt+1 mt , (Ct + τ )βt+1 βt+1 + σ I2 )p(λt+1 | λt , θ); (i) (i) i (k ) , y (iii) Propagating state x: Draw xt+1 from p(xt+1 | λt+1 , (sx t+1 ); t , θ) (iv) Propagating states sufficient statistics, sx t+1 : The Kalman filter recursions yield ′ mt+1 = mt + At+1 (yt+1 − βt+1 mt ) and Ct+1 = Ct + τ 2 − At+1 Q−1 t+1 At+1 , where Qt+1 = (Ct + τ 2 )βt+1 βt+1 + σ 2 I2 and At+1 = (Ct + τ 2 )Q−1 β . t+1 t+1 (v) Propagating parameter sufficient statistics, st+1 : The posterior p(θ | st ) is decomposed into (βi | σ 2 , st+1 ) ∼ N (bi,t+1 , σ 2 Bi,t+1 ), i = 1, 2, (σ 2 | st+1 ) ∼ IG(ν0,t+1 /2, d0,t+1 /2t), (τ 2 | st+1 ) ∼ IG(ν1,t+1 /2, d1,t+1 /2), (p | st+1 ) ∼ Beta(p1,t+1 , p2,t+1 ), −1 −1 (q | st+1 ) ∼ Beta(q1,t+1 , q2,t+1 ) with Bi,t+1 = Bit + x2t+1 Iλt+1 =i , bi,t+1 = Bi,t+1 −1 (Bit bit + xt yt2 Iλt+1 =i ) and νi,t+1 = νi,t + 1, for i = 1, 2, d1,t+1 = d1t + (xt+1 − xt )2 , p1,t+1 = p1t + Iλt =1,λt+1 =1 , p2,t+1 = p2t + Iλt =1,λt+1 =2 , q1,t+1 = q1t + P Iλt =2,λt+1 =2 , q2,t+1 = q2t +Iλt =2,λt+1 =1 , d0,t+1 = d0t + 2j=1 [(yt+1,2 −bj,t+1 xt+1 ) −1 yt+1,2 + bj,t+1 Bj0 (yt+1,1 − xt+1 )2 ]Iλt+1 =j , Figure 3 and 4 illustrates the performance of the PL algorithm. The first panel of Figure 3 displays the true underlying λ process along with filtered and smoothed estimates whereas the second panel presents the same information for the common factor. Figure 4 provides the sequential parameter learning plots. Example 4 (Dynamic logit models). Extensions of PL to non-gaussian, non-linear state space models appear in Carvalho, Lopes and Polson (2010) and Carvalho, Johannes, Lopes and Polson (2010). We illustrate some these ideas in the context of a dynamic multinomial logit model with the following structure P (yt+1 = 1 | βt+1 ) = (1 + exp{−βt+1 xt+1 })−1 βt+1 = φβt + σx ǫβ t+1 where β0 ∼ N (m0 , C0 ) and θ = (φ, σx2 ). For simplicity assume that xt is scalar. It is common practice in limited dependent variable models to introduce a latent continuous variable zt+1 to link yt+1 and xt (see Scott, 2004, Kohn, 1997, and Frühwirth-Schnatter and Frühwirth, 2007). More precisely, the previous model, conditionally on zt+1 , where yt+1 = I(zt+1 ≥ 0), can be rewritten as zt+1 = βt+1 xt+1 + ǫzt+1 βt+1 = φβt + ǫβ t+1 , 2 z z where ǫβ t+1 ∼ N (0, σx ), ǫt+1 is an extreme value distribution of type 1, i.e. ǫt+1 ∼ − log E(1), with E(1) denoting an exponential with mean one. Conditional normality can be achieved by rewriting the extreme value distribution as a mixture of normals. Frühwirth-Schnatter and Frühwirth (2007) suggest a 10-component mixture of normals with weight, mean and variance for component j given by wj , µj and s2j , for j = 1, . . . , 10. Hence conditional on the latent vector (zt+1 , λt+1 ), the previous representation leads to the following Gaussian dynamic linear model: zt+1 = βt+1 xt+1 + ǫt βt+1 = φβt + ǫβ t+1 , Lopes et al. 14 Figure 4: Dynamic factor model - parameter learning. Sequential posterior median (black line) and posterior 95% credibility intervals (blue lines) for model parameters β1 , β2 , σ 2 , τ 2 , p and q . True values are the red lines. where ǫt+1 ∼ N (µλt+1 , sλt+1 ). Given λt+1 we have conditional state sufficient statistics β (for βt ) and the Kalman filter recursions still hold as sβ t+1 = K(st , zt+1 , λt+1 , θ, yt+1 ). Similarly for the parameter sufficient statistics st , which now involves λt+1 . Moreover, as λt+1 is discrete, it is straightforward to see that 2 −1/2 2 2 2 P r(yt+1 = 1 | sβ ) t , θ, λt+1 ) = 1 − Φ(−φmt xt+1 ((φ Ct + σx )xt+1 + sλt+1 ) leading to the predictive P r(yt+1 = 1 | sβ t , θ) = 10 X j=1 wj P r(yt+1 = 1 | sβ t , θ, λt+1 = j), which plays an important role in the re-sampling step. The propagation step requires one β to be able to sample λt+1 from p(λt+1 | sβ t , θ, yt+1 ), zt+1 from p(zt+1 | st , θ, λt+1 , yt+1 ) β and βt+1 from p(βt+1 | st , θ, λt+1 , zt+1 , yt+1 ). The final step of PL is the deterministic updating for conditional sufficient statistics. Particle Learning 15 2.3. Nonparametric Mixture Models We now develop PL for discrete nonparametric mixture models and Bayesian nonparametric density estimation. Details appear in Carvalho, Lopes, Polson and Taddy (2010). Our essential state vector now depends on the (random) number of unique mixture components. The posterior information can be summarized by nt = (nt,1 , . . . , nt,mt ), the number of observations allocated to each unique component, and st = (st,1 , . . . , st,mt ), the conditional sufficient statistics for the component parameters. The state vector to be tracked by PL can then be defined as Zt = (kt , mt , st , nt ). ⋆ By definition θt⋆ = {θ1⋆ , . . . , θm } is the set of mt distinct components in θt , kt is t the associated latent allocation such that θt = θk⋆t , nt = (nt,1 , . . . , nt,mt ) is the number of observations allocated to each unique component, and st = (st,1 , . . . , st,mt ) is the set of conditional sufficient statistics for each θj⋆ . Once again, we can define the state vector as Zt = (kt , mt , st , nt ). PL will not directly provide the full joint posterior of the allocation vector kt = (k1 , . . . , kt ). If this is required either a particle smoothing or an MCMC step is required. For infinite mixture models particle learning proceeds through the two familiar steps: Resample: (st , nt , mt ) ∝ p(yt+1 | st , nt , mt ) and Propagate: kt+1 ∼ p(kt+1 | st , nt , mt , yt+1 ). The filtered posterior for (sT , nT , mT ) can be used for inference via the posterior predictive density p(y | sT , nT , mT ), which is a RaoBlackwellized version of E[f (y; G) | y T ] forR many nonparametric priors (including the DP). Alternatively, since p(G | y T ) = p(G | sT , nT , mT ) dp(sT , nT , mT | y T ), the filtered posterior provides a basis for inference about the full random mixing distribution. The DP characterizes a prior over probability distributions and is most intuitively represented through its constructive definition (Perman, Pitman and Yor, 1992): a random distribution G generated from DP(α, G0 (ψ)) is almost surely of the form ! l−1 ∞ X X iid pl δϑl (·) with ϑl ∼ G0 (ϑl ; ψ), pl = 1 − pj vl , dG(·) = j=1 l=1 iid and vl ∼ beta(1, α), for l = 1, 2, where G0 (ϑ; ψ) is the centering distribution function, parametrized by ψ, and the sequences {ϑl , l = 1, 2, . . .} and {vk : l = 1, 2, ...} are independent. The discreteness of DP realizations is explicit in this definition. R The DP mixture model is then f (yr ; G) = k(yr ; θ)dG(θ) for r = 1, . . . , t, where G ∼ DP(α, G0 ). Alternatively, in terms of latent variables, the hierarchical model ind iid is that for r = 1, . . . , t, yr ∼ k(yr ; θr ), θr ∼ G and G ∼ DP(α, G0 ). Two properties of the DP are particularly important for sequential inference. First, the DP is a conditionally conjugate prior: given θt (or, equivalently, θt⋆ and nt ), the posterior distribution for G is characterized as a DP(α + t, Gt0 ) where, dGt0 (θ; θt⋆ , nt ) = mt X α nt,j dG0 (θ) + δ[θ=θj⋆ ] . α+t α +t j=1 R Second, this Pólya urn density dGt0 is also E[ dG | θt ] = dG(θ)dp(G | θt⋆ , nt ), and provides a finite predictive probability function for our mixture model: p(yt+1 | θt ) = R k(yt+1 ; θ)dGt0 (θ). Lopes et al. 16 A Rao-Blackwellized version of the standard Pólya urn mixture serves as a density estimator: Z ` ´ p E[f (y; G)] | y t = p (E[f (y; G)] | st , nt , mt ) dp(st , nt , mt | y t ), R and p (E[f (y; G)] | st , nt , mt ) = p(y | θt⋆ , nt )dp(θt⋆ | st , nt , mt ). If either α or ψ are assigned hyperpriors, we include this in Zt and sample off-line for each particle conditional on (nt , st , mt )(i) at each iteration. This is of particular importance in the understanding of the generality of PL. PL for DP mixture models Step 1. (Resample) Generate an index ζ ∼ Multinomial(ω, N ) where p(yt+1 | (st , nt , mt )(i) ) ; ω (i) = PN (i) i=1 p(yt+1 | (st , nt , mt ) ) Step 2. (Propagate) Step 2.1. kt+1 ∼ p(kt+1 | (st , nt , mt )ζ(i) , yt+1 ); Step 2.2. st+1 = S(st , kt+1 , yt+1 ); Step 2.3. nt+1 nt+1,j = nt,j , for j 6= kt+1 , nt+1,kt = nt,kt + 1 and mt+1 = mt , if kt+1 ≤ mt , nt,mt+1 = 1, mt+1 = mt + 1 and , if kt+1 > mt ; Step 3. (Estimation) p(E[f (y; G)] | y t ) = N 1 X p(y | (st , nt , mt )(i) ) N i=1 Example 5 (The DP mixture of multivariate normals). In the particular case of the d-dimensional DP multivariate normal mixture (DP-MVN) model has density function f (yt ; G) = Z N(yt | µt , Σt )dG(µt , Σt ), and G ∼ DP (α, G0 (µ, Σ)), with conjugate centering distribution G0 = N (µ; λ, Σ/κ) W(Σ−1 ; ν, Ω), where W(Σ−1 ; ν, Ω) denotes a Wishart distribution such that E[Σ−1 ] = νΩ−1 and E[Σ] = (ν − (d + 1)/2)−1 Ω. Conditional sufficient statistics for each unique mixture component st,j are ȳt,j = X r:kr =j yr /nt,j and St,j = X r:kr =j ′ . yr yr′ − nt,j ȳt,j ȳt,j The initial sufficient statistics are deterministically n1 = 1 and s1 = {y1 , 0}, such that the algorithm is populated with N identical particles. Conditional on existing particles Particle Learning 17 {(nt , st )i }N i=1 , uncertainty is updated through the familiar resample/propagate approach. The resampling step is performed by an application of the predictive probability function p(yt+1 | st , nt , mt + 1) = mt X nt,j α St(yt+1 ; a0 , B0 , c0 ) + St(yt+1 ; at,j , Bt,j , ct,j ), α+t α +t j=1 with hyperparameters a0 = λ, B0 = at,j = κλ + nt,j ȳt,j , κ + nt,j ct,j = 2ν + nt,j − d + 1, 2(κ+1) Ω, κc0 c0 = 2ν − d + 1, 2(κ + nt,j + 1) (Ω + 0.5Dt,j ), (κ + nt,j )ct,j κnt,j (λ − ȳt,j )(λ − ȳt,j )′ . = St,j + (κ + nt,j ) Bt,j = and Dt,j In the propagation step, we then sample the component state kt+1 such that, for j = 1, . . . , mt , p(kt+1 = j) ∝ p(kt+1 = mt + 1) ∝ nt,j St(yt+1 ; at,j , Bt,j , ct,j ) α+t α St(yt+1 ; a0 , B0 , c0 ). α+t If kt+1 = mt +1, the new sufficient statistics are defined by mt+1 = mt +1 and st+1,mt+1 = [yt+1 , 0]. If kt+1 = j, nt+1,j = nt,j + 1 and we update st+1,j such that ȳt+1 = (nt,j ȳt,j + ′ i′ − n i′ yt+1 )/nt+1,j and St+1,j = St,j + yt+1 yt+1 + nt,j ȳt,j ȳt,j t+1,j ȳt+1,j ȳt+1,j . The remaining sufficient statistics are the same as at time t. We can also assign hyperpriors to the parameters of G0 . In this case, a parameter learning step for each particle is added to the algorithm. Assuming a W(γΩ , Ψ−1 Ω ) prior for Ω and a N(γλ , Ψλ ) prior for λ, the sample at time t is augmented with draws for the auxiliary variables {µ⋆j , Σ⋆j }, for j = 1, . . . , mt , from their posterior full conditionals, p(µ⋆j , Σ⋆j | st , nt ) ≡ N „ µ⋆j ; at,j , 1 Σ⋆ κ + nt,j j « W (Σ⋆−1 ; ν + nt,j , Ω + Dt,j ). j The parameter updates are then 1 0 mt X ⋆−1 −1 Σj µ⋆j ), RA and Ω ∼ W(γΩ + mt ν, R−1 ), λ ∼ N @R(γλ Ψλ + κ j=1 Pmt where R−1 = j=1 Σj⋆−1 + Ψ−1 Ω . Similarly, if α is assigned the usual gamma hyperprior, it can be updated for each particle using the auxiliary variable method from Escobar and West (1995). To illustrate the PL algorithm, a dataset was simulated with dimension d = 2 and sample size T = 1000. The bivariate vector of yt was generated from a N(µt , AR(0.9)) density, where µt ∼ Gµ and AR(0.9) denotes the correlation matrix implied by an autoregressive process of lag one and correlation 0.9. The mean distribution, Gµ , is the realization of a DP(4, N (0, 4I)) process. Thus the simulated data is clustered around a set of distinct means, and highly correlated within each cluster. The parameters are fixed at α = 2, λ = 0, κ = 0.25, ν = 4, and Ω = (ν − 1.5)I, was fit to this data. Figure 5 shows the data and bivariate density estimates, which are the mean Rao-Blackwellized posterior predictive p(y | sT , nT , mT ); hence, the posterior expectation for f (y; G). Marginal estimates are just the appropriate marginal density derived from mixture of Student’s t distributions. Lopes et al. 4 18 0.01 0.02 0.02 3 0.0 4 2 0.0 35 0.015 0.0 0.025 0 0.02 0.03 5 −2 5 0.0 5 0.01 0.05 4 0.0 45 0.0 0.01 5 −4 0.025 0.03 0.005 −8 −6 −4 −2 0 2 4 z1 ● ● 4 ● ● ●● ● ● −4 −2 0 2 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ●● ●● ● ● ●●●● ● ● ● ● ● ● ●●● ●● ●● ●●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ●● ●● ● ●● ●● ● ●● ● ●● ● ●●●●●●●● ● ● ● ● ●●●●●●●●● ● ● ● ● ● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ●●●●●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −8 −6 −4 −2 0 2 4 Figure 5: DP mixture of multivariate normals. Data and density estimates for PL fit with 1000 particles (left) and each of ten PL fits with 500 particles (right), to a random ordering of the 1000 observations of bivariate data. 3. OTHER APPLICATIONS Successful implementations of PL (and hybrid versions of PL) have appeared over the last couple of years. Taddy, Gramacy and Polson (2010) show that PL is the best alternative to perform online posterior filtering of tree-states in dynamic regression tree models, while Gramacy and Polson (2010) use PL for online updating of Gaussian process models for regression and classification. Shi and Dunson (2009) adopt a PL-flavored scheme for stochastic variable selection and model search in linear regression and probit models, while Mukherjee and West (2009) focus on model comparison for applications in cellular dynamics in systems biology. With a more time series flavor, Rios and Lopes (2009), for example, propose a hybrid LW-Storvik filter for the Markov switching stochastic volatility model that outperforms Carvalho and Lopes (2007) filter. Lund and Lopes (2010) sequentially estimate a regime switching macro-finance model for the postwar US term-structure Particle Learning 19 of interest rates, while Prado and Lopes (2010) adapt PL to study state-space autoregressive models with structured priors. Chen, Petralia and Lopes (2010) propose a hybrid PL-LW sequential MC algorithm that fully estimates non-linear, non-normal dynamic to stochastic general equilibrium models, with a particular application in a neoclassical growth model. Additionally, Dukić, Lopes and Polson (2010) use PL to track flu epidemics using Google trends data, while Lopes and Polson (2010) use PL to estimate volatility and examine volatility dynamics for financial time series, such as the S&P500 and the NDX100 indices, during the early part of the credit crisis. 4. FINAL THOUGHTS 4.1. Historical Note Since the seminal paper by Gordon, Salmond and Smith (1993), and subsequently Kong, Liu and Wong (1994), Liu and Chen (1998) and Doucet, Godsill and Andrieu (2000), to name but a few, the sequential Monte Carlo literature is growing continuously. The first generation of SMC methods is well summarized in the compendium edited by Doucet, de Freitas and Gordon (2001) where several strategies for improving existing particle filters are discussed as well as about a dozen applications in various areas (see also Ristic, Arulampalam and Gordon, 2004, and the 2002 special issue of IEEE Transactions on Signal Processing on sequential Monte Carlo methods). The vast majority of the literature defining the first generation focuses on sample-resample schemes, but it is the resample-sample particle filter introduced by Pitt and Shephard (1999) the key initiator of the second stage of development in the SMC literature. APF with parameter learning was introduced by Liu and West (2001) and builds on earlier work by West (1992, 1993) who is the first published adaptive importance sampling scheme using mixtures (via kernel shrinkage) in sequential models. Our PL approach is a direct extension of Pitt and Shephard’s (1999) APF. Carvalho, Johannes, Lopes and Polson (2010) show that APF and PL, both resample-sample schemes, outperform the standard sample-resample filters. The second wave in the SMC literature occurred over the last decade, with recent advances in SMC that focus on, amongst other things, i) parameter learning, ii) similarities and differences between propagate-resample and resample-propagate filters; iii) computational viable particle smoothers and iv) the merge of SMC and MCMC tools towards more efficient sequential schemes. See Cappé, Godsill and Moulines (2007) and Doucet and Johansen (2008) for thorough reviews. See also, Prado and West (2010, chapter 6) and Lopes and Tsay (2010). For example, important contributions to parameter learning were brought up, either for online or batch sampling, by Liu and West (2001), as mentioned above, Pitt (2002), Storvik (2002), Fearnhead (2002), Polson, Stroud and Müller (2008), Doucet and Tadić (2003), Poyiadjis, Doucet and Singh (2005) and Olsson, Cappé, Douc and Moulines (2006), to name by a few, while SIS and APF similarities are the focus of Doucet and Johansen (2008) and Douc, Moulines and Olsson (2009). 4.2. PL and the Future Particle Learning provides a simulation-based approach to sequential Bayesian inference. It combines the features of data augmentation that is prevalent in MCMC with the resample-propagate auxiliary particle filter of Pitt and Shephard (1999). In many ways there is a parallel between the proliferation of data augmentation in Lopes et al. 20 Gibbs sampling and its potential role in expanding the PL framework. This combination of factors provides new insights on sequential learning for static parameters. In the case of trees, for example, using the essential state vector Zt is itself a modeling tool suggesting many different ways of searching model space and specifying prior distributions in complicated spaces. This leads to a fruitful direction for future modeling. There is a many open areas for future implementation of the framework: • Nonlinear and nonnormal panels (econometrics); • REML (econometrics); • Structural equations model; • Dynamic factor models; • Multivariate extensions (challenging); • Space-time models (ensemble Kalman filters). Finally, we emphasis that there are differences in the way that Monte Carlo errors accumulate in MCMC and PL and this is clearly another fruitful area for future research both from a theoretical and empirical perspective. As with MCMC methods the usual word of caution of relying heavily on asymptotic central limit theorem results carried over to the PL framework. REFERENCES Cappé, O., Godsill, S. and Moulines, E. (2007). An overview of existing methods and recent advances in sequential Monte Carlo. IEEE Trans. Signal Process. 95, 899-924. Carpenter, J., Clifford, P. and Fearnhead, P. (1999). An improved particle filter for non-linear problems. IEE Proc. Radar, Sonar and Navigation 146, 2–7. Carlin, B. P. and Polson, N. G. (1991). Inference for Nonconjugate Bayesian Models Using the Gibbs Sampler. Can. J. Statist. 19, 399–405. Carvalho, C. M., Johannes, M., Lopes, H. F. and Polson, N. G. (2010). Particle learning and smoothing. Statist. Science (to appear). Carvalho, C. M. and Lopes, H. F. (2007). Simulation-based sequential analysis of Markov switching stochastic volatility models. Comput. Statist. Data Anal. 51, 4526–4542. Carvalho, C. M., Lopes, H. F., Polson (2010). Particle learning for generalized dynamic conditionally linear models. Working Paper. University of Chicago Booth School of Business. Carvalho, C. M., Lopes, H. F., Polson, N. G. and Taddy, M. (2009). Particle learning for general mixtures. Working Paper. University of Chicago Booth School of Business. Chen, H., Petralia, F. and Lopes, H. F. (2010). Sequential Monte Carlo estimation of DSGE models. Working Paper. The University of Chicago Booth School of Business. Chen, R. and Liu, J. S. (2000). Mixture Kalman filters. J. Roy. Statist. Soc. B 62, 493–508. Chopin, N. (2002). A sequential particle filter method for static models. Biometrika 89, 539–551. Douc, R., E. Moulines, and J. Olsson (2009). Optimality of the auxiliary particle filter. Probab. Math. Statist. 29, 1–28. Doucet, A., De Freitas, N. , and Gordon, N. (2001). Sequential Monte Carlo Methods in Practice. Berlin: Springer. Particle Learning 21 Doucet, A., Godsill, S. and Andrieu, C. (2000). On sequential Monte-Carlo sampling methods for Bayesian filtering. Statist. Computing 10, 197–208. Doucet, A. and Johansen, A. (2008). A Note on Auxiliary Particle Filters. Statist. Probab. Lett. 78, 1498–1504. Doucet, A. and Tadić, V. B. (2003). Parameter estimation in general state-space models using particle methods. Ann. Inst. Statist. Math. 55, 409–422. Dukić, V., Lopes, H. F. and Polson, N. G. (2010). Tracking flu epidemics using Google trends and particle learning. Working Paper. The University of Chicago Booth School of Business. Escobar, M. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90, 577–588. Fearnhead, P. (2002). Markov chain Monte Carlo, sufficient statistics and particle filter. J. Comp. Graphical Statist. 11, 848-862. Gamerman, D. and Lopes, H. F. (2006). Chain Monte Carlo: Stochastic Simulation for Bayesian Inference (2nd edition). Boca Raton: Chapman and Hall/CRC. Gilks, W. and Berzuini, C. (2001). Following a moving target: Monte Carlo inference for dynamic Bayesian models. J. Roy. Statist. Soc. B 63, 127–146. Godsill, S. J., A. Doucet, and M. West (2004). Monte Carlo smoothing for nonlinear time series. J. Amer. Statist. Assoc. 99, 156–168. Gordon, N., Salmond, D. and Smith, A. F. M. (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F. Radar Signal Process 140, 107–113. Gramacy, R. and Polson, N. G. (2010). Particle learning of Gaussian process models for sequential design and optimization. Working Paper. The University of Chicago Booth School of Business. Hans, C. (2009). Bayesian lasso regression. Biometrika 96, 835–845. Kitagawa, G. (1996). Monte Carlo filter and smoother for non-Gaussian non-linear state space models. J. Comp. Graphical Statist. 5, 1–25. Kong, A., Liu, J. S. and Wong, W. (1994). Sequential imputation and Bayesian missing data problems. J. Amer. Statist. Assoc. 89, 590–599. Liu, J. and Chen, R. (1998). Sequential Monte Carlo methods for dynamic systems. J. Amer. Statist. Assoc. 93, 1032–1044. Liu, J. and West, M. (2001). Combined parameters and state estimation in simulation based filtering. In Sequential Monte Carlo Methods in Practice (A. Doucet, N. de Freitas and N. Gordon, eds.), 197–223. Berlin: Springer. Lopes, H. F. (2000). Bayesian analysis in latent factor and longitudinal models. Unpublished Ph.D. Thesis. Institute of Statistics and Decision Sciences, Duke University, USA. Lopes, H. F. and Polson, N. G. (2010). Extracting SP500 and NASDAQ volatility: The credit crisis of 2007-2008. In Handbook of Applied Bayesian Analysis (A. OHagan and M. West, eds.), 319–342. Oxford: University Press. Lopes, H. F. and Tsay, R. S. (2010). Bayesian analysis of financial time series via particle filters. J. Forecast. (to appear). Lund, B. and Lopes, H. F. (2010). Learning in a regime switching macro-finance model for the term structure. Working Paper. The University of Chicago Booth School of Business. Mukherjee, C. and West, M. (2009). Sequential Monte Carlo in Model Comparison: Example in Cellular Dynamics in Systems Biology. Working paper. Department of Statistical Science, Duke University. Olsson, J., Cappé, O., Douc, R. and Moulines, E. (2008). Sequential Monte Carlo smoothing with application to parameter estimation in non-linear state space models. Bernoulli 14, 155-179. 22 Lopes et al. Perman, M., J. Pitman, and M. Yor (1992). Size-biased sampling of Poisson point processes and excursions. Probab. Theory Related Fields 92, 21–39. Pitt, M. K. (2002). Smooth particle filters for likelihood evaluation and maximisation. Working paper. Department of Economics, University of Warwick, UK. Pitt, M. K. and Shephard, N. (1999). Filtering via simulation: Auxiliary particle filters. J. Amer. Statist. Assoc. 94, 590–599. Polson, N. G., Stroud, J. R. and Müller, P. (2008). Practical filtering with sequential parameter learning. J. Roy. Statist. Soc. B 70, 413–428. Poyiadjis, G., Doucet, A. and Singh, S. S. (2005). Particle methods for optimal filter derivative: Application to parameter estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 5, 925–928. Prado, R. and Lopes, H. F. (2010). Sequential parameter learning and filtering in structured autoregressive models. Working Paper. The University of Chicago Booth School of Business. Prado, R. and West, M. (2010). Time Series: Modelling, Computation and Inference. Boca Raton: Chapman & Hall/CRC Press. Rios, M. P. and Lopes, H. F. (2009). Sequential parameter estimation in stochastic volatility models. Working Paper. The University of Chicago Booth School of Business. Ristic, B., Arulampalam, S. and Gordon, N. (2004). Beyond the Kalman Filter: Particle Filters for Tracking Applications. Artech House Radar Library. Shi, M. and Dunson, D. (2009). Bayesian Variable Selection via Particle Stochastic Search. Working paper. Department of Statistical Science, Duke University. Storvik, G. (2002). Particle filters in state space models with the presence of unknown static parameters. IEEE Trans. Signal Process. 50, 281–289. Taddy, M., Gramacy, R. and Polson, N. G. (2010). Dynamic trees for learning and design. Working Paper. The University of Chicago Booth School of Business. West, M. (1992). Modelling with mixtures (with discussion). Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) Oxford: University Press, 503–524. West, M. (1993). Mixture models, Monte Carlo, Bayesian updating and dynamic models. Computing Science and Statistics 24, 325–333. West, M. and Harrison, J. (1997). Bayesian forecasting and dynamic models (2nd Edition). Berlin: Springer.