BAYESIAN STATISTICS 9,
J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid,
D. Heckerman, A. F. M. Smith and M. West (Eds.)
c Oxford University Press, 2010
Particle Learning for Sequential
Bayesian Computation
Hedibert F. Lopes
Carlos M. Carvalho
The University of Chicago, USA
The University of Chicago, USA
[email protected] [email protected]
Michael S. Johannes
Columbia University, USA
[email protected]
Nicholas G. Polson
The University of Chicago, USA
[email protected]
Summary
Particle learning provides a simulation-based approach to sequential Bayesian
computation. To sample from a posterior distribution of interest we use an
essential state vector together with a predictive and propagation rule to build
a resampling-sampling framework. Predictive inference and sequential Bayes
factors are a direct by-product. Our approach provides a simple yet powerful
framework for the construction of sequential posterior sampling strategies for
a variety of commonly used models.
Keywords and Phrases: Particle learning; Bayesian; Dynamic factor
models; Essential state vector; Mixture models; Sequential
inference; Conditional dynamic linear models; Nonparametric;
Dirichlet.
Hedibert F. Lopes is Associate Professor of Econometrics and Statistics, The University
of Chicago Booth School of Business. Carlos M. Carvalho is Assistant Professor of Econometrics and Statistics, The University of Chicago Booth School of Business. Michael S.
Johannes is Roger F. Murray Associate Professor of Finance, Graduate School of Business, Columbia University. Nicholas G. Polson is Professor of Econometrics and Statistics,
The University of Chicago Booth School of Business. We would like to thank Mike West,
Raquel Prado and Peter Müller for insightful comments that greatly improved the article.
We also thank Seung-Min Yae for research assistance with some of the examples. Part of
this research was conducted while the first two authors were visiting the Statistical and Applied Mathematical Sciences Institute for the 2008-09 Program on Sequential Monte Carlo
Methods.
Lopes et al.
2
1. THE PL FRAMEWORK
Sequential Bayesian computation requires calculation of a set of posterior distributions p(θ | y t ), for t = 1, . .R. , T , where y t = (y1 , . . . , yt ). The inability to directly
compute marginal p(y t ) = p(y t | θ)p(θ)dθ implies that accessing the desired posterior distributions requires simulation schemes. We present a sequential simulation
strategy to both p(θ | y t ) and p(y t ) based on a resample-sampling framework called
Particle Learning (PL). PL is a direct extension of the resample-sampling scheme
which was introduced by Pitt and Shephard (1999) in the parameter-free, time series
context.
Our new look at Bayes’s theorem delivers a sequential, on-line inference strategy
for effective posterior simulation strategies in a variety of commonly used models.
These strategies are intuitive and easy to implement. In addition, when contrasted
to MCMC methods PL delivers more for less as it provides
• posterior samples in a direct approximations of marginal likelihoods;
• parallel environment, an important feature as more multi-processor computational power becomes available.
Central to PL is the creation of a essential state vector Zt to be tracked sequentially. We assume that this vector is conditionally sufficient for the parameter of
interest; so that p(θ | Zt ) is either available in closed-form or can easily be sampled
from.
(i)
t
Given samples {Zt }N
i=1 ∼ p(Zt | y ), then a simple mixture approximation to
the set of posteriors (or moments thereof) is given by
pN (θ | y t ) =
N
1 X
(i)
p(θ | Zt ).
N i=1
This follows from the Rao-Blackwellised identity,
Z
p(θ | y t ) = E {p(θ | Zt )} = p(θ | Zt )p(Zt | y t )dZt .
(i)
If we require samples, we draw θ(i) ∼ p(θ | Zt ). See West (1992,1993) for an early
approach to approximating posterior distributions via mixtures.
The task of sequential Bayesian computation is then equivalent to a filtering
(i)
t
problem for the essential state vector, drawing {Zt }N
i=1 ∼ p(Zt | y ) sequentially
from the set of posteriors. To this end, PL exploits the following sequential decomposition of Bayes’ rule
Z
p(Zt+1 | y t+1 ) = p(Zt+1 | Zt , yt+1 ) dP(Zt | y t+1 )
Z
∝ p(Zt+1 | Zt , yt+1 ) p(yt+1 | Zt ) dP(Zt | y t ).
{z
}|
{z
}
|
propagate
resample
The distribution dP(Zt | y t+1 ) ∝ p(yt+1 | Zt )dP(Zt | y t ), where P(Zt | y t ) denotes the
distribution
PN of the current state vector. In particle form this would be represented
by N1
i=1 δZ (i) , where δ is the Dirac measure.
t
Particle Learning
3
The intuition is as follows. Given P(Zt | y t ) we find the smoothed distribution
P(Zt | y t+1 ) via resampling and then propagate forward using p(Zt+1 | Zt , yt+1 ) to
find the new Zt+1 . Making an analogy to dynamic linear models this is exactly the
Kalman filtering logic in reverse, first proposed by Pitt and Shephard (1999). From
a sampling perspective, this leads to a very simple algorithm for updating particles
N
{Zt }N
i=1 to {Zt+1 }i=1 in 2 steps:
(i) Resample: with replacement from a multinomial with weights proportional to
(i)
ζ(i)
the predictive distribution p(yt+1 | Zt ) to obtain {Zt }N
i=1 ;
(i)
ζ(i)
(ii) Propagate: with Zt+1 ∼ p(Zt+1 | Zt
(i)
, yt+1 ) to obtain {Zt+1 }N
i=1 .
The ingredients of particle learning are the essential state vector Zt , a predictive
(i)
probability rule p(yt+1 | Zt ) for resampling ζ(i) and a propagation rule to update
ζ(i)
(i)
particles: Zt → Zt+1 . We summarize the algorithm as follows:
Particle Learning (PL)
Step 1. (Resample) Generate an index ζ ∼ Multinomial(ω, N ) where
(i)
Step 2. (Propagate)
p(yt+1 | Zt )
ω (i) = PN
;
(i)
i=1 p(yt+1 | Zt )
(ζ(i))
Zt+1
(ζ(i))
∼ p(Zt+1 | Zt
, yt+1 );
Step 3. (Learn)
pN (θ | y t+1 ) =
N
1 X
p(θ | Zt+1 ).
N i=1
Example 1 (Constructing Zn for the i.i.d. model).
As a first illustration of the
derivation of the essential state vector and the implementation of PL, consider the following
simple i.i.d. model
yi | λi
λi
∼
∼
N (µ, τ 2 λi )
IG(ν/2, ν/2)
for i = 1, . . . , n and known ν and prior µ | τ 2 ∼ N (m0 , C0 τ 2 ) and τ 2 ∼ IG(a0 , b0 ).
Here the essential state vector is Zn = (λn+1 , an , bn , mn , Cn ) where (an , bn ) index the
sufficient statistics for the updating of τ 2 , while (mn , Cn ) index the sufficient statistics
for the updating of µ. Set m0 = 0 and C0 = 1. The sequence of variables λn+1 are
i.i.d. and so can be propagated directly from p(λn+1 ), whilst the conditional sufficient
statistics (an+1 , bn+1 ) are deterministically calculated based on previous values (an , bn )
and parameters (µn+1 , λn+1 ). Here µn+1 simply denotes draws for the parameter µ at
time n + 1. Given the particle set {(Z0 , µ, τ 2 )(i) }N
i=1 , PL cycles through the following
steps:
Lopes et al.
4
2 (i) N
Step 1. Resample {(Z̃n , µ̃, τ̃ 2 )(i) }N
i=1 from {(Zn , µ, τ ) }i=1 with weights
(i)
(i)
(i)
(i)
(i)
wn+1 ∝ p(yn+1 | Zn ) = fN (yn+1 ; mn , τ 2(i) (Cn + λn+1 )),
(i)
(i)
(i)
i = 1, . . . , N ;
(i)
(i)
2
Step 2. Propagate an+1 = ãn + 0.5 and bn+1 = b̃n + 0.5yn+1
/(1 + λ̃n+1 ), and sample
τ 2(i)
from
(i)
(i)
IG(an+1 , bn+1 ),
for i = 1, . . . , N ;
(i)
(i)
(i)
(i)
(i)
(i)
(i)
Step 3. Propagate Cn+1 = 1/(1/C̃n + 1/λn+1 ) and (Cn+1 )−1 mn+1 = (C̃n )−1 m̃n +
(i)
(i)
(i)
(i)
yn+1 /λn+1 , and sample µn+1 from N (mn+1 , Cn+1 ), for i = 1, . . . , N ;
(i)
(i)
Step 4. Sample λn+2 from p(λn+2 ) and let Zn+1 = (λn+2 , an+1 , bn+1 , mn+1 , Cn+1 )(i) ,
for i = 1, . . . , N .
In step 2 fN (y; µ, σ 2 ) denotes the density of N (µ, σ 2 ) evaluated at y. The posterior for µ
and τ 2 could be approximated through a Gibbs sampler based on the full conditionals:
µ | λ, τ 2 , y
N (g1 (y, λ)/s(λ); τ 2 /s(λ))
∼
τ 2 | λ, y
∼
2
λi | τ , yi
IG(a0 + 0.5n, b0 + 0.5g2 (y, λ))
„
«
ν + 1 ν + (yi − µ)2 /τ 2
IG
,
2
2
∼
i = 1, . . . , n.
P
Pn
Pn
−1
2
where s(λ) = 1 + n
i=1 λi , g1 (y, λ) =
i=1 yi /λi , and g2 (y, λ) =
i=1 yi /(1 + λi ).
Figure 1 provides an illustration of both PL to the Gibbs sampler.
0.25
PL
0.25
GIBBS
0.20
0.20
●
●
●
●
●
●
●
●
●
0.15
0.15
●
●
●
●
●
●
●
●
●
0.00
−0.5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
0.0
0.5
µ
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●● ●
●
● ●
● ●
● ● ● ● ●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●●
●
●
●
● ●●● ● ● ●
● ●
●
●
● ●●●
●●●
●●● ● ●
● ●●
●●
● ●●●● ●
●●●
● ● ●● ●
● ●●●
●●● ● ● ● ●
●●● ● ●● ●●
●
● ●
●
●●
● ●●●●
● ●●
●●
●
●
●
●
●● ● ●
● ●●
●●
●
●
●●
● ● ●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●● ●
●
● ●●● ●●●
●●●
● ●●●● ●●
●
●
●
●●●
●●●●●
●
●
●●●
●●
●
●
●● ●
●●
●●●●
● ●●●●●
● ●●
●
●●●●
●●●●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●●●
●●●
●
●●
●
●
●
●●
●●
●
●
●
● ●● ● ●●
●
●
●
●
● ●●
●
●●
●
●●●● ●●●
●
●●
●
●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●
● ●●
● ●●● ●●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●●
●●
●●
●●
●
●
●
●
●●
●
●
●●
●
● ●●
●
●
●
●
●
●
●● ●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
● ●●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●●
●
●
●
●
●
●
●
●● ●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
● ●●
●●
●
●
●
●
●●
● ●●
●
●
●
●●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●●● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
0.05
0.05
●
●●
●
●
●
●
●
●
●
●● ●
● ●
●
●
●
●
● ●●
●●●
● ●
● ● ●●
●
●
● ●
●● ●●
●
●
●●● ● ●● ●
●
●
● ●● ●
●
● ●● ●
●
●
● ● ●● ● ●●● ● ●
●
●
●●
●●● ●●●
●●
●●●
●
●● ● ●●● ●●
●
●
●● ● ●
●● ●● ●●● ●●●
●
●●
●● ●
●
● ● ●●●●
●
●
●
● ● ●●
●
●
●●● ●● ●
●● ●●
●●●
●
● ●● ●
●●●
●
● ●●
●
●●
●●
●●
● ●●
●●
●●
●● ●
●●●●
●●●
●
●
● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●●● ● ● ●
● ●●●
●
●● ●● ●●●
● ●● ●
● ●●
●● ● ●
●● ●
●●
●
●●
●● ●● ●
●
●
●●
●
●●●
●
●
●
●
●
● ● ●●●
●●
●●
●
●●
●
●
●
●●●●●
●
●●
●●
●●
●●●
● ●● ●● ● ● ● ● ●
●●
●●
● ● ●
●
●●●
●●●●
●●
●●●
●●
●
●
●
●
●●
●
●●
● ● ●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
● ●●
●
●
●
●●●
●
●
●
●
●
●
●
●
● ●●
●
●●● ●
●
●●
●
● ●
●
●
●
●
●
●● ●
●
●
●●
●●●
●
●
●
●
●
●
●●
●●
●
●●
●
●●●●
●
●
●●●
● ●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●●
●●●● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
● ● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ● ●●
●
●
● ●●●
●
●
●
●
●●
●
●●
● ●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●● ●
●
●
● ●●●●●●●●●
●
●
●
●
●
●●
●
●
0.00
●
●
●
●
●
●
●
●
0.10
0.10
● ● ●
●
τ2
τ2
●
●
●
●
1.0
−0.5
0.0
●
●
●
●
0.5
1.0
µ
1:
i.i.d. model. Gibbs versus Particle Learning. Data y =
(−15, −10, 0, 1, 2), number of degrees of freedom ν = 1, and hyperparameters
a0 = 5, b0 = 0.05, m0 = 0 and C0 = 1. For the Gibbs sampler, the initial
value for τ 2 is V (y) = 58.3 and 5000 draws after 10000 as burn-in. PL is based
Figure
on 10000 particles. The contours represent the true posterior distribution.
Particle Learning
5
1.1. Constructing the Essential State Vector
At first sight, PL seems to be a rather simple paradigm. The real power, however,
lies in the flexibility one has in defining the essential state vector. This may include:
state variables, auxiliary variables, subset of parameters θ(−i) , sufficient statistics,
model specification, among others. The dimensionality of Zt can also be included
in the particle set and increase with the sample size as for example, in the nonparametric mixture of Dirichlet process discussed later.
In sum, the use of an essential state vector Zt is an integral part of our approach,
and its definition will become clear in the following sections. The propagation rule
can involve either stochastic or deterministic updates and in many ways it is a
modeling tool by itself. For example, in complicated set ups (variable selection,
treed models) the propagation rule p(Zt+1 | Zt , yt+1 ) suggests many different ways
of searching the model space. It our hope that dissimination of the ideas associated
with PL there will be more cases where the use of Zt leads to new modeling insights.
The following represent examples of the form of Zt in the models that will be
addressed later in the chapter:
• Mixture Regression Models: Auxiliary state variable λt and conditional sufficient statistics st for parameter inference;
• State Space Models: In conditionally gaussian dynamic linear models Zt tracks
the usual Kalman filter state moments denoted by (mt , Ct ) and conditional
sufficient statistics st for fixed parameters;
• Nonparametric Models: Track an indicator of each mixture component kt , the
number nt allocated to each component and the current number of unique
components mt . In a Dirichlet process mixture, the particle vector can grow
in time as there’s a positive probability of adding a new mixture component
with each new observation.
In the rest of the paper, we address each of these models and provide the necessary calculations to implement PL.
1.2. Comparison with SIS and MCMC
Particle filtering (Gordon, Salmond and Smith, 1993) and sequential importance
sampling (Kong, Liu and Wong, 1994) have a number of features in common with
PL. For example, one can view our update for the augmented vector Zt as a fullyadapted version of Pitt and Shephard’s (1999) the auxiliary particle filter (APF),
with the additional step that the augmented variables can depend on functionals of
(i)
the parameter. The additional parameter draw θ(i) ∼ p(θ | Zn ) is not present in
the APF and is used in PL to replenish the diversity of the parameter particles.
Storvik (2002) proposed the use sufficient statistics in state space models that
are independent of parameters in a propagate-resampling algorithm. Chen and Liu
(2000) work with a similar approach in the mixture Kalman filter context. PL differs
in two important ways: (i) they only consider the problem of state filtering and
(ii) they work on the propagate-resample framework. This is carefully discussed
in Carvalho, Johannes, Lopes and Polson (2010). Again, our view of augmented
variables Zt is more general than Storvik’s approach.
Another related class of algorithms are Rao-Blackwellised particle filters, which
are typically propagate-resample algorithms where zt+1 denotes missing data and
Lopes et al.
6
xt+1 a state and a pure filtering problem.
`
´ Additionally they attempt the approximation of the joint distribution p Z t | y t . This target increases in dimensionality
as` new data
becomes available leading to unbalanced weights.` In our´ framework,
´
p Z t | y t is not of interest as the filtered, lower dimensional p Zt | y t is sufficient
for inference at time t. Notice that, based on their work, one has to consider the
question of “when to resample?” as an alternative to re-balance the approximation
weights. In contrast, our approach requires re-sampling at every step as the preselection of particles in light of new observations is fundamental in avoiding a decay
in the particle approximation for θ.
Another avenue of research uses MCMC steps inside a sequential Monte Carlo
algorithm as in the resample-move algorithm of Gilks and Berzuini (2001). This
is not required in our strategy as we are using a fully-adapted approach. Finally,
see Lopes and Tsay (2010) for a recent review of particle filter methods with an
emphasis on contrasting propagate-resample and resample-propagate filters.
1.3. Smoothing
At time T , PL provides the filtered distribution of the last essential state vector ZT ,
namely p(ZT | y T ).
If the smoothed distribution of any element k of Z, i.e, p(kT | y T ) is required,
it can be obtained at the end of the filtering process. To compute the marginal
smoothing distribution, we need the distribution
Z
p(kT | y) = p(kT | Z T , y T )p(ZT | y T )dZT
In the case where kt is conditionally independent across time given Zt this can
further simplified as
Z
p(kT | Z T , y)p(ZT | y T )dZT =
Z Y
T
t=1
p(kt | ZT yt )p(ZT | y T )dZT
so that samples from p(kT | y T ) can be obtained by sampling (for each particle ZT )
each kt independently from the discrete filtered mixture with probability proportional to
p(kt = j | ZT , yt ) ∝ p(yt | kt = j, ZT )p(kt = j | ZT ).
This is the case, for example, in the mixture models consider later where k could
represent the allocation of each observation to a mixture component.
In other models, where kt has a Markovian evolution (as in state space models)
the smoothing distribution can be expressed by
Z
p(kT | Z T , y T )p(ZT | y T )dZT =
T
Y
t=1
p(kt | kt+1 ZT )p(ZT | y T ).
By noting that
p(kt | kt+1 ZT ) ∝ p(kt+1 | kt , Zt )p(kt | Zt )
sequential backwards sampling schemes can be constructed by using the transition
equation of kt as weights.
Particle Learning
7
This discussion is a generalization of the algorithm presented in Carvalho, Johannes, Lopes and Polson (2010) for state space models which is originally proposed
as an extension of Godsill, Doucet and West (2004). It is important to point out that
traditional SMC algorithms attempt to approximate p(kt | y t ) as part of the filtering
process, i.e., attempting to sequentially approximate a posterior that is growing in
dimension with t – this leads, as expected and extensively reported, to unbalanced
weights. PL focus on the simpler, more stable problem of filtering p(kt | y t ) and
observes that, in most models, smoothing can effectively be performed in the end.
1.4. Marginal Likelihoods
PL can also provide estimates of the predictive distribution p(yn+1 | y n ) and marginal
likelihood p(y n ) for model assessment and Bayes factors. Following our resamplingsampling approach, an on-line estimate of the full marginal likelihood can be developed by sequentially approximating p(yn+1 | y n ). Specifically, given the current
particle draws, we have
pN (yn+1 | y n ) =
N
X
i=1
p(yn+1 | Zn(i) ) and
pN (y n ) =
n
Y
i=1
pN (yi | y i−1 ).
Therefore we simplify the problem of calculating p (y n ) by estimating a sequence of
small integrals. This also provides access to sequential Bayes factors necessary in
many sequential decision problems.
1.5. Choice of Priors
At its simplest level the algorithm only requires samples θ(i) from the prior p(θ).
Hence the method is not directly applicable to improper priors.
However, the natural
R
class of priors are mixture priors on the form p(θ) = p(θ | Z0 )p(Z0 )dZ0 . The
conditional p(θ | Z0 ) is chosen to be naturally conjugate to the likelihood. If Z0 is
fixed, then we start all particles out with the same Z0 value. More commonly, we
(i)
will start with a sample Z0 ∼ p(Z0 ) and let the algorithm resample these draws
(i)
with the marginal likelihood p(y1 | Z0 ). This approach will lead to efficiency gains
over blindly sampling from the prior. This method also allows us to implement nonconjugate priors together with vague “uninformative” priors such as Cauchy priors
via a scale mixtures of normals.
1.6. Monte Carlo Error
Due to the
sequential
Monte
Carlo nature of the algorithm, error bounds of the
√
form CT / N are available where N is the number of particles used. The constant
CT is model, prior and data dependent and in general its magnitude accumulates
over T , see, for example, Brockwell, Del Moral and Doucet (2010). Clearly, these
propagate errors will be worse for diffuse priors and for large signal-to-noise ratios
as with many Monte Carlo approaches. To assess Monte Carlo standard errors we
propose the convergence diagnostic of Carpenter, Clifford and Fearnhead (1999). By
running the algorithm M independent time (based on N particles) one can calculate
the Monte Carlo estimates of the mean and variance for the functional of interest.
Then by performing an analysis of variance between replicates, the Monte carlo
error or effective sample size can be assessed. One might also wish to perform this
measure over different data trajectories as some data realizations might be harder
to estimate than others.
Lopes et al.
8
2. APPLICATIONS
2.1. Mixture Regression Models
In order to illustrate the efficiency gains available with our approach consider the
most common class of applications: mixture or latent variable models
Z
p(y | θ) = p(y | θ, λ)p(λ | θ)dλ,
where λn = (λ1 , . . . , λn ) is a data augmentation variable. For this model, with a
conditionally conjugate prior, we can find a conditional sufficient statistic, sn , for
parameter learning. Therefore, we define our sufficient state vector as Zn = (λn , sn ).
Under these assumptions, we can write
`
´
p θ | λn+1 , y n+1 = p (θ | sn+1 ) with sn+1 = S (sn , λn+1 , yn+1 )
where S(·) is a deterministic recursion relating the previous sn to the next, conditionally on λn+1 and yn+1 . Now, the propagation step becomes
λn+1
sn+1
∼
=
p(λn+1 | λn , θ, yn+1 )
S (sn , λn+1 , yn+1 ) .
More complicated mixture models appear in Section 2.3.
Example 2 (Bayesian lasso). We can develop a sequential version of Bayesian lasso
(Carlin and Polson, 1991, Hans, 2009)√for a simple problem of signal detection. The model
takes the form yi = θi +εi and θi = τ λi εθi , where εi ∼ N (0, 1), εθi ∼ N (0, 1), λi ∼ Exp(2)
and τ 2 ∼ IG(a0 , b0 ). This leads to independent double exponential marginal priors for
each θi with p(θi ) = (2τ )−1 exp (− | θi | /τ ). The natural set of latent variables is given
by the augmentation variable λn+1 and conditional sufficient statistics leading to Zn =
(λn+1 , an , bn ). The sequence of variables λn+1 are i.i.d. and so can be propagated directly
with p(λn+1 ), whilst the conditional sufficient statistics (an+1 , bn+1 ) are deterministically
determined based on parameters (θn+1 , λn+1 ) and previous values (an , bn ).
Given the particle set (Z0 , τ )(i) , the resample-propagate algorithm cycles through the
following steps:
(i)
(i)
i) Resample particles with weights wn+1∝ fN (yn+1 ; 0, 1+τ 2(i) λn+1 );
(i)
(i)
(i)
(i)
(i)
(i)
−1(i)
−1
ii) Propagate θn+1 ∼ N (mn , Cn ), mn = Cn τ̃ 2(i) λ̃n+1 yn+1 and Cn
= 1+τ̃ −2(i) λ̃n+1 ;
(i)
(i)
(i)
2(i)
(i)
iii) Update sufficient statistics an+1 = ãn + 1/2 and bn+1 = b̃n + θn+1 /(2λ̃n+1 );
(i)
iv) Draw τ 2(i) ∼ IG(an+1 , bn+1 ) and λn+2 ∼ Exp(2);
(i)
(i)
(i)
(i)
v) Let Zn+1 = (λn+1 , an , bn ) and update (Zn+1 , τ )(i) .
We use our marginal likelihood (or Bayes factor) to compare lasso with a standard
normal prior. Under the normal prior we assume that τ 2 ∼ IG(a1 , b1 ) and we match
the variances of the parameter θi . As the lasso is a model for sparsity we would expect the evidence for it to increase when we observe yt = 0. We can sequentially estiP
(i)
mate p(yn+1 | y n , lasso) via p(yn+1 | y n , lasso) = N −1 N
i=1 p(yn+1 | (λn , τ ) ) with pre2
dictive p(yn+1 | λn , τ ) ∼ N (0, τ λn + 1). This leads to a sequential Bayes factor Bn+1 =
p(y n+1 | lasso)/p(y n+1 | normal).
Particle Learning
9
Data was simulated based on θ = (0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1) and priors τ 2 ∼
IG(2, 1) for the double exponential case and τ 2 ∼ IG(2, 3) for the normal case, reflecting the
ratio of variances between those two distributions. Results are summarized by computing
sequential Bayes factor (figure not shown). As expected the evidence in favor of the lasso
is increased when we observe y = 0 and for the normal model when we observe a signal
y = 1.
PL can easily be extended to a lasso regression setting. Suppose that we have
p
yt+1 = Xt′ β + σ λt+1 ǫt+1
and θ = (β, σ 2 ) and conditionally conjugate prior is assumed, i.e. p(β | σ 2 ) ∼ N (b0 , σ 2 B0−1 )
and p(σ 2 ) ∼ IG(ν0 /2, d0 /2). We track Zt = (st , λt+1 ) where st = (bt , Bt , dt ) are conditional sufficient statistics for the parameters. The recursive definitions are
Bt+1
=
′
Bt + λ−1
t+1 Xt Xt
Bt+1 bt+1
=
′
Bt bt + λ−1
t+1 Xt yt+1 , and
dt+1
=
′
′
dt + b′t Bt bt + λ−1
t+1 Xt+1 yt+1 − bt+1 Bt+1 bt+1 .
The conditional posterior p(θ | Zn ) is then available for sampling and our approach applies.
We use this example to compare the accuracy in estimating the posterior distribution
of the regularization penalty p(τ | y). We use the generic resample-move batch importance
sampling developed by Gilks and Berzuini (2001) and Chopin (2002). The data is cut into
batches parameterized by block-lengths (n, p). In the generic resample move algorithm, we
first initialize by drawing from the prior π(θ, τ ) with θ = (θ1 , . . . , θ15 ). The particles are
then re-sampled with the likelihood from the first batch of observations (y1 , . . . , yp ). Then
the algorithm proceeds sequentially.
There is no need to use the λi augmentation variables as this algorithm does not
exploit this conditioning information. Then an MCMC kernel is used to move particles.
Here, we use a simple random walk MCMC step. This can clearly be tuned to provide
better performance although this detracts from the “black-box” nature of this approach.
Chopin (2002) provides recommendations for the choice of kernel. Figure 2 provides the
comparison with two separate runs of the algorithm both with N = 10, 000 particles for
(n, p) = (3, 5) or (n, p) = (15, 1). The performance is similar for the case p = 1. Our
efficiency gains come from the extra conditioning information available in Zn .
2.2. Conditional Dynamic Linear Models
We now explicitly derive our PL algorithm in a class of conditional dynamic linear
models which are an extension of the models considered in West and Harrison (1997).
This follows from Carvalho, Johannes, Lopes and Polson (2010) and consists of a vast
class of models that embeds many of the commonly used dynamic models. MCMC
via Forward-filtering Backwards-sampling or mixture Kalman filtering (MKF) (Chen
and Liu, 2000) are the current methods of use for the estimation of these models.
As an approach for filtering, PL has a number of advantages. First, our algorithm is
more efficient as it is a perfectly-adapted filter (Pitt and Shephard, 1999). Second,
we extend MKF by including learning about fixed parameters and smoothing for
states.
The conditional DLM defined by the observation and evolution equations takes
the form of a linear system conditional on an auxiliary state λt+1
(yt+1 | xt+1 , λt+1 , θ) ∼ N (Fλt+1 xt+1 , Vλt+1 )
(xt+1 | xt , λt+1 , θ) ∼ N (Gλt+1 xt , Wλt+1 )
Lopes et al.
10
n=15, p=1
p(τ |y) by PL
1200
1000
1000
800
800
600
600
400
400
200
200
n=15, p=5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0
0.1
1200
1200
1000
1000
800
800
600
600
400
400
200
200
0
0.1
Figure
p(τ |y) by Chopin(2002)
1200
0.2
0.3
0.4
0.5
0.6
0.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.2
0.3
0.4
0.5
0.6
0.7
2: Bayesian Lasso. Comparison to Chopin’s (2002) batch sampling scheme.
with θ containing the unknown elements of the quadruple {Fλ , Gλ , Vλ , Wλ }. The
marginal distribution of observation error and state shock distribution are any combination of normal, scale mixture of normals, or discrete mixture of normals depending on the specification of the distribution on the auxiliary state variable p(λt+1 | θ),
so that,
p(yt+1 | xt+1 , θ) =
Z
fN (yt+1 ; Fλt+1 xt+1 , Vλt+1 )p(λt+1 | θ)dλt+1 .
Extensions to hidden Markov specifications where λt+1 evolves according to the
transition p(λt+1 | λt , θ) are straightforward and are discussed in the dynamic factor
model example below.
In CDLMs the state filtering and parameter learning problem is equivalent to
a filtering problem for the joint distribution of their respective sufficient statistics.
This is a direct result of the factorization of the full joint p(xt+1 , θ, λt+1 , st+1 , sxt+1 | y t+1 )
as a sequence of conditional distributions
p(θ | st+1 )p(xt+1 | sxt+1 , λt+1 )p(λt+1 , st+1 , sxt+1 | y t+1 ).
where st and sxt are the conditional sufficient statistics for parameters and states
respectively. Here the conditional sufficient statistics for states (sxt ) and parameters
Particle Learning
11
(st ) satisfy deterministic updating rules
sxt+1
st+1
=
=
K (sxt , θ, λt+1 , yt+1 )
S (st , xt+1 , λt+1 , yt+1 ) .
More specifically, define sxt = (mt , Ct ) as Kalman filter first and second moments
at time t. Conditional on θ, we then have (xt+1 | sxt+1 , λt+1 , θ, ) ∼ N (at+1 , Rt+1 ),
where at+1 = Gλt+1 mt and Rt+1 = Gλt+1 Ct G′λt+1 +Wλt+1 Updating state sufficient
−1
statistics (mt+1 , Ct+1 ) is achieved by mt+1 = Gλt+1 mt +At+1 (yt+1 − et ) and Ct+1
=
−1
−1
−1
′
Rt+1 + Fλt+1 Fλt+1 Vλt+1 , with Kalman gain matrix At+1 = Rt+1 Fλt+1 Qt+1 , et =
Fλt+1 Gλt+1 mt , and Qt+1 = Fλt+1 Rt+1 Fλt+1 + Vλt+1 .
We are now ready to define the PL scheme for the CDLMs. First, assume that
the auxiliary state variable is discrete with λt+1 ∼ p(λt+1 | λt , θ). We start, at time
t, with a particle approximation for the joint posterior of (xt , λt , st , sxt , θ | y t ). Then
we propagate to t + 1 by first re-sampling the current particles with weights proportional to the predictive p(yt+1 | (θ, sxt )). This provides a particle approximation to
p(xt , θ, λt , st , sxt | y t+1 ), the smoothing distribution. New states λt+1 and xt+1 are
then propagated through the conditional posterior distributions p(λt+1 | λt , θ, yt+1 )
and p(xt+1 | λt+1 , xt , θ, yt+1 ). Finally the conditional sufficient statistics are updated
according to (1) and (1) and new samples for θ are obtained from p(θ | st+1 ). Notice
that in the conditional dynamic linear models all the above densities are available
for evaluation and sampling. For instance, the predictive is computed via
(i)
p(yt+1 | (λt , sx
t , θ) ) =
X
λt+1
(i)
p(yt+1 | λt+1 , (sx
t , θ) )p(λt+1 | λt , θ)
where the inner predictive distribution is given by
p (yt+1 | λt+1 , sx
t , θ) =
Z
p (yt+1 | xt+1 , λt+1 , θ) p(xt+1 | sx
t , θ)dxt+1 .
In the general case where the auxiliary state variable λt is continuous it might
not be possible to integrate out λt+1 form the predictive in step 1. We extend the
above scheme by adding to the current particle set a propagated particle λt+1 ∼
p(λt+1 | (λt , θ)(i) ). Both algorithms can be combined with the backwards propagation scheme of Carvalho, Johannes, Lopes and Polson (2010) to provide a full draw
from the marginal posterior distribution for all the states given the data, namely
the smoothing distribution p(x1 , . . . , xT | y T ).
The next two examples detail the steps of PL for a dynamic factor models with
time-varying loadings and for a dynamic logit models.
Example 3 (Dynamic factor model with time-varying loadings).
Consider data
yt = (yt1 , yt2 )′ , t = 1, . . . , T , following a dynamic factor model with time-varying loadings
driven by a discrete latent state λt with possible values {1, 2}. Specifically, we have
(yt+1 | xt+1 , λt+1 , θ) ∼ N (βt+1 xt+1 , σ 2 I2 )
(xt+1 | xt , λt+1 , θ) ∼ N (xt , σx2 )
with time-varying loadings βt+1 = (1, βλt+1 )′ and initial state distribution x0 ∼ N (m0 , C0 ).
The jumps in the factor loadings are driven by a Markov switching process (λt+1 | λt , θ),
whose transition matrix Π has diagonal elements P r(λt+1 = 1 | λt = 1, θ) = p and
Lopes et al.
12
Figure 3: Dynamic factor model - state filtering. Top panel: True value
of λt (red line), P r(λt = 1 | y t ) (black line) and P r(λt = 1 | y T ) (blue line)
Bottom panel: True value of xt (red line), E(xt | y t ) (black line) and E(xt | y T )
(blue line).
P r(λt+1 = 2 | λt = 2, θ) = q. The parameters are θ = (β1 , β2 , σ 2 , τ 2 , p, q)′ . See Carvalho
and Lopes (2007) for related Markov switching models.
We are able to marginalizing over both (xt+1 , λt+1 ) by using state sufficient statistics
t
t
sx
t = (mt , Ct ) as particles. From the Kalman filter recursions we know that (xt | λ , θ, y ) ∼
x, λ
N (mt , Ct ). The mapping for state sufficient statistics sx
=
K(s
,
θ,
y
)
is
given
t+1
t+1
t
t+1
by the one-step Kalman update equations. The prior distributions are conditionally con2
2
2
jugate where (βi | σ ) ∼ N (bi0 , σ Bi0 ) for i = 1, 2, σ ∼ IG(ν00 /2, d00 /2) and τ 2 ∼
IG(ν10 /2, d10 /2). For the transition probabilities, we assume that p ∼ Beta(p1 , p2 ) and q ∼
(i)
Beta(q1 , q2 ). Assume that at time t, we have particles {(xt , θ, λt , sx
t , st ) , i = 1, . . . , N }
t
approximating p(xt , θ, λt , sx
t , st | y ). The PL algorithm can be described through the following steps:
(1)
(N )
(i) Re-sampling: Draw an index ki ∼ Multinomial(wt , . . . , wt
(ki ) ),
p(yt+1 | (sx
t , λt , θ)
2
X
λt+1 =1
where
p(yt+1 | sx
t , λt , θ)
(i)
) with weights wt
equals
′
+ σ 2 I2 )p(λt+1 | λt , θ);
fN (yt+1 ; βt+1 mt , (Ct + τ 2 )βt+1 βt+1
∝
Particle Learning
13
i
(i)
(k ) , y
(ii) Propagating state λ: Draw λt+1 from p(λt+1 | (sx
t+1 ), so the density
t , λt , θ)
2
′
2
p(λt+1 | sx
t , λt , θ, yt+1 ) ∝ fN (yt+1 ; βt+1 mt , (Ct + τ )βt+1 βt+1 + σ I2 )p(λt+1 | λt , θ);
(i)
(i)
i
(k ) , y
(iii) Propagating state x: Draw xt+1 from p(xt+1 | λt+1 , (sx
t+1 );
t , θ)
(iv) Propagating states sufficient statistics, sx
t+1 : The Kalman filter recursions yield
′
mt+1 = mt + At+1 (yt+1 − βt+1 mt ) and Ct+1 = Ct + τ 2 − At+1 Q−1
t+1 At+1 , where
Qt+1 = (Ct + τ 2 )βt+1 βt+1 + σ 2 I2 and At+1 = (Ct + τ 2 )Q−1
β
.
t+1 t+1
(v) Propagating parameter sufficient statistics, st+1 : The posterior p(θ | st ) is decomposed into (βi | σ 2 , st+1 ) ∼ N (bi,t+1 , σ 2 Bi,t+1 ), i = 1, 2, (σ 2 | st+1 ) ∼ IG(ν0,t+1 /2,
d0,t+1 /2t), (τ 2 | st+1 ) ∼ IG(ν1,t+1 /2, d1,t+1 /2), (p | st+1 ) ∼ Beta(p1,t+1 , p2,t+1 ),
−1
−1
(q | st+1 ) ∼ Beta(q1,t+1 , q2,t+1 ) with Bi,t+1
= Bit
+ x2t+1 Iλt+1 =i , bi,t+1 = Bi,t+1
−1
(Bit
bit + xt yt2 Iλt+1 =i ) and νi,t+1 = νi,t + 1, for i = 1, 2, d1,t+1 = d1t + (xt+1 −
xt )2 , p1,t+1 = p1t + Iλt =1,λt+1 =1 , p2,t+1 = p2t + Iλt =1,λt+1 =2 , q1,t+1 = q1t +
P
Iλt =2,λt+1 =2 , q2,t+1 = q2t +Iλt =2,λt+1 =1 , d0,t+1 = d0t + 2j=1 [(yt+1,2 −bj,t+1 xt+1 )
−1
yt+1,2 + bj,t+1 Bj0
(yt+1,1 − xt+1 )2 ]Iλt+1 =j ,
Figure 3 and 4 illustrates the performance of the PL algorithm. The first panel of
Figure 3 displays the true underlying λ process along with filtered and smoothed estimates
whereas the second panel presents the same information for the common factor. Figure 4
provides the sequential parameter learning plots.
Example 4 (Dynamic logit models). Extensions of PL to non-gaussian, non-linear
state space models appear in Carvalho, Lopes and Polson (2010) and Carvalho, Johannes,
Lopes and Polson (2010). We illustrate some these ideas in the context of a dynamic
multinomial logit model with the following structure
P (yt+1 = 1 | βt+1 ) = (1 + exp{−βt+1 xt+1 })−1
βt+1 = φβt + σx ǫβ
t+1
where β0 ∼ N (m0 , C0 ) and θ = (φ, σx2 ). For simplicity assume that xt is scalar. It is
common practice in limited dependent variable models to introduce a latent continuous
variable zt+1 to link yt+1 and xt (see Scott, 2004, Kohn, 1997, and Frühwirth-Schnatter
and Frühwirth, 2007). More precisely, the previous model, conditionally on zt+1 , where
yt+1 = I(zt+1 ≥ 0), can be rewritten as
zt+1
=
βt+1 xt+1 + ǫzt+1
βt+1
=
φβt + ǫβ
t+1 ,
2
z
z
where ǫβ
t+1 ∼ N (0, σx ), ǫt+1 is an extreme value distribution of type 1, i.e. ǫt+1 ∼
− log E(1), with E(1) denoting an exponential with mean one.
Conditional normality can be achieved by rewriting the extreme value distribution as
a mixture of normals. Frühwirth-Schnatter and Frühwirth (2007) suggest a 10-component
mixture of normals with weight, mean and variance for component j given by wj , µj and
s2j , for j = 1, . . . , 10. Hence conditional on the latent vector (zt+1 , λt+1 ), the previous
representation leads to the following Gaussian dynamic linear model:
zt+1
=
βt+1 xt+1 + ǫt
βt+1
=
φβt + ǫβ
t+1 ,
Lopes et al.
14
Figure 4: Dynamic factor model - parameter learning. Sequential posterior median (black line) and posterior 95% credibility intervals (blue lines) for
model parameters β1 , β2 , σ 2 , τ 2 , p and q . True values are the red lines.
where ǫt+1 ∼ N (µλt+1 , sλt+1 ). Given λt+1 we have conditional state sufficient statistics
β
(for βt ) and the Kalman filter recursions still hold as sβ
t+1 = K(st , zt+1 , λt+1 , θ, yt+1 ).
Similarly for the parameter sufficient statistics st , which now involves λt+1 . Moreover, as
λt+1 is discrete, it is straightforward to see that
2
−1/2
2
2 2
P r(yt+1 = 1 | sβ
)
t , θ, λt+1 ) = 1 − Φ(−φmt xt+1 ((φ Ct + σx )xt+1 + sλt+1 )
leading to the predictive
P r(yt+1 = 1 | sβ
t , θ) =
10
X
j=1
wj P r(yt+1 = 1 | sβ
t , θ, λt+1 = j),
which plays an important role in the re-sampling step. The propagation step requires one
β
to be able to sample λt+1 from p(λt+1 | sβ
t , θ, yt+1 ), zt+1 from p(zt+1 | st , θ, λt+1 , yt+1 )
β
and βt+1 from p(βt+1 | st , θ, λt+1 , zt+1 , yt+1 ). The final step of PL is the deterministic
updating for conditional sufficient statistics.
Particle Learning
15
2.3. Nonparametric Mixture Models
We now develop PL for discrete nonparametric mixture models and Bayesian nonparametric density estimation. Details appear in Carvalho, Lopes, Polson and
Taddy (2010). Our essential state vector now depends on the (random) number
of unique mixture components. The posterior information can be summarized by
nt = (nt,1 , . . . , nt,mt ), the number of observations allocated to each unique component, and st = (st,1 , . . . , st,mt ), the conditional sufficient statistics for the component parameters. The state vector to be tracked by PL can then be defined as
Zt = (kt , mt , st , nt ).
⋆
By definition θt⋆ = {θ1⋆ , . . . , θm
} is the set of mt distinct components in θt , kt is
t
the associated latent allocation such that θt = θk⋆t , nt = (nt,1 , . . . , nt,mt ) is the number of observations allocated to each unique component, and st = (st,1 , . . . , st,mt ) is
the set of conditional sufficient statistics for each θj⋆ . Once again, we can define the
state vector as Zt = (kt , mt , st , nt ). PL will not directly provide the full joint posterior of the allocation vector kt = (k1 , . . . , kt ). If this is required either a particle
smoothing or an MCMC step is required.
For infinite mixture models particle learning proceeds through the two familiar steps: Resample: (st , nt , mt ) ∝ p(yt+1 | st , nt , mt ) and Propagate: kt+1 ∼
p(kt+1 | st , nt , mt , yt+1 ). The filtered posterior for (sT , nT , mT ) can be used for
inference via the posterior predictive density p(y | sT , nT , mT ), which is a RaoBlackwellized version of E[f (y; G) | y T ] forR many nonparametric priors (including
the DP). Alternatively, since p(G | y T ) = p(G | sT , nT , mT ) dp(sT , nT , mT | y T ),
the filtered posterior provides a basis for inference about the full random mixing
distribution.
The DP characterizes a prior over probability distributions and is most intuitively represented through its constructive definition (Perman, Pitman and Yor,
1992): a random distribution G generated from DP(α, G0 (ψ)) is almost surely of
the form
!
l−1
∞
X
X
iid
pl δϑl (·) with ϑl ∼ G0 (ϑl ; ψ), pl = 1 −
pj vl ,
dG(·) =
j=1
l=1
iid
and vl ∼ beta(1, α), for l = 1, 2, where G0 (ϑ; ψ) is the centering distribution function, parametrized by ψ, and the sequences {ϑl , l = 1, 2, . . .} and {vk : l = 1, 2, ...}
are independent. The discreteness of DP realizations
is explicit in this definition.
R
The DP mixture model is then f (yr ; G) = k(yr ; θ)dG(θ) for r = 1, . . . , t, where
G ∼ DP(α, G0 ). Alternatively, in terms of latent variables, the hierarchical model
ind
iid
is that for r = 1, . . . , t, yr ∼ k(yr ; θr ), θr ∼ G and G ∼ DP(α, G0 ).
Two properties of the DP are particularly important for sequential inference.
First, the DP is a conditionally conjugate prior: given θt (or, equivalently, θt⋆ and
nt ), the posterior distribution for G is characterized as a DP(α + t, Gt0 ) where,
dGt0 (θ; θt⋆ , nt ) =
mt
X
α
nt,j
dG0 (θ) +
δ[θ=θj⋆ ] .
α+t
α
+t
j=1
R
Second, this Pólya urn density dGt0 is also E[ dG | θt ] = dG(θ)dp(G | θt⋆ , nt ), and
provides
a finite predictive probability function for our mixture model: p(yt+1 | θt ) =
R
k(yt+1 ; θ)dGt0 (θ).
Lopes et al.
16
A Rao-Blackwellized version of the standard Pólya urn mixture serves as a density estimator:
Z
`
´
p E[f (y; G)] | y t = p (E[f (y; G)] | st , nt , mt ) dp(st , nt , mt | y t ),
R
and p (E[f (y; G)] | st , nt , mt ) = p(y | θt⋆ , nt )dp(θt⋆ | st , nt , mt ). If either α or ψ are
assigned hyperpriors, we include this in Zt and sample off-line for each particle
conditional on (nt , st , mt )(i) at each iteration. This is of particular importance in
the understanding of the generality of PL.
PL for DP mixture models
Step 1. (Resample) Generate an index ζ ∼ Multinomial(ω, N ) where
p(yt+1 | (st , nt , mt )(i) )
;
ω (i) = PN
(i)
i=1 p(yt+1 | (st , nt , mt ) )
Step 2. (Propagate)
Step 2.1. kt+1 ∼ p(kt+1 | (st , nt , mt )ζ(i) , yt+1 );
Step 2.2. st+1 = S(st , kt+1 , yt+1 );
Step 2.3. nt+1
nt+1,j = nt,j , for j 6= kt+1 ,
nt+1,kt = nt,kt + 1 and mt+1 = mt , if kt+1 ≤ mt ,
nt,mt+1 = 1, mt+1 = mt + 1 and , if kt+1 > mt ;
Step 3. (Estimation)
p(E[f (y; G)] | y t ) =
N
1 X
p(y | (st , nt , mt )(i) )
N i=1
Example 5 (The DP mixture of multivariate normals). In the particular case of the
d-dimensional DP multivariate normal mixture (DP-MVN) model has density function
f (yt ; G) =
Z
N(yt | µt , Σt )dG(µt , Σt ), and G ∼ DP (α, G0 (µ, Σ)),
with conjugate centering distribution G0 = N (µ; λ, Σ/κ) W(Σ−1 ; ν, Ω), where W(Σ−1 ; ν, Ω)
denotes a Wishart distribution such that E[Σ−1 ] = νΩ−1 and E[Σ] = (ν − (d + 1)/2)−1 Ω.
Conditional sufficient statistics for each unique mixture component st,j are
ȳt,j =
X
r:kr =j
yr /nt,j
and
St,j =
X
r:kr =j
′
.
yr yr′ − nt,j ȳt,j ȳt,j
The initial sufficient statistics are deterministically n1 = 1 and s1 = {y1 , 0}, such that
the algorithm is populated with N identical particles. Conditional on existing particles
Particle Learning
17
{(nt , st )i }N
i=1 , uncertainty is updated through the familiar resample/propagate approach.
The resampling step is performed by an application of the predictive probability function
p(yt+1 | st , nt , mt + 1) =
mt
X
nt,j
α
St(yt+1 ; a0 , B0 , c0 ) +
St(yt+1 ; at,j , Bt,j , ct,j ),
α+t
α
+t
j=1
with hyperparameters a0 = λ, B0 =
at,j
=
κλ + nt,j ȳt,j
,
κ + nt,j
ct,j
=
2ν + nt,j − d + 1,
2(κ+1)
Ω,
κc0
c0 = 2ν − d + 1,
2(κ + nt,j + 1)
(Ω + 0.5Dt,j ),
(κ + nt,j )ct,j
κnt,j
(λ − ȳt,j )(λ − ȳt,j )′ .
= St,j +
(κ + nt,j )
Bt,j =
and
Dt,j
In the propagation step, we then sample the component state kt+1 such that, for j =
1, . . . , mt ,
p(kt+1 = j)
∝
p(kt+1 = mt + 1)
∝
nt,j
St(yt+1 ; at,j , Bt,j , ct,j )
α+t
α
St(yt+1 ; a0 , B0 , c0 ).
α+t
If kt+1 = mt +1, the new sufficient statistics are defined by mt+1 = mt +1 and st+1,mt+1 =
[yt+1 , 0]. If kt+1 = j, nt+1,j = nt,j + 1 and we update st+1,j such that ȳt+1 = (nt,j ȳt,j +
′
i′ − n
i′
yt+1 )/nt+1,j and St+1,j = St,j + yt+1 yt+1
+ nt,j ȳt,j ȳt,j
t+1,j ȳt+1,j ȳt+1,j . The remaining sufficient statistics are the same as at time t.
We can also assign hyperpriors to the parameters of G0 . In this case, a parameter
learning step for each particle is added to the algorithm. Assuming a W(γΩ , Ψ−1
Ω ) prior
for Ω and a N(γλ , Ψλ ) prior for λ, the sample at time t is augmented with draws for the
auxiliary variables {µ⋆j , Σ⋆j }, for j = 1, . . . , mt , from their posterior full conditionals,
p(µ⋆j , Σ⋆j | st , nt ) ≡ N
„
µ⋆j ; at,j ,
1
Σ⋆
κ + nt,j j
«
W (Σ⋆−1
; ν + nt,j , Ω + Dt,j ).
j
The parameter updates are then
1
0
mt
X
⋆−1
−1
Σj µ⋆j ), RA and Ω ∼ W(γΩ + mt ν, R−1 ),
λ ∼ N @R(γλ Ψλ + κ
j=1
Pmt
where R−1 = j=1 Σj⋆−1 + Ψ−1
Ω . Similarly, if α is assigned the usual gamma hyperprior,
it can be updated for each particle using the auxiliary variable method from Escobar and
West (1995).
To illustrate the PL algorithm, a dataset was simulated with dimension d = 2 and sample size T = 1000. The bivariate vector of yt was generated from a N(µt , AR(0.9)) density,
where µt ∼ Gµ and AR(0.9) denotes the correlation matrix implied by an autoregressive
process of lag one and correlation 0.9. The mean distribution, Gµ , is the realization of
a DP(4, N (0, 4I)) process. Thus the simulated data is clustered around a set of distinct
means, and highly correlated within each cluster. The parameters are fixed at α = 2,
λ = 0, κ = 0.25, ν = 4, and Ω = (ν − 1.5)I, was fit to this data. Figure 5 shows the data
and bivariate density estimates, which are the mean Rao-Blackwellized posterior predictive
p(y | sT , nT , mT ); hence, the posterior expectation for f (y; G). Marginal estimates are just
the appropriate marginal density derived from mixture of Student’s t distributions.
Lopes et al.
4
18
0.01
0.02
0.02
3
0.0
4
2
0.0
35
0.015
0.0
0.025
0
0.02
0.03
5
−2
5
0.0
5
0.01
0.05
4
0.0
45
0.0
0.01
5
−4
0.025
0.03
0.005
−8
−6
−4
−2
0
2
4
z1
●
●
4
●
●
●●
●
●
−4
−2
0
2
●
●●
●
●
● ● ●
●
●
● ●
●
● ●
●
●
●
●●
●
●
●
● ●
●
●● ● ●●
●
● ● ●●
●●
●
●● ●
●
●
● ●
● ● ●●
●
●●●
●
●
●●
● ●● ●
● ●
●
● ●● ●
● ●● ●
●
●● ●●●
●
●
●
●
●
●
●
●
●
●
●
●●● ● ●● ●
●
●
● ●● ●●
●
● ●●●●
● ●
●
●
●
● ●●●
●●
●●
●●●●
● ●●
● ●
●
●
● ● ●● ●
●
● ●● ● ●
● ● ●
●
●
●
●
●
●●
● ●● ●● ● ●
●
● ●
● ●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●●
●
●
●
●
●
●
● ●
●
●
●
● ● ●●
●
●
●
●
●
● ● ● ●
●
●
●
●●
●
●●
●
●
● ●
●
●
● ●● ●● ●
●
●
●
●●●
●
●
●
●
●
● ●
●
●●
●●
●
●●●
●
●● ●● ●
● ●
●●
●●
●
●
●
●
●
●● ● ●●
●
●
●
●●
●●
● ●● ●●
●
●● ●●
● ●●
●
●●
●
●●●●●●●●
●
●
●
●
●●●●●●●●●
●
●
●
●
● ●
●●●
●●
●●
●
●
●
●
● ●
● ● ●●● ●● ●
●
●
●
●
●
●●●●●●●
● ● ●●
●
● ●
●
●
●●
●
●
● ●●
●
●
●●●●
● ●●
●
●
●
●
● ● ●
●
●● ●
●●●● ●● ●
●
●
●●
●● ●
●
●
●
●
● ●●
●
●●
●● ●●
●
●● ●
●
●
●●
● ●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
−8
−6
−4
−2
0
2
4
Figure 5: DP mixture of multivariate normals. Data and density estimates
for PL fit with 1000 particles (left) and each of ten PL fits with 500 particles
(right), to a random ordering of the 1000 observations of bivariate data.
3. OTHER APPLICATIONS
Successful implementations of PL (and hybrid versions of PL) have appeared over
the last couple of years. Taddy, Gramacy and Polson (2010) show that PL is the
best alternative to perform online posterior filtering of tree-states in dynamic regression tree models, while Gramacy and Polson (2010) use PL for online updating
of Gaussian process models for regression and classification. Shi and Dunson (2009)
adopt a PL-flavored scheme for stochastic variable selection and model search in linear regression and probit models, while Mukherjee and West (2009) focus on model
comparison for applications in cellular dynamics in systems biology.
With a more time series flavor, Rios and Lopes (2009), for example, propose a
hybrid LW-Storvik filter for the Markov switching stochastic volatility model that
outperforms Carvalho and Lopes (2007) filter. Lund and Lopes (2010) sequentially
estimate a regime switching macro-finance model for the postwar US term-structure
Particle Learning
19
of interest rates, while Prado and Lopes (2010) adapt PL to study state-space autoregressive models with structured priors. Chen, Petralia and Lopes (2010) propose a
hybrid PL-LW sequential MC algorithm that fully estimates non-linear, non-normal
dynamic to stochastic general equilibrium models, with a particular application in
a neoclassical growth model. Additionally, Dukić, Lopes and Polson (2010) use PL
to track flu epidemics using Google trends data, while Lopes and Polson (2010) use
PL to estimate volatility and examine volatility dynamics for financial time series,
such as the S&P500 and the NDX100 indices, during the early part of the credit
crisis.
4. FINAL THOUGHTS
4.1. Historical Note
Since the seminal paper by Gordon, Salmond and Smith (1993), and subsequently
Kong, Liu and Wong (1994), Liu and Chen (1998) and Doucet, Godsill and Andrieu (2000), to name but a few, the sequential Monte Carlo literature is growing
continuously. The first generation of SMC methods is well summarized in the compendium edited by Doucet, de Freitas and Gordon (2001) where several strategies
for improving existing particle filters are discussed as well as about a dozen applications in various areas (see also Ristic, Arulampalam and Gordon, 2004, and the
2002 special issue of IEEE Transactions on Signal Processing on sequential Monte
Carlo methods).
The vast majority of the literature defining the first generation focuses on
sample-resample schemes, but it is the resample-sample particle filter introduced
by Pitt and Shephard (1999) the key initiator of the second stage of development
in the SMC literature. APF with parameter learning was introduced by Liu and
West (2001) and builds on earlier work by West (1992, 1993) who is the first published adaptive importance sampling scheme using mixtures (via kernel shrinkage)
in sequential models. Our PL approach is a direct extension of Pitt and Shephard’s
(1999) APF. Carvalho, Johannes, Lopes and Polson (2010) show that APF and PL,
both resample-sample schemes, outperform the standard sample-resample filters.
The second wave in the SMC literature occurred over the last decade, with
recent advances in SMC that focus on, amongst other things, i) parameter learning,
ii) similarities and differences between propagate-resample and resample-propagate
filters; iii) computational viable particle smoothers and iv) the merge of SMC and
MCMC tools towards more efficient sequential schemes. See Cappé, Godsill and
Moulines (2007) and Doucet and Johansen (2008) for thorough reviews. See also,
Prado and West (2010, chapter 6) and Lopes and Tsay (2010).
For example, important contributions to parameter learning were brought up,
either for online or batch sampling, by Liu and West (2001), as mentioned above,
Pitt (2002), Storvik (2002), Fearnhead (2002), Polson, Stroud and Müller (2008),
Doucet and Tadić (2003), Poyiadjis, Doucet and Singh (2005) and Olsson, Cappé,
Douc and Moulines (2006), to name by a few, while SIS and APF similarities are
the focus of Doucet and Johansen (2008) and Douc, Moulines and Olsson (2009).
4.2. PL and the Future
Particle Learning provides a simulation-based approach to sequential Bayesian inference. It combines the features of data augmentation that is prevalent in MCMC
with the resample-propagate auxiliary particle filter of Pitt and Shephard (1999).
In many ways there is a parallel between the proliferation of data augmentation in
Lopes et al.
20
Gibbs sampling and its potential role in expanding the PL framework. This combination of factors provides new insights on sequential learning for static parameters.
In the case of trees, for example, using the essential state vector Zt is itself a modeling tool suggesting many different ways of searching model space and specifying
prior distributions in complicated spaces.
This leads to a fruitful direction for future modeling. There is a many open
areas for future implementation of the framework:
• Nonlinear and nonnormal panels (econometrics);
• REML (econometrics);
• Structural equations model;
• Dynamic factor models;
• Multivariate extensions (challenging);
• Space-time models (ensemble Kalman filters).
Finally, we emphasis that there are differences in the way that Monte Carlo errors
accumulate in MCMC and PL and this is clearly another fruitful area for future
research both from a theoretical and empirical perspective. As with MCMC methods
the usual word of caution of relying heavily on asymptotic central limit theorem
results carried over to the PL framework.
REFERENCES
Cappé, O., Godsill, S. and Moulines, E. (2007). An overview of existing methods and
recent advances in sequential Monte Carlo. IEEE Trans. Signal Process. 95, 899-924.
Carpenter, J., Clifford, P. and Fearnhead, P. (1999). An improved particle filter for
non-linear problems. IEE Proc. Radar, Sonar and Navigation 146, 2–7.
Carlin, B. P. and Polson, N. G. (1991). Inference for Nonconjugate Bayesian Models Using
the Gibbs Sampler. Can. J. Statist. 19, 399–405.
Carvalho, C. M., Johannes, M., Lopes, H. F. and Polson, N. G. (2010). Particle learning
and smoothing. Statist. Science (to appear).
Carvalho, C. M. and Lopes, H. F. (2007). Simulation-based sequential analysis of Markov
switching stochastic volatility models. Comput. Statist. Data Anal. 51, 4526–4542.
Carvalho, C. M., Lopes, H. F., Polson (2010). Particle learning for generalized dynamic
conditionally linear models. Working Paper. University of Chicago Booth School of
Business.
Carvalho, C. M., Lopes, H. F., Polson, N. G. and Taddy, M. (2009). Particle learning for
general mixtures. Working Paper. University of Chicago Booth School of Business.
Chen, H., Petralia, F. and Lopes, H. F. (2010). Sequential Monte Carlo estimation of
DSGE models. Working Paper. The University of Chicago Booth School of Business.
Chen, R. and Liu, J. S. (2000). Mixture Kalman filters. J. Roy. Statist. Soc. B 62,
493–508.
Chopin, N. (2002). A sequential particle filter method for static models. Biometrika 89,
539–551.
Douc, R., E. Moulines, and J. Olsson (2009). Optimality of the auxiliary particle filter.
Probab. Math. Statist. 29, 1–28.
Doucet, A., De Freitas, N. , and Gordon, N. (2001). Sequential Monte Carlo Methods in
Practice. Berlin: Springer.
Particle Learning
21
Doucet, A., Godsill, S. and Andrieu, C. (2000). On sequential Monte-Carlo sampling
methods for Bayesian filtering. Statist. Computing 10, 197–208.
Doucet, A. and Johansen, A. (2008). A Note on Auxiliary Particle Filters. Statist. Probab.
Lett. 78, 1498–1504.
Doucet, A. and Tadić, V. B. (2003). Parameter estimation in general state-space models
using particle methods. Ann. Inst. Statist. Math. 55, 409–422.
Dukić, V., Lopes, H. F. and Polson, N. G. (2010). Tracking flu epidemics using Google
trends and particle learning. Working Paper. The University of Chicago Booth School
of Business.
Escobar, M. and West, M. (1995). Bayesian density estimation and inference using
mixtures. J. Amer. Statist. Assoc. 90, 577–588.
Fearnhead, P. (2002). Markov chain Monte Carlo, sufficient statistics and particle filter.
J. Comp. Graphical Statist. 11, 848-862.
Gamerman, D. and Lopes, H. F. (2006). Chain Monte Carlo: Stochastic Simulation for
Bayesian Inference (2nd edition). Boca Raton: Chapman and Hall/CRC.
Gilks, W. and Berzuini, C. (2001). Following a moving target: Monte Carlo inference for
dynamic Bayesian models. J. Roy. Statist. Soc. B 63, 127–146.
Godsill, S. J., A. Doucet, and M. West (2004). Monte Carlo smoothing for nonlinear time
series. J. Amer. Statist. Assoc. 99, 156–168.
Gordon, N., Salmond, D. and Smith, A. F. M. (1993). Novel approach to
nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F. Radar Signal
Process 140, 107–113.
Gramacy, R. and Polson, N. G. (2010). Particle learning of Gaussian process models for
sequential design and optimization. Working Paper. The University of Chicago Booth
School of Business.
Hans, C. (2009). Bayesian lasso regression. Biometrika 96, 835–845.
Kitagawa, G. (1996). Monte Carlo filter and smoother for non-Gaussian non-linear state
space models. J. Comp. Graphical Statist. 5, 1–25.
Kong, A., Liu, J. S. and Wong, W. (1994). Sequential imputation and Bayesian missing
data problems. J. Amer. Statist. Assoc. 89, 590–599.
Liu, J. and Chen, R. (1998). Sequential Monte Carlo methods for dynamic systems.
J. Amer. Statist. Assoc. 93, 1032–1044.
Liu, J. and West, M. (2001). Combined parameters and state estimation in simulation
based filtering. In Sequential Monte Carlo Methods in Practice (A. Doucet, N. de
Freitas and N. Gordon, eds.), 197–223. Berlin: Springer.
Lopes, H. F. (2000). Bayesian analysis in latent factor and longitudinal models.
Unpublished Ph.D. Thesis. Institute of Statistics and Decision Sciences, Duke
University, USA.
Lopes, H. F. and Polson, N. G. (2010). Extracting SP500 and NASDAQ volatility: The
credit crisis of 2007-2008. In Handbook of Applied Bayesian Analysis (A. OHagan and
M. West, eds.), 319–342. Oxford: University Press.
Lopes, H. F. and Tsay, R. S. (2010). Bayesian analysis of financial time series via particle
filters. J. Forecast. (to appear).
Lund, B. and Lopes, H. F. (2010). Learning in a regime switching macro-finance model for
the term structure. Working Paper. The University of Chicago Booth School of
Business.
Mukherjee, C. and West, M. (2009). Sequential Monte Carlo in Model Comparison:
Example in Cellular Dynamics in Systems Biology. Working paper. Department of
Statistical Science, Duke University.
Olsson, J., Cappé, O., Douc, R. and Moulines, E. (2008). Sequential Monte Carlo
smoothing with application to parameter estimation in non-linear state space models.
Bernoulli 14, 155-179.
22
Lopes et al.
Perman, M., J. Pitman, and M. Yor (1992). Size-biased sampling of Poisson point
processes and excursions. Probab. Theory Related Fields 92, 21–39.
Pitt, M. K. (2002). Smooth particle filters for likelihood evaluation and maximisation.
Working paper. Department of Economics, University of Warwick, UK.
Pitt, M. K. and Shephard, N. (1999). Filtering via simulation: Auxiliary particle filters.
J. Amer. Statist. Assoc. 94, 590–599.
Polson, N. G., Stroud, J. R. and Müller, P. (2008). Practical filtering with sequential
parameter learning. J. Roy. Statist. Soc. B 70, 413–428.
Poyiadjis, G., Doucet, A. and Singh, S. S. (2005). Particle methods for optimal filter
derivative: Application to parameter estimation. In Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal Processing 5, 925–928.
Prado, R. and Lopes, H. F. (2010). Sequential parameter learning and filtering in
structured autoregressive models. Working Paper. The University of Chicago Booth
School of Business.
Prado, R. and West, M. (2010). Time Series: Modelling, Computation and Inference. Boca
Raton: Chapman & Hall/CRC Press.
Rios, M. P. and Lopes, H. F. (2009). Sequential parameter estimation in stochastic
volatility models. Working Paper. The University of Chicago Booth School of Business.
Ristic, B., Arulampalam, S. and Gordon, N. (2004). Beyond the Kalman Filter: Particle
Filters for Tracking Applications. Artech House Radar Library.
Shi, M. and Dunson, D. (2009). Bayesian Variable Selection via Particle Stochastic
Search. Working paper. Department of Statistical Science, Duke University.
Storvik, G. (2002). Particle filters in state space models with the presence of unknown
static parameters. IEEE Trans. Signal Process. 50, 281–289.
Taddy, M., Gramacy, R. and Polson, N. G. (2010). Dynamic trees for learning and design.
Working Paper. The University of Chicago Booth School of Business.
West, M. (1992). Modelling with mixtures (with discussion). Bayesian Statistics 4 (J. M.
Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) Oxford: University
Press, 503–524.
West, M. (1993). Mixture models, Monte Carlo, Bayesian updating and dynamic models.
Computing Science and Statistics 24, 325–333.
West, M. and Harrison, J. (1997). Bayesian forecasting and dynamic models (2nd
Edition). Berlin: Springer.