09 Ba402
09 Ba402
09 Ba402
63–84
1 Introduction
We consider a continuous time Markov switching model (MSM). The observation process
can be seen as a diffusion where the drift and the volatility coefficients are modeled as
continuous time, finite state Markov processes with a common state process. This model
includes the hidden Markov model (HMM) where the volatility is constant. Continuous
time MSMs (e.g. Buffington and Elliott 2002; Guo 2001) and HMMs (e.g. Liechty and
Roberts 2001; Sass and Haussmann 2004) are widely used in finance, but are also applied
in many other areas, for example in biophysics for the problem of restoration of ion
channels (e.g. Ball et al. 1999).
It is our aim to estimate the states for drift and for volatility as well as the gener-
ator matrix of the underlying Markov process based on discretely observed data. This
is a common situation that occurs e.g. in many inference problems encountered in fi-
nance, where it is more convenient to use continuous time models since these allow the
derivation of closed form solutions while stock prices are observed only in discrete time.
∗ RICAM,Austrian Academy of Sciences, Linz, Austria, mailto:[email protected]
† Department
of Mathematics, University of Kaiserslautern, Kaiserslautern, Germany, mailto:sass@
mathematik.uni-kl.de
We assume that the number of states is given, and refer to (Otranto and Gallo 2002)
and (Frühwirth-Schnatter 2001) for the matter of model selection. For the problem of
label switching, which arises if the different states are not well separated, methods like
identifiability constraints, relabelling, or random permutation (cf. Jasra et al. 2005) can
be incorporated into the sampler constructed.
In discrete time, the expectation maximization (EM) algorithm (Dempster et al.
1977) can be used to find maximum likelihood estimates for the parameters, the drift
and covariance coefficients as well as the generator matrix governing the switching of the
states, see e.g. (Elliott et al. 1997) or (Engel and Hamilton 1990). For low rates (i.e. rare
jumps) and low volatility, the EM algorithm provides very accurate results. However,
for high rates the law of the discretized model (using a discrete time Markov chain)
varies considerably from the law of the continuous time model and the EM algorithm
yields poor results. In continuous time, no finite dimensional filter for estimating the
underlying state of the MSM is known, hence it is not possible to compute the maximum
likelihood parameter estimate via an EM algorithm, see (Elliott et al. 1995) for details.
Alternatively, one might use a moment based method, see e.g. (Elliott et al. 2008). This
can yield good estimates, but it requires a lot of observations.
Taking a Bayesian approach, for parameter estimation in a setting with continuous
time and a latent state process, two methods based on Markov chain Monte Carlo
(MCMC) seem to be reasonable: First, using time discretization (replacing the generator
matrix with a transition matrix) and augmenting the unknowns with the (then discrete)
state process; or, second, working in continuous time and augmenting with the full state
process (compare Ball et al. 1999; Liechty and Roberts 2001). Both methods are based
on a multiple-block Metropolis-Hastings sampler. Results for the continuous time MSM
in (Hahn et al. 2007) indicate that for data with high rates, considerable noise, and based
on not too many observations, MCMC estimation in continuous time is not sufficiently
stable and fast. Then again, using time discretization and discrete time MCMC methods
turns out to work more robustly and faster, but introduces a discretization error.
The idea of the algorithm presented here is to combine the useful aspects of these two
MCMC approaches. On the one hand, the algorithm is inspired from the discretization,
where filtering for the state process is possible, on the other hand, we catch attractive
features of the continuous time method, like exact estimation (i.e. no discretization
error) and direct estimation of the generator matrix rather than the transition matrix
(compare Israel et al. 2001). This is achieved by taking not the whole state process for
data augmentation, but considering only the states at observation times. Working in
continuous time but augmenting with the state process at discrete times only, we refer
to this method as “semi-continuous”.
For this setting, the conditional distribution of the observed data given the boundary
states is an infinite mixture of normal distributions. As the weights of these mixtures
depend rather on the occupation times of the state process than on the path of the state
process itself, the calculations simplify. Using results on the distribution of occupation
times in Markov processes in (Sericola 2000), it is possible to compute the complete
data likelihood exactly.
Hahn and Sass 65
Z t Z t
Rt = µs ds + σs dWs , (1)
0 0
the representations
d
X d
X
µt = µ(k) 1{Yt =k} , σt = σ (k) 1{Yt =k} . (2)
k=1 k=1
The state process Y Pis characterized by the generator matrix Q ∈ Rd×d as follows:
d
For rates λk = −Qkk = l=1,l6=k Qkl , k = 1, . . . , d, in state k the waiting time for the
next jump is λk -exponentially distributed, and the probability of jumping to state l 6= k
when leaving k is given by Qkl /λk , see e.g. (Brémaud 1981).
Remark 1. One cannot distinguish between the pairs (σ, W ) and (σ̄, W̄ ), where σ̄
is a square-root of σσ > and W̄ = σ̄ −1 σW . However, without loss of generality we
can assume σ (k) to be a lower triangular matrix, i.e. σ (k) (σ (k) )> equals the Cholesky
factorization of the covariance matrix.
To this end, we augment the unknowns with Ẏ = (Ẏm )m=0,...,N , Ẏm = Ym ∆t , the
latent state process at the observation times. We construct a multiple-block Metropolis-
Hastings sampler to draw from the joint posterior distribution of B, Σ, Q, Ẏ given Ṙ.
We partition the unknowns into three blocks, namely (B, Σ), Q, Ẏ .
In Sections 3 and 4, we specify appropriate prior distributions and describe the
construction of the updates.
2.3 Motivation
As pointed out in Section 1, existing algorithms for estimating MSMs work well for
slow state switching, but show weaknesses in dealing with fast switching. While Rydén
et al. (1998) and Timmermann (2000) show that MSMs are suitable approximations
for modeling stock returns (catching a number of stylized facts), in such applications
one often faces fast state switching, compare (Sass and Haussmann 2004) and (Hahn
et al. 2008a) for daily and intra-day data, respectively. To avoid this problem, one
might make the observation interval ∆t smaller, but this may be problematic. More
frequently quoted data might not be available (think of assets quoted on a daily basis
Hahn and Sass 67
1 1 1
Y0=1, Yt=1
Y =2, Y =1
0 t
0.5 0.5 0.5
Y =1, Y =2
0 t
Y =2, Y =2
0 t
0 0 0
0 0.01 0.02 0.03 0 0.01 0.02 0.03 0 0.01 0.02 0.03
only). Still, if high-frequent data is available, one should reconsider that appropriate
statistical properties for stock returns are obtained only over time intervals that are not
too short; e.g. for second-by-second quotations it is inevitable to include the possibility
of jumps in the price process.
This is some motivation to look for methods giving reliable results even for fast state
switching. And in fact it is possible to obtain stable results for such data (of course
depending on the noise level), where “fast” switching means up to about one switch per
observation interval on average (in accordance with observations in Sass and Haussmann
2004 and Hahn et al. 2008b).
To give some illustration, Figure 1 plots P(Yt = l | Y0 = k, Q), the transition proba-
bilities depending on time, for different rates. Assuming ∆t = 1/250 (as in Section 6),
even in the third panel with high rates, transition probabilities for a time-step of length
∆t are far from ergodic probabilities. This justifies our choice of augmenting with Ẏ .
More illustrations and results for some reasonably chosen drift and volatility are found
in Section 6.
If we think of time zero as the beginning of our observations after the process has
already run for some time, it may be reasonable to alter the assumptions on the initial
state as follows:
Assumption 1a. Assumption 1 holds, but B, Σ, (Q, Ẏ0 ) are independent and the state
process starts from its ergodic probability ω, i.e.
P(Ẏ0 | Q) = ω. (5)
4 Proposal distributions
4.1 Drift and volatility
For the update of B and Σ, a joint normal random walk proposal, reflected at zero for
the diagonal entries of σ (k) , is used:
0
µ(k) = µ(k) + rµ φ,
(k) 0 (k) (k)
σij = σij + rσ ψij , (7)
(k) 0 ¯ (k) (k) ¯
σii = ¯σii + rσ ψii ¯,
where φ and ψ are n-dimensional vectors and n×n matrices, respectively, of independent
standard normal variates, i = 1, . . . , n, j = 1, . . . , i − 1, k = 1, . . . , d, and rµ and rσ are
parameters scaling the step widths.
As the transition kernel is symmetric we have a Metropolis step with acceptance
probability αB,Σ = min{1, ᾱB,Σ }, where
P(Ṙ | B 0 , Σ0 , Q, Ẏ ) P(B 0 ) P(Σ0 )
ᾱB,Σ = . (8)
P(Ṙ | B, Σ, Q, Ẏ ) P(B) P(Σ)
Hahn and Sass 69
the maximum
Pd likelihood estimate of the transition matrix given Ẏ is given by X̂kl =
n̂kl / j=1 n̂kj , from which we can compute the corresponding generator matrix Q̂,
stationary distribution ω̂, and expected number of jumps fˆ and expected occupation
times ĝ,
d
Y n̂kl
P(Ẏ | Q) = P(Ẏ0 | Q) Xkl . (14)
k,l=1
where
P(Ẏm | Ẏm−1 , ∆Ṙm , η) = P(Ẏm−1 , ∆Ṙm−1 , η) P(Ẏm | Ẏm−1 , ∆Ṙm−1 , η)
× P(∆Ṙm | Ẏm−1 , Ẏm , ∆Ṙm−1 , η)/ P(Ẏm−1 , ∆Ṙm , η)
∝ XẎm−1 ,Ẏm P(∆Ṙm | Ẏm−1 , Ẏm , η) (17)
where
µ d
X d
X ¶
(j) (j)
ψ(∆Ṙm ; B, C, τ ) = ϕ ∆Ṙm ; µ τj , C τj , (22)
j=1 j=1
Hahn and Sass 71
Fkin (τ ) = P(O∆t
1 d
≤ τ1 , . . . , O∆t ≤ τd | Y0 = k), (23)
bd 1 d
Fk,l (τ ) = P(O∆t ≤ τ1 , . . . , O∆t ≤ τd | Y0 = k, Y∆t = l). (24)
Since
d
X
P(∆Ṙm | Ẏm−1 = k, η) = P(∆Ṙm | Ẏm−1 = k, Ẏm = l, η)Xkl , (26)
l=1
We set
and use the notation ri D for its relative interior. Clearly, we just have ri D = {τ ∈
(0, ∆t)d | τ1 + · · · + τd = ∆t}. We also define its relative boundary rb D = D \ ri D.
It is useful to introduce the ∗ operator that projects a vector on its first d − 1
components, i.e. for a d-dimensional vector the last component is omitted, while a
d − 1-dimensional vector is unchanged.
To ease notation, in the remainder of this section we assume Q to be given.
Sericola (2000, Theorem 4.3) describes the joint distribution of the occupation times
of d − 1 states and the final state conditional on the initial state in terms of the distri-
bution of the uniformized discrete chain Z̃. We adapt the conditioning and take partial
derivatives to obtain an expression for the density f bd .
72 Estimation of Continuous Time Markov Switching Models
bd
fk,l (τ ) =
∞
X (λ̃ ∆t)m X ∆t;τ ∗ ±
e−λ̃ ∆t χm;v∗ P(Ṽm∗ ≤ v ∗ , Z̃m = l | Z̃0 = k) Xkl , (29)
m! ∗
m=d−1 v ∈Vm
where
Remark 3. The conditional joint distribution of Ṽm∗ and Z̃m can be computed via its
backward equation (see Sericola 2000, Section 4.1).
bd
fk,l (τ1 , τ2 ) =
∞ µ
X m µ ¶ v−1 m−v−1
m X
−λ̃ ∆t (λ̃ ∆t) m τ1 τ2 (v τ2 − (m − v)τ1 )
e
m=1
m! v=0 v ∆tm
¶
1
±
× P(Ṽm ≤ v, Z̃m = l | Z̃0 = k) Xkl . (32)
Remark 4. The error in the computation of the conditional distributions of the obser-
vations ∆Ṙm results firstly from the truncation of the infinite sum in (29), and secondly
from the numerical approximation of the integrals. However, opposite to the situation
of the discrete approximation, using this approach we can control the error and reduce
it arbitrarily. In fact, we have fast convergence for the error-affected terms and hence, it
is possible to keep the numerical error negligible: Due to the exponential term in (29),
only the very first summands are of importance, and as ψ as well as f bd are very smooth
functions, e.g. the composite Simpson’s Rule provides fast and highly accurate results
even for few subintervals (compare Figure 4).
We illustrate the results of this section with Figures 2, 3, and 4, where we consider
a one-dimensional d = 2 states model with observation time interval length ∆t = 0.004;
for state one and two, the corresponding rates are set to λ1 = 150 and λ2 = 200, the
drifts to µ(1) = 2 and µ(2) = −2, and the volatilities to σ (1) = 0.10 and σ (2) = 0.12.
Figure 2 shows the conditional distributions of the occupation time of the first state
given the boundary and initial states. Note that the distributions for Ẏ0 = 1, Ẏ1 = 2 and
Ẏ0 = 2, Ẏ1 = 1 are identical (also in Figure 3 and 4) and that the discrete contributions
in the density are not shown.
Figure 3 shows the distributions of the observations ∆Ṙm (rescaled to be compa-
rable with µ) given the boundary (left) and initial (right) states; for comparison with
the discrete time approximation, we included the normal densities ϕ( · ; µ(1) , C (1) ) and
ϕ( · ; µ(2) , C (2) ) (dotted). Note that the conditional distributions are skew and obviously
non-normal.
In Figure 4 the distributions of the occupation time of the first state given the
boundary (left) and initial states (right) are plotted for varying precision. The solid
plots (highest values at the modes) show the true functions. For the dotted plots, three
summands in (32) and three sampling points for Simpson’s Rule were used. For the
dashed plots (which nearly totally coincide with the true functions), four summands
and five integration points were used. This makes it evident that the approximation
error is negligible even for fast state switching. Using five terms in the summation and
seven integration points takes little more computation time and gives results numerically
equal to the exact values.
74 Estimation of Continuous Time Markov Switching Models
300 200
200
100
100
0 0
0 0.002 0.004 0 0.002 0.004
1 1
0.5 0.5
0 0
0 0.002 0.004 0 0.002 0.004
6 Numerical results
6.1 Simulated data
First we consider 100 samples of simulated data (n = 1, d = 2, N = 5000, ∆t = 1/250)
for moderate speed of state switching. To avoid a bias, the parameters for the prior
distributions (according to (4)) are chosen such that the mean values are the true ones,
while the variances are large (Var(µ(k) ) = 16, Var(σ (k) ) = 0.0025, Var(Qkl ) = 120).
We compare three methods: a sampler based on time discretization and data aug-
mentation with the state-vector (referred to as DT), a continuous time sampler based
on data augmentation with the full state process (referred to as CT), and the semi-
continuous method (referred to as SC) to estimate parameters; both DT and CT are
described in (Hahn et al. 2007). The samplers were slightly adapted to make them fully
comparable with respect to number of samples needed and run-time (see Remark 5).
Hahn and Sass 75
0.2 0.2
0.1 0.1
0 0
−8 −4 0 4 8 −8 −4 0 4 8
0.2 0.2
0.1 0.1
0 0
−8 −4 0 4 8 −8 −4 0 4 8
In Table 1, for each method we give means, standard deviations, and root mean square
errors of the mean estimates.
For DT as well as SC, 100 000 samples (with a burn-in of 20 000) were sufficient
to give stable results; for CT, 5 000 000 samples (1 000 000 burn-in) turned out to be
necessary.
Remark 5. Note that these numbers are not “full” updates of the parameters for the
following reason. For CT, updates for the state process are rejected more often the
longer the state process is. Hence, to obtain a reasonable acceptance ratio, only a
randomly chosen block is updated each time (cf. Hahn et al. 2007). In this example, on
average about 1 % of the state process is updated. To facilitate comparison, we proceed
similarly for DT and SC. As the current state process is crucial for the updates of
the remaining parameters, the crude numbers given above should be divided by 100 to
obtain a rough number of “full” updates.
The quality of the estimates is very accurate for CT and SC, while DT introduces
some discretization error. Regarding the run-time per sample, CT is fastest; DT takes
twice the time, and SC takes ten times longer. Altogether, DT is fastest but not very
exact; SC takes five times longer but gives precise results; also giving accurate results,
CT mixes much slower and takes 25 times longer.
Next, we turn to a more challenging example with less observations, a lower signal-
to-noise-ratio, and very frequent state switching. Again we consider 100 samples of
simulated data (n = 1, d = 2, N = 2500, ∆t = 1/250); prior distributions are set
Hahn and Sass 77
such that the mean values are the true ones but with large variances (Var(µ(k) ) = 16,
Var(σ (k) ) = 0.0025, Var(Qkl ) = 1000); the different samplers are run. Results for
200 000 samples per run for DT and SC and for 5 000 000 samples for CT (where again
Remark 5 holds) are given in Table 2.
DT introduces a considerable discretization error; still, the results are acceptable
in the sense that they give an idea about the magnitudes of and the relations between
the parameters. For SC, with a total run-time that is about five times higher, results
are much better and the RMSEs are substantially reduced compared to DT. As the
volatility is higher for the first state, the corresponding estimates are less accurate
than those for the second state. Nevertheless, the estimates are of good quality for all
parameters. Also CT yields accurate results, however, the total runtime required is 60
times longer than DT and 12 times higher than SC.
ple 2 states HMM, that is, we assume to observe up and down periods for the drift
but constant volatility. Obviously, this model is not sufficient to give good pathwise
fits, however, it is already a serious improvement over the Black-Scholes model (where
constant drift is assumed) and can lead to substantially better results e.g. in portfolio-
optimization (cf. Hahn et al. 2008b; Sass and Haussmann 2004).
Prior distributions are chosen according to (4); for each data set, means for the
drifts m1 and m2 are set to the 60 % and 40 % quantiles of the daily returns under
consideration, and standard deviations are set to s1 = s2 = (m1 − m2 )/2; for the
volatility, the mean equals 0.9 times the standard deviation of the returns and the
standard deviation equals half of the mean. For the rates, for all data sets we set means
to 80 and standard deviations to 40, meaning that we expect average drift regimes to
last between 2 and 6 trading days.
For DT as well as SC, 500 000 samples (with a burn-in of 100 000) were sufficient
for all data sets to give stable results (where in each update step about 2 % of the state
process is updated, compare Remark 5; so, roughly speaking, we needed about 10 000
“full” samples).
We also tried to employ CT, however, for almost all sets we obtained no plausible
results; no well separated drifts were identified and often enormous rates were estimated.
This is not surprising, since (according to the results of DT and SC) state switching
occurs frequently; drifts for up and down states are close and volatility is high (i.e. the
signal-to-noise ratio is much lower); moreover, only relatively few data points are used.
Figure 5 gives an overview of the results for all the 500 data sets, showing the
distributions of the sample means both for SC and DT. For SC (DT), drifts are centered
around 1.3 (0.93) and −0.78 (−0.44), volatility around 0.25 (0.26), and rates around
102 (92) and 89 (83), the latter meaning that on average we expect a change in the drift
every 2–3 trading days. Average sample standard deviations for SC (DT) are around
0.50 (0.44) and 0.40 (0.34) for the drifts, 0.006 (0.006) for the volatility, and 39 (40)
and 35 (37) for the rates. These results look quite plausible and reliable.
Clearly, this is only a very rough analysis of the results, however, we are less inter-
ested in details on the concrete results for each of the 500 data sets, but more in the
comparison of the results of DT and SC. To this end, for each run we compute the
relative deviation of the results of DT from SC; i.e. for some parameter θ we take mean
estimates θ̄DT and θ̄SC from the output of the samplers DT and SC; then we plot the
distribution of the relative deviation (θ̄SC − θ̄DT )/|θ̄SC |. In Figure 6 these distributions
for the different parameters over all data sets (as well as corresponding average values)
are shown. Comparing DT to SC, on average the drift states are about 40 % less ex-
treme, while the volatility is only 1.5 % higher; the rates are about 8 % lower. This
fits perfectly with the results for simulated data, where SC yielded much more accurate
results than DT; in many applications, this enhanced accuracy will justify the higher
computational costs. Especially when challenging parameters are expected, SC should
be the preferred estimation method.
Hahn and Sass 79
Drifts
2
0
−4 −2 0 2 4
Volatility
10
0
0 0.1 0.2 0.3 0.4 0.5
Rates
0.1
0
50 75 100 125 150
Drifts
4
0
−1 −0.5 0 0.5 1
Volatiliy
50
0
−0.2 −0.1 0 0.1 0.2
Rates
5
0
−0.6 −0.3 0 0.3 0.6
7 Conclusion
We present a so-called semi-continuous time MCMC algorithm for parameter estimation
in multi-dimensional continuous time MSMs. This problem is especially difficult if few
data points are available and the switching of the states happens fast. While methods of
moments yield precise results for big data sets (say, more than 30 000 data points) very
fast, we concentrate on the case where not more than 5 000 data points are available.
Moreover, in many applications one observes fast regime switching (which makes the
expectation maximization algorithm fail) and a low signal-to-noise ratio.
The resulting semi-continuous algorithm is a stable method that mixes quickly, as
the dimension of the augmented variables as well as the dependence between the pa-
rameters is reduced significantly compared to augmenting with the full state process. In
comparison to discrete time approximations, we still can employ exact filtering for the
states, but no approximation error is introduced and the generator matrix is estimated
directly rather than estimating the transition matrix. The numerical computations can
be performed to arbitrary precision. The quality of the results is quite satisfying, and
especially for fast state switching the higher computational costs pay off definitely.
While the theory presented here applies for an arbitrary number of states, due to
computational complexity, it is applicable for a moderate number of states only; it
remains for future work to devise a fast and efficient implementation to deal with high
numbers of states.
The transfer of the semi-continuous method to other Markov switching models is
straightforward whenever the distribution of the observed data depends rather on the
occupation times than on the full path of the underlying Markov process. Otherwise,
the semi-continuous approach still may be beneficial, but calculations will get more
involved.
8 Appendix
In the following we describe how to compute P(∆Ṙm | Ẏm−1 = k, Ẏm = l, η), the condi-
tional distributions of the observed increments ∆Ṙm given the boundary states Ẏm−1 ,
Ẏm , for d > 2.
For an index set I ∈ I, where I = {I : I ( {1, . . . , d}}, we define
the set of possible occupation times τ when states in I are not visited, but all others are.
Note that D∅0 = ri D and P(O∆t ∈ DI0 | Y0 = k, Y∆t = l) > 0 if k, l ∈ / I, i.e. in general
there is positive probability for occupation time 0 of single states, although these sets
are of lower effective dimension. Hence, while Theorem 1 only tells us how to deal with
ri D, we also need to consider the boundary rb D explicitly. Note that as in Section 5,
we assume Q to be given.
The idea is to split D into its interior and parts of its boundary. For the interior, we
Hahn and Sass 81
need to solve a (d−1)-dimensional integral. For the parts of the boundary, the dimension
is lower, but suitably adapting Q, we S
can proceed similarly as for the interior. In more
detail, we use the representation D = I∈I DI0 to evaluate the integral (21) conditioning
on which states have occupation time 0, i.e. O∆t ∈ DI0 . We obtain
Z
bd
ψ(∆Ṙm ; B, C, τ ) dFk,l (τ ) =
T
X Z
bd
ψ(∆Ṙm ; B, C, τ ) dFI;k,l (τ ) P(O∆t ∈ DI0 | Y0 = k, Y∆t = l), (37)
I∈I DI0
where
bd
FI;k,l (τ ) = P(O∆t ≤ τ | Y0 = k, Y∆t = l, O∆t ∈ DI0 ). (38)
Corollary 3. For I ∈ I, |I| < d − 1, FIbd has a continuous density fIbd on DI0 .
For sets I, 0 < |I| < d − 1, replace
P d with dI = d − |I| and adapt Q as follows. Reset
the diagonal entries Qkk to − l∈I,l6/ =k Qkl and drop lines and columns i for i ∈ I,
which gives the corresponding dI × dI rate matrix. Accordingly adapt parameters of the
uniformized Markov chain. In τ , components τk , where k ∈ I, are dropped.
Then, w.l.o.g. we can assume I = ∅. For τ ∈ DI0 , we have
bd
fI;k,l (τ ) =
X∞
(λ̃ ∆t)m X ∆t;τ ∗ ±
e−λ̃ ∆t χm;v∗ P(1d−1 ≤ Ṽm∗ ≤ v ∗ , Z̃m = l | Z̃0 = k) Xkl , (39)
m!
m=d−1 ∗ +
v ∈Vm
Finally, we rewrite Equation (37) to obtain a numerically computable form for Equa-
tion (21).
82 Estimation of Continuous Time Markov Switching Models
References
Ball, F. G., Cai, Y., Kadane, J. B., and O’Hagan, A. (1999). “Bayesian inference for
ion-channel gating mechanisms directly from single-channel recordings, using Markov
chain Monte Carlo.” Proceedings of the Royal Society Series A, 455.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). “Maximum likelihood from
incomplete data via the EM algorithm.” Journal of the Royal Statistical Society
Series B, 39(1): 1–38.
Elliott, R. J., Aggoun, L., and Moore, J. B. (1995). Hidden Markov Models – Estimation
and Control, volume 29 of Applications of mathematics. New York: Springer.
Elliott, R. J., Hunter, W. C., and Jamieson, B. M. (1997). “Drift and volatility estima-
tion in discrete time.” Journal of Economic Dynamics and Control, 22: 209–218.
Hahn and Sass 83
Elliott, R. J., Krishnamurthy, V., and Sass, J. (2008). “Moment based regression algo-
rithm for drift and volatility estimation in continuous time Markov switching models.”
Econometrics Journal, 11: 244–270.
Engel, C. and Hamilton, J. D. (1990). “Long swings in the dollar: Are they in the data
and do markets know it?” American Economic Review, 80(4): 689–713.
Frühwirth-Schnatter, S. (2001). “Fully Bayesian analysis of switching Gaussian state
space models.” Annals of the Institute of Statistical Mathematics, 53(1): 31–49.
— (2006). Finite Mixture and Markov Switching Models. New York: Springer.
Guo, X. (2001). “An explicit solution to an optimal stopping problem with regime
switching.” Stochastic Processes and their Applications, 38: 464–481.
Hahn, M., Frühwirth-Schnatter, S., and Sass, J. (2007). “Markov chain Monte Carlo
methods for parameter estimation in multidimensional continuous time Markov
switching models.” Technical Report 2007-09, Johann Radon Institute for Com-
putational and Applied Mathematics.
— (2008a). “Estimating models based on Markov jump processes given fragmented
observation series.” Submitted.
Hahn, M., Putschögl, W., and Sass, J. (2008b). “Optimizing consumption and in-
vestment: The case of partial information.” In Kalcsics, J. and Nickel, S. (eds.),
Operations Research Proceedings 2007, 57–62. German Operations Research Society,
Berlin: Springer.
Israel, R. B., Rosenthal, J. S., and Wei, J. Z. (2001). “Finding generators for Markov
chains via empirical transition matrices, with applications to credit ratings.” Mathe-
matical Finance, 11: 245–265.
Jasra, A., Holmes, C. C., and Stephens, D. A. (2005). “Markov chain Monte Carlo
methods and the label switching problem in Bayesian mixture modeling.” Statistical
Science, 20(1): 50–67.
Liechty, J. C. and Roberts, G. O. (2001). “Markov chain Monte Carlo methods for
switching diffusion models.” Biometrika, 88(2): 299–315.
Otranto, E. and Gallo, G. M. (2002). “A nonparametric Bayesian approach to detect
the number of regimes in Markov switching models.” Econometric Reviews, 21(4):
477–496.
Ross, S. M. (1996). Stochastic Processes. New York: John Wiley & Sons, 2nd edition.
Rydén, T., Teräsvirta, T., and Åsbrink, S. (1998). “Stylized facts of daily return series
and the hidden Markov models.” Journal of Applied Econometrics, 13: 217–244.
Sass, J. and Haussmann, U. G. (2004). “Optimizing the terminal wealth under partial
information: The drift process as a continuous time Markov chains.” Finance and
Stochastics, 8(4): 553–577.
84 Estimation of Continuous Time Markov Switching Models
Acknowledgments
The authors acknowledge support by the Austrian Science Fund FWF, project P17947-N12,
and by the Heisenberg Programme of the German Research Foundation DFG. We also thank
the referee, the Associate Editor, and the Editor for their comments and suggestions which led
to a considerable improvement of the paper.