Markovchain Proofs

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Math-Stat-491-Fall2014-Notes-III

Hariharan Narayanan

October 28, 2014

1 Introduction
We will be closely following the book
”Essentials of Stochastic Processes”, 2nd Edition, by Richard Durrett,
for the topic ‘Finite Discrete time Markov Chains’ (FDTM). This note is for giving a sketch of the
important proofs. The proofs have a value beyond what is proved - they are an introduction to standard
probabilistic techniques.

2 Markov Chain summary


The important ideas related to a Markov chain can be understood by just studying its graph, which has
nodes corresponding to states and edges corresponding to nonzero entries in the transition matrix.
Figure 1 helps us to summarize key ideas.
The first part of this figure shows an irreducible Markov chain on states A, B, C. The graph in this
case is strongly connected, i.e., one can move from any node to any other through directed paths. Such
a Markov chain has a unique stationary distribution.
This Markov chain is also ‘aperiodic’. If you start from any node you can return to it in 2, 3, 4, 5, · · · .
steps. So the GCD of all these loop lengths is 1. For such Markov chains if you take a sufficiently large
power P n of the transition matrix P it will have all entries positive. (In this case however P itself has
this property.) If you start from any probability distribution π 0 and run an irreducible aperiodic Markov
chain for ‘infinite time’ π 0T P n will converge to the unique stationary distribution. The value of this
distribution will be positive for each state.
Next consider the second Markov chain on A0 , B 0 , C 0 , D0 . Here we can see that from D0 we can reach
A0 , B 0 , C 0 , but not the other way about. Further if you restrict the Markov chain to A0 , B 0 , C 0 you will get
an irreducible chain. The Markov chain on A0 , B 0 , C 0 , D0 is not irreducible but has a unique stationary
distribution. However it takes zero value on some states. The general rule is the following. If from a
given state X you can reach some other state Y but cannot return from Y to X, then the stationary
distribution will take value zero on X. We call such states ‘transient’. If you start from any probability
distribution π 0 and run this Markov chain indefinitely, π 0T P n will converge to the unique stationary
distribution. The value of this distribution will be positive for each state in R1 but zero for D, D0 .

1
D’

1/3

A’ 2/3
2/3
1/3
1/3
1/3 2/3
1/3 2/3

1/3
1/3 B’ C’
B C

T={D’} R={A’,B’,C’}
R={A,B,C} 2/3

2/3
Markov Chain not irreducible but has unique stationary distribution
Irreducible Markov Chain
π (D’)=0; π(A’), π(B’),π (C’) positive.

2/3
D D’

1
1/3

A
2/3 A’ 2/3
1/3 1/3

1/3 1/3 2/3


2/3

1/3
1/3
B C B’ C’

2/3
2/3

R 1 ={A,B,C} T={D,D’} R 2 ={A’,B’,C’}

This Markov chain has non unique stationary distribution but in all of them probability of D,D’ is zero

Figure 1: Markov Chain summary

In general just by examining the graph we can partition the set of states into T which are transient,
and a number of Ri whose states are ‘recurrent’, i.e., not transient. The Ri s each have the property :
there are no directed paths leaving the set and there are directed paths from any node to any other in the
set.
The third Markov chain has states partitioned into T ≡ {D, D0 }, R1 ≡ {A, B, C}, R2 ≡ {A0 , B 0 , C 0 }.
Note that from D, D0 we can go to states in R1 , R2 , but not return. If we restrict the Markov chain to
either R1 or R2 , we will get an irreducible Markov chain. But on the set of states R1 ∪R2 the Markov chain

2
is not irreducible since you cannot go from a state in R1 or R2 to the other. In this case we can get two
primitive stationary distributions one which is nonzero only on R1 and zero on all the others and a second
which is nonzero only on R2 but zero on all the others. We can show that all stationary distributions are
convex combinations of these two primitive distributions, i.e., π T = λπ 1T + (1 − λ)π 2T , 0 ≤ λ ≤ 1.
The first important result

Theorem 2.1. A finite, irreducible Markov chain Xn has a unique stationary distribution π(·).

Remark: It is not claimed that this stationary distribution is also ‘steady state’, i.e., if you start from
any probability distribution π 0 and run this Markov chain indefinitely, π 0T P n may not converge to the
unique stationary distribution. That happens only if the irreducible Markov chain is aperiodic, i.e., the
GCD of the length of loops starting from any node is 1. Figure 2 shows a Markov chain of period 2. This
has a unique stationary distribution but π 0T P n does not converge to it.

B
A

Figure 2: Markov chain with period 2

Recall that an irreducible Markov chain is one whose directed graph has a directed path from node i
to node j for every possible node pair.
Now every stationary distribution is a nonnegative row eigenvector of the transition matrix corre-
sponding to the eigenvalue 1.
Thus the statement of the theorem involves only ideas from linear algebra and graphs. We will first
sketch an algebraic proof of this result and then a probabilistic proof.

3 Algebraic Proof of Theorem 2.1


Case 1. The transition matrix has no zero entries.
We know that if π(·) is a stationary distribution, when we write it as a row vector π T , it satisfies
π T P = π T , i.e.,π T is a row eigenvector for the eigen value 1.
We claim that every row eigenvector corresponding to eigenvalue 1, has only entries which are all of
the same sign.
Suppose not. Let some entries be positive and some negative. Consider the equation π T P = π T .
The j th entry of the RHS of this equation is obtained by taking the ‘dot product’ of π T with the j th
column of the matrix P . This dot product π(j) is the sum of some positive and some negative terms
since the j th column of P is fully positive while π T has both positive and negative entries. Formally,

3
Σi π(i)Pij = π(j). Let π 0T be the nonnegative vector obtained by changing the sign of the negative entries
of π T , i.e., π 0 (i) = |π(i)|∀i. Clearly, |Σi π(i)Pij | < Σi |π(i)|Pij , since the right side ‘dot product’ has no
cancellation while the left side one has. Thus, π 0 (j) = |π(j)| = |Σi π(i)Pij | < Σi |π(i)|Pij = Σi π 0 (i)Pij .
We now have Σj (Σi π 0 (i)Pij ) > Σj |Σi π(i)Pij | = Σj |π(j)|. We will show however that the left side
actually sums up to the same expression as the right side, i.e., to Σj |π(j)|. We have, Σj (Σi π 0 (i)Pij ) =
Σi (Σj π 0 (i)Pij ) = Σi π 0 (i)(Σj Pij ) = Σi (π 0 (i) × 1) = Σi π 0 (i) = Σi |π(i)|.
We conclude therefore that every row eigen vector π T of P corresponding to eigen value 1 has entries of
the same sign. Wlog let us take this to be the positive sign.
Next, since P has only positive entries, π T P has only positive entries. But π T P = π T , so that π T
has only positive entries.
Thus without loss of generality, we may take a row eigen vector of P corresponding to eigen value 1
to have only positive entries.
We will now show that we cannot have two independent row eigen vectors of P corresponding to eigen
value 1.
Suppose xT , y T are two such independent vectors. Then there exists a linear combination z T , which
has a zero entry but is not fully zero. But z T is a row eigen vector of P corresponding to eigen value 1
and therefore, by the above proof, must be fully nonzero, a contradiction.
We conclude that P has a unique row eigen vector corresponding to eigen value 1 which is non negative
and whose row sum is 1 and this vector is fully positive. But such a row vector is a stationary distribution
of P.
Therefore P has a unique stationary distribution π and π(y) > 0 ∀y.
Case 2. The transition matrix P corresponds to a general irreducible Markov chain and has zero
entries.
Observe that if π T P = π T , then π T (I + P )/2 = π T . But if P is a transition matrix (i.e., rows of P add
up to 1), so is (I + P )/2, since its rows also add up to 1. The Markov chain corresponding to (I + P )/2
looks like that corresponding to P except that there are additional self loops of probability 1/2 and all
the old edges have half the original probability. Therefore in the new graph too there are directed paths
from any node to any other node. We conclude that the Markov chain corresponding to (I + P )/2 is also
irreducible.
Let Q ≡ (I + P )/2. Observe that, because of the self loop at each node (state), Q(i, i) 6= 0 ∀i and
(Q)n (i, i) 6= 0 ∀i, n. Using Chapman-Kolmogorov theorem we conclude from the irreducibility of the
Markov chain corresponding to Q, that for each i, j there exists some n such that Qn (i, j) > 0, and
further that for each positive integer Qn+k (i, j) ≥ (Q)k (i, i) × Qn (i, j) > 0. Thus for a large enough
positive integer m we must have Qm fully positive. Observe that if Q has all row sums as 1 so will Qm
have. It is clear that if π T P = π T , we also have π T Q = π T and π T Qm = π T . Now Qm is a transition
matrix of a Markov chain that is fully positive and any row eigen vector of P corresponding to eigen
value 1 is also a row eigen vector of Qm corresponding to eigen value 1. In particular any stationary
distribution of P is also a stationary distribution of Qm .
By discussion of case 1 above we know that Qm has a unique stationary distribution π and π(y) > 0 ∀y.

4
It follows that P has a unique stationary distribution π and π(y) > 0 ∀y.

3.1 An additional result


Theorem 3.1. The highest magnitude eigen value for a transition matrix P corresponding to an irre-
ducible Markov chain is 1 and is unique.

We use the following lemma which can be proved by plane geometry


Lemma
Let c1 , c2 , · · · ck be complex numbers. Let cj have the maximum magnitude among all these numbers.
Let Σi pi ci = d be a convex combination of the ci , with pj , the coefficient for cj , greater than zero. Then
|d| ≤ |cj | and further, |d| = |cj | iff whenever pi > 0, we have ci = cj .
Proof of Theorem 3.1: Let λ be a maximum magnitude eigen value of P and let x be a right eigen
vector corresponding to this eigen value.We have P x = λx. Let xj have the maximum magnitude among
entries of x. We then have Σk pjk xk = λxj . Noting that Σk pjk xk is a convex combination of the entries of
x, and using the lemma we conclude that |λ| ≤ 1, and further |λ| = 1 can happen only if Σk pjk xk = xj ,
and therefore λ = 1.

4 Probabilistic proof of Theorem 2.1


In order to describe the probabilistic ideas technically we introduce some notation.

4.1 Notation
1. Ty ≡ min{n ≥ 1 : Xn = y} ≡ time of first return to y or, equivalently, the time of first visit to y
after n = 0. Here even if we dont start from X0 = y, we still use the term ’return’. Note that, when
the starting state is defined, this is a random variable, since the event Ty = k has a probability
associated with it.

2. Tyk ≡ min{n > Tyk−1 : Xn = y} ≡ time of k th return to y. Note that Ty = Ty1 .

The probability associated with Ty = n starting from x is denoted by P rx {Ty = n}. The expected
value of Ty starting from x is denoted by Ex (Ty ) and given by Σn n × P rx {Ty = n}.

Remark: Observe that if the expected value of Tz (starting from z) is greater than the expected value
of Ty (starting from y), you return to z less often than you return to y so we expect π(z) < π(y).
We will show that πy = 1/Ey (Ty ).

3. ρxy denotes P rx {Ty < ∞}, i.e., the probability of reaching from x to y in finite time, given that the
starting state is x. The event of ‘reaching from x to y in finite time’, is the event of all finite sequences
X0 = x, X1 = x1 , · · · , Xk−1 = xk−1 , Xk = y, where none of the Xi except Xk is equal to y. The
probability associated with such a sequence is p(x, x1 ) × p(x1 , x2 ) · · · × p(xk−1 , xk ). The probability
associated with the event is the sum of probabilites of all such sequences (which represent disjoint
events).

5
4. ρyy denotes P ry {Ty < ∞}. Equivalently, ρyy is the probability of returning to y starting from y
using a finite number of steps, i.e.,

ρyy = Σn P r{Xn = y | X0 = y, Xj 6= y, 0 < j < n}.

(k)
5. ρyy denotes P ry {Tyk < ∞}.

6. Ny ≡ number of visits to y. This again is a random variable once the starting state is specified.

7. We say a state is recurrent if ρyy = 1. It is transient if ρyy < 1.

4.2 Preliminary basic results


1. (Strong Markov Property) If T is a stopping time then

P r{XT +1 = z | XT = x, T = n} = P r{X1 = z | X0 = x}.

Here T is a random variable with the property that we can determine the truth of T = n, if we
know the sequence X0 = x0 , X1 = x1 , · · · , Xk−1 = xk−1 , Xn = x. Informally, the truth of T = n
depends only on the past and the immediate present and not on the future. The Markov chains
we deal with are time homogeneous, so we could choose any (fixed) time n as our zero time. The
strong Markov property allows us to use the random variable T as our zero time. This essentially
means that we can take a time n as our starting time under the condition that certain things have
taken place by that time.

A consequence: We have

ρ(k) k−1
yy = P ry {Ty + n < ∞ | Ty = n} × ρ(k−1)
yy = ρyy × ρ(k−1)
yy .
(k)
By induction it follows that ρyy = ρkyy .

(2)
[Here are the details to prove ρyy = ρ2yy . We have
(2)
ρyy ≡ P ry {Ty2 < ∞}
≡ Probability that X0 = Xm = Xn = y with Xi 6= y, 0 < i < m and Xi 6= y, m < i < n, for all
values of m, n with m ≤ n.
Let X0 = y. Let Ty0 be the first m when Xm = y, for m ≥ 1, and let T ”y be the first k when
Xk+Ty0 = y, for k ≥ 1.
Observe that Ty0 , T ”y are random variables and that Ty2 = Ty0 + T ”y .
So P ry {Ty2 < ∞} = Σk Σm P ry {Ty0 = m, T ”y = k} = Σk Σm P r{T ”y = k | Ty0 = m} × P ry {Ty0 = m}.

Now by strong Markov property P r{T ”y = k | Ty0 = m} = P ry {Ty0 = k}. Thus we have
(2)
ρyy = P ry {Ty2 < ∞} = Σk Σm P ry {Ty0 = k} × P ry {Ty0 = m}
= Σk P ry {Ty0 = k} × Σm P ry {Ty0 = m} = ρyy × ρyy .]

Another consequence:

P rx (Tyk < ∞) = P r{Tyk−1 < ∞|XT = y} × P rx {Ty < ∞} = ρk−1


yy ρxy .

6
2. (Expectation of infinite sums of random variables) Let X = Σ∞
i=0 Xi , where Xi are nonnegative

random variables. Then

• if Σni=0 E(Xi ) converges as n → ∞, then E(X) = Σ∞


i=0 E(Xi ),

• if Σni=0 E(Xi ) becomes (positively) unbounded as n → ∞, then E(X) = ∞.

Observe that this result has to be proved separately and is not immediate from ’finite’ linearity of
expectation of random variables.

Proof: Let Yk ≡ Σk0 Xi . Then, by linearity of expectation, E(Yk ) = Σk0 E(Xi ). Now suppose Σk0 E(Xi )
converges as k → ∞. It follows that E(Yk ) converges too. Further since Xi are nonnegative it follows
that Yk ≤ YK+1 , ∀k, and Yk ≤ X. Now Lebesgue’s monotone convergence theorem states that
If Yk ≤ Yk+1 ∀k and limk→∞ Yk = X, then limk→∞ E(Yk ) = E(X), provided limk→∞ E(Yk ) exists.
The conditions of this theorem are clearly satisfied in the present case of Yk and X. This completes
the proof of the first part of the statement.

Next suppose E(Yk ) = Σk0 E(Xi ) is (positively) unbounded as k → ∞. Since Yk ≤ X, we must have
E(Yk ) ≤ E(X) making E(X) also (positively) unbounded.

4.3 Intermediate results


1. If X is a discrete random variable taking values 1, 2, · · · , then X = Σ∞
n=1 1X≥n , and therefore

E(X) = Σ∞
n=1 P r{X ≥ n}.

2. (Recall that the random variable Ny ≡ number of visits to y,


and that P rx {Tyk < ∞} = P rx {N (y) ≥ k}.)

P rx {N (y) ≥ k} = ρk−1
yy ρxy .

3. •
Ex (Ny ) = ρxy (Σ∞ k
k=0 ρyy ) = ρxy /(1 − ρyy ),

where Ex (Ny ) is the expectation of Ny , when the starting state is x.

• A state y is recurrent (i.e., ρyy = 1) iff Ey (Ny ) = ∞.

4.
Ex (Ny ) = Σ∞ n
n=1 p (x, y),

where pn (x, y) refers to the probability of reaching from x to y in n steps.


Proof is by recognizing Ny = Σ∞
n=1 1Xn =y , and taking expectation on both sides.

5. If y is recurrent and ρyx > 0, then x is recurrent.


Proof: Since ρyx > 0, and y is recurrent ρxy 6= 0. So we must have for some j, k, pj (x, y) >
0, pk (y, x) > 0. Now

Ex (Nx ) = Σ∞ n j ∞ n k
n=1 p (x, x) ≥ p (x, y)(Σn=1 p (y, y))p (y, x) = ∞.

7
4.4 Final Results
1. • A finite, irreducible Markov chain has atleast one recurrent state.
Proof is by noting that

Σy Ex (Ny ) = Σy Σn pn (x, y) = Σn (Σy pn (x, y)) = Σn (1) = ∞.

Therefore atleast one of the y must be such that Ex (Ny ) = ∞.

• A finite, irreducible Markov chain has all its states recurrent.


Proof: At least one of the states y is recurrent and for each x, ρ(x, y), ρ(y, x) are nonzero. By
a result proved earlier this is sufficient for x to be recurrent.

2. The states of any finite Markov chain can be partitioned into sets T, R1 , · · · Rk , where the states
in T are transient, the remaining states are recurrent and the restrictions of the original Markov
chain to each of the Ri is irreducible.
Proof: The set T is composed of states x for which for some y, we have ρ(x, y) > 0, and ρ(y, x) = 0,
i.e., we can reach y from x but not return. The remaining states form equivalence classes of states
which can be reached from each other. These are the Ri . Once you enter an Ri you cannot leave.
The Markov chain restricted to the nodes in any of the Ri is irreducible since one can reach any
node in it from any other and therefore the states are recurrent.

Remark: One can build an infinity of stationary distributions if there are atleast two Ri . Let µ1
be the stationary distribution where µ(x) = 0, x ∈ T, µ(z) = 0, z ∈ R2 , and on Ri , µ1 agrees
with the unique stationary distribution µ1 of the restriction of the Markov chain on R1 . Define
µ2 interchanging R1 , R2 in the definition of µ1 . Every convex combination of µ1 , µ2 would yield a
stationary distribution of the original Markov chain.

3. Every finite irreducible Markov chain has a unique stationary distribution.

The algebraic proof for this fact is simple and already given.

When the irreducible Markov chain is aperiodic, i.e., the gcd of the period of return starting from
any node (state) is 1, we can show that starting with any x, the limit as n → ∞, of pn (x, y) tends
to π(y), the value of the stationary distribution at y and therefore π(·) is unique.

—————————————————————————
An FDTM has a finite number of states on which it can rest at discrete times 1, 2, · · · . Usually the
state at time n is denoted by Xn , and the possible set of states could be denoted by lower case symbols
x, y · · · etc.
If the chain is at state i at time n, it moves to state j at time n + 1, with probability p(i, j). This could be
stated alternatively as ‘the conditional probability of the chain being at state j at time n + 1, given that
it is at state i at time n is p(i, j).’ The key idea is that the probability of the chain being at state j at
time n + 1, does not depend on what happened before time n. Formally, we have the ‘Markov property’,

P r{Xn+1 = j | Xn = i} = P r{Xn+1 = j | Xn = i, Xn−1 = in−1 , · · · , X0 = i0 }.

8
We have further assumed ‘temporal homogeneity’, i.e, that p(i, j) does not depend upon n.
In particular this means

P r{Xn+1 = j | Xn = i} = P r{Xn+1 = j | Xn = i, Xn−1 = in−1 , · · · , Xk = ik }, (1)

because this latter expression, when we shift the time origin to k, is the same as

P r{Xn+1−k = j | Xn−k = i} = P r{Xn+1−k = j | Xn−k = i, Xn−k−1 = i0n−k−1 , · · · , X0 = i00 },

where i0r ≡ ir+k .


Just to be clear on how we should go about proving formal versions of informal statements consider
the following. Suppose what is given is not the immediately previous state but say a state k steps before.
Surely, what happened before time n + 1 − k is unimportant? Let us prove this for k = 2. Formally, let
us prove
P r{Xn+1 = j | Xn−1 = in−1 } = P r{Xn+1 = j | Xn−1 = in−1 , · · · , X0 = i0 }.

We will use the law of total probability. The LHS

P r{Xn+1 = j | Xn−1 = in−1 } = Σall states s P r{Xn+1 = j | Xn = s, Xn−1 = in−1 }×P r{Xn = s | Xn−1 = in−1 }.

The RHS
P r{Xn+1 = j | Xn−1 = in−1 , · · · , X0 = i0 }

= Σall states s P r{Xn+1 = j | Xn = s, Xn−1 = in−1 , · · · , X0 = i0 }×P r{Xn = s | Xn−1 = in−1 , · · · , X0 = i0 .}

Now we use the basic Markov property on each of the two terms being multiplied and using (1) write the
above expression as

Σall states s P r{Xn+1 = j | Xn = s, Xn−1 = in−1 } × P r{Xn = s | Xn−1 = in−1 }.

Since this is the same as the LHS, the proof is complete.


Next let us prove the following variation of the Markov property

P r{Xn+1 = j | Xn = i} = P r{Xn+1 = j | Xn = i, Xn−1 = in−1 , · · · , X2 = i2 , X0 = i0 }.

We have skipped X1 = i1 in the right side conditional probability.


Note that the RHS is equal to
Σall states s P r{Xn+1 = j | Xn = i, Xn−1 = in−1 , · · · , X2 = i2 , X1 = s, X0 = i0 }
×P r{X1 = s | Xn = i, · · · , X2 = i2 , X0 = i0 .}
But this is the same as

Σall states s P r{Xn+1 = j | Xn = i} × P r{X1 = s | Xn = i, · · · , X2 = i2 , X0 = i0 .}

This is equal to

P r{Xn+1 = j | Xn = i} × Σall states s P r{X1 = s | Xn = i, · · · , X2 = i2 , X0 = i0 .}

9
But
Σall states s P r{X1 = s | Xn = i, · · · , X2 = i2 , X0 = i0 .} = 1.

So the RHS=LHS.
The same trick can be used even when there are multiple gaps.
Temporal homogeneity implies we can start from Xr = i0 and repeat the above arguments, replacing
Xn by Xn+r , etc.
(Natural question at this stage: surely we could also use models where the chain state at n+1 depends
on states at time n, n − 1, · · · , n − k?
Answer: They are not needed! This model is sufficiently versatile.)
Note here that everything we know about an FDTM is captured by the set of p(i, j)s. So we could
represent the FDTM

• pictorially through a graph with vertices as states with, for each ordered pair (i, j), an edge directed
from i to j with weight p(i, j). Of course if p(i, j) = 0, we omit the edge from i to j.

• in terms of a ‘transition matrix’ with rows and columns named by the states with the (i, j)th entry
being p(i, j).

Since, from a given state i at time n, the chain must move to some state (including i) at time n + 1, it
follows that that the outgoing edges (including selfloops,if any) must have weights adding up to 1, and,
in the transition matrix, the rows should sum up to 1.

STEP 1 in the study of FDTM: Convert word problems in the book to FDTMs specifying the
transition matrix and drawing the FDTM graph.

Some fundamental notions associated with an FDTM are listed below.

1. Steady state distribution on the states: this is a probability distribution π(·), such that if we pick
initial states i with probability π(i), the next states j will occur with probability π(j).

Note here that if we pick initial states according to some probability distribution π 0 (·), the proba-
bility that the next state is j is given by

Σi P r(i) × P r(j | i) = Σi π 0 (i) × p(i, j).

Let (π 0 )T denote the row vector whose ith entry is π 0 (i), and let P denote the transition matrix of
FDTM (with p(i, j) as the (i, j) entry).
So if the initial probability distribution is π 0 (·), the next state distribution is (π 0 )T P.
In the case of the stationary distribution π(·), we have (π)T = (π)T P.

10
2. Transient and recurrent states: if you start from a ‘recurrent’ state, you will return to it, with
probability 1. If this probability is < 1, then the state is said to be ‘transient’. We can prove that
the following ‘event’ has probability 1 : if we start from a transient state we will return to it only
a finite number of times.

3. The expected time of return starting from a given recurrent state. This is self explanatory.

4. ρxy ≡ P r{we can go f rom x to y in f inite time}.

5. Absorbing state: A state is said to be absorbing if, when we start from it, we remain at it with
probability 1, i.e., i is an absorbing state iff p(i, i) = 1.

6. Irreducible Markov Chain: The graph of the FDTM has the property that given any ordered pair
of states (i, j), there is a directed path from i to j in the graph of the FDTM.

STEP 2 in the study of FDTM: Internalize the above definitions.

Through the lectures we will attempt to answer the following questions rigorously:

1. How to identify recurrent and transient states by looking only at the graph?
Answer: Check if we can go from the given state i to some state j through a directed path such
that there is no return directed path. If this is true the state is transient, otherwise recurrent.

2. Does the FDTM have a stationary distribution? If it exists, is it unique?


Answer: FDTMs always have a stationary distribution. It is unique if the FDTM is irreducible.

3. Starting from a given recurrent state i, what is the expected time of return to it?
Answer: If the FDTM is recurrent and has the stationary distribution π(·), the expected time of
return to the recurrent state i is 1/π(i).

4. Starting from a transient state what is the expected time to reach a specified absorbing state?
Answer: We will describe a method of computing this quantity.

5. Suppose we start from any state and keep running the markov chain according to its transition
matrix. Will we encounter the different states with frequency in proportion with their π(·) value?
Answer: Yes, if the FDTM is irreducible.

6. Suppose we start the Markov chain according to a probability distribution π 0 (·) and let it run
forever. Let π n (·) denote the probability distribution after n steps. Will π n (·) converge to the
stationary distribution in the limit as n → ∞?
Answer: Yes if the FDTM is irreducible and aperiodic. (Periodic means all states recur after a
certain fixed period > 1.)

To answer the above questions we will need to understand the following:

11
1. What are the characteristic features of a transition matrix (or equivalently the graph) of an FDTM?
Answer: The matrix should have nonnegative entries and the rows should add up to 1. The outgoing
edges from any node in the graph should have the sum of their weights equal to 1.
Let us call these respectively Markov transition matrix and Markov chain graph.

2. What kind of a matrix is P k ? What is its significance?


Answer: P k is also a Markov transition matrix because its entries are nonnegative and the row sum
is 1.
Its significance is the following:
Its (i, j)th entry gives the probability of reaching j at the time n + k starting with i at time n, i.e.,
the probability of reaching j from i in k steps.
The above requires an understanding of the famous ‘Chapman-Kolmogorov equation’. Below, we
give a sketch of the proof.
Let us denote by ps (i, j) the probability of reaching state j starting from state i in s steps.
‘Chapman-Kolmogorov equation’ states

pm+n (i, j) = Σk pm (i, k) × pn (k, j).

The proof is straight forward. To reach from i to j in m + n steps we must reach some intermediate
state in m steps. So the probability of reaching from i to j in m + n steps through the intermediate
state k is the product pm (i, k) × pn (k, j). Now the events ‘ reaching k’ in m steps and then reaching
j in n steps are disjoint for different ks and together make up the event ‘reaching from i to j in
m + n steps. So
pm+n (i, j) = Σk pm (i, k) × pn (k, j).

But this is exactly how the matrix P m+n is computed in terms of the matrices P m , P n .
(To formalize the sketch we must use conditional probabilities.
Example: ps (i, j) = P r{Xs = j | X0 = i}.

Probability of reaching from i to j in m + n steps passing through k after m steps


= P r{Xm+n = j, Xm = k | X0 = i}
= P r{Xm+n = j | Xm = k, X0 = i} × P r{Xm = k | X0 = i}
= P r{Xm+n = j | Xm = k} × P r{Xm = k | X0 = i} = pn (k, j) × pm (i, k).)

3. A random variable T that takes values 0, 1, 2 · · · is called a stopping time or Markov time if whether
T = k or not depends only on the states X0 , · · · Xk , that the markov chain takes at times 0, 1, · · · , k.
Are the following stopping times?

(a) T = n, if starting at x at time 0 we reach y at time n.

(b) T = n, if starting at x at time 0 we reach y at time n and n ≤ 100.

(c) T = n, if we are at y for the last time.

12
Answer: The first two random variables are stopping times. The last is not, since, to know that we
are at y for the last time is impossible knowing only the past history.

‘Stopping time’ is one of the most slippery and useful notions that we encounter while studying
FDTMs. For us, its power lies in the fact that we can, because of the ‘strong Markov property,’
treat XT as X0 , even though T is a random variable.

STEP 3 in the study of FDTM: Know how to prove the above results rigorously. But do not lose
the ‘common touch’- understand them in the commonsensical intuitive way too.

Special conditions make the computation of stationary probability easy for some Markov chains. Some
of these are discussed below.

1. Doubly stochastic chains For these chains the transition matrix has its columns also adding up
to 1. Now, if we sum all the row vectors, we get the vector 1T = (1, 1, · · · 1), i.e., 1T P = 1T . So
scaling 1T by the reciprocal of the number of states will satisfy the same equation while being a
probability distribution. Thus, for such chains, the uniform probability distribution is a stationary
distribution.

2. Duality Let Markov chain Xn on states S have transition probability p(i, j) and (unique) station-
ary probability π(i), i ∈ S.
Define a new Markov chain Yn on S with transition probability p0 (i, j) ≡ π(j) × p(j, i)/π(i). Let us
call this the dual Markov chain to Xn . It can be verified that Σi p0 (i, j) = 1.

It can be seen that, whenever there is an edge from i to j in the original Markov chain with
probability p(i, j), in the dual there is an edge from j to i with probability p0 (j, i). Let π 0 (·) denote
the stationary distribution of the dual Markov chain. We can show that π 0 (·) = π(·).

3. Reversibility When the Markov chain is its own dual, we have the detailed balance condition

p(i, j) = π(j) × p(j, i)/π(i).

We call such a Markov chain reversible.

The algorithm given below verifies whether a given chain is reversible even where the distribution
π(·) is not available, and if reversible, computes π(·).

The algorithm is simple. Let us suppose that from any node in the graph we can go to any other
node through a directed path (i.e., always going along the direction of the arrow of the edge). We
can show then that all entries of π(·) are positive even if the Markov chain is not reversible.

Firstly if p(i, j) > 0 we must have p(j, i) > 0 for the detailed balance condition
p(i, j) = π(j) × p(j, i)/π(i), to hold. So we should check if every edge has, in parallel, an edge going
in the opposite direction. If this is not true, we can declare the chain to be not reversible.

13
Let us start from some node say 0 and assign it the value π(0) = 1.
(We will be scaling the π(.) values later appropriately, if our algorithm is able to terminate properly.)
If node 1 has an edge with value p(0, 1) coming into it, for the detailed balance condition to be
satisfied, we need π(1) = π(0) × p(0, 1)/p(1, 0). So π(1) is fixed.

We repeat this process:


Starting from the set of nodes V for which the π(·) value is currently fixed, check if any out going
edges are there to other nodes. Suppose there is an edge from node j ∈ V to a node k outside. Fix
the value π(k) as π(k) = π(j) × p(j, k)/p(k, j).

We will stop when there are no outgoing edges from V. This would also mean that we have assigned
a π(·) value to each node in the graph. Observe that by this time, within a scaling factor, π(·) is
unique. The scaling factor arises because we could have assigned any positive value for π(0).

However, we do not know whether for every edge (i, j), the detailed balance condition is satisfied.
So we check this now. If for some edge there is failure, we declare the Markov chain is not reversible.

If there is no failure, the chain is reversible and we scale the π(·) values so that they add up to 1.
This is the desired stationary probability distribution.

An immediate consequence of the above discussion is that if

• every edge has a parallel edge in the opposite direction

• there are no directed loops other than the parallel edges

the Markov chain is reversible. This is because by the time all nodes have been assigned π(·) values
by our algorithm, the detailed balance conditions would have also been verified since there are no
other edges.

Example: The Ehrenfest chain graph is a simple straight line, if we replace parallel edges with
single edges. So there are no loops except the parallel edges and the chain is reversible.

The algorithm is illustrated with an example below.

A test for Reversibility of Markov Chains

Assume that M is an irreducible aperiodic markov chain. Thus it has a unique stationary distri-
bution, with a positive probability at every vertex.

Step 1 Build the R-graph of the Markov chain.


R-graph: Vertices are states of the Markov Chain.If pij is not equal to zero, put a directed
pij
edge e(i, j) from i to j with weight wij = pji in the case of FDTM . If the weight of any edge
is infinite STOP. The MC is not reversible since we have

πi pij 6= πj pji (2)

for any stationary distribution π on the states.


Example:
Let the transition matrix in the case of FDTM be the matrix A below

14
 
a11 a12 0 a14
 
 
a21 a22 a23 0 
A=
 

 0 a32 a33 a34 
 
 
a41 0 a43 a44
The R-graph is shown in Figure 3

Figure 3:

Step 2 a Start from any node n. Assign it value 1. Suppose a set NA of nodes have been assigned
values. If NA is not the full set of nodes pick any edge from i ∈ NA to j ∈
/ NA . Assign
πj = wij πi . If no edges leave these nodes to the (non null) complement declare the Markov
Chain to be not irreducible.
b If NA is the full set of nodes GO TO STEP 3.

Step 3 For every edge e(i, j) verify that πi wij = πj . If this is violated declare chain is not reversible.
If not, Scale πi by k so that
X
kπi = 1 (3)

Output πi as the stationary distribution and STOP.

Justification : If step 3 is satisfied we have the detailed balance equation satisfied for every
pair {i, j} for which pij 6= 0 and hence the MC is reversible.

4. New Markov Chains from old

15
(a) Merging: Suppose a subset Sk of k states all have the same probability under the stationary
distribution π(·). We build a new Markov chain by fusing all these states into a single ‘super-
state’. All the edges leaving the original states have the probabilities associated multiplied
by 1/k, edges entering have the same probability as before, edges going between nodes in Sk
become self loops, with probability multiplied by 1/k. If parallel edges (of the same direc-
tion) result they are replaced by a single edge with the new probability equal to the sum of
those of the original parallel edges. If several self loops result at a node, they can be replaced
by a single selfloop at the same node with the edge probabilities added. For this Markov
chain, we can show that the new stationary distribution π 0 (·) can be obtained by putting
π 0 (Sk ) = k × π(i), i ∈ Sk , and leaving all other probabilities unchanged.
πa πg
1/2 1/3

πc = p

2/3

π=p 2/3
1/2 f πh

1/3

π d= p

1/2

1/2
πb πk

(1/2+1/3 +2/3)x 1/3


πa
π
g
1/2
1/3 x 1/3

2/3 x 1/3 π
h

π cdf = 3xp 1/2 x 1/3

πb 1/2
π
k

Figure 4: The merging process

The proof of these statements is best done by visualizing the above process of merging in terms

16
of the transition matrix.
Let P be the transition matrix. Let S2 be the set of k states to be merged and S1 be the
complement. The k states in S2 have the same π(.) value. Let s ∈ S2 . For the original Markov
chain we have π T P = π T . After partitioning the rows and columns according to S1 , S2 ,
 
P 11 P 12
π1T , π2T   = π1T , π2T .
 
(4)
P21 P22

Now this is the same as


 
P11 P12
π1T , π(s) × k   = π1T , π2T ,
 
(5)
p̂21 p̂22

where (p̂21 , p̂22 ) is 1/k × sum of rows of (P21 , P22 ). Now in the above equation, if we sum
the second set of columns of the matrix
 
P11 P12
 , (6)
p̂21 p̂22

on the LHS, yielding  


P11 P̃12
 , (7)
p̂21 p̃22
the equation would remain correct if we sum the second set of columns of the row vector on the
 
right side too. But summing the second set of columns of π1T , π2 T , yields π1T , π(s) × k .
We thus have  
P11 P̃12
π1T , π(s) × k   = π1T , π(s) × k .
 
(8)
p̂21 p̃22

To verify that the matrix in (7), is the transition matrix of the new merged Markov chain,
we just note that the edges entering correspond to entries in P̃12 , edges leaving correspond
to entries in p̂21 , and self loops correspond to entries in p̃22 . It follows therefore that π 0 (·) =

π1T , π(s) × k , is the stationary distribution for the merged Markov chain.

(b) splitting: This operation could be regarded as one way of reversing the operation of ‘merging’.
In the old Markov chain graph, we take any node and split it into k nodes. Every incoming
edge should be duplicated k times, with the edge probability 1/k times the previous one, and
feed into each of the split nodes. Every outgoing edge must be duplicated k times and feed out
of each of the split nodes, with the same edge probability. Every self loop must be duplicated
k times and attached to each of the resulting split nodes with the same probability as before.
Let the old stationary distribution be π(·), and the new one be π 0 (·). Let node s in the old
Markov chain graph split into the identical k nodes si which form set Sk . Then we can show
π 0 (si ) = 1/k × π(s). The probabilities associated with the nodes which have not been split
remain as before.
Observe that splitting followed by merging will return to the original Markov chain, but
merging followed by splitting might not.

17
(c) Metropolis-Hastings method This is a powerful method of building a reversible Markov chain
which has a desired stationary distribution π(·), on a given set of states. Let Xn be a Markov
chain with edge probability q(i, j), with the additional condition that q(i, j) 6= 0 implies
q(j, i) 6= 0. Let r(i, j) denote min [π(j)q(j, i)/π(i)q(i, j), 1].
Let Yn be the Markov chain on the same graph but with transition probability

p(i, j) = q(i, j)r(i, j).

Let us verify that Yn satisfies the detailed balance condition with respect to the distribution
π(·). Suppose π(j)q(j, i) ≤ π(i)q(i, j). We then have

π(i)p(i, j) = π(i)q(i, j)r(i, j) = π(i)q(i, j) × π(j)q(j, i)/π(i)q(i, j) = π(j)q(j, i).

π(j)p(j, i) = π(j)q(j, i)r(j, i) = π(j)q(j, i) × 1 = π(j)q(j, i),

verifying the detailed balance condition for Yn .

STEP 4 in the study of FDTM: DO LOTS OF PROBLEMS.

18

You might also like