Notes Stochastic Process
Notes Stochastic Process
Notes Stochastic Process
Jean-François Delmas
2 Conditional expectation 25
2.1 Projection in the L2 space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Conditional expectation with resp. to a random variable . . . . . . . . . . . . 31
2.3.1 The discrete case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.2 The density case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 Elements on the conditional distribution . . . . . . . . . . . . . . . . . 33
i
ii CONTENTS
4 Martingales 65
4.1 Stopping times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Martingales and the optional stopping theorem . . . . . . . . . . . . . . . . . 69
4.3 Maximal inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Convergence of martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 More on convergence of martingales . . . . . . . . . . . . . . . . . . . . . . . 74
5 Optimal stopping 79
5.1 Finite horizon case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.1 The adapted case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.1.3 Marriage of a princess . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Infinite horizon case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.1 Essential supremum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.2 The adapted case: regular stopping times . . . . . . . . . . . . . . . . 86
5.2.3 The adapted case: optimal equations . . . . . . . . . . . . . . . . . . . 88
5.2.4 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 From finite horizon to infinite horizon . . . . . . . . . . . . . . . . . . . . . . 91
5.3.1 From finite horizon to infinite horizon . . . . . . . . . . . . . . . . . . 91
5.3.2 Castle to sell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.3 The Markovian case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8 Appendix 131
8.1 More on measure theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.1.1 Construction of probability measures . . . . . . . . . . . . . . . . . . . 131
8.1.2 Proof of the Carathéodory extension Theorem 8.3 . . . . . . . . . . . 134
8.2 More on convergence for sequence of random variables . . . . . . . . . . . . . 137
8.2.1 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . 137
8.2.2 Law of large number and central limit theorem . . . . . . . . . . . . . 139
8.2.3 Uniform integrability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.2.4 Convergence in probability and in L1 . . . . . . . . . . . . . . . . . . . 141
9 Exercises 145
9.1 Measure theory and random variables . . . . . . . . . . . . . . . . . . . . . . 145
9.2 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.3 Discrete Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.4 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.5 Optimal stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.6 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
10 Solutions 155
10.1 Measure theory and random variables . . . . . . . . . . . . . . . . . . . . . . 155
10.2 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.3 Discrete Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
10.4 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
10.5 Optimal stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
10.6 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
11 Vocabulary 169
Index 171
iv CONTENTS
Chapter 1
In this chapter, we present in Section 1.1 a basic tool kit in measure theory with in mind the
applications to probability theory. In Section 1.2, we develop the corresponding integration
and expectation. The presentation of this chapter follows closely [1], see also [2].
of non-negative integers, N∗ =
TWe use the following convention N = {0, 1, . . .} is the set T
N S(0, +∞), and for m < n ∈ N, we set Jm, nK = [m, n] N. We shall consider R =
R {±∞} = [−∞, +∞], and for a, b ∈ R, we write a ∨ b = max(a, b), a+ = a ∨ 0 the positive
part of a, and a− = (−a)+ its negative part.
For two sets A ⊂ E, the function 1A defined on E taking values in R is equal to 1 on A
and to 0 on E \ A.
(i) Ω ∈ F;
(ii) A ∈ F implies Ac ∈ F;
S
(iii) if (Ai , i ∈ I) is a finite or countable collection of elements of F, then i∈I Ai ∈ F.
1
2 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES
(i) and (ii) implies that ∅ is measurable. Notice that P(Ω) and {∅, Ω} are σ-fields. The latter
is called the trivial σ-field. When Ω is at most countable, unless otherwise specified, we shall
consider the σ-field P(Ω).
Proposition 1.2. Let C ⊂ P(Ω). There exists a smallest σ-field on Ω which contains C.
The smallest σ-field which contains C is denoted by σ(C) and is also called the σ-field
generated by C.
Proof. Let (Fj , j ∈ J) be the collection of all the σ-fields on Ω containing
T C. This collection
is not empty as it contains P(Ω). It is left to the reader to check that j∈J Fj is a σ-field.
Clearly, this is the smallest σ-field containing C.
Remark 1.3. In this remark we give an explicit description of a σ-field generated by a finite
number of sets. Let C = {A1 , . . . ,ASn }, with n ∈ N∗ , be a
finite collection
T of subsets
T of cΩ. It
is elementary to check that F = I∈ITCI ; I ⊂ P(J1, nK) , with CI = i∈I Ai j6∈I Aj and
I ⊂ J1, nK, is a σ-field. Notice that CI T CJ = ∅ for I 6= J. Thus, the subsets CI are atoms
of F in the sense that if B ∈ F, then CI B is equal to CI or to ∅.
We shall prove that σ(C) = F. Since by construction CI ∈ σ(C) for allSI ⊂ J1, nK, we
deduce that F ⊂ σ(C). On the other hand, for all i ∈ J1, nK, we have Ai = I⊂J1,nK, i∈I CI .
This gives that C ⊂ F, and thus σ(C) ⊂ F. In conclusion, we get σ(C) = F. ♦
S
If F and H are σ-fields on Ω, we denote by F ∨ H = σ(F H) the σ-field generated
by
W F and H.S More generally, if (Fi , i ∈ I) is a collection of σ-fields on Ω, we denote by
i∈I Fi = σ( i∈I Fi ) the σ-field generated by (Fi , i ∈ I).
Similarly to the one dimensional case, as all the open sets of Rd can be written as a
d
Qd of open rectangles, the Borel σ-field on R , d ≥ 1, is generated by all the
countable union
rectangles i=1 (ai , bi ) with ai < bi for 1 ≤ i ≤ d. In particular, we get that the Borel σ-field
of Rd is the product2 of the d Borel σ-fields on R. ♦
1.1.2 Measures
We give in this section the definition and some properties of measures and probability mea-
sures.
Definition 1.7. Let (Ω, F) be a measurable space.
(i) A [0, +∞]-valued function µ defined on F is σ-additive if for all finite or countable
collection (Ai , i ∈ I) of measurable pairwise disjoint sets, that is Ai ∈ F for all i ∈ I
and Ai ∩ Aj = ∅ for all i 6= j, we have:
!
[ X
µ Ai = µ(Ai ). (1.1)
i∈I i∈I
(ii) A measure µ on (Ω, F) is a σ-additive [0, +∞]-valued function defined on F such that
µ(∅) = 0. We call (Ω, F, µ) a measured space. A measurable set A is µ-null if µ(A) = 0.
(iv) A probability measure P on (Ω, F) is a measure on (Ω, F) such that P(Ω) = 1. The
measured space (Ω, F, P) is also called a probability space.
We refer to Section 8.1 for the construction of measures such as the Lebesgue measure,
see Proposition 8.4 and Remark 8.6, and the product probability measure, see Proposition
8.7.
Example 1.8. We give some examples of measures (check these are indeed measures). Let Ω
be a space.
(i) The counting measure Card is defined by A 7→ Card (A) for A ⊂ Ω, with Card (A) the
cardinal of A. It is σ-finite if and only if Ω is at most countable.
(iii) The Bernoulli probability distribution with parameter p ∈ [0, 1], PB(p) , is a probability
measure on (R, B(R)) given by PB(p) = (1 − p)δ0 + pδ1 .
2
Let (E1 , O1 ) and (E2 , O2 ) be two topological spaces. Let C = {O1 × O2 ; O1 ∈ O1 , O2 ∈ O1 } be the set
of product of open sets. By definition, B(E1 ) ⊗ B(E2 ) is the σ-field generated by C. The product topology
O1 ⊗ O2 on E1 × E2 is defined as the smallest topology on E1 × E2 containing C. The Borel σ-field on E1 × E2 ,
B(E1 ×E2 ), is the σ-field generated by O1 ⊗O2 . Since C ⊂ O1 ⊗O2 , we deduce that B(E1 )⊗B(E2 ) ⊂ B(E1 ×E2 ).
Since O1 ⊗ O2 is stable by infinite (even uncountable) union, it might happens that the previous inclusion is
not an equality, see Theorem 4.44 p. 149 from C. Aliprantis and K. Border. Infinite Dimensional Analysis.
Springer, 2006.
4 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES
(iv) The Lebesgue measure λ on (R, B(R)) is a measure characterized by λ([a, b]) = b − a for
all a < b. In particular, any finite set or (by σ-additivity) any countable set is λ-null3 .
The Lebesgue measure is σ-finite.
4
Let us mention that assuming only the additivity property (that is I is assumed to be
finite in (1.1)), instead of the stronger σ-additivity property, for the definition of measures4
leads to a substantially different and less efficient approach. We give elementary properties
of measures.
Proposition 1.9. Let µ be a measure on (Ω, F). We have the following properties.
Proof. We prove (i). The sets A ∩ B c , A ∩ B and Ac ∩ B are measurable and pairwise disjoint.
Using the additivity property three times, we get:
We prove (ii). As Ac ∩ B ∈ F, we get by additivity that µ(B) = µ(A) + µ(Ac ∩ B). Then
use that µ(Ac ∩ B) ≥ 0, to conclude.
c ∗
S
We prove (iii). We set B0 = A 0 and Bn = An ∩An−1 for all n ∈ N so that n≤m Bn = Am
for all m ∈ N∗ and n∈N Bn = n∈N An . TheSsets (Bn , n ≥
S S
P0) are measurable and
S pairwise
disjoint. By σ-additivity, we get µ(Am ) = µ( n≤m Bn ) = n≤m µ(Bn ) and µ n∈N An =
S P P
µ n∈N Bn = n∈N µ(Bn ). Use P the convergence of the partial sums n≤m µ(Bn ), whose
terms are non-negative, towards n∈N µ (Bn ) as m goes to infinity to conclude.
Property (iv) is a direct consequence of properties (i) and (iii).
We give a property for probability measures, which is deduced from (i) of Proposition 1.9.
Corollary 1.10. Let (Ω, F, P) be a probability space and A ∈ F. We have P(Ac ) = 1 − P(A).
3
A set A ⊂ R is negligible if there exists a λ-null set B such that A ⊂ B (notice that A might not be a
Borel set). Let Nλ be the sets of negligible sets. The Lebesgue σ-field, Bλ (R), on R is the σ-field generated
by the Borel σ-field and Nλ . By construction, we have B(R) ⊂ Bλ (R) ⊂ P(R). It can be proved that those
two inclusions are strict.
4
H. Föllmer and A. Schied. Stochastic finance. An introduction in discrete time. De Gruyter, 2011.
1.1. MEASURES AND MEASURABLE FUNCTIONS 5
The σ-fields (Fi , i ∈ I) are independent if for all Ai ∈ Fi ⊂ F, i ∈ I, the events (Ai , i ∈ I)
are independent.
Corollary 1.14. Let P and P0 be two probability measures defined on a measurable space
(Ω, F) Let C ⊂ F be a collection of events stable by finite intersection. If P(A) = P0 (A) for
all A ∈ C, then we have P(B) = P0 (B) for all B ∈ σ(C).
The next corollary is an immediate consequence of Definition 1.5 and Corollary 1.14.
Corollary 1.15. Let (E, O) be a topological space. Two probability measures on (E, B(E))
which coincide on the open sets O are equal.
Lemma
−1 f be a function from S to E and E a σ-field on E. The collection of sets
1.16. Let
f (A); A ∈ E is a σ-field on S.
The σ-field f −1 (A); A ∈ E , denoted by σ(f ), is also called the σ-field generated by f .
When there is no ambiguity on the σ-fields S and E, we simply say that f is measurable.
Example 1.18. Let A ⊂ S. The function 1A is measurable from (S, S) to (R, B(R)) if and
only if A is measurable as σ(1A ) = {∅, S, A, Ac }. 4
The next proposition is useful to prove that a function is measurable.
by f −1 (A),
Proof. We denote by G the σ-field generated A ∈ C . We have G ⊂ σ(f ). It
is easy to check that the collection A ∈ E; f −1 (A) ∈ G is a σ-field on E. It contains C
and thus E. This implies that σ(f ) ⊂ G and thus σ(f ) = G. We conclude using Definition
1.17.
Proposition 1.21. Let (S, S) and ((Ei , Ei ), i ∈ I) be measurable spaces. For all i ∈ I, let
fi be a function defined on S
Q taking Nvalues in Ei and set f = (fi , i ∈ I). The function f is
measurable from (S, S) to ( i∈I Ei , i∈I Ei ) if and only if for all i ∈ I, the function fi is
measurable from (S, S) to (Ei , Ei ).
1.1. MEASURES AND MEASURABLE FUNCTIONS 7
N Q
Proof. By definition, the σ-field
Q i∈I Ei is generated by i∈I Ai with Ai ∈QEi and for all
i ∈ I but one, Ai = Ei . Let i∈I Ai be such a set. Assume itQis not equal to i∈I Ei and let
−1
i0 denote the only index such that Ai0 6= Ei0 . We have f −1
A
i∈I i = fi0 (Ai0 ) ∈ S. Thus
if f is measurable so is fi0 . The converse is a consequence of Proposition 1.19.
If (an , n ∈ N) is a real-valued sequence then its lower and upper limit are defined by:
lim inf an = lim inf{ak , k ≥ n} and lim sup an = lim sup{ak , k ≥ n}
n→∞ n→∞ n→∞ n→∞
The sequence (an , n ∈ N) converges (in R) if lim inf n→∞ an = lim supn→∞ an and this common
value, denoted by limn→∞ an , belongs to R.
The next proposition asserts in particular that the limit of measurable functions is mea-
surable.
Proposition 1.24. Let (fn , n ∈ N) be a sequence of real-valued measurable functions defined
on a measurable space (S, S). The functions lim supn→∞ fn and lim inf n→∞ fn are measur-
able. The set of convergence of the sequence, {x ∈ S; lim supn→∞ fn (x) = lim inf n→∞ fn (x)},
is measurable. In particular, if the sequence (fn , n ∈ N) converges, then its limit, denoted by
limn→∞ fn , is also measurable.
Since the functions fn are measurable, we deduce that {lim supn→∞ fn < a} is also measur-
able. Since the σ-field B(R) is generated by [−∞, a) for a ∈ R, we deduce from Proposition
1.19 that lim supn→∞ fn is measurable. Since lim inf n→∞ fn = − lim supn→∞ (−fn ), we de-
duce that lim inf n→∞ fn is measurable.
Let h = lim supn→∞ fn − lim inf n→∞ fn , with the convention +∞ − ∞ = 0. The function
h is measurable thanks to Corollary 1.23. Since the set of convergence is equal to h−1 ({0})
and that {0} is a Borel set, we deduce that the set of convergence is measurable.
We end this section with a very useful result which completes Proposition 1.22.
Proposition 1.25. Let (Ω, F), (S, S) be measurable spaces, f a measurable function defined
on Ω taking values is S and ϕ a measurable function from (Ω, σ(f )) to (R, B(R)). Then,
there exists a real-valued measurable function g defined on S such that ϕ = g ◦ f .
Proof. By simplicity, we assume that ϕ takes its values in R instead of R. For all k ∈ Z,
n ∈ N the sets An,k = ϕ−1 ([k2−n , (k + 1)2−n )) are σ(f )-measurable. Thus,Sfor all n ∈ N,
there exists a collection (Bn,k , k ∈ Z) of sets of S pairwise disjoint such that k∈Z Bn,k = S,
Bn,k ∈ SPand f −1 (Bn,k ) = An,k for all k ∈ Z. For all n ∈ N, the real-valued function
gn = 2−n k∈Z k1Bn,k defined on S is measurable, and we have gn ◦ f ≤ ϕ ≤ gn ◦ f + 2−n0 for
n ≥ n0 ≥ 0. The function g = lim supn→∞ gn is measurable according to Proposition 1.24,
and we have g ◦ f ≤ ϕ ≤ g ◦ f + 2−n0 for all n0 ∈ N. This implies that g ◦ f = ϕ.
At some point we shall specify the σ-field F on Ω, and say that X is F-measurable.
We say two E-valued random variables X and Y are equal in distribution, and we write
(d)
X = Y , if PX = PY . For A ∈ E, we recall we write {X ∈ A} = {ω; X(ω) ∈ A} = X −1 (A).
Two random variables X and Y defined on the same probability space are equal a.s., and we
a.s.
write X = Y , if P(X = Y ) = 1. Notice that if X and Y are equal a.s., then they have the
same probability distribution.
Remark 1.28. Let X be a real-valued random variable. Its cumulative distribution function
FX is defined by FX (x) = P(X ≤ x) for all x ∈ R. It is easy to deduce from Exercise 9.1
1.1. MEASURES AND MEASURABLE FUNCTIONS 9
that if X and Y are real-valued random variables, then X and Y are equal in distribution if
and only if FX = FY . ♦
The next lemma gives a characterization of the distribution of a family of random vari-
ables.
According to Proposition 1.21, in the above lemma the Xj is an Ej -valued random variable
and its marginal probability distribution can be recovered from the distribution of X as:
!
Y
P(Xj ∈ Aj ) = P X ∈ Ai with Ai = Ei for i 6= j.
i∈I
Q
Proof. According to Definition 1.4, the product Q σ-field E on the product space E = i∈I Ei
is generated by the family C of product sets i∈I AiQsuch Ai ∈ Ei for all i ∈ I and Ai = Ei
for all i 6∈ J, with J ⊂ I finite. Notice then that PX ( i∈I Ai ) = P(Xj ∈ Aj for j ∈ J). Since
C is stable by finite intersection, we deduce from the monotone class theorem, and more
precisely Corollary 1.14, that the probability measure PX on E is uniquely characterized by
P(Xj ∈ Aj for j ∈ J) for all J finite subset of I and all Aj ∈ Ej for j ∈ J.
Definition 1.31. The random variables (Xi , i ∈ I) are independent if the σ-fields (σ(Xi ), i ∈
I) are independent. Equivalently, the random variables (Xi , i ∈ I) are independent if for all
finite subset J ⊂ I, all Aj ∈ Ej with j ∈ J, we have:
Y
P(Xj ∈ Aj for all j ∈ J) = P(Xj ∈ Aj ).
j∈J
We deduce from this definition that if the marginal probability distributions Pi of all
the random variables Xi for i ∈ I are known N and if (Xi , i ∈ I) are independent, then the
distribution of X is the product probability i∈I Pi introduced in Proposition 8.7.
We end this section with the Bernoulli scheme.
Theorem 1.32. Let (E, E, P) be a probability space. Let I be a set of indices. Then, there
exists a probability space and a sequence (Xi , i ∈ I) of E-valued random variables defined on
this probability space which are independent and with the same distribution probability P.
Lemma 1.33. Let f be a simple function defined on S. The integral µ(f ) does not depend
on the choice of its representation.
Proof. Consider two representations for f : f = nk=1P αk 1Ak = m 1B` , with n, m ∈ N∗
P P
`=1 β`P
and A1 , . . . , An , B1 , . . . , Bm ∈ S. We shall prove that k=1 αk µ(Ak ) = m
n
`=1 β` µ(B` ).
According to Remark T 1.3, there exits a finite family of measurable sets (CI , I ∈ P(J1, n +
mK)) such that CI CJ = ∅ if I 6= J andSfor all k ∈ J1, nK S and ` ∈ J1, mK there exists
Ik ⊂ J1, nK and J` ⊂ J1, mK such that Ak = I∈Ik CI and B` = I∈J` CI . We deduce that:
n m
! !
X X X X
f= αk 1{I∈Ik } 1CI = β` 1{I∈J` } 1CI
I∈P(J1,n+mK) k=1 I∈P(J1,n+mK) `=1
Pn Pm
and thus k=1 αk 1{I∈Ik } = `=1 β` 1{I∈J` } for all I such that CI = 6 ∅. We get:
n n m m
! !
X X X X X X
αk µ(Ak ) = αk 1{I∈Ik } µ(CI ) = β` 1{I∈J` } µ(CI ) = β` µ(B` ),
k=1 I k=1 I `=1 `=1
where we used the additivity of µ for the first and third equalities. This ends the proof.
The next lemma gives a representation of µ(f ) using that a non-negative measurable
function f is the non-decreasing limit of a sequence of simple functions. Such sequence exists.
Indeed, one can define for n ∈ N∗ the simple function fn by fn (x) = min(n, 2−n b2n f (x)c) for
x ∈ S with the convention b+∞c = +∞. Then, the functions (fn , n ∈ N∗ ) are measurable
and their non-decreasing limit is f .
Lemma 1.35. Let f be a [0, +∞]-valued function defined on S and (fn , n ∈ N) a non-
decreasing sequence of simple functions such that limn→∞ fn = f . Then, we have that
limn→∞ µ(fn ) = µ(f ).
Proof. It is enough to prove that for all non-decreasing sequence of simple functions (fn , n ∈
N) and simple function g such that limn→∞ fn ≥ g, we have limn→∞ µ(fn ) ≥ µ(g). We
deduce
P from the proof of Lemma 1.33 that there exists a representation of g such that
g= N k=1 αk 1Ak and the measurable sets (Ak , 1 ≤ k ≤ N ) are pairwise disjoint. Using this
representation and the linearity, we see it is enough to consider the particular case g = α1A ,
with α ∈ [0, +∞], A ∈ S and fn 1Ac = 0 for all n ∈ N.
By monotonicity, the sequence (µ(fn ), n ∈ N) is non-decreasing and thus limn→∞ µ(fn )
is well defined, taking values in [0, +∞].
The result is clear if α = 0. We assume that α > 0. Let α0 ∈ [0, α). For n ∈ N, we consider
the measurable sets Bn = {fn ≥ α0 }. The sequence (Bn , n ∈ N) is non-decreasing with A as
limit because limn→∞ fn ≥ g. The monotone property for measure, see property (iii) from
Proposition 1.9, implies that limn→∞ µ(Bn ) = µ(A). As µ(fn ) ≥ α0 µ(Bn ), we deduce that
limn→∞ µ(fn ) ≥ α0 µ(A) and that limn→∞ µ(fn ) ≥ µ(g) as α0 ∈ [0, α) is arbitrary.
Corollary 1.36. The linearity and monotonicity properties, see (1.3) and (1.4), also hold
for [0, +∞]-valued measurable functions f and g defined on S.
Proof. Let (fn , n ∈ N) and (gn , n ∈ N) be two non-decreasing sequences of simple functions
converging respectively towards f and g. Let a, b ∈ [0, +∞). The non-decreasing sequence
(afn + bgn , n ∈ N) of simple functions converges towards af + bg. By linearity, we get:
µ(af + bg) = lim µ(afn + bgn ) = a lim µ(fn ) + b lim µ(gn ) = aµ(f ) + bµ(g).
n→∞ n→∞ n→∞
The function f is µ-integrable if max (µ(f + ), µ(f − )) < +∞ (i.e. µ(|f |) < +∞).
12 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES
R R
We also write µ(f ) = f dµ = f (x) µ(dx). A property holds µ-almost everywhere
(µ-a.e.) if it holds on a measurable set B such that µ(B c ) = 0. If µ is a probability measure,
then one says µ-almost surely (µ-a.s.) for µ-a.e.. We shall omit µ and write a.e. or a.s. when
there is no ambiguity on the measure.
µ(f ) = 0 ⇐⇒ f = 0 µ-a.e..
The next corollary asserts that it is enough to know f a.e. to compute its integral.
Corollary 1.39. Let f and g be two real-valued measurable functions defined on S such that
µ(f ) and µ(g) are well defined. If a.e. f = g, then we have µ(f ) = µ(g).
To conclude notice that f = g a.e. implies that f + = g + a.e. and f − = g − a.e. and then
use the first part of the proof to conclude.
Corollary 1.40. The linearity property, see (1.3) with a, b ∈ R, and the monotonicity prop-
erty (1.4), where f ≤ g can be replaced by f ≤ g a.e., hold for real-valued measurable
µ-integrable functions f and g defined on S.
We deduce that the set of R-valued µ-integrable functions defined on S is a vector space.
The linearity property (1.3) holds also on the set of real-valued measurable functions h
such that µ(h+ ) < +∞ and on the set of real-valued measurable functions h such that
µ(h− ) < +∞. The monotonicity property holds also for real-valued measurable functions f
and g such that µ(f ) and µ(g) are well defined.
1.2. INTEGRATION AND EXPECTATION 13
Notice that Proposition 1.24 assures indeed that lim inf n→∞ is measurable. We thus
deduce the following corollary.
We now give the three main results on the convergence of integrals for sequence of con-
verging functions.
Theorem 1.43 (Monotone convergence theorem). Let (fn , n ∈ N) be a sequence of real-
valued measurable functions defined on S such that for all n ∈ N, a.e. 0 ≤ fn ≤ fn+1 . Then,
we have: Z Z
lim fn dµ = lim fn dµ.
n→∞ n→∞
Proof. The set A = {x; fn (x) < 0 or fn (x) > fn+1 (x) for some n ∈ N} is µ-null as countable
union of µ-null sets. Thus, we get that a.e. fn = fn 1Ac for all n ∈ N. Corollary 1.39 implies
that, replacing fn by fn 1Ac without loss of generality, it is enough to prove the theorem
under the stronger conditions: for all n ∈ N, 0 ≤ fn ≤ fn+1 . We set f = limn→∞ fn the
non-decreasing (everywhere) limit of (fn , n ∈ N).
For all n ∈ N, let (fn,k , k ∈ N) be a non-decreasing sequence of simple functions which
converges towards fn . We set gn = max{fj,n ; 1 ≤ j ≤ n}. The non-decreasing
R R sequence
of simple functions (gn , n ∈ N) converges
R to f
R and thus lim
R n→∞ g n dµ = f dµ. By
monotonicity,
R gn ≤R fn ≤ f implies gn dµ ≤ fn dµ ≤ f dµ. Taking the limit, we get
limn→∞ fn dµ = f dµ.
The proof of the next corollary is left to the reader (hint: use the monotone convergence
theorem to get the σ-additivity).
We shall say that the measure f µ has density f with respect to the reference measure µ.
Fatou’s lemma will be used for the proof of the dominated convergence theorem, but it
is also interesting by itself.
Lemma 1.45 (Fatou’s lemma). Let (fn , n ∈ N) be a sequence of real-valued measurable
functions defined on S such that a.e. fn ≥ 0 for all n ∈ N. Then, we have the lower
14 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES
semi-continuity property:
Z Z
lim inf fn dµ ≥ lim inf fn dµ.
n→∞ n→∞
Proof. The function lim inf n→∞ fn is the non-decreasing limit of the sequence (gn , n ∈ N)
with gn = inf k≥n fk . We get:
Z Z Z Z
lim inf fn dµ = lim gn dµ ≤ lim inf fk dµ = lim inf fn dµ,
n→∞ n→∞ n→∞ k≥n n→∞
where we used the monotone convergence theorem for the first equality and the monotonicity
property of the integral for the inequality.
The next theorem and the monotone convergence theorem are very useful to exchange
integration and limit.
Theorem 1.46 (Dominated convergence theorem). Let f, g, (fn , n ∈ N) and (gn , n ∈ N) be
real-valued measurable functions defined on S. We assume that a.e.: |f
R n | ≤ gn forR all n ∈ N,
fR = limn→∞ fn and g = limn→∞ gn . We also assume that limn→∞ gn dµ = g dµ and
g dµ < +∞. Then, we have:
Z Z
lim fn dµ = lim fn dµ.
n→∞ n→∞
Taking gn = g for all n ∈ N in the above theorem gives the following result.
Corollary 1.47 (Lebesgue’s dominated convergence theorem). Let f, g and (fn , n ∈ N) be
R functions defined on S. We assume that a.e.: |fn | ≤ g for all n ∈ N,
real-valued measurable
f = limn→∞ fn and g dµ < +∞. Then, we have:
Z Z
lim fn dµ = lim fn dµ.
n→∞ n→∞
R
Proof of Theorem 1.46. As a.e. |f | ≤ g and g dµ < +∞, we get that the function f is
integrable. The functions gn + fn and gn − fn are a.e. non-negative. Fatou’s lemma with
gn + fn and gn − fn gives:
Z Z Z Z Z Z
g dµ + f dµ = (g + f ) dµ ≤ lim inf (gn + fn ) dµ = g dµ + lim inf fn dµ,
n→∞ n→∞
Z Z Z Z Z Z
g dµ − f dµ = (g − f ) dµ ≤ lim inf (gn − fn ) dµ = g dµ − lim sup fn dµ.
n→∞ n→∞
R R R
Since g dµ is finite, Rwe deduce from
R those inequalities that f
R dµ ≤ lim inf n→∞ fn dµ
and thatR lim supn→∞ fn dµ ≤ f dµ. Thus, the sequence ( fn dµ, n ∈ N) converges
towards f dµ.
We shall use the next Corollary in Chapter 5, which is a direct consequence of Fatou’s
lemma and dominated convergence theorem.
Corollary 1.48. Let f, g, (fn , n ∈ N) be real-valued measurable functions defined on S. We
assume that a.e. fn+ ≤ g for all n ∈ N, f = limn→∞ fn and that g dµ < +∞. Then, we
R
have that (µ(fn ), n ∈ N) and µ(f ) are well defined as well as:
Z Z
lim sup fn dµ ≤ lim fn dµ.
n→∞ n→∞
1.2. INTEGRATION AND EXPECTATION 15
• Hölder inequality. Let p, q ∈ (1, +∞) such that p1 + 1q = 1. Assume that |f |p and |g|q
are integrable. Then f g is integrable and we have:
Z Z 1/p Z 1/q
p q
|f g| dµ ≤ |f | dµ |g| dµ .
The Hölder inequality is an equality if and only if there exists c, c0 ∈ [0, +∞) such that
(c, c0 ) 6= (0, 0) and a.e. c|f |p = c0 |g|q .
R 2 1/2 R 2 1/2
if and only there exist c, c0 ∈
R
Furthermore, we have f g dµ = f dµ g dµ
0 0
[0, +∞) such that (c, c ) 6= (0, 0) and a.e. c f = c g.
• Minkowski inequality. Let p ∈ [1, +∞). Assume that |f |p and |g|p are integrable.
We have: Z 1/p Z 1/p Z 1/p
p p p
|f + g| dµ ≤ |f | dµ + |g| dµ .
Proof. Hölder inequality. We recall the convention 0 · +∞ = 0. The Young inequality states
that for a, b ∈ [0, +∞], p, q ∈ (1, +∞) such that p1 + 1q = 1, we have:
1 p 1 q
ab ≤ a + b .
p q
Indeed, this inequality is obvious if a or b belongs to {0, +∞}. For a, b ∈ (0, +∞), using the
convexity of the exponential function, we get:
log(ap ) log(bq )
1 1 1 1
ab = exp + ≤ exp (log(ap )) + exp (log(bq )) = ap + bq .
p q p q p q
If µ(|f |p ) = 0 or µ(|g|q ) = 0, the Hölder inequality is trivially true as a.e. f g = 0 thanks to
Lemma 1.38. If this is not the case, then integrating with respect to µ in the Young inequality
with a = |f |/µ(|f |p )1/p and b = |g|/µ(|g|q )1/q gives the result. Because of the strict convexity
of the exponential, if a and b are finite, then the Young inequality is an equality if and only
if ap and bq are equal. This implies that, if |f |p and |g|q are integrable, then the Hölder
inequality is an equality if and only there exist c, c0 ∈ [0, +∞) such that (c, c0 ) 6= (0, 0) and
a.e. c|f |p = c0 |g|q .
R
The Cauchy-Schwarz inequality is the Hölder inequality with p = q = 2. If f g dµ =
R 2 1/2 R 2 1/2 R R
f dµ g dµ , then since f g dµ ≤ |f g| dµ, the equality holds also in the
16 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES
Cauchy-Schwarz inequality. Thus there exists c, c0 ∈ [0, +∞) such that (c, c0 ) 6= (0, 0) and
0
R
c|f | = c |g|. Notice also that (|f g| − f g) dµ = 0. Then use Lemma 1.38 to conclude that
a.e. |f g| = f g and thus a.e. c f = c0 g.
Let p ≥ 1. From the convexity of the function x 7→ |x|p , we get (a + b)p ≤ 2p−1 (ap + bp )
for all a, b ∈ [0, +∞]. We deduce that |f + g|p is integrable. The case p = 1 of the Minkowski
R
inequality comes from the triangular inequality in R. Let p > 1. We assume that |f +
g|p dµ > 0, otherwise the inequality is trivial. Using Hölder inequality, we get:
Z Z Z
p p−1
|f + g| dµ ≤ |f ||f + g| dµ + |g||f + g|p−1 dµ
Z 1/p
Z ! Z 1/p (p−1)/p
≤ |f |p dµ + |g|p dµ |f + g|p dµ .
(p−1)/p
|f + g|p dµ
R
Dividing by gives the Minkowski inequality.
For p ∈ [1, +∞), letR Lp ((S, S, µ)) denote the set of R-valued measurable functions f
defined on S such that |f |p dµ < +∞. When there is no ambiguity on the underlying
space, resp. space and measure, we shall simply write Lp (µ), resp. Lp . Minkowski inequality
and the linearity of the integral yield that Lp is a vector space and the map k·kp from Lp
1/p
|f |p dµ
R
to [0, +∞) defined by kf kp = is a semi-norm (that is kf + gkp ≤ kf kp + kgkp
p
and kaf kp ≤ |a| kf kp for f, g ∈ L and a ∈ R). Notice that kf kp = 0 implies that a.e. f = 0
thanks to Lemma 1.38. Recall that the relation “a.e. equal to” is an equivalence relation on
the set of real-valued measurable functions defined on S. We deduce that the space (Lp , k·kp ),
where Lp is the space Lp quotiented by the equivalence relation “a.e. equal to”, is a normed
vector space. We shall use the same notation for an element of Lp and for its equivalent
class in Lp . If we need to stress the dependence of on the measure µ of the measured space
(S, S, µ) we may write Lp (µ) and even Lp (S, S, µ) for Lp .
The next proposition asserts that the normed vector space (Lp , k·kp ) is complete and,
by definition, is a Banach space. We recall that a sequence (fn , n ∈ N) of elements of Lp
converges in Lp to a limit, say f , if f ∈ Lp , and limn→∞ kfn − f kp = 0.
Proposition 1.50. Let p ∈ [1, +∞). The normed vector space (Lp , k·kp ) is complete. That
is every Cauchy sequence of elements of Lp converges in Lp : if (fn , n ∈ N) is a sequence of
elements of Lp such that limmin(n,m)→∞ kfn − fm kp = 0, then there exists f ∈ Lp such that
limn→∞ kfn − f kp = 0.
Proof. Let (fn , n ∈ N) be a Cauchy sequence of elements of Lp , that is fn ∈ Lp for all n ∈ N
and limmin(n,m)→∞ kfn − fm kp = 0. Consider the sub-sequence (nk , k ∈ N) defined by n0 = 0
and for k ≥ 1, nk = inf{m > nk−1 ; kfi − fj kp ≤ 2−k for all i ≥ m, j ≥ m}. In particular, we
have
fnk+1 − fnk
p ≤ 2−k for all k ≥ 1. Minkowski inequality and the monotone convergence
P
P
theorem imply that
k∈N |fnk+1 − fnk |
p < +∞ and thus k∈N |fnk+1 − fnk | is a.e. finite.
The series with general term (fnk+1 − fnk ) is a.e. absolutely converging. By considering the
convergence of the partial sums, we get that the sequence (fnk , k ∈ N) converges a.e. towards
a limit, say f . This limit is a real-valued measurable function, thanks to Corollary 1.42. We
deduce from Fatou lemma that:
kfm − f kp ≤ lim inf kfm − fnk kp .
k→∞
1.2. INTEGRATION AND EXPECTATION 17
This implies that limm→∞ kfm − f kp = 0, and Minkowski inequality gives that f ∈ Lp .
We give an elementary criterion for the Lp convergence for a.e. converging sequences.
Lemma 1.51. Let p ∈ [1, +∞). Let (fn , n ∈ N) be a sequence of elements of Lp which
converges a.e. towards f ∈ Lp . The convergence holds in Lp (i.e. limn→∞ kf − fn kp = 0) if
and only if limn→∞ kfn kp = kf kp .
Proof. Assume that limn→∞ kf − fn kp = 0. Using Minkowski inequality, we deduce that
kf kp − kfn kp ≤ kf − fn kp . This proves that limn→∞ kfn kp = kf kp .
On the other hand, assume that limn→∞ kfn kp = kf kp . Set gn = 2p−1 (|fn |p + |f |p ) and
g = 2p |f |p . Since the function x 7→ |x|pRis convex, Rwe get |fn − f |p ≤ gn for all n ∈ N. We also
have limn→∞ gn = g a.e. and limn→∞ R gn dµ = g dµR< +∞. The dominated convergence
Theorem 1.46 gives then that limn→∞ |fn − f |p dµ = limn→∞ |fn − f |p dµ = 0. This ends
the proof.
The next theorem allows to define the integral of a real-valued function with respect to
the product of σ-finite5 measures.
Theorem 1.53 (Fubini’s theorem). Assume that ν and µ are σ-finite measures.
(i) There exists a unique measure on (E × S, E ⊗ S), denoted by ν ⊗ µ and called product
5
When the measures ν and µ are not σ-finite, the Fubini’s theorem may fail essentially because the product
measure might not be well defined. Consider the measurable space ([0, 1], B([0, 1])) with λ the Lebesgue
measure and µ the counting R Rmeasure (which is not σ-finite),R and
R the measurable function f ≥ 0 defined by
f (x, y) = 1{x=y} , so that f (x, y) µ(dy) λ(dx) = 1 and f (x, y) λ(dx) µ(dy) = 0 are not equal.
18 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES
Remark 1.54. Notice that the proof of (i) of Fubini theorem gives an alternative construction
of the product of two σ-finite measures to the one given in Proposition 8.7 for the product
of probability measures. Thanks to (i) of Fubini theorem, the Lebesgue measure on Rd can
be seen as the product measure of d times the one-dimensional Lebesgue measure. ♦
R
of X with respect to the probability measure P and is denoted by E[X] = X(ω) P(dω). We
recall the expectation E[X] is well defined if min(E[X + ], E[X − ]) is finite, where X + = X ∨ 0
and X − = (−X) ∨ 0, and that X is integrable if max(E[X + ], E[X − ]) is finite.
Example 1.55. If A is an event, then 1A is a random variable and we have E[1A ] = P(A).
Taking A = Ω, we get obviously that E[1Ω ] = E[1] = 1. 4
The next elementary lemma is very useful to compute expectation in practice. Recall the
distribution of X, denoted by PX , has been introduced in Definition 1.27.
Lemma 1.56. Let X be a random variable taking values in a measured space (E, E). Let ϕ be
a real-valued
R measurable function defined on (E, E). If E[ϕ(X)]
R is well defined, or equivalently
if ϕ(x) PX (dx) is well defined, then we have E[ϕ(X)] = ϕ(x) PX (dx).
Proof. Assume that ϕ is simple: ϕ = nk=1 αk 1Ak for some n ∈ N∗ , αk ∈ [0, +∞], Ak ∈ F.
P
We have:
Xn n
X Z
E[ϕ(X)] = αk P(X ∈ Ak ) = αk PX (Ak ) = ϕ(x) PX (dx).
k=1 k=1
R
Then use the monotone convergence theorem to get E[ϕ(X)] = ϕ(x) R PX (dx) when ϕ is
measurable and [0, +∞]-valued. Use the definition of E[ϕ(X)] and ϕ(x) PX (dx), when
they are well defined, to conclude when ϕ is real-valued and measurable.
Obviously, if X and Y have the same distribution, then E[ϕ(X)] = E[ϕ(Y )] for all real-
valued function ϕ such that E[ϕ(X)] is well defined, in particular if ϕ is bounded.
Remark 1.57. We give a closed formula for the expectation of discrete random variable. Let
X be a random variable taking values in a measurable space (E, E). We say that X is a
discrete random variable if {x} ∈ E for all x ∈ E and P(X ∈ ∆X ) = 1, where ∆X = {x ∈
E; P(X = x) > 0} is the discrete support of the distribution of X. Notice that ∆X is at
most countable and thus belongs to E.
If X is a discrete random variable and ϕ a [0, +∞]-valued function defined on E, then we
have: X
E[ϕ(X)] = ϕ(x)P(X = x). (1.8)
x∈∆X
Equation (1.8) also holds for ϕ a real-valued measurable function as soon as E[ϕ(X)] is well
defined. ♦
Remark 1.58. Let B ∈ F such that P(B) > 0. By considering the probability measure
1
P(B) 1B P : A 7→ P(A ∩ B)/P(B), see Corollary 1.44, we can define the expectation condition-
ally on B by, for all real-valued random variable Y such that E[Y ] is well defined:
E[Y 1B ]
E[Y |B] = · (1.9)
P(B)
If furthermore P(B) < 1, then we easily get that E[Y ] = P(B)E[Y |B] + P(B c )E[Y |B c ]. ♦
A real-valued random variable X is square-integrable if it belongs to L2 (P). Since 2 |x| ≤
1 + |x|2 , we deduce from the monotonicity property of the expectation that if X ∈ L2 (P) then
X ∈ L1 (P), that is X is integrable. This means that L2 (P) ⊂ L1 (P).
20 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES
For X = (X1 , . . . , Xd ) and Rd -valued random variable, we say that E[X] is well defined
(resp. Xi is integrable, resp. square integrable) if E[Xi ] is well defined (resp. Xi is integrable,
resp. square integrable) for all i ∈ J1, dK, and we set E[X] = (E[X1 ], . . . , E[Xd ]).
We recall an R-valued function ϕ defined on Rd is convex if ϕ(qx + (1 − q)y) ≤ qϕ(x) +
(1 − q)ϕ(y) for all x, y ∈ Rd and q ∈ (0, 1). The function ϕ is strictly convex if this convex
inequality is strict for all x 6= y. Let h·, ·i denote the scalar product of Rd . Then, it is well
known that if ϕ is an R-valued convex function defined on Rd , then it is continuous6 and
there exists7 a sequence ((an , bn ), n ∈ N) with an ∈ Rd and bn ∈ R such that for all x ∈ Rd :
We give further inequalities which complete Proposition 1.49. We recall that a R-valued
convex function defined on Rd is continuous (and thus measurable).
Proposition 1.59.
E[X 2 ]
P(|X| ≥ a) ≤ .
a2
Proof. Since 1{|X|≥a} ≤ X 2 /a2 , we deduce the Tchebychev inequality from the monotonicity
property of the expectation and Example 1.55.
Let ϕ be a real-valued convex function. Using (1.10), we get ϕ(X) ≥ b0 + ha0 , Xi and
thus ϕ(X) ≥ −|b0 | − |a0 ||X|. Since X is integrable, we deduce that E[ϕ(X)− ] < +∞, and
thus E[ϕ(X)] is well defined. Then, using the monotonicity of the expectation, we get that
for all n ∈ N, E[ϕ(X)] ≥ bn + han , E[X]i. Taking the supremum over all n ∈ N and using the
characterization (1.10), we get (1.11).
6
It is enough to prove the continuity at 0 and without loss of generality, we can assume that ϕ(0) = 0.
Since ϕ is finite on the 2d vertices of the cube [−1, 1]d , it is bounded from above by a finite constant, say M .
Using the convex inequality, we deduce that ϕ is bounded on [−1, 1]d by M . Let α ∈ (0, 1) and y ∈ [−α, α]d .
Using the convex inequality with x = y/α, y = 0 and q = α, we get that ϕ(y) ≤ αϕ(y/α) ≤ αM . Using the
convex inequality with x = y, y = −y/α and q = 1/(1 + α), we also get that 0 ≤ ϕ(y)/(1 + α) + M α/(1 + α).
This gives that |ϕ(y)| ≤ αM . Thus ϕ is continuous at 0.
7
This is a consequence of the separation theorem for convex sets. See for example Proposition B.1.2.1 in
J.-B. Hiriart-Urruty and C. Lemaréchal. Fundamentals of convex analysis. Springer-Verlag, 2001.
1.2. INTEGRATION AND EXPECTATION 21
To complete the proof, we shall check that if X is not equal a.s. to a constant and if
ϕ is strictly convex, then the inequality in (1.11) is strict. For simplicity, we consider the
case d = 1 as the case d ≥ 2 can be proved similarly. Set B = {X ≤ E[X]}. Since X
is non-constant, we deduce that P(B) ∈ (0, 1) and that E[X|B] < E[X|B c ]. Recall that
E[X] = P(B)E[X|B] + P(B c )E[X|B c ]. We get that:
where we used the strict convexity of ϕ and that E[X|B] 6= E[X|B c ] for the first inequality
and Jensen inequality for the second. This proves that the inequality in (1.11) is strict.
We end this section with the covariance and variance. Let X, Y be two real-valued square-
integrable random variables. Thanks to Cauchy-Schwarz inequality, XY is integrable. The
covariance of X and Y , Cov(X, Y ), and the variance of X, Var(X), are defined by:
Var(aX + b) = a2 Var(X).
Using Lemma 1.38 with f = (X − E[X])2 , we get that Var(X) = 0 implies X is a.s. constant.
The covariance can be defined for random vectors as follows.
1.2.6 Independence
Recall the independence of σ-fields given in Definition 1.11 and of random variables given in
Definition 1.31.
Proof. If (1.13) is true, then taking fj = 1Aj with Aj ∈ Ej , we deduce from Definition 1.31
that the random variables (Xi , i ∈ I) are independent.
If (Xi , i ∈ I) are independent, then Definitions 1.31 implies that (1.13) holds for indicator
functions. By linearity, we get that (1.13) holds also for simple functions. Use monotone
convergence theorem and then linearity to deduce that (1.13) holds for bounded real-valued
measurable functions.
Bibliography
[1] O. Kallenberg. Foundations of modern probability. Probability and its Applications (New
York). Springer-Verlag, New York, second edition, 2002.
[2] J. Neveu. Bases mathématiques du calcul des probabilités. Masson et Cie, Éditeurs, Paris,
1970.
23
24 BIBLIOGRAPHY
Chapter 2
Conditional expectation
Let X be a square integrable real-valued random variable. The constant c which minimizes
E[(X − c)2 ] is the expectation of X. Indeed, we have, with m = E[X]:
E[(X − c)2 ] = E[(X − m)2 + (m − c)2 + 2(X − m)(m − c)] = Var(X) + (m − c)2 .
(i) (Existence.) There exists a real-valued random variable XH ∈ H, called the orthogonal
projection of X on H, such that E[(X − XH )2 ] = inf{E[(X − Y )2 ]; Y ∈ H}. And, for
all Y ∈ H, we have E[XY ] = E[XH Y ].
25
26 CHAPTER 2. CONDITIONAL EXPECTATION
Let (Xn , n ∈ N) be a sequence of H such that limn→+∞ E[(X − Xn )2 ] = a. Using the median
formula with Z 0 = Xn − X and Y 0 = Xm − X, we get:
with I = (Xn + Xm )/2 ∈ H. As E[(X − I)2 ] ≥ a, we deduce that the sequence (Xn , n ∈ N)
is a Cauchy sequence in L2 and thus converge in L2 , say towards XH . In particular, we have
E[(X − XH )2 ] = a. Since H is closed, we get that the limit XH belongs to H.
Let Z ∈ H be such that E[(X −Z)2 ] = a. For Y ∈ H, the function t 7→ E[(X −Z +tY )2 ] =
a + 2tE[(X − Z)Y ] + t2 E[Y 2 ] is minimal for t = 0. This implies that its derivative at t = 0
is zero, that is E[(X − Z)Y ] = 0. In particular, we have E[(X − XH )Y ] = 0. This ends the
proof of (i).
On the one hand, let Z ∈ H be such that E[(X − Z)2 ] = a. We deduce from the previous
arguments that for all Y ∈ H:
Taking Y = (XH − Z), gives that E[(XH − Z)2 ] = 0 and thus a.s. Z = XH , see Lemma 1.38.
On the other hand, if there exists Z ∈ H such that E[ZY ] = E[XY ] for all Y ∈ H,
arguing as above, we also deduce that a.s. Z = XH .
According to the remarks at the beginning of this chapter, we see that if X is a real-valued
square-integrable random variable, then E[X] can be seen as the orthogonal projection of X
on the vector space of the constant random variables.
The next lemma asserts that, if the expectation of X conditionally on H exists then it is
unique up to an a.s. equality. It will be denoted by E[X|H].
2.2. CONDITIONAL EXPECTATION 27
Lemma 2.3 (Uniqueness of the conditional expectation). Let Z and Z 0 be real-valued random
variables, H-measurable with E[Z] and E[Z 0 ] well defined, and such that E[Z1A ] = E[Z 0 1A ]
for all A ∈ H. Then, we get that a.s. Z = Z 0 .
Proof. Let n ∈ N∗ and consider A = {n ≥ Z > Z 0 ≥ −n} which belongs to H. By linearity,
we deduce from the hypothesis that E[(Z − Z 0 )1{n≥Z>Z 0 ≥−n} ] = 0. Lemma 1.38 implies that
P(n ≥ Z > Z 0 ≥ −n) = 0 and thus P(+∞ > Z > Z 0 > −∞) = 0 by monotone convergence.
Considering A = {Z = +∞, n ≥ Z 0 }, A = {Z ≥ n, Z 0 = −∞} and A = {Z = +∞, Z 0 = −∞}
leads similarly to P(Z > Z 0 , Z = +∞ or Z 0 = −∞) = 0. So we get P(Z > Z 0 ) = 0. By
symmetry, we deduce that a.s. Z = Z 0 .
We use the orthogonal projection theorem on Hilbert spaces, to define the conditional
expectation for square-integrable real-valued random variables.
Proposition 2.4. If X ∈ L2 , then E[X|H] is the orthogonal projection defined in Proposition
2.1, of X on the vector space H of all square-integrable H-measurable random variables.
Proof. The set H corresponds to the space L2 (Ω, H, P). It is closed thanks to Proposition
1.50. The set H is thus a closed vector subspace of L2 . Property (i) (with Y = 1A ) from
Theorem 2.1 implies then that (2.1) holds and thus that the orthogonal projection of X ∈ L2
on H is the expectation of X conditionally on H.
(ii) Linearity. For a, b ∈ R, we have that a.s. E[aX + bY |H] = aE[X|H] + bE[Y |H].
of the conditional expectation that for all n ∈ N a.s. 0 ≤ E[Xn |H] ≤ E[Xn+1 |H]. The
random-variable Z = limn→+∞ E[Xn |H] is H-measurable according to Corollary 1.42 and
a.s. non-negative. The monotone convergence theorem implies that for all A ∈ H:
E [Z1A ] = lim E [E[Xn |H]1A ] = lim E [Xn 1A ] = E lim Xn 1A .
n→+∞ n→+∞ n→+∞
Deduce from (2.1) and Lemma 2.3 that a.s. Z = E[limn→+∞ Xn |H]. This ends the proof.
Proof. Consider first the case where X is a.s. non-negative. The random variable X is the
a.s. limit of a sequence of positive square-integrable random variables. Property (iii) from
Proposition 2.5 implies that E[X|H] exists. It is unique thanks to Lemma 2.3. It is a.s.
non-negative as limit of non-negative random variables. Taking A = Ω in (2.1), we get (2.3).
We now consider the general case. Recall that X + = max(X, 0) and X − = max(−X, 0).
From the previous argument the expectations of E[X + |H] and E[X − |H] are well defined and
respectively equal to E[X + ] and E[X − ]. Since one of those two expectations is finite, we
deduce that a.s. E[X + |H] if finite or a.s. E[X − |H] is finite. Then use (2.1) and Lemma 2.3
to deduce that E[X + |H] − E[X − |H] is equal to E[X|H], the expectation of X conditionally
on H. Since E[X|H] is the difference of two non-negative random variables, one of them
being integrable, we deduce that the expectation of E[X|H] is well defined and use (2.1) with
A = Ω to get (2.3). Eventually, if X is integrable, so are E[X + |H] and E[X − |H] thanks to
(2.3) for non-negative random variables. This implies that E[X|H] is integrable.
We summarize in the next proposition the properties of the conditional expectation di-
rectly inherited from the properties of the expectation.
Proposition 2.7. We have the following properties.
(i) Positivity. If X is a real-valued random variable such that a.s. X ≥ 0, then a.s.
E[X|H] ≥ 0.
(iii) Monotony. For X, Y real-valued random variables such that a.s. X ≤ Y and E[X] as
well as E[Y ] are well defined, we have E[X|H] ≤ E[Y |H].
2.2. CONDITIONAL EXPECTATION 29
(iv) Monotone convergence. Let (Xn , n ∈ N) be real-valued random variables such that for
all n ∈ N a.s. 0 ≤ Xn ≤ Xn+1 . Then we have that a.s.:
h i
lim E[Xn |H] = E lim Xn H .
n→∞ n→∞
(v) Fatou Lemma. Let (Xn , n ∈ N) be real-valued random variables such that for all n ∈ N
a.s. 0 ≤ Xn . Then we have that a.s.:
h i
E lim inf Xn H ≤ lim inf E [Xn |H] .
n→∞ n→∞
(vii) The Tchebychev, Hölder, Cauchy-Schwarz, Minkowski and Jensen inequalities from
Propositions 1.49 and 1.59 holds with the expectation replaced by the conditional expec-
tation.
For example, we state Jensen inequality from property (vii) above. Let ϕ be a R-valued
convex function defined on Rd . Let X be an integrable Rd -valued random variable. Then,
E[ϕ(X)|H] is well defined and a.s.:
Using the monotone or dominated convergence theorems, it is easy to prove the following
Corollary which generalizes (2.1).
Corollary 2.8. Let X and Y be two real-valued random variables such that E[X] and E[XY ]
are well defined, and Y is H-measurable. Then E [E[X|H]Y ] is well defined and we have:
Recall Definitions 1.11 and 1.30 on independence. We complete the properties of the
conditional expectation.
30 CHAPTER 2. CONDITIONAL EXPECTATION
Proposition 2.9. Let X be a real-valued random variable such that E[X] is well defined.
(iii) If Y is a real-valued H-measurable random variable such that E[XY ] is well defined,
then we have that a.s. E[Y X|H] = Y E[X|H].
Proof. If X is H-measurable, then we can choose Z = X in (2.1) and use Lemma 2.3 to get
property (i). If X is independent of H, then for all A ∈ H, we have E[X1A ] = E[X]E[1A ] =
E[E[X]1A ], and we can choose Z = E[X] in (2.1) and use Lemma 2.3 to get property (ii). If Y
is a real-valued H-measurable random variable such that E[XY ] is well defined, then E[XY 1A ]
is also well defined for A ∈ H, and according to (2.5), we have E[XY 1A ] = E[E[X|H]Y 1A ].
Then, we can choose Z = Y E[X|H] in (2.1), with X replaced by X1A , and use Lemma 2.3
to get property (iii).
We prove property (iv). Let A ∈ G ⊂ H. We have:
where we used (2.1) with H replaced by G for the first equality, (2.1) for the second and (2.1)
with H replaced by G and X by E[X|H] for the last. Then we deduce property (iv) from
Definition 2.2 and Lemma 2.3.
We prove property (v) first when X is integrable. For A ∈ G and B ∈ H, we have:
where we used that 1A is independent of H∨σ(X) in the second equality and independent of H
in the last. Using the dominated convergence theorem, we get that A = {A ∈ F, E [1A X] =
E[1A E[X|H]]} is a monotone class. It contains C = {A ∩ B; A ∈ G, B ∈ H} which is stable
by finite intersection. The monotone class theorem implies that A contains σ(C) and thus
G ∨ H. Then we deduce property (v) from Definition 2.2 and Lemma 2.3. Use the monotone
convergence theorem to extend the result to non-negative random variable and use that
E[X|H0 ] = E[X + |H0 ] − E[X − |H0 ] for any σ-field H0 ⊂ F when E[X] is well defined, to extend
the result to any real random variable X such that E[X] is well defined.
In the next two paragraphs we give an explicit formula for g when V is a discrete random
variable and when X = ϕ(Y, V ) with Y some random variable taking values in a measurable
space (S, S) such that (Y, V ) has a density with respect to some product measure on S × E.
Recall (2.2) for the notation P(A| H) for A ∈ F; and we shall write P(A|V ) for P(A| σ(V )).
E[X1{V =v} ]
g(v) = = E[X|V = v] if P(V = v) > 0, and g(v) = 0 otherwise. (2.6)
P(V = v)
Proof. According to Corollary 2.11, we have E[X|V ] = g(V ) for some real-valued measurable
function g. We deduce from (2.1) with A = {V = v} that E[X1{V =v} ] = g(v)P(V = v). If
P(V = v) > 0, we get:
E[X1{V =v} ]
g(v) = = E[X|V = v].
P(V = v)
The value of E[X|V = v] when P(V = v) = 0 is unimportant, and can be set equal to 0.
Remark 2.13. Let (E, E) be a measurable space and V be a discrete E-valued random variable
with discrete support ∆V = {v ∈ E, P(V = v) > 0}. For v ∈ ∆V , denote by Pv the
probability measure on (Ω, F) defined by Pv (A) = P(A|V = v) for A ∈ F. The law of X
conditionally on {V = v}, denoted by PX|v is the image of the probability measure Pv by
X, and we define the law of X conditionally on V as the collection of probability measure
PX|V = (PX|v , v ∈ ∆V ). An illustration is given in the next example. ♦
Example 2.14. Let (Xi , i ∈ J1, nK) P
be independent Bernoulli random variables with the same
parameter p ∈ (0, 1). We set Sn = ni=1 Xi , which has a binomial distribution with parameter
(n, p). We shall compute the law of X1 conditionally on Sn . We get for k ∈ J1, nK:
where we used independence for X1 and (X2 , . . . , Xn ) for the second equality and that X2 +
· · · + Xn has a binomial distribution with parameter (n − 1, p) for the last. For k = 0, we
get directly that P(X1 = 1|Sn = k) = 0. We deduce that X1 conditionally on {Sn = k}
is a Bernoulli random variable with parameter k/n for all k ∈ J0, nK. We shall say that,
conditionally on Sn , X1 has the Bernoulli distribution with parameter Sn /n.
Using Corollary 2.12, we get that E[X1 |Sn ] = Sn /n, which could have been obtained
directly as the expectation of a Bernoulli random variable is given by its parameter. 4
f(Y,V ) (y, v)
fY |V (y|v) = , y ∈ S.
fV (v)
Thanks to Fubini theorem, we get that, for v such R that fV (v) ∈ (0, +∞), the function
y 7→ fY |V (y|v) is a density as it is non-negative and fY |V (y|v) µ(dy) = 1.
We now give the expectation of X = ϕ(Y, V ), for some function ϕ, conditionally on V .
Proposition 2.16. Let (E, E, ν) and (S, S, µ) be measured space such that ν and µ are σ-
finite. Let (Y, V ) be an S × E-valued random variable with density (y, v) 7→ f(Y,V ) (y, v) with
respect to the product measure µ(dy)ν(dv). Let ϕ be a real-valued measurable function defined
on S × E and set X = ϕ(Y, V ). Assume that E[X] is well defined. Then we have that a.s.
E[X|V ] = g(V ), with:
Z
g(v) = ϕ(y, v)fY |V (y|v) µ(dy). (2.7)
Proof. Let A ∈ σ(V ). The function 1A is σ(V )-measurable, and thus, thanks to Proposition
1.25, there exists a measurable function h such that 1A = h(V ). Since fV is a density, we
2.3. CONDITIONAL EXPECTATION WITH RESP. TO A RANDOM VARIABLE 33
R
get that 1{fV 6∈(0,+∞)} fV dν = 0. We have:
The existence of a probability kernel2 allows to give a representation of the conditional ex-
pectation which holds simultaneously for all nice functions ϕ (but on a set of 0 probability
1
If one takes ν = PV , then the density is constant equal to 1.
2
The existence of the conditional distribution of Y , taking values in S, given V can be proven under some
topological property of the space (S, S). See Theorem 5.3 in O. Kallenberg. Foundations of modern probability.
Springer-Verlag, 2002.
34 CHAPTER 2. CONDITIONAL EXPECTATION
for V ). When V is a discrete random variable, Remark 2.13 states that the kernel κ given
by κ(v, dx) = 1{v∈∆V } PX|v (dx) is, with ν = PV , the conditional distribution of X given V .
Example 2.18. In Example 2.14, with P = S/n, the conditional distribution of X1 given P
is the Bernoulli distribution with parameter P . This corresponds to the kernel κ(p, dx) =
(1 − p)δ0 (dx) + pδ1 (dx). (Notice one only needs to consider p ∈ [0, 1].)
In Exercise 9.20, the conditional distribution of Y given V is the uniform distribution on
[0, V ]. This corresponds to the kernel κ(v, dy) = v −1 1[0,v] (y) λ(dy). (Notice one only needs
to consider v ∈ (0, +∞).) 4
Chapter 3
1
The set E is discrete if E is at most countable, all x ∈ E are isolated, that is all subsets of E are open and
closed. For example, the set N with the Euclidean distance is a discrete set, while the set {0} ∪ {1/k, k ∈ N∗ }
with the Euclidean distance is not a discrete set as the set {0} is not open.
35
36 CHAPTER 3. DISCRETE MARKOV CHAINS
Definition 3.1. A filtration F = (Fn , n ∈ N) (with respect to the measurable space (Ω, F))
is a sequence of σ-fields such that Fn ⊂ Fn+1 ⊂ F for all n ∈ N. A E-valued process
X = (Xn , n ∈ N) is F-adapted if Xn is Fn -measurable for all n ∈ N.
In the setting of stochastic process, one usually (but not always) chooses the natural
filtration F = (Fn , n ∈ N) which is generated by X: for all n ∈ N, Fn is the σ-field generated
by (X0 , . . . , Xn ) and the P-null sets. This obviously implies that X is F-adapted.
A Markov chain is a process such that, conditionally on the process at time n, the past
before time n and the evolution of the process after time n are independent.
Definition 3.2. The process X = (Xn , n ∈ N) is a Markov chain with respect to the filtration
F = (Fn , n ∈ N) if it is adapted and it has the Markov property: for all n ∈ N, conditionally
on Xn , Fn and (Xk , k ≥ n) are independent, that is for all A ∈ Fn and B ∈ σ(Xk , k ≥ n),
P(A ∩ B| Xn ) = P(A| Xn )P(B| Xn ).
In the previous definition, we shall omit to mention the filtration when it is the natural
filtration. Since X is adapted to F, if X is a Markov chain with respect to F, it is also a
Markov chain with respect to its natural filtration.
(ii) For all n ∈ N and B ∈ σ(Xk , k ≥ n), we have a.s. P(B| Fn ) = P(B| Xn ).
Proof. That property (i) implies property (ii) is a direct consequence of Exercise 9.18 (with
A = Fn , B = σ(Xk , k ≥ n) and H = σ(Xn )). Let us check that property (ii) implies property
(i). Assume property (ii). Let A ∈ Fn and B ∈ σ(Xk , k ≥ n). A.s. we have, using property
(ii) for the second equality:
Taking B = {Xn+1 = y} in property (ii) gives property (iii). We now assume property
(iii) holds, and we prove property (ii). As σ(Xk , k ≥ n) is generated by the events Bk =
{Xn = y0 , . . . , Xn+k = yk } where k ∈ N and y0 , . . . , yk ∈ E, we deduce from the monotone
class theorem, and more precisely Corollary 1.14, that, to prove (ii), it is enough to prove
that a.s.
P(Bk | Fn ) = P(Bk | Xn ). (3.1)
We shall prove this by induction. Notice (3.1) is true for k = 1 thanks to (iii). Assume that
3.1. DEFINITION AND PROPERTIES 37
The Markov chain is called homogeneous when the sequence (Pn , n ∈ N∗ ) is constant, and its
common value, say P , is then called the2 transition matrix of X.
The transition matrix of the simple random walk described in Example 3.4 is given by
P (x, y) = 0 if |x − y| =
6 1, P (x, x + 1) = p and P (x, x − 1) = 1 − p for x, y ∈ Z.
Unless specified otherwise, we shall consider homogeneous Markov chains.
The next proposition states that the transition matrix and the initial distribution char-
acterize the distribution of the Markov chain.
In order to stress the dependence of the distribution of the Markov chain X on the
probability distribution µ0 of X0 , we may write Pµ0 and Eµ0 . When µ0 is simply the Dirac
mass at x (that is P(X0 = x) = 1), then we simply write Px and Ex and say the Markov
chain is started at x.
Proof. We have that for k ∈ N∗ , x0 , . . . , xk+1 ∈ E, with Bk = {X0 = x0 , . . . , Xk = xk }, that:
P(Xk+1 = xk+1 , Bk ) = E E 1{Xk+1 =xk+1 } 1Bk | Fk
= E [P(Xk+1 = xk+1 | Fk ) 1Bk ]
= E [P (Xk , xk+1 )1Bk ]
= P (xk , xk+1 )P(Bk ),
where we used that Bk ∈ Fk for the second equality, (3.2) for the third, that Xk = xk on Bk
for the last. We then deduce that (3.4) holds by induction.
Use that {(x0 , . . . , xn )} for x0 , . . . , xn ∈ E generates the product σ-field on E n+1 and
Lemma 1.29 to deduce that the left hand side of (3.4) for all n ∈ N and x0 , . . . , xn ∈ E
characterizes the distribution of X. We then deduce from (3.4) that the distribution of X is
characterized by P and µ0 .
Example 3.10. Let Xn be the number of items in a stock at time n, Dn+1 the random
consumer demand and q ∈ N∗ the deterministic quantity of items produced between period
n and n + 1. Considering the stock at time n + 1, we get:
10 20
8 16
6 12
4 8
2 4
0 0
0 250 500 750 1000 0 250 500 750 1000
Figure 3.1: Simulations of the the random evolution of a stock with dynamics Xn+1 =
(Xn + q − Dn+1 )+ , where X0 = 0, q = 3 and the random variables (Dn , n ∈ N∗ ) are
independent with Poisson distribution parameter θ (θ = 4 on the left and θ = 3 on the right).
Remark 3.11. Even if a Markov chain is not a stochastic dynamical system, it is distributed
as one. Indeed let µ0 be a probability distribution on E and P a stochastic matrix on E.
Let X0 be a E-valued random variable with distribution µ0 and (Un , n ∈ N) be a sequence of
independent random variables, independent of X0 distributed as U = (U (x), x ∈ E), where
U (x) are independent E-valued random variables such that U (x) has distribution P (x, ·).
Then the stochastic dynamical system (Xn , n ∈ N), defined by Xn+1 = Un+1 (Xn ) for n ∈ N,
is a Markov chain on E with initial distribution µ0 and transition matrix P . ♦
The next corollary is a consequence of the Markov property.
Corollary 3.12. Let X = (Xn , n ∈ N) be a Markov chain with respect to the filtration
F = (Fn , n ∈ N), taking values in a discrete state space E and with transition matrix P . Let
n ∈ N and defined the shifted process X̃ = (X̃k = Xn+k , k ∈ N). Conditionally on Xn , we
have that Fn and X̃ are independent and that X̃ is a Markov chain with transition matrix P
started at Xn , which means that a.s. for all k ∈ N, all x0 , . . . , xk ∈ E:
Notice that in the previous corollary the initial distribution of the Markov chains X and
X̃ are not the same a priori.
Proof. By definition of a Markov chain, we have that, conditionally on Xn , Fn and X̃ are
independent. So, we only need to prove that:
k
Y
P(X̃0 = x0 , . . . , X̃k = xk | Fn ) = 1{Xn =x0 } P (xj−1 , xj ). (3.6)
j=1
where we used that Xn+j = xj on Bj for the last equality. This implies that P(Bj+1 | Fn ) =
P (xj , xj+1 ) P(Bj | Fn ). Thus, we deduce that (3.6) holds by induction. Then, conclude using
Proposition 3.8 on the characterization of the distribution of a Markov chain.
In the setting of Markov chains, computing probability distribution or expectation re-
duce to elementary linear algebra on E. Let P and Q be two matrices defined on E
with
P non-negative entries. We denote by P Q the matrix on E defined by P Q(x, y) =
z∈E P (x, z)Q(z, y) for x, y ∈ E. It is easy to check that if P and Q are stochastic, then
P Q is also stochastic. We set P 0 = IE the identity matrix on E and for k ≥ 1, P k = P k−1 P
(or equivalently P = P P k−1 ).
For a line vector µ = (µ(x), x ∈ E) with non-negatives entries, which we shall see as
a
P measure on E, we denote by µP the line vector (µP (y), y ∈ E) defined by µP (y) =
x∈E µ(x)P (x, y). For a column vector f = (f (y), y ∈ E) with real entries, which we shall
P on E, we denote, by P f or P (f ) the column vector (P f (x), x ∈ E)
see as a function defined
defined by P f (x) = y∈E P (x, y)f (y). Notice this last quantity, and thus P f , is well defined
+ −
P soon as, for all x ∈ E we have P (f )(x) or P (f )(x) finite. We also write µf = (µ, f ) =
as
x∈E µ(x)f (x) the integral of the function f with respect to the measure µ, when it is well
defined.
We shall consider a measure µ = (µ(x) = Pµ({x}), x ∈ E) on E as a line vector with non
negative
P entries. For A ⊂ E, we set µ(A) = x∈A µ(x), so that µ is a probability measure
if x∈E µ(x) = 1. We shall also consider a real-valued function f = (f (x), x ∈ E) defined
on E as a column vector. The next results give explicit formula to compute (conditional)
expectations and distributions.
Proposition 3.13. Let X = (Xn , n ∈ N) be a Markov chain with transition matrix P .
Denote by µn the probability distribution of Xn for n ∈ N. Let f be a bounded or non-
negative function. We have for n ∈ N∗ :
(i) µn = µ0 P n ,
1.00
0.75
0.50
0.25
0.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Figure 3.2: Graph of the function x 7→ P(Ln ≥ bxc), with Ln the maximal length of the
sequences of consecutive 1 in a sequence of length n = 100 of independent Bernoulli random
variables with parameter p = 1/2.
Let X = (Xn , n ∈ N) be a Markov chain with transition matrix P with starting probability
measure µ0 = π an invariant probability measure for P . Denote by µn the probability
distribution of Xn . We have µ1 = πP = π and by recurrence we get µn = π for all n ∈ N∗ .
This means that Xn is distributed as X0 : the distribution of Xn is stationary, that is constant
in time.
Remark 3.16. Let X = (Xn , n ∈ N) be a Markov chain with transition matrix P with starting
probability measure µ0 = π an invariant probability measure for P . For simplicity, let us
assume further that π(x) > 0 for all x ∈ E. For x, y ∈ E, we set:
π(y)P (y, x)
Q(x, y) = · (3.7)
π(x)
P
Since π is an invariant probability measure, we have y∈E Q(x, y) = 1 for all x ∈ E. Thus
the matrix Q is stochastic. Notice that π is also an invariant probability measure for Q. For
x, y ∈ E, n ∈ N, we have:
Pπ (Xn = y, Xn+1 = x)
Pπ (Xn = y|Xn+1 = x) = = Q(x, y).
Pπ (Xn+1 = x)
In other words (Xn , Xn−1 , . . . , X0 ) is distributed under Pπ as the first n steps of a Markov
chains with transition matrix Q with initial distribution π. Intuitively, the time reversal of
the process X under π is a Markov chain with transition matrix Q. ♦
There is an important particular case where a probability measure π is invariant for a
stochastic matrix.
See examples of the Ehrenfest urn model and the Metropolis-Hastings algorithm in Section
3.5 for reversible Markov chains.
3.3. IRREDUCIBILITY, RECURRENCE, TRANSIENCE, PERIODICITY 43
Remark 3.19. If P in Remark 3.16 is also reversible with respect to the probability mea-
sure π, then we get P = Q. Therefore, under Pπ , we get that (X0 , . . . , Xn−1 , Xn ) and
(Xn , Xn−1 , . . . , X0 ) have the same distribution. We give a stronger statement in the next
Remark. ♦
Remark 3.20. Let P be a stochastic matrix on E reversible with respect to a probability
measure π. The following construction is inspired by Remark 3.11. Let (Un , n ∈ Z∗ ) be
a sequence of independent random variables distributed as U = (U (x), x ∈ E), where the
E-valued random variables are independent and U (x) is distributed as P (x, ·). Let X0 be a
E-valued random variable independent of (Un , n ∈ Z∗ ) with distribution π. For n ∈ N∗ , set
Xn+1 = Un+1 (Xn ) and X−(n+1) = U−(n+1) (X−n ). Then the process X = (Xn , n ∈ Z) can be
seen as a Markov chain with time index Z instead of N in Definition 3.2 (the proof of this
fact is left to the reader). We deduce from Remark 3.16 that X̃ = (X̃n = X−n , n ∈ Z) is then
also a Markov chain with time index Z. It is called the time reversal process of X. One can
easily check that its transition matrix is P , so that X and X̃ have the same distribution. ♦
The Markov chain in the Ehrenfest’s urn model, see Section 3.5, is irreducible. The
Markov chain of the Wright-Fischer model, see Section 3.5, has two absorbing states 0 and
N and one open communicating class {1, . . . , N − 1}. 4
We set N x = n∈N 1{Xn =x} the number of visits of the state x. The next proposition
P
gives a characterization for transience and recurrence.
Proposition 3.23. Let X be a Markov chain on E with transition matrix P .
(i) Let x ∈ E be recurrent. Then we have Px (N x = ∞) = 1 and n∈N P n (x, x) = +∞.
P
To have a complete picture, in view of property (iv) above, we shall study closed commu-
nicating classes (see Remark 3.25 below for a first result in this direction). For this reason,
we shall consider Markov chains started in a closed communicating class. This amounts to
study irreducible Markov chains, as a Markov chain started in a closed communicating class
remains in it.
where we used the Markov property at time r for the third equality. Using that Px (N x >
0) = 1, we deduce that Px (N x > n) = (1 − p)n for n ∈ N. This gives that N x P has under Px a
geometric distribution with parameter p ∈ [0, 1]. Notice also that [N x] =
Ex n∈N Px (Xn =
n
P
x) = n∈N P (x, x), which is finite if and only if p > 0. Thus, if x is transient, then p > 0
and we get Px (Nx < ∞) = 1 and Ex [N x ] is finite. And, if x is recurrent, then p = 0 and we
get Px (Nx < ∞) = 0 and Ex [N x ] is infinite. This proves property (i) and the first part of
property (ii).
We prove the second part of property (ii). Let ν be a probability measure on E. As x is
transient, by decomposing according to the values of T x and using the Markov chain property
for the first equality, we get:
X
Pν (N x = +∞) = Pν (T x = n)Px (N x = +∞) = 0,
n∈N∗
Since n1 nk=1 1{Xk =x} is bounded by 1, we deduce that limn→∞ n1 nk=1 Pπ (Xk = x) = 0 by
P P
P n+n1 +n2 (y, y) ≥ P n1 (y, x)P n (x, x)P n2 (x, y), (3.10)
n+n1 +n2 n2 n n1
P (x, x) ≥ P (x, y)P (y, y)P (y, x). (3.11)
This implies that the sums n∈N P n (x, x) and n∈N P n (y, y) are both either converging or
P P
diverging. Thanks to properties (i) and (ii), we get that either x and y are both transient or
both recurrent. This gives (iii).
We now prove property (iv). If C is an open communicating class, then there exist x ∈ C
and y 6∈ C such that P (x, y) > 0. Since x is not accessible from y, we get Py (T x = ∞) = 1.
Using the Markov property, we get that Px (T x = ∞) ≥ P (x, y)Py (T x = ∞) > 0. This gives
that x is transient.
According to property (iii) from Proposition 3.23, we get that an irreducible Markov chain
is either transient or recurrent. And, in the former case the probability of {N x < ∞} is equal
to 1 for all choice of the initial distribution. The next lemma asserts that for an irreducible
recurrent Markov chain, the probability of {N x < ∞} is strictly less than 1 for all choice of
the initial distribution.
Lemma 3.24. Let X be an irreducible Markov chain on E. If X is transient, then P(N x <
∞) = 1 for all x ∈ E. If X is recurrent, then P(N x = ∞) = 1 for all x ∈ E.
46 CHAPTER 3. DISCRETE MARKOV CHAINS
Proof. For the transient case, see property (ii) of Proposition 3.23. We assume that X is
recurrent. Let x ∈ E. By decomposing according to the values of T x and using the Markov
property for the first equality and property (i) of Proposition 3.23 for the second, we get:
X
P(N x < ∞) = P(T x = ∞) + P(T x = n)Px (N x < ∞) = P(T x = ∞). (3.12)
n∈N
where for the first equality we used that Px (N x = ∞) = 1, and for the second
P the Markov
property at time m and that Py (Xn = x for some n ≥ 1) = Py (T x < ∞). As y∈E Px (Xm =
y) = 1 and Py (T x < ∞) ≤ 1, we deduce that Py (T x < ∞) = 1 for all y ∈ E such that
Px (Xm = y) > 0. Since X is irreducible, for all y ∈ E, there exists m ∈ N∗ such that
Px (Xm = y) > 0. We deduce that Py (T x < ∞) = 1 for all y ∈ E and thus P(T x < ∞) = 1.
Then use (3.12) to get P(N x < ∞) = 0.
Remark 3.25. Let X be an irreducible Markov chain on a finite state space E. Since
x x
P
x∈E N = ∞ and E is finite, we deduce that P(N = ∞ for some x ∈ E) = 1. This
x
implies that P(N = ∞) > 0 for some x ∈ E. We deduce from Lemma 3.24 that X is
recurrent. Thus, all elements of a finite closed communicating class are recurrent. ♦
3.3.3 Periodicity
In Example 3.4 of the simple random walk X = (Xn , n ∈ N), we notice that if X0 is even
(resp. odd), then X2n+1 is odd (resp. even) and X2n is even (resp. odd) for n ∈ N. Therefore
the state space Z can be written as disjoint union of two sub-sets: the even integers, 2Z, and
the odd integers, 2Z + 1. And, a.s. the the Markov chain jumps from one sub-set to the other
one. From the Lemma 3.28 below, we see that X has period 2 in this example.
Definition 3.26. Let X be a Markov chain on E with transition matrix P . The period d of
a state x ∈ E is the greatest common divisor (GCD) of the set {n ∈ N∗ ; P n (x, x) > 0}, with
the convention that d = ∞ if this set is empty. The state is aperiodic if d = 1.
Notice that the set {n ∈ N∗ ; P n (x, x) > 0} is empty if and only if Px (T x = ∞) = 1, and
that this also implies that {x} is an open communicating class.
Proposition 3.27. Let X be a Markov chain on E with transition matrix P . We have the
following properties.
(i) If x ∈ E has a finite period d, then there exists n0 ∈ N such that P nd (x, x) > 0 for all
n ≥ n0 .
(ii) The elements of the same communicating class have the same period.
In view of (ii) above, we get that if X is irreducible, then all the states have the same
finite period. For this reason, we shall say that an irreducible Markov chain is aperiodic
(resp. has period d) if one of the states is aperiodic (resp. has period d).
3.3. IRREDUCIBILITY, RECURRENCE, TRANSIENCE, PERIODICITY 47
Proof. We first consider the case d = 1. Let x ∈ E be aperiodic. We consider the non-empty
set I = {n ∈ N∗ ; P n (x, x) > 0}. Since P n+m (x, x) ≥ P n (x, x)P m (x, x), we deduce that I
is stable by addition. By hypothesis, there exist n1 , . . . , nK ∈ I whichPKare relatively prime.
According to Bézout’s lemma, there exist a1 , . . . , aK ∈ Z such that k=1 ak nk = 1. We set
n+ = K
P PK
k=1;ak >0 ak nk and n− = k=1;ak <0 |ak | nk . If n− = 0, then we deduce that 1 ∈ I
and so (i) is proved with n0 = 1. We assume now that n− ≥ 1. We get that n+ , n− ∈ I and
n+ − n− = 1. Let n ≥ n2− . Considering the Euclidean division of n by n− , we get there exist
integers r ∈ {0, . . . , n− − 1} and q ≥ n− such that:
Since q − r ≥ 0 and I is stable by addition, we get that n ∈ I. This proves property (i) with
n0 = n2− .
For d ≥ 2 finite, consider Q = P d . It is easy to check that x is then aperiodic when
considering the Markov chain with transition matrix Q. Thus, there exists n0 ≥ 1, such that
for Qn (x, x) > 0 for all n ≥ n0 , that is P nd (x, x) > 0 for all n ≥ n0 . This proves property (i).
Proof. Since X is irreducible, we get that the period d is finite. Let x0 ∈ E. Consider the
sets Ei = {x ∈ E; there exists n ∈ N such that P nd+i (x0 , x) > 0} for i ∈ J0, d − 1K. Since X
is irreducible, for x ∈ E there exists m ∈ N Ssuch that P m (x0 , x) > 0. This gives that x ∈ Ei
with i = m mod (d). We deduce that E = d−1 i=1 Ei .
k
T
If x ∈ Ei Ej , then using that P (x, x0 ) > 0 for some k ∈ N, we get there exists n, m ∈ N
such that P nd+i+k (x0 , x0 ) > 0 and P md+j+k (x0 ,Tx0 ) > 0. By definition of the period, we
deduce that i = j mod (d). This implies that Ei Ej = ∅ if i 6= j and i, j ∈ J0, d − 1K.
To conclude, notice that if x ∈ Ei , that is P nd+i (x0 , x) > 0 for some n ∈ N, and z ∈ E
such that P (x, z) > 0, then we get that P nd+i+1 (x0 , z) > 0 and thus z ∈ Ei+1 . This readily
implies (3.13). Since x0 ∈ E0 , we get that E0 is non empty. Using (3.13), we get by recurrence
that Ei for i ∈ J0, d − 1K) is non empty. Thus, (Ei , i ∈ J0, d − 1K) is a partition of E.
Lemma 3.29. Let X = (Xn , n ∈ N) and Y = (Yn , n ∈ N) be two independent Markov chains
with respective discrete state spaces E and F . Then, the process Z = ((Xn , Yn ), n ∈ N) is a
Markov chain with state space E × F . If π (resp. ν) is an invariant probability measure for
X (resp. Y ), then π ⊗ ν is an invariant probability measure for Z. If X and Y are irreducible
and furthermore X or Y is aperiodic, then Z is irreducible on E × F .
48 CHAPTER 3. DISCRETE MARKOV CHAINS
Proof. Let P and Q be the transition matrix of X and Y . Using the independence of X
and Y , it is easy to prove that Z is a Markov chain with transition matrix R given by
R(z, z 0 ) = P (x, x0 )Q(y, y 0 ) with z = (x, y), z 0 = (x0 , y 0 ) ∈ E × F .
If π (resp. ν) is an invariant measure for X (resp. Y ), then we have for z = (x, y) ∈ E ×F :
X X
(π⊗ν)R(z) = π(x0 )ν(y 0 )R((x0 , y 0 ), (x, y)) = π(x0 )ν(y 0 )P (x0 , x)Q(y 0 , y) = π⊗ν(z).
x0 ∈E,y 0 ∈F x0 ∈E,y 0 ∈F
For an irreducible transient Markov chain, we recall that Px (T x = +∞) > 0 and thus
Ex [T x ] = +∞ for all x ∈ E, so that π = 0.
Definition 3.30. A recurrent state x ∈ E is null recurrent if π(x) = 0 and positive recurrent
if π(x) > 0. The Markov chain is null (resp. positive) recurrent if all the states are null
(resp. positive) recurrent.
We shall consider asymptotic events whose probability depends only on the transition
matrix and not on the initial distribution of the Markov chain. This motivates the following
definition. An event A ∈ σ(X) is said to be almost sure (a.s.) for a Markov chain X =
(Xn , n ∈ N) if Px (A) = 1 for all starting state x ∈ E of X, or equivalently Pµ0 (A) = 1 for all
initial distribution µ0 of X0 .
The next two fundamental theorems on the asymptotics of irreducible Markov chain will
be proved in Section 3.4.3.
Theorem 3.31. Let X = (Xn , n ∈ N) be an irreducible Markov chain on E. Let π be given
by (3.14).
(i) The Markov chain X is either transient or null recurrent or positive recurrent.
3.4. ASYMPTOTIC THEOREMS 49
(ii) If the Markov chain is transient or null recurrent, then there is no invariant probability
measure. Furthermore, we have π = 0.
The next result is specifically on irreducible positive recurrent Markov chain. The definition
of the convergence in distribution of sequence of random variables and some of its character-
ization are given in Section 8.2.1.
Theorem 3.32 (Ergodic theorem). Let X = (Xn , n ∈ N) be an irreducible positive recurrent
Markov chain on E.
(i) The measure π defined by (3.14) is the unique invariant probability of X. (And we have
π(x) > 0 for all x ∈ E.)
(ii) For all real-valued function f defined on E such that (π, f ) is well defined, we have:
n
1X a.s.
f (Xk ) −−−→ (π, f ). (3.16)
n n→∞
k=1
(d)
(iii) If X is aperiodic, then we have the convergence in distribution Xn −−−→ π and:
n→∞
X
lim |P n (x, y) − π(y)| = 0 for all x ∈ E. (3.17)
n→∞
y∈E
In particular for an irreducible positive recurrent Markov chain, the empirical mean or
time average converges a.s. to the spatial average with respect to the invariant probability
measure. In the aperiodic case, we also get that the asymptotic behavior of the Markov chain
is given by the stationary regime. We give the following easy to remember corollary.
Corollary 3.33. An irreducible Markov chain X = (Xn , n ∈ N) on a finite state space is
positive recurrent: π defined by (3.14) is its unique invariant probability measure, π(x) > 0
for all x ∈ E and (3.16) holds for all R-valued function f defined on E. If furthermore X is
aperiodic, then the sequence (Xn , n ∈ N) converges in distribution towards π.
P
Proof. Summing (3.15) over x ∈ E, we get that x∈E π(x) = 1. Thus the Markov chain is
positive recurrent according to Theorems 3.31, properties (i)-(ii), and 3.32, property (i). The
remaining part of the corollary is a direct consequence of Theorem 3.32.
The convergences of the empirical means, see (3.16), for irreducible positive recurrent
Markov chains is a generalization of the strong law of large number recalled in Section 8.2.2.
Indeed, if X = (Xn , n ∈ N) is a sequence of independent random variables taking values in
E with the same distribution π, then, X is a Markov chain with transition matrix P whose
lines are all equal to π (that is P (x, y) = π(y) for all x, y ∈ E). Notice then that P is
reversible with respect to π. Assume for simplicity that π(x) > 0 for all x ∈ E so that X is
irreducible with invariant probability π. Then (3.16) corresponds exactly to the strong law
50 CHAPTER 3. DISCRETE MARKOV CHAINS
of large numbers. By the way, the initial motivation of the introduction of Markov chains by
Markov5 in 1906 was to extend the law of large number and the central limit theorem (CLT)
to sequences of dependent random variables.
Eventually notice that the limits in (3.16) or in (iii) of Theorem 3.31 does not involve
the initial distribution of the Markov chain. Forgetting the initial condition is an important
property of the Markov chains.
It is legitimate to expect that the variance of the limit Gaussian random variable is the limit
√
of Var ( n In (f )) and as the mean in time correspond intuitively to the average under the
invariant probability measure, this would be, as (π, f ) = 0:
hX i X
σ(f )2 = Eπ f 2 (X0 ) + 2Eπ f (X0 )f (Xj ) = π, f 2 + 2 π, fP jf .
(3.18)
j∈N∗ j∈N∗
To be precise, we state Theorems II.4.1 and II.4.3 from [1]. For x ∈ E, set:
X
Hn (x) = |P n (x, y) − π(y)|.
y∈E
We recall that according to (3.17), if X is aperiodic then we have the ergodicity property
limn→∞ Hn (x) = 0 for all x ∈ E.
5
G. Basharin, A. Langville, V. Naumov. The life and work of A.A. Markov. Linear Algebra and its
Applications, vol. 386, pp. 3-26, 2004
3.4. ASYMPTOTIC THEOREMS 51
Theorem. Let X be an irreducible positive recurrent and aperiodic Markov chain with invari-
ant probability measure π. Let f be a real-valued function defined on E such that (π, f 2 ) < +∞
and (π, f ) = 0. If one of the two following conditions is satisfied:
P
(i) f is bounded and n∈N∗ (π, Hn ) < +∞ (ergodicity of degree 2);
(ii) limn→∞ supx∈E Hn (x) = 0 (uniform ergodicity);
Then, σ(f )2 given by (3.18) is finite and non-negative, and:
√ (d)
n In (f ) −−−→ N 0, σ(f )2 .
n→∞
Usually the variance σ(f )2 is positive, but for some particular Markov chain and particular
function f , it may be null. Concerning the hypothesis (i) and (ii) in the previous theorem, we
also mention that uniform ergodicity implies there exists c > 1 such that supx∈E Hn (x) ≤ c−n
for large n, which in turns implies the ergodicity of degree 2. Notice that if the state space
E is finite, then an irreducible aperiodic Markov chain is uniformly ergodic.
Based on the excursion approach developed in Section 3.4.3, it is also possible to give an
alternative result for the CLT of Markov chains, see Theorems 17.2.2, 17.4.4 and 17.5.3 in
[7]. For f a real-valued function defined on E and x ∈ E, we set, when it is well defined:
Tx
X
Sx (f ) = f (Xk ).
k=1
Theorem. Let X be an irreducible positive recurrent Markov chain with invariant probability
measure π. Let f be a real-valued function defined on E such that (π, f ) is well defined with
(π, f ) = 0. Let x ∈ E such that Ex [Sx (1)2 ] = Ex [(T x )2 ] < +∞ and Ex [Sx (|f |)2 ] < +∞ (so
that Sx (f ) is a.s. well defined). Set
σ 0 (f )2 = π(x)Ex Sx (f )2 .
(3.19)
Then, we have that:
√ (d)
n In (f ) −−−→ N 0, σ 0 (f )2 .
n→∞
Furthermore (3.19) holds for all x ∈ E.
An other approach is based on the Poisson equation. Assume (π, |f |) is finite. We say
that a R-valued function fˆ is a solution to the Poisson equation if P fˆ is well defined and:
fˆ − P fˆ = f − (π, f ). (3.20)
Theorem. Let X be an irreducible positive recurrent Markov chain with invariant probability
measure π. Let f be a real-valued function defined on E such that (π, |f |) < +∞ and (π, f ) =
0. Assume there exists a solution fˆ to the Poisson equation such that (π, fˆ2 ) < +∞. Set
σ 00 (f )2 = π, fˆ2 − (P fˆ)2 . (3.21)
Then we have that:
√ (d)
n In (f ) −−−→ N 0, σ 00 (f )2 .
n→∞
Of course, the asymptotic variances given by (3.18), (3.19) and (3.21) coincide when the
hypothesis of the three previous theorem hold. This is in particular the case if E is finite
(even if X is periodic).
52 CHAPTER 3. DISCRETE MARKOV CHAINS
Notice the measure νx is infinite as (νx , 1) = Ex [T x ] = +∞. According to [2, 1, 4], we have
the following results.
Theorem. Let X = (Xn , n ∈ N) be a Markov chain with transition matrix P . If x is
recurrent then νx is an invariant measure for P .
If furthermore X is irreducible null recurrent, then we get the following results:
(i) The measure νx is the only invariant measure (up to a positive multiplicative constant)
and νx (y) > 0 for all y ∈ E. And for all y, z ∈ E, we have νy (z) = νx (z)/νx (y).
(ii) For all R-valued functions f, g defined on E such that (ν, f ) is well defined and g is
non-negative with 0 < (ν, g) < +∞, we have:
Pn
f (Xk ) a.s. (ν, f )
Pk=1
n −−−→ ·
k=1 g(Xk )
n→∞ (ν, g)
Proof. Property (ii) of Proposition 3.23 implies that π = 0 and that k∈N 1{Xk =x} = N x is
P
a.s. finite. We deduce that (3.15) holds. Then use that π = 0 and Lemma 3.34 to deduce
that X has no invariant probability measure.
Yn = (Tnx − Tn−1
x
, XTn−1
x , XTn−1
x +1 , . . . , XTnx ). (3.22)
The random variable Yn describes the n-th excursion out for the state x. Notice that x is
the end of the excursion, that is XTnx = x, and for n ≥ 2 it is also the starting point of the
excursion as XTn−1
x = x. So Yn takes values in the discrete space E traj = ∪k∈N∗ {k}×E k ×{x}.
The next lemma is the key ingredient to prove the asymptotic results for recurrent Markov
chains.
Lemma 3.36. Let X be an irreducible recurrent Markov chain. The random variables
(Yn , n ∈ N∗ ) defined by (3.22) are independent. And the random variables (Yn , n ≥ 2) are all
distributed as Y1 under Px .
Proof. For y = (r, x0 , . . . , xr ) ∈ E traj , we set ty = r the length of the excursion and we recall
that the end point of the excursion is equal to x: xr = x . We shall first prove that for all
n ∈ N∗ , y1 , . . . , yn ∈ E traj , we have:
n
Y
P(Y1 = y1 , . . . , Yn = yn ) = P(Y1 = y1 ) Px (Y1 = yk ). (3.23)
k=2
Then, we get (3.23) by induction. Use Definition 1.31 and (3.23) for any n ∈ N∗ and
y1 , . . . , yn ∈ E traj to conclude.
We will now prove (3.15) for irreducible recurrent Markov chains. This and Lemma 3.35
will give property (iii) from Theorem 3.31.
54 CHAPTER 3. DISCRETE MARKOV CHAINS
Proposition 3.37. Let X be an irreducible recurrent Markov chain. Then (3.15) holds.
Proof. Let x ∈ E be fixed. Since Tnx = T1x + nk=2 (Tkx − Tk−1 x ), with T x a.s. finite, and
P
1
x x
(Tk − Tk−1 , n ≥ 2) are, according to Lemma 3.36, independent positive random variables
distributed as T x under Px , we deduce from the law of large number, see Theorem 8.15, that:
Tnx a.s.
−−−→ Ex [T x ]. (3.24)
n n→∞
We define the number of visit of x from time 1 to n ∈ N∗ :
n
X
Nnx = 1{Xk =x} . (3.25)
k=1
By construction, we have:
TNx nx ≤ n < TNx nx +1 . (3.26)
Nnx Nnx + 1 Nnx Nnx
This gives x ≤ ≤ . Since x is recurrent, we get that a.s. limn→∞ Nnx =
Nnx + 1 TNnx +1 n TNx nx
+∞. We deduce from (3.24) that a.s. limn→∞ Nnx /n = 1/Ex [T x ] = π(x).
Next lemma and property (iii) of Proposition 3.23 give property (i) of Theorem 3.31.
Lemma 3.38. Let X be an irreducible recurrent Markov chain. Then, it is either null
recurrent or positive recurrent.
Proof. Let x ∈ E. Notice the left hand-side of (3.15) is bounded byP1. Integrating (3.15)
with respect to Px , we get by dominated convergence that limn→∞ n1 nk=1 P k (x, x) = π(x).
Since X is irreducible, we deduce from (3.11), that if the above limit is zero for a given x, it
is zero for all x ∈ E. This implies that either π = 0 or π(x) > 0 for all x ∈ E.
The proof of the next lemma is a direct consequence of Lemma 3.34 and the fact that
π = 0 for irreducible null recurrent Markov chains.
Lemma 3.39. Let X be an irreducible null recurrent Markov chain. Then, there is no
invariant probability measure.
Lemmas 3.35 and 3.39 imply property (ii) of Theorem 3.31. This ends the proof of
Theorem 3.31.
The end of this section is devoted to the proof of Theorem 3.32. From now on we assume
that X is irreducible and positive recurrent.
Proposition 3.40. Let X be an irreducible positive recurrent Markov chain. Then, the
measure π defined in (3.14) is a probability measure. For all real-valued function f defined
on E such that (π, f ) is well defined, we have (3.16).
Proof. Let x ∈ E. We keep notations from the proof of Lemma 3.36. Let f be a finite
non-negative function defined on E. We set for y = (r, x0 , . . . , xr ) ∈ E traj :
r
X
F (y) = f (xk ).
k=1
3.4. ASYMPTOTIC THEOREMS 55
According to Lemma 3.36, the random variables (F (Yn ), n ≥ 2) are independent non-negative
and distributed as F (Y1 ) under Px . As F (Y1 ) is finite, we deduce from the law of large num-
1 Pn PTnx
ber,
Pn see Theorem 8.15, that a.s. limn→∞ n k=1 F (Yk ) = Ex [F (Y1 )]. Since i=1 f (Xi ) =
k=1 F (Yk ), we deduce from (3.24) that:
x
Tn n
1 X n 1X a.s.
x
f (Xi ) = x F (Yk ) −−−→ π(x) Ex [F (Y1 )].
Tn Tn n n→∞
i=1 k=1
Recall that TNx nx ≤ n < TNx nx +1 from (3.26). Since f is non-negative, we deduce that:
x
TN Txx
x n
TNx nx 1 Xn 1X TNx nx +1 1
Nn +1
X
f (Xi ) ≤ f (Xi ) ≤ f (Xi ).
TNx nx +1 TNx nx n x x
TNnx TNnx +1
i=1 i=1 i=1
Since a.s. limn→∞ Nnx = +∞, limn→∞ Tnx = +∞ and limn→+∞ Tnx /Tn+1
x = 1, see (3.24), we
deduce that:
n
1X a.s.
f (Xi ) −−−→ π(x)Ex [F (Y1 )]. (3.27)
n n→∞
i=1
Taking f = 1{y} in the equation above, we deduce from (3.15) that:
Tx
hX i
π(y) = π(x)Ex 1{Xk =y} . (3.28)
k=1
Using then (3.27), we deduce that (3.16) holds when f is finite and non-negative. If f
is non-negative but not finite, the result is immediate as Nx = ∞ a.s. for all x ∈ E and
(π, f ) = +∞. If f is real-valued such that (π, f ) is well defined, then considering (3.16) with
f replaced by f + and f − , and making the difference of the two limits, we get (3.16).
Next Proposition and Proposition 3.40 give properties (i) and (ii) of Theorem 3.32.
Proposition 3.41. Let X be an irreducible positive recurrent Markov chain. Then, the
measure π defined in (3.14) is the unique invariant probability measure.
Proof. According to Proposition 3.40, the measure π is a probability measure. We now check
it is invariant. Let µ be the distribution of X0 . We set:
n
1X
µ̄n (x) = µP k (x).
n
k=1
Let y ∈ E be fixed and f (·) = P (·, y). We notice that limn→∞ (µ̄n , f ) = (π, f ) = πP (y)
and that (µ̄n , f ) = µ̄n P (y) = n+1 1
n µ̄n+1 (y) − n µP (y). Letting n goes to infinity in those
equalities, we get that πP (y) = π(y). Since y is arbitrary, we deduce that π is invariant. By
Lemma 3.34, this is the unique invariant probability measure.
The next proposition and Lemma 8.14 give property (iii) from Theorem 3.32. Its proof
relies on a coupling argument.
Proof. Let X = (Xn , n ∈ N) be an irreducible positive recurrent aperiodic Markov chain. Re-
call that π defined in (3.14) is its unique invariant probability measure. Let Y = (Yn , n ∈ N)
be a Markov chain independent of X with the same transition matrix and initial distribution
π. Thanks to Lemma 3.29, the Markov chain Z = ((Xn , Yn ), n ∈ N) is irreducible and it has
π ⊗ π has invariant probability measure. This gives that Z is positive recurrent.
Let x ∈ E and consider T = inf{n ≥ 1; Xn = Yn = x} the return time of Z to (x, x). For
y ∈ E, we have:
By symmetry we can replace (Xn , Yn ) in the previous inequality by (Yn , Xn ) and deduce that:
Since Z is recurrent, we get that a.s. T is finite. Using that P(Yn = y) = π(y), as π is
invariant and the initial distribution of Y , we deduce that limn→∞ |P(Xn = y) − π(y)| = 0
for all y ∈ E. Then, use Lemma 8.14 to conclude.
Random walk on Zd
Let d ∈ N∗ . Let U be a Zd -valued random variables with probability distribution p =
(p(x) = P(U = x), x ∈ Zd ). Let (Un , n ∈ N∗ ) be a sequence of independent random variables
distributed as U , and X0 a Zd -valued independent random variable. We consider the random
walk X = (Xn , n ∈ N) with increments distributed as U defined by:
n
X
Xn = X0 + Uk for n ∈ N∗ .
k=1
3.5. EXAMPLES AND APPLICATIONS 57
The transition matrix P of X is given by P (x, y) = p(y − x). We assume that X is ir-
reducible (equivalently the smallestPadditive sub-group of Zd which contains the support
{x ∈ Zd ; p(x) > 0} is Zd ). Because x∈Zd P (x, y) = 1, we deduce that the counting measure
on Zd is invariant. (According to Section 3.4.2, this implies that irreducible random walks
are transient or null recurrent.) We refer to [8, 6] for a detailed account on random walks.
The simple symmetric random walk corresponds to U being uniform on the set of cardinal
2d: {x ∈ Zd ; |x| = 1}, with |x| denoting the Euclidean norm on Rd . It is irreducible with
period 2 (as P 2 (x, x) > 0 and by parity P 2n+1 (x, x) = 0 for all n ∈ N).
We summarize the main results on transience/recurrence for random walks, see [8] The-
orem 8.1.
Metropolis-Hastings algorithm
Let π be a given probability distribution on E such that π(x) > 0 for all x ∈ E. The aim
of the Metropolis-Hastings6 algorithm is to simulate a random variable with distribution
(asymptotically close to) π.
We consider a stochastic matrix Q on E which is irreducible (that is for all x, y ∈ E,
there exists n ∈ N∗ such that Qn (x, y) > 0) and such that for all x, y ∈ E, if Q(x, y) = 0 then
Q(y, x) = 0. The matrix Q is called the selection matrix.
We say a function ρ = (ρ(x, y); x, y ∈ E such that Q(x, y) > 0) taking values in (0, 1] is
an accepting probability function if for x, y ∈ E such that Q(x, y) > 0, we have:
where γ is a function defined on (0, +∞) taking values in (0, 1] satisfying γ(u) = uγ(1/u) for
u > 0. A common choice for γ is γ(u) = min(1, u) (Metropolis algorithm) or γ(u) = u/(1 + u)
(Boltzmann or Barker algorithm).
We now describe the Metropolis-Hastings algorithm. Let X0 be a random variable on
E with distribution µ0 . At step n + 1, the random variables X0 , . . . , Xn are defined, and
we explain how to generate Xn+1 . First consider a random variable Yn+1 with distribution
Q(Xn , ·). With probability ρ(Xn , Yn+1 ), we accept the transition and set Xn+1 = Yn+1 . If the
6
W. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika,
vol. 57, pp.97-109, 1970.
58 CHAPTER 3. DISCRETE MARKOV CHAINS
Since r > 0, we have that Q(x, y) > 0 implies P (x, y) > 0 and, for x 6= y, that Q(x, y) > 0
is equivalent to P (x, y) > 0. We deduce that X is irreducible as Q is irreducible. Condition
(3.29) implies that X is reversible with respect to the probability π. Thus, the Markov chain
is irreducible recurrent positive with invariant probability π. Let f be a real-valued function
f defined on E, such that P (π, f ) is well defined. An approximation of (π, f ), is according to
Theorem 3.31, given by n nk=1 f (Xk ) for n large. The drawback of this approach is that it
1
does not come with a confidence interval of (π, f ). If furthermore either Q is aperiodic or
there exists x, y ∈ E such that Q(x, y) > 0 and ρ(x, y) < 1 so that P (x, x) > 0, then the
Markov chain X is aperiodic. In this case, Theorem 3.32 implies then that X converges in
distribution towards π.
It may happens that π is known up to a normalizing constant. This is the case of the
so called Boltzmann or Gibbs measure in statistical physics for example, where E is the
state space of a system, and the probability for the system to be in configuration x ∈ E is
π(x) = ZT−1 exp(−H(x)/T ), where H(x) is the energy of the system in configuration x, T
the temperature and ZT the normalizing constant. It is usually very difficult to compute an
approximation of ZT .
When using the accepting probability function given by (3.30), then only the ratio
π(x)/π(y) is needed to be computed to simulate X. In particular, the simulation does not
rely on the value of ZT .
i N −j
j
N i
PN (i, j) = 1− for i, j ∈ EN .
j N N
7
R. A. Fisher. On the dominance ratio. Proc. Roy. Soc. Edinburgh, vol. 42, pp. 321-341, 1922.
8
S. Wright. Evolution in Mendelian populations. Genetics, vol. 16, pp.97-159, 1931.
3.5. EXAMPLES AND APPLICATIONS 59
Notice that 0 and N are absorbing state, and that {1, . . . , N − 1} is an open communicating
class. The quantity of interest in this model is the extinction time of the diversity (that is
the entry time of {0, N }):
τN = inf{n ≥ 0; Xn ∈ {0, N }},
with the convention inf ∅ = ∞. Using martingale techniques developed in Chapter 4, one can
easily prove the following result.
Lemma 3.43. A.s. the extinction time τN is finite and P(XτN = N |X0 ) = X0 /N .
= 1 + PN tN (i),
where we used the Markov property at time 1 for the third equality. As 0 and N are absorbing
state, we have tN (i) = P tN (i) = 0 for i ∈ {0, N }. Let e0 (resp. eN ) denote the element
of RN +1 with all entries equal to 0 but for the first (resp. last) which is equal to 1, and
1 = (1, . . . , 1) ∈ RN +1 . We have:
tN = PN tN + 1 − e0 − eN .
So to compute tN , one has to solve a linear system. For large N , we have the following result9
for x ∈ [0, 1]:
1
E [τN ] −−−→ −2 (x log(x) + (1 − x) log(1 − x)) .
N bN xc n→∞
where bzc is the integer part of z ∈ R. We give an illustration of this approximation in Figure
3.3.
14 140 ××××××××××××
×××× ×××
××× ××
×× ××
×× ××
× ×× ××
× × ×× ××
12 120 ×× ×
× ×
× ×
× ×
× × × ×
× ×
× ×
× ×
× ×
10 100 ×
× ×
×
× ×
× × × ×
× ×
× ×
× ×
8 80 × ×
× ×
× ×
× ×
6 × × 60 × ×
× ×
× ×
× ×
4 40 × ×
× ×
× ×
2 20 × ×
× ×
× × × ×
0 0
0 1 2 3 4 5 6 7 8 9 10 0 10 20 30 40 50 60 70 80 90 100
Figure 3.3: Mean extinction time of the diversity (k 7→ Ek [τN ]) and its continuous limit,
N x 7→ −2N (x log(x) + (1 − x) log(1 − x)), for N = 10 (left) and N = 100 (right).
Markov chains11 . By construction, as all the particles play the same role, the process X is
a Markov chain on E = J0, N K with transition matrix P given by P (k, `) = 0 if |k − `| = 6 1,
P (k, k + 1) = (N − k)/N and P (k, k − 1) = k/N for k, ` ∈ E. We deduce the Markov
chain X is irreducible. Notice that X is reversible with respect to the binomial distribution
πN = (πN (k), k ∈ E), where πN (k) = 2−N Nk for k ∈ E. To see this, it is enough to check
that πN (k)P (k, k + 1) = πN (k + 1)P (k + 1, k) for all k ∈ J0, N − 1K. For k ∈ J0, N − 1K, we
have:
−N N N − k −N N k+1
πN (k)P (k, k + 1) = 2 =2 = πN (k + 1)P (k + 1, k).
k N k+1 N
According to Lemma 3.18 and Theorem 3.32, we deduce that πN is the √ unique invariant
probability measure of X. Let a > 0 and P define the interval Ia,N = [(N ± a N )/2]. We also
get that the empirical mean time n−1 nk=1 1{Xk ∈Ia,N } spent by the system in the interval
Ia,N converges a.s., as n goes to infinity, towards πN (Ia,N ). Thanks to the CLT, we have
that πN (Ia,N ) converges, as N goes to infinity, towards P(G ∈ [−a, a]) where G ∼ N (0, 1) is
a standard Gaussian random variable. For a larger than some units (say 2 or 3), this latter
probability is close to
√ 1. This implies that it is unlikely to observe values away from N/2
by some units time N . Using large deviation theory for the Bernoulli distribution with
parameter 1/2, we get that for ε ∈ (0, 1):
2
log(πN ([0, N (1 − ε)/2])) −−−−→ −(1 + ε) log(1 + ε) − (1 − ε) log(1 − ε).
N N →∞
Thus the probability to observe the values from the N/2 further by some small units time N
decrease exponentially fast towards 0 as N goes to infinity.
For k, ` ∈ E, let tk,` = Ek [T ` ] be the mean of the return time to ` starting from k. Set
N0 = bN/2c. Using (3.14) and Stirling formula, we get:
1 1
= 2−N .
p
tN0 ,N0 = ∼ πN/2 and t0,0 =
πN (N0 ) πN (0)
Notice that t0,0 and tN0 ,N0 are not of the same order.
11
S. Karlin and J. McGregor. Ehrenfest urn models. J. Appl. Probab, vol. 2, pp. 352-376, 1965
3.5. EXAMPLES AND APPLICATIONS 61
We are now interested in the mean of the return time from 0 to N0 and from 0 to N0 .
Let ` ≥ 2. By decomposing with respect to X1 , we have t`−1,` = 1 + ((` − 1)/N )t`−2,` and
for k ∈ J0, ` − 2K:
k N −k
tk,` = 1 + tk−1,` + tk+1,` .
N N
Then, using some lengthy computations, we get by recurrence that for 0 ≤ k < ` ≤ N :
N 1 du
Z h i
tk,` = (1 − u)N −` (1 + u)k (1 + u)`−k − (1 − u)`−k .
2 0 u
12
G.-Y. Chen and L. Saloff-Coste. The L2 -cutoff for reversible Markov processes. J. Funct. Analysis, vol.
258, pp. 2246-2315, 2010.
13
A. Erlang. The theory of probabilities and telephone conversations. Nyt Tidsskrift for Matematik B, vol.
20, pp. 33-39, 1909.
62 CHAPTER 3. DISCRETE MARKOV CHAINS
Bibliography
[1] X. Chen. Limit theorems for functionals of ergodic Markov chains with general state space.
Mem. Amer. Math. Soc., 1999.
[2] K. L. Chung. Markov chains with stationary transition probabilities. Second edition. Die
Grundlehren der mathematischen Wissenschaften, Band 104. Springer-Verlag New York,
1967.
[3] R. Douc, E. Moulines, P. Priouret, and P. Soulier. Markov chains. Springer Series in
Operations Research and Financial Engineering. Springer, Cham, 2018.
[4] R. Durrett. Probability: theory and examples, volume 31 of Cambridge Series in Statistical
and Probabilistic Mathematics. Cambridge University Press, Cambridge, fourth edition,
2010.
[5] W. Feller. An introduction to probability theory and its applications. Vol. I. Third edition.
John Wiley & Sons, Inc., New York-London-Sydney, 1968.
[6] G. F. Lawler and V. Limic. Random walk: a modern introduction, volume 123 of Cam-
bridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, 2010.
[7] S. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University
Press, Cambridge, second edition, 2009. With a prologue by Peter W. Glynn.
63
64 BIBLIOGRAPHY
Chapter 4
Martingales
65
66 CHAPTER 4. MARTINGALES
Proof. Use that {τ >Sn} = {τ ≤ n}c to get (i). Use that {τ = n} = {τ ≤ n} {τ ≤ n − 1}c
T
and that {τ ≤ n} = nk=0 {τ = k} to get (ii).
It is left to the reader to check that the σ-field Fτ in the next definition is indeed a σ-field
and a subset of F∞ .
Definition 4.5. Let τ be a F-stopping time. The σ-field Fτ of the events which are prior to
a stopping time τ is defined by:
Fτ = {B ∈ F∞ ; B ∩ {τ = n} ∈ Fn for all n ∈ N} .
(i) The random variable Y is Fτ -measurable if and only if Y 1{τ =n} is Fn -measurable for
all n ∈ N.
Proof. We prove (i). Set Yn = Y 1{τ =n} . We first assume that Y is Fτ -measurable and we
prove that Yn is Fn -measurable for all n ∈ N. If Y = 1B with B ∈ F∞ , we clearly get that
4.1. STOPPING TIMES 67
Since Y is F∞ -measurable, we also get that Y 1{τ =∞} is F∞ -measurable. Thus, we deduce
from (i) that Z is Fτ -measurable. For B ∈ Fτ , we have:
X X
E [Z1B ] = E E[Y | Fn ]1{τ =n}∩B = E Y 1{τ =n}∩B = E [Y 1B ] ,
n∈N n∈N
where we used monotone convergence for the first equality, the fact that {τ = n} ∩ B belongs
to Fn and (2.1) for the second and monotone convergence for the last. As Z is Fτ -measurable,
we deduce from (2.1) that a.s. Z = E[Y | Fτ ].
Then consider Y a F∞ -measurable real-valued random variable. Subtracting (4.1) with
Y replaced by Y − to (4.1) with Y replaced by Y + gives that (4.1) holds as soon as E[Y ] is
well defined.
Definition 4.8. Let X = (Xn , n ∈ N) be a F-adapted process and τ a F-stopping time. The
random variable Xτ is defined by:
X
Xτ = Xn 1{τ =n} .
n∈N
This definition is extended in an obvious way when τ is an a.s. finite stopping time and X
a process indexed on N instead of N. By construction the random variable Xτ from Definition
4.8 is Fτ -measurable. We can now give an extension of the Markov property, see Definition
3.2, when considering random times. Compare the next proposition with Corollary 3.12.
Proposition 4.9 (Strong Markov property). Let X = (Xn , n ∈ N) be a Markov chain with
respect to the filtration F = (Fn , n ∈ N), taking values in a discrete state space E and with
transition matrix P . Let τ be a F-stopping time a.s. finite and define a.s. the shifted process
X̃ = (X̃k = Xτ +k , k ∈ N). Conditionally on Xτ , we have that Fτ and X̃ are independent
and that X̃ is a Markov chain with transition matrix P , which means that a.s. for all k ∈ N,
all x0 , . . . , xk ∈ E:
68 CHAPTER 4. MARTINGALES
E[Mn+1 | Fn ] = Mn . (4.5)
We provide two proofs of this important lemma. The shorter one relies on the stochastic
integral. The other one can be generalized when the integrability condition is weakened; it
will inspire some computations in Chapter 5.
First proof. Let M be a martingale. The process H = (Hn , n ∈ N∗ ) defined by Hn = 1{τ ≥n}
is predictable bounded and non-negative. The discrete stochastic integral of H with respect
to M is given by H ·M = (H ·Mn , n ∈ N) with:
n
X
H ·Mn = 1{τ ≥k} (Mk − Mk−1 ) = Mτ ∧n − M0 .
k=1
This implies that M τ is a martingale. The proofs are similar in the super-martingale and
sub-martingale cases.
E [Mτ | Fν ] = Mν . (4.6)
Proof. Let n0 ∈ N be such that a.s. τ ≤ n0 . We have according to Lemma 4.7 that a.s.:
n0
X
E [Mτ | Fν ] = 1{ν=n} E [Mτ | Fn ] .
n=0
See Exercise 9.27 for an application of the martingale theory to simple random walk.
Proof. Let n ∈ N. Consider the stopping time τ = inf{k ∈ N; Mk ≥ a}, and set A =
{maxk∈J0,nK Mk ≥ a} = {τ ≤ n}. Thanks to the optional stopping theorem, we have E[Mn ] ≥
E[Mτ ∧n ]. Since Mτ ∧n ≥ a1A + Mn 1Ac , we deduce that:
E[Mn ] ≥ aP(A) + E [Mn 1Ac ] .
This implies that E [Mn 1A ] ≥ aP(A). The inequality E [Mn 1A ] ≤ E [Mn+ ] is obvious.
Using Hölder inequality, we get that E |Mn | (Mn∗ )p−1 ≤ E [|Mn |p ]1/p E [(Mn∗ )p ](p−1)/p . This
We give in the next corollary direct consequences which are so often used that they deserve
to be stated on their own.
Corollary 4.22. We have the following results.
(i) Let M = (Mn , n ∈ N) be a sub-martingale such that supn∈N E[Mn+ ] < +∞. Then, the
process M converges a.s. to a limit, say M∞ , which is integrable and (4.9) holds.
Proof. We first prove property (i). As M is a sub-martingale, we have that E[M0 ] ≤ E[Mn ]
and thus E[|Mn |] ≤ 2E[Mn+ ] − E[M0 ]. We deduce the condition supn∈N E[Mn+ ] < +∞ is
equivalent to supn∈N E[|Mn |] < +∞. Then use Theorem 4.21 to conclude.
Let M be a non-negative super-martingale. Considering property (i) with −M , we get
the a.s. convergence of M towards a limit say M∞ . Then use Fatou’s lemma and that the
sequence (E[Mn ], n ∈ N) is non-increasing to get (4.10).
Remark 4.23. We state without proof the following extension, see [1]. Let M = (Mn , n ∈ N)
be a non-negative non necessarily integrable super-martingale, that is M is adapted, and
a.s., for all n ∈ N, we have Mn ≥ 0 and E[Mn+1 |Fn ] ≤ Mn . Then, the process M converges
a.s. and the limit, say M∞ , satisfies the inequality E [M∞ | Fn ] ≤ Mn a.s. for all n ∈ N.
Furthermore, for all stopping times τ and ν such that τ ≥ ν, we have that a.s. E [Mτ | Fν ] ≤
Mν . However, Equality (4.6) does not hold in general for positive non necessarily integrable
martingale, that is an adapted process M = (Mn , n ∈ N) such that, for all n ∈ N, a.s. Mn ≥ 0
and E[Mn+1 |Fn ] = Mn . ♦
with the convention that inf ∅ = ∞. We define the number of up-crossings for the sequence
x of the interval [a, b] up to time n ∈ N as:
We deduce that the sequence x converges in R if and only if βa,b (x) < ∞ for all a < b with
a, b ∈ Q. Thus, to prove the convergence of the sequence x, it is enough to give a finite upper
bounds of βa,b (x). Since xτk (x) − xνk (x) ≥ b − a when τk (x) < ∞ that is k ≤ βa,b (x), we
deduce that:
βa,b (x,n)
X
(xτk (x) − xνk (x) ) ≥ (b − a)βa,b (x, n). (4.11)
k=1
Define H` (x) = 1Sk∈N {νk (x)<`≤τk (x)} for ` ∈ N∗ . Considering the discrete integral H(x)·xn =
Pn
`=1 H` (x)∆x` , with ∆x` = x` − x`−1 , we get:
βa,b (x,n)
X
H(x)·xn ≥ (xτk (x) − xνk (x) ) − (xn − a)− ≥ (b − a)βa,b (x, n) − (xn − a)− , (4.12)
k=1
where for the first inequality we took into account the fact that n may belongs to an up-
crossing from a to b, and we used (4.11) for the second.
Up to replacing M by −M , we can assume that M is a super-martingale. We now replace
x by the super-martingale M . The random variables νk (M ), τk (M ), for k ∈ N, are by
construction stopping times. This implies that, for ` ∈ N∗ , the event {νk (M ) < ` ≤ τk (M )}
belongs to F`−1 . We deduce that the process H = (H` (M ), ` ∈ N∗ ) is adapted bounded and
non-negative. Thanks to Lemma 4.14 the discrete stochastic integral (H(M )·Mn , n ∈ N) is
a super-martingale. Since H(M )·M0 = 0, we get E[H(M )·Mn ] ≤ 0. We deduce from (4.12)
that:
(b − a)E[βa,b (M, n)] ≤ E[(Mn − a)− ] + E[H(M )·Mn ] ≤ E[|Mn |] + |a|.
Letting n goes to infinity in the previous inequality, we get using supn∈N E[|Mn |] < +∞
and
T the monotone convergence theorem that E[βa,b (M )] < +∞. This implies that the event
a<b; a,b∈Q {βa,b (M ) < ∞} has probability 1, that is the super-martingale M a.s. converges
to a real-valued random variable, say M∞ .
Using Fatou’s lemma, we get:
E[|M∞ |] = E[ lim |Mn |] ≤ lim inf E[|Mn |] ≤ sup E[|Mn |] < +∞.
n→∞ n→∞ n∈N
Proof. Assume property (i) holds. Denote M∞ the limit of M . For m ≥ n, we have
E[Mm | Fn ] = Mn a.s. and thus using Jensen’s inequality:
h i h i h i h i
E |Mn − E[M∞ | Fn ]| = E |E[Mm − M∞ | Fn ]| ≤ E E[|Mm − M∞ || Fn ] = E |Mm − M∞ | .
The next Lemma does not hold if we assume that Z isWnon-negative instead of integrable,
see a counter-example page 31 in [1]. Recall that F∞ = n∈N Fn .
Corollary 4.25. Let Z be an integrable real-valued random variable. Then the process
(E[Z| Fn ], n ∈ N) is a closed martingale which converges a.s. and in L1 towards E[Z| F∞ ].
Proof. Condition (ii) of Theorem 4.24 holds for the martingale M = (Mn = E[Z| Fn ], n ∈ N).
Since (i) and (ii) of Theorem 4.24 are equivalent, we deduce that M converges a.s. and in L1
to a real-valued random variable, say M∞ which is integrable. Using (4.13), we get that for
all A ∈ Fn :
E [(Z − M∞ )1A ] = E [E [(Z − M∞ )| Fn ] 1A ] = 0.
S
This implies that the set A ⊂ F of events A such that E [Z1A ] = E [M∞ 1A ] contains n∈N Fn
which is stable by finite intersection. Since Z and M∞ are integrable, we get, using dominated
convergence, that A is alsoSa λ-system. According to the monotone class theorem, A contains
the σ-field generated by n∈N Fn , that is F∞ . Then, we deduce from Definition 2.2 and
Lemma 2.3 that a.s. M∞ = E[Z| F∞ ].
We can extend the optional stopping theorem for closed martingale to any stopping times.
76 CHAPTER 4. MARTINGALES
Proposition 4.26. Let M = (Mn , n ∈ N) be a closed martingale and write M∞ for its a.s.
limit. Let τ and ν be stopping times such that ν ≤ τ . Then, we have a.s.:
E [Mτ | Fν ] = Mν . (4.14)
Proof. According to Lemma 4.7, we have for any stopping time τ 0 that a.s.:
X X
E [M∞ |Fτ 0 ] = 1{τ 0 =n} E [M∞ |Fn ] = 1{τ 0 =n} Mn = Mτ 0 ,
n∈N n∈N
where we used (4.13) for the second equality. Using this result twice first with τ 0 = τ and
then with τ 0 = ν, we get, as Fν ⊂ Fτ according to property (iii) of Lemma 4.10, that a.s.:
We have the following result when the martingale is bounded in Lp for some p > 1.
Proposition 4.27. Let M = (Mn , n ∈ N) be a martingale such that supn∈N E[|Mn |p ] < +∞
for some p > 1. Then, the martingale converges a.s. and in Lp towards a limit, say M∞ and
Mn = E[M∞ | Fn ] a.s. for all n ∈ N. We also have that M∞ ∗ = sup p
n∈N |Mn | belongs to L
p
and, with Cp = (p/(p − 1)) :
∗ p
E [(M∞ ) ] ≤ Cp E [|M∞ |p ] as well as E [|M∞ |p ] = sup E[|Mn |p ].
n∈N
Proof. Since M is bounded in L1 , we deduce from Theorem 4.21 that M converges a.s.
towards a limit, say M∞ ∈ L1 . We recall, see (4.7), that Mn∗ = maxk∈J0,nK |Mk |. By monotone
convergence, since M∞ ∗ = lim ∗
n→∞ Mn , we have that:
∗ p
E [(M∞ ) ] = lim E [(Mn∗ )p ] . (4.15)
n→∞
According to Proposition 4.20 and since supn∈N E[|Mn |p ] < +∞, we deduce that:
∗ p
E [(M∞ ) ] ≤ Cp sup E[|Mn |p ] < +∞.
n∈N
This gives that M∞ ∗ belongs to Lp . We deduce from (4.15) and the dominated convergence
77
78 BIBLIOGRAPHY
Chapter 5
Optimal stopping
The goal of this chapter is to determine the best time, if any, at which one has to stop a
game, seen as a stochastic process, in order to maximize a given criterion seen as a gain or a
reward. The following two examples are typical of the problems which will be solved. Their
solution are given respectively in Sections 5.1.3 and 5.3.2.
Example 5.1 (Marriage of a princess: the setting). In a faraway old age, a princess had to
choose a prince for a marriage among ζ ∈ N∗ candidates. At step 1 ≤ n < ζ, she interviews
the n-th candidate and at the end of the interview she either accepts to marry this candidate
or refuses. In the former case the process stops and she get married with the n-th candidate;
in the latter case the rebuked candidate leaves forever and the princess moves on to step
n + 1. If n = ζ, she has no more choice but to marry the last candidate. What is the best
strategy or stopping rule for the princess if she wants to maximize the probability to marry
the best prince?
This “Marriage problem”, also known as the “Secretary problem”, appeared in the late
1950’s and early 1960’s. See Ferguson [4] for an historical review as well as the corresponding
Wikipedia page1 . 4
Example 5.2 (Castle to sell). A princess want to sell her castle, let Xn be the n-th price offer.
However, preparing the castle for the visit of a potential buyer has a cost, say c > 0 per visit.
So the gain of the selling at step n ≥ 1 will be Gn = Xn − nc or Gn = max1≤k≤n Xk − nc if
the princess can recall a previous interested buyer. In this infinite time horizon setting, what
is the best strategy for the princess in order to maximize her gain?
This “House-selling problem”, see Chapter 4 in Ferguson [3] is also known as the “Job
search problem” in economy, see Lippman and McCall [5]. 4
S T T
For n < ζ ∈ N = N {∞}, we set Jn, ζK = [n, ζ] N and Jn, ζJ= [n, ζ) N. We consider
a game over the discrete time interval J0, ζK with horizon ζ ∈ N, where at step n ≤ ζ we can
either stop and receive the gain Gn or continue to step n + 1 if n + 1 ≤ ζ. Eventually in the
infinite horizon case, ζ = ∞, if we never stop, we receive the gain G∞ . We assume the gains
G = (Gn , n ∈ J0, ζK) form a sequence of random variables on a probability space (Ω, P, F)
taking values in [−∞, +∞).
We assume the information available is given by a filtration F = (Fn , n ∈ J0, ζK) with
Fn ⊂ F, and a strategy or stopping rule corresponds to a stopping time. Let Tζ be the set
1
https://en.wikipedia.org/wiki/Secretary_problem
79
80 CHAPTER 5. OPTIMAL STOPPING
of all stopping times with respect to the filtration F taking values in J0, ζK. We shall assume
that E[G+ ζ +
τ ] < +∞ for all τ ∈ T , where x = max(0, x). In particular, the expectation E[Gτ ]
is well defined and belongs to [−∞, +∞). Thus, the maximal gain of the game G is:
A stopping time τ 0 ∈ Tζ is said optimal for G if E[Gτ 0 ] = V∗ and thus V∗ = maxτ ∈Tζ E[Gτ ].
The next theorem, which is a direct consequences of Corollaries 5.8 and 5.18, is the main
result of this Chapter. For a real sequence (an , n ∈ N), we set lim sup an = lim sup ak .
n%∞ ∞>k≥n
Theorem 5.3. Let ζ ∈ N, G = (Gn , n ∈ J0, ζK) be a sequence of random variables tak-
ing values in [−∞, +∞) and F = (Fn , n ∈ J0, ζK) be a filtration. Assume the integrability
condition: h i
E sup G+ n < +∞. (5.2)
n∈J0,ζK
If ζ ∈ N or if ζ = ∞ and
lim sup Gn ≤ G∞ a.s., (5.3)
then, there exists an optimal stopping time.
We complete Theorem 5.3 by giving a description of the optimal stopping times when the
T and (5.3) holds if ζ = ∞. In this case,
sequence G is adapted to the filtration F, (5.2) holds
we consider the Snell envelope S = (Sn , n ∈ J0, ζK N) which is a particular solution to the
so-called optimal equations or Bellman equations:
Sn = max (Gn , E[Sn+1 |Fn ]) for n ∈ J0, ζJ. (5.4)
More precisely, in the finite horizon case S is defined by Sζ = Gζ and the backward recursion
(5.4); in the infinite horizon case S is defined by (5.17) which satisfies (5.4) according to Propo-
sition 5.16. In this setting, we will consider the stopping times τ∗ ≤ τ∗∗ in Tζ :
with the convention inf ∅ = ζ. We shall prove that they are optimal, see Propositions 5.6 and
5.17, and Exercises 5.1 and 5.5. Furthermore, if V∗ > −∞, then a stopping time τ is optimal
if and only if τ∗ ≤ τ ≤ τ∗∗ a.s. and on {τ < ∞} we have a.s. Sτ = Gτ . See Exercises 5.1, 5.4
and 5.5. Thus, τ∗ is the minimal optimal stopping time and τ∗∗ the maximal one.
In the following two Remarks, we comment on the integrability condition (5.2) and we
consider the case when the sequence G is not adapted to the filtration F.
Remark 5.4. Notice that (5.2) implies that E[G+ ζ
τ ] < +∞ for all τ ∈ T . When ζ < ∞, then
(5.2) is equivalent to
E[G+
n ] < +∞ for all n ∈ J0, ζK. (5.7)
When ζ = ∞, Condition (5.2) can be slightly weaken, see Proposition 5.17, when G is F-
adapted to Condition (H) page 86 which corresponds to the gain being bounded from above
by a non-negative uniformly integrable martingale. ♦
5.1. FINITE HORIZON CASE 81
Remark 5.5. When the sequence G is not adapted to the filtration F, the idea is to check that
an optimal stopping time for the adapted gain G0 = (G0n , n ∈ J0, ζK) with G0n = E[Gn | Fn ] is
also an optimal stopping time for G, see Sections 5.1.2 and 5.2.4. ♦
The finite horizon case, ζ < ∞, is presented in Section 5.1, and the infinite horizon case,
ζ = ∞, which is much more delicate in particular for the definition of S, is presented in
Section 5.2. We consider the approximation of the infinite horizon case by finite horizon
cases in Section 5.3, which includes the Markov chain setting developed in Section 5.3.3.
The presentation of this Chapter follows closely Ferguson [3] also inspired by Snell [7],
see also Chow, Robbins and Siegmund [1, 6] and the references therein or for the Markovian
setting Dynkin [2]. Concerning the infinite horizon case, we consider stopping time taking
values in N instead of N in most text books. Since in some standard applications, the gain
of not stopping in finite time is G∞ = −∞ (which de facto implies the optimal stopping
time is finite unless V∗ = −∞), we shall consider rewards Gn taking values in [−∞, +∞),
whereas in most text books it is assumed that E[|Gn |] < +∞ holds for all finite n ≤ ζ. The
advantage of this setting is the simplicity of the hypothesis and the generality of the result
given in Theorem 5.3. Its drawback is that we can not use the elegant martingale theory
which is the corner stone of the Snell envelope approach, see Remark 5.7 and Exercise 5.1 and
the presentation in Neveu [6]. Thus, we shall deal with integral technicalities in the infinite
horizon case.
Proof. For n ∈ J0, ζK, we define Tn as the set of all stopping times with respect to the filtration
F taking values in Jn, ζK, as well as the stopping time τn = inf{k ∈ Jn, ζK; Sk = Gk }. Notice
82 CHAPTER 5. OPTIMAL STOPPING
Remark 5.7 (Snell envelope). Let ζ ∈ N. Assume that E[|Gn |] < ∞ for all n ∈ J0, ζK. Notice
from (5.4) that S is a super-martingale and that S dominates G. It is left to the reader to
check that S is in fact the smallest super-martingale which dominates G. It is called the
Snell enveloppe of G. For n ∈ J0, ζJ, using that Sn = E[Sn+1 |Fn ] on {τ∗ > n}, we have:
Sn∧τ∗ = Sτ∗ 1{τ∗ ≤n} + Sn 1{τ∗ >n} = Sτ∗ 1{τ∗ ≤n} + E[Sn+1 1{τ∗ >n} |Fn ] = E S(n+1)∧τ∗ |Fn .
(5.14)
This gives that (Sn∧τ∗ , n ∈ J0, ζK) is a martingale. ♦
Exercise 5.1. Let ζ ∈ N. Assume that E[|Gn |] < ∞ for all n ∈ J0, ζK.
1. Prove that τ is an optimal stopping time if and only if Sτ = Gτ a.s. and (Sn∧τ , n ∈
J0, ζK) is a martingale.
2. Deduce that τ∗ is the minimal optimal stopping time (that is: if τ is optimal, then a.s.
τ ≥ τ∗ ).
5.1. FINITE HORIZON CASE 83
4. Using the Doob decomposition, see Remark 4.15, of the super-martingale S, prove that
if τ ≥ τ∗∗ is an optimal stopping time then τ = τ∗∗ .
5. Arguing as in the proof of property (ii) from Lemma 5.13, prove that if τ and τ 0 are
optimal stopping times so is max(τ, τ 0 ).
6. Deduce that τ is an optimal stopping time if and only if a.s. τ∗ ≤ τ ≤ τ∗∗ and Sτ = Gτ .
4
Thanks to Jensen inequality, we have E[(G0n )+ ] ≤ E[G+ n ] < +∞ for all n ∈ J0, ζK. Thus the
0
sequence G is is adapted to F and satisfies the integrability condition (5.7) or equivalently
(5.2). Recall Tζ is the set of all stopping time with respect to the filtration F taking values
in J0, ζK. Thanks to Fubini, we get that for τ ∈ Tζ :
ζ
X ζ
X
E G0n 1{τ =n} = E[G0τ ].
E[Gτ ] = E Gn 1{τ =n} =
n=0 n=0
We thus deduce the maximal gain for the game G is also the maximal gain for the game G0 .
The following Corollary is then an immediate consequence of Proposition 5.6.
Corollary 5.8. Let ζ ∈ N and G = (Gn , n ∈ J0, ζK) be such that E[G+ n ] < +∞ for all
n ∈ J0, ζK. Set Sζ = E[Gζ |Fζ ] and Sn = max (E[Gn |Fn ], E[Sn+1 |Fn ]) for 0 ≤ n < ζ. Then
the stopping time τ∗ = inf{n ∈ J0, ζK; Sn = E[Gn |Fn ]} is optimal and V∗ = E[Gτ∗ ] = E[S0 ].
We continue Example 5.1. The princess wants to maximize the probability to marry the best
prince among ζ ∈ N∗ candidates. The corresponding gain at step n is Gn = 1{Σn =1} , with
Σn the random rank of the n-th candidate among the ζ candidates. The random variable
Σ = (Σn , n ∈ J1, ζK) takes values in the set Sζ of permutation on J1, ζK.
For a permutation σ = (σn , n ∈ J1, ζK) ∈ Sζ , we define the sequence of partial ranks
r(σ) = (r1 , . . . , rζ ) such that rn is the partial rank of σn in (σ1 , . . . , σn ). In particular, we
have r1 = 1 and rζ = σζ . Set E = ζn=1 J1, nK the state space of r(σ). It is easy to check that
Q
r is one-to-one from Sζ to E. Set (R1 , . . . , Rn ) = r(Σ), so that Rn is the observed partial rank
of the n-th candidate. In particular Rn corresponds to the observation of the princess at step
n, and the information of the princess at step n is given by the σ-field Fn = σ(R1 , . . . , Rn ).
In order to stick to the formalism of this chapter, we set G0 = −∞ and F0 the trivial σ-field.
84 CHAPTER 5. OPTIMAL STOPPING
We assume the princes are interviewed at random, that is the random permutation Σ =
(Σn , n ∈ J1, ζK) is uniformly distributed on Sζ . Notice then that, for n ∈ J1, ζJ, Σn is
not a function of (R1 , . . . , Rn ) and so it is not Fn -measurable and thus the gain sequence
G = (Gn , n ∈ J0, ζK) is not adapted to the filtration F = (Fn , n ∈ J0, ζK).
Since r is one-to-one, we deduce that r(Σ) is uniform on E. Since E has a product form,
we get that the random variables R1 , . . . , Rζ are independent and Rn is uniform on J1, nK
for all n ∈ J1, ζK. The event {Σn = 1} is equal to {Rn = 1} ζk=n+1 {Rk > 1}. Using the
T
independence of (Rn+1 , . . . , Rζ ) with Fn , we deduce that for n ∈ J1, ζK:
ζ
Y n
E[Gn |Fn ] = E[1{Σn =1} |Fn ] = 1{Rn =1} P(Rk > 1) = 1 .
ζ {Rn =1}
k=n+1
By an elementary backward induction, we get from the definition of Sn given in Corollary 5.8
that, for n ∈ J1, ζK, Sn is a function of Rn and more precisely Sn = max nζ 1{Rn =1} , sn+1 ,
with sn+1 = E[Sn+1 |Fn ] = E[Sn+1 ] as Sn+1 , which is a function of Rn+1 , is independent of
Fn . The sequence (sn , n ∈ J1, ζK) is non-increasing as (Sn , n ∈ J1, ζK) is a super-martingale.
We deduce that the optimal stopping time can be written as τ∗ = γn∗ for some n∗ , where for
n ∈ J1, ζK, the stopping rule γn corresponds to first observe n − 1 candidate and then choose
the next one who is better than those who have been observed (or the last if there is none):
γn = inf{k ∈ Jn, ζK; Rk = 1 or k = ζ}. We set Γn = E[Gγn ] the gain corresponding to the
strategy γn . We have Γ1 = 1/ζ and for n ∈ J2, ζK:
ζ ζ ζ
X X n−1 X 1
Γn = P(γn = k, Σk = 1) = P(Rn > 1, . . . , Rk = 1, . . . , Rζ > 1) = ,
ζ k−1
k=n k=n k=n
where we used the independence for the last equality. Notice that ζΓ1 = ζΓζ = 1. For
n ∈ J1, ζ − 1K, we have ζ(Γn − Γn+1 ) = 1 − ζ−1
P
1/j. We deduce that Γn is maximal for
Pζ−1 j=n
n∗ = inf{n ≥ 1; Γn ≥ Γn+1 } = inf{n ≥ 1; j=n 1/j ≤ 1}. We also have V∗ = Γn∗ .
For ζ large, we get n∗ ∼ ζ/ e, so the optimal strategy is to observe a fraction of order
1/ e ' 37% of the candidates, and then choose the next best one (or the last if there is none);
the probability to get the best prince is then V∗ = Γn∗ ' n∗ /ζ ' 1/ e ' 37%.
Exercise 5.2 (Choosing the second best instead of the best2 ). Assume the princess knows the
best prince is very likely to get a better proposal somewhere else, so that she wants to select
the second best prince among ζ candidates instead of the best one. For x > 0, we set bxc
the only integer n ∈ N such that x − 1 < n ≤ x. Prove that the optimal stopping rule is to
reject the first n0 = b(ζ − 1)/2c candidates and then chose the first second best so far prince
or the last if none that is τ∗ = inf{k > n0 ; Rk = 2 or k = ζ} and that the optimal gain is:
n0 (ζ − n0 )
V∗ = ·
ζ(ζ − 1)
So for ζ large, we get V∗ ' 1/4. Selecting the third best leads to a more complex optimal
strategy. 4
2
J. S. Rose. A problem of optimal choice assignment. Operarions Research, 30(1):172-181, 1982.
5.2. INFINITE HORIZON CASE 85
2n+1 − 1
E[Gn+1 |Fn ] = G n ≥ Gn .
2n+1 − 2
Thus, for any τ ∈ T, we have E[Gτ ∧n ] ≤ E[Gn ] ≤ 1. And by Fatou Lemma, we get E[Gτ ] ≤ 1.
Thus, we deduce that V∗ = 1.
Since E[Gn+1 |Fn ] > Gn on {Gn 6= 0} and Gn+1 = Gn on {Gn = 0}, we get at step n
that the expected future gain at step n + 1 is better than the gain Gn . Therefore it is more
interesting to continue than to stop at step n. However this strategy will provide the gain
G∞ = 0, and is thus not optimal. We deduce there is no optimal stopping time.
Consider the stopping time τ = inf{n ≥ 1; Gn = 0}. We have that τ is a geometric
random variable with parameter 1/2. Furthermore, we have supn∈J0,ζK G+ τ −1 − 1 and
h i n = 2
thus E supn∈J0,ζK G+n = +∞. In particular, condition (5.2) does not hold in this case. 4
The main result of this section is that if (5.2) and (5.3) hold, then there exists an optimal
stopping time τ∗ ∈ T, see Corollary 5.18. The main idea of the infinite horizon case, inspired
by the finite horizon case, is to consider a process S = (Sn , n ∈ J0, ζK) satisfying the optimal
equations (5.4). But since the initialization of S given in the finite horizon case is now useless,
we shall rely on a definition inspired by (5.8) and (5.9). However, we need to consider a
measurable version of the supremum of E[Gτ |Fn ], where τ is any stopping time such that
τ ≥ n. This is developed in Section 5.2.1. In the technical Section 5.2.2, due to the fact we
don’t assume the gain to be integrable, following Ferguson [3], we use the notion of regular
stopping time to prove the existence of an optimal stopping time in the adapted case. We
connect this result with the optimal equations (5.4) in Section 5.2.3. Then, we consider the
general case in Section 5.2.4.
86 CHAPTER 5. OPTIMAL STOPPING
(ii) If there exists a random variable Y such that for all t ∈ T , P(Y ≥ Xt ) = 1, then a.s.
Y ≥ X∗ .
The random variable X∗ of the previous proposition is called the essential supremum of
(Xt , t ∈ T ) and is denoted by:
X∗ = ess sup Xt .
t∈T
Example 5.12. If U is a uniform random variable on [0, 1], and Xt = 1{U =t} for t ∈ T = [0, 1],
then we have that a.s. supt∈T Xt = 1 and it is easy to check that a.s. ess supt∈T Xt = 0. 4
Proof of Proposition 5.11. Since we are only considering inequalities between real random
variables, by mapping R onto [0, 1] with an increasing one-to-one function, we can assume
that Xt takes values in [0, 1] for all t ∈ T .
Let I be the family of all countable sub-families of T . For each I ∈ I, consider the
(well defined) random variable XI = supt∈I Xt and define α = supI∈I S E[XI ]. There exists a
sequence (In , n ∈ N) such that limn→+∞ E[XIn ] = α. The set I∗ = n∈N In is countable and
thus I∗ ∈ I. Set X∗ = XI∗ . Since E[XInS] ≤ E[X∗ ] ≤ α for all n ∈ N, we get E[X∗ ] = α.
For any t ∈ T , consider J = I∗ {t}, which belongs to I, and notice that XJ =
max(Xt , X∗ ). Since α = E[X∗ ] ≤ E[XJ ] ≤ α, we deduce that E[X∗ ] = E[XJ ] and thus
a.s. XJ = X∗ , that is P(X∗ ≥ Xt ) = 1. This gives (i).
Let Y be as in (ii). Since I∗ is countable, we get that a.s. Y ≥ X∗ . This gives (ii).
Condition (H) and (4.1) imply that for all τ ∈ T, we have a.s. G+
τ ≤ E[M |Fτ ]. Notice that
if (5.2) holds then (H) holds with M = supk∈N G+k .
For n ∈ N, let Tn = {τ ∈ T; τ ≥ n} be the set of stopping times larger than or equal to
n. We say a stopping times τ ∈ Tn is regular, which will be understood with respect to G, if
for all finite k ≥ n:
E[Gτ |Fk ] > Gk a.s. on {τ > k}.
5.2. INFINITE HORIZON CASE 87
We denote by T0n the subset of Tn of regular stopping times. Notice that T0n is non-empty as
it contains n.
(i) If τ ∈ Tn , then there exists a regular stopping time τ 0 ∈ T0n such that τ 0 ≤ τ and a.s.
E[Gτ 0 |Fn ] ≥ E[Gτ |Fn ].
(ii) If τ 0 , τ 00 ∈ T0n are regular, then the stopping time τ = max(τ 0 , τ 00 ) ∈ T0n is regular and
a.s. E[Gτ |Fn ] ≥ max (E[Gτ 0 |Fn ], E[Gτ 00 |Fn ]).
Proof. We prove property (i). Let τ ∈ Tn and set τ 0 = inf{k ≥ n; E[Gτ |Fk ] ≤ Gk } with the
convention that inf ∅ = ∞. Notice that τ 0 is a stopping time and that a.s. n ≤ τ 0 ≤ τ . On
{τ 0 = ∞}, we have τ = ∞ and a.s. Gτ 0 = G∞ = Gτ . For ∞ > m ≥ n, we have, on {τ 0 = m},
that a.s. E[Gτ 0 |Fm ] = Gm ≥ E[Gτ |Fm ]. We deduce that for all finite k ≥ n a.s. on {τ 0 ≥ k}:
X X
E[Gτ 0 |Fk ] = E E[Gτ 0 |Fm ]1{τ 0 =m} |Fk ≥ E E[Gτ |Fm ]1{τ 0 =m} |Fk .
m∈Jk,∞K m∈Jk,∞K
We have on {τ 0 > k}, E[Gτ |Fk ] > Gk . Then use (5.15) to get that τ 0 is regular. Take k = n
in (5.15) and use that τ 0 ≥ n a.s. to get the last part of (i).
We prove property (ii). Let τ 0 , τ 00 ∈ T0n and τ = max(τ 0 , τ 00 ). By construction τ is a
stopping time, see Proposition 4.4. We have for all m ≥ k ≥ n and k finite:
E Gτ 1{τ 0 =m} |Fk = E Gτ 0 1{m=τ 0 ≥τ 00 } |Fk + E Gτ 00 1{τ 00 >τ 0 =m} |Fk .
By summing (5.16) over m with m > k and using that τ 0 ∈ T0n , we get:
E[Gτ |Fk ]1{τ 0 >k} ≥ E[Gτ 0 |Fk ]1{τ 0 >k} > Gk 1{τ 0 >k} .
By symmetry, we also get E[Gτ |Fk ]1{τ 00 >k} > Gk 1{τ 00 >k} . Since {τ > k} = {τ 0 > k} {τ 00 >
S
k}, this implies that E[Gτ |Fk ] > Gk a.s. on {τ > k}. Thus, τ is regular.
By summing (5.16) over m with m ≥ k = n, and using that τ 0 ≥ n a.s., we get E[Gτ |Fn ] ≥
E[Gτ 0 |Fn ]. By symmetry, we also have E[Gτ |Fn ] ≥ E[Gτ 00 |Fn ]. We deduce the last part
of (ii).
Lemma 5.14. We assume that G is adapted and hypothesis (H) and (5.3) hold. Then, for
all n ∈ N, there exists τn◦ ∈ Tn such that a.s. ess supτ ∈Tn E[Gτ |Fn ] = E[Gτn◦ |Fn ].
Proof. We set X∗ = ess supτ ∈Tn E[Gτ |Fn ]. According to the proof of Proposition 5.11, there
exists a sequence (τk , k ∈ N) of elements of Tn such that X∗ = supk∈N E[Gτk |Fn ]. Thanks
to (i) of Lemma 5.13, there exists a sequence (τk0 , k ∈ N) of regular stopping times, elements
of T0n , such that E[Gτk0 |Fn ] ≥ E[Gτk |Fn ]. According to (ii) of Lemma 5.13, for all k ∈
N, the stopping time τk00 = max0≤j≤k τj0 belongs to T0n , the sequence (E[Gτk00 |Fn ], k ∈ N)
is non-decreasing and E[Gτk00 |Fn ] ≥ E[Gτk0 |Fn ] ≥ E[Gτk |Fn ]. In particular, we get X∗ =
supk∈N E[Gτk |Fn ] ≤ supk∈N E[Gτk00 |Fn ] ≤ X∗ , so that a.s. X∗ = limk→∞ E[Gτk00 |Fn ].
Let τn◦ ∈ Tn be the limit of the non-decreasing sequence (τk00 , k ∈ N). Set Yk = E[M |Fτk00 ].
We deduce from the optional stopping theorem for closed martingale, see Proposition 4.26,
that (Yk , k ∈ N) is a martingale with respect to the filtration (Fτk , k ∈ N), which is closed
thanks to property (ii) from Theorem 4.24. In particular, the sequence (Yk , k ∈ N) converges
a.s. and in L1 towards Y∞ = E[M |Fτn◦ ] according to Corollary 4.25. Notice also that
a.s. E [Yk | Fn ] = E [Y∞ | Fn ]. Then, we use Lemma 5.30 with Xk = Gτk00 to get that a.s.
X∗ ≤ E[lim supk→∞ Gτk00 |Fn ]. Thanks to (5.3), we have a.s. lim supk→∞ Gτk00 ≤ Gτn◦ . So we get
that a.s. X∗ ≤ E[Gτn◦ |Fn ]. To conclude use that by definition of X∗ , we have E[Gτn◦ |Fn ] ≤ X∗
and thus X∗ = E[Gτn◦ |Fn ].
Exercise 5.3. Assume that hypothesis (H) and (5.3) hold. Let n ∈ N. Prove that the limit
of a non-decreasing sequence of regular stopping times, elements of T0n , is regular. Deduce
that τn◦ in Lemma 5.14 is regular, that is τn◦ belongs to T0n . 4
◦
Thanks to Lemma 5.14, there exists τn+1 ∈ Tn+1 such that a.s. Sn+1 = E[Gτn+1 ◦ |Fn+1 ].
◦
Since τn+1 (resp. n) belongs also to Tn , we have Sn ≥ E[Gτn+1 ◦ |Fn ] = E[Sn+1 |Fn ] (resp.
Sn ≥ Gn ). This implies that Sn ≥ max(Gn , E[Sn+1 | Fn ]). And thus (Sn , n ∈ N) satisfies the
optimal equations.
Use Corollary 5.15 and Lemma 5.14 to get V∗ = E[S0 ].
Proposition 5.17. We assume that G is adapted and hypothesis (H) and (5.3) hold. Then
τ∗ defined by (5.5), with (Sn , n ∈ N) given by (5.17), is optimal: V∗ = E[Gτ∗ ].
Proof. If V∗ = −∞ then nothing has to be proven. So, we assume V∗ > −∞. According to
Corollary 5.15, there exists an optimal stopping time τ .
In a first step, we check that τ 0 = min(τ, τ∗ ) is also optimal. Since E[G+
τ ] < +∞, by
Fubini and the definition of Sn , we have:
X X X
E Gτ 1{τ >τ∗ } = E Gτ 1{τ >τ∗ =n} = E E[Gτ |Fn ]1{τ >τ∗ =n} ≤ E Sn 1{τ >τ∗ =n} .
n∈N n∈N n∈N
Since P(τ 0 < τ∗ ) > 0 and Sn > Gn on {τ∗ > n} for n ∈ N, we deduce that:
X
E Gτ 00 1{τ 0 <τ∗ } > E Gn 1{n=τ 0 <τ∗ } = E Gτ 0 1{τ 0 <τ∗ }
n∈N
unless E Gτ 00 1{τ 0 <τ∗ } = E Gτ 0 1{τ 0 <τ∗ } = −∞. The latter case
is not possible
since
E [Gτ 0 ] = V∗ > −∞. Thus, we deduce that E Gτ 00 1{τ 0 <τ∗ } > E Gτ 0 1{τ 0 <τ∗ } . This im-
plies (using again that E[Gτ 0 ] > −∞) that:
E [Gτ 00 ] = E Gτ 0 1{τ 0 =τ∗ } + E Gτ 00 1{τ 0 <τ∗ } > E Gτ 0 1{τ 0 =τ∗ } + E Gτ 0 1{τ 0 <τ∗ } = E [Gτ 0 ] .
Exercise 5.4. Assume that G is adapted and hypothesis (H) and (5.3) hold and V∗ > −∞.
1. Deduce from the proof of Proposition 5.17, that τ∗ defined by (5.5) is the minimal
optimal stopping time: if τ is an optimal stopping time then a.s. τ ≥ τ∗ .
4
Exercise 5.5. Assume that G is adapted and hypothesis (H) and (5.3) hold. We set for n ∈ N:
with the convention that inf ∅ = ∞. Recall τ∗ and τ∗∗ defined by (5.5) and (5.6).
3. Prove that E[S0 ] ≤ E[lim sup Sn∧τ∗∗ ] ≤ E[Gτ∗∗ ]. Deduce that τ∗∗ is optimal.
4. Assume that V∗ > −∞. Prove that if τ is an optimal stopping time, then τ ∧ τ∗∗ is also
optimal. Prove that a.s. τ ≤ τ∗∗ .
5. Assume that V∗ > −∞. Prove that τ is an optimal stopping time if and only if a.s.
Sτ = Gτ on {τ < ∞} and τ∗ ≤ τ ≤ τ∗∗ .
4
Exercise 5.6. Assume that G is adapted and hypothesis (H) and (5.3) hold, as well as V∗ >
−∞. Prove that τ∗ defined by (5.5) is regular. 4
Corollary 5.18. Let G = (Gn , n ∈ N) be a sequence of random variables such that (5.2) and
(5.3) hold. Then there exists an optimal stopping time.
Proof. According to Wthe first paragraph of Section 5.2, without loss of generality, we can
assume that F∞ = n∈N Fn . If G is adapted to the filtration F = (Fn , n ∈ N) then use
M = supn∈N G+ n , so that (H) holds, and Corollary 5.15 to conclude.
If the sequence G is not adapted to the filtration F, then we shall consider the correspond-
ing adapted sequence G0 = (G0n , n ∈ N) given by G0n = E[Gn |Fn ] for n ∈ N. Notice G0 is well
defined thanks to (5.2). Thanks to (5.2), we can use Fubini lemma to get for τ ∈ T:
X X
E[Gτ ] = E[Gn 1{τ =n} ] = E[G0n 1{τ =n} ] = E[G0τ ].
n∈N n∈N
We thus deduce the maximal gain for the game G is also the maximal gain for the game G0 .
5.3. FROM FINITE HORIZON TO INFINITE HORIZON 91
Let M = E supn∈N G+ 0
n |F∞ . Notice then that (H) holds with G replaced by G . To
conclude using Corollary 5.15, it is enough to check that 0
0
(5.3) holds with G replaced
+
by G .
For n ≥ k finite, we have Gn ≤ E sup`∈Jk,∞K G` Fn . Since E sup`∈Jk,∞K G` is finite
thanks to (5.2), we deduce from Lemma 5.31 that:
lim sup G0n ≤ lim sup E sup G` F∞ ≤ E lim sup sup G` F∞ ≤ E[G∞ | F∞ ] = G0∞ ,
n k `∈Jk,∞K k `∈Jk,∞K
where we used Lemma 5.30 (with Xk = sup`∈Jk,∞K G` and Yk = Y = M ) for the second
inequality and (5.3) for the last. Thus (5.3) holds with G replaced by G0 . This finishes the
proof.
Exercise 5.7. Let G = (Gn , n ∈ N) be a sequence of random variables such that (5.2) and
(5.3) hold. Let τ∗ = inf{n ∈ N; ess supτ ∈Tn E[Gτ |Fn ] = E[Gn |Fn ]} with inf ∅ = ∞. Prove
that τ∗ is optimal. 4
Remark 5.19. We comment on the conditions (5.19) and (5.20). In particular, (5.20) holds if
(5.3) holds and a.s. G∞ = −∞. We now prove that if (5.3) holds and
then we can modify the gain so that the maximal gain is the same and (5.19) holds for the
modified gain. Notice the convergence (5.21) holds in particular if G∞ is integrable, thanks
to Corollary 4.25.
92 CHAPTER 5. OPTIMAL STOPPING
Assume that Condition (H) page 86 holds for G. We consider the gain G0 = (G0n , n ∈ N)
with G0n = max(Gn , E[G∞ |Fn ]) which satisfies Condition (H) with M 0 = M + G+ ∞ as well
as (5.19), since (5.21) holds. According to Proposition 5.17, there exists an optimal
S stopping
0 0 0 0
time, say τ , for the gain G . The maximal gain is V∗ = E[Gτ 0 ]. Set τ = τ on n∈N {τ 0 =
0
n, G0n = Gn } and τ = +∞ otherwise. Roughly speaking, the stopping rule τ can be described
as follows: on {τ 0 = n}, then either Gn = G0n , and then one stops the game at time n to get
the gain Gn , or Gn < G0n and then one never stops the game to get the gain G∞ . Notice τ
is a stopping time. We have:
X
E[Gτ ] = E[Gn 1{τ =n} ]
n∈N
X X
= E[Gn 1{τ 0 =n,Gn =G0n } ] + E[G∞ 1{τ 0 =n,Gn <G0n } ] + E[G∞ 1{τ 0 =∞} ]
n∈N n∈N
X X
= E[Gn 1{τ 0 =n,Gn =G0n } ] + E[E[G∞ |Fn ]1{τ 0 =n,Gn <G0n } ] + E[G∞ 1{τ 0 =∞} ]
n∈N n∈N
X
= E[G0n 1{τ 0 =n} ] + E[G∞ 1{τ 0 =∞} ]
n∈N
= E[G0τ 0 ].
As E[Gτ ] = E[G0τ 0 ], we get that E[G0τ 0 ] ≤ V∗ . Since G0n ≥ Gn and τ 0 is optimal, we also
get that E[G0τ 0 ] ≥ V∗ . We deduce that V∗0 = E[G0τ 0 ] = V∗ = E[Gτ ], which implies that τ is
optimal.
Thus, if (5.21) holds, then (5.19) holds with G0 instead of G, and if (H) holds for G, then
we can recover an optimal stopping times for G from an optimal stopping times for G0 , the
maximal gain being the same. ♦
Recall Tn = {τ ∈ T; τ ≥ n} is the set of stopping time equal to or larger than n ∈ N and
Tζ = {τ ∈ T; τ ≤ ζ} is the set of stopping times bounded by ζ ∈ N. For ζ ∈ N and n ∈ J0, ζK
we define Tζn = Tn Tζ as well as:
T
From Sections 5.1.1 and 5.2.3, we get that Sζζ = Gζ and S ζ = (Snζ , n ∈ J0, ζK) satisfies
the optimal equations (5.4). For n ∈ N, the sequence (Snζ , ζ ∈ Jn, ∞J) is non-decreasing
and denote by Sn∗ its limit. For n ∈ N, we have a.s. Sn∗ = ess supτ ∈T(b) E[Gτ | Fn ], where
n
(b)
Tn = Tn T(b) and T(b) ⊂ T is the subset of bounded stopping times. By construction of
T
Sn , we have for all n ∈ N:
Sn∗ ≤ Sn , (5.23)
The sequence (τ∗ζ , ζ ∈ N), with τ∗ζ = inf{n ∈ J0, ζK; Snζ = Gn }, is non-decreasing and thus
converge to a limit, say τ∗∗ ∈ N and
τ∗∗ = lim τ∗ζ = inf{n ∈ N; Sn∗ = Gn }. (5.24)
ζ→∞
Thanks to (5.23) we deduce that a.s. τ∗∗ ≤ τ∗ . We set V∗ζ = E[S0ζ ] = supτ ∈Tζ E[Gτ ] and
V∗ = E[S0 ] = supτ ∈T E[Gτ ]. Let V∗∗ be the non-decreasing limit of the sequence (V∗ζ , ζ ∈ N),
so that V∗∗ ≤ V∗ .
5.3. FROM FINITE HORIZON TO INFINITE HORIZON 93
Remark 5.20. Assume that (5.2) holds and Gn is integrable for all n ∈ N. Since Gn ≤ Snζ ≤
E supk∈N G+
k | Fn = Mn for all ζ ≥ n, using dominated convergence, we deduce from (5.2)
that (Sn∗ , n ∈ N) satisfies the optimal equations (5.4) with ζ = ∞. In fact, it is easy to
check that S ∗ = (Sn∗ , n ∈ N) is the smallest sequence satisfying the optimal equations (5.4)
with ζ = ∞. Following Remark 5.7, we deduce that S ∗ is the smallest super-martingale
which dominates (Gn , n ∈ N). And the process (Sn∧τ ∗
∗ , n ∈ N) is a martingale, which is a.s.
∗
converging thanks to (5.2). ♦
Definition 5.21. The infinite horizon case is the limit of the finite horizon cases if V∗∗ = V∗ .
It is not true in general that V∗∗ = V∗ , see Example 5.22 below taken from Neveu [6].
Example 5.22. Let (Xn , n ∈ N∗ ) be independent random variables such that P(Xn = 1) =
P(Xn = −1) = 1/2 for all n ∈ N. Let c = (cn , n ∈ N∗ ) be a strictly increasing sequence such
that 0 < cn < 1 for all n ∈ N∗ and limn→∞ cn = 1. We define G0 = 0, G∞ = 0, and for
n ∈ N∗ :
Gn = min 1, Wn − cn ,
with Wn = nk=1 Xk . Notice that Gn ≤ 1 and a.s. lim sup Gn = G∞ so that (5.2) and (5.19)
P
hold. (Notice also that E[|Gn |] for all n ∈ N.) Since E[Wn+1 |Fn ] = Wn , we deduce from
Jensen inequality that a.s. E[min(1, Wn+1 )|Fn ] ≥ min(1, Wn ). Then use that the sequence
c is strictly increasing to get that for all n ∈ N a.s. Gn > E[Gn+1 |Fn ]. Using a backward
induction argument and the optimal equations, we get that Snζ = Gn for all n ∈ J0, ζK and
ζ ∈ N and thus τ∗ζ = 0. We deduce that Sn∗ = Gn for all n ∈ N, τ∗∗ = 0 and V∗∗ = 0.
Since (5.2) and (5.3) hold, we deduce there exists an optimal stopping time for the infinite
horizon case. The stopping time τ = inf{n ∈ N∗ ; Wn = 1} is a.s. strictly positive and finite.
On {τ = n}, we have that Gn = 1 − cn as well as Gm ≤ 0 < Gn for all m ∈ J0, nJ and
Gm ≤ 1 − cm < Gn for all m ∈ Kn, ∞K. We deduce that Gτ = supτ 0 ∈T Gτ 0 , that is τ = τ∗ is
optimal. Notice that V∗ > V∗∗ = 0 and a.s. τ∗ > τ∗∗ = 0. Thus, the infinite horizon case is
not the limit of the finite horizon cases. 4
We give sufficient conditions so that V∗∗ = V∗ . Recall that (5.2) implies Condition (H).
Proof. If V∗ = −∞, nothing has to be proven. Let us assume that V∗ > −∞. According to
Proposition 5.17, there exists an optimal stopping time, say τ . Since E[Gmin(τ,n) ] ≤ V∗n , we
get:
As (Tn , n ∈ N) is uniformly integrable, we deduce from property (iii) of Proposition 8.18 that
(1{n<τ <∞} Tn , n ∈ N) is also uniformly integrable. Since a.s. limn→+∞ 1{n<τ <∞} = 0 and
thus limn→+∞ 1{n<τ <∞} Tn = 0, we deduce from Proposition 8.21 that this latter convergence
holds also in L1 that is limn→+∞ E 1{n<τ <∞} Tn = 0.
94 CHAPTER 5. OPTIMAL STOPPING
If τ is a.s. finite, then we have E 1{τ =∞} (G∞ − Gn )+ = 0. Otherwise, if (5.20) holds,
then the sequence (1{τ =∞} (G∞ − Gn )+ , n ∈ N) converges a.s. to 0. Since 1{τ =∞} (G∞ −
Gn )+ ≤ |Tn | and (Tn , n ∈ N) is uniformly integrable, we deduce from property (iii) of
Proposition 8.18 that the sequence (1{τ =∞} (G∞ − Gn )+ , n ∈ N) is uniformly integrable.
Use
Proposition 8.21 to get it converges towards 0 in L1 : limn→+∞ E 1{τ =∞} (G∞ − Gn )+ = 0.
In both cases, we deduce that limn→∞ V∗ − V∗n = 0. This gives the result.
The following exercise complete Proposition 5.23 by giving the convergence of the minimal
optimal stopping time in the finite horizon case to τ∗ the minimal optimal stopping time in
the infinite horizon case defined in (5.5).
Exercise 5.8. Let (Gn , n ∈ N) be an adapted sequence of random variables taking values in
R and define G∞ by (5.19). Assume that (H) holds and that the sequence (Tn , n ∈ N), with
Tn = supk≥n Gk − Gn , is uniformly integrable. Recall τ∗∗ defined in (5.24).
∗
1. If τ∗ is a.s. finite, prove that a.s. Sn∧τ = Sn∧τ∗ for all n ∈ N and thus a.s. τ∗∗ = τ∗ .
∗
2. If (5.20) holds, prove that Sn∗ = Sn for all n ∈ N and thus a.s. τ∗∗ = τ∗ .
4
We give an immediate Corollary of Proposition 5.23.
Corollary 5.24. Let (Gn , n ∈ N) be an adapted sequence of R-valued random variables and
define G∞ by (5.19). Assume that for n ∈ N we have Gn = Zn − Wn , with (Zn , n ∈ N)
adapted, E[supn∈N |Zn |] < +∞ and (Wn , n ∈ N) an adapted non-decreasing sequence of non-
negative random variables. If there exists an a.s. finite optimal stopping time or if (5.20)
holds, then the infinite horizon case is the limit of the finite horizon cases.
Proof. For k ≥ n, we have Gk − Gn ≤ Zk − Zn ≤ 2 sup`∈N |Z` |. This gives that the sequence
(Tn = supk≥n Gk − Gn , n ∈ N) is non-negative and bounded by 2 sup`∈N |Z` |, hence it is
uniformly integrable. We conclude using Proposition 5.23.
Using super-martingale theory, we can prove directly the following result (which is not a
direct consequence of the previous Corollary with Wn = 0).
Proposition 5.25. Let (Gn , n ∈ N) be an adapted sequence of random variables taking values
in R and define G∞ by (5.19). Assume that E[supn∈N |Gn |] < +∞. Then the infinite horizon
case is the limit of the finite horizon cases. Furthermore, we have that (Sn , n ∈ N) given by
(5.17) is a.s. equal to (Sn∗ , n ∈ N) given by (5.22), and thus the optimal stopping time τ∗
defined by (5.5) is a.s. equal to τ∗∗ defined by (5.24).
Proof. According to Remark 5.20, the process S ∗ = (Sn∗ , n ∈ N) satisfies the optimal equations
(5.4) with ζ = ∞. Since it is bounded by supn∈N |Gn | which is integrable, it is a super-
martingale and it converges a.s. to a limit say S∞ ∗ . We have S ∗ ≥ G for all n ∈ N, which
n n
implies thanks to (5.19) that S∞ ∗ ≥G .
∞
Let n ∈ N. We have for all stopping times τ ≥ n that a.s. Sn∗ ≥ limm→∞ E[Sm∧τ ∗ | Fn ] =
∗
E[Sτ | Fn ], where we used the optional stopping theorem for the inequality, and the dominated
convergence from property (vi) in Proposition 2.7 (with Y = supn∈N |Gn | and Xn = Sn∗ )
for the equality. This implies that, for all stopping times τ ≥ n, a.s. Sn∗ ≥ E[Gτ |Fn ],
5.3. FROM FINITE HORIZON TO INFINITE HORIZON 95
which thanks to Proposition 5.11 implies that a.s. Sn∗ ≥ Sn . Thanks to (5.23), we get that
a.s. Sn∗ ≤ Sn and thus a.s. Sn∗ = Sn for all n ∈ N. By dominated convergence, we have
V∗∗ = limζ→∞ E[S0ζ ] = E[S0∗ ] = V∗ . Thus, the infinite horizon case is the limit of the finite
horizon cases. Using (5.24), we get that a.s. τ∗ = τ∗∗ .
We set x0 = sup{x ∈ R; P(X ≥ x) > 0} and x0 ∈ (−∞, +∞] as P(X > −∞) > 0. Since
E[X + ] is finite, we get that the function f (x) = E[(X − x)+ ] is continuous strictly decreasing
on (−∞, x0 ) and such that limx→−∞ f (x) = +∞ and limx→x0 f (x) = 0. By convention, we
set f (−∞) = +∞. Since a.s. limn→∞ Mn = x0 , we get that a.s. limn→∞ f (Mn ) = 0. Thus
the stopping time τ = inf{n ∈ N∗ , f (Mn ) ≤ c} is a.s. finite. From the properties of f , we
deduce there exists a unique c∗ ∈ R such that f (c∗ ) = c. Using that (f (Mn ), n ∈ N∗ ) is
non-increasing and that it jumps at record times of the sequence (Xn , n ∈ N∗ ), we get the
representation:
τ = inf{n ∈ N∗ , Xn ≥ c∗ }.
We shall prove that τ is optimal and:
Since τ is geometric with parameter P(X ≥ c∗ ), we have E[τ ] = 1/P(X ≥ c∗ ) < +∞ and:
E[X1{X≥c∗ } ] − c E[(X − c∗ )+ ] − c
E[Gτ ] = E[Xτ ] − cE[τ ] = = + c∗ = c∗ ,
P(X ≥ c∗ ) P(X ≥ c∗ )
where we used that E[(X − c∗ )+ ] = f (c∗ ) = c for the last equality. Furthermore, for n ∈ N∗ ,
we deduce from (5.25) that a.s. :
\
E[Gn+1 |Fn ] > Gn on {n < τ } {Gn > −∞}, (5.26)
E[Gn+1 |Fn ] ≤ Gn on {n ≥ τ }. (5.27)
We now state a technical Lemma whose proof is postponed to the end of this section.
96 CHAPTER 5. OPTIMAL STOPPING
Lemma 5.26. Let X be a random variable taking values in [−∞, +∞). Let (Xn , n ∈ N∗ ) be a
sequence of random variables distributed as X. Let c ∈]0, +∞[. Set Gn = max1≤k≤n Xk − nc
for n ∈ N∗ . If E[(X + )2 ] < +∞, then E[supn∈N∗ G+
n ] < +∞ and lim sup Gn = −∞.
According to Lemma 5.26, we have that (5.2) and (5.3) hold. According to Proposition
5.17, τ∗ given by (5.5) is optimal. This implies that V∗ = E[Gτ∗ ] ≥ E[Gτ ] > −∞ and since
a.s. G∞ = −∞, we get that τ∗ is finite. We deduce also from (5.26) that a.s. τ∗ ≥ τ .
We have with c0 = c/2:
−∞ < E[Gτ∗ ] = E[ max Xk − τ∗ c0 ] − E[τ∗ c0 ] ≤ E[ sup ( max Xk − nc0 )+ ] − E[τ∗ ]c0 .
k∈J1,τ∗ K n∈N∗ 1≤k≤n
Using Lemma 5.26 with c replaced by c0 , we get that E[supn∈N∗ (max1≤k≤n Xk − nc0 )+ ] is
finite and thus E[τ∗ ] is finite. Let n ∈ N∗ . On {τ = n}, we have for finite k ≥ n that
Gτ − τ∗ c ≤ Gτ∗ ∧k ≤ supn∈N G+ n and thus a.s.:
Mimicking (5.14) with G instead of S and using that τ∗ ≥ τ , we deduce from (5.27) that
a.s., on {τ = n}, E[Gτ∗ ∧k |Fn ] ≤ Gn for all finite k ≥ n. Letting k goes to infinity, since τ∗ is
a.s. finite, we deduce by dominated convergence, using (5.28), that E[Gτ∗ |Fn ] ≤ Gn a.s. on
{τ = n}. Since τ is finite, this gives E[Gτ∗ ] ≤ E[Gτ ]. Since τ∗ is optimal, we deduce that τ is
also optimal. This gives V∗ = E[Gτ ] = c∗ . Notice also that a.s. τ = τ∗ as τ∗ is the minimal
optimal stopping time according to Exercise 5.4.
If one can not call back a previous buyer, then the gain is G00n = Xn − nc. Let V∗00
be the corresponding maximal gain. On the one hand, since G00n ≤ Gn for all n ∈ N, we
deduce that V∗00 ≤ V∗ . On the other hand, we have G00τ = Gτ = Gτ∗ . This implies that
V∗00 ≥ E[G00τ ] = E[Gτ ] = V∗ . We deduce that V∗00 = c∗ and τ is also optimal in this case.
In this last part, we assume furthermore that E[|X|] < +∞. We shall prove directly, as
Corollary 5.24 can not be used here, that the infinite horizon case is the limit of the finite
horizon cases. We first consider the case where previous buyers can be called back, so the
gain is Gn = max1≤k≤n Xk − nc for n ∈ N∗ . For n ∈ N∗ , we have a.s. that X1 − τ c ≤
Gτ ∧n ≤ supn∈N G+ n . By dominated convergence, we get that limn→∞ E[Gτ ∧n ] = E[Gτ ] = V∗ .
We deduce that V∗∗ ≥ limn→∞ E[Gτ ∧n ] = V∗ and V∗∗ = V∗ as V∗ ≥ V∗∗ . Therefore the infinite
horizon case is the limit of the finite horizon cases. (Notice that if 1 > P(X = −∞) > 0,
then the infinite horizon case is no more the limit of the finite horizon cases as V∗n = −∞ for
all n ∈ N∗ .)
We now consider the case where previous buyers can not be called back, so the gain is
Gn = Xn − nc for n ∈ N∗ . Let V∗00 = V∗ (resp. V∗00 n ) denote the maximal gain when the
00
where we used that G00τ∗ − G00n ≤ Xτ∗ − Xn on {n < τ∗ } for the second inequality and that con-
ditionally on {n < τ∗ < ∞}, (Xτ∗ , Xn ) and (Xτ∗ , X1 ) have the same
distribution for the last
equality. Since X∗ and X1 are integrable, we get that limn→+∞ E 1{n<τ∗ <∞} (Xτ∗ − X1 ) = 0
by dominated convergence. We deduce that the infinite horizon case is the limit of the finite
horizon cases.
5.3. FROM FINITE HORIZON TO INFINITE HORIZON 97
Proof of Lemma 5.26. Assume that E[(X + )2 ] < +∞. Since Xn −nc ≤ Gn ≤ max1≤k≤n (Xk −
kc) for all n ∈ N∗ , we deduce that supn∈N∗ Gn = supn∈N∗ (Xn − nc). This gives:
h i h i h X i h X i
E sup G+ n = E sup (Xn − nc)+
≤ E (Xn − nc) +
= E (X − nc)+
,
n∈N∗ n∈N∗ n∈N∗ n∈N∗
where we used Fubini (twice) and that Xn is distributed as X in the last equality. Then use
that for x ∈ R: X X
(x − n)+ ≤ x+ 1{n<x+ } ≤ (x+ )2 ,
n∈N∗ n∈N∗
hP i h i
to get E (X − nc)+ ≤ E[(X + )2 ]/c < +∞. So we obtain E sup G + < +∞.
n∈N∗ n∈N∗ n
Set G0n = max1≤k≤n Xk − nc/2. Using the previous result (with c replaced by c/2),
we deduce that supn∈N∗ (G0n )+ is integrable and thus a.s. lim sup G0n < +∞. Since Gn =
G0n − nc/2, we get that a.s. lim sup Gn ≤ lim sup G0n − lim nc/2 = −∞.
With the notation of Lemma 5.26, one can prove that if the random variables (Xn , n ∈ N∗ )
are independent then E[supn∈N∗ G+ + 2
n ] < +∞ implies that E[(X ) ] < +∞.
Proof. We keep notations from Section 5.3. Recall definition (5.22) of Snζ for the finite horizon
ζ ∈ N. We deduce from Proposition 5.6 and Sζζ = Gζ that Snζ = ϕζ−n (Xn ) for all 0 ≤ n ≤ ζ
and that the optimal stopping time is τ∗ζ = inf{n ∈ J0, ζK; ϕζ−n (Xn ) = ϕ(Xn )}.
Proof. By an elementary induction argument, we get that the sequence (ϕn , n ∈ N) is non-
decreasing. Let ϕ∗ be its limit. By monotone convergence, we get that ϕ∗ = max(ϕ, P ϕ∗ ).
Let g be a non-negative function g such that g ≥ max(ϕ, P g), we have by induction that
g ≥ ϕn and thus g ≥ ϕ∗ .
We now give the main result of this section on the infinite horizon case. Recall T(b) is the
set of bounded stopping times.
and the conventions inf ∅ = ∞ and ϕ(X∞ ) = lim sup ϕ(Xn ). Furthermore, the infinite
horizon case is the limit of the finite horizon case and a.s. τ∗ = τ∗∗ .
Proof. We keep notations from the proof of Lemma 5.27. Lemma 5.28 implies that Sn∗ =
limζ→∞ Snζ = ϕ∗ (Xn ). Recall that by definition τ∗∗ = limζ→∞ τ∗ζ . According to Proposition
5.25, the infinite horizon case is the limit of the finite horizon cases and the optimal stopping
time τ∗ is given by (5.5) that is by (5.29) with the conventions inf ∅ = ∞ and ϕ(X∞ ) =
lim sup ϕ(Xn ). We also get it is a.s. equal to τ∗∗ and that V∗ = E[Gτ∗ ] = Ex [S0∗ ] = ϕ∗ (x).
5.4 Appendix
We give in this section some technical Lemmas related to integration. Let (Ω, P, F) be a
probability space.
Lemma 5.30. Let X and (Xn , n ∈ N) be real-valued random variables. Let Y and (Yn , n ∈ N)
be non-negative integrable random variables. Assume that a.s. Xn+ ≤ Yn for all n ∈ N,
limn→∞ Yn = Y and limn→∞ E[Yn |H] = E[Y |H], where H ⊂ F is a σ-field. Then we have
that a.s.:
lim sup E[Xn |H] ≤ E lim sup Xn H .
n→∞ n→∞
5.4. APPENDIX 99
Proof. By Fatou Lemma, we get lim inf n→∞ E [Xn− |H] ≥ E [lim inf n→∞ Xn− |H]. We also have:
where we used Fatou lemma for the inequality. To conclude, use that a.s.: :
lim sup E[Xn |H] ≤ lim sup E[Xn+ |H] − lim inf E[Xn− |H]
n→∞ n→∞ n→∞
and lim supn→∞ Xn+ − lim inf n→∞ Xn− = lim supn→∞ Xn .
W
Let F = (Fn , n ∈ N), with FSn ⊂ F, be a filtration. We set F∞ = n∈N Fn the smallest
possible σ-field which contains n∈N Fn .
Lemma 5.31. Let M be random variable taking values in [−∞, +∞) such that E[M + ] < +∞.
Let Mn = E[M | Fn ] for n ∈ N. Then, we have that a.s. lim sup Mn ≤ M∞ .
[1] Y. S. Chow, H. Robbins, and D. Siegmund. Great expectations: the theory of optimal
stopping. Houghton Mifflin Co., Boston, Mass., 1971.
[2] E. B. Dynkin. Optimal choice of the stopping moment of a Markov process. Dokl. Akad.
Nauk SSSR, 150:238–240, 1963.
[4] T. S. Ferguson. Who solved the secretary problem? Statist. Sci., 4(3):282–289, 08 1989.
[5] S. A. Lippman and J. J. McCall. Job search in a dynamic economy. J. Econom. Theory,
12(3):365–390, 1976.
[7] J. L. Snell. Applications of martingale system theorems. Trans. Amer. Math. Soc.,
73:293–312, 1952.
101
102 BIBLIOGRAPHY
Chapter 6
Brownian motion
with parameters m ∈ R and σ > 0. The random variable X is square integrable and the
parameter m is the mean of X and σ 2 its variance. The law of X is often denoted by
N (m, σ 2 ). By convention, the constant m ∈ R will also be considered as a (degenerate)
Gaussian random variable with σ 2 = 0 and we shall denote its distribution by N (m, 0). The
characteristic function ψm,σ2 of X with distribution N (m, σ 2 ) is given by:
iuX 1 2 2
ψm,σ2 (u) = E e = exp ium − σ u for u ∈ R. (6.1)
2
In the next definition we recall the extension of the Gaussian distribution in higher di-
mension. We recall that a matrix Σ ∈ Rd×d , with d ≥ 1, is positive semi-definite if it is
symmetric and hu, Σui ≥ 0 for all u ∈ Rd , where h·, ·i is the Euclidean scalar product on Rd .
Definition 6.1. Let d ≥ 1. Let µ ∈ Rd and Σ ∈ Rd×d be a positive semi-definite matrix. A
Rd -valued random variable X has Gaussian distribution N (µ, Σ) if its characteristic function
ψµ,Σ is given by:
h i 1
ψµ,Σ (u) = E eihu,Xi = exp ihu, µi − hu, Σui for u ∈ Rd . (6.2)
2
If X is a Gaussian random variable with distribution N (µ, Σ), then X is square integrable
with mean E[X] = µ and covariance matrix (see Definition 1.61) Cov(X, X) = Σ. Further-
more using the development of the exponential function in series, we get that for all λ ∈ Cd ,
the random variable ehλ,Xi is integrable, and we have:
h
hλ,Xi
i 1
E e = exp hλ, µi + hλ, Σλi . (6.3)
2
103
104 CHAPTER 6. BROWNIAN MOTION
Using (6.2) with u replaced by M t u, we deduce the following lemma, which asserts that
every affine transformation of a Gaussian random variable is still a Gaussian random variable.
Lemma 6.2. Let p, d ∈ N∗ . Let X be a Rd -valued Gaussian random variable with distribution
N (µ, Σ). Let M ∈ Rp×d and c ∈ Rp . Then Y = c + M X is a Rp -valued Gaussian random
variable with parameter E[Y ] = c + M µ and Cov(Y, Y ) = M ΣM > .
The next remark ensures that for all µ ∈ Rd and Σ ∈ Rd×d a positive semi-definite matrix,
the distribution N (µ, Σ) is meaningful.
Remark 6.3. Let d ≥ 1. Let (G1 , . . . , Gd ) be independent real-valued Gaussian random
variables with the same distribution N (0, 1). Using (6.2), we get that the random vector
G = (G1 , . . . , Gd ) is Gaussian with distribution N (0, Id ) and Id ∈ Rd×d the identity matrix.
Let µ ∈ Rd and Σ ∈ Rd×d a positive semi-definite matrix. There exists an orthogonal
matrix O ∈ Rd×d (that is O> O = OO> = Id ) and a diagonal matrix ∆ ∈ Rd×d with non-
negative entries such that Σ = O∆2 O> . According to Lemma 6.2, we get that µ + O∆G has
distribution N (µ, Σ). ♦
We have the following result on the convergence in distribution of Gaussian vectors.
Lemma 6.4. Let d ≥ 1. The family of Gaussian probability distributions {N (µ, Σ); µ ∈
Rd , Σ ∈ Rd×d positive semi-definite} is closed for the convergence in distribution. Further-
more, if (Xn , n ∈ N) are Gaussian random variables on Rd , then the sequence (Xn , n ∈ N)
converges in distribution towards a limit, say X, if and only if X is a Gaussian random
variable, (E[Xn ], n ∈ N) and (Cov(Xn , Xn ), n ∈ N) converge respectively towards E[X] and
Cov(X, X).
This implies that the sequence (mn , n ∈ N) either converges in R or limn→∞ |mn | = +∞.
−ixaG
In the latter case, we deduce from (6.4) that E e ϕ(aG) = 0 for all a > 0. Letting a
goes to 0, we deduce by dominated convergence, as |ϕ| = 1, from the continuity of ϕ at 0
that ϕ(0) = 0 which is a contradiction. Thus the sequence (mn , n ∈ N) converges to a limit
2 2
m ∈ R. We deduce that, for all u ∈ R, ψn (u) converges towards eium−σ u /2 which is thus
equal to ψ(u). We deduce from (6.1) that X has distribution N (m, σ 2 ).
6.1. GAUSSIAN PROCESS 105
We have proved that if the sequence (Xn , n ∈ N) of real-valued Gaussian random variables
converges in distribution towards X, then X is a Gaussian random variable and (E[Xn ], n ∈
N) as well as (Cov(Xn , Xn ), n ∈ N) converge respectively towards E[X] and Cov(X, X). The
converse is a direct consequence of (6.1).
Using (6.2) (three times), we get that = E[eihw,W i ] = E[eihu,Xi ]E[eihv,Y i ] for all w = (u, v) ∈
Rd+p . Since the characteristic function characterizes the distribution of Rq -valued random
variables for q ∈ N∗ , we deduce that (X, Y ) has the same distribution as (X 0 , Y 0 ) where X 0
and Y 0 are independent and respectively distributed as X and Y . This implies that X and
Y are independent.
The converse is immediate.
Lemma 6.8. The distribution of a Gaussian process is characterized by its mean process and
covariance kernel.
Theorem 6.9. Let T be a set, m a real-valued function defined on T and K a positive semi-
definite function defined on T . Then there exist a probability space and a Gaussian process
defined on this probability space with mean process m and covariance kernel K.
One very interesting Gaussian process is the so called Brownian motion. We first give its
covariance kernel.
Lemma 6.10. The function K = (K(s, t); s, t ∈ R+ ) defined by K(s, t) = s∧t is a covariance
kernel on R+ .
R
Proof. Let λ be the Lebesgue measure on R+ . We recall that hf, gi = f g dλ defines a scalar
product on L2 (R+ , B(R+ ), λ). Set ft = 1[0,t] for t ∈ R+ , and notice that K(s, t) = hfs , ft i
for all s, t ∈ R+ . The function K is clearly symmetric and for all n ∈ N∗ , t1 , . . . , tn ∈ R+ ,
a1 , . . . , an ∈ R, we have:
X Z X 2
ai aj K(ti , tj ) = ai fti dλ ≥ 0.
1≤i,j≤n 1≤i≤n
The existence of the Brownian motion, see below, is justified by Theorem 6.9 and Lemma
6.10. We say a Gaussian process is centered if its mean function is constant equal to 0.
6.2.1 Continuity
There is a technical difficulty when one says the Brownian motion is a.s. continuous, because
one sees the Brownian motion as a RR+ -valued random variable and one can prove that the
set of continuous functions is not measurable with respect to σ-field B(R)⊗R+ on RR+ . For
this reason, we shall consider directly the set of continuous functions.
For an interval I ⊂ R+ , let C 0 (I) = C 0 (I, R) be the set of R-valued continuous functions
defined on I. We define the uniform norm k·k∞ on C 0 (I) as kf k∞ = supx∈I |f (x)| for
f ∈ C 0 (I). It is easy to check that (C 0 (I), k·k∞ ) is a Banach space. And we denote by
B(C 0 (I)) the corresponding Borel σ-field. Notice that C 0 (I) is T a subset of RI (but it does
not belong to B(R) ). We consider C (I) ∩ B(R) = {C (I) A; A ∈ B(R)⊗I } which is
⊗I 0 ⊗I 0
a σ-field on C 0 (I); it is called the restriction of B(R)⊗I to C 0 (I). We admit the following
lemma which states that the Borel σ-field of the Banach space C 0 (I) is C 0 (I) ∩ B(R)⊗I , see
Example 1.3 in [1]. (The proof given in [1] has to be adapted when I is not compact).
Lemma 6.12. Let I be an interval of R+ . We have B(C 0 (I)) = C 0 (I) B(R)⊗I .
T
(n)
We consider a time-space scaling X (n) = (Xt , t ∈ R+ ) of the process S given by, for n ∈ N∗ :
(n) 1
Xt = √ Sbntc .
n
Proof. We deduce from the central limit theorem that (bntc−1/2 Sbntc , n ∈ N∗ ) converges in
distribution towards a Gaussian random variable with distribution N (0, 1). This implies
(n)
that (Xt , n ∈ N∗ ) converges in distribution towards Bt . This gives the convergence in
distribution of the 1-dimensional marginals of X (n) towards those of B.
(n) (n) (n)
Let t ≥ s ≥ 0. By construction, we have that Xt − Xs is independent of σ(Xu , u ∈
(n)
[0, s]) and distributed as an (t, s)Xt−s with an (t, s) = bn(t−s)c/(bntc−bnsc) if
bntc−bnsc > 0
(n) (n)
and an (t, s) = 1 otherwise. Since limn→∞ an (t, s) = 1, we deduce that (Xs , Xt −
(n)
Xs ), n ∈ N∗ converges in distribution towards (G1 , G2 ), where G1 and G2 are independent
Gaussian random variable with G1 ∼ N (0, s) and G2 ∼ N (0, t − s). Notice that (G1 , G2 )
is distributed as (Bs , Bt − Bs ). Indeed (Bs , Bt − Bs ) is Gaussian vector as the linear trans-
formation of the Gaussian vector (Bs , Bt ); it has mean (0, 0) and we have Var(Bs ) = s,
Var(Bt − Bs ) = t − s, see (6.5), and Cov(Bs , Bt − Bs ) = 0, see (6.6), so the mean and
covariance matrix of (G1 , G2 ) and (Bs , Bt − Bs ) are the same. This gives they have the
(n) (n) ∗
same distribution. We deduce that (Xs , Xt ), n ∈ N converges in distribution towards
(Bs , Bt ). This gives the convergence in distribution of the 2-dimensional marginals of X (n)
towards those of B.
The convergence in distribution of the k-dimensional marginals of X (n) towards those of
B is then an easy extension which is left to the reader.
In fact we can have a much stronger statement concerning this convergence by considering
a continuous linear interpolation of the processes X (n) . For n ∈ N∗ , we define the continuous
(n) (n) (n) (n) (n)
process X̃ (n) = (X̃t , t ∈ R+ ) by X̃t = Xt + Ct , where Ct = √1n (nt − bntc)Ybntc+1 .
(n) (n)
Notice that E[|Ct |2 ] ≤ n−1 so that (Ct , n ∈ N∗ ) converges in probability towards 0 for
all t ∈ R+ . We deduce that the sequence (X̃ (n) , n ∈ N∗ ) converges in distribution for the
finite dimensional marginals towards B. The Donsker’s theorem state this convergence in
distribution holds for the process seen as continuous functions. For a function f = (f (t), t ∈
R+ ) defined on R+ , we write f[0,1] = (f (t), t ∈ [0, 1]) for its restriction to [0, 1]. We admit
the following result, see Theorem 8.2 in [1].
(n)
Theorem 6.15 (Donsker (1951)). The sequence of processes X̃[0,1] , n ∈ N∗ converges in
distribution, on the space C 0 ([0, 1]), towards B[0,1] , where B is a standard Brownian motion.
6.2. PROPERTIES OF BROWNIAN MOTION 109
In particular, we get that for all continuous functional F defined on C 0 ([0, 1]), we have
(n)
that (F (X̃[0,1] ), n ∈ N∗ ) converges in distribution towards F (B[0,1] ). For example the following
real-valued functionals, say F , on C 0 ([0, 1]) are continuous, for f ∈ C 0 ([0, 1]):
Z
F (f ) = kf k∞ , F (f ) = sup(f ), F (f ) = f dλ, F (f ) = f (t0 ) for some t0 ∈ [0, 1].
[0,1] [0,1]
(i) We say that X is a Markov process with respect to the filtration F if for all t ∈ R+ ,
conditionally on the σ-field σ(Xt ) the σ-fields Ft and σ(Xu , u ≥ t) are independent.
In the previous definition, usually one takes F the natural filtration of X, that is Ft =
σ(Xu , u ∈ [0, t]). Clearly, if a process has independent increments, it has the Markov property
(with respect to its natural filtration).
Lemma 6.17. The Brownian motion is a Markov process (with respect to its natural filtra-
tion), with independent and stationary increments.
Proof. Let B = (Bt , t ∈ R+ ) be a standard Brownian motion and F = (Ft , t ∈ R+ ) its natural
filtration, that is Ft = σ(Bu , u ∈ [0, t]). It is enough to check that it has independent and
stationary increments. Let t ≥ s ≥ 0. Since B is a Brownian process, we deduce that Bt − Bs
is Gaussian, and we have Var(Bt − Bs ) = t − s = Var(Bt−s ), see (6.5). Since B is centered, we
deduce that Bt − Bs and Bt−s have the same distribution N (0, t − s). Thus B has stationary
increments. Since B is a Gaussian process and, according to (6.5), Cov(Bu , Bt − Bs ) = 0 for
all u ∈ [0, s], we deduce that Bt − Bs is independent of Fs = σ(Bu , u ∈ [0, s]). Thus, B has
independent increments. The extension to a general Brownian motion is immediate.
We mention that the Brownian motion is the only continuous random process with inde-
pendent and stationary increments (the proof of this fact is beyond those notes), and that
the study of general random process with independent and stationary increments is a very
active domain of the probabilities.
110 CHAPTER 6. BROWNIAN MOTION
Fτ = {B ∈ F∞ ; B ∩ {τ ≤ t} ∈ Ft for all t ∈ R+ } .
Remark 6.18. We recall the convention that inf ∅ = +∞. Let A be an open set of R. The
entry time τA = inf{t ≥ 0; Bt ∈ A} is a stopping time 1 . Indeed, we have for t ≥ 0 that:
[
{τA ≤ t} = {Bs ∈ A} ∈ Ft .
s∈Q+ , s≤t
♦
A real-valued process M = (Mt , t ∈ R+ ) is called a F-martingale if it is F-adapted (that
is Mt is Ft -measurable for all t ∈ R+ ) and for all t ≥ s ≥ 0, Mt is integrable and:
1
One can prove that if X = (Xt , t ∈ R+ ) is an a.s. continuous process taking values in a metric space E
and A a Borel subset of E then: the entry time τA = inf{t ≥ 0; Bt ∈ A} is a stopping time with respect to the
natural filtration F = (Ft , t ∈ R+ ) where Ft = σ(Xu , u ∈ [0, t]); and the hitting
T time TA = inf{t > 0; Bt ∈ A}
is a stopping time with respect to the filtration (Ft+ , t ∈ R+ ) where Ft+ = s>t Fs .
6.3. WIENER INTEGRALS 111
We admit that the Brownian motion has the strong Markov property, see [7].
Let HB be the closure in L2 (P) of IB . Notice that HB is also an Hilbert space. The space
HB is a Gaussian space in the following sense.
Lemma 6.22. Let d ∈ N∗ and X1 , . . . , Xd ∈ HB . Then the random vector (X1 , . . . , Xd ) is a
Gaussian centered vector.
112 CHAPTER 6. BROWNIAN MOTION
Proof. We first consider the case d = 1. Since X1 ∈ HB , there exists a sequence (Yk , k ∈ N)
of elements of Vect(Bt , t ∈ R+ ) which converges in L2 (P) towards X1 . Thanks to Lemma
6.2, Yk is a centered Gaussian random variable for all k ∈ N. Thanks to Lemma 6.4, we get
that X1 is also a centered Gaussian random variable. The general case d ∈ N∗ is proved using
similar arguments.
By linearity, we deduce that for f, g ∈ I, hI(f ), I(g)iP = hf, giλ . Therefore, I is a linear
isometric map from I to IB . It admits a unique linear isometric extension from I = L2 (λ)
to IB = HB , which we still denote by I. The Wiener integral of a function f ∈ L2 (λ) with
respect to the Brownian motion R variable a.s. equal to I(f ) ∈ HB . We
R B is any random
shall use the notation I(f ) = R+ f (s) dBs = R+ f dB, even if the Brownian motion has
Rt Rt R
no derivative. We shall also use the notation 0 f dB = 0 f (s) dBs = R+ 1[0,t) (s)f (s) dBs .
Rt
With this convention, we have 0 dBs = Bt .
We have the following properties of the Wiener integral.
Proposition 6.23. Let f, g ∈ L2 (λ).
R
2 dλ .
R
(i) The random variable R+ f dB is Gaussian with distribution N 0, R+ f
R R
(ii) The Gaussian random variables R+ f dB and R+ g dB are independent if and only if
R
R+ f g dλ = 0.
(iii) LetR h be a measurable real-valued function defined on R+ locally square integrable (that
t Rt
is 0 h2 dλ < +∞ for all t ∈ R+ ). The process M = (Mt = 0 h dB, t ∈ R+ ) is a
martingale.
(iv) The process M given in (iii) is a centeredR Gaussian process with covariance kernel
s∧t
K = (K(s, t); s, t ∈ R+ ) given by K(s, t) = 0 h2 dλ.
Proof. Proof of property (i). Let f ∈ L2 (λ). Since I(f ) belongs to HB , we deduce from
Lemma 6.22, that I(f ) is a centered R Gaussian random variable. Its variance is given by
E[I(f )2 ] = hI(f ), I(f )iP = hf, f iλ = R+ f 2 dλ.
Proof of property (ii). Let f, g ∈ L2 (λ). Since I(f ) and I(g) belongs to the Gaussian
space HB and are centered, we deduce from Lemmas 6.6 and 6.22, that I(f ) and I(g) are
R if and only if E[I(f )I(g)] = 0. Then use that E[I(f )I(g)] = hI(f ), I(g)iP =
independent
hf, giλ = R+ f g dλ to conclude.
Proof of property (iii). Notice that Mt ∈ L2 (P) for all t ≥ 0. Let t ≥ s ≥ 0 be
fixed. Since h1[s,t) belongs to L2 , there exists a sequence (fn , n ∈ N) of elements of I which
converges to h1[s,t) in L2 . Clearly the sequence (fn 1[s,t) , n ∈ N) converges also to h1[s,t) in
L2 . Since fn 1[s,t) belongs to I, we get that I(fn 1[s,t) ) is σ(Bu − Bs , u ∈ [s, t])-measurable by
6.3. WIENER INTEGRALS 113
construction. This implies that I(h1[s,t) ) is also σ(Bu −Bs , u ∈ [s, t])-measurable. We deduce
that Mt − Ms is σ(Bu − Bs , u ∈ [s, t])-measurable and (taking s = 0 and t = s) that Ms is
Fs -measurable. In particular the process M is adapted to the filtration F. Using that the
Brownian motion has independent increments, we get that the σ-fields σ(Bu − Bs , u ∈ [s, t])
and Fs are independent. We deduce that E[Mt − Ms | Fs ] = E[Mt − Ms ] = E[Mt ] − E[Ms ] = 0.
This gives that M is a martingale.
Proof of (iv). Since Mt ∈ HB for all t ≥ 0, we deduce from Lemma 6.22 that M is a
centered Gaussian process. Its covariance kernel R s∧tis given for s, t ∈ R+ by K(s, t) = E[Ms Mt ] =
hI(h1[0,s) ), I(h1[0,t) )iP = hh1[0,s) , h1[0,t) iλ = 0 h2 dλ.
Rt
We give a natural representation of 0 f dB when f is of class C 1 .
Proposition 6.24. Assume that f ∈ C 1 (R+ ). We have the following integration by part
formula, for all t ∈ R+ , a.s.:
Z t Z t
f (s) dBs = f (t)Bt − Bs f 0 (s) ds.
0 0
Rt
Remark 6.25. If f ∈ C 1 (R+ ), then the process M̃ = (M̃t = f (t)Bt − 0 Bs f 0 (s) d, t ∈ R+ )
Rt
is a.s. continuous. Consider the martingale M = (Mt = 0 f (s) dBs , t ∈ R+ ). From
Proposition 6.24, we get that for all t ∈ R+ , a.s. M̃t = Mt . We say that M̃ is a continuous
version2 of M . ♦
Rt t
Proof of Proposition 6.24. For t ≥ 0, we set Zt = f (t)Bt − 0 Bs f 0 (s) ds − 0 f (s) dBs . We
R
We consider the Langevin equation in dimension 1 which describes the evolution of the
speed V of a particle with mass m in a fluid with friction and multiple homogeneous random
collisions from molecules of the fluid:
where λ > 0 is a damping coefficient, which can be seen as a frictional or drag force, and F (t)
is a random force with Gaussian distribution. This force F (t) dt is modeled by a Brownian
2
It can be proven that if f is a measurable real-valued
R tlocally square integrable function defined on R+ ,
then there exists a continuous version of the martingale ( 0 f (s) dBs , t ∈ R+ ).
114 CHAPTER 6. BROWNIAN MOTION
motion ρ dBt , with ρ > 0. Taking a = λ/m > 0 and σ = ρ/m > 0, we get the stochastic
differential equation:
dVt = −aVt dt + σ dBt for t ∈ R+ . (6.10)
We say that a random locally integrable process V = (Vt , t ∈ R+ ) is solution to the Langevin
equation (6.10) with initial condition V0 if a.s.:
Z t
Vt = V0 − a Vs ds + σBt .
0
The process V given in Proposition 6.26 is called the Ornstein-Uhlenbeck process. See
Exercise 9.38 for results on this process.
The Ornstein-Uhlenbeck process can be defined for all times in R by ( √σ2a e−at Be2at , t ∈
R). This definition is coherent thanks to (9.2).
If Rwe consider the position of the particle at time t ≥ 0, say Xt , we get that Xt =
t
X0 + 0 Vs ds, which gives the following result whose proof is immediate.
Lemma 6.27. Let a > 0 and σ > 0. The path of the particle X = (Xt , t ∈ R+ ) governed by
the Langevin equation (6.10) is given by a.s.:
Z t
Xt = X0 + Vs ds
0
σ t −a(t−r)
Z
V0 σ
= X0 + 1 − e−at + Bt − e dBr for t ∈ R+ .
a a a 0
6.3. WIENER INTEGRALS 115
Remark 6.28. Recall that a = λ/m and σ = ρ/m, with m the mass of the particle. Denote
(m)
by X (m) = (Xt , t ∈ R+ ) the path of the particle with mass m. We have:
Z t
(m) V0
1 − e−at + ρBt − ρ e−a(t−r) dBr .
Xt = X0 +
a 0
(m)
which implies that Xt converges to X0 + ρBt in L2 (P) as m goes to zero. ♦
R Let
t 2
h be a measurable real-valued function defined on R+ locally square integrable (that
is 0 h dλ < +∞ for all t ∈ R+ ). We consider the non-negative process M h = (Mth , t ∈ R+ )
defined by: Z t
1 t
Z
h 2
Mt = exp h(s) dBs − h(s) ds . (6.11)
0 2 0
Proof. According to property (iii) of Proposition 6.23, we get that M is adapted to the
Brownian filtration. Notice that for t ≥ s ≥ 0, we have:
2 /2)
Mth = Msh eG−(σ
Rt Rt
with G = s h(s) dBs and σ 2 = s h(s)2 ds. Arguing as in the proof of property (iii) of
Proposition 6.23, we get that G is independent of Fs and has distribution N (0, σ 2 ). We
deduce that a.s.: h i h i
2
E Mth | Fs = Msh E eG−(σ /2) = Msh ,
where the last equality is a consequence of (6.3) with λ = 1 and X = G. This implies that
M is a martingale. By construction, it is non-negative.
The next theorem assert that a Brownian motion with a drift can be seen as a Brownian
motion under a different probability measure.
116 CHAPTER 6. BROWNIAN MOTION
Remark 6.31. Let P̃ be a probability measure defined on (Ω, Ft ) by P̃(A) = E[Mth 1A ] for
A ∈ Ft . (Check this define indeed a probability measure.) Let Ẽ denote the corresponding
expectation. Theorem 6.30 gives that:
Z u
E F Bu + h(s) ds, u ∈ [0, t] = Ẽ [F ((Bu , u ∈ [0, t]))] .
0
Rt
In particular, the process t 7→ Bt − 0 h(s) ds is a Brownian motion under P̃. ♦
Partial proof of Theorem 6.30. We assumeR3 that it is enough to check (6.12) for functions
t
R ( 0 f (u) dYu ) with f ∈ I Rand I defined by (6.9).
F of the form F ((Yu , u ∈ [0, t])) = exp
t u
So, we have F ((Bu , u ∈ [0, t])) = exp 0 f (u) dBu and F Bu + 0 h(s) ds, u ∈ [0, t] =
R
t Rt
exp 0 f (u) dBu + 0 f (u)h(u) λ(du) . We get:
Z u h Rt Rt i
E F Bu + h(s) ds, u ∈ [0, t] = E e 0 f (u) dBu + 0 f h dλ
0
Z t Z t
1 2
= exp f dλ + f h dλ
2 0 0
Z t
1 t 2
Z
1
= exp (f + h)2 dλ − h dλ
2 0 2 0
h Rt i
= E Mth e 0 f (u) dBu
h i
= E Mth F ((Bu , u ∈ [0, t])) ,
where we used that M f (resp. M f +h ) is a martingale for the second (resp. fourth) equality.
As an application, we can compute the Laplace transform (and hence the distribution) of
the hitting time of a line for the Brownian motion. Let a > 0 and δ ∈ R. We consider:
Notice that τaδ is a stopping times as {τaδ ≤ t} = n∈N∗ s∈Q+ , s≤t {Bs − a − δs ≥ −1/n}
T S
which belongs to Ft for all t ≥ 0. When δ = 0, we write τa for τaδ , and using the continuity
of B, we get also that τa = inf{t ∈ R+ ; Bt ≥ a}.
(ii) Let δ ∈ R. We have P(τaδ < +∞) = exp (−2aδ + ) and for λ ≥ 0:
h δ
i p
E e−λτa = exp −a δ + 2λ + δ 2 .
Proof.
√We first prove (i). Let λ ≥ 0. Consider the process M = (Mt , t ∈ R+ ) with Mt =
√
exp 2λBt − λt . Using (6.11), we have M = M h with h constant equal to 2λ. Thus
M is a continuous martingale. This implies that the process N = (Nt = Mτa ∧t , t ∈ R+ )
is a continuous martingale
√ thanks to the optional stopping Theorem 6.19. It converges a.s.
towards N∞ = e a 2λ −λτ a 1{τa <+∞} as Bτa = a on {τa < +∞}. Since the process N takes
√
values in [0, ea 2λ ], we deduce it converges also in L1 towards N∞ . By dominated convergence,
we get that E[N∞ ] = E[N0 ] = 1 and thus:
h √ i
E ea 2λ −λτa 1{τa <+∞} = 1.
Taking λ = 0 in the previous equality implies that τa is a.s. finite. This gives (i).
We now prove (ii). Let f be a non-negative measurable function defined on R. We have:
where we used the Cameron-Martin theorem with h = −δ for the second equality, the optional
stopping Theorem 6.19 (with T = t and S = τa ∧ t and the martingale M −δ which has a
continuous version thanks to Remark 6.25) for the third. Taking f (x) = e−λx with λ ≥ 0, we
get:
h i 2
−λ(τaδ ∧t) −λ(τa ∧t)−δBτa ∧t − δ2 (τa ∧t)
E e =E e .
Assume δ ≤ 0. Letting t goes to infinity in the previous equality and using that τa is a.s.
finite and Bτa = a, we get by dominated convergence (for the left hand-side and the right
hand-side) that:
h i 2
−λτaδ −(λ+ δ2 )τa −δa
E e =E e .
h δ
i √
Then use (6.13) to get that E e−λτa = exp (−δa − a 2λ + δ 2 ). Letting λ goes down to 0,
we deduce that P(τaδ < ∞) = 1.
δ = inf{t ∈
The case δ > 0 is more technical. The idea is to consider the stopping time τa,b
δ and then use that
R+ ; Bt 6∈ (b + δt, a + δt)} for b < a; compute the Laplace transform of τa,b
the non-decreasing sequence (τa,b δ , b ∈ (−∞, a)) converges to τ δ when b goes to −∞. The
a
details are left to the reader.
118 CHAPTER 6. BROWNIAN MOTION
Bibliography
[2] A. Klenke. Probability theory. Universitext. Springer, London, second edition, 2014. A
comprehensive course.
[3] J.-F. Le Gall. Brownian motion, martingales, and stochastic calculus. Springer, 2016.
[4] M. A. Lifshits. Gaussian random functions, volume 322 of Mathematics and its Applica-
tions. Kluwer Academic Publishers, 1995.
[7] D. Revuz and M. Yor. Continuous martingales and Brownian motion, volume 293 of
Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathemat-
ical Sciences]. Springer-Verlag, Berlin, third edition, 1999.
[8] L. C. G. Rogers and D. Williams. Diffusions, Markov processes, and martingales. Vol. 1.
Cambridge Mathematical Library. Cambridge University Press, Cambridge, 2000.
[9] L. C. G. Rogers and D. Williams. Diffusions, Markov processes, and martingales. Vol. 2.
Cambridge Mathematical Library. Cambridge University Press, Cambridge, 2000.
119
120 BIBLIOGRAPHY
Chapter 7
Stochastic approximation
algorithms
The aim of this chapter is to present some results on the convergence of stochastic algo-
rithms approximations, the so called Robbins-Monro algorithm, by using a comparison of the
stochastic algorithm with its companion ODE. This is the so-called ODE method which is
presented in the monographs Kushner and Clark [7], Kushner and Yin [8] and Duflo [4, 5].
Our presentation will follow closely Benaı̈m [1] based first on analytic properties of pseudo-
trajectories associated to the ODE in Section 7.2, and second on the control of the stochastic
algorithms approximations in Sections 7.3.
As applications and motivations, we shall consider in more details the so-called two-armed
bandit see Sections 7.1 and 7.4.3, as well as the estimation of the quantile in linear times in
Section 7.4.2.
121
122 CHAPTER 7. STOCHASTIC APPROXIMATION ALGORITHMS
the slot machine A (resp. B) being chosen with probability Xn (resp. 1 − Xn ). At time n + 1
the probability Xn+1 is up-dated by a reward γn Xn if there has been a gain as follows: if
A has been chosen and there has been a gain then Xn+1 is equal to Xn + γn (1 − Xn ); if
B has been chosen and there has been a gain then Xn+1 is equal to Xn − γn Xn . Notice
that if there is noPgain there is no penalty. The fraction γn ∈ (0, 1) is deterministic. We
shall assume that n∈N γn = +∞, which is a necessary condition to forget the starting value
X0 . More precisely, we consider the following linear scheme. Let pi , i ∈ {A, B}, be the
probability to have a gain with the slot machine i. Let (γn , n ∈ N) be a sequence taking
values in (0, 1). Let X0 ∈ (0, 1) be a random variable and let (Un , n ∈ N∗ ) and (Vn , n ∈ N∗ )
be independent random variables uniformly distributed on [0, 1] and independent of X0 . We
define the sequence (Xn , n ∈ N) recursively as follows:
with
Yn+1 = (1 − Xn )1{Un+1 ≤Xn , Vn+1 ≤pA } − Xn 1{Un+1 >Xn , Vn+1 ≤pB } .
The event {Vn+1 ≤ pi } corresponds to the winning event with the machine i ∈ {A, B}.
We are interested in the behavior of the sequence (Xn , n ∈ N) as n goes to infinity. Let
F = (Fn , n ∈ N) be the natural filtration of the process X. We rewrite Yn+1 as:
so that εn+1 = Yn+1 − F (Xn ) is a martingale increment with respect to F. We have F (x) =
πx(1 − x) with π = pA − pB and thus:
Xn+1 − Xn
= F (Xn ) + εn+1 . (7.2)
γn
The Ordinary Differential Equation (ODE) method intuitively says that the stochastic
approximation algorithm (7.2) (also called perturbed recursive equation) behaves as the de-
terministic ODE:
dx(t)
= F (x(t)). (7.3)
dt
Notice that 0 and 1 are roots of F (y) = 0, and thus the constant functions equal to 0 or to
1 are solutions of (7.3). For x0 ∈ (0, 1), the solution x = (x(t), t ∈ R+ ) of (7.3) with initial
condition x0 is given by:
x0
x(t) = for t ≥ 0.
x0 + (1 − x0 ) e−πt
In particular we have that limt→+∞ x(t) = 1, meaning that 1 is an attractor of (7.3) and
thus a stable fixed point of the ODE, whereas 0 is an unstable fixed point. One expects that
the stochastic approximation algorithms is close to the solutions of the ODE and thus might
converge to the stable fixed point of the ODE. We shall give in Section 7.4.3 some condition on
the step sequence (γn , n ∈ N) which implies the a.s. convergence of the stochastic algorithm
to the stable fixed point.
7.2. ASYMPTOTIC PSEUDO-TRAJECTORIES 123
7.2.1 Definition
The hypothesis on F imply that the vector field F is globally integrable, see ????, and there
exists a flow Φ = (Φt (y), (t, y) ∈ R × Rd ) which is a continuous function (of (t, y)) such that
for all (t, y) ∈ R × R:
dΦt (y)
= F (Φt (y)) and Φ0 (y) = y.
dt
The map t 7→ Φt (y) defined on R is the global solution of the Cauchy problem associated to
F with initial condition y. Notice that Φ0 is the identity function. The function Φ has the
so called flow property as for all s, t ∈ R:
Φs+t = Φs ◦ Φt .
We define the set of equilibria of Φ as Λ0 = {y ∈ Rd , F (y) = 0}.
An asymptotic pseudo-trajectory is a path which is asymptotically close to a solution of
the Cauchy problem.
Definition 7.1. A function x = (x(t), t ∈ R+ ) is an asymptotic pseudo-trajectory for Φ if
for all T > 0:
lim sup |x(t + s) − Φs (x(t))| = 0.
t→+∞ s∈[0,T ]
The aim of this section is to give description of L(x). The next lemma corresponds to Theorem
5.7 in [1], see also the references therein.
Lemma 7.2. Let x be an asymptotic pseudo-trajectory for the flow Φ which is bounded, that
is supt≥0 |x(t)| < +∞. The set L = L(x) is non-empty, connected, compact and Φ-invariant,
that is Φt (L) = L for all t ∈ R.
When there is a Lyapounov function, see definition below, for the flow Φ, it is possible to
have a more precise description of L(x).
Let Λ be an invariant set of Φ. A continuous function V defined on Rd taking values
in R is called a Lyapounov function for Λ if the function t 7→ V (Φt (y)) defined for t ≥ 0 is
constant for y ∈ Λ and strictly decreasing for y 6∈ Λ. If Λ is equal to the set Λ0 of equilibrium
of F , then V is called a strict Lyapounov function.
124 CHAPTER 7. STOCHASTIC APPROXIMATION ALGORITHMS
Remark 7.3. When F is the gradient of a function V , that is F = −∇V , then V itself is
a strict Lyapounov
Ry function. Thus, in dimension d = 1, then the function V defined by
V (y) = − 0 F (s) ds for y ∈ R, is a strict Lyapounov function. ♦
We get from Corollary 6.6 in [1] the following result on the convergence of asymptotic
pseudo-trajectories.
We shall consider the following hypothesis on the control of the martingale increments and
on the step sequence:
X
sup E[|εn |2 ] ≤ +∞ and γn2 < +∞ (7.5)
n∈N∗ n∈N
as well as
X
(εn , n ∈ N∗ ) is sub-Gaussian, and for all c > 0, e−c/γn < +∞. (7.6)
n∈N
Exercise 7.1. LetU be a bounded Rd -valued random variable. By considering the function
hθ,U
θ 7→ log E e i , prove that U is sub-Gaussian, that is there exists a finite constant Γ > 0
2
such that for all θ ∈ Rd , we have E ehθ,U i ≤ eΓ|θ| /2 .
4
We consider the linear interpolation X = (X(t), t ≥ 0) of the sequence (Xn , n ∈ N) with
time step (γn , n ∈ N) as follows. We set τ0 = 0 and τn+1 = τn + γn for n ∈ N. For t ≥ 0, let
n ∈ N be such that t ∈ [τn , τn+1 ), we set:
Xn+1 − Xn
X(t) = Xn + (t − τn ) ·
γn
We have the following main result whose proof is postponed to Section 7.3.2.
7.3. STOCHASTIC APPROXIMATION ALGORITHMS 125
Theorem 7.5. Let F be a continuous map from Rd to Rd which is bounded and locally
Lipschitz. Let (Xn , n ∈ N) be a Robbins-Monro sequence associated to F , see (7.4), and X
its linear interpolation. Assume that
P
(i) n∈N∗ γn = +∞ and either (7.5) or (7.6) holds (so that limn→∞ γn = 0).
(iii) There exists non-negative constants K and R, with K < +∞, such that:
n∈N
with |g(y)| ≤ K1 |y|2 /2 for some non-negative finite constant K1 . Using (7.4) as well as (i),
we get:
Then use (iii) and (iv) to deduce by induction that E[V (Xn )] is finite for all n ∈ N. We define
Vn = V (Xn ) + Wn with
X h i
γj2 E 1{|Xj |<R} |εj+1 |2 + |F (Xj )|2 | Fn .
Wn = K1
j≥n
Notice that Wn is well defined thanks to (iii) and that Vn is integrable. We deduce from (7.7)
and (iii) that:
E [Vn+1 − Vn | Fn ] ≤ −γn k(Xn ) + K1 γn2 Kk(Xn ).
126 CHAPTER 7. STOCHASTIC APPROXIMATION ALGORITHMS
Since limn→∞ γn = 0, we deduce there exists n0 ∈ N such that E [Vn+1 − Vn | Fn ] ≤ 0 for all
n ≥ n0 . Thus (Vn , n ≥ n0 ) is a non-negative super-martingale. Thanks to Corollary 4.22, it
converges a.s. to an integrable limit, say V∞ . Since (iii) implies that a.s. limn→∞ Wn = 0, we
deduce that (V (Xn ), n ∈ N) converges a.s. to V∞ . Use (ii) to get that a.s. lim supn→∞ |Xn | <
+∞. This ends the proof.
X̄(t) = Xn ,
ε̄(t) = εn+1 and γ̄(t) = γn .
R
t+s
For t ≥ 0 and T ≥ 0, we set ∆(t, T ) = sups∈[0,T ] t ε̄(u) du.
The next lemma is the first step of the proof.
Lemma 7.7. Under either (7.5) or (7.6), we have that for all T ≥ 0:
Proof. Since limn→∞ γn = 0, we get that (7.8) holds for all T ≥ 0 if and only if for all T ≥ 0:
We first assume that (7.5) holds. We get that supn∈N∗ E[Mn2 ] is finite and thus the
martingale M a.s. converges. This directly implies (7.9).
The case (7.6) corresponds to Proposition 4.4 in [1].
The second step of the proof is deterministic and corresponds to Proposition 4.1 in [1].
7.4 Applications
7.4.1 Dosage
This application is taken from [5]. A dose y of a chemical products creates a random effect say
g(y, U ), where U is a random variable and g is an unknown real-valued bounded measurable
function defined on R2 . We assume the mean effect G(y) = E[g(y, U )], which is unknown, is
however non-decreasing as a function of y. We want to determine the dose y ∗ which creates
a mean effect of a given level a: that is G(y ∗ ) = a.
We consider the following Robbins-Monro stochastic algorithms, for n ∈ N:
where we assume that (Un , n ∈ N∗ ) are independent random variables distributed as U and
independent of X0 . Notice g(Xn , Un+1 ) corresponds to an effect of the dose Xn produced by
the n-th experiment.
To stick to the formalism (7.4), we set F (y) = G(y) − a and εn+1 = g(Xn , Un+1 ) − G(Xn ).
Since g is bounded, we get that the sequence (εn , n ∈ N∗ ) is bounded. We assume that G is
7.4. APPLICATIONS 127
Ry
Lipschitz (and bounded as g is bounded). Set V (y) = − 0 F (r) dr which is thus of class C 2
and with bounded second order derivatives. Assume that G is non-decreasing and that there
exists a unique root, say y ∗ , to the equation G(y) = a. This implies that Λ0 = {y ∗ } is the
set of equilibrium of the associated ODE and that V is a strict Lyapounov function. We also
have (ii) of Proposition 7.6. Take k = F 2 so that (i) of Proposition 7.6 holds. Assume that:
X X
γn = +∞ and γn2 < +∞.
n∈N n∈N
Thus, we have the second part of (7.5). Since (εn , n ∈ N∗ ) is bounded, we get the first part
of (7.5), as well as (iii) of Proposition 7.6 with R = +∞.
We deduce from Proposition 7.6 that a.s. the sequence (Xn , n ∈ N) is bounded. Theorem
7.5 and Lemma 7.4 imply that a.s. (Xn , n ∈ N) converges towards the only equilibrium y ∗ of
the associated ODE.
Some hypothesis can be slightly weakened, see Section 1.4.2 in [5]. For the speed of
convergence of (Xn , n ∈ N) towards y ∗ , which corresponds to a central limit theorem for
martingales, we also refer to [5].
We recall this is a stochastic approximation algorithm (7.4) with F (y) = πy(1 − y) and
π = pA −pB . The equilibrium set of the ODE associated to F is Λ0 = {0, 1}. For convenience,
we assume that pA > pB so that 1 is stable and 0 is unstable.
128 CHAPTER 7. STOCHASTIC APPROXIMATION ALGORITHMS
Notice that by construction, see (7.1), the sequence (Xn , n ∈ N) belongs a.s. to (0, 1), as
X0 ∈ (0, 1), and thusR is bounded.
y
Since V (y) = − 0 F (r) dr is a strict Lyapounov function, we deduce from Theorem 7.5
and Lemma 7.4 that the sequence (Xn , n ∈ N) converges a.s. to a limit, say X∞ which
belongs to Λ0 .
We write Px when starting the algorithm from X0 = x. Since (Xn , n ∈ N) is a sub-
martingale as F ≥ 0, we deduce that Ex [X∞ ] ≥ Ex [X0 ] = x. As Ex [X∞ ] = Px (X∞ = 1), we
get the elementary lower bound Px (X∞ = 1) ≥ x.
We say the approximation is fallible when starting from x ∈ (0, 1) if Px (X∞ = 0) > 0,
and is infallible if Px (X∞ = 0) = 0 for all x ∈ (0, 1). The study of the algorithm to be fallible
or infallible is a second order property, whereas the ODE method can be seen as a first
order property. Because the two equilibrium of the two armed-bandit lies on the boundary
of the interval definition of the ODE, no general result can be applied. A direct study of this
particular algorithm is developed in [11]. As example, we provide the following results which
is part of Corollaries 1 and 2 in [11].
Lemma 7.8. Let 1/2 < α ≤ 1 and C > 0. Assume that the step sequence is given by
γn = (C/(n + C))α for all n ∈ N. If α = 1 and C ≤ 1/pB then the algorithm is infallible.
Otherwise, it is fallible when starting from x ∈ (0, 1).
For practical implementation the algorithm is infallible for the step sequence γn = 1/(n +
1), which corresponds to α = C = 1.
Bibliography
[3] G. Burtini, J. Loeppky, and R. Lawrence. A survey of online experiment design with
the stochastic multi-armed bandit. arXiv:1510.00757, 2015.
[6] J. C. Gittins. Bandit processes and dynamic allocation indices. J. Roy. Statist. Soc. Ser.
B, 41(2):148–177, 1979. With discussion.
[7] H. J. Kushner and D. S. Clark. Stochastic approximation methods for constrained and
unconstrained systems, volume 26 of Applied Mathematical Sciences. Springer-Verlag,
New York-Berlin, 1978.
[8] H. J. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and
applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag,
New York, second edition, 2003. Stochastic Modelling and Applied Probability.
[9] D. Lamberton and G. Pagès. How fast is the bandit? Stoch. Anal. Appl., 26(3):603–623,
2008.
[10] D. Lamberton and G. Pagès. A penalized bandit algorithm. Electron. J. Probab., 13:no.
13, 341–373, 2008.
[11] D. Lamberton, G. Pagès, and P. Tarrès. When can the two-armed bandit algorithm be
trusted? Ann. Appl. Probab., 14(3):1424–1454, 2004.
129
130 BIBLIOGRAPHY
[14] M. F. Norman. On the linear model with two absorbing barriers. J. Mathematical
Psychology, 5:225–241, 1968.
[15] H. Robbins. Some aspects of the sequential design of experiments. Bull. Amer. Math.
Soc., 58(5):527–535, 09 1952.
[16] I. J. Shapiro and K. S. Narendra. Use of stochastic automata for parameter self-
optimization with multimodal performance criteria. IEEE Transactions on Systems
Science and Cybernetics, 1969.
[17] R. Weber. On the Gittins index for multiarmed bandits. Ann. Appl. Probab., 2(4):1024–
1033, 1992.
Chapter 8
Appendix
(i) Ω ∈ A;
(ii) A ∈ A implies Ac ∈ A;
(iii) A, B ∈ A implies A ∪ B ∈ A.
(i) P(Ω) = 1;
This extension theorem allows to prove the existence of the Lebesgue measure.
131
132 CHAPTER 8. APPENDIX
Proposition 8.4 (Lebesgue measure). There exists a unique probability measure P on the
measurable space ([0, 1), B([0, 1))), called Lebesgue measure, such that P([a, b)) = b − a for all
0 ≤ a ≤ b ≤ 1.
Before giving the proof of Proposition 8.4, we provide a sufficient condition for a real-
valued additive function defined on a Boolean algebra to be continuous at ∅.
n n n n
!
[ X X X
P(An ) = P An ∩ ( Bkc ) ≤ P(An ∩ Bkc ) ≤ P(Ak ∩ Bkc ) ≤ ε2−k ≤ 2ε,
k=0 k=0 k=0 k=0
that is P(An ) ≤ 2ε for all n ≥ n0 . Since ε > 0 is arbitrary, we deduce that limn→+∞ P(An ) =
0, which ends the proof of the lemma.
Proof of Proposition 8.4. Let A be the set of finite union of intervals [a, b) with 0 ≤ a ≤
b ≤ 1. Notice A is a Boolean algebra which generates the Borel σ-field B([0, 1)). Define
P([a, b)) = b − a for 0 ≤ a ≤ b ≤ 1. It is elementary to check that P can be uniquely extended
to A into an a additive [0, +∞]-valued function, which we still denote by P. Notice that
P([0, 1)) = 1. To conclude, it is enough to that P is continuous at ∅.
For A ∈ A, non empty, there exists n0 ∈ N ∗
S , 0 ≤ ai < bi < ai+1 for i ∈ J1, n0 K,
with the convention an0 +1 = 1, such that A = i∈J1,n0 K [ai , bi ). Let ε > 0. Taking K =
−i −i
S S
i∈J1,n0 K [ai , ai ∨ (bi −Tε2 )] and B = i∈J1,n0 K [ai , ai ∨ (bi − ε2 )) we get that B ∈ A,
c
B ⊂ K ⊂ A and P(A B ) ≤ ε. We deduce that the hypothesis of Lemma 8.5 are satisfied.
Thus P is a probability on A. Therefore Theorem 8.3 implies there exists a unique probability
on [0, 1) which is an extension of P.
Remark 8.6. Let λ1 denote the Lebesgue measure on [0, 1). P Then, the Lebesgue measure
on R, λ, is defined by: for all Borel set A of R, λ(A) = x∈Z λ1 ((A + x) ∩ [0, 1)), where
A + x = {z + x, z ∈ A}. It is easy to check that λ is σ-additive (and thus a measure according
to Definition 1.7). Notice that λ([a, b)) = b − a for all a ≤ b. Using Exercise 9.2, we get that
the Lebesgue measure is the only measure on (R, B(R)) such that this latter property holds.
d d
The construction of the Lebesgue Qdmeasure, λ, onQd (R , B(R )) for d ≤ 1, which is the
unique σ-finite measure such that λ( i=1 [ai , bi )) = i=1 (bi − ai ) for all ai ≤ bi , follows the
same steps and is left to the reader. ♦
8.1. MORE ON MEASURE THEORY 133
Using the extension theorem, we get the existence of the product probability measure of
the product measurable space.
Proposition
Q 8.7. Let N ((Ωi , Gi , Pi ), i ∈ I) be a collection of probability spaces and set Ω =
Ω
i∈I i Qas well as GQ= i∈I Gi . There exists a unique probability measure P on (Ω, G) such
that P i∈I Ai = i∈I Pi (Ai ), where Ai ∈ Gi for all i ∈ I and Ai = Ωi but for a finite
number of indices.
N
The probability P is called the product probability measure and it is denoted by i∈I Pi .
The probability space (Ω, G, P) is called the product probability space. Proposition 8.7 can be
extended to the finite product of σ-finite measures, see also Theorem 1.53 for an alternative
construction in this particular case.
Q
Proof. Let A be the set of finite unions of sets of the form i∈I Ai , where Ai ∈ Gi for all
i ∈ I and Ai = Ωi but for a finite number of indices. Q NoticeQthat A is a Boolean algebra
which generates the product σ-field G. Define P( i∈I Ai ) = i∈I Pi (Ai ). It is elementary
to check that P can be uniquely extended to A into an a additive [0, +∞]-valued function,
which we still denote by P. Notice that P(Ω) = 1. To conclude, it is enough to prove that P
is continuous at ∅.
We first assume that I = N∗ . QFor n ∈ N, we set Ωn = k>n Ωk and An the Boolean
Q
0 0 k > n and A0k = Ωk but for
algebra of the finite unions of set k>n
n
Q Ak with0
AkQ∈ Gk for all 0
finite number of indices. Define P k>n Ak = k>n Pk (Ak ). It is elementary to check
that Pn can be uniquely extended to A into an a additive [0, +∞]-valued function, which we
still denote by Pn .
Let us prove that P is continuous at ∅ by contraposition. TLet (An , n ∈ N∗ ) be a non-
increasing
T A-valued sequence and ε > 0 such that limn→+∞ P ( nk=1 Ak ) ≥ ε. We shall prove
that n∈N∗ An is non-empty.
For ω1 ∈ Ω1 , consider A1n (ω1 ) = {ω 1 ∈ Ω1 ; (ω1 , ω 1 ) ∈ An } the section of An on Ω1 at
ω1 . It is elementary to deduce from An ⊂ A that for all ω1 ∈ Ω1 we have A1n (ω1 ) ∈ A1 .
It is also not difficult to prove that Bn,1 = {ω1 ∈ Ω1 ; P1 (A1n (ω1 )) ≥ ε/2} belongs to G1 .
Since the sequence (An , n ∈ N) is non-increasing, S we get the sequence (Bn1 , n ∈ N∗ ) is also
non-increasing. Since An is a subset of Bn,1 × Ω {(ω1 , ω 1 ); ω1 6∈ Bn,1 and ω 1 ∈ A1n (ω1 )},
1
we get that:
ε
ε ≤ P(An ) ≤ P1 (Bn,1 ) + (1 − P1 (Bn,1 )) ,
2
∗
and thus P1 (BT n,1 ) ≥ ε/2 for all n ∈ N . We deduce from 1the continuity ∗
of P1 at ∅, that there
exists ω̄1 ∈ n∈N∗ Bn,1 . Furthermore, the T sequence (An (ω̄1 ), n ∈ N ) of elements of A1 is
1 n 1
T 1
non-increasing and such that limn→+∞ P k=1 Ak (ω̄1 ) ≥ ε/2 and {ω̄1 } × n∈N∗ An (ω̄1 ) ⊂
∗
T
n∈N∗ An . By iterating the previous argument, we get that for all k ∈ N , there exists
ω̄k ∈ Ωk such that:
\ \
{(ω̄1 , . . . , ω̄k )} × Akn (ω̄1 , . . . , ω̄k ) ⊂ An ,
n∈N∗ n∈N∗
where Akn (ω̄1 , . . . , ω̄k ) = {ω k ∈ Ωk ; (ω̄1 , . . . , ω̄k , ωTk ) ∈ An } is the section of An on ki=1 Ωi at
Q
(ω̄1 , . . . , ω̄k ). This implies that (ω̄k , k ∈ N∗ ) ∈ n∈N∗ An , and thus n∈N∗ An is non-empty.
T
The proposition is thus true when I is countable.
134 CHAPTER 8. APPENDIX
According to the previous arguments, it is clear the proposition is also true when I is finite.
Let us assume that I is uncountable. For all (countable) sequence (An , n ∈ N∗ ) of elements
of A, there Q Q J ⊂ I at most0 countable such that the sets An are finite
exists a set Qunions of
0 J
sets of type j∈J Aj i∈I\J Ωi with Aj ∈ Gj for all j ∈ J. Thus we have An = An i∈I\J Ωi ,
with AJn = {ωJ ∈ j∈J Ωj ; {ωJ } × i∈I\J Ωi ⊂ An }. And the continuity of P at ∅ is the a
Q Q
consequence of the first part of the proof as J is at most countable.
Based on Proposition 8.7, the next exercise provides an alternative proof of Proposition
8.4 on the existence of the Lebesgue measure on [0, 1) .
Exercise 8.1. Set Ωi = {0, 1}, Gi = P(Ω ∗
Qi ) and Pi ({0})N
= Pi ({1}) = 1/2Nfor i ∈ N . Consider
the product probability space (Ω = i∈N∗ Ωi , G = i∈N∗ Gi , P = i∈N∗ Pi ). Define the
function ϕ from Ω to [0, 1) by:
X
ϕ((ωi , i ∈ N∗ )) = 2−i ωi .
i∈N∗
By considering intervals [k2−i , j2−i ), check that ϕ is measurable and, using Corollary 1.14,
that Pϕ , the image of P by ϕ, is the Lebesgue measure on ([0, 1), B([0, 1)). 4
Proposition 8.8. Let A be a Boolean algebra on Ω and P a probability measure on (Ω, A).
We have the following properties.
Proof. Properties (i) and (ii) are consequence of the additive property of P.
It is enough to prove property (iii) with SI countable. Let (Bn , n ∈ N) S a sequence of
elements of A pairwise disjoint and such that n∈N Bn ∈ A. The sequence k>n Bk , n ∈ N
is non-decreasing
S and converges towards ∅. The continuity property at ∅ of P implies that
limn→+∞ P k>n Bk = 0. By additivity, we get:
n n
! ! ! !
[ [ [ X [
P Bk =P Bk +P Bk = P(Bk ) + P Bk .
k∈N k=0 k>n k=0 k>n
S P
Letting n goes to infinity, we get P k∈N Bk = k∈N P(Bk ).
S
Let (An , n ∈ N) be a sequence of elements of A such that n∈N An ∈ A. Set B0 = A0
T Sn−1 c Sn Sn
and for n ≥ 1, Bn = An k=0 Ak . We have Bn ⊂ An as well as k=0 Bk = k=0 Ak
8.1. MORE ON MEASURE THEORY 135
S S
and thus Sk∈N Bk = k∈N Ak . The sets (Bn , n ∈ N) belongs to A, are pairwise disjoint and
such that n∈N Bn ∈ A. we deduce from the first part of the proof that:
! !
[ [ X X
P Ak = P Bk = P(Bk ) ≤ P(Ak ).
k∈N k∈N k∈N k∈N
Let A be a Boolean algebra on Ω and P a probability measure on (Ω, A). The outer
probability measure P∗ is a [0, 1]-valued function defined on P(Ω) by:
( )
X [
P∗ (A) = inf P(Bn ); A ⊂ Bn and Bn ∈ A for all n ∈ N . (8.1)
n∈N n∈N
The next lemma states that the restriction of P∗ to A coincide with P and that P∗ is monotone
and σ-sub-additive.
S
Proof.
S Let (Bn ∈ N) and A be elements A such that A ⊂ n∈N Bn . We have A ∩ Bn ∈ A
and n∈N (A ∩ Bn ) = A ∈ A. We deduce from property (iii) of Proposition 8.8 that:
!
[ X X
P(A) = P (A ∩ Bn ) ≤ P(A ∩ Bn ) ≤ P(Bn ),
n∈N n∈N n∈N
We first prove that G is a Boolean algebra which contains A (Lemma 8.10), then that G
is a σ-field and that P∗ is a probability measure on G (Lemma 8.11).
P∗ (B) = P∗ (B ∩ A1 ) + P∗ (B ∩ Ac1 )
= P∗ (B ∩ A1 ∩ A2 ) + P∗ (B ∩ A1 ∩ Ac2 ) + P∗ (B ∩ Ac1 )
≥ P∗ (B ∩ A1 ∩ A2 ) + P∗ (B ∩ (A1 ∩ A2 )c ), (8.3)
where we used the sub-additivity of P∗ for the inequality and (A1 ∩ Ac2 ) Ac1 = (A1 ∩ A2 )c .
S
As P∗ is sub-additive, we deduce that the inequality (8.3) is in fact an equality, and thus that
A1 ∩ A2 ∈ G. This implies that G is a Boolean algebra.
∈ A, B ∈ Ω and ε > 0. There
Let A S P exists a sequence (Bn , n ∈ N) of elements of A such
that B ⊂ n∈N Bn and P∗ (B) + ε ≥ n∈N P(Bn ). By additivity of P, we get:
X X X
P∗ (B) + ε ≥ P(Bn ) = P(Bn ∩ A) + P(Bn ∩ Ac ) ≥ P∗ (B ∩ A) + P∗ (B ∩ Ac ),
n∈N n∈N n∈N
Lemma 8.11. The family G is a σ-field and the function P∗ restricted to G is a probability
measure.
Proof. Notice that for A ∈ G and B, C ∈ Ω such that A∩C = ∅, we deduce from the definition
of G that:
P∗ (B ∩ (A ∪ C)) = P∗ (B ∩ A) + P∗ (B ∩ C). (8.4)
(An , n ∈ N) be elements of G pairwise disjoint and B ∈ Ω. We set A0n = nk=0 Ak and
S
Let
A0 = k∈N Ak . We have A0n ∈ A. Using the monotonicity of P∗ and then (8.4), we get:
S
n
X
P∗ (B ∩ A0 ) ≥ P∗ (B ∩ A0n ) = P∗ (B ∩ Ak ). (8.5)
k=0
Letting n goes to infinity, we deduce from (8.6) that P∗ (B) ≥ P∗ (B ∩ A0 ) +SP∗ (B ∩ A0 c ). Since
P∗ is sub-additive, this inequality is in fact an equality, and thus A0 = k∈N An ∈ G. It is
then immediate to check that G is stable by countable union. Thus, G is a σ-field.
For B = Ω in (8.6), we get that P∗ is σ-additive on G: P∗ ∗
S P
k∈N Ak = k∈N P (Ak ).
∗
The restriction of P to G is therefore a probability measure.
Definition 8.12. Let (E, B) be a metric space with its Borel σ-field. A sequence (µn , n ∈ N)
of probability measures on E converges weakly to a probability measure Rµ on E ifR for all
bounded real-valued continuous function f defined on E, we have limn→∞ f dµn = f dµ.
Let (Xn , n ∈ N) and X be E-valued random variables. The sequence (Xn , n ∈ N) converges
in distribution towards X if the probability measures (PXn , n ∈ N) converges weakly towards
PX , that is limn→∞ E[f (Xn )] = E[f (X)] for all bounded real-valued continuous function. And
(d) (d)
we write Xn −−−→ X (or some times Xn −−−→ PX ).
n→∞ n→∞
We refer to [1] for further results on convergence in distribution. Since we shall be mainly
interested by the convergence in distribution for random variables taking values in a discrete
space we introduce the convergence for the distance in total variation. The distance in total
variation dTV between two finite measures µ and ν on (S, S) is given by:
It is elementary to check that dTV is indeed a distance1 on the set of finite measures on (S, S).
Lemma 8.13. The convergence for the distance in total variation for probability measures
on a metric space implies the convergence in distribution.
Proof. Let (E, B(E)) be a metric space with its Borel σ-field. Let f be a real-valued mea-
surable
R function
R1 defined on E taking values in (0, 1). By Fubini theorem, we have that
f dν = 0 ν({f > t}) dt for any probability measure ν on (E, B(E)).
Let (µn , n ∈ N) be a sequence of probability measures which converges for the distance
in total variation towards a probability measure µ. This implies that limn→∞ µn ({f > t}) =
µ({f > t}). By dominated R convergence,
R we deduce from the comment at the beginning of
the proof, that limn→∞ f dµn = f dµ. By linearity, we get this convergence holds for any
bounded real-valued measurable function f . This implies that (µn , n ∈ N) converges weakly
towards µ.
We now assume that E is a discrete space (and E = P(E)). Let λ denote the counting
measure on (E, E): λ(A) = Card (A) for A ∈ E. Notice that any measure µ on (E, E) has a
density with respect to the counting measure λ given by the function (µ({x}), x ∈ E). We
shall identify the density of µ with µ and thus write µ(x) for µ({x}). We shall consider the L1
norm with respectP to the counting measure so that for a real-valued function f = (f (x), x ∈ E)
we set kf k1 = x∈E |f (x)|. It is left to the reader to check that for two finite measures µ
and ν on (E, E):
2 dTV (µ, ν) = kµ − νk1 . (8.7)
Lemma 8.14. Let E be a discrete space. Let (Xn , n ∈ N∗ ) and X be E-valued random
variables. The following conditions are equivalent.
(d)
(i) Xn −−−→ X.
n→∞
Proof. Since {x} is open and closed, the function 1{x} is continuous. Thus, property (i)
implies property (ii). Property (iii) implies property (i) thanks to Lemma 8.13.
We now prove that property (ii) implies property (iii). Let ε > 0 and K ⊂ E finite such
that P(X ∈ K) ≥ 1 − ε. Since K is finite, we deduce from (ii) that limn→∞ P(Xn ∈ K) =
1
A non-negative finite function d defined on S × S is a distance on S if for all x, y, z ∈ S, we have:
d(x, y) = d(y, x) (symmetry); d(x, y) ≤ d(x, z) + d(z, y) (triangular inequality); d(x, y) = 0 implies x = y
(separation).
8.2. MORE ON CONVERGENCE FOR SEQUENCE OF RANDOM VARIABLES 139
P(X ∈ K) and thus limn→∞ P(Xn ∈ K c ) = P(X ∈ K c ) ≤ ε. So for n large enough, say
n ≥ n0 , we have P(Xn ∈ K c ) ≤ 2ε. We deduce from (8.7) that for n ≥ n0 :
X
2 dTV (PXn , PX ) = |P(Xn = x) − P(X = x)|
x∈E
X
≤ |P(Xn = x) − P(X = x)| + P(Xn ∈ K c ) + P(X ∈ K c )
x∈K
X
≤ |P(Xn = x) − P(X = x)| + 3ε.
x∈K
This implies that lim supn→∞ 2 dTV (PXn , PX ) ≤ 3ε. Conclude using that ε is arbitrary.
Theorem 8.15 (Law of large number). Let X be a real-valued random variable such that
E[X] is well defined. Let (Xn , n ∈ N∗ ) be a sequence of independent real-valued random
variables distributed as X. We have the following a.s. converge:
a.s.
X̄n −−−→ E[X].
n→∞
The fluctuation are given by the CLT. We denote by N (µ, Σ), where µ ∈ Rd and Σ ∈ Rd×d
a symmetric non-negative matrix, the Gaussian distribution with mean µ and covariance
matrix Σ.
Theorem 8.16 (Central Limit Theorem (CLT)). Let X be a Rd -valued random variable such
that X ∈ L2 . Set µ = E[X] and Σ = Cov(X, X). Let (Xn , n ∈ N∗ ) be a sequence of indepen-
dent real-valued random variables distributed as X. We have the following convergences:
a.s. √ (d)
X̄n −−−→ µ and n X̄n − µ −−−→ N (0, Σ).
n→∞ n→∞
Notice that if (8.8) holds, then E[|Xi |] ≤ ε + K and thus supi∈I E[|Xi |] is finite.
We give some results related to the uniform integrability.
(i) The family (Xi , i ∈ I) is uniformly integrable if and only if the following two conditions
are satisfied:
(a) For all ε > 0, there exists δ > 0 such that for all events A with P(A) ≤ δ, we have
E [|Xi |1A ] ≤ ε.
(b) supi∈I E[|Xi |] < +∞.
(iii) If there exists an integrable real-valued random variable Y such that |Xi | ≤ |Y | a.s. for
all i ∈ I, then the family (Xi , i ∈ I) is uniformly integrable. More generally, if there
exists a family (Yi , i ∈ I) of uniformly integrable real-valued random variables such that
|Xi | ≤ |Yi | a.s. for all i ∈ I, then the family (Xi , i ∈ I) is uniformly integrable.
(v) If there exists r > 0 such that supi∈I E[|Xi |1+r ] < +∞, then the family (Xi , i ∈ I)
is uniformly integrable. More generally, if supi∈I E[f (Xi )] < +∞, where f is a non-
negative real-valued measurable function defined on R such that limx→+∞ f (x)/x = +∞,
then the family (Xi , i ∈ I) is uniformly integrable.
Proof. We first prove property (i). Assume that the family (Xi , i ∈ I) is uniformly integrable.
We have already noticed that (b) holds. Choose K such that (8.8) holds with ε replaced by
ε/2. Set δ = ε/2K and let A be an event such that P(A) ≤ δ. Using (8.8) and Markov
inequality, we get:
ε
E [|Xi |1A ] = E |Xi |1A 1{|Xi |≥K} + E |Xi |1A 1{|Xi |<K} ≤ + KP(A) ≤ ε.
2
This gives (a).
Assume that (a) and (b) hold. Set C = supi∈I E[|Xi |] which is finite by (b). Let ε > 0
be fixed and δ given by (a). Set K = C/δ and Ai = {|Xi | ≥ K}. Markov inequality gives
that P(Ai ) ≤ E[|Xi |]/K ≤ C/K = δ. We deduce from (a), with A replaced by Ai , that (8.8)
holds. This implies that the family (Xi , i ∈ I) is uniformly integrable.
We prove property (ii). Let Y be an integrable real-valued random
variable. In
particular
Y is a.s. finite. By dominated convergence, we get that limK→+∞ E |Y |1{|Y |≥K} = 0. Thus
(8.8) holds and Y is uniformly integrable.
Thanks to property (i), to prove property (iii) it is enough to check (a) and (b). Notice
that E[|Xi |] ≤ E[|Y |] for all i ∈ I and thus (b) holds. We have E [|Xi |1A ] ≤ E [|Y |1A ]. Then
use that Y is uniformly integrable, thanks to (ii), to conclude that (a) holds. The proof of
the more general case is similar.
8.2. MORE ON CONVERGENCE FOR SEQUENCE OF RANDOM VARIABLES 141
We prove property (v). Let ε > 0. Use that |Xi |1{|Xi |≥K} ≤ K −r |Xi |1+r to deduce that
(8.8) holds when K = (supi∈I E[|Xi |1+r ]/ε)1/r . The proof of the more general case is similar.
Thanks to property (i), to prove property (vi) it is enough to check (a) and (b). Let
(Xn , n ∈ N) be a sequence of integrable real-valued random variables which converges in L1
towards zero. Condition (b) is immediate. Let us check (a). Fix ε > 0. There exists n0 ∈ N
such that for all n ≥ n0 , we have E[|Xn |] ≤ ε. Then use (ii) and (i) to get there exists δi > 0
such that if A is an event such that P(A) ≤ δi then E[|Xi |1A ] ≤ ε for all i ≤ n0 . Take
δ = min0≤i≤n0 δi to deduce that (a) holds.
Lemma 8.19. Let X be an integrable real-valued random variable. The family (XH =
E[X| H]; H is a σ-field and H ⊂ G) is uniformly integrable.
Proof. We shall check (a) and (b) from property (i) of Proposition 8.18. Using Jensen in-
equality, we get that E[|XH |] ≤ E[|X|] for all σ-field H ⊂ G. We get (b) as X is integrable.
Furthermore decomposing according to {|X| ≥ K} and {|X| < K}, and using that a.s.
0 ≤ E [1A | H] ≤ 1, we get:
h i ε
E |X| E [1A | H] ≤ E |X|1{|X|≥K} + KE [E [1A | H]] ≤ + KP(A) ≤ ε.
2
We have obtained that E [|XH |1A ] ≤ ε for all σ-field H ⊂ G and all A ⊂ G such that
P(A) ≤ ε/2K. This gives (a).
Lemma 8.20. Let (Xn , n ∈ N) be a sequence of real-valued random variables which converges
in probability towards a real-valued random variable X∞ . Then, there is a sub-sequence
(Xnk , k ∈ N) which converges a.s. to X∞ .
The proof of this lemma is a consequence of the Borel-Cantelli lemma, but we provide a
direct short proof (see also the proof of Proposition 1.50 where similar arguments are used).
142 CHAPTER 8. APPENDIX
Proof. Let n0 = 0, and for k ∈ N, set nk+1 = inf{n > nk ; P(|Xn − X∞ | ≥ 2−k ) ≤ 2−k }. The
sub-sequenceP(nk , k ∈ N) is well defined since (Xn , n ∈ N) converges in probability towards
X∞ . Since k∈N P(|Xnk − X∞ | ≥ 2−k ) < +∞, we get that k∈N 1{|Xn −X∞ |≥2−k } is a.s.
P
k
finite and thus a.s. |Xnk − X∞ | ≥ 2−k for finitely many k. This implies that the sub-sequence
(Xnk , k ∈ N) converges a.s. to X∞ .
The uniform integrability is the right concept to get the L1 convergence from the a.s.
convergence of real-valued random variables. This is a consequence of the following proposi-
tion.
(i) The random variables (Xn , n ∈ N) are uniformly integrable and (Xn , n ∈ N) converges
in probability towards X∞ .
Proof. We first assume (i). Thanks to Lemma 8.20, there exists a sub-sequence (Xnk , k ∈ N)
which converges a.s. to X∞ . As (|Xnk |, k ∈ N) converges a.s. to |X∞ |, we deduce from
Fatou’s lemma that E [|X∞ |] ≤ lim inf k→∞ E [|Xnk |] ≤ supn∈N E [|Xn |]. Since the random
variables (Xn , n ∈ N) are uniformly integrable, we deduce from property (i)-(b) of Proposition
8.18 that X∞ is integrable.
Let ε > 0. Since the random variables (Xn , n ∈ N) are uniformly integrable as well as
X∞ , thanks to property (ii) of Proposition 8.18, we deduce there exists δ > 0 such that if A
is an event with P(A) ≤ δ, then E[|Xn |1A ] ≤ ε for all n ∈ N. Since (Xn , n ∈ N) converges in
probability towards X∞ , there exists n0 such that for n ≥ n0 , we have P(|Xn − X∞ | > ε) ≤ δ.
This implies that for n ≥ n0 :
E [|Xn − X∞ |] = E |Xn − X∞ |1{|Xn −X∞ |≤ε} + E |Xn − X∞ |1{|Xn −X∞ |>ε}
≤ ε + E |Xn |1{|Xn −X∞ |>ε} + E |X∞ |1{|Xn −X∞ |>ε} ≤ 3ε.
[1] P. Billingsley. Probability and measure. Wiley Series in Probability and Statistics. John
Wiley & Sons, Inc., Hoboken, NJ, 2012.
[2] J. Neveu. Bases mathématiques du calcul des probabilités. Masson et Cie, Éditeurs, Paris,
1970.
143
144 BIBLIOGRAPHY
Chapter 9
Exercises
4
Exercise 9.4 (Permutation of integrals). Prove that:
!
x2 − y 2
Z Z
π
2 2 2
λ(dy) dx = ·
(0,1) (0,1) (x + y ) 4
x2 − y 2
Deduce that the function f (x, y) = is not integrable with respect to the Lebesgue
(x2 + y 2 )2
measure on (0, 1)2 . (Hint: compute the derivative with respect to y of y/(x2 + y 2 ).) 4
Exercise 9.5 (Independence). Extend (1.13) to functions fj such that fj ≥ 0 forQall j ∈ J or
to functions fj such that fj (Xj ) is integrable for all j ∈ J. And in the latter case j∈J fj (Xj )
is also integrable. 4
145
146 CHAPTER 9. EXERCISES
Exercise 9.6 (Independence and covariance). Let X and Y be real-valued integrable random
variables. Prove that if X and Y are independent, then XY is integrable and Cov(X, Y ) =
0. Give an example such that X and Y are square-integrable not independent but with
Cov(X, Y ) = 0. 4
Exercise 9.7 (Independence). Let (Ai , i ∈ I) be independent events. Prove that (1Ai , i ∈ I)
are independent random variables and deduce that (Aci , i ∈ I) are also independents events.
4
Exercise 9.8 (Independence). Let (Ω, F, P) be a probability space. Let G ⊂ F be a σ-field
and C a collection of events which are all independents of G.
1. Prove by a counterexample that if C is not stable by finite intersection, then σ(C) may
not be independent of G.
2. Using the monotone class theorem prove that if C is stable by finite intersection, then
σ(C) is independent of G.
3. Let C and C 0 be two collections of events stable by finite intersection such that every
A ∈ C and A0 ∈ C 0 are independent. Prove that σ(C) and σ(C 0 ) are independent.
4
Exercise 9.12 (X conditioned on |X|). Let X be an integrable real-valued random variable
with
density f with respect to the Lebesgue measure. Compute E [X| |X|]. Compute also
2
E X| X . 4
Exercise 9.13 (Variance). Let X be a real-valued random variable such that E[X 2 ] < +∞.
Let H be a σ-field. Prove that E[X| H]2 is integrable and Var(E[X| H]) ≤ Var(X). 4
Exercise 9.14 (L1 distance). Let X, Y be independent R-valued integrable random variables
such that E[Y ] = 0. Prove that E[|X − Y |] ≥ E[|X|]. 4
Exercise 9.15 (Kolmogorov’s maximal inequality). Let (Xn , n ∈ N∗ ) be identically distributed
independent real-valued
Pn random variables. We assume that E[X12 ] < +∞ and E[X1 ] = 0. Let
x > 0. We set Sn = k=1 Xk for n ∈ N and T = inf{n ∈ N∗ ; |Sn | ≥ x} with the convention
∗
Pn
2. Check that k=1 P(T = k) = P (max1≤k≤n |Sk | ≥ x).
3. By noticing that Sn2 ≥ Sk2 + 2Sk (Sn − Sk ), prove Kolmogorov’s maximal inequality:
E[Sn2 ]
P max |Sk | ≥ x ≤ for all x > 0 and n ∈ N∗ .
1≤k≤n x2
4
Exercise 9.16 (An application of Jensen inequality). Let X and Y be two integrable real-
valued random variables such that a.s. E[X| Y ] = Y and E[Y | X] = X. Using Jensen
inequality (twice) with a positive strictly convex function ϕ such that limx→+∞ ϕ(x)/x and
limx→−∞ ϕ(x)/x are finite, prove that a.s. X = Y . 4
Exercise 9.17 (Independence and conditional expectation). Let H ⊂ F be a σ-field, Y and
V random variables taking values in measurable spaces (S, S) and (E, E) such that Y is
independent of H and V is H-measurable. Let ϕ be a non-negative real-valued measurable
function defined on S × E (endowed with the product σ-field). Prove that a.s.:
4
Exercise 9.18 (Conditional independence). Let A, B and H be σ-fields, subsets of F. Assume
that H ⊂ A ∩ B and that conditionally on H the σ-fields A and B are independent, that is
for all A ∈ A, B ∈ B, we have a.s. P(A ∩ B| H) = P(A| H)P(B| H).
1. Let A ∈ A, B ∈ B. Check that a.s. E [1A E [1B | A] | H] = E [1A E [1B | H] | H].
4
Exercise 9.20 (Conditional densities). Let (Y, V ) be an R2 -valued
random variable whose
law has density f(Y,V ) (y, v) = λv −1 e−λv 1{0<y<v} with respect to the Lebesgue measure
on R2 . Check that the law of Y conditionally on V is the uniform distribution on [0, V ].
For a real-valued measurable bounded function ϕ defined on R, deduce that E[ϕ(Y )|V ] =
RV
V −1 0 ϕ(y) dy. 4
Exercise 9.21 (Conditional distribution and independence). Let (Y, V ) be an S × E-valued
random variable. Prove that Y and V are independent if and only if the conditional distribu-
tion of Y given V exists and is given by a kernel, say κ, such that κ(v, dy) does not depend
on v ∈ E. In this case, check that κ(v, dy) = PY (dy). 4
148 CHAPTER 9. EXERCISES
1. Compute P(X2 = y|X0 = x) for x, y ∈ E. Prove that Z is a Markov chain and gives its
transition matrix.
2. Prove that any invariant probability measure for X is also invariant for Z. Prove the
converse is false in general.
4
Exercise 9.23 (Markov chains built from a Markov chain-II). Let X = (Xn , n ∈ N) be a
Markov chain on a finite or countable set E with transition matrix P . Set Y = (Yn , n ∈ N∗ )
where Yn = (Xn−1 , Xn ).
4
Exercise 9.24 (Labyrinth). A mouse is in the labyrinth depicted in figure 9.1 with 9 squares.
We consider the three classes of squares: A = {1, 3, 7, 9} (the corners), B = {5} (the center)
and C = {2, 4, 6, 8} the other squares. At each step n ∈ N, the mouse is in a square and we
denote by Xn its number and Yn its class.
1 2 3
4 5 6
7 8 9
1. At each step, the mouse choose an adjacent square at random (an uniformly). Prove
that X = (Xn , n ∈ N) is a Markov chain and represent its transition graph. Classify
the states of X.
2. Prove that Y = (Yn , n ∈ N) is a Markov chain and represent its transition graph.
Compute the invariant probability measure of Y and deduce the one of X.
4
9.3. DISCRETE MARKOV CHAINS 149
9.4 Martingales
Exercise 9.27 (Exit time distribution). Let U be a random variable on {−1, 1} such that
P(U = 1) = 1 − P(U = −1) = p with p ∈ (0, 1). We consider the simple Pnrandom walk
X = (Xn , n ∈ N) from Exercise 3.4 started at X0 = 0 defined by Xn = k=1 Uk , where
(Un , n ∈ N∗ ) are independent random variables distributed as U . Let a ∈ N∗ and
consider
τa = inf{n ∈ N∗ ; |Xn | ≥ a} the exit time of (−a, a). We set ϕ(λ) = log E[eλU ] for λ ∈ R.
Let λ ∈ R such that ϕ(λ) ≥ 0.
1. Prove that τa is a stopping time. Using that X is an irreducible Markov chain, prove
that a.s. τa is finite (but not bounded if a ≥ 2).
(λ)
2. Prove that M (λ) = (Mn = eλXn −nϕ(λ) ; n ∈ N) is a positive martingale.
(λ)
3. Using the optional stopping theorem and that ϕ(λ) ≥ 0, prove that E[Mτa ] = 1.
(±λ)
4. Assume that p = 1/2. Check that ϕ is non-negative. By considering Mτa for λ ∈ R,
prove that for all r ≥ 0:
1
E e−rτa =
·
cosh(a cosh−1 (er ))
4
Exercise 9.28 (Return time to 0). Let U be a random variable on {−1, 1} such that P(U =
1) = 1 − P(U = −1) = 1/2. We P consider the simple random walk X = (Xn , n ∈ N) started at
X0 = 1 defined by Xn = 1 + nk=1 Uk , where (Un , n ∈ N∗ ) are independent random variables
distributed as U . Let τ = inf{n ∈ N∗ ; Xn = 0} be the return time to 0.
4. Check that E[Mτ ] 6= E[M0 ] (thus τ is not bounded and M is not uniformly integrable).
4
Exercise 9.29 (Martingale not converging in L1 ). Let (Xn , n ∈ N∗ ) be a sequence of indepen-
dent Bernoulli random variables of parameter E[Xn ] = (1 + e)−1 . We define M0 = 1 and for
n ∈ N∗ : Pn
Mn = e−n+2 i=1 Xi .
4
Exercise 9.30 (Martingale not converging a.s.). Let (Zn , n ∈ N∗ )
be independent random
variables such that P(Zn = 1) = P(Zn = −1) = 1/(2n) and P(Zn = 0) = 1 − n−1 . We set
X1 = Z1 and for n ≥ 1:
3. Using Borel-Cantelli’s lemma, prove that P(Zn 6= 0 infinitely often) = 1. Deduce that
P( lim Xn exists) = 0. In particular, the martingale does not converge a.s. towards 0.
n→∞
4
Exercise 9.31 (Wright-Fisher model). We consider a population of constant size N . We
assume that the reproduction is random: this corresponds in the end to each individual
choosing his parent independently in the previous generation. The Wright-Fisher model
study the evolution of the number of individuals carrying one of the two alleles A and a. For
n ∈ N, let Xn denote the number of alleles A at generation n in the population. We assume
that X0 = i ∈ {0, . . . , N } is given. We shall study the process X = (Xn , n ≥ 0).
3. Prove that X converges to a limit, say X∞ , and give the type of convergence.
5. Prove that one of the allele disappears a.s. in finite time. Compute the probability that
allele A disappears.
4
Exercise 9.32 (Waiting time of a given sequence). Let X = (Xn , n ∈ N∗ ) be a sequence of
independent Bernoulli random variable with parameter p ∈ (0, 1): P(Xn = 1) = 1 − P(Xn =
0) = p. Let τijk = inf{n ≥ 3; (Xn−2 , Xn−1 , Xn ) = (i, j, k)} be the waiting time of the
sequence (i, j, k) ∈ {0, 1}3 . The aim of this exercise is to compute its expectation.
X1 X2 X2
4. Compute E[τ110 ], using the sequence (Tn , n ≥ 2) defined by T2 = + and
p2 p
1 − Xn Xn−1 Xn Xn−1 (1 − Xn ) Xn
Tn = Tn−1 + − + for n ≥ 3.
1−p p2 p(1 − p) p
5. Using similar arguments, compute E[τ100 ] and E[τ101 ].
If p = 1/2, it can be proved1 that for any sequence (i, j, k) ∈ {0, 1}, the sequence (̄, i, j),
with ̄ = 1 − j, appears earlier in probability, that is P(τ̄ ij < τijk ) > 1/2. 4
Exercise 9.33 (When does an insurance companies goes bankrupt?). We consider the evolu-
tion of the capital of an insurance company. Let S0 = x > 0 be the initial capital, c > 0
the fixed income per year and Xn ≥ 0 the (random) cost P of the damage for the year n. The
capital at the end of year n ≥ 1 is thus Sn = x + nc − nk=1 Xk . Bankruptcy happens if the
capital becomes negative that is if the bankruptcy time τ = inf{k ∈ N; Sk < 0}, with the
convention inf ∅ = ∞, is finite. The goal of this exercise is to find an upper bound of the
bankruptcy probability P(τ < ∞).
We assume the real random variables (Xk , k ≥ 1) are independent, identically distributed,
a.s. non constant, and have all its exponential moments (i.e. E[eλX1 ] < ∞ for all λ ∈ R).
1. Check that E[X1 ] > c implies P(τ < ∞) = 1, and that P(X1 > c) = 0 implies
P(τ < ∞) = 0.
(You can check we recover the maximal inequality for the positive sub-martingale.)
4. Deduce that P(τ < ∞) ≤ e−λ0 x , where λ0 ∈ (0, ∞) is the unique root of E[eλX1 ] = eλc .
4
Exercise 9.34 (A.s. convergence and convergence in distribution). Pn Let (Xn , n ≥ 1) be a
sequence of independent real random variables. We set Sn = k=1 Xk for n ≥ 1. The goal
of this exercise is to prove that if the sequence (Sn , n ≥ 1) converges in distribution, then it
converges a.s. also.
eitSn
For t ∈ R, we set ψn (t) = E[eitXn ] and Mn (t) = Qn for n ≥ 1 if nk=1 ψk (t) 6= 0.
Q
k=1 ψk (t)
1. Let t ∈ R be such that nk=1 ψk (t) 6= 0. Prove that (Mk (t), 1 ≤ k ≤ n) is a martingale.
Q
1
R. Graham, D. Knuth and O. Patashnik. Concrete mathematics: a foundation of computer science, 2nd
Edition. Addison-Wesley Publishing Company, 1994. (See Section 8.4.)
9.5. OPTIMAL STOPPING 153
3. We recall that if there exists ε > 0 s.t., for almost all t ∈ [−ε, ε], the sequence (eitsn , n ≥
1) converges, then the sequence (sn , n ≥ 1) converges. Prove that (Sn , n ≥ 1) converges
a.s. towards a random variable distributed as S.
4
3. Prove that (tB1/t , t ∈ (0, +∞)) is distributed as (Bt , t ∈ (0, +∞)). Deduce that a.s.
limt→+∞ Bt /t = 0.
4
Exercise 9.37 (Simulation of Brownian motion). We present a recursive algorithm due to
Lévy to simulate the Brownian motion on the interval [0, T ] with T > 0.
1. Prove that the Brownian bridge W T is a centered Gaussian process with covariance
kernel K = (K(s, t); s, t ∈ [0, T ]) given by K(s, t) = t ∧ s(T − t ∨ s)/T .
2. Prove that E[WtT BT +s ] = 0 for all t ∈ [0, T ] and s ∈ R+ . Deduce that W T is indepen-
dent of (BT +s , s ∈ R+ ).
Let s ≥ r ≥ 0 be fixed. We define the process W̃ = (W̃t , t ∈ [r, s]) by:
t−r
W̃t = Bt − Br − (Bs − Br ).
s−r
3. Prove that W̃ is a Gaussian process distributed as W s−r . And deduce the variance of
W̃t for t ∈ [r, s].
S
5. Let t ∈ [r, s]. Deduce that conditionally on (Bu , u ∈ [0, r] [s, +∞)), Bt is distributed
as: r
(t − r)(s − t) s−t t−r
G+ Br + Bs ,
s−r s−r s−r
S
where G ∼ N (0, 1) is independent of (Bu , u ∈ [0, r] [s, +∞)).
4
Exercise 9.38 (Ornstein-Uhlenbeck process). Let V = (Vt , t ∈ R+ ) be the solution of the
Langevin equation (6.10) with initial condition V0 . Let U be a centered Gaussian random
variable with variance σ 2 /(2a) and independent of the Brownian motion B.
4
Chapter 10
Solutions
Exercise 9.2 Let µ and µ0 be two σ-finite measures on (Ω, F) which coincide on a collection
S C stable by finite intersection such that Ωn ∈ C for all n ∈ N, where µ(Ωn ) < +∞
of events
and n∈N Ωn = Ω. By replacing Ωn by ∪0≤k≤n Ωk for n ∈ N, we can assume that the
sequence (Ωn , n ∈ N) is non-decreasing. For n ∈ N, we can define Pn = µ(Ωn )−1 µ and
P0n = µ0 (Ωn )−1 µ0 . Those two probability measures coincide on Cn = {A ∩ Ωn , A ∈ C} ⊂ C
which is also stable by finite intersection, and thus thanks to Corollary 1.14, they coincide
on σ(Cn ). As µ(Ωn ) = µ0 (Ωn ), we deduce that µ and µ0 coincide on σ(Cn ) for all n ∈ N.
Let G = {A ∈ F, A ∩ Ωn ∈ σ(Cn ) for all n ∈ N}. It is elementary to check that G is a
σ-field. SinceSC is stable by finite intersection, we have C ⊂ G and thus σ(C) ⊂ G. If A ∈ G,
we get A = n∈N (A ∩ Ωn ). As A ∩ Ωn ∈ σ(Cn ) ⊂ σ(C), we deduce that A ∈ σ(C). This
implies that G = σ(C). By monotone convergence, we get that for A ∈ σ(C), that is A ∈ G:
where we used that µ and µ0 coincide on σ(Cn ) for the second equality. We deduce that
µ = µ0 on σ(C).
The extension of Corollary 1.15 is immediate.
lim sup(an − bn ) = lim sup (ak − bk ) ≤ lim ( sup ak − inf bk ) = lim sup an − lim inf bn .
n→∞ n→∞ 0≤k≤n n→∞ 0≤k≤n 0≤k≤n n→∞ n→∞
We also have:
lim sup(an − bn ) = lim sup (ak − bk ) ≥ lim ( sup ak − sup bk ) = lim sup an − lim sup bn .
n→∞ n→∞ 0≤k≤n n→∞ 0≤k≤n 0≤k≤n n→∞ n→∞
If furthermore the sequence (bn , n ∈ N) converges, we have lim supn→∞ bn = lim inf n→∞ bn ,
which allows to conclude.
155
156 CHAPTER 10. SOLUTIONS
Exercise 9.5 For the case fj ≥ 0 for j ∈ J, only the last sentence of the proof of Proposition
1.62 need to be changed. Use monotone convergence theorem, to get (1.13) holds if the
function fj are non-negative.
For the other case, according to the above argument, we get that:
Y εj Y ε
E fj (Xj ) = E[fj j (Xj )],
j∈J j∈J
where εj ∈ {−, +} for j ∈ J. Those quantities being all finite as fj (Xj ) is integrable, we
obtain (1.13) using the linearity of the expectation in L1 (P).
Exercise 9.6 The fact that XY is integrable and that Cov(X, Y ) = 0 is a consequence of
Exercise 9.5. Let X be a non-negative square-integrable random variable with non-zero
variance. Let Y = εX, with ε independent of X and such that P(ε = 1) = P(ε = −1) = 1/2.
We have E[Y ] = 0 and E[XY ] = 0 so that Cov(X, Y ) = 0. However, we have Cov(X, |Y |) =
Var(X) > 0. This implies that X and Y are not independent.
Exercise 9.7 Use that f (1A ) = 1 + (f (1) − f (0))1A and Proposition 1.62 to deduce that if the
events (Ai , i ∈ I) are independent then the random variables (1Ai , i ∈ I) are independent.
By Definition 1.31, if the random variables (Xi , i ∈ I) are independent so are the random
variables (fi (Xi ), i ∈ I) for any measurable functions (fi , i ∈ I). Take Xi = 1Ai and fi (x) =
1−x to deduce that (1Aci , i ∈ I) are independent random variables, and thus thanks to (1.13),
deduce that (Aci , i ∈ I) are also independents events.
Exercise 9.10 Since (X1 , X2 ) has the same distribution as (X2 , X1 ), we deduce that (X1 , S2 )
has the same distribution as (X2 , S2 ). This implies that E[X1 | S2 ] = E[X2 | S2 ]. By linearity,
we have:
E[X1 | S2 ] + E[X2 | S2 ] = E[S2 | S2 ] = S2 .
We deduce that E[X1 | S2 ] = S2 /2. Similarly, we get E[X1 |Sn ] = Sn /n.
Exercise 9.11 Notice that (X, X 2 ) and (−X, X 2 ) have the same distribution. We deduce
that E[X| X 2 ] = E[−X| X 2 ] = −E[X| X 2 ]. This implies that E[X| X 2 ] = 0.
for some measurable function g, such that thanks to (2.2) (with h given by 1A = h(|X|) for
A ∈ σ(|X|)), we will deduce that a.s. g(|X|) = E[X| |X|]. On one hand, we have:
Z Z Z
E[X h(|X|)] = xh(|x|)f (x) dx = xh(|x|)f (x) dx + xh(|x|)f (x) dx
R R+ R−
Z
= |x|h(|x|) f (x) − f (−x) dx,
R+
We deduce that:
f (x) − f (−x)
g(x) = |x|
f (x) + f (−x)
satisfies (10.1) and thus a.s.
f (X) − f (−X)
E[X| |X|] = |X| ·
f (X) + f (−X)
√
Notice that σ(|X|) = σ(X 2 ) (as |x| = x2 and x2 = |x|2 ) so that E[X| X 2 ] = E[X| |X|].
Exercise 9.13 By Jensen’s inequality, we have E[X| H]2 ≤ E[X 2 | H]. Since E[E[X 2 | H]] =
E[X 2 ] < +∞, we deduce that E[X| H]2 is integrable. Using Jensen’s inequality, we get that:
Exercise 9.14 Set ϕ(x) = E[|x − Y |] for x ∈ R. Using Jensen’s inequality, we get that
ϕ(x) = E[|x − Y |] ≥ |x − E[Y ]| = |x| for all x ∈ R. As X and Y are independent, we also
have that E[|X − Y |] = E[ϕ(X)]. This gives the result: E[|X − Y |] ≥ E[|X|].
Exercise 9.15
Exercise 9.16 Let ϕ be a positive strictly convex function on R such that limx→+∞ ϕ(x)/x
and limx→−∞ ϕ(x)/x are finite. This implies in particular that ϕ(X) and ϕ(Y ) are inte-
grable. We deduce from Jensen’s inequality that E[ϕ(X)| Y ] ≥ ϕ(E[X| Y ]) = ϕ(Y ) and thus
E[ϕ(X)] ≥ E[ϕ(Y )]. By symmetry, we get E[ϕ(X)] = E[ϕ(Y )] and thus the Jensen’s inequal-
ity is an equality: a.s. E[ϕ(X)| Y ] = ϕ(E[X| Y ])]. Since ϕ is strictly convex, this implies that
a.s. ϕ(X) = ϕ(E[X| Y ])] = ϕ(Y ), and thus, as ϕ is general, a.s. X = Y .
Exercise 9.17 Let A ∈ H and consider the random variable X = (V, 1A ) which takes values in
E × {0, 1}. Since Y and X are independent, we deduce from Lemma 1.56 that for measurable
sets B ∈ S and C ∈ E:
P(Y,X) (B × C) = P(Y ∈ B, X ∈ C) = P(Y ∈ B)P(X ∈ C) = PY (B)PX (C).
We deduce from Fubini’s theorem that P(Y,X) (dy, dx) = PY (dy)PX (dx), and, with x =
(x1 , x2 ) ∈ E × {0, 1} and f (x, y) = ϕ(y, x1 )x2 , from Equation (1.6) that:
Z Z Z
E[ϕ(Y, V )1A ] = f (y, x) P(Y,X) (dy, dx) = f (x, y) PY (dy) PX (dx).
R
Set g(v) = E[ϕ(v, Y )] = ϕ(v, y) PY (dy) so that:
Z
E[ϕ(Y, V )1A ] = g(x1 )x2 PX (dx) = E[g(V )1A ].
Exercise 9.19 1. We have E[|Xn |] = 1/n, which implies that the sequence X = (Xn , n ∈
N∗ ) converges to 0 in L1 . Notice that 0 −2
P P(Xn > 1/n) = P(Bn = Bn = 1) = n and thus
the non-negative random variable n∈N∗ 1{Xn >1/n} is integrable and thus finite. Since
the terms in the sum are either 1 or 0, we deduce that Xn ≤ 1/n for n large enough,
that is X converges to a.s. 0.
Exercise 9.20 The computations are elementary, Z see Sections 2.3.2 and 2.3.3. The density
of the probability distribution of V is fV (v) = fY,V (y, v) dy = λ e−λv 1{v>0} . We deduce
that for y > 0, fY |V (y|v) = v −1 1{0<y<v} , which is the density of the uniform distribution on
[0, v]. The last formula is then clear.
Exercise 9.21 We first assume that Y and V are independent. We deduce from Exercise 9.17
with H = σ(V ) that, for all nonnegative
R measurable function ϕ, we have E[ϕ(Y, V )| H] = g(V )
with g(v) = E[ϕ(Y, v)] = ϕ(y, v) PY (dy). We deduce from Definition 2.17 that P(Y ∈
A|V ) = PY (A), and thus the conditional distribution of Y given V exists and is given by the
kernel κ(v, dy) = PY (dy) which does not depend on v ∈ E.
We now assume that the conditional distribution of Y given V exists and is given by a
kernel which does not depend on v ∈ E, say κ(dy). By Definition 2.17, we have P(Y ∈ A|V ) =
κ(V, A) = κ(A), and taking the expectation, we get P(Y ∈ A) = κ(A), that is κ = PY . We
also get P (Y ∈ A, V ∈ B) = E[1B (V )P(Y ∈ A| V )] = PY (A)E[1B (V )] = P(Y ∈ A)P(V ∈ B),
which means that Y and V are independent.
Assume that X is a stochastic dynamical system: Xn+1 = f (Xn , Un+1 ) for some measur-
able function f and (Un , n ∈ N∗ ) independent identically distributed random variables
independent of X0 . Then, we have:
Zn+1 = X2n+2 = f (f (X2n , U2n+1 ), U2n+2 ) = g(Zn , Vn+1 ),
with Vn+1 = (U2n+1 , U2n+2 ) and g(x, (v1 , v2 )) = f (f (x, v1 ), v2 ). Since the random
variables (Vn , n ∈ N∗ ) are independent, identically distributed and independent of Z0 =
X0 , we deduce that Z is a stochastic dynamical system and thus a Markov chain.
In general, X is distributed as a stochastic dynamical system, say X̃. The process Z
is a functional of X, say Z = F (X). We deduce that Z is distributed as Z̃ = F (X̃),
which is a stochastic dynamical system, according to the previous argument. Hence, Z
is a Markov chain.
Notice that Z has transition matrix Q = P 2 .
2. On one hand, if π is an invariant probability measure for P , then we have πP 2 =
(πP )P = πP = π. Hence it is also an invariant probability measure for Q.
On the other hand, for the state space E = {a, b} (with a 6= b):
0 1
P =
1 0
and the unique invariant probability measure is π = (1/2, 1/2)t . As Q = P 2 is the
idendity matrix we get that any probability measure is invariant for Q.
Exercise 9.23 1. Assume that X is a stochastic dynamical system: Xn+1 = f (Xn , Un+1 )
for some measurable function f and (Un , n ∈ N∗ ) independent identically distributed
random variables independent of X0 . Then, we have:
Yn+1 = (Xn , Xn+1 ) = (Xn , f (Xn , Un+1 )) = g(Yn , Un+1 ),
with g((y1 , y2 ), u) = (y2 , f (y2 , u)). Since the random variable (Un , n ≥ 2) is indepen-
dent of Y1 = (X0 , X1 ), we deduce that Y is a stochastic dynamical system and thus a
Markov chain.
In general, X is distributed as a stochastic dynamical system, say X̃. The process Y
is a functional of X, say Y = F (X). We deduce that Y is distributed as Ỹ = F (X̃),
which is a stochastic dynamical system, according to the previous argument. Hence, Y
is a Markov chain.
The transition matrix Q of Y on E 2 is given by:
Q((x1 , x2 ), (y1 , y2 )) = 1{x2 =y1 } P (y1 , y2 ).
3. It is easy to check that ν = (ν(z), z ∈ E 2 ), with ν(z) = π(x)P (x, y) for z = (x, y) ∈ E 2
is an invariant measure for Y . Indeed, we have for z = (v, w) ∈ E 2 :
X
νQ(z) = ν((x, y))Q((x, y), (v, w))
x,y∈E
X
= π(x)P (x, y)1{y=v} P (v, w) = π(v)P (v, w) = ν(z).
x,y∈E
Exercise 9.24 1. Since each new step is chosen uniformly random on the available neigh-
bors, the next position depends on the past only through the current position. This
implies that X is a Markov chain. Clearly it is irreducible (so all the states belong
to the same closed class). Since the state space is finite, the Markov chain is positive
recurrent (so all the states are positive recurrent).
2. For all n ∈ N, set Fn = σ(X0 , . . . , Xn ). We shall check that Y is a Markov chain with
respect to the filtration F = (Fn , n ∈ N). We first compute P(Yn+1 = A| Fn ). Let P be
the transition matrix of the Markov chain X. We have:
X
P(Yn+1 = A| Fn ) = P(Xn+1 = y| Fn )
y∈A
X
= P(Xn+1 = y| Xn )
y∈A
X
= P (Xn , y)
y∈A
XX
= P (x, y)1{Xn =x}
y∈A x∈C
2X
= 1{Xn =x}
3
x∈C
2
= 1{Yn =C} ,
3
where we used that X is a Markov chain for thePsecond equality, that P (x, y) = 0
for x 6∈ C and y ∈ A for the fourth, and that y∈A P (x, y) = 2/3 for all x ∈ C
for the fifth. Since the last right hand-side term is σ(Yn )-measurable, this implies
that P(Yn+1 = A| Fn ) = P(Yn+1 = A| Yn ). Similarly, we obtain P(Yn+1 = B| Fn ) =
1
3 1{Yn =C} = P(Yn+1 = B| Xn ) and P(Yn+1 = C| Fn ) = 1{Yn 6=C} = P(Yn+1 = C| Xn ).
Since P(Yn+1 = • | Fn ) = P(Yn+1 = • | Yn ), we deduce that Y is a Markov chain. Its
transition matrix (on E = A, B, C) is given by:
0 0 1
Q= 0 0 1 .
2/3 1/3 0
probability for P could be given by πP (x) = 1/12 for x ∈ A, πP (x) = 1/6 for x ∈ B
and πP (x) = 1/8 for x ∈ C. It is indeed elementary to check that πP P = πP . Since
the Markov chain X is irreducible on a finite state, the invariant probability exists and
is unique, and thus is given by πP . (One can check that the invariant probability for a
uniform random walk on a general finite graph (i.e. the next state is chosen uniformly
at random among P the closest neighbors) is proportional to the degree of the nodes:
π(x) = deg(x)/ y deg(y) for all nodes x of the finite graph.)
Exercise 9.25
Exercise 9.26
10.4 Martingales
1. We have {τa ≤ n} = nk=1 {|Xn | < a}. This implies that τa is a stopping
T
Exercise 9.27
time. Since X is an irreducible Markov chain, it is either transient and the time
spent in (−a, a) is finite, or recurrent and the number of visit of a is infinite. In both
case, we have that X leaves (−a, a) in finite time. That is τa is finite. The event
T n
k=1 {U2k = 1, U2k+1 = −1} has positive probability, and, if a ≥ 2, we have on this
event that τa ≥ 2n. Therefore, τa is not bounded.
2. M (λ) is clearly adapted. Since |Xn | ≤ n, we deduce that for fixed n, M (λ) is bounded
thus integrable. We have, using that Un+1 is independent of Fn :
h i
(λ)
E[Mn+1 | Fn ] = eλXn −(n+1)ϕ(λ) E eλUn+1 | Fn
h i
= eλXn −(n+1)ϕ(λ) E eλUn+1
= eλXn −nϕ(λ) = Mn(λ) .
We deduce that:
sinh(λa) sinh(λa)
E e−rτa 1{Xτa =a} = and E e−rτa 1{Xτa =−a} =
·
sinh(2λa) sinh(2λa)
sinh(λa) 1
E e−rτa = 2
= ,
sinh(2λa) cosh(λa)
Exercise 9.28 1. We have for y 6= x, and r = |y − x| that P(Xr = y|X0 = x) = 2−r > 0.
Hence X is irreducible.
5. Notice that N is F-adapted and integrable (as for all n ∈ N, |Xn | ≤ n and thus
|Nn | ≤ n2 ). We have:
2
E[Xn+1 | Fn ] = E[(Xn + Un+1 )2 | Fn ] = Xn2 + E[Un+1 | Fn ] + 2Xn E[Un+1 | Fn ]
= Xn2 + E[Un+1 ] + 2Xn E[Un+1 ]
= Xn2 + 1.
6. By the optional stopping theorem, we have that E[Nτ ∧n ] = 0. This gives E[Mn2 ] =
E[Xτ2∧n ] = E[τ ∧ n] ≤ E[τ ]. Since M is not uniformly integrable, it is not bounded
in L2 . Thus we have supn∈N E[Mn2 ] = +∞. This implies that E[τ ] = +∞. Let
T the first return time to 0. By decomposing the random walk started at 0 with
respect to the first step, and considering only the case where it goes first at 1, we get
T ≥ 1{U1 =1} (1 + τ 0 ), where τ 0 is distributed as τ and independent of U1 . Therefore we
have E[T ] ≥ (1 + E[τ ])/2 = +∞. This implies that X is recurrent null.
where we used that Xn+1 is independent of Fn for the first equality. Thus M is a
martingale.
−1
Pn −
By the strong law of large numbers,
Pn we have that a.s. limn→∞ n k=1 Xk = (1+e) 1
and, thus a.s. limn→∞ −n + 2 k=1 Xk = −∞ ans a.s. limn→∞ Mn = 0.
2. Since limn→∞ E[Mn ] = 1 > 0 = E[limn→∞ Mn ], this implies that M doesn’t converge
in L1 .
Exercise 9.30 1. An elementary induction gives that |Xn | ≤ n! and thus Xn is integrable.
The process X is adapted with respect to the natural filtration of the process (Zn , n ∈
N∗ ). We have:
E[Xn+1 |Fn ] = E[Zn+1 ]1{Xn =0} + (n + 1)Xn E[|Zn+1 |] 1{Xn 6=0} = Xn 1{Xn 6=0} = Xn ,
where we used that Zn+1 is independent of Fn for the first equality. We deduce that
X is a martingale.
3. Set An = {Zn 6= 0}. The events (An , n ∈ N∗ ) are independent and n∈N∗ P(An ) = +∞.
P
Borel-Cantelli’s lemma implies that the set of ω ∈ Ω such that Card {n ∈ N∗ , ω ∈
An } = ∞ is of probability 1, that is P(Zn 6= 0 infinitely often) = 1. Since {Xn 6=
0} = {Zn 6= 0} and Xn belongs to Z, we deduce that P(|Xn | ≥ 1 infinitely often) = 1.
Since Xn converges in probability towards 0, we deduce that P(limn→∞ Xn exists) =
P(limn→∞ Xn = 0). But this latter quantity is 0 as P(|Xn | ≥ 1 infinitely often) = 1.
Exercise 9.31 1. Let F = (Fn , n ∈ N) be the natural filtration of the process X. We have
that, conditionally on Fn , Xn+1 has a binomial distribution with parameter (N, Xn /N ).
Notice this proves that X is an homogeneous Markov chain.
Xn
E[Xn+1 | Fn ] = E[Xn+1 | Xn ] = N = Xn .
N
Thus X is a martingale.
We deduce that:
n+1
N
E[Mn+1 | Fn ] = E[Xn+1 (N − Xn+1 )| Xn ]
N −1
n+1
N Xn Xn
= N (N − 1) 1−
N −1 N N
= Mn .
where we used that Sn is Fn -measurable for the first equality, and that Xn+1 is inde-
pendent of Fn for the second. We deduce that M is a martingale.
Write τ for τ111 . Using the optional stopping theorem, we get that E[Mn∧τ ] = E[M0 ] =
0. This implies that for all n ∈ N:
E[n ∧ τ ] = E[Sn∧τ ].
By monotone convergence, we get that limn→∞ E[n ∧ τ ] = E[τ ]. Since τ is finite a.s.,
we have that a.s. limn→∞ Sn∧τ = Sτ . It is clear from the dynamic of S that:
1 1 1
0 ≤ Sn∧τ ≤ Sτ = + 2+ 3·
p p p
166 CHAPTER 10. SOLUTIONS
thus, by dominated convergence, we deduce that limn→∞ E[Sn∧τ ] = E[Sτ ]. This gives:
1 1 1
E[τ ] = E[Sτ ] = + 2+ 3·
p p p
3. Using the strong Markov property at time τ11 for the Markov chain X, we deduce that
(Xτ11 +n , n ∈ N∗ ) is independent of (Xk , 1 ≤ k ≤ τ11 ) and distributed as X. We deduce
that:
P(τ111 > τ110 ) = P(Xτ11 +1 = 0) = P(X1 = 0) = 1 − p.
1 1 1 1
E[τ110 ] = E[Tτ110 ] = = + 2+ ·
p2 (1− p) p p 1−p
X1 1 − Xn Xn
5. Consider U1 = and Un = Un−1 + for n ≥ 2, to get:
p 1−p p
1
E[τ100 ] = E[Uτ100 ] = ·
p(1 − p)2
X1 (1 − X2 ) X2 Xn Xn−1 (1 − Xn ) Xn−1 Xn Xn
Consider V2 = + and Vn = Vn−1 + − +
p(1 − p) p p p(1 − p) p2 p
for n ≥ 3, to get:
1
E[τ101 ] = 2 ·
p (1 − p)
Exercise 9.33 1. Assume that E[X1 ] > c. By the strong law of large numbers, we get that
a.s. limn→∞ Sn /n = c − E[X1 ] < 0. This implies that a.s. limn→∞ Sn = −∞ and thus
τ is a.s. finite.
If P(X1 > c) = 0, then a.s. Xk ≤ c and then Sn ≥ x a.s. for all n ∈ N. This implies
that a.s. τ is infinite.
3. We have:
N
X
E[VN 1{τ ≤N } ] = E[VN 1{τ =k} ]
k=1
N
X
= E[E[VN 1{τ =k} | Fk ]]
k=1
XN
= E[1{τ =k} E[VN | Fk ]]
k=1
XN
≥ E[Vk 1{τ =k} ]
k=1
≥ eλx P(τ ≤ N ),
where we used that V is a sub-martingale for the first inequality, and that Vk ≥ eλx on
{τ = k} for thesecond inequality.
4. The function ϕ defined on R+ by ϕ(λ) = E[eλ(X1 −c) ] belongs to C ∞ (R+ ) (use domi-
nated convergence to prove the continuity and Fubini to prove recursively that ϕ(n) is
derivable). Since ϕ00 (λ) = E[(X1 −c)2 eλ(X1 −c) ] > 0, we deduce that ϕ is strictly convex.
We have ϕ(0) = 1 and ϕ0 (0) = E[X1 − c] < 0. As P(X1 > c) > 0, there exists a > c such
that p = P(X1 ≥ a) > 0. We deduce that ϕ(λ) ≥ E[1{X1 ≥a} eλ(X1 −c) ] ≥ p eλ(a−c) , so
that limλ→∞ ϕ(λ) = +∞. Thus, there exists a unique root λ0 of ϕ(λ) = 1 on (0, +∞).
Taking λ = λ0 in Question 2 , we deduce that V is a martingale. Then, using Question
3 for the inequality, we get that:
This gives P(τ ≤ N ) ≤ e−λ0 x . Then let N goes to infinity to get the result.
Exercise 9.34
Exercise 9.37
Exercise 9.38
168 CHAPTER 10. SOLUTIONS
Chapter 11
Vocabulary
english français
N N (but N∗ in some books)
(0, 1) ]0, 1[
positive strictement positif
countable dénombrable
pairwise disjoint sets ensembles disjoints 2 à 2
a σ-field une tribu ou σ-algèbre
a λ-sytem une classe monotone
nested emboité(es)
non-negative positif ou nul
convergence in distribution convergence en loi
pointwise convergence convergence simple
irreducible irréductible
super-martingale sur-martingale
sub-martingale sous-martingale
predictable prévisible
optional stopping theorem théorème d’arrêt
optimal stopping arrêt optimal
169
Index
martingale
closed, 75
measurable space, 1
measure, 3
σ-finite, 3
product, 18
Metropolis, 57
170
INDEX 171