Notes Stochastic Process

Download as pdf or txt
Download as pdf or txt
You are on page 1of 177

Stochastic Processes and Applications

Jean-François Delmas

August 22, 2022


2
Contents

1 A starter on measure theory and random variables 1


1.1 Measures and measurable functions . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Measurable space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Characterization of probability measures . . . . . . . . . . . . . . . . . 5
1.1.4 Measurable functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.5 Probability distribution and random variables . . . . . . . . . . . . . . 8
1.2 Integration and expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Integration: construction and properties . . . . . . . . . . . . . . . . . 10
1.2.2 Integration: convergence theorems . . . . . . . . . . . . . . . . . . . . 13
1.2.3 The Lp space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.4 Fubini theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.5 Expectation, variance, covariance and inequalities . . . . . . . . . . . 18
1.2.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Conditional expectation 25
2.1 Projection in the L2 space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Conditional expectation with resp. to a random variable . . . . . . . . . . . . 31
2.3.1 The discrete case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.2 The density case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 Elements on the conditional distribution . . . . . . . . . . . . . . . . . 33

3 Discrete Markov chains 35


3.1 Definition and properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Invariant probability measures, reversibility . . . . . . . . . . . . . . . . . . . 41
3.3 Irreducibility, recurrence, transience, periodicity . . . . . . . . . . . . . . . . . 43
3.3.1 Communicating classes . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2 Recurrence and transience . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.3 Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Asymptotic theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.1 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.2 Complement on the asymptotic results . . . . . . . . . . . . . . . . . . 50
3.4.3 Proof of the asymptotic theorems . . . . . . . . . . . . . . . . . . . . . 52
3.5 Examples and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

i
ii CONTENTS

4 Martingales 65
4.1 Stopping times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Martingales and the optional stopping theorem . . . . . . . . . . . . . . . . . 69
4.3 Maximal inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Convergence of martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 More on convergence of martingales . . . . . . . . . . . . . . . . . . . . . . . 74

5 Optimal stopping 79
5.1 Finite horizon case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.1 The adapted case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.1.3 Marriage of a princess . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Infinite horizon case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.1 Essential supremum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.2 The adapted case: regular stopping times . . . . . . . . . . . . . . . . 86
5.2.3 The adapted case: optimal equations . . . . . . . . . . . . . . . . . . . 88
5.2.4 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 From finite horizon to infinite horizon . . . . . . . . . . . . . . . . . . . . . . 91
5.3.1 From finite horizon to infinite horizon . . . . . . . . . . . . . . . . . . 91
5.3.2 Castle to sell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.3 The Markovian case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6 Brownian motion 103


6.1 Gaussian process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.1.1 Gaussian vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.1.2 Gaussian process and Brownian motion . . . . . . . . . . . . . . . . . 105
6.2 Properties of Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.1 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.2 Limit of simple random walks . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.3 Markov property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2.4 Brownian bridge and simulation . . . . . . . . . . . . . . . . . . . . . 110
6.2.5 Martingale and stopping times . . . . . . . . . . . . . . . . . . . . . . 110
6.3 Wiener integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3.1 Gaussian space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3.2 An application: the Langevin equation . . . . . . . . . . . . . . . . . . 113
6.3.3 Cameron-Martin Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 115

7 Stochastic approximation algorithms 121


7.1 The two-armed bandit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2 Asymptotic Pseudo-Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.2.2 The limit set for asymptotic pseudo-trajectory . . . . . . . . . . . . . 123
7.3 Stochastic approximation algorithms . . . . . . . . . . . . . . . . . . . . . . . 124
7.3.1 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.3.2 Proof of Theorem 7.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
CONTENTS iii

7.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126


7.4.1 Dosage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.4.2 Estimating a quantile . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.4.3 Two-armed bandit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

8 Appendix 131
8.1 More on measure theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.1.1 Construction of probability measures . . . . . . . . . . . . . . . . . . . 131
8.1.2 Proof of the Carathéodory extension Theorem 8.3 . . . . . . . . . . . 134
8.2 More on convergence for sequence of random variables . . . . . . . . . . . . . 137
8.2.1 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . 137
8.2.2 Law of large number and central limit theorem . . . . . . . . . . . . . 139
8.2.3 Uniform integrability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.2.4 Convergence in probability and in L1 . . . . . . . . . . . . . . . . . . . 141

9 Exercises 145
9.1 Measure theory and random variables . . . . . . . . . . . . . . . . . . . . . . 145
9.2 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.3 Discrete Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.4 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.5 Optimal stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.6 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

10 Solutions 155
10.1 Measure theory and random variables . . . . . . . . . . . . . . . . . . . . . . 155
10.2 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.3 Discrete Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
10.4 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
10.5 Optimal stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
10.6 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

11 Vocabulary 169

Index 171
iv CONTENTS
Chapter 1

A starter on measure theory and


random variables

In this chapter, we present in Section 1.1 a basic tool kit in measure theory with in mind the
applications to probability theory. In Section 1.2, we develop the corresponding integration
and expectation. The presentation of this chapter follows closely [1], see also [2].
of non-negative integers, N∗ =
TWe use the following convention N = {0, 1, . . .} is the set T
N S(0, +∞), and for m < n ∈ N, we set Jm, nK = [m, n] N. We shall consider R =
R {±∞} = [−∞, +∞], and for a, b ∈ R, we write a ∨ b = max(a, b), a+ = a ∨ 0 the positive
part of a, and a− = (−a)+ its negative part.
For two sets A ⊂ E, the function 1A defined on E taking values in R is equal to 1 on A
and to 0 on E \ A.

1.1 Measures and measurable functions


1.1.1 Measurable space
Let Ω be a set also called a space. A measure on a set Ω is a function which gives the “size”
of subsets of Ω. We shall see that, if one asks the measure to satisfy some natural additive
properties, it is not always possible to define the measure of any subsets of Ω. For this reason,
we shall consider families of sub-sets of Ω called σ-fields. We denote by P(Ω) = {A; A ⊂ Ω}
the set of all subsets of Ω.
Definition 1.1. A collection of subsets of Ω, F ⊂ P(Ω), is called a σ-field on Ω if:

(i) Ω ∈ F;

(ii) A ∈ F implies Ac ∈ F;
S
(iii) if (Ai , i ∈ I) is a finite or countable collection of elements of F, then i∈I Ai ∈ F.

We call (Ω, F) a measurable space and a set A ∈ F is said to be F-measurable.


When there is no ambiguity on the σ-field we shall simply say that A is measurable instead
of F-measurable. In probability theory a measurable set is also called an event. Properties

1
2 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES

(i) and (ii) implies that ∅ is measurable. Notice that P(Ω) and {∅, Ω} are σ-fields. The latter
is called the trivial σ-field. When Ω is at most countable, unless otherwise specified, we shall
consider the σ-field P(Ω).

Proposition 1.2. Let C ⊂ P(Ω). There exists a smallest σ-field on Ω which contains C.
The smallest σ-field which contains C is denoted by σ(C) and is also called the σ-field
generated by C.
Proof. Let (Fj , j ∈ J) be the collection of all the σ-fields on Ω containing
T C. This collection
is not empty as it contains P(Ω). It is left to the reader to check that j∈J Fj is a σ-field.
Clearly, this is the smallest σ-field containing C.
Remark 1.3. In this remark we give an explicit description of a σ-field generated by a finite
number of sets. Let C = {A1 , . . . ,ASn }, with n ∈ N∗ , be a
finite collection
T of subsets
T of cΩ. It
is elementary to check that F = I∈ITCI ; I ⊂ P(J1, nK) , with CI = i∈I Ai j6∈I Aj and
I ⊂ J1, nK, is a σ-field. Notice that CI T CJ = ∅ for I 6= J. Thus, the subsets CI are atoms
of F in the sense that if B ∈ F, then CI B is equal to CI or to ∅.
We shall prove that σ(C) = F. Since by construction CI ∈ σ(C) for allSI ⊂ J1, nK, we
deduce that F ⊂ σ(C). On the other hand, for all i ∈ J1, nK, we have Ai = I⊂J1,nK, i∈I CI .
This gives that C ⊂ F, and thus σ(C) ⊂ F. In conclusion, we get σ(C) = F. ♦
S
If F and H are σ-fields on Ω, we denote by F ∨ H = σ(F H) the σ-field generated
by
W F and H.S More generally, if (Fi , i ∈ I) is a collection of σ-fields on Ω, we denote by
i∈I Fi = σ( i∈I Fi ) the σ-field generated by (Fi , i ∈ I).

Q of measurable spaces. If (Ai , i ∈ I) is a collection of sets, then


We shall consider product
its product is denoted by i∈I Ai = {(ωi , i ∈ I); ωi ∈ Ai ∀i ∈ I}.
Definition
N 1.4. Let ((Ωi , Fi ), iQ
∈ I) be a collection of measurable spaces. The product
Q σ-field
i∈I Fi on the product space i∈I Ωi is the σ-field generated by all the sets i∈I Ai such
Ai ∈ Fi for all i ∈ I and Ai = Ωi for all i ∈ I but for a finite number of indices.
When all the measurableQ spaces (Ωi , Fi ) are the same for all iN
∈ I, say (Ω, F), then we
also write the product space i∈I Ωi = Ω and the product σ-field i∈I Fi = F ⊗I .
I

We recall a topological space (E, O) is a space E and a collection O of subsets of E such


that: ∅ and E belongs to O, any (finite or infinite) union of elements of O belongs to O, and
the intersection of any finite number of elements of O belongs to O. The elements of O are
called the open sets, and O is called a topology on E. There is a very natural σ-field on a
topological space.
Definition 1.5. If (E, O) is a topological space, then the Borel σ-field, B(E) = σ(O), is the
σ-field generated by all the open sets. An element of B(E) is called a Borel set.
Usually the Borel σ-field on E is different from P(E).
Remark 1.6. Since all the open subsets of R can be written as the union of a countable
number of bounded open intervals, we deduce that the Borel σ-field is generated by all the
intervals (a, b) for a < b. It is not trivial to exhibit a set which is not a Borel set; an example
was provided by Vitali1 .
1
J. Stern. ”Le problème de la mesure.” Séminaire Bourbaki 26 (1983-1984): 325-346. http://eudml.org/
doc/110033.
1.1. MEASURES AND MEASURABLE FUNCTIONS 3

Similarly to the one dimensional case, as all the open sets of Rd can be written as a
d
Qd of open rectangles, the Borel σ-field on R , d ≥ 1, is generated by all the
countable union
rectangles i=1 (ai , bi ) with ai < bi for 1 ≤ i ≤ d. In particular, we get that the Borel σ-field
of Rd is the product2 of the d Borel σ-fields on R. ♦

1.1.2 Measures
We give in this section the definition and some properties of measures and probability mea-
sures.
Definition 1.7. Let (Ω, F) be a measurable space.

(i) A [0, +∞]-valued function µ defined on F is σ-additive if for all finite or countable
collection (Ai , i ∈ I) of measurable pairwise disjoint sets, that is Ai ∈ F for all i ∈ I
and Ai ∩ Aj = ∅ for all i 6= j, we have:
!
[ X
µ Ai = µ(Ai ). (1.1)
i∈I i∈I

(ii) A measure µ on (Ω, F) is a σ-additive [0, +∞]-valued function defined on F such that
µ(∅) = 0. We call (Ω, F, µ) a measured space. A measurable set A is µ-null if µ(A) = 0.

on (Ω, F) is σ-finite if there exists a sequence of measurable sets (Ωn , n ∈


(iii) A measure µ S
N) such that n∈N Ωn = Ω and µ(Ωn ) < +∞ for all n ∈ N.

(iv) A probability measure P on (Ω, F) is a measure on (Ω, F) such that P(Ω) = 1. The
measured space (Ω, F, P) is also called a probability space.

We refer to Section 8.1 for the construction of measures such as the Lebesgue measure,
see Proposition 8.4 and Remark 8.6, and the product probability measure, see Proposition
8.7.
Example 1.8. We give some examples of measures (check these are indeed measures). Let Ω
be a space.
(i) The counting measure Card is defined by A 7→ Card (A) for A ⊂ Ω, with Card (A) the
cardinal of A. It is σ-finite if and only if Ω is at most countable.

(ii) Let ω ∈ Ω. The Dirac measure at ω, δω , is defined by A 7→ δω (A) = 1A (ω) for A ⊂ Ω.


It is a probability measure.

(iii) The Bernoulli probability distribution with parameter p ∈ [0, 1], PB(p) , is a probability
measure on (R, B(R)) given by PB(p) = (1 − p)δ0 + pδ1 .
2
Let (E1 , O1 ) and (E2 , O2 ) be two topological spaces. Let C = {O1 × O2 ; O1 ∈ O1 , O2 ∈ O1 } be the set
of product of open sets. By definition, B(E1 ) ⊗ B(E2 ) is the σ-field generated by C. The product topology
O1 ⊗ O2 on E1 × E2 is defined as the smallest topology on E1 × E2 containing C. The Borel σ-field on E1 × E2 ,
B(E1 ×E2 ), is the σ-field generated by O1 ⊗O2 . Since C ⊂ O1 ⊗O2 , we deduce that B(E1 )⊗B(E2 ) ⊂ B(E1 ×E2 ).
Since O1 ⊗ O2 is stable by infinite (even uncountable) union, it might happens that the previous inclusion is
not an equality, see Theorem 4.44 p. 149 from C. Aliprantis and K. Border. Infinite Dimensional Analysis.
Springer, 2006.
4 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES

(iv) The Lebesgue measure λ on (R, B(R)) is a measure characterized by λ([a, b]) = b − a for
all a < b. In particular, any finite set or (by σ-additivity) any countable set is λ-null3 .
The Lebesgue measure is σ-finite.

4
Let us mention that assuming only the additivity property (that is I is assumed to be
finite in (1.1)), instead of the stronger σ-additivity property, for the definition of measures4
leads to a substantially different and less efficient approach. We give elementary properties
of measures.
Proposition 1.9. Let µ be a measure on (Ω, F). We have the following properties.

(i) Additivity: µ(A ∪ B) + µ(A ∩ B) = µ(A) + µ(B), for all A, B ∈ F.

(ii) Monotonicity: A ⊂ B implies µ(A) ≤ µ(B), for all A, B ∈ F.

(iii) Monotone convergence: If (An , n ∈ N) is a sequence of elements of F such that An ⊂


An+1 for all n ∈ N, then, we have:
!
[
µ An = lim µ(An ).
n→∞
n∈N

(iv) If (Ai , i ∈ I) Sis a finite


 P or countable collection of elements of F, then we have the
inequality µ i∈I Ai ≤ i∈I µ(Ai ). In particular a finite or countable union of µ-null
sets is µ-null.

Proof. We prove (i). The sets A ∩ B c , A ∩ B and Ac ∩ B are measurable and pairwise disjoint.
Using the additivity property three times, we get:

µ(A ∪ B) + µ(A ∩ B) = µ(A ∩ B c ) + 2µ(A ∩ B) + µ(Ac ∩ B) = µ(A) + µ(B).

We prove (ii). As Ac ∩ B ∈ F, we get by additivity that µ(B) = µ(A) + µ(Ac ∩ B). Then
use that µ(Ac ∩ B) ≥ 0, to conclude.
c ∗
S
We prove (iii). We set B0 = A 0 and Bn = An ∩An−1 for all n ∈ N so that n≤m Bn = Am
for all m ∈ N∗ and n∈N Bn = n∈N An . TheSsets (Bn , n ≥
S S
P0) are measurable and
S pairwise

disjoint. By σ-additivity, we get µ(Am ) = µ( n≤m Bn ) = n≤m µ(Bn ) and µ n∈N An =
S  P P
µ n∈N Bn = n∈N µ(Bn ). Use P the convergence of the partial sums n≤m µ(Bn ), whose
terms are non-negative, towards n∈N µ (Bn ) as m goes to infinity to conclude.
Property (iv) is a direct consequence of properties (i) and (iii).

We give a property for probability measures, which is deduced from (i) of Proposition 1.9.

Corollary 1.10. Let (Ω, F, P) be a probability space and A ∈ F. We have P(Ac ) = 1 − P(A).
3
A set A ⊂ R is negligible if there exists a λ-null set B such that A ⊂ B (notice that A might not be a
Borel set). Let Nλ be the sets of negligible sets. The Lebesgue σ-field, Bλ (R), on R is the σ-field generated
by the Borel σ-field and Nλ . By construction, we have B(R) ⊂ Bλ (R) ⊂ P(R). It can be proved that those
two inclusions are strict.
4
H. Föllmer and A. Schied. Stochastic finance. An introduction in discrete time. De Gruyter, 2011.
1.1. MEASURES AND MEASURABLE FUNCTIONS 5

We end this section with the definition of independent events.


Definition 1.11. Let (Ω, F, P) be a probability space. The events (Ai , i ∈ I) are independent
if for all finite subset J ⊂ I, we have:
 
\ Y
P Aj  = P(Aj ).
j∈J j∈J

The σ-fields (Fi , i ∈ I) are independent if for all Ai ∈ Fi ⊂ F, i ∈ I, the events (Ai , i ∈ I)
are independent.

1.1.3 Characterization of probability measures


In this section, we prove that if two probability measures coincide on a sufficiently large
family of events, then they are equal, see the main results of Corollaries 1.14 and 1.15. After
introducing a λ-system (or monotone class), we prove the monotone class theorem.
Definition 1.12. A collection A of sub-sets of Ω is a λ-system (or monotone class) if:
(i) Ω ∈ A;

(ii) A, B ∈ A and A ⊂ B imply B ∩ Ac ∈ A;


S
(iii) if (An , n ∈ N) is an increasing sequence of elements of A, then we have n∈N An ∈ A.
Theorem 1.13 (Mononote class Theorem). Let C be a collection of sub-sets of Ω stable by
finite intersection (also called a π-system). All λ-system (or monotone class) containing C
also contains σ(C).
Proof. Notice that P(Ω) is λ-system containing cc. Let A be the intersection of all λ-systems
containing C. It is easy to check that A is the smallest λ-system containing C. It is clear
that A satisfies properties (i) and (ii) from Definition 1.1. To check that property (iii) from
Definition 1.1 holds also, so that A is a σ-field, it is enough, according to property (iii) from
Definition 1.12, to check that A is stable by finite union or equivalently by finite intersection,
thanks to property (ii) of Definition 1.12.
For B ⊂ Ω, set AB = {A ⊂ Ω; A ∩ B ∈ A}. Assume that B ∈ C. It is easy to check
that AB is a λ-system, as C is stable by finite intersection, and that it contains C and thus
A. Therefore, for all B ∈ C, A ∈ A, we get A ∈ AB and thus A ∩ B ∈ A.
Assume now that B ∈ A. It is easy to check that AB is a λ-system. According to the
previous part, it contains C and thus A. In particular, for all B ∈ A, A ∈ A, we get A ∈ AB
and thus A ∩ B ∈ A. We deduce that A is stable by finite intersection and is therefore a
σ-field. To conclude, notice that A contains C and thus σ(C) also.

Corollary 1.14. Let P and P0 be two probability measures defined on a measurable space
(Ω, F) Let C ⊂ F be a collection of events stable by finite intersection. If P(A) = P0 (A) for
all A ∈ C, then we have P(B) = P0 (B) for all B ∈ σ(C).

Proof. Notice that {A ∈ F; P(A) = P0 (A)} is a λ-system. It contains C. By the monotone


class theorem, it contains σ(C).
6 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES

The next corollary is an immediate consequence of Definition 1.5 and Corollary 1.14.
Corollary 1.15. Let (E, O) be a topological space. Two probability measures on (E, B(E))
which coincide on the open sets O are equal.

1.1.4 Measurable functions


Let (S, S) and (E, E) be two measurable spaces. Let f be a function defined on S and taking
values in E. For A ⊂ E, we set {f ∈ A} = f −1 (A) = {x ∈ S; f (x) ∈ A}. It is easy to check
that for A ⊂ E and (Ai , i ∈ I) a collection of subsets of E, we have:
! !
[ [ \ \
−1 −1 −1 −1 −1
c
f (A ) = f (A) , f c
Ai = f (Ai ) and f Ai = f −1 (Ai ).
i∈I i∈I i∈I i∈I
(1.2)
We deduce from the properties (1.2) the following elementary lemma.

Lemma
 −1 f be a function from S to E and E a σ-field on E. The collection of sets
1.16. Let
f (A); A ∈ E is a σ-field on S.

The σ-field f −1 (A); A ∈ E , denoted by σ(f ), is also called the σ-field generated by f .


Definition 1.17. A function f defined on a space S and taking values in a space E is


measurable from (S, S) to (E, E) if σ(f ) ⊂ S.

When there is no ambiguity on the σ-fields S and E, we simply say that f is measurable.
Example 1.18. Let A ⊂ S. The function 1A is measurable from (S, S) to (R, B(R)) if and
only if A is measurable as σ(1A ) = {∅, S, A, Ac }. 4
The next proposition is useful to prove that a function is measurable.

Proposition 1.19. Let C be a collection of subsets of E which generates the σ-field E on E.


A function f from S to E is measurable from (S, S) to (E, E) if and only if for all A ∈ C,
f −1 (A) ∈ S.

by f −1 (A),

Proof. We denote by G the σ-field generated A ∈ C . We have G ⊂ σ(f ). It
is easy to check that the collection A ∈ E; f −1 (A) ∈ G is a σ-field on E. It contains C


and thus E. This implies that σ(f ) ⊂ G and thus σ(f ) = G. We conclude using Definition
1.17.

We deduce the following result, which is important in practice.


Corollary 1.20. A continuous function defined on a topological space and taking values in
a topological space is measurable with respect to the Borel σ-fields.

The next proposition concerns function taking values in product spaces.

Proposition 1.21. Let (S, S) and ((Ei , Ei ), i ∈ I) be measurable spaces. For all i ∈ I, let
fi be a function defined on S
Q taking Nvalues in Ei and set f = (fi , i ∈ I). The function f is
measurable from (S, S) to ( i∈I Ei , i∈I Ei ) if and only if for all i ∈ I, the function fi is
measurable from (S, S) to (Ei , Ei ).
1.1. MEASURES AND MEASURABLE FUNCTIONS 7

N Q
Proof. By definition, the σ-field
Q i∈I Ei is generated by i∈I Ai with Ai ∈QEi and for all
i ∈ I but one, Ai = Ei . Let i∈I Ai be such a set. Assume itQis not equal to i∈I Ei and let
−1
i0 denote the only index such that Ai0 6= Ei0 . We have f −1

A
i∈I i = fi0 (Ai0 ) ∈ S. Thus
if f is measurable so is fi0 . The converse is a consequence of Proposition 1.19.

The proof of the next proposition is immediate.


Proposition 1.22. Let (Ω, F), (S, S), (E, E) be three measurable spaces, f a measurable
function defined on Ω taking values in S and g a measurable function defined on S taking
values in E. The composed function g ◦ f defined on Ω and taking values in E is measurable.
We shall consider functions taking values in R. The Borel σ-field on R, B(R), is by
definition the σ-field generated by B(R), {+∞} and {−∞} or equivalently by the family
([−∞, a), a ∈ R). We say a function (resp. a sequence) is real-valued if it takes values in
(resp. its elements belong to) R. With the convention 0 · ∞ = 0, the product of two real-
valued functions is always defined. The sum of two functions f and g taking values in R is
well defined if (f, g) does not take the values (+∞, −∞) or (−∞, +∞).
Corollary 1.23. Let f and g be real-valued measurable functions defined on the same mea-
surable space. The functions f g, f ∨ g = max(f, g) are measurable. If (f, g) does not take
the values (+∞, −∞) and (−∞, +∞), then the function f + g is measurable.
2 2
Proof. The R -valued functions defined on R by (x, y) 7→ xy, (x, y) 7→ x∨y and (x, y) 7→ (x+
y)1{(x,y)∈R2 \{(−∞,+∞),(+∞,−∞)}} are continuous on R2 and thus measurable on R2 according
2
to Corollary 1.20. Thus, they are also measurable on R . The corollary is then a consequence
of Proposition 1.22.

If (an , n ∈ N) is a real-valued sequence then its lower and upper limit are defined by:
lim inf an = lim inf{ak , k ≥ n} and lim sup an = lim sup{ak , k ≥ n}
n→∞ n→∞ n→∞ n→∞

and they belong to R. Notice that:



lim sup an = lim sup a+
n − lim inf an .
n→∞ n→∞ n→∞

The sequence (an , n ∈ N) converges (in R) if lim inf n→∞ an = lim supn→∞ an and this common
value, denoted by limn→∞ an , belongs to R.
The next proposition asserts in particular that the limit of measurable functions is mea-
surable.
Proposition 1.24. Let (fn , n ∈ N) be a sequence of real-valued measurable functions defined
on a measurable space (S, S). The functions lim supn→∞ fn and lim inf n→∞ fn are measur-
able. The set of convergence of the sequence, {x ∈ S; lim supn→∞ fn (x) = lim inf n→∞ fn (x)},
is measurable. In particular, if the sequence (fn , n ∈ N) converges, then its limit, denoted by
limn→∞ fn , is also measurable.

Proof. For a ∈ R, we have:


  [ [ \  
1
lim sup fn < a = fn ≤ a − .
n→∞ ∗
k
k∈N m∈N n≥m
8 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES

Since the functions fn are measurable, we deduce that {lim supn→∞ fn < a} is also measur-
able. Since the σ-field B(R) is generated by [−∞, a) for a ∈ R, we deduce from Proposition
1.19 that lim supn→∞ fn is measurable. Since lim inf n→∞ fn = − lim supn→∞ (−fn ), we de-
duce that lim inf n→∞ fn is measurable.
Let h = lim supn→∞ fn − lim inf n→∞ fn , with the convention +∞ − ∞ = 0. The function
h is measurable thanks to Corollary 1.23. Since the set of convergence is equal to h−1 ({0})
and that {0} is a Borel set, we deduce that the set of convergence is measurable.

We end this section with a very useful result which completes Proposition 1.22.

Proposition 1.25. Let (Ω, F), (S, S) be measurable spaces, f a measurable function defined
on Ω taking values is S and ϕ a measurable function from (Ω, σ(f )) to (R, B(R)). Then,
there exists a real-valued measurable function g defined on S such that ϕ = g ◦ f .

Proof. By simplicity, we assume that ϕ takes its values in R instead of R. For all k ∈ Z,
n ∈ N the sets An,k = ϕ−1 ([k2−n , (k + 1)2−n )) are σ(f )-measurable. Thus,Sfor all n ∈ N,
there exists a collection (Bn,k , k ∈ Z) of sets of S pairwise disjoint such that k∈Z Bn,k = S,
Bn,k ∈ SPand f −1 (Bn,k ) = An,k for all k ∈ Z. For all n ∈ N, the real-valued function
gn = 2−n k∈Z k1Bn,k defined on S is measurable, and we have gn ◦ f ≤ ϕ ≤ gn ◦ f + 2−n0 for
n ≥ n0 ≥ 0. The function g = lim supn→∞ gn is measurable according to Proposition 1.24,
and we have g ◦ f ≤ ϕ ≤ g ◦ f + 2−n0 for all n0 ∈ N. This implies that g ◦ f = ϕ.

1.1.5 Probability distribution and random variables


We first start with the definition of the image measure (or push-forward measure) which
is obtained by transferring a measure using a measurable function. The proof of the next
Lemma is elementary and left to the reader.
Lemma 1.26. Let (E, E, µ) be a measured space, (S, S) a measurable space, and f a mea-
surable function defined on E and taking values in S. We define the function µf on S by
µf (A) = µ(f −1 (A)) for all A ∈ S. Then µf is a measure on (S, S).

The measure µf is called the push-forward measure or image measure of µ by f .


In what follow, we consider a probability space (Ω, F, P).
Definition 1.27. Let (E, E) be a measurable space. An E-valued random variable X defined
on Ω is a measurable function from (Ω, F) to (E, E). Its probability distribution or law is the
image probability measure PX .

At some point we shall specify the σ-field F on Ω, and say that X is F-measurable.
We say two E-valued random variables X and Y are equal in distribution, and we write
(d)
X = Y , if PX = PY . For A ∈ E, we recall we write {X ∈ A} = {ω; X(ω) ∈ A} = X −1 (A).
Two random variables X and Y defined on the same probability space are equal a.s., and we
a.s.
write X = Y , if P(X = Y ) = 1. Notice that if X and Y are equal a.s., then they have the
same probability distribution.
Remark 1.28. Let X be a real-valued random variable. Its cumulative distribution function
FX is defined by FX (x) = P(X ≤ x) for all x ∈ R. It is easy to deduce from Exercise 9.1
1.1. MEASURES AND MEASURABLE FUNCTIONS 9

that if X and Y are real-valued random variables, then X and Y are equal in distribution if
and only if FX = FY . ♦
The next lemma gives a characterization of the distribution of a family of random vari-
ables.

Lemma 1.29. Let ((Ei , Ei ), i ∈ I) be a collection ofQmeasurable spaces and X = (Xi , i ∈ I) a


random variable taking values in the product space i∈I Ei endowed with the product σ-field.
The distribution of X is characterized by the family of distributions of (Xj , j ∈ J) where J
runs over all finite subsets of I.

According to Proposition 1.21, in the above lemma the Xj is an Ej -valued random variable
and its marginal probability distribution can be recovered from the distribution of X as:
!
Y
P(Xj ∈ Aj ) = P X ∈ Ai with Ai = Ei for i 6= j.
i∈I
Q
Proof. According to Definition 1.4, the product Q σ-field E on the product space E = i∈I Ei
is generated by the family C of product sets i∈I AiQsuch Ai ∈ Ei for all i ∈ I and Ai = Ei
for all i 6∈ J, with J ⊂ I finite. Notice then that PX ( i∈I Ai ) = P(Xj ∈ Aj for j ∈ J). Since
C is stable by finite intersection, we deduce from the monotone class theorem, and more
precisely Corollary 1.14, that the probability measure PX on E is uniquely characterized by
P(Xj ∈ Aj for j ∈ J) for all J finite subset of I and all Aj ∈ Ej for j ∈ J.

We first give the definition of a random variable independent from a σ-field.


Definition 1.30. A random variable X taking values in a measurable space (E, E) is inde-
pendent from a σ-field H ⊂ F if σ(X) and H are independent or equivalently if, for all A ∈ E
and B ∈ H, the events {X ∈ A} and B are independent.

We now give the definition of independent random variables.

Definition 1.31. The random variables (Xi , i ∈ I) are independent if the σ-fields (σ(Xi ), i ∈
I) are independent. Equivalently, the random variables (Xi , i ∈ I) are independent if for all
finite subset J ⊂ I, all Aj ∈ Ej with j ∈ J, we have:
Y
P(Xj ∈ Aj for all j ∈ J) = P(Xj ∈ Aj ).
j∈J

We deduce from this definition that if the marginal probability distributions Pi of all
the random variables Xi for i ∈ I are known N and if (Xi , i ∈ I) are independent, then the
distribution of X is the product probability i∈I Pi introduced in Proposition 8.7.
We end this section with the Bernoulli scheme.

Theorem 1.32. Let (E, E, P) be a probability space. Let I be a set of indices. Then, there
exists a probability space and a sequence (Xi , i ∈ I) of E-valued random variables defined on
this probability space which are independent and with the same distribution probability P.

When P is the Bernoulli probability distribution and I = N∗ , then (Xn , n ∈ N∗ ) is called


a Bernoulli scheme.
10 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES

Proof. For i ∈ I, set Ωi = E, F Qi = E and Pi = P. According to Proposition 8.7, we can


consider
N the product space Ω = i∈I Ωi with the product σ-field and the product probability
P
i∈I i . For all i ∈ I, we consider the random variable: Xi (ω) = ωi where ω = (ωi , i ∈ I).
Using Definition 1.31, we deduce that the random variables (Xi , i ∈ I) are independent with
the same probability distribution P.

1.2 Integration and expectation


Using the results from the integration theory of Sections 1.2.1 and 1.2.2, we introduce in
Section 1.2.5 the expectation of real-valued or Rd -valued random variables and give some
well known inequalities. We study the properties of the Lp spaces in Section 1.2.3 and prove
the Fubini theorem in Section 1.2.4. In Section 1.2.6 we collect some further results on
independence.

1.2.1 Integration: construction and properties


Let (S, S, µ) be a measured space. The set R is endowed with its Borel σ-field. We use the
convention 0 · ∞ = 0. A function f defined on S is simple if it is real-valued, measurable
and if there exists nP∈ N∗ , α1 , . . . , αn ∈ [0, +∞], A1 , . . . , An ∈ S such that we have the
Rrepresentation
R f = nk=1 αk 1Ak . The integral of f with respect to µ, denoted by µ(f ) or
f dµ or f (x)µ(dx), is defined by:
n
X
µ(f ) = αk µ(Ak ) ∈ [0, +∞].
k=1

Lemma 1.33. Let f be a simple function defined on S. The integral µ(f ) does not depend
on the choice of its representation.
Proof. Consider two representations for f : f = nk=1P αk 1Ak = m 1B` , with n, m ∈ N∗
P P
`=1 β`P
and A1 , . . . , An , B1 , . . . , Bm ∈ S. We shall prove that k=1 αk µ(Ak ) = m
n
`=1 β` µ(B` ).
According to Remark T 1.3, there exits a finite family of measurable sets (CI , I ∈ P(J1, n +
mK)) such that CI CJ = ∅ if I 6= J andSfor all k ∈ J1, nK S and ` ∈ J1, mK there exists
Ik ⊂ J1, nK and J` ⊂ J1, mK such that Ak = I∈Ik CI and B` = I∈J` CI . We deduce that:
n m
! !
X X X X
f= αk 1{I∈Ik } 1CI = β` 1{I∈J` } 1CI
I∈P(J1,n+mK) k=1 I∈P(J1,n+mK) `=1
Pn Pm
and thus k=1 αk 1{I∈Ik } = `=1 β` 1{I∈J` } for all I such that CI = 6 ∅. We get:
n n m m
! !
X X X X X X
αk µ(Ak ) = αk 1{I∈Ik } µ(CI ) = β` 1{I∈J` } µ(CI ) = β` µ(B` ),
k=1 I k=1 I `=1 `=1

where we used the additivity of µ for the first and third equalities. This ends the proof.

It is elementary to check that if f and g are simple functions, then we get:


µ(af + bg) = aµ(f ) + bµ(g) for a, b ∈ [0, +∞) (linearity), (1.3)
f ≤ g ⇒ µ(f ) ≤ µ(g) (monotonicity). (1.4)
1.2. INTEGRATION AND EXPECTATION 11

Definition 1.34. Let f be a [0, +∞]-valued measurable function defined on S. We define


the integral of f with respect to the measure µ by:

µ(f ) = sup{µ(g); g simple such that 0 ≤ g ≤ f }.

The next lemma gives a representation of µ(f ) using that a non-negative measurable
function f is the non-decreasing limit of a sequence of simple functions. Such sequence exists.
Indeed, one can define for n ∈ N∗ the simple function fn by fn (x) = min(n, 2−n b2n f (x)c) for
x ∈ S with the convention b+∞c = +∞. Then, the functions (fn , n ∈ N∗ ) are measurable
and their non-decreasing limit is f .
Lemma 1.35. Let f be a [0, +∞]-valued function defined on S and (fn , n ∈ N) a non-
decreasing sequence of simple functions such that limn→∞ fn = f . Then, we have that
limn→∞ µ(fn ) = µ(f ).
Proof. It is enough to prove that for all non-decreasing sequence of simple functions (fn , n ∈
N) and simple function g such that limn→∞ fn ≥ g, we have limn→∞ µ(fn ) ≥ µ(g). We
deduce
P from the proof of Lemma 1.33 that there exists a representation of g such that
g= N k=1 αk 1Ak and the measurable sets (Ak , 1 ≤ k ≤ N ) are pairwise disjoint. Using this
representation and the linearity, we see it is enough to consider the particular case g = α1A ,
with α ∈ [0, +∞], A ∈ S and fn 1Ac = 0 for all n ∈ N.
By monotonicity, the sequence (µ(fn ), n ∈ N) is non-decreasing and thus limn→∞ µ(fn )
is well defined, taking values in [0, +∞].
The result is clear if α = 0. We assume that α > 0. Let α0 ∈ [0, α). For n ∈ N, we consider
the measurable sets Bn = {fn ≥ α0 }. The sequence (Bn , n ∈ N) is non-decreasing with A as
limit because limn→∞ fn ≥ g. The monotone property for measure, see property (iii) from
Proposition 1.9, implies that limn→∞ µ(Bn ) = µ(A). As µ(fn ) ≥ α0 µ(Bn ), we deduce that
limn→∞ µ(fn ) ≥ α0 µ(A) and that limn→∞ µ(fn ) ≥ µ(g) as α0 ∈ [0, α) is arbitrary.

Corollary 1.36. The linearity and monotonicity properties, see (1.3) and (1.4), also hold
for [0, +∞]-valued measurable functions f and g defined on S.
Proof. Let (fn , n ∈ N) and (gn , n ∈ N) be two non-decreasing sequences of simple functions
converging respectively towards f and g. Let a, b ∈ [0, +∞). The non-decreasing sequence
(afn + bgn , n ∈ N) of simple functions converges towards af + bg. By linearity, we get:

µ(af + bg) = lim µ(afn + bgn ) = a lim µ(fn ) + b lim µ(gn ) = aµ(f ) + bµ(g).
n→∞ n→∞ n→∞

Assume f ≤ g. The non-decreasing sequence ((fn ∨ gn ), n ∈ N) of simple functions


converges towards g. By monotonicity, we get µ(f ) = limn→∞ µ(fn ) ≤ limn→∞ µ(fn ∨ gn ) =
µ(g).

Recall that for a function f , we write f + = f ∨ 0 = max(f, 0) and f − = (−f )+ .


Definition 1.37. Let f be a real-valued measurable function defined on S. The integral of f
with respect to the measure µ is well defined if min (µ(f + ), µ(f − )) < +∞ and it is given by:

µ(f ) = µ(f + ) − µ(f − ).

The function f is µ-integrable if max (µ(f + ), µ(f − )) < +∞ (i.e. µ(|f |) < +∞).
12 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES

R R
We also write µ(f ) = f dµ = f (x) µ(dx). A property holds µ-almost everywhere
(µ-a.e.) if it holds on a measurable set B such that µ(B c ) = 0. If µ is a probability measure,
then one says µ-almost surely (µ-a.s.) for µ-a.e.. We shall omit µ and write a.e. or a.s. when
there is no ambiguity on the measure.

Lemma 1.38. Let f ≥ 0 be a real-valued measurable function defined on S. We have:

µ(f ) = 0 ⇐⇒ f = 0 µ-a.e..

Proof. The equivalence is clear if f is simple.


When f is not simple, consider a non-decreasing sequence of simple (non-negative) func-
tions (fn , n ∈ N) which converges towards f . As {f 6= 0} is the non-decreasing limit of the
measurable sets {fn 6= 0}, n ∈ N, we deduce from the monotonicity property of Proposition
1.9, that f = 0 a.e. if and only if fn = 0 a.e. for all n ∈ N. We deduce from the first part
of the proof that f = 0 a.e. if and only if µ(fn ) = 0 for all n ∈ N. As (µ(fn ), n ∈ N) is
non-decreasing and converges towards µ(f ), we get that µ(fn ) = 0 for all n ∈ N if and only
if µ(f ) = 0. We deduce that f = 0 a.e. if and only if µ(f ) = 0.

The next corollary asserts that it is enough to know f a.e. to compute its integral.

Corollary 1.39. Let f and g be two real-valued measurable functions defined on S such that
µ(f ) and µ(g) are well defined. If a.e. f = g, then we have µ(f ) = µ(g).

Proof. Assume first that f ≥ 0 and g ≥ 0. By hypothesis the measurable set A = {f 6= g} is


µ-null. We deduce that a.e. f 1A = 0 and g1A = 0. This implies that µ(f 1A ) = µ(g1A ) = 0.
By linearity, see Corollary 1.36, we get:

µ(f ) = µ(f 1Ac ) + µ(f 1A ) = µ(g1Ac ) = µ(g1Ac ) + µ(g1A ) = µ(g).

To conclude notice that f = g a.e. implies that f + = g + a.e. and f − = g − a.e. and then
use the first part of the proof to conclude.

The relation f = g a.e. is an equivalence relation on the set of real-valued measurable


functions defined on S. We shall identify a function f with its equivalent class of all mea-
surable functions g such that µ-a.e., f = g. Notice that if f is µ-integrable, then µ-a.e.
|f | < +∞. In particular, if f and g are µ-integrable, we shall write f + g for any element of
the equivalent class of f 1{|f |<+∞} + g1{|g|<+∞} . Using this remark, we conclude this section
with the following immediate corollary.

Corollary 1.40. The linearity property, see (1.3) with a, b ∈ R, and the monotonicity prop-
erty (1.4), where f ≤ g can be replaced by f ≤ g a.e., hold for real-valued measurable
µ-integrable functions f and g defined on S.

We deduce that the set of R-valued µ-integrable functions defined on S is a vector space.
The linearity property (1.3) holds also on the set of real-valued measurable functions h
such that µ(h+ ) < +∞ and on the set of real-valued measurable functions h such that
µ(h− ) < +∞. The monotonicity property holds also for real-valued measurable functions f
and g such that µ(f ) and µ(g) are well defined.
1.2. INTEGRATION AND EXPECTATION 13

1.2.2 Integration: convergence theorems


The a.e. convergence for sequences of measurable functions introduced below is weaker than
the simple convergence and adapted to the convergence of integrals. Let (S, S, µ) be a mea-
sured space.

Definition 1.41. Let (fn , n ∈ N) be a sequence of real-valued measurable functions defined


on S. The sequence converges a.e. if a.e. lim inf n→∞ fn = lim supn→∞ fn . We denote by
limn→∞ fn any element of the equivalent class of the measurable functions which are a.e.
equal to lim inf n→∞ fn .

Notice that Proposition 1.24 assures indeed that lim inf n→∞ is measurable. We thus
deduce the following corollary.

Corollary 1.42. If a sequence of real-valued measurable functions defined on S converges


a.e., then its limit is measurable.

We now give the three main results on the convergence of integrals for sequence of con-
verging functions.
Theorem 1.43 (Monotone convergence theorem). Let (fn , n ∈ N) be a sequence of real-
valued measurable functions defined on S such that for all n ∈ N, a.e. 0 ≤ fn ≤ fn+1 . Then,
we have: Z Z
lim fn dµ = lim fn dµ.
n→∞ n→∞

Proof. The set A = {x; fn (x) < 0 or fn (x) > fn+1 (x) for some n ∈ N} is µ-null as countable
union of µ-null sets. Thus, we get that a.e. fn = fn 1Ac for all n ∈ N. Corollary 1.39 implies
that, replacing fn by fn 1Ac without loss of generality, it is enough to prove the theorem
under the stronger conditions: for all n ∈ N, 0 ≤ fn ≤ fn+1 . We set f = limn→∞ fn the
non-decreasing (everywhere) limit of (fn , n ∈ N).
For all n ∈ N, let (fn,k , k ∈ N) be a non-decreasing sequence of simple functions which
converges towards fn . We set gn = max{fj,n ; 1 ≤ j ≤ n}. The non-decreasing
R R sequence
of simple functions (gn , n ∈ N) converges
R to f
R and thus lim
R n→∞ g n dµ = f dµ. By
monotonicity,
R gn ≤R fn ≤ f implies gn dµ ≤ fn dµ ≤ f dµ. Taking the limit, we get
limn→∞ fn dµ = f dµ.

The proof of the next corollary is left to the reader (hint: use the monotone convergence
theorem to get the σ-additivity).

Corollary 1.44. Let f be a real-valued measurable function


R defined on S such that a.e.
f ≥ 0. Then the function f µ defined on S by f µ(A) = 1A f dµ is a measure on (S, S).

We shall say that the measure f µ has density f with respect to the reference measure µ.
Fatou’s lemma will be used for the proof of the dominated convergence theorem, but it
is also interesting by itself.
Lemma 1.45 (Fatou’s lemma). Let (fn , n ∈ N) be a sequence of real-valued measurable
functions defined on S such that a.e. fn ≥ 0 for all n ∈ N. Then, we have the lower
14 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES

semi-continuity property:
Z Z
lim inf fn dµ ≥ lim inf fn dµ.
n→∞ n→∞
Proof. The function lim inf n→∞ fn is the non-decreasing limit of the sequence (gn , n ∈ N)
with gn = inf k≥n fk . We get:
Z Z Z Z
lim inf fn dµ = lim gn dµ ≤ lim inf fk dµ = lim inf fn dµ,
n→∞ n→∞ n→∞ k≥n n→∞

where we used the monotone convergence theorem for the first equality and the monotonicity
property of the integral for the inequality.
The next theorem and the monotone convergence theorem are very useful to exchange
integration and limit.
Theorem 1.46 (Dominated convergence theorem). Let f, g, (fn , n ∈ N) and (gn , n ∈ N) be
real-valued measurable functions defined on S. We assume that a.e.: |f
R n | ≤ gn forR all n ∈ N,
fR = limn→∞ fn and g = limn→∞ gn . We also assume that limn→∞ gn dµ = g dµ and
g dµ < +∞. Then, we have:
Z Z
lim fn dµ = lim fn dµ.
n→∞ n→∞

Taking gn = g for all n ∈ N in the above theorem gives the following result.
Corollary 1.47 (Lebesgue’s dominated convergence theorem). Let f, g and (fn , n ∈ N) be
R functions defined on S. We assume that a.e.: |fn | ≤ g for all n ∈ N,
real-valued measurable
f = limn→∞ fn and g dµ < +∞. Then, we have:
Z Z
lim fn dµ = lim fn dµ.
n→∞ n→∞
R
Proof of Theorem 1.46. As a.e. |f | ≤ g and g dµ < +∞, we get that the function f is
integrable. The functions gn + fn and gn − fn are a.e. non-negative. Fatou’s lemma with
gn + fn and gn − fn gives:
Z Z Z Z Z Z
g dµ + f dµ = (g + f ) dµ ≤ lim inf (gn + fn ) dµ = g dµ + lim inf fn dµ,
n→∞ n→∞
Z Z Z Z Z Z
g dµ − f dµ = (g − f ) dµ ≤ lim inf (gn − fn ) dµ = g dµ − lim sup fn dµ.
n→∞ n→∞
R R R
Since g dµ is finite, Rwe deduce from
R those inequalities that f
R dµ ≤ lim inf n→∞ fn dµ
and thatR lim supn→∞ fn dµ ≤ f dµ. Thus, the sequence ( fn dµ, n ∈ N) converges
towards f dµ.
We shall use the next Corollary in Chapter 5, which is a direct consequence of Fatou’s
lemma and dominated convergence theorem.
Corollary 1.48. Let f, g, (fn , n ∈ N) be real-valued measurable functions defined on S. We
assume that a.e. fn+ ≤ g for all n ∈ N, f = limn→∞ fn and that g dµ < +∞. Then, we
R

have that (µ(fn ), n ∈ N) and µ(f ) are well defined as well as:
Z Z
lim sup fn dµ ≤ lim fn dµ.
n→∞ n→∞
1.2. INTEGRATION AND EXPECTATION 15

1.2.3 The Lp space


Let (S, S, µ) be a measured space. We start this section with very useful inequalities.
Proposition 1.49. Let f and g be two real-valued measurable functions defined on S.

• Hölder inequality. Let p, q ∈ (1, +∞) such that p1 + 1q = 1. Assume that |f |p and |g|q
are integrable. Then f g is integrable and we have:
Z Z 1/p Z 1/q
p q
|f g| dµ ≤ |f | dµ |g| dµ .

The Hölder inequality is an equality if and only if there exists c, c0 ∈ [0, +∞) such that
(c, c0 ) 6= (0, 0) and a.e. c|f |p = c0 |g|q .

• Cauchy-Schwarz inequality. Assume that f 2 and g 2 are integrable. Then f g is


integrable and we have:
Z Z 1/2 Z 1/2
2 2
|f g| dµ ≤ f dµ g dµ .

R 2 1/2 R 2 1/2
if and only there exist c, c0 ∈
R
Furthermore, we have f g dµ = f dµ g dµ
0 0
[0, +∞) such that (c, c ) 6= (0, 0) and a.e. c f = c g.

• Minkowski inequality. Let p ∈ [1, +∞). Assume that |f |p and |g|p are integrable.
We have: Z 1/p Z 1/p Z 1/p
p p p
|f + g| dµ ≤ |f | dµ + |g| dµ .

Proof. Hölder inequality. We recall the convention 0 · +∞ = 0. The Young inequality states
that for a, b ∈ [0, +∞], p, q ∈ (1, +∞) such that p1 + 1q = 1, we have:
1 p 1 q
ab ≤ a + b .
p q
Indeed, this inequality is obvious if a or b belongs to {0, +∞}. For a, b ∈ (0, +∞), using the
convexity of the exponential function, we get:
log(ap ) log(bq )
 
1 1 1 1
ab = exp + ≤ exp (log(ap )) + exp (log(bq )) = ap + bq .
p q p q p q
If µ(|f |p ) = 0 or µ(|g|q ) = 0, the Hölder inequality is trivially true as a.e. f g = 0 thanks to
Lemma 1.38. If this is not the case, then integrating with respect to µ in the Young inequality
with a = |f |/µ(|f |p )1/p and b = |g|/µ(|g|q )1/q gives the result. Because of the strict convexity
of the exponential, if a and b are finite, then the Young inequality is an equality if and only
if ap and bq are equal. This implies that, if |f |p and |g|q are integrable, then the Hölder
inequality is an equality if and only there exist c, c0 ∈ [0, +∞) such that (c, c0 ) 6= (0, 0) and
a.e. c|f |p = c0 |g|q .
R
The Cauchy-Schwarz inequality is the Hölder inequality with p = q = 2. If f g dµ =
R 2 1/2 R 2 1/2 R R
f dµ g dµ , then since f g dµ ≤ |f g| dµ, the equality holds also in the
16 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES

Cauchy-Schwarz inequality. Thus there exists c, c0 ∈ [0, +∞) such that (c, c0 ) 6= (0, 0) and
0
R
c|f | = c |g|. Notice also that (|f g| − f g) dµ = 0. Then use Lemma 1.38 to conclude that
a.e. |f g| = f g and thus a.e. c f = c0 g.
Let p ≥ 1. From the convexity of the function x 7→ |x|p , we get (a + b)p ≤ 2p−1 (ap + bp )
for all a, b ∈ [0, +∞]. We deduce that |f + g|p is integrable. The case p = 1 of the Minkowski
R
inequality comes from the triangular inequality in R. Let p > 1. We assume that |f +
g|p dµ > 0, otherwise the inequality is trivial. Using Hölder inequality, we get:
Z Z Z
p p−1
|f + g| dµ ≤ |f ||f + g| dµ + |g||f + g|p−1 dµ
Z  1/p
Z  ! Z 1/p  (p−1)/p
≤ |f |p dµ + |g|p dµ |f + g|p dµ .

(p−1)/p
|f + g|p dµ
R
Dividing by gives the Minkowski inequality.
For p ∈ [1, +∞), letR Lp ((S, S, µ)) denote the set of R-valued measurable functions f
defined on S such that |f |p dµ < +∞. When there is no ambiguity on the underlying
space, resp. space and measure, we shall simply write Lp (µ), resp. Lp . Minkowski inequality
and the linearity of the integral yield that Lp is a vector space and the map k·kp from Lp
1/p
|f |p dµ
R
to [0, +∞) defined by kf kp = is a semi-norm (that is kf + gkp ≤ kf kp + kgkp
p
and kaf kp ≤ |a| kf kp for f, g ∈ L and a ∈ R). Notice that kf kp = 0 implies that a.e. f = 0
thanks to Lemma 1.38. Recall that the relation “a.e. equal to” is an equivalence relation on
the set of real-valued measurable functions defined on S. We deduce that the space (Lp , k·kp ),
where Lp is the space Lp quotiented by the equivalence relation “a.e. equal to”, is a normed
vector space. We shall use the same notation for an element of Lp and for its equivalent
class in Lp . If we need to stress the dependence of on the measure µ of the measured space
(S, S, µ) we may write Lp (µ) and even Lp (S, S, µ) for Lp .
The next proposition asserts that the normed vector space (Lp , k·kp ) is complete and,
by definition, is a Banach space. We recall that a sequence (fn , n ∈ N) of elements of Lp
converges in Lp to a limit, say f , if f ∈ Lp , and limn→∞ kfn − f kp = 0.
Proposition 1.50. Let p ∈ [1, +∞). The normed vector space (Lp , k·kp ) is complete. That
is every Cauchy sequence of elements of Lp converges in Lp : if (fn , n ∈ N) is a sequence of
elements of Lp such that limmin(n,m)→∞ kfn − fm kp = 0, then there exists f ∈ Lp such that
limn→∞ kfn − f kp = 0.
Proof. Let (fn , n ∈ N) be a Cauchy sequence of elements of Lp , that is fn ∈ Lp for all n ∈ N
and limmin(n,m)→∞ kfn − fm kp = 0. Consider the sub-sequence (nk , k ∈ N) defined by n0 = 0
and for k ≥ 1, nk = inf{m > nk−1 ; kfi − fj kp ≤ 2−k for all i ≥ m, j ≥ m}. In particular, we
have fnk+1 − fnk p ≤ 2−k for all k ≥ 1. Minkowski inequality and the monotone convergence

P P
theorem imply that k∈N |fnk+1 − fnk | p < +∞ and thus k∈N |fnk+1 − fnk | is a.e. finite.
The series with general term (fnk+1 − fnk ) is a.e. absolutely converging. By considering the
convergence of the partial sums, we get that the sequence (fnk , k ∈ N) converges a.e. towards
a limit, say f . This limit is a real-valued measurable function, thanks to Corollary 1.42. We
deduce from Fatou lemma that:
kfm − f kp ≤ lim inf kfm − fnk kp .
k→∞
1.2. INTEGRATION AND EXPECTATION 17

This implies that limm→∞ kfm − f kp = 0, and Minkowski inequality gives that f ∈ Lp .

We give an elementary criterion for the Lp convergence for a.e. converging sequences.
Lemma 1.51. Let p ∈ [1, +∞). Let (fn , n ∈ N) be a sequence of elements of Lp which
converges a.e. towards f ∈ Lp . The convergence holds in Lp (i.e. limn→∞ kf − fn kp = 0) if
and only if limn→∞ kfn kp = kf kp .
Proof. Assume that limn→∞ kf − fn kp = 0. Using Minkowski inequality, we deduce that

kf kp − kfn kp ≤ kf − fn kp . This proves that limn→∞ kfn kp = kf kp .

On the other hand, assume that limn→∞ kfn kp = kf kp . Set gn = 2p−1 (|fn |p + |f |p ) and
g = 2p |f |p . Since the function x 7→ |x|pRis convex, Rwe get |fn − f |p ≤ gn for all n ∈ N. We also
have limn→∞ gn = g a.e. and limn→∞ R gn dµ = g dµR< +∞. The dominated convergence
Theorem 1.46 gives then that limn→∞ |fn − f |p dµ = limn→∞ |fn − f |p dµ = 0. This ends
the proof.

1.2.4 Fubini theorem


Let (E, E, ν) and (S, S, µ) be two measured spaces. The product space E × S is endowed
with the product σ-field E ⊗ S. We give a preliminary lemma.
Lemma 1.52. Assume that ν and µ are σ-finite measures. Let f be a real-valued measurable
function defined on E × S.
(i) For all x ∈ E, the function y 7→ f (x, y) defined on S is measurable and for all y ∈ S,
the function x 7→ f (x, y) defined on E is measurable.
R
(ii) Assume that f ≥ R0. The function x 7→ f (x, y) µ(dy) defined on E is measurable and
the function y 7→ f (x, y) ν(dx) defined on S is measurable.
Proof. It is not difficult to check that the set A = {C ∈ E ⊗F; 1C satisfies (i) and (ii)} is a λ-
system, thanks to Corollary 1.23 and Proposition 1.24. (Hint: consider first the case µ and ν
finite, and then extend to the case that µ and ν σ-finite, to prove that A satisfies property (ii)
from Definition 1.12 of a λ-system.) Since A trivially contains C = {A×B; A ∈ E and B ∈ S}
which is stable by finite intersection, we deduce from the monotone class theorem that A
contains σ(C) = E ⊗ S. We deduce that (i) holds for any real-valued measurable functions,
as they are limits of difference of simple functions, see the comments after Definition 1.34.
We also deduce that (ii) holds for every simple function, and then for every [0, +∞]-valued
measurable functions thanks to Proposition 1.24 and the dominated convergence theorem.

The next theorem allows to define the integral of a real-valued function with respect to
the product of σ-finite5 measures.
Theorem 1.53 (Fubini’s theorem). Assume that ν and µ are σ-finite measures.

(i) There exists a unique measure on (E × S, E ⊗ S), denoted by ν ⊗ µ and called product

5
When the measures ν and µ are not σ-finite, the Fubini’s theorem may fail essentially because the product
measure might not be well defined. Consider the measurable space ([0, 1], B([0, 1])) with λ the Lebesgue
measure and µ the counting R Rmeasure (which  is not σ-finite),R and
R the measurable function f ≥ 0 defined by
f (x, y) = 1{x=y} , so that f (x, y) µ(dy) λ(dx) = 1 and f (x, y) λ(dx) µ(dy) = 0 are not equal.
18 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES

measure such that:

ν ⊗ µ(A × B) = ν(A)µ(B) for all A ∈ E, B ∈ S. (1.5)

(ii) Let f be a [0, +∞]-valued measurable function defined on E × S. We have:


Z Z Z 
f (x, y) ν ⊗ µ(dx, dy) = f (x, y) µ(dy) ν(dx) (1.6)
Z Z 
= f (x, y) ν(dx) µ(dy). (1.7)

Let f be a real-valued measurable function defined E × S. If ν ⊗ µ(f ) is well defined, then


the equalities (1.6) and (1.7) hold with their right hand-side being well defined.
We shall write ν(dx)µ(dy) for ν ⊗ µ(dx, dy). If ν and µ are probabilities measures, then
the definition of the product measure ν ⊗ µ coincide with the one given in Proposition 8.7.
R R 
Proof. For all C ∈ E ⊗ S, we set ν ⊗ µ(C) = 1C (x, y) µ(dy) ν(dx). The σ-additivity of
ν and µ and the dominated convergence implies that ν ⊗ µ is a measure on (E × S, E ⊗ S).
It is clear that (1.5) holds. Since ν and µ are σ-finite, we deduce that ν ⊗ µ is σ-finite. Using
Exercise 9.2 based on the monotone class theorem and that {A × B; A ∈ E, B ∈ S} generates
E ⊗ S, we get that (1.5) characterizes uniquely the measure ν ⊗ µ. This ends the proof of
property (i).
Property (ii) holds clearly for functions f = 1C with C = A × B, A ∈ E and B ∈ S.
Exercise 9.2, Corollary 1.23, Proposition 1.24, the monotone convergence theorem and the
monotone class theorem imply that (1.6) and (1.7) hold also for f = 1C with C ∈ E ⊗ S. We
deduce that (1.6) and (1.7) hold for all simple functions thanks to Corollary 1.23, and then
for all [0, +∞]-valued measurable functions defined on E × S thanks to Proposition 1.24 and
the monotone convergence theorem.
Let f be a real-valued measurable function defined E ×S such that ν ⊗µ(f ) is well defined.
Without loss of generality, we can assume that ν ⊗ µ(fR+ ) is finite. We deduce from (1.6) and
+ +
then (1.7) with f replacedR by f +that NE = {x ∈ E; f (x, y) µ(dy) =++∞} is ν-null, and
then that NS = {y ∈ S; f (x, y) ν(dx) = +∞} is µ-null. We set g = f 1NEc ×NSc . It is now
legitimate to subtract (1.6) with f replaced by f − to (1.6) with f replaced by g in order to
get (1.6) with f replaced by g − f − . Since ν ⊗ µ-a.e. f + = g and thus f = g − f − , Lemma
1.38 implies that (1.6) holds. Equality (1.7) is deduced by symmetry.

Remark 1.54. Notice that the proof of (i) of Fubini theorem gives an alternative construction
of the product of two σ-finite measures to the one given in Proposition 8.7 for the product
of probability measures. Thanks to (i) of Fubini theorem, the Lebesgue measure on Rd can
be seen as the product measure of d times the one-dimensional Lebesgue measure. ♦

1.2.5 Expectation, variance, covariance and inequalities


We consider the particular case of probability measure. Let (Ω, F, P) be a probability space.
Let X be a real-valued random variable. The expectation of X is by definition the integral
1.2. INTEGRATION AND EXPECTATION 19

R
of X with respect to the probability measure P and is denoted by E[X] = X(ω) P(dω). We
recall the expectation E[X] is well defined if min(E[X + ], E[X − ]) is finite, where X + = X ∨ 0
and X − = (−X) ∨ 0, and that X is integrable if max(E[X + ], E[X − ]) is finite.
Example 1.55. If A is an event, then 1A is a random variable and we have E[1A ] = P(A).
Taking A = Ω, we get obviously that E[1Ω ] = E[1] = 1. 4
The next elementary lemma is very useful to compute expectation in practice. Recall the
distribution of X, denoted by PX , has been introduced in Definition 1.27.
Lemma 1.56. Let X be a random variable taking values in a measured space (E, E). Let ϕ be
a real-valued
R measurable function defined on (E, E). If E[ϕ(X)]
R is well defined, or equivalently
if ϕ(x) PX (dx) is well defined, then we have E[ϕ(X)] = ϕ(x) PX (dx).

Proof. Assume that ϕ is simple: ϕ = nk=1 αk 1Ak for some n ∈ N∗ , αk ∈ [0, +∞], Ak ∈ F.
P
We have:
Xn n
X Z
E[ϕ(X)] = αk P(X ∈ Ak ) = αk PX (Ak ) = ϕ(x) PX (dx).
k=1 k=1
R
Then use the monotone convergence theorem to get E[ϕ(X)] = ϕ(x) R PX (dx) when ϕ is
measurable and [0, +∞]-valued. Use the definition of E[ϕ(X)] and ϕ(x) PX (dx), when
they are well defined, to conclude when ϕ is real-valued and measurable.

Obviously, if X and Y have the same distribution, then E[ϕ(X)] = E[ϕ(Y )] for all real-
valued function ϕ such that E[ϕ(X)] is well defined, in particular if ϕ is bounded.
Remark 1.57. We give a closed formula for the expectation of discrete random variable. Let
X be a random variable taking values in a measurable space (E, E). We say that X is a
discrete random variable if {x} ∈ E for all x ∈ E and P(X ∈ ∆X ) = 1, where ∆X = {x ∈
E; P(X = x) > 0} is the discrete support of the distribution of X. Notice that ∆X is at
most countable and thus belongs to E.
If X is a discrete random variable and ϕ a [0, +∞]-valued function defined on E, then we
have: X
E[ϕ(X)] = ϕ(x)P(X = x). (1.8)
x∈∆X

Equation (1.8) also holds for ϕ a real-valued measurable function as soon as E[ϕ(X)] is well
defined. ♦
Remark 1.58. Let B ∈ F such that P(B) > 0. By considering the probability measure
1
P(B) 1B P : A 7→ P(A ∩ B)/P(B), see Corollary 1.44, we can define the expectation condition-
ally on B by, for all real-valued random variable Y such that E[Y ] is well defined:

E[Y 1B ]
E[Y |B] = · (1.9)
P(B)

If furthermore P(B) < 1, then we easily get that E[Y ] = P(B)E[Y |B] + P(B c )E[Y |B c ]. ♦
A real-valued random variable X is square-integrable if it belongs to L2 (P). Since 2 |x| ≤
1 + |x|2 , we deduce from the monotonicity property of the expectation that if X ∈ L2 (P) then
X ∈ L1 (P), that is X is integrable. This means that L2 (P) ⊂ L1 (P).
20 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES

For X = (X1 , . . . , Xd ) and Rd -valued random variable, we say that E[X] is well defined
(resp. Xi is integrable, resp. square integrable) if E[Xi ] is well defined (resp. Xi is integrable,
resp. square integrable) for all i ∈ J1, dK, and we set E[X] = (E[X1 ], . . . , E[Xd ]).
We recall an R-valued function ϕ defined on Rd is convex if ϕ(qx + (1 − q)y) ≤ qϕ(x) +
(1 − q)ϕ(y) for all x, y ∈ Rd and q ∈ (0, 1). The function ϕ is strictly convex if this convex
inequality is strict for all x 6= y. Let h·, ·i denote the scalar product of Rd . Then, it is well
known that if ϕ is an R-valued convex function defined on Rd , then it is continuous6 and
there exists7 a sequence ((an , bn ), n ∈ N) with an ∈ Rd and bn ∈ R such that for all x ∈ Rd :

ϕ(x) = sup (bn + han , xi) . (1.10)


n∈N

We give further inequalities which complete Proposition 1.49. We recall that a R-valued
convex function defined on Rd is continuous (and thus measurable).
Proposition 1.59.

• Tchebychev inequality. Let X be real-valued random variable. Let a > 0. We have:

E[X 2 ]
P(|X| ≥ a) ≤ .
a2

• Jensen inequality. Let X be an Rd -valued integrable random variable. Let ϕ be a


R-valued convex function defined on Rd . We have that E[ϕ(X)] is well defined and:

ϕ(E[X]) ≤ E[ϕ(X)]. (1.11)

Furthermore, if ϕ is strictly convex, the inequality in (1.11) is an equality if and only


if X is a.s. constant.

Remark 1.60. If X is a real-valued integrable random variable, we deduce from Cauchy-


Schwarz inequality or Jensen inequality that E[X]2 ≤ E[X 2 ]. ♦

Proof. Since 1{|X|≥a} ≤ X 2 /a2 , we deduce the Tchebychev inequality from the monotonicity
property of the expectation and Example 1.55.
Let ϕ be a real-valued convex function. Using (1.10), we get ϕ(X) ≥ b0 + ha0 , Xi and
thus ϕ(X) ≥ −|b0 | − |a0 ||X|. Since X is integrable, we deduce that E[ϕ(X)− ] < +∞, and
thus E[ϕ(X)] is well defined. Then, using the monotonicity of the expectation, we get that
for all n ∈ N, E[ϕ(X)] ≥ bn + han , E[X]i. Taking the supremum over all n ∈ N and using the
characterization (1.10), we get (1.11).
6
It is enough to prove the continuity at 0 and without loss of generality, we can assume that ϕ(0) = 0.
Since ϕ is finite on the 2d vertices of the cube [−1, 1]d , it is bounded from above by a finite constant, say M .
Using the convex inequality, we deduce that ϕ is bounded on [−1, 1]d by M . Let α ∈ (0, 1) and y ∈ [−α, α]d .
Using the convex inequality with x = y/α, y = 0 and q = α, we get that ϕ(y) ≤ αϕ(y/α) ≤ αM . Using the
convex inequality with x = y, y = −y/α and q = 1/(1 + α), we also get that 0 ≤ ϕ(y)/(1 + α) + M α/(1 + α).
This gives that |ϕ(y)| ≤ αM . Thus ϕ is continuous at 0.
7
This is a consequence of the separation theorem for convex sets. See for example Proposition B.1.2.1 in
J.-B. Hiriart-Urruty and C. Lemaréchal. Fundamentals of convex analysis. Springer-Verlag, 2001.
1.2. INTEGRATION AND EXPECTATION 21

To complete the proof, we shall check that if X is not equal a.s. to a constant and if
ϕ is strictly convex, then the inequality in (1.11) is strict. For simplicity, we consider the
case d = 1 as the case d ≥ 2 can be proved similarly. Set B = {X ≤ E[X]}. Since X
is non-constant, we deduce that P(B) ∈ (0, 1) and that E[X|B] < E[X|B c ]. Recall that
E[X] = P(B)E[X|B] + P(B c )E[X|B c ]. We get that:

ϕ(E[X]) < P(B)ϕ(E[X|B]) + P(B c )ϕ(E[X|B c ])


≤ P(B)E[ϕ(X)|B] + P(B c )E[ϕ(X)|B c ] = E[ϕ(X)],

where we used the strict convexity of ϕ and that E[X|B] 6= E[X|B c ] for the first inequality
and Jensen inequality for the second. This proves that the inequality in (1.11) is strict.

We end this section with the covariance and variance. Let X, Y be two real-valued square-
integrable random variables. Thanks to Cauchy-Schwarz inequality, XY is integrable. The
covariance of X and Y , Cov(X, Y ), and the variance of X, Var(X), are defined by:

Cov(X, Y ) = E[XY ] − E[X]E[Y ] and Var(X) = Cov(X, X).

By linearity, we also get that:

Var(X) = E[(X − E[X])2 ] and Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X, Y ). (1.12)

The covariance is a bilinear form on L2 (P) and for a, b ∈ R, we get:

Var(aX + b) = a2 Var(X).

Using Lemma 1.38 with f = (X − E[X])2 , we get that Var(X) = 0 implies X is a.s. constant.
The covariance can be defined for random vectors as follows.

Definition 1.61. Let X = (X1 , . . . , Xd ) and Y = (Y1 , . . . , Yp ) be respectively two Rd -valued


and Rp -valued square-integrable random variables with d, p ∈ N∗ . The covariance matrix of
X and Y , Cov(X, Y ), is a d × p matrix defined by:

Cov(X, Y ) = (Cov(Xi , Yj ), i ∈ J1, dK and j ∈ J1, pK).

1.2.6 Independence
Recall the independence of σ-fields given in Definition 1.11 and of random variables given in
Definition 1.31.

Proposition 1.62. Let ((Ei , Ei ), i ∈ I) be a collection


Q of measurable spaces and (Xi , i ∈ I) a
random variable taking values in the product space i∈I Ei endowed with the product σ-field.
The random variables (Xi , i ∈ I) are independent if and only if for all finite subset J ⊂ I,
for all bounded real-valued measurable function fj defined on Ej for j ∈ J, we have:
 
Y n
Y
E fj (Xj ) = E[fj (Xj )]. (1.13)
j∈J j=1
22 CHAPTER 1. A STARTER ON MEASURE THEORY AND RANDOM VARIABLES

Proof. If (1.13) is true, then taking fj = 1Aj with Aj ∈ Ej , we deduce from Definition 1.31
that the random variables (Xi , i ∈ I) are independent.
If (Xi , i ∈ I) are independent, then Definitions 1.31 implies that (1.13) holds for indicator
functions. By linearity, we get that (1.13) holds also for simple functions. Use monotone
convergence theorem and then linearity to deduce that (1.13) holds for bounded real-valued
measurable functions.
Bibliography

[1] O. Kallenberg. Foundations of modern probability. Probability and its Applications (New
York). Springer-Verlag, New York, second edition, 2002.

[2] J. Neveu. Bases mathématiques du calcul des probabilités. Masson et Cie, Éditeurs, Paris,
1970.

23
24 BIBLIOGRAPHY
Chapter 2

Conditional expectation

Let X be a square integrable real-valued random variable. The constant c which minimizes
E[(X − c)2 ] is the expectation of X. Indeed, we have, with m = E[X]:

E[(X − c)2 ] = E[(X − m)2 + (m − c)2 + 2(X − m)(m − c)] = Var(X) + (m − c)2 .

In some sense, the expectation of X is the best approximation of X by a constant (with a


quadratic criterion).
More generally, the conditional expectation of X given another random variable Y will
be defined as the best approximation of X by a function of Y . In Section 2.1, we define the
conditional expectation of a square integrable random variable as a projection. In Section
2.2, we extend the conditional expectation to random variables whose expectations are well
defined. In Section 2.3, we provide explicit formulas for discrete and continuous random
variables.
We shall consider that all the random variables of this chapter are defined on a probability
space (Ω, F, P). Recall that the normed vector space L2 = L2 (Ω, F, P) denote the set of
(equivalent classes of) square integrable real-valued random variables.

2.1 Projection in the L2 space


The bilinear form h·, ·iL2 on L2 defined by hX, Y iL2 = E[XY ] is the scalar product corre-
sponding to the norm k·k2 . The space L2 with the product scalar h·, ·iL2 is an Hilbert space, as
it is complete, thanks to Proposition 1.50. Notice that square-integrable real-valued random
variables which are independent and centered are orthogonal for the scalar product h·, ·iL2 .
We shall consider the following results on projection in Hilbert spaces.

Theorem 2.1. Let H be a closed vector sub-space of L2 and X ∈ L2 .

(i) (Existence.) There exists a real-valued random variable XH ∈ H, called the orthogonal
projection of X on H, such that E[(X − XH )2 ] = inf{E[(X − Y )2 ]; Y ∈ H}. And, for
all Y ∈ H, we have E[XY ] = E[XH Y ].

(ii) (Uniqueness.) Let Z ∈ H such that E[(X − Z)2 ] = inf{E[(X − Y )2 ]; Y ∈ H} or such


that E[ZY ] = E[XY ] for all Y ∈ H. Then, we have that a.s. Z = XH .

25
26 CHAPTER 2. CONDITIONAL EXPECTATION

Proof. We set a = inf{E[(X − Y )2 ]; Y ∈ H}. The following median formula is clear:


2 2
E[(Z 0 − Y 0 )2 ] + E[(Z 0 + Y 0 )2 ] = 2E[Z 0 ] + 2E[Y 0 ] for all Y 0 , Z 0 ∈ L2 .

Let (Xn , n ∈ N) be a sequence of H such that limn→+∞ E[(X − Xn )2 ] = a. Using the median
formula with Z 0 = Xn − X and Y 0 = Xm − X, we get:

E[(Xn − Xm )2 ] = 2E[(X − Xn )2 ] + 2E[(X − Xm )2 ] − 4E[(X − I)2 ],

with I = (Xn + Xm )/2 ∈ H. As E[(X − I)2 ] ≥ a, we deduce that the sequence (Xn , n ∈ N)
is a Cauchy sequence in L2 and thus converge in L2 , say towards XH . In particular, we have
E[(X − XH )2 ] = a. Since H is closed, we get that the limit XH belongs to H.
Let Z ∈ H be such that E[(X −Z)2 ] = a. For Y ∈ H, the function t 7→ E[(X −Z +tY )2 ] =
a + 2tE[(X − Z)Y ] + t2 E[Y 2 ] is minimal for t = 0. This implies that its derivative at t = 0
is zero, that is E[(X − Z)Y ] = 0. In particular, we have E[(X − XH )Y ] = 0. This ends the
proof of (i).
On the one hand, let Z ∈ H be such that E[(X − Z)2 ] = a. We deduce from the previous
arguments that for all Y ∈ H:

E[(XH − Z)Y ] = E[(X − Z)Y ] − E[(X − XH )Y ] = 0.

Taking Y = (XH − Z), gives that E[(XH − Z)2 ] = 0 and thus a.s. Z = XH , see Lemma 1.38.
On the other hand, if there exists Z ∈ H such that E[ZY ] = E[XY ] for all Y ∈ H,
arguing as above, we also deduce that a.s. Z = XH .

According to the remarks at the beginning of this chapter, we see that if X is a real-valued
square-integrable random variable, then E[X] can be seen as the orthogonal projection of X
on the vector space of the constant random variables.

2.2 Conditional expectation


Let H ⊂ F be a σ-field. We recall that a random variable Y (which is by definition F-
measurable) is H-measurable if σ(Y ), the σ-field generated by Y , is a subset of H. The
expectation of X conditionally on H corresponds intuitively to the best “approximation” of
X by an H-measurable random variable.
Notice that if X is a real-valued random variable such that E[X] is well defined, then
E[X1A ] is also well defined for any A ∈ F.
Definition 2.2. Let X be a real-valued random variable such that E[X] is well defined. We
say that a real-valued H-measurable random variable Z, such that E[Z] is well defined, is the
expectation of X conditionally on H if:

E[X1A ] = E[Z1A ] for all A ∈ H. (2.1)

The next lemma asserts that, if the expectation of X conditionally on H exists then it is
unique up to an a.s. equality. It will be denoted by E[X|H].
2.2. CONDITIONAL EXPECTATION 27

Lemma 2.3 (Uniqueness of the conditional expectation). Let Z and Z 0 be real-valued random
variables, H-measurable with E[Z] and E[Z 0 ] well defined, and such that E[Z1A ] = E[Z 0 1A ]
for all A ∈ H. Then, we get that a.s. Z = Z 0 .
Proof. Let n ∈ N∗ and consider A = {n ≥ Z > Z 0 ≥ −n} which belongs to H. By linearity,
we deduce from the hypothesis that E[(Z − Z 0 )1{n≥Z>Z 0 ≥−n} ] = 0. Lemma 1.38 implies that
P(n ≥ Z > Z 0 ≥ −n) = 0 and thus P(+∞ > Z > Z 0 > −∞) = 0 by monotone convergence.
Considering A = {Z = +∞, n ≥ Z 0 }, A = {Z ≥ n, Z 0 = −∞} and A = {Z = +∞, Z 0 = −∞}
leads similarly to P(Z > Z 0 , Z = +∞ or Z 0 = −∞) = 0. So we get P(Z > Z 0 ) = 0. By
symmetry, we deduce that a.s. Z = Z 0 .

We use the orthogonal projection theorem on Hilbert spaces, to define the conditional
expectation for square-integrable real-valued random variables.
Proposition 2.4. If X ∈ L2 , then E[X|H] is the orthogonal projection defined in Proposition
2.1, of X on the vector space H of all square-integrable H-measurable random variables.
Proof. The set H corresponds to the space L2 (Ω, H, P). It is closed thanks to Proposition
1.50. The set H is thus a closed vector subspace of L2 . Property (i) (with Y = 1A ) from
Theorem 2.1 implies then that (2.1) holds and thus that the orthogonal projection of X ∈ L2
on H is the expectation of X conditionally on H.

Notice that for A ∈ F, we have 1A ∈ L2 , and we shall use the notation:

P(A|H) = E [1A | H] . (2.2)

We have the following properties.


Proposition 2.5. Let X and Y be real-valued square-integrable random variables.
(i) Positivity. If a.s. X ≥ 0 then we have that a.s. E[X|H] ≥ 0.

(ii) Linearity. For a, b ∈ R, we have that a.s. E[aX + bY |H] = aE[X|H] + bE[Y |H].

(iii) Monotone convergence. Let (Xn , n ∈ N) be a sequence of real-valued square integrable


random variables such that for all n ∈ N a.s. 0 ≤ Xn ≤ Xn+1 . Then, we have that a.s.:
 
lim E[Xn |H] = E lim Xn H .

n→+∞ n→+∞

Proof. Let X be a square-integrable a.s. non-negative random variable. We set A =


{E[X|H] < 0}. We have:
0 ≥ E[E[X|H]1A ] = E[X1A ] ≥ 0,
where we used that A ∈ H and (2.1) for the equality. We deduce that E[E[X|H]1A ] = 0 and
thus that a.s. E[X|H] ≥ 0 according to Lemma 1.38.
The linearity property is a consequence of the linearity property of the expectation, (2.1)
and Lemma 2.3.
Let (Xn , n ∈ N) be a sequence of real-valued square-integrable random variables such that
for all n ∈ N a.s. 0 ≤ Xn ≤ Xn+1 . We deduce from the linearity and positivity properties
28 CHAPTER 2. CONDITIONAL EXPECTATION

of the conditional expectation that for all n ∈ N a.s. 0 ≤ E[Xn |H] ≤ E[Xn+1 |H]. The
random-variable Z = limn→+∞ E[Xn |H] is H-measurable according to Corollary 1.42 and
a.s. non-negative. The monotone convergence theorem implies that for all A ∈ H:
 
E [Z1A ] = lim E [E[Xn |H]1A ] = lim E [Xn 1A ] = E lim Xn 1A .
n→+∞ n→+∞ n→+∞

Deduce from (2.1) and Lemma 2.3 that a.s. Z = E[limn→+∞ Xn |H]. This ends the proof.

We extend the definition of conditional expectations to random variables whose expecta-


tion is well defined.
Proposition 2.6. Let X be a real-valued random variable such that E[X] is well defined.
Then its expectation conditionally on H, E[X|H], exists and is unique up to an a.s. equality.
Furthermore, the expectation of E[X|H] is well defined and is equal to E[X]:

E [E[X|H]] = E[X]. (2.3)

If X is non-negative a.s. (resp. integrable), so is E[X|H].

Proof. Consider first the case where X is a.s. non-negative. The random variable X is the
a.s. limit of a sequence of positive square-integrable random variables. Property (iii) from
Proposition 2.5 implies that E[X|H] exists. It is unique thanks to Lemma 2.3. It is a.s.
non-negative as limit of non-negative random variables. Taking A = Ω in (2.1), we get (2.3).

We now consider the general case. Recall that X + = max(X, 0) and X − = max(−X, 0).
From the previous argument the expectations of E[X + |H] and E[X − |H] are well defined and
respectively equal to E[X + ] and E[X − ]. Since one of those two expectations is finite, we
deduce that a.s. E[X + |H] if finite or a.s. E[X − |H] is finite. Then use (2.1) and Lemma 2.3
to deduce that E[X + |H] − E[X − |H] is equal to E[X|H], the expectation of X conditionally
on H. Since E[X|H] is the difference of two non-negative random variables, one of them
being integrable, we deduce that the expectation of E[X|H] is well defined and use (2.1) with
A = Ω to get (2.3). Eventually, if X is integrable, so are E[X + |H] and E[X − |H] thanks to
(2.3) for non-negative random variables. This implies that E[X|H] is integrable.

We summarize in the next proposition the properties of the conditional expectation di-
rectly inherited from the properties of the expectation.
Proposition 2.7. We have the following properties.

(i) Positivity. If X is a real-valued random variable such that a.s. X ≥ 0, then a.s.
E[X|H] ≥ 0.

(ii) Linearity. For a, b in R (resp. in [0, +∞)), X, Y real-valued random-variables with X


and Y integrable (resp. with E[X + +Y + ] or E[X − +Y − ] finite), we have E[aX+bY |H] =
aE[X|H] + bE[Y |H].

(iii) Monotony. For X, Y real-valued random variables such that a.s. X ≤ Y and E[X] as
well as E[Y ] are well defined, we have E[X|H] ≤ E[Y |H].
2.2. CONDITIONAL EXPECTATION 29

(iv) Monotone convergence. Let (Xn , n ∈ N) be real-valued random variables such that for
all n ∈ N a.s. 0 ≤ Xn ≤ Xn+1 . Then we have that a.s.:
h i
lim E[Xn |H] = E lim Xn H .
n→∞ n→∞

(v) Fatou Lemma. Let (Xn , n ∈ N) be real-valued random variables such that for all n ∈ N
a.s. 0 ≤ Xn . Then we have that a.s.:
h i
E lim inf Xn H ≤ lim inf E [Xn |H] .
n→∞ n→∞

(vi) Dominated convergence (Lebesgue). Let X, Y , (Xn , n ∈ N) be real-valued random


variables such that a.s. limn→∞ Xn = X, for all n ∈ N a.s. |Xn | ≤ Y and E[Y ] < +∞.
Then we have that a.s.:
h i
lim E[Xn |H] = E lim Xn H .
n→∞ n→∞

(vii) The Tchebychev, Hölder, Cauchy-Schwarz, Minkowski and Jensen inequalities from
Propositions 1.49 and 1.59 holds with the expectation replaced by the conditional expec-
tation.
For example, we state Jensen inequality from property (vii) above. Let ϕ be a R-valued
convex function defined on Rd . Let X be an integrable Rd -valued random variable. Then,
E[ϕ(X)|H] is well defined and a.s.:

ϕ(E [X| H]) ≤ E [ϕ(X)| H] . (2.4)

Furthermore, if ϕ is strictly convex, the inequality in (2.4) is an equality if and only if X is


a.s. equal to an H-measurable random variable.
Proof. The positivity property comes from Proposition 2.6. The linearity property comes
from the linearity of the expectation, (2.1) and Lemma 2.3. The monotony property is a
consequence of the positivity and linearity properties. The proof of the monotone convergence
theorem is based on the same arguments as in the proof of Proposition 2.5. Fatou Lemma and
the dominated convergence theorem are consequences of the monotone convergence theorem,
see proof of Lemma 1.45 and of Theorem 1.46. The proofs of the inequalities are similar to
the proofs of Propositions 1.49 and 1.59. (Be careful when characterizing the equality in (2.4)
when ϕ is strictly convex.)

Using the monotone or dominated convergence theorems, it is easy to prove the following
Corollary which generalizes (2.1).
Corollary 2.8. Let X and Y be two real-valued random variables such that E[X] and E[XY ]
are well defined, and Y is H-measurable. Then E [E[X|H]Y ] is well defined and we have:

E[XY ] = E [E[X|H]Y ] . (2.5)

Recall Definitions 1.11 and 1.30 on independence. We complete the properties of the
conditional expectation.
30 CHAPTER 2. CONDITIONAL EXPECTATION

Proposition 2.9. Let X be a real-valued random variable such that E[X] is well defined.

(i) If X is H-measurable, then we have that a.s. E[X|H] = X.

(ii) If X is independent of H, then we have that a.s. E[X|H] = E[X].

(iii) If Y is a real-valued H-measurable random variable such that E[XY ] is well defined,
then we have that a.s. E[Y X|H] = Y E[X|H].

(iv) If G ⊂ H is a σ-field, then we have that a.s. E [E[X|H]|G] = E[X|G].

(v) If G ⊂ F is a σ-field independent of H and independent of X (that is G is independent


of H ∨ σ(X)), then we have that a.s. E[X|G ∨ H] = E[X|H].

Proof. If X is H-measurable, then we can choose Z = X in (2.1) and use Lemma 2.3 to get
property (i). If X is independent of H, then for all A ∈ H, we have E[X1A ] = E[X]E[1A ] =
E[E[X]1A ], and we can choose Z = E[X] in (2.1) and use Lemma 2.3 to get property (ii). If Y
is a real-valued H-measurable random variable such that E[XY ] is well defined, then E[XY 1A ]
is also well defined for A ∈ H, and according to (2.5), we have E[XY 1A ] = E[E[X|H]Y 1A ].
Then, we can choose Z = Y E[X|H] in (2.1), with X replaced by X1A , and use Lemma 2.3
to get property (iii).
We prove property (iv). Let A ∈ G ⊂ H. We have:

E [E[X|G]1A ] = E[X1A ] = E [E[X|H]1A ] = E [E [E[X|H]|G] 1A ] ,

where we used (2.1) with H replaced by G for the first equality, (2.1) for the second and (2.1)
with H replaced by G and X by E[X|H] for the last. Then we deduce property (iv) from
Definition 2.2 and Lemma 2.3.
We prove property (v) first when X is integrable. For A ∈ G and B ∈ H, we have:

E [1A∩B X] = E [1A 1B X] = E[1A ]E[1B X] = E[1A ]E[1B E[X|H]] = E[1A 1B E[X|H]],

where we used that 1A is independent of H∨σ(X) in the second equality and independent of H
in the last. Using the dominated convergence theorem, we get that A = {A ∈ F, E [1A X] =
E[1A E[X|H]]} is a monotone class. It contains C = {A ∩ B; A ∈ G, B ∈ H} which is stable
by finite intersection. The monotone class theorem implies that A contains σ(C) and thus
G ∨ H. Then we deduce property (v) from Definition 2.2 and Lemma 2.3. Use the monotone
convergence theorem to extend the result to non-negative random variable and use that
E[X|H0 ] = E[X + |H0 ] − E[X − |H0 ] for any σ-field H0 ⊂ F when E[X] is well defined, to extend
the result to any real random variable X such that E[X] is well defined.

We extend the definition of conditional expectation to Rd -valued random variables.

Definition 2.10. Let d ∈ N∗ . Let X = (X1 , . . . , Xd ) be an Rd -valued random variable such


that E[X] is well defined. The conditional expectation of X conditionally on H, denoted by
E[X|H], is given by (E[X1 |H], . . . , E[Xd |H]).
2.3. CONDITIONAL EXPECTATION WITH RESP. TO A RANDOM VARIABLE 31

2.3 Conditional expectation with resp. to a random variable


Let V be a random variable taking values in a measurable space (E, E). Recall that σ(V )
denote the σ-field generated by V . Let X be a real-valued random variable. We write E[X|V ]
for E[X|σ(V )]. The next result states that E[X|V ] is a measurable function of V . It is a
direct consequence of Proposition 1.25.
Corollary 2.11. Let V be a random variable taking values in a measurable space (E, E) and
X a real-valued random variable such that E[X] is well defined. There exists a real-valued
measurable function g defined on E such that a.s. E[X|V ] = g(V ).

In the next two paragraphs we give an explicit formula for g when V is a discrete random
variable and when X = ϕ(Y, V ) with Y some random variable taking values in a measurable
space (S, S) such that (Y, V ) has a density with respect to some product measure on S × E.
Recall (2.2) for the notation P(A| H) for A ∈ F; and we shall write P(A|V ) for P(A| σ(V )).

2.3.1 The discrete case


The following corollary provides an explicit formula for the expectation conditionally on a
discrete random variable. Recall the definition of a discrete random variable in Remark 1.57
and of the expectation conditionally on an event in Remark 1.58.
Corollary 2.12. Let (E, E) be a measurable space and V be a discrete E-valued random
variable. Let X be a real-valued random variable such that E[X] is well defined. Then, we
have that a.s. E[X|V ] = g(V ) with:

E[X1{V =v} ]
g(v) = = E[X|V = v] if P(V = v) > 0, and g(v) = 0 otherwise. (2.6)
P(V = v)
Proof. According to Corollary 2.11, we have E[X|V ] = g(V ) for some real-valued measurable
function g. We deduce from (2.1) with A = {V = v} that E[X1{V =v} ] = g(v)P(V = v). If
P(V = v) > 0, we get:
E[X1{V =v} ]
g(v) = = E[X|V = v].
P(V = v)
The value of E[X|V = v] when P(V = v) = 0 is unimportant, and can be set equal to 0.

Remark 2.13. Let (E, E) be a measurable space and V be a discrete E-valued random variable
with discrete support ∆V = {v ∈ E, P(V = v) > 0}. For v ∈ ∆V , denote by Pv the
probability measure on (Ω, F) defined by Pv (A) = P(A|V = v) for A ∈ F. The law of X
conditionally on {V = v}, denoted by PX|v is the image of the probability measure Pv by
X, and we define the law of X conditionally on V as the collection of probability measure
PX|V = (PX|v , v ∈ ∆V ). An illustration is given in the next example. ♦
Example 2.14. Let (Xi , i ∈ J1, nK) P
be independent Bernoulli random variables with the same
parameter p ∈ (0, 1). We set Sn = ni=1 Xi , which has a binomial distribution with parameter
(n, p). We shall compute the law of X1 conditionally on Sn . We get for k ∈ J1, nK:

P(X1 = 1, Sn = k) P(X1 = 1)P(X2 + · · · + Xn = k − 1) k


P(X1 = 1|Sn = k) = = = ,
P(Sn = k) P(Sn = k) n
32 CHAPTER 2. CONDITIONAL EXPECTATION

where we used independence for X1 and (X2 , . . . , Xn ) for the second equality and that X2 +
· · · + Xn has a binomial distribution with parameter (n − 1, p) for the last. For k = 0, we
get directly that P(X1 = 1|Sn = k) = 0. We deduce that X1 conditionally on {Sn = k}
is a Bernoulli random variable with parameter k/n for all k ∈ J0, nK. We shall say that,
conditionally on Sn , X1 has the Bernoulli distribution with parameter Sn /n.
Using Corollary 2.12, we get that E[X1 |Sn ] = Sn /n, which could have been obtained
directly as the expectation of a Bernoulli random variable is given by its parameter. 4

2.3.2 The density case


Let Y be a random variable taking values in (S, S) such that (Y, V ) has a density with
respect to some product measure on S × E. See Fubini Theorem 1.53 for the definition of
product measure. More precisely, we assume the probability distribution of (Y, V ) is given
by f(Y,V ) (y, v) µ(dy)ν(dv), where µ and ν are σ-finite measures respectively
R on (S, S) and
(E, E) and fY,V is a [0, +∞]-valued measurable function such that f(Y,V ) d µ ⊗ ν = 1. In
this setting, we give a closed formula for E[X|V ] when X = ϕ(Y, V ), with ϕ a real-valued
measurable function defined on S × E endowed with the product σ-field.

According to Fubini theorem, V has probability


R distribution fV ν with density (with re-
spect to the measure ν) given by fV (v) = f(Y,V ) (y, v) µ(dy) and Y has probability
R distribu-
tion fY µ with density (with respect to the measure µ) given by fY (y) = f(Y,V ) (y, v) ν(dv).
We now define the law of Y conditionally on V .

Definition 2.15. The probability distribution of Y conditionally on {V = v}, with v ∈ E


such that fV (v) ∈ (0 + ∞), is defined by fY |V (y|v) µ(dy) with its density fY |V (with respect
to the measure µ) given by:

f(Y,V ) (y, v)
fY |V (y|v) = , y ∈ S.
fV (v)

By convention, we set fY |V (y|v) = 0 if fV (v) 6∈ (0, +∞).

Thanks to Fubini theorem, we get that, for v such R that fV (v) ∈ (0, +∞), the function
y 7→ fY |V (y|v) is a density as it is non-negative and fY |V (y|v) µ(dy) = 1.
We now give the expectation of X = ϕ(Y, V ), for some function ϕ, conditionally on V .

Proposition 2.16. Let (E, E, ν) and (S, S, µ) be measured space such that ν and µ are σ-
finite. Let (Y, V ) be an S × E-valued random variable with density (y, v) 7→ f(Y,V ) (y, v) with
respect to the product measure µ(dy)ν(dv). Let ϕ be a real-valued measurable function defined
on S × E and set X = ϕ(Y, V ). Assume that E[X] is well defined. Then we have that a.s.
E[X|V ] = g(V ), with:
Z
g(v) = ϕ(y, v)fY |V (y|v) µ(dy). (2.7)

Proof. Let A ∈ σ(V ). The function 1A is σ(V )-measurable, and thus, thanks to Proposition
1.25, there exists a measurable function h such that 1A = h(V ). Since fV is a density, we
2.3. CONDITIONAL EXPECTATION WITH RESP. TO A RANDOM VARIABLE 33

R
get that 1{fV 6∈(0,+∞)} fV dν = 0. We have:

E[X1A ] = E[ϕ(Y, V )h(V )]


Z
= ϕ(y, v)h(v)f(Y,V ) (y, v) µ(dy)ν(dv)
Z
= ϕ(y, v)h(v)f(Y,V ) (y, v)1{fV (v)∈(0,+∞)} µ(dy)ν(dv)
Z Z 
= h(v) ϕ(y, v)fY |V (y|v) µ(dy) fV (v)1{fV (v)∈(0,+∞)} ν(dv)
Z
= h(v)g(v)fV (v) ν(dv)

= E[g(V )h(V )] = E[g(V )1A ],

where we used that d µ ⊗ ν-a.e.:

f(Y,V ) (y, v) = f(Y,V ) (y, v)1{fV (v)∈(0,+∞)}


R R
as 1{fV 6∈(0,+∞)} f(Y,V ) d µ⊗ν = 1{fV 6∈(0,+∞)} fV dν = 0 for the third equality, the definition
of fY |V and Fubini theorem for the fourth and the definition of g given by (2.7) for the fith.
Using (2.1) and Lemma 2.3, we deduce that a.s. g(V ) = E[X|V ].

2.3.3 Elements on the conditional distribution


We shall present in this section some elementary notions on the conditional distribution.
Let (E, E, ν) be a measured space and (S, S) a measurable space. A probability kernel κ
is a [0, 1]-valued function defined on E × S such that: for all v ∈ E, the map A 7→ κ(v, A)
is a measure on (S, S); for all A ∈ S, the map v 7→ κ(v, A) is measurable; and ν(dv)-a.e.
κ(v, S) = 1. It is left to the reader Rto prove that for any [0, +∞]-valued measurable function
ϕ defined on S × E, the map v 7→ ϕ(y, v) κ(v, dy) is measurable.
Definition 2.17. Let (Y, V ) be an S × E-valued random variable, such that the distribution
of V has a density1 with respect to the measure ν. The probability kernel κ is the conditional
distribution of Y given V if a.s.:

P(Y ∈ A|V ) = κ(V, A) for all A ∈ S.

If the probability kernel κ is the conditional distribution of Y given V , then arguing as in


the proof of Fubini’s Theorem 1.53, we get that for any [0, +∞]-valued measurable function
ϕ defined on S × E a.s.:
Z
E[ϕ(Y, V )|V ] = g(V ) with g(v) = ϕ(y, v) κ(v, dy).

The existence of a probability kernel2 allows to give a representation of the conditional ex-
pectation which holds simultaneously for all nice functions ϕ (but on a set of 0 probability
1
If one takes ν = PV , then the density is constant equal to 1.
2
The existence of the conditional distribution of Y , taking values in S, given V can be proven under some
topological property of the space (S, S). See Theorem 5.3 in O. Kallenberg. Foundations of modern probability.
Springer-Verlag, 2002.
34 CHAPTER 2. CONDITIONAL EXPECTATION

for V ). When V is a discrete random variable, Remark 2.13 states that the kernel κ given
by κ(v, dx) = 1{v∈∆V } PX|v (dx) is, with ν = PV , the conditional distribution of X given V .
Example 2.18. In Example 2.14, with P = S/n, the conditional distribution of X1 given P
is the Bernoulli distribution with parameter P . This corresponds to the kernel κ(p, dx) =
(1 − p)δ0 (dx) + pδ1 (dx). (Notice one only needs to consider p ∈ [0, 1].)
In Exercise 9.20, the conditional distribution of Y given V is the uniform distribution on
[0, V ]. This corresponds to the kernel κ(v, dy) = v −1 1[0,v] (y) λ(dy). (Notice one only needs
to consider v ∈ (0, +∞).) 4
Chapter 3

Discrete Markov chains

A Markov chain is a sequence of random variables X = (Xn , n ∈ N) which represents the


dynamical evolution (in discrete time) of a stochastic system: Xn represents the state of the
system at time n. The fundamental property of a Markov chain is that the evolution after time
n of the system depends on the previous states only through the state of the system at time
n. In other words, conditionally on Xn , (X0 , . . . , Xn ) and (Xn+k , k ∈ N) are independent.
Markov chains appears naturally in a large variety of domain: networks, population genetics,
mathematical finance, stock management, stochastic optimization algorithms, simulations,
. . . . We shall be interested in the asymptotic behavior of X for large times. In what follows,
we assume that the state space is at most countable.
We give in Section 3.1 the definition and the first properties of the Markov chains. Then,
we consider invariant measures in Section 3.2. We characterize the states of the Markov chain
and introduce the notion of irreducible chain in Section 3.3. Intuitively, an irreducible chain
has a positive probability starting from one state to go in one or many steps to any other state.
The ergodic theorems from Section 3.4 give the asymptotic behavior of an irreducible Markov
chain for large time. They are among the most interesting results on Markov chains. Their
proof is postponed to Section 3.4.3. In Section 3.5, we present and analyze some well known
applications of Markov chains. We refer to [7, 3] for a recent and very detailed presentation
of Markov chains.
We shall consider that all the random variables of this chapter are defined on a probability
space (Ω, F, P). In all this chapter we shall consider a discrete state space1 E (not reduced
to one state) with the σ-field E = P(E).

3.1 Definition and properties


Let X = (Xn , n ∈ N) be a sequence of E-valued random variables, which will be seen as a
process, Xn being the state of the process at time n. We represent the information available
at time n ∈ N by a σ-field Fn , which is non-decreasing with n.

1
The set E is discrete if E is at most countable, all x ∈ E are isolated, that is all subsets of E are open and
closed. For example, the set N with the Euclidean distance is a discrete set, while the set {0} ∪ {1/k, k ∈ N∗ }
with the Euclidean distance is not a discrete set as the set {0} is not open.

35
36 CHAPTER 3. DISCRETE MARKOV CHAINS

Definition 3.1. A filtration F = (Fn , n ∈ N) (with respect to the measurable space (Ω, F))
is a sequence of σ-fields such that Fn ⊂ Fn+1 ⊂ F for all n ∈ N. A E-valued process
X = (Xn , n ∈ N) is F-adapted if Xn is Fn -measurable for all n ∈ N.

In the setting of stochastic process, one usually (but not always) chooses the natural
filtration F = (Fn , n ∈ N) which is generated by X: for all n ∈ N, Fn is the σ-field generated
by (X0 , . . . , Xn ) and the P-null sets. This obviously implies that X is F-adapted.
A Markov chain is a process such that, conditionally on the process at time n, the past
before time n and the evolution of the process after time n are independent.
Definition 3.2. The process X = (Xn , n ∈ N) is a Markov chain with respect to the filtration
F = (Fn , n ∈ N) if it is adapted and it has the Markov property: for all n ∈ N, conditionally
on Xn , Fn and (Xk , k ≥ n) are independent, that is for all A ∈ Fn and B ∈ σ(Xk , k ≥ n),
P(A ∩ B| Xn ) = P(A| Xn )P(B| Xn ).

In the previous definition, we shall omit to mention the filtration when it is the natural
filtration. Since X is adapted to F, if X is a Markov chain with respect to F, it is also a
Markov chain with respect to its natural filtration.

We give equivalent definitions for being a Markov chain.


Proposition 3.3. Let X = (Xn , n ∈ N) be a E-valued process adapted to the filtration
F = (Fn , n ∈ N). The following properties are equivalent.

(i) X is Markov chain.

(ii) For all n ∈ N and B ∈ σ(Xk , k ≥ n), we have a.s. P(B| Fn ) = P(B| Xn ).

(iii) For all n ∈ N and y ∈ E, we have a.s. P(Xn+1 = y| Fn ) = P(Xn+1 = y| Xn ).

Proof. That property (i) implies property (ii) is a direct consequence of Exercise 9.18 (with
A = Fn , B = σ(Xk , k ≥ n) and H = σ(Xn )). Let us check that property (ii) implies property
(i). Assume property (ii). Let A ∈ Fn and B ∈ σ(Xk , k ≥ n). A.s. we have, using property
(ii) for the second equality:

P(A ∩ B| Xn ) = E [1A E [1B | Fn ] | Xn ] = E [1A E [1B | Xn ] | Xn ] = P(A| Xn )P(B| Xn ).

This gives property (i).

Taking B = {Xn+1 = y} in property (ii) gives property (iii). We now assume property
(iii) holds, and we prove property (ii). As σ(Xk , k ≥ n) is generated by the events Bk =
{Xn = y0 , . . . , Xn+k = yk } where k ∈ N and y0 , . . . , yk ∈ E, we deduce from the monotone
class theorem, and more precisely Corollary 1.14, that, to prove (ii), it is enough to prove
that a.s.
P(Bk | Fn ) = P(Bk | Xn ). (3.1)

We shall prove this by induction. Notice (3.1) is true for k = 1 thanks to (iii). Assume that
3.1. DEFINITION AND PROPERTIES 37

(3.1) is true for k ≥ 1. Then, we have a.s.:


     
P(Bk+1 | Fn ) = E 1Bk+1 | Fn = E E 1{Xn+k+1 =yk+1 } 1Bk | Fn+k | Fn
= E [P(Xn+k+1 = yk+1 | Fn+k )1Bk | Fn ]
= E [P(Xn+k+1 = yk+1 | Xn+k )1Bk | Fn ]
= P(Xn+k+1 = yk+1 | Xn+k = yk ) P(Bk | Fn )
= P(Xn+k+1 = yk+1 | Xn+k = yk ) P(Bk | Xn ),
where we used that Bk ∈ Fn+k for the second equality, property (iii) for the third, and that
Xn+k = yn+k on Bk and, see Corollary 2.12, that a.s. P(Xn+k+1 = yk+1 | Xn+k )1{Xn+k =yk } =
P(Xn+k+1 = yk+1 | Xn+k = yk )1{Xn+k =yk } for the fourth and the induction for the last.
In particular, we deduce that P(Bk+1 | Fn ) is σ(Xn )-measurable. This readily implies that
P(Bk+1 | Fn ) = P(Bk+1 | Xn ) and thus (3.1) is true for k replaced by k + 1. This ends the
proof of property (ii).
As an immediate consequence of this proposition, using property (iv) of Proposition 2.9
and that σ(X0 , . . . , Xn ) ⊂ Fn , we deduce that for a Markov chain X = (Xn , n ∈ N):
a.s. P(Xn+1 = y | Fn ) = P(Xn+1 = y | X0 , . . . , Xn ) = P(Xn+1 = y | Xn ). (3.2)
Unless specified otherwise, we shall consider F is the natural filtration of X.
Example 3.4. We present the example of the simple random walk, which has been (and is
still) thoroughly studied, see [8, 6]. We take E = Z. Let p ∈ (0, 1) and U = (Un , n ∈ N∗ ) be
independent random variables taking values in {−1, 1} with the same distribution P(Un =
1) = 1 − P(Un = −1) = p. Let X0 be a Z-valued random variable independent of U and set
for n ∈ N∗ :
n
X
Xn = X0 + Uk .
k=1
By construction we get property (iii) from Proposition 3.3 holds as P(Xn+1 = y| Fn ) =
P(Un+1 = y − Xn | Fn ) = ϕ(y − Xn ) with ϕ(z) = P(Un+1 = z), since Un+1 is independent
of Fn and thanks to (9.1) with Y = Un+1 , V = Xn and H = Fn . Thus the process X is a
Markov chain. 4
Motivated by this example, we have the following lemma whose proof is similar and left
to the reader.
Lemma 3.5. Let (S, S) be a measurable space. Let U = (Un , n ∈ N∗ ) be a sequence of
independent S-valued random variables. Let X0 be a E-valued random variable independent
of U . Let f be a measurable function defined on E × S taking values in E. The stochastic
dynamical system X = (Xn , n ∈ N) defined by Xn+1 = f (Xn , Un+1 ) for n ∈ N is a Markov
chain.
The sequence U in Lemma 3.5 is called the sequence of innovations. In what follows, we link
the Markov chains with the matrix formalism.

P A matrix P = (P (x, y), x, y ∈ E) on E is stochastic if: P (x, y) ≥ 0 for all


Definition 3.6.
x, y ∈ E, and y∈E P (x, y) = 1 for all x ∈ E.
In view of (3.2), it is natural to focus on the transition probability P(Xn+1 = y | Xn ).
38 CHAPTER 3. DISCRETE MARKOV CHAINS

Definition 3.7. A Markov chain X on E has transition matrices (Pn , n ∈ N∗ ) if (Pn , n ∈ N∗ )


is a sequence of stochastic matrices on E and for all y ∈ E a.s.:

P(Xn+1 = y | Xn ) = Pn+1 (Xn , y). (3.3)

The Markov chain is called homogeneous when the sequence (Pn , n ∈ N∗ ) is constant, and its
common value, say P , is then called the2 transition matrix of X.

The transition matrix of the simple random walk described in Example 3.4 is given by
P (x, y) = 0 if |x − y| =
6 1, P (x, x + 1) = p and P (x, x − 1) = 1 − p for x, y ∈ Z.
Unless specified otherwise, we shall consider homogeneous Markov chains.
The next proposition states that the transition matrix and the initial distribution char-
acterize the distribution of the Markov chain.

Proposition 3.8. The distribution of a (homogeneous) Markov chain X = (Xn , n ∈ N) is


characterized by its transition matrix, P , and the initial probability distribution, µ0 , of X0 .
Moreover, we have for all n ∈ N∗ , x0 , . . . , xn ∈ E:
n
Y
P(X0 = x0 , . . . , Xn = xn ) = µ0 (x0 ) P (xk−1 , xk ). (3.4)
k=1

In order to stress the dependence of the distribution of the Markov chain X on the
probability distribution µ0 of X0 , we may write Pµ0 and Eµ0 . When µ0 is simply the Dirac
mass at x (that is P(X0 = x) = 1), then we simply write Px and Ex and say the Markov
chain is started at x.
Proof. We have that for k ∈ N∗ , x0 , . . . , xk+1 ∈ E, with Bk = {X0 = x0 , . . . , Xk = xk }, that:
  
P(Xk+1 = xk+1 , Bk ) = E E 1{Xk+1 =xk+1 } 1Bk | Fk
= E [P(Xk+1 = xk+1 | Fk ) 1Bk ]
= E [P (Xk , xk+1 )1Bk ]
= P (xk , xk+1 )P(Bk ),

where we used that Bk ∈ Fk for the second equality, (3.2) for the third, that Xk = xk on Bk
for the last. We then deduce that (3.4) holds by induction.

Use that {(x0 , . . . , xn )} for x0 , . . . , xn ∈ E generates the product σ-field on E n+1 and
Lemma 1.29 to deduce that the left hand side of (3.4) for all n ∈ N and x0 , . . . , xn ∈ E
characterizes the distribution of X. We then deduce from (3.4) that the distribution of X is
characterized by P and µ0 .

We now give some examples of Markov chains.


Example 3.9. If the process X = (Xn , n ∈ N) is a sequence of independent random variables
with distribution π, then X is a Markov chain with transition matrix P given by P (x, y) =
π(y) for all x, y ∈ E. 4
2
There is a slight abuse here, as (3.3) might not characterize P . Indeed, if P(Xn = x) = 0 for some x ∈ E
and all n ∈ N∗ , then P (x, ·) is not characterized by (3.3). This shall however not be troublesome.
3.1. DEFINITION AND PROPERTIES 39

Example 3.10. Let Xn be the number of items in a stock at time n, Dn+1 the random
consumer demand and q ∈ N∗ the deterministic quantity of items produced between period
n and n + 1. Considering the stock at time n + 1, we get:

Xn+1 = (Xn + q − Dn+1 )+ .

If the demand D = (Dn , n ∈ N∗ ) is a sequence of independent random variables with the


same distribution, independent of X0 , then according to Lemma 3.5, the stochastic dynamical
system X = (Xn , n ∈ N) is a Markov chain. Its transition matrix is given by: P (x, y) =
P(D = k) if y = x + q − k > 0, and P (x, 0) = P(D ≥ x + q) for x, y ∈ N. Figure 3.1 represents
some simulations of the process X for different probability distributions of the demand. 4

10 20

8 16

6 12

4 8

2 4

0 0
0 250 500 750 1000 0 250 500 750 1000

Figure 3.1: Simulations of the the random evolution of a stock with dynamics Xn+1 =
(Xn + q − Dn+1 )+ , where X0 = 0, q = 3 and the random variables (Dn , n ∈ N∗ ) are
independent with Poisson distribution parameter θ (θ = 4 on the left and θ = 3 on the right).

Remark 3.11. Even if a Markov chain is not a stochastic dynamical system, it is distributed
as one. Indeed let µ0 be a probability distribution on E and P a stochastic matrix on E.
Let X0 be a E-valued random variable with distribution µ0 and (Un , n ∈ N) be a sequence of
independent random variables, independent of X0 distributed as U = (U (x), x ∈ E), where
U (x) are independent E-valued random variables such that U (x) has distribution P (x, ·).
Then the stochastic dynamical system (Xn , n ∈ N), defined by Xn+1 = Un+1 (Xn ) for n ∈ N,
is a Markov chain on E with initial distribution µ0 and transition matrix P . ♦
The next corollary is a consequence of the Markov property.
Corollary 3.12. Let X = (Xn , n ∈ N) be a Markov chain with respect to the filtration
F = (Fn , n ∈ N), taking values in a discrete state space E and with transition matrix P . Let
n ∈ N and defined the shifted process X̃ = (X̃k = Xn+k , k ∈ N). Conditionally on Xn , we
have that Fn and X̃ are independent and that X̃ is a Markov chain with transition matrix P
started at Xn , which means that a.s. for all k ∈ N, all x0 , . . . , xk ∈ E:

P(X̃0 = x0 , . . . , X̃k = xk | Fn ) = P(X̃0 = x0 , . . . , X̃k = xk | Xn )


k
Y
= 1{Xn =x0 } P (xj−1 , xj ). (3.5)
j=1
40 CHAPTER 3. DISCRETE MARKOV CHAINS

Notice that in the previous corollary the initial distribution of the Markov chains X and
X̃ are not the same a priori.
Proof. By definition of a Markov chain, we have that, conditionally on Xn , Fn and X̃ are
independent. So, we only need to prove that:
k
Y
P(X̃0 = x0 , . . . , X̃k = xk | Fn ) = 1{Xn =x0 } P (xj−1 , xj ). (3.6)
j=1

Set Bj = {X̃0 = x0 , . . . , X̃j = xj } = {Xn = x0 , . . . , Xn+j = xj } for j ∈ {0, . . . , k}. Using


(3.2) and Definition 3.7 with n replaced by n + j, we get for j ∈ {0, . . . , k − 1} that:
  h i
E 1Bj+1 | Fn+j = E 1{Xn+j+1 =xj+1 } 1Bj | Fn+j = P (Xn+j , xj+1 ) 1Bj = P (xj , xj+1 ) 1Bj ,

where we used that Xn+j = xj on Bj for the last equality. This implies that P(Bj+1 | Fn ) =
P (xj , xj+1 ) P(Bj | Fn ). Thus, we deduce that (3.6) holds by induction. Then, conclude using
Proposition 3.8 on the characterization of the distribution of a Markov chain.
In the setting of Markov chains, computing probability distribution or expectation re-
duce to elementary linear algebra on E. Let P and Q be two matrices defined on E
with
P non-negative entries. We denote by P Q the matrix on E defined by P Q(x, y) =
z∈E P (x, z)Q(z, y) for x, y ∈ E. It is easy to check that if P and Q are stochastic, then
P Q is also stochastic. We set P 0 = IE the identity matrix on E and for k ≥ 1, P k = P k−1 P
(or equivalently P = P P k−1 ).
For a line vector µ = (µ(x), x ∈ E) with non-negatives entries, which we shall see as
a
P measure on E, we denote by µP the line vector (µP (y), y ∈ E) defined by µP (y) =
x∈E µ(x)P (x, y). For a column vector f = (f (y), y ∈ E) with real entries, which we shall
P on E, we denote, by P f or P (f ) the column vector (P f (x), x ∈ E)
see as a function defined
defined by P f (x) = y∈E P (x, y)f (y). Notice this last quantity, and thus P f , is well defined
+ −
P soon as, for all x ∈ E we have P (f )(x) or P (f )(x) finite. We also write µf = (µ, f ) =
as
x∈E µ(x)f (x) the integral of the function f with respect to the measure µ, when it is well
defined.
We shall consider a measure µ = (µ(x) = Pµ({x}), x ∈ E) on E as a line vector with non
negative
P entries. For A ⊂ E, we set µ(A) = x∈A µ(x), so that µ is a probability measure
if x∈E µ(x) = 1. We shall also consider a real-valued function f = (f (x), x ∈ E) defined
on E as a column vector. The next results give explicit formula to compute (conditional)
expectations and distributions.
Proposition 3.13. Let X = (Xn , n ∈ N) be a Markov chain with transition matrix P .
Denote by µn the probability distribution of Xn for n ∈ N. Let f be a bounded or non-
negative function. We have for n ∈ N∗ :
(i) µn = µ0 P n ,

(ii) E[f (Xn )] = µn f = µ0 P n f and Ex [f (Xn )] = P n f (x) for x ∈ E,

(iii) E[f (Xn )| Fn−1 ] = P f (Xn−1 ) a.s.,

(iv) E[f (Xn )| X0 ] = P n f (X0 ) a.s.,


3.2. INVARIANT PROBABILITY MEASURES, REVERSIBILITY 41

(v) P(Xn+k = y| Fn ) = P k (Xn , y) a.s. for all k ∈ N, y ∈ E.


Proof. Summing (3.4) over x0 , . . . , xn−1 ∈ E gives property (i). Property (ii) is a direct
consequence of property (i). Using that P(Xn = y | Fn−1 ) = P (Xn−1 , y), see (3.2) and (3.3),
multiplying by f (y) and summing over y ∈ E gives property (iii). Iterating (iii) leads to
E[f (Xn )| F0 ] = P n f (X0 ), which implies (iv) as a.s. E[f (Xn )| X0 ] = E[E[f (Xn )| F0 ] | X0 ].
Iterating (iii) leads also to E[f (Xn+k )| Fn ] = P k f (Xn ) a.s., and then take f = 1{y} .

Example 3.14. Let (Un , n ∈ N∗ ) be a sequence of independent Bernoulli random variables


with parameter p ∈ (0, 1). Let p`,n be the probability to get a sequence of consecutive 1 with
length at least ` in the sequence U1 . . . Un . It is very simple to get a closed formula for p`,n
using the formalism of Markov chains34 . We consider the Markov chain X = (Xn , n ∈ N)
defined by X0 = 0 and Xn+1 = (Xn + 1)1{Un+1 =1,Xn <`} + `1{Xn =`} for n ∈ N. As soon as we
observe a sequence of consecutive 1 with length `, then the process X is constant equal to `.
In particular, we have p`,n = P(Xn = `) = P n (0, `), where P is the transition matrix of the
Markov chain X. The transition matrix is given by P (x, 0) = 1 − p and P (x, x + 1) = p for
x ∈ {0, . . . , ` − 1}, P (`, `) = 1 and all the other entries of P are zeros. We give the values of
p`,n for n = 100 and p = 1/2 in Figure 3.2. In particular, for p = 1/2, we get a probability
larger than 1/2 to observe a sequence of 6 consecutive 1 in a sequence of length 100. 4

1.00

0.75

0.50

0.25

0.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 3.2: Graph of the function x 7→ P(Ln ≥ bxc), with Ln the maximal length of the
sequences of consecutive 1 in a sequence of length n = 100 of independent Bernoulli random
variables with parameter p = 1/2.

3.2 Invariant probability measures, reversibility


Invariant probability measures appear naturally in the asymptotic study of Markov chains at
large times.
3
L. Guibas and A. Odlyzko. String overlaps, pattern matching, and nontransitive games. J. Combin.
Theory Ser. A, vol. 30(2), pp. 183-208, 1981.
4
J. Fu and V. Koutras. Distribution theory of runs: a Markov chain approach. J. Amer. Statist. Assoc.,
vol. 89(427), pp. 1050-1058, 1994.
42 CHAPTER 3. DISCRETE MARKOV CHAINS

Definition 3.15. A probability measure π is invariant for a stochastic matrix P if π = πP .


It is also called a stationary probability measure.

Let X = (Xn , n ∈ N) be a Markov chain with transition matrix P with starting probability
measure µ0 = π an invariant probability measure for P . Denote by µn the probability
distribution of Xn . We have µ1 = πP = π and by recurrence we get µn = π for all n ∈ N∗ .
This means that Xn is distributed as X0 : the distribution of Xn is stationary, that is constant
in time.
Remark 3.16. Let X = (Xn , n ∈ N) be a Markov chain with transition matrix P with starting
probability measure µ0 = π an invariant probability measure for P . For simplicity, let us
assume further that π(x) > 0 for all x ∈ E. For x, y ∈ E, we set:

π(y)P (y, x)
Q(x, y) = · (3.7)
π(x)
P
Since π is an invariant probability measure, we have y∈E Q(x, y) = 1 for all x ∈ E. Thus
the matrix Q is stochastic. Notice that π is also an invariant probability measure for Q. For
x, y ∈ E, n ∈ N, we have:

Pπ (Xn = y, Xn+1 = x)
Pπ (Xn = y|Xn+1 = x) = = Q(x, y).
Pπ (Xn+1 = x)

More generally, it is easy to check that for all n ∈ N, x0 , . . . , xn ∈ E:


n
Y
Pπ (Xn = x0 , . . . , X0 = xn ) = π(x0 ) Q(xk−1 , xk ).
k=1

In other words (Xn , Xn−1 , . . . , X0 ) is distributed under Pπ as the first n steps of a Markov
chains with transition matrix Q with initial distribution π. Intuitively, the time reversal of
the process X under π is a Markov chain with transition matrix Q. ♦
There is an important particular case where a probability measure π is invariant for a
stochastic matrix.

Definition 3.17. A stochastic matrix P is reversible with respect to a probability measure π


if for all x, y ∈ E:
π(x)P (x, y) = π(y)P (y, x). (3.8)
A Markov chain X is reversible with respect to a probability measure π if its transition matrix
is reversible with respect to π.

Summing (3.8) over x ∈ E, we deduce the following lemma.


Lemma 3.18. If a stochastic matrix P is reversible with respect to a probability measure π,
then π is an invariant probability measure for P .

See examples of the Ehrenfest urn model and the Metropolis-Hastings algorithm in Section
3.5 for reversible Markov chains.
3.3. IRREDUCIBILITY, RECURRENCE, TRANSIENCE, PERIODICITY 43

Remark 3.19. If P in Remark 3.16 is also reversible with respect to the probability mea-
sure π, then we get P = Q. Therefore, under Pπ , we get that (X0 , . . . , Xn−1 , Xn ) and
(Xn , Xn−1 , . . . , X0 ) have the same distribution. We give a stronger statement in the next
Remark. ♦
Remark 3.20. Let P be a stochastic matrix on E reversible with respect to a probability
measure π. The following construction is inspired by Remark 3.11. Let (Un , n ∈ Z∗ ) be
a sequence of independent random variables distributed as U = (U (x), x ∈ E), where the
E-valued random variables are independent and U (x) is distributed as P (x, ·). Let X0 be a
E-valued random variable independent of (Un , n ∈ Z∗ ) with distribution π. For n ∈ N∗ , set
Xn+1 = Un+1 (Xn ) and X−(n+1) = U−(n+1) (X−n ). Then the process X = (Xn , n ∈ Z) can be
seen as a Markov chain with time index Z instead of N in Definition 3.2 (the proof of this
fact is left to the reader). We deduce from Remark 3.16 that X̃ = (X̃n = X−n , n ∈ Z) is then
also a Markov chain with time index Z. It is called the time reversal process of X. One can
easily check that its transition matrix is P , so that X and X̃ have the same distribution. ♦

3.3 Irreducibility, recurrence, transience, periodicity


Let P be a stochastic matrix on E and X = (Xn , n ∈ N) be a Markov chain with transition
matrix P . Recall E is a finite or countable discrete space with at least two elements.

3.3.1 Communicating classes


In order to study the longtime behavior of the Markov chain X, we shall decompose the state
space E in subsets on which the study of X will be easier.
We introduce some definitions. A state y is accessible from a state x, which we shall write
x → y, if P n (x, y) > 0 for some n ∈ N, or equivalently Px (Xn = y for some n ∈ N) > 0. Since
P 0 = IE , the identity matrix on E, we get that x → x. The states x and y communicate,
which we shall write as x ↔ y if they are accessible from each other (that is x → y and
y → x). It is clear that “to communicate with” is an equivalence relation, and we denote by
Cx the equivalent class of x. The communicating classes form a partition of the state space
E. Notice the communicating classes are completely determined by the zero of the transition
matrix P . We say the Markov chain X is irreducible if all states communicate with each
other, that is E is a (and the only one) communicating class.

A communicating class C is called closed if for all x ∈ C we have that x → y implies


y ∈ C (that is x ↔ y), and open otherwise. Intuitively, when a Markov chain reach a closed
communicating class, it stays therein. A state x ∈ E is called an absorbing state if Cx = {x}
and Cx is closed. Equivalently, the state x is an absorbing state if and only if P (x, x) = 1
and thus Px (Xn = x for all n ∈ N) = 1. In particular a Markov chain with an absorbing
state is not irreducible.
Example 3.21. In Example 3.4, the simple random walk is an irreducible Markov chain with
state space Z (that is Z is a closed communicating class).
In Example 3.14, the state ` is an absorbing state, and {0, · · · , ` − 1} is an open commu-
nicating class.
44 CHAPTER 3. DISCRETE MARKOV CHAINS

The Markov chain in the Ehrenfest’s urn model, see Section 3.5, is irreducible. The
Markov chain of the Wright-Fischer model, see Section 3.5, has two absorbing states 0 and
N and one open communicating class {1, . . . , N − 1}. 4

3.3.2 Recurrence and transience


We use the convention inf ∅ = +∞. We define the (first) return time of x ∈ E for the Markov
chain X by:
T x = inf {n ≥ 1; Xn = x} .

Definition 3.22. Let X be a Markov chain on E. The state x ∈ E is transient if Px (T x =


∞) > 0, and recurrent (or persistent) otherwise. The Markov chain is transient (resp.
recurrent) if all the states are transient (resp. recurrent).

We set N x = n∈N 1{Xn =x} the number of visits of the state x. The next proposition
P
gives a characterization for transience and recurrence.
Proposition 3.23. Let X be a Markov chain on E with transition matrix P .
(i) Let x ∈ E be recurrent. Then we have Px (N x = ∞) = 1 and n∈N P n (x, x) = +∞.
P

(ii) Let x ∈ E be transient. Then we have Px (N x < ∞) = 1, n


P
n∈N P (x, x) < +∞
x x
and N has under Px a geometric distribution with parameter Px (T = ∞). And for
all probability measure ν on E, we have Pν (N x < ∞) = 1. Furthermore, if π is an
invariant measure for P , then π(x) = 0.
(iii) The elements of the same communicating class are either all transient or all recurrent.
(iv) The elements of an open communicating class are transient.

To have a complete picture, in view of property (iv) above, we shall study closed commu-
nicating classes (see Remark 3.25 below for a first result in this direction). For this reason,
we shall consider Markov chains started in a closed communicating class. This amounts to
study irreducible Markov chains, as a Markov chain started in a closed communicating class
remains in it.

Proof. We set p = Px (T x = ∞) = Px (N x = 1). Notice that {T x < ∞} = {N x > 1} under


Px . By decomposing according to the possible values of T x , we get for n ∈ N:
X
Px (N x > n + 1) = Px (N x > n + 1, T x = r)
r∈N∗
!
X X
x
= Px T = r, Xr = x, 1{Xr+` =x} > n
r∈N∗ `∈N
!
X X
= Px (T x = r, Xr = x) Px 1{X` =x} > n
r∈N∗ `∈N
= Px (Tx < ∞)Px (N x > n) (3.9)
x
= (1 − p)Px (N > n),
3.3. IRREDUCIBILITY, RECURRENCE, TRANSIENCE, PERIODICITY 45

where we used the Markov property at time r for the third equality. Using that Px (N x >
0) = 1, we deduce that Px (N x > n) = (1 − p)n for n ∈ N. This gives that N x P has under Px a
geometric distribution with parameter p ∈ [0, 1]. Notice also that [N x] =
Ex n∈N Px (Xn =
n
P
x) = n∈N P (x, x), which is finite if and only if p > 0. Thus, if x is transient, then p > 0
and we get Px (Nx < ∞) = 1 and Ex [N x ] is finite. And, if x is recurrent, then p = 0 and we
get Px (Nx < ∞) = 0 and Ex [N x ] is infinite. This proves property (i) and the first part of
property (ii).
We prove the second part of property (ii). Let ν be a probability measure on E. As x is
transient, by decomposing according to the values of T x and using the Markov chain property
for the first equality, we get:
X
Pν (N x = +∞) = Pν (T x = n)Px (N x = +∞) = 0,
n∈N∗

that is Pν (N x < ∞) = 1. Let π be an invariant measure. Use Pπ (N x < ∞) = 1 to get that:


n
1X 1
Pπ -a.s. lim 1{Xk =x} = lim N x = 0.
n→∞ n n→∞ n
k=1

Since n1 nk=1 1{Xk =x} is bounded by 1, we deduce that limn→∞ n1 nk=1 Pπ (Xk = x) = 0 by
P P

dominated convergence. As π is invariant, we get that Pπ (Xk = x) = πP k (x) = π(x). We


deduce that π(x) = 0. This finishes the proof of property (ii).
We prove property (iii). Let x, y be two elements of the same communicating class. In
particular, there exists n1 , n2 ∈ N such that P n1 (y, x) > 0 and P n2 (x, y) > 0. We deduce
that for all n ∈ N:

P n+n1 +n2 (y, y) ≥ P n1 (y, x)P n (x, x)P n2 (x, y), (3.10)
n+n1 +n2 n2 n n1
P (x, x) ≥ P (x, y)P (y, y)P (y, x). (3.11)

This implies that the sums n∈N P n (x, x) and n∈N P n (y, y) are both either converging or
P P
diverging. Thanks to properties (i) and (ii), we get that either x and y are both transient or
both recurrent. This gives (iii).
We now prove property (iv). If C is an open communicating class, then there exist x ∈ C
and y 6∈ C such that P (x, y) > 0. Since x is not accessible from y, we get Py (T x = ∞) = 1.
Using the Markov property, we get that Px (T x = ∞) ≥ P (x, y)Py (T x = ∞) > 0. This gives
that x is transient.

According to property (iii) from Proposition 3.23, we get that an irreducible Markov chain
is either transient or recurrent. And, in the former case the probability of {N x < ∞} is equal
to 1 for all choice of the initial distribution. The next lemma asserts that for an irreducible
recurrent Markov chain, the probability of {N x < ∞} is strictly less than 1 for all choice of
the initial distribution.

Lemma 3.24. Let X be an irreducible Markov chain on E. If X is transient, then P(N x <
∞) = 1 for all x ∈ E. If X is recurrent, then P(N x = ∞) = 1 for all x ∈ E.
46 CHAPTER 3. DISCRETE MARKOV CHAINS

Proof. For the transient case, see property (ii) of Proposition 3.23. We assume that X is
recurrent. Let x ∈ E. By decomposing according to the values of T x and using the Markov
property for the first equality and property (i) of Proposition 3.23 for the second, we get:
X
P(N x < ∞) = P(T x = ∞) + P(T x = n)Px (N x < ∞) = P(T x = ∞). (3.12)
n∈N

To conclude, we shall prove that P(T x < ∞) = 1. We get that for m ∈ N∗ :


X
1 = Px (Xn = x for some n ≥ m + 1) = Px (Xm = y)Py (T x < ∞),
y∈E

where for the first equality we used that Px (N x = ∞) = 1, and for the second
P the Markov
property at time m and that Py (Xn = x for some n ≥ 1) = Py (T x < ∞). As y∈E Px (Xm =
y) = 1 and Py (T x < ∞) ≤ 1, we deduce that Py (T x < ∞) = 1 for all y ∈ E such that
Px (Xm = y) > 0. Since X is irreducible, for all y ∈ E, there exists m ∈ N∗ such that
Px (Xm = y) > 0. We deduce that Py (T x < ∞) = 1 for all y ∈ E and thus P(T x < ∞) = 1.
Then use (3.12) to get P(N x < ∞) = 0.

Remark 3.25. Let X be an irreducible Markov chain on a finite state space E. Since
x x
P
x∈E N = ∞ and E is finite, we deduce that P(N = ∞ for some x ∈ E) = 1. This
x
implies that P(N = ∞) > 0 for some x ∈ E. We deduce from Lemma 3.24 that X is
recurrent. Thus, all elements of a finite closed communicating class are recurrent. ♦

3.3.3 Periodicity
In Example 3.4 of the simple random walk X = (Xn , n ∈ N), we notice that if X0 is even
(resp. odd), then X2n+1 is odd (resp. even) and X2n is even (resp. odd) for n ∈ N. Therefore
the state space Z can be written as disjoint union of two sub-sets: the even integers, 2Z, and
the odd integers, 2Z + 1. And, a.s. the the Markov chain jumps from one sub-set to the other
one. From the Lemma 3.28 below, we see that X has period 2 in this example.

Definition 3.26. Let X be a Markov chain on E with transition matrix P . The period d of
a state x ∈ E is the greatest common divisor (GCD) of the set {n ∈ N∗ ; P n (x, x) > 0}, with
the convention that d = ∞ if this set is empty. The state is aperiodic if d = 1.
Notice that the set {n ∈ N∗ ; P n (x, x) > 0} is empty if and only if Px (T x = ∞) = 1, and
that this also implies that {x} is an open communicating class.
Proposition 3.27. Let X be a Markov chain on E with transition matrix P . We have the
following properties.
(i) If x ∈ E has a finite period d, then there exists n0 ∈ N such that P nd (x, x) > 0 for all
n ≥ n0 .

(ii) The elements of the same communicating class have the same period.
In view of (ii) above, we get that if X is irreducible, then all the states have the same
finite period. For this reason, we shall say that an irreducible Markov chain is aperiodic
(resp. has period d) if one of the states is aperiodic (resp. has period d).
3.3. IRREDUCIBILITY, RECURRENCE, TRANSIENCE, PERIODICITY 47

Proof. We first consider the case d = 1. Let x ∈ E be aperiodic. We consider the non-empty
set I = {n ∈ N∗ ; P n (x, x) > 0}. Since P n+m (x, x) ≥ P n (x, x)P m (x, x), we deduce that I
is stable by addition. By hypothesis, there exist n1 , . . . , nK ∈ I whichPKare relatively prime.
According to Bézout’s lemma, there exist a1 , . . . , aK ∈ Z such that k=1 ak nk = 1. We set
n+ = K
P PK
k=1;ak >0 ak nk and n− = k=1;ak <0 |ak | nk . If n− = 0, then we deduce that 1 ∈ I
and so (i) is proved with n0 = 1. We assume now that n− ≥ 1. We get that n+ , n− ∈ I and
n+ − n− = 1. Let n ≥ n2− . Considering the Euclidean division of n by n− , we get there exist
integers r ∈ {0, . . . , n− − 1} and q ≥ n− such that:

n = qn− + r = qn− + r(n+ − n− ) = (q − r)n− + rn+ .

Since q − r ≥ 0 and I is stable by addition, we get that n ∈ I. This proves property (i) with
n0 = n2− .
For d ≥ 2 finite, consider Q = P d . It is easy to check that x is then aperiodic when
considering the Markov chain with transition matrix Q. Thus, there exists n0 ≥ 1, such that
for Qn (x, x) > 0 for all n ≥ n0 , that is P nd (x, x) > 0 for all n ≥ n0 . This proves property (i).

Property (ii) is a direct consequence of property (i), (3.10) and (3.11).

We give a natural interpretation of the period.

Lemma 3.28. Let X = (Xn , n ∈ N) be an irreducible Markov chain on E with period d.


Then, there exists a partition (Ei , i ∈ J0, d−1K) of E such that, with the convention Ed = E0 :

Px (X1 ∈ Ei+1 ) = 1 for all i ∈ J0, d − 1K and x ∈ Ei . (3.13)

Proof. Since X is irreducible, we get that the period d is finite. Let x0 ∈ E. Consider the
sets Ei = {x ∈ E; there exists n ∈ N such that P nd+i (x0 , x) > 0} for i ∈ J0, d − 1K. Since X
is irreducible, for x ∈ E there exists m ∈ N Ssuch that P m (x0 , x) > 0. This gives that x ∈ Ei
with i = m mod (d). We deduce that E = d−1 i=1 Ei .
k
T
If x ∈ Ei Ej , then using that P (x, x0 ) > 0 for some k ∈ N, we get there exists n, m ∈ N
such that P nd+i+k (x0 , x0 ) > 0 and P md+j+k (x0 ,Tx0 ) > 0. By definition of the period, we
deduce that i = j mod (d). This implies that Ei Ej = ∅ if i 6= j and i, j ∈ J0, d − 1K.
To conclude, notice that if x ∈ Ei , that is P nd+i (x0 , x) > 0 for some n ∈ N, and z ∈ E
such that P (x, z) > 0, then we get that P nd+i+1 (x0 , z) > 0 and thus z ∈ Ei+1 . This readily
implies (3.13). Since x0 ∈ E0 , we get that E0 is non empty. Using (3.13), we get by recurrence
that Ei for i ∈ J0, d − 1K) is non empty. Thus, (Ei , i ∈ J0, d − 1K) is a partition of E.

The next lemma will be used in Section 3.4.3.

Lemma 3.29. Let X = (Xn , n ∈ N) and Y = (Yn , n ∈ N) be two independent Markov chains
with respective discrete state spaces E and F . Then, the process Z = ((Xn , Yn ), n ∈ N) is a
Markov chain with state space E × F . If π (resp. ν) is an invariant probability measure for
X (resp. Y ), then π ⊗ ν is an invariant probability measure for Z. If X and Y are irreducible
and furthermore X or Y is aperiodic, then Z is irreducible on E × F .
48 CHAPTER 3. DISCRETE MARKOV CHAINS

Proof. Let P and Q be the transition matrix of X and Y . Using the independence of X
and Y , it is easy to prove that Z is a Markov chain with transition matrix R given by
R(z, z 0 ) = P (x, x0 )Q(y, y 0 ) with z = (x, y), z 0 = (x0 , y 0 ) ∈ E × F .
If π (resp. ν) is an invariant measure for X (resp. Y ), then we have for z = (x, y) ∈ E ×F :
X X
(π⊗ν)R(z) = π(x0 )ν(y 0 )R((x0 , y 0 ), (x, y)) = π(x0 )ν(y 0 )P (x0 , x)Q(y 0 , y) = π⊗ν(z).
x0 ∈E,y 0 ∈F x0 ∈E,y 0 ∈F

Therefore the probability measure π ⊗ ν is invariant for Z.


Let us assume that X is aperiodic and irreducible and that Y is irreducible. Let z =
(x, y), z 0 = (x0 , y 0 ) ∈ E × F . Since X and Y are irreducible, there exists n1 , n2 , n3 ∈ N∗ such
that P n1 (x, x0 ) > 0, Qn2 (y, y 0 ) > 0 and Qn3 (y 0 , y 0 ) > 0. Property (i) of Proposition 3.27 gives
that P kn3 +n2 −n1 (x0 , x0 ) > 0 for k ∈ N∗ large enough. Thus, we get for k large enough:

Rkn3 +n2 (z, z 0 ) = P kn3 +n2 (x, x0 )Qkn3 +n2 (y, y 0 )


≥ P n1 (x, x0 )P kn3 +n2 −n1 (x0 , x0 )Qn2 (y, y 0 )Qn3 (y 0 , y 0 )k > 0.

We deduce that Z is irreducible. We get the same result if Y is aperiodic instead of X.

3.4 Asymptotic theorems


3.4.1 Main results
Let X = (Xn , n ∈ N) be a Markov chain on a discrete state space E. We recall the first
return time of x is given by T x = inf {n ≥ 1; Xn = x}. Since T x ≥ 1, we get that Ex [T x ] ≥ 1.
We set for x ∈ E:
1
π(x) = ∈ [0, 1]. (3.14)
Ex [T x ]

For an irreducible transient Markov chain, we recall that Px (T x = +∞) > 0 and thus
Ex [T x ] = +∞ for all x ∈ E, so that π = 0.
Definition 3.30. A recurrent state x ∈ E is null recurrent if π(x) = 0 and positive recurrent
if π(x) > 0. The Markov chain is null (resp. positive) recurrent if all the states are null
(resp. positive) recurrent.
We shall consider asymptotic events whose probability depends only on the transition
matrix and not on the initial distribution of the Markov chain. This motivates the following
definition. An event A ∈ σ(X) is said to be almost sure (a.s.) for a Markov chain X =
(Xn , n ∈ N) if Px (A) = 1 for all starting state x ∈ E of X, or equivalently Pµ0 (A) = 1 for all
initial distribution µ0 of X0 .
The next two fundamental theorems on the asymptotics of irreducible Markov chain will
be proved in Section 3.4.3.
Theorem 3.31. Let X = (Xn , n ∈ N) be an irreducible Markov chain on E. Let π be given
by (3.14).

(i) The Markov chain X is either transient or null recurrent or positive recurrent.
3.4. ASYMPTOTIC THEOREMS 49

(ii) If the Markov chain is transient or null recurrent, then there is no invariant probability
measure. Furthermore, we have π = 0.

(iii) For all x ∈ E, we have:


n
1X a.s.
1{Xk =x} −−−→ π(x). (3.15)
n n→∞
k=1

The next result is specifically on irreducible positive recurrent Markov chain. The definition
of the convergence in distribution of sequence of random variables and some of its character-
ization are given in Section 8.2.1.
Theorem 3.32 (Ergodic theorem). Let X = (Xn , n ∈ N) be an irreducible positive recurrent
Markov chain on E.

(i) The measure π defined by (3.14) is the unique invariant probability of X. (And we have
π(x) > 0 for all x ∈ E.)

(ii) For all real-valued function f defined on E such that (π, f ) is well defined, we have:
n
1X a.s.
f (Xk ) −−−→ (π, f ). (3.16)
n n→∞
k=1

(d)
(iii) If X is aperiodic, then we have the convergence in distribution Xn −−−→ π and:
n→∞
X
lim |P n (x, y) − π(y)| = 0 for all x ∈ E. (3.17)
n→∞
y∈E

In particular for an irreducible positive recurrent Markov chain, the empirical mean or
time average converges a.s. to the spatial average with respect to the invariant probability
measure. In the aperiodic case, we also get that the asymptotic behavior of the Markov chain
is given by the stationary regime. We give the following easy to remember corollary.
Corollary 3.33. An irreducible Markov chain X = (Xn , n ∈ N) on a finite state space is
positive recurrent: π defined by (3.14) is its unique invariant probability measure, π(x) > 0
for all x ∈ E and (3.16) holds for all R-valued function f defined on E. If furthermore X is
aperiodic, then the sequence (Xn , n ∈ N) converges in distribution towards π.
P
Proof. Summing (3.15) over x ∈ E, we get that x∈E π(x) = 1. Thus the Markov chain is
positive recurrent according to Theorems 3.31, properties (i)-(ii), and 3.32, property (i). The
remaining part of the corollary is a direct consequence of Theorem 3.32.

The convergences of the empirical means, see (3.16), for irreducible positive recurrent
Markov chains is a generalization of the strong law of large number recalled in Section 8.2.2.
Indeed, if X = (Xn , n ∈ N) is a sequence of independent random variables taking values in
E with the same distribution π, then, X is a Markov chain with transition matrix P whose
lines are all equal to π (that is P (x, y) = π(y) for all x, y ∈ E). Notice then that P is
reversible with respect to π. Assume for simplicity that π(x) > 0 for all x ∈ E so that X is
irreducible with invariant probability π. Then (3.16) corresponds exactly to the strong law
50 CHAPTER 3. DISCRETE MARKOV CHAINS

of large numbers. By the way, the initial motivation of the introduction of Markov chains by
Markov5 in 1906 was to extend the law of large number and the central limit theorem (CLT)
to sequences of dependent random variables.
Eventually notice that the limits in (3.16) or in (iii) of Theorem 3.31 does not involve
the initial distribution of the Markov chain. Forgetting the initial condition is an important
property of the Markov chains.

3.4.2 Complement on the asymptotic results


We shall state without proof some results on the CLT for irreducible positive recurrent Markov
chain and on invariant measures for irreducible null recurrent Markov chain.

On the CLT in the positive recurrent case


Similarly to the CLT for sequences of independent random variables with the same dis-
tribution, see Section 8.2.2, it is possible to provide the fluctuations associated to (3.16)
under reasonable assumptions. Let X = (Xn , n ∈ N) be an irreducible positive recurrent
Markov chain P on E with transition matrix P and invariant probability measure π. Set
In (f ) = n1 nk=1 f (Xk ) for n ∈ N∗ and f a real-valued function defined on E such that
(π, f 2 ) is finite. Thanks to Theorem 3.32, we have that a.s. limn→∞ In (f ) = (π, f ). Without
loss of generality, we assume that:
(π, f ) = 0.
As in the CLT for independent random variables, we expect the convergence in distribution

of ( n In (f ), n ∈ N∗ ) towards a centered Gaussian random variable. With this idea in mind,

it is natural to consider the variance of n In (f ):
n
√  1 X
Var n In (f ) = Cov(f (Xk ), f (X` ))
n
k,`=1
 
n n−k
1 X X
= Var(f (Xk )) + 2 Cov(f (Xk ), f (Xk+j )) .
n
k=1 j=1

It is legitimate to expect that the variance of the limit Gaussian random variable is the limit

of Var ( n In (f )) and as the mean in time correspond intuitively to the average under the
invariant probability measure, this would be, as (π, f ) = 0:
hX i    X 
σ(f )2 = Eπ f 2 (X0 ) + 2Eπ f (X0 )f (Xj ) = π, f 2 + 2 π, fP jf .
 
(3.18)
j∈N∗ j∈N∗

To be precise, we state Theorems II.4.1 and II.4.3 from [1]. For x ∈ E, set:
X
Hn (x) = |P n (x, y) − π(y)|.
y∈E

We recall that according to (3.17), if X is aperiodic then we have the ergodicity property
limn→∞ Hn (x) = 0 for all x ∈ E.
5
G. Basharin, A. Langville, V. Naumov. The life and work of A.A. Markov. Linear Algebra and its
Applications, vol. 386, pp. 3-26, 2004
3.4. ASYMPTOTIC THEOREMS 51

Theorem. Let X be an irreducible positive recurrent and aperiodic Markov chain with invari-
ant probability measure π. Let f be a real-valued function defined on E such that (π, f 2 ) < +∞
and (π, f ) = 0. If one of the two following conditions is satisfied:
P
(i) f is bounded and n∈N∗ (π, Hn ) < +∞ (ergodicity of degree 2);
(ii) limn→∞ supx∈E Hn (x) = 0 (uniform ergodicity);
Then, σ(f )2 given by (3.18) is finite and non-negative, and:
√ (d)
n In (f ) −−−→ N 0, σ(f )2 .

n→∞

Usually the variance σ(f )2 is positive, but for some particular Markov chain and particular
function f , it may be null. Concerning the hypothesis (i) and (ii) in the previous theorem, we
also mention that uniform ergodicity implies there exists c > 1 such that supx∈E Hn (x) ≤ c−n
for large n, which in turns implies the ergodicity of degree 2. Notice that if the state space
E is finite, then an irreducible aperiodic Markov chain is uniformly ergodic.
Based on the excursion approach developed in Section 3.4.3, it is also possible to give an
alternative result for the CLT of Markov chains, see Theorems 17.2.2, 17.4.4 and 17.5.3 in
[7]. For f a real-valued function defined on E and x ∈ E, we set, when it is well defined:
Tx
X
Sx (f ) = f (Xk ).
k=1
Theorem. Let X be an irreducible positive recurrent Markov chain with invariant probability
measure π. Let f be a real-valued function defined on E such that (π, f ) is well defined with
(π, f ) = 0. Let x ∈ E such that Ex [Sx (1)2 ] = Ex [(T x )2 ] < +∞ and Ex [Sx (|f |)2 ] < +∞ (so
that Sx (f ) is a.s. well defined). Set
σ 0 (f )2 = π(x)Ex Sx (f )2 .
 
(3.19)
Then, we have that:
√ (d)
n In (f ) −−−→ N 0, σ 0 (f )2 .

n→∞
Furthermore (3.19) holds for all x ∈ E.
An other approach is based on the Poisson equation. Assume (π, |f |) is finite. We say
that a R-valued function fˆ is a solution to the Poisson equation if P fˆ is well defined and:
fˆ − P fˆ = f − (π, f ). (3.20)
Theorem. Let X be an irreducible positive recurrent Markov chain with invariant probability
measure π. Let f be a real-valued function defined on E such that (π, |f |) < +∞ and (π, f ) =
0. Assume there exists a solution fˆ to the Poisson equation such that (π, fˆ2 ) < +∞. Set
 
σ 00 (f )2 = π, fˆ2 − (P fˆ)2 . (3.21)
Then we have that:
√ (d)
n In (f ) −−−→ N 0, σ 00 (f )2 .

n→∞
Of course, the asymptotic variances given by (3.18), (3.19) and (3.21) coincide when the
hypothesis of the three previous theorem hold. This is in particular the case if E is finite
(even if X is periodic).
52 CHAPTER 3. DISCRETE MARKOV CHAINS

More on the null recurrent case


It is possible to have more precise ergodic results for irreducible null recurrent Markov chains,
but with less natural probabilistic interpretation.
Let ν be a measure on E such that ν 6= 0 and ν(x) < +∞ for all x ∈ E. We say that ν is
an invariant measure for a stochastic matrix P if νP = ν. This generalizes Definition 3.15.
It can be proved that if X is an irreducible positive recurrent Markov chain, then the only
invariant measures are λπ where π is the invariant probability measure and λ > 0.
Let X = (Xn , n ∈ N) be a Markov chain. For x ∈ E, we define the measure νx by:
" Tx #
X
νx (y) = Ex 1{Xk =y} for y ∈ E.
k=1

Notice the measure νx is infinite as (νx , 1) = Ex [T x ] = +∞. According to [2, 1, 4], we have
the following results.
Theorem. Let X = (Xn , n ∈ N) be a Markov chain with transition matrix P . If x is
recurrent then νx is an invariant measure for P .
If furthermore X is irreducible null recurrent, then we get the following results:
(i) The measure νx is the only invariant measure (up to a positive multiplicative constant)
and νx (y) > 0 for all y ∈ E. And for all y, z ∈ E, we have νy (z) = νx (z)/νx (y).
(ii) For all R-valued functions f, g defined on E such that (ν, f ) is well defined and g is
non-negative with 0 < (ν, g) < +∞, we have:
Pn
f (Xk ) a.s. (ν, f )
Pk=1
n −−−→ ·
k=1 g(Xk )
n→∞ (ν, g)

(iii) We have limn→+∞ P(Xn = y) = 0 for all y ∈ E.


For irreducible transient Markov chain, there is no simple answer on the existence or
uniqueness
Pn of invariant measure, see the two exercises below; furthermore notice that the
sum k=1 1{Xk =x} is constant for large n and that limn→∞ P(Xn = x) = 0 for all x ∈ E.

3.4.3 Proof of the asymptotic theorems


Let X = (Xn , n ≥ 0) be an irreducible Markov chain on a discrete state space E, with
transition matrix P . Recall the measure π defined by (3.14). The next lemma insures that if
there exists an invariant probability measure, then it has to be π.
Lemma 3.34. Let X be an irreducible Markov chain. If (3.15) holds and ν is an invariant
probability measure, then we have ν = π.
Proof. Assume that ν is an invariant probability measure. Since the left hand-side member
of (3.15) is bounded by 1, using dominated convergence and taking the expectation in (3.15)
with ν as initial distribution of X, we get that for all x ∈ E:
n
1X
νP k (x) −−−→ π(x).
n n→∞
k=1

Since ν is invariant, we get νP k = ν. We deduce that ν(x) = π(x) for all x ∈ E.


3.4. ASYMPTOTIC THEOREMS 53

The next results is on transient Markov chains.

Lemma 3.35. Let X be an irreducible transient Markov chain. We have: π = 0, (3.15)


holds and X has no invariant probability measure.

Proof. Property (ii) of Proposition 3.23 implies that π = 0 and that k∈N 1{Xk =x} = N x is
P
a.s. finite. We deduce that (3.15) holds. Then use that π = 0 and Lemma 3.34 to deduce
that X has no invariant probability measure.

From now on we assume that X is irreducible and recurrent.


Let x ∈ E be fixed. Lemma 3.24 gives that a.s. the number of visit of x is a.s. infinite.
We can thus define a.s. the successive return times to x. By convention, we write T0x = 0
and for n ∈ N:
x
Tn+1 = inf{k > Tnx ; Xk = x}.
We define the successive excursions (Yn , n ∈ N∗ ) out of the state x as follows:

Yn = (Tnx − Tn−1
x
, XTn−1
x , XTn−1
x +1 , . . . , XTnx ). (3.22)

The random variable Yn describes the n-th excursion out for the state x. Notice that x is
the end of the excursion, that is XTnx = x, and for n ≥ 2 it is also the starting point of the
excursion as XTn−1
x = x. So Yn takes values in the discrete space E traj = ∪k∈N∗ {k}×E k ×{x}.
The next lemma is the key ingredient to prove the asymptotic results for recurrent Markov
chains.

Lemma 3.36. Let X be an irreducible recurrent Markov chain. The random variables
(Yn , n ∈ N∗ ) defined by (3.22) are independent. And the random variables (Yn , n ≥ 2) are all
distributed as Y1 under Px .

Proof. For y = (r, x0 , . . . , xr ) ∈ E traj , we set ty = r the length of the excursion and we recall
that the end point of the excursion is equal to x: xr = x . We shall first prove that for all
n ∈ N∗ , y1 , . . . , yn ∈ E traj , we have:
n
Y
P(Y1 = y1 , . . . , Yn = yn ) = P(Y1 = y1 ) Px (Y1 = yk ). (3.23)
k=2

For n = 1 and y1 ∈ E traj , Equation (3.23) holds trivially.PLet n ≥ 2 and y1 , . . . , yn ∈ E traj .


n−1
On the event {Y1 = y1 , . . . , Yn−1 = yn−1 }, the time s = k=1 tyk is the end of the n − 1-th
excursion, and at this time we have Xs = x as all the excursions end at state x. Using the
Markov property at time s and that Xs = x on {Y1 = y1 , . . . , Yn−1 = yn−1 }, we get that:

P(Y1 = y1 , . . . , Yn = yn ) = P(Y1 = y1 , . . . , Yn−1 = yn−1 )Px (Y1 = yn ).

Then, we get (3.23) by induction. Use Definition 1.31 and (3.23) for any n ∈ N∗ and
y1 , . . . , yn ∈ E traj to conclude.

We will now prove (3.15) for irreducible recurrent Markov chains. This and Lemma 3.35
will give property (iii) from Theorem 3.31.
54 CHAPTER 3. DISCRETE MARKOV CHAINS

Proposition 3.37. Let X be an irreducible recurrent Markov chain. Then (3.15) holds.
Proof. Let x ∈ E be fixed. Since Tnx = T1x + nk=2 (Tkx − Tk−1 x ), with T x a.s. finite, and
P
1
x x
(Tk − Tk−1 , n ≥ 2) are, according to Lemma 3.36, independent positive random variables
distributed as T x under Px , we deduce from the law of large number, see Theorem 8.15, that:
Tnx a.s.
−−−→ Ex [T x ]. (3.24)
n n→∞
We define the number of visit of x from time 1 to n ∈ N∗ :
n
X
Nnx = 1{Xk =x} . (3.25)
k=1

By construction, we have:
TNx nx ≤ n < TNx nx +1 . (3.26)
Nnx Nnx + 1 Nnx Nnx
This gives x ≤ ≤ . Since x is recurrent, we get that a.s. limn→∞ Nnx =
Nnx + 1 TNnx +1 n TNx nx
+∞. We deduce from (3.24) that a.s. limn→∞ Nnx /n = 1/Ex [T x ] = π(x).

Next lemma and property (iii) of Proposition 3.23 give property (i) of Theorem 3.31.
Lemma 3.38. Let X be an irreducible recurrent Markov chain. Then, it is either null
recurrent or positive recurrent.
Proof. Let x ∈ E. Notice the left hand-side of (3.15) is bounded byP1. Integrating (3.15)
with respect to Px , we get by dominated convergence that limn→∞ n1 nk=1 P k (x, x) = π(x).
Since X is irreducible, we deduce from (3.11), that if the above limit is zero for a given x, it
is zero for all x ∈ E. This implies that either π = 0 or π(x) > 0 for all x ∈ E.

The proof of the next lemma is a direct consequence of Lemma 3.34 and the fact that
π = 0 for irreducible null recurrent Markov chains.
Lemma 3.39. Let X be an irreducible null recurrent Markov chain. Then, there is no
invariant probability measure.
Lemmas 3.35 and 3.39 imply property (ii) of Theorem 3.31. This ends the proof of
Theorem 3.31.
The end of this section is devoted to the proof of Theorem 3.32. From now on we assume
that X is irreducible and positive recurrent.
Proposition 3.40. Let X be an irreducible positive recurrent Markov chain. Then, the
measure π defined in (3.14) is a probability measure. For all real-valued function f defined
on E such that (π, f ) is well defined, we have (3.16).
Proof. Let x ∈ E. We keep notations from the proof of Lemma 3.36. Let f be a finite
non-negative function defined on E. We set for y = (r, x0 , . . . , xr ) ∈ E traj :
r
X
F (y) = f (xk ).
k=1
3.4. ASYMPTOTIC THEOREMS 55

According to Lemma 3.36, the random variables (F (Yn ), n ≥ 2) are independent non-negative
and distributed as F (Y1 ) under Px . As F (Y1 ) is finite, we deduce from the law of large num-
1 Pn PTnx
ber,
Pn see Theorem 8.15, that a.s. limn→∞ n k=1 F (Yk ) = Ex [F (Y1 )]. Since i=1 f (Xi ) =
k=1 F (Yk ), we deduce from (3.24) that:
x
Tn n
1 X n 1X a.s.
x
f (Xi ) = x F (Yk ) −−−→ π(x) Ex [F (Y1 )].
Tn Tn n n→∞
i=1 k=1

Recall that TNx nx ≤ n < TNx nx +1 from (3.26). Since f is non-negative, we deduce that:
x
TN Txx
x n
TNx nx 1 Xn 1X TNx nx +1 1
Nn +1
X
f (Xi ) ≤ f (Xi ) ≤ f (Xi ).
TNx nx +1 TNx nx n x x
TNnx TNnx +1
i=1 i=1 i=1

Since a.s. limn→∞ Nnx = +∞, limn→∞ Tnx = +∞ and limn→+∞ Tnx /Tn+1
x = 1, see (3.24), we
deduce that:
n
1X a.s.
f (Xi ) −−−→ π(x)Ex [F (Y1 )]. (3.27)
n n→∞
i=1
Taking f = 1{y} in the equation above, we deduce from (3.15) that:
Tx
hX i
π(y) = π(x)Ex 1{Xk =y} . (3.28)
k=1

Summing over y ∈ E, we get by monotone convergence that y∈E π(y) = π(x)Ex [T x ] = 1.


P
This gives that π is a probability measure. By monotone convergence, we deduce from (3.28)
that:
X Tx
hX i X
π(x)Ex [F (Y1 )] = f (y)π(x)Ex 1{Xk =y} = f (y)π(y) = (π, f ).
y∈E k=1 y∈E

Using then (3.27), we deduce that (3.16) holds when f is finite and non-negative. If f
is non-negative but not finite, the result is immediate as Nx = ∞ a.s. for all x ∈ E and
(π, f ) = +∞. If f is real-valued such that (π, f ) is well defined, then considering (3.16) with
f replaced by f + and f − , and making the difference of the two limits, we get (3.16).

Next Proposition and Proposition 3.40 give properties (i) and (ii) of Theorem 3.32.
Proposition 3.41. Let X be an irreducible positive recurrent Markov chain. Then, the
measure π defined in (3.14) is the unique invariant probability measure.
Proof. According to Proposition 3.40, the measure π is a probability measure. We now check
it is invariant. Let µ be the distribution of X0 . We set:
n
1X
µ̄n (x) = µP k (x).
n
k=1

By dominated convergence, taking the expectation in (3.15) with respect to Pµ , we get


limn→∞ µ̄n (x) = π(x) for all x ∈ E. Similarly, using (3.16) with f bounded, we get that
limn→∞ (µ̄n , f ) = (π, f ).
56 CHAPTER 3. DISCRETE MARKOV CHAINS

Let y ∈ E be fixed and f (·) = P (·, y). We notice that limn→∞ (µ̄n , f ) = (π, f ) = πP (y)
and that (µ̄n , f ) = µ̄n P (y) = n+1 1
n µ̄n+1 (y) − n µP (y). Letting n goes to infinity in those
equalities, we get that πP (y) = π(y). Since y is arbitrary, we deduce that π is invariant. By
Lemma 3.34, this is the unique invariant probability measure.

The next proposition and Lemma 8.14 give property (iii) from Theorem 3.32. Its proof
relies on a coupling argument.

Proposition 3.42. An irreducible positive recurrent aperiodic Markov chain converges in


distribution towards its unique invariant probability measure.

Proof. Let X = (Xn , n ∈ N) be an irreducible positive recurrent aperiodic Markov chain. Re-
call that π defined in (3.14) is its unique invariant probability measure. Let Y = (Yn , n ∈ N)
be a Markov chain independent of X with the same transition matrix and initial distribution
π. Thanks to Lemma 3.29, the Markov chain Z = ((Xn , Yn ), n ∈ N) is irreducible and it has
π ⊗ π has invariant probability measure. This gives that Z is positive recurrent.
Let x ∈ E and consider T = inf{n ≥ 1; Xn = Yn = x} the return time of Z to (x, x). For
y ∈ E, we have:

P(Xn = y) = P(Xn = y, T ≤ n) + P(Xn = y, T > n) ≤ P(Xn = y, T ≤ n) + P(T > n).

Decomposing according to the events {T = k} for k ∈ N∗ , and using that Xk = x = Yk on


{T = k}, that X and Y have the same transition matrix, as well as the Markov property at
time k, we get that P(Xn = y, T ≤ n) = P(Yn = y, T ≤ n). Thus, we obtain:

P(Xn = y) ≤ P(Yn = y, T ≤ n) + P(T > n) ≤ P(Yn = y) + P(T > n).

By symmetry we can replace (Xn , Yn ) in the previous inequality by (Yn , Xn ) and deduce that:

|P(Xn = y) − P(Yn = y)| ≤ P(T > n).

Since Z is recurrent, we get that a.s. T is finite. Using that P(Yn = y) = π(y), as π is
invariant and the initial distribution of Y , we deduce that limn→∞ |P(Xn = y) − π(y)| = 0
for all y ∈ E. Then, use Lemma 8.14 to conclude.

3.5 Examples and applications


In this section, we give some well known examples of Markov chains.

Random walk on Zd
Let d ∈ N∗ . Let U be a Zd -valued random variables with probability distribution p =
(p(x) = P(U = x), x ∈ Zd ). Let (Un , n ∈ N∗ ) be a sequence of independent random variables
distributed as U , and X0 a Zd -valued independent random variable. We consider the random
walk X = (Xn , n ∈ N) with increments distributed as U defined by:
n
X
Xn = X0 + Uk for n ∈ N∗ .
k=1
3.5. EXAMPLES AND APPLICATIONS 57

The transition matrix P of X is given by P (x, y) = p(y − x). We assume that X is ir-
reducible (equivalently the smallestPadditive sub-group of Zd which contains the support
{x ∈ Zd ; p(x) > 0} is Zd ). Because x∈Zd P (x, y) = 1, we deduce that the counting measure
on Zd is invariant. (According to Section 3.4.2, this implies that irreducible random walks
are transient or null recurrent.) We refer to [8, 6] for a detailed account on random walks.
The simple symmetric random walk corresponds to U being uniform on the set of cardinal
2d: {x ∈ Zd ; |x| = 1}, with |x| denoting the Euclidean norm on Rd . It is irreducible with
period 2 (as P 2 (x, x) > 0 and by parity P 2n+1 (x, x) = 0 for all n ∈ N).
We summarize the main results on transience/recurrence for random walks, see [8] The-
orem 8.1.

Theorem. Let X be an irreducible random walk on Zd with increments distributed as U . We


have the following results:

(i) If d = 1, U ∈ L1 and E[U ] = 0, then X is null recurrent.

(ii) If d = 2, U ∈ L2 and E[U ] = 0, then X is null recurrent.

(iii) If d = 3, then X is transient.

Metropolis-Hastings algorithm
Let π be a given probability distribution on E such that π(x) > 0 for all x ∈ E. The aim
of the Metropolis-Hastings6 algorithm is to simulate a random variable with distribution
(asymptotically close to) π.
We consider a stochastic matrix Q on E which is irreducible (that is for all x, y ∈ E,
there exists n ∈ N∗ such that Qn (x, y) > 0) and such that for all x, y ∈ E, if Q(x, y) = 0 then
Q(y, x) = 0. The matrix Q is called the selection matrix.
We say a function ρ = (ρ(x, y); x, y ∈ E such that Q(x, y) > 0) taking values in (0, 1] is
an accepting probability function if for x, y ∈ E such that Q(x, y) > 0, we have:

ρ(x, y)π(x)Q(x, y) = ρ(y, x)π(y)Q(y, x). (3.29)

An example of an accepting probability function is given by:


 
π(y)Q(y, x)
ρ(x, y) = γ for x, y ∈ E such that Q(x, y) > 0, (3.30)
π(x)Q(x, y)

where γ is a function defined on (0, +∞) taking values in (0, 1] satisfying γ(u) = uγ(1/u) for
u > 0. A common choice for γ is γ(u) = min(1, u) (Metropolis algorithm) or γ(u) = u/(1 + u)
(Boltzmann or Barker algorithm).
We now describe the Metropolis-Hastings algorithm. Let X0 be a random variable on
E with distribution µ0 . At step n + 1, the random variables X0 , . . . , Xn are defined, and
we explain how to generate Xn+1 . First consider a random variable Yn+1 with distribution
Q(Xn , ·). With probability ρ(Xn , Yn+1 ), we accept the transition and set Xn+1 = Yn+1 . If the
6
W. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika,
vol. 57, pp.97-109, 1970.
58 CHAPTER 3. DISCRETE MARKOV CHAINS

transition is rejected, we set Xn+1 = Xn . By construction X = (Xn , n ∈ N) is a stochastic


dynamical system and thus a Markov chain. Its transition matrix is given by, for x, y ∈ E:
(
Q(x, y)ρ(x, y) if x 6= y,
P (x, y) = P
1 − z6=x P (x, z) if x = y.

Since r > 0, we have that Q(x, y) > 0 implies P (x, y) > 0 and, for x 6= y, that Q(x, y) > 0
is equivalent to P (x, y) > 0. We deduce that X is irreducible as Q is irreducible. Condition
(3.29) implies that X is reversible with respect to the probability π. Thus, the Markov chain
is irreducible recurrent positive with invariant probability π. Let f be a real-valued function
f defined on E, such that P (π, f ) is well defined. An approximation of (π, f ), is according to
Theorem 3.31, given by n nk=1 f (Xk ) for n large. The drawback of this approach is that it
1

does not come with a confidence interval of (π, f ). If furthermore either Q is aperiodic or
there exists x, y ∈ E such that Q(x, y) > 0 and ρ(x, y) < 1 so that P (x, x) > 0, then the
Markov chain X is aperiodic. In this case, Theorem 3.32 implies then that X converges in
distribution towards π.
It may happens that π is known up to a normalizing constant. This is the case of the
so called Boltzmann or Gibbs measure in statistical physics for example, where E is the
state space of a system, and the probability for the system to be in configuration x ∈ E is
π(x) = ZT−1 exp(−H(x)/T ), where H(x) is the energy of the system in configuration x, T
the temperature and ZT the normalizing constant. It is usually very difficult to compute an
approximation of ZT .
When using the accepting probability function given by (3.30), then only the ratio
π(x)/π(y) is needed to be computed to simulate X. In particular, the simulation does not
rely on the value of ZT .

Wright Fisher model


The Wright-Fisher model for population evolution has been introduced by Fisher in 19227
and Wright8 in 1931. Consider a population of constant size N with individuals with one
time unit of lifetime and which reproduce at each unit of time. We assume the reproduction
is random, and there is no mating (each individual can have children). More formally, if
Yin ∈ J1, N K is the parent of individual i at generation n ∈ N∗ alive at generation n − 1,
then the random variables (Yin+1 , i ∈ J1, N K, n ∈ N) are independent uniformly distributed
on J1, N K. Intuitively, each child chooses independently and uniformly its parent.
We assume that individuals may be either of type A or type a, and that a child inherit
the type from its parent. Let Xn be the number of the individuals at time n of type A. By
construction X = (Xn , n ∈ N) is a Markov chain on EN = J0, N K. Conditionally on Xn ,
each child at generation n + 1 has probability Xn /N to be of type A. Thus the distribution
of Xn+1 has conditionally on Xn a binomial distribution with parameter (Xn /N, N ). The
transition matrix PN is thus given by:

i N −j
   j  
N i
PN (i, j) = 1− for i, j ∈ EN .
j N N
7
R. A. Fisher. On the dominance ratio. Proc. Roy. Soc. Edinburgh, vol. 42, pp. 321-341, 1922.
8
S. Wright. Evolution in Mendelian populations. Genetics, vol. 16, pp.97-159, 1931.
3.5. EXAMPLES AND APPLICATIONS 59

Notice that 0 and N are absorbing state, and that {1, . . . , N − 1} is an open communicating
class. The quantity of interest in this model is the extinction time of the diversity (that is
the entry time of {0, N }):
τN = inf{n ≥ 0; Xn ∈ {0, N }},
with the convention inf ∅ = ∞. Using martingale techniques developed in Chapter 4, one can
easily prove the following result.

Lemma 3.43. A.s. the extinction time τN is finite and P(XτN = N |X0 ) = X0 /N .

It is interesting to study the mean extinction time tN = (tN (i); i ∈ EN ) defined by


tN (i) = Ei [τN ]. We have tN (0) = tN (N ) = 0 and for i ∈ {1, . . . , N − 1}:
X
tN (i) = Ei [τN 1{X1 =j} ]
j∈EN
X  
=1+ Ei inf{n ≥ 0; Xn+1 ∈ {0, N }} 1{X1 =j}
j∈EN
X
=1+ Ej [τN ]Pi (X1 = j)
j∈EN

= 1 + PN tN (i),

where we used the Markov property at time 1 for the third equality. As 0 and N are absorbing
state, we have tN (i) = P tN (i) = 0 for i ∈ {0, N }. Let e0 (resp. eN ) denote the element
of RN +1 with all entries equal to 0 but for the first (resp. last) which is equal to 1, and
1 = (1, . . . , 1) ∈ RN +1 . We have:

tN = PN tN + 1 − e0 − eN .

So to compute tN , one has to solve a linear system. For large N , we have the following result9
for x ∈ [0, 1]:
1
E [τN ] −−−→ −2 (x log(x) + (1 − x) log(1 − x)) .
N bN xc n→∞

where bzc is the integer part of z ∈ R. We give an illustration of this approximation in Figure
3.3.

Ehrenfest urn model


The Ehrenfest10 model has been introduced in 1907 to describe some “paradoxes” in statistical
physics. We consider N particles in two identical containers. A each discrete time, we take
one particle at random and move it to the other container. Let Xn denote the number of
particles in the first container at time n, X0 being the initial configuration. The sequence
X = (Xn , n ∈ N) represents the evolution of the system. The equilibrium states should
concentrate about half of the particles in each container. In this model one container being
empty is possible, but almost unobserved. We shall explain this situation using results on
9
W. J. Ewens. Mathematical population genetics. Springer-Verlag, second edition, 2004.
10
T. Ehrenfest and P. Ehrenfest. The conceptual foundations of the statistical approach in mechanics.
Cornell Univ. Press, 1959.
60 CHAPTER 3. DISCRETE MARKOV CHAINS

14 140 ××××××××××××
×××× ×××
××× ××
×× ××
×× ××
× ×× ××
× × ×× ××
12 120 ×× ×
× ×
× ×
× ×
× × × ×
× ×
× ×
× ×
× ×
10 100 ×
× ×
×
× ×
× × × ×
× ×
× ×
× ×
8 80 × ×
× ×
× ×
× ×
6 × × 60 × ×
× ×
× ×
× ×
4 40 × ×
× ×
× ×

2 20 × ×

× ×

× × × ×
0 0
0 1 2 3 4 5 6 7 8 9 10 0 10 20 30 40 50 60 70 80 90 100

Figure 3.3: Mean extinction time of the diversity (k 7→ Ek [τN ]) and its continuous limit,
N x 7→ −2N (x log(x) + (1 − x) log(1 − x)), for N = 10 (left) and N = 100 (right).

Markov chains11 . By construction, as all the particles play the same role, the process X is
a Markov chain on E = J0, N K with transition matrix P given by P (k, `) = 0 if |k − `| = 6 1,
P (k, k + 1) = (N − k)/N and P (k, k − 1) = k/N for k, ` ∈ E. We deduce the Markov
chain X is irreducible. Notice that X is reversible with respect to the binomial distribution
πN = (πN (k), k ∈ E), where πN (k) = 2−N Nk for k ∈ E. To see this, it is enough to check


that πN (k)P (k, k + 1) = πN (k + 1)P (k + 1, k) for all k ∈ J0, N − 1K. For k ∈ J0, N − 1K, we
have:
   
−N N N − k −N N k+1
πN (k)P (k, k + 1) = 2 =2 = πN (k + 1)P (k + 1, k).
k N k+1 N
According to Lemma 3.18 and Theorem 3.32, we deduce that πN is the √ unique invariant
probability measure of X. Let a > 0 and P define the interval Ia,N = [(N ± a N )/2]. We also
get that the empirical mean time n−1 nk=1 1{Xk ∈Ia,N } spent by the system in the interval
Ia,N converges a.s., as n goes to infinity, towards πN (Ia,N ). Thanks to the CLT, we have
that πN (Ia,N ) converges, as N goes to infinity, towards P(G ∈ [−a, a]) where G ∼ N (0, 1) is
a standard Gaussian random variable. For a larger than some units (say 2 or 3), this latter
probability is close to
√ 1. This implies that it is unlikely to observe values away from N/2
by some units time N . Using large deviation theory for the Bernoulli distribution with
parameter 1/2, we get that for ε ∈ (0, 1):
2
log(πN ([0, N (1 − ε)/2])) −−−−→ −(1 + ε) log(1 + ε) − (1 − ε) log(1 − ε).
N N →∞

Thus the probability to observe the values from the N/2 further by some small units time N
decrease exponentially fast towards 0 as N goes to infinity.
For k, ` ∈ E, let tk,` = Ek [T ` ] be the mean of the return time to ` starting from k. Set
N0 = bN/2c. Using (3.14) and Stirling formula, we get:
1 1
= 2−N .
p
tN0 ,N0 = ∼ πN/2 and t0,0 =
πN (N0 ) πN (0)
Notice that t0,0 and tN0 ,N0 are not of the same order.
11
S. Karlin and J. McGregor. Ehrenfest urn models. J. Appl. Probab, vol. 2, pp. 352-376, 1965
3.5. EXAMPLES AND APPLICATIONS 61

We are now interested in the mean of the return time from 0 to N0 and from 0 to N0 .
Let ` ≥ 2. By decomposing with respect to X1 , we have t`−1,` = 1 + ((` − 1)/N )t`−2,` and
for k ∈ J0, ` − 2K:
k N −k
tk,` = 1 + tk−1,` + tk+1,` .
N N
Then, using some lengthy computations, we get by recurrence that for 0 ≤ k < ` ≤ N :

N 1 du
Z h i
tk,` = (1 − u)N −` (1 + u)k (1 + u)`−k − (1 − u)`−k .
2 0 u

We then deduce that:


N
t0,N0 ∼log(N ) and tN0 ,0 ∼ 2N .
4
This is another indication that one sees mostly the process around N0 than around 0. Notice
the mean time to reach an equilibrium starting from 0 is about N log(N )/4. In fact, one can
show a cut-off phenomenon12 : starting from any initial distribution, one needs at most about
N log(N )/4 steps to be close to the invariant measure.

Queuing and stock models


Queuing theory goes back to A. Erlang13 in 1909 whose work was motivated by telephone
exchanges. Since then, this domain has known a huge amount of work. We shall consider a
toy example in discrete time. We consider Yn the size of the queue at the end of the service
of the n-th client, with initial state Y0 . We have Yn+1 = (Yn − 1 + Vn+1 )+ , where Vn+1 is the
number of clients who arrived during the service of the n-th client. The random variables
(Vn , n ∈ N∗ ) are assumed to be independent with the same distribution and independent of
Y0 , so that (Yn , n ∈ N) is a Markov chain on N. More generally, we can consider the Markov
chain X = (Xn , n ∈ N) on N defined as a stochastic dynamical system, for n ∈ N:

Xn+1 = (Xn + Un+1 )+ , (3.31)

where the innovation (Un , n ∈ N∗ ) is a sequence of Z-valued independent random variables


with the same distribution and independent of X0 . Notice this Markov chains is also a model
for the evolution of a stock, with Un being the delivery minus the consummation at time n.
The next lemma gives some criterion for the transience or recurrence for X.

Lemma. Let X = (Xn , n ∈ N) be a Markov chain defined by (3.31). We assume that


U1 ∈ L1 , P(U1 > 0) > 0, P(U1 < 0) > 0 and X is irreducible.

1. If E[U1 ] > 0, then X is transient.

2. If E[U1 ] = 0 and U1 ∈ L2 , then X is null recurrent.

3. If E[U1 ] < 0 and U1 has exponential moments, then X is positive recurrent.

12
G.-Y. Chen and L. Saloff-Coste. The L2 -cutoff for reversible Markov processes. J. Funct. Analysis, vol.
258, pp. 2246-2315, 2010.
13
A. Erlang. The theory of probabilities and telephone conversations. Nyt Tidsskrift for Matematik B, vol.
20, pp. 33-39, 1909.
62 CHAPTER 3. DISCRETE MARKOV CHAINS
Bibliography

[1] X. Chen. Limit theorems for functionals of ergodic Markov chains with general state space.
Mem. Amer. Math. Soc., 1999.

[2] K. L. Chung. Markov chains with stationary transition probabilities. Second edition. Die
Grundlehren der mathematischen Wissenschaften, Band 104. Springer-Verlag New York,
1967.

[3] R. Douc, E. Moulines, P. Priouret, and P. Soulier. Markov chains. Springer Series in
Operations Research and Financial Engineering. Springer, Cham, 2018.

[4] R. Durrett. Probability: theory and examples, volume 31 of Cambridge Series in Statistical
and Probabilistic Mathematics. Cambridge University Press, Cambridge, fourth edition,
2010.

[5] W. Feller. An introduction to probability theory and its applications. Vol. I. Third edition.
John Wiley & Sons, Inc., New York-London-Sydney, 1968.

[6] G. F. Lawler and V. Limic. Random walk: a modern introduction, volume 123 of Cam-
bridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, 2010.

[7] S. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University
Press, Cambridge, second edition, 2009. With a prologue by Peter W. Glynn.

[8] F. Spitzer. Principles of random walk. Springer-Verlag, second edition, 1976.

63
64 BIBLIOGRAPHY
Chapter 4

Martingales

W consider (Ω, F, P) a probability space and F = (Fn , n ∈ N) a filtration.


In all this chapter, we
We also set F∞ = n∈N Fn . We say a process H = (Hn , n ∈ N) is i) F-adapted if Hn is
Fn -measurable for all n ∈ N; ii) integrable if Hn is integrable for all n ∈ N; iii) bounded if
supn∈N |Hn | is a.s. bounded. Those definitions are extended in an obvious way to processes
indexed by N instead of N. We say a process H = (Hn , n ∈ N∗ ) is F-predictable if Hn is
Fn−1 -measurable for all n ∈ N∗ .
In Section 4.1, we define random times called stopping times and their associated σ-field.
That allows to extend the Markov property of Markov chains to the stopping times, which
is the so-called strong Markov property. Section 4.2 is devoted to the definition and first
properties of the martingales (and super-martingales and sub-martingales). Martingales are
a powerful tool to study processes in particular because of the maximal inequalities, see
Section 4.3, and the convergence results, see Sections 4.4 and 4.5.
The presentation of this chapter follows closely [1], see also [2] for numerous applications.

4.1 Stopping times


Stopping times are random times which play an important role in Markov process theory and
martingale theory.
Definition 4.1. An N-valued random variable τ is an F-stopping time if {τ ≤ n} ∈ Fn for
all n ∈ N.
T
From the definition above, notice that if τ is an F-stopping time, then {τ = ∞} = n∈N {τ ≤
n}c belongs to F∞ .
When there is no ambiguity on the filtration F, we shall write stopping time instead of
F-stopping time. It is clear that the integers are stopping time.
Example 4.2. For the simple random walk X = (Xn , n ∈ N), see Example 3.4, and F the
natural filtration of X, it is easy to check that the return time to 0, T 0 = inf{n ≥ 1; Xn = 0},
with the convention that inf ∅ = +∞, is a stopping time. It is also easy to check that T 0 − 1
is not a stopping time. 4
In the next lemma, we give equivalent characterization of stopping times.

65
66 CHAPTER 4. MARTINGALES

Lemma 4.3. Let τ be a N-valued random variable.

(i) τ is a stopping time if and only if {τ > n} ∈ Fn for all n ∈ N.

(ii) τ is a stopping time if and only if {τ = n} ∈ Fn for all n ∈ N.

Proof. Use that {τ >Sn} = {τ ≤ n}c to get (i). Use that {τ = n} = {τ ≤ n} {τ ≤ n − 1}c
T
and that {τ ≤ n} = nk=0 {τ = k} to get (ii).

We give in the following proposition some properties of stopping times.


Proposition 4.4. Let (τn , n ∈ N) be a sequence of stopping times. The random variables
supn∈N τn , inf n∈N τn , lim supn→∞ τn and lim inf n→∞ τn are stopping times.
T
Proof. We have that {supk∈N τk ≤ n} = k∈N {τk ≤ n} belongs to Fn for all n ∈ N as τk are
stopping time for kS∈ N. This proves that supk∈N τk is a stopping time. Similarly, use that
{inf k∈N τk ≤ n} = k∈N {τk ≤ n} to deduce that inf k∈N τk is a stopping time.
S SinceT stopping time are N-valued random variables, we get that {lim supk→∞ τk ≤ n} =
m∈N k≥m {τk ≤ n} for n ∈ N. This last event belongs to Fn as τk are stopping times for
k ∈ N.TWe deduce
S that lim supk→∞ τk is a stopping time. Similarly, use that {lim inf k→∞ τk ≤
n} = m∈N k≥m {τk ≤ n} for n ∈ N to deduce that lim inf k→∞ τk is a stopping time.

It is left to the reader to check that the σ-field Fτ in the next definition is indeed a σ-field
and a subset of F∞ .
Definition 4.5. Let τ be a F-stopping time. The σ-field Fτ of the events which are prior to
a stopping time τ is defined by:
Fτ = {B ∈ F∞ ; B ∩ {τ = n} ∈ Fn for all n ∈ N} .

Clearly, we have that τ is Fτ -measurable.


Remark 4.6. Consider X = (Xn , n ∈ N) a Markov chain on a discrete state space E with its
natural filtration F = (Fn , n ∈ N). Recall the return time to x ∈ E defined by T x = inf{n ≥
1; Xn = x} and the excursion Y1 = (T x , X0 , . . . , XT x ) defined in section 3.4.3. It is easy to
check that T x is an F-stopping time and that FT x is equal to σ(Y1 ). Roughly speaking the
σ-field FT x contains all the information on X prior to T x . ♦
We give an elementary characterization of the Fτ -measurable random variables.
Lemma 4.7. Let Y be a F∞ -measurable real-valued random variable and τ a stopping time.

(i) The random variable Y is Fτ -measurable if and only if Y 1{τ =n} is Fn -measurable for
all n ∈ N.

(ii) If E[Y ] is well defined, then we have that a.s.:


X
E[Y | Fτ ] = 1{τ =n} E[Y | Fn ]. (4.1)
n∈N

Proof. We prove (i). Set Yn = Y 1{τ =n} . We first assume that Y is Fτ -measurable and we
prove that Yn is Fn -measurable for all n ∈ N. If Y = 1B with B ∈ F∞ , we clearly get that
4.1. STOPPING TIMES 67

Yn is Fn -measurable for all n ∈ N by definition of Fτ . It is then easy to extend this result to


any F∞ -measurable random variable which takes finitely different values in R, and then to
extend to any F∞ -measurable real-valued random variable Y by considering any sequence of
random variables (Y k , k ∈ N) which converges to Y and such that Y k is Fτ -measurable and
takes finitely many values in R (for example take Y k = 2−k b2k Y c1{|Y |≤k} + Y 1{|Y |=+∞} ).
We now assume that Yn is Fn -measurable for all n ∈ N and we prove that Y is Fτ -
measurable. Let A ∈ B(R) and set B = Y −1 (A). Notice that B belongs to F∞ as Y is
F∞ -measurable. First assume that 0 6∈ A. In this case, we get B ∩ {τ = n} = Yn−1 (A)
and thus B ∩ {τ = n} ∈Fn for all n ∈ N. This gives B ∈ Fτ . If 0 ∈ A, then uses that
c
B = Y −1 (A) = Y −1 (Ac ) to also get that B ∈ Fτ . This implies that Y is Fτ -measurable.
This ends the proof of (i).
We now prove (ii). Assume first that Y ≥ 0 and set:
X
Z= E[Y | Fn ]1{τ =n} .
n∈N

Since Y is F∞ -measurable, we also get that Y 1{τ =∞} is F∞ -measurable. Thus, we deduce
from (i) that Z is Fτ -measurable. For B ∈ Fτ , we have:
X   X  
E [Z1B ] = E E[Y | Fn ]1{τ =n}∩B = E Y 1{τ =n}∩B = E [Y 1B ] ,
n∈N n∈N

where we used monotone convergence for the first equality, the fact that {τ = n} ∩ B belongs
to Fn and (2.1) for the second and monotone convergence for the last. As Z is Fτ -measurable,
we deduce from (2.1) that a.s. Z = E[Y | Fτ ].
Then consider Y a F∞ -measurable real-valued random variable. Subtracting (4.1) with
Y replaced by Y − to (4.1) with Y replaced by Y + gives that (4.1) holds as soon as E[Y ] is
well defined.

Definition 4.8. Let X = (Xn , n ∈ N) be a F-adapted process and τ a F-stopping time. The
random variable Xτ is defined by:
X
Xτ = Xn 1{τ =n} .
n∈N

This definition is extended in an obvious way when τ is an a.s. finite stopping time and X
a process indexed on N instead of N. By construction the random variable Xτ from Definition
4.8 is Fτ -measurable. We can now give an extension of the Markov property, see Definition
3.2, when considering random times. Compare the next proposition with Corollary 3.12.

Proposition 4.9 (Strong Markov property). Let X = (Xn , n ∈ N) be a Markov chain with
respect to the filtration F = (Fn , n ∈ N), taking values in a discrete state space E and with
transition matrix P . Let τ be a F-stopping time a.s. finite and define a.s. the shifted process
X̃ = (X̃k = Xτ +k , k ∈ N). Conditionally on Xτ , we have that Fτ and X̃ are independent
and that X̃ is a Markov chain with transition matrix P , which means that a.s. for all k ∈ N,
all x0 , . . . , xk ∈ E:
68 CHAPTER 4. MARTINGALES

P(X̃0 = x0 , . . . , X̃k = xk | Fτ ) = P(X̃0 = x0 , . . . , X̃k = xk | Xτ )


k
Y
= 1{Xτ =x0 } P (xj−1 , xj ). (4.2)
j=1

Proof. Let B ∈ Fτ , k ∈ N and x0 , . . . , xk ∈ E. We first compute:


h i
In = E 1B 1{X̃0 =x0 ,...,X̃k =xk } | Fn 1{τ =n} .

We have, using that B ∩ {τ = n} ∈ Fn and the Markov property at time n:


 
In = E 1B∩{τ =n} 1{Xn =x0 ,...,Xn+k =xk } | Fn = 1B∩{τ =n} H(Xn ),
where for x ∈ E:
k−1
Y
H(x) = Px (X0 = x0 , . . . , Xk = xk ) = 1{x=x0 } P (xi , xi+1 ). (4.3)
i=0
h i
We get from Lemma 4.7 and Definition 4.8 that E 1B 1{X̃0 =x0 ,...,X̃k =xk } | Fτ = 1B H(Xτ ).
Then, taking the expectation conditionally on Xτ , we deduce that:
h i
E 1B 1{X̃0 =x0 ,...,X̃k =xk } | Xτ = P(B|Xτ ) H(Xτ ). (4.4)
Since this holds for all B ∈ Fτ , k ∈ N and x0 , . . . , xk ∈ E, we get that conditionally on Xτ , Fτ
and X̃ are independent. Take B = Ω in (4.4) to get P(X̃0 = x0 , . . . , X̃k = xk | Xτ ) = H(Xτ )
and use the definition (4.3) of H to conclude that X̃ is conditionally on Xτ a Markov chain
with transition matrix P . Take B = Ω in the previous computations to get (4.2).
Using the strong Markov property, it is immediate to get that the excursions of a recurrent
irreducible Markov chain out of a given state are independent and, but for the first one, with
the same distribution, see the key Lemma 3.36.
We end this section with the following lemma.
Lemma 4.10. Let τ and τ 0 be two stopping times.
(i) The events {τ < τ 0 }, {τ = τ 0 } and {τ ≥ τ 0 } belongs to Fτ and Fτ 0 .
(ii) If B ∈ Fτ , then we have that B ∩ {τ ≤ τ 0 } belongs to Fτ 0 .
(iii) If τ ≤ τ 0 , then we have Fτ ⊂ Fτ 0 .
Proof. We have {τ < τ 0 }∩{τ = n} = {τ = n}∩{τ 0 > n} which belongs to Fn as {τ = n} and
{τ 0 > n} belong already to Fn . Since this holds for all n ∈ N, we deduce that {τ < τ 0 } ∈ Fτ .
The other results of property (i) can be proved similarly.
Let B ∈ Fτ . This implies that B ∩ {τ ≤ n} belongs to Fn . We deduce that B ∩ {τ ≤
τ 0 } ∩ {τ 0 = n} = B ∩ {τ ≤ n} ∩ {τ 0 = n} belongs to Fn . Since this holds for n ∈ N, we get
that B ∩ {τ ≤ τ 0 } ∈ Fτ 0 . This gives property (ii).
Property (iii) is a direct consequence of property (ii) as {τ ≤ τ 0 } = Ω.
Remark 4.11. In some cases, it can be convenient to assume that F0 contains at least all the
P-null sets. Under this condition, if a N-valued random variable is a.s. constant, then it is
a stopping time. And, more importantly, under this condition, property (iii) of Lemma 4.10
holds if a.s. τ ≤ τ 0 . ♦
4.2. MARTINGALES AND THE OPTIONAL STOPPING THEOREM 69

4.2 Martingales and the optional stopping theorem

Definition 4.12. A real-valued process M = (Mn , n ∈ N) is called an F-martingale if it is


F-adapted, integrable and for all n ∈ N a.s.:

E[Mn+1 | Fn ] = Mn . (4.5)

If (4.5) is replaced by E[Mn+1 | Fn ] ≥ Mn , then M is called an F-sub-martingale.


If (4.5) is replaced by E[Mn+1 | Fn ] ≤ Mn , then M is called an F-super-martingale.

Quoting [1]: “A super-martingale is by definition a sequence of random variables which


decrease in conditional mean. For a sequence (Mn , n ∈ N) of non-negative random variables
denoting the sequence of values of the fortune of a gambler, the super-martingale condition
express the property that at each play the game is unfavorable to the player in conditional
mean. On the other hand, a martingale remains constant in conditional mean and, for the
gambler, corresponds to a game which is on the average fair”.
When there is no possible confusion, we omit the filtration; for example we write mar-
tingale for F-martingale. See [1], for a theory of super-martingales which are non-negative
processes instead of integrable.
Example 4.13 (Random walk in R). Let (Un , n ∈ N∗ ) be independent integrable real-valued
random variables with the sameP distribution. We consider the random walk X = (Xn , n ∈ N)
defined by X0 = 0 and Xn = nk=1 Uk for n ∈ N∗ . Let F = (Fn , n ∈ N) be the natural
filtration of the process X. If E[U1 ] ≤ 0, then X is a super-martingale. If E[U1 ] = 0, then X
is a martingale.
Assume that U1 has allits exponential moments (that is E[exp(λU1 )] < +∞ for all λ ∈ R),
λU

and define ϕ(λ) = log E e 1 for λ ∈ R. Let λ ∈ R be fixed. It is elementary to check
that M λ = (Mnλ , n ∈ N) defined by, for n ∈ N:
Mnλ = eλXn −nϕ(λ)
is a positive martingale. This martingale is called the exponential martingale associated to
the random walk X. 4
Let X = (Xn , n ∈ N) and (Hn , n ∈ N∗ ) be two sequences of real-valued random variables
which are a.s. finite. We define the discrete stochastic integral of H with respect to X by
the process H ·X = (H ·Xn , n ∈ N) with H ·X0 = 0 and for all n ∈ N∗ :
n
X
H ·Xn = Hk ∆Xk = H ·Xn−1 + Hn ∆Xn where ∆Xk = Xk − Xk−1 .
k=1

Lemma 4.14. Let M be a martingale (resp. super-martingale) and H a bounded real-valued


predictable process (resp. and non-negative). Then, the discrete stochastic integral H·M is a
martingale (resp. super-martingale).
Proof. With M = (Mn , n ∈ N), H = (Hn , n ∈ N∗ ) and H ·M = (H ·Mn , n ∈ N), we get that
the process H ·M is adapted and integrable. Assume that M is a martingale. We have for
n ∈ N∗ a.s.:
E[H ·Mn | Fn−1 ] = H ·Mn−1 + Hn (E[Mn | Fn−1 ] − Mn−1 ) = 0.
70 CHAPTER 4. MARTINGALES

The conclusion is then straightforward. The case M super-martingale and H non-negative


is proved similarly.

Remark 4.15 (Doob decomposition of a super-martingale). Let M = (Mn , n ∈ N) be a super-


martingale. We set N0 = M0 , A0 = 0 and for n ∈ N:

Nn+1 = Nn + Mn+1 − E[Mn+1 | Fn ] and An+1 = An + Mn − E[Mn+1 | Fn ].

By construction the process N = (Nn ∈ N) is adapted and the process A = (An , n ∈ N∗ )


is predictable. It easy to check that N is a martingale and A is non-decreasing and thus
non-negative. The decomposition of the super-martingale M as Mn = Nn − An with N a
martingale and A predictable non-decreasing is called the Doob decomposition of M . ♦
Using Jensen inequality (2.4), we easily derive the next corollary.

Corollary 4.16. Let M = (Mn , n ∈ N) be a real-valued F-martingale. Let ϕ be a R-valued


convex function defined on R. Assume that ϕ(Mn ) is integrable for all n ∈ N. Then, the
process (ϕ(Mn ), n ∈ N) is a sub-martingale.

For x, x0 ∈ R, we recall that we write x ∧ x0 for min(x, x0 ).


Lemma 4.17. Let τ be a stopping time and M = (Mn , n ∈ N) a martingale (resp. super-
martingale, sub-martingale). Then, the process M τ = (Mτ ∧n , n ∈ N) is a martingale (resp.
super-martingale, sub-martingale).

We provide two proofs of this important lemma. The shorter one relies on the stochastic
integral. The other one can be generalized when the integrability condition is weakened; it
will inspire some computations in Chapter 5.

First proof. Let M be a martingale. The process H = (Hn , n ∈ N∗ ) defined by Hn = 1{τ ≥n}
is predictable bounded and non-negative. The discrete stochastic integral of H with respect
to M is given by H ·M = (H ·Mn , n ∈ N) with:
n
X
H ·Mn = 1{τ ≥k} (Mk − Mk−1 ) = Mτ ∧n − M0 .
k=1

As H ·M is a martingale according to Lemma 4.14, we deduce that M τ is a martingale. The


proofs are similar in the super-martingale and sub-martingale cases.
Second proof. Let M be a martingale. For n ∈ N, we have:
n−1
X
Mτ ∧n = Mk 1{τ =k} + Mn 1{τ >n−1} .
k=0
4.3. MAXIMAL INEQUALITIES 71

This implies that Mτ ∧n is integrable and Fn -measurable. For n ≥ 1, we have:


n−1
X
E [Mτ ∧n | Fn−1 ] = Mk 1{τ =k} + E [Mn | Fn−1 ] 1{τ >n−1}
k=0
n−1
X
= Mk 1{τ =k} + Mn−1 1{τ >n−1}
k=0
= Mτ ∧(n−1) .

This implies that M τ is a martingale. The proofs are similar in the super-martingale and
sub-martingale cases.

We recall that a stopping time τ is bounded if P(τ ≤ n0 ) = 1 for some n0 ∈ N. If ν is a


stopping time, recall the σ-field Fν of the events prior to ν.
Theorem 4.18 (Optional stopping theorem). Let M = (Mn , n ∈ N) be a martingale. Let τ
and ν be bounded stopping times such that ν ≤ τ . We have a.s.:

E [Mτ | Fν ] = Mν . (4.6)

When M is a super-martingale (resp. sub-martingale) the equality in (4.6) is replaced by the


inequality E [Mτ | Fν ] ≤ Mν (resp. E [Mτ | Fν ] ≥ Mν ).

In particular, if M is a martingale and τ a bounded stopping time, taking the expectation


in (4.6) with ν = 0, we get E [Mτ ] = E [M0 ]. See Proposition 4.26 for an extension of the
optional stopping theorem to unbounded stopping times for closed martingale.

Proof. Let n0 ∈ N be such that a.s. τ ≤ n0 . We have according to Lemma 4.7 that a.s.:
n0
X
E [Mτ | Fν ] = 1{ν=n} E [Mτ | Fn ] .
n=0

Since M τ is a martingale and τ ∧ n0 = τ a.s., we have for n ≤ n0 that a.s. E [Mτ | Fn ] =


E [Mτ ∧n0 | Fn ] = Mτ ∧n . Since ν ≤ τ , we deduce that:
n0
X
E [Mτ | Fν ] = 1{ν=n} Mτ ∧n = Mν .
n=0

The proofs are similar for super-martingales and sub-martingales.

See Exercise 9.27 for an application of the martingale theory to simple random walk.

4.3 Maximal inequalities


In this section, we provide inequalities in mean on the path of a martingale using the last
value of the path.
72 CHAPTER 4. MARTINGALES

Theorem 4.19 (Doob’s maximal inequality). Let (Mn , n ∈ N) be a sub-martingale. Let


a > 0. Then, we have for n ∈ N:
  h i
aP max Mk ≥ a ≤ E Mn 1{maxk∈J0,nK Mk ≥a} ≤ E Mn+ .
 
k∈J0,nK

Proof. Let n ∈ N. Consider the stopping time τ = inf{k ∈ N; Mk ≥ a}, and set A =
{maxk∈J0,nK Mk ≥ a} = {τ ≤ n}. Thanks to the optional stopping theorem, we have E[Mn ] ≥
E[Mτ ∧n ]. Since Mτ ∧n ≥ a1A + Mn 1Ac , we deduce that:
E[Mn ] ≥ aP(A) + E [Mn 1Ac ] .
This implies that E [Mn 1A ] ≥ aP(A). The inequality E [Mn 1A ] ≤ E [Mn+ ] is obvious.

Let (Mn , n ∈ N) be a sequence of real-valued random variables. We define for n ∈ N:


Mn∗ = max |Mk |. (4.7)
k∈J0,nK

We deduce from Corollary 4.16 that if (Mn , n ∈ N) is a martingale, then (|Mn |, n ∈ N)


is a sub-martingale and thus, thanks to Theorem 4.19, we have for a > 0:
aP (Mn∗ ≥ a) ≤ E[|Mn |].
Proposition 4.20. Let M = (Mn , n ∈ N) be a martingale. Assume that Mn ∈ Lp for some
n ∈ N and p > 1. Then, we have, with Cp = (p/(p − 1))p :
E [(Mn∗ )p ] ≤ Cp E [|Mn |p ] .
Proof. We first prove that Mn∗ belongs to Lp . We deduce from Corollary 4.16 that (|Mk |, k ∈
N) is a non-negative sub-martingale. We deduce from Jensen inequality that for 0 ≤ k ≤ n:
E [|Mk |p ] ≤ E [E [|Mn || Fk ]p ] ≤ E [|Mn |p ] . (4.8)
Pn
Since Mn∗ ≤ k=1 |Mk |, we deduce that Mn∗ belongs to Lp .
Thanks to Theorem 4.19 (with Mk replaced by |Mk |), we have for all a > 0 that aP(Mn∗ ≥
a) ≤ E |Mn |1{Mn∗ ≥a} . Multiplying this inequality by p(p−1)ap−2 and integrating over a > 0


with respect to the Lebesgue measure gives:


Z
(p − 1)E [(Mn∗ )p ] = p(p − 1) ap−1 P(Mn∗ ≥ a) da
Za>0
ap−2 E |Mn |1{Mn∗ ≥a} da
 
≤ p(p − 1)
a>0
= pE |Mn | (Mn∗ )p−1 .
 

Using Hölder inequality, we get that E |Mn | (Mn∗ )p−1 ≤ E [|Mn |p ]1/p E [(Mn∗ )p ](p−1)/p . This
 

implies that (p − 1)E [(Mn∗ )p ]1/p ≤ pE [|Mn |p ]1/p .

4.4 Convergence of martingales


We now state the main result on convergence of martingales whose proof is given at the end
of this section.
4.4. CONVERGENCE OF MARTINGALES 73

Theorem 4.21. Let M = (Mn , n ∈ N) be a martingale or a sub-martingale or a super-


martingale bounded in L1 , that is supn∈N E[|Mn |] < +∞. Then, the process M converges a.s.
to a limit, say M∞ , which is integrable and:

lim inf E[|Mn |] ≥ E[|M∞ |]. (4.9)


n→∞

We give in the next corollary direct consequences which are so often used that they deserve
to be stated on their own.
Corollary 4.22. We have the following results.

(i) Let M = (Mn , n ∈ N) be a sub-martingale such that supn∈N E[Mn+ ] < +∞. Then, the
process M converges a.s. to a limit, say M∞ , which is integrable and (4.9) holds.

(ii) Let M = (Mn , n ∈ N) be a non-negative martingale or a non-negative super-martingale.


Then the process M converges a.s. to a limit, say M∞ , which is integrable, and we have:

lim E[Mn ] ≥ E[M∞ ]. (4.10)


n→∞

Proof. We first prove property (i). As M is a sub-martingale, we have that E[M0 ] ≤ E[Mn ]
and thus E[|Mn |] ≤ 2E[Mn+ ] − E[M0 ]. We deduce the condition supn∈N E[Mn+ ] < +∞ is
equivalent to supn∈N E[|Mn |] < +∞. Then use Theorem 4.21 to conclude.
Let M be a non-negative super-martingale. Considering property (i) with −M , we get
the a.s. convergence of M towards a limit say M∞ . Then use Fatou’s lemma and that the
sequence (E[Mn ], n ∈ N) is non-increasing to get (4.10).

Remark 4.23. We state without proof the following extension, see [1]. Let M = (Mn , n ∈ N)
be a non-negative non necessarily integrable super-martingale, that is M is adapted, and
a.s., for all n ∈ N, we have Mn ≥ 0 and E[Mn+1 |Fn ] ≤ Mn . Then, the process M converges
a.s. and the limit, say M∞ , satisfies the inequality E [M∞ | Fn ] ≤ Mn a.s. for all n ∈ N.
Furthermore, for all stopping times τ and ν such that τ ≥ ν, we have that a.s. E [Mτ | Fν ] ≤
Mν . However, Equality (4.6) does not hold in general for positive non necessarily integrable
martingale, that is an adapted process M = (Mn , n ∈ N) such that, for all n ∈ N, a.s. Mn ≥ 0
and E[Mn+1 |Fn ] = Mn . ♦

Proof of Theorem 4.21


This proof can be skipped in a first reading. For a < b ∈ R and a sequence x = (xn , n ∈ N)
of elements of R, we define the down-crossing and up-crossing times of [a, b] for the sequence
x as τ0 (x) = 0 and for all k ∈ N∗ :

νk (x) = inf{n ≥ τk−1 (x); xn ≤ a} and τk (x) = inf{n ≥ νk (x); xn ≥ b},

with the convention that inf ∅ = ∞. We define the number of up-crossings for the sequence
x of the interval [a, b] up to time n ∈ N as:

βa,b (x, n) = sup{k ∈ N; τk (x) ≤ n}.


74 CHAPTER 4. MARTINGALES

We shall also consider the total number of up-crossings given by:


βa,b (x) = lim βa,b (x, n) = Card ({k ∈ N; τk (x) < ∞}) ∈ N.
n→∞
As a < b, we have the following implications:
lim inf xn < a < b < lim sup xn =⇒ βa,b (x) = ∞ =⇒ lim inf xn ≤ a < b ≤ lim sup xn .
n→∞ n→∞ n→∞ n→∞

We deduce that the sequence x converges in R if and only if βa,b (x) < ∞ for all a < b with
a, b ∈ Q. Thus, to prove the convergence of the sequence x, it is enough to give a finite upper
bounds of βa,b (x). Since xτk (x) − xνk (x) ≥ b − a when τk (x) < ∞ that is k ≤ βa,b (x), we
deduce that:
βa,b (x,n)
X
(xτk (x) − xνk (x) ) ≥ (b − a)βa,b (x, n). (4.11)
k=1
Define H` (x) = 1Sk∈N {νk (x)<`≤τk (x)} for ` ∈ N∗ . Considering the discrete integral H(x)·xn =
Pn
`=1 H` (x)∆x` , with ∆x` = x` − x`−1 , we get:
βa,b (x,n)
X
H(x)·xn ≥ (xτk (x) − xνk (x) ) − (xn − a)− ≥ (b − a)βa,b (x, n) − (xn − a)− , (4.12)
k=1
where for the first inequality we took into account the fact that n may belongs to an up-
crossing from a to b, and we used (4.11) for the second.
Up to replacing M by −M , we can assume that M is a super-martingale. We now replace
x by the super-martingale M . The random variables νk (M ), τk (M ), for k ∈ N, are by
construction stopping times. This implies that, for ` ∈ N∗ , the event {νk (M ) < ` ≤ τk (M )}
belongs to F`−1 . We deduce that the process H = (H` (M ), ` ∈ N∗ ) is adapted bounded and
non-negative. Thanks to Lemma 4.14 the discrete stochastic integral (H(M )·Mn , n ∈ N) is
a super-martingale. Since H(M )·M0 = 0, we get E[H(M )·Mn ] ≤ 0. We deduce from (4.12)
that:
(b − a)E[βa,b (M, n)] ≤ E[(Mn − a)− ] + E[H(M )·Mn ] ≤ E[|Mn |] + |a|.
Letting n goes to infinity in the previous inequality, we get using supn∈N E[|Mn |] < +∞
and
T the monotone convergence theorem that E[βa,b (M )] < +∞. This implies that the event
a<b; a,b∈Q {βa,b (M ) < ∞} has probability 1, that is the super-martingale M a.s. converges
to a real-valued random variable, say M∞ .
Using Fatou’s lemma, we get:
E[|M∞ |] = E[ lim |Mn |] ≤ lim inf E[|Mn |] ≤ sup E[|Mn |] < +∞.
n→∞ n→∞ n∈N

We deduce that M∞ is integrable and that (4.9) holds.

4.5 More on convergence of martingales


The fact that (4.10) is an equality or not for martingales plays an important role in the
applications, which motivate this section. We refer to Section 8.2.3 for the definition of the
uniform integrability and some related results.
Theorem 4.24. Let M = (Mn , n ∈ N) be a martingale. The next conditions are equivalent.
4.5. MORE ON CONVERGENCE OF MARTINGALES 75

(i) The martingale M converges a.s. and in L1 to a limit, say M∞ .


(ii) There exists a real-valued integrable random variable Z such that Mn = E[Z| Fn ] a.s.
for all n ∈ N.

(iii) The random variables (Mn , n ∈ N) are uniformly integrable.

If any of these conditions hold, we have that a.s. for all n ∈ N:


Mn = E[M∞ | Fn ]. (4.13)

A martingale satisfying the conditions of Theorem 4.24 is called a closed martingale.

Proof. Assume property (i) holds. Denote M∞ the limit of M . For m ≥ n, we have
E[Mm | Fn ] = Mn a.s. and thus using Jensen’s inequality:
h i h i h i h i
E |Mn − E[M∞ | Fn ]| = E |E[Mm − M∞ | Fn ]| ≤ E E[|Mm − M∞ || Fn ] = E |Mm − M∞ | .

As M converges to M∞ in L1 , we deduce that the h right-hand side of


i the previous equation
goes to 0 as m goes to infinity. We deduce that E |Mn − E[M∞ | Fn ]| = 0, which gives (4.13)
and that (ii) holds with Z = M∞ .
Using Lemma 8.19, we get that property (ii) implies property (iii).
Assume property (iii) holds. Thanks to (b) from property (i) of Proposition 8.18, we
get that supn∈N E[|Mn |] < +∞. Using Theorem 4.21, we deduce that the martingale con-
verges a.s. towards a limit, say M∞ . Since the random variables (Mn , n ∈ N) are uniformly
integrable, we deduce from Proposition 8.21 that the convergence holds also in L1 . Hence,
property (i) holds.

The next Lemma does not hold if we assume that Z isWnon-negative instead of integrable,
see a counter-example page 31 in [1]. Recall that F∞ = n∈N Fn .

Corollary 4.25. Let Z be an integrable real-valued random variable. Then the process
(E[Z| Fn ], n ∈ N) is a closed martingale which converges a.s. and in L1 towards E[Z| F∞ ].

Proof. Condition (ii) of Theorem 4.24 holds for the martingale M = (Mn = E[Z| Fn ], n ∈ N).
Since (i) and (ii) of Theorem 4.24 are equivalent, we deduce that M converges a.s. and in L1
to a real-valued random variable, say M∞ which is integrable. Using (4.13), we get that for
all A ∈ Fn :
E [(Z − M∞ )1A ] = E [E [(Z − M∞ )| Fn ] 1A ] = 0.
S
This implies that the set A ⊂ F of events A such that E [Z1A ] = E [M∞ 1A ] contains n∈N Fn
which is stable by finite intersection. Since Z and M∞ are integrable, we get, using dominated
convergence, that A is alsoSa λ-system. According to the monotone class theorem, A contains
the σ-field generated by n∈N Fn , that is F∞ . Then, we deduce from Definition 2.2 and
Lemma 2.3 that a.s. M∞ = E[Z| F∞ ].

We can extend the optional stopping theorem for closed martingale to any stopping times.
76 CHAPTER 4. MARTINGALES

Proposition 4.26. Let M = (Mn , n ∈ N) be a closed martingale and write M∞ for its a.s.
limit. Let τ and ν be stopping times such that ν ≤ τ . Then, we have a.s.:

E [Mτ | Fν ] = Mν . (4.14)

Proof. According to Lemma 4.7, we have for any stopping time τ 0 that a.s.:
X X
E [M∞ |Fτ 0 ] = 1{τ 0 =n} E [M∞ |Fn ] = 1{τ 0 =n} Mn = Mτ 0 ,
n∈N n∈N

where we used (4.13) for the second equality. Using this result twice first with τ 0 = τ and
then with τ 0 = ν, we get, as Fν ⊂ Fτ according to property (iii) of Lemma 4.10, that a.s.:

E [Mτ | Fν ] = E [E [M∞ | Fτ ] | Fν ] = E [M∞ | Fν ] = Mν .

This gives the result.

We have the following result when the martingale is bounded in Lp for some p > 1.

Proposition 4.27. Let M = (Mn , n ∈ N) be a martingale such that supn∈N E[|Mn |p ] < +∞
for some p > 1. Then, the martingale converges a.s. and in Lp towards a limit, say M∞ and
Mn = E[M∞ | Fn ] a.s. for all n ∈ N. We also have that M∞ ∗ = sup p
n∈N |Mn | belongs to L
p
and, with Cp = (p/(p − 1)) :
∗ p
E [(M∞ ) ] ≤ Cp E [|M∞ |p ] as well as E [|M∞ |p ] = sup E[|Mn |p ].
n∈N

Proof. Since M is bounded in L1 , we deduce from Theorem 4.21 that M converges a.s.
towards a limit, say M∞ ∈ L1 . We recall, see (4.7), that Mn∗ = maxk∈J0,nK |Mk |. By monotone
convergence, since M∞ ∗ = lim ∗
n→∞ Mn , we have that:

∗ p
E [(M∞ ) ] = lim E [(Mn∗ )p ] . (4.15)
n→∞

According to Proposition 4.20 and since supn∈N E[|Mn |p ] < +∞, we deduce that:
∗ p
E [(M∞ ) ] ≤ Cp sup E[|Mn |p ] < +∞.
n∈N

This gives that M∞ ∗ belongs to Lp . We deduce from (4.15) and the dominated convergence

Theorem 1.46 (with fn = |Mn − M∞ |p , gn = 2p−1 ((Mn∗ )p + (M∞ ∗ )p ) and f ≤ g as (a + b)p ≤


n n
p−1 p p p
2 (a + b ) for a, b ∈ R+ ) that M converges in L towards M∞ . This implies in particular
that E [|M∞ |p ] = limn→∞ E [|Mn |p ]. Then, use that (E [|Mn |p ] , n ∈ N) is non-decreasing, see
(4.8), to deduce that E [|M∞ |p ] = supn∈N E[|Mn |p ].
Since the martingale M converges in Lp towards M∞ , it also converges in L1 . We deduce
then from Theorem 4.21 that Mn = E[M∞ | Fn ] a.s. for all n ∈ N.
Bibliography

[1] J. Neveu. Discrete-parameter martingales. North-Holland Publishing Co., New York,


revised edition, 1975.

[2] D. Williams. Probability with martingales. Cambridge Mathematical Textbooks. Cam-


bridge University Press, Cambridge, 1991.

77
78 BIBLIOGRAPHY
Chapter 5

Optimal stopping

The goal of this chapter is to determine the best time, if any, at which one has to stop a
game, seen as a stochastic process, in order to maximize a given criterion seen as a gain or a
reward. The following two examples are typical of the problems which will be solved. Their
solution are given respectively in Sections 5.1.3 and 5.3.2.
Example 5.1 (Marriage of a princess: the setting). In a faraway old age, a princess had to
choose a prince for a marriage among ζ ∈ N∗ candidates. At step 1 ≤ n < ζ, she interviews
the n-th candidate and at the end of the interview she either accepts to marry this candidate
or refuses. In the former case the process stops and she get married with the n-th candidate;
in the latter case the rebuked candidate leaves forever and the princess moves on to step
n + 1. If n = ζ, she has no more choice but to marry the last candidate. What is the best
strategy or stopping rule for the princess if she wants to maximize the probability to marry
the best prince?
This “Marriage problem”, also known as the “Secretary problem”, appeared in the late
1950’s and early 1960’s. See Ferguson [4] for an historical review as well as the corresponding
Wikipedia page1 . 4
Example 5.2 (Castle to sell). A princess want to sell her castle, let Xn be the n-th price offer.
However, preparing the castle for the visit of a potential buyer has a cost, say c > 0 per visit.
So the gain of the selling at step n ≥ 1 will be Gn = Xn − nc or Gn = max1≤k≤n Xk − nc if
the princess can recall a previous interested buyer. In this infinite time horizon setting, what
is the best strategy for the princess in order to maximize her gain?
This “House-selling problem”, see Chapter 4 in Ferguson [3] is also known as the “Job
search problem” in economy, see Lippman and McCall [5]. 4
S T T
For n < ζ ∈ N = N {∞}, we set Jn, ζK = [n, ζ] N and Jn, ζJ= [n, ζ) N. We consider
a game over the discrete time interval J0, ζK with horizon ζ ∈ N, where at step n ≤ ζ we can
either stop and receive the gain Gn or continue to step n + 1 if n + 1 ≤ ζ. Eventually in the
infinite horizon case, ζ = ∞, if we never stop, we receive the gain G∞ . We assume the gains
G = (Gn , n ∈ J0, ζK) form a sequence of random variables on a probability space (Ω, P, F)
taking values in [−∞, +∞).
We assume the information available is given by a filtration F = (Fn , n ∈ J0, ζK) with
Fn ⊂ F, and a strategy or stopping rule corresponds to a stopping time. Let Tζ be the set
1
https://en.wikipedia.org/wiki/Secretary_problem

79
80 CHAPTER 5. OPTIMAL STOPPING

of all stopping times with respect to the filtration F taking values in J0, ζK. We shall assume
that E[G+ ζ +
τ ] < +∞ for all τ ∈ T , where x = max(0, x). In particular, the expectation E[Gτ ]
is well defined and belongs to [−∞, +∞). Thus, the maximal gain of the game G is:

V∗ = sup E[Gτ ]. (5.1)


τ ∈Tζ

A stopping time τ 0 ∈ Tζ is said optimal for G if E[Gτ 0 ] = V∗ and thus V∗ = maxτ ∈Tζ E[Gτ ].
The next theorem, which is a direct consequences of Corollaries 5.8 and 5.18, is the main
result of this Chapter. For a real sequence (an , n ∈ N), we set lim sup an = lim sup ak .
n%∞ ∞>k≥n

Theorem 5.3. Let ζ ∈ N, G = (Gn , n ∈ J0, ζK) be a sequence of random variables tak-
ing values in [−∞, +∞) and F = (Fn , n ∈ J0, ζK) be a filtration. Assume the integrability
condition: h i
E sup G+ n < +∞. (5.2)
n∈J0,ζK

If ζ ∈ N or if ζ = ∞ and
lim sup Gn ≤ G∞ a.s., (5.3)
then, there exists an optimal stopping time.

We complete Theorem 5.3 by giving a description of the optimal stopping times when the
T and (5.3) holds if ζ = ∞. In this case,
sequence G is adapted to the filtration F, (5.2) holds
we consider the Snell envelope S = (Sn , n ∈ J0, ζK N) which is a particular solution to the
so-called optimal equations or Bellman equations:
Sn = max (Gn , E[Sn+1 |Fn ]) for n ∈ J0, ζJ. (5.4)

More precisely, in the finite horizon case S is defined by Sζ = Gζ and the backward recursion
(5.4); in the infinite horizon case S is defined by (5.17) which satisfies (5.4) according to Propo-
sition 5.16. In this setting, we will consider the stopping times τ∗ ≤ τ∗∗ in Tζ :

τ∗ = inf{n ∈ J0, ζJ; Sn = Gn }, (5.5)


τ∗∗ = inf{n ∈ J0, ζJ; Sn > E[Sn+1 |Fn ]}, (5.6)

with the convention inf ∅ = ζ. We shall prove that they are optimal, see Propositions 5.6 and
5.17, and Exercises 5.1 and 5.5. Furthermore, if V∗ > −∞, then a stopping time τ is optimal
if and only if τ∗ ≤ τ ≤ τ∗∗ a.s. and on {τ < ∞} we have a.s. Sτ = Gτ . See Exercises 5.1, 5.4
and 5.5. Thus, τ∗ is the minimal optimal stopping time and τ∗∗ the maximal one.
In the following two Remarks, we comment on the integrability condition (5.2) and we
consider the case when the sequence G is not adapted to the filtration F.
Remark 5.4. Notice that (5.2) implies that E[G+ ζ
τ ] < +∞ for all τ ∈ T . When ζ < ∞, then
(5.2) is equivalent to
E[G+
n ] < +∞ for all n ∈ J0, ζK. (5.7)
When ζ = ∞, Condition (5.2) can be slightly weaken, see Proposition 5.17, when G is F-
adapted to Condition (H) page 86 which corresponds to the gain being bounded from above
by a non-negative uniformly integrable martingale. ♦
5.1. FINITE HORIZON CASE 81

Remark 5.5. When the sequence G is not adapted to the filtration F, the idea is to check that
an optimal stopping time for the adapted gain G0 = (G0n , n ∈ J0, ζK) with G0n = E[Gn | Fn ] is
also an optimal stopping time for G, see Sections 5.1.2 and 5.2.4. ♦
The finite horizon case, ζ < ∞, is presented in Section 5.1, and the infinite horizon case,
ζ = ∞, which is much more delicate in particular for the definition of S, is presented in
Section 5.2. We consider the approximation of the infinite horizon case by finite horizon
cases in Section 5.3, which includes the Markov chain setting developed in Section 5.3.3.
The presentation of this Chapter follows closely Ferguson [3] also inspired by Snell [7],
see also Chow, Robbins and Siegmund [1, 6] and the references therein or for the Markovian
setting Dynkin [2]. Concerning the infinite horizon case, we consider stopping time taking
values in N instead of N in most text books. Since in some standard applications, the gain
of not stopping in finite time is G∞ = −∞ (which de facto implies the optimal stopping
time is finite unless V∗ = −∞), we shall consider rewards Gn taking values in [−∞, +∞),
whereas in most text books it is assumed that E[|Gn |] < +∞ holds for all finite n ≤ ζ. The
advantage of this setting is the simplicity of the hypothesis and the generality of the result
given in Theorem 5.3. Its drawback is that we can not use the elegant martingale theory
which is the corner stone of the Snell envelope approach, see Remark 5.7 and Exercise 5.1 and
the presentation in Neveu [6]. Thus, we shall deal with integral technicalities in the infinite
horizon case.

5.1 Finite horizon case


We assume in this section that ζ ∈ N and that the gain process G = (Gn , n ∈ J0, ζK)
satisfies the integrability condition (5.7), or equivalently (5.2). We consider the filtration
F = (Fn , n ∈ J0, ζK). Recall Tζ is the set of stopping times with respect to the filtration F
taking values in J0, ζK. Notice that (5.7) implies that E[G+ ζ
τ ] < +∞ for all τ ∈ T .
We shall first consider in Section 5.1.1 that the gain process G is adapted to the filtration
F. This is not always the case. Indeed, in Example 5.1 on the marriage of a princess, the gain
at step n ∈ J1, ζK is given by Gn = 1{Σn =1} , with Σn the random rank of the n-th candidate
among the ζ candidates. In particular the rank Σn and thus the gain Gn are not observed
unless n = ζ, and thus the gain process is not adapted to the filtration generated by the
observations. We extend the results of Section 5.1.1 to the case where G is not adapted to F
in Section 5.1.2. Then, we solve the marriage problem in Section 5.1.3.

5.1.1 The adapted case


We assume that the sequence G is adapted to the filtration F. We define the sequence
S = (Sn , n ∈ J0, ζK) recursively by Sζ = Gζ and the optimal equations (5.4). The following
Proposition gives a solution to the optimal stopping problem in the setting of this section.
Proposition 5.6. Let ζ ∈ N and G = (Gn , n ∈ J0, ζK) be an adapted sequence such that
E[G+n ] < +∞ for all n ∈ J0, ζK. The stopping time τ∗ given by (5.5), with (Sn , n ∈ J0, ζK)
defined by Sζ = Gζ and (5.4), is optimal and V∗ = E[Gτ∗ ] = E[S0 ].

Proof. For n ∈ J0, ζK, we define Tn as the set of all stopping times with respect to the filtration
F taking values in Jn, ζK, as well as the stopping time τn = inf{k ∈ Jn, ζK; Sk = Gk }. Notice
82 CHAPTER 5. OPTIMAL STOPPING

that n ≤ τn ≤ ζ. We first prove by backward induction that:

Sn ≥ E[Gτ |Fn ] a.s. for all τ ∈ Tn , (5.8)


Sn = E[Gτn |Fn ] a.s.. (5.9)

Notice that (5.8) and (5.9) are clear for n = ζ as Sζ = Gζ .


Let n ∈ J0, ζ − 1K. We assume (5.8) and (5.9) hold for n + 1 and prove them for n. Let
τ ∈ Tn and consider τ 0 = max(τ, n + 1) ∈ Tn+1 . As τ = τ 0 on {τ > n}, we have:

E[Gτ |Fn ] = Gn 1{τ =n} + E[Gτ 0 |Fn ]1{τ >n} . (5.10)

Using Inequality (5.8) with n + 1 and τ 0 , we get that a.s.:

E[Gτ 0 |Fn ] = E [E[Gτ 0 |Fn+1 ]Fn ] ≤ E[Sn+1 |Fn ]. (5.11)

Using the optimal equations (5.4), we get a.s.:

E[Sn+1 |Fn ] ≤ Sn . (5.12)

Since (5.4) gives also Gn ≤ Sn , we get using (5.10) that a.s.

E[Gτ |Fn ] ≤ Sn . (5.13)

This gives (5.8).


Consider τn instead of τ in (5.10). Then notice that on {τn > n}, we have max(τn , n+1) =
τn+1 . Then the inequality in (5.11) (with τ 0 = τn+1 ) is in fact an equality thanks to (5.9)
(with n + 1). The inequality in (5.12) is also an equality on {τn > n} by definition of τn .
Then use that Gn = Sn on {τn = n}, so that (5.13), with τn instead of τ , is also an equality.
This gives (5.9). We then deduce that (5.8) and (5.9) hold for all n ∈ J0, ζK.
Notice that τ∗ = τ0 by definition. We deduce from (5.8), with n = 0, that E[S0 ] ≥ E[Gτ ]
for all τ ∈ Tζ , and from (5.9), that E[S0 ] = E[Gτ∗ ]. This gives V∗ = E[S0 ] and τ∗ is
optimal.

Remark 5.7 (Snell envelope). Let ζ ∈ N. Assume that E[|Gn |] < ∞ for all n ∈ J0, ζK. Notice
from (5.4) that S is a super-martingale and that S dominates G. It is left to the reader to
check that S is in fact the smallest super-martingale which dominates G. It is called the
Snell enveloppe of G. For n ∈ J0, ζJ, using that Sn = E[Sn+1 |Fn ] on {τ∗ > n}, we have:
 
Sn∧τ∗ = Sτ∗ 1{τ∗ ≤n} + Sn 1{τ∗ >n} = Sτ∗ 1{τ∗ ≤n} + E[Sn+1 1{τ∗ >n} |Fn ] = E S(n+1)∧τ∗ |Fn .
(5.14)
This gives that (Sn∧τ∗ , n ∈ J0, ζK) is a martingale. ♦

Exercise 5.1. Let ζ ∈ N. Assume that E[|Gn |] < ∞ for all n ∈ J0, ζK.

1. Prove that τ is an optimal stopping time if and only if Sτ = Gτ a.s. and (Sn∧τ , n ∈
J0, ζK) is a martingale.

2. Deduce that τ∗ is the minimal optimal stopping time (that is: if τ is optimal, then a.s.
τ ≥ τ∗ ).
5.1. FINITE HORIZON CASE 83

3. Prove that τ∗∗ defined by (5.6) is an optimal stopping time.

4. Using the Doob decomposition, see Remark 4.15, of the super-martingale S, prove that
if τ ≥ τ∗∗ is an optimal stopping time then τ = τ∗∗ .

5. Arguing as in the proof of property (ii) from Lemma 5.13, prove that if τ and τ 0 are
optimal stopping times so is max(τ, τ 0 ).

6. Deduce that τ is an optimal stopping time if and only if a.s. τ∗ ≤ τ ≤ τ∗∗ and Sτ = Gτ .
4

5.1.2 The general case


If the sequence G = (Gn , n ∈ J0, ζK) is not adapted to the filtration F, then we shall consider
the corresponding adapted sequence G0 = (G0n , n ∈ J0, ζK) defined by:

G0n = E[Gn |Fn ].

Thanks to Jensen inequality, we have E[(G0n )+ ] ≤ E[G+ n ] < +∞ for all n ∈ J0, ζK. Thus the
0
sequence G is is adapted to F and satisfies the integrability condition (5.7) or equivalently
(5.2). Recall Tζ is the set of all stopping time with respect to the filtration F taking values
in J0, ζK. Thanks to Fubini, we get that for τ ∈ Tζ :
ζ
X ζ
 X
E G0n 1{τ =n} = E[G0τ ].
  
E[Gτ ] = E Gn 1{τ =n} =
n=0 n=0

We thus deduce the maximal gain for the game G is also the maximal gain for the game G0 .
The following Corollary is then an immediate consequence of Proposition 5.6.
Corollary 5.8. Let ζ ∈ N and G = (Gn , n ∈ J0, ζK) be such that E[G+ n ] < +∞ for all
n ∈ J0, ζK. Set Sζ = E[Gζ |Fζ ] and Sn = max (E[Gn |Fn ], E[Sn+1 |Fn ]) for 0 ≤ n < ζ. Then
the stopping time τ∗ = inf{n ∈ J0, ζK; Sn = E[Gn |Fn ]} is optimal and V∗ = E[Gτ∗ ] = E[S0 ].

5.1.3 Marriage of a princess

We continue Example 5.1. The princess wants to maximize the probability to marry the best
prince among ζ ∈ N∗ candidates. The corresponding gain at step n is Gn = 1{Σn =1} , with
Σn the random rank of the n-th candidate among the ζ candidates. The random variable
Σ = (Σn , n ∈ J1, ζK) takes values in the set Sζ of permutation on J1, ζK.
For a permutation σ = (σn , n ∈ J1, ζK) ∈ Sζ , we define the sequence of partial ranks
r(σ) = (r1 , . . . , rζ ) such that rn is the partial rank of σn in (σ1 , . . . , σn ). In particular, we
have r1 = 1 and rζ = σζ . Set E = ζn=1 J1, nK the state space of r(σ). It is easy to check that
Q
r is one-to-one from Sζ to E. Set (R1 , . . . , Rn ) = r(Σ), so that Rn is the observed partial rank
of the n-th candidate. In particular Rn corresponds to the observation of the princess at step
n, and the information of the princess at step n is given by the σ-field Fn = σ(R1 , . . . , Rn ).
In order to stick to the formalism of this chapter, we set G0 = −∞ and F0 the trivial σ-field.
84 CHAPTER 5. OPTIMAL STOPPING

We assume the princes are interviewed at random, that is the random permutation Σ =
(Σn , n ∈ J1, ζK) is uniformly distributed on Sζ . Notice then that, for n ∈ J1, ζJ, Σn is
not a function of (R1 , . . . , Rn ) and so it is not Fn -measurable and thus the gain sequence
G = (Gn , n ∈ J0, ζK) is not adapted to the filtration F = (Fn , n ∈ J0, ζK).
Since r is one-to-one, we deduce that r(Σ) is uniform on E. Since E has a product form,
we get that the random variables R1 , . . . , Rζ are independent and Rn is uniform on J1, nK
for all n ∈ J1, ζK. The event {Σn = 1} is equal to {Rn = 1} ζk=n+1 {Rk > 1}. Using the
T
independence of (Rn+1 , . . . , Rζ ) with Fn , we deduce that for n ∈ J1, ζK:
ζ
Y n
E[Gn |Fn ] = E[1{Σn =1} |Fn ] = 1{Rn =1} P(Rk > 1) = 1 .
ζ {Rn =1}
k=n+1

By an elementary backward induction, we get from the definition of Sn given  in Corollary 5.8

that, for n ∈ J1, ζK, Sn is a function of Rn and more precisely Sn = max nζ 1{Rn =1} , sn+1 ,
with sn+1 = E[Sn+1 |Fn ] = E[Sn+1 ] as Sn+1 , which is a function of Rn+1 , is independent of
Fn . The sequence (sn , n ∈ J1, ζK) is non-increasing as (Sn , n ∈ J1, ζK) is a super-martingale.
We deduce that the optimal stopping time can be written as τ∗ = γn∗ for some n∗ , where for
n ∈ J1, ζK, the stopping rule γn corresponds to first observe n − 1 candidate and then choose
the next one who is better than those who have been observed (or the last if there is none):
γn = inf{k ∈ Jn, ζK; Rk = 1 or k = ζ}. We set Γn = E[Gγn ] the gain corresponding to the
strategy γn . We have Γ1 = 1/ζ and for n ∈ J2, ζK:
ζ ζ ζ
X X n−1 X 1
Γn = P(γn = k, Σk = 1) = P(Rn > 1, . . . , Rk = 1, . . . , Rζ > 1) = ,
ζ k−1
k=n k=n k=n

where we used the independence for the last equality. Notice that ζΓ1 = ζΓζ = 1. For
n ∈ J1, ζ − 1K, we have ζ(Γn − Γn+1 ) = 1 − ζ−1
P
1/j. We deduce that Γn is maximal for
Pζ−1 j=n
n∗ = inf{n ≥ 1; Γn ≥ Γn+1 } = inf{n ≥ 1; j=n 1/j ≤ 1}. We also have V∗ = Γn∗ .
For ζ large, we get n∗ ∼ ζ/ e, so the optimal strategy is to observe a fraction of order
1/ e ' 37% of the candidates, and then choose the next best one (or the last if there is none);
the probability to get the best prince is then V∗ = Γn∗ ' n∗ /ζ ' 1/ e ' 37%.
Exercise 5.2 (Choosing the second best instead of the best2 ). Assume the princess knows the
best prince is very likely to get a better proposal somewhere else, so that she wants to select
the second best prince among ζ candidates instead of the best one. For x > 0, we set bxc
the only integer n ∈ N such that x − 1 < n ≤ x. Prove that the optimal stopping rule is to
reject the first n0 = b(ζ − 1)/2c candidates and then chose the first second best so far prince
or the last if none that is τ∗ = inf{k > n0 ; Rk = 2 or k = ζ} and that the optimal gain is:

n0 (ζ − n0 )
V∗ = ·
ζ(ζ − 1)

So for ζ large, we get V∗ ' 1/4. Selecting the third best leads to a more complex optimal
strategy. 4
2
J. S. Rose. A problem of optimal choice assignment. Operarions Research, 30(1):172-181, 1982.
5.2. INFINITE HORIZON CASE 85

5.2 Infinite horizon case


We assume in this section that ζ = ∞. Let (Fn , n ∈ N) be a filtration. For simplicity,
we write T = T∞ for the set of stopping times taking values in N. Notice the definition
of stopping time, and thus of the set T, does not depend on the choice of F∞ asWlong as
this σ-field contains Fn for all n ∈ N. For Sthis reason, we shall take for F∞ = n∈N Fn
the smallest possible σ-field which contains n∈N Fn , see Proposition 1.2. We also use the
following convention for the limit operators: limn→∞ will be understood as limn→∞; n<∞ ,
and for a real sequence (an , n ∈ N), we set lim sup an = limn→∞ sup∞>k≥n ak as well as
lim inf an = limn→∞ inf ∞>k≥n ak .
The next two examples prove one can not remove the hypothesis (5.2) and (5.3) on the
gain process to ensure the existence of an optimal stopping time.
Example 5.9. We consider the gain process G = (Gn , n ∈ N) given by Gn = n/(n + 1) for
n ∈ N and G∞ = 0. Clearly we have V∗ = 1 and there is no optimal stopping time. Notice
that (5.3) does not hold in this case. 4
Example 5.10. Let (Xn , n ∈ N∗ ) be independent Bernoulli random variables such that P(Xk =
1) = P(Xk = 0) Q = 1/2. We consider the gain process G = (Gn , n ∈ N) given by G0 = 0,
Gn = (2n − 1) nk=1 Xk for n ∈ N∗ and a.s. G∞ = limn→∞ Gn = 0. Let F be the natural
filtration of the process G. We have E[Gn ] = 1 − 2−n so that V∗ ≥ 1. Notice G is a
non-negative sub-martingale as:

2n+1 − 1
E[Gn+1 |Fn ] = G n ≥ Gn .
2n+1 − 2
Thus, for any τ ∈ T, we have E[Gτ ∧n ] ≤ E[Gn ] ≤ 1. And by Fatou Lemma, we get E[Gτ ] ≤ 1.
Thus, we deduce that V∗ = 1.
Since E[Gn+1 |Fn ] > Gn on {Gn 6= 0} and Gn+1 = Gn on {Gn = 0}, we get at step n
that the expected future gain at step n + 1 is better than the gain Gn . Therefore it is more
interesting to continue than to stop at step n. However this strategy will provide the gain
G∞ = 0, and is thus not optimal. We deduce there is no optimal stopping time.
Consider the stopping time τ = inf{n ≥ 1; Gn = 0}. We have that τ is a geometric
random variable with parameter 1/2. Furthermore, we have supn∈J0,ζK G+ τ −1 − 1 and
h i n = 2
thus E supn∈J0,ζK G+n = +∞. In particular, condition (5.2) does not hold in this case. 4
The main result of this section is that if (5.2) and (5.3) hold, then there exists an optimal
stopping time τ∗ ∈ T, see Corollary 5.18. The main idea of the infinite horizon case, inspired
by the finite horizon case, is to consider a process S = (Sn , n ∈ J0, ζK) satisfying the optimal
equations (5.4). But since the initialization of S given in the finite horizon case is now useless,
we shall rely on a definition inspired by (5.8) and (5.9). However, we need to consider a
measurable version of the supremum of E[Gτ |Fn ], where τ is any stopping time such that
τ ≥ n. This is developed in Section 5.2.1. In the technical Section 5.2.2, due to the fact we
don’t assume the gain to be integrable, following Ferguson [3], we use the notion of regular
stopping time to prove the existence of an optimal stopping time in the adapted case. We
connect this result with the optimal equations (5.4) in Section 5.2.3. Then, we consider the
general case in Section 5.2.4.
86 CHAPTER 5. OPTIMAL STOPPING

5.2.1 Essential supremum


The following proposition asserts the existence of a minimal random variable dominating a
family (which might be uncountable) of random variables in the sense of a.s. inequality. We
set R = [−∞, +∞].
Proposition 5.11. Let (Xt , t ∈ T ) be a family of real-valued random variables indexed by a
general set T . There exists a unique (up to the a.s. equivalence) real-valued random variable
X∗ such that:
(i) For all t ∈ T , P(X∗ ≥ Xt ) = 1.

(ii) If there exists a random variable Y such that for all t ∈ T , P(Y ≥ Xt ) = 1, then a.s.
Y ≥ X∗ .
The random variable X∗ of the previous proposition is called the essential supremum of
(Xt , t ∈ T ) and is denoted by:
X∗ = ess sup Xt .
t∈T

Example 5.12. If U is a uniform random variable on [0, 1], and Xt = 1{U =t} for t ∈ T = [0, 1],
then we have that a.s. supt∈T Xt = 1 and it is easy to check that a.s. ess supt∈T Xt = 0. 4

Proof of Proposition 5.11. Since we are only considering inequalities between real random
variables, by mapping R onto [0, 1] with an increasing one-to-one function, we can assume
that Xt takes values in [0, 1] for all t ∈ T .
Let I be the family of all countable sub-families of T . For each I ∈ I, consider the
(well defined) random variable XI = supt∈I Xt and define α = supI∈I S E[XI ]. There exists a
sequence (In , n ∈ N) such that limn→+∞ E[XIn ] = α. The set I∗ = n∈N In is countable and
thus I∗ ∈ I. Set X∗ = XI∗ . Since E[XInS] ≤ E[X∗ ] ≤ α for all n ∈ N, we get E[X∗ ] = α.
For any t ∈ T , consider J = I∗ {t}, which belongs to I, and notice that XJ =
max(Xt , X∗ ). Since α = E[X∗ ] ≤ E[XJ ] ≤ α, we deduce that E[X∗ ] = E[XJ ] and thus
a.s. XJ = X∗ , that is P(X∗ ≥ Xt ) = 1. This gives (i).
Let Y be as in (ii). Since I∗ is countable, we get that a.s. Y ≥ X∗ . This gives (ii).

5.2.2 The adapted case: regular stopping times

W the sequence G = (Gn , n ∈ N) is adapted to the filtration


We assume in this section that
F = (Fn , n ∈ N), with F∞ = n∈N Fn . We shall consider the following hypothesis which is
slightly weaker than (5.2):
(H) There exists a non-negative integrable random variable M such that for all n ∈ N, we
have a.s. G+
n ≤ E[M |Fn ].

Condition (H) and (4.1) imply that for all τ ∈ T, we have a.s. G+
τ ≤ E[M |Fτ ]. Notice that
if (5.2) holds then (H) holds with M = supk∈N G+k .
For n ∈ N, let Tn = {τ ∈ T; τ ≥ n} be the set of stopping times larger than or equal to
n. We say a stopping times τ ∈ Tn is regular, which will be understood with respect to G, if
for all finite k ≥ n:
E[Gτ |Fk ] > Gk a.s. on {τ > k}.
5.2. INFINITE HORIZON CASE 87

We denote by T0n the subset of Tn of regular stopping times. Notice that T0n is non-empty as
it contains n.

Lemma 5.13. Assume that G is adapted and a.s. E[G+


τ ] < +∞ for all τ ∈ T. Let n ∈ N.

(i) If τ ∈ Tn , then there exists a regular stopping time τ 0 ∈ T0n such that τ 0 ≤ τ and a.s.
E[Gτ 0 |Fn ] ≥ E[Gτ |Fn ].

(ii) If τ 0 , τ 00 ∈ T0n are regular, then the stopping time τ = max(τ 0 , τ 00 ) ∈ T0n is regular and
a.s. E[Gτ |Fn ] ≥ max (E[Gτ 0 |Fn ], E[Gτ 00 |Fn ]).

Proof. We prove property (i). Let τ ∈ Tn and set τ 0 = inf{k ≥ n; E[Gτ |Fk ] ≤ Gk } with the
convention that inf ∅ = ∞. Notice that τ 0 is a stopping time and that a.s. n ≤ τ 0 ≤ τ . On
{τ 0 = ∞}, we have τ = ∞ and a.s. Gτ 0 = G∞ = Gτ . For ∞ > m ≥ n, we have, on {τ 0 = m},
that a.s. E[Gτ 0 |Fm ] = Gm ≥ E[Gτ |Fm ]. We deduce that for all finite k ≥ n a.s. on {τ 0 ≥ k}:
X   X  
E[Gτ 0 |Fk ] = E E[Gτ 0 |Fm ]1{τ 0 =m} |Fk ≥ E E[Gτ |Fm ]1{τ 0 =m} |Fk .
m∈Jk,∞K m∈Jk,∞K

And thus, for all finite k ≥ n:

E[Gτ 0 |Fk ]1{τ 0 ≥k} ≥ E[Gτ |Fk ]1{τ 0 ≥k} . (5.15)

We have on {τ 0 > k}, E[Gτ |Fk ] > Gk . Then use (5.15) to get that τ 0 is regular. Take k = n
in (5.15) and use that τ 0 ≥ n a.s. to get the last part of (i).
We prove property (ii). Let τ 0 , τ 00 ∈ T0n and τ = max(τ 0 , τ 00 ). By construction τ is a
stopping time, see Proposition 4.4. We have for all m ≥ k ≥ n and k finite:
     
E Gτ 1{τ 0 =m} |Fk = E Gτ 0 1{m=τ 0 ≥τ 00 } |Fk + E Gτ 00 1{τ 00 >τ 0 =m} |Fk .

Using that τ 00 ∈ T0n , we get for all finite m ≥ k ≥ n:


 
E[Gτ 00 1{τ 00 >τ 0 =m} |Fk ] = E E[Gτ 00 |Fm ]1{τ 00 >m} 1{τ 0 =m} |Fk ≥ E[Gm 1{τ 00 >τ 0 =m} |Fk ].

We deduce that for all m ≥ k ≥ n and k finite:

E[Gτ 1{τ 0 =m} |Fk ] ≥ E[Gτ 0 1{τ 0 =m} |Fk ]. (5.16)

By summing (5.16) over m with m > k and using that τ 0 ∈ T0n , we get:

E[Gτ |Fk ]1{τ 0 >k} ≥ E[Gτ 0 |Fk ]1{τ 0 >k} > Gk 1{τ 0 >k} .

By symmetry, we also get E[Gτ |Fk ]1{τ 00 >k} > Gk 1{τ 00 >k} . Since {τ > k} = {τ 0 > k} {τ 00 >
S
k}, this implies that E[Gτ |Fk ] > Gk a.s. on {τ > k}. Thus, τ is regular.
By summing (5.16) over m with m ≥ k = n, and using that τ 0 ≥ n a.s., we get E[Gτ |Fn ] ≥
E[Gτ 0 |Fn ]. By symmetry, we also have E[Gτ |Fn ] ≥ E[Gτ 00 |Fn ]. We deduce the last part
of (ii).

The next lemma is the main result of this section.


88 CHAPTER 5. OPTIMAL STOPPING

Lemma 5.14. We assume that G is adapted and hypothesis (H) and (5.3) hold. Then, for
all n ∈ N, there exists τn◦ ∈ Tn such that a.s. ess supτ ∈Tn E[Gτ |Fn ] = E[Gτn◦ |Fn ].
Proof. We set X∗ = ess supτ ∈Tn E[Gτ |Fn ]. According to the proof of Proposition 5.11, there
exists a sequence (τk , k ∈ N) of elements of Tn such that X∗ = supk∈N E[Gτk |Fn ]. Thanks
to (i) of Lemma 5.13, there exists a sequence (τk0 , k ∈ N) of regular stopping times, elements
of T0n , such that E[Gτk0 |Fn ] ≥ E[Gτk |Fn ]. According to (ii) of Lemma 5.13, for all k ∈
N, the stopping time τk00 = max0≤j≤k τj0 belongs to T0n , the sequence (E[Gτk00 |Fn ], k ∈ N)
is non-decreasing and E[Gτk00 |Fn ] ≥ E[Gτk0 |Fn ] ≥ E[Gτk |Fn ]. In particular, we get X∗ =
supk∈N E[Gτk |Fn ] ≤ supk∈N E[Gτk00 |Fn ] ≤ X∗ , so that a.s. X∗ = limk→∞ E[Gτk00 |Fn ].
Let τn◦ ∈ Tn be the limit of the non-decreasing sequence (τk00 , k ∈ N). Set Yk = E[M |Fτk00 ].
We deduce from the optional stopping theorem for closed martingale, see Proposition 4.26,
that (Yk , k ∈ N) is a martingale with respect to the filtration (Fτk , k ∈ N), which is closed
thanks to property (ii) from Theorem 4.24. In particular, the sequence (Yk , k ∈ N) converges
a.s. and in L1 towards Y∞ = E[M |Fτn◦ ] according to Corollary 4.25. Notice also that
a.s. E [Yk | Fn ] = E [Y∞ | Fn ]. Then, we use Lemma 5.30 with Xk = Gτk00 to get that a.s.
X∗ ≤ E[lim supk→∞ Gτk00 |Fn ]. Thanks to (5.3), we have a.s. lim supk→∞ Gτk00 ≤ Gτn◦ . So we get
that a.s. X∗ ≤ E[Gτn◦ |Fn ]. To conclude use that by definition of X∗ , we have E[Gτn◦ |Fn ] ≤ X∗
and thus X∗ = E[Gτn◦ |Fn ].

We have the following Corollary.


Corollary 5.15. We assume that G is adapted and hypothesis (H) and (5.3) hold. Then,
we have that τ0◦ is optimal that is V∗ = E[Gτ0◦ ].
Proof. Lemma 5.14 gives that E[Gτ ] ≤ E[Gτ0◦ ] for all τ ∈ T. Thus τ0◦ is optimal.

Exercise 5.3. Assume that hypothesis (H) and (5.3) hold. Let n ∈ N. Prove that the limit
of a non-decreasing sequence of regular stopping times, elements of T0n , is regular. Deduce
that τn◦ in Lemma 5.14 is regular, that is τn◦ belongs to T0n . 4

5.2.3 The adapted case: optimal equations


We assume in this section thatWthe sequence G = (Gn , n ∈ N) is adapted to the filtration
F = (Fn , n ∈ N), with F∞ = n∈N Fn . Recall that Tn = {τ ∈ T; τ ≥ n} for n ∈ N. We
assume (H) holds. We set for n ∈ N:

Sn = ess sup E[Gτ | Fn ]. (5.17)


τ ∈Tn

The next proposition is the main result of this section.


Proposition 5.16. We assume that G is adapted and hypothesis (H) and (5.3) hold. Then,
for all n ∈ N, we have E[Sn+ ] < +∞. The sequence (Sn , n ∈ N) satisfies the optimal equations
(5.4). We also have V∗ = E[S0 ].
Proof. Recall that (H) implies E[G+ τ ] < +∞ for all τ ∈ Tn . Then use Lemma 5.14 to deduce
+ +
that E[Sn ] = E[Gτn◦ ] < +∞. For τ ∈ Tn , we have (5.10) and (5.11) by definition of the
essential supremum for Sn+1 . We deduce that a.s. E[Gτ | Fn ] ≤ max(Gn , E[Sn+1 | Fn ]). This
implies, see (ii) of Proposition 5.11, that a.s. Sn ≤ max(Gn , E[Sn+1 | Fn ]).
5.2. INFINITE HORIZON CASE 89


Thanks to Lemma 5.14, there exists τn+1 ∈ Tn+1 such that a.s. Sn+1 = E[Gτn+1 ◦ |Fn+1 ].

Since τn+1 (resp. n) belongs also to Tn , we have Sn ≥ E[Gτn+1 ◦ |Fn ] = E[Sn+1 |Fn ] (resp.
Sn ≥ Gn ). This implies that Sn ≥ max(Gn , E[Sn+1 | Fn ]). And thus (Sn , n ∈ N) satisfies the
optimal equations.
Use Corollary 5.15 and Lemma 5.14 to get V∗ = E[S0 ].

We conclude this section by giving an explicit optimal stopping time.

Proposition 5.17. We assume that G is adapted and hypothesis (H) and (5.3) hold. Then
τ∗ defined by (5.5), with (Sn , n ∈ N) given by (5.17), is optimal: V∗ = E[Gτ∗ ].

Proof. If V∗ = −∞ then nothing has to be proven. So, we assume V∗ > −∞. According to
Corollary 5.15, there exists an optimal stopping time τ .
In a first step, we check that τ 0 = min(τ, τ∗ ) is also optimal. Since E[G+
τ ] < +∞, by
Fubini and the definition of Sn , we have:
  X   X   X  
E Gτ 1{τ >τ∗ } = E Gτ 1{τ >τ∗ =n} = E E[Gτ |Fn ]1{τ >τ∗ =n} ≤ E Sn 1{τ >τ∗ =n} .
n∈N n∈N n∈N

Since Sn = Gn on {τ∗ = n} for n ∈ N, we deduce that:


  X    
E Gτ 1{τ >τ∗ } ≤ E Gn 1{τ >τ∗ =n} = E Gτ∗ 1{τ >τ∗ } .
n∈N

This implies that:


       
E [Gτ ] = E Gτ 1{τ >τ∗ } + E Gτ 1{τ ≤τ∗ } ≤ E Gτ∗ 1{τ >τ∗ } + E Gτ 1{τ ≤τ∗ } = E [Gτ 0 ] .

And thus τ 0 is optimal.


In a second step we check that a.s. τ 0 = τ∗ . Let us assume that P(τ 0 < τ∗ ) > 0. Recall τn◦
defined in Lemma 5.14. We define the stopping time τ 00 by τ 00 = τ∗ on {τ 0 = τ∗ } and τ 00 = τn◦
on {n = τ 0 < τ∗ } for n ∈ N. Since E[G+ τ 00 ] < +∞, by Fubini and the definition of Sn , we
have:
  X   X   X  
E Gτ 00 1{τ 0 <τ∗ } = E Gτn◦ 1{n=τ 0 <τ∗ } = E E[Gτn◦ |Fn ]1{n=τ 0 <τ∗ } = E Sn 1{n=τ 0 <τ∗ } .
n∈N n∈N n∈N

Since P(τ 0 < τ∗ ) > 0 and Sn > Gn on {τ∗ > n} for n ∈ N, we deduce that:
  X    
E Gτ 00 1{τ 0 <τ∗ } > E Gn 1{n=τ 0 <τ∗ } = E Gτ 0 1{τ 0 <τ∗ }
n∈N
   
unless E Gτ 00 1{τ 0 <τ∗ } = E Gτ 0 1{τ 0 <τ∗ } = −∞.  The latter case
 is not possible
 since
E [Gτ 0 ] = V∗ > −∞. Thus, we deduce that E Gτ 00 1{τ 0 <τ∗ } > E Gτ 0 1{τ 0 <τ∗ } . This im-
plies (using again that E[Gτ 0 ] > −∞) that:
       
E [Gτ 00 ] = E Gτ 0 1{τ 0 =τ∗ } + E Gτ 00 1{τ 0 <τ∗ } > E Gτ 0 1{τ 0 =τ∗ } + E Gτ 0 1{τ 0 <τ∗ } = E [Gτ 0 ] .

This is impossible as τ 0 is optimal. Thus, we have a.s. τ 0 = τ∗ and τ∗ is optimal.


90 CHAPTER 5. OPTIMAL STOPPING

Exercise 5.4. Assume that G is adapted and hypothesis (H) and (5.3) hold and V∗ > −∞.

1. Deduce from the proof of Proposition 5.17, that τ∗ defined by (5.5) is the minimal
optimal stopping time: if τ is an optimal stopping time then a.s. τ ≥ τ∗ .

2. Deduce that if G∞ = −∞ a.s., then a.s. τ∗ is finite.

4
Exercise 5.5. Assume that G is adapted and hypothesis (H) and (5.3) hold. We set for n ∈ N:

Wn = ess sup E[Gτ | Fn ] (5.18)


τ ∈Tn+1

with the convention that inf ∅ = ∞. Recall τ∗ and τ∗∗ defined by (5.5) and (5.6).

1. Prove that Wn = E[Sn+1 |Fn ].

2. Prove that (Sn∧τ∗∗ , n ∈ N) is such that E[S0 ] = E[Sn∧τ∗∗ ] for all n ∈ N.

3. Prove that E[S0 ] ≤ E[lim sup Sn∧τ∗∗ ] ≤ E[Gτ∗∗ ]. Deduce that τ∗∗ is optimal.

4. Assume that V∗ > −∞. Prove that if τ is an optimal stopping time, then τ ∧ τ∗∗ is also
optimal. Prove that a.s. τ ≤ τ∗∗ .

5. Assume that V∗ > −∞. Prove that τ is an optimal stopping time if and only if a.s.
Sτ = Gτ on {τ < ∞} and τ∗ ≤ τ ≤ τ∗∗ .

4
Exercise 5.6. Assume that G is adapted and hypothesis (H) and (5.3) hold, as well as V∗ >
−∞. Prove that τ∗ defined by (5.5) is regular. 4

5.2.4 The general case


We state the main result of this section. Let T denote the set of stopping times (taking values
in N) with respect to the filtration (Fn , n ∈ N).

Corollary 5.18. Let G = (Gn , n ∈ N) be a sequence of random variables such that (5.2) and
(5.3) hold. Then there exists an optimal stopping time.

Proof. According to Wthe first paragraph of Section 5.2, without loss of generality, we can
assume that F∞ = n∈N Fn . If G is adapted to the filtration F = (Fn , n ∈ N) then use
M = supn∈N G+ n , so that (H) holds, and Corollary 5.15 to conclude.
If the sequence G is not adapted to the filtration F, then we shall consider the correspond-
ing adapted sequence G0 = (G0n , n ∈ N) given by G0n = E[Gn |Fn ] for n ∈ N. Notice G0 is well
defined thanks to (5.2). Thanks to (5.2), we can use Fubini lemma to get for τ ∈ T:
X X
E[Gτ ] = E[Gn 1{τ =n} ] = E[G0n 1{τ =n} ] = E[G0τ ].
n∈N n∈N

We thus deduce the maximal gain for the game G is also the maximal gain for the game G0 .
5.3. FROM FINITE HORIZON TO INFINITE HORIZON 91

Let M = E supn∈N G+ 0
 
n |F∞ . Notice then that (H) holds with G replaced by G . To
conclude using Corollary 5.15, it is enough to check that 0
0
 (5.3) holds with G replaced
+
 by G .
For n ≥ k finite, we have Gn ≤ E sup`∈Jk,∞K G` Fn . Since E sup`∈Jk,∞K G` is finite
thanks to (5.2), we deduce from Lemma 5.31 that:

lim sup G0n ≤ lim sup E sup G` Fn ≤ E sup G` F∞ .


   
n n `∈Jk,∞K `∈Jk,∞K

Since k is arbitrary, we get:

lim sup G0n ≤ lim sup E sup G` F∞ ≤ E lim sup sup G` F∞ ≤ E[G∞ | F∞ ] = G0∞ ,
   
n k `∈Jk,∞K k `∈Jk,∞K

where we used Lemma 5.30 (with Xk = sup`∈Jk,∞K G` and Yk = Y = M ) for the second
inequality and (5.3) for the last. Thus (5.3) holds with G replaced by G0 . This finishes the
proof.

Exercise 5.7. Let G = (Gn , n ∈ N) be a sequence of random variables such that (5.2) and
(5.3) hold. Let τ∗ = inf{n ∈ N; ess supτ ∈Tn E[Gτ |Fn ] = E[Gn |Fn ]} with inf ∅ = ∞. Prove
that τ∗ is optimal. 4

5.3 From finite horizon to infinite horizon


In the finite horizon case, the solution to the optimal equations (5.4) are defined recursively in
a constructive way. There is no such constructive way in the infinite horizon case. Thus, it is
natural to ask if the infinite horizon case can be seen as the limit of finite horizon cases, when
the horizon ζ goes to infinity. We shall give sufficient condition for this to hold in Section
5.3.1 for the adapted case then derive a solution to the castle selling problem of Example 5.2
in Section 5.3.2 and a solution in a Markov chain setting in Section 5.3.3.

5.3.1 From finite horizon to infinite horizon


We assume in this section thatWthe gain sequence G = (Gn , n ∈ N) is adapted to the filtration
F = (Fn , n ∈ N), with F∞ = n∈N Fn . We also assume that (5.2), or the weaker Condition
(H) page 86, holds. We consider the following assumptions which are stronger than (5.3):

lim sup Gn = G∞ a.s. (5.19)


lim Gn = G∞ a.s.. (5.20)
n→∞

Remark 5.19. We comment on the conditions (5.19) and (5.20). In particular, (5.20) holds if
(5.3) holds and a.s. G∞ = −∞. We now prove that if (5.3) holds and

lim E[G∞ |Fn ] = G∞ a.s., (5.21)


n→∞

then we can modify the gain so that the maximal gain is the same and (5.19) holds for the
modified gain. Notice the convergence (5.21) holds in particular if G∞ is integrable, thanks
to Corollary 4.25.
92 CHAPTER 5. OPTIMAL STOPPING

Assume that Condition (H) page 86 holds for G. We consider the gain G0 = (G0n , n ∈ N)
with G0n = max(Gn , E[G∞ |Fn ]) which satisfies Condition (H) with M 0 = M + G+ ∞ as well
as (5.19), since (5.21) holds. According to Proposition 5.17, there exists an optimal
S stopping
0 0 0 0
time, say τ , for the gain G . The maximal gain is V∗ = E[Gτ 0 ]. Set τ = τ on n∈N {τ 0 =
0

n, G0n = Gn } and τ = +∞ otherwise. Roughly speaking, the stopping rule τ can be described
as follows: on {τ 0 = n}, then either Gn = G0n , and then one stops the game at time n to get
the gain Gn , or Gn < G0n and then one never stops the game to get the gain G∞ . Notice τ
is a stopping time. We have:
X
E[Gτ ] = E[Gn 1{τ =n} ]
n∈N
X X
= E[Gn 1{τ 0 =n,Gn =G0n } ] + E[G∞ 1{τ 0 =n,Gn <G0n } ] + E[G∞ 1{τ 0 =∞} ]
n∈N n∈N
X X
= E[Gn 1{τ 0 =n,Gn =G0n } ] + E[E[G∞ |Fn ]1{τ 0 =n,Gn <G0n } ] + E[G∞ 1{τ 0 =∞} ]
n∈N n∈N
X
= E[G0n 1{τ 0 =n} ] + E[G∞ 1{τ 0 =∞} ]
n∈N
= E[G0τ 0 ].
As E[Gτ ] = E[G0τ 0 ], we get that E[G0τ 0 ] ≤ V∗ . Since G0n ≥ Gn and τ 0 is optimal, we also
get that E[G0τ 0 ] ≥ V∗ . We deduce that V∗0 = E[G0τ 0 ] = V∗ = E[Gτ ], which implies that τ is
optimal.
Thus, if (5.21) holds, then (5.19) holds with G0 instead of G, and if (H) holds for G, then
we can recover an optimal stopping times for G from an optimal stopping times for G0 , the
maximal gain being the same. ♦
Recall Tn = {τ ∈ T; τ ≥ n} is the set of stopping time equal to or larger than n ∈ N and
Tζ = {τ ∈ T; τ ≤ ζ} is the set of stopping times bounded by ζ ∈ N. For ζ ∈ N and n ∈ J0, ζK
we define Tζn = Tn Tζ as well as:
T

Snζ = ess sup E[Gτ | Fn ]. (5.22)


τ ∈Tζn

From Sections 5.1.1 and 5.2.3, we get that Sζζ = Gζ and S ζ = (Snζ , n ∈ J0, ζK) satisfies
the optimal equations (5.4). For n ∈ N, the sequence (Snζ , ζ ∈ Jn, ∞J) is non-decreasing
and denote by Sn∗ its limit. For n ∈ N, we have a.s. Sn∗ = ess supτ ∈T(b) E[Gτ | Fn ], where
n
(b)
Tn = Tn T(b) and T(b) ⊂ T is the subset of bounded stopping times. By construction of
T
Sn , we have for all n ∈ N:
Sn∗ ≤ Sn , (5.23)
The sequence (τ∗ζ , ζ ∈ N), with τ∗ζ = inf{n ∈ J0, ζK; Snζ = Gn }, is non-decreasing and thus
converge to a limit, say τ∗∗ ∈ N and
τ∗∗ = lim τ∗ζ = inf{n ∈ N; Sn∗ = Gn }. (5.24)
ζ→∞

Thanks to (5.23) we deduce that a.s. τ∗∗ ≤ τ∗ . We set V∗ζ = E[S0ζ ] = supτ ∈Tζ E[Gτ ] and
V∗ = E[S0 ] = supτ ∈T E[Gτ ]. Let V∗∗ be the non-decreasing limit of the sequence (V∗ζ , ζ ∈ N),
so that V∗∗ ≤ V∗ .
5.3. FROM FINITE HORIZON TO INFINITE HORIZON 93

Remark 5.20. Assume that (5.2) holds and Gn is integrable for all n ∈ N. Since Gn ≤ Snζ ≤
E supk∈N G+
 
k | Fn = Mn for all ζ ≥ n, using dominated convergence, we deduce from (5.2)
that (Sn∗ , n ∈ N) satisfies the optimal equations (5.4) with ζ = ∞. In fact, it is easy to
check that S ∗ = (Sn∗ , n ∈ N) is the smallest sequence satisfying the optimal equations (5.4)
with ζ = ∞. Following Remark 5.7, we deduce that S ∗ is the smallest super-martingale
which dominates (Gn , n ∈ N). And the process (Sn∧τ ∗
∗ , n ∈ N) is a martingale, which is a.s.

converging thanks to (5.2). ♦

Definition 5.21. The infinite horizon case is the limit of the finite horizon cases if V∗∗ = V∗ .

It is not true in general that V∗∗ = V∗ , see Example 5.22 below taken from Neveu [6].
Example 5.22. Let (Xn , n ∈ N∗ ) be independent random variables such that P(Xn = 1) =
P(Xn = −1) = 1/2 for all n ∈ N. Let c = (cn , n ∈ N∗ ) be a strictly increasing sequence such
that 0 < cn < 1 for all n ∈ N∗ and limn→∞ cn = 1. We define G0 = 0, G∞ = 0, and for
n ∈ N∗ : 
Gn = min 1, Wn − cn ,
with Wn = nk=1 Xk . Notice that Gn ≤ 1 and a.s. lim sup Gn = G∞ so that (5.2) and (5.19)
P
hold. (Notice also that E[|Gn |] for all n ∈ N.) Since E[Wn+1 |Fn ] = Wn , we deduce from
Jensen inequality that a.s. E[min(1, Wn+1 )|Fn ] ≥ min(1, Wn ). Then use that the sequence
c is strictly increasing to get that for all n ∈ N a.s. Gn > E[Gn+1 |Fn ]. Using a backward
induction argument and the optimal equations, we get that Snζ = Gn for all n ∈ J0, ζK and
ζ ∈ N and thus τ∗ζ = 0. We deduce that Sn∗ = Gn for all n ∈ N, τ∗∗ = 0 and V∗∗ = 0.
Since (5.2) and (5.3) hold, we deduce there exists an optimal stopping time for the infinite
horizon case. The stopping time τ = inf{n ∈ N∗ ; Wn = 1} is a.s. strictly positive and finite.
On {τ = n}, we have that Gn = 1 − cn as well as Gm ≤ 0 < Gn for all m ∈ J0, nJ and
Gm ≤ 1 − cm < Gn for all m ∈ Kn, ∞K. We deduce that Gτ = supτ 0 ∈T Gτ 0 , that is τ = τ∗ is
optimal. Notice that V∗ > V∗∗ = 0 and a.s. τ∗ > τ∗∗ = 0. Thus, the infinite horizon case is
not the limit of the finite horizon cases. 4
We give sufficient conditions so that V∗∗ = V∗ . Recall that (5.2) implies Condition (H).

Proposition 5.23. Let (Gn , n ∈ N) be an adapted sequence of R-valued random variables


and define G∞ by (5.19). Assume that (H) holds and that the sequence (Tn , n ∈ N), with
Tn = supk≥n Gk − Gn , is uniformly integrable. If there exists an a.s. finite optimal stopping
time or if (5.20) holds, then the infinite horizon case is the limit of the finite horizon cases.

Proof. If V∗ = −∞, nothing has to be proven. Let us assume that V∗ > −∞. According to
Proposition 5.17, there exists an optimal stopping time, say τ . Since E[Gmin(τ,n) ] ≤ V∗n , we
get:

0 ≤ V∗ − V∗n ≤ E[Gτ − Gmin(τ,n) ] = E 1{n<τ <∞} (Gτ − Gn ) + E 1{τ =∞} (G∞ − Gn )


   

≤ E 1{n<τ <∞} Tn + E 1{τ =∞} (G∞ − Gn )+ .


   

As (Tn , n ∈ N) is uniformly integrable, we deduce from property (iii) of Proposition 8.18 that
(1{n<τ <∞} Tn , n ∈ N) is also uniformly integrable. Since a.s. limn→+∞ 1{n<τ <∞} = 0 and
thus limn→+∞ 1{n<τ <∞} Tn = 0, we deduce from Proposition 8.21 that this latter convergence
holds also in L1 that is limn→+∞ E 1{n<τ <∞} Tn = 0.

94 CHAPTER 5. OPTIMAL STOPPING

If τ is a.s. finite, then we have E 1{τ =∞} (G∞ − Gn )+ = 0. Otherwise, if (5.20) holds,
 

then the sequence (1{τ =∞} (G∞ − Gn )+ , n ∈ N) converges a.s. to 0. Since 1{τ =∞} (G∞ −
Gn )+ ≤ |Tn | and (Tn , n ∈ N) is uniformly integrable, we deduce from property (iii) of
Proposition 8.18 that the sequence (1{τ =∞} (G∞ − Gn )+ , n ∈ N) is uniformly integrable.
 Use
Proposition 8.21 to get it converges towards 0 in L1 : limn→+∞ E 1{τ =∞} (G∞ − Gn )+ = 0.
In both cases, we deduce that limn→∞ V∗ − V∗n = 0. This gives the result.

The following exercise complete Proposition 5.23 by giving the convergence of the minimal
optimal stopping time in the finite horizon case to τ∗ the minimal optimal stopping time in
the infinite horizon case defined in (5.5).
Exercise 5.8. Let (Gn , n ∈ N) be an adapted sequence of random variables taking values in
R and define G∞ by (5.19). Assume that (H) holds and that the sequence (Tn , n ∈ N), with
Tn = supk≥n Gk − Gn , is uniformly integrable. Recall τ∗∗ defined in (5.24).

1. If τ∗ is a.s. finite, prove that a.s. Sn∧τ = Sn∧τ∗ for all n ∈ N and thus a.s. τ∗∗ = τ∗ .

2. If (5.20) holds, prove that Sn∗ = Sn for all n ∈ N and thus a.s. τ∗∗ = τ∗ .

4
We give an immediate Corollary of Proposition 5.23.

Corollary 5.24. Let (Gn , n ∈ N) be an adapted sequence of R-valued random variables and
define G∞ by (5.19). Assume that for n ∈ N we have Gn = Zn − Wn , with (Zn , n ∈ N)
adapted, E[supn∈N |Zn |] < +∞ and (Wn , n ∈ N) an adapted non-decreasing sequence of non-
negative random variables. If there exists an a.s. finite optimal stopping time or if (5.20)
holds, then the infinite horizon case is the limit of the finite horizon cases.

Proof. For k ≥ n, we have Gk − Gn ≤ Zk − Zn ≤ 2 sup`∈N |Z` |. This gives that the sequence
(Tn = supk≥n Gk − Gn , n ∈ N) is non-negative and bounded by 2 sup`∈N |Z` |, hence it is
uniformly integrable. We conclude using Proposition 5.23.

Using super-martingale theory, we can prove directly the following result (which is not a
direct consequence of the previous Corollary with Wn = 0).

Proposition 5.25. Let (Gn , n ∈ N) be an adapted sequence of random variables taking values
in R and define G∞ by (5.19). Assume that E[supn∈N |Gn |] < +∞. Then the infinite horizon
case is the limit of the finite horizon cases. Furthermore, we have that (Sn , n ∈ N) given by
(5.17) is a.s. equal to (Sn∗ , n ∈ N) given by (5.22), and thus the optimal stopping time τ∗
defined by (5.5) is a.s. equal to τ∗∗ defined by (5.24).

Proof. According to Remark 5.20, the process S ∗ = (Sn∗ , n ∈ N) satisfies the optimal equations
(5.4) with ζ = ∞. Since it is bounded by supn∈N |Gn | which is integrable, it is a super-
martingale and it converges a.s. to a limit say S∞ ∗ . We have S ∗ ≥ G for all n ∈ N, which
n n
implies thanks to (5.19) that S∞ ∗ ≥G .

Let n ∈ N. We have for all stopping times τ ≥ n that a.s. Sn∗ ≥ limm→∞ E[Sm∧τ ∗ | Fn ] =

E[Sτ | Fn ], where we used the optional stopping theorem for the inequality, and the dominated
convergence from property (vi) in Proposition 2.7 (with Y = supn∈N |Gn | and Xn = Sn∗ )
for the equality. This implies that, for all stopping times τ ≥ n, a.s. Sn∗ ≥ E[Gτ |Fn ],
5.3. FROM FINITE HORIZON TO INFINITE HORIZON 95

which thanks to Proposition 5.11 implies that a.s. Sn∗ ≥ Sn . Thanks to (5.23), we get that
a.s. Sn∗ ≤ Sn and thus a.s. Sn∗ = Sn for all n ∈ N. By dominated convergence, we have
V∗∗ = limζ→∞ E[S0ζ ] = E[S0∗ ] = V∗ . Thus, the infinite horizon case is the limit of the finite
horizon cases. Using (5.24), we get that a.s. τ∗ = τ∗∗ .

Exercise 5.9. Extend Proposition 5.25 to the non adapted case. 4

5.3.2 Castle to sell


Continuation of Example 5.2. We model the proposal of the n-th buyer of the castle by a
random variable Xn . We assume (Xn , n ∈ N∗ ) is a sequence of independent random variables
distributed as a random variable X which takes values in [−∞, +∞) with E[(X + )2 ] < +∞
and P(X > −∞) > 0. We assume each visit of the castle has a fixed cost c > 0. We first
consider the case, where a previous buyer can be called back, so that the gain at step n ∈ N∗
is given by Gn = Mn − nc, with Mn = max1≤k≤n XkW. We set G∞ = −∞. We consider the
σ-field Fn = σ(X1 , . . . , Xn ) for n ∈ N∗ and F∞ = n∈N∗ Fn . (Notice that to stick to the
presentation of this section, we could set G0 = −∞ and F0 the trivial σ-field.)
Notice that max(x, y) = (x − y)+ + y for x ∈ [−∞, +∞) and y ∈ R. In particular, if Y
is a R-valued random variable independent of X, we get E[max(X, Y )|Y ] = f (Y ) + Y with
f (x) = E[(X − x)+ ]. We deduce for n ∈ N∗ that on {Mn > −∞}:

E[Gn+1 |Fn ] = E[max(Xn+1 , Mn )|Mn ] − (n + 1)c = f (Mn ) − c + Gn . (5.25)

We set x0 = sup{x ∈ R; P(X ≥ x) > 0} and x0 ∈ (−∞, +∞] as P(X > −∞) > 0. Since
E[X + ] is finite, we get that the function f (x) = E[(X − x)+ ] is continuous strictly decreasing
on (−∞, x0 ) and such that limx→−∞ f (x) = +∞ and limx→x0 f (x) = 0. By convention, we
set f (−∞) = +∞. Since a.s. limn→∞ Mn = x0 , we get that a.s. limn→∞ f (Mn ) = 0. Thus
the stopping time τ = inf{n ∈ N∗ , f (Mn ) ≤ c} is a.s. finite. From the properties of f , we
deduce there exists a unique c∗ ∈ R such that f (c∗ ) = c. Using that (f (Mn ), n ∈ N∗ ) is
non-increasing and that it jumps at record times of the sequence (Xn , n ∈ N∗ ), we get the
representation:
τ = inf{n ∈ N∗ , Xn ≥ c∗ }.
We shall prove that τ is optimal and:

τ∗ = τ a.s. and V∗ = E[Gτ ] = c∗ .

Since τ is geometric with parameter P(X ≥ c∗ ), we have E[τ ] = 1/P(X ≥ c∗ ) < +∞ and:
E[X1{X≥c∗ } ] − c E[(X − c∗ )+ ] − c
E[Gτ ] = E[Xτ ] − cE[τ ] = = + c∗ = c∗ ,
P(X ≥ c∗ ) P(X ≥ c∗ )

where we used that E[(X − c∗ )+ ] = f (c∗ ) = c for the last equality. Furthermore, for n ∈ N∗ ,
we deduce from (5.25) that a.s. :
\
E[Gn+1 |Fn ] > Gn on {n < τ } {Gn > −∞}, (5.26)
E[Gn+1 |Fn ] ≤ Gn on {n ≥ τ }. (5.27)

We now state a technical Lemma whose proof is postponed to the end of this section.
96 CHAPTER 5. OPTIMAL STOPPING

Lemma 5.26. Let X be a random variable taking values in [−∞, +∞). Let (Xn , n ∈ N∗ ) be a
sequence of random variables distributed as X. Let c ∈]0, +∞[. Set Gn = max1≤k≤n Xk − nc
for n ∈ N∗ . If E[(X + )2 ] < +∞, then E[supn∈N∗ G+
n ] < +∞ and lim sup Gn = −∞.

According to Lemma 5.26, we have that (5.2) and (5.3) hold. According to Proposition
5.17, τ∗ given by (5.5) is optimal. This implies that V∗ = E[Gτ∗ ] ≥ E[Gτ ] > −∞ and since
a.s. G∞ = −∞, we get that τ∗ is finite. We deduce also from (5.26) that a.s. τ∗ ≥ τ .
We have with c0 = c/2:
−∞ < E[Gτ∗ ] = E[ max Xk − τ∗ c0 ] − E[τ∗ c0 ] ≤ E[ sup ( max Xk − nc0 )+ ] − E[τ∗ ]c0 .
k∈J1,τ∗ K n∈N∗ 1≤k≤n

Using Lemma 5.26 with c replaced by c0 , we get that E[supn∈N∗ (max1≤k≤n Xk − nc0 )+ ] is
finite and thus E[τ∗ ] is finite. Let n ∈ N∗ . On {τ = n}, we have for finite k ≥ n that
Gτ − τ∗ c ≤ Gτ∗ ∧k ≤ supn∈N G+ n and thus a.s.:

1{τ =n} E[sup |Gτ∗ ∧k | | Fn ] < +∞. (5.28)


k≥n

Mimicking (5.14) with G instead of S and using that τ∗ ≥ τ , we deduce from (5.27) that
a.s., on {τ = n}, E[Gτ∗ ∧k |Fn ] ≤ Gn for all finite k ≥ n. Letting k goes to infinity, since τ∗ is
a.s. finite, we deduce by dominated convergence, using (5.28), that E[Gτ∗ |Fn ] ≤ Gn a.s. on
{τ = n}. Since τ is finite, this gives E[Gτ∗ ] ≤ E[Gτ ]. Since τ∗ is optimal, we deduce that τ is
also optimal. This gives V∗ = E[Gτ ] = c∗ . Notice also that a.s. τ = τ∗ as τ∗ is the minimal
optimal stopping time according to Exercise 5.4.
If one can not call back a previous buyer, then the gain is G00n = Xn − nc. Let V∗00
be the corresponding maximal gain. On the one hand, since G00n ≤ Gn for all n ∈ N, we
deduce that V∗00 ≤ V∗ . On the other hand, we have G00τ = Gτ = Gτ∗ . This implies that
V∗00 ≥ E[G00τ ] = E[Gτ ] = V∗ . We deduce that V∗00 = c∗ and τ is also optimal in this case.
In this last part, we assume furthermore that E[|X|] < +∞. We shall prove directly, as
Corollary 5.24 can not be used here, that the infinite horizon case is the limit of the finite
horizon cases. We first consider the case where previous buyers can be called back, so the
gain is Gn = max1≤k≤n Xk − nc for n ∈ N∗ . For n ∈ N∗ , we have a.s. that X1 − τ c ≤
Gτ ∧n ≤ supn∈N G+ n . By dominated convergence, we get that limn→∞ E[Gτ ∧n ] = E[Gτ ] = V∗ .
We deduce that V∗∗ ≥ limn→∞ E[Gτ ∧n ] = V∗ and V∗∗ = V∗ as V∗ ≥ V∗∗ . Therefore the infinite
horizon case is the limit of the finite horizon cases. (Notice that if 1 > P(X = −∞) > 0,
then the infinite horizon case is no more the limit of the finite horizon cases as V∗n = −∞ for
all n ∈ N∗ .)
We now consider the case where previous buyers can not be called back, so the gain is
Gn = Xn − nc for n ∈ N∗ . Let V∗00 = V∗ (resp. V∗00 n ) denote the maximal gain when the
00

horizon is infinite (resp. equal to n). We have:


n
0 ≤ V∗00 − V∗00 ≤ E[G00τ∗ − G00τ∗ ∧n ] ≤ E 1{n<τ∗ <∞} (Xτ∗ − Xn ) = E 1{n<τ∗ <∞} (Xτ∗ − X1 ) ,
   

where we used that G00τ∗ − G00n ≤ Xτ∗ − Xn on {n < τ∗ } for the second inequality and that con-
ditionally on {n < τ∗ < ∞}, (Xτ∗ , Xn ) and (Xτ∗ , X1 ) have the same
 distribution for the last
equality. Since X∗ and X1 are integrable, we get that limn→+∞ E 1{n<τ∗ <∞} (Xτ∗ − X1 ) = 0
by dominated convergence. We deduce that the infinite horizon case is the limit of the finite
horizon cases.
5.3. FROM FINITE HORIZON TO INFINITE HORIZON 97

Proof of Lemma 5.26. Assume that E[(X + )2 ] < +∞. Since Xn −nc ≤ Gn ≤ max1≤k≤n (Xk −
kc) for all n ∈ N∗ , we deduce that supn∈N∗ Gn = supn∈N∗ (Xn − nc). This gives:
h i h i h X i h X i
E sup G+ n = E sup (Xn − nc)+
≤ E (Xn − nc) +
= E (X − nc)+
,
n∈N∗ n∈N∗ n∈N∗ n∈N∗

where we used Fubini (twice) and that Xn is distributed as X in the last equality. Then use
that for x ∈ R: X X
(x − n)+ ≤ x+ 1{n<x+ } ≤ (x+ )2 ,
n∈N∗ n∈N∗
hP i h i
to get E (X − nc)+ ≤ E[(X + )2 ]/c < +∞. So we obtain E sup G + < +∞.
n∈N∗ n∈N∗ n
Set G0n = max1≤k≤n Xk − nc/2. Using the previous result (with c replaced by c/2),
we deduce that supn∈N∗ (G0n )+ is integrable and thus a.s. lim sup G0n < +∞. Since Gn =
G0n − nc/2, we get that a.s. lim sup Gn ≤ lim sup G0n − lim nc/2 = −∞.

With the notation of Lemma 5.26, one can prove that if the random variables (Xn , n ∈ N∗ )
are independent then E[supn∈N∗ G+ + 2
n ] < +∞ implies that E[(X ) ] < +∞.

5.3.3 The Markovian case


Let (Fn , n ∈ N) be a filtration. Recall T is the set of stopping times and Tζ is the set of
stopping times bounded by ζ ∈ N. Let (Xn , n ∈ N) be a Markov chain with state space E
(at most countable) and transition kernel P . Let ϕ be a non-negative function defined on
E. We shall consider the optimal stopping problem for the game with gain Gn = ϕ(Xn ) for
n ∈ N and G∞ = lim sup Gn with horizon ζ ∈ N. We set:

ϕ0 = ϕ and ϕn+1 = max(ϕ, P ϕn ) for n ∈ N.

We have the following result for the finite horizon case.

Lemma 5.27. Let ζ ∈ N, x ∈ E and ϕ a non-negative function defined on E. Assume that


Ex [ϕ(Xn )] < +∞ for all n ∈ J0, ζK. Then, we have:

ϕζ (x) = sup Ex [ϕ(Xτ )] = Ex [ϕ(Xτ ζ )],



τ ∈Tζ

with τ∗ζ = inf {n ∈ J0, ζK; Xn ∈ {ϕ = ϕζ−n }} .

Proof. We keep notations from Section 5.3. Recall definition (5.22) of Snζ for the finite horizon
ζ ∈ N. We deduce from Proposition 5.6 and Sζζ = Gζ that Snζ = ϕζ−n (Xn ) for all 0 ≤ n ≤ ζ
and that the optimal stopping time is τ∗ζ = inf{n ∈ J0, ζK; ϕζ−n (Xn ) = ϕ(Xn )}.

We give a technical lemma.

Lemma 5.28. The sequence of functions (ϕn , n ∈ N) is non-decreasing and converges to


a limit say ϕ∗ such that ϕ∗ = max(ϕ, P ϕ∗ ). For any non-negative function g such that
g ≥ max(ϕ, P g), we have that g ≥ ϕ∗ .
98 CHAPTER 5. OPTIMAL STOPPING

Proof. By an elementary induction argument, we get that the sequence (ϕn , n ∈ N) is non-
decreasing. Let ϕ∗ be its limit. By monotone convergence, we get that ϕ∗ = max(ϕ, P ϕ∗ ).
Let g be a non-negative function g such that g ≥ max(ϕ, P g), we have by induction that
g ≥ ϕn and thus g ≥ ϕ∗ .

We now give the main result of this section on the infinite horizon case. Recall T(b) is the
set of bounded stopping times.

Proposition 5.29. Let x ∈ E and ϕ a non-negative function defined on E. Assume that


Ex [supn∈N ϕ(Xn )] < +∞. Then, the maximal gain under Px is given by:

ϕ∗ (x) = sup Ex [ϕ(Xτ )] = sup Ex [ϕ(Xτ )] = E[ϕ(Xτ∗ )],


τ ∈T(b) τ ∈T

with the optimal stopping time:

τ∗ = inf{n ∈ N; Xn ∈ {ϕ = ϕ∗ }}, (5.29)

and the conventions inf ∅ = ∞ and ϕ(X∞ ) = lim sup ϕ(Xn ). Furthermore, the infinite
horizon case is the limit of the finite horizon case and a.s. τ∗ = τ∗∗ .

Proof. We keep notations from the proof of Lemma 5.27. Lemma 5.28 implies that Sn∗ =
limζ→∞ Snζ = ϕ∗ (Xn ). Recall that by definition τ∗∗ = limζ→∞ τ∗ζ . According to Proposition
5.25, the infinite horizon case is the limit of the finite horizon cases and the optimal stopping
time τ∗ is given by (5.5) that is by (5.29) with the conventions inf ∅ = ∞ and ϕ(X∞ ) =
lim sup ϕ(Xn ). We also get it is a.s. equal to τ∗∗ and that V∗ = E[Gτ∗ ] = Ex [S0∗ ] = ϕ∗ (x).

Exercise 5.10. Let x ∈ E and ϕ a non-negative function defined on E. Assume that


Ex [supn∈N ϕ(Xn )] < +∞ and consider the gain sequence (Gn , n ∈ N) with Gn = ϕ(Xn )
and G∞ = lim sup Gn . Recall the minimal stopping time τ∗ defined by (5.5) or equivalently
(5.29) and the maximal stopping time τ∗∗ defined by (5.6). Prove that:

τ∗ = inf {n ∈ N; Xn ∈ {ϕ ≥ P ϕ∗ }} and τ∗∗ = inf {n ∈ N; Xn ∈ {ϕ > P ϕ∗ }} ,

with the convention inf ∅ = ∞. 4

5.4 Appendix
We give in this section some technical Lemmas related to integration. Let (Ω, P, F) be a
probability space.

Lemma 5.30. Let X and (Xn , n ∈ N) be real-valued random variables. Let Y and (Yn , n ∈ N)
be non-negative integrable random variables. Assume that a.s. Xn+ ≤ Yn for all n ∈ N,
limn→∞ Yn = Y and limn→∞ E[Yn |H] = E[Y |H], where H ⊂ F is a σ-field. Then we have
that a.s.:  

lim sup E[Xn |H] ≤ E lim sup Xn H .

n→∞ n→∞
5.4. APPENDIX 99

Proof. By Fatou Lemma, we get lim inf n→∞ E [Xn− |H] ≥ E [lim inf n→∞ Xn− |H]. We also have:

lim sup E Xn+ | H = − lim inf E Yn − Xn+ | H − lim E [Yn | H]


   
n→∞ n→∞ n→∞
h i
≤ −E lim inf (Yn − Xn+ )| H − E [Y | H]
n→∞
  h i  
+ +
= E lim sup Xn | H − E lim Yn | H − E [Y | H] = E lim sup Xn | H ,
n→∞ n→∞ n→∞

where we used Fatou lemma for the inequality. To conclude, use that a.s.: :

lim sup E[Xn |H] ≤ lim sup E[Xn+ |H] − lim inf E[Xn− |H]
n→∞ n→∞ n→∞

and lim supn→∞ Xn+ − lim inf n→∞ Xn− = lim supn→∞ Xn .
W
Let F = (Fn , n ∈ N), with FSn ⊂ F, be a filtration. We set F∞ = n∈N Fn the smallest
possible σ-field which contains n∈N Fn .

Lemma 5.31. Let M be random variable taking values in [−∞, +∞) such that E[M + ] < +∞.
Let Mn = E[M | Fn ] for n ∈ N. Then, we have that a.s. lim sup Mn ≤ M∞ .

Proof. Let a ∈ R. By Jensen inequality, we have that Mn ∨ a ≤ E[M ∨ a| Fn ]. According


to Corollary 4.25, we have a.s. lim sup Mn ≤ lim sup Mn ∨ a ≤ E[M ∨ a| F∞ ]. By monotone
convergence, we deduce by letting a goes to −∞, that lim sup Mn ≤ E[M | F∞ ] = M∞ .
100 CHAPTER 5. OPTIMAL STOPPING
Bibliography

[1] Y. S. Chow, H. Robbins, and D. Siegmund. Great expectations: the theory of optimal
stopping. Houghton Mifflin Co., Boston, Mass., 1971.

[2] E. B. Dynkin. Optimal choice of the stopping moment of a Markov process. Dokl. Akad.
Nauk SSSR, 150:238–240, 1963.

[3] T. Ferguson. Optimal stopping and applications. http://www.math.ucla.edu/~tom/


Stopping/Contents.html.

[4] T. S. Ferguson. Who solved the secretary problem? Statist. Sci., 4(3):282–289, 08 1989.

[5] S. A. Lippman and J. J. McCall. Job search in a dynamic economy. J. Econom. Theory,
12(3):365–390, 1976.

[6] J. Neveu. Discrete-parameter martingales. North-Holland Publishing Co., New York,


revised edition, 1975.

[7] J. L. Snell. Applications of martingale system theorems. Trans. Amer. Math. Soc.,
73:293–312, 1952.

101
102 BIBLIOGRAPHY
Chapter 6

Brownian motion

6.1 Gaussian process


6.1.1 Gaussian vector
We recall that X is a Gaussian (or normal) random variable if it is a real-valued random
variable whose distribution has density fm,σ2 with respect to the Lebesgue measure on R
given by:
1 2 2
fm,σ2 (x) = √ e−(x−m) /(2σ ) for x ∈ R,
2πσ 2

with parameters m ∈ R and σ > 0. The random variable X is square integrable and the
parameter m is the mean of X and σ 2 its variance. The law of X is often denoted by
N (m, σ 2 ). By convention, the constant m ∈ R will also be considered as a (degenerate)
Gaussian random variable with σ 2 = 0 and we shall denote its distribution by N (m, 0). The
characteristic function ψm,σ2 of X with distribution N (m, σ 2 ) is given by:
 
 iuX  1 2 2
ψm,σ2 (u) = E e = exp ium − σ u for u ∈ R. (6.1)
2
In the next definition we recall the extension of the Gaussian distribution in higher di-
mension. We recall that a matrix Σ ∈ Rd×d , with d ≥ 1, is positive semi-definite if it is
symmetric and hu, Σui ≥ 0 for all u ∈ Rd , where h·, ·i is the Euclidean scalar product on Rd .
Definition 6.1. Let d ≥ 1. Let µ ∈ Rd and Σ ∈ Rd×d be a positive semi-definite matrix. A
Rd -valued random variable X has Gaussian distribution N (µ, Σ) if its characteristic function
ψµ,Σ is given by:
 
h i 1
ψµ,Σ (u) = E eihu,Xi = exp ihu, µi − hu, Σui for u ∈ Rd . (6.2)
2
If X is a Gaussian random variable with distribution N (µ, Σ), then X is square integrable
with mean E[X] = µ and covariance matrix (see Definition 1.61) Cov(X, X) = Σ. Further-
more using the development of the exponential function in series, we get that for all λ ∈ Cd ,
the random variable ehλ,Xi is integrable, and we have:
 
h
hλ,Xi
i 1
E e = exp hλ, µi + hλ, Σλi . (6.3)
2

103
104 CHAPTER 6. BROWNIAN MOTION

Using (6.2) with u replaced by M t u, we deduce the following lemma, which asserts that
every affine transformation of a Gaussian random variable is still a Gaussian random variable.

Lemma 6.2. Let p, d ∈ N∗ . Let X be a Rd -valued Gaussian random variable with distribution
N (µ, Σ). Let M ∈ Rp×d and c ∈ Rp . Then Y = c + M X is a Rp -valued Gaussian random
variable with parameter E[Y ] = c + M µ and Cov(Y, Y ) = M ΣM > .

The next remark ensures that for all µ ∈ Rd and Σ ∈ Rd×d a positive semi-definite matrix,
the distribution N (µ, Σ) is meaningful.
Remark 6.3. Let d ≥ 1. Let (G1 , . . . , Gd ) be independent real-valued Gaussian random
variables with the same distribution N (0, 1). Using (6.2), we get that the random vector
G = (G1 , . . . , Gd ) is Gaussian with distribution N (0, Id ) and Id ∈ Rd×d the identity matrix.
Let µ ∈ Rd and Σ ∈ Rd×d a positive semi-definite matrix. There exists an orthogonal
matrix O ∈ Rd×d (that is O> O = OO> = Id ) and a diagonal matrix ∆ ∈ Rd×d with non-
negative entries such that Σ = O∆2 O> . According to Lemma 6.2, we get that µ + O∆G has
distribution N (µ, Σ). ♦
We have the following result on the convergence in distribution of Gaussian vectors.

Lemma 6.4. Let d ≥ 1. The family of Gaussian probability distributions {N (µ, Σ); µ ∈
Rd , Σ ∈ Rd×d positive semi-definite} is closed for the convergence in distribution. Further-
more, if (Xn , n ∈ N) are Gaussian random variables on Rd , then the sequence (Xn , n ∈ N)
converges in distribution towards a limit, say X, if and only if X is a Gaussian random
variable, (E[Xn ], n ∈ N) and (Cov(Xn , Xn ), n ∈ N) converge respectively towards E[X] and
Cov(X, X).

Proof. We consider the one-dimensional case d = 1. (The general case d ≥ 1 which is


proved similarly is left to the reader.) Let (Xn , n ∈ N) be a sequence of real-valued Gaussian
random variables which converges in distribution towards a limit, say X. Let mn = E[Xn ]
and σn2 = Var(Xn ), and ψn be the characteristic function of Xn . As the sequence of functions
(ψn , n ∈ N) converges pointwise towards ψ, the characteristic function of X, we get that
limn→+∞ |ψn (u)| = |ψ(u)|. This and (6.1) readily implies that the sequence (σn , n ∈ N)
converges to a limit σ ∈ [0, +∞]. Use that ψ(0) = 1 to deduce that σ is finite.
We shall now prove that the sequence (mn , n ∈ N) converges. We deduce from the
2 2
first part of the proof that (eiumn = eu σn /2 ψn (u), n ∈ N) converges pointwise towards
2 2
ϕ(u) = eu σ /2 ψ(u). Notice that ϕ is continuous on R and that |ϕ(u)| = 1 for u ∈ R. Let G
be Gaussian random variable with distribution N (0, 1). By dominated convergence, we get
that for all x ∈ R and a > 0:
2 2
h i
e−(mn −x) a /2 = E ei(mn −x)aG −−−→ E e−ixaG ϕ(aG) .
 
(6.4)
n→∞

This implies that the sequence (mn , n ∈ N) either converges in R or limn→∞ |mn | = +∞.
 −ixaG
In the latter case, we deduce from (6.4) that E e ϕ(aG) = 0 for all a > 0. Letting a
goes to 0, we deduce by dominated convergence, as |ϕ| = 1, from the continuity of ϕ at 0
that ϕ(0) = 0 which is a contradiction. Thus the sequence (mn , n ∈ N) converges to a limit
2 2
m ∈ R. We deduce that, for all u ∈ R, ψn (u) converges towards eium−σ u /2 which is thus
equal to ψ(u). We deduce from (6.1) that X has distribution N (m, σ 2 ).
6.1. GAUSSIAN PROCESS 105

We have proved that if the sequence (Xn , n ∈ N) of real-valued Gaussian random variables
converges in distribution towards X, then X is a Gaussian random variable and (E[Xn ], n ∈
N) as well as (Cov(Xn , Xn ), n ∈ N) converge respectively towards E[X] and Cov(X, X). The
converse is a direct consequence of (6.1).

We give in the next remark an alternative characterization for Gaussian vectors.


Remark 6.5. Let d ≥ 1 and X a Rd -valued random variable. If hu, Xi is Gaussian for all
u ∈ Rd , then X has a Gaussian distribution.
Indeed, since hu, Xi is square integrable for all u ∈ Rd , we deduce that X is square
integrable. Let µ = E[X] and Σ = Cov(X, X). Notice that Σ is positive semi-definite as
hu, Σui = Var(hu, Xi) ≥ 0 for u ∈ Rd . Since hu, Xi is Gaussian with mean hµ, E[X]i and
variance Var(hu, Xi), its distribution is N (hu, µi, hu, Σui). We deduce from (6.1) (with u and
X replaced respectively by 1 and hu, Xi) that (6.2) holds. Thus, by Definition 6.1, X is a
Gaussian random vector with distribution N (µ, Σ). ♦
It is easy to characterize the independence for Gaussian vectors.

Lemma 6.6. Let d ≥ 1, p ≥ 1, X be a Rd -valued random variable and Y be a Rp -valued


random variable. Assume that (X, Y ) has a Gaussian distribution. Then X and Y are
independent if and only if Cov(X, Y ) = 0.

Proof. Since Cov(X, Y ) = 0, we get, with W = (X, Y ), that:


 
Cov(X, X) 0
Cov(W, W ) = ,
0 Cov(Y, Y )

and thus for all w = (u, v) ∈ Rd+p :

hw, Cov(W, W )wi = hu, Cov(X, X)ui + hv, Cov(Y, Y )vi.

Using (6.2) (three times), we get that = E[eihw,W i ] = E[eihu,Xi ]E[eihv,Y i ] for all w = (u, v) ∈
Rd+p . Since the characteristic function characterizes the distribution of Rq -valued random
variables for q ∈ N∗ , we deduce that (X, Y ) has the same distribution as (X 0 , Y 0 ) where X 0
and Y 0 are independent and respectively distributed as X and Y . This implies that X and
Y are independent.
The converse is immediate.

6.1.2 Gaussian process and Brownian motion


We refer to [6, 4, 5] for a general theory of Gaussian processes. The next definition gives an
extension of Gaussian vectors to processes.
Definition 6.7. Let T be a set. Consider the Borel σ-field B(R) on R, and the product
space E = RT with the corresponding product σ-field E = B(R)⊗T . We say a measurable
E-valued random variable X = (Xt , t ∈ T ) is a Gaussian process indexed by T if for all finite
set J ⊂ T , the vector (Xt , t ∈ J) is Gaussian. In this case the mean process m is given
by m = (m(t) = E[Xt ]; t ∈ T ) and the covariance kernel K is given by K = (K(s, t) =
Cov(Xs , Xt ); s, t ∈ T ).
106 CHAPTER 6. BROWNIAN MOTION

Lemma 6.8. The distribution of a Gaussian process is characterized by its mean process and
covariance kernel.

Proof. Let X = (Xt , t ∈ T ) be a Gaussian process indexed by T with mean process m


and covariance kernel K. For all J ⊂ T finite, the vector (Xt , t ∈ J) is Gaussian and its
distribution is characterized by its mean (m(t), t ∈ J) and its covariance (K(s, t); s, t ∈ J),
hence by m and K. Then use Lemma 1.29 to get that the distribution of X is characterized
by m and K.

Let K = (K(s, t); s, t ∈ T ) be the covariance kernel of a Gaussian process indexed by


T . From the proof of Lemma 6.8, we get that (K(s, t); s, t ∈ J) is a positive semi-definite
matrix for all finite J ⊂ T . We deduce that the covariance kernel K is a positive semi-definite
function, that is K(s, t) = K(t, s) for all s, t ∈ T and for all finite set J ⊂ T and all R-valued
vector (at , t ∈ J), we have:
X
as at K(s, t) ≥ 0.
s,t∈J

We admit the converse, see Corollary 3.5 in [6].

Theorem 6.9. Let T be a set, m a real-valued function defined on T and K a positive semi-
definite function defined on T . Then there exist a probability space and a Gaussian process
defined on this probability space with mean process m and covariance kernel K.

One very interesting Gaussian process is the so called Brownian motion. We first give its
covariance kernel.

Lemma 6.10. The function K = (K(s, t); s, t ∈ R+ ) defined by K(s, t) = s∧t is a covariance
kernel on R+ .
R
Proof. Let λ be the Lebesgue measure on R+ . We recall that hf, gi = f g dλ defines a scalar
product on L2 (R+ , B(R+ ), λ). Set ft = 1[0,t] for t ∈ R+ , and notice that K(s, t) = hfs , ft i
for all s, t ∈ R+ . The function K is clearly symmetric and for all n ∈ N∗ , t1 , . . . , tn ∈ R+ ,
a1 , . . . , an ∈ R, we have:
X Z  X 2
ai aj K(ti , tj ) = ai fti dλ ≥ 0.
1≤i,j≤n 1≤i≤n

Thus the function K is positive semi-definite.

The existence of the Brownian motion, see below, is justified by Theorem 6.9 and Lemma
6.10. We say a Gaussian process is centered if its mean function is constant equal to 0.

Definition 6.11. A standard Brownian motion B = (Bt , t ∈ R+ ) defined a probability


space (Ω, F, P) is a centered Gaussian process with covariance kernel K given in Lemma
6.10. A Brownian motion with drift b ∈ R and diffusion coefficient σ ∈ R∗+ is distributed as
(bt + σBt , t ∈ R+ ).
6.2. PROPERTIES OF BROWNIAN MOTION 107

We derive some elementary computations for a standard Brownian motion B = (Bt , t ∈


R+ ). For t ≥ s ≥ u ≥ 0, we have:

Var(Bt − Bs ) = Var(Bt ) + Var(Bs ) − 2 Cov(Bt , Bs ) = t − s = Var(Bt−s ), (6.5)


Cov(Bt − Bs , Bu ) = Cov(Bt , Bu ) − Cov(Bs , Bu ) = 0. (6.6)

6.2 Properties of Brownian motion


We refer to [3, 7] for a general presentation of Brownian motion.

6.2.1 Continuity
There is a technical difficulty when one says the Brownian motion is a.s. continuous, because
one sees the Brownian motion as a RR+ -valued random variable and one can prove that the
set of continuous functions is not measurable with respect to σ-field B(R)⊗R+ on RR+ . For
this reason, we shall consider directly the set of continuous functions.
For an interval I ⊂ R+ , let C 0 (I) = C 0 (I, R) be the set of R-valued continuous functions
defined on I. We define the uniform norm k·k∞ on C 0 (I) as kf k∞ = supx∈I |f (x)| for
f ∈ C 0 (I). It is easy to check that (C 0 (I), k·k∞ ) is a Banach space. And we denote by
B(C 0 (I)) the corresponding Borel σ-field. Notice that C 0 (I) is T a subset of RI (but it does
not belong to B(R) ). We consider C (I) ∩ B(R) = {C (I) A; A ∈ B(R)⊗I } which is
⊗I 0 ⊗I 0

a σ-field on C 0 (I); it is called the restriction of B(R)⊗I to C 0 (I). We admit the following
lemma which states that the Borel σ-field of the Banach space C 0 (I) is C 0 (I) ∩ B(R)⊗I , see
Example 1.3 in [1]. (The proof given in [1] has to be adapted when I is not compact).
Lemma 6.12. Let I be an interval of R+ . We have B(C 0 (I)) = C 0 (I) B(R)⊗I .
T

Since a probability measure on RI is characterized by the distribution of the corresponding


finite marginals, one can then prove that a probability measure on C 0 (I) is also character-
ized by the distribution of the corresponding finite marginals. We also admit the following
theorem, which says that the Brownian motion is a.s. continuous, see Theorem I.2.2 and
Corollary I.2.6 in [7].
Theorem 6.13. There exists a probability space (Ω, F, P) and a C 0 (R+ )-valued process B =
(Bt , t ∈ R+ ) defined on it which is a Brownian motion (when one sees B as a RR+ -valued
process). Furthermore, for any ε > 0, B is a.s. Hölder with index 1/2 − ε and a.e. not
Hölder with index 1/2 + ε.
In particular, the Brownian motion has no derivative.

6.2.2 Limit of simple random walks


Let Y be a R-valued square integrable random variable such that E[Y ] = 0 and Var(Y ) = 1.
Let (Yn , n ∈ N) be independent random variables distributed as Y . We consider the random
walk S = (Sn , n ∈ N) defined by:
n
X
S0 = 0 and Sn = Yk for n ∈ N∗ .
k=1
108 CHAPTER 6. BROWNIAN MOTION

(n)
We consider a time-space scaling X (n) = (Xt , t ∈ R+ ) of the process S given by, for n ∈ N∗ :

(n) 1
Xt = √ Sbntc .
n

We have the following important result.

Proposition 6.14. Let B = (Bt , t ∈ R+ ) be a standard Brownian motion. The sequence


of processes (X (n) , n ∈ N∗ ) converges in distribution for the finite dimensional marginals
(n) (n)
towards B: for all k ∈ N∗ , t1 , . . . , tk ∈ R+ , the sequence of vectors ((Xt1 , . . . , Xtk ), n ∈ N∗ )
converges in distribution towards (Bt1 , . . . , Btk ).

Proof. We deduce from the central limit theorem that (bntc−1/2 Sbntc , n ∈ N∗ ) converges in
distribution towards a Gaussian random variable with distribution N (0, 1). This implies
(n)
that (Xt , n ∈ N∗ ) converges in distribution towards Bt . This gives the convergence in
distribution of the 1-dimensional marginals of X (n) towards those of B.
(n) (n) (n)
Let t ≥ s ≥ 0. By construction, we have that Xt − Xs is independent of σ(Xu , u ∈
(n)
[0, s]) and distributed as an (t, s)Xt−s with an (t, s) = bn(t−s)c/(bntc−bnsc) if 
bntc−bnsc > 0
(n) (n)
and an (t, s) = 1 otherwise. Since limn→∞ an (t, s) = 1, we deduce that (Xs , Xt −

(n)
Xs ), n ∈ N∗ converges in distribution towards (G1 , G2 ), where G1 and G2 are independent
Gaussian random variable with G1 ∼ N (0, s) and G2 ∼ N (0, t − s). Notice that (G1 , G2 )
is distributed as (Bs , Bt − Bs ). Indeed (Bs , Bt − Bs ) is Gaussian vector as the linear trans-
formation of the Gaussian vector (Bs , Bt ); it has mean (0, 0) and we have Var(Bs ) = s,
Var(Bt − Bs ) = t − s, see (6.5), and Cov(Bs , Bt − Bs ) = 0, see (6.6), so the mean and
covariance matrix of (G1 , G2 ) and (Bs , Bt − Bs ) are the  same. This gives they have the
(n) (n) ∗
same distribution. We deduce that (Xs , Xt ), n ∈ N converges in distribution towards
(Bs , Bt ). This gives the convergence in distribution of the 2-dimensional marginals of X (n)
towards those of B.
The convergence in distribution of the k-dimensional marginals of X (n) towards those of
B is then an easy extension which is left to the reader.

In fact we can have a much stronger statement concerning this convergence by considering
a continuous linear interpolation of the processes X (n) . For n ∈ N∗ , we define the continuous
(n) (n) (n) (n) (n)
process X̃ (n) = (X̃t , t ∈ R+ ) by X̃t = Xt + Ct , where Ct = √1n (nt − bntc)Ybntc+1 .
(n) (n)
Notice that E[|Ct |2 ] ≤ n−1 so that (Ct , n ∈ N∗ ) converges in probability towards 0 for
all t ∈ R+ . We deduce that the sequence (X̃ (n) , n ∈ N∗ ) converges in distribution for the
finite dimensional marginals towards B. The Donsker’s theorem state this convergence in
distribution holds for the process seen as continuous functions. For a function f = (f (t), t ∈
R+ ) defined on R+ , we write f[0,1] = (f (t), t ∈ [0, 1]) for its restriction to [0, 1]. We admit
the following result, see Theorem 8.2 in [1].
 
(n)
Theorem 6.15 (Donsker (1951)). The sequence of processes X̃[0,1] , n ∈ N∗ converges in
distribution, on the space C 0 ([0, 1]), towards B[0,1] , where B is a standard Brownian motion.
6.2. PROPERTIES OF BROWNIAN MOTION 109

In particular, we get that for all continuous functional F defined on C 0 ([0, 1]), we have
(n)
that (F (X̃[0,1] ), n ∈ N∗ ) converges in distribution towards F (B[0,1] ). For example the following
real-valued functionals, say F , on C 0 ([0, 1]) are continuous, for f ∈ C 0 ([0, 1]):
Z
F (f ) = kf k∞ , F (f ) = sup(f ), F (f ) = f dλ, F (f ) = f (t0 ) for some t0 ∈ [0, 1].
[0,1] [0,1]

6.2.3 Markov property


Let F = (Ft , t ∈ R+ ) be a filtration on (Ω, F, P), that is a non-decreasing of family of σ-fields,
subsets of F. A process (Xt , t ∈ R+ ) defined on Ω is said F-adapted if Xt is Ft -measurable
for all t ∈ R+ .

Definition 6.16. Let X = (Xt , t ∈ R+ ) be a R-valued process adapted to the filtration


F = (Ft , t ∈ R+ ).

(i) We say that X is a Markov process with respect to the filtration F if for all t ∈ R+ ,
conditionally on the σ-field σ(Xt ) the σ-fields Ft and σ(Xu , u ≥ t) are independent.

(ii) We say that X has independent increments if for all t ≥ s ≥ 0, Xt − Xs is independent


of Fs .

(iii) We say that X has stationary increments if for all t ≥ s, Xt − Xs is distributed as


Xt−s − X0 .

In the previous definition, usually one takes F the natural filtration of X, that is Ft =
σ(Xu , u ∈ [0, t]). Clearly, if a process has independent increments, it has the Markov property
(with respect to its natural filtration).
Lemma 6.17. The Brownian motion is a Markov process (with respect to its natural filtra-
tion), with independent and stationary increments.

Proof. Let B = (Bt , t ∈ R+ ) be a standard Brownian motion and F = (Ft , t ∈ R+ ) its natural
filtration, that is Ft = σ(Bu , u ∈ [0, t]). It is enough to check that it has independent and
stationary increments. Let t ≥ s ≥ 0. Since B is a Brownian process, we deduce that Bt − Bs
is Gaussian, and we have Var(Bt − Bs ) = t − s = Var(Bt−s ), see (6.5). Since B is centered, we
deduce that Bt − Bs and Bt−s have the same distribution N (0, t − s). Thus B has stationary
increments. Since B is a Gaussian process and, according to (6.5), Cov(Bu , Bt − Bs ) = 0 for
all u ∈ [0, s], we deduce that Bt − Bs is independent of Fs = σ(Bu , u ∈ [0, s]). Thus, B has
independent increments. The extension to a general Brownian motion is immediate.

We mention that the Brownian motion is the only continuous random process with inde-
pendent and stationary increments (the proof of this fact is beyond those notes), and that
the study of general random process with independent and stationary increments is a very
active domain of the probabilities.
110 CHAPTER 6. BROWNIAN MOTION

6.2.4 Brownian bridge and simulation


Let B = (Bt , t ∈ R+ ) be a standard Brownian motion. Let T > 0 be given. The Brownian
bridge over [0, T ] is the distribution of the process W T = (WtT , t ∈ [0, T ]) defined by:
t
WtT = Bt − BT .
T
See Exercise 9.37 for the recursive simulation of the Brownian motion using Brownian
bridge approach.

6.2.5 Martingale and stopping times


We shall admit all the results presented in this section and refer to [2, 3, 7, 8, 9]. We consider
a standard Brownian motion B = (Bt , t ∈ R+ ) seen as a random variable on C 0 (R+ ) defined
on a probability space (Ω, F, P). The Brownian filtration F = (Ft , W t ∈ R+ ) of the Brownian
motion is given by Ft generated by σ(Bu , u ∈ [0, t]). We set F∞ = t∈R+ Ft .
A non-negative real-valued random variable τ is a stopping time with respect to the
filtration F if {τ ≤ t} ∈ Ft for all t ∈ R+ . The σ-field Fτ of the events which are prior to a
stopping time τ is defined by:

Fτ = {B ∈ F∞ ; B ∩ {τ ≤ t} ∈ Ft for all t ∈ R+ } .

Remark 6.18. We recall the convention that inf ∅ = +∞. Let A be an open set of R. The
entry time τA = inf{t ≥ 0; Bt ∈ A} is a stopping time 1 . Indeed, we have for t ≥ 0 that:
[
{τA ≤ t} = {Bs ∈ A} ∈ Ft .
s∈Q+ , s≤t


A real-valued process M = (Mt , t ∈ R+ ) is called a F-martingale if it is F-adapted (that
is Mt is Ft -measurable for all t ∈ R+ ) and for all t ≥ s ≥ 0, Mt is integrable and:

E[Mt | Fs ] = Ms a.s.. (6.7)

If (6.7) is replaced by E[Mt | Fs ] ≥ Ms a.s., then M is called an F-sub-martingale.


If (6.7) is replaced by E[Mt | Fs ] ≤ Ms a.s., then M is called an F-super-martingale.
In this setting, we admit the following optional stopping theorem, see [7].
Theorem 6.19. If M is a continuous F-martingale and T, S are two bounded stopping times
such that S ≤ T , then we have:

E[MT |FS ] = MS a.s.. (6.8)

In particular, we get that E[MT ] = E[M0 ].

1
One can prove that if X = (Xt , t ∈ R+ ) is an a.s. continuous process taking values in a metric space E
and A a Borel subset of E then: the entry time τA = inf{t ≥ 0; Bt ∈ A} is a stopping time with respect to the
natural filtration F = (Ft , t ∈ R+ ) where Ft = σ(Xu , u ∈ [0, t]); and the hitting
T time TA = inf{t > 0; Bt ∈ A}
is a stopping time with respect to the filtration (Ft+ , t ∈ R+ ) where Ft+ = s>t Fs .
6.3. WIENER INTEGRALS 111

We admit that the Brownian motion has the strong Markov property, see [7].

Theorem 6.20. Let T be a finite stopping time. Then B̃ = (B̃t = BT +t − BT , t ∈ R+ ) is a


standard Brownian motion independent of FT .

6.3 Wiener integrals


Let (Ω, F, P) be a probability space on which is defined a Brownian motion R t B = (Bt , t ∈ R+ ).
In Section 6.3.1, we shall give a precise meaning of the Wiener integral 0 f (s) dBs for some
general function f , whereas the Brownian motion is not differentiable. This integral (and
its generalization known as the Itô integral when the integrand f is also random) has been
intensively used in physics, biology, finance, applied mathematics, ... The Wiener integral
can also bee seen as an extension of the stochastic discrete integrals (see Lemma 4.14) to the
continuous case. This approach, which is not developed here, known as stochastic calculus
with respect to martingales is another very important generalization of the integrals with
respect to the Brownian motion. We refer to [7, 8, 9] for a complete exposition. We present
in Section 6.3.2 an application to the Langevin equation which describes the evolution of
the speed of a particle in a fluid. In Section 6.3.3, we use the Wiener integral to define the
Cameron-Martin change of probability measure and compute the Laplace transform of the
hitting time of a line by the Brownian motion.

6.3.1 Gaussian space


2 2
Let λ be the Lebesgue measure on R R. The vector space L (λ) = L (R+ , B(R+ ), λ) endowed
with the scalar product hf, giλ = R+ f g dλ is an Hilbert space. We consider the vector space
I = Vect(1[0,t) , t ∈ R+ ) ⊂ L2 of finite linear combinations of indicators of intervals [0, t) for
t ∈ R+ that is:
n
nX o
I= ak−1 1[tk−1 ,tk ) for some n ∈ N∗ , 0 = t0 < · · · < tn and a0 , . . . , an−1 ∈ R . (6.9)
k=1

We admit the following lemma.


Lemma 6.21. The vector space I is dense in the Hilbert space L2 .
We deduce from Proposition 1.50 that L2 (P) = L2 (Ω, F, P) endowed with the scalar
product hX, Y iP = E[XY ] for X, Y ∈ L2 (P) is an Hilbert space. Let IB = Vect(Bt , t ∈
R+ ) ∈ L2 (P) be the vector space of finite linear combinations of marginals of B, that is:
n
nX o
IB = ak−1 (Btk − Btk−1 ) for some n ∈ N∗ , 0 = t0 < · · · < tn and a0 , . . . , an−1 ∈ R .
k=1

Let HB be the closure in L2 (P) of IB . Notice that HB is also an Hilbert space. The space
HB is a Gaussian space in the following sense.
Lemma 6.22. Let d ∈ N∗ and X1 , . . . , Xd ∈ HB . Then the random vector (X1 , . . . , Xd ) is a
Gaussian centered vector.
112 CHAPTER 6. BROWNIAN MOTION

Proof. We first consider the case d = 1. Since X1 ∈ HB , there exists a sequence (Yk , k ∈ N)
of elements of Vect(Bt , t ∈ R+ ) which converges in L2 (P) towards X1 . Thanks to Lemma
6.2, Yk is a centered Gaussian random variable for all k ∈ N. Thanks to Lemma 6.4, we get
that X1 is also a centered Gaussian random variable. The general case d ∈ N∗ is proved using
similar arguments.

For f = nk=1 ak−1 1[tk−1 ,tP


P
k)
, element of I, we define the integral of f with respect to the
Brownian motion as I(f ) = nk=1 ak−1 (Btk − Btk−1 ). Notice that I is a linear map from I
to IB . We have I(1[0,t) ) = Bt for all t ∈ R+ and thus for t, s ∈ R+ :

hI(1[0,t) ), I(1[0,s) )iP = E[Bt , Bs ] = s ∧ t = h1[0,t) , 1[0,s) iλ .

By linearity, we deduce that for f, g ∈ I, hI(f ), I(g)iP = hf, giλ . Therefore, I is a linear
isometric map from I to IB . It admits a unique linear isometric extension from I = L2 (λ)
to IB = HB , which we still denote by I. The Wiener integral of a function f ∈ L2 (λ) with
respect to the Brownian motion R variable a.s. equal to I(f ) ∈ HB . We
R B is any random
shall use the notation I(f ) = R+ f (s) dBs = R+ f dB, even if the Brownian motion has
Rt Rt R
no derivative. We shall also use the notation 0 f dB = 0 f (s) dBs = R+ 1[0,t) (s)f (s) dBs .
Rt
With this convention, we have 0 dBs = Bt .
We have the following properties of the Wiener integral.
Proposition 6.23. Let f, g ∈ L2 (λ).
 R 
2 dλ .
R
(i) The random variable R+ f dB is Gaussian with distribution N 0, R+ f
R R
(ii) The Gaussian random variables R+ f dB and R+ g dB are independent if and only if
R
R+ f g dλ = 0.

(iii) LetR h be a measurable real-valued function defined on R+ locally square integrable (that
t Rt
is 0 h2 dλ < +∞ for all t ∈ R+ ). The process M = (Mt = 0 h dB, t ∈ R+ ) is a
martingale.

(iv) The process M given in (iii) is a centeredR Gaussian process with covariance kernel
s∧t
K = (K(s, t); s, t ∈ R+ ) given by K(s, t) = 0 h2 dλ.

Proof. Proof of property (i). Let f ∈ L2 (λ). Since I(f ) belongs to HB , we deduce from
Lemma 6.22, that I(f ) is a centered R Gaussian random variable. Its variance is given by
E[I(f )2 ] = hI(f ), I(f )iP = hf, f iλ = R+ f 2 dλ.
Proof of property (ii). Let f, g ∈ L2 (λ). Since I(f ) and I(g) belongs to the Gaussian
space HB and are centered, we deduce from Lemmas 6.6 and 6.22, that I(f ) and I(g) are
R if and only if E[I(f )I(g)] = 0. Then use that E[I(f )I(g)] = hI(f ), I(g)iP =
independent
hf, giλ = R+ f g dλ to conclude.
Proof of property (iii). Notice that Mt ∈ L2 (P) for all t ≥ 0. Let t ≥ s ≥ 0 be
fixed. Since h1[s,t) belongs to L2 , there exists a sequence (fn , n ∈ N) of elements of I which
converges to h1[s,t) in L2 . Clearly the sequence (fn 1[s,t) , n ∈ N) converges also to h1[s,t) in
L2 . Since fn 1[s,t) belongs to I, we get that I(fn 1[s,t) ) is σ(Bu − Bs , u ∈ [s, t])-measurable by
6.3. WIENER INTEGRALS 113

construction. This implies that I(h1[s,t) ) is also σ(Bu −Bs , u ∈ [s, t])-measurable. We deduce
that Mt − Ms is σ(Bu − Bs , u ∈ [s, t])-measurable and (taking s = 0 and t = s) that Ms is
Fs -measurable. In particular the process M is adapted to the filtration F. Using that the
Brownian motion has independent increments, we get that the σ-fields σ(Bu − Bs , u ∈ [s, t])
and Fs are independent. We deduce that E[Mt − Ms | Fs ] = E[Mt − Ms ] = E[Mt ] − E[Ms ] = 0.
This gives that M is a martingale.
Proof of (iv). Since Mt ∈ HB for all t ≥ 0, we deduce from Lemma 6.22 that M is a
centered Gaussian process. Its covariance kernel R s∧tis given for s, t ∈ R+ by K(s, t) = E[Ms Mt ] =
hI(h1[0,s) ), I(h1[0,t) )iP = hh1[0,s) , h1[0,t) iλ = 0 h2 dλ.
Rt
We give a natural representation of 0 f dB when f is of class C 1 .
Proposition 6.24. Assume that f ∈ C 1 (R+ ). We have the following integration by part
formula, for all t ∈ R+ , a.s.:
Z t Z t
f (s) dBs = f (t)Bt − Bs f 0 (s) ds.
0 0
Rt
Remark 6.25. If f ∈ C 1 (R+ ), then the process M̃ = (M̃t = f (t)Bt − 0 Bs f 0 (s) d, t ∈ R+ )
Rt
is a.s. continuous. Consider the martingale M = (Mt = 0 f (s) dBs , t ∈ R+ ). From
Proposition 6.24, we get that for all t ∈ R+ , a.s. M̃t = Mt . We say that M̃ is a continuous
version2 of M . ♦
Rt t
Proof of Proposition 6.24. For t ≥ 0, we set Zt = f (t)Bt − 0 Bs f 0 (s) ds − 0 f (s) dBs . We
R

have Zt ∈ HB . We compute for u ≥ 0:


Z t
hZt , Bu iP = E[Zt Bu ] = f (t)E[Bt Bu ] − E[Bs Bu ]f 0 (s) ds − E[I(f 1[0,t) )I(1[0,u) )]
0
Z t Z t∧u
0
= f (t)(t ∧ u) − (s ∧ u)f (s) ds − f dλ
0 0
= 0.

Since Vect(Bt , t ∈ R+ ) is dense in HB , we deduce that hZt , XiP = 0 for all X ∈ HB . As Zt


belongs to HB , we can take X = Zt and deduce that a.s. Zt = 0. This ends the proof.

6.3.2 An application: the Langevin equation

We consider the Langevin equation in dimension 1 which describes the evolution of the
speed V of a particle with mass m in a fluid with friction and multiple homogeneous random
collisions from molecules of the fluid:

mdVt = −λVt dt + F (t) dt for t ∈ R+ ,

where λ > 0 is a damping coefficient, which can be seen as a frictional or drag force, and F (t)
is a random force with Gaussian distribution. This force F (t) dt is modeled by a Brownian
2
It can be proven that if f is a measurable real-valued
R tlocally square integrable function defined on R+ ,
then there exists a continuous version of the martingale ( 0 f (s) dBs , t ∈ R+ ).
114 CHAPTER 6. BROWNIAN MOTION

motion ρ dBt , with ρ > 0. Taking a = λ/m > 0 and σ = ρ/m > 0, we get the stochastic
differential equation:
dVt = −aVt dt + σ dBt for t ∈ R+ . (6.10)
We say that a random locally integrable process V = (Vt , t ∈ R+ ) is solution to the Langevin
equation (6.10) with initial condition V0 if a.s.:
Z t
Vt = V0 − a Vs ds + σBt .
0

We have the following solution to the Langevin equation.


Proposition 6.26. Let a > 0 and σ > 0. The equation (6.10) with initial condition V0 has
a unique locally integrable solution V = (Vt , t ∈ R+ ) given a.s. by:
Z t
Vt = V0 e−at +σ e−a(t−s) dBs a.s. for t ∈ R+ .
0
Rt
Proof. We define Y = (Yt , t ∈ R+ ) by Yt = V0 e−at +σ 0 e−a(t−s) dBs for all t ∈ R+ . Using
Proposition 6.24, we have that a.s. for t ≥ 0:
 Z t 
−at −at at as
Yt = V0 e +σ e e Bt − a Bs e ds
0
Z t
−at
= V0 e +σBt − aσ Bs e−a(t−s) ds.
0
Rt Rt
We deduce that a 0 Ys ds = V0 (1 − e−at ) + aσ 0 Bs ds − aσXt , where, using Fubini theorem:
Z t Z s Z t Z t Z t  
Xt = a ds Bu e−a(s−u) du = du Bu eau a e−as ds = Bu 1 − e−a(t−u) du.
0 0 0 u 0
Rt
We get that a.s. for all t ≥ 0: a 0 Ys ds = V0 − Yt + σBt . This gives that Y is a solution to
(6.10) with initial condition V0 . Let Y 0 = (Yt0 , t ∈ R+ ) be another locally integrable solution.
Taking Zt = Yt − Yt0 , we getR that Z = (Zt , t ∈ R+ ) is locally integrable and that a.s. Z0 = 0
t
and for all t ≥ 0: Zt = −a 0 Zs ds. This gives that a.s. Zt = 0 for all t ≥ 0. Hence there
exists at most one locally integrable solution to (6.10) with initial condition V0 .

The process V given in Proposition 6.26 is called the Ornstein-Uhlenbeck process. See
Exercise 9.38 for results on this process.
The Ornstein-Uhlenbeck process can be defined for all times in R by ( √σ2a e−at Be2at , t ∈
R). This definition is coherent thanks to (9.2).
If Rwe consider the position of the particle at time t ≥ 0, say Xt , we get that Xt =
t
X0 + 0 Vs ds, which gives the following result whose proof is immediate.
Lemma 6.27. Let a > 0 and σ > 0. The path of the particle X = (Xt , t ∈ R+ ) governed by
the Langevin equation (6.10) is given by a.s.:
Z t
Xt = X0 + Vs ds
0
σ t −a(t−r)
Z
V0  σ
= X0 + 1 − e−at + Bt − e dBr for t ∈ R+ .
a a a 0
6.3. WIENER INTEGRALS 115

Remark 6.28. Recall that a = λ/m and σ = ρ/m, with m the mass of the particle. Denote
(m)
by X (m) = (Xt , t ∈ R+ ) the path of the particle with mass m. We have:
Z t
(m) V0
1 − e−at + ρBt − ρ e−a(t−r) dBr .

Xt = X0 +
a 0

Letting m goes to 0 (that is considering an infinitesimal particle), or equivalently a goes to


(m)
infinity, with X0 and ρ fixed, we get that Xt converges in L2 (P) towards the Einstein model
(0)
for the motion of a infinitesimal particle in a fluid X (0) = (Xt , t ∈ R+ ) given by:
(0)
Xt = X0 + ρBt .
V0
1 − e−at = 0 and

Indeed, we have, as a goes to infinity, that lima→+∞ a
"Z 2 #
t Z t
−a(t−r)
lim E e dBr = lim e−2a(t−r) dr = 0,
a→+∞ 0 a→+∞ 0

(m)
which implies that Xt converges to X0 + ρBt in L2 (P) as m goes to zero. ♦

6.3.3 Cameron-Martin Theorem


The Cameron-Martin theorem is a particular case of the family of Girsanov theorem whose
spirit is a change of probability measure using exponential martingales.

R Let
t 2
h be a measurable real-valued function defined on R+ locally square integrable (that
is 0 h dλ < +∞ for all t ∈ R+ ). We consider the non-negative process M h = (Mth , t ∈ R+ )
defined by: Z t
1 t
Z 
h 2
Mt = exp h(s) dBs − h(s) ds . (6.11)
0 2 0

Lemma 6.29. The process M h defined in (6.11) is a non-negative martingale.

Proof. According to property (iii) of Proposition 6.23, we get that M is adapted to the
Brownian filtration. Notice that for t ≥ s ≥ 0, we have:
2 /2)
Mth = Msh eG−(σ
Rt Rt
with G = s h(s) dBs and σ 2 = s h(s)2 ds. Arguing as in the proof of property (iii) of
Proposition 6.23, we get that G is independent of Fs and has distribution N (0, σ 2 ). We
deduce that a.s.: h i h i
2
E Mth | Fs = Msh E eG−(σ /2) = Msh ,

where the last equality is a consequence of (6.3) with λ = 1 and X = G. This implies that
M is a martingale. By construction, it is non-negative.

The next theorem assert that a Brownian motion with a drift can be seen as a Brownian
motion under a different probability measure.
116 CHAPTER 6. BROWNIAN MOTION

Theorem 6.30 (Cameron-Martin).


 Let t ≥ 0 and F be a real-valued non-negative measurable
0 0
function defined on C ([0, t]), B(C ([0, t])) . We have:
  Z u  h i
E F Bu + h(s) ds, u ∈ [0, t] = E Mth F ((Bu , u ∈ [0, t])) . (6.12)
0

Remark 6.31. Let P̃ be a probability measure defined on (Ω, Ft ) by P̃(A) = E[Mth 1A ] for
A ∈ Ft . (Check this define indeed a probability measure.) Let Ẽ denote the corresponding
expectation. Theorem 6.30 gives that:
  Z u 
E F Bu + h(s) ds, u ∈ [0, t] = Ẽ [F ((Bu , u ∈ [0, t]))] .
0
Rt
In particular, the process t 7→ Bt − 0 h(s) ds is a Brownian motion under P̃. ♦

Partial proof of Theorem 6.30. We assumeR3 that it is enough to check (6.12) for functions
t
R ( 0 f (u) dYu ) with f ∈ I Rand I defined by (6.9).
F of the form F ((Yu , u ∈ [0, t])) = exp
t u 
So, we have F ((Bu , u ∈ [0, t])) = exp 0 f (u) dBu and F Bu + 0 h(s) ds, u ∈ [0, t] =
R 
t Rt
exp 0 f (u) dBu + 0 f (u)h(u) λ(du) . We get:
  Z u  h Rt Rt i
E F Bu + h(s) ds, u ∈ [0, t] = E e 0 f (u) dBu + 0 f h dλ
0
 Z t Z t 
1 2
= exp f dλ + f h dλ
2 0 0
 Z t
1 t 2
Z 
1
= exp (f + h)2 dλ − h dλ
2 0 2 0
h Rt i
= E Mth e 0 f (u) dBu
h i
= E Mth F ((Bu , u ∈ [0, t])) ,

where we used that M f (resp. M f +h ) is a martingale for the second (resp. fourth) equality.

As an application, we can compute the Laplace transform (and hence the distribution) of
the hitting time of a line for the Brownian motion. Let a > 0 and δ ∈ R. We consider:

τaδ = inf{t ∈ R+ ; Bt = a + δt}.

Notice that τaδ is a stopping times as {τaδ ≤ t} = n∈N∗ s∈Q+ , s≤t {Bs − a − δs ≥ −1/n}
T S

which belongs to Ft for all t ≥ 0. When δ = 0, we write τa for τaδ , and using the continuity
of B, we get also that τa = inf{t ∈ R+ ; Bt ≥ a}.

Proposition 6.32. Let a > 0.


3
This is the functional version of the monotone class theorem.
6.3. WIENER INTEGRALS 117

(i) We have that for all λ ≥ 0: h i √


E e−λτa = e−a 2λ . (6.13)

(ii) Let δ ∈ R. We have P(τaδ < +∞) = exp (−2aδ + ) and for λ ≥ 0:
h δ
i   p 
E e−λτa = exp −a δ + 2λ + δ 2 .

Proof.
√We first prove (i). Let λ ≥ 0. Consider the process M = (Mt , t ∈ R+ ) with Mt =
 √
exp 2λBt − λt . Using (6.11), we have M = M h with h constant equal to 2λ. Thus
M is a continuous martingale. This implies that the process N = (Nt = Mτa ∧t , t ∈ R+ )
is a continuous martingale
√ thanks to the optional stopping Theorem 6.19. It converges a.s.
towards N∞ = e a 2λ −λτ a 1{τa <+∞} as Bτa = a on {τa < +∞}. Since the process N takes

values in [0, ea 2λ ], we deduce it converges also in L1 towards N∞ . By dominated convergence,
we get that E[N∞ ] = E[N0 ] = 1 and thus:
h √ i
E ea 2λ −λτa 1{τa <+∞} = 1.

Taking λ = 0 in the previous equality implies that τa is a.s. finite. This gives (i).
We now prove (ii). Let f be a non-negative measurable function defined on R. We have:

E[f (τaδ ∧ t)] = E[f (inf{u ∈ R+ ; Bu − δu = a} ∧ t)]


= E[Mt−δ f (τa ∧ t)]
= E[Mτ−δ
a ∧t
f (τa ∧ t)],

where we used the Cameron-Martin theorem with h = −δ for the second equality, the optional
stopping Theorem 6.19 (with T = t and S = τa ∧ t and the martingale M −δ which has a
continuous version thanks to Remark 6.25) for the third. Taking f (x) = e−λx with λ ≥ 0, we
get:  
h i 2
−λ(τaδ ∧t) −λ(τa ∧t)−δBτa ∧t − δ2 (τa ∧t)
E e =E e .

Assume δ ≤ 0. Letting t goes to infinity in the previous equality and using that τa is a.s.
finite and Bτa = a, we get by dominated convergence (for the left hand-side and the right
hand-side) that:  
h i 2
−λτaδ −(λ+ δ2 )τa −δa
E e =E e .
h δ
i √
Then use (6.13) to get that E e−λτa = exp (−δa − a 2λ + δ 2 ). Letting λ goes down to 0,
we deduce that P(τaδ < ∞) = 1.
δ = inf{t ∈
The case δ > 0 is more technical. The idea is to consider the stopping time τa,b
δ and then use that
R+ ; Bt 6∈ (b + δt, a + δt)} for b < a; compute the Laplace transform of τa,b
the non-decreasing sequence (τa,b δ , b ∈ (−∞, a)) converges to τ δ when b goes to −∞. The
a
details are left to the reader.
118 CHAPTER 6. BROWNIAN MOTION
Bibliography

[1] P. Billingsley. Convergence of probability measures. Wiley Series in Probability and


Statistics: Probability and Statistics. John Wiley & Sons, Inc., New York, second edition,
1999. A Wiley-Interscience Publication.

[2] A. Klenke. Probability theory. Universitext. Springer, London, second edition, 2014. A
comprehensive course.

[3] J.-F. Le Gall. Brownian motion, martingales, and stochastic calculus. Springer, 2016.

[4] M. A. Lifshits. Gaussian random functions, volume 322 of Mathematics and its Applica-
tions. Kluwer Academic Publishers, 1995.

[5] M. A. Lifshits. Lecture on Gaussian processes. Springer Briefs in Mathematics. Springer,


2012.

[6] J. Neveu. Processus aléatoires gaussiens. Séminaire de Mathématiques Supérieures, No.


34 (Été, 1968). Les Presses de l’Université de Montréal, Montreal, Que., 1968.

[7] D. Revuz and M. Yor. Continuous martingales and Brownian motion, volume 293 of
Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathemat-
ical Sciences]. Springer-Verlag, Berlin, third edition, 1999.

[8] L. C. G. Rogers and D. Williams. Diffusions, Markov processes, and martingales. Vol. 1.
Cambridge Mathematical Library. Cambridge University Press, Cambridge, 2000.

[9] L. C. G. Rogers and D. Williams. Diffusions, Markov processes, and martingales. Vol. 2.
Cambridge Mathematical Library. Cambridge University Press, Cambridge, 2000.

119
120 BIBLIOGRAPHY
Chapter 7

Stochastic approximation
algorithms

The aim of this chapter is to present some results on the convergence of stochastic algo-
rithms approximations, the so called Robbins-Monro algorithm, by using a comparison of the
stochastic algorithm with its companion ODE. This is the so-called ODE method which is
presented in the monographs Kushner and Clark [7], Kushner and Yin [8] and Duflo [4, 5].
Our presentation will follow closely Benaı̈m [1] based first on analytic properties of pseudo-
trajectories associated to the ODE in Section 7.2, and second on the control of the stochastic
algorithms approximations in Sections 7.3.
As applications and motivations, we shall consider in more details the so-called two-armed
bandit see Sections 7.1 and 7.4.3, as well as the estimation of the quantile in linear times in
Section 7.4.2.

7.1 The two-armed bandit


The one-armed bandit describes a slot machine, and the two-armed bandit corresponds to
having two slot machines with possibly different unknown probability distributions of gain.
The aim is to maximize number of winnings, when playing n times, which is a competition
between exploration (in order to determine which slot machine is the most advantageous)
and exploitation (use what you think is the best slot machine to achieve a maximal gain).
This problem goes back to Robbins [15], where it is seen as a problem of sequential design.
This problem can be solved using Markov decision processes leading to the use of the Gittins
policy, see Gittins [6] and Weber [17] for a short proof. See the survey from Burtini, Loeppky
and Lawrence [3] for other recent variants on the subject.
We shall consider a recursive random procedure developed independently by Norman [14]
and Shapiro and Narendra [16], see the survey and the book from Narendra and Thathachar
[12, 13] on learning automate theory. This random procedure, also called the Linear Reward-
Inaction scheme, has been fully studied by Lamberton, Pagès and Tarrès [11] as well as
Lamberton and Pagès [9] for the two armed bandit; see also Lamberton and Pagès [10] for
the Linear Reward-Penalty scheme.
We consider two slot machines A and B. At time n, one plays only with one slot machine,

121
122 CHAPTER 7. STOCHASTIC APPROXIMATION ALGORITHMS

the slot machine A (resp. B) being chosen with probability Xn (resp. 1 − Xn ). At time n + 1
the probability Xn+1 is up-dated by a reward γn Xn if there has been a gain as follows: if
A has been chosen and there has been a gain then Xn+1 is equal to Xn + γn (1 − Xn ); if
B has been chosen and there has been a gain then Xn+1 is equal to Xn − γn Xn . Notice
that if there is noPgain there is no penalty. The fraction γn ∈ (0, 1) is deterministic. We
shall assume that n∈N γn = +∞, which is a necessary condition to forget the starting value
X0 . More precisely, we consider the following linear scheme. Let pi , i ∈ {A, B}, be the
probability to have a gain with the slot machine i. Let (γn , n ∈ N) be a sequence taking
values in (0, 1). Let X0 ∈ (0, 1) be a random variable and let (Un , n ∈ N∗ ) and (Vn , n ∈ N∗ )
be independent random variables uniformly distributed on [0, 1] and independent of X0 . We
define the sequence (Xn , n ∈ N) recursively as follows:

Xn+1 = Xn + γn Yn+1 , (7.1)

with
Yn+1 = (1 − Xn )1{Un+1 ≤Xn , Vn+1 ≤pA } − Xn 1{Un+1 >Xn , Vn+1 ≤pB } .

The event {Vn+1 ≤ pi } corresponds to the winning event with the machine i ∈ {A, B}.
We are interested in the behavior of the sequence (Xn , n ∈ N) as n goes to infinity. Let
F = (Fn , n ∈ N) be the natural filtration of the process X. We rewrite Yn+1 as:

Yn+1 = F (Xn ) + εn+1 with F (Xn ) = E[Yn+1 | Fn ],

so that εn+1 = Yn+1 − F (Xn ) is a martingale increment with respect to F. We have F (x) =
πx(1 − x) with π = pA − pB and thus:

Xn+1 − Xn
= F (Xn ) + εn+1 . (7.2)
γn

The Ordinary Differential Equation (ODE) method intuitively says that the stochastic
approximation algorithm (7.2) (also called perturbed recursive equation) behaves as the de-
terministic ODE:
dx(t)
= F (x(t)). (7.3)
dt
Notice that 0 and 1 are roots of F (y) = 0, and thus the constant functions equal to 0 or to
1 are solutions of (7.3). For x0 ∈ (0, 1), the solution x = (x(t), t ∈ R+ ) of (7.3) with initial
condition x0 is given by:
x0
x(t) = for t ≥ 0.
x0 + (1 − x0 ) e−πt

In particular we have that limt→+∞ x(t) = 1, meaning that 1 is an attractor of (7.3) and
thus a stable fixed point of the ODE, whereas 0 is an unstable fixed point. One expects that
the stochastic approximation algorithms is close to the solutions of the ODE and thus might
converge to the stable fixed point of the ODE. We shall give in Section 7.4.3 some condition on
the step sequence (γn , n ∈ N) which implies the a.s. convergence of the stochastic algorithm
to the stable fixed point.
7.2. ASYMPTOTIC PSEUDO-TRAJECTORIES 123

7.2 Asymptotic Pseudo-Trajectories


We follow the approach developed in details in [1] and presented also in Benaı̈m and El Karoui
[2] on pseudo-trajectories. In this section we give properties of paths which are close to the
solutions of an autonomous ODE.
Let F be a continuous map from Rd to Rd which is either Lipschitz or bounded and locally
Lipschitz. Let |y| denote the Euclidian norm of y ∈ Rd .

7.2.1 Definition
The hypothesis on F imply that the vector field F is globally integrable, see ????, and there
exists a flow Φ = (Φt (y), (t, y) ∈ R × Rd ) which is a continuous function (of (t, y)) such that
for all (t, y) ∈ R × R:
dΦt (y)
= F (Φt (y)) and Φ0 (y) = y.
dt
The map t 7→ Φt (y) defined on R is the global solution of the Cauchy problem associated to
F with initial condition y. Notice that Φ0 is the identity function. The function Φ has the
so called flow property as for all s, t ∈ R:
Φs+t = Φs ◦ Φt .
We define the set of equilibria of Φ as Λ0 = {y ∈ Rd , F (y) = 0}.
An asymptotic pseudo-trajectory is a path which is asymptotically close to a solution of
the Cauchy problem.
Definition 7.1. A function x = (x(t), t ∈ R+ ) is an asymptotic pseudo-trajectory for Φ if
for all T > 0:
lim sup |x(t + s) − Φs (x(t))| = 0.
t→+∞ s∈[0,T ]

7.2.2 The limit set for asymptotic pseudo-trajectory


The limit set of an asymptotic pseudo-trajectory x is defined as:
\
L(x) = x([t, +∞)).
t≥0

The aim of this section is to give description of L(x). The next lemma corresponds to Theorem
5.7 in [1], see also the references therein.
Lemma 7.2. Let x be an asymptotic pseudo-trajectory for the flow Φ which is bounded, that
is supt≥0 |x(t)| < +∞. The set L = L(x) is non-empty, connected, compact and Φ-invariant,
that is Φt (L) = L for all t ∈ R.
When there is a Lyapounov function, see definition below, for the flow Φ, it is possible to
have a more precise description of L(x).
Let Λ be an invariant set of Φ. A continuous function V defined on Rd taking values
in R is called a Lyapounov function for Λ if the function t 7→ V (Φt (y)) defined for t ≥ 0 is
constant for y ∈ Λ and strictly decreasing for y 6∈ Λ. If Λ is equal to the set Λ0 of equilibrium
of F , then V is called a strict Lyapounov function.
124 CHAPTER 7. STOCHASTIC APPROXIMATION ALGORITHMS

Remark 7.3. When F is the gradient of a function V , that is F = −∇V , then V itself is
a strict Lyapounov
Ry function. Thus, in dimension d = 1, then the function V defined by
V (y) = − 0 F (s) ds for y ∈ R, is a strict Lyapounov function. ♦
We get from Corollary 6.6 in [1] the following result on the convergence of asymptotic
pseudo-trajectories.

Lemma 7.4. Let x be a bounded asymptotic pseudo-trajectory of Φ such that L(x) ∩ Λ0 is at


most countably. Assume there exists a strict Lyapounov function for Φ. Then x converges to
an equilibrium: there exists y ∈ Λ0 such that limt→+∞ x(t) = y.

7.3 Stochastic approximation algorithms


7.3.1 Main results
Let F = (Fn , n ∈ N) be a filtration and a step sequence (γn , n ∈ N) of deterministic positive
real numbers. A Robbins-Monro sequence (Xn , n ∈ N) associated to F is defined by X0 a
Rd -valued random variable F0 -measurable and for n ∈ N:

Xn+1 = Xn + γn F (Xn ) + γn εn+1 , (7.4)

where (εn , n ∈ N∗ ) is a sequence of Rd -valued F-adapted integrable random variables such


that E[εn+1 | Fn ] = 0 for all n ∈ N∗ . The sequence (εn , n ∈ N∗ ) is said to be sub-Gaussian if
there exists a finite constant Γ > 0 such that for all θ ∈ Rd and n ∈ N∗ , we have:
h i 2
E ehθ,εn i | Fn ≤ eΓ|θ| /2 .

We shall consider the following hypothesis on the control of the martingale increments and
on the step sequence:
X
sup E[|εn |2 ] ≤ +∞ and γn2 < +∞ (7.5)
n∈N∗ n∈N

as well as
X
(εn , n ∈ N∗ ) is sub-Gaussian, and for all c > 0, e−c/γn < +∞. (7.6)
n∈N

Exercise 7.1. LetU be a bounded Rd -valued random variable. By considering the function
 hθ,U
θ 7→ log E e i , prove that U is sub-Gaussian, that is there exists a finite constant Γ > 0
2
such that for all θ ∈ Rd , we have E ehθ,U i ≤ eΓ|θ| /2 .
 
4
We consider the linear interpolation X = (X(t), t ≥ 0) of the sequence (Xn , n ∈ N) with
time step (γn , n ∈ N) as follows. We set τ0 = 0 and τn+1 = τn + γn for n ∈ N. For t ≥ 0, let
n ∈ N be such that t ∈ [τn , τn+1 ), we set:
Xn+1 − Xn
X(t) = Xn + (t − τn ) ·
γn
We have the following main result whose proof is postponed to Section 7.3.2.
7.3. STOCHASTIC APPROXIMATION ALGORITHMS 125

Theorem 7.5. Let F be a continuous map from Rd to Rd which is bounded and locally
Lipschitz. Let (Xn , n ∈ N) be a Robbins-Monro sequence associated to F , see (7.4), and X
its linear interpolation. Assume that
P
(i) n∈N∗ γn = +∞ and either (7.5) or (7.6) holds (so that limn→∞ γn = 0).

(ii) The sequence (Xn , n ∈ N) is a.s. bounded.


Then X is an asymptotic pseudo-trajectory for the flow Φ associated to F .
Notice that to use Theorem 7.5, one has to check that the sequence (Xn , n ∈ N) is a.s.
bounded. There are several stability conditions based on Lyapounov functions which imply
that the sequence (Xn , n ∈ N) is a.s. bounded, see Section 7.3 in [1] for references. The
following condition is from [8].
Proposition 7.6. Let F be a continuous map from Rd to Rd which is bounded and locally
Lipschitz. Let (Xn , n ∈ N) be a Robbins-Monro sequence associated to F , see (7.4). Assume
that limn→∞ γn = 0. Let V and k be two measurable functions defined on Rd and taking
values in R+ . Assume that V is of class C 2 with bounded second order derivatives and that:
(i) h∇V (y), F (y)i ≤ −k(y) for y ∈ Rd .

(ii) lim|y|→+∞ V (y) = +∞.

(iii) There exists non-negative constants K and R, with K < +∞, such that:

E |εn+1 |2 | Fn + |F (Xn )|2 ≤ Kk(Xn ) on {|Xn | ≥ R},


 
X
γn2 E |εn+1 |2 + |F (Xn )|2 1{|Xn |<R} < +∞.
  

n∈N

(iv) E[V (X0 )] < +∞ and E[k(Xn )] ≤ +∞ if E[V (Xn )] ≤ +∞.


Then a.s. the sequence (Xn , n ∈ N) is bounded.
Proof. Since V is of class C 2 with bounded second derivatives, we get:

V (Xn+1 ) = V (Xn ) + hXn+1 − Xn , ∇V (Xn )i + g(Xn+1 − Xn ),

with |g(y)| ≤ K1 |y|2 /2 for some non-negative finite constant K1 . Using (7.4) as well as (i),
we get:

E [V (Xn+1 )| Fn ] ≤ V (Xn ) − γn k(Xn ) + K1 γn2 E[|εn+1 |2 | Fn ] + |F (Xn )|2 .



(7.7)

Then use (iii) and (iv) to deduce by induction that E[V (Xn )] is finite for all n ∈ N. We define
Vn = V (Xn ) + Wn with
X h i
γj2 E 1{|Xj |<R} |εj+1 |2 + |F (Xj )|2 | Fn .

Wn = K1
j≥n

Notice that Wn is well defined thanks to (iii) and that Vn is integrable. We deduce from (7.7)
and (iii) that:
E [Vn+1 − Vn | Fn ] ≤ −γn k(Xn ) + K1 γn2 Kk(Xn ).
126 CHAPTER 7. STOCHASTIC APPROXIMATION ALGORITHMS

Since limn→∞ γn = 0, we deduce there exists n0 ∈ N such that E [Vn+1 − Vn | Fn ] ≤ 0 for all
n ≥ n0 . Thus (Vn , n ≥ n0 ) is a non-negative super-martingale. Thanks to Corollary 4.22, it
converges a.s. to an integrable limit, say V∞ . Since (iii) implies that a.s. limn→∞ Wn = 0, we
deduce that (V (Xn ), n ∈ N) converges a.s. to V∞ . Use (ii) to get that a.s. lim supn→∞ |Xn | <
+∞. This ends the proof.

7.3.2 Proof of Theorem 7.5


Pn
We consider the martingale M = (Mn = k=1 γk εk , n ∈ N∗ ). For t ≥ 0, let n ∈ N be such
that t ∈ [τn , τn+1 ), we define:

X̄(t) = Xn ,
ε̄(t) = εn+1 and γ̄(t) = γn .
R
t+s
For t ≥ 0 and T ≥ 0, we set ∆(t, T ) = sups∈[0,T ] t ε̄(u) du .

The next lemma is the first step of the proof.
Lemma 7.7. Under either (7.5) or (7.6), we have that for all T ≥ 0:

a.s. lim ∆(t, T ) = 0. (7.8)


t→+∞

Proof. Since limn→∞ γn = 0, we get that (7.8) holds for all T ≥ 0 if and only if for all T ≥ 0:

a.s. lim sup {|Mk − Mn |, for k ≥ n such that τk ≤ τn + T } . (7.9)


n→∞

We first assume that (7.5) holds. We get that supn∈N∗ E[Mn2 ] is finite and thus the
martingale M a.s. converges. This directly implies (7.9).
The case (7.6) corresponds to Proposition 4.4 in [1].

The second step of the proof is deterministic and corresponds to Proposition 4.1 in [1].

7.4 Applications
7.4.1 Dosage
This application is taken from [5]. A dose y of a chemical products creates a random effect say
g(y, U ), where U is a random variable and g is an unknown real-valued bounded measurable
function defined on R2 . We assume the mean effect G(y) = E[g(y, U )], which is unknown, is
however non-decreasing as a function of y. We want to determine the dose y ∗ which creates
a mean effect of a given level a: that is G(y ∗ ) = a.
We consider the following Robbins-Monro stochastic algorithms, for n ∈ N:

Xn+1 = Xn − γn (g(Xn , Un+1 ) − a),

where we assume that (Un , n ∈ N∗ ) are independent random variables distributed as U and
independent of X0 . Notice g(Xn , Un+1 ) corresponds to an effect of the dose Xn produced by
the n-th experiment.
To stick to the formalism (7.4), we set F (y) = G(y) − a and εn+1 = g(Xn , Un+1 ) − G(Xn ).
Since g is bounded, we get that the sequence (εn , n ∈ N∗ ) is bounded. We assume that G is
7.4. APPLICATIONS 127

Ry
Lipschitz (and bounded as g is bounded). Set V (y) = − 0 F (r) dr which is thus of class C 2
and with bounded second order derivatives. Assume that G is non-decreasing and that there
exists a unique root, say y ∗ , to the equation G(y) = a. This implies that Λ0 = {y ∗ } is the
set of equilibrium of the associated ODE and that V is a strict Lyapounov function. We also
have (ii) of Proposition 7.6. Take k = F 2 so that (i) of Proposition 7.6 holds. Assume that:
X X
γn = +∞ and γn2 < +∞.
n∈N n∈N

Thus, we have the second part of (7.5). Since (εn , n ∈ N∗ ) is bounded, we get the first part
of (7.5), as well as (iii) of Proposition 7.6 with R = +∞.
We deduce from Proposition 7.6 that a.s. the sequence (Xn , n ∈ N) is bounded. Theorem
7.5 and Lemma 7.4 imply that a.s. (Xn , n ∈ N) converges towards the only equilibrium y ∗ of
the associated ODE.
Some hypothesis can be slightly weakened, see Section 1.4.2 in [5]. For the speed of
convergence of (Xn , n ∈ N) towards y ∗ , which corresponds to a central limit theorem for
martingales, we also refer to [5].

7.4.2 Estimating a quantile


Assume we dispose of a sequence (Un , n ∈ N∗ ) of independent random variables with the
same unknown distribution, say L0 . An estimation of the quantile αq of order q ∈ (0, 1) for
the distribution L0 using a sample of size n is naturally given by the empirical quantile. In
order to find the empirical quantile, one has to sort the n realizations, which takes of order
n log(n) operations. This might take too long if n is huge.
Instead, it is possible to use a linear stochastic algorithm to have an approximation of
the quantile. We follow the setting developed in Section 7.4.1 with g(y, u) = 1{u≤y} and
a = q. Assume the distribution of interest has a bounded continuous density f0 and that the
cumulative distribution F0 is such that F0 (y) = q has only one solution. If
X X
γn = +∞ and γn2 < +∞,
n∈N n∈N

then, according to Section 7.4.1, the stochastic algorithm:

Xn+1 = Xn − γn (1{Un+1 ≤Xn } − q)

converges a.s. to the quantile αq .

7.4.3 Two-armed bandit


We consider the stochastic algorithm given by (7.1) in Section 7.1 with
X X
γn = +∞ and γn2 < +∞.
n∈N n∈N

We recall this is a stochastic approximation algorithm (7.4) with F (y) = πy(1 − y) and
π = pA −pB . The equilibrium set of the ODE associated to F is Λ0 = {0, 1}. For convenience,
we assume that pA > pB so that 1 is stable and 0 is unstable.
128 CHAPTER 7. STOCHASTIC APPROXIMATION ALGORITHMS

Notice that by construction, see (7.1), the sequence (Xn , n ∈ N) belongs a.s. to (0, 1), as
X0 ∈ (0, 1), and thusR is bounded.
y
Since V (y) = − 0 F (r) dr is a strict Lyapounov function, we deduce from Theorem 7.5
and Lemma 7.4 that the sequence (Xn , n ∈ N) converges a.s. to a limit, say X∞ which
belongs to Λ0 .
We write Px when starting the algorithm from X0 = x. Since (Xn , n ∈ N) is a sub-
martingale as F ≥ 0, we deduce that Ex [X∞ ] ≥ Ex [X0 ] = x. As Ex [X∞ ] = Px (X∞ = 1), we
get the elementary lower bound Px (X∞ = 1) ≥ x.
We say the approximation is fallible when starting from x ∈ (0, 1) if Px (X∞ = 0) > 0,
and is infallible if Px (X∞ = 0) = 0 for all x ∈ (0, 1). The study of the algorithm to be fallible
or infallible is a second order property, whereas the ODE method can be seen as a first
order property. Because the two equilibrium of the two armed-bandit lies on the boundary
of the interval definition of the ODE, no general result can be applied. A direct study of this
particular algorithm is developed in [11]. As example, we provide the following results which
is part of Corollaries 1 and 2 in [11].

Lemma 7.8. Let 1/2 < α ≤ 1 and C > 0. Assume that the step sequence is given by
γn = (C/(n + C))α for all n ∈ N. If α = 1 and C ≤ 1/pB then the algorithm is infallible.
Otherwise, it is fallible when starting from x ∈ (0, 1).

For practical implementation the algorithm is infallible for the step sequence γn = 1/(n +
1), which corresponds to α = C = 1.
Bibliography

[1] M. Benaı̈m. Dynamics of stochastic approximation algorithms. In Séminaire de Prob-


abilités, XXXIII, volume 1709 of Lecture Notes in Math., pages 1–68. Springer, Berlin,
1999.

[2] M. Benaı̈m and N. El Karoui. Promenade Aléatoire. Editions de l’Ecole Polytechnique,


2007.

[3] G. Burtini, J. Loeppky, and R. Lawrence. A survey of online experiment design with
the stochastic multi-armed bandit. arXiv:1510.00757, 2015.

[4] M. Duflo. Algorithmes stochastiques, volume 23 of Mathématiques & Applications


(Berlin) [Mathematics & Applications]. Springer-Verlag, Berlin, 1996.

[5] M. Duflo. Random iterative models, volume 34 of Applications of Mathematics (New


York). Springer-Verlag, Berlin, 1997. Translated from the 1990 French original by
Stephen S. Wilson and revised by the author.

[6] J. C. Gittins. Bandit processes and dynamic allocation indices. J. Roy. Statist. Soc. Ser.
B, 41(2):148–177, 1979. With discussion.

[7] H. J. Kushner and D. S. Clark. Stochastic approximation methods for constrained and
unconstrained systems, volume 26 of Applied Mathematical Sciences. Springer-Verlag,
New York-Berlin, 1978.

[8] H. J. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and
applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag,
New York, second edition, 2003. Stochastic Modelling and Applied Probability.

[9] D. Lamberton and G. Pagès. How fast is the bandit? Stoch. Anal. Appl., 26(3):603–623,
2008.

[10] D. Lamberton and G. Pagès. A penalized bandit algorithm. Electron. J. Probab., 13:no.
13, 341–373, 2008.

[11] D. Lamberton, G. Pagès, and P. Tarrès. When can the two-armed bandit algorithm be
trusted? Ann. Appl. Probab., 14(3):1424–1454, 2004.

[12] K. S. Narendra and M. A. L. Thathachar. Learning automata - a survey. IEEE Trans-


actions on Systems, Man, and Cybernetics, SMC-4(4):323–334, July 1974.

129
130 BIBLIOGRAPHY

[13] K. S. Narendra and M. A. L. Thathachar. Learning Automata: An Introduction.


Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1989.

[14] M. F. Norman. On the linear model with two absorbing barriers. J. Mathematical
Psychology, 5:225–241, 1968.

[15] H. Robbins. Some aspects of the sequential design of experiments. Bull. Amer. Math.
Soc., 58(5):527–535, 09 1952.

[16] I. J. Shapiro and K. S. Narendra. Use of stochastic automata for parameter self-
optimization with multimodal performance criteria. IEEE Transactions on Systems
Science and Cybernetics, 1969.

[17] R. Weber. On the Gittins index for multiarmed bandits. Ann. Appl. Probab., 2(4):1024–
1033, 1992.
Chapter 8

Appendix

8.1 More on measure theory


8.1.1 Construction of probability measures
We give in this section, without proofs, the main theorem which allows to build the usual
measures such as Lebesgue measure and product measure.

Definition 8.1. A collection, A, of subsets of Ω is called a Boolean algebra if:

(i) Ω ∈ A;

(ii) A ∈ A implies Ac ∈ A;

(iii) A, B ∈ A implies A ∪ B ∈ A.

It is easy to check that a Boolean algebra is stable by finite intersection. A probability


distribution can be defined on a Boolean algebra (to be compared with Definition 1.7).

Definition 8.2. Let A be a Boolean algebra. A probability measure on (Ω, A) is a map P


defined on A taking values in [0, +∞] such that:

(i) P(Ω) = 1;

(ii) Additivity: for all A, B ∈ A disjoint, P(A ∪ B) = P(A) + P(B);

at ∅: for all sequences (An , n ∈ N) such that An ∈ A, An+1 ⊂ An for all


(iii) Continuity T
n ∈ N and n∈N An = ∅, then the sequence (P(An ), n ∈ N) converges to 0.

The following extension theorem allows to extend a probability measure on a Boolean


algebra to a probability measure on the σ-field generated by the Boolean algebra. Its proof
can be found in Section I.5 of [2] or in Section 3 of [1].

Theorem 8.3 (Carathéodory extension theorem). Let P be a probability measure defined on


a Boolean algebra A of Ω. There exists a unique probability measure P on (Ω, σ(A)) such
that P and P coincide on A.

This extension theorem allows to prove the existence of the Lebesgue measure.

131
132 CHAPTER 8. APPENDIX

Proposition 8.4 (Lebesgue measure). There exists a unique probability measure P on the
measurable space ([0, 1), B([0, 1))), called Lebesgue measure, such that P([a, b)) = b − a for all
0 ≤ a ≤ b ≤ 1.

Before giving the proof of Proposition 8.4, we provide a sufficient condition for a real-
valued additive function defined on a Boolean algebra to be continuous at ∅.

Lemma 8.5. Let A be a Boolean algebra on Rd , d ≥ 1. Let P be a [0, +∞]-valued function


defined on A such that P(Ω) = 1 and P is additive (that is (ii) of Definition 8.2 holds). If for
all A ∈ A and ε > 0, there exists a compact set K ⊂ Rd and B ∈ A such that B ⊂ K ⊂ A
and P(A ∩ B c ) ≤ ε, then P is a probability on A (that P is continuous at ∅).
T
Proof. Let (An , n ∈ N) be a non-increasing A-valued sequence such that n∈N An = ∅. We
shall prove that limn→+∞ P(An ) = 0.
Let ε > 0. For all k ∈ N, there exists T a compact set Kk andT Bk ∈ A such that Bk ⊂
c k
Kk ⊂ Ak and P(Ak ∩ Bk ) ≤ ε/2 . Since k∈N Ak = ∅, we get k∈N Kk = ∅. As a sequence
of compact sets with empty intersection Tnhas a finite sub-sequence withTfinite intersection, we
n0
deduce
Sn there exists n 0 ∈ N such that k=0 k = ∅. This implies that
0
K k=0 Bk = ∅ and thus
c d
k=0 Bk = R for all n ≥ n0 . We get that, for n ≥ n0 :

n n n n
!
[ X X X
P(An ) = P An ∩ ( Bkc ) ≤ P(An ∩ Bkc ) ≤ P(Ak ∩ Bkc ) ≤ ε2−k ≤ 2ε,
k=0 k=0 k=0 k=0

that is P(An ) ≤ 2ε for all n ≥ n0 . Since ε > 0 is arbitrary, we deduce that limn→+∞ P(An ) =
0, which ends the proof of the lemma.

Proof of Proposition 8.4. Let A be the set of finite union of intervals [a, b) with 0 ≤ a ≤
b ≤ 1. Notice A is a Boolean algebra which generates the Borel σ-field B([0, 1)). Define
P([a, b)) = b − a for 0 ≤ a ≤ b ≤ 1. It is elementary to check that P can be uniquely extended
to A into an a additive [0, +∞]-valued function, which we still denote by P. Notice that
P([0, 1)) = 1. To conclude, it is enough to that P is continuous at ∅.
For A ∈ A, non empty, there exists n0 ∈ N ∗
S , 0 ≤ ai < bi < ai+1 for i ∈ J1, n0 K,
with the convention an0 +1 = 1, such that A = i∈J1,n0 K [ai , bi ). Let ε > 0. Taking K =
−i −i
S S
i∈J1,n0 K [ai , ai ∨ (bi −Tε2 )] and B = i∈J1,n0 K [ai , ai ∨ (bi − ε2 )) we get that B ∈ A,
c
B ⊂ K ⊂ A and P(A B ) ≤ ε. We deduce that the hypothesis of Lemma 8.5 are satisfied.
Thus P is a probability on A. Therefore Theorem 8.3 implies there exists a unique probability
on [0, 1) which is an extension of P.

Remark 8.6. Let λ1 denote the Lebesgue measure on [0, 1). P Then, the Lebesgue measure
on R, λ, is defined by: for all Borel set A of R, λ(A) = x∈Z λ1 ((A + x) ∩ [0, 1)), where
A + x = {z + x, z ∈ A}. It is easy to check that λ is σ-additive (and thus a measure according
to Definition 1.7). Notice that λ([a, b)) = b − a for all a ≤ b. Using Exercise 9.2, we get that
the Lebesgue measure is the only measure on (R, B(R)) such that this latter property holds.
d d
The construction of the Lebesgue Qdmeasure, λ, onQd (R , B(R )) for d ≤ 1, which is the
unique σ-finite measure such that λ( i=1 [ai , bi )) = i=1 (bi − ai ) for all ai ≤ bi , follows the
same steps and is left to the reader. ♦
8.1. MORE ON MEASURE THEORY 133

Using the extension theorem, we get the existence of the product probability measure of
the product measurable space.

Proposition
Q 8.7. Let N ((Ωi , Gi , Pi ), i ∈ I) be a collection of probability spaces and set Ω =

i∈I i Qas well as GQ= i∈I Gi . There exists a unique probability measure P on (Ω, G) such
that P i∈I Ai = i∈I Pi (Ai ), where Ai ∈ Gi for all i ∈ I and Ai = Ωi but for a finite
number of indices.
N
The probability P is called the product probability measure and it is denoted by i∈I Pi .
The probability space (Ω, G, P) is called the product probability space. Proposition 8.7 can be
extended to the finite product of σ-finite measures, see also Theorem 1.53 for an alternative
construction in this particular case.
Q
Proof. Let A be the set of finite unions of sets of the form i∈I Ai , where Ai ∈ Gi for all
i ∈ I and Ai = Ωi but for a finite number of indices. Q NoticeQthat A is a Boolean algebra
which generates the product σ-field G. Define P( i∈I Ai ) = i∈I Pi (Ai ). It is elementary
to check that P can be uniquely extended to A into an a additive [0, +∞]-valued function,
which we still denote by P. Notice that P(Ω) = 1. To conclude, it is enough to prove that P
is continuous at ∅.
We first assume that I = N∗ . QFor n ∈ N, we set Ωn = k>n Ωk and An the Boolean
Q
0 0 k > n and A0k = Ωk but for
algebra of the finite unions of set k>n
n
Q Ak with0
 AkQ∈ Gk for all 0
finite number of indices. Define P k>n Ak = k>n Pk (Ak ). It is elementary to check
that Pn can be uniquely extended to A into an a additive [0, +∞]-valued function, which we
still denote by Pn .
Let us prove that P is continuous at ∅ by contraposition. TLet (An , n ∈ N∗ ) be a non-
increasing
T A-valued sequence and ε > 0 such that limn→+∞ P ( nk=1 Ak ) ≥ ε. We shall prove
that n∈N∗ An is non-empty.
For ω1 ∈ Ω1 , consider A1n (ω1 ) = {ω 1 ∈ Ω1 ; (ω1 , ω 1 ) ∈ An } the section of An on Ω1 at
ω1 . It is elementary to deduce from An ⊂ A that for all ω1 ∈ Ω1 we have A1n (ω1 ) ∈ A1 .
It is also not difficult to prove that Bn,1 = {ω1 ∈ Ω1 ; P1 (A1n (ω1 )) ≥ ε/2} belongs to G1 .
Since the sequence (An , n ∈ N) is non-increasing, S we get the sequence (Bn1 , n ∈ N∗ ) is also
non-increasing. Since An is a subset of Bn,1 × Ω {(ω1 , ω 1 ); ω1 6∈ Bn,1 and ω 1 ∈ A1n (ω1 )},
1

we get that:
ε
ε ≤ P(An ) ≤ P1 (Bn,1 ) + (1 − P1 (Bn,1 )) ,
2

and thus P1 (BT n,1 ) ≥ ε/2 for all n ∈ N . We deduce from 1the continuity ∗
of P1 at ∅, that there
exists ω̄1 ∈ n∈N∗ Bn,1 . Furthermore, the T sequence (An (ω̄1 ), n ∈ N ) of elements of A1 is
1 n 1
T 1
non-increasing and such that limn→+∞ P k=1 Ak (ω̄1 ) ≥ ε/2 and {ω̄1 } × n∈N∗ An (ω̄1 ) ⊂

T
n∈N∗ An . By iterating the previous argument, we get that for all k ∈ N , there exists
ω̄k ∈ Ωk such that:
\ \
{(ω̄1 , . . . , ω̄k )} × Akn (ω̄1 , . . . , ω̄k ) ⊂ An ,
n∈N∗ n∈N∗

where Akn (ω̄1 , . . . , ω̄k ) = {ω k ∈ Ωk ; (ω̄1 , . . . , ω̄k , ωTk ) ∈ An } is the section of An on ki=1 Ωi at
Q
(ω̄1 , . . . , ω̄k ). This implies that (ω̄k , k ∈ N∗ ) ∈ n∈N∗ An , and thus n∈N∗ An is non-empty.
T
The proposition is thus true when I is countable.
134 CHAPTER 8. APPENDIX

According to the previous arguments, it is clear the proposition is also true when I is finite.
Let us assume that I is uncountable. For all (countable) sequence (An , n ∈ N∗ ) of elements
of A, there Q Q J ⊂ I at most0 countable such that the sets An are finite
exists a set Qunions of
0 J
sets of type j∈J Aj i∈I\J Ωi with Aj ∈ Gj for all j ∈ J. Thus we have An = An i∈I\J Ωi ,
with AJn = {ωJ ∈ j∈J Ωj ; {ωJ } × i∈I\J Ωi ⊂ An }. And the continuity of P at ∅ is the a
Q Q
consequence of the first part of the proof as J is at most countable.

Based on Proposition 8.7, the next exercise provides an alternative proof of Proposition
8.4 on the existence of the Lebesgue measure on [0, 1) .
Exercise 8.1. Set Ωi = {0, 1}, Gi = P(Ω ∗
Qi ) and Pi ({0})N
= Pi ({1}) = 1/2Nfor i ∈ N . Consider
the product probability space (Ω = i∈N∗ Ωi , G = i∈N∗ Gi , P = i∈N∗ Pi ). Define the
function ϕ from Ω to [0, 1) by:
X
ϕ((ωi , i ∈ N∗ )) = 2−i ωi .
i∈N∗

By considering intervals [k2−i , j2−i ), check that ϕ is measurable and, using Corollary 1.14,
that Pϕ , the image of P by ϕ, is the Lebesgue measure on ([0, 1), B([0, 1)). 4

8.1.2 Proof of the Carathéodory extension Theorem 8.3


We first give some properties on Boolean algebra.

Proposition 8.8. Let A be a Boolean algebra on Ω and P a probability measure on (Ω, A).
We have the following properties.

(i) P(A ∪ B) + P(A ∩ B) = P(A) + P(B) for all A, B ∈ A.

(ii) P(A) ≤ P(B) for all A, B ∈ A such that A ⊂ B.


S
(iii) Let (Ai , i ∈ S
I) be a family
P at most countable of elements of A such that i∈I Ai ∈ A.
One has P i∈I Ai ≤ i∈I P(Ai ) with an equality if the sets (Ai , i ∈ I) are pairwise
disjoint.

Proof. Properties (i) and (ii) are consequence of the additive property of P.
It is enough to prove property (iii) with SI countable. Let (Bn , n ∈ N) S a sequence of
elements of A pairwise disjoint and such that n∈N Bn ∈ A. The sequence k>n Bk , n ∈ N
is non-decreasing
S and  converges towards ∅. The continuity property at ∅ of P implies that
limn→+∞ P k>n Bk = 0. By additivity, we get:

n n
! ! ! !
[ [ [ X [
P Bk =P Bk +P Bk = P(Bk ) + P Bk .
k∈N k=0 k>n k=0 k>n
S  P
Letting n goes to infinity, we get P k∈N Bk = k∈N P(Bk ).
S
Let (An , n ∈ N) be a sequence of elements of A such that n∈N An ∈ A. Set B0 = A0

T Sn−1 c Sn Sn
and for n ≥ 1, Bn = An k=0 Ak . We have Bn ⊂ An as well as k=0 Bk = k=0 Ak
8.1. MORE ON MEASURE THEORY 135

S S
and thus Sk∈N Bk = k∈N Ak . The sets (Bn , n ∈ N) belongs to A, are pairwise disjoint and
such that n∈N Bn ∈ A. we deduce from the first part of the proof that:
! !
[ [ X X
P Ak = P Bk = P(Bk ) ≤ P(Ak ).
k∈N k∈N k∈N k∈N

Let A be a Boolean algebra on Ω and P a probability measure on (Ω, A). The outer
probability measure P∗ is a [0, 1]-valued function defined on P(Ω) by:
( )
X [
P∗ (A) = inf P(Bn ); A ⊂ Bn and Bn ∈ A for all n ∈ N . (8.1)
n∈N n∈N

The next lemma states that the restriction of P∗ to A coincide with P and that P∗ is monotone
and σ-sub-additive.

Lemma 8.9. We have the following properties.

(i) P∗ (A) = P(A) for all A ∈ A.

(ii) Monotony. For all A ⊂ B ⊂ Ω, we have P∗ (A) ≤ P∗ (B).

(iii) σ-sub-additivity. Let (Ai , i ∈ I) be a family at most countable of subsets of Ω. We have:


!
[ X

P Ai ≤ P∗ (Ai ).
i∈I i∈I

S
Proof.
S Let (Bn ∈ N) and A be elements A such that A ⊂ n∈N Bn . We have A ∩ Bn ∈ A
and n∈N (A ∩ Bn ) = A ∈ A. We deduce from property (iii) of Proposition 8.8 that:
!
[ X X
P(A) = P (A ∩ Bn ) ≤ P(A ∩ Bn ) ≤ P(Bn ),
n∈N n∈N n∈N

which is an equality if for example B0 = A and Bn = ∅ for n ∈ N∗ . We deduce that


P∗ (A) = P(A), that is property (i).
Property (ii) is a consequence of the definition of P∗ . We now prove property (iii). Let
(An , n ∈ N) be Pε > 0 and (Bn,k ; n, k ∈ N) be elements
S sub-sets of Ω, of A such
S that for all
n ∈ N, An ⊂ k∈N Bn,k and k∈N P(Bn,k ) ≤ P∗ (An ) + ε2−n . As n∈N An ⊂ n,k∈N Bk,n , we
S
deduce that:
!
[ X X X

P An ≤ P(Bn,k ) ≤ P∗ (An ) + ε2−n = 2ε + P∗ (An ).
n∈N n,k∈N n∈N n∈N

Since ε > 0 is arbitrary, we deduce that P∗ ∗ (A


S  P
n∈N An ≤ n∈N P n) and thus property
(iii) holds.
136 CHAPTER 8. APPENDIX

The sub-additivity of P∗ implies that P∗ (B) ≤ P∗ (B ∩ A) + P∗ (B ∩ Ac ) for all A, B ⊂ Ω.


We consider the family of measurable sets for P∗ defined by:

G = {A ⊂ Ω; P∗ (B) = P∗ (B ∩ A) + P∗ (B ∩ Ac ) pour tout B ⊂ Ω} . (8.2)

We first prove that G is a Boolean algebra which contains A (Lemma 8.10), then that G
is a σ-field and that P∗ is a probability measure on G (Lemma 8.11).

Lemma 8.10. The set G is a Boolean algebra and it contains A.

Proof. As P∗ (∅) = P(∅) = 0, we deduce that Ω ∈ G. By symmetry, if A ∈ G then we have


Ac ∈ G. Let A1 , A2 ∈ G and B ∈ Ω. We have:

P∗ (B) = P∗ (B ∩ A1 ) + P∗ (B ∩ Ac1 )
= P∗ (B ∩ A1 ∩ A2 ) + P∗ (B ∩ A1 ∩ Ac2 ) + P∗ (B ∩ Ac1 )
≥ P∗ (B ∩ A1 ∩ A2 ) + P∗ (B ∩ (A1 ∩ A2 )c ), (8.3)

where we used the sub-additivity of P∗ for the inequality and (A1 ∩ Ac2 ) Ac1 = (A1 ∩ A2 )c .
S
As P∗ is sub-additive, we deduce that the inequality (8.3) is in fact an equality, and thus that
A1 ∩ A2 ∈ G. This implies that G is a Boolean algebra.
∈ A, B ∈ Ω and ε > 0. There
Let A S P exists a sequence (Bn , n ∈ N) of elements of A such
that B ⊂ n∈N Bn and P∗ (B) + ε ≥ n∈N P(Bn ). By additivity of P, we get:
X X X
P∗ (B) + ε ≥ P(Bn ) = P(Bn ∩ A) + P(Bn ∩ Ac ) ≥ P∗ (B ∩ A) + P∗ (B ∩ Ac ),
n∈N n∈N n∈N

we used the definition of P∗ and that B ∩ A ⊂ n∈N Bn ∩ A


S
where for the last equality
as well as B ∩ Ac ⊂ n∈N Bn ∩ Ac . Since ε > 0 is arbitrary, we deduce that P∗ (B) ≥
S
P∗ (B ∩ A) + P∗ (B ∩ Ac ), and then that this inequality is an equality as P∗ is sub-additive.
Thus, we get that A ∈ G if A ∈ A.

Lemma 8.11. The family G is a σ-field and the function P∗ restricted to G is a probability
measure.

Proof. Notice that for A ∈ G and B, C ∈ Ω such that A∩C = ∅, we deduce from the definition
of G that:
P∗ (B ∩ (A ∪ C)) = P∗ (B ∩ A) + P∗ (B ∩ C). (8.4)
(An , n ∈ N) be elements of G pairwise disjoint and B ∈ Ω. We set A0n = nk=0 Ak and
S
Let
A0 = k∈N Ak . We have A0n ∈ A. Using the monotonicity of P∗ and then (8.4), we get:
S

n
X
P∗ (B ∩ A0 ) ≥ P∗ (B ∩ A0n ) = P∗ (B ∩ Ak ). (8.5)
k=0

We deduce that P∗ (B ∩ A0 ) ≥ ∗ ∩ Ak ) and then, since P∗ is σ-sub-additive, that:


P
k∈N P (B
X
P∗ (B ∩ A0 ) = P∗ (B ∩ Ak ). (8.6)
k∈N
8.2. MORE ON CONVERGENCE FOR SEQUENCE OF RANDOM VARIABLES 137

We deduce from the equality in (8.5) and the monotonicity of P∗ that:


n
c c
X
P∗ (B) = P∗ (B ∩ A0n ) + P∗ (B ∩ A0n ) ≥ P∗ (B ∩ Ak ) + P∗ (B ∩ A0 ).
k=0

Letting n goes to infinity, we deduce from (8.6) that P∗ (B) ≥ P∗ (B ∩ A0 ) +SP∗ (B ∩ A0 c ). Since
P∗ is sub-additive, this inequality is in fact an equality, and thus A0 = k∈N An ∈ G. It is
then immediate to check that G is stable by countable union. Thus, G is a σ-field.
For B = Ω in (8.6), we get that P∗ is σ-additive on G: P∗ ∗
S P
k∈N Ak = k∈N P (Ak ).

The restriction of P to G is therefore a probability measure.

Proof of Theorem 8.3. Let P be a probability measure on a Boolean algebra A of Ω. Accord-


ing to Lemma 8.11, the family of sets G defined by (8.2) is a σ-field and the restriction of P∗ ,
defined by (8.1), on G is a probability measure. Since G is a σ-field, we deduce from Lemma
8.10 that σ(A) ⊂ G. According to Lemma 8.9 we have that P∗ and P coincide on A. We
deduce that the restriction of P∗ to σ(A) is a probability measure on σ(A) which coincide
with P on A.
Uniqueness of this extension on σ(A) is a consequence of the monotone class theorem and
more precisely Corollary 1.14.

8.2 More on convergence for sequence of random variables


All the random variables of the section will be defined on a given probability space (Ω, G, P).

8.2.1 Convergence in distribution


The convergence in distribution for random variables correspond to the weak convergence of
their distribution. We shall investigate this subject only partially so that we can state the
ergodic theorems for Markov chains.

Definition 8.12. Let (E, B) be a metric space with its Borel σ-field. A sequence (µn , n ∈ N)
of probability measures on E converges weakly to a probability measure Rµ on E ifR for all
bounded real-valued continuous function f defined on E, we have limn→∞ f dµn = f dµ.
Let (Xn , n ∈ N) and X be E-valued random variables. The sequence (Xn , n ∈ N) converges
in distribution towards X if the probability measures (PXn , n ∈ N) converges weakly towards
PX , that is limn→∞ E[f (Xn )] = E[f (X)] for all bounded real-valued continuous function. And
(d) (d)
we write Xn −−−→ X (or some times Xn −−−→ PX ).
n→∞ n→∞

We refer to [1] for further results on convergence in distribution. Since we shall be mainly
interested by the convergence in distribution for random variables taking values in a discrete
space we introduce the convergence for the distance in total variation. The distance in total
variation dTV between two finite measures µ and ν on (S, S) is given by:

dTV (µ, ν) = sup |µ(A) − ν(A)|.


A∈S
138 CHAPTER 8. APPENDIX

It is elementary to check that dTV is indeed a distance1 on the set of finite measures on (S, S).

Lemma 8.13. The convergence for the distance in total variation for probability measures
on a metric space implies the convergence in distribution.

Proof. Let (E, B(E)) be a metric space with its Borel σ-field. Let f be a real-valued mea-
surable
R function
R1 defined on E taking values in (0, 1). By Fubini theorem, we have that
f dν = 0 ν({f > t}) dt for any probability measure ν on (E, B(E)).
Let (µn , n ∈ N) be a sequence of probability measures which converges for the distance
in total variation towards a probability measure µ. This implies that limn→∞ µn ({f > t}) =
µ({f > t}). By dominated R convergence,
R we deduce from the comment at the beginning of
the proof, that limn→∞ f dµn = f dµ. By linearity, we get this convergence holds for any
bounded real-valued measurable function f . This implies that (µn , n ∈ N) converges weakly
towards µ.

The discrete state space case

We now assume that E is a discrete space (and E = P(E)). Let λ denote the counting
measure on (E, E): λ(A) = Card (A) for A ∈ E. Notice that any measure µ on (E, E) has a
density with respect to the counting measure λ given by the function (µ({x}), x ∈ E). We
shall identify the density of µ with µ and thus write µ(x) for µ({x}). We shall consider the L1
norm with respectP to the counting measure so that for a real-valued function f = (f (x), x ∈ E)
we set kf k1 = x∈E |f (x)|. It is left to the reader to check that for two finite measures µ
and ν on (E, E):
2 dTV (µ, ν) = kµ − νk1 . (8.7)

We now give a characterization of the weak convergence of probability measure on discrete


space.

Lemma 8.14. Let E be a discrete space. Let (Xn , n ∈ N∗ ) and X be E-valued random
variables. The following conditions are equivalent.

(d)
(i) Xn −−−→ X.
n→∞

(ii) For all x ∈ E, we have P(Xn = x) −−−→ P(X = x).


n→∞

(iii) limn→∞ dTV (PXn , PX ) = 0.

Proof. Since {x} is open and closed, the function 1{x} is continuous. Thus, property (i)
implies property (ii). Property (iii) implies property (i) thanks to Lemma 8.13.
We now prove that property (ii) implies property (iii). Let ε > 0 and K ⊂ E finite such
that P(X ∈ K) ≥ 1 − ε. Since K is finite, we deduce from (ii) that limn→∞ P(Xn ∈ K) =
1
A non-negative finite function d defined on S × S is a distance on S if for all x, y, z ∈ S, we have:
d(x, y) = d(y, x) (symmetry); d(x, y) ≤ d(x, z) + d(z, y) (triangular inequality); d(x, y) = 0 implies x = y
(separation).
8.2. MORE ON CONVERGENCE FOR SEQUENCE OF RANDOM VARIABLES 139

P(X ∈ K) and thus limn→∞ P(Xn ∈ K c ) = P(X ∈ K c ) ≤ ε. So for n large enough, say
n ≥ n0 , we have P(Xn ∈ K c ) ≤ 2ε. We deduce from (8.7) that for n ≥ n0 :
X
2 dTV (PXn , PX ) = |P(Xn = x) − P(X = x)|
x∈E
X
≤ |P(Xn = x) − P(X = x)| + P(Xn ∈ K c ) + P(X ∈ K c )
x∈K
X
≤ |P(Xn = x) − P(X = x)| + 3ε.
x∈K

This implies that lim supn→∞ 2 dTV (PXn , PX ) ≤ 3ε. Conclude using that ε is arbitrary.

8.2.2 Law of large number and central limit theorem


We refer to any introductory books in probability for a proof of the next results on the law
of large number and central limit theorem (CLT). For a sequence (Xn , n ∈ N∗ ) of real-valued
or Rd -valued random variables, we define the empirical mean, when it is meaningful, by:
n
1X
X̄n = Xk .
n
k=1

Theorem 8.15 (Law of large number). Let X be a real-valued random variable such that
E[X] is well defined. Let (Xn , n ∈ N∗ ) be a sequence of independent real-valued random
variables distributed as X. We have the following a.s. converge:
a.s.
X̄n −−−→ E[X].
n→∞

If X ∈ L1 , then the converge holds also in L1 .

The fluctuation are given by the CLT. We denote by N (µ, Σ), where µ ∈ Rd and Σ ∈ Rd×d
a symmetric non-negative matrix, the Gaussian distribution with mean µ and covariance
matrix Σ.

Theorem 8.16 (Central Limit Theorem (CLT)). Let X be a Rd -valued random variable such
that X ∈ L2 . Set µ = E[X] and Σ = Cov(X, X). Let (Xn , n ∈ N∗ ) be a sequence of indepen-
dent real-valued random variables distributed as X. We have the following convergences:

a.s. √  (d)
X̄n −−−→ µ and n X̄n − µ −−−→ N (0, Σ).
n→∞ n→∞

8.2.3 Uniform integrability


Let (Ω, G, P) be a probability space.

Definition 8.17. We say a family of real-valued random variables (Xi , i ∈ I) is uniformly


integrable if for all ε > 0, there exists K finite such that for all i ∈ I:
 
E |Xi |1{|Xi |≥K} ≤ ε. (8.8)
140 CHAPTER 8. APPENDIX

Notice that if (8.8) holds, then E[|Xi |] ≤ ε + K and thus supi∈I E[|Xi |] is finite.
We give some results related to the uniform integrability.

Proposition 8.18. Let (Xi , i ∈ I) be a family of real-valued random variables.

(i) The family (Xi , i ∈ I) is uniformly integrable if and only if the following two conditions
are satisfied:

(a) For all ε > 0, there exists δ > 0 such that for all events A with P(A) ≤ δ, we have
E [|Xi |1A ] ≤ ε.
(b) supi∈I E[|Xi |] < +∞.

(ii) Any single real-valued integrable random variable is uniformly integrable.

(iii) If there exists an integrable real-valued random variable Y such that |Xi | ≤ |Y | a.s. for
all i ∈ I, then the family (Xi , i ∈ I) is uniformly integrable. More generally, if there
exists a family (Yi , i ∈ I) of uniformly integrable real-valued random variables such that
|Xi | ≤ |Yi | a.s. for all i ∈ I, then the family (Xi , i ∈ I) is uniformly integrable.

(v) If there exists r > 0 such that supi∈I E[|Xi |1+r ] < +∞, then the family (Xi , i ∈ I)
is uniformly integrable. More generally, if supi∈I E[f (Xi )] < +∞, where f is a non-
negative real-valued measurable function defined on R such that limx→+∞ f (x)/x = +∞,
then the family (Xi , i ∈ I) is uniformly integrable.

(vi) If (Xn , n ∈ N) is a sequence of integrable real-valued random variables which converges


in L1 to zero, then it is uniformly integrable.

Proof. We first prove property (i). Assume that the family (Xi , i ∈ I) is uniformly integrable.
We have already noticed that (b) holds. Choose K such that (8.8) holds with ε replaced by
ε/2. Set δ = ε/2K and let A be an event such that P(A) ≤ δ. Using (8.8) and Markov
inequality, we get:
    ε
E [|Xi |1A ] = E |Xi |1A 1{|Xi |≥K} + E |Xi |1A 1{|Xi |<K} ≤ + KP(A) ≤ ε.
2
This gives (a).
Assume that (a) and (b) hold. Set C = supi∈I E[|Xi |] which is finite by (b). Let ε > 0
be fixed and δ given by (a). Set K = C/δ and Ai = {|Xi | ≥ K}. Markov inequality gives
that P(Ai ) ≤ E[|Xi |]/K ≤ C/K = δ. We deduce from (a), with A replaced by Ai , that (8.8)
holds. This implies that the family (Xi , i ∈ I) is uniformly integrable.
We prove property (ii). Let Y be an integrable real-valued random
 variable. In
 particular
Y is a.s. finite. By dominated convergence, we get that limK→+∞ E |Y |1{|Y |≥K} = 0. Thus
(8.8) holds and Y is uniformly integrable.
Thanks to property (i), to prove property (iii) it is enough to check (a) and (b). Notice
that E[|Xi |] ≤ E[|Y |] for all i ∈ I and thus (b) holds. We have E [|Xi |1A ] ≤ E [|Y |1A ]. Then
use that Y is uniformly integrable, thanks to (ii), to conclude that (a) holds. The proof of
the more general case is similar.
8.2. MORE ON CONVERGENCE FOR SEQUENCE OF RANDOM VARIABLES 141

We prove property (v). Let ε > 0. Use that |Xi |1{|Xi |≥K} ≤ K −r |Xi |1+r to deduce that
(8.8) holds when K = (supi∈I E[|Xi |1+r ]/ε)1/r . The proof of the more general case is similar.

Thanks to property (i), to prove property (vi) it is enough to check (a) and (b). Let
(Xn , n ∈ N) be a sequence of integrable real-valued random variables which converges in L1
towards zero. Condition (b) is immediate. Let us check (a). Fix ε > 0. There exists n0 ∈ N
such that for all n ≥ n0 , we have E[|Xn |] ≤ ε. Then use (ii) and (i) to get there exists δi > 0
such that if A is an event such that P(A) ≤ δi then E[|Xi |1A ] ≤ ε for all i ≤ n0 . Take
δ = min0≤i≤n0 δi to deduce that (a) holds.

We provide an interesting example of family of uniform integrable random variables.

Lemma 8.19. Let X be an integrable real-valued random variable. The family (XH =
E[X| H]; H is a σ-field and H ⊂ G) is uniformly integrable.

Proof. We shall check (a) and (b) from property (i) of Proposition 8.18. Using Jensen in-
equality, we get that E[|XH |] ≤ E[|X|] for all σ-field H ⊂ G. We get (b) as X is integrable.

We prove (a). Let ε > 0. According to property (ii) from  Proposition


 8.18, we get that
X is uniformly integrable. Thus, there exists K such that E |X|1{|X|≥K} ≤ ε/2. Let A ⊂ G
be such that P(A) ≤ ε/2K. For any σ-field H ⊂ G, we have using Jensen inequality:
h i h i
E [|XH |1A ] = E [|XH |E [1A | H]] ≤ E E [|X| | H] E [1A | H] = E |X| E [1A | H] .

Furthermore decomposing according to {|X| ≥ K} and {|X| < K}, and using that a.s.
0 ≤ E [1A | H] ≤ 1, we get:
h i   ε
E |X| E [1A | H] ≤ E |X|1{|X|≥K} + KE [E [1A | H]] ≤ + KP(A) ≤ ε.
2
We have obtained that E [|XH |1A ] ≤ ε for all σ-field H ⊂ G and all A ⊂ G such that
P(A) ≤ ε/2K. This gives (a).

8.2.4 Convergence in probability and in L1


We recall that a sequence (Xn , n ∈ N) of real-valued random variables converges in probability
towards a real-valued random variable X∞ if for all ε > 0, we have limn→∞ P(|Xn − X∞ | ≥
ε) = 0. We also recall that the a.s. convergence implies the convergence in probability. The
converse is false in general, but we have the following partial converse. We call (ank , n ∈ N)
a sub-sequence of (an , n ∈ N) if the N-valued sequence (nk , k ∈ N) is increasing.

Lemma 8.20. Let (Xn , n ∈ N) be a sequence of real-valued random variables which converges
in probability towards a real-valued random variable X∞ . Then, there is a sub-sequence
(Xnk , k ∈ N) which converges a.s. to X∞ .

The proof of this lemma is a consequence of the Borel-Cantelli lemma, but we provide a
direct short proof (see also the proof of Proposition 1.50 where similar arguments are used).
142 CHAPTER 8. APPENDIX

Proof. Let n0 = 0, and for k ∈ N, set nk+1 = inf{n > nk ; P(|Xn − X∞ | ≥ 2−k ) ≤ 2−k }. The
sub-sequenceP(nk , k ∈ N) is well defined since (Xn , n ∈ N) converges in probability towards
X∞ . Since k∈N P(|Xnk − X∞ | ≥ 2−k ) < +∞, we get that k∈N 1{|Xn −X∞ |≥2−k } is a.s.
P
k
finite and thus a.s. |Xnk − X∞ | ≥ 2−k for finitely many k. This implies that the sub-sequence
(Xnk , k ∈ N) converges a.s. to X∞ .

The uniform integrability is the right concept to get the L1 convergence from the a.s.
convergence of real-valued random variables. This is a consequence of the following proposi-
tion.

Proposition 8.21. Let (Xn , n ∈ N) be a sequence of integrable real-valued random variables


and X∞ a real-valued random variable. The following properties are equivalent.

(i) The random variables (Xn , n ∈ N) are uniformly integrable and (Xn , n ∈ N) converges
in probability towards X∞ .

(ii) The sequence (Xn , n ∈ N) converges in L1 towards X∞ which is integrable.

Proof. We first assume (i). Thanks to Lemma 8.20, there exists a sub-sequence (Xnk , k ∈ N)
which converges a.s. to X∞ . As (|Xnk |, k ∈ N) converges a.s. to |X∞ |, we deduce from
Fatou’s lemma that E [|X∞ |] ≤ lim inf k→∞ E [|Xnk |] ≤ supn∈N E [|Xn |]. Since the random
variables (Xn , n ∈ N) are uniformly integrable, we deduce from property (i)-(b) of Proposition
8.18 that X∞ is integrable.
Let ε > 0. Since the random variables (Xn , n ∈ N) are uniformly integrable as well as
X∞ , thanks to property (ii) of Proposition 8.18, we deduce there exists δ > 0 such that if A
is an event with P(A) ≤ δ, then E[|Xn |1A ] ≤ ε for all n ∈ N. Since (Xn , n ∈ N) converges in
probability towards X∞ , there exists n0 such that for n ≥ n0 , we have P(|Xn − X∞ | > ε) ≤ δ.
This implies that for n ≥ n0 :
   
E [|Xn − X∞ |] = E |Xn − X∞ |1{|Xn −X∞ |≤ε} + E |Xn − X∞ |1{|Xn −X∞ |>ε}
   
≤ ε + E |Xn |1{|Xn −X∞ |>ε} + E |X∞ |1{|Xn −X∞ |>ε} ≤ 3ε.

This gives (ii).


We now assume (ii). First recall the convergence in L1 implies by Markov inequality the
convergence in probability. Since (Xn − X∞ , n ∈ N) converges in L1 towards 0, we deduce
from property (vi) of Proposition 8.18 that this sequence is uniformly integrable. Since X∞
is integrable, it is uniformly integrable according to property (ii) of Proposition 8.18. Using
that |Xn | ≤ |Xn − X∞ | + |X∞ | and the characterization of the uniform integrability given by
property (i) of Proposition 8.18, we easily get that (Xn , n ∈ N) is uniformly integrable.
Bibliography

[1] P. Billingsley. Probability and measure. Wiley Series in Probability and Statistics. John
Wiley & Sons, Inc., Hoboken, NJ, 2012.

[2] J. Neveu. Bases mathématiques du calcul des probabilités. Masson et Cie, Éditeurs, Paris,
1970.

143
144 BIBLIOGRAPHY
Chapter 9

Exercises

9.1 Measure theory and random variables


Exercise 9.1 (Characterization of probability measures on R). Let P and P0 be two probability
measures on (R, B(R)) such that P((−∞, a]) = P0 ((−∞, a]) for all a in a dense subset of R.
Prove that P = P0 . 4
Exercise 9.2 (Characterization of σ-finite measures). Extend Corollary 1.14 (resp. Corollary
1.15) to σ-finite measures by assuming further that the sequence (Ωn , n ∈ N) from (iii) of
Definition 1.7 can be chosen to belong to C (resp. to O). 4
Exercise 9.3 (Limit of differences). Let (an , n ∈ N) and (bn , n ∈ N) be real-valued sequence
such that (an , bn ) for all n ∈ N as well as (lim supn→∞ an , lim inf n→∞ bn ) are not equal to
(+∞, +∞) nor (−∞, −∞). Prove that:

lim sup(an − bn ) ≤ lim sup an − lim inf bn .


n→∞ n→∞ n→∞

If furthermore the sequence (bn , n ∈ N) converges, deduce that:

lim sup(an − bn ) = lim sup an − lim bn .


n→∞ n→∞ n→∞

4
Exercise 9.4 (Permutation of integrals). Prove that:
!
x2 − y 2
Z Z
π
2 2 2
λ(dy) dx = ·
(0,1) (0,1) (x + y ) 4

x2 − y 2
Deduce that the function f (x, y) = is not integrable with respect to the Lebesgue
(x2 + y 2 )2
measure on (0, 1)2 . (Hint: compute the derivative with respect to y of y/(x2 + y 2 ).) 4
Exercise 9.5 (Independence). Extend (1.13) to functions fj such that fj ≥ 0 forQall j ∈ J or
to functions fj such that fj (Xj ) is integrable for all j ∈ J. And in the latter case j∈J fj (Xj )
is also integrable. 4

145
146 CHAPTER 9. EXERCISES

Exercise 9.6 (Independence and covariance). Let X and Y be real-valued integrable random
variables. Prove that if X and Y are independent, then XY is integrable and Cov(X, Y ) =
0. Give an example such that X and Y are square-integrable not independent but with
Cov(X, Y ) = 0. 4
Exercise 9.7 (Independence). Let (Ai , i ∈ I) be independent events. Prove that (1Ai , i ∈ I)
are independent random variables and deduce that (Aci , i ∈ I) are also independents events.
4
Exercise 9.8 (Independence). Let (Ω, F, P) be a probability space. Let G ⊂ F be a σ-field
and C a collection of events which are all independents of G.

1. Prove by a counterexample that if C is not stable by finite intersection, then σ(C) may
not be independent of G.

2. Using the monotone class theorem prove that if C is stable by finite intersection, then
σ(C) is independent of G.

3. Let C and C 0 be two collections of events stable by finite intersection such that every
A ∈ C and A0 ∈ C 0 are independent. Prove that σ(C) and σ(C 0 ) are independent.

9.2 Conditional expectation


Exercise 9.9 (Indicators). Let (Ω, F, P) be a probability space and A, B ∈ F be two events.
Describe σ(1B ) and then compute E [1A |1B ]. 4
Exercise 9.10 (Random walk). Let (Xn , n ∈ N∗ ) be identically distributed
Pn independent real-
valued random variables such that E[X1 ] is well defined. Let Sn = k=1 Xk for k ∈ N∗ .
Compute E[X1 |S2 ] and deduce E[X1 |Sn ] for n ≥ 2. 4
Exercise 9.11 (Symmetric random variable). Let X be a real-valued random variable
 inte-
grable and symmetric, that is X and −X have the same distribution. Compute E X|X 2 .


4
Exercise 9.12 (X conditioned on |X|). Let X be an integrable real-valued random variable
with
 density f with respect to the Lebesgue measure. Compute E [X| |X|]. Compute also
2

E X| X . 4
Exercise 9.13 (Variance). Let X be a real-valued random variable such that E[X 2 ] < +∞.
Let H be a σ-field. Prove that E[X| H]2 is integrable and Var(E[X| H]) ≤ Var(X). 4
Exercise 9.14 (L1 distance). Let X, Y be independent R-valued integrable random variables
such that E[Y ] = 0. Prove that E[|X − Y |] ≥ E[|X|]. 4
Exercise 9.15 (Kolmogorov’s maximal inequality). Let (Xn , n ∈ N∗ ) be identically distributed
independent real-valued
Pn random variables. We assume that E[X12 ] < +∞ and E[X1 ] = 0. Let
x > 0. We set Sn = k=1 Xk for n ∈ N and T = inf{n ∈ N∗ ; |Sn | ≥ x} with the convention

that inf ∅ = +∞.

1. Prove that P(T = k) ≤ x12 E Sk2 1{T =k} for all k ∈ N∗ .


 
9.2. CONDITIONAL EXPECTATION 147

Pn
2. Check that k=1 P(T = k) = P (max1≤k≤n |Sk | ≥ x).

3. By noticing that Sn2 ≥ Sk2 + 2Sk (Sn − Sk ), prove Kolmogorov’s maximal inequality:

E[Sn2 ]
 
P max |Sk | ≥ x ≤ for all x > 0 and n ∈ N∗ .
1≤k≤n x2

4
Exercise 9.16 (An application of Jensen inequality). Let X and Y be two integrable real-
valued random variables such that a.s. E[X| Y ] = Y and E[Y | X] = X. Using Jensen
inequality (twice) with a positive strictly convex function ϕ such that limx→+∞ ϕ(x)/x and
limx→−∞ ϕ(x)/x are finite, prove that a.s. X = Y . 4
Exercise 9.17 (Independence and conditional expectation). Let H ⊂ F be a σ-field, Y and
V random variables taking values in measurable spaces (S, S) and (E, E) such that Y is
independent of H and V is H-measurable. Let ϕ be a non-negative real-valued measurable
function defined on S × E (endowed with the product σ-field). Prove that a.s.:

E[ϕ(Y, V )| H] = g(V ) with g(v) = E[ϕ(Y, v)]. (9.1)

4
Exercise 9.18 (Conditional independence). Let A, B and H be σ-fields, subsets of F. Assume
that H ⊂ A ∩ B and that conditionally on H the σ-fields A and B are independent, that is
for all A ∈ A, B ∈ B, we have a.s. P(A ∩ B| H) = P(A| H)P(B| H).
1. Let A ∈ A, B ∈ B. Check that a.s. E [1A E [1B | A] | H] = E [1A E [1B | H] | H].

2. Deduce that, for all B ∈ B, a.s. P(B| A) = P(B| H).


4
Exercise 9.19 (Convergence of conditional expectation). Let B = (Bn , n ∈ N∗ )
and B0
=
(Bn0 , n ∈ N∗ ) be independent sequences of independent random variables such that Bn and
Bn0 are Bernoulli with parameter 1/n. Let H = σ(B). We set Xn = nBn Bn0 for n ∈ N∗ .

1. Prove that (Xn , n ∈ N∗ ) converges a.s. and in L1 towards 0.

2. Prove that (E [Xn | H] , n ∈ N∗ ) converges in L1 but not a.s. towards 0.

4
Exercise 9.20 (Conditional densities). Let (Y, V ) be an R2 -valued
random variable whose
law has density f(Y,V ) (y, v) = λv −1 e−λv 1{0<y<v} with respect to the Lebesgue measure
on R2 . Check that the law of Y conditionally on V is the uniform distribution on [0, V ].
For a real-valued measurable bounded function ϕ defined on R, deduce that E[ϕ(Y )|V ] =
RV
V −1 0 ϕ(y) dy. 4
Exercise 9.21 (Conditional distribution and independence). Let (Y, V ) be an S × E-valued
random variable. Prove that Y and V are independent if and only if the conditional distribu-
tion of Y given V exists and is given by a kernel, say κ, such that κ(v, dy) does not depend
on v ∈ E. In this case, check that κ(v, dy) = PY (dy). 4
148 CHAPTER 9. EXERCISES

9.3 Discrete Markov chains


Exercise 9.22 (Markov chains built from a Markov chain-I). Let X = (Xn , n ∈ N) be a Markov
chain on a finite or countable set E with transition matrix P . Set Z = (Zn = X2n , n ∈ N).

1. Compute P(X2 = y|X0 = x) for x, y ∈ E. Prove that Z is a Markov chain and gives its
transition matrix.

2. Prove that any invariant probability measure for X is also invariant for Z. Prove the
converse is false in general.

4
Exercise 9.23 (Markov chains built from a Markov chain-II). Let X = (Xn , n ∈ N) be a
Markov chain on a finite or countable set E with transition matrix P . Set Y = (Yn , n ∈ N∗ )
where Yn = (Xn−1 , Xn ).

1. Prove that Y is a Markov chain on E 2 and give its transition matrix.

2. Give an example with X irreducible and Y nor irreducible. If X is irreducible, change


the state space of Y so that it is also irreducible.

3. Let π be an invariant probability distribution of X. Deduce an invariant probability


distribution for Y .

4
Exercise 9.24 (Labyrinth). A mouse is in the labyrinth depicted in figure 9.1 with 9 squares.
We consider the three classes of squares: A = {1, 3, 7, 9} (the corners), B = {5} (the center)
and C = {2, 4, 6, 8} the other squares. At each step n ∈ N, the mouse is in a square and we
denote by Xn its number and Yn its class.

1 2 3

4 5 6

7 8 9

Figure 9.1: Labyrinth


.

1. At each step, the mouse choose an adjacent square at random (an uniformly). Prove
that X = (Xn , n ∈ N) is a Markov chain and represent its transition graph. Classify
the states of X.

2. Prove that Y = (Yn , n ∈ N) is a Markov chain and represent its transition graph.
Compute the invariant probability measure of Y and deduce the one of X.

4
9.3. DISCRETE MARKOV CHAINS 149

Exercise 9.25 (Skeleton Markov chains). Let X = (Xn , n ∈ N) be a Markov chain on a


countable space E with transition matrix P . We use the convention inf ∅ = +∞. We define
τ1 = inf{k ≥ 1; Xk 6= X0 }.
1. Let x ∈ E. Give the distribution of τ1 conditionally on {X0 = x}. Check that,
conditionally on {X0 = x}, τ1 = +∞ a.s. if x is an absorbing state and otherwise a.s.
τ1 is finite.
2. Conditionally on {X0 = x}, if x is not an absorbing state, give the distribution of Xτ1 .
We set S0 = 0, Y0 = X0 and by recurrence for n ≥ 1: Sn = Sn−1 + τn , and if Sn < +∞:
Yn = XSn as well as τn+1 = inf{k ≥ 1; Xk+Sn 6= XSn }. Let R = inf{n; τn = +∞} =
inf{n; Sn = +∞}.
3. Prove that if X does not have absorbing states, then a.s. R = +∞.
We assume that X does not have absorbing states.
4. Prove that Y = (Yn , n ∈ N) is a Markov chain (it is called skeleton of X). Prove that
its transition matrix, Q, is given by:
P (x, y)
Q(x, y) = 1 for x, y ∈ E.
1 − P (x, x) {x6=y}
5. Let π be an invariant probability distribution of X. We define a measure ν on E by:
π(x)(1 − P (x, x))
ν(x) = P , x ∈ E.
y∈E π(y)(1 − P (y, y))
Check that ν is an invariant probability measure of Y .
4
Exercise 9.26 (Parameter estimation). Let X = (Xn , n ∈ N) be an irreducible positive re-
current Markov chain on a countable state space E with transition matrix P and invariant
probability π. The aim of this exercise is to give an estimation of the parameter π and P of
the Markov chain X.
1. For x ∈ E and n ∈ N∗ , we set π̂(x; n) = 1
n Card {1 ≤ k ≤ n; Xk = x}. Prove that a.s.
for all x ∈ E, limn→+∞ π̂(x; n) = π(x).
We set Z = (Zn , n ∈ N∗ ) with Zn = (Xn−1 , Xn ).
2. Prove that Z is an irreducible Markov chain on E2 = (x, y) ∈ E 2 ; P (x, y) > 0 . And


compute its transition matrix.


3. Compute the invariant probability distribution of Z and deduce that Z is recurrent
positive.
4. For x, y ∈ E and n ∈ N∗ , we set:
Card {1 ≤ k ≤ n; Zk = (x, y)}
P̂ (x, y; n) = ,
Card {0 ≤ k ≤ n − 1; Xk = x}
with the convention that P̂ (x, y; n) = 0 if Card {0 ≤ k ≤ n − 1; Xk = x} = 0. Prove
that a.s. for all (x, y) ∈ E2 , limn→+∞ P̂ (x, y; n) = P (x, y).
4
150 CHAPTER 9. EXERCISES

9.4 Martingales
Exercise 9.27 (Exit time distribution). Let U be a random variable on {−1, 1} such that
P(U = 1) = 1 − P(U = −1) = p with p ∈ (0, 1). We consider the simple Pnrandom walk
X = (Xn , n ∈ N) from Exercise 3.4 started at X0 = 0 defined by Xn = k=1 Uk , where
(Un , n ∈ N∗ ) are independent random variables distributed as U . Let a ∈ N∗ and
 consider
τa = inf{n ∈ N∗ ; |Xn | ≥ a} the exit time of (−a, a). We set ϕ(λ) = log E[eλU ] for λ ∈ R.
Let λ ∈ R such that ϕ(λ) ≥ 0.

1. Prove that τa is a stopping time. Using that X is an irreducible Markov chain, prove
that a.s. τa is finite (but not bounded if a ≥ 2).
(λ)
2. Prove that M (λ) = (Mn = eλXn −nϕ(λ) ; n ∈ N) is a positive martingale.
(λ)
3. Using the optional stopping theorem and that ϕ(λ) ≥ 0, prove that E[Mτa ] = 1.
(±λ)
4. Assume that p = 1/2. Check that ϕ is non-negative. By considering Mτa for λ ∈ R,
prove that for all r ≥ 0:
1
E e−rτa =
 
·
cosh(a cosh−1 (er ))

4
Exercise 9.28 (Return time to 0). Let U be a random variable on {−1, 1} such that P(U =
1) = 1 − P(U = −1) = 1/2. We P consider the simple random walk X = (Xn , n ∈ N) started at
X0 = 1 defined by Xn = 1 + nk=1 Uk , where (Un , n ∈ N∗ ) are independent random variables
distributed as U . Let τ = inf{n ∈ N∗ ; Xn = 0} be the return time to 0.

1. Check that the Z-valued Markov chain X is irreducible.

2. Prove that M = (Mn = Xn∧τ ; n ∈ N) is a non-negative martingale.

3. Deduce that τ is a.s. finite and thus X is recurrent.

4. Check that E[Mτ ] 6= E[M0 ] (thus τ is not bounded and M is not uniformly integrable).

5. Prove that N = (Nn = Xn2 − n) is a martingale.

6. Deduce that E[τ ] = +∞ and prove that X is recurrent null.

4
Exercise 9.29 (Martingale not converging in L1 ). Let (Xn , n ∈ N∗ ) be a sequence of indepen-
dent Bernoulli random variables of parameter E[Xn ] = (1 + e)−1 . We define M0 = 1 and for
n ∈ N∗ : Pn
Mn = e−n+2 i=1 Xi .

1. Prove that M = (Mn , n ∈ N) is a martingale and that a.s. limn→∞ Mn = 0.

2. Check that M doesn’t converge in L1 .


9.4. MARTINGALES 151

4
Exercise 9.30 (Martingale not converging a.s.). Let (Zn , n ∈ N∗ )
be independent random
variables such that P(Zn = 1) = P(Zn = −1) = 1/(2n) and P(Zn = 0) = 1 − n−1 . We set
X1 = Z1 and for n ≥ 1:

Xn+1 = Zn+1 1{Xn =0} + (n + 1)Xn |Zn+1 | 1{Xn 6=0} .

1. Check that |Xn | ≤ n! and that X = (Xn , n ∈ N∗ ) is a martingale.

2. Prove directly that X converge in probability towards 0.

3. Using Borel-Cantelli’s lemma, prove that P(Zn 6= 0 infinitely often) = 1. Deduce that
P( lim Xn exists) = 0. In particular, the martingale does not converge a.s. towards 0.
n→∞

4
Exercise 9.31 (Wright-Fisher model). We consider a population of constant size N . We
assume that the reproduction is random: this corresponds in the end to each individual
choosing his parent independently in the previous generation. The Wright-Fisher model
study the evolution of the number of individuals carrying one of the two alleles A and a. For
n ∈ N, let Xn denote the number of alleles A at generation n in the population. We assume
that X0 = i ∈ {0, . . . , N } is given. We shall study the process X = (Xn , n ≥ 0).

1. Give the law of Xn+1 conditionally on Xn .

2. Prove that X is a martingale (specify the filtration).

3. Prove that X converges to a limit, say X∞ , and give the type of convergence.

4. Prove that M = (Mn = ( NN−1 )n Xn (N − Xn ), n ≥ 0) is a martingale and compute


E[X∞ (N − X∞ )].

5. Prove that one of the allele disappears a.s. in finite time. Compute the probability that
allele A disappears.

6. Compute limn→∞ Mn and deduce that M doesn’t converge in L1 .

4
Exercise 9.32 (Waiting time of a given sequence). Let X = (Xn , n ∈ N∗ ) be a sequence of
independent Bernoulli random variable with parameter p ∈ (0, 1): P(Xn = 1) = 1 − P(Xn =
0) = p. Let τijk = inf{n ≥ 3; (Xn−2 , Xn−1 , Xn ) = (i, j, k)} be the waiting time of the
sequence (i, j, k) ∈ {0, 1}3 . The aim of this exercise is to compute its expectation.

1. Prove that τijk is a stopping time a.s. finite.


Xn
2. We set S0 = 0 and Sn = (Sn−1 + 1) for n ≥ 1. Prove that (Sn − n, n ≥ 0) is a
p
martingale. Deduce E[τ111 ].

3. Compute P(τ110 < τ111 ).


152 CHAPTER 9. EXERCISES

X1 X2 X2
4. Compute E[τ110 ], using the sequence (Tn , n ≥ 2) defined by T2 = + and
p2 p
1 − Xn Xn−1 Xn Xn−1 (1 − Xn ) Xn
Tn = Tn−1 + − + for n ≥ 3.
1−p p2 p(1 − p) p
5. Using similar arguments, compute E[τ100 ] and E[τ101 ].

If p = 1/2, it can be proved1 that for any sequence (i, j, k) ∈ {0, 1}, the sequence (̄, i, j),
with ̄ = 1 − j, appears earlier in probability, that is P(τ̄ ij < τijk ) > 1/2. 4
Exercise 9.33 (When does an insurance companies goes bankrupt?). We consider the evolu-
tion of the capital of an insurance company. Let S0 = x > 0 be the initial capital, c > 0
the fixed income per year and Xn ≥ 0 the (random) cost P of the damage for the year n. The
capital at the end of year n ≥ 1 is thus Sn = x + nc − nk=1 Xk . Bankruptcy happens if the
capital becomes negative that is if the bankruptcy time τ = inf{k ∈ N; Sk < 0}, with the
convention inf ∅ = ∞, is finite. The goal of this exercise is to find an upper bound of the
bankruptcy probability P(τ < ∞).
We assume the real random variables (Xk , k ≥ 1) are independent, identically distributed,
a.s. non constant, and have all its exponential moments (i.e. E[eλX1 ] < ∞ for all λ ∈ R).

1. Check that E[X1 ] > c implies P(τ < ∞) = 1, and that P(X1 > c) = 0 implies
P(τ < ∞) = 0.

We assume that E[X1 ] < c and P(X1 > c) > 0.

2. Check that if E[eλXk ] ≥ eλc , then V = (Vn = e−λSn +λx , n ≥ 0) is a non-negative


sub-martingale.

3. Let N ≥ 1. Prove that {τ ≤ N } is the disjoint union of the events Fk = {Sr ≥


0 for r < k, Sk < 0} = {τ = k} for k ∈ {1, . . . , N }. Deduce that:
N
X
E[VN 1{τ ≤N } ] ≥ E[Vk 1{τ =k} ] ≥ eλx P(τ ≤ N ).
k=1

(You can check we recover the maximal inequality for the positive sub-martingale.)

4. Deduce that P(τ < ∞) ≤ e−λ0 x , where λ0 ∈ (0, ∞) is the unique root of E[eλX1 ] = eλc .

4
Exercise 9.34 (A.s. convergence and convergence in distribution). Pn Let (Xn , n ≥ 1) be a
sequence of independent real random variables. We set Sn = k=1 Xk for n ≥ 1. The goal
of this exercise is to prove that if the sequence (Sn , n ≥ 1) converges in distribution, then it
converges a.s. also.
eitSn
For t ∈ R, we set ψn (t) = E[eitXn ] and Mn (t) = Qn for n ≥ 1 if nk=1 ψk (t) 6= 0.
Q
k=1 ψk (t)

1. Let t ∈ R be such that nk=1 ψk (t) 6= 0. Prove that (Mk (t), 1 ≤ k ≤ n) is a martingale.
Q

1
R. Graham, D. Knuth and O. Patashnik. Concrete mathematics: a foundation of computer science, 2nd
Edition. Addison-Wesley Publishing Company, 1994. (See Section 8.4.)
9.5. OPTIMAL STOPPING 153

We assume that (Sn , n ≥ 1) converges in distribution towards S.


2. Prove there exists ε > 0 such that for all t ∈ [−ε, ε], a.s. the sequence (eitSn , n ≥ 1)
converges.

3. We recall that if there exists ε > 0 s.t., for almost all t ∈ [−ε, ε], the sequence (eitsn , n ≥
1) converges, then the sequence (sn , n ≥ 1) converges. Prove that (Sn , n ≥ 1) converges
a.s. towards a random variable distributed as S.
4

9.5 Optimal stopping


Exercise 9.35 (Dice). In a game, you are allowed to roll a dice at most three times. When
you roll the dice, you can either stop and gain the result of the dice or, unless it is your third
roll, go on. What is the strategy which maximizes your gain and the corresponding average
gain. 4

9.6 Brownian motion


Exercise 9.36 (Transformations of Brownian motion). Let B = (Bt , t ∈ R+ ) be a standard
Brownian motion.
1. Let t0 ∈ R+ . Prove that (Bt+t0 − Bt0 , t ∈ R+ ) is a Brownian motion.

2. Let λ > 0. Prove that (λ−1/2 Bλt , t ∈ R+ ) is a Brownian motion.

3. Prove that (tB1/t , t ∈ (0, +∞)) is distributed as (Bt , t ∈ (0, +∞)). Deduce that a.s.
limt→+∞ Bt /t = 0.
4
Exercise 9.37 (Simulation of Brownian motion). We present a recursive algorithm due to
Lévy to simulate the Brownian motion on the interval [0, T ] with T > 0.
1. Prove that the Brownian bridge W T is a centered Gaussian process with covariance
kernel K = (K(s, t); s, t ∈ [0, T ]) given by K(s, t) = t ∧ s(T − t ∨ s)/T .

2. Prove that E[WtT BT +s ] = 0 for all t ∈ [0, T ] and s ∈ R+ . Deduce that W T is indepen-
dent of (BT +s , s ∈ R+ ).
Let s ≥ r ≥ 0 be fixed. We define the process W̃ = (W̃t , t ∈ [r, s]) by:
t−r
W̃t = Bt − Br − (Bs − Br ).
s−r
3. Prove that W̃ is a Gaussian process distributed as W s−r . And deduce the variance of
W̃t for t ∈ [r, s].

4. UsingSthat (W̃ , B) is a Gaussian process, prove that W̃ is independent of (Bu , u ∈


[0, r] [s, +∞)).
154 CHAPTER 9. EXERCISES

S
5. Let t ∈ [r, s]. Deduce that conditionally on (Bu , u ∈ [0, r] [s, +∞)), Bt is distributed
as: r
(t − r)(s − t) s−t t−r
G+ Br + Bs ,
s−r s−r s−r
S
where G ∼ N (0, 1) is independent of (Bu , u ∈ [0, r] [s, +∞)).

6. Let n ≥ 1. Deduce a recursive algorithm to simulate a standard Brownian motion at


times 0 = t0 ≤ t1 ≤ · · · ≤ t2n = T , by first simulating the standard Brownian motion
at time 0 and T .

4
Exercise 9.38 (Ornstein-Uhlenbeck process). Let V = (Vt , t ∈ R+ ) be the solution of the
Langevin equation (6.10) with initial condition V0 . Let U be a centered Gaussian random
variable with variance σ 2 /(2a) and independent of the Brownian motion B.

1. Prove that (Vt , t ∈ R+ ) converges in distribution towards U as t goes to infinity.

2. Assume that V0 is deterministic. Prove that V is a Gaussian process and a Markov


process. Prove that Vt is distributed as N V0 e−at , σ 2 (2a)−1 (1 − e−2at ) .

3. Assume that V0 = U . Prove that Vt is distributed as U for all t ∈ R+ .

4. Assume that V0 = U . Prove that V is distributed as (Zt , t ∈ R+ ) with:


σ
Zt = √ e−at Be2at . (9.2)
2a

4
Chapter 10

Solutions

10.1 Measure theory and random variables


Exercise 9.1 The family C = {]−∞, a], a ∈ R} is stable by finite intersection (this is a π-
system). In particular, all the intervals belong to the σ-field σ(C). Thus it is the Borel σ-field
on R.

Exercise 9.2 Let µ and µ0 be two σ-finite measures on (Ω, F) which coincide on a collection
S C stable by finite intersection such that Ωn ∈ C for all n ∈ N, where µ(Ωn ) < +∞
of events
and n∈N Ωn = Ω. By replacing Ωn by ∪0≤k≤n Ωk for n ∈ N, we can assume that the
sequence (Ωn , n ∈ N) is non-decreasing. For n ∈ N, we can define Pn = µ(Ωn )−1 µ and
P0n = µ0 (Ωn )−1 µ0 . Those two probability measures coincide on Cn = {A ∩ Ωn , A ∈ C} ⊂ C
which is also stable by finite intersection, and thus thanks to Corollary 1.14, they coincide
on σ(Cn ). As µ(Ωn ) = µ0 (Ωn ), we deduce that µ and µ0 coincide on σ(Cn ) for all n ∈ N.
Let G = {A ∈ F, A ∩ Ωn ∈ σ(Cn ) for all n ∈ N}. It is elementary to check that G is a
σ-field. SinceSC is stable by finite intersection, we have C ⊂ G and thus σ(C) ⊂ G. If A ∈ G,
we get A = n∈N (A ∩ Ωn ). As A ∩ Ωn ∈ σ(Cn ) ⊂ σ(C), we deduce that A ∈ σ(C). This
implies that G = σ(C). By monotone convergence, we get that for A ∈ σ(C), that is A ∈ G:

µ(A) = lim µ(A ∩ Ωn ) = lim µ0 (A ∩ Ωn ) = µ0 (A),


n→∞ n→∞

where we used that µ and µ0 coincide on σ(Cn ) for the second equality. We deduce that
µ = µ0 on σ(C).
The extension of Corollary 1.15 is immediate.

Exercise 9.3 We have:

lim sup(an − bn ) = lim sup (ak − bk ) ≤ lim ( sup ak − inf bk ) = lim sup an − lim inf bn .
n→∞ n→∞ 0≤k≤n n→∞ 0≤k≤n 0≤k≤n n→∞ n→∞

We also have:

lim sup(an − bn ) = lim sup (ak − bk ) ≥ lim ( sup ak − sup bk ) = lim sup an − lim sup bn .
n→∞ n→∞ 0≤k≤n n→∞ 0≤k≤n 0≤k≤n n→∞ n→∞

If furthermore the sequence (bn , n ∈ N) converges, we have lim supn→∞ bn = lim inf n→∞ bn ,
which allows to conclude.

155
156 CHAPTER 10. SOLUTIONS

Exercise 9.4 By successive integration, we get:


Z Z ! Z  1 Z
y 1 π
I1 = f (x, y)dy dx = dx = dx = ·
(0,1) (0,1) (0,1) (x + y 2 )2
2
0 (0,1) x2 +1 4
Z Z !
π
Similarly, we get I2 = f (x, y)dx dy = − ·; If f were integrable on (0, 1)2 , we
(0,1) (0,1) 4
could use Fubini’s theorem and get that I1 = I2 . Since this equality doesn’t hold, we deduce
that f is not integrable over (0, 1)2 .

Exercise 9.5 For the case fj ≥ 0 for j ∈ J, only the last sentence of the proof of Proposition
1.62 need to be changed. Use monotone convergence theorem, to get (1.13) holds if the
function fj are non-negative.
For the other case, according to the above argument, we get that:
 
Y εj Y ε
E fj (Xj ) = E[fj j (Xj )],
j∈J j∈J

where εj ∈ {−, +} for j ∈ J. Those quantities being all finite as fj (Xj ) is integrable, we
obtain (1.13) using the linearity of the expectation in L1 (P).

Exercise 9.6 The fact that XY is integrable and that Cov(X, Y ) = 0 is a consequence of
Exercise 9.5. Let X be a non-negative square-integrable random variable with non-zero
variance. Let Y = εX, with ε independent of X and such that P(ε = 1) = P(ε = −1) = 1/2.
We have E[Y ] = 0 and E[XY ] = 0 so that Cov(X, Y ) = 0. However, we have Cov(X, |Y |) =
Var(X) > 0. This implies that X and Y are not independent.

Exercise 9.7 Use that f (1A ) = 1 + (f (1) − f (0))1A and Proposition 1.62 to deduce that if the
events (Ai , i ∈ I) are independent then the random variables (1Ai , i ∈ I) are independent.
By Definition 1.31, if the random variables (Xi , i ∈ I) are independent so are the random
variables (fi (Xi ), i ∈ I) for any measurable functions (fi , i ∈ I). Take Xi = 1Ai and fi (x) =
1−x to deduce that (1Aci , i ∈ I) are independent random variables, and thus thanks to (1.13),
deduce that (Aci , i ∈ I) are also independents events.

Exercise 9.8 1. Let X1 and X2 be two independent random variables distributed as 2X −1


where X has Bernoulli distribution with parameter 1/2. Set G = σ(X2 ) and C = {{X1 =
1}, {X1 X2 = 1}}. It is elementary to check that if A ∈ C then the set A is independent
of G. Notice that σ(C) = σ(X1 , X2 ) which is not independent of G = σ(X2 ).
2. Let B ⊂ F be a collection of events. Consider the collection A ⊂ F of events A which
are independents of B, that is P(A ∩ B) = P(A)P(B) for all B ∈ B. It is easy to
check that this collection is a monotone class. It contains C which is stable by finite
intersection. By the monotone class theorem, we get that σ(C) ⊂ A, and thus σ(C) is
independent of B. Taking B = G gives the results.
3. Take B = C 0 in the proof of the previous question to deduce that σ(C) and C 0 are
independent, and then use the previous question to deduce that σ(C) and σ(C 0 ) are
independent.
10.2. CONDITIONAL EXPECTATION 157

10.2 Conditional expectation


Exercise 9.9 We have σ(1B ) = {∅, Ω, B, B c }. We get using the characterization (2.1) of the
conditional expectation that:

E [1A |1B ] = P(A|B)1B + P(A|B c )1B c .

Exercise 9.10 Since (X1 , X2 ) has the same distribution as (X2 , X1 ), we deduce that (X1 , S2 )
has the same distribution as (X2 , S2 ). This implies that E[X1 | S2 ] = E[X2 | S2 ]. By linearity,
we have:
E[X1 | S2 ] + E[X2 | S2 ] = E[S2 | S2 ] = S2 .
We deduce that E[X1 | S2 ] = S2 /2. Similarly, we get E[X1 |Sn ] = Sn /n.

Exercise 9.11 Notice that (X, X 2 ) and (−X, X 2 ) have the same distribution. We deduce
that E[X| X 2 ] = E[−X| X 2 ] = −E[X| X 2 ]. This implies that E[X| X 2 ] = 0.

Exercise 9.12 Let h be measurable bounded. We want to write

E[X h(|X|)] = E[g(|X|) h(|X|)] (10.1)

for some measurable function g, such that thanks to (2.2) (with h given by 1A = h(|X|) for
A ∈ σ(|X|)), we will deduce that a.s. g(|X|) = E[X| |X|]. On one hand, we have:
Z Z Z
E[X h(|X|)] = xh(|x|)f (x) dx = xh(|x|)f (x) dx + xh(|x|)f (x) dx
R R+ R−
Z

= |x|h(|x|) f (x) − f (−x) dx,
R+

and on the other hand we have:


Z Z

E[g(|X|) h(|X|)] = g(|x|)h(|x|)f (x) dx = g(x)h(x) f (x) + f (−x) dx.
R R+

We deduce that:
f (x) − f (−x)
g(x) = |x|
f (x) + f (−x)
satisfies (10.1) and thus a.s.

f (X) − f (−X)
E[X| |X|] = |X| ·
f (X) + f (−X)

Notice that σ(|X|) = σ(X 2 ) (as |x| = x2 and x2 = |x|2 ) so that E[X| X 2 ] = E[X| |X|].

Exercise 9.13 By Jensen’s inequality, we have E[X| H]2 ≤ E[X 2 | H]. Since E[E[X 2 | H]] =
E[X 2 ] < +∞, we deduce that E[X| H]2 is integrable. Using Jensen’s inequality, we get that:

Var(E[X| H]) = E E[X| H]2 − E [E[X| H]]2 ≤ E E[X 2 | H] − E[X]2 = Var(X).


   
158 CHAPTER 10. SOLUTIONS

Exercise 9.14 Set ϕ(x) = E[|x − Y |] for x ∈ R. Using Jensen’s inequality, we get that
ϕ(x) = E[|x − Y |] ≥ |x − E[Y ]| = |x| for all x ∈ R. As X and Y are independent, we also
have that E[|X − Y |] = E[ϕ(X)]. This gives the result: E[|X − Y |] ≥ E[|X|].

Exercise 9.15

Exercise 9.16 Let ϕ be a positive strictly convex function on R such that limx→+∞ ϕ(x)/x
and limx→−∞ ϕ(x)/x are finite. This implies in particular that ϕ(X) and ϕ(Y ) are inte-
grable. We deduce from Jensen’s inequality that E[ϕ(X)| Y ] ≥ ϕ(E[X| Y ]) = ϕ(Y ) and thus
E[ϕ(X)] ≥ E[ϕ(Y )]. By symmetry, we get E[ϕ(X)] = E[ϕ(Y )] and thus the Jensen’s inequal-
ity is an equality: a.s. E[ϕ(X)| Y ] = ϕ(E[X| Y ])]. Since ϕ is strictly convex, this implies that
a.s. ϕ(X) = ϕ(E[X| Y ])] = ϕ(Y ), and thus, as ϕ is general, a.s. X = Y .

Exercise 9.17 Let A ∈ H and consider the random variable X = (V, 1A ) which takes values in
E × {0, 1}. Since Y and X are independent, we deduce from Lemma 1.56 that for measurable
sets B ∈ S and C ∈ E:
P(Y,X) (B × C) = P(Y ∈ B, X ∈ C) = P(Y ∈ B)P(X ∈ C) = PY (B)PX (C).
We deduce from Fubini’s theorem that P(Y,X) (dy, dx) = PY (dy)PX (dx), and, with x =
(x1 , x2 ) ∈ E × {0, 1} and f (x, y) = ϕ(y, x1 )x2 , from Equation (1.6) that:
Z Z Z 
E[ϕ(Y, V )1A ] = f (y, x) P(Y,X) (dy, dx) = f (x, y) PY (dy) PX (dx).
R
Set g(v) = E[ϕ(v, Y )] = ϕ(v, y) PY (dy) so that:
Z
E[ϕ(Y, V )1A ] = g(x1 )x2 PX (dx) = E[g(V )1A ].

This directly implies that a.s. E[ϕ(Y, V )| H] = g(V ).

Exercise 9.18 1. We have:


E [1A E [1B | A] | H] = E [E [1A 1B | A] | H]
= P(A ∩ B| H)
= P(A| H)P(B| H)
= E [1A E [1B | H] | H] ,
where we used that A ∈ A for the first equality, that A and B are independent condi-
tionally on H for the third, and that E [1B | H] is H-measurable for the last.
2. We have for any A ∈ A:
E [1A E [1B | A]] = E [E [1A E [1B | A] | H]]
= E [E [1A E [1B | H] | H]]
= E [1A E [1B | H]] ,
where we used the first question for the second equality. Since H ⊂ A, we deduce that
E [1B | H] is A-measurable, and by uniqueness of the conditional expectation, we get
that a.s. E [1B | A] = E [1B | H].
10.3. DISCRETE MARKOV CHAINS 159

Exercise 9.19 1. We have E[|Xn |] = 1/n, which implies that the sequence X = (Xn , n ∈
N∗ ) converges to 0 in L1 . Notice that 0 −2
P P(Xn > 1/n) = P(Bn = Bn = 1) = n and thus
the non-negative random variable n∈N∗ 1{Xn >1/n} is integrable and thus finite. Since
the terms in the sum are either 1 or 0, we deduce that Xn ≤ 1/n for n large enough,
that is X converges to a.s. 0.

2. Set Y = (Yn = E [Xn | H] , n ∈ N∗ ). As Yn is nonnegative, we get E[|Yn |] = E[Yn ] =


E[Xn ], and thus Y converges to 0 in L1 .
0 1}. The events ({Yn = 1}, n ∈ N∗ ) are independent and
We
P also have Yn = B Pn ∈ {0,−1
n∈N∗ P(Yn = 1) = n∈N∗ n = +∞. We deduce from the Borel-Cantelli lemma that
a.s. lim supn→∞ Yn = 1, and thus the sequence Y does not converges a.s. to 0.

Exercise 9.20 The computations are elementary, Z see Sections 2.3.2 and 2.3.3. The density
of the probability distribution of V is fV (v) = fY,V (y, v) dy = λ e−λv 1{v>0} . We deduce
that for y > 0, fY |V (y|v) = v −1 1{0<y<v} , which is the density of the uniform distribution on
[0, v]. The last formula is then clear.

Exercise 9.21 We first assume that Y and V are independent. We deduce from Exercise 9.17
with H = σ(V ) that, for all nonnegative
R measurable function ϕ, we have E[ϕ(Y, V )| H] = g(V )
with g(v) = E[ϕ(Y, v)] = ϕ(y, v) PY (dy). We deduce from Definition 2.17 that P(Y ∈
A|V ) = PY (A), and thus the conditional distribution of Y given V exists and is given by the
kernel κ(v, dy) = PY (dy) which does not depend on v ∈ E.
We now assume that the conditional distribution of Y given V exists and is given by a
kernel which does not depend on v ∈ E, say κ(dy). By Definition 2.17, we have P(Y ∈ A|V ) =
κ(V, A) = κ(A), and taking the expectation, we get P(Y ∈ A) = κ(A), that is κ = PY . We
also get P (Y ∈ A, V ∈ B) = E[1B (V )P(Y ∈ A| V )] = PY (A)E[1B (V )] = P(Y ∈ A)P(V ∈ B),
which means that Y and V are independent.

10.3 Discrete Markov chains


Exercise 9.22 1. We have P(X2 = y|X0 = x) = P 2 (x, y) as:
X
P(X2 = y|X0 = x) = P(X2 = y, X1 = z|X0 = x)
z∈E
X
= P(X2 = y|X1 = z, X0 = x)P(X1 = z|X0 = x)
z∈E
X
= P(X2 = y|X1 = z)P(X1 = z|X0 = x)
z∈E
X
= P (x, z)P (z, y)
z∈E
2
= P (x, y),

where we used the Markov property for the third equality.


160 CHAPTER 10. SOLUTIONS

Assume that X is a stochastic dynamical system: Xn+1 = f (Xn , Un+1 ) for some measur-
able function f and (Un , n ∈ N∗ ) independent identically distributed random variables
independent of X0 . Then, we have:
Zn+1 = X2n+2 = f (f (X2n , U2n+1 ), U2n+2 ) = g(Zn , Vn+1 ),
with Vn+1 = (U2n+1 , U2n+2 ) and g(x, (v1 , v2 )) = f (f (x, v1 ), v2 ). Since the random
variables (Vn , n ∈ N∗ ) are independent, identically distributed and independent of Z0 =
X0 , we deduce that Z is a stochastic dynamical system and thus a Markov chain.
In general, X is distributed as a stochastic dynamical system, say X̃. The process Z
is a functional of X, say Z = F (X). We deduce that Z is distributed as Z̃ = F (X̃),
which is a stochastic dynamical system, according to the previous argument. Hence, Z
is a Markov chain.
Notice that Z has transition matrix Q = P 2 .
2. On one hand, if π is an invariant probability measure for P , then we have πP 2 =
(πP )P = πP = π. Hence it is also an invariant probability measure for Q.
On the other hand, for the state space E = {a, b} (with a 6= b):
 
0 1
P =
1 0
and the unique invariant probability measure is π = (1/2, 1/2)t . As Q = P 2 is the
idendity matrix we get that any probability measure is invariant for Q.
Exercise 9.23 1. Assume that X is a stochastic dynamical system: Xn+1 = f (Xn , Un+1 )
for some measurable function f and (Un , n ∈ N∗ ) independent identically distributed
random variables independent of X0 . Then, we have:
Yn+1 = (Xn , Xn+1 ) = (Xn , f (Xn , Un+1 )) = g(Yn , Un+1 ),
with g((y1 , y2 ), u) = (y2 , f (y2 , u)). Since the random variable (Un , n ≥ 2) is indepen-
dent of Y1 = (X0 , X1 ), we deduce that Y is a stochastic dynamical system and thus a
Markov chain.
In general, X is distributed as a stochastic dynamical system, say X̃. The process Y
is a functional of X, say Y = F (X). We deduce that Y is distributed as Ỹ = F (X̃),
which is a stochastic dynamical system, according to the previous argument. Hence, Y
is a Markov chain.
The transition matrix Q of Y on E 2 is given by:
Q((x1 , x2 ), (y1 , y2 )) = 1{x2 =y1 } P (y1 , y2 ).

2. Take E = {0, 1} and P (1, 1) = 1 so that 1 is an absorbing state for X. Then


Q(x, (1, 0)) = 0 for all x ∈ E2 . Thus Y is not irreducible.
Assume that X is irreducible. Notice that Y takes values in Ẽ = {(x, y) ∈ E, P (x, y) >
0}. Let x = (x1 , x2 ), y = (y1 , y2 ) Q∈ Ẽ. Since X is irreducible, there exists n ≥
3,
Qn+1x3 , . . . , xn = y1 ∈ E such that ni=3 P (xi−1 , xi ) > 0. This clearly implies that
i=3 Q(zi−1 , zi ) > 0, with zi = (xi−1 , xi ), z2 = x and zn+1 = y (notice we used that
Q(zn , zn+1 ) = Q(zn , y) > 0 as as y ∈ Ẽ). Thus Y is irreducible (on Ẽ).
10.3. DISCRETE MARKOV CHAINS 161

3. It is easy to check that ν = (ν(z), z ∈ E 2 ), with ν(z) = π(x)P (x, y) for z = (x, y) ∈ E 2
is an invariant measure for Y . Indeed, we have for z = (v, w) ∈ E 2 :
X
νQ(z) = ν((x, y))Q((x, y), (v, w))
x,y∈E
X
= π(x)P (x, y)1{y=v} P (v, w) = π(v)P (v, w) = ν(z).
x,y∈E

This gives that νQ = ν, that is ν is invariant for Q.

Exercise 9.24 1. Since each new step is chosen uniformly random on the available neigh-
bors, the next position depends on the past only through the current position. This
implies that X is a Markov chain. Clearly it is irreducible (so all the states belong
to the same closed class). Since the state space is finite, the Markov chain is positive
recurrent (so all the states are positive recurrent).

2. For all n ∈ N, set Fn = σ(X0 , . . . , Xn ). We shall check that Y is a Markov chain with
respect to the filtration F = (Fn , n ∈ N). We first compute P(Yn+1 = A| Fn ). Let P be
the transition matrix of the Markov chain X. We have:
X
P(Yn+1 = A| Fn ) = P(Xn+1 = y| Fn )
y∈A
X
= P(Xn+1 = y| Xn )
y∈A
X
= P (Xn , y)
y∈A
XX
= P (x, y)1{Xn =x}
y∈A x∈C
2X
= 1{Xn =x}
3
x∈C
2
= 1{Yn =C} ,
3
where we used that X is a Markov chain for thePsecond equality, that P (x, y) = 0
for x 6∈ C and y ∈ A for the fourth, and that y∈A P (x, y) = 2/3 for all x ∈ C
for the fifth. Since the last right hand-side term is σ(Yn )-measurable, this implies
that P(Yn+1 = A| Fn ) = P(Yn+1 = A| Yn ). Similarly, we obtain P(Yn+1 = B| Fn ) =
1
3 1{Yn =C} = P(Yn+1 = B| Xn ) and P(Yn+1 = C| Fn ) = 1{Yn 6=C} = P(Yn+1 = C| Xn ).
Since P(Yn+1 = • | Fn ) = P(Yn+1 = • | Yn ), we deduce that Y is a Markov chain. Its
transition matrix (on E = A, B, C) is given by:
 
0 0 1
Q= 0 0 1 .
2/3 1/3 0

The invariant probability for Q is given by πQ = (1/3, 1/6, 1/2). When Yn is in a


given state D ∈ E, then Xn can be in any state x ∈ D, so intuitively the invariant
162 CHAPTER 10. SOLUTIONS

probability for P could be given by πP (x) = 1/12 for x ∈ A, πP (x) = 1/6 for x ∈ B
and πP (x) = 1/8 for x ∈ C. It is indeed elementary to check that πP P = πP . Since
the Markov chain X is irreducible on a finite state, the invariant probability exists and
is unique, and thus is given by πP . (One can check that the invariant probability for a
uniform random walk on a general finite graph (i.e. the next state is chosen uniformly
at random among P the closest neighbors) is proportional to the degree of the nodes:
π(x) = deg(x)/ y deg(y) for all nodes x of the finite graph.)

Exercise 9.25

Exercise 9.26

10.4 Martingales
1. We have {τa ≤ n} = nk=1 {|Xn | < a}. This implies that τa is a stopping
T
Exercise 9.27
time. Since X is an irreducible Markov chain, it is either transient and the time
spent in (−a, a) is finite, or recurrent and the number of visit of a is infinite. In both
case, we have that X leaves (−a, a) in finite time. That is τa is finite. The event
T n
k=1 {U2k = 1, U2k+1 = −1} has positive probability, and, if a ≥ 2, we have on this
event that τa ≥ 2n. Therefore, τa is not bounded.

2. M (λ) is clearly adapted. Since |Xn | ≤ n, we deduce that for fixed n, M (λ) is bounded
thus integrable. We have, using that Un+1 is independent of Fn :
h i
(λ)
E[Mn+1 | Fn ] = eλXn −(n+1)ϕ(λ) E eλUn+1 | Fn
h i
= eλXn −(n+1)ϕ(λ) E eλUn+1
= eλXn −nϕ(λ) = Mn(λ) .

This implies that M (λ) is a martingale.


(λ) (λ)
3. Using the optional stopping theorem, we get that E[Mτa ∧n ] = E[M0 ] = 1 for all n ∈ N.
We have that limn→∞ τa ∧ n = τa and as τa is finite a.s., that a.s. limn→∞ Xτa ∧n = Xτa .
(λ) (λ)
This implies that a.s. limn→∞ Mτa ∧n = Mτa .
Since ϕ(λ) ≥ 0 and |Xτa ∧n | ≤ a, we deduce that 0 ≤ Mτa ∧n ≤ e|λ|a . By dominated
convergence we get that:
E[Mτ(λ)
a
] = 1.

4. we have ϕ(λ) = log(cosh(λ)). Since cosh(λ) ≥ 1, we deduce that ϕ ≥ 0. As τa is finite,


we obtain that Xτa ∈ {−a, a}. We have, considering first λ and then −λ, and with
r = ϕ(λ) = ϕ(−λ):

eλa E e−rτa 1{Xτa =a} + e−λa E e−rτa 1{Xτa =−a} = 1,


   

e−λa E e−rτa 1{Xτa =a} + eλa E e−rτa 1{Xτa =−a} = 1.


   
10.4. MARTINGALES 163

We deduce that:
sinh(λa) sinh(λa)
E e−rτa 1{Xτa =a} = and E e−rτa 1{Xτa =−a} =
   
·
sinh(2λa) sinh(2λa)

Then use that 2 sinh(x) cosh(x) = sinh(2x) to deduce that:

sinh(λa) 1
E e−rτa = 2
 
= ,
sinh(2λa) cosh(λa)

and conclude using that λ = cosh−1 (er ).

Exercise 9.28 1. We have for y 6= x, and r = |y − x| that P(Xr = y|X0 = x) = 2−r > 0.
Hence X is irreducible.

2. We consider the natural filtration F = (Fn , n ∈ N) where Fn = σ(U1 , . . . , Un ) =


σ(X0 , . . . , Xn ). Recall the return time τ is a stopping time. Since X is a martingale,
we get that M is the stopped martingale X at the stopping time τ .

3. Since M is a non-negative martingale, it a.s. converges. On the event {τ = +∞},


we have |Mn+1 − Mn | = |Un+1 | = 1. Thus on {τ = +∞} the martingale M is not
converging. We deduce that a.s. τ is finite. This implies that X is recurrent.

4. Since τ is finite a.s., we get Xτ = 0. In particular 0 = E[Xτ ] = E[M∞ ] < E[M0 ] =


E[X0 ] = 1. Thus, the optional stopping time theorem does not applied; this implies
that τ is not bounded. This also implies that M is not a closed martingale, that is M
is not uniformly integrable.

5. Notice that N is F-adapted and integrable (as for all n ∈ N, |Xn | ≤ n and thus
|Nn | ≤ n2 ). We have:
2
E[Xn+1 | Fn ] = E[(Xn + Un+1 )2 | Fn ] = Xn2 + E[Un+1 | Fn ] + 2Xn E[Un+1 | Fn ]
= Xn2 + E[Un+1 ] + 2Xn E[Un+1 ]
= Xn2 + 1.

This implies that N is a martingale.

6. By the optional stopping theorem, we have that E[Nτ ∧n ] = 0. This gives E[Mn2 ] =
E[Xτ2∧n ] = E[τ ∧ n] ≤ E[τ ]. Since M is not uniformly integrable, it is not bounded
in L2 . Thus we have supn∈N E[Mn2 ] = +∞. This implies that E[τ ] = +∞. Let
T the first return time to 0. By decomposing the random walk started at 0 with
respect to the first step, and considering only the case where it goes first at 1, we get
T ≥ 1{U1 =1} (1 + τ 0 ), where τ 0 is distributed as τ and independent of U1 . Therefore we
have E[T ] ≥ (1 + E[τ ])/2 = +∞. This implies that X is recurrent null.

Exercise 9.29 1. Let F = (Fn , n ∈ N) be the natural filtration associated to M , so that M


is F-adapted. Notice that 0 ≤ Mn ≤ en , so that Mn is integrable. We also have that:

E[Mn+1 |Fn ] = Mn E[e2Xn+1 −1 ] = Mn ,


164 CHAPTER 10. SOLUTIONS

where we used that Xn+1 is independent of Fn for the first equality. Thus M is a
martingale.
−1
Pn −
By the strong law of large numbers,
Pn we have that a.s. limn→∞ n k=1 Xk = (1+e) 1
and, thus a.s. limn→∞ −n + 2 k=1 Xk = −∞ ans a.s. limn→∞ Mn = 0.

2. Since limn→∞ E[Mn ] = 1 > 0 = E[limn→∞ Mn ], this implies that M doesn’t converge
in L1 .

Exercise 9.30 1. An elementary induction gives that |Xn | ≤ n! and thus Xn is integrable.
The process X is adapted with respect to the natural filtration of the process (Zn , n ∈
N∗ ). We have:

E[Xn+1 |Fn ] = E[Zn+1 ]1{Xn =0} + (n + 1)Xn E[|Zn+1 |] 1{Xn 6=0} = Xn 1{Xn 6=0} = Xn ,

where we used that Zn+1 is independent of Fn for the first equality. We deduce that
X is a martingale.

2. We have P(Xn 6= 0) = P(Zn 6= 0) = n−1 . This implies that X converges in probability


towards 0.

3. Set An = {Zn 6= 0}. The events (An , n ∈ N∗ ) are independent and n∈N∗ P(An ) = +∞.
P
Borel-Cantelli’s lemma implies that the set of ω ∈ Ω such that Card {n ∈ N∗ , ω ∈
An } = ∞ is of probability 1, that is P(Zn 6= 0 infinitely often) = 1. Since {Xn 6=
0} = {Zn 6= 0} and Xn belongs to Z, we deduce that P(|Xn | ≥ 1 infinitely often) = 1.
Since Xn converges in probability towards 0, we deduce that P(limn→∞ Xn exists) =
P(limn→∞ Xn = 0). But this latter quantity is 0 as P(|Xn | ≥ 1 infinitely often) = 1.

Exercise 9.31 1. Let F = (Fn , n ∈ N) be the natural filtration of the process X. We have
that, conditionally on Fn , Xn+1 has a binomial distribution with parameter (N, Xn /N ).
Notice this proves that X is an homogeneous Markov chain.

2. By definition X is F-adapted. We have that Xn ∈ [0, N ], thus Xn is integrable. We


have, using that a binomial (N, p) random variable has mean N p, that:

Xn
E[Xn+1 | Fn ] = E[Xn+1 | Xn ] = N = Xn .
N
Thus X is a martingale.

3. Since X is a non-negative bounded martingale, it converges a.s. and in Lp for any p ≥ 1


towards a limit say X∞ .

4. As Mn is a measurable function of Xn , we deduce that M is F-adapted. For n ∈ N


fixed, since Xn ∈ [0, N ], we deduce that Mn is bounded and thus integrable. Let Y be a
binomial (N, p) random variable, then we have E[Y ] = N p and E[Y 2 ] = N p(1−p)+N 2 p2
and thus:

E[Y (N − Y )] = N 2 p − N p(1 − p) − N 2 p2 = N (N − 1)p(1 − p).


10.4. MARTINGALES 165

We deduce that:
 n+1
N
E[Mn+1 | Fn ] = E[Xn+1 (N − Xn+1 )| Xn ]
N −1
 n+1  
N Xn Xn
= N (N − 1) 1−
N −1 N N
= Mn .

Thus M is a martingale. By dominated convergence, we have:


N −1 n
 
E[X∞ (N − X∞ )] = lim E[Xn (N − Xn )] = lim E[Mn ] = 0.
n→∞ n→∞ N

5. Since E[X∞ (N − X∞ )] = 0, we deduce that a.s. X∞ ∈ {0, N }, that is a.s. an allele


disappears. Since the sequence (Xn , n ∈ N) is N-valued and a.s. converging, it is a.s.
constant for n large enough, i.e. for n ≥ N0 , with N0 a finite random variable. In
particular one of the allele has disappeared at time N0 . We have X0 = E[X∞ ] =
N P(X∞ = N ), which gives P(A disappears) = (N − X0 )/N .
6. Since Xn (N − Xn ) = 0 for n ≥ N0 , we deduce that Mn = 0 for n ≥ N0 . This implies
that a.s. M∞ = 0. Since E[M∞ ] < M0 (unless X0 ∈ {0, N }), we deduce that M does
not converge in L1 .
Notice that 0 and N are absorbing states of the Markov chain X, and that {1, . . . , N − 1}
are transient states.

Exercise 9.32 1. Set Yn = (Xn−2 , Xn−1 , Xn ) for n ≥ 3. Since Y = (Yn , n ≥ 3) is a


stochastic dynamical system, it is a Markov chain on the finite state space {0, 1}3 .
Since p ∈ (0, 1), Y is irreducible. An irreducible Markov chain on a finite state space
is positive recurrent, and thus visits infinitely often all the states. In particular the
hitting time τijk is a.s. finite.
2. Let Fn = (Fn , n ∈ N) be the natural filtration of X, with F0 = {∅, Ω} the trivial
P σ-field.
The process M = (Mn = Sn − n, n ∈ N) is F-adapted. We have 0 ≤ Sn ≤ nk=1 p−k .
This implies that Sn and Mn are integrable. We have:

E[Sn+1 | Fn ] = (Sn + 1)p−1 E[Xn | Fn ] = Sn + 1,

where we used that Sn is Fn -measurable for the first equality, and that Xn+1 is inde-
pendent of Fn for the second. We deduce that M is a martingale.

Write τ for τ111 . Using the optional stopping theorem, we get that E[Mn∧τ ] = E[M0 ] =
0. This implies that for all n ∈ N:

E[n ∧ τ ] = E[Sn∧τ ].

By monotone convergence, we get that limn→∞ E[n ∧ τ ] = E[τ ]. Since τ is finite a.s.,
we have that a.s. limn→∞ Sn∧τ = Sτ . It is clear from the dynamic of S that:
1 1 1
0 ≤ Sn∧τ ≤ Sτ = + 2+ 3·
p p p
166 CHAPTER 10. SOLUTIONS

thus, by dominated convergence, we deduce that limn→∞ E[Sn∧τ ] = E[Sτ ]. This gives:

1 1 1
E[τ ] = E[Sτ ] = + 2+ 3·
p p p

3. Using the strong Markov property at time τ11 for the Markov chain X, we deduce that
(Xτ11 +n , n ∈ N∗ ) is independent of (Xk , 1 ≤ k ≤ τ11 ) and distributed as X. We deduce
that:
P(τ111 > τ110 ) = P(Xτ11 +1 = 0) = P(X1 = 0) = 1 − p.

4. Arguing as in Question 2, we get that M = (Mn = Tn − n, n ≥ 2) is a martingale,


with E[M0 ] = 0; that for n < τ110 , Tn = 0 if Xn = 0, Tn = p−1 if (Xn−1 , Xn ) = (0, 1),
Tn = p−1 + p−2 if (Xn−1 , Xn ) = (1, 1), and Tτ110 = 1/(p2 (1 − p)); and then that:

1 1 1 1
E[τ110 ] = E[Tτ110 ] = = + 2+ ·
p2 (1− p) p p 1−p

X1 1 − Xn Xn
5. Consider U1 = and Un = Un−1 + for n ≥ 2, to get:
p 1−p p

1
E[τ100 ] = E[Uτ100 ] = ·
p(1 − p)2

X1 (1 − X2 ) X2 Xn Xn−1 (1 − Xn ) Xn−1 Xn Xn
Consider V2 = + and Vn = Vn−1 + − +
p(1 − p) p p p(1 − p) p2 p
for n ≥ 3, to get:
1
E[τ101 ] = 2 ·
p (1 − p)

Exercise 9.33 1. Assume that E[X1 ] > c. By the strong law of large numbers, we get that
a.s. limn→∞ Sn /n = c − E[X1 ] < 0. This implies that a.s. limn→∞ Sn = −∞ and thus
τ is a.s. finite.
If P(X1 > c) = 0, then a.s. Xk ≤ c and then Sn ≥ x a.s. for all n ∈ N. This implies
that a.s. τ is infinite.

2. The process V is adapted to the natural filtration F =Q(Fn , n ∈ N) of the process


(Xn , n ∈ N∗ ). We have, by independence, that E[Vn ] = nk=1 E[eλXk −λc ] < +∞. We
also get for n ∈ N:

E[Vn+1 | Fn ] = Vn E[eλXn+1 −λc | Fn ] = Vn E[eλXn+1 −λc ] ≥ Vn .

This gives that V is a non-negative sub-martingale.


10.5. OPTIMAL STOPPING 167

3. We have:
N
X
E[VN 1{τ ≤N } ] = E[VN 1{τ =k} ]
k=1
N
X
= E[E[VN 1{τ =k} | Fk ]]
k=1
XN
= E[1{τ =k} E[VN | Fk ]]
k=1
XN
≥ E[Vk 1{τ =k} ]
k=1
≥ eλx P(τ ≤ N ),

where we used that V is a sub-martingale for the first inequality, and that Vk ≥ eλx on
{τ = k} for thesecond inequality.

4. The function ϕ defined on R+ by ϕ(λ) = E[eλ(X1 −c) ] belongs to C ∞ (R+ ) (use domi-
nated convergence to prove the continuity and Fubini to prove recursively that ϕ(n) is
derivable). Since ϕ00 (λ) = E[(X1 −c)2 eλ(X1 −c) ] > 0, we deduce that ϕ is strictly convex.
We have ϕ(0) = 1 and ϕ0 (0) = E[X1 − c] < 0. As P(X1 > c) > 0, there exists a > c such
that p = P(X1 ≥ a) > 0. We deduce that ϕ(λ) ≥ E[1{X1 ≥a} eλ(X1 −c) ] ≥ p eλ(a−c) , so
that limλ→∞ ϕ(λ) = +∞. Thus, there exists a unique root λ0 of ϕ(λ) = 1 on (0, +∞).
Taking λ = λ0 in Question 2 , we deduce that V is a martingale. Then, using Question
3 for the inequality, we get that:

1 = E[V0 ] = E[VN ] ≥ eλ0 x P(τ ≤ N ).

This gives P(τ ≤ N ) ≤ e−λ0 x . Then let N goes to infinity to get the result.

Exercise 9.34

10.5 Optimal stopping


Exercise 9.35 Using the optimal equations, see (5.4) and Proposition 5.6, we get that at the
first roll of the dice you stop only if you get 5 or 6, at the second you stop only if you get 4,
5 or 6. The average gain of this strategy is 14/3.

10.6 Brownian motion


Exercise 9.36

Exercise 9.37

Exercise 9.38
168 CHAPTER 10. SOLUTIONS
Chapter 11

Vocabulary

english français
N N (but N∗ in some books)
(0, 1) ]0, 1[
positive strictement positif
countable dénombrable
pairwise disjoint sets ensembles disjoints 2 à 2
a σ-field une tribu ou σ-algèbre
a λ-sytem une classe monotone
nested emboité(es)
non-negative positif ou nul
convergence in distribution convergence en loi
pointwise convergence convergence simple
irreducible irréductible
super-martingale sur-martingale
sub-martingale sous-martingale
predictable prévisible
optional stopping theorem théorème d’arrêt
optimal stopping arrêt optimal

169
Index

λ-system, 5 Minkowski inequality, 15


σ-field, 1 monotone class, 5

a.e., 12 probability measure


algorithm invariant, 42
Metropolis, 57 product, 133
almost everywhere, 12 stationary, 42
product σ-field, 2
Bellman, 80 projection
Boolean algebra, 131 orthogonal, 25
Borel set, 2
random variable
Carathéodory, 131 discrete, 19
Cauchy-Schwarz inequality, 15 reversibility, 42
discrete space, 35, 138 Snell, 82
equation Snell enveloppe, 82
Bellman, 80 state
optimal, 80 absorbing, 43
aperiodic, 46
function recurrent, 44
integrable, 11 transient, 44
stochastic matrix, 37
Hilbert space, 25
Hölder inequality, 15 Tchebychev inequality, 20
theorem
irreducible, 43 extension, 131
Jensen inequality, 20 uniform integrability, 75, 139
Lebesgue, 132
Lebesgue measure, 132

martingale
closed, 75
measurable space, 1
measure, 3
σ-finite, 3
product, 18
Metropolis, 57

170
INDEX 171

You might also like