Conditional Probability and Expectation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

36-752 Advanced Probability Overview Spring 2018

6. Conditional Probability and Expectation

Instructor: Alessandro Rinaldo

Associated reading: Chapter 5 of Ash and Doléans-Dade; Sec 5.1 of Durrett.

Overview
In this set of lecture notes we shift focus to dependent random variables. We introduce
measure-theoretic definitions of conditional probability and conditional expectations.

1 Conditional Expectation
The measure-theoretic definition of conditional expectation is a bit unintuitive, but we will
show how it matches what we already know from earlier study.

Definition 1 (Conditional Expectation). Let (Ω, F, P ) be a probability space, and let


C ⊆ F be a sub-σ-field. Let X be a random variable that is F/B 1 measurable and E|X| < ∞.
We use the symbol E(X|C) to stand for any function h : Ω → IR that is C/B 1 measurable
and that satisfies 󰁝 󰁝
hdP = XdP, for all C ∈ C. (1)
C C
We call such a function h, a version of the conditional expectation of X given C.

Equation (1) can also be written E(IC h) = E(IC X) for all C ∈ C. Any two versions of
E(X|C) must be equal a.s. according to Theorem 21 in Lecture Notes Set 3 (part 3). Also,
any C/B 1 -measurable function that equals a version of E(X|C) a.s. is another version.

Example 2. If X is itself C/B 1 measurable, then X is a version of E(X|C).

Example 3. If X = a a.s., then E(X|C) = a a.s.

Let Y be a random quantity and let C = σ(Y ). We will use the notation E(X|Y ) to stand
for E(X|C). According to Theorem 39 (given in the Appendix), E(X|Y ) is some function
g(Y ) because it is σ(Y )/B 1 -measurable. We will also use the notation E(X|Y = y) to stand
for g(y).

1
Example 4 (Joint Densities). Let (X, Y ) be a pair of random variables with a joint density
fX,Y with respect to Lebesgue measure. Let C = σ(Y ). The usual marginal and conditional
densities are
󰁝
fY (y) = fX,Y (x, y)dx,
fX,Y (x, y)
fX|Y (x|y) = .
fY (y)
The traditional calculation of the conditional mean of X given Y = y is
󰁝
g(y) = xfX|Y (x|y)dx.

That is, E(X|Y ) = g(Y ) is the traditional definition of conditional mean of X given Y . We
also use the symbol E(X|Y = y) to stand for g(y). We can prove that h = g(Y ) is a version
of the conditional mean according to Definition 1. Since g(Y ) is a function of Y , we know
that it is C/B 1 measurable. We need to show that Equation (1) holds. Let C ∈ C so that
there exists B ∈ B 1 so that C = Y −1 (B). Then IC (ω) = IB (Y (ω)) for all ω. Then
󰁝 󰁝
hdP = IC hdP
C
󰁝
= IB (Y )g(Y )dP
󰁝
= IB gdµY
󰁝
= IB (y)g(y)fY (y)dy
󰁝 󰁝
= IB (y) xfX|Y (x|y)dxfY (y)dy
󰁝 󰁝
= IB (y)xfX,Y (x, y)dxdy
= E(IB (Y )X) = E(IC X).

Example 4 can be extended easily to handle two more general cases. First, we could find
E(r(X)|Y ) by virtually the same calculation. Second, the use of conditional densities extends
to the case in which the joint distribution of (X, Y ) has a density with respect to an arbitrary
product measure.
All of the familiar results about conditional expectation are special cases of the general
definition. Here is an unfamiliar example.

Example 5. Let X1 , X2 be independent with U (0, θ) distribution for some known θ. Let
Y = max{X1 , X2 } and X = X1 . We want to find the conditional mean of X given Y .

2
Intuitively, with probability 1/2, X = Y , and with probability 1/2, X is the min of X1
and X2 and ought to be uniformly distributed between 0 and Y . The mean of this hybrid
distribution is Y /2 + Y /4 = 3Y /4. Let’s verify this.
First, we see that h = 3Y /4 is measurable with respect to C = σ(Y ). Next, let C ∈ C. We
need to show that E(XIC ) = E([3Y /4]IC ). Theorem 21 of Lecture Notes Set 3 (part 4) says
that we only need to check this for sets of the form C = Y −1 ([0, d]) with 0 < d < θ. Rewrite
these expectations as integrals with respect to the joint distribution of (X1 , X2 ). We need to
show that 󰁝 d󰁝 d 󰁝 d
x1 3y 2y
2
dx 1 dx 2 = dy, (2)
0 0 θ 0 4 θ2
for all 0 < d < θ. It is easy to see that both sides of Equation (2) equal d3 /[2θ2 ]. Furthermore,
󰀝
′ 3Y /4 if Y is irrational,
h =
0 otherwise.

is another version of E(X|Y ).

Here is a simple property that extends from expectations to conditional expectations. It will
be used to prove the existence of conditional expectations.

Lemma 6 (Monotonicity). If X1 ≤ X2 a.s., then E(X1 |C) ≤ E(X2 |C) a.s.

Proof: Suppose that both E(X1 |C) and E(X2 |C) exist. Let

C0 = {∞ > E(X1 |C) > E(X2 |C)},


C1 = {∞ = E(X1 |C) > E(X2 |C)}.

Then, for i = 0, 1,
󰁝 󰁝
0≤ [E(X1 |C) − E(X2 |C)]dP = (X1 − X2 )dP ≤ 0.
Ci Ci

It follows that all terms in this string are 0 and P (Ci ) = 0 for i = 0, 1. Since C0 ∪ C1 =
{E(X1 |C) > E(X2 |C)}, the result is proven.

2 Existence of Conditional Expectation


We could prove that versions of conditional expectations exist by the Radon-Nikodym the-
orem. However, the “modern” way to prove the existence of conditional expectations is
through the theory of Hilbert spaces.

3
Definition 7. An inner product space is a vector space V with an inner product 〈·, ·〉 :
V × V → IR, a function that satisfies

• symmetry: 〈u, v〉 = 〈v, u〉,

• bilinearity (part 1): 〈u1 + u2 , v〉 = 〈u1 , v〉 + 〈u2 , v〉,

• bilinearity (part 2): for real λ, 〈λu, v〉 = λ〈u, v〉,

• positivity: 〈u, u〉 > 0 for all u ∕= 0, and 〈u, u〉 = 0 if and only if u = 0.


󰁳
An inner product provides a norm, namely 󰀂u󰀂 = 〈u, u〉 and a metric d(u, v) = 󰀂u − v󰀂.
These facts follow from some simple properties of inner products.

Proposition 8. Let V be a vector space with an inner product 〈·, ·〉. Then

1. Parallelogram law: for all u, v ∈ V, 󰀂u󰀂2 + 󰀂v󰀂2 = 12 (󰀂u + v󰀂2 + 󰀂u − v󰀂2 ).

2. Cauchy-Schwarz inequality: for all u, v ∈ V, |〈u, v〉| ≤ 󰀂u󰀂󰀂v󰀂, with equality if and
only if u and v are collinear.

3. Triangle inequality: for all u, v ∈ V, 󰀂u + v󰀂 ≤ 󰀂u󰀂 + 󰀂v󰀂.

Definition 9. A complete (see Definition 7 in Lecture Notes Set 6) inner product space is
a Hilbert space.
󰁕
Example 10. Let V = L2 (Ω, F, µ). Define 〈f, g〉 = f gdµ. This is an inner product that
produces the norm 󰀂 · 󰀂2 . Lemma 9 of Lecture Notes Set 6 showed that Lp is complete.

We prove existence of conditional expectations using orthogonal projection in Hilbert spaces.


The following theorem is a basic result in Hilbert space theory, and is proved in the Appendix.

Theorem 11 (Existence and uniqueness of orthogonal projections). Let V be a


Hilbert space and let V0 be a closed subspace. For each v ∈ V, there is a unique v0 ∈ V0
(called the orthogonal projection of v into V0 ) such that v − v0 is orthogonal to every vector
in V0 and 󰀂v − v0 󰀂 = inf w∈V0 󰀂v − w󰀂.

Now, we can prove the existence of conditional expectations.

Theorem 12 (Existence of conditional expectation). Let (Ω, F, P ) be a probability


space, and let Y be a random variable. Let C be a sub-σ-field of F. If E(Y ) exists, then there
exists a version of E(Y |C).

4
Proof: It is easy to see that L2 (Ω, C, P ) is a closed linear subspace. If Y ∈ L2 (Ω, F, P ),
let Y0 be the projection of Y into L2 (Ω, C, P ). According to Theorem 11, E([Y − Y0 ]X) = 0
for all X ∈ L2 (Ω, C, P ), in particular for X = IC for arbitrary C ∈ C.
If Y > 0 but not in L2 , define Yn = min{Y, n}. Then Yn ∈ L2 . Let Y0,n be a version
of E(Yn |C), and assume that Y0,n ≤ Y0,n+1 for all n, which is allowed by Lemma 6. Let
Y0 = limn→∞ Yn,0 , which exists (by monotonicity of the conditional expectation) and is
C-measurable. Thus, for each C ∈ C,

E(IC Y ) = lim E(IC Yn ),


n→∞
= lim E(IC Y0,n ),
n→∞
= E(IC Y0 ),

by the conditional and unconditional version of the monotone convergence theorem. It


follows that E(IC Y ) = E(IC Y0 ) for all C ∈ C and Y0 is a version of E(Y |C).
If Y takes both positive and negative values, write Y = Y + − Y − . If one of the means E(Y + )
or E(Y − ) is finite then the probability is 0 that both E(Y + |C) = ∞ and E(Y − |C) = ∞. Then
E(Y + |C) − E(Y − |C) is a version of E(Y |C).

The following result summarizes what we have learned about the existence and uniqueness
of conditional expectation.

Corollary 13. If Y ∈ L1 (Ω, F, P ) and C is a sub-σ-field of F. Let Z ∈ L2 (Ω, C, P ), then


the following are equivalent.

1. Z = E(Y |C).

2. E(XZ) = E(XY ) for all X ∈ L2 (Ω, C, P ).

3. Z is the orthogonal projection of Y into L2 (Ω, C, P ).

3 Additional Properties of Conditional Expectation


The following fact is immediate by letting C = F.

Proposition 14. E(E(X|C)) = E(X).

Here is a generalization of Proposition 14, which is sometimes called the tower property of
conditional expectations, or law of total probability.

Proposition 15 (William’s Tower Property). If C1 ⊆ C2 ⊆ F are sub-σ-field’s and


E(X) exists, then E(X|C1 ) is a version of E(E(X|C2 )|C1 ).

5
Proof: By definition E(X|C1 ) is C1 /B 1 -measurable. We need to show that, for every C ∈ C1 ,
󰁝 󰁝
E(X|C1 )dP = E(X|C2 )dP.
C C

The left side is E(XIC ) by definition of conditional mean. Similarly, because C ∈ C2 also,
the right side is E(XIC ) as well.

Example 16. Let (X, Y, Z) be a triple of random variables. Then E(X|Y ) is a version of
E(E(X|(Y, Z))|Y ).

The following corollary to Proposition 15 is sometimes useful.

Corollary 17. Assume that C1 ⊆ C2 ⊆ F are sub-σ-field’s and E(X) exists. If a version
of E(X|C2 ) is C1 /B 1 -measurable, then E(X|C1 ) is a version of E(X|C2 ) and E(X|C2 ) is a
version of E(X|C1 ).

Example 18. Suppose that X and Y have a joint conditional density given Θ that factors,

fX,Y |Θ (x, y|θ) = fX|Θ (x|θ)fY |Θ (y|θ).

Then, the conditional density of X given (Y, Θ) is


fX,Y |Θ (x, y|θ)
fX|Y,Θ (x|y, θ) = = fX|Θ (x|θ).
fY |Θ (y|θ)

With C1 = σ(Θ) and C2 = σ(Y, Θ), we see that E(r(X)|C1 ) will be a version of E(r(X)|C2 )
for every function r(X) with defined mean.

The next lemma shows that conditional expectation is linear.

Lemma 19 (Linearity). If E(X), E(Y ), and E(X + Y ) all exist, then E(X|C) + E(Y |C) is
a version of E(X + Y |C).

Proof: Clearly E(X|C) + E(Y |C) is C/B 1 -measurable. We need to show that for all C ∈ C,
󰁝 󰁝
E(X|C) + E(Y |C)dP = (X + Y )dP. (3)
C C
󰁕 󰁕 󰁕
The left side of Equation (3) is C
XdP + C
Y dP = C
(X + Y )dP because E(IC X), E(IC Y )
and E(IC [X + Y ]) all exist.

The following theorem is used extensively in later results.

Theorem 20. Let (Ω, F, P ) be a probability space and let C be a sub-σ-field of F. Suppose
that E(Y ) and E(XY ) exist and that X is C/B 1 -measurable. Then E(XY |C) = XE(Y |C).

6
Proof: Clearly, XE(Y |C) is C/B 1 -measurable. We will use the standard machinery on X.
If X = IB for a set B ∈ C, then

E(IC XY ) = E(IC∩B Y ) = E(IC∩B E(Y |C)) = E(IC XE(Y |C)), (4)

for all C ∈ C. Hence, XE(Y |C) = E(XY |C). By linearity of expectation, the extreme ends
of Equation (4) are equal for every nonnegative simple function, X. Next, suppose that X
is nonnegative and let {Xn } be a sequence of nonnegative simple functions converging to X
from below. Then

E(IC Xn Y + ) = E(IC Xn E(Y + |C)),


E(IC Xn Y − ) = E(IC Xn E(Y − |C)),

for each n and each C ∈ C. Apply the monotone convergence theorem to all four sequences
above to get

E(IC XY + ) = E(IC XE(Y + |C)),


E(IC XY − ) = E(IC XE(Y − |C)),

for all C ∈ C. It now follows easily from Lemma 19 that XE(Y |C) = E(XY |C). Finally, if X
is general, use what we just proved to see that X + E(Y |C) = E(X + Y |C) and X − E(Y |C) =
E(X − Y |C). Apply Lemma 19 one last time.

In all of the proofs so far, we have proven that the defining equation for conditional expecta-
tion holds for all C ∈ C. Sometimes, this is too difficult and the following result can simplify
a proof.

Proposition 21. Let (Ω, F, P ) be a probability space and let C be a sub-σ-field of F. Let D
be a π-system that generates C. Let Y be a random variable whose mean exists. Let Z be a
C/B 1 -measurable random variable such that E(IC Z) = E(IC Y ) for all C ∈ D. Then Z is a
version of E(Y |C).

One proof of this result relies on signed measures, and is very similar to the proof of unique-
ness of measure.

4 Conditional Distribution
Now we introduce the measure-theoretic version of conditional probability and distribution.

7
4.1 Conditional probability
For A ∈ F, define Pr(A|C) = E(IA |C). That is, treat IA as a random variable X and define
the conditional probability of A to be the conditional mean of X. We would like to show
that conditional probabilities behave like probabilities. The first thing we can show is that
they are additive. That is a consequence of the following result.
It follows easily from Lemma 19 that Pr(A|C) + Pr(B|C) = Pr(A ∪ B|C) a.s. if A and B are
disjoint. The following additional properties are straightforward, and we will not do them
all in class. They are similar to Lemma 19.
Example 22 (Probability at most 1). We shall show that Pr(A|C) ≤ 1 a.s. Let B = {ω :
Pr(A|C) > 1}. Then B ∈ C, and
󰁝 󰁝
P (B) ≤ Pr(A|C)dP = IA dP = P (A ∩ B) ≤ P (B),
B B

where the first inequality is strict if P (B) > 0. Clearly, neither of the inequalities can be
strict, hence P (B) = 0.
Example 23 (Countable Additivity). Let {An }∞ disjoint elements of F. Let W =
n=1 be 󰁖
󰁓∞ ∞
n=1 Pr(An |C). We shall show that W is a version of Pr( n=1 An |C). Let C ∈ C.
󰀣 󰀥∞ 󰀦󰀤
󰀅 󰀆 󰁞
E I C I ∪∞
n=1 An
= P C∩ An
n=1

󰁛
= P (C ∩ An )
n=1
󰁛∞ 󰁝
= Pr(An |C)dP
n=1 C
󰁝 󰁛

= Pr(An |C)dP
C n=1
󰁝
= W dP,
C

where the sum and integral are interchangeable by the monotone convergence theorem.

We could also prove that Pr(A|C) ≥ 0 a.s. and Pr(Ω|C) = 1, a.s. But there are generally
uncountably many different A ∈ F and uncountably many different sequences of disjoint
events. Although countable additivity holds a.s. separately for each sequence of disjoint
events, how can we be sure that it holds simultaneously for all sequences a.s.?
Definition 24 (Regular Conditional Probabilities). Let A ⊆ F be a sub-σ-field. We
say that a collection of versions {Pr(A|C) : A ∈ A} are regular conditional probabilities if,
for each ω, Pr(·|C)(ω) is a probability measure on (Ω, A).

8
Rarely do regular conditional probabilities exist on (Ω, F), but there are lots of common
sub-σ-field’s A such that regular conditional probabilities exist on (Ω, A). Oddly enough,
the existence of regular conditional probabilities doesn’t seem to depend on C.

Example 25 (Joint Densities). Use the same setup as in Example 4. For each y such
that fY (y) = 0, define fX|Y (x|y) = φ(x), the standard normal density. For each y such that
fY (y) > 0, define fX|Y as in Example 4. Next, for each B ∈ B 1 , let A = X −1 (B), and define
󰁝
h(y) = fX|Y (x|y)dx,
B

for all y. Finally, define Pr(A|C)(ω) = h(Y (ω)). The calculation done in Example 4 shows
that this is a version of the conditional mean of IA given C. But it is easy to see that for
each ω, Pr(·|C)(ω) is a probability measure on (Ω, σ(X)).

The results we have on existence of regular conditional probabilities are for the cases in which
A is the σ-field generated by a random variable or something a lot like a random variable.
Note that this is a condition on A not on C. The conditioning σ-field can be anything at
all. What matters is the σ-field on which the conditional probability is to be defined. There
are examples in which no regular conditional probabilities exist. These examples depend
upon the existence of a nonmeasurable set, which we did not prove. We will not cover such
examples here.

4.2 Conditional distribution


Let (Ω, F, P ) be a probability space and let (X , B) be a measurable space. Let X : Ω → X
be a random quantity. If A = σ(X), conditional probabilities on A form a conditional
distribution for X.

Definition 26 (Conditional distribution). For each B ∈ B, define µX|C (B)(ω) = Pr(X −1 (B)|C)(ω).
A collection of versions {µX|C (B)(·) : B ∈ B} is called a conditional distribution of X given
C. If, in addition, for each ω, µX|C (·)(ω) is a probability measure on (X , B), then the collec-
tion is a regular conditional distribution (rcd).

Example 25 is already an example of an rcd.


Here is a bit of notation that we will use when we deal with conditional distributions given
random quantities.

Definition 27 (Conditional distribution given a random quantity). Let X and Y be


random quantities (defined on the same probability space) taking values in X (with σ-field B)
and Y (with σ-field D) respectively. Assume that D contains all singletons. We will use µX|Y
to denote the conditional distribution of X given Y with the following meaning. For every
y ∈ Y and ω ∈ Y −1 ({y}) and B ∈ B, we define µX|Y (B|y) = µX|C (B)(ω), where C = σ(Y ).

9
Remember that a C/B 1 -measurable function like µX|C (B) must itself be a function of Y .
That is, for all ω, µX|C (B)(ω) = h(Y (ω)), where h : Y → IR is D/B 1 -measurable. What we
have done is define µX|Y (B|y) = h(y). We also use the notation Pr(X ∈ B|Y = y) to stand
for this same value.
Here is a result that says that, in the presence of an rcd, conditional expectations can be
computed the naı̈ve way.

Proposition 28 (Expectation under RCD). Let (X , B) be a measurable space, and let


X : Ω → X be a random quantity. Let g : X → IR be a measurable function such that the
󰁕
mean of g(X) exists. Suppose that µX|C is an rcd for X given C. Then gdµX|C is a version
of E(g(X)|C).

Just use the standard machinery to prove this result.


Very often, a conditional distribution of X given Y is proposed on the space X in which
X takes its values, and is given as a function of Y . To check whether such a proposed
distribution is the conditional distribution of X given Y , the following result can help.

Lemma 29. Let (Ω, F, P ) be a probability space. Let (X , B) and (Y, D) be measurable
spaces. Let B1 and D1 be π-systems that generate B and D respectively. Let X : Ω → X and
Y : Ω → Y be random quantities. Let µY stand for the distribution of Y and let µX,Y stand
for the joint distribution of (X, Y ). For each B ∈ B, let hB : Y → IR be a measurable function
󰁕
such that for all D ∈ D1 and B ∈ B1 , D hB dµY = µX,Y (B × D). Then {hB (Y ) : B ∈ B} is
a version of the conditional distribution of X given Y .

Proof: Let C = σ(Y ). Each hB (Y ) is a C/B 1 -measurable function from Ω to IR. We need
to show that for all C ∈ C and B ∈ B,
󰁝
hB (Y )dP = P (C ∩ X −1 (B)). (5)
C

First, notice that F1 = {C ∩ X −1 (B) : C ∈ C, B ∈ B} is a sub-σ-field of F and that both


sides of Equation (5) define σ-finite measures on (Ω, F1 ). We will prove that these two
measures agree on the π-system {Y −1 (D) ∩ X −1 (B) : D ∈ D1 , B ∈ B1 }, which generates F1 .
Then apply the uniqueness of measure. For D ∈ D1 and B ∈ B1 , let C = Y −1 (D). It follows
that 󰁝 󰁝
hB (Y )dP = hB dµY = µX,Y (B × D) = P (C ∩ X −1 (B)).
C D

Theorem 30 (Existence of RCD for r.v.’s). Let (Ω, F, P ) be a probability space with C
a sub-σ-field. Let X be a random variable. Then there is a rcd of X given C.

10
Proof: Define
󰀝 󰀞
C1 = ω : µX|C ((−∞, q])(ω) = inf µX|C ((−∞, r])(ω), for all rational q ,
rational r>q
󰀝 󰀞
C2 = ω: lim µX|C ((−∞, x])(ω) = 0 ,
x→−∞, x rational
󰀝 󰀞
C3 = ω: lim µX|C ((−∞, x])(ω) = 1 .
x→∞, x rational

Let C0 = C1 ∩ C2 ∩ C3 . In another course document, we give details on why P (C0 ) = 1. For


ω ∈ C0 and irrational x, define

µX|C ((−∞, x])(ω) = inf µX|C ((−∞, q])(ω). (6)


rational q>x

Rational x already satisfy Equation (6) for ω ∈ C0 since C0 ⊆ C1 . If ω ∈ C0C , define


µX|C ((−∞, x])(ω) = F (x), where F is your favorite cdf. For each ω, we have defined a cdf
on IR, which extends to a probability measure on (IR, B 1 ). This collection of probabilities
forms an rcd by construction.

Further discussions are given in the Appendix.


The following result says that RCD exists under bimeasurable mappings.

Lemma 31. Let X : Ω → X and let φ : X → R ∈ B 1 be on-to-one, onto, measurable, with


measurable inverse. Let Y = φ(X). Let µY |C be an rcd for Y given C. Define µX|C (B)(ω) =
µY |C (φ(B))(ω). Then µX|C defines an rcd for X given C.

The proof of Lemma 31, along with some examples, are given in the Appendix.
There is a laundry list of properties of conditional expectation that mimic similar properties
of expectation. Proposition 28 can be used to prove some of these. Using Proposition 28
requires the existence of an rcd. Theorem 30 shows that an rcd exists for every random
variable.

Proposition 32 (Integral properties under RCD). Let C be a sub-σ-field of F.

1. (Monotone convergence theorem) If 0 ≤ Xn ≤ X a.s. for all n and Xn → X


a.s., then E(Xn |C) → E(X|C) a.s.

2. (Dominated convergence theorem) If Xn → X a.s. and |Xn | ≤ Y a.s., where


Y ∈ L1 , then E(Xn |C) → E(X|C) a.s.

3. (Jensen’s inequality) Let E(X) be finite. If φ is a convex function and φ(X) ∈ L1 ,


then E[φ(X)|C] ≥ φ(E[X|C]) a.s.

11
4. Assume that E(X) exists and σ(X) is independent of C. Then E(X) is a version of
E(X|C).

Proof: We will prove only the first part. Let X0 = X and let X = (X0 , X1 , X2 , . . .). Let
µX|C (·) be a regular conditional distribution for X given C. That is, for each B ∈ B ∞ (the
product σ-field for IR∞ ) µX|C (B)(·) (a function from Ω to IR) is a version of Pr(X ∈ B|C)(·),
and for each ω ∈ Ω, µX|C (·)(ω) is a probability measure on B ∞ . Let L = {x ∈ IR∞ :
limn→∞ xn = x0 }. Let C = {ω : µX|C (L)(ω) = 1}. We know that

1 = P (X ∈ L)
󰁝
= µX|C (L)dP
󰁝 󰁝
= µX|C (L)dP + µX|C (L)dP
C C C
󰁝
= P (C) + µX|C (L)dP,
CC

where the first equality follows from limn→∞ Xn = X0 a.s., the second follows from the law of
total probability Proposition 14, and the last two are obvious. If P (C) < 1, the last integral
above is strictly less than 1 − P (C) contradicting the first equality. Hence P (C) = 1.1
Define fn (x) = xn for n = 0, 1, . . .. For each ω ∈ C, limn→∞ fn = f0 , a.s. [µX|C (·)(ω)]. The
monotone convergence theorem says that, for each ω ∈ C,
󰁝 󰁝
lim fn µX|C (dx)(ω) = f0 µX|C (dx)(ω).
n→∞

According to Proposition 28,


󰁝
fn (x)µX|C (dx)(ω) = E(fn (X)|C) = E(Xn |C),
󰁝
f0 (x)µX|C (dx)(ω) = E(f0 (X)|C) = E(X0 |C).

It follows that limn→∞ E(Xn |C) = E(X|C) a.s.

5 Bayes’ Theorem
Let (X , B) be a Borel space and let X : Ω → X be a random quantity. Also, let Θ : Ω → T ,
where (T , τ ) is a measurable space. We can safely assume that X has an rcd given Θ. Let
ν be a measure on (X , B) such that µX|Θ (·|θ) has a density fX|Θ (x|θ) with respect to ν for
all θ. Assume that fX|Θ is jointly measurable as a function of (x, θ).
1
This illustrates a common technique in probability proofs. We integrate a nonnegative function f ≤ 1
󰁕
with respect to a probability P and find that f dP = 1. It follows that P (f = 1) = 1.

12
Theorem 33 (Bayes’ theorem). Assume the above structure. Then
󰁝
fX (x) = fX|Θ (x|θ)dµΘ (θ) is a density for µX with respect to ν. (7)
T

Also, µΘ|X (·|x) ≪ µΘ a.s. [µX ] and the following is a density for µΘ|X with respect to µΘ :
fX|Θ (x|θ)
fΘ|X (θ|x) = .
fX (x)

Surprisingly, Bayes’ theorem does not require that (T , τ ) be a Borel space. Nevertheless, we
get an rcd for Θ given X. The proof of Theorem 33 is given in another course document.

6 Independence and Conditioning


Recall that {Xα }α∈ℵ are mutually independent if, for all finite k and all distinct i1 , . . . , ik ∈ ℵ,
the joint distributions satisfy
µXi1 ,...,Xik = µXi1 × · · · × µXik .

Theorem 34. Random quantities {Xα }α∈ℵ are mutually independent if and only if for
all integers k1 , k2 (such that k1 + k2 is no more than the cardinality of ℵ) and distinct
i1 , . . . , ik1 , j1 , . . . , jk2 ∈ ℵ, µXj1 ,...,Xjk is a version of µXj1 ,...,Xjk |Xi1 ,...,Xik .
2 2 1

Proof: For the “if” direction, we shall use induction. Start with k = 2, i1 ∕= j1 = i2 , and
k1 = k2 = 1. We assume that µXj1 is a version of µXj1 |Xi1 , so
󰁝
µXi1 ,Xi2 (B1 × B2 ) = µXi2 (B2 )dµXi1 = µXi1 (B1 )µXi2 (B2 ).
B1

So, Xi1 and Xi2 are independent for all distinct i1 and i2 . Now, assume that the “if”
implication is true for all k ≤ k0 . Let i1 , . . . , ik0 +1 be distinct elements of ℵ. Let k1 = k0
and k2 = 1. Let j1 = ik0 +1 . Then µXik +1 is a version of µXik +1 |Xi1 ,...,Xik . Using the same
0 0 0
argument as above, we we see that Xi1 , . . . , Xik0 +1 are independent.
For the “only if” direction, assume that the random variables are independent. Let i1 , . . . , ik1 , j1 , . . . , jk2
be distinct. Let B1 be in the product σ-field of the spaces where Xi1 , . . . , Xik1 take their
values and let B2 be in the product σ-field of the spaces where Xj1 , . . . , Xjk2 take their values.
We have assumed that
µXi1 ,...,Xjk (B1 × B2 ) = µXi1 ,...,Xik (B1 )µXj1 ,...,Xjk (B2 )
2
󰁝 1 2

= µXj1 ,...,Xjk (B2 )dµXi1 ,...,Xik .


2 1
B1

This equality for all B1 and B2 is sufficient (by Lemma 29) to say that µXj1 ,...,Xjk is a version
2
of µXj1 ,...,Xjk |Xi1 ,...,Xik .
2 1

13
Theorem 35. Two σ-field’s C1 and C2 are independent if and only if, for every C1 -measurable
random variable X such that E(X) is defined, E(X) is a version of E(X|C2 ).

Proof: We know that C1 and C2 are independent if and only if, for all Ai ∈ Ci (i = 1, 2)
P (A1 ∩A2 ) = P (A1 )P (A2 ). Notice that P (A1 ) = E(IA1 ) is a version of E(IA1 |C2 ) = Pr(A1 |C2 )
if and only if, 󰁝
P (A1 ∩ A2 ) = P (A1 )dP, for all A2 ∈ C2 .
A2

But the right side equals P (A1 )P (A2 ). Hence, we have proven that C1 and C2 are inde-
pendent if and only if E(IA1 ) is a version of E(IA1 |C2 ) for all A1 ∈ C1 . The extension to
all C1 -measurable random variables is an application of the standard machinery (part 4 of
Proposition 32, which you are proving for homework).

The following definition is useful in statistics.

Definition 36. We say that X1 , X2 , . . . are conditionally independent given C if either of


the following conditions holds:

• For all k1 , k2 and distinct i1 , . . . , ik1 , j1 , . . . , jk2 , µXj1 ,...,Xjk |C is a version of µXj1 ,...,Xjk |σ(C,Xi1 ,...,Xik ) .
2 2 1

• For every k and distinct i1 , . . . , ik , µXi1 |C × · · · × µXik |C is a version of µXi1 ,...,Xik |C

Proposition 37. The two conditions in the above definition are equivalent.

Example 38. A sequence {Xn }∞ n=1 is called a Markov chain if, for all n > 1, (X1 , . . . , Xn−1 )
is conditionally independent of {Xk }∞
k=n+1 given Xn .

Appendix
A Conditional expectation given a random variable
Theorem 39. Let (Ωi , Fi ) for i = 1, 2, 3 be measurable spaces. Let f : Ω1 → Ω2 be a
measurable onto function. Suppose that F3 contains all singletons. Let A1 = σ(f ). Let
g : Ω1 → Ω3 be F1 /F3 -measurable. Then g is A1 /F3 -measurable if and only if there exists a
F2 /F3 -measurable h : Ω2 → Ω3 such that g = h(f ).

Proof: For the “if” part, assume that there is a measurable h : Ω2 → Ω3 such that
g(ω) = h(f (ω)) for all ω ∈ Ω1 . Let B ∈ F3 . We need to show that g −1 (B) ∈ A1 . Since h is
measurable, h−1 (B) ∈ F2 , so h−1 (B) = A for some A ∈ F2 . Since g −1 (B) = f −1 (h−1 (B)), it
follows that g −1 (B) = f −1 (A) ∈ A1 .

14
For the “only if” part, assume that g is A1 measurable. For each t ∈ Ω3 , let Ct = g −1 ({t}).
Since g is measurable with respect to A1 = f −1 (F2 ), every element of g −1 (F3 ) is in f −1 (F2 ).
So let At ∈ F2 be such that Ct = f −1 (At ). Define h(ω) = t for all ω ∈ At . (Note that
if t1 ∕= t2 , then At1 ∩ At2 = ∅, so h is well defined.) To see that g(ω) = h(f (ω)), let
g(ω) = t, so that ω ∈ Ct = f −1 (At ). This means that f (ω) ∈ At , which in turn implies
h(f (ω)) = t = g(ω).
To see that h is measurable, let A ∈ F3 . We must show that h−1 (A) ∈ F2 . Since g is A1
measurable, g −1 (A) ∈ A1 , so there is some B ∈ F2 such that g −1 (A) = f −1 (B). We will show
that h−1 (A) = B ∈ F2 to complete the proof. If ω ∈ h−1 (A), let t = h(ω) ∈ A and ω = f (x)
(because f is onto). Hence, x ∈ Ct ⊆ g −1 (A) = f −1 (B), so f (x) ∈ B. Hence, ω ∈ B. This
implies that h−1 (A) ⊆ B. Lastly, if ω ∈ B, ω = f (x) for some x ∈ f −1 (B) = g −1 (A) and
h(ω) = h(f (x)) = g(x) ∈ A. So, h(ω) ∈ A and ω ∈ h−1 (A). This implies B ⊆ h−1 (A).

The condition that f be onto can be relaxed at the expense of changing the domain of h to
be the image of f , i.e. h : f (Ω1 ) → Ω3 , with a different σ-field. The proof is slightly more
complicated due to having to keep track of the image of f , which might not be a measureable
set in F2 .
The following is an example to show why the condition that F3 contains all singletons is
included in Theorem 39.

Example 40. Let Ωi = IR for all i and let F1 = F2 = B 1 , while F3 = {IR, ∅}. Then every
function g : Ω1 → Ω3 is σ(f )/F3 -measurable, no matter what f : Ω1 → Ω2 is. For example,
let f (x) = x2 and g(x) = x for all x. Then g −1 (F3 ) ⊆ σ(f ) but g is not a function of f .

The reason that we need the condition about singletons is the following. Suppose that there
are two points t1 , t2 ∈ Ω3 such that t1 ∈ A implies t2 ∈ A and vice versa for every A ∈ F3 .
Then there can be a set A ∈ F3 that contains both t1 and t2 , and g can take both of the
values t1 and t2 , but f is constant on g −1 (A) and all the measurability conditions still hold.
In this case, g is not a function of f .

B Projection in Hilbert spaces


Theorem 11 is well known in finite dimensional spaces. The following lemma aids in the
general proof.

Lemma 41. Let x1 , x2 , x be elements of an inner product space. Then

󰀂x1 − x2 󰀂2 = 2󰀂x1 − x󰀂2 + 2󰀂x2 − x󰀂2 − 4󰀂(x1 + x2 )/2 − x󰀂2 . (8)

15
Proof: Use the relation between inner products and norms to compute

󰀂x1 − x2 󰀂2 = 〈x1 , x1 〉 + 〈x2 , x2 〉 − 2〈x1 , x2 〉,


2󰀂x1 − x󰀂2 = 2〈x1 , x1 〉 + 2〈x, x〉 − 4〈x1 , x〉,
2󰀂x2 − x󰀂2 = 2〈x2 , x2 〉 + 2〈x, x〉 − 4〈x2 , x〉,
−4󰀂(x1 + x2 )/2 − x󰀂2 = −〈x1 , x1 〉 − 〈x2 , x2 〉 − 2〈x1 , x2 〉 − 4〈x, x〉 + 4〈x1 , x〉 + 4〈x2 , x〉.

Add the last three rows, and the sum is the first row.

Proof: [Proof of Theorem 11] Fix v ∈ V. Define g(w) = 󰀂w −v󰀂2 and let c0 = inf w∈V0 g(w).
Let {vn }∞ ∞
n=1 be elements of V0 such that limn→∞ g(vn ) = c0 . We will prove that {vn }n=1 is
a Cauchy sequence. If not, there is a subsequence, call it {yn }∞n=1 , and 󰂃 > 0 such that
󰀂yn − yn+1 󰀂 > 󰂃 for all n. For each n, use Lemma 41 with x1 = yn , x2 = yn+1 , x = v to
conclude that, for all n,

󰂃2 < 󰀂yn − yn+1 󰀂2


= 2󰀂yn − v󰀂2 + 2󰀂yn+1 − v󰀂2 − 4󰀂(yn + yn+1 )/2 − v󰀂2 . (9)

Notice that lim supn 󰀂yn −v󰀂2 = lim supn 󰀂yn+1 −v󰀂2 = c20 and lim inf n 󰀂(yn +yn+1 )/2−v󰀂2 ≥
c20 . It follows that the lim supn of the far right of Appendix B is at most 0, a contradiction.
Because V0 is complete, it follows that {vn }∞ n=1 has a limit v0 . Because g is continuous,
󰀂v0 − v󰀂 = c0 .
Next, let w ∈ V0 be nonzero and define c = 〈v0 − v, w〉. Notice that

󰀂aw + v0 − v󰀂2 = 󰀂v0 − v󰀂2 + 2ac + a2 󰀂w󰀂2 ≥ 󰀂v0 − v󰀂2 ,

by the definition of v0 . It follows that h(a) = 2ac + a2 󰀂w󰀂2 ≥ 0 for all a. But the function h
has a unique minimum of −c2 /󰀂w󰀂2 at a = −c/󰀂w󰀂2 , hence c = 0. So v0 v is orthogonal to
every vector in V0 .
Finally, show that v0 is unique. Suppose that there is v1 such that 󰀂v −v1 󰀂 = 󰀂v −v0 󰀂. Apply
Lemma 41 with x1 = v0 , x2 = v1 , and x = v. The left side of Equation (8) is nonnegative
while the right side is nonpositive, so they are both 0 and v1 = v0 .

C More on existence of RCD


We give further detailed discussion of the existence of RCD.

16
C.1 More on the proof of Theorem 30
For each rational number q, let µX|C ((−∞, q]) be a version of Pr(X ≤ q|C). Define
󰀝 󰀞
C1 = ω : µX|C ((−∞, q])(ω) = inf µX|C ((−∞, r])(ω), for all rational q ,
rational r>q
󰀝 󰀞
C2 = ω: lim µX|C ((−∞, x])(ω) = 0 ,
x→−∞, x rational
󰀝 󰀞
C3 = ω: lim µX|C ((−∞, x])(ω) = 1 .
x→∞, x rational

(Notice that C2 and C3 are defined slightly differently than in the original class notes.)
Define 󰁞
Mq,r = {ω : µX|C ((−∞, q](ω) < µX|C ((−∞, r])(ω))}, M= Mq,r .
q>r

If P (Mq,r ) > 0, for some q > r then


󰁝 󰁝
Pr(Mq,r ∩ {X ≤ q}) = µX|C ((−∞, q])dP < µX|C ((−∞, q])dP
Mq,r Mq,r
= Pr(Mq,r ∩ {X ≤ r}),
which is a contradiction. Hence, P (M ) = 0. Next, define
󰁞
Nq = {ω ∈ M C : lim µX|C ((−∞, r](ω) ∕= µX|C ((−∞, q])(ω), N= Nq .
r ↓ q, r rational
All q

If P (Nq ) > 0 for some q, then


󰁝 󰁝
Pr(Nq ∩ {X ≤ q}) = µX|C ((−∞, q])dP < lim µX|C ((−∞, r])dP
Nq Nq r ↓ q, r rational
󰁝
= lim µX|C ((−∞, r])dP = lim Pr(Nq ∩ {X ≤ r}),
r ↓ q, r rational r ↓ q, r rational

which is a contradiction. We can use Example 23 once again to prove that P (N ) = 0. Notice
that C1 = N C , so P (C1 ) = 1.
Next, notice that
󰀣 󰀤 󰁝
󰁟
0=P C1 ∩ C2C ∩ {X ≤ x} = lim µX|C ((−∞, x])dP
x→−∞, x rational C1 ∩C2C
rational x
󰁝
= lim µX|C ((−∞, x])dP.
C1 ∩C2C x→−∞, x rational

If P (C1 ∩ C2 ) < 1, then the last integral above is strictly positive, a contradiction. The
interchange of limit and integral is justified by the fact that, for ω ∈ C1 , µX|C ((−∞, x]) is
nondecreasing in x. A similar contradiction arises if P (C1 ∩ C3 ) < 1.

17
C.2 Bimeasurable functions and Borel spaces
Definition 42 (Bimeasurable functions). A one-to-one onto measurable function φ be-
tween two measurable spaces is called bimeasurable if its inverse is also measurable. Suppose
that there exists a bimeasurable φ : X → R, where R is a measurable subset of IR. In this
case, we say that (X , B) is a Borel space.

Lemma 43. Let X : Ω → X and let φ : X → R ⊆ IR be bimeasurable, where R ∈ B 1 . Let


Y = φ(X). Let µY |C be an rcd for Y given C. Define µX|C (B)(ω) = µY |C (φ(B))(ω). Then
µX|C defines an rcd for X given C.

Proof: Recall that φ−1 is measurable, hence, for each ω, µX|C (·)(ω) = µY |C (φ(·))(ω) is a
probability measure on (X , B). For each B ∈ B, µX|C (B)(·) is a measurable 󰁕function of ω.
What remains is to verify that for all C ∈ C and B ∈ B, P (C ∩ X −1 (B)) = C µX|C (B)dP .
But the left side of this is
󰁝 󰁝
−1
P (C ∩ Y (φ(B))) = µY |C (φ(B))dP = µX|C (B)dP.
C C

Besides (IR, B 1 ) and measurable subspaces, what else are Borel spaces?

• Finite and countable products of Borel spaces are Borel spaces.

• Complete separable metric spaces (Polish spaces) are Borel spaces.

• The collection of all continuous functions on the closed unit interval with the L∞ norm
and Borel σ-field is a Borel space.

These results are proven in Section B.3.2 of Schervish (1995) Theory of Statistics. Here is
one example.

Example 44. (Bimeasurable Function from (0, 1)∞ to a Subset of (0, 1)) For each
x ∈ (0, 1), define

y0 (x) = x,
󰀝
1 if 2yj−1 (x) ≥ 1,
zj (x) = for j = 1, 2, . . .,
0 if not
yj (x) = 2yj−1 (x) − zj (x), for j = 1, 2, . . .,

This makes zj (x) the jth bit in the binary expansion of x that has infinitely many 0’s. Also,
each zj is a measurable function. Construct the following array that contains each positive

18
integer once and only once:
1 3 6 10 15 ···
2 5 9 14 20 ···
4 8 13 19 26 ···
7 12 18 25 33 ···
.. .. .. .. .. ..
. . . . . .
Let ℓ(i, j) stand for the jth number from the top of the ith column. Now, define
∞ 󰁛
󰁛 ∞
zj (xi )
φ(x1 , x2 , . . .) = .
i=1 j=1
2ℓ(i,j)

Intuitively, φ takes each xi and places its binary expansion down column i of a doubly infinite
array and then combines all the bits in the order of the array above into a single number.
It is easy to see that φ is a limit of measurable functions, so it is measurable. Its inverse is
φ−1 = (g1 , g2 , . . .) where
󰁛∞
zℓ(i,j) (x)
gi (x) = j
.
j=1
2

This definition makes zj (gi (x)) = zℓ(i,j) (x), confirming that φ(g1 (x), g2 (x), . . .) = x. Also,
each gi is measurable, so φ−1 is measurable. The range R of the function φ is all elements of
󰁖
(0, 1) except ∞ ∞
i=1 Bi , where Bi is defined as follows. For each finite subset I of {ℓ(i, j)}j=1 ,
let Ci,I be the set of all x that have 1’s in the bits of their binary expansions corresponding
󰁖
to all coordinates not in I. Then Bi = I Ci,I . Since there are only countably many finite
subsets I of each {ℓ(i, j)}∞j=1 , Bi is a countable union of sets. Each Ci,I is measurable, so R
is measurable.

One can extend Example 44 to IR∞ by first mapping IR∞ to (0, 1)∞ using a strictly increasing
cdf on each coordinate. Also, a slightly simpler argument can map (0, 1)k into (0, 1), so that
IRk is a Borel space. Indeed, the argument in Example 44 proves that products of Borel
spaces are Borel spaces. (Just map each Borel space into IR first and then into (0, 1) and
then apply the example to the product.)

19

You might also like