MIT6 436JF18 Lec06

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2018


Lecture 6

MORE ON DISCRETE RANDOM VARIABLES AND THEIR


EXPECTATIONS

Contents

1. Comments on expected values


2. Expected values of some common random variables
3. Covariance and correlation
4. Indicator variables and the inclusion-exclusion formula
5. Conditional expectations

1 COMMENTS ON EXPECTED VALUES


!
(a) Recall
! that E[X] is well defined unless both sums x:x<0 xpX (x) and
x:x>0 xpX (x) are infinite. Furthermore, E[X] is well-defined and finite if
and only if both sums are finite. This is the same as requiring that
"
E[|X|] = |x|pX (x) < ∞.
x

Random variables that satisfy this condition are called integrable.


(b) Note that for any random variable X, E[X 2 ] is always
! well-defined (whether
finite or infinite), because all the terms in the sum x x2 pX (x) are nonneg-
ative. If we have E[X 2 ] < ∞, we say that X is square integrable.
(c) Using the inequality |x| ≤ 1 + x2 , we have E[|X|] ≤ 1 + E[X 2 ], which
shows that a square integrable random variable is always integrable. Simi-
larly, for every positive integer r, if E[|X|r ] is finite then it is also finite for
every l < r (fill details).

1
Exercise 1. Recall that the r-the central moment of a random variable X is
E[(X − E[X])r ]. Show that if the r-th central moment of an almost surely
non-negative random variable X is finite, then its l-th central moment is
also finite for every l < r.

(d) Because of the formula var(X) = E[X 2 ] − (E[X])2 , we see that: (i) if X is
square integrable, the variance is finite; (ii) if X is integrable, but not square
integrable, the variance is infinite; (iii) if X is not integrable, the variance is
undefined.

2 EXPECTED VALUES OF SOME COMMON RANDOM VARIABLES

In this section, we use either the definition or the properties of expectations to


calculate the mean and variance of a few common discrete random variables.

(a) Bernoulli(p). Let X be a Bernoulli random variable with parameter p.


Then,

E[X] = 1 · p + 0 · (1 − p) = p,
var(X) = E[X 2 ] − (E[X])2 = 12 · p + 02 · (1 − p) − p2 = p(1 − p).

(b) Binomial(n, p). Let X be a binomial random variable with ! parameters n


and p. We note that X can be expressed in the form X = ni=1 Xi , where
X1 , . . . , Xn are independent Bernoulli random variables with a common
parameter p. It follows that
n
"
E[X] = E[Xi ] = np.
i=1

Furthermore, using the independence of the random variables Xi , we have


n
"
var(X) = var(Xi ) = np(1 − p).
i=1

(c) Geometric(p). Let X be a geometric


!∞ random variable with parameter p.
We will use the formula E[X] = n=0 P(X > n). We observe that

"
P(X > n) = (1 − p)j−1 p = (1 − p)n ,
j=n+1

2
which implies that

" 1
E[X] = (1 − p)n = .
p
n=0

The variance of X is given by


1−p
var(X) = ,
p2
but we defer the derivation to a later section.
(d) Poisson(λ). Let X be a Poisson random variable with parameter λ. A direct
calculation yields

" λn
E[X] = e−λ n
n!
n=0

" λn
= e−λ n
n!
n=1

" λn
= e−λ
n=1
(n − 1)!

" λn
= λe−λ
n!
n=0
= λ.

The variance of X turns out to satisfy var(X) = λ, but we defer the deriva-
tion to a later section. We note, however, that the mean and the variance of a
Poisson random variable are exactly what one would expect, on the basis of
the formulae for the mean and variance of a binomial random variable, and
taking the limit as n → ∞, p → 0, while keeping np fixed at λ.
(e) Power(α). Let X be a random variable with a power law distribution with
parameter α. We have
∞ ∞
" " 1
E[X] = P(X > k) = .
(k + 1)α
k=0 k=0

If α ≤ 1, the expected value is seen to be infinite. For α > 1, the sum


is finite, but a closed form expression is not available; it is known as the
Riemann zeta function, and is denoted by ζ(α).

3
3 COVARIANCE AND CORRELATION

3.1 Covariance
The covariance of two square integrable random variables X and Y is denoted
by cov(X, Y ), and is defined by
#
cov(X, Y ) = E X − E[X] Y − E[Y ] .

When cov(X, Y ) = 0, we say that X and Y are uncorrelated.


Note that, under the square integrability assumption, the covariance is al-
ways well-defined and finite. This is a consequence of the fact that |XY | ≤
(X 2 + Y 2 )/2, which implies that XY , as well as (X − E[X])(Y − E[Y ]), are
integrable.
Roughly speaking, a positive or negative covariance indicates that the values
of X − E[X] and Y − E[Y ] obtained in a single experiment “tend” to have the
same or the opposite sign, respectively. Thus, the sign of the covariance provides
an important qualitative indicator of the relation between X and Y .
We record a few properties of the covariance, which are immediate conse-
quences of its definition:
(a) cov(X, X) = var(X);
(b) cov(X, Y + a) = cov(X, Y );
(c) cov(X, Y ) = cov(Y, X);
(d) cov(X, aY + bZ) = a · cov(X, Y ) + b · cov(X, Z).
An alternative formula for the covariance is
cov(X, Y ) = E[XY ] − E[X] E[Y ],
as can be verified by a simple calculation. Recall from last lecture that if X
and Y are independent, we have E[XY ] = E[X] E[Y ], which implies that
cov(X, Y ) = 0. Thus, if X and Y are independent, they are also uncorrelated.
However, the reverse is not true, as illustrated by the following example.
Example. Suppose that the pair of random variables (X, Y ) takes the values (1, 0),
(0, 1), (−1, 0), and (0, −1), each with probability 1/4. Thus, the marginal PMFs of X
and Y are symmetric around 0, and E[X] = E[Y ] = 0. Furthermore, for all possible
value pairs (x, y), either x or y is equal to 0, which implies that XY = 0 and E[XY ] =
0. Therefore,
cov(X, Y ) = E[XY ] − E[X] E[Y ] = 0,
and X and Y are uncorrelated. However, X and Y are not independent since, for
example, a nonzero value of X fixes the value of Y to zero.

4
3.2 Variance of the sum of random variables
The covariance can be used to obtain a formula for the variance of the sum of
several (not necessarily independent) random variables. In particular, if X1 , X2 ,
. . . , Xn are random variables with finite variance, we have

var(X1 + X2 ) = var(X1 ) + var(X2 ) + 2cov(X1 , X2 ),

and, more generally,


n
" n
" n−1
" n
"
var Xi = var(Xi ) + 2 cov(Xi , Xj ).
i=1 i=1 i=1 j=i+1

This can be seen from the following calculation, where for brevity, we denote
X̃i = Xi − E[Xi ]:
⎡ ⎤
n n 2
˜i ⎦
" "
var Xi = E ⎣ X
i=1 i=1
⎡ ⎤
"n "
n
= E⎣ X̃i X̃j ⎦
i=1 j=1
n "
" n
= E[X̃i X̃j ]
i=1 j=1
n n−1 n
]
" # " "
2
= E X̃i + 2 E[X̃i X̃j ]
i=1 i=1 j=i+1
n
" n−1
" n
"
= var(Xi ) + 2 cov(Xi , Xj ).
i=1 i=1 j=i+1

3.3 Correlation coefficient


The correlation coefficient ρ(X, Y ) of two random variables X and Y that
have nonzero and finite variances is defined as
cov(X, Y )
ρ(X, Y ) = p .
var(X)var(Y )
(The simpler notation ρ will also be used when X and Y are clear from the
context.) It may be viewed as a normalized version of the covariance cov(X, Y ).

5
Theorem 1. Let X and Y be discrete random variables with positive vari-
ance, and correlation coefficient equal to ρ.

(a) We have −1 ≤ ρ ≤ 1.
(b) We have ρ = 1 (respectively, ρ = −1) if and only if there exists a positive
(respectively, negative) constant a such that Y −E[Y ] = a(X −E[X]), with
probability 1.

The proof of Theorem 1 relies on the Schwarz (or Cauchy-Schwarz) inequal-


ity, given below.

Proposition 1. (Cauchy-Schwarz inequality) For any two random vari-


ables, X and Y , with finite variance, we have
2
E[XY ] ≤ E[X 2 ] E[Y 2 ].

Proof: Let us assume that E[Y 2 ] ̸


= 0; otherwise, we have Y = 0 with probabil-
ity 1, and hence E[XY ] = 0, so the inequality holds. We have
" 2 #
E[XY ]
0≤E X− Y
E[Y 2 ]
" 2 #
E[XY ] E[XY ]
= E X2 − 2 XY + 2 Y
2
E[Y 2 ] E[Y ]2
2
2 E[XY ] E[XY ] 2
= E[X ] − 2 E[XY ] + 2 E[Y ]
E[Y 2 ] E[Y ] 2
2
E[XY ]
= E[X 2 ] − ,
E[Y 2 ]
2
i.e., E[XY ] ≤ E[X 2 ] E[Y 2 ].

Proof of Theorem 1:

(a) Let X̃ = X − E[X] and Ỹ = Y − E[Y ]. Using the Schwarz inequality, we


get
2
2 E[X̃Ỹ ]
ρ(X, Y ) = ≤ 1,
E[X̃ 2 ] E[Ỹ 2 ]
and hence |ρ(X, Y )| ≤ 1.

6
(b) One direction is straightforward. If Ỹ = aX̃, then
˜ X̃]
E[Xa a
ρ(X, Y ) = ⎡ = ,
|a|
E[X̃ 2 ] E[(aX̃)2 ]

which equals 1 or −1 depending on whether a is positive or negative.


2
To establish the reverse direction, let us assume that ( ρ(X, Y ) ) = 1, which
2
implies that E[X̃ 2 ]E[Ỹ 2 ] = ( E[X̃Ỹ ] ) . Using the inequality established in
the proof of Proposition 1, we conclude that the random variable

˜ − E[X̃Ỹ ] Y˜
X
E[Ỹ 2 ]

is equal to zero, with probability 1. It follows that, with probability 1,



E[X̃ Y˜ ] E[X̃ 2 ]
X̃ = Ỹ = ρ(X, Y )Y˜ .
E[Ỹ 2 ] E[Ỹ 2 ]

Note that the sign of the constant ratio of X̃ and Ỹ is determined by the sign
of ρ(X, Y ), as claimed.

Example. Consider n independent tosses of a coin with probability of a head equal to


p. Let X and Y be the numbers of heads and of tails, respectively, and let us look at the
correlation coefficient of X and Y . Here, we have X + Y = n, and also E[X] + E[Y ] =
n. Thus,
X − E[X] = − Y − E[Y ] .
We will calculate the correlation coefficient of X and Y , and verify that it is indeed
equal to −1.
We have
h  i
cov(X, Y ) = E X − E[X] Y − E[Y ]
h 2 i
= −E X − E[X]
= −var(X).

Hence, the correlation coefficient is


cov(X, Y ) −var(X)
ρ(X, Y ) = p =p = −1.
var(X)var(Y ) var(X)var(X)

7
4 INDICATOR VARIABLES AND THE INCLUSION-EXCLUSION FOR-
MULA

Indicator functions are special discrete random variables that can be useful in
simplifying certain derivations or proofs. In this section, we develop the inclusion-
exclusion formula and apply it to a matching problem.
Recall that with every event A, we can associate its indicator function,
which is a discrete random variable IA : Ω → {0, 1}, defined by IA (ω) = 1 if
ω ∈ A, and IA (ω) = 0 otherwise. Note that IAc = 1 − IA and that E[IA ] =
P(A). These simple observations, together with the linearity of expectations
turn out to be quite useful.

4.1 The inclusion-exclusion formula


Note that IA∩B = IA IB , for every A, B ∈ F. Therefore,

IA∪B = 1 − I(A∪B)c = 1 − IAc ∩B c = 1 − IAc IB c


= 1 − (1 − IA )(1 − IB ) = IA + IB − IA IB .

Taking expectations of both sides, we obtain

P(A ∪ B) = P(A) + P(B) − P(A ∩ B),

an already familiar formula.


We now derive a generalization, known as the inclusion-exclusion formula.
Suppose we have a collection of events Aj , j = 1, . . . , n, and that we are inter-
ested in the probability of the event B = ∪nj =1Aj . Note that
n

IB = 1 − (1 − IAj ).
j=1

We begin with the easily verifiable fact that for any real numbers a1 , . . . , an , we
have
n
⎣ " " "
(1 − aj ) =1 − aj + ai aj − ai aj ak
j=1 1≤j≤n 1≤i<j≤n 1≤i<j<k≤n

+ · · · + (−1)n a1 · · · an .

We replace aj by IAj , and then take expectations of both sides, to obtain

8
" " "
P(B) = P(Aj ) − P(Ai ∩ Aj ) + P(Ai ∩ Aj ∩ Ak )
1≤j≤n 1≤i<j≤n 1≤i<j<k≤n

− · · · + (−1)n+1 P(A1 ∩ · · · ∩ An ).

4.2 The matching problem


Suppose that n people throw their hats in a box, where n ≥ 2, and then each
person picks one hat at random. (Each hat will be picked by exactly one person.)
We interpret “at random” to mean that every permutation of the n hats is equally
likely, and therefore has probability 1/n!.
In an alternative model, we can visualize the experiment sequentially: the
first person picks one of the n hats, with all hats being equally likely; then,
the second person picks one of the remaining n − 1 remaining hats, with every
remaining hat being equally likely, etc. It can be verified that under this second
model, every permutation has probability 1/n!, so the two models are equivalent.
We are interested in the mean, variance, and PMF of a random variable X,
defined as the number of people that get back their own hat.1 This problem is
best approached using indicator variables.
For the ith person, we introduce a random variable Xi that takes the value 1
if the person selects his/her own hat, and takes the value 0 otherwise. Note that
X = X1 + X2 + · · · + Xn .
Since P(Xi = 1) = 1/n and P(Xi = 0) = 1 − 1/n, the mean of Xi is
 
1 1 1
E[Xi ] = 1 · + 0 · 1 − = ,
n n n
which implies that
1
E[X] = E[X1 ] + E[X2 ] + · · · + E[Xn ] = n · = 1.
n
In order to find the variance of X, we first find the variance and covariances
of the random variables Xi . We have
1 1
var(Xi ) = 1− .
n n
1
For more results on various extensions of the matching problem, see L.A. Zager and G.C.
Verghese, “Caps and robbers: what can you expect?,” College Mathematics Journal, v. 38, n. 3,
2007, pp. 185-191.

9
For i ̸
= j, we have
#
cov(Xi , Xj ) = E Xi − E[Xi ] Xj − E[Xj ]
= E[Xi Xj ] − E[Xi ] E[Xj ]
= P(Xi = 1 and Xj = 1) − P(Xi = 1)P(Xj = 1)
= P(Xi = 1)P(Xj = 1 | Xi = 1) − P(Xi = 1)P(Xj = 1)
1 1 1
= · − 2
n n−1 n
1
= 2
.
n (n − 1)

Therefore,
n
" 
var(X) = var Xi
i=1
n
" n−1
" n
"
= var(Xi ) + 2 cov(Xi , Xj )
i=1 i=1 j=i+1
 
1 1 n(n − 1) 1
= n· 1− +2· · 2
n n 2 n (n − 1)
= 1.

Finding the PMF of X is a little harder. Let us first dispense with some
easy cases. We have P(X = n) = 1/n!, because there is only one (out of
the n! possible) permutations under which every person receives their own hat.
Furthermore, the event X = n − 1 is impossible: if n − 1 persons have received
their own hat, the remaining person must also have received their own hat.
Let us continue by finding the probability that X = 0. Let Ai be the event
that the ith person gets their own hat, i.e., Xi = 1. Note that the event X = 0
is the same as the event ∩i Aci . Thus, P(X = 0) = 1 − P(∪ni=1 Ai ). Using the
inclusion-exclusion formula, we have
" " "
P(∪ni=1 Ai ) = P(Ai ) − P(Ai ∩ Aj ) + P(Ai ∩ Aj ∩ Ak ) + · · · .
i i<j i<j<k

Observe that for every fixed distinct indices i1 , i2 , . . . , ik , we have

1 1 1 (n − k)!
P(Ai1 ∩ Ai2 ∩ · · · ∩ Aik ) = · ··· = . (1)
n n−1 n−k+1 n!

10
Thus,

1 n (n − 2)! n (n − 3)! n (n − n)!


P(∪ni=1 Ai ) = n ·
− + + · · · + (−1)n+1
n 2 n! 3 n! n n!
1 1 1
=1− + − · · · + (−1)n+1 .
2! 3! n!
We conclude that
1 1 1
P(X = 0) = − + · · · + (−1)n . (2)
2! 3! n!
Note that P(X = 0) → e−1 , as n → ∞.
To conclude, let us now fix some integer r, with 0 < r ≤ n−2, and calculate
P(X = r). The event {X = r} can only occur as follows: for some subset S of
{1, . . . , n}, of cardinality r, the following two events, BS and CS , occur:

BS : for every i ∈ S, person i receives their own hat;


CS : for every i ∈
/ S, person i does not receive their own hat.

We then have ⎦
{X = r} = BS ∩ CS .
S: |S|=r

The events BS ∩ CS for different subsets S are disjoint. Furthermore, by sym-


metry, P(BS ∩ CS ) is the same for every S of cardinality r. Thus,
"
P(X = r) = P(BS ∩ CS )
S: |S|=r
 
n
= P(BS ) P(CS | BS ).
r

Note that
(n − r)!
P(BS ) = ,
n!
by the same argument as in Eq. (1). Conditioned on the event that the r persons
in the set S have received their own hats, the event CS will materialize if and
only if none of the remaining n − r persons receive their own hat. But this is
the same situation as the one analyzed when we calculated the probability that
X = 0, except that n needs to be replaced by n − r. We conclude that
1 1 1
P (CS | BS ) = − + · · · + (−1)n−r .
2! 3! (n − r)!

11
Putting everything together, we conclude that
 
n (n − r)!  1 1 1 
P(X = r) = − + · · · + (−1)n−r
r n! 2! 3! (n − r)!
11 1 1 
= − + · · · + (−1)n−r .
r! 2! 3! (n − r)!
Note that for each fixed r, the probability P(X = r) converges to e−1 /r!,
as n → ∞, which corresponds to a Poisson distribution with parameter 1. An
intuitive justification is as follows. The random variables Xi are not independent
(in particular, their covariance is nonzero). On the other hand, as n → ∞, they
are “approximately independent”. Furthermore, the success probability for each
person is 1/n, and the situation is similar to the one in our earlier proof that the
binomial PMF approaches the Poisson PMF.

5 CONDITIONAL EXPECTATIONS

We have already defined the notion of a conditional PMF, pX | Y ( · | y), given


the value of a random variable Y . Similarly, given an event A, we can define a
conditional PMF pX|A , by letting pX|A (x) = P(X = x | A). In either case, the
conditional PMF, as a function of x, is a bona fide PMF (a nonnegative function
that sums to one). As such, it is natural to associate a (conditional) expectation
to the (conditional) PMF.

Definition 1. Given an event A, such that P(A) > 0, and a discrete random
variable X , the conditional expectation of X given A is defined as
"
E[X | A] = xpX | A (x),
x

provided that the sum is well-defined.

Note that the preceding also provides a definition for a conditional expecta-
tion of the form E[X | Y = y], for any y such that pY (y) > 0: just let A be the
event {Y = y}, which yields
"
E[X | Y = y] = xpX | Y (x | y).
x

We note that the conditional expectation is always well defined when either
the random variable X is nonnegative, or when the random variable X is inte-
grable. In particular, whenever E[|X|] < ∞, we also have E[|X| | Y = y] < ∞,

12
for every y such that pY (y) > 0. To verify the latter assertion, note that for
every y such that pY (y) > 0, we have
" " pX,Y (x, y) 1 " E[|X|]
|x|pX|Y (x | y) = |x| ≤ |x|pX (x) = .
x x
pY (y) pY (y) x pY (y)

The converse, however, is not true: it is possible that E[|X| | Y = y] is finite


for every y that has positive probability, while E[|X|] = ∞. This is left as an
exercise.
The conditional expectation is essentially the same as an ordinary expecta-
tion, except that the original PMF is replaced by the conditional PMF. As such,
the conditional expectation inherits all the properties of ordinary expectations
(cf. Proposition 4 in the notes for Lecture 6).

5.1 The total expectation theorem


A simple calculation yields
" ""
E[X | Y = y]pY (y) = xpX|Y (x | y)pY (y)
y y x
""
= xpX,Y (x, y)
y x
= E[X].
Note that this calculation is rigorous if X is nonnegative or integrable.
Suppose now that {Ai } is a countable family of disjoint events that forms a
partition of the probability space Ω. Define a random variable Y by letting Y = i
if and only if Ai occurs. Then, pY (i) = P(Ai ), and E[X | Y = i] = E[X | Ai ],
which yields "
E[X] = E[X | Ai ]P(Ai ).
i

Example. (The mean of the geometric.) Let X be a random variable with parameter p,
so that pX (k) = (1−p)k−1 p, for p ∈ N. We first observe that the geometric distribution
is memoryless: for k ∈ N, we have
P(X = k + 1, X > 1)
P(X − 1 = k | X > 1) =
P(X > 1)
P(X = k + 1)
=
P(X > 1)
(1 − p)k p
= = (1 − p)k−1 p
1−p
= P(X = k).

13
In words, in a sequence of repeated i.i.d., trials, given that the first trial was a failure,
the distribution of the remaining trials, X − 1, until the first success is the same as the
unconditional distribution of the number of trials, X, until the first success. In particular,
E[X − 1 | X > 1] = E[X].
Using the total expectation theorem, we can write
E[X] = E[X | X > 1]P(X > 1)+ E[X | X = 1]P(X = 1) = (1 + E[X])(1 − p)+ 1 ·p.
We solve for E[X], and find that E[X] = 1/p.
Similarly,
E[X 2 ] = E[X 2 | X > 1]P(X > 1) + E[X 2 | X = 1]P(X = 1).
Note that
E[X 2 | X > 1] = E[(X −1)2 | X > 1]+E[2(X −1)+1 | X > 1] = E[X 2 ]+(2/p)+1.
Thus,
E[X 2 ] = (1 − p)(E[X 2 ] + (2/p) + 1) + p,
which yields
2 1
E[X 2 ] = 2
− .
p p
We conclude that
2 2 1 1 1−p
var(X) = E[X 2 ] − E[X] = − − 2 = .
p2 p p p2

Example. Suppose we flip a biased coin N times, independently, where N is a Poisson


random variable with parameter λ. The probability of heads at each flip is p. Let X be
the number of heads, and let Y be the number of tails. Then,
∞ n  
" " n m
E[X | N = n] = mP(X = m | N = n) = m p (1 − p)n−m .
m=0 m=0
m

But X is just the expected number of heads in n trials, so that E[X | N = n] = np.
Let us now calculate E[N | X = m]. We have

"
E[N | X = m] = nP(N = n | X = m)
n=0

" P(N = n, X = m)
= n
n=m
P(X = m)

" P(X = m | N = n)P(N = n)
= n
n=m
P(X = m)
n
pm (1 − p)n−m (λn /n!)e−λ
"∞
m
= n .
n=m
P(X = m)

14
d
Recall that X = Pois(λp), so that P(X = m) = e−λp (λp)m /m!. Thus, after some
cancellations, we obtain

" (1 − p)n−m λn−m e−λ(1−p)
E[N | X = m] = n
n=m
(n − m)!

" (1 − p)n−m λn−m e−λ(1−p)
= (n − m)
n=m
(n − m)!

" (1 − p)n−m λn−m e−λ(1−p)
+m
n=m
(n − m)!
= λ(1 − p) + m.

A faster way of obtaining this result is as follows. From Theorem 3 in the notes for
Lecture 6, we have that X and Y are independent, and that Y is Poisson with parameter
λ(1 − p). Therefore,

E[N | X = m] = E[X | X = m] + E[Y | X = m] = m + E[Y ] = m + λ(1 − p).

Exercise. (Simpson’s “paradox”) Let S be an event and X, Y discrete random variables,


all defined on a common probability space. Show that

P[S|X = 0, Y = y] > P[S|X = 1, Y = y] ∀y

does not imply


P[S|X = 0] ≥ P[S|X = 1] .
Thus in a clinical trial comparing two treatments (indexed by X) a drug can be more
successful on each group of patients (indexed by Y ) yet be less successful overall.

5.2 The conditional expectation as a random variable


Let X and Y be two discrete random variables. For any fixed value of y, the
expression E[X | Y = y] is a real number, which however depends on y, and
can be used to define a function φ : R → R, by letting φ(y) = E[X | Y = y].
Consider now the random variable φ(Y ); this random variable takes the value
E[X | Y = y] whenever Y takes the value y, which happens with probability
P(Y = y). This random variable will be denoted as E[X | Y ]. (Strictly speak-
ing, one needs to verify that this is a measurable function, which is left as an
exercise.)
Example. Let us return to the last example and find E[X | N ] and E[N | X]. We found
that E[X | N = n] = np. Thus E[X | N ] = N p, i.e., it is a random variable that takes
the value np with probability P(N = n) = (λn /n!)e−λ . We found that E[N | X =
m] = λ(1 − p) + m. Thus E[N | X] = λ(1 − p) + X.

15
Note further that
E[E[X | N ]] = E[N p] = λp = E[X],
and
E[E[N | X]] = λ(1 − p) + E[X] = λ(1 − p) + λp = λ = E[N ].
This is not a coincidence; the equality E[E[X | Y ]] = E[X] is always true, as we shall
now see. In fact, this is just the total expectation theorem, written in more abstract
notation.

Theorem 2. Let g : R → R be a measurable function such that Xg(Y ) is


either nonnegative or integrable. Then,

E E[X | Y ]g(Y ) = E[Xg(Y )].

In particular, by letting g(y) = 1 for all y , we obtain E[E[X|Y ]] = E[X].

Proof: We have
"
E E[X|Y ]g(Y ) = E[X | Y = y]g(y)pY (y)
y
""
= xpX|Y (x | y)g(y)pY (y)
y x
"
= xg(y)pX,Y (x, y) = E[Xg(Y )].
x,y

The formula in Theorem 2 can be rewritten in the form


 
E (E[X | Y ] − X)g(Y ) = 0. (3)
Here is an interpretation. We can think of E[X | Y ] as an estimate of X, on the
basis of Y , and E[X | Y ] − X as an estimation error. The above formula says
that the estimation error is uncorrelated with every function of the original data.
Equation (3) can be used as the basis for an abstract definition of conditional
expectations. Namely, we define the conditional expectation as a random vari-
able of the form φ(Y ), where φ is a measurable function, that has the property
 
E (φ(Y ) − X)g(Y ) = 0,
for every measurable function g. The merits of this definition is that it can
be used for all kinds of random variables (discrete, continuous, mixed, etc.).
However, for this definition to be sound, there are two facts that need to be
verified:

16
(a) Existence: It turns out that as long as X is integrable, a function φ with the
above properties is guaranteed to exist. We already know that this is the
case for discrete random variables: the conditional expectation as defined in
the beginning of this section does have the desired properties. For general
random variables, this is a nontrivial and deep result. It will be revisited
later in this course.
(b) Uniqueness: It turns out that there is essentially only one function φ with
the above properties. More precisely, any two functions with the above
properties are equal with probability 1.

17
MIT OpenCourseWare
https://ocw.mit.edu

6.436J / 15.085J Fundamentals of Probability


Fall 2018

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms

You might also like