MIT6 436JF18 Lec06
MIT6 436JF18 Lec06
MIT6 436JF18 Lec06
Contents
1
Exercise 1. Recall that the r-the central moment of a random variable X is
E[(X − E[X])r ]. Show that if the r-th central moment of an almost surely
non-negative random variable X is finite, then its l-th central moment is
also finite for every l < r.
(d) Because of the formula var(X) = E[X 2 ] − (E[X])2 , we see that: (i) if X is
square integrable, the variance is finite; (ii) if X is integrable, but not square
integrable, the variance is infinite; (iii) if X is not integrable, the variance is
undefined.
E[X] = 1 · p + 0 · (1 − p) = p,
var(X) = E[X 2 ] − (E[X])2 = 12 · p + 02 · (1 − p) − p2 = p(1 − p).
2
which implies that
∞
" 1
E[X] = (1 − p)n = .
p
n=0
The variance of X turns out to satisfy var(X) = λ, but we defer the deriva-
tion to a later section. We note, however, that the mean and the variance of a
Poisson random variable are exactly what one would expect, on the basis of
the formulae for the mean and variance of a binomial random variable, and
taking the limit as n → ∞, p → 0, while keeping np fixed at λ.
(e) Power(α). Let X be a random variable with a power law distribution with
parameter α. We have
∞ ∞
" " 1
E[X] = P(X > k) = .
(k + 1)α
k=0 k=0
3
3 COVARIANCE AND CORRELATION
3.1 Covariance
The covariance of two square integrable random variables X and Y is denoted
by cov(X, Y ), and is defined by
#
cov(X, Y ) = E X − E[X] Y − E[Y ] .
4
3.2 Variance of the sum of random variables
The covariance can be used to obtain a formula for the variance of the sum of
several (not necessarily independent) random variables. In particular, if X1 , X2 ,
. . . , Xn are random variables with finite variance, we have
This can be seen from the following calculation, where for brevity, we denote
X̃i = Xi − E[Xi ]:
⎡ ⎤
n n 2
˜i ⎦
" "
var Xi = E ⎣ X
i=1 i=1
⎡ ⎤
"n "
n
= E⎣ X̃i X̃j ⎦
i=1 j=1
n "
" n
= E[X̃i X̃j ]
i=1 j=1
n n−1 n
]
" # " "
2
= E X̃i + 2 E[X̃i X̃j ]
i=1 i=1 j=i+1
n
" n−1
" n
"
= var(Xi ) + 2 cov(Xi , Xj ).
i=1 i=1 j=i+1
5
Theorem 1. Let X and Y be discrete random variables with positive vari-
ance, and correlation coefficient equal to ρ.
(a) We have −1 ≤ ρ ≤ 1.
(b) We have ρ = 1 (respectively, ρ = −1) if and only if there exists a positive
(respectively, negative) constant a such that Y −E[Y ] = a(X −E[X]), with
probability 1.
Proof of Theorem 1:
6
(b) One direction is straightforward. If Ỹ = aX̃, then
˜ X̃]
E[Xa a
ρ(X, Y ) = ⎡ = ,
|a|
E[X̃ 2 ] E[(aX̃)2 ]
˜ − E[X̃Ỹ ] Y˜
X
E[Ỹ 2 ]
Note that the sign of the constant ratio of X̃ and Ỹ is determined by the sign
of ρ(X, Y ), as claimed.
7
4 INDICATOR VARIABLES AND THE INCLUSION-EXCLUSION FOR-
MULA
Indicator functions are special discrete random variables that can be useful in
simplifying certain derivations or proofs. In this section, we develop the inclusion-
exclusion formula and apply it to a matching problem.
Recall that with every event A, we can associate its indicator function,
which is a discrete random variable IA : Ω → {0, 1}, defined by IA (ω) = 1 if
ω ∈ A, and IA (ω) = 0 otherwise. Note that IAc = 1 − IA and that E[IA ] =
P(A). These simple observations, together with the linearity of expectations
turn out to be quite useful.
We begin with the easily verifiable fact that for any real numbers a1 , . . . , an , we
have
n
⎣ " " "
(1 − aj ) =1 − aj + ai aj − ai aj ak
j=1 1≤j≤n 1≤i<j≤n 1≤i<j<k≤n
+ · · · + (−1)n a1 · · · an .
8
" " "
P(B) = P(Aj ) − P(Ai ∩ Aj ) + P(Ai ∩ Aj ∩ Ak )
1≤j≤n 1≤i<j≤n 1≤i<j<k≤n
− · · · + (−1)n+1 P(A1 ∩ · · · ∩ An ).
9
For i ̸
= j, we have
#
cov(Xi , Xj ) = E Xi − E[Xi ] Xj − E[Xj ]
= E[Xi Xj ] − E[Xi ] E[Xj ]
= P(Xi = 1 and Xj = 1) − P(Xi = 1)P(Xj = 1)
= P(Xi = 1)P(Xj = 1 | Xi = 1) − P(Xi = 1)P(Xj = 1)
1 1 1
= · − 2
n n−1 n
1
= 2
.
n (n − 1)
Therefore,
n
"
var(X) = var Xi
i=1
n
" n−1
" n
"
= var(Xi ) + 2 cov(Xi , Xj )
i=1 i=1 j=i+1
1 1 n(n − 1) 1
= n· 1− +2· · 2
n n 2 n (n − 1)
= 1.
Finding the PMF of X is a little harder. Let us first dispense with some
easy cases. We have P(X = n) = 1/n!, because there is only one (out of
the n! possible) permutations under which every person receives their own hat.
Furthermore, the event X = n − 1 is impossible: if n − 1 persons have received
their own hat, the remaining person must also have received their own hat.
Let us continue by finding the probability that X = 0. Let Ai be the event
that the ith person gets their own hat, i.e., Xi = 1. Note that the event X = 0
is the same as the event ∩i Aci . Thus, P(X = 0) = 1 − P(∪ni=1 Ai ). Using the
inclusion-exclusion formula, we have
" " "
P(∪ni=1 Ai ) = P(Ai ) − P(Ai ∩ Aj ) + P(Ai ∩ Aj ∩ Ak ) + · · · .
i i<j i<j<k
1 1 1 (n − k)!
P(Ai1 ∩ Ai2 ∩ · · · ∩ Aik ) = · ··· = . (1)
n n−1 n−k+1 n!
10
Thus,
We then have ⎦
{X = r} = BS ∩ CS .
S: |S|=r
Note that
(n − r)!
P(BS ) = ,
n!
by the same argument as in Eq. (1). Conditioned on the event that the r persons
in the set S have received their own hats, the event CS will materialize if and
only if none of the remaining n − r persons receive their own hat. But this is
the same situation as the one analyzed when we calculated the probability that
X = 0, except that n needs to be replaced by n − r. We conclude that
1 1 1
P (CS | BS ) = − + · · · + (−1)n−r .
2! 3! (n − r)!
11
Putting everything together, we conclude that
n (n − r)! 1 1 1
P(X = r) = − + · · · + (−1)n−r
r n! 2! 3! (n − r)!
11 1 1
= − + · · · + (−1)n−r .
r! 2! 3! (n − r)!
Note that for each fixed r, the probability P(X = r) converges to e−1 /r!,
as n → ∞, which corresponds to a Poisson distribution with parameter 1. An
intuitive justification is as follows. The random variables Xi are not independent
(in particular, their covariance is nonzero). On the other hand, as n → ∞, they
are “approximately independent”. Furthermore, the success probability for each
person is 1/n, and the situation is similar to the one in our earlier proof that the
binomial PMF approaches the Poisson PMF.
5 CONDITIONAL EXPECTATIONS
Definition 1. Given an event A, such that P(A) > 0, and a discrete random
variable X , the conditional expectation of X given A is defined as
"
E[X | A] = xpX | A (x),
x
Note that the preceding also provides a definition for a conditional expecta-
tion of the form E[X | Y = y], for any y such that pY (y) > 0: just let A be the
event {Y = y}, which yields
"
E[X | Y = y] = xpX | Y (x | y).
x
We note that the conditional expectation is always well defined when either
the random variable X is nonnegative, or when the random variable X is inte-
grable. In particular, whenever E[|X|] < ∞, we also have E[|X| | Y = y] < ∞,
12
for every y such that pY (y) > 0. To verify the latter assertion, note that for
every y such that pY (y) > 0, we have
" " pX,Y (x, y) 1 " E[|X|]
|x|pX|Y (x | y) = |x| ≤ |x|pX (x) = .
x x
pY (y) pY (y) x pY (y)
Example. (The mean of the geometric.) Let X be a random variable with parameter p,
so that pX (k) = (1−p)k−1 p, for p ∈ N. We first observe that the geometric distribution
is memoryless: for k ∈ N, we have
P(X = k + 1, X > 1)
P(X − 1 = k | X > 1) =
P(X > 1)
P(X = k + 1)
=
P(X > 1)
(1 − p)k p
= = (1 − p)k−1 p
1−p
= P(X = k).
13
In words, in a sequence of repeated i.i.d., trials, given that the first trial was a failure,
the distribution of the remaining trials, X − 1, until the first success is the same as the
unconditional distribution of the number of trials, X, until the first success. In particular,
E[X − 1 | X > 1] = E[X].
Using the total expectation theorem, we can write
E[X] = E[X | X > 1]P(X > 1)+ E[X | X = 1]P(X = 1) = (1 + E[X])(1 − p)+ 1 ·p.
We solve for E[X], and find that E[X] = 1/p.
Similarly,
E[X 2 ] = E[X 2 | X > 1]P(X > 1) + E[X 2 | X = 1]P(X = 1).
Note that
E[X 2 | X > 1] = E[(X −1)2 | X > 1]+E[2(X −1)+1 | X > 1] = E[X 2 ]+(2/p)+1.
Thus,
E[X 2 ] = (1 − p)(E[X 2 ] + (2/p) + 1) + p,
which yields
2 1
E[X 2 ] = 2
− .
p p
We conclude that
2 2 1 1 1−p
var(X) = E[X 2 ] − E[X] = − − 2 = .
p2 p p p2
But X is just the expected number of heads in n trials, so that E[X | N = n] = np.
Let us now calculate E[N | X = m]. We have
∞
"
E[N | X = m] = nP(N = n | X = m)
n=0
∞
" P(N = n, X = m)
= n
n=m
P(X = m)
∞
" P(X = m | N = n)P(N = n)
= n
n=m
P(X = m)
n
pm (1 − p)n−m (λn /n!)e−λ
"∞
m
= n .
n=m
P(X = m)
14
d
Recall that X = Pois(λp), so that P(X = m) = e−λp (λp)m /m!. Thus, after some
cancellations, we obtain
∞
" (1 − p)n−m λn−m e−λ(1−p)
E[N | X = m] = n
n=m
(n − m)!
∞
" (1 − p)n−m λn−m e−λ(1−p)
= (n − m)
n=m
(n − m)!
∞
" (1 − p)n−m λn−m e−λ(1−p)
+m
n=m
(n − m)!
= λ(1 − p) + m.
A faster way of obtaining this result is as follows. From Theorem 3 in the notes for
Lecture 6, we have that X and Y are independent, and that Y is Poisson with parameter
λ(1 − p). Therefore,
15
Note further that
E[E[X | N ]] = E[N p] = λp = E[X],
and
E[E[N | X]] = λ(1 − p) + E[X] = λ(1 − p) + λp = λ = E[N ].
This is not a coincidence; the equality E[E[X | Y ]] = E[X] is always true, as we shall
now see. In fact, this is just the total expectation theorem, written in more abstract
notation.
Proof: We have
"
E E[X|Y ]g(Y ) = E[X | Y = y]g(y)pY (y)
y
""
= xpX|Y (x | y)g(y)pY (y)
y x
"
= xg(y)pX,Y (x, y) = E[Xg(Y )].
x,y
16
(a) Existence: It turns out that as long as X is integrable, a function φ with the
above properties is guaranteed to exist. We already know that this is the
case for discrete random variables: the conditional expectation as defined in
the beginning of this section does have the desired properties. For general
random variables, this is a nontrivial and deep result. It will be revisited
later in this course.
(b) Uniqueness: It turns out that there is essentially only one function φ with
the above properties. More precisely, any two functions with the above
properties are equal with probability 1.
17
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms