Probbook

David J.
Olive
Probability and Measure

January 3, 2023
Springer
Preface
Many statistics departments offer a one semester graduate course in Prob-

ability and Measure. Two good texts are Karr (1993) and Resnick (1999).
Billingsley (1995) and Ash and Doleans-Dade (1999) are more difficult. Also
see Breiman (1968), Capiński and Kopp (2004), Chung (2001), Dudley (2002),
Durrett (2019), Feller (1971), Gnedenko (1989), Pollard (2001), Rényi (2007),
Rosenthal (2006), and Shiryaev (1996). Problems are given in Shiryaev (2012)
and Stoyanov, et al. (1989).
The prerequisite for this text is a course in Lebesgue Measure and Lebesgue
Integration at the level of Royden and Fitzpatrick (2007) and Spiegel (1969).
A prerequisite for Lebesgue Measure and Integration is an Introduction to
Real Analysis course at the level of Gaughan (2009) and Ross (1980). A
course on Real Analysis and Metric Spaces, such as Ash (1993), is at an
intermediate level between an Introduction to Real Analysis and Lebesgue
Measure and Integration.
Acknowledgements
Teaching Probability and Measure and Large Sample Theory as Math 581
and Math 582 at Southern Illinois University in Fall 2021 and Spring 2022
was useful.
v
Contents
1 Probability Measures and Measures . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Random Variables and Random Vectors . . . . . . . . . . . . . . . . . 21

2.1 Measurable Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Some Useful Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Integration and Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Fubini’s theorem and Product Measures . . . . . . . . . . . . . 46
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Large Sample Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.1 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 The Characteristic Function and Related Functions . . 71
4.3 The CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Slutsky’s Theorem, the Continuity Theorem and
Related Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5 Order Relations and Convergence Rates . . . . . . . . . . . . . 87
vii
viii Contents
4.6 More CLTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.7 More Results for Random Variables . . . . . . . . . . . . . . . . . . 94
4.8 Multivariate Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.9 The Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.11 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.12 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5 Conditional Probability and Conditional Expectation . . . . 133

5.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7 Some Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Chapter 1
Probability Measures and Measures
This chapter covers probability measures and measures.
1.1 Probability Measures
Definition 1.1. The sample space Ω is the set of all possible outcomes of
an experiment.
Remark 1.1. We will assume that Ω is not the empty set, which is the
set that contains no elements. The experiment is an idealized experiment.
For example, toss a coin once. Then Ω = {heads, tails}. Outcomes where one
can not tell whether the coin is heads or tails are not allowed in the idealized
experiment.
Definition 1.2. Let A, B ⊆ Ω.

a) The complement of A is Ac = {ω ∈ Ω : ω 6∈ A} = Ω − A.
b) A − B = A ∩ B c is the difference between A and B.
c) The empty set is ∅.
Note that [Ac]c = A and ∅ = Ω c .
Definition 1.3. Let Λ be a nonempty index set of sets Aλ ⊆ Ω. Then
{Aλ }λ∈Λ is an indexed
S family of sets.
a) The union λ∈Λ AT λ = {ω ∈ Ω : ω ∈ Aλ for at least one λ ∈ Λ}.
b) The intersection λ∈Λ Aλ = {ω ∈ Ω : ω ∈ Aλ for all λ ∈ Λ}.
Notation: a) Often “∈ Ω” will be omitted. Hence

{ω ∈ Ω : ω ∈ Aλ for all λ ∈ Λ} = {ω : ω ∈ Aλ for all λ ∈ Λ}.
b) Often Λ = N = {i}∞ i=1 =S{1, 2, ..., } =Sthe set of positive integers
T∞= the
∞ T
set of natural numbers. Then λ∈N Aλ = i=1 Ai , and λ∈N Aλ = i=1 Ai .
S c) If Λ = S{i}∞i=m = {m, T(m + 1), ..., }T= the set of integers ≥ m, then
∞ ∞
λ∈Λ Aλ = i=m Ai , and λ∈Λ Aλ = i=m Ai .
1
2 1 Probability Measures and Measures
S∞
Warning
T∞ 1.1: Since ∞ is not an integer, there is no set A∞ in i=m Ai
or i=m Ai .
Remark 1.2. One way to prove A = B is to prove A ⊆ B and B ⊆ A.

This technique is equivalent to i) showing that if ω ∈ A, then ω ∈ B, and ii)
showing that if ω ∈ B, then ω ∈ A. A second way to prove A = B is to show
ω ∈ A iff ω ∈ B where “iff” means “if and only if.”
Theorem 1.1 De Morgan’s Laws: Let Λ be a nonempty index set of

" Aλ ⊆ Ω.
sets #c
[ \
i) Aλ = Acλ .
λ∈Λ λ∈Λ
" #c
\ [
ii) Aλ = Acλ .
λ∈Λ λ∈Λ
" ∞
#c ∞
\ [
iii) Ai = Aci .
i=1 i=1
" ∞
#c ∞
[ \
iv) Ai = Aci .
i=1 i=1
c
v) [A ∪ B] = A ∩ B c . c
c
vi) [A ∩ B] = Ac ∪ B c .
Proof: Equations " iii) - vi)#care special cases of i) and ii).

[ S
Proof of i): ω ∈ Aλ iff ω 6∈ λ∈Λ Aλ iff ω 6∈ Aλ for any λ ∈ Λ iff
λ∈Λ
ω ∈ Acλ for all λ ∈ Λ iff ω ∈ λ∈Λ Acλ .
T
" #c
[ S
Alternative proof of i): If ω ∈ Aλ , then ω 6∈ λ∈Λ Aλ . Hence
λ∈Λ
ω 6∈ Aλ for any λ ∈ Λ. Hence ω ∈ Acλ for all λ ∈ Λ. Thus ω ∈ λ∈Λ Acλ .
T
\
If ω ∈ Acλ then ω ∈ Acλ for all λ ∈ Λ. Hence ω 6∈ Aλ for any λ ∈ Λ.
λ∈Λ
" #c
[ [
Hence ω 6∈ Aλ . Thus ω ∈ Aλ .
λ∈Λ λ∈Λ
ii)"Can be proved
#c in a manner similar to i). Alternatively, using [Ac ]c = A,
[ \ [
by i) Acλ = Aλ . Taking complements of both sides gives Acλ =
" λ∈Λ
#c λ∈Λ λ∈Λ
\
Aλ .
λ∈Λ
1.1 Probability Measures 3
Definition 1.4. Let Ω 6= ∅. A class C of subsets of Ω is a field (or algebra)

on Ω if
i) Ω ∈ C.
ii) A ∈ C ⇒ Ac ∈ C.
iii) A, B ∈ C ⇒ A ∩ B ∈ C.
Theorem 1.2. Principle of Mathematical Induction: Let P (n) be a

statement for each n ∈ N such that
a) P(1) is true, and
b) for each k ∈ N, if P (k) is true, then P (k + 1) is true.
Then P (n) is true for all n ∈ N.
By induction, if A1 , ..., An ∈ A, then ∩ni=1 Ai ∈ C where C is a field.
However, if A1 , A2 , ..., ∈ C, it is not necessarily true that ∩∞
i=1 Ai ∈ C. Note
that ∞ ∈ / N.
A field is closed under the formation of complements, finite unions, and
finite intersections by induction and De Morgan’s laws.
Definition 1.5. Let Ω 6= ∅. A class F of subsets of Ω is a σ−field (or

σ−algebra) on Ω if
i) Ω ∈ F.
ii) A ∈ F ⇒ Ac ∈ F.
iii) A, B ∈ F ⇒ A ∩ B ∈ F.
iv) A1 , A2 , ... ∈ F ⇒ ∪∞
i=1 Ai ∈ F.
Notation: Countable in this text means finite or countably infinite.

Note that i), ii), and iii) mean that a σ-field is a field on Ω. A σ−field
is a field that is closed under countable set operations: complementation,
countable unions, and countable intersections. The term “on Ω” is often
understood and omitted.
Warning: A common error is to use n instead of ∞ in Definition 1.5 iv).
Example 1.1. The largest σ-field consists of all subsets of Ω. The smallest
σ-field is F = {∅, Ω}.
Example 1.2. A finite field is a σ-field.

Proof. We need to show that iv) from Def. 1.5 holds. By Tninduction, a field
is closed under finite intersections: A1 , ..., An ∈ C implies i=1 Ai ∈ C. Hence
a field is closed under finite unions be De Morgan’s laws. Since the field is
finite, it has a finite number of sets, B1 , ..., BJ , say. If A1 , A2 , ..., ∈ C, then
only a fixed number of these sets are distinct, say C1 , ..., Ck where k depends
on the sequence A1 , .... Thus ∪∞ k
i=1 Ai = ∪i=1 Ci ∈ C, and C is a σ-field by Def.
1.5. .
Definition 1.6. Let A be a class of sets of Ω. The σ−field generated by
A, denoted by σ(A) is the intersection of all σ−fields containing A.
Let Λ be the class of σ-fields containing A. Then Λ is nonempty since

the σ-field of all subsets of Ω is in Λ. Then σ(A) = ∩λ∈Λ Fλ . Thus σ(A) is
the smallest σ−field containing A, since if Fλ is a σ-field containing A, then
σ(A) ⊆ Fλ .
Proof that σ(A) is a σ−field:
i) Ω ∈ σ(A) since Ω ∈ Fλ for each λ ∈ Λ.
ii) If A ∈ σ(A), then A ∈ Fλ for each λ ∈ Λ. Hence Ac ∈ Fλ for each
λ ∈ Λ. Thus Ac ∈ σ(A).
iii) If A, B ∈ σ(A), then A, B ∈ Fλ for each λ ∈ Λ. Hence A ∩ B ∈ Fλ for
each λ ∈ Λ. Thus A ∩ B ∈ σ(A).
iv) If A1 , A2 , ... ∈ σ(A), then A1 , A2 , ... ∈ Fλ for each λ ∈ Λ. Hence
∪∞ ∞
i=1 Ai ∈ Fλ for each λ ∈ Λ. Thus ∪i=1 Ai ∈ σ(A).
Definition 1.7. a) Let A be the class of all open intervals of [0,1]. Then
σ(A) = B[0, 1] is the Borel σ-field on [0,1].
b) Let x = (x1 , ..., xk) ∈ Rk . Let A be the class of “rectangles” {x ∈ Rk :
ai < xi ≤ bi , i = 1, ..., k} where ai , bi ∈ R. Then σ(A) = B(Rk ) is the Borel
σ-field on Rk .
Fact: B[0, 1] = σ(A) where A is the class of all closed intervals in [0,1], or
A is the class of all intervals of the form (a, b] in [0,1], or A is the class of all
intervals of the form [a, b) in [0,1].
Definition 1.8. A1 , A2 , ... are disjoint F sets if Ai ∈ F ∀i ∈ N, and
Ai ∩ Aj = ∅ for i 6= j.
Notation: The phrase “F sets” will often be omitted. Other terms for
disjoint are pairwise disjoint and mutually exclusive. The sequence of sets
in Def. 1.8 can be finite: A1 , ..., An with n ≥ 2. The sets Ai and B = ∅ are
disjoint
U∞ for any Ai ∈ F. Notation such as ∪∞ ∞ ∞
i=1 Ai = ]i=1 Bi or ∪i=1 Ai =
i=1 Bi means that the sets Bi are disjoint.
Definition 1.9. A set function P on a σ−field F on Ω is a probability

measure if
P1) 0 ≤ P (A) ≤ 1 for A ∈ F,
P2) P (∅) = 0 and P (Ω) = 1, P∞
P3) If A1 , A2 , ... are disjoint F sets, then P (∪∞i=1 Ai ) = i=1 P (Ai ) (count-
able additivity).
Common error: use n instead of ∞ U∞in P3). P∞
Note that for P3), P (∪∞ A
i=1 i ) = P ( i=1 Ai ) = i=1 P (Ai ).
Definition 1.10. (Ω, F , P ) is a probability space if Ω is a sample space,
F is a σ-field on Ω, and P is a probability measure on (Ω, F ). Then an event
A is any set A ∈ F.
Typically F is not the class of all subsets of Ω. Then there are subsets of Ω
that are not events. Typically in this chapter, we will assume that (Ω, F , P )
is a probability space, and that sets such as An are F sets: An ∈ F.
For a discrete random variable (RV) X, Ω is a countable set and F is often

the σ-field of all subsets of Ω. For a continuous RV X, F is often the Borel
σ-field B(Ω) where Ω is an interval.
Example 1.3. Let µL be the Lebesgue measure on Ω = [a, b]: µL ([c, d]) =
d − c if [c, d] ⊆ [a, b]. The uniform(a,b) RV has
µL µL
P = = .
µL ([a, b]) b−a
The interval [a, b] = [0, 1] is interesting.

Notation. An ↑ A means A1 ⊆ A2 ⊆ · · · and A = ∪∞ i=1 Ai .
An ↓ A means A1 ⊇ A2 ⊇ · · · and A = ∩∞ A
i=1 i .
xn ↑ x means x1 ≤ x2 ≤ · · · and xn → x.
xn ↓ x means x1 ≥ x2 ≥ · · · and xn → x.
Theorem 1.3. Properties of P : Let A, B, Ai , An , Ak be F sets. Pn
i) Finite additivity: If A1 , ..., An are disjoint, then P (∪ni=1 Ai ) = i=1 P (Ai ).
ii) P is monotone: A ⊆ B ⇒ P (A) ≤ P (B).
iii) If A ⊆ B, then P (B − A) = P (B) − P (A).
iv) Complement rule: P (Ac ) = 1 − P (A). Pn
v) Finite subadditivity: P (∪ni=1 Ai ) ≤ i=1 P (Ai ).
vi) continuity from below: If An ↑ A then P (An ) ↑ P (A).
vii) continuity from above: If An ↓ A then PP (An ) ↓ P (A).
∞
viii) countable subadditivity: P (∪∞ k=1 A k ) ≤ k=1 P (Ak ).
Proof. i) Let U Ai = ∅ for i ≥ U n + 1. Then P∞A1 , A2 , ..., are
Pndisjoint, and
P (∪ni=1 Ai ) = P ( ni=1 Ai ) = P ( ∞ i=1 Ai ) = i=1 P (Ai ) = i=1 P (Ai ) by
P3). U
ii) and iii) If A ⊆ B, then B = A (B − A) where this notation means A
and B − A are disjoint. Hence P (B) = P (A) + P (B − A) ≥ P (A) by i), and
P (B − A) = P (B) − P (A).
iv) Take B = Ω = A Ac . Then P (Ω) = 1 = P (A) + P (Ac).
U
v) We will find disjoint sets B1 , ..., Bn suchUthat a) Bj ⊆ Aj for j =
n
1, ..., n, b) Bk ⊆ Acj for j < k, and c) ∪ni=1 Ai = i=1 Bi . Then P (∪ni=1 Ai ) =
Un Pn Pn
P ( i=1 Bi ) = i=1 P (Bi ) ≤ i=1 P (Ai ).
The sets Bi that work are B1 = A1 and
k−1 c
Bk = Ak ∩ Ac1 ∩ · · · ∩ Ack−1 = Ak ∩ ∪i=1 Ai
for k = 2, ..., n. To see that the Bk are disjoint, without loss of generality
(WLOG) let j < k. Then Bj ⊆ Aj and Bk ⊆ Acj . Hence Bj and Bk are
disjoint for j 6= k. Now ∪ni=1 Ai =
n
]
A1 ∪ [A2 ∩ Ac1 ] ∪ [A3 ∩ (A1 ∪ A2 )c ] ∪ · · · ∪ [An ∩ (A1 ∪ · · · ∪ An−1 )c ] = Bi .
i=1
(Use induction or make a Venn diagram of concentric circles with the inner-
most circle A1 . Then the second innermost circle is A1 ∪ A2 where the ring
about the A1 circle is the set B2 , the third innermost circle is A1 ∪ A2 ∪ A3
where the ring about the A1 ∪ A2 circle is B3 , et cetera.)
vi) We will find disjoint sets B1 , ..., Bn, ... such that the Bk are disjoint,
An = ∪nk=1 Ak = ∪nk=1 Bk , and A = ∪∞ ∞
k=1 Ak = ∪k=1 Bk . Then
∞
X n
X
P (A) = Bk = lim P (Bk ) = lim P (An ).
n→∞ n→∞
k=1 k=1
Thus
n
X
P (An ) = P (Bk ) ↑ P (A).
k=1
The sets Bi that work are B1 = A1 and Bk = Ak − Ak−1 for k > 1.

Since An ↑ A, An = ∪nk=1 Ak . Use induction or a Venn diagram to show that
An = ∪nk=1 Ak = ∪nk=1 Bk for each positive integer n. Then solve Problem 1.1
to prove that A = ∪∞ ∞
k=1 Ak = ∪k=1 Bk .
c c
vii) An ↓ A ⇒ An ↑ A . Hence
P (Acn ) = [1 − P (An )] ↑ [1 − P (A)] = P (Ac )
by vi). Thus P (An ) ↓ P (A).

viii) Let Bn = ∪nk=1 Ak . Then
n
X ∞
X
P (Bn ) = P (∪nk=1 Ak ) ≤ P (Ak ) ≤ P (Ak )
k=1 k=1
for any positive integer n. Now Bn ↑ B = ∪∞ k=1 Ak , and thus P (Bn ) ↑

P (B) = P (∪∞ k=1 Ak ) by vi).
P∞ Hence P (B) is the least upper bound on the
sequence P (BnP ) while k=1 P (Ak ) is an upper bound on the P (Bn ). Thus
∞
P (∪∞ k=1Ak ) ≤ k=1 P (Ak ).
Note: vi) and vii) together are known as monotone continuity. Finite sub-
additivity is also known as Boole’s inequality.
The limit superior and limit inferior of a sequence will be useful. The
sequence {an }∞ n=1 = a1 , a2 , .... Useful references are Hunter (2014: pp. 6-7)
and Ross (1980, pp. 57-60). Let {an }∞ n=m (= am , am+1 , ...) be a sequence of
numbers. Then i) sup an = least upper bound of {an }, and
ii) inf an = greatest lower bound of {an }.
Definition 1.11. Let {an }∞ n=1 be a sequence.

a) The limit superior of the sequence limsupn an = limn an is the limit of
the nonincreasing sequence {supk≥j ak }∞j=1 .
b) The limit inferior of the sequence liminfn an = limn an is the limit of
the nondecreasing sequence {inf k≥j ak }∞
j=1 .
Remark 1.3. a) Unlike the limit, limn an and limn an always exist when
±∞ are allowed as limits, since limits of nondecreasing and nonincreasing
sequences then exist.
b) limn an ≤ limn an
c) limn an = a iff limn an = limn an = a. Hence the limit of a sequence
exists iff limn an = limn an . Again, a = ±∞ is allowed.
d) Let lim∗n an be limn an or limn an .
If an ≤ bn , then lim∗n an ≤ lim∗n bn .
If an < bn , then lim∗n an ≤ lim∗n bn .
If an ≥ bn , then lim∗n an ≥ lim∗n bn .
If an > bn , then lim∗n an ≥ lim∗n bn .
That is, when taking the liminf or limsup on both sides of a strict inequality,
the < or > must be replaced by ≤ or ≥.
A similar result holds for limits if both limits exist.
e) limsupn (−an ) = −liminfn an .
f) i) limsupn an = limn an is the limit of the nonincreasing sequence
sup ak , sup ak , ....

k≥m k≥m+1
ii) liminfn an = limn an is the limit of the nondecreasing sequence
inf ak , inf ak , ....

k≥m k≥m+1
iii)
limn an = inf sup ak = lim sup(an , n ≥ k).
n k≥n k→∞
iv)
limn an = sup inf ak = lim inf(an , n ≥ k).
n k≥n k→∞
v) If a limit point of a sequence {an } is any number, including ±∞, that

is a limit of some subsequence, then liminfn an and limsupn an are the inf
and sup of the set of limit points, often the smallest and largest limit points.
A limit point is also called an accumulation point and a cluster point.

If {xn } is a bounded sequence, then lim xn = largest accumulation point
(cluster point) of {xn }, and lim xn = smallest accumulation point of {xn }.
Remark 1.4. Warning: a common error is to take the limit of both sides
of an equation an = bn or of an inequality an ≤ bn . Taking the limit is an
error if the existence of the limit has not been shown. If ±∞ are allowed,
limn an and limn an always exits. Hence the limn an or limn an of the above
equation or inequality can be taken.
Example 1.1. a) If an = (−1)n , then limsupn an = 1 and liminfn an =

−1.
(−1)n
b) If an = , then limsupn an = liminfn an = limn an = 0.
n
1 1 1 1 1
c) Note that < , but lim∗n ≤ lim∗n . In fact, limn =
n+1 n n+1 n n+1
1 1 1 1 1
0 = limn = limn = limn . Thus limn is not less than limn .
n+1 n n n+1 n
The limsup, liminf, and limit of sets can also be defined.
Definition 1.12. Let An be a sequence of F sets.

a) lim An = limsupn An = ∩∞ ∞
n=1 ∪k=n Ak = {ω : ω ∈ An for infinitely many
An }.
b) lim An = liminfn An = ∪∞ ∞
n=1 ∩k=n Ak = {ω : ω ∈ An for all but finitely
many An }.
c) If liminfn An = limsupn An , then limn An = A = liminfn An =
limsupn An , written An → A.
Example 1.3. Let An = {(−1)n }. Then limsupn An = {−1, 1} since both

numbers occur infinitely often, while liminfn An = ∅ since −1 and 1 are the
only possible elements of An , and neither number occurs for all but finitely
many An .
Note that ω ∈ lim An iff for each positive integer n, there exists k ≥ n
such that ω ∈ Ak iff ω is in infinitely many of the An . Note that ω ∈ lim An
iff there exists positive integer n such that ω ∈ Ak for all k ≥ n iff ω lies in
all but finitely many of the An .
Theorem 1.4. Let An be a sequence of F sets.

a) lim An , lim An ∈ F.
b) If limn An exists, then limn An = A ∈ F.
c) liminfn An ⊆ limsupn An .
d) (limsupn An )c = liminfn Acn .
e) (liminfn An )c = limsupn Acn .
Proof. a) Cn = ∪∞ ∞
k=n Ak ∈ F for each n. Hence ∩n=1 Cn = lim An ∈ F.
∞ ∞
Bn = ∩k=n Ak ∈ F for each n. Hence ∪n=1 Bn = lim An ∈ F.
b) Follows from a).
c) If ω ∈ An for all but finitely many An , then ω ∈ An for all but in-
finitely many An . Hence if ω ∈ liminfn An then ω ∈ limsupn An . Thus
liminfn An ⊆ limsupn An .
d) By De Morgan’s laws applied twice, (limsupn An )c = [∩∞ c
n=1 Cn ] =
∞ c c
∪n=1 Cn = liminfn An where Cn is given in a).
e) By De Morgan’s laws applied twice, (liminfn An )c = [∪∞ c
n=1 Bn ] =
∞ c c
∩n=1 Bn = limsupn An where Bn is given in a).
If limsupn An ⊆ A ⊆ liminfn An , then limn An = A by Theorem 1.4.
Remark 1.5. a) Bn = ∩∞ k=n Ak ↑ lim An . Thus limn→∞ ∩k=n Ak =

∞
lim An .
b) Cn = ∪∞ ∞
k=n Ak ↓ limAn . Thus limn→∞ ∪k=n Ak = lim An , and lim An =
∞
∩n=1 Cn .
c) Do not treat convergence of sets like convergence of functions.
An → A iff lim sup An = liminf An which implies that if ω ∈ An for in-
finitely many n, then ω ∈ An for all but finitely many n.
d) Warning: Students who have not figured out the following two examples
tend to make errors on similar problems.
e) Typically want to show that open, closed, and half open intervals can be
written as a countable union or countable intersection of intervals of another
type. Then the Borel σ-field B(R) = σ(C) where C is a class of intervals such
as the class of all open intervals.
Example 1.4. Prove the following results.
a) A1 ⊆ A2 ⊆ · · · implies that An ↑ A = ∪∞ n=1 An .
b) A1 ⊇ A2 ⊇ · · · implies that An ↓ A = ∩∞ n=1 An .
Proof. a) For each n, A = ∪∞ A
k=n k . Thus limsup An = ∩∞ n=1 A = A. For
each n, ∩∞ k=n kA = A n . Thus liminf A n = ∪ ∞
A
n=1 n = A.
b) For each n, ∪∞ ∞
k=n Ak = An . Thus limsup An = ∩n=1 An = A. For each
n, ∩∞ A
k=n k = A. Thus liminf A n = ∪ ∞
n=1 A = A.
Example 1.5. Simplify the following sets where a < b. Answers might be
(a, b), [a, b), (a, b], [a, b], [a, a] = {a}, (a, a) = ∅.
∞
\ 1
a) I = a, b +
n=1
n
∞
[ 1
b) I = a, b −
n=1
n
∞
[ 1 1
c) I = a+ ,b−
n=1
n n
∞
\ 1
d) I = a, b +
n=1
n
∞
\ 1
e) I = a, a +
n=1
n
∞
[ 1
f) I = a, b −
n=1
n
∞ ∞
\ 1 \
Solution. a) I = (a, b] = a, b + = An where An ↓ I. Note
n=1
n n=1
∞
\ 1 1
that (a, b] ⊆ A = a, b + since b ∈ a, b + ∀n. For any > 0,
n=1
n n
(a, b + ] 6⊆ A since b + 1/n < b + for large enough n. Note that b + 1/n → b,
but sets are not functions. (A common error is to say I = (a, b).)
∞ ∞
[ 1 [
b) I = (a, b) = a, b − = An where An ↑ I. Note that
n=1
n n=1
∞
[ 1 1
b 6∈ a, b − = A since b 6∈ a, b − ∀n and since n ∈ N so n = ∞
n=1
n n

1
never occurs. Note that a, b − = ∅ if b − 1/n ≤ a. For any > 0 such
n
that b − > a, it follows that (a, b − ] ∈ A since b − 1/n > b − for large
enough n, say n > N . Thus b − ∈ A all but finitely many times.
∞ ∞
[ 1 1 [
c) I = (a, b) = a + ,b− = An where An ↑ I. Note that
n=1
n n
n=1
1 1
a, b 6∈ A = I since a, b 6∈ a + , b − ∀n ∈ N. Then the proof is simi-
n n
lar to that of b).
∞ ∞
\ 1 \
d) I = [a, b] = a, b + = An where An ↓ I. This proof is similar
n=1
n n=1
to that of a).
∞ ∞
\ 1 \
e) I = [a, a] = {a} = a, a + = An where An ↓ I. Note that
n=1
n n=1
a ∈ A = I, but a + 6∈ A ∀ > 0.
∞ ∞
[ 1 [
f) I = [a, b) = a, b − = An where An ↑ I. This proof is similar
n=1
n n=1
to that of b).
Theorem 1.3 proved monotone continuity (continuity from below and con-
tinuity from above) of P . The following theorem proves continuity of P . In
the proof, we can’t take limits when the limits have not been shown to exist,
but we can use the lim or lim operators.
Theorem 1.5. For each sequence {An } of F sets,
i) P (liminfn An ) ≤ liminfn P (An ) ≤ limsupn P (An ) ≤ P (limsupn An )
ii) Continuity of probability: If An → A, then P (An ) → P (A).
Proof. i) We need to show a) P (liminfn An ) ≤ liminfn P (An ) and b)
limsupn P (An ) ≤ P (limsupn An ). Let Bn = ∩∞ k=n Ak ↑ liminfn An , and
Cn = ∪∞ k=n kA ↓ limsup n A n . Then P (An ) ≥ P (Bn ) → P (liminfn An ).
(We can’t take limits on both sides of the inequality since we do not
know if limn P (An ) exists. Note that limn P (Bn ) = P (liminfn An ) by
monotone continuity.) Taking liminf of both sides gives liminfn P (An ) ≥
liminfn P (Bn ) = P (liminfn An ), proving a).
Similarly, P (An ) ≤ P (Cn ) → P (limsupn An ). Taking limsup of both sides
of the inequality gives limsupn P (An ) ≤ limsupn P (Cn ) = P (limsupn An ),
proving b).
ii) Follows from i) since P (An ) → P (A) iff lim P (An ) = lim P (An ) =
P (A).
In the above proof, a common alternative for proving i) b) in Probability

texts is to use (limsupn An )c = liminfn Acn . Hence 1 − P (limsupn An ) =
P [(limsupn An )c ] = P [liminfn Acn ] ≤ liminfn P (Acn ) = 1 − lim supn P (An )
where the last inequality follows by i) a) using sets Acn instead of set An .
Thus P (limsupn An ) ≥ limsupn P (An ).
The following theorem shows that if A1 , A2 , ... are sets each having prob-
ability 0, then ∪∞
i=1 Ai is also a set having probability 0. If A1 , A2 , ... are sets
each having probability 1, then ∩∞ i=1 Ai is also a set having probability 1.
Theorem 1.6. Let A1 , A2 , ... be F sets.

i) If P (Ai ) = 0 for all i, then P (∪∞i=1 Ai ) = 0.
ii) If P (Ai ) = 1 for all i, then PP(∩∞i=1 Ai ) = 1.
∞
Proof. i) 0 ≤ P (∪∞ A
i=1 i ) ≤ i=1 P (Ai ) = 0.
ii) Let Bn = Acn so P (Bn ) = 0. Then by i), P [(∪∞ c
i=1 Bn ) ] = 1 − 0 =
∞ c ∞
P (∩i=1 Bn ) = P (∩i=1 An ).
Definition 1.13. i) Two events A and B are independent, written

A B, if P (A ∩ B) = P (A)P (B).
ii) A finite collection of events A1 , ..., An is independent if for any sub-
Qk
collection Ai1 , ..., Aik , P (∩kj=1 Aij ) = j=1 P (Aij ).
iii) An infinite (perhaps uncountable) collection of events is independent
if each of its finite subcollections is.
iv) If the events are not independent, then the events are dependent.
Theorem 1.7:P P∞ Let (Ω, F , P ) be fixed

First Borel-Cantelli Lemma.
∞
and An events. If n=1 P (An ) < ∞ (the sum n=1 P (An ) converges),then
P (limsupn An ) = 0. S∞
Proof. Since limsupn An ⊆ k=m Ak for any positive integer m,
∞ ∞
!
[ X
P (limsupn An ) ≤ P Ak ≤ P (Ak ) ≤
k=m k=m
for m ≥ m() by definition of a convergent sum. Since > 0 is arbitrary,

P (limsupn An ) = 0.
The proof of the following theorem will use the fact that 1 − x ≤ e−x for
x ≥ 0.
Theorem 1.8: Second Borel-Cantelli Lemma.P Let (Ω, F , P ) be fixed

∞
and An
P∞ events. If the An are independent events and n=1 P (An ) = ∞ (the
sum n=1 P (An ) diverges), then P (limsupn An ) = 1.
Proof. The result holds if 0 = P [(limsupn An )c ] = P (liminfn Acn ) =
P (∪∞ ∞ c ∞ c
n=1 ∩k=n Ak ) = 0 which is true if P (∩k=nAk ) = 0 for each positive
integer n by Theorem 1.6. Since 1 − x ≤ e−x ,

n+j n+j n+j
" #
P (∩n+j
Y Y X
c
k=nAk ) = [1 − P (Ak )] ≤ exp[−P (Ak )] = exp − P (Ak ) .
k=n k=n k=n
P∞
Since k=n P (Ak ) diverges for each positive integer n, the last term con-
verges to 0 as j → ∞. Thus
" n+j #
n+j
X
0 ≤ P (∩∞ c c
k=n Ak ) = lim P (∩k=nAk ) ≤ lim exp − P (Ak ) = 0
j→∞ j→∞
k=n
(where the first limit exists since ∩n+j c ∞ c

k=n Ak ↓ ∩k=n Ak ).
Definition 1.14. Let {An } be a sequence of events defined on (Ω, F , P ).

a) Then τ = ∩∞ n=1 σ(An , An+1 , ...) is the tail σ−field.
b) If A ∈ τ , then A is a tail event.
Note that σ(An , An+1 , ...) is the σ-field generated by An , An+1 , .... See
Definition 1.6. By Remark 1.5, lim An and lim An are tail events.
Theorem 1.9: the Kolmogorov 0-1 Law. Let {An } be a sequence

of independent events defined on (Ω, F , P ). If A ∈ τ , then P (A) = 0 or
P (A) = 1.
1.2 Measures
Definition 1.15. A set function µ is a measure on (Ω, F ) (where F is a

σ−field on Ω) if
m1) µ(A) ∈ [0, ∞] for A ∈ F. (Note that ∞ is allowed.)
m2) µ(∅) = 0, and P∞
m3) If A1 , A2 , ... are disjoint F sets, then µ(∪∞
i=1 Ai ) = i=1 µ(Ai ) (count-
able additivity).
Note that the value of µ(Ω) ∈ [0, ∞] is not specified in Def. 1.15. For a
probability measure, P (Ω) = 1.
Definition 1.16. A measure µ is
i) finite if µ(Ω) < ∞ and
ii) infinite if µ(Ω) = ∞.
iii) If Ω = ∪∞ i=1 Ai where Ai ∈ F with µ(Ak ) < ∞ for all k ∈ N, then µ is
σ−finite.
A measure is a probability measure if µ(Ω) = 1, and every probability
measure is a finite measure and a σ−finite measure.
1.3 Summary 13
Definition 1.17. a) (Ω, F ) is a measurable space if Ω is a sample space

and F is a σ−field on Ω.
b) (Ω, F , µ) is a measure space if Ω is a sample space, F is a σ−field on
Ω, and µ is a measure on (Ω, F ).
Theorem 1.10. Properties of a measure µ: Let A, B, Ai , An , Ak be F sets.

i) µ is monotone: A ⊆ B ⇒ µ(A) ≤ µ(B).
ii) If A ⊆ B and µ(B) < ∞, then µ(B − A) = µ(B) − µ(A). Pn
iii) Finite additivity: If A1 , ..., An are disjoint, then µ(∪ni=1 Ai ) = i=1 µ(Ai ).
n
iv) Finite subadditivity: µ(∪ni=1 Ai ) ≤ i=1 µ(Ai ).
P
v) continuity from above: If An ↓ A and µ(A1 ) < ∞, then µ(An ) ↓ µ(A).
vi) continuity from below: If An ↑ A then µ(A P∞ n ) ↑ µ(A).
vii) countable subadditivity: µ(∪∞ k=1 A k ) ≤ k=1 µ(Ak ).
The proof of this theorem is similar to that of Theorem 1.3 which gives
properties of a probability measure. One difference is that in Theorem 1.10
v), the condition µ(A1 ) < ∞ is needed. See Problem 1.9.
1.3 Summary
1) The sample space Ω is the set of all outcomes from an idealized experi-
ment. The empty set is ∅. The complement of a set A is Ac = {ω ∈ Ω :
ω 6∈ A}.
2) Let Ω 6= ∅. A class F of subsets of Ω is a σ−field (or σ−algebra) on Ω
if
i) Ω ∈ F.
ii) A ∈ F ⇒ Ac ∈ F.
iii) A, B ∈ F ⇒ A ∩ B ∈ F.
iv) A1 , A2 , ... ∈ F ⇒ ∪∞
i=1 Ai ∈ F.
Note that i), ii), and iii) mean that a σ-field is a field (or algebra) on Ω. A
σ−field is closed under countable set operations. The term “on Ω” is often
understood and omitted.
Common error: Use n instead of ∞ in iv).
3) De Morgan’s laws: i) A ∩ B = (Ac ∪ B c )c , ii) A ∪ B = (Ac ∩ B c )c ,
c
iii) [∪∞ ∞ c
i=1 Ai ] = ∩i=1 Ai .
4) Let A be a class of sets. The σ−field generated by A, denoted by σ(A)

is the intersection of all σ−fields containing A. Then σ(A) is the smallest
σ−field containing A.
5) Let A be the class of all open intervals of [0,1]. Then σ(A) = B[0, 1] is
the Borel σ−field on [0,1]. Fact: B[0, 1] = σ(A) where A is the class of all
closed intervals in [0,1], or A is the class of all intervals of the form (a, b] in
[0,1], or A is the class of all intervals of the form [a, b) in [0,1].
6) A set function P is a probability measure on a σ−field F on Ω if

P1) 0 ≤ P (A) ≤ 1 for A ∈ F. P2) P (∅)P= 0 and P (Ω) = 1, P3) If A1 , A2 , ...
∞
are disjoint F sets, then P (∪∞ i=1 Ai ) = i=1 P (Ai ) (countable additivity).
Common error: use n instead of ∞ in P3).
7) A − B = A ∩ B c is the difference between A and B.
8) An ↑ A means A1 ⊆ A2 ⊆ · · · and A = ∪∞ i=1 Ai .
An ↓ A means A1 ⊇ A2 ⊇ · · · and A = ∩∞ i=1 Ai .
xn ↑ x means x1 ≤ x2 ≤ · · · and xn → x.
xn ↓ x means x1 ≥ x2 ≥ · · · and xn → x.
9) Properties of P : Let A, B, Ai , An , Ak be F sets. Pn
∞
viii) countable subadditivity: P (∪∞ k=1 Ak ) ≤ k=1 P (Ak ).
Note: vi) and vii) together are known as monotone continuity.
10) lim An = limsupn An = ∩∞ ∞
n=1 ∪k=n Ak = {ω : ω ∈ An for infinitely
many An }.
lim An = liminfn An = ∪∞ ∞
n=1 ∩k=n Ak = {ω : ω ∈ An for all but finitely
many An }.
If An ∈ F, then lim An , lim An ∈ F. Also, liminfn An ⊆ limsupn An .
11) If liminfn An = limsupn An , then limn An = A = liminfn An =
limsupn An , written An → A.
If An ∈ F, then limn An = A ∈ F.
Facts: (limsupn An )c = liminfn Acn and (liminfn An )c = limsupn Acn
12) (Ω, F , P ) is a probability space if Ω is a sample space, F is a σ-field
on Ω and P is a probability measure on (Ω, F ). Then an event A is any set
A ∈ F.
13) For a sequence of real numbers, lim xn = limsupn xn = infn supk≥n xk ,
and
lim xn = liminfn xn = supn infk≥n xk . Also, lim (−xn ) = −lim xn
inf=infinum = greatest lower bound, sup = supremum = least upper bound
Fact 1) lim xn ≤ lim xn . Fact 2) limn xn = x iff x = lim xn = lim xn .
Then xn → x. Fact 3) If {xn } is a bounded sequence, then lim xn = largest
accumulation point (cluster point) of {xn }, and lim xn = smallest accumu-
lation point of {xn }.
14) Theorem: For each sequence {An } of F sets,
15) Theorem: Let A1 , A2 , ... be F sets.
i) If P (Ai ) = 0 for all i, then P (∪∞ i=1 Ai ) = 0.
ii) If P (Ai ) = 1 for all i, then P (∩∞ i=1 Ai ) = 1.
1.4 Complements 15
16) i) Two events A and B are independent, written A B, if P (A∩B) =

P (A)P (B).
ii) A finite collection of events A1 , ..., An is independent if for any sub-
Qk
collection Ai1 , ..., Aik , P (∩kj=1 Aij ) = j=1 P (Aij ).
iii) An infinite (perhaps uncountable) collection of events is independent
if each of its finite subcollections is.
If the events are not independent, then the events are dependent.
17)PBorel-Cantelli Lemmas: Let (Ω, F , P ) be fixed and An events.
1) If ∞ n=1 P (An ) < ∞ (the sum converges), Pthen P (limsupn An ) = 0.
∞
2) If the An are independent events and n=1 P (An ) = ∞ (the sum di-
verges), then P (limsupn An ) = 1.
18) Let {An } be a sequence of events defined on (Ω, F , P ). Then τ =
∩∞ n=1 σ(An , An+1 , ...) is the tail σ−field. (See 4) on the exam 1 review.) If
A ∈ τ , then A is a tail event.
19) The Kolmogorov 0-1 Law: Let {An } be a sequence of independent
events defined on (Ω, F , P ). If A ∈ τ , then P (A) = 0 or P (A) = 1.
20) A set function µ is a measure on (Ω, F ) (where F is a σ−field on Ω)
if
m1) µ(A) ∈ [0, ∞] for A ∈ F. (Note that ∞ is allowed.)
m2) µ(∅) = 0, and P∞
m3) If A1 , A2 , ... are disjoint F sets, then µ(∪∞ i=1 Ai ) = i=1 µ(Ai ) (count-
able additivity).
21) A measure µ is finite if µ(Ω) < ∞ and infinite if µ(Ω) = ∞. If Ω =
∪∞ i=1 Ai where Ai ∈ F with µ(Ak ) < ∞ for all k ∈ N, then µ is σ−finite. A
measure is a probability measure if µ(Ω) = 1, and every probability measure
is a finite measure and a σ−finite measure.
22) (Ω, F ) is a measurable space if Ω is a sample space and F is a
σ−field on Ω. (Ω, F , µ) is a measure space if Ω is a sample space, F is a
σ−field on Ω, and µ is a measure on (Ω, F ).
23) Theorem: Properties of a measure µ: Let A, B, Ai , An , Ak be F sets.
i) µ is monotone: A ⊆ B ⇒ µ(A) ≤ µ(B).
ii) If A ⊆ B and µ(B) < ∞, then µ(B − A) = µ(B) − µ(A). Pn
iii) Finite additivity: If A1 , ..., An are disjoint, then µ(∪ni=1 Ai ) = i=1 µ(Ai ).
n
iv) Finite subadditivity: µ(∪ni=1 Ai ) ≤ i=1 µ(Ai ).
P
v) continuity from above: If An ↓ A and µ(A1 ) < ∞, then µ(An ) ↓ µ(A).
vi) continuity from below: If An ↑ A then µ(A P∞ n ) ↑ µ(A).
vii) countable subadditivity: µ(∪∞ k=1 Ak ) ≤ k=1 µ(Ak ).
1.4 Complements
Kolmogorov’s definition of a probability function makes a probability func-

tion a normed measure. Hence many of the tools of measure theory can be
used for probability theory. See, for example, Ash and Doleans-Dade (1999),
Billingsley (1995), Dudley (2002), Durrett (1995), Feller (1971), and Resnick
(1999).
Gaughan (2009) is a good reference for induction.
1.5 Problems
PROBLEMS WITH AN ASTERISK * ARE ESPECIALLY USE-

FUL.
1.1. One way to show that A = B is to show i) if ω ∈ A then ω ∈ B so
A ⊆ B, and ii) if ω ∈ B then ω ∈ A so B ⊆ A. Suppose for each positive
integer n, ∪nk=1 Ak = ∪nk=1 Bk . Let A = ∪∞ ∞
k=1 Ak and B = ∪k=1 Bk . Prove
A = B by showing i) and ii). In probability theory, often the Bk are disjoint.
∞
\
1.2. Suppose A1 ⊇ A2 ⊇ A3 ... so that An ↓ A. Prove A = An .
n=1
1.3. Billingsley (1986, problem 2.3): a) Suppose Ω ∈ D and A, B ∈ D ⇒
A − B = A ∩ B c ∈ D. Show D is a field. Hint: the first 3 properties of a
σ−field define a field.
b) Suppose Ω ∈ D and that D is closed under the formation of comple-
ments and finite disjoint unions. Show that D need not be a field. Hint: let
Ω = {1, 2, 3, 4} and D = {∅, {1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}, Ω}.
1.4. (Similar to Billingsley (1986, problem 4.3 c) ): Suppose (limsupn An )c =

liminfn Acn for any sequence of sets {An }. Show (liminfn An )c = limsupn Acn .
1.5. (Similar to Billingsley (1986, problem 4.3a) ):

i) Show (limsupn An )∩(limsupn Bn ) ⊇ limsupn (An ∩Bn ). Hint: ω ∈ lim Cn
means ω ∈ Cn infinitely often (i.o.).
ii) Show (limsupn An ) ∪ (limsupn Bn ) = limsupn (An ∪ Bn ).
[Note: By i), (limsupn Acn ) ∩ (limsupn Bnc ) ⊇ limsupn (Acn ∩ Bnc ).
Taking complements of both sides shows (liminfn An ) ∪ (liminfn Bn ) ⊆
liminfn (An ∪ Bn ).
By ii) (limsupn Acn ) ∪ (limsupn Bnc ) = limsupn (Acn ∪ Bnc ). Taking comple-
ments of both sides gives (liminfn An )∩(liminfn Bn ) = liminfn (An ∩Bn ).]
1.6. Let Λ be an arbitrary nonempty

\ index set, and for λ ∈ Λ, let Fλ be
a σ−field on Ω. Prove that F = Fλ is a σ−field on Ω.
λ∈Λ
Q
1.7 . Billingsley (1986, problem 4.15): Suppose
P A1 , A2 , ... are
P indepen-
c
dent. There are 4 cases for the divergence of n P (A n ) and n P (An ).
Describe the pair P (limsupn An ) and P (liminfn An ) in each case.
1.5 Problems 17
i) Pn P (An ) < ∞ and Pn P (Acn ) < ∞

P P
ii) Pn P (An ) = ∞ and Pn P (Acn ) = ∞
iii) Pn P (An ) < ∞ and Pn P (Acn ) = ∞
iv) n P (An ) = ∞ and n P (Acn ) < ∞
Hint: One case is impossible, and for the other cases the probabilities are
0 or 1 by the Borel Cantelli lemmas and Theorem 4.1. Complementation may
also be needed.
1.8. (Similar to Billingsley (1986, problem 4.14 a)): Let A1 , A2 , ... be in-
dependent events.
∞
\ ∞
Y
i) Prove P ( An ) = P (An ).
n=1 i=1
∞
[ ∞
Y
ii) Prove P ( An ) = 1 − [1 − P (An )].
n=1 i=1
m
\ ∞
\
Hint: for i), Bm = An ↓ An as m → ∞.
n=1 n=1
1.9Q . Let µ be a measure on (Ω, F ), and let A, B, Ai , An , Ak be F sets.
Prove the following.
Hint: for a), b) and c). The proof is nearly identical to that for a probability
measure, just replace P by µ.
Pn
a) Finite additivity: If A1 , ..., An are disjoint, then µ(∪ni=1 Ai ) = i=1 µ(Ai ).
b) µ is monotone: A ⊆ B ⇒ µ(A) ≤ µ(B).

c) If A ⊆ B and µ(B) < ∞, then µ(B − A) = µ(B) − µ(A).
Pn
d) Finite subadditivity: µ(∪ni=1 Ai ) ≤ i=1 µ(Ai ).
k−1
Hint: Let B1 = A1 and Bk = Ak ∩ Ac1 · · · Ack−1 = Ak ∩ [∪i=1 Ai ]c]. You
may use the fact that the Bi are disjoint, Bi ⊆ Ai , and ∪i=1 Ai = ∪ni=1 Bi , as
n
was done for proving the analogous property for a probability measure.
e) continuity from below: If An ↑ A then µ(An ) ↑ µ(A).
Hint: Let B1 = A1 and Bk = Ak −Ak−1 . You may use the fact that the Bk
are disjoint, An = ∪ni=1 Ai = ∪ni=1 Bi for each n, and A = ∪∞ ∞
i=1 Ai = ∪i=1 Bi ,
as was done for proving the analogous property for a probability measure.
1.10. Suppose An = A for n ≥ 1 where P (A) = 0.5. Then An → A,
P (An ) → P (A) = 0.5, limsupn An = ∩∞ ∞
n=1 ∪k=n Ak = {ω : ω ∈ An for
infinitely many An } = A and liminfn An = ∪n=1 ∩∞∞
k=n Ak = {ω : ω ∈ An =
for all but finitely many An } = A. It is known that liminf An and limsup An
are tail events. Why does the above result not contradict Kolmogorov’s zero-
one (0-1) law?
1.11. Let µ be a measure on (Ω, F ) and let c > 0. Prove that ν = cµ is
measure on (Ω, F ).
Qn n
Qn
QnNote: If µ = i=1 µi is a product measure, then ν = c µ = i=1 cµi =
i=1 νi is a product measure by Problem 1.11). Also, a finite measure µ =
P/c is a scaled probability measure ν = P = cµ with c = 1/µ(Ω).
Exam and Quiz Problems
∞ [
[ 1 1 1 1
1.12. Let a < b and let I = a+ ,b− = a+ ,b− =
n=1
n n n n
n∈N
∞
[ 1 1
a+ ,b− where m is the smallest positive integer such that a +
n=m
n n
1 1
≤ b− since [c, d] = ∅ if c > d. I is equal to an interval. Find that
m m
interval.
1.13. a)!Let {Ai }∞i=1 be a sequence of sets such that P (An ) = 0 ∀n. Prove
[∞
P Ai = 0.
i=1
b) Let {Bi }∞ ! 1 ∀n. Then
i=1 be a sequence of sets such that P (Bn ) =
∞
∞
\
P (Bnc ) = 0 ∀n, and by a), P ( i=1 Bic ) = 0. Prove P
S
Bi = 1.
i=1
1.14. For an arbitrary sequence of events {An },
P (liminfn An ) ≤ liminfn P (An ) ≤ limsupn P (An ) ≤ P (limsupn An ).
Also, limn→∞ xn = x iff liminfn xn = limsupn xn = x where xn , x ∈ R.

a) Use these results to prove that if limn→∞ An exists, then P (limn→∞ An ) =
limn→∞ P (An ).
b) Let {An } be a sequence of events with the same probability P (An ) = p
∀n. Prove P (limsupn An ) ≥ p.
" N #c N
\ [
1.15. Prove DeMorgan’s law Ak = Ack where N ≥ n, n is a
k=n k=n
positive integer, and N = ∞ is allowed. (You may assume Ak ⊆ Ω ∀k.)
1.16. Let µ be a measure on (Ω, F ), and let A, B, Ai be F sets. You
n
may
Pn assume finite additivity: if A1 , ..., An are disjoint, then µ(∪i=1 Ai ) =
i=1 µ(Ai ). If A ⊆ B and µ(B) < ∞, prove µ(B − A) = µ(B) − µ(A).
1.17. Simplify the following sets. Answers might be (a, b), [a, b), (a, b], [a, b], [a, a] =
{a}, (a, a) = ∅.
∞
\ 1
i) a, b + =
n=1
n
∞
\ 1
ii) a, b + =
n=1
n
1.5 Problems 19
∞
\ 1
iii) a, a + =
n=1
n
∞
[ 1
iv) a, b − =
n=1
n
∞
[ 1
v) a, b − =
n=1
n
" N
#c N
\ [
1.18. A DeMorgan’s law can be written as Ak = Ack where
k=n k=n
N ≥ n, n is a positive integer, and N = ∞ is allowed.
" N #c
[
i) Find Ak using the above law (and complementation).
k=n
∞ [
\ ∞ ∞
\ ∞
[
ii) limsupn Acn = = Ack
where Cnc Cnc = Ack .
n=1 k=n n=1 # k=n
" ∞ c
\
Use DeMorgan’s law to find Cnc .
n=1
iii) Find Cn .
iv) Use the above
" results to show
#c
∞ [ ∞ ∞ \
∞
c
\ [
c c
[ limsupn An ] = Ak = Ak = lim inf An .
n
n=1 k=n n=1 k=n
1.19. Suppose Λ is the index set for σ−fields on Ω, Fλ , that contain a
class A of subsets of Ω. Then Λ is nonempty since the σ−field of all subsets
of Ω contains A. Let the σ−field generated by A be
\
σ(A) = Fλ .
λ∈Λ
Prove that σ(A) is a σ−field.

1.20. Prove (limsupn An )c = liminfn Acn .
1.21. Prove (limsupn An )c = liminfn Bn and find Bn .
(Variant of 1.20.)
1.22. Fix (Ω, F , P ). If A ∈ F is an event, prove P (AC ) = 1 − P (A).
You may assumePn finite additivity: if A1 , ..., An are disjoint events, then
µ(∪ni=1 Ai ) = i=1 µ(Ai ).
1.23. What is a probability space?
1.24.
1.25.
1.26.
1.27.
1.28.
1.29.
Some Qual Type Problems
1.30Q . Prove the following theorem.
Theorem 1.3. Properties of P : Let A, B, Ai , An , Ak be F sets. Pn
1.31Q Prove the following theorem.

Theorem 1.3. Properties of P : Let A, B, Ai , An , Ak be F sets.
∞
viii) countable subadditivity: P (∪∞ A
k=1 k ) ≤ k=1 P (Ak ).
Theorem 1.5. For each sequence {An } of F sets,
(Problems 1.13 and 1.14 are similar to Problem 1.9.)
1.33Q . State and prove the First Borel Cantelli Lemma.
1.34Q . State and prove the Second Borel Cantelli Lemma.
1.35Q . Suppose A1 , A2 , ... are
P independent.P There are 4 cases for the con-
vergence and/or divergence of n P (An ) and n P (Acn ). One case is impos-
sible. (This problemP is similar to Problem P 1.7.)
a) Suppose that n P (An ) < ∞ and n P (Acn ) < ∞. If possible, find
P (limsupn An ), findPP (liminfn An ), and P if P (An ) → c, find c.
b) Suppose that n P (An ) = ∞ and n P (Acn ) = ∞. If possible, find
P (limsupn An ), andPfind P (liminfn An ). P Does limn An = A exist?
c) Suppose that n P (An ) < ∞ and n P (Acn ) = ∞. If possible, find
P (limsupn An ), find P (liminfn An ), and if P (An ) → c, find c. Was inde-
pendence needed? P
d) Suppose that n P (An ) = ∞ and n P (Acn ) < ∞. If possible, find
P
P (limsupn An ), find P (liminfn An ), and if P (An ) → c, find c.
Chapter 2
Random Variables and Random Vectors
This chapter shows that random variables and random vectors are measurable
functions.
2.1 Measurable Functions
Let (Ω, F ) and (Ω 0 , F 0 ) be two measurable spaces. A mapping T : Ω → Ω 0 is

a generalized function from one sample space to another sample space. Often
the mapping will be a real valued or vector valued set function. If Ω 0 = Rk
and F 0 = B(Rk ), then X : Ω → Rk is an important mapping. If Ω 0 = R
and F 0 = B(R), then X : Ω → R is an important mapping. In the definition
below, the inverse image is a set, not an inverse mapping or inverse function.
Definition 2.1. a) The inverse image T −1 (A0 ) = {ω ∈ Ω : T (ω) ∈ A0 }

for any set A0 ∈ Ω 0 .
b) The inverse image X −1 (B) = {ω ∈ Ω : X(ω) ∈ B} for any set B ∈
B(Rk ).
c) The inverse image X −1 (B) = {ω ∈ Ω : X(ω) ∈ B} for any set B ∈ B(R).
Definition 2.2. a) Let (Ω, F ) and (Ω 0 , F 0 ) be two measurable spaces. For

a mapping T : Ω → Ω 0 , the mapping T is measurable F /F 0 if T −1 (A0 ) ∈ F
for each A0 ⊆ F 0 .
b) If Ω 0 = Rk , F 0 = B(Rk ), and X : Ω → Rk , then X is a measurable
function or measurable or measurable F if X is measurable F /B(Rk ).
Hence X is a measurable function if X −1 (B) = {ω : X(ω) ∈ B} ∈ F
∀B ∈ B(Rk ).
c) If Ω 0 = R, F 0 = B(R), and X : Ω → R, then X is a measurable function
or measurable or measurable F if X is measurable F /B(R). Hence X is
a measurable function if X −1 (B) = {ω : X(ω) ∈ B} ∈ F ∀B ∈ B(R).
21
22 2 Random Variables and Random Vectors
Measurable functions can also be defined for the extended real numbers
[−∞, ∞].
Definition 2.3. A function f : Ω → [−∞, ∞] is a measurable function (or
measurable or F measurable or Borel measurable) if
i) f −1 (B) ∈ F ∀B ∈ B(R),
ii) f −1 ({∞}) = {ω : f(ω = ∞} ∈ F, and
iii) f −1 ({−∞}) = {ω : f(ω = −∞} ∈ F.
2.2 Random Variables
Comparing definitions 2.4 and 2.2 c) shows that X is a random variable iff
X is a measurable function.
Definition 2.4. Let (Ω, F , P ) be a probability space. A function X :
Ω → R = (−∞, ∞) is a random variable if the inverse image X −1 (B) ∈ F
∀B ∈ B(R). Equivalently, a function X : Ω → R is a random variable iff X
is a measurable function.
Warning: The inverse image X −1 (A) is a set, not an inverse function.
Theorem 2.1. Let X : Ω → R. Let A, B, Bn , Bλ ∈ B(R).

i) If A ⊆ B, then X −1 (A) ⊆ X −1 (B).
ii) X −1 (∪∞ ∞
n=1 Bn ) = ∪n=1 X
−1
(Bn ).
iii) X (∩n=1 Bn ) = ∩n=1 X −1 (Bn ).
−1 ∞ ∞
iv) If A and B are disjoint, then X −1 (A) and X −1 (B) are disjoint.
v) X −1 (B c ) = [X −1 (B)]c .
Let Λ be a nonempty index set.
vi) X −1 (∪λ∈Λ Bλ ) = ∪λ∈Λ X −1 (Bλ ).
vii) X −1 (∩λ∈Λ Bλ ) = ∩λ∈Λ X −1 (Bλ ).
Proof Sketch. i) If ω ∈ X −1 (ω), then X(ω) ∈ A ⊆ B. Hence X(ω) ∈ B
and ω ∈ X −1 (B). Thus X −1 (A) ⊆ X −1 (B)
ii) See Problem 2.1.
iii) ω ∈ X −1 (∩∞ ∞
n=1 Bn ) iff X(ω) ∈ ∩n=1 Bn iff X(ω) ∈ Bn for each n iff
−1 ∞ −1
ω ∈ X (Bn ) for each n iff ω ∈ ∩n=1 X (Bn ).
iv) If ω ∈ X −1 (A), then X(ω) ∈ A. Hence X(ω) ∈ / B. Thus ω ∈/ X −1 (B).
−1 c c −1 c
v) ω ∈ X (B ) iff X(ω) ∈ B iff X(ω) ∈ / B iff ω ∈ [X (B)] .
vi) Similar to ii).
vii) Replace n by λ in iii).
Note that unions and intersections in the above theorem can be finite,
countable, or uncountable.
Theorem 2.2. Let (Ω, F , P ) be a probability space. A function X : Ω →
R = (−∞, ∞) is a random variable iff {X ≤ t} = {ω ∈ Ω : X(ω) ≤ t} ∈ F
∀t ∈ R.
2.2 Random Variables 23
Remark 2.1. a) Note that (−∞, t] ∈ B(R) for any t ∈ R. Hence if X is

a random variable, then X −1 ((−∞, t]) = {ω ∈ Ω : X(ω) ∈ (−∞, t]} = {ω ∈
Ω : X(ω) ≤ t} = {X ≤ t} ∈ F ∀t ∈ R. Hence {X ≤ t} is an event for any
t ∈ R.
b) Showing that {X ≤ t} ∈ F ∀t ∈ R implies X −1 (B) ∈ F ∀B ∈ B(R) is
nontrivial.
Definition 2.5. Let (Ω, F , P ) be a probability space. Let X : Ω → R be
a random variable. Then the cumulative distribution function (cdf) of
X is the real valued function F (t) = P (X ≤ t) = P ({X ≤ t}) for t ∈ R.
The cdf is sometimes called the distribution function.
The Borel σ-field is large enough so that most functions that could be
suggested by a person who has not had measure theory tend to be measurable.
If A ∈ Ω but A ∈ / B(R), then IA is not a measurable function where the
indicator function IA (ω) = 1 if ω ∈ A, and IA (ω) = 0 if ω ∈
/ A. See Example
2.1 b).
Theorem 2.3. Fix (Ω, F , P ). Let X : Ω → R. X is a measurable function
iff X is a random variable iff any one of the following conditions holds.
i) X −1 (B) = {ω ∈ Ω : X(ω) ∈ B} ∈ F ∀ B ∈ B(R).
ii) X −1 ((−∞, t]) = {X ≤ t} = {ω ∈ Ω : X(ω) ≤ t} ∈ F ∀t ∈ R.
iii) X −1 ((−∞, t)) = {X < t} = {ω ∈ Ω : X(ω) < t} ∈ F ∀t ∈ R.
iv) X −1 ([t, ∞)) = {X ≥ t} = {ω ∈ Ω : X(ω) ≥ t} ∈ F ∀t ∈ R.
v) X −1 ((t, ∞)) = {X > t} = {ω ∈ Ω : X(ω) > t} ∈ F ∀t ∈ R.
Note that i) holds in the above theorem by Definition 2.3 and ii) holds by
Theorem 2.1.
Example 2.1. a) A constant X(ω) ≡ c for all ω ∈ Ω is a random variable
since X −1 (A) = Ω ∈ F if c ∈ A and X −1 (A) = ∅ ∈ F if c ∈ / A for any
A ∈ B(R).
b) Let the indicator function

1, ω ∈ A
X(ω) = IA (ω) =
0, ω ∈/ A.
Then X = IA is a random variable iff A ∈ F.

Proof. X −1 (B) = {ω : IA (ω) ∈ B}, but IA (ω) is 0 or 1. Thus


 ∅, if 0 ∈ B and 1 ∈ /B
 c
A if 0 ∈
/ B and 1 ∈/B
X −1 (B) =

 A, if 0 ∈
/ B and 1 ∈ B
Ω, if 0 ∈ B and 1 ∈ B.

Hence X −1 (B) ∈ F ∀B ∈ B(R) iff A ∈ F.

Let Q be the set of rational numbers. Let RV stand for random variable.
Theorem 2.4. Let X, Y , and Xi be RVs on (Ω, F , P ).
a) aX is a RV for any a ∈ R.
Pn
b) aX + bY is a RV for any a, b ∈ R. Hence i=1 Xi is a RV.
c) max(X, Y ) is a RV. Hence max(X1 , ..., Xn) is a RV.
d) min(X, Y ) is a RV. Hence min(X1 , ..., Xn) is a RV.
e) XY is a RV. Hence X1 · · · Xn is a RV.
f) X/Y is a RV if Y (ω) 6= 0 ∀ ω ∈ Ω.
g) supn Xn is a RV.
h) infn Xn is a RV.
i) limsupn Xn is a RV.
j) liminfn Xn is a RV.
k) If limn P
Xn = X, thenPX is a RV.
m ∞
l) If limm n=1 Xn = n=1 Xn = X, then X is a RV.
m) If h : Rn → R is measurable, then Y = h(X1 , ..., Xn) is a RV.
n) If h : Rn → R is continuous, then h is measurable and Y = h(X1 , ..., Xn)
is a RV.
o) If h : R → R is monotone, then h is measurable and h(X) is a RV.
Proof of a)–l). a) If a > 0, then {aX ≤ t} = {X ≤ t/a} ∈ F.
If a < 0, then {aX ≤ t} = {X ≥ t/a} ∈ F.
If a = 0, then aX ≡ 0 is a constant, and a constant is a random variable.
Thus aX is a random variable if X is a random variable.
b) For each t,
{X + Y < t} = ∪r∈Q [{X < r} ∩ {Y < t − r}] ∈ F
since the union is countable. Thus a sum of two random variables is a random
variable, and by induction, a finite sum of random variables is a random
variable.
c) For each t, {max(X, Y ) ≤ t} = {X ≤ t} ∩ {Y ≤ t} ∈ F
(since max(X, Y ) ≤ t iff both X ≤ t and Y ≤ t).
d) For each t, {min(X, Y ) ≤ t} = {X ≤ t} ∪ {Y ≤ t} ∈ F
(since min(X, Y ) ≤ t iff at least one of the following holds i) X ≤ t or ii)
Y ≤ t).
e) First show that X 2 is √
a random variable
√ if X is
√a random variable.
√ For
any t ≥ 0, {X 2 ≤ t} = {− t ≤ X ≤ t} = {x ≤ t} − {X < − t} ∈ F,
while for any t < 0, {X 2 ≤ t} = ∅ ∈ F. Thus X 2 is a random variable. Then
XY = 0.5[(X + Y )2 − X 2 − Y 2 ] is a random variable by b).
f) First show 1/Y is a random variable. Then the result follows by e). Now

1 {Y ≥ 1/t} ∪ {Y ≤ 0}, t ≥ 0
≤t =
Y {Y ≥ 1/t}, t < 0.
(Note that for t = 0, then 1/Y ≤ 0 iff Y ≤ 0 since Y (ω) 6= 0 ∀ω.)

g) For each t, {supn Xn ≤ t} = ∩∞n=1 {Xn ≤ t} ∈ F.
h) For each t, {inf n Xn ≥ t} = ∩∞
n=1 {Xn ≥ t} ∈ F.
i) limsupn Xn = inf k supm≥k Xm = inf k Yk is a RV by h).
j) liminfn Xn = supk inf m≥k Xm = supk Wk is a RV by g).
k) X = lim supn Xn = lim inf n Xn is a RV by i) and j).
2.2 Random Variables 25
Pm
l) P
By induction and b), Ym = n=1 Xn is a RV. Thus limm Ym =
m P ∞
limm n=1 Xn =P n=1 Xn =PX is a RV by j).
m ∞ Pm
(Note
P∞ that limm n=1 Xn = n=1 Xn = X means that limm n=1 Xn (ω) =
n=1 Xn (ω) = X(ω) ∀ ω.)
Example 2.2. If X is a random variable, then X + = max(X, 0) and
X = − min(X, 0) are random variables. Hence X + + X − = |X| is a random
−
variable.
Theorem 2.5. Fix (Ω, F , P ). Let the induced probability PX = PF be
PX (B) = P [X −1 (B)] for any B ∈ B(R). Then (R, B(R), PX ) is a probability
space.
Proof. PX is a set function on B(R). Need to show the PX is a probability
measure.
P1) Let B ∈ B(R). Then PX (B) = P [X −1 (B)]. Hence 0 ≤ PX (B) ≤ 1.
P2) PX (R) = P [X −1 (R)] = P (Ω) = 1, and PX (∅) = P ({ω : X(ω) ∈ ∅) =
P (∅) = 0. U∞ U∞
P3)ULet {Bi } be disjoint B(R) sets. Then PX P ( i=1 Bi ) = P [X −1 ( i=1 Bi )] =
∞ ∞ ∞
P [ i=1 X −1 (Bi )] = −1
P
i=1 P [X (Bi )] = i=1 PX (Bi ). (Theorem 2.1 ii)
gives the second equality, but the inverse images of disjoint sets are disjoint
sets by Theorem 2.1 iv), giving the third equality.)
Definition 2.6. The distribution of X is PX (B) = P [X −1 (B)], B ∈
B(Rk ).
Note that the cumulative distribution function F (t) = FX (t) = PX ((−∞, t])
since PX ((−∞, t]) = P [X −1 ((−∞, t])] = P ({ω : X(ω) ∈ (−∞, t]} = P (X ≤
t) and since (−∞, t] ∈ B(R).
Notation. For a given random variable X, the subscript X in PX will
often be suppressed: e.g., write P ((−∞, x]) for PX ((−∞, x]). This notation
is often used when PX is the only probability of interest, and this notation
is used in the following proof.
Theorem 2.6. A function F : R → [0, 1] is a cumulative distribution
function of a random variable X if
df1) F is nondecreasing: x1 < x2 ⇒ F (x1 ) ≤ F (x2 ).
df2) F is right continuous:
lim F (x + h) = F (x) ∀x ∈ R.
h↓0
df3) F (−∞) = lim F (x) = 0.

x→−∞
df4) F (∞) = lim F (x) = 1.
x→∞
df5) F (x) can have at most countably infinite many points of discontinuity.
Proof of df1)-df4). F (x) = P ((−∞, x]). Thus 0 ≤ F (x) ≤ 1 ∀x.
df1) If x1 < x2 , then (−∞, x1 ] ⊆ (−∞, x2 ]. Thus F (x1 ) = P ((−∞, x1]) ≤
P ((−∞, x2] = F (x2 ).
df2) As h ↓ 0, (−∞, x + h] ↓ (−∞, x]. Thus F (x + h) = P ((−∞, x + h] ↓
P ((−∞, x]) = F (x).
df3) (−∞, −n] ↓ ∅. Hence F (−n) ↓ 0, and limn→∞ F (−n) = limx→−∞ F (x) =
0.
df4) (−∞, n] ↑ R. Hence F (n) ↑ 1, and limn→∞ F (n) = limx→∞ F (x) = 1.

For the above proof, technically need Ah ↓ A to be a countable limit,
where Ah = (−∞, x + h] ↓ (−∞, x] = A, to apply the continuity from above
property of probability, but (−∞, x+h] ↓ (−∞, x] regardless of how h ↓ 0 (e.g.
using h = 1/n, a countable sequence of rational numbers, or an uncountable
sequence of irrationals), and (−∞, x + h] and (−∞, x] are Borel sets. Thus
the probabilities do exist and do decrease and converge to the limit F (x).
Similar remarks apply to df3) and df4).
Remark 2.1. Define F (x−) = P (X < x). Then P (X = x) = F (x) −
F (x−). Note that P (a < X ≤ b) = F (b) − F (a).
Definition 2.7. The σ-field σ(X) is the smallest σ-field with respect to
which the random variable X is measurable.
Theorem 2.7. σ(X) = the collection {X −1 (B) : B ∈ B(R)}, which is a
σ-field.
Proof. The above collection of sets is a subset of σ(X). Hence the result
follows if the collection is a σ-field.
σ1) X −1 (R) = Ω ∈ σ(X).
σ2) Let A ∈ σ(X). Then A = X −1 (B) for some B ∈ B(R). Thus Ac =
[X −1 (B)]c = X −1 (B c ) by Theorem 2.1 v), where B c ∈ B(R). Hence Ac ∈
σ(X).
σ3) A, B ∈ σ(X) implies A = X −1 (C) and B = X −1 (D) for some sets
C, D ∈ B(R). Hence A ∩ B = X −1 (C ∩ D) ∈ σ(X) by Theorem 2.1 vii).
σ4) Let {Ai }∞ i=1 ∈ σ(X). Then Ai = X
−1
(Bi ) for some Bi ∈ B(R). Thus
∞ ∞ −1 −1 ∞
∪i=1 Ai = ∪i=1 X (Bi ) = X (∪i=1 Bi ) by Theorem 2.1 iii). Thus ∪∞ i=1 Ai ∈
σ(X).
Example 2.1, continued. For a) where X is a constant, σ(X) = {∅, Ω},
the smallest possible σ-field. For b) where X = IA where A ∈ F, σ(X) =
{∅, A, Ac, Ω}.
2.3 Random Vectors
Definition 2.8. Let (Ω, F , P ) be a probability space. A function X : Ω → Rk

is a random vector iff X is a measurable function iff the inverse image
X −1 (B) = {ω : X(ω) ∈ B} ∈ F ∀B ∈ B(Rk ).
The random vector X = (X1 , ..., Xk) and X(ω) = (X1 (ω), ..., Xk (ω))
where the Xi : Ω → R are random variables (measurable functions) for
i = 1, ..., k. A random variable is the special case of a random vector where
k = 1.
2.3 Random Vectors 27
Theorem 2.1 is the special case of Theorem 2.8 with k = 1.

Theorem 2.8. Let X : Ω → Rk . Let A, B, Bn , Bλ ∈ B(Rk ).
i) If A ⊆ B, then X −1 (A) ⊆ X −1 (B).
ii) X −1 (∪∞ ∞
n=1 Bn ) = ∪n=1 X
−1
(Bn ).
iii) X (∩n=1 Bn ) = ∩n=1 X −1 (Bn ).
−1 ∞ ∞
v) X −1 (B c ) = [X −1 (B)]c .
Let Λ be a nonempty index set.
vi) X −1 (∪λ∈Λ Bλ ) = ∪λ∈Λ X −1 (Bλ ).
vii) X −1 (∩λ∈Λ Bλ ) = ∩λ∈Λ X −1 (Bλ ).
Proof Sketch. i) If ω ∈ X −1 (ω), then X(ω) ∈ A ⊆ B. Hence X(ω) ∈ B
and ω ∈ X −1 (B). Thus X −1 (A) ⊆ X −1 (B)
ii) Similar to Problem 2.1.
iii) ω ∈ X −1 (∩∞ ∞
n=1 Bn ) iff X(ω) ∈ ∩n=1 Bn iff X(ω) ∈ Bn for each n iff
−1 ∞ −1
ω ∈ X (Bn ) for each n iff ω ∈ ∩n=1 X (Bn ).
iv) If ω ∈ X −1 (A), then X(ω) ∈ A. Hence X(ω) ∈ / X −1 (B).
/ B. Thus ω ∈
−1 c c −1
v) ω ∈ X (B ) iff X(ω) ∈ B iff X(ω) ∈ / B iff ω ∈ [X (B)]c .
vi) Similar to ii).
vii) Replace n by λ in iii).
Theorem 2.5 is the special case of Theorem 2.9 with k = 1
Theorem 2.9. Fix (Ω, F , P ). If X is a 1 × k random vector, then the
induced probability PX = PF be PX (B) = P [X −1 (B)] for any B ∈
B(Rk ). Then (Rk , B(Rk ), PX ) is a probability space.
Proof. PX is a set function on B(Rk ). Need to show the PX is a proba-
bility measure.
P1) Let B ∈ B(Rk ). Then PX (B) = P [X −1 (B)]. Hence 0 ≤ PX (B) ≤ 1.
P2) PX (Rk ) = P [X −1 (Rk )] = P (Ω) = 1, and PX (∅) = P ({ω : X(ω) ∈
∅) = P (∅) = 0. U∞ U∞
P3) Let {Bi } be disjoint B(Rk ) sets. Then PX ( i=1 Bi ) = P [X −1 ( i=1 Bi )] =
U∞ ∞ ∞
P [ i=1 X −1 (Bi )] = −1
P P
i=1 P [X (Bi )] = i=1 PX (Bi ). (Theorem 2.8 ii)
gives the second equality, but the inverse images of disjoint sets are disjoint
sets by Theorem 2.8 iv), giving the third equality.)
Definition 2.9. The σ-field σ(X) is the smallest σ-field with respect to
which the 1 × k random vector X is measurable.
Theorem 2.10. σ(X) = the collection {X −1 (B) : B ∈ B(Rk )}, which is
a σ-field.
Proof. The above collection of sets is a subset of σ(X). Hence the result
follows if the collection is a σ-field.
σ1) X −1 (R) = Ω ∈ σ(X).
σ2) Let A ∈ σ(X). Then A = X −1 (B) for some B ∈ B(R). Thus Ac =
[X −1 (B)]c = X −1 (B c ) by Theorem 2.8 v), where B c ∈ B(Rk ). Hence Ac ∈
σ(X).
σ3) A, B ∈ σ(X) implies A = X −1 (C) and B = X −1 (D) for some sets
C, D ∈ B(Rk ). Hence A ∩ B = X −1 (C ∩ D) ∈ σ(X) by Theorem 2.8 vii).

−1
σ4) Let {Ai }∞ i=1 ∈ σ(X). Then Ai = X (Bi ) for some Bi ∈ B(Rk ). Thus
−1 −1
∪i=1 Ai = ∪i=1 X (Bi ) = X (∪i=1 Bi ) by Theorem 2.8 iii). Thus ∪∞
∞ ∞ ∞
i=1 Ai ∈
σ(X).
Definition 2.10. The cumulative distribution function (cdf) of a
1 × k random vector X is FX (x) = F (x) = P (X1 ≤ x1 , ..., Xk ≤ xk ) for any
x = (x1 , ..., xk) ∈ Rk .
2.4 Some Useful Distributions
Let the population quantile be yδ . Then P (Y ≤ yδ ) = δ if Y has a pdf that

is positive at yδ . The moment generating function (mgf) m(t) and character-
istic function c(t) will be defined in Chapter 4. The cumulative distribution
function (cdf) or distribution function is F (x). Context will be used to deter-
mine whether f(x) is a probability distribution function (pdf) or probability
mass function (pmf).
R∞
Definition 2.11. The gamma function Γ (x) = 0 tx−1 e−t dt for x > 0.
Some properties of the gamma function follow. i) Γ (k) = (k−1)! for integer
k ≥ 1. ii) Γ (x + 1) =
√ x Γ (x) for x > 0. iii) Γ (x) = (x − 1) Γ (x − 1) for
x > 1. iv) Γ (0.5) = π.
1) Y ∼ beta(δ, ν)
Γ (δ + ν) δ−1
f(y) = y (1 − y)ν−1
Γ (δ)Γ (ν)
where δ > 0, ν > 0 and 0 ≤ y ≤ 1.

δ δν
E(Y ) = , V (Y ) = 2
.
δ +ν (δ + ν) (δ + ν + 1)
2) Bernoulli(ρ) = binomial(k = 1, ρ) f(y) = ρy (1 − ρ)1−y for y = 0, 1.

E(Y ) = ρ, V (Y ) = ρ(1 − ρ).
m(t) = [(1 − ρ) + ρet ], c(t) = [(1 − ρ) + ρeit ].
3) binomial(k, ρ), Y ∼ BIN (k, ρ),

k y
f(y) = ρ (1 − ρ)k−y
y
for y = 0, 1, . . . , k where 0 < ρ < 1. E(Y ) = kρ, V (Y ) = kρ(1 − ρ).

m(t) = [(1 − ρ) + ρet ]k , c(t) = [(1 − ρ) + ρeit ]k . If Y1 , ..., Yn are independent
2.4 Some Useful Distributions 29
binomial BIN(ki , ρ) random variables, then

n n
!
X X
Yi ∼ BIN ki , ρ .
i=1 i=1
Pn
Thus if Y1 , ..., Yn are iid BIN(k, ρ) random variables, then i=1 Yi ∼ BIN(nk, ρ).
4) Y ∼ Cauchy(µ, σ),
1
f(y) =
πσ[1 + ( y−µ 2
σ ) ]
where y and µ are real numbers and σ > 0. E(Y ) = ∞ = VAR(Y ). E(Y )
and V (Y ) do not exist. c(t) = exp(itµ − |t|σ).
1 y −µ
F (y) = [arctan( ) + π/2].
π σ
5) chi-square(p) = gamma(ν = p/2, λ = 2), Y ∼ χ2p ,
p y
y 2 −1 e− 2
f(y) = p
2 2 Γ ( p2 )
where y > 0 and p is a positive integer. E(Y ) = p, V (Y ) = 2p.

p/2 p/2
1 1
m(t) = = (1 − 2t)−p/2 for t < 1/2, c(t) = .
1 − 2t 1 − i2t
If Y1 , ..., Yn are independent chi–square χ2pi , then

n
X Xn
Yi ∼ χ2 ( pi ).
i=1 i=1
Thus if Y1 , ..., Yn are iid χ2p , then

n
X
Yi ∼ χ2np .
i=1
6) exponential(λ)= gamma(ν = 1, λ), Y ∼ EXP(λ)

1 y
f(y) = exp (− ) I(y ≥ 0)
λ λ
where λ > 0. E(Y ) = λ, V (Y ) = λ2 , and yδ = −λ ln(1 − δ).
m(t) = 1/(1 − λt) for t < 1/λ, c(t) = 1/(1 − iλt).

F (y) = 1 − exp(−y/λ), y ≥ 0.
If Y1 , ..., Yn are iid exponential EXP(λ), then
n
X
Yi ∼ G(n, λ).
i=1
7) gamma(ν, λ), Y ∼ G(ν, λ),
yν−1 e−y/λ
f(y) =
λν Γ (ν)
where ν, λ, and y are positive. E(Y ) = νλ, V (Y ) = νλ2 .

ν ν
1 1
m(t) = for t < 1/λ, c(t) = .
1 − λt 1 − iλt
If Y1 , ..., Yn are independent Gamma G(νi , λ) then

n
X Xn
Yi ∼ G( νi , λ).
i=1 i=1
n
X
Thus if Y1 , ..., Yn are iid G(ν, λ), then Yi ∼ G(nν, λ).
i=1
8) Y ∼ N (µ, σ 2 )
−(y − µ)2

1
f(y) = √ exp
2πσ 2 2σ 2
where σ > 0 and µ and y are real. E(Y ) = µ, V (Y ) = σ 2 , and yδ = µ+σzδ .
m(t) = exp(tµ + t2 σ 2 /2), c(t) = exp(itµ − t2 σ 2 /2).

y−µ
F (y) = Φ .
σ
If Y1 , ..., Yn are independent normal N (µi , σi2 ), then
n
X n
X n
X
(ai + bi Yi ) ∼ N ( (ai + bi µi ), b2i σi2 ).
i=1 i=1 i=1
Here ai and bi are fixed constants. Thus if Y1 , ..., Yn are iid N (µ, σ 2 ), then
Y ∼ N (µ, σ 2 /n).
9) Poisson(θ), Y ∼ POIS(θ)
2.4 Some Useful Distributions 31
e−θ θy
f(y) =
y!
for y = 0, 1, . . . , where θ > 0. E(Y ) = θ = V (Y ).
m(t) = exp(θ(et − 1)), c(t) = exp(θ(eit − 1)).
If Y1 , ..., Yn are independent POIS(θi ), then

n
X Xn
Yi ∼ POIS( θi ).
i=1 i=1
Thus if Y1 , ..., Yn are iid P OIS(θ), then

n
X
Yi ∼ POIS(nθ).
i=1
10) uniform(θ1 , θ2 ), Y ∼ U (θ1 , θ2 ).

1
f(y) = I(θ1 ≤ y ≤ θ2 ).
θ2 − θ1
F (y) = (y − θ1 )/(θ2 − θ1 ) for θ1 ≤ y ≤ θ2 . E(Y ) = (θ1 + θ2 )/2. V (Y ) =

(θ2 − θ1 )2 /12, and yδ = (θ2 − θ1 )δ + θ1 . By definition, m(0) = c(0) = 1. For
t 6= 0,
etθ2 − etθ1 eitθ2 − eitθ1
m(t) = , and c(t) = .
(θ2 − θ1 )t (θ2 − θ1 )it
11) point mass at c: The distribution of Y is a point mass at c (or Y
is degenerate at c) if P (Y = c) = 1 with pmf f(c) = 1. Hence Y ∼ N (c, 0),
E(Y ) = c, V (Y ) = 0. m(t) = etc . c(t) = eitc.
More Distributions:
12) If Y has a geometric distribution, Y ∼ geom(ρ) then the pmf of Y is
f(y) = P (Y = y) = ρ(1 − ρ)y
for y = 0, 1, 2, ... and 0 < ρ < 1. E(Y ) = (1 − ρ)/ρ. V (Y ) = (1 − ρ)/ρ2 .

Y ∼ N B(1, ρ). Hence the mgf of Y is
ρ
m(t) =
1 − (1 − ρ)et
for t < − log(1 − ρ).

13) If Y has an inverse Gaussian distribution, Y ∼ IG(θ, λ), then the pdf
of Y is s
−λ(y − θ)2

λ
f(y) = exp
2πy3 2θ2 y
where y, θ, λ > 0. E(Y ) = θ and
θ3
V (Y ) = .
λ
The mgf is
" r !# " r !#
λ 2θ2 t 2 λ 2θ2 it
m(t) = exp 1− 1− t < λ/(2θ ), c(t) = exp 1− 1− .
θ λ θ λ
14) If Y has a negative binomial distribution, Y ∼ NB(r, ρ), then the pmf
of Y is
r+y−1 r

f(y) = P (Y = y) = ρ (1 − ρ)y
y
for y = 0, 1, . . . where 0 < ρ < 1. E(Y ) = r(1 − ρ)/ρ, and
r(1 − ρ)
V (Y ) = .
ρ2
The moment generating function
r
ρ
m(t) =
1 − (1 − ρ)et
for t < − log(1 − ρ).

15) If Y has an F distribution, Y ∼ F (ν1, ν2 ), then the pdf of Y is
ν1 /2
Γ ( ν1+ν
2 ) y(ν1 −2)/2

2
ν1
f(y) = (ν1 +ν2 )/2
Γ (ν1/2)Γ (ν2 /2) ν2
1 + ( νν21 )y
where y > 0 and ν1 and ν2 are positive integers.

ν2
E(Y ) = , ν2 > 2
ν2 − 2
and 2
ν2 (ν1 + ν2 − 2)
V (Y ) = 2 , ν2 > 4.
ν2 − 2 ν1 (ν2 − 4)
16) If Y has a Student’s t distribution, Y ∼ tp , then the pdf of Y is
Γ ( p+1 ) y2 p+1
f(y) = 1/2
2
(1 + )−( 2 )
(pπ) Γ (p/2) p
where p is a positive integer and y is real. This family is symmetric about

0. The t1 distribution is the Cauchy(0, 1) distribution. If Z is N (0, 1) and is
independent of W ∼ χ2p , then
2.5 Summary 33
Z
(W
p )
1/2
is tp . E(Y ) = 0 for p ≥ 2. V (Y ) = p/(p − 2) for p ≥ 3.

Two Multivariate Distributions:
17) point mass at c: The distribution of the p × 1 random vector Y is a
point mass at c (or Y is degenerate at c) if P (Y = c) = 1 with pmf f(c) = 1.
T T
Hence Y ∼ Np (c, 0), E(Y ) = c, Cov(Y ) = 0, m(t) = et c , c(t) = eit c .
18) multivariate normal (MVN) distribution: If Y ∼ Np (µ, Σ), then
E(Y ) = µ and Cov(Y ) = Σ.

T 1 T T 1 T
m(t) = exp t µ + t Σt , c(t) = exp it µ − t Σt .
2 2
If Y ∼ Np (µ, Σ) and if A is a q × p matrix, then AY ∼ Nq (Aµ, AΣAT ).

If a is a p × 1 vector of constants, then Y + a ∼ Np (µ + a, Σ).

Y1 µ1 Σ 11 Σ 12
Let Y = , µ= , and Σ = .
Y2 µ2 Σ 21 Σ 22
All subsets of a MVN are MVN: (Yk1 , ..., Ykq )T ∼ Nq (µ̃, Σ̃) where
µ̃i = E(Yki ) and Σ̃ ij = Cov(Yki , Ykj ). In particular, Y 1 ∼ Nq (µ1 , Σ 11 ) and
Y 2 ∼ Np−q (µ2 , Σ 22 ). If Y ∼ Np (µ, Σ), then Y 1 and Y 2 are independent iff
Σ 12 = 0.
2.5 Summary
33) Let (Ω, F ) and (Ω 0 , F 0 ) be two measurable spaces. For a mapping T :

Ω → Ω 0 , the mapping T is measurable F /F 0 if T −1 (A0 ) ∈ F for each
A0 ⊆ F 0 . If Ω 0 = Rk and F 0 = B(Rk ), and X : Ω → Rk , then X is a
measurable function if X is measurable F /B(Rk ) iff X = X is a random
variable for k = 1 and X is a 1 × k random vector for k > 1 iff X −1 (B) =
{ω : X(ω) ∈ B} ∈ F ∀B ∈ B(Rk ).
Note the random vector X = (X1 , ..., Xk) and X(ω) = (X1 (ω), ..., Xk (ω))
where the Xi : Ω → R are random variables (measurable functions) for
i = 1, ..., k.
20) Let (Ω, F , P ) be a probability space. A function X : Ω → R is a
random variable if X −1 (B) ∈ F ∀ B ∈ B(R). Equivalently, X is a random
variable if
{X ≤ t} = {ω ∈ Ω : X(ω) ≤ t} ∈ F ∀t ∈ R.
21) The random variable X is a measurable function. B(R) is the Borel
σ−field on the real numbers R = (−∞, ∞). The inverse image X −1 (B) =
{ω ∈ Ω : X(ω) ∈ B}. Note that the inverse image X −1 (B) is a set. X −1 (B)
is not the inverse function.
27) Let B(R) be the Borel σ−field on the real numbers R = (−∞, ∞). Let
(Ω, F ) be a measurable space, and let the real function X : Ω → R. Then X
is a measurable function if X −1 (B) ∈ F ∀ B ∈ B(R). Equivalently, X is
a measurable function if
{X ≤ t} = {ω ∈ Ω : X(ω) ≤ t} ∈ F ∀t ∈ R.
28) Fix the probability space (Ω, F , P ). Combining 20) and 27) shows X
is a random variable iff X is a measurable function.
71) Let X : Ω → R. Let A, B, Bn ∈ B(R).
i) If A ⊆ B, then X −1 (A) ⊆ X −1 (B).
ii) X −1 (∪n Bn ) = ∪n X −1 (Bn ).
iii) X −1 (∩n Bn ) = ∩n X −1 (Bn ).
v) X −1 (B c ) = [X −1 (B)]c .
(The unions and intersections in ii) and iii) can be finite, countable or un-
countable.)
72) Theorem: Fix (Ω, F , P ). Let X : Ω → R. X is a measurable function
iff X is a RV iff any one of the following conditions holds.
i) X −1 (B) = {ω ∈ Ω : X(ω) ∈ B} ∈ F ∀ B ∈ B(R).
ii) X −1 ((−∞, t]) = {X ≤ t} = {ω ∈ Ω : X(ω) ≤ t} ∈ F ∀t ∈ R.
iv) X −1 ([t, ∞)) = {X ≥ t} = {ω ∈ Ω : X(ω) ≥ t} ∈ F ∀t ∈ R.
v) X −1 ((t, ∞)) = {X > t} = {ω ∈ Ω : X(ω) > t} ∈ F ∀t ∈ R.
73) Theorem: Let X, Y , and Xi be RVs on P (Ω, F , P ).
a) aX + bY is a RV for any a, b ∈ R. Hence ni=1 Xi is a RV.
b) max(X, Y ) is a RV. Hence max(X1 , ..., Xn) is a RV.
c) min(X, Y ) is a RV. Hence min(X1 , ..., Xn) is a RV.
d) XY is a RV. Hence X1 · · · Xn is a RV.
e) X/Y is a RV if Y (ω) 6= 0 ∀ ω ∈ Ω.
f) supn Xn is a RV.
g) infn Xn is a RV.
h) limsupn Xn is a RV.
i) liminfn Xn is a RV.
j) If limn XP n = X, thenP X is a RV.
m ∞
k) If limm n=1 Xn = n=1 Xn = X, then X is a RV.
l) If h : Rn → R is measurable, then Y = h(X1 , ..., Xn) is a RV.
m) If h : Rn → R is continuous, then h is measurable and Y = h(X1 , ..., Xn)
is a RV.
n) If h : R → R is monotone, then h is measurable and h(X) is a RV.
34) An indicator IA is the function such that IA (ω) = 1 if ω ∈ A and
IA (ω) = 0 if ω 6∈ A.
Pk
35) A function f is a simple function if f = i=1 xi IAi for some positive
integer k. Thus a simple function f has finite range.
36) A simple function is a random variable if each Ai ∈ F.
2.7 Problems 35
2.6 Complements
2.7 Problems
2.1. Prove
∞ ∞
!
[ [
−1
X Bi = X −1 (Bi )
i=1 i=1
if X : Ω → R is a random variable (measurable function and real function)

and the Bi ∈ B(R).
2.2. Fix (Ω, F , P ). Let X : Ω → R. Then X is a measurable function or
random variable if X −1 ((−∞, t]) = {X ≤ t} = {ω ∈ Ω : X(ω) ≤ t} ∈ F ∀t ∈
R. Prove that X is a random variable if X −1 ((t, ∞)) = {X > t} = {ω ∈ Ω :
X(ω) > t} ∈ F ∀t ∈ R.
2.3. Let (Ω, F , P ) be a probability space. Prove that IA is not a random
variable (with respect to the probability space) if A is not a subset of F .
2.4. Let t : (R, B(R)) → (R, B(R)) be a measurable real function. Hence
t−1 (B) = {y ∈ R : t(y) ∈ B} = B 0 ∈ B(R) ∀B ∈ B(R). Let X : (Ω, F ) →
(R, B(R)) be a random variable. Prove that Z = t(X) is a random variable
where Z : Ω → R. Hint: show Z −1 (B) = X −1 (t−1 (B)) if B ∈ B(R).
R. If X and Y are random variables, prove that W = max(X, Y ) is a random
variable.
R. Prove that X is a random variable if X −1 ((t, ∞)) = {X > t} = {ω ∈ Ω :
X(ω) > t} ∈ F ∀t ∈ R.
2.7. Prove
∞ ∞
!
\ \
−1
X Bi = X −1 (Bi ).
i=1 i=1
(You may assume, for example, that X : Ω → Rk is a random vector and the
Bi ∈ B(Rk ).)
2.8. Let (Ω, F ) be a measurable space. Suppose the 1 × k vector X : Ω →
Rk . Give the definition of a random vector X.
2.9. Fix (Ω, F , P ) and let X be a RV. What is the induced probability
PX (B) for B ∈ B(R)?
2.10.
2.11.
2.12.
2.13.
2.14.
2.15.
2.16.
2.17.
2.18.
2.19.
2.20Q . Prove Theorem 2.4 using Theorem 2.4.
Theorem 2.3. Fix (Ω, F , P ). Let X : Ω → R. X is a measurable function
iff X is a random variable iff any one of the following conditions holds.
i) X −1 (B) = {ω ∈ Ω : X(ω) ∈ B} ∈ F ∀ B ∈ B(R).
ii) X −1 ((−∞, t]) = {X ≤ t} = {ω ∈ Ω : X(ω) ≤ t} ∈ F ∀t ∈ R.
iv) X −1 ([t, ∞)) = {X ≥ t} = {ω ∈ Ω : X(ω) ≥ t} ∈ F ∀t ∈ R.
v) X −1 ((t, ∞)) = {X > t} = {ω ∈ Ω : X(ω) > t} ∈ F ∀t ∈ R.
Theorem 2.4. Let X, Y , and Xi be RVsP on (Ω, F , P ).
n
a) aX + bY is a RV for any a, b ∈ R. Hence i=1 Xi is a RV.
b) max(X, Y ) is a RV. Hence max(X1 , ..., Xn) is a RV.
c) min(X, Y ) is a RV. Hence min(X1 , ..., Xn) is a RV.
d) XY is a RV. Hence X1 · · · Xn is a RV.
e) X/Y is a RV if Y (ω) 6= 0 ∀ ω ∈ Ω.
f) supn Xn is a RV.
g) infn Xn is a RV.
h) limsupn Xn is a RV.
i) liminfn Xn is a RV.
j) If limn XP n = X, thenPX is a RV.
m ∞
k) If limm n=1 Xn = n=1 Xn = X, then X is a RV.
2.21Q . Fix (Ω, F , P ). For a random variable X, prove that the induced
probability PX (B) = P [X −1 (B)] for B ∈ B(R) is a probability measure on
(R, B(R)).
S∞ You may S∞use −1 without proof i) X −1 (R) = Ω, ii) X −1 (∅) = ∅, iii)
X ( i=1 Bi ) = i=1 X (Bi ), and iv) if A and C are disjoint, then X −1 (A)
−1
and X −1 (C) are disjoint.

2.22Q . Let σ(X) = the collection {X −1 (B) : B ∈ B(R)}. Prove that σ(X)
is a σ-field.
S∞ You may S∞ use −1 without proof i) X −1 (R) = Ω, ii) X −1 (∅) = ∅, iii)
X ( i=1 Bi ) = i=1 X (Bi ), iv) if A and C are disjoint, then X −1 (A)
−1
and X −1 (C) are disjoint,

v) [X −1 (B)]c = X −1 (B c ), and vi) X −1 (C) ∩ X −1 (D) = X −1 (C ∩ D).
Chapter 3
Integration and Expected Value
This chapter covers integration, expected values, Fubini’s theorem, and prod-
uct measures. Most of the proofs for integration are omitted, but the corre-
sponding results for expectation are often given.
3.1 Integration
Remark 3.1. For the theory of integration, assume the function f in

the integrand is measurable where f : Ω → R, f : Ω → [0, ∞], or f :
Ω → [−∞, ∞], and (Ω, F , µ) is a measure space. We will often say f : Ω →
[−∞, ∞] when the result is also valid for f : Ω → [a, b] with a < b, and
a = −∞ and b = ∞ possible.
Definition 3.1. A disjoint sequence of sets {Ai } is a finite
Sn F decom-
position
Un (or a F decomposition of Ω) if Ai ∈ F and Ω = i=1 Ai =
i=1 Ai for some n.
First a definition of integration is given for nonnegative functions, then for
general functions.
Definition 3.2. Let f : Ω → [0, ∞] be a nonnegative function. Then the
integral
Z X
fdµ = sup{Ai } (infω∈Ai f(ω))µ(Ai ) where {Ai } is a finite F decompo-
i
sition.
Remark 3.2. Conventions for integration of a nonnegative function. a)
Ai = ∅ implies that the inf term = ∞, b) x(∞) = ∞ for x > 0, c) 0(∞) = 0,
and d) ∞ − ∞ = −∞ + ∞ is undefined.
Pm
Theorem 3.1. Let f ≥ 0 with f(ω) = j=1 xj IBj (ω) where each xj ≥ 0
R Pm
and {Bj } is an F decomposition of Ω. Then fdµ = j=1 xj µ(Bj ).
37
38 3 Integration and Expected Value
Definition 3.3. If f : Ω → [−∞, ∞], then the positive part f + =

max(f, 0) = fI(f ≥ 0), and the negative part f − = max(−f, 0) =
−min(f, 0) = −fI(f ≤ 0). Hence f + (ω) = f(ω)I(f(ω) ≥ 0) and f − (ω) =
−f(ω)I(f(ω) ≤ 0).
Remark 3.3. Here I(f ≥ 0) = I(0 ≤ f ≤ ∞) while I(f(ω) ≤ 0) =
I(−∞ ≤ f ≤ 0). If f is measurable, then f + ≥ 0, f − ≥ 0 are both measur-
able, f = f + − f − , and |f| = f + + f − .
Definition 3.4.R Let f R: Ω → [−∞, R ∞].
i) The integral fdµ = f + dµ − f − dµ.
R + ∞ − ∞.
ii) The integral is defined unless it involves
f dµ and f − dµ are finite. Thus
R
iii)
R The function f is integrable if both
fdµ ∈ R if f is integrable.
Definition 3.5. A property holds almost everywhere (ae), if the prop-
erty holds for ω outside a set of measure 0, i.e. the property holds on a set A
such that µ(Ac ) = 0. If µ is a probability measure P , then P (A) = 1 while
P (Ac) = 0.
Theorem 3.2. Suppose
R f and g are both nonnegative.
i) If f = 0 ae, then fdµ = 0. R
ii) If µ({ω
R : f(ω) > 0}) > 0, then fdµ > 0.
iii) If fdµ < ∞, then R f < ∞R ae.
iv) If f ≤ g ae, then R fdµ ≤R gdµ.
v) If f = g ae, then fdµ = gdµ.
R
Theorem 3.3. i) f is integrable iff |f|dµ < ∞. R
R monotonicity: If f and g are integrable and f ≤ g ae, then fdµ ≤
ii)
gdµ.
iii) linearity:
R If f and gRare integrable
R and a, b ∈ R, then af +bg is integrable
with (af + bg)dµ = a fdµ + b gdµ.
Theorem R 3.4: Monotone
R Convergence Theorem (MCT): If 0 ≤ fn ↑
f ae, then fn dµ ↑ fdµ.
R
Theorem
R 3.5: Fatou’s Lemma: For nonnegative fn , liminfn fn dµ ≤
liminfn fn dµ.
Theorem 3.6: Lebesgue’s Dominated Convergence Theorem (LDCT):
If the |fn | ≤ g ae
R where gRis integrable, and if fn → f ae, then f and fn are
integrable and fn dµ → fdµ.
Theorem 3.7: Bounded Convergence Theorem (BCT): R If µ(Ω)R < ∞
and the fn are uniformly bounded, then fn → f ae implies fn dµ → fdµ.
R P∞ P∞ R
P∞ R 3.8. i) If fn ≥ 0 then
Theorem R P∞ n=1 fn dµ P= ∞ R
n=1 fn dµ.
ii) If n=1 |fn |dµ < ∞, then f dµ
Rn=1 n R = n=1R fn dµ.
iii) If f and g are integrable, then | fdµ − gdµ| ≤ |f − g|dµ.
R Pk Pk R
Remark 3.4. Consequences: a) linearity implies n=1 fn dµ = n=1 fn dµ:
i.e., the integral and finite sum operators can be interchanged
3.2 Expected Value 39
R
b) MCT, LDCT, and R BCT Rgive conditionsR where the limit and can be
interchanged: limn fn dµ = limn fn dµ = fdµ P∞
c) Theorem R3.8 i) and ii) give conditions
R P∞where the P infinite sum n=1 and
∞ R
the integral can be interchanged: n=1 fn dµ = n=1 fn dµ.
Remark 3.5. A common technique is to show the result is true for in-
dicators. Extend to simple functions by linearity, and then to nonnegative
function by a monotone passage of the limit. Use f = f + − f − for general
functions.
R R
Definition 3.6. If A ∈ F, then A fdµ = fIA dµ.
R
Theorem 3.9. If µ(A) = 0, then A fdµ = 0.
Theorem R 3.10. If µ : F → [0, ∞] is a measure and f ≥ 0, then
a) ν(A)
R = A
fdµ is a measure Ron F .
b) If Ω fdµ = 1, then P (A) = A fdµ is a probability measure on F .
3.2 Expected Value
Definition 3.7. Fix (Ω, F , P ). A simple random variable (SRV) is a function

X : Ω → R such that the range of X is finite and {X = x} = {ω : X(ω) =
x} ∈ F ∀x ∈ R.
Hence X is a discrete random variable
Pn with finite support.
Example 3.1. Note that X = i=1 xi IAi is a SRV if each Ai ∈ F.
P∞ that the An are disjoint for n ≥ 1 and xi 6= xj
Example 3.2. Suppose
for i 6= j. Then X = i=1 xi IAi is not a simple random variable since X has
infinite range.
Un
Definition
Pn 3.8. Suppose events A1 , ..., An are disjoint and i=1 Ai = Ω.
Let X = i=1 xi IAi . Then the expected value of X is
n
X X
E(X) = xi P (Ai ) = xP (X = x). (3.1)
i=1 x
Example 3.3. IA = 1IA +0IAC is a simple random variable, and E(IA ) =

P (A) if A is an event.
Remark 3.6. The expected value E(X) is a finite sum since X is a SRV.
The middle term is useful for proofs. For the given SRV, E(X) exists and is
unique. In the second sum, the x need to be the distinct values in the range
of X.
Proof of Existence and Uniqueness of (3.1). Existence: Suppose SRV
X
Pmtakes on distinct values x1 , ..., xm where m need not equal n. Then X =
Uni=1 xi IBi where the Bi = {X = xi } = {ω : X(ω) = xi } are disjoint with
i=1 Bi = Ω. Thus
m
X m
X
E(X) = xi P (Bi ) = xi P (X = xi ).
i=1 i=1
Uniqueness:
n
X X X X X
xi P (Ai ) = xi P (Ai ) = xP (∪i:xi=x Ai ) = P (X = x).
i=1 x i:xi =x x x

Note that all sums in the above proof are finite. Also note that although
many partitions Ai may exist, each partition gives the same value of E(X).
Theorem 3.11. Let Xn , X, and Y be SRVs.
a) −∞ < E(X) < ∞
b) linearity: E(aX Pn+ bY ) = aE(X) + bE(Y )
c) If SRVPX = i=1 xi IAi where the Ai are not necessarily disjoint, then
n
E(X) = i=1 xi P (Ai ).
d) monotonicity: If X ≤ Y , then E(X) ≤ E(Y )
e) If the sequence {Xn } is uniformly bounded and X = limn Xn on a set of
probability 1, then E(X) = limn E(Xn ). P
f) If t is a real valued function, then E[t(X)] = x t(x)P (X = x)
g) If X isPnonnegative, X R≥ 0, then
∞
E(X) = i P (X > xi ) = 0 [1 − F (x)]dx.
h) If X Y , then E(X) P= E(X)E(Y ).
Proof. a) E(X) = x xP (X = x) where the x are bounded since X has
finite range x1 , ..., xm and P (X = x) ∈ [0, 1]. Hence
min(xi ) ≤ E(X)P ≤ max(xi ). P
b) Let X = i xi IAi and Y = j yj IBj where the Ai partition Ω and
the Bj partition Ω. Then the Ai ∩ Bj partition Ω, and aX + bY = axi + byj
for ω ∈ Ai ∩ Bj . Thus
XX
aX + bY = (axi + byj )IAi ∩Bj
i j
is a SRV with
XX
E(aX + bY ) = (axi + byj )P (Ai ∩ Bj ) =
i j
X X X X
axi P (Ai ∩ Bj ) + byj P (Ai ∩ Bj ) =
i j j i
X X
a xi P (Ai ) + b yj P (Bj ) = aE(X) + bE(Y ).
i j
c) Since IAi is a SRV with E(IAi ) = P (Ai ) by Example 3.3, by linearity

and induction,
n
X
E(X) = xi P (Ai ).
i=1
P
d) Let W = Y − X ≥ 0. Then E(W ) = w wP (W = w) ≥ 0 since each
distinct value of w ≥ 0. By linearity, 0 ≤ E(Y − X) = E(Y ) − E(X), or
E(X) ≤ E(Y ).
e) Pn Pn
f) If X = i=1 xi IAi then
P W = t(X) = i=1 Pnt(xi )IAi shows W is a SRV.
Thus E(W ) = E[t(X)] = w wP (W = w) = i=1 t(xi )P (Ai ) by c).
g)
h) X X XX
XY = xi IAi yj IBj = xi yj IAi ∩Bj
i j i j
is a SRV. Thus
ind
XX XX
E(XY ) = xi yj P (Ai ∩ Bj ) = xi yj P (Ai )P (Bj ) =
i j i j
X X
xi P (Ai ) yj P (Bj ) = E(X)E(Y ).
i j
Remark 3.7. For expected values, assume (Ω, F , P ) is fixed, and the
random variables are measurableR (with respect to) wrt F . We can define the
expected value to be E(X) = XdP as the special case of integration where
the measure µ = P is a probability measure, or we can use the following
definition that ignores most measure theory. There are several equivalent
ways to define integrals and expected values. Hence E(X) can also be defined
as in Def. 3.2 with µ replaced by P and f replaced by X : Ω → [0, ∞).
Theorem 3.12. Let X ≥ 0 be a random variable. Then there exist SRVs
Xn ≥ 0 such that Xn ↑ X.
Proof.
Note: Xn ↑ X means Xn (ω) ↑ X(ω) ∀ω. An analogy for Theorem 3.12 is
to take step functions, and “increase them” to get Riemann integrabilty of a
function. A consequence of Theorem 3.12 is that if X ≤ 0, then there exist
SRVs Xn such that Xn ↓ X.
Definition 3.9. Let X ≥ 0 Z be a nonnegative RV.
a) E(X) = limn→∞ E(Xn ) = XdP ≤ ∞ where the Xn are nonnegative
SRVs with 0 ≤ Xn ↑ X.
b) The expectation of X over an event A is E(XIA ).
Proof of existence and uniqueness:
existence: 0 ≤ E(X1 ) ≤ E(X2 ) ≤ .... So {E(Xn )} is a monotone sequence
and limn→∞ E(Xn ) exists in [0, ∞].
uniqueness (show E(X) is well defined): later
Theorem 3.13. Let X, Y be nonnegative random variables.

a) “restricted linearity:” For X, Y ≥ 0 and a, b ≥ 0,
E(aX + bY ) = aE(X) + bE(Y ).
b) “monotonicity:” If X ≤ Y ae, then E(X) ≤ E(Y ).
Proof. a) For SRVs 0 ≤ Xn ↑ X and 0 ≤ Yn ↑ Y , the RVs aXn + bYn are
SRVs and aXn + bYn ↑ aX + bY , which is nonnegative. Thus
E(aX + bY ) = lim E(aXn + bYn ) = lim (aE[Xn ] + bE(Yn ]) =

n→∞ n→∞
a lim E[Xn ] + b lim E(Yn ) = aE(X) + bE(Y ).

n→∞ n→∞
The first and last equalities holds by the definition of expected value for
nonnegative RVs. The second inequality holds by linearity for SRVs. The
third inequality holds since lim (an + bn ) = lim an + lim bn if the RHS
exists.
b) Let W = Y − X ≥ 0. Since E(Z) ≥ 0 when Z ≥ 0, E(Y − X) ≥ 0.
Using a) gives
E(Y ) = E(Y − X + X) = E(Y − X) + E(X).
Hence E(Y ) = ∞ if E(X) = ∞. If E(X) < ∞, then
E(Y ) − E(X) = E(Y − X) ≥ 0
where E(Y ) − E(X) exists since 0 ≤ E(X) < ∞.

Pn Pn
By induction, if the ai Xi ≥ 0, then E( i=1 ai Xi ) = i=1 E(ai Xi ): the
expected value of a finite sum of nonnegative random variables is the sum of
the expected values. Note that a) is not linearity since a and b are restricted
to be nonnegative.
Definition 3.10. For a random variable X : Ω → (−∞, ∞), the
positive part X + = max(X, 0) = XI(X ≥ 0), and the negative part
X − = max(−X, 0) = −min(X, 0) = −XI(X ≤ 0).
Remark 3.8. Hence X = X + − X − , and |X| = X + + X − . Random
variables are real functions: ±∞ are not allowed.
Definition 3.11. Let the randomR variableR X : Ω →R(−∞, ∞).
i) The expected value E(X) = XdP = X + dP − X − dP =
E(X + ) − E(X − ).
ii) The expected value is defined unless it involves ∞ − ∞.
iii) The random variable X is integrable if E[|X|] < ∞. Thus E(X) ∈ R if
X is integrable.
Theorem 3.14. i) X is integrable iff both E[X + ] and E[X − ] are finite.
ii) linearity: If X and Y are integrable and a, b ∈ R, then aX + bY is
integrable with E(aX + bY ) = aE(X) + bE(Y ).
iii) monotonicity: If X and Y are integrable and X ≤ Y ae, then
E(X) ≤ E(Y ).
iv) |E(X)| ≤ E( |X| ).
Proof. i) If X is integrable, then E[|X|] = E[X + ] + E[X − ] by Theorem
3.13 a). Since E[X + ] ≥ 0, E[X − ] ≥ 0, and the sum is finite, both terms are
finite. If both E[X + ] and E[X − ] are finite, then E[|X|] = E[X + ] + E[X − ] is
finite.
ii)
iii) By ii) 0 ≤ E(Y − X) = E(Y ) − E(X). Thus E(Y ) ≥ E(X).
iv) Since −|X| ≤ X ≤ |X|, iii) implies that E(X) ≤ E(|X|) and
−E(|X|) ≤ E(X). Thus −E(X) ≤ E(|X|). Hence |E(X)| ≤ E( |X| ).
Theorem 3.15: Fatou’s Lemma: For RVs Xn ≥ 0, E[lim inf n Xn ] ≤
lim inf n E[Xn ].
Proof.
Theorem 3.16: Monotone Convergence Theorem (MCT): If 0 ≤
Xn ↑ X ae, then
E(Xn ) ↑ E(X).
Proof. The proof is for when the convergence is everywhere. Then Xn ↑ X
implies E(Xn ) ≤ E(X) for all n using monotonicity of nonnegative RVs. Thus
limsupn E(Xn ) ≤ E(X). By Fatou’s lemma:
E(X) = E[lim Xn ] = E[liminfXn ] ≤ liminfE[Xn ] ≤ limsupE(Xn ) ≤ E(X).
Thus lim E(Xn ) = E(X). Since Xn ↑ X, E(Xn ) ≤ E(Xn+k ) for k ≥ 0. Thus

E(Xn ) ↑ E(X).
Theorem 3.17: Lebesgue’s Dominated Convergence Theorem
(LDCT): If the |Xn | ≤ Y ae where Y is integrable, and if Xn → X ae,
then X and Xn are integrable and E(Xn ) → E(X).
Proof. Since limsup |Xn | = liminf|Xn | = lim |Xn | = |X| ≤ Y , Xn and
X are integrable. Using the nonnegativity of Y − Xn and Y + Xn ,
E(Y ) − E(X) = E(Y − X) = E[liminf (Y − Xn )] ≤
liminfE(Y − Xn ) = E(Y ) − limsup E(Xn )

where the first and third equalities follow by linearity, the second inequality
holds since lim (Y − Xn ) = liminf (Y − Xn ) = Y − X, and the inequality
holds by Fatou’s lemma. Since E(Y ) is finite by integrability,
−E(X) ≤ −limsup E(Xn ).
Thus E(X) ≥ limsup E(Xn ). Similarly,
E(Y ) + E(X) = E(Y + X) = E[liminf (Y + Xn )] ≤
liminfE(Y + Xn ) = E(Y ) + liminf E(Xn ).

Hence
E(X) ≤ liminf E(Xn ) ≤ limsup E(Xn ) ≤ E(X).

Thus E(X) = limn→∞ E(Xn ).
vii) Theorem 3.18: Bounded Convergence Theorem (BCT): If the
Xn are uniformly bounded, then Xn → X ae implies E(Xn ) → E(X).
Proof. P∞ P∞
Theorem
P∞ 3.19. i) If Xn ≥ 0 then
P∞E( n=1 X n) = n=1 E(Xn ).
P ∞
ii) If n=1 E(|Xn |) < ∞, then E( n=1 Xn ) = n=1 E(Xn ).
Pm then |E(X)
ii) If X and Y are integrable, P−
∞
E(Y )| ≤ E[|X − Y |].
Proof. i) 0 ≤ Ym = n=1 Xn ↑ Y = n=1 Xn . Thus
m
X ∞
X ∞
X
lim E(Ym ) = lim E(Xn ) = E(Xn ) = E(Y ) = E( Xn )
m→∞ m→∞
n=1 n=1 n=1
by MCT.
Pk Pk
Remark 3.9. Consequences: a) linearity implies E( n=1 an Xn ) = n=1 an E(Xn ):
i.e., the expectation and finite sum operators can be interchanged, or the
expectation of a finite sum is the sum of the expectations if the Xn are inte-
grable.
b) MCT, LDCT, and BCT give conditions where the limit and E can be
interchanged: limn E(Xn ) = E[limn Xn ] = E(X) P∞
c) Theorem 3.18 i) and ii) give conditions where
P∞ the infiniteP∞ sum n=1 and
the expected value can be interchanged: E[ n=1 Xn ] = n=1 E(Xn ).
Definition 3.12. Given (Ω, F , P ), the collection of all integrable random
vectors or random variables is denoted by L1 = L1 (Ω, F , P ).
Definition 3.13. Let X be a 1 × k random vector with cdf FX (t) =
F (t) = P (X R 1 ≤ t1 , ..., Xk ≤ tk ). Then the Lebesgue Stieltjes integral
E[h(X)] = h(t)dF (t) provided the expected value exists, and the integral
is a linear operator
R with respect to both h and F . If X is a random variable,
then E[h(X)] = h(t)dF (t). If W = h(X) is integrable or if W = h(X) ≥ 0,
then the expected value exists. Here h : Rk → Rj with 1 ≤ j ≤ k.
Definition 3.14. The distribution of a 1 × k random vector X is a mix-
ture distribution if the cdf of X is
J
X
FX (t) = πj FU j (t)
j=1
PJ
where the probabilities πj satisfy 0 ≤ πj ≤ 1 and j=1 πj = 1, J ≥ 2,
and FU j (t) is the cdf of a 1 × k random vector U j . Then X has a mixture
distribution of the U j with probabilities πj . If X is a random variable, then
J
X
FX (t) = πj FUj (t).
j=1
Theorem 3.20: Expected Value Theorem: Assume all expected values

exist. Let dx = dx1 dx2 ...dxk. Let X be the support of X = {x : f(x) > 0}
or {x : p(x) > 0}. R∞ R∞
a) If X has (joint) pdf f(x), then E[h(X)] = −∞ · · · −∞ h(x)f(x) dx =
R R R∞ R∞ R R
· · · X h(x)f(x) dx. Hence E[X] = −∞ · · · −∞ xf(x) dx = · · · X xf(x) dx.
R∞ R
b) If X has pdf f(x), then E[h(X)] = −∞ h(x)f(x) dx = X h(x)f(x) dx.
R∞ R
Hence E[X] = −∞ xf(x) dx = X xf(x) dx. P P
c) If X has (joint) P
P pmf p(x), then E[h(X)] = xP
1
· · · xP k
h(x)p(x) =
Px∈Rk h(x)p(x)P= x∈X h(x)p(x). Hence E[X] = x1 · · · xk xp(x) =
x∈Rk xp(x) = x∈X xp(x). P P
d) If X has
P pmf p(x),Pthen E[h(X)] = x h(x)p(x) = x∈X h(x)p(x). Hence
E[X] = x xp(x) = x∈X xp(x).
e) Suppose X has a mixture distribution given by 68) and that E(h(X)) and
the E(h(U j )) exist. Then
J
X J
X
E[h(X)] = πj E[h(U j )] and E(X) = πj E[U j ].
j=1 j=1
f) Suppose X has a mixture distribution given by 68) and that E(h(X)) and
the E(h(Uj )) exist. Then
J
X J
X
E[h(X)] = πj E[h(Uj )] and E(X) = πj E[Uj ].
j=1 j=1
This theorem is easy to prove if the U j are continuous random vectors with
(joint) probability density functions (pdfs) fU j (t). Then X is a continuous
random vector with pdf
J
X Z ∞ Z ∞
fX (t) = πj fU j (t), and E[h(X)] = ··· h(t)fX (t)dt
j=1 −∞ −∞
J
X Z ∞ Z ∞ J
X
= πj ··· h(t)fU j (t)dt = πj E[h(U j )]
j=1 −∞ −∞ j=1
where E[h(U j )] is the expectation with respect to the random vector U j .

R Alternatively, with respect to a Lebesgue Stieltjes integral, E[h(X)] =
h(t)dF (t) provided the expected value exists, and the integral is a linear
operator with
R respect to both h and F . Hence for a mixture distribution,
E[h(X)] = h(t)dF (t) =
 
Z J
X J
X Z J
X
h(t) d  πj FU j (t) = πj h(t)dFU j (t) = πj E[h(U j )].
j=1 j=1 j=1
Remark 3.10. Let f(x) ≥ R0 be a Lebesgue integrable pdf of a RV with cdf

F . Then PX (B) = PF (B) = B f(x)dx wrt Lebesgue integration. So many
probability distributions can be obtained with Lebesgue integration.
3.3 Fubini’s theorem and Product Measures
DefinitionQn 3.15. RVs X1 , ..., Xk are independent if P (X1 ∈ B1 , ..., Xk ∈

Bk ) = i=1 P (Xi ∈ Bi ) for any B1 , ..., Bk ∈ B(R) iff FX1 ,...,Xk (x1 , ..., xk) =
FX1 (x1 ) · · · FXk (xk ) for any real x1 , ..., xk iff σ(X1 ), ..., σ(Xk ) are indepen-
dent (∀Ai ∈ σ(Xi ), A1 , ..., Ak are independent). An infinite collection of RVs
X1 , X2 , ... is independent if any finite subset is independent. If pdfs exist,
X1 , ..., Xk are independent iff fX1 ,...,Xk (x1 , ..., xk) = fX1 (x1 ) · · · fXk (xk ) for
any real x1 , ..., xk. If pmfs exist, X1 , ..., Xk are independent iff pX1 ,...,Xk (x1 , ..., xk) =
pX1 (x1 ) · · · pXk (xk ) for any real x1 , ..., xk.
Theorem 3.21. Suppose X1 , ..., Xn are independent Qn and gi (X
Qni ) is a func-
tion of Xi alone. Then E[g1 (x1 ) · · · gn (Xi )] = E[ i=1 gi (Xi )] = i=1 E[gi (Xi )]
provided the expected values exist.
Definition 3.16. Let (Ω1 , F1 , P1 ) and (Ω2 , F2 , P2) be two probability
spaces. The Cartesian product = cross product Ω1 × Ω2 = {(ω1 , ω2) :
Ω1 ∈ Ω1 , Ω2 ∈ Ω2 }. The product of F1 and F2 , denoted by F1 × F2 , is the
σ-field σ(A) where A = {A1 × A2 : A1 ∈ F1 , A2 ∈ F2 } is the collection of all
cross products A1 × A2 of events in F1 and F2 .
Theorem 3.22. There is a unique probability measure P = P1 × P2 ,
called the product of P1 and P2 or the product probability measure, such
that P (A1 × A2 ) = P1 (A1 )P2 (A2 ) for all A1 ∈ F1 and A2 ∈ F2 .
Definition 3.17. The product probability space is (Ω1 × Ω2 , F1 ×
F2 , P1 × P2 ).
Remark 3.11. Theorem 3.22 and Definitions 3.16 and 3.17 can Qnbe ex-
tended to (Ωi , Fi ,Q Pi ) for i = 1, ..., n. Denote P1Q× · · · × Pn by i=1 Pi ,
n n
F1 × · · · × Fn by i=1 Fi , and Ω1 × · · · × Ωn by i=1 Ωi . If (ΩQ i , Fi , P i ) =
n
(R, B(R), Pi ), then the product probability space is (Rn , B(Rn ), i=1 Pi ). If Q
n
(Ωi , Fi , Pi ) = (R, B(R), PXi ), then the product probability space is (Rn , B(Rn ), i=1 PXi ).
Definition 3.18. Let independent Xi be defined on (R,QB(R), PXi ).
n
Then the product probability space (Ω, F , P ) = (Rn , B(Rn ), i=1 PXi ) is
the probability space for X R = (X1R, ..., Xn).
Definition 3.19. Let fdµ = f(x)dµ(x). Then the double integral
Z Z
f(x1 , x2 )d[P1 × P2 (x1 , x2 )] =
Ω1 ×Ω2
3.3 Fubini’s theorem and Product Measures 47
Z Z Z Z
f(x1 , x2 )dP2 (x2 ) dP1(x1 ) = f(x1 , x2 )dP1 (x1 ) dP2 (x2 ).
Ω1 Ω2 Ω2 Ω1
(3.2)
The last two equations are known as iterated integrals. R
Theorem 3.23: Fubini’s Theorem: a) Assume f ≥ 0. Then Ω1 f(x1 , x2 )dP1 (x1 )
R
is measurable F2 , Ω2 f(x1 , x2 )dP2 (x2 ) is measurable F1 , and Equation (3.2)
holds. R
b) Assume f is integrable wrt P1 × P2 , then Ω1 f(x1 , x2 )dP1 (x1 ) is finite
R
ae and measurable F2 ae, Ω2 f(x1 , x2)dP2 (x2 ) is finite ae and measurable
F1 ae, and (3.2) holds.
Note: Part a) is also known as Tonelli’s theorem R or the Fubini-Tonelli
theorem. The double integral is often written as Ω1 ×Ω2 . Note that f : Ω1 ×
Ω2 → R (at least ae). Fubini’s theorem for product probability measures
shows double integrals can be calculated with iterated integrals if X1 X2 ,
and the theorem is sometimes stated as below.
Theorem 3.24 Fubini’s Theorem for product probability mea-
sures: If f is measurable, then
Z Z Z Z Z
fd[P1 ×P2 ] = f(x1 , x2)dP2 (x2 ) dP1 (x1 ) = f(x1 , x2)dP1 (x1 ) dP2 (x2 )
Ω1×Ω2 Ω1 Ω2 Ω2 Ω1
R
provided that either a) f ≥ 0, or b) Ω1 ×Ω2 |f|d[P1 × P2 ] < ∞.
Qn Qn
Definition 3.20. A product measure µ satisfies µ( i=1 Ai ) = i=1 µ(Ai ).
Theorem 3.25: Fubini’s Theorem for product measures: If f is
measurable, then
Z Z Z Z Z
fd[µ1 ×µ2 ] = f(x1 , x2 )dµ2 (x2 ) dµ1 (x1 ) = f(x1 , x2 )dµ1 (x1 ) dµ2 (x2 )
Ω1×Ω2 Ω1 Ω2 Ω2 Ω1
R
provided that the µi are σ-finite and either a) f ≥ 0, or b) Ω1 ×Ω2 |f|d[µ1 ×
µ2 ] < ∞.
Note: the Lebesgue measure is σ−finite on R and the counting measure µC
is σ-finite if Ω is countable, where µC (A) = the number of points in set A. Let
λ be the Lebesgue measure on R2 and µL the Lebesgue measure on R. The
λ(A×B) = µL (A)µL (B) is a product measure. Let ν be the counting measure
on Z2 and µC the counting measure on Z. Then ν(A × B) = µC (A)µC (B) is
a product measure.
Theorem 3.26: Fubini’s Theorem for Lebesgue Integrals: Let C =
{(x, y) : a ≤ x ≤ b, c ≤ y ≤ d} = [a, b] × [c, d]. Let g(x, y) be measurable and
Lebesgue integrable. Then
Z Z Z "Z d
#
b Z "Z #
b d
g(x, y)dxdy = g(x, y)dx dy = g(x, y)dy dx.
C c a a c
Remark 3.12. The result in Theorem 3.26 can be extended to where

the limits of integration are infinite and to n ≥ 2 integrals. Using g(x, y) =
h(x, y)f(x, y) where f is a pdf gives E[h(X, Y )]. Note that g : R2 → R (at
least ae).
3.4 Summary
37) Fix (Ω, F , P ). A simple random variable (SRV) is a function X : Ω → R

such that the range of X is finite and {X = x} = {ω : X(ω) = x}P∈ F ∀x ∈ R.
n
Hence X is a discrete RV with finite support. Note that X = i=1 xi IAi is
a SRV if each Ai ∈ F. Sn
38) Suppose events A1 , ..., An are disjoint and i=1 Ai = Ω. Let X =
n
Pn X
i=1 x i IA i . Then the expected value of X is E(X) = xi P (Ai ) =
X i=1
xP (X = x) which is a finite sum since X is a SRV. The middle term
x
is useful for proofs. For this SRV, E(X) exists and is unique. In the second
sum, the x need to be the distinct values in the range of X.
Pm 39) Suppose SRV X takes on distinct values x1 , S ..., xm. Then X =
n
x
i=1 i BiI where the B i = {X = x i } are disjoint with i=1 Bi = Ω. Hence
a SRV has the form of 38) with Ai = Bi and n = m.
40) Theorem. Let Xn , X, and Y be SRVs.
a) −∞ < E(X) < ∞
b) linearity: E(aX P + bY ) = aE(X) + bE(Y )
c) If SRVPX = ni=1 xi IAi where the Ai are not necessarily disjoint, then
n
E(X) = i=1 xi P (Ai ).
d) If X ≤ Y , then E(X) ≤ E(Y )
e) If {Xn } is uniformly bounded and X = limn Xn on a set of probability 1,
then E(X) = limn E(Xn ). P
g) If X isPnonnegative, X R≥ 0, then
∞
E(X) = i P (X > xi ) = 0 [1 − F (x)]dx.
h) If X Y , then E(X) = E(X)E(Y ).
41) For the theory of integration, assume the function f in the inte-
grand is measurable where f : Ω → R and (Ω, F , µ) is a measure space.
42) A function f : Ω → [−∞, ∞] is a measurable function (or measurable
or F measurable or Borel measurable) if
i) f −1 (B) ∈ F ∀B ∈ B(R),
ii) f −1 ({∞}) = {ω : f(ω = ∞} ∈ F, and
iii) f −1 ({−∞}) = {ω : f(ω = −∞} ∈ F.
3.4 Summary 49
Ω → [0, ∞] be a nonnegative function. Then the integral

Z 43) Def. Let f :X
fdµ = sup{Ai } (infω∈Ai f(ω))µ(Ai ) where {Ai } is a finite F decompo-
i
sition.
(ASfinite F decomposition (F decomp of Ω) means that Ai ∈ F and
n
Ω = i=1 Ai for some n, and the Ai are disjoint.)
44) Conventions for integration of a nonnegative function. a) Ai = ∅ im-
plies that the inf term = ∞, b) x(∞) = ∞Pfor x > 0, and c) 0(∞) = 0.
m
45) Theorem: Let f ≥ 0 with f(ω) = j=1 xj IBj (ω)where each xj ≥ 0
R Pm
and {Bj } is an F decomp of Ω. Then fdµ = j=1 xj µ(Bj ).
46) If f : Ω → [−∞, ∞], then the positive part f + = fI(f ≥ 0) =
max(f, 0), and the negative part f − = −fI(f ≤ 0) = max(−f, 0) =
−min(f, 0). Hence f + (ω) = f(ω)I(f(ω) ≥ 0) and f − (ω) = −f(ω)I(f(ω) ≤
0).
Here I(f ≥ 0) = I(0 ≤ f ≤ ∞) while I(f(ω) ≤ 0) = I(−∞ ≤ f ≤ 0). If f
is measurable, then f + ≥ 0, f − ≥ 0 are both measurable, f = f + − f − , and
|f| = f + + f − .
47) Convention: ∞ − ∞ = −∞ + ∞ is undefined.
48) Def: Let fR : Ω → [−∞, ∞]. R
i) The integral fdµ = f + dµ − f − dµ.
R
R + ∞ − ∞.
ii) The integral is defined unless it involves
f dµ and f − dµ are finite. Thus
R
iii)
R The function f is integrable if both
fdµ ∈ R if f is integrable.
49) A property holds almost everywhere (ae), if the property holds for
ω outside a set of measure 0, i.e. the property holds on a set A such that
µ(Ac ) = 0. If µ is a probability measure P , then P (A) = 1 while P (Ac ) = 0.
50) Theorem: suppose
R f and g are both nonnegative.
i) If f = 0 ae, then fdµ = 0. R
ii) If µ({ω
R : f(ω) > 0}) > 0, then fdµ > 0.
iii) If fdµ < ∞, then R f < ∞R ae.
iv) If f ≤ g ae, then R fdµ ≤R gdµ.
v) If f = g ae, then fdµ = gdµ.
R
51) Theorem: i) f is integrable iff |f|dµ < ∞. R
Rii) monotonicity: If f and g are integrable and f ≤ g ae, then fdµ ≤
gdµ.
iii) linearity:
R If f and gRare integrable
R and a, b ∈ R, then af +bg is integrable
with (af + bg)dµ = a fdµ + b gdµ.
iv)
R MonotoneR Convergence Theorem (MCT): If 0 ≤ fn ↑ f ae, then
fn dµ ↑ fdµ. R R
v) Fatou’s Lemma: For nonnegative fn , liminfn fn dµ ≤ liminfn fn dµ.
vi) Lebesgue’s Dominated Convergence Theorem (LDCT): If the
|fn | ≤ g ae where
R g is integrable,
R and if fn → f ae, then f and fn are
vii) Bounded Convergence Theorem (BCT): If µ(Ω) < ∞ and the fn
R R
are uniformly bounded, then fn → fPae implies
R P∞ fn dµ → fdµ.
∞ R
viii) IfPfn ≥ R0 then n=1 fn dµR =
P∞n=1 fn dµ.
∞ P∞ R
ix) If n=1 |fn |dµ < ∞, then R n=1 fn dµR = n=1 R fn dµ.
x) If f and g are integrable, then | fdµ − gdµ| ≤ |f − g|dµ.
R Pk Pk R
52) Consequences: a) linearity implies n=1 fn dµ = n=1 fn dµ: i.e.,
the integral and finite sum operators can be interchanged R
b) MCT, LDCT, and R BCT Rgive conditionsR where the limit and can be
interchanged: limn fn dµ = limn fn dµ = fdµ P∞
c) 51) viii)
R and ix) give conditionsR P∞ where the infinite
P∞ R sum n=1 and the
integral can be interchanged: f
n=1 n dµ = n=1 fn dµ.
53) A common technique is to show the result is true for indicators. Ex-
tend to simple functions by linearity, and then to nonnegative function by a
monotone passage of the limit. Use f = f + − f − for general functions.
54) Induction Theorem: If R(n) is a statement for each n ∈ N such that
a) R(1) is true, and b) for each k ∈ N, if R(k) is true, then R(k + 1) is true,
then R(n) is true for each n ∈ N.
Note that ∞ 6∈ N. Induction can be used with linearity to prove 52) a),
but induction generally does R not work R for 52) c).
55) Def. If A ∈ F, then R A fdµ = fIA dµ.
56) If µ(A) = 0, then A fdµ = 0.
57) If µ R: F → [0, ∞] is a measure and f ≥ 0, then
a) ν(A)R = A fdµ is a measure Ron F .
b) If Ω fdµ = 1, then P (A) = A fdµ is a probability measure on F .
58) For expected values, assume (Ω, F , P ) is fixed, and the random vari-
ables are measurable wrt F . R
59) We can define the expected value to be E(X) = XdP as the special
case of integration where the measure µ = P is a probability measure, or we
can use a definition that ignores most measure theory.
60) Def. Let X ≥ 0 be a nonnegative
Z RV.
a) E(X) = limn→∞ E(Xn ) = XdP ≤ ∞ where the Xn are nonnegative
SRVs with 0 ≤ Xn ↑ X.
b) The expectation of X over an event A is E(XIA ).
There are several equivalent ways to define integrals and expected values.
Hence E(X) can also be defined as in 43) with µ replaced by P and f replaced
by X : Ω → R.
61) Theorem: Let X, Y be nonnegative random variables.
a) For X, Y ≥ 0 and a, b ≥ 0, E(aX + bY ) = aE(X) + bE(Y ).
b) If X ≤ Y ae, then E(X) ≤ E(Y ).
Pn Pn
By induction, if the ai Xi ≥ 0, then E( i=1 ai Xi ) = i=1 E(ai Xi ): the
expected value of a finite sum of nonnegative RVs is the sum of the expected
values.
62) For a random variable X : Ω → (−∞, ∞), then the positive part
X + = XI(X ≥ 0) = max(X, 0), and the negative part X − = −XI(X ≤
3.4 Summary 51
0) = max(−X, 0) = −min(X, 0). Hence X = X + −X − , and |X| = X + +X − .

Random variables are real functions: ±∞ are not allowed.
63) Def: Let the random variable R X:Ω→ R (−∞, ∞).R
i) The expected value E(X) = XdP = X + dP − X − dP = E(X + ) −
E(X − ).
ii) The expected value is defined unless it involves ∞ − ∞.
iii) The random variable X is integrable if E[|X|] < ∞. Thus E(X) ∈ R if
X is integrable.
64) Theorem: i) X is integrable iff both E[X + ] and E[X − ] are finite.
ii) monotonicity: If X and Y are integrable and X ≤ Y ae, then E(X) ≤
E(Y ).
iii) linearity: If X and Y are integrable and a, b ∈ R, then aX + bY is inte-
grable with E(aX + bY ) = aE(X) + bE(Y ).
iv) Monotone Convergence Theorem (MCT): If 0 ≤ Xn ↑ X ae, then
E(Xn ) ↑ E(X).
v) Fatou’s Lemma: For RVs Xn ≥ 0, E[lim infn Xn ] ≤ lim inf n E[Xn ].
|Xn | ≤ Y ae where Y is integrable, and if Xn → X ae, then X and Xn
are integrable and E(Xn ) → E(X).
vii) Bounded Convergence Theorem (BCT): If the Xn are uniformly
bounded, then Xn → XPae implies E(X P∞n ) → E(X).
∞
viii) IfPXn ≥ 0 then E( n=1 Xn ) =P n=1 E(XnP ).
∞ ∞ ∞
ix) If n=1 E(|Xn |) < ∞, then E( n=1 Xn ) = n=1 E(Xn ).
x) If X and Y are integrable, then |E(X) − E(Y )| ≤ E[|X − Y |].
65) Consequences: a) linearity implies E( kn=1 an Xn ) = kn=1 an E(Xn ):
P P
i.e., the expectation and finite sum operators can be interchanged, or the
expectation of a finite sum is the sum of the expectations if the Xn are
integrable.
b) MCT, LDCT, and BCT give conditions where the limit and E can be
interchanged: limn E(Xn ) = E[limn Xn ] = E(X) P∞
c) 64) viii) and ix) give conditions where P∞ the infiniteP∞sum n=1 and the
expected value can be interchanged: E[ n=1 Xn ] = n=1 E(Xn ).
66) Given (Ω, F , P ), the collection of all integrable random vectors or RVs
is denoted by L1 = L1 (Ω, F , P ).
67) Let X be a 1 × k random vector with cdf FX (t) = F (t)R= P (X1 ≤
t1 , ..., Xk ≤ tk ). Then the Lebesgue Stieltjes integral E[h(X)] = h(t)dF (t)
provided the expected value exists, and the integral is a linear operator
with
R respect to both h and F . If X is a random variable, then E[h(X)] =
h(t)dF (t). If W = h(X) is integrable or if W = h(X) ≥ 0, then the ex-
pected value exists. Here h : Rk → Rj with 1 ≤ j ≤ k.
68) The distribution of a 1 × k random vector X is a mixture distribu-
tion if the cdf of X is
J
X
FX (t) = πj FU j (t)
j=1
PJ
where the probabilities πj satisfy 0 ≤ πj ≤ 1 and j=1 πj = 1, J ≥ 2,
and FU j (t) is the cdf of a 1 × k random vector U j . Then X has a mixture
distribution of the U j with probabilities πj . If X is a random variable, then
J
X
FX (t) = πj FUj (t).
j=1
69) Expected Value Theorem: Assume all expected values exist. Let
dx = dx1 dx2 ...dxk . Let X be the support of X = {x : f(x) > 0} or
{x : p(x) > 0}. R∞ R∞
a) If X has (joint) pdf f(x), then E[h(X)] = −∞ · · · −∞ h(x)f(x) dx =
R R R∞ R∞ R R
· · · X h(x)f(x) dx. Hence E[X] = −∞ · · · −∞ xf(x) dx = · · · X xf(x) dx.
R∞ R
b) If X has pdf f(x), then E[h(X)] = −∞ h(x)f(x) dx = X h(x)f(x) dx.
R∞ R
Hence E[X] = −∞ xf(x) dx = X xf(x) dx. P P
P If X has (joint) P
c) pmf p(x), then E[h(X)] = xP
1
· · · xP k
h(x)p(x) =
Px∈R k h(x)p(x)P= x∈X h(x)p(x). Hence E[X] = x1 · · · xk xp(x) =
x∈R k xp(x) = x∈X xp(x). P P
d) If X hasP pmf p(x),Pthen E[h(X)] = x h(x)p(x) = x∈X h(x)p(x). Hence
E[X] = x xp(x) = x∈X xp(x).
e) Suppose X has a mixture distribution given by 68) and that E(h(X)) and
the E(h(U j )) exist. Then
J
X J
X
E[h(X)] = πj E[h(U j )] and E(X) = πj E[U j ].
j=1 j=1
f) Suppose X has a mixture distribution given by 68) and that E(h(X)) and
the E(h(Uj )) exist. Then
J
X J
X
E[h(X)] = πj E[h(Uj )] and E(X) = πj E[Uj ].
j=1 j=1
This theorem is easy to prove if the U j are continuous random vectors with
(joint) probability density functions (pdfs) fU j (t). Then X is a continuous
random vector with pdf
J
X Z ∞ Z ∞
fX (t) = πj fU j (t), and E[h(X)] = ··· h(t)fX (t)dt
j=1 −∞ −∞
J
X Z ∞ Z ∞ J
X
= πj ··· h(t)fU j (t)dt = πj E[h(U j )]
j=1 −∞ −∞ j=1
3.4 Summary 53
where E[h(U j )] is the expectation with respect to the random vector U j .

R Alternatively, with respect to a Lebesgue Stieltjes integral, E[h(X)] =
h(t)dF (t) provided the expected value exists, and the integral is a linear
operator with
R respect to both h and F . Hence for a mixture distribution,
E[h(X)] = h(t)dF (t) =
 
Z XJ XJ Z XJ
h(t) d  πj FU j (t) =
 πj h(t)dFU j (t) = πj E[h(U j )].
j=1 j=1 j=1
70) Fix (Ω, F , P ). Let the induced probability PX = PF be PX (B) =

P [X −1 (B)] for any B ∈ B(R). Then (R, B(R), PX ) is a probability space.
If X is a 1 × k random vector, then the induced probability PX = PF
be PX (B) = P [X −1 (B)] for any B ∈ B(Rk ). Then (Rk , B(Rk ), PX ) is a
probability space. R R R
Then E[h(X)]R = h(X) R dP = h(x) dF (x) =R EF [h] = h dPX . Then
E[h(X)] = h(X) dP = h(x) dF (x) = EF [h] = h dPX . Here W = h(X)
is a RV wrt (Ω, F , P ), while Z = h is a RV wrt (R, B(R), PX ).
71) Let X : Ω → R. Let A, B, Bn ∈ B(R).
i) If A ⊆ B, then X −1 (A) ⊆ X −1 (B).
ii) X −1 (∪n Bn ) = ∪n X −1 (Bn ).
iii) X −1 (∩n Bn ) = ∩n X −1 (Bn ).
v) X −1 (B c ) = [X −1 (B)]c .
(The unions and intersections in ii) and iii) can be finite, countable or un-
countable.)
74) Let f(x) ≥ 0Rbe a Lebesgue integrable pdf of a RV with cdf F . Then
PX (B) = PF (B) = B f(x)dx wrt Lebesgue integration. So many probability
distributions can be obtained with Lebesgue integration.
Qn75) RVs X1 , ..., Xk are independent if P (X1 ∈ B1 , ..., Xk ∈ Bk ) =
i=1 P (Xi ∈ Bi ) for any B1 , ..., Bk ∈ B(R) iff FX1 ,...,Xk (x1 , ..., xk) =
FX1 (x1 ) · · · FXk (xk ) for any real x1 , ..., xk iff σ(X1 ), ..., σ(Xk ) are indepen-
dent (∀Ai ∈ σ(Xi ), A1 , ..., Ak are independent). An infinite collection of RVs
X1 , X2 , ... is independent if any finite subset is independent. If pdfs exist,
X1 , ..., Xk are independent iff fX1 ,...,Xk (x1 , ..., xk) = fX1 (x1 ) · · · fXk (xk ) for
any real x1 , ..., xk. If pmfs exist, X1 , ..., Xk are independent iff pX1 ,...,Xk (x1 , ..., xk) =
pX1 (x1 ) · · · pXk (xk ) for any real x1 , ..., xk. Recall that the σ-field σ(X) =
{X −1 (B) : B ∈ B(R}.
76) Suppose X1 , ..., Xn are independent Qn and gi (Xi ) Q is a function of Xi
n
alone. Then E[g1(x1 ) · · · gn (Xi )] = E[ i=1 gi (Xi )] = i=1 E[gi (Xi )] pro-
vided the expected values exist.
77) Let (Ω1 , F1 , P1) and (Ω2 , F2, P2 ) be two probability spaces. The
Cartesian product = cross product Ω1 × Ω2 = {(ω1 , ω2 ) : Ω1 ∈ Ω1 , Ω2 ∈
Ω2 }. The product of F1 and F2 , denoted by F1 ×F2 , is the σ-field σ(A) where
A = {A1 × A2 : A1 ∈ F1 , A2 ∈ F2 } is the collection of all cross products

A1 × A2 of events in F1 and F2 .
78) Theorem: There is a unique probability measure P = P1 × P2 , called
the product of P1 and P2 or the product probability measure, such that
P (A1 × A2 ) = P1 (A1 )P2 (A2 ) for all A1 ∈ F1 and A2 ∈ F2 .
79) The product probability space is (Ω1 × Ω2 , F1 × F2 , P1 × P2 ).
80) 77)-79) can Qnbe extended to (Ωi , FiQ , Pi ) for i = 1, ..., n. Denote
n
P
Qn1 × · · · × P n by i=1 P i , F 1 × · · · × F n by i=1 Fi , and Ω1 × · · · × Ωn by
i=1 Ω i . If (Ω i , F i , P i ) = (R, B(R), P i ), then the product probability space
is (Rn , B(Rn ), ni=1 Pi ). If (Ωi , F
Q
Qn i , P i ) = (R, B(R), PXi ), then the product
probability space is (Rn , B(Rn ), i=1 PXi ).
81) Let independent Xi be defined onQ (R, B(R), PXi ). Then the product
n
probability space (Ω, F , P ) = (Rn , B(Rn ), i=1 PXi ) is the probability space
for X = (XR1 , ..., XnR).
82) Let fdµ = f(x)dµ(x). Then the double integral
Z Z
f(x1 , x2 )d[P1 × P2 (x1 , x2 )] =
Ω1 ×Ω2
Z Z Z Z
f(x1 , x2 )dP2 (x2 ) dP1(x1 ) = f(x1 , x2 )dP1 (x1 ) dP2 (x2 ).
Ω1 Ω2 Ω2 Ω1
The last two equations are known as iterated integrals. R

83) Fubini’s Theorem: a) Assume f ≥ 0. Then Ω1 f(x1 , x2 )dP1 (x1 ) is
R
measurable F2 , Ω2 f(x1 , x2 )dP2 (x2 ) is measurable F1 , and 82) holds.
R
b) Assume f is integrable wrt P1 × P2 , then Ω1 f(x1 , x2 )dP1 (x1 ) is finite
R
ae and measurable F2 ae, Ω2 f(x1 , x2)dP2 (x2 ) is finite ae and measurable
F1 ae, and 82) holds.
Note: Part 83 a) is also known as Tonelli’s theorem
R or the Fubini-Tonelli
theorem. The double integral is often written as Ω1 ×Ω2 . Note that f : Ω1 ×
Ω2 → R (at least ae). Fubini’s theorem for product probability measures
shows double integrals can be calculated with iterated integrals if X1 X2 ,
and the theorem is sometimes stated as below.
84) Fubini’s Theorem for product probability measures: If f is
measurable, then
Z Z Z Z Z
fd[P1 ×P2 ] = f(x1 , x2)dP2 (x2 ) dP1 (x1 ) = f(x1 , x2)dP1 (x1 ) dP2 (x2 )
Ω1×Ω2 Ω1 Ω2 Ω2 Ω1
R
provided that either a) f ≥ 0, or b) Ω1 ×Ω2 |f|d[P1 × P2 ] < ∞.
Qn Qn
85) A product measure µ satisfies µ( i=1 Ai ) = i=1 µ(Ai ).
86) Fubini’s Theorem for product measures: If f is measurable, then
Z Z Z Z Z
fd[µ1 ×µ2 ] = f(x1 , x2 )dµ2 (x2 ) dµ1 (x1 ) = f(x1 , x2 )dµ1 (x1 ) dµ2 (x2 )
Ω1×Ω2 Ω1 Ω2 Ω2 Ω1
3.6 Problems 55
R
provided that the µi are σ-finite and either a) f ≥ 0, or b) Ω1 ×Ω2 |f|d[µ1 ×
µ2 ] < ∞.
Note: the Lebesgue measure is σ−finite on R and the counting measure µC
is σ-finite if Ω is countable, where µC (A) = the number of points in set A. Let
λ be the Lebesgue measure on R2 and µL the Lebesgue measure on R. The
λ(A×B) = µL (A)µL (B) is a product measure. Let ν be the counting measure
on Z2 and µC the counting measure on Z. Then ν(A × B) = µC (A)µC (B) is
a product measure.
87) Fubini’s Theorem for Lebesgue Integrals: Let C = {(x, y) : a ≤
x ≤ b, c ≤ y ≤ d} = [a, b] × [c, d]. Let g(x, y) be measurable and Lebesgue
integrable. Then
Z Z Z "Z d b
# Z "Z b
#
d
g(x, y)dxdy = g(x, y)dx dy = g(x, y)dy dx.
C c a a c
88) The result in 87) can be extended to where the limits of integration
are infinite and to n ≥ 2 integrals. Using g(x, y) = h(x, y)f(x, y) where f is
a pdf gives E[h(X, Y )]. Note that g : R2 → R (at least ae).
3.5 Complements
3.6 Problems
3.1. Suppose the Xn are nonnegative random variables with limn→∞ Xn = X

and limn→∞ E(Xn ) = c > 0. What does Fatou’s lemma say about the 2
quantities limn→∞ Xn and Z limn→∞ E(Xn )? Z Z
3.2. Suppose limsupn fn du ≤ liminfn fn du. Does limn→∞ fn dµ
exist? Explain briefly.
3.3. Let P be the uniform U(0,1) probability and let
X = 1I(0,0.75) + 1I(0.5,1).
Pn Pn
a) Find E(X) usingP linearity: E( i=1 xi IAi ) = i=1 xi P (Ai ).
b) Find E(X) = x xP (X = x) by finding the two distinct values of x in
the range of X and the two values of P (X = x).
(Note:P for X = 1I(0,0.75) + 1I(0.5,1), n = 2, and xi = 1 for i = 1, 2. Thus
n
E(X) 6= i=1 xi P (X = xi ) = 2(1)P (X =P 1). Need the xi to beP
the distinct
values of the range of SRV X for E(X) = ni=1 xi P (X = xi ) = x xP (X =
x). )
3.4. Fix (Ω, F , P ). Let the induced probability PXR = PF be PX (B) =
P [X −1 (B)] for any B ∈ B(R). Show that E[IB (X)] = IB dPX .
Hint: Find IB (X(ω)),R and then take the expectation.

R R
(Note: E[IB (X)] = IB (x)dF (x). Hence R h(x) dF (x) and
R h dPX agree
on indicator RVs h = IB . By linearity, h(x) dF (x) and h dP R X agree on
SRVs.R By monotone passage of the limit of nonnegative SRVs, h(x) dF (x)
and h dP R X should agree on nonnegative RVs h, and hence on general RVs
h. Also, hR dPX = EX [h] R = E F [h] where Z = h is a RV on (R, B(R), P X ).
E[h(X)] = h(X)dP = h(x)dF (x) has W = h(X) a RV on (Ω, F , P ). Do
not use these results for solving 3).)
3.5. Let W , X and Y be integrable.
a) Prove |E[W ]| ≤ E[|W |].
b) Prove |E[X] − E[Y ]| ≤ E[|X − Y |]. R
3.6. Billingsley (1986, 16.9): Let fn beR integrable R and supn fn dµ < ∞.
If fn ↑ f, prove that f is integrable and fn dµ → fdµ.
Hints: a) 0 ≤ (fn − f1 ) ↑R (f − f1 ). ApplyR the MCT. R
b) Let g = f − f1 . Then gdµ = limn (fn − f1 )dµ ≤ supn (fn − f1 )dµ.
Show this implies g is integrable.
c) Then g + f1 = f is integrable.
3.7. Billingsley (1986, 21.5): X ∼ C(0, 1), a Cauchy distribution with
location and scale parameters 0 and 1, if the probability density function
(pdf) of X is
1
f(x) =
π(1 + x2 )
for −∞ < x < ∞. Show E(X) does not exist by showing that E[|X|] = ∞.
Hint: |x|f(x) is an even function. Thus
Z ∞ Z ∞
|x|
E[|X|] = |x|f(x)dx = 2 dx.
−∞ 0 π(1 + x2 )
RP P R(1986, 21.6 modified): Theorem 16.6: if fn ≥ 0, then

3.8. Billingsley
n fn dµ = n fn dµ. a) Apply Theorem 16.6 to indicator RVs with
µ = P to prove the first Borel Cantelli lemma.
3.9. Billingsley (1986, 20.2): If X is a positive RV with pdf f, prove that
X −1 = 1/X has pdf
1 1
2
f .
x x
3.10. a) Suppose X has a mixture distribution of the Uj with probabilities
PJ
πj , and that the cdf of X is FX (t) = j=1 πj FUj (t). If fUj (t) is the pdf of
Uj for each j, find the pdf fX of X.
b) Using a), show that if E[h(X)] and the E[h(Uj )] exist, then E[h(X)] =
PJ
j=1 πj E[h(Uj )].
c) Suppose X has a mixture distribution of U1 with probability 0.95 and
U2 with probability 0.95 where P (U1 = 0) = 1 and U2 is a nonnegative
random variable with E(U2 ) = 1000 and V (U2 ) = 10000. Find i) E(X), ii)
E(X 2 ), and iii) V (X).
3.6 Problems 57
Note: X can be the claims distribution for an insurance policy where 95%
of the policy holders make no claim in the year, and 5% make a claim with
a complicated nonnegative distribution U2 where the mean and variance are
known from extensive past records.
P Then the central limit theorem can be
used to find the percentiles of Xi where the Xi are iid from the distribution
of X.
3.11. The random variable X is a point mass at the real number c if
P (X = c) = 1. Then the pmf pX (x) > 0 only at x = c. If h is a (measurable)
function, find E[h(X)].
Pn
3.12. Like part of Billingsley (1986, 5.12): Let X = k=1 IAk be a simple
random variable, and find E[X/n].
3.13. Billingsley (1986, 5.14): Prove that if X has nonnegative integers as
values, then
P∞
E[X] = n=1 P (X P ≥ n). P∞
Hint: E[X] = x xP (X = x) = n=1 nP (X = n). Consider the following
array, and sum on columns and sum on rows.
Table 3.1
sum
P(X=1) P(X=2) P(X=3) P(X=4) ··· P (X ≥ 1)
P(X=2) P(X=3) P(X=4) ··· P (X ≥ 2)
P(X=3) P(X=4) ··· P (X ≥ 3)
P(X=4) ··· P (X ≥ 4)
.. .. .. .. .. ..
. . . . . .
sum P(X=1) 2 P(X=2) 3 P(X=3) 4 P(X=4) ··· E(X)

3.14. a) Which theorem allows double integrals to be computed with it-
erated integrals?
b) State Fatou’s Lemma for random variables.
3.15. State the Central Limit Theorem.
3.16. a) Suppose X has a mixture distribution of the Uj with probabilities
πj , and that the cdf of X is FX (t) = Jj=1 πj FUj (t). If fUj (t) is the pdf of
P
Uj for each j, find the pdf fX of X.
b) Using a), show that if E[h(X)] and the E[h(Uj )] exist, then E[h(X)] =
PJ
j=1 πj E[h(Uj )].
3.17. Let P be the uniform U(0,1) probability and let X = 1I(0,0.7) +
1I(0.6,1). Find E(X). P
n
3.18. Suppose X = i=1 xi IAi where the xi are real numbers and the Ai
are events. Using linearity, find E(X).
3.19. RSuppose the fn are nonnegative functions with limn→∞ fn = f and

limn→∞ fn dµ = c > 0. What does Fatou’s lemma say about these 2 quan-
tities? Z Z Z
3.20. Suppose fn → f, fdµ ≤ liminfn fn dµ, and limsupn fn dµ ≤
Z Z
fdµ. Find limn→∞ fn dµ, if possible.
3.21. State the Monotone Convergence Theorem for nonnegative measur-
able functions fn and f.
3.22. For each result given for integrals, fn , f, and g, give the correspond-
ing result for expectation,
R Xn , X, and Y .
i) f is integrable iff |f|dµ < ∞.
R
R ii) monotonicity: If f and g are integrable and f ≤ g ae, then fdµ ≤
gdµ.
iii) linearity:R If f and g are RintegrableRand a, b ∈ R, then af + bg is

integrable with (af + bg)dµ = a fdµ + b gdµ.
R iv) Monotone
R Convergence Theorem (MCT): If 0 ≤ fn ↑ f ae, then
fn dµ ↑ fdµ.
R R
v) Fatou’s Lemma: For nonnegative fn , liminfn fn dµ ≤ liminfn fn dµ.
3.6 Problems 59

|fn | ≤ g ae where
R g is integrable,
R and if fn → f ae, then f and fn are
vii) Bounded Convergence Theorem (BCT): R If µ(Ω)R < ∞ and the fn

are uniformly bounded, then fn → f ae implies fn dµ → fdµ.
R P∞ P∞ R
viii) If fn ≥ 0 then n=1 fn dµ = n=1 fn dµ.
P∞ R R P∞ P∞ R
ix) If n=1 |fn |dµ < ∞, then n=1 fn dµ = n=1 fn dµ.
R R R
x) If f and g are integrable, then | fdµ − gdµ| ≤ |f − g|dµ.
3.23. Suppose X has a mixture distribution of the Uj with probabilities
PJ
πj , and that the cdf of X is FX (t) = j=1 πj FUj (t). If each Uj is a discrete
RV with probability mass function (pmf) pUj (t), then X is a discrete RV
with pmf
XJ
pX (t) = πj pUj (t).
j=1
Use the pmf pX (t) to show that if E[h(X)] and the E[h(Uj )] exist, then
PJ
E[h(X)] = j=1 πj E[h(Uj )].
3.24. Prove one of the following:
P∞ a) the Monotone
P∞ Convergence Theorem
for RVs, b) If Xn ≥ 0, then E[ n=1 Xn ] = n=1 E[Xn ], or c) Lebesgue’s
Dominated Convergence Theorem for RVs. State which result, a), b) or c)
that you are proving.
3.25.
3.26.
3.27.
3.28.
3.29.
Un
3.30Q . Suppose events A1 ,P
..., An are disjoint and i=1 Ai = Ω. Let simple
n
random variable (SRV) X = i=1 xi IAi . Then the expected value of X is
n
X X
E(X) = xi P (Ai ) = xP (X = x). (3.3)
i=1 x
Prove the existence and uniqueness of Equation (3.3).

Theorem 3.11. Let Xn , X, and Y be SRVs.
a) −∞ < E(X) < ∞
b) linearity: E(aX
Pn+ bY ) = aE(X) + bE(Y )
c) If SRVPX = i=1 xi IAi where the Ai are not necessarily disjoint, then
E(X) = ni=1 xi P (Ai ).
d) monotonicity: If X ≤ Y , then E(X) ≤ E(Y )
P
h) If X Y , then E(X) = E(X)E(Y ).
3.32Q . Let X ≥ 0 be a nonnegative RV. Then
Z
E(X) = limn→∞ E(Xn ) = XdP ≤ ∞ (3.4)
where the Xn are nonnegative SRVs with 0 ≤ Xn ↑ X. Prove the existence

Equation (3.4).
Theorem 3.13. Let X, Y be nonnegative random variables.
a) “restricted linearity:” For X, Y ≥ 0 and a, b ≥ 0,
E(aX + bY ) = aE(X) + bE(Y ).
b) “monotonicity:” If X ≤ Y ae, then E(X) ≤ E(Y ).
3.34Q . Prove the following theorem. In your proof of iii) and iv), you may
use ii) linearity: If X and Y are integrable and a, b ∈ R, then aX + bY is
integrable with E(aX + bY ) = aE(X) + bE(Y ).
Theorem 3.14. i) X is integrable iff both E[X + ] and E[X − ] are finite.
iii) monotonicity: If X and Y are integrable and X ≤ Y ae, then

E(X) ≤ E(Y ).
iv) |E(X)| ≤ E( |X| ).
3.35Q . State and prove the Monotone Convergence Theorem (for RVs).
Ignore “ae” in the proof.
3.36Q . State and prove the Lebesgue Dominate Convergence Theorem (for
RVs). Ignore “ae” in the proof.
Chapter 4
Large Sample Theory
This chapter discusses the central limit theorem, convergence in distribution

and convergence in probability.
Large sample theory, also called asymptotic theory, is used to approxi-
mate the distribution of an estimator when the sample size n is large. This
theory is extremely useful if the exact sampling distribution of the estimator
is complicated or unknown. To use this theory, one must determine what the
estimator is estimating, the rate of convergence, the asymptotic distribution,
and how large n must be for the approximation to be useful.
The cumulative distribution function (cdf) F (x) is defined in Definition
2.5 for a random variable and Definition 2.10 for a random vector. Some
properties of the cdf for a random variable are given in Theorem 2.6. Some
useful distributions are given in Section 2.4.
4.1 Modes of Convergence
Definition 4.1. Let {Zn , n = 1, 2, ...} be a sequence of random variables with

cdfs Fn , and let X be a random variable with cdf F. Then Zn converges in
distribution to X, written
D
Zn → X,
L
or Zn converges in law to X, written Zn → X, if
lim Fn (t) = F (t)

n→∞
at each continuity point t of F. The distribution of X is called the limiting

distribution or the asymptotic distribution of Zn .
61
62 4 Large Sample Theory
Convergence in distribution is also known as weak convergence or Xn

converges weakly√to X. The Central Limit Theorem gives the limiting distri-
butions of Zn = n(Y n − µ).
Remark 4.1. i) An important fact is that the limiting distribution
does not depend on the sample size n.
ii) Warning: A common error is to get a “limiting distribution” that does
depend on n.
iii) Know: If Fn (t) → H(t) and H(t) is continuous, then for convergence in
D
distribution, H(t) needs to be a cdf: H(t) = FX (t) if Xn → X. If H(t) is
a constant: H(t) = c ∈ [0, 1] ∀t, then H(t) is not a cdf, and Xn does not
converge in distribution to any random variable X.
iv) Since F (x) = P (X ≤ x), it follow that 0 ≤ Fn (t) ≤ 1. Thus limn→∞ Fn (t) =
H(t) has 0 ≤ H(t) ≤ 1 if the limit exists. Warning: A common error it to
get H(t) < 0 or H(t) > 1.
v) Warning: Convergence in distribution says that the cdf Fn (t) of Xn gets
close to the cdf of F(t) of X as n → ∞ provided that t is a continuity
point of F . Hence for any > 0 there exists Nt such that if n > Nt , then
|Fn(t) − F (t)| < . Notice that Nt depends on the value of t. Convergence in
distribution does not imply that the random variables Xn ≡ Xn (ω) converge
to the random variable X ≡ X(ω) for all ω.
D
vi) If FXn (t) → FX (t) at all continuity points of FX (t), then Xn → X. If t0 is
a discontinuity point of FX (t), then the behavior of FXn (t0 ) is not important:
could have limn→∞ FXn (t0 ) = ct0 ∈ [0, 1] or that limn→∞ FXn (t0 ) does not
exist. Convergence in distribution does not need ct0 = FX (t0 ).
vii) If FXn (t) → H(t) except at discontinuity points of FX (t), still need
D
H(t) = FX (t) at continuity points of FX (t) for Xn → X.
Convergence in distribution is useful because if the distribution of Xn is
unknown or complicated and the distribution of X is easy to use, then for
large n we can approximate the probability that Xn is in an interval by the
D
probability that X is in the interval. To see this, notice that if Xn → X, then
P (a < Xn ≤ b) = Fn (b) − Fn (a) → F (b) − F (a) = P (a < X ≤ b) if F is
continuous at a and b.
Example 4.1. Suppose that Xn ∼ U (−1/n, 1/n). Then the cdf Fn (x) of
Xn is
x ≤ −1

 0, n
nx
Fn (x) = + 2 , n ≤ x ≤ n1
1 −1
 2
1, x ≥ n1 .
Sketching Fn (x) shows that it has a line segment rising from 0 at x = −1/n
to 1 at x = 1/n and that Fn (0) = 0.5 for all n ≥ 1. Examining the cases
x < 0, x = 0 and x > 0 shows that as n → ∞,
4.1 Modes of Convergence 63

 0, x < 0
Fn (x) → 21 , x = 0
1, x > 0.

Notice that if X is a random variable such that P (X = 0) = 1, then X has

cdf
0, x < 0
FX (x) =
1, x ≥ 0.
Since x = 0 is the only discontinuity point of FX (x) and since Fn (x) → FX (x)
for all continuity points of FX (x) (i.e. for x 6= 0),
D
Xn → X.
Example 4.2. Suppose Yn ∼ U (0, n). Then Fn (t) = t/n for 0 < t ≤ n
and Fn (t) = 0 for t ≤ 0. Hence limn→∞ Fn (t) = 0 for t ≤ 0. If t > 0 and
n > t, then Fn (t) = t/n → 0 as n → ∞. Thus limn→∞ Fn (t) = H(t) = 0
for all t, and Yn does not converge in distribution to any random variable Y
since H(t) ≡ 0 is a continuous function but not a cdf.
Definition 4.2. A sequence of random variables Xn converges in distri-

bution to a constant τ (θ), written
D D
Xn → τ (θ), if Xn → X
where P (X = τ (θ)) = 1. The distribution of the random variable X is said

to be degenerate at τ (θ) or to be a point mass at τ (θ).
See Section 2.4 for some properties of the point mass distribution, which
corresponds to a discrete random variable that only takes on exactly one
value. Using characteristic functions, it can be shown that if X has a point
mass at τ (θ), then X ∼ N (τ (θ), 0), a normal distribution with mean τ (θ)
and variance 0. See Section 4.2. A point mass at 0, where P (X = 0) = 1, is
a common limiting distribution. See Examples 4.1 and 4.3.
Example 4.3. X has a point mass distribution at c or X is degenerate
at c if P (X = c) = 1. Thus X has a probability mass function with all of the
mass at the point c. Then FX (t) = 1 for t ≥ c and FX (t) = 0 for t < c. Often
D
FXn (t) → FX (t) for all t 6= c where P (X = c) = 1. Then Xn → X where
P (X = c) = 1. Thus FXn (t) → H(t) for all t 6= c where H(t) = FX (t) ∀t 6= c.
It is possible that limn→∞ FXn (c) = H(c) ∈ [0, 1] or that limn→∞ FXn (c)
does not exist.
Example 4.4. Prove whether the following sequences of random variables
D
Xn converge in distribution to some random variable X. If Xn → X, find the
distribution of X (for example, find FX (t) or note that P (X = c) = 1, so X
has the point mass distribution at c).
a) Xn ∼ U (−n − 1, −n)
b) Xn ∼ U (n, n + 1)
c) Xn ∼ U (an , bn ) where an → a < b and bn → b.
d) Xn ∼ U (an , bn ) where an → c and bn → c.
e) Xn ∼ U (−n, n)
f) Xn ∼ U (c − 1/n, c + 1/n)
Solution. If Xn ∼ U (an , bn ) with an < bn , then
t − an
FXn (t) =
b n − an
for an ≤ t ≤ bn , FXn (t) = 0 for t ≤ an and FXn (t) = 1 for t ≥ bn . On [an , bn ],
1
FXn (t) is a line segment from (an , 0) to (bn , 1) with slope .
b n − an
a) FXn (t) → H(t) ≡ 1 ∀t ∈ R since FXn (t) = 1 for t ≥ −n. Since H(t) is
continuous but not a cdf, Xn does not converge in distribution to any RV X.
b) FXn (t) → H(t) ≡ 0 ∀t ∈ R since FXn (t) = 0 for t < n. Since H(t) is
continuous but not a cdf, Xn does not converge in distribution to any RV X.
c) 
 0 t≤a
t−a
FXn (t) → FX (t) = b−a a ≤ t≤b
1 t ≥ b.

D
Hence Xn → X ∼ U (a, b).
d)
0 t<c
FXn (t) →
1 t > c.
D
Hence Xn → X where P (X = c) = 1. Hence X has a point mass distribution
at c. (The behavior of limn→∞ FXn (c) is not important, even if the limit does
not exist.)
e)
t+n 1 t
FXn (t) = = +
2n 2 2n
for −n ≤ t ≤ n. Thus FXn (t) → H(t) ≡ 0.5 ∀t ∈ R. Since H(t) is continuous
but not a cdf, Xn does not converge in distribution to any RV X.
f)
t − c + n1 1 n
FXn (t) = 2 = + (t − c)
n
2 2
for c − 1/n ≤ t ≤ c + 1/n. Thus

 0 t<c
FXn (t) → H(t) = 1/2 t = c
1 t > c.

If X has the point mass at c, then

0 t<c
FX (t) =
1 t ≥ c.
Hence t = c is the only discontinuity point of FX (t), and H(t) = FX (t) at all
D
continuity points of FX (t). Thus Xn → X where P (X = c) = 1.
Definition 4.3. a) A sequence of random variables Xn converges in prob-
ability to a constant τ (θ), written
P
Xn → τ (θ),
if for every > 0,
lim P (|Xn − τ (θ)| < ) = 1 or, equivalently, lim P(|Xn − τ (θ)| ≥ ) = 0.

n→∞ n→∞
b) The sequence Xn converges in probability to X, written

P
Xn → X,
if for every > 0,
lim P (|Xn − X| < ) = 1, or, equivalently, lim P(|Xn − X| ≥ ) = 0.

n→∞ n→∞
P P
Notice that Xn → X if Xn − X → 0.
Definition 4.4. For a real number r > 0, Yn converges in rth mean to

r
a random variable Y , written Yn → Y, if
E(|Yn − Y |r ) → 0
as n → ∞. In particular, if r = 2, Yn converges in quadratic mean to Y ,

written
2 qm
Yn → Y or Yn → Y,
if E[(Yn − Y )2 ] → 0 as n → ∞. We say that Xn converges in rth mean to
τ (θ), written
r
Xn → τ (θ),
if E(|Yn − τ (θ)|r ) → 0 as n → ∞. The mean square error M SEτ(θ) (Xn ) =
Eθ [(Xn − τ (θ))2 ].
Convergence in quadratic mean is also known as convergence in mean

r L
square and as mean square convergence. The notations Yn → Y, Yn →r Y,
Lr
and Yn → Y are used in the literature, especially for r ≥ 1.
Theorem 4.1: Generalized Chebyshev’s Inequality or Generalized

Markov’s Inequality: Let u : R → [0, ∞) be a nonnegative function. If
E[u(Y )] exists then for any c > 0,
E[u(Y )]
P [u(Y ) ≥ c] ≤ .
c
If µ = E(Y ) exists, then taking u(y) = |y − µ|r and c̃ = cr gives
Markov’s Inequality: for r > 0 with E[|Y − µ|r ] finite and for any c > 0,
E[|Y − µ|r ]
P (|Y − µ| ≥ c] = P (|Y − µ|r ≥ cr ] ≤ .
cr
If r = 2 and σ 2 = V (Y ) exists, then we obtain
Chebyshev’s Inequality:
V (Y )
P (|Y − µ| ≥ c] ≤ .
c2
Proof. The proof is given for pdfs. For pmfs, replace the integrals by sums.
Now
Z Z Z
E[u(Y )] = u(y)f(y)dy = u(y)f(y)dy + u(y)f(y)dy
R {y:u(y)≥c} {y:u(y)<c}
Z
≥ u(y)f(y)dy
{y:u(y)≥c}
since the integrand u(y)f(y) ≥ 0. Hence

Z
E[u(Y )] ≥ c f(y)dy = cP [u(Y ) ≥ c].
{y:u(y)≥c}
Note: if E[|Y − µ|k ] is finite and k > 1, then E[|Y − µ|r ] is finite for
1 ≤ r ≤ k.
The following theorem gives sufficient conditions for Tn to converge in
qm
probability to τ (θ). Notice that M SEτ(θ) (Tn ) → 0 is equivalent to Tn → τ (θ).
Theorem 4.2. a) If
lim M SEτ(θ) (Tn ) = 0,

n→∞
P
then Tn → τ (θ).
b) If
lim Vθ (Tn ) = 0 and lim Eθ (Tn ) = τ (θ),
n→∞ n→∞
P
Proof. a) Using Theorem 4.1 with Y = Tn , u(Tn ) = (Tn − τ (θ))2 and
c = 2 shows that for any > 0,
Eθ [(Tn − τ (θ))2 ]
Pθ (|Tn − τ (θ)| ≥ ) = Pθ [(Tn − τ (θ))2 ≥ 2 ] ≤ .
2
Hence
lim Eθ [(Tn − τ (θ))2 ] = lim M SEτ(θ) (Tn ) → 0
n→∞ n→∞
P
is a sufficient condition for Tn → τ (θ).
b)
M SEτ(θ) (Tn ) = Vθ (Tn ) + [Biasτ(θ) (Tn )]2
where Biasτ(θ) (Tn ) = Eθ (Tn ) − τ (θ). Since M SEτ(θ) (Tn ) → 0 if both Vθ (Tn )
→ 0 and Biasτ(θ) (Tn ) = Eθ (Tn ) − τ (θ) → 0, the result follows from a).
P
Remark 4.2. We want conditions A ⇒ B where B is Xn → X. A ⇒ B
does not mean that if A does not hold, then B does not hold. A ⇒ B means
that if A holds, then B holds. A common error is for the student to say A
does not hold, so Xn does not converge in probability to X.
Theorem 4.3. a) Suppose Xn and X are RVs with the same probability
P D
space. If Xn → X, then Xn → X.
P D
b) Xn → τ (θ) iff Xn → τ (θ).
P
Proof. a) Assume Xn → X, and let > 0. Then Fn (x) = P (Xn ≤ x) =
P (Xn ≤ x, X > x+)+P (Xn ≤ x, X ≤ x+) ≤ P (|Xn −X| ≥ )+P (X ≤ x+)
= P (|Xn − X| ≥ ) + FX (x + )
where the second equality holds because the events for a partition. P (Xn ≤
x, X > x + ) ≤ P (|Xn − X| ≥ ) by the following diagram with e = .
X_n X
--------------------------
x x+e
Note that P (Xn ≤ x, X ≤ x + ) ≤ P (X ≤ x + ) since P (A ∩ B) ≤ P (B).
Similarly,
FX (x − ) = P (X ≤ x − ) = P (X ≤ x − , Xn > x) + P (X ≤ x − , Xn ≤ x)
≤ P (|Xn − X| ≥ ) + P (Xn ≤ x) = P (|Xn − X| ≥ ) + Fn (x)

where the second equality holds because the events for a partition. P (X ≤
x − , Xn > x) ≤ P (|Xn − X| ≥ ) by the following diagram with e = .
X X_n
--------------------------
x-e x
Thus
FX (x − ) − P (|Xn − X| ≥ ) ≤ Fn (x) ≤ P (|Xn − X| ≥ ) + FX (x + ).

P
Since Xn → X, it follows that P (|Xn − X| ≥ ) → 0 as n → ∞. If FX (x) is
continuous at x, then FX (x − ) → FX (x) and FX (x + ) → FX (x) as → 0.
Taking liminf and limsup gives
FX (x − ) ≤ liminfn Fn (x) ≤ limsupn Fn (x) ≤ FX (x + ).

D
Thus Fn (x) → FX (x) as n → ∞ if FX (x) is continuous at x. Thus Xn → X.
P D D
b) Let c = τ (θ). If Xn → c, then Xn → c by a). Assume Xn → c and
> 0. Then
P [|Xn − c| ≥ ] = P (Xn ≥ c + ) + P [Xn ≤ c − ] =
1 − P (Xn < c + ) + P (Xn ≤ c − ) = RHS.

Now
P (Xn < c + ) ≥ P Xn ≤ c + .
2
Thus P [|Xn − c| ≥ ] = RHS ≤

1 − P Xn ≤ c + + P (Xn ≤ c − ) = 1 − Fn c + + Fn (c − ) → 0
2 2
as n → ∞ since Fn (t) → FX (t) as n → ∞ for t 6= c where

0, t < c
FX (t) =
1, t ≥ c.
P
Thus P [|Xn − c| ≥ ] → 1 − 1 + 0 = 0 as n → ∞, and Xn → c.
Definition 4.5. a) A sequence of random variables Xn converges with
probability 1 (or almost surely, or almost everywhere) to X if
P ( lim Xn = X) = 1.
n→∞
This type of convergence will be denoted by

wp1
Xn → X.
b)
wp1
Xn → τ (θ),
if P ( lim Xn = τ (θ)) = 1.
n→∞
The convergence in Definition 4.5 is also known as strong convergence.

Notation such as “Xn converges to X wp1” will also be used. Sometimes
ae as
“wp1” will be replaced with “as” or “ae.” The notations Xn → X, Xn → X,
wp1
and Xn → X are often used.
Theorem 4.4. Let Yi be a sequence of iid random variables with E(Yi ) =

µ. Then
wp1
a) Strong Law of Large Numbers (SLLN): Y n → µ, and
P
b) Weak Law of Large Numbers (WLLN): Y n → µ.
Proof of WLLN when V (Yi ) = σ 2 : By Chebyshev’s inequality, for every
> 0,
V (Y n ) σ2
P (|Y n − µ| ≥ ) ≤ 2
= 2 →0
n
as n → ∞.
P r wp1
Remark 4.3. a) For i) Xn → X, ii) Xn → X, or iii) Xn → X, the Xn
and X need to be defined on the same probability space.
D
b) For Xn → X, the probability spaces can differ.
P wp1 D r
c) For i) Xn → c, ii) Xn → c, iii) Xn → c, and iv) Xn → c, the probability
spaces of the Xn can differ.
d) Warning: For the SLLN and WLLN, students often forget that V (Yi ) =
σ 2 is not needed. Only need the Yi iid with E(Yi ) = µ.
Theorem 4.5: Let k > 0. If E(X k ) is finite, then E(X j ) is finite for
0 < j ≤ k.
Proof. If |y| ≤ 1, then |yj | = |y|j ≤ 1. If |y| > 1 then |y|j ≤ |y|k . Thus
|y| ≤ |y|k + 1 and |X|j ≤ |X|k + 1. Hence E[|X|j ] ≤ E[|X|k ] + 1 < ∞.
j
Theorem 4.6, Jensen’s Inequality:
g[E(X)] ≤ E[g(X)]
if the expected values exist and the function g is convex on an interval con-
taining the range of X.
Remark 4.4. a) Let (a, b) be an open interval where a = −∞ and b = ∞
are allowed. A sufficient condition for a function g to be convex on an open
interval (a, b) is g00 (x) > 0 on (a, b). If (a, b) = (0, ∞) and g is continuous on
[0, ∞) and convex on (0, ∞), then g is convex on [0, ∞).
b) If X is a positive RV, then the range of X is (0, ∞).
r k
Theorem 4.7: If Xn → X, then Xn → X where 0 < k < r.
Proof. Let Un = |Xn − X|r and Wn = |Xn − X|k . then Un = Wnt where
t = r/k > 1. The function g(x) = xt is convex on [0, ∞). By Jensen’s
inequality,
E[|Xn − X|r ] = E[Un ] = E[Wnt ] ≥ (E[Wn ])t = (E[|Xn − X|k ])r/k
for r > k. Thus limn→∞ E[|Xn −X|r = 0 implies that limn→∞ E[|Xn −X|k =
0 for 0 < k < r.
r P
Theorem 4.8. If Xn → X, then Xn → X.
Proof I) For > 0,
|Xn − X|r ≥ |Xn − X|r I[|Xn − X| ≥ ] ≥ r I[|Xn − X| ≥ ]
where the first inequality holds since the indicator is 0 or 1, and the second
inequality holds since |Xn − X|r ≥ r when the indicator is 1. Thus for any
> 0,
E[|Xn −X|r ] ≥ E[|Xn −X|r I(|Xn −X| ≥ )] ≥ E[r I(|Xn −X| ≥ )] = r P [|Xn−X| ≥ ].
Hence
E[|Xn − X|r ]
P [|Xn − X| ≥ ] ≤ →0
r
as n → ∞.
Proof II)
E[|Xn − X|r ]
P [|Xn − X| ≥ ] = P [|Xn − X|r ≥ r ] ≤ →0
r
as n → ∞ by the Generalized Chebyshev Inequality.
Example 4.5. a) Let P (Xn = n) = 1/n and P (Xn = 0) = 1 − 1/n. Hence

1
Xn is discrete and takes on two values with E(Xn ) = n = 1 for all positive
n
integers n. Hence E[|Xn − 0|] = E(Xn ) = 1 ∀n and Xn does not satisfy
1
Xn → 0. Let > 0. Then
1
P [|Xn − 0| ≥ ] ≤ P (Xn = n) = →0
n
P D
as n → ∞. Hence Xn → 0 and Xn → 0.
1
b) Let P (Xn = 0) = 1 − and P (Xn = 1) = 1/n. Hence Xn is discrete
n
and takes on two values with
X 1 1 1
E[(Xn − 0)2 ] = E(Xn2 ) = x2 P (Xn = x) = 02 (1 − ) + 12 = → 0
n n n
2 P D
as n → ∞. Hence Xn → 0, Xn → 0, and Xn → 0. Note that
1
E[|Xn − 0|] = E(Xn ) = → 0.
n
4.2 The Characteristic Function and Related Functions 71
1 2
Hence Xn → 0 as expected by Theorem 4.7 since Xn → 0.
Theorem 4.9: Let Xn have pdf fXn (x), and let X have pdf fX (x). If
fXn (x) → fX (x) for all x (or for x outside of a set of Lebesgue measure 0),
D
then Xn → X.
Theorem 4.10: Suppose Xn and X are integer valued RVs with pmfs
D
fXn (x) and fX (x). Then Xn → X iff P (Xn = k) → P (X = k) for every
integer k iff fXn (x) → fX (x) for every real x.
4.2 The Characteristic Function and Related Functions
Definition 4.6. The moment generating function (mgf) of a random

variable Y is
m(t) = E[etY ] (4.1)
if the expectation exists for t in some neighborhood
P ofty0. Otherwise, the
mgf does not exist. If Y is discrete, then m(t) = y e f(y), and if Y is
R ∞ ty
continuous, then m(t) = −∞ e f(y)dy.
Notation. The natural logarithm of y is log(y) = ln(y). If another base

is wanted, it will be given, e.g. log10 (y).
Definition 4.7. If the mgf exists, then the cumulant generating func-
tion (cgf) k(t) = log(m(t)) for the values of t where the mgf is defined.
Definition 4.8. The characteristic function of a random variable Y

itY
is c(t)
√ = E[e ] = E[cos(tY )] + iE[sin(tY )] where the complex number
i = −1.
Moment generating functions do not necessarily exist in a neighborhood

of zero, but a characteristic function always exists. This text does not require
much knowledge of theory of complex variables, but know that i2 = −1,
i3 = −i and i4 = 1. Hence i4k−3 = i, i4k−2 = −1, i4k−1 = −i and i4k = 1
for k = 1, 2, 3, ....
√ Let complex number z = a + ib. Then the modulus of z is
|z| = |a + ib| = a2 + b2 .
Definition 4.9. For positive integers k, the kth moment of Y is E[Y k ]

while the kth central moment is E[(Y − E[Y ])k ].
Remark 4.5. a) Suppose that Y is a random variable with an mgf m(t)

that exists for |t| < b for some constant b > 0. Then often the characteristic
function of Y is i) c(t) = m(it) while ii) m(t) = c(−it). If Y has a pmf
with f(yj ) = PP(Y = yj ) = pj , then the characteristic function of Y is
c(t) = cY (t) = j eityj pj while the mgf mY (t) = j etyj pj . Hence the two
P
formulas i) and ii) “hold” if Y has a pmf, at least for t such that the mgf is
defined. If Y is nonnegative then the mgf is a scaled Laplace transformation
and c(t) is a scaled Fourier transformation, and then the two formulas i) and
ii) hold by Laplace and Fourier transformation theory, at least for t such that
the mgf is defined. The Taylor series for the mgf is
∞ k
X t
mY (t) = E[X k ]
k!
k=0
for |t| < b while the characteristic function

∞
X (it)k
cY (t) = E[X k ]
k!
k=0
for all real t if Y has an mgf defined for all real t. Hence if b = ∞, the two
formulas i) and ii) hold. See Billingsley (1986, pp. 285, 353).
b) If E[Y 2 ] is finite, then
1
cY (t) = 1 + itE(Y ) − t2 E[Y 2 ] + o(t2 ) as t → 0.
2
In particular, if E(Y ) = 0 and E(Y 2 ) = V (Y ) = σ 2 , then
t2 σ 2
cY (t) = 1 − + o(t2 ) as t → 0. (4.2)
2
a(t)
Here a(t) = o(t2 ) as t → 0 if lim = 0. See Billingsley (1986, p. 354).
t→0 t2
c) Properties of c(t): i) c(0) = 1, ii) the modulus |c(t)| ≤ 1 for all real t,
iii) c(t) is a continuous function.
d) If Y has mgf m(t), then E(Y k ) is finite for each positive integer k.
e) A complex random variable Z = X + iY where X and Y are ordi-
nary random √variables. Then E(Z) = E(X) + iE(Y ), and E(Z) exists if
E(|Z|) = E( X 2 + Y 2 ) < ∞. Linearity of expectation and key inequali-
ties such as |E(Z)| ≤ E(|Z|) remain valid. Also, if Z1 Z2 and gi (Zi ) is a
function of the complex random variable Zi alone, then E[g1 (Z1 )g2 (Z2 )] =
E[g1 (Z1 )]E[g2(Z2 )] if the expectations exist. Z = eitY is the main complex
random variable in this book.
i0X 0 itX
p = E(e ) = E(e ) = 1. Note that |c(t)| = |E[e ]| ≤
Note that c(0)
E(|e |) = E[ [cos(itX)]2 + [sin(itX)]2 ] = E(1) = 1 by f) since [cos(itX(ω)]2 +
itX
[sin(itX(ω))]2 = 1 for each ω ∈ Ω.
Remarks 4.5 and 4.6 are often used

in proofsof the Central Limit Theorem.
n
c±
Note that by Remark 4.6a), lim 1 − = e−[c±] where is a real
n→∞ n
number. Letting positive → 0 proves Remark 4.6b). Remark 4.6c) shows

that this result holds even if is complex valued.
c n
Remark 4.6. a) lim 1 − = e−c .
n→∞ n cn n
b) If cn → c as n → ∞, then lim 1 − = e−c .
n→∞ n
c) If cn is a sequence of complex numbers such that cn → c as n → ∞
cn n
where c is real, then lim 1 − = e−c .
n→∞ n
In the following theorem, let the kth derivative of g(t) be g(k) (t) with
00
derivative g(1) (t) = g0 (t) and second derivative g(2)(t) = g (t).
Theorem 4.11. Suppose that the mgf m(t) exists for |t| < b for some
constant b > 0, and suppose that the kth derivative m(k) (t) exists for |t| < b.
Then E[Y k ] = m(k) (0) for positive integers k. In particular, E[Y ] = m0 (0)
00
and E[Y 2 ] = m (0). For the cumulant generating function k(t) = kY (t),
00
E(Y ) = k 0 (0) and V (Y ) = k (0). If E(Y k ) exists for a positive integer k,
then
1
E[Y k ] = k c(k)(0).
i
Note that
d m0Y (0)
k 0 (0) = log(mY (t)) = = E(Y )/1 = E(Y ).
dt t=0 mY (0)
Now
d m0Y (t) m00 (t)mY (t) − (m0Y (t))2
k 00 (t) = = Y .
dt mY (t) [mY (t)]2
So
k 00 (0) = m00Y (0) − [m0Y (0)]2 = E(Y 2 ) − [E(Y )]2 = V (Y ).
Definition 4.10. Random variables X and Y are identically distributed,
D
written X ∼ Y , X = Y , or Y ∼ FX , if FX (y) = FY (y) for all real y.
Theorem 4.12. Let X and Y be random variables. Then X and Y are

identically distributed, X ∼ Y , if any of the following conditions hold.
a) FX (y) = FY (y) for all y,
b) fX (y) = fY (y) for all y,
c) cX (t) = cY (t) for all t, or
d) mX (t) = mY (t) for all t in a neighborhood of zero.
Proof of the WLLN. Want to show that if the Xi are iid with E(Xi ) <
D Pn
∞, then X n = Tn /n → E(X1 ) where Tn = i=1 Xi = nX n . Let Yi =
Tn
Xi − E(Xi ) have characteristic function ϕY (t). Then Y n = − E(X1 ) has
n
characteristic function n
t
ψn (t) = ϕY .
n
Now
n n n
t t X t
ϕY − 1 = ϕY − 1n ≤ ϕY −1 =
n n n
k=1

t
n ϕY −1 .
n
If t 6= 0, then
n t

ϕY − ϕY (0)

t t n
ϕY − 1 ≤ n ϕY −1 = t
|t| → |t| ϕ0Y (0)
n n n
as n → ∞. By Theorem 4.11, ϕ0Y (0) = iE(Y ) = i[E(X1 ) − E(X1 )] = 0.

Hence for t 6= 0, n
t
ϕY −1 → 0
n
as n → ∞. Since ϕY (0) = 1,
n
t
lim ϕY = lim ψn (t) = eit0 = ϕX (t) ∀t ∈ R
n→∞ n n→∞
where P (X = 0) = 1. By the continuity theorem,
Tn D
− E(X1 ) → X.
n
Thus
Tn Tn D
− E(X1 ) + E(X1 ) = = X n → E(X1 )
n n
P
by Slutsky’s theorem using an = E(X1 ) → a = E(X1 ).
Definition 4.11. The characteristic function (cf) of a random vector

Y is T
cY (t) = E(eit Y )
√
∀t ∈ Rn where the complex number i = −1.
Definition 4.12. The moment generating function (mgf) of a random

vector Y is T
mY (t) = E(et Y )
provided that the expectation exists for all t in some neighborhood of the
origin 0.
Theorem 4.13. If Y1 , ..., Yn have a cf cY (t) and mgf mY (t) then the
marginal cf and mgf for Yi1 , ..., Yik are found from the joint cf and mgf by
replacing tij by 0 for j = k + 1, ..., n. In particular, if Y = (Y 1 , Y 2 )T and
t = (t1 , t2 )T , then
cY 1 (t1 ) = cY ((tT1 , 0T )T ) and mY 1 (t1 ) = mY ((tT T T

1 , 0 ) ).
Proof. Use the definition of the cf and mgf. For example, if Y 1 =

(Y1 , ..., Yk)T and s = t1 , then m((tT1 , 0T )T ) =
E[exp(t1 Y1 + · · · + tk Yk + 0Yk+1 + · · · + 0Yn )] = E[exp(t1 Y1 + · · · + tk Yk )] =
E[exp(sT Y 1 )] = mY 1 (s), which is the mgf of Y 1 .
Theorem 4.14. Partition the 1 × n vectors Y and t as Y = (Y 1 , Y 2 )T

and t = (t1 , t2 ). Then the random vectors Y 1 and Y 2 are independent iff
their joint cf factors into the product of their marginal cfs:
cY (t) = cY 1 (t1 )cY 2 (t2 ) ∀t ∈ Rn .
If the joint mgf exists, then the random vectors Y 1 and Y 2 are independent
iff their joint mgf factors into the product of their marginal mgfs:
mY (t) = mY 1 (t1 )mY 2 (t2 )
∀t in some neighborhood of 0.
Note that if Y 1 Y 2 , then
cY (t) = E[exp(itT Y )] = E[exp(itT1 Y 1 +itT2 Y 2 )] = E[exp(itT1 Y 1 ) exp(itT2 Y 2 )]
ind
= E[exp(itT1 Y 1 )]E[exp(itT2 Y 2 )] = cY 1 (t1 )cY 2 (t2 )
for any t = (tT1 , tT2 )T ∈ Rn .
Theorem 4.15. a) The characteristic function uniquely determines the
distribution.
b) If the moment generating function exists, then it uniquely determines
the distribution.
c) Assume that Y1 , ..., Yn are independent with
Pn characteristic functions
cYi (t). Then the characteristic function of W = i=1 Yi is
n
Y
cW (t) = cYi (t). (4.3)
i=1
d) Assume that Y1 , ..., Yn are iid P

with characteristic functions cY (t). Then
n
the characteristic function of W = i=1 Yi is
cW (t) = [cY (t)]n . (4.4)
e) Assume
P that Y1 , ..., Yn are independent with mgfs mYi (t). Then the mgf
of W = ni=1 Yi is
Yn
mW (t) = mYi (t). (4.5)
i=1
Pnf) Assume that Y1 , ..., Yn are iid with mgf mY (t). Then the mgf of W =
i=1 Yi is
mW (t) = [mY (t)]n . (4.6)
g) Assume that Y1 , ..., Yn are independent with
Pn characteristic functions
cYi (t). Then the characteristic function of W = j=1 (aj + bj Yj ) is
n
X n
Y
cW (t) = exp(it aj ) cYj (bj t). (4.7)
j=1 j=1
h) Assume
Pn that Y1 , ..., Yn are independent with mgfs mYi (t). Then the mgf
of W = i=1 (ai + bi Yi ) is
n
X n
Y
mW (t) = exp(t ai ) mYi (bi t). (4.8)
i=1 i=1
Partial Proof:
c)
 
n
ind
Pn Y
cPnj=1 Yj (t) = E[eit j=1 Yj
] = E[eitY1+···+itYn ] = E  eitYj  =
j=1
n
Y n
Y
E[eitYj ] = cYj (t).
j=1 j=1
The proofs for d), e), and f) are similar, but for mgfs, omit the i’s and
change c to m. Pn Qn
g) Recall that exp(w) = ew and exp( j=1 dj ) = j=1 exp(dj ). Now
n
X
cW (t) = E(eitW ) = E(exp[it (aj + bj Yj )])
j=1
n
X n
X
= exp(it aj ) E(exp[ itbj Yj )])
j=1 j=1
n
X n
Y
= exp(it aj ) E( exp[itbj Yj )])
j=1 i=1
n
X n
Y
= exp(it aj ) E[exp(itbj Yj )]
j=1 i=1
since by Remark 4.5 e), the expected value of a product of independent

random variables is the product of the expected values of the independent
random variables. Now in the definition of a cf, the t is a dummy variable
as long as t is real. Hence cY (t) = E[exp(itY )] and cY (s) = E[exp(isY )].
Taking s = tbj gives E[exp(itbj Yj )] = cYj (tbj ). Thus
n
X n
Y
cW (t) = exp(it aj ) cYj (tbj ).
j=1 i=1
Pn
The distribution of W = i=1 Yi is known as the convolution of Y1 , ..., Yn.
Even for n = 2, convolution formulas tend to be hard; however,Pn the following
two theorems suggest that to find the distribution of W = i=1 Yi , first find
the mgf or characteristic function of W . If the mgf or cf is that of a brand
name distribution, then W has that distribution. For example, if the mgf of
W is a normal (ν, τ 2 ) mgf, then W has a normal (ν, τ 2 ) distribution, written
W ∼ N (ν, τ 2 ). This technique is useful for several brand name distributions
given in Section 2.4.
Theorem 4.16. a) If Y1 , ..., Yn are independent binomial BIN(ki , ρ) ran-

dom variables, then
X n Xn
Yi ∼ BIN( ki , ρ).
i=1 i=1
Pn
Thus if Y1 , ..., Yn are iid BIN(k, ρ) random variables, then i=1 Yi ∼ BIN(nk, ρ).
b) Denote a chi–square χ2p random variable by χ2 (p). If Y1 , ..., Yn are in-

dependent chi–square χ2pi , then
n
X Xn
Yi ∼ χ2 ( pi ).
i=1 i=1
Thus if Y1 , ..., Yn are iid χ2p , then

n
X
Yi ∼ χ2np .
i=1
c) If Y1 , ..., Yn are iid exponential EXP(λ), then

n
X
Yi ∼ G(n, λ).
i=1
d) If Y1 , ..., Yn are independent Gamma G(νi , λ) then

n
X Xn
Yi ∼ G( νi , λ).
i=1 i=1
Thus if Y1 , ..., Yn are iid G(ν, λ), then

n
X
Yi ∼ G(nν, λ).
i=1
e) If Y1 , ..., Yn are independent normal N (µi , σi2 ), then

n
X n
X n
X
(ai + bi Yi ) ∼ N ( (ai + bi µi ), b2i σi2 ).
i=1 i=1 i=1
Here ai and bi are fixed constants. Thus if Y1 , ..., Yn are iid N (µ, σ 2 ), then
Y ∼ N (µ, σ 2 /n).
f) If Y1 , ..., Yn are independent Poisson POIS(θi ), then
n
X Xn
Yi ∼ POIS( θi ).
i=1 i=1
Thus if Y1 , ..., Yn are iid P OIS(θ), then

n
X
Yi ∼ POIS(nθ).
i=1
Theorem 4.17. a) If Y1 , ..., Yn are independent Cauchy C(µi , σi ), then

n
X n
X n
X
(ai + bi Yi ) ∼ C( (ai + bi µi ), |bi |σi ).
i=1 i=1 i=1
Thus if Y1 , ..., Yn are iid C(µ, σ), then Y ∼ C(µ, σ).

b) If Y1 , ..., Yn are iid geometric geom(p), then
n
X
Yi ∼ NB(n, p).
i=1
c) If Y1 , ..., Yn are iid inverse Gaussian IG(θ, λ), then

4.3 The CLT 79
n
X
Yi ∼ IG(nθ, n2 λ).
i=1
Also
Y ∼ IG(θ, nλ).
d) If Y1 , ..., Yn are independent negative binomial NB(ri , ρ), then
n
X Xn
Yi ∼ NB( ri , ρ).
i=1 i=1
Thus if Y1 , ..., Yn are iid N B(r, ρ), then

n
X
Yi ∼ N B(nr, ρ).
i=1
Example 4.6. Suppose Y1 , ..., Yn are iid IG(θ, λ) where the mgf
" r !#
λ 2θ2 t
mYi (t) = m(t) = exp 1− 1−
θ λ
for t < λ/(2θ2 ). Then

n
" r !#
Y nλ 2θ2 t
mP ni=1 Yi (t) = mYi (t) = [m(t)]n = exp 1− 1−
i=1
θ λ
" r !#
n2 λ 2(nθ)2 t
= exp 1− 1−
nθ n2 λ
which is the mgf of an IG(nθ, n2 λ) random variable. The last equality was
2
obtained
Pby multiplying nλ
θ
by 1 = n/n and by multiplying 2θλ t by 1 = n2 /n2 .
n
Hence i=1 Yi ∼ IG(nθ, n2 λ).
4.3 The CLT
The CLT is also known as the Lindeberg-Lévy CLT, and several proofs will
be given later in this chapter.
Theorem 4.18: the Central Limit Theorem (CLT). Let Y1 , P ..., Yn be

n
iid with E(Y ) = µ and V (Y ) = σ 2 . Let the sample mean Y n = n1 i=1 Yi .
Then √ D
n(Y n − µ) → N (0, σ 2 ).
Remark
√ 4.7. i) The sample mean is estimating the population mean µ
with a n convergence rate, the asymptotic distribution is normal.
ii)
Pn
√ i=1 Yi − nµ

Yn−µ

Yn−µ

Zn = n = √ = √
σ σ/ n nσ
Pn D D
is the z–score of Y and the z–score of i=1 Yi . Then Zn → N (0, 1). If Zn →
N (0, 1), then the notation Zn ≈ N (0, 1), also written as Zn ∼ AN (0, 1),
means approximate the cdf of Zn by the standard normal cdf. Similarly, the
notation
Y n ≈ N (µ, σ 2 /n),
also written as Y n ∼ AN (µ, σ 2 /n), means approximate the cdf of Y n as if
Y n ∼ N (µ, σ 2 /n). Note that the approximate distribution, unlike the limit-
ing distribution, often does depend on n.
D
iii) The notation Yn → X means that for large n we can approximate the cdf
of Yn by the cdf of X.
iv) The distribution of X is the limiting distribution or asymptotic distribu-
tion of Yn , and the limiting distribution does not depend on n.
The two main applications of the CLT are to√give the limiting distribution
√
of n(Y n − µ) and the limiting Pn distribution of n(Yn /n − µX ) for a random
variable Yn such that Yn = i=1 Xi where the Xi are iid with E(X) = µX
2
and V (X) = σX . Several of the random variables in Theorems 4.16 and 4.17
can be approximated in this way.
Given iid data from some distribution,
√ a common homework problem is to
find the limiting distribution of n(Y n − µ) using the CLT. You may need to
find E(Y ), E(Y 2 ), and V (Y ) = E(Y 2 ) − [E(Y )]2 . A variant of this problem
gives a formula for E(Y r ). Then find E(Y ) = E(Y 1 ) with r = 1 and E(Y 2 )
with r = 2.
Example 4.7. a) Let Y1 , ..., Yn be iid Ber(ρ). Then E(Y ) = ρ and V (Y ) =

ρ(1 − ρ). Hence
√ D
n(Y n − ρ) → N (0, ρ(1 − ρ))
by the CLT.
D Pn
b) Now suppose that Yn ∼ BIN (n, ρ). Then Yn = i=1 Xi where
X1 , ..., Xn are iid Ber(ρ). Hence
√

Yn D
n − ρ → N (0, ρ(1 − ρ))
n
since
√ D √

Yn D
n − ρ = n(X n − ρ) → N (0, ρ(1 − ρ))
n
by a).
4.4 Slutsky’s Theorem, the Continuity Theorem and Related Results 81
c) Now suppose that Yn ∼ BIN (kn , ρ) where kn → ∞ as n → ∞. Then

p Yn
kn − ρ ≈ N (0, ρ(1 − ρ))
kn
or
Yn ρ(1 − ρ)
≈N ρ, or Yn ≈ N (kn ρ, kn ρ(1 − ρ)) .
kn kn
4.4 Slutsky’s Theorem, the Continuity Theorem and

Related Results
Theorem 4.19. Suppose Xn and X are RVs with the same probability space.
P D
a) If Xn → X, then Xn → X.
wp1 P D
b) If Xn → X, then Xn → X and Xn → X.
r P D
c) If Xn → X, then Xn → X and Xn → X.
P D
d) Xn → τ (θ) iff Xn → τ (θ).
D D D
e) If Xn → X and Xn → Y , then X = Y and FX (x) = FY (x) for all real x.
Partial Proof. a) See Theorem 4.3. c) See Theorem 4.8. d) See Theorem
4.3.
e) Suppose X has cdf F and Y has cdf G. Then F and G agree at their
common points of continuity. Hence F and G agree at all but countably many
points since F and G are cdfs. Hence F and G agree at all points by right
continuity.
A A D
Note: If Xn → X and Xn → Y , then X = Y where A is wp1, r, or P .
A A
This result holds by Theorem 4.19 e) since if Xn → X and Xn → Y , then
D D
Xn → X and Xn → Y .
D P
Theorem 4.20: Slutsky’s Theorem. Suppose Yn → Y and Wn → w
for some constant w. Then
D
a) Yn + Wn → Y + w,
D
b) Yn Wn → wY, and
D
c) Yn /Wn → Y /w if w 6= 0.
A D
Remark 4.8. Note that Yn → Y implies Yn → Y where A = wp1, r, or
P D
P . Also Wn → w iff Wn → w. If a sequence of constants cn → c as n → ∞
wp1 P
(regular convergence is everywhere convergence), then cn → c and cn → c.
P B
So Wn → w can be replaced by Wn → w where B = D, wp1, r, P, or regular
convergence.
A B
i) So Slutsky’s theorem a), b) and c) hold if Yn → Y and Wn → w.
A B
ii) If Y ≡ y where y is a constant, then Yn → y and Wn → w implies that
D P
a), b) and c) hold with Y replaced by y, and → can be replaced by →.
D P P D
iii) If Yn → Y , an → a, and bn → b, then an + bn Yn → a + bY .
P P
Theorem 4.21. a) If Xn → θ and τ is continuous at θ, then τ (Xn ) → τ (θ).
D D
b) If Xn → θ and τ is continuous at θ, then τ (Xn ) → τ (θ).
Theorem 4.21 is a special case of the continuous mapping theorem. See

D r wp1
Theorem 4.25. Suppose that Tn → τ (θ), Tn → τ (θ) or Tn → τ (θ). Then
P
Tn → τ (θ) by Theorem 4.19. We are assuming that the function τ does not
depend on n since we want a single function τ (θ) rather than a sequence of
functions τn (θ).
Example 4.8. Let Y1 , ..., Yn be iid with mean E(Yi ) = µ and variance
P
V (Yi ) = σ 2 . Then the sample mean Y n → µ since i) the SLLN holds (use
Theorem 4.19 and 4.4), ii) the WLLN holds and iii) the CLT holds (use
Theorem 4.34). Since
lim Vµ (Y n ) = lim σ 2 /n = 0 and lim Eµ (Y n ) = µ,

n→∞ n→∞ n→∞
P
Y n → µ by Theorem 4.2.
D D
Example 4.9. (Ferguson 1996, p. 40): If Xn → X then 1/Xn → 1/X if
X is a continuous random variable since P (X = 0) = 0 and x = 0 is the only
discontinuity point of g(x) = 1/x.
Example 4.10. Show that if Yn ∼ tn , a t distribution with n degrees of

D
freedom, then Yn → Z where Z ∼ N (0, 1).
D p p P
Solution: Yn = Z/ Vn /n where Z Vn ∼ χ2n . If Wn = Vn /n → 1,
D Pn
then the result follows by Slutsky’s Theorem. But Vn = i=1 Xi where the
P p P
iid Xi ∼ χ21 . Hence Vn /n → 1 by the WLLN and Vn /n → 1 by Theorem
4.21a.
Theorem 4.22: Continuity Theorem. Let Yn be sequence of random

variables with characteristic functions cn (t). Let Y be a random variable with
cf c(t).
a)
D
Yn → Y iff cn (t) → c(t) ∀t ∈ R.
b) Also assume that Yn has mgf mn and Y has mgf m. Assume that
all of the mgfs mn and m are defined on |t| ≤ d for some d > 0. Then if
D
mn (t) → m(t) as n → ∞ for all |t| < a where 0 < a < d, then Yn → Y .
The following theorem is often part of the continuity theorem in the liter-
ature, and helps explain why Theorem 4.22 is called the continuity theorem.
Theorem 4.23: If limn→∞ cXn (t) = g(t) for all t where g is continuous
at t = 0, then g(t) = cX (t) is a characteristic function for some RV X, and
D
Xn → X.
Remark 4.9. a) Continuity at t = 0 implies continuity everywhere since

g(t) = cX (t) is continuous. If g(t) is not continuous at 0, then Xn does not
converge in distribution.
b) If cYn (t) → h(t) where h(t) is not continuous, then Yn does not converge
in distribution to any RV Y , by the Continuity Theorem and a).
c) Warning: cXn (0) ≡ 1, but cXn (0) → 1 as n → ∞ does not imply that
g is continuous at t = 0 if limn→∞ cXn (t) = g(t) for all real t.
D
Theorem 4.24, Helly-Bray-Pormanteau Theorem: Xn → X iff
E[g(Xn )] → E[g(X)] for every bounded, real, continuous function g.
The above theorem is used to prove Theorem 4.25 b).
Theorem 4.25. a) Generalized Continuous Mapping Theorem: If

D
Xn → X and the function g is such that P [X ∈ C(g)] = 1 where C(g) is the
D
set of points where g is continuous, then g(Xn ) → g(X).
D
b) Continuous Mapping Theorem: If Xn → X and the function g is
D
continuous, then g(Xn ) → g(X).
Proof of the Continuous Mapping Theorem: If g is real and contin-

uous, then cos[tg(x)] and sin[tg(x)] are bounded real continuous functions.
Hence by the Helly-Bray-Pormanteau theorem, for each real t, the character-
istic function
cg(Xn ) (t) = E[eitg(Xn ) ] = E(cos[tg(Xn )]) + iE(sin[tg(Xn )]) →
E(cos[tg(X)]) + iE(sin[tg(X)]) = E[eitg(X) ] = cg(X) (t).

D
Thus g(Xn ) → g(X) by the continuity theorem.
Remark 4.10, Notes for Proving the CLT. a) Suppose the Yi are
iid with characteristic function cY (t). Then E(Yi − µ) = 0 and V (Yi − µ) =
E[(Yi − µ)2 ] = σ 2 . Thus by Remark 4.5,
σ2 2
CY −µ (t) = 1 − t + o(t2 ) and
2
t2

t
CY −µ √ =1− + o(t2 /n)
σ n 2n
where
o(t2 /n)
→0
t2 /n
as n → ∞. Hence n o(t2 /n) → 0 as n → 0.
b) Let the Z-score of Y n be
√ Pn Pn
n(Y − µ) Y −µ Yi − nµ (Yi − µ)
Zn = = √ = i=1 √ = i=1 √
σ σ/ n σ n σ n
where the Yi − µ are iid with characteristic function cY −µ (t). Then the char-
Yi − µ

t
acteristic function of √ is cY −µ √ , and the characteristic function
σ n σ n
of Zn is n
t
cZn (t) = cY −µ √ .
σ n
√
If cZn (t) → cZ (t), the N (0, 1) characteristic function, then σZn = n(Y n −µ)
has 2 2
cσZn (t) → cσZ (t) = cZ (σt ) = e−σ t /2 ,
the N (0, σ 2 ) characteristic function, and the CLT holds.
Proof of the CLT: Let Zn be the Z-score of Y n . By Remark 4.10,
n
t2

cZn (t) = 1 − + o(t2 /n) =
2n
" #n
t2 2
2 − n o(t /n) 2
1− → e−t /2 = cZ (t)
n
D √ D
for all t by Remark 4.5 b). Thus Zn → Z ∼ N (0, 1) and σZn = n(Y n −µ) →
N (0, σ 2 ).
The next proof does not use characteristic functions, but only applies to iid
random variables Yi that have a moment distribution function. Thus E(Yij )
exists for each positive integer j. The CLT only needs E(Y ) and E(Y 2 ) to
exist. In the proof, k(t) = log(m(t)) is the cumulant generating function with
k 0 (0) = E(X) and k 00 (x) = V (X).
L’Hôspital’s Rule: Suppose functions f(x) → 0 and g(x) → 0 as x ↓ d,
x ↑ d, x → d, x → ∞, or x → −∞. If
f 0 (x) f(x)
→ L then →L
g0 (x) g(x)
as x ↓ d, x ↑ d, x → d, x → ∞, or x → −∞.
Proof of a Special Case of the CLT. Following
Rohatgi (1984, pp. 569-9) and Tardiff (1981), let Y1 , ..., Yn be iid with mean
µ, variance σ 2 and mgf mY (t) for |t| < to . Then
Yi − µ
Zi =
σ
has mean 0, variance 1 and mgf mZ (t) = exp(−tµ/σ)mY (t/σ) for |t| < σto .
Want to show that
√

Yn−µ D

Wn = n → N (0, 1).
σ
Notice that Wn =
n n Pn
Yi − µ n−1/2 Y n − µ

−1/2
X
−1/2
X
−1/2 i=1 Yi − nµ
n Zi = n =n = 1
.
i=1 i=1
σ σ n
σ
Thus
n n
X X √
mWn (t) = E(etWn ) = E[exp(tn−1/2 Zi )] = E[exp( tZi / n)]
i=1 i=1
n n
Y √ Y √ √
= E[etZi / n
]= mZ (t/ n) = [mZ (t/ n)]n .
i=1 i=1
The cumulant generating function kZ (t) = log(mZ (x)). Then

√
√ √ kZ (t/ n)
kWn (t) = log[mWn (t)] = n log[mZ (t/ n)] = nkZ (t/ n) = 1 .
n
Now kZ (0) = log[mZ (0)] = log(1) = 0. Thus by L’Hôpital’s rule (where the
derivative is with respect to n), limn→∞ log[mWn (t)] =
√ √ √
kZ (t/ n ) 0
kZ (t/ n )[ −t/2
n3/2
] t 0
kZ (t/ n )
lim 1 = lim = lim .
n→∞
n
n→∞ ( −1
n2
) 2 n→∞ √1
n
0
Now kZ (0) = E(Zi ) = 0, so L’Hôpital’s rule can be applied again, giving
limn→∞ log[mWn (t)] =
√
t 00
kZ (t/ n )[ 2n−t3/2 ] t2 00
√ t2 00
lim −1 = lim kZ (t/ n ) = kZ (0).
2 n→∞ ( 2n3/2 ) 2 n→∞ 2
00
Now kZ (0) = V (Zi ) = 1. Hence limn→∞ log[mWn (t)] = t2 /2 and
lim mWn (t) = exp(t2 /2)

n→∞
which is the N(0,1) mgf. Thus by the continuity theorem,

√

Yn−µ

D
Wn = n → N (0, 1).
σ
By Theorem 4.26, dn Fg,dn ,1−δ → χ2g,1−δ as dn → ∞. Here P (X ≤

χ2g,1−δ )= 1 − δ if X ∼ χ2g , and P (X ≤ Fg,dn ,1−δ ) = 1 − δ if X ∼ Fg,dn .
Theorem 4.26. If Wn ∼ Fr,dn where the positive integer dn → ∞ as

D
n → ∞, then rWn → χ2r .
Proof. If X1 ∼ χ2d1 X2 ∼ χ2d2 , then
X1 /d1
∼ Fd1 ,d2 .
X2 /d2
Pk
If Ui ∼ χ21 are iid then i=1 Ui ∼ χ2k . Let d1 = r and k = d2 = dn . Hence if
X2 ∼ χ2dn , then
Pdn
X2 Ui P
= i=1 = U → E(Ui ) = 1
dn dn
D
by the law of large numbers. Hence if W ∼ Fr,dn , then rWn → χ2r .
Example 4.11. a) Let Xn ∼ bin(n, pn) where npn = λ > 0 for all positive
integers n. Then the mgf mXn (t) = (1 − pn + pn et )n for all t. Thus
n n
λ(et − 1)

λ λ t
mXn (t) = 1 − + et = 1+ → eλ(e −1) = mX (t)
n n n
D
for all t where X ∼ P OIS(λ). Hence Xn → X ∼ P OIS(λ) by the continuity
theorem.
b) Now let Xn ∼ bin(n, pn) where npn → λ > 0 as n → ∞. Thus
n
−npn + npn et

t
mXn (t) = 1 + → eλ(e −1) = mX (t)
n
for all t since cn n

1+ → ec
n
D
if cn → c. Here c = −λ + λet = λ(et − 1). See Remark 4.6. Hence Xn → X ∼
P OIS(λ) by the continuity theorem.
Note: In the above example, a) is easier, and making assumptions that

make the large sample theory easier is a useful techniques.
4.5 Order Relations and Convergence Rates 87
4.5 Order Relations and Convergence Rates
Definition 4.13. Lehmann (1999, p. 53-54): a) A sequence of random vari-

ables Wn is tight or bounded in probability, written Wn = OP (1), if for every
> 0 there exist positive constants D and N such that
P (|Wn | ≤ D ) ≥ 1 −
for all n ≥ N . Also Wn = OP (Xn ) if |Wn /Xn | = OP (1).

b) The sequence Wn = oP (n−δ ) if nδ Wn = oP (1) which means that
P
nδ Wn → 0.
c) Wn has the same order as Xn in probability, written Wn P Xn , if for

every > 0 there exist positive constants N and 0 < d < D such that
Wn
P (d ≤ ≤ D ) ≥ 1 −
Xn
for all n ≥ N .
d) Similar notation is used for a k × r matrix An = [ai,j (n)] if each
element ai,j (n) has the desired property. For example, An = OP (n−1/2 ) if
each ai,j (n) = OP (n−1/2 ).
Definition 4.14. Let β̂n be an estimator of a p × 1 vector β, and let

Wn = kβ̂n − βk.
a) If Wn P n−δ for some δ > 0, then both Wn and β̂ n have (tightness)
rate nδ .
b) If there exists a constant κ such that
D
nδ (Wn − κ) → X
for some nondegenerate random variable X, then both Wn and β̂ n have

convergence rate nδ .
Theorem 4.27. Suppose there exists a constant κ such that

D
nδ (Wn − κ) → X.
a) Then Wn = OP (n−δ ).
b) If X is not degenerate, then Wn P n−δ .
The above result implies that if Wn has convergence rate nδ , then Wn has
tightness rate nδ , and the term “tightness” will often be omitted. Part a) is
proved, for example, in Lehmann (1999, p. 67).
The following result shows that if Wn P Xn , then Xn P Wn , Wn =

OP (Xn ) and Xn = OP (Wn ). Notice that if Wn = OP (n−δ ), then nδ is
a lower bound on the rate of Wn . As an example, if the CLT holds then
Y n = OP (n−1/3 ), but Y n P n−1/2 .
Theorem 4.28. a) If Wn P Xn then Xn P Wn .

b) If Wn P Xn then Wn = OP (Xn ).
c) If Wn P Xn then Xn = OP (Wn ).
d) Wn P Xn iff Wn = OP (Xn ) and Xn = OP (Wn ).
Proof. a) Since Wn P Xn ,
Wn 1 Xn 1
P (d ≤ ≤ D ) = P ( ≤ ≤ )≥1−
Xn D Wn d
for all n ≥ N . Hence Xn P Wn .

b) Since Wn P Xn ,
Wn
P (|Wn | ≤ |Xn D |) ≥ P (d ≤ ≤ D ) ≥ 1 −
Xn
for all n ≥ N . Hence Wn = OP (Xn ).

c) Follows by a) and b).
d) If Wn P Xn , then Wn = OP (Xn ) and Xn = OP (Wn ) by b) and c).
Now suppose Wn = OP (Xn ) and Xn = OP (Wn ). Then
P (|Wn | ≤ |Xn |D/2 ) ≥ 1 − /2
for all n ≥ N1 , and
P (|Xn | ≤ |Wn |1/d/2) ≥ 1 − /2
for all n ≥ N2 . Hence
Wn
P (A) ≡ P ( ≤ D/2 ) ≥ 1 − /2
Xn
and
Wn
P (B) ≡ P (d/2 ≤ ) ≥ 1 − /2
Xn
for all n ≥ N = max(N1 , N2 ). Since P (A ∩ B) = P (A) + P (B) − P (A ∪ B) ≥
P (A) + P (B) − 1,
Wn
P (A ∩ B) = P (d/2 ≤ ≤ D/2 ) ≥ 1 − /2 + 1 − /2 − 1 = 1 −
Xn
for all n ≥ N. Hence Wn P Xn .

4.6 More CLTs 89
The following result is used to prove the following Theorem 4.30 which says
that if there are K estimators Tj,n of a parameter β, such that kTj,n − βk =
OP (n−δ ) where 0 < δ ≤ 1, and if Tn∗ picks one of these estimators, then
kTn∗ − βk = OP (n−δ ).
Theorem 4.29: Pratt (1959). Let X1,n , ..., XK,n each be OP (1) where
K is fixed. Suppose Wn = Xin ,n for some in ∈ {1, ..., K}. Then
Wn = OP (1). (4.9)
Proof.
P (max{X1,n , ..., XK,n} ≤ x) = P (X1,n ≤ x, ..., XK,n ≤ x) ≤
FWn (x) ≤ P (min{X1,n , ..., XK,n} ≤ x) = 1 − P (X1,n > x, ..., XK,n > x).
Since K is finite, there exists B > 0 and N such that P (Xi,n ≤ B) > 1−/2K
and P (Xi,n > −B) > 1 − /2K for all n > N and i = 1, ..., K. Bonferroni’s
PK
inequality states that P (∩K i=1 Ai ) ≥ i=1 P (Ai ) − (K − 1). Thus
FWn (B) ≥ P (X1,n ≤ B, ..., XK,n ≤ B) ≥
K(1 − /2K) − (K − 1) = K − /2 − K + 1 = 1 − /2

and
−FWn (−B) ≥ −1 + P (X1,n > −B, ..., XK,n > −B) ≥
−1 + K(1 − /2K) − (K − 1) = −1 + K − /2 − K + 1 = −/2.
Hence
FWn (B) − FWn (−B) ≥ 1 − for n > N.
Theorem 4.30. Suppose kTj,n − βk = OP (n−δ ) for j = 1, ..., K where
0 < δ ≤ 1. Let Tn∗ = Tin ,n for some in ∈ {1, ..., K} where, for example, Tin ,n
is the Tj,n that minimized some criterion function. Then
kTn∗ − βk = OP (n−δ ). (4.10)
Proof. Let Xj,n = nδ kTj,n − βk. Then Xj,n = OP (1) so by Theorem 4.29,
n kTn∗ − βk = OP (1). Hence kTn∗ − βk = OP (n−δ ).
δ
4.6 More CLTs
Remark 4.11. For each positive integer n, let Wn1 , ..., Wnrn be independent.
The probability space may change with n, giving a double array of random
rn
X
variables. Let E[Wnk ] = 0, V (Wnk ) = E[Wnk2
] = σnk2
, and s2n = 2
σnk =
k=1
rn
X
V[ Wnk ]. Then
k=1 Prn
k=1 Wnk
Zn =
sn
Prn
is the z-score of k=1 Wnk .
For the above remark, let rn = n. Then the double array is the triangular
array shown below. Double arrays are sometimes called triangular arrays.
W11
W21 , W22
W31 , W32 , W33
..
.
Wn1 , Wn2 , Wn3, ..., Wnn
..
.
Theorem 4.31, Lyapounov’s CLT: Under Remark 4.11, assume the
|Wnk |2+δ are integrable for some δ > 0. Assume Lyapounov’s condition:
rn
X E[|Wnk |2+δ ]
lim = 0. (4.11)
n→∞
k=1
s2+δ
n
Then Prn
k=1 Wnk D
Zn = → N (0, 1).
sn
TheoremP4.31 can be proved using Theorem 4.32. Note that Zn is the
rn
Z-score of k=1 Wnk .
Example 4.12. Special
Pn cases: i) rn = n and Wnk = Wk has W1 , ..., Wn, ...
independent with s2n = k=1 σk2 .
ii) Wnk = Xnk − E(Xnk ) = Xnk − µnk has
Prn
k=1 (Xnk − µnk ) D
→ N (0, 1).
sn
iii) Suppose X1 , X2 , ... are independent with E(Xi ) = µi and V (Xi ) = σi2 .
Let Pn Pn
i=1 Xi − i=1 µi
Zn = Pn 1/2
( i=1 σi2 )
Pn
be the z-score of i=1 Xi . Assume E[|Xi − µi |3 ] < ∞ for all n ∈ N and
Pn 3
i=1 E[|Xi − µi | ]
lim n 3/2
= 0. (4.12)
n→∞
( i=1 σi2 )
P
D
Then Zn → N (0, 1).
4.6 More CLTs 91
Pn
Proof of iii): Take Wnk = Xk − µk , δ = 1, s2n = 2
k=1 σk , and apply
Lyapounov’s CLT. Note that
n
!3/2
X
σk2 = (s2n )3/2 = s3n = s2+1
n .
k=1

The (Lindeberg-Lévy) CLT has the Xi iid with V (Xi ) = σ 2 < ∞. The
Lyapounov CLT in Example 4.12. iii) has the Xi independent (not necessar-
ily identically distributed), but needs stronger moment conditions to satisfy
Equation (4.11) or (4.12).
Theorem 4.32, Lindeberg CLT: Let the Wnk satisfy Remark 4.11 and
Lindeberg’s condition
rn 2
X E(Wnk I[|Wnk | ≥ sn ])
lim =0 (4.13)
n→∞ s2n
k=1
for any > 0. Then

Prn
k=1 Wnk D
Zn = → N (0, 1).
sn
Note: The Lindeberg CLT is sometimes called the PLindeberg-Feller CLT.
rn
k=1 Wnk D
Lindeberg’s condition is nearly necessary for Zn = → N (0, 1).
sn
Lindeberg’s condition can also be written as
rn
1
X Z
lim W 2 dP = 0 (4.14)
n→∞ s2n {|Wnk |≥sn } nk
k=1
Prn
for any > 0. Note that Zn is the Z-score of k=1 Wnk .
Example 4.13. a) Special case of the Lindeberg CLT: Let rn = n and let
the Wnk = Wk be independent. If
n
X E(Wk2 I[|Wk | ≥ sn ])
lim =0 (4.15)
n→∞ s2n
k=1
for any > 0, prove that

Pn
k=1 Wk D
Zn = → N (0, 1).
sn
b) uniformly bounded sequence: Let rn = n and Wnk = Wk . If there

is a constant c > 0 such that P (|Wk | < c) = 1 ∀k, and if sn → ∞ as n → ∞,
then Lindeberg’s CLT holds.
c) Let rn = n and let the Wnk = Wk be iid with V (Wk ) = σ 2 ∈ (0, ∞).
Then Lindeberg’s CLT holds. (Taking Wi = Xi − µ proves the usual CLT
with the Lindeberg CLT.)
d) If Lyapounov’s condition holds, then Lindeberg’s condition holds. Hence
the Lindeberg CLT proves the Lyapounov CLT.
Proof: a) Plug the special case values into Theorem 4.32.
b) Once n is large enough so that sn > c (which occurs since sn → ∞),
D
I[|Wk | ≥ sn ] = 0. Hence Equation (4.15) holds and Zn → N (0, 1).
2 2 2
c) Now sn = nσ and the Wk I[|Wk | ≥ sn ] are iid for given n. Thus
n
1 X 1 √
2
E(Wk2 I[|Wk | ≥ sn ]) = 2 E(W12 I[|W1 | ≥ σ n])
sn σ
k=1
1
Z
= W12 dP → 0
σ2 √
|W1 |≥σ n
√ √
as n → ∞ since P (|W1 | ≥ σ n) ↓ 0 as n → ∞. Or Yn = W12 I[|W1 | ≥ σ n]
satisfies Yn ≤ W12 and Yn ↓ Y = 0 as n → ∞. Thus E(Yn ) → E(Y ) = 0
by Lebesgue’s Dominated Convergence Theorem. Thus Equation (4.15) holds
D
and Zn → N (0, 1). If the Wi = Xi − µ, then
Pn √
i=1 (Xi − µ) n(X n − µ) D
Zn = √ = → N (0, 1).
σ n σ
√ D
Thus n(X n − µ) → N (0, σ 2 ).
d) Note that
rn rn
1 1 |Wnk |2+δ
X Z X Z
2
W dP ≤ dP = RHS
s2n {|Wnk |≥sn } nk s2n {|Wnk |≥sn } δ sδn
k=1 k=1
since |Wnk |δ ≥ δ sδn on the integral set. So
|Wnk |δ
>1
δ sδn
on the integral set. Thus RHS ≤

rn
1 X 1
2+δ
E[|Wnk |2+δ ] → 0
δ s
k=1 n
4.6 More CLTs 93
for any > 0 if Lyapounov’s condition holds. Thus Lindeberg’s condition

holds. Note that the above inequality holds since |Wnk |2+δ ≥ 0. Hence
Z Z
2+δ
|Wnk | dP ≤ |Wnk |2+δ dP = E[|Wnk |2+δ ]
A Ω
using Ω = A ∪ Ac and Ω |f| dP = A |f| dP + Ac |f| dP .

R R R
Example 4.14. DeGroot (1975, pp. 229-230): Suppose the Xi are in-
dependent Ber(pi ) ∼ bin(m =P1, pi ) random variables with E(Xi ) = pi ,
∞
V (Xi ) = pi qi , qi = 1 − pi , and i=1 pi qi = ∞. Prove that
Pn Pn
i=1 Xi − i=1 pi D
Zn = n → N (0, 1)
( i=1 pi qi )1/2
P
as n → ∞.
Proof. Let Yi = |Wi | = |Xi − pi |. Then P (Yi = 1 − pi ) = pi and
P (Yi = qi ) = qi . Thus
X
E[|Xi − pi |3 ] = E[|Wi |3 ] = y3 f(y) = (1 − pi )3 pi + p3i qi = qi3 pi + p3i qi
y
= pi qi (p2i + qi2 ) ≤ pi qi
Pn Pn
since p2i + qi2 ≤ P
(Pi + qi )2 = 1. Thus i=1 E[|Xi − pi |3 ] ≤ i=1 pi qi . Dividing
n
both sides by ( i=1 pi qi )3/2 gives
Pn
i=1 E[|Xi − pi |3 ] 1
Pn ≤ Pn →0
( i=1 pi qi )3/2 ( i=1 pi qi )1/2
D
as n → ∞. thus Equation (4.12) holds and Zn → N (0, 1).
Theorem 4.33, Hájek Šidak CLT: Let X1 , ..., Xn be iid with E(Xi ) = µ
and V (Xi ) = σ 2 . Let cn = (cn1 , ..., cnn)T be a vector of constants such that
c2
max Pn ni 2 → 0 as n → ∞.
j=1 cnj
1≤i≤n
Then Pn
i=1 cni (Xi − µ) D
Zn = qP → N (0, 1).
n 2
σ c
j=1 nj
Note: cni = 1/n gives the usual CLT.

4.7 More Results for Random Variables

√
The following result shows estimators that converge in distribution at a n
rate to a constant also converge in probability. Note that b) follows from a)
with Xθ ∼ N (0, v(θ)).
Theorem 4.34. a) Let Xθ be a random variable with a distribution de-

pending on θ, and 0 < δ ≤ 1. Suppose
D
nδ (Tn − τ (θ)) → Xθ ,
P
b) If
√ D
n(Tn − τ (θ)) → N (0, v(θ)),
P
P D
Theorem 4.35: a) Tn → τ (θ) iff Tn → τ (θ).
P P
b) If Tn → θ and τ is continuous at θ, then τ (Tn ) → τ (θ).
Theorem 4.36: Suppose Xn and X are RVs with the same probability
space for b) and c). Let g : R → R be a continuous function.
D D
a) If Xn → X, then g(Xn ) → g(X).
P P
b) If Xn → X, then g(Xn ) → g(X).
ae wp1
c) If Xn → X, then g(Xn ) → g(X).
Theorem 4.37: Suppose Xn and X are RVs with the same probability
space.
wp1 P D
a) If Xn → X, then Xn → X and Xn → X.
P D
b) If Xn → X, then Xn → X.
r P D
P D
d) Xn → τ (θ) iff Xn → τ (θ) where c is a constant.
P
Theorem 4.38: a) If E[(Xn − X)2 ] → 0 as n → ∞, then Xn → X.
P
b) If E(Xn ) → E(X) and V (Xn − X) → 0 as n → ∞, then Xn → X.
Note: Part a) follows from Theorem 4. c) with r = 2. See Theorem 4. if
P (X = τ (θ)) = 1.
Theorem 4.39: Let g : R → R be continuous at constant c.
D D
a) If Xn → c, then g(Xn ) → c.
P P
b) If Xn → c, then g(Xn ) → c.
wp1 wp1
c) If Xn → c, then g(Xn ) → c.
Remark 4.12. For Theorem 4., a) follows from Slutsky’s Theorem by
D P
taking Yn ≡ X = Y and Wn = Xn − X. Then Yn → Y = X and Wn → 0.
4.8 Multivariate Limit Theorems 95
D
Hence Xn = Yn + Wn → Y + 0 = X. The convergence in distribution parts
of b) and c) follow from a). Part f) follows from d) and e). Part e) implies
that if Tn is a consistent estimator of θ and τ is a continuous function, then
τ (Tn ) is a consistent estimator of τ (θ). Theorem 4. says that convergence in
distribution is preserved by continuous functions, and even some discontinu-
ities are allowed as long as the set of continuity points is assigned probability
1 by the asymptotic distribution. Equivalently, the set of discontinuity points
is assigned probability 0.
4.8 Multivariate Limit Theorems
Many of the univariate results from previous sections can be extended to

p is typically a k × 1
random vectors. For the limit theorems, the vector X
column vector and X T is a row vector. Let kxk = x21 + · · · + x2k be the
Euclidean norm of x.
Definition 4.15. Let X n be a sequence of random vectors with joint cdfs

Fn (x) and let X be a random vector with joint cdf F (x).
D
a) X n converges in distribution to X, written X n → X, if Fn (x) →
F (x) as n → ∞ for all points x at which F (x) is continuous. The distribution
of X is the limiting distribution or asymptotic distribution of X n .
P
b) X n converges in probability to X, written X n → X, if for every
> 0, P (kX n − Xk > ) → 0 as n → ∞.
c) Let r > 0 be a real number. Then X n converges in rth mean to X,
r
written X n → X, if E(kX n − Xkr ) → 0 as n → ∞.
ae
d) X n converges almost everywhere to X, written X n → X, if
P (limn→∞ X n = X) = 1.
The following theorem is an extension of Theorem 4.1.

Theorem 4.40: Generalized Chebyshev’s Inequality or General-
ized Markov’s Inequality: Let u : Rk → [0, ∞) be a nonnegative function.
If E[u(X)] exists, then for any > 0,
E[u(X)]
P [u(X) ≥ ] ≤ .

Proof Sketch. The proof is nearly identical to that of Theorem 4.1.
Example 4.15. Let u(x) = kx − ckr for some r > 0. Often c = 0 or

a = E(X) = µ. If E[u(X)] exists, then for any > 0,
E[kX − ckr ]
P (kX − ck ≥ ] = P (kX − ckr ≥ r ] ≤ .
r
Some results on the expected value and covariance matrix of a random

vector will be useful.
Definition 4.16. If the second moments exist, the population mean of a
random p × 1 vector x = (X1 , ..., Xp)T is
E(x) = µ = (E(X1 ), ..., E(Xp))T ,
and the p × p population covariance matrix
Cov(x) = E[(x − E(x))(x − E(x))T ] = E[(x − E(x))xT ] =
E(xxT ) − E(x)[E(x)]T = ((σi,j )) = Σ x .

That is, the ij entry of Cov(x) is Cov(Xi , Xj ) = σi,j =
E([Xi − E(Xi )][Xj − E(Xj )]).
If x and y are p × 1 random vectors with covariance matrices, a a con-

formable constant vector, and A and B are conformable constant matrices,
then
E(a + x) = a + E(x) and E(x + y) = E(x) + E(y) (4.16)
and
E(Ax) = AE(x) and E(AxB) = AE(x)B. (4.17)
Thus
Cov(a + Ax) = Cov(Ax) = ACov(x)AT . (4.18)
Theorem 4.41 is the
√ multivariate extensions of CLT. When the limiting
distribution of Z n = n(g(T n ) − g(θ)) is multivariate normal Nk (0, Σ), ap-
proximate the joint cdf of Z n with the joint cdf of the Nk (0, Σ) distribution.
Thus to find probabilities, manipulate Z n as if Z n ≈ Nk (0, Σ). To see that
the CLT is a special case of the MCLT below, let k = 1, E(X) = µ and
V (X) = Σ = σ 2 .
Theorem 4.41: the Multivariate Central Limit Theorem (MCLT).

If X 1 , ..., Xn are iid k ×1 random vectors with E(X) = µ and Cov(X) = Σ,
then √ D
n(X n − µ) → Nk (0, Σ)
where the sample mean
n
1X
Xn = X i.
n i=1
The MCLT is proven after Theorem 4..

Remark 4.13. The behavior of convergence in distribution to a MVN
distribution in B) is much like the behavior of the MVN distributions in A).
The results in B) can be proven using the multivariate delta method. Let A
be a q × k constant matrix, b a constant, a a k × 1 constant vector, and d a
q × 1 constant vector. Note that a + bX n = a + AX n with A = bI. Thus i)
and ii) follow from iii).
A) Suppose X ∼ Nk (µ, Σ), then
i) AX ∼ Nq (Aµ, AΣAT ).
ii) a + bX ∼ Nk (a + bµ, b2 Σ).
iii) AX + d ∼ Nq (Aµ + d, AΣAT ).
(Find the mean and covariance matrix of the left hand side and plug in those
values for the right hand side. Be careful with the dimension k or q.)
D
B) Suppose X n → Nk (µ, Σ). Then
D
i) AX n → Nq (Aµ, AΣAT ).
D
ii) a + bX n → Nk (a + bµ, b2 Σ).
D
iii) AX n + d → Nq (Aµ + d, AΣAT ).
P
Definition 4.17. If the estimator g(T n ) → g(θ) for all θ ∈ Θ, then g(T n )
is a consistent estimator of g(θ).
Theorem 4.42. If 0 < δ ≤ 1, X is a random vector, and

D
nδ (g(T n ) − g(θ)) → X,
P
then g(T n ) → g(θ).
Theorem 4.43. If X 1 , ..., Xn are iid, E(kXk) < ∞ and E(X) = µ, then
P
a) WLLN: X n → µ and
ae
b) SLLN: X n → µ.
Theorem 4.44: Continuity Theorem. Let X n be a sequence of k × 1

random vectors with characteristic function cn (t) and let X be a k×1 random
vector with cf c(t). Then
D
X n → X iff cn (t) → c(t)
for all t ∈ Rk .
Theorem 4.45: Cramér Wold Device. Let X n be a sequence of k × 1

random vectors and let X be a k × 1 random vector. Then
D D
X n → X iff tT X n → tT X
for all t ∈ Rk .
Proof. (Serverini (2005, p. 337)): Let Wn = tT X n and W = tT X. Note

that h T
i
cWn (y) = ctT X n (y) = E eiyt X n = cX n (yt)
where y ∈ R, and similarly
cW (y) = ctT X (y) = cX (yt)
where y ∈ R.
D
If X n → X, then cX n (t) → cX (t) ∀ t ∈ Rk . Fix t. Then cX n (yt) →
D
cX (yt) ∀ y ∈ R. Thus tT X n → tT X.
D
Now assume tT X n → tT X ∀ t ∈ Rk . Then cX n (yt) → cX (yt) ∀ y ∈ R
D
and ∀ t ∈ Rk . Take y = 1 to get cX n (t) → cX (t) ∀ t ∈ Rk . Hence X n → X
by the Continuity Theorem.
Application: Proof of the MCLT Theorem 4.41. Note that for fixed
t, the tT X i are iid random variables with mean tT µ and variance tT Σt.
√ D
Hence by the CLT, tT n(X n − µ) → N (0, tT Σt). The right hand side has
distribution tT X where X ∼ Nk (0, Σ). Hence by the Cramér Wold Device,
√ D
n(X n − µ) → Nk (0, Σ).
P D
Theorem 4.46. a) If X n → X, then X n → X.
b)
P D
X n → g(θ) iff X n → g(θ).
√ ≥ 1 be an increasing function of the sample size n: g(n) ↑ ∞, e.g.

Let g(n)
g(n) = n. See White (1984, p. 15). If a k×1 random vector T n −µ converges
to
√ a nondegenerate multivariate normal
√ distribution with convergence rate
n, then T n has (tightness) rate n.
Definition 4.18. Let An = [ai,j (n)] be an r × c random matrix.

a) An = OP (Xn ) if ai,j (n) = OP (Xn ) for 1 ≤ i ≤ r and 1 ≤ j ≤ c.
b) An = op (Xn ) if ai,j (n) = op (Xn ) for 1 ≤ i ≤ r and 1 ≤ j ≤ c.
c) An P (1/(g(n)) if ai,j (n) P (1/(g(n)) for 1 ≤ i ≤ r and 1 ≤ j ≤ c.
d) Let A1,n = T n − µ and A2,n = C n − cΣ for some constant c > 0. If
A1,n P (1/(g(n)) and A2,n P (1/(g(n)), then (T n , C n ) has (tightness)
rate g(n).
Theorem 4.47. Let Wn , Xn , Yn and Zn be sequences of random variables

such that Yn > 0 and Zn > 0. (Often Yn and Zn are deterministic, e.g.
Yn = n−1/2 .)
a) If Wn = OP (1) and Xn = OP (1), then Wn + Xn = OP (1) and Wn Xn =
OP (1), thus OP (1) + OP (1) = OP (1) and OP (1)OP (1) = OP (1).
b) If Wn = OP (1) and Xn = oP (1), then Wn + Xn = OP (1) and Wn Xn =
oP (1), thus OP (1) + oP (1) = OP (1) and OP (1)oP (1) = oP (1).
c) If Wn = OP (Yn ) and Xn = OP (Zn ), then Wn +Xn = OP (max(Yn , Zn ))

and Wn Xn = OP (Yn Zn ), thus OP (Yn ) + OP (Zn ) = OP (max(Yn , Zn )) and
OP (Yn )OP (Zn ) = OP (Yn Zn ).
Theorem 4.48: Continuous Mapping Theorem. Let X n ∈ Rk . If

D D
X n → X and if the function g : Rk → Rj is continuous, then g(X n ) →
g(X).
Theorem 4.49. Suppose xn and x are random vectors with the same
probability space.
P D
a) If xn → x, then xn → x.
wp1 P D
b) If xn → x, then xn → x and xn → x.
r P D
c) If xn → x for some r > 0, then xn → x and xn → x.
P D
d) xn → c iff xn → c where c is a constant vector.
The proof of c) follows from the Generalized Chebyshev inequality. See
Example 4.15.
Remark 4.14. Let W n be a sequence of m × m random matrices and let
C be an m × m constant matrix.
P P
a) W n → X iff aT W n b → aT Cb for all constant vectors a, b ∈ Rm .
P P
b) If W n → C, then the determinant det(W n ) = |W n | → |C| = det(C).
P P
c) If W −1
n exists for each n and C
−1
exists, then If W n → C iff W −1
n → C
−1
.
The following two theorems are taken from Severini (2005, pp. 345-349,
354).
Theorem 4.50. Let X n = (X1n , ..., Xkn)T be a sequence of k × 1

random vectors, let Y n be a sequence of k × 1 random vectors, and let
X = (X1 , ..., Xk)T be a k × 1 random vector. Let W n be a sequence of k × k
nonsingular random matrices, and let C be a k × k constant nonsingular
matrix.
P P
a) X n → X iff Xin → Xi for i = 1, ..., k.
D P
b) Slutsky’s Theorem: If X n → X and Y n → c for some constant k ×1
D
vector c, then i) X n + Y n → X + c and
D
ii) Y Tn X n → cT X.
D P D D
c) If X n → X and W n → C, then W n X n → CX, X Tn W n → X T C,
D D
W −1
n Xn → C
−1
X, and X Tn W −1 T −1
n →X C .
Theorem 4.51. Let Wn , Xn , Yn , and Zn be sequences of random variables

such that Yn > 0 and Zn > 0. (Often Yn and Zn are deterministic, e.g.
Yn = n−1/2 .)
a) If Wn = OP (1) and Xn = OP (1), then Wn + Xn = OP (1) and Wn Xn =
OP (1), thus OP (1) + OP (1) = OP (1) and OP (1)OP (1) = OP (1).
b) If Wn = OP (1) and Xn = oP (1), then Wn + Xn = OP (1) and Wn Xn =

oP (1), thus OP (1) + oP (1) = OP (1) and OP (1)oP (1) = oP (1).
c) If Wn = OP (Yn ) and Xn = OP (Zn ), then Wn +Xn = OP (max(Yn , Zn ))
and Wn Xn = OP (Yn Zn ), thus OP (Yn ) + OP (Zn ) = OP (max(Yn , Zn )) and
OP (Yn )OP (Zn ) = OP (Yn Zn ).
√ D
Theorem 4.52. i) Suppose n(Tn − µ) → Np (θ, Σ). Let A be a q × p
√ √ D
constant matrix. Then A n(Tn − µ) = n(ATn − Aµ) → Nq (Aθ, AΣAT ).
ii) Let Σ > 0. If (T, C) is a consistent estimator of (µ, s Σ) where s > 0
is some constant, then Dx 2
(T, C) = (x − T )T C −1 (x − T ) = s−1 Dx
2
(µ, Σ) +
2 −1 2
oP (1), so Dx (T, C) is a consistent estimator of s Dx (µ, Σ).
√ D
iii) Let Σ > 0. If n(T −µ) → Np (0, Σ) and if C is a consistent estimator
D
of Σ, then n(T − µ)T C −1 (T − µ) → χ2p . In particular,
D
n(x − µ)T S −1 (x − µ) → χ2p .
Proof: ii) Dx2
(T, C) = (x − T )T C −1 (x − T ) =
(x − µ + µ − T )T [C −1 − s−1 Σ −1 + s−1 Σ −1 ](x − µ + µ − T )
= (x − µ)T [s−1 Σ −1 ](x − µ) + (x − T )T [C −1 − s−1 Σ −1 ](x − T )
+(x − µ)T [s−1 Σ −1 ](µ − T ) + (µ − T )T [s−1 Σ −1 ](x − µ)
+(µ − T )T [s−1 Σ −1 ](µ − T ) = s−1 Dx 2
(µ, Σ) + OP (1).
2
(Note that Dx (T, C) = s−1 Dx 2
(µ, Σ) + OP (n−δ ) if (T, C) is a consistent
estimator of (µ, s Σ) with rate nδ where 0 < δ ≤ 0.5 if [C −1 − s−1 Σ −1 ] =
OP (n−δ ).)
2
Alternatively, Dx (T, C) is a continuous function of (T, C) if C > 0 for
2 P 2
n > 10p. Hence Dx (T, C) → Dx (µ, sΣ).
√ −1/2 D
iii) Note that Z n = n Σ (T − µ) → Np (0, I p ). Thus Z Tn Z n =
D
n(T − µ)T Σ −1 (T − µ) → χ2p . Now n(T − µ)T C −1 (T − µ) =
n(T − µ)T [C −1 − Σ −1 + Σ −1 ](T − µ) = n(T − µ)T Σ −1 (T − µ) +
D
n(T − µ)T [C −1 − Σ −1 ](T − µ) = n(T − µ)T Σ −1 (T − µ) + oP (1) → χ2p since
√ √
n(T − µ)T [C −1 − Σ −1 ] n(T − µ) = OP (1)oP (1)OP (1) = oP (1).
Theorem 4.53. Let xn = (x1n , ..., xkn)T and x = (x1 , ..., xk)T be random
D D
vectors. Then xn → x implies xin → xi for i = 1, ..., k.
Proof. Use the Cramér Wold device with ti = (0, ..., 0, 1, 0, ...0)T where
the 1 is in the ith position. Thus
D
tTi xn = xin → xi = tTi x.

Joint convergence in distribution implies marginal convergence in dis-
tribution by Theorem 4.53. Typically marginal convergence in distribution
D
xin → xi for i = 1, ..., m does not imply that
   
x1n x1
 ..  D  .. 
 .  →  . .
xmn xm
That is marginal convergence in distribution does not imply joint conver-

gence in distribution. An exception is when the marginal random vectors are
independent.
D
Example 4.16. Suppose that xn y n for n = 1, 2, .... Suppose xn → x,
D
and y n → y where x y. Then

xn D x
→
yn y
by the continuity theorem. To see this, let t = (tT1 , tT2 )T , z n = (xTn , yTn )T ,
and z = (xT , y T )T . Since xn y n and x y, the characteristic function
φz n (t) = φxn (t1 )φy n (t2 ) → φx (t1 )φy (t2 ) = φz (t).
D D
Hence z n → z and g(z n ) → g(z) if g is continuous by the continuous
mapping theorem.
Remark 4.15. a) In the above example, we can show x y instead of

assuming x y. See Ferguson (1996, p. 42).
D P
b) If xn → x and y n → c, a constant vector, then

xn D x
→ .
yn c
Note that a constant vector c x for any random vector x.

Example 4.17. a) Let X ∼ N (0, 1). Let Xn = X ∀n. Let

X, n even
Yn =
−X, n odd.
D D
Thus Yn ∼ N (0, 1), Xn → X, and Yn → X. Then

Xn 2X, n even
(1 1) = Xn + Yn =
Yn 0, n odd
does not converge in distribution as n → ∞ by the Cramér Wold Device with

t = (1 1)T . Thus
Xn
Yn
does not converge in distribution.
b) Let X ∼ N (0, 1) and W ∼ N (0, 1). Let Xn = X ∀n and Yn = −X ∀n.

Then
Xn X Xn D X
= ∀n, and → .
Yn −X Yn −X
D D
Now Xn → W and Yn → W . Since

Xn Xn
(1 1) = Xn + Yn = 0 ∀n,
Yn Yn
does not converge in distribution to

W
W
as n → ∞.
4.9 The Delta Method
Theorem 4.54: the Delta Method. If g0 (θ) 6= 0, and

√ D
n(Tn − θ) → N (0, σ 2 ),
then √ D
n(g(Tn ) − g(θ)) → N (0, σ 2 [g0 (θ)]2 ).
The CLT says that Y n ∼ AN (µ, σ 2 /n). The delta method says that if
Tn ∼ AN (θ, σ 2 /n), and if g0 (θ) 6= 0, then g(Tn ) ∼ AN (g(θ), σ 2 [g0 (θ)]2 /n).
Hence a smooth function g(Tn ) of a well behaved
√ statistic Tn tends to be well
behaved (asymptotically normal with a n convergence rate). By the delta
P
method and Theorem 4.b, Tn = g(Y n ) → g(µ) if g0 (µ) 6= 0 for all µ ∈ Θ. By
P
Theorem 4.e, g(Y n ) → g(µ) if g is continuous at µ.
Example 4.18. Let Y1 , ..., Yn be iid with E(Y ) = µ and V (Y ) = σ 2 . Then
by the CLT,
√ D
n(Y n − µ) → N (0, σ 2 ).
Let g(µ) = µ2 . Then g0 (µ) = 2µ 6= 0 for µ 6= 0. Hence
√ D
n((Y n )2 − µ2 ) → N (0, 4σ 2 µ2 )
for µ 6= 0 by the delta method.
Example 4.19. Let Xn ∼ Poisson(nλ) where the positive integer n is

large and 0 < λ.
4.9 The Delta Method 103
√

Xn
a) Find the limiting distribution of − λ .
n
n
"r #
√ Xn √
b) Find the limiting distribution of n − λ .
n
D Pn
Solution. a) Xn = i=1 Yi where the Yi are iid Poisson(λ). Hence E(Y ) =
λ = V (Y ). Thus by the CLT,
Pn
√ D √ i=1 Yi

Xn D
n −λ = n − λ → N (0, λ).
n n
√
b) Let g(λ) = λ. Then g0 (λ) = 2√1 λ and by the delta method,
"r #
√ √ √

Xn Xn D
n − λ = n g − g(λ) →
n n

1 1
N (0, λ (g0 (λ))2 ) = N 0, λ = N 0, .
4λ 4
Example 4.20. Let Y1 , ..., Yn be independent and identically distributed
(iid) from a Gamma(α, β) distribution.
√
a) Find the limiting distribution of n Y − αβ .
√
b) Find the limiting distribution of n (Y )2 − c for appropriate con-

stant c.
Solution: a) Since E(Y ) = αβ and V (Y ) = αβ 2 , by the CLT

√ D
n Y − αβ → N (0, αβ 2 ).
b) Let µ = αβ and σ 2 = αβ 2 . Let g(µ) = µ2 so g0 (µ) = 2µ and
√ D
[g (µ)]2 = 4µ2 = 4α2 β 2 . Then by the delta method, n (Y )2 − c →
0
N (0, σ 2 [g0 (µ)]2 ) = N (0, 4α3β 4 ) where c = µ2 = α2 β 2 .
Example 4.21. Let X ∼ Binomial(n, p) where the positive" integer n #is

2
√ X
large and 0 < p < 1. Find the limiting distribution of n − p2 .
n
√
Solution. Example 4.b gives the limiting distribution of n( X
n − p). Let
2 0
g(p) = p . Then g (p) = 2p and by the delta method,
" #
2
√ √

X 2 X D
n −p = n g − g(p) →
n n
N (0, p(1 − p)(g0 (p))2 ) = N (0, p(1 − p)4p2 ) = N (0, 4p3 (1 − p)).
√ D
Remark 4.16. a) Note that if n(Tn − k) → N (0, σ 2 ), then evaluate
0
the derivative at k. Thus use g (k) where k = αβ in the above example. A
common error occurs when k is a simple function of θ, for example k = θ/2

with g(µ) = µ2 . Thus g0 (µ) = 2µ so g0 (θ/2) = 2θ/2 = θ. Then the common
delta method error is to plug in g0 (θ) = 2θ instead of g0 (k) = θ. See Problems
2.3, 2.33, 2.35, 2.36, and 2.37.
b) For the delta method, also note that the function g can not depend
on n since then there would be a sequence of functions gn rather than one
function g. This fact also applies to several other theorems in this chapter.
The following extension of the delta method is sometimes useful.
Theorem 4.55: the Second Order Delta Method. Suppose that

g0 (θ) = 0, g00 (θ) 6= 0 and
√ D
n(Tn − θ) → N (0, τ 2 (θ)).
Then
1 2 D
τ (θ)g00 (θ)χ21 .
n[g(Tn ) − g(θ)] →
2
Example 4.22. Let Xn ∼ Binomial(n, p) where the positive integer n is
3
large 0< p < 1. Let g(θ) = θ − θ. Find the limiting distribution of
and
Xn 1
n g − c for appropriate constant c when p = √ .
n 3
D Pn
Solution: Since Xn = i=1 Yi where Yi ∼ BIN (1, p),
√

Xn D
n − p → N (0, p(1 − p))
n
by the CLT. Let θ = p. Then g0 (θ) = 3θ2 − 1 and g00 (θ) = 6θ. Notice that
√ √ √ √ 1 −2
g(1/ 3) = (1/ 3)3 − 1/ 3 = (1/ 3)( − 1) = √ = c.
3 3 3
√ √ √
Also g0 (1/ 3) = 0 and g00 (1/ 3) = 6/ 3. Since τ 2 (p) = p(1 − p),
√ 1 1
τ 2 (1/ 3) = √ (1 − √ ).
3 3
Hence

Xn −2 D 1 1 1 6 1
n g − √ → √ (1 − √ ) √ χ21 = (1 − √ ) χ21 .
n 3 3 2 3 3 3 3
To see that the delta method is a special case of the multivariate delta
method, note that if Tn and parameter θ are real valued, then Dg (θ ) = g0 (θ).
Theorem 4.56: the Multivariate Delta Method. If

4.9 The Delta Method 105
√ D
n(T n − θ) → Nk (0, Σ),
then √ D
n(g(T n ) − g(θ)) → Nd (0, Dg (θ ) ΣDTg (θ ) )
if Dg (θ ) ΣDTg (θ ) is nonsingular, where the d × k Jacobian matrix of partial

derivatives  ∂ ∂
∂θ1 g1 (θ) . . . ∂θk g1 (θ)

D g (θ ) =  .. ..
.
 
. .
∂ ∂
g (θ) . . . ∂θk gd (θ)
∂θ1 d
Here the mapping g : Rk → Rd needs to be differentiable in a neighborhood

of θ ∈ Rk .
Example 4.23. If Y has a Weibull distribution, Y ∼ W (φ, λ), then the

pdf of Y is
φ yφ
f(y) = yφ−1 e− λ
λ
where λ, y, and φ are all positive. If µ = λ1/φ so µφ = λ, then the Weibull
pdf " #
φ−1 φ
φ y y
f(y) = exp − .
µ µ µ
Let (µ̂, φ̂) be the MLE of (µ, φ). According to Bain (1978, p. 215),
2
!!
√ 1.109 φµ2 0.257µ

µ̂ µ D 0
n − →N ,
φ̂ φ 0 0.257µ 0.608φ2
= N2 (0, I −1 (θ)) where I(θ) is given in Definition 4..

Let column vectors θ = (µ φ)T and η = (λ φ)T . Then
φ
λ µ g1 (θ)
η = g(θ) = = = .
φ φ g2 (θ)
So
∂ ∂ ∂ φ ∂ φ
µ ∂φ µ φµφ−1 µφ log(µ)
    
∂θ1 g1 (θ) ∂θ2 g1 (θ) ∂µ
D g (θ ) =  = = .
∂ ∂ ∂ ∂
∂θ1 g2 (θ) ∂θ2 g2 (θ) ∂µ φ ∂φ φ
0 1
Thus by the multivariate delta method,
√

λ̂ λ D
n − → N2 (0, Σ)
φ̂ φ
where (see Definition 4. below)
Σ = I(η)−1 = [I(g(θ))]−1 = Dg (θ ) I −1 (θ)DTg (θ ) =
1.109λ2 (1 + 0.4635 log(λ) + 0.5482(log(λ))2 ) 0.257φλ + 0.608λφ log(λ)

 
 .
2
0.257φλ + 0.608λφ log(λ) 0.608φ
4.10 Summary
D
1) Xn → X if
lim Fn (t) = F (t)
n→∞
at each continuity point t of F. Convergence in distribution is also known

as weak convergence and convergence in law. X is the limiting distribution
or asymptotic distribution of Xn . The limiting distribution does not
D D
depend on the sample size n. Xn → τ (θ) if Xn → X where P (X = τ (θ)) =
1: hence X is degenerate at τ (θ) or the distribution of X is a point mass at
τ (θ).
D D D
2) If Xn → X and Xn → Y , then i) X = Y and ii) FX (x) = FY (x) for all
real x.
P
3) Convergence in probability: a) Xn → τ (θ) if for every > 0,
lim P (|Xn − τ (θ)| < ) = 1 or, equivalently, lim P(|Xn − τ (θ)| ≥ ) = 0.

n→∞ n→∞
P
b) Xn → X if for every > 0,
lim P (|Xn − X| < ) = 1, or, equivalently, lim P(|Xn − X| ≥ ) = 0.

n→∞ n→∞
P
4) Theorem: Tn → τ (θ) if any of the following 2 conditions holds:
i) limn→∞ Vθ (Tn ) = 0 and limn→∞ Eθ (Tn ) = τ (θ).
ii) M SEτ(θ) (Tn ) = E[(Tn − τ (θ))2 ] → 0.
Here
M SEτ(θ) (Tn ) = Vθ (Tn ) + [Biasτ(θ) (Tn )]2
where Biasτ(θ) (Tn ) = Eθ (Tn ) − τ (θ).
5) Theorem: a) Let Xθ be a random variable with a distribution depending
on θ, and 0 < δ ≤ 1. If
D
nδ (Tn − τ (θ)) → Xθ
4.10 Summary 107
P
for all θ ∈ Θ, then Tn → τ (θ).
b) If
√ D
n(Tn − τ (θ)) → N (0, v(θ))
for all θ ∈ Θ, then Tn is a consistent estimator of τ (θ).
√ D P
Note: If n(Tn − θ) → N (0, σ 2 ), then Tn → θ. Often Xθ ∼ N (0, v(θ)).
6) WLLN: Let Y1 , ..., Yn, ... be a sequence of iid random variables with
P
E(Yi ) = µ. Then Y n → µ. Hence Y n is a consistent estimator of µ.
r
7) Yn converges in rth mean to a random variable Y , Yn → Y, if
E(|Yn − Y |r ) → 0
as n → ∞. In particular, if r = 2, Yn converges in quadratic mean to Y ,

written
2 qm
Yn → Y or Yn → Y,
r
if E[(Yn − Y )2 ] → 0 as n → ∞. Yn → τ (θ) if E(|Yn − τ (θ)|r ) → 0 as n → ∞.
r Lr L
If r ≥ 1, Yn → Y is often written as Yn → Y or Yn →r Y .
8) A sequence of random variables Xn converges with probability 1 (or
almost surely, or almost everywhere, or strong convergence) to X if
P ( lim Xn = X) = 1.
n→∞
wp1
This type of convergence will be denoted by Xn → X. Notation such as “Xn
converges to X wp1” will also be used. Sometimes “wp1” will be replaced
with “as” or “ae.”
wp1
Xn → τ (θ),
if P (limn→∞ Xn = τ (θ)) = 1.
wp1
9) SLLN: If X1 , ..., Xn are iid with E(Xi ) = µ finite, then X n → µ.
P r wp1
10) a) For i) Xn → X, ii) Xn → X, or iii) Xn → X, the Xn and X need
to be defined on the same probability space.
D
b) For Xn → X, the probability spaces can differ.
P wp1 D r
c) For i) Xn → c, ii) Xn → c, iii) Xn → c, and iv) Xn → c, the probability
spaces of the Xn can differ.
P D
11) Theorem: i) Tn → τ (θ) iff Tn → τ (θ).
P P
ii) If Tn → θ and τ is continuous at θ, then τ (Tn ) → τ (θ). Hence if Tn is
a consistent estimator of θ, then τ (Tn ) is a consistent estimator of τ (θ) if τ
is a continuous function on Θ.
12) Theorem: Suppose Xn and X are RVs with the same probability space
for b) and c). Let g : R → R be a continuous function.
D D
a) If Xn → X, then g(Xn ) → g(X).
P P
b) If Xn → X, then g(Xn ) → g(X).
ae wp1
c) If Xn → X, then g(Xn ) → g(X).
13) CLT: Let Y1 , ..., Yn be iid with E(Y ) = µ and V (Y ) = σ 2 . Then
√ D
n(Y n − µ) → N (0, σ 2 ).
Pn
√ i=1 Yi − nµ

Yn −µ

Yn −µ

14) a) Zn = n = √ = √ is the
σ σ/ n nσ
Pn D
z–score of X n (and the z-score of i=1 Yi ), and Zn → N (0,√ 1). b) Two
applications of the CLT are to √ give the limiting distribution of n(Y n − µ)
and the limiting
Pn distribution of n(Yn /n−µY ) for a random variable Yn such 2
that Yn = i=1 Xi where the Xi are iid with E(X) = µX and V (X) = σX .
See Section 1.4. c) The CLT is the Lindeberg-Lévy CLT.
15) Theorem: Suppose Xn and X are RVs with the same probability space.
wp1 P D
a) If Xn → X, then Xn → X and Xn → X.
P D
b) If Xn → X, then Xn → X.
r P D
P D
d) Xn → τ (θ) iff Xn → τ (θ) where c is a constant.
P
16) Theorem: a) If E[(Xn − X)2 ] → 0 as n → ∞, then Xn → X.
P
b) If E(Xn ) → E(X) and V (Xn − X) → 0 as n → ∞, then Xn → X.
Note: See 15) if P (X = τ (θ)) = 1.
r k
17) Theorem: If Xn → X, then Xn → X where 0 < k < r.
18) Theorem: Let Xn have pdf fXn (x), and let X have pdf fX (x). If
fXn (x) → fX (x) for all x (or for x outside of a set of Lebesgue measure 0),
D
then Xn → X.
19) Theorem: Let g : R → R be continuous at constant c.
D D
a) If Xn → c, then g(Xn ) → c.
P P
b) If Xn → c, then g(Xn ) → c.
wp1 wp1
c) If Xn → c, then g(Xn ) → c.
20) Theorem: Suppose Xn and X are integer valued RVs with pmfs fXn (x)
D
and fX (x). Then Xn → X iff P (Xn = k) → P (X = k) for every integer k iff
fXn (x) → fX (x) for every real x.
D P
21) Slutsky’s Theorem: If Yn → Y and Wn → w for some constant w,
D D D
then i) Yn Wn → wY , ii) Yn + Wn → Y + w and iii) Yn /Wn → Y /w for w 6= 0.
B D P
Note that Yn → Y implies Yn → Y where B = wp1, r, or P . Also Wn → w
D
iff Wn → w. If a sequence of constants cn → c as n → ∞ (everywhere
wp1 P
convergence), then cn → c and cn → c. (So everywhere convergence is a
special case of almost everywhere convergence.)
22) The cumulative distribution function (cdf) of any random variable
Y is F (y) = P (Y ≤ y) for all y ∈ R. If F (y) is a cumulative distribution
function, then i) F (−∞) = lim F (y) = 0, ii) F (∞) = lim F (y) = 1, iii)
y→−∞ y→∞
4.10 Summary 109
F is a nondecreasing function: if y1 < y2 , then F (y1 ) ≤ F (y2 ), iv) F is right

continuous: lim F (y+h) = F (y) for all real y. v) Since a cdf is a probability for
h↓0
fixed y, 0 ≤ F (y) ≤ 1 for all real y. vi) A cdf F (y) can have at most countably
many points of discontinuity, vii) P (a < Y ≤ b) = F (b) − F (a). viii) If Y is
a random variable, then FY (y) completely determines the distribution of Y .
23) The moment generating function (mgf) of a random variable Y is
m(t) = E[etY ] (4.19)
if the expectation exists for t in some neighborhood P ofty0. Otherwise, the

mgf does not exist. If Y is discrete, then m(t) = y e f(y), and if Y is
R ∞ ty
continuous, then m(t) = −∞ e f(y)dy. If Y is a random variable and mY (t)
exists, then mY (t) completely determines the distribution of Y .
Notes: a) If X has mgf mX (t), then E(X k ) exists for all positive integers
k.
b) Let j and k be positive integers. If E(X k ) is finite, then E(X j ) is finite
for 1 ≤ j ≤ k.
24) The characteristic function of a random variable Y is c(t) √ =
E[eitY ] = E[cos(tY )] + iE[sin(tY )] where the complex number i = −1.
i) c(0) = 1, ii) the modulus |c(t)| ≤ 1 for all real t, iii) c(t) is a continuous
function. iv) If E(Y ) = 0 and E(Y 2 ) = V (Y ) = σ 2 , then
t2 σ 2
cY (t) = 1 + + o(t2 ) as t → 0.
2
a(t)
Here a(t) = o(t2 ) as t → 0 if lim 2 = 0. v) If Y is discrete with pmf fY (y),
X t→0 t
then cY (t) = eity fy (y). vi) If Y is a random variable, then cY (t) always
y
exists, and completely determines the distribution of Y .
25) Continuity Theorem: Let Yn be sequence of random variables with
characteristic functions cYn (t). Let Y be a random variable with cf cY (t).
a)
D
Yn → Y iff cYn (t) → cY (t) ∀t ∈ R.
b) Also assume that Yn has mgf mYn and Y has mgf mY . Assume that
all of the mgfs mYn and mY are defined on |t| ≤ d for some d > 0. Then if
D
mYn (t) → mY (t) as n → ∞ for all |t| < c where 0 < c < d, then Yn → Y .
26) Theorem: If limn→∞ cXn (t) = g(t) for all t where g is continuous at
t = 0, then g(t) = cX (t) is a characteristic function for some RV X, and
D
Xn → X.
Note: Hence continuity at t = 0 implies continuity everywhere since g(t) =
ϕX (t) is continuous. If g(t) is not continuous at 0, then Xn does not converge
in distribution.
27) If cYn (t) → h(t) where h(t) is not continuous, then Yn does not con-
verge in distribution to any RV Y , by the Continuity Theorem and 26).
28) Let X1 , ..., Xn be independent RVs with characteristic functions cXj (t).
n
Pn Y
Then the characteristic function of j=1 Xj is cPnj=1 Xj (t) = cXj (t). If
j=1
Pn
the RVs also have mgfs mXj (t), then the mgf of j=1 Xj is m j=1 Xj (t) =
Pn
Yn
mXj (t).
j=1
D
29) Helly-Bray-Pormanteau Theorem: Xn → X iff E[g(Xn )] →
E[g(X)] for every bounded, real, continuous function g.
Note: 29) is used to prove 30) b).
D
30) a) Generalized Continuous Mapping Theorem: If Xn → X and
the function g is such that P [X ∈ C(g)] = 1 where C(g) is the set of points
D
where g is continuous, then g(Xn ) → g(X).
Note: P [X ∈ C(g)] = 1 can be replaced by P [X ∈ D(g)] = 0 where D(g)
is the set of points where g is not continuous.
D
b) Continuous Mapping Theorem: If Xn → X and the function g is
D
continuous, then g(Xn ) → g(X).
Note: the function g can not depend on n since gn is a sequence of functions
rather than a single function.
31) Generalized Chebyshev’s Inequality or Generalized Markov’s Inequal-
ity: Let u : R → [0, ∞) be a nonnegative function. If E[u(Y )] exists then for
any c > 0,
E[u(Y )]
P [u(Y ) ≥ c] ≤ .
c
If µ = E(Y ) exists, then taking u(y) = |y − µ|r and c̃ = cr gives
Markov’s Inequality: for r > 0 and any c > 0,
E[|Y − µ|r ]
P (|Y − µ| ≥ c] = P (|Y − µ|r ≥ cr ] ≤ .
cr
If r = 2 and σ 2 = V (Y ) exists, then we obtain
Chebyshev’s Inequality:
V (Y )
P (|Y − µ| ≥ c] ≤ .
c2
c n
32) a) lim 1− = e−c .
n→∞ n n
−cn
b) If cn → c as n → ∞, then lim 1 + = e−c .
n→∞ n
c) If cn is a sequence of complex numbers such that cn → c as n → ∞
cn n
where c is real, then lim 1 − = e−c .
n→∞ n
4.10 Summary 111
33) For each positive integer n, let Wn1 , ..., Wnrn be independent. The
probability space may change with n, giving a triangular array of RVs. Let
Xrn rn
X
E[Wnk ] = 0, V (Wnk ) = E[Wnk2 2
] = σnk , and s2n = 2
σnk = V[ Wnk ].
k=1 k=1
Then Prn
k=1 Wnk
Zn =
sn
Prn
is the z-score of k=1 Wnk .
34) Lyapounov’s CLT: Under 42), assume the |Wnk |2+δ are integrable
for some δ > 0. Assume Lyapounov’s condition:
rn
X E[|Wnk |2+δ ]
lim = 0.
n→∞
k=1
s2+δ
n
Then Prn
k=1 Wnk D
Zn = → N (0, 1).
sn
35) Special cases: i) rn = n and Wnk = Wk has W1 , ..., Wn, ... independent.
ii) Wnk = Xnk − E(Xnk ) = Xnk − µnk has
Prn
k=1 (Xnk − µnk ) D
→ N (0, 1).
sn
iii) Suppose X1 , X2 , ... are independent with E(Xi ) = µi and V (Xi ) = σi2 .
Let Pn Pn
Xi − i=1 µi
Zn = i=1 Pn 1/2
( i=1 σi2 )
Pn
be the z-score of i=1 Xi . Assume E[|Xi − µi |3 ] < ∞ for all n ∈ N and
Pn 3
i=1 E[|Xi − µi | ]
lim n 3/2
= 0. (∗)
n→∞
( i=1 σi2 )
P
D
Then Zn → N (0, 1).
36) The (Lindeberg-Lévy) CLT has the Xi iid with V (Xi ) = σ 2 < ∞. The
Lyapounov CLT in 43 iii) has the Xi independent (not necessarily identically
distributed), but needs stronger moment conditions to satisfy (∗).
37) Lindeberg CLT: Let the Wnk satisfy 42) and Lindeberg’s condition
rn 2
X E(Wnk I[|Wnk | ≥ sn ])
lim =0
n→∞ s2n
k=1
for any > 0. Then

Prn
k=1 Wnk D
Zn = → N (0, 1).
sn
Notes: The Lindeberg CLT is sometimes called the PrLindeberg-Feller CLT.
n
k=1 Wnk D
Lindeberg’s condition is nearly necessary for Zn = → N (0, 1).
sn
38) Special case of the Lindeberg CLT: Let rn = n and let the Wnk = Wk
be independent. If
n
X E(Wk2 I[|Wk | ≥ sn ])
lim =0
n→∞ s2n
k=1
for any > 0. Then Pn

k=1 Wk D
Zn = → N (0, 1).
sn
39) a) uniformly bounded sequence: Let rn = n and Wnk = Wk . If
there is a constant c > 0 such that P (|Wk | < c) = 1 ∀k, and if sn → ∞ as
n → ∞, then Lindeberg’s CLT 46) holds.
b) Let rn = n and let the Wnk = Wk be iid with V (Wk ) = σ 2 ∈ (0, ∞).
Then Lindeberg’s CLT 46) holds. (Taking Wi = Xi − µ proves the usual CLT
with the Lindeberg CLT.)
c) If Lyapounov’s condition holds, then Lindeberg’s condition holds. Hence
the Lindeberg CLT proves the Lyapounov CLT.
40) Let h(y), g(y), n(y) and d(y) be functions. Review how to find the
derivative g0 (y) of g(y) and how to find the kth derivative
dk
g(k) (y) = g(y)
dyk
for integers k ≥ 2. Recall that the product rule is
(h(y)g(y))0 = h0 (y)g(y) + h(y)g0 (y).
The quotient rule is

0
d(y)n0 (y) − n(y)d0 (y)

n(y)
= .
d(y) [d(y)]2
The chain rule is

[h(g(y))]0 = [h0 (g(y))][g0 (y)].
00
Then given the mgf m(t), find E[Y ] = m0 (0), E[Y 2 ] = m (0) and V (Y ) =
E[Y 2 ] − (E[Y ])2 .
4.12 Problems 113
4.11 Complements
Many statistics departments offer a one semester graduate course in large

sample theory. A nice review of large sample theory is Chernoff (1956). There
are several PhD level texts on large sample theory including, in roughly in-
creasing order of difficulty, Olive (2022), Lehmann (1999), Ferguson (1996),
Sen and Singer (1993), and Serfling (1980). Cramér (1946) is also an impor-
tant reference, and White (1984) considers asymptotic theory for economet-
ric applications. The online text Hunter (2014) is useful. Also see DasGupta
(2008), Davidson (1994), Jiang (2022), Polansky (2011), Sen, Singer, and
Pedrosa De Lima (2010), and van der Vaart (1998).
More advanced topics for large sample theory can be found in Lukacs
(1970, 1975), Petrov (1995), Pollard (1984), and Shorack and Wellner (1986).
For some roughly Master’s level large sample theory (USA), see Bickel
and Doksum (1977, section 4.4), Casella and Berger (2002, section 5.5), Hoel,
Port, and Stone (1971, sections 8.2-8.4), Lehmann (1983, ch. 5), Olive (2014,
ch. 8), Rohatgi (1976, ch. 6), Rohatgi (1984, ch. 9), and Woodroofe (1975,
ch. 9).
Hoel, Port, and Stone (1971) has useful material on characteristic functions
and an interesting proof of the CLT.
4.12 Problems
4.1. Let Xn ∼ U (−n, n) have cdf Fn (x). Then limn Fn (x) = 0.5 for all real
D
x. Does Xn → X for some random variable X? Explain briefly.
4.2. Let Xn be a sequence of random variables such that P (Xn = 1/n) =
1. Does Xn converge in distribution? If yes, prove it by finding X and the
cdf of X. If no, prove it.
4.3. Suppose Xn has cdf
x n
Fn (x) = 1 − 1 −
θn
D
for x ≥ 0 and Fn (x) = 0 for x < 0. Show that Xn → X by finding the cdf of
X.
4.4. Suppose that Y1 , ..., Yn are iid with E(Y ) = (1 − ρ)/ρ and
VAR(Y ) =
√ 1−ρ
(1−ρ)/ρ2 where 0 < ρ < 1. Find the limiting distribution of n Y n − .
ρ
4.5. Let X1 , ..., Xn be iid with cdf F (x) = P (X ≤ x). Let Yi = I(Xi ≤ x)
where the indicator equals 1 if Xi ≤ x and 0, otherwise.
a) Find E(Yi ).
b) Find VAR(Yi ).
n
1X
c) Let F̂n (x) = I(Xi ≤ x) for some fixed real number x. Find the
n i=1
√
limiting distribution of n F̂n (x) − cx for an appropriate constant cx.
4.6. Let Xn ∼ Binomial(n, p) where the positive integer n is large and
0 < p < 1.
√

Xn
Find the limiting distribution of n − p .
n
4.7. Suppose Xn is a discrete random variable with P (Xn = n) = 1/n
and P (Xn = 0) = (n − 1)/n.
D
a) Does Xn → X? Explain
b) Does E(Xn ) → E(X)? Explain briefly.
4.8. Lemma 1 (from Billingsley (1986)): Let z1 , ..., zm and w1 , ..., wm be
complex
Pm numbers of modulus at most 1. Then |(z1 · · · zm ) − (w1 · · · wm )| ≤
k=1 k − wk |.
|z
Prove this lemma by induction using (z1 · · · zm ) − (w1 · · · wm ) =
(z1 − w1 )(z2 · · · zm ) + w1 [(z2 · · · zm ) − (w2 · · · wm )]. Also, the modulus |z| acts
much like the absolute value. Hence |z1 z2 | = |z1 ||z2 |, and |z1 +z2 | ≤ |z1 |+|z2 |.
4.9. The characteristic function for Y ∼ N (µ, σ 2 ) is

cY (t) = exp(itµ − t2 σ 2 /2). Let Xn ∼ N (0, n).
a) Prove cXn (t) → h(t) ∀t by finding h(t).
b) Use a) to prove whether Xn converges in distribution.
4.10. X has a point mass at c or X is degenerate at c if P (X = c) = 1.
a) Find the characteristic function of X.
b) Suppose Xn is a sequence of random variables and cXn (t) → 1 ∀t as
n → ∞. Prove whether Xn converges in distribution.
4.11. Suppose X1 , ..., Xn are uncorrelated with E(Xi ) = µi and V (Xi ) =
n n
1X 1 X 2
σi2 . Then E(X n ) = µn = µi and V (X n ) = 2 σi → 0 as n → ∞
n n
i=1 i=1
P
Use Chebyshev’s inequality to prove (X n − µn ) → 0 as n → ∞.
4.12. If X ∼ C(0, 1), the Cauchy (0,1) distribution, then the characteristic
function of X is ϕX (t) = e−|t| .
a) If X1 , ..., Xn are iid C(0, 1), prove X n ∼ C(0, 1).
D
b) Prove X n → X.
4.13. A proof for showing convergence in rth mean implies convergence
in probability is given in this problem. If h(t) is an increasing function (at
least on the range of W ), then P (W ≥ c) = P (h(W ) ≥ h(c)). Let > 0.
Then P (|Xn − X| ≥ ) = P (|Xn − X|r ≥ r ). Now apply the Generalized
r
Chebyshev’s Inequality to show that if Xn → X, then P (|Xn − X| ≥ ) → 0
as n → ∞.
4.12 Problems 115
4.14. For each n ∈ N, let Xn1 , ..., Xnr be independent RVs on Pprobability
2 rn
space (Ωn , Fn , P
Pn ) with E(Xnk ) = µnk , V (XPnk ) = σnk , Tn = k=1 Xnk ,
rn 2 rn 2
E(Tn ) = µn = k=1 µnk , and V (Tn ) = σn = k=1 σnk .
a) If vn > 0 and σn /vn → 0 as n → ∞, use Chebyshev’s inequality to
prove
Tn − µn
Pn ≥ →0
vn
∀ > 0 as n → ∞.
b) Billinglsey (1986, problem 6.5 slightly modified):
Pn Let A1 , A2 , ... be in-
dependentP P (Ai ) = pi and pn = n1 i=1 pi . Let Xnk = Xk = IAk
events with P
n n
and Tn = k=1 Xk = k=1 IAk . Let rn = n and Pn = P for all n. Use a) to
prove
P [|n−1Tn − pn | ≥ ] → 0
for all > 0 as n → ∞.
√

Yn
4.15. Let Yn ∼ χ2n . Find the limiting distribution of n − 1 .
n
4.16. Suppose that X1 , ..., Xn are iid and that t is a function such that
E(t(X1 )) = µt . Is there a constant c such that
Pn
i=1 t(Xi ) P
→c ?
n
Explain briefly.
4.17. Let P (Xn = n) = 1.
a) Show FXn (x) → H(x) as n → ∞.
b) Let MXn (t) be the moment generating function of Xn . Find limn MXn (t)
for all t.
Hint: examine t < 0, t = 0, and t > 0.
c) Does Xn converge in distribution?
4.18. Suppose that X1 , ..., Xn are iid and V(X1 ) = σ 2 . Given that
n
1X P
σ̂n2 = (Xi − X)2 → σ 2 ,
n i=1
give a very short proof that the sample variance

n
1 X P
Sn2 = (Xi − X)2 → σ 2 .
n − 1 i=1
4.19. Suppose X 1 , ..., Xn are iid p×1 random vectors from a multivariate
t-distribution with parameters µ and Σ with d degrees of freedom. Then
d
E(X i ) = µ and Cov(X) = Σ for d > 2. Assuming d > 2, find the
√ d − 2
limiting distribution of n(X − c) for appropriate vector c.
4.20. Suppose
√
n(X n − µ) D
Zn = → N (0, 1)
σ
P
and s2n → σ 2 where σ > 0. Prove that
√
n(X n − µ) D
→ N (0, 1).
sn
D P P D
4.21. If Yn → Y , an → a, and bn → b, then an + bn Yn → X. Find X.
4.22. Let X 1 , ..., Xn be iid k × 1 random vectors where E(X i ) =
(λ1 , ..., λk )T and Cov(X i ) = diag(λ21 , ..., λ2k), a diagonal k × k matrix with
jth diagonal entry λ2j . The nondiagonal entries are 0. Find the limiting dis-
√
tribution of n(X − c) for appropriate vector c.
4.23. What theorem can be used to prove both the (usual) central limit
theorem and the Lyapounov CLT?
4.24. Let Yn ∼ binomial(n, p).
√

Yn
a) Find the limiting distribution of n −p .
n
b) Find the limiting distribution of
r ! !
√ Yn √
n arcsin − arcsin( p) .
n
d 1
Hint : arcsin(x) = √ .
dx 1 − x2
4.25. Suppose Yn ∼ uniform(−n, n). Let Fn (y) be the cdf of Yn .
a) Find F (y) such that Fn (y) → F (y) for all y as n → ∞.
D
b) Does Yn → Y ? Explain briefly.
4.26.
4.27. Suppose x1 , ..., xn are iid p × 1 random vectors where E(xi ) = e0.5 1
and √
Cov(xi ) = (e2 − e)I p . Find the limiting distribution of n(x − c) for appro-
priate vector c.
4.28. Assume that
√
2
β̂1 β1 D 0 σ1 0
n − → N2 , .
β̂2 β2 0 0 σ22
Find the limiting distribution of
√ √

β̂1 β1
n[(β̂1 − β̂2 ) − (β1 − β2 )] = (1 − 1) n − .
β̂2 β2
4.12 Problems 117
4.29. Let X1 , ..., Xn be iid with mean E(X) = µ and variance V (X) =
√ D
σ > 0. Then n(X − µ)2 = [ n(X − µ)]2 → W . What is W ?
2
2
4.30. Suppose that X1 , ..., Xn are iid√ N (µ, σ ).
a) Find the limiting distribution of n X n − µ .
√
b) Let g(θ) = [log(1+θ)]2 . Find the limiting distribution of n g(X n ) − g(µ)

for µ > 0.
c) Let g(θ) = [log(1+θ)]2 . Find the limiting distribution of n g(X n ) − g(µ)

for µ = 0. Hint: Use the second order delta method.

4.31. Let Y1 , ..., Yn be iid double exponential DE(θ, λ) with E(Y ) = θ
and V (Y ) = 2λ2 where θ and y are real √ and λ > 0.
a) Find the limiting distribution of n[ Y − c] for an appropriate constant
c.
√
b) Find the limiting distribution of n (Y )2 − d for appropriate constant

d for the values of θ where the delta method applies.

c) What is the limiting distribution of n (Y )2 − d for the value or values
of θ where the delta method does not apply?
4.32. Let Y1 , ..., Yn be iid with√E(Y r ) = exp(rµ + r 2 σ 2 /2) for any real r.
Find the limiting distribution of n(Y n − c) for appropriate constant c.
√ Yn
4.33. Let Yn ∼ Poisson(nθ). Find the limiting distribution of n −c
n
for appropriate constant c.
D
4.34. Suppose Xn ∼ U (0, n). Does Xn → X for some random variable X?
D
Prove or disprove. If Xn → X, find X.
4.35. Suppose Yn ∼ EXP (1/n) with cdf FYn (y) = 1−exp(−ny) for y ≥ 0,
D
and FYn (y) = 0 for y < 0. Does Yn → Y for some random variable Y ? Prove
D
or disprove. If Yn → Y , find Y .
4.36. Suppose X1 , ..., Xn are iid from a distribution with mean µ and
n
1X 2 P
variance σ 2 . Xi → c. What is c? Hint: Use WLLN on Wi = Xi2 .
n
i=1
4.37. Rohatgi (1971, p. 248): Let P (Xn = 0) = 1 − 1/nr and P (Xn =

n) = 1/nr where r > 0.
a) Prove that Xn does not converge in rth mean to 0. Hint: Find E[|Xn |r ].
D
b) Does Xn → X for some random variable X? Prove or disprove.
Hint: P (|Xn − 0| ≥ ) ≤ P (Xn = n).
4.38. Suppose Y1 , ..., Yn are iid EXP(λ). Let Tn = Y(1) = Y1:n =
min(Y1 , ..., Yn). It can be shown that the mgf of Tn is
1
mTn (t) =
1 − λt
n
D
for t < n/λ. Show that Tn → X and give the distribution of X.
4.39. Suppose X 1 , ..., X n are iid 3 ×1 random vectors from a multinomial
distribution with
   
mρ1 mρ1 (1 − ρ1 ) −mρ1 ρ2 −mρ1 ρ3
E(X i ) =  mρ2  and Cov(X i ) =  −mρ1 ρ2 mρ2 (1 − ρ2 ) −mρ2 ρ3 
mρ3 −mρ1 ρ3 −mρ2 ρ3 mρ3 (1 − ρ3 )
where m is a known positive integer

√ and 0 < ρi < 1 with ρ1 + ρ2 + ρ3 = 1.
Find the limiting distribution of n(X − c) for appropriate vector c.
P P
4.40. Suppose Y n → Y . Then W n = Y n − Y → 0. Define X n = Y for
D D
all n. Then X n → Y . Then Y n = X n + W n → Z by Slutsky’s Theorem.
What is Z?
4.41. If X ∼ Nk (µ, Σ), then the characteristic function of X is

T 1 T
cX (t) = exp it µ − t Σt
2
for t ∈ Rk . Let a ∈ Rk and find the characteristic function of aT X =

caT X (y) = E[exp(i y aT X)] = cX (ya) for any y ∈ R. Simplify any con-
stants.
4.42. Suppose
   
θ̂1 θ1
√  .   .  D
n   ..  −  ..   → Np (0, Σ).
θ̂p θp
Let θ = (θ1 , ..., θp)T and let g(θ) = (eθ1 , ..., eθp )T . Find D g (θ ) .
4.43. Let µi be the ith population mean and let Σ i be the nonsingular
population covariance matrix of the ith population. Let xi,1 , ...xi,ni be iid
from the ith population. Let xi be the k × 1 sample mean from the xi,j ,
j = 1, ..., ni.
√
a) Find the limiting distribution of ni (xi − µi ).
Pp P
are p populations, n = i=1 ni , and ni /n√→ πi where
b) Assume there P
p
0 < πi√< 1 and = i=1 πi . Find the limiting distribution of n(xi − µi ).
√ 1√ √
Hint: n = ( n/ ni )( ni ).
D
4.44. Suppose Z n → Np (µ, I). Let a be a p × 1 constant vector. Find the
limiting distribution of aT (Z n − µ).
4.45. Let x1 , ..., xn be iid with mean E(x) = µ and variance V (x) = σ 2 >
√ D
0. Then exp[ n(x −µ)] → W . What is W ? Hint: use the continuous mapping
D D
theorem: if Z n → X and g is continuous, then g(Z n ) → g(X).
4.12 Problems 119
4.46. Let X1 , ..., Xn be independent and identically distributed (iid) from

n
1X
a N (µ, σ 2 ) distribution. Let X = Xi .
n i=1
√
a) Find the limiting distribution of n ( X − µ ).
b) Find the limiting distribution of

√ 1
n[ −c ]
X
for appropriate constant c. You may assume µ 6= 0.
4.47. Let Y1 , ..., Yn be independent and identically distributed (iid) from
a distribution with probability density function
2y
f(y) =
θ2
for 0 < y ≤ θ and f(y) = 0, otherwise. √ Then E(Y ) = 2θ/3 and V (Y ) = θ2 /18.
a) Find the limiting distribution of n Y − c for appropriate constant
c.
√
b) Find the limiting distribution of n log( Y ) − d for
appropriate constant d. Note: log(x) is ln(x) in this class.
2
4.48. Suppose that X1 , ..., Xn are iid√ N (µ, σ ).
a) Find the limiting distribution of n X n − µ .
√
b) Let g(θ) = [log(1+θ)]2 . Find the limiting distribution of n g(X n ) − g(µ)

for µ > 0.
c) Let g(θ) = [log(1+θ)]2 . Find the limiting distribution of n g(X n ) − g(µ)

for µ = 0. Hint: use the Second Order Delta Method and find g(0).
4.49. Suppose
x ≤ c − n1

 0,
FXn (x) = 2 (x − c + n ), c − n1 < x < c + n1
n 1
1, x ≥ c + n1 .

D D
Does Xn → X for some random variable X? Prove or disprove. If Xn → X,
find X.
4.50. Suppose Yn ∼ EXP (n) with cdf FYn (y) = 1 − exp(−y/n) for y ≥ 0
D
and FYn (y) = 0 for y < 0. Does Yn → Y for some random variable Y ? Prove
D
or disprove. If Yn → Y , find Y .
4.51. Suppose that Y1 , ..., Yn are iid with E(Y ) = (1−ρ)/ρ and VAR(Y )=
√

1−ρ
(1−ρ)/ρ2 where 0 < ρ < 1. a) Find the limiting distribution of n Y n − .
ρ
√
b) Find the limiting distribution of n g(Y n ) − ρ for appropriate func-
tion g.
4.52. Let Xn ∼ Binomial(n, p) where the positive integer n is large and

0 < p < 1.
√

Xn
a) Find the limiting distribution of n − p .
n
" 2 #
√ Xn 2
b) Find the limiting distribution of n − p .
n

Xn
c) Let g(θ) = θ3 − θ. Find the limiting distribution of n g − c
n
1
for appropriate constant c when p = √ . Hint: Use the second order delta
3
method.
4.53. Suppose Y1 , ..., Yn are iid P OIS(θ).
√ Then the MLE of θ is θ̂n = Y n .
a) Find the limiting distribution of n(Y n − c) for appropriate constant
c.
√
b) Let τ (θ) = θ2 . Find the limiting distribution of n[τ (θ̂n ) − τ (θ)] using
Delta Method.
4.54. Let Xn be sequence of random variables with cdfs Fn and mgfs mn .
Let X be a random variable with cdf F and mgf m. Assume that all of the
mgfs mn and m are defined to |t| ≤ d for some d > 0. Let
1
mn (t) =
[1 − (λ + n1 )t]
for t < 1/(λ + 1/n). Show that mn (t) → m(t) by finding m(t).
D
(Then Xn → X where X ∼ EXP (λ) with E(X) = λ by the continuity
theorem for mgfs.)
4.55. Suppose X 1 , ..., Xn are iid k × 1 random vectors where E(X i ) =
(µ1 , ..., µk)T and Cov(X i ) = diag(σ12 , ..., σk2), a diagonal k × k matrix with
jth diagonal entry σj2 . The nondiagonal entries are 0. Find the limiting dis-
√
tribution of n(X − c) for appropriate vector c.
P P
4.56. Suppose Yn → Y . Then Wn = Yn − Y → 0. Define Xn = Y for all
D D
n. Then Xn → Y . Then Yn = Xn + Wn → Z by Slutsky’s Theorem. What is
Z?
4.57. The method of moments estimator for Cov(X, Y ) = σX,Y is
n
1X
σ̂X,Y = (xi − x)(yi − y). Another common estimator is
n i=1
n
1 X n P
SX,Y = (xi − x)(yi − y) = σ̂X,Y . Using the fact that σ̂X,Y →
n − 1 i=1 n−1
P
σX,Y when the covariance exists, prove that SX,Y → σX,Y with Slutsky’s
4.12 Problems 121
P D
Theorem. Hint: Zn → c iff Zn → c if c is a constant, and usual convergence
P
an → a of a sequence of constants implies an → a.
4.58. Suppose that the characteristic function of X n is
t2 σ 2
cX (t) = exp(− ).
2n
√ √
Then the characteristic function of n X n is c√n X (t) = cX ( n t). Does
√ D
n X n → W for some random variable W ? Explain.
√ D
4.59. Suppose that β is a p × 1 vector and that n(β̂ n − β) → Np (0, C)
where C is a p × p nonsingular matrix. Let A be a j × p matrix with full rank
j. Suppose that Aβ = 0. √
a) What is the limiting distribution of nAβ̂n ?
√
b) What is the limiting distribution of Z n = n[ACAT ]−1/2Aβ̂ n ? Hint:
for a square symmetric nonsingular matrix D, we have D1/2 D 1/2 = D, and
D−1/2 D−1/2 = D−1 , and D−1/2 and D1/2 are both symmetric.
T
c) What is the limiting distribution of Z Tn Z n = nβ̂ n AT [ACAT ]−1 Aβ̂ n ?
D D
Hint: If Z n → Z ∼ Nk (0, I) then Z Tn Z n → Z T Z ∼ χ2k .
4.60. Suppose
 2  2
σ̂1 σ1
√  .   .  D
n   ..  −  ..   → Np (0, Σ).
σ̂p2 σp2
p q
Let θ = (σ12 , ..., σp2)T and let g(θ) = ( σ12 , ..., σp2 )T . Find Dg (θ ) .
4.61. Suppose
   
σ̂1 σ1
√  .   .  D
n   .  −  ..   → Np (0, Σ).
.
σ̂p σp
Let θ = (σ1 , ..., σp)T and let g(θ) = ((σ1 )2 , ..., (σp)2 )T . Find Dg (θ ) .

Σ D
4.62. Let wB ∼ Np 0, . Then wB → w as B → ∞. Find w.
B
2
4.63.PLet x1 , ..., xn be iid
Pnwith mean E(x) =2 µ and Pnvariance V2 (x) = σ >
n
0. Then i=1 (xi −x n ) = i=1 (xi −µ+µ−x n ) = i=1 (xi −µ) −n(x−µ)2 .
2
n
1X P
a) (xi − µ)2 → θ. What is θ?
n i=1
√ D
b) n(x − µ)2 = [ n(x − µ)]2 → W . What is W ?
D
4.64. Suppose Z n → Nk (µ, I). Let A be a constant r × k matrix. Find
the limiting distribution of A(Z n − µ).
4.65. Suppose x1 , ..., xn are iid p × 1 random vectors where
xi ∼ (1 − γ)Np (µ, Σ) + γNp (µ, cΣ)
with 0 < γ < 1 and c > 0. Then √ E(xi ) = µ and Cov(xi ) = [1 + γ(c − 1)]Σ.
Find the limiting distribution of n(x − d) for appropriate vector d.
4.66. Let Σ i be the nonsingular population covariance matrix of the ith
treatment group or population.P To simplify the large sample theory, assume
3
ni = πi n where 0 < πi < 1 and i=1 πi = 1. Let Ti be a multivariate location
estimator such that
√ √

D D Σi
ni (Ti − µi ) → Nm (0, Σ i ), and n(Ti − µi ) → Nm 0, for i = 1, 2, 3.
πi
Assume the Ti are independent.
Then  
√ T 1 − µ 1
D
n  T2 − µ2  → u.
T3 − µ3
a) Find the distribution of u.
b) Suggest an estimator π̂i of πi .
4.67. Let X1 , ..., Xn be independent and identically
Pn distributed (iid) from
Xi
a Poisson(λ) distribution with E(X) = λ. Let X = i=1 n
.
√
a) Find the limiting distribution of n ( X − λ ).
√
b) Find the limiting distribution of n [ (X)3 − (λ)3 ].
4.68.Let X1 , ..., Xn be iid from a normal distribution with unknown mean
√ 3
µ and known variance σ 2 . Find the limiting distribution of n(X − c) for
an appropriate constant c.
4.69. Let X1 , ..., Xn be a random sample from a population with pdf
θ−1
θx
0<x<3
f(x) = 3θ
0 elsewhere
X
The method of moments estimator for θ is Tn = . Find the limiting
√ 3 − X
distribution of n(Tn − θ) as n → ∞.
4.70. Let Yn ∼ χ2n .
√

Yn
a) Find the limiting distribution of n − 1 .
n
" #
3
√ Yn
b) Find the limiting distribution of n − 1 .
n
4.71. Let Y1 , ..., Yn be iid with E(Y ) = µ and V (Y ) = σ 2 . Let g(µ) = µ2 .
For µ = 0, find the limiting distribution of n[(Y n )2 − 02 ] = n(Y n )2 by using
the Second Order Delta Method.
4.12 Problems 123
4.72. In earlier courses,

Pn you should have used moment generating functions
to show that if Yn = i=1 Xi where the Xi are iid from a nice distribution,
then Yn has a nice distribution where the nice distributions are the binomial,
chi–square, gamma, negative binomial, normal, and Poisson distributions. If
E(X) = µ and V(X) = σ 2 then by the CLT
√ D
n(X n − µ) → N (0, σ 2 ).
√ √
Since n( Ynn − µ) and n(X n − µ) have the same distribution,
√

Yn D
n − µ → N (0, σ 2 )
n
Pn
For example, if Yn ∼ N (nµ, nσ 2 ) then Yn ∼ i=1 Xi where the Xi are iid
N (µ, σ 2 ). Hence
√ √

Yn D
n − µ ∼ n(X n − µ) → N (0, σ 2 ).
n
which should not be surprising since

√

Yn
n − µ ∼ N (0, σ 2 ).
n
Write down the distribution of Xi if

i) Yn ∼ BIN(n, p) where BIN stands for binomial.
ii) Yn ∼ χ2n .
iii) Yn ∼ G(nα, β) where G stands for gamma.
iv) Yn ∼ N B(n, p) where NB stands for negative binomial.
v) Yn ∼ P OIS(nθ) where POIS stands for Poisson.
(Write down the distribution if you know it or can find it. Do not use mgfs
unless you can not find the distribution.)
4.73. Suppose that Xn ∼ U (−1/n, 1/n).
a) What is the cdf Fn (x) of Xn ?
b) What does Fn (x) converge to? (Hint: consider x < 0, x = 0 and
x > 0.)
D
c) Xn → X. What is X?
4.74. Suppose X1 , ..., Xn are iid from a distribution
√ with E(X k ) =
k
2θ /(k + 2). Find the limiting distribution of n( X n − c ) for appropri-
ate constant c.

and P (Xn = 0) = (n − 1)/n.
D
a) Show Xn → X.
b) Does E(Xn ) → E(X)? Explain briefly.

4.76. Suppose Xn has cdf
x n
Fn (x) = 1 − 1 −
θn
D
for x ≥ 0 and Fn (x) = 0 for x < 0. Show that Xn → X by finding the cdf of
X.
4.77. Let Wn = Xn − X and let r > 0. Notice that for any > 0,
E|Xn − X|r ≥ E[|Xn − X|r I(|Xn − X| ≥ )] ≥ r P (|Xn − X| ≥ ).

P
Show that Wn → 0 if E|Xn − X|r → 0 as n → ∞.
4.78. Rohatgi (1971, p. 248): Let P (Xn = 0) = 1 − 1/nr and

P (Xn = n) = 1/nr where r > 0.
a) Prove that Xn does not converge in rth mean to 0. Hint: Find E[|Xn |r ].
D
b) Does Xn → X for some random variable X? Prove or disprove.
4.79. Suppose X1 , ..., Xn are iid C(µ, σ) with characteristic function
cX (t) = exp(itµ − |t|σ) where exp(a) = ea . Pn
a) Find the characteristic function cTn (t) of Tn = i=1 Xi .
b) Find the characteristic function of X n = Tn /n.
D
c) Does X n → W for some RV W ? Explain.
4.80. Suppose X1 , ..., Xn are iid from a distribution with mean µ and
variance σ 2 . The method of moments estimator for σ 2 is
n n
2 1X 1X 2
SM = (Xi − X n )2 = Xi − (X n )2 .
n n
i=1 i=1
n
1X 2 P
a) X → c. What is c? Hint: Use WLLN on Wi = Xi2 .
n i=1 i
P P
b) (X n )2 → d. What is d? Hint: g(x) = x2 is continuous, so if Zn → θ,
P
then g(Zn ) → g(θ).
2 P
c) Show Sm → σ2 .
n
n 1 X P
d) S 2 = 2
SM = (Xi − X n )2 . Prove S 2 → σ 2 .
n−1 n − 1 i=1
(µ1 , ..., µk)T and Cov(X i ) = (1 − α)I + α11T , where I is the k × k iden-
T −1
tity matrix, 1 = √ (1, 1, ..., 1) , and −(k − 1) < α < 1. Find the limiting
distribution of n(X − c) for appropriate vector c.
4.82. Suppose Xn are random variables with characteristic functions
cXn (t), and that cXn (t) → eitc for every t ∈ R where c is a constant. Does
4.12 Problems 125
D
Xn → X for some random variable X? Explain briefly. Hint: Is the func-
tion g(t) = eitc continuous as t = 0? Is there a random variable that has
characteristic function g(t)?
4.83. The characteristic function for Y ∼ N (µ, σ 2 ) is
cY (t) = exp(itµ − t2 σ 2 /2). Let Xn ∼ N (0, n).
a) Prove cXn (t) → h(t) ∀t by finding h(t).
b) Use a) to prove whether Xn converges in distribution.
4.84. Suppose √
n(X n − µ) D
Zn = → N (0, 1)
σ
P
and s2n → σ 2 where σ > 0. Prove that
√
n(X n − µ) D
→ N (0, 1).
sn
4.85. Show the usual Delta Method is a special case of the Multivariate
Delta Method if g is a real function (d = 1), Tn is a random variable, θ is a
scalar and Σ = σ 2 is a scalar (k = 1).
4.86. Let X be a k × 1 random vector and X n be a sequence of k × 1
random vectors and suppose that
D
tT X n → tT X
D
for all t ∈ Rk . Does X n → X? Explain briefly.
D
4.87. Suppose the k×1 random vector X n → Nk (µ, Σ). Hence the asymp-
totic distribution of X n is the multivariate normal MVN Nk (µ, Σ) distribu-
tion. Find the d, µ̃ and Σ̃ for the following problem. Let C T be the transpose
of C.
D
Let C be an m × k matrix, then CX n → Nd (µ̃, Σ̃).
4.88. Suppose X n are k × 1 random vectors with characteristic functions
cX n (t). Does cX n (0) → a for some constant a? Prove or disprove. Here 0 is
a k × 1 vector of zeroes.
4.89. Suppose
√

λ̂ λ D 0 Σλ Σ λη
n − → Np+1 , ∼ Np+1 (0, Σ)
η̂ η 0 Σηλ Ση
where λ is a scalar and η = (η1 , ..., ηp). Let

λ
g = λη =
η
(λη1 , ..., ληp)T . Then

√ D

n(λ̂η̂ − λη) → Np 0, Dg (θ ) ΣDTg (θ )
by the Multivariate Delta Method.

a) Find Dg (θ ) .
b) Let A be a k × p full rank constant matrix with k ≤ p and 0 = Aη.
Find ADg (θ) .
√ D

Note: then n(Aλ̂η̂ − 0) → Np 0, ADg (θ) ΣDTg (θ) AT .
4.90. Suppose
 2  2
σ̂1 σ1
√  .   .  D
n   ..  −  ..   → Np (0, Σ).
σ̂p2 σp2
Let θ = (σ12 , ..., σp2)T and let g(θ) = (log(σ12 ), ..., log(σp2 ))T . Find Dg (θ ) .
4.91. It is true that Wn has the same order as Xn in probability, written
Wn P Xn , iff for every > 0 there exist positive constants N and 0 < d <
D such that
Wn
P (d ≤ ≤ D ) ≥ 1 −
Xn
for all n ≥ N .
a) Show that if Wn P Xn then Xn P Wn .
b) Show that if Wn P Xn then Wn = OP (Xn ).
c) Show that if Wn P Xn then Xn = OP (Wn ).
d) Show that if Wn = OP (Xn ) and if Xn = OP (Wn ), then Wn P Xn .
4.92. This problem will prove the following Theorem which says that if
there are K estimators Tj,n of a parameter β, such that kTj,n −βk = OP (n−δ )
where 0 < δ ≤ 1, and if Tn∗ picks one of these estimators, then kTn∗ − βk =
OP (n−δ ).
Lemma: Pratt (1959). Let X1,n , ..., XK,n each be OP (1) where K is
fixed. Suppose Wn = Xin ,n for some in ∈ {1, ..., K}. Then
Wn = OP (1). (4.20)
Proof.
P (max{X1,n , ..., XK,n} ≤ x) = P (X1,n ≤ x, ..., XK,n ≤ x) ≤
FWn (x) ≤ P (min{X1,n , ..., XK,n} ≤ x) = 1 − P (X1,n > x, ..., XK,n > x).
Since K is finite, there exists B > 0 and N such that P (Xi,n ≤ B) > 1−/2K
and P (Xi,n > −B) > 1 − /2K for all n > N and i = 1, ..., K. Bonferroni’s
PK
inequality states that P (∩K i=1 Ai ) ≥ i=1 P (Ai ) − (K − 1). Thus
FWn (B) ≥ P (X1,n ≤ B, ..., XK,n ≤ B) ≥ K(1−/2K)−(K−1) = K−/2−K+1 = 1−/2

4.12 Problems 127
and
−FWn (−B) ≥ −1 + P (X1,n > −B, ..., XK,n > −B) ≥
−1 + K(1 − /2K) − (K − 1) = −1 + K − /2 − K + 1 = −/2.
Hence
FWn (B) − FWn (−B) ≥ 1 − for n > N. QED
Theorem. Suppose kTj,n − βk = OP (n−δ ) for j = 1, ..., K where 0 < δ ≤
1. Let Tn∗ = Tin ,n for some in ∈ {1, ..., K} where, for example, Tin ,n is the
Tj,n that minimized some criterion function. Then
kTn∗ − βk = OP (n−δ ). (4.21)
Prove the above theorem using the Lemma with an appropriate Xj,n .
2
4.93. Let W ∼ N (µW , σW ) and let X ∼ Np (µ, Σ). The characteristic
function of W is
y2 2

iyW
ϕW (y) = E(e ) = exp iyµW − σw .
2
Suppose W = tT X. Then W ∼ N (µW , σW

2 2
). Find µW and σW . Then the
characteristic function of X is
T
ϕX (t) = E(eit X ) = ϕW (1).
Use these results to find ϕX (t).

1 = (1, ..., 1)T and Cov(X i ) = I k√= diag(1, ..., 1), the k × k identity matrix.
Find the limiting distribution of n(X − c) for appropriate vector c.
4.95. Suppose
x ≤ c − n1

 0,
FXn (x) = 2 (x − c + n ), c − n1 < x < c + n1
n 1
1, x ≥ c + n1 .

D D
Does Xn → X for some random variable X? Prove or disprove. If Xn → X,
find X.
4.96. Suppose X1 , ..., Xn are iid from a distribution with E(X k ) = Γ (3 −
k)/6λk for integer k < 4. Recall
√ that Γ (n) = (n − 1)! for integers n ≥ 1. Find
the limiting distribution of n( X n − c ) for appropriate constant c.
and
D
P (Xn = 0) = (n − 1)/n. Does Xn → X? Explain.
√

Xn
4.98. Let Xn ∼ Poisson(nθ). Find the limiting distribution of n − θ .
n
2
4.99. Let Y1 , ..., Yn be iid Gamma(θ, θ) random variables with
√ E(Yi ) = θ
3
and V (Yi ) = θ where θ > 0. Find the limiting distribution of n( Y n − c )
for appropriate constant√ c.
4.100. Let Xn = n with probability 1/n and Xn = 0 with probability
1 − 1/n.
√
(Xn = nI[0,1/n] wrt U(0,1) probability.)
1
a) Prove that Xn → 0.
2
b) Does Xn → 0? Prove or disprove.
D
4.101. Suppose Xn ∼ U (c−1/n, c+1/n). Does Xn → X for some random
variable X? Prove or disprove. (If Y ∼ U (θ1 , θ2 ), then the cdf of Y is F (y) =
(y − θ1 )/(θ2 − θ1 ) for θ1 ≤ y ≤ θ2 .)
4.102. Suppose X 1 , ..., X n are iid with E(X i ) = 0 but Cov(X i ) does not
P
exist. Does X n → c for some constant vector c? Explain briefly.
D P D
4.103. Suppose X n → X and Y n − X n → 0. Does Y n → W for some
random vector W ? [Hint: Y n = X n + (Y n − X n ).]
4.104. Let Xn ∼ N (0, σn2 ) where σn2 → ∞ as n → ∞. Let Φ(x) be the cdf
of a N (0, 1) RV. Then the cdf of Xn is Fn (x) = Φ(x/σn ).
a) Find F (x) such that Fn (x) → F (x) for all real x.
D
b) Does Xn → X? Explain briefly.
4.105. Define when a sequence of random variables Xn converges in prob-
ability to a random variable X.
4.106. Suppose X1 , ..., Xn are iid C(µ, σ) with characteristic function
ϕX (t) = exp(itµ − |t|σ) where exp(a) = ea . Pn
a) Find the characteristic function ϕTn (t) of Tn = i=1 Xi .
b) Find the characteristic function of X n = Tn /n.

D
c) Does X n → W for some RV W ? Explain.
4.107. Let P (Xn = 1) = 1/n and P (Xn = 0) = 1 − 1/n.

a) Find P (|Xn | ≥ ) for 0 < ≤ 1.
(Note that P (|Xn | ≥ ) = 0 for > 1.)
b) Does Xn converge in probability? Explain.

2
4.108. Let P (Xn = 0) = 1 − 1/n and P (Xn = 1) = 1/n. Prove Xn → 0
by showing E[(Xn − 0)2 ] → 0 as n → ∞.
4.109. Let Yn and Y be random variables such that Yn = Y with proba-
bility 1−pn and Yn = n with probability pn where pn → 0. Prove or disprove:
D
Yn → Y .
4.110. a) If X ∼ Nk (µ, Σ), then the characteristic function of X is

T 1 T
ϕX (t) = exp it µ − t Σt
2
4.12 Problems 129
for t ∈ Rk . Let a ∈ Rk and find the characteristic function of aT X =

ϕaT X (y) = E[exp(i y aT X)] = ϕX (t) for any y ∈ R and some vector
t ∈ Rk that depends on y. Simplify any constants.
b Suppose X = c for some constant vector c ∈ Rk . Prove c ∼ Nk (c, 0)
where 0 is the k × k matrix of zeroes. Hint: find the characteristic function
of X where P (X = c) = 1, and compare to the characteristic function given
in problem 3).
4.111.
4.112.
4.113.
4.114.
4.115.
4.116.
4.117.
4.118.
4.119.
4.120Q . a) Suppose that Xn ∼ U (−1/n, 1/n). Prove whether or not Xn
converges in distribution to a random variable X.
b) Suppose Yn ∼ U (0, n). Prove whether or not Xn converges in distribu-
tion to a random variable X.
4.121Q . State and prove Generalized Chebyshev’s Inequality = General-
ized Markov’s Inequality.
4.122Q . State a) the SLLN and b) the WLLN. c) Prove the WLLN for
the special case where V (Yi ) = σ 2 .
r k
Theorem 4.6: If Xn → X, then Xn → X where 0 < k < r.
r P
Theorem 4.7. If Xn → X, then Xn → X.
Q
4.125 . State and prove the Central Limit Theorem.
4.126Q . State and prove the Continuous Mapping theorem.
4.127Q . State and prove the Cramér Wold Device.
4.128Q . State and prove the multivariate central limit theorem.
D
4.129Q . Suppose that xn y n for n = 1, 2, .... Suppose xn → x, and
D
y n → y where x y. Prove that

xn D x
→ .
yn y
4.130Q . Prove whether the following sequences of random variables Xn

D
converge in distribution to some random variable X. If Xn → X, find the
distribution of X (for example, find FX (t) or note that P (X = c) = 1, so X
has the point mass distribution at c).
a) Xn ∼ U (−n − 1, −n)
b) Xn ∼ U (n, n + 1)
c) Xn ∼ U (an , bn ) where an → a < b and bn → b.
d) Xn ∼ U (an , bn ) where an → c and bn → c.
e) Xn ∼ U (−n, n)
f) Xn ∼ U (c − 1/n, c + 1/n)
4.131Q . a) Let P (Xn = n) = 1/n and P (Xn = 0) = 1 − 1/n.
1
i) Determine whether Xn → 0.
P
ii) Determine whether Xn → 0.
D
iii) Determine whether Xn → 0.
1
b) Let P (Xn = 0) = 1 − and P (Xn = 1) = 1/n.
n
2
i) Determine whether Xn → 0.
1
ii) Determine whether Xn → 0.
P
iii) Determine whether Xn → 0.
D
iv) Determine whether Xn → 0.

Theorem 4.3. a) Suppose Xn and X are RVs with the same probability
P D
space. If Xn → X, then Xn → X.
P D
b) Xn → τ (θ) iff Xn → τ (θ).
4.133Q . a) Let Xn ∼ Binomial(n, p) where the positive integer n is large
and 0 < p < 1.
√

Xn
Find the limiting distribution of n − p .
n
b) Let X1 , ..., Xn be iid with cdf F (x) = P (X ≤ x). Let Yi = I(Xi ≤ x)
where the indicator equals 1 if Xi ≤ x and 0, otherwise.
i) Find E(Yi ).
ii) Find VAR(Yi ).
n
1X
iii) Let F̂n (x) = I(Xi ≤ x) for some fixed real number x. Find the
n
i=1
√
limiting distribution of n F̂n (x) − cx for an appropriate constant cx .
c) Suppose X 1 , ..., Xn are iid p × 1 random vectors from a multivariate
t-distribution with parameters µ and Σ with d degrees of freedom. Then
d
E(X i ) = µ and Cov(X) = Σ for d > 2. Assuming d > 2, find the
√ d − 2
limiting distribution of n(X − c) for appropriate vector c.
r 2 2
d) Let Y1 , ..., Yn be iid with
√ E(Y ) = exp(rµ + r σ /2) for any real r. Find
the limiting distribution of n(Y n − c) for appropriate constant c.
4.134Q . Billingsley (1986, problem 27.4 a) modified slightly): For each
2
n ∈ N, let Wnk be independent with E(Wnk ) = 0, V (Wnk ) = σnk , and
4.12 Problems 131
Prn 2
s2n = k=1 σnk . Suppose |Wnk | ≤ Mn wp1 and Mn /sn → 0. Verify that
Lyapounov’s condition holds.
Hint: |Wnk |2+δ ≤ Mnδ Wnk
2
wp1 for δ > 0. Take expectations of both sides.
Q
4.135 . Billingsley (1986, 27.4 b) modified slightly)): For each P n ∈ N, let
Wnk be independent with E(Wnk ) = 0, V (Wnk ) = σnk 2
, and s2n = rk=1 n 2
σnk .
Suppose |Wnk | ≤ Mn wp1 and Mn /sn → 0. Verify that Lindeberg’s condition
holds. Show directly: do not use the fact that if Lyapounov’s condition holds,
then Lindeberg’s condition holds.
4.136Q . Suppose the Xi are independent Ber(pi ) ∼ bin(m P = 1, pi ) random
variables with E(Xi ) = pi , V (Xi ) = pi qi , qi = 1 − pi , and ∞ i=1 pi qi = ∞.
Prove that Pn Pn
Xi − i=1 pi D
Zn = i=1 Pn → N (0, 1)
( i=1 pi qi )1/2
as n → ∞.
4.137Q . Prove Lyapounov’s CLT.
4.138Q. Let rn = n and Wnk = Wk . If there is a constant c > 0 such that
P (|Wk | < c) = 1 ∀k, and if sn → ∞ as n → ∞, prove that Lindeberg’s CLT
holds.
4.139Q. Let rn = n and let the Wnk = Wk be iid with E(Wk ) = 0, and
V (Wk ) = σ 2 ∈ (0, ∞). Prove that Lindeberg’s CLT holds. (Taking Wi =
Xi − µ proves the usual CLT with the Lindeberg CLT.)
Chapter 5
Conditional Probability and Conditional
Expectation
The Radon-Nikodym theorem is used to prove the existence of the condi-

tional probability P (A|G) and of the conditional expectation E(X|G). The
conditional probability can be regarded as a special case of conditional ex-
pectation.
5.1 Conditional Probability
Definition 5.1. Let µ and ν be measures on (Ω, F ). Then ν is absolutely

continuous wrt µ if for each A ∈ F, µ(A) = 0 ⇒ ν(A) = 0, written ν << µ.
Theorem 5.1, Radon-Nikodym Theorem: If µ and ν are σ-finite mea-

sures such that ν << µ, Rthen there exists a measurable, nonnegative f, a
density, such that ν(A) = A fdµ for all A ∈ F. For two such densities f and
g, µ[f 6= g] = 0. Hence f = g µ ae.
dν
Definition 5.2. The density f = is called the Radon-Nikodym deriva-
dµ
dν
Z Z
tive of ν wrt µ. Note that ν(A) = dµ = dν for all A ∈ F.
A dµ A
Definition 5.3. Fix A ∈ F and let the σ-field G ⊆ F. A conditional
probability of A Rgiven G is an f = P [A|G] that is i) measurable G and
integrable, and ii) G P [A|G]dP = E[P (A|G)IG ] = P (A ∩ G) for any G ∈ G.
Remark 5.1. i) Note that f = P [A|G] is a random variable wrt G.

ii) 0 ≤ P [A|G] ≤ 1 wp1.
iii) There are many such RVs P [A|G] satisfying Definition 5.3, but any two
of them are equal wp1. A specific such RV is called a version of P [A|G].
133
134 5 Conditional Probability and Conditional Expectation
5.2 Conditional Expectation
Definition 5.4. Let E(X) exist on (Ω, F , P ), and let the σ-field G ⊆ F. A
conditional expectation of XR given G is a f = E[X|G] that is i) measur-
Rable G and integrable, and ii) G E[X|G]dP = E[E(X|G)IG ] = E[XIG ] =
G
XdP for any G ∈ G.
Remark 5.2. i) Note that f = E[X|G] is a random variable wrt G.
ii) There are many such RVs E[X|G] satisfying Definition 5.4, but any two
of them are equal wp1. A specific such RV is called a version of E[X|G].
iii) Fix A ∈ F. If X = IA , then E[IA |G] is a version of P [A|G].
iv) Since G ⊆ F, often X is not measurable G. Then X is not a version of
E[X|G]. If X is measurable G, then X is a version of E[X|G].
Theorem 5.2. If X is measurable G and Y and XY are integrable, then
E[XY |G] = XE[Y |G] wp1. That is, XE[Y |G] is a version of E[XY |G].
Theorem 5.3. Let X, Y , and Xn be integrable. Let a and b be constants.
i) If X = a wp1, then E[X|G] = a wp1.
ii) E[(aX + bY )|G] = aE[X|G] + bE[Y |G] wp1.
iii) If X ≤ Y wp1, then E[X|G] ≤ E[Y |G] wp1.
iv) |E[X|G]| ≤ E[|X| | G] wp1.
v) If limn Xn = X wp1, |Xn | ≤ Y , and Y is integrable, then limn E[Xn |G] =
E[X|G] wp1.
Theorem 5.4. If X is integrable and σ−fields G1 ⊆ G2 ⊆ F, then
E(E[X|G2 ]|G1 ) = E[X|G1 ] wp1.
5.3 Summary
154) Let µ and ν be measures on (Ω, F ). Then ν is absolutely continuous

wrt µ if for each A ∈ F, µ(A) = 0 ⇒ ν(A) = 0, written ν << µ.
155) Radon-Nikodym Theorem: If µ and ν are σ-finite measures such
that ν << µ, Rthen there exists a measurable, nonnegative f, a density, such
that ν(A) = A fdµ for all A ∈ F. For two such densities f and g, µ[f 6=
g] = 0. Hence f = g µ ae.
dν
156) The density f = is called the Radon-Nikodym derivative of ν wrt
dµ
dν
Z Z
µ. Note that ν(A) = dµ = dν for all A ∈ F.
A dµ A
157) The Radon-Nikodym Theorem is used to prove the existence of the
conditional probability P (A|G) and of the conditional expectation E(X|G).
See points 158) and 160).
158) Fix A ∈ F and let the σ-field G ⊆ F. A conditional probability
of A
R given G is an f = P [A|G] that is i) measurable G and integrable, and
ii) G P [A|G]dP = E[P (A|G)IG] = P (A ∩ G) for any G ∈ G.
5.5 Problems 135
159) i) Note that f = P [A|G] is a random variable wrt G.

ii) 0 ≤ P [A|G] ≤ 1 wp1.
iii) There are many such RVs P [A|G] satisfying 158), but any two of them
are equal wp1. A specific such RV is called a version of P [A|G].
160) Let E(X) exist on (Ω, F , P ), and let the σ-field G ⊆ F. A condi-
tional expectation ofRX given G is a f = E[X|G] that is i) measurable R G
and integrable, and ii) G E[X|G]dP = E[E(X|G)IG ] = E[XIG ] = G XdP
for any G ∈ G.
161) i) Note that f = E[X|G] is a random variable wrt G.
ii) There are many such RVs E[X|G] satisfying 160), but any two of them
are equal wp1. A specific such RV is called a version of E[X|G].
161) i) Fix A ∈ F. If X = IA , then E[IA |G] is a version of P [A|G].
ii) Since G ⊆ F, often X is not measurable G. Then X is not a version of
E[X|G]. If X is measurable G, then X is a version of E[X|G].
162) Theorem: If X is measurable G and Y and XY are integrable, then
E[XY |G] = XE[Y |G] wp1. That is, XE[Y |G] is a version of E[XY |G].
163) Theorem: Let X, Y , and Xn be integrable. Let a and b be constants.
i) If X = a wp1, then E[X|G] = a wp1.
ii) E[(aX + bY )|G] = aE[X|G] + bE[Y |G] wp1.
iii) If X ≤ Y wp1, then E[X|G] ≤ E[Y |G] wp1.
iv) |E[X|G]| ≤ E[|X| | G] wp1.
v) If limn Xn = X wp1, |Xn | ≤ Y , and Y is integrable, then limn E[Xn |G] =
E[X|G] wp1.
164) If X is integrable and σ−fields G1 ⊆ G2 ⊆ F, then E(E[X|G2 ]|G1 ) =
E[X|G1 ] wp1.
5.4 Complements
5.5 Problems
Problem 5.1. What theorem can be used to prove the existence of P [A|G]
and E[X|G]?
Problem 5.2. Using E[IA |G] = P [A|G] wp1, use X = IA , Y = IB ,
Xi = IAi , and the result for E[X|G] to get the corresponding result for
P [A|G]. Pn Pn Pn Pn
a) Using E[ i=1 ai Xi |G] = i=1 ai E[ i=1 ai Xi |G], find E[ i=1 ai IAi |G]
in terms of P [Ai |G].
b) If X ≤ Y wp1, then E[X|G] ≤ E[Y |G] wp1. If A ⊆ B, then IA ≤ IB .
Use these results to show that if A ⊆ B, then P [A|G] ≤ P [B|G] wp1.
c) If X = a wp1, then E[X|G] = a wp1. Use 1 = IΩ and b) with B = Ω
to prove P [A|G] ≤ 1 wp1.
Problem 5.3. Let a be a constant. Prove E[aX|G] = aE[X|G] wp1.
136 5 Conditional Probability and Conditional Expectation

Problem 5.4. Suppose X is an integrable random variableR R F, P )
on (Ω,
and that σ-field G ⊆ F, G ∈ G, and A ∈ F. Then E(X) = XdP = Ω XdP .
Use the definitions of E(X|G) and P (A|G) to find the following integrals.
Simplify.
R
a) E[E(X|G)] = Ω E(X|G)dP =
R
b) Ω P (A|G)dP =
R
Problem 5.5. a) Find IA dp.
R
b) Find G IA dp.
Problem 5.6.
Problem 5.7.
Problem 5.8.
Problem 5.9. Suppose X is an integrable random variableR R F, P )
on (Ω,
and that σ-field G ⊆ F, G ∈ G, and A ∈ F. Then E(X) = XdP = Ω XdP .
Use the definitions of E(X|G) and P (A|G) to find the following integrals.
R
a) G E(X|G)dP =
R
b) E[E(X|G)] = Ω E(X|G)dP =
R
c) G E(IA |G)dP =
R
d) G P (A|G)dP =
R
e) Ω P (A|G)dP =
Chapter 6
Martingales
137
Chapter 7
Some Solutions
Some solutions to qual type problems are at

(http://parker.ad.siu.edu/Olive/zM581qualprob.pdf).
A) Probability and Measure

1.30. See proof of Theorem 1.3.
1.33. See the First Borel Cantelli Lemma and its proof.
1.34. See thePSecond Borel Cantelli Lemma and its proof.
1.35. a) If n P (An ) < ∞, then by the first Borel CantelliP lemma,
P (An ) → 0 which implies that P (Acn ) → 1 which implies that n P (Acn ) =
∞. Hence this case is impossible.
b) By the 2nd Borel Cantelli lemma, P (limsupn An ) = 1 and 1 =
P (limsupn Acn ) = P [(liminf An )c ]. Thus P (limsupn An ) = 1 and P (liminf An ) =
0. Thus limn An does not exist.
c) By the 1st Borel Cantelli lemma, P (limsupn An ) = 0 Thus P [liminf An ] =
0 and c = 0. Independence was not needed since the 1st Borel Cantelli lemma
was used.
d) By the 1st Borel Cantelli lemma, P (limsupn Acn ) = P [(liminf An )c ] =
0. Hence P [liminf An ] = 1 ≤ P (limsupn An ) = 1 = c. Thus P (An ) → 1.
(The 2nd Borel Cantelli lemma also gives P (limsupn An ) = 1, but is not
needed.)
B) Random Variables and Random Vectors
2.20. See the proof of Theorem 2.4.

C) Integration and Expected Value
139
140 7 Some Solutions
3.30. Proof. Existence: Suppose SRV

PmX takes on distinct values x1 , ..., xm
where m need not equal n. Then X U= i=1 xi IBi where the Bi = {X = xi } =
n
{ω : X(ω) = xi } are disjoint with i=1 Bi = Ω. Thus
m
X m
X
E(X) = xi P (Bi ) = xi P (X = xi ).
i=1 i=1
Uniqueness:
n
X X X X X
xi P (Ai ) = xi P (Ai ) = xP (∪i:xi=x Ai ) = P (X = x).
i=1 x i:xi =x x x

3.32. Proof. 0 ≤ E(X1 ) ≤ E(X2 ) ≤ .... So {E(Xn )} is a monotone
sequence and limn→∞ E(Xn ) exists in [0, ∞].
3.35. See the Monotone Convergence Theorem 3.16 and its proof.
3.36. See the Lebesgue Dominate Convergence Theorem for RVs and its
proof.
D) Large Sample Theory:
4.1. Fn (y) = 0.5 + 0.5y/n for −n < y < n, so Fn (y) → H(y) ≡ 0.5 for all
real y. Hence Xn does not converge in distribution since H(y) is not a cdf.
4.16. c = µt by the WLLN since the Wi = t(Xi ) are iid
4.23. Lindeberg CLT
4.34. a) Fn (y) = y/n for 0 < y < n, so Fn (y) → H(y) ≡ 0.0 for all real y.
Hence Xn does not converge in distribution since H(y) is not a cdf.
4.36. c = E(Xi2 ) = σ 2 + µ2
√ D
4.46. a) n(X̄ − µ) → N (0, σ 2 ).
√ 1 D
b) Define g(x) = x1 , g0 (x) = −1
x2 . b) Using delta method, n( X̄ − µ1 ) →
2
N (0, σµ4 ), provided µ 6= 0.
4.120. Solution. a) The cdf Fn (x) of Xn is
x ≤ −1

 0, n
nx 1 −1 1
Fn (x) = 2
+ 2, n ≤ x ≤ n
1, x ≥ n1 .

Sketching Fn (x) shows that it has a line segment rising from 0 at x = −1/n
to 1 at x = 1/n and that Fn (0) = 0.5 for all n ≥ 1. Examining the cases
x < 0, x = 0 and x > 0 shows that as n → ∞,
7 Some Solutions 141

 0, x < 0
1
Fn (x) → x=0
2
1, x > 0.
Notice that if X is a random variable such that P (X = 0) = 1, then X has

cdf
0, x < 0
FX (x) =
1, x ≥ 0.
Since x = 0 is the only discontinuity point of FX (x) and since Fn (x) → FX (x)
for all continuity points of FX (x) (i.e. for x 6= 0),
D
Xn → X.
b) Fn (t) = t/n for 0 < t ≤ n and Fn (t) = 0 for t ≤ 0. Hence

limn→∞ Fn (t) = 0 for t ≤ 0. If t > 0 and n > t, then Fn (t) = t/n → 0
as n → ∞. Thus limn→∞ Fn (t) = H(t) = 0 for all t, and Yn does not con-
verge in distribution to any random variable Y since H(t) ≡ 0 is a continuous
function but not a cdf.
4.121.See the Generalized Chebyshev’s Inequality and its proof.

4.122. See the SLLN and the WLLN, and the proof of the WLLN for the
special case where V (Xi ) = σ 2 .
2
4.125. Solution. CLT: Let YP 1 , ..., Yn be iid with E(Y ) = µ and V (Y ) = σ .
1 n
Let the sample mean Y n = n i=1 Yi . Then
√ D
n(Y n − µ) → N (0, σ 2 ).
Proof. Let Zn be the Z-score of Y n . Then the characteristic function

n
t2

cZn (t) = 1 − + o(t2 /n) =
2n
" #n
t2
− n o(t2 /n) 2
1− 2
→ e−t /2 = cZ (t)
n
D √ D
for all t. Thus Zn → Z ∼ N (0, 1) and σZn = n(Y n − µ) → N (0, σ 2 ).
4.126. See the Continuous Mapping theorem and its proof.
4.127. See the Cramér Wold Device and its proof.
4.128. See the multivariate central limit theorem and its proof.
4.129. Solution. Let t = (tT1 , tT2 )T , z n = (xTn , yTn )T , and z = (xT , yT )T .
Since xn y n and x y, the characteristic function
φz n (t) = φxn (t1 )φy n (t2 ) → φx (t1 )φy (t2 ) = φz (t).

D
Hence z n → z.
4.130. Solution. If Xn ∼ U (an , bn ) with an < bn , then

t − an
FXn (t) =
b n − an
for an ≤ t ≤ bn , FXn (t) = 0 for t ≤ an and FXn (t) = 1 for t ≥ bn . On [an , bn ],
1
FXn (t) is a line segment from (an , 0) to (bn , 1) with slope .
b n − an
a) FXn (t) → H(t) ≡ 1 ∀t ∈ R. Since H(t) is continuous but not a cdf, Xn
does not converge in distribution to any RV X.
b) FXn (t) → H(t) ≡ 0 ∀t ∈ R. Since H(t) is continuous but not a cdf,
Xn does not converge in distribution to any RV X.
c) 
 0 t≤a
t−a
FXn (t) → FX (t) = b−a a≤t≤b
1 t ≥ b.

D
Hence Xn → X ∼ U (a, b).
d)
0 t<c
FXn (t) →
1 t > c.
D
Hence Xn → X where P (X = c) = 1. Hence X has a point mass distribution
at c. (The behavior of limn→∞ FXn (c) is not important, even if the limit does
not exist.)
e)
t+n 1 t
FXn (t) = = +
2n 2 2n
for −n ≤ t ≤ n. Thus FXn (t) → H(t) ≡ 0.5 ∀t ∈ R. Since H(t) is continuous
but not a cdf, Xn does not converge in distribution to any RV X.
f)
t − c + n1 1 n
FXn (t) = 2 = + (t − c)
n
2 2
for c − 1/n ≤ t ≤ c + 1/n. Thus

 0 t<c
FXn (t) → H(t) = 1/2 t = c
1 t > c.

If X has the point mass at c, then

0 t<c
FX (t) =
1 t ≥ c.
Hence t = c is the only discontinuity point of FX (t), and H(t) = FX (t) at all
D
continuity points of FX (t). Thus Xn → X where P (X = c) = 1.
4.131. Solution. a) i) Xn is discrete and takes on two values with E(Xn ) =
1
n for all positive integers n. Hence E[|Xn − 0|] = E(Xn ) = 1 ∀n and Xn
n
1
does not satisfy Xn → 0.
ii) Let > 0. Then
1
P [|Xn − 0| ≥ ] ≤ P (Xn = n) = →0
n
P
as n → ∞. Hence Xn → 0.
D
iii) By ii) Xn → 0.
b) i) Xn is discrete and takes on two values with
X 1 1 1
E[(Xn − 0)2 ] = E(Xn2 ) = x2 P (Xn = x) = 02 (1 − ) + 12 = → 0
n n n
2
as n → ∞. Hence Xn → 0.
Since i) holds, so do ii), iii) and iv).
(Also note that
1
E[|Xn − 0|] = E(Xn ) = → 0 ∀n.
n
1
Hence Xn → 0.)
4.132. See the proof of Theorem
Pn 4.3.
4.133. Solution. a) Xn ∼ i=1 Yi where the Yi are iid bin(n = 1, p)
random variables with E(Yi ) = p and V (Yi ) = p(1 − p). Thus
√ D √

Xn D
n − p = n Y − p → N [0, p(1 − p)]
n
by the CLT.
b) Yi = I(Xi ≤ x) ∼ bin(n = 1, F (x)) for fixed x.
i) E(Yi ) = P (Xi ≤ x) = F (x)
ii) V (Yi ) = F (x)(1 − F (x))
√ D
iii) n (F̂n (x) − F (x) ) → N [0, F (x)(1 − F (x)] by the CLT.
√

D d
c) n(X − µ) → Np 0, Σ
d−2
by the MCLT.
d) E(Y ) = exp(µ + σ 2 /2) using r = 1, and E(Y 2 ) = exp(2µ + 2σ 2 ) using
r = 2. V (Y ) = E(Y 2 ) − [E(Y )]2 . Thus
√ √ D
n(Y n − E(Y )) = n(Y n − exp(µ + σ 2 /2)) → N (0, V (Y ))
by the CLT.
4.134. Solution: Proof: Let δ > 0. Then
rn rn δ Xrn
E[|Wnk |2+δ Mn2 E[|Wnk |2 E[|Wnk |2

X X Mn
lim ≤ lim = lim =
n→∞
k=1
s2+δ
n n→∞
k=1
s2+δ
n n→∞ sn
k=1
s2n
δ
Mn
lim =0
n→∞ sn
Prn
using s2n = k=1 E[|Wnk |2 .
4.135. Solution: Proof: Let > 0. Then
1 2 1 M2
2
E[Wnk I(|Wnk | ≥ sn )] ≤ 2 E[Mn2 I(|Wnk | ≥ sn )] = 2n P (|Wnk | ≥ sn ) ≤
sn sn sn
2
Mn2 E(Wnk2 2

) Mn σnk
=
s2n 2 s2n sn s2n 2
where the last inequality holds by Chebyshev’s inequality. So
rn 2 rn 2
X 1 2 Mn 1 X 2 Mn 1
E[Wnk I(|Wnk | ≥ sn )] ≤ σnk = →0
s2n sn s2n 2 sn 2
k=1 k=1
Prn 2
using k=1 σnk = s2n .
4.136. Proof. Let Yi = |Wi | = |Xi − pi |. Then P (Yi = 1 − pi ) = pi and
P (Yi = qi ) = qi . Thus
X
E[|Xi − pi |3 ] = E[|Wi |3 ] = y3 f(y) = (1 − pi )3 pi + p3i qi = qi3 pi + p3i qi
y
= pi qi (p2i + qi2 ) ≤ pi qi
Pn Pn
since p2i + qi2 ≤ P
(Pi + qi )2 = 1. Thus i=1 E[|Xi − pi |3 ] ≤ i=1 pi qi . Dividing
n
both sides by ( i=1 pi qi )3/2 gives
Pn 3
i=1 E[|Xi − pi | ] 1
Pn ≤ Pn →0
( i=1 pi qi )3/2 ( i=1 pi qi )1/2
as n → ∞. Hence the special case of Lyapounov’s condition
Pn 3
i=1 E[|Xi − µi | ]
lim n 3/2
= 0.
n→∞
( i=1 σi2 )
P
holds with µi = pi and σi2 = pi qi . Thus

Pn
i=1 (Xi − µi ) D
Zn = n → N (0, 1).
( i=1 σi2 )1/2
P

4.137. See the proof of Lyapounov’s CLT.
4.138. Solution: Proof: Once n is large enough so that sn > c (which
occurs since sn → ∞), I[|Wk | ≥ sn ] = 0. Hence Lindeberg’s condition holds.

4.139. Solution: Proof: Need to show that Lindeberg’s condition holds.
Now s2n = nσ 2 and the Wk2 I[|Wk | ≥ sn ] are iid for given n. Thus
n
1 X 1 √
E(Wk2 I[|Wk | ≥ sn ]) = 2 E(W12 I[|W1 | ≥ σ n])
s2n σ
k=1
1
Z
= W12 dP → 0
σ2 √
|W1 |≥σ n
√ √
as n → ∞ since P (|W1 | ≥ σ n) ↓ 0 as n → ∞. Or Yn = W12 I[|W1 | ≥ σ n]
satisfies Yn ≤ W12 and Yn ↓ Y = 0 as n → ∞. Thus E(Yn ) → E(Y ) = 0
by Lebesgue’s Dominated Convergence Theorem. Thus Equation (4.15) holds
D
and Zn → N (0, 1). If the Wi = Xi − µ, then
Pn √
(Xi − µ) n(X n − µ) D
Zn = i=1 √ = → N (0, 1).
σ n σ
√ D
Thus n(X n − µ) → N (0, σ 2 ).
E) Conditional Probability and Conditional Expectation
5.9.
Solution:
R R
a) G E(X|G)dP = G XdP (= E[XIG ])
R R
b) E[E(X|G)] = Ω E(X|G)dP = Ω XdP = E[X]
R R R R
c) G E(IA |G)dP = G IA dP = IA IG dP = IA∩G dP = P (A ∩ G)
R
d) G P (A|G)dP = P (A ∩ G)
R
e) Ω P (A|G)dP = P (A ∩ Ω) = P (A)
REFERENCES 147
Ash, R.B. (1993), Real Variables: with Metric Space Topology, IEEE Press,
New York, NY. Available from (https://faculty.math.illinois.edu/∼r-ash/).
Ash, R.B., and Doleans-Dade, C.A. (1999), Probability and Measure The-
ory, 2nd ed., Academic Press, San Diego, CA.
Bickel, P.J., and Doksum, K.A. (1977), Mathematical Statistics: Basic
Ideas and Selected Topics, 1st ed., Holden Day, Oakland, CA.
Billingsley, P. (1986, 1995), Probability and Measure, 2nd and 3rd ed.,
Wiley, New York, NY.
Breiman, L. (1968), Probability, Addison-Wesley, Reading, MA.
Capiński, M., and Kopp, P.E. (2004), Measure, Integral and Probability,
2nd ed., Springer-Verlag, London, UK.
Casella, G., and Berger, R.L. (2002), Statistical Inference, 2nd ed., Duxbury,
Belmont, CA.
Chernoff, H. (1956), “Large-Sample Theory: Parametric Case,” The An-
nals of Mathematical Statistics, 27, 1-22.
Chung, K.L. (2001), A Course in Probability Theory, 3rd ed., Academic
Press, San Diego, CA.
Cramér, H. (1946), Mathematical Methods of Statistics, Princeton Univer-
sity Press, Princeton, NJ.
DasGupta, A. (2008), Asymptotic Theory of Statistics and Probability,
Springer, New York, NY.
Davidson, J. (1994), Stochastic Limit Theory, Oxford University Press,
Oxford, UK.
DeGroot, M.H. (1975), Probability and Statistics, 1st ed., Addison-Wesley
Publishing Company, Reading, MA.
Dudley, R.M. (2002), Real Analysis and Probability, Cambridge University
Press, Cambridge, UK.
Durrett, R. (2019), Probability, Theory and Examples, 5th ed., Cambridge
University Press, Cambridge, UK.
Feller, W. (1971), An Introduction to Probability Theory and Its Applica-
tions, Vol. II, 2nd ed., Wiley, New York, NY.
Ferguson, T.S. (1996), A Course in Large Sample Theory, Chapman &
Hall, New York, NY.
Gaughan, E.D. (2009), Introduction to Analysis, 5th ed., American Math-
ematical Society, Providence, RI.
Gnedenko, B.V. (1989), Theory of Probability, 5th ed., Chelsea Publishers,
Providence, RI.
Hoel, P.G., Port, S.C., and Stone, C.J. (1971), Introduction to Probability
Theory, Houghton Mifflin, Boston, MA.
Hunter, D.R. (2014), Notes for a Graduate-Level Course in Asymptotics
for Statisticians, available from (www.stat.psu.edu/∼dhunter/asymp/lectures/).
Jiang, J. (2022), Large Sample Techniques for Statistics, 2nd ed., Springer,
New York, NY.
Karr, A.F. Probability, (1993), Springer, New York, NY.
148 REFERENCES
Lehmann, E.L. (1999), Elements of Large–Sample Theory, Springer, New

York, NY.
Lukacs, E. (1970), Characteristic Functions, 2nd ed., Hafnir, New York,
NY.
Lukacs, E. (1975), Stochastic Convergence, Academic Press, New York,
NY.
Olive, D.J. (2014), Statistical Theory and Inference, Springer, New York,
NY.
Olive, D.J. (2022), Large Sample Theory: online course notes, (http://parker.ad.
siu.edu/Olive/lsampbk.pdf).
Petrov, V.V. (1995), Limit Theorems of Probability Theory: Sequences of
Independent Random Variables, Clarendon Press, Oxford, UK.
Polansky, A.M. (2011), Introduction to Statistical Limit Theory, CRC
Press, Boca Raton, FL.
Pollard, D. (1984), Convergence of Stochastic Processes, Springer, Berlin.
Pollard, D. (2001), A User’s Guide to Measure Theoretic Probability, Cam-
bridge University Press, Cambridge, UK.
Pratt, J.W. (1959), “On a General Concept of “in Probability”,” The
Annals of Mathematical Statistics, 30, 549-558.
Rényi, A., (2007), Probability Theory, Dover, New York, NY.
Resnick, S. (1999), A Probability Path, Birkhäuser, Boston, MA.
Rohatgi, V.K. (1976), An Introduction to Probability Theory and Mathe-
matical Statistics, Wiley, New York, NY.
Rohatgi, V.K. (1984), Statistical Inference, Wiley, New York, NY.
Rosenthal, J.S. (2006), A First Look at Rigorous Probability Theory, 2nd
ed., World Scientific, Singapore.
Ross, K.A. (1980), Elementary Analysis: the Theory of Calculus, Springer,
New York, NY.
Royden, H.L., and Fitzpatrick, P. (2007), Real Analysis, 4th ed., Prentice
Hall, Englewood Cliffs, NJ.
Sen, P.K., and Singer, J.M. (1993), Large Sample Methods in Statistics:
an Introduction with Applications, Chapman & Hall, New York, NY.
Sen, P.K., Singer, J.M., and Pedrosa De Lima, A.C. (2010), From Finite
Sample to Asymptotic Methods in Statistics, Cambridge University Press,
New York, NY.
Serfling, R.J. (1980), Approximation Theorems of Mathematical Statistics,
Wiley, New York, NY.
Severini, T.A. (2005), Elements of Distribution Theory, Cambridge Uni-
versity Press, New York, NY.
Shiryaev, A.N. (1996), Probability, 2nd ed. Springer Verlag, New York,
NY.
Shiryaev, A.N. (2012), Problems in Probability, Springer, New York, NY.
Shorack, G.R., and Wellner, J.A. (1986), Empirical Processes With Appli-
cations to Statistics, Wiley, New York, NY.
REFERENCES 149
Spiegel, M.R. (1969), Schaum’s Outline of Theory and Problems of Real

Variables: Lebesgue Measure and Integration with Applications to Fourier
Series, McGraw–Hill, New York, NY.
Stoyanov, J., Mirazchiiski, I., Ignatov, Z., and Tanushev, M. (1989), Ex-
ercise Manual in Probability Theory, Kluwar Academic Publishers, Boston,
MA.
Tardiff, R.M. (1981), “L’Hospital’s Rule and the Central Limit Theorem,”
The American Statistician, 35, 43.
van der Vaart, A.W. (1998), Asymptotic Statistics, Cambridge University
Press, Cambridge, UK.
White, H. (1984), Asymptotic Theory for Econometricians, Academic
Press, San Diego, CA.
Woodroofe, M. (1975), Probability With Applications, McGraw-Hill, New
York, NY.
Index
Ash, v, 15 De Morgan’s Laws, 2

asymptotic distribution, 61 DeGroot, 93
asymptotic theory, 61 Delta Method, 102
Doksum, 113
Bain, 105 Doleans-Dade, v, 15
Berger, 113 Dudley, v, 16
Bickel, 113 Durrett, v, 16
Billinglsey, 115
Billingsley, v, 16, 72, 114, 130, 131 Euclidean norm, 95
Boole’s inequality, 6 Feller, v, 16
Breiman, v Ferguson, 82, 113
Fitzpatrick, v
Capiński, v
Casella, 113 gamma function, 28
cdf, 108 Gaughan, v, 16
Central Limit Theorem, 79 Generalized Chebyshev’s Inequality, 66,
cf, 74 95
characteristic function, 71, 74, 109 Generalized Markov’s Inequality, 66, 95
Chebyshev’s Inequality, 66, 110 Gnedenko, v
Chernoff, 113
Chung, v Hoel, 113
Continuity Theorem, 82 Hunter, 6, 113
Continuous Mapping Theorem:, 83 intersection, 1
convergence in mean square, 65
converges in rth mean, 65 Jacobian matrix, 105
converges in distribution, 61 Jensen’s Inequality, 69
converges in law, 61 Jiang, 113
converges in probability, 65
converges in quadratic mean, 65, 107 Karr, v
converges with probability 1, 68, 107 Kopp, v
covariance matrix, 96 Lehmann, 87, 113
Cramér, 113 limiting distribution, 61, 80
cumulant generating function, 71 Lindeberg-Lévy CLT, 79
cumulative distribution function, 108 Lukacs, 113
DasGupta, 113 Markov’s Inequality, 66, 110

Davidson, 113 mean square convergence, 65
151
152 Index
mgf, 71, 74, 109 Sen, 113

moment generating function, 71, 74, 109 Serfling, 113
monotone continuity, 6 Serverini, 98
Multivariate Central Limit Theorem, 96 Severini, 99
Multivariate Delta Method, 104 Shiryaev, v
Shorack, 113
Olive, 113 Singer, 113
SLLN, 69
Pedrosa De Lima, 113 Slutsky’s Theorem, 81, 99
Petrov, 113 Spiegel, v
Polansky, 113 Stone, 113
Pollard, v, 113 Stoyanov, v
population mean, 96 strong convergence, 69
Port, 113 Strong Law of Large Numbers, 69
Pratt, 126
Tardiff, 84
Rényi, v
Radon-Nikodym derivative, 133
union, 1
Radon-Nikodym Theorem, 133
random variable, 22
random vector, 26 van der Vaart, 113
rational numbers, 23
Resnick, v, 16 weak convergence, 62
Rohatgi, 84, 113, 117, 124 Weak Law of Large Numbers, 69
Rosenthal, v Weibull distribution, 105
Ross, v, 6 Wellner, 113
Royden, v White, 98, 113
RV, 5, 23 WLLN, 69
WLOG, 5
sample mean, 80 Woodroofe, 113
sample space, 1 wrt, 41

Probbook

Uploaded by

Copyright:

Available Formats

Probbook

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probbook

Uploaded by

Copyright:

Available Formats

David J.

Probability and Measure

Many statistics departments offer a one semester graduate course in Prob-

1 Probability Measures and Measures . . . . . . . . . . . . . . . . . . . . . . 1

2 Random Variables and Random Vectors . . . . . . . . . . . . . . . . . 21

3 Integration and Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Large Sample Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.6 More CLTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5 Conditional Probability and Conditional Expectation . . . . 133

7 Some Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

This chapter covers probability measures and measures.

1.1 Probability Measures

Definition 1.2. Let A, B ⊆ Ω.

Notation: a) Often “∈ Ω” will be omitted. Hence

Remark 1.2. One way to prove A = B is to prove A ⊆ B and B ⊆ A.

Theorem 1.1 De Morgan’s Laws: Let Λ be a nonempty index set of

Proof: Equations " iii) - vi)#care special cases of i) and ii).

Definition 1.4. Let Ω 6= ∅. A class C of subsets of Ω is a field (or algebra)

Theorem 1.2. Principle of Mathematical Induction: Let P (n) be a

Definition 1.5. Let Ω 6= ∅. A class F of subsets of Ω is a σ−field (or

Notation: Countable in this text means finite or countably infinite.

Warning: A common error is to use n instead of ∞ in Definition 1.5 iv).

Example 1.2. A finite field is a σ-field.

Let Λ be the class of σ-fields containing A. Then Λ is nonempty since

Definition 1.9. A set function P on a σ−field F on Ω is a probability

For a discrete random variable (RV) X, Ω is a countable set and F is often

The interval [a, b] = [0, 1] is interesting.

The sets Bi that work are B1 = A1 and Bk = Ak − Ak−1 for k > 1.

P (Acn ) = [1 − P (An )] ↑ [1 − P (A)] = P (Ac )

by vi). Thus P (An ) ↓ P (A).

for any positive integer n. Now Bn ↑ B = ∪∞ k=1 Ak , and thus P (Bn ) ↑

Definition 1.11. Let {an }∞ n=1 be a sequence.

sup ak , sup ak , ....

ii) liminfn an = limn an is the limit of the nondecreasing sequence

inf ak , inf ak , ....

v) If a limit point of a sequence {an } is any number, including ±∞, that

A limit point is also called an accumulation point and a cluster point.

Example 1.1. a) If an = (−1)n , then limsupn an = 1 and liminfn an =

Definition 1.12. Let An be a sequence of F sets.

Example 1.3. Let An = {(−1)n }. Then limsupn An = {−1, 1} since both

Theorem 1.4. Let An be a sequence of F sets.

Remark 1.5. a) Bn = ∩∞ k=n Ak ↑ lim An . Thus limn→∞ ∩k=n Ak =

In the above proof, a common alternative for proving i) b) in Probability

Theorem 1.6. Let A1 , A2 , ... be F sets.

Definition 1.13. i) Two events A and B are independent, written

Theorem 1.7:P P∞ Let (Ω, F , P ) be fixed

for m ≥ m() by definition of a convergent sum. Since  > 0 is arbitrary,

Theorem 1.8: Second Borel-Cantelli Lemma.P Let (Ω, F , P ) be fixed

integer n by Theorem 1.6. Since 1 − x ≤ e−x ,

(where the first limit exists since ∩n+j c ∞ c

Definition 1.14. Let {An } be a sequence of events defined on (Ω, F , P ).

Theorem 1.9: the Kolmogorov 0-1 Law. Let {An } be a sequence

Definition 1.15. A set function µ is a measure on (Ω, F ) (where F is a

Definition 1.17. a) (Ω, F ) is a measurable space if Ω is a sample space

Theorem 1.10. Properties of a measure µ: Let A, B, Ai , An , Ak be F sets.

4) Let A be a class of sets. The σ−field generated by A, denoted by σ(A)

6) A set function P is a probability measure on a σ−field F on Ω if

16) i) Two events A and B are independent, written A B, if P (A∩B) =

for m ≥ m() by definition of a convergent sum. Since > 0 is arbitrary,