Elements of Probability Theory: 5.1 Probability Space and Random Variables
Elements of Probability Theory: 5.1 Probability Space and Random Variables
Elements of Probability Theory: 5.1 Probability Space and Random Variables
The purpose of this chapter is to summarize some important concepts and results in
probability theory. Of particular interest to us are the limit theorems which are powerful
tools to analyze the convergence behaviors of econometric estimators and test statistics.
These properties are the core of the asymptotic analysis in subsequent chapters. For a
more complete and thorough treatment of probability theory see Davidson (1994) and
other probability textbooks, such as Ash (1972) and Billingsley (1979). Bierens (1994),
Gallant (1997) and White (2001) also provide concise coverages of the topics in this
chapter. Many results here are taken freely from the references cited above; we will not
refer to them again in the text unless it is necessary.
1. Ω ∈ F.
2. If A ∈ F, then Ac ∈ F.
3. If A1 , A2 , . . . are in F, then ∪∞
n=1 An ∈ F.
111
112 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
The first and second properties together imply that Ωc = ∅ is also in F. Combining the
second and third properties we have from de Morgan’s law that
∞ c ∞
An = Acn ∈ F.
n=1 n=1
A σ-algebra is thus closed under complementation, countable union and countable in-
tersection.
1. IP(Ω) = 1.
From these axioms we easily deduce that IP(∅) = 0, IP(Ac ) = 1 − IP(A), IP(A) ≤ IP(B)
if A ⊆ B, and
Thus, the collection of all closed intervals (half-open intervals, half lines) generates the
same Borel field. This is why open intervals, closed intervals, half-open intervals and
half lines are also known as Borel sets. The Borel field on Rd , denoted as B d , is generated
by all open hypercubes:
c Chung-Ming Kuan, 2004
5.1. PROBABILITY SPACE AND RANDOM VARIABLES 113
or by
The sets that generate the Borel field B d are all Borel sets.
A random variable z defined on (Ω, F, IP) is a function z : Ω → R such that for every B
in the Borel field B, its inverse image of B is in F, i.e.,
z −1 (B) = {ω : z(ω) ∈ B} ∈ F.
z −1 (B) = {ω : z(ω) ∈ B} ∈ F;
that is, z is a F/B d -measurable function. Given the random vector z, its inverse images
z −1 (B) form a σ-algebra, denoted as σ(z). This σ-algebra must be in F, and it is the
smallest σ-algebra contained in F such that z is measurable. This is known as the
σ-algebra generated by z or, more intuitively, the information set associated with z.
{ζ ∈ R : g(ζ) ≤ b} ∈ B.
If z is a random variable defined on (Ω, F, IP), then g(z) is also a random variable
defined on the same probability space provided that g is Borel measurable. Note that
the functions we usually encounter (e.g., continuous functions and integrable functions)
are Borel measurable. Similarly, for the d-dimensional random vector z, g(z) is a
random variable provided that g is B d-measurable.
c Chung-Ming Kuan, 2004
114 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
Recall from Section 2.1 that the joint distribution function of z is the non-decreasing,
right-continuous function Fz such that for ζ = (ζ1 . . . ζd ) ∈ Rd ,
with
Two random variables y and z are said to be (pairwise) independent if, and only if,
for any Borel sets B1 and B2 ,
This immediately leads to the standard definition of independence: y and z are indepen-
dent if, and only if, their joint distribution is the product of their marginal distributions,
as in Section 2.1. A sequence of random variables {zi } is said to be totally independent
if
IP {zi ∈ Bi } = IP(zi ∈ Bi ),
all i all i
for any Borel sets Bi . In what follows, a totally independent sequence will be referred
to an independent sequence or a sequence of independent variables for convenience. For
an independent sequence, we have the following generalization of Lemma 2.1.
c Chung-Ming Kuan, 2004
5.1. PROBABILITY SPACE AND RANDOM VARIABLES 115
where the right-hand side is a Lebesgue integral. In view of the distribution function
defined above, a change of ω causes the realization of z to change so that
IE(zi ) = ζi dFz (ζ) = ζi dFzi (ζi ),
Rd R
Other moments, such as variance and covariance, can also be defined as Lebesgue inte-
grals with respect to the probability measure; see Section 2.2.
Lemma 5.2 (Jensen) For the Borel measurable function g that is convex on the sup-
port of the integrable random variable z, suppose that g(z) is also integrable. Then,
g(IE(z)) ≤ IE[g(z)];
For the random variable z with finite p th moment, let zp = [IE(z p )]1/p denote its
Lp -norm. Also define the inner product of two square integrable random variables zi
and zj as their cross moment:
zi , zj = IE(zi zj ).
Then, L2 -norm can be obtained from the inner product as zi 2 = zi , zi 1/2 . It is easily
seen that for any c > 0 and p > 0,
c IP(|z| ≥ c) = c
p p
1{ζ:|ζ|≥c} dFz (ζ) ≤ |ζ|p dFz (ζ) ≤ IE |z|p ,
{ζ:|ζ|≥c}
where 1{ζ:|ζ|≥c} is the indicator function which equals one if |ζ| ≥ c and equals zero
otherwise. This establishes the following result.
c Chung-Ming Kuan, 2004
116 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
Lemma 5.3 (Markov) Let z be a random variable with finite p th moment. Then,
IE |z|p
IP(|z| ≥ c) ≤ ,
cp
where c is a positive real number.
For p = 2, Lemma 5.3 is also known as the Chebyshev inequality. If c is small such that
IE |z|p /cp > 1, Markov’s inequality is trivial. When c becomes large, the probability
that z assumes very extreme values will be vanishing at the rate c−p .
Lemma 5.4 (Hölder) Let y be a random variable with finite p th moment (p > 1) and
z a random variable with finite q th moment (q = p/(p − 1)). Then,
For p = 2, we have IE |yz| ≤ y2 z2 . By noting that | IE(yz)| < IE |yz|, we immedi-
ately have the next result; cf. Lemma 2.3.
Lemma 5.5 (Cauchy-Schwartz) Let y and z be two square integrable random vari-
ables. Then,
Let y = 1 and x = z p . Then for q > p and r = q/p, Hölder’s inequality also ensures
that
This shows that when a random variable has finite q th moment, it must also have finite
p th moment for any p < q, as stated below.
Lemma 5.6 (Liapunov) Let z be a random variable with finite q th moment. Then
for p < q, zp ≤ zq .
The inequality below states that the Lp -norm of a finite sum is less than the sum of
individual Lp -norms.
c Chung-Ming Kuan, 2004
5.2. CONDITIONAL DISTRIBUTIONS AND MOMENTS 117
When there are only two random variables in the sum, this is just the triangle inequality
for Lp -norms; see also Exercise 5.3.
for IP(B) = 0. It can be shown that IP(·|B) satisfies the axioms for probability mea-
sures; see Exerise 5.4. This concept is readily extended to construct conditional density
function and conditional distribution function.
Let y and z denote two integrable random vectors such that z has the density function
fz . For fy (η) = 0, define the conditional density function of z given y = η as
fz,y (ζ, η)
fz|y (ζ | y = η) = ,
fy (η)
Thus, fz|y is a legitimate density function. For example, the bivariate density function
of two random variables z and y forms a surface on the zy-plane. By fixing y = η,
we obtain a cross section (slice) under this surface. Dividing the joint density by the
marginal density fy (η) amounts to adjusting the height of this slice so that the resulting
area integrates to one.
c Chung-Ming Kuan, 2004
118 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
Note that this conditional probability is defined even when IP(y = η) may be zero. In
particular, when
A = (−∞, ζ1 ] × · · · × (−∞, ζd ],
When z and y are independent, the conditional density (distribution) simply reduces
to the unconditional density (distribution).
This definition basically says that the conditional expectation with respect to G is such
that its weighted sum is the same as that of z over any G in G. Suppose that G is the
trivial σ-algebra {Ω, ∅}, i.e., the smallest σ-algebra that contains no extra information
from any random vectors. For the conditional expectation with respect to the trivial
σ-algebra, it is readily seen that it must be a constant c with probability one so as to
be measurable with respect to {Ω, ∅}. Then,
IE(z) = z d IP = c d IP = c.
Ω Ω
c Chung-Ming Kuan, 2004
5.2. CONDITIONAL DISTRIBUTIONS AND MOMENTS 119
That is, the conditional expectation with respect to the trivial σ-algebra is the uncon-
ditional expectation IE(z). Consider now G = σ(y), the σ-algebra generated by y. We
also write
which is interpreted as the prediction of z given all the information associated with y.
this is known as the law of iterated expectations. This result also suggests that if
conditional expectations are taken sequentially with respect to a collection of nested
σ-algebras, only the smallest σ-algebra matters. For example, for k random vectors
y1 , . . . , yk ,
c Chung-Ming Kuan, 2004
120 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
In particular, z can be taken out from the conditional expectation when z itself is a
conditioning variable. This result is generalized as follows.
Lemma 5.10 Let z be a G-measurable random vector. Then for any Borel-measurable
function g,
Two square integrable random variables z and y are said to be orthogonal if their
inner product IE(zy) = 0. This definition allows us to discuss orthogonal projection in
the space of square integrable random vectors. Let z be a square integrable random
variable and z̃ be a G-measurable random variable. Then, by Lemma 5.9 (law of iterated
expectations) and Lemma 5.10,
IE z − IE(z | G) z̃ = IE IE z − IE(z | G) z̃ | G
= IE IE(z | G)z̃ − IE(z | G)z̃
= 0.
That is, the difference between z and its conditional expectation IE(z | G) must be
orthogonal to any G-measurable random variable. It can then be seen that for any
square integrable, G-measurable random variable z̃,
where in the second equality the cross-product term vanishes because both IE(z | G)
and z̃ are G-measurable and hence orthogonal to z − IE(z | G). That is, among all
G-measurable random variables that are also square integrable, IE(z | G) is the closest
to z in terms of the L2 -norm. This shows that IE(z | G) is the orthogonal projection of
z onto the space of all G-measurable, square integrable random variables.
c Chung-Ming Kuan, 2004
5.2. CONDITIONAL DISTRIBUTIONS AND MOMENTS 121
In particular, let G = σ(y), where y is a square integrable random vector. Lemma 5.11
implies that
2
2
IE z − IE z | σ(y) ≤ IE z − h(y) ,
for any Borel-measurable function h such that h(y) is also square integrable. Thus,
IE[z | σ(y)] minimizes the L2 -norm z − h(y)2 , and its difference from z is orthogonal
to any function of y that is also square integrable. We may then say that, given all the
information generated from y, IE[z | σ(y)] is the “best approximation” of z in terms of
the L2 -norm (or simply the best L2 predictor).
var(Az + b | y) = A var(z | y) A ,
which is nonsingular provided that A has full row rank and var(z | y) is positive definite.
It can also be shown that
var(z) = IE[var(z | y)] + var IE(z | y) ;
see Exercise 5.6. That is, the variance of z can be expressed as the sum of two com-
ponents: the mean of its conditional variance and the variance of its conditional mean.
This is also known as the decomposition of analysis of variance.
c Chung-Ming Kuan, 2004
122 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
It is easy to see that the conditional density function of y given x, obtained from
dividing the multivariate normal density function of y and x by the normal density of
x, is also normal with the conditional mean
Note that IE(y | x) is a linear function of x and that var(y | x) does not vary with x.
We first introduce the concept of almost sure convergence (convergence with probability
one). Suppose that {zn } is a sequence of random variables and z is a random variable,
all defined on the probability space (Ω, F, IP). The sequence {zn } is said to converge to
z almost surely if, and only if,
The following result shows that continuous transformation preserves almost sure
convergence.
c Chung-Ming Kuan, 2004
5.3. MODES OF CONVERGENCE 123
a.s. a.s.
[a] If zn −→ z, where z is a random variable such that IP(z ∈ Sg ) = 1, then g(zn ) −→
g(z).
a.s. a.s.
[b] If zn −→ c, where c is a real number at which g is continuous, then g(zn ) −→ g(c).
Proof: Let Ω0 = {ω : zn (ω) → z(ω)} and Ω1 = {ω : z(ω) ∈ Sg }. Thus, for ω ∈ (Ω0 ∩Ω1 ),
continuity of g ensures that g(zn (ω)) → g(z(ω)). Note that
which has probability zero because IP(Ωc0 ) = IP(Ωc1 ) = 0. It follows that Ω0 ∩ Ω1 has
probability one. This proves that g(zn ) → g(z) with probability one. The second
assertion is just a special case of the first result. 2
where z1,n , z2,n are two elements of z n and z1 , z2 are the corresponding elements of z.
Also, provided that z2 = 0 with probability one, z1,n /z2,n → z1 /z2 a.s.
or equivalently,
IP
denoted as zn −→ z or zn → z in probability. We also say that z is the probability
limit of zn , denoted as plim zn = z. In particular, if the probability limit of zn is a
constant c, all the probability mass of zn will concentrate around c when n becomes
c Chung-Ming Kuan, 2004
124 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
More specifically, let Ω0 denote the set of ω such that zn (ω) converges to z(ω). For
ω ∈ Ω0 , there is some m such that ω is in Ωn () for all n > m. That is,
∞
∞
Ω0 ⊆ Ωn () ∈ F.
m=1 n=m
As ∩∞
n=m Ωn () is also in F and non-decreasing in m, it follows that
∞ ∞ ∞
IP(Ω0 ) ≤ IP Ωn () = lim IP Ωn () ≤ lim IP Ωm () .
m→∞ m→∞
m=1 n=m n=m
This inequality proves that almost sure convergence implies convergence in probability,
but the converse is not true in general. We state this result below.
a.s. IP
Lemma 5.14 If zn −→ z, then zn −→ z.
The following well-known example shows that when there is convergence in proba-
bility, the random variables themselves may not even converge for any ω.
Example 5.15 Let Ω = [0, 1] and IP be the Lebesgue measure (i.e., IP{(a, b]} = b − a
for (a, b] ⊆ [0, 1]). Consider the sequence {In } of intervals [0, 1], [0, 1/2), [1/2, 1], [0, 1/3),
[1/3, 2/3), [2/3, 1], . . . , and let zn = 1In be the indicator function of In : zn (ω) = 1 if
ω ∈ In and zn = 0 otherwise. When n tends to infinity, In shrinks toward a singleton
which has the Lebesgue measure zero. For 0 < < 1, we then have
c Chung-Ming Kuan, 2004
5.3. MODES OF CONVERGENCE 125
infinitely many n, and hence zn (ω) does not converge to zero. Note that convergence in
probability permits zn to deviate from the probability limit infinitely often, but almost
sure convergence does not, except for those ω in the set of probability zero. 2
Intuitively, when zn has finite variance such that var(zn ) vanishes asymptotically,
the distribution of zn would shrink toward its mean IE(zn ). If, in addition, IE(zn )
tends to a constant c (or IE(zn ) = c), then zn ought to be degenerate at c in the
limit. These observations suggest the following sufficient conditions for convergence in
probability; see Exercises 5.7 and 5.8. In many cases, it is easier to establish convergence
in probability by verifying these conditions.
Lemma 5.16 Let {zn } be a sequence of square integrable random variables. If IE(zn ) →
IP
c and var(zn ) → 0, then zn −→ c.
IP IP
[a] If zn −→ z, where z is a random variable such that IP(z ∈ Sg ) = 1, then g(zn ) −→
g(z).
IP
[b] (Slutsky) If zn −→ c, where c is a real number at which g is continuous, then
IP
g(zn ) −→ g(c).
Proof: By the continuity of g, for each > 0, we can find a δ > 0 such that
Taking complementation of both sides and noting that the complement of {ω : z(ω) ∈
Sg } has probability zero, we have
c Chung-Ming Kuan, 2004
126 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
where z1,n , z2,n are two elements of z n and z1 , z2 are the corresponding elements of z.
IP
Also, provided that z2 = 0 with probability one, z1,n /z2,n −→ z1 /z2 .
for every continuity point ζ of Fz . That is, regardless the distributions of zn , convergence
in distribution ensures that Fzn will be arbitrarily close to Fz for all n sufficiently large.
The distribution Fz is thus known as the limiting distribution of zn . We also say that
A
zn is asymptotically distributed as Fz , denoted as zn ∼ Fz .
D
For random vectors {z n } and z, z n −→ z if the joint distributions Fzn converge
to Fz for every continuity point ζ of Fz . It is, however, more cumbersome to show
convergence in distribution for a sequence of random vectors. The so-called Cramér-
Wold device allows us to transform this multivariate convergence problem to a univariate
one. This result is stated below without proof.
c Chung-Ming Kuan, 2004
5.3. MODES OF CONVERGENCE 127
Similarly,
That is, Fzn (ζ) → Fz (ζ). The converse is not true in general, however.
When zn converges in distribution to a real number c, it is not difficult to show
that zn also converges to c in probability. In this case, these two convergence modes
are equivalent. To be sure, note that a real number c can be viewed as a degenerate
random variable with the distribution function:
0, ζ < c,
F (ζ) =
1, ζ ≥ c,
D
which is a step function with a jump point at c. When zn −→ c, all the probability mass
IP
of zn will concentrate at c as n becomes large; this is precisely what zn −→ c means.
More formally, for any > 0,
The continuous mapping theorem below asserts that continuous functions preserve
convergence in distribution; cf. Lemmas 5.13 and 5.17.
For example, if zn converges in distribution to the standard normal random variable, the
limiting distribution of zn2 is χ2 (1). Generalizing this result to Rd -valued random vari-
ables, we can see that when z n converges in distribution to the d-dimensional standard
normal random variable, the limiting distribution of z n z n is χ2 (d).
c Chung-Ming Kuan, 2004
128 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
Two sequences of random variables {yn } and {zn } are said to be asymptotically
equivalent if their differences yn − zn converge to zero in probability. Intuitively, the
limiting distributions of two asymptotically equivalent sequences, if exist, ought to be
the same. This is stated in the next result without proof.
Lemma 5.21 Let {yn } and {zn } be two sequences of random vectors such that yn −
IP D D
zn −→ 0. If zn −→ z, then yn −→ z.
The next result is concerned with two sequences of random variables such that one
converges in distribution and the other converges in probability.
to zero.
c Chung-Ming Kuan, 2004
5.5. LAW OF LARGE NUMBERS 129
The order notations can be easily extended to describe the behavior of sequences of
random variables. A sequence of random variables {zn } is said to be Oa.s. (cn ) (or O(cn )
almost surely) if zn /cn is O(1) a.s., and it is said to be OIP (cn ) (or O(cn ) in probability)
if for every > 0, there is some Δ such that
IP(|zn |/cn ≥ Δ) ≤ ,
a.s.
for all n sufficiently large. Similarly, {zn } is oa.s. (cn ) (or o(cn ) almost surely) if zn /cn −→
IP
0, and it is oIP (cn ) (or o(cn ) in probability) if zn /cn −→ 0.
If {zn } is Oa.s. (1) (oa.s (1)), we say that zn is bounded (vanishing) almost surely; if
{zn } is OIP (1) (oIP (1)), zn is bounded (vanishing) in probability. Note that Lemma 5.23
also holds for stochastic order notations. In particular, if a sequence of random variables
is bounded almost surely (in probability) and another sequence of random variables is
vanishing almost surely (in probability), the products of their corresponding elements
are vanishing almost surely (in probability). That is, yn = Oa.s. (1) and zn = oa.s (1),
then yn zn is oa.s (1).
D
When zn −→ z, we know that zn does not converge in probability to z in general,
but more can be said about the behavior of zn . Let ζ be a continuity point of Fz . Then
for any > 0, we can choose a sufficiently large ζ such that IP(|z| > ζ) < /2. As
D
zn −→ z, we can also choose n large enough such that
which implies IP(|zn | > ζ) < . This leads to the following conclusion.
D
Lemma 5.24 Let {zn } be a sequence of random vectors such that zn −→ z. Then
zn = OIP (1).
c Chung-Ming Kuan, 2004
130 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
large numbers when its sample average essentially follows its mean behavior; random
irregularities (deviations from the mean) are “wiped out” in the limit by averaging.
When a law of large numbers holds almost surely, it is a strong law of large numbers
(SLLN); when a law of large numbers holds in probability, it is a weak law of large
numbers (WLLN). For a sequence of random vectors (matrices), a SLLN (WLLN) is
defined elementwise.
There are different versions of the SLLN (WLLN) for various types of random vari-
ables. Below is a well known SLLN for i.i.d. random variables.
Lemma 5.25 (Kolmogorov) Let {zt } be a sequence of i.i.d. random variables with
mean μo . Then,
1 a.s.
T
z −→ μo .
T t=1 t
This result asserts that, when zt have a finite, common mean μo , the sample average of
zt is essentially close to μo , a non-stochastic number. Note, however, that i.i.d. random
variables need not obey Kolmogorov’s SLLN if they do not have a finite mean; for
instance, Lemma 5.25 does not apply to i.i.d. Cauchy random variables. As almost
sure convergence implies convergence in probability, the same condition in Lemma 5.25
ensures that {zt } also obeys a WLLN.
When {zt } is a sequence of independent random variables with possibly heteroge-
neous distributions, it may still obey a SLLN (WLLN) under a stronger condition.
Lemma 5.26 (Markov) Let {zt } be a sequence of independent random variables with
non-degenerate distributions such that for some δ > 0, IE |zt |1+δ is bounded for all t.
Then,
1
T
a.s.
[zt − IE(zt )] −→ 0,
T
t=1
1 a.s. 1
T T
zt −→ lim IE(zt ) =: μ∗ .
T t=1 T →∞ T
t=1
c Chung-Ming Kuan, 2004
5.5. LAW OF LARGE NUMBERS 131
The following example shows that a sequence of correlated random variables may
also obey a WLLN.
where ut are i.i.d. random variables with mean zero and variance σu2 . In view of Sec-
tion 4.3, we have IE(yt ) = 0, var(yt ) = σu2 /(1 − α2o ), and
σu2
cov(yt , yt−j ) = αjo .
1 − α2o
These results imply that IE(T −1 Tt=1 yt ) = 0 and
T −1
T
T
var yt = var(yt ) + 2 (T − τ ) cov(yt , yt−τ )
t=1 t=1 τ =1
T
T −1
≤ var(yt ) + 2T | cov(yt , yt−τ )|
t=1 τ =1
= O(T ).
The latter result shows that var T −1 Tt=1 yt = O(T −1 ) which converges to zero as T
approaches infinity. It follows from Lemma 5.16 that
1
T
IP
yt −→ 0;
T
t=1
that is, {yt } obeys a WLLN. It can be seen that a key condition in the proof above is
that the variance of Tt=1 yt does not grow too rapidly (it is O(T )). The facts that yt
has a constant variance and that cov(yt , yt−j ) goes to zero exponentially fast as j tends
to infinity are sufficient for this condition. This WLLN result is readily generalized to
weakly stationary AR(p) processes. 2
The example above shows that it may be quite cumbersome to establish a WLLN
for weakly stationary processes. The lemma below gives a strong law for correlated
random variables and is convenient in practice; see Davidson (1994, p. 326) for a more
general result.
c Chung-Ming Kuan, 2004
132 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
∞
Lemma 5.28 Let yt = i=0 πi ut−i ,
where ut are i.i.d. random variables with mean
∞
zero and variance σu2 . If πi are absolutely summable, i.e., i=−∞ |πi | < ∞, then
a.s.
T −1 Tt=1 yt −→ 0.
∞ ∞
In Example 5.27, yt = i
i=0 αo ut−i with |αo | < 1, so that i=0 |αo |
i < ∞. Hence,
Lemma 5.28 ensures that the average of yt also converges to its mean (zero) almost
surely. If yt = zt − μo , then the average of zt converges to IE(zt ) = μo almost surely.
Comparing to Example 5.27, Lemma 5.28 is quite general and applicable to any process
that can be expressed as an MA process with absolutely summable weights.
From Lemmas 5.25, 5.26 and 5.28 we can see that a SLLN (WLLN) does not always
hold. The random variables in a sequence must be “well behaved” (i.e., satisfying certain
regularity conditions) to ensure a SLLN (WLLN). In particular, the sufficient conditions
for a SLLN (WLLN) usually regulate the moments and dependence structure of random
variables. Intuitively, random variables without certain bounded moment may exhibit
aberrant behavior so that their random irregularities cannot be completely averaged
out. For random variables with strong correlations over time, the variation of their
partial sums may grow too rapidly and cannot be eliminated by simple averaging. More
generally, it is also possible for a sequence of weakly dependent and heterogeneously dis-
tributed random variables to obey a SLLN (WLLN). This usually requires even stronger
conditions on their moments and dependence structure. To avoid technicality, we will
not give a SLLN (WLLN) for such general sequences but refer to White (2001) and
Davidson (1994) for details. The following examples illustrate why a SLLN (WLLN)
may fail to hold.
Example 5.29 Consider the sequences {t} and {t2 }, t = 1, 2, . . .. It is well known that
T
t = T (T + 1)/2,
t=1
T
t2 = T (T + 1)(2T + 1)/6.
t=1
T T
Hence, T −1 t=1 t and T −1 t=1 t
2 both diverge. Thus, these sequences do not obey a
SLLN. 2
Example 5.30 Suppose that ut are i.i.d. random variables with mean zero and variance
a.s.
σu2 . Thus, T −1 Tt=1 ut −→ 0 by Kolmogorv’s SLLN (Lemma 5.25). Consider now {tut }.
c Chung-Ming Kuan, 2004
5.5. LAW OF LARGE NUMBERS 133
This sequence does not have bounded (1 + δ) th moment because IE |tut |1+δ grows with
t and therefore does not obey Markov’s SLLN (Lemma 5.26). Moreover, note that
T
T
T (T + 1)(2T + 1)
var tut = t2 var(ut ) = σu2 .
t=1 t=1
6
T T
By Exercise 5.11, t=1 tut = OIP (T 3/2 ). It follows that T −1 t=1 tut = OIP (T 1/2 )
which diverges in probability. This shows that the sequence {tut } does not obey a
WLLN either. 2
yt = yt−1 + ut , t = 1, 2, . . . ,
with y0 = 0, where ut are i.i.d. random variables with mean zero and variance σu2 .
Clearly,
t
yt = ui ,
i=1
which has mean zero and unbounded variance tσu2 . For s < t, write
t
yt = ys + ui = ys + vt−s ,
i=s+1
t
where vt−s = i=s+1 ui is independent of ys . We then have
T
T
var(yt ) = tσu2 = O(T 2 ),
t=1 t=1
T −1
T
T −1
T
2 cov(yt , yt−τ ) = 2 (t − τ )σu2 = O(T 3 ).
τ =1 t=τ +1 τ =1 t=τ +1
c Chung-Ming Kuan, 2004
134 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
Thus, var( Tt=1 yt ) = O(T 3 ), so that Tt=1 yt = OIP (T 3/2 ) by Exercise 5.11. This shows
that
1
T
yt = OIP (T 1/2 ),
T
t=1
which diverges in probability. This shows that when {yt } is a random walk, it does
not obey a WLLN. In this case, yt have unbounded variances and strong correlations
over time. Due to these correlations, the variation of the partial sum of yt grows much
T
too fast. (Recall that the variance of t=1 yt is only O(T ) in Example 5.27.) The
conclusions above will not be altered when {ut } is a white noise or a weakly stationary
process. 2
yt = yt−1 + ut , t = 1, 2, . . . ,
with y0 = 0, as in Example 5.31. Then, the sequence {yt−1 ut } has mean zero and
2
var(yt−1 ut ) = IE(yt−1 ) IE(u2t ) = (t − 1)σu4 .
We then have
T
T
T
var yt−1 ut = var(yt−1 ut ) = (t − 1)σu4 = O(T 2 ),
t=1 t=1 t=1
T
and t=1 yt−1 ut= OIP (T ). Note, however, that var(T −1 Tt=1 yt−1 ut ) converges to
σu4 /2, rather than 0. Thus, T −1 Tt=1 yt−1 ut cannot behave like a non-stochastic number
in the limit. This shows that {yt−1 ut } does not obey a WLLN, even though its partial
sums are OIP (T ). 2
c Chung-Ming Kuan, 2004
5.6. UNIFORM LAW OF LARGE NUMBERS 135
1
T
a.s.
[z − IE(zt )] −→ 0; (5.1)
T t=1 t
it is said to obey a WLLN if the almost sure convergence above is replaced by conver-
gence in probability. When IE(zt ) is a constant μo , (5.1) simplifies to
1 a.s.
T
zt −→ μo .
T
t=1
More specifically, suppose that {q(zt ; θ)} obeys a SLLN for each θ ∈ Θ:
1
T
a.s.
QT (ω; θ) = q(zt (ω); θ) −→ Q(θ),
T
t=1
c Chung-Ming Kuan, 2004
136 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
by an estimator θ̃T . There may not exist a finite T ∗ such that QT (ω; θ̃T ) are arbitrarily
close to Q(ω; θ̃T ) for all T > T ∗ .
These observations suggest that we should study convergence that is uniform on the
parameter space Θ. In particular, QT (ω; θ) converges to Q(θ) uniformly in θ almost
surely (in probability) if the largest possible difference:
In what follows we always assume that this supremum is a random variables for all
T . The example below, similar to Example 2.14 of Davidson (1994), illustrates the
difference between uniform and pointwise convergence.
Example 5.33 Let zt be i.i.d. random variables with zero mean and
⎧
⎪
⎪ 1
0 ≤ θ ≤ 2T
⎨ T θ, ,
1
qT (zt (ω); θ) = zt (ω) + 1 − T θ, 2T < θ ≤ T1 ,
⎪
⎪
⎩ 0, 1
T < θ < ∞.
1 1
T T
QT (ω; θ) = qT (zt ; θ) = zt ,
T T
t=1 t=1
which converges to zero almost surely by Kolmogorov’s SLLN. Thus, for a given θ, we
a.s.
can always choose T large enough such that QT (ω; θ) −→ 0, where 0 is the pointwise
limit. On the other hand, it can be seen that Θ = [0, ∞) and
a.s.
sup |QT (ω; θ)| = |z̄T + 1/2| −→ 1/2,
θ∈Θ
c Chung-Ming Kuan, 2004
5.6. UNIFORM LAW OF LARGE NUMBERS 137
space Θ ∈ Rm . For notation simplicity, we will not explicitly write ω in the functions.
We say that {qT t (z t ; θ)} obeys a strong uniform law of large numbers (SULLN) if
1
T
a.s.
sup [qT t (z t ; θ) − IE(qT t (z t ; θ))] −→ 0, (5.2)
θ∈Θ T t=1
cf. (5.1). Similarly, {qT t (z t ; θ)} is said to obey a weak uniform law of large numbers
(WULLN) if the convergence condition above holds in probability. If qT t is Rm -valued
functions, the SULLN (WULLN) is defined elementwise.
We have seen that pointwise convergence alone does not imply uniform convergence.
An interesting question one would ask is: What are the additional conditions required
to guarantee uniform convergence? Let
1
T
QT (θ) = [qT t (z t ; θ) − IE(qT t (z t ; θ))].
T
t=1
where · denotes the Euclidean norm, and CT is a random variable bounded almost
surely and does not depend on θ. Under this condition, QT (θ † ) can be made arbitrarily
close to QT (θ), provided that θ † is sufficiently close to θ. Using the triangle inequality
and taking supremum over θ we have
Let Δ denote an almost sure bound of CT . Then given any > 0, choosing θ † such that
θ − θ † < /(2Δ) implies
sup |QT (θ) − QT (θ † )| ≤ CT ≤ ,
θ∈Θ 2Δ 2
for all T sufficiently large. As these results hold almost surely, we have a SULLN for
QT (θ); the conditions ensuring a WULLN are analogous.
c Chung-Ming Kuan, 2004
138 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
Lemma 5.34 Suppose that for each θ ∈ Θ, {qT t (z t ; θ)} obeys a SLLN (WLLN) and
that for θ, θ † ∈ Θ,
where CT is a random variable bounded almost surely (in probability) and does not
depend on θ. Then, {qT t (z t ; θ)} obeys a SULLN (WULLN).
There are also different versions of CLT for various types of random variables. The
following CLT applies to i.i.d. random variables.
c Chung-Ming Kuan, 2004
5.7. CENTRAL LIMIT THEOREM 139
A sequence of i.i.d. random variables need not obey this CLT if they do not have a finite
variance, e.g., random variables with t(2) distribution. Comparing to Lemma 5.25, one
can immediately see that the Lindeberg-Lévy CLT requires a stronger condition (i.e.,
finite variance) than does Kolmogorov’s SLLN.
Remark: In this example, z̄T converges to μo in probability, and its variance σo2 /T
vanishes when T tends to infinity. To prevent a degenerate distribution in the limit,
it is natural to consider the normalized average T 1/2 (z̄T − μo ), which has a constant
variance σo2 for all T . This explains why the normalizing factor T 1/2 is needed. For
a normalizing factor T a with a < 1/2, the normalized average still converges to zero
because its variance vanishes in the limit. For a normalizing factor T a with a > 1/2, the
normalized average diverges. In both cases, the resulting normalized averages cannot
have a well-behaved, non-degenerate distribution in the limit. Thus, when {zt } obeys a
CLT, z̄T is said to converge to μo at the rate T −1/2 .
Independent random variables may also have the effect of a CLT. Below is a a version
of Liapunov’s CLT for independent (but not necessarily identically distributed) random
variables.
Lemma 5.36 Let {zT t } be a triangular array of independent random variables with
mean μT t and variance σT2 t > 0 such that
1 2
T
σ̄T2 = σ → σo2 > 0.
T t=1 T t
If for some δ > 0, IE |zT t |2+δ are bounded for all t, then
√
T (z̄T − μ̄T ) D
−→ N (0, 1).
σo
Note that this result requires a stronger condition (bounded (2 + δ) th moment) than
does Markov’s SLLN, Lemma 5.26. Comparing to Lindeberg-Lévy’s CLT, Lemma 5.36
allows mean and variance to vary with t at the expense of a stronger moment condition.
The sufficient conditions for a CLT are similar to but usually stronger than those for
a WLLN. In particular, the random variables that obey a CLT have bounded moment
up to some higher order and are asymptotically independent with dependence vanishing
sufficiently fast. Moreover, every random variable must also be asymptotically negli-
gible, in the sense that no random variable is influential in affecting the partial sums.
c Chung-Ming Kuan, 2004
140 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
Although we will not specify the regularity conditions explicitly, we note that weakly
stationary AR and MA processes obey a CLT in general. A sequence of weakly depen-
dent and heterogeneously distributed random variables may also obey a CLT, depending
on its moment and dependence structure. The following examples show that a CLT may
not always hold.
Example 5.37 Suppose that {ut } is a sequence of independent random variables with
mean zero, variance σu2 , and bounded (2 + δ) th moment. From Example 5.29, we know
var( Tt=1 tut ) is O(T 3 ), which implies that variance of T −1/2 Tt=1 tut is diverging at
the rate O(T 2 ). On the other hand, observe that
1 t
T
T (T + 1)(2T + 1) 2 σu2
var u = σ → .
T 1/2 t=1 T t
6T 3 u
3
These results show that {(t/T )ut } obeys a CLT, whereas {tut } does not. 2
yt = yt−1 + ut , t = 1, 2, . . . ,
with y0 = 0, where ut are i.i.d. random variables with mean zero and variance σu2 . From
Example 5.31 we have seen that yt have unbounded variances and strong correlations
over time. Hence, they do not obey a CLT. Example 5.32 also suggests that {yt−1 ut }
does not obey a CLT. 2
c Chung-Ming Kuan, 2004
5.8. FUNCTIONAL CENTRAL LIMIT THEOREM 141
where z̄T = T −1 Tt=1 zT t , μ̄T = IE(z̄T ), and
T
2 −1/2
σT = var T zT t → σo2 > 0.
t=1
Note that this definition requires neither IE(zT t ) nor var(zT t ) to be a constant. If IE(zT t )
is the constant μo , (5.3) would read:
√
T (z̄T − μo ) D
−→ N (0, 1),
σo
as we usually seen in other textbooks.
a positive definite matrix. Using the Cramér-Wold device (Lemma 5.18), {z T t } is said
to obey a multivariate CLT, in the sense that
1 √
T
Σ−1/2 [z T t − IE(z T t )] = Σ−1/2
D
o √ o T (z̄ T − μ̄T ) −→ N (0, I d ),
T t=1
z(ω) = {z t (ω), t ∈ T }.
c Chung-Ming Kuan, 2004
142 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
For each t ∈ T , z t (·) is a Rd -valued random variable; for each ω, z(ω) is a sample path
(realization) of z, which is a Rd -valued function on T . Therefore, a stochastic process
is understood as a collection of random variables or a random function on the index set.
The random sequence encountered in the preceding sections is just a stochastic process
whose index set is the set of integers.
In what follows, for the stochastic process z, we will write z(t, ·) or simply z(t) in
place of z t (·). Thus, z with a subscript (say, z n ) denotes a process in a sequence of
stochastic processes. To signify the index set T , we may also write z as {z(t, ·), t ∈ T }.
The finite-dimensional distributions of {z(t, ·), t ∈ T } is
The process {w(t), t ∈ [0, ∞)} is the standard Wiener process (also known as the
standard Brownian motion) if it has continuous sample paths almost surely and satisfies
the following properties.
(i) IP w(0) = 0 = 1.
(ii) For 0 ≤ t0 ≤ t1 ≤ · · · ≤ tk ,
IP w(ti ) − w(ti−1 ) ∈ Bi , i ≤ k = i≤k IP w(ti ) − w(ti−1 ) ∈ Bi ,
(iii) For 0 ≤ s < t, w(t) − w(s) is normally distributed with mean zero and variance
t − s.
By (i), this process must start from the origin with probability one. The second property
requires non-overlapping increments of w being independent. By the property (iii), every
increment of w is normally distributed with variance depending on the time difference;
in particular, w(t) is normally distributed with mean zero and variance t. This implies
that for r ≤ t,
cov w(r), w(t) = IE w(r) w(t) − w(r) + IE w(r)2 = r,
c Chung-Ming Kuan, 2004
5.8. FUNCTIONAL CENTRAL LIMIT THEOREM 143
We also note that, although the sample paths of the Wiener process are a.s. con-
tinuous, they are highly irregular. To see this, define wc (t) = w(c2 t)/c for c > 0. It
can be shown that wc is also a standard Wiener process (Exercise 5.13). Note that
wc (1/c) = w(c)/c, where w(c)/c is the slope of the chord between w(c) and w(0). If
we choose a c large enough such that w(c)/c > 1, then the slope of the chord between
wc (1/c) and wc (0) is
wc (1/c) w(c)/c
= = w(c) > c.
1/c 1/c
This shows that the sample path of wc has a large slope c and hence must experience
a large change on a very small interval (0, 1/c). In fact, it can be shown that almost
all the sample paths of w are nowhere differentiable; see e.g., Billingsley (1979, p. 450).
Intuitively, the difference quotient [w(t + h) − w(t)]/h is distributed as N (0, 1/|h|). As
its variance diverges to infinity when h tends to zero, the difference quotient can not
converge to a finite limit with a positive probability.
We may also construct different processes using the standard Wiener process. In
particular, the process w0 on [0, 1] with w0 (t) = w(t) − tw(1) is known as the Brownian
bridge or the tied down Brownian motion. It is easily seen that w0 (0) = w0 (1) = 0 with
probability one so that the Brownian bridge starts from zero and must return to zero
at t = 1. Moreover, IE[w0 (t)] = 0, and for r < t,
cov w0 (r), w 0 (t) = cov w(r) − rw(1), w(t) − tw(1)
= r(1 − t) I d ;
c Chung-Ming Kuan, 2004
144 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
Let S be a metric space and S be the Borel σ-algebra generated by the open sets in S.
If for every bounded, continuous real function f on S we have
f (s) dIPn (s) → f (s) d IP(s),
where {IPn } and IP are probability measures on (S, S), we say that IPn converges weakly
to IP and write IPn ⇒ IP. For the random elements z n and z in S with the distributions
induced by IPn and IP, respectively, we say that {z n } converges in distribution to z, also
D
denoted as z n −→ z, if IPn ⇒ IP. Note that z n and z here may be random functions.
When z n and z are all Rd -valued random variables, IPn ⇒ IP reduces to the usual notion
of convergence in distribution, as in Section 5.3.3. When z n and z are d-dimensional
D
stochastic processes, z n −→ z implies that all the finite-dimensional distributions of z n
converge to the corresponding distributions of z. To distinguish between the convergence
in distribution of random variables and that of random functions, we shall, in what
follows, denote the latter as z n ⇒ z.
Let S and S be two metric spaces with respective Borel σ-algebras S and S . Also
let g : S → S be a measurable mapping. Then each probability measure IP on (S, S)
induces a unique probability measure IP∗ on (S , S ) via
which is equivalent to
f (a) dIPn (a) → f (a) dIP∗ (a).
∗
This proves that IP∗n ⇒ IP∗ . This result is also known as the continuous mapping
theorem; cf. Lemma 5.20.
c Chung-Ming Kuan, 2004
5.8. FUNCTIONAL CENTRAL LIMIT THEOREM 145
A sequence of random variables {ζi } is said to obey a functional central limit theorem
(FCLT) if its normalized partial sums zn converge in distribution to the standard Wiener
process w, i.e., zn ⇒ w. The FCLT, also known as the invariance principle, ensures that
the limiting behavior of the normalized partial sums of ζi is governed by the standard
Wiener process, regardless of the original distributions of ζi .
To see how the FCLT works, we consider the i.i.d. sequence {ζi } with mean zero
and variance σ 2 . The partial sum of ζi is sn = ζ1 + · · · + ζn , and it can be normalized
√
as zn (i/n) = (σ n)−1 si . For t ∈ [(i − 1)/n, i/n), define the constant interpolations of
zn (i/n) as
1
zn (t) = zn ((i − 1)/n) = √ s[nt] ,
σ n
where [nt] is the the largest integer less than or equal to nt, so that [nt] = i − 1. It
can be seen that the sample paths of zn are right continuous with left-hand limits,
i.e., zn (t+) = zn (t) and zn (t−) = limr↑t zn (r). Such sample paths are also known as
cadlag (an abreviation of the French “continue à droite, limite à gauche”) functions.
The interpolated process zn is thus a random element of D[0, 1], the space of all cadlag
functions. In view of the discussion of Section 5.8.2, we may study the weak convergence
property of {zn }.
c Chung-Ming Kuan, 2004
146 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
D
from which we deduce that (zn (r), zn (t)) −→ (w(r), w(t)). Proceeding along the same
line we can show that all the finite-dimensional distributions of zn converge to the
corresponding distributions of the standard Wiener process. Although merely proving
convergence of finite-dimensional distributions is not sufficient for zn ⇒ w, it should
help understanding the intuition of the FCLT. To arrive at zn ⇒ w, it is also required
the probability measures induced by zn being “well behaved;” we omit the details.
In view of the discussion above, we are now ready to state an FCLT for i.i.d. random
variables.
Lemma 5.41 (Donsker) Let ζt be i.i.d. random variables with mean μo and variance
σo2 > 0 and zT be the stochastic process with
[T r]
1
zT (r) = √ (ζt − μo ), r ∈ [0, 1].
σo T t=1
Then, zT ⇒ w as T → ∞.
Lemma 5.42 Let ζt be independent random variables with mean μt and variance σt2 > 0
such that
1 2
T
σ̄T2 = σ → σo2 > 0.
T t=1 t
[T r]
1
zT (r) = √ ζt − μt , r ∈ [0, 1].
σo T t=1
If for some δ > 0, IE |ζt |2+δ are bounded for all t, then zT ⇒ w as T → ∞.
c Chung-Ming Kuan, 2004
5.8. FUNCTIONAL CENTRAL LIMIT THEOREM 147
and assume that Σ∗ exists and is positive definite. We say that {ζ t } obeys a (multi-
variate) FCLT if z T ⇒ w as T → ∞, where z T is the d-dimensional stochastic process
with
[T r]
1 −1/2
z T (r) = √ Σ∗ ζ t − μt , r ∈ [0, 1],
T t=1
yt = yt−1 + ut , t = 1, 2, . . . ,
with y0 = 0, where ut are i.i.d. random variables with mean zero and variance σu2 . As
[T r]
{ut } obeys Donsker’s FCLT and y[T r] = t=1 ut is a partial sum of ut , we have
T
1
T t/T
1
y t = σu √ y[T r] dr
T 3/2 t=1 t=1 (t−1)/T T σu
1
⇒ σu w(r) dr,
0
c Chung-Ming Kuan, 2004
148 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
T
where the right-hand side is a random variable. This result also verifies that t=1 yt is
OIP (T 3/2 ), as stated in Example 5.31. Similarly,
1
1 2
T
2
y ⇒ σu w(r)2 dr,
T 2 t=1 t 0
T 2
so that t=1 yt is OIP (T 2 ). It is clear that these results remain valid, as long as ut obey
a FCLT (but need not be i.i.d. or independent). 2
Exercises
5.1 Let C be a collection of subsets of Ω. Show that the intersection of all the σ-
algebras on Ω that contain C is the smallest σ-algebra containing C.
5.2 Let y and z be two independent, integrable random variables. Show that IE(yz) =
IE(y) IE(z).
5.3 Let x and y be two random variables with finite p th moment (p > 1). Prove the
following triangle inequality:
Hint: Write IE |x + y|p = IE(|x + y||x + y|p−1 ) and apply Hölder’s inequality.
5.4 In the probability space (Ω, F, IP) suppose that we know the event B in F has
occurred. Show that the conditional probability IP(·|B) satisfies the axioms for
probability measures.
5.6 Prove that for the square integrable random vectors z and y,
IE(zn − z)2 → 0.
c Chung-Ming Kuan, 2004
5.8. FUNCTIONAL CENTRAL LIMIT THEOREM 149
5.8 Show that a sequence of square integrable random variables {zn } converges to a
constant c in L2 if and only if IE(zn ) → c and var(zn ) → 0.
5.9 Prove that z T ∼ N (0, I) if, and only if, λ z T ∼ N (0, 1) for all λ λ = 1.
A A
5.11 Suppose that IE(zn2 ) = O(cn ), where {cn } is a sequence of positive real numbers.
1/2
Show that zn = OIP (cn ).
yt = yt−1 + ut , t = 1, 2, . . . ,
with y0 = 0, where ut are i.i.d. normal random variables with mean zero and
variance σu2 . Show that Tt=1 yt2 is OIP (T 2 ).
5.13 Let w be a standard Wierner process and define wc as wc (t) = w(c2 t)/c, where
c > 0. Show that wc is also a standard Wierner process.
5.14 Let w be a standard Wiener process and w0 a Brownian bridge. Suppose that
x(t) = w(t + r) − w(r) for a given r > 0 and y(t) = (1 + t) w0 (t/(1 + t)), t ∈ [0, ∞).
Show that both x and y are standard Wiener processes.
References
Ash, Robert B. (1972). Real Analysis and Probability, New York, NY: Academic Press.
Bierens, Herman J. (1994). Topics in Advanced Econometrics, New York, NY: Cam-
bridge University Press.
Billingsley, Patrick (1979). Probability and Measure, New York, NY: John Wiley and
Sons.
Davidson, James (1994). Stochastic Limit Theory, New York, NY: Oxford University
Press.
Gallant, A. Ronald and Halbert White (1988). A Unified Theory of Estimation and
Inference for Nonlinear Dynamic Models, Oxford, UK: Basil Blackwell.
c Chung-Ming Kuan, 2004
150 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY
White, Halbert (2001). Asymptotic Theory for Econometricians, revised edition, Or-
lando, FL: Academic Press.
c Chung-Ming Kuan, 2004