Measure Theory
Measure Theory
Measure Theory
This is a slightly updated version of the Lecture Notes used in 204 in the
summer of 2002. The measure-theoretic foundations for probability theory
are assumed in courses in econometrics and statistics, as well as in some
courses in microeconomic theory and finance. These foundations are not
developed in the classes that use them, a situation we regard as very unfor-
tunate. The experience in the summer of 2002 indicated that it is impossible
to develop a good understanding of this material in the brief time available
for it in 204. Accordingly, this material will not be covered in 204. This hand-
out is being made available in the hope it will be of some help to students
as they see measure-theoretic constructions used in other courses.
The Riemann Integral (the integral that is treated in freshman calculus)
applies to continuous functions. It can be extended a little beyond the class
of continuous functions, but not very far. It can be used to define the lengths,
areas, and volumes of sets in R, R2, and R3 , provided those sets are rea-
sonably nice, in particular not too irregularly shaped. In R2, the Riemann
Integral defines the area under the graph of a function by dividing the x-axis
into a collection of small intervals. On each of these small intervals, two
rectangles are erected: one lies entirely inside the area under the graph of
the function, while the other rectangle lies entirely outside the graph. The
function is Riemann integrable (and its integral equals the area under its
graph) if, by making the intervals sufficiently small, it is possible to make
the sum of the areas of the outside rectangles arbitrarily close to the sum of
the areas of the inside rectangles.
Measury theory provides a way to extend our notions of length, area,
volume etc. to a much larger class of sets than can be treated using the
Riemann Integral. It also provides a way to extend the Riemann Integral
to Lebesgue integrable functions, a much larger class of functions than the
continuous functions.
The fundamental conceptual difference between the Riemann and Lebesgue
integrals is the way in which the partitioning is done. As noted above, the
Riemann Integral partitions the domain of the function into small intervals.
By contrast, the Lebesgue Integral partitions the range of the function into
small intervals, then considers the set of points in the domain on which the
value of the function falls into one of these intervals. Let f : [0, 1] → R.
1
Given an interval [a, b) ⊆ R, f −1 ([a, b)) may be a very messy set. However,
as long as we can assign a “length” or “measure” µ (f −1 ([a, b))) to this set,
we know that the contribution of this set to the integral of f should be be-
tween aµ (f −1 ([a, b))) and bµ (f −1 ([a, b])). By making the partition of the
range finer and finer, we can determine the integral of the function.
Clearly, the key to extending the Lebesgue Integral to as wide a class of
functions as possible is to define the notion of “measure” on as wide a class
of sets as possible. In an ideal world, we would be able to define the measure
of every set; if we could do this, we could then define the Lebesgue integral
of every function. Unfortunately, as we shall see, it is not possible to define
a measure with nice properties on every subset of R.
Measure theory is thus a second best exercise. We try to extend the notion
of measure from our intuitive notions of length, area and volume to as large
a class of measurable subsets of R, R2, and R3 as possible. In order to
be able to make use of measures and integrals, we need to know that the
class of measurable sets is closed under certain types of operations. If we
can assign a sensible notion of measure to a set, we ought to be able to
assign a sensible notion to its complement. Probability and statistics focus
on questions about convergence of sequences of random variables. In order to
talk about convergence, we need to be able to assign measures to countable
unions and countable intersections of measurable sets. Thus, we would like
the collection of measurable sets to be a σ-algebra:
1. Ω is a set
(a) µ : B → R+ ∪ {∞}
2
(b) Bn ∈ B, n ∈ N, Bn ∩ Bm = ∅ if n 6= m ⇒ µ (∪n∈N Bn ) =
n∈N µ(Bn )
P
and µ(∅) = 0.
Remark 2 The definition of a σ-algebra is closely related to the properties of
open sets in a metric space. Recall that the collection of open sets is closed
under (1) arbitrary unions and (2) finite intersections; by contrast, a σ-
algebra is closed under (1) countable unions and (2) countable intersections.
Notice also that σ-algebras are closed under complements; the complement
of an open set is closed, and generally not open, so closure under taking
complements is not a property of the collection of open sets. The analogy
between the properties of a σ-algebra and the properties of open sets in a
metric space will be very useful in developing the Lebesgue integral. Recall
that a function f : X → Y is continuous if and only if f −1 (U) is open
in X for every open set U in Y . Recall from the earlier discussion that
the Lebesgue integral of a function f is defined by partitioning the range
of the fuction f into small intervals, and summing up numbers of the form
aµ (f −1 ([a, b))); thus, we will need to know that f −1 ([a, b)) ∈ B. We will see
in a while that a function f : (Ω, B, µ) → (Ω0 , B 0) is said to be measurable
if f −1 (B 0) ∈ B for every B 0 ∈ B 0. Thus, there is a close analogy between
measurable functions and continuous functions. As you know from calculus,
continuous functions on a closed interval can be integrated using the so-called
Riemann integral; the Lebesgue integral extends the Riemann integral to all
bounded measurable functions (and many unbounded measurable functions).
Remark 3 Countable additivity implies µ(∅) = 0 provided there is some set
B with µ(B) < ∞; thus, the requirement µ(∅) = 0 is imposed to rule out
the pathological case in which µ(B) = ∞ for all B ∈ B.
Remark 4 If we have a finite collection B1, . . . , Bk ∈ B with Bn ∩ Bm = ∅
if n 6= m, we can write Bn = ∅ for n > k, and obtain µ(B1 ∪ · · · ∪ Bk ) =
µ(∪n∈N Bn ) = n∈N µ(Bn ) = kn=1 µ(Bk ) + ∞
Pk
n=k+1 µ(∅) = n=1 µ(Bk ), so
P P P
3
In other words, BΛ is the collection of all subsets of Ω which can be formed
by taking unions of partition sets. BΛ is closed under complements, as well as
arbitrary (not just countable) unions and intersections. Suppose the partition
is finite, i.e. Λ is finite, say Λ = {1, . . . , n}. Then BΛ is finite; it has exactly
2n elements, each corresponding to a subset of Λ. Suppose now that Λ
is countably infinite; since every subset C ⊆ Λ determines a different set
B ∈ BΛ , BΛ is uncountable. Suppose finally that Λ is uncountable. For
concreteness, let Ω = R, Λ = R, and Ωλ = {λ}, i.e. each set in the partition
consists of a single real number. There are many σ-algebras containing this
partition. As before, we can take
BΛ = {∪λ∈C Ωλ : C ⊆ Λ} = {C : C ⊆ R} = 2R
B0 = {C : C ⊆ R, C countable or R \ C countable}
i.e. it is the intersection of the class of all σ-algebras that contain all the
open sets in R. A set is called Borel if it belongs to B.
4
The most important example of a measure space is the Lebesgue measure
space, which comes in two flavors. The first flavor is (R, B, µ), where B is
the Borel σ-algebra and µ is Lebesgue measure, the measure defined in the
following theorem; the second flavor is (R, C, µ), where C is the σ-algebra
defined in the proof sketch (below) of the following theorem.
Theorem 8 If B is the Borel σ-algebra on R, there is a unique measure µ
(called Lebesgue measure) defined on B such that µ((a, b)) = b − a provided
that b > a. For every Borel set B,
The proof works by gradually extending µ from the open intervals to the
Borel σ-algebra. First, one shows one can extend µ to bounded open sets;
it follows that one can extend it to compact sets. Then let Cn = {C ⊂
[−n, n] : sup{µ(K) : K compact, K ⊂ C} = inf{µ(U) : U open, C ⊂ U}},
and C = {C ⊂ R : C ∩ [−n, n] ∈ Cn for all n}. If C ∈ C, we define µ(C) =
sup{µ(K) : K compact, K ⊂ C}. One can verify that C is a σ-algebra
containing every open set, hence C ⊃ B.
Definition 9 A measure space (Ω, B, µ) is complete if B ∈ B, A ⊆ B,
µ(B) = 0 ⇒ A ∈ B. The completion of a measure space (Ω, B, µ) is the
measure space (Ω, B̄, µ̄), where
and
for B ∈ B̄.
It is easy to verify that the Lebesgue measure space (R, C, µ) is complete,
and is the completion of (R, B, µ), where B is the Borel σ-algebra.
Definition 10 Suppose µ is a measure on the Lebesgue σ-algebra C on R.
µ is translation-invariant if, for all x ∈ R and all C ∈ C, µ(C + x) = µ(C).
5
The theorem follows readily from the construction of Lebesgue measure, since
translation doesn’t change the length of intervals.
Observe if x ∈ R, µ({x}) ≤ µ((x − ε/2, x + ε/2)) = ε for every positive
ε, so µ({x}) = 0.
As we have already indicated, it is not possible to extend Lebesgue mea-
sure to every subset of R, at least in the conventional set-theoretic founda-
tions of mathematics.1
Theorem 12 There is a set D ⊂ R which is not Lebesgue measurable.
Proof: We actually prove the following stronger statement: there is no
translation-invariant measure µ defined on all subsets of R such that 0 <
µ([0, 1]) < ∞.
Let µ be a translation-invariant measure defined on all subsets of R.
Define an equivalence relation ∼ on R by
x∼y ⇔x−y ∈ Q
∀x ∈ R ∃d ∈ D s.t. d − x ∈ Q
∀d1 , d2 ∈ D d1 − d2 6∈ Q
The operation +0 is addition modulo one. It is easy to check that, given any
C ⊆ [0, 1) and y ∈ [0, 1),
µ(C +0 y) = µ(C)
i.e. µ is translation-invariant with respect to translation using the operation
+0 . Then
[0, 1) = ∪q∈Q∩[0,1)D +0 q
1
The crucial axiom needed in the proof of Theorem 12 is the Axiom of Choice; it
is possible to construct alternative set theories in which the Axiom of Choice fails, and
every subset of R is Lebesgue measurable. This is true, not because Lebesgue measure is
extended further, but rather because the class of all sets is restricted.
6
so
µ(D +0 q) =
X X
µ([0, 1)) = µ(D)
q∈Q∩[0,1) q∈Q∩[0,1)
lim µ(Bn ) = 0
n→∞
B1 = (∩n∈N Bn ) ∪ (∪m∈NCm )
so X
µ(Cm ) = µ(B1 ) − µ (∩n∈N Bn ) ≤ µ(B1 ) < ∞
m∈N
∞
X
µ(Cm ) → 0 as n → ∞
m=n
∞
X
µ(Bn ) = µ (∩m∈NBm ) + µ(Cm )
m=n
→ µ (∩m∈NBm )
7
We now turn to the definition of a random variable. We think of Ω as the
set of all possible states of the world tomorrow. Exactly one state will occur
tomorrow. Tomorrow, we will be able to observe some function of the state
which occurred. We do not know today which state will occur, and hence
what value of the function will occur. However, we do know the probability
that any given state will occur, and we know the mapping from possible states
into values of the function, so we know the probability that the function will
take on any given possible value. Thus, viewed from today, the value the
function will take on tomorrow is a random variable; we know its probability
distribution, but not its exact value. Tomorrow, we will observe the value
which is realized.
Definition 16 Let (Ω, B, P ) be a probability space. A random variable on
Ω is a function X : Ω → R ∪ {−∞, ∞} satisfying
is a random variable if and only if for every open interval (a, b), X −1 ((a, b)) ∈
B.
M = {B ⊂ R : X −1 (B) ∈ B}
8
defined), X −1 ({−∞, ∞}) ∈ B; since B is a σ-algebra, it is closed under
complements, so X −1 (R) ∈ B, and R ∈ M.
If B ∈ M, X −1 (B) ∈ B, so X −1 (R \ B) = X −1 (R) \ X −1(B) ∈ B because
X (R) ∈ B and X −1 (B) ∈ B. Therefore, R \ B ∈ M, so M is closed under
−1
complements.
Now suppose Bn ∈ M, n ∈ N. Then X −1 (Bn ) ∈ B for all n, so
1. If 0, 1 6∈ B f −1 (B) = ∅ ∈ C.
2. If 0 ∈ B, 1 6∈ B, f −1 (B) = [0, 1] \ Q ∈ C.
3. If 0 6∈ B, 1 ∈ B, f −1 (B) = [0, 1] ∩ Q ∈ C.
4. If 0, 1 ∈ B, f −1 (B) = [0, 1] ∈ C.
Thus, f is measurable.
9
In elementary probability theory, random variables are not rigorously
defined. Usually, they are described only by specifying their cumulative dis-
tribution functions. Continuous and discrete distributions often seem like en-
tirely unconnected notions, and the formulation of mixed distributions (which
have both continuous and discrete parts) can be problematic. Measure theory
gives us a way to deal simultaneously with continuous, discrete, and mixed
distibutions in a unified way. We shall first define the cumulative distribu-
tion function of a random variable, and establish that it satisfies the defining
properties of a cumulative distribution functions, as defined in elementary
probability theory. We will then show (first in examples, then in a general
theorem) that given any function F satisfying the defining properties of a
cumulative distribution function, there is in fact a random variable defined
on the Lebesgue probability space whose cumulative distribution function is
F.
2.
lim F (t) = 1
t→∞
3. F is increasing, i.e.
s < t ⇒ F (t) ≥ F (s)
10
Proof: We prove only right-continuity, leaving the rest as an exercise. Since
F is increasing, it is enough to show that (as an exercise, think through
carefully why this is enough)
1
lim F t + = F (t)
n→∞ n
1
F t+ − F (t)
n
1
= P ω ∈ Ω : X(ω) ≤ t + − P ({ω ∈ Ω : X(ω) ≤ t})
n
1
−1
= P X −∞, t + − P X −1 ((−∞, t])
n
1
−1
= P X t, t +
n
1
X −1 t, t +
\
→ P
n∈N n
\ 1
= P X −1 t, t +
n∈N n
= P X −1 (∅)
= P (∅) = 0
by Theorem 15.
Example 22 The uniform distribution on [0, 1] is the cumulative distribu-
tion function
0 if
t<0
F (t) = t if t ∈ [0, 1]
1 if t>1
Consider the random variable X defined on the Lebesgue probability space
X : ([0, 1], C, µ) → R, X(t) = t. Observe that
µ ({ω ∈ [0, 1] : X(ω) ≤ t}) = µ ([0, t]) = t
Thus, X has the uniform distribution on [0, 1]. Notice also that F is strictly
increasing and hence one-to-one on [0, 1]. Thus, F |[0,1] has an inverse function
−1 −1
F |[0,1] : [0, 1] → [0, 1]. In fact, X = F |[0,1] .
11
Example 23 The standard normal distribution has the cumulative distribu-
tion function
1
Z t
2
F (t) = √ e−x /2 dx
2π −∞
Notice that F is strictly increasing, hence one-to-one, and the range of F is
(0, 1), so F has an inverse function X : (0, 1) → R; extend X to [0, 1] by
defining X(0) = −∞ and X(1) = ∞. One can show that the inverse of a
strictly increasing, continuous function is strictly increasing and continuous.
Since X|(0,1) is continuous, it is measurable on the Lebesgue probability space.
Observe that
µ ({ω ∈ [0, 1] : X(ω) ≤ t}) = µ X −1 ((−∞, t]) = F (t)
12
Proof: Let
X(ω) = inf{t : F (t) ≥ ω}
Since
ω0 > ω
⇒ {t : F (t) ≥ ω 0 } ⊆ {t : F (t) ≥ ω}
⇒ inf{t : F (t) ≥ ω 0 } ≥ inf{t : F (t) ≥ ω}
⇒ X(ω 0 ) ≥ X(ω)
so
13
Definition 26 A simple function is a function of the form
n
X
f= αi χBi (1)
i=1
Let n
X
fn = αi χf −1([αi ,αi+1 ))
i=1
Thus,
fn (ω) = αi ⇔ f(ω) ∈ [αi , αi+1 )
Because f is a random variable, f −1 ([αi , αi+1 )) ∈ B, so fn is a simple func-
tion.
The value of the integral may be ∞. We say that f is integrable if fdP <
R
Ω
∞.
14
1
Example 28 Consider the function f(x) = x
on the Lebesgue probability
space ([0, 1], C, µ). Let
n
X n
X
fn = iχf −1([i,i+1)) = iχ(( 1 1
, ])
i+1 i
i=1 i=1
The following theorem, which we shall not prove, shows that the Lebesgue
and Riemann integrals coincide when the Riemann integral is defined:
Theorem 30 Suppose that f : [a, b] → R is Riemann integrable (in par-
ticular, this is the case if f is continuous). Then f is Lebesgue integrable
and Z Z b
f dµ = f(t)dt
[a,b] a
15
Definition 32 Suppose f : (Ω, B, P ) → R is a random variable. If B ∈ B,
we define Z Z
fdP = fχB dP
B Ω
where χB is the characteristic function of B.
1. cfdP = c fdP
R R
B B
3. f ≤ g a.e. on B ⇒ fdP ≤
R R
B B gdP
L2 (Ω, B, P ) ⊆ L1 (Ω, B, P )
16
Theorem 37 L1 (Ω, B, P ) and L2 (Ω, B, P ) are Banach spaces under the re-
spective norms
Z
kfk1 = |f(ω)|dP
Ω
sZ
kfk2 = f 2 (ω)dP
Ω
For the Lebesgue probability space, L1 ([0, 1], C, µ) and L2 ([0, 1], C, µ) are the
completions of the normed space C([0, 1]), with respect to the norms k · k1
and k · k2 .
Example 38 Recall that C([0, 1]) is a Banach space with the sup norm
kfk = sup{|f(x)| : x ∈ [0, 1]}. (C[0, 1], k · k1 ) and (C[0, 1], k · k2) are normed
spaces, but they are not complete in these norms, hence are not h Banachi
spaces. To see this, let fn (x) be the function which is zero for x ∈ 0, 12 − n1 ,
h i h i
one for x ∈ 12 + n1 , 1 and linear for x ∈ 12 − n1 , 21 + n1 . fn is not Cauchy
with respect to k · k∞ . However, fn is Cauchy and has a natural limit with
respect to k · k1 . Indeed, let
1
(
0 if x < 2
f(x) = 1
1 if x ≥ 2
Then
1 2 1
Z
|fn − f| dµ ≤ × = →0
[0,1] 2 n n
so fn converges to f in L1 ([0, 1], C, µ), and hence must be Cauchy with respect
to k · k1 . Notice that this limit does not belong to C([0, 1]).
17
Definition 40 If f, g ∈ L2 (Ω, B, P ), define the inner product of f and g by
Z
f ·g = f(ω)g(ω)dP
Ω
The properties of the inner product are closely analogous to those of the dot
product of vectors in Euclidean space √Rn . In particular, they determine a
geometry, including lengths (kfk2 = f · f ) and angles. The most basic
property of the dot product, the Cauchy-Schwarz inequality, extends to the
inner product.
|f · g| ≤ kfk2 kgk2
Observe from the definition that Covar(f, g) = Covar(g, f). Thus, given
f1 , . . . , fn ∈ L2(Ω, B, P ), the covariance matrix C whose (i, j) entry is cij =
Covar(fi , fj ) is a symmetric matrix. Hence, there is an orthonormal basis
of Rn composed of eigenvectors of C; expressed in this basis, the matrix
becomes diagonal. Observe also that
n n
β1
.
.
X X
Covar( αi fi , βj fj ) = (α1 , . . . , αn ) C
.
i=1 j=1
βn
18
for each i). π(g) is also characterized as the point in V closest to g. The
coefficients α1, · · · , αn are the regression coefficients of g on f1 , . . . , fn . If
f1 , . . . , fn are orthonormal (i.e. fi · fj = 1 if i = j and 0 if i 6= j), αi = g · fi
for each i.
One frequently encounters two measures living on a single probability
space. For example, Lebesgue measure is one measure on the real line R;
any random variable determines another measure on R by its distribution.
B ∈ A, µ(B) = 0 ⇒ ν(B) = 0
for all B ∈ A.
(P × Q)(A × B) = P (A)Q(B)
19
diagonal {(x, y) : x = y} is not of that form. However, it is possible to write
the diagonal in the form
∩m∈N ∪in=1
m
Amn × Bmn
with Amn ∈ A and Bmn ∈ B. This suggests we should extend P × Q to the
smallest σ-algebra containing {A × B : A ∈ A, B ∈ B}.
Definition 46 Suppose (X, A, P ) and (Y, B, Q) are probability spaces. The
product probability space is (X × Y, A × B, P × Q), where
1. A ×0 B is the smallest σ-algebra containing all sets of the form A × B,
where A ∈ A and B ∈ B
2. P ×0 Q is the unique measure on A×0 B that takes the values P (A)Q(B)
on sets of the form A × B, where A ∈ A and B ∈ B.
3. (X × Y, A × B, P × Q) is the completion of (X × Y, A ×0 B, P ×0 Q)
The existence of the product measure P × Q is proven by extending the
definition of P × Q for sets of the form A × B to A × B, in a manner
similar to the process by which Lebesgue measure is obtained from lengths
of intervals.
5.
Z Z Z Z
f(x, y)dQ(y) dP (x) = f(x, y)dP (x) dQ(y)
X Y
ZY X
20
There are many notions of convergence of functions, and results showing
that the integral of a limit of functions is the limit of the integrals.
∀ε > 0 ∃N s.t. n > N ⇒ P ({ω : |fn (ω) − f(ω)| > ε}) < ε
21
Theorem 51 (Fatou’s Lemma) If (Ω, B, P ) is a probability space, fn , f :
Ω → R+ are random variables, and fn converges to f almost surely, then
Z Z
f dP ≤ lim inf fn dP
Ω n→∞ Ω
In the next theorem, the functions fn converge to f from below; this guar-
antees that no mass which is present in the fn can suddenly disappear at the
limit.
22