Lecture Notes On Stochastic Calculus (NYU)

Stochastic Calculus Notes, Lecture 1
Last modied September 12, 2004

1 Overture
1.1. Introduction: The term stochastic means random. Because it usually
occurs together with process (stochastic process), it makes people think of
something something random that changes in a random way over time. The term
calculus refers to ways to calculate things or nd things that can be calculated
(e.g. derivatives in the dierential calculus). Stochastic calculus is the study of
stochastic processes through a collection of powerful ways to calculate things.
Whenever we have a question about the behavior of a stochastic process, we will
try to nd an expected value or probability that we can calculate that answers
our question.
1.2. Organization: We start in the discrete setting in which there is a
nite or countable (denitions below) set of possible outcomes. The tools are
summations and matrix multiplication. The main concepts can be displayed
clearly and concretely in this setting. We then move to continuous processes
in continuous time where things are calculated using integrals, either ordinary
integrals in R
n
or abstract integrals in probability space. It is impossible (and
beside the point if it were possible) to treat these matters with full mathematical
rigor in these notes. The reader should get enough to distinguish mathematical
right from wrong in cases that occur in practical applications.
1.3. Backward and forward equations: Backward equations and forward
equations are perhaps the most useful tools for getting information about stochas-
tic processes. Roughly speaking, there is some number, f, that we want to know.
For example f could be the expected value of a portfolio after following a pro-
posed trading strategy. Rather than compute f directly, we dene an array
of related expected values, f(x, t). The tower property implies relationships,
backward equations or forward equations, among these values that allow us to
compute some of them in terms of others. Proceeding from the few known val-
ues (initial conditions and boundary conditions), we eventually nd the f we
rst wanted. For discrete time and space, the equations are matrix equations or
recurrence relations. For continuous time and space, they are partial dierential
equations of diusion type.
1.4. Diusions and Ito calculus: The Ito calculus is a tool for studying
continuous stochastic processes in continuous time. If X(t) is a dierentiable
function of time, then X = X(t +t) X(t) is of the order of
1
t. Therefore
f(X(t)) = f(X(t + t)) f(X(t)) f
X to this accuracy. For an Ito

process, X is of the order of

t, so f f
X +
1
2
f
X
2
has an error
1
This means that there is a C so that |X(t + t X(t)| C || for small t.
1
smaller than t. In the special case where X(t) is Brownian motion, it is often
permissible (and the basis of the Ito calculus) to replace X
2
by its mean value,
t.
2 Discrete probability
Here are some basic denitions and ideas of probability. These might seem dry
without examples. Be patient. Examples are coming in later sections. Although
the topic is elementary, the notation is taken from more advanced probability
so some of it might be unfamiliar. The terminology is not always helpful for
simple problems but it is just the thing for describing stochastic processes and
decision problems under incomplete information.
2.1. Probability space: Do an experiment or trial, get an outcome,
. The set of all possible outcomes is , which is the probability space. The
is discrete if it is nite or countable (able to be listed in a single innite
numbered list). The outcome is often called a random variable. I avoid that
term because I (and most other people) want to call functions X() random
variables, see below.
2.2. Probability: The probability of a specic outcome is P(). We always
assume that P() 0 for any and that
P() = 1. The interpreta-

tion of probability is a matter for philosophers, but we might say that P() is
the probability of outcome happening, or the fraction of times event would
happen in a large number of independent trials. The philosophical problem is
that it may be impossible actually to perform a large number of independent
trials. People also sometimes say that probabilities represent our often subjec-
tive (lack of) knowledge of future events. Probability 1 means something that
is certain to happen while probability 0 is for something that cannot happen.
Probability zero impossible is only strictly true for discrete probability.
2.3. Event: An event is a set of outcomes, a subset of . The probability of
an event is the sum of the probabilities of the outcomes that make up the event
P(A) =
A
P() . (1)
Usually, we specify an event in some way other than listing all the outcomes in
it (see below). We do not distinguish between the outcome and the event that
that outcome occurred A = . That is, we write P() for P() or vice
versa. This is called abuse of notation: we use notation in a way that is not
absolutely correct but whose meaning is clear. Its the mathematical version of
saying I could care less to mean the opposite.
2.4. Countable and uncountable (technical detail): A probability space (or
2
any set) that is not countable is called uncountable. This distinction was
formalized by the late nineteenth century mathematician Georg Cantor, who
showed that the set of (real) numbers in the interval [0, 1] is not countable.
Under the uniform probability density, P() = 0 for any [0, 1]. It is hard to
imagine that the probability formula (1) is useful in this case, since every term
in the sum is zero. The dierence between continuous and discrete probability
is the dierence between integrals and sums.
2.5. Example: Toss a coin 4 times. Each toss yields either H (heads) or T
(tails). There are 16 possible outcomes, TTTT, TTTH, TTHT, TTHH, THTT,
. . ., HHHH. The number of outcomes is #() = [[ = 16. We suppose that
each outcome is equally likely, so P() =
1
16
for each . If A is the event
that the rst two tosses are H, then
A = HHHH, HHHT, HHTH, HHTT .
There are 4 elements (outcomes) in A, each having probability
1
16
Therefore
P(rst two H) = P(A) =
A
P() =
A
1
16
=
4
16
=
1
4
.
2.6. Set operations: Events are sets, so set operations apply to events. If A
and B are events, the event A and B is the set of outcomes in both A and
B. This is the set intersection A B, because the outcomes that make both A
and B happen are those that are in both events. The union A B is the set
of outcomes in A or in B (or in both). The complement of A, A
c
, is the event
not A, the set of outcomes not in A. The empty event is the empty set, the
set with no elements, . The probability of should be zero because the sum
that denes it has no terms: P() = 0. The complement of is . Events A
and B are disjoint if A B = . Event A is contained in event B, A B, if
every outcome in A is also in B. For example, if the event A is as above and B
is the event that the rst toss is H, then A B.
2.7. Basic facts: Each of these facts is a consequence of the representation
P(A) =

A
P(). First P(A) P(B) if A B. Also, P(A) + P(B) =
P(A B) if P(A B) = 0, but not otherwise. If P( ,= 0 for all , then
P(AB) = 0 only wehn A and B are distoint. Clearly, P(A)+P(A
c
) = P() =
1.
2.8. Conditional probability: The probability of outcome A given that B has
occurred is the conditional probability (read the probability of A given B,
P(A [ B) =
P(A B)
P(B)
. (2)
This is the fraction of B outcomes that are also A outcomes. The formula is
called Bayes rule. It is often used to calculate P(A B) once we know P(B)
and P(A [ B). The formula for that is P(A B) = P(A [ B)P(B).
3
2.9. Independence: Events A and B are independent if P(A [ B) = P(A).
That is, knowing whether or not B occurred does not change the probability of
A. In view of Bayes rule, this is expressed as
P(A B) = P(A) P(B) . (3)
For example, suppose A is the event that two of the four tosses are H, and B
is the event that the rst toss is H. Then A has 6 elements (outcomes), B has
8, and, as you can check by listing them, A B has 3 elements. Since each
element has probability
1
16
, this gives P(A B) =
3
16
while P(A) =
6
16
and
P(B) =
8
16
=
1
2
. We might say duh for the last calculation since we started
the example with the hypothesis that H and T were equally likely. Anyway,
this shows that (3) is indeed satised in this case. This example is supposed to
show that while some pairs of events, such as the rst and second tosses, are
obviously independent, others are independent as the result of a calculation.
Note that if C is the event that 3 of the 4 tosses are H (instead of 2 for A),
then P(C) =
4
16
=
1
4
and P(B C) =
3
16
, because
B C = HHHT, HHTH, HTHH
has three elements. Bayes rule (2) gives P(B [ C) =
3
16
/
3
4
=
3
4
. Knowing that
there are 3 heads in all raises the probability that the rst toss is H from
1
2
to
3
4
.
2.10. Working with conditional probability: Let us x the event B, and
discuss the conditional probability

P() = P( [ B), which also is a probability
(assuming P(B) > 0). There are two slightly dierent ways to discuss

P. One
way is to take B to be the probability space and dene
P() =
P()
P(B)
for all B. Since B is the probability space for

P, we do not have to dene
P for / B. This

P is a probability because

P() 0 for all B and
P() = 1. The other way is to keep as the probability space and

set the conditional probabilities to zero for / B. If we know the event B
happened, then the probability of an outcome not in B is zero.
P( [ B) =
_
P()
P(B)
for B,
0 for / B.
(4)
Either way, we restrict to outcomes in B and renormalize the probabilities
by dividing by P(B) so that they again sum to one. Note that (4) is just the
general conditional probability formula (2) applied to the event A = .
We can condition a second time by conditioning

P on another event, C. It
seems natural that

P( [ C), which is the conditional probability of given that
4
C, occurred given that B occurred, should be be the P conditional probability
of given that both B and C occurred. Bayes rule veries this intuition:
P( [ C) =

P()
P(C)
=
P( [ B)
P(C [ B)
=
P()
P(B)
P(C B
P(B)
=
P()
P(B C)
= P( [ B C) .
The conclusion is that conditioning on B and then on C is the same as condi-
tioning on BC (B and C) all at once. This tower property underlies the many
recurrence relations that allow us to get answers in practical situations.
2.11. Algebra of sets and incomplete information: A set of events, T, is an
algebra if
i: A T implies that A
c
T.
ii: A T and B T implies that A B T and A B T.
iii: T and T.
We interpret T as representing a state of partial information. We know whether
any of the events in T occurred but we do not have enough information to
determine whether an event not in T occurred. The above axioms are natural
in light of this interpretation. If we know whether A happened, we surely know
whether not A happened. If we know whether A happened and whether B
happened, then we can tell whether A and B happened. We denitely know
whether happened (it did not) and whether happened (it did). Events in
T are called measurable or determined in T.
2.12. Example 1 of an T: Suppose we learn the outcomes of the rst two
tosses. One event measurable in T is (with some abuse of notation)
HH = HHHH, HHHT, HHTH, HHTT .
An example of an event not determined by this T is the event of no more than
one H:
A = TTTT, TTTH, TTHT, THTT, HTTT .
Knowing just the rst two tosses does not tell you with certainty whether the
total number of heads is less than two.
5
2.13. Example 2 of an T: Suppose we know only the results of the tosses
but not the order. This might happen if we toss 4 identical coins at the same
time. In this case, we know only the number of H coins. Some measurable sets
are (with an abuse of notation)
4 = HHHH
3 = HHHT, HHTH, HTHH, THHH
.
.
.
0 = TTTT
The event 2 has 6 outcomes (list them), so its probability is 6
1
16
=
3
8
. There
are other events measurable in this algebra, such as less than 3 H, but, in
some sense, the events listed generate the algebra.
2.14. algebra: An algebra of sets is a algebra (pronounced sigma
algebra) if it is closed under countable intersections, which means the following.
Suppose A
n
T is a countable family of events measurable in T, and A =
n
A
n
is the set of outcomes in all of the A
n
, then A T, too. The reader can
check that an algebra closed under countable intersections is also closed under
countable unions, and conversely. An algebra is automatically a algebra if
is nite. If is innite, an algebra might or might not be a algebra.
2
In
a algebra, it is possible to take limits of innite sequences of events, just as
it is possible to take limits of sequences of real numbers. We will never (again)
refer to an algebra of events that is not a algebra.
2.15. Terminology: What we call outcome is usually called random
variable. I did not use this terminology because it can be confusing, in that we
often think of variables as real (or complex) numbers. A real valued function
of the random variable is a real number X for each , written X(). The
most common abuse of notation in probability is to write X instead of X().
We will do this most of the time, but not just yet. We often think of X as a
random number whose value is determined by the outcome (random variable) .
A common convention is to use upper case letters for random numbers and lower
case letters for specic values of that variable. For example, the cumulative
distribution function (CDF), F(x), is the probability that X x, that is:
F(x) =
X()x
P().
2.16. Informal event terminology: We often describe events in words. For
example, we might write P(X x) where, strictly, we might be supposed to
say A
x
= [ X() x then P(X x) = P(A
x
). For example, if there are
2
Let be the set of integers and A F if A is nite or A
c
is nite. This F is an algebra
(check), but not a algebra. For example, if An leaves out only the rst n odd integers,
then A is the set of even integers, and neither A nor A
c
is nite.
6
two functions, X
1
and X
2
, we might try to calculate the probability that they
are equal, P(X
1
= X
2
). Strictly speaking, this is the probability of the set of
so that X
1
() = X
2
().
2.17. Measurable: A function (of a random variable) X() is measurable
with respect to the algebra T if the value of X is completely determined by
the information in T. To give a mathematical denition, for any number, x,
we can consider the event that X = x, which is B
x
= : X() = x.
In discrete probability, B
x
will be the empty set for almost all x values and
will not be empty only for those values of x actually taken by X() for one
of the outcomes . The function X() is measurable with respect to T
if the sets B
x
are all measurable. People often write X T (an abuse of
notation) to indicate that X is measurable with respect to T. In Example 2
above, the function X = number of H minus number of T is measurable, while
the function X = number of T before the rst H is not (nd an x and B
x
/ T
to show this).
2.18. Generating an algebra of sets: Suppose there are events A
1
, . . .,
A
k
that you know. The algebra, T, generated by these sets is the algebra
that expresses the information about the outcome you gain by knowing these
events. One denition of T is that an event A is in T if A can be expressed in
terms of the known events A
j
using the set operations intersection, union, and
complement a number of times. For example, we could dene an event A by
saying is in A
1
and (A
2
or A
3
) but not in A
4
or A
5
, which would be written
A = (A
1
(A
2
A
3
)) (A
4
A
5
)
c
. This is the same as saying that T is the
smallest algebra of sets that contains the known events A
j
. Obviously (think
about this!) any algebra that contains the A
j
contains any event described by
set operations on the A
j
, that is the denition of algebra of sets. Also the sets
dened by set operations on the A
j
form an algebra of sets. For example, if A
1
is the event that the rst toss is H and A
2
is the event that both the rst two
are H, then A
1
and A
2
generate the algebra of events determined by knowing
the results of the rst two tosses. This is Example 1 above. To generate a
algebra, we mayhave to allow innitely many set operations, but a precise
discussion of this would be o message.
2.19. Generating by a function: A function X() denes an algebra of
sets generated by the sets B
x
. This is the smallest algebra, T, so that X is
measurable with respect to T. Example 2 above has this form. We can think of
T as being the algebra of sets dened by statements about the values of X().
For example, one A T would be the set of with X either between 1 and 3
or greater than 4.
We write T
X
for the algebra of sets generated by X and ask what it means
that another function of , Y (), is measurable with respect to T
X
. The
information interpretation of T
X
says that Y T
X
if knowing the value of X()
determines the value of Y (). This means that if
1
and
2
have the same X
value (X(
1
) = X(
2
)) then they also have the same Y value. Said another
7
way, if B
x
is not empty, then there is some number, u(x), so that Y () = u(x)
for every B
x
. This means that Y () = u(X()) for all ). Altogether,
saying Y T
X
is a fancy way of saying that Y is a function of X. Of course,
u(x) only needs to be dened for those values of x actually taken by the random
variable X.
For example, if X is the number of H in 4 tosses, and Y is the number of
H minus the number of T, then, for any 4 tosses, , Y () = 2X() 4. That
is, u(x) = 2x 4.
2.20. Equivalence relation: A algebra, T, determines an equivalence
relation. Outcomes
1
and
2
are equivalent, written
1

2
, if the information
in T does not distinguish
1
from
2
. More formally,
1

2
if
1
A
2
A for every A T. For example, in Example 2 above, THTT TTTH.
Because T is an algebra,
1

2
also implies that
1
/ A
2
/ A (think this
through). Note that it is possible that A
= A
while ,=
. This happens
when
.
The equivalence class of outcome is the set of outcomes equivalent to in
T, indistinguishable from using the information available in T. If A
is the
equivalence class of , then A
T. (Proof: for any
not equivalent to in
T, there is at least one B
T with B
but
/ B
. Since there are (at

most) countably many
, and T is a algebra, A
T. This A
contains every
1
that is equivalent to (why?) and only those.) In Example
2, the equivalence class of THTT is the event HTTT, THTT, TTHT, TTTH.
2.21. Partition: A partition of is a collection of events, T = B
1
, B
2
, . . .
so that every outcome is in exactly one of the events B
k
. The algebra
generated by T, which we call T
P
, consists of events that are unions of events
in T (Why are complements and intersections not needed?). For any partition
T, the equivalence classes of T
P
are the events in T (think this through). Con-
versely, if T is the partition of into equivalence classes for T, then T generates
T. In Example 2 above, the sets B
k
= k form the partition corresponding to
T. More generally, the sets B
x
= [ X() = x that are not empty are the
partition corresponding to T
X
. In discrete probability, partitions are a conve-
nient way to understand conditional expectation (below). The ininformation in
T
P
is the knowledge of which of the B
j
happened. The remaining uncertainty
i swhich of the B
j
happened.
2.22. Expected value: A random variable (actually, a function of a random
variable) X() has expected value
E[X] =
X()P() .
(Note that we do not write on the left. We think of X as simply a random
number and as a story telling how X was generated.) This is the average
value in the sense that if you could perform the experiment of sampling X
many times then average the resulting numbers, you would get roughly E[X].
8
This is because P() is the fraction of the time you would get and X() is
the number you get for . If X
1
() and X
2
() are two random variables, then
E[X
1
+ X
2
] = E[X
1
] + E[X
2
]. Also, E[cX] = cE[X] if c is a constant (not
random).
2.23. Best approximation property: If we wanted to approximate a random
variable, X, (function X() with not written) by a single non random number,
x, what value would we pick? That would depend on the sense of best. One
such sense is least squares, choosing x to minimize the expected value of (Xx)
2
.
A calculation, which uses the above properties of expected value, gives
E
_
(X x)
2
_
= E[X
2
2Xx +x
2
]
= E[X
2
] 2xE[X] +x
2
.
Minimizing this over x gives the optimal value
x
opt
= E[X] . (5)
2.24. Classical conditional expectation: There are two senses of the term
conditional expectation. We start with the original classical sense then turn
to the related but dierent modern sense often used in stochastic processes.
Conditional expectation is dened from conditional probability in the obvious
way
E[X[B] =
B
X()P([B) . (6)
For example, we can calculate
E[#of H in 4 tosses [ at least one H] .
Write B for the event at least one H. Since only =TTTT does not have
at least one H, [B[ = 15 and P( [ B) =
1
15
for any B. Let X() be the
number of H in . Unconditionally, E[X] = 2, which means
1
16
x
X() = 2 .
Note that X() = 0 for all / B (only TTTT), so
X()P() =
B
X()P() ,
and therefore
1
16
B
X()P() = 2
9
15
16

1
15
B
X()P() = 2
1
15
B
X()P() =
2 16
15
E[X [ B] =
32
15
= 2 +.133 . . . .
Knowing that there was at least one H increases the expected number of H by
.133 . . ..
2.25. Law of total probability: Suppose T = B
1
, B
2
, . . . is a partition of
. The law of total probability is the formula
E[X] =
k
E[X [ B
k
]P(B
k
) . (7)
This is easy to understand: exactly one of the events B
k
happens. The expected
value of X is the sum over each of the events B
k
of the expected value of X
given that B
k
happened, multiplied by the probability that B
k
did happen. The
derivation is a simple combination of the denitions of conditional expectation
(6) and conditional probability (4):
E[X] =
X()P()
=
k
_

B
k
X()P()
_
=
k
_

B
k
X()
P()
P(B
k
)
_
P(B
k
)
=
k
E[X [ B
k
]P(B
k
) .
This fact underlies the recurrence relations that are among the primary tools of
stochastic calculus. It will be reformulated below as the tower property when
we discuss the modern view of conditional probability.
2.26. Modern conditional expectation: The modern conditional expectation
starts with an algebra, T, rather than just the set B. It denes a (function of
a) random variable, Y () = E[X [ T], that is measurable with respect to T
even though X is not. This function represents the best prediction (in the least
squares sense) of X given the information in T. If X T, then the value of
X() is determined by the information in T, so Y = X.
In the classical case, the information is the occurrance or non occurrance of
a single event, B. That is, the algebra, T
B
, consists only of the sets B, B
c
, ,
and . For this T
B
, the modern denition gives a function Y () so that
Y () =
_
E[X [ B] if B,
E[X [ B
c
] if / B.
10
Make sure you understand the fact that this two valued function Y is measurable
with respect to T
B
.
Only slightly more complicated is the case where T is generated by a parti-
tion, T = B
1
, B
2
, . . ., of . The conditional expectation Y () = E[X [ T] is
dened to be
Y () = E[X [ B
j
] if B
j
, (8)
where E[X [ B
j
] is classical conditional expectation (6). A single set B denes
a partition: B
1
= B, B
2
= B
c
, so this agrees with the earlier denition in that
case. The information in T is only which of the B
j
occurred. The modern
conditional expectation replaces X with its expected value over the set taht
occurred. This is the expected value of X given the information in T.
2.27. Example of modern conditional expectation: Take to be sequences of
4 coin tosses. Take T to be the algebra of Example 2 determined by the number
of H tosses. Take X() to be the number of H tosses before the rst T (e.g.
X(HHTH) = 2, X(TTTT) = 0, X(HHHH) = 4, etc.). With the usual abuse
of notation, we calculate (below): Y (0) = 0, Y (1) = 1/4, Y (2) = 2/3,
Y (3) = 3/2, Y (4) = 4. Note, for example, that because HHTT and HTHT
are equivalent in T (in the equivalence class 2), Y (HHTT) = Y (HTHT) = 1/4
even though X(HHTT) ,= X(HTHT). The common value of Y is its average
11
value of X over the outcomes in the equivalence class.
0
TTTT
0
expected value = 0
1
HTTT THTT TTHT TTTH
1 0 0 0
expected value = (1 + 0 + 0 + 0)/4 = 1/4
2
HHTT HTHT HTTH THHT THTH TTHH
2 1 1 0 0 0
expected value = (2 + 1 + 1 + 0 + 0 + 0)/6 = 2/3
3
HHHT HHTH HTHH THHH
3 2 1 0
expected value = (3 + 2 + 1 + 0)/4 = 3/2
4
HHHH
4
expected value = 4
2.28. Best approximation property: Suppose we have a random variable,
X(), that is not measurable with respect to the algebra T. That is, the
information in T does not completely determine the value of X. The conditional
expectation, Y () = E[X [ T], among all functions measurable with respect to
T, is the closest to X in the least squares sense. That is, if Z T, then
E
_
(Z X)
2
E
_
(Y X)
2
.
In fact, this best approximation property will be the denition of conditional
expectation in situations where the partition denition is not directly applica-
ble. The best approximation property for modern conditional expectation is
a consequence of the best approximation for classical conditional expectation.
The least squares error is the sum of the least squares errors over each B
k
in the
partition dened by T. We minimize the least squares error in B
k
by choosing
Y (B
k
) to be the average of X over B
k
(weighted by the probabilities P() for
B
k
). By choosing the best approximation in each B
k
, we get the best
approximation overall.
12
This can be expressed in the terminology of linear algebra. The set of func-
tions (random variables) X is a vector space (Hilbert space) with inner product
X, Y ) =
X()Y ()P() = E [XY ] ,

so |X Y |
2
= E
_
(X Y )
2
. The set of functions measurable with respect

to T is a subspace, which we call o
F
. The conditional expectation, Y , is the
orthogonal projection of X onto o
F
, which is the element of o
F
that closest to
X in the norm just given.
2.29. Tower property: Suppose ( is a algebra that has less information
than T. That is, every event in ( is also in T, but events in T need not be in
(. This is expressed simply (without abuse of notation) as ( T. Consider
the (modern) conditional expectations Y = E[X [ T] and Z = E[X [ (]. The
tower property is the fact that Z = E[Y [ (]. That is, conditioning in one step
gives the same result as conditioning in two steps. As we said before, the tower
property underlies the backward equations that are among the most useful tools
of stochastic calculus.
The tower property is an application of the law of total probability to condi-
tional expectation. Suppose T and Q are the partitions of corresponding to
T and ( respectively. The partition T is a renement of Q, which means that
each C
k
Q itself is partitioned into events B
k,1
, B
k,2
, . . ., where the B
k,j
are
elements of T. Then (see Working with conditional probability) for C
k
,
we want to show that Z() = E[Y [ C
k
]:
Z() = E[X [ C
k
]
=
j
E[X [ B
jk
]P(B
jk
[ C
k
)
=
j
Y (B
jk
)P(B
jk
[ C
k
)
= E[Y [ C
k
] .
The linear algebra projection interpretation makes the tower property seem
obvious. Any function measurable with respect to ( is also measurable with
respect to T, which means that the subspace o
G
is contained in o
F
. If you
project X onto o
F
then project the projection onto o
G
, you get the same thing
as projecting X directly onto o
G
(always orthogonal projections).
2.30. Modern conditional probability: Probabilities can be dened as ex-
pected values of characteristic functions (see below). Therefore, the modern def-
inition of conditional expectation gives a modern denition of conditional prob-
ability. For any event, A, the indicator function, 1
A
(), (also written
A
(),
for characteristic function, terminology less used by probabilists because char-
acteristic function means something else to them) is dened by 1
A
() = 1 if
A, and 1
A
() = 0 if / A. The obvious formula P(A) = E[1
A
] is the
13
representation of the probability as an expected value. The modern conditional
probability then is P(A [ T) = E[1
A
[ T]. Unraveling the denitions, this is a
function, Y
A
(), that takes the value P(A [ B
k
) whenever B
k
. A related
statement, given for practice with notation, is
P(A [ T)() =
B
k
P
F
P(A [ B
k
)1
B
k
() .
3 Markov Chains, I
3.1. Introduction: Discrete time Markov
3
chains are a simple abstract class
of discrete random processes. Many practical models are Markov chains. Here
we discuss Markov chains having a nite state space (see below).
Many of the general concepts above come into play here. The probability
space is the space of paths. The natural states of partial information are
described by the algebras T
t
, which represent the information obtained by ob-
serving the chain up to time t. The tower property applied to the T
t
leads to
backward and forward equations. This section is mostly denitions. The good
stu is in the next section.
3.2. Time: The time variable, t, will be an integer representing the number
of time units from a starting time. The actual time to go from t to t + 1 could
be a nanosecond (for modeling computer communication networks) or a month
(for modeling bond rating changes), or whatever. To be specic, we usually
start with t = 0 and consider only non negative times.
3.3. State space: At time t the system will be in one of a nite list of states.
This set of states is the state space, o. To be a Markov chain, the state should
be a complete description of the actual state of the system at time t. This
means that it should contain any information about the system at time t that
helps predict the state at future times t + 1, t + 2, ... . This is illustrated with
the hidden Markov model below. The state at time t will be called X(t) or X
t
.
Eventually, there may be an also, so that the state is a function of t and :
X(t, ) or X
t
(). The states may be called s
1
, . . ., s
m
, or simply 1, 2, . . . , m.
depending on the context.
3.4. Path space: The sequence of states X
0
, X
1
, . . ., X
T
, is a path. The set of
paths is path space. It is possible and often convenient to use the set of paths as
the probability space, . When we do this, the path X = (X
0
, X
1
, . . . , X
T
) =
(X(0), X(1), . . . , X(T)) plays the role that was played by the outcome in the
general theory above. We will soon have a formula for the P(X), probability of
path X, in terms of transition probabilities.
3
The Russian mathematician A. A. Markov was active in the last decades of the 19
th
century. He is known for his path breaking work on the distribution of prime numbers as well
as on probability.
14
In principle, it should be possible to calculate the probability of any event
(such as X(2) ,= s, or X(t) = s
1
for some t T) by listing all the paths
(outcomes) in that event and summing their probabilities. This is rarely the
easiest way. For one thing, the path space, while nite, tends to be enormous.
For example, if there are m = [o[ = 7 states and T = 50 times, then the number
of paths is || = m
T
= 7
50
, which is about 1.8 10
42
. This number is beyond
computers.
3.5. Algebras T
t
and (
t
: The information learned by observing a Markov
chain up to and including time t is T
t
. Paths X
1
and X
2
are equivalent in T
t
if X
1
(s) = X
2
(s) for 0 s t. Said only slightly dierently, the equivalence
class of path X is the set of paths X
with X
(s) = X(s) for 0 s t. The T

t
form an increasing family of algebras: T
t
T
t+1
. (Event A is in T
t
if we can
tell whether A occurred by knowing X(s) for 0 s t. In this case, we also
can tell whether A occurred by knowing X(s) for 0 s t + 1, which is what
it means for A to be in T
t+1
.)
The algebra (
t
is generated by X(t) only. It encodes the information learned
by observing X at time t only, not at earlier times. Clearly (
t
T
t
, but (
t
is
not contained in (
t+1
, because X(t + 1) does not determine X(t).
3.6. Nonanticipating (adapted) functions: The underlying outcome, which
was called , is now called X. A function of a the outcome, or function of
a random variable, will now be called F(X) instead of X(). Over and over
in stochastic processes, we deal with functions that depend on both X and t.
Such a function will be called F(X, t). The simplest such function is F(X, t) =
X(t). More complicated functions are: (i) F(X, t) = 1 if X(s) = 1 for some
s t, F(X, t) = 0 otherwise, and (ii) F(X, t) = min(s > t) with X(s) = 1 or
F(X, t) = T if X(s) ,= 1 for t < s T.
A function F(X, t) is nonanticipating (also called adapted, though the notions
are slightly dierent in more sophisticated situations) if, for each t, the function
of X given by F(X, t) is measurable with respect to T
t
. This is the same as
saying that F(X, t) is determined by the values X(s) for s t. The function
(i) above has this property but (ii) does not.
Nonanticipating functions are important for several reasons. In time, we
will see that the Ito integral makes sense only for nonanticipating functions.
Moreover, functions F(X, t) are a model of decision making under uncertainty.
That F is nonanticipating means that the decision at time t is made based on
information available at time t and does not depend on future information.
3.7. Markov property: Informally, the Markov property is that X(t) is all the
information about the past that is helpful in predicting the future. In classical
terms, for example,
P(X(t + 1) = k[X(t) = j) = P(X(t + 1) = k[X(t) = j, X(t 1) = l, etc.) .
15
In modern notation, this may be stated
P(X(t + 1) = k [ T
t
) = P(X(t + 1) = k [ (
t
) . (9)
Recall that both sides are functions of the outcome, X. The function on the
right side, to be measurable with respect to (
t
must be a function of X(t) only
(see Generating by a function in the previous section). The left side also is
a function, but in general could depend on all the values X(s) for s t. The
equality (9) states that this function depends on X(t) only.
This may be interpreted as the absence of hidden variables, variables that
inuence the evolution of the Markov chain but are not observable or included
in the state description. If there were hidden variables, observing the chain for a
long period might help identify them and therefore change our prediction of the
future state. The Markov property (9) states, on the contrary, that observing
X(s) for s < t does not change our predictions.
3.8. Transition probabilities: The conditional probabilities (9) are transition
probabilities:
P
jk
= P (X(t + 1) = k [ X(t) = j) = P(j k in one step) .
The Markov chain is stationary if the transition probabilities P
jk
are indepen-
dent of t. Each transition probability P
jk
is between 0 and 1, with values 0 and
1 allowed, though 0 is more common than 1. Also, with j xed, the P
jk
must
sum to 1 (summing over k) because k = 1, 2, . . ., m is a complete list of the
possible states at time t + 1.
3.9. Path probabilities: The Markov property leads to a formula for the
probabilities of individual path outcomes P(X) as products of transition prob-
abilities. We do this here for a stationary Markov chain to keep the notation
simple. First, suppose that the probabilities of the initial states are known, and
call them
f
0
(j) = P(X(0) = j) .
The Bayes rule (2) implies that
P(X(1) = k and X(0) = j)
= P(X(1) = k [ X(0) = j) P(X(0) = j) = f
0
(j)P
jk
.
Using this argument again, and using (9), we nd (changing the order of the
factors on the last line)
P(X(2) = l and X(1) = k and X(0) = j)
= P(X(2) = l [ X(1) = k and X(0) = j) P(X(1) = k and X(0) = j)
= P(X(2) = l [ X(1) = k) P(X(1) = k and X(0) = j)
= f
0
(j)P
jk
P
kl
.
This can be extended to paths of any length.
16
One way to express the general formula uses a notational habit common
in probability, using upper case letters to represent a random value of a vari-
able and lower case for generic values of the same quantity (see Terminol-
ogy, Section 2, but note that the meaning of X has changed). We write
x = (x(0), x(1), , x(T)) for a generic path, and seek P(x) = P(X = x) =
P(X(0) = x(0), X(1) = x(1), ). The argument above shows that this is given
by
P(x) = f
0
(x(0))P
x(0),x(1)
P
x(T1),x(T)
= f
0
(x(0))
T1
t=0
P
x(t),x(t+1)
. (10)
3.10. Transition matrix: The transition probabilities form an mm matrix,
P (an unfortunate conict of notation), called the transition matrix. The (j, k)
entry of P is the transition probability P
jk
= P(j k). The sum of the
entries of the transition matrix P in row j is

k
P
jk
= 1. A matrix with these
properties: no negative entries, all row sums equal to 1, is a stochastic matrix.
Any stochastic matrix can be the transition matrix for a Markov chain.
Methods from linear algebra often help in the analysis of Markov chains. As
we will see in the next lecture, the time s transition probability
P
s
jk
= P(X
t+s
= k [ X
t
= j)
is the (j, k) entry of P
s
, the s
th
power of the transition matrix (explanation
below). Also, as discussed later, steady state probabilities form an eigenvector
of P corresponding to eigenvalue = 1.
3.11. Example 3, coin ips: The state space has m = 2 states, called U
(up) and D (down). Writing H and T would conict with T being the length
of the chain. The coin starts in the U position, which means that f
0
(U) = 1
and f
0
(D) = 0. At every time step, the coin turns over with 20% probability,
so the transition probabilities are P
UU
= .8, P
UD
= .2, P
DU
= .2, P
DD
= .8.
The transition matrix is (taking U for 1 and D for 2):
P =
_
.8 .2
.2 .8
_
For example, we can calculate
P
2
= P P =
_
.68 .32
.32 .88
_
and P
4
= P
2
P
2
=
_
.5648 .4352
.4352 .5648
_
.
This implies that P(X(4) = D) = P(X(0) = U X(4) = D) = P
4
UD
= .5648.
The eigenvalues of P are
1
= 1 and
2
= .6, the former required by theory.
Numerical experimentation should can convince the reader that
_
_
_
_
P
s
_
.5 .5
.5 .5
__
_
_
_
= const
s
2
.
17
Take T = 3 and let A be the event UUzU, where the state X(2) = z is
unknown. There are two outcomes (paths) in A:
A = UUUU, UUDU ,
so P(A) = P(UUUU) + P(UUDU). The individual path probabilities are cal-
culated using (10):
U
.8
U
.8
U
.8
U so P(UUUU) = 1 .8 .8 .8 = .512 .
U
.8
U
.2
D
.2
U so P(UUDU) = 1 .8 .2 .2 = .032 .
Thus, P(A) = .512 +.032 = .544.
3.12. Example 4: There are two coins, F (fast) and S (slow). Either coin will
be either U or D at any given time. Only one coin is present at any given time
but sometimes the coin is replaced (F for S or vice versa) without changing its
UD status. The F coin has the same UD transition probabilities as example
3. The S coin has UD transition probabilities:
_
.9 .1
.05 .95
_
The probability of coin replacement at any given time is 30%. The replacement
(if it happens) is done after the (possible) coin ip without changing the UD
status of the coin after that ip. The Markov chain has 4 states, which we
arbitrarily number 1: UF, 2: DF, 3: US, 4: DS. States 1 and 3 are U states
while states 1 and 2 are F states, etc. The transition matrix is 4 4. We can
calculate, for example, the (non) transition probability for UF UF. We rst
have a U U (non) transition then an F (non) transition. The probability
is then P(U U [ F) P(F F) = .8 .7 = .56. The other entries can be
found in a similar way. The transitions are:
_
_
_
_
UF UF UF DF UF US UF DS
DF UF DF DF DF US DF DS
US UF US DF US US US DS
DS UF DS DF DS US DS DS
_
_
_
_
.
The resulting transition matrix is
P =
_
_
_
_
.8 .7 .2 .7 .8 .3 .2 .3
.2 .7 .8 .7 .2 .3 .8 .3
.9 .3 .1 .3 .9 .7 .1 .7
.05 .3 .95 .3 .05 .7 .95 .7
_
_
_
_
.
If we start with U but equally likely F or S, and want to know the probability
of being D after 4 time periods, the answer is
.5
_
P
4
12
+P
4
14
+P
4
32
+P
4
34
_
18
because states 1 = UF and 3 = US are the (equally likely) possible initial U
states, and 2 = DF and 4 = DS are the two D states. We also could calculate
P(UUzU) by adding up the probabilities of the 32 (list them) paths that make
up this event.
3.13. Example 5, incomplete state information: In the model of example 4
we might be able to observe the UD status but not the FS status. Let X(t)
be the state of the Example 4 model above at time t. Suppose Y (t) = U if
X(t) = UF or X(t) = UD, and Y (t) = D if X(t) = DF or X(t) = DD. Then
the sequence Y (t) is a stochastic process but it is not a Markov chain. We can
better predict U D transitions if we know whether the coin is F or S, or
even if we have a basis for guessing its FS status.
For example, suppose that the four states (UF, DF, US, DS) at time t = 0
are equally likely, that we know Y (1) = U and we want to guess whether Y (2)
will again be U. If Y (0) is D then we are more likely to have the F coin so
a Y (1) = U Y (2) = D transition is more likely. That is, with Y (1) xed,
Y (0) = D makes it less likely to have Y (2) = U. This is a violation of the
Markov property brought about by incomplete state information. Models of this
kind are called hidden Markov models. Statistical estimation of the unobserved
variable is a topic for another day.
Thanks to Laura K and Craig for pointing out mistakes and confusions in
earlier drafts.
19
1 Forward and Backward Equations for Markov
chains
1.1. Introduction: Forward and backward equations are useful ways to
get answers to quantitative questions about Markov chains. The probabilities
u(k, t) = P(X(t) = k) satisfy forward equations that allows us to compute all
the numbers u(k, t +1) once the all the numbers u(j, t) are known. This moves
us forward from time t to time t+1. The expected values f(k, t) = E[V (X(T)) |
X(t) = k] (for t < T) satisfy a backward equation that allows us to calculate
the numbers f(k, t) once all the f(j, t +1) are known. A duality relation allows
us to infer the forward equation from the backward equation, or conversely.
The transition matrix is the generator of both equations, though in dierent
ways. There are many related problems that have solutions involving forward
and backward equations. Two treated here are hitting probabilities and random
compound interest.
1.2. Forward equation, functional version: Let u(k, t) = P(X(t) = k). The
law of total probability gives
u(k, t + 1) = P(X(t + 1) = k)
=

j
P(X(t + 1) = k | X(t) = j) P(X(t) = j) .
Therefore
u(k, t + 1) =
j
P
jk
u(j, t) . (1)
This is the forward equation for probabilities. It is also called the Kolmogorov
forward equation or the Chapman Kolmogorov equation. Once u(j, t) is known
for all j S, (1) gives u(k, t + 1) for any k. Thus, we can go forward in time
from t = 0 to t = 1, etc. and calculate all the numbers u(k, t).
Note that if we just wanted one number, say u(17, 49), still we would have
to calculate many related quantities, all the u(j, t) for t < 49. If the state space
is too large, this direct forward equation approach may be impractical.
1.3. Row and column vectors: If A is an n m matrix, and B is an m p
matrix, then AB is np. The matrices are compatible for multiplication because
the second dimension of A, the number of columns, matches the rst dimension
of B, the number of rows. A matrix with just one column is a column vector.
1
1
The physicists more sophisticated idea that a vector is a physical quantity with certain
transformation properties is inoperative here.
1
Just one row makes it a row vector. Matrix-vector multiplication is a special
case of matrix-matrix multiplication. We often denote genuine matrices (more
than one row and column) with capital letters and vectors, row or column, with
lower case. In particular, if u is an n dimensional row vector, a 1n matrix, and
A is an n n matrix, then uA is another n dimensional row vector. We do not
write Au for this because that would be incompatible. Matrix multiplication is
always associative. For example, if u is a row vector and A and B are square
matrices, then (uA)B = u(AB). We can compute the row vector uA then
multiply by B, or we can compute the n n matrix AB then multiply by u.
If u is a row vector, we usually denote the k-th entry by u
k
instead of u
1k
.
Similarly, the k-th entry of column vector f is f
k
instead of f
k1
. If both u and f
have n components, then uf =

n
k=1
u
k
f
k
is a 11 matrix, i.e. a number. Thus,
treating row and column vectors as special kinds of matrices makes the product
of a row with a column vector natural, but not, for example, the product of two
column vectors.
1.4. Forward equation, matrix version: The probabilities u(k, t) form the
components of a row vector, u(t), with components u
k
(t) = u(k, t) (an abuse of
notation). The forward equation (1) may be expressed (check this)
u(t + 1) = u(t)P . (2)
Because matrix multiplication is associative, we have
u(t) = u(t 1)P = u(t 2)P
2
= = u(0)P
t
. (3)
Tricks of matrix multiplication give information about the evolution of probabil-
ities. For example, we can write a formula for u(t) in terms of the eigenvectors
and eigenvalues of P. Also, we can save eort in computing u(t) for large t by
repeated squaring:
P P
2
_
P
2
_
2
= P
4
P
2
k
using just k matrix multiplications. For example, this computes P
1024
using
just ten matrix multiplies, instead of a thousand.
1.5. Backward equation, functional version: Suppose we run the Markov
chain until time T then get a reward, V (X(T)). For t T, dene the condi-
tional expectations
f(k, t) = E [V (X(T)) | X(t) = k] . (4)
This expression is used so often it often is abbreviated
f(k, t) = E
k,t
[V (X(T))] .
These satisfy a backward equation that follows from the law of total probability:
f(k, t) = E [V (X(T)) | X(t) = k]
2
=
jS
E [V (X(T)) | X(t) = k and X(t + 1) = j] P(X(t + 1) = j | X(t) = k)
f(k, t) =

jS
f(j, t + 1)P
kj
. (5)
The Markov property is used to infer that
E[V (X(T)) | X(t) = k and X(t + 1) = j] = E
j,t+1
[V (X(T))] .
The dynamics (5) must be supplemented with the nal condition
f(k, T) = V (k) . (6)
Using these, we may compute all the numbers f(k, T 1), then all the numbers
f(k, T 2), etc.
1.6. Backward equation using modern conditional expectation: As usual, F
t
denotes the algebra generated by X(0), . . ., X(t). Dene F(t) = E[V (X(T)) |
F
t
]. The left side is a random variable that is measurable in F
t
, which means
that F(t) is a function of (X(0), . . . , X(t)). The Markov property implies that
F(t) actually is measurable with respect to G
t
, the algebra generated by X(t)
alone. This means that F(t) is a function of X(t) alone, which is to say that
there is a function f(k, t) so that F(t) = f(X(t), t), and
f(X(t), t) = E[V (X(T)) | F
t
] = E[V (X(T)) | G
t
] .
Since G
t
is generated by the partition {k} = {X(t) = k}, this is the same def-
inition (4). Moreover, because F
t
F
t+1
and F(t + 1) = E[V (X(T)) | F
t+1
],
the tower property gives
E[V (X(T)) | F
t
] = E[F(t + 1) | F
t
] ,
so that, again using the Markov property,
F(t) = E[F(t + 1) | G
t
] . (7)
Note that this is a version of the tower property. On the event {X(t) = k}, the
right side above takes the value
jS
f(j, t + 1) P(x(t + 1) = j | X(t) = k) .
Thus, (7) is the same as the backward equation (5). In the continuous time
versions to come, (7) will be very handy.
1.7. Backward equation, matrix version: We organize the numbers f(k, t)
into a column vector f(t) = (f(1, t), f(2, t), )
t
. It is barely an abuse to write
f(t) both for a function of k and a vector. After all, any computer programmer
3
knows that a vector really is a function of the index. The backward equation
(5) then is equivalent to (check this)
f(t) = Pf(t + 1) . (8)
Again the associativity of matrix multiplication lets us write, for example,
f(t) = P
Tt
V ,
writing V for the vector of values of V .
1.8. Invariant expectation value: We combine the conditional expectations
(4) with the probabilities u(k, t) with the law of total probability to get, for any
t,
E[V (X(T))] =

kS
P(X(t) = k) E[V (X(T)) | X(t) = k]
=
kS
u(k, t)f(k, t)
= u(t)f(t) .
The last line is a natural example of an inner product between a row vector and a
column vector. Note that the product E[V (X(T))] = u(t)f(t) does not depend
on t even though u(t) and f(t) are dierent for dierent t. For this invariance
to be possible, the forward evolution equation for u and the backward equation
for f must be related.
1.9. Relationship between the forward and backward equations: It often
is possible to derive the backward equation from the forward equation and
conversely using the invariance of u(t)f(t). For example, suppose we know
that f(t) = Pf(t + 1). Then u(t + 1)f(t + 1) = u(t)f(t) may be rewritten
u(t + 1)f(t + 1) = u(t)Pf(t + 1), which may be rearranged as (using rules of
matrix multiplication)
( u(t + 1) u(t)P ) f(t + 1) = 0 .
If this is true for enough linearly independent vectors f(t + 1), then the vector
u(t+1)u(t)P must be zero, which is the matrix version of the forward equation
(2). A theoretically minded reader can verify that enough f vectors are produced
if the transition matrix is nonsingular and we choose a linearly independent
family of reward vectors, V . In the same way, the backward evolution of f is
a consequence of invariance and the forward evolution of u.
We now have two ways to evaluate E[V (X(T))]: (i) start with given u(0),
compute u(T) = u(0)P
T
, evaluate u(T)V , or (ii) start with given V = f(T),
compute f(0) = P
T
V , then evaluate u(0)f(0). The former might be preferable,
for example, if we had a number a number of dierent reward functions to
evaluate. We could compute u(T) once then evalute u(T)V for all our V vectors.
4
1.10. Duality: In its simplest form, duality is the relationship between a
matrix and its transpose. The set of column vectors with n components is a
vector space of dimension n. The set of n component row vectors is the dual
space, which has the same dimension but may be considered to be a dierent
space. We can combine an element of a vector space with an element of its dual
to get a number: row vector u multiplied by column vector f yields the number
uf. Any linear transformation on the vector space of column vectors is repre-
sented by an nn matrix, P. This matrix also denes a linear transformation,
the dual transformation, on the dual space of row vectors, given by u uP.
This is the sense in which the forward and backward equations are dual to each
other.
Some people prefer not to use row vectors and instead think of organizing
the probabilities u(k, t) into a column vector that is the transpose of what
we called u(t). For them, the forward equation would be written u(t + 1) =
P
t
u(t) (note the notational problem: the t in P
t
means transpose while the
t in u(t) and f(t) refers to time.). The invariance relation for them would be
u
t
(t +1)f(t +1) = u
t
(t)f(t). The transpose of a matrix is often called its dual.
1.11. Hitting probabilities, backwards: The hitting probability for state 1
up to time T is
P (X(t) = 1 for some t [0, T]) . (9)
Here and below we write [a, b] for all the integers between a and b, including
a and/or b if they are integers. Hitting probabilities can be computed using
forward or backward equations, often by modifying P and adding boundary
conditions. For one backward equation approach, dene
f(k, t) = P (X(t
) = 1 for some t
[t, T] | X(t) = k) . (10)

Clearly,
f(1, t) = 1 for all t, (11)
and
f(k, T) = 0 for k = 1. (12)
Moreover, if k = 1, the law of total probabilities yields a backward relation
f(k, t) =
jS
P
kj
f(j, t + 1) . (13)
The dierence between this and the plain backward equation (5) is that the
relation (13) holds only for interior states k = 1, while the boundary condition
(11) supplies the values of f(1, t). The sum on the right of (13) includes the
term corresponding to state j = 1.
1.12. Hitting probabilities, forward: We also can compute the hitting proba-
bilities (9) using a forward equation approach. Dene the survival probabilities
u(k, t) = P (X(t) = k and X(t
) = 1 for t
[0, t]) . (14)

5
These satisfy the obvious boundary condition
u(1, t) = 0 , (15)
and initial condition
u(k, 0) = 1 for k = 1. (16)
The forward equation is (as the reader should check)
u(k, t + 1) =
jS
u(j, t)P
jk
. (17)
We may include or exclude the term with j = 1 on the right because u(1, t) = 0.
Of course, (17) applies only at interior states k = 1. The overall probability
of survival up to time T is

kS
u(k, T) and the hitting probability is the
complementary 1
kS
u(k, T).
The matrix vector formulation of this involves the row vector
u(t) = (u(2, t), u(3, t), . . .)
and the matrix

P formed from P by removing the rst row and column. The
evolution equation (17) and boundary condition (15) are both expressed by the
matrix equation
u(t + 1) = u(t)
P .
Note that

P is not a stochastic matrix because some of the row sums are less
than one:

j=1
P
kj
< 1 if P
k1
> 0 .
1.13. Absorbing boundaries: Absorbing boundaries are another way to think
about hitting and survival probabilities. The absorbing boundary Markov chain
is the same as the original chain (same transition probabilities) as long as the
state is not one of the boundary states. In the absorbing chain, the state never
again changes after it visits an absorbing boundary point. If P is the transition
matrix of the absorbing chain and P is the original transition matrix, this means
that P
jk
= P
jk
if j is not a boundary state, while P
jk
= 0 if j is a boundary
state and k = j. The probabilities u(k, t) for the absorbing chain are the same
as the survival probabilities (14) for the original chain.
1.14. Running cost: Suppose we have a running cost functtion, W(x), and
we want to calculate
f = E
_
T
t=0
W(X(t))
_
. (18)
Sums like this are called path dependent because their value depends on the
whole path, not just the nal value X(T). We can calculate (18) with the
6
forward equation using
f =
T
t=0
E [W(X(t))]
=
T
t=0
u(t)W . (19)
Here W is the column vector with components W
k
= W(k). We compute the
probabilities that are the components of the u(t) using the standard forward
equation (2) and sum the products (19).
One backward equation approach uses the quantities
f(k, t) = E
k,t
_
T
=t
W(X(t
))
_
. (20)
These satisfy (check this):
f(t) = Pf(t + 1) +W . (21)
Starting with f(T) = W, we work backwards with (21) until we reach the
desired f(0).
1.15. Multiplicitive functionals: For some reason, a function of a function is
often called a functional. The path, X(t), is a function of t, so a function, F(X),
that depends on the whole path is often called a functional. Some applications
call for nding the expected value of a multiplicative functional:
f = E
_
T
t=0
V (X(t))
_
. (22)
For example, X(t) could represent the state of a nancial market and V (k) =
1 + r(k) the interest rate for state k. Then (22) would be the expected total
interest. We also can write V (k) = e
W(k)
, so that
V (X(t)) = exp
_
W(X(t))
_
= e
Z
,
with Z =

W(x(t)). This dos not solve the problem of evaluating (22) because
E [e
z
] = e
E(Z)
.
The backward equation approach uses the intermediate quantities
f(k, t) = E
k,t
_
T
=t
V (X(t
))
_
.
The t
= t term in the product has V (X(t)) = V (k). The nal condition is

f(k, T) = V (k). The backward evolution equation is derived more or less as
7
before:
f(k, t) = E
k,t
_
V (k)
>t
V (X(t
))
_
= V (k)E
k,t
_
T
=t+1
V (X(t
))
_
= V (k)E
k,t
[f(X(t + 1), t + 1)] (the tower property)
f(k, t) = V (k)
_
Pf(t + 1)
_
(k) . (23)
In the last line on the right, f(t + 1) is teh column vector with components
f(k, t +1) and Pf(t +1) is teh matrix vector product. We write
_
Pf(t +1)
_
(k)
for the k
th
component of the column vector Pf(t + 1). We could express the
whole thing in matrix terms using diag(V ), the diagonal matrix with V (k) in
the (k, k) position:
f(t) = diag(V )Pf(t + 1) .
A version of (23) for Brownian motion is called the Feynman-Kac formula.
1.16. Branching processes: One forward equation approach to (22) leads to
a dierent interpretation of the answer. Let B(k, t) be the event {X(t) = k}
and I(k, t) the indicator function of B(k, t). That is I(k, t, X) = 1 if X
B(k, t) (i.e. X(t) = k), and I(k, t, X) = 0 otherwise. It is in keeping with the
probabilists habbit of leaving out the arguents of functions when the argument
is the underlying random outcome. We have u(k, t) = E[I(k, t)]. The forward
equation for the quantities
g(k, t) = E
_
I(k, t)
t
=0
V (X(t
))
_
(24)
is (see homework):
g(k, t) = V (k)
_
g(t 1)P
_
(k) . (25)
This is also the forward equation for a branching process with branching
factors V (k). At time t, the branching process has N(k, t) particles, or walkers,
at state k. The numbers N(k, t) are random. A time step of the branching
process has two parts. First, each particle takes one step of the Markov chain.
A particle at state j goes to state k with probability P
jk
. All steps for all
particles are independent. Then, each particle at state k does a branching or
birth/death step in which the particle is replaced by a randomnumber of particles
with expected number V (k). For example, if V (k) = 1/2, we could delete the
particle (death) with probability half. If V (k) = 2.8, we could keep the existing
particle, one new one, then add a third with probability .8. All particles are
treated independently. If there are m particles in state k before the birth/death
step, the expected number after the birth/death step is V (k)m. The expected
number of particles, g(k, t) = E[N(k, t)], satises (25).
8
When V (k) = 1 for all k there need be no birth or death. There will be
just one particle, the path X(t). The number of particles at state k at time t,
N(k, t), will be zero if X(t) = k or one if X(t) = k. In fact, N(k, t) = I(k, t)(X).
The expected values will be g(k, t) = E[N(k, t)] = E[I(k, t)] = u(k, t).
The branching process representation of (22) is possible when V (k) 0 for
all k. Monte Carlo methods based on branching processes are more accurate
than direct Monte Carlo in many cases.
2 Lattices, trees, and random walk
2.1. Introduction: Random walk on a lattice is an important example
where the abstract theory of Markov chains is used. It is the simplest model of
something randomly moving through space with none of the subtlty of Brownian
motion, though random walk on a lattice is a useful approximation to Brownian
motion, and vice versa. The forward and backward equations take a specic
simple form for lattice random walk and it is often possible to calculate or
approximate the solutions by hand. Boundary conditions will be applied at the
boundaries of lattices, hence the name.
We pursue forward and backward equations for several reasons. First, they
often are the best way to calculate expectations and hitting probabilities. Sec-
ond, many theoretical qualitative properties of specic Markov chains are un-
derstood using backward or forward equatins. Third, they help explain and
motivate the partial dierential equations that arise as backward and forward
equations for diusion processes.
2.2. Simple random walk: The state space for simple random walk is the
integers, positive and negative. At each time, the walker has three choices:
(A) move up one, (B) do not move, (C) move down one. The probabilities are
P(A) = P(k k +1) = a, P(B) = P(X(t +1) = X(t)) = b, and P(X(t +1) =
X(t)1) = c. Naturally, we need a, b, and c to be non-negative and a+b+c = 1.
The transation matrix
2
has b on the diagonal (P
kk
= b for all k), a on the super-
diagonal (P
k,k+1
= a for all k), and c on the sub diagonal. All other matrix
elements P
jk
are zero.
This Markov chain is homogeneous or translation invariant: The probalities
of moving up or down are independent of X(t). A translation by k is a shift of
everything by k (I do not know why this is called translation). Translation
invariance means, for example, that the probability of going from m to l in s
steps is the same as the probability of going from m + k to l + k in s steps:
P(X(t +s) = l | X(t) = m) = P(X(t +s) = l +k | X(t) = m+k). It is common
to simplify general discussions by choosing k so that X(0) = 0. Mathematicians
often say without loss of generality or w.l.o.g. when doing so.
2
This matrix is innite when the state space is innite. Matrix multiplication is still
dened. For example, the k component of uP is given by (uP)
k
=
j
u
j
P
jk
. This possibly
innite sum has only three nonzero terms when P is tridiagonal.
9
Often, particularly when discussing multidimensional random walk, we use
x, y, etc. instead of j, k, etc. to denote lattice points (states of the Markov
chain). Probabilists often use lower case Latin letters for general possible values
of a random variable, while using the capital letter for the random variable
itself. Thus, we might write P
xy
= P(X(t + 1) = x | X(t) = y). As an execise
in denition unwrapping, review Lecture 1 and check that this is the same as
P
X(t),x
= P(X(t + 1) = x | F
t
).
2.3. Gaussian approximation, drift, and volatility: We can write X(t + 1) =
X(t) + Y (t), where P(Y (t) = 1) = a, P(Y (t) = 0) = b, and P(Y (t) = 1) = c.
The random variables Y (t) are independent of each other because of the Markov
property and homogeniety. Assuming (without loss of generality) that X(0) = 0,
we have
X(t) =
t1
s=0
Y (s) , (26)
which expresses X(t) as a sum of iid (independent and identically distributed)
random variables. The central limit theorem then tells us that for large t, X(t)
is approximately Gaussian with mean t and variance
2
t, where = E[Y (t)] =
ab and
2
= var[Y (t)] = a+c (ac)
2
. These are called drift and volatility
3
respectively. The mean and variance of X(t) grow linearly in time with rate
and
2
respectively. Figure 1 shows some probability distributions for simple
random walk.
2.4. Trees: Simple random walk can be thought of as a sequence of decisions.
At each time you decide: up(A), stay(B), or down(C). A more general sequence
of decisions is a decision tree. In a general decision tree, making choice A at
time 0 then B at time one would have a dierent result than choosing rst B
then A. After t decisions, there could be 3
t
dierent decision paths and results.
The simple random walk decision tree is recombining, which means that
many dierent decision paths lead to the same X(t) For example, start (w.l.o.g)
with X(0) = 0, the paths ABB, CAA, BBA, etc. all lead to X(3) = 1. A
recombining tree is much smaller than a general decision tree. For simple ran-
dom walk, after t steps there are 2t +1 possible states, instead of up to 3
t
. For
t = 10, this is 21 instead of about 60 thousand.
2.5. Urn models: Urn models illustrate several features of more general
random walks. Unlike simple random walk, urn models are mean reverting and
have steady state probabilities that determine their large time behavior. We will
come back to them when we discuss scaling in future lectures.
The simple urn contains n balls that are identical except for their color.
There are k red balls and n k green ones. At each state, someone chooses
one of the balls at random with each ball equally likely to be chosen. He or she
replaces the chosen ball with a fresh ball that is red with probability p and green
3
People use the term volatility in two distinct ways. In the Black Scholes theory, volatility
means something else.
10
45 40 35 30 25 20 15 10 5 0
0
0.02
0.04
0.06
0.08
a = 0.20, b= 0.20, c= 0.60, T = 60
k
p
r
o
b
a
b
i
l
i
t
y
8 6 4 2 0 2 4 6
0
0.05
0.1
0.15
0.2
a = 0.20, b= 0.20, c= 0.60, T = 8
k
p
r
o
b
a
b
i
l
i
t
y
Figure 1: The probability distributions after T = 8 (top) and T = 60 (bottom)
steps for simple random walk. The smooth curve and circles represent the cen-
tral limit theorem Gaussian approximation. The plots have dierent probability
and k scales. Values not shown have very small probability.
11
with probability 1 p. All choices are independent. The number of red balls
decreases by one if he or she removes a red ball and returns a green one. This
happens with probabilty (k/n) (1 p). Similarly, the k k + 1 probability is
((n k)/n) p. In formal terms, the state space is the integers from 0 to n and
the transition probabilities are
P
k,k1
=
k(1 p)
n
, P
kk
=
(2p 1)k + (p 1)n
n
, P
k,k+1
=
(n k)p
n
,
P
jk
= 0 otherwise.
If these formulas are right, then P
k,k1
+P
kk
+P
k,k+1
= 1.
2.6. Urn model steady state: For the simple urn model, the probabilities
u(k, t) = P(X(t) = k) converge to steady state probabilities, v(k), as t .
This is illustrated in Figure (2). The steady state probabilities are
v(k) =
_
n
k
_
p
k
(1 p)
nk
.
The steady state probabilities have the property that if u(k, t) = v(k) for all
k, then u(k, t + 1) = v(k) also for all k. This is statistical steady state because
the probabilities have reached steady state values though the states themselves
keep changing, as in Figure (3). In matrix vector notation, we can form the
row vector, v, with entries v(k). Then v is a statistical steady state if vP = v.
It is no coincedence that v(k) is the probability of getting k red balls in n
independent trials with probability p for each trial. The steady state expected
number of red balls is
E
v
[X] = np ,
where the notation E
v
[] refers to expectation in probability distribution v.
2.7. Urn model mean reversion: If we let m(t) be the expected value if X(t),
then a calculation using the transition probabilities gives the relation
m(t + 1) = m(t) +
1
n
(np m(t)) . (27)
This relation shows not only that m(t) = np is a steady state value (m(t) = np
implies m(t +1) = np), but also that m(t) np as t (if r(t) = m(t) np,
then r(t + 1) = r(t) with || =

1
1
n
< 1).
Another way of expression mean reversion will be useful in discussing stochas-
tic dierential equations later. Because the urn Model is a Markov chain,
E [X(t + 1) | F
t
] = E [X(t + 1) | X(t)]
Again using the transition probabilities, we get
E [X(t + 1) | F
t
] = X(t) +
1
n
(np X(t)) . (28)
12
0 5 10 15 20 25 30
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
n = 30, T = 6
k
p
r
o
b
a
b
i
l
i
t
y
Figure 2: The probability distributions for the simple urn model plotted every
T time steps. The rst curve is blue, low, and at. The last one is red and most
peaked in the center. The computation starts with each state being equally
likely. Over time, states near the edges become less likely.
13
0 50 100 150 200 250 300 350 400
0
10
20
30
40
50
60
70
80
90
100
p = 0.5, n = 100
t
X
Figure 3: A Monte-Carlo sampling of 11 paths from the simple urn model. At
time t = 0 (the left edge), the paths are evenly spaced within the state space.
14
If X(t) > np, we have
E[X(t)] = E[X(t + 1) X(t)]
1
n
(np X(t)) ,
is negative. If X(t) < np, it is positive.
2.8. Boundaries: The terms boundary, interior, region, etc. as used in the
general discussion of Markov chain hitting probabilities come from applications
in lattice Markov chains such as simple random walk. For example, the region
x > has boundary x = . The quantities
u(x, t) = P(X(t) = x and X(s) > for 0 s t)
satisfy the forward equation (just (1) in this special case)
u(x, t + 1) = au(x 1, t) +bu(x, t) +cu(x + 1, t)
for x > together with the absorbing boundary condition u(, t) = 0. We could
create a nite state space Markov chain by considering a region < x < with
simple random walk in the interior together with absorbing boundaries at x =
and x = . Absorbing boundary conditions are also called Dirichlet boundary
conditions.
Another way to create a nite state space Markov chain is to put reecting
boundaries at x = and x = . This chain has the same transition probabilities
as ordinary random walk in the interior ( < x < ). However, transitions from
to 1 are disallowed and replaced by transitions from to + 1. This
means changing the transition probabilities starting from x = to
P( 1) = P
,1
= 0 , P( ) = P
= b , P( +1) = P
,+1
= a+c .
The transition rules at x = are similarly changed to block + 1 transi-
tions. There is some freedom in dening the reection rules at the boundaries.
We could, for example, make P( ) = b + c and P( + 1) = a, which
changes the blocked transition to standing still rather than moving right. We
return to this point in discussing oblique reection in multidimensional random
walks and diusions.
2.9. Multidimensional lattice: The unit square lattice in d dimensions is the
set of dtuples of integers (the set of integers is called Z):
x = (x
1
, . . . , x
d
) with x
j
Z for 1 j d .
The scaled square lattice, with lattice spacing h > 0, is the set of points hx =
(hx
1
, . . . , hx
d
), where x are integer lattice points. In the present discussion, the
scaling is irrelevent, so we use the unit lattice. We say that latice points x and
y are neighbors if
|x
j
y
j
| 1 for all coordinates j = 1, . . . , d .
15
1 Martingales and stopping times
1.1. Introduction: Martingales and stopping times are inportant technical
tools used in the study of stochastic processes such as Markov chains and diu-
sions. A martingale is a stochastic process that is always unpredictable in the
sense that E[F
t+t
| F
t
] = F
t
(see below) if t
> 0. A stopping time is a random

time, (), so that we know at time t whether to stop, i.e. the event { t} is
measurable in F
t
. These tools work well together because stopping a martingale
at a stopping also has mean zero: if t t
, then E [F
| F
t
] = F
t
. A central
fact about the Ito calculus is that Ito integrals with respect to Brownian motion
are martingales.
1.2. Stochstic process: Here is a more abstract denition of a discrete time
stochastic processes. We have a probability space, . The information available
at time t is represented by the algebra of events F
t
. We assume that for each
t, F
t
F
t+1
; since we are supposed to gain information going from t to t + 1,
every known event in F
t
is also known at time t + 1. A stochastic process
is a family of random variables, X
t
(), with X
t
F
t
(X
t
measureable with
respect to F
t
). Sometimes it happens that the random variables X
t
contain
all the information in the F
t
in the sense that F
t
is generated by X
1
, . . ., X
t
.
This the minimal algebra in which the X
t
form a stochastic process. In other
cases F
t
contains more information. Economists use these possibilities when
they distinguish between the weak ecient market hypothesis (the F
t
are
minimal), and the strong hypothesis (F
t
contains all the public information
in the world, literally). In the case of minimal F
t
, it may be possible to identify
the outcome, , with the path X = X
1
, . . . , X
T
. This is less common when the
F
t
are not minimal because the extra information may have to do with processes
other than X
t
. For the denition of stochastic process, the probabilities are not
important, just the algebras of sets and random variables X
t
. An expanding
family of algebras F
t
F
t+1
is a ltration.
1.3. Notation: The value of a stochastic process at time t may be written
X
t
or X(t). The subscript notation reminds us that the X
t
are a family of
functions of the random outcome (random variable) . In practical contexts,
particularly in discussing multidimensional processes (X(t) R
n
), we prefer
X(t) so that X
k
(t) can represent the k
th
component of X(t). When the process
is a martingale, we often call it F
t
. This will allow us to let X(t) be a Markov
chain and F
t
(X) a martingale function of X.
1.4. Example 1, Markov chains: In this example, the F
t
are minimal and
is the path space of sequences of length T from the state space, S. The
1
new information revealed at time t is the state of the chain at time t. The
variables X
t
are may be called coordinate functions because X
t
is coordinate
t (or entry t) in the sequence X. In principle, we could express this with the
notation X
t
(X), but that would drive people crazy. Although we distinguish
between Markov chains (discrete time) and Markov processes (continuous time),
the term stochastic process can refer to either continuous or discrete time.
1.5. Example 2, diadic sets: This is a set of denitions for discussing averages
over a range of length scales. The time variable, t, represents the amount of
averaging that has been done. The new information revealed at time t is ner
scale information about a function (an audio signal or digital image). The state
space is the positive integers from 1 to 2
T
. We start with a function X() and
ask that X
t
() be constant on diadic blocks of length 2
Tt
. The diadic blocks
at level t are
B
t,k
=

1 + (k 1)2
Tt
, 2 + (k 1)2
Tt
, . . . , k2
Tt
. (1)
The reader should check that moving from level t to level t +1 splits each block
into right and left halves:
B
t,k
= B
t+1,2k1
B
t+1,2k
. (2)
The algebras F
t
are generated by the block partitions
P
t
=

B
t,k
with k = 1, . . . , 2
Tt
.
Because F
t
F
t+1
, the P
t+1
is a renement of P
t
. The union (2) shows how.
We will return to this example after discussing martingales.
1.6. Martingales: A real valued stochastic process, F
t
, is a martingale
1
if
E[F
t+1
| F
t
] = F
t
.
If we take the overall expectation of both sides we see that the expectation
value does not depend on t, E[F
t+1
] = E[F
t
]. The martingale property says
more. Whatever information you might have at time t notwithstanding, still
the expectation of future values is the present value. There is a gambling in-
terpretation: F
t
is the amount of money you have at time t. No matter what
has happened, your expected winnings at between t and t + 1, the martingale
dierence Y
t+1
= F
t+1
F
t
, has zero expected value. You can also think of
martingale dierences as a generalization of independent random variables. If
the random variables Y
t
were actually independent, then the sums F
t
=

t
k=1
Y
t
would form a martingale (using the F
t
, generated by the Y
1
, . . ., Y
t
). The reader
should check this.
1
For nite this is the whole story. For countable we also assume that the sums dening
E[Xt] converge absolutely. This means that E[|Xt|] < . That implies that the conditional
expectations E[Xt + 1 | Ft are well dened.
2
1.7. Examples: The simplest way to get a martingale is to start with
a random variable, F(), and dene F
t
= E[F | F
t
]. If we apply this to
a Markov chain with the minimal ltration F
t
, and F is a nal time reward
F = V (X(T)), then F
t
= f(X(t), t) as in the previous lecture. If we apply this
to =

1, 2, . . . , 2
T
, with uniform probability P(k) = 2

T
for k , and the
diadic ltration, we get the diadic martingale with F
t
(j) constant on the diadic
blocks (1) and equal to the average of F over the block j is in.
1.8. A lemma on conditional expectation: In working with martingales we
often make use of a basic lemma about conditional expectation. Suppose U()
and Y () are real valued random variables and that U F. Then
E[UY | F] = UE[Y | F] . (3)
We see this using classical conditional expectation over the sets in the partition
dening F. Let B be one of these sets. Let y
B
= E[Y | B] be the value of
E[Y | F] for B. We know that U() is constant in B because U F. Call
this value u
B
. Then E[UY | B] = u
B
E[Y | B] = u
b
y
b
. But this is the value of
UE[Y | F] for B. Since each is in some B, this proves (3) for all .
1.9. Doobs principle: This lemma lets us make new martingales from
old ones. Let F
t
be a martingale and Y
t
= F
t
F
t1
the martingale dier-
ences (called innovations by statisticians and returns in nance). We use the
convention that F
1
= 0 so that F
0
= Y
0
. The martingale condition is that
E[Y
t+1
| F
t
] = 0. Clearly F
t
=

t
t
=0
Y
t
.
Suppose that at time t we are allowed to place a bet of any size
2
on the as
yet unknown martingale dierence, Y
t+1
. Let U
t
F
t
be the size of the bet.
The return from betting on Y
t
will be U
t1
Y
t
, and the total accumulated return
up to time t is
G
t
= U
0
Y
1
+U
1
Y
2
+ +U
t1
Y
t
. (4)
Because of the lemma (3), the betting returns have E[U
t
Y
t+1
| F
t
] = 0, so
E[G
t+1
| F
t
] = G
t
and G
t
also is a martingale.
The fact that G
t
in (4) is a martingale sometimes is called Doobs principle
or Doobs theorem after the probabilist who formulated it. A special case below
for stopping times is Doobs stopping time theorem or the optional stopping
theorem. They all say that strategizing on a martingale never produces anything
but a martingale. Nonanticipating strategies on martingales do not give positive
expected returns.
1.10. Weak and strong ecient market hypotheses: It is possible that the
random variables F
t
form a martingale with respect to their minimal ltration,
F
t
, but not with respect to an enriched ltration G
t
F
t
. The simplest example
would be the algebras G
t
= F
t+1
, which already know the value of F
t+1
at time
t. Note that the F
t
also are a stochastic process with respect to the G
t
. The
2
We may have to require that the bet have nite expected value.
3
weak ecient market hypothesis is that e
t
S
t
is a martingale (S
t
being the
stock price and its expected growth rate) with respect to its minimal ltra-
tion. Technical analysis means using trading strategies that are nonanticipating
with respect to the minimal ltration. Therefore, the weak ecient market hy-
pothesis says that technical trading does not produce better returns than buy
and hold. Any extra information you might get by examining the price history
of S up to time t is already known by enough people that it is already reected
in the price S
t
.
The strong ecient market hypothesis states that e
t
S
t
is a martingale
with respect to the ltration, G
t
, representing all the public information in the
world. This includes the previous price history of S and much more (prices of
related stocks, corporate reports, market trends, etc.).
1.11. Investing with Doob: Economists sometimes use Doobs principle and
the ecient market hypotheses to make a point about active trading in the
stock market. Suppose that F
t
, the price of a stock at time t, is a martingale
3
.
Suppose that at time t we all the information in F
t
, and choose an amount,
U
t
, to invest at time t. The fact that the resulting accumulated, G
t
, has zero
expected value is said to show that active investing is no better than a buy
and hold strategy that just produces the value F
t
. The well known book A
Random Walk on Wall Street is mostly an exposition of this point of view.
This argument breaks down when applied to non martingale processes, such as
stock prices over longer times. Active trading strategies such as (4) may produce
reduce the risk more than enough to compensage risk averse investors for small
amounts of lost expected value. Mertons optimal dynamic investment analysis
is a simple example of an active trading strategy that is better for some people
than passive buy and hold.
1.12. Stopping times: We have and the expanding family F
t
. A stopping
time is a function () that is one of the times 1, . . ., T, so that the event
{ t} is in F
t
. Stopping times might be thought of as possible strategies.
Whatever your criterion for stopping is, you have enough information at time t
to know whether you should stop at time t. Many stopping times are expressed
as the rst time something happens, such as the rst time X
t
> a. We cannot
ask to stop, for example, at the last t with X
t
> a because we might not know
at time t whether X
t
> a for some t
> t.
1.13. Doobs stopping time theorem for one stopping time: Because stop-
ping times are nonanticipating strategies, they also cannot make money from a
martingale. One version of this statement is that E[X
] = E[X
1
]. The proof of
this makes use of the events B
t
, that = t. The stopping time hypothesis is
that B
t
F
t
. Since has some value 1 T, the B
t
form a partition of .
Also, if B
t
, () = t, so X
= X
t
. Therefore,
E[X
1
] = E[X
T
]
3
This is a reasonable approximation for much short term trading
4
=
T
t=1
E[X
T
| B
t
]P(B
t
)
=
T
t=1
E[X
]P( = t)
= E[X
] .
In this derivation we made use of the classical statement of the martingale
property, if B F
then E[X
T
| B] = E[X
t
| B]. In our B = B
t
, X
t
= X
.
This simple idea, using the martingale property applied to the partition
B
t
, is crucial for much of the theory of martingales. The idea itself was rst
used Kolmogorov in the context of random walk or Brownian motion. Doob
realized that Kolmogorovs was even simpler and more beautiful when applied
to martingales.
1.14. Stopping time paradox: The technical hypotheses above, nite state
space, bounded stopping times, may be too strong, but they cannont be com-
pletely ignored, as this famous example shows. Let X
t
be a symmetric random
walk starting at zero. This forms a martingale, so E[X
] = 0 for any stopping

time, . On the other hand, suppose we take = min(t | X
t
= 1). Then X
= 1
always, so E[X
] = 1. The catch is that there is no T with () T for all

. Even though < almost surely (more to come on that expression),
E[] = (explination later). Even that would be OK if the possible values of
X
t
were bounded. Suppose you choose T and set
= min(, T). That is, you

wait until X
t
= 1 or t = T, whichever comes rst, to stop. For large T, it is very
likely that you stopped for X
t
= 1. Sill, those paths that never reached 1 prob-
ably drifted just far enough in the negative direction so that their contribution
to the overall expected value cancels the 1 to yield E[X
] = 0.
1.15. More stopping times theorem: Suppose we have an increasing family
of stopping times, 1
1

2
. In a natural way the random variables
Y
1
= X
1
, Y
2
= X
2
, etc. also form a martingale. This is a nal elaborate way
of saying that strategizing on a martingale is a no win game.
5
Last modied October 4, 2004
1 Continuous probability
1.1. Introduction: Recall that a set is discrete if it is nite or countable.
We will call a set continuous if it is not discrete. Many of the probability
spaces used in stochastic calculus are continuous in this sense (examples below).
Kolmogorov
1
suggested a general framework for continuous probability based
on abstract integration with respect to abstract probability measures. The
theory makes it possible to discuss general constructions such as conditional
expectation in a way that applies to a remarkably diverse set of examples.
The dierence between continuous and discrete probability is the dierence
between integration and summation. Continuous probability cannot be based
on the formula
P(A) =
A
P() . (1)
Indeed, the typical situation in continuous probability is that any event consist-
ing of a single outcome has probability zero: P({}) = 0 for all .
As we explain below, the classical formalism of probability densities also does
not apply in many of the situations we are interested in. Abstract probability
measures give a framework for working with probability in path space, as well
as more traditional discrete probability and probabilities given by densities on
R
n
.
These notes outline the Kolmogorovs formalism of probability measures for
continuous probability. We leave out a great number of details and mathemat-
ical proofs. Attention to all these details would be impossible within our time
constraints. In some cases we indicate where a precise denition or a complete
proof is missing, but sometimes we just leave it out. If it seems like something
is missing, it probably is.
1.2. Examples of continuous probability spaces: Be denition, a probabil-
ity space is a set, , of possible outcomes, together with a algebra, F, of
measurable events. This section discusses only the sets . The corresponding
algebras are discussed below.
R, the real numbers. If x
0
is a real number and u(x) is a probability density,
then the probability of the event B
r
(x
0
) = {x
0
r X x
0
+r} is
P([x
0
r, x
0
+r]) =
_
x0+r
x0r
u(x)dx 0 as r 0.
1
The Russian mathematician Kolmogorov was active in the middle of the 20
th
century.
Among his many lasting contributions to mathematics are the modern axioms of probability
and some of its most important theorems. His theories of turbulent uid ow anticipated
modern fractals be several decades.
1
Thus the probability of any individual outcome is zero. An event with
positive probability (P(A) > 0) is made up entirely of outcomes x
0
A,
with P(x
0
) = 0. Because of countable additivity (see below), this is only
possible when is uncountable.
R
n
, sequences of n numbers (possibly viewed as a row or column vector depend-
ing on the context): X = (X
1
. . . , X
n
). Here too if there is a probability
density then the probability of any given outcome is zero.
S
N
. Let S be the discrete state space of a Markov chain. The space S
T
is the set of sequences of length T of elements of S. An element of S
T
may be written x = (x(0), x(1), , x(T 1)), with each of the x(t) in
S. It is common to write x
t
for x(t). An element of S
N
is an innite se-
quence of elements of S. The exponent N stands for natural numbers.
We misuse this notation because ours start with t = 0 while the actual
natural numbers start with t = 1. We use S
N
when we ask questions
about an entire innite trajectory. For example the hitting probability is
P(X(t) = 1 for all t 0). Cantor proved that S
N
is not countable when-
ever the state space has more than one element. Generally, the probability
of any particular innite sequence is zero. For example, suppose the tran-
sition matrix has P
11
= .6 and u
0
(1) = 1. Let x be the innite sequence
that never leaves state 1: x = (1, 1, 1, ). Then P(x) = u
0
(1) .6 .6 .
Multiplying together an innite number of .6 factors should give the an-
swer P(x) = 0. More generally, if the transition matrix has P
jk
r < 1
for all (j, k), then P(x) = 0 for any single innite path.
C([0, T] R), the path space for Brownian motion. The C stands for con-
tinuous. The [0, T] is the time interval 0 t T; the square brackets
tell us to include the endpoints (0 and T in this case). Round parentheses
(0, T) would mean to leave out 0 and T. The nal R is the target space,
the real numbers in this case. An element of is a continuous function
from the interval [0, T] to R. This function could be called X(t) or X
t
(for
0 t T). In this space we can ask questions such as P(
_
T
0
X(t)dt > 4).
1.3. Probability measures: Let F be a algebra of subsets of . A
probability measure is a way to assign a probability to each event A F. In
discrete probability, this is done using (1). In R
n
a probability density leads to
a probability measure by integration
P(A) =
_
A
u(x)dx . (2)
There are still other ways to specify probabilities of events in path space. All
of these probability measures satisfy the same basic axioms.
Suppose that for each A F we have a number P(A). The numbers P(A)
are a probability measure if
i. If A F and B F are disjoint events, then P(A B) = P(A) +P(B).
2
ii. P(A) 0 for any event A F.
iii. P() = 1.
iv. If A
n
F is a sequence of events each disjoint from all the others and
n=1
A
n
= A, then

n=1
P(A
n
) = P(A).
The last property is called countable additivity. It is possible to consider prob-
ability measures that are not countably additive, but is not bery useful.
1.4. Example 1, discrete probability: If is discrete, we may take F to be
the set of all events (i.e. all subsets of ). If we know the probabilities of each
individual outcome, then the formula (1) denes a probability measure. The
axioms (i), (ii), and (iii) are clear. The last, countable additivity, can be veried
given a solid undergraduate analysis course.
1.5. Borel sets: It is rare that one can dene P(A) for all A . Usually,
there are non measurable events whose probability one does not try to dene
(see below). This is not related to partial information, but is an intrinsic aspect
of continuous probability. Events that are not measurable are quite articial,
but they are impossible to get rid of. In most applications in stochastic calculus,
it is convenient to take the largest algebra to be the Borel sets
2
In a previous lecture we discussed how to generate a algebra from a
collection of sets. The Borel algebra is the algebra that is generated by all
balls. The open ball with center x
0
and radius r > 0 in n dimensional space is
B
r
(x
0
) = {x | |x x
0
| < r. A ball in one dimension is an interval. In two
dimensions it is a disk. Note that the ball is solid, as opposed to the hollow
sphere, S
r
(x
0
) = {x | |x x
0
| = r}. The condition |x x
0
| r instead of
|x x
0
| < r, denes a closed ball. The algebra generated by open balls is
the same as the algebra generated by closed balls (check this if you wish).
1.6. Borel sets in path space: The denition of Borel sets works the same
way in the path space of Brownian motion, C([0, T], R). Let x
0
(t) amd x(t) be
two continuous functions of t. The distance between them in the sup norm is
x x
0
= sup
0tT
|x(t) x
0
(t)| .
We often use double bars to represent the distance between functions and single
bar absolute value signs to represent the distance between numbers or vectors
in R
n
. As before, the open ball of radius r about a path x
0
is the set of all
paths with x x
0
< r.
1.7. The algebra for Markov chain path space: There is a convenient
limit process that denes a useful algebra on S
N
, the innite time horizon
path space for a Markov chain. We have the algebras F
t
generated by the rst
2
The larger algebra of Lebesgue sets seems to more of a nuisance than a help, particularly
in discussing convergence of probability measures in path space.
3
t + 1 states x(0), x(1), . . ., x(t). We take F to be the algebra generated
by all these. Note that the event A = {X(t) = 1 for t 0} is not in any of
the F
t
. However, the event A
t
= {X(t) = 1 for 0 t T} is in F
t
. Therefore
A =
t0
A
t
must be in any algebra that contains all the F
t
. Also note that
the union of all the F
t
is an algebra of sets, though it is not a algebra.
1.8. Generating a probability measure: Let M be a collection of events
that generates the algebra F. Let A be the algebra of sets that are nite
intersections, unions, and complements of events in M. Clearly the algebra
generated by M is the same as the one generated by A. The process of going
from the algebra A to the algebra F is one of completion, adding all limits
of countable intersections or unions of events in A.
In order to specify P(A) for all A F, it suces to give P(A) for all events
A A. That is, if there is a countably additive probability measure P(A) for
all A F, then it is completely determined by the numbers P(A) for those
A A. Hopefully is is plausable that if the events in A generate those in F,
then the probabilities of events in Mdetermine the probabilities of events in F
(proof ommitted).
For example, in R
n
if we specify P(A) for event described by nitely many
balls, then we have determined P(A) for any Borel set. It might be that the
numbers P(A) for A A are inconsistent with the axioms of probability (which
is easy to check) or cant be extended in a way that is countably additive to all of
F (doesnt happen in our examples), but otherwise the measure is determined.
1.9. Non measurable sets (technical aside): A construction demonstrates that
non measurable sets are unavoidable. Let be the unit circle. The simplest
probability measure on would seem to be uniform measure (divided by 2 so
that P() = 1). This measure is rotation invariant: if A is a measurable event
having probability P(A) then the event A + = {x + | x A} is measurable
and has P(A+) = P(A). It is possible to construct a set B and a (countable)
sequence of rotations,
n
, so that the events B +
k
and B +
n
are disjoint
if k = n and

n
B +
n
= . This set cannot be measurable. If it were and
= P(B) then there would be two choices: = 0 or > 0. In the former case
we would have P() =
n
P(B +
n
) =
n
0 = 0, which is not what we want.
In the latter case, again using countable additivity, we would get P() = .
The construction of the set B starts with a description of the
n
. Write n
in base ten, ip over the decimal point to get a number between 0 and 1, then
multiply by 2. For example for n = 130, we get
n
=
130
= 2 .031. Now
use the
n
to create an equivalence relation and partition of by setting x y
if x = y +
n
(mod 2) for some n. The reader should check that this is an
equivalence relation (x y y x, and x y and y z x z). Now,
let B be a set that has exactly one representative from each of the equivalence
classes in the partition. Any x is in one of the equivalence classes, which
means that there is a y B (the representative of the x equivalence class) and
an n so that y +
n
= x. That means that any x has x B +
n
for some
n, which is to say that

n
B +
n
= . To see that B +
k
is disjoint from
4
B +
n
when k = n, suppose that x B +
k
and x
n
. Then x = y +
k
and x = z +
n
for y B and z B. But (and this is the punch line) this
would mean y z, which is impossible because B has only one representative
from each equivalence class. The possibility of selecting a single element from
each partition element without having to say how it is to be done is the axiom
of choice.
1.10. Probability densities in R
n
: Suppose u(x) is a probability density in R
n
.
Ifg A is an event made from nitely many balls (or rectangles) by set operations,
we can dene P(A) by integrating, as in (2). This leads to a probability measure
on Borel sets corresponding to the density u. Deriving the probability measure
from a probability density does not seem to work in path space because there
is nothing like the Riemann integral to use in
3
(2) Therefore, we describe path
space probability measures directly rather than through probability densities.
1.11. Measurable functions: Let be a probability space with a algebra
F. Let f() be a function dened on . In discrete probability, f was measur-
able with respect to F if the sets B
a
= { | f( = a)} all were measurable. In
continuous probability, this denition is replaced by the condition that the sets
A
ab
= { | a f() b} are measurable. Because F is countably additive, and
because the event a < f is the (countable) union of the events a +
1
n
f, this
is the same as requiring all the sets

A
ab
= { | a < f() < b} to be measurable.
If is discrete (nite or countable), then the two denitions of measurable
function agree.
In continuous probability, the notion of measurability of a function with
respect to a algebra plays two roles. The rst, which is purely technical,
is that f is suciently regular (meaning not crazy) that abstract integrals
(dened below) make sense for it. The second, particularly for smaller algebras
G F, again involves incomplete information. A function that is measurable
with respect to G not only needs to be regular, but also must depend on fewer
variables (possibly in some abstract sense).
1.12. Integration with respect to a measure: The denition of integration
with respect to a general probability measure is easier than the denition of the
Riemann integral. The integral is written
E[f] =
_
f()dP() .
We will see that in R
n
with a density u, this agrees with the classical denition
E[f] =
_
R
n
f(x)u(x)dx ,
3
The Feynman integral in path space has some properties of true integrals but lacks others.
The probabilist Mark Kac (pronounced cats) discovered that Feynmans ideas applied to
the heat equation rather than the Schrodinger equation can be interpreted as integration with
respect to Wiener measure. This is now called the Feynman Kac formula.
5
if we write dP(x) = u(x)dx. Note that the abstract variable is replaced by
the concrete variable, x, in this more concrete situation. The general denition
is forced on us once we make the natural requirements
i. If A F is any event, then E[1
A
] = P(A). The integral of the indicator
function if an event is the probability of that event.
ii. If f
1
and f
2
have f
1
() f
2
() for all , then E[f
1
] E[f
2
]. Integra-
tion is monotone.
iii. For any reasonable functions f
1
and f
2
(e.g. bounded), we have E[af
1
+
bf
2
] = aE[f
1
] +bE[f
2
]. (Linearity of integration).
iv. If f
n
() is an increasing family of positive functions converging pointwise to
f (f
n
() 0 and f
n+1
() f
n
() for all n, and f
n
( f() as n
for all ), then E[f
n
] E[f] as n . (This form of countable additiv-
ity for abstract probability integrals is called the monotone convergence
theorem.)
A function is a simple function if there are nitely many events A
k
, and
weights w
k
, so that f =

k
w
k
1
A
k
. Properties (i) and (iii) imply that the
expectation of a simple function is
E[f] =
k
w
k
P(A
k
) .
We can approximate general functions by simple functions to determine their
expectations.
Suppose f is a nonnegative bounded function: 0 f() M for all .
Choose a small number = 2
n
and dene the
4
ring sets A
k
= {(k 1)
f < k. The A
k
depend on but we do not indicate that. Although the events
A
k
might be complicated, fractal, or whatever, each of them is measurable. A
simple function that approximates f is f
n
() =
k
(k 1)1
A
k
. This f
n
takes
the value (k 1) on the sets A
k
. The sum dening f
n
is nite because f is
bounded, though the number of terms is M/. Also, f
n
() f() for each
(though by at most ). Property (ii) implies that
E[f] E[f
n
] =
k
(k 1)P(A
k
) .
In the same way, we can consider the upper function g
n
=
k
k1
A
k
and have
E[f] E[g
n
] =
k
kP(A
k
) .
The reader can check that f
n
f
n+1
f g
n+1
g
n
and that g
n
f
n
.
Therefore, the numbers E[f
n
] form an increasing sequence while the E[g
n
] are a
4
Take f = f(x, y) = x
2
+ y
2
in the plane to see why we call them ring sets.
6
decreasing sequence converging to the same number, which is the only possible
value of E[f] consistent with (i), (ii), and (iii).
It is sometimes said that the dierence between classical (Riemann) integra-
tion and abstract integration (here) is that the Riemann integral cuts the x axis
into little pieces, while the abstarct integral cuts the y axis (which is what the
simple function approximations amount to).
If the function f is positive but not bounded, it might happen that E[f] = .
The cut o functions, f
M
() = min(f(), M), might have E[f
M
] as
M . If so, we say E[f] = . Otherwise, property (iv) implies that
E[f] = lim
M
E[f
M
]. If f is both positive and negative (for dierent ),
we integrate the positive part, f
+
() = max(f(), 0), and the negative part
f
() = min(f(), 0 separately and subtract the results. We do not attempt a

denition if E[f
+
] = and E[f
] = . We omit the long process of showing

that these denitions lead to an integral that actually has the properties (i) -
(iv).
1.13. Markov chain probability measures on S
N
: Let A =
as before.
The probability of any A A is given by the probability of that event in F
t
if A F
t
. Therefore P(A) is given by a formula like (1) for any A A. A
theorem of Kolmogorov states that the completion of this measure to all of F
makes sense and is countably additive.
1.14. Conditional expectation: We have a random variable X() that is
measurable with respect to the algebra, F. We have algebra that is a
sub algebra: G F. We want to dene the conditional expectation Y = E[X |
G]. In discrete probability this is done using the partition dened by G. The
partition is less useful because it probably is uncountable, and because each
partition element, B() = A (the intersection being over all A G with
A), may have P(B()) = 0 (examples below). This means that we cannot
apply Bayes rule directly.
The denition is that Y () is the random variable measurable with respect
to G that best approximates X in the least squares sense
E[(Y X)
2
] = min
ZG
E[(Z X)
2
] .
This is one of the denitions we gave before, the one that works for continuous
and discrete probability. In the theory, it is possible to show that there is a
minimizer and that it is unique.
1.15. Generating a algebra: When the probability space, , is nite, we
can understand an algebra of sets by using the partition of that generates the
algebra. This is not possible for continuous probability spaces. Another way
to specify an algebra for nite was to give a function X(, or a collection
of functions X
k
() that are supposed to be measurable with respect to F. We
noted that any function measurable with respect to the algebra generated by
functions X
k
is actually a function of the X
k
. That is, if F F (abuse of
7
notation), then there is some function u(x
1
, . . . , x
n
) so that
F() = u(X
1
(), . . . , X
n
()) . (3)
The intuition was that F contains the information you get by knowing the
values of the functions X
k
. Any function measurable with respect to this alge-
bra is determined by knowing the values of these functions, which is precisely
what (3) says. This approach using functions is often convenient in continuous
probability.
If is a continuous probability space, we may again specify functions X
k
that we want to be measurable. Again, these functions generate an algebra,
a algebra, F. If F is measurable with respect to this algebra then there is
a (Borel measurable) function u(x
1
, . . .) so that F() = u(X
1
, . . .), as before.
In fact, it is possible to dene F in this way. Saying that A F is the same
as saying that 1
A
is measurable with respect to F. If u(x
1
, . . .) is a Borel
measurable function that takes values only 0 or 1, then the function F dened by
(3) denes a function that also takes only 0 or 1. The event A = { | F() = 1
has (obviously) F = 1
A
. The algebra generated by the X
k
is the set of
events that may be dened in this way. A complete proof of this would take a
few pages.
1.16. Example in two dimensions: Suppose is the unit square in two
dimensions: (x, y) if 0 x 1 and 0 y 1. The x coordinate function
is X(x, y) = x. The information in this is the value of the x coordinate, but not
the y coordinate. An event measurable with respect to this F will be any event
determined by the x coordinate alone. I call such sets bar code sets. You can
see why by drawing some.
1.17. Marginal density and total probability: The abstract situation is that
we have a probability space, with generic outcome . We have some
functions (X
1
(), . . . , X
n
()) = X(). With in the background, we can ask
for the joint PDF of (X
1
, . . . , X
n
), written u(x
1
, . . . , x
n
). A formal denition of
u would be that if A R
n
, then
P(X() A) =
_
xA
u(x)dx . (4)
Suppose we neglect the last variable, X
n
, and consider the reduced vector
X() = (X
1
, . . . , X
n1
) with probability density u(x
1
, . . . , x
n1
). This u is
the marginal density and is given by integrating u over the forgotten variable:
u(x
1
, . . . , x
n1
) =
_

u(x
1
, . . . , x
n
)dx
n
. (5)
This is a continuous probability analogue of the law of total probability: in-
tegrate (or sum) over a complete set of possibilities, all values of x
n
in this
case.
8
We can prove (5) from (4) by considering a set B R
n1
and the corre-
sponding set A R
n
given by A = B R (i.e. A is the set of all pairs x, x
n
)
with x = (x
1
, . . . , x
n1
) B). The denition of A from B is designed so that
P(X A) = P(

X B). With this notation,
P(

X B) = P(X A)
=
_
A
u(x)dx
=
_
xB
_

xn=
u( x, x
n
)dx
n
d x
P(

X B) =
_
B
u( x)d x .
This is exactly what it means for u to be the PDF for

X.
1.18. Classical conditional expectation: Again in the abstract setting ,
suppose we have random variables (X
1
(), . . . , X
n
()). Now consider a function
f(x
1
, . . . , x
n
), its expectated value E[f(X)], and the conditional expectations
v(x
n
) = E[f(X) | X
n
= x
n
] .
The Bayes rule denition of v(x
n
) has some trouble because both the denomi-
nator, P(X
n
= x
n
), and the numerator,
E[f(X) 1
Xn=xn
] ,
are zero.
The classical solution to this problem is to replace the exact condition X
n
=
x
n
with an approximate condition having positive (though small) probability:
x
n
X
n
x
n
+. We use the approximaion
_
xn+
xn
g( x,
n
)d
n
g( x, x
n
) .
The error is roughly proportional to
2
and much smaller than either the terms
above. With this approximation the numerator in Bayes rule is
E[f(X) 1
xnXnxn+
] =
_
xR
n1
_
n=xn+
n=xn
f( x,
n
)u( x, x
n
)d
n
d x

_
x
f( x, x
n
)u( x, x
n
)d x .
Similarly, the denominator is
P(x
n
X
n
x
n
+)
_
x
u( x, x
n
)d x .
9
If we take the Bayes rule quotient and let 0, we get the classical formula
E[f(X) | X
n
= x
n
] =
_
x
f( x, x
n
)u( x, x
n
)d x
_
x
u( x, x
n
)d x
. (6)
By taking f to be the characteristic function of an event (all possible events)
we get a formula for the probability density of

X given that X
n
= x
n
, namely
u( x | X
n
= x
n
) =
u( x, x
n
)
_
x
u( x, x
n
)d x
. (7)
This is the classical formula for conditional probability density. The integral
in the denominator insures that, for each x
n
, u is a probability density as a
function of x, that is
_
u( x | X
n
= x
n
)d x = 1 ,
for any value of x
n
. It is very useful to notice that as a function of x, u and u
almost the same. They dier only by a constant normalization. For example,
this is why conditioning Gaussians gives Gaussians.
1.19. Modern conditional expectation: The classical conditional expectation
(6) and conditional probability (7) formulas are the same as what comes from
the modern denition from paragraph 1.6. Suppose X = (X
1
, . . . , X
n
) has
density u(x), F is the algebra of Borel sets, and G is the algebra generated
by X
n
(which might be written X
n
(X), thinking of X as in the abstract
notation). For any f(x), we have

f(x
n
) = E[f | G]. Since G is generated by
X
n
, the function

f being measurable with respect to G is the same as its being
a function of x
n
. The modern denition of

f(x
n
) is that it minimizes
_
R
n
_
f(x)

f(x
n
)
_
2
u(x)dx , (8)
over all functions that depend only on x
n
(measurable in G).
To see the formula (6) emerge, again write x = ( x, x
n
), so that f(x) =
f( x, x
n
), and u(x) = u( x, x
n
). The integral (8) is then
_

xn=
_
xR
n1
_
f( x, x
n
)

f(x
n
)
_
2
u( x, x
n
)d xdx
n
.
In the inner integral:
R(x
n
) =
_
xR
n1
_
f( x, x
n
)

f(x
n
)
_
2
u( x, x
n
)d x ,
f(x
n
) is just a constant. We nd the value of

f(x
n
) that minimizes R(x
n
) by
minimizing the quantity
_
xR
n1
(f( x, x
n
) g)
2
u( x, x
n
)d x =
_
f( x)
2
u( x, x
n
)d x + 2g
_
f( x)u( x, x
n
)d x +g
2
_
u( x, x
n
)d x .
10
The optimal g is given by the classical formula (6).
1.20. Modern conditional probability: We already saw that the modern ap-
proach to conditional probability for G F is through conditional expectation.
In its most general form, for every (or almost every) , there should be
a probability measure P
on so that the mapping P
is measureable
with respect to G. The measurability condition probably means that for every
event A F the function p
A
() = P
(A) is a G measurable function of .

In terms of these measures, the conditional expectation

f = E[f | G] would be
f() = E
[f]. Here E
means the expected value using the probability measure

P
. There are many such subscripted expectations coming.

A subtle point here is that the conditional probability measures are dened
on the original probability space, . This forces the measures to live on
tiny (generally measure zero) subsets of . For example, if = R
n
and G is
generated by x
n
, then the conditional expectation value

f(x
n
) is an average of
f (using density u) only over the hyperplane X
n
= x
n
. Thus, the conditional
probability measures P
X
depend only on x
n
, leading us to write P
xn
. Since
f(x
n
) =
_
f(x)dP
xn
(x), and

f(x
n
) depends only on values of f( x, x
n
) with
the last coordinate xed, the measure dP
xn
is some kind of measure on that
hyperplane. This point of view is useful in many advanced problems, but we
will not need it in this course (I sincerely hope).
1.21. Semimodern conditional probability: Here is an intermediate semi-
modern version of conditional probability density. We have = R
n
, and
= R
n1
with elements x = (x
1
, . . . , x
n1
). For each x
n
, there will be a (con-
ditional) probability density function u
xn
. Saying that u depends only on x
n
is
the same as saying that the function x u
xn
is measurable with respect to G.
The conditional expectation formula (6) may be written
E[f | G](x
n
) =
_
R
n1
f( x, x
n
) u
xn
( x)d x .
In other words, the classical u( x | X
n
= x
n
) of (7) is the same as the semimodern
u
xn
( x).
2 Gaussian Random Variables
The central limit theorem (CLT) makes Gaussian random variables important.
A generalization of the CLT is Donskers invariance principle that gives Brow-
nian motion as a limit of random walk. In many ways Brownian motion is a
multivariate Gaussian random variable. We review multivariate normal random
variables and the corresponding linear algebra as a prelude to Brownian motion.
2.1. Gaussian random variables, scalar: The one dimensional standard
11
normal, or Gaussian, random variable is a scalar with probability density
u(x) =
1
2
e
x
2
/2
.
The normalization factor
1
2
makes
_
u(x)dx = 1 (a famous fact). The

mean value is E[X] = 0 (the integrand xe
x
2
/2
is antisymmetric about x = 0).
The variance is (using integration by parts)
E[X
2
] =
1
2
_

x
2
e
x
2
/2
dx
=
1
2
_

x
_
xe
x
2
/2
_
dx
=
1
2
_

x
_
d
dx
e
x
2
/2
_
dx
=
1
2
_
xe
x
2
/2
_
+
1
2
_

e
x
2
/2
dx
= 0 + 1
Similar calculations give E[X
4
] = 3, E[X
6
] = 15, and so on. I will often write
Z for a standard normal random variable. A one dimensional Gaussian random
variable with mean E[X] = and variance var(X) = E[(X )
2
] =
2
has
density
u(x) =
1
2
2
e
(x)
2
2
2
.
It is often more convenient to think of Z as the random variable (like ) and
write X = +Z. We write X N(,
2
) to express the fact that X is normal
(Gaussian) with mean and variance
2
. The standard normal random variable
is Z N(0, 1)
2.2. Multivariate normal random variables: The nn matrix, H, is positive
denite if x
Hx > 0 for any n component column vector x = 0. It is symmetric

if H
= H. A symmetric matrix is positive denite if and only if all its eigenvales

are positive. Since the inverse of a symmetric matrix is symmetric, the inverse
of a symmetric positive denite (SPD) matrix is also SPD. An n component
random variable is a mean zero multivariate normal if it has a probability density
of the form
u(x) =
1
z
e
1
2
x
Hx
,
for some SPD matrix, H. We can get mean = (
1
, . . . ,
n
)
either by taking
X + where X has mean zero, or by using the density with x
Hx replaced by
(x )
H(x ).
If X R
n
is multivariate normal and if A is an mn matrix with rank m,
then Y R
m
given by Y = AX is also multivariate normal. Both the cases
m = n (same number of X and Y variables) and m < n occur.
12
2.3. Diagonalizing H: Suppose the eigenvalues and eigenvectors of H are
Hv
j
=
j
v
j
. We can express x R
n
as a linear combination of the v
j
either in
vector form, x =

n
j=1
y
j
v
j
, or in matrix form, x = V y, where V is the n n
matrix whose columns are the v
j
and y = (y
1
, . . . , y
n
)
. Since the eigenvectors

of a symmetric matrix are orthogonal to each other, we may normalize them so
that v
j
v
k
=
jk
, which is the same as saying that V is an orthogonal matrix,
V
V = I. In the y variables, the quadratic form x
Hx is diagonal, as we can
see using the vector or the matrix notation. With vectors, the trick is to use the
two expressions x =
n
j=1
y
j
v
j
and x =
n
k=1
y
k
v
k
, which are the same since j
and k are just summation variables. Then we can write
x
Hx =
_
_
n
j=1
y
j
v
j
_
_
H
_
n
k=1
y
k
v
k
_
=
jk
_
v
j
Hv
k
_
y
j
y
k
=
jk
k
v
j
v
k
y
j
y
k
x
Hx =
k
y
2
k
. (9)
The matrix version of the eigenvector/eigenvalue relations is V
HV = ( be-
ing the diagonal matrix of eigenvalues). With this we have x
Hx = (V y)
HV y =
y
(V
HV )y = y
y. A diagonal matrix in the quadratic form is equivalent to

having a sum involving only squares
k
y
2
k
. All the
k
will be positive if H is
positive denite. For future reference, also remember that det(H) =
n
k=1
k
.
2.4. Calculations using the multivariate normal density: We use the y
variables as new integration variables. The point is that if the quadratic form is
diagonal the muntiple integral becomes a product of one dimensional gaussian
integrals that we can do. For example,
_
R
2
e
1
2
(1y
2
1
+2y
2
2
)
dy
1
dy
2
=
_

y1=
_

y2=
e
1
2
(1y
2
1
+2y
2
2
)
dy
1
dy
2
=
_

y1=
e
1y
2
1
/2
dy
1
_

y2=
e
2y
2
2
/2
dy
2
=
_
2/
1
_
2/
2
.
Ordinarily we would need a Jacobian determinant representing
dx
dy
, but here
the determinant is det(V ) = 1, for an orthogonal matrix. With this we can nd
the normalization constant, z, by
1 =
_
u(x)dx
=
1
z
_
e
1
2
x
Hx
dx
13
=
1
z
_
e
1
2
y
y
dy
=
1
z
_
exp(
1
2
n
k=1
k
y
2
k
))dy
=
1
z
_
_
n
k=1
e
k
y
2
k
_
dy
=
1
z
n
k=1
__

y
k
=
e
k
y
2
k
dy
k
_
=
1
z
n
k=1
_
2/
k
1 =
1
z

(2)
n/2
_
det(H)
.
This gives a formula for z, and the nal formula for the multivariate normal
density
u(x) =
det H
(2)
n/2
e
1
2
x
Hx
. (10)
2.5. The covariance, by direct integration: We can calculate the covariance
matrix of the X
j
. The jk element of E[XX
] is E[X
j
X
k
] = cov(X
j
, X
k
). The
covariance matrix consisting of all these elements is C = E[XX
]. Note the
conict of notation with the constant C above. A direct way to evaluate C is
to use the density (10):
C =
_
R
n
xx
u(x)dx
=
det H
(2)
n/2
_
R
n
xx
1
2
x
Hx
dx .
Note that the integrand is an n n matrix. Although each particular xx
has rank one, the average of all of them will be a nonsingular positive denite
matrix, as we will see. To work the integral, we use the x = V y change of
variables above. This gives
C =
det H
(2)
n/2
_
R
n
(V y)(V y)
1
2
y
y
dy .
We use (V y)(V y)
= V (yy
)V
and take the constant matrices V outside the

integral. This gives C as the product of three matrices, rst V , then an integral
involving yy
, then V
. So, to calculate C, we can calculate all the matrix

elements
B
jk
=
det H
(2)
n/2
_
R
n
y
j
y
k
e
1
2
y
y
dy .
14
Clearly, if j = k, B
jk
= 0, because the integrand is an odd (antisymmetric)
function, say, of y
j
. The diagonal elements B
kk
may be found using the fact
that the integrand is a product:
B
kk
=
det H
(2)
n/2
j=k
_
_
yj
e
jy
2
j
/2
dy
j
_
_
y
k
y
2
k
e
k
y
2
k
/2
dy
k
.
As before,
j
factors (for j = k) integrate to
_
2/
j
. The
k
factor integrates
to
_
2/(
k
)
3/2
. The
k
factor diers from the others only by a factor 1/
k
.
Most of these factors combine to cancel the normalization. All that is left is
B
kk
=
1
k
.
This shows that B =
1
, so
C = V
1
V
.
Finally, since H = V V
, we see that
C = H
1
. (11)
The covariance matrix is the inverse of the matrix dening the multivariate
normal.
2.6. Linear functions of multivariate normals: A fundamental fact about
multivariate normals is that a linear transformation of a multivariate normal is
also multivariate normal, provided that the transformation is onto. Let A be
an m n matrix with m n. This A denes a linear transformation y = Ax.
The transformation is onto if, for every y R
m
, there is at least ibe x R
n
with Ax = y. If n = m, the transformation is onto if and only if A is invertable
(det(A) = 0), and the only x is A
1
y. If m < n, A is onto if its m rows
are linearly independent. In this case, the set of solutions is a hyperplane
of dimension n m. Either way, the fact is that if X is an n dimensional
multivariate normal and Y = AX, then Y is an m dimensional multivariate
normal. Given this, we can completely determine the probability density of Y
by calculating its mean and covariance matrix. Writing
X
and
Y
for the
means of X and Y respectively, we have
Y
= E[Y ] = E[AX] = AE[X] = A
X
.
Similarly, if E[Y ] = 0, we have
C
Y
= E[Y Y
] = E[(AX)(AX)
] = E[AXX
] = AE[XX
]A
= AC
X
A
.
The reader should verify that if C
X
is n n, then this formula gives a C
Y
that
is mm. The reader should also be able to derive the formula for C
Y
in terms
15
of C
X
without assuming that
Y
= 0. We will soon give the proof that linear
functions of Gaussians are Gaussian.
2.7. Uncorrelation and independence: The inverse of a symmetric matrix
is another symmertic matrix. Therefore, C
X
is diagonal if and only if H is
diagonal. If H is diagonal, the probability density function given by (10) is a
product of densities for the components. We have already used that fact and
will use it more below. For now, just note that C
X
is diagonal if and only if the
components of X are uncorrelated. Then C
X
being diagonal implies that H is
diagonal and the components of X are independent. The fact that uncorrelated
components of a multivariate normal are actually independent rstly is a prop-
erty only of Gaussians, and secondly has curious consequences. For example,
suppose Z
1
and Z
2
are independent standard normals and X
1
= Z
1
+ Z
2
and
X
2
= Z
1
Z
2
, then X
1
and X
2
, being uncorrelated, are independent of each
other. This may seem surprising in view of that fact that increasing Z
1
by 1/2
increases both X
1
and X
2
by the same 1/2. If Z
1
and Z
2
were independent uni-
form random variables (PDF = u(z) = 1 if 0 z 1, u(z) = 0 otherwise), then
again X
1
and X
2
would again be uncorrelated, but this time not independent
(for example, the only way to get X
1
= 2 is to have both Z
1
= 1 and Z
2
= 1,
which implies that X
2
= 0.).
2.8. Application, generating correlated normals: There are simple tech-
niques for generating (more or less) independent standard normal random vari-
ables. The Box Muller method being the most famous. Suppose we have a
positive denite symmetric matrix, C
X
, and we want to generate a multivari-
ate normal with this covariance. One way to do this is to use the Choleski
factorization C
X
= LL
, where L is an n n lower triangular matrix. Now

dene Z = (Z
1
, . . . , Z
n
) where the Z
k
are independent standard normals. This
Z has covariance C
Z
= I. Now dene X = LZ. This X has covariance
C
X
= LIL
= LL
, as desired. Actually, we do not necessarily need the

Choleski factorization; L does not have to be lower triangular. Another possi-
bility is to use the symmetric square root of C
X
. Let C
X
= V V
, where
is the diagonal symmetric matrix with eigenvalues of C
X
( =
1
where
is given above), and V is the orthogonal matrix if eigenvectors. We can
take A = V
, where

is the diagonal matrix. Usually the Choleski
factorization is easier to get than the symmetric square root.
2.9. Central Limit Theorem: Let X be an n dimensional random variable
with probability density u(x). Let X
(1)
, X
(2)
, . . ., be a sequence of independent
samples of X, that is, independent random variables with the same density u.
Statisticians call this iid (independent, identically distributed). If we need to
talk about the individual components of X
(k)
, we write X
(k)
j
for component j
of X
(k)
. For example, suppose we have a population of people. If we choose a
person at random and record his or her height (X
1
) and weight (X
2
), we get a
two dimensional random variable. If we measure 100 people, we get 100 samples,
16
X
(1)
, . . ., X
(100)
, each consisting of a height and weight pair. The weight of
person 27 is X
(27)
2
. Let = E[X] be the mean and C = E[(X )(X )
]
the covariance matrix. The Central Limit Theorem (CLT) states that for large
n, the random variable
R
(n)
=
1
n
n
k=1
(X
(k)
)
has a probability distribution close to the multivariate normal with mean zero
and covariance C. One interesting consequence is that if X
1
and X
2
are uncor-
related then an average of many independent samples will have R
(
1
n) and R
(n)
2
nearly independent.
2.10. What the CLT says about Gaussians: The Central Limit Theorem
tells us that if we avarage a large number of independent samples from the
same distribution, the distribution of the average depends only on the mean
and covariance of the starting distribution. It may be surprising that many
of the properties that we deduced from the formula (10) may be found with
almost no algebra simply knowing that the multivariate normal is the limit of
averages. For example, we showed (or didnt show) that if X is multivariate
normal and Y = AX where the rows of A are linearly independent, then Y is
multivariate normal. This is a consequence of the averaging property. If X is
(approximately) the average of iid random variables U
k
, then Y is the average
of random variables V
k
= AU
k
. Applying the CLT to the averaging of the V
k
shows taht Y is also multivariate normal.
Now suppose U is a univariate random variable with iid samples U
k
, and
E[U
k
] = 0, E[U
2
k
=
2
], and E[U
4
k
] = a
4
< Dene X
n
=
1
n
k=n
U
k
. A
calculation shows that E[X
4
n
] = 3
4
+
1
n
a
4
. For large n, the fourth moment of
the average depends only on the second moment of the underlying distribution.
A multivariate and slightly more general version of this calculation gives Wicks
theorem, an expression for the expected value of a product of components of
a multivariate normal in terms of covariances.
17
1 Brownian Motion
1.1. Introduction: Brownian motion is the simplest of the stochastic pro-
cesses called diusion processes. It is helpful to see many of the properties of
general diusions appear explicitly in Brownian motion. In fact, the Ito calculus
makes it possible to describea any other diusion process may be described in
terms of Brownian motion. Furthermore, Brownian motion arises as a limit or
many discrete stochastic processes in much the same way that Gaussian random
variables appear as a limit of other random variables throught the central limit
theorem. Finally, the solutions to many other mathematical problems, parti-
cilarly various common partial dierential equations, may be expressed in terms
of Brownian motion. For all these reasons, Brownian motion is a central object
to study.
1.2. History: Late in the 18
th
century, an English botanist named Brown
looked at pollen grains in water under a microscope. To his amazement, they
were moving randomly. He had no explination for supposedly inert pollen grains,
and later inorganic dust, seeming to swim as though alive. In 1905, Einstein
proposed the explination that the observed Brownian motion was caused by
individual water molecules hitting the pollen or dust particles. This allowed
him to estimate, for the rst time, the weight of a water molecule and won him
the Nobel prize (relativity and quantum mechanics being too controversial at
the time). This is the modern view, that the observed random motion of pollen
grains is the result of a huge number of independent and random collisions with
tiny water molecules.
1.3. Basics: The mathematical description of Brownian motion involves a
random but continuous function on time, X(t). The standard Brownian motion
starts at x = 0 at time t = 0: X(0) = 0. The displacement, or increment
between time t
1
> 0 and time t
2
> t
1
, Y = X(t
2
) X(t
1
), is the sum of a
large number of i.i.d. mean zero random variables, (each modeling the result
of one water molecule collision). It is natural to suppose that the number of
such collisions is proportional to the time increment. This implies, throught the
central limit theorem, that Y should be a Gaussian random variable with vari-
ance proportional to t
2
t
1
. The standard Brownian motion has X normalized
so that the variance is equal to t
2
t
1
. The random shocks (a term used in
nance for any change, no matter how small) in disjoint time intervals should
be independent. If t
3
> t
2
and Y
2
= X(t
3
) X(t
2
), Y
1
= X(t
2
) Xt
1
), then Y
2
and Y
1
should be independent, with variances t
3
t
2
and t
2
t
1
respectively.
This makes the increments Y
2
and Y
1
a two dimensional multivariate normal.
1
1.4. Wiener measure: The probability space for standard Brownian motion
is C
0
([0, T], R). As we said before, this consists of continuous functions, X(t),
dened for t in the range 0 t T. The notation C
0
means
1
that X(0) = 0.
The algebra representing full information is the Borel algebra. The in-
nite dimensional Gaussian probability measure on C
0
([0, T], R) that represents
Brownian motion is called Wiener measure
2
.
This measure is uniquely specied by requiring that for any times 0 = t
0
<
t
1
< < t
n
T, the increments Y
k
= X(t
k+1
) X(t
k
) are independent Gaus-
sian random variables with var(Y
k
) = t
k+1
t
k
. The proof (which we omit) has
two parts. First, it is shown that there indeed is such a measure. Second, it
is shown that there is only one such. All the information we need is contained
in the joint distribution of the increments. The fact that increments from dis-
joint time intervals are independent is the independent increments property. It
also is possible to consider Brownian motion on an innite time horizon with
probability space C
0
([0, ), R).
1.5. Technical aside: There is a dierent descripton of the Borel algebra
on C
0
([0, T], R). Rather than using balls in the sup norm, one can use sets more
closely related to the denition of Wiener measure through the joint distribution
of increments. Choose times 0 = t
0
< t
1
< t
n
, and for each t
k
a Borel set,
I
k
R (thought of as intervals though they may not be). Let A be the event
{X(t
k
) I
k
for all k}. The set of such events forms an algebra (check this),
though not a algebra. The probabilities P(A) are determined by the joint
distributions of the increments. The Borel algebra on C
0
([0, T], R) is generated
by this algebra (proof ommitted), so Wiener measure (if it exists) is determined
by these probabilities.
1.6. Transition probabilities: The transition probability density for Brownian
motion is the probability density for X(t + s) given that X(t) = y. We denote
this by G(y, x, s), the G standing for Greens function. It is much like the
Markov chain transition probabilities P
t
y,x
except that (i) G is a probability
density as a function of x, not a probability, and (ii) tr is continuous, not
discrete. In our case, the increment X(t +s) X(t), is Gaussina with variance
s. If we learn that X(t) = y, then y becomes the expected value of X(t + s).
Therefore,
G(y, x, s) =
1
2s
e
(xy)
2
/2s
. (1)
1.7. Functionals: An element of = C
0
([0, T], R) is called X. We de-
note by F(X) a real valued function of X. In this context, such a func-
tion is often called a functional, to keep from confusing it with X(t), which
1
In other contexts, people use C
0
to indicate functions with compact support (whatever
that means) or functions that tend to zero as t , but not here.
2
The American mathematician and MIT professor Norbert Wiener was equally brilliant
and inarticulate.
2
is a random function of t. This functional is just what we called a func-
tion of a random variable (the path X palying the role of the abstract ran-
dom outcome ). The simplest example of a functional is just a function of
X(T): F(X) = V (X(T)). More complicated functionals are integrals: F(X) =
_
T
0
V (X(t))dt. extrema: F(X) = max
tT
X(t), or stopping times such as
F(X) = min
_
t such that
_
t
0
X(s)dx 1
_
. Stochastic calculus provides tools
for computing the expected values of many such functionals, often through solu-
tions of partial dierential equations. Computing expected values of functionals
is our main way to understand the behavior of Brownian motion (or any other
stochastic process).
1.8. Markov property: The independent increments property makes Brown-
ian motion a Markov process. Let F
t
be the algebra generated by the path
up to time t. This may be characterized as the algebra generated by all the
random variables X(s) for s t, which is the smallest algebra in which all the
functions X(s) are measurable. It also may be characterized as the algebra
generated by events of the form A above (Tehcnical aside) with t
n
t (proof
ommitted). We also have the algebra G
t
generated by the present only. That
is, G
t
is generated by the single random variable X(t); it is the smallest
algebra in which X(t) is measurable. Finally, we let H
t
denote the algebra
that depends only on future values X(s) for s t. The Markov property states
that if F(X) is any functional measurable with respect to H
t
(i.e. depending
only on the future of t), then E[F | F
t
] = E[F | G
t
].
Here is a quick sketch of the proof. If F(X) is a function of nitely many
values, X(t
k
), with t
k
t, then then E[F | F
t
] = E[F | G
t
] follows from the
independent increments property. It is possible (though tedious) to show that
any F measurable with respect to H
t
may be approximated by a functional
depending on nitely many future times. This extends E[F | F
t
] = E[F | G
t
] to
all F measurable in H
t
.
1.9. Path probabilities: For discrete Markov chains, as here, the individual
outcomes are paths, X. For Markov chains one can compute the probability
of an individual path by multiplying the transition probabilities. The situation
is dierent Brownian motion, where each individual path has probability zero.
We will make much use of the following partial substitute. Again choose times
t
0
= 0 < t
1
< < t
n
T, let

t = (t
1
, . . . , t
n
) be the vector of these times, and
let

X = (X(t
1
), . . . , X(t
n
)) be the vector of the corresponding observations of
X. We write U
(n)
(x,
t) for the joint probability density for the n observations,

which is found by multiplying together the transition probability densities (1)
(and using properties of exponentials):
U
(n)
(x,
t) =
n1
k=0
G(x
k
, x
k+1
, t
k+1
t
k
)
3
=
1
(2)
n/2
n1
k=0
1
t
k+1
t
k
exp
_
1
2
n1
k=0
(x
k+1
x
k
)
2
t
k+1
t
k
_
. (2)
The formula (2) is a concrete summary of the dening properties of the
probability measure for Brownian motion, Wiener measure: the independent
increments property, the Gaussian distribution of the increments, the variance
being proportional to the time dierences, and the increments having mean zero.
It also makes clear that each nite collection of observations forms a multivariate
normal. For any of the events A as in Technical aside, we have
P(A) =
_
x1I1

_
xnIn
U
(n)
(x
1
, . . . , x
n
,
t)dx
1
dx
n
.
1.10. Consistency: You cannot give just any old probability densities to
replace the joint densities (2). They must satisfy a simple consistency condition.
Having given the joint density for n observations, you also have given the joint
density for a subset of these observations. For example, the joint density for
X(t
1
) and X(t
3
) must be the marginal of the joint density of X((t
1
), X(t
2
),
and X(t
3
):
U
(2)
(x
1
, x
3
, t
1
, t
3
) =
_

x2=
U
(3)
(x
1
, x
2
, x
3
, t
1
, t
2
, t
3
)dx
2
.
It is possible to verify these consistency conditions by direct calculation with
the Gaussian integrals. A more abstract way is to understand the consistency
conditions as adding random increments. The U
(2)
density says that we get
X(t
3
) from X(t
1
) by adding an increment that is Gaussian with mean zero
and variance t
3
t
1
. The U
(2)
says that we get X(t
3
) from X(t
2
) by adding
a Gaussian with mean zero and variance t
3
t
2
. In turn, we get X(t
2
) from
X(t
1
) by adding an increment having mean zero and variance t
2
t
1
. Since the
smaller time increments are Gaussian and independent of each other, their sum
is also Gaussian, with mean zero and variance (t
3
t
2
) +(t
2
t
1
), which is the
same as the variance in going from X(t
1
) to X(t
3
) directly.
1.11. Rough paths: The above picture shows 5 Brownian motion paths.
They are random and dier in gross features (some go up, others go down), but
the ne scale structure of the paths is the same. They are not smooth, or even
dierentiable functions of t. If X(t) is a dierentiable function of t, then for
small t its increments are roughly proportional to t:
X = X(t + t) X(t)
dX
dt
tl.
For Brownian motion, the expected value of the square of X (the variance of
X) is proportional to t. This suggests that typical values of X will be on
the order of

t. In fact, an easy calculation gives
E[|X|] =
t
2
.
4
This would be impossible if successive increments of Brownian motion were all
in the same direction (see Total variation below). Instead, Brownian motion
paths are constantly changing direction. They go nowhere (or not very far) fast.
1.12. Total variation: One quantitative sense of path roughness is the vact
that Brownian motion paths have innite total variation. The total variation
of a function X(t) measures the total distance it moves, counting both ups and
downs. For a dierentiable function, this would be
TV(X) =
_
T
0
dX
dt
dtl. (3)
If X(t) has simple jump discontinuities, we add the sizes of the jumps to (3).
For general functions, the total variation is
TV(X) = sup
n1
k=0
|X(t
k+1
) X(t
k
)| , (4)
where the supremum as over all positive n and all sequences t
0
= 0 < t
1
< <
t
n
T.
Suppose X(t) has nitely many local maxima or minima, such as t
0
= local
max, t
1
= local min, etc. Then taking these t values in (4) gives the exact total
variation (further subdivision does not increase the left side). This is one way
to relate the general denition (4) to the denition for dierentiable functions
(??). This does not help for Brownian motion paths, which have innitely many
local maxima and minima.
1.13. Almost surely: Let A F be a measurable event. We say A happens
almost surely if P(A) = 1. This allows us to establish properties of random ob-
jects by doing calculations (stochastic calculus). For example, we will show that
Brownian motions paths have innite total variation almost surely by showing
that for any (small) > 0 and any (large) N,
P(TV(X) < N) < . (5)
Let B C
0
([0, t], R) be the set of paths with nite total variation.. This is a
countable union
B =
_
N>0
{TV(X) < N} =
_
N>0
B
N
.
Since P(B
N
) < ) for any > 0, we must have P(B
N
) = 0. Countable additivity
then implies that P(B) = 0, which means that P(TV = ) = 1.
There is a distinction between outcomes that do not exist and events that
never happen because they have probability zero. For example, if Z is a one
dimensional Gaussian random variable, the outcome Z = 0 does exist, but the
event {Z = 0} is impossible (never will be observed). This is what we mean
when we say a Gaussian random variable never is zero, or every Brownian
motion path has invinite total variation.
5
1.14. The TV of BM: The heart of the matter is tha actual calculation
behind the inequality (5). We choose an n > 0 and dene (not for the last time)
t = T/n and t
k
= kt. Let Y be the random variable
Y =
n1
k=0
|X(t
k+1
) X(t
k
)| .
Remember that Y is one of the candidates we must use in the supremem (4) that
denes the total variation. If Y is large, then the total variation is at least as
large. Because E[|X|] =
_
2
t, we have E[Y ] =
_
2
n. A calculation
using the independent increments property shows that
var(Y ) =
_
1
2
_
T
for any n. Tchebychevs inequality
3
implies that
P
_
Y <
_
_
2
n k
_
1
2
T
_
1
k
2
.
If we take very large n and medium large k, this inequality says that it is very
unlikely for Y (or total variation of X) to be much less than const
n. Our
inequality (5) follows from this whth a suitable choice of n and k.
1.15. Structure of BM paths: For any function X(t), we can dene the
total variation on the interval [t
1
, t
2
] in an obvious way. The odometer of a car
records the distance travelled regardless of the direction. For X(t), the total
variation on the interval [0, t] plays a similar role. Clearly, X is monotone on
the interval [t
1
, t
2
] if and only if TV(X, t
1
, t
2
) = |X(t
2
) X(t
1
)|. Otherwise,
X has at least one local min or max within [t
1
, t
2
]. Now, Brownian motion
paths have innite total variation on any interval (the proof above implies this).
Therefore, a Brownian motion path has a local max or min within any interval.
This means that (like the rational numbers, for example) the set of local maxima
and minima is dense: There is a local max or min arbitrarily close to any given
number.
1.16. Dynamic trading: The innite total variation of Brownian motion has
a consequence for dynamic trading strategies. Some of the simplest dynamic
trading strategies, Black-Scholes hedging, and Merton half stock/half cash trad-
ing, call for trades that are proportional to the change in the stock price. If the
stock price is a diusion process and there are transaction costs proportional
to the size of the trade, then the total transaction costs will either be innite
(in the idealized continuous trading limit) or very large (if we trade as often as
3
If E[Y ] = and var(Y ) =
2
, then P(|Y | > k) <
1
k
2
. The proof and more examples
are in any good basic probability book.
6
possible). It turns out that dynamic trading strategies that take trading costs
into account can approach the idealized zero cost strategies when trading costs
are small. Next term you will learn how this is done.
1.17. Quadratic variation: A more useful measure of roughness of Brownian
motion paths and other diusion processes is quadratic variation. Using previous
notations: t = T/n, t
k
= kt, the denition is
4
(where n as t 0
with t = nt xed)
Q(X) = lim
t0
Q
n
(X) = lim
t0
n1
k=0
(X(t
k+1
X(t
k
))
2
. (6)
If X is a dierentiable function of t, then its quadratic variation is zero (Q
n
is the sum of n terms each of order 1/n
2
). For Brownian motion, Q(T) =
T (almost surely). Clearly E[Q
n
] = T for any n (independent increments,
Gaussian increments with variance t). The independent increments property
also lets us evaluate var(Q
n
) = 3T
2
/n (the sum of n terms each equal to 3t
2
=
3T
2
/n
2
). Thus, Q
n
must be increasingly close to T as n gets larger
5
1.18. Trading volatility: The quadratic variation of a stock price (or a similar
quantity) is called its realized volatility. The fact that it is possible to buy
and sell realized volatility says that the (geometric) Brownian motion model
of stock price movement is not completely realistic. That model predicts that
realized volatility is a constant, which is nothing to bet on.
1.19. Brownian bridge construction:
1.20. Continuous time stochastic process: The general abstract denition of
a continuous time stochastic process is just a probability space, , and, for each
t > 0, a algebra F
t
. These algebras should form a ltration (corresponding
to increase of information): F
t1
F
t2
if t
1
t
2
. There should also be a family
of random variables Y
t
(), with Y
t
measurable in F
t
(i.e. having a value known
at time t). This explains why probabilists often write X
t
instead of X(t) for
Brownian motion and other diusion processes. For each t, we think of X
t
as a
function of with t simply being a parameter. Our choice of probability space
= C
0
([0, T], R) implies that for each , X
t
() is a continuous function of t.
(Actually, for simple Brownian motion, the path X plays the role of the abstract
outcome , though we never write X
t
(X).) Other stochastic processes, such as
the Poisson jump process, do not have continuous sample paths.
4
It is possible, though not customary, to dene TV(X) using evenly spaced points. In the
limit t 0, we would get the same answer for continuous paths or paths with TV(X) < .
You dont have to use uniformly spaced times in the denition of Q(X), but I think you get
a dierent answer if you let the times depend on X as they might in the denition of total
variation.
5
Thes does not quite prove that (almost surely) Qn T as n . We will come back
to this point in later lectures.
7
1.21. Continuous time martingales: A stochastic process F
t
(with and the
F
t
) is a martingale if E[F
s
| F
t
] = F
t
for s > t. Brownian motion forms the rst
example of a continuous time martingale. Another famous martingale related to
Brownian motion is F
t
= X
2
t
t (the reader should check this). As in discrete
time, any random variable, Y , denes a continuous time martingale through
conditional expectations: Y
t
= E[Y | F
t
]. The Ito calculus is based on the idea
that a stochastic integral with respect to X should produce a martingale.
2 Brownian motion and the heat equation
2.1. Introduction: Forward and backward equations are tools for calculating
probabilities and expected values related to Brownian motion, as they are for
Markov chains and stochastic processes more generally. The probability density
of X(t) satises a forward equation. The conditional expectations E[V | F
t
]
satisfy backward equations for a variety of functionals V . For Brownian motion,
the forward and backward equations are partial dierential equations, either the
heat equation or a close relative. We will see that the theory of partial dierential
equations of diusion type (the heat equation being the a prime example) and
the theory of diusion processes (Brownian motion being a prime example) each
draw from the other.
2.2. Forward equation for the probability density: If X(t) is a standard
Brownian motion with X(0) = 0, then X(t) N(0, t), so its probability density
is (see (1))
u(x, t) = G(0, x, t) =
1
2t
e
x
2
/2t
.
Directly calculating partial derivatives, we can verify that
t
G =
1
2
2
x
G . (7)
We also could consider a Brownian motion with a more general initial density
X(0) u
0
(x). Then X(t) is the sum of independent random variables X(0)
and an N(0, t). Therefore, the probability density for X(t) is
u(x, t) =
_

y=
G(y, x, t)u
0
(y)dy =
_

y=
G(0, x y, t)u
0
(y)dy . (8)
Again, direct calculation (dierentiating (8), x and t derivatives land on G)
shows that u satises
t
u =
1
2
2
x
u . (9)
This is the heat equation, also called diusion equation. The equation is used in
two ways. First, we can compute probabilities by solving the partial dierential
equation. Second, we can use known probability densities as solutions of the
partial dierential equation.
8
2.3. Heat equation via Taylor series: The above is not so much a derivation
of the heat equation as a verication. We are told that u(x, t) (the probability
density of X
t
) satises the heat equation and we verify that fact. Here is a
method for deriving a forward equation without knowing it in advance. We
assume that u(x, t) is smooth enough as a function of x and t that we may expand
it to to second order in Taylor series, do the expansion, then take the conditional
expectation of the terms. Variations of this idea lead to the backward equations
and to major parts of the Ito calculus.
Let us x two times separated by a small t: t
= t + t. The rules of
conditional probability allow us to compute the density of X = X(t
) in terms
of the density of Y = X(t) and the transition probabilit density (1):
u(x, t + t) =
_

y=
G(y, x, t)u(y, t)dy . (10)
The main idea is that for small t, X(t + t) will be close to X(t). This is
expressed in G being small unless y is close to x, which is evident in (1). In
the integral, x is a constant and y is the variable of integration. If we would
approximate u(y, t) by u(x, t), the value of the integral just would be u(x, t).
This would give the true but not very useful approximation u(x, t + t)
u(x, t) for small t. Adding the next Taylor series term (writing u
x
for
x
u):
u(y, t) u(x, t)+u
x
(x, t)(yx), the integral does not change the result because
_
G(y, x, t)(y x)dy = 0. Adding the next term:
u(y, t) u(x, t) +u
x
(x, t)(y x) +
1
2
u
xx
(x, t)(y x)
2
,
gives (because E[(Y X)
2
] = t)
u(x, t + t) u(x, t) +
1
2
u
xx
(x, t)t .
To derive a partial dierential equation, we expand the left side as u(x, t+t) =
u(x, t) +u
t
(x, t)t +O(t
2
). On the right, we use
_
G(y, x, t) |y x|
3
dy = O(t
3/2
) .
Altogether, this gives
u(x, t) +u
t
(x, t)t = u(x, t) +u
xx
(x, t)t +O(t
3/2
) .
If we cancel the common u(x, t) then cancel the common factor t and let
t 0, we get the desired heat equation (9).
2.4. The initial value problem: The heat equation (9) is the Brownian motion
anologue of the forward equation for Markov chains. If we know the time 0
density u(x, 0) = u
0
(x) and the evolution equation (9), the values of u(x, t) are
completely and uniquely determined (ignoring mathematical technicalities that
9
would be unlikely to trouble a practical person). The task of nding u(x, t) for
t > 0 from u
0
(x) and (9) is called the initial value problem, with u
0
(x) being
the initial value (or values??). This initial value problem is well posed,
which means that the solution, u(x, t), exists and depends continuously on the
initial data, u
0
. If you want a proof that the solution exists, just use the integral
formula for the solution (8). Given u
0
, the integral (8) exists, satises the heat
equation, and is a continuous function of u
0
. The proof that u is unique is more
technical, partly because it rests on more technical assumptions.
2.5. Ill posed problems: In some situations, the problem of nding a function
u from a partial dierential equation and other data may be ill posed, useless
for practical purposes. A problem is ill posed if it is not well posed. This means
either that the solution does not exist, or that it does not depend continuously
on the data, or that it is not unique. For example, if I try to nd u(x, t) for
positive t knowing only u
0
(x) for x > 0, I must fail. A mathematician would say
that the solution, while it exists, is not unique, there being many dierent ways
to give u
0
(x) for x > 0, each leading to a dierent u. A more subtle situation
arises, for example, if we give u(x, T) for all x and wish to determine u(x, t)
for 0 t < T. For example, if u(x, T) = 1
[0,1]
(x), there is no solution (trust
me). Even if there is a solution, for example given by (8), is does not depend
continuously on the values of u(x, T) for T > t (trust me).
The heat equation (9) relates values of u at one time to values at another
time. However, it is well posed only for determining u at future times from u
at earlier times. This forward equation is well posed only for moving forward
in time.
2.6. Conditional expectations: We saw already for Markov chains that
certain conditional expected values can be calculated by working backwards in
time with the backward equation. The Brownian motion version of this uses
the conditional expectation
f(x, t) = E[V (X
T
) | X
t
= x] . (11)
One modern formulation of this denes F
t
= E[V (X
t
) | F
t
]. The Markov
property implies that F
t
is measurable in G
t
, which makes it a function of
X
t
. We write this as F
t
= f(X
t
, t). Of course, these denitions mean the
same thing and yield the same f. The denition is also sometimes written as
f(x, t) = E
x,t
[V (X
T
)]. In general if we have a parametrized family of probability
measures, P
, we write the expected value with respect to P
as E
[]. Here,
the probability measure P
x,t
is the Wiener measure describing Brownian motion
paths that start from x at time t, which is dened by the densities of increments
for times larger than t as before.
2.7. Backward equation by direct verication: Given that X
t
= x, the
conditional density for X
T
is same transition density (1). The expectation (11)
10
is given by the integral f(x, t) as an integral, we get
f(x, t) =
_

G(x, y, T t)V (y)dy . (12)

We can verify by explicit dierentiation (x and t derivatives act on G) that
t
f +
1
2
2
x
f = 0 . (13)
Note that the sign of
t
here is not what it was in (9), which is because we are
calculating
t
G(T t) rather than
t
G(t). This (13) is the backward equation.
2.8. Backward equation by Taylor series: As with the forward equation (9),
we can nd the backward equation by Taylor series expansions. We start by
choosing a small t and expressing f(x, t) in terms of
6
f(, t + t). As before,
dene F
t
= E[V (X
T
) | F
t
] = f(X
t
, t). Since F
t
F
t+t
, the tower property
implies that F
t
= E[F
t+t
| F
t
].
f(x, t) = E
x,t
[f(X
t+t
)]
=
_

y=
f(y, t + t)G(x, y, t)dy . (14)
As before, we expand f(y, t +t) about x, t dropping terms that contribute less
than O(t):
f(y, t + t)
= f(x, t) +f
x
(x, t)(y x) +
1
2
f
xx
(x, t)(y x)
2
+f
t
(x, t)t
+O(|y x|
3
) +O(t
2
) .
Substituting this into (14) and integrating each term leads to
f(x, t) = f(x, t) + 0 +
1
2
f
xx
(x, t)t +f
t
(x, t)t +O(t
3/2
) +O(t
2
) .
A bit of algebra and t 0 then gives (13).
For future reference, we pause to note the dierences between this derivation
of (13) and the related derivation of (9). Here, we integrated G with respect
to its second argument, while earlier we integrated with respect to the rst
argument. This does not matter for the special case of Brownian motion and
the heat equation because G(x, y, t) = G(y, x, t). When we apply this reasoning
to other diusion processes, G(x, y, t) will be a probability density as a function
of y for every x, but it need not be a probability density as a function of x for
given y. This is an anologue of the fact in Markov chains that the transition
6
The notation f(, t +t) is to avoid writing f(x, t +t) which might imply that the value
f(x, t) depends only on f at time t + t for the same x value. Instead, it depends on all the
values f(y, t + t).
11
matrix P acts from the left on column vectors f (summing P
jk
over k) but from
the right on row vectors u (summing P
jk
over j). For each j,

k
P
jk
= 1 but
the column sums

j
P
jk
may not equal one. Of course, the sign of the
t
term
is dierent in the two cases because we did the t Taylor series on the right side
of (14) but on the left side of (10).
2.9. The nal value problem: The nal values f(x, T) = V (x), together with
the backward evolution equation (13) allow us to determine the values f(, t)
for t < T. The denition (11) makes this obvious. This means that the nal
value problem for the backward heat equation is a well posed problem.
On the other hand, the initial value problem for the backward heat equation
is not a well posed problem. If we have a f(x, 0) and we want a V (x) that leads
to it, we are probably out of luck.
2.10. Duality: As for Markov chains, we can express the expected value of
V (X
T
) in terms of the probability density at any earlier time t T
E[V (X
T
)] =
_
u(x, t)f(x, t)dx .
This again implies that the right side is independent of t, which in turn al-
lows us to derive the forward equation (9) from the backward equation (13) or
conversely. For example, dierentiating and using (13) gives
0 =
d
dt
=
_
u
t
(x, t)f(x, t)dx +
_
u(x, t)f
t
(x, t)dx
=
_
u
t
(x, t)f(x, t)dx
_
u(x, t)
1
2
f
xx
(x, t)dx .
To derive an equation involving only u derivatives, we want to integrate the last
integral by parts to move the x derivatives from f to u. In this formal derivation,
we will assume that the probability density u(x, t) decays to zero fast enough as
|x| that we can neglect possible boundary terms at x = . This gives
_
_
u
t
(x, t)
1
2
u
xx
(x, t)
_
f(x, t)dx = 0 .
If this relation holds for a suciently rich family of functions f, we can only
conclude that u
t
1
2
u
xx
is identically zero, which is the forward equation (9).
2.11. The smoothing property, regularity: Solutions of the forward or back-
ward heat equation become smooth functions of x and t even if the initial data
(for the forward equation) or nal data (for the backward equation) are not
smooth. For u, this is clear from the integral formula (8). If we dierentiate
with respect to x, this derivative passes under the integral and onto the G fac-
tor. This applies also to x or t derivatives of any order, since the corresponding
12
derivatives of G are still smooth integrable functions of x. The same can be said
for f using (12); as long as t < T, any derivatives of f with respect to x and/or t
are bounded. A function that has all partial derivatives of any order bounded is
called smooth. (Warning, this term is not used consistently. Some people say
smoooth to mean, for example, merely having derivatives up to second order
bounded.) Solutions of more general forward and backward equations often,
but not always, have the smoothing property.
2.12. Rate of smoothing: Suppose the payout (and nal value) function,
V (x), is a discontinuous function such as V (x) = 1
x<0
(x) (a digital option in
nance). The solution to the backward equation can be expressed in terms of
the cumulative normal (with Z N(0, 1))
N(x) = P(Z < x) =
1
2
_
x
z=
e
z
2
/2
dz .
Then we have
f(x, t) =
_
0
y=
G(x, y, T t)dy
=
1
_
2(T t)
_
0
y=
e
(xy)
2
/2(tt)
dy
f(x, t) = N(x/
T t) . (15)
From this it is clear that f is dierentiable when t < T, but the rst x derivative
is as large as 1/
T t, the second as large as 1/(T t), etc. All derivatives

blow up as t T with higher derivatives blowing up faster. This can make
numerical solution of the backward equation dicult and inaccurate when the
nal data V (x) is not smooth.
The formula (15) can be derived without integration. One way is to note that
f(x, t) = P(X
T
< 0 | X
t
= x) and X
T
x+
T tZ, (Gaussian increments) so

that X
T
< 0 is the same as Z < x/
T t. Even without the normal probability,

a physicist would tell you that X
t, so the hitting probability starting

from x at time t has to be some function of x/
T t.
2.13. Diusion: If you put a drop of ink into a glass of still water, you
will see the ink slowly diuse through the water. This is modelled as a vast
number of tiny ink particles each preforming an independent Brownian motion
in the water. Let u(x, t) represent the density of particles about x at time t
(say, particles per cubic millemeter). This u satises the heat equation but not
the requirement that
_
u(x, t)dx = 1. If ink has been diusing through water
for some time, there will be dark regions with a high density of particles (large
u) and lighter regions with smaller u. In the absence of boundaries (sides of the
class and the top of the water), the ink distribution would be Gaussian.
2.14. Heat: Heat also can diuse through a medium, as happens when
we put a thick metal pan over a ame and wait for the other side to heat
13
up. We can think of u(x, t) as representing the temperature in a metal at
location x at time t. This helps us interpret solutions of the heat equation
(9) when u is not necessarily positive. In particular, it helps us imagine the
cancellation that can occur when regions of positive and negative u are close to
each other. Heat ows from the high temperature regions to low or negative
temperature regions in a way that makes the temperature distribution a more
uniform. A physical argument that heat (temperature) owing through a metal
should satisfy the heat equation was given by the French mathematical phycisist,
friend of Napoleon, and founder of Ecole Polytechnique, Joseph Fourier.
2.15. Hitting times: A stopping time, , is any time that depends on the
Brownian motion path X so that the event t is measurable with respect to
F
t
. This is the same as saying that for each t there is some process that has as
input the values X
s
for 0 s t and as output a decision t or > t. One
kind of stopping time is a hitting time:
a
= min (t | X
t
= a) .
More generally (particularly for Brownian motion in more than one dimension)
if A is a closed set, we may consider
A
= min(t | X
t
A). It is useful to dene
a Brownian motion that stops at time :

X
t
= X
t
if t ,

X
t
= X
if t .
2.16. Probabilities for stopped Brownian motion: Suppose X
t
is Brownian
motion starting at X
0
= 1 and

X is the Brownian motion stopped at time
0
,
the rst time X
t
= 0. The probability measure, P
t
, for

X
t
may be written
as the sum of two terms, P
t
= P
s
t
+ P
ac
t
. (Since

X
t
is a single number, the
probability space is = R, and the algebra is the Borel algebra.) The
singular part, P
s
t
, corresponds to the paths that have been stopped. If p(t) is
the probability that t, then P
s
t
= p(t)(x), which means that for any Borel
set, A R, P
s
t
(A) = p(t) if 0 A and P
s
t
(A) = 0 if 0 / A. This is called
the delta function or delta mass; it puts weight one on the point zero and
no weight anywhere else. Probabilists sometimes write
x0
for the measure that
puts weight one on the point x
0
. Phycisists write
x0
(x) = delta(x = x
0
). The
absolutely continuous part, P
ac
t
, is given by a density, u(x, t). This means
that P
ac
t
(A) =
_
A
u(x, t)dx. Because
_
R
u(x, t)dx = 1 p(t) < 1, u, while being
a density, is not a probability density.
This decomposition of a measure (P) as a sum of a singular part and ab-
solutely continuous part is a special case of the Radon Nikodym theorem. We
will see the same idea in other contexts later.
2.17. Forward equation for u: The density for the absolutely continuous part,
u(x, t), is the density for paths that have not touched X = a. In the diusion
interpretation, think of a tiny ink particle diusing as before but being absorbed
if it ever touches a. It is natural to expect that when x = a, the density satises
the heat equation (9). u knows about the boundary condition because of
the boundary condition u(a, t) = 0. This says that the density of particles
approaches zero near the absorbing boundary. By the end of the course, we
14
will have several ways to prove this. For now, think of a diusing particle, a
Brownian motion path, as being hyperactive; it moves so fast that it has already
visited a neighborhood of its current location. In particluar, if X
t
is close to a,
then very likely X
s
= a for some s < t. Only a small minority of the particles
at x near a, with small density u(x, t) 0 as x a have not touched a.
2.18. Probability ux: Suppose a Brownian motion starts at a random point
X
0
> 0 with probability density u
0
(x) and we take the absorbing boundary
at a = 0. Clearly, u(x, t) = 0 for x < 0 because a particle cannot cross from
positive to negative without crossing zero, the Brownian motion paths being
continuous. The probability of not being absorbed before time t is given by
1 p(t) =
_
x>0
u(x, t)dx . (16)
The rate of absorbtion of particles, the rate of decrease of probabiltiy, may be
calculated by using the heat equation and the boundary condition. Dierenti-
ating (16) with respect to t and using the heat equation for the right side then
integrating gives
p(t) =
_
x>0
t
u(x, t)dx
=
_
x>0
1
2
2
x
u(x, t)dx
p(t) =
1
2
x
u(0, t) . (17)
Note that both sides of (17) are positive. The left side because P( t) is an
increasing function of t, the right side because u(0, t) = 0 and u(x, t) > 0 for
x > 0. The identity (17) leads us to interpret the left side as the probability
ux (or density ux if we are thinking of diusing particles). The rate
at which probability ows (or particles ow) across a xed point (x = 0) is
proportional to the derivative (the gradient) at that point. In the heat ow
interpretation this says that the rate of heat ow across a point is proportional
to the temperature gradient. This natural idea is called Ficks law (or possibly
Fouriers law).
2.19. Images and Reections: We want a function u(x, t) that satises the
heat equation when x > 0, the boundary condition u(0, t) = 0, and goes to
x0
as t 0. The method of images is a trick for doing this. We think of
x0
as
a unit charge (in the electrical, not nancial sense) at x
0
and g(x x
0
, t) =
1
2
e
(xx0)
2
/2t
as the response to this charge, if there is no absorbing boundary.
For example, think of puting a unit drop of ink at x
0
and watching it spread
along the x axis in a bell shaped (i.e. gaussian) density distribution. Now
think of adding a negative image charge at x
0
so that u
0
(x) =
x0

x0
and correspondingly
u(x, t) =
1
2t
_
e
(xx0)
2
/2t
e
(x+x0)
2
/2t
_
. (18)
15
This function satises the heat equation everywhere, and in particular for x > 0.
It also satises the boundary condition u(0, t) = 0. Also, it has the same initial
data as g, as long as x > 0. Therefore, as long as x > 0, the u given by (18)
represents the density of unabsorbed particles in a Brownian motion with ab-
sorption at x = 0. You might want to consider the image charge contribution
in (18),
1
2
e
(xx0)
2
/2t
, as red ink (the ink that represents negative quanti-
ties) that also diuses along the x axis. To get the total density, we subtract
the red ink density from the black ink density. For x = 0, the red and black
densities are the same because the distance to the sources at x
0
are the same.
When x > 0 the black density is higher so we get a positive u. We can think of
the image point, x
0
, as the reection of the original source point through the
barrier x = 0.
2.20. The reection principle: The explicit formula (18) allows us to evaluate
p(t), the probability of touching x = 0 by time t starting at X
0
= x
0
. This is
p(t) = 1
_
x>0
u(x, t)dx =
_
x>0
1
2t
_
e
(xx0)
2
/2t
e
(x+x0)
2
/2t
_
dx .
Because
_
2t
e
(xx0)/2t
dx = 1, we may write
p(t) =
_
0
2t
e
(xx0)
2
/2t
dx +
_

0
1
2t
e
(x+x0)
2
/2t
dx .
Of course, the two terms on the right are the same! Therefore
p(t) = 2
_
0
2t
e
(xx0)
2
/2t
dx .
This formula is a particular case the Kolmogorov reection principle. It says
that the probability that X
s
< 0 for some s t is (the left side) is exactly
twice the probability that X
t
< 0 (the integral on the right). Clearly some of
the particles that cross to the negative side at times s < t will cross back, while
others will not. This formula says that exactly half the particles that touch
for some s t have X
t
> 0. Kolmogorov gave a proof of this based on the
Markov property and the symmetry of Brownian motion. Since X
= 0 and
the increments of X for s > are independent of the increments for s < , and
since the increments are symmetric Gaussian random variables, they have the
same chance to be positive X
t
> 0 as negative X
t
< 0.
16
1 Integrals involving Brownian motion
1.1. Introduction: There are two kinds of integrals involving Brownian
motion, time integrals and Ito integrals. The time integral, which is discussed
here, is just the ordinary Riemann integral of a continuous but random function
of t with respect to t. Such integrals dene stochastic processes that satisfy
interesting backward equations. On the one hand, this allows us to compute
the expected value of the integral by solving a partial dierential equation. On
the other hand, we may nd the solution of the partial dierential equation by
computing the expected value by Monte Carlo, for example. The Feynman Kac
formula is one of the examples in this section.
1.2. The integral of Brownian motion: Consider the random variable, where
X(t) continues to be standard Brownian motion,
Y =
_
T
0
X(t)dt . (1)
We expect Y to be Gaussian because the integral is a linear functional of the
(Gaussian) Brownian motion path X. Because X(t) is a continuous function
of t, this is a standard Riemann integral. The Riemann sum approximations
converge. As usual, for n > 0 we dene t = T/n and t
k
= kt. The Riemann
sum approximation is
Y
n
= t
n1
k=0
X(t
k
) , (2)
and Y
n
Y as n because X(t) is a continuous function of t. The n
summands in (2), X(t
k
), form an n dimensional multivariate normal, so each of
the Y
n
is normal. It would be surprising if Y , as the limit of Gaussians, were
not Gaussian.
1.3. The variance of Y : We will start the hard way, computing the variance
from (2) and letting t 0. The trick is to use two summation variables
Y
n
= t
n1
k=0
X(t
k
) and Y
n
= t
n1
j=0
X(t
j
). It is immediate from (2) that
E[Y
n
] = 0 and var(Y
n
) = E[Y
2
n
]:
E[Y
2
n
] = E[Y
n
Y
n
]
= E
_
_
_
t
n1
k=0
X(t
k
)
_
_
_
t
n1
j=0
X(t
j
)
_
_
_
_
= t
2
jk
E[X(t
k
)X(t
j
)] .
1
If we now let t 0, the left side converges to E[Y
2
] and the right side
converges to a double integral:
E[Y
2
] =
_
T
s=0
_
T
t=0
E[X(t)X(s)]dsdt . (3)
We can nd the needed E[X(t)X(s)] if s > t by writing X(s) = X(t) + X
with X independent of X(t), so
E[X(t)X(s) = E[X(t)(X(t) + X)]
= E[X(t)X(t)]
= t .
A variation of this argument gives E[X
t
X
s
] = s if s < t. Altogether
E[X
t
X
s
] = min(t, s) ,
which is a famous formula. This now gives
E[Y
2
] =
_
T
s=0
_
T
t=0
E[X
t
X
s
]dsdt =
_
T
s=0
_
T
t=0
min(s, t)dsdt =
1
3
T
3
.
There is a simpler and equally rigorous way to get this. Write Y =
_
T
s=0
X(s)ds
and
_
T
t=0
X(t)dt so that again
E[Y
2
] = E
_
_
T
s=0
X(s)ds
_
T
t=0
X(t)dt
_
= E
_
_
T
s=0
_
T
t=0
X(s)X(t)dtds
_
(4)
=
_
T
s=0
_
T
t=0
E[X(s)X(t)]dtds; . (5)
Going from the (4) to (5) involves changing the order of integration
1
. After all,
E[] just represents integration over a probability space. The right side of (4)
has the abstarct form
_
_
_
s[0,T]
_
t[0,T]
F(, s, t)dtds
_
dP() .
1
The possibility of changing order of abstract integrals was established by the twentieth
century mathematician Fubini. He proved it to be correct if the double (triple in our case)
integral converges absolutely (a requirement even for ordinary Riemann integrals) and the
function F is jointly measurable in all its arguments. Our integrand is nonnegative, so the
result will be innite if the integral does not converge absolutely. We omit a discussion of
product measures and joint measurability.
2
Here F = X(s)X(t), and is the random outcome (the whole path X[0, T]
here), and P represents Wiener measure. If we interchange the ordinary Rie-
mann dsdt integral with the abstract dP integral, we get
_
s[0,T]
_
t[0,T]
__
F(, s, t)dP()
_
dsdt ,
Which is the abstract form of (5).
1.4. Measurability of Brownian motion integrals: Suppose t
1
< t
2
. Consider
the integrals U =
_
t1
0
X(t)dt and V =
_
t2
t1
(X(t) X(t
1
))dt. We expect U
to be measurable in F
t1
because all the X values dening U are measurable
in F
t1
. Similarly, all the dierences dening V are independent of anything
anything in F
t1
. Therefore, we expect V to be independent of U. We omit the
straightforward proofs of these facts, which depend on elementary properties of
abstract integration.
1.5. The X
3
t
martingale: Many martingales are constructed from integrals
involving Brownian motion. A simple one is
F(t) = X(t)
3
3
_
t
0
X(s)ds .
To check the martingale property, choose t
2
> t
1
and, for t > t
1
, write X(t) =
X(t
1
) + X(t). Then
E
__
t2
0
X(t)ds | F
t1
_
= E
___
t1
0
X(t)dt +
_
t2
t1
X(t)dt
_
F
t1
_
= E
__
t
0
X(t)dt | F
t
_
+E
__
t2
t1
(X(t
1
) + X(t)) dt | F
t
_
=
_
t
0
X(t)dt + (t
2
t
1
)X(t
1
) .
In the last line we use the facts that X(t) F
t1
when t < t
1
, and X
t1
F
t1
,
and that E[X(t) | F
t1
] = 0 when t > t
1
, which is part of the independent
increments property. For the X(t)
3
part, we have,
E
_
(X(t
1
) + X(t
2
))
3
| F
t1
_
= E
_
X(t
1
)
3
+ 3X(t
1
)
2
X(t
2
) + 3X(t
1
)X(t
2
)
2
+ X(t
2
)
3
| F
t1
= X(t
1
)
3
+ 3X(t
1
)
2
0 + 3X(t
1
)E[X(t
2
)
2
| F
t1
] + 0
= X(t
1
)
3
+ 3(t
2
t
1
)X(t
1
) .
In the last line we used the independent increments property to get E[X(t
2
) |
F
t1
] = 0, and the formula for the variance of the increment to get E[X(t
2
)
2
|
F
t1
] = t
2
t
1
. This veries that E[F(t
2
) | F
t
] = F(t
1
), which is the martingale
property.
3
1.6. Backward equations for expected values of integrals: Many integrals
involving Brownian motion arise in applications and may be solved using
backward equations. One example is F =
_
T
0
V (X(t))dt, which represents the
total accumulated V (X) over a Brownian motion path. If V (x) is a continuous
function of x, the integral is a standard Riemann integral, because V (X(t))
is a continuous function of t. We can calculate E[F], using the more general
function
f(x, t) = E
x,t
_
_
T
t
V (X(s))ds
_
. (6)
As before, we can describe the function f(x, t) in terms of the random variable
F(t) = E
_
_
T
t
V (X(s))dt | F
t
_
.
Since F(t) is measurable in F
t
and depends only on future values (X(s) with
s > t), F(t) is measurable in G
t
. Since G
t
is generated by X(t) alone, this
means that F(t) is a function of X(t), which we write as F(t) = f(X(t), t).
Of course, this denition is a big restatement of denition (6). Once we know
f(x, t), we can plug in t = 0 to get E[F] = F(0) = f(x
0
, 0) if X(0) = x
0
is
known. Otherwise, E[F] = E[f(X(0), t)].
The backward equation for f is
t
f +
1
2
2
x
f +V (x, t) = 0 , (7)
with nal conditions f(x, T) = 0. The derivation is similar to the one we used
before for the backward equation for E
x,t
[V (X
T
)]. We use Taylor series and the
tower property to calculate how f changes over a small time increment, t. We
start with
_
T
t
V (X(s))ds =
_
t+t
t
V (X(s))ds +
_
T
t+t
V (X(s))ds ,
take the x, t expectation, and use (6) to get
f(x, t) = E
x,t
_
_
t+t
t
V (X(s))ds
F
t
_
+E
x,t
_
_
T
t+t
V (X(s))ds
F
t
_
. (8)
The rst integral on the right has the value V (x)t +o(t). We write o(t) for
a quantity that is smaller than t in the sense that o(t)/t 0 as t 0
(we will shortly divide by t, take the limit t 0, and neglect all o(t)
terms.). The second term has
E
x,t
_
_
T
t+t
V (X(s))ds | F
t+t
_
= F(X
t+t
) = f(X(t + t), t + t) .
4
Writing X(t + t) = X(t) + X, we use the tower property with F
t
F
t+t
to get
E
_
_
T
t+t
V (X(s))ds | F
t
_
= E [f(X
t
+ X, t + t) | F
t
] .
As before, we use Taylor expansions the conditional expectation to get rst
f(x+X, t+t) = f(x, t)+t
t
f(x, t)+X
x
f(x, t)+
1
2
X
2
2
x
f(x, t)+o(t) ,
then
E
x,t
[f(x + X, t + t] = f(x, t) + t
t
f(x, t) +
1
2
t
2
x
f(x, t) +o(t) .
Putting all this back into (8) gives
f(x, t) = tV (x) +f(x, t) + t
t
f(x, t) +
1
2
t
2
x
f(x, t) +o(t) .
Now just cancel f(x, t) from both sides and let t 0 to get the promised
equation (7).
1.7. Application of PDE: Most commonly, we cannot evaluate either the
expected value (6) or the solution of the partial dierential equation (PDE)
(7). How does the PDE represent progress toward evaluating f? One way is
by suggesting a completely dierent computational procedure. If we work only
from the denition (6), we would use Monte Carlo for numerical evaluation.
Monte Carlo is notoriously slow and inaccurate. There are several techniques
for nding the solution of a PDE that avoid Monte Carlo, including nite dif-
ference methods, nite element methods, spectral methods, and trees. When
such deterministic methods are practical, they generally are more reliable, more
accurate, and faster. In nancial applications, we are often able to nd PDEs
for quantities that have no simple Monte Carlo probabilistic denition. Many
such examples are related to optimization problems: maximizing an expected
return or minimizing uncertainty with dynamic trading strategies in a randomly
evolving market. The Black Scholes evaluation of the value of an American style
option is a well known example.
1.8. The Feynman Kac formula: Consider
F = E
_
exp
_
_
T
0
V (X(t)dt
__
. (9)
As before, we evaluate F using the related and more rened quantities
f(x, t) = E
x,t
_
e
_
T
t
V (Xs)ds
_
(10)
5
satises the backward equation
t
f +
1
2
2
x
f +V (x)f = 0 . (11)
When someone refers to the Feynman Kac formula, they usually are referring
to the fact that (10) is a formula for the solution of the PDE (11). In our work,
the situation mostly will be reversed. We use the PDE (11) to get information
about the quantity dened by (10) or even just about the process X(t).
We can verify that (10) satises (11) more or less as in the preceding para-
graph. We note that
exp
_
_
t+t
t
V (X(s))ds +
_
T
t+t
V (X(s))ds
_
= exp
_
_
t+t
t
V (X(s))ds
_
exp
_
_
T
t+t
V (X(s))ds
_
= (1 + tV (X(t)) +o(t)) exp
_
_
T
t+t
V (X(s))ds
_
The expectation of the rigth side with respect to F
t+t
is
(1 + tV (X
t
) +o(t)) f(X(t + X, t + t) .
When we now take expectation with respect to F
t
, which amounts to averaging
over X, using Taylor expansion of f about f(x, t) as before, we get (11).
1.9. The Feynman integral: A precurser to the Feynman Kac formula, is the
Feynman integral
2
solution to the Schrodinger equation. The Feynman integral
is not an integral in the sense of measure theory. (Neither is the Ito integral, for
that matter.) The colorful probabilist Marc Kac (pronounced Katz) discov-
ered that an actual integral over Wiener measure (10) gives the solution of (11).
Feynmans reasoning will help us derive the Girsanov formula, so we pause to
sketch it.
The nite dierence approximation
_
T
0
V (X(t))dt t
n1
k=0
V (X(t
k
)) , (12)
(always t = T/n, t
k
= kt) leads to an approximation to F of the form
F
n
= E
_
exp
_
t
n=1
k=0
V (X(t
k
))
__
. (13)
2
The American Physicist Richard Feynman was born and raised in Far Rockaway (a neigh-
borhood of Queens, New York). He is the author of several wonderful popular books, including
Surely Youre Joking, Mr. Feynman and The Feynman Lectures on Physics.
6
The functional F
n
depends only on nitely many values X
k
= X(t
k
), so we may
evaluate (13) using teh known joint density function for

X = (X
1
, . . . , X
n
). The
density is (see Path probabilities from Lecture 5):
U
(n)
(x) =
1
(2t
n/2
exp
_
n1
k=0
(x
k+1
x
k
)
2
/et
_
.
It is suggestive to rewrite this as
U
(n)
(x) =
1
(2t
n/2
exp
_
t
2
n1
k=0
_
x
k+1
x
k
t
_
2
_
. (14)
Using this to evaluate F
n
gives
F
n
=
1
(2t
n/2
_
R
n
exp
_
t
n1
k=0
V (x
k
)
t
2
n1
k=0
_
x
k+1
x
k
t
_
2
_
dx . (15)
It is easy to show that F
n
F as n as long as V (x) is, say, continuous
and bounded (see below).
Feynman proposed a view of F = lim
n
F
n
in (15) that is not mathemat-
ically rigorous but explains whats going on. If x
k
x(t
k
), then we should
have
t
n1
k=0
V (x
k
)
_
T
t=0
V (x(t))dt .
Also,
_
x
k+1
x
k
t
_
dx
dt
= x(t
k
) ,
so we should also have
t
2
n1
k=0
_
x
k+1
x
k
t
_
2
_
T
0
x(t)
2
dt .
As n , the integral over R
n
should converge to the integral over all paths
x(t). We denote this by P without worring about exactly which paths are
allowed (continuous, dierentiable, ...?). The integration element dx has the
possible formal limit
dx =
n1
k=0
dx
k
=
n1
k=0
dx(t
k
)
T
t=0
dx(t) .
Altogether, this gives the formal expression for the limit of (15):
F = const
_
P
exp
_
_
T
0
V (x(t))dt
1
2
_
T
0
x(t)
2
dt
_
T
t=0
dx(t) . (16)
7
1.10. Feynman and Wiener integration: Mathematicians were quick to com-
plain about (16). For one thing, the constant const = lim
n
(2t)
n/2
should
be innite. More seriously, there is no abstract integral measure corresponding
to
_
P
T
t=0
dx(t) (it is possible to prove this). Kac proposed to write (16) as
F =
_
P
exp
_
_
T
0
V (x(t))dt
__
const exp
_
1
2
_
T
0
x(t)
2
dt
_
T
t=0
dx(t)
_
.
and then interpret the latter part as Wiener measure (dP):
const exp
_
1
2
_
T
0
x(t)
2
dt
_
T
t=0
dx(t) = dP(X) (17)
In fact, we have already implicitly argued informally (and it can be formalized)
that
lim
n
U
(n)
(x)
n1
k=0
dx
k
dP(X) as n .
These intuitive but mathematically nonsensical formulas are a great help in
understanding Brownian motion. For one thing, (17) makes clear that Wiener
measure is Gaussian. Its density has the form const exp(Q(x)), where Q(x) is
a positive quadratic function of x. Here Q(x) =
_
x(t)
2
dt (and the constant is,
alas, innite). Moreover, in many cases it is possible to approximate integrals
of the form
_
exp((x))dx by e
, where
= max
x
(x) if the is sharply
peaked around its maximum. This is particularly common in rare event or
large deviation problems. In our case, this would lead us to solve the calculus
of variations problem
max
x
_
_
T
0
V (x(t)dt
1
2
_
T
0
x(t)
2
dt
_
.
1.11. Application of Feynman Kac: The problem of evaluating
f = E
_
exp
_
_
T
0
V (X
t
)dt
__
arises in many situations. In nance, f could represent the present value of
a payment in the future subject to unknown uctuating interest rates. The
PDE (11) provides a possible way to evaluate f = f(0, 0), either analytically or
numerically.
2 Mathematical formalism
2.1. Introduction: We examine the solution formulas for the backward and
forward equation from two points of view. The rst is an analogy with linear
8
algebra, with function spaces playing the role of vector space and operators
playing the role of matrices. The second is a more physical picture, interpreting
G(x, y, t) as the Greens function describing the forward diusion of a point mass
of probability or the backward diusion of a localized unit of payout.
2.2. Solution operator As time moves forward, the probability density for
X
t
changes, or evolves. As time moves backward, the value function f(x, t)
also evolves
3
The backward evolution process is given by (for s > 0, this is a
consequence of the tower property.)
f(x, t s) =
_
G(x, y, s)f(y, t)dy . (18)
We write this abstractly as f(t s) = G(s)f(t).
This formula is anologous to the comparable Markov chain formula f(ts) =
P
s
f(t). In the Markov chain case, s and t are integers and f(t) represents a
vector in R
n
whose components are f
k
(t). Here, f(t) is a function of x whose
values are f(x, t). We can think of P
s
as an n n matrix or as the linear
operator that transforms the vector f to the vector g = P
s
f. Similarly, G(s) is
a linear operator, transforming a function f into g, with
g(x) =
_

G(x, t, s)f(y)dy .
The operation is linear, which means that G(af
(1)
+bf
(2)
) = aGf
(1)
+bGf
(2)
.
The family of operators G(s) for s > 0 produces the solution to the backward
equaion, so we call G(s) the solution operator for time s.
2.3. Duhamels principle: The inhomogeneous backward equation
t
f +
2
x
f = V (x, t) , (19)
with homogeneous
4
nal condition f(x, T) = 0 may be solved by
f(x, t) = E
x,t
_
_
T
t
V (X(t
), t
dt
)
_
.
Exchanging the order of integration, we may write
f(x, t) =
_
T
t
=t
g(x, t, t
)dt
, (20)
where
g(x, t, t
) = E
x,t
[V (X(t
))] .
3
Unlike biological evolustion, this evolution process makes the solution less complicated,
not more.
4
We often say homogeneous to mean zero and inhomogeneous to mean not zero. That
may be because if V (x, t) is zero then it is constant, i.e. the same everywhere, which is the
usual meaning of homogeneous.
9
This g is the expected value (at (x, t)) of a payout (V (, t
) at time t
> t). As
such, g is the solution of a homogeneous nal value problem with inhomogeneous
nal values:
t
g +
1
2
2
x
g = 0 for t < t
,
g(x, t
) = V (x, t
) .
_
_
_
(21)
Duhamels principle, which we just demonstrated, is as follows. To solve the
invonogeneous nal value problem (19), we solve a homogeneous nal value
problem (21) for each t
between t and T then we add up the results (20).

2.4. Innitesimal generator: There are matrices of many dierent types that
play various roles in theory and computation. And so it is with operators. In
addition to the solution operator, there is the innitesimal generator (or simply
generator). For Brownian motion in one dimension, the generator is
L =
1
2
2
x
. (22)
The backward equation may be written
t
f +Lf = 0 . (23)
For other diusion processes, the generator is the operator L that puts the
backward equation for process in the form (23).
Just as a matrix has a transpose, an operator has an adjoint, written L
.
The forward equation takes the form
t
u = L
u .
The operator (22) for Brownian motion is self adjoint, which means that L
= L,
which is why the operator
1
2
2
x
is the appears in both. We will return to these
points later.
2.5. Composing (multiplying) operators: If A and B are matrices, then there
are two ways to form the matrix AB. One way is to multiply the matrices. The
other is to compose the linear transformations: f Bf ABf. In this way,
AB is the composite linear transformation formed by rst applying B then
applying A. We also can compose operators, even if we sometimes lack a good
explicit representation for the composite AB. As with matrices, composition of
operators is associative: A(Bf) = (AB)f.
2.6. Composing solution operators: The solution operator G(s
1
moves the
value function backward in time by the amount s
1
, which is written f(t s
1
) =
G(s
1
)f(t). The operator G(s
2
) moves it back an additional s
2
, i.e. f(t (s
1
+
s
2
)) = G(s
2
)f(ts
1
) = G(s
2
)G(s
1
)f(t). The result is to move f back by s
1
+s
2
in total, which is the same as applying G(s
1
+ s
2
). This shows that for every
(allowed) f, G(s
2
)G(s
1
)f = G(s
2
+s
1
)f,. which means that
G(s
2
)G(s
1
) = G(s
2
+s
1
) . (24)
10
This is called the semigroup property. It is a basic property of the solution
operator for any problem. The matrix anologue for Markov chains is P
s2+s+1
=
P
s2
P
s1
, which is a basic fact about powers of matrices having nothing to do
with Markov chains. The property (24) would be called the group property if
we were to allow negative s
2
or s
1
, which we do not. Negative s is allowed in
the matrix version if P is nonsingular. There is no particular physical reason
for the transition matrix of a Markov chain to be non singular.
2.7. Operator kernels: If matrix A has elements A
jk
, we can compute g = Af
by doing the sum g
j
=

k
A
jk
f
k
. Similarly, operator A may or may not have
a kernel
5
, which is a function A(x, y) so that g = Af is represented by
g(x) =
_
A(x, y)f(y)dy .
If operators A and B both have kernels, then the composite operator has the
kernel
(AB)(x, y) =
_
A(x, z)B(z, y)dz . (25)
To derive this formula, set g = Bf and h = Ag. Then h(x) =
_
A(x, z)g(z)dz
and g(z) =
_
B(z, y)f(y)dy implies that
h(x) =
_ __
A(x, z)B(z, y)dz
_
f(y)dy .
This shows that (25) is the kernel of AB. The formula is anologous to the
formula for matrix multiplication.
2.8. The semigroup property: When we dened (18) the solution operators
G(s), we did so by specifying the kernels
G(x, t, s) =
1
2s
e
(xy)
2
/2s
.
According to (25). the semigroup property should be an integral identity in-
volving G. The identity is
G(x, y, s
2
+s
1
) =
_
G(x, z, s
2
)G(z, y, s
1
)dz .
More concretely:
1
_
2(s
2
+s
1
)
e
(xy)
2
/2(s2+s1)
=
1
_
2(s
2
)
1
_
2(s
1
)
_
e
(xz)
2
/2s2
e
(zy)
2
/2s1
dz .
5
The term kernel also describes vectors f with Af = 0, it is unfortunate that the same
word is used for these dierent objects.
11
The reader is encouraged to verify this by direct integration. It also can be
veried by recognizing it as the statement that adding independent mean zero
Gaussian random variables with variance s
2
and s
1
respectively gives a Gaussian
with variance s
2
+s
1
.
2.9. Fundamental solution: The operators G(t) form a fundamental solution
6
for the problem f
t
+Lf = 0 if
t
G = LG , for t > 0 , (26)
G(0) = I . (27)
The property (26) really means that
t
_
G(t)f
_
= L
_
Gf
_
for any f. If G(t) has
a kernel G(x, y, t), this in turn means (as the reader should ckeck) that
t
G(x, y, t) = L
x
G(x, y, t) , (28)
where L
x
means that the derivatives on L are with respect to the x variables in
G. In our case with G being the heat kernel, this is
t
1
2t
e
(xy)
2
/2t
=
1
2
2
x
1
2t
e
(xy)
2
/2t
,
which we have checked and rechecked.
Without matrices, we still have the identity operator: If = f for all f. The
property (27) really means that G(t)f f as t 0. It is easy to verify this
for our heat kernel provided that f is continuous.
2.10. Duhamel with fundamental solution operator: The g appearing in (20)
may be expresses as g(t, t
) = G(t
t)V (t
), where V (t
) is the function with

values V (x, t
). This puts (20) in the form

f(t) =
_
T
t
G(t
t)V (t
)dt
. (29)
We illustrate the properties of the fundamental solution operator by verifying
(29) directly. We want to show taht (29) implies that
t
f + Lf = V (t) and
f(T) = 0. The latter is clear. For the former we compute
t
f(t) by dierenti-
ating the right side of (29):
t
_
T
t
G(t
t)V (t
)dt
= G(t t)V (t)

_
T
t
G
(t
t)V (t
)dt
,
We write G
(t) to represent
t
G(t). This allows us to write
t
G(t
t) =
G
(t
t) = LG(t
t). Continuing, the left side is

V (t)
_
T
t
LG(t
t)V (t
)dt
= V (t)
_
T
t
LG(t
t)V (t
)dt
.
6
We have adjusted this denition from its original form in books on ordinary dierential
equations to accomodate the backward evolution of the backward equation. This amounts to
reversing the sign of L.
12
If we take L outside the integral on the right, we recognize what is left in the
integral as f(t). Altogether, we have
t
f = V (t) Lf(t). This is almost right,
I just have to x the minus sign somehow.
2.11. Greens function: Consider the solution formula for the homogeneous
nal value problem
t
f +Lf = 0, f(T) = V :
f(x, t) =
_
G(x, y, T t)V (y)dy . (30)
Consider a special jackpot payout V (y) = (y x
0
). If you like, you can
think of V (y) =
1
when |y = x
0
| < then let 0. We then get f(x, t) =
G(x, x
0
, T t). The function that satises
t
G+L
x
G = 0, G(x, T = (x x
0
)
is called the Greenss function
7
. The Greens function represents the result of
a point mass payout. A general payout can be expressed as a sum (integral) of
point mass payouts as x
0
with weight V (x
0
):
V (y) =
_
V (x
0
)(y x
0
)dx
0
.
Since the backward equation is linear, the general value function will be the
weighted sum (integral) of the point mass value functions, which is the formula
(30).
2.12. More generally: Brownian motion is special in that G(x, y, t) is a
function of x y. This is because Brownian motion is translation invariant: a
Brownian motion starting from any point looks like a Brownian motion starting
from any other point. Brownian motion is also special in that the forward
equation and backward equations are nearly the same, having the same spatial
operator L =
1
2
2
x
.
More general diusion processes loose both these properties. The solution
operator depends in a more complicated way on x and y. The backward equa-
tion is
t
f + Lf = 0 but the forward equation is
t
u = L
u. The Greens
function, G(x, y, t) is the fundamental solution for the backward equation in the
x, t variables with y as a parameter. It also is the fundamental solution to the
forward equation in the y, t variables with x as a parameter. This material will
be in a future lecture.
7
This is in honor of a 19
th
century Englishman named Green.
13
Last modied December 3, 2004
1 The Ito integral with respect to Brownian mo-
tion
1.1. Introduction: Stochastic calculus is about systems driven by noise. The
Ito calculus is about systems driven by white noise, which is the derivative of
Brownian motion. To nd the response of the system, we integrate the forcing,
which leads to the Ito integral, of a function against the derivative of Brownian
motion.
The Ito integral, like the Riemann integral, has a denition as a certain limit.
The fundamental theorem of calculus allows us to evaluate Riemann integrals
without returning to its original denition. Itos lemma plays that role for Ito
integration. Itos lemma has an extra term not present in the fundamental
theorem that is due to the non smoothness of Brownian motion paths. We will
explain the formal rule: dW
2
= dt, and its meaning.
In this section, standard one dimensional Brownian motion is W(t) (W(0) =
0, E[W
2
] = t). The change in Brownian motion in time dt is formally called
dW(t). The independent increments property implies that dW(t) is independent
of dW(t
) when t ,= t
. Therefore, the dW(t) are a model of driving noise

impulses acting on a system that are independent from one time to another.
We want a rule to add up the cumulative eects of these impulses. In the rst
instance, this is the integral
Y (T) =
_
T
0
F(t)dW(t) . (1)
Our plan is to lay out the principle ideas rst then address the mathemat-
ical foundations for them later. There will be many points in the beginning
paragraphs where we appeal to intuition rather than to mathematical analysis
in making a point. To justify this approach, I (mis)quote a snippet of a poem I
memorized in grade school: So you have built castles in the sky. That is where
they should be. Now put the foundations under them. (Author unknown by
me).
1.2. The Ito integral: Let T
t
be the ltration generated by Brownian motion
up to time t, and let F(t) T
t
be an adapted stochastic process. Correspond-
ing to the Riemann sum approximation to the Riemann integral we dene the
following approximations to the Ito integral
Y
t
(t) =
t
k
<t
F(t
k
)W
k
, (2)
1
with the usual notions t
k
= kt, and W
k
= W(t
k+1
) W(t
k
). If the limit
exists, the Ito integral is
Y (t) = lim
t0
Y
t
(t) . (3)
There is some exibility in this denition, though far less than with the Riemann
integral. It is absolutely essential that we use the forward dierence rather than,
say, the backward dierence ((wrong) W
k
= W(t
k
) W(t
k1
)), so that
E
_
F(t
k
)W
k
T
t
k
= 0 . (4)
Each of the terms in the sum (2) is measurable measurable in T
t
, therefore
Y
n
(t) is also. If we evaluate at the discrete times t
n
, Y
t
is a martingale:
E[Y
t
(t
n+1
)

T
tn
= Y t(t
n
) .
In the limit t 0 this should make Y (t) also a martingale measurable in T
t
.
1.3. Famous example: The simplest interesting integral with an F
t
that is
random is
Y (T) =
_
T
0
W(t)dW(t) .
If W(t) were dierentiable with derivative

W, we could calculate the limit of
(2) using dW(t) =

W(t)dt as
(wrong)
_
T
0
W(t)

W(t)dt =
1
2
_
T
0

t
_
W(t)
2
_
dt =
1
2
W(t)
2
. (wrong) (5)
But this is not what we get from the denition (2) with actual rough path
Brownian motion. Instead we write
W(t
k
) =
1
2
(W(t
k+1
) +W(t
k
))
1
2
(W(t
k+1
) W(t
k
)) ,
and get
Y
t
(t
n
) =
k<n
W(t
k
) (W(t
k+1
) W(t
k
))
=
k<n
1
2
(W(t
k+1
) +W(t
k
)) (W(t
k+1
) W(t
k
))
k<n
1
2
(W(t
k+1
) W(t
k
)) (W(t
k+1
) W(t
k
))
=
k<n
1
2
_
W(t
k+1
)
2
W(t
k
)
2
_
k<n
1
2
(W(t
k+1
) W(t
k
))
2
.
The rst on the bottom right is (since W(0) = 0)
1
2
W(t
n
)
2
2
The second term is a sum of n independent random variables, each with expected
value t/2 and variance t
2
/2. As a result, the sum is a random variable with
mean nt/2 = t
n
/2 and variance nt
2
/2 = t
n
t/2. This implies that
1
2
t
k
<T
(W(t
k+1
) W(t
k
))
2
T/2 as t 0 . (6)
Together, these results give the correct Ito answer
_
T
0
W(t)dW(t) =
1
2
_
W(t)
2
T
_
. (7)
The dierence between the right answer (7) and the wrong answer (5) is the
T/2 coming from (6). This is a quantitative consequence of the roughness of
Brownian motion paths. If W(t) were a dierentiable function of t, that term
would have the approximate value
t
_
T
0
_
dW
dt
_
2
dt 0 as t 0 .
1.4. Backward dierencing, etc: If we use the backward dierence W
k
=
W(t
k
) W(t
k1
), then the martingale property (4) does not hold. For ex-
ample, if F(t) = W(t) as above, then the right side changes from zero to
(W(t
n
) W(t
n1
)W(t
n
) (all quantities measurable in T
tn
), which has expected
value
1
t. In fact, if we use the backward dierence and follow the argu-
ment used to get (7), we get instead
1
2
(W(T)
2
+ T). In addition to the Ito
integral there is a Stratonovich integral, which is used the central dierence
W
k
=
1
2
(W(t
k+1
)W(t
k1
)). The Stratonovich denition makes the stochas-
tic integral act more like a Riemann integral. In particular, the reader can check
that the Stratonovich integral of WdW is
1
2
W(T)
2
.
1.5. Martingales: The Ito integral is a martingale. It was dened for that
purpose. Often one can compute an Ito integral by starting with the ordinary
calculus guess (such as
1
2
W(T)
2
) and asking what needs to change to make the
answer a martingale. In this case, the balancing term T/2 does the trick.
1.6. The Ito dierential: Itos lemma is a formula for the Ito dierential,
which, in turn, is dened in using the Ito integral. Let F(t) be a stochastic
process. We say dF = a(t)dW(t) +b(t)dt (the Ito dierential) if
F(T) F(0) =
_
T
0
a(t)dW(t) +
_
T
0
b(t)dt . (8)
The rst integral on the right is an Ito integral and the second is a Riemann
integral. Both a(t) and b(t) may be stochastic processes (random functions of
1
E[(W(tn) W(t
n1
))W(t
n1
)] = 0, so E[(W(tn) W(t
n1
))W(tn)] = E[(W(tn)
W(t
n1
))(W(tn) W(t
n1
))] = t
3
time). For example, the Ito dierential of W(t)
2
is
dW(t)
2
= 2W(t)dW(t) +dt ,
which we verify by checking that
W(T)
2
= 2
_
T
0
W(t)dW(t) +
_
T
0
dt .
This is a restatement of (7).
1.7. Itos lemma: The simplest version of Itos lemma involves a function
f(w, t). The lemma is the formula (which must have been stated as a lemma
in one of his papers):
df(W(t), t) =
w
f(W(t), t)dW(t) +
1
2
2
w
f(W(t), t)dt +
t
f(W(t), t)dt . (9)
According to the denition of the Ito dierential, this means that
f(W(T), T) f(W(0), 0) (10)
=
_
T
0
w
f(W(t), t)dW(t) +
_
T
0
_
2
w
f(W(t), t) +
t
f(W(t), t)
_
dt (11)
1.8. Using Itos lemma to evaluate an Ito integral: Like the fundamental
theorem of calculus, Itos lemma can be used to evaluate integrals. For example,
consider
Y (T) =
_
T
0
W(t)
2
dW(t) .
A naive guess might be
1
3
W(T)
3
, which would be the answer for a dierentiable
function. To check this, we calculate (using (9),
w
1
3
w
3
= w
2
, and
1
2
2
w
1
3
w
3
= w)
d
1
3
W(t)
3
= W
2
(t)dW(t) +W(t)dt .
This implies that
1
3
W(t)
3
=
_
T
0
d
1
3
W(t)
3
=
_
T
0
W(t)
2
dW(t) +
_
T
0
W(t)dt ,
which in turn gives
_
T
0
W(t)
2
dW(t) =
1
3
W(t)
3
_
T
0
W(t)dt .
This seems to be the end. There is no way to integrate Z(T) =
_
T
0
W(t)dt
to get a function of W(T) alone. This is to say that Z(T) is not measurable in
(
T
, the algebra generated by W(T) alone. In fact, Z(T) depends equally on all
4
W(t) values for 0 t T. A more technical version of this remark is coming
after the discussion of the Brownian bridge.
1.9. To tell a martingale: Suppose F(t) is an adapted stochastic process
with dF(t) = a(t)dW(t) +b(t)dt. Then F is a martingale if and only if b(t) = 0.
We call a(t)dW(t) the martingale part and b(t)dt drift term. If b(t) is at all
continuous, then it can be identied through (because E[
_
a(s)dW(s)

T
t
] = 0)
E
_
F(t + t) F(t)

T
t
= E
_
_
t+t
t
b(s)ds
T
t
_
= b(t)t +o(t) . (12)
We give one and a half of the two parts of the proof of this theorem. If
b = 0 for all t (and all, or almost all ), then F(T) is an Ito integral and
hence a martingale. If b(t) is a continuous function of t, then we may nd a
t
and > 0 and > 0 so that, say, b(t) > > 0 when [t t
[ < . Then
E[F(t
+) F(t
)] > 2 > 0, so F is not a martingale

2
.
1.10. Deriving a backward equation: Itos lemma gives a quick derivations
of backward equations. For example, take
f(W(t), t) = E
_
V (W(T))

T
t
.
The tower property tells us that F(t) = F(W(t), t) is a martingale. But Itos
lemma, together with the previous paragraph, implies that F(W(t), t) is a mar-
tingale of and only if
t
F +
1
2
= 0, which is the backward equation for this case.
In fact, the proof of Itos lemma (below) is much like the proof of this backward
equation.
1.11. A backward equation with drift: The derivation of the backward
equation for
f(w, t) = E
w,t
_
_
T
t
V (W(s), s)ds
_
uses the above, plus (12). Again using
F(t) = E
_
_
T
t
V (W(s), s)ds
T
t
_
,
with F(t) = f(W(t), t), we calculate
E
_
F(t + t) F(t)

T
t
= E
_
_
t+t
t
V (W(s), s)ds
T
t
_
= V (W(t), t)t +o(t) .
2
This is a somewhat incorrect version of the proof because , , and t
probably are random.

There is a real proof something like this.
5
This says that dF(t) = a(t)dW(t) +b(t)dt where
b(t) = V (W(t), t) .
But also, b(t) =
t
f +
1
2
2
w
f. Equating these gives the backward equation from
Lecture 6:
t
f +
1
2
2
w
f +V (w, t) = 0 .
1.12. Proof of Itos lemma: We want to show that
f(W(T), T) f(W(0), 0) =
_
T
0
f
w
(W(t), t)dW(t) +
_
T
0
f
t
(W(t), t)dt
+
1
2
_
T
0
f
ww
(W(t), t)dt . (13)
Dene t = T/n, t
k
= kt, W
k
= W(t
k
), W
k
= W(t
k+1
) W(t
k
), and
f
k
= f(W
k
, t
k
), and write
f
n
f
0
=
n1
k=0
_
f
k+1
f
k
_
. (14)
Taylor series expansion of the terms on the right of (14) will produce terms
that converge to the three integrals on the right of (13) plus error terms that
converge to zero. In our pre-Ito derivations of backward equations, we used the
relation E[(W)
2
] = t. Here we argue that with many independent W
k
, we
may replace (W
k
)
2
with t (its mean value).
The Taylor series expansion is
f
k+1
f
k
=
w
f
k
W
k
+
1
2
2
w
f
k
(W)
2
+
t
f
k
t +R
k
, (15)
where
w
f
k
means
w
f(W(t
k
), t
k
), etc. The remainder has the bound
3
[R
k
[ C
_
t
2
+ t [W
k
[ +
W
3
k
_
.
Finally, we separate the mean value of W
2
k
from the deviation from the mean:
1
2
2
w
f
k
W
2
k
=
1
2
2
w
f
k
t +
1
2
2
w
f
k
(W
2
k
t) .
The individual summands on the right side all have order of magnitude t.
However, the mean zero terms (the second sum) add up to much less than the
rst sum, as we will see. With this, (14) takes the form
f
n
f
0
=
n1
k=0
w
f
k
W
k
+
n1
k=0
t
f
k
t +
1
2
n1
k=0
2
w
f
k
t
+
1
2
n1
k=0
2
w
f
k
_
W
2
t
_
+
n1
k=0
R
k
. (16)
3
We assume that f(w, t) is thrice dierentiable with bounded third derivatives. The error
in a nite Taylor approximation is bounded by the sized of the largest terms not used. Here,
that is t
2
(for omitted term
2
t
f), t(W)
2
(for tw), and W
3
(for
3
w
).
6
The rst three sums on the right converge respectively to the corresponding
integrals on the right side of (13). A technical digression will show that the last
two converge to zero as n in a suitable way.
1.13. Like Borel Cantelli: As much as the formulas, the proofs in stochastic
calculus rely on calculating expected values of things. Here, S
m
is a sequence
of random numbers and we want to show that S
m
0 as m (almost
surely). We use two observations. First, if s
m
is a sequence of numbers with
m=1
[s
m
[ < , then s
m
0 as m . Second, if B > 0 is a random
variable with E[B] < , then B < almost surely (if the event B = has
positive probability, then E[B] = ). We take B =

m=1
[S
m
[. If B <
then

m=1
[S
m
[ < so S
m
0 as m . What this shows is:
_

m=1
E [[S
m
[] <
_
=
_
S
m
0 as m (a.s.)
_
(17)
This observarion is a variant of the Borel Cantelli lemma, which often is used
in such arguments.
1.14. One of the error terms: To apply the Borel Cantelli lemma we must
nd bounds for the error terms, bounds whose sum is nite. We start with the
last error term in (16). Choose n = 2
m
and dene S
m
=

n1
k=0
R
k
, with
[R
k
[ C
_
t
2
+ t [W
k
[ +[W
k
[
3
_
.
Since E[[W
k
[] C
t and E[[W
k
[
3
] Ct
3/2
(you do the math the
integrals), this gives (with nt = T)
E [[S
m
[] Cn
_
t
2
+ t
3/2
_
CT
t .
Expressed in terms of m, we have t = T/2
m
and

t =

T2
m/2
=
T
_
2
_
m
. Therefore E [[S
m
[] C(T)
_
2
_
m
. Now, if z is any number
greater than one, then

m=1
z
m
= 1/(1 + 1/z)) < . This implies that
m=1
E [[S
m
[] < and (using Borel Cantelli) that S
m
0 as m (almost
surely).
This argument would not have worked this way had we taken n = m instead
of n = 2
m
. The error bounds of order 1/
n would not have had a nite

sum. If both error terms in the bottom line of (16) go to zero as m
with n = 2
m
, this will prove Itos lemma. We will return to this point when
we discuss the dierence between almost sure convergence, which we are using
here, and convergence in probability, which we are not.
1.15. The other sum: The other error sum in (16) is small not because of the
smallness of its terms, but because of cancellation. The positive and negative
7
terms roughly balance, leaving a sum smaller than the sizes of the terms would
suggest. This cancellation is of the same sort appearing in the central limit
theorem, where

n1
k=0
X
k
= U
n
is of order

n rather than n when the X
k
are
i.i.d. with nite variance. In fact, using a trick we used before we show that U
2
n
is of order n rather than n
2
:
E
_
U
2
n
jk
E [X
j
X
k
] = nE
_
X
2
k
= cn .
Our sum is
U
n
=
1
2
2
w
f(W
k
, t
k
)
_
W
2
k
t
k
_
.
The above argument applies, though the terms are not independent. Suppose
j ,= k and, say, k > j. The cross term involving W
j
and W
k
still vanishes
because
E
_
W
k
t
T
t
k
= 0 ,
and the rest is in T
t
k
. Also (as we have used before)
E
_
(W
k
t)
2

T
t
k
_
= 2t
2
.
Therefore
E
_
U
2
n
=
1
4
n1
k=0
_
2
w
f(W
k
, t
k
)
_
2
t
2
C(T)t .
As before, we take n = 2
m
and sum to nd that U
2
2
m 0 as m , which of
course implies that U
2
m 0 as m (almost surely).
1.16. Convergence of Ito sums: Choose t and dene t
k
= kt and W
k
=
W(t
k
). To approximate the Ito integral
Y (T) =
_
T
0
F(t)dW(t) ,
we have the the Ito sums
Y
m
(T) =
t
k
<T
F(t
k
) (W
k+1
W
k
) , (18)
where t = 2
m
. In proving convergence of Riemann sums to the Riemann
integral, we assume that the integrand is continuous. Here, we will prove that
lim
m
Y
m
(T) exists under the hypothesis
E
_
(F(t + t) F(t))
2
Ct . (19)
This is natural in that it represents the smoothness of Brownian motion paths.
We will discuss what can be done for integrands more rough than (19).
The trick is to compare Y
m
with Y
m+1
, which is to compare the t approxi-
mation to the t/2 approximation. For that purpose, dene t
k+1/2
= (k+
1
2
)t,
8
W
k+1/2
= W(t
k+1/2
), etc. The t
k
term in the Y
m
sum corresponds to the time
interval (t
k
, t
k+1
). The Y
m+1
sum divides this interval into two subintervals of
length t/2. Therefore, for each term in the Y
m
sum there are two correspond-
ing terms in the Y
m+1
sum (assuming T is a multiple of t), and:
Y
m+1
(T) Y
m
(T) =
t
k
<T
_
F(t
k
)(W
k+1/2
W
k
) +F(t
k+1/2
)(W
k+1
W
k+1/2
)
F(t
k
)(W
k+1
W
k
)
t
k
<T
(W
k+1
W
k+1/2
)(F(t
k+1/2
) F(t
k
))
=
t
k
<T
R
k
,
where
R
k
= (W
k+1
W
k+1/2
)(F(t
k+1/2
) F(t
k
)) .
We compute E[(Y
m+1
(T)Y
m
(T))
2
] =

jk
E[R
j
R
k
]. As before,
4
E[R
j
R
k
] =
0 unless j = k. Also, the independent increments property and (19) imply that
5
E[R
2
k
] = E
_
_
W
k+1
W
k+1/2
_
2
_
E
_
_
F(t
k+1/1
) F(t
k
)
_
2
_
t
2
C
t
2
= Ct
2
.
This gives
E
_
(Y
m+1
(T) Y
m
(T))
2
C2
m
. (20)
The convergence of the Ito sums follows from (??) using our Borel Cantelli
type lemma. Let S
m
= Y
m+1
Y
m
. From (20), we have
6
E [S
m
[] C
2
m
.
Thus
lim
m
Y
m
(T) = Y
1
(T) +
m1
Y
m+1
(T) Y
m
(T)
exists and is nite. This shows that the limit dening the Ito integral exists, at
least in the case of an integrand that satises (19), which includes most of the
cases we use.
1.17. Ito isometry formula: This is the formula
E
_
_
_
_
T2
T1
a(t)dW(t)
_
2
_
_
=
_
T2
T1
E[a(t)
2
]dt . (21)
4
If j > k then E[W
j+1
W
j+1/2
| Ft
j+1/2
] = 0, so E[R
j
R
k
| Ft
j+1/2
] = 0, and E[R
j
R
k
] =
0
5
Mathematicians often use the same letter C to represent dierent constants in the same
formula. For example, Ct + C
2
t Ct really means: Let C = C
1
+ C
2
2
, if u C
1
t
and v C
2
t, then u + v
2
Ct. Instead, we dont bother to distinguish between the
various constants.
6
The Cauchy Schwartz inequality gives E[|Sm|] = E[|Sm| 1] (E[S
2
m
]E[1
2
])
1/2
=
E[S
2
m
]
1/2
.
9
The derivation uses what we just have done. We approximate the Ito integral
by the sum
T1t
k
<T2
a(t
k
)W
k
.
Because the dierent W
k
are independent, and because of the independent
increments property, the expected square of this is
T1t
k
<T2
a(t
k
)
2
t .
The formula (21) follows from this.
An application of this is to understand the roughness of Y (T) =
_
T
0
a(t)dW(t).
If E[a(t)
2
] < C for all t T, then E[(Y (T
2
) Y (T
1
))
2
] Ct. This is the
same roughness as Brownian motion itself.
1.18. White noise: White noise is a generalized function,
7
(t), which is
thought of as homogeneous and Gaussian with (t
1
) independent of (t
2
) for
t
1
,= t
2
. More precisely, if t
0
< t
1
< < t
n
and Y
k
=
_
t
k+1
t
k
(t)dt, then the Y
k
are independent and normal with zero mean and var(Y
k
) = t
k+1
t
k
. You can
convince yourself that (t) is not a true function by showing that it would have
to have
_
b
a
(t)
2
dt = for any a < b. Brownian motion can be thought of as
the motion of a particle pushed by white noise, i.e. W(t) =
_
t
0
(s)ds. The Y
k
dened above then are the increments of Brownian and have the appropriate
statistical properties (independent, mean zero, normal, variance t
k+1
t
k
).
These properties may be summarized by saying that (t) has mean zero and
cov((t), (t
) = E[(t)(t
) = (t t
) . (22)
For example, if f(t) and g(t) are deterministic functions and Y
f
=
_
f(t)d(t)
and Y
g
=
_
g(t)d(t), then, (22) implies that
E[Y
f
Y
g
] =
_
t
_
t
f(t)g(t
)E[(t)(t
)]dtdt
=
_
t
_
t
f(t)g(t
)(t t
)dtdt
=
_
t
f(t)g(t)dt
It is tempting to take dW(t) = (t)dt in the Ito integral and use (22) to derive
the Ito isometry formula. However, this must by done carefully because the
existence of the Ito integral, and the isometry formula, depend on the causality
structure that makes dW(t) independent of a(t).
7
A generalized function is not an actual function, but has properties dened as though it
were an actual function through integration. The function for example, is dened by the
formula
_
f(t)(t)dt = f(0). No actual function can do this. Generalized functions also are
called distributions.
10
2 Stochastic Dierential Equations
2.1. Introduction: The theory of stochastic dierential equations (SDE) is a
framework for expressing dynamical models that include both random and non
random forces. The theory is based on the Ito integral. Like the Ito integral,
approximations based on nite dierences that do not respect the martingale
structure of the equation can converge to dierent answers. Solutions to Ito
SDEs are Markov processes in that the future depends on the past only through
the present. For this reason, the solutions can be studied using backward and
forward equations, which turn out to be linear parabolic partial dierential equa-
tions of diusion type.
2.2. A Stochastic Dierential Equation: An Ito stochastic dierential equa-
tion takes the form
dX(t) = a(X(t), t)dt +(X(t), t)dW(t) . (23)
A solution is an adapted process that satises (23) in the sense that
X(T) X(0) =
_
T
0
a(X(t), t)dt +
_
T
0
(X(t), t)dW(t) , (24)
where the rst integral on the right is a Riemann integral and the second is an
Ito integral. We often specify initial conditions X(0) u
0
(x), where u
0
(x) is
the given probability density for X(0). Specifying X(0) = x
0
is the same as
saying u
0
(x) = (x x
0
). As in the general Ito dierential, a(X(t), t)dt is the
drift term, and (X(t), t)dW(t) is the martingale term. We often call (x, t)
the volatility. However, this is a dierent use of the letter from Black Scholes,
where the martingale term is x for a constant (also called volatility).
2.3. Geometric Brownian motion: The SDE
dX(t) = X(t)dt +X(t)dW(t) , (25)
with initial data X(0) = 1, denes geometric Brownian motion. In the general
formulation above, (25) has drift coecient a(x, t) = x, and volatility (x, t) =
x (with the conict of terminology noted above). If W(t) were a dierentiable
function of t, the solution would be
(wrong) X(t) = e
t+W(t)
. (26)
To check this, dene the function x(w, t) = e
t|w
with
w
x = x
w
= x, x
t
= x
and x
ww
=
2
x. so that the Ito dierential of the trial function (26) is
dX(W(t), t) = Xdt +XdW(t) +

2
2
Xdt .
11
We can remove the unwanted nal term by multiplying by e
2
t/2
, which sug-
gests that the formula
X(t) = e
t
2
t/2+W(t)
(27)
satises (25). A quick Ito dierentiation veries that it does.
2.4. Properties of geometric Brownian motion: Let us focus on the simple
case = 0 = 1, so that
dX(t) = X(t)dW(t) . (28)
The solution, with initial data X(0) = 1, is the simple geometric Brownian
motion
X(t) = exp(W(t) t/2) . (29)
We discuss (29) in relation to the martingale property (X(t) is a martingale
because the drift term in (23) is zero in (28)). A simple calculation based on
X(t +t
) = exp
_
W(t) t/2 + (W(t
) W(t) (t
t)/2)
_
and integrals of Gaussians shows that E[X(t +t
[ T
t
] = X(t).
However, W(t) has the order of magnitude

t. For large t, this means that
the exponent in (29) is roughly equal to t/2, which suggests that
X(t) = exp(W(t) t/2) e
t/2
0 as t (a.s.)
Therefore, the expected value E[X(t)] = 1, for large t, is not produced by
typical paths, but by very exceptional ones. To be quantitative, there is an
exponentially small probability that X(t) is as large as its expected value:
P(X(t) 1) < e
t/4
for large t.
2.5. Dominated convergence theorem: The dominated convergence theorem
is about expected values of limits of random variables. Suppose X(t, ) is a
family of random variables and that lim
t
X(t, ) = Y () for almost every
. The random variable U() dominates the X(t, ) if [X(t, )[ U() for
almost every and for every t > 0. We often write this simply as X(t) Y
as t a.s., and [X(t)[ U a.s. The theorem states that if E[U] < then
E[X(t)] E[Y ] as t . It is fairly easy to prove the theorem from the
denition of abstract integration. The simplicity of the theorem is one of the
ways abstract integration is simpler than Riemann integration.
The reason for mentioning this theorem here is that geometric Brownian
motion (29) is an example showing what can go wrong without a dominating
function. Although X(t) 0 as t a.s., the expected value of X(t) does
not go to zero, as it would do if the conditions of the dominated convergence
theorem were met. The reader is invited to study the maximal function, which
is the random variable M = max
t>0
(W(t) t/2), in enough detail to show that
E[e
M
] = .
12
2.6. Strong and weak solutions: A strong solution is an adapted function
X(W, t), where the Brownian motion path W again plays the role of the abstract
random variable, . As in the discrete case, X(t) (i.e. X(W, t)) being measurable
in T
t
means that X(t) is a function of the values of W(s) for 0 s t. The two
examples we have, geometric Brownian motion (27), and the Ornstein Uhlenbeck
process
8
X(t) =
_
t
0
e
(ts)
dW(s) , (30)
both have this property. Note that (27) depends only on W(t), while (30)
depends on the whole pate up to time t.
A weak solution is a stochastic process, X(t), dened perhaps on a dierent
probability space and ltration (, T
t
) that has the statistical properties called
for by (23). These are (using X = X(t + t) X(t)) roughly
9
E[X [ T
t
] = a(X(t), t)t +o(t) , (31)
and
E[X
2
[ T
t
] =
2
(X(t), t)t +o(t) . (32)
We will see that a strong solution satises (31) and (32), so a strong solution
is a weak solution. It makes no sense to ask whether a weak solution is a
strong solution since we have no information on how, or even whether, the weak
solution depends on W.
The formulas (31) and (32) are helpful in deriving SDE descriptions of phys-
ical or nancial systems. We calculate the left sides to identify the a(x, t) and
(x, t) in (23). Brownian motion paths and Ito integration are merely a tool
for constructing the desired process X(t). We saw in the example of geomertic
Brownian motion that expressing the solution in terms of W(t) can be very
convenient for understanding its properties. For example, it is not particularly
easy to show that X(t) 0 as t from (31) and (32) with a = X and
10
(x, t) = x.
2.7. Strong is weak: We just verify that the strong solution to (23) that
satises (24) also satises the weak form requirements (31) and (32). This is an
important motivation for using the Ito denition of dW rather than, say, the
Stratonovich denition.
A slightly more general fact is simpler to explain. Dene R and I by
R =
_
t+t
t
a(t)dt , I =
_
t+t
t
(t)dW(t) ,
8
This process satises the SDE dX = Xdt + dW, with X(0) = 0.
9
The little o notation f(t) = g(t) + o(t) informally means that the dierence between f
and g is a mathematicians order of magnitude smaller than t for small t. Formally, it means
that (f(t) g(t))/t 0 as t 0.
10
This conict of notation is common in discussing geometric Brownian motion. On the left
is the coecient of dW(t). On the right is the nancial volatility coecient.
13
where a(t) and (t) are continuous adapted stochastic processes. We want to
see that
E
_
R +I
T
t
= a(t) +o(t) , (33)

and
E
_
(R +I)
2

T
t
=
2
(t) +o(t) . (34)
We may leave I out of (33) because E[I] = 0 always. We may leave R out of (34)
because [I[ >> [R[. (If a is bounded then R = O(t) so E[R
2
[ T
t
] = O(t
2
).
The Ito isometry formula suggests that E[I
2
[ T
t
] = O(t). Cauchy Schwartz
then gives E[RI [ T
t
] = O(t
3/2
). Altogether, E[(R + I)
2
[ T
t
] = E[I
2
[
T
t
] +O(t
3/2
).)
To verify (33) without I, we assume that a(t) is a continuous function of t
in the sense that for s > t,
E
_
a(s) a(t)

T
t
0 as s t .
This implies that
1
t
_
t+t
t
E [a(s) a(t) [ T
t
] 0 as t 0,
so that
E
_
R
T
t
=
_
t+t
t
E [a(s) [ T
t
]
=
_
t+t
t
E [a(t) [ T
t
] +
_
t+t
t
E [a(s) a(t) [ T
t
]
= ta(t) +o(t) .
This veries (33). The Ito isometry formula gives
E
_
I
2

T
t
=
_
t+t
t
(s)
2
ds ,
so (34) follows in the same way.
2.8. Markov diusions: Roughly speaking,
11
a diusion process is a contin-
uous stochastic process that satises (31) and (32). If the process is Markov,
the a of (31) and the
2
of (32) must be functions of X(t) and t. If a(x, t) and
(x, t) are Lipschitz ([a(x, t) a(y, t)[ C[x y[, etc.) functions of x and t,
then it is possible to nd it is possible to express X(t) as a strong solution of
an Ito SDE (23).
This is the way equations (23) are often derived in practice. We start o
wanting to model a process with an SDE. It could be a random walk on a lattice
with the lattice size converging to zero or some other process that we hope will
11
More detailed treatment are the books by Steele, Chung and Williams, Karatsas and
Shreve, and Oksendal.
14
have a limit as a diusion. The main step in proving the limit exists is tightness,
which we hint at a lecture to follow. We identify a and by calculations. Then
we use the representation theorem to say that the process may be represented
as the strong solution to (23).
2.9. Backward equation: The simplest backward equation is the PDE sat-
ised by f(x, t) = E
x,t
[V (X(T))]. We derive it using the weak form conditions
(31) and(32) and the tower property. As with Brownian motion, the tower
property gives
f(x, t) = E
x,t
[V (X(T))] = E
x,t
[F(t + t)] ,
where F(s) = E[V (X(T)) [ T
s
]. The Markov property implies that F(s) is a
function of X(s) alone, so F(s) = f(X(s), s). This gives
f(x, t) = E
x,t
[f(X(t + t), t + t)] . (35)
If we assume that f is a smooth function of x and t, we may expand in Taylor
series, keeping only terms that contribute O(t) or more.
12
We use X =
X(t + t) x and write f for f(x, t), f
t
for f
t
(x, t), etc.
f(X(t + t), t + t) = f +f
t
t +f
x
X +
1
2
f
xx
X
2
+ smaller terms.
Therefore (31) and (32) give:
f(x, t) = E
x,t
[f(X(t + t), t + t)]
= f(x, t)f
t
t +f
x
E
x,t
[X] +
1
2
f
xx
E
x,t
[X
2
] +o(t)
= f(x, t) +f
t
t +f
x
a(x, t)t +
1
2
f
xx
2
(x, t)t +o(t) .
We now just cancel the f(x, t) from both sides, let t 0 and drop the o(t)
terms to get the backward equation
t
f(x, t) +a(x, t)
x
f(x, t) +

2
(x, t)
2

2
x
f(x, t) = 0 . (36)
2.10. Forward equation: The forward equation follows from the backward
equation by duality. Let u(x, t) be the probability density for X(t). Since
f(x, t) = E
x,t
[V (X(T))], we may write
E[V (X(T))] =
_

u(x, t)f(x, t)dx ,

which is independent of t. Dierentiating with respect to t and using the back-
ward equation (36) for f
t
, we get
0 =
_
u(x, t)f
t
(x, t)dx +
_
u
t
(x, t)f(x, t)dx
=
_
u(x, t)a(x, t)
x
f(x, t)
1
2
_
u(x, t)
2
(x, t)
2
x
f(x, t) +
_
u
t
(x, t)f(x, t) .
12
The homework has more on the terms left out.
15
We integrate by parts to put the x derivatives on u. We may ignore boundary
terms if u decaus fast enough as [x[ and if f does not grow too fast. The
result is
_
_
x
(a(x, t)u(x, t))
1
2
2
x
_
2
(x, t)u(x, t)
_
+
t
u(x, t)
_
f(x, t)dx = 0 .
Since this should be true for every function f(x, t), the integrand must vanish
identically, which implies that
t
u(x, t) =
x
(a(x, t)u(x, t)) +
1
2
2
x
_
2
(x, t)u(x, t)
_
. (37)
This is the forward equation for the Markov process that satises (31) and (32).
2.11. Transition probabilities: The transition probability density is the
probability density for X(s) given that X(t) = y and s > t. We write it as
G(y, s, t, s), the probabiltiy density to go from y to x as time goes from t to s.
If the drift and diusion coecients do not depend on t, then G is a function
of t s. Because G is a probability density in the x and s variables, it satises
the forward equation
s
G(y, x, t, s) =
x
(a(x, s)G(y, x, t, s)) +
1
2
2
x
_
2
(x, s)G(y, x, t, s)
_
. (38)
In this equation, t and y are merely parameters, but s may not be smaller than
t. The initial condition that represents the requirement that X(t) = y is
G(y, x, t, t) = (x y) . (39)
The transition density is the Greens function for the forward equation, which
means that the general solution may be written in terms of G as
u(x, s) =
_

u(y, t)G(y, x, t, s)dy . (40)

This formula is a continuous time version of the law of total probability: the
probability density to be at x at time s is the sum (integral) of the probability
density to be at x at time s conditional on being at y at time t (which is
G(y, x, t, s)) multiplied by the probability density to be at y at time s (which is
u(y, t)).
2.12. Greens function for the backward equation: We can also express the
solution of the backward equation in terms the transition probabilities G. For
s > t,
f(y, t) = E
y,t
[f(X(s), s)] ,
which is an expression of the tower property. The expected value on the right
may be evaluated using the transition probability density for X(s). The result
is
f(y, t) =
_

G(y, x, t, s)f(x, s)dx . (41)

16
For this to hold, G must satisfy the backward equation as a function of y and t
(which were parameters in (38). To show this, we apply the backward equation
operator (see below for terminology)
t
+a(y, t)
y
+
1
2
2
(y, t)
2
y
to both sides.
The left side gives zero because f satises the backward equation. Therefore we
nd that
0 =
_
_

t
+a(y, t)
y
+
1
2
2
(y, t)
2
y
_
G(y, x, t, s)f(x, s)dx
for any f(x, s). Therefore, we conclude that
t
G(y, x, t, s) +a(y, t)
y
G(y, x, t, s) +
1
2
2
(y, t)
2
y
G(y, x, t, s) = 0 . (42)
Here x and s are parameters. The nal condition for (42) is the same as (39).
The equality s = t represents the initial time for s and the nal time for t
because G is dened for all t s.
2.13. The generator: The generator of an Ito process is the operator con-
taining the spatial part of the backward equation
13
L(t) = a(x, t)
x
+
1
2
2
(x, t)
2
x
.
The backward equation is
t
f(x, t) + L(t)f(x, t) = 0. We write just L when a
and do not depend on t. For a general continuous time Markov process, the
generator is dened by the requirement that
d
dt
E[g(X(t), t)] = E [(L(t)g)(X(t), t) +g
t
(X(t), t)] , (43)
for a suciently rich (dense) family of functions g. This applies not only to Ito
processes (diusions), but also to jump diusions, continuous time birth/death
processes, continuous time Markov chains, etc. Part of the requirement is that
the limit dening the derivative on the left side should exist. Proving (43) for an
Ito process is more or less what we did when we derived the backward equation.
On the other hand, if we know (43) we can derive the backward equation by
requiring that
d
dt
E[f(X(t), t)] = 0.
2.14. Adjoint: The adjoint of L is another operator that we call L
. It is
dened in terms of the inner product
u, f) =
_

u(x)f(x)dx .
We leave out the t variable for the moment. If u and f are complex, we take
the complex conjugate of u above. The adjoint is dened by the requirement
that for general u and f,
u, Lf) = L
u, f) .
13
Some people include the time derivative in the denition of the generator. Watch for this.
17
In practice, this boils down to the same integration by parts we used to derive
the forward equation from the backward equation:
u, Lf) =
_

u(x)
_
a(x)
x
f(x) +
1
2
2
(x)
2
x
f(x)
_
dx
=
_

x
(a(x)u(x)) +
1
2
2
x
(
2
(x)u(x))
_
f(x)dx .
Putting the t dependence back in, we nd the action of L
on u to be
(L(t)
u)(x, t) =
x
(a(x, t)u(x, t)) +
1
2
2
x
(
2
(x, t)u(x, t)) .
The forward equation (37) then may be written
t
u = L(t)
u .
All we have done here is dene notation (L
) and show how our previous deriva-

tion of the forward equation is expressed in terms of it.
2.15. Adjoints and the Greens function: Let us summarize and record what
we have said about the transition probability density G(y, x, t, s). It is dened
for s t and has G(x, y, t, t) = (x y). It moves probabilities forward by
integrating over y (38) and moves expected values backward by integrating over
x ??). As a function of x and s it satises the forward equation
s
G(y, x, t, s) = (L
x
(t)G)(y, x, t, s) .
We write L
x
to indicate that the derivatives in L
are with respect to the x

variable:
(L
x
(t)G)(y, x, t, s) =
x
(a(x, t)G(y, x, t, s)) +
1
2
2
x
(
2
(x, t)G(y, x, t, s)) .
As a function of y and t it satises the backward equation
t
G(y, x, t, s) + (L
y
(t)G)(y, x, t, s) = 0 .
3 Properties of the solutions
3.1. Introduction: The next few paragraphs describe some properties of
solutions of backward and forward equations. For Brownian motion, f and u
have every property because the forward and backward equations are essentially
the same. Here f has some and u has others.
3.2. Backward equation maximum principle: The backward equation has a
maximum principle
max
x
f(x, t) max
y
f(y, s) for s > t. (44)
18
This follows immediately from the representation
f(x, t) = E
x,t
[f(X(s), s)] .
The expected value of f(X(s), s) cannot be larger than its maximum value.
Since this holds for every x, it holds in particular for the maximizer.
There is a more complicated proof of the maximum principle that uses the
backward equation. I give a slightly naive explination to avoid taking too long
with it. Let m(t) = max
x
f(x, t). We are trying to show that m(t) never
increases. If, on the contrary, m(t) does increase as t decreases, there must be
a t
with
dm
dt
(t
) = < 0. Choose x
so that f(x
, t
) = max
x
f(x, t
). Then
f
x
(x
, t
) = 0 and f
xx
(x
, t
) 0. The backward equation then implies that

f
t
(x
, t
) 0 (because
2
0), which contradicts f
t
(x
, t
) < 0.
The PDE proof of the maximum principle shows that the coecients a and
2
have to be outside the derivatives in the backward equation. Our argument
that Lf 0 at a maximum where f
x
= 0 and f
xx
0 would be wrong if we had,
say,
x
(a(x)f(x, t)) rather than a(x)
x
f(x, t). We could get a non zero value
because of variation in a(x) even when f was constant. The forward equation
does not have a maximum principle for this reason. Both the Ornstein Uhlen-
beck and geometric Brownian motion problems have cases where max
x
u(x, t)
increases in forward time or backward time.
3.3. Conservation of probability: The probability density has
_
u(x, t)dx =
1. We can see that
d
dt
_

u(x, t)dx = 0
also from the forward equation (37). We simply dierentiate under the integral,
substitute from the equation, and integrate the resulting x derivatives. For this
it is crucial that the coecients a and
2
be inside the derivatives. Almost any
example with a(x, t) or (x, t) not independent of x will show that
d
dt
_

f(x, t)dx ,= 0 .
3.4. Martingale property: If there is no drift, a(x, t) = 0, then X(t) is a
martingale. In particular, E[X(t)] is independent of t. This too follows from
the forward equation (37). There will be no boundary contributions in the
integrations by parts.
d
dt
E[X(t)] =
d
dt
_

xu(x, t)dx
=
_

xu
t
(x, t)
=
_

x
1
2
2
x
(
2
(x, t)u(x, t))dx
19
=
_

1
2
x
(
2
(x, t)u(x, t))dx
= 0 .
This would not be true for the backward equation form
1
2
2
(x, t)
2
x
f(x, t) or even
for the mixed form we get from the Stratonovich calculus
1
2
x
(
2
(x, t)
x
f(x, t)).
The mixed Stratonovich form conserves probability but not expected value.
3.5. Drift and advection: If there is no drift then the SDE (23) becomes the
ordinary dierential equation (ODE)
dx
dt
= a(x, t) . (45)
If x(t) is a solution, then clearly the expected payout should satisfy f(x(t), t) =
f(x(s), s), if nothing is random then the expected value is the value. It is easy
to check using the backward equation that f(x(t), t) is independent of t if x(t)
satises (45) and = 0:
d
dt
f(x(t), t) = f
t
(x(t), t) +
dx
dt
f
x
(x(t), t) = f
t
(x(t), t) +a(x(t), t)f
x
(x(t), t) = 0 .
Advection is the process of being carried by the wind. If there is no diusion,
then the values of f are simply advected by the drift. The term drift implies
that this advection process is slow and gentle. If is small but not zero, then
f may be essentially advected with a little spreading or smearing induced by
diusion. Computing drift dominated solutions can be more challenging than
computing diusion dominated ones.
The probability density does not have u(x(t), t) a constant (try it in the
forward equation). There is a conservation of probability correction to this that
you can nd if you are interested.
20
Last modied December 14, 2004
1 Path space measures and change of measure
1.1. Introduction: We turn to a closer study of the probability measures
on path space that represent solutions of stochastic dierential equations. We
do not have exact formulas for the probability densities, but there are approxi-
mate formulas that generalize the ones we used to derive the Feynman integral
(not the Feynman Kac formula). In particular, these allow us to compare the
measures for dierent SDEs so that we may use solutions of one to represent ex-
pected values of another. This is the Cameron Martin Girsanov formula. These
changes of measure have many applications, including importance sampling in
Monte Carlo and change of measure in nance.
1.2. Importance sampling: Importance sampling is a technique that can
make Monte Carlo computations more accurate. In the simplest version, we
have a random variable, X, with probability density u(x). We want to estimate
A = E
u
[(X)]. Here and below, we write E
P
[] to represent expecation using
the P measure. To estimate A, we generate N (a large number) independent
samples from the population u. That is, we generate random variables X
k
for
k = 1, . . . , N that are independent and have probability density u. Then we
estimate A using
A

A
u
=
1
N
N
k=1
(X
k
) . (1)
The estimate is unbiased because the bias, A E
u
[
A
u
], is zero. The error is
determined by the variance var(
A
u
) =
1
N
var
u
((X)).
Let v(x) be another probability density so that v(x) = 0 for all x with
u(x) = 0. Then clearly
A =
_
(x)u(x)dx =
_
(x)
u(x)
v(x)
v(x)dx .
We express this as
A = E
u
[(X)] = E
v
[(X)L(X)] , where L(x) =
u(x)
v(x)
. (2)
The ratio L(x) is called the score function in Monte Carlo, the likelihood ratio
in statistics, and the Radon Nikodym derivative by mathematicians. We get a
dierent unbiased estimate of A by generating N independent samples of v and
taking
A

A
v
=
1
N
N
k=1
(X
k
)L(X
k
) . (3)
1
The accuracy of (3) is determined by
var
v
((X)L(X)) = E
v
[((X)L(X) A)
2
] =
_
((x)L(x) A)
2
v(x)dx .
The goal is to improve the Monte Carlo accuracy by getting var(
A
v
) <<
var(
A
u
).
1.3. A rare event example: Importance sampling is particularly helpful
in estimating probabilities of rare events. As a simple example, consider the
problem of estimating P(X > a) (corresponding to (x) = 1
x>a
) when X
N(0, 1) is a standard normal random variable and a is large. The naive Monte
Carlo method would be to generate N sample standard normals, X
k
, and take
_
_
_
X
k
N(0, 1), k = 1, , N ,
A = P(X > a)

A
u
=
1
N
#{X
k
> a} =
1
N
X
k
>a
1 .
(4)
For large a, the hits, X
k
> a, would be a small fraction of the samples, with the
rest being wasted.
One importance sampling strategy uses v corresponding to N(a, 1). It
seems natural to try to increase the number of hits by moving the mean from
0 to a. Since most hits are close to a, it would be a mistake to move the
mean farther than a. Using the probability densities u(x) =
1
2
e
x
2
/2
and
v(x) =
1
2
e
(xa)
2
/2
, we nd L(x) = u(x)/v(x) = e
a
2
/2
e
ax
. The importance
sampling estimate is
_
_
_
X
k
N(a, 1), k = 1, , N ,
A

A
v
=
1
N
e
a
2
/2
X
k
>a
e
aX
k
.
Some calculations show that the variance of

A
v
is smaller than the variance
of of the naive estimator

A
u
by a factor of roughly e
a
2
/2
. A simple way to
generate N(a, 1) random variables is to start with mean zero standard normals
Y
k
N(0, 1) and add a: X
k
= Y
k
+a. In this form, e
a
2
/2
e
aX
k
= e
a
2
/2
e
aY
k
,
and X
k
> a, is the same as Y
k
> 0, so the variance reduced estimator becomes
_
_
_
Y
k
N(0, 1), k = 1, , N ,
A

A
v
= e
a
2
/2
1
N
Y
k
>0
e
aY
k
.
(5)
The naive Monte Carlo method (4) produces a small

A by getting a small
number of hits in many samples. The importance sampling method (5) gets
roughly 50% hits but discounts each hit by a factor of at least e
a
2
/2
to get the
same expected value as the naive estimator.
2
1.4. Radon Nikodym derivative: Suppose is a measure space with
algebra F and probability measures P and Q. We say that L() is the
Radon Nikodym derivative of P with respect to Q if dP() = L()dQ(), or,
more formally,
_
V ()dP() =
_
V ()L()dQ() ,
which is to say
E
P
[V ] = E
Q
[V L] , (6)
for any V , say, with E
P
[|V |] < . People often write L =
dP
dQ
, and call it
the Radon Nikodym derivative of P with respect to Q. If we know L, then the
right side of (6) oers a dierent and possibly better way to estimate E
P
[V ].
Our goal will be a formula for L when P and Q are measures corresponding to
dierent SDEs.
1.5. Absolute continuity: One obstacle to nding L is that it may not exist.
If A is an event with P(A) > 0 but Q(A) = 0, L cannot exist because the
formula (6) would become
P(A) =
_
A
dP() =
_
1
A
()dP() =
_
1
A
()L()dQ() .
Looking back at our denition of the abstract integral, we see that if the event
A = {f() = 0} has Q(A) = 0, then all the approximations to
_
f()dQ() are
zero, so
_
f()dQ() = 0.
We say that measure P is absolutely continuous with respect to Q if P(A) =
0 Q(A) = 0 for every
1
A F. We just showed that L cannot exist unless
P is absolutely continuous with respect to Q. On the other hand, the Radon
Nikodym theorem states that an L satisfying (6) does exist if P is absolutely
continuous with respect to Q.
In practical examples, if P is not absolutely continuous with respect to Q,
then P and Q are completely singular with respect to each other. This means
that there is an event, A F with P(A) = 1 and Q(A) = 0.
1.6. Discrete probability: In discrete probability, with a nite or countable
state space, P is absolutely continuous with respect to Q if and only if P() > 0
whenever Q(x) > 0. In that case, L() = P()/Q(). If P and Q represent
Markov chains on a discrete state space, then P is not absolutely continuous
with respect to Q if the transition matrix for P (also called P) allows transitions
that are not allowed in Q.
1.7. Finite dimensional spaces: If = R
n
and the probability measures are
given by densities, then P may fail to be absolutely continuous with respect to
1
This assumes that measures P and Q are dened on the same algebra. It is useful
for this reason always to use the algebra of Borel sets. It is common to imagine completing
a measure by adding to F all subsets of events with P(A) = 0. It may seem better to have
more measurable events, it makes the change of measure discussions more complicated.
3
Q if the densities are dierent from zero in dierent places. An example with
n = 1 is P corresponding to a negative exponential random variable u(x) = e
x
for x 0 and u(x) = 0 for x > 0, while Q corresponds to a positive exponential
v(x) = e
x
for x 0 and v(x) = 0 for x < 0.
Another way to get singular probability measures is to have measures using
functions concentrated on lower dimensional sets. An example with = R
2
has
Q saying that X
1
and X
2
are independent standard normals while P says that
X
1
= X
2
. The probability density for P is u(x
1
, x
2
) =
1
2
e
x
2
1
/2
(x
2
x
1
).
The event A = {X
1
= X
2
} has Q probability zero but P probability one.
1.8. Testing for singularity: It sometimes helps to think of complete singu-
larity of measures in the following way. Suppose we learn the outcome, and
we try to determine which probability measure produced it. If there is a set
A with P(A) = 1 and Q(A) = 0, then we report P if A and Q if / A.
We will be correct 100% of the time. Conversely, if there is a way to determine
whether P of Q was used to generate , then let A be the set of outcomes that
you say came from P and you have P(A) = 1 because you always are correct
in saying P if came from P. Also Q(A) = 0 because you never say Q when
A.
Common tests involve statistics, i.e. functions of . If there is a (measurable)
statistic F() with F() = a almost surely with respect to P and F() = b = a
almost surely with respect to Q, then we take A = { | F() = a} and see
that P and Q are completely singular with respect to each other.
1.9. Coin tossing: In common situations where this works, the function F()
is a limit that exists almost surely (but with dierent values) for both P and Q.
If lim
n
F
n
() = a almost surely with respect to P and lim
n
F
n
() = b
almost surely with respect to Q, then P and Q are completely singular.
Suppose we make an innite sequence of coin tosses with the tosses being
independent and having the same probability of heads. We describe this by
taking to be innite sequences = (Y
1
, Y
2
, . . .), where the k
th
toss Y
k
equals
one or zero, and the Y
k
are independent. Let the measure P represent tossing
with Y
k
= 1 with probability p, and Q represent tossing with Y
k
= 1 with
probability q = p. Let F
n
() =
1
n
n
k=1
Y
k
. The (Kolmogorov strong) law of
large numbers states that F
n
p as n almost surely in P and F
n

q as n almost surely in Q. This shows that P and Q are completely
singular with respect to each other. Note that this is not an example of discrete
probability in our sense because the state space consists of innite sequences.
The set of innite sequences is not countable (a theorem of Cantor).
1.10. The Cameron Martin formula: The Cameron Martin formula relates
the measure, P, for Brownian motion with drift to the Wiener measure, W, for
standard Brownian motion without drift. Wiener measure describes the process
dX(t) = dB(t) . (7)
4
The P measure describes solutions of the SDE
dX(t) = a(X(t), t)dt +dB(t) . (8)
For deniteness, suppose X(0) = x
0
is specied in both cases.
1.11. Approximate joint probability measures: We nd the formula for
L(X) = dP(X)/dW(X) by taking a nite t approximation, directly comput-
ing L
t
, and observing the limit of L as t 0. We use our standard notations
t
k
= kt, X
k
X(t
k
), B
k
= B(t
k+1
) B(t
k
), and

X = (X
1
, . . . , X
n
) R
n
.
The approximate solution of (8) is
X
k+1
= X
k
+ ta(X
k
, t
k
) + B
k
. (9)
This is exact in the case a = 0. We write V (x) for the joint density of

X for
W and U(x) for teh joint density under (9). We calculate L
t
(x) = U(x)/V (x)
and observe the limit as t 0.
To carry this out, we again note that the joint density is the product of the
transition probability densities. For (7), if we know x
k
, then X
k+1
is normal
with mean x
k
and variance t. This gives
G(x
k
, x
k+1
, t) =
1
2t
e
(x
k+1
x
k
)
2
/2t
,
and
V (x) =
_
2 t
_
n/2
exp
_
1
2t
n1
k=0
(x
k+1
k
k
)
2
_
. (10)
For (9), the approximation to (8), X
k+1
is normal with mean x
k
+ ta(x
k
, t
k
)
and variance t. This makes its transition density
G(x
k
, x
k+1
, t) =
1
2t
e
(x
k+1
x
k
ta(x
k
,t
k
))
2
/2t
,
so that
U(x) =
_
2 t
_
n/2
exp
_
1
2t
n1
k=0
(x
k+1
k
k
ta(x
k
, t
k
))
2
_
. (11)
To calculate the ratio, we expand (using some obvious notation)
_
X
k
ta
k
_
2
= x
2
k
2tx
k
+ t
2
a
2
k
.
Dividing U by V removes the 2 factors and the x
2
k
in the exponents. What
remains is
L
t
(x) = U(x)/V (x)
= exp
_
n1
k=0
(a(x
k
), t
k
)(x
k+1
x
k
)
t
2
n1
k=0
a(x
k
), t
k
)
2
_
.
5
The rst term in the exponent converges to the Ito integral
n1
k=0
(a(x
k
), t
k
)(x
k+1
x
k
)
_
T
0
a(X(t), t)dX(t) as t 0,
if t
n
= max {t
k
< T}. The second term converges to the Riemann integral
t
n1
k=0
a(x
k
), t
k
)
2
_
T
0
a
2
(X(t), t)dt as t 0.
Altogether, this suggests that if we x T and let t 0, then
dP
dW
= L(X) = exp
_
_
T
0
a(X(t), t)dX(t)
1
2
_
T
0
a
2
(X(t), t)dt
_
. (12)
This is the Cameron Martin formula.
2 Multidimensional diusions
2.1. Introduction: Some of the most interesting examples, curious phenom-
ena, and challenging problems come from diusion processes with more than one
state variable. The n state variables are arranged into an n dimensional state
vector X(t) = (X
1
(t), . . . , X
n
(t))
t
. We will have a Markov process if the state
vector contains all the information about the past that is helpful in predicting
the future. At least in the beginning, the theory of multidimensional diusions
is a vector and matrix version of the one dimensional theory.
2.2. Strong solutions: The drift now is a drift for each component of X,
a(x, t) = (a
1
(x, t), . . . , a
n
(x, t))
t
. Each component of a may depend on all com-
ponents of X. The now is an n m matrix, where m is the number of
independent sources of noise. We let B(t) be a column vector of m independent
standard Brownian motion paths, B(t) = (B
1
(t), . . . , B
m
(t))
t
. The stochastic
dierential equation is
dX(t) = a(X(t), t)dt +(X(t), t)dB(t) . (13)
A strong solution is a function X(t, B) that is nonanticipating and satises
X(t) = X(0) +
_
t
0
a(X(s), s)ds +
_
t
0
(X(s), s)dB(s) .
The middle term on the right is a vector of Riemann integrals whose k
th
com-
ponent is the standard Riemann integral
_
t
0
a
k
(X(s), s)ds .
6
The last term on the right is a collection of standard Ito integrals. The k
th
component is
m
j=1
_
t
0
kj
(X(s), s)dB
j
(s) ,
with each summand on the right being a scalar Ito integral as dened in previous
lectures.
2.3. Weak form: The weak form of a multidimensional diusion problem asks
for a probability measure, P, on the probability space = C([0, T], R
n
) with
ltration F
t
generated by {X(s) for s t} so that X(t) is a Markov process
with
E
_
X
F
t
= a(X(t), t)t +o(t) , (14)

and
E
_
XX
t

F
t
= (X(t), t)t +o(t) . (15)

Here X = X(t+t)X(t), we assume t > 0, and X
t
= (X
1
, . . . , X
n
) is
the transpose of the column vector X. The matrix formula (15) is a convenient
way to express the short time variances and covariances
2
E
_
X
j
X
k
F
t
=
jk
(X(t), t)t +o(t) . (16)
As for one dimensional diusions, it is easy to verify that a strong solution of
(13) satises (14) and (15) with =
t
.
2.4. Backward equation: As for one dimensional diusions, the weak form
conditions (14) and (15) give a simple derivation of the backward equation for
f(x, t) = E
x,t
[V (X(T))] .
We start with the tower property in the familiar form
f(x, t) = E
x,t
[f(x + X, t + t)] , (17)
and expand f(x+X, t+t) about (x, t) to second order in X and rst order
in t:
f(x + X, t + t) = f +
x
k
f X
k
+
1
2
xj
x
k
X
j
X
k
+
t
f t +R .
Here follow the Einstein summation convention by leaving out the sums over j
and k on the right. We also omit arguments of f and its derivatives when the
arguments are (x, t). For example,
x
k
f X
k
really means
n
k=1
x
k
f(x, t) X
k
.
2
The reader should check that the true covariances
E
_
(X
j
E[X
j
])(X
k
E[X
k
])
Ft
also satisfy (16) when E

_
X
j
Ft
= O(t).
7
As in one dimension, the error term R satises
|R| C
_
|X| t +|X|
3
+ t
2
_
,
so that, as before,
E [|R|] C t
3/2
.
Putting these back into (17) and using (14) and (15) gives (with the same
shorthand)
f = f +a
k
(x, t)
x
k
ft +
1
2
jk
(x, t)
xj
x
k
ft +
t
ft +o(t) .
Again we cancel the f from both sides, divide by t and take t 0 to get
t
f +a
k
(x, t)
x
k
f +
1
2
jk
(x, t)
xj
x
k
f = 0 , (18)
which is the backward equation.
It sometimes is convenient to rewrite (18) in matrix vector form. For any
function, f, we may consider its gradient to be the row vector
x
f = D
x
f =
(
x1
f, . . . ,
xn
f). The middle term on the left of (18) is the product of the
row vector Df and the column vector x. We also have the Hessian matrix of
second partials (D
2
f)
jk
=
xj
x
k
f. Any symmertic matrix has a trace tr(M) =
k
M
kk
. The summation convention makes this just tr(M) = M
kk
. If A and
B are symmetric matrices, then (as the reader should check) tr(AB) = A
jk
B
jk
(with summation convention). With all this, the backward equation may be
written
t
f +D
x
f a(x, t) +
1
2
tr((x, t)D
2
x
f) = 0 . (19)
2.5. Generating correlated Gaussians: Suppose we observe the solution of
(13) and want to reconstruct the matrix . A simpler version of this problem
is to observe
Y = AZ , (20)
and reconstruct A. Here Z = (Z
1
, . . . , Z
m
) R
m
, with Z
k
N(0, 1) i.i.d.,
is an m dimensional standard normal. If m < n or rank(A) < n then Y is a
degenerate Gaussian whose probability density (measure) is concentrated on
the subspace of R
n
consisting of vectors of the form y = Az for some z R
m
.
The problem is to nd A knowing the distribution of Y .
2.6. SVD and PCA: The singular value decomposition (SVD) of A is a
factorization
A = UV
t
, (21)
where U is an nn orthogonal matrix (U
t
U = I
nn
, the nn identity matrix),
V is an mm orthogonal matrix (V
t
V = I
mm
), and is an nm diagonal
matrix (
jk
= 0 if j = k) with nonnegative singular values on the diagonal:
kk
=
k
0. We assume the singular values are arranged in decreasing order
1

2
. The singular values also are called principal components and
8
the SVD is called principal component analysis (PCA). The columns of U and
V (not V
t
) are left and right singular vectors respectively, which also are called
principal components or principal component vectors. The calculation
C = AA
t
= (UV
t
)(V
t
U
t
) = U
t
U
t
shows that the diagonal n n matrix =
t
contains the eigenvalues of
C = AA
t
, which are real and nonnegative because C is symmetric and positive
semidenite. Therefore, left singular vectors, the columns of C, are the eigen-
vectors of the symmetric matrix C. The singular values are the nonnegative
square roots of the eigenvalues of C:
k
=
k
. Thus, the singular values and
left singular vectors are determined by C. In a similar way, the right singular
vectors are the eigenvectors of the m m positive semidenite matrix A
t
A. If
n > m, then the
k
are dened only for k m (there is no
m+1,m+1
in the
nm matrix ). Since the rank of C is at most m in this case, we have
k
= 0
for k > m. Even when n = m, A may be rank decient. The rank of A being l
is the same as
k
= 0 for k > l. When m > n, the rank of A is at most n.
2.7. The SVD and nonuniqueness of A: Because Y = AZ is Gaussian
with mean zero, its distribution is determined by its covariance C = E[Y Y
t
] =
E[AZZ
t
A
t
] = AE[ZZ
t
]A
t
= AA
t
. This means that the distribution of A
determines U and but not V . We can see this directly by plugging (21) into
(20) to get
Y = U(V
t
Z) = UZ
, where Z
= V
t
Z .
Since Z
is a mean zero Gaussian with covariance V

t
V = I, Z
has the same

distribution as Z, which means that Y
= UZ has the same distribution as Y .

Furthermore, if A has rank l < m, then we will have
k
= 0 for k > l and we
need not bother with the Z
k
for k > l. That is, for generating Y , we never need
to take m > n or m > rank(A).
For a simpler point of view, suppose we are given C and want to generate
Y N(0, C) in the form Y = AZ with Z N(0, I). The condition is that
C = AA
t
. This is a sort of square root of C. One solution is A = U as above.
Another solution is the Choleski decomposition of C: C = LL
t
for a lower
triangular matrix L. This is most often done in practice because the Choleski
decomposition is easier to compute than the SVD. Any A that works has the
same U and in its SVD.
2.8. Choosing (x, t): This non uniqueness of A carries over to non unique-
ness of (x, t) in the SDE (13). A diusion process X(t) denes (x, t) through
(15), but any (x, t) with
t
= leads to the same distribution of X trajec-
tories. In particular, if we have one (x, t), we may choose any adapted matrix
valued function V (t) with V V
t
I
mm
, and use
= V . To say this another

way, if we solve dZ
= V (t)dZ(t) with Z
(0) = 0, then Z
(t) also is a Brownian

motion. (The Levi uniqueness theorem states that any continuous path process
that is weakly Brownian motion in the sense that a 0 and I in (14) and
(15) actually is Brownian motion in the sense that the measure on is Wiener
9
measure.) Therefore, using dZ
= V (t)dZ gives the same measure on the space

of paths X(t).
The conclusion is that it is possible for SDEs wtih dierent (x, t) to repre-
sent the same X distribution. This happens when
t
=
t
. If we have , we
may represent the process X(t) as the strong solution of an SDE (13). For this,
we must choose with some arbtirariness a (x, t) with (x, t)(x, t)
t
= (x, t).
The number of noise sources, m, is the number of non zero eigenvalues of . We
never need to take m > n, but m < n may be called for if has rank less than
n.
2.9. Correlated Brownian motions: Sometimes we wish to use the SDE model
(13) where the B
k
(t) are correlated. We can accomplish this with a change in .
Let us see how to do this in the simpler case of generating correlated standard
normals. In that case, we want Z = (Z
1
, . . . , Z
m
)
t
R
m
to be a multivariate
mean zero normal with var(Z
k
) = 1 and given correlation coecients
jk
=
cov(Z
j
, Z
k
)
_
var(Z
j
)var(Z
k
)
= cov(Z
j
, Z
k
) .
This is the same as generating Z with covariance matrix C with ones on the
diagonal and C
jk
=
jk
when j = k. We know how to do this: choose A with
AA
t
= C and take Z = AZ
. This also works in the SDE. We solve

dX(t) = a(X(t), t)dt +(X(t), t)AdB(t) ,
with the B
k
being independent standard Brownian motions. We get the eect
of correlated Brownian motions by using independent ones and replacing (x, t)
by (x, t)A.
2.10. Normal copulas (a digression): Suppose we have a probability den-
sity u(y) for a scalar random variable Y . We often want to generate families
Y
1
, . . . , Y
m
so that each Y
k
has the density u(y) but dierent Y
k
are correlated.
A favorite heuristic for doing this
3
is the normal copula. Let U(y) = P(Y < y)
be the cumulative distribution function (CDF) for Y . Then the Y
k
will have
density u(y) if and only if U(Y
k
) T
k
and the T
k
are uniformly distributed in
the interval [0, 1] (check this). In turn, the T
k
are uniformly distributed in [0, 1]
if T
k
= N(Z
k
) where the Z
k
are standard normals and N(z) is the standard
normal CDF. Now, rather than generating independent Z
k
, we may use corre-
lated ones as above. This in turn leads to correlated T
k
and correlated Y
k
. I do
not know how to determine the Z correlations in order to get a specied set of
Y correlations.
2.11. Degenerate diusions: Many practical applications have fewer sources
of noise than state variables. In the strong form (13) this is expressed as m < n
or m = n and det() = 0. In the weak form is always n n but it may be
3
I hope this goes out of fashion in favor of more thoughtful methods that postulate some
mechanism for the correlations.
10
rank decient. In either case we call the stochastic process a degenerate diu-
sion. Nondegenerate diusions have qualitative behavior like that of Brownian
motion: every component has innite total variation and nite quadratic varia-
tion, transition densities are smooth functions of x and t (for t > 0) and satisfy
forward and backward equations (in dierent variables) in the usual sense, etc.
Degenerate diusions may lack some or all of these properties. The qualitative
behavior of degenerate diusions is subtle and problem dependent. There are
some examples in the homework. Computational methods that work well for
nondegenerate diusions may fail for degenerate ones.
2.12. A degenerate diusion for Asian options: An Asian option gives a
payout that depends on some kind of time average of the price of the under-
lying security. The simplest form would have th eunderlier being a geometric
Brownian motion in the risk neutral measure
dS(t) = rS(t)dt +S(t)dB(t) , (22)
and a payout that depends on
_
T
0
S(t)dt. This leads us to evaluate
E [V (Y (T))] ,
where
Y (T) =
_
T
0
S(t)dt .
To get a backward equation for this, we need to identify a state space so
that the state is a Markov process. We use the two dimensional vector
X(t) =
_
S(t)
Y (t)
_
,
where S(t) satises (22) and dY (t) = S(t)dt. Then X(t) satises (13) with
a =
_
rS
S
_
,
and m = 1 < n = 2 and (with the usual double meaning of )
=
_
S
0
_
.
For the backward equation we have
=
t
=
_
S
2
2
0
0 0
_
,
so the backward equation is
t
f +rs
s
f +s
y
f +
s
2
2
2

2
s
f = 0 . (23)
11
Note that this is a partial dierential equation in two space variables,
x = (s, y)
t
. Of course, we are interested in the answer at t = 0 only for y = 0.
Still, we have include other y values in the computation. If we were to try the
standard nite dierence approximate solution of (23) we might use a central
dierence approximation
y
f(s, y, t)
1
2y
(f(s, y + y, t) f(s, y y, t)).
If > 0 it is ne to use a central dierence approximation for
s
f, and this
is what most people do. However, a central dierence approximation for
y
f
leads to an unstable computation that does not produce anything like the right
answer. The inherent instability of centeral dierencing is masked in s by the
strongly stabilizing second derivative term, but there is nothing to stabalize the
unstable y dierencing in this degenerate diusion problem.
2.13. Integration with dX: We seek the anologue of the Ito integral and
Itos lemma for a more general diusion. If we have a function f(x, t), we seek
a formula df = adt +bdX. This would mean that
f(X(T), T) = f(X(0), 0) +
_
T
0
a(t)dt +
_
T
0
b(t)dX(t) . (24)
The rst integral on the right would be a Riemann integral that would be dened
for any continuous function a(t). The second would be like the Ito integral with
Brownian motion, whose denition depends on b(t) being an adapted process.
The denition of the dX Ito integral should be so that Itos lemma becomes
true.
For small t we seek to approximate f = f(X(t+t), t+t)f(X(t), t).
If this follows the usual pattern (partial justication below), we should expand
to second order in X and rst order in t. This gives (wth summation con-
vention)
f (
xj
f)X
j
+
1
2
(
xj
x
k
f)X
j
X
k
+
t
ft . (25)
As with the Ito lemma for Brownian motion, the key idea is to replace the
products X
j
X
k
by their expected values (conditional on F
t
). If this is true,
(15) suggests the general Ito lemma
df = (
xj
f)dX
k
+
_
1
2
(
xj
x
k
f)
jk
+
t
f
_
dt , (26)
where all quantities are evaluated at (X(t), t).
2.14. Itos rule: One often nds this expressed in a slightly dierent way. A
simpler way to represent the small time variance condition (15) is
E [dX
j
dX
k
] =
jk
(X(t), t)dt .
(Though it probably should be E
_
dX
j
dX
k
F
t
.) Then (26) becomes

df = (
xj
f)dX
k
+
1
2
(
xj
x
k
f)E[dX
j
dX
k
] +
t
fdt .
This has the advantage of displaying the main idea, which is that the uctuations
in dX
j
are important but only the mean values of dX
2
are important, not the
12
uctuations. Itos rule (never enumciated by Ito as far as I know) is the formula
dX
j
dX
k
=
jk
dt . (27)
Although this leads to the correct formula (26), it is not structly true, since the
standard deation of the left side is as large as its mean.
In the derivation of (26) sketched below, the total change in f is represented
as the sum of many small increments. As with the law of large numbers, the
sum of many random numbers can be much closer to its mean (in relative terms)
than the random summands.
2.15. Ito integral: The denition of the dX Ito integral follows the denition
of the Ito integral with respect to Brownian motion. Here is a quick sketch
with many details missing. Suppose X(t) is a multidimensional diusion pro-
cess, F
t
is the algebra generated by the X(s) for 0 s t, and b(t) is a
possibly random function that is adapted to F
t
. There are n components of b(t)
corresponding to the n components of X(t). The Ito integral is (t
k
= kt as
usual):
_
T
0
b(t)dX(t) = lim
t0
t
k
<T
b(t
k
) (X(t
k+1
) X(t
k
)) . (28)
This denition makes sense because the limit exists (almost surely) for a rich
enough family of integrands b(t). Let Y
t
=

t
k
<T
b(t
k
) (X(t
k+1
) X(t
k
)) and
write (for appropriately chosen T)
Y
t/2
Y
t
=
t
k
<T
R
k
,
where
R
k
=
_
b(t
k+1/2
) b(t
k
)
__
X(t
k+1
) X(t
k+1/2
)
_
.
The bound
E
_
_
Y
t/2
Y
t
_
2
_
= O(t
p
) , (29)
implies that the limit (28) exists almost surely if t
l
= 2
l
.
As in the Brownian motion case, we assume that b(t) has the (lack of)
smoothness of Brownian motion: E[(b(t + t) b(t))
2
] = O(t). In the mar-
tingale case (drift = a 0 in (14)), E[R
j
R
k
] = 0 if j = k. In evaluating E[R
2
k
],
we get from (15) that
E
_
X(t
k+1
) X(t
k+1/2
)
F
t
k+1/2
_
= O(t) .
Since b(t
t+1/2
) is known in F
t
k+1/2
, we may use the tower property and our
assumption on b to get
E[R
2
k
] E
_
X(t
k+1
) X(t
k+1/2
)
b(t
k+1/2
) b(t)
2
_
= O(t
2
) .
13
This gives (29) with p = 1 (as for Brownian motion) for that case. For the
general case, my best eort is too complicated for these notes and gives (29)
with p = 1/2.
2.16. Itos lemma: We give a half sketch of the proof of Itos lemma for
diusions. We want to use k to represent the time index (as in t
k
= kt) so
we replace the index notation above with vector notation:
x
fX instead of
x
k
X
k
,
2
x
(X
k
, X
k
) instead of (
xj
x
k
f)X
j
X
k
, and tr(
2
x
f) instead
of (
xj
x
k
f)
jk
. Then X
k
will be the vector X(t
k+1
) X(t
k
) and
2
x
f
k
the
n n matrix of second partial derivatives of f evaluated at (X(t
k
), t
k
), etc.
Now it is easy to see who f(X(T), T) f(X(0), 0) =

t
k
<T
F
k
is given
by the Riemann and Ito integrals of the right side of (26). We have
f
k
=
t
f
k
t +
x
f
k
X
k
+
1
2
2
x
f
k
(X
k
, X
k
)
+ O(t
2
) +O(t |X
k
|) +O
_
X
3
k
_
.
As t 0, the contribution from the second row terms vanishes (the third
term takes some work, see below). The sum of the
t
f
k
t converges to the
Riemann integral
_
T
0

t
f(X(t), t)dt. The sum of the
x
f
k
X
k
converges to the
Ito integral
_
T
0

x
f(X(t), t)dX(t). The remaining term may be written as
2
x
f
k
(X
k
, X
k
) = E
_
2
x
f
k
(X
k
, X
k
)

F
t
k
+U
k
.
It can be shown that
E
_
|U
k
|
2

F
t
k
_
CE
_
|X
k
|
4

F
t
k
_
Ct
2
,
as it is for Brownian motion. This shows (with E[U
j
U
k
] = 0) that
E
_
_
t
k
<T
U
k
2
_
_
=
t
k
<T
E
_
|U
k
|
2
_
CTt ,
so

t
k
<T
U
k
0 as t 0 almost surely (with t = 2
l
). Finally, the small
time variance formula (15) gives
E
_
2
x
f
k
(X
k
, X
k
)

F
t
k
= tr
_
2
x
f
k
k
_
+o(t) ,
so
t
k
<T
E
_
2
x
f
k
(X
k
, X
k
)

F
t
k
_
T
0
tr
_
2
x
f(X(t), t)(X(t), t)
_
dt ,
(the Riemann integral) as t 0. This shows how the terms in the Ito lemma
(26) are accounted for.
2.17. Theory left out: We did not show that there is a process satisfying (14)
and (15) (existence) or that these conditions characterize the process (unique-
ness). Even showing that a process satisfying (14) and (15) with zero drift and
14
= I is Brownian motion is a real theorem: the Levi uniqueness theorem.
The construction of the stochastic process X(t) (existence) also gives bounds
on higher moments, such as E
_
|X|
4
_
C t
2
, that we used above. The
higher moment estimates are true for Brownian motion because the increments
are Gaussian.
2.18. Approximating diusions: The formula strong form formulation of the
diusion problem (13) suggests a way to generate approximate diusion paths.
If X
k
is the approximation to X(t
k
) we can use
X
k+1
= X
k
+a(X
k
, t
k
)t +(X
k
, t
k
)
tZ
k
, (30)
where the Z
k
are i.i.d. N(0, I
mm
). This has the properties corresponding to
(14) and (15) that
E
_
X
k+1
X
k
X
1
, , X
k
= a(X
k
, t
k
)t
and
cov(X
k+1
X
k
) = t .
This is the forward Euler method. There are methods that are better in some
ways, but in a surprising large number of problems, methods better than this
are not known. This is a distinct contrast to numerical solution of ordinary
dierential equations (without noise), for which forward Euler almost never is
the method of choice. There is much research do to to help the SDE solution
methodology catch up to the ODE solution methodology.
2.19. Drift change of measure:
The anologue of the Cameron Martin formula for general diusions is the
Girsanov formula. We derive it by writing the joint densities for the discrete
time processes (30) with and without the drift term a. As usual, this is a
product of transition probabilities, the conditional probability densities for X
k+1
conditional on knowing X
j
for j k. Actually, because (30) is a Markov
process, the conditional densityh for X
k+1
depends on X
k
only. We write it
G(x
k
, x
k+1
, t
k
, t). Conditional on X
k
, X
k+1
is a multivariate normal with
covariance matrix (X
k
, t
k
)t. If a 0, the mean is X
k
. Otherwise, the mean
is X
k
+a(X
k
, t
k
)t. We write
k
and a
k
for (X
k
, t
k
) and a(X
k
, t
k
).
Without drift, the Gaussian transition density is
G(x
k
, x
k+1
, t
k
, t) =
1
(2)
n/2
_
det(
k
)
exp
_
(x
k+1
x
k
)
t
1
k
(x
k+1
x
k
)
2t
_
(31)
With nonzero drift, the prefactor
z
k
=
1
(2)
n/2
_
det(
k
)
15
is the same and the exponential factor accomodates the new mean:
G(x
k
, x
k+1
, t
k
, t) = z
k
exp
_
(x
k+1
x
k
a
k
t)
t
1
k
(x
k+1
x
k
a
k
t)
2t
_
.
(32)
Let U(x
1
, . . . , x
N
) be the joint density without drift and U(x
1
, . . . , x
N
) with
drift. We want to evaluate L(x) = V (x)/U(x) Both U and V are products of
the appropriate transitions densities G. In the division, the prefactors z
k
cancel,
as they are the same for U and V because the
k
are the same.
The main calculation is the subtraction of the exponents:
(x
k
a
k
t)
t
1
k
(x
k
a
k
t)x
t
k
1
x
k
= 2ta
t
k
1
k
x
k
+t
2
a
t
k
1
k
a
k
.
This gives:
L(x) = exp
_
N1
k=0
a
t
k
1
k
x
k
+
t
2
N1
k=0
a
t
k
1
k
a
k
_
.
This is the exact likelihood ratio for the discrete time processes without drift.
If we take the limit t 0 for the continuous time problem, the two terms in
the exponent converge respectively to the Ito integral
_
T
0
a(X(t), t)
t
(X(t), t)
1
dX(t) ,
and the Riemann integral
_
T
0
1
2
a(X(t), t)
t
(X(t), t)
1
a(X(t), t)dt .
The result is the Girsanov formula
dP
dQ
= L(X)
= exp
_
_
T
0
a(X(t), t)
t
(X(t), t)
1
dX(t)
_
T
0
1
2
a(X(t), t)
t
(X(t), t)
1
a(X(t), t)dt
_
. (33)
16
Stochastic Calculus, Fall 2004 (http://www.math.nyu.edu/faculty/goodman/teaching/StochCalc2004/)
Assignment 1.
Given Summer 2004, due September 9, the rst day of class. The course web page has
hints for reviewing or lling in missing background.
Last revised, May 26.
Objective: Review of Basic Probability.
1. We have a container with 300 red balls and 600 blue balls. We mix the balls well and
choose one at random, with each ball being equally likely to be chosen. After each
choice, we return the chosen ball to the container and mix again.
a. What is the probability that the rst n balls chosen are all blue?
b. Let N be the number of blue balls chosen before the rst red one. What is the
P(N = n)? What are the mean and variance of N. Explain your answers using
the formulae
n=0
x
n
=
1
1 x
for |x| < 1
n=0
nx
n
= x
d
dx
1
1 x
for |x| < 1
etc.
c. What is the probability that N = 0 given that N 2?
d. What is the probability that N is an even number? Count 0 as an even number.
2. A tourist decides between two plays, called Good (G) and Bad (B). The probability
of the tourist choosing Good is P(G) = 10%. A tourist choosing Good likes it (L)
with 70% probability (P(L | G) = .7) while a tourist choosing Bad dislikes it with 80%
probability (P(D | B) = .8).
a. Draw a probability decision tree diagram to illustrate the choices.
b. Calculate P(L), the probability that the tourist liked the play he or she saw.
c. If the tourist liked the play he or she chose, what is the probability that he or she
chose Good?
3. A triangular random variable, X, has probability density function (PDF) f(x) given
by
f(x) =
2(1 x) if 0 x 1,
0 otherwise.
a. Calculate the mean and variance of X.
1
b. Suppose X
1
and X
2
are independent samples (copies) of X and Y = X
1
+X
2
. That
is to say that X
1
and X
2
are independent random variables and each has the same
density f. Find the PDF for Y .
c. Calculate the mean and variance of Y without using the formula for its PDF.
d. Find P(Y > 1).
e. Suppose X
1
, X
2
, . . ., X
100
are independent samples of X. Estimate Pr(X
1
+ +
X
100
> 34) using the central limit theorem. You will need access to standard
normal probabilities either through a table or a calculator or computer program.
4. Suppose X and Y have a joint PDF
f(x, y) =
1
8
4 x
2
y
2
if x
2
+y
2
4,
0 otherwise.
a. Calculate P(X
2
+Y
2
1).
b. Calculate the marginal PDF for X alone.
c. What is the covariance between X and Y ?
d. Find an event depending on X alone whose probability depends on Y . Use this to
show that X is not independent of Y .
e. Write the joint PDF for U = X
2
and V = Y
2
.
f. Calculate the covariance between X
2
and Y
2
. It may be easier to do this without
using part e. Use this to show, again, that X and Y are not independent.
2
Assignment 2.
Given September 9, due September 23.
Last revised, September 10.
Objective: Conditioning and Markov chains.
1. Suppose that F and G are two algebras of sets and that F adds information to G in the
sense that any G measurable event is also F measurable: G F. Suppose that the
probability space is discrete (nite or countable) and that X() is a variable dened
on (that is, a function of the random variable ). The conditional expectations (in
the modern sense) of X with respect to F and G are Y = E[X | F] and Z = E[X | G].
In each case below, state whether the statement is true or false and explain your answer
with a proof or a counterexample.
a. Z F.
b. Y G.
c. Z = E[Y | G].
d. Y = E[Z | F].
2. Let be a discrete probability space and F a algebra. Let X() be a (function of
a) random variable with E[X
2
] < . Let Y = E[X | F]. The variance of X is
var(X) = E[(X X)
2
], where X = E[X].
a. Show directly from the (modern) denition of conditional expectation that
E[X
2
] = E[(X Y )
2
] +E[Y
2
] . (1)
Note that this equation also could be written
E[X
2
] = E
(X E[X | F])
2
+E
(E[X | F])
2
.
.
b. Use this to show that var(X) = var(X Y ) + var(Y ).
c. If we interpret conditional expectation as an orthogonal projection in a vector space,
what theorem about orthogonality does part a represent?
d. We have n independent coin tosses with each equally likely to be H or T. Take X
to be the indicator function of the event that the rst toss is H. Take F to be
the algebra generated by the number of H tosses in all. Calculate each of the
three quantities in (1) from scratch and check that the equation holds. Both of
the terms on the right are easiest to do using the law of total probability, which
is pretty obvious in this case.
1
3. (Bayesian identication of a Markov chain) We have a state space of size m and two
m m stochastic matrices, Q, and R. First we pick one of the matrices, choosing Q
with probability f and R with probability 1 f. Then we use the chosen matrix to
run a Markov chain X, starting with X(0) = 1 up to time T.
a. Describe the probability space appropriate for this situation.
b. Let F be the algebra generated by the chain itself, without knowing whether Q or
R was chosen. Find a formula for P(Q | F) (which would be P(Q | X = x) in
classical notation). Though this formula might be ugly, it is easy to probram.
4. Suppose we have a 3 state Markov chain with transition matrix
P =
.6 .2 .2
.3 .5 .2
.1 .2 .7
and suppose that X(0) = 1. For any t > 0, the algebras F

t
and G
t
are as in the notes,
and H
t
is the algebra generated by X(s) for t s T.
a. Show that the probability distribution of the rst t steps conditioned on G
t+1
is the
same as that conditioned on H
t+1
. This is a kind of backwards Markov property:
a forward Markov chain is a backward Markov chain also.
b. Calculate P(X(3) = 2 | G
4
). This consists of 3 numbers.
2
Assignment 3.
Given September 16, due September 23. Last revised, September 16.
Objective: Markov chains, II and lattices.
1. We have a three state Markov chain wih transition matrix
P =
1
2
1
4
1
4
1
2
1
2
0
1
3
1
3
1
3
.
(Some of the transition probabilities are P(1 1) =
1
2
, P(3 1) =
1
3
, and P(1 2) =
1
4
. Let = min(t | X
t
= 3).) Even though this is not bounded (it could be arbitrarily
large), we will see that P( > t) Ca
t
for some a < 1 so that the probability of large
is very small. This is enough to prevent the stopping time paradox (take my word
for it). Suppose that at time t = 1, all states are equally likely.
a. Consider the quantities u
t
(j) = P(X
t
= j and > t). Find a matrix evolution
equation for a two component vector made from the u
t
(j) and a submatrix,

P, of
P.
b. Solve this equation using the the eigenvectors and eigenvalues of

P to nd a formula
for m
t
= P( = t).
c. Use the answer of part b to nd E[]. It might be helpful to use the formula
t=1
tP( = t) =
t=1
P( t) .
Verify the formula if you want to use it.
d. Consider the quantities f
t
(j) = P( t | X
1
= j). Find a matrix recurrence for
them.
e. Use the method of question 1 to solve this and nd the f
t
.
2. Let P be the transition matrix for an n state Markov chain. Let v(k) be a function of
the state k S. For this problem, suppose that paths in the Markov chain start at
time t = 0 rather than t = 1, so that X = (X
0
, X
1
, . . .). For any complex number, z,
with |z| < 1, consider the sum
E
t=0
z
t
v(X
t
) | F
0
. (1)
1
Of course, this is a function of X
0
= k, which we call f(k). Find a linear matrix
equation for the quantities f that involves P, z, and v. Hint: the sum
E
t=1
z
t
v(X
t
) | F
1
.
may be related to (1) if we take out the common factor, z.
3. (Change of measure) Let P be the probability measure corresponding to ordinary random
walk with step probabilities a = b = c = 1/3. Let Q be the measure for the random
walk where the up, stay, and down probabilities from site x depend on x as
(c(x), b(x), a(x)) =
1
3
e
(x)
e
(x)
, 1, e
(x)
.
We may choose (x) arbitrarily and then choose (x) so that a(x) + b(x) + c(x) = 1.
Taking = 0 introduces drift into the random walk. The state space for walks of
length T is the set of paths x = x(0), , x(T) through the lattice. Assume that
the probability distribution for x(0) is the same for P and Q. Find a formula for
Q(x)/P(x). The answer is a discrete version of Girsanovs formula.
4. In the urn process, suppose balls are either stale or fresh. Assume that the process
starts with all n stale balls and that all replacement balls are fresh. Let be the rst
time all balls are fresh. Let X(t) be the number of stale balls at time t. Show that
X(t) is a Markov chain and write the transition probabilities. Use the backward or
forward equation aproach to calculate the quantities a(x) = E
x
(). This is a two term
recurrence relation (a(x + 1) = something a(x)) that is easy to solve. Show that at
time , the colors of the balls are iid red with probability p. Use this ti explain the
binomial formula for the invariant distribution of colors.
2
Assignment 4.
Given October 1, due SOctober 14. Last revised, October 1.
Objective: Gaussian random variables.
1. Suppose X N(,
2
). Find a formula for E[e
X
]. Hint: write the integral and complete
the square in the exponent. Use the answer without repeating the calculation to get
E[e
aX
] for any constant a.
2. In nance, people often use N(x) for the CDF (cumulative distribution function) for the
standard normal. That is, if Z N(0, 1) then N(x) = P(Z x). Suppose S = e
X
for
X N(,
2
). Find a formula for E[max(S, K)] in terms of the N function. (Hint:
as in problem 1.) This calculation is part of the BlackScholes theory of the value of a
vanilla European style call option. K is the known strike price and S is the unknown
stock price.
3. Suppose X = (X
1
, X
2
, X
3
) is a 3 dimensional Gaussian random variable with mean zero
and covariance
E[XX
] =
2 1 0
1 2 1
0 1 2
.
Set Y = X
1
+X
2
X
3
and Z = 2X
1
X
2
.
a. Write a formula for the probability density of Y .
b. Write a formula for the joint probability density for (Y, Z).
c. Find a linear combination W = aY +bZ that is independent of X
1
.
4. Take X
0
= 0 and dene X
k+1
= X
k
+ Z
k
, for k = 0, . . . , n 1. Here the Z
k
are iid.
standard normals, so that
X
k
=
k1
j=0
Z
j
. (1)
Let X R
n
be the vector X = (X
1
, . . . , X
n
).
a. Write the joint probability density for X and show that X is a multivariate normal.
Identify the n n tridiagonal matrix H that arises.
b. Use the formula (1) to calculate the variance of X
k
and the covariance E[X
j
, X
k
].
c. Use the answers to part b to write a formula for the elements of H
1
.
d. Verify by matrix multiplication that your answer to part c is correct.
1
Assignment 5.
Given October 1, due October 21. Last revised, October 7.
Objective: Brownian Motion.
1. Suppose h(x) has h
(x) > 0 for all x so that there is at most one x for each y so that
y = h(x). Consider the process Y
t
= h(X
t
), where X
t
is standard Brownian motion.
Suppose the function h(x) is smooth. The answers to the questions below depend at
least on second derivatives of h.
a. With the notation Y
t
= Y
t+t
Y
t
, for a positive t, calculate a(y) and b(y) so
that E[Y
t
| F
t
] = a(Y
t
)t + O(t
2
) and E[Y
2
t
| F
t
] = b(Y
t
)t + O(t
2
).
b. With the notation f(Y
t
, t) = E[V (Y
T
) | F
t
], nd the backward equation satised
by f. (Assume T > t.)
c. Writing u(y, t) for the probability density of Y
t
, use the duality argument to nd the
forward equation satised by u.
d. Write the forward and backward equations for the special case Y
t
= e
cXt
. Note (for
those who know) the similarity of the backward equation to the Black Scholes
partial dierential equation.
2. Use a calculation similar to the one we used in class to show that Y
T
= X
4
T
6
T
0
X
2
t
dt
is a martingale. Here X
t
is Brownian motion.
3. Show that Y
t
= cos(kX
t
)e
k
2
t/2
is a martingale.
a. Verify this directly by rst calculating (as in problem 1) that
E[Y
t+t
| F
t
] = Y
t
+ O(t
2
) .
Then explain why this implies that Y
t
is a martingale exactly (Hint: To show that
E[Y
t
| F
t
] = Y
t
, divide the time interval (t, t
) into n small pieces and let n .

b. Verify that Y
t
is a martingale using the fact that a certain function satises the
backward equation. Note that, for any function V (x), Z
t
= E[V (X
T
) | F
t
] is a
martingale (the tower property). Functions like this Z satisfy backward equations.
c. Find a simple intuition that allows a supposed martingale to grow exponentially in
time.
4. Let A
x
0
,t
be the event that a standard Brownian motion starting at x
0
has X
t
> 0 for all t
between 0 and t. Here are two ways to verify the large time asymptotic approximation
P(A
x
0
,t
)
1
2
2x
0
t
.
1
a. Use the formula from Images and reections to get
P(A
x
0
,t
) =

0
u(x, t)dx
2t

0
e
x
2
/2t
e
xx
0
/t
e
xx
0
/t
dx .
The change of variables y = x/
t should make it clear how to approximate the

last integral for large t.
b. Use the same formula to get
d
dt
P(A
x
0
,t
) =
1
2
2x
0
t
3/2
e
x
2
0
/2t
. (1)
Once we know that P(A
x
0
,t
) 0 as t , we can estimate its value by inte-
grating (1) from t to using the approximation e
const/t
1 for large t. Note:
There are other hitting problems for which P(A
t
) does not go to zero as t .
This method would not work for them.
2
Assignment 6.
Given October 21, due October 28. Last revised, October 21.
Objective: Forward and Backward equations for Brownian motion.
The terms forward and backward equation refer to the equations the probability density
of X
t
and E
x,t
[V (X
T
)] respectively. The integrals below are easily done if you use identities
such as
1
2
2
x
2n
e
x
2
/2
2
dx =
2n
(2n 1)(2n 3) 3 .
You should not have to do any actual integration for these problems.
1. Solve the forward equation with initial data
u
0
(x) =
x
2
2
e
x
2
/2
.
a. Assume the solution has the form
u(x, t) =
A(t)x
2
+ B(t)
g(x, t) , g(x, t) = G(0, x, t+1) =

1
2(t + 1)
e
x
2
/2(t+1)
.
(1)
Then nd and solve the ordinary dierential equations for A(t) and B(t) that
make (1) a solution of the forward equation.
b. Compute the integrals
u(x, t) =
u
0
(y)G(y, x, t)dy .
This should give the same answer as part a.
c. Sketch the probability density at time t = 0, for small time, and for large time.
Rescale the large time plot so that it doesnt look at.
d. Why does the structure seen in u(x, t) for small time (the double hump) disappear
for large t?
e. Show in a rough way that a similar phenomenon happens for any initial data of the
form u
0
(x) = p(x)g(x, 0), where p(x) is an even nonnegative polynomial. When t
is large, u(x, t) looks like a simple Gaussian, no matter what p was.
2. Solve the backward equation with nal data V (x) = x
4
.
a. Write the solution in the form
f(x, t) = x
4
+ a(t)x
2
+ b(t) . (2)
Then nd and solve the dierential equations that a(t) and b(t) must satisfy so
that (2) is the solution of the backward equation.
1
b. Compute the integrals
f(x, t) =
G(x, y, T t)V (y)dy .

This should be the same as your answer to part a.
c. Give a simple explanation for the form of the formula for f(0, t) = b(t) in terms of
moments of a Gaussian random variable.
3. Check that

u(x, t)f(x, t)dx

is independent of t.
2
Assignment 7.
Given November 4, due November 11. Last revised, November 9.
Objective: Pure and applied mathematics.
The rst problems are strictly theoretical. They illustrate how clever some rigorous proofs
are. The inequality (3) serves the following function: We want to understand something
about the entire path F
k
for 0 k n. We can get bounds on F
k
for particular values
of k by calculating expectations (e.g. E[F
2
k
]). Then (3) uses this to say something about
the whole path. As an application, we will have an easy proof of the convergence of the
approximations to the Ito integral for all t T once we can prove it at the single time T.
1. Let F
k
be a discrete time nonnegative martingale. Let M
n
= max
0kn
F
k
be its maximal
function. This problem is the proof that
P(M
n
> f)
1
f
E[F
n
1
Mnf
] . (1)
The proof also shows that if F
k
is any martingale and M
n
= max
0kn
|F
k
| its maximal
function, then
P(M
n
> f)
1
f
E[|F
n
| 1
Mnf
] . (2)
These inequalities are relatives of Markovs inequality (also called Chebychevs inequal-
ity, though that term us usually applied to an interesting special case), which says that
if X is a nonnegative random variable then P(X > a) <
1
a
E[X1
X>a
], or, if X is any
random variable, that P(|X| > a) <
1
a
E[|X|1
|X|>a
].
a. Let A be the event A = {M
n
f}. Write A as a disjoint union of disjoint events
B
k
F
k
so that F
k
() f when B
k
. Hint: If M
n
f, there is a rst k with
F
k
f.
b. Since F
k
() f for B
k
, show that P(B
k
)
1
f
E[1
B
k
F
k
] (the main step in the
Markov/Chebychev inequality)).
c. Use the martingale property and the tower property to show that 1
B
k
F
k
= E[1
B
k
F
n
|
F
k
] so E[1
B
k
F
k
] = E[1
B
k
F
n
]. Do this for discrete probability if that helps you.
d. Add these to get (1).
e. We say F
k
is a submartingale if G
k
E[G
n
| F
k
] (warning: submartingales go up,
not down). Show that if F
k
is any martingale, then |F
k
| is a submartingale. Show
(1) applies to nonnegative submartingales so (2) applies to general martingales,
positive or not.
1
2. Let M be any nonnegative random variable. Dene (f) = P(M f), which is related
to the CDF of M. Use the denition of the abstract integral to show that E[M] =
0
(f)df and E[M
2
] = 2
0
f(f)df. These formulas work even if the common
value is innite. If G is another nonnegative random variable, show that E[GM] =
0
E[G1
Mf
]df. Of course, one way to do this is to formulate a single general formula
that each of these is a special case of.
3. Use the formulas of part 2 together with Doobs inequality (2) to show that
E[M
2
n
] 2E[M
n
|F
n
|] ,
so
E[M
2
n
] 2E
F
2
n
. (3)
(It will help to use the Cauchy Schwarz inequality E[XY ] (E[X
2
]E[Y
2
])
1/2
.)
Now some more concrete examples. We can think of martingales as absolutely non mean
reverting. The inequality (3) expresses that fact in one way: the maximum of a martingale
is comparable to its value at the nal time, on average. The Ornstein Uhlenbeck process
is the simplest continuous time mean reverting process, a continuous time anologue of the
simple urn model.
4. An Ornstein Uhlenbeck process is an adapted process X(t) that satises the Ito dierential
equation
dX(t) = X(t)dt +dW(t) . (4)
We cannot use Itos lemma to calculate dX(t) because X(t) is not a function of W(t)
and t alone.
a. Examine the denition of the Ito integral and verify that if g(t) is a dierentiable
function of t, and dX(t) = a(t)dt + b(t)dW(t), with a random but bounded b(t),
then d(g(t)X(t)) = g(t)X(t)dt + g(t)dX(t). It may be helpful to use the Ito
isometry formula (paragraph 1.17 of lecture 7).
b. Bring the drift term to the left side of (4), multiply by e
t
and integrate (using part
a) to get
X(T) = e
T
X(0) +
T
0
e
(Tt)
dW(t) .
c. Conclude that X(t) is Gaussian for any T (if X(0) is) and that the probability
density for X(T) has a limit as T . Find the limit by computing the mean
and variance.
d. Contrast the large time behavior of the Ornstein Uhlembeck process with that of
Brownian motion.
5. Show that e
ikW(t)+k
2
t/2
is a martingale using Ito dierentiation.
2
Assignment 8.
Given November 11, due November 18. Last revised, November 11.
Objective: Diusions and diusion equations.
1. An Ornstein Uhlenbeck process is a stochastic process that satises the stochastic dier-
ential equation
dX(t) = X(t)dt + dW(t) . (1)
a. Write the backward equation for f(x, t) = E
x,t
[V (X(T)].
b. Show that the backward equation has (Gaussian) solutions of the form f(x, t) =
A(t) exp(s(t)(x (t))
2
/2). Find the dierential equations for A, , and s that
make this work.
c. Show that f(x, t) does not represent a probability distribution, possibly by showing
that

f(x, t)dt is not a constant.

d. What is the large time behavior of A(t) and s(t)? What does this say about the
nature of an Ornstein Uhlenbeck reward that is paid long in the future as a
function of starting position?
2. The forward equation:
a. Write the forward equation for u(x, t) which is the probability density for X(t).
b. Show that the forward equation has Gaussian solutions of the form
u(x, t) =
1
2(t)
2
e
(x(t))
2
/2
2
(t)
.
Find the appropriate dierential equations for and .
c. Use the explicit solution formula for (1) from assignment 7 to calculate (t) =
E[X(t)] and (t) = var[X(t)]. These should satisfy the equations you wrote for
part b.
d. Use the approximation from (1): X Xt + W (and the independent
increments property) to express and (
2
) in terms of and and get yet
another derivation of the answer in part b. Use the denitions of and from
part c.
e. Dierentiate

xu(x, t)dx with respect to t using the forward equation to nd a

formula for d/dt. Find the formula for d/dt in a similar way from the forward
equation.
1
f. Give an abstract argument that X(t) should be a Gaussian random variable for
each t (something is a linear function of something), so that knowing (t) and
(t) determines u(x, t).
g. Find the solutions corresponding to (0) = 0 and (0) = y and use them to get a
formula for the transition probability density (Greens function) G(y, x, t). This
is the probability density for X(t) given that X(0) = y.
h. The transition density for Brownian motion is G
B
(y, x, t) =
1
2t
exp((xy)
2
/2t).
Derive the transition density for the Ornstein Uhlenbeck process from this using
the Cameron Martin Girsanov formula (warning: I have not been able to do this
yet, but it must be easy since there is a simple formula for the answer. Check the
bboard.).
i. Find the large time behavior of (t) and (t). What does this say about the distri-
bution of X(t) for large t as a function of the starting point?
3. Duality:
a. Show that the Greens function from part 2 satises the backward equation as a
function of y and t.
b. Suppose the initial density is u(x, 0) = (x y) and that the reward is V (x) =
(x z). Use your expressions for the corresponding forward solution u(x, t) and
backward solution f(x, t) to show by explicit integration that

u(x, t)f(x, t)dx

is independent of t.
2
Assignment 9.
Given December 9, due December 23. Last revised, December 91.
Instructions: Please answer these questions without discussing them with others or looking
up the answers in books.
1. Let S be a nite state space for a Markov chain. Let (t) S be the state of the
chain at time t. The chain is nondegenerate if there is an n with P
n
jk
= 0 for all
j S and k S. Here the P
jk
are the j k transition probabilities and P
n
jk
is
the (j, k) entry of P
n
, which is the n step j k transition probability. For any
nondegenerate Markov chain with a nite state space, the Perron Frobeneus theorem
gives the following information. There is a row vector, , with

kS
(k) = 1 and
(k) > 0 for all k S (a probability vector) so that P
t
1 Ce
t
. Here 1 is the
column vector of all ones and > 0. In the problems below, assume that the transition
matrix P represents a nondegenerate Markov chain.
(a) Show that if P((t) = k) = (k) for all k S, then P((t + 1) = k) = (k) for
all k S. In this sense, represents the steady state or invariant probability
distribution.
(b) Show that P has one eigenvalue equal to one, which is simple, and that every
other eigenvalue has || < 1.
(c) Let u(k, t) = P((t) = k). Show that u(k, t) (k) as t . No matter what
probability distribution the Markov chain starts with, the probability distribution
converges to the unique steady state distribution.
(d) Suppose we have a function f(k) dened for k S and that E
[f()] = 0.
Let f be the column vector with entries f(k) and

f the row vector with entries
(f)(k) = f(k)(k). Show that

cov
(f((0)), f((t))) = E
[f((0)), f((t))] =

fP
t
f .
(e) Show that if A is a square matrix with A < 1, then
t=0
A
t
= (I A)
1
.
This is a generalization of the geometric sequence formula

t=0
a
t
= 1/(1 a) if
|a| < 1, and the proof/derivation can be almost the same, once the series is shown
to converge.
(f) Show that if E
[f()] = 0, then

t=0
P
t
f = g with g Pg = f and E
[g()] = 0.
If the series converges, the argument above should apply.
1
(g) Show that
C =
t=0
cov
[f((0)), f((t))] =

fg ,
where g is as above.
(h) Let X(T) =

T
t=0
f((t)). Show that var(X(T)) DT for large T, where
D = var
[f()] + 2
t=1
cov
[f((0)), f((t))] .
This is a version of the Einstein Kubo formula. To be precise,
1
T
var(X(T)) D
as T . Even more precisely, |var(X(T)) DT| is bounded as T . Prove
whichever of these you prefer.
(i) Suppose P represents a Markov chain with invariant probability distribution
and we want to know = E
[f()]. Show that

T
=
1
T
T
t=0
f((t)) converges
to as T in the sense that E[(
T
)
2
] 0 as T . Show that this
convergence does not depend on u(k, 0), the initial probability distribution. It is
not terribly hard (though not required in this assignment) to show that
T
as
T almost surely. This is the basis of Markov chain Monte Carlo, which uses
Markov chains to sample probability distributions, , that cannot be sampled in
any simpler way.
(j) Consider the Markov chain with state space L k L having 2L + 1 states.
The one step transition probabilities are
1
3
for any k k 1, k k or k k +1
transitions that do not take the state out of the state space. Transitions that
would go out of S are rejected, so that, for example, P(L L) =
2
3
. Take
f(k) = k and calculate and D. Hint: the general solution to the equations
(g Pg)(k) = k is a cubic polynomial in k.
2. A Brownian bridge is a Brownian motion, X(t), with X(0) = X(T) = 0. Find an
SDE satised by the Brownian bridge. Hint: Calculate E
x,t
[X | X(T) = 0], which is
something about a multivariate normal.
3. Suppose stock prices S
1
(t), . . . , S
n
(t) satisfy the SDEs dS
k
(t) =
k
S
k
dt +
k
S
k
dW
k
(t),
where the W
k
(t) are correlated standard Brownian motions woth correlation coecients
jk
= corr(W
j
(t), W
k
(t)).
(a) Write a formula for S
1
(t), . . . , S
n
(t) in terms of independent Brownian motions
B
1
(t), . . . , B
n
(t). You may use the Cholesky decomposition LL
t
= .
(b) Write a formula for u(s, t), the joint density function of S(t) R
n
. This is the n
dimensional correlated lognormal density.
(c) Write the partial dierential equation one could solve to determine E[max(S
1
(T), S
2
(T))]
with S
1
(0) = s
1
and S
2
(0) = s
2
and
12
= 0
4. Suppose dS(t) = a(S(t), t)dt + (S(t), t)S(t)dB(t). Write a formula for

T
0
S(t)dS(t)
that involves only Riemann integrals and evaluations.
2

Lecture Notes On Stochastic Calculus (NYU)

Uploaded by

Copyright:

Available Formats

Lecture Notes On Stochastic Calculus (NYU)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Notes On Stochastic Calculus (NYU)

Uploaded by

Copyright:

Available Formats

Stochastic Calculus Notes, Lecture 1

Last modied September 12, 2004

X to this accuracy. For an Ito

P() = 1. The interpreta-

P() = 1. The other way is to keep as the probability space and

T. (Proof: for any

. Since there are (at

X()Y ()P() = E [XY ] ,

. The set of functions measurable with respect

(s) = X(s) for 0 s t. The T

[t, T] | X(t) = k) . (10)

[0, t]) . (14)

= t term in the product has V (X(t)) = V (k). The nal condition is

> 0. A stopping time is a random

, with uniform probability P(k) = 2

] = 0 for any stopping

] = 1. The catch is that there is no T with () T for all

= min(, T). That is, you

() = min(f(), 0 separately and subtract the results. We do not attempt a

] = . We omit the long process of showing

on so that the mapping P

(A) is a G measurable function of .

means the expected value using the probability measure

. There are many such subscripted expectations coming.

u(x)dx = 1 (a famous fact). The

Hx > 0 for any n component column vector x = 0. It is symmetric

= H. A symmetric matrix is positive denite if and only if all its eigenvales

. Since the eigenvectors

V = I. In the y variables, the quadratic form x

y. A diagonal matrix in the quadratic form is equivalent to

and take the constant matrices V outside the

. So, to calculate C, we can calculate all the matrix

, where L is an n n lower triangular matrix. Now

, as desired. Actually, we do not necessarily need the

t) for the joint probability density for the n observations,

, we write the expected value with respect to P

G(x, y, T t)V (y)dy . (12)

T t, the second as large as 1/(T t), etc. All derivatives

T tZ, (Gaussian increments) so

T t. Even without the normal probability,

t, so the hitting probability starting

between t and T then we add up the results (20).

) is the function with

). This puts (20) in the form

= G(t t)V (t)

t). Continuing, the left side is

. Therefore, the dW(t) are a model of driving noise

)] > 2 > 0, so F is not a martingale

probably are random.

n would not have had a nite

= a(t) +o(t) , (33)

u(x, t)f(x, t)dx ,

u(y, t)G(y, x, t, s)dy . (40)

G(y, x, t, s)f(x, s)dx . (41)

) and show how our previous deriva-

are with respect to the x

) 0. The backward equation then implies that

= a(X(t), t)t +o(t) , (14)

= (X(t), t)t +o(t) . (15)