ECMT Math Camp 2018 PDF

Calculus
Math Camp
ECMT
July 5, 2018
Math Camp (ECMT) Calculus July 5, 2018 1 / 38

Overview
1 Differentiation
2 Integration
3 Multi-Variate Calculus

Differentiation

Differentiation
Differentiation is commonly used in Optimization.
Informally, it is a measure of the rate of change of a function.
Geoetric representation of the derivative of a one-variable function:

Differentiation
Geoetric representation of the derivative of two-variable function:

Rules of Differentiation
For constants b, c and m, have
If f (x) = c then f 0 (x) = 0
If f (x) = mx + b, then f 0 (x) = m
If f (x) = x n , then f 0 (x) = nx n−1
dy
If y = e x , then dx = ex
dy 1
If y = ln x, then dx = x
If y = f (x) and x = f −1 (y ), then dx

dy = 1
dy /dx

Rules of Differentiation
If g (x) = cf (x), then g 0 (x) = cf 0 (x)
If h(x) = g (x) ± f (x), then h0 (x) = g 0 (x) ± f 0 (x)
Pn Pn
If h(x) = i=1 gi (x), then h0 (x) = i=1 g
0 (x)
If h(x) = f (x)g (x), then h0 (x) = f 0 (x)g (x) + f (x)g 0 (x)
f (x) f 0 (x)g (x)−f (x)g 0 (x)

If h(x) = g (x) , g (x) 6= 0, then h0 (x) = [g (x)]2
dy dy du
If y = f (u) and u = g (x), then dx = du dx

Continuity
Let S ⊂ Rn and T ⊂ Rl . Then f : S → T is said to be continuous at

x ∈ S if for all sequences xk in S such that limk→∞ xk = x, we have
limk→∞ f (xk ) = f (x).
The function f : S → T is said to be continuous on S if it is continuous at

all points in S.
Example of a non-continuous function

Differentiability
Let S ⊂ Rn and T ⊂ Rl . Then f : S → T is said to be differntiable at a

point x ∈ S if there exists a l × n matrix A such that for all > 0, there
exists a δ > 0 such that t ∈ S and ||t − x|| < δ implies
||f (t) − f (x) − A(t − x)||

lim =0
t→x ||t − x||
The matrix A is called the derivative of f at x and is denoted Df (x) (often

f 0 (x) or dx
df
).

Differentiability
If f is differentiable at all points in S, then f is said to be differentiable on

S, and the derivative Df (x) itself forms a function s to Rl×n .
If Df : S → Rl×n is a continuous function, then f is said to be

continuously differentiable on S and we say f belongs to the class C 1
functions.
Example
x 2 sin( x1 )

if x 6= 0
f (x) =
0 if x = 0
is differentiable at x = 0 but not continuously differentiable.
However, thankfully most functions in economics are continuous

real-valued functions, e.g. utility functions and cost functions.

Mean Value Theorem
Mean Value Theorem. If the function f (x) is continuous and differentiable
on some closed interval [a, b], then there must be a number c ∈ [a, b] such
that
f (b) − f (a)
f 0 (c) =
b−a
that is the secant line connecting A and B and the tangent line at must be
parallel.

Intermediate Value Theorem
Intermediate Value Theorem. Let f : [a, b] → R be a continuous function.

If f (a) < f (b), and if c is a real number such that f (a) < c < f (b), then
there exists x ∈ (a, b) such that f (x) = c. A similar statement follows for
f (a) > f (b).
Intermediate Value Theorem in Rn . Let D ⊂ Rn be a convex set and let

f : D → R be continuous on D. Suppose that a and b are points in D
such that f (a) < f (b). Then for any c such that f (a) < c < f (b), there
exists a λ ∈ (0, 1) such that f ((1 − λ)a + λb) = c.
The Intermediate Value Theorem is useful for showing the existence of

f (x) = c, f ((1 − λ)a + λb) = c, etc.

Intermediate Value Theorem
Intermediate Value Theorem for the Derivative. Let f : [a, b] → R be a

function that is differentiable everywhere on [a, b]. If and only if c is a real
number such that f 0 (a) < c < f 0 (b), then there is a point x ∈ (a, b) such
that f 0 (x) = c. A similar statement follows for f 0 (a) > f 0 (b).
Note that the above theorem does not assume f is C 1 .

Implicit Function Theorem
Implicit Function Theorem.
If f1 , · · · , fn are differentiable functions on a neighbourhood of the
point (x0 , y0 ) = (x10 , · · · , xn0 , y10 , · · · , ym
0 ) in Rn+m
And if f1 (x0 , y0 ) = f2 (x0 , y0 ) = · · · = fn (x0 , y0 )

And if the following n × n matrix is non-singular
∂f1 ∂f1 ∂f1
···
 
∂x1 ∂x2 ∂xn
∂f2 ∂f2 ∂f2

 ∂x1 ∂x2 ··· ∂xn


 .. .. .. .. 
 . . . . 
∂fn ∂fn ∂fn
∂x1 ∂x2 ··· ∂xn
then there is a neighbourhood U of the point y0 = (y10 , · · · , ym0 ) in Rm ,
there is a neighbourhood V of the point x0 = (x1 , · · · , xn ) in Rn , and

0 0
there is a unique mapping ϕ : U → V such that ϕ(y0 ) = x0 and

f1 (ϕ(y ), y ) = · · · = fn (ϕ(y ), y ) = 0 for all y in U.
That is, if we write

ϕ(y ) = (g1 (y ), · · · , gn (y )
where g1 , · · · , gn are differentiable functions on U, then
x1 = g1 (y1 , · · · , ym )
.. ..
. .
xn = gn (y1 , · · · , ym )
is the unique solution to the above system of equations near y0 .

Also  
∂ϕ1 (y ) ∂ϕ1 (y )
∂y1 ··· ∂ym
 .. .. .. 

 . . . 

∂ϕm (y )
∂y1 · · · ∂ϕ∂ymm(y )
∂f1 ∂f1 −1  ∂f1 (ϕ(y ),y ) ∂f1 (ϕ(y ),y )

···

∂x1 ··· ∂xm ∂y1 ∂ym
= −
 .. .. ..   .. .. .. 
. . .    . . . 

∂fn ∂fn ∂fn (ϕ(y ),y ) ∂fn (ϕ(y ),y )
∂x1 ··· ∂xn ∂y1 ··· ∂ym
The Implicit Function Theorem is the idea behind the Lagrange method.

Example
We want to investigate the behaviour of u and v in terms of x and y in

the neighbourhood of (2, −1, 2, 1) fot the following equations
x 2 − y 2 − u3 + v 2 + 4 = 0
2xy + y 2 − 2u 2 + 3v 4 + 8 = 0
Let
x 2 − y 2 − u3 + v 2 + 4

f1
f = =
f2 2xy + y 2 − 2u 2 + 3v 4 + 8
then
∂f1 ∂f1
−3u 2 2v

∂u ∂v =
∂f2 ∂f2 −4u 12v 3
∂u ∂v

and !
∂f1 ∂f1
∂x ∂y 2x −2y
∂f2 ∂f2 =
∂x ∂y
2y 2y
Hence
! −1
∂u ∂u
−3u 2 2v

∂x ∂y 2x −2y
∂v ∂v =−
∂x ∂y
−4u 12v 3 2y 2y
−1
−12v 3 2v

1 2x −2y
=
8uv − 36u 2 v 2 −4u 3u 2 2y 2y
That is, for example

∂u 1
= (−24xv 3 + 4vy )
∂x 8uv − 36u 2 v 2

Taylor Series
The Taylor Series expansion of the function f (x) in a neighbourhood of

the value x = x0 is
n−1 (k)
X f (x0 )(x1 − x0 )k
f (x1 ) = f (x0 ) + [ ] + Rn
k!
k−1
where Rn = f (n) (ξ)(x1 − x0 )n /n! and ξ lies between x0 and x1 . The

function is assumed to possess derivatives to the nth order.
Note Rn can be made as small as one wishes by taking n large.
Taylor Series is useful as linear approximation of functions.

Taylor Series
Example
By Taylor Series expansion,
x2 x3 x n−1
ex = 1 + x + + + ··· + + Rn
2! 3! (n − 1)!
ξn
where Rn = n! and ξ lies between 0 and x.

Integration

Integration is commonly used in Probability and Statistics.
Informally, it is a measure of the area under a function.
Geoetric representation of the integral:

Fundamental Theorem of Calculus
If the function f (x) is continuous on the closed interval [a, b] and if F (x)
is any antiderivative (indefinite integral) of f (x), then
Z b
f (x)dx = F (b) − F (a)
a
where F (b) is the antiderivative of f (x) at the point x = b and F (a) is the
antiderivative of f (x) at the point x = a. The expression [F (b) − F (a)] is
often denoted as [F (x)]ba .
The Fundamental Theorem of Calculus establishes the relationship

between the derivative and anti-derivative (integral).
∂F (x) R
If f (x) = ∂x , then f (x)dx = F (x) + C .

Rules of Integration
x n=1
Z
x n dx = +C
n+1
Z Z Z
f (x) ± g (x)dx = f (x)dx ± g (x)dx
Z Z
kf (x)dx = k f (x)dx
Z
e x dx = e x dx + C
Z
1
= ln(x) + C
x

Properties of Integration
Z c Z b Z c
f (x)dx = f (x)dx + f (x)dx
a a b
Z a Z c
f (x)dx ≡ lim f (x)dx = 0
a c→a a
Z c Z a
f (x)dx = − f (x)dx
a c

Techniques of Integration
Substitution
If F (u) is the antiderivative of f (u) and u = g (x), then

Z Z Z
du 0
f (u) dx = f (g (x))g (x)dx = f (u)du = F (u) + C
dx
Example Z
(x 3 + e x )(3x 2 + e x )dx
du d(x 3 +e x )
let u = x 3 + e x , then dx = dx = 3x 2 + e x so have
u2 (x 3 + e x )2
Z Z
du
u dx = u du = +C = +C
dx 2 2

Techniques of Integration
Integration by Parts
Suppose thatu = f (x) and v = f (x) are continuous functions, then

Z Z
v du = uv − u dv
Example Z
xe x dx
Let u = e x and v = x, then du x x

dx = e ⇒ du = e dx and
dv
dx = 1 ⇒ dv = dx so
Z Z
xe x dx = e x x − e x dx = e x (x − 1) + C

Fubini’s Theorem
Fubini’s Theorem Let f (x, y ) be continuous on a compact interval

I = [a, b] × [c, d] where x ∈ [a, b] and y ∈ [c, d]. Then
Z Z b Z d Z d Z b
f (x, y )d(x, y ) = f (x, y )dy dx = f (x, y )dx dy
[a,b]×[c,d] a c c a
Useful for changing the order of integration.

Fubini’s Theorem
Example
For A = [0, 1] × [0, 1],

Z Z 1 Z 1 Z 1 Z x
xy xy z
xe dxdy = xe dy dx = e dz dx
A 0 0 0 0
Z 1
= (e x − 1)dx = e − 2
0

Leibniz’s Rule
Leibniz’s Rule. If f (θ), a(θ) and b(θ) are differentiable with respect to θ,
then Z b(θ)
d
f (x, θ)dx
dθ a(θ)
Z b(θ)
d d ∂f (x, θ)
= f (b(θ), θ) b(θ) − f (a(θ), θ) a(θ) + dx
dθ dθ a(θ) ∂θ
If a(θ) and b(θ) are constants, then we have

Z b Z b
d ∂f (x, θ)
f (x, θ)dx = dx
dθ a a ∂θ
Useful for bringing the differentiation inside the integral. Also useful for
finding integrals by differentiating first.

Leibniz’s Rule
Example
1
xα − 1
Z
dx (α ≥ 0)
0 ln x
x α −1
R1
Let F (α) = 0 ln x dx. Differentiating both sides with respect to α
1 1 1
xα − 1 ∂ xα − 1
Z Z Z
0 d 1
F (α) = dx = dx = x α dx =
dα 0 ln x 0 ∂α ln x 0 α+1
Integrating both sides with respect to α, get F (α) = ln(α + 1) + C . Since

F (0) = 0, have C = 0. So F (α) = ln(α + 1) that is
1
xα − 1
Z
(For α ≥ 0) dx = ln(α + 1)
0 ln x

Multi-Variate Calculus

Partial Derivative
The partial derivative of a function y = f (x1 , x2 , · · · , xn ) with respect to

the variable xi is
∂f f (x1 , · · · , xi + ∆xi , · · · , xn ) − f (x1 , · · · , xi , · · · , xn )

= lim
∂xi ∆x→0 ∆xi
Useful for finding the rate of change with respect to one variable keeping
all others constant.

Partial Derivative
Example
Let
f (x1 , x2 ) = Ax1α x2β
then
∂f
= αAx1α−1 x2β
∂x1
∂f
= βAx1α x2β−1
∂x2

Young’s Theorem
Young’s Theorem. If f (x ) : Rn → R has continuous second partial

derivatives, then the order of differentiation in computing the cross-partials
is irrelevant, that is for i 6= j
∂2f ∂2f
=
∂xi ∂xj ∂xj ∂xi
Useful for swapping the order of differentiation.
Corollary: The Hessian matrix is symmetric for functions with continuous

second partial derivatives.

Directional Derivative
Let f : RN → RM . Let x0 ∈ R and v ∈ RN . Then the directional

derivative ∂∂fv is defined as
∂f f (x0 + hv ) − f (x0 )
= lim
∂v h→0,h6=0 h
Note that for f : RN → RM differentiable at x0 and any vector v ∈ RN

have
∂f
= (Df (x0 ))(v )
∂v
Useful for finding rate of change with respect to a vector, e.g.

consumption bundle.

Directional Derivative
Example
Let f (x, y ) = xy . Let v1 = (1, 1) and v2 = (2, 2).
Then the directional derivative is given by

∂f
=x +y
v1
∂f
= 2x + 2y
v2
Note normally the directional derivatives are computed in terms of unit

vectors, i.e. ||v || = 1.

References
Hoy et. al. (2011).Mathematics for Economics, 3rd ed.). London:

MIT Press.
Loomis, L.H. and Sternberg, S. (1990). Advanced Calculus. Boston:
Jones and Bartlett Publishers.
Hallam, A. (2005). Retrieved from
http://www2.econ.iastate.edu/classes/econ500/hallam/
Royster, D. C. (2009). Retrieved from www.ms.uky.edu/
~droyster/courses/fall98/math4080/classnotes/
Yu, w. W. (2013). Retrieved from
https://www.mathualberta.ca/~xinweiyu/217.1.13f/
Hastings, S. (2011). Retrieved from
www.math.pitt.edu/~sph/1540/
Pervin, W. J. (2012). Retrieved from
www.utdallas.edu/~pervin/ENGR3300/

Optimization
Math Camp
ECMT
July 17, 2018
Math Camp (ECMT) Optimization July 17, 2018 1 / 51

Overview
1 Set Theory
2 Unconstrained Optimization
3 Constrained Optimization
Lagrange Method
Envelope Theorem
Kuhn-Tucker Theorem

Set Theory

Definitions
A set S is a collection of elements that possess a certain property P(x),

written
S = {x : P(X )}
If x is an element of set S, we write
x ∈S
If x is not an element of set S, we write
x∈
/S
Example of a set
S = {x : x is an integer and x ≤ 5} = {1, 2, 3, 4, 5}

Number Sets
Number sets:
Natural numbers N = {1, 2, 3, ...}
Integers Z = {..., −2, −1, 0, 1, 2, ...}
Positive integers Z+ = {1, 2, 3, ...}
Negative integers Z− = {..., −3, −2, −1}
Rational numbers Q = { pq : p ∈ Z and q ∈ Z}
Real numbers R = (−∞, ∞)
Positive real numbers R+ = [0, ∞)
Strictly positive real numbers R++ = (0, ∞)
Negative real numbers R− = (−∞, 0]
Strictly negative real numbers R−− = (−∞, 0)

Sets and Subsets
If all the elements of set X are also elements of set Y , then X is a subset
of Y , written
X ⊆Y
If all the elements of set X are in set Y , but not all elements of set Y are
in set X , then X is a proper subset of Y , written
X ⊂Y
If X ⊆ Y and Y ⊆ X , then X and Y contain exactly the same elements,

i.e. they are equal
X =Y
The universal set U is the set that contains all the elements of every
possible set.
The empty set or the null set is the set with no elements, written ∅
Sets and Subsets
The intersection W of two sets X and Y is the set of elements that are in
both X and Y
W = X ∩ Y = {x : x ∈ X and x ∈ Y }
The union V of two sets X and Y is the set of elements that are in one or
other of the sets
V = X ∪ Y = {x : x ∈ X or x ∈ Y }
Example given X = {1, 2, 3} and Y = {3, 4, 5}, have
X ∩ Y = {3}
X ∪ Y = {1, 2, 3, 4, 5}

Sets and Subsets
The complement X C of set X is the set of elements of the universal set U
that are not elements of X
X C = {x ∈ U : x ∈
/ X}
Note: UC = ∅ and ∅C = U
The relative difference X − Y of X and Y is the set of elements of X that

are not also in Y
X − Y = {x ∈ U : x ∈ X and x ∈
/ Y}
Note: X − Y = X ∩ Y C
Example given U = {1, 2, 3, 4, 5}, X = {1, 2, 3} and Y = {3, 4, 5},

X C = {4, 5}
X − Y = {1, 2}
Sets and Subsets
Sets X1 , X2 , ..., Xn is a partition S of the universal set U if X1 , X2 , ..., Xn
are disjoint and the union of X1 , X2 , ..., Xn is U, that is
n
[
S = {x ⊆ U : Xi = U and Xi ∩ Xj = ∅ for i, j = 1, ..., n and i 6= j}
i=1
Example given U = {1, 2, 3, 4, 5}, then X1 = {1, 2}, X2 = {3, 4},

X3 = {5} is a partition.
The power set P(X ) of a set X is the set of all subsets of X
P = {A : A ⊆ X }
Example if X = {1, 2, 3}, then
P(X ) = {∅, {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}}

Intervals and Euclidean Distance
Closed interval [a, b] = {x ∈ R : a ≤ x ≤ b}

Half-open interval (a, b] = {x ∈ R : a < x ≤ b}
Half-open interval [a, b) = {x ∈ R : a ≤ x < b}
Open interval (a, b) = {x ∈ R : a < x < b}
The Euclidean distance d(a, b) between points a = (a1 , ..., an ) and

b = (b1 , ..., bn ) in Rn is
v
u n
uX
d(a, b) = t (ai − bi )2
i=1
Example
q √
d[(2, 3, 4), (4, 1, −5)] = (2 − 4)2 + (3 − 1)2 + (4 − (−5))2 = 89

Closed and Bounded Sets
An -neighbourhood N (x0 ) of a point xo ∈ Rn is the set of points lying

within a distance of of x0
N (x0 ) = {x ∈ Rn : d(x0 , x) < }
A set X ⊂ Rn is open if, for every x ∈ X , there exists an such that

N (x) ⊂ X .
A boundary point of a set X ⊂ Rn is a point x0 such that every

-neighbourhood N (x0 ) contains points that are in and points that are not
in X .
A set X ⊂ Rn is closed if its complement X C ⊂ Rn is an open set.
A set X ⊂ Rn is bounded if, for every x0 ∈ X , there exists an < ∞ such

that X ⊂ N (x0 ).
Convex Sets
The convex combination of two points, x, x 0 ∈ Rn is the set of points
x̄ ∈ Rn for some λ ∈ [0, 1], given by
x̄ = λx + (1 − λ)x 0
A set X ⊂ Rn is convex if for every pair of points x, x 0 ∈ Rn , and any
λ ∈ [0, 1], the point
x̄ = λx + (1 − λ)x 0
also belongs to the set X .
An interior point of a set X ⊂ Rn is a point x0 ∈ X for which there exists

an such that N (x0 ) ⊂ X .
A set X ⊂ Rn is strictly convex, if for every pair of points x, x 0 ∈ Rn , and

every λ ∈ (0, 1), we have that x̄ is an interior point of X , where
x̄ = λx + (1 − λ)x 0
Unconstrained Optimization

Global/Local Maximum/Minimum
At a global maximum x ∗ , have for all x
f (x ∗ ) ≥ f (x)
At a local maximum x̂, have for all x where x̂ − ≤ x ≤ x̂ +
f (x̂) ≥ f (x)
At a global minimum x ∗ , have for all x
f (x ∗ ) ≤ f (x)
At a local minimum x̂, have for all x where x̂ − ≤ x ≤ x̂ +
f (x̂) ≤ f (x)

First and Second-Order Conditions
If the differentiable function f takes an extreme value (maximum or

minimum) at a point x ∗ , then f 0 (x ∗ ) = 0
For a differentiable function f , point x ∗ , at which f 0 (x) = 0, yields a

stationary value of the function. Such stationary values may be extreme
values or points of inflection. Every extreme value of a function is a
stationary value, but not every stationary value need be an extreme value.
If f 0 (x ∗ ) = 0, and f 00 (x ∗ ) > 0, then f has a local minimum at x ∗

If f 0 (x ∗ ) = 0, and f 00 (x ∗ ) < 0, then f has a local maximum at x ∗

For matrices,
If ∇f (X ∗ ) = 0 and H(f )(X ∗ ) is positive definite, then f has a local

minimum at X ∗ .
If ∇f (X ∗ ) = 0 and H(f )(X ∗ ) is negative definite, then f has a local
maximum at X ∗ .
Note a matrix A is positive definite if the k-th leading principal minors Ak

are such that |A1 | > 0, |A2 | > 0, |A3 | > 0, |A4 | > 0, and so on
Similarly a matrix A is negative definite if the k-th leading principal minors
Ak are such that |A1 | < 0, |A2 | > 0, |A3 | < 0, |A4 | > 0, and so on

Example
Given
f (x1 , x2 , x3 ) = 5x12 + 2x22 + x34 − 32x3 + 6x1 x2 + 5x2
Solving ∇f (X ∗ ) = 0
   
10x1 + 6x2 0
∇f =  6x1 + 4x2 + 5  =  0 
4x33 − 32 0
which gives  
7.5
X ∗ =  −12.5 
2

The Hessian is given by

 
10 6 0
H= 6 4 0 
0 0 12x32
with
|H1 | = |10| = 10

10 6
|H2 | = =4
6 4

10 6 0

|H3 | = 6 4
0 = 192

0 0 12(2)2
Hence H(f )(X ∗ ) is positive definite and f has a local minimum at X ∗ .

Matrix Differentiation Identities
∂x T b ∂b T x
= =b
∂x ∂x
∂Ax ∂x T A
= =A
∂x ∂x
∂y T Ax ∂x T AT y
= = AT y
∂x ∂x
∂x T Ax
= (A + AT )x
∂x
∂2
= (A + AT )
∂x∂x T

Matrix Differentiation Identities
∂aT Xb
= ab T
∂X
n−1
∂ T n X
a X b= (X r )T ab T (X n−1−r )T
∂X
r =0
∂aT X T b
= baT
∂X
∂aT Xa ∂aT X T a
= = aaT
∂X ∂X
∂aT X T Xb
= X (ab T + baT )
∂X
n−1
∂ T n T n X
a (X ) X b = [X n−1−r ab T (X n )T X r + (X r )T X n ab T (X n−1−r )T ]
∂X
r =0

Weiestrass’s Theorem
If f is a continuous function, and X is a nonempty, closed and bounded

set, then f has both a maximum and a minimum on X .

Constrained Optimization

Lagrange Method
Let (x1∗ , x2∗ ) be a solution to the constrained maximization problem
max f (x1 , x2 ) s.t. g (x1 , x2 ) = 0
then the Lagrange method of finding (x1∗ , x2∗ ) consists of deriving the
Lagrange function
L(x1 , x2 , λ) = f (x1 , x2 ) + λg (x1 , x2 )
to satisfy the following first-order conditions to the stationary point(s) of

the Lagrange function
∂L
=0
∂x1
∂L
=0
∂x2
∂L
=0
∂λ
Lagrange Method
Example given
max y = x10.25 x20.75 s.t. 100 − 2x1 − 4x2 = 0
then the Lagrange function is
L = x10.25 x20.75 + λ(100 − 2x1 − 4x2 )
First-order conditions
∂L
= 0.25x1−0.75 x20.75 − 2λ = 0
∂x1
∂L
= 0.75x10.25 x2−0.25 − 4λ = 0
∂x2
∂L
= 100 − 2x1 − 4x2 = 0
∂λ
which gives
600 300
x1∗ = and x2∗ =
48 16
Lagrangian Method for Multiple Constraints
In the constrained maximization problem
max f (x1 , · · · , xn ) s.t. g 1 (x1 , · · · , xn ) = 0, · · · , g m (x1 , · · · , xn ) = 0
where m < n, if x ∗ is a solution to the problem, and if the n × m matrix

∂ j ∗
G = ∂x i
g (x1 , · · · , xn∗ ) has rank m, then there exist real numbers
λ1 , . . . , λm such that (x1∗ , · · · , xn∗ ) satisfies n + m conditions
∂ X ∂ j ∗
f (x1∗ , · · · , xn∗ ) + λj g (x1 , · · · , xn∗ ) = 0
∂xi ∂xi
j
g j (x1∗ , · · · , xn∗ ) = 0
where i = 1, · · · , n and j = 1, · · · , m

Quasiconcavity
A level set of the function y = f (x1 , x2 , ..., xn ) is the set
L = {(x1 , ..., xn ) ∈ Rn : f (x1 , x2 , ..., xn ) = c}
The better or upper contour set of the point (x10 , x20 , ..., xn0 ) is
B(x10 , x20 , ..., xn0 ) = {(x1 , ..., xn ) ∈ X : f (x1 , ..., xn ) ≥ f (x10 , x20 , ..., xn0 )}
A function f with domain X ⊆ Rn is quasiconcave if for every point in X ,

the better set B of that point is a convex set. It is strictly quaisconcave if
B is strictly convex.

Quasiconvexity
The worse or lower contour set of the point (x10 , x20 , ..., xn0 ) is
W (x10 , x20 , ..., xn0 ) = {(x1 , ..., xn ) ∈ X : f (x1 , ..., xn ) ≥ f (x10 , x20 , ..., xn0 )}
A function f with domain X ⊆ Rn is quasiconvex if for every point in X ,

the better set W of that point is a convex set. It is strictly quaisconvex if
W is strictly convex.

Global Optimum
if the function f is quasiconcave, and the functions g 1 , · · · , g m are all

quasiconvex, then any locally optimal solution to the problem is also
globally optimal.

Global Optimum
if f and g are increasing functions of x = (x1 , · · · , xn ), if either
(i) f is strictly quaisconcave and the functions g j for j = 1, · · · , m are all

quasiconvex
or
(ii) f is quaisconcave and the functions g j for j = 1, · · · , m are all strictly
quasiconvex
then a locally optimal solution is unique and also globally optimal.

General Method
Given a model of n equations

 
f1 (x1 , · · · , xn ; α1 , · · · , αm )
f (x1 , · · · , xn ; α1 , · · · , αm ) =  ..
=0
 
.
fn (x1 , · · · , xn ; α1 , · · · , αm )
where (x1 , · · · , xn ) are endogeneous variables whose values the model

is deisgned to explain
where (α1 , · · · , αm ) are exogeneous variables whose values are taken
as given from outside the model
The functions are assumed to have continuous partial derivatives up to the

r th order over some open sets of points in Rn+m

General Method
Let F be the Jacobian J(f ) with respect to the xi ’s
 ∂f1 ∂f1 ∂f1 
∂x1 ∂x2 · · · ∂xn
 ∂f2 ∂f2 · · · ∂f2 
 ∂x ∂x2 ∂xn 
F =  .1 .. .. .. 
 .. . . . 
∂fn ∂fn ∂fn
∂x1 ∂x2 ··· ∂xn
Hence for (x1∗ , · · · , xn∗ ; α10 , · · · , αm

0 ) a point satisfying the model of n
equations, we have in some neighbourhood (x1∗ , · · · , xn∗ ; α10 , · · · , αm 0 ), for
j = 1, · · · , m
 ∂x ∗  
∂f1 
 ∂f1 ∂f1
· · · ∂f1 
∂α
1
− ∂α
∂x1 ∂x2 ∂xn  ∗  j j
 ∂f2 ∂f2 · · · ∂f2   ∂x2   − ∂f2 
 ∂αj 
 ∂x1 ∂x2 ∂xn   ∂αj 
 . . . .  .  =  .. 

 . . .
. . . .
.  ..  
 ∗   . 
∂fn ∂fn ∂fn ∂xn ∂fn
∂x1 ∂x2 · · · ∂xn − ∂α
∂αj j

General Method
Assume that the determinant |F | =

6 0
By Cramer’s Rule
∂xi∗ |Fij |
=
∂αj |F |
where Fij is given by replacing the ith column of F by the jth column of
the Jacobian J(f ) with respect to the αj ’s
∂f1 ∂f1 ∂f1
− ∂α − ∂α ··· − ∂α
 
1 2 m
∂f2 ∂f2 ∂f2
 − ∂α 1
− ∂α 2
··· − ∂α m

J=
 
.. .. .. .. 
 . . . . 
∂fn ∂fn ∂fn
− ∂α 1
− ∂α 2
··· − ∂α m

General Method
Example
Given
max u(x1 , x2 ) s.t. p1 x1 + p2 x2 = m
Applying Lagrangian method, get the following first-order conditions

∂u
f1 (x1∗ , x2∗ , λ∗ ) = − λ∗ p1 = 0
∂x1
∂u
f2 (x1∗ , x2∗ , λ∗ ) = − λ∗ p2 = 0
∂x2
f3 (x1∗ , x2∗ , λ∗ ) = m − p1 x1∗ − p2 x2∗ = 0

General Method
Ordering the endogeneous variables from first to last as x1∗ , x2∗ , λ∗ get

∂2u ∂2u
∂x1 ∂x1 ∂x1 ∂x2 −p1

∂2u 2
|F | = ∂x ∂x ∂x∂ ∂x u
−p2

2 1 2 2
−p1 −p2 0
Ordering the exogeneous variables from first to last as p1 , p2 , m get

 ∗ 
λ 0 0
J =  0 λ∗ 0 
x1∗ x2∗ −1
We can then solve for the partial derivatives at equilibrium

General Method

λ∗ ∂2u
∂x1 ∂x2 −p1

∂2u
−p2

0 ∂x2 ∂x2

∗
∂x1 |F11 | x1 −p2 0
= =
∂p1 |F | |F |

∂2u 0 −p1
∂x1 ∂x1
∂2u
λ∗ −p2

∂x ∂x
2 1
x2∗

∂x2 |F22 | −p1 0
= =
∂p2 |F | |F |

General Method

∂2u
0
∂x1 ∂x2 −p1

∂2u
−p2

0 ∂x2 ∂x2

∂x1 |F13 | −1 −p2 0
= =
∂m |F | |F |

∂2u 0 −p1
∂x1 ∂x1
∂2u
λ∗ −p2

∂x ∂x
2 1
∂x2 |F23 | −p1 −1 0
= =
∂m |F | |F |

Envelope Theorem
General idea:
Consider the maximization problem
max f (x1 , x2 ; α) s.t. g (x1 , x2 ; α) = 0
The Lagrangian is given by
L(x1 , x2 ; α) = f (x1 , x2 ; α) + λg (x1 , x2 ; α)
with first-order conditions

∂f (x1∗ , x2∗ ; α) ∂g (x1∗ , x2∗ ; α)
+ λ∗ =0
∂x1 ∂x1
∂f (x1∗ , x2∗ ; α) ∂g (x1∗ , x2∗ ; α)
+ λ∗ =0
∂x2 ∂x2
g (x1∗ , x2∗ ; α) = 0
Envelope Theorem
Expressing the solutions as a function of α
x1∗ = x1∗ (α), x2∗ = x2∗ (α) and λ∗ = λ∗ (α)
and substituting back into f get
f (x1∗ , x2∗ ; α) = V (α)
Differentiating V (α) have
dV df dx1 df dx2 df
= + +
dα dx1 dα dx2 dα dα
Substituting first two FOCs
dV dg dx1 dg dx2 df
= −λ∗ ( + )+
dα dx1 dα dx2 dα dα

Envelope Theorem
Differentiating third FOC get

dg dx1 dg dx2 dg
+ + =0
dx1 dα dx2 dα dα
So
dV df dg
= + λ∗
dα dα dα
Similarly
L(α) = f (x1∗ (α), x2∗ (α); α) + λ∗ (α)g (x1∗ (α), x2∗ (α); α)
Differentiating get
dL d dx1 d dx2 df dλ∗ dg
= (f + λ∗ g ) + (f + λ∗ g ) + + g + λ∗
dα dx1 dα dx2 dα dα dα dα

Envelope Theorem
Substituting FOCs
dL dx1 dx2 df dλ dg
=0· +0· + + · 0 + λ∗
dα dα dα dα dα dα
dL df dg
= + λ∗
dα dα dα
Hence
dV dL df dg
= = + λ∗
dα dα dα dα

Envelope Theorem
Envelope Theorem
Given
max f (x1 , · · · , xn ; α1 , · · · , αm )
subject to
gk (x1 , · · · , xn ; α1 , · · · , αm ) = 0 for k = 1, · · · , K
and the corresponding value function V α1 , · · · , αm ) have

K
dV dL df X dgk
= = + λj
dαj dαj dαj dαj
k=1
The Lagrange multiplier measures the rate at which the value function
changes when the corresponding constraint is tightened or relaxed slightly.
If a constraint is nonbinding at the optimum, so that a small tightening or
relaxing of it has no effect on the solution, then the associated Lagrange
multiplier will take the value zero at the optimum.
Envelope Theorem
Example
max Y = p1 x1 + p2 x2 s.t. xi = ai Lbi for i = 1, 2

L1 + L2 = L0
for ai > 0 and 0 < b < 1.
The Lagrange function is given by
L = p1 a1 Lb1 + p2 a2 Lb2 + λ(L0 − L1 − L2 )
FOCs
bpi ai Lib−1 − λ∗ = 0 for i = 1, 2
L0 − L1 − L2 = 0

Envelope Theorem
Solving
L1 = c1 L0 and L2 = c2 L0
where
p1 a1 1/(b−1) −1
c1 = [1 + ( ) ] and c2 = 1 − c1
p2 a2
Optimized value function
Y ∗ = p1 a1 [c1 L0 ]b + p2 a2 [c2 L0 ]b = V (p1 , p2 , L0 )
then by Envelope Theorem have

∂V ∂L
= = λ∗
∂L0 ∂L0
∂V ∂L
= = x1∗
∂p1 ∂p1

Kuhn-Tucker Conditions
Given inequality constraints
max f (x1 , x2 ) s.t. g (x1 , x2 ) ≥ 0 for x1 , x2 ≥ 0
where both f and g are concave and differentiable, the Langrange function
L(x1 , x2 , λ) = f (x1 , x2 ) + λg (x1 , x2 )
is maximized w.r.t. x1 , x2 and minimized w.r.t. λ subject to x1 , x2 , λ ≥ 0.

The Kuhn-Tucker conditions are

∂L ∂f ∗ ∗ ∂g ∗ ∗
= (x1 , x2 ) + λ∗ (x , x ) ≤ 0 where xi∗ ≥ 0
∂xi ∂xi ∂xi 1 2
∂L
xi∗ =0
∂xi
∂L
= g (x1∗ , x2∗ ) ≥ 0 whereλ∗ ≥ 0
∂λ
∂L
λ∗ =0
∂λ

Kuhn-Tucker Theorem
Kuhn-Tucker Theorem
Given
max f (x1 , x2 ) s.t. g (x1 , x2 ) ≥ 0 for x1 , x2 ≥ 0
if f and g are concave and differnetiable, and if there exists a point
(x10 , x20 ) such that g (x10 , x20 ) > 0, then there exists a Lagrange multiplier
λ∗ such that the Kuhn-Tucker conditions are both necessary and sufficient
for the point (x1∗ , x2∗ ) to be a solution to the problem.

Kuhn-Tucker Theorem
Example
max u(x1 , x2 ) s.t. m − p1 x1 − p2 x2 ≥ 0 and x1 , x2 ≥ 0
Lagrange function
L = u(x1 , x2 ) + λ(m − p1 x1 − p2 x2 )
∂L ∂u
= − λ∗ pi ≤ 0 where xi∗ ≥ 0
∂xi ∂xi
∂u
xi∗ ( − λ∗ pi ) = 0
∂xi
∂L
= m − p1 x1∗ − p2 x2∗ ≥ 0 where λ∗ ≥ 0
∂λ
λ∗ (m − p1 x1∗ − p2 x2∗ ) = 0
Kuhn-Tucker Theorem
If xi∗ > 0 and x2∗ > 0, get ∂u

∂xi = λ∗ pi and
∂u
∂x1 p1
∂u
=
∂x2
p2
If xi∗ > 0 and x2∗ = 0, get

∂u
= λ ∗ p1
∂x1
∂u
≤ λ ∗ p2
∂x2
and
∂u
∂x1 p1
∂u
≥
∂x2
p2

Kuhn-Tucker Theorem
Kuhn-Tucker Theorem (General Case)
Given
max f (x1 , · · · , xn )
subject to
gj (x1 , · · · , xn ) ≥ 0 for j = 1, · · · , m
if all the funuctions f and gj are concave and differentiable, and there
exists a point (x10 , · · · , xn0 ) such that for all gj (x10 , · · · , xn0 ) > 0, then there
exist m Lagrange multipliers λ∗j such that the following conditions are
necessary and sufficient for the point (x1∗ , · · · , xn∗ ) to be a solution to the
problem.

Kuhn-Tucker Theorem
Conditions
∂f (x1∗ , · · · , xn∗ ) X ∗ ∂gj (x1∗ , · · · , xn∗ )
− λj ≤0 and xi ≥ 0
∂xi ∂xi
∂f X ∂gj
xi∗ ( − λ∗j )=0
∂xi ∂xi
gj (x1∗ , · · · , xn∗ ) ≥ 0 and λ∗j ≥ 0
λ∗j gj (x1∗ , · · · , xn∗ ) = 0

References
Miller, R. E. (2000).Optimization. New York: Wiley-Interscience

Publication.
MIT Press.
Petersen, K. B. and Pedersen, M. S. (2012. The Matrix Cookbook.
Retrieved from https:
//www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

Statistics
Math Camp
ECMT
July 26, 2018
Math Camp (ECMT) Statistics July 26, 2018 1 / 72

Overview
1 Sequences and Series
2 Probability
3 Convergence
4 Random Variables
5 Distribution Functions
6 Estimators

Sequences and Series

Sequences
A sequence is a function whose domain is the positive integers.
Example f (n) = 3n − 2 gives the sequence 1, 4, 7..., that is

a1 = 1, a2 = 4, a3 = 7, ...
A sequence is said to have the limit L if for ane > 0, however small,
there is some value N such that |an − L| < whenever n > N.
Such a sequence is said to be convergent and we write
lim an = L
n→∞
Example:
1
lim (1 + )n = e
n→∞ n
If a sequence has no limit, it is divergent.
Properties of Sequences
For lim an = La convergent, lim bn = Lb convergent and c constant,

n→∞ n→∞
have
lim can = cLa
n→∞
lim (an ± bn ) = La ± Lb
n→∞
lim (an )(bn ) = La Lb
n→∞
lim an /bn = La /Lb for Lb 6= 0
n→∞

Properties of Sequences
For lim an = La convergent, lim bn = +∞ definitely divergent and c

n→∞ n→∞
constant, have
lim cbn = +∞ for c > 0 and lim cbn = −∞ for c < 0
n→∞ n→∞
lim (an ± bn ) = ±∞
n→∞
lim (an )(bn ) = +∞ for La > 0 and lim (an )(bn ) = −∞ for La < 0
n→∞ n→∞
lim an /bn = 0
n→∞
lim c/bn = 0
n→∞

Monotonicity and Boundedness
A sequence is monotonically increasing if a1 < a2 < a3 < ... and is

monotonically decreasing if a1 > a2 > a3 > ...
A sequence is bounded if and only if it ahs a lower bound an d an
upper bound.
A monotnic sequence is convergent if and only if it is bounded.

Series
Pn
If at is a sequence, then sn = t=1 at is a series.
Example at = 3t − 2 with a1 = 1, a2 = 4, a3 = 7, ..., then

n
X
sn = 3t − 2
t=1
where sn gives the series 1, 5, 12, ..., that is s1 = 1, s2 = 5, s3 = 12, ...

Properties of Series
Pn
If sn = t=1 at is the series associated with sequence at and

an+1
lim =L
n→∞ an
have
If L < 1, then the series sn converges
If L > 1, then the series sn diverges
If L = 1, then the series sn may converge or diverge
Example for the geometric series

n
X
sn = aρt−1 = a + aρ + aρ2 + ... + aρn−1
t=1
n
Computing | aan+1
= | aρaρn−1 | = |ρ| allows us to conclude that sn converges if
n
|
|ρ| < 1 and diverges if |ρ| > 1. If |ρ| = 1, have at = a and sn = na which
diverges for any a 6= 0.
Probability

σ-algebra
The collection of subsets A in the sample space Ω, that is A ⊆ Ω, is a

Borel σ-algebra denoted by B if
Ω∈B
If A ∈ B, then AC ∈ B
∞
If A1 , · · · ∈ B, then Ai ∈ B
S
i=1
Some identities:
∞
If A1 , · · · , An ∈ B, then Ai ∈ B
T
i=1
If B1 ∈ B and B2 ∈ B, then B1 ∩ B2 ∈ B
{∅, Ω} ∈ B
P(Ω) ∈ B where P denotes the power set

Probability
Let F be the σ-albegra defined on Ω, then the probability measure defined

on (Ω, F) is a function P : F → [0, 1] with the following properties:
P(A) ≥ 0 for ∀A ∈ F
P(Ω) = 1
∞ ∞
For partition A1 , · · · ∈ F, have P
S P
Ai = P(Ai )
i=1 i=1
Some identities:
P(∅) = 0
P(A) ≤ 1
P(AC ) = 1 − P(A)
If A ⊆ B, then P(A) ≤ P(B)

Probability
For A1 ⊆ A2 ⊆ · · · , then
∞
!
[
lim P(An ) = P Ai
n→∞
i=1
For A1 ⊇ A2 ⊇ · · · , then
∞
!
\
lim P(An ) = P Ai
n→∞
i=1

Convergence

Almost Sure Convergence
A sequence of random variables X1 , X2 , · · · converges almost surely to the

random variable X if
n o
P s ∈ S : lim Xn (s) = X (s) = 1
n→∞
If Xn converges almost surely to X , it is denoted

a.s.
Xn −−→ X

Almost Sure Convergence
Example, let S ∈ [0, 1] and
n+1

1 for 0 ≤ s < 2n
Xn (s) =
0 otherwise
then let
1

1 for 0 ≤ s < 2
X (s) =
0 otherwise
For 0 ≤ s < 12 , since n+1
2n> 21 ∀n ≥ 1, i.e. Xn (s) = 1 ∀n ≥ 1 and have

P lim Xn (s) = X (s) = 1
n→∞
1
For 2 ≤ s ≤ 1, have Xn (s) = 0 ∀n ≥ 1 and

P lim Xn (s) = X (s) = 1
n→∞
a.s.
Hence Xn −−→ X
Convergence in Probability (Probability Limit)
A sequence of random variables {Xb } converges in probability towards the

random variable X if for all ε > 0
lim P(|Xn − X | > ε) = 0

n→∞
Alternatively
lim P(|Xn − X | < ε) = 1
n→∞
If Xn converges in probability to X , it is denoted

p
Xn −
→X or plim Xn = X
n→∞

Convergence in Probability (Probability Limit)
Some identities:
plim cXn = c plim Xn
plim Xn + Yn = plim Xn + plim Yn
plim Xn Yn = plim Xn plim Yn
Slutsky’s Theorem
If the function g is continuous at plim X , then
plim g (X ) = g (plim X )

Convergence in Distribution
A sequence of random variables X1 , X2 , · · · converges in distribution to the

random variable X if
lim FXn (x) = FX (x)
n→∞
If Xn converges in distribution to X , it is denoted

d
Xn −
→X
Cramer Wold’s Device

d d
→ X , then c T Xn −
If Xn − → c T X for any vector c
d
Note: Xn −
→ X ; E (Xn ) → E (X )

Convergence in Distribution
Example, for n ≥ 2 let
1 − (1 − n1 )nx

for x > 0
FXn (x) =
0 otherwise
then let X ∼ Exponential(1), i.e.
1 − e −x

for x > 0
FX (x) =
0 otherwise
For x ≤ 0, have
FXn (x) = FX (x) = 0 ∀n ≥ 2
For x > 0, have
1 nx 1 nx

lim FXn (x) = lim 1− 1− = 1 − lim 1 − = 1 − e −x
n→∞ n→∞ n n→∞ n
d
Hence Xn −
→X
Convergence in r -th Mean
Given a real number r ≥ 1, the sequence Xn converges in the r -th mean

(or in the Lr -norm) towards the random variable X , if
lim E (|Xn − X |r ) = 0
n→∞
For r = 1, have Xn converges in mean to X .

For r = 2, have Xn converges in mean square to X .

Convergence in r -th Mean
Example, let
1

n for 0 ≤ x ≤ n
fXn (x) =
0 otherwise
then Z 1
r
n 1
E (|Xn − 0| ) = x r ndx = →0
0 (r + 1)nr
Lr
Hence Xn −→ 0 for all r ≥ 1

Useful Properties
Markov’s Inequality
E (X n )
P(X ≥ a) ≤
an
Chebychev’s Inequality
Var (X )
P(|X − E (X )| ≥ a) ≤
a2
Borel-Cantelli Lemma
∞
P
If P([|Xn − c| > ]) < ∞, ∀ > 0, then
n→∞
a.c.
Xn −−→ c

Useful Properties
a.s. p d
Xn −−→ X ⇒ Xn −
→ X ⇒ Xn −
→X
For r ≥ 1
Lr p
Xn −→ X ⇒ Xn −
→X
For s ≥ r ≥ 1
Ls Lr
Xn −→ X ⇒ Xn −→ X

Weak Law of Large Numbers
Weak LLN
Suppose X1 , · · · , Xn are a sequence of iid random variables with mean

µ < ∞ and variance σ 2 , ∞, then
n
1X p
lim Xi = X̄ −
→µ
n→∞ n
i=1
Khinchine’s Weak LLN

µ<∞
p
X̄ −
→µ

Weak Law of Large Numbers
Chebychev’s Weak LLN
Suppose X1 , · · · , Xn are a sequence of independent (not necessarily

identical) random variables with E (Xi ) = µi < ∞ and Var (Xi ) = σi2 < ∞
such that n1 bar σn2 → 0, then
n
1X p
X̄ − µi −
→0
n
i=1

Strong Law of Large Numbers
Kolmogorov’s Strong LLN
Suppose X1 , · · · , Xn are a sequence of iid random variables with

E (Xi ) = µ < ∞, then
a.s.
X̄ − µi −−→ 0
Markov’s Strong LLN

identical)
P random variables with E (Xi ) = µi < ∞ and ∃δ > 0 s.t.
lim E (|Xi − µi |1+δ )/i 1+δ < ∞, then
n→∞
n
1X a.s.
X̄ − µi −−→ 0
n
i=1

Strong Law of Large Numbers
Liapounov’s Strong LLN

identical) random variables with E (Xi ) = µ < ∞ and ∃δ, ∆ > 0 s.t.
E (|Xt |1+δ ) < ∆ << ∞ for some Xt , then
n
1X a.s.
X̄ − µi −−→ 0
n
i=1
e.g. if let Xt = Xi − µi , we have Markov’s SLLN.

Central Limit Theorem
Lindberg-Levy CLT

µ < ∞ and variance σ 2 ∈ (0, ∞), then
√ d
→ N(0, σ 2 )
n(X̄ − µ) −
Liapounov CLT

identical) random variables with E (Xi ) = µi < ∞ and ∃δ, ∆ > 0 s.t.
E [|xi − µi |2+δ < ∆ < ∞], then if lim σ̄n2 > 0 have
√ d
→ N(0, lim σ̄n2 )
n(X̄ − µ̄n ) −

Central Limit Theorem
Lindberg-Feller CLT

identical) random variables with E (Xi ) = µi < ∞ and
Var (Xi ) = σi2 ∈ (0, ∞), then if
n
1 −2 X
lim σ̄ E ((Xi − µi )2 1[(Xi −µi )2 >nεσ̄n2 ] ) = 0, ∀ε > 0
n→∞ n n
i=1
have
σi2
lim max ∈ [1, n] =0
n→∞ i=1 nσ̄n2
√ d
→ N(0, lim σ̄n2 )
n(X̄ − µ̄n ) −

Delta Method
√ d
If n(X̄ − µ) −
→ N(0, σ2) and g differentiable, then
√ d ∂g (µ) 2 ∂g (µ)
n(g (X̄ ) − g (µ)) −
→ N(0, σ )
∂µ ∂µ

Big ’O’ and Little ’o’ Notation
g (n)
If lim → c, we say that g (n) = O(f (n)), for example
n→∞ f (n)
a1 n2 + a2 n + a3 = O(n2 )
b1 n−2 + b2 n−1 = O(n−1 )

that is big ’O’ means ”g (n) is of same order as f (n)”
g (n)
If lim → 0, we say that g (n) = o(f (n)), for example
n→∞ f (n)
a1 n2 + a2 n + a3 = o(n3 )
b1 n−2 + b2 n−1 = o(1)

that is little ’o’ means ”g (n) is ultimately negligible compared to f (n)”

Big ’O’ and Little ’o’ Notation
If g (n) = O(f (n)), then cg (n) = O(f (n)) for any constant c
If g1 (n) = O(f (n)) and g2 (n) = O(f (n)), then
g1 (n) + g2 (n) = O(f (n))
If g1 (n) = O(f (n)) but g2 (n) = o(f (n)), then
g1 (n) + g2 (n) = O(f (n))
If g (n) = O(f (n)) but f (n) = o(b(n)), then g (n) = o(b(n))

Random Variables

Random Variable
A random variable X is a mapping
X : (Ω, B(Ω)) → (R, B(R))
and (Ω, B(Ω), P) is the associated probability space.
Example, let X be the number of heads in a three coins toss, then have
Ω = {{HHH}, {HHT }, {HTH}, {HTT }

{THH}, {THT }, {TTH}, {TTT }}
X = {0, 1, 2, 3}
P(X = 0) = 0.125, P(X = 1) = 0.375
P(X = 2) = 0.375, P(X = 3) = 0.125

Cumulative Distribution Function
The Cumulative Distribution Function F (X ) of a random variable X is

given by
F (x) = P(X ≤ x)
with the following properties
lim F (x) = 0
x→−∞
lim F (x) = 1
x→∞
F (x) is a non-decreasing function of x

Probability Density Function
The Probability Density Function (Prbability Mass Function for x discrete)

f (X ) of a random variable X is given by
d
f (x) = F (x)
dx
0 ≤ f (x) ≤ 1
R∞
−∞ f (x) = 1
Rx
F (x) = −∞ f (t)dt

Expectation
The Expectation (or Mean) of a random variable X is given by

Z ∞
E [X ] = xf (x)dx
−∞
Alternatively Z ∞
E [g (X )] = g (x)f (x)dx
−∞
Some identities:
E [ag1 (x) + bg2 (x) + c] = aE [g1 (x)] + bE [g2 (x)] + c
If g1 (x) ≥ g2 (x) for all x, then E [g1 (x)] ≥ E [g2 (x)] for all x

Moment generating function
The Moment Generating Function of a random variable X is given by

Z
MX [t] = E [e ] = e tx f (x)dx
tX
The n-th moment Expectiation of X is given by
dn

n

E (X ) = n MX (t)
dt t=0
Example for n = 1
Z Z Z
d d tx d tx
MX (t) = e f (x)dx = e f (x)dx = xe tx f (x)dx
dt dt dt
Z Z
d tx

MX (t)
= xe f (x)dx
= xf (x)dx = E [X ]
dt t=0 t=0

Variance
The variance of a random variable X , sometimes denoted by σ 2 is given by
Var (X ) = E [(X − E (X ))2 ]

Z ∞
2
σ = (x − µ)2 f (x)dx
−∞
Also
Var (X ) = E [(X − E (X ))2 ]
= E [X 2 − 2XE (X ) + (E (X ))2 ]
= E (X 2 ) − 2E (X )E (X ) + (E (X ))2
= E (X 2 ) − 2(E (X ))2 + (E (X ))2
= E (X 2 ) − (E (X ))2

Variance
Some identities:
Var (X ) ≥ 0
Var (c) = 0 for constant c
Var (X + c) = Var (X ) for constant c
Var (cX ) = c 2 Var (X ) for constant c
Var (cX + dY ) = c 2 Var (X ) + d 2 Var (Y ) ± 2cdCov (X )(Y ) for
constants c, d
Var (XY ) = E [X 2 Y 2 ] − [E (XY )]2

= Cov (X 2 , Y 2 ) + E (X 2 )E (Y 2 ) − [E (XY )]2
= Cov (X 2 , Y 2 )
+(Var (X ) + [E (X )]2 )(Var (Y ) + [E (Y )]2 )
−[Cov (X , Y ) + E (X )E (Y )]2

Variance
For matrices
N N

P P
Var Xi = Cov (Xi , Xj )
i=1 i=1,j=1
PN P
= Var (Xi ) + Cov (Xi , Xj )
i=1 i6=j
N N

P P
Var ai Xi = ai aj Cov (Xi , Xj )
i=1 i=1,j=1
N
ai2 Var (Xi ) +
P P
= ai aj Cov (Xi , Xj )
i=1 i6=j
N
ai2 Var (Xi ) + 2
P P
= ai aj Cov (Xi , Xj )
i=1 1≤≤j≤N

Variance
If Xi and Xj are uncorrelated, that is Cov (Xi , Xj ) = 0, ∀i 6= j, have
N N
!
X X
Var Xi = Var (Xi )
i=1 i=1

Conditional Probability
The Conditional Probability of X given Y is given as
P(X ∩ Y )
P(X |Y ) =
P(Y )
Baye’s Rule
P(Y |X )P(X )
P(X |Y )
P(Y )

Conditional ditribution
f (x, y )
f (x|y ) =
f (y )
Conditional Expectation
Z
E (X |Y ) = f (x|y )dx
Conditional Variance
Z
Var (X |Y ) = [X − E (X |Y )]2 f (x|y )dx
= E (Y 2 |X ) − [E (Y |X )]2

Variance Decomposition
Var (X ) = E [Var (X |Y )] + Var (E [X |Y ])
Law of iterated or double expectations
E [E (X |Y )] = E (X )
E [E (X |Y , Z )|Y ] = E (X |Y )

Independence
If X and Y are mutually independent, have
P(X ∩ Y ) = P(X )P(Y )
that is
P(X ∩ Y ) P(X )P(Y )
P(X |Y ) = = = P(X )
P(Y ) P(Y )
Two random variables X and Y are identically distributed iff
P[x ≥ x] = P[x ≥ Y ] ∀x
Variables X1 , · · · , X − N are independent and identically distributed

(denoted i.i.d.) if each random variable has the same probability
distribution as the others and all are mutually independent.

Likelihood estimators
Let f (x |θ) denote the joint pdf of X = (X1 , · · · , Xn )T . Then given X =x

is observed, the likelihood function of θ is given by
n
L(θ|x ) = f (x |θ) =
Y
f (xi |θ)
i=1
and the log-likelihood function of θ is given by

n
`(θ|x ) = log L(θ|x ) =
X
f (xi |θ)
i=1
Example the Maximum Likelihood Estimator can be obtained by setting
`(θ|x ) = 0
∂
∂θ

Bi-variate Cumulative Distribution Function
The Joint Cumulative Distribution Function FX ,Y (x, y ) of a random

variables X and Y is given by
FX ,Y (x, y ) = P(X ≤ x, Y ≤ y )

FX ,Y (−∞, y )) = lim FX ,Y (x, y ) = 0
x→−∞
FX ,Y (x, −∞)) = lim FX ,Y (x, y ) = 0
y →−∞
FX ,Y (∞, ∞)) = lim FX ,Y (x, y ) = 1
x→∞,y →∞

Bi-variate Probability Density Function
Let Z = (X , Y )T be a bivariate random variable.
The Probability Density Function f (X , Y ) of random variables X and Y is

given by
∂2
f (x, y ) = F (x, y )
∂x∂y
0 ≤ fZ (x, y ) ≤ 1
RR
R2 f (x, y) = 1
P(Z ) = −∞ −∞ f (s, t)dsdt
Ry Rx

Expectation and MGF of Bivariate Random Variables
Let t = (t1 , t2 )T .
The Expectation of a g (X , Y ) is given by
Z ∞Z ∞
E [g (X , Y )] = g (x, y )f (x, y )dxdy
−∞ −∞
The Moment Generating Function of Z = (X , Y )T is given by

Z ∞Z ∞
MZ [t] = EZ [e t TZ
]= e t1 X +t2 Y f (x, y )dxdy
−∞ −∞
The Marginal Probability Density Function of X and Y are given by

Z ∞
fX (x) = f (x, y )dy
−∞
Z ∞
fY (y ) = f (x, y )dx
−∞
Variance of Bivariate Random Variables
Let the mean vector of Z = (X , Y )Tbe

µ = E (Z ) =
E (X )
E (Y )
Then the variance-covariance matrix is given by

Var (Z ) = E [(Z − µ) (Z − µ)] = E ( σx
T σx
σy )
σy
where σx2 = Var (X ) and σy2 = Var (Y ), that is
σx2

Var (Z ) =
σx σy
σy σx σy2

Multivariate Random Variables
Similarly we can define for X = (X1 , · · · , Xn )T

FX (x ) = P(X1 ≤ x1 , · · · , Xn ≤ xn )
∂n
f (x ) = FX
∂x1 · · · ∂xn
Z Z
E [g (X )] = · · · g (X )f (x )dx1 · · · dxn
Rn
MX [t] = EX [e t X ] = e t X f (x )d x
Z
T T
Rn

Multivariate Random Variables
 
E (X1 )
µ = E (X ) =  ..
 
. 
E (Xn )
σx21
 
σx1 σx2 · · · σx1 σxn
 σ2 σx σx22 · · · σx2 σxn
Var (X ) = 

 1 1 
.. .. .. .. 
 . . . . 
σxn σx1 σxn σx2 · · · σx2n
Some properties
E (a T X ) = a T E (X )
Var (a T X ) = a T Var (X )a

Distirubtion Functions

Normal Distribution N(µ, σ 2 )
Probability Distribution Function
1 (x−µ)2
f (x) = √ e− 2σ 2
2πσ 2

1 x −µ
F (x) = 1 + erf √
2 2σ 2
Moment Generating Function
σ2 t 2
MX = e µt+ 2

Normal Distribution N(µ, σ 2 )
where Z z
2 2
erf (z) = √ e −t dt
π 0
Mean and Variance
E (X ) = µ
Var (X ) = σ 2

Bi-variate Normal Distribution N(µ, Σ)
Bi-variate normal random variable

σ12
  
ρσ1 σ2
X1 µ1
∼N , 
X2 µ2
ρσ1 σ2 σ22
have
f (x1 , x2 ) =
σ1 −µ1 2
− 2ρ( σ1σ−µ )( σ2σ−µ ) + ( σ2σ−µ
!
1 2 2 2
1 ( σ1 ) 1 2 2
)
exp −
2(1 − ρ2 )
p
2πσ1 σ2 1 − ρ2

Poisson Distribution Po(λ)
Probability Mass Function
λk e −λ
f (x) =
k!
k
X λi
F (x) = e −λ
i!
i=0

t −1)
MX = e λ(e
Mean and Variance

E (X ) = λ
Var (X ) = λ
Uniform Distribution U(a, b)

1
b−a forx ∈ [a, b]
f (x) =
0 otherwise


 0 forx < a
x−a
F (x) = for x ∈ [a, b)
 b−a
1 forx ≥ b

Uniform Distribution U(a, b)

(
e ib −e ia
t(b−a) fort 6= 0
MX =
1 for t = 0
Mean and Variance

1
E (X ) = (a + b)
2
1
Var (X ) = (b − a)2
12

Exponential Distribution Exponential(λ)
f (x) = λe −λx
F (x) = 1 − e −λx

λ
MX = for t < λ
λ−t
Meand and Variance
E (X ) = λ−1
Var (X ) = λ−2

Gamma Distribution Gamma(α, β)

β α α−1 −βx
f (x) = x e
Γ(α)

1
F (x) = γ(α, βx)
Γ(α)

−α
t
MX = 1−
β

Gamma Distribution Gamma(α, β)
where Z ∞
Γ(α) = t α−1 e −t dt
0
Z βx
γ(α, βx) = t α−1 e −t dt
0
Mean and Variance
α
E (X ) =
β
α
Var (X ) = 2
β

Beta Distribution Beta(α, β)
x α−1 (1 − x)β−1
f (x) =
B(α, β)

B(x; α, β)
F (x) =
B(α, β)

∞ k−1
!
X Y α+r tk
MX = 1 +
α+β+r k!
k=1 r =0

Beta Distribution Beta(α, β)
where Z 1
B(α, β) = t α−1 (1 − t)β−1 dt
0
Z x
B(x; α, β) = t α−1 (1 − t)b−1 dt
0
Mean and Variance
α
E (X ) =
α+β
αβ
Var (X ) =
(α + β)2 (α + β + 1)

Estimators

Estimators
Let β̂ be the estimator for true parameter β, then

β̂ is unbiased if E (β̂) = E (β)
The bias ofβ̂ is given by E (β̂) − E (β)

p
β̂ is consistent if β̂ −
→β
β̂ is more efficient than another estimator β̃ if Var (β̃) − Var (β̂) is

positive definite with probability 1
The asymptotic distribution of β̂ is such that we need to premultiply

E (β̂) − E (β) by some power of n other than 0 to get some meaningful
distribution. For example
√ d
→ N(0, σ 2 )
n(X̄ − µ) −

Time Series
The time series {Xt }T

t=1 variable is covariance stationary if all of the
following are met
E (Xt ) = µ < ∞ ∀t
Var (Xt ) = σ2 <∞ ∀t
Cov (Xt , Xt−j ) = γj ∀t and ∀j 6= 0
Examples of time-series (where et ∼ N(0, σe2 ) are i.i.d.):

p
P
AR(p): Xt = µ + φi Xt−i + et
i=1
q
P
MA(q): Xt = µ + et − θi et−i
i=1
p
P q
P
ARMA(p,q): Xt = µ + φi Xt−i + et − θi et−i
i=1 i=1

Hypothesis Testing
Example testing H0 : c T β = r versus H1 : c T β 6= r where c is a k × 1

vector and r is a scalar, gives the following test statistic
cT β − r
Tn = p
s 2 c T (X T X )−1 c
Student t distribution
A random variable T follows the student t distribution with q degrees of

freedom, written as T ∼ t(q), if T = √U where U ∼ N(0, 1),
V /q
V ∼ χ2 (q) and U ⊥ V
Under H0 have Tn ∼ t(n − k)

Hypothesis Testing
Example testing H0 : Rβ = r versus H1 : Rβ 6= r where R is a q × k

matirx with q < k, and r is a q × 1 vector, gives the following test statistic
1 T h i
Fn = R β̂ − r s 2 R(X T X )−1 R T R β̂ − r
q
F distribution
A random variable F follows the F distribution with (p, q) degrees of

freedom, written as F ∼ F (p, q), if F = VU/p 2
/q where U ∼ χ (p),
V ∼ χ2 (q) and U ⊥ V
Under H0 have Fn ∼ F (q, n − k)

References

MIT Press.
(2018). Retrieved from
https://www.probabilitycourse.com/chapter7
Shalizi C. (2013). Retrieved from http:
//www.stat.cmu.edu/~cshalizi/uADA/13/lectures/app-b.pdf

ECMT Math Camp 2018 PDF

Uploaded by

Copyright:

Available Formats

ECMT Math Camp 2018 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ECMT Math Camp 2018 PDF

Uploaded by

Copyright:

Available Formats

Calculus

Math Camp (ECMT) Calculus July 5, 2018 1 / 38

Math Camp (ECMT) Calculus July 5, 2018 2 / 38

Math Camp (ECMT) Calculus July 5, 2018 3 / 38

Differentiation is commonly used in Optimization.

Informally, it is a measure of the rate of change of a function.

Geoetric representation of the derivative of a one-variable function:

Math Camp (ECMT) Calculus July 5, 2018 4 / 38

Geoetric representation of the derivative of two-variable function:

Math Camp (ECMT) Calculus July 5, 2018 5 / 38

For constants b, c and m, have

If f (x) = c then f 0 (x) = 0

If f (x) = mx + b, then f 0 (x) = m

If f (x) = x n , then f 0 (x) = nx n−1

If y = f (x) and x = f −1 (y ), then dx

Math Camp (ECMT) Calculus July 5, 2018 6 / 38

If g (x) = cf (x), then g 0 (x) = cf 0 (x)

If h(x) = g (x) ± f (x), then h0 (x) = g 0 (x) ± f 0 (x)

If h(x) = f (x)g (x), then h0 (x) = f 0 (x)g (x) + f (x)g 0 (x)

f (x) f 0 (x)g (x)−f (x)g 0 (x)

Math Camp (ECMT) Calculus July 5, 2018 7 / 38

Let S ⊂ Rn and T ⊂ Rl . Then f : S → T is said to be continuous at

The function f : S → T is said to be continuous on S if it is continuous at

Example of a non-continuous function

Math Camp (ECMT) Calculus July 5, 2018 8 / 38

Let S ⊂ Rn and T ⊂ Rl . Then f : S → T is said to be differntiable at a

||f (t) − f (x) − A(t − x)||

The matrix A is called the derivative of f at x and is denoted Df (x) (often

Math Camp (ECMT) Calculus July 5, 2018 9 / 38

If f is differentiable at all points in S, then f is said to be differentiable on

If Df : S → Rl×n is a continuous function, then f is said to be

However, thankfully most functions in economics are continuous

Math Camp (ECMT) Calculus July 5, 2018 10 / 38

Math Camp (ECMT) Calculus July 5, 2018 11 / 38

Intermediate Value Theorem. Let f : [a, b] → R be a continuous function.

Intermediate Value Theorem in Rn . Let D ⊂ Rn be a convex set and let

The Intermediate Value Theorem is useful for showing the existence of

Math Camp (ECMT) Calculus July 5, 2018 12 / 38

Intermediate Value Theorem for the Derivative. Let f : [a, b] → R be a

Note that the above theorem does not assume f is C 1 .

Math Camp (ECMT) Calculus July 5, 2018 13 / 38

And if f1 (x0 , y0 ) = f2 (x0 , y0 ) = · · · = fn (x0 , y0 )

then there is a neighbourhood U of the point y0 = (y10 , · · · , ym0 ) in Rm ,

there is a neighbourhood V of the point x0 = (x1 , · · · , xn ) in Rn , and

there is a unique mapping ϕ : U → V such that ϕ(y0 ) = x0 and

That is, if we write

is the unique solution to the above system of equations near y0 .

Math Camp (ECMT) Calculus July 5, 2018 15 / 38

Math Camp (ECMT) Calculus July 5, 2018 16 / 38

We want to investigate the behaviour of u and v in terms of x and y in

Math Camp (ECMT) Calculus July 5, 2018 17 / 38

That is, for example

Math Camp (ECMT) Calculus July 5, 2018 18 / 38

The Taylor Series expansion of the function f (x) in a neighbourhood of

where Rn = f (n) (ξ)(x1 − x0 )n /n! and ξ lies between x0 and x1 . The

Note Rn can be made as small as one wishes by taking n large.

Taylor Series is useful as linear approximation of functions.

Math Camp (ECMT) Calculus July 5, 2018 19 / 38

By Taylor Series expansion,

An -neighbourhood N (x0 ) of a point xo ∈ Rn is the set of points lying

N (x0 ) = {x ∈ Rn : d(x0 , x) < }

A set X ⊂ Rn is open if, for every x ∈ X , there exists an such that

A set X ⊂ Rn is bounded if, for every x0 ∈ X , there exists an < ∞ such