Cheat Sheet

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5
At a glance
Powered by AI
The document covers an overview of important probability distributions like the binomial, Poisson, normal distributions as well as concepts like maximum likelihood estimation, method of moments estimation, and properties of the exponential family of distributions.

The main probability distributions covered include the binomial, Poisson, normal, exponential, uniform and Cauchy distributions. Key properties like expected value and variance are discussed for each.

Maximum likelihood estimation aims to find the parameter values that maximize the likelihood function. The MLE is consistent and asymptotically normal under certain conditions. Fisher information describes the amount of information a random variable provides about an unknown parameter.

Cheatsheet for 18.6501x by Blechturm Fisher Information: Exponential L(x1 . . .

Xn ; µ, σ 2 ) =
Page 1 of x Parameter( λ, continuous 1 1 P
 
n
I(p) = p(1−p) =  √ n exp − 2 ni=1 (Xi − µ)2
λexp(−λx), if x >= 0 σ 2π 2σ
1 Important probability distributions fx (x) =
0, o.w.
Bernoulli Canonical exponential form: ( Loglikelihood:
Parameter 1 − exp(−λx), if x >= 0
( p ∈ [0, 1], discrete fp (y) =
Fx (x) =
0, o.w. `n (µ, σ 2 ) =√
p, if k = 1
px (k) = !
= −nlog(σ 2π) − 1 2 ni=1 (Xi − µ)2
E[X] = λ1
P
(1 − p), if k = 0 n 2σ
= exp(y (ln(p) − ln(1 − p)) + n ln(1 − p) + ln( ))
y V ar(X) = 12 MLE:
E[X] = p | {z } | {z }
λ
θ −b(θ)
| {z }
c(y,φ) Likelihood: Fisher Information:
V ar(X) = p(1 − p)  P 
Multinomial L(X1 . . . Xn ; λ) = λn exp −λ ni=1 Xi Canonical exponential form:
Likelihood n trials: Parameters n > 0 and p1 , . . . , pr .
n! Loglikelihood: Gaussians are invariant under affine transformation:
Ln (X , . . . , Xn , p) = P px (x) = x !,...,x ! p1 , . . . , pr
P1n n 1 n Pn
= p i=1 Xi (1 − p)n− i=1 Xi `n (λ) = nln(λ) − λ i=1 (Xi ) aX + b ∼ N (X + b, a2 σ 2 )
E[Xi ] = n ∗ pi
Loglikelihood n trials: MLE: Sum of independent gaussians:
V ar(Xi ) = npi (1 − pi )
`n (p) = λ̂MLE = Pn n(X ) Let X∼N (µX , σX2 ) and Y ∼N (µY , σY2 )
  Likelihood: i=1 i
= ln (p) ni=1 Xi + n − ni=1 Xi ln (1 − p)
P P
Fisher Information: If Y = X + Z, then Y ∼ N (µX + µY , σX + σY )
px (x) = nj=1 pj Tj , where T j = 1(Xi = j) is the count
Q
MLE: If U = X − Y , then U ∼ N (µX − µY , σX + σY )
how often an outcome is seen in trials. I(λ) = 12
Pn λ
p̂MLE = i=1 (Xi ) Symmetry:
n Loglikelihood:   Canonical exponential form:
Fisher Information: `n = nj=2 Tj ln pj
P   If X ∼ N (0, σ 2 ), then −X ∼ N (0, σ 2 )
fθ (y) = exp yθ − (− ln(−θ)) + 0
1
I(p) = p(1−p)
| {z } |{z} P(|X| > x) = 2P(X > x)
Poisson b(θ) c(y,φ)
Parameter λ. discrete, approximates the binomial Standardization:
Canonical exponential form: PMF when n is large, p is small, and λ = np. θ = −λ = − µ1
X−µ
fθ (y) = exp yθ − ln(1 + eθ ) + 0
 
k φ=1 Z= σ ∼ N (0, 1)
| {z } |{z} px (k) = exp(−λ) λk! for k = 0, 1, . . . ,
Shifted Exponential 
t−µ

b(θ) c(y,φ) P (X ≤ t) = P Z ≤
  E[X] = λ Parameters
( λ, θ ∈ R, continuous σ
p λexp(−λ(x − θ)), x >= θ Higher moments:
θ = ln 1−p fx (x) =
V ar(X) = λ 0, x <= θ
φ=1 ( E[X 2 ] = µ2 + σ 2
Likelihood: 1 − exp(−λ(x − a)), if x >= θ E[X 3 ] = µ3 + 3µσ 2
Pn Fx (x) =
Binomial x 0, x <= θ
i
E[X 4 ] = µ4 + 6µ2 σ 2 + 3σ 4
Qn λ
Ln (x1 , . . . , xn , λ) = Qni=1 e−nλ
Parameters p and n, discrete. Describes the number i=1 i=1 xi !
of successes in n independent Bernoulli trials. E[X] = a + λ1
Loglikelihood: Quantiles:
V ar(X) = 12
px (k) = nk pk (1 − p)n−k , k = 1, . . . , n `n (λ) =

λ Uniform
= −nλ + log(λ)( ni=1 xi )) − log( ni=1 xi !)
P Q
Likelihood: Parameters
( 1 a and b, continuous.
E[X] = np
MLE: , if a < x <b
L(X1 . . . Xn ; λ, θ) = λn exp(−λn(X n − θ))1(X1 ≥ θ) fx (x) = b−a
V ar(X) = np(1 − p) 0, o.w.
λ̂MLE = n1 ni=1 (Xi ) Univariate Gaussians
P
E[X] = a+b
Likelihood: Parameters µ and σ 2 > 0, continuous 2
(b−a)2
Fisher Information: (x−µ)2 V ar(X) =
Ln(X1 , . . . , Xn , θ) = f (x) = √ 1 exp(− ) 12
(2πσ 2 ) 2σ 2
n I(λ) = λ1 Likelihood:
!
Y K  Pn Xi Pn E[X] = µ
=   θ i=1 (1 − θ)nK− i=1 Xi L(x1 . . . xn ; b) =
1(maxi (xi ≤b))
Xi 
Canonical exponential form: V ar(X) = σ 2 bn
i=1
Loglikelihood: CDF of standard gaussian: Loglikelihood:
fθ (y) = exp yθ − eθ − ln y!
 
Cauchy
P   
`n (θ) = C + ni=1 Xi log θ + nK − ni=1 Xi log(1−θ)
P |{z} |{z} Rz 2
b(θ) c(y,φ) Φ(z) = −∞ √1 e−x /2 dx continuous, parameter m,

MLE: θ = ln λ Likelihood: 1
fm (x) = π 1
2
1+(x−m)
φ=1
Cheatsheet for 18.6501x by Blechturm Product of dependent r.vs X and Y : 6 Covariance 8 Random Vectors
Page 2 of x
T
The Covariance is a measure of how much the

A random vector X = X (1) , . . . , X (d) of dimension
E[X · Y ] , E[X] · E[Y ] values of each of two correlated random variables
E[X] = notdef ined! determine each other d × 1 is a vector-valued function from a probability
V ar(X) = notdef ined! E[X · Y ] = E[E[Y · X|Y ]] = E[Y · E[X|Y ]] space ω to Rd :
Cov(X, Y ) = E[(X − µX )(Y − µY )]
med(X) Z= P (X > M) = P (X < M) Linearity of Expectation where a and c are given X : Ω −→ Rd
∞ 1 1 scalars: Cov(X, Y ) = E[XY ] − E[X]E[Y ]  (1)
= 1/2 = · dx

2 X (ω)
1/2 π 1 + (x − m) E[aX + cY ] = aE[X] + cE[Y ] Cov(X, Y ) = E[(X)(Y − µY )] X (2) (ω)
Chi squared ω −→ 
 
Possible notations: .. 
The χd2 distribution with d degrees of freedom is If Variance of X is known: 
 .


 (d) 
given by the distribution of Z12 + Z22 + · · · + Zd2 , where E[X 2 ] = var(X) − E[X] Cov(X, Y ) = σ (X, Y ) = σ(X,Y ) X (ω)
iid
Z1 , . . . , Zd ∼ N (0, 1) where each X (k) , is a (scalar) random variable on
4 Variance Covariance is commutative: Ω.
If V ∼ χk2 : Variance is the squared distance from the mean. Cov(X, Y ) = Cov(Y , X) PDF of X: joint distribution of its components
E = E[Z12 ] + E[Z22 ] + . . . + E[Zd2 ] = d V ar(X) = E[(X − E(X))2 ] X (1) , . . . , X (d) .
Covariance with of r.v. with itself is variance:
V ar(V ) = V ar(Z12 ) + V ar(Z22 ) + . . . + V ar(Zd2 ) = 2d
V ar (X) = E X 2 − (E [X])2
h i
Cov(X, X) = E[(X − µX )2 ] = V ar(X) CDF of X:
Student’s T Distribution
Tn := √ Z where Z ∼ N (0, 1), and Z and V are Variance of a product with constant a: Useful properties: Rd → [0, 1]
V /n
independent V ar(aX) = a2 V ar (X) x 7→ P(X (1) ≤ x(1) , . . . , X (d) ≤ x(d) ).
Cov(aX + h, bY + c) = abCov(X, Y )
2 Quantiles of a Distribution
Let α in (0, 1). The quantile of order 1 − α of a Variance of sum of two dependent r.v.: Cov(X, X + Y ) = V ar(X) + cov(X, Y ) The sequence X1 , X2 , . . . converges in probability to
random variable X is the number qα such that: X if and only if each component of the sequence
V ar(X + Y ) = V ar(X) + V ar(Y ) + 2Cov(X, Y ) Cov(aX + bY , Z) = aCov(X, Z) + bCov(Y , Z) (k) (k)
qα = P (X ≤ qα ) = 1 − α X1 , X2 , . . . converges in probability to X (k) .
Variance of sum of two independent r.v.: If Cov(X, Y ) = 0, we say that X and Y are uncorrela- Expectation of a random vector
P(X ≥ qα ) = α ted. If X and Y are independent, their Covariance The expectation of a random vector is the element-
V ar(X + Y ) = V ar(X) + V ar(Y ) is zero. The converse is not always true. It is only wise expectation. Let X be a random vector of
FX (qα ) = 1 − α true if X and Y form a gaussian vector, ie. any linear dimension d × 1.
−1 (1 − α) = α V ar(X − Y ) = V ar(X) + V ar(Y ) combination αX + βY is gaussian for all (α, β) ∈ R2
E[X (1) ]
 
FX without {0, 0}.
5 Sample Mean and Sample Variance 7 Law of large Numbers and Central Limit theorem E[X] = 
 .. 
 .
If X ∼ N (0, 1): univariate . 
iid 
(d)
Let X1 , ..., Xn ∼ Pµ , where E(Xi ) = µ and

iid E[X ]
P(|X| > qα ) = α Let X1 , ..., Xn ∼ Pµ , where E(Xi ) = µ and
V ar(Xi ) = σ 2 for all i = 1, 2, ..., n The expectation of a random matrix is the expected
3 Expectation V ar(Xi ) = σ 2 for all i = 1, 2, ..., n and Xn = n1 ni=1 Xi . value of each of its elements. Let X = {X } be an
P
R +inf ij
E [X] = −inf x · fX (x) dx Sample Mean: n × p random matrix. Then E[X], is the n × p matrix
Law of large numbers:
R +inf X n = n1 ni=1 Xi
P of numbers (if they exist):
E [g (X)] = −inf g (x) · fX (x) dx P ,a.s.
Xn −−−−−→ µ . E[X ] E[X ] . . . E[X ]
11 12 1p 
R +inf Sample Variance: n→∞ 
E[X21 ] E[X22 ] . . . E[X2p ]
E [X Y = y] = −inf x · fX|Y (x|y) dx 1 Pn P ,a.s.  
n i=1 g(Xi ) −n→∞
−−−−→ E[g(X)] E[X] =  . .. .. 

Sn = n1 ni=1 (Xi − X n )2 = ..
P
 .. . . . 
Integration limits only have to be over the support 2 
= n1 ( ni=1 Xi2 ) − X n
P
of the pdf. Discrete r.v. same as continuous but with Central Limit Theorem: E[Xn1 ] E[Xn2 ] . . . E[Xnp ]
sums and pmfs. p Xn −µ (d) Let X and Y be random matrices of the same dimen-
Cochranes Theorem: sion, and let A and B be conformable matrices of
(n) √ 2 −−−−−→ N (0, 1)
Total expectation theorem: (σ ) n→∞ constants.
iid
If X1 , ..., Xn ∼ N µ, σ 2 the sample mean X n and the (d)
(n)(Xn − µ) −−−−−→ N (0, σ 2 )
R +inf p
E [X] = −inf fY (y) · E [X Y = y] dy sample variance Sn are independent X n ⊥⊥ Sn for n→∞ E[X + Y ] = E[X] + E[Y ]
all n. The sum of squares of n Numbers follows a E[AXB] = AE[X]B
Expectation of constant a: nS
Variance of the Mean:
Chi squared distribution 2n ∼ χn−1 2
σ Covariance Matrix
E[a] = a V ar(Xn ) = Let X be a random vector of dimension d × 1 with
Unbiased estimator of sample variance: σ2 σ2
( n )2 V ar(X1 + X2 , ..., Xn ) = n . expectation µX .
Product of independent r.vs X and Y : n  Matrix outer products!
1 X 2 n Expectation of the mean:
S̃n = Xi − X n = S
E[X · Y ] = E[X] · E[Y ] n−1 n−1 n Σ = E[(X − µX )(X − µX )T ] =
i=1 E[Xn ] = 1 E[X1 + X2 , ..., Xn ] = µ.
Cheatsheet for 18.6501x by Blechturm Multivariate CLT 9 Statistical models Pθ [I 3 θ] ≥ 1 − α, ∀θ ∈ Θ
Page 3 of x Let X1 , . . . , Xd ∈ Rd be independent copies of a E, {Pθ }θ∈Θ
 X − µ   random vector X such that E[x] = µ (d × 1 vector of Confidence interval of asymptotic level 1 − α for θ:
 1 1  expectations) and Cov(X) = Σ E is a sample space for X i.e. a set that contains all
 X2 − µ2  Any random interval I whose boundaries do not
E  . . .  [X1 − µ1 , X2 − µ2 , . . . , Xd − µd ] possible outcomes of X

   (d) depend on θ and such that:
Xd − µd p
{Pθ }θ∈Θ is a family of probability distributions on
σ (n)(Xn − µ) −−−−−→ N (0, Σ)
 11 σ12 . . . σ1d  n→∞ E. limn→∞ Pθ [I 3 θ] ≥ 1 − α, ∀θ ∈ Θ

σ21 σ22 . . . σ2d  (d)
(n)Σ−1/2 Xn − µ −−−−−→ N (0, Id )
p
Σ = Cov(X) =  .. .. .. .. 
 
n→∞ Θ is a parameter set, i.e. a set consisting of some Two-sided asymptotic CI
 . . . . 
possible values of Θ.
  iid
σd1 σd2 . . . σdd Where Σ−1/2 is the d × d matrix such that Let X1 , . . . , Xn = X̃ and X̃ ∼ Pθ . A two-sided CI
The covariance matrix Σ is a d × d matrix. It is a Σ−1/2 Σ−1/2 = Σ1 and Id is the identity matrix. θ is the true parameter and unknown. In a parame- is a function depending on X̃ giving an upper
table of the pairwise covariances of the elemtents tric model we assume that Θ ⊂ Rd , for some d ≥ 1. and lower bound in which the estimated parame-
of the random vector. Its diagonal elements are the Multivariate Delta Method ter lies I = [l(X̃, u(X̃)] with a certain probability
variances of the elements of the random vector, the Gradient Matrix of a Vector Function: Identifiability: P(θ ∈ I ) ≥ 1 − qα and conversely P(θ < I ) ≤ α
off-diagonal elements are its covariances. Note that
the covariance is commutative e.g. σ12 = σ21 Given a vector-valued function f : Rd → Rk , the θ , θ 0 ⇒ Pθ , Pθ 0 Since the estimator is a r.v. depending on X̃ it has
gradient or the gradient matrix of f , denoted by ∇f a variance V ar(θ̂n and a mean E[θ̂n ]. After fin-
Alternative forms: , is the d × k matrix: Pθ = Pθ 0 ⇒ θ = θ 0 ding those it is possible to standardize the estima-
Σ = E[XX T ] − E[X]E[X]T = tor using the √ CLT. This yields
√ an asymptotic CI:
∇f = A Model is well specified if: −qα/2 V ar(θ) qα/2 V ar(θ)
= E[XX T ] − µX µTX  |

| | | 

I = θ̂n + [ √ , √ ]
n n
= ∇f1 ∇f2 ... ∇fk  = ∃θ s.t. P = Pθ
Let the random vector X ∈ Rd and A and B be | | | | 10 Estimators
This expression depends on the real variance V ar(θ)
conformable matrices of constants.  ∂f ∂fk 
of the r.vs, the variance has to be estimated. Three
 1 ...  A statistic is any measurable functionof the sam- possible methods: plugin (use sample mean), solve
 ∂x1 ∂x1 
Cov(AX + B) = Cov(AX) = ACov(X)AT = AΣAT ple, e.g. Xn , max(Xi ), etc. An Estimator of θ is any (solve quadratic inequality), conservative (use the

=  ... ..
 
 . statistic which does not depend on θ.
Every Covariance matrix is positive definite.  ... . 
 maximum of the variance).
 ∂f1
...
∂fk   An estimator θ̂n is weakly consistent if: lim θ̂n = θ
Σ≺0 ∂xd ∂xd n→∞ Delta Method
P
This is also the transpose of what is known as the or θ̂n −−−−−→ E[g(X)]. If the convergence is almost If I take a function of the mean and want to make it
Gaussian Random Vectors n→∞ converge to a function of the mean.
Jacobian matrix Jf of f . surely it is strongly consistent.
A random vector X = (X (1) , . . . , X (d) )T is a Gaussi- √ (d)
an vector, or multivariate Gaussian or normal va- General statement, given n(g(mb1 ) − g(m1 (θ))) −−−−−→ N (0, g 0 (m1 (θ))2 σ 2 )
riable, if any linear combination of its components Asymptotic normality of an estimator: n→∞
is a (univariate) Gaussian variable or a constant (a 12 Hypothesis tests
(d) Comparisons of two proportions
(n)(θ̂n − θ) −−−−−→ N (0, σ 2 )
p
“Gaussian"variable with zero variance), i.e., if α T X • (Tn )n≥1 a sequence of random vectors
n→∞ iid iid
is (univariate) Gaussian or constant for any constant Let X1 , . . . , Xn ∼ Bern(px ) and Y1 , . . . , Yn ∼ Bern(py )
√   (d) σ 2 is called the Asymptotic Variance of θˆn . In the
non-zero vector α ∈ Rd . ~ −−−−−→ T, and be X independent of Y . p̂x = 1/n ni=1 Xi and
P
• satisfying n Tn − θ case of the sample mean it the variance of a single
Multivariate Gaussians n→∞ Pn
Xi . If the estimator is a function of the sample p̂x = 1/n i=1 Yi
The distribution of, X the d-dimensional Gaussian • a function g : Rd → Rk that is continuously mean the Delta Method is needed to compute
or normal distribution, is completely specified by the Asymptotic Variance. Asymptotic Variance , H0 : px = py ; H1 : px , py
~
differentiable at θ,
the vector mean µ = E[X] = (E[X (1) ], . . . , E[X (d) ])T Variance of an estimator. To get the asymptotic Variance use multivariate
and the d × d covariance matrix Σ. If Σ is invertible, then Delta-method. Consider p̂x − p̂y = g(p̂x , p̂y ); g(x, y) =
then the pdf of X is: Bias of an estimator:
x − y, then
1 T −1 √   (d)
Bias(θ̂n = E[θˆn ] − θ (d)
fX (x) = q 1 e− 2 (x−µ) Σ (x−µ) , ~ −−−−−→ ∇g(θ)
n g(Tn ) − g(θ) ~TT p
(n)(g(p̂x , p̂y ) − g(px − py )) −−−−−→ N (0, ∇g(px −
d n→∞ n→∞
(2π) det(Σ)
With multivariate Gaussians and Sample mean: Quadratic risk of an estimator: py )T Σ∇g(px − py ))
x ∈ Rd
Let Tn = Xn where Xn is the sample average of R(θ̂n ) = E[(θ̂n − θ)2 ] = Bias2 + V ariance ⇒ N (0, px (1 − px) + py (1 − py))
Where det(Σ) is the determinant of Σ, which is posi-
tive when Σ is invertible. iid
X1 , . . . , Xn ∼ X, and θ ~ = E[X]. The (multivariate) 11 Confidence intervals Pivot:
If µ = 0 and Σ is the identity matrix, then X is called CLT then gives T ∼ N (0, ΣX ) where ΣX is the Let (E, (Pθ )θ∈Θ ) be a statistical model based on
a standard normal random vector . covariance of X. In this case, we have: Let X1 , . . . , Xn be random samples and let Tn be a
observations X1 , . . . Xn and assume Θ ⊆ R. Let function of X and a parameter vector θ. That is, Tn
If the covariant matrix Σ is diagonal, the pdf factors α ∈ (0, 1).
into pdfs of univariate Gaussians, and hence the √   (d) is a function of X1 , . . . , Xn , θ. Let g(Tn ) be a random
components are independent. n g(Tn ) − g(θ) ~ TT
~ −−−−−→ ∇g(θ) variable whose distribution is the same for all θ .
n→∞ Non asymptotic confidence interval of level 1 − α
  for θ: Then, g is called a pivotal quantity or a pivot.
The linear transform of a gaussian X ∼ Nd (µ, Σ) ∇g(θ) ~ T T ∼ N 0, ∇g(θ) ~ T ΣX ∇g(θ)
~
with conformable matrices A and B is a gaussian: Any random interval I , depending on the sample For example, let X be a random variable with mean
(T ∼ N (0, ΣX )) X1 , . . . Xn but not at θ and such that: µ and variance σ 2 . Let X1 , . . . , Xn be iid samples of
AX + B = Nd (Aµ + b, AΣAT ) X. Then,
 n 
Cheatsheet for 18.6501x by Blechturm nonnegative: Y  • Find the expectation of the functions of Xi
Page 4 of x d(P, Q) ≥ 0 argmaxθ∈Θ ln  pθ (Xi ) and subsitute them back into the Hessian or
definite: i=1 the second derivative. Be extra careful to sub-
d(P, Q) = 0 ⇐⇒ P = Q Gaussian Maximum-loglikelihood estimators: situte the right power back. E[Xi ] , E[Xi2 ].
Xn − µ triangle inequality: 2 • Don’t forget the minus sign!
gn , MLE estimator
P for σ = τ:
σ d(P, V) ≤ d(P, Q) + d(Q, V) τ̂nMLE = n1 ni=1 Xi2
iT If the support of P and Q is disjoint: Asymptotic normality of the maximum likelihood
estimator
h
is a pivot with θ = µ σ 2 being the parameter vec- d(P, V) = 1 MLE estimators:
Under certain conditions the MLE is asymptotically
tor. The notion of a parameter vector here is not to TV between continuous and discrete r.v: d(P, V) = 1 normal and consistent. This applies even if the MLE
µ̂MLE = n1 i=1 (xi )
P
be confused with the set of paramaters that we use KL divergence n is not the sample average.
to define a statistical model. 13.1 Fisher Information Let the true parameter θ ∗ ∈ Θ. Necessary assumpti-
Onesided the KL divergence (also known as relative entropy) The Fisher information is the covariance matrix of ons:
Twosided KL between between the propability measures P and the gradient of the loglikelihood function. It is equal
P-Value Q with the common sample space E and pmf/pdf to the negative expectation of the Hessian of the • The parameter is identifiable
functions f and g is defined as: loglikelihood function and captures the negative of
Walds Test   • For all θ ∈ Θ, the support Pθ does not depend
iid
P p(x) the expected curvature of the loglikelihood function. on θ (e.g. like in U nif (0, θ));
X1 , . . . , Xn ∼ Pθ ∗ for some true parameter θ ∗ ∈ p(x) ln , discr

 x∈E  q(x)


Rd . We construct the associated statistical model KL(P, Q) =  R p(x) Let θ ∈ Θ ⊂ Rd and let (E, {Pθ }θ∈Θ ) be a statistical • θ ∗ is not on the boundary of Θ;
(R, {Pθ }θ∈Rd ) and the maximum likelihood estima-
 p(x) ln q(x) dx, cont
model. Let fθ (x) be the pdf of the distribution Pθ .

x∈E • Fisher information I (θ) is invertible in the
tor θn
b MLE for θ . ∗ Not a distance! Then, the Fisher information of the statistical model neighborhood of θ ∗
Decide between two hypotheses: Sum over support of P ! is.
∗ ∗ Asymetric in general: • A few more technical conditions
H0 : θ = 0 VS H1 : θ , 0 I (θ) = Cov(∇`(θ)) =
Assuming that the null hypothesis is true, the asym- KL(P, Q) , KL(Q, P) The asymptotic variance of the MLE is the inverse
bMLE implies that the Nonnegative: = E[∇`(θ))∇`(θ)T ] − E[∇`(θ)]E[∇`(θ)] = of the fisher information.
ptotic normality of the MLE θ
√ n KL(P, Q) ≥ 0 = −E[H`(θ)] p (d)
bnMLE − θ ∗ ) −−−−−→ Nd (0, I (θ ∗ )−1 )
bnMLE − 0) 2 Definite: (n)(θ

following random variable n I (0)1/2 (θ n→∞
if P = Q then KL(P, Q) = 0 Where `(θ) = ln fθ (X).If ∇`(θ) ∈ Rd it is a d × d
converges to a χk2 distribution. Does not satisfy triangle inequality in general: matrix. The definition when the distribution has 14 Method of Moments
KL(P, V)  KL(P, Q) + KL(Q, V) a pmf pθ (x) is also the same, with the expectation iid
√ 2 (d) Let X1 , . . . , Xn ∼ Pθ ∗ associated with model
n I (0)1/2 (θ bnMLE − 0) −−−−−→ χ2 taken with respect to the pmf. (E, {Pθ }θ∈Θ ), with E ⊆ R and Θ ⊆ R, for some d
n→∞ d
Estimator of KL divergence: ≥1
Wald’s Test in 1 dimension:
"
p ∗ (X)
!# Let (R, {Pθ }θ∈R ) denote a continuous statistical Population moments:
KL (Pθ ∗ , Pθ ) = Eθ ∗ ln θ , model. Let fθ (x) denote the pdf (probability density
In 1 dimension, Wald’s Test coincides with the two- pθ (X) function) of the continuous distribution Pθ . Assume mk (θ) = Eθ [X1k ], 1 ≤ k ≤ d
sided test based on on the asymptotic normality of KL(P c θ , Pθ ) = const − 1 n log(pθ (Xi ))
P
that fθ (x) is twice-differentiable as a function of the
the MLE. ∗ n i=1
parameter θ. Empirical moments:
Given the hypotheses Maximum likelihood estimation
H0 : θ ∗ = 0 VS H1 : θ ∗ , 0 Cookbook: take the log of the likelihood function. Formula for the calculation of Fisher Information of ck (θ) = Xnk = n1 ni=1 Xik
m
P
a two-sided test of level α, based on the Take the partial derivative of the loglikelihood func- X:
Convergence of empirical moments:
asymptotic normality of the MLE,  is ψα = tion with respect to the parameter. Set the partial
derivative to zero and solve for the parameter.

∂fθ (x) 2

bMLE − θ0 > qα/2 (N (0, 1))
p
1 nI(θ0 ) θ R∞ ∂θ
P ,a.s.
If an indicator function on the pdf/pmf does not I (θ) = −∞ dx m
ck −−−−−→ mk
where the Fisher information I(θ0 )−1 is the asym- depend on the parameter, it can be ignored. If it de- fθ (x) n→∞
bMLE under the null hypothesis. pends on the parameter it can’t be ignored because P ,a.s.
ptotic variance of θ Models with one parameter (ie. Bernulli): (m cd ) −−−−−→ (m1 , . . . , md )
c1 , . . . , m
On the other hand, a Wald’s test of level α is  there is an discontinuity in the loglikelihood functi- n→∞
Wald
 
MLE
2
2 on. The maximum/minimum of the Xi is then the I (θ) = Var(` 0 (θ)) MOM Estimator M is a map from the parameters of
ψα = 1 nI(θ0 ) θ b − θ0 > qα (χ1 ) = maximum likelihood estimator. a model to the moments of its distribution. This map
q
! Maximum likelihood estimator: I (θ) = −E(` 00 (θ)) is invertible, (ie. it results into a system of equati-
bMLE − θ0 > qα (χ2 ) .
p
1 nI(θ0 ) θ ons that can be solved for the true parameter vector
1
Models with multiple parameters (ie. Gaussians): θ ∗ ). Find the moments (as many as parameters), set
n o
Let E, (Pθ )θ∈Θ be a statistical model associa-
13 Distance between distributions up system of equations, solve for parameters, use
Total variation ted with a sample of i.i.d. random variables I (θ) = −E [H`(θ)] empirical moments to estimate.
X1 , X2 , . . . , Xn . Assume that there exists θ ∗ ∈ Θ such
The total variation distance TV between the propa- that Xi ∼ Pθ ∗ . ψ : Θ → Rd
bility measures P and Q with a sample space E is Cookbook:
defined as: The maximum likelihood estimator is the (unique) θ 7→ (m1 (θ), m2 (θ), . . . , md (θ))
θ that minimizes KL c (Pθ ∗ , Pθ ) over the parameter Better to use 2nd derivative.
TV(P, Q) = maxA⊂E |P(A) − Q(A)|, space. (The minimizer of the KL divergence is un- M −1 (m1 (θ ∗ ), m2 (θ ∗ ), . . . , md (θ ∗ ))
ique due to it being strictly convex in the space of • Find loglikelihood The MOM estimator uses the empirical moments:
Calculation with f and g: distributions once is fixed.) • Take second derivative (=Hessian if multiva-  P 
M −1 n1 ni=1 Xi , n1 ni=1 Xi2 , . . . , n1 ni=1 Xid
(1 P P P
x∈E |f (x) − g(x)|, discr bnMLE =
θ riate)
TV(P, Q) = 12 R
|f (x) − g(x)|dx, cont
2 x∈E argminθ∈Θ KL c n (Pθ ∗ , Pθ ) = • Massage second derivative or Hessian (isola- Assuming M −1 is continuously differentiable at
Symmetry: Xn te functions of Xi to use with −E(` 00 (θ)) or M(0), the asymptotical variance of the MOM esti-
d(P, Q) = d(Q, P) argmaxθ∈Θ ln pθ (Xi ) = −E [H`(θ)]. mator is:
Cheatsheet for 18.6501x by Blechturm 19 Calculus If the Hessian has both positive and negative eigen-
Page 5 of x Differentiation under the integral sign values then a is a saddle point for f .
R 
d b(x)
dx a(x)
f (x, t)dt = f (x, b(x))b0 (x) − f (x, a(x))a0 (x) +
R b(x)
(d) a(x) x
f (x, t)dt.
(n)(θnMM − θ) −−−−−→ N (0, Γ )
p €
n→∞ Concavity in 1 dimension
where, If g : I → R is twice differentiable in the interval I:

−1
T 
−1
 concave:
Γ (θ) = ∂M
∂θ
(M(θ)) Σ(θ) ∂M
∂θ
(M(θ)) if and only if g 00 (x)≤0 for all x ∈ I
Γ (θ) = ∇θ (M −1 )T Σ∇θ (M −1 ) strictly concave:
Σθ is the covariance matrix of the random vector of if g 00 (x)<0 for all x ∈ I
the moments (X11 , X12 . . . , X1d ).
15 OLS convex:
Y |X = x ∼ N (µ(x), σ 2 I) if and only if g 00 (x)≥0 for all x ∈ I

Regression function µ(x): strictly convex if:


g 00 (x)>0 for all x ∈ I
E[Y |X = x] = µ(x) = xT β
Multivariate Calculus
Random Component of the Linear Model: The Gradient ∇ of a twice differntiable function f is
Y is continous and Y |X = x is Gaussian with mean defined as:
µ(x) ∇f : Rd → Rd
 ∂f 
16 Generalized Linear Models  ∂θ 
We relax the assumption that µ is linear. Instead, we θ1   1 
 
θ2   ∂f 
assume that g ◦µ is linear, for some function g:
θ =  ..  7→  ∂θ. 2 
 
g(µ(x)) = x βT
 .   .. 
  
 
The function g is assumed to be known, and is refer- θ d
 
 ∂f 

red to as the link function. It maps the domain of ∂θd θ
the dependent variable to the entire real Line. Hessian
it has to be strictly increasing,
it has to be continuously differentiable and The Hessian of f is a symmetric matrix of second
its range is all of R partial derivatives of f
16.1 The Exponential Family
A family of distribution {Pθ : θ ∈ Θ}, where the Hh(θ) = ∇2 h(θ) =
 2 ∂2 h (θ) 

parameter space Θ ⊂ Rk is -k dimensional, is called  ∂θ∂ ∂θ h (θ) · · ·
∂θ1 ∂θd 
1 1
a k-parameter exponential family on R1 if the pmf 
 
..  d×d
q
or pdf fθ : R → R of Pθ can be written in the form: 
 .  ∈ R

 ∂2 h (θ) · · · ∂2 h (θ) 


η1∂θ (θ) d ∂θ ∂θd ∂θd


  1

η(θ) = A
 symmetric
 ..  : Rk(real-valued) d × d matrix A is:
 
→ Rk



  . 
  
Positive semi-definite:
 Tηk (θ)



for all x ∈ Rd .

x A x ≥ 0



 1  T (y)
fθ (y) = h(y) exp (η(θ) · T(y) − B(θ)) where 

  ..  : Rq → Rk
T(y) = Positive
 

  .  definite:
 T
x > 0 for all non-zero vectors x ∈ Rd

x TkA
 
(y)




: Rk → R



 B(θ) Negative semi-definite (resp. negative definite):
: Rq → R.

h(y)

if k = 1 it reduces to: xT A x is negative for all x ∈ Rd − {0}.

fθ (y) = h(y) exp (η(θ)T (y) − B(θ)) Positive (or negative) definiteness implies positive
17 Algebra (or negative) semi-definiteness.
Absolute Value Inequalities:
If the Hessian is positive definite then f attains a
|f (x)| < a ⇒ −a < f (x) < a
local minimum at a (convex).
|f (x)| > a ⇒ f (x) > a or f (x) < −a
If the Hessian is negative definite at a, then f attains
18 Matrixalgebra a local maximum at a (concave).
kAxk2 = (Ax)T (Ax) = xT AT Ax = xT AT Ax

You might also like