Cheat Sheet

Cheatsheet for 18.6501x by Blechturm Fisher Information: Exponential L(x1 . . .
Xn ; µ, σ 2 ) =
Page 1 of x Parameter( λ, continuous 1 1 P

n
I(p) = p(1−p) = √ n exp − 2 ni=1 (Xi − µ)2
λexp(−λx), if x >= 0 σ 2π 2σ
1 Important probability distributions fx (x) =
0, o.w.
Bernoulli Canonical exponential form: ( Loglikelihood:
Parameter 1 − exp(−λx), if x >= 0
( p ∈ [0, 1], discrete fp (y) =
Fx (x) =
0, o.w. `n (µ, σ 2 ) =√
p, if k = 1
px (k) = !
= −nlog(σ 2π) − 1 2 ni=1 (Xi − µ)2
E[X] = λ1
P
(1 − p), if k = 0 n 2σ
= exp(y (ln(p) − ln(1 − p)) + n ln(1 − p) + ln( ))
y V ar(X) = 12 MLE:
E[X] = p | {z } | {z }
λ
θ −b(θ)
| {z }
c(y,φ) Likelihood: Fisher Information:
V ar(X) = p(1 − p) P
Multinomial L(X1 . . . Xn ; λ) = λn exp −λ ni=1 Xi Canonical exponential form:
Likelihood n trials: Parameters n > 0 and p1 , . . . , pr .
n! Loglikelihood: Gaussians are invariant under affine transformation:
Ln (X , . . . , Xn , p) = P px (x) = x !,...,x ! p1 , . . . , pr
P1n n 1 n Pn
= p i=1 Xi (1 − p)n− i=1 Xi `n (λ) = nln(λ) − λ i=1 (Xi ) aX + b ∼ N (X + b, a2 σ 2 )
E[Xi ] = n ∗ pi
Loglikelihood n trials: MLE: Sum of independent gaussians:
V ar(Xi ) = npi (1 − pi )
`n (p) = λ̂MLE = Pn n(X ) Let X∼N (µX , σX2 ) and Y ∼N (µY , σY2 )
Likelihood: i=1 i
= ln (p) ni=1 Xi + n − ni=1 Xi ln (1 − p)
P P
Fisher Information: If Y = X + Z, then Y ∼ N (µX + µY , σX + σY )
px (x) = nj=1 pj Tj , where T j = 1(Xi = j) is the count
Q
MLE: If U = X − Y , then U ∼ N (µX − µY , σX + σY )
how often an outcome is seen in trials. I(λ) = 12
Pn λ
p̂MLE = i=1 (Xi ) Symmetry:
n Loglikelihood: Canonical exponential form:
Fisher Information: `n = nj=2 Tj ln pj
P If X ∼ N (0, σ 2 ), then −X ∼ N (0, σ 2 )
fθ (y) = exp yθ − (− ln(−θ)) + 0
1
I(p) = p(1−p)
| {z } |{z} P(|X| > x) = 2P(X > x)
Poisson b(θ) c(y,φ)
Parameter λ. discrete, approximates the binomial Standardization:
Canonical exponential form: PMF when n is large, p is small, and λ = np. θ = −λ = − µ1
X−µ
fθ (y) = exp yθ − ln(1 + eθ ) + 0

k φ=1 Z= σ ∼ N (0, 1)
| {z } |{z} px (k) = exp(−λ) λk! for k = 0, 1, . . . ,
Shifted Exponential
t−µ

b(θ) c(y,φ) P (X ≤ t) = P Z ≤
E[X] = λ Parameters
( λ, θ ∈ R, continuous σ
p λexp(−λ(x − θ)), x >= θ Higher moments:
θ = ln 1−p fx (x) =
V ar(X) = λ 0, x <= θ
φ=1 ( E[X 2 ] = µ2 + σ 2
Likelihood: 1 − exp(−λ(x − a)), if x >= θ E[X 3 ] = µ3 + 3µσ 2
Pn Fx (x) =
Binomial x 0, x <= θ
i
E[X 4 ] = µ4 + 6µ2 σ 2 + 3σ 4
Qn λ
Ln (x1 , . . . , xn , λ) = Qni=1 e−nλ
Parameters p and n, discrete. Describes the number i=1 i=1 xi !
of successes in n independent Bernoulli trials. E[X] = a + λ1
Loglikelihood: Quantiles:
V ar(X) = 12
px (k) = nk pk (1 − p)n−k , k = 1, . . . , n `n (λ) =

λ Uniform
= −nλ + log(λ)( ni=1 xi )) − log( ni=1 xi !)
P Q
Likelihood: Parameters
( 1 a and b, continuous.
E[X] = np
MLE: , if a < x <b
L(X1 . . . Xn ; λ, θ) = λn exp(−λn(X n − θ))1(X1 ≥ θ) fx (x) = b−a
V ar(X) = np(1 − p) 0, o.w.
λ̂MLE = n1 ni=1 (Xi ) Univariate Gaussians
P
E[X] = a+b
Likelihood: Parameters µ and σ 2 > 0, continuous 2
(b−a)2
Fisher Information: (x−µ)2 V ar(X) =
Ln(X1 , . . . , Xn , θ) = f (x) = √ 1 exp(− ) 12
(2πσ 2 ) 2σ 2
n I(λ) = λ1 Likelihood:
!
Y K  Pn Xi Pn E[X] = µ
=   θ i=1 (1 − θ)nK− i=1 Xi L(x1 . . . xn ; b) =
1(maxi (xi ≤b))
Xi 
Canonical exponential form: V ar(X) = σ 2 bn
i=1
Loglikelihood: CDF of standard gaussian: Loglikelihood:
fθ (y) = exp yθ − eθ − ln y!

Cauchy
P
`n (θ) = C + ni=1 Xi log θ + nK − ni=1 Xi log(1−θ)
P |{z} |{z} Rz 2
b(θ) c(y,φ) Φ(z) = −∞ √1 e−x /2 dx continuous, parameter m,
2π
MLE: θ = ln λ Likelihood: 1
fm (x) = π 1
2
1+(x−m)
φ=1
Cheatsheet for 18.6501x by Blechturm Product of dependent r.vs X and Y : 6 Covariance 8 Random Vectors
Page 2 of x
T
The Covariance is a measure of how much the

A random vector X = X (1) , . . . , X (d) of dimension
E[X · Y ] , E[X] · E[Y ] values of each of two correlated random variables
E[X] = notdef ined! determine each other d × 1 is a vector-valued function from a probability
V ar(X) = notdef ined! E[X · Y ] = E[E[Y · X|Y ]] = E[Y · E[X|Y ]] space ω to Rd :
Cov(X, Y ) = E[(X − µX )(Y − µY )]
med(X) Z= P (X > M) = P (X < M) Linearity of Expectation where a and c are given X : Ω −→ Rd
∞ 1 1 scalars: Cov(X, Y ) = E[XY ] − E[X]E[Y ]  (1)
= 1/2 = · dx

2 X (ω)
1/2 π 1 + (x − m) E[aX + cY ] = aE[X] + cE[Y ] Cov(X, Y ) = E[(X)(Y − µY )] X (2) (ω)
Chi squared ω −→ 
 
Possible notations: .. 
The χd2 distribution with d degrees of freedom is If Variance of X is known: 
 .


 (d) 
given by the distribution of Z12 + Z22 + · · · + Zd2 , where E[X 2 ] = var(X) − E[X] Cov(X, Y ) = σ (X, Y ) = σ(X,Y ) X (ω)
iid
Z1 , . . . , Zd ∼ N (0, 1) where each X (k) , is a (scalar) random variable on
4 Variance Covariance is commutative: Ω.
If V ∼ χk2 : Variance is the squared distance from the mean. Cov(X, Y ) = Cov(Y , X) PDF of X: joint distribution of its components
E = E[Z12 ] + E[Z22 ] + . . . + E[Zd2 ] = d V ar(X) = E[(X − E(X))2 ] X (1) , . . . , X (d) .
Covariance with of r.v. with itself is variance:
V ar(V ) = V ar(Z12 ) + V ar(Z22 ) + . . . + V ar(Zd2 ) = 2d
V ar (X) = E X 2 − (E [X])2
h i
Cov(X, X) = E[(X − µX )2 ] = V ar(X) CDF of X:
Student’s T Distribution
Tn := √ Z where Z ∼ N (0, 1), and Z and V are Variance of a product with constant a: Useful properties: Rd → [0, 1]
V /n
independent V ar(aX) = a2 V ar (X) x 7→ P(X (1) ≤ x(1) , . . . , X (d) ≤ x(d) ).
Cov(aX + h, bY + c) = abCov(X, Y )
2 Quantiles of a Distribution
Let α in (0, 1). The quantile of order 1 − α of a Variance of sum of two dependent r.v.: Cov(X, X + Y ) = V ar(X) + cov(X, Y ) The sequence X1 , X2 , . . . converges in probability to
random variable X is the number qα such that: X if and only if each component of the sequence
V ar(X + Y ) = V ar(X) + V ar(Y ) + 2Cov(X, Y ) Cov(aX + bY , Z) = aCov(X, Z) + bCov(Y , Z) (k) (k)
qα = P (X ≤ qα ) = 1 − α X1 , X2 , . . . converges in probability to X (k) .
Variance of sum of two independent r.v.: If Cov(X, Y ) = 0, we say that X and Y are uncorrela- Expectation of a random vector
P(X ≥ qα ) = α ted. If X and Y are independent, their Covariance The expectation of a random vector is the element-
V ar(X + Y ) = V ar(X) + V ar(Y ) is zero. The converse is not always true. It is only wise expectation. Let X be a random vector of
FX (qα ) = 1 − α true if X and Y form a gaussian vector, ie. any linear dimension d × 1.
−1 (1 − α) = α V ar(X − Y ) = V ar(X) + V ar(Y ) combination αX + βY is gaussian for all (α, β) ∈ R2
E[X (1) ]
 
FX without {0, 0}.
5 Sample Mean and Sample Variance 7 Law of large Numbers and Central Limit theorem E[X] = 
 .. 
 .
If X ∼ N (0, 1): univariate . 
iid 
(d)
Let X1 , ..., Xn ∼ Pµ , where E(Xi ) = µ and

iid E[X ]
P(|X| > qα ) = α Let X1 , ..., Xn ∼ Pµ , where E(Xi ) = µ and
V ar(Xi ) = σ 2 for all i = 1, 2, ..., n The expectation of a random matrix is the expected
3 Expectation V ar(Xi ) = σ 2 for all i = 1, 2, ..., n and Xn = n1 ni=1 Xi . value of each of its elements. Let X = {X } be an
P
R +inf ij
E [X] = −inf x · fX (x) dx Sample Mean: n × p random matrix. Then E[X], is the n × p matrix
Law of large numbers:
R +inf X n = n1 ni=1 Xi
P of numbers (if they exist):
E [g (X)] = −inf g (x) · fX (x) dx P ,a.s.
Xn −−−−−→ µ . E[X ] E[X ] . . . E[X ]
11 12 1p 
R +inf Sample Variance: n→∞ 
E[X21 ] E[X22 ] . . . E[X2p ]
E [X Y = y] = −inf x · fX|Y (x|y) dx 1 Pn P ,a.s.  
n i=1 g(Xi ) −n→∞
−−−−→ E[g(X)] E[X] =  . .. .. 

Sn = n1 ni=1 (Xi − X n )2 = ..
P
 .. . . . 
Integration limits only have to be over the support 2 
= n1 ( ni=1 Xi2 ) − X n
P
of the pdf. Discrete r.v. same as continuous but with Central Limit Theorem: E[Xn1 ] E[Xn2 ] . . . E[Xnp ]
sums and pmfs. p Xn −µ (d) Let X and Y be random matrices of the same dimen-
Cochranes Theorem: sion, and let A and B be conformable matrices of
(n) √ 2 −−−−−→ N (0, 1)
Total expectation theorem: (σ ) n→∞ constants.
iid
If X1 , ..., Xn ∼ N µ, σ 2 the sample mean X n and the (d)
(n)(Xn − µ) −−−−−→ N (0, σ 2 )
R +inf p
E [X] = −inf fY (y) · E [X Y = y] dy sample variance Sn are independent X n ⊥⊥ Sn for n→∞ E[X + Y ] = E[X] + E[Y ]
all n. The sum of squares of n Numbers follows a E[AXB] = AE[X]B
Expectation of constant a: nS
Variance of the Mean:
Chi squared distribution 2n ∼ χn−1 2
σ Covariance Matrix
E[a] = a V ar(Xn ) = Let X be a random vector of dimension d × 1 with
Unbiased estimator of sample variance: σ2 σ2
( n )2 V ar(X1 + X2 , ..., Xn ) = n . expectation µX .
Product of independent r.vs X and Y : n Matrix outer products!
1 X 2 n Expectation of the mean:
S̃n = Xi − X n = S
E[X · Y ] = E[X] · E[Y ] n−1 n−1 n Σ = E[(X − µX )(X − µX )T ] =
i=1 E[Xn ] = 1 E[X1 + X2 , ..., Xn ] = µ.
Cheatsheet for 18.6501x by Blechturm Multivariate CLT 9 Statistical models Pθ [I 3 θ] ≥ 1 − α, ∀θ ∈ Θ
Page 3 of x Let X1 , . . . , Xd ∈ Rd be independent copies of a E, {Pθ }θ∈Θ
 X − µ   random vector X such that E[x] = µ (d × 1 vector of Confidence interval of asymptotic level 1 − α for θ:
 1 1  expectations) and Cov(X) = Σ E is a sample space for X i.e. a set that contains all
 X2 − µ2  Any random interval I whose boundaries do not
E  . . .  [X1 − µ1 , X2 − µ2 , . . . , Xd − µd ] possible outcomes of X

   (d) depend on θ and such that:
Xd − µd p
{Pθ }θ∈Θ is a family of probability distributions on
σ (n)(Xn − µ) −−−−−→ N (0, Σ)
 11 σ12 . . . σ1d  n→∞ E. limn→∞ Pθ [I 3 θ] ≥ 1 − α, ∀θ ∈ Θ

σ21 σ22 . . . σ2d  (d)
(n)Σ−1/2 Xn − µ −−−−−→ N (0, Id )
p
Σ = Cov(X) =  .. .. .. .. 
 
n→∞ Θ is a parameter set, i.e. a set consisting of some Two-sided asymptotic CI
 . . . . 
possible values of Θ.
  iid
σd1 σd2 . . . σdd Where Σ−1/2 is the d × d matrix such that Let X1 , . . . , Xn = X̃ and X̃ ∼ Pθ . A two-sided CI
The covariance matrix Σ is a d × d matrix. It is a Σ−1/2 Σ−1/2 = Σ1 and Id is the identity matrix. θ is the true parameter and unknown. In a parame- is a function depending on X̃ giving an upper
table of the pairwise covariances of the elemtents tric model we assume that Θ ⊂ Rd , for some d ≥ 1. and lower bound in which the estimated parame-
of the random vector. Its diagonal elements are the Multivariate Delta Method ter lies I = [l(X̃, u(X̃)] with a certain probability
variances of the elements of the random vector, the Gradient Matrix of a Vector Function: Identifiability: P(θ ∈ I ) ≥ 1 − qα and conversely P(θ < I ) ≤ α
off-diagonal elements are its covariances. Note that
the covariance is commutative e.g. σ12 = σ21 Given a vector-valued function f : Rd → Rk , the θ , θ 0 ⇒ Pθ , Pθ 0 Since the estimator is a r.v. depending on X̃ it has
gradient or the gradient matrix of f , denoted by ∇f a variance V ar(θ̂n and a mean E[θ̂n ]. After fin-
Alternative forms: , is the d × k matrix: Pθ = Pθ 0 ⇒ θ = θ 0 ding those it is possible to standardize the estima-
Σ = E[XX T ] − E[X]E[X]T = tor using the √ CLT. This yields
√ an asymptotic CI:
∇f = A Model is well specified if: −qα/2 V ar(θ) qα/2 V ar(θ)
= E[XX T ] − µX µTX  |

| | | 

I = θ̂n + [ √ , √ ]
n n
= ∇f1 ∇f2 ... ∇fk  = ∃θ s.t. P = Pθ
Let the random vector X ∈ Rd and A and B be | | | | 10 Estimators
This expression depends on the real variance V ar(θ)
conformable matrices of constants.  ∂f ∂fk 
of the r.vs, the variance has to be estimated. Three
 1 ...  A statistic is any measurable functionof the sam- possible methods: plugin (use sample mean), solve
 ∂x1 ∂x1 
Cov(AX + B) = Cov(AX) = ACov(X)AT = AΣAT ple, e.g. Xn , max(Xi ), etc. An Estimator of θ is any (solve quadratic inequality), conservative (use the

=  ... ..
 
 . statistic which does not depend on θ.
Every Covariance matrix is positive definite.  ... . 
 maximum of the variance).
 ∂f1
...
∂fk   An estimator θ̂n is weakly consistent if: lim θ̂n = θ
Σ≺0 ∂xd ∂xd n→∞ Delta Method
P
This is also the transpose of what is known as the or θ̂n −−−−−→ E[g(X)]. If the convergence is almost If I take a function of the mean and want to make it
Gaussian Random Vectors n→∞ converge to a function of the mean.
Jacobian matrix Jf of f . surely it is strongly consistent.
A random vector X = (X (1) , . . . , X (d) )T is a Gaussi- √ (d)
an vector, or multivariate Gaussian or normal va- General statement, given n(g(mb1 ) − g(m1 (θ))) −−−−−→ N (0, g 0 (m1 (θ))2 σ 2 )
riable, if any linear combination of its components Asymptotic normality of an estimator: n→∞
is a (univariate) Gaussian variable or a constant (a 12 Hypothesis tests
(d) Comparisons of two proportions
(n)(θ̂n − θ) −−−−−→ N (0, σ 2 )
p
“Gaussian"variable with zero variance), i.e., if α T X • (Tn )n≥1 a sequence of random vectors
n→∞ iid iid
is (univariate) Gaussian or constant for any constant Let X1 , . . . , Xn ∼ Bern(px ) and Y1 , . . . , Yn ∼ Bern(py )
√ (d) σ 2 is called the Asymptotic Variance of θˆn . In the
non-zero vector α ∈ Rd . ~ −−−−−→ T, and be X independent of Y . p̂x = 1/n ni=1 Xi and
P
• satisfying n Tn − θ case of the sample mean it the variance of a single
Multivariate Gaussians n→∞ Pn
Xi . If the estimator is a function of the sample p̂x = 1/n i=1 Yi
The distribution of, X the d-dimensional Gaussian • a function g : Rd → Rk that is continuously mean the Delta Method is needed to compute
or normal distribution, is completely specified by the Asymptotic Variance. Asymptotic Variance , H0 : px = py ; H1 : px , py
~
differentiable at θ,
the vector mean µ = E[X] = (E[X (1) ], . . . , E[X (d) ])T Variance of an estimator. To get the asymptotic Variance use multivariate
and the d × d covariance matrix Σ. If Σ is invertible, then Delta-method. Consider p̂x − p̂y = g(p̂x , p̂y ); g(x, y) =
then the pdf of X is: Bias of an estimator:
x − y, then
1 T −1 √ (d)
Bias(θ̂n = E[θˆn ] − θ (d)
fX (x) = q 1 e− 2 (x−µ) Σ (x−µ) , ~ −−−−−→ ∇g(θ)
n g(Tn ) − g(θ) ~TT p
(n)(g(p̂x , p̂y ) − g(px − py )) −−−−−→ N (0, ∇g(px −
d n→∞ n→∞
(2π) det(Σ)
With multivariate Gaussians and Sample mean: Quadratic risk of an estimator: py )T Σ∇g(px − py ))
x ∈ Rd
Let Tn = Xn where Xn is the sample average of R(θ̂n ) = E[(θ̂n − θ)2 ] = Bias2 + V ariance ⇒ N (0, px (1 − px) + py (1 − py))
Where det(Σ) is the determinant of Σ, which is posi-
tive when Σ is invertible. iid
X1 , . . . , Xn ∼ X, and θ ~ = E[X]. The (multivariate) 11 Confidence intervals Pivot:
If µ = 0 and Σ is the identity matrix, then X is called CLT then gives T ∼ N (0, ΣX ) where ΣX is the Let (E, (Pθ )θ∈Θ ) be a statistical model based on
a standard normal random vector . covariance of X. In this case, we have: Let X1 , . . . , Xn be random samples and let Tn be a
observations X1 , . . . Xn and assume Θ ⊆ R. Let function of X and a parameter vector θ. That is, Tn
If the covariant matrix Σ is diagonal, the pdf factors α ∈ (0, 1).
into pdfs of univariate Gaussians, and hence the √ (d) is a function of X1 , . . . , Xn , θ. Let g(Tn ) be a random
components are independent. n g(Tn ) − g(θ) ~ TT
~ −−−−−→ ∇g(θ) variable whose distribution is the same for all θ .
n→∞ Non asymptotic confidence interval of level 1 − α
for θ: Then, g is called a pivotal quantity or a pivot.
The linear transform of a gaussian X ∼ Nd (µ, Σ) ∇g(θ) ~ T T ∼ N 0, ∇g(θ) ~ T ΣX ∇g(θ)
~
with conformable matrices A and B is a gaussian: Any random interval I , depending on the sample For example, let X be a random variable with mean
(T ∼ N (0, ΣX )) X1 , . . . Xn but not at θ and such that: µ and variance σ 2 . Let X1 , . . . , Xn be iid samples of
AX + B = Nd (Aµ + b, AΣAT ) X. Then,
 n 
Cheatsheet for 18.6501x by Blechturm nonnegative: Y  • Find the expectation of the functions of Xi
Page 4 of x d(P, Q) ≥ 0 argmaxθ∈Θ ln  pθ (Xi ) and subsitute them back into the Hessian or
definite: i=1 the second derivative. Be extra careful to sub-
d(P, Q) = 0 ⇐⇒ P = Q Gaussian Maximum-loglikelihood estimators: situte the right power back. E[Xi ] , E[Xi2 ].
Xn − µ triangle inequality: 2 • Don’t forget the minus sign!
gn , MLE estimator
P for σ = τ:
σ d(P, V) ≤ d(P, Q) + d(Q, V) τ̂nMLE = n1 ni=1 Xi2
iT If the support of P and Q is disjoint: Asymptotic normality of the maximum likelihood
estimator
h
is a pivot with θ = µ σ 2 being the parameter vec- d(P, V) = 1 MLE estimators:
Under certain conditions the MLE is asymptotically
tor. The notion of a parameter vector here is not to TV between continuous and discrete r.v: d(P, V) = 1 normal and consistent. This applies even if the MLE
µ̂MLE = n1 i=1 (xi )
P
be confused with the set of paramaters that we use KL divergence n is not the sample average.
to define a statistical model. 13.1 Fisher Information Let the true parameter θ ∗ ∈ Θ. Necessary assumpti-
Onesided the KL divergence (also known as relative entropy) The Fisher information is the covariance matrix of ons:
Twosided KL between between the propability measures P and the gradient of the loglikelihood function. It is equal
P-Value Q with the common sample space E and pmf/pdf to the negative expectation of the Hessian of the • The parameter is identifiable
functions f and g is defined as: loglikelihood function and captures the negative of
Walds Test • For all θ ∈ Θ, the support Pθ does not depend
iid
P p(x) the expected curvature of the loglikelihood function. on θ (e.g. like in U nif (0, θ));
X1 , . . . , Xn ∼ Pθ ∗ for some true parameter θ ∗ ∈ p(x) ln , discr

 x∈E q(x)


Rd . We construct the associated statistical model KL(P, Q) =  R p(x) Let θ ∈ Θ ⊂ Rd and let (E, {Pθ }θ∈Θ ) be a statistical • θ ∗ is not on the boundary of Θ;
(R, {Pθ }θ∈Rd ) and the maximum likelihood estima-
 p(x) ln q(x) dx, cont
model. Let fθ (x) be the pdf of the distribution Pθ .

x∈E • Fisher information I (θ) is invertible in the
tor θn
b MLE for θ . ∗ Not a distance! Then, the Fisher information of the statistical model neighborhood of θ ∗
Decide between two hypotheses: Sum over support of P ! is.
∗ ∗ Asymetric in general: • A few more technical conditions
H0 : θ = 0 VS H1 : θ , 0 I (θ) = Cov(∇`(θ)) =
Assuming that the null hypothesis is true, the asym- KL(P, Q) , KL(Q, P) The asymptotic variance of the MLE is the inverse
bMLE implies that the Nonnegative: = E[∇`(θ))∇`(θ)T ] − E[∇`(θ)]E[∇`(θ)] = of the fisher information.
ptotic normality of the MLE θ
√ n KL(P, Q) ≥ 0 = −E[H`(θ)] p (d)
bnMLE − θ ∗ ) −−−−−→ Nd (0, I (θ ∗ )−1 )
bnMLE − 0) 2 Definite: (n)(θ

following random variable n I (0)1/2 (θ n→∞
if P = Q then KL(P, Q) = 0 Where `(θ) = ln fθ (X).If ∇`(θ) ∈ Rd it is a d × d
converges to a χk2 distribution. Does not satisfy triangle inequality in general: matrix. The definition when the distribution has 14 Method of Moments
KL(P, V) KL(P, Q) + KL(Q, V) a pmf pθ (x) is also the same, with the expectation iid
√ 2 (d) Let X1 , . . . , Xn ∼ Pθ ∗ associated with model
n I (0)1/2 (θ bnMLE − 0) −−−−−→ χ2 taken with respect to the pmf. (E, {Pθ }θ∈Θ ), with E ⊆ R and Θ ⊆ R, for some d
n→∞ d
Estimator of KL divergence: ≥1
Wald’s Test in 1 dimension:
"
p ∗ (X)
!# Let (R, {Pθ }θ∈R ) denote a continuous statistical Population moments:
KL (Pθ ∗ , Pθ ) = Eθ ∗ ln θ , model. Let fθ (x) denote the pdf (probability density
In 1 dimension, Wald’s Test coincides with the two- pθ (X) function) of the continuous distribution Pθ . Assume mk (θ) = Eθ [X1k ], 1 ≤ k ≤ d
sided test based on on the asymptotic normality of KL(P c θ , Pθ ) = const − 1 n log(pθ (Xi ))
P
that fθ (x) is twice-differentiable as a function of the
the MLE. ∗ n i=1
parameter θ. Empirical moments:
Given the hypotheses Maximum likelihood estimation
H0 : θ ∗ = 0 VS H1 : θ ∗ , 0 Cookbook: take the log of the likelihood function. Formula for the calculation of Fisher Information of ck (θ) = Xnk = n1 ni=1 Xik
m
P
a two-sided test of level α, based on the Take the partial derivative of the loglikelihood func- X:
Convergence of empirical moments:
asymptotic normality of the MLE, is ψα = tion with respect to the parameter. Set the partial
derivative to zero and solve for the parameter.

∂fθ (x) 2

bMLE − θ0 > qα/2 (N (0, 1))
p
1 nI(θ0 ) θ R∞ ∂θ
P ,a.s.
If an indicator function on the pdf/pmf does not I (θ) = −∞ dx m
ck −−−−−→ mk
where the Fisher information I(θ0 )−1 is the asym- depend on the parameter, it can be ignored. If it de- fθ (x) n→∞
bMLE under the null hypothesis. pends on the parameter it can’t be ignored because P ,a.s.
ptotic variance of θ Models with one parameter (ie. Bernulli): (m cd ) −−−−−→ (m1 , . . . , md )
c1 , . . . , m
On the other hand, a Wald’s test of level α is there is an discontinuity in the loglikelihood functi- n→∞
Wald

MLE
2
2 on. The maximum/minimum of the Xi is then the I (θ) = Var(` 0 (θ)) MOM Estimator M is a map from the parameters of
ψα = 1 nI(θ0 ) θ b − θ0 > qα (χ1 ) = maximum likelihood estimator. a model to the moments of its distribution. This map
q
! Maximum likelihood estimator: I (θ) = −E(` 00 (θ)) is invertible, (ie. it results into a system of equati-
bMLE − θ0 > qα (χ2 ) .
p
1 nI(θ0 ) θ ons that can be solved for the true parameter vector
1
Models with multiple parameters (ie. Gaussians): θ ∗ ). Find the moments (as many as parameters), set
n o
Let E, (Pθ )θ∈Θ be a statistical model associa-
13 Distance between distributions up system of equations, solve for parameters, use
Total variation ted with a sample of i.i.d. random variables I (θ) = −E [H`(θ)] empirical moments to estimate.
X1 , X2 , . . . , Xn . Assume that there exists θ ∗ ∈ Θ such
The total variation distance TV between the propa- that Xi ∼ Pθ ∗ . ψ : Θ → Rd
bility measures P and Q with a sample space E is Cookbook:
defined as: The maximum likelihood estimator is the (unique) θ 7→ (m1 (θ), m2 (θ), . . . , md (θ))
θ that minimizes KL c (Pθ ∗ , Pθ ) over the parameter Better to use 2nd derivative.
TV(P, Q) = maxA⊂E |P(A) − Q(A)|, space. (The minimizer of the KL divergence is un- M −1 (m1 (θ ∗ ), m2 (θ ∗ ), . . . , md (θ ∗ ))
ique due to it being strictly convex in the space of • Find loglikelihood The MOM estimator uses the empirical moments:
Calculation with f and g: distributions once is fixed.) • Take second derivative (=Hessian if multiva- P
M −1 n1 ni=1 Xi , n1 ni=1 Xi2 , . . . , n1 ni=1 Xid
(1 P P P
x∈E |f (x) − g(x)|, discr bnMLE =
θ riate)
TV(P, Q) = 12 R
|f (x) − g(x)|dx, cont
2 x∈E argminθ∈Θ KL c n (Pθ ∗ , Pθ ) = • Massage second derivative or Hessian (isola- Assuming M −1 is continuously differentiable at
Symmetry: Xn te functions of Xi to use with −E(` 00 (θ)) or M(0), the asymptotical variance of the MOM esti-
d(P, Q) = d(Q, P) argmaxθ∈Θ ln pθ (Xi ) = −E [H`(θ)]. mator is:
Cheatsheet for 18.6501x by Blechturm 19 Calculus If the Hessian has both positive and negative eigen-
Page 5 of x Differentiation under the integral sign values then a is a saddle point for f .
R
d b(x)
dx a(x)
f (x, t)dt = f (x, b(x))b0 (x) − f (x, a(x))a0 (x) +
R b(x)
(d) a(x) x
f (x, t)dt.
(n)(θnMM − θ) −−−−−→ N (0, Γ )
p
n→∞ Concavity in 1 dimension
where, If g : I → R is twice differentiable in the interval I:

−1
T
−1
concave:
Γ (θ) = ∂M
∂θ
(M(θ)) Σ(θ) ∂M
∂θ
(M(θ)) if and only if g 00 (x)≤0 for all x ∈ I
Γ (θ) = ∇θ (M −1 )T Σ∇θ (M −1 ) strictly concave:
Σθ is the covariance matrix of the random vector of if g 00 (x)<0 for all x ∈ I
the moments (X11 , X12 . . . , X1d ).
15 OLS convex:
Y |X = x ∼ N (µ(x), σ 2 I) if and only if g 00 (x)≥0 for all x ∈ I
Regression function µ(x): strictly convex if:

g 00 (x)>0 for all x ∈ I
E[Y |X = x] = µ(x) = xT β
Multivariate Calculus
Random Component of the Linear Model: The Gradient ∇ of a twice differntiable function f is
Y is continous and Y |X = x is Gaussian with mean defined as:
µ(x) ∇f : Rd → Rd
 ∂f 
16 Generalized Linear Models  ∂θ 
We relax the assumption that µ is linear. Instead, we θ1   1 
 
θ2   ∂f 
assume that g ◦µ is linear, for some function g:
θ =  ..  7→  ∂θ. 2 
 
g(µ(x)) = x βT
 .   .. 
  
 
The function g is assumed to be known, and is refer- θ d
 
 ∂f 

red to as the link function. It maps the domain of ∂θd θ
the dependent variable to the entire real Line. Hessian
it has to be strictly increasing,
it has to be continuously differentiable and The Hessian of f is a symmetric matrix of second
its range is all of R partial derivatives of f
16.1 The Exponential Family
A family of distribution {Pθ : θ ∈ Θ}, where the Hh(θ) = ∇2 h(θ) =
 2 ∂2 h (θ) 

parameter space Θ ⊂ Rk is -k dimensional, is called  ∂θ∂ ∂θ h (θ) · · ·
∂θ1 ∂θd 
1 1
a k-parameter exponential family on R1 if the pmf 
 
..  d×d
q
or pdf fθ : R → R of Pθ can be written in the form: 
 .  ∈ R

 ∂2 h (θ) · · · ∂2 h (θ) 


η1∂θ (θ) d ∂θ ∂θd ∂θd


  1

η(θ) = A
 symmetric
 ..  : Rk(real-valued) d × d matrix A is:
 
→ Rk



  . 
  
Positive semi-definite:
 Tηk (θ)



for all x ∈ Rd .

x A x ≥ 0



 1  T (y)
fθ (y) = h(y) exp (η(θ) · T(y) − B(θ)) where 

  ..  : Rq → Rk
T(y) = Positive
 

  .  definite:
 T
x > 0 for all non-zero vectors x ∈ Rd

x TkA
 
(y)




: Rk → R



 B(θ) Negative semi-definite (resp. negative definite):
: Rq → R.

h(y)

if k = 1 it reduces to: xT A x is negative for all x ∈ Rd − {0}.
fθ (y) = h(y) exp (η(θ)T (y) − B(θ)) Positive (or negative) definiteness implies positive
17 Algebra (or negative) semi-definiteness.
Absolute Value Inequalities:
If the Hessian is positive definite then f attains a
|f (x)| < a ⇒ −a < f (x) < a
local minimum at a (convex).
|f (x)| > a ⇒ f (x) > a or f (x) < −a
If the Hessian is negative definite at a, then f attains
18 Matrixalgebra a local maximum at a (concave).
kAxk2 = (Ax)T (Ax) = xT AT Ax = xT AT Ax

Cheat Sheet

Uploaded by

Copyright:

Available Formats

Cheat Sheet

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cheat Sheet

Uploaded by

Copyright:

Available Formats

What are the main probability distributions covered?

What are the key concepts in maximum likelihood estimation?

Cheatsheet for 18.6501x by Blechturm Fisher Information: Exponential L(x1 . . .

Regression function µ(x): strictly convex if:

You might also like