All of Stats-W

Tutorial C09 | Statistical Inference (C206) | Mihir Arjunwadkar | MTech M&S 2016-17 1
1. What are the differences between parametric and nonparametric inference?
2. What are the alternative terms for parametric and nonparametric inference?
3. Explain: Parameter of interest, nuiscence parameter.
4. Define: kth moment of a distribution.
5. Define: kth sample moment.
6. Explain: Method of moments estimator/estimation (mome).
7. Derive expressions for the mome for
(a) Bernoulli(p)
(b) Normal(µ, σ 2 )
(c) Gamma(k, θ)
8. State the essential properties of mome.
9. Define: Likelihood function.
10. Explain the rationale behind the likelihood principle.
11. Define: Maximum likelihood estimation/estimator (mle).
12. Derive the likelihood functions for
(a) Bernoulli(p)
(b) Normal(µ, σ 2 )
(c) Gamma(k, θ)
(d) Exp(λ)
(e) Poisson(λ)
(f) Beta(α, β)
assuming that the respective parameters are unknown.
13. Derive expressions for the mle for indicated parameters:
(a) p for Bernoulli(p)

(b) µ, σ 2 for Normal(µ, σ 2 )
(c) θ for Gamma(k, θ), assuming k is known
(d) λ for Exp(λ)
(e) λ for Poisson(λ)
assuming that the respective parameters are unknown.
14. State the essential properties of mle.
15. Define Fisher information.
16. Explain how {mle, asymptotic normality} and {equivariance, Delta method} are related to one
another.
17. State the rationale behind the parametric bootstrap.
18. State the parametric bootstrap procedure.
1
2
Let X1 , . . . , Xn ∼ Beta(α, β) iid. The value of β is known.
1. Derive the expression for E(X).

The pdf for Beta(α, β) is
(
xα−1 (1−x)β−1
B(α,β) 0≤x≤1
fX (x) = , where
0 otherwise
Γ(α)Γ(β)
B(α, β) = (Beta Function)
Γ(α + β)
Z ∞
Γ(z) = tz−1 e−t dt (Gamma Function).
0
Because fX (x) is normalized, we have

Z 1
xα−1 (1 − x)β−1 dx = B(α, β).
0
Hence
Z 1
1
E(X) = x × xα−1 (1 − x)β−1 dx
B(α, β) 0
Z 1
1
= xα (1 − x)β−1 dx
B(α, β) 0
B(α + 1, β)
=
B(α, β)
Γ(α + 1)Γ(β)/Γ(α + β + 1)
=
Γ(α)Γ(β)/Γ(α + β)
αΓ(α)Γ(β)/(α + β)Γ(α + β)
= (because Γ(z + 1) = zΓ(z))
Γ(α)Γ(β)/Γ(α + β)
= α/(α + β).
2. Assuming β to be known, find the mome for α.

Pn
Equate α
bn /(b b1 = n−1
αn + β) with the first sample moment µ i=1 Xi , and solve for it α
bn . We get
µ
b1
bn = β ×
α .
1−µb1
3. Use parametric bootstrap to get glimpses of the sampling distribution of α

bn for different values
of n.
# Beta distribution parameter
alpha <- 0.5 # we will estimate this using MoM ; see function below
beta <- 0.5 # we will assume this to be known
# MoM estimator of alpha

mome . alpha <- function ( x , beta )
{
mu1 . hat <- mean ( x )
beta * mu1 . hat / ( 1 - mu1 . hat )
}
# data sizes
N <- c ( 5 , 10 , 50 , 100 , 500 , 1000 , 5000 , 10000 )
# bootstrap size
2
B <- 1000
op <- par ( mfrow = c ( 2 , 4 ) )
for ( n in N )
{
# data from Beta ( alpha , beta )
X <- rbeta ( n , alpha , beta )
# MoM estimate of alpha assuming beta to be know

alpha . hat <- mome . alpha ( X , beta )
# parametric bootstrap
alpha . hat . star <- replicate ( B , mome . alpha ( rbeta ( n , alpha . hat , beta ) , beta ) )
#
# this code is intended to illustrate the basic parametric bootstrap .
# consider using package boot for other purposes .
# bootstrap distribution of alpha . hat

hist ( alpha . hat . star , min ( 50 , nclass . FD ( alpha . hat . star ) ) ,
xlim = range ( alpha . hat . star , alpha . hat , alpha ) ,
freq = FALSE , main = paste ( ’n = ’ , n ) , border = ’ white ’ , col = ’ gray ’ )
curve ( dnorm ( x , mean ( alpha . hat . star ) , sd ( alpha . hat . star ) ) , col = ’ red ’ , lty = 2 ,
from = min ( alpha . hat . star ) , to = max ( alpha . hat . star ) , n = 501 , add = TRUE )
abline ( v = alpha , col = ’ blue ’ )
axis ( 3 , alpha , label = expression ( alpha ) , col = ’ blue ’ , col . axis = ’ blue ’ )
}
par ( op )
3
1. Let X1 , . . . , Xn ∼ N (α + β, σ 2 ) iid and Y1 , . . . , Ym ∼ N (α − β, σ 2 ) iid. Also assume independence

between the X and Y rvs. Find the maximum likelihood estimators of α, β and σ 2 .
The likelihood L(α, β, σ 2 ) and log-likelihood l(α, β, σ 2 ) ≡ log L(α, β, σ 2 ) functions for this problem
are
n Y m
Y 1 1 1 1
L(α, β, σ 2 ) = √ exp − 2 (xi − (α + β))2 × √ exp − 2 (yj − (α − β))2
i=1 2πσ 2 2σ j=1 2πσ 2 2σ
n m
n+m 1 X 1 X
l(α, β, σ 2 ) log 2π + log σ 2 − 2 (xi − (α + β))2 − 2 (yj − (α − β))2 .

= −
2 2σ i=1 2σ j=1
ML estimators for the three parameters are obtained by maximizing the likelihood; i.e., by solving
∂ ∂ ∂
l(α, β, σ 2 ) = 0, l(α, β, σ 2 ) = 0, l(α, β, σ 2 ) = 0
∂α ∂β ∂σ 2
together for the three parameters. In principle, one should also check whether the 3 × 3 matrix
of second partial derivatives is negative (semi)definite to ensure that it is a maximum and not a
minimum. The ML estimators turn out to be
 
n m
1  −1 X X
α
b = n xi + m−1 yj 
2
i=1 j=1
 
n m
1  −1 X X
βb = n xi − m−1 yj 
2
i=1 j=1
 
n m
c2 = 1  X
b 2+
X
b 2 .
σ (xi − (b
α + β)) (yj − (b
α − β))
n+m
i=1 j=1
2. With α = 1, β = 2, σ 2 = 1, draw a sample of size 60 from the distribution of X and a sample of

size 40 from the distribution of Y . Compute the maximum likelihood estimates of α, β and σ 2
from the combined sample of size 100.
# true values of parameters
#
alpha <- 1
beta <- 2
sigma <- 1
# data sizes
#
n <- 60
m <- 40
# data
#
x <- rnorm ( n , mean = alpha + beta , sd = sigma )
y <- rnorm ( m , mean = alpha - beta , sd = sigma )
# MLE
#
alpha . hat <- 0.5 * ( mean ( x ) + mean ( y ) )
beta . hat <- 0.5 * ( mean ( x ) - mean ( y ) )
sigma . sqr . hat <- ( sum ( ( x - ( alpha . hat + beta . hat ) ) ^2 )
+ sum ( ( y - ( alpha . hat - beta . hat ) ) ^2 ) ) / ( n + m )
3. Give the confidence intervals for the above estimates by estimating standard error through para-
metric bootstrap and using asymptotic normal approximation.
We use below the standard R package boot for the parametric bootstrap. The code below is based
on the last example on the help page for function boot.
4
# data generator
get . data <- function ( data , mle )
{
data $ x <- rnorm ( data $n , mle [1] + mle [2] , mle [3] )
data $ y <- rnorm ( data $m , mle [1] - mle [2] , mle [3] )
return ( data )
}
# MLE
get . mle <- function ( data )
{
# stopifnot ( length ( data $ x ) == n , length ( data $ y ) == m )
a . hat <- 0.5 * ( mean ( data $ x ) + mean ( data $ y ) )

b . hat <- 0.5 * ( mean ( data $ x ) - mean ( data $ y ) )
ss . hat <- ( sum ( ( data $ x - ( a . hat + b . hat ) ) ^2 )
+ sum ( ( data $ y - ( a . hat - b . hat ) ) ^2 ) ) / ( data $ n + data $ m )
c ( alpha = a . hat , beta = b . hat , sigma = sqrt ( ss . hat ) )

}
# true values of parameters

theta <- c ( alpha = 1 , beta = 2 , sigma = 1 )
# data sizes and data

data <- list ( n = 60 , m = 40 , x = NULL , y = NULL )
data <- get . data ( data , theta )
# MLE
theta . hat <- get . mle ( data )
# parametric bootstrap
require ( boot )
R <- 1000
b <- boot ( data = data , statistic = get . mle , R = R ,
sim = " parametric " , ran . gen = get . data , mle = theta . hat )
# bootstrap CIs
alpha <- 0.05 # confidence parameter
ci <- rbind ( boot . ci ( b , type = ’ norm ’ , conf = 1 - alpha , index = 1 ) $ normal [ -1] ,
boot . ci ( b , type = ’ norm ’ , conf = 1 - alpha , index = 2 ) $ normal [ -1] ,
boot . ci ( b , type = ’ norm ’ , conf = 1 - alpha , index = 3 ) $ normal [ -1] )
rownames ( ci ) <- names ( theta )
# plot bootstrap distributions

op <- par ( mfrow = c ( 1 , 3 ) )
for ( i in 1: ncol ( b $ t ) )
{
# histogram
h <- hist ( b $ t [ , i ] , ’ FD ’ , plot = FALSE )
plot ( h , col = ’ gray ’ , border = ’ white ’ , main = ’ ’ , freq = FALSE ,
xlab = parse ( text = paste ( ’ widehat ( ’ , names ( theta ) [ i ] , ’) ’ ) ) )
# normal pdf with same mean and sd as the bootstrap distribution

curve ( dnorm ( x , mean ( b $ t [ , i ] ) , sd ( b $ t [ , i ] ) ) , from = min ( b $ t [ , i ] ) ,
to = max ( b $ t [ , i ] ) , n = 501 , add = TRUE , lty = 2 )
# mean of bootstrap values
abline ( v = mean ( b $ t [ , i ] ) , lty = 2 )
# MLE
abline ( v = theta . hat [ i ] ) # estimated bias = mean ( b $ t [ , i ] - b $ t [ i ] )
# CI
arrows ( ci [i ,1] , max ( h $ density ) , ci [i ,2] , max ( h $ density ) ,
length = 0.1 , angle = 90 , code = 3 )
}
par ( op )
5
4
Let X1 , . . . , Xn ∼ N (µ, σ 2 ) iid. Let τ be the β percentile (0 ≤ β ≤ 1); i.e., P (X < τ ) = β.
The log-likelihood function for this problem, sans additive constants, is
n
Sn2 + (X n − µ)2 ,

ln (µ, σ) = −n log σ − 2
2σ
where X n = n−1 ni=1 Xi and Sn2 = n−1 ni=1 (Xi − X n )2 . See Example 9.11, AoS2004. The mle of µ
P P
and σ are, respectively,
N
1X
µ
bn = Xn = Xi
n
i=1
v
u N
u1 X
σ
bn = Sn = t bn )2 .
(Xi − µ
n
i=1
1. Find the maximum likelihood estimate of τ .

Let Z = (X − µ)/σ ∼ N (0, 1), and Φ be the cdf for the standard normal N (0, 1). Then

X −µ τ −µ τ −µ τ −µ
β = P (X < τ ) = P < =P Z< =Φ .
σ σ σ σ
where Φ is the cdf for N (0, 1). Hence,
τ = σΦ−1 (β) + µ.
The mle of τ is obtained by invoking the equivariance property of the mle:
bn Φ−1 (β) + µ
τbn = σ bn .
2. Derive expression for an approximate 95% confidence interval for τ .

For this purpose, we need to use the multiparameter Delta method which establishes asymptotic
normality of mle of functions of parameters. By definition, the (i, j)th element of the Fisher
information matrix In is the expected value of the second partial derivative of ln (θ) with respect
to ith and jth parameters θ ≡ (µ, σ); i.e.,
2
∂ ln (µ, σ)
(In )i,j = −E .
∂θi ∂θj
The expected values of the four partial second derivatives of ln (µ, σ) with respect to µ and σ can
be obtained using the facts E(X n ) = µ and E(Sn2 ) = (n − 1)σ 2 /n. (The last expression follows
from the fact that E(nSn2 /(n − 1)) = σ 2 .) The Fisher information matrix In turns out to be

n 1 0
In = 2
σ 0 2
The inverse Jn of In is
σ2

1 0
Jn = .
n 0 12
The gradient of τ with respect to the parameters is
∂τ
1
∇τ (µ, σ) = ∂µ∂τ = −1 ,
∂σ
Φ (β)
6
where φ is the N (0, 1) pdf. For the Delta method, the two partial derivatives in the gradient are
required to be non-zero at the maximum-likelihood point (b µ, σ
b). The standard error of τb is
s
σ2

1 −1
q
T 2
τn ) = (∇τ ) Jn (∇τ ) =
se(b 1 + (Φ (β)) .
n 2
The estimated standard error sc τn ) is obtained by substituting σ

e(b bn for σ. Because
(bτn − τ )
N (0, 1),
sc τn )
e(b
an approximate/asymptotic (1 − α) confidence interval on τ is
τb ± zα/2 sc τn ).
e(b
3. Consider the LakeHuron data available in R. Find the maximum likelihood of τ for this dataset.
Find the standard error using the delta method. Find the standard error using the parametric
bootstrap.
# visualize
#
op <- par ( mfrow = c ( 1 , 3 ) )
# time series plot
plot ( LakeHuron , bty = ’n ’ )
# QQ against normal
qqnorm ( LakeHuron , bty = ’n ’ )
qqline ( LakeHuron )
# histogram with a normal curve overlaid

hist ( LakeHuron , ’ FD ’ , col = ’ gray ’ , border = ’ white ’ , freq = FALSE )
curve ( dnorm ( x , mean ( LakeHuron ) , sd ( LakeHuron ) ) ,
from = par ( ’ usr ’ ) [1] , to = par ( ’ usr ’ ) [2] , n = 501 , add = TRUE , lty = 2 )
# data locations on the x - axis

rug ( LakeHuron )
par ( op )
# data size
#
n <- length ( LakeHuron )
# quantile
#
beta <- 0.95
qb <- qnorm ( beta )
# MLE
#
mu . hat <- mean ( LakeHuron )
sigma . hat <- sd ( LakeHuron ) * sqrt ( ( n - 1 ) / n )
tau . hat <- sigma . hat * qb + mu . hat
tau . hat . se . hat <- sqrt ( sigma . hat ^2 * ( 1 + 0.5 * qb ^2 ) / n )
# confidence interval for tau

#
alpha <- 0.05
ci <- tau . hat + c ( -1 , +1 ) * qnorm ( 1 - 0.5 * alpha ) * tau . hat . se . hat
#
# while MLEs are asymp toticall y normal , one never knows if the distribution
# of tau . hat is approximately normal for this data size . The CI should not
# be over - interpreted .
7
5
Two coins have P (Head) to be p and p2 respectively, where 0 < p < 1. Each coin is tossed exactly
once. The resulting rvs are denoted X and Y respectively. In other words, X ∼ Bernoulli(p) and
Y ∼ Bernoulli(p2 ).
1. Write the likelihood function for p, log-likelihoood and its first two derivatives, assuming inde-
pendence between X and Y .
L(p) = pX (1 − p)1−X × (p2 )Y (1 − p2 )1−Y = pX+2Y (1 − p)1−X (1 − p2 )1−Y

l(p) = (X + 2Y ) log p + (1 − X) log(1 − p) + (1 − Y ) log(1 − p2 )
d X + 2Y 1−X 1−Y
l(p) = − − 2p
dp p 1−p 1 − p2
d2 X + 2Y 1−X 1−Y 1−Y
2
l(p) = − 2
− 2
−2 2
− (2p)2 .
dp p (1 − p) 1−p (1 − p2 )2
2. Derive the likelihood equation for p.

Differentiating l(p) with respect to p and setting it equal to zero, we get
d X + 2Y 1−X 1−Y
0= l(p) = − − 2p
dp p 1−p 1 − p2
(1 − p)(1 − p )(X + 2Y ) − p(1 − p2 )(1 − X) − 2p2 (1 − p)(1 − Y )
2
=
p(1 − p)(1 − p2 )
(1 − p − p2 + p3 )(X + 2Y ) − (p − p3 )(1 − X) − 2(p2 − p3 )(1 − Y )
=
p(1 − p)(1 − p2 )
3p3 − (X + 2)p2 − (2Y + 1)p + (X + 2Y )
= ,
p(1 − p)(1 − p2 )
which leads to the cubic equation
3p3 − (X + 2)p2 − (2Y + 1)p + (X + 2Y ) = 0.
3. If X = 1 and Y = 0, find the mle of p.

For X = 1 and Y = 0, the above equation reduces to
3p3 − 3p2 − p + 1 = (3p2 − 1)(p − 1) = 0.
√
whose three solutions are p = ±1/ 3 and p = 1. Of these, the negative solution is irrelavant.
√ It
can be√verified that the second derivative of the log likelihood is negative at p = 1/ 3, so that
p = 1/ 3 is the required maximum in log likelihood and the estimate of the unknown parameter.
4. Based on any insights from the previous result, plus some guesswork/trial-and-error, find all
solutions to the likelihood equation in terms of X and Y .
Trial-and-error shows that p = 1 is one solution to this equation. Factoring p = 1 out of the
likelihood equation, we get the quadratic equation
3p2 + (1 − X)p − (X + 2Y ) = 0,
whose roots are
1 p 1 p
p± = −(1 − X) ± (1 − X)2 + 12(X + 2Y ) = (X − 1) ± 1 − 10X + 24Y + X 2 .
6 6
Both roots are real for the four joint possibilities for X and Y ; namely, (a) X = 1 and Y = 0, (b)
X = 1 and Y = 1, (c) X = 0 and Y = 1, and (d) X = 0 and Y = 0. One of the positive roots,
namely, 1 or p+ is the mle.
8
5. Plot the likelihood, log-likelihoood, and its first two derivatives as functions of p for the four cases
(a) X = 1 and Y = 0, (b) X = 1 and Y = 1, (c) X = 0 and Y = 1, and (d) X = 0 and Y = 0.
ll <- function ( p , X , Y , deriv = 0 )
{
stopifnot ( length ( deriv ) == 1 , deriv % in % ( 0:2 ) ,
all ( X % in % ( 0:1 ) ) , all ( Y % in % ( 0:1 ) ) )
nX <- length ( X )
nY <- length ( Y )
sX <- sum ( X )
sY <- sum ( Y )
if ( deriv == 0 )
{
i <- which ( p % in % c ( 0 , 1 ) )
j <- setdiff ( 1: length ( p ) , i )
p [ i ] <- - Inf
p [ j ] <- ( sX +2 * sY ) * log ( p [ j ]) +( nX - sX ) * log (1 - p [ j ]) +( nY - sY ) * log (1 - p [ j ]^2)
return ( p )
}
switch ( deriv , # need better handling of p =0 ,1

( sX +2 * sY ) /p -( nX - sX ) / (1 - p ) -2 * p * ( nY - sY ) / (1 - p ^2) ,
-( sX +2 * sY ) / p ^2 -( nX - sX ) / (1 - p ) ^2 -2 * ( nY - sY ) / (1 - p ^2) -4 * p ^2 * ( nY - sY ) / (1 - p ^2) ^2)
}
mle <- function ( X , Y )

{
stopifnot ( all ( X % in % ( 0:1 ) ) , all ( Y % in % ( 0:1 ) ) )
nX <- length ( X )
nY <- length ( Y )
sX <- sum ( X )
sY <- sum ( Y )
0.5 * ( sX - nX + sqrt ( ( sX - nX ) ^2 + 4 * ( nX +2 * nY ) * ( sX +2 * sY ) ) ) / ( nX +2 * nY )
}
plot . ll <- function ( X , Y , ylim = c ( -50 , +50 ) )

{
nX <- length ( X )
nY <- length ( Y )
sX <- sum ( X )
sY <- sum ( Y )
p . hat <- mle ( X , Y )
curve ( ll ( x , X , Y , 0 ) , from = 0 , to = 1 , n = 501 ,

bty = ’n ’ , xlab = ’p ’ , ylab = ’ ’ , lwd = 1.5 )
title ( main = paste ( ’Log - Likelihood | Sx = ’ , sX , ’ , Nx = ’ , nX ,
’ | Sy = ’ , sY , ’ , Ny = ’ , nY , sep = ’ ’ ) )
abline ( v = p . hat , lty = 2 , col = ’ red ’ )
axis ( 3 , at = p . hat , labels = ’ MLE ’ , col . axis = ’ red ’ , col = ’ white ’ )

bty = ’n ’ , xlab = ’p ’ , ylab = ’ ’ , lwd = 1.5 , ylim = ylim )
title ( main = ’ First Derivative ’ )
abline ( v = p . hat , h = 0 , lty = 2 , col = ’ red ’ )

bty = ’n ’ , xlab = ’p ’ , ylab = ’ ’ , lwd = 1.5 , ylim = ylim )
title ( main = ’ Second Derivative ’ )
abline ( v = p . hat , h = 0 , lty = 2 , col = ’ red ’ )
}
op <- par ( mfcol = c ( 3 , 4 ) )

plot . ll ( 0 , 0 )
plot . ll ( 1 , 0 )
plot . ll ( 0 , 1 )
plot . ll ( 1 , 1 )
par ( op )
Computation of the derivatives in the script above may not be correct for p = 0, 1.
9
6. Generalize the likelihood function, log-likelihood, and its derivatives for X1 , . . . , XN ∼ Bernoulli(p)
and Y1 , . . . , YM ∼ Bernoulli(p2 ). Assume independence.
N
X M
X
Let SX = Xi and SY = Yj .
i=1 j=1
L(p) = pSX (1 − p)N −SX × (p2 )SY (1 − p2 )M −SY = pSX +2SY (1 − p)N −SX (1 − p2 )M −SY
l(p) = (SX + 2SY ) log p + (N − SX ) log(1 − p) + (M − SY ) log(1 − p2 )
d SX + 2SY N − SX M − SY
l(p) = − − 2p
dp p 1−p 1 − p2
d2 SX + 2SY N − SX M − SY M − SY
2
l(p) = − 2
− 2
−2 2
− (2p)2 .
dp p (1 − p) 1−p (1 − p2 )2
The likelihood equation is
(N + 2M )p3 − (SX + 2M )p2 − (2SY + N )p + (SX + 2SY ) = 0,
which has p = 1 as one solution. The other two solutions are the roots of the quadratic
(N + 2M )p2 + (N − SX )p − (SX + 2SY ) = 0;
namely,
1 p
p± = −(N − SX ) ± (N − SX )2 + 4(N + 2M )(SX + 2SY ) .
2(N + 2M )
The positive roots 1, p+ qualify as candidates for the mle.
7. Derive the formula for Fisher information on p.

Since E(X) = p and E(Y ) = p2 , E(SX ) = N p and E(SY ) = M p2 . The Fisher information for
parameter p is
2
d SX + 2SY N − SX M − SY 2 M − SY
I(p) = −E l(p) = −E − − − 2 − (2p)
dp2 p2 (1 − p)2 1 − p2 (1 − p2 )2
N p + 2M p2 N − N p M − M p2 2 M − Mp
2
= + + 2 + (2p)
p2 (1 − p)2 1 − p2 (1 − p2 )2
N + 2M p N 4M p2
= + + 2M + .
p 1−p 1 − p2
10
6
If X1 , X2 , . . . , Xn ∼ Poisson(λ) iid, design estimators for λ using estimation methods you have studied.
Poisson distribution is a discrete distribution defined over nonnegative integers k = 0, 1, . . . by the pmf
λk e−λ
P (k) = ,
k!
∞
X ∞
X
with λ > 0. It is normalized (i.e., 1 = P (k)) because λk /k! = eλ .
k=0 k=0
1. mome. We have
∞
X ∞
X
E(X) = kP (k) = kP (k)
k=0 k=1
∞ ∞
X λk X λk+1
= e−λ = e−λ
(k − 1)! k!
k=1 k=0
∞ ∞
−λ
X λk X
= λ×e =λ× P (k)
k!
k=0 k=0
= λ.
Pn
Equating this with the first sample moment n−1 i=1 Xi , we get the mome
n
X
b = n−1
λ Xi .
i=1
2. mle. Likelihood and log-likelihood functions are

n n n n
Y Y λXi e−λ Y λXi e−λ Pn Y 1
L(λ) = P (Xi ) = = = e−nλ λ i=1 Xi
×
Xi ! Xi ! Xi !
i=1 i=1 i=1 i=1
Xn
l(λ) = −nλ + log(λ) Xi + terms that do not depend on λ.
i=1
To maximize l(λ), set dl(λ)/dλ = 0:

n
1X
−n + Xi = 0,
λ
i=1
which gives us the mle

n
X
b = n−1
λ Xi .
i=1
3. plug-in. This is a parametric setting and, as such, the plug-in method is not appropriate. If one
uses E(X) = λ, and use the plug-in estimator for mean, then this is same as mome.
11
1. X1 , X2 , . . . , Xn ∼ Uniform(0, θ), where θ > 0. What is the maximum likelihood estimator for θ?
This is AoS2004 Example 9.12.
1

θ 0≤x≤θ
fX (x; θ) = .
0 otherwise
Because θ defines the right boundary of the pdf, the pdf is discontinuous at x = θ. This
discontinuity will be reflected in the likelihood function. As such, the derivative-based method of
maximization is inapplicable here. The likelihood function is
n −n
Y θ 0 ≤ X1 , . . . , Xn ≤ θ
L(θ) = fX (Xi ; θ) = .
0 otherwise
i=1
To see how L(θ) changes with θ, imagine sliding the value of θ from left to right across the range
of observed data1 :
θ
X3 X8 X6 X5 X2 X1 X4 X7
As long as any data point remains to the right of θ, the likelihood L(θ) will be equal to zero.
Continuing the same argument, the likelihood will be nonzero (= θ−n ) for θ ≥ max{X1 , . . . , Xn }.
So
0 θ < max{X1 , . . . , Xn }
L(θ) = −n .
θ θ ≥ max{X1 , . . . , Xn }
Clearly, the maximum value of L occurs at the boundary, i.e., θ = max{X1 , . . . , Xn }. Therefore,
θbn = max{X1 , . . . , Xn }.
theta <- 1
n <- 10
u <- runif ( n , 0 , theta )
curve ( x ^( - n ) , from = max ( u ) , to = 1.5 * max ( u ) , lwd = 2 , bty = ’n ’ ,

xlab = expression ( theta ) , ylab = ’ Likelihood ’ , xlim = c ( 0 , 1.5 * max ( u ) ) )
lines ( c ( 0 , max ( u ) ) , c ( 0 , 0 ) , lwd = 2 )
abline ( v = max ( u ) , lty = 2 )
rug ( u , col = ’ red ’ , lwd = 1.5 )
axis ( 3 , at = max ( u ) , label = expression ( widehat ( theta ) [ n ] ) )
Formally, we may write the pdf in the alternative form

1
fX (x; θ) = I[0,θ] (x),
θ
where
1 x∈A
IA (x) = .
0 x∈
/A
The likelihood function takes the form
n
Y
−n
L(θ) = θ I[0,θ] (Xi ).
i=1
Requiring all data to be inside the interval [0, θ] is same as requiring the maximum data value to
be inside the interval [0, θ]. Therefore,
L(θ) = θ−n I[0,θ] (max{X1 , . . . , Xn }).

1
Because θ is the upper bound on X, we can meaningfully put X1 , . . . , Xn on the θ axis.
12
From this form, it is straightforward to see that

0 θ < max{X1 , . . . , Xn }
L(θ) = ,
θ−n θ ≥ max{X1 , . . . , Xn }
and that L(θ) attains its maximum value for θ = max{X1 , . . . , Xn }.
2. If X1 , X2 , . . . , Xn ∼ Uniform(θ, 0), what is the maximum likelihood estimator for θ?
θbn = min{X1 , . . . , Xn }.
13
8
Let X1 , X2 , . . . , Xn ∼ Uniform(a, b) iid. a and b are the unknown parameters. Assume that a < b.
1. Find the mome for a and b.

The pdf of each Xi is
1

b−a if a ≤ X ≤ b
fX (x) = .
0 otherwise
The first two moments of the distribution are
Z Z b
1 1
µ1 = E(X) = xfX (x)dx = xdx = (a + b)
b−a a 2
Z Z b
1 1
µ2 = E(X 2 ) = x2 fX (x)dx = x2 dx = (a2 + ab + b2 ).
b−a a 3
The first two sample moments (i.e., estimators for the first two moments above) are
n
1X
µ
b1 = Xi
n
i=1
n
1 X
µ
b2 = Xi2 .
n
i=1
Equate µk with µ
bk , k = 1, 2, we get two equations for the estimators of a and b:
1
µ
b1 = (b
an + bbn )
2
1 2
µ
b2 = (b
a +b anbbn + bb2n ).
3 n
Solving for b
an and bbn , we get the following two pairs of solutions:
q
an = µ
b b1 + 3(b µ2 − µb21 )
q
bbn = µ b1 − 3(b µ2 − µb21 )
OR
q
b1 −
an = µ
b 3(b b21 )
µ2 − µ
q
bbn = µ
b1 + 3(b b21 ).
µ2 − µ
b21 = n1 ni=1 (Xi − µ b1 )2 > 0. Hence, the required pair of estimators which is
P
b2 − µ
Notice that µ
consistent with a < b is the first one.
2. Find the maximum likelihood estimator for a and b.

Using arguments similar to AoS2004 Example 9.12, or those in the previous problem, we get
an = min{X1 , . . . , Xn }
b
bbn = max{X1 , . . . , Xn }.
14
9
Let X1 , X2 , . . . , Xn be independent and identically distributed rvs with the pdf
α+1
f (x; α) = , x ≥ 1, α > 0.
xα+2
This distribution function is a special form of the Pareto distribution, with parameter xm = 1.
1. Derive the mome for α.

First get the expression for E(X):
Z ∞ Z ∞ ∞
−(α+1) −α
α+1 α+1
E(X) = xf (x; α)dx = (α + 1) x dx = −
x = .
1 1 1 α α
b1 = n−1 ni=1 Xi with this

P
The mome α bn is obtained by equating the first sample moment µ
expected value, and solving the resulting equation for α. We get
n
α
bn = .
b1 − n
µ
2. Derive the mle for α.

Assuming that X1 , . . . , Xn ≥ 1, the likelihood, log-likelihood, and log-likelihood derivative are
n n
!−(α+2)
Y α+1 n
Y
L(α) = α+2 = (α + 1) Xi
i=1
Xi i=1
n
X
l(α) = n log(α + 1) − (α + 2) log(Xi )
i=1
n
dl(α) n X
= − log(Xi ).
dα α+1
i=1
Setting the derivative to zero and solving for α, we get the mle
n
!−1
1X
α
bn = log(Xi ) − 1.
n
i=1
15
10
Suppose we have independent normal data with the unequal means but the same variance; i.e., Xi ∼
N (µi , σ 2 ), i = 1, . . . , n. The means µi are all known.
1. Derive the mle for σ 2 .

The likelihood function L(σ 2 ) and the log-likelihood function l(σ 2 ) = log L(σ 2 ) for this problem
is
n
(Xi − µi )2

2
Y 1
L(σ ) = √ exp −
2σ 2 2σ 2
i=1
n
n 1 X (Xi − µi )2
l(σ 2 ) = − log 2σ 2 − .
2 2 σ2
i=1
Differentiating l(σ 2 ) with respect to σ 2 and setting it equal to zero, we get

n
d 2 n 1 X
l(σ ) = − + (Xi − µi )2 = 0.
dσ 2 σ σ3
i=1
The solution of the above equation for σ 2 is the desired estimator

n
c2 = 1
X
σ (Xi − µi )2 .
n
i=1
2. Find Fisher information for σ 2 .

Second derivative of log-likelihood function with respect to σ 2 is
n
d2 3 X Xi − µi 2

2 n
l(σ ) = + 2 − 2 .
dσ 2 2 σ σ σ
i=1
2 2
Xi − µi Xi − µi Xi − µi
Because ∼ N (0, 1), ∼ χ21 . Hence E = 1. Therefore,
σ σ σ
d2

2 2n
2
I(σ ) = −E 2 l(σ ) = σ 2 .
dσ 2
16
11
Suppose we have independent normal data with the same mean but unequal variances; i.e., Xi ∼
N (µ, σi2 ), i = 1, . . . , n. The variances σi2 are all known.
1. Derive the mle for µ.

The likelihood function L(µ) and the log-likelihood function l(µ) = log L(µ) for this problem is
n
(xi − µ)2

Y 1
L(µ) = exp −
2σi2
q
i=1 2σi2
n n
1 X xi − µ 2

1X 2
l(µ) = − log 2σi − .
2 2 σi
i=1 i=1
Differentiating l(µ) with respect to µ and setting it equal to zero, we get

n
d X xi − µ 1
l(µ) = = 0.
dµ σi σi
i=1
The solution of the above equation for µ is the desired estimator

n
X
µ
b= wi x i ,
i=1
where
1/σ 2
wi = Pn i 2 .
j=1 1/σj
2. Find Fisher information for µ.

Second derivative of log-likelihood function with respect to µ is
n
d2 X 1
l(µ) = − .
dµ2 σi2 i=1
Therefore,
n
d2
X
1
I(µ) = −E l(µ) = .
dµ 2 σi2 i=1
17
12
X1 , X2 , . . . , Xn ∼ Exp(θ) iid.
1. Derive expression for E(X).

The pdf for Exp(θ) is
1 −x/θ

θe x≥0
fX (x) =
0 oherwise
Therefore,
Z ∞ Z ∞ Z ∞
1 −x/θ
E(X) = xfX (x)dx = xe dx = θ ye−y dy = θΓ(2) = θ, (1)
0 θ 0 0
R∞
where Γ(z) = 0 tz−1 e−t dt is the Gamma Function. For integer values of z, Γ(z) = (z − 1)!.
2. What is the mome for θ?

Pn
b1 = n−1
Equating E(X) with the first sample moment µ i=1 Xi , we get the estimator
n
X
θbn = n−1 Xi .
i=1
18
13
n1 people are given treatment 1 and n2 people are given treatment 2. n1 and n2 are known. Let X1 be
the number of people on treatment 1 who respond favourably to the treatment and X2 be the number of
people on treatment 2 who respond favourably to the treatment. Assume that X1 ∼ Binomial(n1 , p1 ),
X2 ∼ Binomial(n2 , p2 ). Let ψ = p1 − p2 .
1. Find the maximum likelihood estimator ψ̂ of ψ.
2. Find the Fisher information matrix I(p1 , p2 ).
3. Use the multiparameter delta method to find the asymptotic standard error of ψ̂.
19
14
Let X1 , . . . , Xn be iid rvs that take values θ + 1 and θ − 1 with probabilities p and 1 − p respectively.
Here, 0 ≤ p ≤ 1 is known. θ, the unknown, is a real number. We wish to estimate θ from data
X1 , X2 , . . . , Xn .
1. What is the the expected value of X?
E(X) = p × (θ + 1) + (1 − p) × (θ − 1) = 2p + θ − 1.
2. What is the variance of X?

Using similar reasoning, the variance of X turns out to be
Var(X) = E(X 2 ) − (E(X))2 = 4p(1 − p) + 2θ.
3. What is the pdf of X?

The pdf of X can be written in the form

p x=θ+1
fX (x) =
1−p x=θ−1
1 1
= p 2 (x−(θ−1)) (1 − p)1−( 2 (x−(θ−1))) .
4. What is the plug-in estimator of θ?

This is a parametric setting and, as such, the plug-in method is not appropriate. We could
consider using the plug-in estimator for the mean and relate it to E(X), but this would be the
mome; see below.
5. What is the mome of θ?

Pn
b1 = n−1
Equating E(X) to the first sample moment µ i=1 Xi , we get the mome for θ:
n
X
θbn = 1 − 2p + n−1 Xi .
i=1
It can be shown easily that this estimator is unbiased, and has variance (4p(1 − p) + 2θ)/n.
6. What is the mle of θ?

Pn
The likelihood and log-likelihood functions are (with S = i=1 xi ):
n
Y 1 1
L(θ) = fX (xi ) = p 2 (S−n(θ−1)) (1 − p)n− 2 (S−n(θ−1))
i=1

1 1
l(θ) = (S − n(θ − 1)) log(p) + n − (S − n(θ − 1)) log(1 − p).
2 2
Sans additive constants, the log-likelihood function is
n
l(θ) = − (log(p) + log(1 − p)) × θ.
2
Because the log-likelihood is linear in θ, it is unbounded and increases indefinitely as θ → ∞
because the slope − n2 (log(p)+log(1−p)) > 0. This is a case where the mle method is inapplicable.
20
7. Any other estimator for θ that you could design?

Assuming that both values of X, i.e., θ ± 1, are represented in the data, we have the following
two obvious estimators
θbn = min(X1 , . . . , Xn ) + 1
θbn = max(X1 , . . . , Xn ) − 1
which will both give the true value of θ.
8. If all the observations turn out to be identical, would you need to modify your estimators? If yes,
state the modification and justify.
The events
X1 = . . . = Xn = θ + 1
and
X1 = . . . = Xn = θ − 1
could occur with probabilities pn and (1 − p)n respectively. However, we don’t know which
possibility has actually shown up. In these two cases, all previous estimators will lead to wrong
estimates of θ. Suppose X1 = . . . = Xn = α, where α is either θ + 1 or θ − 1 – we don’t know.
The following estimators may work better for this situation:

α − 1 if p > 1 − p
θn =
b
α + 1 otherwise

α − 1 with probability p
θbn =
α + 1 with probability 1 − p
The second estimator above requires generating a Uniform(0,1) random number r, and choosing
α − 1 if r ≤ p, and α + 1 otherwise. The second estimator above appears more appropriate when
p = 1/2.
** CONFIRM **
21
15
Suppose we have paired data (Xi , Yi ), i = 1, . . . , n. Let us assume that X and Y are related to each
other as
Yi = (aXi + b) + i ,
where the random noise i ∼ N (0, 1). Find the mles of a, b.
22
16
Suppose you have paired data (xi , yi ), i = 1, . . . , n, where xi are precisely known (they are not rvs).
Further, you are expecting a relationship between xi and yi that has the form
yi = aexi /b + i ,
where i ∼ N (0, σ 2 ) iid and σ 2 is not known. Write down the log-likelihood function and the likelihood
equations for estimation of a, b, σ 2 .
23
17
Let X1 , . . . , Xn ∼ N (µ, σ 2 ) iid, with known µ and unknown σ. Find the mle of σ and ψ = log σ. Find
the Fisher information for σ. Obtain thereby an estimate of the standard error of ψ, b together with a
(1 − α) confidence interval on ψ.
24
18
Let X1 , . . . , Xn ∼ Bernoulli(p) and let ψ = g(p) = log(p/(1 − p)). State the maximum likelihood
estimate for p. Using that, derive the formula for maximum likelihood estimate ψ̂ of ψ. Also compute
the standard error of ψ̂.
25
19
Let X1 , . . . , Xn ∼ f (·; k, θ) iid, where
k+1
− x)k 0 ≤ x ≤ θ

θk+1
(θ
f (x; k, θ) = ,
0 otherwise
where θ > 0, k > 0.
1. Find the mome for k and θ.

First moment of the pdf:
θ
k+1 θ
Z Z
k+1 k x k
E(X) = x(θ − x) dx = x 1− dx
θk+1
0 θ 0 θ
Z 1
Γ(2)Γ(k + 1) θ
= θ(k + 1) y(1 − y)k dx = θ(k + 1) = ,
0 Γ(k + 3) k+2
where we have used the definition and a property of the Beta function
Z 1
Γ(α)Γ(β)
B(α, β) = y α−1 (1 − y)β−1 dy = ,
0 Γ(α + β)
and the Gamma function properties Γ(n) = (n − 1)! for integer n, and Γ(k + 1) = kΓ(k).
Second moment of the pdf:
k+1 θ 2 k+1 θ 2
Z Z
2 k x k
E(X ) = x (θ − x) dx = x 1 − dx
θk+1 0 θ 0 θ
Z 1
Γ(3)Γ(k + 1) θ θ
= θ2 (k + 1) y 2 (1 − y)k dx = θ2 (k + 1) = × ,
0 Γ(k + 4) k+3 k+2
Pn
Equating P b1 = n−1
the first two moments of the pdf with the first sample moments µ i=1 Xi and
n
b2 = n−1 i=1 Xi2 , we get the mome equations
µ
θ
µ
b1 =
k+2
θ θ
µ
b2 = × .
k+3 k+2
Solving for k and θ, we get the mome
µ2 − 2b
3b µ1 2
k = −
b
b2 − µ
µ b21
θb = (b
k + 2)b
µ1 .
2. Write the likelihood and log-likelihood functions. Proceed as far as possible towards finding the
mle for the parameters.
The pdf is discontinuous at x = 0 and possibly non-differentiable at x = θ. Since θ defines
a boundary for the pdf, we need to analyze the behaviour of the likelihood function carefully.
Write the pdf in the alternative form
2
fX (x; θ) = (θ − x)I[0,θ] (x),
θ2
where
1 x∈A
IA (x) = .
0 x∈
/A
26
The likelihood function is

n
n Y
k+1
L(k, θ) = (θ − Xi )k I[0,θ] (Xi ).
θk+1
i=1
Because of the indicator functions I[0,θ] (Xi ), i = 1, . . . , n, the likelihood as function of θ will
be nonzero only when θ > max{X1 , . . . , Xn }. We may write the likelihood and log-likelihood
functions as

0 θ ≤ max{X1 , . . . , Xn }
L(k, θ) = k+1 n Qn
k
k+1 i=1 (θ − Xi ) θ > max{X1 , . . . , Xn }
θ
−∞ θ ≤ max{X1 , . . . , Xn }
l(k, θ) =
n log(k + 1) − n(k + 1) log(θ) + k ni=1 log(θ − Xi ) θ > max{X1 , . . . , Xn }
P
Assuming that θ is constrained to the region θ > max{X1 , . . . , Xn }, we may write the likelihood
equations
n
∂l n(k + 1) X 1
= − +k =0
∂θ θ θ − Xi
i=1
n
∂l n X
= − − n log(θ) + log(θ − Xi ) = 0.
∂k (k + 1)
i=1
which turn out to be nonlinear in the parameters, and presumably from hard to impossible to solve
analytically. We may have to resort to numerical maximization methods or numerical methods
for solving systems of nonlinear equations.
3. Write R code to plot the log-likelihood function (as function of θ for fixed value of k) to see if the
mle of θ necessarily lies on the boundary θ = max{X1 , . . . , Xn }, or on the right of this boundary.
Write R code to make a surface plot or heatmap of the log-likelihood as function of both θ and k.
Core functions:
df <- function ( x , theta = 1 , k = 1 )
{
# probability density function
stopifnot ( theta > 0 , k > 0 )
i <- which ( ( x <= theta ) & ( x >= 0 ) )

j <- setdiff ( 1: length ( x ) , i )
if ( ! is . null ( i ) ) x [ i ] <- ( k + 1 ) * ( 1 - x [ i ] / theta ) ^ k / theta

if ( ! is . null ( j ) ) x [ j ] <- 0
x
}
pf <- function ( x , theta = 1 , k = 1 )

{
# cumulative distribution function
stopifnot ( theta > 0 , k > 0 )
l <- which ( x <= 0 )

n <- which ( x >= theta )
m <- setdiff ( 1: length ( x ) , c ( l , n ) )
if ( ! is . null ( l ) ) x [ l ] <- 0
if ( ! is . null ( m ) ) x [ m ] <- 1 - ( 1 - x [ m ] / theta ) ^( k + 1 )
if ( ! is . null ( n ) ) x [ n ] <- 1
x
}
qf <- function ( q , theta = 1 , k = 1 )
27
{
# quantile function
stopifnot ( theta > 0 , k > 0 , q >= 0 , q <= 1 )
theta * ( 1 - ( 1 - q ) ^( 1 / ( k + 1 ) ) )
}
rf <- function ( n , theta = 1 , k = 1 )

{
# random number generator
stopifnot ( theta > 0 , k > 0 , n >= 0 , n == as . integer ( n ) )
qf ( runif ( n ) , theta , k )
}
ll <- function ( theta , k = 1 , x )

{
# log - likelihood function for single fixed value of k
stopifnot ( length ( k ) == 1 , k > 0 , all ( theta > 0 ) )
n <- length ( theta )
i <- which ( theta <= max ( x ) )

j <- setdiff ( 1: n , i )
theta [ i ] <- - Inf

theta [ j ] <- ( n * log ( k + 1 ) - ( k + 1 ) * n * log ( theta [ j ] )
+ sapply ( theta [ j ] , function ( t ) { sum ( log ( t - x ) ) } ) )
theta
}
Plots of pdf, cdf, qf, and a random number histogram:

source ( ’ 18 -0. r ’ )
theta <- 1
k <- 1 / 2
op <- par ( mfcol = c ( 2 , 2 ) )

# pdf
curve ( df ( x , theta , k ) , from = 0 , to = theta + 0.5 , n = 501 ,
bty = ’n ’ , lwd = 2 , xlim = c ( -0.5 , theta + 0.5 ) , ylab = ’ PDF ’ )
lines ( c ( -0.5 , 0 ) , c ( 0 , 0 ) , lwd = 2 )
abline ( v = c ( 0 , theta ) , lty = 2 )
title ( main = parse ( text = paste ( " paste ( k == " , k , " ~ ~ theta == " , theta , " ) " ) ) )
# random numbers histogram overlaid with pdf

r <- rf ( 1000 , theta , k )
hist ( r , ’ FD ’ , border = ’ white ’ , col = ’ gray ’ , freq = FALSE , xlab = ’x ’ ,
main = ’ Random Numbers ’ , xlim = c ( -0.5 , theta + 0.5 ) )
abline ( v = c ( 0 , theta ) , lty = 2 )
lines ( c ( -0.5 , 0 ) , c ( 0 , 0 ) , lwd = 2 )
lines ( c ( theta , theta + 0.5 ) , c ( 0 , 0 ) , lwd = 2 )
curve ( df ( x , theta , k ) , from = 0 , to = theta , n = 501 , add = TRUE , lwd = 2 )
# cdf
curve ( pf ( x , theta , k ) , from = -0.5 , to = theta + 0.5 , n = 501 ,
bty = ’n ’ , lwd = 2 , xlim = c ( -0.5 , theta + 0.5 ) , ylab = ’ CDF ’ , asp = 1 )
abline ( h = c ( 0 , 1 ) , v = c ( 0 , theta ) , lty = 2 )
# qf
curve ( qf ( x , theta , k ) , from = 0 , to = 1 , n = 501 ,
bty = ’n ’ , lwd = 2 , xlim = c ( -0.1 , 1.1 ) , ylab = ’ QF ’ , xlab = ’q ’ , asp = 1 )
abline ( h = c ( 0 , theta ) , v = c ( 0 , 1 ) , lty = 2 )
par ( op )
Log-likelihood function for fixed k:

source ( ’ 18 -0. r ’ )
theta <- 1
k <- 1 / 2
28
r <- rf ( 10 , theta , k ) # a random sample from the pdf
# log - likelihood function as function of theta , with fixed value for k

# log - likelihood --> - infinity for theta < max ( r ) ( not shown )
curve ( ll ( x , k , r ) , from = max ( r ) , to = 1.01 * max ( r ) , n = 501 , col = ’ red ’ ,

bty = ’n ’ , xlab = expression ( theta ) , ylab = ’ Log Likelihood ’ , lwd = 2 )
abline ( v = max ( r ) , lty = 2 )
axis ( 3 , at = max ( r ) ,
label = expression ( paste ( max , ’( ’ , X [1] , ’ , ’ , ldots , ’ , ’ , X [2] , ’) ’ ) ) )
title ( main = paste ( ’log - likelihood for fixed value of k = ’ , k ) )
29
20
Let X1 , . . . , Xn ∼ f (·; θ, δ) iid, where
(
1 − x−δ
θe
θ for x ≥ δ
f (x; θ, δ) =
0 otherwise
The pdf definedZ above can be thought of as a shifted version of the exponential distribution. It can be
1 ∞ − x−δ
verified that e θ dx = 1.
θ δ
1. Find the mome for θ and δ.

Because there are two parameters to estimate, we need to evaluate the first two moments of f :
1 ∞ − x−δ
Z Z ∞
µ1 = E(X) = xe θ dx = (θy + δ)e−y dy = θ + δ
θ δ 0
Z ∞ Z ∞
1 2 − θx−δ
µ2 = E(X ) = 2
x e dx = (θy + δ)2 e−y dy = 2θ2 + 2θδ + δ 2 = θ2 + µ21 ,
θ δ 0
Z ∞
where we have used the result y k−1 e−y dy ≡ Γ(k) = (k − 1)! for positive integer k. The first
0
two sample moments are
n
1X
µ
b1 = Xi
n
i=1
n
1X 2
µ
b2 = Xi .
n
i=1
Equating the respective moments with their sample counterparts, we get the mome equations
µ
b1 = θbn + δbn
b2 = θb2 + µ
µ b2 . n 1
Solving for θbn and δbn , we get the mome:

q
θbn = µ b21
b2 − µ
b1 − θbn .
δbn = µ
2. Find the mle for θ and δ.

The likelihood and log-likelihood functions are, respectively,
n
Y 1 Xi −δ 1 Pn
L(δ, θ) = e− θ = θ−n e− θ i=1 (Xi −δ)
θ
i=1
n
1X
l(δ, θ) = −n log(θ) − (Xi − δ).
θ
i=1
Set derivatives of l to zero:

∂l(δ, θ) n
0 = =
∂δ θ
n
∂l(δ, θ) n 1 X
0 = =− + 2 (Xi − δ).
∂θ θ θ
i=1
30
Clearly, something is wrong with this naive treatment: The culprit is that f is neither continuous
nor differentiable at x = δ. Alternate arguments lead to
δbn = min{X1 , . . . , Xn }.
With this, the second equation leads to

n
1X
θbn = (Xi − δbn ).
n
i=1
** CONFIRM **
3. Visualize the density and the log-likelihood function.

Core functions:
# PDF
df <- function ( x , delta = 0 , theta = 1 ) { dexp ( x - delta , rate = 1 / theta ) }
# CDF
pf <- function ( x , delta = 0 , theta = 1 ) { pexp ( x - delta , rate = 1 / theta ) }
# QF
qf <- function ( q , delta = 0 , theta = 1 ) { qexp ( q , rate = 1 / theta ) + delta }
# RNG
rf <- function ( n , delta = 0 , theta = 1 ) { rexp ( n , rate = 1 / theta ) + delta }
# log - likelihood function for single fixed value of theta

ll <- function ( delta , theta = 1 , x )
{
stopifnot ( length ( theta ) == 1 , theta > 0 )
n <- length ( delta )
i <- which ( delta > min ( x ) )

j <- setdiff ( 1: n , i )
delta [ i ] <- - Inf

delta [ j ] <- - ( n * log ( theta )
+ sapply ( delta [ j ] , function ( d ) { sum ( x - d ) } ) / theta )
delta
}
Plots of pdf, cdf, qf, and a random number histogram:

source ( ’ 020 -0. r ’ )
delta <- 1
theta <- 2
op <- par ( mfcol = c ( 2 , 2 ) )

# pdf
curve ( df ( x , delta , theta ) , from = delta , to = 10 * delta , n = 501 , bty = ’n ’ , lwd = 2 ,
xlim = c ( - delta , 10 * delta ) , ylim = c ( 0 , 1 / theta ) , ylab = ’ PDF ’ )
lines ( c ( - delta , delta ) , c ( 0 , 0 ) , lwd = 2 )
abline ( v = delta , lty = 2 )
axis ( 3 , at = delta , label = expression ( delta ) )
# random numbers histogram overlaid with pdf

r <- rf ( 1000 , delta , theta )
hist ( r , ’ FD ’ , border = ’ white ’ , col = ’ gray ’ , freq = FALSE , xlab = ’x ’ ,
main = ’ Random Numbers ’ , xlim = c ( - delta , 10 * delta ) , ylim = c ( 0 , 1 / theta ) )
abline ( v = delta , lty = 2 )
lines ( c ( - delta , delta ) , c ( 0 , 0 ) , lwd = 2 )
curve ( df ( x , delta , theta ) , from = delta , to = 10 * delta , n = 501 , add = TRUE , lwd = 2
)
# cdf
31
curve ( pf ( x , delta , theta ) , from = - delta , to = 10 * delta , n = 501 ,

bty = ’n ’ , lwd = 2 , ylab = ’ CDF ’ )
abline ( h = c ( 0 , 1 ) , v = delta , lty = 2 )
# qf
curve ( qf ( x , delta , theta ) , from = 0 , to = 1 , n = 501 , bty = ’n ’ , lwd = 2 ,
xlim = c ( -0.1 , 1.1 ) , ylim = c ( - delta , 10 * delta ) , ylab = ’ QF ’ , xlab = ’q ’ )
abline ( h = delta , v = c ( 0 , 1 ) , lty = 2 )
axis ( 4 , at = delta , label = expression ( delta ) , las = 2 )
par ( op )
Log-likelihood function for fixed θ:

source ( ’ 020 -0. r ’ )
delta <- 1
theta <- 2
r <- rf ( 10 , delta , theta ) # a random sample from the pdf
# log - likelihood function as function of delta , with fixed value for theta
# log - likelihood --> - infinity for delta > min ( r ) ( not shown )
curve ( ll ( x , theta , r ) , from = 0.8 * min ( r ) , to = min ( r ) , n = 501 , col = ’ red ’ ,

bty = ’n ’ , xlab = expression ( delta ) , ylab = ’ Log Likelihood ’ , lwd = 2 )
abline ( v = min ( r ) , lty = 2 )
axis ( 3 , at = min ( r ) ,
label = expression ( paste ( min , ’( ’ , X [1] , ’ , ’ , ldots , ’ , ’ , X [2] , ’) ’ ) ) )
title ( main = paste ( ’log - likelihood for fixed value of theta = ’ , theta ) )
32
21
Let X1 , . . . , Xn ∼ Gamma(k, θ) iid. The pdf of Gamma(k, θ) is
( x
xk−1 e− θ
k x>0
f (x; k, θ) = θ Γ(k)
0 otherwise
with k, θ > 0.
1. Find the mome for k and θ.
We need the first two moments of Gamma(k, θ):
µ1 = E(X) = kθ
µ2 = E(X ) = Var(X) + (E(X))2 = kθ2 + (kθ)2 .
2
b1 = n−1 ni=1 Xi and µ

b2 = n−1 ni=1 Xi2 , we
P P
Equating these to the first two sample moments µ
get the mome equations for the estimators of the parameters:
µ
b1 = b
k θb
2
µ k θb2 + b
b2 = b k θb .
Solving these for b

k and θ,
b we get
b21
µ
k =
b
b2 − µ
µ b21
b2 − µ
µ b21
θb = .
µ
b1
2. Assuming that the value of k is known, find the mle of θ.

The likelihood function is
n n ix
Y Y xk−1 e− θ
i
L(θ) = f (xi ; k, θ) =
θk Γ(k)
i=1 i=1
(assuming x1 , . . . , xn > 0 which is true by assumption). The log-likelihood function is

n n
X 1X
l(θ) = (k − 1) log(xi ) − xi − nk log(θ) − n log(Γ(k)).
θ
i=1 i=1
This is a continuous and differentiable function of θ, so to maximize it, we find its derivative and
set it to zero:
n
nk 1 X
− − 2 xi = 0,
θ θ
i=1
giving us
n
1 X µ
b1
θ=
b xi = .
nk k
i=1
The mle of k (i.e., b

k) does not have a closed-form analytical expression; see https://en.
wikipedia.org/wiki/Gamma_distribution#Maximum_likelihood_estimation for more infor-
mation.
3. Assuming that the value of k is known, find the mle of θ, find mle for θ and τ = (2θ − 1)2 .
Using equivariance, we get τb = (2θb − 1)2 .
33
22
Visualize the likelihood or the log-likelihood functions:
1. X1 , . . . , Xn ∼ Bernoulli(p) iid.
# likelihood function for a Bernoulli ( p ) sample
log . likelihood . bernoulli <- function ( data , p = seq ( 0 , 1 , by = 0.1 ) )
{
N <- length ( data )
S <- sum ( data )
L <- p ^ S * ( 1 - p ) ^( N - S )
logL <- S * log ( p ) + ( N - S ) * log ( 1 - p )
cbind ( p , L , logL )
}
# The " Truth "

p <- 0.6 # stopifnot ( p >= 0 , p <= 1 )
# visualization
op <- par ( mfrow = c ( 2 , 3 ) )
for ( N in c ( 5 , 10 , 15 , 30 , 100 , 1000 ) ) # loop over sample sizes

{
x <- sample ( c ( 1 , 0 ) , N , replace = T , prob = c ( p , 1 - p ) )
ll <- log . likelihood . bernoulli ( x , p = seq ( 0 , 1 , by = 0.01 ) )

mle <- sum ( x ) / length ( x )
plot ( ll [ , ’p ’] , ll [ , ’L ’] , type = ’l ’ , xlab = ’p ’ , ylab = ’ log ( likelihood ) ’ ,

main = paste ( N , ’ Tosses , ’ , sum ( x ) , ’ Heads ’ ) , bty = ’n ’ )
abline ( v = mle , col = ’ blue ’ , lty = 2 )
abline ( v = 0.5 + coin . bias , col = ’ red ’ )
mtext ( ’ TRUTH ’ , 3 , at = 0.5 + coin . bias , line = 0.25 , cex = 0.5 , col = ’ red ’ )
text ( mle , 1.01 * max ( ll [ , ’L ’] ) , ’ MLE ’ , cex = 0.75 , col = ’ blue ’ , adj = c ( 0.5 , 0 ) )
}
par ( op )
2. X1 , . . . , Xn ∼ N (µ, σ 2 ) iid.
# log likelihood function for a normal sample
log . likelihood . normal <- function ( mu , sigma , X )
{
# stopifnot ( sigma > 0 )
n <- length ( X )
S2 <- ( n - 1 ) * var ( X ) / n
Xbar <- mean ( X )
sapply ( sigma , function ( sigma ) { -n * log ( sigma ) - 0.5 * n * ( S2 +( Xbar - mu ) ^2) / sigma ^2 } )
}
# The " Truth "

mu <- 3
sigma <- 2
# visualization
grid . size <- 501 # finer grid == > smoother contours , slower draw , larger storage
# grids : arbitrary choice

mu . grid <- seq ( floor ( 0.5 * mu ) , ceiling ( 1.5 * mu ) , length = grid . size )
se . grid <- seq ( floor ( 0.5 * sigma ) , ceiling ( 1.5 * sigma ) , length = grid . size )
op <- par ( mfrow = c ( 2 , 3 ) )
for ( N in c ( 5 , 10 , 15 , 30 , 100 , 1000 ) ) # loop over sample sizes

{
X <- rnorm ( N , mu , sigma ) # data sample
mu . hat <- mean ( X ) # MLE of mu
se . hat <- sqrt ( ( N - 1 ) * var ( X ) / N ) # MLE of sigma ; ! = sd ( X )
ll <- log . likelihood . normal ( mu . grid , se . grid , X )
34
contour ( mu . grid , se . grid , ll , nlevels = 21 , xlim = range ( mu . grid , mu , mu . hat ) ,

ylim = range ( se . grid , sigma , se . hat ) , asp = 1 , col = ’ blue ’ , bty = ’n ’ )
title ( xlab = expression ( mu ) , ylab = expression ( sigma ) ,
main = paste ( ’ log ( likelihood ) for N ( mu , sigma ) | N = ’ , N ) )
points ( mu , sigma , pch = ’+ ’ , cex = 2 , col = ’ red ’ )

text ( mu , sigma , ’ TRUTH ’ ,
pos = if ( mu < mu . hat ) 2 else 4 , cex = 0.75 , col = ’ red ’ )
points ( mu . hat , se . hat , pch = ’+ ’ , cex = 2 , col = ’ blue ’ )

text ( mu . hat , se . hat , ’ MLE ’ ,
pos = if ( mu < mu . hat ) 4 else 2 , cex = 0.75 , col = ’ blue ’ )
}
par ( op )
35

All of Stats-W

Uploaded by

Copyright:

Available Formats

All of Stats-W

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

All of Stats-W

Uploaded by

Copyright:

Available Formats

Tutorial C09 | Statistical Inference (C206) | Mihir Arjunwadkar | MTech M&S 2016-17 1

1. What are the differences between parametric and nonparametric inference?

3. Explain: Parameter of interest, nuiscence parameter.

4. Define: kth moment of a distribution.

5. Define: kth sample moment.

6. Explain: Method of moments estimator/estimation (mome).

7. Derive expressions for the mome for

8. State the essential properties of mome.

9. Define: Likelihood function.

10. Explain the rationale behind the likelihood principle.

11. Define: Maximum likelihood estimation/estimator (mle).

12. Derive the likelihood functions for

assuming that the respective parameters are unknown.

13. Derive expressions for the mle for indicated parameters:

(a) p for Bernoulli(p)

assuming that the respective parameters are unknown.

14. State the essential properties of mle.

15. Define Fisher information.

17. State the rationale behind the parametric bootstrap.

18. State the parametric bootstrap procedure.

1. Derive the expression for E(X).

Because fX (x) is normalized, we have

2. Assuming β to be known, find the mome for α.

3. Use parametric bootstrap to get glimpses of the sampling distribution of α

# MoM estimator of alpha

op <- par ( mfrow = c ( 2 , 4 ) )

# MoM estimate of alpha assuming beta to be know

# bootstrap distribution of alpha . hat

1. Let X1 , . . . , Xn ∼ N (α + β, σ 2 ) iid and Y1 , . . . , Ym ∼ N (α − β, σ 2 ) iid. Also assume independence

2. With α = 1, β = 2, σ 2 = 1, draw a sample of size 60 from the distribution of X and a sample of

a . hat <- 0.5 * ( mean ( data $ x ) + mean ( data $ y ) )

c ( alpha = a . hat , beta = b . hat , sigma = sqrt ( ss . hat ) )

# true values of parameters

# data sizes and data

# plot bootstrap distributions

# normal pdf with same mean and sd as the bootstrap distribution

1. Find the maximum likelihood estimate of τ .

where Φ is the cdf for N (0, 1). Hence,

The mle of τ is obtained by invoking the equivariance property of the mle:

2. Derive expression for an approximate 95% confidence interval for τ .

The estimated standard error sc τn ) is obtained by substituting σ

an approximate/asymptotic (1 − α) confidence interval on τ is

# histogram with a normal curve overlaid

# data locations on the x - axis

# confidence interval for tau

L(p) = pX (1 − p)1−X × (p2 )Y (1 − p2 )1−Y = pX+2Y (1 − p)1−X (1 − p2 )1−Y

2. Derive the likelihood equation for p.

3. If X = 1 and Y = 0, find the mle of p.

switch ( deriv , # need better handling of p =0 ,1

mle <- function ( X , Y )

plot . ll <- function ( X , Y , ylim = c ( -50 , +50 ) )

curve ( ll ( x , X , Y , 0 ) , from = 0 , to = 1 , n = 501 ,

curve ( ll ( x , X , Y , 1 ) , from = 0 , to = 1 , n = 501 ,

curve ( ll ( x , X , Y , 2 ) , from = 0 , to = 1 , n = 501 ,

op <- par ( mfcol = c ( 3 , 4 ) )

The likelihood equation is

(N + 2M )p3 − (SX + 2M )p2 − (2SY + N )p + (SX + 2SY ) = 0,

(N + 2M )p2 + (N − SX )p − (SX + 2SY ) = 0;