Discrete Probability and Likelihood: Readings: Agresti (2002), Section 1.2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Stat 544, Lecture 1 1

'
&
$
%
Discrete Probability
and Likelihood
Readings: Agresti (2002), Section 1.2
Today well review principles of likelihood-based
inference. By the next lecture, well apply them to the
simplest problem in categorical data analysistests
and intervals regarding a population proportion.
Discrete distributions. Continuous random
variables are described by density functions (e.g. the
normal curve). But a discrete random variable Y is
described by a probability mass function, which we
will also call a distribution,
f(y) = P(Y = y).
The set of y-values for which f(y) > 0 is called the
support. The support of a discrete random variable is
nite (e.g. {0, 1, , . . . , n}) or countably innite (e.g.
{0, 1, 2, . . .}).
Stat 544, Lecture 1 2
'
&
$
%
If we denote the support by Y, then
X
yY
f(y) = 1,
X
yY
y f(y) = E(Y ),
X
yY
(y E(y))
2
f(y) = V (Y ),
and so on.
If the distribution depends on unknown parameter(s)
, we can write it as f(y; ) (preferred by frequentists)
or f(y|) (preferred by Bayesians).
One simple example of a discrete random variable is
the binomial Y Bin(n, ),
f(y|) =
n!
y! (n y)!

y
(1 )
ny
,
y = 0, 1, . . . , n and 0 1. By now you should
know that E(Y ) = n and V (Y ) = n(1 ).
Another example is the Poisson Y P(),
f(y|) =

y
exp()
y!
,
y = 0, 1, 2, . . ., > 0. You should already know that
E(Y ) = V (y) = .
Stat 544, Lecture 1 3
'
&
$
%
A plot of f(y) for a discrete random variable consists
of spikes over the support. For example, heres a plot
of the Bin(n, ) distribution with n = 5 and = .3.
y
f
(
y
)
0 1 2 3 4 5
0
.
0
0
.
1
0
.
2
0
.
3
Likelihood and loglikelihood. If we observe a
random variable Y = y from distribution f(y|), then
the likelihood associated with y, L(|y), is simply the
distribution f(y|) regarded as a function of with y
xed. For example, if we observe y from Bin(n, ),
the likelihood function is
L(|y)
y
(1 )
ny
.
Any multiplicative constant which does not depend
on is irrelevant and may be discarded.
The formula for the likelihood looks similar to the
distribution f(y|), but dont be deceived: its very
Stat 544, Lecture 1 4
'
&
$
%
dierent. The distribution function is dened over the
support y Y, but the likelihood is dened over the
continuous parameter space for . In most cases, we
will be working with the loglikelihood
l(|y) = log L(|y),
which is dened up to an arbitrary additive constant.
For example, the binomial loglikelihood is
l(|y) = y log + (n y) log(1 ).
If n = 5, the loglikelihood looks like this if y = 0,
pi
l
o
g
l
i
k
0.0 0.2 0.4 0.6 0.8 1.0

2
0

1
5

1
0

5
0
Stat 544, Lecture 1 5
'
&
$
%
like this if y = 1,
pi
l
o
g
l
i
k
0.0 0.2 0.4 0.6 0.8 1.0

1
5

1
0

5
and like this if y = 2:
pi
l
o
g
l
i
k
0.0 0.2 0.4 0.6 0.8 1.0

1
4

1
2

1
0

4
In many problems of interest, we will derive our
loglikelihood from a sample rather than from a single
observation. If we observe an independent sample
y
1
, y
2
, . . . , y
n
from a distribution f(y|), the likelihood
Stat 544, Lecture 1 6
'
&
$
%
is
L() =
n
Y
i=1
f(y
i
|)
and the loglikelihood is
l() =
n
X
i=1
log f(y
i
|),
where conditioning on the sample y
1
, . . . , y
n
is
understood. (In the case of a binomial, the single
count Y represents the total number of successes from
n Bernoulli trials, so the binomial problem also has
this form.)
In regular problems, as the total sample size n grows,
the loglikelihood function does two things: (a) it
becomes more sharply peaked around its maximum,
and (b) its shape becomes nearly quadratic (i.e. a
parabola, if there is a single parameter). This is
important, because the loglikelihood for a
normal-mean problem is exactly quadratic. That is, if
we observe y
1
, . . . , y
n
from a normal population with
known variance, the loglikelihood is
l() =
1
2
2
n
X
i=1
(y
i
)
2
Stat 544, Lecture 1 7
'
&
$
%
in the one-dimensional case and
l() =
1
2
n
X
i=1
(y
i
)
T

1
(y
i
)
in the multivariate case. As the sample size grows,
the inference comes to resemble the normal-mean
problem. This is true even for discrete data. The
extent to which normal-theory approximations work
for discrete data does not depend on how closely the
distribution of responses resembles a normal curve,
but on how closely the loglikelihood resembles a
quadratic function.
Transformations. If L() is a likelihood and
= g() is a one-to-one function of the parameter
with back-transformation = g
1
(), then we can
express the likelihood in terms of as L( g
1
() ).
There is no Jacobian term, because L() is not a
distribution but a function. Transformations may help
us to improve the shape of the loglikelihood. If the
parameter space for has boundaries, we may want
to choose a transformation to the entire real space.
For example, consider the binomial loglikelihood,
l(|y) = y log + (n y) log(1 )
Stat 544, Lecture 1 8
'
&
$
%
= y log

+ nlog(1 ).
If we apply the logit transformation
= log

,
whose back-transformation is
=
exp()
1 + exp()
,
the loglikelihood in terms of is
l() = y + n log

1
1 + exp()

.
If we observe y = 1 from a binomial with n = 5, the
loglikelihood in terms of looks like this.
beta
l
o
g
l
i
k
5 4 3 2 1 0 1

5
.
5

5
.
0

4
.
5

4
.
0

3
.
5

3
.
0

2
.
5
Stat 544, Lecture 1 9
'
&
$
%
Transformations do not aect the location of the
maximum-likelihood (ML) estimate. If l() is
maximized at

, then l() is maximized at

= g(

).
ML estimation. As you undoubtedly know, ML
estimate for is the maximizer of L() or,
equivalently, the maximizer of l(). In regular
problems, the ML estimate can be found by setting to
zero the rst derivative(s) of l() with respect to .
A rst derivative of l() with respect to is called a
score function or simply a score. In a one-parameter
problem, the score function from an independent
sample y
1
, . . . , y
n
is
l

() =
n
X
i=1
u
i
(),
where
u
i
() =
d
d
log f(y
i
|)
is the score contribution for y
i
. If there are k
unknown parameters,
= (
1
,
2
, . . . ,
k
)
T
,
Stat 544, Lecture 1 10
'
&
$
%
then the score vector is
l

() =
2
6
6
6
6
6
6
4
l

1
l

2
.
.
.
l

k
3
7
7
7
7
7
7
5
=
n
X
i=1
2
6
6
6
6
6
6
4

1
log f(y
i
|)

2
log f(y
i
|)
.
.
.

1
log f(y
i
|)
3
7
7
7
7
7
7
5
.
For example, if Y is Bin(n, ), then the score with
respect to is
l

() =
Y

n Y
1

=
Y n
(1 )
.
Mean of the score function. A well known
property of the score is that it has mean zero. But
what, exactly, does that mean? The score is an
expression that involves both the parameter and the
data Y . Because it involves Y , we can take its
expectation with respect to the data distribution
f(y|). The expected score is no longer a function of
Y , but its still a function of . If we evaluate this
expected score at the true value of that is, at
the same value of assumed when we took the
Stat 544, Lecture 1 11
'
&
$
%
expectationwe get zero:
E( l

() )

=
Z
l

) f(y|

) dy = 0.
For example, in the case of the binomial proportion,
we have
E( l

() ) = E

Y n
(1 )

=
1
(1 )
E[ Y n ]
which is zero because E(Y ) = n. If we apply a
one-to-one transformation to the parameter = g(),
then the score function with respect to the new
parameter also has mean zero.
Estimating functions. This property of the score
functionthat it has an expectation of zero when
evaluated at the true parameter is a key to the
modern theory of statistical estimation.
In the original theory of likelihood-based estimation,
as developed by R.A. Fisher and others, the ML
estimate

is viewed as the value of the parameter
that, under the parametric model, that makes the
observed data most likely. More recently, however,
Stat 544, Lecture 1 12
'
&
$
%
statisticians have begun to view

as the solution to
the score equation(s). That is, we now often view an
ML estimate as the solution to
l

() = 0.
The fact that the score functions have mean zero is
what makes an ML estimate

asymptotically
unbiased (i.e.,

n-consistent) and aymptotically
normally distributed,

N(0, )
for some > 0. But the score functions are not the
only functions with this property. Any function of the
data and the parameters having mean zero at the true
has this property as well. Functions having the
mean-zero property are called estimating
functions. Setting the estimating functions to zero is
called the estimating equations.
In the case of the binomial proportion, for example,
Y n
is a mean-zero estimating function, and so is

1
[Y n] .
Stat 544, Lecture 1 13
'
&
$
%
Is one set of estimating functions as good as another?
Not really. If the distributional assumptions are
correct, then the solution to the true score equations
has some desirable properties of optimality. But other
estimating functions may lead to other estimates that
are also consistent and asymptotically normal.
We will be discussing the theory of estimating
functions at various points throughout the semester.
For now, you should simply note that they exist and
become familiar with this terminology.
Information and variance estimation. The
variance of the score is known as the Fisher
information. In the case of a single parameter, the
Fisher information is
i() = V ( l

() )
= E( u
2
() )
= E
"
X
i
u
i
()
#
2
.
If has k parameters, the Fisher information is the
k k covariance matrix for scores,
Stat 544, Lecture 1 14
'
&
$
%
i() = V ( l

() )
= E( u()u()
T
)
= E
2
4

X
i
u
i
()
!
X
i
u
i
()
!
T
3
5
.
Like the score function, the Fisher information is also
a function of . So we can evaluate it at any given
value of .
Notice that i() as we dened it is the square of a
sum which, in many problems, can be messy. To
actually compute the Fisher information, we usually
make use of the well known identity
i() = E[ l

() ],
where
l

() =
d
2
d
2
n
X
i=1
log f(y
i
|)
=
n
X
i=1
d
2
d
2
log f(y
i
|)
is the second derivative of the loglikelihood. In the
multiparameter case, l

() is the k k matrix of
Stat 544, Lecture 1 15
'
&
$
%
second derivatives
l

() =

2

T
"
n
X
i=1
log f(y
i
|)
#
,
whose (l, m)th element is
n
X
i=1

m
log f(y
i
|) =

2

m
n
X
i=1
log f(y
i
|).
For example, in the binomial case, the loglikelihood is
l() = y log + (n y) log(1 )
and the score function is
l

() =
Y

n Y
1

=
Y n
(1 )
.
Dierentiating again with respect to gives
l

() =
n(1 ) (Y n)(1 2)

2
(1 )
2
.
Multiplying by 1 and taking the expectation gives
i() = E

n(1 ) + (Y n)(1 2)

2
(1 )
2

=
n
(1 )
because the second term in the sum is zero.
Stat 544, Lecture 1 16
'
&
$
%
The reason why we care about the Fisher information
is that it provides us with a way (several ways,
actually) of assessing the uncertainty in the ML
estimate. It is well known that, in regular problems,

is approximately normally distributed about the true


with variance given by the reciprocal (or, in the
multiparameter case, the matrix inverse) of the Fisher
information.
In practice, there are two common ways to
approximate the variance of

. The rst way is to
plug

into i() and invert,
V (

) i
1
()

;
this is commonly called the expected information.
The second way is to invert (minus one times) the
actual second derivative of the loglikelihood at =

,
V (

()

;
this is commonly called the observed information.
Lets see how this works for the binomial parameter
. The score equation
l

() =
Y n
(1 )
= 0
Stat 544, Lecture 1 17
'
&
$
%
is obviously solved at Y n = 0 or = Y/n. The
Fisher information is
i() =
n
(1 )
,
so the variance estimate based on the expected
information is

V ( ) =
(1 )
n
=
(Y/n)(1 Y/n)
n
.
Minus one times the actual second derivative of the
loglikelihood is
l

() =
n(1 ) + (Y n)(1 2)

2
(1 )
2
.
Plugging in = Y/n for and taking the reciprocal
gives the estimate based on the observed information,

V ( ) =
(1 )
n
=
(Y/n)(1 Y/n)
n
.
In this example, the two variance estimates are the
same. This will not always happen, however; in many
problems of interest, the two will dier. The two
methods are asymptotically equivalent; in large
samples, they will give very similar answers. Some
statisticians have expressed a mild preference for
using the observed information.

You might also like