In Lectures 1—2, we went back to basics and provided exploratory analysis

of random variables and relations between them in a nonparametric (hence
agnostic and robust) way, then in a parametric (hence efficient) and robust
In Lectures 2—4, we tackled predicting their level .
In Lecture 5, we considered 2 and variance as a measure of risk.

But is the variance the best measure of risk?

There are at least two reasons why this may not be the case:

1. The variance is calculated from the two tails of any distribution, so it is not
ideal to describe risk: the lower tail of the distribution of profits is the one
that matters for defining risk, not the upper tail (positive profits). They’re
the same if the distribution is symmetric, but not otherwise.
For example, a distribution with only one long tail has the same variance as
its mirror image (var() = var( − )), but certainly not the same risk!
2. There are distributions whose tails decay so slowly that the variance is in-
finitely large; e.g.  ∼ t() for  = 1 (Cauchy case) and  = 2 have
var() = ∞. Is this a case where we should give up measuring (hence
controlling) risk?! Certainly not, as risk becomes even more relevant!

One of the purposes of this lecture is to introduce a method of analyzing the

probability that a critical event (e.g. bankruptcy, crash, etc.) happens, and
finding out which factors affect it and by how much.
This probability will always exist, even if the moments don’t (e.g. var() = ∞).
A. Qualitative LHS variates

A dichotomous (or binary) variate is one that can only take one of two possible
values; e.g. default or not, crash or not. More generally, we can have qualitative
variates (e.g. earnings that either meet expectations, are below, much below,
above, or much above), themselves a special case of discrete variates whose
outcomes can be enumerated (counted).
What if the probability of an outcome occuring (e.g. crash) is a function of
some variables (e.g. the economy’s fundamentals, the past values of the index,
etc.)? If we model this relation, we can perdict the event better, and we can
anticipate how its distribution changes as these explanantory factors evolve.

But first we need to take the discreteness of the LHS variate into account. We
cannot calculate the usual regressions otherwise, as we shall see.

Let us start by defining some basic discrete variates...

A Bernoulli trial is a variate  ∼ Ber() that can take the values 0 or 1. The
event  = 1 is a “success” and occurs with probability , so the p.d.f. is
Pr( = ) ≡  () =  (1 − )1− ( = 0 1 0 ≤  ≤ 1) (1)
We can now go in two different directions to generalize the simple Ber().
Repeating a Bernoulli
P trial  independent times, then counting the number of
successes  ≡ =1 , we get a binomial  ∼ Bin( ) with p.d.f.
µ ¶
 
Pr( = ) ≡  () =  (1 − )− ( = 0 1      0 ≤  ≤ 1) (2)

where  are the binomial coefficients seen in Lecture 3.
On the other hand, if  6= 0, to achieve the first  successes we need  trials,
where  is now random while  is fixed in advance. The p.d.f. of  −  is the
negative binomial µ ¶
+−1 
Pr( −  = ) =  (1 − ) ( = 0 1     0   ≤ 1) (3)

denoted  − ∼ Nbin( ) with  − the number of failures. The case of  =
1 success gives rise to the geometric variate (alternatively,  independent draws
from the geometric give the negative binomial). For the binomial, E() = .
For the negative binomial, E( − ) = −1 −  hence E() = −1.
The adjective “negative” in (3) is because
µ ¶
+−1 ( +  − 1) ( +  − 2)    ( + 1) ()
 ! µ ¶
 (−) (− − 1)    (− −  + 2) (− −  + 1)  −
= (−1) = (−1)  (4)
! 
where −  0.
We assumed the sample is i.i.d., which is rarely the case of the data in practice.
If there is some autocorrelation or other dynamic pattern, it is common to re-
move before analyzing the remainder. We have done this when we presented
Johansen’s procedure for estimating VARs (eliminating the short-run dynam-
ics by two regressions before conducting the long-run cointegration analysis).
Here, e.g., if returns follow an AR(1) (this is checkable as in Lectures 3—4)
 = −1 + 
we should estimate  by  b (e.g. LS estimator), calculate
b b −1
 =  − 
and conduct the analysis above on b .
We now examine the logical progression from simplest (& flawed) model to
better then best one. We are only able to tackle estimation in Section C.
A.1. The linear probability model
Let us take the simplest Bernoulli case of Pr ( = 1) =  and Pr ( = 0) = 1−.
Note that  ∈ [0 1] but  can only be 0 or 1.
The linear probability model specifies a continuous model for  (not  which is
discrete):  = x0β; e.g. the probability of default  may depend on a number
of factors x like the characteristics of the firm and the state of the economy.
From the Bernoulli distribution, E() =  and var() =  (1 − ). Hence,
0 0 ¡ 0 ¢
E( | x) = x β and var( | x) = x β 1 − x β  (5)
As in Lecture 1,
 ≡ E( | x) + ( − E( | x)) ≡ E( | x) + 
where E( | x) = 0, and the first result of (5) implies
 = x0β +  (6)
¡ 0
We have E( | x) = 0, but var( | x) = var( | x) = x β 1 − x β from the
second result in (5), so the residual in (6) is not homoskedastic (as you have
guessed from the notation  instead of ) because its variance depends on x.
More importantly, (6) is flawed in a fundamental way. The functional form
 = x0β gives us no guarantee that  ∈ [0 1], since a linear relation implies
that  can go off to ±∞! Also, var( | x) can become negative!
One solution is to take  =  (x0β), where  is some distribution function.
Since  (−∞) = 0 and increases monotonically to  (∞) = 1, such a transfor-
mation would guarantee that  ∈ [0 1]; see Figure 1.





-4 -2 0 2 4

Figure 1: The transformation of x0β into  =  (x0β).

A.2. Probit and logit models

One possible choice for this  transformation in  =  (x0β) is the normal

c.d.f. Z  Z  ¡ 2 ¢
exp − 2
 () = Φ () ≡  () d ≡ √ d (7)
−∞ −∞ 2
giving the probit model Φ−1 () = x0β, where Φ−1 () is regressed linearly on
x since Φ−1 () ∈ (−∞ ∞).
This choice of  is inconvenient, because of the integral in (7) and the lack of
an explicit formula for Φ−1(). It is also not justified by a statistical argument.
An alternative choice is the logistic c.d.f.
 () =  (8)
1 + exp (−)
giving rise to the logit model  −1 () = x0β where  −1 () = log ( (1 − )).
Linking log ( (1 − )) to x is called the logistic regression model.
The c.d.f. in (8) is the one that was plotted in Figure 1, and it is very close to
Φ when the variance is adjusted to be 1; the variance in (8) is  23 ≈ 3290.
In addition to the explicit formula that involves no integrals in (8), the logistic
is justified within our Bernoulli setup; see Exercise 5.
¡ 0 ¢
In both cases, the marginal effects of a change in x on  ≡  x β are
¡ 0 ¢ ¡ 0 ¢
 d x β  x β 0β)β
= =  (x
x d(x0β) x
where the density  corresponds to the distribution  (assumed continuous).
We have used the chain rule, and the differentiation of linear forms w.r.t.
vectors; see Remark 1 (at the end) for a generalization.
Note that the derivatives depend on x, as well as β, unlike in the usual regres-
sion models. Also,  =  (x0β)  depends on all the ’s.
So far, the errors  have not been i.i.d., hence not ideal for estimation.
Even by fixing the nonlinearity problem and choosing  =  (x0β), we still
have a remaining problem if we adopted a formulation like
 =  + 
This is because  can only be 0 or 1, while  can be anything in [0 1] such as
0.011 etc., and the error  has to make up the difference. The dependence of
 on  would violate ideal conditions (more on this later).
We need an alternative formulation of the model where the errors can be i.i.d.
and not dependent on x0β; the third (final) model of this section.
The model we will see here bears some similarity to discriminant analysis in
statistics, whereby observations are classified into whether they truly belong
to the case of  = 0 or  = 1, based on some discriminant function exceeding
a threshold or not.
The final formulation of our problem is in terms of an unobservable (or latent)
variate  † defined by
 † ≡ x0β + 
where  (notice change of notation) has distribution  (could be normal, lo-
gistic,...) and density  . For a given x, we have x0β =  (a constant) and  †
inherits the randomness of  by shifting the location of  by this constant .
We observe
 = 0 if  † ≤ 0
 = 1 if  †  0
where 0 is some threshold. Think of  † as a score combining a firm’s charac-
teristics (cash flow, leverage,...) & other factors into a single number; e.g. Alt-
man’s Z-score for bankruptcy. The convention is to take 0 = 0, since we can
redefine the unobservable as  † ≡  † − 0 and proceed with  † instead of  †.
Then,  ≡ Pr( = 1) is
 ≡ Pr( †  0) = Pr(  −x0β) = 1 − Pr( ≤ −x0β) ≡ 1 −  (−x0β)
and Pr( = 0) ≡ 1 −  =  (−x0β). (If  is symmetric around 0, then
Pr(  −x0β) = Pr(  x0β) and we get the simplification  =  (x0β).)
This formulation in terms of latent variates is preferred because we can take
an i.i.d. sequence of ’s to facilitate estimation, as we shall see in Section C.
These models for dichotomous variates have been generalized to include more
than two outcomes. The case that needs special treatment is when these out-
comes have a ranking (e.g. bond rating categories) denoted by  = 0 1     .
Then, given x, the ordered probit or logit is obtained from the classification
Pr( = 0) = Pr( † ≤ 0) = Pr(x0β +  ≤ 0) =  (−x0β)
Pr( = 1) = Pr(0   † ≤ 1) = Pr(0  x0β +  ≤ 1) =  (1 − x0β) −  (−x0β)
Pr( = 2) = Pr(1   † ≤ 2) = Pr(1  x0β +  ≤ 2) =  (2 − x0β) −  (1 − x0β)
Pr( = ) = Pr(−1   †) = Pr(−1  x0β + ) = 1 −  (−1 − x0β)
where 0  1  · · ·  −1 are the boundaries for the classification into the
 + 1 groups and we use
Pr(   ≤ ) ≡ Pr( ≤ ) − Pr( ≤ ) ≡  () −  ()
for   . Note that the probabilities in (9) add up to 100%, as usual.
B. Limited LHS variates
We now move to the case of limited LHS variates, i.e. ones that can be allowed
to be continuous but have a limited range. We consider two types.
B.1. Truncated variates
Truncated variates are ones that are sampled only if they are above (resp.
below) some chosen threshold    (resp.   ); e.g. taking a sample that
included only taxpayers. Suppose  = x0β + , where {}=1 are i.i.d. with
common density  (and distribution  ) having mean 0. Then, we would only
have observations   , arising from the shaded region of  in Figure 2.

Figure 2: Observations drawn from the values to the right of  =  − x0β.

Hence    − x0β and the truncated density of  defined over    − x0β is
 ()  ()
() ≡  ( |    − x0β) = 0 = 0  (10)
Pr(   − x β) 1 −  ( − x β)
to rescale the shaded area so that the new density  is proper (integrates to 1).
The hazard rate, seen in Lecture 1, is this function evaluated at the truncation
point  =  − x0β.
From the figure, it is obvious that E( |    − x0β)  0 if E() = 0.
Furthermore, x0β and  (i.e. regressor and error) are now negatively correlated,
b = (X 0X)−1X 0y) is biased: E(β)
so the usual OLS estimator of β (i.e. β b 6= β.

The truncated  is also less variable than the original , since one of the tails
has been cut off!
We need alternatives to OLS...
Two general estimation methods
Let x be a  × 1 variate whose density x is determined by the  × 1 vector
of unknown parameters θ. We wish to estimate θ from a random sample
{x}=1 which we collate in the × matrix X ≡ (x1     x)0. For example,
when  = 1, the density may ¡ be¢ a √
normal with unknown mean and variance,
() = exp(− ( − )2  22 ) 22, with θ ≡ ( 2)0.
What value of θ is most likely to have generated these data {x}=1? The
solution to this question is known as maximum-likelihood estimators (MLEs).
Once we have sampled x, the only remaining unknowns in the sample’s density
are θ. To stress this state, we change notation and consider the joint density
of x1     x as a function of θ, writing the likelihood function of θ as
 (θ) ≡ x (x1) × · · · × x (x)
by the independence of x1     x. Alternative notation includes (θ | X),
to stress that we take the data X as given (we condition on X). Note also
that the arguments of x(.) are the actual data x (not some hypothetical u).
b that maximizes  (θ).
The MLE of the parameters is the vector θ
To illustrate, consider the sample {}=1 ∼ IN( 1). Here, there is only one
unknown parameter,  ≡ , with likelihood
³ P ´
Y  µ ¶ exp − 1  ( − )2
1 1 2 2 =1 
 () = √ exp(− ( − ) ) = 
2 2 2
=1 (2)
The value of  that maximize this function is the MLE, and is denoted by 

Figure 3: Likelihood function for the mean ( = ) of a normal random sample.

The function  () being differentiable, we can obtain  b by solving for the
derivative being zero, as we can see in Figure 3. Then,
³ P ´
1  2 
d () exp − 2 =1 ( − ) X
= ( − )
d 2
(2) =1
implies 0 = b), i.e.
=1 ( − 
X  X  
 = b ≡ b
 b=
⇒ 

=1 =1 =1
hence b = : the MLE of  in this example is the usual sample mean .
(Check that this is a maximum, rather than a minimum, by considering the
sign of the second derivative of  () evaluated at 
b since
The maxima of  (θ) and  (θ) ≡ log ( (θ)) are achieved at the same θ,
log is an increasing function of its argument. For our example,
 1
 () = − log (2) − ( − )2  (11)
2 2
where we recognize the sum-of-squares criterion =1 ( − )2 whose mini-
mization gave rise to LS. Optimizing  gives
X X
d ()
= ( − ) ⇒ ( − b) = 0 ⇐⇒  b =  (12)
=1 =1
the second derivative being −  0 (easy now!) confirming a maximum at 
The next example illustrates that:
• the maximum of the likelihood may not be solvable by differentiation;
• the MLE of  is not always  (otherwise, LS and ML would be the same!).
Let {}=1 be a random sample from a uniform density on [0 ], equal to 1
for  ∈ [0 ] and to 0 otherwise. Then,

Y 1 1
() = = 
 
for max {} ≤  (since all the ’s satisfy  ∈ [0 ] and  cannot be smaller
than max {}), and () = 0 for   max {}. This () is a diminishing
function of  for  ≥ max {}, so we maximize it by choosing  b = max {}.
There is another type of estimator, called a method-of-moment (MoM) estima-
tor. To estimate the unknown parameters of the population, it equates sample
moments to population moments.
Here, the population mean is 2 and the sample mean is the usual , i.e. 2
is the MoM estimator of . This estimator is inefficient, and has a higher MSE
than the MLE for any   2 (they have the same MSE for  = 1 2).

• In this example, we did not differentiate the likelihood because the maxi-
mum was achieved as a corner solution where the derivative is nonzero: the
maximum was at the edge of the domain of definition of .
• In this example, the largest observation is a much more efficient (lower-
variance) estimator, and it results in a lower MSE as well. This result
makes sense: the largest observation indicates the most likely value of the
unknown upper bound  of the distribution.
[Think of  as the grade that some teacher gives to his students. His marking
is such that all grades are equally probable, but he is stingy and does not give
100% even if the answers are all correct! You can collect grades from this
teacher’s courses, and discover his upper bound  more efficiently from the
maximum grade than from the mean (quite literally!) grade. Also, in this
case, MoM can generate an estimate of  bigger than 100% (if   50%)!]
Going back to our truncated variate, we know that OLS is flawed, but we now
have ML and MoM.
The correction of OLS problems is by using the relevant density  from (10):
the MLE is obtained by maximizing w.r.t. β the likelihood (e.g. with  = )
Y Y 0 β) 0 β)
0  (1 − x1  ( − x
(β) = () ≡ ( −xβ) ≡ ×· · ·×
1 −  ( − x01β) 1 −  ( − x0β)
=1 =1 Q
instead of maximizing the usual =1  () (only the numerator here) which
applies when there is no truncation.
We will use this method of estimation in Section C. For now, we illustrate the
simpler (but less efficient) MoM.
Consider the example of the normal distribution and take  ≡ 1 and  ≡ .
Then,  ≡  +  ∼ N(  2). If we truncate when   , we have
E( |   ) =  +  and var( |   ) =  2 (1 − ( − ) )  (13)
where  ≡ ( − ) is the standardized truncation point (or quantile) and
 ≡ ()(1 − Φ()) ≡ () Pr(  ) is the normal’s hazard rate (in general,
−1 ≡ (1 −  ())  () is known as Mills’ ratio).
The first result in (13) gives a relation between the mean of the truncated
distribution and the original N(  2). We can exploit it to estimate the original
from the truncated distribution, when we only observe the truncated variate.
William Greene illustrates. A 1987 newspaper ‘survey’ of affluent Americans
(earning above $100K, the top 2%) says that their typical income is $142K.
Can we infer the average income of all the population from only the top 2%?
Incomes often follow a log-normal distribution, so we will use  ≡ log(income).
We are told that Pr(  log 100) = 098. Corresponding to it, the quantile
Φ−1(098) ≈ 2054 yields
(log 100) − 
≡ ≈ 2054 ⇒  ≈ (log 100) − 2054 and

1 2 √
() (2054) exp(− 2 (2054) )) 2 00484
≡ ≈ = ≈ ≈ 242 (14)
Pr(  ) 002 002 002
The ‘survey’ revealed that E( |   log 100) ≈ log 142 and we can plug this
number and our approximations of  and  from (14) into E( |   ) = +
of (13) to get log 142 ≈ [(log 100) − 2054] + 242 ⇒  ≈ 0958, hence
 ≈ (log 100) − 2054 (0958) ≈ 2637.
Therefore, log(income) ∼ N(2637 (0958)2). But what is E(income)? Recall
Jensen’s inequality from Lecture 5 and apply it here as E(log )  log E()
with  ≡ income.
+ 22
Exercise 7 proves that the log-normal density has mean e , giving us an
estimate of the (untruncated) mean income of the whole population as
exp(2637 + (0958)2 2) ≈ 22
based on adjusting the newspaper ‘survey’. The actual figure for 1987 is $25K,
so the adjustment for such an extreme truncation still works quite well! Amaz-
ingly, we are able to predict something also for the 98% of the population who
were not even sampled.

  22  22
Note that the mean is e e where the factor e  1 is due to Jensen’s
inequality. This  22 term is something you will also encounter in Itô calculus,
in the calculation of stochastic discount factors in asset pricing, and in many
other applications in finance.
B.2. Censored variates

Suppose that, instead of truncation, we have the following situation. A con-

sumer or investor decides to buy £ worth of an asset if the conditions are
right, but 0 units otherwise. A formalization can take the following form
½ 0
xβ +  (if x0β +   0)
 =
0 (otherwise)
where  has density  and distribution  .
Think of x0β as the purchasing power of this person. It depends on a host of
factors, like income, wealth, access to credit, etc. If it exceeds a certain level,
then purchases are made; but not otherwise.
Unlike before, we now have observations  when  ≤ −x0β, which is good:
the corresponding combinations of x’s tell us something about the relation
that leads to  = 0. However, all these ’s are reported to be zero and are said
to be censored: we have no idea how far x0β +  was below zero, so again we
cannot do the usual OLS analysis.
For convenience, relabel the first    observations to be the ones that
are not censored. Then, in a random sample, these have the joint density
(continuous part)
Y 
 () =  ( − x0β) (15)
=1 =1
For the remaining  − observations, we have the joint density (discrete part)
Y Y Y
Pr( = 0) = Pr( ≤ −x0β) ≡  (−x0β) (16)
=+1 =+1 =+1
The joint density of all observations is
⎛ ⎞⎛ ⎞

Y 
(β) ≡ ⎝  ( − x0β)⎠ ⎝  (−x0β)⎠  (17)
=1 =+1
Optimizing (17) w.r.t. β yields the MLE:  and x are the known data,  is
the chosen normal, logistic, etc., and we search for the β that maximizes this
function (β).
(Don’t use OLS: same bias problem as before, and shouldn’t get b  0.)
C. Further estimation issues

The previous slide has implicitly solved also the estimation of the models in
Section A! In the dichotomous case,
Pr ( = 0 | x) =  (−x0β)
Pr ( = 1 | x) = 1 −  (−x0β)
as a special case of (9), and the joint density is
⎛ ⎞⎛ ⎞
Y 
¡ ¢
(β) ≡ ⎝ 1 −  (−x0β) ⎠ ⎝  (−x0β)⎠  (18)
=1 =+1
where the relabelled first  observations are the ones corresponding to  = 1.
In the case of more than two outcomes, use the multinomial in (18) instead of
the binomial.
The standard limiting distributions arise for the MLE and tests based on it.
(Apply to LDV.xls.) However, the same is not true of OLS: it performs badly.
A few issues in estimation remain to be addressed. We do so (briefly) now.
First, heteroskedasticity arises frequently in this context. So far, we made the
i.i.d. assumption, and  had a given (fixed) variance. This can be generalized
to allow for the variance to change as  changes. In this case, in the previous
0 0 α 0x
formulae we replace xβ by xβ  (where the form   = e   0 is often
A second issue that arises is the choice of  . Kernel estimators have been used
to reduce the dependence of the results on assumed functional forms for  .
The third and final issue we address is about the sample selection bias, also
known as attrition (or survivorship) bias or incidental truncation.
Suppose we formulate a hypothesis that leads us to study the history of an
existing stock. We can’t select a company that has been delisted and/or has
gone bankrupt! This induces a bias, just as we showed with the truncation
problem earlier. The CRSP dataset (available in WRDS) is adjusted for this
type of problem.
D. Technical result and Exercises
Remark 1 (Differentiating quadratic forms) Consider the quadratic form  ≡
x0Ax as a function of the  × 1 vector x, and let A be a  ×  symmetric
matrix of constants. Then, the  × 1 vector of derivatives of  with respect to
the elements 1      of x is
⎛ ⎞ ¡ 0 ¢
1  x Ax
⎝ .. ⎠ ≡  ≡ = 2Ax
x x
where the factor 2 comes from differentiating a quadratic function, and x0 (not
x) drops out from x0Ax in order to make the dimension of the RHS compatible
with the rest (i.e. Ax is  × 1). The  ×  matrix of second derivatives is
 2  (2Ax)
0 ≡ 0 = 2A
xx x
Try this for  = 1 and  = 2 to see it in terms of the elements. When  = 1,
 = 2.¶When
differentiate µ µ ¶ = 2, differentiate
µ w.r.t. ¶1 and 2 the function
1 2 1 11 + 22
 = (1 2) = (1 2) = 121+2212+322
2 3 2 21 + 32
then check that the derivatives equal 2Ax and the second derivatives equal 2A.
Exercise 1 (Binomial representation) A new drug cures patients with proba-
bility . What is the probability that  patients are cured in a trial on a random
group of  patients?
Exercise 2 (If you don’t succeed, try and try again) Consider the following
two stories:
(a) Sarah throws eggs at a bad musician who will give up if and only if three
eggs have hit him. The probability of a successful hit is 0.6. You may assume
that no-one else in the audience has eggs, and that Sarah is the best shot of
them all. Compute the probability that exactly  eggs will be required to stop
the musician from playing. What is the probability that fewer than six eggs will
be required? (Sarah needs to know how many eggs to buy from the shop!)
(b) There are  different types of coupons. Every box sold contains one coupon.
The probability that a box contains coupon  is 1. What is the expected
number of boxes you have to buy so that you possess at least one of every
Exercise 3 (The fiddle) Let  denote the probability that a firm fiddles its
(a) Defining  = 1 if it does, and  = 0 otherwise, derive the density of  and
name it.
(b) What is the probability that  firms do so in a random group of  firms,
and what is the name of this distribution?
(c) Suppose that  = 01 and that we keep auditing more firms until we have
found 3 bad ones. Compute the probability that we will need to audit exactly 
firms, and identify the resulting distribution. What is the probability that fewer
than six firms will need to be audited?
Exercise 4 (Extermists) An extreme value occurs with probability  ∈ [0 1].
Define  = 1 if it occurs, and  = 0 otherwise. Derive and give the name of:
(a) the density of ;
(b) the probability that  of these events will occur in a random sample of size
(c) the probability that you have to wait for  observations before  of these
events occur, if  6= 0.
Exercise 5 (Bernoulli MLE and logit) Let 1      be a random sample from
a Bernoulli distribution with parameter  ∈ [0 1].
(a) Obtain the MLE of .
(b) Is this estimator unbiased?
(c) What is the log-likelihood of ?
Exercise 6 (Signs of marginals in ordered data) Differentiate the first and last
probabilities in (9). What can you say about the opposite signs of these two
Exercise 7 (Log-normal moments) Let  ∼ N(  2) with moment-generating
function  () ≡ E(e) = exp(+ 222) for  ∈ R. Defining the log-normal
 ≡ e (i.e. log() is normal), prove that E(  ) = exp( +  2 22) for all
 ∈ N.
Exercise 8 (Mean and variance of censored normal, optional exercise) Let
 ≡ max{  + }, where  ∼ N(0 2). Prove that
E() = Φ() + (1 − Φ())
h ( + ) and i
var() = 2 (1 − Φ()) 1 − ( − )  + ( − )2 Φ() 
where  ≡ ( − ) and  ≡ ()(1 − Φ()).
Solution 1. Define the random variable  which takes the value 1 if the new
drug cures patient  and 0 otherwise. Then  follows a Bernoulli distribution
( has a binary Yes/No outcome) with parameter , the percentage of patients
cured by the new drug.
We need to derive the distribution of the variate  ≡ =1  ∈ {0 1     }.
Assume that the ’s are independent, for example because the disease is not
contagious and/or because the sample was randomly¡  ¢ selected from different
locations. For any realization  = , there are  possible combinations of
patients, and the probability of observing each of these combinations is
⎛ ⎞ ⎛ ⎞

Y Y
⎝ ⎠ · ⎝ (1 − )⎠ =  (1 − )− 
=1 =+1
by the independence of each patient from the other. Then,  ∼ Bin( ). The
binomial is therefore the general distribution of the sum of a repeated Bernoulli
Solution 2. (a) Sarah will require  eggs if 2 out of the previous −1 eggs hit
their target, and the -th is a hit too (the last one has to be a hit: it finishes
the game!). Defining this joint probability as the product of probabilities of
two independent events, with the probability of one success as  = 06, we have
Pr ( throws) =  · Bin(−1)(2)
µ ¶ µ ¶
−1 2 −1−2 −1
=  (1 − ) = (06)3(04)−3
2 2
¡¢ ¡  ¢
Recall the identity  = − : the number of ways of choosing  out of

¢ ¡−1¢is identical to the number of excluding  −  from . Here,
2 = −3 and equation (3) gives the distribution of the random number
of throws in excess of 3, whose realization () is  − 3. It is the negative
binomial Nbin( ) where  = 3. It is the general distribution for sampling
over and above , until  successes are achieved.
The probability that fewer than 6 eggs will be required by Sarah is
X5 µ ¶ 5−3 µ
X ¶
−1 3 +2
(06) (04)−3 = (06)3(04) ≈ 0683
2 2
=3 =0
where  = 0 denotes the perfect score of exactly 3 throws. She’d better
improve her aim (practice a few days to change ) or buy more eggs to have a
better chance than 68.3%!
(b) The first box gives you one coupon. Let the random variable 1 be the
number of boxes you have to buy in order to get a coupon which is different from
the first one. From (a), the geometric p.d.f. arises for the number of required
attempts in excess of 1, so 1 − 1 is a geometric variate with 1 = ( − 1).
Once you have two different coupons, let 2 be the number of boxes you have
to buy in order to get a coupon which is different from the first two. Then
2 − 1 is a geometric random variable with 2 = ( − 2).
Proceeding in this way, the number of boxes you need to buy equals  =
1+1 +2 +· · ·+−1. From the mean of a geometric variate (i.e. NBin(1 )),
we have E () = 1 = ( − ) and the expected number of boxes you
P−1 P
have to buy is =0 ( − ) =  =1 1 by reversing the index  into
 =  − .
P5 3. This
¡−1¢ is Exercises 1 and 2a in disguise! Here, we get in the final
part =3 2 (01)3(09)−3 ≈ 000856.
Solution 4. Guess what?! This is another form of the same story of the
previous exercise.
Solution 5. (a) The likelihood function is given by
 ³
Y ´ P P
() =  (1 − )1− =  =1  (1 − )− =1  =  (1 − )(1−)
for  ∈ [0 1], and zero otherwise; and we denote its logarithm by (). This
function of  is continuous, even though the density of  is discrete. There are
two cases where we getPa corner solution and we cannot optimize the function
by differentiation. If =1  = 0 (hence P  = 0), then () = (1 − ) is
maximized at b = 0 hence b = . If =1  =  (hence  = 1), then
() =  is maximized at  b = 1 hence b =  again. For other values of
=1 ,
d()  1−
= −
d  1−
gives the MLE b = , the sample mean again. In this third case, we have
(0) = 0 = (1) with ()  0 in between. Therefore, () has at least one
maximum in (0 1), and it is the one we have derived: we do not have to check
further the sign of the second-order derivative of  () at b.
(b) By E(1 + 2) = E(1) + E(2) for any 1 2 whose expectations exist,
µ X ¶
1  1 X 1 X 1
E() = E  = E () =  = () = 
 =1  =1  =1 
so E() equals the population mean  and  is an unbiased estimator of .
(c) The likelihood obtained in (a) gives the log-likelihood
µ ¶

() =  log (1 − ) +  log 
where the second term is where the data  and the parameter  interact.
[Note: Suppose we take log ( (1 − )) from the interaction term in (), and
make it a function of  such as log ( (1 − )) = . Then,
1− 1
exp (−) = = − 1
 
= 
1 + exp (−)
This function (logistic c.d.f.) maps  ∈ R to  ∈ [0 1]: values of  outside this
interval do not arise. Such transformations are the subject matter of generalized
linear models in statistics, of which the logistic regression is a special case.]
Solution 6. From (9),
 Pr ( = 0 | x)
= − (−x0β)β
 Pr ( =  | x)
=  (−1 − x0β)β
where we notice that one derivative has the opposite sign from the other one,
since  ≥ 0 always.
There are two tails, one to the left of 0 (i.e. 0) and one to the right of −1.
When the data change, the density  shifts either to the right or to the left
because x0β has changed. As a result of the shift, the area included in one of
the tails increases while the other shrinks.
Solution 7. The moment-generating function (m.g.f.) of  is given as
 () ≡ E(e ) = exp( +  222)
where the last equality identifies  as N( 2) uniquely (just like a c.d.f. would).
E(  ) ≡ E((e ) ) ≡ E(e )
and choosing  =  in  () gives the required result.
Solution 8. Let us rewrite  ≡ max{ }, where  ∼ N(  2). There are
two possibilities. First,  ≤ , in which case  =  with probability
µ ¶ µ ¶
− − −
Pr ( ≤ ) = Pr ≤ ≡ Pr ≤  = Φ()
  
since ( − )  ∼ N(0 1). Second,    with probability 1 − Φ(), in which
case (13) tells us that E( |   ) =  + . The unconditional E() is
obtained from the law of iterated expectations (LIE) as
E() ≡ Pr ( ≤ ) E( |  ≤ ) + Pr (  ) E( |   )
= Φ() + (1 − Φ()) ( + ) 
For the variance, we will use the LIE again as
var() ≡ var (E| ()) + E (var| ())
The last term is easy to work out as before, because
E (var| ()) = Φ() var( |  ≤ ) + (1 − Φ()) var( |   )
= Φ() var() + (1 − Φ())  2 (1 − ( − ) )
=  2 (1 − Φ()) (1 − ( − ) )  (19)
Since E|≤() =  and E|() =  + , we work out the remaining term
µh i2¶
var (E| ()) = E E| () − E (E| ())
µh i2¶
= E E| () − E() (by the LIE)
µh i2¶
= E E| () − Φ() − (1 − Φ()) ( + )

= Φ() [ − Φ() − (1 − Φ()) ( + )]2

+ (1 − Φ()) [ +  − Φ() − (1 − Φ()) ( + )]2
= Φ() (1 − Φ())2 [ −  − ]2
+ (1 − Φ()) Φ()2 [− +  + ]2
= Φ() (1 − Φ()) [ −  − ]2 [1 − Φ() + Φ()]
= Φ() (1 − Φ()) [ − ]2 
where the last step follows from the definition  ≡ ( − ) . Adding this
result to (19) gives the required var().

