Empirical Finance 6
Empirical Finance 6
Empirical Finance 6
1. The variance is calculated from the two tails of any distribution, so it is not
ideal to describe risk: the lower tail of the distribution of profits is the one
that matters for defining risk, not the upper tail (positive profits). They’re
the same if the distribution is symmetric, but not otherwise.
For example, a distribution with only one long tail has the same variance as
its mirror image (var() = var( − )), but certainly not the same risk!
2. There are distributions whose tails decay so slowly that the variance is in-
finitely large; e.g. ∼ t() for = 1 (Cauchy case) and = 2 have
var() = ∞. Is this a case where we should give up measuring (hence
controlling) risk?! Certainly not, as risk becomes even more relevant!
A dichotomous (or binary) variate is one that can only take one of two possible
values; e.g. default or not, crash or not. More generally, we can have qualitative
variates (e.g. earnings that either meet expectations, are below, much below,
above, or much above), themselves a special case of discrete variates whose
outcomes can be enumerated (counted).
What if the probability of an outcome occuring (e.g. crash) is a function of
some variables (e.g. the economy’s fundamentals, the past values of the index,
etc.)? If we model this relation, we can perdict the event better, and we can
anticipate how its distribution changes as these explanantory factors evolve.
But first we need to take the discreteness of the LHS variate into account. We
cannot calculate the usual regressions otherwise, as we shall see.
0.8
0.6
0.4
0.2
-4 -2 0 2 4
The truncated is also less variable than the original , since one of the tails
has been cut off!
We need alternatives to OLS...
Two general estimation methods
Let x be a × 1 variate whose density x is determined by the × 1 vector
of unknown parameters θ. We wish to estimate θ from a random sample
{x}=1 which we collate in the × matrix X ≡ (x1 x)0. For example,
when = 1, the density may ¡ be¢ a √
normal with unknown mean and variance,
() = exp(− ( − )2 22 ) 22, with θ ≡ ( 2)0.
What value of θ is most likely to have generated these data {x}=1? The
solution to this question is known as maximum-likelihood estimators (MLEs).
Once we have sampled x, the only remaining unknowns in the sample’s density
are θ. To stress this state, we change notation and consider the joint density
of x1 x as a function of θ, writing the likelihood function of θ as
(θ) ≡ x (x1) × · · · × x (x)
by the independence of x1 x. Alternative notation includes (θ | X),
to stress that we take the data X as given (we condition on X). Note also
that the arguments of x(.) are the actual data x (not some hypothetical u).
b that maximizes (θ).
The MLE of the parameters is the vector θ
To illustrate, consider the sample {}=1 ∼ IN( 1). Here, there is only one
unknown parameter, ≡ , with likelihood
³ P ´
Y µ ¶ exp − 1 ( − )2
1 1 2 2 =1
() = √ exp(− ( − ) ) =
2 2 2
=1 (2)
b.
The value of that maximize this function is the MLE, and is denoted by
• In this example, we did not differentiate the likelihood because the maxi-
mum was achieved as a corner solution where the derivative is nonzero: the
maximum was at the edge of the domain of definition of .
• In this example, the largest observation is a much more efficient (lower-
variance) estimator, and it results in a lower MSE as well. This result
makes sense: the largest observation indicates the most likely value of the
unknown upper bound of the distribution.
[Think of as the grade that some teacher gives to his students. His marking
is such that all grades are equally probable, but he is stingy and does not give
100% even if the answers are all correct! You can collect grades from this
teacher’s courses, and discover his upper bound more efficiently from the
maximum grade than from the mean (quite literally!) grade. Also, in this
case, MoM can generate an estimate of bigger than 100% (if 50%)!]
Going back to our truncated variate, we know that OLS is flawed, but we now
have ML and MoM.
The correction of OLS problems is by using the relevant density from (10):
the MLE is obtained by maximizing w.r.t. β the likelihood (e.g. with = )
Y Y 0 β) 0 β)
0 (1 − x1 ( − x
(β) = () ≡ ( −xβ) ≡ ×· · ·×
1 − ( − x01β) 1 − ( − x0β)
=1 =1 Q
instead of maximizing the usual =1 () (only the numerator here) which
applies when there is no truncation.
We will use this method of estimation in Section C. For now, we illustrate the
simpler (but less efficient) MoM.
Consider the example of the normal distribution and take ≡ 1 and ≡ .
Then, ≡ + ∼ N( 2). If we truncate when , we have
E( | ) = + and var( | ) = 2 (1 − ( − ) ) (13)
where ≡ ( − ) is the standardized truncation point (or quantile) and
≡ ()(1 − Φ()) ≡ () Pr( ) is the normal’s hazard rate (in general,
−1 ≡ (1 − ()) () is known as Mills’ ratio).
The first result in (13) gives a relation between the mean of the truncated
distribution and the original N( 2). We can exploit it to estimate the original
from the truncated distribution, when we only observe the truncated variate.
William Greene illustrates. A 1987 newspaper ‘survey’ of affluent Americans
(earning above $100K, the top 2%) says that their typical income is $142K.
Can we infer the average income of all the population from only the top 2%?
Incomes often follow a log-normal distribution, so we will use ≡ log(income).
We are told that Pr( log 100) = 098. Corresponding to it, the quantile
Φ−1(098) ≈ 2054 yields
(log 100) −
≡ ≈ 2054 ⇒ ≈ (log 100) − 2054 and
1 2 √
() (2054) exp(− 2 (2054) )) 2 00484
≡ ≈ = ≈ ≈ 242 (14)
Pr( ) 002 002 002
The ‘survey’ revealed that E( | log 100) ≈ log 142 and we can plug this
number and our approximations of and from (14) into E( | ) = +
of (13) to get log 142 ≈ [(log 100) − 2054] + 242 ⇒ ≈ 0958, hence
≈ (log 100) − 2054 (0958) ≈ 2637.
Therefore, log(income) ∼ N(2637 (0958)2). But what is E(income)? Recall
Jensen’s inequality from Lecture 5 and apply it here as E(log ) log E()
with ≡ income.
+ 22
Exercise 7 proves that the log-normal density has mean e , giving us an
estimate of the (untruncated) mean income of the whole population as
exp(2637 + (0958)2 2) ≈ 22
based on adjusting the newspaper ‘survey’. The actual figure for 1987 is $25K,
so the adjustment for such an extreme truncation still works quite well! Amaz-
ingly, we are able to predict something also for the 98% of the population who
were not even sampled.
22 22
Note that the mean is e e where the factor e 1 is due to Jensen’s
inequality. This 22 term is something you will also encounter in Itô calculus,
in the calculation of stochastic discount factors in asset pricing, and in many
other applications in finance.
B.2. Censored variates
The previous slide has implicitly solved also the estimation of the models in
Section A! In the dichotomous case,
Pr ( = 0 | x) = (−x0β)
Pr ( = 1 | x) = 1 − (−x0β)
as a special case of (9), and the joint density is
⎛ ⎞⎛ ⎞
Y
Y
¡ ¢
(β) ≡ ⎝ 1 − (−x0β) ⎠ ⎝ (−x0β)⎠ (18)
=1 =+1
where the relabelled first observations are the ones corresponding to = 1.
In the case of more than two outcomes, use the multinomial in (18) instead of
the binomial.
The standard limiting distributions arise for the MLE and tests based on it.
(Apply to LDV.xls.) However, the same is not true of OLS: it performs badly.
A few issues in estimation remain to be addressed. We do so (briefly) now.
First, heteroskedasticity arises frequently in this context. So far, we made the
i.i.d. assumption, and had a given (fixed) variance. This can be generalized
to allow for the variance to change as changes. In this case, in the previous
0 0 α 0x
formulae we replace xβ by xβ (where the form = e 0 is often
chosen).
A second issue that arises is the choice of . Kernel estimators have been used
to reduce the dependence of the results on assumed functional forms for .
The third and final issue we address is about the sample selection bias, also
known as attrition (or survivorship) bias or incidental truncation.
Suppose we formulate a hypothesis that leads us to study the history of an
existing stock. We can’t select a company that has been delisted and/or has
gone bankrupt! This induces a bias, just as we showed with the truncation
problem earlier. The CRSP dataset (available in WRDS) is adjusted for this
type of problem.
D. Technical result and Exercises
Remark 1 (Differentiating quadratic forms) Consider the quadratic form ≡
x0Ax as a function of the × 1 vector x, and let A be a × symmetric
matrix of constants. Then, the × 1 vector of derivatives of with respect to
the elements 1 of x is
⎛ ⎞ ¡ 0 ¢
1 x Ax
⎝ .. ⎠ ≡ ≡ = 2Ax
x x
where the factor 2 comes from differentiating a quadratic function, and x0 (not
x) drops out from x0Ax in order to make the dimension of the RHS compatible
with the rest (i.e. Ax is × 1). The × matrix of second derivatives is
2 (2Ax)
0 ≡ 0 = 2A
xx x
Try this for = 1 and = 2 to see it in terms of the elements. When = 1,
= 2.¶When
differentiate µ µ ¶ = 2, differentiate
µ w.r.t. ¶1 and 2 the function
1 2 1 11 + 22
= (1 2) = (1 2) = 121+2212+322
2 3 2 21 + 32
then check that the derivatives equal 2Ax and the second derivatives equal 2A.
Exercise 1 (Binomial representation) A new drug cures patients with proba-
bility . What is the probability that patients are cured in a trial on a random
group of patients?
Exercise 2 (If you don’t succeed, try and try again) Consider the following
two stories:
(a) Sarah throws eggs at a bad musician who will give up if and only if three
eggs have hit him. The probability of a successful hit is 0.6. You may assume
that no-one else in the audience has eggs, and that Sarah is the best shot of
them all. Compute the probability that exactly eggs will be required to stop
the musician from playing. What is the probability that fewer than six eggs will
be required? (Sarah needs to know how many eggs to buy from the shop!)
(b) There are different types of coupons. Every box sold contains one coupon.
The probability that a box contains coupon is 1. What is the expected
number of boxes you have to buy so that you possess at least one of every
coupon?
Exercise 3 (The fiddle) Let denote the probability that a firm fiddles its
books.
(a) Defining = 1 if it does, and = 0 otherwise, derive the density of and
name it.
(b) What is the probability that firms do so in a random group of firms,
and what is the name of this distribution?
(c) Suppose that = 01 and that we keep auditing more firms until we have
found 3 bad ones. Compute the probability that we will need to audit exactly
firms, and identify the resulting distribution. What is the probability that fewer
than six firms will need to be audited?
Exercise 4 (Extermists) An extreme value occurs with probability ∈ [0 1].
Define = 1 if it occurs, and = 0 otherwise. Derive and give the name of:
(a) the density of ;
(b) the probability that of these events will occur in a random sample of size
;
(c) the probability that you have to wait for observations before of these
events occur, if 6= 0.
Exercise 5 (Bernoulli MLE and logit) Let 1 be a random sample from
a Bernoulli distribution with parameter ∈ [0 1].
(a) Obtain the MLE of .
(b) Is this estimator unbiased?
(c) What is the log-likelihood of ?
Exercise 6 (Signs of marginals in ordered data) Differentiate the first and last
probabilities in (9). What can you say about the opposite signs of these two
derivatives?
Exercise 7 (Log-normal moments) Let ∼ N( 2) with moment-generating
function () ≡ E(e) = exp(+ 222) for ∈ R. Defining the log-normal
≡ e (i.e. log() is normal), prove that E( ) = exp( + 2 22) for all
∈ N.
Exercise 8 (Mean and variance of censored normal, optional exercise) Let
≡ max{ + }, where ∼ N(0 2). Prove that
E() = Φ() + (1 − Φ())
h ( + ) and i
var() = 2 (1 − Φ()) 1 − ( − ) + ( − )2 Φ()
where ≡ ( − ) and ≡ ()(1 − Φ()).
Solution 1. Define the random variable which takes the value 1 if the new
drug cures patient and 0 otherwise. Then follows a Bernoulli distribution
( has a binary Yes/No outcome) with parameter , the percentage of patients
cured by the new drug.
P
We need to derive the distribution of the variate ≡ =1 ∈ {0 1 }.
Assume that the ’s are independent, for example because the disease is not
contagious and/or because the sample was randomly¡ ¢ selected from different
locations. For any realization = , there are possible combinations of
patients, and the probability of observing each of these combinations is
⎛ ⎞ ⎛ ⎞
Y Y
⎝ ⎠ · ⎝ (1 − )⎠ = (1 − )−
=1 =+1
by the independence of each patient from the other. Then, ∼ Bin( ). The
binomial is therefore the general distribution of the sum of a repeated Bernoulli
trial.
Solution 2. (a) Sarah will require eggs if 2 out of the previous −1 eggs hit
their target, and the -th is a hit too (the last one has to be a hit: it finishes
the game!). Defining this joint probability as the product of probabilities of
two independent events, with the probability of one success as = 06, we have
Pr ( throws) = · Bin(−1)(2)
µ ¶ µ ¶
−1 2 −1−2 −1
= (1 − ) = (06)3(04)−3
2 2
¡¢ ¡ ¢
Recall the identity = − : the number of ways of choosing out of
¡−1individuals
¢ ¡−1¢is identical to the number of excluding − from . Here,
2 = −3 and equation (3) gives the distribution of the random number
of throws in excess of 3, whose realization () is − 3. It is the negative
binomial Nbin( ) where = 3. It is the general distribution for sampling
over and above , until successes are achieved.
The probability that fewer than 6 eggs will be required by Sarah is
X5 µ ¶ 5−3 µ
X ¶
−1 3 +2
(06) (04)−3 = (06)3(04) ≈ 0683
2 2
=3 =0
where = 0 denotes the perfect score of exactly 3 throws. She’d better
improve her aim (practice a few days to change ) or buy more eggs to have a
better chance than 68.3%!
(b) The first box gives you one coupon. Let the random variable 1 be the
number of boxes you have to buy in order to get a coupon which is different from
the first one. From (a), the geometric p.d.f. arises for the number of required
attempts in excess of 1, so 1 − 1 is a geometric variate with 1 = ( − 1).
Once you have two different coupons, let 2 be the number of boxes you have
to buy in order to get a coupon which is different from the first two. Then
2 − 1 is a geometric random variable with 2 = ( − 2).
Proceeding in this way, the number of boxes you need to buy equals =
1+1 +2 +· · ·+−1. From the mean of a geometric variate (i.e. NBin(1 )),
we have E () = 1 = ( − ) and the expected number of boxes you
P−1 P
have to buy is =0 ( − ) = =1 1 by reversing the index into
= − .
Solution
P5 3. This
¡−1¢ is Exercises 1 and 2a in disguise! Here, we get in the final
part =3 2 (01)3(09)−3 ≈ 000856.
Solution 4. Guess what?! This is another form of the same story of the
previous exercise.
Solution 5. (a) The likelihood function is given by
³
Y ´ P P
() = (1 − )1− = =1 (1 − )− =1 = (1 − )(1−)
=1
for ∈ [0 1], and zero otherwise; and we denote its logarithm by (). This
function of is continuous, even though the density of is discrete. There are
two cases where we getPa corner solution and we cannot optimize the function
by differentiation. If =1 = 0 (hence P = 0), then () = (1 − ) is
maximized at b = 0 hence b = . If =1 = (hence = 1), then
() = is maximized at b = 1 hence b = again. For other values of
P
=1 ,
d() 1−
= −
d 1−
gives the MLE b = , the sample mean again. In this third case, we have
(0) = 0 = (1) with () 0 in between. Therefore, () has at least one
maximum in (0 1), and it is the one we have derived: we do not have to check
further the sign of the second-order derivative of () at b.
(b) By E(1 + 2) = E(1) + E(2) for any 1 2 whose expectations exist,
µ X ¶
1 1 X 1 X 1
E() = E = E () = = () =
=1 =1 =1
so E() equals the population mean and is an unbiased estimator of .
(c) The likelihood obtained in (a) gives the log-likelihood
µ ¶
() = log (1 − ) + log
1−
where the second term is where the data and the parameter interact.
[Note: Suppose we take log ( (1 − )) from the interaction term in (), and
make it a function of such as log ( (1 − )) = . Then,
1− 1
exp (−) = = − 1
hence
1
=
1 + exp (−)
This function (logistic c.d.f.) maps ∈ R to ∈ [0 1]: values of outside this
interval do not arise. Such transformations are the subject matter of generalized
linear models in statistics, of which the logistic regression is a special case.]
Solution 6. From (9),
Pr ( = 0 | x)
= − (−x0β)β
x
Pr ( = | x)
= (−1 − x0β)β
x
where we notice that one derivative has the opposite sign from the other one,
since ≥ 0 always.
There are two tails, one to the left of 0 (i.e. 0) and one to the right of −1.
When the data change, the density shifts either to the right or to the left
because x0β has changed. As a result of the shift, the area included in one of
the tails increases while the other shrinks.
Solution 7. The moment-generating function (m.g.f.) of is given as
() ≡ E(e ) = exp( + 222)
where the last equality identifies as N( 2) uniquely (just like a c.d.f. would).
Writing
E( ) ≡ E((e ) ) ≡ E(e )
and choosing = in () gives the required result.
Solution 8. Let us rewrite ≡ max{ }, where ∼ N( 2). There are
two possibilities. First, ≤ , in which case = with probability
µ ¶ µ ¶
− − −
Pr ( ≤ ) = Pr ≤ ≡ Pr ≤ = Φ()
since ( − ) ∼ N(0 1). Second, with probability 1 − Φ(), in which
case (13) tells us that E( | ) = + . The unconditional E() is
obtained from the law of iterated expectations (LIE) as
E() ≡ Pr ( ≤ ) E( | ≤ ) + Pr ( ) E( | )
= Φ() + (1 − Φ()) ( + )
For the variance, we will use the LIE again as
var() ≡ var (E| ()) + E (var| ())
The last term is easy to work out as before, because
E (var| ()) = Φ() var( | ≤ ) + (1 − Φ()) var( | )
= Φ() var() + (1 − Φ()) 2 (1 − ( − ) )
= 2 (1 − Φ()) (1 − ( − ) ) (19)
Since E|≤() = and E|() = + , we work out the remaining term
µh i2¶
var (E| ()) = E E| () − E (E| ())
µh i2¶
= E E| () − E() (by the LIE)
µh i2¶
= E E| () − Φ() − (1 − Φ()) ( + )