Solved Exercises and Problems of Statist PDF
Solved Exercises and Problems of Statist PDF
Solved Exercises and Problems of Statist PDF
Statistical
Inference
David Casado
You can decide not to print this file and consult it in digital format – paper and ink will be saved. Otherwise, print
it on recycled paper, double-sided and with less ink. Be ecological. Thank you very much.
Contents
Links, Keywords and Descriptions 1–6
PE – CI – HT 143 – 153
Additional Exercises 154 – 169
Appendixes 170 – 191
Probability Theory 170 – 175
Some Reminders 170
Markov's Inequality. Chebyshev's Inequality 170 – 171
Probability and Moments Generating Functions. Characteristic Function. 171 – 172
Mathematics 191 – 209
Some Reminders 191 – 192
Limits 192 – 194
References 210
Tables of Statistics 211 – 217
Probability Tables 218 – 222
Index 223 – 225
Prologue
These exercises and problems are a necessary complement to the theory included in Notes of Statistical
Inference, available at http://www.casado-d.org/edu/NotesStatisticalInference-Slides.pdf. Nevertheless, some
important theoretical details are also included in the remarks at the beginning of each chapter. Those Notes are
thought for teaching purposes, and they do not include the advanced mathematical justifications and
calculations included in this document.
Although we can study only linearly and step by step, it is worth noticing that methods are usually
related—as tasks are in the real-world—in Statistical Inference. Thus, in most exercises and problems we have
made it clear which are the suppositions and how they should be proved properly. In same cases, several
statistical methods have been “naturally” combined in the statement. Many steps and even sentences are
repeated in most exercises of the same type, both to insist on them and to facilitate the reading of exercises
individually. The advanced exercises have been marked with the symbol (*).
Written in Courier New font style is the code with which we have done some calculation by using
the programming language R—you can copy and paste this code from the file. I include some notes to help,
up to my knowledge, students with a mother language different to the English.
Acknowledgements
This document has been created with Linux, LibreOffice, OpenOffice.Org, GIMP and R. I thank those who
make this software available for free. I donate funds to these kinds of project from time to time.
Links, Keywords and Explanations
Inference Theory (IT)
Framework and Scope of the Methods
> [Keywords] infinite populations, independent populations, normality, asymptoticness, descriptive statistics.
> [Description] The conditions under which the Statistics considered here can be applied are listed.
Some Remarks
> [Keywords] partial knowledge, randomness, certainty, dimensional analysis, validity, use of the samples, calculations.
> [Description] The partial knowledge justifies both the random character of the mathematical variables used to explain the variables of
the real-world problems and the impossibility of reaching the maximum certainty in using samples instead of the whole population. The
validity of the results must be understood within the scenario made of the assumptions, the methods, the certainty and the data.
Sampling Probability Distribution
Exercise 1it-spd
> [Keywords] inference theory, joint distribution, sampling distribution, sample mean, probability function.
> [Description] From a simple probability distribution for X, the joint distribution of a sample (X1,X2) and the sampling distribution of
the sample mean X are determined.
PE – CI – HT
Exercise 1pe-ci-ht
> [Keywords] point estimations, confidence intervals, method of the pivot, normal distribution, t distribution, pooled sample variance.
> [Description] The probability of an event involving the difference between the means of two independent normal populations is
calculated with and without the supposition that the variances of the populations are the same. The method of the pivot is applied to
construct a confidence interval for the quotient of the standard deviations.
Exercise 2pe-ci-ht
> [Keywords] confidence intervals, point estimations, normal distribution, method of the pivot, probability, pooled sample variance.
> [Description] For the difference of the means of two (independent) normally distributed variables, a confidence interval is
constructed by applying the method of the pivotal quantity. Since the equality of the means is included in a high-confidence interval,
the pooled sample variance is considered in calculating a probability involving the difference of the sample means.
Exercise 3pe-ci-ht
> [Keywords] hypothesis tests, confidence intervals, Bernoulli populations, one-tailed tests, population proportion, critical region,
p-value, type I error, type II error, power function, method of the pivot.
> [Description] A decision on whether the population proportion is smaller or equal in one population than in the other is made looking
at both the critical values and the p-value. The type II error is calculated and the power function is plotted. By applying the method of
the pivot, a confidence interval for the difference of the population proportions is built. This interval can be seen as the acceptance
region of the equivalent two-sided hypothesis test. In this case, the same decision is made with the test and with the interval.
Exercise 4pe-ci-ht
> [Keywords] point estimations, hypothesis tests, standard power function density, method of the moments, maximum likelihood
method, plug-in principle, Neyman-Pearson's lemma, likelihood ratio tests, critical region.
> [Description] Given the probability function of a population random variable, estimators are built by applying both the method of
the moments and the maximum likelihood method. Then, the plug-in principle allows us to obtain estimators for the mean and the
variance of the distribution of the variable. In testing the equality of the parameter to a given value, the form of the critical region is
theoretically studied when four different types of alternative hypothesis are considered.
Additional Exercises (Solved but not ordered by difficulty, described nor referred to in the final index.)
References
Index
Samples
[As1] Sample sizes are supposed to be quite smaller than population sizes—a correction factor is not
necessary for these (closely) infinite populations.
[As2] At the same time, we consider either any amount of normally distributed data or many data
(large samples) from any distribution.
[As3] Data will be supposed to have been selected randomly, with the same probability and
independently; that is, by applying simple random sampling.
Methods
[Am1] Before applying inferential methods, data should be analysed to guarantee that nothing strange
will spoil the inference—we suppose that such descriptive analysis and data treatment have been done.
[Am2] We are able to learn only linearly, but in practice methods need not be applied in the order in
which they are presented here—e.g. nonparametric hypothesis tests to check assumptions before
applying parametric methods.
Finally, at the end of the possible theoretical part of exercises, we do not insist that a sample (X1,...,Xn)
would in practice be used by entering its values in the theoretical expressions obtained as a solution.
Estimators and statistics are random quantities until specific data are used.
Useful Questions
To make the answer, users can find it useful to ask themselves:
On the Populations
On the Samples
● If populations are not normally distributed, are the sample sizes large enough to apply asymptotic
results?
● Do we know the data themselves, or only some quantities calculated from them?
● Which are the estimators, the statistics and the methods that will be applied?
On the Quantities
● Which are the units of measurement? Are all the units equal?
● How large are the magnitudes? Do they seem reasonable? Are all of them coherent (variability is
positive, probabilities and relative frequencies are between 0 and 1, etc)?
On the Interpretation
They may want to consult some other pieces of advice that we have written in Guide for Students of Statistics,
available at http://www.casado-d.org/edu/GuideForStudentsOfStatistics-Slides.pdf.
For two populations, other basic estimators are made with these:
V 2X s 2X S 2X
̄ −Ȳ
X η̂ X − η̂ Y
V 2Y s 2Y S Y2
Finally, all these estimators are used to make statistics whose sampling distribution is known.
Exercise 1it-spd
Given a population (variable) X following the probability distribution determined by the following values and
Discussion: The distribution of X is totally determined, since we know all the information necessary to
calculate any quantity—e.g. the mean:
3 1 5 18
μ X = E( X ) = ∑ Ω x j⋅P X ( x j )= ∑ {1,2,3 } x j⋅p j = 1⋅ + 2⋅ +3⋅ = =2.222222
9 9 9 9
Instead of a table, a function is sometimes used to provide the values and the probabilities—the mass or
density function. We can represent this function with the computer:
values = c(1, 2, 3)
probabilities = c(3/9, 1/9, 5/9)
plot(values, probabilities, type='h', xlab='Value', ylab='Probability', ylim=c(0,1), main= 'Mass Function', lwd=7)
The sampling probability distribution of X is determined once we give the possible values and the
probabilities with which they can be taken. Before doing that, we describe the probability distribution of the
random vector X = (X1,X2).
3 3 1
f X (1,1)=P X ( 1,1) = P X ({ X 1=1 }∩{ X 2=1 })=P X ( X 1=1)⋅P X ( X 2 =1)= ⋅ =
1
9 9 9 2
To fill in the following table, the other probabilities are calculate in the same way.
Joint Probability Distribution of (X1,X2)
Value (x1,x2) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)
Probability 3 3 3 1 3 5 1 3 1 1 1 5 5 3 5 1 5 5
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
of (x1,x2) 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
1 1 5 1 1 5 5 5 25
9 27 27 27 81 81 27 81 81
Notice that (1,3) and (3,1), for example, contain the same information. The values and their probabilities can
valuesX1 = c(1, 1, 1, 2, 2, 2, 3, 3, 3)
valuesX2 = c(1, 2, 3, 1, 2, 3, 1, 2, 3)
probabilities = c(1/9, 1/27, 5/27, 1/27, 1/81, 5/81, 5/27, 5/81, 25/81)
library('scatterplot3d') # To load the package
scatterplot3d(valuesX1, valuesX2, probabilities, type='h', xlab='Value X1', ylab='Value X2', zlab='Probability',
xlim=c(0, 4), ylim=c(0, 4), zlim=c(0,1), main= 'Mass Function', lwd=7)
5 5 5 10
P ( ) = P ({(2,3) }∪{(3,2)})=P ({(2,3)})+ P ({( 3,2)})= + =
X̄ X X X
2 81 81 81
25
P X̄ (3) = P X ({(3,3)})=
81
Then, the sampling probability distribution of the sample mean X is determined, in this case, by
Probability Distribution of X
3 5
Value x 1 2 3
2 2
1 2 31 10 25
Probability of x
9 27 81 81 81
We can check that the total sum of probabilities is equal to one:
1 2 31 10 25 9+6+31+10+ 25 81
∑Ω P X̄ ( x j) = ∑Ω p j = 9 + 27 + 81 + 81 + 81 = 81
= =1
81
From the information in the table above it is possible to calculate any quantity—e.g. the mean:
1 3 2 31 5 10 25 9+9+62+25+75
μ X̄ = E( X̄ )= ∑ Ω x j⋅P X̄ ( x j )= ∑ Ω x j⋅p j = 1⋅ + ⋅ +2⋅ + ⋅ +3⋅ = =2.222222
9 2 27 81 2 81 81 81
It is worth noticing that this value is equal to the value that we obtained at the beginning, which agrees with
the well-known theoretical property:
μ X̄ = E( X̄ )= E ( X ) = μ X
Values and probabilities can also be provided by using a function—the mass or density function, which can be
represented with the help of a computer:
values = c(1, 3/2, 2, 5/2, 3)
probabilities = c(1/9, 2/27, 31/81, 10/81, 25/81)
plot(values, probabilities, type='h', xlab='Value', ylab='Probability', ylim=c(0,1), main= 'Mass Function', lwd=7)
Conclusion: For a simple distribution for X and a small sample size X = (X1,X2), we have written both the
joint probability distribution of the sample X and the sampling distribution of X. This helps us to understand
the concept of sampling distribution of any random quantity (not only the sample mean), whether we are able
to write it or even to know it (e.g. due to a theorem).
My notes:
Remark 2pe: If the method of the moments is used to estimate m parameters (frequently 1 or 2), the first m equations of the system
usually suffice; nevertheless, if not all the parameters appear in the first-order moments of X, the smallest m moments—and
equations—for which the parameters appear must be considered. For example, if μ1 = 0 or if the interest relies directly on σ2 because
μ is known, the first-order equation μ1 = μ = E(X) = m1 does not involve σ and hence the second-order equation μ2 = E(X2) = Var(X)
+ E(X)2 = σ2+μ2 = m2 must be considered instead.
Remark 3pe: When looking for local maxima or minima of differentiable functions, the first-order derivatives are equalized to zero.
After that, to discriminate between maxima and minima, the second-order derivatives are studied. For most of the functions we will
work with, this second step can be solved by applying some qualitative reasoning on the sign of the quantities involved and the
possible values of the data xi. When this does not suffice, the values found in the first step, say θ0, must be substituted in the
expression of the second step. On the other hand, global maxima and minima cannot in general be found using the derivatives, and
some qualitative reasoning must be applied. It is important to highlight that, in applying the maximum likelihood method, the
purpose is to find the maximum, whichever the mathematical way.
Exercise 1pe-m
If X is a population variable that follows a binomial distribution of parameters κ and η, and X = (X1,...,Xn) is
a simple random sample:
(a) Apply the method of the moments to obtain an estimator of the parameter η.
(b) Apply the maximum likelihood method to obtain an estimator of the parameter η.
(c) When κ = 10 and x = (x1,...,x5) = (4, 4, 3, 5, 6), use the estimators obtained in the two previous
sections to construct final estimates of the parameter η and the measures μ and σ2.
Hint: (i) In the two first sections treat the parameter κ as if it were known. (ii) In the likelihood function, join the combinatorial
terms into a product; this product does not depend on the parameter η and hence its derivative will be zero.
Discussion: This statement is mathematical, although in the last section we are given some data to be
substituted. In practice, that the binomial can be used to explain X should be supported. The variable X is
dimensionless. For the binomial distribution,
(See the appendixes to see how the mean and the variance of this distribution can be calculated.) Particularly,
the results obtained here can be applied to the Bernoulli distribution with κ = 1.
(a1) Population and sample moments: The probability distribution has two parameters originally, but we
have to study only one. The first-order moments are
1 n
μ1 ( η)=E ( X )=κ⋅η and m1 ( x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
n
(a2) System of equations: Since the parameter of interest η appears in the first-order population moment of
(b1) Likelihood function: For the binomial distribution the mass function is f (x ; κ , η)= κ ηx (1−η) κ− x . ( )
x
We are interested only in η, so
n n
L( x 1 , x 2 , ... , x n ; η)=∏ j=1 f ( x j ; η)=∏ j=1 κ η (1−η) = κ η (1−η) ⋯ κ η (1−η)
x κ− x x κ− x x κ− x
( ) ( ) ( )
j j 1 1 n n
xj x1 xn
[ ] [ ]
n n n n
n n
η∑ (1−η)∑ ∏ j=1 ( xκj ) ⋅η∑
n κ−∑ j=1 x j
∏ j =1 ( xκj )
xj (κ− x j ) xj
= j=1 j=1
= j=1
(1−η) .
(b2) Optimization problem: The logarithm function is applied to facilitate the calculations,
[∏ ( )]
n n
n
log [ L( x 1 , x 2 , ... , x n ; η)]=log κ +log [ η∑ j=1
xj
]+log [(1−η)
n κ− ∑ j=1 x j
]
j=1 xj
[ ]
n n n
=log ∏ j=1 ( xκj ) +(∑ j =1 x j )log(η)+( n κ−∑ j=1 x j )log(1−η).
To discover the local or relative extreme values, the necessary condition is
n n
d 1 n −1 n n κ−∑ j=1 x j ∑ j=1 x j
0= log[ L( x 1 , x 2 , ... , x n ;η)]=0 +( ∑ j =1 x j ) +( n κ−∑ j=1 x j ) → =
dη η 1−η 1−η η
n
n n n n ∑ j=1 x j 11 n 1
→ η n κ−η∑ j=1 x j =∑ j=1 x j −η ∑ j=1 x j → η n κ=∑ j =1 x j → η0= = κ ∑ j=1 x j = κ x̄
nκ n
To verify that the only candidate is a local or relative maximum, the sufficient condition is
n n
d2 n −1 −1 n ∑ j =1 x j n κ−∑ j=1 x j
2
log[ L ( x 1 , x 2 ,... , x n ;η)]=( ∑ j =1 x j) 2 −(n κ−∑ j=1 x j ) (−1)=− − <0
dη η (1−η)2 η2 (1−η) 2
n n
since κ ≥ xj and therefore n κ≥∑ j=1 x j ↔ n κ−∑ j=1 x j≥0 . This holds for any value, including η0 .
Since μ=E ( X )=κ⋅η, an estimator of η induces an estimator of μ by applying the plug-in principle:
Conclusion: We can see that for the binomial population the two methods provide the same estimator for η.
The value of κ must be known to use the expression obtained. In this particular case, the value 0.44 indicates
that, for each underlying trials (Bernoulli variables), one value seems more probable than the other. On the
other hand, the quality of the estimator obtained should be studied, especially if the two methods had provided
different estimators. As a particular case, κ = 1 for the Bernoulli distribution.
My notes:
Exercise 2pe-m
A random quantity X is supposed to follow a geometric distribution. Let X be a simple random sample.
A) Apply the method of the moments to find an estimator of the parameter η.
B) Apply the maximum likelihood method to find an estimator of the parameter η.
27
C) Given a sample such that ∑ j=1 x j = 134 , apply the formulas obtained in the two previous sections
to give final estimates of η. Finally, give estimates of the mean and the variance of X.
Discussion: This statement is mathematical, although we are given some data in the last section. The
random variable X is dimensionless. For the geometric distribution,
(See the appendixes to see how the mean and the variance of this distribution can be calculated.)
a1) Population and sample moments: The population distribution has only one parameter, so one equation
suffices. The first-order moments of the model X and the sample x are, respectively,
1 1 n
μ1 (η)=E ( X )= η and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
n
a2) System of equations: Since the parameter of interest η appears in the first-order moment of X, the first
equation suffices:
−1
1 1 n 1 n 1
μ1 (η)=m1 ( x1 , x 2 , ... , x n ) →
η n n (
= ∑ j=1 x j = x̄ → η= ∑ j=1 x j =
x̄)
a3) The estimator:
−1
1 n 1
η^ M =( ∑ X
n j=1 j ) =
X̄
b1) Likelihood function: For the geometric distribution, the mass function is f (x ; η)=η⋅(1−η) x−1 so
n
n ( ∑ j=1 x j )−n
L( x 1 , x 2 , ... , x n ; η)=∏ j=1 f (x j ; η)=η⋅(1−η) x −1 ⋯η⋅(1−η)x −1=ηn⋅(1−η)
1 n
b2) Optimization problem: The logarithm function is applied to make calculations easier
n
( ∑ j=1 x j )−n n
log[ L( x 1 , x 2 , ... , x n ; η)]=log [ηn ]+ log [(1−η) ]=n⋅log (η)+[( ∑ j =1 x j )−n ]⋅log(1−η)
The population distribution has only one parameter, so a onedimensional function must be maximized. To find
the local or relative extreme values, the necessary condition is:
n
d n n −1 n [( ∑ j =1 x j )−n ]
0= log [ L( x 1 , x 2 , ... , x n ; η)]= η +[(∑ j=1 x j )−n] → η=
dη 1−η 1−η
n n n 1
→ n−n η=η∑ j=1 x j−ηn → n=η ∑ j =1 x j → η0= =
n
x̄
∑ j=1 x j
To verify that the only candidate is a (local) maximum, the sufficient condition is:
d2 n n −(−1)
2
log[ L (x 1 , x 2 ,... , x n ; η)]=− 2 −[( ∑ j =1 x j )−n ] <0
dη η (1−η)2
n 1
as ( ∑ j=1 x j )−n > 0 (note that xj ≥1). This holds for any value, including η0= .
x̄
b3) The estimator:
−1
1 n 1
η^ ML = ( ∑ X
n j =1 j ) =
X̄
C) Estimation of η, μ, and σ2
27
Since n = 27 and ∑ j=1 x j = 134 ,
1 1 27
From the method of the moments: η^ M = = = =0.201 .
x̄ 1 27 134
⋅∑ j =1 x j
27
From the maximum likelihood method, as the same estimator was obtained: η̂ ML =0.201 .
1
Since μ=E ( X )= η , an estimator of η induces an estimator of μ:
1 134
From the method of the moments: μ^ M = = =4.96 .
η^ M 27
From the maximum likelihood method, since the same estimator was obtained: μ^ ML =4.96 .
Note: From the numerical point of view, calculating 134/27 is expected to have smaller error than calculating 1/0.201.
2 1−η
Finally, since σ =Var ( X )= ,
η2
Conclusion: For the geometric model, the two methods provide the same estimator for η. We have used the
estimator of η to obtain an estimator of μ. On the other hand, the quality of the estimator obtained should be
studied, especially if the two methods had provided different estimators.
My notes:
Exercise 3pe-m
A real-world variable is modeled by using a random variable X that follows a Poisson distribution. Given a
simple random sample of size n,
A) Apply the method of the moments to obtain an estimator of the parameter λ.
B) Apply the maximum likelihood method to obtain an estimator of the parameter λ.
C) Use these estimators to build estimators of the mean μ and the variance σ2 of the distribution.
(See the appendixes to see how the mean and the variance of this distribution can be calculated.)
a1) Population and sample moments: The population distribution has only one parameter, so one equation
suffices. The first-order moments of the model X and the sample x are, respectively,
1 n
μ1 (λ)= E( X )=λ and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
n
a2) System of equations: Since the parameter of interest λ appears in the first-order moment of X, the first
equation suffices. The system has only one trivial equation:
1 n
μ1 (λ)=m1 ( x 1 , x 2 , ... , x n) → λ= ∑ j=1 x j = x̄
n
n n xj x1 x2 xn ∑ j=1 x j
L( x 1 , x 2 , ... , x n ; λ)=∏ j=1 f ( x j ; λ)=∏ j=1 λ e = λ e ⋅ λ e ⋯ λ e = λn
−λ −λ −λ −λ
e−n λ
xj! x1 ! x2 ! xn !
∏ j=1 x j !
b2) Optimization problem: The logarithm function is applied to make calculations easier:
n
∑ j=1 x j n n n
log [ L( x 1 , x 2 , ... , x n ; λ)]=log[λ ]+log[e−n λ ]−log[ ∏ j=1 x j ! ]=( ∑ j =1 x j )log [λ]−n λ−log[ ∏ j=1 x j ! ]
The population distribution has only one parameter, so a onedimensional function must be maximized. To find
the local extreme values the necessary condition is:
d n 1 1 n
0= log[ L (x 1 , x 2 ,... , x n ; λ)]=( ∑ j =1 x j ) λ −n → λ 0= ∑ j=1 x j = x̄
dλ n
To verify that the only candidate is a (local) maximum, the sufficient condition is:
d2 n −1
dλ 2
log[ L(x 1 , x 2 ,... , x n ; λ )]=( ∑ x ) 2 <0
j=1 j
λ
n
since x ∈{0, 1, 2...} → ∑ j=1 x j≥0 . Then, the second derivative is always negative, also for λ 0 .
b3) The estimator: For λ, it is obtained after substituting the lower-case letters xj (numbers representing THE
sample we have) by upper-case letters Xj (random variables representing ANY possible sample we may have):
1 n
λ^ ML= ∑ j=1 X j= X̄
n
C) Estimation of μ and σ2
To obtain estimators of the mean and the variance, we take into account that for this model μ=E ( X )=λ
and σ 2=Var ( X )=λ , so by applying the plug-in principle:
μ^ = λ^ = X̄ , σ^ 2 = λ^ = X̄
Conclusion: For the Poisson model, the two methods provide the same estimator for λ, and therefore for μ
and σ2 (when the plug-in principle is applied). On the other hand, the quality of the estimator obtained should
be studied (though the sample mean is a well-known estimator).
My notes:
Exercise 4pe-m
A random variable X follows the normal distribution. Let X = (X1,...,Xn) be a simple random sample of X (seen
as the population). To obtain an estimator of the parameters θ = (μ,σ), apply:
(A) The method of the moments (B) The maximum likelihood method
{
1 n
{
∑ x = x̄
μ= μ= x̄
{
μ 1 (μ , σ)=m1 ( x1 , x 2 , ... , x n )
μ 2 (μ , σ)=m2 ( x 1 , x 2 , ... , x n)
→
2
n j =1 j
2 1 n
σ +μ = ∑ j=1 x j
n
2
→
σ =
2
( 1 n
n )
∑ j=1 x 2j − x̄ 2=s 2x
2
1 n 1 n
where Var ( X )= E ( X 2 )− E( X )2 and s 2x = ( ∑ x
n j=1 j
2
−
n
∑)( x
j=1 j
2 2
)
= x¯ − x̄ have been used.
{
θ^ M = μ^ M = X̄
σ^ M =s X
( )(
n − 1 ∑
n
( x j −μ)2
1 −
1
)
n n 2 2
Maximum: The population distribution has two parameters, and then it is necessary to maximize a
twodimensional function. To discover the local extreme values, the necessary conditions are:
{
1 n
2 ∑ j=1
{
∂ − [2 ( x j −μ )(−1)]=0
∂μ log[ L( x 1 , x 2 , ... , x n ;μ , σ)]=0 → 2σ
∂ n 1 −2 σ
( )
n
∂ σ log [ L( x 1 , x 2 , ... , x n ; μ , σ)]=0 − √ 2 π− 2 [∑ j =1 ( x j −μ)2 ] 4 =0
σ √2 π σ
{
n
∑ ( x j−μ)=0 ∑ j=1 ( x j −μ)=0
{
n
σ 2 j=1 → →
∑ j =1 x j=nμ
n 1 n 1 n
−n+ 2 ∑ j =1 ( x j −μ )2=0 n
− σ + 3 ∑ j =1 (x j −μ )2=0 ∑ j=1 ( x j−μ)2=n σ 2
σ σ
{
1 n
n ∑ j =1 j
{
μ= x μ= x̄
σ2 =
1
∑
n
( x j −μ )2
→ 1 n
2 2 2
σ = ∑ j=1 (x j − x̄) =s x
n
→
{ μ= x̄
σ=s x
n j=1
To verify that the only candidate is a (local) maximum, the sufficient conditions on the partial derivatives of
second order are:
1
[ 1
]n
2 n n
A= ∂ 2 log [ L( x 1 , ... , x n ;μ , σ)]= ∂ ∑ (x j −μ ) = ∑ (−1)=− 2
∂μ ∂μ σ 2 j=1
σ 2 j=1
σ
[ ]
∂ log [ L( x , ... , x ; μ , σ)]= ∂ 1 ∑n (x −μ ) = −2 σ ∑ n (x −μ )=− 2 ∑n ( x −μ )
2
B= 1 n
∂μ ∂σ ∂ σ σ 2 j=1 j σ
4 j =1 j
σ
3 j=1 j
n 1
[ n 3
]
2 n n
C= ∂ 2 log [ L( x 1 , ... , x n ;μ , σ)]= ∂ − σ + 3 ∑ j=1 ( x j−μ)2 = 2 − 4 ∑ j=1 ( x j−μ)2
∂σ ∂σ σ σ σ
To calculate D = B –AC, substituting the pair (μ , σ)=( ̄x , s x ) in A, B and C simplifies the work
2
n
A∣( ̄x , s ) =− 2
< 0
sx
x
2n 2
x
2
sx
n
B∣(μ , s )=− 3 ∑ j=1 ( x j− x̄ )=0 → 2
x
( )( )
D∣( ̄x , s ) =− −
n
s 2x
−
2n
s2x
=−
s 4x
< 0
n 3 n 2 2n
C∣( x̄ , s ) = 2
− 4 ∑ j=1 (x j −μ ) =− 2
x
s x sx sx
n n n n n
as ∑ j=1 (x j − x̄)=(∑ j =1 x j )−n x̄=0 and ∑ j=1 (x j − x̄)2= n ∑ j =1 (x j − x̄ )2=n s 2x . Then,
log[ L( x ; μ , σ)] has a maximum at (μ ,σ )=( ̄x , s x ) since it is a local extreme value and D < 0, A < 0.
{ ̄
θ̂ ML = μ̂ ML = X
σ̂ ML =s X
Conclusion: Since in this case there are two parameters, both the parameter and its estimator can be thought
as twodimensional quantities: θ=(μ , σ) and θ=( ̂ μ̂ , σ).
̂ On the other hand, the quality of the estimator
obtained should be studied, especially if the two methods had provided different estimators.
My notes:
{
1
if x∈[0,θ ]
f (x ; θ) = θ
0 otherwise
as a density function. Let X = (X1,...,Xn) be a simple random sample of a population X following this
probability distribution.
A) Apply the method of the moments to find an estimator of the parameter θ.
B) Apply the maximum likelihood method to find an estimator of the parameter θ.
Use this estimator to build others for the mean and the variance of X.
Discussion: This statement is mathematical, and there is no supposition that would require justification. The
random variable X is dimensionless. We are given the density function of the distribution of X, though for this
distribution it could be deduced from the fact that all values have the same probability. For the general
continuous uniform distribution,
Note: If we had not remembered the first population moments, with the notation of this exercise we could do
θ
E ( X )=∫−∞
+∞ 1
θ 1 x2
x f ( x ; θ) dx=∫0 x θ dx = θ
2 0
1 2
[ ] ( )
= θ θ −0 = θ
2 2
θ
1 x3
[ ] ( )
1 3 2
+∞ θ 1
E ( X )=∫−∞ x f (x ; θ) dx=∫0 = θ θ −0 = θ
2 2 2
x θ dx= θ
3 0 3 3
so
2 1 1
2 2 2
μ=E ( X )= θ
2
and σ =Var ( X )=E ( X )−E ( X ) = θ − θ =θ
2 2 2
3 2
− =θ
3 4 12 ( ) ( )
A) Method of the moments
a1) Population and sample moments: For uniform distributions, discrete or continuous, the mean is the
middle value. Then, the first-order moment of the distribution and of the sample are
0+θ θ 1 n
μ1 (θ )= = and m1 (x 1 , ... , x n)= ∑ j=1 x j = x̄
2 2 n
b2) Optimization problem: First, we try to discover the maximum by applying the technique based on the
derivatives. The logarithm function is applied,
log[ L( x 1 , x 2 , ... , x n ; θ)]=log [θ−n ]=−n log(θ),
and the first condition leads to a useless equation:
0=
d
dθ
1
log[ L(x 1 , x 2 , ... , x n ; θ)]=−n θ → ?
Then, we realize that global minima and maxima cannot always be found through the derivatives (only if they
are also local extremes). In fact, it is easy to see that the function L monotonically decreases with θ and
therefore monotonically increases when θ decreases (this pattern or just the opposite tend to happen when the
probability function changes monotonically with the parameter, e.g. when the parameter appears only once in
the expression). As a consequence, it has no local extreme values. Since, on the other hand, 0≤x j≤θ , ∀ j ,
{ButL xwhen θ
≤θ , ∀ j
j
→ θ0 =max j { x j }
b3) The estimator: It is obtained after substituting the lower-case letters xj (numbers representing THE
sample we have) by upper-case letters Xj (random variables representing ANY possible sample we may have):
θ^ ML =max j { X j }
C) Estimation of μ and σ2
To obtain estimators of the mean, we take into account that μ=E ( X )= θ and apply the plug-in principle:
2
^θ M 2 X̄ ^θ ML max j { X j }
μ^ M = = = X̄ μ^ ML = =
2 2 2 2
2
To obtain estimators of the variance, since σ =Var ( X )= θ
2
12
2
2 θ^ M (2 X̄ )2 ( X̄ ) 2
2
2 θ^ 2ML ( max j { X j })
σ^ M = = = σ^ ML = =
12 12 3 12 12
Conclusion: For the uniform distribution, both methods provide different estimators of the parameter and
hence of the mean. The quality of the estimators obtained should be studied.
My notes:
{
0 if x <3
f (x ; θ) =
1 − x−3
θe
θ
if x≥3
Discussion: This statement is mathematical. The random variable X is supposed to be dimensionless. The
probability function and the first two moments are given, which is enough to apply the two methods. In the
last step, the plug-in principle will be applied.
Note: If E(X) had not been given in the statement, it could have been calculated by applying integration by parts (since polynomials and
exponentials are functions “of different type”):
∞
1 − x−3
[ ]
+∞ ∞ x−3 x−3
− −
E ( X )=∫−∞ x f ( x ; θ)dx=∫3 x e θ dx= −x e θ −∫ 1⋅(−e θ )dx 3
θ
x−3 ∞ x−3 3
[
= −x e
−
x−3
θ
−θe
−
θ
3
] =[( x+ θ)e ] =3+θ .
−
θ
∞
That ∫ u (x )⋅v ' (x )dx=u (x )⋅v ( x )−∫ u ' (x)⋅v (x ) dx has been used with
• u=x → u ' =1
1 − x−3 1 − x−3 −
x−3
• v '= θ e θ → v=∫ θ e θ dx=−e θ
On the other hand, ex changes faster than xk for any k. To calculate E(X2):
∞
1 − x−3
[ ]
+∞ ∞ x−3 x−3
2 − −
E ( X )=∫−∞ x f (x ; θ)dx=∫3 x e θ dx= −x e θ + 2∫ x e θ dx
2 2 2
θ 3
x−3 3
=x e [ 2 − θ
∞
] +2 θ∫ ∞
3
1 − x−3
x θ e θ dx =(3 2−0)+2 θ μ=9+2 θ (3+θ)=2θ 2 +6 θ+9 .
Integration by parts has been applied: ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' (x)⋅v (x )dx with
2
• u=x → u ' =2 x
x−3
1 − 1 − x−3 −
x−3
• v '= θ e θ
→ v=∫ θ e θ dx=−e θ
Again, ex changes faster than xk for any k.
a1) Population and sample moments: There is only one parameter, so one equation suffices. The first-order
moments of the model X and the sample x are, respectively,
1 n
μ1 (θ )=E ( X )=θ +3 and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
n
a2) System of equations: Since the parameter of interest η appears in the first-order moment of X, the first
equation suffices:
1 − x−3
b1) Likelihood function: For this probability distribution, the density function is f ( x ; θ)= θ e θ so
1 n
n n 1 − x θ−3 1 −θ ∑
j ( x j −3)
L( x 1 , x 2 , ... , x n ; θ)=∏ j =1 f ( x j ; θ)=∏ j=1 θe = ne
j=1
θ
b2) Optimization problem: The logarithm function is applied to make calculations easier
1 n 1 n
log [ L( x 1 , x 2 , ... , x n ; θ)]=log (θ )− ∑ j =1 (x j −3)=−n log (θ)− ∑ j=1 ( x j−3)
−n
θ θ
The population distribution has only one parameter, so a onedimensional function must be maximized. To find
the local or relative extreme values, the necessary condition is:
d 1 1 n n 1 n
0= log [ L(x 1 , x 2 , ... , x n ; θ)]=−n + 2 ∑ j=1 ( x j−3) → ∑
θ θ2 j =1 ( x j −3)
θ θ =
dθ
1 n 1 n 1 n
→ θ= ∑ j=1 (x j −3)= ∑ j =1 x j− ∑ j=1 3= x̄−3 → θ0 = x̄−3
n n n
To verify that the only candidate is a (local) maximum, the sufficient condition is:
d2 d 1 1 n n 2θ n ?
2
log [ L( x 1 , x 2 , ... , x n ;θ)]= [−n θ + 2 ∑ j=1 ( x j−3)]= 2 − 4 ∑ j=1 ( x j −3) < 0
dθ dθ θ θ θ
The first term is always positive but the second is always negative, so we had better substitute the candidate
n 2θ
2
d n 2θ n
2
log[ L( x 1 , x 2 , ... , x n ; θ)]= 2 − 4 n( x̄−3)= 2 − 40 n θ0=− 2 < 0
dθ θ θ θ0 θ0 θ0
b3) The estimator:
θ^ ML = X̄ −3
C) Estimation of η and σ2
c1) For the mean: By using the hint and the plug-in principle,
From the method of the moments: μ^ M =θ^ M + 3= X̄ −3+3= X̄ .
From the maximum likelihood method, as the same estimator was obtained: μ^ ML = X̄ .
c2) For the variance: We must write it in terms of the first two moments of X,
σ 2=Var ( X )=E ( X 2)−E ( X )2=2 θ2 +6 θ+ 9−(θ+3)2=2θ 2 +6 θ+9−θ 2−6 θ−9=θ 2
Then,
From the method of the moments: σ^ 2M =θ^ 2M =( X̄ −3)2=( X̄ )2 −6 X̄ +9 .
From the maximum likelihood method: σ^ 2ML =θ^ 2ML =( X̄ −3)2 =( X̄ ) 2−6 X̄ +9 .
Conclusion: For this model, the two methods provide the same estimator. We have used the estimator of θ
to obtain estimators of μ and σ2. The quality of the estimator obtained should be studied, especially if the two
1 − x−δ
In fact, the distribution with probability function f ( x ; θ) = θ e θ , x >δ (and zero elsewhere) is termed
two-parameter exponential distribution. It is a translation of size δ of the usual exponential distribution. A
particular, simple case is obtained for θ = 1 and δ =0, since f ( x ) = e− x , x > 0 .
My notes:
Exercise 7pe-m
A random quantity X is supposed to follow a distribution whose probability function is, for θ>0,
{
3 x2
3
if 0≤ x≤θ
f (x ; θ) = θ
0 otherwise
Discussion: This statement is mathematical. The random variable X is supposed to be dimensionless. The
probability function and the first two moments are given, which is enough to apply the two methods. In the
last step, the plug-in principle will be applied.
Note: If E(X) had not been given in the statement, it could have been calculated by integrating:
θ
3x2
+∞
θ
θ3 4 3
E ( X )=∫−∞ x f ( x ;θ)dx=∫0 x 3 dx= 3 θ = θ
θ 4 0 4 [ ]
On the other hand, if Var(X) had not been given in the statement, it could have been calculated by using a property and integrating:
θ
Now,
+∞
E ( X )=∫−∞ x f (x ;θ)dx=∫0
2 2
θ
23 x2 3 x5
x 3 dx= 3
θ θ 5 0
3
[ ]
= θ2 .
5
3 2 3 2 3 32 2 3 2
3
μ=E ( X )= θ
4
and
2 2 2
σ =Var ( X )=E ( X )−E ( X ) = θ − θ = − 2 θ = θ .
5 4 5 4 80 ( ) ( )
A) Method of the moments
a1) Population and sample moments: There is only one parameter, so one equation suffices. The first-order
moments of the model X and the sample x are, respectively,
a2) System of equations: Since the parameter of interest η appears in the first-order moment of X, the first
equation suffices:
3 1 n 4
μ1 (θ )=m1 ( x 1 , x 2 ,... , x n ) → θ= ∑ j=1 x j = x̄ → θ0 = x̄
4 n 3
a3) The estimator:
4
θ^ M = X̄
3
Now, if we try to find the maximum by looking at the first-order derivatives, a useless equation is obtained:
0=
d
dθ
1
log[ L(x 1 , x 2 , ... , x n ; θ)]=−3 n θ → ?
Then, we realize that global minima and maxima cannot in general be found through the derivatives (only if
they are also local). It is easy to see that the function L monotonically increases when θ decreases (this pattern
or just the opposite tend to happen when the probability function changes monotonically with the parameter,
e.g. when the parameter appears only once in the expression). As a consequence, it has no local extreme
values. On the other hand, 0≤x j≤θ , ∀ j , so
{ButL xwhen θ
≤θ , ∀ j
j
→ θ0 =max j { x j }
C) Estimation of η and σ2
c1) For the mean: By using the hint and the plug-in principle,
3 34
From the method of the moments: μ^ M = θ^ M = X̄ = X̄ .
4 43
3 3
From the maximum likelihood method: μ^ ML = θ^ ML = max j { X j }.
4 4
c2) For the variance: By using that principle again,
2
3 3 4 1
From the method of the moments: σ^ 2M = θ^ 2M =
80 80 3
X̄ = ( X̄ )2 .
15 ( )
26 Solved Exercises and Problems of Statistical Inference
2 3 ^2 3 2
From the maximum likelihood method: σ^ ML = θ ML = ( max j { X j }) .
80 80
Conclusion: For this model, the two methods provide different estimators. The quality of the estimators
obtained should be studied. We have used the estimator of θ to obtain estimators of μ and σ2.
My notes:
Remark 5pe: We do not usually use the definition of the mean square error but the result at the end of the following equalities:
^
MSE ( θ)= ^
E ([ θ−θ] 2 ^
)= E ([ θ−E ^ E ( θ)−θ
( θ)+ ^ ^
]2 )=E ([θ−E ^ 2+[ E ( θ)−θ
( θ)] ^ ^ E(θ)]⋅
]2 +2 [θ− ^ [ E ( θ)−θ
^ ])
^
= E ([ θ−E ^ ) + [ E ( θ)−θ]
( θ)]
2 ^ 2 ^ E ( θ)−θ]−2
+ 2 E ( θ)⋅[ ^ ^ E ( θ)−θ
E ( θ)⋅[ ^ ^ b( θ)
]= Var ( θ)+ ^ 2
Remark 6pe: To study the consistency in probability we have been taught a sufficient—but not necessary—condition that is
equivalent to the consistency in mean of order two (managing the definition is quite complex). Thus, this type of consistency is
proved when the condition is fulfilled, which is sufficient—but not necessary—for the consistency in probability. By using the
Chebyshev's inequality:
^
E(( θ−θ)2 ^
) MSE ( θ) ^
lim n→∞ MSE ( θ)
|^ |
P( θ−θ ≥ϵ)≤ = → ^ |≥ϵ) ≤
lim n →∞ P (|θ−θ
ϵ2 ϵ2 ϵ2
If the sufficient condition is not fulfilled, the estimator under study is not consistent in mean of order two, but it can still be
consistent in probability—this type of consistency should be studied using a different way. Additionally, since MSE ( θ), ^
b ( θ) ^
^ 2 and Var ( θ) are nonnegative, the mean square error is zero if and only if the other two are zero at the same time, and viceversa.
The same happens for their limits. That is why we are allowed to split the limit of the mean square error into two limits.
Exercise 1pe-p
The efficiency (in lumens per watt, u) of light bulbs of a certain type have a population mean of 9.5u and
standard deviation of 0.5u, according to production specifications. The specifications for a room in which
eight of these bulbs (the simple random sample) are to be installed call for the average efficiency of the eight
bulbs to exceed 10u. Find the probability that this specification for the room will be met, assuming that
efficiency measurements are normally distributed.
(From Mathematical Statistics with Applications, Mendenhall, W., D.D. Wackerly and R.L. Scheaffer, Duxbury Press.)
Identification of the variable and selection of the statistic : The variable is the efficiency of the
light bulbs, while the estimator is the sample mean of eight elements. Since the population is normal and the
two population parameters are known, we will consider the (dimensionless) statistic:
X̄ −μ
T ( X ;μ )= ∼ N ( 0,1)
σ2
n √
( n)
2
Rewriting the event: Although in this case the sampling distribution of X is known, as X̄ ∼ N μ , σ ,
we need to standardize before consulting the table of the standard normal distribution:
(√ ) ( ) ( )
X
̄ −μ 10−μ 10−9.5 0.5 √ 8
̄ > 10)=P
P(X > =P T > =P T > =P ( T > √ 8) =0.0023
√ √ √ 0.5 2
2 2
σ σ 0.52
n n 8
> 1 - pnorm(sqrt(8),0,1)
where in this case the language R has been used: [1] 0.002338867
Conclusion: The production specifications will be met, for the room mentioned, with a probability of
0.0023, that is, they will hardly be met.
My notes:
Exercise 2pe-p
When a production process is working properly, the resistance of the components follows a normal
distribution with standard deviation 4.68u. A simple random sample with four components is taken. What is
the probability that the sample quasivariance will be bigger than 30u2?
Discussion: In this exercise, the supposition that the normal distribution reasonably explains the variable
resistance should be evaluated by using proper statistical techniques. The question involves S2. Again, it is
necessary to make the proper statistic appear, in order to use its sampling distribution.
Search for a known distribution: The quantity required is P(S2 >30). To calculate the probability of an
event, we need to know the distribution of the random quantity involved. In this case, we do not know the
sampling distribution of S 2 , but since R follows a normal distribution we are allowed to use
Table of the χ2 distribution: Since n–1=4–1=3, it is necessary to look at the third row.
The probabilities in the table are given for events of the form P(T < x ) (or P(T ≤x ) , as the distribution is
continuous), and therefore the complementary of the event must be considered:
P(T > 4.11)=1−P (T ≤4.11)=1−0.75=0.25
Conclusion: The probability of the event is 0.25. This means that S2 will sometimes take a value larger than
30u2, when evaluated at specific data x coming from the mentioned distribution.
My notes:
Exercise 3pe-p
A simple random sample of 270 homes was taken from a large population of older homes to estimate the
proportion of homes with unsafe wiring. If, in fact, 20% of homes have unsafe wiring, what is the probability
that the sample proportion will be between 16% and 24%?
Hint: Since probabilities and proportions are measured in a 0-to-1 scale, write all quantities in this scale.
(From Statistics for Business and Economics, Newbold, P., W.L. Carlson and B.M. Thorne, Pearson.)
LINGUISTIC NOTE (From: The Careful Writer: A Modern Guide to English Usage. Bernstein, T.M. Atheneum)
home, house. It is a tribute to the unquenchable sentimentalism of users of English that one of the matters of usage that seem to agitate
them the most is the use of home to designate a structure designed for residential purposes. Their contention is that what the builder erects
is a house and that the occupants then fashion it into a home.
That is, or at least was, basically true, but the distinction has become blurred. Nor is this solely the doing of the real estate
operators. They do, indeed, lure prospective buyers not with the thought of mere masonry but with glowing picture of comfort,
congeniality, and family collectivity that make a house into a home. But the prospective buyers are their co-conspirators; they, too, view
the premises not as a heap of stone and wood but as a potential abode.
There may be areas in which the words are not used interchangeably. In legal or quasi-legal terminology we speak of a “house and
lot,” not a “home and lot.” The police and fire departments usually speak of a robbery or a fire in a house, not a home, at Main Street and
First Avenue. And the individual most often buys a home, but sells his house (there, apparently, speaks sentiment again). But in most
areas the distinction between the words has become obfuscated. When a flood or a fire destroys a community, it wipes out not merely
houses but homes as well, and homes has come to be accepted in this sense. No one would discourage the sentimentalists from trying to
pry the two words apart, but it would be rash to predict much success for them.
Discussion: The information of this “real-world study” must be translated into the mathematical language.
Since there are two possible situations, each home can be “modeled” by using a Bernoulli variable. Although
Identification of the variable and selection of the statistic : The variable having unsafe wiring can
take two possible values: 0 (not having unsafe wiring) and 1 (having it, if one want to register or count this
fact). The theoretical proportion of older homes with unsafe wiring is known: η = 0.20 (20%). For this
framework—a large sample from a Bernoulli population with parameter η—we select the dimensionless,
asympotic statistic:
̂
η−η d
T ( X ; η)= → N (0,1)
n √
?(1−? )
Rewriting the event: We are asked for the probability P (0.16 < η̂ < 0.24), but to calculate it we need to
rewrite the event until making T appear:
0.16−η η
̂ −η 0.24−η
P (0.16 < η
̂ < 0.24)=P
(√ η(1−η)
n
<
√ η(1−η)
n
<
√ η(1−η)
n
)
( )( )
0.24−0.20 0.16−0.20
=P T < −P T ≤ = P(T < 1.64)−P (T ≤−1.64)
√ 0.20(1−0.20)
270 √ 0.20 (1−0.20)
270
(In these calculations, we have standardized and then decomposed, but it is also possible to decompose and
then to standardize.) Now, let us assume that we have a table of the standard normal distribution including
positive quantiles only. By using a simple plot with the density function of this distribution, it is easy to see
(look at the areas) that for the second probability P (T ≤−1.64)=P (T ≥ +1.64)=1−P (T <+ 1.64) , so
P (T < 1.64)−P (T ≤−1.64)=P (T < 1.64)−[1−P (T < 1.64)]=2⋅P (T < 1.64)−1=2⋅0.9495−1=0.90.
> pnorm(1.64,0,1) - pnorm(-1.64,0,1)
Alternatively, by using the language R: [1] 0.8989948
Conclusion: The probability of the event is 0.90, which means that the sample proportion of older homes
with unsafe wiring, calculated from the sample X = (X1,...,X270), will take a value between 0.16 and 0.24 with
this probability. As a percentage: the proportion of the 270 homes with unsafe wiring will be between 16%
and 24% with 90% certainty.
My notes:
Exercise 4pe-p
Simple random samples X = (X1,...,X11) and Y = (Y1,...,Y6) are taken from two independent populations
2 2
X ∼ N (μ X =1 , σ X =1) and Y ∼ N (μ Y =2 , σY =0.5)
Calculate or find:
Discussion: There are two independent normal populations whose parameters are known. The variances, not
the standard deviation, are given. It is required to calculate probabilities or find quantiles for events involving
the sample means and the sample quasivariances. In the first two sections, only one of the populations is
involved. Sample sizes are 11 and 6, respectively. The variables X and Y are dimensionless, and so are both
sides of the inequalities.
(nY −1)S Y2
(1) The event involves the estimator S 2 , which reminds us of the statistic T = 2
∼ χ 2n −1 . Then,
σY Y
(n y −1)S 2y
2
P ( S ≤ 1.5)= P
y
( σ 2
y
≤
(n y −1)1.5
σ 2
y
) =P T≤( (6−1) 1.5
0.5
=P T ≤ )
5⋅1.5
1
2 (
= P ( T ≤ 15 ) = 0.99
)
̄ −μ X
X
̄ , so we think about the statistic T =
(2) The event involves X ∼ N ( 0,1) . Then,
√
2
σ X
nX
̄ −μ X
X c−μ X c−μ x
(√ ) ( √ ) ( √)
c−1
̄ > c) = P
0.25 = P ( X > =P T> =P T>
√ 1
2 2 2
σX σX σ X
nX nX nX 11
or, equivalently,
c−1
1−0.25 = 0.75 = P T ≤
( √ 1
11
)
Now, the quantile found in the table of the standard normal distribution must verify that
r 0.25=l 0.75=0.674 =
c−1
√1
11
→ c = 0.674
√ 1
11
+1 = 1.20
( X̄ −Ȳ )−(μ X −μ Y )
(3) To work with the means of two populations, we use T = ∼ N (0,1), so
̄ −Ȳ )−(μ x −μ y )
√ σ 2X σ Y2
+
n X nY
(X 0.2−(μ x −μ y )
( ) (
0.2−(1−2)
̄ −0.1 > 0.1+ Ȳ ) = P ( X
P(X ̄ −Ȳ > 0.2) = P
√ σ 2x σ 2y
+
nx ny
>
√ σ 2x σ 2y
+
nx ny
=P T >
√ 1 0.5
+
11 6
)
31 Solved Exercises and Problems of Statistical Inference
0.2−1+ 2
=P T >
( 1 1
+
11 12 √ )
= P ( T > 2.87 ) = 1−P (T ≤ 2.87 ) = 1−0.9979 = 0.0021
S 2X σ 2Y
(4) To work with the variances of two populations, T = 2 2 ∼ F n −1 ,n −1 is used:
S Y σX X Y
S 2X σ 2Y S 2X σY2 σ 2Y
0.9 = P
( S 2
Y
) (
≤c =P
σ S 2
X
2
Y
≤c
σ 2
X
) ( = P T ≤c
σ 2
X
) =P T ≤c ( 0.5
1) (
=P T ≤
c
2 )
The quantile found in the table of the distribution F n X −1 , nY −1 =F 11−1 ,6−1=F 10,5 is 3.30, which allows us to
find the unknown c:
c > qf(0.9, 10, 5)
r 0.1=l 0.9=3.30= → c = 6.60. [1] 3.297402
2
(Advanced Item) In this case, allocating the two sample means in the first side of the inequality leads to
̄ −0.1 > 0.1−Ȳ ) = P( X̄ + Ȳ > 0.2)
P(X
We remember that
σ2X σ2Y
̄ ∼ N μX ,
X ( nX ) and Ȳ ∼ N μ Y ,
nY ( )
so the rules that govern the sums—and hence subtractions—of normally distributed variables imply both
σ 2X σ 2Y σ 2X σ2Y
̄ −Ȳ ∼ N μ X −μ Y ,
X ( +
n X nY ) and X̄ + Ȳ ∼ N μ X +μY , +
n X nY ( )
(Note that in both cases the variances are added—uncertainty increases.) Although the difference is used more
frequently, to compare to populations, the sampling distribution of the sum of the sample means is also known
thanks to the rules for normal variables; alternatively, we could still use the first result by doing X+Y = X–(–Y)
and using the –Y has mean and variances equal to –μY and σY2. Either way, after standardizing:
( X̄ + Ȳ )−(μ X +μ Y )
T= ∼ N ( 0, 1 )
̄ + Y.
̄ Now,
√ σ 2X σ2Y
+
n X nY
This is the “mathematical tool” necessary to work with X
̄ + Ȳ )−(μ X +μY )
(X 0.2−(μ X +μY )
( ) (
0.2−(1+ 2)
̄ −0.1 > 0.1−Ȳ ) = P( X̄ + Ȳ > 0.2) = P
P(X
√ σ2X σ 2Y
+
nX nY
>
√ σ 2X σY2
+
n X nY
=P T >
√ 1 0.5
+
11 6
)
0.2−3
=P T >
( √ 1 1
+
11 12
)
= P ( T > −6.71 ) = 1−P ( T ≤−6.71 ) = 1
The quantile 6.71 is not usually in the tables of the N(0,1), so we can consider that P ( T ≤−6.71 )≈0. Or, if
we use the programming language R: > 1-pnorm(-6.71,0,1)
[1] 1
Conclusion: For each case, we have selected the appropriate statistic. After completing the expression of the
event, the statistic T appears. Then, since the (sampling) distribution of T is known, the tables can be used to
calculate probabilities or to find quantiles. In the latter case, the unknown c is found after the quantile of T.
Exercise 5pe-p
Suppose that you manage a bank where the amounts of daily deposits and daily withdrawals are given by
independent random variables with normal distributions. For deposits, the mean is ₤12,000 and the standard
deviation is ₤4,000; for withdrawals, the mean is ₤10,000 and the standard deviation is ₤5,000.
(a) For a week, calculate or bind the probability that the five withdrawals will add up to more than
₤55,000.
(b) For a particular day, calculate or bind the probability that withdrawals will exceed deposits by more
than ₤5,000.
Imagine that you are to launch a new monthly product. A prospective study indicated that profits (in million
dollars) can be modeled through the random quantity Q = (X+1)/2.325, where X follows a t distribution with
twenty degrees of freedom.
(c) For a particular month, calculate or bind the probability that profits will be smaller than ₤106 (one
million pounds).
(Based on an exercise of Business Statistics, Douglas Downing and Jeffrey Clark, Barron's.)
Discussion: There are several suppositions implicit in the statement, namely: (i) the normal distribution can
reasonably be used to model the two variables of interest D and W; (ii) withdrawals and deposits are
independent; and (iii) X can reasonably be modeled by using the t distribution. These suppositions should
firstly be evaluated by using proper statistical techniques. To solve this exercise, the rules on sums and
differences of normally distributed variables must be used.
Identification of variables and distributions: If D and W represent the random variables daily sum
of deposits and daily sum of withdrawals, respectively, from the statement we have that
D ∼ N (μ D =₤ 12,000 , σ 2D =₤ 2 4,0002 ) and W ∼ N (μW =₤ 10,000 , σ 2W =₤ 2 5,000 2)
(a) Since the variables are measured daily, in a week we have five measurements (one for each working day).
Translation into the mathematical language: We are asked for the probability
5
P (W 1+ W 2+ W 3+W 4+ W 5 > 55,000)=P ( ∑ j =1 W j > 55,000)
Search for a known distribution: To calculate or bind this probability, we need to know the distribution of
the sum or, alternatively, to relate it to any quantity whose distribution we know. By using the rules that
govern the sums and subtractions of normal variables,
5
∑ j=1 W j ∼ N (5μ W ,5 σ 2W )
Rewriting the event: We can easily rewrite the event in terms of the standardized version of this normal
distribution:
5
5
P ( ∑ j=1 W j >55,000)=P
( ∑ j =1 W j−5μ W 55,000−5 μW
√ W
5 σ 2
>
√ 5 σ2W )=P Z>
( 55,000−50,000
√5⋅5,000 2 )
=P ( Z > 0.4472)
1 5 1
( )
5
P ( ∑ j=1 W j >55,000)=P
5 ∑ j=1 j 5
W > 55,000 =P ( W̄ >11,000 )
and use that
2
1 5 σ
W̄ = ∑ j=1 W j ∼ N μ W , W
5 5 ( ) →
̄ −μ W
W
∼ N (0,1)
√ σ 2W
5
(b) Translation into the mathematical language: We are asked for the probability P (W > D+ 5,000).
Search for a known distribution: To calculate or bind this probability, we rewrite the event until all random
quantities are on the left side of the inequality:
P (W > D+ 5,000)=P (W −D >5,000)
Now we need to know the distribution of W – D or, alternatively, of a quantity involving this difference. By
again using the rules that govern the sums and differences of normal variables, it holds that
W −D ∼ N (μ W −μ D , σW2 +σ 2D )= N (₤ 10,000−₤ 12,000 , ₤ 2 5,000 2 + ₤ 2 4,0002 )
Rewriting the event: We can easily express the event in terms of the standardized version of W – D:
(W − D)−(μ W −μ D ) 5,000−(μ W −μ D )
P (W −D> 5,000)=P
( √ σ2W + σ2D
>
√ σ2W + σ 2D )
(W −D)−(−2,000) 5,000−(−2,000) 7⋅103
=P
( >
√ 25⋅106 +16⋅106 √25⋅106 +16⋅106
= P(Z >
√ 25+16⋅103 )
)=P (Z >1.0932)
Consulting the table: We can bind the probability as follows (see the figure below)
P (Z > 1.0900)> P (Z> 1.0932)> P( Z > 1.1000)
1−P ( Z ≤1.0900)> P ( Z > 1.0932)> 1−P (Z ≤1.1000)
1−0.8621> P ( Z> 1.0932)> 1−0.8643
0.1379> P(Z > 1.0932)> 0.1357
Then,
0.1357< P (W > D+ 5,000)< 0.1379
Rewriting the event: The event can easily be rewritten in terms of this known distribution:
X +1
P ( 2.325 <1)=P ( X +1< 2.325)=P ( X <2.325−1)=P ( X <1.325)
Consulting the table: Finally, it is enough to consult the table of the t distribution. The quantity 1.325 is in
our table of lower-tail probabilities, so
P ( X <1.325)=0.900
Conclusion: For a week, the probability that the five withdrawals will add up to more than $55,000 is
around 0.33. For a particular day, the probability that withdrawals will exceed deposits by more than $5,000 is
around 0.13. For a particular month, the probability that profits will be smaller than one (million dollars) is
0.9, that is, quite high.
My notes:
Exercise 6pe-p
To study the mean of a population variable X, μ = E(X), a simple random sample of size n is considered.
Imagine that we do not trust the first and the last data, so we think about using the statistic
~ 1 n−1 1 X + X 3 +⋯+ X n−1
X=
n−2
∑ j=2
X j=
n−2
( X 2 + X 3+⋯+ X n−1 ) = 2
n−2
Calculate the expectation and the variance of this statistic. Calculate the mean square error (MSE) and its
limit when n tends to infinite. Study the consistency. Compare the previous error with that of the ordinary
sample mean.
Discussion: The statement of this exercise is mathematical. Here we are interested in the mean. The quantity
X is dimensionless. We could not apply the defintions, and the mean and the variance must be written in terms
of the mean and the variance X by applying the basic properties of these measures.
Expectation and variance: The basic properties of the mean and the variance are applied to do:
1 1 1
~
E ( X )= E ( n−2 ( X + X +⋯+ X ))=
2 3
n−2
(
n−1 E ( X )+⋯+ E ( X ) ) =
2
n−2
(n−2)μ=μ
n−1
1 1 1 2
Var ( X ) =Var ( ( X + X +⋯+ X ) )=
~ n−1
(n−2)σ = σ
2
n−2 2 3
(n−2)
n−1∑ Var ( X )=
2 j=2
(n−2)
j
n−2 2
When n increases, that is, when the sample consists of more and more data, the limits are, respectively:
2
~ ~
lim n →∞ E ( X )=lim n →∞ μ=μ and lim n →∞ Var ( X ) =lim n →∞ σ =0
n−2
Comparison of errors:
2 2
MSE ( ~
X )= σ MSE ( X̄ ) = σ
n−2 n
Since σ2 appears in the two positive quantities, by looking at the coefficients it is easy to see that,
~
MSE ( X̄ ) < MSE ( X )
(for n larger than 2). This result is due to the fact that the sample mean uses all the data available, though only
the number of data—not their quality, since all of them are supposed to follow the same distribution—is
considered in calculating the mean square error. In the limit, –2 is negligible. We can plot the coefficients
(they are also the mean square errors when σ=1).
# Grid of values for 'n'
n = seq(from=3,to=10,by=1)
# The three sequences of coefficients
coeff1 = 1/(n-2)
coeff2 = 1/n
# The plot
allValues = c(coeff1, coeff2)
yLim = c(min(allValues), max(allValues));
x11(); par(mfcol=c(1,3))
plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')
plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='b')
plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')
points(n, coeff2, type='b')
Conclusion: ~
X is a consistent estimator of μ. The estimator is appropriate for estimating μ. When nothing
suggests removing data, it is better to maintain them in the sample.
Advanced theory: The estimator in the statement is the usual sample mean when the sample has n–2 data
instead of n (leaving out these two data can be seen as a sort of data treatment implemented in the method, not
in the previous analysis of data). When any of the two left out data is not trustable, using this estimator makes
sense; otherwise, it does not exploit the information available efficiently. On the other hand, the sample mean
can be affected by tiny or huge values (outliers). To make the sample mean robust, this estimator is sometimes
considered after ordering the data from the smallest to the largest; if X(j) is the j-th datum in the sample already
reordered:
~ 1 n−1 1
X= ∑ X =
n−2 j=2 ( j ) n−2 (2)
( X + X (3) +⋯+ X (n−1) )
This new robust estimator of the population mean μ is called trimmed sample mean, and any number of data
can be left out—not only two.
Exercise 7pe-p
A population variable X follows the χ2 distribution with κ degrees of freedom. We consider a statistic T that
uses the information contained in the simple random sample X = (X1, X2,...,Xn). If
T ( X )=T ( X 1 , X 2 , ... , X n )=2 X̄ −1 ,
calculate its expectation and variance. Calculate the mean square error of T. As an estimator of twice the
mean of the population law, is T a consistent estimator?
Hint: If X follows the χ2 distribution with κ degrees of freedom, μ = E(X) = κ and σ2 = Var(X) = 2κ.
Discussion: Even if a population is mentioned, this statement is mathematical. To calculate the value of
these two properties of the sampling distribution of T, we have to apply the general properties of the
expectation and the variance. The knowledge about the distribution of X will be used in the last steps. This is a
dimensionless quantity. The mean square error is defined in terms of these quantities.
Expectation or mean:
([ 1 n
] ) (
2 n 2
)
n
E ( T ( X ) ) =E 2 ∑
n j=1
X j −1 =E
n ∑ j =1 j
X − E ( 1 )= E
n (∑ j=1
X j −1)
2 n 2
=
n ∑ j=1
E ( X j )−1= n E ( X )−1=2 κ−1
n
(Since μ = E(X) = κ)
Variance:
2
([ 1 n
] ) ( 2 n 2
)() 4
n n
Var ( T ( X ) )=Var 2
n ∑ j =1
X j −1 =Var ∑ j=1 X j =
n n
Var (∑ j=1
Xj= ) n
2 ∑ j=1
Var ( X j )
4 8κ
= n Var ( X )= (Since σ2 = Var(X) = 2κ) Independence of Xj
n 2
n (simple random sample)
Mean square error: Since b (T )=E (T )−2 E( X )=( 2 κ−1−2 κ)=−1 , then
2 8κ d
MSE ( T ) = b(T ) +Var (T ) = 1+ → 1
n
Consistency: Although the variance of T tends to zero when n increases, the bias does not (thus, T is
asymptotically biased). Hence, the mean square error does not tend either, and nothing can be said about the
consistency in probability using this way (although we can say that it is not consistent in mean of order two).
Conclusion: Since the mean square error tends to 1, in general T is not a “good” estimator of 2μ even for
many data.
My notes:
Discussion: This statement is basically mathematical. The relative efficiency is defined in terms of the mean
square error of the estimators.
( 12 X + 12 X )= 12 E ( X )+ 12 E ( X )= 12 E ( X ) + 12 E ( X )= 12 μ+ 12 μ=μ
E ( μ̂ 1 ) =E 1 2 1 2
1 2 1 2 1 2 1 2
E μ̂ = E ( X + X )= E ( X ) + E ( X ) = E ( X ) + E ( X )= μ + μ=μ
( 2) 1 2 1 2
3 3 3 3 3 3 3 3
Conclusion: Both estimators are unbiased while the first has smaller variance; then, the first is preferred.
We have not mathematically proved that this first estimator minimizes the variance, so we cannot say that it is
an efficent estimator.
Exercise 9pe-p
The mean μ = E(X) of any population can be estimated from a simple random sample of size n through X.
Prove that:
(a) This estimator is always consistent.
(b) For X normally distributed (normal population), this estimator is efficient.
Discussion: This statement is theoretical. The first section of this exercise needs calculations similar to
those of previous exercises. To prove the efficiency, we have to apply its definition.
(a) Consistency: The expectation of the sample mean is always—for any population—the population mean.
Nevertheless, we repeat the calculations:
1 n 1
( ) ) 1n ∑ 1
n n
E ( X̄ )= E
n ∑ j =1
Xj = E
n (∑ j=1
Xj = j=1
E ( X j ) = n E ( X )= E ( X )=μ
n
The variance of the sample mean is always—for any population—the population variance divided by n. We
repeat the calculations too:
Independence of Xj (simple random sample)
2
1 n 1 1 1 σ2
( )()
n n
Var ( X̄ )=Var
n
∑ j=1
X j =
n
Var (∑ j=1 )
Xj =
n
2 ∑ j=1
Var ( X j )=
n
2
n Var ( X ) =
n
The bias is defined as b ( X ̄ )−μ = 0 . We prove the consistency (in probability) by using the
̄ )= E ( X
sufficient—but not necessary—condition (consistency in mean of order two):
[ ]
2
̄ ) ] = lim n→∞ 0 + σ = 0
lim n →∞ MSE ( X̄ )= lim n →∞ [ b( X̄ )2+ Var ( X
n
Then, it is consistent in mean of order two and therefore in probability.
(b) Efficiency: It is necessary to prove that the two conditions of the definition are fulfilled:
i. The expectation of X is always μ = E(X), that is, X is always an unbiased estimator of μ.
ii. X has minimum variance, which happens—because of a theoretical result—when Var(X) attains the
Cramér-Rao's lower bound
1
[( )]
2
∂ log[ f ( X ; θ)]
n⋅E ∂θ
where θ = μ in this case, and f(x;θ) is the probability function of the population law where the
nonrandom variable x is substituted by the random variable X (otherwise, it is not possible to talk
about expectation, since f(x;θ) is not random when θ is a parameter).
The unbiasedness is proved. On the other hand, we compute the Cramér-Rao's lower bound step by step:
(1) Function (with X in place of x)
2
( X −μ)
1 −
2σ
2
f ( X ;μ)= e
√2 π σ2
[( ) ]= E[( )]
2 2
∂ log[ f ( X ; μ)] X −μ 1 1 1 1
E [ ( X −μ ) ] = 4 Var ( X )= 4 σ 2= 2
2
E =
∂μ σ
2
σ
4
σ σ σ
(5) Cramér-Rao's lower bound:
1 1 2
= =σ
1 n
[( )]
2
∂ log[ f ( X ;μ )] n⋅ 2
n⋅E
∂μ σ
The variance of the estimator, calculated in section (a), attains the bound and hence the estimator has
minimum variance. Since both conditions are fulfilled, the efficient is proved.
Conclusion: We have proved that the sample mean X is always—for any population—a consistent estimator
of the population mean μ. For a normal population, it is also efficient.
Advanced theory: When log[f(x;θ)] is twice differentiable with respect to θ, the Cramér-Rao's bound can
equivalently be written as
−1
[ ]
2
∂ log[ f ( X ; θ)]
n⋅E
∂θ2
Concerning the regularity conditions, Wikipedia refers (http://en.wikipedia.org/wiki/Fisher_information) to
eq. (2.5.16). of Theory of Point Estimation, Lehmann, E. L. and G. Casella, 1998. Springer. Let us assume that
this alternative expression can be applied; then, step (3) would be
∂2 ( log[ f ( X ; μ)]) = ∂ X −μ = 1 ⋅(−1)=− 1
∂μ 2 ∂μ σ 2 σ2 σ2( )
step (4) would be
E
[
∂ 2 log[ f ( X ;μ )]
∂μ 2
1 1
=E − 2 =− 2
σ σ ] [ ]
and, finally, step (5) would be
2
−1 −1
= =σ
−1 n
n⋅E
[ ∂2 log [ f ( X ; μ)]
∂μ2 ] n⋅ 2
σ
We would have obtained the same result with easier calculations, although the fulfillment of the regularity
conditions must have been verified previously.
My notes:
Discussion: This statement is mathematical. We should know the density function of the continuous uniform
distribution, although it could also be deduced from the fact that all possible values have the same probability.
The quantity X is dimensionless.
(a) Density function: For this distribution, all values have the same probability, so the density function must
be a flat curve. For the case θ > 2 (there is a similar figure for any other θ),
(b1) Bias: By applying a property of the sample mean and the information of the statement,
̄ )= E ( X )=θ− 1
E(X ̄ )= E( X̄ )−θ=θ− 1 −θ=− 1
→ b(X
1
→ lim n →∞ b( X̄ )=lim n→∞ − =−
1
2 2 2 2 2
(It is asymptotically biased.) Since one condition of the pair is not verified, it is not necessary to check the
other, and neither the fulfillment of the consistency in probability nor the opposite can be proved using this
way (though the estimator is not consistent in the mean-square sense).
(c1) Unbiasedness: In the previous section it has been proved that X is a biased estimator of θ.
The first condition does not hold, and hence it is not necessary to check the second one. The conclusion is that
X is not an efficient estimator of θ.
(d1) Bias: By applying a property of the sample mean and the information of the statement,
̂ E( X̄ )+ 1 =θ− 1 + 1 =θ → b( θ)=E
E ( θ)= ̂ ̂ ̂
→ lim n →∞ b( θ)=lim
( θ)−θ=θ−θ=0 n →∞ 0 = 0
2 2 2
(d2) Variance: By applying a property of the sample mean and the information of the statement,
1 Var ( X ) 3 3
^
Var ( θ)=Var ( )
X̄ + =Var ( X̄ )=
2 n
=
4⋅n
→ lim n →∞ Var ( θ̂ )=lim n→∞
4⋅n
=0
As a conclusion, the mean square error (MSE) tends to zero and hence the proposed estimator θ=̂ X̄ + 1 is a
2
consistent—in mean square error and hence in probability—estimator of θ.
Conclusion: We could prove neither the consistency nor the efficiency. Nevertheless, the bias has allowed
us to build an unbiased, consistent estimator of the parameter. The efficiency of this new estimator could be
studied, but it is not required in the statement.
My notes:
Exercise 11pe-p
A population random quantity X is supposed to follow a geometric distribution. Let X = (X1,...,Xn) be a simple
random sample. By applying the factorization theorem below, find a sufficient statistic T(X) = T(X1,...,Xn) for
the parameter. Give explanations.
Discussion: The factorization theorem can be applied both to prove that a given statistic is sufficient and to
find sufficient statistics. On the other hand, for the distribution involved we know that
Likelihood function:
n
L( X ; η)=∏ j =1 f ( X j ; η)= f ( X 1 ; η)⋅ f ( X 2 ; η)⋯ f ( X n ; η)=η⋅(1−η) X −1⋅η⋅(1−η) X −1 ⋯η⋅(1−η) X
1 2 n −1
Theorem:
We must try allocating each term of the likelihood function:
n
➔ η depends only on the parameter, not on Xj. Then, it would be part of g.
n
(∑ )
X j −n
➔ (1−η) depends on both the parameter and the data Xj, and these two kinds of information
j=1
neither are mixed nor can mathematically be separated. Then, it would be part of g and the only
n
possible sufficient statistic, if the theorem holds, is T =∑ j =1 X j .
n
n −n ∑ j=1 X j
By considering g (T ( X ) ; η)=η ⋅(1−η) (1−η) and h( X )=1 , the theorem holds and hence the
n
statistic T ( X )=∑ j=1 X j is sufficient for studying η. The idea behind this kind of statistics is that they
“summarize the important information (about the parameter)” contained in the sample. In fact, the statistic T
has essentially the same information as any one-to-one transformation of it, particularly the sample mean
n n
T ( X )= ∑ j =1 X j =n X̄ .
n
Conclusion: The factorization theorem has been used to find a sufficient statistic (for the parameter). Since
the total sum appears, we complete the expression to write the result in terms of the sample mean. Both
statistics contain the same information about the parameter of the distribution.
My notes:
2 V 2X 2 s 2X 2 S 2X
(C) For normal populations: V 2 s 2 S 2
VY sY SY
Suppose that the two populations are independent. Study the consistency in mean of order two and then the
consistency in probability.
Discussion: In this exercise, the most important estimators are involved. The basic properties of the
expectation and the variance allows us to calculate the mean square error. In most cases, the estimators will be
completed for a proper quantity (with known sampling distribution) to appear, and then use its properties.
Although the estimators of the third section can be used for any X and Y, the calculations for normally
distributed variables are easier due to the use of additional information—the knowledge about statistics and
their sampling distribution. Thus, the results of this section are based on the normality of the variables X and
Y. (Some of the quantities are also valid for any variables.)
Fortunately, the limits of the two-variable functions—sequences, really—that appear in this exercise can
easily be solved either by decomposing them into two limits of one-variable functions or by binding the two-
variable sequences. That the limits are studied when nX and nY tend to infinite facilitates the calculations (e.g. a
constant like –2 is negligible when it appears in a factor).
It is sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.
( ) ( )
2 2 2 2 2
nV nV
E (V 2 ) = E σ =σ E = σ n = σ2
n σ2 n σ
2
n
nV2 2
( ) ( )
2
Var ( V 2 ) = Var σ = σ 4 Var nV = σ 4 2 n = 2 σ4
n σ2 n2 σ2 n2 n
2 2 2 2 2 2 4
MSE ( V ) = [ E (V )−σ ] + Var (V ) = σ
n
Then,
• The estimator V 2 is unbiased for σ2, whatever the sample size.
In another exercise, this estimator is compared with the other two estimators of the variance. (For the
expectation, it is easy to find in literature direct calculations that lead to the same value for any variables—not
necessarily normal.)
2
VX
(c2) For the quotient between the variances of the samples 2
VY
V 2X σY2
By using T = 2 2 ∼ F n , n and the properties of the F distribution,
VY σX X Y
2
V 2X σ2X V 2X σ 2Y σ2X V 2X σ2Y 2 n 2Y (n X +n Y −2) σ 4X
Var
( ) ( )( ) ( )
V 2Y
= Var
σ 2Y V 2Y σ2X
=
σ2Y
Var
V Y2 σ 2X
=
n X (nY −2)2 ( nY −4) σ 4Y
(nY > 4)
( ) [( ) ] ( )[
2
] ( )
2
V 2X V 2X σ2X V 2X σ 2X
nY σ 2X σ2X 2n 2Y (n X +n Y −2)
MSE = E − + Var = 2 − +
V 2Y V 2Y σ 2Y V Y2 σY nY −2 σ 2Y σ 2Y n X (nY −2)2 (nY −4)
[( ]
2 2 4
nY 2n Y (n X +n Y −2) σ X
=
nY −2
−1 +
)
n X (nY −2)2 (nY −4) σ 4Y
(nY > 4)
Then,
• The estimator is V 2X /V 2Y biased for σX2/σY2, but it is asymptotically unbiased since
V 2X n Y σ2X σ 2X σ 2X
lim n X
nY →∞
→∞ E
( )
V 2Y
=lim n Y
→∞
( nY −2 σ 2Y
=
)
σ2Y
lim n →∞
1−
1
2
nY
=
σ2Y
Y
( )
Mathematically, only nY must tend to infinite. Statistically, since populations can be named and
allocated in either order, it is deduced that both sample sizes must tend to infinite. In fact, it is
sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.
• The estimator V 2X /V 2Y is consistent (in mean of order two and therefore in probability) for σ X2/σY2,
since it is asymptotically unbiased and
V 2X σ 4X 2 n2Y (n X +nY −2)
lim n X
nY →∞
→∞ Var
( ) V 2Y
=
σY4
lim n X
→∞
nY →∞ nX (nY −2)2 (nY −4)
=0
=
σ
lim n
4
X
−3 −1
n n 2 n (n X +nY −2)
Y X σ 2
Y
= lim n
4
X
2 ( n1 + n1 − n 2n ) =0
Y X Y X
4 →∞ −3 −1 2 4 →∞ 2
n n n (nY −2) (nY −4) σ
σ
(1− n2 ) (1− n4 )
X X
n Y Y →∞ Y X X n Y Y →∞
Y Y
The numerator tends to zero if and only if so do both sample sizes. In short, it is sufficient and
necessary the two sample sizes tending to infinite—this limit has been studied in the mathematical
appendix.
( ( ) )
2
E (s 2) = E σ = σ 2 E n s = σ2 (n−1)= n−1 σ 2
n σ2 n σ
2
n n
2 2
2( n−1)
Var (s ) = Var ( σ ) = σ Var ( )
2 4 4
ns ns
= σ 2(n−1) =
2 4
2
σ 2 2 2 2
n σ n σ n n
2
MSE ( s 2 )= [E ( s 2 )−σ 2 ]2 + Var ( s 2 ) =
[ n−1 2
n
σ −σ 2 +
2( n−1) 4
n
2
2 1
]
σ = − 2 σ4
n n ( )
Then,
• The estimator s 2 is biased but asymptotically unbiased (for σ2), since
1
2
lim n →∞ E( s ) = lim n→∞
1
=σ ( n−1
n
σ )=σ lim
2 2
n→ ∞ ( )
1−
n 2
It is sufficient and necessary the sample size tending to infinite—see the mathematical appendix.
• The estimator s 2 is consistent (in mean of order two and therefore in probability) for σ2, since
It is sufficient and necessary the sample size tending to infinite—see the mathematical appendix.
In another exercise, this estimator is compared with the other two estimators of the variance. (For the
expectation, it is easy to find in literature direct calculations that lead to the same value for any variables—not
necessarily normal.)
2
sX
(c4) For the quotient between the sample variances 2
sY
S 2X σ2Y n X (n Y −1) s 2X σ 2Y
By using T = 2 2
= ∼ Fn −1 , nY −1 and the properties of the F distribution,
SY σ X nY ( n X −1) s 2Y σ 2X X
2 2 2 2
E
( ) sX
s2Y
=
n Y ( n X −1) σ X
n X ( nY −1) σY 2
n (n −1) s X σY
E X Y
(
nY (n X −1) sY2 σ 2X )
nY (n X −1) σ2X nY −1 nY (n X −1) σ 2X
= = ( nY −1> 2)
n X (nY −1) σ2Y ( nY −1)−2 n X (nY −3) σ 2Y
2 2 2 4 2 2
Var
( )
sX
sY2
=
nY (n X −1) σ X
n2X (nY −1)2 σ 4Y
Var
( n X (nY −1) s X σ Y
nY (n X −1) s 2Y σ2X )
nY2 (n X −1)2 σ 4X 2( nY −1)2 ( n X −1+n Y −1−2) 2 n 2Y (n X −1)( n X + nY −4) σ4X
= = (n Y −1>4)
n 2X (nY −1)2 σY4 (n X −1)(nY −1−2)2 ( nY −1−4) n 2X (nY −3)2 ( nY −5) σ4Y
( )[( ) ] ( )[
2
MSE
s2X
s 2Y
= E
s2X
s 2Y
−
σ2X
σ 2Y
+ Var 2 =
sY
s2X
n Y ( n X −1) σ 2X σ2X
−
n X (nY −3) σ2Y σ 2Y ] +
2 n2Y ( n X −1)( n X +n Y −4) σ 4X
n2X (nY −3)2 (nY −5) σ4Y
{[ }
2
2 n2Y (n X −1)(n X +nY −4) σ 4X
=
nY (n X −1)
n X (nY −3)
−1 +
]
n 2X (n Y −3)2 ( nY −5) σ4Y
(nY −1>4 )
Then,
• The estimator is s 2X / s 2Y biased for σX2/σY2, but it is asymptotically unbiased since
1
1−
[ ]
2 2 2 2
n X σ 2X
lim n →∞ E
X
n →∞
Y
s
s ( )
= lim n →∞
n →∞
X
2
Y
nY (n X −1) σ
n X (nY −3) σ
=
σ
σ
lim n →∞
X
Y
n X nY −n Y σ
n n −3 n X σ
n →∞ X Y
= lim n →∞
n →∞ 1−
nY
= X
2
3 σ2Y
Y
X
2
Y
X
Y
X
2
Y
X
It is sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.
• The estimator is s 2X / s 2Y consistent (in mean of order two and therefore in probability) for σ X2/σY2, as it
is asymptotically unbiased and
lim n X
nY →∞
→∞ Var
( ) s 2X
s2Y
= lim n X
→∞
nY →∞
[ 2 n2Y (n X −1)(n X +nY −4) σ 4X
n2X (nY −3)2 (nY −5) σ4Y ]
σ 4X 2
n−2 −3
X nY 2 nY (n X −1)( n X + nY −4)
= lim n →∞
σ4Y −3 2 2
n
X
Y →∞ n−2
X nY n X (nY −3) (nY −5)
1 1 1 1 4
=
σ 4
X
lim n
2 ( −
nY n X nY )( + −
nY n X n X nY ) =0
4 →∞ 2
3 5
σ
( n )( n )
X
Y
n Y →∞
1− 1−
Y Y
It is sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.
In another exercise, this estimator is compared with the other two estimators of the quotient of variances.
( ) ( )
2
σ 2 (n−1) S 2
(n−1) S 2 2
= σ E = σ (n−1) = σ
2 2
E (S ) = E
σ2 2
n−1 n−1 σ n−1
2 2
Var ( S 2) = Var ( n−1 σ2 (n−1)2 σ2 )
σ2 (n−1) S = σ 4 Var ( n−1)S = σ 4 2 (n−1) = 2 σ 4
( n−1)2 n−1 ( )
2 2 2 2 2 2 4
MSE ( S ) = [ E ( S )−σ ] + Var ( S ) = σ
n−1
Then,
• The estimator S 2 is unbiased for σ2, whatever the sample size.
• The estimator S 2 is consistent (in mean of order two and therefore in probability) for σ2, since
2 2 σ4
lim n →∞ MSE ( s ) = lim n →∞ =0
n−1
It is sufficient and necessary the sample size tending to infinite—see the mathematical appendix.
2 2 2 2 2 2
E
( ) ( )
SX
S 2Y
σX
= 2 E 2 2 = 2
σY SY σ X
S X σY
nY −1
=
nY −1 σ X
σ Y (nY −1)−2 nY −3 σ2Y
σX
(n Y −1>2)
2
S 2X σ 2X S 2X σ 2Y σ 4X 2( nY −1)2 ( n X −1+n Y −1−2)
Var
( )( ) ( )
S 2Y
=
σ 2Y
Var
S 2Y σ 2X
=
σY4 ( n X −1)(n Y −1−2) 2 (nY −1−4)
2 4
2 (nY −1) ( n X +nY −4) σ X
= 2 4
( nY −1> 4)
( n X −1)(nY −3) (nY −5) σ Y
2
( )[( ) ] ( )[
2
MSE
S2X
S 2Y
= E
S2X
SY2
−
σ2X
σ 2Y
+ Var 2 =
SY
S2X
n Y −1 σ2X σ 2X
−
nY −3 σ 2Y σ2Y
+
]
2(nY −1)2 (n X +nY −4 ) σ 4X
(n X −1)(nY −3)2 (nY −5) σ 4Y
[( ]
2
nY −1 2 (nY −1)2(n X + nY −4) σ 4X
=
nY −3
−1 +
)
( n X −1)( nY −3)2( nY −5) σ 4Y
(nY −1> 4)
Then,
• The estimator is S 2X /S 2Y biased for σX2/σY2, but it is asymptotically unbiased since
1
2 2 2 1−
n Y σ 2X
n →∞
S
lim n →∞ E X2 =lim n →∞ Y
SY n →∞
n −1 σ X σ X
nY −3 σ2Y
=
σ 2Y
X
Y
lim n →∞
( )
1−
=
3 σ2Y
nY
X
Y
( ) Y
Mathematically, only nY must tend to infinite. Statistically, since populations can be named and
allocated in either order, it is deduced that both sample sizes must tend to infinite. In fact, it is
sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.
• The estimator is S 2X /S 2Y consistent (in mean of order two and therefore in probability) for σ X2/σY2, as
it is asymptotically unbiased and
[ ]
2 2 4
lim n X
→∞
nY →∞
Var
( )
SX
SY
2
= lim n X
→∞
nY →∞
2(nY −1) (nX +nY −4) σ X
2
(nX −1)(n Y −3) (nY −5) σY
4
2
1 1 1 4
=
σ 4
X
lim n
−1 −3
n n 2(nY −1) (n X +nY −4)
X Y
2
=
σ 4
X
lim n
2 1−
( )(
nY ) =0
+ −
n Y nX n X nY
4 X →∞ X →∞
n n (n −1)(n Y −3)2 (nY −5)
−1 −3 4
1 3 5 2
σ Y nY →∞ X Y X σ Y nY → ∞
( n )( n ) ( n )
1−
X
1− 1−
Y Y
It is sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.
In another exercise, this estimator is compared with the other two estimators of the quotient of variances.
Conclusion: For the most important estimators, the mean square error has been calculated either directly (in
few cases) or by making a proper statistic appear. The consistencies in mean square error of order two and in
My notes:
V 2X s 2X S 2X
(B) (Consider only the case nX = n = nY)
V 2Y s 2Y S Y2
In the second section, suppose that the populations are independent.
Discussion: The expressions of the mean square error of these estimators have been calculated in other
exercise. Comparing the coefficients is easy in some cases, but sequences may sometimes cross one another
and the comparisons must be done analitically—by solving equalities and inequalities—or graphically. We
plot the sequences (lines between dots are used to facilitate the identification).
The mean square errors were found for static situations, but the idea of limit involves dynamic
situations. By using a computer, it is also possible to study—either analytically or graphically—the asymptotic
behaviour of the estimators (but it is not a “whole mathematical proof”). It is worth noticing that the formulas
and results of this exercise are valid for normal populations (because of the theoretical results on which they
are based); in the general case, the expressions for the mean square error of these estimators are more
complex. For two populations, there is an infinite amount of mathematical ways for the two sample sizes to
tend to infinite (see the figure); the case nX = n = nY, in the last figure, will be considered.
4
MSE ( V ) =
n
σ
2
MSE ( s 2 ) = − 2 σ 4
n n
2
MSE ( S ) = ( n−1
σ
4
)
Since σ appears in all these positive quantities, by looking at the coefficients it is easy to see that, for n is
larger than two,
2 2 2
MSE ( s ) < MSE (V ) < MSE ( S )
That is, sequences—indexed by n—do not cross one another. We can plot the coefficients (they are also the
mean square errors when σ=1).
# Grid of values for 'n'
n = seq(from=2,to=10,by=1)
# The three sequences of coefficients
coeff1 = 2/n
coeff2 = 2/n - 1/(n^2)
coeff3 = 2/(n-1)
# The plot
allValues = c(coeff1, coeff2, coeff3)
yLim = c(min(allValues), max(allValues));
x11(); par(mfcol=c(1,4))
plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')
plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='b')
plot(n, coeff3, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 3', type='b')
plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')
points(n, coeff2, type='b')
points(n, coeff3, type='b')
2 1 2 2
Asymptotically, the three estimators behave similarly, since − 2≈ ≈ .
n n n n−1
V 2X s 2X S 2X
(B) For 2 , 2 and 2
V Y sY SY
The expressions of their mean square error, when nX = n = nY, are:
V 2X 4 4
( ) {[ } {[ ] }
2
2 n2 (n+ n−2) σ X 2
MSE
V 2Y
=
n
n−2
−1 + ]
n(n−2)2 (n−4) σ 4Y
=
n
n−2
−1 +
4 n(n−1) σ X
(n−2)2 (n−4) σ 4Y
(n>4)
( ) {[ } {[ ]
2
s2
}
4 2 4
2 n2 (n−1)( n+n−4) σ X
MSE X2 =
sY
n( n−1)
n(n−3)
−1 +
]
n2 (n−3)2 ( n−5) σ 4Y
=
n−1
n−3
−1 +
4 (n−1)(n−2) σ X
(n−3)2 ( n−5) σ4Y
(n−1>4 )
S2X 4 4
( ) {[ } {[ ] }
2
2 (n−1)2 (n+n−4 ) σ X 2
MSE
S 2Y
=
n−1
n−3
−1 + ]
( n−1)(n−3)2( n−5) σY4
=
n−1
n−3
−1 +
4(n−1)(n−2) σ X
(n−3)2 (n−5) σ 4Y
(n−1>4)
For equal sample sizes, the mean square error of the last two estimators is the same (but they may behave
differently under other criteria different to the mean square error, e.g. even their expectation). We can plot the
coefficients (they are also the mean square errors when σX = σY), for n > 5.
This shows that, for normal populations and samples of sizes nX = n = nY, it seems that
V 2X s 2X S 2X
MSE
( )
V 2Y
?
≤ MSE
( ) s 2Y
= MSE
( )
S2Y
and the sequences do not cross one another. Really, a figure is not a mathematical proof, so we do the
following calculations:
2 2
n 4 n (n−1) ? n−1 4(n−1)(n−2)
n−2 (
−1 + 2 ) ≤
(n−2) ( n−4) n−3
−1 + 2 (
(n−3) ( n−5) )
4(n−4)+4 n(n−1) ? 4(n−5)+4 (n−1)(n−2) n−4+ n 2−n ? n2−2 n−3
≤ ↔ ≤
( n−2)2 (n−4) (n−3)2 (n−5) 2 2
(n−2) ( n−4) (n−3) (n−5)
(n−2)( n+2) ? (n−3)(n+1) ?
≤ ↔ (n+ 2)(n−3)(n−5)≤(n+1)(n−2)(n−4)
(n−2)2 (n−4) (n−3)2 (n−5)
? ?
n 3−6 n 2−n+ 30≤ n3−5 n 2+ 2 n+8 ↔ 22≤n (n+3)
This inequality is true for n≥4, since it is true for n=4 and the second side increases with n. Thus, we can
guarantee that, for n > 5,
V 2X s 2X S 2X
MSE 2 ≤ MSE 2 = MSE 2
VY ( )
sY SY ( ) ( )
Asymptotically, by using infinites
V 2X
{[( ] }
2
2nY2 ( nX +nY −2) σ 4X
X
n →∞
Y
VY ( )
lim n →∞ MSE 2 = lim n →∞
n →∞
X
Y
nY
nY −2
−1 +
)
n X (nY −2)2 (nY −4) σ 4Y
{[( ] } [ ]
2
nY 2 n2 (n +n ) σ4X 2( n X +n Y ) σ4X
=lim n X →∞
nY → ∞
nY )
−1 + Y X2 Y
nX nY nY σ 4Y
=lim n →∞
n →∞
n X n Y σ 4
Y
X
Y
=0
{[ } [ ]
2
2n 2Y n X (n X + nY ) σ 4X 2(n X +nY ) σ 4X
=lim nX →∞
nY → ∞
nY n X
nX nY
−1 +
] 2 2
n X nY nY σY
4
=lim n
n
X
Y
→∞
→∞
n X n Y σ4Y
=0
{[ }
2
lim n →∞
X
nY →∞
MSE
S 2X
( )
SY
2
= lim n X →∞
nY →∞
nY −1
nY −3
−1 +
]
2(nY −1)2(n X +n Y −4) σ 4X
2
(n X −1)(nY −3) (nY −5) σY
4
{[( ] } [ ]
2
nY 2 n2 (n +n ) σ4X 2( n X +n Y ) σ4X
=lim n X →∞
nY → ∞
nY )
−1 + Y X2 Y
nX nY nY σ 4Y
=lim n →∞
n →∞
n X n Y σ 4
Y
=0
X
The three estimators behave similarly, since the quantitative behaviour of their mean square errors is
characterized by the same limit, namely:
[ ]
4
2(n X +nY ) σ X
lim n →∞ =0 .
n →∞
n X n Y σ4Y X
(It is worth noticing that this asymptotic behaviour arises when the limits are solved by using infinites—this
cannot seen when the limits are solved by using other ways.)
Conclusion: The expression of the mean square error of these estimators allow us to compare then, to study
their consistency and even their rate of convergence. We have proved the following result:
Proposition
(1) For a normal population,
MSE ( s 2 ) < MSE (V 2 ) < MSE (S 2 )
(2) For two independent normal populations, when nX = n = nY
V 2X s 2X S 2X
MSE
( ) V 2Y
≤ MSE
( )
s 2Y
= MSE
( ) S2Y
Note: For one population, V 2 has higher error than s 2 , even if the information about the value of the
population mean μ is used by the former while it is estimated in the other two estimators. For two populations,
the information about the value of the two population means μX and μY is used in the first quotient while they
must be estimated in the other two estimators. Either way, the population mean in itself does not play an
important role in studying the variance, which is based on relative distances, but any estimation using the
same data reduces the amount of information available and the degrees of freedom in a unit.
Again, it is worth noticing that there are in general several matters to be considered in selecting among
different estimators of the same quantity:
(a) The error can be measured by using a quantity different to the mean square error.
(b) For large sample sizes, the differences provided by the formulas above may be negligible.
(c) The computational or manual effort in calculating the quantities must also be taken into account—not
all of them requires the same number of operations.
(d) We may have some quantities already available.
My notes:
Discussion: The expressions of the mean square error of the basic estimators involved in this exercise has
been calculated in another exercise, and they will be used in calculating the mean square errors of the new
estimators. The errors are calculated for static situations, but limits are studied in dynamic situations
Comparing the coefficients is easy in some cases, but sequences can sometimes cross one another and the
comparisons must be done analitically—by solving equalities and inequalities—or graphically. By using a
computer, it is also possible to study—either analytically or graphically—the behaviour of the estimators. The
results obtained here are valid for two independent Bernoulli populations and two independent normal
populations, respectively. On the other hand, we must find the expression of the error for the new estimators
based on semisums:
2
1
( 1
) [( 1
)
MSE ( θ^1 + θ^2 ) = E ( θ^1 + θ^2) −θ +Var ( θ^1 + θ^2 )
2 2 2 ] ( )
and, for unbiased estimators,
1 1
( )
MSE ( θ^1 + θ^2 ) = 0+ [Var ( θ^1)+Var ( θ^2 )]
2 4
1
(A) For Bernoulli populations: ( η^ + η
^ ) and η^ p
2 X Y
1
(a1) For the semisum of the sample proportions ( η^ + η
^ )
2 X Y
By using previous results and that μ=η and σ2=η(1–η),
E ( 12 ( η^ + η^ )) = 12 [ E( η^ )+ E( η^ )]= 12 (η +η )=η
X Y X Y X Y
X
X Y
Y
Y
X Y
2
MSE ( ( η^ + η^ )) = [ (η +η )−μ ] +
4( )= 4 ( n + n ) η(1−η)
1 1 1 η (1−η ) η (1−η ) 1 1 1
X X Y Y
X Y X Y +
2 2 n n X Y X Y
Then,
1
• The estimator ^ ) is unbiased for μ, whatever the sample sizes.
( η^ + η
2 X Y
1
• The estimator ^ ) is consistent (in the mean-square sense and therefore in probability) for η.
( η^ + η
2 X Y
n →∞
1
2 Y
X (
lim n →∞ MSE ( η^ X + η^ Y ) = lim n → ∞
n →∞
1 1 1
+
4 n X nY )
η(1−η) X
Y
[( ) ]
It is sufficient and necessary that both sample sizes must tend to infinite—see the mathematical
appendix.
1 n η +n η
E (η
^ p) = [nX E (η ^ Y )]= X X Y Y =η
^ X )+n Y E ( η
n X +n Y n X +n Y
1 n η (1−ηX )+ nY ηY (1−ηY ) 1
Var ( η
^ p) = 2
[n2X Var ( η^ X )+ nY2 Var ( η
^ Y )]= X X 2
= η(1−η)
(n X + nY ) (n X +nY ) n X + nY
2
n X ηX +nY ηY n η (1−ηX )+nY ηY (1−ηY )
MSE( η^ p ) =
( n X + nY )
−η + X X
(n X +nY )2
=
1
n X + nY
η(1−η)
Then,
• The estimator η^ p is unbiased for η, whatever the sample sizes.
• The estimator η^ p is consistent (in mean of order two and therefore in probability) for η, since
η(1−η)
lim n →∞ MSE ( η^ p ) = lim n →∞ =0
n →∞
X
n →∞
n X + nY X
Y Y
If the mean square error is compared with those of the two populations, we can see that the new
denominator is the sum of both sample sizes. Again, it is worth noticing that it is sufficient and
necessary at least one sample size tending to infinite, but not both. In this case, the denominator tends
to infinite. The interpretation of this fact is that, in estimating, one sample can do “the whole work.”
1
(a3) Comparison of ^ ) and η^ p
( η^ + η
2 X Y
Case nX = n = nY
MSE ( 12 ( η^ + η^ )) = η(1−η)
X Y
2n
= MSE ( η
^ ) p
1
In fact, by looking at the expressions of the estimators themselves, η^ p= ( η
^ + η^ ) in this case.
2 X Y
General case
The expressions of their mean square error are (the sample proportion is unbiased):
Then
n +n
1 1 1
( + ≤
1
)
4 n X nY n X + nY ( )
↔ (n X +nY ) X Y ≤4 ↔ n 2X + n2Y + 2 n X n Y ≤4 n X nY ↔ (n X −nY )2≤0
n X nY
Then, the pooled estimator is always better or equal than the semisum of the sample proportions. Both
estimators have the same mean square error—their behaviour may be different under other criteria different to
the mean square error—only when nX=nY. Besides, Thus, (nX–nY)2 can be seen as a measure of the convenience
of using the pooled sample proportion, since it shows how different the two errors are. The inequality also
shows a symmetric situation, in the sense that it does not matter which sample size is bigger: the measure
depends on the difference. We have proved the following result:
Proposition
For two independent Bernoulli populations with the same parameter, the pooled sample proportion
has smaller or equal mean square error than the semisum of the sample proportions. Besides, both
are equivalent only when the sample sizes are equal.
We can plot the coefficients (they are also the mean square errors when η(1– η)=1) for a sequence of sample
sizes, indexed by k, such that nY(k)=2nX(k), for example (but this only one possible way for the sample sizes to
tend to infinite):
# Grid of values for 'n'
c = 2
n = seq(from=2,to=10,by=1)
# The sequences of coefficients
coeff1 = (1 + 1/c)/(4*n)
coeff2 = 1/((1+c)*n)
# The plot
allValues = c(coeff1, coeff2)
yLim = c(min(allValues), max(allValues));
x11(); par(mfcol=c(1,3))
plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')
plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='b')
plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')
points(n, coeff2, type='b')
The reader can repeat this figure by using values closer to and farther from 1 than c(k) = 2.
4 4
4 4
] ( ) (
2
MSE ( 1 2
2
2
) [
1 2 2 2
(V X +V Y ) = ( σ X +σ Y )−σ +
2
1 σX σY 1 1 1 4
+
2 nX nY
= +
2 n X nY
σ )
Then,
1 2 2
• The estimator (V +V Y ) is unbiased for σ2, whatever the sample sizes.
2 X
1 2 2
• The estimator (V X +V Y ) is consistent (in the mean-square sense and therefore in probability) for σ2
2
since,
1 1 1 4
lim n →∞ MSE ( η^ p ) = lim n →∞
n →∞ n →∞
+
2 nX nYX
Y
σ =0 X
Y
( )
It is sufficient and necessary that both sample sizes must tend to infinite—see the mathematical
appendix.
1 2 2
(b2) For the semisum of the sample variances (s + s )
2 X Y
By using previous results,
1 n X −1 2 nY −1 2 1 n X −1 nY −1 2
E ( 1 2 2
2
1
)
2
[
( s X + s Y ) = E ( s 2X ) + E ( s 2Y ) =
2 nX
σX +
nY
]
σY =
2 nX (+
nY
σ
) ( )
Var ( 1 2 2
2 2
1
) [
(s X + sY ) = 2 Var ( s 2X ) +Var ( s 2Y ) =
2 n2X
σ X + 2 σY =
nY
]
2 n2X [
1 n X −1 4 nY −1 4 1 (n X −1) (nY −1) 4
+
n2Y
σ
] [ ]
2
1
(
MSE (s2X + s 2Y ) =
2 2 nX
σX+ ) [(
1 n X −1 2 n Y −1 2
nY
σY −σ 2 +
1 n X −1 4 n Y −1 4
2 n2X
σ X + 2 σY
nY ) ] [ ]
2
=
[(
1 n X −1 n Y −1 2
2 nX
+
nY
σ −σ 2 +
1 n X −1 n Y −1 4
2 n 2X
+ 2 σ
nY ) ] [ ]
{[ [ ]} [ ]
2
n X nY2 −nY2 + n2X nY −n2X (n X + nY )2 n X n2Y −n 2Y + n2X nY −n2X 4
= −
1 1 1
+
2 n X nY ( )] +2
4 n 2X n2Y
σ=
4
4 n 2X nY2
+2
4 n 2X n 2Y
σ
[ ] [ ]
2 2 2 2 2
2 n X n Y + 2 n X nY + 2 n X nY −n X −n Y 4 2 n X nY (n X + nY )−(n X −nY )
= 2 2
σ= 2 2
σ4
4n n Y X 4n n X Y
[ ] [ ]
2 2
1 n X + nY ( n X −n Y ) 4 1 1 1 ( n X −n Y )
= − σ= + − σ4
2 n X nY 2 2
2 n X nY 2 n X n Y
2 2
2 n X nY
Then,
1 2 2
• The estimator ( s + s ) is biased but asymptotically unbiased for σ2, since
2 X Y
1 n X −1 nY −1
X
Y
n →∞
1 2 2
2( 2
lim n →∞ E ( s X +s Y ) = σ lim n → ∞
n →∞
)
2 nX
+
nY
21
=σ (1+1 )=σ
2
2
X
Y
( )
57 Solved Exercises and Problems of Statistical Inference
It is sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.
1 2 2
• The estimator (s + s ) is consistent (in the mean-square sense and therefore in probability) for σ2,
2 X Y
because it is asymptotically unbiased and
1 n X −1 nY −1
n →∞
1
lim n →∞ Var (s 2X + s2Y ) = σ4 lim n → ∞
2 X
n →∞
Y
2 n 2
X nY (
+ 2 =0 ) X
Y
( )
Again, it is sufficient and necessary the two sample sizes tending to infinite—see the mathematical
appendix.
1 2 2
(b3) For the semisum of the sample quasivariances (S + S )
2 X Y
By using previous results,
E ( 12 (S + S )) = 12 [ E ( S )+ E ( S ) ]= 12 (σ + σ )=σ
2
X
2
Y
2
X
2
Y
2
X
2
Y
2
4 4
4
σ4
] ( ) (
2
MSE ( 1 2 2
2
1 2
2 ) [ 2 2
(S X + S Y ) = ( σ X + σY )−σ +
1 σX
+ Y =
1 1
+
1
2 n X −1 n Y −1 2 nX −1 nY −1
σ
4
)
Then,
1 2 2
• The estimator (S + S ) is unbiased for σ2, whatever the sample sizes.
2 X Y
1 2 2
• The estimator (S X + S Y ) is consistent (in the mean-square sense and therefore in probability) for σ2
2
since,
1 1 1 1
lim n →∞ MSE (S2X +S 2Y ) = lim n → ∞
n →∞
2 n →∞
+
2 n X −1 nY −1
X
Y
σ 4=0 ( ) X
Y
( )
It is sufficient and necessary both sample sizes tending to infinite—see the mathematical appendix.
[ ]
2 2
n + n −2 2 n +n −2 n +n −2
MSE(s ) = X Y 2
p
n X +n Y (
σ −σ 2 + 2 X Y 2 σ4 = X Y
(n X +nY ) )
(n +n −2−n X −nY )
(n X + nY )2
+ 2 X Y 2 σ 4=
(n X +nY ) n X
2
+ n Y
σ4
Then,
• The estimator s 2p is biased for σ2, but asymptotically unbiased
n +n −2 2 n +n
lim n →∞ X Y
n →∞
n X + nY n →∞
X
Y
(
σ = lim n →∞ X Y σ 2 = σ2
n X +nY ) X
Y
( )
(The calculation above for the mean suggests that a –2 in the denominator of the definition would
provide an unbiased estimator—see the estimator in the following section.)
• The estimatoris s 2p consistent (in mean of order two and therefore in probability) for σ2, since
2 4 2
lim n →∞ MSE (s p ) = σ lim n →∞ =0
n →∞
X
n
n →∞ X
+nY X
Y Y
It is worth noticing that it is sufficient and necessary at least one sample size tending to infinite, but
not both. In this case, the denominator tends to infinite. The interpretation of this fact is that, in
estimating, one sample can do “the whole work.”
[ ]
2 2 4 4
(n X −1)σ X +(nY −1) σY
2 (n −1) σ X +(nY −1)σ Y 2
MSE( S ) = p −σ2 + 2 X 2
= σ4
n X +n Y −2 (n X + nY −2) n X +n Y −2
Then,
• The estimator S 2p is unbiased for σ2, whatever the sample sizes.
1 2 2 1 2 2 1 2 2
(b7) Comparison of (V X +V Y ) , (s + s ) , (S X + S Y ) , V 2p , s 2p and S 2p
2 2 X Y 2
Case nX = n = nY
MSE ( 12 (V + V )) = 12 ( 2 1n ) σ = 1n σ
2
X
2
Y
4 4
1 1 1 1
MSE ( ( s + s ) ) = (2 −0 ) σ = σ
2 2 4 4
X Y
2 2 n n
1 1 1 1
MSE ( (S + S ) ) = (2
2 n−1 )
2 2 4 4
X Y σ= σ
2 n−1
2 2 4 1 4
MSE( V p) = σ = σ
2n n
2 2 4 1 4
MSE(s p) = σ = σ
2n n
2 2 4 1 4
MSE(S p) = σ= σ
2 n−2 n−1
Since σ4 appears in all these positive quantities, by looking at the coefficients it is easy to see the relation
MSE ( 12 ( s + s )) = MSE ( 12 (V
2
X
2
Y
2
X
2
) 2
+V Y ) = MSE (V p) = MSE ( s p ) < MSE (S p) = MSE
2 2
( 12 (S 2
X
2
+ SY ) )
(For individual estimators, the order MSE ( s 2 ) < MSE (V 2 ) < MSE ( S 2 ) was obtained in other exercise.) This
relation has been obtained for the case nX = n = nY and (independent) normal populations. We can plot the
coefficients (they are also the mean square errors when σ=1).
# Grid of values for 'n'
n = seq(from=10,to=20,by=1)
# The three sequences of coefficients
coeff1 = 1/n
coeff2 = coeff1
coeff3 = 1/(n-1)
coeff4 = coeff1
coeff5 = coeff1
coeff6 = coeff3
# The plot
allValues = c(coeff1, coeff2, coeff3, coeff4, coeff5, coeff6)
yLim = c(min(allValues), max(allValues));
x11(); par(mfcol=c(1,7))
plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')
plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='l')
plot(n, coeff3, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 3', type='b')
plot(n, coeff4, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 4', type='l')
plot(n, coeff5, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 5', type='l')
plot(n, coeff6, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 6', type='b')
plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')
points(n, coeff2, type='l')
points(n, coeff3, type='b')
points(n, coeff4, type='l')
points(n, coeff5, type='l')
points(n, coeff6, type='b')
General case
The expressions of their mean square error are:
MSE ( 12 ( V + V )) = 12 ( n1 + n1 ) σ
2
X
2
Y
X Y
4
2 [n 2n n ]
2
1 1 1 1 (n −n )
MSE ( ( s + s )) =
2
X
2
Y + − σ X Y 4
2 n X Y
2
X
2
Y
1 1 1 1
MSE ( (S + S ) ) = (
2 n −1 n −1 )
2 2 4
X Y + σ
2 X Y
2 2 4
MSE(V p) = σ
n X +n y
2 2 4
MSE( s p) = σ
n X +n y
2 2 4
MSE(S p) = σ
n X +nY −2
We have simplified the expressions as much as possible, and now a general comparison can be tacked by
doing some pairwise comparisons. Firstly, by looking at the coefficients
MSE ( 12 ( s + s )) ≤ MSE ( 12 (V
2
X
2
Y
2
X
2
)
+V Y ) < MSE ( 12 (S 2
X
2
+ SY ) )
and the equality is reached only when nX = n = nY. On the other hand,
MSE (V 2p ) = MSE ( s 2p ) < MSE (S 2p)
Now, we would like to allocate V 2p , s 2p and S 2p in the first chain. To compare V 2p and s 2p with
1 2 2
(V +V Y ) ,
2 X
2 1 1 1
≤ +
n X +nY 2 n X nY ( ) ↔ 4 n n ≤(n +n ) X Y X Y
2
↔ 4 n X nY ≤ n 2X + n2Y +2 n X n Y ↔ 0≤(n X −n Y )2
That is,
{ ( 12 (V
MSE (S 2p ) ≤ MSE )
2
X +V 2Y ) if 2(n X + nY )≤( n X −n Y )2
1
MSE (S ) ≥ MSE ( (V +V ) )
2 2 2
p X Y if 2(n X + nY )≥( n X −n Y )2
2
Intuitively, in the region around the bisector line the difference of the sample means is small, and therefore the
pooled sample variance is worse; on the other hand, in the complementary region the square of the difference
is bigger than twice the sum of the sizes, and, therefore, the pooled sample variance is better. The frontier
seems to be parabolic. Some work can be done to find the frontier determined by the equality and the two
regions on both sided—this is done in the mathematical appendix. Now, we write some “force-based” lines for
the computer to plot these points in the frontier:
N = 100
vectorNx = vector(mode="numeric", length=0)
vectorNy = vector(mode="numeric", length=0)
for (nx in 1:N)
{
for (ny in 1:N)
{
if (2*(nx+ny)==(nx-ny)^2) { vectorNx = c(vectorNx, nx); vectorNy = c(vectorNy, ny) }
}
}
plot(vectorNx, vectorNy, xlim = c(0,N+1), ylim = c(0,N+1), xlab='nx', ylab='ny', main=paste('Frontier of the region'), type='p')
1 2 2
To compare S 2p with (S X + S Y )
2
2 1 1 1
≤ + (
n X +nY −2 2 n X −1 nY −1 ) ↔ 4 (n −1)(n −1)≤(n +n −2)
X Y X Y
2
and the equality is attained only if the sample sizes are the same.
We can summarize all the results of this section in the following statement:
{
1
MSE ( S ) ≤ MSE ( (V +V ) ) if 2(n + n )≤(n −n )
2 2 2 2
p X Y X Y X Y
(e) 2
1
MSE ( S ) ≥ MSE ( (V +V ) ) if 2(n + n )≥(n −n )
2 2 2 2
p X Y X Y X Y
2
1
(f) MSE ( S ) ≤ MSE ( (S + S ))
2 2 2
p X Y
2
1 2 2
Note: I have tried to compare V 2p , s 2p and S 2p with
(s + s ) , but I have not managed to solve
2 X Y
the inequalities. On the other hand, these relations show that, for two independent normal populations,
there exist estimators with smaller mean square error than the pooled sample variance S 2p . Nevertheless,
there are other criteria different to the mean square error, and, additionally, the pooled sample variance has
also some advantages (see the advanced theory at the end).
Conclusion: For some pooled estimators, the mean square errors have been calculated either directly or
making a proper statistic appear. The consistencies in mean square error of order two and in probability have
been proved. By using theoretical expressions for the mean square error, the behaviour of the pooled
estimators for the proportion (Bernoulli populations) and for the variance (normal populations) have been
compared with “natural” estimators consisting in the semisum of the individual estimators for each
population.
Once more, it is worth noticing that there are in general several matters to be considered in selecting
among different estimators of the same quantity:
(a) The error can be measured by using a quantity different to the mean square error.
(b) For large sample sizes, the differences provided by the formulas above may be negligible.
(c) The computational or manual effort in calculating the quantities must also be taken into account—not
all of them requires the same number of operations.
(d) We may have some quantities already available.
Advanced Theory: The previous estimators can be written as a sum ω X θ^ X +ω Y θ^ Y with weights
ω=(ω X ,ω Y ) such that ω X + ωY =1. As regards the interpretation of the weights, they can be seen as a
measure of the importance that each estimator is given in the global formula. For some weights that depends
on the sample sizes, it is possible for one estimator to adquire all the importance when the sample sizes
increase in the proper way. On the contrary, when the weights are constant the possible effect—positive or
My notes:
Discussion: This statement is mathematical. The assumptions are supposed to have been checked. We are
given the density function of the distribution of X (a dimensionless quantity). The exercise involves two
methods of estimation, the definition of the bias, the mean square error and the sufficient condition for the
consistency (in probability). The two first population moments are provided.
Note: If E(X) and E(X2) had not been given in the statement, they could have been calculated by applying the definition and solving the integrals,
+∞ θ 2(θ−x ) 2 θ θ
E ( X )=∫−∞ x f ( x ; θ)dx=∫0 x
θ2
dx= 2
θ
(∫ 0
x θ dx−∫0 x 2 dx )
2 θ 3 θ
θ
2
= 2 θ ( [ ] [ ]) (
x
−
2 0 3
x
0
=
2 θ 2 θ3
θ 2
2 3
1 1
6 3)
θ − =θ 2 = θ
+∞ θ 2(θ− x) 2 θ θ
E ( X 2)=∫−∞ x 2 f (x ; θ)dx=∫0 x 2
θ 2
dx= 2
θ
(∫ 0
x 2 θ dx−∫0 x 3 dx )
θ θ
2
= 2 θ
θ ( [ ] [ ]) (
x3
−
3 0 4
x4
0
=
2 θ3 θ4
θ 2
θ − =2 θ 2
3 4
4
)
3⋅4
−
3
4⋅3( 2
12
1
= θ 2= θ 2
6 )
(a) Method of the moments
(a1) Population and sample moments
The distribution has only one parameter, so one equation suffices. By using the information in the hint:
It is obtained after substituting the lower case letters xj by upper case letters Xi:
3 n
θ^ M = ∑ j =1 X j =3 X̄
n
We do not usually apply the definition MSE ( θ̂ M ) = E ( ( θ̂ M −θ)2 ) but a property derived from it, for which
we need to calculate the variance:
2 2
2 E ( X )− E ( X ) 3 2 θ2 θ 3 2 θ2 1 1 3 2 θ2 1
2 Var ( X )
[ ( )]
2 2
^
Var ( θ M ) = Var ( 3 X̄ ) = 3
n
=3
n
=
n 6
−
3
=
n 6 9(− =) =θ
n 18 2 n
where we have used the properties of the variance, a property of the sample mean and the information in the
statement. Then
2 2
MSE ( θ^ M ) = b ( θ^ M ) + Var ( θ^ M ) = 0 + θ = θ
2 2
2n 2n
(c) Consistency
̂
̂
We try applying the sufficient condition lim n →∞ MSE ( θ)=0 or, equivalently,
2
{ lim n →∞ b( θ)=0
̂
lim n →∞ Var ( θ)=0
. Since
3 3
ML
3 3
2
To obtain estimators of the variance, since σ =Var ( X )= θ
2
6
θ^
2
(2 X̄ )2 2( X̄ )2
σ^ 2M = M = = σ^ 2ML = ?
6 6 3
Conclusion: The method of the moment is applied to obtain an estimator that is unbiased for any sample
size n and has good behaviour when used with for large n (many data). The maximum likelihood method
cannot be applied since it is difficult to optimize the likelihood function by considering either its expression or
the behaviour of the density function.
My notes:
Exercise 2pe
Let X be a random variable following the Rayleigh distribution, whose with probability function is
2
x
x − 2
f (x ; θ) = 2 e 2 θ , x ≥ 0, (θ> 0)
θ
4−π 2
such that E ( X )=θ π and Var ( X )=
2 √ 2
θ . Let X = (X1,...,Xn) be a simple random sample.
Discussion: This is a theoretical exercise where we must apply two methods of point estimation. The basic
properties must be considered for the estimator obtained through the first method.
Note: If E(X) had not been given in the statement, it could have been calculated by applying integration by parts (since polynomials and
exponentials are functions “of different type”):
∞
[ ]
2 2 2
x x x
+∞ ∞ x − − − 2 2 2
√ 2 θ2 dt=√ 2 θ2 √2π =θ √ π2
0 0
where ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' ( x)⋅v (x )dx has been used with
• u=x → u ' =1
2 2 2
x x x
x − 2
x − − 2 2
• v '= 2 e 2θ
→ v=∫ 2 e 2 θ dx=−e 2 θ
θ θ
Then, we have applied the change
x
=t → x=t √ 2 θ2 → dx=dt √ 2θ 2
√2 θ2
We calculate the variance by using the first two moments. For the second moment, we can apply integration by parts twice (as the
exponent decreases one unit each time)
∞
[ ]
2 2 2 2
x x x x
∞ x − − 2
− ∞ x − 2 2 2
θ θ
where ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' ( x)⋅v (x )dx has been used with
2
• u=x → u ' =2 x
2 2 2
x x x
x − 2
x − − 2 2
μ1 (θ )= x̄ → θ π = x̄
2 √
1 2
→ θ= π x̄= π x̄ √ 2
→ θ^ M = π X̄ √
2 √
(b) Bias, mean square error and consistency
2 (4−π) 2 (4−π) 2
Mean square error: ECM ( θ^ M ) =b ( θ^ M ) +Var ( θ^ M ) =0+ θ= θ
πn πn
(4−π) 2
Consistency: lim n →∞ MSE ( θ^ M ) =lim n →∞ θ =0 and therefore σ̂ M is consistent (for θ).
πn
2 2 2n
θ θ θ
Log-likelihood function:
To facilitate the differentiation, θ2n is moved to the numerator and a property of the logarithm is applied.
n
n 1 n ∑ j=1 x 2j 1
log ( L( X ; θ) ) =log (∏ j =1 )
xj −
2θ 2∑ j
x 2 + log ( θ−2 n )=log (∏ j=1
xj −) 2 θ2
−2 n log (θ)
( ∑ j=1 x 2j
) ∑ x j 2n 2
d2 d 2n n −1 −1
log ( L( X ; θ) ) = − =∑ j =1 x 2j 6 3 θ2 −2 n 2 =−3 j =14 + 2
dθ 2
dθ θ3 θ θ θ θ θ
The first term is negative and the second is positive, but it is difficult to check qualitatively whether the
second is larger in absolute value than the first. Then, the extreme obtained is substituted:
n n
d2
d σ2
( 2
log L( X ; σ =
∑ j=1 x 2j
2n
) =−3 n )
(∑ j=1 x 2j)2 2 n2 (2 n)2
(∑ j=1 x 2j )
2
+ n
∑ j =1 j
x 2
4 n2
=−3 n
4 n2
+ n
∑ j=1 j ∑ j=1 j
x 2
x 2
4 n2
=−2 n
∑ j=1 j
x 2
<0
√
n
∑j=1 X 2j
θ^ ML=
2n
Discussion: The Rayleigh distribution is one of the few cases for which the two methods provide different
estimators of the parameter. In the first case, we could easily calculate the mean and the variance, as the
estimator was linear in Xj; nevertheless, in the second case the nonlinearities Xj2 and the square root make
those calculations difficult.
My notes:
Exercise 3pe
Before commercializing a new model of light bulb, a deep statistical study on its duration (measured in days,
d) must be carried out. The population variable duration is expected to follow the exponential probability
model:
d
Now you must prove that X is an efficient estimator of θ and you can easily calculate , while
dθ
d
only experts can calculate .
d λ−1
(g) The empirical part of the study, based on the measurement of 55 independent light bulbs, has yielded a
55
total sum of ∑ j=1 x j = 598 d . Introduce this information in the expressions obtained in previous
sections to give final estimates of λ.
(h) Give an estimate of the mean μ = E(X).
Hint: For section (c), apply the factorization theorem and make it clear how the two parts are. In the theorem: (1) g and h are
nonnegative; (1) T cannot depend on θ; (2) g depends only on the sample and the parameter, and it depends on the sample through
T; (3) h can be 1; and (4) since h is any function of the sample, it may involve T.
Discussion: First of all, the supposition that the exponential distribution can reasonably be used to model the
variable duration should be tested. One aim of this exercise is to show how many methods and properties
involved in previous exercises can be involved in the same statistical analysis. The quality of the estimators
(a1) Population and sample moments: The population distribution has only one parameter, so one equation
suffices. The first-order moments of the model X and the sample x are, respectively,
1 1 n
μ1 (λ)= E( X )= and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
λ n
(a2) System of equations: Since the parameter of interest λ appears in the first moment of X, the solution is:
−1
1 1 n 1 n 1
μ1 (λ)=m1 ( x 1 , x 2 , ... , x n) → = ∑ j=1 x j = x̄
λ n
→
n (
λ= ∑ j =1 x j ) =
x̄
(a3) The estimator:
−1
1 n 1
(
λ^ M = ∑ j =1 X
n j ) =
X̄
(b1) Likelihood function: For an exponential random variable the density function is f (x ; λ)=λ e−λ x , so
we write the product and join the terms that are similar
n
n n −λ ∑ j=1 x j
L( x 1 , x 2 , ... , x n ; λ)=∏ j=1 f ( x j ; λ)=∏ j=1 λ e−λ x =λ e−λ x ⋅λ e−λ x ⋯λ e−λ x =λ n e
j 1 2 n
(b2) Optimization problem: The logarithm function is applied to make calculations easier
n
−λ ∑ j=1 x j n
log [ L( x 1 , x 2 , ... , x n ; λ)]=log[λ n ]+log[e ]=n⋅log[λ]−λ⋅∑ j=1 x j
The population distribution has only one parameter, and hence a onedimensional function must be maximized.
To find the local or relative extreme values, the necessary condition is:
−1
d n 1 n 1
( )
n
0= log[ L ( x 1, x 2,. .. , x n ; λ)]= λ −∑ j =1 x j → λ 0= ∑ j =1 x j =
dλ n x̄
To verify that the only candidate is a (local) maximum, the sufficient condition is:
2
d n
2
log[ L( x 1, x 2,. .. , x n ; λ)]=− 2 < 0
dλ λ
1
which holds for any value, particularly for λ 0= .
̄x
(b3) The estimator:
−1
1 n 1
(
λ^ ML= ∑ j =1 X j
n ) =
X̄
mathematically both types of information; then, this term would be part of g too. Moreover, the only
n
candidate to be a sufficient statistic is T ( X )=T ( X 1 , ... , X n)=∑ j=1 X j .
n
n −λ ∑ j=1 X
Since the condition holds for g (T ( X 1 , X 2 , ... , X n) ; λ)=λ e j
and h( X 1 , X 2 , ... , X n)=1 , the
n
statistic T ( X )=T ( X 1 , ... , X n)=∑ j=1 X j is sufficient. This means that it “summarizes the important
information (about the parameter)” contained in the sample. The previous statistic contains the same
1 n
information as any one-to-one transformation of it, concretely the sample mean T ( X )= ∑ j =1 X j = X̄ .
n
(d1) Unbiasedness: By applying a property of the sample mean and the information of the statement,
1 1
E ( X̄ )= E ( X )= λ → b ( X̄ )=E ( X̄ )−λ= λ −λ ≠ 0
The first condition does not hold for all values of λ, and hence it is not necessary to check the second one.
1
Note: The previous bias is zero when −λ=0 ↔ λ=±√ 1 → λ=1 (for f(x) to be a probability function, λ must be positive, so
λ
the solution –1 is not taken into account). Thus, when λ = 1, the estimator may still be efficient if the second condition holds.
(e2) Variance: By applying a property of the sample mean and the information of the statement,
Var ( X ) 1 1
̄ )=
Var ( X = 2 → lim n →∞ Var ( X
̄ )=lim n→ ∞ =0
n λ ⋅n λ 2⋅n
As a conclusion, the mean square error (MSE) tends to zero, which is sufficient—but not necessary—for the
consistency (in probability).
(f2) Mininum variance: We compare the variance and the Cramér-Rao's bound. The variance is:
̄ )= Var ( X ) = θ
2
Var ( X
n n
On the other hand, the bound is calculated step by step:
i. Function (with X in place of x)
1 −X
f ( X ; θ)= e θ
θ
ii. Logarithm of the function:
X
− X
log[ f ( X ; μ)]=log(θ−1 )+log(e θ
)=−log(θ)− θ
[( ) ] [( )] [ ]
2
∂ log[ f ( X ; θ)] 1 X 2
( X −θ )2 Var ( X ) θ 2 1
E =E − ⋅θ + 2 =E = = 4= 2
∂θ θ θ θ θ4 θ4 θ θ
v. Theoretical Cramér-Rao's lower bound:
1 1 2
= =θ
1 n
[( )]
2
∂ log [ f ( X ;θ )] n⋅ 2
n⋅E ∂θ θ
The variance of the estimator attains the bound, so the estimator has minimum variance. The fulfillment of the
two conditions proves that X is an efficient estimator of λ–1 = θ.
(g) Estimation of λ
55
It is necessary to use the only information available: ∑ j=1 x j = 598 d .
−1 −1
1 n 1
From the method of the moments: λ^ M = ∑ x j =
n j =1( 55 ) (
598 d =0.09197 d .
−1
)
From the maximum likelihood method, since the same estimator was obtained: λ^ ML=0.09197 d −1 .
(h) Estimation of μ
1
Since μ=E ( X )= , an estimator of λ induces, by applying the plug-in principle, an estimator of μ:
λ
Conclusion: We can see that for the exponential model the two methods provide the same estimator for λ.
The estimator obtained has been used to obtain an estimator of the population mean. The mean duration
estimate of the new model of light bulb was 10.87 days. On the other hand, some desirable properties of the
estimator have been proved. A different, equivalent notation has been used to facilitate the proof of one of
these properties, which emphasizes the importance of the notation in doing calculations.
My notes:
Remark 2ci: Since there is an infinite number of pairs of quantiles a1 and a2 such that P (a 1≤T ≤a2 )=1−α , those
determining tails of probability α/2 are considered by convention. This criterion is also applied for two-tailed hypothesis tests.
Remark 3ci: When the Central Limit Theorem can be applied, asymptotic results on averages are relatively independent of the
initial population. Therefore, in some exercises there are not suppositions on the distribution of the population variables.
Exercise 1ci-m
To forecast the yearly inflation (in percent, %), a simple random sample has been gathered:
1.5 2.1 1.9 2.3 2.5 3.2 3.0
It is assumed that the variable inflation follows a normal distribution.
(a) By using these data, construct a 99% confidence interval for the mean of the inflation.
(b) Experts have the opinion that the previous interval is too wide, and they want a total length of a unit.
Find the level of confidence for this new interval.
(c) Construct a confidence interval of 90% for the standard deviation.
Discussion: The intervals will be built by applying the method of the pivot, and then the expression of the
margin of error is determined. Since variances are nonnegative by definition and the positive branch of the
square root function is strictly increasing, the interval for the standard deviation is obtained by applying the
square root to the inteval for the variance.
Sample information
Theoretical (simple random) sample: X1,..., X7 s.r.s. → n = 7
Empirical sample: x1,..., x7 → 1.5 2.1 1.9 2.3 2.5 3.2 3.0
In this exercise, we know the values of the sample xi. This allows calculating any quantity we want.
(a) Confidence interval for the mean: To choose the proper pivot, we take into account:
• The variable of interest follows a normal distribution.
• The population variance σ2 is unknown, so it must be estimated by the sample (quasi)variance.
• The sample size is small, n = 7, so we should not think about the asymptotic framework.
From a table of statistics (e.g. in [T]), the pivot
is selected. Then
√ S2
n √n
)
≤+r α/ 2 =P (−r α/2
√ S2 ̄
n √
≤ X −μ ≤+r α / 2
S2
n
)
√ S2
n
√ √ √ √
2 2 2 2
̄ −r α /2 S ≤−μ ≤− X
=P (− X ̄ + r α/ 2 S )=P ( X
̄ +r +α / 2 S ≥ μ ≥ X
̄ −r α / 2 S )
n n n n
so
[ √ √ ]
2 2
̄ −r α / 2 S , X
I 1−α = X ̄ +r α/2 S
n n
where r α / 2 is the quantile such that P(T > r α/2 )=α /2. Let us calculate the quantities in the formula:
1 7
7 ∑ j =1 j
• x̄= x =2.36
• The level of confidence is 99%, and hence α = 0.01. The quantile is found in the table of the t distribution with κ = 7–1
degrees of freedom r α / 2=r 0.01 /2=r 0.005 =3.71
2 1 7 2 1 2 2 2
• By using the data, S=
7−1 ∑ j=1
( x j− x̄ ) =
7−1
[ (1.5 %−2.35 %) +⋯+(3.0 %−2.35 %) ]=0.36 %
• Finally, n = 7
[
I 0.99 = 2.35 %−3.71
√ 0.36 %2
7
, 2.35 %+3.71
0.36 %2
7 √
= [1.51 % , 3.20 %] ]
whose length is 3.20%–1.51% = 1.69%.
(b) Confidence level: The length of the interval, the distance between the two endpoints, is twice the margin
of error when T follows a symmetric distribution.
(
L= X +r α/ 2
̄ S2
n √ )(
− X −r α /2
̄ S2
n
=2 r α /2
S2
n √) √
In this section L is given and α must be found; nevertheless, it is necessary to find r α / 2 previously.
L √ n 1⋅√ 7 % > 1-pt(2.20, 7-1)
r α / 2= = =2.20 [1] 0.03505109
2 S 2⋅0.6 %
In the table of the t law it is found that α/2 = 0.035, so α = 0.07 and 1–α = 0.93. The confidence level is 93%.
(c) Confidence interval for the standard deviation: To choose the new statistic:
• The variable of interest follows a normal distribution.
• The quantity of interest it the standard deviation σ.
• The population mean μ is unknown.
• The sample size is small, n = 7, so we should not think about the asymptotic framework
From a table of statistics (e.g. in [T]), the proper pivot
[ ]
2 2
( n−1) S ( n−1)S
I 1−α = ,
rα / 2 l α/ 2
I 0.9 = [√ 6⋅0.36 %2
12.6
,
√ 6⋅0.36 %2
1.64 ]
= [ 0.414 % , 1.148 %]
Conclusion: The length in section (b) is smaller than in section (a), that is, the interval is narrower and the
confidence is smaller.
My notes:
Exercise 2ci-m
In the library of a university, the mean duration (in days, d) of the borrowing period seems to be 20d. A simple
random sample of 100 books is analysed, and the values 18d and 8d 2 are obtained for the sample mean and
the sample variance, respectively. Construct a 99% confidence interval for the mean duration of the
borrowings to check if the initial population value is inside.
Discussion: For so many data, asymptotic results are considered. The method of the pivotal quantity can
also be applied. The dimension of the variable duration is time, while the unit of measurement is days.
Sample information:
Theoretical (simple random) sample: X1,...,X100 s.r.s. → n = 100
2 2
Empirical sample: x1,...,x100 → x̄=18 d , s =8 d
The values xj of the sample are unknown; instead, the evaluation of some statistics is given. These quantities
̄ and S 2 .
must be sufficient for the calculations, and, therefore, formulas must be written in terms of X
√ S2
n
is chosen, where S2 is the sample quasivariance. By applying the method of the pivotal quantity:
(
1−α=P (l α/ 2≤ T ( X ;μ ) ≤r α/ 2)= P −r α/ 2≤
X
̄ −μ
)
≤+ r α/ 2 =P (−r α / 2
√ S2 ̄
n √
≤ X −μ ≤+ r α/ 2
S2
n
)
√ S2
n
=P (− X̄ −r α /2
Then, the interval is
√ S2
n
≤−μ ≤− X̄ + r α/ 2
S2
n √
)=P ( X̄ + r α/ 2
S2
n √
≥ μ ≥ X̄ −r α/ 2
S2
n
)
√
[
I 1−α = X̄ −r α / 2
√ S2
n
, X̄ +r α/2
S2
n √ ]
where r α / 2 is the quantile such that P(Z> r α /2 )=α /2.
The interval is
[
I 0.99 = 18 d−2.58
√ 8.1d 2 , 18 d +2.58 √8.1 d 2
√100 √ 100 ] = [17.27 d , 18.73 d ]
Conclusion: The mean duration estimate of the borrowings belongs to the interval obtained with 99%
confidence. The initial value μ = 20d is not inside the high-confidence interval obtained, that is, it is not
supported by the data. (Remember: statistical results depend on: the assumptions, the methods, the certainty
and the data.)
My notes:
Exercise 3ci-m
The accounting firm Price Waterhouse periodically monitors the U.S. Postal Service's performance. One
parameter of interest is the percentage of mail delivered on time. In a simple random sample of 332,000
Discussion: The population is characterized by a Bernoulli variable, since for each item there are only two
possible values. We must construct a confidence interval for the proportion (a percent is a proportion
expressed in a 0-to-100 scale). Proportions have no dimension.
Confidence interval
For this kind of population and amount of data, we use the statistic:
̂
η−η d
T ( X ; η)= → N (0,1)
√ ?(1−? )
n
where ? is substituted by η or η. ̂ For confidence intervals η is unknown and no value is supposed, and
hence it is estimated through the sample proportion. By applying the method of the pivot:
( )
η
̂ −η
1−α=P (l α/ 2≤ T ( X ; η) ≤r α/ 2 )=P −r α /2≤ ≤+ r α / 2
√ η(1−
̂
n
η
̂)
( √
=P −r α /2
η
̂ (1− η)
n
̂
̂ −η ≤+ r α / 2
≤η
η(1−
̂
n√η
̂)
̂ α/2
=P −η−r
η
) (
̂ (1− η
n
̂)
̂ rα / 2
≤ −η ≤−η+
√
η(1−
̂
n
η)
̂
√ )
( √
̂ +r +α/2
=P η
η
̂ (1− η)
n
̂
̂ α/ 2
≥ η≥ η−r
η(1−
̂
n
η
̂)
√ )
Then, the interval is
[
I 1−α = η
̂ −r +α/ 2
√ η(1−
̂
n
η)
̂
, η+
̂ r +α / 2
η(1−
̂
n
η)
̂
√ ]
Substitution: We calculate the quantities in the formula,
• n = 332000
282200
• η=
̂ =0.850
332000
• 99% → 1–α = 0.99 → α = 0.01 → α/2 = 0.005 → r α /2=r 0.005=l 0.995=2.58
So
[
I 0.99= 0.850−2.58
√ 0.850( 1−0.850)
332000
, 0.850+2.58
0.850 (1−0.850)
332000 √
=[0.848 , 0.852] ]
Conclusion: With a confidence of 0.99, measured in a 0-to-1 scale, the value of η will be in the interval
My notes:
Exercise 4ci-m
Two independent groups, A and B, consist of 100 people each of whom have a disease. A serum is given to
group A but not to group B, which are termed treatment and control groups, respectively; otherwise, the two
groups are treated identically. Two simple random samples have yielded that in the two groups, 75 and 65
people, respectively, recover from the disease. To study the effect of the serum, build a 95% confidence
interval for the difference ηA–ηB. Does the interval contain the case ηA = ηB?
Discussion: There are two independent Bernoulli populations. The interval for the difference of proportion
is built by applying the method of the pivot. Proportions are, by definition, dimensionless quantities.
√ η
̂ A (1− η
nA
̂ A) η
̂ (1− η
+ B
nB
̂ B)
[
I 1−α = ( η̂ A −η̂ B )−r α / 2
√ ̂ A (1− η̂ A ) η̂ B (1−η̂ B )
η
nA
+
nB √
, ( η̂ A −η̂ B )+r α /2
̂ A (1− η
η
nA
̂ A) η
̂ (1− η
+ B
nB
̂ B)
]
where r α / 2 is the value of the standard normal distribution such that P( Z>r α/2 )=α / 2.
Conclusion: The lack-of-effect case (ηA = ηB) cannot be excluded when the decision has 95% confidence.
Since η ∈( 0,1), any “reasonable” estimator of η should provide values in this range or close to it. Because
of the natural uncertainty of the sampling process (randomness and variability), in this case the smaller
endpoint of the interval was –0.0263, which can be interpreted as being 0. When an interval of high
confidence is far from 0, the case ηA = ηB can clearly be discarded or rejected. Finally, it is important to notice
that a confidence interval can be used to make decisions about hypotheses on the parameter values—it is
equivalent to a two-sided hypothesis test, as the interval is also two-sided. (Remember: statistical results
depend on: the assumptions, the methods, the certainty and the data.)
Advanced theory: When the assumption ηA = η = ηB seems reasonable (notice that this case is included in
the 95% confidence interval just calculated), it makes sense to try to estimate the common variance of the
n η^ + n η^
estimator as well as possible. This can be done by using the pooled sample proportion η^ p= A A B B in
n A+ n B
estimating η(1– η) for the denominator; nonetheless, the pooled estimator should not be considered in the
numerator, as ( η^ p− η^ p)=0 whatever the data are. The statistic would be:
~ ( η^ A− η^ B )−(ηA−ηB ) d
T ( A , B)= → N (0,1)
√ η^ p (1− η
nA
^ p) η
^ (1− η
+ p
nB
^ p)
I~
[
1−α = ( η
^ A −η^ B )−r α/ 2
√ ^ p (1− η
η
nA
^ p) η
^ (1−η^ p )
+ p
nB √
, ( η^ A −η^ B )+ r α/ 2
^ p (1−η^ p ) η^ p (1− η
η
nA
+
nB
^ p)
]
The quantities involved in the previous formula are
• nA = 100 and nB = 100
Then,
I~
0.95 =(0.75−0.65)∓1.96 2
100 √
0.70(1−0.70)
=[−0.0270, 0.227]
One way to measure how different the results are consists in directly comparing the length—twice the margin
of error—in both cases:
~
L=0.226−(−0.0263)=0.2523 L=0.227−(−0.0270)=0.254
Even if the latter length is larger, it is theoretically more trustable than the former when ηA = η = ηB is true.
The general expressions of these lengths can be found too:
L=2 r α/ 2
√ η
^ A (1−η^ A ) η^ B (1− η
nA
+
nB
^ B) ~
L=2 r α/2
η
√
^ p (1− η
nA
^ p ) η^ p (1− η
+
nB
^ p)
Another way to measure how different the results are can be based on comparing the statistics:
~
T ( A , B)=
( η^ A− η^ B )−(ηA−ηB )
=
( η^ A −η^ B )−(ηA−ηB ) √ η
^ A (1− η
nA
^ A) η
^ (1− η
+ B
nB
^ B)
√ η^ p (1− η
nA
^ p) η
^ (1− η
+ p
nB
^ p)
√ η
^ A (1− η
nA
^ A) η
^ (1− η
+ B
nB
^ B)
√ η^ p (1− η
nA
^ p) η
^ (1− η
+ p
nB
^ p)
= T ( A , B)
√ η
^ A (1− η
nA
^ A) η^ B (1− η^ B )
+
nB
→
L
~=
√ η^ A (1−η^ A) η
nA
^ (1− η
+ B
nB
^ B)
=
~
T
T
~~
( so L⋅T = L⋅T )
√ √
η^ p (1− η^ p) η^ p (1− η^ p) η L
^ p (1− η ^ p) η ^ (1−η^ p )
+ + p
nA nB nA nB
Thus, the quantity
√
η^ A (1−η^ A ) η
n
^ (1− η
+ B
n
^ B)
√ η^ (1− η^ A)+ η^ B (1− η^ B ) =0.994
= A
η
√
^ p (1− η
n
^ p) η ^ (1−η^ p )
+ p
n
√ 2 η^ p (1− η^ p )
can be seen as a measure of the effect of using the pooled sample proportion. This effect is little in this
exercise, but it could be higher in other situations. As regards the case ηA = η = ηB, it is also included in this
interval, which is not worthy as it has been used as an assumption; nevertheless, the exclusion of this case
would have contradicted the initial assumption.
My notes:
Remark 5ci: Once there is a discrete quantity in an equation, the unknown cannot take any possible value. This implies that, strictly
speaking, equalities like
√
2 2
E=r α / 2 σ σ =α
2
n nE
may be never fulfilled for continuous E, α, σ and discrete n. Solving the equality and rounding the result upward is a way alternative
to solving the inequalities
√
2
E g≥ E=r α/2 σ σ2 ≤ α
2
n n Eg
where the purpose is to find the minimum n for which the (possible discrete values of the) margin of error is smaller than or equal to
the given precision Eg.
Exercise 1ci-s
The lengths (in millimeters, mm) of metal rods produced by an industrial process are normally distributed
with a standard deviation of 1.8mm. Based on a simple random sample of nine observations from this
population, the 99% confidence interval was found for the population mean length to extend from 194.65mm
to 197.75mm. Suppose that a production manager believes that the interval is too wide for practical use and,
instead, requires a 99% confidence interval extending no further than 0.50mm on each side of the sample
mean. How large a sample is needed to achieve such an interval? Apply both the method based on the
confidence interval and the method based on the Chebyshev's inequality.
(From: Statistics for Business and Economics, Newbold, P., W.L. Carlson and B.M. Thorne, Pearson.)
Discussion: There is one normal population with known standard deviation. By using a sample of nine
elements, a 99% confidence interval was built, I1 = [194.65mm, 197.75mm], of length 197.75mm – 194.65mm
= 3.1mm and margin of error 3.1mm/2 = 1.55mm. A narrower interval is desired, and the number of data
necessary in the new sample must be calculated. More data will be necessary for the new margin of error to be
smaller (0.50 < 1.55) while the other quantities—standard deviation and confidence—are the same.
Sample information:
Theoretical (simple random) sample: X1,..., Xn s.r.s. (the lengths of n rods are taken)
Margin of error:
We need the expression of the margin of error. If we do not remember it, we can apply the method of the pivot
to take the expression from the formula of the interval.
[ 2
√
̄ −r α / 2 σ , X
I 1−α = X
n
̄ + rα/ 2 σ
n
2
√ ]
If we remembered the expression, we can use it. Either way, the margin of error (for one normal population
√
2
E=r α / 2 σ
n
Sample size
Method based on the confidence interval: We want the margin of error E to be smaller or equal than the
given Eg,
√
2 2 2
2 1.8 mm
2
E g≥ E=r α/2 σ → E g≥r α / 2 σ → n≥z α / 2 σ =2.58
n
2 2
n
2
Eg ( )
0.5 mm (
=86.27 → n≥87 )
since r α/ 2=r 0.01 /2=r 0.005 =2.58 . (The inequality does not change neither when multiplying or dividing by
positive quantities nor squaring, while it changes when inverting.)
Method based on the Chebyshev's inequality: For unbiased estimators, it holds that:
^
^ |≥E )=P (|θ−E
P (|θ−θ ^ ^ |≥ E)≤ Var ( θ) ≤ α
( θ) 2
E
2
^
so Var ( θ)=Var ( X̄ )= σ
n
2
1 2
1 1.82 mm2
σ ≤α → n≥ α σ = =1296 → n≥1296
2
n Eg Eg ( )
0.01 0.52 mm2
Conclusion: At least n data are necessary to guarantee that the margin of error is equal to 0.50 (this margin
can be thought of as “the maximum error in probability”, in the sense that the distance or error ∣θ−θ̂ ∣ will
be smaller that Eg with a probability of 1–α = 0.99, but larger with a probability of α = 0.01). Any number of
data larger than n would guarantee—and go beyond—the precision desired. As expected, more data are
necessary (86 > 9) to increase the accuracy (narrower interval) with the same confidence. The minimum
sample sizes provided by the two methods are quite different (see remark 4ci). (Remember: statistical results
depend on: the assumptions, the methods, the certainty and the data.)
My notes:
Sample information:
Theoretical (simple random) sample: X1,..., X9 s.r.s. (the marks of nine students are to be taken) → n = 9
9 9
Empirical sample: x1,...,x9 → ∑j =1 x j=1,098 ∑j =1 x 2j=138,148 (the marks have been taken)
We can see that the sample values xj themselves are unknown in this exercise; instead, information calculated
from them is provided; this information must be sufficient for carrying out the calculations.
a) Method of the pivotal quantity: To choose the proper statistic with which the confidence interval is
calculated, we take into account that:
• The variable follows a normal distribution
• We are given the value of the population standard deviation σ
• The sample size is small, n = 9, so asymptotic formulas cannot be applied
is selected. Then
√ σ2
n
X
̄ −μ
√ √
2 2
(
1−α=P (l α/ 2≤ T ( X ;μ ) ≤r α/ 2)= P −r α/ 2≤
√ σ2
n )
≤+ r α/ 2 =P (−r α / 2 σ ≤ X
n
̄ −μ ≤+r α/2 σ )
n
√ √ √ √
2 2 2 2
̄ −r α /2 σ ≤−μ ≤− X
=P (− X ̄ + r α/ 2 σ )=P ( X
̄ + r +α / 2 σ ≥ μ ≥ X
̄ −r α / 2 σ )
n n n n
so
[ 2
̄ −r α / 2 σ , X
I 1−α = X
n √
̄ + rα/ 2 σ
n
2
√ ]
where r α / 2 is the value of the standard normal distribution verifying P( Z>r α /2 )=α / 2 , that is, the value
such that an area equal to α /2 is on the right (upper tail).
[
I 0.9= 122−1.645
28.2
√9
, 122+1.645
28.2
√9 ]
= [ 106.54 , 137.46 ]
b) Length of the interval: To answer this question it is possible to argue that, when all the parameters but the
length are fixed, if higher certainty is desired it is necessary to widen the interval, that is, to increase the
distance between the two endpoints. The formal way to justify this idea consists in using the formula of the
interval:
( √ )( √ ) √
2 2 2
̄ +r α / 2 σ − ̄X −r α/ 2 σ =2⋅r α / 2 σ
L= X
n n n
Now, if σ and n remain unchanged, to study how L changes with α it is enough to see how the quantile
“moves”. For the 95% interval:
• α = 0.05 → α decreases with respect to the value in section (a)
• Now r α/ 2 must leave less area (probability) on the right → r α/ 2 increases → L increases
In short, when the tails (α) get smaller the interval (1–α) gets wider, and vice versa.
c) Sample size:
Method based on the confidence interval: Now the 90% confidence interval of the first section is revisited.
For given α and Lg, the value of n must be found. From the expression of the length,
√ 28.2 2
2 2 2
L g ≥L=2 r α /2 σ → L g ≥2 r α / 2 σ → n≥ 2 z α/ 2 σ = 2⋅1.645
) ( )
2 2 2
=86.08 → n≥87
n n Lg ( 10
(Only when inverting the inequality must be changed.)
My notes:
Discussion: For 64 data, asymptotic results can be applied. The method of the pivotal quantity will be
applied. The role of the number 100 is no other than being part of the units in which the data are measured.
For the second section, additional suppositions—added by myself—are considered; in a real-world situation
they should be evaluated.
Sample information:
Theoretical (simple random) sample: C = (C1,...,C64) s.r.s. → n = 64
Empirical sample: c = (c1,...,c64) → c̄=9.36u , s=1.4 u
The values cj of the sample are unknown; instead, the evaluation of some statistics is given. These quantities
̄ and s 2 .
must be sufficient for the calculations, so formulas must involve C
√ S2
n
where S2 will be calculated by applying the relation n s 2=(n−1) S 2 . By applying the method of the pivot:
(
1−α=P (l α/ 2≤ T (C ; μ) ≤r α /2 )=P −r α/2 ≤
̄
C−μ
)
≤+ r α/ 2 =P (−r α/ 2
√ S2 ̄
n
≤ C−μ ≤+r α /2
√S2
n
)
√ S2
n
=P (−C̄−r α/ 2
√ S2
n
Then, the confidence interval is
≤−μ ≤−C̄ +r α /2
S2
n √
)=P ( C̄ +r α / 2
S2
n √
≥ μ ≥C̄−r α/ 2
S2
n
)
√
86 Solved Exercises and Problems of Statistical Inference
[ √ √ ]
2 2
̄ −r α/ 2 S , C
I 1−α = C ̄ +r α/2 S
n n
where r α / 2 is the quantile such that P(Z> r α /2 )=α /2.
The interval is
[
I 0.96 = 9.36 u−2.054
√ 1.99 u2
64
, 9.36 u+2.054
64 √ ]
1.99 u2
= [9.00 u , 9.72u]
From a table of statistics (e.g. in [T]), the following pivot is selected (now the exact sampling distribution is
known, instead of the asympotic distribution)
C̄ −μ
T (C ;μ)= ∼ N (0,1)
√ σ2
n
By doing calculations similar to those of the previous section or exercise, the interval is
[ √ √ ]
2 2
I 1−α = C̄ −r α/ 2 σ , C̄ + r α / 2 σ
n n
√
2
from which the expression of the margin of error is obtained, namely: E=r α /2 σ . Values can be
n
substituted either before or after breaking an inequality; this time let us use numbers from the beginning:
2 u2
√
2
1 1 2 2 2u
E g= u≥E =2.054 → 2 u ≥2.054 → n≥4 2⋅2.054 2⋅2=135.01 → n≥136
4 n 4 n
(When inverting, the inequality must be changed.)
My notes:
Exercise 3ci
You have been hired by a consortium of dairy farmers to conduct a survey about the consumption of milk.
Based on results from a pilot study, assume that σ = 8.7oz. Suppose that the amount of milk is normally
distributed. If you want to estimate the mean amount of milk consumed daily by adults:
(a) How many adults must you survey if you want 95% confidence that your sample mean is in error by no
more than 0.5oz? Apply both the method based on the confidence interval and the method based on the
Chebyshev's inequality.
(b) Calculate the margin of error if the number of data in the sample were twice the minimum (rounded)
value that you obtained. Is now the margin of error half the value it was?
(Based on an exercise of: Elementary Statistics. Triola M.F. Pearson.)
Discussion: There is one normal population with known standard deviation. In both sections, the answer can
be found by using the expression of the margin of error.
Sample information:
Theoretical (simple random) sample: X1,...,Xn s.r.s. (the amount is measured for n adults)
If we remembered the expression, we can directly use it. Either way, the margin of error (for one normal
population with known variance) is:
√
2
E=r α / 2 σ
n
(a) Sample size
Method based on the confidence interval: The equation involves four quantities, and we can calculate any
of them once the others are known. Here:
√
2 2 2 2
2 8.7 oz
E g≥ E=r α /2 σ → E g≥r α/ 2 σ → n≥z α/ 2 σ =1.96
n
2 2
n
2
Eg 0.5 oz( ) (
=1163.08 → n≥1164 )
since r α/ 2=r 0.05 /2=r 0.025 =1.96 . (The inequality does not change neither when multiplying or dividing by
positive quantities nor squaring, while it changes when inverting.)
√ √
8.7 2 oz 2
2
E=r α / 2 σ =1.96 =0.3534 oz
n 2⋅1164
When the sample size is doubled, the margin of error is not reduced by half but by less than this amount.
√ √σ2 = 1 r
√
σ 2 = E = 0.5 oz =0.3535 oz
2
~
E =r α/ 2 σ
~ =r α / 2 α /2
n 2 n √2 n √2 √2
Now it is easy to see that if the sample size is multiplied by 2, the margin of error is divided by √2. Besides,
more generally:
Proposition
For the confidence interval estimation of the mean of a normal population with known variance,
based on the method of the pivot, when the sample size is multiplied by any scalar c the margin of
error is divided by √c.
(Notice that 0.5 is slightly smaller than the real margin of error after rounding n upward; that is why there is a
small different between the results of both ways.)
Conclusion: At least 1164 or 6056 data are necessary to guarantee that the margin of error is equal to 0.50
(this margin can be thought of as “the maximum error in probability”, in the sense that the distance or error
My notes:
Exercise 4ci
A company makes two products, A and B, that can be considered independent and whose demands follow the
distributions N(μA, σA2=702u2) and N(μB, σB2=602u2), respectively. After analysing 500 shops, the two simple
random samples yield a = 156 and b = 128.
(a) Build 95 and 98 percent confidence intervals for the difference between the population means.
(b) What are the margin of errors? If sales are measured in the unit u = number of boxes, what is the unit
of measure of the margin of error?
(c) A margin of error equal to 10 is desired, how many shops are necessary? Apply both the method based
on the confidence interval and the method based on the Chebyshev's inequality.
(d) If only product A is considered, as if product B had not been analysed, how many shops are necessary
to guarantee a margin of error equal to 10? Again, apply the two methods.
LINGUISTIC NOTE (From: Longman Dictionary of Common Errors. Turton, N.D., and J.B.Heaton. Longman.)
company. an organization that makes or sells goods or that sells services: 'My father works for an insurance company.' 'IBM is one of the
biggest companies in the electronics industry.'
factory. a place where goods such as furniture, carpets, curtains, clothes, plates, toys, bicycles, sports equipment, drinks and packaged
food are produced: 'The company's UK factory produces 500 golf trolleys a week.'
industry. (1) all the people, factories, companies etc involved in a major area of production: 'the steel industry', 'the clothing industry'
(2) all industries considered together as a single thing: 'Industry has developed rapidly over the years at the expense of agriculture.'
mill. (1) a place where a particular type of material is made: 'a cotton mill', 'a textile mill', 'a steel mill', 'a paper mill' (2) a place where
flour is made from grain: 'a flour mill'
plant. a factory or building where vehicles, engines, weapons, heavy machinery, drugs or industrial chemicals are produced, where
chemical processes are carried out, or where power is generated: 'Vauxhall-Opel's UK car plants', 'Honda's new engine plant at
Microconcord. Swindon', 'a sewage plant', 'a wood treatment plant', 'ICI's ₤100m plant', 'the Sellafield nuclear reprocessing plant in
Cumbria'
works. an industrial building where materials such as cement, steel, and bricks are produced, or where industrial processes are carried
out: 'The drop in car and van sales has led to redundancies in the country's steel works.'
Discussion: It should statistically be proved the supposition that the normal distribution is appropriate to
model both variables. The independence of the two populations should be tested as well. The method of the
pivot will be applied. After obtaining the theoretical expression of the interval, it is possible to argue about the
relation confidence-length. Given the length of the interval, the expression allows us to calculate the minimum
number of data necessary. The number of units demaned can be seen as dimensionless quantities. An
approximation is implicitly being used in this exercise, since the number of units demanded is a discrete
variable while the normal distribution is continuous.
√
2 2
σ σ A B
+
nA nB
(a2) Event rewriting
( √
=P ( ̄A − ̄B )+ r α/ 2
σ 2A σ2B
+ ≥ μ A −μ B ≥( ̄A− ̄B )−r α/ 2
nA nB
σ2A σ 2B
+
n A nB √ )
(a3) The interval
[ √ √ ]
2 2 2 2
σ A σB σ A σB
I 1−α = ( ̄A− ̄B )−r α /2 + , ( ̄A− B
̄ )+r α/2 +
nA nB nA nB
Thus, at 95%
[
I 0.95= (156−128)−1.96
√ 702 60 2
+
500 500
, (156−128)+1.96
702 60 2
+
500 500 √
=[19.92, 36.08] ]
and at 98%
√ σ2A σ 2B
√ √
2 2 2 2 2 2
70 u 60 u 70 60
E 0.95=r α / 2 + =1.96
n A nB 500
+
500
=1.96 +
500 500
√ 2
u =8.08u
and at 98%
√ σ 2A σ 2B
√ √
2 2 2 2 2 2
70 u 60 u 70 60
E 0.98=r α/ 2 + =2.326
n A nB 500
+
500
=2.326 +
500 500
√ u2=9.59 u
( ) ( )
2 2 2
E g≥ E=r α/2 + → E ≥r
g α/ 2 → n≥r α /2 =r α/ 2 + r α/ 2
n n n E 2g Eg Eg
[ √ √ ] √ σ2A
2 2
σA σA
I 1−α = Ā−r α/2 , Ā+ r α/ 2 and E=r α/ 2
nA nA nA
(Note that this case can be thought of as a particular case where the second population has values B = 0, μB=0
and σB2=0.) Then,
σ 2A
√
2 2
2 2 σA 2 σA
E g≥ E=r α/2 → E g≥r α/ 2 → n A ≥r α / 2 2
nA nA Eg
Conclusion: As expected, when the probability of the tails α decreases the margin of error—and hence the
length—increases. For either one or two products and given the margin of error, the more confidence (less
significance) we want the more data we need. Since 500 shops were really considered to attain this margin of
error, there has been a waste of time and money—fewer shops would have sufficed for the desired accuracy
(95% or 98%). When two independent quantities are added or subtracted, the error or uncertainty of the result
can be as large as the total of the two individual errors or uncertainties; this also holds for random quantities
(if they are dependent, a correction term—covariance—appears); for this reason, to guarantee the same
margin of error, more data are necessary in each of the two samples—notice that for two populations the
minimum value is larger than or equal to the sum of the minimum values that would be necessary for each
population individually (for the same precision and confidence). The minimum sample sizes provided by the
two methods are quite different (see remark 4ci). (Remember: statistical results depend on: the assumptions,
the methods, the certainty and the data.)
My notes:
Remark 2ht: The quantities α, p-value, β, 1–β and φ are probabilities, so their values must be between 0 and 1.
Remark 3ht: For two-tailed tests, since there is an infinite number of pairs of quantiles such that P (a 1≤T 0≤a2 )=1−α ,
those that determine tails of probability α/2 are considered by convention. This criterion is also applied for confidence intervals.
Remark 4ht: To apply the second methodology, binding the p-value is sometimes enough to compare it with α. To do that, the
proper closest value included in the table is used.
Remark 5ht: In calculating the p-value for two-tailed tests, by convention the probability of the tail determined by T0(x,y) is
doubled. When T0(X,Y) follows an asymmetric distribution, it is difficult to identify the tail if the value of T0(x,y) is close to the
median. In fact, knowing the median is not necessary, since if we select the wrong tail, twice its probability will be greater than 1
and we will realize that the other tail must have been considered. Alternatively, it is always possible to calculate the two
probabilities (on the left and on the right) and double the minimum of them (this is useful in writing code for software programs).
Remark 6ht: When more than one test can be applied to make a decision about the same hypotheses, the most powerful should be
considered (if it exists).
Remark 7ht: After making a decision, it is possible to evaluate the strengh with which it was made: for the first methodology, by
comparing the distance from the statistic to the critical values—or, better, the area between this set of values and the density
function of T0—and, for the second methodology, by looking at the magnitude of the p-value.
Remark 8ht: For small sample sizes, n=2 or n=3, the critical region—obtained by applying any methodology—can be plotted in the
two- or threedimensional space.
[HT] Parametric
Remark 9ht: There are four types of pair of hypotheses:
(1) simple versus simple
(2) simple versus one-sided composite
(3) one-sided composite versus one-sided composite
(4) simple versus two-sided composite
We will directly apply Neyman-Pearson's lemma for the first case. When the solution of the first case does not depend upon any
particular value of the parameter θ1 under H1, the same test will be uniformly most powerful for the second case. In addition, when
there is a uniformly most powerful test for the second case, it will also be uniformly most powerful for the third case.
Remark 10ht: Given H0 and α, different decisions can be made for one- and two-tailed tests. That is why: (i) describing the details
of the framework is of great important in Statistics; and (ii) as a general rule, all trustworthy information must be used, which
implies that a one-sided test should be used when there is information that strongly suggests so—compare the estimate calculated
from the sample with the hypothesized values.
Remark 11ht: For parametric tests,α (θ)= P (Reject H 0 ∣ θ∈Θ0) and 1−β(θ) = P ( Reject H 0 ∣ θ∈Θ1) , so to plot
the power function ϕ(θ) = P ( Reject H 0 ∣ θ∈Θ0 ∪Θ1) it is usually enough to enter θ∈Θ0 in the analytical expression of
1−β(θ). This is the method that we have used in some exercises where the computer has been used.
Remark 12ht: A reasonable testing process should verify that
1−β(θ1 )=P (T 0 ∈Rc ∣ θ∈Θ1) > P (T 0 ∈ Rc ∣ θ∈Θ0 ) = α(θ 0 )
with 1–β(θ1) ≈ α(θ0) when θ1 ≈ θ0. This can be noticed in the power functions plotted in some exercises, where there is a local
minimum at θ0.
Remark 13ht: Since one-sided tests are, in its range of parameter values, more powerful than the corresponding two-sided test, the
best way of testing an equality consists in accepting it when it is compared with the two types of inequality. Similarly, the best way
[HT-p] Based on T
Exercise 1ht-T
The lifetime of a machine (measured in years, y) follows a normal distribution with variance equal to 4y 2. A
simple random sample of size 100 yields a sample mean equal to 1.3y. Test the null hypothesis that the
population mean is equal to 1.5y, by applying a two-tailed test with 5 percent significance level. What is the
type I error? Calculate the type II error when the population mean is 2y. Find the general expression of the
type II error and then use a computer to plot the power function.
Discussion: First of all, the supposition that the normal distribution reasonably explains the lifetime of the
machine should be evaluated by using proper statistical techniques. Nevertheless, the purpose of this exercise
is basically to apply the decision-making methodologies.
Statistic: Since
• There is one normal population
• The population variance is known
the statistic
̄ −μ
X
T ( X ; μ)= ∼ N (0,1)
√ σ2
n
is selected from a table of statistics (e.g. in [T]). Two particular cases of T will be used:
X̄ −μ 0 ̄ −μ1
X
T 0 ( X )= ∼ N (0,1) and T 1 ( X )= ∼ N (0,1)
σ2
n√ σ2
n √
To apply any of the two methodologies, the value of T0 at the specific sample x = (x1,...,x100) is necessary:
̄x −μ 0 1.3−1.5 −0.2⋅10
T 0 ( x)= = = =−1
2
√ √
2
σ 4
n 100
{
α (1.5)
= P(T 0 ( X )< a1) → a1=l α / 2=−1.96
2
α (1.5)
=P (T 0 (X )>a 2 ) → a2=r α/ 2=+1.96
2
→ Rc ={T 0 ( X )<−1.96 }∪{T 0 ( X )>+1.96 }={∣T 0 ( X )∣>+1.96 }
The decision is: T 0 ( x)=−1 → T 0 ( x)∉ Rc → H0 is not rejected.
Type II error: To calculate β, we have to work under H1, that is, with T1. Nonetheless, the critical region is
expressed in terms of T0. Thus, the mathematical trick of adding and subtracting the same quantity is applied:
β(μ 1) = P(Type II error) = P ( Accept H 0 ∣ H 1 true) = P (T 0 ( X )∉ Rc ∣ H 1 )= P (∣T 0 ( X )∣≤1.96 ∣ H 1)
∣)
X
̄ −μ 0
= P (−1.96≤T 0 ( X )≤+1.96 ∣ H 1 ) = P −1.96≤
( √ σ2
n
≤+1.96 H 1
∣) (
X
̄ −μ 1 +μ 1−μ 0 μ 1−μ 0 μ 1−μ0
(
= P −1.96≤
σ2
n √
≤+1.96 H 1 = P −1.96−
σ2
n √
≤T 1 ( X )≤+1.96−
√ )
σ2
n
μ −μ μ −μ
( σ
n √ ) (
= P T 1 ( X )≤+1.96− 1 0 − P T 1 (X )<−1.96− 1 0
2
σ
n
2
√ )
For the particular value μ1 = 2,
> pnorm(-0.54,0,1)-pnorm(-4.46,0,1)
β(2) = P ( T 1 ( X )≤−0.54 )− P ( T 1 ( X )<−4.46 ) =0.29 [1] 0.2945944
By using a computer, many more values μ1 ≠ 2 can be considered so as to numerically determine the power
curve 1–β(μ1) of the test and to plot the power function.
ϕ(μ) = P ( Reject H 0 )=
{ α (μ ) if μ∈Θ0
1−β(μ) if μ∈Θ1
# Population
variance = 4
# Sample and inference
n = 100
alpha = 0.05
theta0 = 1.5 # Value under the null hypothesis H0
q = qnorm(1-alpha/2,0,1)
Conclusion: The hypothesis that 1.5y is the mean of the distribution of the lifetime is not rejected. As
expected, when the true value is supposed to be 2, far from 1.5, the probability of rejecting 1.5 is 1–β(2) =
0.71, that is, high. This value has been calculated by hand; additionally, after finding the analytical expression
of the curve 1–β, also by hand, the computer allows the power function to be plotted. This theoretical curve,
not depending on the sample information, is symmetric with respect to μ0 = 1.5. (Remember: statistical results
depend on: the assumptions, the methods, the certainty and the data.)
My notes:
Exercise 2ht-T
A company produces electric devices operated by a thermostatic control. The standard deviation of
the temperature at which these controls actually operate should not exceed 2.0ºF. For a simple
random sample of 20 of these controls, the sample quasi-standard deviation of operating
temperatures was 2.39ºF. Stating any assumptions you need (write them), test at the 5% level the null
hypothesis that the population standard deviation is not larger than 2.0ºF against the alternative that
it is. Apply the two methodologies and calculate the type II error at σ2=4.5ºF2. Use a computer to plot
the power function. On the other hand, between the two alternative hypothesis H 1 : σ=σ 1 > 2 or
H 1 : σ=σ 1 ≠ 2 , which one would you have selected? Why?
Hint: Be careful to use S2 and σ2 wherever you work with a variance instead of a standard deviation.
(Based on an exercise of Statistics for Business and Economics. Newbold, P., W.L. Carlson and B.M. Thorne. Pearson.)
LINGUISTIC NOTE (From: Longman Dictionary of Common Errors. Turton, N.D., and J.B. Heaton. Longman.)
actual = real (as opposed what is believed, planned or expected): 'People think he is over fifty but his actual age is forty-eight.' 'Although
buses are supposed to run every fifteen minutes, the actual waiting time can be up to an hour.'
present/current = happening or existing now: 'No one can drive that car in its present condition.' 'Her current boyfriend works for Shell.'
LINGUISTIC NOTE (From: Common Errors in English Usage. Brians, P. William, James & Co.)
“Device” is a noun. A can-opener is a device. “Devise” is a verb. You can devise a plan for opening a can with a sharp rock instead. Only
in law is “devise” properly used as a noun, meaning something deeded in a will.
Hypothesis test
Then,
Decision: To determine the rejection region, under H0, the critical value a is found by applying the definition
of type I error, with α = 0.05 at σ02 = 4ºF2 :
α (4) = P (Type I error ) = P ( Reject H 0 ∣ H 0 true)= P (T ( X ;θ)∈ Rc ∣ H 0 ) = P (T 0 (X )>a)
→ a=r α=r 0.05=30.14 → Rc = {T 0 ( X )>30.14 }
To make the final decision: T 0 ( x)=27.13 < 30.14 → T 0 ( x)∉ Rc → H0 is not rejected.
The second methodology requires the calculation of the p-value:
Type II error: To calculate β, we have to work under H1, that is, with T1. Since the critical region is already
expressed in terms of T0, the mathematical trick of multiplying and dividing by same quantity is applied:
β(σ12) = P (Type II error ) = P ( Accept H 0 | H 1 true) = P (T 0 ( X )∉ Rc | H 1 ) = P (T 0 ( X )≤30.14 | H 1 )
By using a computer, many other values σ12 ≠ 4.5ºF2 can be considered so as to numerically determine the
power curve 1–β(σ12) of the test and to plot the power function.
ϕ(σ 2 ) = P ( Reject H 0) =
{ α( σ2 ) if σ ∈Θ0
1−β(σ 2) if σ∈Θ1
# Sample and inference
n = 20
alpha = 0.05
theta0 = 4 # Value under the null hypothesis H0
q = qchisq(1-alpha,n-1)
theta1 = seq(from=4,to=15,0.01)
paramSpace = sort(unique(c(theta1,theta0)))
PowerFunction = 1 - pchisq(q*theta0/paramSpace, n-1)
plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')
Conclusion: The null hypothesis H 0 : σ=σ 0 ≤ 2 is not rejected. When any of these factors is different,
the decision might be the opposite. As regards the most appropriate alternative hypothesis, the value of S
suggests that the test with σ1 > 2 is more powerful than the test with σ1 ≠ 2 (the test with σ1 < 2 against
the equality would be the least powerful as both the methodologies—H0 is the default hypothesis—and the
data “tend to help H0”). (Remember: statistical results depend on: the assumptions, the methods, the certainty
and the data.)
My notes:
Exercise 3ht-T
Let X = (X1,...,Xn) be a simple random sample with 25 data taken from a normal population variable X. The
sample information is summarized in
Discussion: The supposition that the normal distribution is appropriate to model X should be statistically
proved. This statement is theoretical.
[ ) ] = 25⋅5.53 =34.56
2
1 1
T 0 ( x)=
25
25
∑ (
x 2j −
25
∑ xk
4 4
2
1 n 1 n
where to calculate the sample variance, the general property s 2= ∑ X
2
−
n j =1 j n j=1 (
∑ X j ) has been used.
Decision: To determine the rejection region, under H0, the critical value a is found by applying the definition
of type I error, with α = 0.05 at σ02 = 4:
α (4) = P (Type I error ) = P ( Reject H 0 ∣ H 0 true)= P (T ( X ;θ)∈ Rc ∣ H 0 ) = P (T 0 (X )>a)
Type II error: To calculate β, we have to work under H1, that is, with T1. Since the critical region is expressed
in terms of T0, the mathematical trick of multiplying and dividing by same quantity is applied:
β(σ12) = P (Type II error ) = P ( Accept H 0 ∣ H 1 true)= P (T 0 ( X )∉R c ∣ H 1) = P (T 0 ( X )≤36.4 ∣ H 1)
∣ ) (
2
36.4⋅σ20
=P
n s2
σ 20 (
≤36.4 H 1 = P
∣ ) (
n s2 σ 1
σ12 σ 20
≤36.4 H 1 = P T 1 (X )≤
σ12 )
For the particular value σ12 = 5,
36.4⋅4
(
β(5) = P T 1 ( X )≤
5 )
= P ( T 1 ( X )≤29.12 ) = 0.78
> pchisq(29.12, 25-1)
[1] 0.7843527
By using a computer, many other values σ12 ≠ 5 can be considered so as to numerically determine the power
curve 1–β(σ12) of the test and to plot the power function.
ϕ(σ 2 ) = P ( Reject H 0) =
{ α( σ2 ) if σ ∈Θ0
1−β(σ 2) if σ∈Θ1
# Sample and inference
n = 25
alpha = 0.05
theta0 = 4 # Value under the null hypothesis H0
q = qchisq(1-alpha,n-1)
theta1 = seq(from=4,to=15,0.01)
paramSpace = sort(unique(c(theta1,theta0)))
PowerFunction = 1 - pchisq(q*theta0/paramSpace, n-1)
plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')
Decision: Now there are two tails, determined by two critical values a1 and a2 that are found by applying the
definition of type I error, with α = 0.05 at σ02 = 4, and the criterion of leaving half the probability in each tail:
α (4)= P(Type I error )=P ( Reject H 0 ∣ H 0 true)= P(T ( X ; θ)∈R c ∣ H 0 )=P (T 0 ( X )<a 1)+ P (T 0 ( X )>a 2 )
We always consider two tails with the same probability,
{
α (4)
=P (T 0 ( X )< a1) → a1=r 1−α/ 2=12.4
2 → Rc ={T 0 ( X )<12.4 }∪{T 0 ( X )> 39.4 }
α (4)
=P (T 0 ( X )>a 2) → a 2=r α / 2=39.4
2
To make the final decision: T 0 ( x)=34.56 → T 0 ( x)∉Rc → H0 is not rejected
To base the decision on the p-value, we calculate twice the probability of the tail:
[( ∣ ) ( ∣ )]
2 2
n s2 12.4⋅σ 0 n s 2 39.4⋅σ0
= 1− P < H 1 +1−P ≤ H1
σ 21 σ 21 σ 21 σ 21
| ) ( | ) (
2 2 2 2
=−P
σ 21(
n s 2 12.4⋅σ 0
<
σ21
H 1 + P
n s 2 39.4⋅σ0
σ 21
≤
σ12
H 1 = P T 1 ( X )≤
39.4⋅σ0
σ21
−P T 1 ( X )<
12.4⋅σ0
σ12 ) ( )
For the particular value σ12 = 5,
> pchisq(c(9.92, 31.52), 25-1)
β(5) = P ( T 1 ( X )≤31.52 ) −P ( T 1 ( X )< 9.92 ) = 0.86−0.0051 = 0.85 [1] 0.00513123 0.86065162
Comparison of the power functions: For the one-tailed test, the power of the test at σ12 = 5 is 1–β(5) =
1–0.78 = 0.22, while for the two-tailed test it is 1–β(5) = 1–0.85 = 0.15. As expected, this latter test has
smaller power (higher type II error), since in the former test additional information is being used when one tail
is previously discarded. Now we compare the power functions of the two tests graphically, for the common
values (> 4), by using the code
# Sample and inference
n = 25
alpha = 0.05
theta0 = 4 # Value under the null hypothesis H0
q = qchisq(c(alpha/2,1-alpha/2),25-1)
theta1 = seq(from=0,to=15,0.01)
paramSpace1 = sort(unique(c(theta1,theta0)))
PowerFunction1 = 1 - pchisq(q[2]*theta0/paramSpace1, n-1) +
pchisq(q[1]*theta0/paramSpace1, n-1)
q = qchisq(1-alpha,n-1)
theta1 = seq(from=4,to=15,0.01)
paramSpace2 = sort(unique(c(theta1,theta0)))
PowerFunction2 = 1 - pchisq(q*theta0/paramSpace2, n-1)
plot(paramSpace1, PowerFunction1, xlim=c(0,15), xlab='Theta',
ylab='Probability of rejecting theta0', main='Power Function', type='l')
lines(paramSpace2, PowerFunction2, lty=2)
It can be noticed that the curve of the one-sided test is over the curve of the two-sided test for any σ2 > 4,
Conclusion: The hypothesis that the population variance is equal to 4 is not rejected in either of the two
sections. Although it has not happened in this case, different decisions may be made for the one- and two-
-tailed cases. (Remember: statistical results depend on: the assumptions, the methods, the certainty and the
data.)
My notes:
Exercise 4ht-T
Imagine that you are hired as a cook. Not an ordinary one but a “statistical cook.” For a normal population,
in testing the two hypotheses
{
2 2
H 0 : σ = σ0 =4
H 1 : σ 2 = σ21 >4
the data (sample x of size n = 11 such that S2=7.6u2) and the significance (α=0.05) have led to rejecting the
null hypothesis because
1−α
r 0.05=18.31 T 0 ( x)=19
Since the chef—your boss—wants the null hypothesis H0 not to be rejected, find three different ways to
scientifically make the opposite decision by changing any of the previous factors. Give qualitative
explanations and, if possible, quantitative ones.
Discussion: Metaphorically, Statistics can be thought of as the kitchen with its utensils and appliances, the
first two factors as the recipe, and the next three items as the ingredients—if H1, α or x are inappropriate, there
is little to do and it does not matter how good the kitchen, the recipe and you are. Our statistical knowledge
allows us to change only the last three elements. The statistic to study the variance of a normal population is
(n−1) S 2 (n−1) S 2 (11−1)7.6 u2 76
T ( X )= 2
∼ χ2n −1 so, under H0, T 0 ( x)= = = =19.
σ σ 20 4 u2 4
Quantitative reasoning: The previous qualitative explanations can be supported with calculations.
A) For the two-tailed test, now the critical value would be r 0.05 /2=r 0.025=20.48 . Then
T 0 ( x)=19 < 20.48=r 0.025 → T 0 ( x)∉Rc → H0 is not rejected.
B) The same effect is obtained if, for the original one-tailed H1, the significance is taken to be 0.025
instead of 0.05. Any other value smaller than 0.025 would lead to the same result. Is 0.025—suggested
by the previous item—the smallest possible value? The answer is made by using the p-value, since it is
sometimes defined as the smallest significance level at which the null hypothesis is rejected. Then,
since
> 1 - pchisq(19, 11-1)
pV =P ( X more rejecting than x | H 0 true)=P (T 0 ( X )> 19)=0.0403 [1] 0.04026268
Conclusion: This exercise highlights how much careful one must be in either writing or reading statistical
works.
My notes:
Discussion: In a real-world situation, suppositions should be proved. We must pay careful attention to the
details: the sample quasivariance is provided for one group, while the sample variance is given for the other.
the statistic
S 2X
σ2X S 2X σ2Y
T ( X , Y ; σ X , σ Y )= = ∼ Fn −1 ,nY −1
S 2Y S 2Y σ 2X X
σ2Y
is selected from a table of statistics (e.g. in [T]). It will be used in two forms (we can write σX2/ σY2 = θ1):
S 2X SX
2
σ 2X S 2X 2
θ 1⋅σY
2
1 SX
T 0 ( X ,Y )= = ∼ Fn −1 ,nY −1 and T 1 ( X , Y )= = ∼ Fn
S 2Y S 2Y X
S 2Y θ1 S 2 X −1 , nY −1
Y
σ2Y σ2Y
(On the other hand, the pooled sample variance Sp2 should not be considered even under H0: σX = σ = σY, as
T 0=( S 2p /S 2p )=1 whatever the data are.) To apply any of the two methodologies we need to evaluate T0 at
the samples x and y:
2 2
SX SX 6.8
T 0 ( x , y )= 2
= = =0.86
SY nY 2 10
s 7.1
nY −1 Y 10−1
Since we were given the sample quasivariance of population X, but the sample variance of population Y, the
general property n s 2 = (n−1) S 2 has been used to calculate SY2.
Decision: To determine the critical region, under H0, the critical value a is found by applying the definition of
type I error, with α = 0.1 at θ0 = 1:
α (1)= P (Type I error )=P (Reject H 0 ∣ H 0 true)= P (T ( X , Y )<a ∣ H 0 )=P (T 0 ( X , Y )< a)
1 1 1 (From the definition of the F distribution, it is easy to see
→ 0.1= P(T 0 ( X , Y )< a)= P
( >
T 0 ( X ,Y ) a ) →
a
=2.35 that if X follows a Fk1,k2 then 1/X follows a Fk2,k1. We use
this property to consult our table.)
1
→ a=r 1−α= =0.43 → Rc = {T 0 ( X , Y )< 0.43}
2.35
To make the final decision about the hypotheses:
T 0 ( x , y )=0.86 → T 0 ( x)∉ Rc → H0 is not rejected.
The second methodology requires the calculation of the p-value:
pV =P ( X ,Y more rejecting than x , y ∣ H 0 true)
=P (T 0 (X , Y )<T 0 ( x , y))=P (T 0 ( X , Y )< 0.86)=0.41
> pf(0.86, 11-1, 10-1)
→ pV =0.41> 0.1=α → H0 is not rejected. [1] 0.406005
Power function: To calculate β, we have to work under H1, that is, with T1. Since in this case the critical
region is already expressed in terms of T0, the mathematical trick of multiplying and dividing by the same
quantity is applied:
β(θ1 ) = P (Type II error) = P( Accept H 0 ∣ H 1 true) = P (T 0 ( X )∉ Rc ∣ H 1 ) = P (T 0 ( X )≥0.43 ∣ H 1 )
∣ ) ( ∣ )
2 2
=P
( SX
2
SY
1 S 1 0.43
( 0.43
≥0.43 H 1 = P θ X2 ≥ θ 0.43 H 1 = P T 1 ( X )≥ θ = 1−P T 1 ( X )< θ
1 SY 1 1 1
) ( )
By using a computer, many values θ1 can be considered so as to determine the power curve 1–β(θ1) of the test
and to plot the power function.
ϕ(θ) = P ( Reject H 0 ) = α (θ) if θ∈Θ0
1−β(θ) if θ ∈Θ1 {
# Sample and inference
nx = 11; ny = 10
alpha = 0.1
theta0 = 1
q = qf(alpha,nx-1,ny-1)
theta1 = seq(from=0,to=1,0.01)
paramSpace = sort(unique(c(theta1,theta0)))
PowerFunction = pf(q/paramSpace, nx-1, ny-1)
plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')
Decision: To apply the methodology based on the rejection region, the critical value a is found by applying
the definition of type I error, with α = 0.1 at θ0 = 1:
α (1)= P (Type I error )=P ( Reject H 0 ∣ H 0 true)= P (T ( X , Y )>a ∣ H 0 )=P (T 0 ( X , Y )> a)
→ a=r α=2.42 → Rc = {T 0 ( X , Y )> 2.42 }
The final decision is: T 0 ( x , y )=0.86 → T 0 ( x)∉ Rc → H0 is not rejected.
The second methodology requires the calculation of the p-value:
pV =P ( X ,Y more rejecting than x , y | H 0 true)= P(T 0 ( X , Y )>T 0 ( x , y ))
=P (T 0 (X , Y )> 0.86)= 1−0.41=0.59 > pf(0.86, 11-1, 10-1)
[1] 0.406005
→ pV =0.59> 0.1=α → H0 is not rejected.
| ) ( | )
2 2
=P
( SX
2
SY
≤2.42 H 1 = P
1 SX 1
Y
≤ 2.42 H 1 = P T 1 ( X )≤
θ 1 S 2 θ1 (2.42
θ1 )
By using a computer, many values θ1 can be considered so as to plot the power function.
Decision: For the first methodology, the critical region must be determined by applying the definition of type I
error, with α = 0.1 at θ1 = 1, and the criterion of leaving half the probability in each tail:
α (1)= P (Type I error )= P( Reject H 0 | H 0 true)=P (T 0 ( X ,Y )<a 1)+ P (T 0 ( X , Y )>a2 )
{
α (1)
= P(T 0 ( X , Y )<a1 ) → a 1=l α /2 =0.33
→ 2
α(1)
=P (T 0 ( X ,Y )>a 2 ) → a2=r α / 2=3.14
2
> qf(c(0.05, 0.95), 11-1, 10-1)
→ Rc ={T 0 ( X , Y )<0.33 }∪{T 0 (X , Y )> 3.14 } [1] 0.3310838 3.1372801
=P ( 0.33
θ 1
≤T ( X )≤
1
3.14
θ )
= P (T ( X )≤
1
3.14
θ )1−P (T ( X )<
1
0.33
θ ) 1
1
By using a computer, many values θ1 can be considered in order to plot the power function.
# Sample and inference
nx = 11; ny = 10
alpha = 0.1
theta0 = 1
q = qf(c(alpha/2, 1-alpha/2),nx-1,ny-1)
theta1 = seq(from=0,to=15,0.01)
paramSpace = sort(unique(c(theta1,theta0)))
PowerFunction = 1 - pf(q[2]/paramSpace, nx-1, ny-1) + pf(q[1]/paramSpace, nx-1, ny-1)
plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')
Comparison of the power functions: Now we compare the power functions of the three tests
graphically, by using the code
# Sample and inference
nx = 11; ny = 10
alpha = 0.1
theta0 = 1
q = qf(c(alpha/2, 1-alpha/2),nx-1,ny-1)
theta1 = seq(from=0,to=15,0.01)
paramSpace1 = sort(unique(c(theta1,theta0)))
PowerFunction1 = 1 - pf(q[2]/paramSpace1, nx-1, ny-1) + pf(q[1]/paramSpace1, nx-1, ny-1)
q = qf(alpha,nx-1,ny-1)
theta1 = seq(from=0,to=1,0.01)
paramSpace2 = sort(unique(c(theta1,theta0)))
PowerFunction2 = pf(q/paramSpace2, nx-1, ny-1)
q = qf(1-alpha,nx-1,ny-1)
It can be seen that the curves of the one-sided tests are over the curve of the two-sided test for any θ1—in its
region each one-sided test has more power than the two-sided test, since additional information is used when
one tail is discarded. Then, any of the two one-sided tests is uniformly more powerful than the two-sided test
in their respective common domains.
Conclusion: The hypothesis that the population variance is equal in the two biological populations is not
rejected when tested against any of the three alternative hypotheses. Although it has not happened in this case,
different decisions can be made for the one- and two-tailed tests. In this exercise, the empirical value T0(x) =
SX2/SY2 = 0.86 suggests the alternative hypothesis H1: σX2/σY2 < 1. (Remember: statistical results depend on: the
assumptions, the methods, the certainty and the data.)
My notes:
Exercise 6ht-T
Two simple random samples of 700 citizens of Italy and Russia yielded, respectively, that 53% of Italian
people and 47% of Russian people wish to visit Spain within the next ten years. Should we conclude, with
confidence 0.99, the Italians' desire is higher than the Russians'? Determine the critical region and make a
decision. What is the type I error? Calculate the p-value and apply the methodology based on the p-value to
make a decision.
1) Allocate the question in the alternative hypothesis. Calculate the type II error for the value –0.1.
2) Allocate the question in the null hypothesis. Calculate the type II error for the value +0.1.
Use a computer to plot the power function.
√ ? I (1−? I ) ? R (1−? R )
nI
+
nR
where each ? must be substituted by the best possible information: supposed or estimated. Two particular
versions of this statistic will be used:
^ I −η
(η ^ R)−θ0 d (η^ I − η
^ R )−θ 1 d
T 0 ( I , R)= → N (0,1) and T 1 (I , R)= → N (0,1)
√ η
^ I (1−η^ I ) η
nI
^ (1− η
+ R
nR
^ R)
√ η
^ I (1− η
nI
^ I) η
^ (1−η^ R )
+ R
nR
To determine the critical region or to calculate the p-value, both under H0, we need the value of the statistic for
the particular samples available:
(0.53−0.47)−0
T 0 (i , r )= =2.25
700 √
0.53(1−0.53) 0.47(1−0.47)
+
700
1) Question in H0
Hypotheses: If we want to allocate the question in the null hypothesis to reject it only when the data strongly
suggest so,
H 0 : ηI −ηR = θ 0 ≥ 0 and H 1 : ηI −ηR = θ1 < 0
By looking at the alternative hypothesis, we deduce the form of the critical region:
Decision: To apply the first methodology, the critical value a that determines the rejection region is found by
applying the definition of type I error, with the value α = 1 – 0.99 = 0.01 at θ0 = 0:
α (0) = P (Type I error) = P( Reject H 0 | H 0 true) = P( T (I , R)∈R c | H 0 )= P (T 0 ( I , R)<a)
→ a=l 0.01=−2.326 → Rc = {T 0 ( I , R)<−2.326}
The decision is: T 0 ( i , r )=2.25 → T 0 (i , r )∉ Rc → H0 is not rejected.
As regards the value of the type I error, it is α by definition. The second methodology is based on the
calculation of the p-value:
pV =P ( I , R more rejecting than i , r | H 0 true)= P (T 0 ( I , R) < T 0 (i , r ))
=P (T 0 (I , R) < 2.25)=0.988
→ pV =0.988 > 0.01=α → H0 is not rejected.
Type II error: To calculate β, we have to work under H1. Since the critical region is expressed in terms of T0
and we must use T1, we are going to apply the mathematical trick of adding and subtracting the same quantity:
β(θ1 )= P (Type II error) = P( Accept H 0 ∣ H 1 true)
|)
(η
^ I −η
^ R)−θ0
= P (T 0 ( I , R)∉ Rc | H 1) = P
(√ ^ I (1− η^ I ) η
η
nI
^ (1− η
+ R
nR
^ R)
≥−2.326 H 1
|)
^ I − η^ R )+ 0−θ 1
(η θ1
=P
(√ ^ I (1− η
η
nI
^ ( 1− η
^ I) η
+ R
nR
^ R)
+
√ ^ I (1− η
η
nI
^ I) η
^ (1−η^ R )
+ R
nR
≥−2.326 H 1
θ1
(
= P T 1 ( I , R)≥−2.326−
√ 0.53(1−0.53) 0.47(1−0.47)
700
+
700 )
For the particular value θ1 = –0.1,
( )
−0.1
β(−0.1) = P T 1( I , R)≥−2.326− = P ( T 1( I , R)≥1.42 )=0.078
2) Question in H1
Hypotheses: If we want to allocate the question in the alternative hypothesis to accept it only when the data
strongly suggest so,
H 0 : ηI −ηR = θ 0 ≤ 0 and H 1 : ηI −ηR = θ1 > 0
By looking at the alternative hypothesis, we deduce the form of the critical region:
The quantity c can be thought of as a margin over θ0 not to exclude cases where ηI – ηR = θ0 = 0 really holds
while values slightly larger than θ0 are due to mere random effects.
Decision: To apply the first methodology, the critical value a is calculated as follows:
α (0)= P (Type I error) = P( Reject H 0 | H 0 true) = P( T (I , R)∈R c | H 0 )= P (T 0 ( I , R)>a)
→ a=r 0.01=2.326 → Rc = {T 0 ( I , R)> 2.326 }
The decision is: T 0 ( i , r )=2.25 → T 0 (i , r )∉ Rc → H0 is not rejected.
The second methodology consists in doing:
|)
(η
^ I −η
^ R)−θ0
= P (T 0 ( I , R)∉ Rc | H 1) = P
(√ ^ I (1− η^ I ) η
η
nI
^ (1− η
+ R
nR
^ R)
≤2.326 H 1
|)
^ I − η^ R )+ 0−θ 1
(η θ1
=P
(√ ^ I (1− η
η
nI
^ ( 1− η
^ I) η
+ R
nR
^ R)
+
√ ^ I (1− η
η
nI
^ I) η
^ (1−η^ R )
+ R
nR
≤2.326 H 1
θ1
(
= P T 1 ( I , R)≤2.326−
(
β(0.1)= P T 1 ( I , R)≤2.326−
√ 0.53(1−0.53) 0.47(1−0.47)
700
+
700
)
= P ( T 1 ( I , R)≤−1.42 ) =0.078
By using a computer, many more values θ1 ≠ 0.1 can be considered so as to numerically determine the power
of the test curve 1–β(θ1) and to plot the power function.
ϕ(θ) = P ( Reject H 0 ) =
{α (θ) if θ∈Θ0
1−β(θ) if θ ∈Θ1
Conclusion: The hypothesis that the two proportions are equal is not rejected when the question is allocated
in either the alternative or the null hypothesis (the best way of testing an equality). That is, it seems that both
populations wish to visit Spain with the same desire. The sample information η^ I =0.53 and η^ R =0.47
suggested the alternative hypothesis H1: ηI – ηR > 0. The two power functions show how symmetric the
situations are. (Remember: statistical results depend on: the assumptions, the methods, the certainty and the
data.)
Advanced theory: Under the hypothesis H0: ηI = η = ηR, it makes sense to try to estimate the common
variance η(1–η) of the estimator—in the denominator—as well as possible. This can be done by using the
n η^ + n η^
pooled sample proportion η^ p= I I R R . Nevertheless, the pooled estimator should not be considered in
n I + nR
the numerator, since ( η^ p− η^ p)=0 whatever the data are. Now, the statistic under the null hypothesis is:
T~0 ( I , R)=
(η
^ I − η^ R )−θ0
=
( η^ I −η^ R )−θ 0 √ η
^ I (1−η^ I ) η
nI
^ (1− η
+ R
nR
^ R)
√ η
^ p (1− η
nI
^ p ) η^ p (1−η^ p )
+
nR √ η^ I (1− η
nI
^ I ) η^ R (1−η^ R )
+
nR √ η
^ p (1− η
nI
^ p) η
^ (1−η^ p )
+ p
nR
= T 0 ( I , R)
√ η^ I (1− η
nI
^ I ) η^ R (1−η^ R )
+
nR d
→ N (0,1)
Then,
√ η^ p (1− η
nI
^ p) η
^ (1− η
+ p
nR
^ p)
η^ p =
700⋅0.53+ 700⋅0.47 0.53+0.47 1
700+ 700
=
1+1
= =0.5 →
2
√ η^ I (1− η
nI
^ I ) η^ R (1−η^ R )
+
nR
= 0.9981983
√ η^ p (1− η^ p) η
nI
^ (1− η
+ p
nR
^ p)
|)
(η
^ I− η
^ R)−θ0
= P ( T~0 ( I , R)∉ Rc | H 1) = P
(√ ^ p ( 1−η^ p ) η^ p (1−η^ p)
η
nI
+
nR
≥−2.325 H 1
(√ √
|)
η^ p (1− η
^ p) η
^ (1− η
^ p)
+ p
^ I −η
(η ^ R)−0−θ 1+ θ1 nI nR
=P ≥−2.325⋅ H1
η
^ I (1− η^ I ) η
nI
^ (1− η
+ R
nR
^ R)
√ η^ I (1− η
nI
^ I ) η^ R (1−η^ R )
+
nR
|)
(η
^ I−η
^ R)−θ1 θ1
=P
(√ ^ I (1− η
η
nI
^ ( 1− η
^ I) η
+ R
nR
^ R)
+
√ ^ I (1− η
η
nI
^ I) η
^ (1−η^ R )
+ R
nR
≥−2.325⋅1.002 H 1
θ1
(
= P T 1 ( I , R)≥−2.330−
√ 0.53(1−0.53) 0.47(1−0.47)
700
+
700 )
For the particular value θ1 = –0.1,
( )
−0.1
β(−0.1) = P T 1( I , R)≥−2.330− = P ( T 1( I , R)≥1.41 )=0.079 .
√ 0.53(1−0.53) 0.47(1−0.47)
700
+
700
Similarly for section (b).
My notes:
[HT-p] Based on Λ
Exercise 1ht-Λ
A random quantity X follows a Poisson distribution. Let X = (X1,...,Xn) be a simple random sample. By
applying the results involving Neyman-Pearson's lemma and the likelihood ratio, study the critical region
(estimator that arises and form) for the following pairs of hypotheses.
Discussion: This is a theoretical exercise where no assumption should be evaluated. First of all, Neyman-
-Pearson's lemma will be applied. We expect the maximum-likelihood estimator of the parameter—calculated
in a previous exercise—and the “usual” critical region form to appear. If the critical region does not depend on
any particular value θ1, the uniformly most powerful test will have been found.
Hypothesis test
{ H 0: λ = λ0
H 1: λ = λ1
( )
j
e−n(λ −λ )
0 1
n
L( X ; λ 1) λ1
∏ j=1 X j !
Rejection region:
{( }
n
λ0 ∑ X
) ( ) {( ∑ n
} λ
j
)
j=1
−n(λ 0−λ 1)
Rc = { Λ < k } = e <k = X j ⋅log λ0 −n (λ 0−λ 1) < log (k )
λ1 j=1
1
=
{( ∑n
j=1 j
λ
λ
0
1 }{ λ λ
X )⋅log ( ) < log (k )+ n( λ −λ ) = n X̄⋅log ( ) < log (k )+n (λ −λ )
0 1
} 0
1
0 1
{ }
log(k )+n (λ 0−λ 1)
• if λ 1< λ 0 then log ( ) λ0
λ1
̄ <
> 0 and hence Rc = X
λ
n log 0
λ1 ( )
{ }
log(k )+n (λ 0−λ 1)
• if λ 1> λ 0 then log ( ) λ0
λ1
< 0 and hence Rc = X
̄ >
λ
n log 0
λ1 ( )
This suggests the estimator X̄ =λ̂ ML (calculated in a previous exercise) and regions of the form
Rc = {Λ< k } = ⋯= { λ̂ ML <c }= ⋯= {T 0 < a } or Rc = {Λ< k } = ⋯= { λ^ ML >c }= ⋯= {T 0 > a }
Hypothesis tests
{ H 0 : λ = λ0
H 1 : λ = λ 1> λ 0 { H 0 : λ = λ0
H 1 : λ = λ 1< λ 0
In applying the methodologies, and given α, the same critical value c or a will be obtained for any λ1 since it
only depends upon λ0 through λ^ ML or T0:
Hypothesis tests
{ H 0 : λ ≤ λ0
H 1 : λ = λ 1> λ 0 { H 0 : λ ≥ λ0
H 1 : λ = λ 1< λ 0
A uniformly most powerful test for H 0 : λ = λ0 is also uniformly most powerful for H 0 : λ ≤ λ0 .
Hypothesis test
{ H 0: λ = λ0
H 1: λ = λ1
L( X ; λ 1) λ1
Rejection region:
{( λ0 n −(λ −λ )∑
) }{ ( λλ )−(λ −λ ) ∑ }
n
X n
0
Rc = { Λ < k } = e 0 1 j=1 j
< k = n log 0 1 X j < log(k )
λ1 1
j=1
{ n λ
( )} {
λ
= (λ1−λ 0) ∑ j=1 X j < log(k )−n log λ 0 = (λ 1−λ 0) n X̄ < log( k )−n log λ 0
1 1
( )}
Now it is necessary that λ 1≠λ 0 and
{ }{
λ
log (k )−n log λ 0( )= 1<
}
1 n (λ 1−λ 0)
• ̄ >
if λ 1< λ 0 then (λ 1−λ 0 )< 0 and Rc = X
n(λ 1−λ0 ) ̄
X λ0
log (k )−n log ( )
λ1
{ }{
λ0
log(k )−n log (λ ) = 1 >
}
1 n (λ 1−λ 0)
• ̄ <
if λ 1> λ 0 then (λ 1−λ 0 )> 0 and Rc = X
n(λ 1−λ0 ) ̄
X λ0
log(k )−n log ( )
λ1
1 ̂
This suggests the estimator =λ ML (calculated in a previous exercise) and regions of the form
X̄
Rc = {Λ< k } = ⋯= { λ̂ ML <c }= ⋯= {T 0 < a } or Rc = {Λ< k } = ⋯= { λ^ ML >c }= ⋯= {T 0 > a }
Hypothesis tests
{ H 0 : λ = λ0
H 1 : λ = λ 1> λ 0 { H 0 : λ = λ0
H 1 : λ = λ 1< λ 0
In applying the methodologies, and given α, the same critical value c or a will be obtained for any λ1 since it
Hypothesis tests
{ H 0 : λ ≤ λ0
H 1 : λ = λ 1>λ 0 { H 0 : λ ≥ λ0
H 1 : λ = λ 1<λ 0
A uniformly most powerful test for H 0 : λ = λ0 is also uniformly most powerful for H 0 : λ ≤ λ0 .
Hypothesis test
{ H 0 : η= η0
H 1 : η= η1
L( X ; η0 ) η0 ∑ X
1−η0 n− ∑ j=1 X
( )
n n j j
∑ j=1 X j n−∑ j=1 X
( )
j=1
L( X ; η) = η (1−η)
j
and Λ ( X ; η0 , η1) = =
L( X ; η1) η1 1−η1
Rejection region:
{ ( ) }{ ) ( )(
n n
η ∑
}
X n− ∑ j=1 X
1−η0 1−η0
η0
( )
j j n n
( ) (∑ )
j=1
Rc = { Λ < k }= η01
1−η1
<k = j =1
X j log η1 + n− ∑ j =1
X j log
1−η1
< log(k )
=
{(∑ )[ ( ) ( )]
n
j=1
X( )} j
η0
log η1 −log
1−η0
1−η1
< log( k )−n log
1−η0
1−η1
{ ( )
̄ log
= nX ( )} η0 (1−η1 )
η1 (1−η0 )
< log (k )−n log
1−η0
1−η1
{ }
1−η0
( )
log ( k )−n log
1−η1
• if η1 < η0 then log
( η0 (1−η1)
η1( 1−η0) )
> 0 and Rc = X
̄ <
n log ( )
η0 (1−η1)
η1(1−η0)
{ ) }
1−η0
log( k )−n log ( )
1−η1
• if η1 > η0 then log
( η0 (1−η1)
η1( 1−η0) ) ̄ >
<0 and Rc = X
n log ( η0 (1−η1)
η1(1−η0)
Hypothesis tests
{ H 0 : η = η0
H 1 : η = η1 >η0 { H 0 : η= η0
H 1 : η= η1<η0
In applying the methodologies, and given α, the same critical value c or a will be obtained for any η1 since it
only depends upon η0 through η^ ML or T0:
α=P (Type I error)= P (T 0 <a) or α=P (Type I error)=P (T 0 > a)
This implies that the uniformly most powerful test has been found.
Hypothesis tests
{ H 0 : η ≤ η0
H 1 : η = η1 >η0 { H 0 : η≥ η0
H 1 : η= η1<η0
A uniformly most powerful test for H 0 : η = η0 is also uniformly most powerful for H 0 : η ≤ η0 .
Hypothesis test
{ H 0 : μ = μ0
H 1 : μ = μ1
1
( )
2 j=1 j
2σ
L( X ; μ) = e
2 π σ2
and
1 1
[∑ ]
n n n
L( X ; μ 0)
2 2 2 2 2 2
− 2 j=1
( X j −μ 0 ) −∑ j=1 (X j −μ 1) − 2 ∑ j=1 (
X j −2 μ 0 X j +μ 0− X j +2 μ 1 X j−μ 1)
Λ ( X ; μ0 ,μ 1) = = e 2σ =e 2σ
L (X ;μ 1)
2 2
1 1 (μ 0−μ1 ) n (μ 0−μ 1)
[ ]
n n n
− 2 ∑ j=1
( μ 20−μ 21−2 μ 0 X j +2 μ 1 X j ) − 2
2 2
n (μ 0−μ 1)−2 (μ 0−μ 1) ∑ j=1 X j 2 ∑ j=1 X j − 2
2σ 2σ 2σ
=e =e =e σ
e
Rejection region:
R = { Λ < k } = {e < k }=
2 2
n (μ 0−μ 1)
{(μ σ−μ ) ∑ }
(μ 0−μ 1) n 2 2
2 ∑ j=1 X j −
2σ 2 0 1
n n (μ 0 −μ 1 )
c
σ
e 2 j=1
X j− 2
< log ( k )
2σ
{
= (μ 0−μ 1) (∑
n
j=1
X j ) < log (k ) σ + n2 (μ −μ )} = {(μ −μ )n X̄ < log( k )σ + n2 (μ −μ )}
2 2
0
2
1 0 1
2 2
0
2
1
{ }
n
log(k ) σ2 + (μ 20−μ21 )
2
• if μ1 <μ 0 then (μ 0−μ 1)>0 and Rc = X
̄ <
n(μ 0−μ 1)
Hypothesis tests
{ H 0 : μ = μ0
H 1 : μ = μ1 >μ 0 { H 0 : μ = μ0
H 1 : μ = μ 1<μ 0
In applying the methodologies, and given α, the same critical value c or a will be obtained for any μ1 since it
only depends upon μ0 through μ^ ML or T0:
α=P (Type I error)= P (T 0 <a) or α=P (Type I error)=P (T 0 > a)
This implies that the uniformly most powerful test has been found.
Hypothesis tests
{ H 0 : μ ≤ μ0
H 1 : μ = μ1 >μ 0 { H 0 : μ ≥ μ0
H 1 : μ = μ 1<μ 0
A uniformly most powerful test for H 0 : μ = μ 0 is also uniformly most powerful for H 0 : μ ≤ μ 0 .
Conclusion: Well-known theoretical results have been applied to study the optimal form for the critical
region of different pairs of hypothesis. Since both the likelihood ratio and the maximum likelihood estimator
use the likelihood function, the critical region of the tests can be expressed in terms of this estimator.
My notes:
Discussion: The analysis of variance can be applied when populations are normally distributed and their
variances are equal, that is, X p ∼ N (μ p , σ2p ) with σ p = σ , ∀ p . These suppositions should be evaluated
(this will be done at the end of the exercise). If the equality of the means is rejected, additional analyses would
be necessary to identify which means are different—this information is not provided by the analysis of
variance. On the other hand, the calculations involved in this analysis are so tedious that almost everybody
uses the computer. Finally, the unit of measurement of the index u is unknown for us.
Statistic: There is one factor identifying the population out of the three possible ones (we do not consider
other magazines), so a one-factor fixed-effects analysis will be applied. The statistic is
MSG MSG
T ( X SA , X FO , X NY ) = with T 0 = ∼ F P −1, n−P ≡ F 3−1, 18−3 ≡ F 2, 15
MSW MSW
Some calculations are necessary to evaluate of the statistic T ( x SA , x FO , x NY ) . First of all, we look at the
three sample means:
1 6 15.75 u +⋯+8.20 u
X̄ SA = ∑ j=1 X SA , j = =10.97 u
6 6
1 6 12.63 u+⋯+ 9.42 u
X̄ FO = ∑ j =1 X FO , j = =10.68 u
6 6
1 6 9.27 u+⋯+5.66 u
X̄ NY = ∑ j=1 X NY , j = =7.35u
6 6
The magnitude of the first and the third seems quite different, which suggests that the population means may
be different. Nevertheless, we should not trust intuition.
1 n 15.75 u+⋯+ 5.66 u
X̄ = ∑ j =1 X j = =9.67 u
18 18
P
̄ p− X̄ )2 = n SA⋅( X
SSG = ∑ p =1 n p ( X ̄ )2 + n FO⋅( X̄ FO− X̄ )2+ n NY⋅( X
̄ SA− X ̄ )2
̄ NY − X
= 6⋅(10.97 u−9.67 u)2 +6⋅(10.68 u−9.67 u)2 +6⋅(7.35 u−9.67 u)2=48.53 u2
1 48.53 u2 2
MSG = SSG= =24.26 u
P−1 3−1
P np 2 6 2 6 2 6 2
SSW = ∑ p=1 ∑ j=1 ( X p , j− X̄ p ) = ∑ j =1 ( X SA , j− X̄ SA) + ∑ j =1 ( X FO , j− X̄ FO ) + ∑ j =1 ( X NY , j − X̄ NY )
2 2
=(15.75 u−10.97 u) +⋯+(8.20 u−10.96 u)
+ (12.63 u−10.68 u)2 +⋯+(9.42 u−10.68 u)2
+ (9.27 u−7.35 u)2 +⋯+(5.66 u−7.35 u)2
2
= 52.22 u
1 52.22 u 2 2
MSW = SSW = =3.48 u
n−P 18−3
and, finally,
MSG 24.26 u2
T 0 ( x SA , x FO , x NY ) = = =6.97
MSW 3.48u 2
Decision: Finally, it is necessary to check if this region “suggested by H0” is compatible with the value that
the data provide for the statistic. If they are not compatible because the value seems extreme when the
hypotheses is true, we will trust the data and reject the hypothesis H0.
Since T 0 ( x SA , x FO , x NY )=6.97 > 6.359 → T 0 ( x)∈Rc → H0 is rejected.
The second methodology is based on the calculation of the p-value:
pV =P (( X SA , X FO , X NY ) more rejecting than (x SA , x FO , x NY ) ∣ H 0 true)
=P (T 0 ( X SA , X FO , X NY )>T 0 ( x SA , x FO , x NY ))=P (T 0 >6.97)=0.0072
→ pV =0.007243< 0.01=α → H0 is rejected. > 1-pf(6.97, 2, 15)
[1] 0.007235116
Conclusion: As suggested by the sample means, the population means of the three magazines are not equal
with a confidence of 0.99, measured in a 0-to-1 scale. Pairwise comparisons could be applied to identify the
differences.
# To calculate the sample mean of the three groups and the total sample mean
mean(SA) ; mean(FO) ; mean(NY) ; mean(Data)
# To calculate the measures and the statistic (for large datasets, the previous means should have been saved)
SSG = 6*((mean(SA) - mean(Data))^2) + 6*((mean(FO) - mean(Data))^2) + 6*((mean(NY) - mean(Data))^2)
MSG = SSG/(3-1)
SSW = sum((SA - mean(SA))^2) + sum((FO - mean(FO))^2) + sum((NY - mean(NY))^2)
MSW = SSW/(18-3)
T0 = MSG/MSW
# To find the quantile 'a' that determines the critical region
a = qf(0.99, 2, 15)
# To calculate the p-value
pValue = 1 - pf(T0, 2, 15)
(In the console, write the name of a quantity to print its value.)
(Compare these quantities with those obtained in the previous calculations.) An equivalent way of applying
the analysis of variance with R consists in substituting the lines
# To apply a one-factor analyis of variance
objectAV = aov(Data ~ Group)
# To print the table with the results
summary(objectAV)
by the lines
# To fit a linear regression model
Model = lm(Data ~ Group)
# To apply and print the analysis of variance
anova(Model)
My notes:
Exercise 1ht-np
Occupational Hazards. The following table is based on data from the U.S. Department of Labor, Bureau of
Labor Statistics.
Taxi
Police Cashiers Guards
Drivers
Homicide 82 107 70 59
Cause of death other
than homicide
92 9 29 42
490
A) Use the data in the table, coming from a simple random sample, to test the claim that occupation is
independent of whether the cause of death was homicide. Use a significance α = 0.05 and apply a
nonparametric chi-square test.
B) Does any particular occupation appear to be most prone to homicides? If so, which one?
(Based on an exercise of Essentials of Statistics, Mario F. Triola, Pearson)
LINGUISTIC NOTE (From: Longman Dictionary of Common Errors. Turton, N.D., and J.B.Heaton. Longman.)
job. Your job is what you do to earn your living: 'You'll never get a job if you don't have any qualifications.' 'She'd like to change her job
but can't find anything better.' Your job is also the particular type of work that you do: 'John's new job sounds really interesting.' 'I know
she works for the BBC but I'm not sure what job she does.' A job may be full-time or part-time (NOT half-time or half-day): 'All she
could get was a part-time job at a petrol station.'
do (for a living). When you want to know about the type of work that someone does, the usual questions are What do you do? What
does she do for a living? etc 'What does your father do?' - 'He's a police inspector.'
occupation. Occupation and job have similar meanings. However, occupation is far less common than job and is used mainly in formal
and official styles: 'Please give brief details of your employment history and present occupation.' 'People in manual occupations seem to
suffer less from stress.'
post/position. The particular job that you have in a company or organization is your post or position: 'She's been appointed to the post of
deputy principal.' 'He's applied for the position of sales manager.' Post and position are used mainly in formal styles and ofter refer to
jobs which have a lot of responsability.
career. Your career is your working life, or the series of jobs that you have during your working life: 'The scandal brought his career in
politics to a sudden end.' 'Later on in his career, he became first secretary at the British Embassy in Washington.' Your career is also the
particular kind of work for which you are trained and that you intend to do for a long time: 'I wanted to find out more about careers in
publishing.'
trade. A trade is a type of work in which you do or make things with your hands: 'Most of the men had worked in skilled trades such as
carpentry or printing.' 'My grandfather was a bricklayer by trade.'
profession. A profession is a type of work such as medicine, teaching, or law which requires a high level of training or education: 'Until
recently, medicine has been a male-dominated profession.' 'She entered the teaching profession in 1987.'
LINGUISTIC NOTE (From: The Careful Writer: A Modern Guide to English Usage. Bernstein, T.M. Atheneum)
occupations. The words people use affectionately, humorously, or disparagingly to describe their own occupations are their own affair.
They may say, “I'm in show business” (or, more likely, “show biz”), or “I'm in the advertising racket,” or “I'm in the oil game,” or “I'm in
the garment line.” But outsiders should use more caution, more discretion, and more precision. For instance, it is improper to write, “Mr.
Danaher has been in the law business in Washington.” Law is a profession. Similarly, to say someone is “in the teaching game” would
undoubtedly give offense to teachers. Unless there is some special reason to be slangy or colloquial, the advisable thing to do is to accord
every occupation the dignity it deserves.
Statistic: Since we have to apply a test of independence, from a table of statistics (e.g. in [T]) we select
L K(N lk − e^lk )2 d 2
T 0 ( X )=∑l =1 ∑ k=1 → χ( L−1)(K −1)
e^lk
for L and K classes, respectively.
Hypotheses: The null hypothesis supposes that the two variables are independent,
H 0 : X , Y independent and H 1 : X , Y dependent
or, probabilistically,
H 0 : f ( x , y)= f X ( x)⋅ f Y ( y ) and H 1 : f ( x , y )≠ f X ( x )⋅ f Y ( y )
This implies that the probability at any cell is the product of the marginal probabilities of its file and column.
Note that two underlying probability distributions are supposed for X and Y, although we do not care about
them, and we will directly estimate the probabilities from the empirical table. As
Instead of using the computer, we can consider the last value in our table to bind the p-value (statisticians
want to discover its value, while we want only to check whether or not it is smaller than α):
pV = P (T 0 ( X )>65.52) < P(T 0 ( X )>11.3)=0.01 → pV <0.01<0.05=α → H0 is rejected
Conclusion: The hypothesis that the two variables are independent is rejected. This means that there seems
to be a correlation between occupation and cause of death. (Remember: statistical results depend on: the
assumptions, the methods, the certainty and the data.)
My notes:
Exercise 2ht-np
World War II Bomb Hits in London. To carry out an analysis, South London was divided into 576 areas. For
the variable N ≡ number of bombs in the k-th area (any), a simple random sample (x1,...,x576) was gathered
and grouped in the following table:
EMPIRICAL
5 or
Number of Bombs 0 1 2 3 4
more
Number of Regions 229 211 93 35 7 1 n = 576
Data taken from: An application of the Poisson distribution. Clarke, R. D. Journal of the Institute of Actuaries [JIA] (1946) 72: 481
http://www.actuaries.org.uk/research-and-resources/documents/application-poisson-distribution
Discussion: We must apply the chi-square methodology to study if the data statistically fit the models
specified. In the second section, a value for the parameter is given. For this probability model,
We have to calculate or estimate the probabilities in order to obtain the expected absolute frequencies. Finally,
by using the statistic T we will compare the two tables and make a decision.
Statistic: Since we have to apply a goodness-of-fit test, from a table of statistics (e.g. in [T]) we select
Poisson (λ = 0.93)
Values 0 1 2 3 4 5 or more
Probabilities 0.395 0.367 0.17 0.0529 0.0123 0.00270 1
We have really done the calculations with the programming language R. By using a calculator, some
quantities may be slightly different due to technicals effects (number of decimal digits, accuracy, etc).
To guarantee the quality of the chi-square methodology, the expected absolute frequencies are usually required
to be larger than four (≥5). For this reason, we merge the last two classes in both the empirical and the
expected tables.
EMPIRICAL
Number of Bombs 0 1 2 3 4 or more
Number of Regions 229 211 93 35 7+1=8 n = 576
For this kind of test, the critical region always has the following form:
Decision: There are K = 5 classes (after merging two of them) and s = 1 estimation, so
d
T 0 ( X ) → χ 2K −s −1 ≡ χ5−1−1
2
≡ χ 23
If we apply the methodology based on the critical region, the necessary quantile a is calculated from the
definition of the type I error, with the given α = 0.05:
α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P( T 0 ( X )∈Rc )≈P (T 0 (X )>a)
> qchisq(1-0.05, 3)
→ a=r α=7.81 → Rc = {T 0 ( X )>7.81 } [1] 7.814728
Then, the decision is: T 0 ( x)=1.019 < 7.81 → T 0 ( x)∉ Rc → H0 is not rejected.
If we apply the alternative methodology based on the p-value,
Poisson (λ = 0.8)
Values 0 1 2 3 4 5 or more
Probabilities 0.449 0.359 0.144 0.0383 0.00767 0.00141 1
As in the previous case, we merge the last two classes for all the expected absolute frequencies to be larger
than four
EMPIRICAL
Number of Bombs 0 1 2 3 4 or more
Number of Regions 229 211 93 35 7+1=8 n = 576
so
Conclusion: The hypothesis that bomb hits can reasonably be modeled by using the Poisson family has not
^
been rejected. In this case, data provided an estimate λ=0.93 . Nevertheless, when the value λ=0.8 is
imposed, the hypotheses that bomb hits can be modeled by using a Pois(λ=0.8) model is rejected. This proves
that:
i. Even quite reasonable a model may not fit the data if inappropriate parameter values are considered.
This emphasizes the importance of using good parameter estimation methods.
ii. Estimating the parameter value was better than fixing a value close to the estimate. As statisticians say:
“let the data talk”. This hightlights the necessity of testing all suppositions, which implies that
nonparametric procedures should sometimes be applied before the parametric ones: in this case, before
supposing that the Poisson family is proper and imposing a value for the parameter, the whole Poisson
family must be considered.
(Remember: statistical results depend on: the assumptions, the methods, the certainty and the data.)
Advanced theory: Mendelhall, W., D.D. Wackerly and R.L. Scheaffer say (Mathematical Statistics with
Applications, Duxbury Press) that the expected absolute frequencies can be as low as 1 for some situations,
according to Cochran, W.G., “The χ2 Test of Goodness of Fit”, Annals of Mathematical Statistics, 23 (1952)
pp. 315-345. To take the most advantage of this exercise, we repeat the previous calculations without merging
the last two classes.
(1) Fit to the Poisson family
We evaluate T0, which is necessary to apply any of the two methodologies.
( 229−227.26 )2 ( 1−1.55 )2
T 0 ( x)= +⋯+ =1.167
227.26 1.55
2 d 2 2
Now there are K = 6 classes and s = 1 estimation, so T 0 ( X ) → χ K −s −1 ≡ χ6−1−1 ≡ χ4 . If we apply the
Then, the decision is: T 0 ( x)=1.167 < 9.49 → T 0 ( x)∉Rc → H0 is not rejected.
If we apply the alternative methodology based on the p-value,
In both sections the same decisions have been made, which implies that this is one of those situations where
merging the last two classes does not seem essential.
My notes:
Exercise 3ht-np
Three finantial products have been commercialized and the presence of interest in them has been registered
for some individuals. It is possible to imagine different situations where the following data could have been
obtained.
Product 1 Product 2 Product 3
Group 1 10 18 9 37
Group 2 20 13 15 48
30 31 24 85
(a) A simple random sample of 48 people of the second group were allocated after considering the
variable product, test at α = 0.01 whether this variable follows the distribution determined by the
sample of the first group.
Discussion: In this exercise, the same table is looked at as containing data obtained from three different
schemes. The chi-square methodology will be applied in all sections through three kinds of test: goodness-of-
-fit, independence and homogeneity. In the first case, a probability distribution F0 is specified, while in the last
two cases the underlying distributions have no interest by themselves.
Hypotheses: For a nonparametric goodness-of-fit test, the null hypothesis assumes that the theoretical
probabilities of the second group follow the probabilities determined by the sample of the first group. If Fk
represents the distribution of the variable product in the k-th population,
H 0: F 2 ∼ F 1 and H 1 : F 2 ∼ F ≠F 1
The variable of the first group determines the following distribution F1:
Value 1 2 3
10 18 9
Probability
37 37 37
Now, under H0 the formula e k =n pk allows us to fill in the expected table:
Hypotheses: For a nonparametric independence test, the null hypothesis assumes that the probabilities at any
cell is the product of the marginal probabilities of its file and column,
H 0 : X , Y independent and H 1 : X , Y dependent
or, probabilistically,
H 0 : f ( x , y)= f X ( x)⋅ f Y ( y ) and H 1 : f ( x , y )≠ f X ( x )⋅ f Y ( y )
N l⋅ N ⋅k
Under H0, the formula e^lk =n p^lk =n p^ l p^ k =n allows us to fill in the expected table:
n n
Then,
37⋅30 2 48⋅24 2
T 0 ( x)=
( 10−
85
+⋯+
)
15−
85 (
=4.29
)
37⋅30 48⋅24
85 85
For this kind of test, the critical region always has the following form:
Hypotheses: For a nonparametric homogeneity test, the null hypothesis assumes that the marginal
probabilities in any column are the same for the two groups, that is, are independent of the group or stratum.
This means that the variable of interest X follows the same probability distribution in each (sub)group or
stratum. If G represents the variable group, mathematically,
H 0 : F ( x∣ G)= F ( x) and H 1 : F ( x∣ G)≠F (x )
N ⋅k
Under H0, the formula e^lk =nl p^lk =n l p^ k =nl allows us to fill in the expected table:
n
30 2 24 2
(
T 0 ( x)=
10−37
85 )+⋯+
(
15−48
85
=4.29
)
30 24
37 48
85 85
For this kind of test, the critical region always has the following form:
Conclusion (advanced): Neither the independence nor the homogeneity has been rejected, while the
hypothesis supposing that the variable product follows in population 2 the distribution determined by the
sample of the group 1 has been rejected. On the one hand, the distribution determined by one sample, involved
in section (a), is in general different to the common supposed underlying distribution involved in section (b),
which is estimated by using the samples of both groups. Thus, it can be thought that this underlying
distribution “is between the two samples”, by which we can justify the decisions made in (a), (b) and (c).
Group 2 has more weight in determining that distribution, since it has more elements. It is worth noticing the
similarity between the independence and the homogeneity tests: same distribution and evaluation for the
statistic, same critical region, et cetera. (As regards the application of the methodologies, binding the p-value
is sometimes enough to discover whether it is smaller than α or not, but in general statisticians want to find its
value.)
My notes:
Discussion: In this exercise, no supposition should be evaluated: in (a) because the Bernoulli model is “the
only proper one” to model a coin, and in (b) and (c) because they involve nonparametric tests. The sections of
this exercise need the same calculations as in previous exercises.
Statistic: From a table of statistics (e.g. in [T]), since the population variable is Bernoulli and the asymptotic
framework can be considered (since n is big), the statistic
̂
η−η d
T ( X ; η)= → N (0,1)
√ ?(1−? )
n
is selected, where the symbol ? is substituted by the best information available. In testing hypotheses, it will
be used in two forms:
̂
η−η0
d ̂
η−η1
d
T 0 ( X )= → N (0,1) and T 1 ( X )= → N (0,1)
√ η0 (1−η0 )
n √ η1 (1−η1 )
n
where the supposed knowledge about the value of η is used in the denominators to estimate the variance (we
do not have nor suppose this information when T is used to build a confidence interval, or for tests with two
populations). Regardless of the methodology to be applied, the following value will be necessary:
50,347 1
−
100,000 2
T 0 ( x)= =2.19
√
1 1
2( )
1−
2
100,000
where η0 = 1/2 when the coin is supposed to be fair.
Hypotheses: Since a parametric test must be applied, the coin—dichotomic situation—is modeled by a
Bernoulli random variable, and the hypotheses are
1 1
H 0 : η = η0 = and H 1 : η= η1 ≠
2 2
Note that the question is about the value of the parameter η while the Bernoulli distribution is supposed under
both hypotheses; in some nonparametric tests, this distribution is not even supposed in general (although the
Decision: To determine Rc, the quantiles are calculated from the type I error with α = 0.1 at η0 = 1/2:
α (1/2)=P (Type I error )=P (Reject H 0 ∣ H 0 true)=P (T ( X ; θ)∈ Rc ∣ H 0 )= P(∣T 0 ( X )∣> a)
→ a=r α/2 =1.645 → Rc = {∣T 0 ( X )∣>1.645 }
Thus, the decision is: T 0 ( x)=2.19 > 1.645 → T 0 ( x)∈Rc → H0 is rejected.
If we apply the methodology based on the p-value,
pV = P ( X more rejecting than x ∣ H 0 true)=P (∣T 0 ( X )∣>∣T 0 ( x)∣)
= 2⋅P (T 0 ( X )<−2.19)=2⋅0.0143=0.0248
→ pV =0.0248 < 0.1=α → H0 is rejected.
Power function: To calculate β, we have to work under H1. Since in this case the critical region is already
expressed in terms of T0 and we must use T1, we apply the mathematical tricks of multiplying and dividing by
the same quantity and of adding and subtracting the quantity:
β(η1 ) = P (Type II error ) = P ( Accept H 0 ∣ H 1 true)= P (T 0 ( X )∉ R c ∣ H 1) = P(∣T 0 ( X )∣≤1.645 ∣ H 1 )
∣)
√ η1 (1−η1) ≤+1.645
∣) (
̂ −η0
η ̂ −η0
η
(
= P −1.645≤
√ η0 (1−η0 )
n
≤+1.645 H 1 = P −1.645≤
√ η1 (1−η1 ) √ η0 (1−η0)
n
H1
∣)
√ η0 (1−η0) ≤ √ η0 (1−η0)
(
̂
η−η0
= P −1.645 ≤+1.645 H1
√ η1 (1−η1)
√ η1 (1−η1)
n
√ η1(1−η1)
∣)
√ η0 (1−η0) ≤ η−η √ η (1−η0)
(
̂ 1+ η1−η0
= P −1.645 ≤+1.645 0 H1
√ η1 (1−η1)
√
η1 (1−η1 )
n
√ η1 (1−η1)
∣)
√ η0 (1−η0) − √ η0 (1−η0 ) −
(
η1−η0 ̂
η−η1 η1−η0
= P −1.645 ≤ ≤+1.645 H1
√ η1 (1−η1)
√ η1 (1−η1)
n √ η1 (1−η1 )
n
√ η1 (1−η1 )
√ η1 (1−η1 )
n
ϕ(η) = P (Reject H 0) =
{ α(η) if p∈Θ0
1−β(η) if p ∈Θ1
# Sample and inference
n = 100000
alpha = 0.1
theta0 = 0.5 # Value under the null hypothesis H0
q = qnorm(c(alpha/2, 1-alpha/2),0,1)
theta1 = seq(from=0,to=1,0.01)
paramSpace = sort(unique(c(theta1,theta0)))
PowerFunction = 1-pnorm((q[2]*sqrt(theta0*(1-theta0))-sqrt(n)*(paramSpace-theta0))/sqrt(paramSpace*(1-paramSpace)),0,1) +
pnorm((q[1]*sqrt(theta0*(1-theta0))-sqrt(n)*(paramSpace-theta0))/sqrt(paramSpace*(1-paramSpace)),0,1)
plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')
Statistic: To apply a goodness-of-fit test, from a table of statistics (e.g. in [T]) we select
K ( N k −e^k )2 d 2
T 0 ( X )=∑k =1 → χ K −s−1
e^k
where there are K classes, and s parameters have to be estimated to determine F0 and hence the probabilities.
Hypotheses: For a nonparametric goodness-of-fit test, the null hypothesis supposes that the sample was
generated by a Bernoulli distribution with η0 = 1/2, while the alternative hypothesis supposes that it was
generated by a different distribution (Bernoulli or not, although this distribution is here “the reasonable way”
of modeling a coin).
1 1
H 0: X ∼ F 0 = B
2 ()
and H 1 : X ∼ F ≠ B
2 ()
For the distribution F0, the table of probabilities is
Value –1 (tail) +1 (head)
Probability 1/2 1/2
th 1
and, under H0, the formula e k =n pk =n P θ (k class)=100,000 =50,000 allows us to fill in the expected
2
table:
Decision: There are K = 2 classes and s = 0 (no parameter has been estimated), so
d 2 2 2
T 0 ( X ) → χ K −s −1 ≡ χ 2−1−0 ≡ χ1
If we apply the methodology based on the critical region, the definition of type I error, with α = 0.1, is applied
to calculate the quantile a:
α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P( T 0 ( X )∈ Rc )≈P (T 0 (X )>a)
→ a=r α=2.71 → Rc = {T 0 ( X )> 2.71}
Statistic: To apply a position sign test, from a table of statistics (e.g. in [T]) we select
T 0 ( X )=Number { X j −θ0 > 0 }∼ Bin(n , P ( X j >θ 0))
Here θ0=0 and P(Xj>0)=1/2, so Me(T0)=E(T0)=n/2.
Hypotheses: For a nonparametric position test, if head and tail are equivalently translated into the numbers
+1 and –1, respectively, the hypotheses are
H 0 : Me( X ) = θ 0 = 0 and H 1 : Me( X ) = θ1 ≠ 0
For these hypotheses,
(∣√ )∣ √ )
a 2⋅a
=P (∣T 0 ( X )−n/ 2∣> a )= P
1 1
>
1 1
(
≈ P ∣Z∣>
√n )
n
2(1−
2
n
2( )
1−
2
→ r α /2=1.645=
2⋅a
√n
→ a≈1.645 √
100,000
2
n
{∣
≈260.097 → Rc = T 0 ( X )− > 260.097
2 ∣ }
The final decision is: ∣T 0 (x )−100,000/ 2∣=347 > 260.097=a → T 0 ( x)∈Rc → H0 is rejected.
If we apply the methodology based on the p-value,
(∣√ ∣ ∣√ ∣) ( ∣√
50,347−50,000
=P
n
1
2( )
1−
1
2
>
n
1
2( )
1−
1
2
≈ P ∣Z∣>
100,000
1
2
1−
1
( )
2
∣)
= P (∣Z∣> 2.19)=2⋅P (Z <−2.19)=2⋅0.0143=0.0248
→ pV = 0.0248 < 0.1=α → H0 is rejected.
Conclusion: (1) In this case the three different tests agree to make the same decision, but this may not
happen in other situations. When it is possible to compare the power functions and there exists a uniformly
most powerful test, the decision of the most powerful should be considered. In general, (proper) parametric
tests are expected to have more power than the nonparametric ones in testing the same hypotheses. (2) With
two classes, the chi-square test does not distinguish any two distributions such that the two class probabilities
are (½, ½), that is, in this case the test provides a decision about the symmetry of the distribution (chi-square
tests work with class probabilities, not with the distributions themselves). (3) In this exercise the parametric
test and the nonparametric test of the signs are essentially the same. (Remember: statistical results depend on:
the assumptions, the methods, the certainty and the data.)
My notes:
Discussion: The pilot statistical study mentioned in the statement should cover the evaluation of all
suppositions. The hypothesis that σM = σF should be evaluated as well. The interval will be built by applying
the method of the pivot.
T ( M , F ; μ M ,μ F )=
̄ −F
(M ̄ )−(μ M −μ F )
∼ t κ with κ=
( SM S F
+
nM nF )
√
2 2 2 2
( ) ( )
2 2
SM SF 1 SM 1 SF
+ +
nM nF n M −1 n M n F −1 n F
̄ −F
(M ̄ )−(μ M −μ F ) 2n M s 2M + n F s 2F (n M −1) S 2M +(n F −1)S 2F
T ( M , F ; μ M ,μ F )= ∼ tn + nF −2 with S =p =
n M + n F −2 nM + nF −2
√ S 2p S 2p
M
+
nM n F
2
σF
Because of the information available, the first and the second statistics allow studying M – F (the second for
the particular case where σM = σF), while the third allows studying σM/σF.
( )( )
̄ −F
(M ̄ )−(μ M −μ F ) 1.27−(μ M −μ F ) 1.27−(14.2−13.5)
P (M
̄ −F
̄ ≤ 1.27)=P ≤ =P T ≤ =P (T ≤ 1.29)
√ √ √
2 2 2 2
S S S S 4.99 5.02
+ M F M
+ F +
nM nF nM nF 53 49
with T ~ tκ where
2
S 2M S 2F
κ=
( +
nM nF ) =
4.99 5.02 2
53
+
49 ( =99.33
)
2 2 2 2 2
S 2M 1 4.99 1 5.02
1
( )
n M −1 n M
+
1 S
( )
F
n F −1 n F 53−1 53
+ ( )
49−1 49 ( )
Should we round this value downward, κ = 99, or upward, κ = 100? We will use this exercise to show that
➢ For large values of κ1 and κ2, the t distribution provides close values
➢ For a large value of κ, the t distribution provides values close to those of the standard normal
distribution (the tκ distribution tends with κ to the standard normal distribution)
By using the programming language R: > pt(1.29, 99)
[1] 0.8999721
• If we make it κ = 99.33 to 99, the probability is
> pt(1.29, 100)
• If we make it κ = 99.33 to 100, the probability is [1] 0.8999871
On the other hand, when the variances are supposed to be equal they can and should be estimated jointly by
using the pooled sample variance.
(n −1) S 2M +(nF −1) S 2F (53−1)$ 2 4.99+(49−1)$ 2 5.02
S 2p = M = =$ 2 5.0044≈$ 2 5
n M + n F −2 53+ 49−2
Then,
( )( )
̄ −F
(M ̄ )−(μ M −μ F ) 1.27−(μ M −μ F ) 1.27−(14.2−13.5)
P (M
̄ −F
̄ ≤ 1.27)=P ≤ =P T ≤ =P (T ≤ 1.29)
√ S 2p S 2p
+
nM nF √ S 2p S 2p
+
nM nF √
5 5
+
53 49
I 1−α =
[ S 2M
r α /2 S 2F
,
S 2M
l α/ 2 S 2F ] and then I 1−α =
[√ S 2M
r α/ 2 S 2F
,
√ ] S 2M
l α/ 2 S 2F
In the calculations, multiplying by a quantity and inverting can be applied in either order.
Then
I 0.95= [√ 4.99
1.76⋅5.02
,
√ 4.99
0.57⋅5.02 ]
=[0.75, 1.32]
Conclusion: First of all, in this case there is very little difference between the two ways of estimating the
variance. On the other hand, as the variances are related through a quotient, the interpretation is not direct: the
dimensionless, multiplicative factor c in σM2 = cσF2 is, with 95% confidence, in the interval obtained. The
interval (with dimensionless endpoints) contains the value 1, so it may happen that the variability of the
amount of money spent is the same for males and females—we cannot reject this hypothesis (note that
confidence intervals can be used to make decisions). (Remember: statistical results depend on: the
assumptions, the methods, the certainty and the data.)
My notes:
Exercise 2pe-ci-ht
The electric light bulbs of manufacturer X have a mean lifetime of 1400 hours (h), while those of
manufacturer Y have a mean lifetime of 1200h. Simple random samples of 125 bulbs of each brand are tested.
From these datasets the sample quasivariances Sx2 = 156h2 and Sy2 = 159h2 are computed. If manufacturers
are supposed to be independent and their lifetimes are supposed to be normally distributed:
a) Build a 99% confidence interval for the quotient of standard deviations σX/σY. Is the value σX/σY=1,
that is, the case σX=σY, included in the interval?
b) By using the proper statistic T, find k such that P ( X
̄ −Ȳ ≤ k ) = 0.4.
Hint: (i) Firstly, build an interval for the quotient σX2/σY2; secondly, apply the positive square root function. (ii) If a random variable
ξ follows a F124, 124 then P(ξ ≤ 0.628) = 0.005 and P(ξ ≤ 1.59) = 0.995. (iii) If ξ follows a t248, then P(ξ ≤ –0.25) = 0.4
(Based on an exercise of Statistics, Spiegel, M.R., and L.J. Stephens, McGraw–Hill.)
LINGUISTIC NOTE (From: Longman Dictionary of Common Errors. Turton, N.D., and J.B.Heaton. Longman.)
electric means carrying, producing, produced by, powered by, or charged with electricity: 'an electric wire', 'an electric generator', 'an
electric shock', 'an electric current', 'an electric light bulb', 'an electric toaster'. For machines and devices that are powered by electricity
but do not have transistors, microchips, valves, etc, use electric (NOT electronic): 'an electric guitar', 'an electric train set', 'an electric
razor'.
Discussion: There are two independent normal populations. All suppositions should be evaluated. Their
means are known while their variances are estimated from samples of size 125. A 99% confidence interval for
σX/σY is required. The interval will be built by applying the method of the pivot. If the value σX/σY=1 belongs
to this interval of confidence 0.99, the probability of the second section can reasonably be calculated under the
supposition σX=σ=σY—this implies that the common variance σ2 is jointly estimated by using the pooled
sample quasivariance Sp2. On the other hand, this exercise shows the natural order in which the statistical
techniques must sometimes be applied in practice: the supposition σX=σY is empirically supported—by
applying a confidence interval or a hypothesis test—before using it in calculating the probability. Since the
standard deviations have the same units of measurement as the data (hours), their quotient is dimensionless,
and so are the endpoints of the interval.
1 n
2 2 2 1 n 2
where V X =
n ∑ j =1
( X j−μ ) and S X =
n−1 ∑ j=1
( X j− X̄ ) , respectively (similarly for population Y).
We would use the first if we were given V 2X and V 2Y or we had enough information to calculate them (we
know the means but not the data themselves). In this exercise we can use only the second statistic.
I 1−α =
[ S 2X
r α/2 S 2Y
,
S 2X
l α/2 S 2Y ] and I 1−α =
[√ S 2X
r α/ 2 S 2Y
,
√ ] S 2X
l α/ 2 S 2Y
[√ √ ]
2 2
156 h 156 h
I 0.95= 2
, 2
=[0.786, 1.25]
1.59⋅159 h 0.628⋅159 h
The value σX/σY=1 is in the interval of confidence 0.99 (99%), so the supposition σX=σY is strongly supported.
(b) Probability
To work with the difference of the means of two independent normal populations when σX=σY, we consider:
̄ −Ȳ )−(μ X −μ Y )
(X
T ( X , Y ; μ X ,μ Y )= ∼ t n +n −2
S 2p S 2p
√
x y
+
n X nY
2 2
2 (n X −1) S +(nY −1)S
X 124⋅156 h2 +124⋅159 h 2
Y
where S p = = =157.5 h2 is the pooled sample quasivariance.
n X + n y −2 125+125−2
√ S 2p S 2p
+
n X nY )(
=P T ≤
√ 157.5 157.5
125
+
125
)
Now, by using the information in (iii) of the hint,
l 0.4=−0.25=
kh−(1400 h−1200 h)
→ k =200 h−0.25 2
125√
157.5 h 2
=199.60 h
√ 157.5 h2 157.5 h 2
125
+
125
Conclusion: A confidence interval has been obtained for the quotient of the standard deviations. The
dimensionless value of θ = σX/σY is between 0.786 and 1.250 with confidence 99%; alternatively, as the
standard deviations are related through a quotient, an equivalent interpretation is the following: the
(dimensionless) multiplicative factor θ in σX=θ·σY is, with 99% confidence, in the interval obtained. Since the
value θ = 1 is in this high-confidence interval, it may happen that the variability of the two lifetimes is the
same—we cannot reject this hypothesis (note that confidence intervals can be used to make decisions);
besides, it is reasonable to use the supposition σX=σY in calculating the probability of the second section. If
any two simple random samples of size 125 were considered, the difference of the sample means will be
smaller than 199.60 with a probability of 0.4. Once two particular samples are substituted, randomness is not
involved any more and the inequality ̄x − ̄y ≤k =199.60 is true or false. The endpoints of the interval have
no dimension, like the quotient σX/σY or the multiplicative factor c. (Remember: statistical results depend on:
the assumptions, the methods, the certainty and the data.)
My notes:
Discussion: In this exercise, no supposition should be evaluated. The number 30 plays a role only in
defining the population under study. The Bernoulli model is “the only proper one” to register the presence-
-absence of a condition. Percents must be rewritten in a 0-to-1 scale. Since the default option is that the
proportion has not changed, the equality is allocated in the null hypothesis. On the other hand, proportions are
dimensionless by definition.
√ ?(1−?)
n
is selected, where the symbol ? is substituted by the best information available: η or η.
^ In testing
hypotheses, it will be used in two forms:
̂
η−η0
d ̂
η−η1
d
T 0 ( X )= → N (0,1) and T 1 ( X )= → N (0,1)
√ η0 (1−η0 )
n √ η1 (1−η1 )
n
where the supposed knowledge about the value of η is used in the denominators to estimate the variance (we
do not have this information when T is used to build a confidence interval, like in the next section).
Regardless of the testing methodology to be applied, the evaluation of the statistic is necessary to make the
decision. Since η0=0.25
34
−0.25
120
T 0 ( x)= =0.843
0.25(1−0.25)
120 √
Hypotheses:
{√ }
η−η
̂ 0 c−η0
Rc ={ η>
̂ c }= > ={T 0 >a }
η0 (1−η0)
n √ η0 (1−η0 )
n
Decision: To determine Rc, the quantile is calculated from the type I error with α = 0.1 at η0 = 0.25:
α (0.25)=P (Type I error)= P( Reject H 0 ∣ H 0 true)= P(T 0 >a)
→ a=r 0.1=l 0.9=1.28 → Rc = {T 0 ( X )>1.28 }.
Now, the decision is: T 0 ( x)=0.843 < 1.28 → T 0 ( x)∉Rc → H0 is not rejected.
p-value
Type II error: To calculate β, we have to work under H1. Since the critical region has been expressed in terms
of T0, and we must use T1, we could apply the mathematical trick of adding and subtracting the same quantity.
Nevertheless, this way is useful when the value c in Rc ={ η> ̂ c } has not been calculated yet; now, since we
have been said that Rc ={ η>
̂ 0.3} it is easier to directly standardize with η1:
β(η) = P (Type II error ) = P ( Accept H 0 | H 1 true)= P (T 0 ( X )∉ Rc | H 1 )= P ( η≤0.3
^ | H 1)
∣) (
̂ −η1 0.3−η1 0.3−η1
(√ )
η
=P ≤ H 1 = P T 1≤
η1 (1−η1)
n √ η1 (1−η1 )
n √ η1 (1−η1 )
n
For the particular value η1 = 0.35,
0.3−0.35
β(0.35) = P T 1≤
( √ 0.35(1−0.35)
120
)
= P ( T 1 ≤−1.15 ) = 0.125
> pnorm(-1.15,0,1)
[1] 0.125
By using a computer, many more values η1 ≠ 0.35 can be considered to plot the power function
ϕ(η) = P (Reject H 0) =
{ α(η) if p∈Θ0
1−β(η) if p ∈Θ1
# Sample and inference
n = 120
alpha = 0.1
theta0 = 0.25 # Value under the null hypothesis H0
c = 0.3
theta1 = seq(from=0.25,to=1,0.01)
paramSpace = sort(unique(c(theta1,theta0)))
PowerFunction = 1 - pnorm((c-paramSpace)/sqrt(paramSpace*(1-paramSpace)/n),0,1)
plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')
√ ?(1−?)
n
where the symbol ? is substituted by the best information available. In testing hypotheses we were also
studying the unknown quantity η, although it was provisionally supposed to be known under the hypotheses;
for confidence intervals, we are not working under any hypothesis and η must be estimated in the
denominator:
̂ −η
η d
T ( X ; η)= → N (0,1)
√ η(1−
̂
n
η)
̂
The interval is obtained with the same calculations as in previous exercises involving a Bernoulli population,
[
I 1−α = η
^ −r α/ 2
√ η(1−
^
n
η)
^
, η+
^ r α/ 2
^
√
η(1−
n
η)
^
]
where r α / 2 is the value of the standard normal distribution such that P( Z>r α/2 )=α / 2. By using
• n = 120.
34
• Sample proportion: ^
η= =0.283 .
120
• 90% → 1–α = 0.9 → α = 0.1 → α/2 = 0.05 → r 0.05=l 0.95 =1.645 .
the particular interval (for these data) appears
[
I 0.9= 0.283−1.645
√ 0.283 (1−0.283)
120
, 0.283+1.645
√
0.283 (1−0.283)
120 ]
=[ 0.215 , 0.351]
Thinking about the interval as an acceptance region, since η0=0.25 ∈ I the hypothesis that η may still be
0.25 is not rejected.
Conclusion: With confidence 90%, the proportion of births by mothers of over 30 years of age seems to be
0.25 at most. The same decision is still made by considering the confidence interval that would correspond to
My notes:
Exercise 4pe-ci-ht
A random quantity X is supposed to follow a distribution whose probability function is, for θ > 0,
{
θ−1
θx if 0≤x ≤1
f (x ; θ) =
0 otherwise
A) Apply the method of the moments to find an estimator of the parameter θ.
B) Apply the maximum likelihood method to find an estimator of the parameter θ.
C) Use the estimators obtained to build others for the mean μ and the variance σ2.
D) Let X = (X1,...,Xn) be a simple random sample. By applying the results involving Neyman-Pearson's
lemma and the likelihood ratio, study the critical region for the following pairs of hypotheses.
{ H 0 : θ = θ0
H 1 : θ = θ1 { H 0 : θ = θ0
H 1 : θ = θ1 >θ 0 { H 0 : θ = θ0
H 1 : θ = θ 1<θ0 { H 0 : θ ≤ θ0
H 1 : θ = θ1 >θ 0 { H 0 : θ ≥ θ0
H 1 : θ = θ1 <θ0
Hint: Use that E(X) = θ/(θ+1) and E(X2) = θ/(θ+2).
Discussion: This statement is basically mathematical. The random variable X is dimensionless. (This
probability distribution, with standard power function density, is a particular case of the Beta distribution.)
Note: If E(X) had not been given in the statement, it could have been calculated by integrating:
θ+1 1
2
+∞
E ( X )=∫−∞ x f ( x ; θ)dx=∫0 x θ x
Besides, E(X ) could have been calculated as follows:
1
θ−1
1
dx=θ ∫0 x dx =θ
θ x
[ ]
= θ
θ +1 0 θ+1
1
Now,
2
+∞
E ( X )=∫−∞ x f (x ;θ)dx=∫0 x θ x
2
1
2 θ−1
1
dx=θ∫0 x θ+1
dx=θ[ ]
xθ+2
= θ
θ+ 2 0 θ+ 2
2
μ=E ( X )= θ σ 2=Var ( X )=E ( X 2)−E ( X )2= θ − θ = θ
θ+1
and
θ+ 2 θ+1 (θ+ 2)(θ +1)
2(. )
A) Method of the moments
a1) Population and sample moments: There is only one parameter—one equation is needed. The first-order
moments of the model X and the sample x are, respectively,
1 n
μ1 (θ )=E( X )= θ and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
θ+1 n
a2) System of equations: Since the parameter of interest θ appears in the first-order moment of X, the first
equation suffices:
θ = 1 n x = x̄ → θ=θ x̄ + x̄ → θ= x̄
θ +1 n ∑ j =1 j
μ1 (θ )=m 1 (x 1 , x 2 ,... , x n ) →
1− x̄
b1) Likelihood function: For this probability distribution, the density function is f (x ; θ)=θ x θ−1 so
n n n θ−1
L( x 1 , x 2 , ... , x n ; θ)=∏ j =1 f ( x j ; θ)=∏ j=1 θ xθj −1=θn (∏ x
j =1 j )
b2) Optimization problem: The logarithm function is applied to make calculations easier
n
log[ L( x 1 , x 2 , ... , x n ; θ)]=n log(θ)+(θ−1)log( ∏ j =1 x j )
To find the local or relative extreme values, the necessary condition is:
d 1 n n n n
0= log [ L(x 1 , x 2 , ... , x n ; θ)]=n θ +log( ∏ j=1 x j ) → θ =−log( ∏ j=1 x j ) → θ0 =−
dθ n
log ( ∏ j=1 x j )
To verify that the only candidate is a (local) maximum, the sufficient condition is:
2
d d 1 n n
dθ
2
log[ L( x 1 , x 2 , ... , x n ; θ)]=
dθ
[n θ +log( ∏ j=1
x j )]=− 2 < 0
θ
The second derivative is always negative, also at the value θ0.
C) Estimation of η and σ2
c1) For the mean: By applying the plug-in principle,
X̄ X̄
^θ M 1− X̄ 1− X̄
From the method of the moments: μ^ M = = = = X̄ .
θ^ M +1 X̄ X̄ 1− X̄
+1 +
1− X̄ 1− X̄ 1− X̄
n
− n
θ^ ML log( ∏ j=1 X j ) n
From the maximum likelihood method: μ^ ML = = = .
^θ ML +1 n n
− n
+1 n−log( ∏ j=1 j
X )
log( ∏ j=1 X j )
c2) For the variance: Instead of substituting in the large expression of σ2, we use functional notation
From the method of the moments: σ^ 2M =σ 2 ( θ^ M ) , with σ2 (θ) and θ^ M given above.
From the maximum likelihood method: σ^ 2ML =σ2 ( θ^ ML ), with σ2 (θ) and θ^ ML given above.
( ) (∏
n n
L( X ; θ) = θ n (∏ j =1
X j ) and Λ ( X ; θ 0 , θ1) = =
L( X ; θ1 ) θ1 j=1
X j )
Then, the critical or rejection region is
{( } {
n
θ0 θ0
) < log ( k )}
θ0−θ1
) (∏ ( θ )+( θ −θ ) log ( ∏
n n
Rc = { Λ < k } =
θ1 j =1
Xj ) < k = n⋅log
1
0 1 j =1
X j
{
= (θ 0−θ 1) log (∏
n
j=1 ) θ
X j < log (k )−n⋅log θ0 =
1
( )}
{ (θ 0−θ 1)log
1
(∏
n
j=1
X j )
>
θ
log ( k )−n⋅log θ 0
1
( ) }
1
{ ( )} { }
1 −n −n 1 −n
= < = θ^ ML <
(θ 0−θ1 ) log n θ (θ0−θ1) θ
(∏ j=1
X j ) log ( k )−n⋅log θ0
1
log (k )−n⋅log θ 0
1
( )
Now it is necessary that θ1≠θ 0 and
{ }
−n(θ 0−θ1 )
• if θ1 <θ 0 then (θ0 −θ1 )> 0 and hence Rc = θ^ ML <
θ0
log( k )−n⋅log θ
1
( )
{ ( )}
−n(θ 0−θ1 )
• if θ1 >θ 0 then (θ0 −θ1 )< 0 and hence Rc = θ^ ML >
θ0
log( k )−n⋅log
θ1
Hypothesis tests
{ H 0 : θ = θ0
H 1 : θ = θ 1>θ 0 { H 0 : θ = θ0
H 1 : θ = θ1 <θ 0
In applying the methodologies, the same critical value c will be obtained for any θ1 since it only depends upon
θ0 through θ^ ML :
Hypothesis tests
{ H 0 : θ ≤ θ0
H 1 : θ = θ 1>θ 0 { H 0 : θ ≥ θ0
H 1 : θ = θ1 <θ 0
A uniformly most powerful test for H 0 : θ = θ0 is also uniformly most powerful for H 0 : θ ≤ θ0 .
Conclusion: For the probability distribution determined by the function given, two methods of points
estimation have been applied. In this case, the two methods provide different estimators. By applying the
plug-in principle, estimators of the mean and the variance have also been obtained. The form of the critical
region has been studied by applying the Neyman-Pearson's lemma and the likelihood ratio.
Additional Exercises
Exercise 1ae
Assume that the height (in centimeters, cm) of any student of a group follows a normal distribution with
variance 55cm2. If a simple random sample of 25 students is considered, calculate the probability that the
sample quasivariance will be bigger than 64.625cm2.
Discussion: In this exercise, the supposition that the normal distribution reasonably explains the variable
height should be evaluated by using proper statistical techniques.
Identification of the variable and selection of the statistic : The variable is the height, the
population distribution is normal, the sample size is 25, and we are asked for the probability of an event
expressed in terms of one of the usual statistics: P (S 2 > 64.625).
Search for a known distribution: Since we do not know the sampling distribution of S2, we cannot
calculate this probability directly. Instead, just after reading 'sample quasivariance' we should think about the
following theoretical result
( n−1)S 2 2 ( 25−1) S 2
T= 2
∼ χn −1 , or, in this case, T = 2
∼ χ225−1 ,
σ 55 cm
Rewriting the event: The event has to be rewritten by completing some terms until (the dimensionless
statistic) T appears. Additionally, when the table of the χ 2 distribution gives lower-tail probabilities P(X ≤ x), it
is necessary to consider the complementary event:
2 2
2
P (S > 64.625)=P (
(25−1) S
55 cm 2
>
( 25−1) 64.625 cm
55 cm2 )
=P ( T > 28.2 )=1− P ( T ≤ 28.2 )=1−0.75=0.25 .
In these calculations, one property of the transformations has been applied: multiplying or dividing by a
positive quantity does not modify an inequality.
Conclusion: The probability of the event is 0.25. This means that S2 will sometimes take a value bigger than
64.625cm2, when evaluated at specific data x coming from the population distribution.
My notes:
Exercise 2ae
Let X be a random variable with probability function
θ−1
θx
f ( x ; θ) = , x ∈[0,3]
3θ
Discussion: This statement is mathematical. Although it is given, the expectation of X could be calculated as
follows
3
+∞ θ x θ−1
μ1 (θ )=E ( X )=∫−∞ x f ( x ; θ)dx=∫0 x θ dx= θ
3
θ x θ +1
3
= θ 3θ+1 3θ
=
3 θ+1 0 3θ θ+ 1 θ+1 [ ]
Method of the moments
System of equations: Since the parameter θ appears in the first-order moment of X, the first equation is
sufficient to apply the method:
3θ 1 n x̄
μ1 (θ )=m 1 (x 1 , x 2 ,... , x n ) → = ∑ j =1 x j= x̄ → 3 θ=θ x̄+ x̄ → θ(3− x̄ )= x̄ → θ=
θ +1 n 3− x̄
The estimator:
X̄
θ^ M =
3− X̄
My notes:
Exercise 3ae
A poll of 1000 individuals, being a simple random sample, over the age of 65 years was taken to determine
the percent of the population in this age group who had an Internet connection. It was found that 387 of the
1000 had one. Find a 95% confidence interval for η.
(Taken from an exercise of Statistics, Spiegel and Stephens, Mc-Graw Hill)
Discussion: Asymptotic results can be applied for this large sample of a Bernoulli population. The cutoff
age value determines the population of the statistical analysis, but it plays no other role. Both η and η^ are
dimensionless.
Identification of the variable: Having the connection or not is a dichotomic situation; then
X ≡ Connected (an individual)? X ~ Bern(η)
√ η(1−
̂
n
η)
̂
( )
η
^ −η
1−α=P (l α / 2≤ T ( X ; η) ≤r α / 2 )=P −r α /2≤ ≤+ r α / 2
√ η(1−
^
n
η
^)
( √
=P −r α /2
η
̂ (1− η)
n
̂
̂ −η ≤+ r α / 2
≤η
η(1−
̂
n√η
̂)
̂ α/2
=P −η−r
η
) (
̂ (1− η
n
̂)
̂ rα / 2
≤ −η ≤−η+
√η(1−
̂
n
η)
̂
√ )
( √
̂ +r +α/2
=P η
η
̂ (1− η)
n
̂
̂ α/ 2
≥ η≥ η−r
η(1−
̂
n
η
̂)
√ )
(3) The interval: Then,
[
I 1−α = η
̂ −r +α/ 2
√ η(1−
̂
n
η)
̂
, η+
̂ r +α / 2
η(1−
̂
n √
η)
̂
]
where r α / 2 is the value of the standard normal distribution verifying P( Z> r α /2 )=α /2.
[
I 0.95= 0.387−1.96
√ 0.387 (1−0.387)
1000
, 0.387+ 1.96
√
0.387 (1−0.387)
1000 ]
=[0.357 , 0.417 ]
Conclusion: The unknown proportion of individuals over the age of 65 years with Internet connection is
inside the range [0.357, 0.417] with a probability of 0.95, and outside the interval with a probability of 0.05.
Perhaps a 0-to-100 scale facilitates the interpretation: the percent of individuals is in [35.7%, 41.7%] with
95% confidence. Proportions and probabilities are always dimensionless quantities, though expressed in
percent.
My notes:
Exercise 4ae
A company is interested in studying its clients' behaviour. For this purpose, the mean time between
consecutive demands of service is modelized by a random variable whose density function is:
1 − x−2
f ( x ; θ)= θ e θ , x≥2, (θ>0)
Discussion: The two sections are based on the calculation of the mean and the variance of the estimator
given in the statement. Then, the formulas of the bias and the mean square error must be used. Finally, the
limit of the mean square error is studied.
Unbiasedness: The estimator is unbiased, as the expression of the mean shows. Alternatively, we calculate
the bias
b ( θ^ M )= E(θ^ M )−θ=θ−θ=0
Conclusion: The calculations of the mean and the variance are quite easy. They show that the estimator is
unbiased and, if the variance is finite, consistent.
Advanced Theory: The If E(X) had not been given in the statement, it could have been calculated by
applying integration by parts (since polynomials and exponentials are functions “of different type”):
∞
1 − x−2
[ ]
+∞ ∞ x−2 x−2
− −
E ( X )=∫−∞ x f ( x ;θ)dx=∫2 x e θ dx= −x e θ −∫ 1⋅(−e θ )dx
θ 2
x−2 ∞ x−2 2
[
= −x e
−
x−2
θ
−θ e
−
θ ] =[( x +θ)e ] =2+θ .
2
−
θ
∞
That ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' (x)⋅v (x )dx has been used with
• u=x → u ' =1
x−2
1 − 1 − x−2 −
x−2
• v '= θ e θ
→ v=∫ θ e θ dx=−e θ
On the other hand, ex changes faster than xk for any k. To calculate E(X2):
x−2 2
[
= x2 e
−
θ ] +2 θ∫
∞
∞
2
1 − x−2
x θ e θ dx=(2 2−0)+2 θ μ=4+ 2 θ(2+θ)=2θ 2 +4 θ +4 .
Again, integration by parts has been applied: ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' (x)⋅v (x )dx with
• u=x 2 → u ' =2 x
1 − x−3 1 − x−3 −
x−3
• v '= θ e θ → v=∫ θ e θ dx=−e θ
My notes:
Exercise 5ae
Is There Intelligent Life on Other Planets? In a 1997 Marist Institute survey of 935 randomly selected
Americans, 60% of the sample answered “yes” to the question “Do you think there is intelligent life on other
planets?” (http://maristpoll.marist.edu/tag/mipo/). Let's use this sample estimate to calculate a 90%
confidence interval for the proportion of all Americans who believe there is intelligent life on other planets.
What are the margin of error and the length of the interval?
(From Mind on Statistics. Utts, J.M., and R.F. Heckard. Thomson)
LINGUISTIC NOTE (From: Common Errors in English Usage. Brians, P. William, James & Co.)
American. Many Canadians and Latin Americans are understandably irritated when U.S. citizens refer to themselves simply as
“Americans.” Canadians (and only Canadians) use the term “North American” to include themselves in a two-member group with their
neighbor to the south, though geographers usually include Mexico in North America. When addressing and international audience
composed largely of people from the Americas, it is wise to consider their sensitivities.
However, it is pointless to try to ban this usage in all contexts. Outside of the Americas, “American” is universally understood to refer
to things relating to the U.S. There is no good substitute. Brazilians, Argentineans, and Canadians all have unique terms to refer to
themselves. None of them refer routinely to themselves as “Americans” outside of contexts like the “Organization of American States.”
Frank Lloyd Wright promoted “Usonian,” but in never caught on. For better or worse, “American” is standard English for “citizen or
resident of the United States of America.”
Discussion: There are several complementary pieces of information in the statement that help us to identify
the distribution of the population variable X (Bernoulli distribution) and select the proper statistic T:
(a) The meaning of the question—for each item there are two possible values: “yes” or “no”.
(b) The value 60% suggests that this is a proportion expressed in percent.
(c) The words Let's use this sample estimate and confidence interval for the proportion.
Thus, we must construct a confidence interval for the proportion η (a percent is a proportion expressed in a 0-
-to-100 scale) of one Bernoulli population. The sample information available consists of two data: the sample
size n = 935 and the sample proportion η=0.6
^ . The relation between these quantities is the following:
n
1 n ∑ j=1 X i # 1 ' s
η=
^ ∑ X
n j =1 i
=
n
=
n ( = # Yeses
n )
.
Confidence interval
For this kind of population and amount of data, we use the statistic:
^
η−η d
T ( X ; η)= → N (0,1)
√ ?(1−? )
n
where ? is substituted by η or η. ^ For confidence intervals η is unknown and no value is supposed, and
hence it is estimated through the sample proportion. By applying the method of the pivot:
( )
η
^ −η
1−α=P (l α/ 2≤ T ( X ; η) ≤r α/ 2 )=P −r α /2≤ ≤+ r α / 2
√ η(1−
^
n
η
^)
( √
=P −r α/2
η
^ (1− η)
n
^
^ −η ≤+ r α / 2
≤η
η(1−
^
n
η
√
^)
^ α /2
=P −η−r
η
) (
^ (1− η
n
^)
√
^ rα/ 2
≤ −η ≤−η+
^ 1−η)
η(
n
^
√ )
( √
^ +r +α/2
=P η
η
^ (1− η)
n
^
^ α/ 2
≥ η≥ η−r
η(1−
^
n
η
^)
√ )
I 1−α = η[
^ −r +α/ 2
√ η(1−
^
n
η)
^
, η+
^ r +α / 2
√
η(1−
^
n
η)
^
]
Substitution: We calculate the quantities in the formula,
• n = 935
• η=0.6
^
• 90% → 1–α = 0.90 → α = 0.10 → α/2 = 0.05 → r α /2=r 0.05=l 0.95=1.645
So
[
I 0.99= 0.6−1.645
√ 0.6 (1−0.6)
935
, 0.6+ 1.645
√
0.6(1−0.6)
935 ]
=[0.574 , 0.626 ]
E = r + α/ 2
√ η(1−
^
n
η)
^
= 1.645
√
0.6 (1−0.6)
935
=0.0264
Conclusion: Since the population proportion is in the interval (0,1) by definition, the values obtained seem
reasonable. Both endpoints are over 0.5, which means that most US citizens think there is intelligent life on
other planets. With a confidence of 0.90, measured in a 0-to-1 scale, the value of η will be in the interval
obtained. As regards the methodology applied, 90% times in average it provides a right interval. Nonetheless,
frequently we do not know the real η and therefore we will never know if the method has failed or not.
My notes:
Exercise 6ae
It is desired to know the proportion η of female students at university. To that end, a simple random sample of
n students is to be gathered. Obtain the estimators η^ M and η^ ML for that proportion, by applying the
method of the moments and the maximum likelihood method.
Discussion: This statement is mathematical, really. Although it is given in the statement, the expectation of
X could be calculated as follows
1 x 1− x
μ1 (η)=E ( X )=∑ Ω x f (x ; θ)=∑ x=0 x η (1−η) =0⋅1⋅(1−η)+ 1⋅η⋅1=η
Population and sample centered moments: The probability distribution has one parameter. The first-order
moments are
1 n
μ1 (η)=E ( X )=η and m1 ( x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
n
System of equations: Since the parameter η appears in the first-order moment of X, the first equation is
sufficient to apply the method:
1 n
μ1 (η)=m1 ( x1 , x 2 , ... , x n ) → η= ∑ j =1 x j= x̄
n
The estimator:
η^ M = X̄
Likelihood function: For the distribution the mass function is f ( x ; η)=η x (1−η)1−x .
n n
n ∑ j=1 x j n−∑ j=1 x j
L( x 1 , x 2 , ... , x n ; η)=∏ j=1 f (x j ; η)=ηx (1−η)1−x ⋯ ηx (1−η)1− x =η
1 1 n n
(1−η)
n n
since 1 ≥ xj and therefore n≥∑ j=1 x j ↔ n−∑ j=1 x j ≥0 . This holds for any value, including η0 .
The estimator:
η^ ML = X̄
My notes:
Discussion: The distribution considered has two parameters, though one of them is known.
Method of the moments
System of equations: Since the parameter of interest λ2 appears in the first-order population moment of X, the
first equation is enough to apply the method:
λ2 1 n
μ1 (λ 2)=m1( x 1 , x 2 ,... , x n) → = ∑ j=1 x j = x̄ → λ 2=2 x̄
2 n
The estimator:
λ^2=2 X̄
Conclusion: To estimate the parameter λ2, the method of the moments suggests twice the sample mean.
My notes:
Exercise 8ae
Plastic sheets produced by a machine are constantly monitored for possible fluctuations in thickness
(measured in millimeters, mm). If the true variance in thicknesses exceeds 2.25 square millimeters, there is
cause for concern about product quality. The production process continues while the variance seems smaller
than the cutoff. Thickness measurements for a simple random sample of 10 sheets produced in a particular
shift were taken, giving the following results:
(226, 226, 227, 226, 225, 228, 225, 226, 229, 227)
Test, at the 5% significance level, the hypothesis that the population variance is smaller than 2.25mm 2.
Suppose that thickness is normally distributed. Calculate the type II error β(2), find the general expression of
β(σ2) and plot the power function.
(Based on an exercise of: Statistics for Business and Economics, Newbold, P., W.L. Carlson and B.M. Thorne, Pearson.)
Discussion: The supposition of normality should be evaluated. This statistical problem requires us to study
the variance of a normal population. Concretely, the application of a hypothesis test to see whether or not the
value considered as reasonable has been exceeded. For large samples, we are given some quantities already
calculated; here we are given the crude data from which we can calculate any quantity. The hypothesis is
allocated at H1 for the production process to continue only when high quality sheets are been made (and for
the equality to be in H0).
and
(10−1)⋅1.61 mm 2
T 0 ( x)= =6.44
2.25 mm 2
Hypotheses and form of the critical region: H 0 : σ 2 = σ 20 ≥ 2.25 and H 1 : σ 2 = σ 21 < 2.25 .
Decision: Finally, it is necessary to check if this region “suggested by H0” is compatible with the value that
the data provide for the statistic. If they are not compatible because the value seems extreme when the
hypotheses is true, we will trust the data and reject the hypothesis H0.
Since T 0 ( x)=6.44 > 3.33 → T 0 ( x)∉Rc → H0 is not rejected.
The second methodology is based on the calculation of the p-value:
pV =P ( X more rejecting than x | H 0 true)=P (T 0 ( X )<T 0 ( x))
=P (T 0< 6.44)=0.305 > pchisq(6.44, 10-1)
[1] 0.3047995
→ pV =0.305> 0.05=α → H0 is not rejected.
Type II error and power function: To calculate β, we have to work under H1, that is, with T1. Since the
critical region is expressed in terms of T0, the mathematical trick of multiplying and dividing by same quantity
is applied:
| ) ( | ) (
2
3.33⋅σ 20
=P
(n−1) S 2
(
σ 20
≥3.33 H 1 = P
(n−1)S 2 σ1
σ 21 σ02
≥3.33 H 1 = P T 1 ( X )≥
σ 21 )
For the particular value σ12 = 2,
3.33⋅2.25
(
β(2) = P T 1 ( X )≥
2 )
= P ( T 1 ( X )≥3.75 ) = 0.927
> 1 - pchisq(3.75, 10-1)
[1] 0.9270832
By using a computer, many other values σ12 ≠ 2 can be considered so as to numerically determine the power of
the test curve 1–β(σ12) and to plot the power function.
ϕ(σ 2 ) = P ( Reject H 0) =
{α( σ2 ) if σ ∈Θ0
1−β(σ 2) if σ∈Θ1
# Sample and inference
n = 10
alpha = 0.05
theta0 = 2.25 # Value under the null hypothesis H0
q = qchisq(alpha,n-1)
theta1 = seq(from=0,to=2.25,0.01)
paramSpace = sort(unique(c(theta1,theta0)))
PowerFunction = pchisq(q*theta0/paramSpace, n-1)
plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')
Conclusion: With a confidence of 0.95, measured in a 0-to-1 scale, the real value of σ 2 will be smaller than
2.25mm2, that is, the quality of the product will be appropriate. In average, the method we are applying
provides a right decision 95% times; however, since frequently we do not known the true value of σ 2 we never
know whether the decision is true or not.
My notes:
Exercise 9ae
If 132 of 200 male voters and 90 of 159 female voters favor a certain cantidate running for governor of
Discussion: There are two independent Bernoulli populations whose proportions must be compared
(populations will not be independent if, for example, males and females would have been selected from the
same couples or families). The value 1 has been used to count the number of voters who favor the candidate.
The method of the pivot will be used.
√ η
^ M (1−η^ M ) η
nM
^ (1− η
+ F
nF
^F)
[
I 1−α = ( η^ M − η
^ F )−r α / 2
√ η^ M (1−η^ M ) η
nM
^ (1− η
+ F
nF
^F)
, (η
^ M −η
^ F )+r α /2
^ M (1−η^ M ) η
η
nM √
^ (1− η
+ F
nF
^F)
]
where r α / 2 is the value of the standard normal distribution such that P( Z>r α/2 )=α / 2.
Conclusion: The case ηM = ηF cannot formally be excluded when the decision is made with 99%
confidence. Since η ∈( 0,1), any “reasonable” estimator of η should provide values in this range or close to
it; but because of the natural uncertainty of the sampling process (randomness and variability), in this case the
smallest endpoint of the interval was –0.03906, which can be interpreted as being 0. When an interval of high
confidence is far from 0, the case η M = ηF can clearly be rejected. Finally, it is important to notice that a
confidence interval can be used to make decisions about hypotheses on the parameter values.
My notes:
Exercise 10ae
For two Bernoulli populations with the same parameter, prove that the pooled sample proportion is an
unbiased estimator of the population proportion. For two normal populations, prove that the pooled sample
variance is an unbiased estimator of the population variance.
Discussion: It is necessary to calculate the expectation of the pooled sample proportion by using its
expression and the basic properties of the mean. Alternatively, the most general pooled sample variance can be
used. For Bernoulli populations, the mean and the variance can be written as μ=η and σ 2=η(1−η) .
Mean of η^ p : This estimator can be used when η X =η=ηY . On the other hand, E ( η^ )= E( X )=η.
nX η
^ X + nY η^ Y
E ( η^ p )=E ( n X + nY ) =
1
n X + nY [ nX E ( η ^ Y ) ]=
^ X )+ nY E ( η
1
( n + n ) E ( X )=η
n X + nY X Y
Then, the bias is b ( η
^ p ) =E ( η^ p )−η=η−η=0 .
Mean of S 2p : This estimator can be used when σ 2X =σ=σ2Y . On the other hand, E ( S 2 )=σ2 .
(n X −1)S 2X +(nY −1)S Y2 (n X −1) E ( S 2X )+(n Y −1) E (S Y2 ) n X −1+ nY −1 2
2
E (S p) = E ( n X + n y −2
= ) n X +n y −2
=
n X +n y −2
σ =σ
2
My notes:
Discussion: In calculating the minimum sample size, the only case we consider (in our subject) is that of
one normal population with known standard deviation. Thus, we can suppose that this is the distribution of X.
Sample information:
Theoretical (simple random) sample: X1,..., Xn s.r.s. (the time measurement of n rotations will be considered)
Margin of error:
We need the expression of the margin of error. If we do not remember it, we can apply the method of the pivot
to take the expression from the formula of the interval.
[ √ √ ]
2 2
I 1−α = X̄ −r α / 2 σ , X̄ + r α / 2 σ
n n
If we remembered the expression, we can use it. Either way, the margin of error (for one normal population
with known variance) is:
√
2
E=r α / 2 σ
n
Sample size
Method based on the confidence interval: We want the margin of error E to be smaller or equal than the
given Eg,
√ 1.6 min 2
2 2 2
E g≥ E=r α/2 σ → E g≥r α / 2 σ → n≥ z α /2 σ = 1.96
) ( )
2 2
=6.2722=39.3 → n≥40
n n Eg (
0.50 min
since r α/ 2=r 0.05 /2=r 0.025 =l 0.975 =1.96 . (The inequality does not change neither when multiplying or dividing
by positive quantities nor squaring.)
Conclusion: At least n = 40 data are necessary to guarantee that the margin of error is 0.50min at most. Any
number of data larger than n would guarantee—and go beyond—the precision desired. (This margin can be
^ | will be
thought of as “the maximum error in probability”, in the sense that the distance or error |θ−θ
smaller that E with a probability of 1–α = 0.95, but larger with a probability of α = 0.05.)
My notes:
Exercise 12ae
To estimate the average tree height of a forest, a simple random sample with 20 elements is considered,
Discussion: In this exercise, the supposition that the normal distribution reasonably explains the variable
height should be evaluated by using proper statistical techniques. To build the interval and find the margin of
error, the method of the pivotal quantity will be applied.
To apply this method, we need a statistic with known distribution, easy to manage and involving μ. From a
table of statistics (e.g. in [T]), we select
X̄ −μ
T ( X , μ)= ∼ t n−1
√ S2
n
where X =( X 1 , X 2 , ... , X n) is a simple random sample, S2 is the sample quasivariance and tκ denotes the t
distribution with κ degrees of freedom.
1−α=P (l α/ 2≤ T ( X ;μ ) ≤r α/ 2)= P −r α/ 2≤
(
X̄ −μ
)
≤+ r α/ 2 =P (−r α / 2
√ S2
n
≤ X̄ −μ ≤+r α/ 2
n√
S2
)
√ S2
n
=P (− X̄ −r α /2
√ S2
n
≤−μ ≤− X̄ + r α/ 2
S2
n √
)=P ( X̄ + r α/ 2
S2
n √
≥ μ ≥ X̄ −r α/ 2
S2
n
)
√
(3) The interval:
[
I = X̄ −r α/2
√ S2
n
, X̄ + r α/ 2
√ ]
S2
n 1−α
= X̄ ∓r α/ 2
√ S2
n
Note: We have simplified the notation, but it is important to notice that the quantities rα/2 and S depend on the sample size n.
To use this general formula with the specific data we have, the quantiles of the t distribution with κ = n–1 =
20–1 = 19 degrees of freedom are necessary
95% → 0.95 = 1–α → α = 0.05
In the table of the t distribution, we must search the quantile provided for the probability p = 1–α/2 = 0.975 in
a lower-tail probability table, or p = α/2 = 0.025 in an upper-tail probability table; if a two-tailed table is used,
the quantile given for p = 1–α = 0.950 must be used. Whichever the table used, the quantile is 2.093. Finally,
I 0= x̄∓r 0.05 /2
√ s2
20
=14.70 u∓2.093 6.34
u
√ 20
=14.70 u∓2.97 u=[11.73 u , 17.67 u]
Conclusion: With 95% confidence we can say that the mean tree height is in the interval obtained. The
margin of error, which is expressed in the same unit of measure as the data, can be thought of as the maximum
distance—when the interval contains the true value—from the real unknown mean and the middle point of the
interval, that is, “the maximum error in probability”.
My notes:
Some Reminders
● Markov's Inequality. Chebyshev's Inequality. For any (real) random variable X, any (real) function
h(x) taking nonnegative values, and any (real) positive a > 0,
E(h(X ))=∫Ω h ( x) dP=∫{h (X )<a } h( x )dP+∫{h (X )≥a } h ( x)dP
Interpretation of the first case: the probability that X takes a value farther from the mean μ than twice
the standard deviation 2σ is 0.25 at most.
All these inequalities are true whichever the probability distribution of X, and the proof above is
based on binding in a rough way. They are nonparametric or distribution-free inequalities. As a
consequence, it seems reasonable to expect that there will be “more powerful” inequalities either
when additional or stronger nonparametric results are used or when a parametric approach is
considered (for example, in calculating the minimum sample size necessary to guarantee a given
precision, we can also apply methods using statistics T based on asymptotic or parametric results).
● Generating Functions. (This section has been extracted from Probability and Random Processes.
Grimmett, G., and D. Stirzaker. Oxford University Press, 3rd ed.) In Probability, generating functions
are useful tools to work with—e.g. when convolutions or sums of independent variables are
considered. Let a = (a0, a1, a2,...) be a sequence. The simplest one is the (ordinary) generating function
of a, defined
∞
Ga (t)=∑ i=0 ai t i , t ∈ℝ for which the sum converges
G(aj ) (0)
The sequence may in principle be reconstructed from the function by setting a j= . This
j!
function is especially useful when ai are probabilities. The exponential generating function of a is
∞ a jt j
Ga (t)=∑ j=0 , t∈ℝ for which the sum converges
j!
On the other hand, the probability generating function of a random variable X taking nonnegative
integer values is defined as
X
G(t)=E(t ) , t ∈ℝ for which there is convergence
(Some authors give a definition for z∈ℂ , and the radius of convergence is one at least) “There are
two major applications of probability generating functions: in calculating moments, and in calculating
the distributions of sums of independent variables.”
(k)
Theorem: E(X )=G '(1) , and, more generally, E(X ( X−1)⋯( X−k + 1))=G (1) .
“Of course, G(k ) (1) is shorthand for lims↑1 whenever the radius of convergence of G is 1.”
Particularly, to calculate the first two raw moments:
E(X )=G(1) (1)
E(X ( X−1))=E( X 2)−E( X)=G(2) (1) → E(X 2 )=G(2) (1)+ E(X )=G(2 ) (1)+G(1) (1)
“If you are more interested in the moments of X than in its mass function, you may prefer to work not
with G but with the function M” called moment generating function and defined by
t tX
M (t)=G( e )=E( e ) , t ∈ℝ for which there is convergence
It is, under convergence, the exponential generating function of the moments E(Xk). It holds that
k (k )
Theorem: E(X )=M ' (0), and, more generally, E(X )=M (0).
Particularly, to calculate the first two raw moments,
{
k
Theorem: (a) If φ (k ) (0) exists then E (|X |)< ∞ if k is even
k−1
E (|X |)<∞ if k is odd
(k)
k φ ( 0)
| k
| (k ) k k
(b) If E( X )< ∞ then φ (0)=i E( X ) , so E(X )= k .
i
Then, to calculate the first two crude moments,
φ(1) (0)
E(X )=
i
2φ(2) (0)
E(X )= 2
i
Characteristic φ(k) ( 0)
Function φ (t)=E(eitX ) , t∈ℝ , i=√−1 E(X k )=
ik
Existence: Techniques for series and integrals must be used to determine the values of t∈ℝ that
guarantee the convergence and hence the existence of the generating function.
When possible, we drop the subindex of the functions to simplify the notation. The reader can consult
the literature on Probability to see whether it is allowed to differentiate inside the series or the
integrals, which is equivalent to differentiate inside the expectation. On the other hand, there are other
generating functions in literature: joint probability generating function, joint characteristic function,
cumulant generating function, et cetera.
Discussion: Several distributions, discrete and continuous, are involved in this exercise. Different ways can
be considered to find the answers: the probability function f(x), the probability tables or a statistical software
program. Sometimes events need to be rewritten or decomposed. For discrete distributions, tables can contain
either individual {X=x} or cumulative {X≤x} (or {X>x}) probabilities; for continuous distributions, only
cumulative probabilities.
(a) The parameter value is λ = 2.7, and for the Poisson distribution the possible values are always 0, 1, 2... If
the table provides cumulative probabilities of the form P(X≤x),
P (1≤ X <3)=P ( X ≤2)− P( X ≤0)=⋯
If the table provides individual probabilities,
P (1≤ X <3)=P ( X =1)+ P( X =2)=0.1815+0.2450=0.4265
By using the mass function,
2.71 −2.7 2.7 2 −2.7
P (1≤ X <3)=P ( X =1)+ P( X =2)= e + e =0.1814549+ 0.2449641=0.426419
1! 2!
Finally, by using the statistical software program R, whose function gives cumulative probabilities,
> ppois(2, 2.7) - ppois(0, 2.7)
[1] 0.426419
(b) The parameter values are κ = 11 and η = 0.3, so the possible values are 0, 1, 2,..., 11. If the table of the
binomial distribution gives individual probabilities P(X = x),
P ( X ≤2)=P ( X =0)+ P ( X =1)+ P ( X =2)=0.0198+ 0.0932+0.1998=0.3128
If cumulative probabilities were given in the table, the probability P ( X ≤2) would be provided directly. On
the other hand, the mass function can be used too,
P ( X ≤2)=P ( X =0)+ P ( X =1)+ P ( X =2)= 11 0.3 (1−0.3) + 11 0.3 (1−0.3) + 11 0.3 (1−0.3)
( ) ( ) ( )
0 11−0 1 11−1 2 11−2
0 1 2
(c) The parameter value is κ = 6, so the possible values are 0, 1, 2,..., 6. This probability distribution is so
simple that no table is needed. Since the event can be decomposed into two disjoint elementary outcomes,
1 1 2 1
P ( X ∈{2, 5})=P ( X =2)+ P ( X =5)= + = =
6 6 6 3
To plot the probability function
values = seq(1, 6)
probabilities = rep(1/6, length(values))
(d) The parameter values are κ1 = 2 and κ2 = 5, so the possible values are the real numbers in the interval [2,5]
(or with open endpoints, depending on the definition for the uniform distribution that you are considering). No
table is necessary for this distribution, and if we realize that 3.5 is the middle value between 2 and 5 no
calculation is needed either,
P ( X ≥3.5)=0.5
If not, we can use the density function,
5 1 1 1.5
P ( X ≥3.5)=∫3.5 dx = ⋅(5−3.5)= =0.5
5−2 3 3
To plot the probability function
values = seq(2, 5)
probabilities = rep(1/(5-2), length(values))
Writing the event in terms of +1.7 is necessary when the table contains only positive quantiles. The
standardization can be applied before or after considering the complementary event. If we try solving the
integral,
2
( x−μ)
+∞ +∞ 1 − 2
antiderivative of e−x does not exist and that the definite integral of f(x) can be solved exactly only for some
limits of integration but it can always be solved numerically. On the other hand, by using the statistical
software program R, whose function contains cumulative probabilities for events of the form {X<x},
(f) The parameter value is κ = 16. The set of possible values is always composed of all positive real numbers.
Most tables of the chi-square distribution provide the probability of events of the form P(X>x). In this case, it
is necessary to consider the complementary event before looking for the quantile:
P ( X ≤a)=0.025 ↔ P ( X >a)=1−0.025=0.975 → a = 6.91
We do not use the density function, as it is too complex. By using the statistical software program R, whose
function gives quantiles for events of the form {X<x},
(g) Now the parameter value is κ = 27. A variable enjoying the t distribution can take any real value. Most
tables of this distribution provide the probability of events of the form P(X>x). In this case, it is not necessary
to rewrite the event:
P ( X > a)=0.1 → a = 1.314
The density function is too complex to be used. The statistical software program R allows doing (the function
provides quantiles for events of the form {X<x}),
(h) The parameter values for this F distribution are κ1 = 10 and κ2 = 8. The possible values are always all
positive real numbers. Again, most tables of this distribution provide the probability for events of the form
{X>x}, so:
P ( X >5.81)=0.01
The density function is also complex. Finally, by using the computer,
(j) Since the parameter value is κ = 12, after decomposing the event into two disjoint tails
P ({ X ≤1.356 }∪{ X >3.055 })=P ({ X ≤1.356 })+P ({ X >3.055 })
=1− P ({ X > 1.356})+P ({ X >3.055 })=1−0.1+0.005=0.905
The density function is also complex. Finally,
My notes:
Exercise 2pt
Weekly maintenance costs (measured in dollars, $) for a certain factory, recorded over a long period of time
and adjusted for inflation, tend to have an approximately normal distribution with an average of $420 and a
standard deviation of $30. If $450 is budgeted for next week, what is an approximate probability that this
budgeted figure will be exceeded?
(Taken from Mathematical Statistics with Applications. W. Mendenhall, D.D. Wackerly and R.L. Scheaffer. Duxbury Press)
Discussion: We need to extract the mathematical information from the statement. There is a quantity, the
weekly maintenance costs, say C, that can be assumed to follow the distribution
C ∼ N (μ=$ 420, σ=$ 30 ) or, in terms of the variance, C ∼ N (μ=$ 420 , σ 2=$ 2 302=$ 2 900 )
(In practice, this supposition should be evaluated.) We are asked for the probability P (C > 450) . Since C
does not follow a standard normal distribution, we standardize both sides of the inequality, by using
2 2 2
μ=E (C )=$ 420 and σ =Var (C )=$ 30 , to be able to use the table of the standard normal distribution:
My notes:
Discussion: Different methods can be applied to calculate the first two moments. We have practiced as
many of them as possible, both to learn as much as possible and to compare their difficulty; besides, some of
them are more powerful that others. Some of these calculations are advanced. To work with characteristic
functions, the definitions and rules of the analysis for complex functions of a real variable must be considered,
and even some calculations may be easier if we work with the theory for complex functions of a complex
variable. Most of these definitions and rules are “natural generalizations” of those of real analysis, but we
must be careful not to apply them without the necessary justification.
This function exists for any t. Now, the usual definitions and rules of the mathematical analysis for real
functions of a real variable imply that
E(X )=G(1) ( 1)= [ η ]t=1=η
2 (2)
E( X )=G (1)+ E(X )=[ 0 ]t =1+ η=η
This function exists for any real t. Because of the mathematical real analysis,
E(X )=M (1) ( 0)=[ ηe t ]t =0=η
This complex function exists for any real t. Complex analysis is considered to do,
φ(1) (0) [ ηe it i ]t =0 ηi
E( X )= = = =η
i i i
This way can also be used to calculate the variance easily, but not to calculate the second moment:
κ κ
σ 2=Var ( X )=Var ( ∑ x=0 Y i )=∑ x=0 Var (Y i )=κ⋅η(1−η) .
( )( ) ( )
κ κ ηt
G(t)=E(t X )=∑ x=0 t x κ ηx (1−η)κ−x =(1−η)κ ∑x=0 κ =(1−η) κ 1+
x ( ) x 1−η 1−η
κ
[ (
= (1−η) 1+
ηt
1−η )] =( 1−η+ ηt) κ
where the binomial theorem (see the appendixes of Mathematics) has been applied. Alternatively, this function
can also be calculated by looking at X as a sum of Bernoulli variables Yj and applying a property for
probability generating functions of a sum of independent random variables,
κ κ
G(t)=[ GY (t )] =( 1−η+ηt )
This function exists for any t. Again, complex analysis allows us to do
E(X )=G (1) ( 1)=[ κ ( 1−η+ηt )κ−1 η]t=1=κ⋅1 κ−1⋅η=κ η
E(X 2 )=G(2) (1)+ E(X )=[ κ(κ−1) ( 1−η+ ηt ) κ−2 η2 ]t =1 + κ η=κ(κ−1)η2 +κ η=κ η(κ η−η+1)
[
= (1−η) 1+( ηe t
1−η )] =( 1−η+ ηe t )κ
Again, it is also possible to look at X as a sum of Bernoulli variables Yj and apply a property for moment
[
E(X )=M (1) ( 0)= κ ( 1−η+ ηe t )
κ−1
]
ηe t t =0=κ η
2
E( X )=M (0)= κ( κ−1) ( 1−η+ ηe
(2)
[ t κ−2
) ( ηe t )2+ κ ( 1−η+ηe t )
κ−1
ηe
t
]
t=0
2
=κ(κ−1) η + κ η=κ η(κ η−η+1)
( )( ) ( )
φ (t)=E( e )=∑x=0 e
κ
κ ηx (1−η)κ−x =(1−η)κ κ κ ηe =(1−η)κ 1+ ηe
itX itx
( ) x ∑ x=0 x 1−η 1−η
κ
[ (
= (1−η) 1+
ηeit
1−η )] =( 1−η+ ηe it ) κ
Once more, by looking at X as a sum of Bernoulli variables Yj and applying a property for characteristic
functions of a sum of independent random variables,
κ κ
φ (t)=[ φ Y (t) ] =( 1−η+ ηeit )
This complex function exists for any real t. Again, complex analysis is considered in doing,
2 [ it
φ(2) (0) κ( κ−1) ( 1−η+ηe )
E( X )= 2 =
κ−2
(ηeit i)2 + κ ( 1−η+ ηe it )
κ−1
ηeit i 2 t=0 ]
i i2
=
[ κ(κ−1) ( 1−η+ ηe ) it κ−2 it 2
( ηe i) + κ ( 1−η+ηe
it κ−1
) it 2
ηe i ]
t =0
=
[ κ(κ−1) η2 i2+ κ ηi2 ]t=0
i2 i2
=
[ κ( κ−1)η2 i2+ κ ηi2 ]t=0 = κ η(κ η−η+ 1) i2 =κ η( κ η−η+1)
i2 i2
As an example, I include a way to calculate E(X) that I found. To prove that any moment of order r is finite or,
equivalently, that the series is (absolutely) convergent, we apply the ratio test for nonnegative series:
(
+∞ +∞
=η [ ]
∑x=0 (1−η)x ⋅[ 1+( 1−η)+⋯]=η [ ∑x=0 (1−η)x =η ] 1−(1−η)
=η η = η .
where the formula of the geometric sequence (see the appendixes of Mathematics) has been used.
Alternatively, μ can be calculated by applying the formula available in literature for arithmetico-geometric
series.
E( X 2 )=G(2) (1)+ E( X )=
[ η2 [1−(1−η)t ](1−η)
[1−(1−η)t ]4 ] [
t=1
1
+ η=
η2(1−η)
[1−(1−η)t ]3 ] t=1
1 2(1−η) 1 2−η
+ η=
η2
+ η= 2
η
2
E( X )=M (0)= (2)
[
ηe t [1−(1−η)e t ]2−ηe t 2[1−(1−η)et ][−(1−η)e t ]
[1−(1−η)e t ]4 ] t =0
[ ] [ ]
t t t t t
ηe [1−(1−η) e +2(1−η)e ] ηe [1+(1−η)e ] η( 2−η) 2−η
= = = = 2
[1−( 1−η)e t ]3 t =0 [1−(1−η) e t ]3 t =0 η3 η
E(X )=
i
=
i [
φ(1) (0) 1 ηe it i[1−(1−η)e it ]−ηe it [−(1−η) e it i]
[1−(1−η)e it ]2 ]
t =0
[ ] [ ]
it it it it
1 ηe i[1−(1−η) e +(1−η) e ] 1 ηe i 1 ηi 1
= = = =
i [1−(1−η) eit ]2 t=0
i [1−(1−η) eit ]2 t =0
i η2 η
2
E(X )= 2 = 2
i i [
φ(2) (0) 1 ηeit i 2 [1−(1−η) eit ]2 −ηe it i 2[1−( 1−η) e it ][−(1−η)e it i]
[1−(1−η) e it ]4 ]t =0
[ ] [ ]
it 2 it it it 2 it 2
1 ηe i [1−(1−η)e + 2(1−η)e ] 1 ηe i [1+( 1−η)e ] 1 ηi (2−η) 2−η
= 2 =2 = 2 = 2
i [1−(1−η)eit ]3 t=0 i [1−(1−η)eit ] 3 t=0 i η3 η
2−η 1 2 1−η
σ2 =Var ( X )=E( X 2 )−E(X )2=
η2
( )
− η = 2
η
Advanced theory: Additional way 1: In Cálculo de probabilidades I, by Vélez, R., and V. Hernández, UNED,
the first four moments are calculated as follows (I write the calculations for the first two moments, with the
notation we are using)
d d 1−η
( )
+∞ +∞ +∞
E ( X )=∑ x=1 x⋅η⋅(1−η) x−1=η ∑ x=1 x⋅(1−η) x−1=η
d (1−η) (∑ x=1
(1−η) x =η ) d (1−η) 1−( 1−η)
1⋅[1−(1−η)]−(1−η)(−1) 1 1
=η 2
=η 2 = η
[1−(1−η)] η
+∞ +∞ +∞
E ( X 2)=∑ x=1 x 2⋅η⋅(1−η)x−1=η∑ x=1 ( x+ 1) x⋅(1−η) x−1−η ∑ x=1 x⋅(1−η) x−1
2 (1−η)[1−(1−η)]−(1−η)2 (−1)
=η
d
d (1−η) ( [1−(1−η)]2
−E(X )
)
2 2 2
=η
d
d (1−η) (
2 (1−η)−2(1−η) +(1−η)
[1−(1−η)]
2
−E(X )=η
d
d (1−η)
2(1−η)−(1−η)
[1−(1−η)]
2 )
−E(X )
( )
[2−2(1−η)][1−(1−η)]2−[2(1−η)−(1−η)2]2[1−(1−η)](−1) 1
=η −η
[1−(1−η)] 4
2[ 1−(1−η)]2 +2[2(1−η)−(1−η)2 ] 1
=η −η
[1−(1−η)]3
2 η2 +4 (1−η)−2(1−η)2 η 2 η2+ 4−4 η−2−2 η2+ 4 η−η 2−η
= − 2= = 2
η2 η η2 η
(We have already justified the convergence of the series involved.) Additional way 2: In trying to find a way
based on calculating the main part of the series by using an ordinary differential equation, as I had previously
done for the Poisson distribution (in the next section), I found the following way that is essentially the same as
the additional way above. A series can be differentiated and integrated term by term inside the circle of
convergence (the radius of convergence was one, which included all possible values for η). The expression of
the mean suggests the following definition for g(η):
+∞ x−1 +∞ x−1
E ( X )=∑ x=1 x⋅η⋅(1−η) =η⋅g ( η) → g (η)=∑ x=1 x⋅(1−η)
and it follows, since g is a well-behaved function of η, that
+∞ +∞ 1−η η−1
G( η) = ∫ g (η) d η=∑ x=1 ∫ x⋅(1−η)x−1 d η=−∑x=1 (1−η) x + c=− +c= +c
1−(1−η) η
I spent some time searching a differential equation... and I found this integral one. Now, by solving it,
η−(η−1) 1
g( η)=G' (η)= 2
+0= 2
η η
(This is a general method to calculate some infinite series.) Finally, the mean is
1 1
E(X )=η⋅g( η)=η⋅ 2 =
η η
For the second moment, we define
+∞ +∞
E ( X 2)=∑ x=1 x 2⋅η⋅(1−η)x−1=η⋅g (η) → g (η)=∑ x=1 x 2⋅(1−η) x−1
and it follows that
+∞ +∞
G( η) = ∫ g (η) d η=∑ x=1 x ∫ x⋅(1−η)x−1 d η=−∑ x=1 x (1−η)x +c
1−η +∞ η−1
=− η ∑ x=1 x η(1−η)x−1 +c=c + 2
η
Now, by solving this trivial integral equation,
η2−(η−1)2 η η2−2 η2 +2 η 2−η
g( η)=G' ( η)=0+ = = 3
η4 η4 η
Finally, the second moment is
2−η 2−η
E( X 2 )=η⋅g ( η)=η 3
= 2
η η
| |
x+1
(x +1)r⋅ λ e−λ
a x+1 ( x+ 1)! ( x+1)r |λ|
lim x→ ∞ =lim x →∞ x
=lim x→ ∞ r
=0 < 1
ax x+1
x
r λ
x⋅ e
x! | −λ
x
| x
+∞ +∞
This implies that ∞ > ∑x=0 x ⋅ e =e ∑x=0 x ⋅λ . Once the (absolute) convergence has been proved,
r λ −λ −λ r
x! x!
the rules of “the usual arithmetic for finite quantities” could be applied. Nevertheless, working with factorial
numbers in series makes it easy to prove the convergence but difficult to find the value.
| |
x+1
( t λ)
a x+1 x+1! |t λ|
lim x→ ∞ =lim x →∞ x
=lim x →∞ =0 < 1 .
ax x +1
| |
(t λ)
x!
Now, the definitions and rules of the mathematical analysis for real functions of a real variable,
E(X )=G(1) ( 1)= [ e λ(t −1) λ ]t =1=λ
E(X 2 )=M (2) (0)= [ e λ(e −1 ) (λ et )2 +e λ(e −1) λ e t ]t =0=[ eλ (e −1) λ et (λ e t +1) ]t =0=λ (λ +1)=λ 2+ λ
t t t
lim x→ ∞
a x+1
=lim x →∞
| x+ 1! |
( eit λ)x +1
=lim x→∞
|e it λ|
=0 < 1.
ax (e it λ) x x+1
| x!|
The definitions and rules of the analysis for complex functions have been applied in the previous calculations
(they are similar to those for real functions of real variable). Now, by using the analysis for complex functions
of one real variable,
Advanced theory: Additional way 1: In finding ways, I found the following one. A series can be
differentiated and integrated term by term inside its circle of convergence. The limit calculated at the
beginning was the same for any λ, so the radius of convergence for λ is infinite when the series is looked at as
a function of λ. The expression of the mean suggests the following definition for g(λ):
+∞ x +∞ x
E ( X )=∑ x=0 x⋅λ e−λ =e−λ g (λ) → ∑x=0 x⋅λx! = g ( λ)
x!
and it follows, since g is a well-behaved function of λ, that
+∞ x λ x−1 +∞ x−1 +∞ x−1 +∞ x−1
g '(λ)= ∑ x=1 x⋅ =∑x=1 ( 1+ x −1)⋅ λ =∑ x=1 λ + ∑ x=1 ( x−1)⋅ λ
λ
=e + g (λ)
x! (x −1)! (x−1)! ( x−1)!
λ
Now, we solve the first-order ordinary differential equation g ' (λ)−g( λ)=e .
Homogeneous equation:
dg 1
g '(λ)−g(λ)=0 → =g → dg=d λ → log(g)=λ+k → gh (λ)=e λ+k =c eλ
dλ g
Particular solution: We apply, for example, the method of variation of parameters or constants. Substituting in
λ λ λ
the equation g( λ)=c (λ)e and g ' (λ)=c ' (λ) e +c ( λ)e
λ λ λ λ λ
c ' (λ )e +c (λ)e −c( λ)e =e → c ' (λ )=1 → c (λ )=λ → g p (λ)=λ e
Any g(λ) given by the previous expression verifies the differential equation, so an additional condition is
necessary to determine the value of c. The initial definition implies that g(0) = 0, so c = 0. Finally, the mean is
−λ −λ λ
E(X )=e g( λ)=e λ e =λ
(The same can be done to calculate some infinite series.) For the second moment, we define
+∞ x +∞ x
E ( X 2)=∑ x=0 x 2⋅λ e−λ =e −λ g (λ) → ∑x=0 x 2⋅λx! = g ( λ)
x!
and it follows, since g is a well-behaved function of λ, that
x−1 x−1 x−1
+∞xλ +∞ +∞
g '( λ) = ∑ x=1 x ⋅ =∑ x=1 (1+ x−1)2⋅ λ =∑x=1 [1+(x−1)2+ 2( x−1)]⋅ λ
2
x! (x−1)! ( x−1)!
+∞ x−1 +∞ x−1 +∞ x−1
=∑ x=1 λ + ∑ x=1 ( x −1)2⋅ λ + 2 ∑x=1 ( x−1) λ =eλ + g( λ) + 2 e λ λ
(x −1)! ( x−1)! (x −1)!
(The expression of the expectation of X has been used in the last term.) Thus, the function we are looking for
λ
verifies the first-order ordinary differential equation g '(λ)− g(λ )=e (1+2 λ).
Homogeneous equation: This equation is the same, so gh (λ)=e λ+k =c eλ
Particular solution: By applying the same method,
c ' (λ )e +c ( λ)e −c(λ) e =e (1+ 2 λ) → c ' (λ )=1+2 λ → c (λ )=λ +λ 2 → g p (λ)=(λ+ λ2 )eλ
λ λ λ λ
E(X )=∫0 x λ e
+∞
−λ x
dx=[−x e 0] −∫0 −e
−λ x +∞
+∞
−λ x −λ x +∞
dx=[−x e
1
] − [e−λ x ]+∞
0
λ 0 = x+[( ) ]
1 −λ x
λ
e
+∞
1
= −0=
λ
1
λ
Where the formula ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' (x)⋅v (x )dx of integration by parts has been applied
=λ lim M →∞ [ e z (i t −λ)
it−λ ] = λ lim M →∞ [ e M (it −λ)−1 ]= λ lim M → ∞ [ e−M λ e i M t −1 ]= λ
{ Z=γ , 0≤γ≤ M } i t−λ i t−λ λ−it
This function exists for any real t such that it–λ ≠ 0 (dividing by zero is not allowed). In the previous
calculation, that the complex integrand is differentiable has been use to calculate the (line) complex integral
by using an antiderivative and the equivalent to the Barrow's rule. Now, the definitions and rules of the
analysis for complex functions of a real variable must be considered to do
E(X )=
φ(1) (0) 1 −λ (−i)
i
=
i (λ−i t)2 [ ] t=0
= λ2 =
λ λ
1
E ( X )=∫−∞ x e 2σ
dx=∫−∞ (t+μ) e 2σ
dt
√2 π σ2 2
√ 2 π σ 2
2
t t
+∞ 1 − +∞ 1 − 2 2
=∫−∞ t e 2σ
dt + μ ∫ e 2σ
dt = 0+μ⋅1 = μ
√2 π σ 2 −∞
√2 π σ 2
2
t t t
+∞ 1 − +∞ 1 − 2+∞ 1 − 2 2
=∫−∞ t ∫ ∫
2 2σ 2 2σ 2σ
e dt + μ e dt + 2μ t e dt
√2 π σ 2 −∞
√2 π σ 2
2
−∞
√2 π σ 2
t
1 +∞ −
1 2
∫ σ 2 √ 2 π σ 2+μ 2 = σ 2 +μ 2
2
= t e 2σ
dt + μ 2⋅1 + 2 μ⋅0=
√2 π σ 2 −∞
√2 π σ 2
[ ]
2 2 2 2
+∞ −
t
+∞ −
t
−
t
+∞ −
t
+∞ ( ) dt
−
e √
2
∫−∞ t 2 e 2 σ2
dt=∫−∞ t⋅t e 2 σ2
dt= −t σ 2 e 2σ 2
−∞ +σ 2∫−∞ e 2 σ2
dt=(0−0)+ σ2∫−∞ 2σ
+∞ 2
=σ 2 √ 2 σ2 ∫−∞ e −u du=σ 2 √ 2 π σ 2
Firstly, we have applied integration by parts
• u=t → u '=1
2 2 2
t t t
− 2 − 2
− 2
v=∫ t e
2σ 2σ 2 2σ
• v ' =t e → dt=−σ e
(Again, the function ex changes faster than xk, for any k.) Then, we have applied the change
t
=u → t=u √ 2 σ → dt=du √ 2 σ
2 2
√2 σ 2
+∞ −x 2
and the well-known result ∫−∞ e dx=√ π (see the appendix of Mathematics). On the other hand, these
integrals converge for any real t.
√ 2 π σ2 √ 2 π σ2 −∞
since
2
(x−μ ) 1 1
+∞ − 2 +∞ − 2
[ −2 σ2 t x+x 2+μ 2−2μ x ] +∞ − 2
{ x 2+−2 x [σ2 t +μ]+μ 2}
∫−∞ e xt
e 2σ
dx=∫−∞ e 2σ
dx=∫−∞ e 2σ
dx
2
x−[σ2 t +μ]
=∫−∞ e
+∞ −
1
2
{(x−[σ 2t +μ])2−[σ 2t +μ]2+μ 2 } −
1
2
{μ2−[σ 2t +μ]2} +∞ ( −
√ 2σ 2 ) dx
2σ
dx=e 2σ
−∞
∫ e
1 2 2 1 2 2
− 2 (μ−[ σ t +μ])(μ+[σ t +μ]) +∞ 2 − 2 [−σ t] [2μ +σ ln (t )]
=e 2σ
∫−∞ e−u √ 2 σ2 du=e 2σ
√ 2 π σ2
[ ] [ ]
1 1
1
2 2
t (2 μ+σ t) t (2μ+ σ t)
E(X )=M (1) ( 0)= e 2 (2μ +σ 2 2 t) = e 2 (μ +σ2 t) t =0=μ
2 t =0
∫−∞ e 2σ
dx=lim M →∞ ∫−M e 2σ
dx
Because of the rules of complex analysis, these calculations are similar—but based on new definitions and
properties— to those of previous sections. What is much different is the way of solving the integral. Now we
cannot find an antiderivative of the integrand—as we did for the exponential distribution—and therefore we
must think of calculating the integral by considering a contour containing the points
{x−μ−i σ2 t , −M≤x≤+M }. The integral of a complex function is null for any close contour within the
domain in which the function is differentiable. We consider the contour:
We are interested in the limit when M increases. For the first integral,
|∫ |
1 1 1 1 1 1
|e |d γ
2 2 2 2 2 2 2 2
tσ
2
− 2
(M−μ) 2
(γ−t σ ) − 2 [i 2(M−μ)(γ−t σ )] tσ
2
− 2
(M−μ) 2
(γ−t σ ) − 2 [i 2(M −μ)(γ−t σ )]
0
e 2σ
e 2σ 2σ
e d γ ≤ ∫0 2σ
e 2σ 2σ
e
1 2 2 1 2 2
− 2
(M −μ) tσ 2
(γ−t σ )
=e 2σ
∫0 e 2σ
d γ →M → ∞ 0
Since |e |=|cos (c)+i sin( c)|=1, ∀ c ∈ℝ and the last integral is finite (the integrand is a continuous
ic
function and the interval of integration is compact) and does not depend on M. For the second integral,
2
1 γ−μ +M −μ
+M − (γ−μ )
2
+M −
( √ 2 σ ) d γ= e−u √2 σ2 du →M →∞
+∞
√ 2σ 2∫−∞ e−u du=√2 π σ 2
2 2 2
2
√ 2 σ2
∫−M e 2σ
d γ=∫−M e ∫ −M −μ
√2 σ 2
where the change
γ−μ −M −μ γ−μ −M−μ
=u → γ=u √ 2 σ 2+μ → d γ=du √2 σ 2 and ≤ ≤
√2 σ2 √2 σ 2
√2σ 2
√ 2 σ2
has been applied. Finally, for the third integral,
|∫
1 1 1
|
2 2 1 2 1 2
tσ
2
− 2
(M +μ ) − 2 −γ − 2 i 2 (M+μ)γ − 2
(M +μ) tσ
2
− 2
−γ
0
e 2σ 2σ
e 2σ
e dγ ≤e 2σ
∫0 e 2σ
d γ →M →∞ 0
Again, the last integral is finite and does not depend on M. In short,
2 2
(x−μ) (x−μ)
1 +∞ itx− 2
1 +M itx− 2
φ (t)= ∫ e 2σ
dx= lim M→ ∞ ∫ e 2σ
dx
√ 2 π σ 2 −∞ √2 π σ 2 −M
1 1 2 2 1 1
(x−μ−i σ t)
1 1
2 2
i t [2 μ+σ i t ] +M − i t [2 μ+σ it ] it [2μ +σ2 i t ]
√2 π σ
2
= e 2
lim ∫
M →∞ −M e
2σ
dx= e 2 2
=e 2
√2 π σ 2
√ 2 π σ2
This function exists for any real t. (The reader can notice that the correct way is slightly longer.) Now,
[ ]
1
1
2
i t [2μ +σ i t ]
E(X )=
(1)
φ (0)
=
e 2
2
i(2μ+ iσ 2 2 t)
t =0
=
e
1
2
i t [2 μ+σ i t ]
i(μ+i σ2 t) t =0 iμ
= =μ
[ 2
]
i i i i
Conclusion: To calculate the moments of a probability distribution, different methods can be considered,
some of them quite more difficult than others. The characteristic function is a complex function of a real
variable, which requires theoretical justifications of complex analysis we must be aware of.
My notes:
[Ap] Mathematics
Remark 1m: The exponential function ex changes faster than any monomial xk of any k.
Remark 2m: In complex analysis, there are frequently definions and properties analogous to those of real analysis. Nevertheless,
one must take care before applying them.
Remark 3m: Theoretically, quantities like proportions (sometimes expressed in per cent), rates, statistics, etc., are dimensionless. To
interpret a numerical quantity, it is necessary to know the framework in which it is being used. For example, 0.98% and 0.98% 2 are
different. The second must be interpreted as √(0.98%2) = 0.99%. Thus, to track how they are transformed the use of a symbol may
be useful.
Remark 4m: In working with expressions—equations, inequations, sums, limits, integrals, etc—, special attention must be paid
when 0 or ∞ appears. For example, even if two limits (series, integrals, etc) do not exist, their summation (difference, quotient,
product, etc) may exist.
∞ 1 ∞ 1 1
limn→∞ n3 = ∞ and limn→∞ n4 = ∞, but limn→∞ n3/n4 = 0 or ∫1 x
dx does not exist while ∫1 ⋅ dx does
x x
On the other hand, many paradoxes (e.g. Zenon's ones) are based on any wrong step (in red color):
0 = 0 ↔ 0·2 = 0·3 ↔ 0·2 = 0·3 ↔ 2 = 3 and ∞ = ∞ ↔ ∞·2 = ∞·3 ↔ ∞·2 = ∞·3 ↔ 2 = 3
Readers of advanced sections may want to check some theoretical details related to the following items (the
very basic theory is not itemized).
Some Reminders
Real Analysis
For real functions of one or several real variables.
n n
( x+ y)n =∑ j=0 n x j y n− j or, equivalently, ( x+ y)n =∑ j=0 n x n− j y j
● Binomial Theorem.
j () j ()
● Limits: infinitesimal and infinite quantities.
● Integration: methods (integration by substitution, integration by parts, etc.), Fubini's theorem, line
integral.
● Series: convergence, criteria of convergence, radius of convergence, differentiability and integrability,
Taylor series, representation of the exponential function, power series. Concretely, when the criterion
of the quotient is applied to study the convergence, the radius of convergence is defined as:
am +1 |c x m+1| |c | |c m|
lim m →∞ =lim m →∞ m +1 m =|x|lim m→∞ m +1 < 1 → |x| < lim m →∞ =r
am |c m x | |c m| |c m+1|
(Similarly for the criterion of the square root.)
Complex Analysis
For complex functions of one real variable.
● Limits: definitions and basic properties.
● Differentiation: definitions and basic properties.
● Integration: definitions and basic properties, antiderivatives and Barrow's rule.
Limits
Frequently, we need to deal with limits of sequences and functions. For sequences, any variable or index (say
n) and the quantity of interest (say Q) can take values in a countable set of discrete positive values, even for
multidimensional situations: the countable product of countable sets is a countable set. Calculations are easier
when there is any monotony, since “the small steps determine the whole way,” or symmetry. For example, the
summation and the product increase when any term increases, or both, while the difference and the quotient
may increase or decrease depending on the term that increases in a unit, since the two terms are not affecting
the total expression in the same direction.
Techniques
In calculating limits, firstly we try to mentally substitute the value of the variable in the sequence or
function. This is frequently enough to solve it, although we can do some formal calculations (specially if we
are not totally sure about the value). When the previous substitution leads to one of the following cases
∞, 0
∞−∞ , ∞⋅0 , ∞ , 1∞ , ∞0 , and 00
0
we talk about indeterminate forms (we have not written possible variations of the signs or positions, e.g. 0·∞,
–∞+∞, or –0/0). The value depends on the particular case, since one term can be “faster than the other” in
Limits in Statistics
Since the sample sizes of populations are positive integer numbers, in Statistics we have to deal with limits of
sequences frequently.
One-Variable Limits: The variable n takes values in ℕ. For this variable, there is a unique natural way for
n to tend to infinite by increasing it one unit at a time. There is a total order in the set ℕ, which is countable.
In Statistics, we are usually interested only in any possible nondecreasing sequences of values for n, which
can be seen as a possible sequence of schemes where more and more data are added.
Two-Variable Limits: A pair of values (nX, nY) can be seen as a point in ℕ x ℕ . There are infinite ways for
nX and nY to tend to infinite by increasing any of them, or both, one unit at a time. There is not a total order in
the product space ℕ x ℕ , though it is still a countable set. Again, in Statistics we are usually interested only
in any possible nondecreasing sequence of pairs of values (nX,nY), which can be seen as a sequence of schemes
where more and more data are added.
In this document, we have to work with easy limits or indeterminate forms like ∞/∞ involving
polynomials. For the latter type of limit, we look at the terms with the highest exponents and we multiply and
divide the quotient by the proper monomial so that to identify the negligible terms, which formally can be
seen as the use of infinites. We will also mention other techniques.
One-Variable Limits: Any possible sequence of values for the sample size, say n(k), can be seen as a
subsequence of the most complete set ℕ of possible values n(k) = k. We are specially interested in
nondecreasing sequences of values n(k) ≤ n(k+1).
The evaluation of any one-dimensional quantity at a subsequence, Q(n(k)), can be seen as a subsequence of
Q(k). If this sequence converges, any subsequence like that must converge. The opposite is not true, since we
can find nonconvergent Q(k) with a convergent subsequence Q(n(k)). The following result can be found in
literature.
Theorem
For a real function f of a real variable x, defined on ℝ=ℝ∪∞ , if a is an accumulation point the
following conditions are equivalent:
(i) limx→a f(x) = L
(ii) For any sequence (in the domain) such that limk→∞ x(k) = a, it holds that limk→∞ f(x(k)) = L
A sequence is a particular case of real function of a real variable, and ∞ is an accumulation point in ℝ .
Two-Variable Limits: Any possible sequence of values (nX(k),nY(k)) can be seen as a path s(k) in the most
complete set ℕ x ℕ of possible values (k1,k2). Again, we are specially interested in nondecreasing sequences
of values nX(k) ≤ nX(k+1) and nY(k) ≤ nY(k+1).
Exercise 1m (*)
Prove that
e−x dx= √ π
+∞ +∞ +∞
∫−∞ e−a x dx= √ πa ,
2 2 2
Discussion: The integrand is a continuous function. We remember that e−x has no antiderivative but it is
still possible to calculate definite integrals for same domains. As regards the limits of integration, the domain
is infinite and we must deal with improper integrals.
(a) Finiteness: Firstly, we prove that the integral is finite not to be working with the equality of two infinite
quantities (something “really dangerous”).
+∞ 2 2 2 +1 ∞
∫−∞ e−x dx=∫{|x|< 1} e−x dx +∫{|x|≥1} e−x dx≤∫−1 1 dx +2 ∫+1 e−x dx=2+2 [−e− x ]∞x=1=2+2 e−1 <∞
since
2 2
• If 0≤|x|<1 then 0≤x 2 <1 and e 0≤e x <e 1 and hence 1=e 0≥e−x >e−1
• For an even function, the integral between -k and +k is twice the integral between 0 and +k.
2 2
I 2=I⋅I = ∫−∞ e− x dx⋅∫−∞ e− y dy=∫−∞ ∫−∞ e −(x + y ) dx dy=∫0 ∫0 e−[ρ ⋅cos (θ ) +ρ ⋅sin (θ ) ] ρ d θ d ρ
+∞
[ ]
2
+∞ 2π 2 +∞ 2π 2 +∞ 2
e−ρ 2 0
=∫0 ∫0 ρd θ d ρ=∫0 ∫0 ρd θ d ρ=2 π∫0 e =π [ e −ρ ]∞ =π
−ρ −ρ −ρ
e e ρ d ρ=2 π
−2 0
| ||
∂x ∂x
|J|= ∂ρ
∂y
∂ρ
∂θ cos(θ) −ρsin (θ)
∂y
∂θ
=
sin(θ) ρ cos(θ)
2 2
=ρ cos (θ) +ρsin(θ) =ρ |
Come back to a onedimensional space: Finally,
+∞ 2
2 2
(c) On the other hand, since f ( x)=e−x =e−(−x) =f (−x) is an even function,
e dx= √ π
+∞ 2
1 +∞ −x 2
∫0 e
−x
dx=
2
∫−∞ 2
( 12 )= π .
−x p−1 +∞
An alternative proof uses the gamma function Γ( p) = ∫0 e x dx and the fact that Γ √ Now,
1
by applying the change of variable x 2=t , for t≥0, which implies that x=√ t and hence dx= dt ,
2 √t
1 +∞ −t −1/ 2 1 1 √π
()
+∞ −x2
∫0 e dx=
2 ∫0
e t dx= Γ
2 2
=
2
.
Conclusion: To be allowed to apply the version of the Fubini's theorem for improper integrals, the finiteness
of the first integral has firstly been proved. The integral of section (a) is used to calculate the others,
respectively by applying a change of variables and by considering the even character of the integrand.
About the proof based on the multiple integration: Proof by Siméon Denis Poisson (1781–1840), according to El omnipresente
número π, Zhúkov, A.V., URSS. I had found this proof in many books, including the previous reference (for the integral in section b
with a=1/2). I have written the bound of the integral. About the proof based on the gamma function: I have found this proof in
Problemas de oposiciones: Matemáticas (Vol. 6), De Diego y otros, Editorial Deimos. In this textbook, the integral in section c is
solved by using the two approaches.
My notes:
Discussion: We have to study several limits. Firstly, we try to substitute the value to which the variable
tends in the expression of the quantity in the limit. If we are lucky, the value is found and the formal
calculations are done later; if not, techniques to solve the indeterminate forms must be applied.
(1) k
lim n →∞ ak n +a k−1 n
k−1
+⋯+ a1 n+a0 , where aj are constants
Way 0: Intuitively, the term with the largest exponent leads the growth when n tends to infinite. Then,
{
lim n →∞ ak nk +a k−1 n k−1+⋯+ a1 n+a0 = −∞ if ak <0
+∞ if a k > 0
1
(2) lim n →∞
n+ c
, where c is a constant
Way 0: Intuitively, the denominator tends to infinite while the numerator does not. (For huge n, the value of c
is negligible.) Then, the limit is zero.
Way 1: Formally, we divide the numerator and the denominator (all their terms) by n.
1
n−1 1
lim n →∞
1
n+ c
=lim n→ ∞ −1 (
n n+ c
=lim n→∞
n
1+
c)=0
n
1
Necessity: lim n =0 then n→∞
n+ c
1 1
If not, that is, if ∃M >0 such that n< M <∞ , then > > 0 and the limit could not be zero.
n+c M +c
an+ b
(3) lim n →∞
cn +d
, where a, b, c and d are constants
Way 0: (This limit includes the previous.) The quotient is an indeterminate form. Intuitively, the numerator
increases like an and the denominator like cn. (The terms b and d are negligible for huge n.) Then, the limit of
the quotient tends to a/c.
Way 1: Formally, we divide the numerator and the denominator (all their terms) by n.
b
−1 a+
lim n →∞
an+ b
cn+d
=lim n→∞ −1 (
n an+ b
n cn+ d
=lim n→ ∞ ) c+
n a
=
d c
n
an+ b a
Necessity: lim n = then n→∞
cn+ d c
If not, that is, if ∃M >0 such that n< M <∞ , then
an+b a acn+bc−acn−ad |bc−ad|
− =
cn+ d c | c ( cn+d ) ||
≥
|c|(|c| M +|d|)
>0 |
and the limit could not be a/c... unless the original quotient was always equal to this value. Notice that when
the previous numerator is cero,
a b an+b λ (cn+ d) a an+b a
ad=bc ↔
c
=λ=
d
↔
{a=λ c
b=λ d
↔
cn+d
=
cn+d
=λ=
c
↔ − =0
cn+d c
that is, in this case the function is really a constant. In the initial statement, the condition |ac db|≠0 could
have been added for the polynomials an+b and cn+ d to be independent.
an k + b(n)
1
(4) lim n →∞ k
, where a and c are constants and b(n) and d(n) are polynomials whose degrees are
cn +d (n)
2
c
a
+∞ if k 1> k 2 , >0
c
Way 1: Formally, we divide the numerator and the denominator (all their terms) by the power of n with the
highest degree among all the terms in the quotient (if there were products, we should imagine how the
monomials are). For example, for the case k 1 <k 2
ab (n)
+
[ ]
k1 k1 k 2−k 1
nk
−k 2
an + b(n) n an +b (n) n 2
lim n →∞ k
=lim n→∞ −k =lim n →∞ =0
cn +d (n)
2
n cn k + d (n)
2 2
d(n)
c+ k
n 2
Way 2: By using infinites, since b(n) and d(n) are negligible for huge n,
{
0 if k 1 <k 2
a
if k 1=k 2
an k + b(n)
1
an k 1
a k −k c
lim n →∞ k =lim n→∞ =lim n →∞ n = 1 2
a
cn +d (n)
2
cn k 2
c −∞ if k 1 >k 2 , < 0
c
a
+∞ if k 1 >k 2 , > 0
c
a b
+
n n2
(5) lim n →∞ , where a, b and c are constants
c
n3
Way 0: The quotient is an indeterminate form. Intuitively, the numerator decreases like a/n (the slowest) and
the denominator like c/n3, so the denominator is smaller and smaller with respect to the numerator, and, as a
consequence, the limit is –∞ or +∞ depending on whether a/c is negative or positive, respectively.
Way 1: Formally, it is always possible to multiply or divide the numerator and the denominator (all their
monomials, if they are summation, or any element, if they are products) by the power of n with the
appropriate exponent. Then we can do
a b a b
( ) {
+ 2 + 2 a
n n 3 2 −∞ if <0
n n n an + bn c
lim n →∞ =lim n→∞ 3 =lim n →∞ =
c n c c a
3 3 +∞ if > 0
n n c
{
+ a
n n2 3 −∞ if < 0
lim n →∞
c
=lim n→∞
n
c
=lim n →∞
an
c n c ( )
a 2
=lim n→∞ n = c
a
+∞ if >0
( )
n3 n3 c
Conclusion: We have studied the limits proposed. Some of them were almost trivial, while others involved
indeterminate forms like 0/0 or ∞/∞. All the cases were quotients of polynomials, so the limits of the former
form have been transformed into limits of the latter form. To solve these cases, the technique of multiplying
and dividing by the same quantity has suffices (there are other techniques, e.g. L'Hôpital rule).
Additional examples
1 1 1
lim n →∞ =0 or lim n →∞ =lim n →∞ =0
n−1 n−1 n
lim n →∞ ( 2n − n1 )=0
2
My notes:
Exercise 3m (*)
Study the following limits of sequences of two variables
nX
(2) lim n →∞ ( n X⋅nY ) and lim n →∞
X
nY →∞
X
nY →∞
nY
n X nY nX
(3) lim n →∞ and lim n →∞
X
nY →∞
nX X
nY →∞
n X nY
Discussion: We have to study several limits of two-variable sequences. Firstly, we try to substitute the value
to which the variable tends in the expression of the quantity in the limit. If we are lucky, the value is found
and the formal calculations are done later; if not, techniques to solve the indeterminate forms must be applied.
These limits may be quite more difficult than those for one variable, since we need to prove that the value
does not depend on the particular way for the sample sizes to tend to infinite (if the limit exists or is infinite)
or find two ways such that different values are obtained (the limits does not exist).
Way 0: Intuitively, the first limit is infinite while the second does not exist, since it depends on which variable
increases faster.
Way 1: For the first limit to be infinite, it is necessary and sufficient one variable tending to infinite, say nX.
lim nX →∞
( n X +nY ) > lim n → ∞ n X =∞
X
nY →∞
it is enougth to see that different values are obtained for different paths: s 1 (k )=(k 2 , k ) and s 2 (k )=(k , k ),
nX
(2) lim n X →∞
( n X⋅nY ) and lim n X →∞
nY →∞ nY →∞
nY
Way 0: Intuitively, the first limit is infinite while the second does not exist, since it depends on which variable
increases faster.
Way 1: For the first limit to be infinite, it is necessary and sufficient one variable tending to infinite, say nX.
lim n X→∞ ( n X⋅nY ) > lim n →∞ n X =∞
X
nY →∞
it is enougth to see that different values are obtained for different paths: s 1 (k )=(k 2 , k ) and s 2 (k )=(k , k ),
nX k2 nX k
lim s (k) =lim k →∞ =∞ and lim s (k) =lim k →∞ =1
1
nY k 2
nY k
n X nY nX
(3) lim n →∞ and lim n →∞
X
nY →∞
nX X
nY →∞
n X nY
Way 0: Even if the expression can be simplified, we use this case to show that the product of increasing terms
increases faster than any of its terms, and the new rate is the product of the two rates (the exponents are
added). The quotient in an indeterminate form. The first limit seems infinite and the second zero.
A product of increasing terms that are bigger than one increases faster than any of its terms. The second limit
can also be seen as the inverse of the first. The sufficiency and the necessity in these limits is determined by
the behaviour of nY: the first limit is infinite and the second is zero if and only if nY tends to infinite.
Way 0: The quotient in an indeterminate form. Intuitively, the product of increasing terms increases faster than
any of its terms, and the new rate is the product of the two rates (the exponents are added). The constants are
negligible when they are added to or substracted from a power. The first limit seems infinite and the second
zero.
Way 1: Formally, we multiply the numerator and the denominator (all their monomials, if they are summation,
or any element, if they are products) by the product of the powers of nX and nY with the highest exponents
[ ] nX nY
−1 −1
(n X +a)(n Y +b) n n X Y ( n X +a)( nY + b)
lim n X →∞
=lim n →∞ =lim n X →∞
=∞
n X +c n →∞ n n
X −1 −1
n X +c 1 c
nY →∞ Y X Y n Y →∞ +
n Y n X nY
The second limit can also be seen as the inverse of the first, by changing the letter of the constants, so we do
not repeat the calculations. The sufficiency and the necessity in these limits is determined by nY: the first limit
is infinite and the second is zero if and only if nY tends to infinite.
1 1 1
n X nY nX
(5) lim n X →∞
and lim n X →∞
nY →∞
1 nY →∞
1 1
nX n X nY
Way 0: Even if the expression can be simplified, we use this case to show that the product of decreasing terms
decreases faster than any of its terms, and the new rate is the product of the two (the exponents are added).
The quotient in an indeterminate form. The first limit seems zero and the second is infinite.
nX n X nY
A product of decreasing terms that are smaller than one decreases faster than any of its terms. The second
limit can also be seen as the inverse of the first. The sufficiency and the necessity in these limits is determined
by the behaviour of nY: the first limit is zero and the second is infinite if and only if nY tends to infinite.
n X +n Y nX
(6) lim n →∞ and lim n →∞
X
nY →∞
nX X
nY →∞
n X +n Y
Y
nX( )
=lim n →∞ 1+ Y =? and
1+
1
nY
=? lim nX →∞
nY →∞
n X + nY
=lim n → ∞ Y
nX
and we have seen that the limits of the new quotients do not exist, it seems that none of the limits exists.
Formally, we could consider the same paths as we considered there. The second limit can also be seen as the
inverse of the first.
1 1 1
n X +a n Y +b nX+ a
(7) lim n X →∞
and lim n X →∞
nY →∞
1 nY →∞
1 1
n X +c n X +b n Y +c
Way 1: Formally, we multiply the numerator and the denominator (all their monomials, if they are summation,
or any element, if they are products) by the product of the powers of nX and nY with the highest exponents
1 1
n X +a n Y +b nX+ c
lim n X →∞
=lim n →∞ =0
nY →∞
1 n →∞
( n X + a)(nY + b)
X
n X +c
n X +c nX
The second limit can also be seen as the inverse of the first, by changing the letter of the constants, so we do
not repeat the calculations. As regards the sufficiency and the necessity in these limits, it is determined by the
behaviour of nY: the first limit is zero and the second is infinite if and only if nY tends to infinite.
1 1 1
+
n X nY nX
(8) lim n X →∞
nd lim n X →∞
1 1 1
nY →∞ nY →∞ +
nX n X nY
Way 0: The quotient in an indeterminate form. Intuitively, any sum of decreasing terms decreases like the
slowest while the other becomes negligible. Thus, the first limit would be one if the fastest is nY, infinite if the
faster is nX; and, if both are equal, the limits are two and one over two, respectively. In short, it seems this
limit does not exist.
nX
1 1
nX n X nY nX nY
lim n X →∞
=lim n X →∞
=lim n Y →∞
=?
1 1 n X nY 1 1 nY +n X
nY →∞ + nY →∞ +
n X nY nX nY
The second limit can also be seen as the inverse of the first.
n X +n Y n X nY
(9) lim n →∞ and lim n →∞
X
nY →∞
n X nY X
nY →∞
n X +n Y
The limit appears in the variance of the estimators of σX2/σY2. We solve it in two simple ways, although others
ways are considered as an “intellectual exercise.”
Y
nY Y
nX X
n X nY 1
lim n →∞ =lim n →∞ =∞
X
nY →∞
n X + nY X
n →∞
Y
n X + nY
n X nY
It is sufficient and necessary that both variables tend to infinite. For the necessity, n X < M <+ ∞ then
n X +n Y 1 1 1
= + > >0
n X nY nY n X M
and the limit could not be zero.
Way 2: Firstly, let us suppose, without loss of generality, that nX ≤ nY. Then
0 ≤ lim n → ∞
X
nY →∞
[ n X +nY
nX nY ]≤ lim n
n
X
Y
→∞
→∞
2 nY
( )
n X nY
= lim n X →∞
2
nX
=0
(ny has been dropped from the numerator and the denominator, it is not that an iterated limit is being
calculated). Nonetheless, this solution does not consider those paths (for the sample sizes) that cross the
bisector line, that is, when none of the sizes is uniformly behind the other. To complete the proof it is enough
to use again the symmetry of the expression with respect to the two variables (it is the same if we switch
them): for any sequence of values for (nX,nY) crossing the bisector line, an equivalent sequence—in the sense
that the sequence takes the same values—either above or behind the bisector line can be considered by
looking at the bisector line as a mirror or a barrier.
Way 3: Polar coordinates can also be used to study that limit. For any sequence s(k)=(nX(k),nY(k)),
{
ρ(k )= √ nX ( k)2 +n Y ( k)2
{
n X ( k )=ρ( k )⋅cos [α (k )] , 0<ρ( k )< ∞ , 0< α(k )< π
nY ( k)=ρ( k)⋅sin[α( k )] 2 α( k )=arctg Y
n ( k)
( )
n X (k )
A mathematical characterization of a sequence s(k) corresponding to sample sizes that tend to infinite can be
ρ(k )→∞
in such a way that even when cos [α(k )]→0 or sin[α (k )]→0 the products n X (k )=ρ(k )⋅cos [α( k )]
and nY (k )=ρ(k )⋅sin [α( k)] still tend to infinite. Then, the limit is calculated as follows
ρ(k ) [ cos[α (k )]+sin [α (k )] ] 2
lim k →∞ 2
≤ lim k →∞ =0
ρ(k ) cos [α (k )]sin[α (k )] ρ(k ) cos[ α (k )]sin[ α( k )]
The only cases that could cause troubles would be those for which either the cosine or the sine tends to zero
(the other tends to one). Nevertheless, the characterization above shows that the denominator would still tend
to infinite. Finally, as regards the necessity, let us suppose, without loss of generality, that nX ≤ M < ∞. Then,
since ρ(k )→∞ it must be cos [α(k )]→0 in such a way that n X (k )=ρ(k )⋅cos [α( k )]≤M . As a
consequence,
ρ(k ) { cos [α (k )]+sin [α(k)] } cos [α(k)]+ sin[α( k )] 0+ 1 1
lim k →∞ 2
≥lim k →∞ = = >0
ρ( k) cos [α (k )]sin [α (k )] M⋅sin[ α (k )] M M
n X −nY n X nY
(10) lim n →∞ and lim n →∞
X
nY →∞
n X nY X
nY →∞
n X −nY
Way 0: Intuitively, the limit of the difference does not exist, since it takes different values that depend on the
path; but the difference—or the summation, in the previous section—is so smaller than the product, that the
first limit seems zero while the second seems infinite. Formally, we can do calculations as for the previous
limit, for example
n −n n n 1 1
lim n →∞ X Y =lim n →∞ X − lim n →∞ Y =lim n →∞ − lim n →∞ =0−0=0
X
Yn →∞
n X nY n →∞
X
Y
n X nY n n
n →∞ X Y
X
Y
nY Y
nX X
Conclusion: We have studied the limits proposed. Some of them were almost trivial, while others involved
indeterminate forms like 0/0 or ∞/∞. All the cases were quotients of polynomials, so the limits of the former
form have been transformed into limits of the latter form. To solve these cases, the technique of multiplying
and dividing by the same quantity has suffices (there are other techniques, e.g. L'Hôpital rule). Other
techniques have been applied too.
Additional Examples: Several limits have been solved in the exercises—look for limit in the final index.
My notes:
Exercise 4m (*)
For two positive integers nX and nY, find the (discrete) frontier and the two regions determined by the equality
2(n X +nY )=(nX −nY )2
Discussion: Both sides of the expression are symmetric with respect to the variables, meaning that they are
the same if the two variables are switched. This implies that the frontier we are looking for is symmetric with
respect to the bisector line. The square suggests a parabolic curve, while
2(n X +nY )=(nX −nY )2 ↔ 2(1+n X nY )=(n X −1)2+(nY −1)2
suggests a sort of transformation of a conic curve.
Intuitively, in the region around the bisector line, the difference of the variables is small and therefore
the right-hand side of the original equality is smaller than the left-hand side; obviously, the other region is at
the other side of the (discrete) frontier.
Purely computational approach: In a previous exercise we wrote some “force-based” lines for the
computer to plot the points in the frontier. Here we use the same code to plot the inner region (see the figures
below)
N = 100
vectorNx = vector(mode="numeric", length=0)
vectorNy = vector(mode="numeric", length=0)
for (nx in 1:N)
{
for (ny in 1:N)
{
if (2*(nx+ny)>=(nx-ny)^2) { vectorNx = c(vectorNx, nx); vectorNy = c(vectorNy, ny) }
}
}
plot(vectorNx, vectorNy, xlim = c(0,N+1), ylim = c(0,N+1), xlab='nx', ylab='ny', main=paste('Regions'), type='p')
Algebraical-computational approach: Before using the computer, we can do some algebraical work
n2X + n2Y −2 n X nY =2n X +2 nY ↔ n2Y −2(n X +1)nY + n X ( n X−2)=0
the previous matrix reminds us a rotation in the plain (although movements have orthonormal matrixes and
the previous is only orthogonal). Let us have a look to how a triangle—a rigid polygon—is transformed,
To confirm that C1 is a rotation plus a dilatation (homothetic transformation), or vice versa, we consider the
distances between points, the linearity, and a rotation of the axes. First, if
~ ~
A=(a 1 , a2 ) → A=( a1−a 2 , a1 +a 2) B=(b1 , b2) → B=(b1−b 2 , b 1+ b2)
then
d (~
u, vA ,~ √ 1 2 1 √
B)= [(b −b )−(a −a )]2 +[(b +b )−(a +a )]2= [(b −a )−(b −a )]2 +[(b −a )+(b −a )]2
2 1 2 1 2 1 1 2 2 1 1 2 2
determines the line containing A and B if λ ∈ℝ and the segment from A to B if λ ∈[0,1] . It is transformed
as follows
C1 ( λ b1 +(1−λ) a1 , λ b2 +(1−λ) a2 )=(λ b 1+(1−λ) a1−λ b2−(1−λ)a2 , λ b 1+(1−λ)a1 +λ b 2+(1−λ) a2 )
=(λ (b1−b2 )+(1−λ)(a 1−a2 ), λ(b 1+ b2 )+(1−λ)(a 1+ a2))=λ( b1−b 2 , b 1+b 2)+(1−λ)(a1−a2 , a1 +a2 )
=λ C1 (b 1 , b 2)+(1−λ)C 1(a1 , a2)
(similarly for C2). This expression determines the line containing C1(A) and C1(B) if λ ∈ℝ and the segment
from C1(A) to C1(B) if λ ∈[0,1]. Third, as regards the rotation of axes, the following figure and formulas are
general
e⃗1 = cos α ~
e⃗1 + sin α ~
e⃗2
e⃗2 =−sin α ~ e⃗1 + cos α ~e⃗2 {
(Rotation sinistrorsum)
)( ) ( )( )
e⃗1 = cos π ~ e⃗ − sin π ~ π −sin π ~
{
e⃗ −
()(
cos ⃗ ~⃗ ~
e⃗1
4 e1 = √ 2 √2 e 1 = 1 1 −1
4 1
e⃗2 = sin π ~
4 2
e⃗ + cos π ~
4 1
e⃗
4 2
e⃗1
e⃗2
= 4
sin π cos π ~e⃗2 1 1 ~
e⃗2 √ 2 1 1 ( )( )
~
e⃗2
4 4 √2 √ 2
Any point P=(x , y ) is transformed through
1 1 −1 x 1 x− y
√ 2 ( )( ) ( ) ( )
1 1 y
=
√ 2 x + y
=u .
v
1 1 −1
The matrix M =
−1 t
( )
√2 1 1
is orthogonal, which means that M⋅M t=I =M t⋅M and implies that
M =M . Then,
~
e⃗
1 1 1 e⃗1
( )( ) ( )
= ~1 .
√2 −1 1 e⃗2 e⃗2
Conclusion: We have applied different approaches to study the frontier and the two regions determined by
the given equality. Fortunately, nowadays the computer allows us to do this work even without any deeper
theoretical study—change of variable, transformation, et cetera.
My notes:
My notes:
Basic Measures
μ = E ( X ) = ∑Ω x i⋅ f ( x i ) (Discrete)
σ 2 = Var ( X ) = E ( [ X −μ]2 ) =⋯ = E( X 2 )−μ 2
μ = E ( X ) = ∫Ω x⋅ f (x )dx (Continuous)
Basic Estimators
̄ = 1∑ Xi 1 n ̄ ) 2 = ⋯= 1 ∑ X 2i − X̄ 2
n n
X s2 = ∑ ( X i − X
n i=1 n i=1 n i =1
2 1 n 2 n X s 2X + n Y s 2Y (n X −1)S 2X + ( nY −1)S Y2
S = ∑ ( X i− X̄ ) 2
n s = (n−1) S 2
S = 2
=
n−1 i=1 p
n X + n y−2 n X + n y −2
n
1 n ∑i=1 X i nX η
̂ X + nY η̂ Y
V = ∑i=1 ( X i−μ ) 2
2
η̂ = η̂ p=
n n n X + nY
1 population 2 populations
Parameter Estimator Parameter Estimator
μ ̄
X μX –μY ̄ −Ȳ
X
σ2 2
σX2/σY2 V 2X
V
μ known μX, μY known V 2Y
σ2 2
or S 2
σX2/σY2 sX
2
or
2
SX
s 2 2
μ unknown μX, μY unknown sY SY
η η̂ ηX–ηY η̂ X − η̂ Y
)
̄ −μ
μ T ( X ; μ)=
X
∼ t n −1
√
2
2 S
σ unknown n
σ2 T ( X ; σ)=
nV 2
∼ χ 2n
2
μ known σ
σ2 n s 2 (n−1) S
2
T ( X ; σ)= 2
= 2
∼ χ2n −1
μ unknown σ σ
̄ −Ȳ ) ∼ N μ X −μ Y ,
(X ( √ σ 2X σY2
+
n X nY )
where k is the closest
μX–μY ̄ −Ȳ )−(μ X −μY ) integer to
(X
T ( X , Y ; μ X ,μ Y )= ∼ tk
√
2 2
S X SY
σX2, σY2 unknown +
n X nY
2
1 nX V X V 2X
σX2/σY2
μX, μY known
T ( X , Y ; σ X , σ Y )=
( n X σ 2X
1 nY V Y
n Y σY2
2
=
σ 2X
V Y2
σY2
) =
V 2X σY2
V Y2 σ 2X
∼ Fn X
, nY
( )
1
σX2/σY2 (n X −1) σ 2X σ 2X S 2X σ2Y
T ( X , Y ; σ X , σ Y )= 2
= 2
= 2 2
∼ Fn −1 ,nY −1
1 (nY −1)S Y SY SY σ X X
1 population, large n
Parameter Statistic
X̄ −μ d n d
μ T ( X ; μ)= → N (0,1) ∑i=1 X i → N ( nμ , n⋅?)
?
n √
where ? is substituted by σ2, S2 or s2
̄ → N μ,?
X
d
n ( )
η T ( X ; η)=
̂
η−η d
→ N (0,1)
√
?(1−? )
n
d
η̂ → N η, ( √ ? (1−?)
n )
where ? is substituted by η or η̂
(X
d ? ?
̄ −Ȳ ) → N μ X −μY , X + Y
(
n X nY √ )
ηX–ηY T ( X , Y ; ηX , ηY )=
( η̂ X − η
̂ Y )−( ηX −ηY ) d
→ N (0,1)
nX √
? X (1−? X ) ?Y (1−?Y )
+
nY
where for each population ? is substituted by η or η̂
Remark 1T: For normal populations, the rules that govern the addition and subtraction imply that:
2 2 2 2
X ( )
̄ ∼ N μx , σx ,
nx ( σ
Ȳ ∼ N μ y , y ,
ny ) and hence
(
̄ ∓Ȳ ∼ N μ x ∓μ y , σ x + σ y .
X
nx n y )
The tables include results combining the rules with a standardization or studentization. We are usually interested in comparing the mean of the two
populations, for which the difference is considered; nevertheless, the addition can also be considered with
√ σ 2X σ2Y
+
n X nY
On the other hand, since the quality of estimators—e.g. measured through the mean square error—increase with the sample size, when the
parameters of two populations are supposed to be equal the samples should be merged to estimate the parameter jointly (especially for small nx and
ny). Then, under the hypothesis σx = σy the pooled sample quasivariance should be used through the statistic:
̄ −Ȳ )−(μ X −μY )
(X
T ( X , Y ; μ X ,μ Y )= ∼ tn +nY −2
√
2 2 X
S S p p
+
n X nY
Remark 2T: For any populations with finite mean and variance, one version of the Central Limit Theorem implies that
2 2 2 2
X
d
(
̄ → N μx , σ X ,
nx ) d
(σ
Ȳ → N μ y , Y ,
ny ) and hence X
d
nx ny (
̄ ∓Ȳ → N μ x ∓μ y , σ X + σY ,
)
where the rules that govern the convergence (in distribution) of the addition—and subtraction—of sequences of random variables (see a text on
Probability Theory) and the rules that govern the addition and subtraction of normally distributed variables are applied. We are usually interested in
comparing the mean of the two populations, for which the difference is considered; nevertheless, the addition can also be considered with
̄ ∓Ȳ )−(μ X ∓μ Y )
(X d ( η̂ X ∓η̂ Y )−(ηX ∓ηY ) d
→ N ( 0,1) → N (0,1).
√ ?x ?y
√ ? X (1−? X ) ? Y (1−? Y )
and, for a Bernoulli population,
+ +
nx ny nX nY
Besides, variances can be estimated when they are unknown. By applying theorems in section 2.2 of Approximation Theorems of Mathematical
Statistics, by R.J. Serfling, John Wiley & Sons, and sections 7.2 and 7.3 of Probability and Random Processes, by G. Grimmett and D. Stirzaker,
Oxford University Press,
X̄ −μ 1 X̄ −μ d ̂ −η
η
=
1 ̂
η−η d
→ N (0,1).
= → 1⋅N (0,1)=N ( 0,1)
√ √ √
S2 S2 σ2
√ √ √
and η(1−
̂ η
̂) η(1−
̂ η
̂) η(1−η)
n σ2 n n η(1−η) n
d
Similarly for two populations. From the first convergence it is deduced that t n−1 → N (0,1). On the other hand, when the parameters of two
populations are supposed to be equal the samples should be merged to estimate the parameter jointly (especially for medium nx and ny). Then, under
the hypothesis σx = σy the pooled sample quasivariance should be used—although in some cases its effect is negligible—through the statistic:
̄ −Ȳ )−(μ X −μY )
(X d
T ( X , Y ; μ X ,μ Y )= → N (0,1)
√ S 2p S 2p
+
n X nY
For a Bernoulli population, under the hypothesis ηx = ηy the pooled sample proportion should be used—although in some cases the effect is
negligible—in the denominator of the statistic:
( η̂ X − η
̂ Y )−(η X −ηY ) d
T ( X , Y ; ηX , ηY )= → N (0,1) .
√ η
̂ p (1− η
nX
̂ p) η
̂ (1−η̂ p )
+ p
nY
Remark 3T: In the last tables, the best information available should be used in place of the symbol ?.
2 ̄ =η̂ ,
Remark 4T: The Bernoulli population is a particular case for which μ=η andσ =η⋅(1−η), so X When the variance σ2 is
directly estimated without estimating η, σ̂ 2 is used in place of the product ?(1−?).
Remark 5T: Once an interval for the variance is obtained, P(a1 < σ2 < a2), since the positive square root is a strictly increasing function (and
therefore it preserves the order between two values) an interval for the standard deviation is given by P(√ a1 < σ < √a2). (Notice that, for a
reasonable initial interval, 0 < a1.) Similarly for the quotient of two variances σX2/σY2.
Parameters Statistic
L( X ; θ 0)
θ (1 dimension) Λ=
L( X ; θ 1)
L( X ; θ̂ 0) d
θ (r dimensions) Λ=
̂
L( X ; θ)
Asymptotically, −2 ln(Λ) → χ 2r
{
X 11 , ... , X 1n L K
1
T 0 ( X )=∑i=1 ∑ j=1
X 21 , ... , X 2n 2
êij
⋮ H0: The samples come d
Homogeneity X L1 ,... , X ln L from the same model → χ 2KL−(L+ K −1) = χ2(K −1 )(L−1)
where
K classes N ⋅j
L samples êij =n i p̂ij =ni p̂ j =ni
n
L ( N ij − êij )2
K
T 0 ( X ,Y )=∑i =1 ∑ j =1
( X 1 , Y 1) êij
⋮ H0: The bivariate
d 2 2
sample comes from two → χ KL−(L−1+ K −1+1) = χ(K −1)( L−1)
Independence ( X n , Y n) independent models
where
KL classes
2 variables
N i⋅ N ⋅j
êij =n p̂ij =n p̂ i p̂ j =n
n n
Remark 6T: Although because of different theoretical reasons, for the practical estimation of eij the same mnemonic rule can be used in both
homogeneity and independence tests: for each position, multiply the absolut frequencies of the row and the column and divide by the total number
of elements n.
Homogeneity { X 1 ,... , X n
Y 1 , ... ,Y n Y
X
F n (t) =
X
1
nX
Number { X i ≤t }
2 samples 1
F n (t) = Number {Y i≤t }
Y
nY
Null
Other Tests Data
Hypothesis
Statistic
Let R be the number of runs.
T 0 ( X )= ∑ { X −q >0 } Ri
i 0
(of Position)
1 position measure Q value q0 T̃ 0 ( X )= 0 → N (0,1)
(e.g. the median) √σ2
with
n( n+1) 2 n( n+1)(2n +1)
μ= σ =
4 24
and using the table of the standard normal distribution
Remark 7s: In the statistics, the parameter of interest is the unknown for confidence intervals while it is supposed to be known for hypothesis tests.
Remark 8s: Usually the estimators involved in the statistic T (like s, S...) and the quantiles (like a ...) also depend on the sample size n, although
the notation is simplified.
Remark 9s: For big sample sizes, when the Central Limit Theorem can be applied to T or its standardization, quantiles or probabilities that are not
tabulated can be approximated: p is directly calculated given a, and for p given a is calculated from the quantile z of the standard normal
distribution:
p=P (T ≤a)=P Z ≤
( a−E (T )
√ Var ( T ) ) z=
a− E( T )
√ Var (T )
a=E (T )+ z √ Var (T )
This is used in the asymptotic approximations proposed in the tests of the last table.
Remark 10s: To consider the approximations, sample sizes bigger than 20 has been proposed in the last table, although it is possible to find other
cutoff values in literature (like 8, 10 or 30); in practice, there is no severe change at any value.
Remark 11s: The goodness-of-fit chi-square test can also be used to test position measures: by considering two classes with probabilities (p,1–p).
Remark 12s: To test the symmetry of a distribution, the position tests can be used.
Remark 13s: Although different types of test can be applied to evaluate the same hypotheses H0 and H1 with the same α (type I error), their quality
is usually different, and β (type II error) should be taken into account. A global comparison can be done by using their power functions.
My notes:
(Taken from: Newbold, P., W. Carlson and B. Thorne. Statistics for Business and Economics. Pearson-Prentice Hall.)
(Taken from: Newbold, P., W. Carlson and B. Thorne. Statistics for Business and Economics. Pearson-Prentice Hall.)
algebra, 4m
analysis
complex, 3pt
real, 1m, 2m, 3m, 4m
analysis of variance, 1ht-av
ANOVA → analysis of variance
asymptoticness, 3pe-p, 1ci-m, 2ci-m, 3ci-m, 4ci-m, 2ci
(see also 'consistency')
basic estimators, 12pe-p, 13pe-p
(see also 'sample mean', 'sample variance', 'sample quasivariance', 'sample proportion')
bind, 1m
bound, 5pe-p, 1m
(see also Cramér-Rao's lower bound)
Bernoulli distribution, 1pe-m, 3pe-p, 12pe-p, 14pe-p, 3ci-m, 4ci-m, 6ht-T, 1ht-Λ, 1ht, 3pe-ci-ht, 3pt
(see also 'binomial distribution')
binomial distribution, 1pe-m, 1pt, 3pt
characteristic function, 3pt
Chebyshev's inequality, 1ci-s, 1ci, 2ci, 3ci, 4ci
chi-square distribution, 7pe-p, 1pt
chi-square tests,
goodness-of-fit, 2ht-np, 3ht-np, 1ht
homogeneity, 3ht-np
independence, 1ht-np, 3ht-np
cook → statistical cook
critical region, 1ht-T, 2ht-T, 3ht-T, 4ht-T, 5ht-T, 6ht-T, 1ht-Λ, 1ht-av, 1ht-np, 2ht-np, 3ht-np, 1ht, 3pe-ci-ht, 4pe-ci-ht
critical values → critical region
completion, 2pe-p, 4pe-p, 5pe-p
standardization, 1pe-p, 3pe-p, 4pe-p, 2pt
complex analysis, 3pt
confidence intervals, 1ci-m, 2ci-m, 3ci-m, 4ci-m, 1ci-s, 1ci, 2ci, 3ci, 4ci, 1pe-ci-ht, 2pe-ci-ht, 3pe-ci-ht
consistency, 6pe-p, 7pe-p, 9pe-p, 10pe-p, 12pe-p, 13pe-p, 14pe-p, 1pe, 2pe, 3pe
convergence → rate of convergence
coordinates
rectangular, 4m
polar, 1m, 3m
Cramér-Rao's lower bound, 9pe-p
density function → probability function
(see the continuous probability distributions)
differential equation, 3pt
efficiency, 9pe-p, 10pe-p, 3pe
(see also 'relative efficiency')
exponential distribution, 3pe, 1ht-Λ, 3pt
two-parameter (or translated), 6pe-m
exponential function, 1m
factorization theorem, 11pe-p, 3pe
F distribution, 1pt
frontier, 4m
Fubini's theorem, 1m
generating functions
→ probability generating function
→ moments generating function
→ characteristic function
My notes: