Solved Exercises and Problems of Statist PDF

Solved Exercises and Problems of
Statistical
Inference
David Casado
Complutense University of Madrid

∟ Faculty of Economic and Business Sciences 5 June 2015
∟ Department of Statistics and Operational Research II
∟ David Casado de Lucas
You can decide not to print this file and consult it in digital format – paper and ink will be saved. Otherwise, print
it on recycled paper, double-sided and with less ink. Be ecological. Thank you very much.
Contents
Links, Keywords and Descriptions 1–6
Inference Theory (IT) 7 – 12

Framework and Scope of the Methods 7
Some Remarks 7–9
Sampling Probability Distribution 9 – 12
Point Estimations (PE) 13 – 73

Methods for Estimating 13 – 27
Properties of Estimators 27 – 64
Methods and Properties 64 – 73
Confidence Intervals (CI) 74 – 93

Methods for Estimating 74 – 81
Minimum Sample Size 81 – 83
Methods and Sample Size 83 – 93
Hypothesis Tests (HT) 94 – 142

Parametric 94 – 125
Based on T 94 – 117
Based on Λ 117 – 122
Analysis of Variance (ANOVA) 122 – 125
Nonparametric 126 – 137
Parametric and Nonparametric 138 – 142
PE – CI – HT 143 – 153
Additional Exercises 154 – 169
Appendixes 170 – 191
Probability Theory 170 – 175
Some Reminders 170
Markov's Inequality. Chebyshev's Inequality 170 – 171
Probability and Moments Generating Functions. Characteristic Function. 171 – 172
Mathematics 191 – 209
Some Reminders 191 – 192
Limits 192 – 194
References 210
Tables of Statistics 211 – 217
Probability Tables 218 – 222
Index 223 – 225
Prologue
These exercises and problems are a necessary complement to the theory included in Notes of Statistical
Inference, available at http://www.casado-d.org/edu/NotesStatisticalInference-Slides.pdf. Nevertheless, some
important theoretical details are also included in the remarks at the beginning of each chapter. Those Notes are
thought for teaching purposes, and they do not include the advanced mathematical justifications and
calculations included in this document.
Although we can study only linearly and step by step, it is worth noticing that methods are usually
related—as tasks are in the real-world—in Statistical Inference. Thus, in most exercises and problems we have
made it clear which are the suppositions and how they should be proved properly. In same cases, several
statistical methods have been “naturally” combined in the statement. Many steps and even sentences are
repeated in most exercises of the same type, both to insist on them and to facilitate the reading of exercises
individually. The advanced exercises have been marked with the symbol (*).
Written in Courier New font style is the code with which we have done some calculation by using
the programming language R—you can copy and paste this code from the file. I include some notes to help,
up to my knowledge, students with a mother language different to the English.
Acknowledgements
This document has been created with Linux, LibreOffice, OpenOffice.Org, GIMP and R. I thank those who
make this software available for free. I donate funds to these kinds of project from time to time.
Links, Keywords and Explanations
Inference Theory (IT)
Framework and Scope of the Methods
> [Keywords] infinite populations, independent populations, normality, asymptoticness, descriptive statistics.
> [Description] The conditions under which the Statistics considered here can be applied are listed.
Some Remarks
> [Keywords] partial knowledge, randomness, certainty, dimensional analysis, validity, use of the samples, calculations.
> [Description] The partial knowledge justifies both the random character of the mathematical variables used to explain the variables of
the real-world problems and the impossibility of reaching the maximum certainty in using samples instead of the whole population. The
validity of the results must be understood within the scenario made of the assumptions, the methods, the certainty and the data.
Sampling Probability Distribution
Exercise 1it-spd
> [Keywords] inference theory, joint distribution, sampling distribution, sample mean, probability function.
> [Description] From a simple probability distribution for X, the joint distribution of a sample (X1,X2) and the sampling distribution of
the sample mean X are determined.
Point Estimations (PE)

Methods for Estimating
Exercise 1pe-m
> [Keywords] point estimations, binomial distribution, Bernoulli distribution, method of the moments, maximum likelihood method,
plug-in principle.
> [Description] For the binomial distribution, the two methods are applied to estimate the second parameter (probability), when the
first (number of trials) is known. In the second method, the maximum can be found by looking at the derivatives. Both methods
provide the same estimator. The plug-in principle allows using the previous estimator to obtain others for the mean and the variance.
Exercise 2pe-m
> [Keywords] point estimations, geometric distribution, method of the moments, maximum likelihood method, plug-in principle.
> [Description] For the geometric distribution, the two methods are applied to estimate the parameter. In the second method, the
maximum can be found by looking at the derivatives. Both methods provide the same estimator. The plug-in principle is applied to use
the previous estimator to obtain others for the mean and the variance.
Exercise 3pe-m
> [Keywords] point estimations, Poisson distribution, method of the moments, maximum likelihood method, plug-in principle.
> [Description] For the Poisson distribution, the two methods are applied to estimate the parameter. In the second method, the
maximum can be found by looking at the derivatives. The two methods provide the same estimator. The plug-in principle is applied to
use the previous estimator to obtain others for the mean and the variance.
Exercise 4pe-m
> [Keywords] point estimations, normal distribution, method of the moments, maximum likelihood method.
> [Description] For the normal distribution, the two methods are applied to estimate at the same time the two parameters of this
distribution. In the second method, the maximum can be found by looking at the derivatives. The two methods provide the same
estimator.
Exercise 5pe-m
> [Keywords] point estimations, (continuous) uniform distribution, method of the moments, maximum likelihood method, plug-in
principle, integrals.
> [Description] For the continuous uniform distribution, the two methods are applied to estimate the parameter. In the second method,
the maximum cannot be found by looking at the derivatives and this task is done by applying simple qualitative reasoning. The two
methods provide different estimators. The plug-in principle allows using the previous estimator to obtain others for the mean and the
variance. As a mathematical exercise, the theoretical expression of the mean and the variance are calculated.
Exercise 6pe-m
> [Keywords] point estimations, (translated) exponential distribution, method of the moments, maximum likelihood method, plug-in
principle, integrals.
> [Description] For a translation of the exponential distribution, the two methods are applied to estimate the parameter. In the second
method, the maximum can be found by looking at the derivatives. The two methods provide the same estimator. The plug-in principle
is applied to use the previous estimator to obtain others for the mean. As a mathematical exercise, the theoretical expression of the
mean and the variance of the distribution are calculated.
Exercise 7pe-m
> [Keywords] point estimations, method of the moments, maximum likelihood method, plug-in principle, integrals.
> [Description] For a distribution given through its density function, the two methods are applied to estimate the parameter. In the
second method, the maximum cannot be found by looking at the derivatives and this task is done by applying simple qualitative
1 Solved Exercises and Problems of Statistical Inference

reasoning. The two methods provide different estimators. The plug-in principle is applied to obtain other estimators for the mean and
the variance. Additionally, the theoretical expression of the mean and the variance of this distribution are calculated.
Properties of Estimators
Exercise 1pe-p
> [Keywords] point estimations, probability, normal distribution, sample mean, completion (standardization).
> [Description] For a normal distribution with known parameters, the probability that the sample mean is larger than a given value is
calculated.
Exercise 2pe-p
> [Keywords] point estimations, probability, normal distribution, sample quasivariance, completion.
> [Description] For a normal distribution with known standard deviation, the probability that the sample quasivariance is larger than a
given value is calculated.
Exercise 3pe-p
> [Keywords] point estimations, probability, Bernoulli distribution, sample proportion, completion (standardization), asymptoticness.
> [Description] For a Bernoulli distribution with known parameter, the probability that the sample proportion is between two given
values is calculated.
Exercise 4pe-p
> [Keywords] point estimations, probability and quantile, normal distribution, sample mean, sample quasivariance, completion.
> [Description] For two (independent) normal distributions with known parameters, probabilities and quantiles of several events
involving the sample mean or the sample quasivariance are calculated or found out, respectively.
Exercise 5pe-p
> [Keywords] point estimations, probability, normal distribution, total sum, completion, bound.
> [Description] For two (independent) normal distributions with known parameters, the probabilities of several events involving the
total sum are calculated.
Exercise 6pe-p
> [Keywords] point estimations, trimmed sample mean, mean square error, consistency, sample mean, rate of convergence.
> [Description] To study the population mean, the mean square error and the consistency are studied for the trimmed sample mean.
The speed in converging is analysed through a comparison with that of the (ordinary) sample mean.
Exercise 7pe-p
> [Keywords] point estimations, chi-square distribution, mean square error, consistency.
> [Description] To study twice the mean of a chi-square population, the mean square error and the consistency are studied for a given
estimator.
Exercise 8pe-p
> [Keywords] point estimations, mean square error, relative efficiency.
> [Description] For a sample of size two, the mean square errors of two given estimators are calculated and compared by using the
relative efficiency.
Exercise 9pe-p
> [Keywords] point estimations, sample mean, mean square error, consistency, efficiency (under normality), Cramér-Rao's lower
bound.
> [Description] That the sample mean is always a consistent estimator of the population mean is proved. When the population is
normally distributed, this estimator is also efficient.
Exercise 10pe-p
> [Keywords] point estimations, (continuous) uniform distribution, probability function, sample mean, consistency, efficiency,
unbiasedness.
> [Description] For a population variable following the continuous uniform distribution, the density function is plotted. The
consistency and the efficiency of the sample mean, as an estimator of the population mean, are studied. Looking at the bias obtained, a
new unbiased estimator of the population mean is built, and its consistency is proved.
Exercise 11pe-p
> [Keywords] point estimations, geometric distribution, sufficiency, likelihood function, factorization theorem.
> [Description] When a population variable follows the geometric distribution, a (minimum-dimension) sufficient statistic for studying
the parameter is found by applying the factorization theorem.
Exercise 12pe-p (*)
> [Keywords] point estimations, basic estimators, population mean, Bernoulli distribution, population proportion, normality,
population variance, mean square error, consistency, rate of convergence.
> [Description] The mean square error is calculated for all basic estimators of the mean, the proportion (for Bernoulli populations) and
the variance (for normal populations). Then, their consistencies in mean of order two and in probability are studied. For two
populations, the two-variable limits that appear are studied by splitting them into two one-variable limits or by binding them.
Exercise 13pe-p (*)
> [Keywords] point estimations, basic estimators, normality, population variance, mean square error, consistency, rate of convergence.
> [Description] For the basic estimators of the variance of normal populations, the mean square errors are compared for one and two
populations. The computer is used to compare graphically the coefficients that appear in the expression of the mean square errors.
Besides, the consistency is also graphically studied.
Exercise 14pe-p (*)
> [Keywords] point estimations, Bernoulli distribution, normal distribution, mean square error, consistency, pooled sample proportion,
pooled sample variance, rate of convergence.

> [Description] The mean square error is calculated for some pooled estimators of the proportion (for Bernoulli populations) and the
variance (for normal populations). Then, their consistencies in mean of order two and in probability are studied. For pooled estimators,
one sample size tending to infinite suffices, that is, one sample can “do the whole work”. Each pooled estimator—for the proportion of
a Bernoulli population and for the variance of a normal population—is compared with the “natural” estimator consisting in the
semisum of the estimators of the two populations. The computer is also used to compare graphically the coefficients that appear in the
expression of the mean square errors. The consistency can be studied graphically.
Methods and Properties
Exercise 1pe
> [Keywords] point estimations, method of the moments, mean square error, consistency, maximum likelihood method.
> [Description] Given the density function of a population variable, the method of the moments is applied to find an estimator of the
parameter; the mean square error of this estimator is calculated; finally, its consistency is studied. On the other hand, the maximum
likelihood method is applied too; the maximum cannot be found by using the derivatives and some qualitative reasoning is necessary.
A simple analytical calculation suffices to see how the likelihood function depends upon the parameter. The two methods provide
different estimators.
Exercise 2pe
> [Keywords] point estimations, Rayleigh distribution, method of the moments, mean square error, consistency, maximum likelihood
method.
> [Description] Supposed a population variable following the Rayleigh distribution, the method of the moments is applied to build an
estimator of the parameter; the mean square error of this estimator is calculated and its consistency is studied. The maximum likelihood
method is also applied to build an estimator of the parameter. For this population distribution, both methods provide different
estimators. As a mathematical exercise, the expressions of the mean and the variance are calculated.
Exercise 3pe
> [Keywords] point estimations, exponential distribution, method of the moments, maximum likelihood method, sufficiency,
likelihood function, factorization theorem, sample mean, efficiency, consistency, plug-in principle.
> [Description] A deep statistical study of the exponential distribution is carried out. To estimate the parameter, two estimators are
obtained by applying both the method of the moments and the maximum likelihood method. For this population distribution, both
methods provide the same estimator. A sufficient statistic is found. The sample mean is studied as an estimator of the parameter and the
inverse of the parameter. In this exercise, it is highlighted how important the mathematical notation may be in doing calculations.
Confidence Intervals (CI)

Methods for Estimating
Exercise 1ci-m
> [Keywords] confidence intervals, method of the pivot, asymptoticness, normal distribution, margin of error.
> [Description] The method of the pivot is applied twice to construct asymptotic confidence intervals for the mean and the standard
deviation of a normally distributed population variable with unknown mean and variance. For the first interval, the expression of the
margin of error is used to obtain the confidence when the length of the interval is one unit.
Exercise 2ci-m
> [Keywords] confidence intervals, method of the pivot, asymptoticness, normal distribution, margin of error.
> [Description] The method of the pivot is applied to construct an asymptotic confidence interval for the mean of a population variable
with unknown variance. There was a previous estimate of the mean that is inside the interval obtained. The value of the margin of error
is explicitly given.
Exercise 3ci-m
> [Keywords] confidence intervals, method of the pivot, Bernoulli distribution, asymptoticness.
> [Description] The method of the pivot is applied to construct an asymptotic confidence interval for the proportion of a population
variable following the Bernoulli distribution.
Exercise 4ci-m
> [Keywords] confidence intervals, asymptoticness, method of the pivot, Bernoulli distribution, pooled sample proportion.
> [Description] A confidence interval for the difference between two proportions is constructed by applying the method of the pivot.
The interval allows us to make a decision about the equality of the proportions, which is equivalent to applying a two-tailed hypothesis
test. As an advanced task, the exercise is repeated with the pooled sample proportion in the denominator of the statistic (estimation of
the variances of the populations), not in the numerator (estimation of the difference between the means).
Minimum Sample Size
Exercise 1ci-s
> [Keywords] confidence intervals, minimum sample size, normal distribution, method of the pivot, margin of error, Chebyshev's
inequality.
> [Description] To find the minimum number of data necessary to guarantee theoretically the precision desired, two methods are
applied: one based on the expression of the margin of error and the other based on the Chebyshev's inequality.
Methods and Sample Size
Exercise 1ci
inequality.
> [Description] A confidence interval for the mean of a normal population is built by applying the method of the pivotal quantity. The
dependence of the length of the interval with the confidence is analysed qualitatively. Given all the other quantities, the minimum

sample size is calculated in two different ways: with the method based on the expression of the margin of error and the method based
on the Chebyshev's inequality.
Exercise 2ci
> [Keywords] confidence intervals, minimum sample size, asymptoticness, normal distribution, method of the pivot, margin of error,
Chebyshev's inequality.
> [Description] An asymptotic confidence interval for the mean of a population random variable is constructed by applying the method
of the pivotal quantity. The equivalent exact confidence interval can be obtained under the supposition that the variable is normally
distributed. Given all the other quantities, the minimum sample size is calculated in two different ways: with the method based on the
expression of the margin of error and the method based on the Chebyshev's inequality.
Exercise 3ci
inequality.
> [Description] A confidence interval for the mean of a normal population is built by applying the method of the pivotal quantity.
Given all the other quantities, the minimum sample size is calculated in two different ways: with the method based on the expression of
the margin of error and the method based on the Chebyshev's inequality. The dependence of the length of the interval upon the
confidence is analysed qualitatively.
Exercise 4ci
inequality.
> [Description] The method of the pivot allows us to construct a confidence interval for the difference between the means of two
(independent) normal populations. Given the other quantities and supposing equal sample sizes, the minimum value is calculated by
applying two different methods: one based on the expression of the margin of error and the other based on the Chebyshev's inequality.
Hypothesis Tests (HT)

Parametric
Based on T
Exercise 1ht-T
> [Keywords] hypothesis tests, normal distribution, two-tailed test, population mean, critical region, p-value, type I error, type II
error, power function.
> [Description] A decision on the equality of the population mean (of a variable) to a given number is made by applying a two-
-sided test and looking at both the critical values and the p-value. The two types of error are determined. With the help of a
computer, the power function is plotted.
Exercise 2ht-T
> [Keywords] hypothesis tests, normal population, one-tailed test, population standard deviation, critical region, p-value, type I
error, type II error, power function.
> [Description] A decision on whether the population standard deviation (of a variable) is smaller than a given number is made by
applying a one-tailed test and looking at both the critical values and the p-value. The expression of the type II error is found. With
the help of a computer, the power function is plotted. Qualitative analysis on the form of the alternative hypothesis is done. The
assumption that the population variable follows the normal distribution is necessary to apply the results for studying the variance.
Exercise 3ht-T
> [Keywords] hypothesis tests, normal population, one- and two-tailed tests, population variance, critical region, p-value, type I
error, type II error, power function.
> [Description] The equality of the population variance (of a variable) to a given number is tested by considering both one- and
two-tailed alternative hypotheses. Decisions are made after looking at both the critical values and the p-value. In the two cases, the
expression of the type II error is found and the power function is plotted with the help of a computer. The power functions are
graphically compared, and the figure shows that the one-sided test is uniformly more powerful than the two-sided test.
Exercise 4ht-T
> [Keywords] hypothesis tests, normal population, one- and two-tailed tests, population variance, critical region, p-value, type I
error, type II error, power function, statistical cook.
> [Description] From the hypotheses of a one-sided test on the population variance (of a variable), different ways are qualitatively
and quantitatively considered for the opposite decision to be made.
Exercise 5ht-T
> [Keywords] hypothesis tests, normal populations, one- and two-tailed tests, population standard deviation, critical region,
p-value, type I error, type II error, power function.
> [Description] A decision on whether the population standard deviation (of a variable) is equal to a given value is made by
applying three possible alternative hypotheses and looking at both the critical values and the p-value. The type II error is calculated
and the power function is plotted. The power functions are graphically compared: the figure shows that the one-sided tests are
uniformly more powerful than the two-sided test.
Exercise 6ht-T
> [Keywords] hypothesis tests, Bernoulli populations, one-tailed tests, population proportion, critical region, p-value, type I error,
type II error, power function.
> [Description] A decision on whether the population proportion is higher in one population is made after allocating this inequality
in the null hypothesis, firstly, and the alternative hypothesis, secondly. Two methodologies are considered, one based on the critical
values and the other based on the p-value. In both tests, the type II error is calculated and the power function is plotted. The

symmetry of the power functions of the two cases is highlighted. As an advanced section, the pooled sample proportion is used to
estimate the variance of the populations (in the denominator of the statistic), but not to estimate the difference between the
population proportions (in the numerator of the statistic).
Based on Λ
Exercise 1ht-Λ
> [Keywords] hypothesis tests, Neyman-Pearson's lemma, likelihood ratio test, critical region, Poisson distribution, exponential
distribution, Bernoulli distribution, normal distribution.
> [Description] The critical region is theoretically studied for the null hypothesis that a parameter of the distribution equals a given
value against four different alternative hypothesis. The form of the region is related to the maximum likelihood of the estimator.
Analysis of Variance (ANOVA)
Exercise 1ht-av
> [Keywords] hypothesis tests, normal populations, analysis of variance, critical region, p-value, type I error, type II error.
> [Description] The analysis of variance is applied to test whether the means of three independent normal populations—whose
variances are supposed to be equal—are the same. Calculations are repeated three times with different levels of “manual work”.
Nonparametric
Exercise 1ht-np
> [Keywords] hypothesis tests, chi-square tests, independence tests, critical region, p-value, type I error, table of frequencies.
> [Description] The independence between two qualitative variables or factors is tested by applying the chi-square statistic.
Exercise 2ht-np
> [Keywords] hypothesis tests, chi-square tests, goodness-of-fit tests, critical region, p-value, type I error, table of frequencies.
> [Description] The goodness-of-fit to the whole Poisson family, firsly, and to a member of the Poisson distribution family, secondly,
is tested by applying the chi-square statistic. The importance of using the sample information, instead of poorly justified assumptions,
is highlighted when the results of both sections are compared.
Exercise 3ht-np
> [Keywords] hypothesis tests, chi-square tests, goodness-of-fit tests, independence tests, homogeneity tests, critical region, p-value,
type I error, table of frequencies.
> [Description] Just the same table of frequencies is looked at as coming from three different scenarios. Chi-square goodness-of-fit,
independence and homogeneity tests are respectively applied.
Parametric and Nonparametric
Exercise 1ht
> [Keywords] hypothesis tests, Bernoulli distribution, goodness-of-fit chi-square test, position signs test, critical region, p-value, type I
error, type II error, power function, table of frequencies.
> [Description] Just the same problem is dealt with by considering three different approaches: one parametric test and two kinds of
nonparametric test. In this case, the same decision is made.
PE – CI – HT
Exercise 1pe-ci-ht
> [Keywords] point estimations, confidence intervals, method of the pivot, normal distribution, t distribution, pooled sample variance.
> [Description] The probability of an event involving the difference between the means of two independent normal populations is
calculated with and without the supposition that the variances of the populations are the same. The method of the pivot is applied to
construct a confidence interval for the quotient of the standard deviations.
Exercise 2pe-ci-ht
> [Keywords] confidence intervals, point estimations, normal distribution, method of the pivot, probability, pooled sample variance.
> [Description] For the difference of the means of two (independent) normally distributed variables, a confidence interval is
constructed by applying the method of the pivotal quantity. Since the equality of the means is included in a high-confidence interval,
the pooled sample variance is considered in calculating a probability involving the difference of the sample means.
Exercise 3pe-ci-ht
> [Keywords] hypothesis tests, confidence intervals, Bernoulli populations, one-tailed tests, population proportion, critical region,
p-value, type I error, type II error, power function, method of the pivot.
> [Description] A decision on whether the population proportion is smaller or equal in one population than in the other is made looking
at both the critical values and the p-value. The type II error is calculated and the power function is plotted. By applying the method of
the pivot, a confidence interval for the difference of the population proportions is built. This interval can be seen as the acceptance
region of the equivalent two-sided hypothesis test. In this case, the same decision is made with the test and with the interval.
Exercise 4pe-ci-ht
> [Keywords] point estimations, hypothesis tests, standard power function density, method of the moments, maximum likelihood
method, plug-in principle, Neyman-Pearson's lemma, likelihood ratio tests, critical region.
> [Description] Given the probability function of a population random variable, estimators are built by applying both the method of
the moments and the maximum likelihood method. Then, the plug-in principle allows us to obtain estimators for the mean and the
variance of the distribution of the variable. In testing the equality of the parameter to a given value, the form of the critical region is
theoretically studied when four different types of alternative hypothesis are considered.
Additional Exercises (Solved but not ordered by difficulty, described nor referred to in the final index.)

Appendixes
Probability Theory (PT)
Some Reminders
Markov's Inequality. Chebyshev's Inequality
Probability and Moments Generating Functions. Characteristic Function.
Exercise 1pt
> [Keywords] probability, quantile, probability tables, probability function, binomial distribution, Poisson distribution, uniform
distribution, normal distribution, chi-square distribution, t distribution, F distribution.
> [Description] For each of these distributions, the probability of a simple event is calculated both by using probability tables and by
using the mass function, or, on the contrary, a quantile is found by using the probability tables or a statistical software program.
Exercise 2pt
> [Keywords] probability, normal distribution, total sum, sample mean, completion (standardization).
> [Description] For a quantity that follows the normal distribution with known parameters, the probability of an event involving the
quantity is calculated after properly completing the two sides of the inequality, that is, after properly rewriting the event.
Exercise 3pt (*)
> [Keywords] probability, Bernoulli distribution, binomial distribution, geometric distribution, Poisson distribution, exponential
distribution, normal distribution, raw or crude population moments, series, integral, probability generating function, moment generating
function, characteristic function, differential equation, integral equation, complex analysis.
> [Description] For the distributions mentioned, the first two raw or crude population moments are calculated by using as many ways
as possible. Their level of difficulty is different, but the aim is to practice. Some calculations require strong mathematical justifications.
Several interested analytical techniques are used: changing the order of summation in series, using Taylor series, characterizing a
function through a differential or integral equation, et cetera.
Mathematics (M)
Some Reminders
Limits
Exercise 1m (*)
> [Keywords] real analysis, integral, exponential function, bind, Fubini's theorem, integration by substitution, multiple integrals, polar
coordinates.
> [Description] It is well-known that the function exp(–x2) has no antiderivative. The definite integral is calculated in three cases that
appear frequently, e.g. when working with the density function of the normal or the Rayleigh distributions. By applying the Fubini's
theorem for improper integrals, calculations are translated to the two-dimensional real space, where polar coordinates are used to solve
the multiple integral easily.
Exercise 2m
> [Keywords] real analysis, limits, sequence, indeterminate forms
> [Description] Several limits of one-variable sequences, similar to those necessary for other exercises, are calculated.
Exercise 3m (*)
> [Keywords] real analysis, limits, sequence, indeterminate forms, polar coordinates.
> [Description] Several limits of two-variable sequences, similar to those necessary for other exercises, are calculated.
Exercise 4m (*)
> [Keywords] algebra, geometry, real analysis, linear transformation, rotation, movement, frontier, rectangular coordinates.
> [Description] Several approaches are used to find the frontier and the regions determined by a discrete relation in the plain.
References
Tables of Statistics (T)

> [Keywords] estimators, statistics T, parametric tests, likelihood ratio, analysis of Variance (ANOVA), nonparametric tests, chi-square tests,
Kolmogorov Smirnov tests, runs test (of randomness), signs test (of position), Wilcoxon signed-rank test (of position).
> [Description] The statistics applied in the exercises are tabulated in this appendix. Some theoretical remarks are included.
Probability Tables (P)

> [Keywords] normal distribution, t distribution, chi-square distribution, F distribution.
> [Description] A probability table with the most frequently used values is included for each of the distributions abovementioned.
Index

Inference Theory
[IT] Framework and Scope of the Methods
Populations
[Ap1] When the entire populations can be studied, no inference is needed. Thus, here we suppose that
we have not such total knowledge.
[Ap2] Populations will be supposed to be independent—matched or paired data must be treated in a
slightly different way.
Samples
[As1] Sample sizes are supposed to be quite smaller than population sizes—a correction factor is not
necessary for these (closely) infinite populations.
[As2] At the same time, we consider either any amount of normally distributed data or many data
(large samples) from any distribution.
[As3] Data will be supposed to have been selected randomly, with the same probability and
independently; that is, by applying simple random sampling.
Methods
[Am1] Before applying inferential methods, data should be analysed to guarantee that nothing strange
will spoil the inference—we suppose that such descriptive analysis and data treatment have been done.
[Am2] We are able to learn only linearly, but in practice methods need not be applied in the order in
which they are presented here—e.g. nonparametric hypothesis tests to check assumptions before
applying parametric methods.
[IT] Some Remarks

Partial Knowledge and Randomness
The partial knowledge mentioned in the previous section has crucial consequences. The use of only some
elements of the population implies that—we can only hypothesized about the other elements—variables must
be assigned a random character, on the one hand, and results will have no total certainty in the sense that
statements will be set with some probability, on the other hand. For example: a 95% confidence in applying a
method must be interpreted as any other probability: the results are true with probability 0.95 and false with
probability 1–0.95 (frequently, we will never know if the method has failed or not). See remark 1pt, in the
appendix of Probability Theory, on the interpretation of the concept of probability.
In Probability Theory, random variables are dimensionless quantities; in real-life problems, variables
almost always are not. Since usually this fact does not cause troubles in Statistics, we do not pay much
attention to the units of measurement, and we can understand that the magnitude of the real-life variable, with
no unit of measurement, is the part that is being modeled by using the proper probability distribution with the
proper parameter values (of course, units of measurement are not random). To get used to pay attention to the
units of measurement and to manage them, they have been written in most numerical expressions.

Regarding the interpretation of the whole statistical processes that we will apply either to practice their
use or to solve particular real-world problems, we highlight the main points on which results are usually
based:
(i) Assumptions.
(ii) The method applied, including particular details of its steps, mathematical theorems, statistic T, etc.
(iii) Certainty with which the method is applied: probability, confidence or significance.
(iv) The data available.
In Statistics, results may change severely when assumptions are really false, other method is applied, different
certainty is considered, or data has no proper information (quantity, quality, representativity, etc.). Alongside
this document, we do insist on the cautions that statisticians and reader of statistical works must take in
interpreting results. Even if you are not interested in “statistically cooking” data, you had better know the
recipes... (Some of them have been included in the notes mentioned in the prologue.)
Use of the Samples

Let X = (X1,...,Xn) be the data of a population. The information they contain is extracted and used
through appropriate mathematical functions: estimators and statistics. When applying the methods,
since we usually need to calculate a probability or to find a quantile, expressions must be written in
terms of those appropriate quantities whose sampling distribution is known.
In trying to make estimators or statistics appear, some Mathematics are needed . We do not
repeat them whenever they are applied in this document. For example, the standardization is a strictly positive
transformation that does not change inequalities when it is applied to both sides, or the positive branch of the
square root must be considered to work with population or sample variances and standard deviations (this
concepts are nonnegative by definition, while the square root is a general mathematical tool applied to this
particular situation). As an example of those mathematical explanations not repeated again and again, we
include the following:
Remark: Since variances are nonnegative by definition and the positive branch of the square root function is strictly increasing, it holds that
σx2 = σy2 ↔ σx = σy (similarly for inequalities). For general numbers a and b, it holds only that a2 = b2 ↔ |a| = |b|. From a strict
mathematical point of view, for the standard deviation we should write σ = +√σ2 = |√σ2|.
Finally, at the end of the possible theoretical part of exercises, we do not insist that a sample (X1,...,Xn)
would in practice be used by entering its values in the theoretical expressions obtained as a solution.
Estimators and statistics are random quantities until specific data are used.
Useful Questions
To make the answer, users can find it useful to ask themselves:
On the Populations
● How many populations are there?

● Are their probability distributions known?
On the Samples
● If populations are not normally distributed, are the sample sizes large enough to apply asymptotic
results?
● Do we know the data themselves, or only some quantities calculated from them?

On the Assumptions
● What is supposed to be true? Does it seem reasonable? Do we need to prove it?

● Should it be checked for the populations: the random character, the independence of the populations,
the goodness-of-fit to the supposed models, the homogeneity between the populations, et cetera?
● Should it be checked for the samples: the within-sample randomness and independence, the between-
-samples independence, et cetera?
● Are there other assumptions (neither mathematical nor statistical)?
On the Statistical Problem
● What are the quantities to be studied statistically?

● Concretely, what is the statistical problem: point estimation, confidence interval, hypothesis test, etc?
On the Statistical Tools
● Which are the estimators, the statistics and the methods that will be applied?
On the Quantities
● Which are the units of measurement? Are all the units equal?
● How large are the magnitudes? Do they seem reasonable? Are all of them coherent (variability is
positive, probabilities and relative frequencies are between 0 and 1, etc)?
On the Interpretation
● What is the statistical interpretation of the solution?

● How is the statistical solution interpreted in the framework of the problem we are working on?
● Do the qualitative results seem reasonable (as expected)?
● Do the quantities seem reasonable (signs, order of magnitude, etc)?
They may want to consult some other pieces of advice that we have written in Guide for Students of Statistics,
available at http://www.casado-d.org/edu/GuideForStudentsOfStatistics-Slides.pdf.
[IT] Sampling Probability Distribution

Remark 1it: The notation and the expression of the most basic estimators, for one population, are
n
1 n
X̄ = ∑i=1 X i ∑i=1 X i
n η̂ =
n
2 1 n 2 2 1 n 2 2 1 n 2
n ∑ j=1 n ∑ j =1
V = ( X j −μ ) s = ( X j− X̄ ) S = ∑
n−1 j=1
( X j − X̄ )
For two populations, other basic estimators are made with these:
V 2X s 2X S 2X
̄ −Ȳ
X η̂ X − η̂ Y
V 2Y s 2Y S Y2
Finally, all these estimators are used to make statistics whose sampling distribution is known.
Exercise 1it-spd
Given a population (variable) X following the probability distribution determined by the following values and

probabilities
Value x 1 2 3
3 1 5
Probability p
9 9 9
Determine:
(a) The joint probability distribution of the sample X = (X1,X2)
(b) The sampling probability distribution of the sample mean X
(Based on an exercise of the materials in Spanish prepared by my workmates.)
Discussion: The distribution of X is totally determined, since we know all the information necessary to
calculate any quantity—e.g. the mean:
3 1 5 18
μ X = E( X ) = ∑ Ω x j⋅P X ( x j )= ∑ {1,2,3 } x j⋅p j = 1⋅ + 2⋅ +3⋅ = =2.222222
9 9 9 9
Instead of a table, a function is sometimes used to provide the values and the probabilities—the mass or
density function. We can represent this function with the computer:
values = c(1, 2, 3)
probabilities = c(3/9, 1/9, 5/9)
plot(values, probabilities, type='h', xlab='Value', ylab='Probability', ylim=c(0,1), main= 'Mass Function', lwd=7)
The sampling probability distribution of X is determined once we give the possible values and the
probabilities with which they can be taken. Before doing that, we describe the probability distribution of the
random vector X = (X1,X2).
(A) Joint probability distribution of the sample

Since Xj are independent in any simple random sample, the probability that X = (X1,X2) takes the value x1 =
(1,1), for example, is calculated as follows (note the intersection):
3 3 1
f X (1,1)=P X ( 1,1) = P X ({ X 1=1 }∩{ X 2=1 })=P X ( X 1=1)⋅P X ( X 2 =1)= ⋅ =
1
9 9 9 2
To fill in the following table, the other probabilities are calculate in the same way.
Joint Probability Distribution of (X1,X2)
Value (x1,x2) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)
Probability 3 3 3 1 3 5 1 3 1 1 1 5 5 3 5 1 5 5
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
of (x1,x2) 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
1 1 5 1 1 5 5 5 25
9 27 27 27 81 81 27 81 81
Notice that (1,3) and (3,1), for example, contain the same information. The values and their probabilities can

be given by extension (table or figure) or by comprehension (function).
## Install this package if you don't have it (run the following line without #)
# install.packages('scatterplot3d')
valuesX1 = c(1, 1, 1, 2, 2, 2, 3, 3, 3)
valuesX2 = c(1, 2, 3, 1, 2, 3, 1, 2, 3)
probabilities = c(1/9, 1/27, 5/27, 1/27, 1/81, 5/81, 5/27, 5/81, 25/81)
library('scatterplot3d') # To load the package
scatterplot3d(valuesX1, valuesX2, probabilities, type='h', xlab='Value X1', ylab='Value X2', zlab='Probability',
xlim=c(0, 4), ylim=c(0, 4), zlim=c(0,1), main= 'Mass Function', lwd=7)
That the total sum of probabilities is equal to one can be checked:

1 1 5 1 1 5 5 5 25 9+3+15+3+1+5+15+5+25 81
∑Ω f X ( x j )= ∑Ω p j = 9 + 27 + 27 + 27 + 81 + 81 + 27 + 81 + 81 = 81
= =1
81
From the information in the table it is possible to calculate any quantity—e.g. the first-order joint moment:
1,1 1 1 5 25
μ X = E ( X 1⋅X 2 )= ∑ Ω x j⋅f X (x j )= 1⋅1⋅ +1⋅2⋅ +⋯+3⋅2⋅ +3⋅3⋅ =4.938272
9 27 81 81
(B) Sampling probability distribution of the sample mean
The sample mean X(X) = X(X1,X2) is a random quantity, since so are X1 and X2. Each pair of values (x1,x2) of
(X1,X2) gives a value x for X; on the contrary, a value x of X can correspond to different pairs of values (x1,x2).
Then, we will fill in a table with all values and merge those that are equal. For example:
1+1 2
X̄ (1,1) = = =1
2 2
The other values x of X are calculate in the same way to fill in the following table:
Value (x1,x2) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)
Probability 1 1 5 1 1 5 5 5 25
of (x1,x2) 9 27 27 27 81 81 27 81 81
1+1 1+2 1+3 2+1 2+2 2+3 3+ 1 3+ 2 3+ 3
Value x of X
2 2 2 2 2 2 2 2 2
3 3 5 5
1 2 2 2 3
2 2 2 2
The sample mean X can take five different values while (X1,X2) could take nine different possible values
(x1,x2). Thus, the probability for X to take the value 2, for example, is calculated as follows (note the union):
5 1 5 31
P X̄ ( 2)= P X ({(1,3) }∪{(2,2) }∪{(3,1)})=P X ((1,3))+ P X ((2,2))P X ((3,1))= + + =
27 81 27 81
In the same way,
1
P X̄ ( 1) = P X ({(1,1)})=
9

( 32 ) = P ({(1,2)}∪{(2,1)})=P ({(1,2)})+ P ({(2,1) })= 271 + 271 = 272
P X̄ X X X
5 5 5 10
P ( ) = P ({(2,3) }∪{(3,2)})=P ({(2,3)})+ P ({( 3,2)})= + =
X̄ X X X
2 81 81 81
25
P X̄ (3) = P X ({(3,3)})=
81
Then, the sampling probability distribution of the sample mean X is determined, in this case, by
Probability Distribution of X
3 5
Value x 1 2 3
2 2
1 2 31 10 25
Probability of x
9 27 81 81 81
We can check that the total sum of probabilities is equal to one:
1 2 31 10 25 9+6+31+10+ 25 81
∑Ω P X̄ ( x j) = ∑Ω p j = 9 + 27 + 81 + 81 + 81 = 81
= =1
81
From the information in the table above it is possible to calculate any quantity—e.g. the mean:
1 3 2 31 5 10 25 9+9+62+25+75
μ X̄ = E( X̄ )= ∑ Ω x j⋅P X̄ ( x j )= ∑ Ω x j⋅p j = 1⋅ + ⋅ +2⋅ + ⋅ +3⋅ = =2.222222
9 2 27 81 2 81 81 81
It is worth noticing that this value is equal to the value that we obtained at the beginning, which agrees with
the well-known theoretical property:
μ X̄ = E( X̄ )= E ( X ) = μ X
Values and probabilities can also be provided by using a function—the mass or density function, which can be
represented with the help of a computer:
values = c(1, 3/2, 2, 5/2, 3)
probabilities = c(1/9, 2/27, 31/81, 10/81, 25/81)
plot(values, probabilities, type='h', xlab='Value', ylab='Probability', ylim=c(0,1), main= 'Mass Function', lwd=7)
Conclusion: For a simple distribution for X and a small sample size X = (X1,X2), we have written both the
joint probability distribution of the sample X and the sampling distribution of X. This helps us to understand
the concept of sampling distribution of any random quantity (not only the sample mean), whether we are able
to write it or even to know it (e.g. due to a theorem).
My notes:

Point Estimations
[PE] Methods for Estimating
Remark 1pe: When necessary, the expectations E(X) and E(X2) are usually given in the statement; once E(X) is given, either Var(X)
or E(X2) can equivalently be given, since Var(X) = E(X2)–E(X)2. If not given, these expectations can be calculated from their
definitions by adding up to or integrating, for discrete and continuous variables, respectively (this is sometimes an advanced
mathematical exercise).
Remark 2pe: If the method of the moments is used to estimate m parameters (frequently 1 or 2), the first m equations of the system
usually suffice; nevertheless, if not all the parameters appear in the first-order moments of X, the smallest m moments—and
equations—for which the parameters appear must be considered. For example, if μ1 = 0 or if the interest relies directly on σ2 because
μ is known, the first-order equation μ1 = μ = E(X) = m1 does not involve σ and hence the second-order equation μ2 = E(X2) = Var(X)
+ E(X)2 = σ2+μ2 = m2 must be considered instead.
Remark 3pe: When looking for local maxima or minima of differentiable functions, the first-order derivatives are equalized to zero.
After that, to discriminate between maxima and minima, the second-order derivatives are studied. For most of the functions we will
work with, this second step can be solved by applying some qualitative reasoning on the sign of the quantities involved and the
possible values of the data xi. When this does not suffice, the values found in the first step, say θ0, must be substituted in the
expression of the second step. On the other hand, global maxima and minima cannot in general be found using the derivatives, and
some qualitative reasoning must be applied. It is important to highlight that, in applying the maximum likelihood method, the
purpose is to find the maximum, whichever the mathematical way.
Exercise 1pe-m
If X is a population variable that follows a binomial distribution of parameters κ and η, and X = (X1,...,Xn) is
a simple random sample:
(a) Apply the method of the moments to obtain an estimator of the parameter η.
(b) Apply the maximum likelihood method to obtain an estimator of the parameter η.
(c) When κ = 10 and x = (x1,...,x5) = (4, 4, 3, 5, 6), use the estimators obtained in the two previous
sections to construct final estimates of the parameter η and the measures μ and σ2.
Hint: (i) In the two first sections treat the parameter κ as if it were known. (ii) In the likelihood function, join the combinatorial
terms into a product; this product does not depend on the parameter η and hence its derivative will be zero.
Discussion: This statement is mathematical, although in the last section we are given some data to be
substituted. In practice, that the binomial can be used to explain X should be supported. The variable X is
dimensionless. For the binomial distribution,
(See the appendixes to see how the mean and the variance of this distribution can be calculated.) Particularly,
the results obtained here can be applied to the Bernoulli distribution with κ = 1.
(a) Method of the moments
(a1) Population and sample moments: The probability distribution has two parameters originally, but we
have to study only one. The first-order moments are
1 n
μ1 ( η)=E ( X )=κ⋅η and m1 ( x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
n
(a2) System of equations: Since the parameter of interest η appears in the first-order population moment of

X, the first equation is enough to apply the method:
1 n 1
n ∑ j=1 j
μ1 (η)=m1 ( x1 , x 2 , ... , x n ) → κ⋅η= x = x̄ → η= κ x̄
(a3) The estimator:

1
η^ M = κ X̄
(b) Maximum likelihood method
(b1) Likelihood function: For the binomial distribution the mass function is f (x ; κ , η)= κ ηx (1−η) κ− x . ( )
x
We are interested only in η, so
n n
L( x 1 , x 2 , ... , x n ; η)=∏ j=1 f ( x j ; η)=∏ j=1 κ η (1−η) = κ η (1−η) ⋯ κ η (1−η)
x κ− x x κ− x x κ− x
( ) ( ) ( )
j j 1 1 n n
xj x1 xn
[ ] [ ]
n n n n
n n
η∑ (1−η)∑ ∏ j=1 ( xκj ) ⋅η∑
n κ−∑ j=1 x j
∏ j =1 ( xκj )
xj (κ− x j ) xj
= j=1 j=1
= j=1
(1−η) .
(b2) Optimization problem: The logarithm function is applied to facilitate the calculations,
[∏ ( )]
n n
n
log [ L( x 1 , x 2 , ... , x n ; η)]=log κ +log [ η∑ j=1
xj
]+log [(1−η)
n κ− ∑ j=1 x j
]
j=1 xj
[ ]
n n n
=log ∏ j=1 ( xκj ) +(∑ j =1 x j )log(η)+( n κ−∑ j=1 x j )log(1−η).
To discover the local or relative extreme values, the necessary condition is
n n
d 1 n −1 n n κ−∑ j=1 x j ∑ j=1 x j
0= log[ L( x 1 , x 2 , ... , x n ;η)]=0 +( ∑ j =1 x j ) +( n κ−∑ j=1 x j ) → =
dη η 1−η 1−η η
n
n n n n ∑ j=1 x j 11 n 1
→ η n κ−η∑ j=1 x j =∑ j=1 x j −η ∑ j=1 x j → η n κ=∑ j =1 x j → η0= = κ ∑ j=1 x j = κ x̄
nκ n
To verify that the only candidate is a local or relative maximum, the sufficient condition is
n n
d2 n −1 −1 n ∑ j =1 x j n κ−∑ j=1 x j
2
log[ L ( x 1 , x 2 ,... , x n ;η)]=( ∑ j =1 x j) 2 −(n κ−∑ j=1 x j ) (−1)=− − <0
dη η (1−η)2 η2 (1−η) 2
n n
since κ ≥ xj and therefore n κ≥∑ j=1 x j ↔ n κ−∑ j=1 x j≥0 . This holds for any value, including η0 .
(b3) The estimator:

1
η^ ML = κ X̄
(c) Estimation of η, μ and σ2
For κ = 10 and x = (x1,...,x5) = (4, 4, 3, 5, 6)

1 1 1
 From the method of the moments: η̂ M = κ ̄x = (4+ 4+3+5+6)=0.44 .
10 5
 From the maximum likelihood method, as the same estimator was obtained: η̂ ML =0.44 .
Since μ=E ( X )=κ⋅η, an estimator of η induces an estimator of μ by applying the plug-in principle:

 From the method of the moments: μ^ M =κ⋅^ηM =4.4 .
 From the maximum likelihood method: μ^ ML =4.4 .
Finally, σ 2=Var ( X )=κ⋅η(1−η) , an estimator of η induces an estimator of μ too:

 From the method of the moments: σ^ 2M =κ⋅η^ M (1− η
^ M )=10⋅0.44 (1−0.44)=2.464 .
 From the maximum likelihood method: σ^ 2ML =κ⋅η^ ML (1− η
^ ML )=2.464 .
Conclusion: We can see that for the binomial population the two methods provide the same estimator for η.
The value of κ must be known to use the expression obtained. In this particular case, the value 0.44 indicates
that, for each underlying trials (Bernoulli variables), one value seems more probable than the other. On the
other hand, the quality of the estimator obtained should be studied, especially if the two methods had provided
different estimators. As a particular case, κ = 1 for the Bernoulli distribution.
My notes:
Exercise 2pe-m
A random quantity X is supposed to follow a geometric distribution. Let X be a simple random sample.
A) Apply the method of the moments to find an estimator of the parameter η.
B) Apply the maximum likelihood method to find an estimator of the parameter η.
27
C) Given a sample such that ∑ j=1 x j = 134 , apply the formulas obtained in the two previous sections
to give final estimates of η. Finally, give estimates of the mean and the variance of X.
Discussion: This statement is mathematical, although we are given some data in the last section. The
random variable X is dimensionless. For the geometric distribution,
(See the appendixes to see how the mean and the variance of this distribution can be calculated.)
A) Method of the moments
a1) Population and sample moments: The population distribution has only one parameter, so one equation
suffices. The first-order moments of the model X and the sample x are, respectively,
1 1 n
μ1 (η)=E ( X )= η and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
n
a2) System of equations: Since the parameter of interest η appears in the first-order moment of X, the first
equation suffices:
−1
1 1 n 1 n 1
μ1 (η)=m1 ( x1 , x 2 , ... , x n ) →
η n n (
= ∑ j=1 x j = x̄ → η= ∑ j=1 x j =
x̄)
a3) The estimator:
−1
1 n 1
η^ M =( ∑ X
n j=1 j ) =
X̄

B) Maximum likelihood method
b1) Likelihood function: For the geometric distribution, the mass function is f (x ; η)=η⋅(1−η) x−1 so
n
n ( ∑ j=1 x j )−n
L( x 1 , x 2 , ... , x n ; η)=∏ j=1 f (x j ; η)=η⋅(1−η) x −1 ⋯η⋅(1−η)x −1=ηn⋅(1−η)
1 n
b2) Optimization problem: The logarithm function is applied to make calculations easier
n
( ∑ j=1 x j )−n n
log[ L( x 1 , x 2 , ... , x n ; η)]=log [ηn ]+ log [(1−η) ]=n⋅log (η)+[( ∑ j =1 x j )−n ]⋅log(1−η)
The population distribution has only one parameter, so a onedimensional function must be maximized. To find
the local or relative extreme values, the necessary condition is:
n
d n n −1 n [( ∑ j =1 x j )−n ]
0= log [ L( x 1 , x 2 , ... , x n ; η)]= η +[(∑ j=1 x j )−n] → η=
dη 1−η 1−η
n n n 1
→ n−n η=η∑ j=1 x j−ηn → n=η ∑ j =1 x j → η0= =
n
x̄
∑ j=1 x j
To verify that the only candidate is a (local) maximum, the sufficient condition is:
d2 n n −(−1)
2
log[ L (x 1 , x 2 ,... , x n ; η)]=− 2 −[( ∑ j =1 x j )−n ] <0
dη η (1−η)2
n 1
as ( ∑ j=1 x j )−n > 0 (note that xj ≥1). This holds for any value, including η0= .
x̄
b3) The estimator:
−1
1 n 1
η^ ML = ( ∑ X
n j =1 j ) =
X̄
C) Estimation of η, μ, and σ2
27
Since n = 27 and ∑ j=1 x j = 134 ,
1 1 27
 From the method of the moments: η^ M = = = =0.201 .
x̄ 1 27 134
⋅∑ j =1 x j
27
 From the maximum likelihood method, as the same estimator was obtained: η̂ ML =0.201 .
1
Since μ=E ( X )= η , an estimator of η induces an estimator of μ:
1 134
 From the method of the moments: μ^ M = = =4.96 .
η^ M 27
 From the maximum likelihood method, since the same estimator was obtained: μ^ ML =4.96 .
Note: From the numerical point of view, calculating 134/27 is expected to have smaller error than calculating 1/0.201.
2 1−η
Finally, since σ =Var ( X )= ,
η2

27
1−
1−η^ M
134 (134−27)134
 From the method of the moments: σ^ 2M = 2 = 2
= 2
=19.67 .
27 27
η^M
134 ( )
2
 From the maximum likelihood method: σ^ ML =19.67 .
Conclusion: For the geometric model, the two methods provide the same estimator for η. We have used the
estimator of η to obtain an estimator of μ. On the other hand, the quality of the estimator obtained should be
studied, especially if the two methods had provided different estimators.
My notes:
Exercise 3pe-m
A real-world variable is modeled by using a random variable X that follows a Poisson distribution. Given a
simple random sample of size n,
A) Apply the method of the moments to obtain an estimator of the parameter λ.
B) Apply the maximum likelihood method to obtain an estimator of the parameter λ.
C) Use these estimators to build estimators of the mean μ and the variance σ2 of the distribution.
Discussion: Although a real-world population is mentioned, this statement is mathematical. It is implicitly

assumed that the Poisson model is appropriate to study that variable (it can be supposed to be dimensionless).
In a statistical study, this supposition should be evaluated, e.g. by applying a hypothesis test, before looking
for an estimator of the population parameter. For a Poisson random variable,
(See the appendixes to see how the mean and the variance of this distribution can be calculated.)
a1) Population and sample moments: The population distribution has only one parameter, so one equation
1 n
μ1 (λ)= E( X )=λ and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
n
a2) System of equations: Since the parameter of interest λ appears in the first-order moment of X, the first
equation suffices. The system has only one trivial equation:
1 n
μ1 (λ)=m1 ( x 1 , x 2 , ... , x n) → λ= ∑ j=1 x j = x̄
n
a3) The estimator:

1 n
λ^ M = ∑ j=1 X j= X̄
n

b1) Likelihood function: We write the product and reorder the terms that are similar:
n
n n xj x1 x2 xn ∑ j=1 x j
L( x 1 , x 2 , ... , x n ; λ)=∏ j=1 f ( x j ; λ)=∏ j=1 λ e = λ e ⋅ λ e ⋯ λ e = λn
−λ −λ −λ −λ
e−n λ
xj! x1 ! x2 ! xn !
∏ j=1 x j !
b2) Optimization problem: The logarithm function is applied to make calculations easier:
n
∑ j=1 x j n n n
log [ L( x 1 , x 2 , ... , x n ; λ)]=log[λ ]+log[e−n λ ]−log[ ∏ j=1 x j ! ]=( ∑ j =1 x j )log [λ]−n λ−log[ ∏ j=1 x j ! ]
the local extreme values the necessary condition is:
d n 1 1 n
0= log[ L (x 1 , x 2 ,... , x n ; λ)]=( ∑ j =1 x j ) λ −n → λ 0= ∑ j=1 x j = x̄
dλ n
d2 n −1
dλ 2
log[ L(x 1 , x 2 ,... , x n ; λ )]=( ∑ x ) 2 <0
j=1 j
λ
n
since x ∈{0, 1, 2...} → ∑ j=1 x j≥0 . Then, the second derivative is always negative, also for λ 0 .
b3) The estimator: For λ, it is obtained after substituting the lower-case letters xj (numbers representing THE
sample we have) by upper-case letters Xj (random variables representing ANY possible sample we may have):
1 n
λ^ ML= ∑ j=1 X j= X̄
n
C) Estimation of μ and σ2
To obtain estimators of the mean and the variance, we take into account that for this model μ=E ( X )=λ
and σ 2=Var ( X )=λ , so by applying the plug-in principle:
μ^ = λ^ = X̄ , σ^ 2 = λ^ = X̄
Conclusion: For the Poisson model, the two methods provide the same estimator for λ, and therefore for μ
and σ2 (when the plug-in principle is applied). On the other hand, the quality of the estimator obtained should
be studied (though the sample mean is a well-known estimator).
My notes:
Exercise 4pe-m
A random variable X follows the normal distribution. Let X = (X1,...,Xn) be a simple random sample of X (seen
as the population). To obtain an estimator of the parameters θ = (μ,σ), apply:
(A) The method of the moments (B) The maximum likelihood method
Discussion: This statement is mathematical. For the normal distribution,

(For this distribution, the mean and the variance are directly μ and σ2; this is proved in the appendixes.)
(A) Method of the moments
(a1) Population and sample moments

The population distribution has two parameters, so two equations are considered. The first-order moments are
1 n
μ1 (μ , σ)= E ( X )=μ and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
n
while the second-order moments are
1 n 2
m2 ( x1 , x 2 , ... , x n )= ∑ j=1 x j
2 2 2 2
μ 2 (μ ,σ )=E ( X )=Var ( X )+ E ( X ) =σ +μ and
n
(a2) System of equations
{
1 n
{
∑ x = x̄
μ= μ= x̄
{
μ 1 (μ , σ)=m1 ( x1 , x 2 , ... , x n )
μ 2 (μ , σ)=m2 ( x 1 , x 2 , ... , x n)
→
2
n j =1 j
2 1 n
σ +μ = ∑ j=1 x j
n
2
→
σ =
2
( 1 n
n )
∑ j=1 x 2j − x̄ 2=s 2x
2
1 n 1 n
where Var ( X )= E ( X 2 )− E( X )2 and s 2x = ( ∑ x
n j=1 j
2
−
n
∑)( x
j=1 j
2 2
)
= x¯ − x̄ have been used.
(a3) The estimator
{
θ^ M = μ^ M = X̄
σ^ M =s X
(B) Maximum likelihood method
(b1) Likelihood function 2

( x−μ )
1 −
2σ
2
The density function of the Gaussian distribution is f ( x ; μ , σ)= e . Then,

√2 π σ2
( x j −μ )2
( )(
n − 1 ∑
n
( x j −μ)2
1 −
1
)
n n 2 2
L( x 1 , x 2 , ... , x n ;μ , σ)=∏ j=1 f ( x j ;μ ,σ )=∏ j =1

j=1
2σ
e = e 2σ
√2 π σ2 √2 π σ 2
(b2) Optimization problem

Logarithm: The logarithm function is applied to make calculations easier
n 2 1 n 2
log[ L( x 1 , x 2 , ... , x n ;μ , σ)]=− log[2 π σ ]− 2 ∑ j=1 ( x j−μ)
2 2σ
Maximum: The population distribution has two parameters, and then it is necessary to maximize a
twodimensional function. To discover the local extreme values, the necessary conditions are:
{
1 n
2 ∑ j=1
{
∂ − [2 ( x j −μ )(−1)]=0
∂μ log[ L( x 1 , x 2 , ... , x n ;μ , σ)]=0 → 2σ
∂ n 1 −2 σ
( )
n
∂ σ log [ L( x 1 , x 2 , ... , x n ; μ , σ)]=0 − √ 2 π− 2 [∑ j =1 ( x j −μ)2 ] 4 =0
σ √2 π σ

{
1 n
{
n
∑ ( x j−μ)=0 ∑ j=1 ( x j −μ)=0
{
n
σ 2 j=1 → →
∑ j =1 x j=nμ
n 1 n 1 n
−n+ 2 ∑ j =1 ( x j −μ )2=0 n
− σ + 3 ∑ j =1 (x j −μ )2=0 ∑ j=1 ( x j−μ)2=n σ 2
σ σ
{
1 n
n ∑ j =1 j
{
μ= x μ= x̄
σ2 =
1
∑
n
( x j −μ )2
→ 1 n
2 2 2
σ = ∑ j=1 (x j − x̄) =s x
n
→
{ μ= x̄
σ=s x
n j=1
To verify that the only candidate is a (local) maximum, the sufficient conditions on the partial derivatives of
second order are:
1
[ 1
]n
2 n n
A= ∂ 2 log [ L( x 1 , ... , x n ;μ , σ)]= ∂ ∑ (x j −μ ) = ∑ (−1)=− 2
∂μ ∂μ σ 2 j=1
σ 2 j=1
σ
[ ]
∂ log [ L( x , ... , x ; μ , σ)]= ∂ 1 ∑n (x −μ ) = −2 σ ∑ n (x −μ )=− 2 ∑n ( x −μ )
2
B= 1 n
∂μ ∂σ ∂ σ σ 2 j=1 j σ
4 j =1 j
σ
3 j=1 j
n 1
[ n 3
]
2 n n
C= ∂ 2 log [ L( x 1 , ... , x n ;μ , σ)]= ∂ − σ + 3 ∑ j=1 ( x j−μ)2 = 2 − 4 ∑ j=1 ( x j−μ)2
∂σ ∂σ σ σ σ
To calculate D = B –AC, substituting the pair (μ , σ)=( ̄x , s x ) in A, B and C simplifies the work
2
n
A∣( ̄x , s ) =− 2
< 0
sx
x
2n 2
x
2
sx
n
B∣(μ , s )=− 3 ∑ j=1 ( x j− x̄ )=0 → 2
x
( )( )
D∣( ̄x , s ) =− −
n
s 2x
−
2n
s2x
=−
s 4x
< 0
n 3 n 2 2n
C∣( x̄ , s ) = 2
− 4 ∑ j=1 (x j −μ ) =− 2
x
s x sx sx
n n n n n
as ∑ j=1 (x j − x̄)=(∑ j =1 x j )−n x̄=0 and ∑ j=1 (x j − x̄)2= n ∑ j =1 (x j − x̄ )2=n s 2x . Then,
log[ L( x ; μ , σ)] has a maximum at (μ ,σ )=( ̄x , s x ) since it is a local extreme value and D < 0, A < 0.
(b3) The estimator
{ ̄
θ̂ ML = μ̂ ML = X
σ̂ ML =s X
Conclusion: Since in this case there are two parameters, both the parameter and its estimator can be thought
as twodimensional quantities: θ=(μ , σ) and θ=( ̂ μ̂ , σ).
̂ On the other hand, the quality of the estimator
obtained should be studied, especially if the two methods had provided different estimators.
My notes:

Exercise 5pe-m
The uniform distribution U[0,θ] has
{
1
if x∈[0,θ ]
f (x ; θ) = θ
0 otherwise
as a density function. Let X = (X1,...,Xn) be a simple random sample of a population X following this
probability distribution.
A) Apply the method of the moments to find an estimator of the parameter θ.
B) Apply the maximum likelihood method to find an estimator of the parameter θ.
Use this estimator to build others for the mean and the variance of X.
Discussion: This statement is mathematical, and there is no supposition that would require justification. The
random variable X is dimensionless. We are given the density function of the distribution of X, though for this
distribution it could be deduced from the fact that all values have the same probability. For the general
continuous uniform distribution,
Note: If we had not remembered the first population moments, with the notation of this exercise we could do
θ
E ( X )=∫−∞
+∞ 1
θ 1 x2
x f ( x ; θ) dx=∫0 x θ dx = θ
2 0
1 2
[ ] ( )
= θ θ −0 = θ
2 2
θ
1 x3
[ ] ( )
1 3 2
+∞ θ 1
E ( X )=∫−∞ x f (x ; θ) dx=∫0 = θ θ −0 = θ
2 2 2
x θ dx= θ
3 0 3 3
so
2 1 1
2 2 2
μ=E ( X )= θ
2
and σ =Var ( X )=E ( X )−E ( X ) = θ − θ =θ
2 2 2
3 2
− =θ
3 4 12 ( ) ( )
a1) Population and sample moments: For uniform distributions, discrete or continuous, the mean is the
middle value. Then, the first-order moment of the distribution and of the sample are
0+θ θ 1 n
μ1 (θ )= = and m1 (x 1 , ... , x n)= ∑ j=1 x j = x̄
2 2 n
a2) System of equations:

θ = 1 n x = x̄ 2 n
2 n ∑ j =1 j n ∑ j =1 j
μ1 (θ )=m 1 (x 1 , x 2 ,... , x n ) → → θ0 = x =2 x̄
a3) The estimator:

2 n
θ^ M = ∑ j =1 X j =2 X̄
n

1
b1) Likelihood function: The density function is f (x ; θ)= for 0≤x ≤θ , so
θ
n n 1 1
L( x 1 , x 2 , ... , x n ; θ)=∏ j =1 f ( x j ; θ)=∏ j=1 θ = n
θ
b2) Optimization problem: First, we try to discover the maximum by applying the technique based on the
derivatives. The logarithm function is applied,
log[ L( x 1 , x 2 , ... , x n ; θ)]=log [θ−n ]=−n log(θ),
and the first condition leads to a useless equation:
0=
d
dθ
1
log[ L(x 1 , x 2 , ... , x n ; θ)]=−n θ → ?
Then, we realize that global minima and maxima cannot always be found through the derivatives (only if they
are also local extremes). In fact, it is easy to see that the function L monotonically decreases with θ and
therefore monotonically increases when θ decreases (this pattern or just the opposite tend to happen when the
probability function changes monotonically with the parameter, e.g. when the parameter appears only once in
the expression). As a consequence, it has no local extreme values. Since, on the other hand, 0≤x j≤θ , ∀ j ,
{ButL xwhen θ
≤θ , ∀ j
j
→ θ0 =max j { x j }
b3) The estimator: It is obtained after substituting the lower-case letters xj (numbers representing THE
sample we have) by upper-case letters Xj (random variables representing ANY possible sample we may have):
θ^ ML =max j { X j }
C) Estimation of μ and σ2
To obtain estimators of the mean, we take into account that μ=E ( X )= θ and apply the plug-in principle:
2
^θ M 2 X̄ ^θ ML max j { X j }
μ^ M = = = X̄ μ^ ML = =
2 2 2 2
2
To obtain estimators of the variance, since σ =Var ( X )= θ
2
12
2
2 θ^ M (2 X̄ )2 ( X̄ ) 2
2
2 θ^ 2ML ( max j { X j })
σ^ M = = = σ^ ML = =
12 12 3 12 12
Conclusion: For the uniform distribution, both methods provide different estimators of the parameter and
hence of the mean. The quality of the estimators obtained should be studied.
My notes:

Exercise 6pe-m
A random quantity X is supposed to follow a distribution whose probability function is, for θ > 0,
{
0 if x <3
f (x ; θ) =
1 − x−3
θe
θ
if x≥3

C) Use the estimators obtained to build estimators of the mean μ and the variance σ2.
Hint: Use that E(X) = θ + 3 and E(X2) = 2θ2 + 6θ + 9.
Discussion: This statement is mathematical. The random variable X is supposed to be dimensionless. The
probability function and the first two moments are given, which is enough to apply the two methods. In the
last step, the plug-in principle will be applied.
Note: If E(X) had not been given in the statement, it could have been calculated by applying integration by parts (since polynomials and
exponentials are functions “of different type”):
∞
1 − x−3
[ ]
+∞ ∞ x−3 x−3
− −
E ( X )=∫−∞ x f ( x ; θ)dx=∫3 x e θ dx= −x e θ −∫ 1⋅(−e θ )dx 3
θ
x−3 ∞ x−3 3
[
= −x e
−
x−3
θ
−θe
−
θ
3
] =[( x+ θ)e ] =3+θ .
−
θ
∞
That ∫ u (x )⋅v ' (x )dx=u (x )⋅v ( x )−∫ u ' (x)⋅v (x ) dx has been used with
• u=x → u ' =1
1 − x−3 1 − x−3 −
x−3
• v '= θ e θ → v=∫ θ e θ dx=−e θ
On the other hand, ex changes faster than xk for any k. To calculate E(X2):
∞
1 − x−3
[ ]
+∞ ∞ x−3 x−3
2 − −
E ( X )=∫−∞ x f (x ; θ)dx=∫3 x e θ dx= −x e θ + 2∫ x e θ dx
2 2 2
θ 3
x−3 3
=x e [ 2 − θ
∞
] +2 θ∫ ∞
3
1 − x−3
x θ e θ dx =(3 2−0)+2 θ μ=9+2 θ (3+θ)=2θ 2 +6 θ+9 .
Integration by parts has been applied: ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' (x)⋅v (x )dx with
2
• u=x → u ' =2 x
x−3
1 − 1 − x−3 −
x−3
• v '= θ e θ
→ v=∫ θ e θ dx=−e θ
Again, ex changes faster than xk for any k.
a1) Population and sample moments: There is only one parameter, so one equation suffices. The first-order
moments of the model X and the sample x are, respectively,
1 n
μ1 (θ )=E ( X )=θ +3 and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
n
equation suffices:

1 n
n ∑ j =1 j
μ1 (θ )=m1 ( x 1 , x 2 ,... , x n ) → θ+3= x = x̄ → θ= x̄−3
a3) The estimator:

θ^ M = X̄ −3
1 − x−3
b1) Likelihood function: For this probability distribution, the density function is f ( x ; θ)= θ e θ so
1 n
n n 1 − x θ−3 1 −θ ∑
j ( x j −3)
L( x 1 , x 2 , ... , x n ; θ)=∏ j =1 f ( x j ; θ)=∏ j=1 θe = ne
j=1
θ
1 n 1 n
log [ L( x 1 , x 2 , ... , x n ; θ)]=log (θ )− ∑ j =1 (x j −3)=−n log (θ)− ∑ j=1 ( x j−3)
−n
θ θ
the local or relative extreme values, the necessary condition is:
d 1 1 n n 1 n
0= log [ L(x 1 , x 2 , ... , x n ; θ)]=−n + 2 ∑ j=1 ( x j−3) → ∑
θ θ2 j =1 ( x j −3)
θ θ =
dθ
1 n 1 n 1 n
→ θ= ∑ j=1 (x j −3)= ∑ j =1 x j− ∑ j=1 3= x̄−3 → θ0 = x̄−3
n n n
d2 d 1 1 n n 2θ n ?
2
log [ L( x 1 , x 2 , ... , x n ;θ)]= [−n θ + 2 ∑ j=1 ( x j−3)]= 2 − 4 ∑ j=1 ( x j −3) < 0
dθ dθ θ θ θ
The first term is always positive but the second is always negative, so we had better substitute the candidate
n 2θ
2
d n 2θ n
2
log[ L( x 1 , x 2 , ... , x n ; θ)]= 2 − 4 n( x̄−3)= 2 − 40 n θ0=− 2 < 0
dθ θ θ θ0 θ0 θ0
b3) The estimator:
θ^ ML = X̄ −3
C) Estimation of η and σ2
c1) For the mean: By using the hint and the plug-in principle,
 From the method of the moments: μ^ M =θ^ M + 3= X̄ −3+3= X̄ .
 From the maximum likelihood method, as the same estimator was obtained: μ^ ML = X̄ .
c2) For the variance: We must write it in terms of the first two moments of X,
σ 2=Var ( X )=E ( X 2)−E ( X )2=2 θ2 +6 θ+ 9−(θ+3)2=2θ 2 +6 θ+9−θ 2−6 θ−9=θ 2
Then,
 From the method of the moments: σ^ 2M =θ^ 2M =( X̄ −3)2=( X̄ )2 −6 X̄ +9 .
 From the maximum likelihood method: σ^ 2ML =θ^ 2ML =( X̄ −3)2 =( X̄ ) 2−6 X̄ +9 .
Conclusion: For this model, the two methods provide the same estimator. We have used the estimator of θ
to obtain estimators of μ and σ2. The quality of the estimator obtained should be studied, especially if the two

methods had provided different estimators. Regarding the original probability distribution: (i) the expression
reminds us the exponential distribution; (ii) the term x–3 suggests a translation; and (iii) the variance θ2 is the
same as the variance of the exponential distribution. After translating all possible values x, the mean is also
translated but the variance is not. Thus, the distribution of the statement is a translation of the exponential
distribution, which has this equivalent notation
1 − x−δ
In fact, the distribution with probability function f ( x ; θ) = θ e θ , x >δ (and zero elsewhere) is termed
two-parameter exponential distribution. It is a translation of size δ of the usual exponential distribution. A
particular, simple case is obtained for θ = 1 and δ =0, since f ( x ) = e− x , x > 0 .
My notes:
Exercise 7pe-m
A random quantity X is supposed to follow a distribution whose probability function is, for θ>0,
{
3 x2
3
if 0≤ x≤θ
f (x ; θ) = θ
0 otherwise

C) Use the estimators obtained to build estimators of the mean μ and the variance σ2.
Hint: Use that E(X) = 3θ/4 and Var(X) = (3θ2)/80.
Discussion: This statement is mathematical. The random variable X is supposed to be dimensionless. The
probability function and the first two moments are given, which is enough to apply the two methods. In the
last step, the plug-in principle will be applied.
Note: If E(X) had not been given in the statement, it could have been calculated by integrating:
θ
3x2
+∞
θ
θ3 4 3
E ( X )=∫−∞ x f ( x ;θ)dx=∫0 x 3 dx= 3 θ = θ
θ 4 0 4 [ ]
On the other hand, if Var(X) had not been given in the statement, it could have been calculated by using a property and integrating:
θ
Now,
+∞
E ( X )=∫−∞ x f (x ;θ)dx=∫0
2 2
θ
23 x2 3 x5
x 3 dx= 3
θ θ 5 0
3
[ ]
= θ2 .
5
3 2 3 2 3 32 2 3 2
3
μ=E ( X )= θ
4
and
2 2 2
σ =Var ( X )=E ( X )−E ( X ) = θ − θ = − 2 θ = θ .
5 4 5 4 80 ( ) ( )
a1) Population and sample moments: There is only one parameter, so one equation suffices. The first-order

3 1 n
n ∑ j =1 j
μ1 (θ )=E ( X )= θ and m1 (x 1 , x 2 ,... , x n )= x = x̄
4
equation suffices:
3 1 n 4
μ1 (θ )=m1 ( x 1 , x 2 ,... , x n ) → θ= ∑ j=1 x j = x̄ → θ0 = x̄
4 n 3
a3) The estimator:
4
θ^ M = X̄
3

2
3x
b1) Likelihood function: For this probability distribution, the density function is f ( x ; θ)= 3
so
θ
n n 3 x 2j 3n n
L( x 1 , x 2 , ... , x n ; θ)=∏ j =1 f (x j ; θ)=∏ j=1 3
= 3n ∏ x2
j=1 j
θ θ
n
log[ L( x 1 , x 2 , ... , x n ; θ)]=log( 3n)−3 n log(θ)+ log( ∏ j =1 x 2j)
Now, if we try to find the maximum by looking at the first-order derivatives, a useless equation is obtained:
0=
d
dθ
1
log[ L(x 1 , x 2 , ... , x n ; θ)]=−3 n θ → ?
Then, we realize that global minima and maxima cannot in general be found through the derivatives (only if
they are also local). It is easy to see that the function L monotonically increases when θ decreases (this pattern
or just the opposite tend to happen when the probability function changes monotonically with the parameter,
e.g. when the parameter appears only once in the expression). As a consequence, it has no local extreme
values. On the other hand, 0≤x j≤θ , ∀ j , so
{ButL xwhen θ
≤θ , ∀ j
j
→ θ0 =max j { x j }
b3) The estimator:

θ^ ML =max j { X j }
c1) For the mean: By using the hint and the plug-in principle,
3 34
 From the method of the moments: μ^ M = θ^ M = X̄ = X̄ .
4 43
3 3
 From the maximum likelihood method: μ^ ML = θ^ ML = max j { X j }.
4 4
c2) For the variance: By using that principle again,
2
3 3 4 1
 From the method of the moments: σ^ 2M = θ^ 2M =
80 80 3
X̄ = ( X̄ )2 .
15 ( )
2 3 ^2 3 2
 From the maximum likelihood method: σ^ ML = θ ML = ( max j { X j }) .
80 80
Conclusion: For this model, the two methods provide different estimators. The quality of the estimators
obtained should be studied. We have used the estimator of θ to obtain estimators of μ and σ2.
My notes:
[PE] Properties of Estimators

Remark 4pe: As regards the sample sizes, we can talk about static situations where we study the dependence of the concepts on the
sizes, or the possible relation between the sizes, say nX = c·nY. On the other hand, we can talk about dynamic situations where the
same dependences are studied asymptotically while the sample sizes are always increasing, say nX(k)= c(k)·nY(k), where k is the
index of a sequence of statistical schemes with those sample sizes. (Statistically, we are interested in sequences with nondecreasing
sample sizes; mathematically, all possible sequences should be taken into account.) The static and the dynamic situations are
respectively represented in the following figures:
Remark 5pe: We do not usually use the definition of the mean square error but the result at the end of the following equalities:
^
MSE ( θ)= ^
E ([ θ−θ] 2 ^
)= E ([ θ−E ^ E ( θ)−θ
( θ)+ ^ ^
]2 )=E ([θ−E ^ 2+[ E ( θ)−θ
( θ)] ^ ^ E(θ)]⋅
]2 +2 [θ− ^ [ E ( θ)−θ
^ ])
^
= E ([ θ−E ^ ) + [ E ( θ)−θ]
( θ)]
2 ^ 2 ^ E ( θ)−θ]−2
+ 2 E ( θ)⋅[ ^ ^ E ( θ)−θ
E ( θ)⋅[ ^ ^ b( θ)
]= Var ( θ)+ ^ 2
Remark 6pe: To study the consistency in probability we have been taught a sufficient—but not necessary—condition that is
equivalent to the consistency in mean of order two (managing the definition is quite complex). Thus, this type of consistency is
proved when the condition is fulfilled, which is sufficient—but not necessary—for the consistency in probability. By using the
Chebyshev's inequality:
^
E(( θ−θ)2 ^
) MSE ( θ) ^
lim n→∞ MSE ( θ)
|^ |
P( θ−θ ≥ϵ)≤ = → ^ |≥ϵ) ≤
lim n →∞ P (|θ−θ
ϵ2 ϵ2 ϵ2
If the sufficient condition is not fulfilled, the estimator under study is not consistent in mean of order two, but it can still be
consistent in probability—this type of consistency should be studied using a different way. Additionally, since MSE ( θ), ^
b ( θ) ^
^ 2 and Var ( θ) are nonnegative, the mean square error is zero if and only if the other two are zero at the same time, and viceversa.
The same happens for their limits. That is why we are allowed to split the limit of the mean square error into two limits.
Exercise 1pe-p
The efficiency (in lumens per watt, u) of light bulbs of a certain type have a population mean of 9.5u and
standard deviation of 0.5u, according to production specifications. The specifications for a room in which
eight of these bulbs (the simple random sample) are to be installed call for the average efficiency of the eight
bulbs to exceed 10u. Find the probability that this specification for the room will be met, assuming that
efficiency measurements are normally distributed.
(From Mathematical Statistics with Applications, Mendenhall, W., D.D. Wackerly and R.L. Scheaffer, Duxbury Press.)

Discussion: The supposition that efficiency measurements follow the distribution N(μ=9.5u, σ2=0.52u2)
should be tested by applying an appropriate statistical technique. The event is defined in terms of X. We think
about making the proper statistic appear, and hence to be allowed to use its sampling distribution.
Identification of the variable and selection of the statistic : The variable is the efficiency of the
light bulbs, while the estimator is the sample mean of eight elements. Since the population is normal and the
two population parameters are known, we will consider the (dimensionless) statistic:
X̄ −μ
T ( X ;μ )= ∼ N ( 0,1)
σ2
n √
( n)
2
Rewriting the event: Although in this case the sampling distribution of X is known, as X̄ ∼ N μ , σ ,
we need to standardize before consulting the table of the standard normal distribution:
(√ ) ( ) ( )
X
̄ −μ 10−μ 10−9.5 0.5 √ 8
̄ > 10)=P
P(X > =P T > =P T > =P ( T > √ 8) =0.0023
√ √ √ 0.5 2
2 2
σ σ 0.52
n n 8
> 1 - pnorm(sqrt(8),0,1)
where in this case the language R has been used: [1] 0.002338867
Conclusion: The production specifications will be met, for the room mentioned, with a probability of
0.0023, that is, they will hardly be met.
My notes:
Exercise 2pe-p
When a production process is working properly, the resistance of the components follows a normal
distribution with standard deviation 4.68u. A simple random sample with four components is taken. What is
the probability that the sample quasivariance will be bigger than 30u2?
Discussion: In this exercise, the supposition that the normal distribution reasonably explains the variable
resistance should be evaluated by using proper statistical techniques. The question involves S2. Again, it is
necessary to make the proper statistic appear, in order to use its sampling distribution.
Identification of the variable:

R ≡ Resistance (of one component) R ~ N(μ, σ2 = 4.682u2)
Sample and statistic:

R1, R2, R3, R4 (The resistance of four components is measured.) → n = 4
2 1 4 2
S= ∑ ( R − R̄)
4−1 j=1 j
Sample quasivariance
Search for a known distribution: The quantity required is P(S2 >30). To calculate the probability of an
event, we need to know the distribution of the random quantity involved. In this case, we do not know the
sampling distribution of S 2 , but since R follows a normal distribution we are allowed to use

(n−1) S 2
T= 2
∼χ 2n−1
σ
Then, by completing the inequality with the necessary constants (until making T appear):
(n−1) S 2 ( n−1)30
P( S2 >30)=P ( σ 2
>
σ 2
=P T > ) (
( 4−1)30
4.68 2
=P(T > 4.11) )
where T ∼ χ 23 . Multiplying and dividing by positive quantities have not changed the inequality.
Table of the χ2 distribution: Since n–1=4–1=3, it is necessary to look at the third row.
The probabilities in the table are given for events of the form P(T < x ) (or P(T ≤x ) , as the distribution is
continuous), and therefore the complementary of the event must be considered:
P(T > 4.11)=1−P (T ≤4.11)=1−0.75=0.25
Conclusion: The probability of the event is 0.25. This means that S2 will sometimes take a value larger than
30u2, when evaluated at specific data x coming from the mentioned distribution.
My notes:
Exercise 3pe-p
A simple random sample of 270 homes was taken from a large population of older homes to estimate the
proportion of homes with unsafe wiring. If, in fact, 20% of homes have unsafe wiring, what is the probability
that the sample proportion will be between 16% and 24%?
Hint: Since probabilities and proportions are measured in a 0-to-1 scale, write all quantities in this scale.
(From Statistics for Business and Economics, Newbold, P., W.L. Carlson and B.M. Thorne, Pearson.)
LINGUISTIC NOTE (From: The Careful Writer: A Modern Guide to English Usage. Bernstein, T.M. Atheneum)
home, house. It is a tribute to the unquenchable sentimentalism of users of English that one of the matters of usage that seem to agitate
them the most is the use of home to designate a structure designed for residential purposes. Their contention is that what the builder erects
is a house and that the occupants then fashion it into a home.
That is, or at least was, basically true, but the distinction has become blurred. Nor is this solely the doing of the real estate
operators. They do, indeed, lure prospective buyers not with the thought of mere masonry but with glowing picture of comfort,
congeniality, and family collectivity that make a house into a home. But the prospective buyers are their co-conspirators; they, too, view
the premises not as a heap of stone and wood but as a potential abode.
There may be areas in which the words are not used interchangeably. In legal or quasi-legal terminology we speak of a “house and
lot,” not a “home and lot.” The police and fire departments usually speak of a robbery or a fire in a house, not a home, at Main Street and
First Avenue. And the individual most often buys a home, but sells his house (there, apparently, speaks sentiment again). But in most
areas the distinction between the words has become obfuscated. When a flood or a fire destroys a community, it wipes out not merely
houses but homes as well, and homes has come to be accepted in this sense. No one would discourage the sentimentalists from trying to
pry the two words apart, but it would be rash to predict much success for them.
Discussion: The information of this “real-world study” must be translated into the mathematical language.
Since there are two possible situations, each home can be “modeled” by using a Bernoulli variable. Although

given in a 0-to-100 scale, the population and sample proportions—always in a 0-to-1 scale—are involved. The
dimensionless character of a proportion is due to its definition. Note that if the data (x1,...,xn) are taken and we
have access to them, there is nothing random any longer. The lack of knowledge, as if we had to select n
elements to build (X1,...,Xn), justifies the use of Probability Theory.
Identification of the variable and selection of the statistic : The variable having unsafe wiring can
take two possible values: 0 (not having unsafe wiring) and 1 (having it, if one want to register or count this
fact). The theoretical proportion of older homes with unsafe wiring is known: η = 0.20 (20%). For this
framework—a large sample from a Bernoulli population with parameter η—we select the dimensionless,
asympotic statistic:
̂
η−η d
T ( X ; η)= → N (0,1)
n √
?(1−? )
where ? is substituted by the best information available about the parameter: η or η.

̂ Here we know η.
Rewriting the event: We are asked for the probability P (0.16 < η̂ < 0.24), but to calculate it we need to
rewrite the event until making T appear:
0.16−η η
̂ −η 0.24−η
P (0.16 < η
̂ < 0.24)=P
(√ η(1−η)
n
<
√ η(1−η)
n
<
√ η(1−η)
n
)
( )( )
0.24−0.20 0.16−0.20
=P T < −P T ≤ = P(T < 1.64)−P (T ≤−1.64)
√ 0.20(1−0.20)
270 √ 0.20 (1−0.20)
270
(In these calculations, we have standardized and then decomposed, but it is also possible to decompose and
then to standardize.) Now, let us assume that we have a table of the standard normal distribution including
positive quantiles only. By using a simple plot with the density function of this distribution, it is easy to see
(look at the areas) that for the second probability P (T ≤−1.64)=P (T ≥ +1.64)=1−P (T <+ 1.64) , so
P (T < 1.64)−P (T ≤−1.64)=P (T < 1.64)−[1−P (T < 1.64)]=2⋅P (T < 1.64)−1=2⋅0.9495−1=0.90.
> pnorm(1.64,0,1) - pnorm(-1.64,0,1)
Alternatively, by using the language R: [1] 0.8989948
Conclusion: The probability of the event is 0.90, which means that the sample proportion of older homes
with unsafe wiring, calculated from the sample X = (X1,...,X270), will take a value between 0.16 and 0.24 with
this probability. As a percentage: the proportion of the 270 homes with unsafe wiring will be between 16%
and 24% with 90% certainty.
My notes:
Exercise 4pe-p
Simple random samples X = (X1,...,X11) and Y = (Y1,...,Y6) are taken from two independent populations
2 2
X ∼ N (μ X =1 , σ X =1) and Y ∼ N (μ Y =2 , σY =0.5)
Calculate or find:
(1) The probability P (S 2Y ≤ 1.5).

(2) The quantile c such that P ( X
̄ > c) = 0.25.
(3) The probability P ( X̄ −0.1 > 0.1+ Ȳ ) .
S 2X
(4) The quantile c such that P
( S 2Y )
≤ c = 0.9.
̄ −0.1 > 0.1−Ȳ ) .

(Advanced Item) The probability P ( X
Discussion: There are two independent normal populations whose parameters are known. The variances, not
the standard deviation, are given. It is required to calculate probabilities or find quantiles for events involving
the sample means and the sample quasivariances. In the first two sections, only one of the populations is
involved. Sample sizes are 11 and 6, respectively. The variables X and Y are dimensionless, and so are both
sides of the inequalities.
(nY −1)S Y2
(1) The event involves the estimator S 2 , which reminds us of the statistic T = 2
∼ χ 2n −1 . Then,
σY Y
(n y −1)S 2y
2
P ( S ≤ 1.5)= P
y
( σ 2
y
≤
(n y −1)1.5
σ 2
y
) =P T≤( (6−1) 1.5
0.5
=P T ≤ )
5⋅1.5
1
2 (
= P ( T ≤ 15 ) = 0.99
)
̄ −μ X
X
̄ , so we think about the statistic T =
(2) The event involves X ∼ N ( 0,1) . Then,
√
2
σ X
nX
̄ −μ X
X c−μ X c−μ x
(√ ) ( √ ) ( √)
c−1
̄ > c) = P
0.25 = P ( X > =P T> =P T>
√ 1
2 2 2
σX σX σ X
nX nX nX 11
or, equivalently,
c−1
1−0.25 = 0.75 = P T ≤
( √ 1
11
)
Now, the quantile found in the table of the standard normal distribution must verify that
r 0.25=l 0.75=0.674 =
c−1
√1
11
→ c = 0.674
√ 1
11
+1 = 1.20
( X̄ −Ȳ )−(μ X −μ Y )
(3) To work with the means of two populations, we use T = ∼ N (0,1), so
̄ −Ȳ )−(μ x −μ y )
√ σ 2X σ Y2
+
n X nY
(X 0.2−(μ x −μ y )
( ) (
0.2−(1−2)
̄ −0.1 > 0.1+ Ȳ ) = P ( X
P(X ̄ −Ȳ > 0.2) = P
√ σ 2x σ 2y
+
nx ny
>
√ σ 2x σ 2y
+
nx ny
=P T >
√ 1 0.5
+
11 6
)
0.2−1+ 2
=P T >
( 1 1
+
11 12 √ )
= P ( T > 2.87 ) = 1−P (T ≤ 2.87 ) = 1−0.9979 = 0.0021
S 2X σ 2Y
(4) To work with the variances of two populations, T = 2 2 ∼ F n −1 ,n −1 is used:
S Y σX X Y
S 2X σ 2Y S 2X σY2 σ 2Y
0.9 = P
( S 2
Y
) (
≤c =P
σ S 2
X
2
Y
≤c
σ 2
X
) ( = P T ≤c
σ 2
X
) =P T ≤c ( 0.5
1) (
=P T ≤
c
2 )
The quantile found in the table of the distribution F n X −1 , nY −1 =F 11−1 ,6−1=F 10,5 is 3.30, which allows us to
find the unknown c:
c > qf(0.9, 10, 5)
r 0.1=l 0.9=3.30= → c = 6.60. [1] 3.297402
2
(Advanced Item) In this case, allocating the two sample means in the first side of the inequality leads to
̄ −0.1 > 0.1−Ȳ ) = P( X̄ + Ȳ > 0.2)
P(X
We remember that
σ2X σ2Y
̄ ∼ N μX ,
X ( nX ) and Ȳ ∼ N μ Y ,
nY ( )
so the rules that govern the sums—and hence subtractions—of normally distributed variables imply both
σ 2X σ 2Y σ 2X σ2Y
̄ −Ȳ ∼ N μ X −μ Y ,
X ( +
n X nY ) and X̄ + Ȳ ∼ N μ X +μY , +
n X nY ( )
(Note that in both cases the variances are added—uncertainty increases.) Although the difference is used more
frequently, to compare to populations, the sampling distribution of the sum of the sample means is also known
thanks to the rules for normal variables; alternatively, we could still use the first result by doing X+Y = X–(–Y)
and using the –Y has mean and variances equal to –μY and σY2. Either way, after standardizing:
( X̄ + Ȳ )−(μ X +μ Y )
T= ∼ N ( 0, 1 )
̄ + Y.
̄ Now,
√ σ 2X σ2Y
+
n X nY
This is the “mathematical tool” necessary to work with X
̄ + Ȳ )−(μ X +μY )
(X 0.2−(μ X +μY )
( ) (
0.2−(1+ 2)
̄ −0.1 > 0.1−Ȳ ) = P( X̄ + Ȳ > 0.2) = P
P(X
√ σ2X σ 2Y
+
nX nY
>
√ σ 2X σY2
+
n X nY
=P T >
√ 1 0.5
+
11 6
)
0.2−3
=P T >
( √ 1 1
+
11 12
)
= P ( T > −6.71 ) = 1−P ( T ≤−6.71 ) = 1
The quantile 6.71 is not usually in the tables of the N(0,1), so we can consider that P ( T ≤−6.71 )≈0. Or, if
we use the programming language R: > 1-pnorm(-6.71,0,1)
[1] 1
Conclusion: For each case, we have selected the appropriate statistic. After completing the expression of the
event, the statistic T appears. Then, since the (sampling) distribution of T is known, the tables can be used to
calculate probabilities or to find quantiles. In the latter case, the unknown c is found after the quantile of T.

My notes:
Exercise 5pe-p
Suppose that you manage a bank where the amounts of daily deposits and daily withdrawals are given by
independent random variables with normal distributions. For deposits, the mean is ₤12,000 and the standard
deviation is ₤4,000; for withdrawals, the mean is ₤10,000 and the standard deviation is ₤5,000.
(a) For a week, calculate or bind the probability that the five withdrawals will add up to more than
₤55,000.
(b) For a particular day, calculate or bind the probability that withdrawals will exceed deposits by more
than ₤5,000.
Imagine that you are to launch a new monthly product. A prospective study indicated that profits (in million
dollars) can be modeled through the random quantity Q = (X+1)/2.325, where X follows a t distribution with
twenty degrees of freedom.
(c) For a particular month, calculate or bind the probability that profits will be smaller than ₤106 (one
million pounds).
(Based on an exercise of Business Statistics, Douglas Downing and Jeffrey Clark, Barron's.)
Discussion: There are several suppositions implicit in the statement, namely: (i) the normal distribution can
reasonably be used to model the two variables of interest D and W; (ii) withdrawals and deposits are
independent; and (iii) X can reasonably be modeled by using the t distribution. These suppositions should
firstly be evaluated by using proper statistical techniques. To solve this exercise, the rules on sums and
differences of normally distributed variables must be used.
Identification of variables and distributions: If D and W represent the random variables daily sum
of deposits and daily sum of withdrawals, respectively, from the statement we have that
D ∼ N (μ D =₤ 12,000 , σ 2D =₤ 2 4,0002 ) and W ∼ N (μW =₤ 10,000 , σ 2W =₤ 2 5,000 2)
(a) Since the variables are measured daily, in a week we have five measurements (one for each working day).
Translation into the mathematical language: We are asked for the probability
5
P (W 1+ W 2+ W 3+W 4+ W 5 > 55,000)=P ( ∑ j =1 W j > 55,000)
Search for a known distribution: To calculate or bind this probability, we need to know the distribution of
the sum or, alternatively, to relate it to any quantity whose distribution we know. By using the rules that
govern the sums and subtractions of normal variables,
5
∑ j=1 W j ∼ N (5μ W ,5 σ 2W )
Rewriting the event: We can easily rewrite the event in terms of the standardized version of this normal
distribution:
5
5
P ( ∑ j=1 W j >55,000)=P
( ∑ j =1 W j−5μ W 55,000−5 μW
√ W
5 σ 2
>
√ 5 σ2W )=P Z>
( 55,000−50,000
√5⋅5,000 2 )
=P ( Z > 0.4472)

Consulting the table: Finally, it is enough to consult the table of the standard normal distribution Z. On the
one hand, in the table we are given values for the quantiles 0.44 and 0.45, so we could round the value 0.4472
to the closest 0.45 or, more exactly, we can bind the probability. On the other hand, our table provides lower-
-tail probabilities, so we will consider the complementary of some events. From the figure below, it is easy to
deduce that
P (Z > 0.44)> P ( Z > 0.4472)> P (Z > 0.45)
1−P ( Z ≤0.44)> P (Z > 0.4472)> 1−P ( Z≤0.45)
1−0.6700> P (Z > 0.4472)> 1−0.6736
0.3300> P (Z > 0.4472)> 0.3264
Then,
5
0.3264< P (∑ j=1 )
W j > 55,000 < 0.3300
Note: It is also possible to relate the total sum to the sample mean
1 5 1
( )
5
P ( ∑ j=1 W j >55,000)=P
5 ∑ j=1 j 5
W > 55,000 =P ( W̄ >11,000 )
and use that
2
1 5 σ
W̄ = ∑ j=1 W j ∼ N μ W , W
5 5 ( ) →
̄ −μ W
W
∼ N (0,1)
√ σ 2W
5
(b) Translation into the mathematical language: We are asked for the probability P (W > D+ 5,000).
Search for a known distribution: To calculate or bind this probability, we rewrite the event until all random
quantities are on the left side of the inequality:
P (W > D+ 5,000)=P (W −D >5,000)
Now we need to know the distribution of W – D or, alternatively, of a quantity involving this difference. By
again using the rules that govern the sums and differences of normal variables, it holds that
W −D ∼ N (μ W −μ D , σW2 +σ 2D )= N (₤ 10,000−₤ 12,000 , ₤ 2 5,000 2 + ₤ 2 4,0002 )
Rewriting the event: We can easily express the event in terms of the standardized version of W – D:
(W − D)−(μ W −μ D ) 5,000−(μ W −μ D )
P (W −D> 5,000)=P
( √ σ2W + σ2D
>
√ σ2W + σ 2D )
(W −D)−(−2,000) 5,000−(−2,000) 7⋅103
=P
( >
√ 25⋅106 +16⋅106 √25⋅106 +16⋅106
= P(Z >
√ 25+16⋅103 )
)=P (Z >1.0932)
Consulting the table: We can bind the probability as follows (see the figure below)
P (Z > 1.0900)> P (Z> 1.0932)> P( Z > 1.1000)
1−P ( Z ≤1.0900)> P ( Z > 1.0932)> 1−P (Z ≤1.1000)
1−0.8621> P ( Z> 1.0932)> 1−0.8643
0.1379> P(Z > 1.0932)> 0.1357
Then,
0.1357< P (W > D+ 5,000)< 0.1379

X +1 X +1
(c) Translation into the mathematical language: We are asked for P ⋅106 <1⋅106 =P ( 2.325
<1 . ) ( 2.325 )
Search for a known distribution: We do not know the distribution of (X+1)/2.325, but we know that
X ∼ t 20
Rewriting the event: The event can easily be rewritten in terms of this known distribution:
X +1
P ( 2.325 <1)=P ( X +1< 2.325)=P ( X <2.325−1)=P ( X <1.325)
Consulting the table: Finally, it is enough to consult the table of the t distribution. The quantity 1.325 is in
our table of lower-tail probabilities, so
P ( X <1.325)=0.900
Conclusion: For a week, the probability that the five withdrawals will add up to more than $55,000 is
around 0.33. For a particular day, the probability that withdrawals will exceed deposits by more than $5,000 is
around 0.13. For a particular month, the probability that profits will be smaller than one (million dollars) is
0.9, that is, quite high.
My notes:
Exercise 6pe-p
To study the mean of a population variable X, μ = E(X), a simple random sample of size n is considered.
Imagine that we do not trust the first and the last data, so we think about using the statistic
~ 1 n−1 1 X + X 3 +⋯+ X n−1
X=
n−2
∑ j=2
X j=
n−2
( X 2 + X 3+⋯+ X n−1 ) = 2
n−2
Calculate the expectation and the variance of this statistic. Calculate the mean square error (MSE) and its
limit when n tends to infinite. Study the consistency. Compare the previous error with that of the ordinary
sample mean.
Discussion: The statement of this exercise is mathematical. Here we are interested in the mean. The quantity
X is dimensionless. We could not apply the defintions, and the mean and the variance must be written in terms
of the mean and the variance X by applying the basic properties of these measures.
Expectation and variance: The basic properties of the mean and the variance are applied to do:
1 1 1
~
E ( X )= E ( n−2 ( X + X +⋯+ X ))=
2 3
n−2
(
n−1 E ( X )+⋯+ E ( X ) ) =
2
n−2
(n−2)μ=μ
n−1
1 1 1 2
Var ( X ) =Var ( ( X + X +⋯+ X ) )=
~ n−1
(n−2)σ = σ
2
n−2 2 3
(n−2)
n−1∑ Var ( X )=
2 j=2
(n−2)
j
n−2 2
When n increases, that is, when the sample consists of more and more data, the limits are, respectively:
2
~ ~
lim n →∞ E ( X )=lim n →∞ μ=μ and lim n →∞ Var ( X ) =lim n →∞ σ =0
n−2

Consistency: The previous limits show that ~
X has some basic desirable properties: (asymptotic)
unbiasedness and evanescent variance. This pair is equivalent to the evanescence of the mean square error
(MSE), that is, the consistency in mean of order two—a sufficient, but not necessary, condition for the
consistency in probability.
Comparison of errors:
2 2
MSE ( ~
X )= σ MSE ( X̄ ) = σ
n−2 n
Since σ2 appears in the two positive quantities, by looking at the coefficients it is easy to see that,
~
MSE ( X̄ ) < MSE ( X )
(for n larger than 2). This result is due to the fact that the sample mean uses all the data available, though only
the number of data—not their quality, since all of them are supposed to follow the same distribution—is
considered in calculating the mean square error. In the limit, –2 is negligible. We can plot the coefficients
(they are also the mean square errors when σ=1).
# Grid of values for 'n'
n = seq(from=3,to=10,by=1)
# The three sequences of coefficients
coeff1 = 1/(n-2)
coeff2 = 1/n
# The plot
allValues = c(coeff1, coeff2)
yLim = c(min(allValues), max(allValues));
x11(); par(mfcol=c(1,3))
plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')
plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='b')
plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')
points(n, coeff2, type='b')
This code generates the following array of figures:
Asymptotically, both estimators behave similarly, since n–2 ≈ n.
Conclusion: ~
X is a consistent estimator of μ. The estimator is appropriate for estimating μ. When nothing
suggests removing data, it is better to maintain them in the sample.
Advanced theory: The estimator in the statement is the usual sample mean when the sample has n–2 data
instead of n (leaving out these two data can be seen as a sort of data treatment implemented in the method, not
in the previous analysis of data). When any of the two left out data is not trustable, using this estimator makes
sense; otherwise, it does not exploit the information available efficiently. On the other hand, the sample mean
can be affected by tiny or huge values (outliers). To make the sample mean robust, this estimator is sometimes
considered after ordering the data from the smallest to the largest; if X(j) is the j-th datum in the sample already
reordered:
~ 1 n−1 1
X= ∑ X =
n−2 j=2 ( j ) n−2 (2)
( X + X (3) +⋯+ X (n−1) )
This new robust estimator of the population mean μ is called trimmed sample mean, and any number of data
can be left out—not only two.

My notes:
Exercise 7pe-p
A population variable X follows the χ2 distribution with κ degrees of freedom. We consider a statistic T that
uses the information contained in the simple random sample X = (X1, X2,...,Xn). If
T ( X )=T ( X 1 , X 2 , ... , X n )=2 X̄ −1 ,
calculate its expectation and variance. Calculate the mean square error of T. As an estimator of twice the
mean of the population law, is T a consistent estimator?
Hint: If X follows the χ2 distribution with κ degrees of freedom, μ = E(X) = κ and σ2 = Var(X) = 2κ.
Discussion: Even if a population is mentioned, this statement is mathematical. To calculate the value of
these two properties of the sampling distribution of T, we have to apply the general properties of the
expectation and the variance. The knowledge about the distribution of X will be used in the last steps. This is a
dimensionless quantity. The mean square error is defined in terms of these quantities.
Expectation or mean:
([ 1 n
] ) (
2 n 2
)
n
E ( T ( X ) ) =E 2 ∑
n j=1
X j −1 =E
n ∑ j =1 j
X − E ( 1 )= E
n (∑ j=1
X j −1)
2 n 2
=
n ∑ j=1
E ( X j )−1= n E ( X )−1=2 κ−1
n
(Since μ = E(X) = κ)
Variance:
2
([ 1 n
] ) ( 2 n 2
)() 4
n n
Var ( T ( X ) )=Var 2
n ∑ j =1
X j −1 =Var ∑ j=1 X j =
n n
Var (∑ j=1
Xj= ) n
2 ∑ j=1
Var ( X j )
4 8κ
= n Var ( X )= (Since σ2 = Var(X) = 2κ) Independence of Xj
n 2
n (simple random sample)
Mean square error: Since b (T )=E (T )−2 E( X )=( 2 κ−1−2 κ)=−1 , then
2 8κ d
MSE ( T ) = b(T ) +Var (T ) = 1+ → 1
n
Consistency: Although the variance of T tends to zero when n increases, the bias does not (thus, T is
asymptotically biased). Hence, the mean square error does not tend either, and nothing can be said about the
consistency in probability using this way (although we can say that it is not consistent in mean of order two).
Conclusion: Since the mean square error tends to 1, in general T is not a “good” estimator of 2μ even for
many data.
My notes:

Exercise 8pe-p
Given a simple random sample of size n = 2, that is, X = (X1, X2), the following estimators of μ = E(X) are
defined:
1 1 1 2
μ̂ 1= X 1+ X 2 μ̂ 2= X 1+ X 2
2 2 3 3
1) Calculate their mean square error.
2) Calculate the relative efficiency. Which one would you use to estimate μ?
(Based on an exercise of Statistics for Business and Economics. Newbold, P., W. Carlson and B. Thorne. Pearson-Prentice Hall.)
Discussion: This statement is basically mathematical. The relative efficiency is defined in terms of the mean
square error of the estimators.
(1) Means: By applying the basic properties of the expectation or mean,
( 12 X + 12 X )= 12 E ( X )+ 12 E ( X )= 12 E ( X ) + 12 E ( X )= 12 μ+ 12 μ=μ
E ( μ̂ 1 ) =E 1 2 1 2
1 2 1 2 1 2 1 2
E μ̂ = E ( X + X )= E ( X ) + E ( X ) = E ( X ) + E ( X )= μ + μ=μ
( 2) 1 2 1 2
3 3 3 3 3 3 3 3
Variances: By applying the basic properties of the variance,

2 2
1 1 1 1 1 1 1 1 1
(
Var ( μ̂ 1 ) =Var )() 2 2 ()
X 1+ X 2 =
2
Var ( X 1 ) +
2
Var ( X 2 )= Var ( X ) + Var ( X )= σ2 + σ 2= σ2
4 4 4 4 2
2 2
1 2 1 2 1 4 1 4 5
Var μ̂ =Var ( X + X )=( ) Var ( X )+ ( ) Var ( X )= Var ( X ) + Var ( X )= σ + σ = σ 2 2 2
( 2) 1 2 1 2
3 3 3 3 9 9 9 9 9
Mean square errors:
1 1
MSE ( μ̂ 1) = b( μ̂ 1 )2 +Var ( μ̂ 1 )=[ E( μ̂ 1)−μ ]2 +Var ( μ̂ 1)=[μ−μ]2+ σ 2= σ2
2 2
5 5
MSE ( μ̂ 2 ) = b( μ̂ 2 )2+ Var (μ̂ 2 )=[ E( μ̂ 2 )−μ ] 2+Var (̂μ 2 )=[μ−μ ] 2+ σ2 = σ 2
9 9
(2) Relative efficiency:

Since bias is zero for unbiased estimators, the mean square error is equal to the variance and we will prefer the
estimator with the smallest variance. An easy way of comparing two estimators consists in using the concept
of relative efficiency, which is a simple quotient (take into account which estimator you allocate in the
numerator). When this quotient is over one, the estimator in the denominator has smaller mean square error,
and vice versa. In this case,
2
5σ
MSE (̂μ 2 ) 9 10
e ( μ̂ 1 , μ̂ 2) = = 2 = >1 → μ̂ 1 is preferred for estimating μ.
MSE (μ̂ 1 ) σ 9
2
Conclusion: Both estimators are unbiased while the first has smaller variance; then, the first is preferred.
We have not mathematically proved that this first estimator minimizes the variance, so we cannot say that it is
an efficent estimator.

My notes:
Exercise 9pe-p
The mean μ = E(X) of any population can be estimated from a simple random sample of size n through X.
Prove that:
(a) This estimator is always consistent.
(b) For X normally distributed (normal population), this estimator is efficient.
Discussion: This statement is theoretical. The first section of this exercise needs calculations similar to
those of previous exercises. To prove the efficiency, we have to apply its definition.
(a) Consistency: The expectation of the sample mean is always—for any population—the population mean.
Nevertheless, we repeat the calculations:
1 n 1
( ) ) 1n ∑ 1
n n
E ( X̄ )= E
n ∑ j =1
Xj = E
n (∑ j=1
Xj = j=1
E ( X j ) = n E ( X )= E ( X )=μ
n
The variance of the sample mean is always—for any population—the population variance divided by n. We
repeat the calculations too:
Independence of Xj (simple random sample)
2
1 n 1 1 1 σ2
( )()
n n
Var ( X̄ )=Var
n
∑ j=1
X j =
n
Var (∑ j=1 )
Xj =
n
2 ∑ j=1
Var ( X j )=
n
2
n Var ( X ) =
n
The bias is defined as b ( X ̄ )−μ = 0 . We prove the consistency (in probability) by using the
̄ )= E ( X
sufficient—but not necessary—condition (consistency in mean of order two):
[ ]
2
̄ ) ] = lim n→∞ 0 + σ = 0
lim n →∞ MSE ( X̄ )= lim n →∞ [ b( X̄ )2+ Var ( X
n
Then, it is consistent in mean of order two and therefore in probability.
(b) Efficiency: It is necessary to prove that the two conditions of the definition are fulfilled:
i. The expectation of X is always μ = E(X), that is, X is always an unbiased estimator of μ.
ii. X has minimum variance, which happens—because of a theoretical result—when Var(X) attains the
Cramér-Rao's lower bound
1
[( )]
2
∂ log[ f ( X ; θ)]
n⋅E ∂θ
where θ = μ in this case, and f(x;θ) is the probability function of the population law where the
nonrandom variable x is substituted by the random variable X (otherwise, it is not possible to talk
about expectation, since f(x;θ) is not random when θ is a parameter).
The unbiasedness is proved. On the other hand, we compute the Cramér-Rao's lower bound step by step:
(1) Function (with X in place of x)
2
( X −μ)
1 −
2σ
2
f ( X ;μ)= e
√2 π σ2

(2) Logarithm of the function:
2
( X −μ )
1 − ( X −μ) 2
( ) )=−log( √ 2 π σ 2)−
2
2σ
log[ f ( X ;μ )]=log + log(e
√ 2 π σ2 2 σ2
(3) Partial derivative of the logarithm of the function:
∂ ( log[ f ( X ;μ)]) =0− 1 2 ( X −μ)(−1)= X −μ
∂μ 2 σ2 σ2
(4) Expectation of the squared partial derivative of the logarithm of the function: In this step, we must
rewrite the terms so as to make σ 2=Var ( X )=E ( ( X −E ( X ))2 ) =E ( ( X −μ)2 ) appear.
[( ) ]= E[( )]
2 2
∂ log[ f ( X ; μ)] X −μ 1 1 1 1
E [ ( X −μ ) ] = 4 Var ( X )= 4 σ 2= 2
2
E =
∂μ σ
2
σ
4
σ σ σ
(5) Cramér-Rao's lower bound:
1 1 2
= =σ
1 n
[( )]
2
∂ log[ f ( X ;μ )] n⋅ 2
n⋅E
∂μ σ
The variance of the estimator, calculated in section (a), attains the bound and hence the estimator has
minimum variance. Since both conditions are fulfilled, the efficient is proved.
Conclusion: We have proved that the sample mean X is always—for any population—a consistent estimator
of the population mean μ. For a normal population, it is also efficient.
Advanced theory: When log[f(x;θ)] is twice differentiable with respect to θ, the Cramér-Rao's bound can
equivalently be written as
−1
[ ]
2
∂ log[ f ( X ; θ)]
n⋅E
∂θ2
Concerning the regularity conditions, Wikipedia refers (http://en.wikipedia.org/wiki/Fisher_information) to
eq. (2.5.16). of Theory of Point Estimation, Lehmann, E. L. and G. Casella, 1998. Springer. Let us assume that
this alternative expression can be applied; then, step (3) would be
∂2 ( log[ f ( X ; μ)]) = ∂ X −μ = 1 ⋅(−1)=− 1
∂μ 2 ∂μ σ 2 σ2 σ2( )
step (4) would be
E
[
∂ 2 log[ f ( X ;μ )]
∂μ 2
1 1
=E − 2 =− 2
σ σ ] [ ]
and, finally, step (5) would be
2
−1 −1
= =σ
−1 n
n⋅E
[ ∂2 log [ f ( X ; μ)]
∂μ2 ] n⋅ 2
σ
We would have obtained the same result with easier calculations, although the fulfillment of the regularity
conditions must have been verified previously.
My notes:

Exercise 10pe-p
Let θ be the parameter of a population random variable X that follows a continuous uniform distribution on
the interval [θ–2, θ+1], and let X = (X1,...,Xn) be a simple random sample; then,
(a) Plot the density function of the variable X.
(b) Study the consistency of the sample mean X when it is used to estimate the parameter θ.
(c) Study the efficiency of the sample mean X when it is used to estimate the parameter θ.
(d) Find an unbiased estimator of θ and study its consistency.
Hint: Use that E(X) = θ – 1/2 and Var(X) = 3/4.
Discussion: This statement is mathematical. We should know the density function of the continuous uniform
distribution, although it could also be deduced from the fact that all possible values have the same probability.
The quantity X is dimensionless.
(a) Density function: For this distribution, all values have the same probability, so the density function must
be a flat curve. For the case θ > 2 (there is a similar figure for any other θ),
This plot is not necessary for the following sections.
(b) Study the consistency (in probability) of X as an estimator of θ

̂
We apply the sufficient consistency in mean of order two: lim n →∞ MSE ( θ)=0 ↔
{ lim n→∞ b( θ̂ )=0
lim n →∞ Var ( θ̂ )=0
(b1) Bias: By applying a property of the sample mean and the information of the statement,
̄ )= E ( X )=θ− 1
E(X ̄ )= E( X̄ )−θ=θ− 1 −θ=− 1
→ b(X
1
→ lim n →∞ b( X̄ )=lim n→∞ − =−
1
2 2 2 2 2
(It is asymptotically biased.) Since one condition of the pair is not verified, it is not necessary to check the
other, and neither the fulfillment of the consistency in probability nor the opposite can be proved using this
way (though the estimator is not consistent in the mean-square sense).
(c) Study the efficiency of X as an estimator of θ

The definition of efficiency consists of two conditions: unbiasedness and minimum variance (this latter is
checked by comparing the variance and the Cramér-Rao's bound).
(c1) Unbiasedness: In the previous section it has been proved that X is a biased estimator of θ.
The first condition does not hold, and hence it is not necessary to check the second one. The conclusion is that
X is not an efficient estimator of θ.

(d) An unbiased estimator of θ and its consistency
̄ )=− 1 , which suggests correcting the previous estimator by adding 1/2, that is:
In (b) we found that b ( X
2
1
̂ X̄ + . To study its consistency (in probability), we apply the sufficient condition mentioned in section
θ=
2
b (the consistency in mean of order two).
(d1) Bias: By applying a property of the sample mean and the information of the statement,
̂ E( X̄ )+ 1 =θ− 1 + 1 =θ → b( θ)=E
E ( θ)= ̂ ̂ ̂
→ lim n →∞ b( θ)=lim
( θ)−θ=θ−θ=0 n →∞ 0 = 0
2 2 2
(d2) Variance: By applying a property of the sample mean and the information of the statement,
1 Var ( X ) 3 3
^
Var ( θ)=Var ( )
X̄ + =Var ( X̄ )=
2 n
=
4⋅n
→ lim n →∞ Var ( θ̂ )=lim n→∞
4⋅n
=0
As a conclusion, the mean square error (MSE) tends to zero and hence the proposed estimator θ=̂ X̄ + 1 is a
2
consistent—in mean square error and hence in probability—estimator of θ.
Conclusion: We could prove neither the consistency nor the efficiency. Nevertheless, the bias has allowed
us to build an unbiased, consistent estimator of the parameter. The efficiency of this new estimator could be
studied, but it is not required in the statement.
My notes:
Exercise 11pe-p
A population random quantity X is supposed to follow a geometric distribution. Let X = (X1,...,Xn) be a simple
random sample. By applying the factorization theorem below, find a sufficient statistic T(X) = T(X1,...,Xn) for
the parameter. Give explanations.
Discussion: The factorization theorem can be applied both to prove that a given statistic is sufficient and to
find sufficient statistics. On the other hand, for the distribution involved we know that
Likelihood function:
n
L( X ; η)=∏ j =1 f ( X j ; η)= f ( X 1 ; η)⋅ f ( X 2 ; η)⋯ f ( X n ; η)=η⋅(1−η) X −1⋅η⋅(1−η) X −1 ⋯η⋅(1−η) X
1 2 n −1

n
n X 1−1+ X 2−1+⋯+ X n−1 n (∑ )
X j −n
=η ⋅(1−η) =η ⋅(1−η) j=1
Theorem:
We must try allocating each term of the likelihood function:
n
➔ η depends only on the parameter, not on Xj. Then, it would be part of g.
n
(∑ )
X j −n
➔ (1−η) depends on both the parameter and the data Xj, and these two kinds of information
j=1
neither are mixed nor can mathematically be separated. Then, it would be part of g and the only
n
possible sufficient statistic, if the theorem holds, is T =∑ j =1 X j .
n
n −n ∑ j=1 X j
By considering g (T ( X ) ; η)=η ⋅(1−η) (1−η) and h( X )=1 , the theorem holds and hence the
n
statistic T ( X )=∑ j=1 X j is sufficient for studying η. The idea behind this kind of statistics is that they
“summarize the important information (about the parameter)” contained in the sample. In fact, the statistic T
has essentially the same information as any one-to-one transformation of it, particularly the sample mean
n n
T ( X )= ∑ j =1 X j =n X̄ .
n
Conclusion: The factorization theorem has been used to find a sufficient statistic (for the parameter). Since
the total sum appears, we complete the expression to write the result in terms of the sample mean. Both
statistics contain the same information about the parameter of the distribution.
My notes:
Exercise 12pe-p (*)

For population variables X and Y, simple random samples of size n X and nY are taken. Calculate the mean
square error of the following estimators, possibly by using proper statistics (involving them) whose sampling
distribution is known.
(A) For any populations: X̄ X̄ −Ȳ
(B) For Bernoulli populations: η^ η^ X − η^ Y
2 V 2X 2 s 2X 2 S 2X
(C) For normal populations: V 2 s 2 S 2
VY sY SY
Suppose that the two populations are independent. Study the consistency in mean of order two and then the
consistency in probability.
Discussion: In this exercise, the most important estimators are involved. The basic properties of the
expectation and the variance allows us to calculate the mean square error. In most cases, the estimators will be
completed for a proper quantity (with known sampling distribution) to appear, and then use its properties.
Although the estimators of the third section can be used for any X and Y, the calculations for normally
distributed variables are easier due to the use of additional information—the knowledge about statistics and
their sampling distribution. Thus, the results of this section are based on the normality of the variables X and
Y. (Some of the quantities are also valid for any variables.)

The mean square errors are found for static situations, but the idea of limit involves dynamic
situations. Statistically speaking, we want to study the behaviour of the estimators when the number of data
increases—we can imagine a sequence of schemes where more and more data are added to the samples, that
is, with the sample sizes always increasing. (From the mathematical point of view, limits must be studied for
any possible way in which the sample sizes tend to infinite.)
Fortunately, the limits of the two-variable functions—sequences, really—that appear in this exercise can
easily be solved either by decomposing them into two limits of one-variable functions or by binding the two-
variable sequences. That the limits are studied when nX and nY tend to infinite facilitates the calculations (e.g. a
constant like –2 is negligible when it appears in a factor).
(A) For any populations
(a1) For the sample mean X̄

It holds that
1 n 1 n 1
E ( X̄ )= E ( n
∑ j =1
X )
j =
n
∑ j=1
E ( X j )= n E ( X )= E ( X )=μ
n
1 n 1 1 Var ( X ) 1 2
( )
n
Var ( X̄ )=Var
n ∑ j=1
X j =
n 2 ∑ j =1
Var ( X j ) = 2 n Var ( X ) =
n n
= σ
n
2
1 2
MSE( X̄ ) = [ E( X̄ )−μ ] + Var ( X̄ )=0 + σ = σ
2 2
n n
Then,
• The estimator X̄ is unbiased for μ, whatever the sample size.
• The estimator X̄ is consistent (in mean of order two and therefore in probability) for μ, since
2
lim n →∞ MSE ( X̄) = lim n→ ∞ σ =0
n
It is sufficient and necessary the sample size tending to infinite—see the mathematical appendix.
(a2) For the difference between the sample means X̄ −Ȳ

By using the previous results,
E ( X̄ −Ȳ ) = E ( X̄ )−E ( Ȳ )=μ X −μ Y
1 2 1 2
Var ( X̄ −Ȳ ) = Var ( X̄ )+Var ( Ȳ )= σ + σ
n X X nY Y
2 1 2 1 2
MSE( X̄−Ȳ ) = [ E( X̄ −Ȳ )−(μ X −μ Y ) ] + Var ( X̄ −Ȳ )= σ + σ
n X X nY Y
The mean square error of X–Y is the sum of the mean square errors of X and Y. On the other hand,
• The estimator X̄ −Ȳ is unbiased for μX–μY, whatever the sample sizes.
• The estimator X̄ −Ȳ is consistent (in the mean-square sense and hence in probability) for μX–μY, as

σ2X σ 2Y
lim nX
→∞
nY →∞
MSE ( X̄−Ȳ ) = lim nX
→∞
nY →∞
( +
n X nY ) =0
It is sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.
(B) For Bernoulli populations
(b1) For the sample proportion η^

Since η^ is a particular case of the sample mean,
E (η
^ ) = μ=η
2
1
Var ( η^ ) = σ = η(1−η)
n n
2 1
^ = [ E( η)−η
MSE( η) ^ ] + Var (η)=
^ η(1−η)
n
Then,
• The estimator η^ is unbiased whatever the sample size.
• It is consistent for η, being sufficient and necessary the sample size tending to infinite.
(b2) For the difference between the sample proportion η^ X − η^ Y

Again, this is a particular case of difference between sample means,
E (η
^ X −η^ Y ) = μ X −μ Y =η X −ηY
1 2 1 2 1 1
Var ( η
^ X −η
^ Y) = σ + σ = η (1−ηX )+ ηY (1−ηY )
n X X nY Y n X X nY
1 2 1 2 1 1
MSE( η^ X − η
^ Y) = σ X + σY = ηX (1−ηX )+ ηY (1−ηY )
nX nY nX nY
Then,
• The estimator η^ X − η^ Y is unbiased for ηX–ηY, whatever the sample sizes.
• It is also consistent for ηX–ηY, being sufficient and necessary the two sample sizes tending to infinite.
(C) For normal populations
(c1) For the variance of the sample V 2

nV 2
By using T = 2
∼ χ2n and the properties of the chi-square distribution,
σ
( ) ( )
2 2 2 2 2
nV nV
E (V 2 ) = E σ =σ E = σ n = σ2
n σ2 n σ
2
n
nV2 2
( ) ( )
2
Var ( V 2 ) = Var σ = σ 4 Var nV = σ 4 2 n = 2 σ4
n σ2 n2 σ2 n2 n
2 2 2 2 2 2 4
MSE ( V ) = [ E (V )−σ ] + Var (V ) = σ
n
Then,
• The estimator V 2 is unbiased for σ2, whatever the sample size.

• The estimator V 2 is consistent (in mean of order two and therefore in probability) for σ2, since
2 2 σ4
lim n →∞ MSE (V ) = lim n →∞ =0
n
In another exercise, this estimator is compared with the other two estimators of the variance. (For the
expectation, it is easy to find in literature direct calculations that lead to the same value for any variables—not
necessarily normal.)
2
VX
(c2) For the quotient between the variances of the samples 2
VY
V 2X σY2
By using T = 2 2 ∼ F n , n and the properties of the F distribution,
VY σX X Y
V 2X σ 2X V 2X σY2 σ2X V 2X σ2Y σ 2X nY σ 2X

E
( ) (
V 2Y ) ( )
=E
σ 2Y V Y2 σ 2X
=
σ2Y
E
V Y2 σ 2X
= 2
nY
=
σ Y nY −2 nY −2 σ2Y
( nY >2)
2
V 2X σ2X V 2X σ 2Y σ2X V 2X σ2Y 2 n 2Y (n X +n Y −2) σ 4X
Var
( ) ( )( ) ( )
V 2Y
= Var
σ 2Y V 2Y σ2X
=
σ2Y
Var
V Y2 σ 2X
=
n X (nY −2)2 ( nY −4) σ 4Y
(nY > 4)
( ) [( ) ] ( )[
2
] ( )
2
V 2X V 2X σ2X V 2X σ 2X
nY σ 2X σ2X 2n 2Y (n X +n Y −2)
MSE = E − + Var = 2 − +
V 2Y V 2Y σ 2Y V Y2 σY nY −2 σ 2Y σ 2Y n X (nY −2)2 (nY −4)
[( ]
2 2 4
nY 2n Y (n X +n Y −2) σ X
=
nY −2
−1 +
)
n X (nY −2)2 (nY −4) σ 4Y
(nY > 4)
Then,
• The estimator is V 2X /V 2Y biased for σX2/σY2, but it is asymptotically unbiased since
V 2X n Y σ2X σ 2X σ 2X
lim n X
nY →∞
→∞ E
( )
V 2Y
=lim n Y
→∞
( nY −2 σ 2Y
=
)
σ2Y
lim n →∞
1−
1
2
nY
=
σ2Y
Y
( )
Mathematically, only nY must tend to infinite. Statistically, since populations can be named and
allocated in either order, it is deduced that both sample sizes must tend to infinite. In fact, it is
sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.
• The estimator V 2X /V 2Y is consistent (in mean of order two and therefore in probability) for σ X2/σY2,
since it is asymptotically unbiased and
V 2X σ 4X 2 n2Y (n X +nY −2)
lim n X
nY →∞
→∞ Var
( ) V 2Y
=
σY4
lim n X
→∞
nY →∞ nX (nY −2)2 (nY −4)
=0
=
σ
lim n
4
X
−3 −1
n n 2 n (n X +nY −2)
Y X σ 2
Y
= lim n
4
X
2 ( n1 + n1 − n 2n ) =0
Y X Y X
4 →∞ −3 −1 2 4 →∞ 2
n n n (nY −2) (nY −4) σ
σ
(1− n2 ) (1− n4 )
X X
n Y Y →∞ Y X X n Y Y →∞
Y Y
The numerator tends to zero if and only if so do both sample sizes. In short, it is sufficient and
necessary the two sample sizes tending to infinite—this limit has been studied in the mathematical
appendix.

In another exercise, this estimator is compared with the other two estimators of the quotient of variances.
(c3) For the sample variance s 2

n s2
By using T = 2
∼ χ 2n−1 and the properties of the chi-square distribution,
σ
n s2 2
( ( ) )
2
E (s 2) = E σ = σ 2 E n s = σ2 (n−1)= n−1 σ 2
n σ2 n σ
2
n n
2 2
2( n−1)
Var (s ) = Var ( σ ) = σ Var ( )
2 4 4
ns ns
= σ 2(n−1) =
2 4
2
σ 2 2 2 2
n σ n σ n n
2
MSE ( s 2 )= [E ( s 2 )−σ 2 ]2 + Var ( s 2 ) =
[ n−1 2
n
σ −σ 2 +
2( n−1) 4
n
2
2 1
]
σ = − 2 σ4
n n ( )
Then,
• The estimator s 2 is biased but asymptotically unbiased (for σ2), since
1
2
lim n →∞ E( s ) = lim n→∞
1
=σ ( n−1
n
σ )=σ lim
2 2
n→ ∞ ( )
1−
n 2
• The estimator s 2 is consistent (in mean of order two and therefore in probability) for σ2, since
lim n →∞ MSE (s ) = lim n →∞

2
[( ) ]
2 1 4
−
n n2
σ =0
2
sX
(c4) For the quotient between the sample variances 2
sY
S 2X σ2Y n X (n Y −1) s 2X σ 2Y
By using T = 2 2
= ∼ Fn −1 , nY −1 and the properties of the F distribution,
SY σ X nY ( n X −1) s 2Y σ 2X X
2 2 2 2
E
( ) sX
s2Y
=
n Y ( n X −1) σ X
n X ( nY −1) σY 2
n (n −1) s X σY
E X Y
(
nY (n X −1) sY2 σ 2X )
nY (n X −1) σ2X nY −1 nY (n X −1) σ 2X
= = ( nY −1> 2)
n X (nY −1) σ2Y ( nY −1)−2 n X (nY −3) σ 2Y
2 2 2 4 2 2
Var
( )
sX
sY2
=
nY (n X −1) σ X
n2X (nY −1)2 σ 4Y
Var
( n X (nY −1) s X σ Y
nY (n X −1) s 2Y σ2X )
nY2 (n X −1)2 σ 4X 2( nY −1)2 ( n X −1+n Y −1−2) 2 n 2Y (n X −1)( n X + nY −4) σ4X
= = (n Y −1>4)
n 2X (nY −1)2 σY4 (n X −1)(nY −1−2)2 ( nY −1−4) n 2X (nY −3)2 ( nY −5) σ4Y

2
( )[( ) ] ( )[
2
MSE
s2X
s 2Y
= E
s2X
s 2Y
−
σ2X
σ 2Y
+ Var 2 =
sY
s2X
n Y ( n X −1) σ 2X σ2X
−
n X (nY −3) σ2Y σ 2Y ] +
2 n2Y ( n X −1)( n X +n Y −4) σ 4X
n2X (nY −3)2 (nY −5) σ4Y
{[ }
2
2 n2Y (n X −1)(n X +nY −4) σ 4X
=
nY (n X −1)
n X (nY −3)
−1 +
]
n 2X (n Y −3)2 ( nY −5) σ4Y
(nY −1>4 )
Then,
• The estimator is s 2X / s 2Y biased for σX2/σY2, but it is asymptotically unbiased since
1
1−
[ ]
2 2 2 2
n X σ 2X
lim n →∞ E
X
n →∞
Y
s
s ( )
= lim n →∞
n →∞
X
2
Y
nY (n X −1) σ
n X (nY −3) σ
=
σ
σ
lim n →∞
X
Y
n X nY −n Y σ
n n −3 n X σ
n →∞ X Y
= lim n →∞
n →∞ 1−
nY
= X
2
3 σ2Y
Y
X
2
Y
X
Y
X
2
Y
X
• The estimator is s 2X / s 2Y consistent (in mean of order two and therefore in probability) for σ X2/σY2, as it
is asymptotically unbiased and
lim n X
nY →∞
→∞ Var
( ) s 2X
s2Y
= lim n X
→∞
nY →∞
[ 2 n2Y (n X −1)(n X +nY −4) σ 4X
n2X (nY −3)2 (nY −5) σ4Y ]
σ 4X 2
n−2 −3
X nY 2 nY (n X −1)( n X + nY −4)
= lim n →∞
σ4Y −3 2 2
n
X
Y →∞ n−2
X nY n X (nY −3) (nY −5)
1 1 1 1 4
=
σ 4
X
lim n
2 ( −
nY n X nY )( + −
nY n X n X nY ) =0
4 →∞ 2
3 5
σ
( n )( n )
X
Y
n Y →∞
1− 1−
Y Y
(c5) For the sample quasivariance S 2

( n−1)S 2
By using T = 2
∼ χ 2n−1 and the properties of the chi-square distribution,
σ
( ) ( )
2
σ 2 (n−1) S 2
(n−1) S 2 2
= σ E = σ (n−1) = σ
2 2
E (S ) = E
σ2 2
n−1 n−1 σ n−1
2 2
Var ( S 2) = Var ( n−1 σ2 (n−1)2 σ2 )
σ2 (n−1) S = σ 4 Var ( n−1)S = σ 4 2 (n−1) = 2 σ 4
( n−1)2 n−1 ( )
2 2 2 2 2 2 4
MSE ( S ) = [ E ( S )−σ ] + Var ( S ) = σ
n−1
Then,
• The estimator S 2 is unbiased for σ2, whatever the sample size.
• The estimator S 2 is consistent (in mean of order two and therefore in probability) for σ2, since
2 2 σ4
lim n →∞ MSE ( s ) = lim n →∞ =0
n−1

S 2X
(c6) For the quotient between the sample quasivariances 2
SY
S 2X σ2Y
By using T = 2 2 ∼ F n −1 ,nY −1 and the properties of the F distribution,
SY σ X X
2 2 2 2 2 2
E
( ) ( )
SX
S 2Y
σX
= 2 E 2 2 = 2
σY SY σ X
S X σY
nY −1
=
nY −1 σ X
σ Y (nY −1)−2 nY −3 σ2Y
σX
(n Y −1>2)
2
S 2X σ 2X S 2X σ 2Y σ 4X 2( nY −1)2 ( n X −1+n Y −1−2)
Var
( )( ) ( )
S 2Y
=
σ 2Y
Var
S 2Y σ 2X
=
σY4 ( n X −1)(n Y −1−2) 2 (nY −1−4)
2 4
2 (nY −1) ( n X +nY −4) σ X
= 2 4
( nY −1> 4)
( n X −1)(nY −3) (nY −5) σ Y
2
( )[( ) ] ( )[
2
MSE
S2X
S 2Y
= E
S2X
SY2
−
σ2X
σ 2Y
+ Var 2 =
SY
S2X
n Y −1 σ2X σ 2X
−
nY −3 σ 2Y σ2Y
+
]
2(nY −1)2 (n X +nY −4 ) σ 4X
(n X −1)(nY −3)2 (nY −5) σ 4Y
[( ]
2
nY −1 2 (nY −1)2(n X + nY −4) σ 4X
=
nY −3
−1 +
)
( n X −1)( nY −3)2( nY −5) σ 4Y
(nY −1> 4)
Then,
• The estimator is S 2X /S 2Y biased for σX2/σY2, but it is asymptotically unbiased since
1
2 2 2 1−
n Y σ 2X
n →∞
S
lim n →∞ E X2 =lim n →∞ Y
SY n →∞
n −1 σ X σ X
nY −3 σ2Y
=
σ 2Y
X
Y
lim n →∞
( )
1−
=
3 σ2Y
nY
X
Y
( ) Y
Mathematically, only nY must tend to infinite. Statistically, since populations can be named and
allocated in either order, it is deduced that both sample sizes must tend to infinite. In fact, it is
sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.
• The estimator is S 2X /S 2Y consistent (in mean of order two and therefore in probability) for σ X2/σY2, as
it is asymptotically unbiased and
[ ]
2 2 4
lim n X
→∞
nY →∞
Var
( )
SX
SY
2
= lim n X
→∞
nY →∞
2(nY −1) (nX +nY −4) σ X
2
(nX −1)(n Y −3) (nY −5) σY
4
2
1 1 1 4
=
σ 4
X
lim n
−1 −3
n n 2(nY −1) (n X +nY −4)
X Y
2
=
σ 4
X
lim n
2 1−
( )(
nY ) =0
+ −
n Y nX n X nY
4 X →∞ X →∞
n n (n −1)(n Y −3)2 (nY −5)
−1 −3 4
1 3 5 2
σ Y nY →∞ X Y X σ Y nY → ∞
( n )( n ) ( n )
1−
X
1− 1−
Y Y
Conclusion: For the most important estimators, the mean square error has been calculated either directly (in
few cases) or by making a proper statistic appear. The consistencies in mean square error of order two and in

probability have been proved. Some limits for functions of two variables arised. These kinds of limit are not
trivial in general, as there is an infinite amount of ways for the sizes to tend to infinite. Nevertheless, those
appearing here could be calculated directly of after doing some simple algebra transformation (multiplying
and dividing by the proper quantity, as they were limits of sequences of the indetermined form infinite-over-
-infinite).
On the other hand, it is worth noticing that there are in general several matters to be considered in
selecting among different estimators of the same quantity:
(a) The error can be measured by using a quantity different to the mean square error.
(b) For large sample sizes, the differences provided by the formulas above may be negligible.
(c) The computational or manual effort in calculating the quantities must also be taken into account—not
all of them requires the same number of operations.
(d) We may have some quantities already available.
My notes:
Exercise 13pe-p (*)

In the following situations, compare the mean square error of the following estimators when simple random
samples, taken from normal populations, are considered:
(A) V 2 s
2
S
2
V 2X s 2X S 2X
(B) (Consider only the case nX = n = nY)
V 2Y s 2Y S Y2
In the second section, suppose that the populations are independent.
Discussion: The expressions of the mean square error of these estimators have been calculated in other
exercise. Comparing the coefficients is easy in some cases, but sequences may sometimes cross one another
and the comparisons must be done analitically—by solving equalities and inequalities—or graphically. We
plot the sequences (lines between dots are used to facilitate the identification).
The mean square errors were found for static situations, but the idea of limit involves dynamic
situations. By using a computer, it is also possible to study—either analytically or graphically—the asymptotic
behaviour of the estimators (but it is not a “whole mathematical proof”). It is worth noticing that the formulas
and results of this exercise are valid for normal populations (because of the theoretical results on which they
are based); in the general case, the expressions for the mean square error of these estimators are more
complex. For two populations, there is an infinite amount of mathematical ways for the two sample sizes to
tend to infinite (see the figure); the case nX = n = nY, in the last figure, will be considered.

(A) For V 2 , s 2 and S 2
The expressions of their mean square error are:
2 4 2 1 2
4
MSE ( V ) =
n
σ
2
MSE ( s 2 ) = − 2 σ 4
n n
2
MSE ( S ) = ( n−1
σ
4
)
Since σ appears in all these positive quantities, by looking at the coefficients it is easy to see that, for n is
larger than two,
2 2 2
MSE ( s ) < MSE (V ) < MSE ( S )
That is, sequences—indexed by n—do not cross one another. We can plot the coefficients (they are also the
mean square errors when σ=1).
coeff1 = 2/n
coeff2 = 2/n - 1/(n^2)
coeff3 = 2/(n-1)
# The plot
allValues = c(coeff1, coeff2, coeff3)
2 1 2 2
Asymptotically, the three estimators behave similarly, since − 2≈ ≈ .
n n n n−1
V 2X s 2X S 2X
(B) For 2 , 2 and 2
V Y sY SY
The expressions of their mean square error, when nX = n = nY, are:
V 2X 4 4
( ) {[ } {[ ] }
2
2 n2 (n+ n−2) σ X 2
MSE
V 2Y
=
n
n−2
−1 + ]
n(n−2)2 (n−4) σ 4Y
=
n
n−2
−1 +
4 n(n−1) σ X
(n−2)2 (n−4) σ 4Y
(n>4)
( ) {[ } {[ ]
2
s2
}
4 2 4
2 n2 (n−1)( n+n−4) σ X
MSE X2 =
sY
n( n−1)
n(n−3)
−1 +
]
n2 (n−3)2 ( n−5) σ 4Y
=
n−1
n−3
−1 +
4 (n−1)(n−2) σ X
(n−3)2 ( n−5) σ4Y
(n−1>4 )
S2X 4 4
( ) {[ } {[ ] }
2
2 (n−1)2 (n+n−4 ) σ X 2
MSE
S 2Y
=
n−1
n−3
−1 + ]
( n−1)(n−3)2( n−5) σY4
=
n−1
n−3
−1 +
4(n−1)(n−2) σ X
(n−3)2 (n−5) σ 4Y
(n−1>4)
For equal sample sizes, the mean square error of the last two estimators is the same (but they may behave
differently under other criteria different to the mean square error, e.g. even their expectation). We can plot the
coefficients (they are also the mean square errors when σX = σY), for n > 5.

coeff1 = ((n/(n-2))-1)^2 + (4*n*(n-1))/(((n-2)^2)*(n-4))
coeff2 = (((n-1)/(n-3))-1)^2 + (4*(n-1)*(n-2))/(((n-3)^2)*(n-5))
coeff3 = coeff2
# The plot
allValues = c(coeff1, coeff2, coeff3)
This shows that, for normal populations and samples of sizes nX = n = nY, it seems that
V 2X s 2X S 2X
MSE
( )
V 2Y
?
≤ MSE
( ) s 2Y
= MSE
( )
S2Y
and the sequences do not cross one another. Really, a figure is not a mathematical proof, so we do the
following calculations:
2 2
n 4 n (n−1) ? n−1 4(n−1)(n−2)
n−2 (
−1 + 2 ) ≤
(n−2) ( n−4) n−3
−1 + 2 (
(n−3) ( n−5) )
4(n−4)+4 n(n−1) ? 4(n−5)+4 (n−1)(n−2) n−4+ n 2−n ? n2−2 n−3
≤ ↔ ≤
( n−2)2 (n−4) (n−3)2 (n−5) 2 2
(n−2) ( n−4) (n−3) (n−5)
(n−2)( n+2) ? (n−3)(n+1) ?
≤ ↔ (n+ 2)(n−3)(n−5)≤(n+1)(n−2)(n−4)
(n−2)2 (n−4) (n−3)2 (n−5)
? ?
n 3−6 n 2−n+ 30≤ n3−5 n 2+ 2 n+8 ↔ 22≤n (n+3)
This inequality is true for n≥4, since it is true for n=4 and the second side increases with n. Thus, we can
guarantee that, for n > 5,
V 2X s 2X S 2X
MSE 2 ≤ MSE 2 = MSE 2
VY ( )
sY SY ( ) ( )
Asymptotically, by using infinites
V 2X
{[( ] }
2
2nY2 ( nX +nY −2) σ 4X
X
n →∞
Y
VY ( )
lim n →∞ MSE 2 = lim n →∞
n →∞
X
Y
nY
nY −2
−1 +
)
n X (nY −2)2 (nY −4) σ 4Y
{[( ] } [ ]
2
nY 2 n2 (n +n ) σ4X 2( n X +n Y ) σ4X
=lim n X →∞
nY → ∞
nY )
−1 + Y X2 Y
nX nY nY σ 4Y
=lim n →∞
n →∞
n X n Y σ 4
Y
X
Y
=0

( ) {[ }
2
s 2X 2 n2Y (n X −1)(n X +nY −4) σ 4X
lim nX →∞
nY →∞
MSE
s 2Y
=
nY (n X −1)
n X (nY −3)
−1 +
]
n2X ( nY −3)2( nY −5) σY4
(nY −1>4 )
{[ } [ ]
2
2n 2Y n X (n X + nY ) σ 4X 2(n X +nY ) σ 4X
=lim nX →∞
nY → ∞
nY n X
nX nY
−1 +
] 2 2
n X nY nY σY
4
=lim n
n
X
Y
→∞
→∞
n X n Y σ4Y
=0
{[ }
2
lim n →∞
X
nY →∞
MSE
S 2X
( )
SY
2
= lim n X →∞
nY →∞
nY −1
nY −3
−1 +
]
2(nY −1)2(n X +n Y −4) σ 4X
2
(n X −1)(nY −3) (nY −5) σY
4
{[( ] } [ ]
2
nY 2 n2 (n +n ) σ4X 2( n X +n Y ) σ4X
=lim n X →∞
nY → ∞
nY )
−1 + Y X2 Y
nX nY nY σ 4Y
=lim n →∞
n →∞
n X n Y σ 4
Y
=0
X
The three estimators behave similarly, since the quantitative behaviour of their mean square errors is
characterized by the same limit, namely:
[ ]
4
2(n X +nY ) σ X
lim n →∞ =0 .
n →∞
n X n Y σ4Y X
(It is worth noticing that this asymptotic behaviour arises when the limits are solved by using infinites—this
cannot seen when the limits are solved by using other ways.)
Conclusion: The expression of the mean square error of these estimators allow us to compare then, to study
their consistency and even their rate of convergence. We have proved the following result:
Proposition
(1) For a normal population,
MSE ( s 2 ) < MSE (V 2 ) < MSE (S 2 )
(2) For two independent normal populations, when nX = n = nY
V 2X s 2X S 2X
MSE
( ) V 2Y
≤ MSE
( )
s 2Y
= MSE
( ) S2Y
Note: For one population, V 2 has higher error than s 2 , even if the information about the value of the
population mean μ is used by the former while it is estimated in the other two estimators. For two populations,
the information about the value of the two population means μX and μY is used in the first quotient while they
must be estimated in the other two estimators. Either way, the population mean in itself does not play an
important role in studying the variance, which is based on relative distances, but any estimation using the
same data reduces the amount of information available and the degrees of freedom in a unit.
Again, it is worth noticing that there are in general several matters to be considered in selecting among
different estimators of the same quantity:
My notes:

Exercise 14pe-p (*)
For population variables X and Y, simple random samples of size n X and nY are taken. Calculate the mean
square error of the following estimators (use results of previous exercises).
1
(A) For two independent Bernoulli populations: ( η^ + η
^ ) η^ p
2 X Y
(B) For two independent normal populations:
1 2 2 1 2 2 1 2 2 2 2 2
(V +V Y ) (s + s ) (S + S ) Vp sp Sp
2 X 2 X Y 2 X Y
where
nX η
̂ X + nY η̂ Y 2 n X V 2X +nY V 2Y 2
2
n X s X + nY s Y
2
2
2 2
(n X −1) S X +( nY −1)S Y
η̂ p= V p= s =
p S =
p
n X + nY n X +nY n X + nY n X +nY −2
(Similarly for Y.) Try to compare the mean square errors. Study the consistency in mean of order two and then
the consistency in probability.
Discussion: The expressions of the mean square error of the basic estimators involved in this exercise has
been calculated in another exercise, and they will be used in calculating the mean square errors of the new
estimators. The errors are calculated for static situations, but limits are studied in dynamic situations
Comparing the coefficients is easy in some cases, but sequences can sometimes cross one another and the
comparisons must be done analitically—by solving equalities and inequalities—or graphically. By using a
computer, it is also possible to study—either analytically or graphically—the behaviour of the estimators. The
results obtained here are valid for two independent Bernoulli populations and two independent normal
populations, respectively. On the other hand, we must find the expression of the error for the new estimators
based on semisums:
2
1
( 1
) [( 1
)
MSE ( θ^1 + θ^2 ) = E ( θ^1 + θ^2) −θ +Var ( θ^1 + θ^2 )
2 2 2 ] ( )
and, for unbiased estimators,
1 1
( )
MSE ( θ^1 + θ^2 ) = 0+ [Var ( θ^1)+Var ( θ^2 )]
2 4
1
(A) For Bernoulli populations: ( η^ + η
^ ) and η^ p
2 X Y
1
(a1) For the semisum of the sample proportions ( η^ + η
^ )
2 X Y
By using previous results and that μ=η and σ2=η(1–η),
E ( 12 ( η^ + η^ )) = 12 [ E( η^ )+ E( η^ )]= 12 (η +η )=η
X Y X Y X Y

( 12 ( η^ + η^ )) == 12 [ Var ( η^ )+Var ( η^ )]= 14 ( η (1−η
Var X Y X Y
n
+
n )= 4 ( n + n ) η(1−η)
) η (1−η ) 1 1 1
X
X
X Y
Y
Y
X Y
2
MSE ( ( η^ + η^ )) = [ (η +η )−μ ] +
4( )= 4 ( n + n ) η(1−η)
1 1 1 η (1−η ) η (1−η ) 1 1 1
X X Y Y
X Y X Y +
2 2 n n X Y X Y
Then,
1
• The estimator ^ ) is unbiased for μ, whatever the sample sizes.
( η^ + η
2 X Y
1
• The estimator ^ ) is consistent (in the mean-square sense and therefore in probability) for η.
( η^ + η
2 X Y
n →∞
1
2 Y
X (
lim n →∞ MSE ( η^ X + η^ Y ) = lim n → ∞
n →∞
1 1 1
+
4 n X nY )
η(1−η) X
Y
[( ) ]
It is sufficient and necessary that both sample sizes must tend to infinite—see the mathematical
appendix.
(a2) For the pooled sample proportion η^ p

1
Firstly, we write η^ p= ^ ). Now, by using previous results,
(n η^ +n η
n X + nY X X Y Y
1 n η +n η
E (η
^ p) = [nX E (η ^ Y )]= X X Y Y =η
^ X )+n Y E ( η
n X +n Y n X +n Y
1 n η (1−ηX )+ nY ηY (1−ηY ) 1
Var ( η
^ p) = 2
[n2X Var ( η^ X )+ nY2 Var ( η
^ Y )]= X X 2
= η(1−η)
(n X + nY ) (n X +nY ) n X + nY
2
n X ηX +nY ηY n η (1−ηX )+nY ηY (1−ηY )
MSE( η^ p ) =
( n X + nY )
−η + X X
(n X +nY )2
=
1
n X + nY
η(1−η)
Then,
• The estimator η^ p is unbiased for η, whatever the sample sizes.
• The estimator η^ p is consistent (in mean of order two and therefore in probability) for η, since
η(1−η)
lim n →∞ MSE ( η^ p ) = lim n →∞ =0
n →∞
X
n →∞
n X + nY X
Y Y
If the mean square error is compared with those of the two populations, we can see that the new
denominator is the sum of both sample sizes. Again, it is worth noticing that it is sufficient and
necessary at least one sample size tending to infinite, but not both. In this case, the denominator tends
to infinite. The interpretation of this fact is that, in estimating, one sample can do “the whole work.”
1
(a3) Comparison of ^ ) and η^ p
( η^ + η
2 X Y
Case nX = n = nY
MSE ( 12 ( η^ + η^ )) = η(1−η)
X Y
2n
= MSE ( η
^ ) p
1
In fact, by looking at the expressions of the estimators themselves, η^ p= ( η
^ + η^ ) in this case.
2 X Y
General case
The expressions of their mean square error are (the sample proportion is unbiased):

1 1 1 1 1
MSE ( 2
( η^ X + η) (
^ Y) = +
4 n X nY )
η(1−η) MSE( η^ p ) =
n X + nY
η(1−η)
Then
n +n
1 1 1
( + ≤
1
)
4 n X nY n X + nY ( )
↔ (n X +nY ) X Y ≤4 ↔ n 2X + n2Y + 2 n X n Y ≤4 n X nY ↔ (n X −nY )2≤0
n X nY
Then, the pooled estimator is always better or equal than the semisum of the sample proportions. Both
estimators have the same mean square error—their behaviour may be different under other criteria different to
the mean square error—only when nX=nY. Besides, Thus, (nX–nY)2 can be seen as a measure of the convenience
of using the pooled sample proportion, since it shows how different the two errors are. The inequality also
shows a symmetric situation, in the sense that it does not matter which sample size is bigger: the measure
depends on the difference. We have proved the following result:
Proposition
For two independent Bernoulli populations with the same parameter, the pooled sample proportion
has smaller or equal mean square error than the semisum of the sample proportions. Besides, both
are equivalent only when the sample sizes are equal.
We can plot the coefficients (they are also the mean square errors when η(1– η)=1) for a sequence of sample
sizes, indexed by k, such that nY(k)=2nX(k), for example (but this only one possible way for the sample sizes to
tend to infinite):
c = 2
# The sequences of coefficients
coeff1 = (1 + 1/c)/(4*n)
coeff2 = 1/((1+c)*n)
# The plot
allValues = c(coeff1, coeff2)
The reader can repeat this figure by using values closer to and farther from 1 than c(k) = 2.
(B) For normal populations

1 2 2
(b1) For the semisum of the variance of the samples (V +V Y )
2 X
By using previous results,

E ( 12 (V +V )) = 12 [ E (V )+ E (V ) ]= 12 (σ +σ )=σ
2
X
2
Y
2
X
2
Y
2
X
2
Y
2
4 4
Var ( (V +V ) ) = [ Var ( V ) +Var ( V ) ]= ( + )= ( + )σ

1 12 2 1 σ σ 1 1 1
2 2 X Y 4
X Y X Y
2 2 2
2 n n 2 n n X Y X Y
4 4
] ( ) (
2
MSE ( 1 2
2
2
) [
1 2 2 2
(V X +V Y ) = ( σ X +σ Y )−σ +
2
1 σX σY 1 1 1 4
+
2 nX nY
= +
2 n X nY
σ )
Then,
1 2 2
• The estimator (V +V Y ) is unbiased for σ2, whatever the sample sizes.
2 X
1 2 2
• The estimator (V X +V Y ) is consistent (in the mean-square sense and therefore in probability) for σ2
2
since,
1 1 1 4
lim n →∞ MSE ( η^ p ) = lim n →∞
n →∞ n →∞
+
2 nX nYX
Y
σ =0 X
Y
( )
It is sufficient and necessary that both sample sizes must tend to infinite—see the mathematical
appendix.
1 2 2
(b2) For the semisum of the sample variances (s + s )
2 X Y
1 n X −1 2 nY −1 2 1 n X −1 nY −1 2
E ( 1 2 2
2
1
)
2
[
( s X + s Y ) = E ( s 2X ) + E ( s 2Y ) =
2 nX
σX +
nY
]
σY =
2 nX (+
nY
σ
) ( )
Var ( 1 2 2
2 2
1
) [
(s X + sY ) = 2 Var ( s 2X ) +Var ( s 2Y ) =
2 n2X
σ X + 2 σY =
nY
]
2 n2X [
1 n X −1 4 nY −1 4 1 (n X −1) (nY −1) 4
+
n2Y
σ
] [ ]
2
1
(
MSE (s2X + s 2Y ) =
2 2 nX
σX+ ) [(
1 n X −1 2 n Y −1 2
nY
σY −σ 2 +
1 n X −1 4 n Y −1 4
2 n2X
σ X + 2 σY
nY ) ] [ ]
2
=
[(
1 n X −1 n Y −1 2
2 nX
+
nY
σ −σ 2 +
1 n X −1 n Y −1 4
2 n 2X
+ 2 σ
nY ) ] [ ]
{[ [ ]} [ ]
2
n X nY2 −nY2 + n2X nY −n2X (n X + nY )2 n X n2Y −n 2Y + n2X nY −n2X 4
= −
1 1 1
+
2 n X nY ( )] +2
4 n 2X n2Y
σ=
4
4 n 2X nY2
+2
4 n 2X n 2Y
σ
[ ] [ ]
2 2 2 2 2
2 n X n Y + 2 n X nY + 2 n X nY −n X −n Y 4 2 n X nY (n X + nY )−(n X −nY )
= 2 2
σ= 2 2
σ4
4n n Y X 4n n X Y
[ ] [ ]
2 2
1 n X + nY ( n X −n Y ) 4 1 1 1 ( n X −n Y )
= − σ= + − σ4
2 n X nY 2 2
2 n X nY 2 n X n Y
2 2
2 n X nY
Then,
1 2 2
• The estimator ( s + s ) is biased but asymptotically unbiased for σ2, since
2 X Y
1 n X −1 nY −1
X
Y
n →∞
1 2 2
2( 2
lim n →∞ E ( s X +s Y ) = σ lim n → ∞
n →∞
)
2 nX
+
nY
21
=σ (1+1 )=σ
2
2
X
Y
( )
1 2 2
• The estimator (s + s ) is consistent (in the mean-square sense and therefore in probability) for σ2,
2 X Y
because it is asymptotically unbiased and
1 n X −1 nY −1
n →∞
1
lim n →∞ Var (s 2X + s2Y ) = σ4 lim n → ∞
2 X
n →∞
Y
2 n 2
X nY (
+ 2 =0 ) X
Y
( )
Again, it is sufficient and necessary the two sample sizes tending to infinite—see the mathematical
appendix.
1 2 2
(b3) For the semisum of the sample quasivariances (S + S )
2 X Y
E ( 12 (S + S )) = 12 [ E ( S )+ E ( S ) ]= 12 (σ + σ )=σ
2
X
2
Y
2
X
2
Y
2
X
2
Y
2
4 4
Var ( (S +S )) = [ Var ( S ) +Var ( S ) ]= ( )

1 1 1 σ σ 1 1 1
2 n −1 n −1 2 n −1 n −1 )
= (
2 2 2 2 X Y 4
X Y + X + σ Y
2 2 2
X Y X Y
4
σ4
] ( ) (
2
MSE ( 1 2 2
2
1 2
2 ) [ 2 2
(S X + S Y ) = ( σ X + σY )−σ +
1 σX
+ Y =
1 1
+
1
2 n X −1 n Y −1 2 nX −1 nY −1
σ
4
)
Then,
1 2 2
• The estimator (S + S ) is unbiased for σ2, whatever the sample sizes.
2 X Y
1 2 2
• The estimator (S X + S Y ) is consistent (in the mean-square sense and therefore in probability) for σ2
2
since,
1 1 1 1
lim n →∞ MSE (S2X +S 2Y ) = lim n → ∞
n →∞
2 n →∞
+
2 n X −1 nY −1
X
Y
σ 4=0 ( ) X
Y
( )
It is sufficient and necessary both sample sizes tending to infinite—see the mathematical appendix.
(b4) For the pooled variance of the samples V 2p

2 2
2 n X V X +nY V Y 1
We can write V p= = ( n V 2 + n V 2 ) . By using previous results,
n X +nY n X +n Y X X Y Y
2 2 2 2
2 n X E(V X )+n Y E(V Y ) n X σ X +n Y σ Y 2
E (V p ) = = =σ
n X + nY n X + nY
2 2 2 2 4 4
n Var (V X )+ nY Var (V Y ) n σ +n σ 2
Var (V )= X 2
p 2
=2 X X Y 2 Y = σ4
( n X +n Y ) (n X + nY ) n X + nY
2
n X σ 2X +nY σ2Y n X σ4X + nY σ 4Y
MSE( V ) =
2
p
n X +n y ( 2
−σ + 2
(n X +n y )2
=
2
n X +n y
σ
4
)
Then,
• The estimator V 2p is unbiased for σ2, whatever the sample sizes.
• The estimator V 2p is consistent (in mean of order two and therefore in probability) for σ2, since

2 4 2
lim n →∞ MSE (V p) = σ lim n →∞ =0
X
nY →∞
X
nY → ∞
n X +nY
It is worth noticing that it is sufficient and necessary at least one sample size tending to infinite, but
not both. In this case, the denominator tends to infinite. The interpretation of this fact is that, in
estimating, one sample can do “the whole work.”
(b5) For the pooled sample variance s 2p

n X s 2X + nY s 2Y
2 1
We can write s = p = (n X s 2X + nY s 2Y ). By using previous results,
n X + nY n X + nY
2 2 2 2
2 n X E( s X )+ nY E ( sY ) (n X −1) σ X +(nY −1)σY n X + nY −2 2
E ( s p )= = = σ
n X + nY n X + nY n X + nY
2 2 2 2 4 4
2 n X Var (s X )+nY Var ( s Y ) (n X −1)σ X +( nY −1) σY n X + nY −2 4
Var ( s ) =
p 2
=2 2
=2 σ
(n X + nY ) ( n X + nY ) (n X + nY )2
[ ]
2 2
n + n −2 2 n +n −2 n +n −2
MSE(s ) = X Y 2
p
n X +n Y (
σ −σ 2 + 2 X Y 2 σ4 = X Y
(n X +nY ) )
(n +n −2−n X −nY )
(n X + nY )2
+ 2 X Y 2 σ 4=
(n X +nY ) n X
2
+ n Y
σ4
Then,
• The estimator s 2p is biased for σ2, but asymptotically unbiased
n +n −2 2 n +n
lim n →∞ X Y
n →∞
n X + nY n →∞
X
Y
(
σ = lim n →∞ X Y σ 2 = σ2
n X +nY ) X
Y
( )
(The calculation above for the mean suggests that a –2 in the denominator of the definition would
provide an unbiased estimator—see the estimator in the following section.)
• The estimatoris s 2p consistent (in mean of order two and therefore in probability) for σ2, since
2 4 2
lim n →∞ MSE (s p ) = σ lim n →∞ =0
n →∞
X
n
n →∞ X
+nY X
Y Y
(b6) For the (bias-corrected) pooled sample variance S 2p

2 2
(n X −1) S X +(nY −1) S Y 1
2
We can write S p =
n X +nY −2
= [ (n −1)S 2X +(n Y −1) S 2Y ] . By using previous results,
n X + nY −2 X
2 2 2 2
( n X −1) E (S X )+( nY −1) E ( S Y ) ( n X −1) σ X +( nY −1)σY
E ( S 2p) = = =σ 2
n X + nY −2 n X +nY −2
2 (n X −1)2 Var ( S 2X )+( nY −1)2 Var (S 2Y ) (n X −1) σ4X +( nY −1) σ4Y 2

Var ( S ) =
p 2
=2 2
= σ4
(n X + nY −2) (n X + nY −2) n X +n Y −2
2
[ ]
2 2 4 4
(n X −1)σ X +(nY −1) σY
2 (n −1) σ X +(nY −1)σ Y 2
MSE( S ) = p −σ2 + 2 X 2
= σ4
n X +n Y −2 (n X + nY −2) n X +n Y −2
Then,
• The estimator S 2p is unbiased for σ2, whatever the sample sizes.

• The estimator S 2p is consistent (in mean of order two and therefore in probability) for σ2, since
4
2 2σ
lim n →∞ MSE (S p ) = lim n X →∞
=0
X
nY →∞ nY →∞
nX +nY −2
1 2 2 1 2 2 1 2 2
(b7) Comparison of (V X +V Y ) , (s + s ) , (S X + S Y ) , V 2p , s 2p and S 2p
2 2 X Y 2
Case nX = n = nY
MSE ( 12 (V + V )) = 12 ( 2 1n ) σ = 1n σ
2
X
2
Y
4 4
1 1 1 1
MSE ( ( s + s ) ) = (2 −0 ) σ = σ
2 2 4 4
X Y
2 2 n n
1 1 1 1
MSE ( (S + S ) ) = (2
2 n−1 )
2 2 4 4
X Y σ= σ
2 n−1
2 2 4 1 4
MSE( V p) = σ = σ
2n n
2 2 4 1 4
MSE(s p) = σ = σ
2n n
2 2 4 1 4
MSE(S p) = σ= σ
2 n−2 n−1
Since σ4 appears in all these positive quantities, by looking at the coefficients it is easy to see the relation
MSE ( 12 ( s + s )) = MSE ( 12 (V
2
X
2
Y
2
X
2
) 2
+V Y ) = MSE (V p) = MSE ( s p ) < MSE (S p) = MSE
2 2
( 12 (S 2
X
2
+ SY ) )
(For individual estimators, the order MSE ( s 2 ) < MSE (V 2 ) < MSE ( S 2 ) was obtained in other exercise.) This
relation has been obtained for the case nX = n = nY and (independent) normal populations. We can plot the
coefficients (they are also the mean square errors when σ=1).
coeff1 = 1/n
coeff2 = coeff1
coeff3 = 1/(n-1)
coeff4 = coeff1
coeff5 = coeff1
coeff6 = coeff3
# The plot
allValues = c(coeff1, coeff2, coeff3, coeff4, coeff5, coeff6)
points(n, coeff2, type='l')

By using this code, it is also possible to study—either analytically or graphically—the asymptotic behaviour
of these estimators (but only with simulated data of some particular distributions for X, what would not be a
“whole mathematical proof”). It is worth noticing that the formulas obtained in this exercise are valid for
normal populations (because of the theoretical results on which they are based). In the general case, the
expressions for the mean square error of these estimators are more complex.
General case
The expressions of their mean square error are:
MSE ( 12 ( V + V )) = 12 ( n1 + n1 ) σ
2
X
2
Y
X Y
4
2 [n 2n n ]
2
1 1 1 1 (n −n )
MSE ( ( s + s )) =
2
X
2
Y + − σ X Y 4
2 n X Y
2
X
2
Y
1 1 1 1
MSE ( (S + S ) ) = (
2 n −1 n −1 )
2 2 4
X Y + σ
2 X Y
2 2 4
MSE(V p) = σ
n X +n y
2 2 4
MSE( s p) = σ
n X +n y
2 2 4
MSE(S p) = σ
n X +nY −2
We have simplified the expressions as much as possible, and now a general comparison can be tacked by
doing some pairwise comparisons. Firstly, by looking at the coefficients
MSE ( 12 ( s + s )) ≤ MSE ( 12 (V
2
X
2
Y
2
X
2
)
+V Y ) < MSE ( 12 (S 2
X
2
+ SY ) )
and the equality is reached only when nX = n = nY. On the other hand,
MSE (V 2p ) = MSE ( s 2p ) < MSE (S 2p)
Now, we would like to allocate V 2p , s 2p and S 2p in the first chain. To compare V 2p and s 2p with
1 2 2
(V +V Y ) ,
2 X
2 1 1 1
≤ +
n X +nY 2 n X nY ( ) ↔ 4 n n ≤(n +n ) X Y X Y
2
↔ 4 n X nY ≤ n 2X + n2Y +2 n X n Y ↔ 0≤(n X −n Y )2

That is,
2
MSE ( V p ) = MSE (s p ) ≤ MSE
2
( 12 (V 2
X
2
+V Y ) )
1 2 2
and the equality is attained only when nX = n = nY. To compare S 2p with (V +V Y )
2 X
2 1 1 1
≤ +
n X +nY −2 2 n X nY ( ) ↔ 4 n n ≤(n +n )(n + n −2) ↔ 2(n +n )≤(n −n )
X Y X Y X Y X Y X Y
2
That is,
{ ( 12 (V
MSE (S 2p ) ≤ MSE )
2
X +V 2Y ) if 2(n X + nY )≤( n X −n Y )2
1
MSE (S ) ≥ MSE ( (V +V ) )
2 2 2
p X Y if 2(n X + nY )≥( n X −n Y )2
2
Intuitively, in the region around the bisector line the difference of the sample means is small, and therefore the
pooled sample variance is worse; on the other hand, in the complementary region the square of the difference
is bigger than twice the sum of the sizes, and, therefore, the pooled sample variance is better. The frontier
seems to be parabolic. Some work can be done to find the frontier determined by the equality and the two
regions on both sided—this is done in the mathematical appendix. Now, we write some “force-based” lines for
the computer to plot these points in the frontier:
N = 100
vectorNx = vector(mode="numeric", length=0)
vectorNy = vector(mode="numeric", length=0)
for (nx in 1:N)
{
for (ny in 1:N)
{
if (2*(nx+ny)==(nx-ny)^2) { vectorNx = c(vectorNx, nx); vectorNy = c(vectorNy, ny) }
}
}
plot(vectorNx, vectorNy, xlim = c(0,N+1), ylim = c(0,N+1), xlab='nx', ylab='ny', main=paste('Frontier of the region'), type='p')
1 2 2
To compare S 2p with (S X + S Y )
2
2 1 1 1
≤ + (
n X +nY −2 2 n X −1 nY −1 ) ↔ 4 (n −1)(n −1)≤(n +n −2)
X Y X Y
2
↔ 4 n X nY −4 n X −4 nY +4 ≤ n2X +n2Y + 2 n X nY +4−4 n X −4 nY ↔ 0≤( n X −n Y )2

That is,
2
MSE ( S p ) ≤ MSE ( 12 ( S + S ))
2
X
2
Y
and the equality is attained only if the sample sizes are the same.
We can summarize all the results of this section in the following statement:

Proposition
For two independent normal populations, when nX = n = nY
(a) MSE ( 12 (s + s ))=MSE ( 12 (V

2
X
2
Y
2
X
2
) 2 2
+ V Y ) = MSE (V p )=MSE (s p) < MSE (S p)=MSE
2
( 12 (S 2
X
2
+ SY ) )
In the general case, when the sample sizes can be different,
(b) MSE ( 12 (s + s )) ≤ MSE ( 12 (V

2
X
2
Y
2
X
2
)
+V Y ) < MSE ( 12 (S 2
X
2
+ SY ) )
(c) MSE (V 2p ) = MSE (s 2p ) < MSE (S 2p)
2 2
( 12 (V +V ))
(d) MSE ( V p ) = MSE (s p ) ≤ MSE
2
X
2
Y
{
1
MSE ( S ) ≤ MSE ( (V +V ) ) if 2(n + n )≤(n −n )
2 2 2 2
p X Y X Y X Y
(e) 2
1
MSE ( S ) ≥ MSE ( (V +V ) ) if 2(n + n )≥(n −n )
2 2 2 2
p X Y X Y X Y
2
1
(f) MSE ( S ) ≤ MSE ( (S + S ))
2 2 2
p X Y
2
In (b), (d) and (f), the equality is attained when nX = n = nY.
1 2 2
Note: I have tried to compare V 2p , s 2p and S 2p with
(s + s ) , but I have not managed to solve
2 X Y
the inequalities. On the other hand, these relations show that, for two independent normal populations,
there exist estimators with smaller mean square error than the pooled sample variance S 2p . Nevertheless,
there are other criteria different to the mean square error, and, additionally, the pooled sample variance has
also some advantages (see the advanced theory at the end).
Conclusion: For some pooled estimators, the mean square errors have been calculated either directly or
making a proper statistic appear. The consistencies in mean square error of order two and in probability have
been proved. By using theoretical expressions for the mean square error, the behaviour of the pooled
estimators for the proportion (Bernoulli populations) and for the variance (normal populations) have been
compared with “natural” estimators consisting in the semisum of the individual estimators for each
population.
Once more, it is worth noticing that there are in general several matters to be considered in selecting
among different estimators of the same quantity:
Advanced Theory: The previous estimators can be written as a sum ω X θ^ X +ω Y θ^ Y with weights
ω=(ω X ,ω Y ) such that ω X + ωY =1. As regards the interpretation of the weights, they can be seen as a
measure of the importance that each estimator is given in the global formula. For some weights that depends
on the sample sizes, it is possible for one estimator to adquire all the importance when the sample sizes
increase in the proper way. On the contrary, when the weights are constant the possible effect—positive or

negative—due to each estimator is bounded. The errors were calculated when the data are representative of
the population, but if the quality of one sample is always small, the other sample cannot do the whole
estimation if the weights do not depend on the sizes.
My notes:
[PE] Methods and Properties

Exercise 1pe
We have reliable information that suggests the probability distribution with density function
2
f ( x ; θ) = 2 (θ−x ) , x ∈[0, θ] ,
θ
as a model for studying the population quantity X. Let X = (X1,...,Xn) be a simple random sample.
(a) Apply the method of the moments to find an estimator θ̂ M of the parameter θ.
(b) Calculate the bias and the mean square error of the estimator θ̂ M .
(c) Study the consistency of θ̂ M .
(d) Try to apply the maximum likelihood method to find and estimator θ^ ML of the parameter θ.
(e) Obtain estimators of the mean and the variance.
Hint: (i) Use that μ = E(X) = θ/3 and E(X 2) = θ2/6.
Discussion: This statement is mathematical. The assumptions are supposed to have been checked. We are
given the density function of the distribution of X (a dimensionless quantity). The exercise involves two
methods of estimation, the definition of the bias, the mean square error and the sufficient condition for the
consistency (in probability). The two first population moments are provided.
Note: If E(X) and E(X2) had not been given in the statement, they could have been calculated by applying the definition and solving the integrals,
+∞ θ 2(θ−x ) 2 θ θ
E ( X )=∫−∞ x f ( x ; θ)dx=∫0 x
θ2
dx= 2
θ
(∫ 0
x θ dx−∫0 x 2 dx )
2 θ 3 θ
θ
2
= 2 θ ( [ ] [ ]) (
x
−
2 0 3
x
0
=
2 θ 2 θ3
θ 2
2 3
1 1
6 3)
θ − =θ 2 = θ
+∞ θ 2(θ− x) 2 θ θ
E ( X 2)=∫−∞ x 2 f (x ; θ)dx=∫0 x 2
θ 2
dx= 2
θ
(∫ 0
x 2 θ dx−∫0 x 3 dx )
θ θ
2
= 2 θ
θ ( [ ] [ ]) (
x3
−
3 0 4
x4
0
=
2 θ3 θ4
θ 2
θ − =2 θ 2
3 4
4
)
3⋅4
−
3
4⋅3( 2
12
1
= θ 2= θ 2
6 )
(a1) Population and sample moments
The distribution has only one parameter, so one equation suffices. By using the information in the hint:

1 1 n
n ∑ j =1 j
μ1 (θ )= θ and m1 (x 1 , x 2 ,... , x n )= x = x̄
3
(a2) System of equations

1 1 n 3 n
θ= ∑ j=1 x j = x̄
n ∑ j =1 j
μ1 (θ )=m 1 (x 1 , x 2 ,... , x n ) → → θ0 = x =3 x̄
3 n
(a3) The estimator
It is obtained after substituting the lower case letters xj by upper case letters Xi:
3 n
θ^ M = ∑ j =1 X j =3 X̄
n
(b) Bias and mean square error

(b1) Bias
To apply the definition b ( θ̂ M ) =E (θ̂ M )−θ we need to calculate the expectation:

E ( θ^ M ) =E ( 3 X̄ )=3 E ( X̄ )=3 E ( X )=3 θ =θ
3
where we have used the properties of the expectation, a property of the sample mean and the information in
the statement. Now
b ( θ^ M ) = E ( θ^ M ) −θ = θ−θ = 0
and we can see that the estimator is unbiased (we could see it also from the calculation of the expectation).
(b2) Mean square error
We do not usually apply the definition MSE ( θ̂ M ) = E ( ( θ̂ M −θ)2 ) but a property derived from it, for which
we need to calculate the variance:
2 2
2 E ( X )− E ( X ) 3 2 θ2 θ 3 2 θ2 1 1 3 2 θ2 1
2 Var ( X )
[ ( )]
2 2
^
Var ( θ M ) = Var ( 3 X̄ ) = 3
n
=3
n
=
n 6
−
3
=
n 6 9(− =) =θ
n 18 2 n
where we have used the properties of the variance, a property of the sample mean and the information in the
statement. Then
2 2
MSE ( θ^ M ) = b ( θ^ M ) + Var ( θ^ M ) = 0 + θ = θ
2 2
2n 2n
(c) Consistency
̂
̂
We try applying the sufficient condition lim n →∞ MSE ( θ)=0 or, equivalently,
2
{ lim n →∞ b( θ)=0
̂
lim n →∞ Var ( θ)=0
. Since
lim n →∞ MSE ( θ^ M ) = lim n→ ∞ θ = 0

2n
it is concluded that the estimator is consistent (in mean of order two and hence in probability) for estimating θ.
(d) Maximum likelihood method

2
(d1) Likelihood function: The density function is f ( x ; θ)= (θ−x ) for 0≤x ≤θ , so
θ2

n 2n n
L( x 1 , x 2 , ... , x n ; θ)=∏ j =1 f (x j ; θ)= 2 n ∏ j =1
(θ−x j )
θ
(d2) Optimization problem: First, we try to find the maximum by applying the technique based on the
derivatives. The logarithm function is applied,
n
log[ L( x 1 , x 2 , ... , x n ; θ)]=n log(2)−2 n log(θ)+ ∑ j =1 log(θ−x j )
and the first condition leads to a useless equation:

0=
d
dθ
1 n
log[ L(x 1 , x 2 , ... , x n ; θ)]=0−2 n θ + ∑ j=1
1
θ− x j
→ ?
Then, we realize that global minima and maxima cannot always be found through the derivatives (only if they
are also local extremes). In this case, it is difficult even to know whether L monotonically decreases with θ or
not, since part of L increases and another decreases—which one changes more? We study the j-th element of
the product, that is, f(xj;θ). Its first derivative is
2
θ −(θ− x j)2 θ θ(2 x j −θ)
f ' ( x j ; θ)=2 4
=2 4
so it has an extreme in θ=2 x j
θ θ
This implies that L is the product of n terms having the extreme in a different way, so L does not change
monotonically with the parameter θ.
(d3) The estimator: → ?

(e) Estimators of the mean and the variance
To obtain estimators of the mean, we take into account that μ=E ( X )= θ and apply the plug-in principle:
3
^θ M 3 X̄
μ^ M = = = X̄ θ^
μ^ = =
max {X } ML j j
3 3
ML
3 3
2
To obtain estimators of the variance, since σ =Var ( X )= θ
2
6
θ^
2
(2 X̄ )2 2( X̄ )2
σ^ 2M = M = = σ^ 2ML = ?
6 6 3
Conclusion: The method of the moment is applied to obtain an estimator that is unbiased for any sample
size n and has good behaviour when used with for large n (many data). The maximum likelihood method
cannot be applied since it is difficult to optimize the likelihood function by considering either its expression or
the behaviour of the density function.
My notes:
Exercise 2pe
Let X be a random variable following the Rayleigh distribution, whose with probability function is
2
x
x − 2
f (x ; θ) = 2 e 2 θ , x ≥ 0, (θ> 0)
θ
4−π 2
such that E ( X )=θ π and Var ( X )=
2 √ 2
θ . Let X = (X1,...,Xn) be a simple random sample.

(a) Apply the method of the moments to find an estimator θ^ M of the parameter θ.
(b) For θ^ M , calculate the bias and the mean square error, and study the consistency.
(c) Apply the maximum likelihood method to find and estimator θ^ MV of the parameter θ.
CULTURAL NOTE (From: Wikipedia.)

In probability theory and statistics, the Rayleigh distribution is a continuous probability distribution for positive-valued random variables.
A Rayleigh distribution is often observed when the overall magnitude of a vector is related to its directional components. One example
where the Rayleigh distribution naturally arises is when wind velocity is analyzed into its orthogonal 2-dimensional vector components.
Assuming that the magnitudes of each component are uncorrelated, normally distributed with equal variance, and zero mean, then the
overall wind speed (vector magnitude) will be characterized by a Rayleigh distribution. A second example of the distribution arises in the
case of random complex numbers whose real and imaginary components are i.i.d. (independently and identically distributed) Gaussian
with equal variance and zero mean. In that case, the absolute value of the complex number is Rayleigh-distributed. The distribution is
named after Lord Rayleigh.
Discussion: This is a theoretical exercise where we must apply two methods of point estimation. The basic
properties must be considered for the estimator obtained through the first method.
Note: If E(X) had not been given in the statement, it could have been calculated by applying integration by parts (since polynomials and
exponentials are functions “of different type”):
∞
[ ]
2 2 2
x x x
+∞ ∞ x − − − 2 2 2
E ( X )=∫−∞ x f ( x ; θ) dx=∫0 x 2 e 2θ dx = −x e 2 θ −∫ 1⋅(−e 2 θ ) dx 0

θ
2
x
( )
∞
=0+∫ e √ dx=∫
−
2θ
2 ∞
e−t
2
√ 2 θ2 dt=√ 2 θ2 √2π =θ √ π2
0 0
where ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' ( x)⋅v (x )dx has been used with
• u=x → u ' =1
2 2 2
x x x
x − 2
x − − 2 2
• v '= 2 e 2θ
→ v=∫ 2 e 2 θ dx=−e 2 θ
θ θ
Then, we have applied the change
x
=t → x=t √ 2 θ2 → dx=dt √ 2θ 2
√2 θ2
We calculate the variance by using the first two moments. For the second moment, we can apply integration by parts twice (as the
exponent decreases one unit each time)
∞
[ ]
2 2 2 2
x x x x
∞ x − − 2
− ∞ x − 2 2 2
E ( X )=∫0 x 2 e 2 θ dx= −x 2 e 2 θ −∫ 2 x⋅(−e 2 θ ) dx 0 =0+2 θ 2∫0 2 e 2 θ dx=2θ 2

2 2
θ θ
where ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' ( x)⋅v (x )dx has been used with
2
• u=x → u ' =2 x
2 2 2
x x x
x − 2
x − − 2 2
• v ' = 2 e 2θ → v=∫ 2 e 2 θ dx=−e 2 θ

θ θ
2 4−π
Var ( X )=E ( X )− E ( X ) =2 θ −θ π =θ
2 2 2 2
The variance is . (In substituting, that ex changes faster than xk for any
2 2
k has been taken into account. On the other hand, in an advanced table of integrals like those physicists or engineers use, one can
+∞ 2 +∞ 2
find ∫0 e−a x dx (see the appendixes of Mathematics) or ∫0 x 2 e−a x dx directly.)

Since there appears only one parameter in the density function, one equation suffices; moreover, since the
expression of μ = E(X) involves θ, the equation and the solution are:
μ1 (θ )= x̄ → θ π = x̄
2 √
1 2
→ θ= π x̄= π x̄ √ 2
→ θ^ M = π X̄ √
2 √
(b) Bias, mean square error and consistency
Mean or expectation: E ( θ^ M ) =E ( √ π2 X̄ )=√ 2π E ( X̄ )=√ 2π E ( X )= √ π2 θ √ π2 =θ

Bias: b ( θ^ M ) =E ( θ^ M ) −θ=θ−θ=0 → θ^ M is an unbiased estimator of θ.
Variance: Var ( θ^ M )=Var ( √ π2 X̄ )= 2π Var ( X̄ )= 2π Varn(X ) = 2π (4−π)

2n
θ=
(4−π)
πn
θ 2 2
2 (4−π) 2 (4−π) 2
Mean square error: ECM ( θ^ M ) =b ( θ^ M ) +Var ( θ^ M ) =0+ θ= θ
πn πn
(4−π) 2
Consistency: lim n →∞ MSE ( θ^ M ) =lim n →∞ θ =0 and therefore σ̂ M is consistent (for θ).
πn
(c) Maximum likelihood method

Likelihood function:
x1
2 2
xn 1 2
2∑
n − xj
n
L( X ; θ)=∏ j=1 f ( x j ; θ)= f ( x 1 ; θ)⋯ f ( x n ; θ)=
x1 e
−
2 θ2
⋯
xn e
−
2 θ2
=
(∏ j=1
xj e ) 2θ
2 2 2n
θ θ θ
Log-likelihood function:
To facilitate the differentiation, θ2n is moved to the numerator and a property of the logarithm is applied.
n
n 1 n ∑ j=1 x 2j 1
log ( L( X ; θ) ) =log (∏ j =1 )
xj −
2θ 2∑ j
x 2 + log ( θ−2 n )=log (∏ j=1
xj −) 2 θ2
−2 n log (θ)
Search for the maximum:

n n n n
d ∑ j=1 x 2j −1 1 ∑ j=1 x 2j 2n 2n ∑j=1 x 2j ∑j =1 x 2j
0= log ( L( X ; θ) )=0− 4
2 θ−2 n θ = 3
− θ → θ = 3
→ θ2=
dθ 2 θ θ θ 2n
Now we prove the condition on the second derivative.
n n
( ∑ j=1 x 2j
) ∑ x j 2n 2
d2 d 2n n −1 −1
log ( L( X ; θ) ) = − =∑ j =1 x 2j 6 3 θ2 −2 n 2 =−3 j =14 + 2
dθ 2
dθ θ3 θ θ θ θ θ
The first term is negative and the second is positive, but it is difficult to check qualitatively whether the
second is larger in absolute value than the first. Then, the extreme obtained is substituted:
n n
d2
d σ2
( 2
log L( X ; σ =
∑ j=1 x 2j
2n
) =−3 n )
(∑ j=1 x 2j)2 2 n2 (2 n)2
(∑ j=1 x 2j )
2
+ n
∑ j =1 j
x 2
4 n2
=−3 n
4 n2
+ n
∑ j=1 j ∑ j=1 j
x 2
x 2
4 n2
=−2 n
∑ j=1 j
x 2
<0
Thus, the extreme is really a maximum.

The estimator:
√
n
∑j=1 X 2j
θ^ ML=
2n
Discussion: The Rayleigh distribution is one of the few cases for which the two methods provide different
estimators of the parameter. In the first case, we could easily calculate the mean and the variance, as the
estimator was linear in Xj; nevertheless, in the second case the nonlinearities Xj2 and the square root make
those calculations difficult.
My notes:
Exercise 3pe
Before commercializing a new model of light bulb, a deep statistical study on its duration (measured in days,
d) must be carried out. The population variable duration is expected to follow the exponential probability
model:
Let X = (X1,...,Xn) be a simple random sample. Then, we want to:

(a) Apply the method of the moments to find an estimator of the parameter λ.
(b) Apply the maximum likelihood method to find an estimator of the parameter λ.
(c) Find a sufficient statistic (see the hint below).
(d) Prove that X is not an efficient estimator of λ.
(e) Prove that X is a consistent estimator of λ–1.
(f) Prove that X is an efficient estimator of λ–1. To cope with this, use the following alternative, equivalent
notation in terms of θ = λ–1
d
Now you must prove that X is an efficient estimator of θ and you can easily calculate , while
dθ
d
only experts can calculate .
d λ−1
(g) The empirical part of the study, based on the measurement of 55 independent light bulbs, has yielded a
55
total sum of ∑ j=1 x j = 598 d . Introduce this information in the expressions obtained in previous
sections to give final estimates of λ.
(h) Give an estimate of the mean μ = E(X).
Hint: For section (c), apply the factorization theorem and make it clear how the two parts are. In the theorem: (1) g and h are
nonnegative; (1) T cannot depend on θ; (2) g depends only on the sample and the parameter, and it depends on the sample through
T; (3) h can be 1; and (4) since h is any function of the sample, it may involve T.
Discussion: First of all, the supposition that the exponential distribution can reasonably be used to model the
variable duration should be tested. One aim of this exercise is to show how many methods and properties
involved in previous exercises can be involved in the same statistical analysis. The quality of the estimators

obtained is studied. (See the appendixes to see how the mean and the variance of this distribution could be
calculated, if necessary.)
(a1) Population and sample moments: The population distribution has only one parameter, so one equation
1 1 n
μ1 (λ)= E( X )= and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
λ n
(a2) System of equations: Since the parameter of interest λ appears in the first moment of X, the solution is:
−1
1 1 n 1 n 1
μ1 (λ)=m1 ( x 1 , x 2 , ... , x n) → = ∑ j=1 x j = x̄
λ n
→
n (
λ= ∑ j =1 x j ) =
x̄
(a3) The estimator:
−1
1 n 1
(
λ^ M = ∑ j =1 X
n j ) =
X̄
(b) Maximum likelihood method
(b1) Likelihood function: For an exponential random variable the density function is f (x ; λ)=λ e−λ x , so
we write the product and join the terms that are similar
n
n n −λ ∑ j=1 x j
L( x 1 , x 2 , ... , x n ; λ)=∏ j=1 f ( x j ; λ)=∏ j=1 λ e−λ x =λ e−λ x ⋅λ e−λ x ⋯λ e−λ x =λ n e
j 1 2 n
(b2) Optimization problem: The logarithm function is applied to make calculations easier
n
−λ ∑ j=1 x j n
log [ L( x 1 , x 2 , ... , x n ; λ)]=log[λ n ]+log[e ]=n⋅log[λ]−λ⋅∑ j=1 x j
The population distribution has only one parameter, and hence a onedimensional function must be maximized.
To find the local or relative extreme values, the necessary condition is:
−1
d n 1 n 1
( )
n
0= log[ L ( x 1, x 2,. .. , x n ; λ)]= λ −∑ j =1 x j → λ 0= ∑ j =1 x j =
dλ n x̄
2
d n
2
log[ L( x 1, x 2,. .. , x n ; λ)]=− 2 < 0
dλ λ
1
which holds for any value, particularly for λ 0= .
̄x
(b3) The estimator:
−1
1 n 1
(
λ^ ML= ∑ j =1 X j
n ) =
X̄
(c) Sufficient statistic

Both to prove that a given statistic is sufficient and to find a sufficient statistic, we apply the factorization
theorem (see the hint).
n
−λ ∑ j=1 X j
(c1) Likelihood function: Computed previously, it is L( X 1 , X 2 , ... , X n ; λ)=λ n e .

(c2) Theorem: L( X 1 , X 2 , ... , X n ; λ)= g (T ( X 1 , X 2 , ... , X n); λ )⋅h( X 1 , X 2 , ... , X n). We must analise each
term of the likelihood function.
n
➔ λ depends only on the parameter, so it would be part of g.
n
−λ ∑ j=1 X
➔ e depends on both the parameter and the sample, and it is not possible to separate
j
mathematically both types of information; then, this term would be part of g too. Moreover, the only
n
candidate to be a sufficient statistic is T ( X )=T ( X 1 , ... , X n)=∑ j=1 X j .
n
n −λ ∑ j=1 X
Since the condition holds for g (T ( X 1 , X 2 , ... , X n) ; λ)=λ e j
and h( X 1 , X 2 , ... , X n)=1 , the
n
statistic T ( X )=T ( X 1 , ... , X n)=∑ j=1 X j is sufficient. This means that it “summarizes the important
information (about the parameter)” contained in the sample. The previous statistic contains the same
1 n
information as any one-to-one transformation of it, concretely the sample mean T ( X )= ∑ j =1 X j = X̄ .
n
(d) X is not an efficient estimator of λ

The definition of efficiency consists of two conditions: unbiasedness and minimum variance (this latter is
checked by comparing the variance with the Cramér-Rao's bound).
(d1) Unbiasedness: By applying a property of the sample mean and the information of the statement,
1 1
E ( X̄ )= E ( X )= λ → b ( X̄ )=E ( X̄ )−λ= λ −λ ≠ 0
The first condition does not hold for all values of λ, and hence it is not necessary to check the second one.
1
Note: The previous bias is zero when −λ=0 ↔ λ=±√ 1 → λ=1 (for f(x) to be a probability function, λ must be positive, so
λ
the solution –1 is not taken into account). Thus, when λ = 1, the estimator may still be efficient if the second condition holds.
(e) X is a consistent estimator of λ–1

To prove the consistency (in probability), we will apply any of the following sufficient conditions (consistency
in mean of order two)
̂
lim n →∞ MSE ( θ)=0 ↔
lim n→∞ b( θ̂ )=0
lim n →∞ Var ( θ̂ )=0{
(e1) Bias: By applying a property of the sample mean and the information of the statement,
1
E(X
̄ )=E ( X )= ̄ )−λ−1= 1 − 1 =0
→ b( X̄ )=E ( X ̄ )=lim n →∞ 0 = 0
→ lim n →∞ b( X
λ λ λ
(e2) Variance: By applying a property of the sample mean and the information of the statement,
Var ( X ) 1 1
̄ )=
Var ( X = 2 → lim n →∞ Var ( X
̄ )=lim n→ ∞ =0
n λ ⋅n λ 2⋅n
As a conclusion, the mean square error (MSE) tends to zero, which is sufficient—but not necessary—for the
consistency (in probability).
(f) X is an efficient estimator of λ–1
Now, we are recommended to use the notation

where θ=λ–1.
(f1) Unbiasedness: By applying a property of the sample mean and the information of the statement,
̄ )=E ( X )=θ
E(X → ̄ )−λ−1=θ−θ = 0
b( X̄ )=E ( X
The first condition holds, and hence it is necessary to check the second one.
(f2) Mininum variance: We compare the variance and the Cramér-Rao's bound. The variance is:
̄ )= Var ( X ) = θ
2
Var ( X
n n
On the other hand, the bound is calculated step by step:
i. Function (with X in place of x)
1 −X
f ( X ; θ)= e θ
θ
ii. Logarithm of the function:
X
− X
log[ f ( X ; μ)]=log(θ−1 )+log(e θ
)=−log(θ)− θ
iii. Derivative of the logarithm of the function:

∂ log [ f ( X ; θ)]) =− 1 − X⋅−1 =− 1 + X
∂θ ( θ θ2 θ θ2
iv. Expectation of the squared partial derivative of the logarithm of the function: We rewrite the
expression so as to make σ 2=Var ( X )=E ( ( X −E ( X ))2 ) =E ( ( X −μ)2 ) appear. In this case, it also
holds that σ 2=Var ( X )=E ( ( X −θ)2 ) =θ2 . Then
[( ) ] [( )] [ ]
2
∂ log[ f ( X ; θ)] 1 X 2
( X −θ )2 Var ( X ) θ 2 1
E =E − ⋅θ + 2 =E = = 4= 2
∂θ θ θ θ θ4 θ4 θ θ
v. Theoretical Cramér-Rao's lower bound:
1 1 2
= =θ
1 n
[( )]
2
∂ log [ f ( X ;θ )] n⋅ 2
n⋅E ∂θ θ
The variance of the estimator attains the bound, so the estimator has minimum variance. The fulfillment of the
two conditions proves that X is an efficient estimator of λ–1 = θ.
(g) Estimation of λ
55
It is necessary to use the only information available: ∑ j=1 x j = 598 d .
−1 −1
1 n 1
 From the method of the moments: λ^ M = ∑ x j =
n j =1( 55 ) (
598 d =0.09197 d .
−1
)
 From the maximum likelihood method, since the same estimator was obtained: λ^ ML=0.09197 d −1 .
(h) Estimation of μ
1
Since μ=E ( X )= , an estimator of λ induces, by applying the plug-in principle, an estimator of μ:
λ

−1 598 d
 From the method of the moments: μ^ M =( λ^ M ) = =10.87 d .
55
 From the maximum likelihood method: μ^ ML =10.87 d .
Conclusion: We can see that for the exponential model the two methods provide the same estimator for λ.
The estimator obtained has been used to obtain an estimator of the population mean. The mean duration
estimate of the new model of light bulb was 10.87 days. On the other hand, some desirable properties of the
estimator have been proved. A different, equivalent notation has been used to facilitate the proof of one of
these properties, which emphasizes the importance of the notation in doing calculations.
My notes:

Confidence Intervals
[CI] Methods for Estimating
Remark 1ci: Confidence can be interpreted as a probability (so it is, although we sometimes use a 0-to-100 scale). See remark 1pt,
in the appendix of Probability Theory, on the interpretation of the concept of probability.
Remark 2ci: Since there is an infinite number of pairs of quantiles a1 and a2 such that P (a 1≤T ≤a2 )=1−α , those
determining tails of probability α/2 are considered by convention. This criterion is also applied for two-tailed hypothesis tests.
Remark 3ci: When the Central Limit Theorem can be applied, asymptotic results on averages are relatively independent of the
initial population. Therefore, in some exercises there are not suppositions on the distribution of the population variables.
Exercise 1ci-m
To forecast the yearly inflation (in percent, %), a simple random sample has been gathered:
1.5 2.1 1.9 2.3 2.5 3.2 3.0
It is assumed that the variable inflation follows a normal distribution.
(a) By using these data, construct a 99% confidence interval for the mean of the inflation.
(b) Experts have the opinion that the previous interval is too wide, and they want a total length of a unit.
Find the level of confidence for this new interval.
(c) Construct a confidence interval of 90% for the standard deviation.
Discussion: The intervals will be built by applying the method of the pivot, and then the expression of the
margin of error is determined. Since variances are nonnegative by definition and the positive branch of the
square root function is strictly increasing, the interval for the standard deviation is obtained by applying the
square root to the inteval for the variance.
Identification of the variable

X ≡ Predicted inflation (of one country) X ~ N(μ,σ2)
Sample information
Theoretical (simple random) sample: X1,..., X7 s.r.s. → n = 7
Empirical sample: x1,..., x7 → 1.5 2.1 1.9 2.3 2.5 3.2 3.0
In this exercise, we know the values of the sample xi. This allows calculating any quantity we want.
(a) Confidence interval for the mean: To choose the proper pivot, we take into account:
• The variable of interest follows a normal distribution.
• The population variance σ2 is unknown, so it must be estimated by the sample (quasi)variance.
• The sample size is small, n = 7, so we should not think about the asymptotic framework.
From a table of statistics (e.g. in [T]), the pivot

̄ −μ
X X̄ −μ
T ( X ; μ)= = ∼ t n−1
S
is selected. Then
√ S2
n √n
1−α=P(l α / 2≤ T (X ;μ )≤r α/2 )=P −r α / 2≤

(
X
̄ −μ
)
≤+r α/ 2 =P (−r α/2
√ S2 ̄
n √
≤ X −μ ≤+r α / 2
S2
n
)
√ S2
n
√ √ √ √
2 2 2 2
̄ −r α /2 S ≤−μ ≤− X
=P (− X ̄ + r α/ 2 S )=P ( X
̄ +r +α / 2 S ≥ μ ≥ X
̄ −r α / 2 S )
n n n n
so
[ √ √ ]
2 2
̄ −r α / 2 S , X
I 1−α = X ̄ +r α/2 S
n n
where r α / 2 is the quantile such that P(T > r α/2 )=α /2. Let us calculate the quantities in the formula:
1 7
7 ∑ j =1 j
• x̄= x =2.36
• The level of confidence is 99%, and hence α = 0.01. The quantile is found in the table of the t distribution with κ = 7–1
degrees of freedom r α / 2=r 0.01 /2=r 0.005 =3.71
2 1 7 2 1 2 2 2
• By using the data, S=
7−1 ∑ j=1
( x j− x̄ ) =
7−1
[ (1.5 %−2.35 %) +⋯+(3.0 %−2.35 %) ]=0.36 %
• Finally, n = 7
Then, the interval is
[
I 0.99 = 2.35 %−3.71
√ 0.36 %2
7
, 2.35 %+3.71
0.36 %2
7 √
= [1.51 % , 3.20 %] ]
whose length is 3.20%–1.51% = 1.69%.
(b) Confidence level: The length of the interval, the distance between the two endpoints, is twice the margin
of error when T follows a symmetric distribution.
(
L= X +r α/ 2
̄ S2
n √ )(
− X −r α /2
̄ S2
n
=2 r α /2
S2
n √) √
In this section L is given and α must be found; nevertheless, it is necessary to find r α / 2 previously.
L √ n 1⋅√ 7 % > 1-pt(2.20, 7-1)
r α / 2= = =2.20 [1] 0.03505109
2 S 2⋅0.6 %
In the table of the t law it is found that α/2 = 0.035, so α = 0.07 and 1–α = 0.93. The confidence level is 93%.
(c) Confidence interval for the standard deviation: To choose the new statistic:
• The variable of interest follows a normal distribution.
• The quantity of interest it the standard deviation σ.
• The population mean μ is unknown.
• The sample size is small, n = 7, so we should not think about the asymptotic framework
From a table of statistics (e.g. in [T]), the proper pivot

(n−1) S 2
T ( X ; σ)= 2
∼ χ 2n−1
σ
is selected. Then
(n−1) S 2 ( n−1)S 2 (n−1) S 2
(
1−α=P l α/ 2≤
σ2
≤r α/ 2 =P ) (
l α/2
≤
1
≤
rα / 2
( n−1) S 2 σ2 (n−1) S 2
=P
l α/2
2
≥σ ≥
rα/ 2 ) ( )
and hence the interval is
[ ]
2 2
( n−1) S ( n−1)S
I 1−α = ,
rα / 2 l α/ 2
The quantities in the formula are:

• Sample size n = 7, so n–1= 6 > qchisq(c(0.05, 1-0.05), 7-1)
2 2
[1] 1.635383 12.591587
• S =0.36 %
• Since α = 0.1 and κ = n–1= 6, the quantiles are l 0.05=1.64 and r 0.05=12.6
By substituting and applying the square root function, the interval is
I 0.9 = [√ 6⋅0.36 %2
12.6
,
√ 6⋅0.36 %2
1.64 ]
= [ 0.414 % , 1.148 %]
Conclusion: The length in section (b) is smaller than in section (a), that is, the interval is narrower and the
confidence is smaller.
My notes:
Exercise 2ci-m
In the library of a university, the mean duration (in days, d) of the borrowing period seems to be 20d. A simple
random sample of 100 books is analysed, and the values 18d and 8d 2 are obtained for the sample mean and
the sample variance, respectively. Construct a 99% confidence interval for the mean duration of the
borrowings to check if the initial population value is inside.
Discussion: For so many data, asymptotic results are considered. The method of the pivotal quantity can
also be applied. The dimension of the variable duration is time, while the unit of measurement is days.

X ≡ Duration (of one borrowing) X~?
Sample information:
Theoretical (simple random) sample: X1,...,X100 s.r.s. → n = 100
2 2
Empirical sample: x1,...,x100 → x̄=18 d , s =8 d
The values xj of the sample are unknown; instead, the evaluation of some statistics is given. These quantities
̄ and S 2 .
must be sufficient for the calculations, and, therefore, formulas must be written in terms of X

Confidence interval: To select the pivot, we take into account:
• Nothing is said about the probability distribution of the variable of interest
• The sample size is big, n = 100 (>30), so an asympotic expression can be considered
• The population variance is unknown, but it is estimated through the sample variance
From a table of statistics (e.g. in [T]), the proper pivot

̄ −μ
X
T ( X ; μ)= → N ( 0,1)
√ S2
n
is chosen, where S2 is the sample quasivariance. By applying the method of the pivotal quantity:
(
1−α=P (l α/ 2≤ T ( X ;μ ) ≤r α/ 2)= P −r α/ 2≤
X
̄ −μ
)
≤+ r α/ 2 =P (−r α / 2
√ S2 ̄
n √
≤ X −μ ≤+ r α/ 2
S2
n
)
√ S2
n
=P (− X̄ −r α /2
√ S2
n
≤−μ ≤− X̄ + r α/ 2
S2
n √
)=P ( X̄ + r α/ 2
S2
n √
≥ μ ≥ X̄ −r α/ 2
S2
n
)
√
[
I 1−α = X̄ −r α / 2
√ S2
n
, X̄ +r α/2
S2
n √ ]
where r α / 2 is the quantile such that P(Z> r α /2 )=α /2.
Substitution: We calculate the quantities involved in the formula,

• Sample mean x̄=18 d
• For a confidence of 99%, α = 0.01 and r α / 2=2.58
2 100 2 2 2 n 2 100 2 2
• To calculate S 2 the property (n−1)S =∑ j=1 ( x j− x̄) =n s is used: S= s= 8 d =8.1d
n−1 99
• n = 100
The interval is
[
I 0.99 = 18 d−2.58
√ 8.1d 2 , 18 d +2.58 √8.1 d 2
√100 √ 100 ] = [17.27 d , 18.73 d ]
Conclusion: The mean duration estimate of the borrowings belongs to the interval obtained with 99%
confidence. The initial value μ = 20d is not inside the high-confidence interval obtained, that is, it is not
supported by the data. (Remember: statistical results depend on: the assumptions, the methods, the certainty
and the data.)
My notes:
Exercise 3ci-m
The accounting firm Price Waterhouse periodically monitors the U.S. Postal Service's performance. One
parameter of interest is the percentage of mail delivered on time. In a simple random sample of 332,000

mailed items, Price Waterhouse determined that 282,200 items were delivered on time (Tampa Tribune, March
26, 1995.) Use this information to estimate with 99% confidence the true percentage of items delivered on
time by the U.S. Postal Servece.
(Taken from: Statistics. J.T. McClave and T. Sincich. Pearson.)
Discussion: The population is characterized by a Bernoulli variable, since for each item there are only two
possible values. We must construct a confidence interval for the proportion (a percent is a proportion
expressed in a 0-to-100 scale). Proportions have no dimension.

X ≡ Delivered on time (one item) ? X ~ B(η)
Confidence interval
For this kind of population and amount of data, we use the statistic:
̂
η−η d
T ( X ; η)= → N (0,1)
√ ?(1−? )
n
where ? is substituted by η or η. ̂ For confidence intervals η is unknown and no value is supposed, and
hence it is estimated through the sample proportion. By applying the method of the pivot:
( )
η
̂ −η
1−α=P (l α/ 2≤ T ( X ; η) ≤r α/ 2 )=P −r α /2≤ ≤+ r α / 2
√ η(1−
̂
n
η
̂)
( √
=P −r α /2
η
̂ (1− η)
n
̂
̂ −η ≤+ r α / 2
≤η
η(1−
̂
n√η
̂)
̂ α/2
=P −η−r
η
) (
̂ (1− η
n
̂)
̂ rα / 2
≤ −η ≤−η+
√
η(1−
̂
n
η)
̂
√ )
( √
̂ +r +α/2
=P η
η
̂ (1− η)
n
̂
̂ α/ 2
≥ η≥ η−r
η(1−
̂
n
η
̂)
√ )
[
I 1−α = η
̂ −r +α/ 2
√ η(1−
̂
n
η)
̂
, η+
̂ r +α / 2
η(1−
̂
n
η)
̂
√ ]
Substitution: We calculate the quantities in the formula,
• n = 332000
282200
• η=
̂ =0.850
332000
• 99% → 1–α = 0.99 → α = 0.01 → α/2 = 0.005 → r α /2=r 0.005=l 0.995=2.58
So
[
I 0.99= 0.850−2.58
√ 0.850( 1−0.850)
332000
, 0.850+2.58
0.850 (1−0.850)
332000 √
=[0.848 , 0.852] ]
Conclusion: With a confidence of 0.99, measured in a 0-to-1 scale, the value of η will be in the interval

obtained. In average, 99% times the method applied provides a right interval. Nonetheless, frequently we do
not know the real η and therefore we will never know if the method has failed or not. (Remember: statistical
results depend on: the assumptions, the methods, the certainty and the data.)
My notes:
Exercise 4ci-m
Two independent groups, A and B, consist of 100 people each of whom have a disease. A serum is given to
group A but not to group B, which are termed treatment and control groups, respectively; otherwise, the two
groups are treated identically. Two simple random samples have yielded that in the two groups, 75 and 65
people, respectively, recover from the disease. To study the effect of the serum, build a 95% confidence
interval for the difference ηA–ηB. Does the interval contain the case ηA = ηB?
Discussion: There are two independent Bernoulli populations. The interval for the difference of proportion
is built by applying the method of the pivot. Proportions are, by definition, dimensionless quantities.
Identification of the variable: Having got better or not is a dichotomic situation,

A ≡ Recuperating (an individual of the treatment group)? A ~ B(ηA)
B ≡ Recuperating (an individual of the control group)? B ~ B(ηB)
(1) Pivot: We take into account that:

• There are two independent Bernoulli populations
• Both sample sizes are large, 100, so an asymptotic approximation can be applied
From a table of statistics (e.g. in [T]), the following pivot is selected
( η̂ A −η̂ B )−(ηA−ηB ) d
T ( A , B ; ηA , ηB )= → N ( 0,1)
√ η
̂ A (1− η
nA
̂ A) η
̂ (1− η
+ B
nB
̂ B)
(2) Event rewriting:

(η
̂ A− η
̂ B )−(ηA−ηB )
1−α=P (l α/ 2≤ T (A , B ; ηA , ηB) ≤r α/ 2)≈ P −r α / 2≤
( √ η̂ A (1−η̂ A ) η̂ B (1−η̂ B )
nA
+
nB
≤+r α/2
)
( √
=P −r α/2
̂ A (1− η
η
nA
̂ A) η̂ B (1−η̂ B )
+
nB
̂ (1− η
η
̂ B )−(ηA−ηB) ≤+r α/ 2 A
≤( η̂ A− η
nA
+
√
̂ A ) η̂ B (1−η̂ B )
nB )
(
=P −( η̂ A− η
̂ B )−r α /2
√ ̂ A (1− η
η
nA
̂ A) η
̂ (1− η̂ B )
+ B
nB
≤ −(ηA−ηB ) ≤−( η
̂ A− η
̂ B )+r α /2
̂ A (1− η
η
nA
+
√
̂ A) η̂ B (1−η̂ B )
nB )
(
=P ( η̂ A−η̂ B )+r α /2
√ ̂ A (1− η
η
nA
̂ A) η
̂ (1− η
+ B
nB
̂ B)
≥ ηA −ηB ≥( η̂ A− η
̂ B )−r α /2
̂ A (1− η
η
nA √
̂ A) η
̂ (1− η
+ B
nB
̂ B)
)
(3) The interval:
[
I 1−α = ( η̂ A −η̂ B )−r α / 2
√ ̂ A (1− η̂ A ) η̂ B (1−η̂ B )
η
nA
+
nB √
, ( η̂ A −η̂ B )+r α /2
̂ A (1− η
η
nA
̂ A) η
̂ (1− η
+ B
nB
̂ B)
]
where r α / 2 is the value of the standard normal distribution such that P( Z>r α/2 )=α / 2.
Substitution: We need to calculate the quantities involved in the previous formula,

• nA = 100 and nB = 100.
• Theoretical (simple random) sample: A1,...,A100 s.r.s. (each value is 1 or 0).
100 1 100 75
Empirical sample: a1,...,a100 → ∑j =1 a j =75 → η^ A =
100 ∑ j=1
a j=
100
=0.75
Theoretical (simple random) sample: B1,...,B100 s.r.s. (each value is 1 or 0)
100 1 100 65
Empirical sample: b1,...,b100 → ∑j =1 b j =65 → η^ B=
100 ∑ j=1
b j=
100
=0.65 .
• 95% → 1–α = 0.95 → α = 0.05 → α/2 = 0.025 → r α/ 2=1.96 .
Then,
I 0.95=(0.75−0.65)∓1.96
√ 0.75(1−0.75) 0.65(1−0.65)
100
+
100
=[−0.0263 , 0.226 ]
The case ηA = ηB is included in the interval.
Conclusion: The lack-of-effect case (ηA = ηB) cannot be excluded when the decision has 95% confidence.
Since η ∈( 0,1), any “reasonable” estimator of η should provide values in this range or close to it. Because
of the natural uncertainty of the sampling process (randomness and variability), in this case the smaller
endpoint of the interval was –0.0263, which can be interpreted as being 0. When an interval of high
confidence is far from 0, the case ηA = ηB can clearly be discarded or rejected. Finally, it is important to notice
that a confidence interval can be used to make decisions about hypotheses on the parameter values—it is
equivalent to a two-sided hypothesis test, as the interval is also two-sided. (Remember: statistical results
depend on: the assumptions, the methods, the certainty and the data.)
Advanced theory: When the assumption ηA = η = ηB seems reasonable (notice that this case is included in
the 95% confidence interval just calculated), it makes sense to try to estimate the common variance of the
n η^ + n η^
estimator as well as possible. This can be done by using the pooled sample proportion η^ p= A A B B in
n A+ n B
estimating η(1– η) for the denominator; nonetheless, the pooled estimator should not be considered in the
numerator, as ( η^ p− η^ p)=0 whatever the data are. The statistic would be:
~ ( η^ A− η^ B )−(ηA−ηB ) d
T ( A , B)= → N (0,1)
√ η^ p (1− η
nA
^ p) η
^ (1− η
+ p
nB
^ p)
Now, the expression of the interval would be
I~
[
1−α = ( η
^ A −η^ B )−r α/ 2
√ ^ p (1− η
η
nA
^ p) η
^ (1−η^ p )
+ p
nB √
, ( η^ A −η^ B )+ r α/ 2
^ p (1−η^ p ) η^ p (1− η
η
nA
+
nB
^ p)
]
The quantities involved in the previous formula are
• nA = 100 and nB = 100

• η^ A =0.75 and η^ B=0.65 , the pooled estimate is
Since
n A ηÂ + n B η^B n( ηÂ + η^B ) 0.75+ 0.65
η^ p = = = = 0.70
nA+ nB 2n 2
• 95% → 1–α = 0.95 → α = 0.05 → α/2 = 0.025 → r α/ 2=1.96
Then,
I~
0.95 =(0.75−0.65)∓1.96 2
100 √
0.70(1−0.70)
=[−0.0270, 0.227]
One way to measure how different the results are consists in directly comparing the length—twice the margin
of error—in both cases:
~
L=0.226−(−0.0263)=0.2523 L=0.227−(−0.0270)=0.254
Even if the latter length is larger, it is theoretically more trustable than the former when ηA = η = ηB is true.
The general expressions of these lengths can be found too:
L=2 r α/ 2
√ η
^ A (1−η^ A ) η^ B (1− η
nA
+
nB
^ B) ~
L=2 r α/2
η
√
^ p (1− η
nA
^ p ) η^ p (1− η
+
nB
^ p)
Another way to measure how different the results are can be based on comparing the statistics:
~
T ( A , B)=
( η^ A− η^ B )−(ηA−ηB )
=
( η^ A −η^ B )−(ηA−ηB ) √ η
^ A (1− η
nA
^ A) η
^ (1− η
+ B
nB
^ B)
√ η^ p (1− η
nA
^ p) η
^ (1− η
+ p
nB
^ p)
√ η
^ A (1− η
nA
^ A) η
^ (1− η
+ B
nB
^ B)
√ η^ p (1− η
nA
^ p) η
^ (1− η
+ p
nB
^ p)
= T ( A , B)
√ η
^ A (1− η
nA
^ A) η^ B (1− η^ B )
+
nB
→
L
~=
√ η^ A (1−η^ A) η
nA
^ (1− η
+ B
nB
^ B)
=
~
T
T
~~
( so L⋅T = L⋅T )
√ √
η^ p (1− η^ p) η^ p (1− η^ p) η L
^ p (1− η ^ p) η ^ (1−η^ p )
+ + p
nA nB nA nB
Thus, the quantity
√
η^ A (1−η^ A ) η
n
^ (1− η
+ B
n
^ B)
√ η^ (1− η^ A)+ η^ B (1− η^ B ) =0.994
= A
η
√
^ p (1− η
n
^ p) η ^ (1−η^ p )
+ p
n
√ 2 η^ p (1− η^ p )
can be seen as a measure of the effect of using the pooled sample proportion. This effect is little in this
exercise, but it could be higher in other situations. As regards the case ηA = η = ηB, it is also included in this
interval, which is not worthy as it has been used as an assumption; nevertheless, the exclusion of this case
would have contradicted the initial assumption.
My notes:
[CI] Minimum Sample Size

Remark 4ci: In calculating the minimum sample size to guarantee a given precision by applying the method based on the margin of
error, the result is obtained using other results: theorem giving the sampling distribution of the pivot T and the method of the pivot.

When the proper statistic T is based on the supposition that the population variable X follows a given parametric probability
distribution, the whole process can be seen at a parametric approach; when T is based on an asymptotic result, the nonparametric
Central Limit Theorem is indirectly being applied. On the other hand, the method based on the Chebyshev's inequality is valid
whichever the probability distribution of the population variable X and nonnegative function h(x). The Central Limit Theorem, being
a nonparametric result, seems more powerful than the Chebyshev's inequality, based on a rough binding (see the appendixes). As a
consequence, we expect the method based on the this inequality to overestimate the minimum sample size. On the contrary, the
number provided by the method based on the margin of error may be less trustable if the assumptions on which it is based are false.
Remark 5ci: Once there is a discrete quantity in an equation, the unknown cannot take any possible value. This implies that, strictly
speaking, equalities like
√
2 2
E=r α / 2 σ σ =α
2
n nE
may be never fulfilled for continuous E, α, σ and discrete n. Solving the equality and rounding the result upward is a way alternative
to solving the inequalities
√
2
E g≥ E=r α/2 σ σ2 ≤ α
2
n n Eg
where the purpose is to find the minimum n for which the (possible discrete values of the) margin of error is smaller than or equal to
the given precision Eg.
Exercise 1ci-s
The lengths (in millimeters, mm) of metal rods produced by an industrial process are normally distributed
with a standard deviation of 1.8mm. Based on a simple random sample of nine observations from this
population, the 99% confidence interval was found for the population mean length to extend from 194.65mm
to 197.75mm. Suppose that a production manager believes that the interval is too wide for practical use and,
instead, requires a 99% confidence interval extending no further than 0.50mm on each side of the sample
mean. How large a sample is needed to achieve such an interval? Apply both the method based on the
confidence interval and the method based on the Chebyshev's inequality.
(From: Statistics for Business and Economics, Newbold, P., W.L. Carlson and B.M. Thorne, Pearson.)
Discussion: There is one normal population with known standard deviation. By using a sample of nine
elements, a 99% confidence interval was built, I1 = [194.65mm, 197.75mm], of length 197.75mm – 194.65mm
= 3.1mm and margin of error 3.1mm/2 = 1.55mm. A narrower interval is desired, and the number of data
necessary in the new sample must be calculated. More data will be necessary for the new margin of error to be
smaller (0.50 < 1.55) while the other quantities—standard deviation and confidence—are the same.

X ≡ Length (of one metal rod) X ~ N(μ, σ2=1.82mm2)
Sample information:
Theoretical (simple random) sample: X1,..., Xn s.r.s. (the lengths of n rods are taken)
Margin of error:
We need the expression of the margin of error. If we do not remember it, we can apply the method of the pivot
to take the expression from the formula of the interval.
[ 2
√
̄ −r α / 2 σ , X
I 1−α = X
n
̄ + rα/ 2 σ
n
2
√ ]
If we remembered the expression, we can use it. Either way, the margin of error (for one normal population

with known variance) is:
√
2
E=r α / 2 σ
n
Sample size
Method based on the confidence interval: We want the margin of error E to be smaller or equal than the
given Eg,
√
2 2 2
2 1.8 mm
2
E g≥ E=r α/2 σ → E g≥r α / 2 σ → n≥z α / 2 σ =2.58
n
2 2
n
2
Eg ( )
0.5 mm (
=86.27 → n≥87 )
since r α/ 2=r 0.01 /2=r 0.005 =2.58 . (The inequality does not change neither when multiplying or dividing by
positive quantities nor squaring, while it changes when inverting.)
Method based on the Chebyshev's inequality: For unbiased estimators, it holds that:
^
^ |≥E )=P (|θ−E
P (|θ−θ ^ ^ |≥ E)≤ Var ( θ) ≤ α
( θ) 2
E
2
^
so Var ( θ)=Var ( X̄ )= σ
n
2
1 2
1 1.82 mm2
σ ≤α → n≥ α σ = =1296 → n≥1296
2
n Eg Eg ( )
0.01 0.52 mm2
Conclusion: At least n data are necessary to guarantee that the margin of error is equal to 0.50 (this margin
can be thought of as “the maximum error in probability”, in the sense that the distance or error ∣θ−θ̂ ∣ will
be smaller that Eg with a probability of 1–α = 0.99, but larger with a probability of α = 0.01). Any number of
data larger than n would guarantee—and go beyond—the precision desired. As expected, more data are
necessary (86 > 9) to increase the accuracy (narrower interval) with the same confidence. The minimum
sample sizes provided by the two methods are quite different (see remark 4ci). (Remember: statistical results
My notes:
[CI] Methods and Sample Size

Exercise 1ci
The mark of an aptitude exam follows a normal distribution with standard deviation equal to 28.2. A simple
random sample with nine students yields the following results:
9 9
∑j =1 x j=1,098 ∑j =1 x 2j=138,148
a) Find a 90% confidence interval for the population mean μ.
b) Discuss without calculations whether the length of a 95% confidence interval will be smaller, greater
or equal to the length of the interval of the previous section.
c) How large must the minimum sample size be to obtain a 90% confidence interval with length (distance
between the endpoints) equal to 10? Apply the method based on the confidence interval and also the
method based on the Chebyshev's inequality.

Discussion: The supposition that the normal distribution is an appropriate model for the variable mark
should be evaluated. The method of the pivot will be applied. After obtaining the theoretical expression of the
interval, it is possible to reason on the relation confidence-length. Given the length of the interval, the
expression also allows us to calculate the minimum number of data necessary. The mark can be seen as a
quantity without any dimension. Finally, it is worth noticing that an approximation is used, since the mark is a
discrete variable while the normal distribution is continuous.

X ≡ Mark (of one student) X ~ N(μ, σ2=28.22)
Sample information:
Theoretical (simple random) sample: X1,..., X9 s.r.s. (the marks of nine students are to be taken) → n = 9
9 9
Empirical sample: x1,...,x9 → ∑j =1 x j=1,098 ∑j =1 x 2j=138,148 (the marks have been taken)
We can see that the sample values xj themselves are unknown in this exercise; instead, information calculated
from them is provided; this information must be sufficient for carrying out the calculations.
a) Method of the pivotal quantity: To choose the proper statistic with which the confidence interval is
calculated, we take into account that:
• The variable follows a normal distribution
• We are given the value of the population standard deviation σ
• The sample size is small, n = 9, so asymptotic formulas cannot be applied
From a table of statistics (e.g. in [T]), the pivot

̄ −μ
X
T ( X ; μ)= ∼ N (0,1)
is selected. Then
√ σ2
n
X
̄ −μ
√ √
2 2
(
1−α=P (l α/ 2≤ T ( X ;μ ) ≤r α/ 2)= P −r α/ 2≤
√ σ2
n )
≤+ r α/ 2 =P (−r α / 2 σ ≤ X
n
̄ −μ ≤+r α/2 σ )
n
√ √ √ √
2 2 2 2
̄ −r α /2 σ ≤−μ ≤− X
=P (− X ̄ + r α/ 2 σ )=P ( X
̄ + r +α / 2 σ ≥ μ ≥ X
̄ −r α / 2 σ )
n n n n
so
[ 2
̄ −r α / 2 σ , X
I 1−α = X
n √
̄ + rα/ 2 σ
n
2
√ ]
where r α / 2 is the value of the standard normal distribution verifying P( Z>r α /2 )=α / 2 , that is, the value
such that an area equal to α /2 is on the right (upper tail).

1 9 1
• x̄=
9 ∑ j=1
x j= 1,098=122
9
• A 90% confidence level implies that α = 0.1, and the quantile r α / 2=r 0.05=1.645 is in the table.

• From the statement, σ=28.2
• Finally, n = 9
Thus, the interval is
[
I 0.9= 122−1.645
28.2
√9
, 122+1.645
28.2
√9 ]
= [ 106.54 , 137.46 ]
b) Length of the interval: To answer this question it is possible to argue that, when all the parameters but the
length are fixed, if higher certainty is desired it is necessary to widen the interval, that is, to increase the
distance between the two endpoints. The formal way to justify this idea consists in using the formula of the
interval:
( √ )( √ ) √
2 2 2
̄ +r α / 2 σ − ̄X −r α/ 2 σ =2⋅r α / 2 σ
L= X
n n n
Now, if σ and n remain unchanged, to study how L changes with α it is enough to see how the quantile
“moves”. For the 95% interval:
• α = 0.05 → α decreases with respect to the value in section (a)
• Now r α/ 2 must leave less area (probability) on the right → r α/ 2 increases → L increases
In short, when the tails (α) get smaller the interval (1–α) gets wider, and vice versa.
c) Sample size:
Method based on the confidence interval: Now the 90% confidence interval of the first section is revisited.
For given α and Lg, the value of n must be found. From the expression of the length,
√ 28.2 2
2 2 2
L g ≥L=2 r α /2 σ → L g ≥2 r α / 2 σ → n≥ 2 z α/ 2 σ = 2⋅1.645
) ( )
2 2 2
=86.08 → n≥87
n n Lg ( 10
(Only when inverting the inequality must be changed.)
Method based on the Chebyshev's inequality: For unbiased estimators:

^
Var ( θ)
^ |≥E )=P (|θ−E
P (|θ−θ ^ ^ |≥ E)≤
( θ) ≤α
2
E
2
^
n
σ2 ≤ α
2
28.2 2
2 → n ≥ σ 2= =318.10 → n≥319
n Eg α Eg 10 2
0.1⋅
2 ( )
Conclusion: Given the other quantities, confidence grows with the length, and vice versa. If a value greater
than n were considered, a higher accuracy interval would be obtained; nevertheless, in practice usually this
would also imply higher expense of both time and money. The minimum sample sizes provided by the two
methods are quite different (see remark 4ci). (Remember: statistical results depend on: the assumptions, the
methods, the certainty and the data.)
My notes:

Exercise 2ci
A 64-element simple random sample of petrol consumption (litres per 100 kilometers, u) in private cars has
been taken, yielding a mean consumption of 9.36u and a standard deviation of 1.4u. Then:
a) Obtain a 96% confidence interval for the mean consumption.
b) Assume both normality (for the consumption) and variance σ2 = 2u2. How large must the sample be if,
with the same confidence, we want the maximum error to be a quarter of litre? Apply the method
based on the confidence interval and the method based on the Chebyshev's inequality.
(From 2007's exams for accessing to the Spanish university.)
Discussion: For 64 data, asymptotic results can be applied. The method of the pivotal quantity will be
applied. The role of the number 100 is no other than being part of the units in which the data are measured.
For the second section, additional suppositions—added by myself—are considered; in a real-world situation
they should be evaluated.

C ≡ Consumption (of one private car, measured in litres per 100 kilometers) C~?
Sample information:
Theoretical (simple random) sample: C = (C1,...,C64) s.r.s. → n = 64
Empirical sample: c = (c1,...,c64) → c̄=9.36u , s=1.4 u
The values cj of the sample are unknown; instead, the evaluation of some statistics is given. These quantities
̄ and s 2 .
must be sufficient for the calculations, so formulas must involve C
a) Confidence interval: To select the pivot, we take into account:

• Nothing is said about the probability distribution of the variable of interest
• The sample size is big, n = 64 (>30), so an asympotic expression can be used
• The population variance is unknown, but it is estimated by the sample variance

̄ −μ
C
T (C ;μ)= → N (0,1)
√ S2
n
where S2 will be calculated by applying the relation n s 2=(n−1) S 2 . By applying the method of the pivot:
(
1−α=P (l α/ 2≤ T (C ; μ) ≤r α /2 )=P −r α/2 ≤
̄
C−μ
)
≤+ r α/ 2 =P (−r α/ 2
√ S2 ̄
n
≤ C−μ ≤+r α /2
√S2
n
)
√ S2
n
=P (−C̄−r α/ 2
√ S2
n
Then, the confidence interval is
≤−μ ≤−C̄ +r α /2
S2
n √
)=P ( C̄ +r α / 2
S2
n √
≥ μ ≥C̄−r α/ 2
S2
n
)
√
[ √ √ ]
2 2
̄ −r α/ 2 S , C
I 1−α = C ̄ +r α/2 S
n n
where r α / 2 is the quantile such that P(Z> r α /2 )=α /2.

• Sample mean c̄=9.36u .
• For a confidence of 96%, α = 0.04 and r α / 2=r 0.04 /2 =r 0.02=l 0.98=2.054 .
2 n 2 64 2 2 2
• The sample quasivariance is S= s = 1.4 u =1.99 u .
n−1 63
• Finally, n = 64.
The interval is
[
I 0.96 = 9.36 u−2.054
√ 1.99 u2
64
, 9.36 u+2.054
64 √ ]
1.99 u2
= [9.00 u , 9.72u]
b) Minimum sample size:

Method based on the confidence interval: To select the pivot, we take into account the new suppositions:
• The variable of interest follows a normal distribution
• The population mean is being studied
• The population variance is known
From a table of statistics (e.g. in [T]), the following pivot is selected (now the exact sampling distribution is
known, instead of the asympotic distribution)
C̄ −μ
T (C ;μ)= ∼ N (0,1)
√ σ2
n
By doing calculations similar to those of the previous section or exercise, the interval is
[ √ √ ]
2 2
I 1−α = C̄ −r α/ 2 σ , C̄ + r α / 2 σ
n n
√
2
from which the expression of the margin of error is obtained, namely: E=r α /2 σ . Values can be
n
substituted either before or after breaking an inequality; this time let us use numbers from the beginning:
2 u2
√
2
1 1 2 2 2u
E g= u≥E =2.054 → 2 u ≥2.054 → n≥4 2⋅2.054 2⋅2=135.01 → n≥136
4 n 4 n
(When inverting, the inequality must be changed.)

^
^ |≥E )=P (|θ−E
P (|θ−θ ^ ^ |≥ E)≤ Var ( θ) ≤ α
( θ) 2
E
2
^
n

2 2
σ2 ≤ α 2u
2 → n ≥ σ 2= 2
=800
α Eg 1
n Eg 0.04⋅ u
4( )
Conclusion: The unknown mean petrol consumption of the population of private cars belongs to the
interval obtained with 96% confidence. For 64 data, the margin of error were 2.055
1.99 u 2
64
=0.36 u , while
136 data are needed for the margin to be 1/4= 0.250. The minimum sample sizes provided by the two methods
√
are quite different (see remark 4ci). (Remember: statistical results depend on the assumptions, the methods,
the certainty and the data.)
My notes:
Exercise 3ci
You have been hired by a consortium of dairy farmers to conduct a survey about the consumption of milk.
Based on results from a pilot study, assume that σ = 8.7oz. Suppose that the amount of milk is normally
distributed. If you want to estimate the mean amount of milk consumed daily by adults:
(a) How many adults must you survey if you want 95% confidence that your sample mean is in error by no
more than 0.5oz? Apply both the method based on the confidence interval and the method based on the
Chebyshev's inequality.
(b) Calculate the margin of error if the number of data in the sample were twice the minimum (rounded)
value that you obtained. Is now the margin of error half the value it was?
(Based on an exercise of: Elementary Statistics. Triola M.F. Pearson.)
CULTURAL NOTE (From: Wikipedia.)

A fluid ounce (abbreviated fl oz, fl. oz. or oz. fl., old forms ℥, fl ℥, f℥, ƒ ℥) is a unit of volume (also called capacity) typically used for
measuring liquids. It is equivalent to approximately 30 millilitres. Whilst various definitions have been used throughout history, two
remain in common use: the imperial and the United States customary fluid ounce. An imperial fluid ounce is 1⁄20 of a imperial pint, 1⁄160
of an imperial gallon or approximately 28.4 ml. A US fluid ounce is 1⁄16 of a US fluid pint, 1⁄128 of a US fluid gallon or approximately
29.6 ml. The fluid ounce is distinct from the ounce, a unit of mass; however, it is sometimes referred to simply as an "ounce" where
context makes the meaning clear.
Discussion: There is one normal population with known standard deviation. In both sections, the answer can
be found by using the expression of the margin of error.

X ≡ Amount of milk (consumed daily by an adult) X ~ N(μ, σ2=8.72oz2)
Sample information:
Theoretical (simple random) sample: X1,...,Xn s.r.s. (the amount is measured for n adults)
Formula for the margin of error:


[ √
2
̄ −r α / 2 σ , X
I 1−α = X
n
̄ + rα/ 2 σ
n √ ]
2
If we remembered the expression, we can directly use it. Either way, the margin of error (for one normal
population with known variance) is:
√
2
E=r α / 2 σ
n
(a) Sample size
Method based on the confidence interval: The equation involves four quantities, and we can calculate any
of them once the others are known. Here:
√
2 2 2 2
2 8.7 oz
E g≥ E=r α /2 σ → E g≥r α/ 2 σ → n≥z α/ 2 σ =1.96
n
2 2
n
2
Eg 0.5 oz( ) (
=1163.08 → n≥1164 )
since r α/ 2=r 0.05 /2=r 0.025 =1.96 . (The inequality does not change neither when multiplying or dividing by
positive quantities nor squaring, while it changes when inverting.)

^
^ |≥E )=P (|θ−E
P (|θ−θ ^ ^ |≥ E)≤ Var ( θ) ≤ α
( θ) 2
E
2
^
n
2 2 2
σ2 ≤ α 1
→ n≥ α σ =
1 8.7 oz → n≥6056
2
n Eg Eg ( )
0.05 0.5 2 oz 2
=6055.2
(b) Margin of error

Way 1: Just by substituting.
√ √
8.7 2 oz 2
2
E=r α / 2 σ =1.96 =0.3534 oz
n 2⋅1164
When the sample size is doubled, the margin of error is not reduced by half but by less than this amount.
Way 2 (suggested to me by a student): By managing the algebraic expression.
√ √σ2 = 1 r
√
σ 2 = E = 0.5 oz =0.3535 oz
2
~
E =r α/ 2 σ
~ =r α / 2 α /2
n 2 n √2 n √2 √2
Now it is easy to see that if the sample size is multiplied by 2, the margin of error is divided by √2. Besides,
more generally:
Proposition
For the confidence interval estimation of the mean of a normal population with known variance,
based on the method of the pivot, when the sample size is multiplied by any scalar c the margin of
error is divided by √c.
(Notice that 0.5 is slightly smaller than the real margin of error after rounding n upward; that is why there is a
small different between the results of both ways.)
Conclusion: At least 1164 or 6056 data are necessary to guarantee that the margin of error is equal to 0.50
(this margin can be thought of as “the maximum error in probability”, in the sense that the distance or error

̂ ∣ will be smaller that Eg with a probability of 1–α = 0.95, but larger with a probability of α = 0.05).
∣θ−θ
When the sample size is multiplied by c, the margin of error is divided by √c. Using more data would also
guarantee the precision desired. The minimum sample sizes provided by the two methods are quite different
(see remark 4ci). (Remember: statistical results depend on: the assumptions, the methods, the certainty and the
data.)
My notes:
Exercise 4ci
A company makes two products, A and B, that can be considered independent and whose demands follow the
distributions N(μA, σA2=702u2) and N(μB, σB2=602u2), respectively. After analysing 500 shops, the two simple
random samples yield a = 156 and b = 128.
(a) Build 95 and 98 percent confidence intervals for the difference between the population means.
(b) What are the margin of errors? If sales are measured in the unit u = number of boxes, what is the unit
of measure of the margin of error?
(c) A margin of error equal to 10 is desired, how many shops are necessary? Apply both the method based
on the confidence interval and the method based on the Chebyshev's inequality.
(d) If only product A is considered, as if product B had not been analysed, how many shops are necessary
to guarantee a margin of error equal to 10? Again, apply the two methods.
LINGUISTIC NOTE (From: Longman Dictionary of Common Errors. Turton, N.D., and J.B.Heaton. Longman.)
company. an organization that makes or sells goods or that sells services: 'My father works for an insurance company.' 'IBM is one of the
biggest companies in the electronics industry.'
factory. a place where goods such as furniture, carpets, curtains, clothes, plates, toys, bicycles, sports equipment, drinks and packaged
food are produced: 'The company's UK factory produces 500 golf trolleys a week.'
industry. (1) all the people, factories, companies etc involved in a major area of production: 'the steel industry', 'the clothing industry'
(2) all industries considered together as a single thing: 'Industry has developed rapidly over the years at the expense of agriculture.'
mill. (1) a place where a particular type of material is made: 'a cotton mill', 'a textile mill', 'a steel mill', 'a paper mill' (2) a place where
flour is made from grain: 'a flour mill'
plant. a factory or building where vehicles, engines, weapons, heavy machinery, drugs or industrial chemicals are produced, where
chemical processes are carried out, or where power is generated: 'Vauxhall-Opel's UK car plants', 'Honda's new engine plant at
Microconcord. Swindon', 'a sewage plant', 'a wood treatment plant', 'ICI's ₤100m plant', 'the Sellafield nuclear reprocessing plant in
Cumbria'
works. an industrial building where materials such as cement, steel, and bricks are produced, or where industrial processes are carried
out: 'The drop in car and van sales has led to redundancies in the country's steel works.'
Discussion: It should statistically be proved the supposition that the normal distribution is appropriate to
model both variables. The independence of the two populations should be tested as well. The method of the
pivot will be applied. After obtaining the theoretical expression of the interval, it is possible to argue about the
relation confidence-length. Given the length of the interval, the expression allows us to calculate the minimum
number of data necessary. The number of units demaned can be seen as dimensionless quantities. An
approximation is implicitly being used in this exercise, since the number of units demanded is a discrete
variable while the normal distribution is continuous.
(a) Confidence interval

The variables are

A ≡ Number of units of product A sold (in one shop) A ~ N(μA, σA2=702u2)
B ≡ Number of units of product B sold (in one shop) B ~ N(μB, σB2=602u2)
(a1) Pivot: We know that

• There are two independent normal populations
• We are interested in μA – μB
• Variances are known
Then, from a table of statistics (e.g. in [T]), we select

( ̄A− B
̄ )−(μ A−μ B )
T ( A , B ; μ A ,μ B )= ∼ N (0,1)
√
2 2
σ σ A B
+
nA nB
(a2) Event rewriting
( ̄A− ̄B )−(μ A−μ B )

1−α=P (l α/ 2≤ T (A , B ;μ A μ B ) ≤r α /2 )=P −r α/2 ≤
( √
2
σ σ
A
+
n A nB
2
B
≤+ r α / 2
)
( √
=P −r α /2
σ 2A σ 2B
+ ≤ ( ̄A− B
nA nB
̄ )−(μ A−μ B ) ≤+r α / 2
σ 2A σ2B
+
n A nB √ )
( √ √ )
2 2 2 2
σA σB σA σ B
=P −( ̄A− B
̄ )−r α/ 2 + ≤−(μ A−μ B ) ≤−( ̄A− B
̄ )+ r α / 2 +
nA nB nA nB
( √
=P ( ̄A − ̄B )+ r α/ 2
σ 2A σ2B
+ ≥ μ A −μ B ≥( ̄A− ̄B )−r α/ 2
nA nB
σ2A σ 2B
+
n A nB √ )
(a3) The interval
[ √ √ ]
2 2 2 2
σ A σB σ A σB
I 1−α = ( ̄A− ̄B )−r α /2 + , ( ̄A− B
̄ )+r α/2 +
nA nB nA nB
Substitution: The quantities in the formula are

• ā=156 u and b̄=128 u
• σ 2A =702 u 2 and σ 2B =602 u 2
• n A =500 and n B =500
• At 95%, 1–α = 0.95 → α = 0.05 → α/2 = 0.025 → r α/ 2=r 0.025=l 0.975=1.96
• At 98%, 1–α = 0.98 → α = 0.02 → α/2 = 0.01 → r α/ 2=r 0.01=l 0.99=2.326
Thus, at 95%
[
I 0.95= (156−128)−1.96
√ 702 60 2
+
500 500
, (156−128)+1.96
702 60 2
+
500 500 √
=[19.92, 36.08] ]
and at 98%

[
I 0.98= (156−128)−2.326
√ 70 2 60 2
+
500 500
, (156−128)+ 2.326
702 602
+
500 500 √
=[18.41, 37.59] ]
(b) Margin of error: Regarding the units, they can be treated as any other algebraic letter representing a
numerical quantity. The quantile and the sample sizes are dimensionless, while the variances are expressed in
the unit u2—because of the square in the definition σ2 = E([X–E(X)]2)—when data X are measured in the unit
u. At 95%
√ σ2A σ 2B
√ √
2 2 2 2 2 2
70 u 60 u 70 60
E 0.95=r α / 2 + =1.96
n A nB 500
+
500
=1.96 +
500 500
√ 2
u =8.08u
and at 98%
√ σ 2A σ 2B
√ √
2 2 2 2 2 2
70 u 60 u 70 60
E 0.98=r α/ 2 + =2.326
n A nB 500
+
500
=2.326 +
500 500
√ u2=9.59 u
(c) Minimum sample sizes

Method based on the confidence interval: Since here both samples sizes are equal to the number of shops,
√ σ 2A σ 2B σ 2A+ σ2B σ2A +σ 2B 2 σ A 2 2 σ B 2
( ) ( )
2 2 2
E g≥ E=r α/2 + → E ≥r
g α/ 2 → n≥r α /2 =r α/ 2 + r α/ 2
n n n E 2g Eg Eg
and hence at 95% and 98%, respectively,

2 2 2 2 2 2 2 2
70 u +60 u 70 u +60 u
n≥1.962 =326.54 → n≥327 and n≥2.326 2 =459.87 → n≥460
10 2 u2 102 u 2
^
^ |≥E )=P (|θ−E
P (|θ−θ ^ ^ |≥ E)≤ Var ( θ) ≤ α
( θ) 2
E
σ 2A σ2B
2 2 + 2 2
σ 2A +σ 2B 1 σ A 2 1 σB 2
σ σ n n σ A+ σ B
^
( ) ( )
A B
If Var ( θ)=Var ( Ā)+Var ( B̄)= + → = ≤α → n≥ =α +α
n n E 2g n E 2g α E2g Eg Eg
so
702 u2+602 u 2 2 2 2 2
70 u + 60 u
n≥ =1700 and n≥ 2 2
=4250
0.05⋅102 u2 0.02⋅10 u
(d) Minimum sample size nA

Method based on the confidence interval: In this case, when the method of the pivotal quantity is applied
(we do not repeat the calculations here), the interval and the margin of error are, respectively,
[ √ √ ] √ σ2A
2 2
σA σA
I 1−α = Ā−r α/2 , Ā+ r α/ 2 and E=r α/ 2
nA nA nA
(Note that this case can be thought of as a particular case where the second population has values B = 0, μB=0
and σB2=0.) Then,
σ 2A
√
2 2
2 2 σA 2 σA
E g≥ E=r α/2 → E g≥r α/ 2 → n A ≥r α / 2 2
nA nA Eg

and hence at 95% and 98%, respectively,
70 2 u2 702 u 2
n A ≥1.962 2 2
=188.24 → n A ≥189 and n A ≥2.326 2 2 2
=265.10 → n A ≥266
10 u 10 u

^
^ |≥E )=P (|θ−E
P (|θ−θ ^ ^ |≥ E)≤ Var ( θ) ≤ α
( θ) 2
E
σ 2A
^ σ 2A nA σ2A σ2A
If Var ( θ)=Var ( Ā)= → = ≤α → nA ≥
nA E 2g n A E 2g α E 2g
so
2 2
702 70 u
nA ≥ =980 and nA ≥ =2450
0.05⋅102 2 2
0.02⋅10 u
Conclusion: As expected, when the probability of the tails α decreases the margin of error—and hence the
length—increases. For either one or two products and given the margin of error, the more confidence (less
significance) we want the more data we need. Since 500 shops were really considered to attain this margin of
error, there has been a waste of time and money—fewer shops would have sufficed for the desired accuracy
(95% or 98%). When two independent quantities are added or subtracted, the error or uncertainty of the result
can be as large as the total of the two individual errors or uncertainties; this also holds for random quantities
(if they are dependent, a correction term—covariance—appears); for this reason, to guarantee the same
margin of error, more data are necessary in each of the two samples—notice that for two populations the
minimum value is larger than or equal to the sum of the minimum values that would be necessary for each
population individually (for the same precision and confidence). The minimum sample sizes provided by the
two methods are quite different (see remark 4ci). (Remember: statistical results depend on: the assumptions,
the methods, the certainty and the data.)
My notes:

Hypothesis Tests
Remark 1ht: Like confidence, the concept of significance can be interpreted as a probability (so they are, although we sometimes
use a 0-to-100 scale). See remark 1pt, in the appendix of Probability Theory, on the interpretation of the concept of probability.
Remark 2ht: The quantities α, p-value, β, 1–β and φ are probabilities, so their values must be between 0 and 1.
Remark 3ht: For two-tailed tests, since there is an infinite number of pairs of quantiles such that P (a 1≤T 0≤a2 )=1−α ,
those that determine tails of probability α/2 are considered by convention. This criterion is also applied for confidence intervals.
Remark 4ht: To apply the second methodology, binding the p-value is sometimes enough to compare it with α. To do that, the
proper closest value included in the table is used.
Remark 5ht: In calculating the p-value for two-tailed tests, by convention the probability of the tail determined by T0(x,y) is
doubled. When T0(X,Y) follows an asymmetric distribution, it is difficult to identify the tail if the value of T0(x,y) is close to the
median. In fact, knowing the median is not necessary, since if we select the wrong tail, twice its probability will be greater than 1
and we will realize that the other tail must have been considered. Alternatively, it is always possible to calculate the two
probabilities (on the left and on the right) and double the minimum of them (this is useful in writing code for software programs).
Remark 6ht: When more than one test can be applied to make a decision about the same hypotheses, the most powerful should be
considered (if it exists).
Remark 7ht: After making a decision, it is possible to evaluate the strengh with which it was made: for the first methodology, by
comparing the distance from the statistic to the critical values—or, better, the area between this set of values and the density
function of T0—and, for the second methodology, by looking at the magnitude of the p-value.
Remark 8ht: For small sample sizes, n=2 or n=3, the critical region—obtained by applying any methodology—can be plotted in the
two- or threedimensional space.
[HT] Parametric
Remark 9ht: There are four types of pair of hypotheses:
(1) simple versus simple
(2) simple versus one-sided composite
(3) one-sided composite versus one-sided composite
(4) simple versus two-sided composite
We will directly apply Neyman-Pearson's lemma for the first case. When the solution of the first case does not depend upon any
particular value of the parameter θ1 under H1, the same test will be uniformly most powerful for the second case. In addition, when
there is a uniformly most powerful test for the second case, it will also be uniformly most powerful for the third case.
Remark 10ht: Given H0 and α, different decisions can be made for one- and two-tailed tests. That is why: (i) describing the details
of the framework is of great important in Statistics; and (ii) as a general rule, all trustworthy information must be used, which
implies that a one-sided test should be used when there is information that strongly suggests so—compare the estimate calculated
from the sample with the hypothesized values.
Remark 11ht: For parametric tests,α (θ)= P (Reject H 0 ∣ θ∈Θ0) and 1−β(θ) = P ( Reject H 0 ∣ θ∈Θ1) , so to plot
the power function ϕ(θ) = P ( Reject H 0 ∣ θ∈Θ0 ∪Θ1) it is usually enough to enter θ∈Θ0 in the analytical expression of
1−β(θ). This is the method that we have used in some exercises where the computer has been used.
Remark 12ht: A reasonable testing process should verify that
1−β(θ1 )=P (T 0 ∈Rc ∣ θ∈Θ1) > P (T 0 ∈ Rc ∣ θ∈Θ0 ) = α(θ 0 )
with 1–β(θ1) ≈ α(θ0) when θ1 ≈ θ0. This can be noticed in the power functions plotted in some exercises, where there is a local
minimum at θ0.
Remark 13ht: Since one-sided tests are, in its range of parameter values, more powerful than the corresponding two-sided test, the
best way of testing an equality consists in accepting it when it is compared with the two types of inequality. Similarly, the best way

to test an inequality consists in accepting it when it is allocated either in the null hypothesis or in the alternative hypothesis. (This
ideas, among others, are rigurously explained in the materials of professor Alfonso Novales Cinca.)
[HT-p] Based on T
Exercise 1ht-T
The lifetime of a machine (measured in years, y) follows a normal distribution with variance equal to 4y 2. A
simple random sample of size 100 yields a sample mean equal to 1.3y. Test the null hypothesis that the
population mean is equal to 1.5y, by applying a two-tailed test with 5 percent significance level. What is the
type I error? Calculate the type II error when the population mean is 2y. Find the general expression of the
type II error and then use a computer to plot the power function.
Discussion: First of all, the supposition that the normal distribution reasonably explains the lifetime of the
machine should be evaluated by using proper statistical techniques. Nevertheless, the purpose of this exercise
is basically to apply the decision-making methodologies.
Statistic: Since
• There is one normal population
• The population variance is known
the statistic
̄ −μ
X
T ( X ; μ)= ∼ N (0,1)
√ σ2
n
is selected from a table of statistics (e.g. in [T]). Two particular cases of T will be used:
X̄ −μ 0 ̄ −μ1
X
T 0 ( X )= ∼ N (0,1) and T 1 ( X )= ∼ N (0,1)
σ2
n√ σ2
n √
To apply any of the two methodologies, the value of T0 at the specific sample x = (x1,...,x100) is necessary:
̄x −μ 0 1.3−1.5 −0.2⋅10
T 0 ( x)= = = =−1
2
√ √
2
σ 4
n 100
Hypotheses: The two-tailed test is determined by

H 0 : μ = μ 0 = 1.5 and H 1 : μ = μ 1 ≠ 1.5
For these hypotheses

Decision: To make the final decision about the hypotheses, two main methodologies are available. To apply
the first one, the critical values a1 and a2 that determine the rejection region are found by applying the
definition of type I error, with α = 0.05 at μ0 = 1.5, and the criterion of leaving half the probability in each tail:
α (1.5) = P (Type I error )= P (Reject H 0 ∣ H 0 true)= P (T ( X ; μ)∈Rc ∣ H 0 )
= P ( {T 0 ( X )<a 1 }∪{T 0 ( X )> a 2 })
{
α (1.5)
= P(T 0 ( X )< a1) → a1=l α / 2=−1.96
2
α (1.5)
=P (T 0 (X )>a 2 ) → a2=r α/ 2=+1.96
2
→ Rc ={T 0 ( X )<−1.96 }∪{T 0 ( X )>+1.96 }={∣T 0 ( X )∣>+1.96 }
The decision is: T 0 ( x)=−1 → T 0 ( x)∉ Rc → H0 is not rejected.
The second methodology is based on the calculation of the p-value:
pV =P ( X more rejecting than x ∣ H 0 true)=P (∣T 0 ( X )∣>∣T 0 ( x )∣)

=P (∣T 0 ( X )∣>∣−1∣)=2⋅P (T 0 (X )<−1)=2⋅0.1587=0.32
→ pV =0.32> 0.05=α → H0 is not rejected.
Type II error: To calculate β, we have to work under H1, that is, with T1. Nonetheless, the critical region is
expressed in terms of T0. Thus, the mathematical trick of adding and subtracting the same quantity is applied:
β(μ 1) = P(Type II error) = P ( Accept H 0 ∣ H 1 true) = P (T 0 ( X )∉ Rc ∣ H 1 )= P (∣T 0 ( X )∣≤1.96 ∣ H 1)
∣)
X
̄ −μ 0
= P (−1.96≤T 0 ( X )≤+1.96 ∣ H 1 ) = P −1.96≤
( √ σ2
n
≤+1.96 H 1
∣) (
X
̄ −μ 1 +μ 1−μ 0 μ 1−μ 0 μ 1−μ0
(
= P −1.96≤
σ2
n √
≤+1.96 H 1 = P −1.96−
σ2
n √
≤T 1 ( X )≤+1.96−
√ )
σ2
n
μ −μ μ −μ
( σ
n √ ) (
= P T 1 ( X )≤+1.96− 1 0 − P T 1 (X )<−1.96− 1 0
2
σ
n
2
√ )
For the particular value μ1 = 2,
> pnorm(-0.54,0,1)-pnorm(-4.46,0,1)
β(2) = P ( T 1 ( X )≤−0.54 )− P ( T 1 ( X )<−4.46 ) =0.29 [1] 0.2945944
By using a computer, many more values μ1 ≠ 2 can be considered so as to numerically determine the power
curve 1–β(μ1) of the test and to plot the power function.
ϕ(μ) = P ( Reject H 0 )=
{ α (μ ) if μ∈Θ0
1−β(μ) if μ∈Θ1
# Population
variance = 4
# Sample and inference
n = 100
alpha = 0.05
theta0 = 1.5 # Value under the null hypothesis H0
q = qnorm(1-alpha/2,0,1)

theta1 = seq(from=0,to=+3,0.01)
paramSpace = sort(unique(c(theta1,theta0)))
PowerFunction = 1 - pnorm(+q-(paramSpace-theta0)/sqrt(variance/n),0,1) + pnorm(-q-(paramSpace-theta0)/sqrt(variance/n),0,1)
plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')
Conclusion: The hypothesis that 1.5y is the mean of the distribution of the lifetime is not rejected. As
expected, when the true value is supposed to be 2, far from 1.5, the probability of rejecting 1.5 is 1–β(2) =
0.71, that is, high. This value has been calculated by hand; additionally, after finding the analytical expression
of the curve 1–β, also by hand, the computer allows the power function to be plotted. This theoretical curve,
not depending on the sample information, is symmetric with respect to μ0 = 1.5. (Remember: statistical results
My notes:
Exercise 2ht-T
A company produces electric devices operated by a thermostatic control. The standard deviation of
the temperature at which these controls actually operate should not exceed 2.0ºF. For a simple
random sample of 20 of these controls, the sample quasi-standard deviation of operating
temperatures was 2.39ºF. Stating any assumptions you need (write them), test at the 5% level the null
hypothesis that the population standard deviation is not larger than 2.0ºF against the alternative that
it is. Apply the two methodologies and calculate the type II error at σ2=4.5ºF2. Use a computer to plot
the power function. On the other hand, between the two alternative hypothesis H 1 : σ=σ 1 > 2 or
H 1 : σ=σ 1 ≠ 2 , which one would you have selected? Why?
Hint: Be careful to use S2 and σ2 wherever you work with a variance instead of a standard deviation.
(Based on an exercise of Statistics for Business and Economics. Newbold, P., W.L. Carlson and B.M. Thorne. Pearson.)
LINGUISTIC NOTE (From: Longman Dictionary of Common Errors. Turton, N.D., and J.B. Heaton. Longman.)
actual = real (as opposed what is believed, planned or expected): 'People think he is over fifty but his actual age is forty-eight.' 'Although
buses are supposed to run every fifteen minutes, the actual waiting time can be up to an hour.'
present/current = happening or existing now: 'No one can drive that car in its present condition.' 'Her current boyfriend works for Shell.'
LINGUISTIC NOTE (From: Common Errors in English Usage. Brians, P. William, James & Co.)
“Device” is a noun. A can-opener is a device. “Devise” is a verb. You can devise a plan for opening a can with a sharp rock instead. Only
in law is “devise” properly used as a noun, meaning something deeded in a will.

Discussion: Because of the mathematical theorems available, we are able to study the variance only for
normally distributed random variables. Thus, we need the supposition that the temperature follows a normal
distribution. In practice, this normality should be evaluated.
Statistic: We know that

• The population mean is unknown
and hence the following (dimensionless) statistic, involving the sample quasivariance, is chosen
(n−1) S 2
T ( X ; σ)= 2
∼ χ 2n−1
σ
We will work with the two following particular cases:
( n−1) S 2 (n−1) S 2
T 0 ( X )= 2
∼ χ 2n−1 and T 1 ( X )= 2
∼ χ2n−1
σ0 σ1
To make the decision, we need to evaluate the statistic T0 at the specific data available x:
(20−1) 2.39 2 F 2
T 0 ( x)= =27.13
22 F 2
Hypothesis test
Hypotheses: H 0 : σ 2 = σ 20 ≤ 22 and H 1 : σ 2 = σ 21 > 22
Then,
Decision: To determine the rejection region, under H0, the critical value a is found by applying the definition
of type I error, with α = 0.05 at σ02 = 4ºF2 :
α (4) = P (Type I error ) = P ( Reject H 0 ∣ H 0 true)= P (T ( X ;θ)∈ Rc ∣ H 0 ) = P (T 0 (X )>a)
→ a=r α=r 0.05=30.14 → Rc = {T 0 ( X )>30.14 }
To make the final decision: T 0 ( x)=27.13 < 30.14 → T 0 ( x)∉ Rc → H0 is not rejected.
The second methodology requires the calculation of the p-value:
pV =P ( X more rejecting than x | H 0 true)=P (T 0 ( X )> T 0 ( x))=P (T 0 ( X )>27.13)=0.102

> 1 - pchisq(27.13, 20-1)
→ pV =0.102> 0.05=α → H0 is not rejected. [1] 0.1016613
Type II error: To calculate β, we have to work under H1, that is, with T1. Since the critical region is already
expressed in terms of T0, the mathematical trick of multiplying and dividing by same quantity is applied:
β(σ12) = P (Type II error ) = P ( Accept H 0 | H 1 true) = P (T 0 ( X )∉ Rc | H 1 ) = P (T 0 ( X )≤30.14 | H 1 )

| ) (
2
30.14⋅σ20
=P
( n s2
σ 20
≤30.14 H
| ) (
1 = P
n s2 σ 1
σ12 σ 20
≤30.14 H 1 = P T 1 (X )≤
σ12 )
For the particular value σ12 = 4.5ºF2,
30.14⋅4
(
β(4.5) = P T 1 ( X )≤
4.5 )
= P ( T 1 ( X )≤26.79 ) = 0.89
> pchisq(26.79, 20-1)
[1] 0.8903596
By using a computer, many other values σ12 ≠ 4.5ºF2 can be considered so as to numerically determine the
power curve 1–β(σ12) of the test and to plot the power function.
ϕ(σ 2 ) = P ( Reject H 0) =
{ α( σ2 ) if σ ∈Θ0
1−β(σ 2) if σ∈Θ1
n = 20
alpha = 0.05
theta0 = 4 # Value under the null hypothesis H0
q = qchisq(1-alpha,n-1)
theta1 = seq(from=4,to=15,0.01)
PowerFunction = 1 - pchisq(q*theta0/paramSpace, n-1)
Conclusion: The null hypothesis H 0 : σ=σ 0 ≤ 2 is not rejected. When any of these factors is different,
the decision might be the opposite. As regards the most appropriate alternative hypothesis, the value of S
suggests that the test with σ1 > 2 is more powerful than the test with σ1 ≠ 2 (the test with σ1 < 2 against
the equality would be the least powerful as both the methodologies—H0 is the default hypothesis—and the
data “tend to help H0”). (Remember: statistical results depend on: the assumptions, the methods, the certainty
and the data.)
My notes:
Exercise 3ht-T
Let X = (X1,...,Xn) be a simple random sample with 25 data taken from a normal population variable X. The
sample information is summarized in

25 25
∑ j=1 x j=105 and ∑ j=1 x 2j=579.24
(a) Should the hypothesis H0: σ2 = 4 be rejected when H1: σ2 > 4 and α = 0.05? Calculate β(5).
(b) And when H1: σ2 ≠ 4 and α = 0.05? Calculate β(5).
Use a computer to plot the power function.
Discussion: The supposition that the normal distribution is appropriate to model X should be statistically
proved. This statement is theoretical.

• The population mean is unknown
and hence the following statistic is selected

n s2
T ( X ; σ)= 2
∼ χ 2n−1
σ
We will work with the two following particular cases:
n s2 n s2
T 0 ( X )= 2
∼ χ 2n−1 and T 1 ( X )= 2
∼ χ2n−1
σ0 σ1
To make the decision, we need to evaluate the statistic at the specific data available x:
[ ) ] = 25⋅5.53 =34.56
2
1 1
T 0 ( x)=
25
25
∑ (
x 2j −
25
∑ xk
4 4
2
1 n 1 n
where to calculate the sample variance, the general property s 2= ∑ X
2
−
n j =1 j n j=1 (
∑ X j ) has been used.
(a) One-tailed alternative hypothesis

2 2 2 2
Hypotheses: H 0 : σ = σ 0 = 4 and H 1: σ = σ1 > 4
For these hypotheses,
Decision: To determine the rejection region, under H0, the critical value a is found by applying the definition
of type I error, with α = 0.05 at σ02 = 4:
α (4) = P (Type I error ) = P ( Reject H 0 ∣ H 0 true)= P (T ( X ;θ)∈ Rc ∣ H 0 ) = P (T 0 (X )>a)

→ a=r α=r 0.05=36.4 → Rc = {T 0 ( X )>36.4 }
To make the final decision: T 0 ( x)=34.56 < 36.4 → T 0 ( x)∉ Rc → H0 is not rejected.
pV =P ( X more rejecting than x ∣ H 0 true)=P (T 0 ( X )>T 0 ( x))=P (T 0 ( X )>34.56)=0.075

> 1 - pchisq(34.56, 25-1)
Type II error: To calculate β, we have to work under H1, that is, with T1. Since the critical region is expressed
in terms of T0, the mathematical trick of multiplying and dividing by same quantity is applied:
β(σ12) = P (Type II error ) = P ( Accept H 0 ∣ H 1 true)= P (T 0 ( X )∉R c ∣ H 1) = P (T 0 ( X )≤36.4 ∣ H 1)
∣ ) (
2
36.4⋅σ20
=P
n s2
σ 20 (
≤36.4 H 1 = P
∣ ) (
n s2 σ 1
σ12 σ 20
≤36.4 H 1 = P T 1 (X )≤
σ12 )
For the particular value σ12 = 5,
36.4⋅4
(
β(5) = P T 1 ( X )≤
5 )
= P ( T 1 ( X )≤29.12 ) = 0.78
> pchisq(29.12, 25-1)
[1] 0.7843527
By using a computer, many other values σ12 ≠ 5 can be considered so as to numerically determine the power
curve 1–β(σ12) of the test and to plot the power function.
ϕ(σ 2 ) = P ( Reject H 0) =
{ α( σ2 ) if σ ∈Θ0
1−β(σ 2) if σ∈Θ1
n = 25
alpha = 0.05
PowerFunction = 1 - pchisq(q*theta0/paramSpace, n-1)
(b) Two-tailed alternative hypothesis

2 2 2 2
Hypotheses: H 0 : σ = σ 0 = 4 and H 1: σ = σ1 ≠ 4

Decision: Now there are two tails, determined by two critical values a1 and a2 that are found by applying the
definition of type I error, with α = 0.05 at σ02 = 4, and the criterion of leaving half the probability in each tail:
α (4)= P(Type I error )=P ( Reject H 0 ∣ H 0 true)= P(T ( X ; θ)∈R c ∣ H 0 )=P (T 0 ( X )<a 1)+ P (T 0 ( X )>a 2 )
We always consider two tails with the same probability,
{
α (4)
=P (T 0 ( X )< a1) → a1=r 1−α/ 2=12.4
2 → Rc ={T 0 ( X )<12.4 }∪{T 0 ( X )> 39.4 }
α (4)
=P (T 0 ( X )>a 2) → a 2=r α / 2=39.4
2
To make the final decision: T 0 ( x)=34.56 → T 0 ( x)∉Rc → H0 is not rejected
To base the decision on the p-value, we calculate twice the probability of the tail:
pV =P ( X more rejecting than x ∣ H 0 true)=2⋅P (T 0 ( X )> T 0 (x ))

=2⋅P (T 0 ( X )>34.56)=2⋅0.075=0.15
> 1 - pchisq(34.56, 25-1)
[1] 0.07519706
→ pV =0.15> 0.05=α → H0 is not rejected
Note: The wrong tail would have been selected if we had obtained a p-value bigger than 1.
Type II error: To calculate β,

β(σ12) = P (Type II error ) = P ( Accept H 0 ∣ H 1 true) = 1−P (T ( X ; θ)∈ Rc ∣ H 1)
= 1−P ({T 0 ( X )< 12.4 }∪{T 0 ( X )>39.4 }| H 1 ) = 1− P

[( n s2
σ 20
<12.4
| ) (
H 1 + P
n s2
σ02 | )]
>39.4 H 1
[( ∣ ) ( ∣ )]
2 2
n s2 12.4⋅σ 0 n s 2 39.4⋅σ0
= 1− P < H 1 +1−P ≤ H1
σ 21 σ 21 σ 21 σ 21
| ) ( | ) (
2 2 2 2
=−P
σ 21(
n s 2 12.4⋅σ 0
<
σ21
H 1 + P
n s 2 39.4⋅σ0
σ 21
≤
σ12
H 1 = P T 1 ( X )≤
39.4⋅σ0
σ21
−P T 1 ( X )<
12.4⋅σ0
σ12 ) ( )
> pchisq(c(9.92, 31.52), 25-1)
β(5) = P ( T 1 ( X )≤31.52 ) −P ( T 1 ( X )< 9.92 ) = 0.86−0.0051 = 0.85 [1] 0.00513123 0.86065162
Again, the computer allows the power function to be plotted.

n = 25
alpha = 0.05
q = qchisq(c(alpha/2,1-alpha/2),25-1)

PowerFunction = 1 - pchisq(q[2]*theta0/paramSpace, n-1) + pchisq(q[1]*theta0/paramSpace, n-1)
Comparison of the power functions: For the one-tailed test, the power of the test at σ12 = 5 is 1–β(5) =
1–0.78 = 0.22, while for the two-tailed test it is 1–β(5) = 1–0.85 = 0.15. As expected, this latter test has
smaller power (higher type II error), since in the former test additional information is being used when one tail
is previously discarded. Now we compare the power functions of the two tests graphically, for the common
values (> 4), by using the code
n = 25
alpha = 0.05
q = qchisq(c(alpha/2,1-alpha/2),25-1)
paramSpace1 = sort(unique(c(theta1,theta0)))
PowerFunction1 = 1 - pchisq(q[2]*theta0/paramSpace1, n-1) +
pchisq(q[1]*theta0/paramSpace1, n-1)
PowerFunction2 = 1 - pchisq(q*theta0/paramSpace2, n-1)
plot(paramSpace1, PowerFunction1, xlim=c(0,15), xlab='Theta',
ylab='Probability of rejecting theta0', main='Power Function', type='l')
lines(paramSpace2, PowerFunction2, lty=2)
It can be noticed that the curve of the one-sided test is over the curve of the two-sided test for any σ2 > 4,

which makes it uniformly more powerful. In this exercise, from the sample information we could have
calculated the estimator S2 of σ2 so as to see if its value is far from 4 and therefore one of the two one-sided
tests should be considered better.
Conclusion: The hypothesis that the population variance is equal to 4 is not rejected in either of the two
sections. Although it has not happened in this case, different decisions may be made for the one- and two-
-tailed cases. (Remember: statistical results depend on: the assumptions, the methods, the certainty and the
data.)
My notes:
Exercise 4ht-T
Imagine that you are hired as a cook. Not an ordinary one but a “statistical cook.” For a normal population,
in testing the two hypotheses
{
2 2
H 0 : σ = σ0 =4
H 1 : σ 2 = σ21 >4
the data (sample x of size n = 11 such that S2=7.6u2) and the significance (α=0.05) have led to rejecting the
null hypothesis because
1−α
r 0.05=18.31 T 0 ( x)=19
where T0 is the usual statistic. A decision depends on several factors:

 Methodology
 Statistic T0
 Form of the alternative hypothesis H1
 Significance α
 Data x (edu.glogster.com/)
Since the chef—your boss—wants the null hypothesis H0 not to be rejected, find three different ways to
scientifically make the opposite decision by changing any of the previous factors. Give qualitative
explanations and, if possible, quantitative ones.
Discussion: Metaphorically, Statistics can be thought of as the kitchen with its utensils and appliances, the
first two factors as the recipe, and the next three items as the ingredients—if H1, α or x are inappropriate, there
is little to do and it does not matter how good the kitchen, the recipe and you are. Our statistical knowledge
allows us to change only the last three elements. The statistic to study the variance of a normal population is
(n−1) S 2 (n−1) S 2 (11−1)7.6 u2 76
T ( X )= 2
∼ χ2n −1 so, under H0, T 0 ( x)= = = =19.
σ σ 20 4 u2 4

Qualitative reasoning: By looking at the figure above, we consider that:
A) If a two-tailed test is considered (H1: σ2 = σ12 ≠ 4), the critical value would be r α / 2 instead of r α
and, then, the evaluation T 0 (x) may not lie in the rejection region (tails).
B) Equivalently, for the original one-tailed test, the critical value r α increases when the significance α
decreases, perhaps with the same implication as in the previous item.
C) Finally, for the same one-sided alternative hypothesis and significance, that is, for the same critical
value r α , the evaluation T 0 (x) would lie out ot the critical region (tail) if the data x—the values
themselves or only the sample size—are such that T 0 (x) < r α=18.31 .
D) Additionally, a fourth way could consist of some combinations of the previous ways.
Quantitative reasoning: The previous qualitative explanations can be supported with calculations.
A) For the two-tailed test, now the critical value would be r 0.05 /2=r 0.025=20.48 . Then
T 0 ( x)=19 < 20.48=r 0.025 → T 0 ( x)∉Rc → H0 is not rejected.
B) The same effect is obtained if, for the original one-tailed H1, the significance is taken to be 0.025
instead of 0.05. Any other value smaller than 0.025 would lead to the same result. Is 0.025—suggested
by the previous item—the smallest possible value? The answer is made by using the p-value, since it is
sometimes defined as the smallest significance level at which the null hypothesis is rejected. Then,
since
> 1 - pchisq(19, 11-1)
pV =P ( X more rejecting than x | H 0 true)=P (T 0 ( X )> 19)=0.0403 [1] 0.04026268
for any α < 0.0403 it would hold that

0.0403= pV > α → H0 is not rejected
C) Finally, for the original test and the same value for n, since
~
(n−1) S 2 ~ 2 ~2
~ S 2 (n−1) S S
T 0 ( x)= 2
= 2 2
= 19 < 18.31=r α
σ0 S σ0 7.6 u 2
the opposite decision would be made for any sample quasivariance such that
2
~2 7.6 u 2
S < 18.31 =7.324 u → T 0 ( x)∉ Rc → H0 is not rejected
19
On the other hand, for the original test and the same value for S, since
(~
n −1) S ( ~ (~
2 2
n −1) ( n−1) S n −1)
T~0 ( x)= 2
= 2
= 19 < 18.31=r α
σ0 (n−1) σ0 (11−1)
the opposite decision would be made for any sample size such that
~ (11−1)
n < 18.31 +1=10.63684 ↔ ~
n ≤ 10 → T 0 ( x)∉ Rc → H0 is not rejected
19
D) Some combinations can easily be proved to lead to rejecting H0.
Conclusion: This exercise highlights how much careful one must be in either writing or reading statistical
works.
My notes:

Exercise 5ht-T
The distribution of a variable is supposed to be normally distributed in two independent biological
populations. The two population variances must be compared. After gathering information through simple
random samples of sizes nX = 11, nY = 10, respectively, we are given the value of the estimators
2 1 n 2 2 1 n 2
∑ ∑
X Y
SX= ( x j− x̄ ) =6.8 sY = ( y j− ȳ ) =7.1
n X −1 j=1 nY j=1
For α = 0.1, test:

(a) H0: σX = σY against H1: σX < σY
(b) H0: σX = σY against H1: σX > σY
(c) H0: σX = σY against H1: σX ≠ σY
In each section, calculate the analytical expression of the type II error and plot the power function by using a
computer.
Discussion: In a real-world situation, suppositions should be proved. We must pay careful attention to the
details: the sample quasivariance is provided for one group, while the sample variance is given for the other.
Statistic: From the information in the statement,

• The population means are unknown
the statistic
S 2X
σ2X S 2X σ2Y
T ( X , Y ; σ X , σ Y )= = ∼ Fn −1 ,nY −1
S 2Y S 2Y σ 2X X
σ2Y
is selected from a table of statistics (e.g. in [T]). It will be used in two forms (we can write σX2/ σY2 = θ1):
S 2X SX
2
σ 2X S 2X 2
θ 1⋅σY
2
1 SX
T 0 ( X ,Y )= = ∼ Fn −1 ,nY −1 and T 1 ( X , Y )= = ∼ Fn
S 2Y S 2Y X
S 2Y θ1 S 2 X −1 , nY −1
Y
σ2Y σ2Y
(On the other hand, the pooled sample variance Sp2 should not be considered even under H0: σX = σ = σY, as
T 0=( S 2p /S 2p )=1 whatever the data are.) To apply any of the two methodologies we need to evaluate T0 at
the samples x and y:
2 2
SX SX 6.8
T 0 ( x , y )= 2
= = =0.86
SY nY 2 10
s 7.1
nY −1 Y 10−1
Since we were given the sample quasivariance of population X, but the sample variance of population Y, the
general property n s 2 = (n−1) S 2 has been used to calculate SY2.

(a) One-tailed alternative hypothesis σX < σY
Hypotheses: H 0 : σ 2X =σ 2Y and H 1 : σ 2X < σ2Y

σ 2X σ2X
Or, equivalently, H 0 : 2 = θ0 = 1 and H 1 : 2 = θ1 < 1
σY σY
Decision: To determine the critical region, under H0, the critical value a is found by applying the definition of
type I error, with α = 0.1 at θ0 = 1:
α (1)= P (Type I error )=P (Reject H 0 ∣ H 0 true)= P (T ( X , Y )<a ∣ H 0 )=P (T 0 ( X , Y )< a)
1 1 1 (From the definition of the F distribution, it is easy to see
→ 0.1= P(T 0 ( X , Y )< a)= P
( >
T 0 ( X ,Y ) a ) →
a
=2.35 that if X follows a Fk1,k2 then 1/X follows a Fk2,k1. We use
this property to consult our table.)
1
→ a=r 1−α= =0.43 → Rc = {T 0 ( X , Y )< 0.43}
2.35
To make the final decision about the hypotheses:
T 0 ( x , y )=0.86 → T 0 ( x)∉ Rc → H0 is not rejected.
pV =P ( X ,Y more rejecting than x , y ∣ H 0 true)
=P (T 0 (X , Y )<T 0 ( x , y))=P (T 0 ( X , Y )< 0.86)=0.41
> pf(0.86, 11-1, 10-1)
Power function: To calculate β, we have to work under H1, that is, with T1. Since in this case the critical
region is already expressed in terms of T0, the mathematical trick of multiplying and dividing by the same
quantity is applied:
β(θ1 ) = P (Type II error) = P( Accept H 0 ∣ H 1 true) = P (T 0 ( X )∉ Rc ∣ H 1 ) = P (T 0 ( X )≥0.43 ∣ H 1 )
∣ ) ( ∣ )
2 2
=P
( SX
2
SY
1 S 1 0.43
( 0.43
≥0.43 H 1 = P θ X2 ≥ θ 0.43 H 1 = P T 1 ( X )≥ θ = 1−P T 1 ( X )< θ
1 SY 1 1 1
) ( )
By using a computer, many values θ1 can be considered so as to determine the power curve 1–β(θ1) of the test
and to plot the power function.
ϕ(θ) = P ( Reject H 0 ) = α (θ) if θ∈Θ0
1−β(θ) if θ ∈Θ1 {
nx = 11; ny = 10
alpha = 0.1
theta0 = 1
q = qf(alpha,nx-1,ny-1)
PowerFunction = pf(q/paramSpace, nx-1, ny-1)

(b) One-tailed alternative hypothesis σX > σY
Hypotheses: H 0 : σ 2X =σY2 and H 1 : σ 2X > σ2Y

2 2
σX σ
Or, equivalently, H 0 : 2 = θ0 = 1 and H 1 : X2 = θ1 > 1
σY σY
Decision: To apply the methodology based on the rejection region, the critical value a is found by applying
the definition of type I error, with α = 0.1 at θ0 = 1:
α (1)= P (Type I error )=P ( Reject H 0 ∣ H 0 true)= P (T ( X , Y )>a ∣ H 0 )=P (T 0 ( X , Y )> a)
→ a=r α=2.42 → Rc = {T 0 ( X , Y )> 2.42 }
The final decision is: T 0 ( x , y )=0.86 → T 0 ( x)∉ Rc → H0 is not rejected.
pV =P ( X ,Y more rejecting than x , y | H 0 true)= P(T 0 ( X , Y )>T 0 ( x , y ))
=P (T 0 (X , Y )> 0.86)= 1−0.41=0.59 > pf(0.86, 11-1, 10-1)
[1] 0.406005
Power function: Now

β(θ1 )= P (Type II error )= P ( Accept H 0 | H 1 true) = P(T 0 ( X )∉ Rc | H 1) = P (T 0 (X )≤2.42 | H 1)
| ) ( | )
2 2
=P
( SX
2
SY
≤2.42 H 1 = P
1 SX 1
Y
≤ 2.42 H 1 = P T 1 ( X )≤
θ 1 S 2 θ1 (2.42
θ1 )
By using a computer, many values θ1 can be considered so as to plot the power function.

nx = 11; ny = 10
alpha = 0.1
theta0 = 1
q = qf(1-alpha,nx-1,ny-1)
PowerFunction = 1 - pf(q/paramSpace, nx-1, ny-1)
(c) Two-tailed alternative hypothesis σX ≠ σY

2 2 2 2
Hypotheses: H 0 : σ X =σ Y and H 1 : σ X ≠σY
σ 2X σ2X
Or, equivalently, H 0 : 2 = θ0 = 1 and H 1 : 2 = θ1 ≠ 1
σY σY
Decision: For the first methodology, the critical region must be determined by applying the definition of type I
error, with α = 0.1 at θ1 = 1, and the criterion of leaving half the probability in each tail:
α (1)= P (Type I error )= P( Reject H 0 | H 0 true)=P (T 0 ( X ,Y )<a 1)+ P (T 0 ( X , Y )>a2 )
{
α (1)
= P(T 0 ( X , Y )<a1 ) → a 1=l α /2 =0.33
→ 2
α(1)
=P (T 0 ( X ,Y )>a 2 ) → a2=r α / 2=3.14
2
> qf(c(0.05, 0.95), 11-1, 10-1)
→ Rc ={T 0 ( X , Y )<0.33 }∪{T 0 (X , Y )> 3.14 } [1] 0.3310838 3.1372801
The decision depends on whether the evaluation of T0 is in the rejection region:

T 0 ( x , y )=0.86 → T 0 ( x)∉ Rc → H0 is not rejected.

To apply the methodology based on the p-value, we calculate the median qf(0.5, 11-1, 10-
1)=1.007739; thus, since T(x,y) is in the left-hand tail:
pV =P ( X ,Y more rejecting than x , y | H 0 true)=2⋅P (T 0 ( X , Y )<T 0 (x , y))
=2⋅P (T 0 ( X ,Y )<0.86)=2⋅0.41=0.82
If you cannot calculate the median, try the tail you trust most and change it if a value bigger than 1 is obtained
after doubling the probability.
Power function: Now

β(θ1 ) = P (Type II error ) = P ( Accept H 0 | H 1 true) = P(T 0 ( X )∉ Rc | H 1)
S 2X
| ) ( | )
2
= P (0.33≤T 0 ( X )≤3.14 | H 1) = P 0.33≤

( S 2Y
0.33 1 S X 3.14
≤3.14 H 1 = P θ ≤ θ 2 ≤ θ H 1
1 1 S
Y
1
=P ( 0.33
θ 1
≤T ( X )≤
1
3.14
θ )
= P (T ( X )≤
1
3.14
θ )1−P (T ( X )<
1
0.33
θ ) 1
1
By using a computer, many values θ1 can be considered in order to plot the power function.
nx = 11; ny = 10
alpha = 0.1
theta0 = 1
q = qf(c(alpha/2, 1-alpha/2),nx-1,ny-1)
PowerFunction = 1 - pf(q[2]/paramSpace, nx-1, ny-1) + pf(q[1]/paramSpace, nx-1, ny-1)
Comparison of the power functions: Now we compare the power functions of the three tests
graphically, by using the code
nx = 11; ny = 10
alpha = 0.1
theta0 = 1
q = qf(c(alpha/2, 1-alpha/2),nx-1,ny-1)
PowerFunction1 = 1 - pf(q[2]/paramSpace1, nx-1, ny-1) + pf(q[1]/paramSpace1, nx-1, ny-1)
q = qf(alpha,nx-1,ny-1)
PowerFunction2 = pf(q/paramSpace2, nx-1, ny-1)
q = qf(1-alpha,nx-1,ny-1)

PowerFunction3 = 1 - pf(q/paramSpace3, nx-1, ny-1)
plot(paramSpace1, PowerFunction1, xlim=c(0,15), xlab='Theta',
ylab='Probability of rejecting theta0', main='Power Function', type='l')
It can be seen that the curves of the one-sided tests are over the curve of the two-sided test for any θ1—in its
region each one-sided test has more power than the two-sided test, since additional information is used when
one tail is discarded. Then, any of the two one-sided tests is uniformly more powerful than the two-sided test
in their respective common domains.
Conclusion: The hypothesis that the population variance is equal in the two biological populations is not
rejected when tested against any of the three alternative hypotheses. Although it has not happened in this case,
different decisions can be made for the one- and two-tailed tests. In this exercise, the empirical value T0(x) =
SX2/SY2 = 0.86 suggests the alternative hypothesis H1: σX2/σY2 < 1. (Remember: statistical results depend on: the
assumptions, the methods, the certainty and the data.)
My notes:
Exercise 6ht-T
Two simple random samples of 700 citizens of Italy and Russia yielded, respectively, that 53% of Italian
people and 47% of Russian people wish to visit Spain within the next ten years. Should we conclude, with
confidence 0.99, the Italians' desire is higher than the Russians'? Determine the critical region and make a
decision. What is the type I error? Calculate the p-value and apply the methodology based on the p-value to
make a decision.
1) Allocate the question in the alternative hypothesis. Calculate the type II error for the value –0.1.
2) Allocate the question in the null hypothesis. Calculate the type II error for the value +0.1.
Use a computer to plot the power function.

Discussion: After reading the statement (possibly twice, if necessary), we realize that there are two
independent populations whose citizens have been set a question with two possible answers (dichotomic
situation). Then, each individual can be thought of as—modeled through—a Bernoulli variable. In practice,
the implicit supposition that the same parameter η governs the behaviour of all the individuals should still be
evaluated for each population (a sort of homogeneity to analyse whether or not several subpopulations should
be considered). The independence of the two populations should be studied as well. Either way, in this
exercise we will merely apply the testing methodologies.
The sample proportions of those who said 'yes' are given: η^ I =0.53 and η^ R =0.47, respectively. If
ηI and ηR are the theoretical proportions of the populations, that is, the quantities we want to compare, we need
to test the hyphothesis ηI > ηR (one-tailed test).
Should this hypothesis be written as a null or as an alternative hypothesis? In general, since we fix the
type I error in our methodologies, a strong sample evidence is necessary to reject H0. Thus, the decision of
allocating the condition to be tested in H0 or H1 depends on our choice (usually on what “making a wrong
decision” means or implies for the specific framework we are working in). We are going to solve both cases.
From a theoretical point of view, H0: ηI ≥ ηR is essentialy the same as H0: ηI = ηR.
As a final remark, in this exercise it holds that 0.53 + 0.47 = 1; this happens just by chance, since these
two quantities are independent and can take any value in [0,1]. On the other hand, proportions are always
dimensionless.

• The sample sizes are larger than 30
so we use the asymptotic result involving two proportions:

^ I − η^ R )−(ηI −ηR )
(η d
T ( I , R)= → N (0,1)
√ ? I (1−? I ) ? R (1−? R )
nI
+
nR
where each ? must be substituted by the best possible information: supposed or estimated. Two particular
versions of this statistic will be used:
^ I −η
(η ^ R)−θ0 d (η^ I − η
^ R )−θ 1 d
T 0 ( I , R)= → N (0,1) and T 1 (I , R)= → N (0,1)
√ η
^ I (1−η^ I ) η
nI
^ (1− η
+ R
nR
^ R)
√ η
^ I (1− η
nI
^ I) η
^ (1−η^ R )
+ R
nR
To determine the critical region or to calculate the p-value, both under H0, we need the value of the statistic for
the particular samples available:
(0.53−0.47)−0
T 0 (i , r )= =2.25
700 √
0.53(1−0.53) 0.47(1−0.47)
+
700
1) Question in H0
Hypotheses: If we want to allocate the question in the null hypothesis to reject it only when the data strongly
suggest so,
H 0 : ηI −ηR = θ 0 ≥ 0 and H 1 : ηI −ηR = θ1 < 0
By looking at the alternative hypothesis, we deduce the form of the critical region:

The quantity c can be thought of as a margin over θ0 not to exclude cases where ηI – ηR = θ0 = 0 really holds
while values slightly smaller than θ0 are due to mere random effects.
Decision: To apply the first methodology, the critical value a that determines the rejection region is found by
applying the definition of type I error, with the value α = 1 – 0.99 = 0.01 at θ0 = 0:
α (0) = P (Type I error) = P( Reject H 0 | H 0 true) = P( T (I , R)∈R c | H 0 )= P (T 0 ( I , R)<a)
→ a=l 0.01=−2.326 → Rc = {T 0 ( I , R)<−2.326}
The decision is: T 0 ( i , r )=2.25 → T 0 (i , r )∉ Rc → H0 is not rejected.
As regards the value of the type I error, it is α by definition. The second methodology is based on the
calculation of the p-value:
pV =P ( I , R more rejecting than i , r | H 0 true)= P (T 0 ( I , R) < T 0 (i , r ))
=P (T 0 (I , R) < 2.25)=0.988
→ pV =0.988 > 0.01=α → H0 is not rejected.
Type II error: To calculate β, we have to work under H1. Since the critical region is expressed in terms of T0
and we must use T1, we are going to apply the mathematical trick of adding and subtracting the same quantity:
β(θ1 )= P (Type II error) = P( Accept H 0 ∣ H 1 true)
|)
(η
^ I −η
^ R)−θ0
= P (T 0 ( I , R)∉ Rc | H 1) = P
(√ ^ I (1− η^ I ) η
η
nI
^ (1− η
+ R
nR
^ R)
≥−2.326 H 1
|)
^ I − η^ R )+ 0−θ 1
(η θ1
=P
(√ ^ I (1− η
η
nI
^ ( 1− η
^ I) η
+ R
nR
^ R)
+
√ ^ I (1− η
η
nI
^ I) η
^ (1−η^ R )
+ R
nR
≥−2.326 H 1
θ1
(
= P T 1 ( I , R)≥−2.326−
√ 0.53(1−0.53) 0.47(1−0.47)
700
+
700 )
For the particular value θ1 = –0.1,
( )
−0.1
β(−0.1) = P T 1( I , R)≥−2.326− = P ( T 1( I , R)≥1.42 )=0.078
√ 0.53 (1−0.53) 0.47(1−0.47)

700
+
700
By using a computer, many other values θ1 ≠ –0.1 can be considered so as to numerically determine the power
of the test curve 1–β(θ1) and to plot the power function.

ϕ(θ) = P ( Reject H 0 ) =
{ α (θ) if θ∈Θ0
1−β(θ) if θ ∈Θ1
ni = 700; nr = 700
sPi = 0.53; sPr = 0.47
alpha = 0.01
q = qnorm(alpha,0,1)
theta1 = seq(from=-0.25,to=0,0.01)
PowerFunction = pnorm(q-paramSpace/sqrt(sPi*(1-sPi)/ni + sPr*(1-sPr)/nr),0,1)
This code generates the following figure:
2) Question in H1
Hypotheses: If we want to allocate the question in the alternative hypothesis to accept it only when the data
strongly suggest so,
H 0 : ηI −ηR = θ 0 ≤ 0 and H 1 : ηI −ηR = θ1 > 0
By looking at the alternative hypothesis, we deduce the form of the critical region:
The quantity c can be thought of as a margin over θ0 not to exclude cases where ηI – ηR = θ0 = 0 really holds
while values slightly larger than θ0 are due to mere random effects.
Decision: To apply the first methodology, the critical value a is calculated as follows:
α (0)= P (Type I error) = P( Reject H 0 | H 0 true) = P( T (I , R)∈R c | H 0 )= P (T 0 ( I , R)>a)
→ a=r 0.01=2.326 → Rc = {T 0 ( I , R)> 2.326 }
The decision is: T 0 ( i , r )=2.25 → T 0 (i , r )∉ Rc → H0 is not rejected.
The second methodology consists in doing:

pV =P ( I , R more rejecting than i , r | H 0 true)= P (T 0 (I , R) > T 0 (i , r ))
=P (T 0 (I , R) > 2.25)=1−P (T 0 ( I , R) ≤ 2.25)=0.0122
→ pV =0.0122 > 0.01=α → H0 is not rejected.
Type II error: Finally, to calculate β:

β(θ1 )= P (Type II error) = P( Accept H 0 ∣ H 1 true)
|)
(η
^ I −η
^ R)−θ0
= P (T 0 ( I , R)∉ Rc | H 1) = P
(√ ^ I (1− η^ I ) η
η
nI
^ (1− η
+ R
nR
^ R)
≤2.326 H 1
|)
^ I − η^ R )+ 0−θ 1
(η θ1
=P
(√ ^ I (1− η
η
nI
^ ( 1− η
^ I) η
+ R
nR
^ R)
+
√ ^ I (1− η
η
nI
^ I) η
^ (1−η^ R )
+ R
nR
≤2.326 H 1
θ1
(
= P T 1 ( I , R)≤2.326−
√ 0.53 (1−0.53) 0.47(1−0.47)

700
+
700 )
For the particular value θ1 = 0.1,
0.1
(
β(0.1)= P T 1 ( I , R)≤2.326−
√ 0.53(1−0.53) 0.47(1−0.47)
700
+
700
)
= P ( T 1 ( I , R)≤−1.42 ) =0.078
By using a computer, many more values θ1 ≠ 0.1 can be considered so as to numerically determine the power
of the test curve 1–β(θ1) and to plot the power function.
ϕ(θ) = P ( Reject H 0 ) =
{α (θ) if θ∈Θ0
1−β(θ) if θ ∈Θ1

ni = 700; nr = 700
sPi = 0.53; sPr = 0.47
alpha = 0.01

q = qnorm(1-alpha,0,1)
theta1 = seq(from=0,to=+0.25,0.01)
PowerFunction = 1 - pnorm(q-paramSpace/sqrt(sPi*(1-sPi)/ni + sPr*(1-sPr)/nr),0,1)
This code generates the figure above.
Conclusion: The hypothesis that the two proportions are equal is not rejected when the question is allocated
in either the alternative or the null hypothesis (the best way of testing an equality). That is, it seems that both
populations wish to visit Spain with the same desire. The sample information η^ I =0.53 and η^ R =0.47
suggested the alternative hypothesis H1: ηI – ηR > 0. The two power functions show how symmetric the
situations are. (Remember: statistical results depend on: the assumptions, the methods, the certainty and the
data.)
Advanced theory: Under the hypothesis H0: ηI = η = ηR, it makes sense to try to estimate the common
variance η(1–η) of the estimator—in the denominator—as well as possible. This can be done by using the
n η^ + n η^
pooled sample proportion η^ p= I I R R . Nevertheless, the pooled estimator should not be considered in
n I + nR
the numerator, since ( η^ p− η^ p)=0 whatever the data are. Now, the statistic under the null hypothesis is:
T~0 ( I , R)=
(η
^ I − η^ R )−θ0
=
( η^ I −η^ R )−θ 0 √ η
^ I (1−η^ I ) η
nI
^ (1− η
+ R
nR
^ R)
√ η
^ p (1− η
nI
^ p ) η^ p (1−η^ p )
+
nR √ η^ I (1− η
nI
^ I ) η^ R (1−η^ R )
+
nR √ η
^ p (1− η
nI
^ p) η
^ (1−η^ p )
+ p
nR
= T 0 ( I , R)
√ η^ I (1− η
nI
^ I ) η^ R (1−η^ R )
+
nR d
→ N (0,1)
Then,
√ η^ p (1− η
nI
^ p) η
^ (1− η
+ p
nR
^ p)
η^ p =
700⋅0.53+ 700⋅0.47 0.53+0.47 1
700+ 700
=
1+1
= =0.5 →
2
√ η^ I (1− η
nI
^ I ) η^ R (1−η^ R )
+
nR
= 0.9981983
√ η^ p (1− η^ p) η
nI
^ (1− η
+ p
nR
^ p)

~
→ T 0 ( i , r )=2.25⋅0.9981983=2.24 .
The same decisions are made with T 0 and T~0 because of the little effect of using η^ p in this exercise (see
the value of the quotient of square roots above); in other situations, both ways may lead to paradoxical results.
As regards the calculations of the type II error, both the mathematical trick of multiplying and dividing by the
same quantity and the mathematical trick of adding and subtracting the same quantity should be applied now.
For section (a):
β(θ1 ) = P (Type II error )= P ( Accept H 0 | H 1 true)
|)
(η
^ I− η
^ R)−θ0
= P ( T~0 ( I , R)∉ Rc | H 1) = P
(√ ^ p ( 1−η^ p ) η^ p (1−η^ p)
η
nI
+
nR
≥−2.325 H 1
(√ √
|)
η^ p (1− η
^ p) η
^ (1− η
^ p)
+ p
^ I −η
(η ^ R)−0−θ 1+ θ1 nI nR
=P ≥−2.325⋅ H1
η
^ I (1− η^ I ) η
nI
^ (1− η
+ R
nR
^ R)
√ η^ I (1− η
nI
^ I ) η^ R (1−η^ R )
+
nR
|)
(η
^ I−η
^ R)−θ1 θ1
=P
(√ ^ I (1− η
η
nI
^ ( 1− η
^ I) η
+ R
nR
^ R)
+
√ ^ I (1− η
η
nI
^ I) η
^ (1−η^ R )
+ R
nR
≥−2.325⋅1.002 H 1
θ1
(
= P T 1 ( I , R)≥−2.330−
√ 0.53(1−0.53) 0.47(1−0.47)
700
+
700 )
For the particular value θ1 = –0.1,
( )
−0.1
β(−0.1) = P T 1( I , R)≥−2.330− = P ( T 1( I , R)≥1.41 )=0.079 .
√ 0.53(1−0.53) 0.47(1−0.47)
700
+
700
Similarly for section (b).
My notes:
[HT-p] Based on Λ
Exercise 1ht-Λ
A random quantity X follows a Poisson distribution. Let X = (X1,...,Xn) be a simple random sample. By
applying the results involving Neyman-Pearson's lemma and the likelihood ratio, study the critical region
(estimator that arises and form) for the following pairs of hypotheses.

{ H 0: λ = λ0
H 1: λ = λ1 { H 0 : λ = λ0
H 1 : λ = λ1 > λ0 { H 0 : λ = λ0
H 1 : λ = λ1 < λ0 { H 0 : λ ≤ λ0
H 1 : λ = λ 1> λ0 { H 0 : λ ≥ λ0
H 1 : λ = λ 1< λ 0
Discussion: This is a theoretical exercise where no assumption should be evaluated. First of all, Neyman-
-Pearson's lemma will be applied. We expect the maximum-likelihood estimator of the parameter—calculated
in a previous exercise—and the “usual” critical region form to appear. If the critical region does not depend on
any particular value θ1, the uniformly most powerful test will have been found.
Poisson distribution: X ~ Pois(λ)

For the Poisson distribution,
Identification of the variable: X ~ Pois(λ)
Hypothesis test
{ H 0: λ = λ0
H 1: λ = λ1
Likelihood function and likelihood ratio:

n n
∑ j=1 X j L ( X ; λ 0) λ0 ∑ X
( )
j
L( X ; λ)= λ e−n λ and Λ ( X ; λ 0 , λ1 ) = =

j=1
e−n(λ −λ )
0 1
n
L( X ; λ 1) λ1
∏ j=1 X j !
Rejection region:
{( }
n
λ0 ∑ X
) ( ) {( ∑ n
} λ
j
)
j=1
−n(λ 0−λ 1)
Rc = { Λ < k } = e <k = X j ⋅log λ0 −n (λ 0−λ 1) < log (k )
λ1 j=1
1
=
{( ∑n
j=1 j
λ
λ
0
1 }{ λ λ
X )⋅log ( ) < log (k )+ n( λ −λ ) = n X̄⋅log ( ) < log (k )+n (λ −λ )
0 1
} 0
1
0 1
Now it is necessary that λ 1≠λ 0 and
{ }
log(k )+n (λ 0−λ 1)
• if λ 1< λ 0 then log ( ) λ0
λ1
̄ <
> 0 and hence Rc = X
λ
n log 0
λ1 ( )
{ }
log(k )+n (λ 0−λ 1)
• if λ 1> λ 0 then log ( ) λ0
λ1
< 0 and hence Rc = X
̄ >
λ
n log 0
λ1 ( )
This suggests the estimator X̄ =λ̂ ML (calculated in a previous exercise) and regions of the form
Rc = {Λ< k } = ⋯= { λ̂ ML <c }= ⋯= {T 0 < a } or Rc = {Λ< k } = ⋯= { λ^ ML >c }= ⋯= {T 0 > a }
Hypothesis tests
{ H 0 : λ = λ0
H 1 : λ = λ 1> λ 0 { H 0 : λ = λ0
H 1 : λ = λ 1< λ 0
In applying the methodologies, and given α, the same critical value c or a will be obtained for any λ1 since it
only depends upon λ0 through λ^ ML or T0:

α=P (Type I error)= P (T 0 <a) or α=P (Type I error)=P (T 0 > a)
This implies that the uniformly most powerful test has been found.
Hypothesis tests
{ H 0 : λ ≤ λ0
H 1 : λ = λ 1> λ 0 { H 0 : λ ≥ λ0
H 1 : λ = λ 1< λ 0
A uniformly most powerful test for H 0 : λ = λ0 is also uniformly most powerful for H 0 : λ ≤ λ0 .
Exponential distribution: For the exponential distribution,
Identification of the variable: X ~ Exp(λ)
Hypothesis test
{ H 0: λ = λ0
H 1: λ = λ1

n
L ( X ; λ 0) λ0 n −(λ −λ )∑
( )
n
n −λ ∑ j=1 X X
L( X ; λ) = λ e j
and Λ ( X ; λ 0 , λ1 ) = = e 0 1 j=1 j
L( X ; λ 1) λ1
Rejection region:
{( λ0 n −(λ −λ )∑
) }{ ( λλ )−(λ −λ ) ∑ }
n
X n
0
Rc = { Λ < k } = e 0 1 j=1 j
< k = n log 0 1 X j < log(k )
λ1 1
j=1
{ n λ
( )} {
λ
= (λ1−λ 0) ∑ j=1 X j < log(k )−n log λ 0 = (λ 1−λ 0) n X̄ < log( k )−n log λ 0
1 1
( )}
Now it is necessary that λ 1≠λ 0 and
{ }{
λ
log (k )−n log λ 0( )= 1<
}
1 n (λ 1−λ 0)
• ̄ >
if λ 1< λ 0 then (λ 1−λ 0 )< 0 and Rc = X
n(λ 1−λ0 ) ̄
X λ0
log (k )−n log ( )
λ1
{ }{
λ0
log(k )−n log (λ ) = 1 >
}
1 n (λ 1−λ 0)
• ̄ <
if λ 1> λ 0 then (λ 1−λ 0 )> 0 and Rc = X
n(λ 1−λ0 ) ̄
X λ0
log(k )−n log ( )
λ1
1 ̂
This suggests the estimator =λ ML (calculated in a previous exercise) and regions of the form
X̄
Rc = {Λ< k } = ⋯= { λ̂ ML <c }= ⋯= {T 0 < a } or Rc = {Λ< k } = ⋯= { λ^ ML >c }= ⋯= {T 0 > a }
Hypothesis tests
{ H 0 : λ = λ0
H 1 : λ = λ 1> λ 0 { H 0 : λ = λ0
H 1 : λ = λ 1< λ 0
In applying the methodologies, and given α, the same critical value c or a will be obtained for any λ1 since it

only depends upon λ0 through λ^ ML or T0:
Hypothesis tests
{ H 0 : λ ≤ λ0
H 1 : λ = λ 1>λ 0 { H 0 : λ ≥ λ0
H 1 : λ = λ 1<λ 0
A uniformly most powerful test for H 0 : λ = λ0 is also uniformly most powerful for H 0 : λ ≤ λ0 .
Bernoulli distribution: For the Bernoulli law,
Identification of the variable: X ~ B(η)
Hypothesis test
{ H 0 : η= η0
H 1 : η= η1

n n
L( X ; η0 ) η0 ∑ X
1−η0 n− ∑ j=1 X
( )
n n j j
∑ j=1 X j n−∑ j=1 X
( )
j=1
L( X ; η) = η (1−η)
j
and Λ ( X ; η0 , η1) = =
L( X ; η1) η1 1−η1
Rejection region:
{ ( ) }{ ) ( )(
n n
η ∑
}
X n− ∑ j=1 X
1−η0 1−η0
η0
( )
j j n n
( ) (∑ )
j=1
Rc = { Λ < k }= η01
1−η1
<k = j =1
X j log η1 + n− ∑ j =1
X j log
1−η1
< log(k )
=
{(∑ )[ ( ) ( )]
n
j=1
X( )} j
η0
log η1 −log
1−η0
1−η1
< log( k )−n log
1−η0
1−η1
{ ( )
̄ log
= nX ( )} η0 (1−η1 )
η1 (1−η0 )
< log (k )−n log
1−η0
1−η1
Now it is necessary that η1 ≠η0 and
{ }
1−η0
( )
log ( k )−n log
1−η1
• if η1 < η0 then log
( η0 (1−η1)
η1( 1−η0) )
> 0 and Rc = X
̄ <
n log ( )
η0 (1−η1)
η1(1−η0)
{ ) }
1−η0
log( k )−n log ( )
1−η1
• if η1 > η0 then log
( η0 (1−η1)
η1( 1−η0) ) ̄ >
<0 and Rc = X
n log ( η0 (1−η1)
η1(1−η0)
This suggests the estimator X

̄ =η̂ ML (calculated in a previous exercise) and regions of the form

Rc = {Λ< k } = ⋯= { η̂ ML <c } = ⋯= {T 0 < a } or Rc = {Λ< k } = ⋯= { η^ ML >c } = ⋯= {T 0 >a }
Hypothesis tests
{ H 0 : η = η0
H 1 : η = η1 >η0 { H 0 : η= η0
H 1 : η= η1<η0
In applying the methodologies, and given α, the same critical value c or a will be obtained for any η1 since it
only depends upon η0 through η^ ML or T0:
Hypothesis tests
{ H 0 : η ≤ η0
H 1 : η = η1 >η0 { H 0 : η≥ η0
H 1 : η= η1<η0
A uniformly most powerful test for H 0 : η = η0 is also uniformly most powerful for H 0 : η ≤ η0 .
Normal distribution: For the normal distribution,
Identification of the variable: X ~ N(μ,σ2)
Hypothesis test
{ H 0 : μ = μ0
H 1 : μ = μ1

n / 2 − 1 ∑ ( X −μ)2
n
1
( )
2 j=1 j
2σ
L( X ; μ) = e
2 π σ2
and
1 1
[∑ ]
n n n
L( X ; μ 0)
2 2 2 2 2 2
− 2 j=1
( X j −μ 0 ) −∑ j=1 (X j −μ 1) − 2 ∑ j=1 (
X j −2 μ 0 X j +μ 0− X j +2 μ 1 X j−μ 1)
Λ ( X ; μ0 ,μ 1) = = e 2σ =e 2σ
L (X ;μ 1)
2 2
1 1 (μ 0−μ1 ) n (μ 0−μ 1)
[ ]
n n n
− 2 ∑ j=1
( μ 20−μ 21−2 μ 0 X j +2 μ 1 X j ) − 2
2 2
n (μ 0−μ 1)−2 (μ 0−μ 1) ∑ j=1 X j 2 ∑ j=1 X j − 2
2σ 2σ 2σ
=e =e =e σ
e
Rejection region:
R = { Λ < k } = {e < k }=
2 2
n (μ 0−μ 1)
{(μ σ−μ ) ∑ }
(μ 0−μ 1) n 2 2
2 ∑ j=1 X j −
2σ 2 0 1
n n (μ 0 −μ 1 )
c
σ
e 2 j=1
X j− 2
< log ( k )
2σ
{
= (μ 0−μ 1) (∑
n
j=1
X j ) < log (k ) σ + n2 (μ −μ )} = {(μ −μ )n X̄ < log( k )σ + n2 (μ −μ )}
2 2
0
2
1 0 1
2 2
0
2
1
Now it is necessary that μ1 ≠μ 0 and
{ }
n
log(k ) σ2 + (μ 20−μ21 )
2
• if μ1 <μ 0 then (μ 0−μ 1)>0 and Rc = X
̄ <
n(μ 0−μ 1)

{ }
n
log(k ) σ2 + (μ 20−μ21 )
2
• if μ1 >μ 0 then (μ 0−μ 1)<0 and Rc = X
̄ >
n(μ 0−μ 1)
̄ =μ̂ ML (calculated in a previous exercise) and regions of the form

This suggests the estimator X
Rc = {Λ< k } = ⋯= {μ̂ ML < c }= ⋯ = {T 0 <a } or Rc = {Λ< k } = ⋯= {μ^ ML > c }= ⋯ = {T 0 >a }
Hypothesis tests
{ H 0 : μ = μ0
H 1 : μ = μ1 >μ 0 { H 0 : μ = μ0
H 1 : μ = μ 1<μ 0
In applying the methodologies, and given α, the same critical value c or a will be obtained for any μ1 since it
only depends upon μ0 through μ^ ML or T0:
Hypothesis tests
{ H 0 : μ ≤ μ0
H 1 : μ = μ1 >μ 0 { H 0 : μ ≥ μ0
H 1 : μ = μ 1<μ 0
A uniformly most powerful test for H 0 : μ = μ 0 is also uniformly most powerful for H 0 : μ ≤ μ 0 .
Conclusion: Well-known theoretical results have been applied to study the optimal form for the critical
region of different pairs of hypothesis. Since both the likelihood ratio and the maximum likelihood estimator
use the likelihood function, the critical region of the tests can be expressed in terms of this estimator.
My notes:
[HT-p] Analysis of Variance (ANOVA)

Exercise 1ht-av
The fog index is used to measure the reading difficulty of a written text: The higher the value of the index, the
more difficult the reading level. We want to know if the reading difficulty index is different for three
magazines: Scientific American, Fortune, and the New Yorker. Three independent random samples of 6
advertisements were taken, and the fog indices for the 18 advertisements were measured, as recorded in the
following table
SCIENTIFIC AMERICAN FORTUNE NEW YORKER
15.75 12.63 9.27
11.55 11.46 8.28
11.16 10.77 8.15
9.92 9.93 6.37
9.23 9.87 6.37
8.20 9.42 5.66

Apply an analysis of variance to test whether the average level of difficulty is the same in the three magazines.
(From Statistics for Business and Economics, Newbold, P., W.L. Carlson and B.M. Thorne, Pearson.)
Discussion: The analysis of variance can be applied when populations are normally distributed and their
variances are equal, that is, X p ∼ N (μ p , σ2p ) with σ p = σ , ∀ p . These suppositions should be evaluated
(this will be done at the end of the exercise). If the equality of the means is rejected, additional analyses would
be necessary to identify which means are different—this information is not provided by the analysis of
variance. On the other hand, the calculations involved in this analysis are so tedious that almost everybody
uses the computer. Finally, the unit of measurement of the index u is unknown for us.
Statistic: There is one factor identifying the population out of the three possible ones (we do not consider
other magazines), so a one-factor fixed-effects analysis will be applied. The statistic is
MSG MSG
T ( X SA , X FO , X NY ) = with T 0 = ∼ F P −1, n−P ≡ F 3−1, 18−3 ≡ F 2, 15
MSW MSW
Some calculations are necessary to evaluate of the statistic T ( x SA , x FO , x NY ) . First of all, we look at the
three sample means:
1 6 15.75 u +⋯+8.20 u
X̄ SA = ∑ j=1 X SA , j = =10.97 u
6 6
1 6 12.63 u+⋯+ 9.42 u
X̄ FO = ∑ j =1 X FO , j = =10.68 u
6 6
1 6 9.27 u+⋯+5.66 u
X̄ NY = ∑ j=1 X NY , j = =7.35u
6 6
The magnitude of the first and the third seems quite different, which suggests that the population means may
be different. Nevertheless, we should not trust intuition.
1 n 15.75 u+⋯+ 5.66 u
X̄ = ∑ j =1 X j = =9.67 u
18 18
P
̄ p− X̄ )2 = n SA⋅( X
SSG = ∑ p =1 n p ( X ̄ )2 + n FO⋅( X̄ FO− X̄ )2+ n NY⋅( X
̄ SA− X ̄ )2
̄ NY − X
= 6⋅(10.97 u−9.67 u)2 +6⋅(10.68 u−9.67 u)2 +6⋅(7.35 u−9.67 u)2=48.53 u2
1 48.53 u2 2
MSG = SSG= =24.26 u
P−1 3−1
P np 2 6 2 6 2 6 2
SSW = ∑ p=1 ∑ j=1 ( X p , j− X̄ p ) = ∑ j =1 ( X SA , j− X̄ SA) + ∑ j =1 ( X FO , j− X̄ FO ) + ∑ j =1 ( X NY , j − X̄ NY )
2 2
=(15.75 u−10.97 u) +⋯+(8.20 u−10.96 u)
+ (12.63 u−10.68 u)2 +⋯+(9.42 u−10.68 u)2
+ (9.27 u−7.35 u)2 +⋯+(5.66 u−7.35 u)2
2
= 52.22 u
1 52.22 u 2 2
MSW = SSW = =3.48 u
n−P 18−3
and, finally,
MSG 24.26 u2
T 0 ( x SA , x FO , x NY ) = = =6.97
MSW 3.48u 2
Hypotheses and form of the critical region:

H 0 : μ 1 = μ 2 =⋯ = μ P and H 1 : ∃a , b / μ a ≠ μ b

For this statistic,
By applying the definition of α:
α = P (Type I error) = P (Reject H 0 ∣ H 0 true) = P (T ∈ Rc ∣ H 0 ) = P (T 0 >a)

→ a=r α=6.359 → Rc = {T 0 ( X SA , X FO , X NY ) > 6.359 } > qf(0.99, 2, 15)
[1] 6.358873
Decision: Finally, it is necessary to check if this region “suggested by H0” is compatible with the value that
the data provide for the statistic. If they are not compatible because the value seems extreme when the
hypotheses is true, we will trust the data and reject the hypothesis H0.
Since T 0 ( x SA , x FO , x NY )=6.97 > 6.359 → T 0 ( x)∈Rc → H0 is rejected.
pV =P (( X SA , X FO , X NY ) more rejecting than (x SA , x FO , x NY ) ∣ H 0 true)
=P (T 0 ( X SA , X FO , X NY )>T 0 ( x SA , x FO , x NY ))=P (T 0 >6.97)=0.0072
→ pV =0.007243< 0.01=α → H0 is rejected. > 1-pf(6.97, 2, 15)
[1] 0.007235116
Conclusion: As suggested by the sample means, the population means of the three magazines are not equal
with a confidence of 0.99, measured in a 0-to-1 scale. Pairwise comparisons could be applied to identify the
differences.
Code to apply the analysis “semimanually”

We have not done the calculations by hand but using the programming language R. The code is:
# To enter the three samples
SA = c(15.75, 11.55, 11.16, 9.92, 9.23, 8.20)
FO = c(12.63, 11.46, 10.77, 9.93, 9.87, 9.42)
NY = c(9.27, 8.28, 8.15, 6.37, 6.37, 5.66)
# To join the samples in a unique vector
Data = c(SA, FO, NY)
# To calculate the sample mean of the three groups and the total sample mean
mean(SA) ; mean(FO) ; mean(NY) ; mean(Data)
# To calculate the measures and the statistic (for large datasets, the previous means should have been saved)
SSG = 6*((mean(SA) - mean(Data))^2) + 6*((mean(FO) - mean(Data))^2) + 6*((mean(NY) - mean(Data))^2)
MSG = SSG/(3-1)
SSW = sum((SA - mean(SA))^2) + sum((FO - mean(FO))^2) + sum((NY - mean(NY))^2)
MSW = SSW/(18-3)
T0 = MSG/MSW
# To find the quantile 'a' that determines the critical region
a = qf(0.99, 2, 15)
# To calculate the p-value
pValue = 1 - pf(T0, 2, 15)
(In the console, write the name of a quantity to print its value.)
Code to apply the analysis with R

Statistical software programs have many built-in functions to apply the most basic methods. Now we use R to
obtain the analysis of variance table. As regards the syntaxis, it is based on the linear regression framework,

X p , j = μ p + ϵ p , j , where this linear dependence of X on the factor effect μp is denoted by Data ~ Group
(see the call to the function aov below).
## After running the first block of lines of the previous code:
# To create a vector with the membership labels
Group = factor(c(rep("SA",length(SA)), rep("FO",length(FO)), rep("NY",length(NY))))
# To apply a one-factor analyis of variance
objectAV = aov(Data ~ Group)
# To print the table with the results
summary(objectAV)
The ANOVA table is
Df Sum Sq Mean Sq F value Pr(>F)

Group 2 48.53 24.264 6.97 0.00723 **
Residuals 15 52.22 3.481
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Compare these quantities with those obtained in the previous calculations.) An equivalent way of applying
the analysis of variance with R consists in substituting the lines
# To apply a one-factor analyis of variance
objectAV = aov(Data ~ Group)
# To print the table with the results
summary(objectAV)
by the lines
# To fit a linear regression model
Model = lm(Data ~ Group)
# To apply and print the analysis of variance
anova(Model)
Code to check the assumptions

By using a computer it is also easy to evaluate the fulfillment of the assumptions.
# To enter the three samples
SA = c(15.75, 11.55, 11.16, 9.92, 9.23, 8.20)
FO = c(12.63, 11.46, 10.77, 9.93, 9.87, 9.42)
NY = c(9.27, 8.28, 8.15, 6.37, 6.37, 5.66)
# To join the samples in a unique vector
Data = c(SA, FO, NY)
# To create a vector with the membership labels
Group = factor(c(rep("SA",length(SA)), rep("FO",length(FO)), rep("NY",length(NY))))
# To test the normality of the sample SA by applying two different hypothesis tests
shapiro.test(SA)
ks.test(SA, "pnorm", mean=mean(SA), sd=sd(SA))
# To test the normality of the sample FO by applying two different hypothesis tests
shapiro.test(FO)
ks.test(FO, "pnorm", mean=mean(FO), sd=sd(FO))
# To test the normality of the sample NY by applying two different hypothesis tests
shapiro.test(NY)
ks.test(NY, "pnorm", mean=mean(NY), sd=sd(NY))
# To test the equality of the variances
bartlett.test(Data ~ Group)
My notes:

[HT] Nonparametric
Remark 14ht: Nonparametric methods involve questions not based on parameters, and therefore it is not usually necessary to
evaluate some kinds of supposition that were present in the parametric hypothesis tests.
Exercise 1ht-np
Occupational Hazards. The following table is based on data from the U.S. Department of Labor, Bureau of
Labor Statistics.
Taxi
Police Cashiers Guards
Drivers
Homicide 82 107 70 59
Cause of death other
than homicide
92 9 29 42
490
A) Use the data in the table, coming from a simple random sample, to test the claim that occupation is
independent of whether the cause of death was homicide. Use a significance α = 0.05 and apply a
nonparametric chi-square test.
B) Does any particular occupation appear to be most prone to homicides? If so, which one?
(Based on an exercise of Essentials of Statistics, Mario F. Triola, Pearson)
job. Your job is what you do to earn your living: 'You'll never get a job if you don't have any qualifications.' 'She'd like to change her job
but can't find anything better.' Your job is also the particular type of work that you do: 'John's new job sounds really interesting.' 'I know
she works for the BBC but I'm not sure what job she does.' A job may be full-time or part-time (NOT half-time or half-day): 'All she
could get was a part-time job at a petrol station.'
do (for a living). When you want to know about the type of work that someone does, the usual questions are What do you do? What
does she do for a living? etc 'What does your father do?' - 'He's a police inspector.'
occupation. Occupation and job have similar meanings. However, occupation is far less common than job and is used mainly in formal
and official styles: 'Please give brief details of your employment history and present occupation.' 'People in manual occupations seem to
suffer less from stress.'
post/position. The particular job that you have in a company or organization is your post or position: 'She's been appointed to the post of
deputy principal.' 'He's applied for the position of sales manager.' Post and position are used mainly in formal styles and ofter refer to
jobs which have a lot of responsability.
career. Your career is your working life, or the series of jobs that you have during your working life: 'The scandal brought his career in
politics to a sudden end.' 'Later on in his career, he became first secretary at the British Embassy in Washington.' Your career is also the
particular kind of work for which you are trained and that you intend to do for a long time: 'I wanted to find out more about careers in
publishing.'
trade. A trade is a type of work in which you do or make things with your hands: 'Most of the men had worked in skilled trades such as
carpentry or printing.' 'My grandfather was a bricklayer by trade.'
profession. A profession is a type of work such as medicine, teaching, or law which requires a high level of training or education: 'Until
recently, medicine has been a male-dominated profession.' 'She entered the teaching profession in 1987.'
LINGUISTIC NOTE (From: The Careful Writer: A Modern Guide to English Usage. Bernstein, T.M. Atheneum)
occupations. The words people use affectionately, humorously, or disparagingly to describe their own occupations are their own affair.
They may say, “I'm in show business” (or, more likely, “show biz”), or “I'm in the advertising racket,” or “I'm in the oil game,” or “I'm in
the garment line.” But outsiders should use more caution, more discretion, and more precision. For instance, it is improper to write, “Mr.
Danaher has been in the law business in Washington.” Law is a profession. Similarly, to say someone is “in the teaching game” would
undoubtedly give offense to teachers. Unless there is some special reason to be slangy or colloquial, the advisable thing to do is to accord
every occupation the dignity it deserves.

Discussion: In this exercise, it is clear from the statement that we need to test the independence of two
variables. A particular sample (x1,...,x490) were grouped and we are given the absolute frequencies in the
empirical table. By looking at the table, the cashier occupation appears to be most prone to homicides.
Statistic: Since we have to apply a test of independence, from a table of statistics (e.g. in [T]) we select
L K(N lk − e^lk )2 d 2
T 0 ( X )=∑l =1 ∑ k=1 → χ( L−1)(K −1)
e^lk
for L and K classes, respectively.
Hypotheses: The null hypothesis supposes that the two variables are independent,
H 0 : X , Y independent and H 1 : X , Y dependent
or, probabilistically,
H 0 : f ( x , y)= f X ( x)⋅ f Y ( y ) and H 1 : f ( x , y )≠ f X ( x )⋅ f Y ( y )
This implies that the probability at any cell is the product of the marginal probabilities of its file and column.
Note that two underlying probability distributions are supposed for X and Y, although we do not care about
them, and we will directly estimate the probabilities from the empirical table. As
by substituting in the expression of the statistic,

318⋅174 2 172⋅101 2
T 0 ( x)=
(
82−
490
+⋯+
42− ) 490 (
=65.52
)
318⋅174 172⋅101
490 490
This value, calculated under H0 and using the data, is necessary both to determine the critical region and to
calculate the p-value.
On the other hand, for any chi-square tests T0 is a nonnegative measure of the dissimilarity between the
two tables; therefore, a value close to zero means that the two tables are similar, while the critical region is
always of the form:
Decision: There are L= 2 and K = 4 classes, respectively, so

d
T 0 ( X ) → χ(2L−1)(K −1) ≡ χ2(2−1 )(4−1) ≡ χ32
For the first methodology, to calculate a the definition of type I error is applied with α = 0.05:
α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P(T ( X )∈ Rc ∣ H 0)≈ P (T 0 ( X )>a)

→ a=r α=7.81 → Rc = {T 0 ( X )>7.81 }
The decision is: T 0 ( x) = 65.52 ∈ Rc → H0 is rejected.

If we apply the methodology based on the p-value,
pV = P ( X more rejecting than x ∣ H 0 true)=P (T 0 ( X )> T 0 ( x))

−14
= P (T 0 ( X )>65.52) = 3.885781⋅10
1-pchisq(65.52, 3)
→ pV <0.05=α → H0 is rejected. [1] 3.885781e-14
Instead of using the computer, we can consider the last value in our table to bind the p-value (statisticians
want to discover its value, while we want only to check whether or not it is smaller than α):
pV = P (T 0 ( X )>65.52) < P(T 0 ( X )>11.3)=0.01 → pV <0.01<0.05=α → H0 is rejected
Conclusion: The hypothesis that the two variables are independent is rejected. This means that there seems
to be a correlation between occupation and cause of death. (Remember: statistical results depend on: the
My notes:
Exercise 2ht-np
World War II Bomb Hits in London. To carry out an analysis, South London was divided into 576 areas. For
the variable N ≡ number of bombs in the k-th area (any), a simple random sample (x1,...,x576) was gathered
and grouped in the following table:
EMPIRICAL
5 or
Number of Bombs 0 1 2 3 4
more
Number of Regions 229 211 93 35 7 1 n = 576
Data taken from: An application of the Poisson distribution. Clarke, R. D. Journal of the Institute of Actuaries [JIA] (1946) 72: 481
http://www.actuaries.org.uk/research-and-resources/documents/application-poisson-distribution
By applying the chi-square goodness-of-fit methodology,

(1) Test at 95% confidence whether N can be supposed to follow a Poisson distribution.
(2) Test at 95% confidence whether N can be supposed to follow a Poisson distribution with λ = 0.8.
Discussion: We must apply the chi-square methodology to study if the data statistically fit the models
specified. In the second section, a value for the parameter is given. For this probability model,
We have to calculate or estimate the probabilities in order to obtain the expected absolute frequencies. Finally,
by using the statistic T we will compare the two tables and make a decision.
Statistic: Since we have to apply a goodness-of-fit test, from a table of statistics (e.g. in [T]) we select

( N k −e^k )2 d 2
K
T 0 ( X )=∑k =1 → χ K −s−1
e^k
where K is the number of classes and s is the number of parameters of F0 that we need to estimate so as to use
this distribution for obtaining the class probabilities or approximations of them.
(1) Fit to the Poisson family

Hypotheses: For this nonparametric goodness-of-fit test, the hypotheses are
H 0 : N ∼ F 0 = Pois (λ) and H 1 : N ∼ F≠ Pois( λ)
(It can be thought that both hypotheses are composite.) To fill in the expected table (under H0), the formula
e^ k =n⋅^p k will be applied. To estimate ^p k the supposed distribution under H0 must be used. And to use the
distribution, an estimator λ^ of the parameter is necessary. Once we have this estimator, the probabilities are
̂
calculated by using software, tables, or the mass function plus the plug-in principle: f ( x ; λ).
On the other hand, to estimate λ we take into account that for this distribution the expectation (and also
the variance) is equal to the parameter. Since the sample mean estimates the expectation, in this case it can be
used to estimate λ too. (If we had not remembered this property, we would have applied the method of the
moments or the maximum likelihood method to obtain this estimator.) Then,
^ μ= 1 576
λ= ^ x̄= ∑ x
576 j =1 j
Since our data are grouped, we can imagine that (look at the table): 229 data are 0's, 211 are 1's, 93 are 2's, 35
are 3's, 7 are 4's, and, finally, 1 is unknown but equal or higher than 5, so we can consider 5 or even 6.
̂ 229⋅0+211⋅1+ 93⋅2+35⋅3+7⋅4+1⋅5 =0.93
λ=
576
By using the plug-in principle and the calculator we obtain
0.930 −0.93 0.931 −0.93

̂p 0=P λ̂ ( X =0)= f λ̂ (0)= e =0.395 ̂p 1=P λ̂ ( X =1)= f λ̂ (1)= e =0.367
0! 1!
0.932 −0.93 0.933 −0.93
̂p 2=P λ̂ ( X =2)= f λ̂ (2)= e =0.171 ̂p 3=P λ̂ ( X =3)= f λ̂ (3)= e =0.0529
2! 3!
4
0.93 −0.93
̂p 4= P λ̂ ( X =4)= f λ̂ (4)= e =0.0123 ̂p 5>=1−P λ̂ ( X ≤4)=1−(0.39+⋯+0.012)=0.00270
4!
Poisson (λ = 0.93)
Values 0 1 2 3 4 5 or more
Probabilities 0.395 0.367 0.17 0.0529 0.0123 0.00270 1
Now, we fill in the expected table by using the formula e^ k =n⋅^p k .
EXPECTED (UNDER H0)

Number of Bombs 0 1 2 3 4 5 or more
Number of Regions 227.26 211.35 98.28 30.47 7.08 1.55 n = 576
We have really done the calculations with the programming language R. By using a calculator, some
quantities may be slightly different due to technicals effects (number of decimal digits, accuracy, etc).

> dpois(c(0,1,2,3,4), 0.93) > 576*dpois(c(0,1,2,3,4), 0.93)
[1] 0.39455371 0.36693495 0.17062475 0.05289367 0.01229778 [1] 227.262937 211.354532 98.279857 30.466756 7.083521
> 1 - sum(dpois(c(0,1,2,3,4), 0.93)) > 576*(1 - sum(dpois(c(0,1,2,3,4), 0.93)))
[1] 0.002695135 [1] 1.552398
To guarantee the quality of the chi-square methodology, the expected absolute frequencies are usually required
to be larger than four (≥5). For this reason, we merge the last two classes in both the empirical and the
expected tables.
EMPIRICAL
Number of Bombs 0 1 2 3 4 or more
Number of Regions 229 211 93 35 7+1=8 n = 576
EXPECTED (UNDER H0)

Number of Regions 227.26 211.35 98.28 30.47 7.08+1.55=8.63 n = 576
We evaluate T0, which is necessary to apply any of the two methodologies.

( 229−227.26 )2 ( 8−8.63 )2
T 0 ( x)= +⋯+ =1.019
227.26 8.63
We have calculated the value of T0 with the computer too:
> empirical = c(229, 211, 93, 35, 8)
> expected = 576*c(dpois(c(0,1,2,3), 0.93), (1-sum(dpois(c(0,1,2,3), 0.93))))
> sum(((empirical-expected)^2)/expected)
[1] 1.018862
For this kind of test, the critical region always has the following form:
Decision: There are K = 5 classes (after merging two of them) and s = 1 estimation, so
d
T 0 ( X ) → χ 2K −s −1 ≡ χ5−1−1
2
≡ χ 23
If we apply the methodology based on the critical region, the necessary quantile a is calculated from the
definition of the type I error, with the given α = 0.05:
α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P( T 0 ( X )∈Rc )≈P (T 0 (X )>a)
> qchisq(1-0.05, 3)
→ a=r α=7.81 → Rc = {T 0 ( X )>7.81 } [1] 7.814728
Then, the decision is: T 0 ( x)=1.019 < 7.81 → T 0 ( x)∉ Rc → H0 is not rejected.
If we apply the alternative methodology based on the p-value,
pV = P ( X more rejecting than x | H 0 true)= P( T 0 ( X )>T 0 (x ))=P (T 0 ( X )> 1.019)=0.80

→ pV =0.80> 0.05=α → H0 is not rejected. > 1 - pchisq(1.019, 3)
[1] 0.7966547

(2) Fit to a member of the Poisson family
Hypotheses: For this nonparametric goodness-of-fit test, the hypotheses are
H 0 : N ∼ F 0 = Pois (0.8) and H 1 : N ∼ F≠ Pois( 0.8)
(It can be thought that the null hypothesis is simple while the alternative hypothesis is composite.) To fill in
the expected table (under H0), the formula e k =n⋅p k will be applied, where the probabilities can be taken
from a table or can be calculated by substituting in the mass function f (x ; λ) ,
0 1
0.8 −0.8 0.8 −0.8
p 0=P λ ( X =0)= f λ (0)= e =0.449 p 1=P λ ( X =1)= f λ (1)= e =0.359
0! 1!
0.82 −0.8 0.83 −0.8
p 2=P λ ( X =2)= f λ (2)= e =0.144 p 3=P λ ( X =3)= f λ (3)= e =0.0383
2! 3!
0.84 −0.8
p 4= P λ ( X =4)= f λ (4)= e =0.00767 p 5>=1−P λ ( X ≤4)=⋯=0.00141
4!
Poisson (λ = 0.8)
Values 0 1 2 3 4 5 or more
Probabilities 0.449 0.359 0.144 0.0383 0.00767 0.00141 1
Now, we fill in the expected table by using the formula e k =n⋅p k .
EXPECTED (UNDER H0)

Number of Bombs 0 1 2 3 4 5 or more
Number of Regions 258.81 207.05 82.82 22.09 4.42 0.813 n = 576
Again, we have done these calculations with the programming language R.

> dpois(c(0,1,2,3,4), 0.8) > 576*dpois(c(0,1,2,3,4), 0.8)
[1] 0.449328964 0.359463171 0.143785269 0.038342738 0.007668548 [1] 258.813483 207.050787 82.820315 22.085417 4.417083
> (1-sum(dpois(c(0,1,2,3,4), 0.8))) > 576*(1-sum(dpois(c(0,1,2,3,4), 0.8)))
[1] 0.00141131 [1] 0.8129146
As in the previous case, we merge the last two classes for all the expected absolute frequencies to be larger
than four
EMPIRICAL
Number of Regions 229 211 93 35 7+1=8 n = 576
EXPECTED (UNDER H0)

Number of Regions 258.81 207.05 82.82 22.09 4.42+0.813=5.233 n = 576
We calculate the value of T0 with the computer as well:

> empirical = c(229, 211, 93, 35, 8)
> expected = 576*c(dpois(c(0,1,2,3), 0.8), (1-sum(dpois(c(0,1,2,3), 0.8))))
> sum(((empirical-expected)^2)/expected)
[1] 13.77982
so

( 229−258.81 )2 ( 8−5.233 ) 2
T 0 ( x)= +⋯+ =13.78
258.81 5.233
On the other hand, for this kind of test the critical region always has the following form:
Decision: Now K = 5 and s = 0, since no estimation has been needed, so

d
T 0 ( X ) → χ 2K −s −1 ≡ χ5−1−0
2
≡ χ24
Now T0 follows the χ2 distribution with 4 degrees of freedom—it was 3 in the previous section.
α=P (Type I error)=P ( Reject H 0 | H 0 true)=P (T 0 ( X )∈ R c )≈P (T 0 ( X )> a)
> qchisq(1-0.05, 4)
→ a=r α=9.49 → Rc = {T 0 ( X )>9.49 } [1] 9.487729
Then, the decision is: T 0 ( x)=13.78 > 9.49 → T 0 ( x)∈ Rc → H0 is rejected.

pV = P ( X more rejecting than x | H 0 true)= P( T 0 ( X )>T 0 (x ))=P (T 0 ( X )> 13.78)=0.0080
→ pV =0.0080< 0.05=α → H0 is rejected. > 1-pchisq(13.78, 4)
[1] 0.00803134
Conclusion: The hypothesis that bomb hits can reasonably be modeled by using the Poisson family has not
^
been rejected. In this case, data provided an estimate λ=0.93 . Nevertheless, when the value λ=0.8 is
imposed, the hypotheses that bomb hits can be modeled by using a Pois(λ=0.8) model is rejected. This proves
that:
i. Even quite reasonable a model may not fit the data if inappropriate parameter values are considered.
This emphasizes the importance of using good parameter estimation methods.
ii. Estimating the parameter value was better than fixing a value close to the estimate. As statisticians say:
“let the data talk”. This hightlights the necessity of testing all suppositions, which implies that
nonparametric procedures should sometimes be applied before the parametric ones: in this case, before
supposing that the Poisson family is proper and imposing a value for the parameter, the whole Poisson
family must be considered.
(Remember: statistical results depend on: the assumptions, the methods, the certainty and the data.)
Advanced theory: Mendelhall, W., D.D. Wackerly and R.L. Scheaffer say (Mathematical Statistics with
Applications, Duxbury Press) that the expected absolute frequencies can be as low as 1 for some situations,
according to Cochran, W.G., “The χ2 Test of Goodness of Fit”, Annals of Mathematical Statistics, 23 (1952)
pp. 315-345. To take the most advantage of this exercise, we repeat the previous calculations without merging
the last two classes.
(1) Fit to the Poisson family
We evaluate T0, which is necessary to apply any of the two methodologies.
( 229−227.26 )2 ( 1−1.55 )2
T 0 ( x)= +⋯+ =1.167
227.26 1.55
2 d 2 2
Now there are K = 6 classes and s = 1 estimation, so T 0 ( X ) → χ K −s −1 ≡ χ6−1−1 ≡ χ4 . If we apply the

methodology based on the critical region, the necessary quantile a is calculated from the definition of the type
I error, with the given α = 0.05:
> qchisq(1-0.05, 4)
→ a=r α=9.49 → Rc = {T 0 ( X )>9.49 } [1] 9.487729
Then, the decision is: T 0 ( x)=1.167 < 9.49 → T 0 ( x)∉Rc → H0 is not rejected.
If we apply the alternative methodology based on the p-value,
pV = P ( X more rejecting than x ∣ H 0 true)=P (T 0 ( X )>T 0 ( x))=P (T 0 ( X )>1.167)=0.88

> 1-pchisq(1.167, 4)
(2) Fit to a member of the Poisson family

We calculate the value of T0
( 229−258.81 )2 ( 1−0.813 )2
T 0 ( x)= +⋯+ =13.87
258.81 0.813
2d 2 2
Since K = 6 and s = 0, now T 0 ( X ) → χ K −s −1 ≡ χ6−1−0 ≡ χ5 . Then,
α=P (Type I error)=P ( Reject H 0 | H 0 true)=P (T 0 ( X )∈ R c )≈P (T 0 ( X )> a)
→ a=r α=11.07 → Rc = {T 0 ( X )>11.07} > qchisq(1-0.05, 5)
[1] 11.0705
Then, the decision is: T 0 ( x)=13.87 > 11.07 → T 0 ( x)∈ Rc → H0 is rejected.

pV = P ( X more rejecting than x | H 0 true)= P(T 0 ( X )>T 0 ( x ))=P (T 0 ( X )> 13.87)=0.0165
→ pV =0.0165< 0.05=α → H0 is rejected. > 1-pchisq(13.87, 5)
[1] 0.01645663
In both sections the same decisions have been made, which implies that this is one of those situations where
merging the last two classes does not seem essential.
My notes:
Exercise 3ht-np
Three finantial products have been commercialized and the presence of interest in them has been registered
for some individuals. It is possible to imagine different situations where the following data could have been
obtained.
Product 1 Product 2 Product 3
Group 1 10 18 9 37
Group 2 20 13 15 48
30 31 24 85
(a) A simple random sample of 48 people of the second group were allocated after considering the
variable product, test at α = 0.01 whether this variable follows the distribution determined by the
sample of the first group.

(b) A simple random sample of 85 people with interest in any of the products were allocated after
considering the two variables group and product. Test at α = 0.01 the independence of the two
variables.
(c) From two independent groups, simple random samples of 37 people and 48 people are surveyed,
respectively. Test at α = 0.01 the homogeneity of the distribution of the variable product in the groups.
Discussion: In this exercise, the same table is looked at as containing data obtained from three different
schemes. The chi-square methodology will be applied in all sections through three kinds of test: goodness-of-
-fit, independence and homogeneity. In the first case, a probability distribution F0 is specified, while in the last
two cases the underlying distributions have no interest by themselves.
(a) Goodness-of-fit test

Statistic: To apply a goodness-of-fit test, from a table of statistics (e.g. in [T]) we select
2
( N k −e^k ) d 2
K
T 0 ( X )=∑k =1 → χ K −s−1
e^k
where there are K classes and s parameters must be estimated to determine the probabilities.
Hypotheses: For a nonparametric goodness-of-fit test, the null hypothesis assumes that the theoretical
probabilities of the second group follow the probabilities determined by the sample of the first group. If Fk
represents the distribution of the variable product in the k-th population,
H 0: F 2 ∼ F 1 and H 1 : F 2 ∼ F ≠F 1
The variable of the first group determines the following distribution F1:
Value 1 2 3
10 18 9
Probability
37 37 37
Now, under H0 the formula e k =n pk allows us to fill in the expected table:
Then, we need the evaluation

2 2 2
10 18 9
T 0 ( x)=
( 20−48
37
+
) (
13−48
37
+
15−48 ) (
37
=9.34
)
10 18 9
48 48 48
37 37 37
On the other hand, for this kind of test the critical region always has the form

Decision: Since there are K = 3 classes and s = 0 (no parameter has to be estimated to determine the
probabilities),
d 2 2 2
T 0 ( X ) → χ K −s −1 ≡ χ3−0−1 ≡ χ2
If we apply the methodology based on the critical region, the necessary quantile a is calculated from the
definition of the type I error, with the given α = 0.01:
→ a=r α=9.21 → Rc = {T 0 ( X )>9.21 }
Then, the decision is: T 0 ( x)=9.34 > 9.21 → T 0 ( x)∈Rc → H0 is rejected.

pV = P ( X more rejecting than x ∣ H 0 true)=P (T 0 ( X )> T 0 ( x))≈P (T 0 ( X )>9.34)
< P (T 0 ( X )> 9.21)=1−0.99=0.01
9.34 is not in our table while 9.21 is
→ pV < 0.01=α → H0 is rejected.
(b) Independence test

Statistic: To apply a test of independence, from a table of statistics (e.g. in [T]) we select
L (N lk − e^lk )2 d 2
K
T 0 ( X )=∑l =1 ∑ k=1 → χ( L−1)(K −1)
e^lk
for L and K classes, respectively. Underlying distributions are supposed—but not specified—for the variables
X and Y, and the probabilities are directly estimated from the sample information.
Hypotheses: For a nonparametric independence test, the null hypothesis assumes that the probabilities at any
cell is the product of the marginal probabilities of its file and column,
H 0 : X , Y independent and H 1 : X , Y dependent
or, probabilistically,
H 0 : f ( x , y)= f X ( x)⋅ f Y ( y ) and H 1 : f ( x , y )≠ f X ( x )⋅ f Y ( y )
N l⋅ N ⋅k
Under H0, the formula e^lk =n p^lk =n p^ l p^ k =n allows us to fill in the expected table:
n n
Then,
37⋅30 2 48⋅24 2
T 0 ( x)=
( 10−
85
+⋯+
)
15−
85 (
=4.29
)
37⋅30 48⋅24
85 85

Decision: There are L= 2 and K = 3 classes, respectively, so
d 2 2 2
T 0 ( X ) → χ( L−1)(K −1) ≡ χ(2−1 )(3−1) ≡ χ 2
For the first methodology, to calculate a the definition of type I error is applied with α = 0.01:
α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P(T 0 ( X )∈ Rc )≈P (T 0 (X )>a)
→ a=r α=9.21 → Rc = {T 0 ( X )>9.21 }
The decision is: T 0 ( x)=4.29 → T 0 ( x)∉Rc → H0 is not rejected.

pV = P ( X more rejecting than x ∣ H 0 true)=P (T 0 ( X )>T 0 ( x))≈P (T 0 (X )≥4.29)
> P (T 0 ( X )>4.61)=1−0.9=0.1
→ pV > 0.1> 0.01=α → H0 is not rejected.
(c) Homogeneity test

Statistic: To apply a test of homogeneity, from a table of statistics (e.g. in [T]) we select
2
L (N lk − e^lk ) d 2
K
T 0 ( X )=∑l =1 ∑ k=1 → χ( L−1)(K −1)
e^lk
for L groups and K classes. An underlying distribution is supposed—but not specified—for the variable X, and
the probabilities are directly estimated from the sample information. (Note that the membership of a group can
be seen as the value of a factor.)
Hypotheses: For a nonparametric homogeneity test, the null hypothesis assumes that the marginal
probabilities in any column are the same for the two groups, that is, are independent of the group or stratum.
This means that the variable of interest X follows the same probability distribution in each (sub)group or
stratum. If G represents the variable group, mathematically,
H 0 : F ( x∣ G)= F ( x) and H 1 : F ( x∣ G)≠F (x )
N ⋅k
Under H0, the formula e^lk =nl p^lk =n l p^ k =nl allows us to fill in the expected table:
n

Then
30 2 24 2
(
T 0 ( x)=
10−37
85 )+⋯+
(
15−48
85
=4.29
)
30 24
37 48
85 85
Decision: For L= 2 groups and K = 3 classes,

d 2 2 2
T 0 ( X ) → χ( L−1)(K −1) ≡ χ(2−1 )(3−1) ≡ χ 2
If we apply the methodology based on the critical region, to calculate the quantile a the definition of type I
error is applied with α = 0.01:
α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P( T 0 ( X i )∈Rc )≈ P (T 0 ( X i )> a)
→ a=r α=9.21 → Rc = {T 0 ( X )>9.21 }
To make the decision: T 0 ( x)=4.29 → T 0 ( x)∉ Rc → H0 is not rejected

pV = P ( X more rejecting than x ∣ H 0 true)=P (T 0 ( X i )>T 0 ( x i ))≈ P(T 0 ( X i )>4.29)

> P (T 0 ( X i )> 4.61)=1−0.9=0.1
4.29 is not in our table while 4.61 is
→ pV > 0.1> 0.01=α → H0 is not rejected
Conclusion (advanced): Neither the independence nor the homogeneity has been rejected, while the
hypothesis supposing that the variable product follows in population 2 the distribution determined by the
sample of the group 1 has been rejected. On the one hand, the distribution determined by one sample, involved
in section (a), is in general different to the common supposed underlying distribution involved in section (b),
which is estimated by using the samples of both groups. Thus, it can be thought that this underlying
distribution “is between the two samples”, by which we can justify the decisions made in (a), (b) and (c).
Group 2 has more weight in determining that distribution, since it has more elements. It is worth noticing the
similarity between the independence and the homogeneity tests: same distribution and evaluation for the
statistic, same critical region, et cetera. (As regards the application of the methodologies, binding the p-value
is sometimes enough to discover whether it is smaller than α or not, but in general statisticians want to find its
value.)
My notes:

[HT] Parametric and Nonparametric
Exercise 1ht
To test if a coin is fair or not, it has independently been tossed 100,000 times (the outputs are a simple
random sample), and 50,347 of them were heads. Should the fairness of the coin, as null hypothesis, be
rejected when α = 0.1?
(a) Apply a parametric test. By using a computer, plot the power function.
(b) Apply the nonparametric chi-square goodness-of-fit test.
(c) Apply the nonparametric position signs test.
Discussion: In this exercise, no supposition should be evaluated: in (a) because the Bernoulli model is “the
only proper one” to model a coin, and in (b) and (c) because they involve nonparametric tests. The sections of
this exercise need the same calculations as in previous exercises.
(a) Parametric test
Statistic: From a table of statistics (e.g. in [T]), since the population variable is Bernoulli and the asymptotic
framework can be considered (since n is big), the statistic
̂
η−η d
T ( X ; η)= → N (0,1)
√ ?(1−? )
n
is selected, where the symbol ? is substituted by the best information available. In testing hypotheses, it will
be used in two forms:
̂
η−η0
d ̂
η−η1
d
T 0 ( X )= → N (0,1) and T 1 ( X )= → N (0,1)
√ η0 (1−η0 )
n √ η1 (1−η1 )
n
where the supposed knowledge about the value of η is used in the denominators to estimate the variance (we
do not have nor suppose this information when T is used to build a confidence interval, or for tests with two
populations). Regardless of the methodology to be applied, the following value will be necessary:
50,347 1
−
100,000 2
T 0 ( x)= =2.19
√
1 1
2( )
1−
2
100,000
where η0 = 1/2 when the coin is supposed to be fair.
Hypotheses: Since a parametric test must be applied, the coin—dichotomic situation—is modeled by a
Bernoulli random variable, and the hypotheses are
1 1
H 0 : η = η0 = and H 1 : η= η1 ≠
2 2
Note that the question is about the value of the parameter η while the Bernoulli distribution is supposed under
both hypotheses; in some nonparametric tests, this distribution is not even supposed in general (although the

only reasonable distribution to model a coin is the Bernoulli). For this kind of alternative hypothesis, the
critical region takes the form
Decision: To determine Rc, the quantiles are calculated from the type I error with α = 0.1 at η0 = 1/2:
α (1/2)=P (Type I error )=P (Reject H 0 ∣ H 0 true)=P (T ( X ; θ)∈ Rc ∣ H 0 )= P(∣T 0 ( X )∣> a)
→ a=r α/2 =1.645 → Rc = {∣T 0 ( X )∣>1.645 }
Thus, the decision is: T 0 ( x)=2.19 > 1.645 → T 0 ( x)∈Rc → H0 is rejected.
pV = P ( X more rejecting than x ∣ H 0 true)=P (∣T 0 ( X )∣>∣T 0 ( x)∣)
= 2⋅P (T 0 ( X )<−2.19)=2⋅0.0143=0.0248
→ pV =0.0248 < 0.1=α → H0 is rejected.
Power function: To calculate β, we have to work under H1. Since in this case the critical region is already
expressed in terms of T0 and we must use T1, we apply the mathematical tricks of multiplying and dividing by
the same quantity and of adding and subtracting the quantity:
β(η1 ) = P (Type II error ) = P ( Accept H 0 ∣ H 1 true)= P (T 0 ( X )∉ R c ∣ H 1) = P(∣T 0 ( X )∣≤1.645 ∣ H 1 )
∣)
√ η1 (1−η1) ≤+1.645
∣) (
̂ −η0
η ̂ −η0
η
(
= P −1.645≤
√ η0 (1−η0 )
n
≤+1.645 H 1 = P −1.645≤
√ η1 (1−η1 ) √ η0 (1−η0)
n
H1
∣)
√ η0 (1−η0) ≤ √ η0 (1−η0)
(
̂
η−η0
= P −1.645 ≤+1.645 H1
√ η1 (1−η1)
√ η1 (1−η1)
n
√ η1(1−η1)
∣)
√ η0 (1−η0) ≤ η−η √ η (1−η0)
(
̂ 1+ η1−η0
= P −1.645 ≤+1.645 0 H1
√ η1 (1−η1)
√
η1 (1−η1 )
n
√ η1 (1−η1)
∣)
√ η0 (1−η0) − √ η0 (1−η0 ) −
(
η1−η0 ̂
η−η1 η1−η0
= P −1.645 ≤ ≤+1.645 H1
√ η1 (1−η1)
√ η1 (1−η1)
n √ η1 (1−η1 )
n
√ η1 (1−η1 )
√ η1 (1−η1 )
n
−1.645 √ η0 (1−η0)−√ n(η1−η0) +1.645 √ η0 (1−η0)− √ n( η1−η0)

=P
( √ η1(1−η1)
≤T 1≤
√ η1 (1−η1 ) )
+1.645 √ η0 (1−η0 )− √ n(η1−η0) −1.645 √ η0 (1−η0)− √ n( η1−η0)
(
= P T 1≤
√ η1 (1−η1 )
−P T 1 <
) ( √ η1 (1−η1 ) )
By using a computer, many more values η1 ≠ 0.5 can be considered to plot the power function
ϕ(η) = P (Reject H 0) =
{ α(η) if p∈Θ0
1−β(η) if p ∈Θ1
n = 100000
alpha = 0.1
q = qnorm(c(alpha/2, 1-alpha/2),0,1)
PowerFunction = 1-pnorm((q[2]*sqrt(theta0*(1-theta0))-sqrt(n)*(paramSpace-theta0))/sqrt(paramSpace*(1-paramSpace)),0,1) +
pnorm((q[1]*sqrt(theta0*(1-theta0))-sqrt(n)*(paramSpace-theta0))/sqrt(paramSpace*(1-paramSpace)),0,1)
With this code the power function is plotted:
(b) Nonparametric chi-square goodness-of-fit test
Statistic: To apply a goodness-of-fit test, from a table of statistics (e.g. in [T]) we select
K ( N k −e^k )2 d 2
T 0 ( X )=∑k =1 → χ K −s−1
e^k
where there are K classes, and s parameters have to be estimated to determine F0 and hence the probabilities.
Hypotheses: For a nonparametric goodness-of-fit test, the null hypothesis supposes that the sample was
generated by a Bernoulli distribution with η0 = 1/2, while the alternative hypothesis supposes that it was
generated by a different distribution (Bernoulli or not, although this distribution is here “the reasonable way”
of modeling a coin).
1 1
H 0: X ∼ F 0 = B
2 ()
and H 1 : X ∼ F ≠ B
2 ()
For the distribution F0, the table of probabilities is
Value –1 (tail) +1 (head)
Probability 1/2 1/2
th 1
and, under H0, the formula e k =n pk =n P θ (k class)=100,000 =50,000 allows us to fill in the expected
2
table:

and
2 (nk −e k )2 (50,347−50,000) 2 ( 49,653−50,000)2
T 0 ( x)=∑k=1 = + =4.82
ek 50,000 50,000
On the other and, for this kind of test, the critical region always has the following form:
Decision: There are K = 2 classes and s = 0 (no parameter has been estimated), so
d 2 2 2
T 0 ( X ) → χ K −s −1 ≡ χ 2−1−0 ≡ χ1
If we apply the methodology based on the critical region, the definition of type I error, with α = 0.1, is applied
to calculate the quantile a:
α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P( T 0 ( X )∈ Rc )≈P (T 0 (X )>a)
→ a=r α=2.71 → Rc = {T 0 ( X )> 2.71}
Then, the decision is: T 0 ( x)=4.82 > 2.71 → T 0 ( x)∈Rc → H0 is rejected.

pV = P ( X more rejecting than x )=P (T 0 ( X )>T 0 ( x ))≈ P (T 0 ( X )>4.82)
< P (T 0 ( X )>3.84)=0.05
→ pV < 0.05 < 0.1=α → H0 is rejected.
Note: Binding the p-value is sometimes enough to make the decision—4.82 is not in our table while 3.84 is.
(c) Nonparametric position sign test
Statistic: To apply a position sign test, from a table of statistics (e.g. in [T]) we select
T 0 ( X )=Number { X j −θ0 > 0 }∼ Bin(n , P ( X j >θ 0))
Here θ0=0 and P(Xj>0)=1/2, so Me(T0)=E(T0)=n/2.
Hypotheses: For a nonparametric position test, if head and tail are equivalently translated into the numbers
+1 and –1, respectively, the hypotheses are
H 0 : Me( X ) = θ 0 = 0 and H 1 : Me( X ) = θ1 ≠ 0

We need the evaluation
∣T 0 (x )−100,000/ 2∣=∣50,347−50,000∣=347
Decision: In the first methodology, the quantile a is calculated by applying the definition of the type I error
with α = 0.1. On the one hand, we know the distribution of T0, while, on the other hand, Rc was easily written
in terms of T0–n/2, whose distribution is involved in a well-known asymptotic result—the Central Limit
Theorem for the Bin(n,1/2). (Moreover, the probabilities of the binomial distribution are not tabulated for n =
100,000.) Then,
α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P ( T 0 ( X )∈ Rc )
T 0 ( X )−n /2
(∣√ )∣ √ )
a 2⋅a
=P (∣T 0 ( X )−n/ 2∣> a )= P
1 1
>
1 1
(
≈ P ∣Z∣>
√n )
n
2(1−
2
n
2( )
1−
2
→ r α /2=1.645=
2⋅a
√n
→ a≈1.645 √
100,000
2
n
{∣
≈260.097 → Rc = T 0 ( X )− > 260.097
2 ∣ }
The final decision is: ∣T 0 (x )−100,000/ 2∣=347 > 260.097=a → T 0 ( x)∈Rc → H0 is rejected.
pV = P ( X more rejecting than x ∣ H 0 true)=P (∣T 0 ( X )−n/ 2∣>∣T 0 ( x )−n/2∣)

T 0 ( X )−n/2 T 0 ( x)−n /2
(∣√ ∣ ∣√ ∣) ( ∣√
50,347−50,000
=P
n
1
2( )
1−
1
2
>
n
1
2( )
1−
1
2
≈ P ∣Z∣>
100,000
1
2
1−
1
( )
2
∣)
= P (∣Z∣> 2.19)=2⋅P (Z <−2.19)=2⋅0.0143=0.0248
→ pV = 0.0248 < 0.1=α → H0 is rejected.
Conclusion: (1) In this case the three different tests agree to make the same decision, but this may not
happen in other situations. When it is possible to compare the power functions and there exists a uniformly
most powerful test, the decision of the most powerful should be considered. In general, (proper) parametric
tests are expected to have more power than the nonparametric ones in testing the same hypotheses. (2) With
two classes, the chi-square test does not distinguish any two distributions such that the two class probabilities
are (½, ½), that is, in this case the test provides a decision about the symmetry of the distribution (chi-square
tests work with class probabilities, not with the distributions themselves). (3) In this exercise the parametric
test and the nonparametric test of the signs are essentially the same. (Remember: statistical results depend on:
the assumptions, the methods, the certainty and the data.)
My notes:

PE – CI – HT
Exercise 1pe-ci-ht
From a previous pilot study, concerning the monthly amount of money (in $) that male and female students of
a community spend on cell phones, the following hypotheses are reasonably supported:
i. The variable amount of money follows a normal distribution in both populations.
ii. The population means are μM = $14.2 and μF = $13.5, respectively.
iii. The two populations are independent.
Two independent simple random samples of sizes nM = 53 and nF = 49 are considered, from which the
following statistics have been calculated:
2 2 2 2
S M = $ 4.99 and S F = $ 5.02
Then,
̄ −F
A) Calculate the probability P ( M ̄ ≤ 1.27). Repeat the calculations with the supposition that σM
and σF are equal.
B) Build a 95% confidence interval for the quotient σM/σF.
Discussion: The pilot statistical study mentioned in the statement should cover the evaluation of all
suppositions. The hypothesis that σM = σF should be evaluated as well. The interval will be built by applying
the method of the pivot.

M ≡ Amount of money spent by a male (one) M ~ N(μM, σM2)
F ≡ Amount of money spent by a female (one) F ~ N(μF, σF2)
Selection of the statistics: We know that:

• Standard deviations σM and σF are unknown, and we compare them through σM/σF
From a table of statistics (e.g. in [T]) we select

2 2 2
T ( M , F ; μ M ,μ F )=
̄ −F
(M ̄ )−(μ M −μ F )
∼ t κ with κ=
( SM S F
+
nM nF )
√
2 2 2 2
( ) ( )
2 2
SM SF 1 SM 1 SF
+ +
nM nF n M −1 n M n F −1 n F
̄ −F
(M ̄ )−(μ M −μ F ) 2n M s 2M + n F s 2F (n M −1) S 2M +(n F −1)S 2F
T ( M , F ; μ M ,μ F )= ∼ tn + nF −2 with S =p =
n M + n F −2 nM + nF −2
√ S 2p S 2p
M
+
nM n F

2
SM
2 2 2
σM SM σF
T ( M , F ; σ M ,σ F )= = ∼ Fn −1 , nF −1
S 2F S 2F σ 2M M
2
σF
Because of the information available, the first and the second statistics allow studying M – F (the second for
the particular case where σM = σF), while the third allows studying σM/σF.
(A) Calculation of the probability:
( )( )
̄ −F
(M ̄ )−(μ M −μ F ) 1.27−(μ M −μ F ) 1.27−(14.2−13.5)
P (M
̄ −F
̄ ≤ 1.27)=P ≤ =P T ≤ =P (T ≤ 1.29)
√ √ √
2 2 2 2
S S S S 4.99 5.02
+ M F M
+ F +
nM nF nM nF 53 49
with T ~ tκ where
2
S 2M S 2F
κ=
( +
nM nF ) =
4.99 5.02 2
53
+
49 ( =99.33
)
2 2 2 2 2
S 2M 1 4.99 1 5.02
1
( )
n M −1 n M
+
1 S
( )
F
n F −1 n F 53−1 53
+ ( )
49−1 49 ( )
Should we round this value downward, κ = 99, or upward, κ = 100? We will use this exercise to show that
➢ For large values of κ1 and κ2, the t distribution provides close values
➢ For a large value of κ, the t distribution provides values close to those of the standard normal
distribution (the tκ distribution tends with κ to the standard normal distribution)
By using the programming language R: > pt(1.29, 99)
[1] 0.8999721
• If we make it κ = 99.33 to 99, the probability is
> pt(1.29, 100)
• If we make it κ = 99.33 to 100, the probability is [1] 0.8999871
• If we use the N(0,1), the probability is > pnorm(1.29)

[1] 0.9014747
On the other hand, when the variances are supposed to be equal they can and should be estimated jointly by
using the pooled sample variance.
(n −1) S 2M +(nF −1) S 2F (53−1)$ 2 4.99+(49−1)$ 2 5.02
S 2p = M = =$ 2 5.0044≈$ 2 5
n M + n F −2 53+ 49−2
Then,
( )( )
̄ −F
(M ̄ )−(μ M −μ F ) 1.27−(μ M −μ F ) 1.27−(14.2−13.5)
P (M
̄ −F
̄ ≤ 1.27)=P ≤ =P T ≤ =P (T ≤ 1.29)
√ S 2p S 2p
+
nM nF √ S 2p S 2p
+
nM nF √
5 5
+
53 49
with T ∼ t n M +nF −2 =t 53+49−2=t 100 .

• By using the table of the t distribution, the probability is 0.9.
> pt((1.27-14.2+13.5)/sqrt((5/53)+(5/49)), 100)
• By using the language R, the probability is [1] 0.8993372

(B) Method of the pivotal quantity:
2 2 2 2 2 2 2 2
1−α=P (l α/ 2≤T ≤r α / 2 )=P l α / 2≤

( SM σF
σ2M S 2F ) (
≤r α/ 2 =P l α/ 2
SF
S 2M
≤
σF
σ 2M
≤r α/ 2
SF
S 2M ) (
=P
SM
l α/ 2 S 2F
≥
σM
σ 2F
≥
SM
r α /2 S 2F )
so confidence intervals for σM2/σF2 and σM/σF are respectively given by
I 1−α =
[ S 2M
r α /2 S 2F
,
S 2M
l α/ 2 S 2F ] and then I 1−α =
[√ S 2M
r α/ 2 S 2F
,
√ ] S 2M
l α/ 2 S 2F
In the calculations, multiplying by a quantity and inverting can be applied in either order.

2 2 2 2
• S M = $ 4.99 and S F = $ 5.02
• 95% → 1–α = 0.95 → α = 0.05 → α/2 = 0.025 →

{ l α / 2=0.57
r α/ 2=1.76
> qf(c(0.025, 1-0.025), 53-1, 49-1)
[1] 0.5723433 1.7583576
Then
I 0.95= [√ 4.99
1.76⋅5.02
,
√ 4.99
0.57⋅5.02 ]
=[0.75, 1.32]
Conclusion: First of all, in this case there is very little difference between the two ways of estimating the
variance. On the other hand, as the variances are related through a quotient, the interpretation is not direct: the
dimensionless, multiplicative factor c in σM2 = cσF2 is, with 95% confidence, in the interval obtained. The
interval (with dimensionless endpoints) contains the value 1, so it may happen that the variability of the
amount of money spent is the same for males and females—we cannot reject this hypothesis (note that
confidence intervals can be used to make decisions). (Remember: statistical results depend on: the
My notes:
Exercise 2pe-ci-ht
The electric light bulbs of manufacturer X have a mean lifetime of 1400 hours (h), while those of
manufacturer Y have a mean lifetime of 1200h. Simple random samples of 125 bulbs of each brand are tested.
From these datasets the sample quasivariances Sx2 = 156h2 and Sy2 = 159h2 are computed. If manufacturers
are supposed to be independent and their lifetimes are supposed to be normally distributed:
a) Build a 99% confidence interval for the quotient of standard deviations σX/σY. Is the value σX/σY=1,
that is, the case σX=σY, included in the interval?
b) By using the proper statistic T, find k such that P ( X
̄ −Ȳ ≤ k ) = 0.4.
Hint: (i) Firstly, build an interval for the quotient σX2/σY2; secondly, apply the positive square root function. (ii) If a random variable
ξ follows a F124, 124 then P(ξ ≤ 0.628) = 0.005 and P(ξ ≤ 1.59) = 0.995. (iii) If ξ follows a t248, then P(ξ ≤ –0.25) = 0.4
(Based on an exercise of Statistics, Spiegel, M.R., and L.J. Stephens, McGraw–Hill.)
electric means carrying, producing, produced by, powered by, or charged with electricity: 'an electric wire', 'an electric generator', 'an
electric shock', 'an electric current', 'an electric light bulb', 'an electric toaster'. For machines and devices that are powered by electricity
but do not have transistors, microchips, valves, etc, use electric (NOT electronic): 'an electric guitar', 'an electric train set', 'an electric
razor'.

electrical means associated with electricity: 'electrical systems', 'a course in electrical engineering', 'an electrical engineer'. To refer to the
general class of things that are powered by electricity, use electrical (NOT electric): 'electrical equipment', 'We stock all the latest
electrical kitchen appliances'.
electronic is used to refer to equipment which is designed to work by means of an electric current passing through a large number of
transistors, microchips, valves etc, and components of this equipment: 'an electronic calculator', 'tiny electronic components'. Compare:
'an electronic calculator' BUT 'an electric oven'. An electronic system is one that uses equipment of this type: 'electronic surveillance', 'e-
mail' (=electronic mail, a system for sending messages very quickly by means of computers).
electronics (WITH s) refers to (1) the branch of science and technology concerned with the study, design or use of electronic equipment:
'a students of electronics' (2) (used as a modifier) anything that is connected with this branch: 'the electronics industry'.
Discussion: There are two independent normal populations. All suppositions should be evaluated. Their
means are known while their variances are estimated from samples of size 125. A 99% confidence interval for
σX/σY is required. The interval will be built by applying the method of the pivot. If the value σX/σY=1 belongs
to this interval of confidence 0.99, the probability of the second section can reasonably be calculated under the
supposition σX=σ=σY—this implies that the common variance σ2 is jointly estimated by using the pooled
sample quasivariance Sp2. On the other hand, this exercise shows the natural order in which the statistical
techniques must sometimes be applied in practice: the supposition σX=σY is empirically supported—by
applying a confidence interval or a hypothesis test—before using it in calculating the probability. Since the
standard deviations have the same units of measurement as the data (hours), their quotient is dimensionless,
and so are the endpoints of the interval.
Identification of the variables:

X ≡ Lifetime of a light bulb of manufacturer X X ~ N(μX=1400h, σX2)
Y ≡ Lifetime of a light bulb of manufacturer Y Y ~ N(μY=1200h, σY2)
(a) Confidence interval

Selection of the statistics: We know that:
• The standard deviations σX and σY are unknown, and we compare them through σX/σY
From a table of statistics (e.g. in [T]) we select a (dimensionless) statistic. To compare the variances of two
independent normal populations, we have two candidates:
V 2X σY2 S 2X σ2Y
T ( X , Y ; σ X , σ Y )= ∼ Fn , nY and T ( X , Y ; σ X , σ Y )= ∼ Fn −1 ,nY −1
V Y2 σ 2X X
S 2Y σ2X X
1 n
2 2 2 1 n 2
where V X =
n ∑ j =1
( X j−μ ) and S X =
n−1 ∑ j=1
( X j− X̄ ) , respectively (similarly for population Y).
We would use the first if we were given V 2X and V 2Y or we had enough information to calculate them (we
know the means but not the data themselves). In this exercise we can use only the second statistic.
Method of the pivot:

S 2X σ 2Y S 2Y σY2 S 2Y S 2X σ2X S 2X
1−α=P (l α/ 2≤T ≤r α / 2 )=P l α / 2≤
( σ2X S Y 2 ) (
≤r α/ 2 =P l α/2
S 2X
≤
σ 2X
≤r α/ 2
S 2X ) (
=P
l α/2 S Y 2
≥
σ2Y
≥
r α / 2 S Y2 )
so confidence intervals for σX2/σY2 and σX/σY are respectively given by
I 1−α =
[ S 2X
r α/2 S 2Y
,
S 2X
l α/2 S 2Y ] and I 1−α =
[√ S 2X
r α/ 2 S 2Y
,
√ ] S 2X
l α/ 2 S 2Y

(In the previous calculations, multiplying by a quantity and inverting can be applied either way.)

• S 2X = 156 h2 and S 2Y = 159 h 2
• 99% → 1–α = 0.99 → α = 0.01 → α/2 = 0.005 →
{ P ( ξ≤l α / 2)=α/ 2=0.005
P( ξ≤r α/ 2 )=1−α /2=0.995
→
{ l α/ 2=0.628
r α/ 2=1.59
where the information in (ii) of the hint has been used. Then
[√ √ ]
2 2
156 h 156 h
I 0.95= 2
, 2
=[0.786, 1.25]
1.59⋅159 h 0.628⋅159 h
The value σX/σY=1 is in the interval of confidence 0.99 (99%), so the supposition σX=σY is strongly supported.
(b) Probability
To work with the difference of the means of two independent normal populations when σX=σY, we consider:
̄ −Ȳ )−(μ X −μ Y )
(X
T ( X , Y ; μ X ,μ Y )= ∼ t n +n −2
S 2p S 2p
√
x y
+
n X nY
2 2
2 (n X −1) S +(nY −1)S
X 124⋅156 h2 +124⋅159 h 2
Y
where S p = = =157.5 h2 is the pooled sample quasivariance.
n X + n y −2 125+125−2
The quantile can be found after rewriting the event as follows:

( X̄ −Ȳ )−(μ X −μ Y ) k −(μ X −μY ) k −(1400−1200)
0.4 = P( X̄ −Ȳ ≤ k )=P
( √ S 2p S 2p
+
n X nY
≤
√ S 2p S 2p
+
n X nY )(
=P T ≤
√ 157.5 157.5
125
+
125
)
Now, by using the information in (iii) of the hint,
l 0.4=−0.25=
kh−(1400 h−1200 h)
→ k =200 h−0.25 2
125√
157.5 h 2
=199.60 h
√ 157.5 h2 157.5 h 2
125
+
125
Conclusion: A confidence interval has been obtained for the quotient of the standard deviations. The
dimensionless value of θ = σX/σY is between 0.786 and 1.250 with confidence 99%; alternatively, as the
standard deviations are related through a quotient, an equivalent interpretation is the following: the
(dimensionless) multiplicative factor θ in σX=θ·σY is, with 99% confidence, in the interval obtained. Since the
value θ = 1 is in this high-confidence interval, it may happen that the variability of the two lifetimes is the
same—we cannot reject this hypothesis (note that confidence intervals can be used to make decisions);
besides, it is reasonable to use the supposition σX=σY in calculating the probability of the second section. If
any two simple random samples of size 125 were considered, the difference of the sample means will be
smaller than 199.60 with a probability of 0.4. Once two particular samples are substituted, randomness is not
involved any more and the inequality ̄x − ̄y ≤k =199.60 is true or false. The endpoints of the interval have
no dimension, like the quotient σX/σY or the multiplicative factor c. (Remember: statistical results depend on:
the assumptions, the methods, the certainty and the data.)
My notes:

Exercise 3pe-ci-ht
In 1990, 25% of births were by mothers of more than 30 years of age. This year a simple random sample of
size 120 births has been taken, yielding the result that 34 of them were by mothers of over 30 years of age.
a) With a significance of 10%, can it be accepted that the proportion of births by mothers of over 30
years of age is still ≤ 25%, against that it has increased? Select the statistic, write the critical region
and make a decision. Calculate the p-value and make a decision. If the critical region is
Rc ={ η̂ > 0.30 }, calculate β (probability of the type II error) for η1 = 0.35. Plot the power function
with the help of a computer.
b) Obtain a 90% confidence interval for the proportion. Use it to make a decision about the value of η,
which is equivalent to having applied a two-sided (nondirectional) hypothesis test in the first section.
(First half of section a and first calculation in b, from 2007's exams for accessing the Spanish university; I have added the other parts.)
Discussion: In this exercise, no supposition should be evaluated. The number 30 plays a role only in
defining the population under study. The Bernoulli model is “the only proper one” to register the presence-
-absence of a condition. Percents must be rewritten in a 0-to-1 scale. Since the default option is that the
proportion has not changed, the equality is allocated in the null hypothesis. On the other hand, proportions are
dimensionless by definition.
(a) Hypothesis test

Statistic: From a table of statistics (e.g. in [T]), since the population variable follows the Bernoulli
distribution and the asymptotic framework can be considered (large n), the statistic
̂
η−η d
T ( X ; p)= → N ( 0,1)
√ ?(1−?)
n
is selected, where the symbol ? is substituted by the best information available: η or η.
^ In testing
hypotheses, it will be used in two forms:
̂
η−η0
d ̂
η−η1
d
T 0 ( X )= → N (0,1) and T 1 ( X )= → N (0,1)
√ η0 (1−η0 )
n √ η1 (1−η1 )
n
where the supposed knowledge about the value of η is used in the denominators to estimate the variance (we
do not have this information when T is used to build a confidence interval, like in the next section).
Regardless of the testing methodology to be applied, the evaluation of the statistic is necessary to make the
decision. Since η0=0.25
34
−0.25
120
T 0 ( x)= =0.843
0.25(1−0.25)
120 √
Hypotheses:
H 0 : η = η0 ≤ 0.25 and H 1 : η= η1 > 0.25

For this alternative hypothesis, the critical region takes the form
{√ }
η−η
̂ 0 c−η0
Rc ={ η>
̂ c }= > ={T 0 >a }
η0 (1−η0)
n √ η0 (1−η0 )
n
Decision: To determine Rc, the quantile is calculated from the type I error with α = 0.1 at η0 = 0.25:
α (0.25)=P (Type I error)= P( Reject H 0 ∣ H 0 true)= P(T 0 >a)
→ a=r 0.1=l 0.9=1.28 → Rc = {T 0 ( X )>1.28 }.
Now, the decision is: T 0 ( x)=0.843 < 1.28 → T 0 ( x)∉Rc → H0 is not rejected.
p-value
pV = P ( X more rejecting than x ∣ H 0 true)=P (T 0 ( X )>T 0 ( x)) = P (T 0 ( X )> 0.843)=0.200

→ pV = 0.200 > 0.1=α → H0 is not rejected.
Type II error: To calculate β, we have to work under H1. Since the critical region has been expressed in terms
of T0, and we must use T1, we could apply the mathematical trick of adding and subtracting the same quantity.
Nevertheless, this way is useful when the value c in Rc ={ η> ̂ c } has not been calculated yet; now, since we
have been said that Rc ={ η>
̂ 0.3} it is easier to directly standardize with η1:
β(η) = P (Type II error ) = P ( Accept H 0 | H 1 true)= P (T 0 ( X )∉ Rc | H 1 )= P ( η≤0.3
^ | H 1)
∣) (
̂ −η1 0.3−η1 0.3−η1
(√ )
η
=P ≤ H 1 = P T 1≤
η1 (1−η1)
n √ η1 (1−η1 )
n √ η1 (1−η1 )
n
For the particular value η1 = 0.35,
0.3−0.35
β(0.35) = P T 1≤
( √ 0.35(1−0.35)
120
)
= P ( T 1 ≤−1.15 ) = 0.125
> pnorm(-1.15,0,1)
[1] 0.125
By using a computer, many more values η1 ≠ 0.35 can be considered to plot the power function
ϕ(η) = P (Reject H 0) =
{ α(η) if p∈Θ0
1−β(η) if p ∈Θ1
n = 120
alpha = 0.1
c = 0.3
theta1 = seq(from=0.25,to=1,0.01)
PowerFunction = 1 - pnorm((c-paramSpace)/sqrt(paramSpace*(1-paramSpace)/n),0,1)
This code generates the power function:

(b) Confidence interval
Statistic: From a table of statistics (e.g. in [T]), the same statistic is selected
̂
η−η d
T ( X ; η)= → N (0,1)
√ ?(1−?)
n
where the symbol ? is substituted by the best information available. In testing hypotheses we were also
studying the unknown quantity η, although it was provisionally supposed to be known under the hypotheses;
for confidence intervals, we are not working under any hypothesis and η must be estimated in the
denominator:
̂ −η
η d
T ( X ; η)= → N (0,1)
√ η(1−
̂
n
η)
̂
The interval is obtained with the same calculations as in previous exercises involving a Bernoulli population,
[
I 1−α = η
^ −r α/ 2
√ η(1−
^
n
η)
^
, η+
^ r α/ 2
^
√
η(1−
n
η)
^
]
where r α / 2 is the value of the standard normal distribution such that P( Z>r α/2 )=α / 2. By using
• n = 120.
34
• Sample proportion: ^
η= =0.283 .
120
• 90% → 1–α = 0.9 → α = 0.1 → α/2 = 0.05 → r 0.05=l 0.95 =1.645 .
the particular interval (for these data) appears
[
I 0.9= 0.283−1.645
√ 0.283 (1−0.283)
120
, 0.283+1.645
√
0.283 (1−0.283)
120 ]
=[ 0.215 , 0.351]
Thinking about the interval as an acceptance region, since η0=0.25 ∈ I the hypothesis that η may still be
0.25 is not rejected.
Conclusion: With confidence 90%, the proportion of births by mothers of over 30 years of age seems to be
0.25 at most. The same decision is still made by considering the confidence interval that would correspond to

a two-sided (nondirectional) test with the same confidence, that is, by allowing the new proportion to be
different because it had severely increased or decreased. (Remember: statistical results depend on: the
My notes:
Exercise 4pe-ci-ht
A random quantity X is supposed to follow a distribution whose probability function is, for θ > 0,
{
θ−1
θx if 0≤x ≤1
f (x ; θ) =
0 otherwise
C) Use the estimators obtained to build others for the mean μ and the variance σ2.
D) Let X = (X1,...,Xn) be a simple random sample. By applying the results involving Neyman-Pearson's
lemma and the likelihood ratio, study the critical region for the following pairs of hypotheses.
{ H 0 : θ = θ0
H 1 : θ = θ1 { H 0 : θ = θ0
H 1 : θ = θ1 >θ 0 { H 0 : θ = θ0
H 1 : θ = θ 1<θ0 { H 0 : θ ≤ θ0
H 1 : θ = θ1 >θ 0 { H 0 : θ ≥ θ0
H 1 : θ = θ1 <θ0
Hint: Use that E(X) = θ/(θ+1) and E(X2) = θ/(θ+2).
Discussion: This statement is basically mathematical. The random variable X is dimensionless. (This
probability distribution, with standard power function density, is a particular case of the Beta distribution.)
Note: If E(X) had not been given in the statement, it could have been calculated by integrating:
θ+1 1
2
+∞
E ( X )=∫−∞ x f ( x ; θ)dx=∫0 x θ x
Besides, E(X ) could have been calculated as follows:
1
θ−1
1
dx=θ ∫0 x dx =θ
θ x
[ ]
= θ
θ +1 0 θ+1
1
Now,
2
+∞
E ( X )=∫−∞ x f (x ;θ)dx=∫0 x θ x
2
1
2 θ−1
1
dx=θ∫0 x θ+1
dx=θ[ ]
xθ+2
= θ
θ+ 2 0 θ+ 2
2
μ=E ( X )= θ σ 2=Var ( X )=E ( X 2)−E ( X )2= θ − θ = θ
θ+1
and
θ+ 2 θ+1 (θ+ 2)(θ +1)
2(. )
a1) Population and sample moments: There is only one parameter—one equation is needed. The first-order
1 n
μ1 (θ )=E( X )= θ and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
θ+1 n
a2) System of equations: Since the parameter of interest θ appears in the first-order moment of X, the first
equation suffices:
θ = 1 n x = x̄ → θ=θ x̄ + x̄ → θ= x̄
θ +1 n ∑ j =1 j
μ1 (θ )=m 1 (x 1 , x 2 ,... , x n ) →
1− x̄

a3) The estimator:
X̄
θ^ M =
1− X̄
b1) Likelihood function: For this probability distribution, the density function is f (x ; θ)=θ x θ−1 so
n n n θ−1
L( x 1 , x 2 , ... , x n ; θ)=∏ j =1 f ( x j ; θ)=∏ j=1 θ xθj −1=θn (∏ x
j =1 j )
n
log[ L( x 1 , x 2 , ... , x n ; θ)]=n log(θ)+(θ−1)log( ∏ j =1 x j )
To find the local or relative extreme values, the necessary condition is:
d 1 n n n n
0= log [ L(x 1 , x 2 , ... , x n ; θ)]=n θ +log( ∏ j=1 x j ) → θ =−log( ∏ j=1 x j ) → θ0 =−
dθ n
log ( ∏ j=1 x j )
2
d d 1 n n
dθ
2
log[ L( x 1 , x 2 , ... , x n ; θ)]=
dθ
[n θ +log( ∏ j=1
x j )]=− 2 < 0
θ
The second derivative is always negative, also at the value θ0.
b3) The estimator:

n
θ^ ML =− n
log( ∏ j=1 X j )
c1) For the mean: By applying the plug-in principle,
X̄ X̄
^θ M 1− X̄ 1− X̄
 From the method of the moments: μ^ M = = = = X̄ .
θ^ M +1 X̄ X̄ 1− X̄
+1 +
1− X̄ 1− X̄ 1− X̄
n
− n
θ^ ML log( ∏ j=1 X j ) n
 From the maximum likelihood method: μ^ ML = = = .
^θ ML +1 n n
− n
+1 n−log( ∏ j=1 j
X )
log( ∏ j=1 X j )
c2) For the variance: Instead of substituting in the large expression of σ2, we use functional notation
 From the method of the moments: σ^ 2M =σ 2 ( θ^ M ) , with σ2 (θ) and θ^ M given above.
 From the maximum likelihood method: σ^ 2ML =σ2 ( θ^ ML ), with σ2 (θ) and θ^ ML given above.
D) Neyman-Pearson's lemma and likelihood ratio
d1) For the hypotheses:

{ H 0 : θ = θ0
H 1 : θ = θ1

The likelihood function and the likelihood ratio are
n
θ−1 L ( X ; θ0 ) θ0 θ0−θ1
( ) (∏
n n
L( X ; θ) = θ n (∏ j =1
X j ) and Λ ( X ; θ 0 , θ1) = =
L( X ; θ1 ) θ1 j=1
X j )
Then, the critical or rejection region is
{( } {
n
θ0 θ0
) < log ( k )}
θ0−θ1
) (∏ ( θ )+( θ −θ ) log ( ∏
n n
Rc = { Λ < k } =
θ1 j =1
Xj ) < k = n⋅log
1
0 1 j =1
X j
{
= (θ 0−θ 1) log (∏
n
j=1 ) θ
X j < log (k )−n⋅log θ0 =
1
( )}
{ (θ 0−θ 1)log
1
(∏
n
j=1
X j )
>
θ
log ( k )−n⋅log θ 0
1
( ) }
1
{ ( )} { }
1 −n −n 1 −n
= < = θ^ ML <
(θ 0−θ1 ) log n θ (θ0−θ1) θ
(∏ j=1
X j ) log ( k )−n⋅log θ0
1
log (k )−n⋅log θ 0
1
( )
Now it is necessary that θ1≠θ 0 and
{ }
−n(θ 0−θ1 )
• if θ1 <θ 0 then (θ0 −θ1 )> 0 and hence Rc = θ^ ML <
θ0
log( k )−n⋅log θ
1
( )
{ ( )}
−n(θ 0−θ1 )
• if θ1 >θ 0 then (θ0 −θ1 )< 0 and hence Rc = θ^ ML >
θ0
log( k )−n⋅log
θ1
This suggests regions of the form

Rc = {Λ< k } = ⋯= { θ^ ML < c } or Rc = {Λ<k } = ⋯= { θ^ ML >c }
The form of the critical region can qualitatively be justified as follows: if θ1 < θ0, the hypothesis H1 will be
accepted when an estimator of θ is in the lower tail, and vice versa.
Hypothesis tests
{ H 0 : θ = θ0
H 1 : θ = θ 1>θ 0 { H 0 : θ = θ0
H 1 : θ = θ1 <θ 0
In applying the methodologies, the same critical value c will be obtained for any θ1 since it only depends upon
θ0 through θ^ ML :
α=P (Type I error)=P ( θ^ ML < c) or α=P (Type I error)=P ( θ^ ML > c)

Hypothesis tests
{ H 0 : θ ≤ θ0
H 1 : θ = θ 1>θ 0 { H 0 : θ ≥ θ0
H 1 : θ = θ1 <θ 0
A uniformly most powerful test for H 0 : θ = θ0 is also uniformly most powerful for H 0 : θ ≤ θ0 .
Conclusion: For the probability distribution determined by the function given, two methods of points
estimation have been applied. In this case, the two methods provide different estimators. By applying the
plug-in principle, estimators of the mean and the variance have also been obtained. The form of the critical
region has been studied by applying the Neyman-Pearson's lemma and the likelihood ratio.

My notes:
Additional Exercises
Exercise 1ae
Assume that the height (in centimeters, cm) of any student of a group follows a normal distribution with
variance 55cm2. If a simple random sample of 25 students is considered, calculate the probability that the
sample quasivariance will be bigger than 64.625cm2.
height should be evaluated by using proper statistical techniques.
Identification of the variable and selection of the statistic : The variable is the height, the
population distribution is normal, the sample size is 25, and we are asked for the probability of an event
expressed in terms of one of the usual statistics: P (S 2 > 64.625).
Search for a known distribution: Since we do not know the sampling distribution of S2, we cannot
calculate this probability directly. Instead, just after reading 'sample quasivariance' we should think about the
following theoretical result
( n−1)S 2 2 ( 25−1) S 2
T= 2
∼ χn −1 , or, in this case, T = 2
∼ χ225−1 ,
σ 55 cm
Rewriting the event: The event has to be rewritten by completing some terms until (the dimensionless
statistic) T appears. Additionally, when the table of the χ 2 distribution gives lower-tail probabilities P(X ≤ x), it
is necessary to consider the complementary event:
2 2
2
P (S > 64.625)=P (
(25−1) S
55 cm 2
>
( 25−1) 64.625 cm
55 cm2 )
=P ( T > 28.2 )=1− P ( T ≤ 28.2 )=1−0.75=0.25 .
In these calculations, one property of the transformations has been applied: multiplying or dividing by a
positive quantity does not modify an inequality.
Conclusion: The probability of the event is 0.25. This means that S2 will sometimes take a value bigger than
64.625cm2, when evaluated at specific data x coming from the population distribution.
My notes:
Exercise 2ae
Let X be a random variable with probability function
θ−1
θx
f ( x ; θ) = , x ∈[0,3]
3θ

such that E(X) = 3θ/(θ+1). Supposed a simple random sample X = (X1,...,Xn), apply the method of the
moments to find an estimator θ̂ M of the parameter θ.
Discussion: This statement is mathematical. Although it is given, the expectation of X could be calculated as
follows
3
+∞ θ x θ−1
μ1 (θ )=E ( X )=∫−∞ x f ( x ; θ)dx=∫0 x θ dx= θ
3
θ x θ +1
3
= θ 3θ+1 3θ
=
3 θ+1 0 3θ θ+ 1 θ+1 [ ]
Method of the moments
Population and sample centered moments: The first-order moments are

3θ 1 n
n ∑ j =1 j
μ1 (θ )= and m1 (x 1 , x 2 ,... , x n )= x = x̄
θ +1
System of equations: Since the parameter θ appears in the first-order moment of X, the first equation is
sufficient to apply the method:
3θ 1 n x̄
μ1 (θ )=m 1 (x 1 , x 2 ,... , x n ) → = ∑ j =1 x j= x̄ → 3 θ=θ x̄+ x̄ → θ(3− x̄ )= x̄ → θ=
θ +1 n 3− x̄
The estimator:
X̄
θ^ M =
3− X̄
My notes:
Exercise 3ae
A poll of 1000 individuals, being a simple random sample, over the age of 65 years was taken to determine
the percent of the population in this age group who had an Internet connection. It was found that 387 of the
1000 had one. Find a 95% confidence interval for η.
(Taken from an exercise of Statistics, Spiegel and Stephens, Mc-Graw Hill)
Discussion: Asymptotic results can be applied for this large sample of a Bernoulli population. The cutoff
age value determines the population of the statistical analysis, but it plays no other role. Both η and η^ are
dimensionless.
Identification of the variable: Having the connection or not is a dichotomic situation; then
X ≡ Connected (an individual)? X ~ Bern(η)

• There is one Bernoulli population
• The sample size is big, n = 1000, so an asymptotic approximation can be applied
A statistic is selected from a table (e.g. in [T]):

̂ −η
η d
T ( X ; η)= → N (0,1)
√ η(1−
̂
n
η)
̂
( )
η
^ −η
1−α=P (l α / 2≤ T ( X ; η) ≤r α / 2 )=P −r α /2≤ ≤+ r α / 2
√ η(1−
^
n
η
^)
( √
=P −r α /2
η
̂ (1− η)
n
̂
̂ −η ≤+ r α / 2
≤η
η(1−
̂
n√η
̂)
̂ α/2
=P −η−r
η
) (
̂ (1− η
n
̂)
̂ rα / 2
≤ −η ≤−η+
√η(1−
̂
n
η)
̂
√ )
( √
̂ +r +α/2
=P η
η
̂ (1− η)
n
̂
̂ α/ 2
≥ η≥ η−r
η(1−
̂
n
η
̂)
√ )
(3) The interval: Then,
[
I 1−α = η
̂ −r +α/ 2
√ η(1−
̂
n
η)
̂
, η+
̂ r +α / 2
η(1−
̂
n √
η)
̂
]
where r α / 2 is the value of the standard normal distribution verifying P( Z> r α /2 )=α /2.

• n = 1000
• Theoretical (simple random) sample: X1,...,X1000 s.r.s. (each value is 1 or 0)
1000 1 1000 387
Empirical sample: x1,...,x1000 → ∑j =1 x j=387 → ^
η=
1000 ∑ j=1
x j=
1000
=0.387
• 95% → 1–α = 0.95 → α = 0.05 → α/2 = 0.025 → r α/ 2=1.96
Finally,
[
I 0.95= 0.387−1.96
√ 0.387 (1−0.387)
1000
, 0.387+ 1.96
√
0.387 (1−0.387)
1000 ]
=[0.357 , 0.417 ]
Conclusion: The unknown proportion of individuals over the age of 65 years with Internet connection is
inside the range [0.357, 0.417] with a probability of 0.95, and outside the interval with a probability of 0.05.
Perhaps a 0-to-100 scale facilitates the interpretation: the percent of individuals is in [35.7%, 41.7%] with
95% confidence. Proportions and probabilities are always dimensionless quantities, though expressed in
percent.
My notes:
Exercise 4ae
A company is interested in studying its clients' behaviour. For this purpose, the mean time between
consecutive demands of service is modelized by a random variable whose density function is:
1 − x−2
f ( x ; θ)= θ e θ , x≥2, (θ>0)
The estimator provided by the method of the moments is θ^ M = X̄ −2 .

1st Is it an unbiased estimator of the parameter? Why?
2nd Calculate its mean square error. Is it a consistent estimator of the parameter? Why?
Note: E(X) = θ + 2 and Var(X) = θ2
Discussion: The two sections are based on the calculation of the mean and the variance of the estimator
given in the statement. Then, the formulas of the bias and the mean square error must be used. Finally, the
limit of the mean square error is studied.
Mean and variance of θ^ M

E ( θ^ M )= E ( X̄ −2)=E ( X̄ )−E (2)= E ( X )−2=θ+2−2=θ
Var ( X ) 2 2
Var ( θ^ M )=Var ( X̄ −2)=Var ( X̄ )−Var (2)= −0= σ = θ
n n n
Unbiasedness: The estimator is unbiased, as the expression of the mean shows. Alternatively, we calculate
the bias
b ( θ^ M )= E(θ^ M )−θ=θ−θ=0
Mean square error and consistency: The:

2 2
MSE ( θ^ M ) = b ( θ^ M )2 +Var (θ^ M ) = 02 + θ = θ
n n
The population variance θ2 does not depend on the sample, particularly on the sample size n. Then,
2
lim n →∞ MSE ( θ^ M ) = lim n →∞ θ = 0
n
Note: In general, the population variance can be finite or infinite (for some “strange” probability distributions we do not consider in
this subject). If the variance is infinite, σ 2 = ∞, neither Var ( θ^ M ) not MSE ( θ^ ) exists, in the sense that they are infinite; in
M
this particular exercise it is finite, θ2 < ∞. In the former case, the mean square error would not exist and the consistency (in
probability) could not be studied by using this way. In the latter case, the mean square error would exist and tend to zero
(consistency in the mean-square sense), which is sufficient for the estimator of θ to be consistent (in probability).
Conclusion: The calculations of the mean and the variance are quite easy. They show that the estimator is
unbiased and, if the variance is finite, consistent.
Advanced Theory: The If E(X) had not been given in the statement, it could have been calculated by
applying integration by parts (since polynomials and exponentials are functions “of different type”):
∞
1 − x−2
[ ]
+∞ ∞ x−2 x−2
− −
E ( X )=∫−∞ x f ( x ;θ)dx=∫2 x e θ dx= −x e θ −∫ 1⋅(−e θ )dx
θ 2
x−2 ∞ x−2 2
[
= −x e
−
x−2
θ
−θ e
−
θ ] =[( x +θ)e ] =2+θ .
2
−
θ
∞
That ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' (x)⋅v (x )dx has been used with
• u=x → u ' =1
x−2
1 − 1 − x−2 −
x−2
• v '= θ e θ
→ v=∫ θ e θ dx=−e θ
On the other hand, ex changes faster than xk for any k. To calculate E(X2):

∞
1 − x−2
[ ]
+∞ ∞ x−2 x−2
− −
E ( X )=∫−∞ x f (x ;θ)dx=∫2 x θ e θ dx= −x 2 e θ +2∫ x e θ dx
2 2 2
2
x−2 2
[
= x2 e
−
θ ] +2 θ∫
∞
∞
2
1 − x−2
x θ e θ dx=(2 2−0)+2 θ μ=4+ 2 θ(2+θ)=2θ 2 +4 θ +4 .
Again, integration by parts has been applied: ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' (x)⋅v (x )dx with
• u=x 2 → u ' =2 x
1 − x−3 1 − x−3 −
x−3
• v '= θ e θ → v=∫ θ e θ dx=−e θ
Again, ex changes faster than xk for any k. Finally, the variance is

σ 2=E ( X 2)−E ( X )2=2 θ2 + 4 θ+4−(θ+2) 2=2 θ 2+4 θ+4−θ2−4 θ−4=θ 2 .
Regarding the original probability distribution: (i) the expression reminds us the exponential distribution; (ii)
the term x–2 suggests a translation; and (iii) the variance θ 2 is the same as the variance of the exponential
distribution. After translating all possible values x, the mean is also translated but the variance is not. Thus, the
distribution of the statement is a translation of the exponential distribution, which has two equivalent notations
Equivalently, when θ = λ–1,
My notes:
Exercise 5ae
Is There Intelligent Life on Other Planets? In a 1997 Marist Institute survey of 935 randomly selected
Americans, 60% of the sample answered “yes” to the question “Do you think there is intelligent life on other
planets?” (http://maristpoll.marist.edu/tag/mipo/). Let's use this sample estimate to calculate a 90%
confidence interval for the proportion of all Americans who believe there is intelligent life on other planets.
What are the margin of error and the length of the interval?
(From Mind on Statistics. Utts, J.M., and R.F. Heckard. Thomson)
LINGUISTIC NOTE (From: Common Errors in English Usage. Brians, P. William, James & Co.)
American. Many Canadians and Latin Americans are understandably irritated when U.S. citizens refer to themselves simply as
“Americans.” Canadians (and only Canadians) use the term “North American” to include themselves in a two-member group with their
neighbor to the south, though geographers usually include Mexico in North America. When addressing and international audience
composed largely of people from the Americas, it is wise to consider their sensitivities.
However, it is pointless to try to ban this usage in all contexts. Outside of the Americas, “American” is universally understood to refer
to things relating to the U.S. There is no good substitute. Brazilians, Argentineans, and Canadians all have unique terms to refer to
themselves. None of them refer routinely to themselves as “Americans” outside of contexts like the “Organization of American States.”
Frank Lloyd Wright promoted “Usonian,” but in never caught on. For better or worse, “American” is standard English for “citizen or
resident of the United States of America.”
LINGUISTIC NOTE (From: Wikipedia.)

American (word). The meaning of the word American in the English language varies according to the historical, geographical, and
political context in which it is used. American is derived from America, a term originally denoting all of the New World (also called the
Americas). In some expressions, it retains this Pan-American sense, but its usage has evolved over time and, for various historical
reasons, the word came to denote people or things specifically from the United States of America.

In modern English, Americans generally refers to residents of the United States; among native English speakers this usage is almost
universal, with any other use of the term requiring specification. [1] However, this default use has been the source of controversy,
[2][3]
particularly among Latin Americans, who feel that using the term solely for the United States misappropriates it. They argue instead
that "American" should denote persons or things from anywhere in North, Central or South America, not just the United States, which is
only a part of North America.
Discussion: There are several complementary pieces of information in the statement that help us to identify
the distribution of the population variable X (Bernoulli distribution) and select the proper statistic T:
(a) The meaning of the question—for each item there are two possible values: “yes” or “no”.
(b) The value 60% suggests that this is a proportion expressed in percent.
(c) The words Let's use this sample estimate and confidence interval for the proportion.
Thus, we must construct a confidence interval for the proportion η (a percent is a proportion expressed in a 0-
-to-100 scale) of one Bernoulli population. The sample information available consists of two data: the sample
size n = 935 and the sample proportion η=0.6
^ . The relation between these quantities is the following:
n
1 n ∑ j=1 X i # 1 ' s
η=
^ ∑ X
n j =1 i
=
n
=
n ( = # Yeses
n )
.
Although it is not necessary, we could calculate the number of ones:

#1's
0.6=η=
^ → # 1' s=935⋅0.6=561
935
561
Now, if we had not realized that 0.6 was the sample proportion, we would do η=
^ =0.6 .
935

X ≡ Answered with “yes” (one American)? X ~ B(η)
Confidence interval
For this kind of population and amount of data, we use the statistic:
^
η−η d
T ( X ; η)= → N (0,1)
√ ?(1−? )
n
where ? is substituted by η or η. ^ For confidence intervals η is unknown and no value is supposed, and
hence it is estimated through the sample proportion. By applying the method of the pivot:
( )
η
^ −η
1−α=P (l α/ 2≤ T ( X ; η) ≤r α/ 2 )=P −r α /2≤ ≤+ r α / 2
√ η(1−
^
n
η
^)
( √
=P −r α/2
η
^ (1− η)
n
^
^ −η ≤+ r α / 2
≤η
η(1−
^
n
η
√
^)
^ α /2
=P −η−r
η
) (
^ (1− η
n
^)
√
^ rα/ 2
≤ −η ≤−η+
^ 1−η)
η(
n
^
√ )
( √
^ +r +α/2
=P η
η
^ (1− η)
n
^
^ α/ 2
≥ η≥ η−r
η(1−
^
n
η
^)
√ )

I 1−α = η[
^ −r +α/ 2
√ η(1−
^
n
η)
^
, η+
^ r +α / 2
√
η(1−
^
n
η)
^
]
• n = 935
• η=0.6
^
• 90% → 1–α = 0.90 → α = 0.10 → α/2 = 0.05 → r α /2=r 0.05=l 0.95=1.645
So
[
I 0.99= 0.6−1.645
√ 0.6 (1−0.6)
935
, 0.6+ 1.645
√
0.6(1−0.6)
935 ]
=[0.574 , 0.626 ]
Margin of error and length

To calculate the previous endpoints we had calculated the margin of error, which is
E = r + α/ 2
√ η(1−
^
n
η)
^
= 1.645
√
0.6 (1−0.6)
935
=0.0264
The length is twice the margin of error

L=2⋅E=2⋅0.0264=0.0527
In general, even if the T follows and asymmetric distribution and we do not talk about margin of error, the
length can always be calculated through the difference between the upper and the lower endpoints:
L=0.626−0.574=0.052
Conclusion: Since the population proportion is in the interval (0,1) by definition, the values obtained seem
reasonable. Both endpoints are over 0.5, which means that most US citizens think there is intelligent life on
other planets. With a confidence of 0.90, measured in a 0-to-1 scale, the value of η will be in the interval
obtained. As regards the methodology applied, 90% times in average it provides a right interval. Nonetheless,
frequently we do not know the real η and therefore we will never know if the method has failed or not.
My notes:
Exercise 6ae
It is desired to know the proportion η of female students at university. To that end, a simple random sample of
n students is to be gathered. Obtain the estimators η^ M and η^ ML for that proportion, by applying the
method of the moments and the maximum likelihood method.
Discussion: This statement is mathematical, really. Although it is given in the statement, the expectation of
X could be calculated as follows
1 x 1− x
μ1 (η)=E ( X )=∑ Ω x f (x ; θ)=∑ x=0 x η (1−η) =0⋅1⋅(1−η)+ 1⋅η⋅1=η

Population and sample centered moments: The probability distribution has one parameter. The first-order
moments are
1 n
μ1 (η)=E ( X )=η and m1 ( x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
n
System of equations: Since the parameter η appears in the first-order moment of X, the first equation is
sufficient to apply the method:
1 n
μ1 (η)=m1 ( x1 , x 2 , ... , x n ) → η= ∑ j =1 x j= x̄
n
The estimator:
η^ M = X̄
Maximum likelihood method
Likelihood function: For the distribution the mass function is f ( x ; η)=η x (1−η)1−x .
n n
n ∑ j=1 x j n−∑ j=1 x j
L( x 1 , x 2 , ... , x n ; η)=∏ j=1 f (x j ; η)=ηx (1−η)1−x ⋯ ηx (1−η)1− x =η
1 1 n n
(1−η)
Optimization problem: The logarithm function is applied to facilitate the calculations,

n n
∑ j=1 x j n−∑ j=1 x j n n
log[ L( x 1 , x 2 , ... , x n ; η)]=log [η ]+ log [(1−η) ]=( ∑ j=1 x j )log (η)+( n−∑ j =1 x j)log (1−η).
To find the local or relative extreme values, the necessary condition is

n n
d n 1 n −1 n−∑ j =1 x j ∑ j=1 x j
0= log [ L( x 1 , x 2 , ... , x n ; η)]=( ∑ j=1 x j ) +(n−∑ j =1 x j) → =
dη η 1−η 1−η η
n
n n n n ∑ j=1 x j
→ η n−η∑ j =1 x j=∑ j=1 x j−η ∑ j=1 x j → η n=∑ j =1 x j → η0= = x̄
n
To verify that the only candidate is a local or relative maximum, the sufficient condition is
n n
d2 n −1 n −1 ∑ j=1 x j n−∑ j =1 x j
d η2
log [ L ( x 1 , x 2 ,... , x n ; η)]=( ∑ j =1
x j )
η2
−(n−∑ j=1
x j )
(1−η) 2
(−1)=−
η2
−
(1−η)2
<0
n n
since 1 ≥ xj and therefore n≥∑ j=1 x j ↔ n−∑ j=1 x j ≥0 . This holds for any value, including η0 .
The estimator:
η^ ML = X̄
Conclusion: The two methods provide the same estimator.
My notes:

Exercise 7ae
A population variable X is supposed to follow the continuous uniform distribution with parameters λ1 = 0 and
λ2. A simple random sample of size n is considered to estimate λ2. Apply the method of the moments to build an
estimator.
Discussion: The distribution considered has two parameters, though one of them is known.
Population and sample moments: The first-order moments are

0+ λ2 1 n
μ1 (λ 2)= E ( X )= and m1 ( x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄
2 n
System of equations: Since the parameter of interest λ2 appears in the first-order population moment of X, the
first equation is enough to apply the method:
λ2 1 n
μ1 (λ 2)=m1( x 1 , x 2 ,... , x n) → = ∑ j=1 x j = x̄ → λ 2=2 x̄
2 n
The estimator:
λ^2=2 X̄
Conclusion: To estimate the parameter λ2, the method of the moments suggests twice the sample mean.
My notes:
Exercise 8ae
Plastic sheets produced by a machine are constantly monitored for possible fluctuations in thickness
(measured in millimeters, mm). If the true variance in thicknesses exceeds 2.25 square millimeters, there is
cause for concern about product quality. The production process continues while the variance seems smaller
than the cutoff. Thickness measurements for a simple random sample of 10 sheets produced in a particular
shift were taken, giving the following results:
(226, 226, 227, 226, 225, 228, 225, 226, 229, 227)
Test, at the 5% significance level, the hypothesis that the population variance is smaller than 2.25mm 2.
Suppose that thickness is normally distributed. Calculate the type II error β(2), find the general expression of
β(σ2) and plot the power function.
(Based on an exercise of: Statistics for Business and Economics, Newbold, P., W.L. Carlson and B.M. Thorne, Pearson.)
Discussion: The supposition of normality should be evaluated. This statistical problem requires us to study
the variance of a normal population. Concretely, the application of a hypothesis test to see whether or not the
value considered as reasonable has been exceeded. For large samples, we are given some quantities already
calculated; here we are given the crude data from which we can calculate any quantity. The hypothesis is
allocated at H1 for the production process to continue only when high quality sheets are been made (and for
the equality to be in H0).

Statistic: Since
• The variance must be studied
• The mean is known
the following statistic will be used
2
n s 2 (n−1) S
T ( X ; σ)= = ∼ χ 2n−1
σ2 σ2
As particular cases, when doing calculations under any hypothesis a value for σ2 is supposed:
( n−1)S 2 (n−1) S 2
T 0 ( X )= 2
∼ χ 2n−1 and T 1 ( X )= 2
∼ χ2n−1
σ0 σ1
We will need the quantities
1 n 1
X̄ =
n ∑ j=1
X j = ( 226 mm+ 226 mm+⋯+ 227 mm ) = 226.5 mm
10
1 n 1
S =
2
n−1 ∑ j=1
(
2
X j − X̄ ) =
10−1
[ (226 mm−226.5 mm) +⋯+(227 mm−226.5 mm) ] = 1.61 mm
2 2 2
and
(10−1)⋅1.61 mm 2
T 0 ( x)= =6.44
2.25 mm 2
One-sided (directional) hypothesis test
Hypotheses and form of the critical region: H 0 : σ 2 = σ 20 ≥ 2.25 and H 1 : σ 2 = σ 21 < 2.25 .
For these hypotheses, Rc = {S 2< c } =⋯ = {T 0 <a }
By applying the definition of α:
α (2.25) = P (Type I error ) = P( Reject H 0 | H 0 true) = P( T ∈ Rc | H 0 ) = P (T 0 < a)

→ a=r α=3.33 → Rc = {T 0 ( X )<3.33 }
Decision: Finally, it is necessary to check if this region “suggested by H0” is compatible with the value that
the data provide for the statistic. If they are not compatible because the value seems extreme when the
hypotheses is true, we will trust the data and reject the hypothesis H0.
Since T 0 ( x)=6.44 > 3.33 → T 0 ( x)∉Rc → H0 is not rejected.
pV =P ( X more rejecting than x | H 0 true)=P (T 0 ( X )<T 0 ( x))
=P (T 0< 6.44)=0.305 > pchisq(6.44, 10-1)
[1] 0.3047995
Type II error and power function: To calculate β, we have to work under H1, that is, with T1. Since the
critical region is expressed in terms of T0, the mathematical trick of multiplying and dividing by same quantity
is applied:

β(σ12) = P (Type II error ) = P ( Accept H 0 | H 1 true) = P (T 0 ( X )∉Rc | H 1 ) = P (T 0 ( X )≥3.33 | H 1)
| ) ( | ) (
2
3.33⋅σ 20
=P
(n−1) S 2
(
σ 20
≥3.33 H 1 = P
(n−1)S 2 σ1
σ 21 σ02
≥3.33 H 1 = P T 1 ( X )≥
σ 21 )
3.33⋅2.25
(
β(2) = P T 1 ( X )≥
2 )
= P ( T 1 ( X )≥3.75 ) = 0.927
> 1 - pchisq(3.75, 10-1)
[1] 0.9270832
By using a computer, many other values σ12 ≠ 2 can be considered so as to numerically determine the power of
the test curve 1–β(σ12) and to plot the power function.
ϕ(σ 2 ) = P ( Reject H 0) =
{α( σ2 ) if σ ∈Θ0
1−β(σ 2) if σ∈Θ1
n = 10
alpha = 0.05
q = qchisq(alpha,n-1)
theta1 = seq(from=0,to=2.25,0.01)
PowerFunction = pchisq(q*theta0/paramSpace, n-1)
Conclusion: With a confidence of 0.95, measured in a 0-to-1 scale, the real value of σ 2 will be smaller than
2.25mm2, that is, the quality of the product will be appropriate. In average, the method we are applying
provides a right decision 95% times; however, since frequently we do not known the true value of σ 2 we never
know whether the decision is true or not.
My notes:
Exercise 9ae
If 132 of 200 male voters and 90 of 159 female voters favor a certain cantidate running for governor of

Illinois, find a 99% confidence interval for the difference between the actual proportions of male and female
voters who favor the candidate.
(From: Mathematical Statistics with Applications. Miller, I., and M. Miller. Pearson.)
Discussion: There are two independent Bernoulli populations whose proportions must be compared
(populations will not be independent if, for example, males and females would have been selected from the
same couples or families). The value 1 has been used to count the number of voters who favor the candidate.
The method of the pivot will be used.
Identification of the variable: Favoring or not is a dichotomic situation,

M ≡ Favoring the candidate M ~ Bern(ηM)
F ≡ Favoring the candidate F ~ Bern(ηF)

• Both sample sizes are large, so an asymptotic approximation can be applied
^ M −η
(η ^ F )−(ηM −ηF ) d
T ( M , F ; ηM , η F )= → N (0,1)
√ η
^ M (1−η^ M ) η
nM
^ (1− η
+ F
nF
^F)

( η^ M −η^ F )−( ηM −ηF )
1−α=P (l α/ 2≤ T (M , F ; ηM , ηF )≤r α/ 2)≈ P −r α/2≤
( √ η^ M (1− η
nM
^ M ) η^ F (1− η^ F )
+
nF
≤+ r α / 2
)
( √
=P −r α/2
^ M (1−η^ M ) η
η
nM
+
^ F (1− η
nF
^F)
≤ ( η^ M −η^ F )−(ηM −ηF )≤+ r α/ 2
η^ M ( 1−η^ M ) η
nM
+
√
^ F (1− η
nF
^ F)
)
(
=P −( η^ M −η^ F )−r α /2
√ ≤−(ηM −ηF ) ≤−( η^ M −η^ F )+ r α / 2
√ )
(
=P ( η^ M − η
^ F )+r α/2
√ ^ M (1− η^ M ) η
η
nM
^ (1−η^ F )
+ F
nF
^ (1−η^ M ) η
η
≥ ηM −ηF ≥( η^ M −η^ F )−r α/ 2 M
nM
^ (1− η
+ F
nF
^F)
√ )
(3) The interval:
[
I 1−α = ( η^ M − η
^ F )−r α / 2
√ η^ M (1−η^ M ) η
nM
^ (1− η
+ F
nF
^F)
, (η
^ M −η
^ F )+r α /2
^ M (1−η^ M ) η
η
nM √
^ (1− η
+ F
nF
^F)
]
where r α / 2 is the value of the standard normal distribution such that P( Z>r α/2 )=α / 2.

• nM = 200 and nF = 159.
• Theoretical (simple random) sample: M1,...,M200 s.r.s. (each value is 1 or 0).

200 1 200 132
Empirical sample: m1,...,m200 → ∑j =1 m j=132 → η^ M =
200 ∑ j=1
m j=
200
=0.66 .
Theoretical (simple random) sample: F1,...,F159 s.r.s. (each value is 1 or 0)
159 1 159 90
Empirical sample: f1,...,f159 → ∑j =1 f j=90 → η^ F =
159 ∑ j=1
f j=
159
=0.56 .
• 99% → 1–α = 0.99 → α = 0.01 → α/2 = 0.005 → r α/ 2=2.576 .
Then,
I 0.99=(0.66−0.56)∓2.576
√ 0.66( 1−0.66) 0.56 (1−0.56)
200
+
159
=[−0.03906 , 0.2270 ]
Conclusion: The case ηM = ηF cannot formally be excluded when the decision is made with 99%
confidence. Since η ∈( 0,1), any “reasonable” estimator of η should provide values in this range or close to
it; but because of the natural uncertainty of the sampling process (randomness and variability), in this case the
smallest endpoint of the interval was –0.03906, which can be interpreted as being 0. When an interval of high
confidence is far from 0, the case η M = ηF can clearly be rejected. Finally, it is important to notice that a
confidence interval can be used to make decisions about hypotheses on the parameter values.
My notes:
Exercise 10ae
For two Bernoulli populations with the same parameter, prove that the pooled sample proportion is an
unbiased estimator of the population proportion. For two normal populations, prove that the pooled sample
variance is an unbiased estimator of the population variance.
Discussion: It is necessary to calculate the expectation of the pooled sample proportion by using its
expression and the basic properties of the mean. Alternatively, the most general pooled sample variance can be
used. For Bernoulli populations, the mean and the variance can be written as μ=η and σ 2=η(1−η) .
Mean of η^ p : This estimator can be used when η X =η=ηY . On the other hand, E ( η^ )= E( X )=η.
nX η
^ X + nY η^ Y
E ( η^ p )=E ( n X + nY ) =
1
n X + nY [ nX E ( η ^ Y ) ]=
^ X )+ nY E ( η
1
( n + n ) E ( X )=η
n X + nY X Y
Then, the bias is b ( η
^ p ) =E ( η^ p )−η=η−η=0 .
Mean of S 2p : This estimator can be used when σ 2X =σ=σ2Y . On the other hand, E ( S 2 )=σ2 .
(n X −1)S 2X +(nY −1)S Y2 (n X −1) E ( S 2X )+(n Y −1) E (S Y2 ) n X −1+ nY −1 2
2
E (S p) = E ( n X + n y −2
= ) n X +n y −2
=
n X +n y −2
σ =σ
2
Then, the bias is b (S 2p )= E ( S 2p ) −σ 2=σ 2−σ 2=0 .
My notes:

Exercise 11ae
A research worker wants to determine the average time it takes a mechanic to rotate the tires of a car, and she
wants to be able to assert with 95% confidence that the mean of her sample is off by at most 0.50 minute. If
she can presume from past experience that σ = 1.6 minutes (min), how large a sample will she have to take?
(From Probability and Statistics for Engineers. Johnson, R. Pearson Prentice Hall.)
Discussion: In calculating the minimum sample size, the only case we consider (in our subject) is that of
one normal population with known standard deviation. Thus, we can suppose that this is the distribution of X.

X ≡ Time (of one rotation) X ~ N(μ, σ=1.6min)
Sample information:
Theoretical (simple random) sample: X1,..., Xn s.r.s. (the time measurement of n rotations will be considered)
Margin of error:
[ √ √ ]
2 2
I 1−α = X̄ −r α / 2 σ , X̄ + r α / 2 σ
n n
If we remembered the expression, we can use it. Either way, the margin of error (for one normal population
with known variance) is:
√
2
E=r α / 2 σ
n
Sample size
Method based on the confidence interval: We want the margin of error E to be smaller or equal than the
given Eg,
√ 1.6 min 2
2 2 2
E g≥ E=r α/2 σ → E g≥r α / 2 σ → n≥ z α /2 σ = 1.96
) ( )
2 2
=6.2722=39.3 → n≥40
n n Eg (
0.50 min
since r α/ 2=r 0.05 /2=r 0.025 =l 0.975 =1.96 . (The inequality does not change neither when multiplying or dividing
by positive quantities nor squaring.)
Conclusion: At least n = 40 data are necessary to guarantee that the margin of error is 0.50min at most. Any
number of data larger than n would guarantee—and go beyond—the precision desired. (This margin can be
^ | will be
thought of as “the maximum error in probability”, in the sense that the distance or error |θ−θ
smaller that E with a probability of 1–α = 0.95, but larger with a probability of α = 0.05.)
My notes:
Exercise 12ae
To estimate the average tree height of a forest, a simple random sample with 20 elements is considered,

yielding
x̄=14.70 u and S =6.34 u
2
where u denotes a unit of length and S is the sample quasivariance. If the population variable height is
supposed to follow a normal distribution, find a 95 percent confidence interval. What is the margin of error?
height should be evaluated by using proper statistical techniques. To build the interval and find the margin of
error, the method of the pivotal quantity will be applied.
(1) Pivot: From the information in the statement, we know that:

• The variable follows a normal distribution
• The population variance σ2 is unknown, so it must be estimated
• The sample size is n = 20 (asymptotic results cannot be considered)
To apply this method, we need a statistic with known distribution, easy to manage and involving μ. From a
table of statistics (e.g. in [T]), we select
X̄ −μ
T ( X , μ)= ∼ t n−1
√ S2
n
where X =( X 1 , X 2 , ... , X n) is a simple random sample, S2 is the sample quasivariance and tκ denotes the t
distribution with κ degrees of freedom.
(2) Event rewriting: The interval is built as follows.
1−α=P (l α/ 2≤ T ( X ;μ ) ≤r α/ 2)= P −r α/ 2≤
(
X̄ −μ
)
≤+ r α/ 2 =P (−r α / 2
√ S2
n
≤ X̄ −μ ≤+r α/ 2
n√
S2
)
√ S2
n
=P (− X̄ −r α /2
√ S2
n
≤−μ ≤− X̄ + r α/ 2
S2
n √
)=P ( X̄ + r α/ 2
S2
n √
≥ μ ≥ X̄ −r α/ 2
S2
n
)
√
(3) The interval:
[
I = X̄ −r α/2
√ S2
n
, X̄ + r α/ 2
√ ]
S2
n 1−α
= X̄ ∓r α/ 2
√ S2
n
Note: We have simplified the notation, but it is important to notice that the quantities rα/2 and S depend on the sample size n.
To use this general formula with the specific data we have, the quantiles of the t distribution with κ = n–1 =
20–1 = 19 degrees of freedom are necessary
95% → 0.95 = 1–α → α = 0.05
In the table of the t distribution, we must search the quantile provided for the probability p = 1–α/2 = 0.975 in
a lower-tail probability table, or p = α/2 = 0.025 in an upper-tail probability table; if a two-tailed table is used,
the quantile given for p = 1–α = 0.950 must be used. Whichever the table used, the quantile is 2.093. Finally,
I 0= x̄∓r 0.05 /2
√ s2
20
=14.70 u∓2.093 6.34
u
√ 20
=14.70 u∓2.97 u=[11.73 u , 17.67 u]
By applying the definition of the margin of error,

E=r α/ 2
√ S2
n
=2.093 6.34
u
√ 20
=2.97 u
Conclusion: With 95% confidence we can say that the mean tree height is in the interval obtained. The
margin of error, which is expressed in the same unit of measure as the data, can be thought of as the maximum
distance—when the interval contains the true value—from the real unknown mean and the middle point of the
interval, that is, “the maximum error in probability”.
My notes:

Appendixes
[Ap] Probability Theory
Remark 1pt: The probability is a measure in a 0-to-1 scale of the chances with which an event involving a random quantity occurs;
alternatively, it can be interpreted as the proportion of times it occurs when many values of the random quantity are considered
repeatedly and independently. For example, an event of confidence 1–α = 0.95 can be considered in two equivalent ways: (i) that a
measure of its occurring, within [0,1], takes the value 0.95; or (ii) that when the experiment is independently repeated many times,
the event will occur more or less 95% of the times. Once values for random quantities are determined, the event will have occurred
or not, but no probability is involved any more.
Some Reminders
● Markov's Inequality. Chebyshev's Inequality. For any (real) random variable X, any (real) function
h(x) taking nonnegative values, and any (real) positive a > 0,
E(h(X ))=∫Ω h ( x) dP=∫{h (X )<a } h( x )dP+∫{h (X )≥a } h ( x)dP
≥ ∫{h(X )≥a } h( x) dP≥∫{h (X )≥a} a dP=a∫{h (X )≥a } dP=a⋅P(h( X)≥a)
Then, the Markov's inequality is obtained

E(h( X))
P(h( X )≥a)≤
a
For discrete X, the same proof can be rewritten with sums (or, alternatively, sums can be seen as a
particular case of Riemann-Stieltjes integrals). For continuous X, the measure dP can be written as
f(x)dx for well-behaved distributions—called absolutely continuous—such that a density function f(x)
exists. The following cases have special interest:
2 E(( X −μ )2 ) Var ( X )
When h(x )=( x−μ )2 , we have that: P(( X−μ) ≥a)≤ =
a a
E(|X|)
When h(x )=|x| , we have that: P(|X|≥a)≤
a
r E(|X−μ|r )
r
When h( x )=|x −μ| , we have that: P(|X−μ| ≥a)≤
a
E( X )
When h(x )=x but X itself is a nonnegative random variable: P(X ≥a)≤
a
The Chebyshev's inequality can be obtained as follows (a proof similar to that above can be written
too)
E((X−μ)2 ) Var ( X)
P(|X−μ|≥a)=P(( X−μ)2≥a2)≤ =
a2 a2
(The positive branch of the square root is a strictly increasing function and the events in the two
probabilities are the same. A similar inequality can be obtained with r instead of 2.) We can make it
a=kσ to calculate the probability that X takes values farther from μ than k times σ. For example,

2
1
a=2 σ → P({|X −μ|≥2 σ })≤ σ 2 = =0.25
4σ 4
2
1
a=3 σ → P({|X−μ|≥3 σ })≤ σ 2 = =0.11
9σ 9
Interpretation of the first case: the probability that X takes a value farther from the mean μ than twice
the standard deviation 2σ is 0.25 at most.
All these inequalities are true whichever the probability distribution of X, and the proof above is
based on binding in a rough way. They are nonparametric or distribution-free inequalities. As a
consequence, it seems reasonable to expect that there will be “more powerful” inequalities either
when additional or stronger nonparametric results are used or when a parametric approach is
considered (for example, in calculating the minimum sample size necessary to guarantee a given
precision, we can also apply methods using statistics T based on asymptotic or parametric results).
● Generating Functions. (This section has been extracted from Probability and Random Processes.
Grimmett, G., and D. Stirzaker. Oxford University Press, 3rd ed.) In Probability, generating functions
are useful tools to work with—e.g. when convolutions or sums of independent variables are
considered. Let a = (a0, a1, a2,...) be a sequence. The simplest one is the (ordinary) generating function
of a, defined
∞
Ga (t)=∑ i=0 ai t i , t ∈ℝ for which the sum converges
G(aj ) (0)
The sequence may in principle be reconstructed from the function by setting a j= . This
j!
function is especially useful when ai are probabilities. The exponential generating function of a is
∞ a jt j
Ga (t)=∑ j=0 , t∈ℝ for which the sum converges
j!
On the other hand, the probability generating function of a random variable X taking nonnegative
integer values is defined as
X
G(t)=E(t ) , t ∈ℝ for which there is convergence
(Some authors give a definition for z∈ℂ , and the radius of convergence is one at least) “There are
two major applications of probability generating functions: in calculating moments, and in calculating
the distributions of sums of independent variables.”
(k)
Theorem: E(X )=G '(1) , and, more generally, E(X ( X−1)⋯( X−k + 1))=G (1) .
“Of course, G(k ) (1) is shorthand for lims↑1 whenever the radius of convergence of G is 1.”
Particularly, to calculate the first two raw moments:
E(X )=G(1) (1)
E(X ( X−1))=E( X 2)−E( X)=G(2) (1) → E(X 2 )=G(2) (1)+ E(X )=G(2 ) (1)+G(1) (1)
“If you are more interested in the moments of X than in its mass function, you may prefer to work not
with G but with the function M” called moment generating function and defined by
t tX
M (t)=G( e )=E( e ) , t ∈ℝ for which there is convergence
It is, under convergence, the exponential generating function of the moments E(Xk). It holds that
k (k )
Theorem: E(X )=M ' (0), and, more generally, E(X )=M (0).
Particularly, to calculate the first two raw moments,

E(X )=M (1) ( 0)
E(X 2 )=M (2) (0)
“Moment generating functions provide a very useful technique but suffter the disadvantage that the
integrals which define them may not always be finite. Rather than explore their properties in detail we
move on immediately to another class of functions that are equally useful and whose finiteness is
guaranteed.” The characteristic function is defined by
itX
φ (t)=E(e ) , t∈ℝ , i= √−1
“First and foremost, from the knowledge of φ we can recapture the distribution of X.” “The
characteristic function of a distribution is closely related to its moment generating function”. (Moment
generating functions are related to Laplace transforms while characteristic functions are related to
Fourier transforms.)
{
k
Theorem: (a) If φ (k ) (0) exists then E (|X |)< ∞ if k is even
k−1
E (|X |)<∞ if k is odd
(k)
k φ ( 0)
| k
| (k ) k k
(b) If E( X )< ∞ then φ (0)=i E( X ) , so E(X )= k .
i
Then, to calculate the first two crude moments,
φ(1) (0)
E(X )=
i
2φ(2) (0)
E(X )= 2
i
Summary of results for calculating the (crude) moments E(Xk).
Generating Functions to Calculate (Raw) Moments

Definition Theorem
Probability
G(t)=E(t X ), t ∈ℝ (k)
Generating Function E( X ( X−1)⋯( X−k + 1))=G (1)
(discrete X)
Moment
M (t )=E(e tX ) , t∈ℝ k (k )
Generating Function E(X )=M (0)
Characteristic φ(k) ( 0)
Function φ (t)=E(eitX ) , t∈ℝ , i=√−1 E(X k )=
ik
Existence: Techniques for series and integrals must be used to determine the values of t∈ℝ that
guarantee the convergence and hence the existence of the generating function.
When possible, we drop the subindex of the functions to simplify the notation. The reader can consult
the literature on Probability to see whether it is allowed to differentiate inside the series or the
integrals, which is equivalent to differentiate inside the expectation. On the other hand, there are other
generating functions in literature: joint probability generating function, joint characteristic function,
cumulant generating function, et cetera.

Exercise 1pt
In the following cases, calculate the probability of find the quantile:
(a) X ∼ Pois (2.7), P (1≤X < 3) (b) X ∼ Bin (11 , 0.3) , P (X≤2)
(c ) X ∼ UnifDisc (6), P( X ∈{2, 5 }) (d ) X ∼ UnifCont (2 , 5), P( X ≥3.5)
(e) X ∼ N (μ=−1 ,σ 2=4) , P ( X > −4.4) (f ) X ∼ χ216 , P ( X ≤ a ) =0.025
( g) X ∼ t 27 , P(X > a)=0.1 (h) X ∼ F10 , 8 , P( X > 5.81)
(i) X ∼ F 15, 6 , P ( X > a )=0.01 ( j) X ∼ t 12 , P ( { X ≤ 1.356 }∪{X > 3.055 })
Discussion: Several distributions, discrete and continuous, are involved in this exercise. Different ways can
be considered to find the answers: the probability function f(x), the probability tables or a statistical software
program. Sometimes events need to be rewritten or decomposed. For discrete distributions, tables can contain
either individual {X=x} or cumulative {X≤x} (or {X>x}) probabilities; for continuous distributions, only
cumulative probabilities.
(a) The parameter value is λ = 2.7, and for the Poisson distribution the possible values are always 0, 1, 2... If
the table provides cumulative probabilities of the form P(X≤x),
P (1≤ X <3)=P ( X ≤2)− P( X ≤0)=⋯
If the table provides individual probabilities,
P (1≤ X <3)=P ( X =1)+ P( X =2)=0.1815+0.2450=0.4265
By using the mass function,
2.71 −2.7 2.7 2 −2.7
P (1≤ X <3)=P ( X =1)+ P( X =2)= e + e =0.1814549+ 0.2449641=0.426419
1! 2!
Finally, by using the statistical software program R, whose function gives cumulative probabilities,
> ppois(2, 2.7) - ppois(0, 2.7)
[1] 0.426419
To plot the probability function

values = seq(0, 10)
probabilities = dpois(values, 2.7)
plot(values, probabilities, type="h", lwd=2, ylim=c(0,1),

xlab="Value", ylab="Probability", main="Pois(2.7)")
(b) The parameter values are κ = 11 and η = 0.3, so the possible values are 0, 1, 2,..., 11. If the table of the
binomial distribution gives individual probabilities P(X = x),
P ( X ≤2)=P ( X =0)+ P ( X =1)+ P ( X =2)=0.0198+ 0.0932+0.1998=0.3128
If cumulative probabilities were given in the table, the probability P ( X ≤2) would be provided directly. On
the other hand, the mass function can be used too,
P ( X ≤2)=P ( X =0)+ P ( X =1)+ P ( X =2)= 11 0.3 (1−0.3) + 11 0.3 (1−0.3) + 11 0.3 (1−0.3)
( ) ( ) ( )
0 11−0 1 11−1 2 11−2
0 1 2

11! 11 11! 10 11! 2 9 11 10 11⋅10 2 9
= ⋅1⋅0.7 + ⋅0.3⋅0.7 + ⋅0.3 ⋅0.7 =0.7 +11⋅0.3⋅0.7 + ⋅0.3 ⋅0.7
0!(11−0)! 1!(11−1)! 2!(11−2)! 2
11 10 11⋅10 2 9
=0.7 +11⋅0.3⋅0.7 + ⋅0.3 ⋅0.7 =0.01977327+0.09321683+ 0.1997504=0.3127405
2
Finally, by using the statistical software program R, whose function gives cumulative probabilities,
> pbinom(2, 11, 0.3)
[1] 0.3127405

values = seq(0, 11)
probabilities = dbinom(values, 11, 0.3)

xlab="Value", ylab="Probability", main="Bin(11, 0.3)")
(c) The parameter value is κ = 6, so the possible values are 0, 1, 2,..., 6. This probability distribution is so
simple that no table is needed. Since the event can be decomposed into two disjoint elementary outcomes,
1 1 2 1
P ( X ∈{2, 5})=P ( X =2)+ P ( X =5)= + = =
6 6 6 3
values = seq(1, 6)
probabilities = rep(1/6, length(values))

xlab="Value", ylab="Probability", main="UnifDisc(6)")
(d) The parameter values are κ1 = 2 and κ2 = 5, so the possible values are the real numbers in the interval [2,5]
(or with open endpoints, depending on the definition for the uniform distribution that you are considering). No
table is necessary for this distribution, and if we realize that 3.5 is the middle value between 2 and 5 no
calculation is needed either,
P ( X ≥3.5)=0.5
If not, we can use the density function,
5 1 1 1.5
P ( X ≥3.5)=∫3.5 dx = ⋅(5−3.5)= =0.5
5−2 3 3
values = seq(2, 5)
probabilities = rep(1/(5-2), length(values))
plot(values, probabilities, type="l", lwd=2, ylim=c(0,1),

xlab="Value", ylab="Probability", main="UnifCont(2, 5)")
For continuous distributions, the probability of any isolated value is zero, so

P ( X ≥3.5)=P ( X >3.5) .
(e) Here the parameter values are μ = –1 and σ2 = 4, and the value of a normally distributed random variable
can always be any real number. Because of the standardization, a table with probabilities and quantiles for the
standard normal distribution suffices. In using this table, we must pay attention to the form of the events
whose probabilities are provided

X −μ −4.4−μ
P ( X > −4.4) = P ( √σ 2
>
√σ 2 ) (
= P Z>
−4.4−(−1)
√4 )
= P ( Z >−1.7 )=P (Z <1.7)=0.9554
Writing the event in terms of +1.7 is necessary when the table contains only positive quantiles. The
standardization can be applied before or after considering the complementary event. If we try solving the
integral,
2
( x−μ)
+∞ +∞ 1 − 2
P ( X >−4.4)=∫−4.4 f ( x)dx =∫−4.4 e 2σ

dx = ?
√2 π σ2
we are not able to find an antiderivative of f(x)... because it does not exist. Then, we may remember that the
2
antiderivative of e−x does not exist and that the definite integral of f(x) can be solved exactly only for some
limits of integration but it can always be solved numerically. On the other hand, by using the statistical
software program R, whose function contains cumulative probabilities for events of the form {X<x},
> 1 - pnorm(-4.4, -1, sqrt(4))

[1] 0.9554345

values = seq(-10, +10, length=100)
probabilities = dnorm(values, -1, 2)

xlab="Value", ylab="Probability", main="N(-1, sd=2)")
(f) The parameter value is κ = 16. The set of possible values is always composed of all positive real numbers.
Most tables of the chi-square distribution provide the probability of events of the form P(X>x). In this case, it
is necessary to consider the complementary event before looking for the quantile:
P ( X ≤a)=0.025 ↔ P ( X >a)=1−0.025=0.975 → a = 6.91
We do not use the density function, as it is too complex. By using the statistical software program R, whose
function gives quantiles for events of the form {X<x},
> qchisq(0.025, 16)

[1] 6.907664

values = seq(0, +40, length(100))
probabilities = dchisq(values, 16)

xlab="Value", ylab="Probability", main="Chi-Sq(16)")
(g) Now the parameter value is κ = 27. A variable enjoying the t distribution can take any real value. Most
tables of this distribution provide the probability of events of the form P(X>x). In this case, it is not necessary
to rewrite the event:
P ( X > a)=0.1 → a = 1.314
The density function is too complex to be used. The statistical software program R allows doing (the function
provides quantiles for events of the form {X<x}),

> qt(1-0.1, 27)
[1] 1.313703

probabilities = dt(values, 27)

xlab="Value", ylab="Probability", main="t(27)")
(h) The parameter values for this F distribution are κ1 = 10 and κ2 = 8. The possible values are always all
positive real numbers. Again, most tables of this distribution provide the probability for events of the form
{X>x}, so:
P ( X >5.81)=0.01
The density function is also complex. Finally, by using the computer,
> 1 - pf(5.81, 10, 8)

[1] 0.01002326

values = seq(0, 10, length=100)
probabilities = df(values, 10, 8)

xlab="Value", ylab="Probability", main="F(10,8)")
(i) Now, the parameter values are κ1 = 15 and κ2 = 6. Then:

P ( X > a)=0.01 → a = 7.56
The density function is also complex. By again using the computer,
> qf(1-0.01, 15, 6)

[1] 7.558994

values = seq(0, 10, length=100)
probabilities = df(values, 15, 6)

xlab="Value", ylab="Probability", main="F(15, 6)")
(j) Since the parameter value is κ = 12, after decomposing the event into two disjoint tails
P ({ X ≤1.356 }∪{ X >3.055 })=P ({ X ≤1.356 })+P ({ X >3.055 })
=1− P ({ X > 1.356})+P ({ X >3.055 })=1−0.1+0.005=0.905
The density function is also complex. Finally,

> pt(1.356, 12) + 1 - pt(3.055, 12)
[1] 0.9049621

probabilities = dt(values, 12)

xlab="Value", ylab="Probability", main="t(12)")
My notes:
Exercise 2pt
Weekly maintenance costs (measured in dollars, $) for a certain factory, recorded over a long period of time
and adjusted for inflation, tend to have an approximately normal distribution with an average of $420 and a
standard deviation of $30. If $450 is budgeted for next week, what is an approximate probability that this
budgeted figure will be exceeded?
(Taken from Mathematical Statistics with Applications. W. Mendenhall, D.D. Wackerly and R.L. Scheaffer. Duxbury Press)
Discussion: We need to extract the mathematical information from the statement. There is a quantity, the
weekly maintenance costs, say C, that can be assumed to follow the distribution
C ∼ N (μ=$ 420, σ=$ 30 ) or, in terms of the variance, C ∼ N (μ=$ 420 , σ 2=$ 2 302=$ 2 900 )
(In practice, this supposition should be evaluated.) We are asked for the probability P (C > 450) . Since C
does not follow a standard normal distribution, we standardize both sides of the inequality, by using
2 2 2
μ=E (C )=$ 420 and σ =Var (C )=$ 30 , to be able to use the table of the standard normal distribution:
P (C > 450) = P ( C−μ

√σ
>
450−μ
2
√σ ) =P T>2
$ 450−$ 420
( √ $ 30 ) 30
= P (T > )
30 2 2
= P (T > 1 ) = 1−P (T ≤ 1 ) = 1−0.8413 = 0.1587
My notes:
Exercise 3pt (*)

Find the first two raw (or crude) moments of a random variable X when it enjoys:
(1) The Bernoulli distribution
(2) The binomial distribution
(3) The geometric distribution
(4) The Poisson distribution
(5) The exponential distribution
(6) The normal distribution

Use the following concepts to do the calculations in several ways: (i) their definition; (ii) the probability
generating function; (iii) the moment generating function; (iv) the characteristic function; or (v) others. Then,
find the mean and the variance of X.
Discussion: Different methods can be applied to calculate the first two moments. We have practiced as
many of them as possible, both to learn as much as possible and to compare their difficulty; besides, some of
them are more powerful that others. Some of these calculations are advanced. To work with characteristic
functions, the definitions and rules of the analysis for complex functions of a real variable must be considered,
and even some calculations may be easier if we work with the theory for complex functions of a complex
variable. Most of these definitions and rules are “natural generalizations” of those of real analysis, but we
must be careful not to apply them without the necessary justification.
(1) The Bernoulli distribution
By applying the definitions

1
E(X )=∑x=0 x ηx (1−η)1− x =0⋅1⋅(1−η)+1⋅η⋅1=η
1
E( X 2 )=∑x=0 x 2 ηx ( 1−η)1− x =02⋅η0⋅(1−η)+12⋅η1⋅1=η
By using the probability generating function

1
G(t)=E(t X )=∑ x=0 t x η x (1−η)1−x =t 0⋅1⋅(1−η)+ t 1⋅η⋅1=1−η+ ηt
This function exists for any t. Now, the usual definitions and rules of the mathematical analysis for real
functions of a real variable imply that
E(X )=G(1) ( 1)= [ η ]t=1=η
2 (2)
E( X )=G (1)+ E(X )=[ 0 ]t =1+ η=η
By using the moment generating function

tX 1 tx x 1− x t⋅0 t⋅1 t
M (t )=E(e )=∑ x=0 e η (1−η) =e ⋅1⋅(1−η)+ e ⋅η⋅1=1−η+ηe
This function exists for any real t. Because of the mathematical real analysis,
E(X )=M (1) ( 0)=[ ηe t ]t =0=η
E( X 2 )=M (2) (0)= [ ηe t ]t =0=η
By using the characteristic function

itX 1 itx x 1−x i⋅t⋅0 i⋅t⋅1 it
φ (t)=E( e )=∑ x=0 e η (1−η) =e ⋅1⋅(1−η)+e ⋅η⋅1=1−η+ ηe
This complex function exists for any real t. Complex analysis is considered to do,
φ(1) (0) [ ηe it i ]t =0 ηi
E( X )= = = =η
i i i
2φ(2) (0) [ ηe it i 2 ]t=0 ηi 2

E(X )= 2 = = 2 =η
i i2 i

Mean and variance
μ=E( X )=η
σ2 =Var (X )=E(X 2 )−E(X )2=η−η2=η(1−η)
(2) The binomial distribution

κ
E ( X )=∑ x=0 x κ η x (1−η)κ−x =?
( )
x
κ
E ( X 2)=∑ x=0 x 2 κ ηx (1−η)κ− x =?
( )
x
κ
A possible way consists in writing X as the sum of κ independent Bernoulli variables: X =∑ j=0 Y j .
κ κ
E ( X )= E (∑ j=0 Y j )=∑ j=0 E( Y j )=κ⋅η
2
([ ∑ ] )=?
κ
E ( X 2)=E j=0
Y j
This way can also be used to calculate the variance easily, but not to calculate the second moment:
κ κ
σ 2=Var ( X )=Var ( ∑ x=0 Y i )=∑ x=0 Var (Y i )=κ⋅η(1−η) .

ηt x κ
( )( ) ( )
κ κ ηt
G(t)=E(t X )=∑ x=0 t x κ ηx (1−η)κ−x =(1−η)κ ∑x=0 κ =(1−η) κ 1+
x ( ) x 1−η 1−η
κ
[ (
= (1−η) 1+
ηt
1−η )] =( 1−η+ ηt) κ
where the binomial theorem (see the appendixes of Mathematics) has been applied. Alternatively, this function
can also be calculated by looking at X as a sum of Bernoulli variables Yj and applying a property for
probability generating functions of a sum of independent random variables,
κ κ
G(t)=[ GY (t )] =( 1−η+ηt )
This function exists for any t. Again, complex analysis allows us to do
E(X )=G (1) ( 1)=[ κ ( 1−η+ηt )κ−1 η]t=1=κ⋅1 κ−1⋅η=κ η
E(X 2 )=G(2) (1)+ E(X )=[ κ(κ−1) ( 1−η+ ηt ) κ−2 η2 ]t =1 + κ η=κ(κ−1)η2 +κ η=κ η(κ η−η+1)

x κ
ηe t ηe t
( )( ) ( )
κ κ
M (t)=E(e tX )=∑ x=0 e tx κ η x (1−η) κ−x =(1−η)κ ∑ x=0 κ
( ) =(1−η)κ 1+
x x 1−η 1−η
κ
[
= (1−η) 1+( ηe t
1−η )] =( 1−η+ ηe t )κ
Again, it is also possible to look at X as a sum of Bernoulli variables Yj and apply a property for moment

generating functions of a sum of independent random variables,
κ κ
M (t)=[ M Y (t ) ] =( 1−η+ ηe t )
This function exists for any real t. Because of the mathematical real analysis,
[
E(X )=M (1) ( 0)= κ ( 1−η+ ηe t )
κ−1
]
ηe t t =0=κ η
2
E( X )=M (0)= κ( κ−1) ( 1−η+ ηe
(2)
[ t κ−2
) ( ηe t )2+ κ ( 1−η+ηe t )
κ−1
ηe
t
]
t=0
2
=κ(κ−1) η + κ η=κ η(κ η−η+1)

it it x κ
( )( ) ( )
φ (t)=E( e )=∑x=0 e
κ
κ ηx (1−η)κ−x =(1−η)κ κ κ ηe =(1−η)κ 1+ ηe
itX itx
( ) x ∑ x=0 x 1−η 1−η
κ
[ (
= (1−η) 1+
ηeit
1−η )] =( 1−η+ ηe it ) κ
Once more, by looking at X as a sum of Bernoulli variables Yj and applying a property for characteristic
functions of a sum of independent random variables,
κ κ
φ (t)=[ φ Y (t) ] =( 1−η+ ηeit )
This complex function exists for any real t. Again, complex analysis is considered in doing,
φ(1) (0) [ κ ( 1−η+ηe ) ηeit i ]t=0

it κ−1
κ ηi
E(X )= = = =κ η
i i i
2 [ it
φ(2) (0) κ( κ−1) ( 1−η+ηe )
E( X )= 2 =
κ−2
(ηeit i)2 + κ ( 1−η+ ηe it )
κ−1
ηeit i 2 t=0 ]
i i2
=
[ κ(κ−1) ( 1−η+ ηe ) it κ−2 it 2
( ηe i) + κ ( 1−η+ηe
it κ−1
) it 2
ηe i ]
t =0
=
[ κ(κ−1) η2 i2+ κ ηi2 ]t=0
i2 i2
=
[ κ( κ−1)η2 i2+ κ ηi2 ]t=0 = κ η(κ η−η+ 1) i2 =κ η( κ η−η+1)
i2 i2
Mean and variance

μ=E( X )=κ η
σ2 =Var (X )=E(X 2 )−E(X )2=η−η2=κ η(κ η−η+1)−( κ η)2=κ η(1−η)
(3) The geometric distribution

+∞ x−1
E ( X )=∑ x=1 x⋅η⋅(1−η) =?
+∞
E ( X 2)=∑x=1 x 2⋅η⋅(1−η)x−1=?
As an example, I include a way to calculate E(X) that I found. To prove that any moment of order r is finite or,
equivalently, that the series is (absolutely) convergent, we apply the ratio test for nonnegative series:

a x+1 |(x +1)r⋅η⋅(1−η)x+1−1| (x +1)r
lim x→ ∞ =lim x →∞ =lim x→∞ |1−η|=|1−η| < 1
ax |x r⋅η⋅(1−η)x−1| xr
Mathematically, the radius of convergence is 1, that is,
−1 < 1−η < +1 ↔ −2 < −η < 0 ↔ 2 > η> 0
Probabilistically, the meaning of the variable η (in the geometric distribution, it is a probability between 0 and
+∞ r x−1
1) implies that the series is convergent for any η. Either way, this implies that ∞ < ∑x=1 x ⋅η⋅(1−η) .
Once the convergence has been proved, the rules of “the usual arithmetic for finite quantities” can be applied.
The convergence of the series is crucial for the following calculations.
+∞ +∞
E ( X )=∑ x=1 x⋅η⋅(1−η) x−1=η ∑ x=0 x⋅(1−η) x =η [ 1(1−η)1 + 2(1−η)2+⋯]
+∞ +∞ +∞ +∞
=η [ ∑x=0 (1−η)x+∑x=1 (1−η)x+⋯]=η[ ∑x=0 (1−η)x+(1−η)∑x=0 (1−η)x+⋯]
2
1 1 2 1
) ()
2
(
+∞ +∞
=η [ ]
∑x=0 (1−η)x ⋅[ 1+( 1−η)+⋯]=η [ ∑x=0 (1−η)x =η ] 1−(1−η)
=η η = η .
where the formula of the geometric sequence (see the appendixes of Mathematics) has been used.
Alternatively, μ can be calculated by applying the formula available in literature for arithmetico-geometric
series.

X
+∞
x x−1
+∞
x ηt
G(t)=E(t )=∑ x=1 t ⋅η⋅(1−η) =ηt ∑ x=0 [t (1−η)] =
1−(1−η)t
Given η, this function exists for t such that |t(1–η)| < 1 (otherwise, the series does not converge), as the
following criterion shows
x+1
a x+1 |t (1−η)|
lim x→ ∞ =lim x →∞ x
=|t(1−η)| < 1 .
ax |t (1−η)|
The definitions and rules of the mathematical analysis for real functions of a real variable,
E(X )=G(1) (1)=

[ η[1−(1−η) t ]−ηt [−(1−η)]
[1−(1−η) t ]2 ] [
t=1
=
η
[1−(1−η)t ]2 ]
t=1
=
η 1
=
η2 η
E( X 2 )=G(2) (1)+ E( X )=
[ η2 [1−(1−η)t ](1−η)
[1−(1−η)t ]4 ] [
t=1
1
+ η=
η2(1−η)
[1−(1−η)t ]3 ] t=1
1 2(1−η) 1 2−η
+ η=
η2
+ η= 2
η

tX
+∞
tx x−1 t
+∞ ηe t
M (t)=E(e )=∑ x=1 e ⋅η⋅(1−η) =ηe ∑x=0 [e t (1−η)]x=
1−(1−η)et
This function exists for any real t such that |et(1–η)| < 1 (otherwise, the series does not converge), as the
following criterion shows
x+1
a x+1 |et (1−η)|
lim x→ ∞ =lim x →∞ t x
=|e t (1−η)| < 1.
ax |e (1−η)|
Because of the mathematical real analysis,
E(X )=M (1) ( 0)=

[ ηe t [1−(1−η)e t ]−ηe t [−(1−η)et ]
[1−(1−η)e t ]2 ] [
t=0
=
ηet [1−(1−η)e t +( 1−η)e t ]
[1−(1−η)et ]2 ]
t =0

=
[ηe t
[1−(1−η) et ]2 ]
t =0
=
η 1
η2
=η
2
E( X )=M (0)= (2)
[
ηe t [1−(1−η)e t ]2−ηe t 2[1−(1−η)et ][−(1−η)e t ]
[1−(1−η)e t ]4 ] t =0
[ ] [ ]
t t t t t
ηe [1−(1−η) e +2(1−η)e ] ηe [1+(1−η)e ] η( 2−η) 2−η
= = = = 2
[1−( 1−η)e t ]3 t =0 [1−(1−η) e t ]3 t =0 η3 η

it
+∞ +∞ ηe
φ (t)=E( eitX )=∑ x=1 eitx⋅η⋅(1−η) x−1=ηe it ∑ x=0 [e it (1−η)]x =
1−(1−η)e it
This complex function exists for any real t such that |eit(1–η)| < 1, where |z| denotes the modulus of a complex
number z (otherwise, the series does not converge), as the following criterion shows
x+ 1
a x+1 |eit (1−η)|
lim x→ ∞ =lim x →∞ it x
=|e it (1−η)| < 1.
ax |e (1−η)|
Once more, complex analysis allows us to do,
E(X )=
i
=
i [
φ(1) (0) 1 ηe it i[1−(1−η)e it ]−ηe it [−(1−η) e it i]
[1−(1−η)e it ]2 ]
t =0
[ ] [ ]
it it it it
1 ηe i[1−(1−η) e +(1−η) e ] 1 ηe i 1 ηi 1
= = = =
i [1−(1−η) eit ]2 t=0
i [1−(1−η) eit ]2 t =0
i η2 η
2
E(X )= 2 = 2
i i [
φ(2) (0) 1 ηeit i 2 [1−(1−η) eit ]2 −ηe it i 2[1−( 1−η) e it ][−(1−η)e it i]
[1−(1−η) e it ]4 ]t =0
[ ] [ ]
it 2 it it it 2 it 2
1 ηe i [1−(1−η)e + 2(1−η)e ] 1 ηe i [1+( 1−η)e ] 1 ηi (2−η) 2−η
= 2 =2 = 2 = 2
i [1−(1−η)eit ]3 t=0 i [1−(1−η)eit ] 3 t=0 i η3 η
Mean and variance

1
μ=E( X )= η
2−η 1 2 1−η
σ2 =Var ( X )=E( X 2 )−E(X )2=
η2
( )
− η = 2
η
Advanced theory: Additional way 1: In Cálculo de probabilidades I, by Vélez, R., and V. Hernández, UNED,
the first four moments are calculated as follows (I write the calculations for the first two moments, with the
notation we are using)
d d 1−η
( )
+∞ +∞ +∞
E ( X )=∑ x=1 x⋅η⋅(1−η) x−1=η ∑ x=1 x⋅(1−η) x−1=η
d (1−η) (∑ x=1
(1−η) x =η ) d (1−η) 1−( 1−η)
1⋅[1−(1−η)]−(1−η)(−1) 1 1
=η 2
=η 2 = η
[1−(1−η)] η
+∞ +∞ +∞
E ( X 2)=∑ x=1 x 2⋅η⋅(1−η)x−1=η∑ x=1 ( x+ 1) x⋅(1−η) x−1−η ∑ x=1 x⋅(1−η) x−1

d2 d2 (1−η)2
( )
+∞
=η
d (1−η)2
(∑ x=1
(1−η) x+ 1
) −E ( X )=η
d (1−η)2 1−(1−η)
−E(X )
2 (1−η)[1−(1−η)]−(1−η)2 (−1)
=η
d
d (1−η) ( [1−(1−η)]2
−E(X )
)
2 2 2
=η
d
d (1−η) (
2 (1−η)−2(1−η) +(1−η)
[1−(1−η)]
2
−E(X )=η
d
d (1−η)
2(1−η)−(1−η)
[1−(1−η)]
2 )
−E(X )
( )
[2−2(1−η)][1−(1−η)]2−[2(1−η)−(1−η)2]2[1−(1−η)](−1) 1
=η −η
[1−(1−η)] 4
2[ 1−(1−η)]2 +2[2(1−η)−(1−η)2 ] 1
=η −η
[1−(1−η)]3
2 η2 +4 (1−η)−2(1−η)2 η 2 η2+ 4−4 η−2−2 η2+ 4 η−η 2−η
= − 2= = 2
η2 η η2 η
(We have already justified the convergence of the series involved.) Additional way 2: In trying to find a way
based on calculating the main part of the series by using an ordinary differential equation, as I had previously
done for the Poisson distribution (in the next section), I found the following way that is essentially the same as
the additional way above. A series can be differentiated and integrated term by term inside the circle of
convergence (the radius of convergence was one, which included all possible values for η). The expression of
the mean suggests the following definition for g(η):
+∞ x−1 +∞ x−1
E ( X )=∑ x=1 x⋅η⋅(1−η) =η⋅g ( η) → g (η)=∑ x=1 x⋅(1−η)
and it follows, since g is a well-behaved function of η, that
+∞ +∞ 1−η η−1
G( η) = ∫ g (η) d η=∑ x=1 ∫ x⋅(1−η)x−1 d η=−∑x=1 (1−η) x + c=− +c= +c
1−(1−η) η
I spent some time searching a differential equation... and I found this integral one. Now, by solving it,
η−(η−1) 1
g( η)=G' (η)= 2
+0= 2
η η
(This is a general method to calculate some infinite series.) Finally, the mean is
1 1
E(X )=η⋅g( η)=η⋅ 2 =
η η
For the second moment, we define
+∞ +∞
E ( X 2)=∑ x=1 x 2⋅η⋅(1−η)x−1=η⋅g (η) → g (η)=∑ x=1 x 2⋅(1−η) x−1
and it follows that
+∞ +∞
G( η) = ∫ g (η) d η=∑ x=1 x ∫ x⋅(1−η)x−1 d η=−∑ x=1 x (1−η)x +c
1−η +∞ η−1
=− η ∑ x=1 x η(1−η)x−1 +c=c + 2
η
Now, by solving this trivial integral equation,
η2−(η−1)2 η η2−2 η2 +2 η 2−η
g( η)=G' ( η)=0+ = = 3
η4 η4 η
Finally, the second moment is
2−η 2−η
E( X 2 )=η⋅g ( η)=η 3
= 2
η η

Remark: Working with the whole series of μ(η) or σ2(η), as functions of η, is more difficult than working with
the previous functions g(η), since the variable η would appear twice instead of once (I spent some time until I
realize it).
(4) The Poisson distribution

+∞ x
E ( X )=∑ x=0 x⋅λ e−λ =?
x!
+∞ x
E ( X 2)=∑x=0 x 2⋅λ e−λ =?
x!
To prove that any moment of order r is finite or, equivalently, that the series is convergent, we apply the ratio
test for nonnegative series:
| |
x+1
(x +1)r⋅ λ e−λ
a x+1 ( x+ 1)! ( x+1)r |λ|
=lim x→ ∞ r
=0 < 1
ax x+1
x
r λ
x⋅ e
x! | −λ
x
| x
+∞ +∞
This implies that ∞ > ∑x=0 x ⋅ e =e ∑x=0 x ⋅λ . Once the (absolute) convergence has been proved,
r λ −λ −λ r
x! x!
the rules of “the usual arithmetic for finite quantities” could be applied. Nevertheless, working with factorial
numbers in series makes it easy to prove the convergence but difficult to find the value.

x x
+∞ +∞ (t λ)
G(t)=E(t X )=∑ x=0 t x⋅λ e−λ =e−λ ∑x=0 =e−λ e t λ =e λ(t−1)
x! x!
This function exists for any t, as the following criterion shows
| |
x+1
( t λ)
a x+1 x+1! |t λ|
=lim x →∞ =0 < 1 .
ax x +1
| |
(t λ)
x!
Now, the definitions and rules of the mathematical analysis for real functions of a real variable,
E(X )=G(1) ( 1)= [ e λ(t −1) λ ]t =1=λ
E(X 2 )=G(2) (1)+ E(X )=[ e λ(t −1) λ2 ]t =1 + E( X )=λ2 +λ

x t x
+∞ +∞ (e λ)
M (t)=E(e )=∑x=0 e ⋅λ e−λ =e−λ ∑ x=0
t t
tX tx
=e−λ e e λ =e λ(e −1 )
x! x!
This function exists for any real t, as the following criterion shows
( e t λ) x+1
lim x→ ∞
a x+1
=lim x →∞
| | x+1!
=lim x→∞
|e t λ|
=0 < 1 .
ax (e t λ)x x +1
| | x!

Because of the mathematical real analysis,
E( X )=M (1) (0)=[ e λ(e −1 ) λ e t ]t =0=λ
t
E(X 2 )=M (2) (0)= [ e λ(e −1 ) (λ et )2 +e λ(e −1) λ e t ]t =0=[ eλ (e −1) λ et (λ e t +1) ]t =0=λ (λ +1)=λ 2+ λ
t t t

+∞ x +∞ (eit λ) x −λ e λ λ (e
φ (t)=E( eitX )=∑x=0 eitx⋅ λ e−λ=e−λ ∑x=0
it it
−1)
=e e =e
x! x!
This function exists for any real t, as the following criterion shows
lim x→ ∞
a x+1
=lim x →∞
| x+ 1! |
( eit λ)x +1
=lim x→∞
|e it λ|
=0 < 1.
ax (e it λ) x x+1
| x!|
The definitions and rules of the analysis for complex functions have been applied in the previous calculations
(they are similar to those for real functions of real variable). Now, by using the analysis for complex functions
of one real variable,
φ(1) (0) [ e λ e it i ]t=0 λ i

it
λ(e −1)
E(X )= = = =η
i i i
φ(2) (0) [ e (λ e it i)2+ eλ (e −1) λ e it i 2 ]t =0 [ eλ (e −1) λ e it i2 (λ eit +1)]t =0

it it it
λ(e −1)
2 λ i 2( λ+1) 2
E(X )= 2 = = = =λ + λ
i2 i2 2
i i
Mean and variance

μ=E( X )=λ
σ2 =Var (X )=E(X 2 )−E(X )2=λ2 +λ−λ2=λ
Advanced theory: Additional way 1: In finding ways, I found the following one. A series can be
differentiated and integrated term by term inside its circle of convergence. The limit calculated at the
beginning was the same for any λ, so the radius of convergence for λ is infinite when the series is looked at as
a function of λ. The expression of the mean suggests the following definition for g(λ):
+∞ x +∞ x
E ( X )=∑ x=0 x⋅λ e−λ =e−λ g (λ) → ∑x=0 x⋅λx! = g ( λ)
x!
and it follows, since g is a well-behaved function of λ, that
+∞ x λ x−1 +∞ x−1 +∞ x−1 +∞ x−1
g '(λ)= ∑ x=1 x⋅ =∑x=1 ( 1+ x −1)⋅ λ =∑ x=1 λ + ∑ x=1 ( x−1)⋅ λ
λ
=e + g (λ)
x! (x −1)! (x−1)! ( x−1)!
λ
Now, we solve the first-order ordinary differential equation g ' (λ)−g( λ)=e .
Homogeneous equation:
dg 1
g '(λ)−g(λ)=0 → =g → dg=d λ → log(g)=λ+k → gh (λ)=e λ+k =c eλ
dλ g
Particular solution: We apply, for example, the method of variation of parameters or constants. Substituting in
λ λ λ
the equation g( λ)=c (λ)e and g ' (λ)=c ' (λ) e +c ( λ)e
λ λ λ λ λ
c ' (λ )e +c (λ)e −c( λ)e =e → c ' (λ )=1 → c (λ )=λ → g p (λ)=λ e

General solution: g( λ)=gh ( λ)+ g p (λ )=c e λ + λ e λ =( c+ λ) e λ
Any g(λ) given by the previous expression verifies the differential equation, so an additional condition is
necessary to determine the value of c. The initial definition implies that g(0) = 0, so c = 0. Finally, the mean is
−λ −λ λ
E(X )=e g( λ)=e λ e =λ
(The same can be done to calculate some infinite series.) For the second moment, we define
+∞ x +∞ x
E ( X 2)=∑ x=0 x 2⋅λ e−λ =e −λ g (λ) → ∑x=0 x 2⋅λx! = g ( λ)
x!
and it follows, since g is a well-behaved function of λ, that
x−1 x−1 x−1
+∞xλ +∞ +∞
g '( λ) = ∑ x=1 x ⋅ =∑ x=1 (1+ x−1)2⋅ λ =∑x=1 [1+(x−1)2+ 2( x−1)]⋅ λ
2
x! (x−1)! ( x−1)!
+∞ x−1 +∞ x−1 +∞ x−1
=∑ x=1 λ + ∑ x=1 ( x −1)2⋅ λ + 2 ∑x=1 ( x−1) λ =eλ + g( λ) + 2 e λ λ
(x −1)! ( x−1)! (x −1)!
(The expression of the expectation of X has been used in the last term.) Thus, the function we are looking for
λ
verifies the first-order ordinary differential equation g '(λ)− g(λ )=e (1+2 λ).
Homogeneous equation: This equation is the same, so gh (λ)=e λ+k =c eλ
Particular solution: By applying the same method,
c ' (λ )e +c ( λ)e −c(λ) e =e (1+ 2 λ) → c ' (λ )=1+2 λ → c (λ )=λ +λ 2 → g p (λ)=(λ+ λ2 )eλ
λ λ λ λ
General solution: g( λ)=gh ( λ)+ g p (λ )=c e λ +(λ+ λ 2) e λ =(c+ λ+ λ2 )e λ

Any g(λ) given by the previous expression verifies the differential equation, so an additional condition is
necessary to determine the value of c. The definition above implies that g(0) = 0, so c = 0. Finally, the second
moment is
E( X 2 )=e−λ g(λ)=e−λ (λ+λ 2)e λ =λ+λ 2
Remark: Working with the whole series of μ(λ) or σ2(λ) as functions of λ is more difficult than working with
the previous functions g(λ), since the variable λ would appear twice instead of once. Additional way 2:
Another way consists in using a relation involving the Stirling polynomials (see, e.g., § 2.69 of Análisis
combinatorio: problemas y ejercicios. Ríbnikov et al. Mir)
xj n
P 0 ( x )=1 , P 1 (x )=x , P 2 ( x)= x (1+ x) ,... , P n+1 (x )=x ∑ j=0 n P j ( x)
+∞
∑ j=0 j⋅
n
j!
x
= e Pn( x )
j ()
In this case,
+∞ x
E ( X )=e−λ ∑x=0 x⋅λ =e−λ⋅e λ P1 ( λ)=λ .
x!
+∞ x
2
E ( X )=e
−λ
∑x=0 x 2⋅λx! =e −λ⋅eλ P 2 (λ )=λ 2 +λ .
(5) The exponential distribution

0
E(X )=∫0 x λ e
+∞
−λ x
dx=[−x e 0] −∫0 −e
−λ x +∞
+∞
−λ x −λ x +∞
dx=[−x e
1
] − [e−λ x ]+∞
0
λ 0 = x+[( ) ]
1 −λ x
λ
e
+∞
1
= −0=
λ
1
λ
Where the formula ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' (x)⋅v (x )dx of integration by parts has been applied

with x and λe–λx as initial functions (since these two functions are of “different type”.
• u=x → u ' =1
−λ x −λ x
• v ' =λ e
−λ x
→ v=∫ λ e dx=−e
For the second-order moment,
2 +∞ 2 −λ x 2 −λ x +∞ +∞ −λ x −1 +∞ −λ x −1 2
E ( X )=∫0 x λ e dx=[−x e 0] −∫0 −2 x e dx=0+ 2 λ ∫0 xλe dx=2 λ μ=
λ2
Where the formula ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' (x)⋅v (x )dx of integration by parts has been applied
with x and λe–λx as initial functions (since these two functions are of “different type”.
2
• u=x → u ' =2 x
−λ x −λ x
• v ' =λ e−λ x → v=∫ λ e dx=−e
That the function ex changes faster than xk, for any k, has been used too in calculating both integrals. On the
other hand, for the exponential λ > 0, so the previous integrals always converge.

+∞ +∞
M (t)=E(e tX )=∫0 etx λ e−λ x dx=λ ∫0 e x[ t−λ ] dx= λ [e x [t −λ]]∞0 = λ
t−λ λ−t
This function exists for real t such that t–λ < 0 (otherwise, the integral does not converge). Because of the
mathematical real analysis,
E( X )=M (1) (0)=

[ −λ(−1)
(λ−t)2 ]
t =0
= λ2 =
λ λ
1
E(X 2 )=M (2) (0)=

[ −2 λ (λ−t)(−1)
(λ−t) 4 ] [
t =0
=
2λ
(λ−t)3 ]
t =0
=
2
λ2

+∞ +∞
φ (t)=E( eitX )=∫0 eitx λ e−λ x dx=λ∫0 e x (i t −λ) dx=λ lim M →∞ ∫{Z =γ , 0 ≤γ≤M } e z (i t−λ) dz
=λ lim M →∞ [ e z (i t −λ)
it−λ ] = λ lim M →∞ [ e M (it −λ)−1 ]= λ lim M → ∞ [ e−M λ e i M t −1 ]= λ
{ Z=γ , 0≤γ≤ M } i t−λ i t−λ λ−it
This function exists for any real t such that it–λ ≠ 0 (dividing by zero is not allowed). In the previous
calculation, that the complex integrand is differentiable has been use to calculate the (line) complex integral
by using an antiderivative and the equivalent to the Barrow's rule. Now, the definitions and rules of the
analysis for complex functions of a real variable must be considered to do
E(X )=
φ(1) (0) 1 −λ (−i)
i
=
i (λ−i t)2 [ ] t=0
= λ2 =
λ λ
1
φ(2) (0) 1 −λ i 2(λ−i t)(−i)

[ ] [ ]
2
2 1 2λi 2λ 2
E(X )= 2 = 2 = 2 = =
i i ( λ−i t)4 t=0 i (λ−it)3 t =0 λ 3 λ2
Mean and variance

1
μ=E( X )=
λ
2 1 2 1
σ2 =Var ( X )=E( X 2 )−E( X )2= −
λ2 λ
( )
= 2
λ

(6) The normal distribution

2 2
(x−μ) t
+∞ 1 − 2
1 +∞ − 2
E ( X )=∫−∞ x e 2σ
dx=∫−∞ (t+μ) e 2σ
dt
√2 π σ2 2
√ 2 π σ 2
2
t t
+∞ 1 − +∞ 1 − 2 2
=∫−∞ t e 2σ
dt + μ ∫ e 2σ
dt = 0+μ⋅1 = μ
√2 π σ 2 −∞
√2 π σ 2
Where the change

t=x−μ → x=t +μ → dx=dt
has been applied. In the second line, the first integral is zero because the integrand is and odd function and
range of integration is symmetric, while the second integral is one because f(x) is a density function.
2 2 2
( x−μ) t t
+∞ 1 − 2 +∞ 1 − +∞ 1 − 2 2
E ( X 2)=∫−∞ x 2 dx=∫−∞ (t +μ )2 ∫−∞

2σ 2σ 2 2 2σ
e e dt= (t +μ + 2μ t) e dt
√ 2 π σ2 2
√2 π σ 2
√2 π σ
2
2
2
t t t
+∞ 1 − +∞ 1 − 2+∞ 1 − 2 2
=∫−∞ t ∫ ∫
2 2σ 2 2σ 2σ
e dt + μ e dt + 2μ t e dt
√2 π σ 2 −∞
√2 π σ 2
2
−∞
√2 π σ 2
t
1 +∞ −
1 2
∫ σ 2 √ 2 π σ 2+μ 2 = σ 2 +μ 2
2
= t e 2σ
dt + μ 2⋅1 + 2 μ⋅0=
√2 π σ 2 −∞
√2 π σ 2
where the first integral has been calculated as follows.

+∞ t
2
[ ]
2 2 2 2
+∞ −
t
+∞ −
t
−
t
+∞ −
t
+∞ ( ) dt
−
e √
2
∫−∞ t 2 e 2 σ2
dt=∫−∞ t⋅t e 2 σ2
dt= −t σ 2 e 2σ 2
−∞ +σ 2∫−∞ e 2 σ2
dt=(0−0)+ σ2∫−∞ 2σ
+∞ 2
=σ 2 √ 2 σ2 ∫−∞ e −u du=σ 2 √ 2 π σ 2
Firstly, we have applied integration by parts
• u=t → u '=1
2 2 2
t t t
− 2 − 2
− 2
v=∫ t e
2σ 2σ 2 2σ
• v ' =t e → dt=−σ e
(Again, the function ex changes faster than xk, for any k.) Then, we have applied the change
t
=u → t=u √ 2 σ → dt=du √ 2 σ
2 2
√2 σ 2
+∞ −x 2
and the well-known result ∫−∞ e dx=√ π (see the appendix of Mathematics). On the other hand, these
integrals converge for any real t.

2 2
(x−μ) ( x−μ) 1
1 − +∞ 2
1 +∞ tx − 2 t (2μ+ σ2 t)
M (t)=E(e )=∫−∞ e tX
e tx 2σ
dx= ∫ e 2σ
dx=e 2
√ 2 π σ2 √ 2 π σ2 −∞
since
2
(x−μ ) 1 1
+∞ − 2 +∞ − 2
[ −2 σ2 t x+x 2+μ 2−2μ x ] +∞ − 2
{ x 2+−2 x [σ2 t +μ]+μ 2}
∫−∞ e xt
e 2σ
dx=∫−∞ e 2σ
dx=∫−∞ e 2σ
dx
2
x−[σ2 t +μ]
=∫−∞ e
+∞ −
1
2
{(x−[σ 2t +μ])2−[σ 2t +μ]2+μ 2 } −
1
2
{μ2−[σ 2t +μ]2} +∞ ( −
√ 2σ 2 ) dx
2σ
dx=e 2σ
−∞
∫ e
1 2 2 1 2 2
− 2 (μ−[ σ t +μ])(μ+[σ t +μ]) +∞ 2 − 2 [−σ t] [2μ +σ ln (t )]
=e 2σ
∫−∞ e−u √ 2 σ2 du=e 2σ
√ 2 π σ2

1 2
t [2 μ+σ ln(t )]
=e 2 √2 π σ2
where we have applied the change
x−[σ2 t+μ ]
=u → x=u √ 2 σ2 +[σ2 t+μ ] → dx=du √ 2 σ 2
√2 σ 2
The integrand suggested completing the square in the exponent. This way is indicated Probability and
Random Processes, by Grimmett and Stirzaker (Oxford University Press) for the standard normal distribution.
We have used this idea for the general normal distribution. This function exists for any real t. Now, because of
the mathematical real analysis,
[ ] [ ]
1 1
1
2 2
t (2 μ+σ t) t (2μ+ σ t)
E(X )=M (1) ( 0)= e 2 (2μ +σ 2 2 t) = e 2 (μ +σ2 t) t =0=μ
2 t =0
[ [(μ+σ 2 t )2+ σ2 ] ]t =0=μ 2 +σ 2

1 2
t (2 μ+σ t )
E(X 2 )=M (2) (0)= e 2

2 2
(x−μ) ( x−μ) 1
+∞ 1 − 2
1 +∞ itx− 2 it (2μ+ σ2 it)
φ (t)=E( eitX )=∫−∞ eitx e 2σ
dx= ∫ e 2σ
dx=⋯=e 2
√2 π σ 2 √2 π σ 2 −∞
This function exists for any real t. In this case, using the previous calculations with it in place of t leads to the
correct result, but the whole way is not: in complex analysis we can also make an square appear in the
exponent, as well as move coefficients outside of the integral (these operations are not trivial generalizations
of the analogous in real analysis, and it is necessary to take into account the definitions, properties and results
of complex analysis), but integrals must be solved in the proper way. (For this section, I have consulted
Variable compleja y aplicaciones, Churchill, R.V., y J.W. Brown, McGraw-Hill, 5ª ed., and Teoría de las
funciones analíticas, Markushevich, A., Mir, 1ª ed, 2ª reimp.) When the following limit exists, the integral can
be solved as follows
2 2
(x−μ ) (x−μ)
+∞ itx− 2 +M itx− 2
∫−∞ e 2σ
dx=lim M →∞ ∫−M e 2σ
dx
Now, by completing an square in the exponent, as for previous generating functions,

2
(x−μ) 1 1 2 2
+M itx − 2 i t [2 μ+σ2 i t ] +M − 2
(x−μ−i σ t )
∫−M e 2σ
dx =⋯=e 2
∫−M e 2σ
dx
Because of the rules of complex analysis, these calculations are similar—but based on new definitions and
properties— to those of previous sections. What is much different is the way of solving the integral. Now we
cannot find an antiderivative of the integrand—as we did for the exponential distribution—and therefore we
must think of calculating the integral by considering a contour containing the points
{x−μ−i σ2 t , −M≤x≤+M }. The integral of a complex function is null for any close contour within the
domain in which the function is differentiable. We consider the contour:
C( γ)=C I (γ)∪C II (γ)∪C III (γ)∪C IV (γ)

C I (γ)={z=γ−μ−i t σ 2 , −M ≤γ≤+ M }
2 2
C II ( γ)={z=M −μ +i(γ−t σ ) , 0≤γ≤t σ }
C III (γ )={z=−( γ−μ), −M ≤γ≤+ M }
C I (γ)={z=−M−μ−i γ , 0≤γ≤t σ 2 }

Then,
0 = ∫C f (z)dz = ∫C f ( z) dz +∫C f (z )dz +∫C f ( z) dz +∫C f (z) dz
I II III IV
1 2
− 2z
2σ
so for f ( z) = e
1 2 2 1 2 1 2 1 2
+M − 2
(x−μ−i σ t ) − 2
z − 2
z − 2
z
∫−M e 2σ
dx = −∫C e
II
2σ
dz−∫C e
III
2σ
dz−∫C e
IV
2σ
dz
1 2 2 1 2 1 2
tσ
2
− 2
[ M−μ +i(γ−t σ )] +M − 2
[−(γ−μ)] tσ
2
− 2
[−M −μ−i γ]
=−∫0 e 2σ
d γ−∫−M e 2σ
d γ−∫0 e 2σ
dγ
1 2 2 2 2 1 2 1 2 2
tσ
2
− 2
[(M −μ) −(γ−t σ ) +i 2 (M −μ)(γ−t σ )] +M − 2
(γ−μ) tσ
2
− 2
[(M +μ) − γ +i 2(M +μ )γ]
=−∫0 e 2σ
d γ−∫− M e 2σ
d γ−∫0 e 2σ
dγ
We are interested in the limit when M increases. For the first integral,
|∫ |
1 1 1 1 1 1
|e |d γ
2 2 2 2 2 2 2 2
tσ
2
− 2
(M−μ) 2
(γ−t σ ) − 2 [i 2(M−μ)(γ−t σ )] tσ
2
− 2
(M−μ) 2
(γ−t σ ) − 2 [i 2(M −μ)(γ−t σ )]
0
e 2σ
e 2σ 2σ
e d γ ≤ ∫0 2σ
e 2σ 2σ
e
1 2 2 1 2 2
− 2
(M −μ) tσ 2
(γ−t σ )
=e 2σ
∫0 e 2σ
d γ →M → ∞ 0
Since |e |=|cos (c)+i sin( c)|=1, ∀ c ∈ℝ and the last integral is finite (the integrand is a continuous
ic
function and the interval of integration is compact) and does not depend on M. For the second integral,
2
1 γ−μ +M −μ
+M − (γ−μ )
2
+M −
( √ 2 σ ) d γ= e−u √2 σ2 du →M →∞
+∞
√ 2σ 2∫−∞ e−u du=√2 π σ 2
2 2 2
2
√ 2 σ2
∫−M e 2σ
d γ=∫−M e ∫ −M −μ
√2 σ 2
where the change
γ−μ −M −μ γ−μ −M−μ
=u → γ=u √ 2 σ 2+μ → d γ=du √2 σ 2 and ≤ ≤
√2 σ2 √2 σ 2
√2σ 2
√ 2 σ2
has been applied. Finally, for the third integral,
|∫
1 1 1
|
2 2 1 2 1 2
tσ
2
− 2
(M +μ ) − 2 −γ − 2 i 2 (M+μ)γ − 2
(M +μ) tσ
2
− 2
−γ
0
e 2σ 2σ
e 2σ
e dγ ≤e 2σ
∫0 e 2σ
d γ →M →∞ 0
Again, the last integral is finite and does not depend on M. In short,
2 2
(x−μ) (x−μ)
1 +∞ itx− 2
1 +M itx− 2
φ (t)= ∫ e 2σ
dx= lim M→ ∞ ∫ e 2σ
dx
√ 2 π σ 2 −∞ √2 π σ 2 −M
1 1 2 2 1 1
(x−μ−i σ t)
1 1
2 2
i t [2 μ+σ i t ] +M − i t [2 μ+σ it ] it [2μ +σ2 i t ]
√2 π σ
2
= e 2
lim ∫
M →∞ −M e
2σ
dx= e 2 2
=e 2
√2 π σ 2
√ 2 π σ2
This function exists for any real t. (The reader can notice that the correct way is slightly longer.) Now,
[ ]
1
1
2
i t [2μ +σ i t ]
E(X )=
(1)
φ (0)
=
e 2
2
i(2μ+ iσ 2 2 t)
t =0
=
e
1
2
i t [2 μ+σ i t ]
i(μ+i σ2 t) t =0 iμ
= =μ
[ 2
]
i i i i
[ [ i2 (μ+ iσ 2 t)2 +i(i σ2 )] ]t=0

1 2
i t [ 2μ+ σ it ]
2
φ (0) e i2 μ 2 +i 2 σ 2
(2)
2
E( X )= 2 = = =μ 2+ σ2
i2 2
i i
Mean and variance

μ=E( X )=μ

σ2 =Var (X )=E(X 2 )−E(X )2=σ 2+μ 2−μ2 =σ2
Conclusion: To calculate the moments of a probability distribution, different methods can be considered,
some of them quite more difficult than others. The characteristic function is a complex function of a real
variable, which requires theoretical justifications of complex analysis we must be aware of.
My notes:
[Ap] Mathematics
Remark 1m: The exponential function ex changes faster than any monomial xk of any k.
Remark 2m: In complex analysis, there are frequently definions and properties analogous to those of real analysis. Nevertheless,
one must take care before applying them.
Remark 3m: Theoretically, quantities like proportions (sometimes expressed in per cent), rates, statistics, etc., are dimensionless. To
interpret a numerical quantity, it is necessary to know the framework in which it is being used. For example, 0.98% and 0.98% 2 are
different. The second must be interpreted as √(0.98%2) = 0.99%. Thus, to track how they are transformed the use of a symbol may
be useful.
Remark 4m: In working with expressions—equations, inequations, sums, limits, integrals, etc—, special attention must be paid
when 0 or ∞ appears. For example, even if two limits (series, integrals, etc) do not exist, their summation (difference, quotient,
product, etc) may exist.
∞ 1 ∞ 1 1
limn→∞ n3 = ∞ and limn→∞ n4 = ∞, but limn→∞ n3/n4 = 0 or ∫1 x
dx does not exist while ∫1 ⋅ dx does
x x
On the other hand, many paradoxes (e.g. Zenon's ones) are based on any wrong step (in red color):
0 = 0 ↔ 0·2 = 0·3 ↔ 0·2 = 0·3 ↔ 2 = 3 and ∞ = ∞ ↔ ∞·2 = ∞·3 ↔ ∞·2 = ∞·3 ↔ 2 = 3
Readers of advanced sections may want to check some theoretical details related to the following items (the
very basic theory is not itemized).
Some Reminders
Real Analysis
For real functions of one or several real variables.
n n
( x+ y)n =∑ j=0 n x j y n− j or, equivalently, ( x+ y)n =∑ j=0 n x n− j y j
● Binomial Theorem.
j () j ()
● Limits: infinitesimal and infinite quantities.
● Integration: methods (integration by substitution, integration by parts, etc.), Fubini's theorem, line
integral.
● Series: convergence, criteria of convergence, radius of convergence, differentiability and integrability,
Taylor series, representation of the exponential function, power series. Concretely, when the criterion
of the quotient is applied to study the convergence, the radius of convergence is defined as:
am +1 |c x m+1| |c | |c m|
lim m →∞ =lim m →∞ m +1 m =|x|lim m→∞ m +1 < 1 → |x| < lim m →∞ =r
am |c m x | |c m| |c m+1|
(Similarly for the criterion of the square root.)

+∞ a
● Geometric Series. For 0<b<1, ∑ j=0 a⋅b j = 1−b < ∞
(See, for example, http://en.wikipedia.org/wiki/Geometric_sequence)
● Arithmetico-Geometric Series.
(See, for example, http://en.wikipedia.org/wiki/Arithmetico-geometric_sequence)
● Ordinary differential equations.
Complex Analysis
For complex functions of one real variable.
● Limits: definitions and basic properties.
● Differentiation: definitions and basic properties.
● Integration: definitions and basic properties, antiderivatives and Barrow's rule.
For complex functions of one complex variable.

● Elementary functions: the exponential complex function
● Limits: definitions and basic properties, infinitesimal and infinite quantities.
● Differentiation: definitions and basic properties, holomorphic (or analytic) functions.
● Integration: definitions and basic properties, antiderivatives and Barrow's rule, basic theorems (of an
analytic function on a close contour), integration by parts.
● Series: convergence and absolute convergence, criteria of convergence, radius of convergence,
differentiability and integrability, Taylor series, representation of the exponential function.
Limits
Frequently, we need to deal with limits of sequences and functions. For sequences, any variable or index (say
n) and the quantity of interest (say Q) can take values in a countable set of discrete positive values, even for
multidimensional situations: the countable product of countable sets is a countable set. Calculations are easier
when there is any monotony, since “the small steps determine the whole way,” or symmetry. For example, the
summation and the product increase when any term increases, or both, while the difference and the quotient
may increase or decrease depending on the term that increases in a unit, since the two terms are not affecting
the total expression in the same direction.
Techniques
In calculating limits, firstly we try to mentally substitute the value of the variable in the sequence or
function. This is frequently enough to solve it, although we can do some formal calculations (specially if we
are not totally sure about the value). When the previous substitution leads to one of the following cases
∞, 0
∞−∞ , ∞⋅0 , ∞ , 1∞ , ∞0 , and 00
0
we talk about indeterminate forms (we have not written possible variations of the signs or positions, e.g. 0·∞,
–∞+∞, or –0/0). The value depends on the particular case, since one term can be “faster than the other” in

tending to a value. There are different techniques to cope with the limits and to transform some indeterminate
forms in others. (Notice that limits like 0–0 are not indeterminate forms, since |a–b| ≤ |a| + |b|.)
Limits in Statistics
Since the sample sizes of populations are positive integer numbers, in Statistics we have to deal with limits of
sequences frequently.
One-Variable Limits: The variable n takes values in ℕ. For this variable, there is a unique natural way for
n to tend to infinite by increasing it one unit at a time. There is a total order in the set ℕ, which is countable.
In Statistics, we are usually interested only in any possible nondecreasing sequences of values for n, which
can be seen as a possible sequence of schemes where more and more data are added.
Two-Variable Limits: A pair of values (nX, nY) can be seen as a point in ℕ x ℕ . There are infinite ways for
nX and nY to tend to infinite by increasing any of them, or both, one unit at a time. There is not a total order in
the product space ℕ x ℕ , though it is still a countable set. Again, in Statistics we are usually interested only
in any possible nondecreasing sequence of pairs of values (nX,nY), which can be seen as a sequence of schemes
where more and more data are added.
In this document, we have to work with easy limits or indeterminate forms like ∞/∞ involving
polynomials. For the latter type of limit, we look at the terms with the highest exponents and we multiply and
divide the quotient by the proper monomial so that to identify the negligible terms, which formally can be
seen as the use of infinites. We will also mention other techniques.
Technique Based on Paths
One-Variable Limits: Any possible sequence of values for the sample size, say n(k), can be seen as a
subsequence of the most complete set ℕ of possible values n(k) = k. We are specially interested in
nondecreasing sequences of values n(k) ≤ n(k+1).
The evaluation of any one-dimensional quantity at a subsequence, Q(n(k)), can be seen as a subsequence of
Q(k). If this sequence converges, any subsequence like that must converge. The opposite is not true, since we
can find nonconvergent Q(k) with a convergent subsequence Q(n(k)). The following result can be found in
literature.
Theorem
For a real function f of a real variable x, defined on ℝ=ℝ∪∞ , if a is an accumulation point the
following conditions are equivalent:
(i) limx→a f(x) = L
(ii) For any sequence (in the domain) such that limk→∞ x(k) = a, it holds that limk→∞ f(x(k)) = L
A sequence is a particular case of real function of a real variable, and ∞ is an accumulation point in ℝ .
Two-Variable Limits: Any possible sequence of values (nX(k),nY(k)) can be seen as a path s(k) in the most
complete set ℕ x ℕ of possible values (k1,k2). Again, we are specially interested in nondecreasing sequences
of values nX(k) ≤ nX(k+1) and nY(k) ≤ nY(k+1).

The evaluation of any one-dimensional quantity at a path, Q(nX(k),nY(k)), can be seen as a subset of Q(k1,k2).
The convergence of Q(nX(k),nY(k)) may depend on the path s(k). Nevertheless, for those cases where the subset
Q(k1,k2) can be ordered to form a one-index convergent sequence, say Q(k), any subsequence Q(nX(k),nY(k))
must converge. The opposite is not true, since we can find nonconvergent Q(k) with a convergent subsequence
Q(nX(k),nY(k)). Notice that the set ℕ x ℕ is countable and hence can be “linearized” in the sense of being
described by using one index only, and then the theorem above can be applied. The idea consists in proving
the existence of the limit—section (i) in the theorem—by using the monotony and the properties of sequences,
and calculating it by using a particularly appropriate sequence x(k)—section (ii) in the theorem.
I wrote this way to prove that the limit of Q=(nX+nY)/nXnY does not depend on the path considered, and
the unique limit can be found by considering a specially appropriate path. It is possible to think about an
underlying two-dimensional induction principle: when a statement that depends on the position (nX,nY) is still
true when any of this variables increases in a unit, then the statement is true restricted to any one-step path
(nX(k),nY(k)).
The previous nondecreasing s(k) are the only paths of interest in Statistics. Mathematically, on the
contrary, for any path s(k) such that the sample sizes tend to infinite, the previous description in terms of steps
could be used to prove that the leftward or downward steps must always be “compensated and outnumbered
by far.”
Finally, any sizes can be used for the steps of a sequence (nX(k),nY(k)), since it is also possible to
complete them so that to obtain a path (mX(k),mY(k)) in terms of one-sized steps. Thus, if a limit is different for
two of those sequences, the limits are also different for these paths.
Exercise 1m (*)
Prove that
e−x dx= √ π
+∞ +∞ +∞
∫−∞ e−a x dx= √ πa ,
2 2 2
(a) ∫−∞ e−x dx= √ π (b) a∈ℝ+ (c) ∫0 2
Discussion: The integrand is a continuous function. We remember that e−x has no antiderivative but it is
still possible to calculate definite integrals for same domains. As regards the limits of integration, the domain
is infinite and we must deal with improper integrals.
(a) Finiteness: Firstly, we prove that the integral is finite not to be working with the equality of two infinite
quantities (something “really dangerous”).
+∞ 2 2 2 +1 ∞
∫−∞ e−x dx=∫{|x|< 1} e−x dx +∫{|x|≥1} e−x dx≤∫−1 1 dx +2 ∫+1 e−x dx=2+2 [−e− x ]∞x=1=2+2 e−1 <∞
since
2 2
• If 0≤|x|<1 then 0≤x 2 <1 and e 0≤e x <e 1 and hence 1=e 0≥e−x >e−1
• For an even function, the integral between -k and +k is twice the integral between 0 and +k.
2 2
• If x≥1 then x 2≥x and e x ≥e x and hence e−x ≤e−x

• For any two quantities, if a1≤b1 and a2≤b2 then a1 +a 2≤b1 +b 2

Temporary calculations in a twodimensional space: The Fubini's theorem of integration for improper
integrals can be applied to do
+∞ 2 +∞ 2 +∞ +∞ 2 2 +∞ 2π 2 2 2 2
I 2=I⋅I = ∫−∞ e− x dx⋅∫−∞ e− y dy=∫−∞ ∫−∞ e −(x + y ) dx dy=∫0 ∫0 e−[ρ ⋅cos (θ ) +ρ ⋅sin (θ ) ] ρ d θ d ρ
+∞
[ ]
2
+∞ 2π 2 +∞ 2π 2 +∞ 2
e−ρ 2 0
=∫0 ∫0 ρd θ d ρ=∫0 ∫0 ρd θ d ρ=2 π∫0 e =π [ e −ρ ]∞ =π
−ρ −ρ −ρ
e e ρ d ρ=2 π
−2 0
where the Jacobian matrix of the change of variables

{ x=ρcos (θ) is
x=ρsin (θ)
| ||
∂x ∂x
|J|= ∂ρ
∂y
∂ρ
∂θ cos(θ) −ρsin (θ)
∂y
∂θ
=
sin(θ) ρ cos(θ)
2 2
=ρ cos (θ) +ρsin(θ) =ρ |
Come back to a onedimensional space: Finally,
+∞ 2
∫−∞ e−x dx=I = √ I 2=√ π

(b) Now, to prove that
+∞
∫−∞ e−a x dx= √ πa
2
we apply the change

1 1
√ a x=u → x= u → dx= du
√a √a
which leads to
+∞ +∞ 1 +∞ 1
∫−∞ e−a x dx=∫−∞ e−( √ a x ) dx= √a ∫−∞ e−u du= √ a √ π= √ πa
2 2 2
2 2
(c) On the other hand, since f ( x)=e−x =e−(−x) =f (−x) is an even function,
e dx= √ π
+∞ 2
1 +∞ −x 2
∫0 e
−x
dx=
2
∫−∞ 2
( 12 )= π .
−x p−1 +∞
An alternative proof uses the gamma function Γ( p) = ∫0 e x dx and the fact that Γ √ Now,
1
by applying the change of variable x 2=t , for t≥0, which implies that x=√ t and hence dx= dt ,
2 √t
1 +∞ −t −1/ 2 1 1 √π
()
+∞ −x2
∫0 e dx=
2 ∫0
e t dx= Γ
2 2
=
2
.
Conclusion: To be allowed to apply the version of the Fubini's theorem for improper integrals, the finiteness
of the first integral has firstly been proved. The integral of section (a) is used to calculate the others,
respectively by applying a change of variables and by considering the even character of the integrand.
About the proof based on the multiple integration: Proof by Siméon Denis Poisson (1781–1840), according to El omnipresente
número π, Zhúkov, A.V., URSS. I had found this proof in many books, including the previous reference (for the integral in section b
with a=1/2). I have written the bound of the integral. About the proof based on the gamma function: I have found this proof in
Problemas de oposiciones: Matemáticas (Vol. 6), De Diego y otros, Editorial Deimos. In this textbook, the integral in section c is
solved by using the two approaches.
My notes:

Exercise 2m
Study the following limits of sequences of one variable
(1) lim n →∞ ak nk +a k−1 n k−1+⋯+ a1 n+a0 , where aj are constants

1
(2) lim n →∞ , where c is a constant
n+ c
an+ b
(3) lim n →∞ , where a, b, c and d are constants
cn +d
k1
an + b(n)
(4) lim n →∞ k , where a and c are constants and b(n) and d(n) are polynomials whose degrees are
cn +d (n)
2
smaller than k1 and k2, respectively

a b
+
n n2
(5) lim n →∞ , where a, b and c are constants
c
n3
Discussion: We have to study several limits. Firstly, we try to substitute the value to which the variable
tends in the expression of the quantity in the limit. If we are lucky, the value is found and the formal
calculations are done later; if not, techniques to solve the indeterminate forms must be applied.
(1) k
lim n →∞ ak n +a k−1 n
k−1
+⋯+ a1 n+a0 , where aj are constants
Way 0: Intuitively, the term with the largest exponent leads the growth when n tends to infinite. Then,
{
lim n →∞ ak nk +a k−1 n k−1+⋯+ a1 n+a0 = −∞ if ak <0
+∞ if a k > 0
Necessity: lim n |a k n k + ak−1 nk−1 +⋯+a1 n+a 0|=∞ then n→∞

If not, that is, if ∃M >0 such that n< M <∞ , then
|ak nk +a k−1 nk −1 +⋯+a 1 n+ a0| ≤ |a k|nk +|ak−1|nk −1 +⋯+|a1|n+|a0| < |a k|M k +|a k−1|M k −1 +⋯+|a1|M +|a 0| < ∞
and the limit could not be infinite.
1
(2) lim n →∞
n+ c
, where c is a constant
Way 0: Intuitively, the denominator tends to infinite while the numerator does not. (For huge n, the value of c
is negligible.) Then, the limit is zero.
Way 1: Formally, we divide the numerator and the denominator (all their terms) by n.
1
n−1 1
lim n →∞
1
n+ c
=lim n→ ∞ −1 (
n n+ c
=lim n→∞
n
1+
c)=0
n

Way 2: By using infinites of the same order, we can substitute n + c by n:
1 1
lim n →∞ =lim n→ ∞ =0
n+ c n
1
Necessity: lim n =0 then n→∞
n+ c
1 1
If not, that is, if ∃M >0 such that n< M <∞ , then > > 0 and the limit could not be zero.
n+c M +c
an+ b
(3) lim n →∞
cn +d
, where a, b, c and d are constants
Way 0: (This limit includes the previous.) The quotient is an indeterminate form. Intuitively, the numerator
increases like an and the denominator like cn. (The terms b and d are negligible for huge n.) Then, the limit of
the quotient tends to a/c.
Way 1: Formally, we divide the numerator and the denominator (all their terms) by n.
b
−1 a+
lim n →∞
an+ b
cn+d
=lim n→∞ −1 (
n an+ b
n cn+ d
=lim n→ ∞ ) c+
n a
=
d c
n
Way 2: By using infinites,

an+ b an a a
lim n →∞ =lim n→∞ =lim n →∞ =
cn+d cn c c
an+ b a
Necessity: lim n = then n→∞
cn+ d c
If not, that is, if ∃M >0 such that n< M <∞ , then
an+b a acn+bc−acn−ad |bc−ad|
− =
cn+ d c | c ( cn+d ) ||
≥
|c|(|c| M +|d|)
>0 |
and the limit could not be a/c... unless the original quotient was always equal to this value. Notice that when
the previous numerator is cero,
a b an+b λ (cn+ d) a an+b a
ad=bc ↔
c
=λ=
d
↔
{a=λ c
b=λ d
↔
cn+d
=
cn+d
=λ=
c
↔ − =0
cn+d c
that is, in this case the function is really a constant. In the initial statement, the condition |ac db|≠0 could
have been added for the polynomials an+b and cn+ d to be independent.
an k + b(n)
1
(4) lim n →∞ k
, where a and c are constants and b(n) and d(n) are polynomials whose degrees are
cn +d (n)
2
smaller than k1 and k2, respectively

Way 0: (This limit includes the two previous.) The quotient is an indeterminate form. Intuitively, the
numerator increases like ank and the denominator like cn k while b(n) and d(n) are negligible. Thus,
1 2

{
0 if k 1< k 2
a
if k 1=k 2
k
an + b(n) 1c
lim n →∞ k = a
cn +d (n) −∞ if k 1 > k 2 , <0
2
c
a
+∞ if k 1> k 2 , >0
c
Way 1: Formally, we divide the numerator and the denominator (all their terms) by the power of n with the
highest degree among all the terms in the quotient (if there were products, we should imagine how the
monomials are). For example, for the case k 1 <k 2
ab (n)
+
[ ]
k1 k1 k 2−k 1
nk
−k 2
an + b(n) n an +b (n) n 2
lim n →∞ k
=lim n→∞ −k =lim n →∞ =0
cn +d (n)
2
n cn k + d (n)
2 2
d(n)
c+ k
n 2
(Similarly for the other cases.)
Way 2: By using infinites, since b(n) and d(n) are negligible for huge n,
{
0 if k 1 <k 2
a
if k 1=k 2
an k + b(n)
1
an k 1
a k −k c
lim n →∞ k =lim n→∞ =lim n →∞ n = 1 2
a
cn +d (n)
2
cn k 2
c −∞ if k 1 >k 2 , < 0
c
a
+∞ if k 1 >k 2 , > 0
c
a b
+
n n2
(5) lim n →∞ , where a, b and c are constants
c
n3
Way 0: The quotient is an indeterminate form. Intuitively, the numerator decreases like a/n (the slowest) and
the denominator like c/n3, so the denominator is smaller and smaller with respect to the numerator, and, as a
consequence, the limit is –∞ or +∞ depending on whether a/c is negative or positive, respectively.
Way 1: Formally, it is always possible to multiply or divide the numerator and the denominator (all their
monomials, if they are summation, or any element, if they are products) by the power of n with the
appropriate exponent. Then we can do
a b a b
( ) {
+ 2 + 2 a
n n 3 2 −∞ if <0
n n n an + bn c
lim n →∞ =lim n→∞ 3 =lim n →∞ =
c n c c a
3 3 +∞ if > 0
n n c

Way 2: By using infinitesimals,
a b a
{
+ a
n n2 3 −∞ if < 0
lim n →∞
c
=lim n→∞
n
c
=lim n →∞
an
c n c ( )
a 2
=lim n→∞ n = c
a
+∞ if >0
( )
n3 n3 c
Conclusion: We have studied the limits proposed. Some of them were almost trivial, while others involved
indeterminate forms like 0/0 or ∞/∞. All the cases were quotients of polynomials, so the limits of the former
form have been transformed into limits of the latter form. To solve these cases, the technique of multiplying
and dividing by the same quantity has suffices (there are other techniques, e.g. L'Hôpital rule).
Additional examples
1 1 1
lim n →∞ =0 or lim n →∞ =lim n →∞ =0
n−1 n−1 n
lim n →∞ ( 2n − n1 )=0
2
lim n →∞ (2 n−n2)=lim n →∞ [n(2−n)]=−∞ or lim n →∞ (2 n−n2)=lim n →∞ (−n2)=−∞

1
1−
n−1 n n−1 n
lim n →∞ =lim n →∞ =1 or lim n →∞ =lim n →∞ =1
n 1 n n
n 1 n n
lim n →∞ =lim n →∞ =1 or lim n →∞ =lim n →∞ =1
n−2 2 n−2 n
1−
n
1
1−
n−1 n n−1 n
lim n →∞ =lim n→∞ =1 or lim n →∞ =lim n→∞ =1
n−3 3 n−3 n
1−
n
My notes:
Exercise 3m (*)
Study the following limits of sequences of two variables
(1) lim n X →∞ ( n X +nY ) and lim n →∞

X
( n X −nY )
nY →∞ nY →∞
nX
(2) lim n →∞ ( n X⋅nY ) and lim n →∞
X
nY →∞
X
nY →∞
nY
n X nY nX
(3) lim n →∞ and lim n →∞
X
nY →∞
nX X
nY →∞
n X nY
(n X +a)(n Y +b) nX+ a

(4) lim n →∞ and lim n X →∞
where a, b and c are constants
X
nY →∞
n X +c nY →∞
(n X +b)(nY +c )

1 1 1
n X nY nX
(5) lim n X →∞
and lim n X →∞
nY →∞
1 nY →∞
1 1
nX n X nY
n X +n Y nX
X
nY →∞
nX X
nY →∞
n X +n Y
1 1 1
n X +a n Y +b nX+ a
(7) lim n X →∞
and lim n X →∞
nY →∞
1 nY →∞
1 1
n X +c n X +b n Y +c
1 1 1
+
n X nY nX
(8) lim n X →∞
nd lim n X →∞
1 1 1
nY →∞ nY →∞ +
nX n X nY
n X +n Y n X nY
X
nY →∞
n X nY X
nY →∞
n X +n Y
n X −nY n X nY
X
nY →∞
n X nY X
nY →∞
n X −nY
Discussion: We have to study several limits of two-variable sequences. Firstly, we try to substitute the value
to which the variable tends in the expression of the quantity in the limit. If we are lucky, the value is found
and the formal calculations are done later; if not, techniques to solve the indeterminate forms must be applied.
These limits may be quite more difficult than those for one variable, since we need to prove that the value
does not depend on the particular way for the sample sizes to tend to infinite (if the limit exists or is infinite)
or find two ways such that different values are obtained (the limits does not exist).
(1) lim n →∞X

( n X +nY ) and lim n X →∞ ( n X −nY )
nY →∞ nY →∞
Way 0: Intuitively, the first limit is infinite while the second does not exist, since it depends on which variable
increases faster.
Way 1: For the first limit to be infinite, it is necessary and sufficient one variable tending to infinite, say nX.
lim nX →∞
( n X +nY ) > lim n → ∞ n X =∞
X
nY →∞
For the necessity, if n X < M <+ ∞ and nY < M <+∞ then

lim nX →∞
( n X +nY ) < 2 M < ∞
nY →∞
and the limit could not be zero. To see that

∄ lim n X →∞
( n X −nY )
nY → ∞
it is enougth to see that different values are obtained for different paths: s 1 (k )=(k 2 , k ) and s 2 (k )=(k , k ),

lim s (k) ( nX −nY )=lim k→∞ ( k 2−k ) =+∞
1
and lim s (k) ( n X−nY )=lim k→∞ ( k −k ) =0
2
nX
(2) lim n X →∞
( n X⋅nY ) and lim n X →∞
nY →∞ nY →∞
nY
Way 0: Intuitively, the first limit is infinite while the second does not exist, since it depends on which variable
increases faster.
Way 1: For the first limit to be infinite, it is necessary and sufficient one variable tending to infinite, say nX.
lim n X→∞ ( n X⋅nY ) > lim n →∞ n X =∞
X
nY →∞
For the necessity, if n X < M <+ ∞ and nY < M <+∞ then

lim n →∞ X
( n X⋅nY ) < M 2 <∞
nY →∞
and the limit could not be zero. To see that

n
∄ lim n →∞ X
n
n →∞ Y
X
it is enougth to see that different values are obtained for different paths: s 1 (k )=(k 2 , k ) and s 2 (k )=(k , k ),
nX k2 nX k
lim s (k) =lim k →∞ =∞ and lim s (k) =lim k →∞ =1
1
nY k 2
nY k
n X nY nX
X
nY →∞
nX X
nY →∞
n X nY
Way 0: Even if the expression can be simplified, we use this case to show that the product of increasing terms
increases faster than any of its terms, and the new rate is the product of the two rates (the exponents are
added). The quotient in an indeterminate form. The first limit seems infinite and the second zero.
Way 1: Formally, we simplify the quotient

n X nY nX 1
lim n X →∞
=lim n Y →∞
nY =∞ and lim n X →∞
=lim n →∞ =0
nY →∞
nX nY →∞
n X nY Y
nY
A product of increasing terms that are bigger than one increases faster than any of its terms. The second limit
can also be seen as the inverse of the first. The sufficiency and the necessity in these limits is determined by
the behaviour of nY: the first limit is infinite and the second is zero if and only if nY tends to infinite.
(n X +a)(n Y +b) nX+ a

(4) lim n →∞ and lim n →∞ where a, b and c are constants
X
nY →∞
n X +c X
nY →∞
(n X +b)(nY +c )
Way 0: The quotient in an indeterminate form. Intuitively, the product of increasing terms increases faster than
any of its terms, and the new rate is the product of the two rates (the exponents are added). The constants are
negligible when they are added to or substracted from a power. The first limit seems infinite and the second
zero.
Way 1: Formally, we multiply the numerator and the denominator (all their monomials, if they are summation,
or any element, if they are products) by the product of the powers of nX and nY with the highest exponents

a b
( 1+ )( )
1+
[ ] nX nY
−1 −1
(n X +a)(n Y +b) n n X Y ( n X +a)( nY + b)
lim n X →∞
=lim n →∞ =lim n X →∞
=∞
n X +c n →∞ n n
X −1 −1
n X +c 1 c
nY →∞ Y X Y n Y →∞ +
n Y n X nY

(n X +a)(n Y +b) n n
lim n →∞ =lim n →∞ X Y =lim n → ∞ nY =∞
X
nY →∞
n X +c n →∞
nXX Y
The second limit can also be seen as the inverse of the first, by changing the letter of the constants, so we do
not repeat the calculations. The sufficiency and the necessity in these limits is determined by nY: the first limit
is infinite and the second is zero if and only if nY tends to infinite.
1 1 1
n X nY nX
(5) lim n X →∞
and lim n X →∞
nY →∞
1 nY →∞
1 1
nX n X nY
Way 0: Even if the expression can be simplified, we use this case to show that the product of decreasing terms
decreases faster than any of its terms, and the new rate is the product of the two (the exponents are added).
The quotient in an indeterminate form. The first limit seems zero and the second is infinite.
Way 1: Formally, we simplify the quotient

1 1 1
n X nY 1 nX
lim n X →∞
=lim n → ∞ =0 lim n X →∞
=lim n → ∞ nY =∞
nY →∞
1 nY
Y
nY →∞
1 1 Y
nX n X nY
A product of decreasing terms that are smaller than one decreases faster than any of its terms. The second
limit can also be seen as the inverse of the first. The sufficiency and the necessity in these limits is determined
by the behaviour of nY: the first limit is zero and the second is infinite if and only if nY tends to infinite.
n X +n Y nX
X
nY →∞
nX X
nY →∞
n X +n Y
The quotient is an indeterminate form. Since we can write

n X + nY n nX
lim n X →∞
nY →∞
nX n →∞
X
Y
nX( )
=lim n →∞ 1+ Y =? and
1+
1
nY
=? lim nX →∞
nY →∞
n X + nY
=lim n → ∞ Y
nX
and we have seen that the limits of the new quotients do not exist, it seems that none of the limits exists.
Formally, we could consider the same paths as we considered there. The second limit can also be seen as the
inverse of the first.
1 1 1
n X +a n Y +b nX+ a
(7) lim n X →∞
and lim n X →∞
nY →∞
1 nY →∞
1 1
n X +c n X +b n Y +c

Way 0: The constants are negligible and these limits are like the previous, namely: the first limit seems zero
and the second is infinite.
Way 1: Formally, we multiply the numerator and the denominator (all their monomials, if they are summation,
or any element, if they are products) by the product of the powers of nX and nY with the highest exponents
1 1
n X +a n Y +b nX+ c
lim n X →∞
=lim n →∞ =0
nY →∞
1 n →∞
( n X + a)(nY + b)
X
n X +c

1 1 1 1
n X +a n Y +b n n 1
lim n →∞ =lim n →∞ X Y =lim n →∞ =0
X
nY →∞
1 n →∞
X
1 Y
nY
Y
n X +c nX
The second limit can also be seen as the inverse of the first, by changing the letter of the constants, so we do
not repeat the calculations. As regards the sufficiency and the necessity in these limits, it is determined by the
behaviour of nY: the first limit is zero and the second is infinite if and only if nY tends to infinite.
1 1 1
+
n X nY nX
(8) lim n X →∞
nd lim n X →∞
1 1 1
nY →∞ nY →∞ +
nX n X nY
Way 0: The quotient in an indeterminate form. Intuitively, any sum of decreasing terms decreases like the
slowest while the other becomes negligible. Thus, the first limit would be one if the fastest is nY, infinite if the
faster is nX; and, if both are equal, the limits are two and one over two, respectively. In short, it seems this
limit does not exist.
Way 1: Formally, we can do

1 1
+
n X nY nX
lim n →∞ =lim n →∞ 1+ =?
X
nY →∞
1 n
X
→∞
nY
Y
nX
1 1
nX n X nY nX nY
lim n X →∞
=lim n X →∞
=lim n Y →∞
=?
1 1 n X nY 1 1 nY +n X
nY →∞ + nY →∞ +
n X nY nX nY
The second limit can also be seen as the inverse of the first.
n X +n Y n X nY
X
nY →∞
n X nY X
nY →∞
n X +n Y
The limit appears in the variance of the estimators of σX2/σY2. We solve it in two simple ways, although others
ways are considered as an “intellectual exercise.”

Way 0: Intuitively, the product changes faster than the summation. Then, the first limit seems zero and the
second infinite.
Way 1: Formally, we can do

n X + nY n n 1 1
lim n →∞ =lim n →∞ X + lim n → ∞ Y =lim n →∞ + lim n → ∞ =0
X
nY →∞
n X nY X
n n
n →∞ X Y
Y n →∞
nX nY
X
Y
nY Y
nX X
n X nY 1
lim n →∞ =lim n →∞ =∞
X
nY →∞
n X + nY X
n →∞
Y
n X + nY
n X nY
It is sufficient and necessary that both variables tend to infinite. For the necessity, n X < M <+ ∞ then
n X +n Y 1 1 1
= + > >0
n X nY nY n X M
and the limit could not be zero.
Way 2: Firstly, let us suppose, without loss of generality, that nX ≤ nY. Then
0 ≤ lim n → ∞
X
nY →∞
[ n X +nY
nX nY ]≤ lim n
n
X
Y
→∞
→∞
2 nY
( )
n X nY
= lim n X →∞
2
nX
=0
(ny has been dropped from the numerator and the denominator, it is not that an iterated limit is being
calculated). Nonetheless, this solution does not consider those paths (for the sample sizes) that cross the
bisector line, that is, when none of the sizes is uniformly behind the other. To complete the proof it is enough
to use again the symmetry of the expression with respect to the two variables (it is the same if we switch
them): for any sequence of values for (nX,nY) crossing the bisector line, an equivalent sequence—in the sense
that the sequence takes the same values—either above or behind the bisector line can be considered by
looking at the bisector line as a mirror or a barrier.
Way 3: Polar coordinates can also be used to study that limit. For any sequence s(k)=(nX(k),nY(k)),
{
ρ(k )= √ nX ( k)2 +n Y ( k)2
{
n X ( k )=ρ( k )⋅cos [α (k )] , 0<ρ( k )< ∞ , 0< α(k )< π
nY ( k)=ρ( k)⋅sin[α( k )] 2 α( k )=arctg Y
n ( k)
( )
n X (k )
A mathematical characterization of a sequence s(k) corresponding to sample sizes that tend to infinite can be
ρ(k )→∞
in such a way that even when cos [α(k )]→0 or sin[α (k )]→0 the products n X (k )=ρ(k )⋅cos [α( k )]
and nY (k )=ρ(k )⋅sin [α( k)] still tend to infinite. Then, the limit is calculated as follows
ρ(k ) [ cos[α (k )]+sin [α (k )] ] 2
lim k →∞ 2
≤ lim k →∞ =0
ρ(k ) cos [α (k )]sin[α (k )] ρ(k ) cos[ α (k )]sin[ α( k )]
The only cases that could cause troubles would be those for which either the cosine or the sine tends to zero
(the other tends to one). Nevertheless, the characterization above shows that the denominator would still tend
to infinite. Finally, as regards the necessity, let us suppose, without loss of generality, that nX ≤ M < ∞. Then,
since ρ(k )→∞ it must be cos [α(k )]→0 in such a way that n X (k )=ρ(k )⋅cos [α( k )]≤M . As a
consequence,
ρ(k ) { cos [α (k )]+sin [α(k)] } cos [α(k)]+ sin[α( k )] 0+ 1 1
lim k →∞ 2
≥lim k →∞ = = >0
ρ( k) cos [α (k )]sin [α (k )] M⋅sin[ α (k )] M M

Way 4: Intuitively:
(a) The mean square error—and the sequence in the limit—should monotonically decrease with the
sample sizes.
(b) We are working with nonnegative quantities—there is a lower bound.
(c) It is a well-known result that a nonincreasing, bounded sequence always converges.
(d) The limit of a sequence, when it exists, is unique. As a consequence, it can be calculated by using any
subsequence—concretely, an appropriate simple one. (The opposite is not true: that one subsequence
converges does not imply that the whole sequence converges.)
First, when nX increases in a unit the sequence decreases:
n X +n Y ? n X + 1+ nY ? ?
> → n2X n Y +n Y n X + n X n2Y +n2Y > n2X nY +n X nY + n X n2Y → n2Y > 1 → Yes
n X nY ( n X +1) nY
Since the expression of the sequence is symmetric, the same inequality is true when nY increases in a unit.
Finally, the case when both sizes increase in a unit can always be decomposed in two of the previous steps,
while the quantity
n +n
Q (n X , nY )= X Y
n X nY
depends only on the position, not on the way to arrive at it; thus, the sequence decreases in this case too.
Second, Q can take values in a discrete set that can sequentially be constructed and ordered to form a
sequence that is strictly decreasing and bounded, say Q(k). (The set ℕ x ℕ is countable.) The symmetry
implies that the increase of Q can take only two values—not three—when any sample size or both increase in
a unit. In sort, Q(k) converges, though we need not build it. Third, any path s(k) such that the sample sizes are
nondecreasing and tend to infinite can be written in terms of one-unit rightward and upward steps, with an
infinite amount of any type. For each path s(k), the quantity
nX ( k)+ nY ( k )
Q s ( k)=
n X (k ) nY (k )
can be seen as a subsequence of Q(k). Finally, the limit of Q is unique and the case nX = n = nY indicates that it
is zero:
n (k )+n(k) 2
lim k →∞ =lim k→ ∞ =0
n (k )⋅n( k) n(k )
For the necessity for both sample sizes to tend to infinite, let us suppose, without loss of generality, that nX ≤
M < ∞. There would be a subsequence that cannot tend to zero:
n X (k )+n Y (k) n ( k )+ nY (k ) 1
lim k →∞ ≥ lim k→∞ X = >0
n X (k ) nY (k ) M nY ( k ) M
whatever the behaviour of nX(k). The previous nondecreasing s(k) are the only paths of interest in Statistics.
n X −nY n X nY
X
nY →∞
n X nY X
nY →∞
n X −nY
Way 0: Intuitively, the limit of the difference does not exist, since it takes different values that depend on the
path; but the difference—or the summation, in the previous section—is so smaller than the product, that the
first limit seems zero while the second seems infinite. Formally, we can do calculations as for the previous
limit, for example
n −n n n 1 1
lim n →∞ X Y =lim n →∞ X − lim n →∞ Y =lim n →∞ − lim n →∞ =0−0=0
X
Yn →∞
n X nY n →∞
X
Y
n X nY n n
n →∞ X Y
X
Y
nY Y
nX X
or, alternatively, use the bound:

n X −nY n X −nY n +n
|lim n →∞
X
nY → ∞
nX nY |≤lim n
n
X →∞
Y →∞
|n X nY|≤lim n →∞ X Y =0
X
n →∞
Y
n X nY
Conclusion: We have studied the limits proposed. Some of them were almost trivial, while others involved
indeterminate forms like 0/0 or ∞/∞. All the cases were quotients of polynomials, so the limits of the former
form have been transformed into limits of the latter form. To solve these cases, the technique of multiplying
and dividing by the same quantity has suffices (there are other techniques, e.g. L'Hôpital rule). Other
techniques have been applied too.
Additional Examples: Several limits have been solved in the exercises—look for limit in the final index.
My notes:
Exercise 4m (*)
For two positive integers nX and nY, find the (discrete) frontier and the two regions determined by the equality
2(n X +nY )=(nX −nY )2
Discussion: Both sides of the expression are symmetric with respect to the variables, meaning that they are
the same if the two variables are switched. This implies that the frontier we are looking for is symmetric with
respect to the bisector line. The square suggests a parabolic curve, while
2(n X +nY )=(nX −nY )2 ↔ 2(1+n X nY )=(n X −1)2+(nY −1)2
suggests a sort of transformation of a conic curve.
Intuitively, in the region around the bisector line, the difference of the variables is small and therefore
the right-hand side of the original equality is smaller than the left-hand side; obviously, the other region is at
the other side of the (discrete) frontier.
Purely computational approach: In a previous exercise we wrote some “force-based” lines for the
computer to plot the points in the frontier. Here we use the same code to plot the inner region (see the figures
below)
N = 100
vectorNx = vector(mode="numeric", length=0)
vectorNy = vector(mode="numeric", length=0)
for (nx in 1:N)
{
for (ny in 1:N)
{
if (2*(nx+ny)>=(nx-ny)^2) { vectorNx = c(vectorNx, nx); vectorNy = c(vectorNy, ny) }
}
}
plot(vectorNx, vectorNy, xlim = c(0,N+1), ylim = c(0,N+1), xlab='nx', ylab='ny', main=paste('Regions'), type='p')
Algebraical-computational approach: Before using the computer, we can do some algebraical work
n2X + n2Y −2 n X nY =2n X +2 nY ↔ n2Y −2(n X +1)nY + n X ( n X−2)=0
2(n X +1)±√ 4 (n X +1)2−4 n X (n X −2) 2( n X +1)±2 √ n2X +2 n X +1−n 2X +2 n X

nY = = =(n X +1)±√ 4 n X +1
2 2

The following code plots the two branches of the frontier (see the figures above)
N = 100
vectorNx = seq(1,N)
vectorNyPos = (vectorNx+1)+sqrt(4*vectorNx+1)
vectorNyNeg = (vectorNx+1)-sqrt(4*vectorNx+1)
integerSolutions = (vectorNyPos/round(vectorNyPos) == 1)
yL = c(0, max(vectorNyPos[integerSolutions], vectorNyNeg[integerSolutions]))
plot(vectorNx[integerSolutions], vectorNyPos[integerSolutions],
xlim = c(0,N+1), ylim = yL, xlab='nx', ylab='ny', main=paste('Frontier'), type='p')
points(vectorNx[integerSolutions], vectorNyNeg[integerSolutions])
Algebraical, analytical and geometrical approach: The change of variables

C1 ( n X , nY )=(n X −nY , n X + nY )=(u , v )
1 2
is a linear transformation. The new frontier can be written as the parabolic curve v = u . The computer
2
allows plottin this frontier in the U-V plane.
N = 50
vectorU = seq(-50, +50)
vectorV = 0.5*vectorU^2
plot(vectorU, vectorV, xlim = c(-N-1,+N+1), ylim = c(0,max(vectorV)), xlab='u', ylab='v', main=paste('Frontier'), type='p')
How should the change of variables be interpreted? If we write
(uv)=(11 −11 )(nn ) X
the previous matrix reminds us a rotation in the plain (although movements have orthonormal matrixes and
the previous is only orthogonal). Let us have a look to how a triangle—a rigid polygon—is transformed,

P1=( 1, 2) → C1 ( P1 )=(−1 , 3)
P2=(1, 1) → C1 (P2 )=(0 , 1)
P3=(2 ,1) → C1 (P3 )=(1 ,3)
To confirm that C1 is a rotation plus a dilatation (homothetic transformation), or vice versa, we consider the
distances between points, the linearity, and a rotation of the axes. First, if
~ ~
A=(a 1 , a2 ) → A=( a1−a 2 , a1 +a 2) B=(b1 , b2) → B=(b1−b 2 , b 1+ b2)
then
d (~
u, vA ,~ √ 1 2 1 √
B)= [(b −b )−(a −a )]2 +[(b +b )−(a +a )]2= [(b −a )−(b −a )]2 +[(b −a )+(b −a )]2
2 1 2 1 2 1 1 2 2 1 1 2 2
=√ 2(b −a ) + 2(b −a ) =√ 2⋅√(b −a ) +(b −a ) =√ 2⋅d

2 2 2 2
1 1 2 2 1 1 2 2 n x , ny ( A , B)
This means that the previous change of variable is not an isometry; therefore it cannot be considered a
movement in the plain, technically. Nonetheless, the previous lines show that the linear transformation
1
C2 (n X , nY )= (n X −nY , n X +n Y )=(u , v ),
√2
respects the distances, so it is an isometry whose matrix is orthonormal—that is, it is a movement. Now, the
frontier is
2
2(n X +nY )=(nX −nY )2 ↔
1
√2
(n X +nY )=
1 1
√2 √2
(n X −nY ) ↔
1 2
v= u
√2 [ ]
C2 can be written as
u = 1 1 −1 n X
v √ 2 1 1 nY ( ) ( )( )
which is the expression of a rotation in the plain (see the literature on Linear Algebra). Second, the linearity
implies that both C1 and C2 transform lines into lines. The expression
AB=0⃗A +λ AB=(a
⃗ 1 , a 2)+λ (b 1 −a1 , b2−a2 )=(λ b1 +( 1−λ) a 1 , λ b2 +(1−λ)a2 )
determines the line containing A and B if λ ∈ℝ and the segment from A to B if λ ∈[0,1] . It is transformed
as follows
C1 ( λ b1 +(1−λ) a1 , λ b2 +(1−λ) a2 )=(λ b 1+(1−λ) a1−λ b2−(1−λ)a2 , λ b 1+(1−λ)a1 +λ b 2+(1−λ) a2 )
=(λ (b1−b2 )+(1−λ)(a 1−a2 ), λ(b 1+ b2 )+(1−λ)(a 1+ a2))=λ( b1−b 2 , b 1+b 2)+(1−λ)(a1−a2 , a1 +a2 )
=λ C1 (b 1 , b 2)+(1−λ)C 1(a1 , a2)
(similarly for C2). This expression determines the line containing C1(A) and C1(B) if λ ∈ℝ and the segment
from C1(A) to C1(B) if λ ∈[0,1]. Third, as regards the rotation of axes, the following figure and formulas are
general
e⃗1 = cos α ~
e⃗1 + sin α ~
e⃗2
e⃗2 =−sin α ~ e⃗1 + cos α ~e⃗2 {
(Rotation sinistrorsum)
e⃗1 = cos α ~e⃗1 − sin α ~

e⃗2
{ e⃗2 = sin α ~
e⃗1 + cos α ~
e⃗2
(Rotation dextrorsum)

When the axes are rotated in one direction, it can be thought that the points are rotated in the opposite. Now,
C2 can be written as a 45º dextrorsum rotation of the axes
1 1
)( ) ( )( )
e⃗1 = cos π ~ e⃗ − sin π ~ π −sin π ~
{
e⃗ −
()(
cos ⃗ ~⃗ ~
e⃗1
4 e1 = √ 2 √2 e 1 = 1 1 −1
4 1
e⃗2 = sin π ~
4 2
e⃗ + cos π ~
4 1
e⃗
4 2
e⃗1
e⃗2
= 4
sin π cos π ~e⃗2 1 1 ~
e⃗2 √ 2 1 1 ( )( )
~
e⃗2
4 4 √2 √ 2
Any point P=(x , y ) is transformed through
1 1 −1 x 1 x− y
√ 2 ( )( ) ( ) ( )
1 1 y
=
√ 2 x + y
=u .
v
1 1 −1
The matrix M =
−1 t
( )
√2 1 1
is orthogonal, which means that M⋅M t=I =M t⋅M and implies that
M =M . Then,
~
e⃗
1 1 1 e⃗1
( )( ) ( )
= ~1 .
√2 −1 1 e⃗2 e⃗2
Conclusion: We have applied different approaches to study the frontier and the two regions determined by
the given equality. Fortunately, nowadays the computer allows us to do this work even without any deeper
theoretical study—change of variable, transformation, et cetera.
My notes:

References
Remark 1r: When an exercise is based on another of a book, the reference has been included below the statement; some statements
may have been taken from official exams. I have written the entire solutions. The slides mentioned in the prologue contain
references on theory. For some specific theoretical details, some literature is referred to in proper section of this document.
[1] The R Project for Statistical Computing, http://www.r-project.org/
[2] Wikipedia, http://en.wikipedia.org/
My notes:

Tables of Statistics
Basic Measures
μ = E ( X ) = ∑Ω x i⋅ f ( x i ) (Discrete)
σ 2 = Var ( X ) = E ( [ X −μ]2 ) =⋯ = E( X 2 )−μ 2
μ = E ( X ) = ∫Ω x⋅ f (x )dx (Continuous)
Basic Estimators
̄ = 1∑ Xi 1 n ̄ ) 2 = ⋯= 1 ∑ X 2i − X̄ 2
n n
X s2 = ∑ ( X i − X
n i=1 n i=1 n i =1
2 1 n 2 n X s 2X + n Y s 2Y (n X −1)S 2X + ( nY −1)S Y2
S = ∑ ( X i− X̄ ) 2
n s = (n−1) S 2
S = 2
=
n−1 i=1 p
n X + n y−2 n X + n y −2
n
1 n ∑i=1 X i nX η
̂ X + nY η̂ Y
V = ∑i=1 ( X i−μ ) 2
2
η̂ = η̂ p=
n n n X + nY
1 population 2 populations
Parameter Estimator Parameter Estimator
μ ̄
X μX –μY ̄ −Ȳ
X
σ2 2
σX2/σY2 V 2X
V
μ known μX, μY known V 2Y
σ2 2
or S 2
σX2/σY2 sX
2
or
2
SX
s 2 2
μ unknown μX, μY unknown sY SY
η η̂ ηX–ηY η̂ X − η̂ Y

Basic Statistics
1 normal population, any n
Parameter Statistic
n
μ T ( X ; μ)=
̄ −μ
X
∼ N (0,1)
∑i=1 X i ∼ N ( nμ , nσ 2 )
2
σ known √ σ2
n (
̄ ∼ N μ,σ
X
n
2
)
̄ −μ
μ T ( X ; μ)=
X
∼ t n −1
√
2
2 S
σ unknown n
σ2 T ( X ; σ)=
nV 2
∼ χ 2n
2
μ known σ
σ2 n s 2 (n−1) S
2
T ( X ; σ)= 2
= 2
∼ χ2n −1
μ unknown σ σ
2 independent normal populations, any nX and nY

Parameters Statistic
̄ −Ȳ )−(μ X −μY )
μX–μY T ( X , Y ; μ X ,μ Y )=
(X
∼ N ( 0,1)
σX2, σY2 known

√ σ 2X σ 2Y
+
n X nY
̄ −Ȳ ) ∼ N μ X −μ Y ,
(X ( √ σ 2X σY2
+
n X nY )
where k is the closest
μX–μY ̄ −Ȳ )−(μ X −μY ) integer to
(X
T ( X , Y ; μ X ,μ Y )= ∼ tk
√
2 2
S X SY
σX2, σY2 unknown +
n X nY
2
1 nX V X V 2X
σX2/σY2
μX, μY known
T ( X , Y ; σ X , σ Y )=
( n X σ 2X
1 nY V Y
n Y σY2
2
=
σ 2X
V Y2
σY2
) =
V 2X σY2
V Y2 σ 2X
∼ Fn X
, nY

( n X −1)S 2X S 2X
( )
1
σX2/σY2 (n X −1) σ 2X σ 2X S 2X σ2Y
T ( X , Y ; σ X , σ Y )= 2
= 2
= 2 2
∼ Fn −1 ,nY −1
1 (nY −1)S Y SY SY σ X X
μX, μY unknown (nY −1) σ 2Y σY2
1 population, large n
Parameter Statistic
X̄ −μ d n d
μ T ( X ; μ)= → N (0,1) ∑i=1 X i → N ( nμ , n⋅?)
?
n √
where ? is substituted by σ2, S2 or s2
̄ → N μ,?
X
d
n ( )
η T ( X ; η)=
̂
η−η d
→ N (0,1)
√
?(1−? )
n
d
η̂ → N η, ( √ ? (1−?)
n )
where ? is substituted by η or η̂
2 independent populations, large nX and nY

̄ −Ȳ )−(μ X −μY )
μX–μY T ( X , Y ; μ X ,μ Y )=
(X d
→ N (0,1)
?X ?Y
+
n X nY √
where for each population ? is substituted by σ2, S2 or s2
(X
d ? ?
̄ −Ȳ ) → N μ X −μY , X + Y
(
n X nY √ )
ηX–ηY T ( X , Y ; ηX , ηY )=
( η̂ X − η
̂ Y )−( ηX −ηY ) d
→ N (0,1)
nX √
? X (1−? X ) ?Y (1−?Y )
+
nY
where for each population ? is substituted by η or η̂
Remark 1T: For normal populations, the rules that govern the addition and subtraction imply that:
2 2 2 2
X ( )
̄ ∼ N μx , σx ,
nx ( σ
Ȳ ∼ N μ y , y ,
ny ) and hence
(
̄ ∓Ȳ ∼ N μ x ∓μ y , σ x + σ y .
X
nx n y )
The tables include results combining the rules with a standardization or studentization. We are usually interested in comparing the mean of the two
populations, for which the difference is considered; nevertheless, the addition can also be considered with

̄ ∓Ȳ )−(μ X ∓μ Y )
(X
∼ N (0,1).
√ σ 2X σ2Y
+
n X nY
On the other hand, since the quality of estimators—e.g. measured through the mean square error—increase with the sample size, when the
parameters of two populations are supposed to be equal the samples should be merged to estimate the parameter jointly (especially for small nx and
ny). Then, under the hypothesis σx = σy the pooled sample quasivariance should be used through the statistic:
̄ −Ȳ )−(μ X −μY )
(X
T ( X , Y ; μ X ,μ Y )= ∼ tn +nY −2
√
2 2 X
S S p p
+
n X nY
Remark 2T: For any populations with finite mean and variance, one version of the Central Limit Theorem implies that
2 2 2 2
X
d
(
̄ → N μx , σ X ,
nx ) d
(σ
Ȳ → N μ y , Y ,
ny ) and hence X
d
nx ny (
̄ ∓Ȳ → N μ x ∓μ y , σ X + σY ,
)
where the rules that govern the convergence (in distribution) of the addition—and subtraction—of sequences of random variables (see a text on
Probability Theory) and the rules that govern the addition and subtraction of normally distributed variables are applied. We are usually interested in
comparing the mean of the two populations, for which the difference is considered; nevertheless, the addition can also be considered with
̄ ∓Ȳ )−(μ X ∓μ Y )
(X d ( η̂ X ∓η̂ Y )−(ηX ∓ηY ) d
→ N ( 0,1) → N (0,1).
√ ?x ?y
√ ? X (1−? X ) ? Y (1−? Y )
and, for a Bernoulli population,
+ +
nx ny nX nY
Besides, variances can be estimated when they are unknown. By applying theorems in section 2.2 of Approximation Theorems of Mathematical
Statistics, by R.J. Serfling, John Wiley & Sons, and sections 7.2 and 7.3 of Probability and Random Processes, by G. Grimmett and D. Stirzaker,
Oxford University Press,
X̄ −μ 1 X̄ −μ d ̂ −η
η
=
1 ̂
η−η d
→ N (0,1).
= → 1⋅N (0,1)=N ( 0,1)
√ √ √
S2 S2 σ2
√ √ √
and η(1−
̂ η
̂) η(1−
̂ η
̂) η(1−η)
n σ2 n n η(1−η) n
d
Similarly for two populations. From the first convergence it is deduced that t n−1 → N (0,1). On the other hand, when the parameters of two
populations are supposed to be equal the samples should be merged to estimate the parameter jointly (especially for medium nx and ny). Then, under
the hypothesis σx = σy the pooled sample quasivariance should be used—although in some cases its effect is negligible—through the statistic:
̄ −Ȳ )−(μ X −μY )
(X d
T ( X , Y ; μ X ,μ Y )= → N (0,1)
√ S 2p S 2p
+
n X nY
For a Bernoulli population, under the hypothesis ηx = ηy the pooled sample proportion should be used—although in some cases the effect is
negligible—in the denominator of the statistic:
( η̂ X − η
̂ Y )−(η X −ηY ) d
T ( X , Y ; ηX , ηY )= → N (0,1) .
√ η
̂ p (1− η
nX
̂ p) η
̂ (1−η̂ p )
+ p
nY
Remark 3T: In the last tables, the best information available should be used in place of the symbol ?.
2 ̄ =η̂ ,
Remark 4T: The Bernoulli population is a particular case for which μ=η andσ =η⋅(1−η), so X When the variance σ2 is
directly estimated without estimating η, σ̂ 2 is used in place of the product ?(1−?).
Remark 5T: Once an interval for the variance is obtained, P(a1 < σ2 < a2), since the positive square root is a strictly increasing function (and
therefore it preserves the order between two values) an interval for the standard deviation is given by P(√ a1 < σ < √a2). (Notice that, for a
reasonable initial interval, 0 < a1.) Similarly for the quotient of two variances σX2/σY2.

Statistics Based on Λ
1 population, any n
L( X ; θ 0)
θ (1 dimension) Λ=
L( X ; θ 1)
L( X ; θ̂ 0) d
θ (r dimensions) Λ=
̂
L( X ; θ)
Asymptotically, −2 ln(Λ) → χ 2r
Analysis of Variance (ANOVA)

P independent normal populations
One-Factor Sample Quantities Statistic

Fixed-Effects
Between-Group P 2 1
̄ p− X̄ )
SSG = ∑ p =1 n p ( X MSG = SSG
Measures P−1
P
Within-Group SSW = ∑ p=1 SS p where 1 MSG
MSW = SSW T0 = ∼ F P −1, n−P
Measures np
̄ p) 2 n−P MSW
SS p = ∑i=1 ( X p ,i − X
P np
Total Measures SST = ∑ p=1 ∑i =1 ( X p , i− X̄ )2 = SSW + SSG
Nonparametric Hypothesis Tests

Chi-Square Null Statistic
Data and
Tests Hypothesis Expected Absolut Frequency
K( N i −eî )2 d 2
T 0 ( X )=∑i=1 → χ K −(1+ s)=χ 2K −1−s
X 1 ,... , X n eî
H0: The sample comes where s parameters are estimated and
Goodness-of-Fit
K classes from the model F0 êi=n p̂ i=n P θ̂ (i th class)
1 model F0 or, if no parameter is estimated, s = 0 and
e i=n p i=n P θ (i th class)

( N ij −êij ) 2
{
X 11 , ... , X 1n L K
1
T 0 ( X )=∑i=1 ∑ j=1
X 21 , ... , X 2n 2
êij
⋮ H0: The samples come d
Homogeneity X L1 ,... , X ln L from the same model → χ 2KL−(L+ K −1) = χ2(K −1 )(L−1)
where
K classes N ⋅j
L samples êij =n i p̂ij =ni p̂ j =ni
n
L ( N ij − êij )2
K
T 0 ( X ,Y )=∑i =1 ∑ j =1
( X 1 , Y 1) êij
⋮ H0: The bivariate
d 2 2
sample comes from two → χ KL−(L−1+ K −1+1) = χ(K −1)( L−1)
Independence ( X n , Y n) independent models
where
KL classes
2 variables
N i⋅ N ⋅j
êij =n p̂ij =n p̂ i p̂ j =n
n n
Remark 6T: Although because of different theoretical reasons, for the practical estimation of eij the same mnemonic rule can be used in both
homogeneity and independence tests: for each position, multiply the absolut frequencies of the row and the column and divide by the total number
of elements n.
Kolmogorov- Data Null Hypothesis Statistic

-Smirnov Tests
T 0 ( X )=max x∣F n ( x)− F 0 ( x)∣
X 1 ,... , X n where
H0: The sample comes from
Goodness-of-Fit the model F0 F 0 (x) = P( X ≤ x)
1 sample
1 model F0 1
F n (x )= Number { X i≤ x }
n
T 0 ( X ,Y )=maxt∣F n (t)− F n (t )∣ X Y
Homogeneity { X 1 ,... , X n
Y 1 , ... ,Y n Y
X
H0: The samples come from

the same model
where
F n (t) =
X
1
nX
Number { X i ≤t }
2 samples 1
F n (t) = Number {Y i≤t }
Y
nY
Null
Other Tests Data
Hypothesis
Statistic
Let R be the number of runs.
T 0 ( X )=R if Nyes < 20, Nno < 20,
X 1 ,... , X n H0: The sample is

and using the specific table. Or, for Nyes ≥ 20, Nno ≥ 20,
simple and random T ( X )−μ d
Runs Test 1 dichotomous property (it has been selected T̃ 0 ( X )= 0 → N (0,1)
(of Randomness) Nyes elements with it by applying simple σ2 √
Nno = n–Nyes elements random sampling) with
without it
2 n1 n 2 2 n1 n 2 (2 n 1 n2−n1−n2 )
μ= +1 σ 2= 2
n 1+ n2 (n1+ n 2) ( n1 +n 2−1)
and using the table of the standard normal distribution

T 0 ( X )=Number { X i −q 0 >0 }
if n < 20, and using the specific table or the table of the
X 1 ,... , X n Binomial(n,p), where p depends on Q (e.g. 1/2 for the median).
H0: The population Or, for n ≥ 20,
Signs Test 1 model F0 measure Q takes de
(of Position) 1 position measure Q T ( X )−μ d
(e.g. the median)

value q0 T̃ 0 ( X )= 0 → N (0,1)
√σ2
with
μ=np σ 2=n p(1− p)
T 0 ( X )= ∑ { X −q >0 } Ri
i 0
if n < 20, where Ri are the positions in the increasing sequence

of |Xi – q0|, and using the specific table. Or, for n ≥ 20,
Wilcoxon Signed- X 1 ,... , X n H0: The population
-Rank Test 1 model F0 measure Q takes de T ( X )−μ d
(of Position)
1 position measure Q value q0 T̃ 0 ( X )= 0 → N (0,1)
(e.g. the median) √σ2
with
n( n+1) 2 n( n+1)(2n +1)
μ= σ =
4 24
Remark 7s: In the statistics, the parameter of interest is the unknown for confidence intervals while it is supposed to be known for hypothesis tests.
Remark 8s: Usually the estimators involved in the statistic T (like s, S...) and the quantiles (like a ...) also depend on the sample size n, although
the notation is simplified.
Remark 9s: For big sample sizes, when the Central Limit Theorem can be applied to T or its standardization, quantiles or probabilities that are not
tabulated can be approximated: p is directly calculated given a, and for p given a is calculated from the quantile z of the standard normal
distribution:
p=P (T ≤a)=P Z ≤
( a−E (T )
√ Var ( T ) ) z=
a− E( T )
√ Var (T )
a=E (T )+ z √ Var (T )
This is used in the asymptotic approximations proposed in the tests of the last table.
Remark 10s: To consider the approximations, sample sizes bigger than 20 has been proposed in the last table, although it is possible to find other
cutoff values in literature (like 8, 10 or 30); in practice, there is no severe change at any value.
Remark 11s: The goodness-of-fit chi-square test can also be used to test position measures: by considering two classes with probabilities (p,1–p).
Remark 12s: To test the symmetry of a distribution, the position tests can be used.
Remark 13s: Although different types of test can be applied to evaluate the same hypotheses H0 and H1 with the same α (type I error), their quality
is usually different, and β (type II error) should be taken into account. A global comparison can be done by using their power functions.
My notes:

Probability Tables
2
z
1 −2
Standard Normal
z
p=P ( Z ≤z )=∫−∞ e dz for x ∈(−∞ ,+∞)=ℝ
√2 π
(Taken from: Kokoska, S., and C. Nevison. Statistical Tables and Formulae. Springer-Verlag, 1989.)

+∞
t p=P ( X > x )=∫x f (x )dx for x ∈(−∞ ,+∞ )
(Taken from: Newbold, P., W. Carlson and B. Thorne. Statistics for Business and Economics. Pearson-Prentice Hall.)

χ2
+∞
p=P ( X > x )=∫x f (x ) dx for x ∈[ 0,+∞)

F p=P ( X > x )=∫x
+∞
f ( x ) dx for x ∈[ 0,+∞)

My notes:

Index
(These references include only the most important concepts involved in each exercise.)
algebra, 4m
analysis
complex, 3pt
real, 1m, 2m, 3m, 4m
analysis of variance, 1ht-av
ANOVA → analysis of variance
asymptoticness, 3pe-p, 1ci-m, 2ci-m, 3ci-m, 4ci-m, 2ci
(see also 'consistency')
basic estimators, 12pe-p, 13pe-p
(see also 'sample mean', 'sample variance', 'sample quasivariance', 'sample proportion')
bind, 1m
bound, 5pe-p, 1m
(see also Cramér-Rao's lower bound)
Bernoulli distribution, 1pe-m, 3pe-p, 12pe-p, 14pe-p, 3ci-m, 4ci-m, 6ht-T, 1ht-Λ, 1ht, 3pe-ci-ht, 3pt
(see also 'binomial distribution')
binomial distribution, 1pe-m, 1pt, 3pt
characteristic function, 3pt
Chebyshev's inequality, 1ci-s, 1ci, 2ci, 3ci, 4ci
chi-square distribution, 7pe-p, 1pt
chi-square tests,
goodness-of-fit, 2ht-np, 3ht-np, 1ht
homogeneity, 3ht-np
independence, 1ht-np, 3ht-np
cook → statistical cook
critical region, 1ht-T, 2ht-T, 3ht-T, 4ht-T, 5ht-T, 6ht-T, 1ht-Λ, 1ht-av, 1ht-np, 2ht-np, 3ht-np, 1ht, 3pe-ci-ht, 4pe-ci-ht
critical values → critical region
completion, 2pe-p, 4pe-p, 5pe-p
standardization, 1pe-p, 3pe-p, 4pe-p, 2pt
complex analysis, 3pt
confidence intervals, 1ci-m, 2ci-m, 3ci-m, 4ci-m, 1ci-s, 1ci, 2ci, 3ci, 4ci, 1pe-ci-ht, 2pe-ci-ht, 3pe-ci-ht
consistency, 6pe-p, 7pe-p, 9pe-p, 10pe-p, 12pe-p, 13pe-p, 14pe-p, 1pe, 2pe, 3pe
convergence → rate of convergence
coordinates
rectangular, 4m
polar, 1m, 3m
Cramér-Rao's lower bound, 9pe-p
density function → probability function
(see the continuous probability distributions)
differential equation, 3pt
efficiency, 9pe-p, 10pe-p, 3pe
(see also 'relative efficiency')
exponential distribution, 3pe, 1ht-Λ, 3pt
two-parameter (or translated), 6pe-m
exponential function, 1m
factorization theorem, 11pe-p, 3pe
F distribution, 1pt
frontier, 4m
Fubini's theorem, 1m
generating functions
→ probability generating function
→ moments generating function
→ characteristic function

geometric distribution, 2pe-m, 11pe-p, 3pt
geometry, 4m
goodness-of-fit → chi-square tests
homogeneity → chi-square tests
hypothesis tests, 1ht-T, 2ht-T, 3ht-T, 4ht-T, 5ht-T, 6ht-T, 1ht-Λ, 1ht-av, 1ht-np, 2ht-np, 3ht-np, 1ht, 3pe-ci-ht, 4pe-ci-ht
independence → chi-square tests
indeterminate form, 2m, 3m
inference theory, 1it-spd
integral equation, 3pt
integral
improper, 3pt, 1m
multiple, 1m
integration
directly, 5pe-m, 7pe-m, 3pt
by parts, 6pe-m, 3pt
by substitution, 3pt, 1m
joint distribution, 1it-spd
likelihood function, 11pe-p, 3pe
likelihood ratio tests, 1ht-Λ, 4pe-ci-ht
limits, 2m, 3m
linear algebra, 4m
margin of error, 1ci-m, 2ci-m, 1ci-s, 1ci, 2ci, 3ci, 4ci
mass function → probability function
(see the discrete probability distributions)
maximum likelihood method, 1pe-m, 2pe-m, 3pe-m, 4pe-m, 5pe-m, 6pe-m, 7pe-m, 1pe, 2pe, 3pe, 4pe-ci-ht
mean square error, 6pe-p, 7pe-p, 8pe-p, 9pe-p, 12pe-p, 13pe-p, 14pe-p, 1pe, 2pe
method of the moments, 1pe-m, 2pe-m, 3pe-m, 4pe-m, 5pe-m, 6pe-m, 7pe-m, 1pe, 2pe, 3pe, 4pe-ci-ht
method of the pivot, 1ci-m, 2ci-m, 3ci-m, 4ci-m, 1ci-s, 1ci, 2ci, 3ci, 4ci, 1pe-ci-ht, 2pe-ci-ht, 3pe-ci-ht
minimum sample size, 1ci-s, 1ci, 2ci, 3ci, 4ci
moment generating function, 3pt
moment
(see 'population moment' and 'sample moment')
movement, 4m
Neyman-Pearson's lemma, 1ht-Λ, 4pe-ci-ht
normal distribution, 4pe-m, 1pe-p, 2pe-p, 4pe-p, 5pe-p, 14pe-p, 1ci-m, 2ci-m, 1ci-s, 1ci, 2ci, 3ci, 4ci, 1ht-T, 2ht-T, 3ht-T, 4ht-T,
5ht-T, 1ht-Λ, 1ht-av, 1pe-ci-ht, 2pe-ci-ht, 1pt, 2pt, 3pt
normality, 12pe-p, 13pe-p
point estimations, 1pe-m, 2pe-m, 3pe-m, 4pe-m, 5pe-m, 6pe-m, 1pe-p, 2pe-p, 3pe-p, 4pe-p, 5pe-p, 6pe-p, 7pe-p, 8pe-p, 9pe-p,
10pe-p, 11pe-p, 12pe-p, 13pe-p, 14pe-p, 1pe, 2pe, 3pe, 1pe-ci-ht, 2pe-ci-ht, 4pe-ci-ht
Poisson distribution, 3pe-m, 1ht-Λ, 1pt, 3pt
polar coordinates, 1m, 3m, 12pe-p
pooled sample proportion → sample proportion
pooled sample variance → sample variance
population mean, 12pe-p, 1ht-T
population moment,
raw or crude, 3pt
population proportion, 12pe-p, 6ht-T, 3pe-ci-ht
population standard deviation → population variance
population variance, 12pe-p, 13pe-p, 2ht-T, 3ht-T, 4ht-T, 5ht-T
position signs test, 1ht
power function, 1ht-T, 2ht-T, 3ht-T, 4ht-T, 5ht-T, 6ht-T, 1ht, 3pe-ci-ht
probability, 1pe-p, 2pe-p, 3pe-p, 4pe-p, 5pe-p, 2pe-ci-ht, 1pt, 2pt, 3pt
probability function, 1it-spd, 10pe-p, 1pt
probability generating function, 3pt
probability tables, 1pt
plug-in principle, 1pe-m, 2pe-m, 3pe-m, 5pe-m, 6pe-m, 7pe-m, 3pe, 4pe-ci-ht
p-value, 1ht-T, 2ht-T, 3ht-T, 4ht-T, 5ht-T, 6ht-T, 1ht-av, 1ht-np, 2ht-np, 3ht-np, 1ht, 3pe-ci-ht

quantile, 4pe-p, 1pt
Rayleigh distribution, 2pe
rate of convergence, 6pe-p, 12pe-p, 13pe-p, 14pe-p
relative efficiency, 8pe-p
(see also 'efficiency')
rotation, 4m
sample mean, 1it-spd, 1pe-p, 4pe-p, 9pe-p, 10pe-p, 3pe, 2pt
trimmed, 6pe-p
sample moment
(see 'method of the moments')
sample proportion, 3pe-p
pooled, 14pe-p, 4ci-m
sample quasivariance, 2pe-p, 4pe-p
sample variance
pooled, 14pe-p, 1pe-ci-ht, 2pe-ci-ht
sample size
minimum → minimum sample size
sample standard deviation → sample variance
sampling distribution, 1it-spd
sequence, 2m, 3m, 12pe-p, 13pe-p, 14pe-p
(see 'rate of convergence')
series, 3pt
statistical cook, 4ht-T
standard power function density, 4pe-ci-ht
sufficiency, 11pe-p, 3pe
table of frequencies, 1ht-np, 2ht-np, 3ht-np, 1ht
t distribution, 1pe-ci-ht, 1pt
total sum, 5pe-p, 2pt
type I error, 1ht-T, 2ht-T, 3ht-T, 4ht-T, 5ht-T, 6ht-T, 1ht-av, 1ht-np, 2ht-np, 3ht-np, 1ht, 3pe-ci-ht
type II error, 1ht-T, 2ht-T, 3ht-T, 4ht-T, 5ht-T, 6ht-T, 1ht-av, 1ht, 3pe-ci-ht
unbiasedness, 10pe-p
(see also 'consistency')
uniform distribution
continuous, 5pe-m, 10pe-p, 1pt
discrete, 1pt
My notes:

Solved Exercises and Problems of Statist PDF

Uploaded by

Copyright:

Available Formats

Solved Exercises and Problems of Statist PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Solved Exercises and Problems of Statist PDF

Uploaded by

Copyright:

Available Formats

Solved Exercises and Problems of

Complutense University of Madrid

Inference Theory (IT) 7 – 12

Point Estimations (PE) 13 – 73

Confidence Intervals (CI) 74 – 93

Hypothesis Tests (HT) 94 – 142

Point Estimations (PE)

1 Solved Exercises and Problems of Statistical Inference

2 Solved Exercises and Problems of Statistical Inference

Confidence Intervals (CI)

3 Solved Exercises and Problems of Statistical Inference

Hypothesis Tests (HT)

4 Solved Exercises and Problems of Statistical Inference

5 Solved Exercises and Problems of Statistical Inference

Tables of Statistics (T)

Probability Tables (P)

6 Solved Exercises and Problems of Statistical Inference

[IT] Some Remarks

7 Solved Exercises and Problems of Statistical Inference

Use of the Samples

● How many populations are there?

8 Solved Exercises and Problems of Statistical Inference

● What is supposed to be true? Does it seem reasonable? Do we need to prove it?

On the Statistical Problem

● What are the quantities to be studied statistically?

On the Statistical Tools

● What is the statistical interpretation of the solution?

[IT] Sampling Probability Distribution

9 Solved Exercises and Problems of Statistical Inference

(A) Joint probability distribution of the sample

10 Solved Exercises and Problems of Statistical Inference

That the total sum of probabilities is equal to one can be checked:

11 Solved Exercises and Problems of Statistical Inference

12 Solved Exercises and Problems of Statistical Inference

(a) Method of the moments

13 Solved Exercises and Problems of Statistical Inference

(a3) The estimator:

(b) Maximum likelihood method

(b3) The estimator:

(c) Estimation of η, μ and σ2

For κ = 10 and x = (x1,...,x5) = (4, 4, 3, 5, 6)

14 Solved Exercises and Problems of Statistical Inference

Finally, σ 2=Var ( X )=κ⋅η(1−η) , an estimator of η induces an estimator of μ too:

A) Method of the moments

15 Solved Exercises and Problems of Statistical Inference

16 Solved Exercises and Problems of Statistical Inference

Discussion: Although a real-world population is mentioned, this statement is mathematical. It is implicitly

A) Method of the moments

a3) The estimator:

17 Solved Exercises and Problems of Statistical Inference

Discussion: This statement is mathematical. For the normal distribution,

18 Solved Exercises and Problems of Statistical Inference

(A) Method of the moments

(a1) Population and sample moments

(a2) System of equations

(a3) The estimator

(B) Maximum likelihood method

(b1) Likelihood function 2

The density function of the Gaussian distribution is f ( x ; μ , σ)= e . Then,