Ef 22 Ene
Ef 22 Ene
Ef 22 Ene
1. (5 points) A survey conducted in 2020 among companies from the European Union, the United Kingdom and
the US, included a question regarding their plans to modify their investments in that year due to the COVID-19
pandemic. Among 536 companies from the European Union, 234 answered that they planned to reduce their
investments due to COVID. Answer the following questions, based on this information:
(a) (1 point) Obtain a 99% confidence interval for the proportion of European Union companies that planned
to reduce their investments in 2020. Comment your assumptions and result, and justify it.
(b) (0.5 points) Without conducting any additional computations, comment if you would reject the hypothesis
that the proportion of EU companies that planned to reduce their investments in 2020 is equal to 50%, for
a significance level of 1%. Justify your answer.
(c) (1 point) You wish to study if EU companies have a more negative view of their investments due to COVID-
19 than those in the US. For a significance level of 1%, you are asked if you have sufficient evidence to reject
the hypothesis that the proportion of EU companies planning to reduce their investments due to COVID-19
in 2020 is larger than 48% (the value for US companies in the survey). Define your null and alternative
hypotheses, find the p-value for the test and comment you conclusion.
(d) (1 point) Obtain the power of the preceding test if the true value of p, the proportion of EU companies
planning to reduce their investments due to COVID-19 in 2020, were equal to 0.45.
The preceding survey included a question on the impact of corporate law and regulations (for each country)
on corporate investment decisions. The following values sum up the survey information corresponding to the
proportions of companies, measured as percentages and aggregated by country, which considered these regulations
a serious obstacle for their investment plans. The values correspond to the surveys conducted in 2019 (values
denoted as xi ) and 2020 (values denoted as yi ):
27
X 27
X 27
X 27
X
xi = 739.13, x2i = 28514.47, yi = 666.24, yi2 = 22494.69
i=1 i=1 i=1 i=1
Assume that we have two simple random samples. It has been verified that the data follow (approximately)
normal distributions with the same variance.
(e) (1 point) Conduct a hypothesis test, at a 5% significance level, to determine if the (population) mean of
the percentages over the different countries has decreased from 2019 to 2020 (if the companies perceive
less obstacles for their investment plans), as opposed to the commonly accepted assumption that this
percentage mean has not changed from one year to the next. Define the null and alternative hypotheses,
find the rejection region for this test and comment your conclusions.
To conduct this test, take into account that you have two paired samples. It also holds that:
(f) (0.5 points) Indicate if the following statements are true or false. Justify your answers; your score for this
question will be based on your justifications.
i. In a one-sided hypothesis test for the mean, H0 : µ ≤ 5, you are told that the power of the test for
µ = 7 equals 0.75. Then, if the true value of µ were equal to 8, the probability of failing to reject the
null hypothesis cannot be larger than 0.25.
ii. The value of the power function for a test H0 : µ = µ0 with normal data and unknown variance cannot
be smaller than its significance level α, for any value of the parameter µ.
Solution.
(a) The given sample has a sufficiently large size (n = 536 ≫ 30) to assume that the CLT offers a reasonable
approximation. In this case, the formula for a confidence interval for the population proportion is given by
r
p̂(1 − p̂)
CIα (p) = p̂ ∓ zα/2 .
n
1
The values we need to replace in this formula are
r r
p̂(1 − p̂) 0.437 × 0.563
n = 536, p̂ = 234/536 = 0.437, = = 0.0214, z0.005 = 2.576,
n 536
and the requested interval is given by:
(b) The reference value given in the statement (p0 = 0.5) does not lie within the confidence interval we have
just computed. As a consequence, for a significance level of 1% (under the assumptions we have introduced)
we can reject the hypothesis that the proportion of all companies in the EU that were planning to invest
less in 2020 due to COVID was equal to 50%.
This observation follows from
r
p̂(1 − p̂) p̂ − p0
p0 > p̂ + zα/2 ⇔ t= r < −zα/2 .
n p̂(1 − p̂)
n
H0 : p ≥ p0 = 0.48
H1 : p < 0.48
As we assume we can apply the CLT, the statistic for our test is
P̂ − p0
T =p ∼ N (0, 1).
p0 (1 − p0 )/n
From the values we have already computed, the statistic for this test takes the value (under the null
hypothesis)
p̂ − p0 0.437 − 0.48
t= p =p = −1.993.
p0 (1 − p0 )/n 0.48 × 0.52/536
We obtain
p-value = Pr(Z < t) = Pr(Z < −1.993) = 0.0232
As this p-value is larger than 0.01, we fail to reject H0 . As a consequence, for a significance level of 0.01 we
conclude that we do not have sufficient evidence to believe that the proportion of companies in the EU that
had decided to invest less in 2020 due to COVID was lower than 48% (lower than the estimated proportion
for the USA).
(d) To obtain the requested power (approximately, using the CLT), we apply the definition of the power function
for this test. Under our assumptions
!
P̂ − p0
power(p = 0.45) = Pr(rejectH0 | p = 0.45) = Pr p < −z0.01 p = 0.45
p0 (1 − p0 )/n
s !
P̂ − p + p − p0 p(1 − p)
= Pr p < −z0.01 p = 0.45
p(1 − p)/n p0 (1 − p0 )
s !
p0 (1 − p0 ) p0 − p
= Pr Z < −z0.01 +p p = 0.45
p(1 − p) p(1 − p)/n
where Z denotes a standard normal random variable, and we have taken onto account that the statistic
follows (approximately) a standard normal distribution when p = 0.45,
P̂ − p
p ∼ N (0, 1).
p(1 − p)/n
2
(e) We define the requested test as
H0 : µY ≥ µX
H1 : µY < µX
As the two samples are assumed to be paired samples, to conduct this test we define a new random variable
D = X − Y . From the assumptions in the description of the problem, this random variable will follow a
normal distribution with unknown variance. The test we wish to conduct can be written as
H0 : µD ≤ 0
H1 : µD > 0
The test statistic and its distribution in this case are given by
D̄ − µD
T = √ ∼ tnD −1 .
sD / nD
d¯ = x̄ − ȳ = 2.700,
s2D = s2X + s2Y − 2cov(x, y) = 318.487 + 232.879 − 2 × 255.59 = 40.179, sD = 6.339
D̄ 2.7
t= √ = √ = 2.213
sD / nD 6.339/ 27
where we have used the property that the probability of a type II error is one minus the power, for a
parameter value that lies within the alternative hypothesis.
ii. TRUE. The power is an increasing function of the true value of the parameter as we move away from
the region under the null hypothesis, in this case when we move away from µ = µ0 . As a consequence,
its minimum value is attained when µ = µ0 . It also holds that for µ = µ0 , the value of the power
function equals the probability of rejecting the null hypothesis when it is true. That is, it is equal to
the significance level α, the probability of a type I error.
2. (5 points) You are interested in studying the relationship between the gross domestic product per capita for
different European countries and the values of different socioeconomic variables for these countries. You have
collected data from EUROSTAT for 2019 and different variables of interest corresponding to 32 countries. In
particular, to study the impact of the educational level of the population on the income level, you have obtained
information on the GDP per capita (measured in thousands of euros per capita), as your dependent variable Y ,
and the percentage of the population with higher education degrees (the percentage of the population between
3
25 and 34 years old with a University degree), as your independent variable X1 . For the data collected for these
two variables it holds that:
32
X 32
X 32
X 32
X
xi = 1366.8, x2i = 60599.66, yi = 1036.39, yi2 = 48067.99
i=1 i=1 i=1 i=1
32
X
xi yi = 47451.37
i=1
We assume that these variables satisfy the assumptions for the simple linear regression model. Answer the
following questions:
(a) (0.5 points) Compute the value of the correlation coefficient for these two variables. Interpret this value.
(b) (0.5 points) Compute the least squares estimates for the coefficients for this simple linear regression model
from the preceding data. Interpret these values.
We are told that the sum of the squares of the residuals for this lineal model is 9934.46.
(c) (0.5 points) Test if the variable Y depends linearly on X1 , for a significance level of 5%. Justify your answer.
Would you reach the same conclusion for a significance level of 1%?
(d) (0.5 points) Compute the value of the coefficient of determination for this model and interpret this value.
(e) (0.75 points) Obtain a point estimate and a 95% confidence interval for the forecast corresponding to a
value of X1 equal to 46.5 (the value corresponding to Spain). Justify and interpret your answer.
To improve our understanding of the impact of different variables on the value of the GDP per capita, you
have decided to expand the preceding model by taking into account two additional independent variables: X2 ,
the percentage of the total employment corresponding to knowledge intensive services, and X3 , measuring the
energy consumption per capita in the country (as equivalent tons of oil per person per year). Some of the results
obtained after fitting this multiple linear regression model are shown in the following table:
The values shown as “—” will have to be computed to answer some of the following questions:
(f) (0.5 points) Obtain the ANOVA table for this model, if you are told that the sum of squares of the residuals
for this expanded model is 4326.70.
(g) (0.75 points) Indicate which variables in this multiple linear regression model are significant, for a sig-
nificance level of 1%. If the p-value for the F ratio is 1.63 × 10−7 , comment if this model is globally
significant.
(h) (0.5 points) Compute the value of the coefficient of multiple determination for this linear regression model.
Interpret this value and compare it with the one obtained for the simple linear regression model in the first
part of this exercise.
(i) (0.5 points) For a simple linear regression model, indicate if the following statements are true or false.
Justify your answers; your score for this question will be based on your justifications.
i. You are told that, for a given value of n, the estimate for the slope of the linear regression model takes
a very small value (as an absolute value). In this case, the model can only be significant if the variance
of the independent variable is very large.
ii. If the value of the (sample) correlation coefficient between two variables, cor(x, y), is negative, then the
sign of the slope estimate β̂1 must also be negative. (We assume that none of the two variances, s2X
and s2Y , is equal to zero.)
Solution.
(a) To compute the correlation coefficient we start by obtaining the following values:
4
We have
47451.37 − 32 × 42.712 × 32.387 cov(x, y)
cov(x, y) = = 102.728, cor(x, y) = p 2 2 = 0.561
31 sX sY
and we have that the correlation coefficient is positive, but not close to one. Thus, we conclude that
these two variables present a weak positive linear correlation, indicating the possible existence of a linear
relationship between both variables.
(b) We apply the formulas for the least squares estimators to obtain the following estimates for the parameters
of the linear regression model,
cov(x, y) 102.728
β̂1 = = = 1.434,
s2X 71.620
β̂0 = ȳ − β̂1 x̄ = 32.387 − 1.434 × 42.712 = −28.878
The value of β̂1 represents our estimate for the expected value of the change in the GDP per capita when
the percentage of the population between 25 and 34 years old with a higher education degree increases by
one unit. In this case, our estimate is 1434 euros. The value of β̂0 is the expected value of the GDP per
capita when the percentage of the population between 25 and 34 years old with a higher education degree
equals 0. In this case we have a negative value, which has no reasonable interpretation.
(c) To test if the variable Y depends linearly on X1 , for a significance level of 5%, we will conduct the following
hypothesis test:
H0 : β1 = 0
H1 : β1 ̸= 0
and the critical value is t30;0.025 = 2.042. Our sample lies in the rejection region, as t > t30;0.025 , and as
a consequence we reject the null hypothesis for the indicated significance level. That is, we conclude that
there is a significant linear relationship between the two variables of interest.
For a significance level of 1% the critical value is t30;0.005 = 2.750, and the value of our test statistic still
lies within the rejection region. Thus, we again conclude that there exists a significant linear relationship
between the two variables of interest for this second significance level.
(d) The value of the coefficient of determination R2 is given by
This value indicates that the linear model is able to explain 31.5% of the variability in the dependent
variable. But note that 68.5% of this variability is left unexplained by the model, corresponding to the
values of the residuals in this model.
(e) The point estimator and the estimate for the forecast corresponding to x0 = 46.5 are given by
5
Replacing the values from the preceding questions, and using t30;0.025 = 2.042, we have
√
r
1 (46.5 − 42.712)2
CI0.95 (y0 ) = 37.82 ± 2.042 331.149 1 + + = [−0.039 ; 75.678]
32 31 × 71.620
If we would repeat this estimation experiment many times, the observed value for the dependent variable
corresponding to the given value of X would lie within this interval 95% of the time.
(f) To generate the ANOVA table we note that
The degrees of freedom for the model are k = 3 and for the residuals are n − k − 1 = 28. The mean values
for the squares are given by SCM/k = 3391.85 and SCR/(n − k − 1) = s2R = 154.52. Finally, F is the ratio
of the two preceding values, F = (SCM/k)/(SCR/(n − k − 1)) = 21.95. The resulting ANOVA table is
ANOVA table
Degrees of freedom Sums of squares Means of squares F ratio
Regr. 3 10175.541 3391.85 21.95
Residuals 28 4326.70 154.52
Total 31 14502.237
(g) The missing value in the table for the t statistic of the variable X2 (the value of the statistic for the
significance test of this variable) is given by
β̂2 1.8516
t= = = 4.584
s(β̂2 ) 0.4039
This statistic follows a Student t distribution with n − k − 1 = 28 degrees of freedom. The p-value for the
significance test is
p-value = 2 Pr(Tn−2 > t) = 2 Pr(T28 > 4.584)
From the tables, this p-value lies in the interval (0; 0.001] (the exact p-value is 8.65 × 10−5 ), implying that
this coefficient is significant for a 1% significance level. X2 is the only significant variable, as the coefficients
for the other two independent variables have very large p-values (much higher than 0.01), implying that
they are not significant.
Our conclusion is that the employment level in knowledge intensive services has a very significant impact
on the GDP per capita level of a country.
The model is also globally significant, as the p-value corresponding to the F ratio, 1.63 × 10−7 , is very
small, much lower than 1% (1.63 × 10−7 ≪ 0.01).
(h) The coefficient of multiple determination for this linear regression model can be obtained from
This value implies that 70.2% of the total variability in the values of Y can be explained by the multiple
linear regression model. This model has a much higher explanatory power than the simple linear regression
model in the first part of the exercise, as this initial model had R2 = 0.315.
The values of the employment in knowledge intensive services provide much better information to explain
the values of the GDP per capita of a country than the values for the higher education level of its population,
for example.
(i) The answers are:
i. FALSE. The statistic used to conduct inference on the slope of the regression line is
β̂1 − β1
T =s
s2R
(n − 1)s2X
For a model to be significant we need t to be sufficiently large (assuming β1 = 0), to imply that we
would reject the null hypothesis of the significance test. If β̂1 takes a small value, t may still be large
if either s2X is sufficiently large or if s2R is sufficiently small, even if s2X is not large (or both).
6
ii. TRUE. The correlation coefficient has the same sign as the covariance, because
cov(x, y)
cor(x, y) = ,
sX sY
and sX > 0, sY > 0. Also, the estimate of the slope for the linear model has the same sign as the
covariance, as
cov(x, y)
β̂1 = ,
s2X
and s2X > 0.