Examiners' Commentaries 2020: ST104a Statistics 1
Examiners' Commentaries 2020: ST104a Statistics 1
Examiners' Commentaries 2020: ST104a Statistics 1
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2019–20. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).
Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.
General remarks
Learning outcomes
At the end of the half course and having completed the Essential reading and activities you should:
be familiar with the key ideas of statistics that are accessible to a candidate with a
moderate mathematical competence
be able to routinely apply a variety of methods for explaining, summarising and presenting
data and interpreting results clearly using appropriate diagrams, titles and labels when
required
be able to summarise the ideas of randomness and variability, and the way in which these
link to probability theory to allow the systematic and logical collection of statistical
techniques of great practical importance in many applied areas
have a grounding in probability theory and some grasp of the most common statistical
methods
be able to perform inference to test the significance of common measures such as means and
proportions and conduct chi-squared tests of contingency tables
be able to use simple linear regression and correlation analysis and know when it is
appropriate to do so.
You have two hours to complete this paper, which is in two parts. The first part, Section A, is
compulsory which covers several subquestions and accounts for 50 per cent of the total marks.
1
ST104a Statistics 1
Section B contains three questions, each worth 25 per cent, from which you are asked to choose two.
Remember that each of the Section B questions is likely to cover more than one topic. In 2020, for
example, the first part of Question 2 related to a contingency table while the second part covered
aspects of sampling design. In Question 3, the first part covered correlation and linear regression
while the second part related to statistical inference for the difference between two population
means. Finally, in Question 4, the first part required a boxplot while the second part related to
statistical inference for paired data. This means that it is really important that you make sure you
have a reasonable idea of what topics are covered before you start work on the paper! We suggest
you divide your time as follows during the examination:
Spend the first 10 minutes annotating the paper. Note the topics covered in each question
and subquestion.
Allow yourself 45 minutes for Section A. Do not allow yourself to get stuck on any one
question, but do not just give up after two minutes!
Once you have chosen your two Section B questions, give them about 25 minutes each.
This leaves you with 15 minutes. Do not leave the examination hall at this point! Check
over any questions you may not have completely finished. Make sure you have labelled and
given a title to any tables or diagrams which were required and, if you did more than the
two questions required in Section B, decide which one to delete. Remember that only two of
your answers will be given credit in Section B and that you must choose which these are!
The examiners are looking for very simple demonstrations from you. They want to be sure that you:
have covered the syllabus as described and explained in the subject guide
know the basic formulae given there and when and how to use them
understand and answer the questions set.
You are not expected to write long essays where explanations or descriptions of sampling design
are required, and note-form answers are acceptable. However, clear and accurate language, both
mathematical and written, is expected and marked. The explanations below and in the specific
commentaries for the papers for each zone should make these requirements clear.
The most important thing you can do is answer the question set! This may sound very simple, but
these are some of the things that candidates did not do, though asked, in the 2019 examinations!
Remember the following.
If you are asked to label a diagram (which is almost always the case!), please do so. Writing
‘Histogram’, ‘Stem-and-leaf diagram’, ‘Boxplot’ or ‘Scatter diagram’ in itself is insufficient.
What do the data describe? What are the units? What are the x-axis and y-axis?
If you are specifically asked to perform a hypothesis test, or calculate a confidence interval,
do so. It is not acceptable to do one rather than the other! If you are asked to use a 5%
significance level, this is what will be marked.
Do not waste time calculating things which are not required by the examiners. If you are
asked to find the line of best fit, you will get no marks if you calculate the correlation
coefficient as well. If you are asked to use the confidence interval you have just calculated to
comment on the results, carrying out an additional hypothesis test will not gain you marks.
When performing calculations try to use as many decimal places as possible in intermediate
steps to reach the most accurate solution. It is advised to have at least two decimal places
in general and at least three decimal places when calculating probabilities.
2
Examiners’ commentaries 2020
How should you use the specific comments on each question given in the
Examiners0 commentaries?
We hope that you find these useful. For each question and subquestion, they give:
further guidance for each question on the points made in the last section
the answers, or keys to the answers, which the examiners were looking for
the relevant detailed reference to Newbold, P., W.L. Carlson and B.M. Thorne Statistics for
Business and Economics. (London: Prentice–Hall, 2012) eighth edition [ISBN
9780273767060] and the subject guide
where appropriate, suggested activities from the subject guide which should help you to
prepare, and similar questions from Newbold et al. (2012).
Any further references you might need are given in the part of the subject guide to which you are
referred for each answer.
It was noted recently that a small number of candidates appeared to be memorising answers from
previous years’ Examiners’ commentaries, for example plots, and produced the exact same image of
them without looking at the current year’s examination paper questions! Note that this is very easy
to spot. The Examiners’ commentaries should be used as a guide to practise on sample examination
questions and it is pointless to attempt to memorise them.
Many candidates are disappointed to find that their examination performance is poorer than they
expected. This may be due to a number of reasons, but one particular failing is ‘question
spotting’, that is, confining your examination preparation to a few questions and/or topics which
have come up in past papers for the course. This can have serious consequences.
We recognise that candidates might not cover all topics in the syllabus in the same depth, but you
need to be aware that examiners are free to set questions on any aspect of the syllabus. This
means that you need to study enough of the syllabus to enable you to answer the required number of
examination questions.
The syllabus can be found in the Course information sheet available on the VLE. You should read
the syllabus carefully and ensure that you cover sufficient material in preparation for the
examination. Examiners will vary the topics and questions from year to year and may well set
questions that have not appeared in past papers. Examination papers may legitimately include
questions on any topic in the syllabus. So, although past papers can be helpful during your revision,
you cannot assume that topics or specific questions that have come up in past examinations will
occur again.
If you rely on a question-spotting strategy, it is likely you will find yourself in difficulties
when you sit the examination. We strongly advise you not to adopt this strategy.
3
ST104a Statistics 1
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2019–20. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).
Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.
Candidates should answer THREE questions: all parts of Section A (50 marks in total) and TWO
questions from Section B (25 marks each). Candidates are strongly advised to divide their
time accordingly.
Section A
Question 1
√ √ √ √
(a) Suppose that x1 = 4, x2 = 5, x3 = 4.5, x4 = −0.8, and y1 = 4, y2 = 5,
y3 = 0, y4 = 2. Calculate the following quantities:
2 4 2
X X X x2i
i. xi yi ii. x3i yi3 iii. x41 + .
i=1 i=3 i=1
yi4
(6 marks)
4
Examiners’ commentaries 2020
i. We have:
2
X √ √ √ √
xi yi = 4× 4+ 5× 5 = 4 + 5 = 9.
i=1
ii. We have:
4
X
x3i yi3 = (4.5)3 × 03 + (−0.8)3 × 23 = 0 + (−0.512) × 8 = −4.096.
i=3
iii. We have:
2 √ √ !
x2 √ ( 4)2 ( 5)2
X
i 4 5
x41 + 4 = ( 4)4 + √ + √ = 16 + + = 16.45.
i=1
yi ( 4)4 ( 5)4 16 25
(b) Classify each one of the following variables as either measurable (continuous) or
categorical. If a variable is categorical, further classify it as either nominal or
ordinal. Justify your answer. (No marks will be awarded without a justification.)
i. Class of cabin on an airliner: ‘first class’, ‘business class’ and ‘economy class’.
ii. The carbon dioxide emissions of an airliner.
iii. The nationalities of airline passengers.
(6 marks)
(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)
i. Skewness of a distribution cannot be determined from a boxplot.
ii. For any event A, it is possible for P (A) + P (Ac ) < 1, where Ac denotes the
complement set.
5
ST104a Statistics 1
iii. A standardised normal random variable has a standard deviation equal to its
variance.
iii. True. A standardised normal random variable has a variance of 1, and hence a standard
deviation of 1.
iv. False. A lower (higher) level of confidence means a lower (higher) confidence coefficient,
hence a narrower (wider) confidence interval.
v. False. A cross-sectional design collects data at one moment in time (or a longitudinal
design collects data over time).
(d) A quota sample of total size n = 500 is to be selected. The researcher has
decided to use age group as a quota control. It is known that the composition of
age groups in the population is:
6
Examiners’ commentaries 2020
ii. We have:
X
E(Y 2 ) = y 2 p(y) = 22 × 0.1 + 42 × 0.3 + 62 × 0.4 + 82 × 0.2 = 32.4
y
hence
√ Var(Y ) = E(Y 2 ) − (E(Y ))2 = 32.4 − (5.4)2 = 3.24, so the standard deviation is
3.24 = 1.8.
Several candidates here mistook E(Y 2 ) for the variance or even the standard deviation.
Make sure you can confidently distinguish between these. Note also that an alternative
method to find the variance is through the formula i (xi − µ)2 p(xi ), where µ was found
P
in part ii.
iii. Since:
E(Y ) ± σY ⇒ 5.4 ± 1.8 ⇒ [3.6, 7.2]
this means we require P (Y = 4) + P (Y = 6) = 0.3 + 0.4 = 0.7.
iv. Since the probability masses are not equal for each value of Y , this is not a discrete
uniform distribution.
7
ST104a Statistics 1
(g) Two cards are chosen at random from a standard 52-card deck. Consider the
events:
iii. Are the events A and B mutually exclusive? Explain your answer.
(2 marks)
iv. Are the events A and B independent? Explain your answer.
(3 marks)
8
Examiners’ commentaries 2020
ii. Since both events can occur simulataneously, by part i. P (A ∩ B) > 0, hence A and B
are not mutually exclusive.
iii. A and B are mutually exclusive – if A occurs, then without replacement B cannot occur.
iv. A and B are not independent. If A occurs, then B cannot occur, but if A does not occur,
then P (B) = 1/51.
Section B
Answer two out of the three questions from this section (25 marks each).
Question 2
(a) A random sample of 3,586 students from various areas is taken to compare their
performance in Physics, based on the final examination mark (low: below 40%,
medium: between 40% and 60%, high: above 60%), in relation to the amount of
private tutoring they receive. The results are summarised in the table below.
Performance in Physics
Tutoring Low Medium High Total
No tutoring 46 (11%) 168 (41%) 196 (48%) 410 (100%)
Some tutoring 100 (5%) 572 (31%) 1,148 (63%) 1,820 (100%)
Frequent tutoring 32 (2%) 248 (18%) 1,076 (79%) 1,356 (100%)
Total 178 (5%) 988 (28%) 2,420 (67%) 3,586 (100%)
i. Based on the data in the table, and without doing a significance test, how
would you describe the relationship between receiving private tutoring and
performance in Physics?
ii. Calculate the χ2 statistic and use it to test for independence between gender
and highest education level, using a 10% significance level. What do you
conclude?
iii. Would you conclude that private tutoring improves the performance in
Physics? Briefly justify your answer.
(13 marks)
9
ST104a Statistics 1
ii. Set out the null hypothesis that there is no association between gender and highest
education level against the alternative, that there is association. Be careful to get these
the correct way round!
H0 : No association between tutoring frequency and performance in Physics.
vs.
H1 : Association between tutoring frequency and performance in Physics.
Work out the expected values to obtain the table below
20.3514 112.962 276.687
90.3402 501.439 1,228.22
67.3084 373.6 915.092
The test statistic formula is:
3 X 3
X (Oi,j − Ei,j )2
∼ χ2(r−1)(c−1)
i=1 j=1
E i,j
which gives a test statistic value of 187.913. This is a 3 × 3 contingency table so the
degrees of freedom are (3 − 1) × (3 − 1) = 4.
For α = 0.05, the critical value is χ20.05, 4 = 7.778, hence we reject H0 at the 5%
significance level. We conclude that there is evidence of an association between receiving
tutoring and performance in Physics.
Many candidates looked up the tables incorrectly and so failed to follow through their
earlier accurate work.
iii. Some reference to correlation and causation is expected here. For example, there is an
association but it could be that those who receive private tutoring may come from
families where there is a strong motivation to do well in school and this is the cause of
better performance in Physics.
(b) i. You have been asked to design a nationwide survey in your country to find
out about the use of Twitter by university students. Provide a probability
sampling scheme and a sampling frame that you would like to use. Identify a
potential source of selection bias that may occur and discuss how this issue
can be addressed.
ii. Describe how you would adapt the sampling design of part i. to create a
longitudinal survey.
(12 marks)
10
Examiners’ commentaries 2020
• Sampling scheme: State a scheme (any probability random sampling scheme would
do) and provide a justification. For example, if you went for clustering discuss the
area of the country, or if stratified discuss stratification factors: gender, subject of
study etc., and why these schemes would be advantageous.
• Source of selection bias: Selection bias will arise from the omission of those students
in universities that admission records are more difficult to obtain.
• Way to address it: Reset the target population group to match what the sampling
frame is actually providing.
Example answer 2:
• Sampling frame: Note that despite the target group of ‘university students’, it may
be good to also look at people who are not at a university but otherwise similar (say
aged 18–22). If this is the case, you might suggest using an electoral register and
sampling from this list.
• Sampling scheme: State a scheme (any probability random sampling scheme would
do) and provide a justification. For example, if you went for clustering discuss the
area of the country, or if stratified discuss stratification factors: gender, subject of
study etc., and why these schemes would be advantageous.
• Source of selection bias: Selection bias will arise from the omission of those who are
not on electoral registers.
• Way to address it: Reset the target population group to match what the sampling
frame is actually providing.
ii. Again, an indicative answer would contain the following statement.
‘A cohort of people aged 18 will be chosen (both at a university and not) and will be
re-surveyed each year’ and some critical discussion highlighting potential issues such as
participant dropout, or potential advantages such as identifying the year of study where
use of Twitter becomes more prevalent.
Question 3
(a) A farmer would like to investigate the relationship between the obtained yield of
apple trees and the amount of weeds found in their roots. For this reason, nine
apple trees of the same type were randomly selected and the amount of weeds
in their roots (x grams) was recorded together with their yield (y kilograms).
Year #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
Weeds in roots (x) 30 28 32 25 25 24 22 24 35 40
Yield (y) 25 30 27 40 42 41 50 45 30 25
The summary statistics for these data are:
Sum of x data: 285 Sum of the squares of x data: 8,419
Sum of y data: 355 Sum of the squares of y data: 13,349
Sum of the products of x and y data: 9,718
i. Draw a scatter diagram of these data on the graph paper provided. Label the
diagram carefully.
ii. Calculate the sample correlation coefficient. Interpret your findings.
iii. Calculate the least squares line of y on x and draw the line on the scatter
diagram.
iv. Using the equation you found in iii., obtain the predicted credit card balance
of someone with annual income of £11,000. Do you think this value is
realistic? Justify your answer.
v. Briefly comment on the suitability of the simple linear regression model for
the data of this question.
(13 marks)
11
ST104a Statistics 1
ii. The summary statistics can be substituted into the formula for the correlation (make
sure you know which one it is!) to obtain the value −0.8492. An interpretation of this
value is the following: the data suggest that the greater the amount of weeds in the
roots, the lower the yield of the apple trees. The fact that the value is close to −1
suggests that this is a strong, negative linear relationship.
Many candidates did not mention all three words (strong, negative, linear). Note that all
of these words provide useful information on interpreting the relationship and are hence
required to obtain full marks.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = −1.3474. The formula for a is:
a = ȳ − bx̄
and we get a = 73.9005. Hence the regression line can be written as:
12
Examiners’ commentaries 2020
this line on the scatter diagram; instead they drew an approximate line trying to go
around the points but without reference to the above equation. No marks were awarded
in such cases.
iv. In this case one can note in the scatter diagram that the points seem to be ‘scattered’
around a straight line. Hence a linear regression model does seem to be a good model.
According to the model 73.9005 − 1.3474 × 37 = 24.047 kilograms.
Many candidates did not provide units here. It is essential to do so in order to obtain full
marks. In order to assess whether this prediction is realistic some criticism on the
prediction validity with statements such as ‘the point is very close to the limits of the
data’ or ’ the assumption does not appear to be linear ’ provided good answers.
v. Some discussion is expected here, on the non-linear association implied by the scatter
diagram.
(b) A company wants to check the quality of its customer service regarding web
chat enquiries. More specifically, the manager wants to compare the waiting
times until each enquiry was answered during the years 2013 and 2012.
Unfortunately, extensive records of the company are not available, and he can
only check a random sample of web chat enquiries within these two years. The
available data, measured in minutes of waiting times, are provided below:
H0 : µA = µB vs. H1 : µA 6= µB .
The test statistic value is 2.3225 (or 2.3624 if pooled variance used).
13
ST104a Statistics 1
The sample size is quite large, hence the standard normal distribution can be used due to
the central limit theorem. Nevertheless the use of a t60 distribution is also reasonable.
The critical values at the 5% significance level are ±1.96 (2.00 if a t60 distribution is
used), hence we reject the null hypothesis. If we take a (smaller) α such as a 1%
significance level, the critical values are ±2.576 (2.66 if a t60 distribution is used), so we
do not reject H0 . We conclude that there is moderate evidence of a difference in the
mean waiting times between the two years.
Some candidates stated assumptions in this part that were not made in part i. Marks
were not awarded in such cases. Also, some other candidates just copied the phrase
‘assumption about equal variances’ and naturally were not awarded any marks. One
should state whether the calculations were based on the assumption that unknown
variances are equal or unequal.
Regarding the assumption which is most likely to be violated the following statement is
an indicative answer: ‘Independent samples assumption is most likely to be violated
since the sampling was done in two successive years’.
H0 : µA = µB vs. H1 : µA > µB
Also make sure to get the correct z-values: ≈ 1.645 for a 5% significance level, and ≈ 2.32
for a 1% significance level (1.671 and 2.39, respectively, if a t60 distribution is used).
Based on these, the result is borderline highly significant (moderately significant will also
do), i.e. the mean waiting time in 2013 was less than that of 2012.
Question 4
(a) i. Carefully construct a boxplot to display the following annual before tax
earnings for the employees of a company, measured in £000s:
35, 26, 22, 24, 21, 57, 36, 35, 29, 47, 30 and 36.
ii. Based on the shape of the box plot you have drawn, describe the distribution
of the data.
iii. Name two other types of graphical displays that would be suitable to
represent the data. Briefly explain your choices.
iv. Provide a hypothesis test statistic for the hypothesis that the mean income is
equal to £33,000. Comment on the suitability of this test to the data of this
question.
(13 marks)
14
Examiners’ commentaries 2020
Note that in order to draw it you will need to calculate the median and lower/upper
quartiles. These in turn will allow you to determine the outlier limits as well as the
extreme outlier limits as well as the whiskers. These calculations are summarised below.
• Quartiles: 25, 32.5, 36, hence IQR = 36 − 25 = 11 (should be consistent with the Q1
and Q3 values).
• Outlier limits: lower is Q1 − 1.5 × IQR = 8.5, upper is Q3 + 1.5 × IQR = 52.5.
• Outlier: 57.
Marks were also awarded for accurately drawing the figure. Note that the numbers
added in the figure above were for illustration purposes.
ii. The distribution of the data appears to be positively/right-skewed. This is also
supported by the fact that the mean is larger than the median.
iii. The variable income is measurable, hence these graphs are suitable for displaying the
distribution of such variables:
∗ histogram
∗ stem-and-leaf diagram
∗ dot plot.
iv. A suitable test statistic would be (x̄ − 33)/Sx which assumes the data to be normally
distributed. Nevertheless, the assumption of normality here is questionable due to the
presence of skewness. The t distribution is a better choice although still symmetric.
Taking a log-transform may help.
(b) A new fitness programme is devised for people who would like to lose weight.
Each participants’ weight in kg was measured before and after the fitness
programme to see if it is effective in reducing their weights. The following data
were obtained.
Participant #1 #2 #3 #4 #5 #6 #7 #8
Before 86 65 74 55 93 52 66 67
After 81 62 71 49 83 51 61 63
15
ST104a Statistics 1
5, 3, 3, 6, 10, 1, 5 and 4.
Note that it is perfectly acceptable to compute ‘after − before’, the signs are reversed in
this case (i.e. all negative in this instance). We have:
16