ST3189 - Machine Learning - 2020 Examiners Commentaries
ST3189 - Machine Learning - 2020 Examiners Commentaries
ST3189 - Machine Learning - 2020 Examiners Commentaries
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2019–20. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).
Unless otherwise stated, all cross-references will be to the latest version of the course (2018). You
should always attempt to use the most recent edition of any Essential reading textbook, even if the
commentary and/or online reading list and/or subject guide refer to an earlier edition. If different
editions of Essential reading are listed, please check the VLE for reading supplements – if none are
available, please use the contents list and index of the new edition to find the relevant section.
General remarks
Learning outcomes
At the end of the course and having completed the essential reading and activities you should be
able to:
You have two hours to complete this paper, which consists of four compulsory questions. Remember
that each of these questions is likely to cover more than one topic. This means that it is really
important that you make sure you have a reasonable idea of what topics are covered before you start
work on the paper! We suggest you divide your time as follows during the examination:
Spend the first 10 minutes annotating the paper. Note the topics covered in each question
and subquestion.
Allow yourself 25 minutes for each question. Do not allow yourself to get stuck on any one
question, but do not just give up after two minutes!
1
ST3189 Machine learning
This leaves you with 10 minutes. Do not leave the examination hall at this point! Check
over any questions you may not have completely finished. Make sure you have labelled and
given a title to any tables or diagrams which were required.
The examiners are looking for very simple demonstrations from you. They want to be sure that you:
have covered the syllabus as described and explained in the course material
know the basic concepts given and, more importantly, when and how to use them
understand and answer the questions set.
You are not expected to write long essays with lengthy explanations. However, clear and accurate
language, both mathematical and written, is expected and marked. The explanations below and in
the specific commentary for the examination paper should make these requirements clear.
The most important thing you can do is answer the question set! This may sound very simple, but
these are some of the things that candidates often do not do, though asked! Remember:
If you are asked to label a diagram (which is almost always the case!), please do so. What
do the data describe? What are the units? What are the x and y axes?
Do not waste time calculating things which are not required by the examiners.
When making calculations try to use as many decimal places as possible to reach the most
accurate solution. It is advised to have at least two decimal places in general and at least
three decimal places when calculating probabilities.
How should you use the specific comments on each question given in the
Examiners0 commentaries?
We hope that you find these useful. For each question and subquestion, they give:
further guidance for each question on the points made in the last section
the answers, or keys to the answers, which the examiners were looking for
where appropriate, suggested activities from the course material which should help you to
prepare, as well as similar questions.
Any further references you might need are given in the part of the course to which you are referred
for each answer.
It is generally noted in similar examination papers that a small number of candidates appear to be
memorising answers from previous years’ Examiners’ commentaries – for example, plots – and,
therefore, produce the exact same image of them without looking at the examination questions at
all! Note that this is very easy to spot. The Examiners’ commentaries should be used as a guide to
practise on sample examination questions and it is pointless to attempt to memorise them.
2
Examiners’ commentaries 2020
Many candidates are disappointed to find that their examination performance is poorer than they
expected. This may be due to a number of reasons, but one particular failing is ‘question
spotting’, that is, confining your examination preparation to a few questions and/or topics which
have come up in past papers for the course. This can have serious consequences.
We recognise that candidates might not cover all topics in the syllabus in the same depth, but you
need to be aware that examiners are free to set questions on any aspect of the syllabus. This
means that you need to study enough of the syllabus to enable you to answer the required number of
examination questions.
The syllabus can be found in the Course information sheet available on the VLE. You should read
the syllabus carefully and ensure that you cover sufficient material in preparation for the
examination. Examiners will vary the topics and questions from year to year and may well set
questions that have not appeared in past papers. Examination papers may legitimately include
questions on any topic in the syllabus. So, although past papers can be helpful during your revision,
you cannot assume that topics or specific questions that have come up in past examinations will
occur again.
If you rely on a question-spotting strategy, it is likely you will find yourself in difficulties
when you sit the examination. We strongly advise you not to adopt this strategy.
3
ST3189 Machine learning
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2019–20. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).
Unless otherwise stated, all cross-references will be to the latest version of the course (2018). You
should always attempt to use the most recent edition of any Essential reading textbook, even if the
commentary and/or online reading list and/or subject guide refer to an earlier edition. If different
editions of Essential reading are listed, please check the VLE for reading supplements – if none are
available, please use the contents list and index of the new edition to find the relevant section.
Candidates should answer all FOUR questions. All questions carry equal marks.
Question 1
(a) Indicate whether the following statements are true or false. Briefly justify your
answers.
i. Other things equal, a classifier trained on less training data is more likely to
overfit.
(3 marks)
ii. Consider the lasso approach to perform variable selection when regressing
S&P 500 index daily returns on the daily returns of a number of selected
stocks. One can apply the basic 5-fold or 10-fold cross-validation method to
choose the optimal tuning parameter, λ,b and consistently get good predictive
performance.
(3 marks)
iii. To perform variable selection, one can use the information criterion approach
to select the model size, including AIC and BIC. Compared with AIC, BIC
tends to select a larger model size when the sample size n becomes
sufficiently large.
(3 marks)
4
Examiners’ commentaries 2020
error which can be found in several sections of the James et al. textbook such as Section 2.2
and Chapter 5 on resampling methods. Part ii. is on cross-validation which is covered in
Section 5.1 of the same book. Part iii. is on the Bayesian and Akaike Information Criteria;
which can be found again in several places throughout the textbook; for example, page 78.
Approaching the question
Remember that the justification has to be one sentence so organise your thoughts
accordingly and avoid lengthy answers. Some ‘good answers’ are provided below. Note that
there can be more than one ‘correct’ answer in some of these questions.
i. True. As the training sample size decreases, the relative model complexity increases and
hence it is more likely to overfit the data.
ii. False. Theoretically, the basic K-fold cross-validation approach cannot be applied on
time dependent data. In practice, one can apply the approach, but the sample
performance may be bad.
iii. False. Compared with AIC, BIC corresponds to a larger penalty term as log n > 2 for
sufficiently large n, (n > 7) so minimising the BIC would result in a smaller model size.
Overall, candidates did well on parts i. and iii., but not so well on part ii.
where sign(z) denotes the sign of z and I(·) denotes an indicator function,
which is 1 if the statement in (·) is true and 0 otherwise.
(6 marks)
ii. Take pλ (|θ|) = λ1 θ 2 + λ2 |θ|, where λ1 , λ2 ≥ 0 are two tuning parameters.
What is the solution to equation (1)?
(Hint: Rewrite the optimisation problem in terms of the form in equation
(1) and then apply the result in part i.)
(6 marks)
• We have θb = 0 if:
1 2 1
z ≤ (z − θ)2 + λθ
2 2
i.e. we have:
1
z≤ θ+λ
2
5
ST3189 Machine learning
1
(z − θ)2 + λθ
2
with respect to θ, hence obtaining:
θb − z + λ = 0 ⇒ θb = z − λ > 0.
• Similarly, we will have θb < 0 if we consider taking the first derivative of:
1
(z − θ)2 − λθ
2
with respect to θ, hence obtaining:
θb − z − λ = 0 ⇒ θb = z + λ < 0.
ii. Rewrite the optimisation problem in terms of the form in part i. We have:
1
(z − θ)2 + λ1 θ2 + λ2 |θ|
2
z2
1
= (1 + 2λ1 )θ2 − 2θz + + λ2 |θ|
2 1 + 2λ1
2
1 z p λ2 p
= constant + √ − 1 + 2λ1 θ + √ 1 + 2λ1 θ
2 1 + 2λ1 1 + 2λ1
1 e 2 + λ|
z − θ)
= constant + (e e θ|
e
2
where:
z p
e = √ λ2
ze = √ , θe = 1 + 2λ1 θ and λ .
1 + 2λ1 1 + 2λ1
Applying the result in part i., we have:
√
z |z| − λ2 |z| λ2
1 + 2λ θ = sign √
b √ I √ >√ .
1 + 2λ1 1 + 2λ1 1 + 2λ1 1 + 2λ1
Note: These questions were rather more difficult from an algebraic point of view and were
not answered well by all candidates. However, overall there were not major problems with
this question.
(c) The table below is the training data with n = 6 observations of a 3-dimensional
input x = (x1 , x2 , x3 )T and a qualitative output y (the colour is blue or red).
6
Examiners’ commentaries 2020
i x1 x2 x3 y
1 0 3 0 red
2 2 0 0 red
3 0 1 3 red
4 0 1 2 blue
5 −1 0 1 blue
6 1 1 1 blue
Question 2
(a) Assign the Gamma(α, β) prior on θ, for α, β > 0 and derive the corresponding
posterior distribution.
(5 marks)
(b) Derive the Jeffreys prior for θ. Use it to obtain the corresponding posterior
distribution.
(7 marks)
(c) Provide a normal approximation to the posterior distribution of θ that
converges to the true posterior as the sample size increases.
(4 marks)
(d) Assuming a quadratic error loss function, find the Bayes’ estimator for each of
the posteriors in parts (a), (b) and (c).
(4 marks)
7
ST3189 Machine learning
(e) Let y represent a future observation from the same model. Find the posterior
predictive distribution of y for one of the posteriors in parts (a) or (b).
(5 marks)
This question is examining Bayesian inference which can be found in Block 4 of the VLE section
of the course. Read the parts on ‘Bayesian Inference Essentials’ and ‘Bayesian Inference
Examples’. Exercises 1–6 as well as Exercise 2 of the mock examination are relevant for practice
(try to do a few of them).
n
!!
X
n+α−1
=θ exp −θ β + 0.5 x2i
i=1
n
!
X
which can be recognised as the kernel of the Gamma n + α, β + 0.5 x2i distribution.
i=1
n
!
X
which can be recognised as the kernel of the Gamma n, 0.5 x2i distribution.
i=1
8
Examiners’ commentaries 2020
(c) We will need the mode of the likelihood, θM , and the Fisher information. The latter was
derived in part (b) and is equal to n/θ2 . For the former we set:
n
n X
− 0.5 x2i = 0
θM i=1
or else:
2n
θM = P
n
x2i
i=1
since:
∂ 2 `(θ | x)
< 0.
∂θ2
The requested normal approximation to the posterior has:
2n θ2
mean = P
n and variance = .
n
x2i
i=1
(d) The Bayes’ estimator is the posterior mean in each of these cases. For part (a) it is equal to:
n+α
n
x2i
P
β + 0.5
i=1
(e) Consider the posterior π J (θ | x) corresponding to the Jeffreys prior. The predictive
distribution for y > 0 and xi > 0, for i = 1, . . . , n, is:
Z
f (y | x) = f (y | θ, x)π J (θ | x) dθ
Θ
Z
= f (y | θ)π J (θ | x) dθ
Θ
n
n
2
P
Z ∞ 0.5 xi n
X
!!
2 i=1 n−1
= θy exp(−0.5θy ) θ exp −θ 0.5 x2i dθ
0 Γ(n) i=1
n
n
x2i
P
y Z ∞ n
!!
i=1 θ X
= θ(n+1)−1 exp − y2 + x2i dθ
2n Γ(n) 0 2 i=1
n
n
x2i
P
y
i=1 Γ(n + 1)2n+1
= n+1 .
2n Γ(n)
n
2
P
y+ xi
i=1
9
ST3189 Machine learning
Question 3
1 n
1− .
n
ii. Provide the probability that the jth observation is in the bootstrap sample
for n = 5.
iii. Name a potential use for the observations that are not in the bootstrap
sample.
iv. Suppose that we want to predict Y for a particular value of the predictor X
based on a machine learning method. Describe how we can estimate the
standard deviation of our prediction.
(12 marks)
iii. They can serve as a test sample. This will provide the out-of-bag-error.
iv. If we use some machine learning method to make a prediction for the response Y for a
particular value of the predictor X we might estimate the standard deviation of our
prediction by using the bootstrap approach. The bootstrap approach works by
repeatedly sampling observations (with replacement) from the original data set B times,
for some large value of B, each time fitting a new model and subsequently obtaining the
square root MSE by contrasting the predictions produced by each of these B models
with the actual observations.
(b) The tree in Figure 1 (on the next page) provides a regression tree based on a
dataset regarding fuel consumption (in miles per gallon) for 392 cars. The
variables appearing in the regression tree are cylinders: number of cylinders
(between 4 and 8) and horsepower : engine horsepower.
i. Provide an interpretation of this tree.
(6 marks)
ii. Create a diagram that represents the partition of the predictors’ space
according to the tree of Figure 1.
(7 marks)
10
Examiners’ commentaries 2020
11
ST3189 Machine learning
Question 4
(a) Some of the most commonly-used types of linkage in hierarchical clustering are:
• Single linkage: distance between clusters is the minimum distance between
any pair of points from two clusters.
• Complete linkage: distance between clusters is the maximum distance
between any pair of points from two clusters.
• Average linkage: distance between clusters is the average distance between
any pair of points from two clusters.
i. Which of the three types of linkage described above would most likely result
in clusters most similar to those given by k-means?
(4 marks)
ii. Consider the data in Figure 2(a). What would be the result if we extract two
clusters from the tree given by hierarchical clustering on this dataset using
single linkage? Describe your answer in terms of the labels 1–4 given to the
four ‘clumps’ in the data. Do the same for complete linkage and average
linkage.
(5 marks)
iii. Which of the three types of linkage (if any) would successfully separate the
two ‘half-moons’ in Figure 2(b)? Which about Figure 2(c)? Briefly explain
your answer.
(6 marks)
i. Average linkage as, like k-means, is based on the average distances between and within
clusters.
ii. Average linkage and complete linkage would assign ‘clumps’ 1 and 3 to the first cluster
and 2 and 4 to the second. Single linkage would assign 1 and 2 to one cluster, 3 and 4 to
the other.
iii. Single linkage would successfully separate the two moons in Figure 2 (b). See the figure
below. Using single linkage would give {1, 2} and {3, 4}, however, complete linkage and
average linkage would cluster 1 and 4 first. None of the methods would work in Figure 2
(c), since even for single linkage, there are some dots connecting the two moons.
12
Examiners’ commentaries 2020
and some calculations show that the Bayes’ decision rule is given by:
(
Class 1 if 3x2 − 4x − 4 log 2 > 0
Class 2 otherwise.
13