ST3189 - Machine Learning - 2020 Examiners Commentaries

Examiners’ commentaries 2020

ST3189 Machine learning
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2019–20. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).
Information about the subject guide and the Essential reading

references
Unless otherwise stated, all cross-references will be to the latest version of the course (2018). You
should always attempt to use the most recent edition of any Essential reading textbook, even if the
commentary and/or online reading list and/or subject guide refer to an earlier edition. If different
editions of Essential reading are listed, please check the VLE for reading supplements – if none are
available, please use the contents list and index of the new edition to find the relevant section.
General remarks
Learning outcomes
At the end of the course and having completed the essential reading and activities you should be
able to:
develop an understanding of the process to learn from data

be familiar with a wide variety of algorithmic and model-based methods to extract
information from data
apply and evaluate suitable methods to various datasets by model selection and predictive
performance evaluation.
Planning your time in the examination
You have two hours to complete this paper, which consists of four compulsory questions. Remember
that each of these questions is likely to cover more than one topic. This means that it is really
important that you make sure you have a reasonable idea of what topics are covered before you start
work on the paper! We suggest you divide your time as follows during the examination:
Spend the first 10 minutes annotating the paper. Note the topics covered in each question
and subquestion.
Allow yourself 25 minutes for each question. Do not allow yourself to get stuck on any one
question, but do not just give up after two minutes!
1
This leaves you with 10 minutes. Do not leave the examination hall at this point! Check
over any questions you may not have completely finished. Make sure you have labelled and
given a title to any tables or diagrams which were required.
What are the examiners looking for?
The examiners are looking for very simple demonstrations from you. They want to be sure that you:
have covered the syllabus as described and explained in the course material
know the basic concepts given and, more importantly, when and how to use them
understand and answer the questions set.
You are not expected to write long essays with lengthy explanations. However, clear and accurate
language, both mathematical and written, is expected and marked. The explanations below and in
the specific commentary for the examination paper should make these requirements clear.
Key steps to improvement
The most important thing you can do is answer the question set! This may sound very simple, but
these are some of the things that candidates often do not do, though asked! Remember:
If you are asked to label a diagram (which is almost always the case!), please do so. What
do the data describe? What are the units? What are the x and y axes?
Do not waste time calculating things which are not required by the examiners.
When making calculations try to use as many decimal places as possible to reach the most
accurate solution. It is advised to have at least two decimal places in general and at least
three decimal places when calculating probabilities.
How should you use the specific comments on each question given in the
Examiners0 commentaries?
We hope that you find these useful. For each question and subquestion, they give:
further guidance for each question on the points made in the last section
the answers, or keys to the answers, which the examiners were looking for
where appropriate, suggested activities from the course material which should help you to
prepare, as well as similar questions.
Any further references you might need are given in the part of the course to which you are referred
for each answer.
Memorising from the Examiners0 commentaries
It is generally noted in similar examination papers that a small number of candidates appear to be
memorising answers from previous years’ Examiners’ commentaries – for example, plots – and,
therefore, produce the exact same image of them without looking at the examination questions at
all! Note that this is very easy to spot. The Examiners’ commentaries should be used as a guide to
practise on sample examination questions and it is pointless to attempt to memorise them.
2
Examination revision strategy
Many candidates are disappointed to find that their examination performance is poorer than they
expected. This may be due to a number of reasons, but one particular failing is ‘question
spotting’, that is, confining your examination preparation to a few questions and/or topics which
have come up in past papers for the course. This can have serious consequences.
We recognise that candidates might not cover all topics in the syllabus in the same depth, but you
need to be aware that examiners are free to set questions on any aspect of the syllabus. This
means that you need to study enough of the syllabus to enable you to answer the required number of
examination questions.
The syllabus can be found in the Course information sheet available on the VLE. You should read
the syllabus carefully and ensure that you cover sufficient material in preparation for the
examination. Examiners will vary the topics and questions from year to year and may well set
questions that have not appeared in past papers. Examination papers may legitimately include
questions on any topic in the syllabus. So, although past papers can be helpful during your revision,
you cannot assume that topics or specific questions that have come up in past examinations will
occur again.
If you rely on a question-spotting strategy, it is likely you will find yourself in difficulties
when you sit the examination. We strongly advise you not to adopt this strategy.
3

Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2019–20. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).
Information about the subject guide and the Essential reading

references
Unless otherwise stated, all cross-references will be to the latest version of the course (2018). You
should always attempt to use the most recent edition of any Essential reading textbook, even if the
commentary and/or online reading list and/or subject guide refer to an earlier edition. If different
editions of Essential reading are listed, please check the VLE for reading supplements – if none are
available, please use the contents list and index of the new edition to find the relevant section.
Comments on specific questions
Candidates should answer all FOUR questions. All questions carry equal marks.
Answer all parts of the following questions.
Question 1
(a) Indicate whether the following statements are true or false. Briefly justify your
answers.
i. Other things equal, a classifier trained on less training data is more likely to
overfit.
(3 marks)
ii. Consider the lasso approach to perform variable selection when regressing
S&P 500 index daily returns on the daily returns of a number of selected
stocks. One can apply the basic 5-fold or 10-fold cross-validation method to
choose the optimal tuning parameter, λ,b and consistently get good predictive
performance.
(3 marks)
iii. To perform variable selection, one can use the information criterion approach
to select the model size, including AIC and BIC. Compared with AIC, BIC
tends to select a larger model size when the sample size n becomes
sufficiently large.
(3 marks)
Reading for this question

This question refers to several techniques and machine learning concepts of the course
requiring some basic understanding of them. More specifically part i. is on training and test
4
error which can be found in several sections of the James et al. textbook such as Section 2.2
and Chapter 5 on resampling methods. Part ii. is on cross-validation which is covered in
Section 5.1 of the same book. Part iii. is on the Bayesian and Akaike Information Criteria;
which can be found again in several places throughout the textbook; for example, page 78.
Approaching the question
Remember that the justification has to be one sentence so organise your thoughts
accordingly and avoid lengthy answers. Some ‘good answers’ are provided below. Note that
there can be more than one ‘correct’ answer in some of these questions.
i. True. As the training sample size decreases, the relative model complexity increases and
hence it is more likely to overfit the data.
ii. False. Theoretically, the basic K-fold cross-validation approach cannot be applied on
time dependent data. In practice, one can apply the approach, but the sample
performance may be bad.
iii. False. Compared with AIC, BIC corresponds to a larger penalty term as log n > 2 for
sufficiently large n, (n > 7) so minimising the BIC would result in a smaller model size.
Overall, candidates did well on parts i. and iii., but not so well on part ii.
(b) Consider minimising the following penalised sum of squared errors:

1
(z − θ)2 + pλ (|θ|) (1)
2
over θ ∈ R, where R is the real line and λ ≥ 0 is the tuning parameter.
i. Take pλ (|θ|) = λ|θ|, show that the solution to (1) is:
θb = sign(z)(|z| − λ)I(|z| > λ)
where sign(z) denotes the sign of z and I(·) denotes an indicator function,
which is 1 if the statement in (·) is true and 0 otherwise.
(6 marks)
ii. Take pλ (|θ|) = λ1 θ 2 + λ2 |θ|, where λ1 , λ2 ≥ 0 are two tuning parameters.
What is the solution to equation (1)?
(Hint: Rewrite the optimisation problem in terms of the form in equation
(1) and then apply the result in part i.)
(6 marks)

This question examines the main idea in shrinkage methods such as Lasso and Ridge
regression which can be found in Section 6.2 of the James et al. textbook. For practice check
also Exercise 1 of the mock examination paper.
The calculations follow the same idea as in finding a maximum likelihood estimator only
that this time an additional penalty term is added to the log-likelihood. The working is
given below.
i. The optimisation problem becomes:
1
(z − θ)2 + λ|θ|.
2
• We have θb = 0 if:
1 2 1
z ≤ (z − θ)2 + λθ
2 2
i.e. we have:
1
z≤ θ+λ
2
5
for all θ > 0, or:

1 2 1
z ≤ (z − θ)2 − λθ
2 2
i.e. we have:
1
z≥ θ−λ
2
for all θ < 0. Hence θb = 0 if −λ ≤ z ≤ λ.
• Otherwise, we will have θb > 0 if we consider taking the first derivative of:
1
(z − θ)2 + λθ
2
with respect to θ, hence obtaining:
θb − z + λ = 0 ⇒ θb = z − λ > 0.
• Similarly, we will have θb < 0 if we consider taking the first derivative of:
1
(z − θ)2 − λθ
2
with respect to θ, hence obtaining:
θb − z − λ = 0 ⇒ θb = z + λ < 0.
Combining the above results yields the solution.
ii. Rewrite the optimisation problem in terms of the form in part i. We have:
1
(z − θ)2 + λ1 θ2 + λ2 |θ|
2
z2

1
= (1 + 2λ1 )θ2 − 2θz + + λ2 |θ|
2 1 + 2λ1
2
1 z p λ2 p
= constant + √ − 1 + 2λ1 θ + √ 1 + 2λ1 θ
2 1 + 2λ1 1 + 2λ1
1 e 2 + λ|
z − θ)
= constant + (e e θ|
e
2
where:
z p
e = √ λ2
ze = √ , θe = 1 + 2λ1 θ and λ .
1 + 2λ1 1 + 2λ1
Applying the result in part i., we have:
√

z |z| − λ2 |z| λ2
1 + 2λ θ = sign √
b √ I √ >√ .
1 + 2λ1 1 + 2λ1 1 + 2λ1 1 + 2λ1
Hence the solution is:

−1
1
θb = sign(z)(|z| − λ2 )I(|z| > λ2 ).
1 + 2λ1
Note: These questions were rather more difficult from an algebraic point of view and were
not answered well by all candidates. However, overall there were not major problems with
this question.
(c) The table below is the training data with n = 6 observations of a 3-dimensional
input x = (x1 , x2 , x3 )T and a qualitative output y (the colour is blue or red).
6
i x1 x2 x3 y
1 0 3 0 red
2 2 0 0 red
3 0 1 3 red
4 0 1 2 blue
5 −1 0 1 blue
6 1 1 1 blue
i. Compute the Euclidean distance between each observation in the training

data and the test point x∗ = (0, 0, 0)T .
(2 marks)
ii. If we use k-nearest neighbours with k = 3, what is the predicted colour
corresponding to the test point x∗ ?
(2 marks)

This question contains material regarding the K-nearest neighbours technique which can be
found in Section 2.2 page 39 of the James et al. textbook. Check the worked example in this
section and Exercise 2.7 for practice.
i. This part only requires calculating the Euclidean distances. These are given below:
i xi yi distance
1 0 3 0 red 3
2 2 0 0 red √2
3 0 1 3 red √10
4 0 1 2 blue √5
5 −1 0 1 blue √2
6 1 1 1 blue 3
ii. The three closest observations are 5, 6, 2. Hence, since two of them are blue and one is
red, the predicted colour is blue.
Question 2
Consider a sample x = (x1 , . . . , xn ) of independent random variables, identically

distributed from the Rayleigh(θ) distribution. The probability density function for
each xi is:
f (xi | θ) = θxi exp(−0.5θx2i ) for xi > 0
where θ > 0 is an unknown parameter.
(a) Assign the Gamma(α, β) prior on θ, for α, β > 0 and derive the corresponding
posterior distribution.
(5 marks)
(b) Derive the Jeffreys prior for θ. Use it to obtain the corresponding posterior
distribution.
(7 marks)
(c) Provide a normal approximation to the posterior distribution of θ that
converges to the true posterior as the sample size increases.
(4 marks)
(d) Assuming a quadratic error loss function, find the Bayes’ estimator for each of
the posteriors in parts (a), (b) and (c).
(4 marks)
7
(e) Let y represent a future observation from the same model. Find the posterior
predictive distribution of y for one of the posteriors in parts (a) or (b).
(5 marks)
This question is examining Bayesian inference which can be found in Block 4 of the VLE section
of the course. Read the parts on ‘Bayesian Inference Essentials’ and ‘Bayesian Inference
Examples’. Exercises 1–6 as well as Exercise 2 of the mock examination are relevant for practice
(try to do a few of them).
(a) The likelihood can be written as:

n n
!
Y X
L(θ | x) = f (x | θ) = θxi exp(−0.5θx2i ) ∝θ n
−0.5θ x2i .
i=1 i=1
Consider a Gamma(α, β) prior for θ. We can write:
π(θ) ∝ θα−1 exp(−βθ).
Hence the corresponding posterior will be proportional to:

n
!
X
α−1 n
π(θ | x) ∝ θ exp(−βθ)θ exp −0.5θ x2i
i=1
n
!!
X
n+α−1
=θ exp −θ β + 0.5 x2i
i=1
n
!
X
which can be recognised as the kernel of the Gamma n + α, β + 0.5 x2i distribution.
i=1
(b) The log-likelihood function can be written as:

n
X
`(θ | x) = n log(θ) − 0.5θ x2i .
i=1
In order to find Jeffreys prior we calculate:

n
∂`(θ | x) n X ∂ 2 `(θ | x) n
= − 0.5 x2i , =− 2
∂θ θ i=1
∂θ2 θ
and the Fisher information: n n

I(θ) = −E − 2 = 2 .
θ θ
The Jeffreys prior is:
π J (θ) ∝ I(θ)1/2 ∝ θ−1 .
Using the Jeffreys prior, the corresponding posterior π J (θ | x) is proportional to:
n
!
X
π J (θ | x) ∝ θn−1 exp −0.5θ x2i
i=1
n
!
X
which can be recognised as the kernel of the Gamma n, 0.5 x2i distribution.
i=1
8
(c) We will need the mode of the likelihood, θM , and the Fisher information. The latter was
derived in part (b) and is equal to n/θ2 . For the former we set:
n
n X
− 0.5 x2i = 0
θM i=1
or else:
2n
θM = P
n
x2i
i=1
since:
∂ 2 `(θ | x)
< 0.
∂θ2
The requested normal approximation to the posterior has:
2n θ2
mean = P
n and variance = .
n
x2i
i=1
(d) The Bayes’ estimator is the posterior mean in each of these cases. For part (a) it is equal to:
n+α
n
x2i
P
β + 0.5
i=1
whereas for both parts (b) and (c) it is equal to:

2n
n .
x2i
P
i=1
(e) Consider the posterior π J (θ | x) corresponding to the Jeffreys prior. The predictive
distribution for y > 0 and xi > 0, for i = 1, . . . , n, is:
Z
f (y | x) = f (y | θ, x)π J (θ | x) dθ
Θ
Z
= f (y | θ)π J (θ | x) dθ
Θ
n
n
2
P
Z ∞ 0.5 xi n
X
!!
2 i=1 n−1
= θy exp(−0.5θy ) θ exp −θ 0.5 x2i dθ
0 Γ(n) i=1
n
n
x2i
P
y Z ∞ n
!!
i=1 θ X
= θ(n+1)−1 exp − y2 + x2i dθ
2n Γ(n) 0 2 i=1
n
n
x2i
P
y
i=1 Γ(n + 1)2n+1
= n+1 .
2n Γ(n)
n
2
P
y+ xi
i=1
9
Question 3
(a) Given a random data sample consisting of n observations, consider a bootstrap

sample generated from it.
i. Describe how the bootstrap sample is obtained and argue that the
probability that the jth observation is not in the bootstrap sample is:
1 n

1− .
n
ii. Provide the probability that the jth observation is in the bootstrap sample
for n = 5.
iii. Name a potential use for the observations that are not in the bootstrap
sample.
iv. Suppose that we want to predict Y for a particular value of the predictor X
based on a machine learning method. Describe how we can estimate the
standard deviation of our prediction.
(12 marks)

This question requires basic understanding of the bootstrap resampling procedure.
Suggested reading is Section 5.2 of the James et al. textbook.
i. The bootstrap sample is obtained by independently sampling n times the units of the
original sample with replacement. At each of these n times the probability of sampling
the jth observation is 1/n, hence the probability of not sampling it is 1 − 1/n. As the
procedure is repeated n times independently, the probability that the jth observation is
never sampled is: n
1
1− .
n
ii. The probability of the jth observation being in the sample is:
5 5
1 4
1− 1− =1− = 0.672.
5 5
iii. They can serve as a test sample. This will provide the out-of-bag-error.
iv. If we use some machine learning method to make a prediction for the response Y for a
particular value of the predictor X we might estimate the standard deviation of our
prediction by using the bootstrap approach. The bootstrap approach works by
repeatedly sampling observations (with replacement) from the original data set B times,
for some large value of B, each time fitting a new model and subsequently obtaining the
square root MSE by contrasting the predictions produced by each of these B models
with the actual observations.
(b) The tree in Figure 1 (on the next page) provides a regression tree based on a
dataset regarding fuel consumption (in miles per gallon) for 392 cars. The
variables appearing in the regression tree are cylinders: number of cylinders
(between 4 and 8) and horsepower : engine horsepower.
i. Provide an interpretation of this tree.
(6 marks)
ii. Create a diagram that represents the partition of the predictors’ space
according to the tree of Figure 1.
(7 marks)
10
Figure 1: Regression tree for Question 3 (b).

This part targets regression and classi
cation trees presented in Chapter 8 of the James et al. textbook, and more speci
cally in Sections 8.1.1 and 8.1.2. Check also Exercise 8.4 from the same book as well as
Exercise 3 (b) of the mock examination paper.
This was overall a well-answered question. A brief example of a ‘good’ answer is given below.
i. The average consumption (in miles per gallon) in cars with 5 or less cylinders is generally
lower.
More specifically, if the horsepower is less than 70.5 we get 33.67 miles per gallon
whereas when the horsepower is higher than 70.5 the miles per gallon drop to 26.68.
However, for 6 or more cylinders the consumption is higher. If the horsepower is less
than 139.5 we get fewer miles per gallon, i.e. 19.60. Finally the maximum consumption
(fewest miles per gallon) is obtained for cars with 6 or more cylinders and horsepower
larger than 139.5.
ii. The requested diagram showing the partition of the predictors’ space according to this
tree is provided in the figure below.
Partition of the predictors’ according to the tree in Figure 1.
11
Question 4
(a) Some of the most commonly-used types of linkage in hierarchical clustering are:
• Single linkage: distance between clusters is the minimum distance between
any pair of points from two clusters.
• Complete linkage: distance between clusters is the maximum distance
between any pair of points from two clusters.
• Average linkage: distance between clusters is the average distance between
any pair of points from two clusters.
i. Which of the three types of linkage described above would most likely result
in clusters most similar to those given by k-means?
(4 marks)
Figure 2: For Question 4 (a).
ii. Consider the data in Figure 2(a). What would be the result if we extract two
clusters from the tree given by hierarchical clustering on this dataset using
single linkage? Describe your answer in terms of the labels 1–4 given to the
four ‘clumps’ in the data. Do the same for complete linkage and average
linkage.
(5 marks)
iii. Which of the three types of linkage (if any) would successfully separate the
two ‘half-moons’ in Figure 2(b)? Which about Figure 2(c)? Briefly explain
your answer.
(6 marks)

This is a question on the technique of hierarchical clustering which is covered in Section 10.2
of the James et al. question. The first part also relates to k-means which is explained in
detail in Section 10.3.1 of the same textbook. For relevant exercises check exercises in
Section 10.4 of that book.
i. Average linkage as, like k-means, is based on the average distances between and within
clusters.
ii. Average linkage and complete linkage would assign ‘clumps’ 1 and 3 to the first cluster
and 2 and 4 to the second. Single linkage would assign 1 and 2 to one cluster, 3 and 4 to
the other.
iii. Single linkage would successfully separate the two moons in Figure 2 (b). See the figure
below. Using single linkage would give {1, 2} and {3, 4}, however, complete linkage and
average linkage would cluster 1 and 4 first. None of the methods would work in Figure 2
(c), since even for single linkage, there are some dots connecting the two moons.
12
(b) Consider a binary classification problem, Y = 1 (class 1) and Y = 2 (class 2).

Suppose that X | Y = 1 is from a normal distribution N (2, 12 ) and X | Y = 2 is
from the normal distribution N (1, (0.5)2 ). In addition, P (Y = 1) = 1/3 and
P (Y = 2) = 2/3. Derive the Bayes’ decision rule that classifies regions to classes
1 and 2 for this problem.
(10 marks)

This is an exercise that is related with the techniques of linear and quadratic discriminant
analysis. These techniques can be studied in Section 4.4 (mainly 4.4.1 and 4.4.2) of the
James et al. textbook. Nevertheless it can also be worked out without specific reference to
these techniques by simply applying Bayes’ theorem and standard probability distribution
theory. See Exercises 4.6 and 4.7 of James et al. for practice.
Note that:
!
(x − µj )2 1 2
gj (x) = (2πσj2 )−1/2 exp − for j = 1, 2 and π1 = , π2 = .
2σj2 3 3
Then the Bayes’ decision rule is given by:

(
Class 1 if π1 g1 (x) > π2 g2 (x)
Class 2 otherwise.
Solve for π1 g1 (x) > π2 g2 (x), that is:
log π1 + log(g1 (x)) > log π2 + log(g2 (x))
and some calculations show that the Bayes’ decision rule is given by:
(
Class 1 if 3x2 − 4x − 4 log 2 > 0
Class 2 otherwise.
13

ST3189 - Machine Learning - 2020 Examiners Commentaries

Uploaded by

Copyright:

Available Formats

ST3189 - Machine Learning - 2020 Examiners Commentaries

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ST3189 - Machine Learning - 2020 Examiners Commentaries

Uploaded by

Copyright:

Available Formats

Examiners’ commentaries 2020

Examiners’ commentaries 2020

Information about the subject guide and the Essential reading

develop an understanding of the process to learn from data

Planning your time in the examination

What are the examiners looking for?

Key steps to improvement

Memorising from the Examiners0 commentaries

Examination revision strategy

Examiners’ commentaries 2020

Information about the subject guide and the Essential reading

Comments on specific questions

Answer all parts of the following questions.

Reading for this question

(b) Consider minimising the following penalised sum of squared errors:

θb = sign(z)(|z| − λ)I(|z| > λ)

Reading for this question

for all θ > 0, or:

Combining the above results yields the solution.

Hence the solution is:

i. Compute the Euclidean distance between each observation in the training

Reading for this question

Consider a sample x = (x1 , . . . , xn ) of independent random variables, identically

Reading for this question

Approaching the question

(a) The likelihood can be written as:

Consider a Gamma(α, β) prior for θ. We can write:

π(θ) ∝ θα−1 exp(−βθ).

Hence the corresponding posterior will be proportional to:

(b) The log-likelihood function can be written as:

In order to find Jeffreys prior we calculate:

and the Fisher information:  n n

whereas for both parts (b) and (c) it is equal to:

(a) Given a random data sample consisting of n observations, consider a bootstrap

Reading for this question

Figure 1: Regression tree for Question 3 (b).

Reading for this question

Partition of the predictors’ according to the tree in Figure 1.

Figure 2: For Question 4 (a).

Reading for this question

Approaching the question

(b) Consider a binary classification problem, Y = 1 (class 1) and Y = 2 (class 2).

Reading for this question

Then the Bayes’ decision rule is given by:

Solve for π1 g1 (x) > π2 g2 (x), that is:

log π1 + log(g1 (x)) > log π2 + log(g2 (x))

You might also like

and the Fisher information: n n