02450ex Fall2016 Sol

Technical University of Denmark
Written examination: 16 December 2016, 9 AM - 1 PM.
Course name: Introduction to Machine Learning and Data Mining.
Course number: 02450.
Aids allowed: All aids permitted.
Exam duration: 4 hours.
Weighting: The individual questions are weighted equally.
You must either use the electronic file or the form on this page to hand in your answers but not both. We
strongly encourage that you hand in your answers digitally using the electronic file. If you hand
in using the form on this page, please write your name and student number clearly.
The exam is multiple choice. All questions have four possible answers marked by the letters A, B, C, and D as
well as the answer “Don’t know” marked by the letter E. Correct answer gives 3 points, wrong answer gives -1
point, and “Don’t know” (E) gives 0 points.
The individual questions are answered by filling in the answer fields with one of the letters A, B, C, D, or E.
Answers:
1 2 3 4 5 6 7 8 9 10
A A B C D B A C A B
11 12 13 14 15 16 17 18 19 20
D D A C D D D B A B
21 22 23 24 25 26 27
B D B D D C B
Name:
Student number:
PLEASE HAND IN YOUR ANSWERS DIGITALLY.
USE ONLY THIS PAGE FOR HAND IN IF YOU ARE

UNABLE TO HAND IN DIGITALLY.
1 of 17
No. Attribute description Abbrev.
x1 Area A
x2 Perimeter P
x3 Length of kernel L
x4 Width of kernel W
y Seed type
Table 1: The attributes of the Seeds data set taken

from http://archive.ics.uci.edu/ml/datasets/seeds.
The output is given by the type of seed, i.e. y=1
corresponds to Kama, y=2 corresponds to Rosa, and
y=3 corresponds to Canadian.
Question 1. We will consider the data of wheat

kernels based on 70 observations of each class of three Figure 1: Boxplot of the data visualized separately
seed types, i.e., Kama, Rosa, and Canadian. The for each of the three types of seeds; Kama, Rosa, and
original data contains seven attributes, however, we Canadian.
presently only consider four of these attributes given
in Table 1. Considering the attributes described in the
table and visualized using boxplots in Figure 1 which Question 2. A principal component analysis (PCA)
one of the following statements is correct? is carried out on the standardized attributes x1 –x4 ,
forming the standardized matrix X̃, resulting in the
A. All the attributes x1 , x2 , x3 , and x4 are following S and V matrices obtained from a singular
continuous and ratio. value decomposition:
B. The output variable y is ordinal.  

28.4 0 0 0
C. Rosa and Canadian do not appear to differ in  0 5.5 0 0 
terms of area (A). S= 0
,
0 1.2 0 
D. The observations pertaining to Kama appear to 0 0 0 0.5
 
contain clear outliers that must be removed. −0.51 0.11 −0.39 −0.76
 −0.51 −0.13 −0.58 0.62 
E. Don’t know. V =
 −0.49 −0.69 0.53
.
−0.05 
Solution 1. As zero means absence of the attribute −0.49 0.71 0.47 0.19
for all the attributes, i.e. zero area means no area Which one of the following statements is correct?
etc. and it makes sense to talk about an attribute
value being twice as large etc. than another attribute A. The first principal component accounts for
value whereas all the values are continuous, the first more than 95 % of the variance.
answer is correct. y is nominal indicating class, but
not ordinal we in general cannot argue that one type B. The two first principal components account for
of seed is better/higher than another, only whether a more than 99.9 % of the variance.
seed is different or not from another. Indeed it appears
from the boxplot that Rosa and Canadian differ i C. The fourth principal component accounts for more
terms of Area such that all Rosa seeds have larger than 0.05% of the variance.
area than Canadian seeds. Finally, although boxplots D. The attributes are not correlated as the data has
indicate observations that fall beyond the whiskers been standardized.
these observations should not be removed unless there
are strong justifications for doing so which the boxplot E. Don’t know.
does not provide reasons for.
2 of 17
Question 3. The data projected onto the two
first principal components (as defined in Question 2)
is given in Figure 2 where each class is indicated using
different markers and colors. which one of the following
statements pertaining to the PCA is correct?
A. A relatively long and narrow seed kernel will pro-

vide a large positive projection onto the second
principal component.
B. The first principal component pertains to

the general size of seeds.
C. A seed that has relatively small area and perime-

ter but large length and width of kernel will have a
negative projection onto the third principal com-
ponent.
Figure 2: Data projected onto the first and second
D. As the third and fourth principal components
principal components.
account for a low amount of the variance in the
data this is a difficult classification task.
E. Don’t know.
Solution 2. The variation explained by each principal Solution 3. As we for the second principal component
σ2 have v >
component is given by P 0iσ2 . As such we find: 2 = [0.11 − 0.13 − 0.69 0.71] a relatively long
i i0 (large positive x3 ) and relatively narrow (large negative
x4 ) will have a large negative projection onto this
component. As the coefficients for the first principal
component all are negative and generally have same
magnitude for the four attributes, this appears to
28.42
V arExpP C1 = 28.42 +5.52 +1.22 +0.52 = 0.9619 (1) capture the general property of size of the seed, such
2 that relatively large area, perimeter, kernel length
V arExpP C2 = 28.42 +5.55.5
2 +1.22 +0.52 = 0.0361 (2) and width will provide a negative projection and vice
2
V arExpP C3 = 28.42 +5.51.2
2 +1.22 +0.52 = 0.0017 (3) versa. The third principal component is defined by
2 v>3 = [−0.39 −0.58 0.53 0.47], thus relatively small area
V arExpP C4 = 28.42 +5.50.5
2 +1.22 +0.52 = 0.0003 (4)
and periphery (negative x1 and x2 ) but large kernel (i.e.
positive x3 , and x4 ) will have a positive projection onto
this component. The PCA does not take information
about the classes into account and therefore does not
necessarily reflect features relevant for classification. In
As such the first PC accounts for more than 95% particular, the singular values are uninformed by the
of the variance, the first two principal components classes as this information is not available in the PCA
accounts for 0.9619+0.0361 = 0.9980 which is less than analysis.
99.9% of the variance. The fourth principal component
accounts for 0.03% which is less than 0.05%. As the
first principal component accounts for more than 95%
of the variance the attributes are indeed very correlated
and not the opposite.
3 of 17
Question 4. A decision tree is fitted to the data
projected onto the four principal components. At the
root of the tree a split according to the projection of the
standardized data onto the first principal component
being larger than 0 is considered, i.e. x̃n v 1 ≥ 0. For
impurity we will use the classification error given by
I(v) = 1 − maxc p(c|v). Before the split we have 70
Kama, 70 Rosa, and 70 Canadian and after the split:
24 Kama, 70 Rosa, 0 Canadian below zero in the

projection onto v 1 .
46 Kama, 0 Rosa, 70 Canadian above or equal to

zero in the projection onto v 1 .
What is the purity gain of this split?
A. -1.0148
Figure 3: Decision boundaries for four different classi-
B. 0.0148 fiers trained on the Seeds data projected onto the first
C. 0.3333 two principal components.
D. 0.6666 Question 5. Four different classifiers are trained

on the data projected onto the first two principal
E. Don’t know.
components (i.e., using the first and second principal
Solution 4. The purity gain is given by components as features) and the decision boundary for
each of the four classifiers is given in Figure 3. Which
2
X N (vj ) one of the following statements is correct?
∆ = I(parent) − I(vj ),
N A. Classifier 1 is a decision tree, Classifier 2 is an
j=1
artificial neural network with three hidden units,
where Classifier 3 is a multinomial regression model, and
Classifier 4 is a 3-nearest neighbor classifier.
I(v) = 1 − max p(c|v).
c
B. Classifier 1 is an artificial neural network with
Inserting for the split defined by the projection onto three hidden units, Classifier 2 is a multinomial
the first PCA being greater or equal to zero we obtain regression model, Classifier 3 is a 3-nearest neigh-
bor classifier, and Classifier 4 is a decision tree.
70
∆ =(1 − ( 210 ))
94
−[ 210 (1 − ( 70
94 ))
C. Classifier 1 is an artificial neural network with
+ 116 70
210 (1 − ( 116 ))] three hidden units, Classifier 2 is a multinomial
2 1 regression model, Classifier 3 is a decision tree,
= − = 1/3.
3 3 and Classifier 4 is a 3-nearest neighbor classifier.
D. Classifier 1 is a multinomial regression

model, Classifier 2 is an artificial neural net-
work with three hidden units, Classifier 3
is a decision tree, and Classifier 4 is a 3-
nearest neighbor classifier.
E. Don’t know.
4 of 17
O1 O2 O3 O4 O5 O6 O7 O8 O9
O1 0 0.534 1.257 1.671 1.090 1.315 1.484 1.253 1.418
O2 0.534 0 0.727 2.119 1.526 1.689 1.214 0.997 1.056
O3 1.257 0.727 0 2.809 2.220 2.342 1.088 0.965 0.807
O4 1.671 2.119 2.809 0 0.601 0.540 3.135 2.908 3.087
O5 1.090 1.526 2.220 0.601 0 0.331 2.563 2.338 2.500
O6 1.315 1.689 2.342 0.540 0.331 0 2.797 2.567 2.708
O7 1.484 1.214 1.088 3.135 2.563 2.797 0 0.275 0.298
O8 1.253 0.997 0.965 2.908 2.338 2.567 0.275 0 0.343
O9 1.418 1.056 0.807 3.087 2.500 2.708 0.298 0.343 0
Table 2: Pairwise Euclidean distance between nine ob-

servations in the Seeds data. Black observations (i.e.,
O1, O2, O3) are observations corresponding to Kama
seeds, red observations (i.e., O4, O5, O6) are observa-
tions corresponding to Rosa seeds, and blue observa-
tions (i.e., O7, O8, O9) are observations corresponding
to Canadian seeds.
Figure 4: The confusion matrix of a 3-nearest neighbor
classifier used to predict the Seeds data.
Solution 5. The decision boundary of classifier 1

is based on lines thus using multinomial regression.
Classifier two has smooth boundaries that are non-
linear thus based on ANN. Classifier three has axis
aligned boundaries corresponding to the decision tree,
leaving classifier four as the 3-nearest neighbour due to
its very complex and non-smooth boundaries.
Question 6. The data is split in half and a KNN
classifier used to predict the test-set based on the
training set for K=3. The confusion matrix of the KNN
classifier is given in Figure 4. What is the accuracy of
the classifier?
A. 0.1429
Figure 5: Four different dendrograms derived from the
B. 0.8571 distances between the nine observation in Table 2.
C. 0.8911
Question 7. In Table 2 is given the pairwise
D. 0.9574 Euclidean distances between nine observations of the
Seeds data. A hierarchical clustering is used to cluster
E. Don’t know. these nine observations using complete (i.e., maximum)
linkage. Which one of the dendrograms given in Fig-
Solution 6. The accuracy is given by the num-
ure 5 corresponds to the clustering?
ber of correctly classified observations out of the
total classified observations which is accurcy = A. Dendrogram 1.
31+30+29
31+30+29+1+3+5+6 = 90/105 = 0.8571.
B. Dendrogram 2.
C. Dendrogram 3.
D. Dendrogram 4.
E. Don’t know.
5 of 17
Solution 7. In complete distance clusters are merged Question 8. We will consider thresholding Dendro-
according to maximal distance between observations gram 4 at the level of three clusters. We recall that
within each cluster. The dendrogram grows by first the Rand index also denoted the simple matching coef-
merging O7 and O8 at 0.275, then O5, O6 at level ficient (SMC) between the true labels and the extracted
0.331, then {O7,O8} with O9 at 0.343, then O1 and clusters is given by:
O2 at level 0.534, then O4 with {O5,O6} at 0.601,
then O3, {O7,O8,O9} at level 1.0882, then {O1, O2}
with {O3, O7,O8,O9} at 1.484, and finally {O1, O2, f00 + f11
SM C = ,
O3 O7,O8,O9} with {O4, O5,O6} at 3.135. Only K
Dendrogram 1 has these properties.
where f00 is the number of object pairs in different class

assigned to different clusters and f11 is the number of
object pairs in same class assigned to same cluster,
whereas K = N (N − 1)/2 is the total number of
object pairs where N is the number of observations
considered. What is the above SMC between the
true labeling of the observations into the three classes
Kama, Rosa, and Canadian, and the clustering defined
by thresholding Dendrogam 4 at the level of three
clusters?
A. 0.7500
B. 0.7778
C. 0.8611
D. 1.0000
E. Don’t know.
Solution 8. When thresholding the clustering we

obtain: the cluster indices: [1 1 3 2 2 2 3 3 3]> ,
whereas the true class labels are [1 1 1 2 2 2 3 3 3]>.
From this, we obtain: Total number of object pairs is:
K = 9(9 − 1)/2 = 36
f00 = 2 · (3 + 3) + 3 · (3 + 1) = 24
f11 = 2·(2−1)/2+3·(3−1)/1+3·(3−1)/2+1·(1−1)/2 =
7
SM C = f00K +f11
= 24+7
36 = 0.8611
6 of 17
Question 9. To determine the type of seed of an Question 10. We suspect that observation O4 may
observation we will use a k-nearest neighbor (KNN) be an outlier. In order to assess if this is the case
classifier to predict each of the nine observations based we would like to calculate the average relative KNN
on the Euclidean distance between the observations density based on the observations given in Table 2 only.
given in Table 2. We will use leave-one-out cross- We recall that the KNN density and average relative
validation for the KNN in order to classify the nine density for the observation xi are given by:
considered observations using a two-nearest neighbor
1
classifier, i.e. K = 2. For tied classes we will classify densityX \i (xi , K) = 1 ,
0
P
the observation according to its closest observation. K x0 ∈NX \i (xi ,K) d(xi , x )
The analysis will be based only on the data given in densityX \i (xi , K)
Table 2. Which one of the following statements is ardX (xi , K) = 1 P ,
correct? K xj ∈NX \i (xi ,K) densityX \j (xj , K)
A. All the observations will be correctly clas- where NX \i (xi , K) is the set of K nearest neighbors
sified. of observation xi excluding the i’th observation, and
ardX (xi , K) is the average relative density of xi using
B. One of the observations will be misclassified. K nearest neighbors. Based on the data in Table 2,
C. Two of the observations will be misclassified. what is the average relative density for observation O4
for K = 1 nearest neighbors?
D. Three of the observations will be misclassified.
A. 0.54
E. Don’t know.
B. 0.61
Solution 9. N (O1, 2) = {O2, O5} as O2 is closest it
will be correctly classified as Kama. C. 1.63
N (O2, 2) = {O1, O3} and will be correctly classified as D. 1.85
Kama.
N (O3, 2) = {O2, O9} as O2 is closest it will be cor- E. Don’t know.
rectly classified as Kama.
N (O4, 2) = {O6, O5} and will be correctly classified as Solution 10.
Rosa. 1
N (O5, 2) = {O6, O4} and will be correctly classified as density(xO4 , 1) = ( · 0.540)−1 = 1.8519
1
Rosa. 1
density(xO6 , 1) = ( · 0.331)−1 = 3.0211
N (O6, 2) = {O5, O4} and will be correctly classified as 1
Rosa. density(xO4 , 1)
a.r.d.(xO4 , 1) = 1
N (O7, 2) = {O8, O9} and will be correctly classified as (density(x O6 , 1))
1
Canadian. 1.8519
N (O8, 2) = {O7, O9} and will be correctly classified as = 1 = 0.61
1 · 3.0211
Canadian.
N (O9, 2) = {O7, O8} and will be correctly classified as
Canadian.
7 of 17
Feature(s) Training Test
Error Rate Error Rate
No features 0.6667 0.6667
x1 0.1143 0.1524
x2 0.1143 0.1143
Figure 6: The nine observations considered in Table 2 x3 0.2190 0.1714
projected onto the first principal component (the loca- x4 0.1524 0.1714
tion of the projection is given in parenthesis). x1 and x2 0.0952 0.1619
x1 and x3 0.1143 0.1619
x1 and x4 0.1143 0.1619
Question 11. We will consider the nine observations x2 and x3 0.1238 0.1333
projected onto the first principal component given in x2 and x4 0.1048 0.1429
Figure 6. We will cluster this data using k-means with x3 and x4 0.1143 0.1619
Euclidean distance into three clusters (i.e., k=3) and x1 and x2 and x3 0.0571 0.1714
initialize the k-means algorithm with centroids located x1 and x2 and x4 0.1048 0.1619
at observation O4, O6, and O5. Which one of the x1 and x3 and x4 0.0857 0.1619
following statements is correct? x2 and x3 and x4 0.0762 0.1524
x1 and x2 and x3 and x4 0.0667 0.1810
A. The converged solution will be {O4}, {O6},{O1,
O2, O3, O5, O6, O7, O8, O9}. Table 3: Error rate for the training and test set when
B. The converged solution will be {O4, O5, O6},{O1, using multinomial regression to predict the type of seed
O2, O3}, {O7, O8, O9}. using different combinations of the four attributes (x1 –
x4 ) based on the hold-out method with 50 % of the
C. The converged solution will be {O4, O5, O6},{O1, observations hold-out for testing.
O2}, {O3, O7, O8, O9}.
D. The converged solution will be {O4},{O5, Question 12. A multinomial regression classifier
O6}, {O1, O2, O3, O7, O8, O9}. is trained using different combinations of the four
attributes x1 , x2 , x3 , and x4 . Table 3 gives the training
E. Don’t know. and test performance of the multinomial regression
classifier when trained using different combinations
Solution 11. With the described initialization, ob- of the four attributes. Which one of the following
servation O4 will be assigned to the cluster located statements is correct?
at O4, observation O6 will be assigned to the clus-
ter located at O6, and the remaining observations A. Forward selection will result in a better model
{O1, O2, O3, O5, O7, O8, O9} assigned to the clus- being selected than backward selection.
ter located at O5. Thus, only cluster located at
O5 will change location and the location updated to B. Neither forward nor backward selection will iden-
−1.5+−0.4+0.0+0.6+0.8+1.0+1.1
= 0.2286. For this new tify the optimal feature combination for this prob-
7
location O5 is closer to cluster located at O6 than lem.
the cluster located at 0.2286, resulting in the updated
C. Backward selection will use a model that includes
clustering {O4},{O5, O6}, {O1, O2, O3, O7, O8,
three features.
O9}. Thus the second cluster will change location to
−1.7+−1.5
2 = −1.6 whereas the third cluster will change D. Forward selection will select only one fea-
location to −0.4+0.0+0.6+0.8+1.0+1.1
6 = 0.5167. As O1 is ture.
still closest to cluster 3 there is no change of assignment
and the k-means procedure has converged. E. Don’t know.
Solution 12. Using forward and backward selection

we would like to minimize the test error rate. Thus,
the forward selection would first select x2 having low-
est test error rate. As no combination of the feature
8 of 17
x2 lead to improvement in the test error rate the for- Question 13. We would like to investigate if we can
ward selection method terminates. Backward selection predict the width of a seed kernel (x4 ) based on the
starts with all features and an improvement is found re- area (x1 ), perimeter (x2 ), and length of kernel (x3 ).
moving feature x1 providing a test error rate of 0.1524 For this purpose regularized least squares regression is
for x2 and x3 and x4 . Subsequently, x4 is removed applied based on minimizing with respect to w the cost
providing a test error rate of 0.1333 for the features function:
x2 and x3 . Finally x3 is removed as having only fea- X
ture x2 provides an error rate of 0.1143 and the method E(w) = (xn4 − [1 xn1 xn2 xn3 ]w)2 + λw> w,
terminates. n
where xnm denotes the m’th feature of the n’th obser-

vation, and 1 is concatenated the data to account for
the bias term. We will consider the following four dif-
ferent values of λ: λ1 = 1, λ2 = 10, λ3 = 100, and
λ4 = 1000. We obtain the following four different so-
lutions for w here given in random order of the values
of λ considered:
   
0.0538 0.0089
 0.0558   0.0931 
wa =  0.1861  , wb =  0.1093  ,
  
−0.0596 0.0417
   
0.2811 0.0167
 0.0445 
 , wd =  0.0698  .
 
wc =  0.3379   0.1354 
−0.4626 0.0403
Which one of the following solutions to w corresponds

to the correct value of λ?
A. wa corresponds to λ2 .
B. wb corresponds to λ2 .
C. wc corresponds to λ2 .
D. wd corresponds to λ2 .
E. Don’t know.
Solution 13. As we increase λ we put more and

more emphasis on the regularization penalization term
kwk22 . Thus, by evaluating this term for each solu-
tion vector we obtain: kwa k22 = 0.0442, kwb k22 =
0.0224, kwc k22 = 0.4092, kwd k22 = 0.0251. Thus
sorting by these norm values we find that wc corre-
sponds to λ1 , wa corresponds to λ2 , wd corresponds to
λ3 , wb corresponds to λ4 .
9 of 17
No. Attribute description x1 x2 x3 x4 x5 y
P1 1 1 1 1 0 1
x1 Occurrence of nausea P2 0 0 0 0 0 0
x2 Lumbar pain P3 1 1 0 1 0 0
x3 Urine pushing P4 0 1 1 0 1 0
x4 Micturition pains P5 1 1 1 1 1 1
x5 Burn/itch/swell urethra outlet P6 0 0 0 0 0 0
y Inflammation of urinary bladder P7 1 1 0 1 0 0
P8 0 1 1 0 1 0
Table 4: The attributes considered from the P9 1 1 1 1 0 1
study on acute inflammation (taken from P10 0 1 1 0 1 0
https://archive.ics.uci.edu/ml/datasets/Acute+ P11 0 0 0 0 0 0
Inflammations). The attributes x1 –x5 and y are P12 1 1 0 1 0 0
binary where we use 1 for true and 0 for false. P13 0 1 1 0 1 0
P14 0 1 1 0 1 0
Question 14. In a study of acute in-
Table 5: Provided in the above table are the last 14
flammation we would like to predict urinary
observations of the acute inflammation data.
bladder inflammation (the data is taken from
https://archive.ics.uci.edu/ml/datasets/Acute+ In-
flammations). We will consider a subset of the Question 15. We will consider a subset of the acute
attributes, these attributes are given in Table 4. From inflammation data given by the last 14 observations
the study we have provided in Table 5. We will consider this dataset a
49.17 pct. of the persons have inflammation of market basket with 14 persons (P1-P14) denoting the
urinary bladder. customers and six items denoted x1 − x5 and y corre-
32.20 pct. of the persons that have inflammation sponding to the five input attributes and output vari-
of urinary bladder have occurrence of nausea. able respectively of the features described in Table 4.
16.39 pct. of the persons that do not have in- What are all frequent itemsets with support greater
flammation of urinary bladder have occurrence of than 40%?
nausea.
A. {x1 }, {x2 }, {x3 }, {x4 }, {x5 }, {x1 , x2 }, {x2 , x3 },
What is the probability that a person that has occur- {x2 , x4 }, {x2 , x5 }, {x3 , x5 }.
rence of nausea, i.e. x1 = 1, has inflammation of the
urinary bladder, i.e. y = 1, according to this study? B. {x1 }, {x2 }, {x3 }, {x4 }, {x5 }, {x1 , x2 }, {x1 , x4 },
{x2 , x3 }, {x2 , x4 }, {x2 , x5 }, {x3 , x5 }.
A. 15.83 %
C. {x1 }, {x2 }, {x3 }, {x4 }, {x5 }, {x1 , x2 }, {x1 , x4 },
B. 32.20 % {x2 , x3 }, {x2 , x4 }, {x2 , x5 }, {x3 , x5 },
{x2 , x3 , x5 }.
C. 65.52 %
D. 98.82% D. {x1 }, {x2 }, {x3 }, {x4 }, {x5 }, {x1 , x2 }, {x1 ,
x4 }, {x2 , x3 }, {x2 , x4 }, {x2 , x5 }, {x3 , x5 },
E. Don’t know. {x1 , x2 , x4 },{x2 , x3 , x5 }.
Solution 14. According to Bayes’ theorem we have:
P (x1 =1|y=1)P (y=1)
E. Don’t know.
P (y = 1|x1 = 1) = P (x1 =1)
= P (x1 =1|y=1)P (y1 =1) Solution 15. For a set to have support more than
P (x1 =1|y=1)P (y1 =1)+P (x1 =1|y=0)P (y1 =0)
40% the set must occur at least 0.4 · 14 = 5.6, i.e. 6
0.3220·0.4917
= 0.3220·0.4917+0.1639·(1−0.4917) out of the 14 times. All the itemsets that have this
= 0.6552 property are {x1 }, {x2 }, {x3 }, {x4 }, {x5 }, {x1 , x2 },
{x1 , x4 }, {x2 , x3 }, {x2 , x4 }, {x2 , x5 }, {x3 , x5 }, {x1 ,
x2 , x4 }, {x2 , x3 , x5 }.
10 of 17
Question 16. What is the confidence of the associa- Question 17. We would like to predict whether a
tion rule {x1 , x2 , x3 , x4 , x5 } → {y}? subject has inflammation of urinary bladder (y = 1)
or not (y = 0) using the data in Table 5 and the
A. 0.0% attributes x1 , and x2 only. We will apply a Naı̈ve Bayes
B. 7.1 % classifier that assumes independence between the two
attributes given y. Given that a person has x1 = 1,
C. 21.4% and x2 = 1 what is the probability that the person has
an inflammation of urinary bladder (y = 1) according
D. 100.0 % to the Naı̈ve Bayes classifier?
E. Don’t know. A. 1/14
Solution 16. The confidence is given as
B. 3/14
P (y = 1|x1 = 1, x2 = 1, x3 = 1, x4 = 1, x5 = 1) = C. 1/2
P (y = 1, x1 = 1, x2 = 1, x3 = 1, x4 = 1, x5 = 1)
P (x1 = 1, x2 = 1, x3 = 1, x4 = 1, x5 = 1) D. 11/19
1/14 E. Don’t know.
= = 1 = 100.0%
1/14
Solution 17. According to the Naı̈ve Bayes classifier
we have
P (y = 1|x1 = 1, x2 = 1) =
 
P (x1 = 1|y = 1)×
 P (x2 = 1|y = 1)× 
P (y = 1)
   
P (x1 = 1|y = 1)× P (x1 = 1|y = 0)×
 P (x2 = 1|y = 1)×  +  P (x2 = 1|y = 0)× 
P (y = 1) P (y = 0)
3/3 · 3/3 · 3/14
= =.
3/3 · 3/3 · 3/14 + 3/11 · 8/11 · 11/14
Question 18. Considering the data in Table 5, we

will use x1 to classify whether a subject has inflamma-
tion of urinary bladder (y = 1) or not (y = 0). We
will quantify how useful x1 is for this purpose by cal-
culating the area under curve (AUC) of the receiver
operator characteristic (ROC). Which one of the ROC
curves given in Figure 7 corresponds to using the fea-
ture x1 to determine if a subject has inflammation of
urinary bladder?
A. The curve with AUC=0.636.
B. The curve with AUC=0.864.
C. The curve with AUC=0.909.
D. The curve with AUC=1.000.
E. Don’t know.
Solution 18. The ROC is defined by considering all

conceivable thresholds and plotting the true positive
11 of 17
Which one of the following statements regarding the
similarity of r an s is correct?
A. J(r, s) < SM C(r, s)
B. J(r, s) > cos(r, s)
C. SM C(r, s) > cos(r, s)
D. cos(r, s) = 3/15
E. Don’t know.
Solution 19. For r and ,s we have:

f11
J(r, s) = = 3/5 = 0.6000,
M − f00
f11 + f00
SM C(r, s) = = 4/6 = 0.6667,
M
f11 √ √
Figure 7: Four different receiver operating characteris- cos(r, s) = = 3/( 5 3) = 0.7746.
krk2 ksk2
tic (ROC) curves and their corresponding area under
the curve (AUC). Thus, J(P1,P3)<SMC(P1,P3) is the only correct state-
ment.
rate (TPR) against the false positive rate (FPR). For

a threshold larger than 1 we have that TPR=FPR=0.
For the threshold at 1 we have that 3 of the 11
observations having y=0 have x1 = 1 thus FPR=3/11,
whereas all three of the three observations with y=1
have x1 = 1, thus TPR=1. lowering the threshold
we get when thresholding above 0 that all 11 of the
11 observations or which y=0 are false positive, i.e.
FPR=1, and all three of three observations where y=1
are true positive, thus TPR=1, giving an AUC =0.864.
Question 19. Considering the data in Table 5, we

will calculate the similarity between P 1 given as the
vector r = [1 1 1 1 0 1] and P 3 given by the vector
s = [1 1 0 1 0 0] using Jaccard, Simple Matching
Coefficient, and Cosine similarity given respectively by:
f11
J(r, s) = ,
M − f00
f11 + f00
SM C(r, s) = ,
M
f11
cos(r, s) = .
krk2 ksk2
12 of 17
Figure 8: The architecture of the considered neural
network having one hidden layer.
Question 20. A neural network is trained to

separate persons with urinary inflammation (y = 1)
from persons not having urinary inflammation based
on the features x1 and x2 . The structure of the
neural network is outlined in Figure 8. The activation
function used for all six neurons n1 , n2 , n3 , n4 , n5 , and
n6 is the rectified linear unit
Figure 9: A dataset with five classes given respectively
x if x > 0 by a large circle and four smaller circles.
f (x) =
0 otherwise.
The neural network has no biases, i.e. all the biases of The output of the output neurons will therefore be:
all units are zero. The weights of the network are:
n6 : f (0.25 · 1 − 0.25 · 0 + 0.25 · 0) = f (0.25) = 0.25.
w13 = 0.5, w14 = 0.5, w15 = −0.5,
w23 = 0.5, w24 = −0.5, w25 = 0.25,
w36 = 0.25, w46 = −0.25, w56 = 0.25.
Question 21. We will consider the dataset with five
What will be the output (ŷ) of the neural network classes given in Figure 9 defined respectively by the
for an observation having x1 = 1 and x2 = 1? four inner circles and the larger outer circle. We will
cluster this dataset using hierarchical clustering. What
A. 0 would be a suitable measure of proximity and linkage
in order to perfectly separate the five classes into five
B. 0.25 clusters?
C. 0.75 A. Average linkage using the 2-norm (i.e. kx − yk2 )
as proximity measure.
D. 1
B. Single linkage using the 1-norm (i.e. kx −
E. Don’t know.
yk1 ) as proximity measure.
Solution 20. The output of the neurons in the hidden
C. Complete linkage using the 2-norm (i.e. kx − yk2 )
layer will be:
as proximity measure.
n3 : f (0.5 · 1 + 0.5 · 1) = f (1) = 1,
D. Complete linkage using the 1-norm (i.e. kx − yk1 )
n4 : f (0.5 · 1 − 0.5 · 1) = f (0) = 0, as proximity measure.
n5 : f (−0.5 · 1 + 0.25 · 1) = f (−0.25) = 0.
E. Don’t know.
13 of 17
Solution 22. The decision boundary is formed by a
hexagonal shape. Inspecting the position of the edges
it is seen that the decision boundary traverses the
coordinates (0, 3/8) and (0.25, 0.25) corresponding to
a point where kxk1 = 12 and kxk∞ = 38 respectively.
Figure 10: A two class classification problem.
Solution 21. The choice of norm will not generally in-

fluence the results much in this example as observations
that are close in 1-norm will also be close in 2-norm.
However, linkage function will heavily influence the re-
sults. As all clusters have the property that at least one
observation is closer within the cluster than an obser-
vation in another cluster, thus, a contiguity based ap-
proach will be well-suited. Hence, single linkage clus-
tering will perfectly separate the classes whereas the
other approaches will fail.
Question 22. Consider the dataset with two classes
given in Figure 10. Which one of the following decisions
would lead to a perfect separation of the two classes?
A. If kxk1 ≤ 14 and kxk2 ≤ 3

8 then black dot,
otherwise red plus.
B. If kxk2 ≤ 38 and kxk∞ ≤ 1
4 then black dot,
otherwise red plus.
C. If kxk1 ≤ 12 and kxk∞ ≤ 1
2 then black dot,
otherwise red plus.
D. If kxk1 ≤ 12 and kxk∞ ≤ 3
8 then black dot,
otherwise red plus.
Figure 11: 10.000 data observations drawn from a
E. Don’t know. Gaussian Mixture Model (GMM).
14 of 17

Question 23. Consider the 10.000 observations −2
has negative covariance. This property only
drawn from a Guassian Mixture Model (GMM) shown 4
in Figure 11. We will in the following use: holds for the answer option:
1 1 > −1

N (x|µ, Σ) = (2π)M/2 |Σ|1/2 exp − 2 (x − µ) Σ (x − µ)
0 3 0
to denote the multivariate normal distribution with p(x) = 0.05 · N (x| , )
0 0 3
mean µ and covariance matrix Σ. Which one of the
following GMM densities best characterize the data? 2 1 0.8
+ 0.475 · N (x| , )
4 0.8 1

A. −2 1 −0.8
+ 0.475 · N (x| , )

4 −0.8 1
2 1 0.8
p(x) = 0.5 · N (x| , )
4 0.8 1

−2 1 −0.8
+ 0.5 · N (x| , )
4 −0.8 1
B.

0 3 0
p(x) = 0.05 · N (x| , )
0 0 3

2 1 0.8
+ 0.475 · N (x| , )
4 0.8 1

−2 1 −0.8
+ 0.475 · N (x| , )
4 −0.8 1
C.

0 3 0
p(x) = 0.5 · N (x| , )
0 0 3

2 1 0.8
+ 0.25 · N (x| , )
4 0.8 1

−2 1 −0.8
+ 0.25 · N (x| , )
4 −0.8 1
D.

0 3 0
p(x) = 0.1 · N (x| , )
0 0 3

2 1 −0.8
+ 0.45 · N (x| , )
4 −0.8 1

−2 1 0.8
+ 0.45 · N (x| , )
4 0.8 1
E. Don’t know.
Solution 23. There are three clusters

and the cen-
0 2 −2
troids of these clusters are , , . The
0 4 4

0
cluster at is not very dense and should there-
0
fore have
a very low weight, furthermore, the clusters
2
at has positive covariance whereas the cluster at
4
15 of 17
Question 24. We will consider a very large dataset Here αt = 12 log 1− t
t
(where log is the natural
PN
with 100 mio. observations and ten features, i.e. N =

logarithm) and t = i=1 wi 1 − δft (xi ),yi , where
100.000.000 and M = 10. We would like to perform δft (xi ),yi = 1 if ft (xi ) = yi and zero otherwise. Initially
two-level cross-validation in order to select between 3 the weights are uniform across samples, i.e. w1 = w2 =
different settings of the parameters of a model (inner . . . = wN = 1/N where N is the number of observa-
fold) and estimate the generalization error (outer fold).tions.
We are only allowed to train maximally 65 models in A dataset is sampled with replacement from this
total. Which one of the following procedures satisfies uniform distribution and the classifier is trained on
this constraint? this sampled data. Using this trained classifier 5 of
the original 25 observations are misclassified. What
A. Five fold cross-validation in both the outer and
will the updated weights be for these misclassified
inner folds.
observations according to the AdaBoost algorithm?
B. Leave-one-out cross-validation for the outer fold
A. 0.02
and hold-out 50 % for the inner fold.
B. 0.025
C. Ten-fold cross-validation for the outer fold and two
fold cross-validation for the inner fold. C. 0.08
D. Two-fold cross-validation for the outer fold D. 0.1
and ten fold cross-validation for the inner
fold. E. Don’t know.
E. Don’t know. Solution 25. As we have 5 misclassified observations

the weighted error rate will be 1 = 5/25 = 1/5,
Solution 24. In the inner fold we have to train as thus αt = 12 log( 1−1/5 1
1/5 ) = 2 log 4 = 0.6931. Thus for
many models as we have folds to identify the optimal correctly classified observations we have: w̃j (t + 1) =
parameters. We then use the optimal parameters to wj (t)e−αt = 1/25e−0.6931 and for incorrectly classified
train a model on the entire training set in order to observations w̃j (t + 1) = wj (t)e−αt = 1/25e+0.6931 . We
evaluate this model on the test set defined in the outer thereby obtain for the updated weights for misclassified
fold. Thus we need for each outer fold to train the 1/25e+0.6931
observations: wi (t + 1) = 20/25e−0.6931 +5/25e+0.6931
= 0.1
number of inner folds + one model (i.e. the model
trained on the entire training data) times number of
outer folds, i.e. K1 (K2 · S + 1). This gives for: Question 26. For which of the following purposes is
Five fold cross-validation in both the outer and inner cross-validation the least well suited?
folds: 5(5 · 3 + 1) = 80 models
Leave-one-out cross-validation for the outer fold and A. Select the number of hidden units in artificial
hold-out: 100.000.000(1 · 3 + 1) = 400.000.000 models neural networks (ANN).
Ten-fold cross-validation for the outer fold and two fold B. Select the width of the Gaussian kernel in kernel
cross-validation for the inner fold: 10(2 · 3 + 1) = 70 density estimation (KDE).
Two-fold cross-validation for the outer fold and ten fold
cross-validation for the inner fold: 2(10 · 3 + 1) = 62. C. Select the observations that minimize the
training error.
Question 25. We recall that the AdaBoost algo-
rithm is given by updating the weight to the i’th data D. Select the number of neighbors in KNN classifica-
observation (wi ) based on the classifier ft at round t tion.
according to:
E. Don’t know.
w̃i (t + 1)
wi (t + 1) = PN , where Solution 26. Cross-validation can trivially be used to
j=1 w̃j (t + 1)
quantify the number of hidden units in ANN and near-
wi (t)e−αt if ft (xi ) = yi

w̃i (t + 1) = est neighbors in KNN classification by evaluating the
wi (t)eαt if ft (xi ) 6= yi . performance predicting the output in these supervised
16 of 17
learning problems. As we have seen in the course cross-
validation can also be used to quantify the width of the
kernel density estimator. Cross-validation is used to
quantify models generalization through the use of the
test sets and not to minimize the training error.
Question 27. Which of the following statements
regarding ensemble methods is correct?
A. In ensemble methods it is important that the

different trained classifiers perform very similar.
B. In Random Forest features are randomly

sampled at each node of the tree.
C. Random Forest is the same as fitting several deci-

sion trees and classifying according to the tree for
which the leaf has highest purity.
D. In bagging the output class labels are randomly

changed to introduce noise for robustness.
E. Don’t know.
Solution 27. Using ensemble methods it is important

that the different methods are not the same but as in-
dependent as possible. In Random Forest m < M fea-
tures are indeed randomly sampled at each node of the
tree. Majority voting is used for the classification and
not the tree with highest purity of the leaf. In bagging
observations are uniformly sampled with replacement
and there is no additional emphasis to misclassified ob-
servations.
17 of 17

02450ex Fall2016 Sol

Uploaded by

Copyright:

Available Formats

02450ex Fall2016 Sol

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02450ex Fall2016 Sol

Uploaded by

Copyright:

Available Formats

Technical University of Denmark

Written examination: 16 December 2016, 9 AM - 1 PM.

Course name: Introduction to Machine Learning and Data Mining.

Course number: 02450.

Aids allowed: All aids permitted.

Exam duration: 4 hours.

Weighting: The individual questions are weighted equally.

PLEASE HAND IN YOUR ANSWERS DIGITALLY.

USE ONLY THIS PAGE FOR HAND IN IF YOU ARE

Table 1: The attributes of the Seeds data set taken

Question 1. We will consider the data of wheat

B. The output variable y is ordinal.  

A. A relatively long and narrow seed kernel will pro-

B. The first principal component pertains to

C. A seed that has relatively small area and perime-

 24 Kama, 70 Rosa, 0 Canadian below zero in the

 46 Kama, 0 Rosa, 70 Canadian above or equal to

What is the purity gain of this split?

D. 0.6666 Question 5. Four different classifiers are trained

D. Classifier 1 is a multinomial regression

Table 2: Pairwise Euclidean distance between nine ob-

Solution 5. The decision boundary of classifier 1

where f00 is the number of object pairs in different class

Solution 8. When thresholding the clustering we

Solution 12. Using forward and backward selection

where xnm denotes the m’th feature of the n’th obser-

Which one of the following solutions to w corresponds

Solution 13. As we increase λ we put more and

Question 18. Considering the data in Table 5, we

A. The curve with AUC=0.636.

B. The curve with AUC=0.864.

C. The curve with AUC=0.909.

D. The curve with AUC=1.000.

Solution 18. The ROC is defined by considering all

A. J(r, s) < SM C(r, s)

B. J(r, s) > cos(r, s)

C. SM C(r, s) > cos(r, s)

Solution 19. For r and ,s we have:

rate (TPR) against the false positive rate (FPR). For

Question 19. Considering the data in Table 5, we

Question 20. A neural network is trained to

Figure 10: A two class classification problem.

Solution 21. The choice of norm will not generally in-

A. If kxk1 ≤ 14 and kxk2 ≤ 3

Solution 23. There are three  clusters

E. Don’t know. Solution 25. As we have 5 misclassified observations

A. In ensemble methods it is important that the

B. In Random Forest features are randomly

C. Random Forest is the same as fitting several deci-

D. In bagging the output class labels are randomly

Solution 27. Using ensemble methods it is important

You might also like

24 Kama, 70 Rosa, 0 Canadian below zero in the

46 Kama, 0 Rosa, 70 Canadian above or equal to

Solution 23. There are three clusters