02450ex Fall2016 Sol
02450ex Fall2016 Sol
02450ex Fall2016 Sol
You must either use the electronic file or the form on this page to hand in your answers but not both. We
strongly encourage that you hand in your answers digitally using the electronic file. If you hand
in using the form on this page, please write your name and student number clearly.
The exam is multiple choice. All questions have four possible answers marked by the letters A, B, C, and D as
well as the answer “Don’t know” marked by the letter E. Correct answer gives 3 points, wrong answer gives -1
point, and “Don’t know” (E) gives 0 points.
The individual questions are answered by filling in the answer fields with one of the letters A, B, C, D, or E.
Answers:
1 2 3 4 5 6 7 8 9 10
A A B C D B A C A B
11 12 13 14 15 16 17 18 19 20
D D A C D D D B A B
21 22 23 24 25 26 27
B D B D D C B
Name:
Student number:
1 of 17
No. Attribute description Abbrev.
x1 Area A
x2 Perimeter P
x3 Length of kernel L
x4 Width of kernel W
y Seed type
2 of 17
Question 3. The data projected onto the two
first principal components (as defined in Question 2)
is given in Figure 2 where each class is indicated using
different markers and colors. which one of the following
statements pertaining to the PCA is correct?
E. Don’t know.
Solution 2. The variation explained by each principal Solution 3. As we for the second principal component
σ2 have v >
component is given by P 0iσ2 . As such we find: 2 = [0.11 − 0.13 − 0.69 0.71] a relatively long
i i0 (large positive x3 ) and relatively narrow (large negative
x4 ) will have a large negative projection onto this
component. As the coefficients for the first principal
component all are negative and generally have same
magnitude for the four attributes, this appears to
28.42
V arExpP C1 = 28.42 +5.52 +1.22 +0.52 = 0.9619 (1) capture the general property of size of the seed, such
2 that relatively large area, perimeter, kernel length
V arExpP C2 = 28.42 +5.55.5
2 +1.22 +0.52 = 0.0361 (2) and width will provide a negative projection and vice
2
V arExpP C3 = 28.42 +5.51.2
2 +1.22 +0.52 = 0.0017 (3) versa. The third principal component is defined by
2 v>3 = [−0.39 −0.58 0.53 0.47], thus relatively small area
V arExpP C4 = 28.42 +5.50.5
2 +1.22 +0.52 = 0.0003 (4)
and periphery (negative x1 and x2 ) but large kernel (i.e.
positive x3 , and x4 ) will have a positive projection onto
this component. The PCA does not take information
about the classes into account and therefore does not
necessarily reflect features relevant for classification. In
As such the first PC accounts for more than 95% particular, the singular values are uninformed by the
of the variance, the first two principal components classes as this information is not available in the PCA
accounts for 0.9619+0.0361 = 0.9980 which is less than analysis.
99.9% of the variance. The fourth principal component
accounts for 0.03% which is less than 0.05%. As the
first principal component accounts for more than 95%
of the variance the attributes are indeed very correlated
and not the opposite.
3 of 17
Question 4. A decision tree is fitted to the data
projected onto the four principal components. At the
root of the tree a split according to the projection of the
standardized data onto the first principal component
being larger than 0 is considered, i.e. x̃n v 1 ≥ 0. For
impurity we will use the classification error given by
I(v) = 1 − maxc p(c|v). Before the split we have 70
Kama, 70 Rosa, and 70 Canadian and after the split:
A. -1.0148
Figure 3: Decision boundaries for four different classi-
B. 0.0148 fiers trained on the Seeds data projected onto the first
C. 0.3333 two principal components.
4 of 17
O1 O2 O3 O4 O5 O6 O7 O8 O9
O1 0 0.534 1.257 1.671 1.090 1.315 1.484 1.253 1.418
O2 0.534 0 0.727 2.119 1.526 1.689 1.214 0.997 1.056
O3 1.257 0.727 0 2.809 2.220 2.342 1.088 0.965 0.807
O4 1.671 2.119 2.809 0 0.601 0.540 3.135 2.908 3.087
O5 1.090 1.526 2.220 0.601 0 0.331 2.563 2.338 2.500
O6 1.315 1.689 2.342 0.540 0.331 0 2.797 2.567 2.708
O7 1.484 1.214 1.088 3.135 2.563 2.797 0 0.275 0.298
O8 1.253 0.997 0.965 2.908 2.338 2.567 0.275 0 0.343
O9 1.418 1.056 0.807 3.087 2.500 2.708 0.298 0.343 0
A. 0.1429
Figure 5: Four different dendrograms derived from the
B. 0.8571 distances between the nine observation in Table 2.
C. 0.8911
Question 7. In Table 2 is given the pairwise
D. 0.9574 Euclidean distances between nine observations of the
Seeds data. A hierarchical clustering is used to cluster
E. Don’t know. these nine observations using complete (i.e., maximum)
linkage. Which one of the dendrograms given in Fig-
Solution 6. The accuracy is given by the num-
ure 5 corresponds to the clustering?
ber of correctly classified observations out of the
total classified observations which is accurcy = A. Dendrogram 1.
31+30+29
31+30+29+1+3+5+6 = 90/105 = 0.8571.
B. Dendrogram 2.
C. Dendrogram 3.
D. Dendrogram 4.
E. Don’t know.
5 of 17
Solution 7. In complete distance clusters are merged Question 8. We will consider thresholding Dendro-
according to maximal distance between observations gram 4 at the level of three clusters. We recall that
within each cluster. The dendrogram grows by first the Rand index also denoted the simple matching coef-
merging O7 and O8 at 0.275, then O5, O6 at level ficient (SMC) between the true labels and the extracted
0.331, then {O7,O8} with O9 at 0.343, then O1 and clusters is given by:
O2 at level 0.534, then O4 with {O5,O6} at 0.601,
then O3, {O7,O8,O9} at level 1.0882, then {O1, O2}
with {O3, O7,O8,O9} at 1.484, and finally {O1, O2, f00 + f11
SM C = ,
O3 O7,O8,O9} with {O4, O5,O6} at 3.135. Only K
Dendrogram 1 has these properties.
A. 0.7500
B. 0.7778
C. 0.8611
D. 1.0000
E. Don’t know.
6 of 17
Question 9. To determine the type of seed of an Question 10. We suspect that observation O4 may
observation we will use a k-nearest neighbor (KNN) be an outlier. In order to assess if this is the case
classifier to predict each of the nine observations based we would like to calculate the average relative KNN
on the Euclidean distance between the observations density based on the observations given in Table 2 only.
given in Table 2. We will use leave-one-out cross- We recall that the KNN density and average relative
validation for the KNN in order to classify the nine density for the observation xi are given by:
considered observations using a two-nearest neighbor
1
classifier, i.e. K = 2. For tied classes we will classify densityX \i (xi , K) = 1 ,
0
P
the observation according to its closest observation. K x0 ∈NX \i (xi ,K) d(xi , x )
The analysis will be based only on the data given in densityX \i (xi , K)
Table 2. Which one of the following statements is ardX (xi , K) = 1 P ,
correct? K xj ∈NX \i (xi ,K) densityX \j (xj , K)
A. All the observations will be correctly clas- where NX \i (xi , K) is the set of K nearest neighbors
sified. of observation xi excluding the i’th observation, and
ardX (xi , K) is the average relative density of xi using
B. One of the observations will be misclassified. K nearest neighbors. Based on the data in Table 2,
C. Two of the observations will be misclassified. what is the average relative density for observation O4
for K = 1 nearest neighbors?
D. Three of the observations will be misclassified.
A. 0.54
E. Don’t know.
B. 0.61
Solution 9. N (O1, 2) = {O2, O5} as O2 is closest it
will be correctly classified as Kama. C. 1.63
N (O2, 2) = {O1, O3} and will be correctly classified as D. 1.85
Kama.
N (O3, 2) = {O2, O9} as O2 is closest it will be cor- E. Don’t know.
rectly classified as Kama.
N (O4, 2) = {O6, O5} and will be correctly classified as Solution 10.
Rosa. 1
N (O5, 2) = {O6, O4} and will be correctly classified as density(xO4 , 1) = ( · 0.540)−1 = 1.8519
1
Rosa. 1
density(xO6 , 1) = ( · 0.331)−1 = 3.0211
N (O6, 2) = {O5, O4} and will be correctly classified as 1
Rosa. density(xO4 , 1)
a.r.d.(xO4 , 1) = 1
N (O7, 2) = {O8, O9} and will be correctly classified as (density(x O6 , 1))
1
Canadian. 1.8519
N (O8, 2) = {O7, O9} and will be correctly classified as = 1 = 0.61
1 · 3.0211
Canadian.
N (O9, 2) = {O7, O8} and will be correctly classified as
Canadian.
7 of 17
Feature(s) Training Test
Error Rate Error Rate
No features 0.6667 0.6667
x1 0.1143 0.1524
x2 0.1143 0.1143
Figure 6: The nine observations considered in Table 2 x3 0.2190 0.1714
projected onto the first principal component (the loca- x4 0.1524 0.1714
tion of the projection is given in parenthesis). x1 and x2 0.0952 0.1619
x1 and x3 0.1143 0.1619
x1 and x4 0.1143 0.1619
Question 11. We will consider the nine observations x2 and x3 0.1238 0.1333
projected onto the first principal component given in x2 and x4 0.1048 0.1429
Figure 6. We will cluster this data using k-means with x3 and x4 0.1143 0.1619
Euclidean distance into three clusters (i.e., k=3) and x1 and x2 and x3 0.0571 0.1714
initialize the k-means algorithm with centroids located x1 and x2 and x4 0.1048 0.1619
at observation O4, O6, and O5. Which one of the x1 and x3 and x4 0.0857 0.1619
following statements is correct? x2 and x3 and x4 0.0762 0.1524
x1 and x2 and x3 and x4 0.0667 0.1810
A. The converged solution will be {O4}, {O6},{O1,
O2, O3, O5, O6, O7, O8, O9}. Table 3: Error rate for the training and test set when
B. The converged solution will be {O4, O5, O6},{O1, using multinomial regression to predict the type of seed
O2, O3}, {O7, O8, O9}. using different combinations of the four attributes (x1 –
x4 ) based on the hold-out method with 50 % of the
C. The converged solution will be {O4, O5, O6},{O1, observations hold-out for testing.
O2}, {O3, O7, O8, O9}.
D. The converged solution will be {O4},{O5, Question 12. A multinomial regression classifier
O6}, {O1, O2, O3, O7, O8, O9}. is trained using different combinations of the four
attributes x1 , x2 , x3 , and x4 . Table 3 gives the training
E. Don’t know. and test performance of the multinomial regression
classifier when trained using different combinations
Solution 11. With the described initialization, ob- of the four attributes. Which one of the following
servation O4 will be assigned to the cluster located statements is correct?
at O4, observation O6 will be assigned to the clus-
ter located at O6, and the remaining observations A. Forward selection will result in a better model
{O1, O2, O3, O5, O7, O8, O9} assigned to the clus- being selected than backward selection.
ter located at O5. Thus, only cluster located at
O5 will change location and the location updated to B. Neither forward nor backward selection will iden-
−1.5+−0.4+0.0+0.6+0.8+1.0+1.1
= 0.2286. For this new tify the optimal feature combination for this prob-
7
location O5 is closer to cluster located at O6 than lem.
the cluster located at 0.2286, resulting in the updated
C. Backward selection will use a model that includes
clustering {O4},{O5, O6}, {O1, O2, O3, O7, O8,
three features.
O9}. Thus the second cluster will change location to
−1.7+−1.5
2 = −1.6 whereas the third cluster will change D. Forward selection will select only one fea-
location to −0.4+0.0+0.6+0.8+1.0+1.1
6 = 0.5167. As O1 is ture.
still closest to cluster 3 there is no change of assignment
and the k-means procedure has converged. E. Don’t know.
8 of 17
x2 lead to improvement in the test error rate the for- Question 13. We would like to investigate if we can
ward selection method terminates. Backward selection predict the width of a seed kernel (x4 ) based on the
starts with all features and an improvement is found re- area (x1 ), perimeter (x2 ), and length of kernel (x3 ).
moving feature x1 providing a test error rate of 0.1524 For this purpose regularized least squares regression is
for x2 and x3 and x4 . Subsequently, x4 is removed applied based on minimizing with respect to w the cost
providing a test error rate of 0.1333 for the features function:
x2 and x3 . Finally x3 is removed as having only fea- X
ture x2 provides an error rate of 0.1143 and the method E(w) = (xn4 − [1 xn1 xn2 xn3 ]w)2 + λw> w,
terminates. n
−0.0596 0.0417
0.2811 0.0167
0.0445
, wd = 0.0698 .
wc = 0.3379 0.1354
−0.4626 0.0403
A. wa corresponds to λ2 .
B. wb corresponds to λ2 .
C. wc corresponds to λ2 .
D. wd corresponds to λ2 .
E. Don’t know.
9 of 17
No. Attribute description x1 x2 x3 x4 x5 y
P1 1 1 1 1 0 1
x1 Occurrence of nausea P2 0 0 0 0 0 0
x2 Lumbar pain P3 1 1 0 1 0 0
x3 Urine pushing P4 0 1 1 0 1 0
x4 Micturition pains P5 1 1 1 1 1 1
x5 Burn/itch/swell urethra outlet P6 0 0 0 0 0 0
y Inflammation of urinary bladder P7 1 1 0 1 0 0
P8 0 1 1 0 1 0
Table 4: The attributes considered from the P9 1 1 1 1 0 1
study on acute inflammation (taken from P10 0 1 1 0 1 0
https://archive.ics.uci.edu/ml/datasets/Acute+ P11 0 0 0 0 0 0
Inflammations). The attributes x1 –x5 and y are P12 1 1 0 1 0 0
binary where we use 1 for true and 0 for false. P13 0 1 1 0 1 0
P14 0 1 1 0 1 0
Question 14. In a study of acute in-
Table 5: Provided in the above table are the last 14
flammation we would like to predict urinary
observations of the acute inflammation data.
bladder inflammation (the data is taken from
https://archive.ics.uci.edu/ml/datasets/Acute+ In-
flammations). We will consider a subset of the Question 15. We will consider a subset of the acute
attributes, these attributes are given in Table 4. From inflammation data given by the last 14 observations
the study we have provided in Table 5. We will consider this dataset a
49.17 pct. of the persons have inflammation of market basket with 14 persons (P1-P14) denoting the
urinary bladder. customers and six items denoted x1 − x5 and y corre-
32.20 pct. of the persons that have inflammation sponding to the five input attributes and output vari-
of urinary bladder have occurrence of nausea. able respectively of the features described in Table 4.
16.39 pct. of the persons that do not have in- What are all frequent itemsets with support greater
flammation of urinary bladder have occurrence of than 40%?
nausea.
A. {x1 }, {x2 }, {x3 }, {x4 }, {x5 }, {x1 , x2 }, {x2 , x3 },
What is the probability that a person that has occur- {x2 , x4 }, {x2 , x5 }, {x3 , x5 }.
rence of nausea, i.e. x1 = 1, has inflammation of the
urinary bladder, i.e. y = 1, according to this study? B. {x1 }, {x2 }, {x3 }, {x4 }, {x5 }, {x1 , x2 }, {x1 , x4 },
{x2 , x3 }, {x2 , x4 }, {x2 , x5 }, {x3 , x5 }.
A. 15.83 %
C. {x1 }, {x2 }, {x3 }, {x4 }, {x5 }, {x1 , x2 }, {x1 , x4 },
B. 32.20 % {x2 , x3 }, {x2 , x4 }, {x2 , x5 }, {x3 , x5 },
{x2 , x3 , x5 }.
C. 65.52 %
D. 98.82% D. {x1 }, {x2 }, {x3 }, {x4 }, {x5 }, {x1 , x2 }, {x1 ,
x4 }, {x2 , x3 }, {x2 , x4 }, {x2 , x5 }, {x3 , x5 },
E. Don’t know. {x1 , x2 , x4 },{x2 , x3 , x5 }.
Solution 14. According to Bayes’ theorem we have:
P (x1 =1|y=1)P (y=1)
E. Don’t know.
P (y = 1|x1 = 1) = P (x1 =1)
= P (x1 =1|y=1)P (y1 =1) Solution 15. For a set to have support more than
P (x1 =1|y=1)P (y1 =1)+P (x1 =1|y=0)P (y1 =0)
40% the set must occur at least 0.4 · 14 = 5.6, i.e. 6
0.3220·0.4917
= 0.3220·0.4917+0.1639·(1−0.4917) out of the 14 times. All the itemsets that have this
= 0.6552 property are {x1 }, {x2 }, {x3 }, {x4 }, {x5 }, {x1 , x2 },
{x1 , x4 }, {x2 , x3 }, {x2 , x4 }, {x2 , x5 }, {x3 , x5 }, {x1 ,
x2 , x4 }, {x2 , x3 , x5 }.
10 of 17
Question 16. What is the confidence of the associa- Question 17. We would like to predict whether a
tion rule {x1 , x2 , x3 , x4 , x5 } → {y}? subject has inflammation of urinary bladder (y = 1)
or not (y = 0) using the data in Table 5 and the
A. 0.0% attributes x1 , and x2 only. We will apply a Naı̈ve Bayes
B. 7.1 % classifier that assumes independence between the two
attributes given y. Given that a person has x1 = 1,
C. 21.4% and x2 = 1 what is the probability that the person has
an inflammation of urinary bladder (y = 1) according
D. 100.0 % to the Naı̈ve Bayes classifier?
E. Don’t know. A. 1/14
Solution 16. The confidence is given as
B. 3/14
P (y = 1|x1 = 1, x2 = 1, x3 = 1, x4 = 1, x5 = 1) = C. 1/2
P (y = 1, x1 = 1, x2 = 1, x3 = 1, x4 = 1, x5 = 1)
P (x1 = 1, x2 = 1, x3 = 1, x4 = 1, x5 = 1) D. 11/19
1/14 E. Don’t know.
= = 1 = 100.0%
1/14
Solution 17. According to the Naı̈ve Bayes classifier
we have
P (y = 1|x1 = 1, x2 = 1) =
P (x1 = 1|y = 1)×
P (x2 = 1|y = 1)×
P (y = 1)
P (x1 = 1|y = 1)× P (x1 = 1|y = 0)×
P (x2 = 1|y = 1)× + P (x2 = 1|y = 0)×
P (y = 1) P (y = 0)
3/3 · 3/3 · 3/14
= =.
3/3 · 3/3 · 3/14 + 3/11 · 8/11 · 11/14
E. Don’t know.
11 of 17
Which one of the following statements regarding the
similarity of r an s is correct?
D. cos(r, s) = 3/15
E. Don’t know.
f11
J(r, s) = ,
M − f00
f11 + f00
SM C(r, s) = ,
M
f11
cos(r, s) = .
krk2 ksk2
12 of 17
Figure 8: The architecture of the considered neural
network having one hidden layer.
The neural network has no biases, i.e. all the biases of The output of the output neurons will therefore be:
all units are zero. The weights of the network are:
n6 : f (0.25 · 1 − 0.25 · 0 + 0.25 · 0) = f (0.25) = 0.25.
w13 = 0.5, w14 = 0.5, w15 = −0.5,
w23 = 0.5, w24 = −0.5, w25 = 0.25,
w36 = 0.25, w46 = −0.25, w56 = 0.25.
Question 21. We will consider the dataset with five
What will be the output (ŷ) of the neural network classes given in Figure 9 defined respectively by the
for an observation having x1 = 1 and x2 = 1? four inner circles and the larger outer circle. We will
cluster this dataset using hierarchical clustering. What
A. 0 would be a suitable measure of proximity and linkage
in order to perfectly separate the five classes into five
B. 0.25 clusters?
C. 0.75 A. Average linkage using the 2-norm (i.e. kx − yk2 )
as proximity measure.
D. 1
B. Single linkage using the 1-norm (i.e. kx −
E. Don’t know.
yk1 ) as proximity measure.
Solution 20. The output of the neurons in the hidden
C. Complete linkage using the 2-norm (i.e. kx − yk2 )
layer will be:
as proximity measure.
n3 : f (0.5 · 1 + 0.5 · 1) = f (1) = 1,
D. Complete linkage using the 1-norm (i.e. kx − yk1 )
n4 : f (0.5 · 1 − 0.5 · 1) = f (0) = 0, as proximity measure.
n5 : f (−0.5 · 1 + 0.25 · 1) = f (−0.25) = 0.
E. Don’t know.
13 of 17
Solution 22. The decision boundary is formed by a
hexagonal shape. Inspecting the position of the edges
it is seen that the decision boundary traverses the
coordinates (0, 3/8) and (0.25, 0.25) corresponding to
a point where kxk1 = 12 and kxk∞ = 38 respectively.
14 of 17
Question 23. Consider the 10.000 observations −2
has negative covariance. This property only
drawn from a Guassian Mixture Model (GMM) shown 4
in Figure 11. We will in the following use: holds for the answer option:
1 1 > −1
N (x|µ, Σ) = (2π)M/2 |Σ|1/2 exp − 2 (x − µ) Σ (x − µ)
0 3 0
to denote the multivariate normal distribution with p(x) = 0.05 · N (x| , )
0 0 3
mean µ and covariance matrix Σ. Which one of the
following GMM densities best characterize the data? 2 1 0.8
+ 0.475 · N (x| , )
4 0.8 1
A. −2 1 −0.8
+ 0.475 · N (x| , )
4 −0.8 1
2 1 0.8
p(x) = 0.5 · N (x| , )
4 0.8 1
−2 1 −0.8
+ 0.5 · N (x| , )
4 −0.8 1
B.
0 3 0
p(x) = 0.05 · N (x| , )
0 0 3
2 1 0.8
+ 0.475 · N (x| , )
4 0.8 1
−2 1 −0.8
+ 0.475 · N (x| , )
4 −0.8 1
C.
0 3 0
p(x) = 0.5 · N (x| , )
0 0 3
2 1 0.8
+ 0.25 · N (x| , )
4 0.8 1
−2 1 −0.8
+ 0.25 · N (x| , )
4 −0.8 1
D.
0 3 0
p(x) = 0.1 · N (x| , )
0 0 3
2 1 −0.8
+ 0.45 · N (x| , )
4 −0.8 1
−2 1 0.8
+ 0.45 · N (x| , )
4 0.8 1
E. Don’t know.
15 of 17
Question 24. We will consider a very large dataset Here αt = 12 log 1− t
t
(where log is the natural
PN
with 100 mio. observations and ten features, i.e. N =
logarithm) and t = i=1 wi 1 − δft (xi ),yi , where
100.000.000 and M = 10. We would like to perform δft (xi ),yi = 1 if ft (xi ) = yi and zero otherwise. Initially
two-level cross-validation in order to select between 3 the weights are uniform across samples, i.e. w1 = w2 =
different settings of the parameters of a model (inner . . . = wN = 1/N where N is the number of observa-
fold) and estimate the generalization error (outer fold).tions.
We are only allowed to train maximally 65 models in A dataset is sampled with replacement from this
total. Which one of the following procedures satisfies uniform distribution and the classifier is trained on
this constraint? this sampled data. Using this trained classifier 5 of
the original 25 observations are misclassified. What
A. Five fold cross-validation in both the outer and
will the updated weights be for these misclassified
inner folds.
observations according to the AdaBoost algorithm?
B. Leave-one-out cross-validation for the outer fold
A. 0.02
and hold-out 50 % for the inner fold.
B. 0.025
C. Ten-fold cross-validation for the outer fold and two
fold cross-validation for the inner fold. C. 0.08
D. Two-fold cross-validation for the outer fold D. 0.1
and ten fold cross-validation for the inner
fold. E. Don’t know.
16 of 17
learning problems. As we have seen in the course cross-
validation can also be used to quantify the width of the
kernel density estimator. Cross-validation is used to
quantify models generalization through the use of the
test sets and not to minimize the training error.
Question 27. Which of the following statements
regarding ensemble methods is correct?
E. Don’t know.
17 of 17