02450ex Fall2017 Sol

Technical University of Denmark
Written examination: 19 December 2017, 9 AM - 1 PM.
Course name: Introduction to Machine Learning and Data Mining.
Course number: 02450.
Aids allowed: All aids permitted.
Exam duration: 4 hours.
Weighting: The individual questions are weighted equally.
You must either use the electronic file or the form on this page to hand in your answers but not both. We
strongly encourage that you hand in your answers digitally using the electronic file. If you hand
in using the form on this page, please write your name and student number clearly.
The exam is multiple choice. All questions have four possible answers marked by the letters A, B, C, and D as
well as the answer “Don’t know” marked by the letter E. Correct answer gives 3 points, wrong answer gives -1
point, and “Don’t know” (E) gives 0 points.
The individual questions are answered by filling in the answer fields with one of the letters A, B, C, D, or E.
Answers:
1 2 3 4 5 6 7 8 9 10
D C C B C A B B A D
11 12 13 14 15 16 17 18 19 20
B C A C A D A C D B
21 22 23 24 25 26 27
A C A A B C A
Name:
Student number:
PLEASE HAND IN YOUR ANSWERS DIGITALLY.
USE ONLY THIS PAGE FOR HAND IN IF YOU ARE

UNABLE TO HAND IN DIGITALLY.
1 of 20
No. Attribute description Abbrev.
x1 Height (in feet) Height
x2 Weight (in pounds) Weight
x3 Percent of successful field goals FG
(out of 100 attempted)
x4 Percent of successful free throws FT
(out of 100 attempted)
y average points scored per game PT
Table 1: The attributes of the Basketball dataset

contains 54 observations of basketball players in terms
of their height, weight and performance. The output y
provides each player’s average points scored per game.
Question 1. We will consider a basketball dataset

containing 54 National Basketball Association (NBA) Figure 1: Boxplot of the four attributes x1 –x4 after
basketball players and their performance1 . For brevity standardizing the data (i.e., subtracting the mean of
this dataset will be denoted the Basketball dataset each attribute and dividing each attribute by its stan-
in the following. In Table 1 the attributes of the dard deviation). The worst performing player (i.e., the
data as well as the output attribute y defined by each player with lowest output value y) is indicated by a
player’s average points scored per game are given. In black circle with and asteric inside on top of each box-
Figure 1 is given a boxplot of the four attributes x1 –x4 plot along with his standardized value of each attribute
after standardizing the data, i.e. subtracting the mean given as black digits.
of each attribute and dividing each attribute by its
standard deviation. The worst performing player (i.e.,
the player with lowest value of y) is indicated by a black
circle with an asteric inside. Considering the attributes
described in Table 1 and the boxplot in Figure 1 which
one of the following statements regarding the attributes
x1 –x4 and output variable y is correct?
The value of x4 is positioned exactly at the smallest
A. The player with worst performance has Weight
value for which the observation is still within the
(i.e., x2 ) that is between the 25th and 50th per-
25th percentile subtracted 1.5 times the interquartile
centile.
range and therefore not an outlier as the lower whisker
B. The player with worst performance has FT (i.e., extends to this observation. In order to be ratio zero
x4 ) so low it is deemed an outlier. has to mean absence of what is being measured. As 0
percent of successful field goals and free throws implies
C. The input attributes FT and FG (i.e., x3 and x4 ) absence of having scored these are ratio attributes and
are both ordinal variables. not just ordinal. We can here also talk about 25 %
being half as frequent as 50 % etc. As zero average
D. The output y is ratio.
points scored per game implies absence of scoring this
E. Don’t know. output variable is also ratio.
Solution 1. Inspecting the boxplot we observe that

the value of x2 is within the 50th and 75th percentiles.
1
The dataset is taken from
http://college.cengage.com/mathematics/brase
/understandable statistics/7e/students/datasets/mlr/
frames/frame.html
2 of 20
Question 2. A principal component analysis (PCA)
is carried out on the standardized attributes x1 –x4 ,
forming the standardized matrix X̃. The squared
Frobenious norm of the standardized matrix is given
by kX̃k2F = 212. A singular value decomposition is
applied to the matrix X̃ and we find that the first
three singular values are σ1 = 11.1 , σ2 = 7.2, σ3 = 5.2.
What is the value of the fourth singular value σ4 ?
A. σ4 = 1.2
B. σ4 = 2.3
C. σ4 = 3.1
D. σ4 = 9.9
E. Don’t know.
Solution 2. The variance explained by the ith prin- Figure 2: The Basketball data projected onto the first
σ2 σ2 and second principal component. In the plot four
cipal component is given by P 0iσ2 = kX̃ki 2 . Thus,
P 2 i i0 F observations are highlighted denoted A, B, C, and D
σ = k X̃k 2 and from this we know that σ =
0
qi i 0 F 4 of which one of these observation corresponds to the
2
P3 2
√ player with lowest output value y given by the circle
2 2
kX̃kF − i0 =1 σi0 = 212 − 11.1 − 7.2 − 5.2 = 2
3.1. with and asteric inside in Figure 1.
Question 3. From the singular value decomposition

of X̃ we further obtain the followingV matrix:
−0.60 0.02 −0.41 0.69
 −0.61 0 −0.33 −0.72 
V =  −0.46 0.46 0.76
.
0.04 
0.25 0.89 −0.39 −0.04
The data projected onto the first two principal compo-
nents is given in Figure 2 including four observations
denoted A, B, C, and D that are marked by a black
circle with an asteric inside. One of these four ob-
servations corresponds to the worst performing player
indicated by a similar black circle with an asteric in-
side in the boxplot of Figure 1. Which one of the four
observations in Figure 2 corresponds to the worst per-
forming player indicated in the boxplots of Figure 1?
A. Observation A.
B. Observation B.
C. Observation C.
D. Observation D.
E. Don’t know.
Solution 3. From the boxplot we know that the worst
performing player in terms of the output value y has
3 of 20
the standardized observation vector x̃∗ = [0.68 0.66 − Question 4. A least squares linear regression
0.67 − 1.47] and will thus have the projection onto the model is trained using different combinations of the
two first principal components given by four attributes x1 , x2 , x3 ,and x4 in order to pre-
  dict the average points scored per game y. Table 2
−0.60 0.02 provides the
 −0.61 q training and test root-mean-square error
0  N
(RMSE= N1 i=1 (yi − ŷi )2 ) performance of the least
P
[0.68 0.66 − 0.67 − 1.47] 
 −0.46 0.46 

squares linear regression model when trained using dif-
0.25 0.89
ferent combinations of the four attributes. Which one
= [−0.8699 − 1.6029]. of the following statements is correct?
Thus, the observation will in the projection be located A. Forward selection will terminate when all features
at (-0.8699 , -1.6029) which corresponds to the obser- x1 –x4 are included in the feature set.
vation denoted C.
B. The solution identified by Forward selection
will be worse than the solution identified by
Backward selection.
C. Forward selection will terminate using two feature

in the feature set.
D. Backward selection will terminate using two fea-

tures in the feature set.
E. Don’t know.
Solution 4. Forward selection will select x4 with per-

formance 5.6845, including additional features will not
improve performance and the procedure will therefore
terminate with the feature set x4 . Backward selec-
tion will remove feature x2 terminating at the feature
set x1 and x3 and x4 with performance 5.5099 since re-
moving additional features only increases the test error.
4 of 20
Feature(s) Training Test Solution 5.
RMSE RMSE
No features 5.8977 5.8505
x1 5.8760 6.0035
x2 5.8841 5.9037
x3 5.1832 5.9272
x4 5.8727 5.6845
x1 and x2 5.6272 7.4558
x1 and x3 5.1482 5.6409
x1 and x4 5.8451 5.8269
x2 and x3 5.0483 5.6656
x2 and x4 5.8660 5.7461
x3 and x4 5.1125 5.7390
x1 and x2 and x3 4.9836 6.2823
x1 and x2 and x4 5.6261 7.3888
x1 and x3 and x4 5.0839 5.5099
x2 and x3 and x4 5.0113 5.5605
x1 and x2 and x3 and x4 4.9645 6.0892
Table 2: Root-mean-square error (RMSE) for the

training and test set when using least squares regres-
sion to predict average points scored per game y using
different combinations of the four attributes (x1 –x4 ).
Question 5. We will encode the output attribute y

in terms of three different classes, i.e. low perform-
ing players having performance below the 33.3 per-
centile, mid-performing players having performance in
the range of the 33.3 percentile to 66.6 percentile and
high-performing players having performance above the
66.6 percentile (i.e, each of the three classes will contain
approximately one third of the data). We presently
consider only the attributes FG and FT. Consider the
three Gaussian distributions given in Figure 3 where
each Gaussian is fitted to each of the three classes sep-
arately. We recall that the multivariate Gaussian dis-
tribution is given by:
N (x|µ, Σ) = (2π)M/21 |Σ|1/2 exp − 12 (x − µ)> Σ−1 (x − µ) ,

with mean µ and covariance matrix Σ. The three

fitted covariances (in arbitrary
order) are given by
0.0035 0.0003 0.0028 −0.0013
Σa = , Σb = ,
0.0003 0.0030 −0.0013 0.0191

0.0020 0.0001
and Σc = . The Gaussians are plot-
0.0001 0.0061
ted in Figure 3 in terms of lines indicating where
N (x|µ, Σ) = 5, N (x|µ, Σ) = 10, and N (x|µ, Σ) = 20
thus defining ellipsoidal shapes at three different levels
of the density functions. Which of the three classes
correspond to which of the three covariance matrices
Σa , Σb , and Σc ?
A. Σa corresponds to the low performing class,
Σb corresponds to the mid performing class,
5 of 20
Σc corresponds to the high performing class.
B. Σa corresponds to the low performing class,
Σ corresponds to the mid performing class,
Inspecting the covariance matrices indicated by the 9 have mid, and 3 have high performance.
iso-line contours of each gaussian at N (x|µ, Σ) = 5, Of the 20 medium height players we have that 4
N (x|µ, Σ) = 10, and N (x|µ, Σ) = 20 we observe have low, 6 have mid, and 10 have high perfor-
that the high performing (i.e., blue class) has nega- mance.
tive covariance which is only the case for Σb Of the
low performing and mid-performing classes we observe Of the 16 tall players we have that 8 have low, 3
that the low performing has a larger variance in the have mid, and 5 have high performance.
second dimension (i.e., the x4 direction) than the mid- Which statement regarding the purity gain ∆ of the
performing, thus, as Σc (2, 2) = 0.0061 > 0.0030 = split is correct?
Σa (2, 2) we have that Σc corresponds to the low per-
forming and Σa to the mid performing classes. A. ∆ = 0.0505
B. ∆ = 0.1667
C. ∆ = 0.3333
D. ∆ = 0.6667
E. Don’t know.
Solution 6. The purity gain is given by

K
X N (vk )
∆ = I(r) − I(vk ),
N
k=1
where
X
I(v) = 1 − p(c|v)2 .
c
Figure 3: The 54 observations plotted in terms of
percentage of successful field goals (FG) plotted against Evaluating the purity gain for the split we have:
percentage of successful free throws (FT) for low, mid 18 2
2 2
+ 18 + 18

∆ =(1 − ( 54 54 54 ))
and high performing players respectively indicated by
18 6 2 9 2 3 2

black plusses, red squares, and blue crosses. For each of −[ 54 (1 − ( 18 + 18 + 18 ))
these three classes a multivariate Gaussian distribution + 20 4 2
6 2

+ 10
2
54 (1 − ( 20 + 20 20 ))
is fitted and the lines corresponding to density values
8 2 3 2 5 2
+ 16

of 5, 10, and 20 are plotted in black, red, and blue 54 (1 − ( 16 + 16 + 16 ))]
respectively. = 0.0505
Question 6. A decision tree is fitted to the data con-

sidering as output whether the basketball player was
in the group performing low, mid, or high according to
splitting the output value in terms of the 33.3 and 66.6
percentile as explained in the previous question. At
the root of the tree it is considered to split according to
Height (i.e., x1 ), considering relatively short, medium,
and tall players based on splitting x1 also according to
its 33.3 and 66.6 percentiles. ForP impurity we will use
2
the Gini given by I(v) = 1 − c p(c|v) . Before the
split, we have 18 low, 18 mid, and 18 high performing
players and after the split we have
Of the 18 short players we have that 6 have low,
6 of 20
Solution 7. There are a total of 4 positive and 6 nega-
tive observations. When lowering the threshold for pre-
dicting high performance based on the value of Height
we observe that the first observation to be above the
Figure 4: Considering the attribute Height for 10 threshold is O10 which belongs to the positive class,
observations in the Basketball data we inspect whether thus TPR=1/4, FPR=0/6. Subsequently we get two
each of the 10 players here denoted O1,O2,. . .,O10 observations from the negative class thus TPR=1/4,
have a relatively high percentage of successful field FPR=2/6 and then three positive observations being
goals (FG>45%) indicated in black and considered above the threshold, i.e. TPR=4/4, FPR=2/6. Low-
the positive class, i.e. observation O5,O6,O7, and ering the threshold further we obtain the remaining
O10 or a relatively low percentage of successful field negative observations such that TPR=4/4, FPR=6/6.
goals (FG≤45%) indicated in red and considered the The only curve having this property is the curve with
negative class, i.e. observations O1,O2,O3,O4,O8, and AUC=0.750.
O9.
Figure 5: Four different receiver operator characteris-

tic (ROC) curves and their corresponding area under
curve (AUC) values.
Question 7. We suspect that a basketball player’s

Height (i.e., x1 ) is predictive of whether the player is
successful with his field goals FG (i.e., x3 ). To quan-
tify whether Height is predictive of FG we will eval-
uate the area under curve (AUC) of the receiver op-
erator characteristic (ROC) using the feature Height
to discriminate between FG> 45% (positive class) or
FG≤ 45% (negative class) considering the data given
in Figure 4. Which one of the receiver operator char-
acteristic (ROC) curves given in Figure 5 corresponds
to the correct ROC curve?
A. The curve having AUC=0.625
B. The curve having AUC=0.750
C. The curve having AUC=0.875
D. The curve having AUC=0.917
E. Don’t know. 7 of 20
Question 8. We will consider the ten observationswhere NX \i (xi , K) is the set of K nearest neighbors
of the Basketball dataset given in Figure 4. We will
of observation xi excluding the i’th observation, and
cluster this data using k-means with Euclidean distance
ardX (xi , K) is the average relative density of xi using
into two clusters (i.e., k=2) . Which one of the K nearest neighbors (if observations are tied in terms
following solutions constitutes a converged solution in
of their distances to an observation, the observation
the k-means clustering procedure? with smallest observation number will be selected).
Based on considering only the attribute Height and
A. {O1}, {O2, O3, O4, O5, O6, O7, O8, O9, O10}.
the ten observations in Figure 4, what is the average
B. {O1, O2, O3, O4, O5}, {O6, O7, O8, O9, relative density for observation O10 for K = 3 nearest
O10}. neighbors?
C. {O1, O2, O3, O4, O5, O6, O7}, {O8, O9, O10}.
D. {O1, O2, O3, O4, O5, O6, O7, O8, O9}, {O10}. A. 0.409
E. Don’t know.
Solution 8. The solution {O1}, {O2, O3, O4, O5, B. 0.500
O6, O7, O8, O9, O10} has centroids at 5.7 and
(6.0+6.2+6.3+6.4+6.6+6.7+6.9+7.0+74)/9=6.5889.
As such, O2 is closer to the centroid at 5.7 than
6.6111 and will thus be reassigned to this centroid C. 0.533
hence this is not a converged solution. The solution
{O1, O2, O3, O4, O5}, {O6, O7, O8, O9, O10}
has centroids at (5.7+6.0+6.2+6.3+6.4)/5=6.12 and D. 1.875
(6.6+6.7+6.9+7.0+7.4)/5=6.92. As such, O5 is closer
to the centroid at 6.12 and O6 is closer to the centroid
at 6.92. This will thus form a converged solution The
solution {O1, O2, O3, O4, O5, O6, O7}, {O8, O9, O10} E. Don’t know.
has centroids at (5.7+6.0+6.2+6.3+6.4+6.6+6.7)/7=
6.2714 and (6.9+7.0+7.4)/3=7.1. As such, O7 is
closer to the centroid at 7.1 than the one at 6.2714 and Solution 9.
will thus be reassigned to this centroid, hence, this is
also not a converged solution. The solution {O1, O2,
O3, O4, O5, O6, O7, O8, O9}, {O10} has centroid 1
at 6.4222 and 7.4. As such, O9 is closer to 7.4 than density(xO10 , 3) = ( (0.4 + 0.5 + 0.7))−1 = 1.8750
3
6.4222 and will thus be reassigned to this centroid, 1
hence, this is also not a converged solution. density(x O9 , 3) = ( (0.1 + 0.3 + 0.4))−1 = 3.7500
3
1
Question 9. We suspect that observation O10 density(xO8 , 3) = ( (0.1 + 0.2 + 0.3))−1 = 5
may be an outlier. In order to assess if this is the 3
1
case we would like to calculate the average relative density(xO7 , 3) = ( (0.1 + 0.2 + 0.3))−1 = 5
3
KNN density based on Euclidean distance and the
a.r.d.(xO10 , 3)) =
observations given in Figure 4 only. We recall that
the KNN density and average relative density (ard) for density(xO10 , 3)
1
the observation xi are given by: 3 (density(x O9 , 3) + density(xO8 , 3) + density(xO7 , 3))
1.8750
1 = 1 = 0.4091
densityX \i (xi , K) = 1 P 0
, (3.75 + 5 + 5)
3
K x0 ∈NX (xi ,K) d(xi , x )
\i
densityX \i (xi , K)
ardX (xi , K) = 1 P ,
K xj ∈NX \i (xi ,K) densityX \j (xj , K)
8 of 20
the level of (0.5+0.6+0.7+0.2+0.3+0.4)/6=0.45 and
then O10 will merge with {O6,O7,O8,O9} at the level
of (0.8+0.7+0.5+0.4)/4=0.6. The only dendrogram
having these properties is dendrogram 4 and we can
thus rule out the other dendrograms.
Figure 6: Four different dendrograms derived using the

Euclidean distance between the 10 observations based
on the attribute Height only. The value of Height for
each of the 10 observations can be found in Figure 4.
Red observations correspond to low values of FG and
black observations to high values of FG.
Question 10. We will consider the Euclidean dis-

tance between the observations in Figure 4 based on the
attribute Height only, (i.e., the Euclidean
p distance be-
tween observation O1 and O2 is (5.7 − 6.0)2 = 0.3).
A hierarchical clustering is used to cluster the observa-
tions based on their distances to each other using aver-
age linkage (when ties in the agglomerative procedure
clusters containing the smallest observation numbers
will merge first). Which one of the dendrograms given
in Figure 6 corresponds to the clustering?
A. Dendrogram 1.
B. Dendrogram 2.
C. Dendrogram 3.
D. Dendrogram 4.
E. Don’t know.
Solution 10. Initially, O3, O4 will merge and

O6, O7 will merge and O8, O9 will merge at the
level of 0.1. Subsequently, O5 will merge onto
{O3,O4} at the level of (0.1+0.2)/2=0.15. Next
{O6,O7} will merge with {O8,O9} at the level of
(0.3+0.3+0.2+0.3)/4=0.275. Next, O1,O2 will merge
at 0.3 and subsequently {O1,O2} with {O3,O4,O5} at
9 of 20
Question 11. We will cut dendrogram 2 at the level
of two clusters and evaluate this clustering in terms of
its correspondence with the class label information in
which O1, O2, O3, O4, O8, and O9 correspond to low
values of FG whereas O5, O6, O7, and O10 correspond
to high values of FG. We recall that the Rand index
also denoted the simple matching coefficient (SMC)
Figure 7: A Gaussian mixture model (GMM) with
between the true labels and the extracted clusters is
three clusters fitted to the 10 observations based only
given by R = f11K +f00
, where f11 is the number of
on the attribute Height. The overall probability den-
object pairs in same class assigned to same cluster,
sity is given in gray and in blue the contribution from
f00 is the number of object pairs in different class
each of the three clusters to the density.
assigned to different clusters, and K = N (N − 1)/2
is the total number of object pairs, where N is the
number of observations considered. What is the value Question 12. We will fit a Gaussian mixture model
of R between the true labeling of the observations in (GMM) with three clusters to the 10 observations given
terms of high and low FG values and the two clusters? in Figure 4. The fitted density is given in Figure 7 in
which the overall density is given in gray and the con-
tribution of each Gaussain given in blue. We recall that
A. 0.3226
the Gaussian mixtureP model1 for 1-dimensional data
1
B. 0.5333 is given by: p(x) = k wk √ 2
exp (− 2σ2 (x − µk )2 ).
2πσk k
For the clustering we have:
C. 0.5778
w1 = 0.37, w2 = 0.29, w3 = 0.34,
D. 0.6222 µ1 = 6.12, µ2 = 6.55, µ3 = 6.93,
σ12 = 0.09, σ22 = 0.13, σ32 = 0.12.
E. Don’t know. What is the probability that observation O8 is assigned
to cluster 2 according to the GMM?
Solution 11. The cluster indices are given by the
vector: [2 2 2 2 2 1 1 1 1 1 1]> , whereas the true class
A. 0.20
labels are given by the vector [1 1 1 1 2 2 2 1 1 2]> .
From this, we obtain: Total number of object pairs is:
B. 0.29
K = 10(10 − 1)/2 = 45
f00 = 4 · 3 + 1 · 2 = 14 C. 0.33
f11 = 4·(4−1)/2+1·(1−1)/1+3·(3−1)/2+2·(2−1)/2 =
10 D. 0.37
R = f11K+f00
= 10+14
45 = 24/45.
E. Don’t know.
Solution 12. The probability that the k’th

cluster generated the observation O8 is given by
wk N (6.9|µk , σk2 ) and we thus have:
p(x8 = 6.9, z8 = 1) =
1 1 2
0.37 √2π0.09 exp (− 2·0.09 (6.9 − 6.12) ) = 0.0168
p(x8 = 6.9, z8 = 2) =
1 1
0.29 √2π0.13 exp (− 2·0.13 (6.9 − 6.55)2 ) = 0.2003
p(x8 = 6.9, z8 = 3) =
1 1
0.34 √2π0.12 exp (− 2·0.12 (6.9 − 6.93)2 ) = 0.3901
p(z8 = 2|x8 = 6.9) = P p(x8 =6.9,z8 =2) 0
k0 p(x8 =6.9,z8 =k )
0.2003
= 0.0168+0.2003+0.3901 = 0.33.
10 of 20
Question 14. The National Basketball Associa-
tion (NBA) is the top basketball league in USA and
all males playing in the NBA earns more than several
million dollars a year or more making them all have a
very high salary. In USA we will assume approximately
0.2 % of the male population that are not playing in
the NBA makes such similar very high salary. Further-
more, we will assume two out of a million American
males are playing in the NBA. Assuming the above,
what is the probability that a male in USA making
such very high salary plays in the NBA?
A. 0.0002%
B. 0.0010%
Figure 8: Confusion matrix based on a classifier’s pre-
dictions of high or low success rate of field goals, (i.e., C. 0.0999%
FG>45% considered the positive class or FG≤45% con-
sidered the negative class respectively). D. 0.2002%
E. Don’t know.
Question 13. We will consider a simple classifier Solution 14. What we are inter-
that predicts the 54 basketball players as having high ested in is P (NBA|Very high salary) =
success rate of field goals (FG>45%, considered the P (Very high salary|NBA)P (N BA)
salary|NBA)P (N BA)+P (Very high salary|not NBA)P (not NBA) =
positive class) if they are taller than 6.65 foot and P (Very high1·2/1000000 2
low otherwise (F G ≤ 45%, considered the negative 1·2/1000000+0.002·(1−2/1000000) = 2+0.002∗999998 =
class). The confusion matrix of the classifier is given 0.0999%.
in Figure 8. Which statement regarding the classifier
is correct?
A. The recall of the classifier is 60.0 %.
B. The precision of the classifier is 61.1 %.
C. The accuracy of the classifier is 66.7 %.
D. The dataset is perfectly balanced.
E. Don’t know.
Solution 13. The recall of the classifier is

TP/(TP+FN)=18/(18+12)= 60.0%. The precision of
the classifier is TP/(TP+FP)=18/(18+9)=66.7%.
The accuracy rate of the classifier is
(TP+TN)/(TP+FP+TN+FN)=(18+15)/54=61.1%.
There are 30 positive examples and 24 negative
examples in the test set.
11 of 20
Solution 15. The decision boundary of classifier 1 is
a straight line thus conforms to a logistic regression
model using only the features x1 and x2 as inputs.
Classifier 2 is a 3-nearest neighbor classifier and when
using Euclidean distance the scale of Weight has much
more variance than that of height thereby heavily
influencing the distance measure. Classifier 3 is a
decision tree with two vertical and one horizontal line
corresponding to three decisions. Classifier 4 is non-
linear and smooth corresponding to a logistic regression
including a transformed variable, i.e. x1 · x2 .
Figure 9: Decision boundaries for four different classi-

fiers trained on the Basketball dataset considering the
features Height and Weight. Gray regions classify into
red crosses whereas white regions into black plusses.
Question 15. Four different classifiers are trained on

the Basketball dataset considering the features Height
and Weight in order to predict if the percentage of
successful field goals is high (FG> 45%) or low (FG≤
45%). The decision boundary for each of the four
classifiers is given in Figure 9. Which one of the
following statements is correct?
A. Classifier 1 corresponds to logistic regres-
sion considering as input x1 and x2 , Classi-
fier 2 is a 3-nearest neighbor classifier using
Euclidean distance, Classifier 3 is a decision
tree including three decisions, Classifier 4
correponds to a logistic regression consid-
ering as input x1 , x2 , and x1 · x2 .
B. Classifier 1 is a 3-nearest neighbor classifier using
Euclidean distance, Classifier 2 is a decision tree
including three decisions, Classifier 3 corresponds
to logistic regression considering as input x1 and
x2 , Classifier 4 corresponds to a logistic regression
considering as input x1 , x2 , and x1 · x2 .
C. Classifier 1 corresponds to a logistic regression
considering as input x1 , x2 , and x1 · x2 , Classifier
2 is a 3-nearest neighbor classifier using Euclidean
distance, Classifier 3 is a decision tree including
three decisions, Classifier 4 corresponds to logistic
regression considering as input x1 and x2 ,
D. Classifier 1 is a decision tree including three deci-
sions, Classifier 2 corresponds to logistic regression
considering as input x1 , x2 , and x1 ·x2 , Classifier 3
corresponds to a logistic regression considering as
12 of 20
input x1 and x2 , Classifier 4 is a 3-nearest neigh-
bor classifier using Euclidean distance.
E. Don’t know.
Question 16. We will consider an artificial neural Question 17. Which statement regarding cross-
network (ANN) trained to predict the average score of validation is correct?
a player (i.e., y). The ANN is based on the model:
A. An advantage of five-fold cross-validation
2 over three-fold cross-validation is that the
(2)
X (2) (1)
(1) datasets used for training are larger.
f (x, w) = w0 + wj h ([1 x]wj ).
j=1
B. The more data used for training a model the more
where h(1) (x) = max(x, 0) is the rectified linear func- we can expect the model to overfit to the training
tion used as activation function in the hidden layer data.
(i.e., positive values are returned and negative values C. 10-fold cross-validation is more accurate but also
are set to zero). We will consider an ANN with two more computationally expensive than leave-one-
hidden  units in the hidden layer
 defined by: out cross-validation.
21.78 −9.60
 −1.65   −0.44  D. When upsampling data in order to avoid class
(1)   (1)  
w1 =   0 , w =  0.01 ,
 2   imbalance issues the same observations should be
 −13.26   14.54  included in the training and test set such that the
−8.46 9.50 training and test sets reflect the same properties.
(2) (2) (2)
and w0 = 2.84, w1 = 3.25, and w2 = 3.46.
What is the predicted average score of a basketball
E. Don’t know.
player with observation vector x∗ = [6.8 225 0.44 0.68]?
Solution 17. When using k-fold cross-validation 1/k
of the data is used for testing and (k-1)/k of the
A. 1.00 data for training during each fold. As such, five-fold
cross-validation uses larger training sets than three-fold
B. 3.74 cross-validation. The more data used for training a
model the less we can expect overfitting to occur as
C. 8.21
the model will be less prone to fit to specific aspects of
D. 11.54 the training set. 10-fold cross-validation should not
be more accurate and in particular, it is not more
E. Don’t know. expensive than leave-one-out cross-validation. Leave-
one-out cross-validation is more expensive as we have
Solution 16. The output is given by: to use as many folds as we have observations and each
model is trained on a larger training set size. When
f (x, w) = 2.84 upsampling the data it is important that the same
  observations do not occur in both the training and test
21.78
 −1.65  set as we otherwise are training the model on parts of
  the test set and thereby fitting the model also to test
+ 3.25 · max([1 6.8 225 0.44 0.68] ·  0  , 0)
 −13.26  data.
 
−8.46
 
−9.60
 −0.44 
 
+ 3.46 max([1 6.8 225 0.44 0.68] · 
 0.01  , 0)

 14.54 
9.50
= 2.84 + 3.25 · max(−1.027, 0) + 3.46 max(2.516, 0)
= 11.54
13 of 20
HL HH WL WH FG≤45% FG>45% FT≤75% FT>75%
out of the 10 times. All the itemsets that have this
O1 1 0 1 0 1 0 1 0
O2 1 0 1 0 1 0 1 0 property are:
O3 1 0 1 0 1 0 1 0 {HL }, {HH }, {WL }, {FG≤45% }, {FG>45% },
O4 1 0 1 0 1 0 0 1
O5 1 0 1 0 0 1 0 1 {FT≤75% }, {FT>75% }, {HL ,WL }, {HL ,FG≤45% },
O6 1 0 0 1 0 1 1 0 {HL ,FT≤75% }, {WL ,FG≤45% }, {WL ,FT>75% },
O7 0 1 1 0 0 1 0 1
O8 0 1 1 0 1 0 0 1
{FG≤45% ,FT≤75% }, and {HL ,WL ,FG≤45% }.
O9 0 1 0 1 1 0 1 0
O10 0 1 0 1 0 1 1 0
Table 3: The ten considered observations of the Basket-

ball dataset binarized considering the attribute x1 –x4 .
The attributes x1 and x2 are binarized according to
whether they are below or above the median value of
the attribute for the entire dataset of 54 observations.
x3 and x4 are respectively threshold at a success rate of
45% for field goals (FG) and a success rate of 75% for
free throw (FT). The ten observations are color coded
in terms of average points scored per game (y) being in
the low range {O2, O3, O6, O9} mid-range {O1, O4}
and high range {O5, O7, O8, O10}.
Question 18. Considering the dataset in Table 3

as a market basket problem with observation O1–O10
corresponding to customers and HL , HH , WL , WH ,
FG≤45% , FG>45% , FT≤75% , and FT>75% corresponding
to items. What are all frequent itemsets with support
greater than 35%?
A. {HL }, {HH }, {WL }, {FG≤45% }, {FG>45% },

{FT≤75% }, and {FT>75% }.
B. {HL }, {HH }, {WL }, {FG≤45% } , {FG>45% },

{FT≤75% }, {FT>75% }, {HL ,WL }, {HL ,FG≤45% },
{HL ,FT≤75% }, {WL ,FG≤45% }, {WL ,FT>75% },
and {FG≤45% ,FT≤75% }.
C. {HL }, {HH }, {WL }, {FG≤45% },

{FG>45% }, {FT≤75% }, {FT>75% }, {HL ,WL },
{HL ,FG≤45% }, {HL ,FT≤75% }, {WL ,FG≤45% },
{WL ,FT>75% }, {FG≤45% ,FT≤75% }, and
{HL ,WL ,FG≤45% }.
D. {HL }, {HH }, {WL }, {FG≤45% }, {FG>45% },

{FT≤75% }, {FT>75% }, {HL ,WL }, {HL ,FG≤45% },
{HL ,FT≤75% }, {WL ,FG≤45% }, {WL ,FT>75% },
{FG≤45% ,FT≤75% }, {HL ,WL ,FG≤45% }, and
{HL ,WL ,FT≤75% }.
E. Don’t know.
Solution 18. For a set to have support more than

35% the set must occur at least 0.35 · 10 = 3.5, i.e. 4
14 of 20
Question 19. We consider again the data in Table 3 Solution 20. Let HAS denote high average score.
as a market basket problem. What is the confidence of According to the Naı̈ve Bayes classifier we have
the association rule {HL , WL } → {FG≤45% , FT≤75% }?
P (HAS|HH = 1, WL = 1) =
 
P (HH = 1||HAS)×
A. 30 %  P (WL = 1|HAS)× 
P (HAS)
 
P (HH = 1||LAS)×
B. 40 %  P (WL = 1|LAS)× 
 
 P (LAS) + P (HH = 1|M AS)× 
 
C. 50 % 
 P (WL = 1|M AS)× 
 P (M AS) + P (HH = 1||HAS)× 
 
 P (WL = 1|HAS)× 
D. 60 % P (HAS)
3/4 · 3/4 · 4/10
=
E. Don’t know. 1/4 · 2/4 · 4/10 + 0 · 2/2 · 2/10 + 3/4 · 3/4 · 4/10
9/40
= = 9/11.
Solution 19. The confidence is given as 2/40 + 0 + 9/40
P (FG≤45% , FT≤75% |HL , WL ) =

FG≤45% , FT≤75% , HL , WL )
P (HL , WL ) Question 21. Considering the data in Table 3,
3/10 we will use a 3-nearest neighbor classifier to classify
= = 3/5 = 60% observation O10 (i.e., with binary observation vector
5/10
[0 1 0 1 0 1 1 0]) based on observation O1–O9. We will
classify according to the three neighboring observations
with highest similarity according to the Jaccard (J)
Question 20. We would like to predict whether a measure of similarity given by J(a, b) = Mf−f11
, where
00
basketball player has a high average score using the f11 and f00 are the number of one matches and zero
data in Table 3. We will apply a Naı̈ve Bayes classi- matches respectively and M the total number of binary
fier that assumes independence between the attributes features. Which one of the following statements is
given the class label (i.e., the class label is given by correct?
the average points scored per game being low (black
color), in the mid-range (red color) or high (blue color)
respectively in the table). Given that a basketball
A. O10 will be classified as black.
player is relatively tall (HH = 1) relatively light weight
(WL = 1) what is the probability that the basketball
player will have a high average score according to the
Naı̈ve Bayes classifier derived from the data in Table 3? B. O10 will be classified as blue.
A. 9/16 C. O10 will be classified as red.
B. 9/11
D. The classifier will be tied between the classes black
C. 3/4 and blue.
D. 1
E. Don’t know.
E. Don’t know.
15 of 20
Solution 21. For O10 we have:
1 1
J(O10, O1) = =
8−1 7
1 1
J(O10, O2) = =
8−1 7
1 1
J(O10, O3) = =
8−1 7
0
J(O10, O4) = =0
8−0
1 1
J(O10, O5) = =
8−1 7 Figure 10: Decision boundaries for two rounds of
3 3 boosting considering a logistic regression model using
J(O10, O6) = =
8−3 5 the features Height and Weight and the 10 observations
3 3 also considered previously in Figure 4. Gray region
J(O10, O7) = =
8−3 5 indicates that the observation will be classified as red
1 1 crosses, white regions that the observation will be
J(O10, O8) = =
8−1 7 classified as black plusses.
3 3
J(O10, O9) = =
8−3 5
Question 22. We will consider classifying the 10 ob-
servations considered in Figure 4 using logistic regres-
Hence the three nearest neighbors are O6,O7, and O9 sion and boosting by Adaboost (notice, the Adaboost
with two black and one blue observation thus according algorithm uses the natural logarithm). For this pur-
to majority voting the observation will be classified as pose we include only two boosting rounds considering
black. only the features Height (i.e., x1 ) and Weight (i.e., x2 )
as inputs. In the first round the data is sampled with
equal probability wi = 1/10 for i = {1, . . . , 10} and the
logistic regression model with decision boundary given
to the left of Figure 10 trained. A new dataset is subse-
quently sampled and a new logistic regression classifier
given to the right of the Figure 10 trained. Based on
these two rounds of the Adaboost algorithm what will
an observation located at x1 = 6 and x2 = 240 be
classified as?
A. The two classes will be tied for the Adaboost

procedure.
B. The observation will be classified as red cross.
C. The observation will be classified as black

plus.
D. The weights w1 , . . . , w10 are changed in round 1 of

the Adaboost procedure.
E. Don’t know.
Solution 22. We have for the first round that the

weighted error rate 1 = 5/10 with associated α1 =
1 1−1
2 log 1 = 12 log 1 = 0. The updated weights will
thus be unchanged as e0 = 1. In the next round
16 of 20
Question 23. Using the 54 observations of the Bas-
ketball dataset we would like to predict the average
points scored per game (y) based on the four features
(x1 –x4 ). For this purpose we consider regularized least
squares regression which minimizes with respect to w
the following
P cost function:
E(w) = n (yn − [1 xn1 xn2 xn3 xn4 ]w)2 + λw> w,
where xnm denotes the m’th feature of the n’th obser-
vation, and 1 is concatenated the data to account for
the bias term. We consider 20 different values of λ and
use leave-one-out cross-validation to quantify the per-
formance of each of these different values of λ. The re-
sults of the leave-one-out cross-validation performance
is given in Figure 11. Inspecting the model for the
value of λ = 0.6952 the following model is identified:
f (x) = 2.76 − 0.37x1 + 0.01x2 + 7.67x3 + 7.67x4 .
Which one of the following statements is correct?
Figure 11: Root-mean square error (RMSE) curves as
function of the regularization strenght λ for regularized A. In Figure 11 the blue curve with circles
least square regression predicting the average points corresponds to the training error whereas
scored per game y based on the attributes x1 –x4 . the black curve with crosses corresponds to
the test error.
B. According to the model defined for λ = 0.6952

increasing a players height will increase his average
points scored per game.
C. There is no optimal way of choosing λ since in-

creasing λ reduces the variance but increases the
bias.
D. As we increase λ the 2-norm of the weight vector

w will also increase.
2 = 2/10 and thus α2 = 12 log 1−2/10 1
2/10 = 2 log 4 = 0.693.
When determining the class we weight each classifier E. Don’t know.
by the importance αt of the round t. However, as the
first round has α1 = 0 this round has zero weight in Solution 23. The blue curve monotonically increases
the voting and thus the classifier will solely be based with λ reflecting a worse fit to the training set as we
on the second round classifier for which α2 = 0.693. increase λ using regularization we can reduce the vari-
This classifier will deem the observation at x1 = 6 and ance by introducing bias and the black curve indicates
x2 = 240 to belong to the class of black plusses as the that an optimal tradeoff at around 10−0.8 as reflected
decision region is white. by the test error indicated in the black curve being
minimal. As we increase λ we will penalize the weights
according to the squared 2-norm more and more and
thus the 2-norm will be reduced. Finally, according
to the fitted model we observe that the coefficient in
front of x1 (Height) is negative thus indicating that an
increase in height will reduce the models prediction of
average points scored per game.
17 of 20
Question 24. We will again consider the ridge
regression described in the previous question. Which
one of the following statements is correct?
A. Exhaustively evaluating all combinations of

features would require the fitting of less
models than the proposed ridge-regression
procedure.
B. To generate the test curve we need to make pre-

dictions from a total of 1060 different models.
C. We can obtain an unbiased estimate of the gener-

alization error of the best performing model from
Figure 11.
D. The ridge regression model will be non-linear as

the model includes regularization.
E. Don’t know.
Solution 24. Exhaustively evaluating all feature com-

binations of four features would require evaluating
24 = 16 models whereas we currently consider 20 dif- Figure 12: A clustering problem containing four clus-
ferent models. For each of the 20 values of λ we have ters indicated by black crosses, red circles, magenta
to estimate 54 models according to the leave-one-out plusses and blue stars.
procedure, (i.e., 54 times we leave out an observation
as the dataset contains 54 observations.) thus we need to an observation in another cluster a contiguity based
a total of 20 · 54 = 1080 different models for making approach is most suited.
the necessary predictions. In order to get an unbiased
estimate of generalization of the best model we would Question 26. Which one of the following statements
need two-layer cross-validation (we currently only have is correct?
one-level cross-validation). The ridge regression model A. Multinomial regression can only handle classifica-
will not be non-linear but still linear in the input data tion problems where the problem is to classify be-
regardless of the regularization. tween two classes.
Question 25. Consider the clustering problem given
B. Decision trees return the probability that an ob-
in Figure 12. Which clustering approach is most suited
servation is in a given class.
for correctly separating the data into the four groups
indicated by black crosses, red circles, magenta plusses, C. k-means, Gaussian Mixture Models
and blue asterics? (GMM) and Artificial Neural Networks
(ANN) are all prone to local minima issues
A. A well-separated clustering approach.
and thus it is recommended to run the
B. A contiguity-based clustering approach. procedures using multiple initializations.
C. A center-based clustering approach. D. The accuracy is a good performance measure when

facing severe class imbalance issues in a two class
D. A conceptual clustering approach. classification problem.
E. Don’t know. E. Don’t know.
Solution 25. As the observation in each cluster is at Solution 26. Multinomial regression is a generaliza-
least closest to one other observation in its cluster than tion of two class logistic regression to handle multiple
18 of 20
Figure 13: A two class classification problem with red Figure 14: A decision tree with four decisions (A,B,C,
crosses (i.e., x) and black plusses (i.e., +) constitut- and D) forming the decision boundaries given in Fig-
ing the two classes as well as the associated decision ure 13 if adequately defined.
boundaries of the two classes indicated respectively by
gray and white regions.
Question 27. We will consider the two class classi-
fication problem given in Figure 13 in which the goal
is to separate red crosses (i.e., x) from black plusses
(i.e., +) based on the decision boundaries in gray and
white indicated in the top panel of the figure. Which
one of the following procedures based on the decision
tree given in Figure 14 will perfectly separate the two
classes?
classes. Decision trees do not return probabilities of

0
being in each class but hard assigns observations to A. A=kx − k∞ < 0.5,
0.5
the classes based on majority voting in each terminal B=x1 < 0.75,

leaf. K-means, Gaussian Mixture Models and Artifi- 0.75
C= kx − k2 < 0.25,
cial Neural Networks (ANN) are indeed all prone to 0.5

local minima and it is therefore advised to use mul- 0.75
D=kx − k1 < 0.25.
tiple restarts selecting the initialization with best so- 0.5
lution. Accuracy is not a good performance measure
when facing severe class-imbalance issues as we may B. A=kx − 0
k1 < 0.5,
trivially obtain a very high accuracy simply by clas- 0.5
B=x1 < 0.75,
sifying by chance. The AUC of the receiver operator
0.75
characteristic would here be more appropriate as it is C= kx − k2 < 0.25,
0.5
not influenced by the relative sizes of the two classes.
0.75
D=kx − k∞ < 0.25.
0.5

0
C. A=kx − k∞ < 0.5,
0.5
B=x1 < 0.75,

0.5
C= kx − k2 < 0.25,
0.75

0.5
19 of 20 D=kx − 0.75 k1 < 0.25.
D. A=x1 < 0.75,

0
Solution 27. All observations for which x1 < 0.5
are red crosses which
can
be captured by the initial
0
decision A=kx − k∞ < 0.5. For the remaining
0.5
observations it appears two different norms are at play
depending on whether x1 < 0.75 or not, thus B=x1 <
0.75. If x1 < 0.75 we observe that decision C should

0.75
have a circular shape defined by C= kx− k2 <
0.5
0.25 whereas if x1 ≥0.75 we have a diamond shape
0.75
defined by D=kx − k1 < 0.25. The other
0.5
solutions will not similarly correctly define the decision
boundaries.
20 of 20

02450ex Fall2017 Sol

Uploaded by

Copyright:

Available Formats

02450ex Fall2017 Sol

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02450ex Fall2017 Sol

Uploaded by

Copyright:

Available Formats

Technical University of Denmark

Written examination: 19 December 2017, 9 AM - 1 PM.

Course name: Introduction to Machine Learning and Data Mining.

Course number: 02450.

Aids allowed: All aids permitted.

Exam duration: 4 hours.

Weighting: The individual questions are weighted equally.

PLEASE HAND IN YOUR ANSWERS DIGITALLY.

USE ONLY THIS PAGE FOR HAND IN IF YOU ARE

Table 1: The attributes of the Basketball dataset

Question 1. We will consider a basketball dataset

Solution 1. Inspecting the boxplot we observe that

3.1. with and asteric inside in Figure 1.

Question 3. From the singular value decomposition

C. Forward selection will terminate using two feature

D. Backward selection will terminate using two fea-

Solution 4. Forward selection will select x4 with per-

Table 2: Root-mean-square error (RMSE) for the

Question 5. We will encode the output attribute y

with mean µ and covariance matrix Σ. The three

Solution 6. The purity gain is given by

Question 6. A decision tree is fitted to the data con-

Figure 5: Four different receiver operator characteris-

Question 7. We suspect that a basketball player’s

Figure 6: Four different dendrograms derived using the

Question 10. We will consider the Euclidean dis-

Solution 10. Initially, O3, O4 will merge and

Solution 12. The probability that the k’th

A. The recall of the classifier is 60.0 %.

B. The precision of the classifier is 61.1 %.

C. The accuracy of the classifier is 66.7 %.

D. The dataset is perfectly balanced.

Solution 13. The recall of the classifier is

Figure 9: Decision boundaries for four different classi-

Question 15. Four different classifiers are trained on

Table 3: The ten considered observations of the Basket-

Question 18. Considering the dataset in Table 3

A. {HL }, {HH }, {WL }, {FG≤45% }, {FG>45% },

B. {HL }, {HH }, {WL }, {FG≤45% } , {FG>45% },

C. {HL }, {HH }, {WL }, {FG≤45% },

D. {HL }, {HH }, {WL }, {FG≤45% }, {FG>45% },

Solution 18. For a set to have support more than

P (FG≤45% , FT≤75% |HL , WL ) =

A. 9/16 C. O10 will be classified as red.

A. The two classes will be tied for the Adaboost

B. The observation will be classified as red cross.

C. The observation will be classified as black

D. The weights w1 , . . . , w10 are changed in round 1 of

Solution 22. We have for the first round that the

B. According to the model defined for λ = 0.6952

C. There is no optimal way of choosing λ since in-

D. As we increase λ the 2-norm of the weight vector

A. Exhaustively evaluating all combinations of

B. To generate the test curve we need to make pre-

C. We can obtain an unbiased estimate of the gener-

D. The ridge regression model will be non-linear as

Solution 24. Exhaustively evaluating all feature com-

C. A center-based clustering approach. D. The accuracy is a good performance measure when

E. Don’t know. E. Don’t know.