02450ex Fall2017 Sol
02450ex Fall2017 Sol
02450ex Fall2017 Sol
You must either use the electronic file or the form on this page to hand in your answers but not both. We
strongly encourage that you hand in your answers digitally using the electronic file. If you hand
in using the form on this page, please write your name and student number clearly.
The exam is multiple choice. All questions have four possible answers marked by the letters A, B, C, and D as
well as the answer “Don’t know” marked by the letter E. Correct answer gives 3 points, wrong answer gives -1
point, and “Don’t know” (E) gives 0 points.
The individual questions are answered by filling in the answer fields with one of the letters A, B, C, D, or E.
Answers:
1 2 3 4 5 6 7 8 9 10
D C C B C A B B A D
11 12 13 14 15 16 17 18 19 20
B C A C A D A C D B
21 22 23 24 25 26 27
A C A A B C A
Name:
Student number:
1 of 20
No. Attribute description Abbrev.
x1 Height (in feet) Height
x2 Weight (in pounds) Weight
x3 Percent of successful field goals FG
(out of 100 attempted)
x4 Percent of successful free throws FT
(out of 100 attempted)
y average points scored per game PT
2 of 20
Question 2. A principal component analysis (PCA)
is carried out on the standardized attributes x1 –x4 ,
forming the standardized matrix X̃. The squared
Frobenious norm of the standardized matrix is given
by kX̃k2F = 212. A singular value decomposition is
applied to the matrix X̃ and we find that the first
three singular values are σ1 = 11.1 , σ2 = 7.2, σ3 = 5.2.
What is the value of the fourth singular value σ4 ?
A. σ4 = 1.2
B. σ4 = 2.3
C. σ4 = 3.1
D. σ4 = 9.9
E. Don’t know.
Solution 2. The variance explained by the ith prin- Figure 2: The Basketball data projected onto the first
σ2 σ2 and second principal component. In the plot four
cipal component is given by P 0iσ2 = kX̃ki 2 . Thus,
P 2 i i0 F observations are highlighted denoted A, B, C, and D
σ = k X̃k 2 and from this we know that σ =
0
qi i 0 F 4 of which one of these observation corresponds to the
2
P3 2
√ player with lowest output value y given by the circle
2 2
kX̃kF − i0 =1 σi0 = 212 − 11.1 − 7.2 − 5.2 = 2
A. Observation A.
B. Observation B.
C. Observation C.
D. Observation D.
E. Don’t know.
Solution 3. From the boxplot we know that the worst
performing player in terms of the output value y has
3 of 20
the standardized observation vector x̃∗ = [0.68 0.66 − Question 4. A least squares linear regression
0.67 − 1.47] and will thus have the projection onto the model is trained using different combinations of the
two first principal components given by four attributes x1 , x2 , x3 ,and x4 in order to pre-
dict the average points scored per game y. Table 2
−0.60 0.02 provides the
−0.61 q training and test root-mean-square error
0 N
(RMSE= N1 i=1 (yi − ŷi )2 ) performance of the least
P
[0.68 0.66 − 0.67 − 1.47]
−0.46 0.46
squares linear regression model when trained using dif-
0.25 0.89
ferent combinations of the four attributes. Which one
= [−0.8699 − 1.6029]. of the following statements is correct?
Thus, the observation will in the projection be located A. Forward selection will terminate when all features
at (-0.8699 , -1.6029) which corresponds to the obser- x1 –x4 are included in the feature set.
vation denoted C.
B. The solution identified by Forward selection
will be worse than the solution identified by
Backward selection.
E. Don’t know.
4 of 20
Feature(s) Training Test Solution 5.
RMSE RMSE
No features 5.8977 5.8505
x1 5.8760 6.0035
x2 5.8841 5.9037
x3 5.1832 5.9272
x4 5.8727 5.6845
x1 and x2 5.6272 7.4558
x1 and x3 5.1482 5.6409
x1 and x4 5.8451 5.8269
x2 and x3 5.0483 5.6656
x2 and x4 5.8660 5.7461
x3 and x4 5.1125 5.7390
x1 and x2 and x3 4.9836 6.2823
x1 and x2 and x4 5.6261 7.3888
x1 and x3 and x4 5.0839 5.5099
x2 and x3 and x4 5.0113 5.5605
x1 and x2 and x3 and x4 4.9645 6.0892
B. ∆ = 0.1667
C. ∆ = 0.3333
D. ∆ = 0.6667
E. Don’t know.
where
X
I(v) = 1 − p(c|v)2 .
c
Figure 3: The 54 observations plotted in terms of
percentage of successful field goals (FG) plotted against Evaluating the purity gain for the split we have:
percentage of successful free throws (FT) for low, mid 18 2
2 2
+ 18 + 18
∆ =(1 − ( 54 54 54 ))
and high performing players respectively indicated by
18 6 2 9 2 3 2
black plusses, red squares, and blue crosses. For each of −[ 54 (1 − ( 18 + 18 + 18 ))
these three classes a multivariate Gaussian distribution + 20 4 2
6 2
+ 10
2
54 (1 − ( 20 + 20 20 ))
is fitted and the lines corresponding to density values
8 2 3 2 5 2
+ 16
of 5, 10, and 20 are plotted in black, red, and blue 54 (1 − ( 16 + 16 + 16 ))]
respectively. = 0.0505
6 of 20
Solution 7. There are a total of 4 positive and 6 nega-
tive observations. When lowering the threshold for pre-
dicting high performance based on the value of Height
we observe that the first observation to be above the
Figure 4: Considering the attribute Height for 10 threshold is O10 which belongs to the positive class,
observations in the Basketball data we inspect whether thus TPR=1/4, FPR=0/6. Subsequently we get two
each of the 10 players here denoted O1,O2,. . .,O10 observations from the negative class thus TPR=1/4,
have a relatively high percentage of successful field FPR=2/6 and then three positive observations being
goals (FG>45%) indicated in black and considered above the threshold, i.e. TPR=4/4, FPR=2/6. Low-
the positive class, i.e. observation O5,O6,O7, and ering the threshold further we obtain the remaining
O10 or a relatively low percentage of successful field negative observations such that TPR=4/4, FPR=6/6.
goals (FG≤45%) indicated in red and considered the The only curve having this property is the curve with
negative class, i.e. observations O1,O2,O3,O4,O8, and AUC=0.750.
O9.
C. {O1, O2, O3, O4, O5, O6, O7}, {O8, O9, O10}.
D. {O1, O2, O3, O4, O5, O6, O7, O8, O9}, {O10}. A. 0.409
E. Don’t know.
Solution 8. The solution {O1}, {O2, O3, O4, O5, B. 0.500
O6, O7, O8, O9, O10} has centroids at 5.7 and
(6.0+6.2+6.3+6.4+6.6+6.7+6.9+7.0+74)/9=6.5889.
As such, O2 is closer to the centroid at 5.7 than
6.6111 and will thus be reassigned to this centroid C. 0.533
hence this is not a converged solution. The solution
{O1, O2, O3, O4, O5}, {O6, O7, O8, O9, O10}
has centroids at (5.7+6.0+6.2+6.3+6.4)/5=6.12 and D. 1.875
(6.6+6.7+6.9+7.0+7.4)/5=6.92. As such, O5 is closer
to the centroid at 6.12 and O6 is closer to the centroid
at 6.92. This will thus form a converged solution The
solution {O1, O2, O3, O4, O5, O6, O7}, {O8, O9, O10} E. Don’t know.
has centroids at (5.7+6.0+6.2+6.3+6.4+6.6+6.7)/7=
6.2714 and (6.9+7.0+7.4)/3=7.1. As such, O7 is
closer to the centroid at 7.1 than the one at 6.2714 and Solution 9.
will thus be reassigned to this centroid, hence, this is
also not a converged solution. The solution {O1, O2,
O3, O4, O5, O6, O7, O8, O9}, {O10} has centroid 1
at 6.4222 and 7.4. As such, O9 is closer to 7.4 than density(xO10 , 3) = ( (0.4 + 0.5 + 0.7))−1 = 1.8750
3
6.4222 and will thus be reassigned to this centroid, 1
hence, this is also not a converged solution. density(x O9 , 3) = ( (0.1 + 0.3 + 0.4))−1 = 3.7500
3
1
Question 9. We suspect that observation O10 density(xO8 , 3) = ( (0.1 + 0.2 + 0.3))−1 = 5
may be an outlier. In order to assess if this is the 3
1
case we would like to calculate the average relative density(xO7 , 3) = ( (0.1 + 0.2 + 0.3))−1 = 5
3
KNN density based on Euclidean distance and the
a.r.d.(xO10 , 3)) =
observations given in Figure 4 only. We recall that
the KNN density and average relative density (ard) for density(xO10 , 3)
1
the observation xi are given by: 3 (density(x O9 , 3) + density(xO8 , 3) + density(xO7 , 3))
1.8750
1 = 1 = 0.4091
densityX \i (xi , K) = 1 P 0
, (3.75 + 5 + 5)
3
K x0 ∈NX (xi ,K) d(xi , x )
\i
densityX \i (xi , K)
ardX (xi , K) = 1 P ,
K xj ∈NX \i (xi ,K) densityX \j (xj , K)
8 of 20
the level of (0.5+0.6+0.7+0.2+0.3+0.4)/6=0.45 and
then O10 will merge with {O6,O7,O8,O9} at the level
of (0.8+0.7+0.5+0.4)/4=0.6. The only dendrogram
having these properties is dendrogram 4 and we can
thus rule out the other dendrograms.
A. Dendrogram 1.
B. Dendrogram 2.
C. Dendrogram 3.
D. Dendrogram 4.
E. Don’t know.
9 of 20
Question 11. We will cut dendrogram 2 at the level
of two clusters and evaluate this clustering in terms of
its correspondence with the class label information in
which O1, O2, O3, O4, O8, and O9 correspond to low
values of FG whereas O5, O6, O7, and O10 correspond
to high values of FG. We recall that the Rand index
also denoted the simple matching coefficient (SMC)
Figure 7: A Gaussian mixture model (GMM) with
between the true labels and the extracted clusters is
three clusters fitted to the 10 observations based only
given by R = f11K +f00
, where f11 is the number of
on the attribute Height. The overall probability den-
object pairs in same class assigned to same cluster,
sity is given in gray and in blue the contribution from
f00 is the number of object pairs in different class
each of the three clusters to the density.
assigned to different clusters, and K = N (N − 1)/2
is the total number of object pairs, where N is the
number of observations considered. What is the value Question 12. We will fit a Gaussian mixture model
of R between the true labeling of the observations in (GMM) with three clusters to the 10 observations given
terms of high and low FG values and the two clusters? in Figure 4. The fitted density is given in Figure 7 in
which the overall density is given in gray and the con-
tribution of each Gaussain given in blue. We recall that
A. 0.3226
the Gaussian mixtureP model1 for 1-dimensional data
1
B. 0.5333 is given by: p(x) = k wk √ 2
exp (− 2σ2 (x − µk )2 ).
2πσk k
For the clustering we have:
C. 0.5778
w1 = 0.37, w2 = 0.29, w3 = 0.34,
D. 0.6222 µ1 = 6.12, µ2 = 6.55, µ3 = 6.93,
σ12 = 0.09, σ22 = 0.13, σ32 = 0.12.
E. Don’t know. What is the probability that observation O8 is assigned
to cluster 2 according to the GMM?
Solution 11. The cluster indices are given by the
vector: [2 2 2 2 2 1 1 1 1 1 1]> , whereas the true class
A. 0.20
labels are given by the vector [1 1 1 1 2 2 2 1 1 2]> .
From this, we obtain: Total number of object pairs is:
B. 0.29
K = 10(10 − 1)/2 = 45
f00 = 4 · 3 + 1 · 2 = 14 C. 0.33
f11 = 4·(4−1)/2+1·(1−1)/1+3·(3−1)/2+2·(2−1)/2 =
10 D. 0.37
R = f11K+f00
= 10+14
45 = 24/45.
E. Don’t know.
10 of 20
Question 14. The National Basketball Associa-
tion (NBA) is the top basketball league in USA and
all males playing in the NBA earns more than several
million dollars a year or more making them all have a
very high salary. In USA we will assume approximately
0.2 % of the male population that are not playing in
the NBA makes such similar very high salary. Further-
more, we will assume two out of a million American
males are playing in the NBA. Assuming the above,
what is the probability that a male in USA making
such very high salary plays in the NBA?
A. 0.0002%
B. 0.0010%
Figure 8: Confusion matrix based on a classifier’s pre-
dictions of high or low success rate of field goals, (i.e., C. 0.0999%
FG>45% considered the positive class or FG≤45% con-
sidered the negative class respectively). D. 0.2002%
E. Don’t know.
Question 13. We will consider a simple classifier Solution 14. What we are inter-
that predicts the 54 basketball players as having high ested in is P (NBA|Very high salary) =
success rate of field goals (FG>45%, considered the P (Very high salary|NBA)P (N BA)
salary|NBA)P (N BA)+P (Very high salary|not NBA)P (not NBA) =
positive class) if they are taller than 6.65 foot and P (Very high1·2/1000000 2
low otherwise (F G ≤ 45%, considered the negative 1·2/1000000+0.002·(1−2/1000000) = 2+0.002∗999998 =
class). The confusion matrix of the classifier is given 0.0999%.
in Figure 8. Which statement regarding the classifier
is correct?
E. Don’t know.
11 of 20
Solution 15. The decision boundary of classifier 1 is
a straight line thus conforms to a logistic regression
model using only the features x1 and x2 as inputs.
Classifier 2 is a 3-nearest neighbor classifier and when
using Euclidean distance the scale of Weight has much
more variance than that of height thereby heavily
influencing the distance measure. Classifier 3 is a
decision tree with two vertical and one horizontal line
corresponding to three decisions. Classifier 4 is non-
linear and smooth corresponding to a logistic regression
including a transformed variable, i.e. x1 · x2 .
−8.46
−9.60
−0.44
+ 3.46 max([1 6.8 225 0.44 0.68] ·
0.01 , 0)
14.54
9.50
= 2.84 + 3.25 · max(−1.027, 0) + 3.46 max(2.516, 0)
= 11.54
13 of 20
HL HH WL WH FG≤45% FG>45% FT≤75% FT>75%
out of the 10 times. All the itemsets that have this
O1 1 0 1 0 1 0 1 0
O2 1 0 1 0 1 0 1 0 property are:
O3 1 0 1 0 1 0 1 0 {HL }, {HH }, {WL }, {FG≤45% }, {FG>45% },
O4 1 0 1 0 1 0 0 1
O5 1 0 1 0 0 1 0 1 {FT≤75% }, {FT>75% }, {HL ,WL }, {HL ,FG≤45% },
O6 1 0 0 1 0 1 1 0 {HL ,FT≤75% }, {WL ,FG≤45% }, {WL ,FT>75% },
O7 0 1 1 0 0 1 0 1
O8 0 1 1 0 1 0 0 1
{FG≤45% ,FT≤75% }, and {HL ,WL ,FG≤45% }.
O9 0 1 0 1 1 0 1 0
O10 0 1 0 1 0 1 1 0
E. Don’t know.
14 of 20
Question 19. We consider again the data in Table 3 Solution 20. Let HAS denote high average score.
as a market basket problem. What is the confidence of According to the Naı̈ve Bayes classifier we have
the association rule {HL , WL } → {FG≤45% , FT≤75% }?
P (HAS|HH = 1, WL = 1) =
P (HH = 1||HAS)×
A. 30 % P (WL = 1|HAS)×
P (HAS)
P (HH = 1||LAS)×
B. 40 % P (WL = 1|LAS)×
P (LAS) + P (HH = 1|M AS)×
C. 50 %
P (WL = 1|M AS)×
P (M AS) + P (HH = 1||HAS)×
P (WL = 1|HAS)×
D. 60 % P (HAS)
3/4 · 3/4 · 4/10
=
E. Don’t know. 1/4 · 2/4 · 4/10 + 0 · 2/2 · 2/10 + 3/4 · 3/4 · 4/10
9/40
= = 9/11.
Solution 19. The confidence is given as 2/40 + 0 + 9/40
B. 9/11
D. The classifier will be tied between the classes black
C. 3/4 and blue.
D. 1
E. Don’t know.
E. Don’t know.
15 of 20
Solution 21. For O10 we have:
1 1
J(O10, O1) = =
8−1 7
1 1
J(O10, O2) = =
8−1 7
1 1
J(O10, O3) = =
8−1 7
0
J(O10, O4) = =0
8−0
1 1
J(O10, O5) = =
8−1 7 Figure 10: Decision boundaries for two rounds of
3 3 boosting considering a logistic regression model using
J(O10, O6) = =
8−3 5 the features Height and Weight and the 10 observations
3 3 also considered previously in Figure 4. Gray region
J(O10, O7) = =
8−3 5 indicates that the observation will be classified as red
1 1 crosses, white regions that the observation will be
J(O10, O8) = =
8−1 7 classified as black plusses.
3 3
J(O10, O9) = =
8−3 5
Question 22. We will consider classifying the 10 ob-
servations considered in Figure 4 using logistic regres-
Hence the three nearest neighbors are O6,O7, and O9 sion and boosting by Adaboost (notice, the Adaboost
with two black and one blue observation thus according algorithm uses the natural logarithm). For this pur-
to majority voting the observation will be classified as pose we include only two boosting rounds considering
black. only the features Height (i.e., x1 ) and Weight (i.e., x2 )
as inputs. In the first round the data is sampled with
equal probability wi = 1/10 for i = {1, . . . , 10} and the
logistic regression model with decision boundary given
to the left of Figure 10 trained. A new dataset is subse-
quently sampled and a new logistic regression classifier
given to the right of the Figure 10 trained. Based on
these two rounds of the Adaboost algorithm what will
an observation located at x1 = 6 and x2 = 240 be
classified as?
E. Don’t know.
16 of 20
Question 23. Using the 54 observations of the Bas-
ketball dataset we would like to predict the average
points scored per game (y) based on the four features
(x1 –x4 ). For this purpose we consider regularized least
squares regression which minimizes with respect to w
the following
P cost function:
E(w) = n (yn − [1 xn1 xn2 xn3 xn4 ]w)2 + λw> w,
where xnm denotes the m’th feature of the n’th obser-
vation, and 1 is concatenated the data to account for
the bias term. We consider 20 different values of λ and
use leave-one-out cross-validation to quantify the per-
formance of each of these different values of λ. The re-
sults of the leave-one-out cross-validation performance
is given in Figure 11. Inspecting the model for the
value of λ = 0.6952 the following model is identified:
f (x) = 2.76 − 0.37x1 + 0.01x2 + 7.67x3 + 7.67x4 .
Which one of the following statements is correct?
Figure 11: Root-mean square error (RMSE) curves as
function of the regularization strenght λ for regularized A. In Figure 11 the blue curve with circles
least square regression predicting the average points corresponds to the training error whereas
scored per game y based on the attributes x1 –x4 . the black curve with crosses corresponds to
the test error.
17 of 20
Question 24. We will again consider the ridge
regression described in the previous question. Which
one of the following statements is correct?
E. Don’t know.
Solution 25. As the observation in each cluster is at Solution 26. Multinomial regression is a generaliza-
least closest to one other observation in its cluster than tion of two class logistic regression to handle multiple
18 of 20
Figure 13: A two class classification problem with red Figure 14: A decision tree with four decisions (A,B,C,
crosses (i.e., x) and black plusses (i.e., +) constitut- and D) forming the decision boundaries given in Fig-
ing the two classes as well as the associated decision ure 13 if adequately defined.
boundaries of the two classes indicated respectively by
gray and white regions.
Question 27. We will consider the two class classi-
fication problem given in Figure 13 in which the goal
is to separate red crosses (i.e., x) from black plusses
(i.e., +) based on the decision boundaries in gray and
white indicated in the top panel of the figure. Which
one of the following procedures based on the decision
tree given in Figure 14 will perfectly separate the two
classes?
classes. Decision trees do not return probabilities of
0
being in each class but hard assigns observations to A. A=kx − k∞ < 0.5,
0.5
the classes based on majority voting in each terminal B=x1 < 0.75,
leaf. K-means, Gaussian Mixture Models and Artifi- 0.75
C= kx − k2 < 0.25,
cial Neural Networks (ANN) are indeed all prone to 0.5
local minima and it is therefore advised to use mul- 0.75
D=kx − k1 < 0.25.
tiple restarts selecting the initialization with best so- 0.5
lution. Accuracy is not a good performance measure
when facing severe class-imbalance issues as we may B. A=kx − 0
k1 < 0.5,
trivially obtain a very high accuracy simply by clas- 0.5
B=x1 < 0.75,
sifying by chance. The AUC of the receiver operator
0.75
characteristic would here be more appropriate as it is C= kx − k2 < 0.25,
0.5
not influenced by the relative sizes of the two classes.
0.75
D=kx − k∞ < 0.25.
0.5
0
C. A=kx − k∞ < 0.5,
0.5
B=x1 < 0.75,
0.5
C= kx − k2 < 0.25,
0.75
0.5
19 of 20 D=kx − 0.75 k1 < 0.25.
20 of 20