hw3 Sol

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

CIS 520, Machine Learning, Fall 2015: Assignment 3

Solutions

Instructions. Please write up your responses to the following problems clearly and concisely. We encourage
you to write up your responses using LATEX; we have provided a LATEX template, available on Canvas, to
make this easier. Submit your answers in PDF form to Canvas. We will not accept paper copies
of the homework.

Collaboration. You are allowed and encouraged to work together. You may discuss the homework to
understand the problem and reach a solution in groups up to size two students. However, each student
must write down the solution independently, and without referring to written notes from the joint session.
In addition, each student must write on the problem set the names of the people with whom
you collaborated. You must understand the solution well enough in order to reconstruct it by yourself.
(This is for your own benefit: you have to take the exams alone.)

1 Linear Regression and LOOCV [30 points]


In the last homework, you learned about using cross validation as a way to estimate the true error of a
learning algorithm. A solution that provides an almost unbiased estimate of this true error is Leave-One-
Out Cross Validation (LOOCV), but it can take a really long time to compute the LOOCV error. In this
problem, you will derive an algorithm for efficiently computing the LOOCV error for linear regression using
the Hat Matrix. 1
Assume that there are n given training examples, (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ), where each input data
point Xi , has m real-valued features. The goal of regression is to learn to predict Y from X. The linear
regression model assumes that the output Y is a weighted linear combination of the input features with
weights given by w, plus some Gaussian noise.
We can write this in matrix form by stacking the data points as the rows of a matrix X so that xij is the
j-th feature of the i-th data point. Then writing Y , w and  as column vectors, we can express the linear
regression model in matrix form as follows:

Y = Xw + 
where:
       
Y1 x11 x12 ... x1m w1 1
 Y2   x21 x22 ... x2m   w2   2 
Y = ,X =  ,w =   , and  = 
       
.. .. .. .. .. .. .. 
 .   . . . .   .   . 
Yn xn1 xn2 ... xnm wm n
Assume that i is normally distributed with variance σ 2 . We saw in class that the maximum likelihood
estimate of the model parameters w (which also happens to minimize the sum of squared prediction errors)
is given by the Normal equation:
ŵ = (X T X)−1 X T Y
1 Unfortunately, such an efficient algorithm may not be easily found for other learning methods.

1
Define Ŷ to be the vector of predictions using ŵ if we were to plug in the original training set X:

Ŷ = X ŵ
= X(X T X)−1 X T Y
= HY

where we define H = X(X T X)−1 X T (H is often called the Hat Matrix ).


As mentioned above, ŵ, also minimizes the sum of squared errors:
n
X
SSE = (Yi − Ŷi )2
i=1

Now recall that the Leave-One-Out Cross Validation score is defined to be:
n
(−i) 2
X
LOOCV = (Yi − Ŷi )
i=1

(−i) 2
where Ŷ (−i) is the estimator of Y after removing the i-th observation (i.e., it minimizes
P
j6=i (Yj − Ŷj ) ).

1. [2 points] To begin with, we should consider when it is possible to compute ŵ in this framework.
(a) [1 point] Suppose m > n. Is ŵ well-defined? Why or why not?
Hint: Recall that the rank of a matrix is equal to the number of linearly independent rows,
which is also equal to the number of linearly independent columns. Use the fact that for two
matrices A and B which can be multiplied to form the product AB, it must be the case that
rank(AB) ≤ min (rank(A), rank(B)). Furthermore, recall that a square matrix is invertible if and
only if it is full-rank.
(b) [1 point] Suppose m ≤ n. Give a condition on X which guarantees that ŵ will not be well-
defined and explain why not. (Don’t assume X is a square matrix.)
For the rest of question 1, assume ŵ is well-defined.

F SOLUTION: a) No. We use the rule from linear algebra that

rank(AB) ≤ min (rank(A), rank(B)) .

The rank of a matrix is not greater than its smallest dimension, in this case n. X T has the same rank as
X. Therefore,
rank(X T X) ≤ rank(X) ≤ n < m.
Therefore, X T X is not full-rank, and so it is not invertible. It follows that ŵ is not well-defined.

b) Two or more columns of X are linearly dependent, OR the rank of X is less than m, OR similar. The
key relation is again
rank(X T X) ≤ rank(X),
plus the fact that X T X can be inverted only if it is full-rank. It is not correct to just say that X is singular,
since this assumes X is square. It is not correct to give the condition that two or more rows of X are
linearly dependent, since this is guaranteed for the case m < n.

2. [3 points] What is the complexity of computing the LOOCV score naively? (The naive algorithm is
to loop through each point, performing a regression on the n − 1 remaining points at each iteration.)
Hint: The complexity of matrix inversion for a k × k matrix is O(k 3 ). (There are faster algorithms out
there but for simplicity we’ll assume that we are using the naive O(k 3 ) algorithm.)

2
F SOLUTION: Let X (−i) denote X without the i-th example (row), and let Y (−i) be Y without the
i-th element. For each point, we need to compute a ŵ(−i) (ŵ computed from X (−i) and Y (−i) ) and dot
it with a single point Xi . Note that it is not necessary to compute the hat matrix H each time, since H is
designed to give predictions for all examples and we only need the prediction for Xi .
The expression for ŵ(−i) is:

ŵ(−i) = ((X (−i) )T X (−i) )−1 (X (−i) )T Y (−i)

Recall that multiplying a n × m matrix by a m × p matrix costs O(nmp) flops (operations, in this case
multiplication and addition). Forming X T X takes m2 n flops. Inverting it takes m3 flops. Multiplying by
X T is m2 n flops, and multiplying by Y is nm flops.
We need to multiply this by n since LOOCV loops through each training point - doing so gives O(m2 n2 +
nm3 ) complexity.
1 point for using ŵ, 2 point for the correct expression.

3. [3 points] Write Ŷi in terms of the elements of H and Y . You may find it useful to use shorthand
such as Hab to denote the entry in row a, column b of H.

Pn
F SOLUTION: Ŷi = j=1 Hij Yj

4. [4 points] Show that Ŷ (−i) is also the estimator which minimizes SSE for Z where

j 6= i

Yj ,
Zj = (−i)
Ŷi , j=i

Hint: Try to start by writing an expression for the SSE of Z; it should look very similar to the definition
of SSE for Y that was given in the introduction section of this question. Then, manipulate terms until
you can argue that substituting Ŷ (−i) for Ẑ would minimize this expression.

F SOLUTION: We want to show that the estimator Ẑj that minimizes SSE for Z is equal to Ŷ (−i) .
n
X
SSE(Z) = (Zj − Ẑj )2
j=1
X
= (Zi − Ẑi )2 + (Zj − Ẑj )2
j6=i
(−i)
X
= (Ŷi − Ẑi )2 + (Yj − Ẑj )2
j6=i

(−i) 2
Consider minimizing each of the two terms separately. Using that Ŷ (−i) minimizes
P
j6=i (Yj − Ŷj ) ,
(−i)
we see that Ŷ (−i) minimizes the right hand term. And furthermore, the choice Zi = Ŷi makes the left
hand term zero and therefore minimizes it.
So Ŷ (−i) is also the estimator that minimizes SSE(Z). This is certainly a valid estimator, i.e. we can
obtain it with linear regression, because the model w(−i) which produces the n − 1 terms of Ŷ (−i) in the
(−i)
sum is the same model which predicts Ŷi from the remaining observation Xi .

(−i) (−i)
5. [6 points] Write Ŷi in terms of H and Z. By definition, Ŷi = Zi , but give an answer that
includes both H and Z.

3
F SOLUTION: Since Ŷ (−i) is the min-SSE estimator for Z as Ŷ is for Y , we can write it (see 1.3) as
n
(−i)
X
Ŷi = Hij Zj .
j=1

Big picture: how would you think to come up with this? Possible answer: we have two expressions for
the LOOCV, one given in the intro to the question, and one which we derive finally in 1.7 (the last part
of this question). To get from the original formula to the 1.7 formula, we remove the estimates for each
observation predicted by each of the n regressions with n−1 observations, and we insert estimates predicted
from just one regression and values from one hat matrix computation. So, it turns out that the predictions
from each of the n regressions are completely encoded in the hat matrix and the single regression with n
observations, and it turns out that the trade-off in computational complexity is favorable.
1.5 is sort of the breakthrough moment where we can write the predictions from each of the n smaller
regressions completely in terms of things we know without doing those n regressions. So if you were trying
to figure out how to do this from the beginning, you might be trying to find exactly this.

(−i) (−i)
6. [6 points] Show that Ŷi − Ŷi = Hii Yi − Hii Ŷi , where Hii denotes the i-th element along the
diagonal of H.
Hint: Use the results from part 2 and 4. Substitute Zi with Yi and Ŷi−i by using its definition in part
3.

F SOLUTION: From parts (2) and (4), we can write


n n
(−i)
X X
Ŷi − Ŷi = Hij Yj − Hij Zj
j=1 j=1
X X
= Hij Yj + Hii Yi − Hij Zj − Hii Zi
j6=i j6=i
(−i)
X X
= Hij Yj + Hii Yi − Hij Yj − Hii Ŷi
j6=i j6=i
(−i)
= Hii Yi − Hii Ŷi

7. [6 points] Show that


n
!2
X Yi − Ŷi
LOOCV =
i=1
1 − Hii
What is the algorithmic complexity of computing the LOOCV score using this formula? Note: We see
from this formula that the diagonal elements of H somehow indicate the impact that each particular
observation has on the result of the regression.

(−i) (−i)
F SOLUTION: Part (??) says that Ŷi − Ŷi = Hii Yi − Hii Ŷi . We can rearrange this to
(−i) (−i)
Ŷi − Hii Ŷi = Ŷi − Hii Yi
(−i) Ŷi − Hii Yi
Ŷi =
1 − Hii

4
Now the LOOCV error becomes
n
(−i) 2
X
LOOCV = (Yi − Ŷi )
i=1
!2
X Ŷi − Hii Yi
= Yi −
i
1 − Hii
!2
X (1 − Hii )Yi − Ŷi + Hii Yi
=
i
1 − Hii
!2
X Yi − Ŷi
=
i
1 − Hii

Evaluating LOOCV using this formula can be done in O(m2 n+m3 ) time since it only requires one regression
(and we don’t need to compute all of H to compute Hii ).
3 points for showing the LOOCV equation, 3 points for the correct complexity.

2 Logistic regression and Naive Bayes [20 points]


A common debate in machine learning has been over generative versus discriminative models for classification.
In this question we will explore this issue by considering Naive Bayes and logistic regression.

1. [4 points] For input X and output Y , which of the following is the objective function optimized
by (i) Naive Bayes, and (ii) logistic regression?

(a) Pr(Y )/Pr(X)


(b) Pr(X)/Pr(Y )
(c) Pr(Y | X)
(d) Pr(Y )
(e) Pr(X)
(f) Pr(Y )Pr(X)
(g) Pr(X, Y )
(h) None of the above (provide the correct formula in this case)

F SOLUTION: (i) ??, (ii) ??.

2. [16 points] Recall from the suggested reading that “the discriminative analog of Naive Bayes is
logistic regression.” This means that the parametric form of P (Y | X) used by logistic regression is
implied by the assumptions of a Naive Bayes classifier, for some specific class-conditional densities. In
class you will see how to prove this for a Gaussian Naive Bayes classifier for continuous input values.
Can you prove the same for binary inputs? Assume Xi and Y are both binary. Assume that Xi | Y = j
is Bernoulli(θij ), where j ∈ {0, 1}, and Y is Bernoulli(π).
Hint: Start by using Bayes Rule and the assumptions of Naive Bayes to express the objective function
for logistic regression in terms of the given quantities θij and π.

5
F SOLUTION: Our goal in this problem is to transform the Naive Bayes class conditional probability
given in ?? into the form of the logistic regression class conditional probability given in ??.
Qn
Pr(Y = 1) i=1 Pr(Xi | Y = 1)
PrN B (Y = 1 | X ) = Qn Qn (1)
Pr(Y = 0) i=1 Pr(Xi | Y = 0) + Pr(Y = 1) i=1 Pr(Xi | Y = 1)
1
PrLR (Y = 1 | X ) = Pn (2)
1 + exp(−w0 − i=1 wi Xi )

We begin by applying a few basic algebraic transformations to ?? to obtain ??.


1
PrN B (Y = 1 | X ) = Pr(Y =0) n
Q
Pr(Xi |Y =0)
1+ Qi=1
Pr(Y =1) ni=1 Pr(Xi |Y =1)

1
=
Pr(Y =0) n
  Q 
Pr(Xi |Y =0)
1 + exp ln Pr(Y =1) Qi=1
n
Pr(Xi |Y =1) i=1

1
=    P   (3)
Pr(Y =0) n Pr(Xi |Y =0)
1 + exp ln Pr(Y =1) + i=1 ln Pr(X i |Y =1)

The probability mass functions of the Bernoulli distribution for of Y and (Xi | Y = j) are given by:
β
Pr(Y = α) = π α (1 − π)1−α and P r(Xi = β | Y = j) = θi,j (1 − θi,j )1−β .

Therefore we can rewrite ?? as ?? by substituting the above probability mass functions.


1
PrN B (Y = 1 | X ) =   X
 (4)
1−π
 Pn θi0i (1−θi0 )(1−Xi )
1 + exp ln π + i=1 ln X
θi1i (1−θi1 )(1−Xi )

With some more algebraic manipulation we can transform ?? into ??.


1
PrN B (Y = 1 | X) =  Pn    
1−π θi0 1−θi0

1 + exp ln π + i=1 Xi ln θi1 + (1 − Xi ) ln 1−θi1
1
=  Pn   Pn   (5)
1−π
 1−θi0 θi0 (1−θi1 )
1 + exp ln π + i=1 ln 1−θi1 + i=1 Xi ln θi1 (1−θi0 )

We now recognize that if we define w0 and wi as in ?? and ?? we can write ?? as ??.


  n  
1−π X 1 − θi0
w0 = − ln − ln (6)
π i=1
1 − θi1
 
θi0 (1 − θi1 )
wi = − ln (7)
θi1 (1 − θi0 )

We may therefore conclude that the assumptions made by the Naive Bayes classifier over binary random
variables imply a decision rule that has the parametric form of logistic regression.

 COMMON MISTAKE 1: Incorrect expressions for the conditional probabilities P (X|Y ).

1
 COMMON MISTAKE 2: Forgetting that logistic regression has the form 1+exp{−wT x}
, so that
w0 and wi need to be negated.

6
3 Double-counting the evidence [30 points]
1. [2 points] Consider the two class problem where class label y ∈ {T, F } and each training example X
has 2 binary attributes X1 , X2 ∈ {T, F }. How many parameters will you need to know/evaluate if you
are to classify an example using the Naive Bayes classifier? Keep in mind that since the probability of
all possible events has to sum to 1, knowing the probabilities of all except one event implies knowledge
of the final event’s probability already. (Don’t include such final events in your count.)

F SOLUTION: The Naive Bayes classifier learns the the conditional probabilities Pr(X1 | Y ) and
Pr(X2 | Y ) as well as the class probability Pr(Y ). To represent Pr(Y ) we need only one parameter π
because Pr(Y = T ) + Pr(Y = F ) = 1. The Pr(Xi | Y ) can be represented using only Pr(Xi = T |
Y = T ) = θi1 and Pr(Xi = T | Y = F ) = θi0 . We do not need additional parameters to represent
Pr(Xi = F | Y = F ) = 1 − θi0 or Pr(Xi = F | Y = T ) = 1 − θi1 . Therefore we need 1 parameter for
Pr(Y ), 2 parameters for Pr(X1 |Y ), and 2 parameters for Pr(X2 |Y ) resulting in a total of 5 parameters.

2. [2 points] Let the class prior be Pr(Y = T ) = 0.5 and also let Pr(X1 = T | Y = T ) = 0.8,
Pr(X1 = F | Y = F ) = 0.7, Pr(X2 = T | Y = T ) = 0.5, and Pr(X2 = F | Y = F ) = 0.9. (Note:
Questions 3.2 - 3.4 all use these probabilities.) So, attribute X1 provides slightly stronger evidence
about the class label than X2 . Assume X1 and X2 are truly independent given Y . Write down the
Naive Bayes decision rule given X1 = x1 and X2 = x2 . Write your answer as a table listing the value
of the decision, call it f (X1 , X2 ), for each of the 4 settings for X1 , X2 .

F SOLUTION: The mathematical form of the decision rule for the Naive Bayes classifier is given in
?? where 1 (·) is the indicator function. Notice that in ?? the sum of log ratios is used rather than a direct
comparison of Pr(Y = T | X ) > Pr(Y = F | X ). While both are equivalent the sum of logs form is often
more numerically stable and easier to work with.
  X n   !
Pr(Y = 1) Pr(Xi | Y = 1)
f (X1 , X2 ) = 1 ln + ln >0 (8)
Pr(Y = 0) i=1
Pr(Xi | Y = 0)

X1 X2 Y Pr(X1 , X2 , Y ) PrN B (Y | X1 , X2 ) f (X1 , X2 )


0 0 0 0.315 0.863014 0
0 0 1 0.05 0.136986 0
0 1 0 0.035 0.411765 1
0 1 1 0.05 0.588235 1
1 0 0 0.135 0.402985 1
1 0 1 0.2 0.597015 1
1 1 0 0.015 0.0697674 1
1 1 1 0.2 0.930233 1

Table 1: Decision rule for the Naive Bayes classifier. The column containing f (X1 , X2 ) contains the actual
decision rule. For simplicity I have also included the joint and conditional probabilities as Pr(X1 , X2 , Y )
and Pr(Y | X1 , X2 ) respectively. Joint probabilties corresponding to incorrect predictions have been colored
yellow.

An alternative and for this problem more informative representation of the decision rule for binary variables
is given in ??.

7
3. [8 points] For the Naive Bayes decision function f (X1 , X2 ), the error rate is:
X
1 (Y 6= f (X1 , X2 )) P (X1 , X2 , Y ).
X1 ,X2 ,Y

For this question, we will assume that the true data distribution is exactly the same as the Naive Bayes
distribution, so we can write P (X1 , X2 , Y ) as P (Y )P (X1 | Y )P (X2 | Y ).

(a) [2 points] Show that if Naive Bayes uses both attributes, X1 and X2 , the error rate is 0.235.

F SOLUTION: To compute the error rate using X1 and X2 we sum Pr(X1 , X2 , Y ) for all the
rows where f (X1 , X2 ) 6= Y in ??, obtaining 0.05 + 0.035 + 0.135 + 0.015 = 0.235.

(b) [2 points] What is the error rate using only X1 ?

F SOLUTION: If we consider only X1 we get the following decision table:


P1
X1 Y x2 =0 Pr(X1 , x2 , Y ) PrN B (Y | X1 ) f (X1 )
0 0 0.35 0.777778 0
0 1 0.1 0.222222 0
1 0 0.15 0.272727 1
1 1 0.4 0.727273 1
and an error rate of 0.1 + 0.15 = 0.25 which is greater than the original classifier f (X1 , X2 ).

(c) [2 points] What is the error rate using only X2 ?

F SOLUTION: If we use only X2 we obtain the decision table:


P1
X2 Y x1 =0 Pr(x1 , X2 , Y ) PrN B (Y | X2 ) f (X2 )
0 0 0.45 0.642857 0
0 1 0.25 0.357143 0
1 0 0.05 0.166667 1
1 1 0.25 0.833333 1
and an error rate of 0.25 + 0.05 = 0.30 which is greater than the original classifier f (X1 , X2 ).

(d) [2 points] Give a conceptual explanation for why the error rate is (choose one) lower/higher
using X1 and X2 together as opposed to using only a single attribute.

F SOLUTION: We see that a Naive Bayes classifier that uses both variables performs better than
a Naive Bayes classifier that uses only one of the two. This implies that there is unique (uncorrelated)
information in each variable.

4. [5 points] Now, suppose that we create a new attribute X3 , which is an exact copy of X2 . So, for
every training example, attributes X2 and X3 have the same value, X2 = X3 .

(a) [2 points] Are X2 and X3 conditionally independent given Y ?

F SOLUTION: First lets consider whether X2 and X3 are conditionally independent given Y .
However this is quite simple as knowing the value of X2 completely determines the value of X3 .
Therefore X2 and X3 are clear NOT conditionally independent given Y .

(b) [3 points] What is the error rate of Naive Bayes now, using X1 , X2 , and X3 ? The predicted Y
should be computed using the (possibly incorrect) assumption of conditional independence, and
the error rate should be computed using the true probabilities.

8
F SOLUTION: To work out the error rate of the Naive Bayes classifier we construct another
table:
X1 X2 X3 Y Pr(X1 , X2 , X3 , Y ) PrN B (Y | X1 , X2 , X3 ) f (X1 , X2 , X3 )
0 0 0 0 0.315 0.918963 0
0 0 0 1 0.05 0.0810373 0
0 0 1 0 0 0.557522 0
0 0 1 1 0 0.442478 0
0 1 0 0 0 0.557522 0
0 1 0 1 0 0.442478 0
0 1 1 0 0.035 0.122807 1
0 1 1 1 0.05 0.877193 1
1 0 0 0 0.135 0.548533 0
1 0 0 1 0.2 0.451467 0
1 0 1 0 0 0.118943 1
1 0 1 1 0 0.881057 1
1 1 0 0 0 0.118943 1
1 1 0 1 0 0.881057 1
1 1 1 0 0.015 0.0147783 1
1 1 1 1 0.2 0.985222 1
The error rate is obtained by summing the probability of all events in which the classifier predicts the
wrong value for Y which is 0.05 + 0.035 + 0.20 + 0.015 = 0.3. The astute reader will notice that
all events where X2 6= X3 occur with zero probability and could have be skipped in the table. The
only thing that has changed by introducing X3 is that for X1 = 1 and X2 = 0 the new Naive Bayes
classifier using three variables predicts 0 rather than 1. Consequently, the error sum contains 0.20
rather than 0.135 resulting in a higher error rate. We see that the Naive Bayes classifier performs
worse when we add highly correlated (uninformative) features. Naturally, we would not expect Naive
Bayes to perform better by simply replicating existing features.
Another, simpler way of looking at the solution is as follows. We can imagine sampling (X1 , X2 , X3 , Y )
in a 2 step process: first, (X1 , X2 , Y ) is sampled according to P . Then X3 is deterministically assigned
the value of X2 . Thus, P (X1 , X2 , X3 , Y | X2 = X3 ) = P (X1 , X2 , Y ). In other words, introducing
X3 does not change the true underlying distribution of the data when considering X1 and X2 . What
does change? The decision rule changes, so that effectively X2 is the only variable used. We already
computed the error rate of X2 , which is 0.3.

 COMMON MISTAKE 1: Many did not see that the “true” distribution no longer follows the
conditional independence assumption following the introduction of X3 , and found that the error rate
decreased.

5. [2 points] Why does Naive Bayes perform worse with the addition of X3 ? (Hint: What assumption
does Naive Bayes make about the inputs?)

F SOLUTION: Naive Bayes performs worse because the addition of X3 breaks the conditional inde-
pendence assumption while not introducing any additional information. As a consequence the classifier over
counts X2 partially ignoring X1 .

6. [3 points] Does logistic regression suffer from the same problem? Explain why or why not.

9
F SOLUTION: Logistic regression does not use the same conditional independence assumptions and
therefore does not suffer when highly correlated features. An optimal logistic regression classifier would set
w20 = w30 = w2 /2 where w2 is the coefficient of X2 obtained by training only on X1 and X2 and w20 and
w30 are the coefficients obtained by training on X1 , X2 , and X3 .

7. [8 points] : In spite of the above fact we will see that in some examples Naive Bayes doesn’t do too
badly. Consider the above example i.e. your features are X1 , X2 which are truely independent given
Y and a third feature X3 = X2 . Suppose you are now given an example with X1 = T and X2 = F .
You are also given the probabilities Pr(Y = T | X1 = T ) = p and P r(Y = T | X2 = F ) = q, and
P (Y = T ) = 0.5. (Note: You should not use the probabilities from 3.2-3.4 in your solutions to the
following.)
(1−q)2
(a) Prove that the decision rule is p ≥ q 2 +(1−q)2 by applying Bayes rule again.

F SOLUTION: The decision rule for Naive Bayes is: Predict Y = T if


Y Y
Pr(Y = T ) Pr(Xi = xi | Y = T ) ≥ Pr(Y = F ) Pr(Xi = xi | Y = F )
i i

But we know that Pr(Y = T ) = Pr(Y = F )


Y Y
Pr(Xi = xi | Y = T ) ≥ Pr(Xi = xi | Y = F )
i i

Using Bayes rule again we get,


Y Pr(Y = T |Xi = xi )Pr(Y = T ) Y Pr(Y = F |Xi = xi )Pr(Y = F )

i
Pr(Xi = xi ) i
Pr(Xi = xi )
Q
Now, the i Pr(Xi = xi ) parts cancel from both sides. Also, we again note Pr(Y = T ) = Pr(Y =
F ) and therefore remove Pr(Y = T ) from both side. Hence we now have
Y Y
Pr(Y = T |Xi = xi ) ≥ Pr(Y = F |Xi = xi )
i i

Setting x1 = T and x3 = x2 = F , we find :

Pr(Y = T | X1 = T )Pr(Y = T | X2 = F )Pr(Y = T | X3 = F )


≥ Pr(Y = F | X1 = T )Pr(Y = F | X2 = F )Pr(Y = F | X3 = F )

Plugging in the values given in question we now obtain:

pq 2 ≥ (1 − p)(1 − q)2

Thus the “Naive Bayes” decision rule is given by,

(1 − q)2
Predict Y=T ⇐⇒ p ≥
q2 + (1 − q)2

(b) What is the true decision rule?

10
F SOLUTION: The actual decision rule doesn’t take into consideration X3 , as we know that X1
and X2 are truly independent given Y . Therefore, we predict true if and only if Pr(Y = T | X1 =
T, X2 = F ) ≥ Pr(Y = F | X1T , X2F ).
Following similar calculations as before, we have

pq ≥ (1 − p)(1 − q)
p≥1−q

Thus the “real” decision rule is given by,

Predict Y=T ⇐⇒ p ≥ 1 − q

(c) Plot the two decision boundaries (vary q between 0 and 1) and highlight the region where Naive
Bayes makes mistakes.

Figure 1: True and Naive Bayes decision boundary

F SOLUTION: The curved line is for Naive Bayes, and the straight line shows the true decision
boundary. The shaded part in ?? shows the region where Naive Bayes decision differs from the true
decision.

4 Feature Selection [20 points]


We saw in class that one can use a variety of regularization penalties in linear regression.

ŵ = arg min kY − Xwk22 + λkwkpp


w

Consider the three cases, p = 0, 1, and 2. We want to know what effect these different penalties have on
estimates of w.
Let’s see this using a simple problem. Use the provided data (data.mat). Assume the constant term in
the regression is zero, and assume λ = 1, except, of course, for question (1). You don’t need to write code

11
that solves these problems in their full generality; instead, feel free to use matlab functions to do the main
calculations, and then just do a primitive search over parameter space by plugging in a few different values.
Matlab function fminsearch will be helpful. (Note: If you are not familiar with function handles, please
review the code from HW 2 or see Matlab documentation.)

1. [3 points] If we assume that the response variable y is distributed according to y ∼ N (w · x, σ 2 ), then


what is the MLE estimate ŵM LE of w?

F SOLUTION: Let r = (Y − Xw).


∂rT r
= − X T (Y − Xw) = 0
∂w
w =(X T X)−1 X T Y
w =[0.8891, −0.8260, 4.1902]

2. [2 points] Given λ = 1, what is ŵ for p = 2?

F SOLUTION: The closed form solution is w = (X T X + λI)−1 X T Y


We use fminsearch in MATLAB to solve for the solution.
w = [0.8646, −0.8210, 4.1219]

3. [2 points] Given λ = 1, what is ŵ for p = 1?

F SOLUTION: We can use either lasso or fminsearch in MATLAB.


Note that lasso in Matlab doesn’t minimize the same objective function.

OBJ1 (fminsearch) OBJ1 (lasso)


dataset w = [0.8749, −0.8182, 4.1829] w = [0, −0.2755, 4.1902]

4. [4 points] Given λ = 1, what is ŵ for p = 0? Note that since L0 norm is not a ”real” norm, the
penalty expression is a little different:

ŵ = arg min kY − Xwk22 + λkwk0


w
Also for L0 norm, you have to solve all combinatorial cases separately where some certain components
of w are set to zero, then add L0 accordingly. There are 8 cases for 3 unknown wi .

F SOLUTION: For L0 norm, we have to solve all combinatorial cases separately where some certain
components of w are set to zero, then add L0 accordingly. There are 8 cases for 3 unknown wi .

w = [0.8891, −0.8260, 4.1902]

5. [4 points] Write a paragraph describing the relation between the estimates of w in the four cases,
explaining why that makes sense given the different penalties.

12
F SOLUTION: The MLE estimate simply tries to minimize the residual error without enforcing any
prior beliefs about w. Regularizing w under different norms corresponds to different prior beliefs on how
we expect the w to be. The L2 norm corresponds to the belief that parameters w follow a Gaussian
distribution with mean 0; this will tend to shrink all the weights by a constant factor. None will be zeroed
out. Penalizing the L1 norm decreases all the parameters w1 , w2 , w3 toward zero by a constant additive
amount. If the parameters would be pushed beyond zero,they are zeroed out. In this problem, two were set
to zero. Finally, the L0 norm encourages a sparse solution, but does not shrink those parameters that are
not zeroed out. Thus the nonzero coefficients are larger than under L1 or L2.

6. [5 points] When λ > 0, we make a trade-off between minimizing the sum of squared errors and the
magnitude of ŵ. In the following questions, we will explore this trade-off further. For the following,
use the same data from data.mat.
(a) [1 point] For the MLE estimate of w (as in 4.1), write down the value of the ratio
||ŵM LE ||22 / ||Y − X ŵM LE ||22 .

F SOLUTION: ||ŵM LE ||22 / ||Y − X ŵM LE ||22 ≈ 19.03/3104.8 ≈ 0.0061.

(b) i. [1 point] Suppose the assumptions of linear regression are satisfied. Let’s say that with
N training samples (assume N >> P , where P is the number of features), you compute
ŵM LE . Then let’s say you do the same, this time with 2N training samples. How do you
expect ||Y − X ŵM LE ||22 to change when going from N to 2N samples? When N >> P , does
this sum of squared errors for linear regression directly depend on the number of training
samples?

F SOLUTION: The SSE will approximately double when N is doubled. Since the assumptions
of linear regression are satisfied, doubling the amount of training data will not dramatically change
the model when N >> P , so we expect approximately twice the SSE with twice the number of
summands. Yes, SSE depends directly on N since SSE is a sum over N squared error terms (one
for each training sample).

ii. [1 point] Likewise, if you double the number of training samples, how do you expect
||ŵM LE ||22 to change? Does ||ŵM LE ||22 for linear regression directly depend on the number of
training samples in the large-N limit?

F SOLUTION: In the large N limit, we expect ||ŵM LE ||22 to change barely at all when N is
doubled. No, ||ŵM LE ||22 does not depend directly on N since it is a sum over the P elements of
the vector ŵM LE .

(c) [1 point] Using any method (e.g. trial and error, random search, etc.), find a value of λ for
which the estimate ŵ satisfies
0.8 < ||ŵ||22 / ||ŵM LE ||22 < 0.9.

F SOLUTION: Any λ between 3.25 and 7.10.

(d) [1 point] Using any method (e.g. trial and error, random search, etc.), find a value of λ for
which the estimate ŵ satisfies
0.4 < ||ŵ||22 / ||ŵM LE ||22 < 0.5.

13
F SOLUTION: Any λ between 25.1 and 35.3.

14

You might also like