hw3 Sol
hw3 Sol
hw3 Sol
Solutions
Instructions. Please write up your responses to the following problems clearly and concisely. We encourage
you to write up your responses using LATEX; we have provided a LATEX template, available on Canvas, to
make this easier. Submit your answers in PDF form to Canvas. We will not accept paper copies
of the homework.
Collaboration. You are allowed and encouraged to work together. You may discuss the homework to
understand the problem and reach a solution in groups up to size two students. However, each student
must write down the solution independently, and without referring to written notes from the joint session.
In addition, each student must write on the problem set the names of the people with whom
you collaborated. You must understand the solution well enough in order to reconstruct it by yourself.
(This is for your own benefit: you have to take the exams alone.)
Y = Xw +
where:
Y1 x11 x12 ... x1m w1 1
Y2 x21 x22 ... x2m w2 2
Y = ,X = ,w = , and =
.. .. .. .. .. .. ..
. . . . . . .
Yn xn1 xn2 ... xnm wm n
Assume that i is normally distributed with variance σ 2 . We saw in class that the maximum likelihood
estimate of the model parameters w (which also happens to minimize the sum of squared prediction errors)
is given by the Normal equation:
ŵ = (X T X)−1 X T Y
1 Unfortunately, such an efficient algorithm may not be easily found for other learning methods.
1
Define Ŷ to be the vector of predictions using ŵ if we were to plug in the original training set X:
Ŷ = X ŵ
= X(X T X)−1 X T Y
= HY
Now recall that the Leave-One-Out Cross Validation score is defined to be:
n
(−i) 2
X
LOOCV = (Yi − Ŷi )
i=1
(−i) 2
where Ŷ (−i) is the estimator of Y after removing the i-th observation (i.e., it minimizes
P
j6=i (Yj − Ŷj ) ).
1. [2 points] To begin with, we should consider when it is possible to compute ŵ in this framework.
(a) [1 point] Suppose m > n. Is ŵ well-defined? Why or why not?
Hint: Recall that the rank of a matrix is equal to the number of linearly independent rows,
which is also equal to the number of linearly independent columns. Use the fact that for two
matrices A and B which can be multiplied to form the product AB, it must be the case that
rank(AB) ≤ min (rank(A), rank(B)). Furthermore, recall that a square matrix is invertible if and
only if it is full-rank.
(b) [1 point] Suppose m ≤ n. Give a condition on X which guarantees that ŵ will not be well-
defined and explain why not. (Don’t assume X is a square matrix.)
For the rest of question 1, assume ŵ is well-defined.
The rank of a matrix is not greater than its smallest dimension, in this case n. X T has the same rank as
X. Therefore,
rank(X T X) ≤ rank(X) ≤ n < m.
Therefore, X T X is not full-rank, and so it is not invertible. It follows that ŵ is not well-defined.
b) Two or more columns of X are linearly dependent, OR the rank of X is less than m, OR similar. The
key relation is again
rank(X T X) ≤ rank(X),
plus the fact that X T X can be inverted only if it is full-rank. It is not correct to just say that X is singular,
since this assumes X is square. It is not correct to give the condition that two or more rows of X are
linearly dependent, since this is guaranteed for the case m < n.
2. [3 points] What is the complexity of computing the LOOCV score naively? (The naive algorithm is
to loop through each point, performing a regression on the n − 1 remaining points at each iteration.)
Hint: The complexity of matrix inversion for a k × k matrix is O(k 3 ). (There are faster algorithms out
there but for simplicity we’ll assume that we are using the naive O(k 3 ) algorithm.)
2
F SOLUTION: Let X (−i) denote X without the i-th example (row), and let Y (−i) be Y without the
i-th element. For each point, we need to compute a ŵ(−i) (ŵ computed from X (−i) and Y (−i) ) and dot
it with a single point Xi . Note that it is not necessary to compute the hat matrix H each time, since H is
designed to give predictions for all examples and we only need the prediction for Xi .
The expression for ŵ(−i) is:
Recall that multiplying a n × m matrix by a m × p matrix costs O(nmp) flops (operations, in this case
multiplication and addition). Forming X T X takes m2 n flops. Inverting it takes m3 flops. Multiplying by
X T is m2 n flops, and multiplying by Y is nm flops.
We need to multiply this by n since LOOCV loops through each training point - doing so gives O(m2 n2 +
nm3 ) complexity.
1 point for using ŵ, 2 point for the correct expression.
3. [3 points] Write Ŷi in terms of the elements of H and Y . You may find it useful to use shorthand
such as Hab to denote the entry in row a, column b of H.
Pn
F SOLUTION: Ŷi = j=1 Hij Yj
4. [4 points] Show that Ŷ (−i) is also the estimator which minimizes SSE for Z where
j 6= i
Yj ,
Zj = (−i)
Ŷi , j=i
Hint: Try to start by writing an expression for the SSE of Z; it should look very similar to the definition
of SSE for Y that was given in the introduction section of this question. Then, manipulate terms until
you can argue that substituting Ŷ (−i) for Ẑ would minimize this expression.
F SOLUTION: We want to show that the estimator Ẑj that minimizes SSE for Z is equal to Ŷ (−i) .
n
X
SSE(Z) = (Zj − Ẑj )2
j=1
X
= (Zi − Ẑi )2 + (Zj − Ẑj )2
j6=i
(−i)
X
= (Ŷi − Ẑi )2 + (Yj − Ẑj )2
j6=i
(−i) 2
Consider minimizing each of the two terms separately. Using that Ŷ (−i) minimizes
P
j6=i (Yj − Ŷj ) ,
(−i)
we see that Ŷ (−i) minimizes the right hand term. And furthermore, the choice Zi = Ŷi makes the left
hand term zero and therefore minimizes it.
So Ŷ (−i) is also the estimator that minimizes SSE(Z). This is certainly a valid estimator, i.e. we can
obtain it with linear regression, because the model w(−i) which produces the n − 1 terms of Ŷ (−i) in the
(−i)
sum is the same model which predicts Ŷi from the remaining observation Xi .
(−i) (−i)
5. [6 points] Write Ŷi in terms of H and Z. By definition, Ŷi = Zi , but give an answer that
includes both H and Z.
3
F SOLUTION: Since Ŷ (−i) is the min-SSE estimator for Z as Ŷ is for Y , we can write it (see 1.3) as
n
(−i)
X
Ŷi = Hij Zj .
j=1
Big picture: how would you think to come up with this? Possible answer: we have two expressions for
the LOOCV, one given in the intro to the question, and one which we derive finally in 1.7 (the last part
of this question). To get from the original formula to the 1.7 formula, we remove the estimates for each
observation predicted by each of the n regressions with n−1 observations, and we insert estimates predicted
from just one regression and values from one hat matrix computation. So, it turns out that the predictions
from each of the n regressions are completely encoded in the hat matrix and the single regression with n
observations, and it turns out that the trade-off in computational complexity is favorable.
1.5 is sort of the breakthrough moment where we can write the predictions from each of the n smaller
regressions completely in terms of things we know without doing those n regressions. So if you were trying
to figure out how to do this from the beginning, you might be trying to find exactly this.
(−i) (−i)
6. [6 points] Show that Ŷi − Ŷi = Hii Yi − Hii Ŷi , where Hii denotes the i-th element along the
diagonal of H.
Hint: Use the results from part 2 and 4. Substitute Zi with Yi and Ŷi−i by using its definition in part
3.
(−i) (−i)
F SOLUTION: Part (??) says that Ŷi − Ŷi = Hii Yi − Hii Ŷi . We can rearrange this to
(−i) (−i)
Ŷi − Hii Ŷi = Ŷi − Hii Yi
(−i) Ŷi − Hii Yi
Ŷi =
1 − Hii
4
Now the LOOCV error becomes
n
(−i) 2
X
LOOCV = (Yi − Ŷi )
i=1
!2
X Ŷi − Hii Yi
= Yi −
i
1 − Hii
!2
X (1 − Hii )Yi − Ŷi + Hii Yi
=
i
1 − Hii
!2
X Yi − Ŷi
=
i
1 − Hii
Evaluating LOOCV using this formula can be done in O(m2 n+m3 ) time since it only requires one regression
(and we don’t need to compute all of H to compute Hii ).
3 points for showing the LOOCV equation, 3 points for the correct complexity.
1. [4 points] For input X and output Y , which of the following is the objective function optimized
by (i) Naive Bayes, and (ii) logistic regression?
2. [16 points] Recall from the suggested reading that “the discriminative analog of Naive Bayes is
logistic regression.” This means that the parametric form of P (Y | X) used by logistic regression is
implied by the assumptions of a Naive Bayes classifier, for some specific class-conditional densities. In
class you will see how to prove this for a Gaussian Naive Bayes classifier for continuous input values.
Can you prove the same for binary inputs? Assume Xi and Y are both binary. Assume that Xi | Y = j
is Bernoulli(θij ), where j ∈ {0, 1}, and Y is Bernoulli(π).
Hint: Start by using Bayes Rule and the assumptions of Naive Bayes to express the objective function
for logistic regression in terms of the given quantities θij and π.
5
F SOLUTION: Our goal in this problem is to transform the Naive Bayes class conditional probability
given in ?? into the form of the logistic regression class conditional probability given in ??.
Qn
Pr(Y = 1) i=1 Pr(Xi | Y = 1)
PrN B (Y = 1 | X ) = Qn Qn (1)
Pr(Y = 0) i=1 Pr(Xi | Y = 0) + Pr(Y = 1) i=1 Pr(Xi | Y = 1)
1
PrLR (Y = 1 | X ) = Pn (2)
1 + exp(−w0 − i=1 wi Xi )
1
=
Pr(Y =0) n
Q
Pr(Xi |Y =0)
1 + exp ln Pr(Y =1) Qi=1
n
Pr(Xi |Y =1) i=1
1
= P (3)
Pr(Y =0) n Pr(Xi |Y =0)
1 + exp ln Pr(Y =1) + i=1 ln Pr(X i |Y =1)
The probability mass functions of the Bernoulli distribution for of Y and (Xi | Y = j) are given by:
β
Pr(Y = α) = π α (1 − π)1−α and P r(Xi = β | Y = j) = θi,j (1 − θi,j )1−β .
We may therefore conclude that the assumptions made by the Naive Bayes classifier over binary random
variables imply a decision rule that has the parametric form of logistic regression.
1
COMMON MISTAKE 2: Forgetting that logistic regression has the form 1+exp{−wT x}
, so that
w0 and wi need to be negated.
6
3 Double-counting the evidence [30 points]
1. [2 points] Consider the two class problem where class label y ∈ {T, F } and each training example X
has 2 binary attributes X1 , X2 ∈ {T, F }. How many parameters will you need to know/evaluate if you
are to classify an example using the Naive Bayes classifier? Keep in mind that since the probability of
all possible events has to sum to 1, knowing the probabilities of all except one event implies knowledge
of the final event’s probability already. (Don’t include such final events in your count.)
F SOLUTION: The Naive Bayes classifier learns the the conditional probabilities Pr(X1 | Y ) and
Pr(X2 | Y ) as well as the class probability Pr(Y ). To represent Pr(Y ) we need only one parameter π
because Pr(Y = T ) + Pr(Y = F ) = 1. The Pr(Xi | Y ) can be represented using only Pr(Xi = T |
Y = T ) = θi1 and Pr(Xi = T | Y = F ) = θi0 . We do not need additional parameters to represent
Pr(Xi = F | Y = F ) = 1 − θi0 or Pr(Xi = F | Y = T ) = 1 − θi1 . Therefore we need 1 parameter for
Pr(Y ), 2 parameters for Pr(X1 |Y ), and 2 parameters for Pr(X2 |Y ) resulting in a total of 5 parameters.
2. [2 points] Let the class prior be Pr(Y = T ) = 0.5 and also let Pr(X1 = T | Y = T ) = 0.8,
Pr(X1 = F | Y = F ) = 0.7, Pr(X2 = T | Y = T ) = 0.5, and Pr(X2 = F | Y = F ) = 0.9. (Note:
Questions 3.2 - 3.4 all use these probabilities.) So, attribute X1 provides slightly stronger evidence
about the class label than X2 . Assume X1 and X2 are truly independent given Y . Write down the
Naive Bayes decision rule given X1 = x1 and X2 = x2 . Write your answer as a table listing the value
of the decision, call it f (X1 , X2 ), for each of the 4 settings for X1 , X2 .
F SOLUTION: The mathematical form of the decision rule for the Naive Bayes classifier is given in
?? where 1 (·) is the indicator function. Notice that in ?? the sum of log ratios is used rather than a direct
comparison of Pr(Y = T | X ) > Pr(Y = F | X ). While both are equivalent the sum of logs form is often
more numerically stable and easier to work with.
X n !
Pr(Y = 1) Pr(Xi | Y = 1)
f (X1 , X2 ) = 1 ln + ln >0 (8)
Pr(Y = 0) i=1
Pr(Xi | Y = 0)
Table 1: Decision rule for the Naive Bayes classifier. The column containing f (X1 , X2 ) contains the actual
decision rule. For simplicity I have also included the joint and conditional probabilities as Pr(X1 , X2 , Y )
and Pr(Y | X1 , X2 ) respectively. Joint probabilties corresponding to incorrect predictions have been colored
yellow.
An alternative and for this problem more informative representation of the decision rule for binary variables
is given in ??.
7
3. [8 points] For the Naive Bayes decision function f (X1 , X2 ), the error rate is:
X
1 (Y 6= f (X1 , X2 )) P (X1 , X2 , Y ).
X1 ,X2 ,Y
For this question, we will assume that the true data distribution is exactly the same as the Naive Bayes
distribution, so we can write P (X1 , X2 , Y ) as P (Y )P (X1 | Y )P (X2 | Y ).
(a) [2 points] Show that if Naive Bayes uses both attributes, X1 and X2 , the error rate is 0.235.
F SOLUTION: To compute the error rate using X1 and X2 we sum Pr(X1 , X2 , Y ) for all the
rows where f (X1 , X2 ) 6= Y in ??, obtaining 0.05 + 0.035 + 0.135 + 0.015 = 0.235.
(d) [2 points] Give a conceptual explanation for why the error rate is (choose one) lower/higher
using X1 and X2 together as opposed to using only a single attribute.
F SOLUTION: We see that a Naive Bayes classifier that uses both variables performs better than
a Naive Bayes classifier that uses only one of the two. This implies that there is unique (uncorrelated)
information in each variable.
4. [5 points] Now, suppose that we create a new attribute X3 , which is an exact copy of X2 . So, for
every training example, attributes X2 and X3 have the same value, X2 = X3 .
F SOLUTION: First lets consider whether X2 and X3 are conditionally independent given Y .
However this is quite simple as knowing the value of X2 completely determines the value of X3 .
Therefore X2 and X3 are clear NOT conditionally independent given Y .
(b) [3 points] What is the error rate of Naive Bayes now, using X1 , X2 , and X3 ? The predicted Y
should be computed using the (possibly incorrect) assumption of conditional independence, and
the error rate should be computed using the true probabilities.
8
F SOLUTION: To work out the error rate of the Naive Bayes classifier we construct another
table:
X1 X2 X3 Y Pr(X1 , X2 , X3 , Y ) PrN B (Y | X1 , X2 , X3 ) f (X1 , X2 , X3 )
0 0 0 0 0.315 0.918963 0
0 0 0 1 0.05 0.0810373 0
0 0 1 0 0 0.557522 0
0 0 1 1 0 0.442478 0
0 1 0 0 0 0.557522 0
0 1 0 1 0 0.442478 0
0 1 1 0 0.035 0.122807 1
0 1 1 1 0.05 0.877193 1
1 0 0 0 0.135 0.548533 0
1 0 0 1 0.2 0.451467 0
1 0 1 0 0 0.118943 1
1 0 1 1 0 0.881057 1
1 1 0 0 0 0.118943 1
1 1 0 1 0 0.881057 1
1 1 1 0 0.015 0.0147783 1
1 1 1 1 0.2 0.985222 1
The error rate is obtained by summing the probability of all events in which the classifier predicts the
wrong value for Y which is 0.05 + 0.035 + 0.20 + 0.015 = 0.3. The astute reader will notice that
all events where X2 6= X3 occur with zero probability and could have be skipped in the table. The
only thing that has changed by introducing X3 is that for X1 = 1 and X2 = 0 the new Naive Bayes
classifier using three variables predicts 0 rather than 1. Consequently, the error sum contains 0.20
rather than 0.135 resulting in a higher error rate. We see that the Naive Bayes classifier performs
worse when we add highly correlated (uninformative) features. Naturally, we would not expect Naive
Bayes to perform better by simply replicating existing features.
Another, simpler way of looking at the solution is as follows. We can imagine sampling (X1 , X2 , X3 , Y )
in a 2 step process: first, (X1 , X2 , Y ) is sampled according to P . Then X3 is deterministically assigned
the value of X2 . Thus, P (X1 , X2 , X3 , Y | X2 = X3 ) = P (X1 , X2 , Y ). In other words, introducing
X3 does not change the true underlying distribution of the data when considering X1 and X2 . What
does change? The decision rule changes, so that effectively X2 is the only variable used. We already
computed the error rate of X2 , which is 0.3.
COMMON MISTAKE 1: Many did not see that the “true” distribution no longer follows the
conditional independence assumption following the introduction of X3 , and found that the error rate
decreased.
5. [2 points] Why does Naive Bayes perform worse with the addition of X3 ? (Hint: What assumption
does Naive Bayes make about the inputs?)
F SOLUTION: Naive Bayes performs worse because the addition of X3 breaks the conditional inde-
pendence assumption while not introducing any additional information. As a consequence the classifier over
counts X2 partially ignoring X1 .
6. [3 points] Does logistic regression suffer from the same problem? Explain why or why not.
9
F SOLUTION: Logistic regression does not use the same conditional independence assumptions and
therefore does not suffer when highly correlated features. An optimal logistic regression classifier would set
w20 = w30 = w2 /2 where w2 is the coefficient of X2 obtained by training only on X1 and X2 and w20 and
w30 are the coefficients obtained by training on X1 , X2 , and X3 .
7. [8 points] : In spite of the above fact we will see that in some examples Naive Bayes doesn’t do too
badly. Consider the above example i.e. your features are X1 , X2 which are truely independent given
Y and a third feature X3 = X2 . Suppose you are now given an example with X1 = T and X2 = F .
You are also given the probabilities Pr(Y = T | X1 = T ) = p and P r(Y = T | X2 = F ) = q, and
P (Y = T ) = 0.5. (Note: You should not use the probabilities from 3.2-3.4 in your solutions to the
following.)
(1−q)2
(a) Prove that the decision rule is p ≥ q 2 +(1−q)2 by applying Bayes rule again.
pq 2 ≥ (1 − p)(1 − q)2
(1 − q)2
Predict Y=T ⇐⇒ p ≥
q2 + (1 − q)2
10
F SOLUTION: The actual decision rule doesn’t take into consideration X3 , as we know that X1
and X2 are truly independent given Y . Therefore, we predict true if and only if Pr(Y = T | X1 =
T, X2 = F ) ≥ Pr(Y = F | X1T , X2F ).
Following similar calculations as before, we have
pq ≥ (1 − p)(1 − q)
p≥1−q
Predict Y=T ⇐⇒ p ≥ 1 − q
(c) Plot the two decision boundaries (vary q between 0 and 1) and highlight the region where Naive
Bayes makes mistakes.
F SOLUTION: The curved line is for Naive Bayes, and the straight line shows the true decision
boundary. The shaded part in ?? shows the region where Naive Bayes decision differs from the true
decision.
Consider the three cases, p = 0, 1, and 2. We want to know what effect these different penalties have on
estimates of w.
Let’s see this using a simple problem. Use the provided data (data.mat). Assume the constant term in
the regression is zero, and assume λ = 1, except, of course, for question (1). You don’t need to write code
11
that solves these problems in their full generality; instead, feel free to use matlab functions to do the main
calculations, and then just do a primitive search over parameter space by plugging in a few different values.
Matlab function fminsearch will be helpful. (Note: If you are not familiar with function handles, please
review the code from HW 2 or see Matlab documentation.)
4. [4 points] Given λ = 1, what is ŵ for p = 0? Note that since L0 norm is not a ”real” norm, the
penalty expression is a little different:
F SOLUTION: For L0 norm, we have to solve all combinatorial cases separately where some certain
components of w are set to zero, then add L0 accordingly. There are 8 cases for 3 unknown wi .
5. [4 points] Write a paragraph describing the relation between the estimates of w in the four cases,
explaining why that makes sense given the different penalties.
12
F SOLUTION: The MLE estimate simply tries to minimize the residual error without enforcing any
prior beliefs about w. Regularizing w under different norms corresponds to different prior beliefs on how
we expect the w to be. The L2 norm corresponds to the belief that parameters w follow a Gaussian
distribution with mean 0; this will tend to shrink all the weights by a constant factor. None will be zeroed
out. Penalizing the L1 norm decreases all the parameters w1 , w2 , w3 toward zero by a constant additive
amount. If the parameters would be pushed beyond zero,they are zeroed out. In this problem, two were set
to zero. Finally, the L0 norm encourages a sparse solution, but does not shrink those parameters that are
not zeroed out. Thus the nonzero coefficients are larger than under L1 or L2.
6. [5 points] When λ > 0, we make a trade-off between minimizing the sum of squared errors and the
magnitude of ŵ. In the following questions, we will explore this trade-off further. For the following,
use the same data from data.mat.
(a) [1 point] For the MLE estimate of w (as in 4.1), write down the value of the ratio
||ŵM LE ||22 / ||Y − X ŵM LE ||22 .
(b) i. [1 point] Suppose the assumptions of linear regression are satisfied. Let’s say that with
N training samples (assume N >> P , where P is the number of features), you compute
ŵM LE . Then let’s say you do the same, this time with 2N training samples. How do you
expect ||Y − X ŵM LE ||22 to change when going from N to 2N samples? When N >> P , does
this sum of squared errors for linear regression directly depend on the number of training
samples?
F SOLUTION: The SSE will approximately double when N is doubled. Since the assumptions
of linear regression are satisfied, doubling the amount of training data will not dramatically change
the model when N >> P , so we expect approximately twice the SSE with twice the number of
summands. Yes, SSE depends directly on N since SSE is a sum over N squared error terms (one
for each training sample).
ii. [1 point] Likewise, if you double the number of training samples, how do you expect
||ŵM LE ||22 to change? Does ||ŵM LE ||22 for linear regression directly depend on the number of
training samples in the large-N limit?
F SOLUTION: In the large N limit, we expect ||ŵM LE ||22 to change barely at all when N is
doubled. No, ||ŵM LE ||22 does not depend directly on N since it is a sum over the P elements of
the vector ŵM LE .
(c) [1 point] Using any method (e.g. trial and error, random search, etc.), find a value of λ for
which the estimate ŵ satisfies
0.8 < ||ŵ||22 / ||ŵM LE ||22 < 0.9.
(d) [1 point] Using any method (e.g. trial and error, random search, etc.), find a value of λ for
which the estimate ŵ satisfies
0.4 < ||ŵ||22 / ||ŵM LE ||22 < 0.5.
13
F SOLUTION: Any λ between 25.1 and 35.3.
14