Homework Exercise 3: Statistical Learning, Fall 2020

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Statistical Learning, Fall 2020

Homework exercise 3
Due date: 22 December in class

1. ESL 4.2: Similarity of LDA and linear regression for two classes
In this problem you will show that for two classes, linear regression leads to the same discriminating
direction as LDA, but not to the exact same classification rule in general.
The derivations for this problem are rather lengthy. Consider part (b) (finding the linear regression
direction) to be extra credit. If you fail to prove one step, try to comment on its geometric interpretation
instead, and move to the next step.
2. Short intuition problems
Choose and explain briefly. If you need additional assumptions to reach your conclusion, specify them.

(a) What is not an advantage of using logistic loss over using squared error loss with 0-1 coding for
2-class classification?
i. That the expected prediction error is minimized by correctly predicting P (Y |X).
ii. That it has a natural probabilistic generalization to K > 2 classes.
iii. That its predictions are always legal probabilities in the range (0, 1).
(b) In the generative 2-class classification models LDA and QDA, what type of distribution does
P (Y |X = x) have?
i. Unknown
ii. Gaussian
iii. Bernoulli
(c) We mentioned in class that Naive Bayes assumes P (x|Y = g) = Πpj=1 Pj (xj |Y = g). In what
situation would you expect this simplifying assumption to be most useful?
i. Small number of predictors, not highly correlated.
ii. Small number of predictors, highly correlated between them.
iii. Large number of predictors, not highly correlated.
iv. Large number of predictors, many highly correlated between them.

3. Equivalence of selecting “reference class” in multinomial logistic regression


In class we defined the logistic model as:
 
P (G = 1|X)
log = X T β1
P (G = K|X)
..
  .
P (G = K − 1|X)
log = X T βK−1 ,
P (G = K|X)

1
with resulting probabilities:

exp{X T βk }
P (G = k|X) = P , k<K
1 + l<K exp{X T βl }
1
P (G = K|X) = P .
1 + l<K exp{X T βl }

Show that if we choose a different class in the denominator, we can obtain the same set of probabilities
by a different set of linear models (i.e., values of β). Hence the two representations are equivalent in
the probabilities they yield.
4. Separability and optimal separators
ESL 4.5: Show that the solution of logistic regression is undefined if the data are separable.

5. (* A real challenge1 )
In the separable case, consider adding a small amount of ridge-type regularization to the likelihood:
X
β̂(λ) = arg min −l(β; X, y) + λ βj2
β
j

where l(β; X, y) is the standard logistic log likelihood.


Show that β̂(λ)/kβ̂(λ)k2 converges to the support vector machine solution (margin maximizing hyper-
plane) as λ → 0.
Hint:You may find the equivalent formulation of SVM in equation (4.44) of ESL useful (equation
(4.48) in the book’s second Edition).
6. Playing around with trees
Run a variety of tree-based algorithms on our competition data and show their performance. Compare:
• Small tree without pruning
• Large tree without pruning
• Large tree after pruning with 1-SE rule
• Bagging/RF on small trees (100 iterations)
• Bagging/RF large trees (100 iterations)
Do this under five-fold cross validation on our competition training set, and use the results of the five
different folds to calculate confidence intervals for performance. Plot all the results in a reasonable
way (e.g. using boxplot()) and comment on them. Explain your choices of “small” and ”large”.
Hints: a. Start early since bagging may take a while to run. b. Use as a basis the code from class
which implements much of this.

1 +50 points extra credit for original solution; +20 points for finding a solution in the literature and explaining it clearly; +5

for finding and citing it only

You might also like