Activity 7
Activity 7
Activity 7
Ivan Korolev∗
1. In Sections 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order
to compute the LOOCV test error estimate. Alternatively, one could compute those
quantities using just the glm() and predict.glm() functions, and a for loop. You
will now take this approach in order to compute the LOOCV error for a simple logistic
regression model on the Weekly data set. Recall that in the context of classification
problems, the LOOCV error is given in (5.4).
(a) Fit a logistic regression model that predicts Direction using Lag1 and Lag2.
(b) Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using
all but the first observation.
(c) Use the model from (b) to predict the direction of the first observation. You can do
this by predicting that the first observation will go up if P(Direction="Up"|Lag1,
Lag2) > 0.5. Was this observation correctly classified?
(d) Write a for loop from i = 1 to i = n, where n is the number of observations in
the data set, that performs each of the following steps:
i. Fit a logistic regression model using all but the ith observation to predict
Direction using Lag1 and Lag2.
ii. Compute the posterior probability of the market moving up for the ith obser-
vation.
∗
Department of Economics, Binghamton University. E-mail: [email protected]. The problems
are borrowed from ISLR.
1
iii. Use the posterior probability for the ith observation in order to predict whether
or not the market moves up.
iv. Determine whether or not an error was made in predicting the direction for
the ith observation. If an error was made, then indicate this as a 1, and
otherwise indicate it as a 0.
(e) Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV
estimate for the test error. Comment on the results.
> set.seed(1)
> x=rnorm(100)
> y=x-2*x^2+rnorm(100)
In this data set, what is n and what is p? Write out the model used to generate
the data in equation form.
(b) Create a scatterplot of X against Y . Comment on what you find.
(c) Set a random seed, and then compute the LOOCV errors that result from fitting
the following four models using least squares:
i. Y = β0 + β1 X + ε
ii. Y = β0 + β1 X + β2 X 2 + ε
iii. Y = β0 + β1 X + β2 X 2 + β3 X 3 + ε
iv. Y = β0 + β1 X + β2 X 2 + β3 X 3 + β4 X 4 + ε
Note you may find it helpful to use the data.frame() function to create a single
data set containing both X and Y .
(d) Repeat (c) using another random seed, and report your results. Are your results
the same as what you got in (c)? Why?
(e) Which of the models in (c) had the smallest LOOCV error? Is this what you
expected? Explain your answer.
(f) Comment on the statistical significance of the coefficient estimates that results
from fitting each of the models in (c) using least squares. Do these results agree
with the conclusions drawn based on the cross-validation results?
3. We will now consider the Boston housing data set, from the MASS library.
2
(a) Based on this data set, provide an estimate for the population mean of medv. Call
this estimate µ̂.
(b) Provide an estimate of the standard error of µ̂. Interpret this result. Hint: We can
compute the standard error of the sample mean by dividing the sample standard
deviation by the square root of the number of observations.
(c) Now estimate the standard error of µ̂ using the bootstrap. How does this compare
to your answer from (b)?
(d) Based on your bootstrap estimate from (c), provide a 95% confidence interval for
the mean of medv. Compare it to the results obtained using t.test(Boston$medv).
Hint: You can approximate a 95% confidence interval using the formula [µ̂ −
2SE(µ̂), µ̂ + 2SE(µ̂)].
(e) Based on this dataset, provide an estimate, µ̂med , for the median value of medv in
the population.
(f) We now would like to estimate the standard error of µ̂med . Unfortunately, there
is no simple formula for computing the standard error of the median. Instead,
estimate the standard error of the median using the bootstrap. Comment on your
findings.
(g) Based on this data set, provide an estimate for the tenth percentile of medv in
Boston suburbs. Call this quantity µ̂0.1 . (You can use the quantile() function.)
(h) Use the bootstrap to estimate the standard error of µ̂0.1 . Comment on your find-
ings.
4. In this exercise, we will generate simulated data, and will then use this data to perform
best subset selection.
(a) Use the rnorm() function to generate a predictor X of length n = 100, as well as
a noise vector ε of length n = 100.
(b) Generate a response vector Y of length n = 100 according to the model
Y = β0 + β1 X + β2 X 2 + β3 X 3 + ε,
3
evidence for your answer, and report the coefficients of the best model obtained.
Note you will need to use the data.frame() function to create a single data set
containing both X and Y .
(d) Repeat (c), using forward stepwise selection and also using backwards stepwise
selection. How does your answer compare to the results in (c)?
(e) Now fit a lasso model to the simulated data, again using X, X 2 , ..., X 10 as pre-
dictors. Use cross-validation to select the optimal value of λ. Create plots of the
cross-validation error as a function of λ. Report the resulting coefficient estimates,
and discuss the results obtained.
(f) Now generate a response vector Y according to the model
Y = β0 + β7 X 7 + ε
and perform best subset selection and the lasso. Discuss the results obtained.
5. In this exercise, we will predict the number of applications received using the other
variables in the College data set.
(a) Split the data set into a training set and a test set.
(b) Fit a linear model using least squares on the training set, and report the test error
obtained.
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation.
Report the test error obtained.
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report
the test error obtained, along with the number of non-zero coefficient estimates.
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report
the test error obtained, along with the value of M selected by cross-validation.
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report
the test error obtained, along with the value ofM selected by cross-validation.
(g) Comment on the results obtained. How accurately can we predict the number
of college applications received? Is there much difference among the test errors
resulting from these five approaches?
6. We have seen that as the number of features used in a model increases, the training
error will necessarily decrease, but the test error may not. We will now explore this in
a simulated data set.
4
(a) Generate a data set with p = 20 features, n = 1, 000 observations, and an associ-
ated quantitative response vector generated according to the model
p
X
0
Y =X β+ε= Xj βj + ε,
j=1
7. We will now try to predict per capita crime rate in the Boston data set.
(a) Try out some of the regression methods explored in this chapter, such as best
subset selection, the lasso, ridge regression, and PCR. Present and discuss results
for the approaches that you consider.
(b) Propose a model (or set of models) that seem to perform well on this data set,
and justify your answer. Make sure that you are evaluating model performance
using validation set error, cross- validation, or some other reasonable alternative,
as opposed to using training error.
(c) Does your chosen model involve all of the features in the data set? Why or why
not?