Homework 6
Homework 6
Homework 6
(576, 3.93) (635, 3.30) (558, 2.81) (578, 3.03) (666, 3.44)
(580, 3.07) (555, 3.00) (661, 3.43) (651, 3.36) (605, 3.13)
(653, 3.12) (575, 2.74) (545, 2.76) (572, 2.88) (594, 2.96)
(a) (5 points) How many instances of spam versus regular emails are there in the data?
How many data points there are? How many features there are?
Note: there may be some missing values, you can just fill in zero.
1
(b) (10 points) Build a classification tree model (also known as the CART model). In
R, this can be done using library(rpart). In our answer, you should report the tree
models fitted similar to what is shown in the “Random forest” lecture, Page 16, the
tree plot. In R, getting this plot can be done using prp function in library(rpart).
(c) (15 points) Also build a random forrest model. In R, this can be done using li-
brary(randomForest).
Now partition the data to use the first 80% for training and the remaining 20% for
testing. Your task is to compare and report the test error for your classification tree
and random forest models on testing data, respectively. To report your results, please
try different tree sizes. Plot the curve of test error versus the number of trees used
in random forest, similar to our lecture.
(a) Assume the prior distribution as in our lecture, π(θ) = 2 cos2 (4πθ). Generate samples
from the posterior distribution π(θ|Y ). Discretize θ to be a uniform grid of points [0,
1/10, . . . , 1]. Run the chain for n = 100, 500, 1000, and 5000 time steps, respectively.
For each time step, compare the empirical distributions with the desired posterior
distribution π(θ|Y ). (Hint: you may use ergodicity: hence the distribution of states
can be estimated from one sample path when the number of time steps is large (e.g.
500).)
(b) Following from the previous question, evaluate the mean of the posterior distribution
(this gives an estimator for the parameter value), and Eπ(θ|Y ) {[θ − 1/2]2 } = (θ −
R