Tarea Nro 9 Diego Vasquez

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Tarea Nro 9

Diego Vásquez Coronado

> ## Pregunta 1a: Missing values


> dim(Hitters)
[1] 322 20
> Hitters <- na.omit(Hitters)
> dim(Hitters)
[1] 263 20
> library(leaps)
e. Aplicar los métodos de selección de variables Best Subset Selection, Forward Stepwiese Selection y
Backward Stepwiese Selection considerando que debe incluirse un máximo de 15 variables predictoras, que
la variable ChmRun debe estar presente en todas las propuestas y que las variables CRBI y Hits deben
excluirse de todas las propuestas.
i. Identificar los modelos elegidos usando los indicadores Cp de Mallows para Best Subset Selection, BIC para
Forward Stepwiese Selection y R2 ajustado para Backward Subset Selection.
ii. Identificar las variables elegidas con cada uno de los indicadores.
iii. Comparar los modelos anteriores usando validación cruzada 10 y los indicadores RMSE, RSquare y MAE.
Usar set.seed(987), set.seed(654) y
set.seed(321).
Aplicar los métodos de selección de variables Best Subset Selection,
> Hitters.bss <- regsubsets(Salary ~ ., data = Hitters, nvmax = 15,
+ force.in= "CHmRun", force.out = c("CRBI", "Hits"),
+ method = "exhaustive")
> bss <- summary(Hitters.bss)
> bss
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 15, force.in = "CHmRun",
force.out = c("CRBI", "Hits"), method = "exhaustive")
19 Variables (and intercept)
Forced in Forced out
CHmRun FALSE FALSE
AtBat FALSE TRUE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CRuns TRUE FALSE
CWalks FALSE FALSE
LeagueN FALSE TRUE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
Hits FALSE FALSE
CRBI FALSE FALSE

1 subsets of each size up to 15


Selection Algorithm: exhaustive
CHmRun AtBat HmRun Runs RBI Walks Years CAtBat CHits CRuns CWalks LeagueN
2 ( 1 ) "*" " " " " "*" " " " " " " " " " " " " " " " "
3 ( 1 ) "*" " " " " "*" " " " " " " " " "*" " " " " " "
4 ( 1 ) "*" " " " " " " " " "*" " " "*" "*" " " " " " "
5 ( 1 ) "*" " " " " " " " " "*" " " "*" "*" " " " " " "
6 ( 1 ) "*" " " " " " " " " "*" " " "*" "*" " " " " " "
7 ( 1 ) "*" " " " " " " " " "*" " " "*" "*" " " "*" " "
8 ( 1 ) "*" " " " " " " " " "*" " " "*" "*" " " "*" " "
9 ( 1 ) "*" " " " " " " " " "*" " " "*" "*" " " "*" " "
10 ( 1 ) "*" "*" " " "*" " " "*" " " "*" "*" " " "*" " "
11 ( 1 ) "*" "*" " " "*" " " "*" " " "*" "*" " " "*" " "
12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" "*" " " "*" "*"
13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" "*" " " "*" "*"
14 ( 1 ) "*" "*" " " "*" " " "*" " " "*" "*" "*" "*" "*"
15 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*"
DivisionW PutOuts Assists Errors NewLeagueN Hits CRBI
2 (1) "" "" "" "" "" "" ""
3 (1) "" "" "" "" "" "" ""
4 (1) "" "" "" "" "" "" ""
5 (1) "" "*" "" "" "" "" ""
6 ( 1 ) "*" "*" "" "" "" "" ""
7 ( 1 ) "*" "*" "" "" "" "" ""
8 ( 1 ) "*" "*" "*" " " " " "" ""
9 ( 1 ) "*" "*" "*" "*" " " "" ""
10 ( 1 ) "*" "*" "*" " " " " "" ""
11 ( 1 ) "*" "*" "*" "*" " " "" ""
12 ( 1 ) "*" "*" "*" "*" " " "" ""
13 ( 1 ) "*" "*" "*" "*" "*" "" ""
14 ( 1 ) "*" "*" "*" "*" "*" "" ""
15 ( 1 ) "*" "*" "*" "*" " " "" ""

Identificar los modelos elegidos usando los indicadores Cp de Mallows


para Best Subset Selection
> plot(bss$cp, xlab = "Numero de variables predictoras", ylab = "Cp",
+ type = 'l')
> which.min(bss$cp)
[1] 7
> points(7, bss$cp[7], col = "red", cex = 2, pch = 20)

> #identificando las variables predictores


> coef(Hitters.bss, 7)
(Intercept) CHmRun Walks CAtBat CHits CWalks DivisionW PutOuts Assists
111.4685913 1.7719874 5.1614320 -0.3905963 1.6447356 -0.3432042 -136.6295273 0.2541381
0.2091329

> ## Pregunta 1b: Forward


> Hitters.fwss <- regsubsets(Salary ~ ., data = Hitters, nvmax = 15,
+ force.in= "CHmRun", force.out = c("CRBI", "Hits"),
+ method = "forward")
> fwss <- summary(Hitters.fwss)
> fwss
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 15, force.in = "CHmRun",
force.out = c("CRBI", "Hits"), method = "forward")
19 Variables (and intercept)
Forced in Forced out
CHmRun FALSE FALSE
AtBat FALSE TRUE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CRuns TRUE FALSE
CWalks FALSE FALSE
LeagueN FALSE TRUE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
Hits FALSE FALSE
CRBI FALSE FALSE

1 subsets of each size up to 15


Selection Algorithm: forward
CHmRun AtBat HmRun Runs RBI Walks Years CAtBat CHits CRuns CWalks LeagueN DivisionW PutOuts Assists Errors
2 ( 1 ) "*" " " " " "*" " " " " " " " " " " " " " " " " " " "" "" ""
3 ( 1 ) "*" " " " " "*" " " " " " " " " "*" " " " " " " " " "" "" ""
4 ( 1 ) "*" " " " " "*" " " " " " " "*" "*" " " " " " " " " "" "" ""
5 ( 1 ) "*" " " " " "*" " " " " " " "*" "*" " " " " " " " " "*" " " " "
6 ( 1 ) "*" " " " " "*" " " " " " " "*" "*" " " " " " " "*" "*" " " " "
7 ( 1 ) "*" " " " " "*" " " "*" " " "*" "*" " " " " " " "*" "*" " " " "
8 ( 1 ) "*" " " " " "*" " " "*" " " "*" "*" " " " " " " "*" "*" "*" " "
9 ( 1 ) "*" " " " " "*" " " "*" " " "*" "*" " " " " " " "*" "*" "*" "*"
10 ( 1 ) "*" " " " " "*" " " "*" " " "*" "*" " " "*" " " "*" "*" "*" "*"
11 ( 1 ) "*" "*" " " "*" " " "*" " " "*" "*" " " "*" " " "*" "*" "*" "*"
12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*"
13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*"
14 ( 1 ) "*" "*" " " "*" " " "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*"
15 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*"
NewLeagueN Hits CRBI
2 (1) "" "" ""
3 (1) "" "" ""
4 (1) "" "" ""
5 (1) "" "" ""
6 (1) "" "" ""
7 (1) "" "" ""
8 (1) "" "" ""
9 (1) "" "" ""
10 ( 1 ) " " "" ""
11 ( 1 ) " " "" ""
12 ( 1 ) " " "" ""
13 ( 1 ) "*" "" ""
14 ( 1 ) "*" "" ""
15 ( 1 ) "*" "" ""

> #identificando las variables predictores forward con BIC


> plot(fwss$bic, xlab = "Numero de variables predictoras", ylab = "bic",
+ type = 'l')
> which.min(fwss$bic)
[1] 6
> plot(fwss$bic, xlab = "Numero de variables predictoras", ylab = "bic",
+ type = 'l')
> points(6, fwss$bic[6], col = "red", cex = 2, pch = 20)

> #identificando las variables predictores


> coef(Hitters.fwss, 6)
(Intercept) CHmRun Runs Walks CAtBat CHits
108.4349879 1.3589571 1.9205511 3.1613449 -0.3983183 1.5918323
DivisionW PutOuts
-128.4782101 0.2500407
> ## Pregunta 1b: Backward
> Hitters.bwss <- regsubsets(Salary ~ ., data = Hitters, nvmax = 15,
+ force.in= "CHmRun", force.out = c("CRBI", "Hits"),
+ method = "backward")
> bwss <- summary(Hitters.bwss)
> bwss
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 15, force.in = "CHmRun",
force.out = c("CRBI", "Hits"), method = "backward")
19 Variables (and intercept)
Forced in Forced out
CHmRun FALSE FALSE
AtBat FALSE TRUE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CRuns TRUE FALSE
CWalks FALSE FALSE
LeagueN FALSE TRUE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
Hits FALSE FALSE
CRBI FALSE FALSE

1 subsets of each size up to 15


Selection Algorithm: backward
CHmRun AtBat HmRun Runs RBI Walks Years CAtBat CHits CRuns CWalks LeagueN DivisionW PutOuts
Assists Errors
2 ( 1 ) "*" " " " " " " " " "*" " " " " " " " " " " " " " " "" "" ""
3 ( 1 ) "*" " " " " " " " " "*" " " " " "*" " " " " " " " " "" "" ""
4 ( 1 ) "*" " " " " " " " " "*" " " "*" "*" " " " " " " " " "" "" ""
5 ( 1 ) "*" " " " " " " " " "*" " " "*" "*" " " " " " " " " "*" " " " "
6 ( 1 ) "*" " " " " " " " " "*" " " "*" "*" " " " " " " "*" "*" " " " "
7 ( 1 ) "*" " " " " " " " " "*" " " "*" "*" " " "*" " " "*" "*" " " " "
8 ( 1 ) "*" " " " " " " " " "*" " " "*" "*" " " "*" " " "*" "*" "*" " "
9 ( 1 ) "*" " " " " "*" " " "*" " " "*" "*" " " "*" " " "*" "*" "*" " "
10 ( 1 ) "*" "*" " " "*" " " "*" " " "*" "*" " " "*" " " "*" "*" "*" " "
11 ( 1 ) "*" "*" " " "*" " " "*" " " "*" "*" " " "*" " " "*" "*" "*" "*"
12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*"
13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*"
14 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*"
15 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*"
NewLeagueN Hits CRBI
2 (1) "" "" ""
3 (1) "" "" ""
4 (1) "" "" ""
5 (1) "" "" ""
6 (1) "" "" ""
7 (1) "" "" ""
8 (1) "" "" ""
9 (1) "" "" ""
10 ( 1 ) " " "" ""
11 ( 1 ) " " "" ""
12 ( 1 ) " " "" ""
13 ( 1 ) " " "" ""
14 ( 1 ) " " "" ""
15 ( 1 ) " " "" ""
Indetificando variable con R2 Ajustado
> plot(bwss$adjr2, xlab = "Numero de variables predictoras", ylab = "adjr2",
+ type = 'l')
> which.max(bwss$adjr2)
[1] 11
> points(11, bwss$adjr2[11], col = "red", cex = 2, pch = 20)
> #identificando las variables predictores
> coef(Hitters.bwss, 11)
(Intercept) CHmRun AtBat Runs Walks CAtBat CHits CWalks LeagueN
132.6742935 1.8472415 -0.5120391 3.6677486 4.3428468 -0.3716263 1.5932798 -0.3968683
41.6605248
DivisionW PutOuts Assists Errors
-126.9637948 0.2779534 0.4540928 -5.4391937

> library(caret)
Loading required package: ggplot2
Loading required package: lattice
Entrenando el dataset
> RNGkind(sample.kind = "Rejection")
> train.cont <- trainControl(method = "cv", number = 10)

> # Modelo con variables predictoras (M7)


> set.seed(987) # Valor semilla
)
> set.seed(987)
> m7 <- train(Salary ~ CHmRun + Walks + CAtBat + CHits + CWalks + Division + PutOuts + Assists, data =
Hitters, method = "lm",
+ trControl = train.cont)
> m7
Linear Regression

263 samples
8 predictor

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 235, 237, 237, 237, 238, 236, ...
Resampling results:

RMSE Rsquared MAE


318.4263 0.5163972 231.2525

> # Valor semilla


> set.seed(321)
> m11 <- train(Salary ~ CHmRun + AtBat + Runs + Walks + CAtBat + CHits + CWalks + League + Division +
PutOuts + Assists + Errors, data = Hitters, method = "lm",
+ trControl = train.cont)
> m11
Linear Regression

263 samples
12 predictor

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 236, 236, 236, 237, 235, 236, ...
Resampling results:

RMSE Rsquared MAE


332.3181 0.4895825 238.3441

# Valor semilla
> set.seed(654)
> m6 <- train(Salary ~ CHmRun + Runs + Walks CAtBat + CHits + Division + PutOuts, data = Hitters, method =
"lm",
Error: unexpected symbol in "m6 <- train(Salary ~ CHmRun + Runs + Walks CAtBat"
> m6 <- train(Salary ~ CHmRun + Runs + Walks + CAtBat + CHits + Division + PutOuts, data = Hitters, method
= "lm",
+ trControl = train.cont)
> m6
Linear Regression

263 samples
7 predictor

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 238, 236, 237, 236, 237, 236, ...
Resampling results:

RMSE Rsquared MAE


330.4754 0.5216176 237.4343

Tuning parameter 'intercept' was held constant at a value of TRUE


> #Comparando
> m7$results
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 TRUE 318.4263 0.5163972 231.2525 98.27995 0.1905345 52.99614
> m6$results
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 TRUE 330.4754 0.5216176 237.4343 116.0167 0.2301842 60.13615
> m11$results
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 TRUE 332.3181 0.4895825 238.3441 79.76433 0.2397632 39.05211

Podriamos inferior que el modelo M7 es el mejor

> #Modelo de regresion Lasso


> x <- model.matrix(Salary ~ ., Hitters)[ , -1]
> y <- Hitters$Salary
> Hitters.lasso <- glmnet(x, y, alpha = 1)
Error in glmnet(x, y, alpha = 1) : could not find function "glmnet"
> ### Pregunta 1c: Regresion lasso
> library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-6
> Hitters.lasso <- glmnet(x, y, alpha = 1)
> plot(Hitters.lasso, xvar = "lambda")
> par(cex = 0.75)
> legend("topright", legend = colnames(x), lty = c(1, 1), col = 1:8)
> set.seed(8926)
> cv.opt <- cv.glmnet(x, y, alpha = 1, type.measure = "mse", nfolds = 10)
> lambda.opt <- cv.opt$lambda.min
> lambda.opt
[1] 2.674375
> Hitters.lasso.opt <- glmnet(x, y, alpha = 1, lambda = lambda.opt)
> coef(Cancer.lasso.opt)
Error in coef(Cancer.lasso.opt) : object 'Cancer.lasso.opt' not found
> coef(Hitters.lasso.opt)
20 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 123.7234962
AtBat -1.5680400
Hits 5.7096674
HmRun .
Runs .
RBI .
Walks 4.7612441
Years -9.3707554
CAtBat .
CHits .
CHmRun 0.5582041
CRuns 0.6754650
CRBI 0.3712480
CWalks -0.5401597
LeagueN 32.4027644
DivisionW -118.9948582
PutOuts 0.2733357
Assists 0.1754315
Errors -2.0298102
NewLeagueN .
Las variables subrayadas son las que quedaron fuera del modelo luego de aplicar la regresión lasso

You might also like