Statistical Models in S
Statistical Models in S
Statistical Models in S
Statistical Models in S
The template is a multiple linear regression model:
where
W. Venables, 2003
Model formulae
General form: yvar ~ term1 + term2 + Examples: y ~ x Simple regression y ~ 1 + x Explicit intercept y ~ -1 + x Through the origin y ~ x +x^2 Quadratic regression y ~ x1 + x2 + x3 Multiple regression y ~ G + x1 + x2 Parallel regressions y ~ G/(x1 + x2) Separate regressions sqrt(Hard) ~ Dens+Dens^2 Transformed
W. Venables, 2003 Data Analysis & Graphics 2
y ~ x + B + N*P
y ~ . - X1 . ~ . + A:B
with covariate
All variables except X1 Add interaction (update)
W. Venables, 2003
summary(obj)
analysis summary
predict(obj,newdata = ndat) predict for new data deviance(obj) residual sum of squares
W. Venables, 2003
W. Venables, 2003
The Janka data. Hardness and density of 36 species of Australian hardwood samples
Hardness
500
1000
1500
2000
2500
3000
30
40 Density
50
60
70
W. Venables, 2003
W. Venables, 2003
W. Venables, 2003
janka$d <- scale(janka$Dens, scale=F) jank.1 <- lm(Hard ~ d, janka) jank.2 <- update(jank.1, .~.+d^2) jank.3 <- update(jank.2, .~.+d^3)
W. Venables, 2003
summary(jank.1)$coef Value Std. Error t value Pr(>|t|) (Intercept) 1469.472 30.5099 48.164 0 d 57.507 2.2785 25.238 0 > summary(jank.2)$coef Value Std. Error t value Pr(>|t|) (Intercept) 1378.19661 38.93951 35.3933 0.000000 d 55.99764 2.06614 27.1026 0.000000 I(d^2) 0.50908 0.15672 3.2483 0.002669 > round(summary(jank.3)$coef, 4) Value Std. Error t value Pr(>|t|) (Intercept) 1379.1028 39.4775 34.9339 0.0000 d 53.9610 5.0746 10.6336 0.0000 I(d^2) 0.4864 0.1668 2.9151 0.0064 I(d^3) 0.0060 0.0135 0.4405 0.6625 Why is this so? Does it matter very much?
W. Venables, 2003 Data Analysis & Graphics 10
W. Venables, 2003
11
Janka data: Studentized residuals against fitted values for the quadratic model
2
Residuals
W. Venables, 2003
12
Janka data: Sorted studentized residuals against normal scores for the quadratic model
-2 -2 -1 0 Normal scores 1 2
W. Venables, 2003
13
> boxcox(jank.2,
lambda = seq(-0.25, 1, len=20))
W. Venables, 2003 Data Analysis & Graphics 14
-0.1876
-237
0.1500
0.5078
log-Likelihood
-238
95%
-239 -240
-0.2
0.0
0.2 lambda
0.4
0.6
W. Venables, 2003
15
1
Residuals (log trans.)
-1
-2
6.5
7.5
8.0
W. Venables, 2003
17
30
40 Dens
50
60
70
W. Venables, 2003
18
W. Venables, 2003
19
> dropterm(bigm, test = "F") Single term deletions Model: Yield ~ Year + Rain0 + Temp1 + Rain1 + Temp2 + Rain2 + Temp3 + Rain3 + Temp4 Df Sum of Sq RSS AIC F Value Pr(F) <none> 1404.8 143.79 Year 1 1326.4 2731.2 163.73 21.715 0.00011 Rain0 1 203.6 1608.4 146.25 3.333 0.08092 Temp1 1 70.2 1475.0 143.40 1.149 0.29495 Rain1 1 33.2 1438.0 142.56 0.543 0.46869 Temp2 1 43.2 1448.0 142.79 0.707 0.40905 Rain2 1 209.2 1614.0 146.37 3.425 0.07710 Temp3 1 0.3 1405.1 141.80 0.005 0.94652 Rain3 1 9.5 1414.4 142.01 0.156 0.69655 Temp4 1 58.6 1463.5 143.14 0.960 0.33738
W. Venables, 2003 Data Analysis & Graphics 20
> smallm <- update(bigm, . ~ Year) > addterm(smallm, bigm, test = "F") Single term additions
Model: Yield ~ Year Df Sum of Sq <none> Rain0 1 138.65 Temp1 1 30.52 Rain1 1 47.88 Temp2 1 16.45 Rain2 1 518.88 Temp3 1 229.14 Rain3 1 149.78 Temp4 1 445.11
W. Venables, 2003
RSS 2429.8 2291.1 2399.3 2381.9 2413.3 1910.9 2200.6 2280.0 1984.7
AIC F Value 145.87 145.93 1.8155 147.45 0.3816 147.21 0.6031 147.64 0.2045 139.94 8.1461 144.60 3.1238 145.77 1.9708 141.19 6.7282
W. Venables, 2003
22
20 10
H 40
Yield (%)
30 20
10
I 40
30
20 10
200
250
300
350
400
450
End point
W. Venables, 2003
25
Random effects
> pet.re1 <- lme(Y ~ EP, petrol, random = ~1+EP|No) > summary(pet.re1) ..... AIC BIC logLik 184.77 193.18 -86.387 Random effects: Formula: ~ 1 + EP | No Structure: General positive-definite StdDev Corr (Intercept) 4.823890 (Inter EP 0.010143 1 Residual 1.778504 Fixed effects: Y ~ EP Value Std.Error DF t-value p-value (Intercept) -31.990 2.3617 21 -13.545 <.0001 EP 0.155 0.0062 21 24.848 <.0001 Correlation: .....
W. Venables, 2003
27
W. Venables, 2003
28
200 A 40
250
300 B
350
400
450 C
200
250
300 D
350
400
450
30
20
10
H 40
Yield (%)
30
20
10
I 40
30
20
10
Note how the random effect (red, broken) lines are less variable than the separate least squares lines.
200 250 300 350 400 450
End point
W. Venables, 2003
29
W. Venables, 2003
31
W. Venables, 2003
32
Females: 1.0
Males:
0.8
Fraction dead
0.6
0.4
0.2
0.0 0 5 10 15 Dose 20 25 30
W. Venables, 2003
33
> bud.0 <- update(bud.1, .~.-Sex:Logdose) GLM linear loop 1: deviance = 6.8074 GLM linear loop 2: deviance = 6.7571 GLM linear loop 3: deviance = 6.7571 GLM linear loop 4: deviance = 6.7571 > anova(bud.0, bud.1, test="Chisq") Analysis of Deviance Table Response: cbind(Dead, Alive) Terms Resid. Df Resid. Dev Test Df 1 Sex + Logdose 9 6.7571 2 Sex * Logdose 8 4.9937 +Sex:Logdose 1
Deviance Pr(Chi)
1 2 1.7633 0.18421
Data Analysis & Graphics 35
W. Venables, 2003
W. Venables, 2003
36
1.0
Females Males
+ o
o o
0.8
bud1 bud0
+ o + + o +
Fraction dead
0.2
0.4
0.6
o + o +
0 5 10 15 Dose 20 25 30
0.0
W. Venables, 2003
37
> jank.g1 <- glm(Hard ~ Dens, quasi(link = sqrt, variance = "mu^2"), janka, trace=T) GLM linear loop 1: deviance = 0.3335 GLM linear loop 2: deviance = 0.3288 GLM linear loop 3: deviance = 0.3288 > range(janka$Dens) [1] 24.7 69.1 > newdat <- data.frame(Dens = 24:70) > p1 <- predict(jank.g1, newdat, type = "response", se = T) > names(p1) [1] "fit" "se.fit" "residual.scale" [4] "df" > ul <- p1$fit + 2*p1$se.fit > ll <- p1$fit - 2*p1$se.fit > mn <- p1$fit > matplot(newdat$Dens, cbind(ll, mn, ul), xlab = "Density", ylab = "Hardness", col = c(2,3,2), lty = c(4,1,4), type = "l") > points(janka$Dens, janka$Hard, pch=16, col = 6)
W. Venables, 2003 Data Analysis & Graphics 39
Hardness
500
1000
1500
2000
2500
3000
30
40 Density
50
60
70
W. Venables, 2003
40
First consider a Poisson "full" model" > quine.1 <- glm(Days ~ Eth*Sex*Age*Lrn, poisson, quine, trace = T) GLM linear loop 4: deviance = 1173.9 > quine.1$df [1] 118 Overdispersion. Now a NB model with the same mean structure (and default log link): > quine.n1 <- glm.nb(Days ~ Eth*Sex*Age*Lrn, quine, trace=T) Theta( 1 ) = 1.92836 , 2(Ls - Lm) = 167.453 > dropterm(quine.n1, test = "Chisq") Single term deletions Model: Days ~ Eth * Sex * Age * Lrn Df AIC LRT Pr(Chi) <none> 1095.3 Eth:Sex:Age:Lrn 2 1092.7 1.4038 0.49563
W. Venables, 2003
43
> addterm(quine.m, quine.up, test = "Chisq") Single term additions Model: Days ~ Eth Df <none> Eth:Sex 1 Eth:Age 3 Sex:Age 3 Eth:Lrn 1 Sex:Lrn 1 Age:Lrn 2
+ Sex + Age + AIC LRT 1107.2 1108.3 0.881 1102.7 10.463 1100.4 12.801 1106.1 3.029 1109.0 0.115 1110.0 1.161
> quine.m <- update(quine.m, .~.+Sex:Age) > addterm(quine.m, quine.up, test = "Chisq")
W. Venables, 2003
45
Single term additions Model: Days ~ Eth + Sex + Age + Lrn + Sex:Age Df AIC LRT Pr(Chi) <none> 1100.4 Eth:Sex 1 1101.1 1.216 0.27010 Eth:Age 3 1094.7 11.664 0.00863 Eth:Lrn 1 1099.7 2.686 0.10123 Sex:Lrn 1 1100.9 1.431 0.23167 Age:Lrn 2 1100.3 4.074 0.13043 > quine.m <- update(quine.m, .~.+Eth:Age) > addterm(quine.m, quine.up, test = "Chisq") Single term additions Model: Days ~ Eth Df <none> Eth:Sex 1 Eth:Lrn 1 Sex:Lrn 1 Age:Lrn 2
+ Sex + Age + AIC LRT 1094.7 1095.4 1.2393 1096.7 0.0004 1094.7 1.9656 1094.1 4.6230
W. Venables, 2003
46
> dropterm(quine.m, test = "Chisq") Single term deletions Model: Days ~ Eth + Sex + Age + Lrn + Sex:Age + Eth:Age Df <none> Lrn Sex:Age AIC 1094.7 1 1097.9 5.166 0.023028 3 1102.7 14.002 0.002902 LRT Pr(Chi)
Eth:Age
The final model (from this approach) can be written in the form Days ~ Lrn + Age/(Eth + Sex) suggesting that Eth and Sex have proportional effects nested within Age, and Lrn has the same proportional effect on all means independent of Age, Sex and Eth.
W. Venables, 2003
47
The best way to appreciate the fitted model is to graph the components.
> par(mfrow=c(2,2)) > plot(iowa.gm1, se=T, ylim = c(-25, 25))
W. Venables, 2003
49
20
10
s( Ra in 0 )
s( Ye a r )
-10
-20
1930
1935
1940
1945 Ye a r
1950
1955
1960
-20
-10
10
20
15
20 R a in 0
25
20
10
s( Te mp 4 )
s( Ra in 2 )
-10
-20
4 R a in 2
-20
-10
10
20
68
70
72
74 Te mp 4
76
78
80
W. Venables, 2003
50
> par(oma = c(0,0,4,0)) > plot.gam(Iowa.fakeGAM, se=T, ylim = c(-25, 25)) > mtext("Additive components of a simple spline model", outer = T, cex = 2) > plot.gam(Iowa.fakePol, se=T, ylim = c(-25, 25)) > mtext("Additive components of a cubic polynomial model", outer = T, cex = 2)
W. Venables, 2003 Data Analysis & Graphics 51
n s( Ra in 0 , 3 )
n s( Ye a r , 3 )
-10 0
-30
1930
1935
1940
1945 Ye a r
1950
1955
1960
-20 -10
10
15
20 R a in 0
25
20
n s( Te mp 4 , 3 )
n s( Ra in 2 , 3 )
10
-20 -10
4 R a in 2
-30
-10
10
20
68
70
72
74 Te mp 4
76
78
80
W. Venables, 2003
52
p o ly( Ye a r , 3 )
1930
1935
1940
1945 Ye a r
1950
1955
1960
-20 -10
-20
10
15
20 R a in 0
25
20
p o ly( Te mp 4 , 3 )
p o ly( Ra in 2 , 3 )
10
-20 -10
4 R a in 2
-30
-10 0
10
20
68
70
72
74 Te mp 4
76
78
80
W. Venables, 2003
53
W. Venables, 2003
54
First, re-fit the preferred NB model: > quine.nb <- glm.nb(Days ~ Lrn + Age/(Eth + Sex), quine) Now make a local copy of quine, add the additional component and fit the (large) random effects model: > > > > quine <- quine quine$Child <- factor(1:nrow(quine)) library(MASS, first = T) quine.re <- glmmPQL(Days ~ Lrn + Age/(Eth + Sex), family = poisson(link = log), data = quine, random = ~1 | Child, maxit = 20)
AgeF0Eth AgeF1Eth AgeF2Eth AgeF3Eth AgeF0Sex AgeF1Sex NB 0.012582 -0.74511 -1.2082 -0.040114 -0.54904 -0.52050 RE -0.030215 -0.76261 -1.1759 -0.079303 -0.61111 -0.44163
AgeF2Sex AgeF3Sex NB 0.66164 0.66438 RE 0.66571 0.77235
W. Venables, 2003
57
> se.re <- sqrt(diag(quine.re$varFix)) > se.nb <- sqrt(diag(vcov(quine.nb))) > rbind(se.re, se.nb) [,1] [,2] [,3] [,4] [,5] [,6] se.re 0.28100 0.16063 0.33832 0.36867 0.35762 0.29025 se.nb 0.31464 0.18377 0.37982 0.41264 0.40101 0.32811
[,7] [,8] [,9] [,10] [,11] [,12] se.re 0.22531 0.23852 0.26244 0.30130 0.24281 0.25532 se.nb 0.25926 0.26941 0.29324 0.33882 0.28577 0.28857 [,13] se.re 0.26538 se.nb 0.29553 > plot(b.nb/se.nb, b.re/se.re) > abline(0, 1, col = 3, lty = 4)
W. Venables, 2003
58
10
b.re/se.re
-4
-2
-4
-2
2 b.nb/se.nb
W. Venables, 2003
59
Comparing predictions
The negative binomial fitted values should be comparable with the "level 0" predictions from the GLMM, i.e. fixed effects only. The predict generic function predicts on the link scale. For the GLMM we need a correction to go from the log scale to the direct scale (cf. lognormal distribution). An approximate correction (on the log scale) is /2 where is the variance of the random effects (or BLUPs).
W. Venables, 2003 Data Analysis & Graphics 60
> fv.nb <- fitted(quine.nb) > fv.re <- exp(predict(quine.re, level = 0) + var(random.effects(quine.re))/2) > plot(fv.re, fv.nb) > plot(fv.nb, fv.re) > abline(0, 1, col = 5, lty = 4)
GLMM vs NB predictions on the original scale.
50 fv.re 10 20 30 40
10
20 fv.nb
30
40
50
W. Venables, 2003
61
Non-linear regression
Generalization of linear regression. Normality and equal variance retained, linearity of parameters relaxed. Generally comes from a fairly secure theory; not often appropriate for empirical work Estimation is still by least squares (= maximum likelihood), but the sum of squares surface is not necessarily quadratic. Theory is approximate and relies on SSQ surface being nearly quadratic in a region around the minimum.
W. Venables, 2003
62
T
DATA DESCRIPTION:
1V W 2
where 1 and 2 are unknown parameters to be estimated, and E is error. The data frame contains the following components: ARGUMENTS: Viscosity Viscosity of fluid Wt Actuating weight Time Time taken SOURCE: E. J. Williams (1959) Regression Analysis. Wiley.
W. Venables, 2003 Data Analysis & Graphics 63
A simple example
Stormer viscometer data (data frame stormer in MASS).
> names(stormer) [1] "Viscosity" "Wt" "Time"
Model: T = V/(W ) + error Ignoring error and re-arranging gives: WT V + T Fit this relationship with ordinary least squares to get initial values for and .
W. Venables, 2003 Data Analysis & Graphics 64
W. Venables, 2003
66
Self-starting models
Allows the starting procedure to be encoded into the model function. (Somewhat arcane but very powerful) > library(nls) # R only > library(MASS) > data(stormer) # R only
> storm.init <- function(mCall, data, LHS) { v <- eval(mCall[["V"]], data) w <- eval(mCall[["W"]], data) t <- eval(LHS, data) b <- lsfit(cbind(v, t), t * w, int = F)$coef names(b) <- mCall[c("b", "t")] b } > NLSstormer <- selfStart( ~ b*V/(W-t), storm.init, c("b","t")) > args(NLSstormer) function(V, W, b, t) NULL
W. Venables, 2003
67
> tst <- nls(Time ~ NLSstormer(Viscosity, Wt, beta, theta), stormer, trace = T) 885.365 : 28.8755 2.84373 825.110 : 29.3935 2.23328 825.051 : 29.4013 2.21823
Bootstrapping is NBD:
> > > > > tst$call$trace <- NULL B <- matrix(NA, 500, 2) r <- scale(resid(tst), scale = F) # mean correct f <- fitted(tst) for(i in 1:500) { v <- f + sample(r, rep = T) B[i, ] <- try(coef(update(tst, v ~ .))) # guard! } > cbind(Coef = colMeans(B), SD = colStdevs(B)) Coef SD [1,] 29.4215 0.64390 [2,] 2.2204 0.46051
W. Venables, 2003 Data Analysis & Graphics 68
Variables:
Strip: identifier of muscle (animal) Conc: CaCl concentrations used to soak the section
Model:
L = + exp(-C/ ) + error
Note that and are (very many) linear parameters. We use this
strongly
W. Venables, 2003
69
Converges in one step, now. Note that with indexed parameters, the starting values must be given in a list (with names).
W. Venables, 2003
71
W. Venables, 2003
72
W. Venables, 2003
73
4 40
S01
S02
S03
S04
S05
S06
30
20
10
S07
40
S08
S09
S10
S11
S12
30
20
10
Length
S13
S14
S15
S16
S17
S18
40
30
20
10
S19
40
S20
S21
30
20
10
0 1 2 3 4 1 2 3 4
Conc
W. Venables, 2003
74
The vectors a, b and th come from the previous fit; no need to supply initial values for the random effects, (though you may).
W. Venables, 2003
75
A composite plot
> newdat$L2 <- predict(musc.re, newdat) > xyplot(Length ~ Conc | Strip, muscle, subscripts = T, panel = function(x, y, subscripts, ...) { panel.xyplot(x, y, ...) ws <- as(muscle$Strip[subscripts[1]], "character") wf <- which(newdat$Strip == ws) xx <- newdat$Conc[wf] yy <- newdat$Length[wf] lines(xx, yy, col = 3) yy <- newdat$L2[wf] lines(xx, yy, lty=4, col=4) }, ylim = range(newdat$Length, muscle$Length), as.table = T)
W. Venables, 2003 Data Analysis & Graphics 76
4 40
S01
S02
S03
S04
S05
S06
30
20
10
S07
40
S08
S09
S10
S11
S12
30
20
10
Length
S13
S14
S15
S16
S17
S18
40
30
20
10
S19
40
S20
S21
30
20
10
0 1 2 3 4 1 2 3 4
W. Venables, 2003
77
A<3
B<2 Male B<1 Female Female Male A<1 C < 10 D = fg Female Male
Female
W. Venables, 2003
79
Regression trees
Suppose Y has a distribution with mean depending on several determining variables, not necessarily continuous.
A regression tree is a decision tree that partitions the determining variable space into non-overlapping prediction regions.
Within each prediction region the response variable Y is predictied by a constant.
The deviance in this case is exactly the same as for regression: the residual sum of squares.
With enough regions (nodes) the training sample can clearly be reproduced exactly, but predictions will be inaccurate. With too few nodes predictions may be seriously biased. How many nodes should we use? This is the key question for all of tree modelling (equivalent to setting the tuning constant).
W. Venables, 2003 Data Analysis & Graphics 81
implements an algorithm to produce a binary tree (regression or classification) and returns a tree object. The formula is of the simple form R ~ X1 +X2 + where R is the target variable and X1, X2,... are the determining variables. if R is a factor the result is a classification tree, if R is numeric the result is a regression tree.
W. Venables, 2003 Data Analysis & Graphics 82
tree or rpart?
In S-PLUS there is a native tree library that V&R have some reservations about. It is useful, though. rpart is a library written by Beth Atkinson and Terry Therneau of the Mayo Clinic, Rochester, NY. It is much closer to the spirit of the original CART algorithm of Breiman, et al. It is now supplied with both S-PLUS and R. In R, there is a tree library that is an S-PLUS look-alike, but we think better in some respects. rpart is the more flexible and allows various splitting criteria and different model bases (survival trees, for example). rpart is probably the better package, but tree is acceptable and some things such as cross-validation are easier with tree. In this discussion we (nevertheless) largely use tree!
W. Venables, 2003 Data Analysis & Graphics 83
minsize = k
specifics that the groups may be as small as k.
W. Venables, 2003 Data Analysis & Graphics 84
The tree has 5 terminal nodes. We can take two views of the model that complement each other.
W. Venables, 2003 Data Analysis & Graphics 86
Dens<62.9
Dens<56.25 2000
2900
87
partition.tree(jank.t1, xlim = c(24, 70)) attach(janka) points(Dens, Hard, col = 5, cex = 1.25) segments(Dens, Hard, Dens, predict(jank.t1), col = 6, lty = 4)
3000 Hard 500 1000 1500 2000 2500
30
40 Dens
50
60
70
W. Venables, 2003
88
"Height" "M.I"
We consider predicting Sex from the other variables. Remove cases with missing values Split data set into "Training" and "Test" sets Build model in training set, test in test. Look at simple "bagging" to improve stability and predictions
W. Venables, 2003
89
Preliminary manipulations
We start with a bit of data cleaning.
> dim(survey) [1] 237 12 > count.nas <- function(x) sum(is.na(x)) > sapply(survey, count.nas) Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height 1 1 1 1 0 45 1 0 1 28 M.I Age 28 0 > Srvy <- na.omit(survey) > dim(Srvy) [1] 168 12
W. Venables, 2003
90
Height<175.13 |
NW.Hnd<19.35
NW.Hnd<18.45 Pulse<68.5 Wr.Hnd<17.55 Wr.Hnd<18.95 Age<20.4585 Pulse<72 Male Male Age<18.25 Female Male Female Female Height<159.99 NW.Hnd<17.2 Female Female Male Female Female
W. Venables, 2003 Data Analysis & Graphics 91
Cross-validation
A technique for using internal evidence to guage the size of tree warranted by the data. Random sections are omitted and the remainder used to construct a tree. The ommitted section is then predicted from the remainder and various criteria (deviance, error rate) used to assess the efficacy. > obj <- cv.tree(tree.obj, FUN=functn, ...) > plot(obj) Currently functn must be either prune.tree or shrink.tree (the default). It determines the protocol by which the sequence of trees tested is generated. The MASS library also has prune.misclass for classification trees, only.
W. Venables, 2003 Data Analysis & Graphics 92
58.00
100
0.33
0.25
0.00
-Inf
misclass
40
60
6 size
10
12
14
W. Venables, 2003
93
Height<175.13 |
Pruning
> surv.t2 <- prune.tree(surv.t1, best = 6) > plot(surv.t2) > text(surv.t2, col = 5)
NW.Hnd<19.35
Female
Male
W. Venables, 2003
94
rpart counterpart
> rp1 <- rpart(Sex ~ ., Srvy, minsplit = 10) > plot(rp1) > text(rp1) > plotcp(rp1) See next slide. This suggests a very small tree two nodes. (Some repeat tests might be necessary.) > rp2 <- prune(rp1, cp = 0.29) > plot(rp2) > text(rp2)
W. Venables, 2003 Data Analysis & Graphics 95
Internal cross-validation and the 'one se' rule rpart tree with minsplit = 10
W. Venables, 2003
96
Confusion matrices
Now to check the effectiveness of the pruned tree on the training sample. Should be pretty good!
> table(predict(surv.t2, type = "class"), Srvy$Sex) Female Male Female 79 11 Male 5 73
First build the model. > w <- sample(nrow(Srvy), nrow(Srvy)/2) > Surv0 <- Srvy[w, ] > Surv1 <- Srvy[-w, ] > surv.t3 <- tree(Sex ~ ., Surv0) Check on training set: > table(predict(surv.t3, type="class"), Surv0$Sex) Female Male Female 42 2 Male 3 37 Test on test set: > table(predict(surv.t3, Surv1, type="class"), Surv1$Sex) Female Male Female 34 6 Male 5 39
W. Venables, 2003 Data Analysis & Graphics 98
Now build an over-fitted model on the training set and check its performance: > surv.t3 <- tree(Sex ~ ., Surv0, minsize = 4) > table(predict(surv.t3, type="class"), Surv0$Sex) Female Male Female 44 1 Male 1 38
Far too good! How does it look on the test set? > table(predict(surv.t3, Surv1, type="class"), Surv1$Sex) Female Male Female 31 9 Male 8 36
W. Venables, 2003 Data Analysis & Graphics 99
Bagging requires fitting the same model a sequence of 'bootstrapped data frames'. It is convenient to have a function to do it.
> jig <- function(dataFrame) dataFrame[sample(nrow(dataFrame), rep = T), ]
Next find the sequence of predictions for each model using the test set. It is convenient to have the result as characters.
> bag.pred <- lapply(bag, predict, newdata = Surv1, type = "class") > bag.pred <- lapply(bag.pred, as.character)
W. Venables, 2003 Data Analysis & Graphics 100
101
W. Venables, 2003
106
W. Venables, 2003
107
W. Venables, 2003
108
W. Venables, 2003
109
Geographical Identification
> dgroups <- cutree(agdredge, k = 4) > table(dgroups) dgroups 1 2 3 4 60 46 33 7 > attach(DredgeX) > library(MASS) > eqscplot(Longitude, Latitude, type="n", axes=F, xlab="", ylab="") > points(Longitude, Latitude, pch = dgroups, col = c("red", "yellow", "green", "blue")[dgroups]) > library(Aus) # Nick Ellis > Aus(add = T, col = "brown")
W. Venables, 2003 Data Analysis & Graphics 110
W. Venables, 2003
111
W. Venables, 2003
113
W. Venables, 2003
114
W. Venables, 2003
115