R Manual PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 78
At a glance
Powered by AI
The document provides an overview of the contents of a statistics textbook that covers various statistical analyses and includes exercises to learn the R programming language.

Each chapter covers a different statistical topic or analysis such as randomization tests, t-tests, regression, factorial experiments, categorical data analysis, and Bayesian data analysis.

Statistical analyses covered include randomization tests, t-tests, regression, ANOVA, multiple regression, factorial experiments, block and split-plot designs, categorical data analysis, logistic regression, Poisson regression, survival analysis, principal component analysis, and Bayesian data analysis.

R MANUAL

G UIDED I NVESTIGATIONS FOR THE S ECOND C OURSE


SHONDA KUIPER
Grinnell College

PRACTICING S TATISTICS:

Jeffrey Sklar
California Polytechnic State University

Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo

The author and publisher of this book have used their best efforts in preparing this book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The author and publisher make no warranty of any kind, expressed or implied, with regard to these programs or the documentation contained in this book. The author and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs. Reproduced by Pearson from electronic files supplied by the author. Copyright 2013 Pearson Education, Inc. Publishing as Pearson, 75 Arlington Street, Boston, MA 02116. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher. Printed in the United States of America. ISBN-13: 978-0-321-78481-0 ISBN-10: 0-321-78481-2

www.pearsonhighered.com

Contents

Getting Started with R .................................................................................................................................................i Chapter 1: Randomization Tests: Schistosomiasis ......................................................................................................1 Chapter 2: The Two-Sample t-test, Regression, and ANOVA: Making Connections.................................................7 Chapter 3: Multiple Regression: How Much is Your Car Worth? ............................................................................15 Chapter 4: Designing Factorial Experiments: Microwave Popcorn...........................................................................25 Chapter 5: Block, Split-Plot and Repeated Measure Designs: What Influences Memory? ......................................29 Chapter 6: Categorical Data Analysis: Is a Tumor Malignant or Benign? ................................................................35

Research Project: Infant Handling in Female Baboons ............................................................. 37


Chapter 7: Logistic Regression: The Space Shuttle Challenger.................................................................................43 Chapter 8: Poisson Log-Linear Regression: Detecting Cancer Clusters ...................................................................49 Chapter 9: Survival Analysis: Melting Chocolate Chips ...........................................................................................55 Chapter 10: Principal Component Analysis: Stock Market Values ...........................................................................59 Chapter 11: Bayesian Data Analysis: What Colors Come in Your M&M's Candy Bag? ........................................67

Copyright 2013 Pearson Education. Inc.

Getting Started with R


This guide provides information for getting started in R. A detailed introduction to the R software is available at: http://cran.r-project.org/doc/manuals/R-intro.html

About R
R is a powerful and commonly used statistical software package that can be downloaded for free at http://www.r-project.org/. Instructions on how to download the software are available on the website.

Typing commands and entering data


When you open the R program, commands can be entered one line at a time into R at the prompt symbol, >, or they can be entered into an R script file and run from there. To run code at the prompt symbol, simply type the code and press Enter. To run code from a script file, in your R session: a) Go to File New script. b) A new window will called Untitled-R Editor will open. c) You can write and edit R code in this editor (this is very convenient if you have several lines of code to run). d) To run the code, highlight the appropriate region and click the Run line or selection icon next to the File save icon. e) To name and save the script file, go to File Save as and then name it anything you want with the extension .R at the end of it. For example, in the R console window, you can type 3 + 8 and R will display the value 11. > 3+8 [1] 11 To store the output of data or a function, use either an equal sign (=) or the special assignment operator <-. For example, we can assign x the value 3+8 and > x <- 3+8 > x [1] 11 Each time we type the lowercase x it will be given a value of 11. In R, the data is stored as a vector. We can also create a vector of data: > X <- c(16,10,7,9,3) > x [1] 11 > X [1] 16 10 7 9 3 Note that R is case sensitive, we have defined lowercase x = 11, and the uppercase X is considered a variable with 5 observations.

Copyright 2013 Pearson Education. Inc.

ii

Getting Started with R

Calculating Descriptive Statistics


The following examples illustrate several statistical functions applied to the vector X we have just created: > mean(X) #computes the mean of the values in X [1] 9 > sd(X) #computes the (sample) standard deviation [1] 4.743416 > var(X) [1] 22.5 #computes the (sample) variance

> summary(X) #computes the five number summary and the mean Min. 1st Qu. Median Mean 3rd Qu. Max. 3 7 9 9 10 16 Note that the characters to the right of the # symbol are not interpreted by R; they form comments to help us remember what a particular line does; comments are optional, but often useful.

Constructing Graphical Summaries


Histograms and boxplots can be constructed using the hist() and boxplot() functions. For example, hist(X) and boxplot(X) will create a histogram and boxplot of the measurements in X.

Creating Scatterplots
The plot() command is used to create a scatterplot to display the values of two quantitative variables. For example, define a vector Y: > Y <- c(4,6,2,8,9) Now plot Y versus the vector X we have created: > plot(X,Y) There are various options that can be included in the plot() command for adding axes labels, titles, etc. For example > plot(X,Y, type= "b", main = "Sample Plot in R", xlab="X Value) Type help(plot)for more information on plotting options.

Writing Functions
Some functions have no arguments (no input values) and thus always give the same output. For example > hello.world <- function() print("hello world") > hello.world() [1] "hello world" Most functions we write will have an argument. In the following example, the input/argument is Mike, but any name would work. The output will depend on which name is typed into the second line: > hello.someone <- function(name) print(paste("hello ",name)) > hello.someone("Mike") [1] "hello Mike" These examples are short functions that can be written on one line. Longer functions should first be created in an R script file (see instructions for opening a new script file), and then activated.

Copyright 2013 Pearson Education. Inc.

Getting Started with R

iii

To activate the function, highlight the code, and click the Run line or selection icon. For example, a function to construct a confidence interval for a population mean could be written as: simple.t.ci <- function(x,conf.level=0.95) { n <- length(x) xbar <- mean(x) alpha <- 1 - conf.level tstar <- qt(1-alpha/2) sigma <- sd(x) SE <- sigma/sqrt(n) xbar + c(-tstar*SE,tstar*SE) } Now assuming that the values in the vector X are a random sample from a normal population, we can find a 95% confidence interval for the population mean: > simple.t.ci(X) [1] 4.842289 13.157711

Generating Random Data


The following function generates a random vector of 50 observations from a normal distribution with mean 10 and standard deviation 2: > x <- rnorm(50, mean = 10, sd = 2) We can also create a vector of 50 observations from a uniform(2,10): > y <- runif(50,2,10) Notice that each time these commands are run, new random data are generated. Random data can also be incorporated into functions: > Y <- 2*X 5 + rnorm(5,0,2) #adds a random term (from a N(0,2) distribution) to each value > plot(X,Y) > abline(lm(Y ~ X)) #adds a linear regression line to the plot

Importing and Exporting Data Sets


You can create a data file using Windows Notepad, by opening a new Notepad document, typing the following data, and saving the file as weightdata.txt: Height Weight 62 135 71 265 67 165 69 210 65 155 This file can be saved in any folder you create, such as C:/Courses/STAT320, We will use the term FILE PATH to represent the location of the folder. So in our example FILE PATH/weightdata.txt, could be, for example, C/Courses/STAT320/weightdata.txt. We can read in the data file into our R session. The option header=T states that each column of data has a label. Once the data is read into R, it can also be attached so that the variables in the data set can be accessed directly by name, for example data1. > data1 <- read.table("FILE PATH/weightdata.txt",header=T) > attach(data1)

Copyright 2013 Pearson Education. Inc.

iv

Getting Started with R

The variables Height and Weight can now be examined simply by typing their names. If the attach() command is not used, then the data set name followed by a $ sign must be typed before the variable name, e.g. data1$Height. For example, > mean(Height) #can be used when the data is attached > mean(data1$Height) #should be used when the data is not attached CSV files can also be saved and read into a data set using the read.csv() command. For example, if weight and height data contained in weightdata.txt is saved in a .csv file, weightdata.csv, then it can be read in your R session by: > mydata <- read.csv(FILE PATH/weightdata.csv, header=T) Once a data set has been read into your R session, we can analyze the data with any available R function (try a few of the previous functions, such as summary(data1)). It is common practice to detach the current data set once you are finished analyzing it to avoid confusion with variables of the same name as those in the currently attached data set. For example: > detach(data1)

Exporting Files
If you have created a new data set in R or altered an existing one, and want to save it to your computer, you can export a data set created in your R session using the write.table() command. For example, > write.table(data1,"FILE PATH/myfolder") will export the file data1 to a location (folder) on your computer called myfolder.

Handling Missing Data


A missing value within a data set is indicated by NA which stands for not available. Some R functions will not work when missing values are present. There are a couple of ways to handle missing data. One approach is to delete each row in a data set that contains a missing value using the na.omit() function. For example, consider a slight modification to the original data in data1 where the height of the first subject is missing: Height Weight NA 135 71 265 67 165 69 210 65 155 Name the modified data set data2 and assume it has been read into your R session. Then using na.omit() will return the following: > na.omit(data2) Height Weight 2 71 265 3 67 165 4 69 210 5 65 155 Another option is to use the na.rm=TRUE option which will perform a function while omitting all rows with NA values. For example, the following code gives the mean of the first column of data2 after omitting the missing value: > mean(data2[,1],na.rm=TRUE) [1] 68

Copyright 2013 Pearson Education. Inc.

Getting Started with R

Supplemental Packages
R packages are collections of functions used to perform specific techniques and types of analyses. Many packages are not immediately available when you start your R session. These packages need to be activated (loaded) in your R session, or downloaded from the internet and then loaded. To load a pre-installed package, use the library() command. For example, to load the survival package which contains several functions for performing survival analysis, type: library(survival). Other packages may need to be downloaded to the library folder within the R software folder before the library() command will work.

Additional Commands for Managing Data and Accessing Help


The following commands are used to obtain information and manage various aspects of available data, functions, and library packages: To delete a data set, type: remove(nameofdataset) To list all available built-in datasets, type: data() To list all data sets in a given library package, type: data(package='package name') For example, in the package survival, typing data(package=survival) will list all data sets within the survival analysis package. To get help on available R commands, type: help(nameoffunction) To search for particular R commands, datasets, or libraries by topic, type: help.search("nameoftopic")

Ending Your R Session


When you are done with your session, type the command q() (for quit). You will be asked if you want to save your workspace image. Click yes if you want save all data sets or functions created during your session. If you click no, you will lose any data sets created during your session. You may also be asked to save the changes made to your script file.

Copyright 2013 Pearson Education. Inc.

vi

Getting Started with R

Some Other Useful R Functions and Commands


The following table provides a list of several R functions and commands for creating numerical and graphical summaries, as well as some common hypothesis testing procedures. Specific formatting and options for each function can be obtained using the help() command for each function. Descriptive Statistics The length of a vector of observations (x) The mean of a vector (x) The standard deviation of a vector (x) The variance of a vector (x) Calculate quantiles of a vector (x) Calculate simple linear regression coefficients (y on x) Plots Create a histogram of a vector of observations (x) Normal probability (quantile) plot of a vector (x) Bar graphs for categorical data Create a matrix of scatterplots for several variables x1, x2, x3 Create a boxplot of a vector (x) Calculates the correlation between two vectors x1 and x2 Create a stem and leaf plot of a vector (x) Creates one window that will consist of the next 6 plots created (2 rows each with 3 plots) Creates one window that will consist of the next 6 plots created (2 columns each with 3 plots) Hypothesis Tests Contingency table tests and goodness-of-fit tests One and two-sample t-tests (including paired samples) Probabilities and Random Variables Probability that a normal random variable is less than x Quantile from a normal distribution Random observations from a normal distribution Random observations from a t distribution Probability that a t (n-1) random variable is less than x

length(x) mean(x) sd(x) var(x) quantile(x) lm(formula = y ~ x)

hist(x) qqnorm(x) barplot() pairs(x1,x2,x3) boxplot(x) corr(x1,x2) stem(x) par(mfrow=c(2,3)) par(mfcol=c(2,3))

chisq.test() t.test()

pnorm() qnorm() rnorm() rt() pt()

Copyright 2013 Pearson Education. Inc.

Chapter 1 Randomization Tests: Schistosomiasis


Activities
2. Read in the worm count data and attach it to your session: worm <- read.table("FILE PATH/Worm Count.txt",header=TRUE) attach(worm) To obtain the summary statistics for each of the groups of mice: Female Treatment Group: c(summary(FemaleTrt),"Stand Dev"=sd(FemaleTrt)) Female Control Group c(summary(FemaleCtl),"Stand Dev"=sd(FemaleCtl)) Male Treatment Group c(summary(MaleTrt),"Stand Dev"=sd(MaleTrt)) Male Control Group c(summary(MaleCtl),"Stand Dev"=sd(MaleCtl)) To detach the worm data set: detach(worm)

7-12. The R code below will simulate 10 random allocations. Note that R code can be entered one line at a time into R at the prompt symbol, >. Alternatively, you can enter the code into a .txt file (using Notepad for example) and then cut and paste all the code into R at once to have R execute it. The cut-and-paste approach has advantages, primarily that you can try the code with a small number of iterations and then run it for real with a large number. The characters from the # symbol to the end of a line are not interpreted by R; they form comments to help us remember what a particular line does; comments are optional, but often useful. After running the code, the results vector contains the results of reps simulated mean differences and p.value is a constant containing the empirical p-value. reps <ctl.fem trt.fem results 10000 <- c(16,10,10,7,17) <- c(1,2,2,10,7) <- numeric(reps) # establish a vector of the right length # all 0s initially x <- c(trt.fem, ctl.fem) for (i in 1:reps) { temp <- sample(x) results[i] <- mean(temp[1:5])-mean(temp[6:10]) }

9.

p.value <- sum(results >= 7.6) / reps p.value

Copyright 2013 Pearson Education. Inc.

Chapter 1: Randomization Tests: Schistosomiasis

10. hist(results,xlab="Control Mean - Treatment Mean")

13. p.value <- (sum(results <= -7.6)+sum(results >= 7.6)) / reps p.value

14. Number of simulations resulting in a difference greater than or equal to 7.6: sum(results >= 7.6) Number of simulations resulting in a difference less than or equal to -7.6: sum(results <= -7.6)

Using R Functions
When you want to use a string of R code repeatedly, making slight modifications each time, it is most efficient to use an R function. The code below illustrates the idea in the context of randomization tests. You can find more general information about writing R functions in the documentation available at http://cran.us.r-project.org/. Figure 1 displays R code for defining a function that calculates a randomization test on two vectors of quantitative data from a randomly-allocated, two-group experiment. The first line is the crucial line; it assigns the name randtest to a new function. The general syntax for defining a function: function.name <- function( ) _____________________________________________________________________ randtest <- function(x,y,fun=mean,reps=1000) { n <- length(x) m <- length(y) data <- c(x,y) results <- numeric(reps) for (i in 1:reps) { temp <- sample(data) results[i] <- fun(temp[1:n])-fun(temp[(n+1):(n+m)]) } greater.p <- sum(results >= (fun(x)-fun(y)))/reps less.p <- sum(results <= (fun(x)-fun(y)))/reps a <- abs(fun(x)-fun(y)) two.sided.p <- sum(abs(results)>=a)/reps temp <- c(greater.p, less.p, two.sided.p) names(temp) <- c("p.greater.than", "p.less.than", "two.sided.p") return(list(temp,results,a)) } _____________________________________________________________________
Figure 1 R code for defining a function to perform a randomization test on two vectors of data, x and y. The output of the function is a list of length three: The first entry contains less than and greater than one-sided p-values, and a two-sided p-value; the second entry is the vector of length reps representing the approximate randomization distribution; and the third entry is the test statistic, which by default, is the difference in means between x and y, and this choice can be changed through the fun argument. To use the function, we recommend assigning output to some other variable, so the component piece of interest (p-value or randomization distribution vector) can be extracted. The greater than p-value is used if we expect the first argument to give bigger values than the second argument; vice versa for using the less than p-value.

Copyright 2013 Pearson Education. Inc.

Chapter 1: Randomization Tests: Schistosomiasis

Inside the parentheses will go the various arguments or inputs needed for the function to work. The function randtest definition calls for 4 arguments: the first two listed, x and y, are required arguments and represent the two vectors we will perform a randomization test on. The third and fourth arguments listed, fun and reps, are optional arguments, both of which are given default values, mean and 1000 respectively. To use randtest, enter into R a line of the form: randtest(vector1, vector2, <optional arguments> ) The <optional arguments> in the syntax can be left blank if using the mean and 1000 reps is what you want, but you can override these defaults. The first two arguments necessarily will be interpreted as our two data vectors. For example: randtest(ctl.fm, trt.fem, fun=median, reps=10000) would call the function using median as the function and 10,000 reps.

9, 13. Cut and paste the R code in Figure 1 into your R session. This defines within R the randtest function. Now enter this code to obtain the two p.values: foo <- randtest(ctl.fem, trt.fem) foo[[1]] Compare the results to those obtained above. Note: Because the function randtest produces three p-values and a vector of length reps in a list of length 3, it is prudent to assign the results to some other variable (here called foo) and then ask R just for the first item in the list, which is a vector of length 2 comprising the p-values. This tactic suppresses the printing of the lengthy vector. If one wanted the vector above you refer to it as foo[[2]]

Extended Activities
17. Read the data set Age Discrim into your R session: age <- read.table("FILE PATH/Age Discrim.txt",header=TRUE) Create two vectors of ages corresponding to those who were and were not laid off: yes.lay <- age[age[,2]==1,][,1] no.lay <- age[age[,2]==0,][,1] Use the randtest function to conduct the permutation test: age.perm <- randtest(yes.lay,no.lay,fun=mean,reps=1000) To obtain the empirical p-value of the permutation test: age.perm[[1]] Select the p-value under the "p.greater.than" label, since we are trying to determine whether the mean age of people who were laid off is higher than the mean age of people who were not laid off.

18. Compare the median age of those who were and were not laid off: age.perm <- randtest(yes.lay,no.lay,fun=median,reps=1000) age.perm[[1]] Select the p-value under the "p.greater.than" label, since we are trying to determine whether the median age of people who were laid off is higher than the median age of people who were not laid off.

Copyright 2013 Pearson Education. Inc.

Chapter 1: Randomization Tests: Schistosomiasis

20. The following R code defines a function, called randtest.paired to perform a randomization test on two vectors of data, x and y, from a matched pairs design. The output of the function is a list of three p-values, plus the vector, results, that represents the empirical randomization distribution. To use the function, use syntax like: output <- randtest.paired(x,y). You can also change the default number of simulations of 1000 by specifying reps = in the function. Then the variable output contains a list with two elements: the first is a length-3 vector of three p-values; the second is a vector of length reps. To unpack these pieces, use output[[1]][1] for the three p-values and output[[2]] for the vector giving the empirical randomization distribution. randtest.paired <- function(x,y,fun=mean,reps=1000){ data <- dats <- cbind(x,y) observed <-fun(x)-fun(y) results<-numeric(reps) n<-length(data[,1]) for(i in 1:reps){ for(j in 1:n){ dats[j,]<-sample(c(-1,1),1)*data[j,] } results[i]<-(fun(dats[,1]))-(fun(dats[,2])) } greater.p <- sum(results>=observed)/reps less.p <- sum(results<=observed)/reps two.sided.p <- sum(abs(results)>=abs(observed))/reps temp <- c(greater.p,less.p,two.sided.p) names(temp)<- c("p.greater.than","p.less.than","two.sided.p") return(list(temp,results)) } Read in the Music data and then attach it to your R session: music <- read.table("FILE PATH/Music.txt",header=TRUE) attach(music) We'll use the randtest.paired and 10000 simulations: music.rand.pair <- randtest.paired(Fastdiff,Slowdiff,fun=mean,reps=10000) a) To construct a histogram of the mean differences: hist(music.rand.pair[[2]],xlab="Fastdiff-Slowdiff (beats per minute)") You can shade the area under the histogram to the right of 1.86. This shaded area represents the empirical p-value for the test. b) To obtain the empirical p-value for the upper tailed test, observe the p-value under the "p.greater.than" label: music.rand.pair[[1]]

21. Read in the ChiSq data: chisq <- read.table("FILE PATH/ChiSq.txt",header=FALSE) Construct and observe the histogram of the sampling distribution of X : reps <- 1000 means <- numeric(reps)

Copyright 2013 Pearson Education. Inc.

Chapter 1: Randomization Tests: Schistosomiasis

for (i in 1:reps) {

means[i] <- mean(sample(chisq[,1],40)) } hist(means,main="Sampling Distribution of X-bar",xlab="Sample Mean") Mean and standard deviation of the sampling distribution of X : mean(means) sd(means)

22. Construct the bootstrap distribution of X : samp.chisq <- sample(chisq[,1],40) reps <- 1000 means1 <- numeric(reps) for (i in 1:reps) { means1[i] <- mean(sample(samp.chisq,40,replace=TRUE)) } hist(means1,main="Bootstrap Distribution of X-bar",xlab="Resample Mean")

23. Construct the sampling distribution and bootstrap distribution of the sample standard deviation: par(mfrow=c(1,2)) reps <- 1000 sds <- numeric(reps) for (i in 1:reps) { sds[i] <- sd(sample(chisq[,1],40)) } hist(sds,main="Sampling Distribution of the Sample SD",xlab="Sample SD") samp.chisq <- sample(chisq[,1],40) reps <- 1000 sds1 <- numeric(reps) for (i in 1:reps) { sds1[i] <- sd(sample(samp.chisq,40,replace=TRUE)) } hist(sds1,main="Bootstrap Distribution of the Sample SD",xlab="Resample SD")

24. Read in the MedSalaries data: med <- read.table("FILE PATH/MedSalaries.txt",header=FALSE) a) Construct the bootstrap distribution of the sample mean: samp.size <- 100 samp.med <- sample(med[,1],samp.size) reps <- 1000 means <- numeric(reps) for (i in 1:reps) { means[i] <- mean(sample(samp.med,samp.size,replace=TRUE)) } hist(means,main="Bootstrap Distribution of X-bar",xlab="Resample Mean")

Copyright 2013 Pearson Education. Inc.

Chapter 1: Randomization Tests: Schistosomiasis

95% Bootstrap t Confidence Interval for the population mean: Lower Limit: mean(means)-qt(.975,samp.size-1)*sd(means) Upper Limit: mean(means)+qt(.975,samp.size-1)*sd(means) 95% Bootstrap Percentile Confidence Interval for the population mean: Lower Limit: quantile(means,.025) Upper Limit: quantile(means,.975) b) Construct the bootstrap distribution of the sample standard deviation samp.med <- sample(med[,1],samp.size) reps <- 1000 sds <- numeric(reps) for (i in 1:reps) { sds[i] <- sd(sample(samp.med,samp.size,replace=TRUE)) } hist(sds,main="Bootstrap Distribution of the Sample SD",xlab="ResampleSD") 95% Bootstrap t Confidence Interval for the population standard deviation: Lower Limit: mean(sds)-qt(.975,samp.size-1)*sd(sds) Upper Limit: mean(sds)+qt(.975,samp.size-1)*sd(sds) 95% Bootstrap Percentile Confidence Interval for the population standard deviation: Lower Limit: quantile(sds,.025) Upper Limit: quantile(sds,.975)

25. Read in the Baseball data set: baseball <- read.table("FILE PATH/Baseball.txt",header=TRUE) Create two vectors containing the salaries of pitchers and first basemen: pitcher <- baseball[baseball$Position=="Pitcher",]$Salary firstbase <- baseball[baseball$Position=="FirstBaseman",]$Salary Perform the Wilcoxon Rank Sum Test: wilcox.test(pitcher,firstbase,exact=FALSE) Note that the test statistic is different from the one given in the chapter (W = 18). R uses a variant of the rank sum test statistic; however, the p-value matches that from Minitab.

26. 2*pnorm(18,27.5,4.787)

27. Two-sample t-test to compare the average salaries of pitchers and first basemen, not assuming equal variances: t.test(pitcher,firstbase)

28. Kruskal-Wallis test to determine if the distribution of salaries differs by position: kruskal.test(Salary~Position,data=baseball)

Copyright 2013 Pearson Education. Inc.

Chapter 2 The Two-Sample t-test, Regression, and ANOVA: Making Connections


Activities
4. Read data set Games1 into your R session and name it "games1": games1 <- read.table("FILE PATH/Games1.txt",header=TRUE) Create a boxplot to compare the times for the color and standard grames: boxplot(Time~Type,data=games1,ylab="Seconds") Create two vectors of Time observations for the Standard and Color groups: Standard <- games1[games1$Type=="Standard",]$Time Color <- games1[games1$Type=="Color",]$Time Mean and standard deviations for the time to win the game for standard game: mean(Standard) sd(Standard) Mean and standard deviations for the time to win the game for color game: mean(Color) sd(Color)

6.

Sample size, 12th observation, and 12th residual for the Standard group: length(Standard) Standard[12] Standard[12]-mean(Standard) Sample size, 12th observation, and 12th residual for the Color group: length(Color) Color[12] Color[12]-mean(Color)

7.

Create a vector of residuals for all observations: residuals <- c(Standard-mean(Standard),Color-mean(Color)) Create a histogram of the residuals: hist(residuals)

8.

Compute the ratio of the maximum of the sample standard deviations for the Standard and Color groups to the minimum of sample standard deviations for the Standard and Color groups: max(c(sd(Standard),sd(Color)))/min(c(sd(Standard),sd(Color)))

9.

Note: for efficiency, the approach to this Question is identical to the approach in Question 11. Create a dummy variable where a value of 1 corresponds to the color distracter game: D1 <- (games1$Type=="Color")*1

Copyright 2013 Pearson Education. Inc.

Chapter 2: The Two-Sample t-test, Regression, and ANOVA: Making Connections

Regress Time on the dummy variable D1 and name the regression object games.reg1: games.reg1 <- lm(Time~D1,data=games1) Obtain the vector of residuals: residuals <- games.reg1$res Plot of the residuals versus the order in which the data were collected. plot(games1$studentID,residuals,type="l",xlab="Order") points(games1$studentID,residuals) abline(1,0)

10. Perform a two-sample t-test to compare average winning time between the Standard and Color groups (assuming equal variances): t.test(Time~Type,data=games1,var.equal = TRUE)

11. Create a dummy variable where a value of 1 corresponds to the color distracter game: D1 <- (games1$Type=="Color")*1 Regress Time on the dummy variable D1 and name the regression object games.reg1: games.reg1 <- lm(Time~D1,data=games1) Estimated coefficients of the model of Time regressed on D1: games.reg1$coef

12. The partial output from the following code contains the t-statistic and p-value for the hypothesis test that

1 = 0 versus the alternative summary(games.reg1)

1 0 :

The 95% confidence interval for 1 is computed by: Lower Limit: 2.55 - qt(.975,nrow(games1)-2)*1.1154 Upper Limit: 2.55 + qt(.975,nrow(games1)-2)*1.1154

13. Now create a dummy variable where a value of 1 corresponds to the Standard game: D1 <- (games1$Type=="Standard")*1 Regress Time on the dummy variable D1 and name the regression object games.reg2: games.reg2 <- lm(Time~D1,data=games1) Estimated coefficients of the model of Time regressed on D1: games.reg2$coef The partial output from the following code contains the t-statistic and p-value for the hypothesis test that 1 = 0 versus the alternative 1 0 : summary(games.reg2) The 95% confidence interval for 1 is computed by: Lower Limit: -2.55 - qt(.975,nrow(games1)-2)*1.1154 Upper Limit: -2.55 + qt(.975,nrow(games1)-2)*1.1154

Copyright 2013 Pearson Education. Inc.

Chapter 2: The Two-Sample t-test, Regression, and ANOVA: Making Connections

14. Obtain the residuals from the regression equation obtained in Question 11. First, create a dummy variable where a value of 1 corresponds to the color distracter game: D1 <- (games1$Type=="Color")*1 Now obtain residuals from the regression model using D1 as the explanatory variable: residuals <- lm(Time~D1,data=games1)$res Construct a histogram of the residuals and a residual versus order plot: par(mfrow=c(1,2)) hist(residuals) plot(games1$studentID,residuals,xlab="Order",ylab="Residuals") lines(games1$studentID,residuals) abline(1,0)

15. Create scatterplot of Time versus Type of game with a superimposed regression line: plot(D1,games1$Time) abline(lm(Time~D1,data=games1)$coef)

19. Compute y.. , y1. , and y 2. , respectively: mean(games1$Time) mean(Standard) mean(Color)

20. Estimate the effect size for the color distracter game: mean(Color)-mean(games1$Time) Estimate the effect size for the standard game: mean(Standard)-mean(games1$Time)

21. Construct the main effects plot of game type: par(mfrow=c(1,1)) plot(c(0,1),c(mean(Color),mean(Standard)),xlim=c(.1,1.1),ylim=c(35.2,38.5) , ylab="Mean Time",pch=16,xlab="Type",main="Main Effects Plot") lines(c(0,1),c(mean(Color),mean(Standard))) text(c(0,1),c(35.2,35.2),c("Color","Standard"))

22. Residual for the 20th observation from the standard group: Standard[20]-mean(Standard)

23. Obtain the results (F-statistic and p-value) of an analysis of variance: summary(aov(Time~Type,data=games1))

25. sqrt(5.2268)

Copyright 2013 Pearson Education. Inc.

10

Chapter 2: The Two-Sample t-test, Regression, and ANOVA: Making Connections

26. Create a vector of residuals from the analysis of variance model: residuals <- aov(Time~Type,data=games1)$res Construct a histogram of the residuals, a plot of the residuals versus type of game, and a plot of the residuals versus order (all in one graphics window): par(mfrow=c(1,3)) hist(residuals,xlab="Residuals") plot(games1$Type,residuals,ylab="Residuals") plot(games1$studentID,residuals,xlab="Order",ylab="Residuals") lines(games1$studentID,residuals) abline(1,0)

Extended Activities
28. c) First, create a vector of the observations given in the question: obs <- c(14,11,17,15,13) Construct a normal probability plot of obs: par(mfrow=c(1,1)) qqnorm(obs) Note that in R, the normal probability plot is constructed so that the percentiles of the standard normal distribution are on the horizontal axis, and the values of the sample are on the vertical axis.

29. Read data set Normal into your session and rename it normal: normal <- read.table("FILE PATH/Normal.txt",header=TRUE) a) Create a histogram and normal probability plot for the first column of the normal: par(mfrow=c(1,2)) hist(normal[,1]) qqnorm(normal[,1]) qqline(normal[,1]) b) Column two of normal contains the same values in column one, except the five largest values have been doubled. Create a histogram and normal probability plot for the second column of normal: par(mfrow=c(1,2)) hist(normal[,2]) qqnorm(normal[,2]) qqline(normal[,2]) c) Column three of normal contains the same values in column one, except the five smallest values have been doubled. Create a histogram and normal probability plot for the third column of normal: par(mfrow=c(1,2)) hist(normal[,3]) qqnorm(normal[,3]) qqline(normal[,3])

Copyright 2013 Pearson Education. Inc.

Chapter 2: The Two-Sample t-test, Regression, and ANOVA: Making Connections

11

30. a) Create a normal probability plot of the residuals from the games1 data in Question 26. residuals <- aov(Time~Type,data=games1)$res qqnorm(residuals) qqline(residuals) b-c) The following code can be run to construct normal probability plots for 9 random samples of size 40 observations taken from a standard normal distribution: par(mfrow=c(3,3)) for (i in 1:9) { samp <- rnorm(40) qqnorm(samp) }

31. Read in data set Emission: emission <- read.table("FILE PATH/Emission.txt",header=TRUE) a) Side-by-side boxplots of emission level by year: par(mfrow=c(1,1)) boxplot(Emission~Year,data=emission) Mean and standard deviation of hydrocarbon emission levels by year: Pre63: mean(emission[emission$Year=="Pre63",]$Emission) sd(emission[emission$Year=="Pre63",]$Emission) Yr63to67: mean(emission[emission$Year=="Yr63to67",]$Emission) sd(emission[emission$Year=="Yr63to67",]$Emission) Yr68to69: mean(emission[emission$Year=="Yr68to69",]$Emission) sd(emission[emission$Year=="Yr68to69",]$Emission) Yr70to71: mean(emission[emission$Year=="Yr70to71",]$Emission) sd(emission[emission$Year=="Yr70to71",]$Emission) Yr72to74: mean(emission[emission$Year=="Yr72to74",]$Emission) sd(emission[emission$Year=="Yr72to74",]$Emission) b) Take a log transformation of the Emission variable, and append it to the original emission data set: log.em <- log(emission$Emission) emission <- cbind(emission,log.em) Side-by-side boxplots of log-emission level by year: par(mfrow=c(1,1)) boxplot(log.em~Year,data=emission) Mean and standard deviation of log-emission levels by year: Pre63: mean(emission[emission$Year=="Pre63",]$log.em)

Copyright 2013 Pearson Education. Inc.

12

Chapter 2: The Two-Sample t-test, Regression, and ANOVA: Making Connections

sd(emission[emission$Year=="Pre63",]$log.em) Yr63to67: mean(emission[emission$Year=="Yr63to67",]$log.em) sd(emission[emission$Year=="Yr63to67",]$log.em) Yr68to69: mean(emission[emission$Year=="Yr68to69",]$log.em) sd(emission[emission$Year=="Yr68to69",]$log.em) Yr70to71: mean(emission[emission$Year=="Yr70to71",]$log.em) sd(emission[emission$Year=="Yr70to71",]$log.em) Yr72to74: mean(emission[emission$Year=="Yr72to74",]$log.em) sd(emission[emission$Year=="Yr72to74",]$log.em) c) ANOVA summary table with F-test statistic and p-value: summary(aov(log.em~Year,data=emission))

32. Read in the Weight data: weight <- read.table("FILE PATH/Weight.txt",header=TRUE,sep="\t") Follow the instructions for Question 33 to complete Parts a & b for Question 32.

33. Read in the RegTrans data and rename it regtrans: regtrans <- read.table("FILE PATH/RegTrans.txt",header=TRUE) Parts a and b are discussed for the first data set in the regtrans file. These parts can easily be replicated for the remaining data sets in the regtrans file. Fit the regression model of Y1 on X1: reg1 <- lm(regtrans[,2]~regtrans[,1]) a) Scatterplot of X1 versus Y1 with a superimposed regression line: plot(regtrans[,1],regtrans[,2]) abline(reg1$coef) Create vector of residuals from fitted regression model: residuals <- reg1$res Plot of residuals versus explanatory variable: plot(regtrans[,1],residuals,xlab="X") Plot of the residuals versus fitted values: plot(reg1$fit,residuals,xlab="Fitted Values") Construct a normal probability plot of the residuals: residuals <- lm(regtrans[,2]~regtrans[,1])$res qqnorm(residuals)

Copyright 2013 Pearson Education. Inc.

Chapter 2: The Two-Sample t-test, Regression, and ANOVA: Making Connections

13

c)

Based on the plot of the residuals versus the explanatory variable, we'll try a quadratic transformation of the explanatory variable: x.sq <- regtrans[,1]^2 Fit a quadratic regression model that includes the lower order and quadratic terms: quad1 <- lm(regtrans[,2]~regtrans[,1]+x.sq) Now let's examine residual plots of the quadratic model. Obtain vector of residuals from the quadratic model: residuals <- quad1$res Plot of the residuals versus fitted values: plot(quad1$fit,residuals,xlab="Fitted Values") Construct a normal probability plot of the residuals: qqnorm(residuals) qqline(residuals)

34. Compute the p-value using the t-distribution in R, assuming a test statistic equal to 2.2862 with 38 degrees of freedom: 2*(1-pt(2.2862,38))

35. Follow the instructions provided for Question 34 to find the p-value.

37. To obtain the ANOVA summary table and extract the various sums of squares and mean squares: summary(aov(Time~Type,data=games1))

38. (Sample) variance of the completion times: var(games1$Time) Then the total sum of squares is: (40-1)*var(games1$Time)

39. Use R to find the p-value of the ANOVA F-test, assuming a test statistic of 5.2268,

df Type = 1, and df Error = 38:


1-pf(5.2268,1,38)

40. Based on the t-test output in R, find the standard error of the regression coefficient. First, create an R object containing the t-test output: t.test.games <- t.test(Time~Type,data=games1,var.equal=TRUE) Then the standard error of the regression coefficient can be found by: (t.test.games$estimate[1]-t.test.games$estimate[2])/t.test.games$statistic

Copyright 2013 Pearson Education. Inc.

14

Chapter 2: The Two-Sample t-test, Regression, and ANOVA: Making Connections

Exercises
E.12. Read in data set Hodgkins: hodgkins <- read.table("FILE PATH/Hodgkins.txt",header=TRUE) a) Side-by-side boxplots of bradykininogen levels by subject type: par(mfrow=c(1,1)) boxplot(Blevel~Type,data=hodgkins) Mean and standard deviation of bradykininogen levels by subject type: Normal Subject: mean(hodgkins[hodgkins$Type=="Normal",]$Blevel) sd(hodgkins[hodgkins$Type=="Normal",]$Blevel) Active Subject: mean(hodgkins[hodgkins$Type=="Active",]$Blevel) sd(hodgkins[hodgkins$Type=="Active",]$Blevel) Inactive Subject: mean(hodgkins[hodgkins$Type=="Inactive",]$Blevel) sd(hodgkins[hodgkins$Type=="Inactive",]$Blevel) b) Normal probability plot of the residuals: residuals <- aov(Blevel~Type,data=hodgkins)$res qqnorm(residuals) qqline(residuals) c) Take a log transformation of the Blevel variable, and append it to the original emission data set: log.blevel <- log(hodgkins$Blevel) hodgkins <- cbind(hodgkins,log.blevel) Side-by-side boxplots of log-Blevel by subject type: par(mfrow=c(1,1)) boxplot(log.blevel~Type,data=hodgkins) Mean and standard deviations of the log-transformed levels can be created following the code of previous questions. d) Normal probability plot of the residuals from the ANOVA with transformed response: residuals <- aov(log.blevel~Type,data=hodgkins)$res qqnorm(residuals) e) ANOVA summary results of model using log-transformed response variable: summary(aov(log.blevel~Type,data=hodgkins)

Copyright 2013 Pearson Education. Inc.

Chapter 3 Multiple Regression: How Much is Your Car Worth?


Activities
1. Read in the cars data set, and attach the data to your R session: cars <- read.table("FILE PATH/cars.txt",header=TRUE) attach(cars) Create a scatterplot between Mileage and Price: plot(Mileage,Price)

2.

Create a "regression" object, named cars.lm, that will contain summary information about the fitted regression model of Price on Mileage: cars.lm <- lm(Mileage~Price) The estimated coefficients, R^2, t-statistics, and other results about the model can be extracted from cars.lm using the summary()command: summary(cars.lm)

3.

Create a vector of residuals from the cars.lm model: residuals <- cars.lm$res Residual for the first car (Buick Century): residuals[1]

4.

a-b) Follow the instructions provided for Question 2. c) Note that the stepwise regression method implemented in R does not use the improvement in R^2 as the basis for adding a new variable to the model. It uses a criterion called AIC. Models with smaller AIC values are preferred. To perform stepwise regression for the cars data in R, first create a regression object that contains all relevant explanatory variables: full.model <- lm(Price~Cyl+Liter+Doors+Cruise+Sound+Leather+Mileage) Now use the step function, and store the results in the object step.model: step.model <- step(full.model) To examine the final selected model, apply the summary command to step.model: summary(step.model) Note that the final model selected contains all explanatory variables, except Liter.

5.

Best subsets regression requires loading the R package leaps: library(leaps)

Copyright 2013 Pearson Education. Inc.

16

Chapter 3: Multiple Regression: How Much is Your Car Worth?

The function that will be used to perform best subsets regression is also called leaps. The following code can be used to select the "best" models based on the lowest value of Cp. You can find additional information about various options for the leaps function by using the help function: help(leaps). First, create a "design matrix" whose columns contain the values of the explanatory variables of interest: X <- cbind(Cyl,Liter,Doors,Cruise,Sound,Leather,Mileage) Now use the leaps function, and store the results in the object best.lm: best.lm <- leaps(X,Price,method="Cp") To identify the best models, examine the output of the matrix best.lm$which and the vector best.lm$Cp. The left-most column of best.lm$which displays the number of explanatory variables in the model (note that several models with the same number of explanatory variables are presented). The top row of the matrix (values 1 through 7) is a list of the explanatory variables of interest whose order corresponds to Cyl, Liter, Doors, Cruise, Sound, Leather, and Mileage. Each subsequent row of the matrix provides a value of TRUE if the explanatory variable 1 through 7 was selected, and a value of FALSE if it was not. The output in best.lm$Cp are the Cp values that correspond to the selected models presented in the best.lm$which matrix. For example, the best model with one explanatory variable contains the predictor Cyl, and has a Cp value of 171.958749. The best model with 6 explanatory variables does not contain Liter, and has a Cp value of 6.824315. Note that it can be difficult to match the Cp value to the actual model selected when you examine the output of best.lm$which and best.lm$Cp separately. An easier way is to examine the output of both s imultaneously: cbind(best.lm$Cp,best.lm$which) The Cp values reside next to the selected models.

7.

We will use the regression model that contains all explanatory variables except Liter: cars.lm2 <- lm(Price~Cyl+Doors+Cruise+Sound+Leather+Mileage) Create a vector of predicted (fitted) values of Price: pred.price <- cars.lm2$fitted Create a vector of residuals: residuals <- cars.lm2$res Construct plots of residuals versus fitted values, and plots of residuals versus each of the explanatory variables in the model (each plot with a horizontal line through Y=0): par(mfrow=c(2,4)) plot(pred.price,residuals) abline(1,0) plot(Cyl,residuals) abline(1,0) plot(Doors,residuals) abline(1,0) plot(Cruise,residuals) abline(1,0) plot(Sound,residuals) abline(1,0)

Copyright 2013 Pearson Education. Inc.

Chapter 3: Multiple Regression: How Much is Your Car Worth?

17

plot(Leather,residuals) abline(1,0) plot(Mileage,residuals) abline(1,0)

8.

Create regression models with the log-transformed (log base 10) and square root-transformed Price: cars.log <- lm(log10(Price)~Cyl+Doors+Cruise+Sound+Leather+Mileage) cars.sqrt <- lm(sqrt(Price)~Cyl+Doors+Cruise+Sound+Leather+Mileage) Results of the new models can be obtained using the summary() command, and residual plots can be constructed following code similar to that provided in Question 7. Remember to name the vectors of residuals from the transformed-response models something different from "residuals" that was used in Question 7 (otherwise you will over-write the original vector of residuals).

9.

Construct a simple linear regression model of log Price (henceforth called TPrice) on Mileage: cars.log2 <- lm(log10(Price)~Mileage) Create a vector of residuals for the transformed response model: res.log2 <- cars.log2$res Construct a plot of the residuals versus time/order (with a line connecting the residuals): par(mfrow=c(1,1)) plot(seq(1:length(res.log2)),res.log2,xlab="Order") lines(seq(1:length(res.log2)),res.log2)

10. Create a vector of residuals from the log-transformed response model that included the 6 explanatory variables selected from Question 5 (henceforth called the TPrice model): res.log <- cars.log$res Construct a plot of the residuals versus time/order: plot(seq(1:length(res.log)),res.log,xlab="Order") lines(seq(1:length(res.log)),res.log)

11. Create vector of fitted values from the TPrice model: cars.log.fit <- cars.log$fit Plots of residuals versus fitted values and Mileage: par(mfrow=c(1,2)) plot(cars.log.fit,res.log,xlab="Fitted Values") plot(Mileage,res.log,xlab="Mileage") a) To identify the row numbers that correspond to the points on the plot of the residuals versus fitted values: par(mfrow=c(1,1)) plot(cars.log.fit,res.log,xlab="Fitted Values") text(cars.log.fit,res.log,labels=seq(1:length(Mileage)))

Copyright 2013 Pearson Education. Inc.

18

Chapter 3: Multiple Regression: How Much is Your Car Worth?

Find the row numbers of the observations that correspond to the largest cluster of outliers (without the convertible Cadillacs): subset(cars,c(Make=="Cadillac" & Type=="Convertible")) Note the row numbers corresponding to the outlying observations are in the left-most column of the output.

12. Fit the TPrice model without the observations corresponding to the largest cluster of outliers, i.e. without the convertible Cadillacs (observations 151-160): cars.log3 <- lm(log10(Price)~Cyl+Doors+Cruise+Sound+Leather+Mileage,data = cars[-c(151:160),]) Use the summary()command to examine the estimated coefficients. Compare these estimated coefficients to those from the model with all the observations.

13. Construct a simple linear regression model of log Price on Mileage: cars.log2 <- lm(log10(Price)~Mileage) Create a vector of residuals for the transformed response model: res.log2 <- cars.log2$res Construct a histogram and normal probability plot of the residuals from the simple model of TPrice regressed on Mileage model: par(mfrow=c(1,2)) hist(res.log2) qqnorm(res.log2)

14. Results for the following models can be obtained using the summary()command. a) Fit the regression model of Price on Mileage and Liter: model1 <- lm(Price~Mileage+Liter) b) Fit the regression model of Price on Cyl: model2 <- lm(Price~Cyl) c) Fit the regression model of Price on Mileage, Liter, and Cyl: model3 <- lm(Price~Mileage+Liter+Cyl)

16. Plot Cyl versus Liter: par(mfrow=c(1,1)) plot(Liter,Cyl) Correlation between Cyl and Liter: cor(Liter,Cyl)

17. Define the log-transformed response variable, TPrice: TPrice <- log10(Price) Construct boxplots of TPrice versus the categorical explanatory variables Make, Model, Trim, and Type: par(mfrow=c(2,2)) boxplot(TPrice~Make)

Copyright 2013 Pearson Education. Inc.

Chapter 3: Multiple Regression: How Much is Your Car Worth?

19

boxplot(TPrice~Model) boxplot(TPrice~Trim) boxplot(TPrice~Type)

18. Create indicator variables for the categorical variable Make: DM1 <- (Make=="Buick")*1 DM2 <- (Make=="Cadillac")*1 DM3 DM4 DM5 DM6 <<<<(Make=="Chevrolet")*1 (Make=="Pontiac")*1 (Make=="SAAB")*1 (Make=="Saturn")*1

19. Construct a linear regression model with response variable TPrice and explanatory variables Mileage, Liter, Cadillac, Chevrolet, Pontiac, SAAB, and Saturn: cars.log4 <- lm(TPrice~Mileage+Liter+DM2+DM3+DM4+DM5+DM6) Examine the results of the model using the summary() command.

20. Create indicator variables for the categorical variable Type: DT1 DT2 DT3 DT4 DT5 <<<<<(Type=="Convertible")*1 (Type=="Coupe")*1 (Type=="Hatchback")*1 (Type=="Sedan")*1 (Type=="Wagon")*1

Construct a linear regression model with response variable TPrice and explanatory variables Make (leave out Buick), Type (leave out Hatchback), Liter, Doors, Cruise, Sound, Leather, and Mileage: cars.log5 <- lm(TPrice~DM2+DM3+DM4+DM5+DM6+DT1+DT2+DT4+DT5+ Liter+Doors+ Cruise+Sound+Leather+Mileage) Examine the results of the model using the summary() command. Create a vector of residuals from this model: res.log5 <- cars.log5$res Residual plots can be constructed using instructions in previous problems.

21. Follow the instructions provided for previous activities.

22. Create a subset of the cars data set corresponding to the Chevrolet Cavalier Sedan models: cav <- cars[Model=="Cavalier",] cav <- cav[cav$Type=="Sedan",] Create a regression model to predict to Price from Mileage using the Cavalier data: cav.lm <- lm(Price~Mileage,data=cav) Obtain the sums of squares: anova(cav.lm)

Copyright 2013 Pearson Education. Inc.

20

Chapter 3: Multiple Regression: How Much is Your Car Worth?

Extended Activities
23. Create a vector of values of y-y_hat: res <- cav.lm$res Create a vector of values of y_hat-y_bar: reg <- cav.lm$fit-mean(cav$Price) Examine the result of: sum(res*reg)

30. For the Cavalier data, fit the regression model to predict Price from Mileage and Cruise: cav.lm2 <- lm(Price~Mileage+Cruise,data=cav) Obtain the F-statistic and p-value using the summary()command: summary(cav.lm2)

31. Create indicator variables for the categorical variable Trim for the cav data set (leave out LS Sedan 4D): DTr1 <- (cav$Trim=="LS Sport Sedan 4D")*1 DTr2 <- (cav$Trim=="Sedan 4D")*1 Construct the full regression model that includes Mileage, Cruise, and Trim: cav.lm3 <- lm(Price~Mileage+Cruise+DTr1+DTr2,data=cav) To test whether Trim is useful, use the anova() command to compare the full model cav.lm3 to the reduced model cav.lm2: anova(cav.lm3,cav.lm2)

35. Read the data set 4-8Cyl into your R session: cyl48 <- read.table("FILE PATH/4-8Cyl.txt",header=TRUE,sep="\t") Construct regression model of Price on Mileage and Cyl: cyl.lm1 <- lm(Price~Mileage+Cyl,data=cyl48) Construct regression model of Price on Mileage, Cyl, and the interaction between Mileage and Cyl: cyl.lm2 <- lm(Price~Mileage*Cyl,data=cyl48) a) Use the summary() command for each model to determine whether the R^2_adj increased. d) Use the anova() command and follow the code in Question 31 to determine whether the interaction term should be included in the model.

36. Create indicator variables for the categorical variable Make for the cyl48 data set (leave out Pontiac): Cadillac <- (cyl48$Make=="Cadillac")*1 SAAB <- (cyl48$Make=="SAAB")*1

Copyright 2013 Pearson Education. Inc.

Chapter 3: Multiple Regression: How Much is Your Car Worth?

21

Construct regression model of Price on Mileage, Cadillac, and SAAB: cyl.lm3 <- lm(Price~Mileage+Cadillac+SAAB,data=cyl48) a) Scatterplot of Price versus Mileage by Make: plot(cyl48$Mileage,cyl48$Price,xlab="Mileage",ylab="Price") b) Construct regression model of Price on Mileage, Cadillac, SAAB, and the interaction terms between Mileage and Cadillac, and Mileage and SAAB: cyl.lm4 <- lm(Price~Mileage*Cadillac+Mileage*SAAB,data=cyl48) Use the anova() command to compare models cyl.lm3 and cyl.lm4 to determine if the interaction terms are significant. Scatterplot of Price versus Mileage: plot(cyl48$Mileage,cyl48$Price,xlab="Mileage",ylab="Price") Superimpose regression lines for the Cadillacs and SAAB's. First obtain the estimated coefficients from the interaction model: coefs <- cyl.lm4$coef Superimpose the regression line (solid line) corresponding to Cadillacs: abline(coefs[1]+coefs[3],coefs[2]+coefs[5]) Superimpose the regression line (dashed line) corresponding to SAAB's: abline(coefs[1]+coefs[4],coefs[2]+coefs[6],lty=2)

37. Read the MPG data into your R session: mpg <- read.table("FILE PATH/MPG.txt",header=TRUE) Create the regression model to predict mpg from speed and displacement: mpg.lm <- lm(mpg~speed+displacement,data=mpg) a) Use the summary() command to examine results of the fitted model. b) Create a vector of residuals from the model: res.mpg <- mpg.lm$res Construct the plots of residuals versus speed and residuals versus displacement: par(mfrow=c(1,2)) plot(mpg$speed,res.mpg) plot(mpg$displacement,res.mpg) c) Normal probability plot of the residuals: par(mfrow=c(1,1)) qqnorm(res.mpg)

38. Create the regression model to predict mpg from speed: mpg.lm2 <- lm(mpg~speed,data=mpg) a) Use the summary() command to get the coefficients and R^2 values.

Copyright 2013 Pearson Education. Inc.

22

Chapter 3: Multiple Regression: How Much is Your Car Worth?

b) Create a vector of residuals from the model mpg.lm2: res.mpg <- mpg.lm2$res Construct the plots of residuals versus speed and residuals versus displacement: par(mfrow=c(1,2)) plot(mpg$speed,res.mpg) plot(mpg$displacement,res.mpg) c) Normal probability plot of the residuals: par(mfrow=c(1,1)) qqnorm(res.mpg)

39. Create the regression model to predict mpg from displacement: mpg.lm3 <- lm(mpg~displacement,data=mpg) a) Use the summary() command to get the coefficients and R^2 values. b) Create a vector of residuals from the model mpg.lm3: res.mpg <- mpg.lm3$res Construct the plots of residuals versus speed and residuals versus displacement: par(mfrow=c(1,2)) plot(mpg$speed,res.mpg) plot(mpg$displacement,res.mpg)

40. Create a quadratic term (speedsq) for speed: speedsq <- mpg$speed^2 Create the quadratic regression model to predict mpg from speed, displacement, and speedsq: mpg.lm4 <- lm(mpg~speed+displacement+speedsq,data=mpg) a) Use the summary() command to get the coefficients and R^2 values. b) Create a vector of residuals from the model mpg.lm4: res.mpg <- mpg.lm4$res Construct the plots of residuals versus speed and residuals versus displacement: par(mfrow=c(1,2)) plot(mpg$speed,res.mpg) plot(mpg$displacement,res.mpg) c) Normal probability plot of the residuals: par(mfrow=c(1,1)) qqnorm(res.mpg)

41. If you are working in a new R session, you will need to read in the cars data again; be sure to attach it to your session with the attach() command, so that the following code will work. You will also have to create the TPrice (log-transformed Price) variable again. Create quadratic term, Mileagesq, for Mileage: Mileagesq <- Mileage^2

Copyright 2013 Pearson Education. Inc.

Chapter 3: Multiple Regression: How Much is Your Car Worth?

23

Regression model to predict TPrice from Mileage: cars.lm1 <- lm(TPrice~Mileage) Regression model to predict TPrice from Mileage and Mileagesq: cars.lm2 <- lm(TPrice~Mileage+Mileagesq) a) Use the summary() command to determine the increase in R^2 between the two models. b) Follow the instructions provided for previous activities to construct residual plots.

42. Create the interaction between Mileage and Cyl: MileCyl <- Mileage*Cyl Regression model to predict TPrice from Mileage, Cyl, and MileCyl: cars.lm3 <- lm(TPrice~Mileage+Cyl+MileCyl) Follow the instructions provided for previous activities to determine if the interaction term improves the model, and to create residual plots.

43. Follow the instructions provided for previous activities.

Copyright 2013 Pearson Education. Inc.

Chapter 4 Designing Factorial Experiments: Microwave Popcorn


Activities
4. Read the Popcorn data set into your R session and attach it: popcorn <- read.table("FILE PATH/Popcorn.txt",header=TRUE,sep="\t") attach(popcorn) Compute descriptive statistics by group: tapply(PopRate, Brand:Time, tapply(PopRate, Brand:Time, tapply(PopRate, Brand:Time, tapply(PopRate, Brand:Time, tapply(PopRate, Brand:Time,

mean) median) sd) IQR) range)

Note: When working with the tapply() function, the syntax Brand:Time should be used when both Brand and Time are categorical variables. If the values of Time are input as 105 and 135, then the variable will be considered quantitative. Thus, the levels of Time should be listed as 105 sec and 135 sec. Find the difference between average PopRate for the cooking times for each brand: mean(PopRate[(Brand=="Fastco") & (Time=="135 Sec")])mean(PopRate[(Brand=="Fastco") & (Time=="105 Sec")]) mean(PopRate[(Brand=="Pop Secret") & (Time=="135 Sec")])mean(PopRate[(Brand=="Pop Secret") & (Time=="105 Sec")])

5.

The average popping rates for each level of Brand and Time, as well as overall average PopRate: mean(PopRate[Brand=="Fastco"]) mean(PopRate[Brand=="Pop Secret"]) mean(PopRate[Time=="105 Sec"]) mean(PopRate[Time=="135 Sec"]) mean(PopRate)

13. The two-way ANOVA is conducted using the aov() command. We will store the ANOVA results in the object pop.aov1: pop.aov1 <- aov(PopRate~Brand+Time+Brand:Time) Use the summary() command to obtain the ANOVA table: summary(pop.aov1)

16. Create individual value plots of the eight factor-level groups: trt3way <- as.integer(Brand:Time:Microwave) plot(trt3way, PopRate, pch=20, xaxt="n",ylab="PopRate = Kernel Popping Rate (per 100)", xlab="Factor-Level Combination (Group)") axis(1,at=c(1:8), labels=c("F105L","F105R","F135L","F135R","PS105L","PS105R","PS135L", "PS13R"))

Copyright 2013 Pearson Education. Inc.

26

Chapter 4: Designing Factorial Experiments: Microwave Popcorn

Note: The labels on your plot might not all appear when you first create the graph (R will hide them to prevent clutter of the label names). You can expand the graph using the mouse to view all labels. d) Create a vector of residuals from the 3-Way ANOVA model: res <- aov(PopRate~Brand+ Time+Microwave+Brand:Microwave+Time:Microwave)$res Now construct a normal probability plot of the residuals: qqnorm(res)

17. Calculate the eight group standard deviations: tapply(PopRate, Brand:Time:Microwave, sd) To find the maximum and minimum standard deviations: max(tapply(PopRate, Brand:Time:Microwave, sd)) min(tapply(PopRate, Brand:Time:Microwave, sd

18. a) The following ANOVA object will contain the results of all six hypotheses: pop.3way.aov <- aov(PopRate ~ Brand + Time + Microwave + Brand:Time + Brand:Microwave + Time:Microwave) Display the results of the tests using the summary() command: summary(pop.3way.aov)

Extended Activities
25. Read in the PaperTowels data and attach it to your R session: paper <- read.table("FILE PATH/PaperTowels.txt",header=TRUE) attach(paper) Calculate means by factor level: meanD <- mean(paper[Brand=="Decorator",]$Strength) meanC <- mean(paper[Brand=="Comfort",]$Strength) mean0 <- mean(paper[as.factor(Water)=="0",]$Strength) mean5 <- mean(paper[as.factor(Water)=="5",]$Strength) mean15 <- mean(paper[as.factor(Water)=="15",]$Strength) Construct the main effects plot: par(mfrow=c(1,2)) plot(c(1,2),c(0,3500),xaxt="n",type="n",xlab="Brand", ylab="Paper Towel Breaking Strength (grams)", main="Main Effects Plot: Brand") axis(1,at=c(1:2),labels=c("Comfort","Decorator")) points(c(1,2),c(meanC,meanD),pch=".",cex=3) lines(c(1,2),c(meanC,meanD),lty=1) plot(c(1,2,3),c(0,1500,3500),xaxt="n",type="n",xlab="Water", ylab="Paper Towel Breaking Strength (grams)", main="Main Effects Plot: Water") axis(1,at=c(1:3),labels=c("0 Drops","5 Drops","15 Drops")) points(c(1,2,3),c(mean0,mean5,mean15),pch=".",cex=3)

Copyright 2013 Pearson Education. Inc.

Chapter 4: Designing Factorial Experiments: Microwave Popcorn

27

lines(c(1,2,3),c(mean0,mean5,mean15),lty=1) 28. Construct the interaction plot: par(mfrow=c(1,1)) plot(c(1,2),c(0,3500),xaxt="n",type="n",xlab="Brand", ylab="Paper Towel Breaking Strength (grams)",main="Interaction Plot") axis(1,at=c(1:2),labels=c("Comfort","Decorator")) points(c(1,2),c(3205.7692,2219.2308),pch=".",cex=3) lines(c(1,2),c(3205.7692,2219.2308),lty=1) points(c(1,2),c(1711.5385,704.8077),pch=".",cex=3) lines(c(1,2),c(1711.5385,704.8077),lty=2) points(c(1,2),c(400.9615,446.1538),pch=".",cex=3) lines(c(1,2),c(400.9615,446.1538),lty=3) legend(1.6,3500,c("0 Drops","5 Drops","15 Drops"),lty=1:3)

35. b) Calculate the six group standard deviations: tapply(Strength, Brand:as.factor(Water), sd) c) Calculate the factor-level group means: tapply(Strength, Brand:as.factor(Water), mean)

d) Transform the response variable using the natural log and square root transformations: ln.strength <- log(Strength) sqrt.strength <- sqrt(Strength) Now check the equal variances assumption by first finding the standard deviations of the transformed response for the 6 factor-level groups: tapply(ln.strength, Brand:as.factor(Water), sd) tapply(sqrt.strength, Brand:as.factor(Water), sd)

36. Construct boxplots for the log and square root transformed responses for the 6 factor-level groups: plot(Brand:as.factor(Water),ln.strength) plot(Brand:as.factor(Water),sqrt.strength)

37. Perform the two-way ANOVA on the square root transformed paper towel data, and store results in the object, aov.paper: aov.paper <- aov(sqrt.strength~Brand+as.factor(Water)+Brand:as.factor (Water)) Examine the ANOVA summary table to obtain the F-statistics and p-values for the appropriate hypothesis tests: summary(aov.paper) Create a vector of residuals from the model: res <- aov.paper$res Construct a normal probability plot of the residuals to investigate the normality assumption. qqnorm(res)

Copyright 2013 Pearson Education. Inc.

Chapter 5 Block, Split-Plot and Repeated Measure Designs: What Influences Memory?
Activities
1. Read in and attach the Memory data into your R session: memory <- read.table("FILE PATH/Memory.txt",header=TRUE,sep="\t") attach(memory) b) Analyze as if data were from a randomized factorial design: wrongCRD <- aov(Score ~ Word.List*Distracter) Examine the results using the summary() command: summary(wrongCRD) c) Create side-by-side boxplots: boxplot(Score ~ reorder(Word.List:Distracter, Score, median), ylab="Avg Score (Recalled Words out of 20)") Examine the standard deviations of the groups to investigate the equal variances assumption: tapply(Score, Word.List:Distracter, sd) Note: When working with the tapply() function, the syntax Word.List:Distracter should be used when both Word.List and Distracter are categorical variables. d) Create vector if residuals from the model: res <- wrongCRD$res Create normal probability plot of residuals: qqnorm(res) Create main effects plots: par(mfrow=c(1,2)) plot(c(1:2),seq(7,9,length=2),xaxt="n",type="n",xlab="Word List", ylab="Mean Number of Words Recalled") axis(1,at=c(1:2),labels=c("Abstract","Concrete")) points(c(1:2),tapply(Score, Word.List, mean)) lines(c(1:2),tapply(Score, Word.List, mean)) plot(c(1:2),seq(7,9,length=2),xaxt="n",type="n",xlab="Distracter", ylab="Mean Number of Words Recalled") axis(1,at=c(1:2),labels=c("Math","Poetry")) points(c(1:2),tapply(Score, Distracter, mean)) lines(c(1:2),tapply(Score, Distracter, mean)) Create interaction plot: par(mfrow=c(1,1)) interaction.plot(Word.List,Distracter,Score,xlab="Word List", ylab="Mean Number of Words Recalled")

2.

Conduct an ANOVA assuming the data are from a block design: block <- aov(Score ~ as.factor(Student) + Word.List*Distracter)

Copyright 2013 Pearson Education. Inc.

30

Chapter 5: Block, Split-Plot and Repeated Measure Designs: What Influences Memory?

Examine the results using the summary() command: summary(block) c) Construct main effects plots: par(mfrow=c(1,3)) plot(c(1:12),seq(5,11,length=12),xaxt="n",type="n",xlab="Student", ylab="Mean Number of Words Recalled") axis(1,at=c(1:12),labels=c(1:12)) points(c(1:12),tapply(Score, as.factor(Student), mean)) lines(c(1:12),tapply(Score, as.factor(Student), mean)) plot(c(1:2),seq(5,11,length=2),xaxt="n",type="n",xlab="Word List", ylab="Mean Number of Words Recalled") axis(1,at=c(1:2),labels=c("Abstract","Concrete")) points(c(1:2),tapply(Score, Word.List, mean)) lines(c(1:2),tapply(Score, Word.List, mean)) plot(c(1:2),seq(5,11,length=2),xaxt="n",type="n",xlab="Distracter", ylab="Mean Number of Words Recalled") axis(1,at=c(1:2),labels=c("Math","Poetry")) points(c(1:2),tapply(Score, Distracter, mean)) lines(c(1:2),tapply(Score, Distracter, mean)) d) To create a normal probability plot of residuals, see Question 1d.

8.

Conduct an ANOVA, treating the memory data as a split-plot design: splitplot <- aov(Score ~ Major + Word.List*Distracter + Error(Major/as.factor(Student2))) Examine the results of the analysis: summary(splitplot) To test the significance of the whole-plot factor (Major), compute the F-statistic which is the ratio of the MS whole-plot to MS split-plot, and find the corresponding p-value: F <- 3.1875/8.208 1-pf(F,3,8) To construct a normal probability plot of the residuals, see Question 1d. Construct main effects plots: par(mfrow=c(1,3)) plot(c(1:4),seq(7,9,length=4),xaxt="n",type="n",xlab="Major",ylab="Mean Number of Words Recalled") axis(1,at=c(1:4),labels=levels(memory$Major)) points(c(1:4),tapply(Score, Major, mean)) lines(c(1:4),tapply(Score, Major, mean)) plot(c(1:2),seq(7,9,length=2),xaxt="n",type="n",xlab="Word List", ylab="Mean Number of Words Recalled") axis(1,at=c(1:2),labels=c("Abstract","Concrete")) points(c(1:2),tapply(Score, Word.List, mean)) lines(c(1:2),tapply(Score, Word.List, mean))

Copyright 2013 Pearson Education. Inc.

Chapter 5: Block, Split-Plot and Repeated Measure Designs: What Influences Memory?

31

plot(c(1:2),seq(7,9,length=2),xaxt="n",type="n",xlab="Distracter", ylab="Mean Number of Words Recalled") axis(1,at=c(1:2),labels=c("Math","Poetry")) points(c(1:2),tapply(Score, Distracter, mean)) lines(c(1:2),tapply(Score, Distracter, mean)) To construct an interaction plot, follow the steps provided for Question 1d. To construct a Normal probability plot of the residuals, follow the steps provided for Question 1d.

11. Model that incorrectly assumes Student2 is a fixed factor and Student2 is not nested within Major: fixedStu2 <- aov(Score~Major+as.factor(Student2)+Word.List*Distracter) Examine results of incorrect model: summary(fixedStu2)

12. Model that incorrectly assumes Student2 is a fixed factor, but correctly assumes Student2 is nested within Major: nestedStu2 <- aov(Score~Major+as.factor(Student2)+Word.List*Distracter+ Error(Major/as.factor(Student2))) Examine results of incorrect model: summary(nestedStu2) To test the significance of the Major, compute the F-statistic, and find the corresponding p-value: F <- 3.1875/2.44 1-pf(F,3,33)

15. Analyze memory data using a split-plot design: splitplot2 <- aov(Score ~ Major + Word.List*Distracter Major:Word.List+Major:Distracter+ Error(Major/as.factor(Student2))) summary(splitplot2) To test the significance of the Major, compute the F-statistic, and find the corresponding p-value: F <- 3.1875/8.208 1-pf(F,3,8)

19. To construct a Normal probability plot of the residuals, follow the steps provided for Question 1d.

20. Individual values plot: allcombs <- as.integer(Major:Word.List:Distracter) plot(allcombs,Score, pch=20, xaxt="n",ylab="Score", xlab="Factor/Level Group", main="Individual Value Plot") axis(1,at=c(1:16),labels=c("CS.ab.m","CS.ab.p","CS.c.m","CS.c.p", "Eng.ab.m","Eng.ab.p","Eng.c.m","Eng.c.p","H.ab.m","H.ab.p","H.c.m", "H.c.p","Ma.ab.m","Ma.ab.p","Ma.c.m","Ma.c.p"),cex.axis=.7,las=2)

Copyright 2013 Pearson Education. Inc.

32

Chapter 5: Block, Split-Plot and Repeated Measure Designs: What Influences Memory?

21. Individual value plot of subject average score by major: mean.student <- tapply(Score, as.factor(Student), mean) plot(c(1,1,1,2,2,2,3,3,3,4,4,4),mean.student,pch=20,xaxt="n", ylab="Student Average Score", xlab="Major") axis(1,at=c(1:4), labels=c("Math","CS","Hist","Eng"))

Extended Activities
25. Read in and attach the Flowers data into your R session: flowers <- read.table("FILE PATH/Flowers.txt",header=TRUE) attach(flowers) Analyze the block design: flower.block <- aov(Days~as.factor(Store)+Water) Examine the results of the model: summary(flower.block) To construct a Normal probability plot of the residuals, follow the steps provided for Question 1d. Create an individual value plot: allcombs <- as.integer(as.factor(Store):Water) plot(allcombs,Days, pch=20, xaxt="n",ylab="Days", xlab="Factor/Level Group", main="Individual Value Plot") axis(1,at=c(1:9),labels=c("1.asp","1.floral","1.plain","2.asp","2.floral", "2.plain","3.asp","3.floral","3.plain"),cex.axis=.7,las=2)

26. Grand mean: mean(Days) Average for each Water level: tapply(Days, as.factor(Water), mean) Water effects can be computed using the grand mean and Water level means calculated above.

31. Read in and attach the Popcorn data into your R session: popcorn <- read.table("FILE PATH/Popcorn.txt",header=TRUE) attach(popcorn) Conduct an ANOVA on the popcorn data: pop.split <- aov(PopRate~Brand+Temp+Brand:Temp+Error (Brand/as.factor(Box))) Examine the results: summary(pop.split) To construct a Normal probability plot of the residuals, follow the steps provided for Question 1d.

Copyright 2013 Pearson Education. Inc.

Chapter 5: Block, Split-Plot and Repeated Measure Designs: What Influences Memory?

33

Create an individual value plot: allcombs <- as.integer(Brand:Temp) plot(allcombs,PopRate, pch=20, xaxt="n",ylab="Percent Popped", xlab="Factor/Level Group",main="Individual Value Plot") axis(1,at=c(1:4), labels=c("Exp Frig","Exp Room","Gen Frig","Gen Room")) Create main effects plots: par(mfrow=c(1,2)) plot(c(1:2),seq(83.5,84.5,length=2),xaxt="n",type="n",xlab="Brand", ylab="Mean Percent Popped") axis(1,at=c(1:2),labels=c("Expensive","Generic")) points(c(1:2),tapply(PopRate, Brand, mean)) lines(c(1:2),tapply(PopRate, Brand, mean)) plot(c(1:2),seq(83.5,84.5,length=2),xaxt="n",type="n", xlab="Temp",ylab="Mean Percent Popped") axis(1,at=c(1:2),labels=c("Frig","Room")) points(c(1:2),tapply(PopRate, Temp, mean)) lines(c(1:2),tapply(PopRate, Temp, mean)) Create interaction plot: par(mfrow=c(1,1)) interaction.plot(Temp,Brand,PopRate,xlab="Temp", ylab="Mean Percent Popped")

32. Grand mean: mean(PopRate) Average for each Brand level: tapply(PopRate, Brand, mean) Brand effects can be computed using the grand mean and Brand level means calculated above.

33. Follow the instructions provided for Question 32.

34. Obtain the means for all Brand/Temp conditions: tapply(PopRate,Brand:Temp,mean) Use these means and the those from Questions 32 and 33 to estimate the interaction effects.

41. Read in and attach the Handwash data: handwash <- read.table("FILE PATH/Handwash.txt",header=TRUE,sep="\t") attach(handwash) Scatterplot of before versus after CFU counts: plot(CFU.Before,CFU.After,xlab="Before CFU Counts",ylab="After CFU Counts")

42. Natural log transform of the explanatory and response variables:

Copyright 2013 Pearson Education. Inc.

34

Chapter 5: Block, Split-Plot and Repeated Measure Designs: What Influences Memory?

ln.before <- log(CFU.Before) ln.after <- log(CFU.After)

Perform the ANCOVA: hand.ancova <- lm(ln.after~ln.before+Cleanser) Examine ANCOVA results: anova(hand.ancova)

43. Create a new blocking factor for Before CFU counts: Before.block <- 1:30 for (i in 1:30) { if (CFU.Before[i]<52) (Before.block[i]="Low") if (CFU.Before[i]>51 && CFU.Before[i]<150) (Before.block[i]="Medium") if (CFU.Before[i]>150) (Before.block[i]="High") } Perform ANOVA on the block design: hand.block <- aov(ln.after~as.factor(Before.block)+Cleanser) Examine ANOVA results: anova(hand.block)

Copyright 2013 Pearson Education. Inc.

Chapter 6 Categorical Data Analysis: Is a Tumor Malignant or Benign?


Activities
3. Create a segmented bar graph for the tumor data: tumors <- cbind(c(7/16,9/16),c(17/21,4/21)) barplot(tumors,xlab="Shape",legend=c("Malignant","Benign"), ylab="Proportion Malignant",names.arg=c("Round","Concave"), xlim=c(0,3.8))

6.

Simulation to randomly select 21 cancer cell nuclei and count the number of concave malignant: cancer.cells <- c(rep("M",24),rep("B",13)) sum(sample(cancer.cells,21)=="M") The second line of code provides the number of concave cells that are malignant.

7.

The following code will provide the number of malignant cells in each of the 9 random sample of 21 cells: nsims <- 9 malig <- 1:nsims for (i in 1:nsims) {cancer.cells <- c(rep("M",24),rep("B",13)) malig[i] <- sum(sample(cancer.cells,21)=="M")} malig

8.

In the code provided for Question 7, set nsims = 10000, then construct a histogram of the 10,000 simulated counts: hist(malig) Estimate the p-value: sum(malig>=17)/10000

9.

Find P(X=17): phyper(17,24,13,21)-phyper(16,24,13,21) In the above code, 24 = number of malignant cells, 13 = number of benign cells, and 21 = number of cells selected at random. By itself, phyper(17,24,13,21) gives P(X<=17). Find P(X=18): phyper(18,24,13,21)-phyper(17,24,13,21) P(X=19), P(X=20), and P(X=21) can be found in a similar manner.

10. Construct a histogram for the hypergeometric probabilities: barplot(phyper(seq(1,21),24,13,21)-phyper(seq(0,20),24,13,21),names.arg =seq(1,21),ylab="Probability")

Copyright 2013 Pearson Education. Inc.

36

Chapter 6: Categorical Data Analysis: Is a Tumor Malignant or Benign?

11. Exact p-value, P(X >=17): 1-phyper(16,24,13,21)

12. Assume M = 13 successes and take a sample of size 16. Then P(X>=9) is: 1-phyper(8,13,24,16)

13. Assume M = 13 successes and take a sample of size 21. Then P(X<=4) is: phyper(4,13,24,21)

15. Estimated p-value for the two-sided hypothesis: (sum(malig>=17)+sum(malig<=10))/10000

16. Find the exact p-value for the two-sided hypothesis using the hypergeometric distribution P(X <=10)+P(X >=17): phyper(10,24,13,21)+(1-phyper(16,24,13,21))

19. Conduct the chi-square test for the cancer cell data: cancer <- cbind(c(9,4),c(7,17)) chisq.test(cancer,correct=FALSE) The test statistic and p-value are provided in the output.

Extended Activities
34. Conduct the chi-square goodness-of-fit test on the die experiment: die.rolls <- c(2,5,3,3,8,9) chisq.test(die.rolls,correct=FALSE)

Copyright 2013 Pearson Education. Inc.

Research Project Infant Handling in Female Baboons


4. The file baboons.txt contains the data needed to carry out the analysis. Save this .txt file to your computer. Open R and use the File menu to change your working directory to the same directory where you just saved this data file [In R, select File, then Change dir]. Then, read the data file into R and attach it for use as follows: baboons.df = read.table("baboon.txt", header=T) attach(baboons.df) names(baboons.df) InfantMotherRnkHandlRnkHand.NameHand.ID Pass.Act Re-create Table 1 using R: The following code creates a table that places Infant in rows and Hand.ID in columns and then sums Pass.Act , the total number of interactions for each combination: tapply(Pass.Act, list(Infant, Hand.ID), sum) 11 entries read NA = "missing value" since mothers are not considered a "handler" of their own infant. We make note of which of the handlers are mothers (i.e., which columns of Table 1 correspond to the mothers) and we will replace all NA's with zeros as in Table 1: mothers = c(1, 5, 6, 7, 8, 13, 15, 18, 17, 19, 21) mothers Pass.Act.Table1 = tapply(Pass.Act, list(Infant, Hand.ID), sum) Pass.Act.Table1[is.na(Pass.Act.Table1)] = 0 Finally, Table 1 re-created in R: Pass.Act.Table1 Table 2A is the 3-by-3 table of infant-handler interaction counts by mother/infant rank and handler tapply(Pass.Act, list(MotherRnk, HandlRnk), sum) Later, we will need a slightly more complex function to create a 3-by-3 table from a generalized Table 1, so we develop the tools to do that right now. First, make an ordered list of the handler (column) and infant (row) dominance ratings to coincide with Table 1: col.ranks = c(rep(1,4), rep(2,7), rep(3,12)) col.ranks row.ranks = col.ranks[mothers] row.ranks Next, collapse Table 1 into Table 2A by summing across rows (infants) with the same dominance level and then summing across columns (handlers) with the same dominance level. Here are the sums across rows: rowsum(Pass.Act.Table1, row.ranks) While R has a built in function to sum rows, rowsum , we will need to write a function to sum across columns. Copy the following code in R to create the new function colsum(). colsum = function(x, group) { t(rowsum(t(x), group)) } Although colsum may look complicated, this function really just uses the t() function for matrix transposition (swapping rows and columns) twice so that it can borrow the existing rowsum function to do all of the hard work. The following code will sum across rows and then sum across columns: colsum(rowsum(Pass.Act.Table1, row.ranks), col.ranks)

Copyright 2013 Pearson Education. Inc.

38

Research Project: Infant Handling in Female Baboons

Finally, save this matrix compression operation in an R function by copying and pasting in the following R code: compress = function(Table1, c.ranks, m){ colsum(rowsum(Table1, c.ranks[m]), c.ranks) } Now we can use the compress function to collapse any Table 1 to into a smaller table, (i.e. Table 2). Pass.Act.Table2 = compress(Table1=Pass.Act.Table1, c.ranks=col.ranks, m=mothers) Pass.Act.Table2

5.

chitest = chisq.test(Pass.Act.Table2) names(chitest) chitest$observed # This is Table 2A (observed counts) chitest$expected # This is Table 2C (expected counts) chitest$residuals # This is Table 2D (standardized residuals) chitest$p.value

6.

This code multiplies the observed Table 2 by the appropriate 1s and -1s.. Pass.Act.Table2 as.vector(Pass.Act.Table2) LTEvector = c(1,1,1,-1,1,1,-1,-1,1) LTEvector sum(LTEvector * as.vector(Pass.Act.Table2)) teststat = function(statvector, Table2vector) { sum(statvector * Table2vector) } Note that this function is general enough to work for a variety of test statistics. Here we specify that we are interested in the LTE statistic by using LTEvector. teststat(statvector=LTEvector, Table2vector=as.vector(Pass.Act.Table2))

7.

f)

ncol.ranks = c(1,3,1,3,3,3,2,2,3,1,3,2,2,3,3,1,3,3,3,2,2,3,2) nrow.ranks = col.ranks[mothers] Table2 = compress(Table1=Pass.Act.Table1, c.ranks=ncol.ranks, m=mothers) Table2 teststat(statvector=LTEvector, Table2vector=as.vector(Table2))

8.

pcol.ranks = col.ranks # Set up new variable to hold permuted column ranks pcol.ranks[mothers] = sample(col.ranks[mothers]) # Permute mothers ranks pcol.ranks[-mothers] = sample(col.ranks[-mothers]) # Permute handlers (non-mothers) This separate permutation for mothers and non-mothers is needed to maintain the distribution of mothers: one mother of High rank, four mothers of Mid rank and six mothers of Low rank. Once a mothers new rank is determined, its infant also needs the same new rank. Thus, rows are determined by the permutation of mothers.

Copyright 2013 Pearson Education. Inc.

Research Project: Infant Handling in Female Baboons

39

Prow.ranks = pcol.ranks[mothers] # Copy mothers new ranks to infants Compare the original ranks with the permuted ranks to verify that any handler who is also a mother has the same rank as her infant. (This makes this permutation a valid permutation of ranks as described in the first step of the randomization test above.) col.ranks pcol.ranks row.ranks prow.ranks pcol.ranks[mothers] Again, copy and paste the following code into R to create and save a function that we can use to repeat this permutation process over and over again later: permute = function(c.ranks, m) { pcol.ranks = c.ranks pcol.ranks[m] = sample(c.ranks[m]) pcol.ranks[-m] = sample(c.ranks[-m]) return(pcol.ranks) } permute(c.ranks=col.ranks, m=mothers) We now combine all our previous programming into one function: perm.test = function(Table1, c.ranks, m, statvector, numperm) { # create a matrix with numperm rows, each containing the obsvd col.ranks col.rank.matrix = matrix(c.ranks, ncol=length(c.ranks), nrow=numperm, byrow=T) # permute each row of col.rank.matrix to produce perm.matrix # with each row a valid permutation of the column (handler) ranks perm.matrix = t(apply(col.rank.matrix, 1, permute, m=m)) # calculate a Table 2 for each permutation of ranks within perm.matrix # each Table 2 will be in one column perm.table2 = apply(perm.matrix, 1, compress, Table1=Table1, m=m) # Finish by returning the list of numperm test statistics return(apply(perm.table2, 2, teststat, statvector=statvector)) } The following R code will compute the empirical p-value, based on 1000 iterations of perm.test: Pass.Act.LTE.perm.distn = perm.test(Table1=Pass.Act.Table1, c.ranks=col.ranks, m=mothers, statvector=LTEvector, numperm=1000) Pass.Act.LTE.p.value = sum(Pass.Act.LTE.perm.distn >=teststat(statvector=LTEvector,Table2vector=as.vector(Pass.Act.Table2)))/ 1000 Pass.Act.LTE.p.value You can also view the distribution of the statistic LTE: hist(Pass.Act.LTE.perm.distn) abline(v=472) The precise results will vary with each run, because random samples are being generated by R, but here is an explanation of this code. Pass.Act.LTE.perm.distn contains a vector of length 1000, giving the test statistic for each of 1000 iterations of the permutation test steps outlined. We next asked R for a histogram of this permutation distribution and added a vertical line (with abline(v=472)) at the observed LTE statistic value of 472. We notice that only a few of the values of LTE statistic resulting from the permutations are at or beyond 472, so we know that the empirical p-value will be small. The estimated p-value is

Copyright 2013 Pearson Education. Inc.

40

Research Project: Infant Handling in Female Baboons

Pass.Act.LTE.p.value. In fact, the author observed an empirical p-value of 0.015, meaning that 15 of the 1000 iterations had LTE statistics exceeding or were equal to the observed LTE statistic value of 472. The small p-value gives fairly strong evidence that evidence of this magnitude would be hard to explain by chance alone.

9.

First recall the LTEvector: LTEvector = c(1,1,1,-1,1,1,-1,-1,1) corresponding to RH1. Create a new vector, LTvector corresponding to RH2. Second run 1000 iterations of Pass.ACT.LT.perm.distn by using statvector = LTvector. Finally, determine the p-value by creating a new function: Pass.Act.LT.p.value.

Challenge: Categories of infant-handler interaction


The Bentley-Condit article [1] suggests more nuanced analyses than we have heretofore considered in this lab. First, TABLE 1 of [1] categorizes interactions into three different types: 1) Passive: movement to within 1m. of the mother-infant pair with no attempt to handle, (the paper calls these interactions Approach); 2) Unsuccessful: movement to within 1m. of the mother-infant pair with an attempted (but not successful) handle, (the paper calls these Attempted); or 3) Successful: a successful handle. Table 1 is a sum total of these three types of interactions for each handler-infant pair. For example, of the 13 interactions of handler KM with infant HZ/HQ, 2 were passive, 4 were unsuccessful, and 7 were successful. In the baboons.df the variables Passive, Unsuccess, and Successful refer to these categories of interaction and the variable Pass.Act is the sum of these three. We can create three tables, one for each category of interaction. inc.mat.Pass <- tapply(Passive, list(Infant, Hand.ID), sum) #create Passive table inc.mat.Pass[is.na(inc.mat.Pass)] <- 0 #The first matrix contains some NAs #replace NA's inc.mat.Un <- tapply(Unsuccess, list(Infant, Hand.ID), sum) #create Table 1 inc.mat.Un[is.na(inc.mat.Un)] <- 0 #The first matrix contains some NAs #replace NA's inc.mat.Succ <- tapply(Successful, list(Infant, Hand.ID), sum) #create Table 1 inc.mat.Succ[is.na(inc.mat.Succ)] <- 0 #The first matrix contains some NAs #replace NA's Analysis 1: The researchers were more interested in the research hypotheses RH1 and RH2 with respect to these categories than they were with the overall table of counts given in Table 1. Consider both RH1 and RH2 for the 3 categories of interactions. Run both descriptive analyses and permutation test analyses for the various combinations of research hypothesis and category of interaction. The goal is to learn about the relationship, if any, between dominance rank and infant-handling behavior and whether the interaction is successful or not.

Copyright 2013 Pearson Education. Inc.

Research Project: Infant Handling in Female Baboons

41

Analysis 2: In [1], the authors ended up dropping the data involving infant 1, the only high-ranking infant in the data set. The strength of conclusions was compromised by this decision. After the publication of [1], the authors took a suggestion of Lunneborg [3] and performed a sensitivity analysis, which had the virtue of looking more carefully at the role of infant 1 and all individual infants. There are actually several ways to go about such a sensitivity analysis, but one we suggest here is as follows. Consider each infant, one at a time. Set aside that infant from the data set; that infants mother remains in the data set, but as a handler only, not a mother. Now, re-compute the p-value for the data set. Consider the set of 11 p-values thus obtained and consider whether it seems as if any one infant has a great effect on the original results. For example, with infant 1 removed is the p-value much different from the original pvalue or from the p-values obtained with the other 10 one-removed p-values? This determination is subjective, but it does provide a way to ascertain the affect of individuals without just throwing away that infant from the analysis. The R code performs this analysis. The vector p.rem.inf.PA.LTE contains the 11 p-values, removing one infant at a time. We see that infant #1 does have a decidedly larger p-value: [1] 0.120 0.014 0.059 0.012 0.061 0.024 0.017 0.021 0.037 0.015 0.035 Lunnenborg actually suggests looking not at p-values with infants removed but at the test statistic itself. The code does this as well, putting test statistic values for the LTE test statistic in the vector LTE.rem.inf.PA: 0.7142857 0.7611408 0.7218750 0.7651123 0.7168950 0.6672052 0.6781250 0.6650407 0.6429809 0.6750789 0.6526138 Note: These are not actually LTE values, but rather they are the LTE value divided by the sum of all entries in the 3by-3 table. The reason for normalizing this way, is that when infants are removed, the total count of the table changes, which would make comparisons across infants impossible. Since the LTE statistic measures the total count in cells consistent with the research hypothesis minus total count in cells inconsistent with the research hypothesis, one should interpret this normalized LTE statistic as a difference in the proportion of consistent events minus the proportion of inconsistent events. Note, that on this descriptive level, the removal of infant 1 seems to exert less influence on the results. The value for infant one removed (.7142857) is not extreme within the set of 11 values. Now, go on to investigate the removal of infants with the LT statistic and also with different categories of interaction. Summarize your findings.

Copyright 2013 Pearson Education. Inc.

Chapter 7 Logistic Regression: The Space Shuttle Challenger


How to Perform Logistic Regression with R
The glm function is used to perform logistic regression analysis in R. Similar to standard linear regression, there are three main arguments that need to be provided to glm: 1) A formula with the binary response variable on the left side of the ~ and the explanatory variable on the right side of the ~ sign. If there is more than one explanatory variable to include in the model, then place a + between them. Interaction terms can be included by placing a : between the variables for which an interaction will be created. 2) The argument family=binomial indicates that logistic regression, rather than linear regression, will be performed. 3) An argument to indicate which data set will be used. Other options are available, and you can type help(glm) to learn more about the function.

Activities
2. Read in and attach the Shuttle data into your R session: shuttle <- read.table("FILE PATH/Shuttle.TXT",header=TRUE,sep="\t") attach(shuttle) Create a scatterplot of Temperature versus Success (0 = not a successful launch, 1 = successful launch): plot(Temperature,Success,xlab="Temperature (Fahrenheit)", ylab="Success (0/1)") Create a side-by-side boxplot of Temperature by Success: boxplot(Temperature~Success)

3.

Create a scatterplot of Temperature versus Success (0 = not a successful launch, 1 = successful launch) with a least squares regression line superimposed: plot(Temperature,Success,xlab="Temperature (Fahrenheit)", ylab="Success(0/1)") abline(lm(Success~Temperature)$coef[1],lm(Success~Temperature)$coef[2]) Obtain the least squares estimates of the slope and y-intercept: lm(Success~Temperature)$coef These estimates can be used to predict the response values when temperature is 60 degrees and 85 degrees.

5.

To plot relationships between following lines of code: par(mfrow=c(2,3)) Beta0 <- c(-10,-5) Beta1 <- c(.5,1,1.5) X <- seq(0,30,by=.1)

and X using values of

0 =-10, -5 and values of 1 =.5, 1, and 1.5, type the

Copyright 2013 Pearson Education. Inc.

44

Chapter 7: Logistic Regression: The Space Shuttle Challenger

for (i in 1:2) { for (j in 1:3) { pi <- exp(Beta0[i] + Beta1[j]*X)/(1+exp(Beta0[i] + Beta1[j]*X)) plot(X,pi,type="l",ylim=c(0,1),ylab = expression(pi)) title(main=bquote(paste(beta[0]==.(Beta0[i]), " ",beta[1]==.(Beta1[j]))),cex=.7) } } To plot relationships between

and X using values of

0 =10, 5 and values of 1 = -.5, -1, and -1.5, type the

following lines of code: par(mfrow=c(2,3)) Beta0 <- c(10,5) Beta1 <- c(-.5,-1,-1.5) X <- seq(0,30,by=.1) for (i in 1:2) { for (j in 1:3) { pi <- exp(Beta0[i] + Beta1[j]*X)/(1+exp(Beta0[i] + Beta1[j]*X)) plot(X,pi,type="l",ylim=c(0,1),ylab = expression(pi)) title(main=bquote(paste(beta[0]==.(Beta0[i]), " ",beta[1]==.(Beta1[j]))),cex=.7) } }

6.

To obtain maximum likelihood estimate for

and

1 , we first create a logistic regression object in R

which will be named shuttle.logistic.5 that will contain several key aspects of the model, including parameter estimates, standard errors, and additional information that will be used later. Type in the following code: shuttle.logistic1 <- glm(Success~Temperature,family=binomial) Now we can examine the output in shuttle.logistic1 to obtain the maximum likelihood estimates using the summary() function. Type: summary(shuttle.logistic1) In the output, the values in the column labeled Estimate are the MLEs for values for

and

1 , i.e. they are the

b0 and b1 , respectively.

11. Probabilities from a logistic regression model using the MLE's (from Questions 6): prob.mle <- exp(-15.043+.232*Temperature)/(1+exp(-15.043+.232*Temperature) ) Probabilities from a logistic regression model using the least squares estimates: ls.coef <- lm(Success~Temperature)$coef prob.ls <- exp(ls.coef[1]+ls.coef[2]*Temperature)/ (1+exp(ls.coef[1]+ls.coef[2]*Temperature)) Create a plot of two logistic regression models: par(mfrow=c(1,2)) plot(Temperature,prob.mle,xlab="Temperature",ylab="Probability", main="Parameters Estimated by MLE")

Copyright 2013 Pearson Education. Inc.

Chapter 7: Logistic Regression: The Space Shuttle Challenger

45

plot(Temperature,prob.ls,xlab="Temperature",ylab="Probability", main="Parameters Estimated by LS")

13. To fit the logistic regression model where a 1 indicates O-ring failure, we can use (1-Success) as the new response variable. The new logistic regression object will be called shuttle.logistic2: shuttle.logistic2 <- glm((1-Success)~Temperature,family=binomial) Now examine the output in shuttle.logistic2 to observe the maximum likelihood estimates using the summary() function. Check the column labeled Estimate to find the MLEs.

14. We will use the summary output for shuttle.logistic1 to conduct the likelihood ratio test (LRT). First, compute the test statistic for the LRT, which well name G: G <- shuttle.logistic1$null.deviance-shuttle.logistic1$deviance Next, well compute the degrees of freedom for the test statistic: df <- shuttle.logistic1$df.null-shuttle.logistic1$df.residual Note that gives us the number of extra parameters that need to be estimated in the full model vs. the null model. Finally, we can compute a p-value for the test use the following code: 1-pchisq(G,df)

Extended Activities
15. Read in and attach the Cancer2 data into your R session: cancer <- read.table("FILE PATH/Cancer2.TXT",header=TRUE,sep="\t") attach(cancer) The logistic regression object corresponding to the model with binary response variable Malignant and explanatory variables radius and concavity will be saved as cancer.logistic1: cancer.logistic1 <- glm(Malignant~radius+concavity,family =binomial,data=cancer) The parameter estimates for the model is obtained using the summary() function. The residual deviance can be obtained with the code: cancer.logistic1$deviance The null deviance can be obtained with the code: cancer.logistic1$null.deviance The LRT Statistic and degrees of freedom are found using the code: G <- cancer.logistic1$null.deviance-cancer.logistic1$deviance df <- cancer.logistic1$df.null-cancer.logistic1$df.residual The p-value for the LRT: 1-pchisq(G,df) Probabilities of malignant cells can be computed by plugging in appropriate values of the explanatory variables into the estimated logistic regression function and solving for P(Malignant).

Copyright 2013 Pearson Education. Inc.

46

Chapter 7: Logistic Regression: The Space Shuttle Challenger

16. The logistic regression object corresponding to the model with binary response variable Malignant and explanatory variable radius will be saved as cancer.logistic2: cancer.logistic2 <- glm(Malignant~radius,family=binomial) The remainder of the question is identical to that of Questions 15; however, be sure to replace cancer.logistic1 with cancer.logistic2 when extracting relevant information from the model results.

18. To fit the logistic regression model where a 1 indicates a benign mass, we can use (1-Malignant) as the new response variable. The new logistic regression object will be called cancer.logistic3: cancer.logistic3 <- glm((1-Malignant)~radius+concavity,family=binomial) The logistic regression output can be examined with: summary(cancer.logistic3)

22. Goodman-Kruskal Gamma and Kendall's Tau-a can be obtained easily in R, but a separate R package must be loaded (and possibly downloaded first), and then the logistic model must be re-fit to the data using the function lrm. The output for lrm is slightly different from that obtained using the glm function. Load the R package Design: library(Design) Re-fit the logistic regression using the lrm function: lrm(Success~Temperature) The output is as follows: Logistic Regression Model lrm(formula = Success ~ Temperature) Frequencies of Responses 0 1 7 16 Frequencies of Missing Values Due to Each Variable Success Temperature 1 0 Obs 23 Max Deriv Model L.R. 2e-04 7.95 Gamma 0.589 Tau-a 0.249 d.f. 1 R2 0.413 P 0.0048 Brier 0.139 C 0.781 Dxy 0.562

Coef S.E. Wald Z P Intercept -15.0429 7.3786 -2.04 0.0415 Temperature 0.2322 0.1082 2.14 0.0320 Goodman-Kruskal Gamma and Kendall's Tau-a are identified in bold above.

Copyright 2013 Pearson Education. Inc.

Chapter 7: Logistic Regression: The Space Shuttle Challenger

47

26. Read in and attach the Cancercells data into your R session: cancer.group <- read.table("FILE PATH\Cancercells.txt",header =TRUE,sep="\t") attach(cancer.group) Fitting a logistic regression model to binomial counts is slightly different from what was done earlier. First, create a two-column matrix containing the counts of successes (cells that are malignant) in the first column and the counts of failures (cells that are benign) in the second column: cells <- cbind(Malignant,Benign) Now fit the logistic regression model: cancer.group1 <- glm(cells~med.rad,family=binomial) The results of the fit can be examined with the summary() function, and the LRT can be performed by following the instructions of Questions 15.

27. a) Create a vector of the Pearson residuals: pearson.res <- residuals(cancer.group1,type="pearson") b) Construct a normal probability plot of the Pearson residuals: qqnorm(pearson.res) c) Create a scatterplot of the mean radius versus Pearson residuals: plot(med.rad,pearson.res)

28. a) Create a vector of the deviance residuals: dev.res <- residuals(cancer.group1,type="deviance") b) Construct a normal probability plot of the Pearson residuals: qqnorm(dev.res) c) Create a scatterplot of the mean radius versus Pearson residuals: plot(med.rad,dev.res)

31. There is no easily accessible code for the Pearson or deviance goodness-of-fit tests in R for ungrouped data when there are duplicate values of the explanatory variable. As stated in the text, these tests are not reliable when there are few observations at each level of the explanatory variable. There is available code to perform the Hosmer-Lemeshow test. The following function to perform the HosmerLemeshow test was developed by Peter D. M. Macdonald, McMaster University: hosmerlem <- function (y, yhat, g = 10) { cutyhat <- cut(yhat, breaks = quantile(yhat, probs = seq(0, 1, 1/g)), include.lowest = T) obs <- xtabs(cbind(1 - y, y) ~ cutyhat) expect <- xtabs(cbind(1 - yhat, yhat) ~ cutyhat) chisq <- sum((obs - expect)^2/expect) P <- 1 - pchisq(chisq, g - 2) c("X^2" = chisq, Df = g - 2, "P(>Chi)" = P) }

Copyright 2013 Pearson Education. Inc.

48

Chapter 7: Logistic Regression: The Space Shuttle Challenger

In the code, y is the binary response variable, yhat corresponds to the fitted probabilities, and g is number of groups (default number is 10 groups). We obtain the fitted probabilities from the logistic regression model, then apply the hosmerlem () function: cancer.glm.fits <- glm(Malignant~radius,family=binomial)$fit hosmerlem(Malignant,cancer.glm.fits)

32. Adjust g in the previous question.

33. See previous questions for residual plots. There is no easily accessible code for delta and leverage graphs.

34. a) Plot the log-likelihood function: pi <- seq(.01,.99,length=100) loglik <- 5*log(pi)+7*log(1-pi) plot(pi,loglik,xlab="Pi",ylab="Log-Likelihood Function")

Copyright 2013 Pearson Education. Inc.

Chapter 8 Poisson Log-Linear Regression: Detecting Cancer Clusters


How to Perform Log-Linear Regression with R
The glm function is used to perform log-linear regression analysis in R. There are either 3-4 main arguments that need to be provided to glm: 1) A formula with a count response variable on the left side of the ~ and the explanatory variable or variables on the right side of the ~ sign. If there is more than one explanatory variable to include in the model, then place a + between them. Interaction terms can be included by placing a : between the variables for which an interaction will be created. 2) The argument family=poisson to indicate that log-linear regression, rather than linear regression will be performed. 3) An argument to indicate which data set will be used. 4) In addition, an offset term may need to be specified if the response variable is a count per unit of exposure, for example number of persons in a geographic region, or the number of persons in different age groups. For a given exposure variable, the offset is specified as offset=log(exposure) in the glm() function. Other options are available, and you can type help(glm) to learn more about the function.

Activities
9. Here is example R code for simulating the cancer status of 1,138 individuals with an average of 25 years of residence. We will use runif() to draw a random number between zero and one for each person-year of exposure (and repeat 10,000 times). Create 10,000 communities of 1,138 people, and let's assume that every person lived in Randolph for 25 years. Each community has a cancer rate equal to the national average of 326 cases/100,000 person-years: n.sims <- 10000 n.people <- 1138 n.years <- 25 rate <- 326/100000 n.person.years <- n.people * n.years Create an array to hold the 10,000 simulated cancer counts, and get a histogram of the simulation results to see how the Bartlett-Green Acres neighborhood compares: cases <- rep(0, n.sims) for (sim in 1:n.sims) { x <- runif(n.person.years) cases[sim] <- sum(x < rate) } summary(cases) hist(cases) To obtain the simulated p-value, add up the number of simulated neighborhoods at least as extreme as Randolph, MA (at least 67 cases), and divide by number of simulations to get proportion: pvalue <- sum(cases >= 67) / n.sims pvalue

Copyright 2013 Pearson Education. Inc.

50

Chapter 8: Poisson Log-Linear Regression: Detecting Cancer Clusters

10. Repeat the steps outlined for the previous question, but change one line so that the exposure is 12.5 years per person on average: n.years = 12.5

15. Read in the BGA_CTR data into your R session: bga_ctr <- read.table("FILE PATH/bga_ctr.TXT",header=TRUE) Create a subset of the data corresponding to the BGA location: BGA <- subset(bga_ctr, location=="BGA") Fit the Poisson regression model to the BGA cancer rates with median age as the covariate: bga.glm <- glm(cases ~ median_age, family=poisson,offset=log(exposure),data=BGA) The offset term offset=log(exposure)take care of the different exposure levels (person-years) for the age groups. We can inspect the estimated coefficients by using the summary() command: summary(bga.glm)

17. Create a subset of the data corresponding to the CTR location: CTR <- subset(bga_ctr, location=="CTR") Fit the Poisson regression model to the CTR cancer rates with median age as the covariate: ctr.glm <- glm(cases ~ median_age, family=poisson, offset=log(exposure), data=CTR) Use the summary() command to view the results of the fitted model.

20. Fit the Poisson regression model to the BGA and CTR cancer rates with median age as the covariate: bgactr.glm1 <- glm(cases ~ median_age, family=poisson, offset=log(exposure), data=bga_ctr) Use the summary() command to view the results of the fitted model.

21. Fit the Poisson regression model to the BGA and CTR cancer rates with location as the covariate: bgactr.glm2 <- glm(cases ~ as.factor(location), family=poisson, offset=log(exposure),data=bga_ctr)

22. Fit the Poisson regression model to the BGA and CTR data with age and location as the covariates: bgactr.glm3 <- glm(cases ~ median_age+as.factor(location), family=poisson, offset=log(exposure), data=bga_ctr) Use the summary() command to view the results of the fitted model.

25. Create a graph of the logarithm of the incidence rates versus median age by location: par(mfrow=c(1,1)) plot(CTR$median_age,log(CTR$rate),xlab="Median Age", ylab="Log Incidence Rate, ylim=c(3,8) points(BGA$median_age,log(BGA$rate),pch=19) legend(30,7,c("BGA","CTR"),pch=c(19,1))

Copyright 2013 Pearson Education. Inc.

Chapter 8: Poisson Log-Linear Regression: Detecting Cancer Clusters

51

27. Fit the Poisson regression model to the BGA and CTR data with covariate age, location, and an interaction between age and location: bgactr.glm4 <- glm(cases ~ median_age+as.factor(location)+ median_age:as.factor(location), family=poisson, offset=log(exposure), data=bga_ctr)

31. Compute the LRT statistic: LRT <- bgactr.glm3$deviance-bgactr.glm4$deviance Obtain the p-value for the LRT: 1-pchisq(LRT,1)

33. To fit the Poisson model to the CTR data that includes a quadratic term for Age: ctr.glm2 <- glm(cases ~ median_age + I(median_age^2), family=poisson, offset=log(exposure),data=CTR) Examine the summary output for the deviance statistic.

34. Residuals of the interaction model: round(residuals(bgactr.glm4,type="deviance"),4) Plot of deviance residuals versus median age: plot(bga_ctr$median_age,residuals(bgactr.glm4),xlab="Median Age", ylab="Deviance Residual") Fit the interaction model plus the quadratic term for age: bgactr.glm5 <- glm(cases ~ median_age+as.factor(location) +median_age:as.factor(location) +I(median_age^2),family=poisson, offset=log(exposure),data=bga_ctr) The LRT statistic is: LRT <- bgactr.glm4$deviance-bgactr.glm5$deviance The corresponding p-value of the LRT is: 1-pchisq(LRT,1)

Extended Activities
44. Read in the Smoking data: smoking <- read.table("FILE PATH/Smoking.txt",header=TRUE) Fit a Poisson regression model to the smoking data: smoke.glm <- glm(Number ~ Location,family=poisson,data=smoking)

49. Pearson chi-square statistic: sum(residuals(smoke.glm,type="Pearson")^2)

Copyright 2013 Pearson Education. Inc.

52

Chapter 8: Poisson Log-Linear Regression: Detecting Cancer Clusters

50. Read in the gala data: gala <- read.table("FILE PATH/gala.txt",header=TRUE) Create plots of the logarithm of the observed counts versus each of the potential covariates: plot gala$area,(log(gala$species),xlab="Area", ylab="Log Species Count") plot gala$elevation,(log(gala$species),xlab="Elevation", ylab="Log Species Count") plot gala$nearest,(log(gala$species),xlab="Nearest", ylab="Log Species Count") plot gala$scruz,(log(gala$species),xlab="Scruz", ylab="Log Species Count") plot gala$adjacent,(log(gala$species),xlab="Adjacent", ylab="Log Species Count")

51. Plot the log-observed counts versus the logarithm of the covariates: plot log(gala$area),(log(gala$species),xlab="Log Area", ylab="Log Species Count") plot log(gala$elevation),(log(gala$species),xlab=" Log Elevation", ylab="Log Species Count") plot log(gala$nearest),(log(gala$species),xlab="Log Nearest", ylab="Log Species Count") plot log(gala$scruz),(log(gala$species), xlab="Log Scruz", ylab="Log Species Count") plot log(gala$adjacent),(log(gala$species),xlab="Log Adjacent", ylab="Log Species Count")

52. Fit a Poisson regression model to the total species count with covariates log(area) & log(elevation): gala.glm <- glm(species~log(area)+log(elevation),family=poisson,data=gala) The LRT statistic to determine if at least one of the covariates is significant: LRT <- gala.glm$null.deviance-gala.glm$deviance P-value for the LRT: 1-pchisq(LRT,2)

53. Fit a Poisson regression model to the total species count with covariates log(area), log(elevation), nearest, scruz, and adjacent: gala.glm2 <- glm(species~nearest+scruz+adjacent+log(area)+log(elevation), family=poisson,data=gala) The LRT statistic is: LRT <- gala.glm$deviance - gala.glm2$deviance P-value is: 1-pchisq(LRT,3)

55. Compute the correlations between all pairs of covariates (use the logarithm of age and elevation): cor(cbind(gala$area, gala$elevation,gala$nearest,gala$scruz, gala$adjacent))

Copyright 2013 Pearson Education. Inc.

Chapter 8: Poisson Log-Linear Regression: Detecting Cancer Clusters

53

58. To fit an overdispersed Poisson regression model, choose the option family=quasipoisson, and leave everything else the same. The parameter estimates will be identical to those in Problem 51, but the standard errors (and hence the Wald statistics) will be adjusted by the estimated overdispersion parameter: gala.glm2 <-glm(species~nearest+scruz+adjacent+log(area)+log(elevation), family=quasipoisson,data=gala) Use the summary() command, and compare the results to those in Problem 53. You will notice that the estimated coefficients are the same, but the p-values are all different.

Copyright 2013 Pearson Education. Inc.

Chapter 9 Survival Analysis: Melting Chocolate Chips


How to Perform Survival Analysis with R
To analyze survival data in R, you will need to download the package survival and load the package using the command library(survival) in your R session. This package contains many of the necessary survival analysis routines you will need to solve the Activities and Extended Activities. Unfortunately, there are currently no R functions in the survival package that compute and plot the estimated hazard function or the cumulative hazard function. So we have supplied two R functions that you can use to plot these functions. The function plot.haz() will plot the estimated hazard function and plot.chaz() will plot the estimated cumulative hazard function when supplied with a survfit object (see the instructions for the related Activities and Extended Activities below). You will need to enable this code when you are in your R session. The easiest way to do this is to open the files plot.haz.R and plot.chaz.R in your R session, highlight it, and click the Run line or selection icon. This process effectively makes these functions available to use. To apply it to survival data, you will supply a survfit object (e.g. KM.obj) to either of these functions. See the related Extended Activities below. For the questions that require class activity data, we will use the MeltingChipsJS data set.

Activities
19. Read in the class data or MeltingChipsJS data into R session: meltchips <- read.table("FILE PATH/MeltingChipsJS.txt",header=TRUE) Construct the Kaplan-Meier curves for the white and milk chocolate chips and store them in the survfit object named KM.obj: KM.obj <- survfit(Surv(Time,Censor)~Type,data=meltchips) Plot the Kaplan-Meier curves on the same graph: plot(KM.obj,lty=1:2,xlab="Time",ylab="Survival Probability",ylim=c(0,1)) Add a legend to the plot to distinguish between the two groups: legend(1,.2,c("Milk","White"),lty=1:2)

25. The estimated median survival time can be obtained using the print() command: print(KM.obj) To compute the estimated mean survival times, use the print() command and include the following options: print(KM.obj,print.rmean=TRUE,rmean="individual")

29. First, create a dataset with milk chocolate chips only: milkchips <- meltchips[meltchips$Type=="Milk",] Construct the Kaplan-Meier curve for the milk chocolate chips and store it in KM.obj: KM.obj <- survfit(Surv(Time,Censor)~Type,data=milkchips,conf.type="plain")

Copyright 2013 Pearson Education. Inc.

56

Chapter 9: Survival Analysis: Melting Chocolate Chips

Plot the Kaplan-Meier curve with the pointwise confidence intervals for the survival probabilities on the same graph: plot(KM.obj,xlab="Time",ylab="Survival Probability",ylim=c(0,1)) You can examine the values of the 95% confidence intervals using the summary() command: summary(KM.obj) Note that the upper limits of the confidence intervals are truncated at 1.

31. Construct the Kaplan-Meier curves for the white and milk chocolate chips and store them in KM.obj: KM.obj <- survfit(Surv(Time,Censor)~Type,data=meltchips,conf.type="plain") Plot the Kaplan-Meier curves on the same graph: plot(KM.obj,lty=1:2,xlab="Time",ylab="Survival Probability",ylim=c(0,1), conf.int=TRUE) Add a legend to the plot to distinguish between the two groups: legend(1,.2,c("Milk","White"),lty=1:2) The values of the pointwise confidence intervals for the survival probabilities can also be examined using the summary() command.

36. Perform the log-rank test to compare the melting experiences of the white and milk chocolate chips: survdiff(Surv(Time,Censor)~Type,data=meltchips) The test statistic and p-value are provided at the bottom of the output.

Extended Activities
43. Read in the Firstdrink data: firstdrink <- read.table("FILE PATH/Firstdrink.txt",header=TRUE) Construct the Kaplan-Meier curve and store it in KM.obj: KM.obj <- survfit(Surv(Age,Censor)~1,data=firstdrink,conf.type="plain")

44. Use the R function plot.haz() shown here to compute the estimated hazard rates, and/or plot the estimated hazard function: plot.haz <- function(KM.obj,plot="TRUE") { ti <- summary(KM.obj)$time di <- summary(KM.obj)$n.event ni <- summary(KM.obj)$n.risk #Est Hazard Function est.haz <- 1:(length(ti)) for (i in 1:(length(ti)-1)) est.haz[i] <- di[i]/(ni[i]*(ti[i+1]-ti[i])) est.haz[length(ti)] <- est.haz[length(ti)-1] if (plot=="TRUE") {

Copyright 2013 Pearson Education. Inc.

Chapter 9: Survival Analysis: Melting Chocolate Chips

57

plot(ti,est.haz,type="s",xlab="Time", ylab="Hazard Rate", main=expression(paste(hat(h),(t)[KM]))) } return(list(est.haz=est.haz,time=ti)) } Use KM.obj from Question 43: plot.haz(KM.obj)

45. Read in the Graduate data: graduate <- read.table("FILE PATH/Graduate.txt",header=TRUE) First construct the Kaplan-Meier curve and store it in KM.obj: KM.obj <- survfit(Surv(Years,Censor)~1,data=graduate,conf.type="plain") Now plot the estimated hazard function using the plot.haz() function: plot.haz(KM.obj)

49. Read in the Rearrest data: rearrest <- read.table("FILE PATH/Rearrest.txt",header=TRUE) First construct the Kaplan-Meier curve and store it in KM.obj: KM.obj <- survfit(Surv(months,censor)~1,data=rearrest,conf.type="plain") The estimated cumulative hazard function is plotted using the following plot.chaz() function: plot.chaz <- function(KM.obj,plot="TRUE") { ti <- summary(KM.obj)$time di <- summary(KM.obj)$n.event ni <- summary(KM.obj)$n.risk #Est Cumulative Hazard Function est.cum.haz <- 1:(length(ti)) for (i in 1:(length(ti))) est.cum.haz[i] <- sum(di[1:i]/ni[1:i]) plot.chaz <- 1:length(KM.obj$time) for (i in 1:length(plot.chaz)) plot.chaz[i] <- sum((KM.obj)$n.event[1:i]/(KM.obj)$n.risk[1:i]) if (plot=="TRUE") { plot((KM.obj)$time,plot.chaz,type="s",xlab="Time", ylab="Cumulative Hazard",main=expression(paste(hat(H),(t)["NA"]))) } return(list(est.chaz=plot.chaz,time=(KM.obj)$time)) Now plot the estimated hazard and cumulative hazard functions using the plot.haz() and plot.chaz() functions: par(mfrow=c(1,2)) plot.haz(KM.obj) plot.chaz(KM.obj)

Copyright 2013 Pearson Education. Inc.

Chapter 10 Principal Component Analysis: Stock Market Values


Activities
1. Open the 2006Stocks dataset and attach it to your R session: stocks <- read.table("FILE PATH/2006Stocks.txt", header=TRUE,sep="\t") attach(stocks) Create time series plots of the Dow and S&P. Both time series should be on the same graph. lablist <- as.vector(Date)[seq(1,251,by=45)] plot(Dow,axes=FALSE,type="l", ylab="Stock Market Value",xlab="Time",ylim=c(1000,13000),col=2) lines(SP,lty=2,col=3) box() axis(1,at=seq(1,251,by=45),labels=FALSE) axis(2,at=seq(0,13000,by=1000),labels=TRUE) text(seq(1,251,by=45),par("usr")[3],pos=1,labels=lablist,xpd=TRUE) legend(175,7000,c("Dow Jones","S&P"),lty=1:2,col=2:3)

2.

Standardize the Dow and S&P columns, (i.e. for each element in the Dow column, subtract the Dow mean and divide by the Dow standard deviation. Save the standardized Dow values in a new column labeled Z1. a) Create time series plots of the standardized Dow and S&P, Z1 and Z2. Both time series should be on the same graph. Z1 <- scale(Dow) #Standardize DJ Z2 <- scale(SP) #Standardize SP lablist <- as.vector(Date)[seq(1,251,by=45)] plot(Z1,axes=FALSE,type="l", ylab="Stock Market Value",xlab="Time",ylim=c(-3,3),col=2) lines(Z2,lty=2,col=3) box() axis(1,at=seq(1,251,by=45),labels=FALSE) axis(2,at=seq(-3,3,by=1),labels=TRUE) text(seq(1,251,by=45),par("usr")[3],pos=1,labels=lablist,xpd=TRUE) legend(2,2,c("Z1 (Dow)","Z2 (S&P)"),lty=1:2,col=2:3)

3.

Later sections will show that the first principal component is calculated as PC1 = Y 1 = 0.707Z1 + 0.707Z2. a) Calculate PC1 and submit a time series plot of the standardized Dow and S&P closing values (Z1, Z2) and the first principal component (PC1). PC1 <- 0.707*Z1 + 0.707*Z2 lablist <- as.vector(Date)[seq(1,251,by=45)] plot(Z1,axes=FALSE,type="l",ylab="Stock Market Value", xlab="Time",ylim=c(-3,3),col=2,lwd=2) lines(Z2,lty=2,col=3,lwd=2) lines(PC1,lty=3,col=4,lwd=2) box() axis(1,at=seq(1,251,by=45),labels=FALSE)

Copyright 2013 Pearson Education. Inc.

60

Chapter 10: Principal Component Analysis: Stock Market Values

axis(2,at=seq(-3,3,by=1),labels=TRUE) text(seq(1,251,by=45),par("usr")[3],pos=1,labels=lablist,xpd=TRUE) legend(2,2,c("Z1 (Dow)","Z2 (S&P)","PC1"),lty=1:3,col=2:4,lwd=2)

4.

First, create Z1, Z2, and Z3: Z1 <- scale(Dow) #Standardize DJ Z2 <- scale(SP) #Standardize SP Z3 <- scale(Nasdaq) #Standardize Nasdaq a) C1 <- 1*Z1 + 0*Z2 + 0*Z3 lablist <- as.vector(Date)[seq(1,251,by=45)] plot(Z1,axes=FALSE,type="l",ylab="Stock Market Value",xlab="Time", ylim=c(-3,3),col=2,lwd=2) lines(Z2,lty=2,col=3,lwd=2) lines(Z3,lty=3,col=4,lwd=2) lines(C1,lty=4,col=5,lwd=2) box() axis(1,at=seq(1,251,by=45),labels=FALSE) axis(2,at=seq(-3,3,by=1),labels=TRUE) text(seq(1,251,by=45),par("usr")[3],pos=1,labels=lablist,xpd=TRUE) legend(2,2.75,c("Z1 (Dow)","Z2 (S&P)", "Z3(Nasdaq)","C1"),lty=1:4,col=2:5,lwd=2) b) C2 <- 1*Z1 - 1*Z2 + 0*Z3 lablist <- as.vector(Date)[seq(1,251,by=45)] plot(Z1,axes=FALSE,type="l",ylab="Stock Market Value",xlab="Time", ylim=c(-3,3),col=2,lwd=2) lines(Z2,lty=2,col=3,lwd=2) lines(Z3,lty=3,col=4,lwd=2) lines(C2,lty=4,col=5,lwd=2) box() axis(1,at=seq(1,251,by=45),labels=FALSE) axis(2,at=seq(-3,3,by=1),labels=TRUE) text(seq(1,251,by=45),par("usr")[3],pos=1,labels=lablist,xpd=TRUE) legend(2,2.75,c("Z1 (Dow)","Z2 (S&P)", "Z3(Nasdaq)","C2"),lty=1:4,col=2:5,lwd=2) c) C3 <- 1/3*Z1 + 1/3*Z2 + 1/3*Z3 lablist <- as.vector(Date)[seq(1,251,by=45)] plot(Z1,axes=FALSE,type="l",ylab="Stock Market Value",xlab="Time", ylim=c(-3,3),col=2,lwd=2) lines(Z2,lty=2,col=3,lwd=2) lines(Z3,lty=3,col=4,lwd=2) lines(C3,lty=4,col=5,lwd=2) box() axis(1,at=seq(1,251,by=45),labels=FALSE) axis(2,at=seq(-3,3,by=1),labels=TRUE) text(seq(1,251,by=45),par("usr")[3],pos=1,labels=lablist,xpd=TRUE) legend(2,2.75,c("Z1 (Dow)","Z2 (S&P)", "Z3(Nasdaq)","C3"),lty=1:4,col=2:5,lwd=2)

d) C4 <- 0*Z1 + 0*Z2 - 1*Z3 lablist <- as.vector(Date)[seq(1,251,by=45)] plot(Z1,axes=FALSE,type="l",ylab="Stock Market Value",

Copyright 2013 Pearson Education. Inc.

Chapter 10: Principal Component Analysis: Stock Market Values

61

xlab="Time",ylim=c(-3,3),col=2,lwd=2) lines(Z2,lty=2,col=3,lwd=2) lines(Z3,lty=3,col=4,lwd=2) lines(C4,lty=4,col=5,lwd=2) box() axis(1,at=seq(1,251,by=45),labels=FALSE) axis(2,at=seq(-3,3,by=1),labels=TRUE) text(seq(1,251,by=45),par("usr")[3],pos=1,labels=lablist,xpd=TRUE) legend(2,2.75,c("Z1 (Dow)","Z2 (S&P)", "Z3(Nasdaq)","C4"),lty=1:4,col=2:5,lwd=2)

5.

Calculate the 2 by 2 correlation matrix for the Dow and S&P variables: Corr1 <- cor(cbind(Dow,S.P)) Type "Corr1" to examine the correlation matrix.

6.

Calculate the 2 by 2 correlation matrix for the Dow and Nasdaq variables: Corr2 <- cor(cbind(Dow,Nasdaq)) Type "Corr2" to examine the correlation matrix.

9.

Use R to calculate the eigenvectors and eigenvalues of the correlation matrix R (for the Dow and S&P variables). Eigenvalues of Corr1: eigen(Corr1)$values (Normalized) Eigenvectors of Corr1: eigen(Corr1)$vectors Note that the signs of the coefficients of the eigenvectors may differ from those coefficients computed using Minitab.

10. Calculate the principal components: PC1 <- .707*Z1+.707*Z2 PC2 <- -.707*Z1+.707*Z2 a) Create a scatterplot of PC1 versus PC2: plot(PC1,PC2) Calculate the correlation between PC1 and PC2: cor(PC1,PC2) b) Calculate the variances of PC1 and PC2: var(PC1) var(PC2)

11. Create a time series plot of Z1, Z2, PC1, and PC2: lablist <- as.vector(Date)[seq(1,251,by=45)] plot(Z1,axes=FALSE,type="l",ylab="Stock Market Value",xlab="Time", ylim=c(-3,3),col=2,lwd=2) lines(Z2,lty=2,col=3,lwd=2) lines(PC1,lty=3,col=4,lwd=2)

Copyright 2013 Pearson Education. Inc.

62

Chapter 10: Principal Component Analysis: Stock Market Values

lines(PC2,lty=4,col=5,lwd=2) box() axis(1,at=seq(1,251,by=45),labels=FALSE) axis(2,at=seq(-3,3,by=1),labels=TRUE) text(seq(1,251,by=45),par("usr")[3],pos=1,labels=lablist,xpd=TRUE) legend(2,2.75,c("Z1 (Dow)","Z2 (S&P)","PC1","PC2"),lty=1:4,col=2:5,lwd=2)

12. Standardize the Nasdaq column into a new column, Z3 and conduct a principal component analysis on all three variables. Z1 <- scale(Dow) #Standardize DJ Z2 <- scale(SP) #Standardize SP Z3 <- scale(Nasdaq) #Standardize Nasdaq a) To perform a principal component analysis, we can either find the eigenvalues and eigenvectors of the correlation matrix R, or we can use the R function princomp(). For this problem, use the princomp()function: pc.stocks <- princomp(cbind(Z1,Z2,Z3)) #Creates a PCA object round(loadings(pc.stocks),3) #round the values to 3 decimal places The columns corresponding to loadings in the output are the eigenvectors of the correlation matrix R. Compute the eigenvalues of the correlation matrix R: R <- cor(cbind(Dow,S.P,Nasdaq)) eigen(R)$values b) Calculate the values (scores) of the three principal components: PC1 <- pc.stocks$scores[,1] PC2 <- pc.stocks$scores[,2] PC3 <- pc.stocks$scores[,3]

13. For this question, the R package scatterplot3d is required and must be downloaded. Create a 3-D plot of the data: library(scatterplot3d) s3d <- scatterplot3d(Z1,Z2,Z3,angle=105) #Plot 1st PC s3d$points3d(c(-0.5821417,0), c(-0.6080418,0), c(-0.5398113,0), type="l",col="blue",lwd=2) #Plot 2nd PC s3d$points3d(c(-0.5377453,0), c(-0.2100566,0), c(0.8165208,0), type="l",col="red",lwd=2) #Plot 3rd PC s3d$points3d(c(0.6098697,0), c(-0.7656118,0), c(0.2046889,0), type="l",col="green",lwd=2) legend(1,0,c("PC1","PC2","PC3"),lty=1,col=c("blue","red","green"),lwd=2) #Rotate the plot (highlight the following "for loop," and run) for (i in 1:460) { s3d <- scatterplot3d(Z1,Z2,Z3,angle=i) s3d$points3d(c(-0.5821417,0), c(-0.6080418,0), c(-0.5398113,0), type="l",col="blue",lwd=2) s3d$points3d(c(-0.5377453,0), c(-0.2100566,0), c(0.8165208,0), type="l",col="red",lwd=2) s3d$points3d(c(0.6098697,0), c(-0.7656118,0), c(0.2046889,0),

Copyright 2013 Pearson Education. Inc.

Chapter 10: Principal Component Analysis: Stock Market Values

63

type="l",col="green",lwd=2) legend(1,0,c("PC1","PC2","PC3"),lty=1,col=c("blue","red","green"),lwd=2) }

14. Create a time series plot with Z1, Z2, Z3, PC1, PC2, and PC3: lablist <- as.vector(Date)[seq(1,251,by=45)] plot(-PC1,axes=FALSE,type="l",ylab="Stock Market Value", xlab="Time",ylim=c(-4,4),col=2,lwd=2) lines(PC2,lty=2,col=3,lwd=2) lines(PC3,lty=3,col=4,lwd=2) lines(Z1,lty=4,col=5,lwd=2) lines(Z2,lty=5,col=6,lwd=2) lines(Z3,lty=6,col=7,lwd=2) box() axis(1,at=seq(1,251,by=45),labels=FALSE) axis(2,at=seq(-3,3,by=1),labels=TRUE) text(seq(1,251,by=45),par("usr")[3],pos=1,labels=lablist,xpd=TRUE) legend(2,4,c("PC1","PC2","PC3","Z1","Z2","Z3"),lty=1:6,col=2:7,lwd=2)

15. Percentage of the variability is explained by the first principal component: var(PC1)/(var(PC1)+var(PC2)+var(PC3)) Percentage of the variability is explained by the first two principal components combined: (var(PC1)+ var(PC2))/(var(PC1)+var(PC2)+var(PC3))

Extended Activities
16. For this question, we'll perform PCA by finding the eigenvectors and eigenvalues of the covariance matrix (i.e. using the unstandardized data). First, calculate the covariance matrix S: S <- cov(cbind(Dow,SP,Nasdaq)) Eigenvalues and eigenvectors of S: eigen(S)$values eigen(S)$vector

17. First compute the values of PC1 (requires some matrix algebra): PC1 <- cbind(Dow,SP,Nasdaq)%*%eigen(S)$vector[,1] Create a time series plot of the original data (X1, X2, and X3) and the first principal component, PC1: lablist <- as.vector(Date)[seq(1,251,by=45)] plot(PC1,axes=FALSE,type="l",ylab="Stock Market Value",xlab="Time",col=2, lwd=2,ylim=c(-700,12500)) lines(Dow,lty=2,col=3,lwd=2) lines(SP,lty=3,col=4,lwd=2) lines(Nasdaq,lty=4,col=5,lwd=2) box() axis(1,at=seq(1,251,by=45),labels=FALSE) axis(2,at=seq(-700,12500,by=1000),labels=TRUE) text(seq(1,251,by=45),par("usr")[3],pos=1,labels=lablist,xpd=TRUE)

Copyright 2013 Pearson Education. Inc.

64

Chapter 10: Principal Component Analysis: Stock Market Values

legend(2,8000,c("PC1","Dow","SP","Nasdaq"),lty=1:4,col=2:5,lwd=2)

18. Multiply Nasdaq values by 5000: NewNasdaq <- 5000*Nasdaq a) Conduct PCA using the covariance matrix on the Dow, S&P, and NewNasdaq. First, calculate the new covariance matrix Snew: Snew <- cov(cbind(Dow,SP,NewNasdaq)) Eigenvalues and eigenvectors of Snew: eigen(Snew)$values eigen(Snew)$vector b) Conduct PCA using the correlation matrix on the Dow, S&P, and NewNasdaq. First, calculate the new covariance matrix, Rnew: Rnew <- cor(cbind(Dow,SP,NewNasdaq)) Eigenvalues and eigenvectors of Rnew: eigen(Rnew)$values eigen(Rnew)$vector

19. Read dataset Versicolor into your R session: versicolor <- read.table("FILE PATH/Versicolor.txt", header=TRUE,sep="\t") Perform a PCA on versicolor data using the correlation matrix. First create a PCA object using the princomp() function: versi.pca <- princomp(versicolor,cor=TRUE) a) The four eigenvalues can be obtained using the syntax: evalues <- (versi.pca$sdev)^2 b) Percentage of variation explained by the first PC: evalues[1]/sum(evalues) c) Percentage of variation explained by the first two PC's combined: sum(evalues[1:2])/sum(evalues)

d) Construct a screeplot: screeplot(versi.pca)

20. Percentage of variation in the original data explained by the third component: evalues[3]/sum(evalues)

21. Create a loadings plot: plot(evectors[,3],-evectors[,4],xlab=PC3,ylab=PC4) lines(c(0,evectors[,3][1]),c(0,-evectors[,4][1]),col=1) lines(c(0,evectors[,3][2]),c(0,-evectors[,4][2]),col=2) lines(c(0,evectors[,3][3]),c(0,-evectors[,4][3]),col=3) lines(c(0,evectors[,3][4]),c(0,-evectors[,4][4]),col=4)

Copyright 2013 Pearson Education. Inc.

Chapter 10: Principal Component Analysis: Stock Market Values

65

24. Read and attach Cars dataset into your R session: cars <- read.table("FILE PATH/Cars.txt", header=TRUE,sep="\t") attach(cars) Perform PCA on liters (Liter) and cylinders (Cyl), and store results in cars.pca: cars.pca <- princomp(cbind(Liter,Cyl),cor=TRUE) a) Create a new variable (PC1) that is a linear combination of liters and cylinders: PC1 <- cbind(Liter,Cyl)%*%loadings(cars.pca)[,1] Use the first principal component, plus Mileage, Buick, Cadillac, Chevrolet, Pontiac and SAAB in a regression model to predict the natural log of retail price, LnPrice. cars.lm1 <glm(LnPrice~PC1+Mileage+Buick+Cadillac+Chevrolet+Pontiac+SAAB) Examine the output of cars.lm1 with the syntax: summary(cars.lm1) b) Use Liter, Cyl, Mileage, Buick, Cadillac, Chevrolet, Pontiac and SAAB in a regression model to predict the natural log of retail price, LnPrice. cars.lm2 <glm(LnPrice~Liter+Cyl+Mileage+Buick+Cadillac+Chevrolet+Pontiac+SAAB) Examine the output of cars.lm2 with the syntax: summary(cars.lm2)

Copyright 2013 Pearson Education. Inc.

Chapter 11 Bayesian Data Analysis: What Colors Come in Your M&M's Candy Bag?
Activities
2. If using the MMs dataset, then read the data into your R session: mms <- read.table("FILE PATH/MMs.txt",header=TRUE,sep="\t") Plot the relative frequency estimates: plot(mms$Total.Number.MMs,mms$Proportion.B.or.O,type="l", xlab="Sample Size (Number of M&M's)", ylab="Proportion of Brown or Orange M&M's")

Extended Activities
34. To find the parameters of the beta prior distribution, use the R function beta.param(): beta.param <- function(min,max) { n <- 100 alpha <- round(seq(1,100,length=n),2) beta <- round(seq(1,100,length=n),2) p1.min <- p2.max <- .1 tol <- seq(0,.1,length=50) for (k in 1:length(tol)) { for (i in 1:n) { for (j in 1:n) { q1 <- qbeta(p1.min,alpha[i],beta[j]) q2 <- qbeta((1-p2.max),alpha[i],beta[j]) if ((round(q1,2)>(round(min,2)-tol[k]) && round(q1,2)<(round(min,2)+tol[k])) && (round(q2,2)>(round(max,2)-tol[k]) && round(q2,2)<(round(max,2)+tol[k]))) return(list(alpha=alpha[i],beta=beta[j])) } } } } Within the parentheses, enter the values of min and max separated by a comma. For the skeptic, min = .21 and min = .29 , so the parameters for the Beta prior are found with the following syntax: beta.param(.21,.29) The following output will appear, providing the parameters of the prior distribution: > beta.param(.21,.29) $alpha [1] 27 $beta [1] 81

Copyright 2013 Pearson Education. Inc.

68

Chapter 11: Bayesian Data Analysis: What Colors Come in Your M&M's Candy Bag?

36. For the open-minded individual, min = .21 and min = .5 , so the parameters for the Beta prior are found with the following syntax: beta.param(.21,.5) For the believer, min = .3 and min = .7 , so the parameters for the Beta prior are found with the following code: beta.param(.3,.7)

38. Plot the prior and posterior distributions for the open-minded individual and believer: par(mfrow=c(1,2)) x <- seq(0,1,length=500) plot(x,dbeta(x,24,43),type="n",ylim=c(0,7), xlab=expression(pi*" = Population Proportion of Hits"), ylab="Probability Density Function",main="Open-Minded Individual") lines(x,dbeta(x,6,11)) lines(x,dbeta(x,24,43),lty=2) legend(.45,6,c(expression("Prior Distribution for "*pi),expression ("Posterior Distribution for "*pi)),lty=1:2,cex=.7) plot(x,dbeta(x,23,37),type="n", ylim=c(0,7), xlab=expression(pi*" = Population Proportion of Hits"), ylab="Probability Density Function",main="Believer") lines(x,dbeta(x,5,5)) lines(x,dbeta(x,23,37),lty=2) legend(.45,6,c(expression("Prior Distribution for "*pi),expression ("Posterior Distribution for "*pi)),lty=1:2,cex=.7) 41. The qbeta() function is used to find quantiles of the Beta( , ) distribution. To find the value of L on the Beta(24,33) posterior distribution such that .025 area is to the left of L: qbeta(.025,24,33) To find the value of U on the Beta(24,33) posterior distribution such that .025 area is to the right of U: qbeta(.975,24,33) To find the limits L and U for other beta distributions, simply follow the above syntax, and supply the appropriate values of and . 45. Find the posterior probability that < .5 : pbeta(.5,24,33)

46. To find the 95% credible intervals, follow the syntax provided in Question #41.

Copyright 2013 Pearson Education. Inc.

You might also like