SASprimer
SASprimer
SASprimer
for
Applied Linear
Regression, Third Edition
Using SAS
Contents
Introduction
0.1 Organization of this primer
0.2 Data files
0.2.1 Documentation
0.2.2 SAS data library
0.2.3 Getting the data in text files
0.2.4 An exceptional file
0.3 Scripts
0.4 The very basics
0.4.1 Reading a data file
0.4.2 Saving text output and graphs
0.4.3 Normal, F , t and 2 tables
0.5 Abbreviations to remember
0.6 Copyright and Printing this Primer
1
5
6
6
6
7
7
7
8
8
9
9
11
11
13
13
18
19
19
vi
CONTENTS
1.5
1.6
19
19
23
23
23
25
25
25
25
27
27
30
Multiple Regression
3.1 Adding a term to a simple linear regression model
3.2 The Multiple Linear Regression Model
3.3 Terms and Predictors
3.4 Ordinary least squares
3.5 The analysis of variance
3.6 Predictions and fitted values
33
33
34
34
35
36
37
Drawing Conclusions
39
4.1 Understanding parameter estimates
39
4.1.1 Rate of change
39
4.1.2 Sign of estimates
39
4.1.3 Interpretation depends on other terms in the mean function 39
4.1.4 Rank deficient and over-parameterized models 39
4.2 Experimentation versus observation
41
4.3 Sampling from a normal population
41
4.4 More on R2
41
4.5 Missing data
41
4.6 Computationally intensive methods
42
45
45
46
47
CONTENTS
5.2
5.3
5.4
5.5
vii
47
48
49
50
Transformations
67
7.1 Transformations and scatterplots
67
7.1.1 Power transformations
67
7.1.2 Transforming only the predictor variable
67
7.1.3 Transforming the response only
69
7.1.4 The Box and Cox method
69
7.2 Transformations and scatterplot matrices
71
7.2.1 The 1D estimation result and linearly related predictors 72
7.2.2 Automatic choice of transformation of the predictors 72
7.3 Transforming the response
72
7.4 Transformations of non-positive variables
72
73
73
74
74
75
75
75
75
76
viii
CONTENTS
8.3
8.4
Nonconstant variance
8.3.1 Variance Stabilizing Transformations
8.3.2 A diagnostic for nonconstant variance
8.3.3 Additional comments
Graphs for model assessment
8.4.1 Checking mean functions
8.4.2 Checking variance functions
77
77
77
78
78
78
81
85
85
85
85
85
86
86
87
88
88
88
88
10 Variable Selection
10.1 The Active Terms
10.1.1 Collinearity
10.1.2 Collinearity and variances
10.2 Variable selection
10.2.1 Information criteria
10.2.2 Computationally intensive criteria
10.2.3 Using subject-matter knowledge
10.3 Computational methods
10.3.1 Subset selection overstates significance
10.4 Windmills
10.4.1 Six mean functions
10.4.2 A computationally intensive approach
91
91
93
94
94
94
95
95
95
97
97
97
97
11 Nonlinear Regression
11.1 Estimation for nonlinear mean functions
11.2 Inference assuming large samples
99
99
99
CONTENTS
ix
105
106
12 Logistic Regression
12.1 Binomial Regression
12.1.1 Mean Functions for Binomial Regression
12.2 Fitting Logistic Regression
12.2.1 One-predictor example
12.2.2 Many Terms
12.2.3 Deviance
12.2.4 Goodness of Fit Tests
12.3 Binomial Random Variables
12.3.1 Maximum likelihood estimation
12.3.2 The Log-likelihood for Logistic Regression
12.4 Generalized linear models
107
107
107
107
108
109
111
112
113
113
113
113
References
115
Index
117
0
Introduction
This computer primer supplements the book Applied Linear Regression (alr),
third edition, by Sanford Weisberg, published by John Wiley & Sons in 2005.
It shows you how to do the analyses discussed in alr using one of several
general-purpose programs that are widely available throughout the world. All
the programs have capabilities well beyond the uses described here. Different
programs are likely to suit different users. We expect to update the primer
periodically, so check www.stat.umn.edu/alr to see if you have the most recent
version. The versions are indicated by the date shown on the cover page of
the primer.
Our purpose is largely limited to using the packages with alr, and we will
not attempt to provide a complete introduction to the packages. If you are
new to the package you are using you will probably need additional reference
material.
There are a number of methods discussed in alr that are not (as yet)
a standard part of statistical analysis, and some methods are not possible
without writing your own programs to supplement the package you choose.
The exceptions to this rule are R and S-Plus. For these two packages we have
written functions you can easily download and use for nearly everything in the
book.
Here are the programs for which primers are available.
R is a command line statistical package, which means that the user types
a statement requesting a computation or a graph, and it is executed
immediately. You will be able to use a package of functions for R that
1
INTRODUCTION
will let you use all the methods discussed in alr; we used R when writing
the book.
R also has a programming language that allows automating repetitive
tasks. R is a favorite program among academic statisticians because
it is free, works on Windows, Linux/Unix and Macintosh, and can be
used in a great variety of problems. There is also a large literature
developing on using R for statistical problems. The main website for
R is www.r-project.org. From this website you can get to the page for
downloading R by clicking on the link for CRAN, or, in the US, going to
cran.us.r-project.org.
Documentation is available for R on-line, from the website, and in several
books. We can strongly recommend two books. The book by Fox (2002)
provides a fairly gentle introduction to R with emphasis on regression.
We will from time to time make use of some of the functions discussed in
Foxs book that are not in the base R program. A more comprehensive
introduction to R is Venables and Ripley (2002), and we will use the
notation vr[3.1], for example, to refer to Section 3.1 of that book.
Venables and Ripley has more computerese than does Foxs book, but
its coverage is greater and you will be able to use this book for more than
linear regression. Other books on R include Verzani (2005), Maindonald
and Braun (2002), Venables and Smith (2002), and Dalgaard (2002). We
used R Version 2.0.0 on Windows and Linux to write the package. A
new version of R is released twice a year, so the version you use will
probably be newer. If you have a fast internet connection, downloading
and upgrading R is easy, and you should do it regularly.
S-Plus is very similar to R, and most commands that work in R also work in
S-Plus. Both are variants of a statistical language called S that was
written at Bell Laboratories before the breakup of AT&T. Unlike R, SPlus is a commercial product, which means that it is not free, although
there is a free student version available at elms03.e-academy.com/splus.
The website of the publisher is www.insightful.com/products/splus. A
library of functions very similar to those for R is also available that will
make S-Plus useful for all the methods discussed in alr.
S-Plus has a well-developed graphical user interface or GUI. Many new
users of S-Plus are likely to learn to use this program through the GUI,
not through the command-line interface. In this primer, however, we
make no use of the GUI.
If you are using S-Plus on a Windows machine, you probably have the
manuals that came with the program. If you are using Linux/Unix, you
may not have the manuals. In either case the manuals are available
online; for Windows see the Help Online Manuals, and for Linux/Unix
use
> cd Splus SHOME/doc
Fig. 0.1
SAS windows.
> ls
and see the pdf documents there. Chambers and Hastie (1993) provides
the basics of fitting models with S languages like S-Plus and R. For a
more general reference, we again recommend Fox (2002) and Venables
and Ripley (2002), as we did for R. We used S-Plus Version 6.0 Release
1 for Linux, and S-Plus 6.2 for Windows. Newer versions of both are
available.
SAS is the largest and most widely distributed statistical package in both
industry and education. SAS also has a GUI. When you start SAS,
you get the collection of Windows shown in Figure 0.1. While it is
possible to do some data analysis using the SAS GUI, the strength of this
program is in the ability to write SAS programs, in the editor window,
and then submit them for execution, with output returned in an output
window. We will therefore view SAS as a batch system, and concentrate
mostly on writing SAS commands to be executed. The website for SAS
is www.sas.com.
INTRODUCTION
challenge: try to figure out how to do what is suggested in the text, and write
your own primer! Send us a PDF file ([email protected]) and we will add
it to our website, or link to yours.
One program missing from the list of programs for regression analysis is
Microsofts spreadsheet program Excel. While a few of the methods described
in the book can be computed or graphed in Excel, most would require great
endurance and patience on the part of the user. There are many add-on
statistics programs for Excel, and one of these may be useful for comprehensive
regression analysis; we dont know. If something works for you, please let us
know!
A final package for regression that we should mention is called Arc. Like
R, Arc is free software. It is available from www.stat.umn.edu/arc. Like JMP
and SPSS it is based around a graphical user interface, so most computations
are done via point-and-click. Arc also includes access to a complete computer
language, although the language, lisp, is considerably harder to learn than the
S or SAS languages. Arc includes all the methods described in the book. The
use of Arc is described in Cook and Weisberg (1999), so we will not discuss it
further here; see also Weisberg (2005).
0.1
The primer often refers to specific problems or sections in alr using notation
like alr[3.2] or alr[A.5], for a reference to Section 3.2 or Appendix A.5,
alr[P3.1] for Problem 3.1, alr[F1.1] for Figure 1.1, alr[E2.6] for an equation and alr[T2.1] for a table. Reference to, for example, Figure 7.1, would
refer to a figure in this primer, not to alr. Chapters, sections, and homework
problems are numbered in this primer as they are in alr. Consequently, the
section headings in primer refers to the material in alr, and not necessarily
the material in the primer. Many of the sections in this primer dont have any
material because that section doesnt introduce any new issues with regard to
computing. The index should help you navigate through the primer.
There are four versions of this primer, one for R and S-Plus, and one for
each of the other packages. All versions are available for free as PDF files at
www.stat.umn.edu/alr.
Anything you need to type into the program will always be in this font.
Output from a program depends on the program, but should be clear from
context. We will write File to suggest selecting the menu called File, and
Transform Recode to suggest selecting an item called Recode from a menu
called Transform. You will sometimes need to push a button in a dialog,
and we will write push ok to mean click on the button marked OK. For
non-English versions of some of the programs, the menus may have different
names, and we apologize in advance for any confusion this causes.
INTRODUCTION
71.2
58.2
56
64.5
53
52.4
56.8
49.2
55.6
77.8
0.2
DATA FILES
0.2.1
Documentation
Documentation for nearly all of the data files is contained in alr; look
in the index for the first reference to a data file. Separate documentation can be found in the file alr3data.pdf in PDF format at the web site
www.stat.umn.edu/alr.
The data are available in a package for R, in a library for S-Plus and for SAS,
and as a directory of files in special format for JMP and SPSS. In addition,
the files are available as plain text files that can be used with these, or any
other, program. Table 0.1 shows a copy of one of the smallest data files called
htwt.txt, and described in alr[P3.1]. This file has two variables, named Ht
and Wt, and ten cases, or rows in the data file. The largest file is wm5.txt with
62,040 cases and 14 variables. This latter file is so large that it is handled
differently from the others; see Section 0.2.4.
A few of the data files have missing values, and these are generally indicated
in the file by a place-holder in the place of the missing value. For example, for
R and S-Plus, the placeholder is NA, while for SAS it is a period . Different
programs handle missing values a little differently; we will discuss this further
when we get to the first data set with a missing value in Section 4.5.
0.2.2
The data for use with SAS are provided in a special SAS format, or as plain
data files. Instructions for getting and installing the SAS library are given on
the SAS page at www.stat.umn.edu/alr.
If you follow the directions on the web site, you will create a data library
called alr3 that will always be present when you start SAS. Most procedures
in SAS have a required data keyword that tells the program where to find the
data to be used. For example, the simple SAS program
proc means data = alr3.heights;
SCRIPTS
run;
tells SAS to run the means procedure using the data set heights in the library
alr3. To run this program, you type it into the editor window, and then select
and submit it; see Section 0.4.1.
0.2.3
You can download the data as a directory of plain text files, or as individual
files; see www.stat.umn.edu/alr/data. Missing values on these files are indicated with a ?. If your program does not use this missing value character, you
may need to substitute a different character using an editor.
0.2.4
An exceptional file
0.3
SCRIPTS
For R, S-Plus, and SAS, we have prepared script files that can be used while
reading this primer. For R and S-Plus, the scripts will reproduce nearly every
computation shown in alr; indeed, these scripts were used to do the calculations in the first place. For SAS, the scripts correspond to the discussion
given in this primer, but will not reproduce everything in alr. The scripts
can be downloaded from www.stat.umn.edu/alr for R, S-Plus or SAS.
Although both JMP and SPSS have scripting or programming languages, we
have not prepared scripts for these programs. Some of the methods discussed
in alr are not possible in these programs without the use of scripts, and so
we encourage readers to write scripts in these languages that implement these
ideas. Topics that require scripts include bootstrapping and computer intensive methods, alr[4.6]; partial one-dimensional models, alr[6.4], inverse response plots, alr[7.1, 7.3], multivariate Box-Cox transformations, alr[7.2],
Yeo-Johnson transformations, alr[7.4], and heteroscedasticity tests, alr[8.3.2].
There are several other places where usability could be improved with a script.
If you write scripts you would like to share with others, let me know
([email protected]) and Ill make a link to them or add them to the website.
INTRODUCTION
0.4
Before you can begin doing any useful computing, you need to be able to read
data into the program, and after you are done you need to be able to save
and print output and graphs. All the programs are a little different in how
they handle input and output, and we give some of the details here.
0.4.1
RUN;
Using the wizard is equivalent to writing this SAS program and then executing
it. Like most SAS programs: (1) all statements end with a semicolon ;, (2)
the program consists of blocks that begin with a proc to start a procedure and
ending with a run; to execute it, and (3) several statements are included into
the procedure call, each ending with a semicolon. There are a few exceptions
to these rules that we will encounter later.
SAS is not case-sensitive, meaning that you can type programs in capitals,
as shown above, using lower-case letters, or any combination of them you
like.
0.4.2
All the programs have many ways of saving text output and graphs. We will
make no attempt to be comprehensive here.
SAS Some SAS procedures produce a lot of output. SAS output is by default
centered on the page, and assumes that you have paper with 132 columns.
You can, however, change these defaults to other values. For example, the
following line at the beginning of a SAS program
options nocenter linesize=75 pagesize=66;
Normal,
F , t and 2 tables
alr does not include tables for looking up critical values and significance
levels for standard distributions like the t, F and 2 . Although these values
can be computed with any of the programs we discuss in the primers, doing
so is easy only with R and S-Plus. Also, the computation is fairly easy with
Microsoft Excel. Table 0.2 shows the functions you need using Excel.
SAS SAS allows computing the computing critical values and significance
levels for a very long list of distributions. Two functions are used, called CDF
10
INTRODUCTION
Table 0.2 Functions for computing p-values and critical values using Microsoft Excel.
The definitions for these functions are not consistent, sometimes corresponding to
two-tailed tests, sometimes giving upper tails, and sometimes lower tails. Read the
definitions carefully. The algorithms used to compute probability functions in Excel
are of dubious quality, but for the purpose of determining p-values or critical values,
they should be adequate; see Kn
usel (2005) for more discussion.
Function
What it does
normsinv(p)
normsdist(q)
tinv(p,df)
tdist(q,df,tails)
finv(p,df1,df2)
fdist(q,df1,df2)
chiinv(p,df)
chidist(q,df)
and quantile. These functions are used inside the data step in SAS, and so
you need to write a short SAS program to get tabled values. For example,
the following program returns the p-value from a 2 (25) test with value of the
test statistic equal to 32.5, and also the critical value at the 95% level for this
test.
data qt;
pval = 1-CDF(chisquare,32.5,25);
critval = quantile(chisquare,.95,25);
output;
proc print data=qt; run;
pval
critval
ABBREVIATIONS TO REMEMBER
0.14405
11
37.6525
giving the p-value of .14, and the critical value 37.6, for the test.
To get F or t significance levels, use functions similar to the following:
1-CDF(F,value,df1,df2) F p-values
quantile(F,level,df1,df2) F critical values
1-CDF(t,value,df) t p-values
quantile(t,level,df) t critical values
For more information, go to Index tab in SAS on-line help, and type either
quantile or cdf.
0.5
ABBREVIATIONS TO REMEMBER
alr refers to the textbook, Weisberg (2005). vr refers to Venables and Ripley
(2002), our primary reference for R and S-Plus. jmp-start refers to Sall,
Creighton and Lehman (2005), the primary reference for JMP. Information
typed by the user looks like this. References to menu items looks like File
or Transform Recode. The name of a button to push in a dialog uses this
font.
0.6
1
Scatterplots and
Regression
1.1
SCATTERPLOTS
As with most SAS examples in this primer, we assume that you have the
data from alr available as a SAS library called alr3; see Section 0.2.2. The
13
14
phrase data=alr3.heights specifies that the data will come from the data set
called heights in the alr3 library. The phrase plot Dheight*Mheight draws
the plot with Mheight on the horizontal axis and Dheight on the vertical. run;
tells SAS to execute the procedure, and quit; tells SAS to close the graph,
meaning that it will not be enhanced by adding more points or lines. Simple
graphs like this one require only a very short SAS program. The output from
the above program is shown in Figure 1.1. We saved this file as a PostScript
file because that is required by our typesetting program. SAS postscript files
seem to be of low quality.
You can also use the menus to draw simple graphs like this one. Select
Solutions Analyze Interactive data analysis, and from in the resulting window, select the library alr3 and the data set heights in that library. The data
will appear in a spreadsheet. Select Analyze Scatter plot to draw this plot.
An advantage to this method of drawing the plot is that the spreadsheet and
the plot will be linked, meaning that selecting points on the plot will highlight
the corresponding values in the spreadsheet. This can be useful for identifying
unusual points.
Returning to proc gplot, you can add commands that will modify the
appearance of the plot. These options must be added before the call to proc
gplot. Continuing with alr[F1.1],
goptions reset=all;
title ;
symbol v=dot c=black h=.1;
axis1 label=(Mheight);
axis2 label=(a=90 Dheight);
proc gplot data=alr3.heights;
plot Dheight*Mheight /haxis=axis1 haxis=55 to 75 hminor=0
SCATTERPLOTS
15
We use the goptions command to reset all options to their default values so
left-over options from a previous graph wont appear in the current one. The
next four commands set the title, the plotting symbol, and the labels for the
two axes. The vertical axis label has been rotated ninety degrees. In in gplot
command, we have specified the haxis and vaxis keywords are set to axis1
and axis2 to get the variable labels. The remainder of the command concerns
tick marks and to ensure that both axes extended from 55 to 75.
Similar SAS code to produce a version of alr[F1.2] illustrates using the
where statement:
goptions reset=all;
proc gplot data=alr3.heights;
plot Dheight*Mheight /haxis=axis1 haxis=55 to 75 hminor=0
vaxis=axis2 vaxis=55 to 75 vminor=0;
where (Mheight ge 57.5) & (Mheight le 58.5) |
(Mheight ge 62.5) & (Mheight le 63.5) |
(Mheight ge 67.5) & (Mheight le 68.5);
run; quit;
Inside proc gplot, specify the data set by data=dataset-name. The first argument Dheight will be the vertical axis variable, and the second argument
Mheight will be on the horizontal axis. As with many SAS commands, options
for a command like plot are added after a /. We let the horizontal axis be
axis1, and the vertical axis be axis2. The where statement uses syntax similar
to the C language. The logical operators & for and and | for or are used to
specify ranges for plotting.
16
A few reminders:
1. All statements end with ;.
2. Always use goptions reset=all; to remove all existing graphical setup
before plotting.
3. Statements for changing graphical options like symbol have to be specified before proc gplot is called.
4. Nearly all SAS commands end with run;.
alr[F1.3-1.4] includes both points and fitted lines. For alr[F1.3a], we
need to get the ols fitted line. For alr[F1.3b], we need to compute residuals.
This can be done by calling proc reg in SAS, which also allows graphing:
goptions reset=all;
symbol v=circle h=1;
proc reg data=alr3.forbes;
model Pressure=Temp;
plot Pressure*Temp;
plot residual.*Temp;
run;
We define circle as the plotting symbol in the symbol statement. For the model
statement, the syntax is model response=terms;. The plot command in proc
reg draws graphs of regression quantities, with the syntax horizontal*vertical.
With simple regression, the plot of the response versus the single predictor
adds the ols line automatically. The second plot makes use of the keyword
residual. to get the residuals. Do not forget the dot after keywords!
alr[F1.5] includes both an ols line and a line that joins the mean lengths
at each age. Although this plot seems simple, the SAS program we wrote to
SCATTERPLOTS
17
The plot consists of three parts: the points, the ols fitted line, and the line
joining the means for each value of Age. We couldnt think of obvious way
to get this last line, and so we used a trick. The loess smoother, discussed in
alr[A.5], can be used for this purpose by setting the smoothing parameter to
be a very small number, like .1. The call shown to proc loess will compute the
mean Length for each Age called m1. Confusingly, you use an ods statement,
not output, to save the output from proc lowess. Next, proc reg is used to
get the regression of Length on Age. The output is saved using the output
statement, not ods. This output includes the input data and the predicted
values in a new data set called m2.
We now have the information for joining the points in the data set m1
and the data for fitting ols in m2. The data statement combines them with
alr3.wblake in another data set called m3. A few graphics commands are
then used. The , symbol2 and symbol3 define plotting symbols; you can define
symbol4, symbol5 and so on. The option i=join tells SAS to connect the points
using straight lines. The option l=1 refers to a connection with solid line, and
l=2 refers to a connection with a dashed line. proc sort was used to arrange
the data set by an ascending order (which is the default) of variable Age. This
makes sure that we do not get a mess when we connect data points to get the
ols line. After we customize each symbol, we can write ols*Age=1 Pred*Age=2
Length*Age=3 in the plot statement, which simply means that use the first
customized symbol for the plot of ols versus Age, the second customized symbol
for the plot of loess Pred versus Age, and the third customized symbol for the
scatterplot of Length versus Age. proc gplot was called with the data set m3.
The overlay option further modifies the axes of the plot.
18
The only part alr[F1.6] that is different from previous graphs is the horizontal line. Getting this line is surprisingly difficult. We recommend that
you define one=1 in the data step so that we can fit a linear regression of Late
on one. All the fitted values for this regression will be equal to the overall
average.
alr[F1.7] is easier to obtain in SAS, because the variable S takes on values
1, 2, 3, which can be directly used in plot statement in proc gplot:
goptions reset=all;
proc gplot data=alr3.turkey;
plot gain*A=S;
run; quit;
We write the plot statement as plot y*x=z;, which allows us to make a scatterplot of y versus x with separate symbol for each value of z variable.
1.2
MEAN FUNCTIONS
SAS alr[F1.8] adds the line y=x to the plot. We fit a linear regression
model first, save the predicted values to a data set, which is called m1 here,
using the output statement. After this, we call the proc gplot to get a version
of alr[F1.8]. The SAS code is given below:
proc reg data=alr3.heights;
model Dheight=Mheight;
output out=m1 predicted=Dhat;
run;
goptions reset=all;
proc gplot data=m1;
plot (Dhat Mheight Dheight)*Mheight /overlay ;
VARIANCE FUNCTIONS
19
run; quit;
1.3
VARIANCE FUNCTIONS
1.4
SUMMARY GRAPH
1.5
SAS proc loess can be used to add a loess smoother to a graph. The following produces a version of alr[F1.10]:
proc reg data=alr3.heights;
model Dheight=Mheight;
output out=m1 predicted=Dhat;
run;
proc loess data=alr3.heights;
model Dheight=Mheight /smooth=.6;
ods output OutputStatistics=myout;
run;
data myout; set myout; set m1;
proc sort data=myout; by Mheight; run;
goptions reset=all;
proc gplot data=myout;
plot (Dhat Pred DepVar)*Mheight /overlay;
run; quit;
1.6
SCATTERPLOT MATRICES
SAS This is the first example we have encountered that requires transformation of some of the variables to be graphed. This is done by creating a new
data set that consists of the old variables plus the transformed variables. The
data step in SAS for transformations.
20
Scatterplot matrices are obtained using proc insight. For the fuel2001
data, here is the program:
data m1;
set alr3.fuel2001;
Dlic=Drivers*1000/Pop;
Fuel=FuelC*1000/Pop;
logMiles=log2(Miles);
goptions reset=all;
proc insight data=m1;
scatter Tax Dlic income logMiles Fuel
*Tax Dlic income logMiles Fuel;
run;
In the data step, we created a new data set called m1 that includes all of
alr3.fuel2001 from the set statement, and the three additional variables computed from transformations; note the use of log2 for base-two logarithms. We
called proc insight using this new data set.
Solutions Analysis Interactive Data Analysis also starts proc insight.
You can also transform variables from within this procedure.
Graphs created with proc insight are interactive. If you click the mouse
on a point in this graph, the point will he highlighted in all frames of the
scatterplot matrix, and its case number will be displayed on the graph.
SCATTERPLOT MATRICES
Fig. 1.6
21
Problems
1.1. Boxplots would be useful in a problem like this because they display level
(median) and variability (distance between the quartiles) simultaneously.
SAS SAS has a proc boxplot procedure. You can follow the generic syntax
as follows:
proc boxplot data=alr3.wblake;
plot Length*Age;
run;
The above SAS code means that a separate boxplot of Length will be provided
at each value of Age.
As an alternative, you can use the SAS Interactive Data Analysis tool
to obtain boxplots. After you launch it with a data set, you can choose
Analyze Box Plot/Mosaic Plot(Y) from the menu. Then you click on Length
once and click the Y button once. Similarly, you can move the variable Age to
the box under the X button. After you click the OK button, a series of boxplots
of Length for each value of Age are printed on the screen.
22
Just a reminder: Even if you choose to use SAS Interactive Data Analysis
tool, you must have the data available in advance, either using a SAS data
step, or using a data file from a library.
To get the standard deviation at each value of Age, you can call proc means
or proc summary (which is exactly the same as proc means except that it does
not show output) and put Age in the class statement. The keyword for
retrieving standard deviation is std.
1.2.
SAS To change the size of an image in SAS, you can move you mouse cursor
to the right-bottom edge of the graph window. When the cursor becomes a
double arrow, click and hold the left button of your mouse, then drag the edge
to any size.
1.3.
SAS In SAS, log2(Fertility) returns the base-two logarithms. If you need
transformations, use the procedure outlined in Section 1.6 of the primer.
2
Simple Linear Regression
2.1
All the computations for simple regression depend on only a few summary
statistics; the formulas are given in the text, and in this section we show how
to do the computations stepby-step. All computer packages will do these
computations automatically, as we show in Section 2.6.
2.2
SAS The procedure proc means can be used to get summary statsitics; see
Table 2.1.
proc means data=alr3.forbes;
var Temp Lpres;
run;
N
17
17
Mean
202.9529412
139.6052941
Std Dev
5.7596786
5.1707955
Minimum
194.3000000
131.7900000
Maximum
212.2000000
147.8000000
23
24
This chapter illustrates the computations that are usually hidden by regression programs. You can follow along with the calculations with the interactive
matrix language in SAS, called proc iml.
data f;
set alr3.forbes ;
one=1;
proc iml;
use f;
read all var {one Temp Pressure Lpres}
where (Temp^=. & Pressure^=. & Lpres^=.) into X;
size=nrow(X); W=X*X;
Tempbar=W[1,2]/size; Lpresbar=W[1,4]/size; Y=j(size,2,0);
Y[,1]=X[,2]-Tempbar; Y[,2]=X[,4]-Lpresbar;
fcov=Y*Y; RSS=fcov[2,2]-fcov[1,2]**2/fcov[1,1];
sigmahat2=RSS/(size-2); sigmahat=sqrt(sigmahat2);
b1=fcov[1,2]/fcov[1,1];
b0=Lpresbar-b1*Tempbar;
title PROC IML results;
print Tempbar Lpresbar fcov b1 b0 RSS sigmahat2;
create forbes_stat var{Tempbar Lpresbar fcov b1 b0 RSS sigmahat2};
append var{Tempbar Lpresbar fcov b1 b0 RSS sigmahat2};
close forbes_stat;
quit;
Table 2.2
RSS SIGMAHAT2
We start by creating a new data set called f that has all the variables in
alr3.forbes plus a new variable called one that repeats the value 1 for all
cases in the data set, so it is a vector of length 17.
The keywords used in proc iml include use, read all var, where and into.
These load all the variables of interest into the procedure so that we can compute various summary statistics. The variables in the braces are collected into
a new matrix variable we called X, whose columns represent the variables in
the order they are entered. The where statement guarantees that all calculations will be done on non-missing values only. ^= stands for not equal
to, and is equivalent to the operator ne (not equal) outside proc iml. The
operator X*X denotes a matrix product X X. Use the single-quote located at
the upper left corner of the computer keyboard, not the usual one, to denote
the transpose. An alternative way of specifying the transpose of a matrix X
is by t(X). The function j(size,2,0) creates a matrix of dimension size by 2
ESTIMATING
25
of all zeroes. The operator **2 is for exponentiating, here squaring. The
print statement asks SAS to display some results, and the create and append
statements save some of the variables into a data set called forbes stat. The
create statement defines a structure of the data set called forbes stat, and
the append statement fills in values. If some variables do not have the same
length, then . will be used in the new data set for missing values. The close
2.3
ESTIMATING
SAS From proc iml, we can easily get the variance estimate as sigmahat2
in the output of proc iml above. RSS is also given above from proc iml. See
the last section.
2.4
2.5
ESTIMATED VARIANCES
The estimated variances of coefficient estimates are computed using the summary statistics we have already obtained. These will also be computed automatically linear regression fitting methods, as shown in the next section.
2.6
Computing the analysis of variance and F test by hand requires only the value
of RSS and of SSreg = SYY RSS. We can then follow the outline given in
alr[2.6].
SAS There are several SAS procedures that can do simple linear regression.
The most basic procedure is proc reg, which prints at minimum the summary
of the fit and the ANOVA table. For the Forbes data,
proc reg data = alr3.forbes;
model Lpres = Temp;
run;
is the smallest program for fitting regression. This will give the output shown
in Table 2.3.
There are literally dozens of options you can add to the model statement
that will modify output. For example, consider the following five statements:
model Lpres=Temp;
26
Source
DF
Model
Error
Corrected Total
1
15
16
Root MSE
Dependent Mean
Coeff Var
Variable
DF
Intercept
Temp
1
1
model
model
model
model
425.63910
0.14366
2962.79
0.37903
139.60529
0.27150
R-Square
Adj R-Sq
0.9950
0.9946
Parameter Estimates
Parameter Standard
Estimate
Error
t Value
-42.13778
0.89549
3.34020
0.01645
-12.62
54.43
Pr > F
<.0001
Pr > |t|
<.0001
<.0001
Lpres=Temp/noprint;
Lpres=Temp/noint;
Lpres=Temp/all;
Lpres=Temp/CLB XPX I R;
All but the third will fit the same regression, but will differ on what is printed.
The first will print the default output in Table 2.3. The second will produce
no printed output, and is useful if followed by an output statement to save
regression summaries for some other purpose. The third will fit a mean function with no intercept. The fourth will produce pages and pages of output,
of everything the program knows about; you probably dont want to use this
option very often (or ever). The last will print the default output confidence
intervals for each of the regression coefficients, the matrix X X, its inverse
(X X)1 , and a printed version of the residuals. See documentation for proc
reg for a complete list of options.
Remember that transformations must be done in a data step, not in proc
reg.
proc glm Another important SAS procedure for fitting linear models is
proc glm, which stands for the general linear model. proc glm is is more
general than proc reg because it allows class variables, which is the name
that SAS uses for factors. Also, proc glm has several different approaches
to fitting ANOVA tables. The approach to ANOVA discussed in alr[3.5] is
sequential ANOVA, called Type I in SAS. The Wald or t-tests discussed in
alr are equivalent to SAS Type II, with no factors present. SAS Type III is
R2
27
2.7
2.8
R2
Confidence intervals and tests can be computed using the formulas in alr[2.8],
in much the same way as the previous computations were done.
SAS The SAS procedure proc reg will give confidence intervals for parameter estimates by adding the option clb to the model statement, so, for example,
model Lpres=Temp/clb alpha=.01;
To get predictions and confidence intervals at unobserved values of the predictors, you need to append new data to the data set, with a missing indicator (a
period) for the response. In the Forbes data, for example, to get predictions
at Temp = 210, 200, use
data forbes2;
input Temp;
cards;
210
220
;
data forbes2;
set alr3.forbes forbes2;
proc reg data=forbes2 alpha=.05;
28
proc reg
proc glm
Allowed
Terms
Class variables (called factors in alr)
Polynomial terms and interactions
Other transformations
Random effects
Allowed
Use a data step
Not recommended; use
proc mixed instead.
Fitting
ols and wls
No intercept
Yes
Allowed
Yes
Allowed
Allowed
Covariance of estimates
Yes
ANOVA tables
R2
Add/delete variables and
refit
Submodel selection
Yes
Allowed
Allowed
Not allowed
Yes
Yes
Yes
No
Prediction/Fitted values
Prediction at new values
Miscellaneous
Model diagnostics
Diagnostic statistics
Graphical diagnostics
29
Obs
Dep Var
Lpres
Predicted
Value
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
131.7900
131.7900
135.0200
135.5500
136.4600
136.8300
137.8200
138.0000
138.0600
138.0400
140.0400
142.4400
145.4700
144.3400
146.3000
147.5400
147.8000
.
.
132.0357
131.8566
135.0804
135.5282
136.4237
136.8714
137.7669
137.9460
138.2146
138.1251
140.1847
141.0802
145.4681
144.6622
146.5427
147.6173
147.8860
145.9159
154.8708
Output Statistics
Std Error
Mean Predict
0.1667
0.1695
0.1239
0.1186
0.1089
0.1048
0.0979
0.0969
0.0954
0.0959
0.0925
0.0958
0.1416
0.1307
0.1571
0.1735
0.1777
0.1480
0.2951
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
95% CL Predict
131.1532
130.9717
134.2304
134.6817
135.5831
136.0332
136.9325
137.1122
137.3816
137.2918
139.3531
140.2469
144.6057
143.8076
145.6682
146.7288
146.9937
145.0486
153.8469
132.9183
132.7416
135.9304
136.3747
137.2642
137.7096
138.6013
138.7798
139.0477
138.9584
141.0163
141.9135
146.3306
145.5168
147.4173
148.5059
148.7783
146.7831
155.8947
Residual
-0.2457
-0.0666
-0.0604
0.0218
0.0363
-0.0414
0.0531
0.0540
-0.1546
-0.0851
-0.1447
1.3598
0.001856
-0.3222
-0.2427
-0.0773
-0.0860
.
.
-3.3378E-13
2.15493
2.52585
The new data for the input variable Temp are entered after the cards statement; we need to use the same variable name as in the forbes data set so that
these new values will be appended to the column of our interest.
To save the predicted values and associated confidence intervals for later
computation, you can use output statement in proc reg. Here the information
is saved to a data set called m1:
data forbes2;
input Temp;
cards;
210
220
;
data forbes2;
set alr3.forbes forbes2;
30
2.9
THE RESIDUALS
SAS The SAS procedure proc reg computes the residuals for the linear
model we fit. A plot of residuals versus fitted values can be obtained using
plot statement in proc reg. The keywords are residual., the ordinary residuals, and predicted.; dont forget the trailing periods. The /nostat nomodel
option suppresses the model fit information on the graph.
proc reg data=alr3.forbes;
model Lpres=Temp /noprint;
output out=m1 residual=res predicted=pred;
plot residual.*predicted. /nostat nomodel; *the residual plot;
run;
proc print data=work.m1 ;
run;
The output statement was used to save the residuals and fitted values in a
data set called m1. We then used proc print to print m1, but you could use
this data set in other graphs or for other purposes.
Problems
2.2.
SAS In order to do part 2.2.3, you need to save the fitted values from the
two regression models to two data sets, then apply SAS graphical procedure
to the combined data set.
For part 2.2.5, you also need to save the predicted values and prediction
standard errors, then use a data step to combine these data and the Hookers
data.
For part 2.2.6, you first need to combine the forbes data and the Hookers
data, with the response in the forbes data to missing. Then you fit the
regression model with this combined data set. Since SAS provides predictions
for the data with missing response as well, the computation of z-scores are
still straightforward in this part.
2.7.
SAS A linear regression model without an intercept can be fit in SAS by
adding noint option to the model statement.
2.10.
THE RESIDUALS
31
SAS The where conditions; statement can be used to select the cases of
interest.
3
Multiple Regression
3.1
SAS Using the fuel2001 data, we first need to transform variables and create
a new data set that includes them. To draw an added-variable plot in SAS,
we use proc reg twice to get the sets of residuals that are needed. Using the
fuel2001 data,
data fuel;
set alr3.fuel2001;
Dlic=Drivers*1000/Pop;
Fuel=FuelC*1000/Pop;
Income=Income/1000;
logMiles=log2(Miles);
goptions reset=all;
title Added Variable Plot for Tax;
proc reg data=fuel;
model Fuel Tax=Dlic Income logMiles;
output out=m1 residual=res1 res2;
run;
proc reg data=m1;
model res1=res2;
plot res1*res2;
run;
The first proc reg has a model statement with two responses on the left side
of the equal sign. This means that two regression models will be fit, one
for each response. The output specification assigns m1 to be the name of the
33
34
MULTIPLE REGRESSION
Table 3.1 Summary statistics from proc means for the fuel2001 data.
The MEANS Procedure
Variable
Tax
Dlic
Income
logMiles
Fuel
N
51
51
51
51
51
Mean
20.1549020
903.6778311
28.4039020
15.7451611
613.1288080
Std Dev
4.5447360
72.8578005
4.4516372
1.4867142
88.9599980
Minimum
7.5000000
700.1952729
20.9930000
10.5830828
317.4923972
Maximum
29.0000000
1075.29
40.6400000
18.1982868
842.7917524
output, and the names for the two sets of residuals are res1 and res2. The
added-variable plot is just the plot of res1 versus res2. We did this using plot
inside the next proc reg; in this way, the fitted line and summary statistics
are automatically added to the plot.
A low-quality version of all the added-variable plots (produced in the output window rather than a graphics window) can be produced using the partial
option:
data=fuel;
model Fuel = Tax Dlic Income logMiles/partial;
run;
3.2
3.3
SAS The standard summary statistics in SAS can be retrieved using proc
means, and the output is in Table 3.1.
data fuel;
set alr3.fuel2001;
Dlic=Drivers*1000/Pop;
Fuel=FuelC*1000/Pop;
Income=Income/1000;
logMiles=log2(Miles);
proc means data=fuel;
var Tax Dlic Income logMiles Fuel;
run;
We will use proc iml, the interactive matrix language that we used in the
last chapter, for several calculations and covariances. First, we get the sample
correlations, starting with the fuel data set that we have just created.
proc iml;
use fuel;
35
-0.085844
1
-0.175961
0.0305907
0.4685063
-0.010685
-0.175961
1
-0.295851
-0.464405
-0.043737
0.0305907
-0.295851
1
0.4220323
-0.259447
0.4685063
-0.464405
0.4220323
1
XCOV
20.654625
-28.4247
-0.216173
-0.295519
-104.8944
-28.4247
5308.2591
-57.07045
3.3135435
3036.5905
-0.216173
-57.07045
19.817074
-1.958037
-183.9126
-0.295519
3.3135435
-1.958037
2.2103191
55.817191
-104.8944
3036.5905
-183.9126
55.817191
7913.8812
3.4
SAS Continuing with using proc iml to do matrix calculations, the ols estimates can be computed from (X X)1 X Y ; results are shown in Table 3.3.
data fuel;
set alr3.fuel;
Dlic=Drivers*1000/Pop;
Fuel=FuelC*1000/Pop;
Income=Income/1000;
logMiles=log2(Miles);
one=1;
proc iml;
use fuel2001;
read all var {one Tax Dlic Income logMiles} where (one^=. &
Tax^=. & Dlic^=. & Income^=. & logMiles^=. & fuel^=.) into X;
read all var fuel where (one^=. &
36
MULTIPLE REGRESSION
Tax^=. & Dlic^=. & Income^=. & logMiles^=. & fuel^=.) into Y;
beta_hat=inv(t(X)*X)*t(X)*Y;
print beta_hat;
quit;
Table 3.3 ols using proc iml for the fuel2001 data.
BETA_HAT
154.19284
-4.227983
0.4718712
-6.135331
18.545275
The ols estimates can be obtained using proc reg or proc glm. The output
from proc glm is given in Table 3.4. The syntax is very similar to that for
simple linear regression.
proc glm data=fuel;
model Fuel=Tax Dlic Income logMiles;
run;
The only difference between fitting simple and multiple regression is in the
specification of the model. All terms in the mean function appear to the right
of the equal sign, separated by white space. The output for using proc reg
would be similar, but would not include the extensive ANOVA tables shown.
3.5
SAS We recommend using proc glm for getting sequential analysis variance
tables. For example,
proc glm data=fuel;
model Fuel=Tax Dlic Income logMiles/ss1;
run;
produces the output shown in Table 3.4. This output gives the overall ANOVA,
and by the sequential ANOVA discussed in alr, and called Type I ANOVA by
SAS. By using the option /ss1, we have suppressed SASs default behavior of
printing what it calls Type III sums of squares; these are not recommended
(Nelder, 1977).
The order of fitting is determined by the order of the terms in the model
statement. Thus
proc glm data=fuel;
model Fuel=logMiles Income Tax Dlic
run;
/ss1;
would fit logMiles first, then Income, Tax and finally Dlic.
37
Table 3.4 Anova for nested models in proc reg for the fuel2001 data.
The GLM Procedure
Dependent Variable: Fuel
Source
Model
Error
Corrected Total
R-Square
0.510480
DF
4
46
50
Coeff Var
10.58362
Source
Tax
Dlic
Income
logMiles
Sum of
Squares
201994.0473
193700.0152
395694.0624
Root MSE
64.89122
Mean Square
50498.5118
4210.8699
F Value
11.99
Pr > F
<.0001
Fuel Mean
613.1288
DF
Type I SS
Mean Square
F Value
Pr > F
1
1
1
1
26635.27578
79377.52778
61408.17553
34573.06818
26635.27578
79377.52778
61408.17553
34573.06818
6.33
18.85
14.58
8.21
0.0155
<.0001
0.0004
0.0063
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
Tax
Dlic
Income
logMiles
154.1928446
-4.2279832
0.4718712
-6.1353310
18.5452745
194.9061606
2.0301211
0.1285134
2.1936336
6.4721745
0.79
-2.08
3.67
-2.80
2.87
0.4329
0.0429
0.0006
0.0075
0.0063
3.6
We created a data file called fuel2 consisting of two cases with values for four
variables. We then redefined fule2 by appending it to the bottom of fuel; all
38
MULTIPLE REGRESSION
Table 3.5 Predictions and prediction standard errors at new values from proc reg
for the fuel2001 data.
The REG Procedure
Model: MODEL1
Dependent Variable: Fuel
... ... (omitted)
Output Statistics
Dep Var Predicted
Obs
Fuel
Value
1
690.2644
727.2652
2
514.2792
677.4242
... ... (omitted)
50
581.7937
593.1223
51
842.7918
659.2930
52
.
12008
53
.
12718
... ... (omitted)
Std Error
Mean Predict
20.2801
32.8388
18.5974
18.7829
3942
4209
95% CL Predict
590.4156 864.1147
531.0318 823.8166
457.2446
523.3120
4072
4244
729.0000
795.2740
19944
21192
Residual
-37.0007
-163.1450
-11.3286
183.4988
.
.
other variables are missing for these two cases. We then fit the regression,
and obtain predictors for the two added cases.
Problems
4
Drawing Conclusions
The first three sections of this chapter do not introduce any new computational
methods; everything you need is based on what has been covered in previous
chapters. The last two sections, on missing data and on computationally
intensive methods introduce new computing issues.
4.1
4.1.1
Rate of change
4.1.2
Sign of estimates
4.1.3
4.1.4
SAS SAS provides several diagnostics when you try to fit a mean function
in which some terms are linear combinations of others. Table 4.1 provides the
output that parallels the discussion in alr[4.1.4].
data BGS;
set alr3.BGSgirls;
DW9=WT9-WT2;
DW18=WT18-WT9;
proc reg data=BGS;
model Soma=WT2 DW9 DW18 WT9 WT18;
run;
39
40
DRAWING CONCLUSIONS
Table 4.1 Output from an over-parameterized model for the BGSgirls data.
The REG Procedure
Model: MODEL1
Dependent Variable: SOMA
Number of Observations Read
Number of Observations Used
70
70
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
3
66
69
25.35982
19.45803
44.81786
8.45327
0.29482
Root MSE
Dependent Mean
Coeff Var
0.54297
4.77857
11.36264
R-Square
Adj R-Sq
F Value
Pr > F
28.67
<.0001
0.5658
0.5461
WT2 + DW9
WT2 + DW9 + DW18
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
WT2
DW9
DW18
WT9
WT18
1
B
B
B
0
0
1.59210
-0.01106
0.10459
0.04834
0
0
0.67425
0.05194
0.01570
0.01060
.
.
2.36
-0.21
6.66
4.56
.
.
0.0212
0.8321
<.0001
<.0001
.
.
Besides providing the correct ANOVA and other estimates, SAS correctly
reports the linear combinations of the terms in the mean function. It then
gives the estimates of parameters in the manner suggested in the text, by
deleting additional terms that are exact linear combinations of previous terms
in the mean function. SAS reports B degrees of freedom for the terms fit
that are part of the exact linear combinations to indicate that the fit would
have been different had the terms been entered in a different order.
41
SAS uses the term model for the mean function; in alr, a model consists
of the mean function and any other assumptions required in a problem.
If proc glm had been used in place of proc reg, the resulting output would
have been similar, although the exact text of the diagnostic warnings is different.
4.2
4.3
4.4
MORE ON
4.5
MISSING DATA
R2
The data files that are included with alr use NA as a place holder for
missing values. Some packages may not recognize this as a missing value
indicator, and so you may need to change this character using an editor to
the appropriate character for your program.
SAS SAS can detect missing values with strings like NA, ., ?, or any
other that is not compatible with the rest values on the same variable(s). You
do not need to tell SAS to skip those cases when you fit models, since it will
do it automatically. As with most other programs, SAS will use cases that do
not have missing values on any variables you use in a particular procedure.
data sleep;
set alr3.sleep1;
logSWS=log(SWS);
logBodyWt=log(BodyWt);
logLife=log(Life);
logGP=log(GP);
proc reg data=sleep;
model logSWS=logBodyWt logLife logGP;
run;
proc reg data=sleep;
model logSWS=logBodyWt logGP;
var logLife;
run;
The way you ask SAS to fit models with a particular subset of cases is by using
the var statement. In the example above, the cases used will correspond to all
cases observed on log(Life), even though that term is not in the model. You
can guarantee that all models are fit to the same cases by putting all terms
of interest in the var statement.
The SAS procedures proc mi and proc mianalyze provide tools for working
with missing data beyond the scope of alr; see Rubin (1987). Introduction to
42
DRAWING CONCLUSIONS
4.6
SAS SAS does not have predefined procedures for the bootstrap or other
simulation methods, so you need use a macro for this purpose. The macro
below is included in the SAS script file for this chapter. Macro writing can
be much more complicated than most of the other SAS programming we
discuss in this primer, so many readers may wish to skip this section. In
our example, we use the transact data, which has n = 261 observations, and
will produce a bootstrap similar to alr[4.6.1]. Here is the macro, with some
gentle annotation:
%let n=261; transact has 261 rows;
%macro boots(B=1, seed=1234, outputdata=temp); start macro def.;
%do i=1 %to &B; repeat B times;
data m1; data step to create data file m1;
rownum=int(ranuni(&seed*&i)*&n)+1; select row number at random;
set alr3.transact point=rownum; add row rownum to data set m1;
j+1; increment row counter;
if j>&n then STOP; stop with sample size is n;
still in the loop, use proc reg to fit with bootstrap sample;
proc reg data=m1 noprint outest=outests (keep=INTERCEPT T1 T2);
model Time=T1 T2;
run;
keep coef. estimates and then use proc append to save them;
proc append base=&outputdata data=outests; run;
%end; end the do loop;
%mend boots; end the macro
Do bootstrap B=999 times, and save as bootout;
%boots(B=999, seed=2004, outputdata=bootout);
If you want to use this macro with a different data set, you must: (1) change
the definition of n; (2) substitute a different name for the data file, and (3)
change the model and term names to the correct ones for your data set.
Running the macro with B = 999 takes several minutes on a fast Windows
computer.
The output from the last line is a new data set called work.supp4boot1
with B = 999 rows. The ith row is the estimates of the three parameters
from the ith bootstrap replication; the macro does no summarization. You
can summarize the data using any of several tools. The simple program
proc univariate data=work.bootout;
histogram;
run;
will produce histograms for each predictor separately, along with a prodigious
amount of output that includes the percentiles needed to get the bootstrap
43
beta_mean
0.30745
beta2_5
0.23519
beta97_5
0.40662
5
Weights, Lack of Fit, and
More
5.1
SAS The weights option in proc reg or in proc glm is used for wls.
data phy;
set alr3.physics;
w=1/SD**2;
proc reg data=phy;
model y=x;
weight w;
run;
Since the weights are the inverses of the SDs, we need to compute them in a
data step before calling proc reg.
Prediction intervals and standard errors with wls depend on the weight
assigned to the future value. This means that the standard error of prediction
is (
2 /w +sefit(y|X = x )2 )1/2 , where w is the weight assigned to the future
observation. The following SAS code obtains prediction interval for x =.1 and
w =.04, and it uses the append-data technique we have used previously. The
cli option on the model statement will generate prediction intervals.
data phy2;
input x w;
cards;
.1 .04
;
data phy2; set phy phy2;
proc reg data=phy2;
45
46
In Table 5.1 the Std Error Mean Predict is sefit. The 95% t-confidence intervals for prediction correctly used sepred.
Table 5.1 Predictions and prediction confidence intervals for the physics data.
The REG Procedure
Model: MODEL1
Dependent Variable: y
Output Statistics
Weight Dep Var Predicted
Std Error
Obs Variable
y
Value Mean Predict
1 0.003460 367.0000 331.6115
9.7215
2
0.0123 311.0000 300.8230
7.2080
3
0.0123 295.0000 281.7129
5.7614
4
0.0204 268.0000 267.9112
4.8259
5
0.0204 253.0000 258.3562
4.2688
6
0.0278 239.0000 247.2086
3.7634
7
0.0278 220.0000 233.9377
3.4542
8
0.0278 213.0000 218.5435
3.5893
9
0.0400 193.0000 193.0634
4.7764
10
0.0400 192.0000 180.3234
5.6291
11
0
. 201.5568
4.2832
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
95% CL Predict
262.9116 400.3113
262.6362 339.0098
244.8555 318.5704
238.9482 296.8742
229.8621 286.8502
222.7009 271.7164
209.6733 258.2021
194.1751 242.9120
171.0153 215.1116
157.2300 203.4167
180.0542 223.0593
Residual
35.3885
10.1770
13.2871
0.0888
-5.3562
-8.2086
-13.9377
-5.5435
-0.0634
11.6766
.
0
21.95265
41.05411
5.1.1
SAS To run polynomial regression, create polynomial terms in the data step,
like squared terms and cubic terms, and then use proc reg to fit model with
the original variables as well as these higher-order terms. Alternatively, you
can use proc glm and specify the polynomials in the model statement. For a
polynomial of degree two:
proc glm data=phy;
model y=x x*x/ss1;
weight w;
run; quit;
The ANOVA tables with Type I SS, the sequential sum of squares described
in alr are shown in Table 5.2.
47
Table 5.2 wls quadratic regression model with proc glm for the physics data.
Source
Model
Error
Corrected Total
R-Square
0.991137
Source
x
x*x
DF
1
1
Coeff Var
0.295019
Type I SS
341.9913694
18.7271174
Parameter
Estimate
Intercept 183.830465
x
0.970902
x*x
1597.504726
Root MSE
0.678815
Mean Square
341.9913694
18.7271174
Standard
Error
6.4590630
85.3687565
250.5868546
F Value
391.41
F Value
742.18
40.64
t Value
28.46
0.01
6.38
Pr > F
<.0001
y Mean
230.0922
Pr > F
<.0001
0.0004
Pr > |t|
<.0001
0.9912
0.0004
5.1.2
5.2
Additional comments
TESTING FOR LACK OF FIT, VARIANCE KNOWN
48
data qt;
pval = 1-probchi(30,25);
critval = cinv(.95,25);
output;
proc print data=qt; run;
pval
0.22429
critval
37.6525
The first value is p-value based on the 2 (25) distribution when the value of
the test is 30, while the second is the critical value for the 2 (25) distribution
at level = 1 .05 = 0.05. Similar functions are available for virtually
all standard distributions. For example, to get a two-tailed p-value from
the t(30) distribution, if the value of the statistic is x, use the statement
2*(1-probt(abs(x),30)).
5.3
SAS If we have one predictor x, we make use of the fact that the sum of
squares for lack of fit is the residual sum of squares for the regression of the
response on x treated as a factor. Thus, if we fit a model first with x and
then with x as a factor, the (sequential) F -test for the factor will be F -test
for lack of fit:
data temp;
input x y;
x2=x;
cards;
1 2.55
1 2.75
1 2.57
2 2.40
3 4.19
3 4.70
4 3.81
4 4.87
4 2.93
4 4.52
;
run;
proc glm data=temp;
class x2;
model y=x x2/ss1;
run; quit;
The SAS output in Table 5.3 shows that the p-value is about 0.39, providing
no evidence for lack of fit. In a mean function with several terms, finding the
GENERAL
TESTING
49
DF
3
6
9
Sum of
Squares
6.42749833
2.35839167
8.78589000
Mean Square
2.14249944
0.39306528
F Value
5.45
Pr > F
0.0378
Source
x
x2
DF
1
2
Type I SS
4.56925025
1.85824808
Mean Square
4.56925025
0.92912404
F Value
11.62
2.36
Pr > F
0.0143
0.1750
sum of squares for lack of fit is more of a challenge, and is omitted here. Code
for this purpose is provided with the R and S-Plus languages.
5.4
GENERAL
F TESTING
SAS SAS has a very general language using either proc reg or proc glm for
performing specific F -tests. Using the fuel consumption data, proc reg can
be used with the test statement:
data fuel;
set alr3.fuel2001;
Dlic=Drivers*1000/Pop;
Fuel=FuelC*1000/Pop;
Income=Income/1000;
logMiles=log2(Miles);
proc reg data=fuel;
model Fuel=Tax Dlic Income logMiles;
test Dlic=Income=0;
test Dlic=Income=Tax=0;
run;
E(Fuel|X) = 0 + 3 log(Miles)
E(Fuel|X) = 0 + 1 Tax + 2 Dlic + 3 log(Miles) + 4 Income
The output for these tests is shown in Table 5.4. proc glm also has a test
statement, but the syntax is different and the tests are more general. If your
50
Table 5.4 Anova for nested models with proc reg for the fuel2001 data.
Test 1 Results for Dependent Variable Fuel
Mean
Source
DF
Square
F Value
Pr > F
Numerator
2
54246
12.88
<.0001
Denominator
46 4210.86989
Test 2 Results for Dependent Variable Fuel
Mean
Source
DF
Square
F Value
Pr > F
Numerator
3
43839
10.41
<.0001
Denominator
46 4210.86989
mean function includes factors (or class variables), then the usual tests of main
effects and interactions will be produced by the program. For this purpose
we once again suggest using Type II tests, so, for example,
proc glm data=fuel;
model Fuel=Tax Dlic Income logMiles/e e2;
run;
will give an ANOVA table that has the correct F -tests for each term adjusted
for each other term in the model.
5.5
SAS The code below will reproduce the confidence ellipse in alr[F5.3]. It
is included here for completeness, but is sufficiently complex that it is not of
much use for data analysis, and so this may be skipped.
proc reg data=alr3.UN2 alpha=.05 outest=m1 tableout covout;
model logFertility=logPPgdp Purban /noprint;
run;
proc iml;
use m1;
read all var {_RMSE_ logPPgdp Purban} into X;
sigmahat2=X[1,1]**2; *extract the variance estimate;
X=X[,2:3];
beta1=X[1,1]; *extract the coefficient estimate;
beta2=X[1,2]; *extract the coefficient estimate;
se=X[2,];
cov=X[8:9,]; *extract the covariance;
invcov=inv(cov); *the inverse of the covariance matrix;
z_val=probit(.975)*{1 -1};
int1=beta1+z_val*se[1];
int2=beta2+z_val*se[2];
one=j(1,2,1);
51
To find the two main axes for the ellipse, the above code uses eigenvalue
decomposition macro inside SAS: eigen(V E matrix), and the keyword call
has to be used before the macro eigen. V is used to save eigenvalues, E is
52
used to save eigenvectors, both appear before the you want to decompose.
Before we save the output, namely, the points on the ellipse, the confidence
limits for each of the two variables, we multiply a two-vector of ones to
the beta estimates. This is simply for drawing one-dimensional confidence
intervals later on. Without doing this, SAS will only draw a single point for
you, namely, only one endpoint of the interval, due to the way the output
from proc iml is saved. If you are not convinced, try it again without the
multiplication and you will see what is going to happen. The program could
be converted to a SAS macro.
Problems
5.3.
The bootstrap used in this problem is different from the bootstrap discussed
in alr[4.6] because rather than resampling cases we are resampling residuals.
Here is the general outline of the method:
1. Fit the model of interest to the data. In this case, the model is just the
simple linear regression of the response y on the predictor x. Compute
the test statistic of interest, given by alr[E5.23]. Save the fitted values
y and the residuals e.
2. A bootstrap sample will consist of the original x and a new y , where
y = y + e . The ith element of e is obtained by sampling from e with
replacement.
3. Given the bootstrap data (x, y ), compute and save alr[E5.23].
4. Repeat steps 2 and 3 B times, and summarize results.
6
Polynomials and Factors
6.1
POLYNOMIAL REGRESSION
The uniform random number generator ranuni(seed) is used, with 1 indicating a system-chosen random seed. You can also change the multipliers, the
values 0.2 and 0.5 used in the program, to control the magnitude of jittering.
There are at least two options for doing polynomial regression. One is to
use proc reg with polynomial terms defined in the data step, or to use proc
glm, where polynomial terms can be specified in model statement directly. For
example, x*x is a quadratic term. Improved computational accuracy can be
obtained using orthogonal polynomial regression in proc orthoreg; we will not
further discuss this latter option.
53
54
6.1.1
SAS Polynomial regression with multiple predictors can be fit with proc
reg with pre-defined higher-order terms, or using proc glm, which allows the
definition of higher-order terms in the model statement. The following uses
proc glm to fit a full quadratic model. See output in Table 6.1.
proc glm data=alr3.cakes;
model Y=X1 X1*X1 X2 X2*X2 X1*X2/ss1;
run; quit;
Table 6.1
Source
Model
Error
Corrected Total
R-Square
0.948711
Sum of
Squares
27.20482661
1.47074482
28.67557143
Coeff Var
6.100376
Source
X1
X1*X1
X2
X2*X2
X1*X2
DF
1
1
1
1
1
Type I SS
4.32315980
2.13083104
7.43319538
10.54541539
2.77222500
Parameter
Intercept
X1
X1*X1
X2
X2*X2
X1*X2
Estimate
-2204.484988
25.917558
-0.156875
9.918267
-0.011950
-0.041625
Mean Square
5.44096532
0.18384310
Root MSE
0.428769
Mean Square
4.32315980
2.13083104
7.43319538
10.54541539
2.77222500
Standard
Error
241.5807402
4.6589114
0.0394457
1.1665592
0.0015778
0.0107192
F Value
29.60
Pr > F
<.0001
Y Mean
7.028571
F Value
23.52
11.59
40.43
57.36
15.08
t Value
-9.13
5.56
-3.98
8.50
-7.57
-3.88
Pr > F
0.0013
0.0093
0.0002
<.0001
0.0047
Pr > |t|
<.0001
0.0005
0.0041
<.0001
<.0001
0.0047
To draw alr[F6.3a], we again use proc glm to generate the predicted values, then use proc gplot to plot all the points on one graph. Once again, the
program for this plot is very long, and uses proc iml, the interactive matrix
language.
proc iml; *create new data;
x1=j(3,50,.); x2=j(1,150,.);
x1[1,]=32+(0:49)*(38-32)/49; x2[1:50]=j(1,50,340);
x1[2,]=32+(0:49)*(38-32)/49; x2[51:100]=j(1,50,350);
x1[3,]=32+(0:49)*(38-32)/49; x2[101:150]=j(1,50,360);
create newdata var {x1 x2};
append var {x1 x2};
close newdata; quit;
data cakes; set newdata alr3.cakes;
POLYNOMIAL REGRESSION
55
To get estimated response curves, we need to get the predicted values. Thus
we first generate these new predictor values, then obtain the prediction via
predicted option in proc glm. Then we draw three estimated response curves
from the output data set m1. The command j(3,50,.) gives a 3 by 50 matrix
of ., the missing value code, and the command (0:49) gives a sequence of
integers from 0 to 49.
6.1.2
56
to center and scale X2 before fitting. We set Z2 = (X2 350)/10, and then
fit the mean function
E(Y |Z2 ) =
=
0 + 1 Z2 + 2 Z22
2
X2 350
X2 350
0 + 1
+ 2
10
10
Differentiating this last equation and setting the result to zero, we find that
this this parameterization the maximum will occur at xm = 350 51 /2 .
Here is the SAS code:
data cakes;
set alr3.cakes;
Z2 = (X2-350)/10;
FACTORS
57
Estimate
7.6838
0.9639
-1.1467
0.8196
Standard
Error
0.3075
0.3201
0.3322
0.3098
DF
14
14
14
14
t Value
24.99
3.01
-3.45
2.65
Pr > |t|
<.0001
0.0093
0.0039
0.0192
Additional Estimates
Label
350-5*gam1/gam2
Estimate
354.20
Standard
Error
1.8519
DF
14
Lower
350.23
Upper
358.17
The point estimate agrees with the value computed with proc iml, but the
standard error is smaller. The discrepancy is caused by computation of the
estimate of variance. When using proc iml, we used the estimate of 2 computed in the usual way as the RSS divided by its df, which is 11 for this
problem. proc nlmixed estimates 2 , by RSS divided
p by n, which for this
problem is 14. If we multiply the standard error by 14/11, we get 2.0892,
essentially agreeing with the solution computed with proc iml. The lesson
to be learned is that different programs do different things, and you need to
know what your program is doing.
6.1.3
6.2
Fractional polynomials
FACTORS
Factors are a slippery topic because different computer programs will handle
them in different ways. In particular, while SAS and SPSS use the same
default for defining factors, JMP, R and S-Plus all used different defaults. A
factor represents a qualitative variable with say a levels by a 1 (or, if no
intercept is in the model, possibly a) dummy variables. alr[E6.16] describes
one method for defining the dummy variables, using the following rules:
1. If a factor A has a levels, create a dummy variables U1 , . . . , Ua , such that
Uj has the value one when the level of A is j, and value zero everywhere
else.
2. Obtain a set of a 1 dummy variables to represent factor A by dropping
one of the dummy variables. For example, using the default coding in
58
B
1
2
3
1
2
3
mu
1
1
1
1
1
1
A1
1
1
0
0
0
0
Design Matrix
A
B
A2 A3 B1 B2
0
0
1
0
0
0
0
1
1
0
0
0
0
1
1
0
0
1
0
1
0
1
0
0
B3
0
0
1
0
0
1
R, the first dummy variable U1 is dropped, while in SAS and SPSS the
last dummy variable is dropped.
3. JMP and S-Plus use a completely different method.
Most of the discussion in alr assumes the R default for defining dummy
variables.
SAS One way to work with factors is to create your own dummy variables
and work with them in place of the factors. For some users, this may make
computations and results clearer. When using proc reg, there is no choice
because this procedure does not recognize factors. Dummy variables can be
created with if statements in a data step. For example, if D=1 then D1=1;
else D1=0; if D=2 then D2=1; else D2=0; creates two dummy variables. The
dummies can be used to fit regression model as factors in proc reg.
With proc glm, you can use class variables, which treats all the variables
listed in that statement as factors, so you can fit model with those factors
directly. Think of a class variable with m levels as a collection of m dummy
variables, one for each level. In fitting one of the dummies is usually redundant, and SAS will drop the one corresponding to the last level. The default
order of the levels is their alphabetical order; this order can be controlled
with the order=option in the proc glm statement. Table 6.2 shows the default
class variable parameterization in proc glm. If you fit a mean function using
model Y = A B;, from Table 6.2 the columns for A3 and B3 would not be used
in fitting.
6.2.1
No other predictors
SAS To obtain alr[T6.1a], which is part of Table 6.3, one can use the
following SAS code:
proc glm data=alr3.sleep1;
class D;
model TS=D /ss1 noint solution clparm;
run;
FACTORS
59
We used the ss1 option to get only Type I sums of squares; the noint option
to fit with no intercept, the solution option to print estimates, and the clparm
option to print confidence intervals for the estimates.
Table 6.3 Complete output from proc glm with only one factor for the sleep1 data.
The GLM Procedure
Dependent Variable: TS
Source
Model
Error
Uncorrected Total
R-Square
0.378001
DF
5
53
58
Coeff Var
35.77238
Source
D
D
D
D
D
D
1
2
3
4
5
Parameter
D
D
D
D
D
Root MSE
3.767818
DF
5
Parameter
Sum of
Squares
6891.717825
752.412175
7644.130000
Mean Square
1378.343565
14.196456
F Value
97.09
Pr > F
<.0001
F Value
97.09
Pr > F
<.0001
TS Mean
10.53276
Type I SS
6891.717825
Mean Square
1378.343565
Estimate
Standard
Error
t Value
Pr > |t|
13.08333333
11.75000000
10.31000000
8.81111111
4.07142857
0.88808333
1.00699185
1.19148882
1.25593949
1.42410153
14.73
11.67
8.65
7.02
2.86
<.0001
<.0001
<.0001
<.0001
0.0061
11.30206374
9.73023014
7.92017607
6.29201550
1.21504264
14.86460292
13.76976986
12.69982393
11.33020672
6.9278145
alr[T6.1b] uses the default parameterization for R, which drops the first
level by default. If you fit in SAS without the noint option, you will get output
similar to alr[T6.1b], except the baseline level will be the last level, not
the first level. You could get exactly the same estimates as in alr[T6.1b]
using the following:
data s2;
set alr3.sleep1;
D2 = D;
if D=1 then D2=6;
output;
prog glm data=s2;
class D2;
model TS=D2/solution;
run;
60
We have reordered the factor D so its first level is alphabetically the last level,
and it becomes the baseline as in the text.
6.2.2
SAS Suppose Y is the generic name of the response variable, X the generic
name of the continuous predictor, and D is a SAS class variable. The four
models described in alr[6.2.2] can be fit using proc glm using the following
four model statements:
model Y = X D X*D;
model Y = D X*D/noint;
model Y = X D;
model Y = X D/noint;
model Y = X D*X;
model Y = D*X;
model Y = X;
The different choices for the model statements will use different parameters
to specify the same mean function. For example, assuming D has 3 levels,
and D1 , D2 and D3 , are, respectively, dummy variables for levels 1, 2, 3 of D,
model Y = X D X*D; will specify
E(Y |X) = 0 + 1 X + 2 D1 + 3 D2 + 4 D1 X + 5 D2 X
while model Y = D X*D/noint; specifies
E(Y |X) = 1 D1 + 2 D2 + 3 D3 + 4 D1 X + 5 D2 X + 6 D3 X
Both of these will have the same residual sum of squares, and that is all that
is needed for the F -tests.
For the sleep data, the following four models can be fit:
data sleep;
set alr3.sleep1;
logBodyWt=log2(BodyWt);
proc glm data=sleep; *general model;
class D;
model TS=D D*logBodyWt /noint;
run; quit;
proc glm data=sleep; *parallel model;
class D;
model TS=D logBodyWt /noint;
run; quit;
proc glm data=sleep; *common intercept;
100
60
80
40
MANY FACTORS
f
50
61
m
m
mm
m
mm
m
m
m
m
m
m
m
m
m
m
mm
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
f
m
m
m m
m
m
m
m
m
m
f
m
m
f f
m
m
m
f
m
mm
f ff f fffff
m
m
m f fffffff f
m
mm
mf fffffffffffffffffffffffffff f
ff
fffm
fff f f ff
ff f f
60
70
80
90 100
120
class D;
model TS=D*logBodyWt;
run; quit;
proc glm data=sleep; *coincident lines;
class D;
model TS=logBodyWt;
run; quit;
Given the residual sums of squares from these models, computing the F -test
by hand is probably easiest.
6.3
MANY FACTORS
6.4
alr[F6.8] is much more compelling in color, and is shown here as Figure 6.2.
62
SAS Since POD models are really nonlinear regression models, they can
be fit in SAS using the general procedure proc nlmixed. This procedure is
sufficiently general that we can use POD models with other types of errors,
such as binomial errors or Poisson errors; see Cook and Weisberg (2004). The
output is given by Table 6.4 with the iteration history omitted.
Call proc nlmixed to fit the POD model
proc nlmixed data=alr3.ais;
Starting values for parameters are required, and you may need better ones
parms beta0=1 beta1=1 beta2=1 beta3=1 beta4=1 eta1=1 sig2=1;
g=beta2*Ht+beta3*Wt+beta4*RCC;
LBM_exp=beta0+beta1*Sex+g+eta1*Sex*g;
model LBM~normal(LBM_exp,sig2);
predict g out=m1;
predict LBM_exp out=m2;
run; quit;
Table 6.4 Edited output of the POD model with proc nlmixed for the ais data.
Parameter Estimates
Param Estimate
beta0 -14.6563
beta1 12.8472
beta2
0.1463
beta3
0.7093
beta4
0.7248
eta1
-0.2587
sig2
5.8708
Standard
Error
6.3486
3.6889
0.03376
0.02380
0.5768
0.03402
0.5842
Upper
-2.1382
20.1208
0.2128
0.7563
1.8621
-0.1917
7.0227
Gradient
0.000016
0.00001
0.002007
0.000578
0.000051
0.001157
-0.00006
The following will plot separately for each group in the POD direction.
data m1; set m1; g=Pred; keep LBM g Sex;
data pod; set m1; set m2; LBM_exp=Pred; keep LBM g LBM_exp Sex;
proc sort data=pod; by g;
goptions reset=all;
symbol1 v=point c=black i=join l=1;
symbol2 v=point c=red i=join l=2;
axis2 label=(a=90 LBM);
axis3 label=(a=90 );
proc gplot data=pod;
plot LBM_exp*g=Sex /vaxis=axis2 hminor=0 vminor=0;
plot2 LBM*g=Sex /vaxis=axis3 vminor=0 nolegend;
run; quit;
63
Fig. 6.3 Summary graph for the POD mean function for the ais data.
6.5
SAS Random coefficient models can be fit in SAS using proc mixed. We
illustrate with the chloride data and the output is shown in Table 6.5 and
Table 6.6.
proc mixed data=alr3.chloride method=reml covtest;
class Marsh Type;
model Cl=Type Month /noint cl;
random INTERCEPT /type=un subject=Marsh;
run; quit;
proc mixed data=alr3.chloride method=reml covtest;
class Marsh Type;
model Cl=Type Month /noint cl;
random Type /type=un subject=Marsh;
run; quit;
Each of the two mixed procedures gives parameter estimates (with confidence limits), estimated covariance matrix, and model fit statistics like 2loglikelihood.
method=reml specifies that the method to estimate the parameters is REML
(REstricted Maximum Likelihood). covtest tells SAS to output the standard errors for covariance parameter estimates. Option cl gives Wald confidence intervals for parameter estimates (for fixed effects). The keyword
INTERCEPT in the random statement refers to random intercept for each level of
subject=Marsh. Similarly, random Type /subject=Marsh assumes random Type
64
Table 6.5 Random intercept for each level of Marsh with proc mixed for the
chloride data.
Cov Parm
UN(1,1)
Residual
Pr Z
0.0415
0.0005
219.7
223.7
224.2
224.1
Effects
DF t Value Pr > |t| Alpha
Lower
Upper
22
-0.71
0.4832 0.05 -21.5084 10.5007
22
6.36
<.0001 0.05 30.3610 59.7751
22
3.50
0.0020 0.05
0.7549 2.9527
effect within each level of Marsh. Option type=UN specifies a unstructured covariance matrix. The default is type=VC, which refers to variance components.
SAS does not provide ANOVA comparison for nested models, and it does
not provide confidence limits for covariance parameter estimates. The ANOVA
model comparison is made by likelihood ratio test, which compares the difference in 2log-likelihood with a 2 distribution (the corresponding degrees
of freedom is equal to the difference in number of parameters between the
large model and the small model). The output gives the change in 2loglikelihood from 219.7 to 214.1, which equals 5.6, with two degrees of freedom.
Hence p-value=1-probchi(5.6,2)= 0.061. SAS gives p-values for covariance
parameter estimates based on normal distribution, which is appropriate only
in large samples.
65
Table 6.6 Random intercept and random Type effect for each level of Marsh with
proc mixed for the chloride data.
Cov Parm
UN(1,1)
UN(2,1)
UN(2,2)
Residual
Pr Z
0.3623
.
0.0878
0.0004
Fit Statistics
-2 Res Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
BIC (smaller is better)
214.1
222.1
223.8
222.9
7
Transformations
7.1
7.1.1
Power transformations
7.1.2
68
TRANSFORMATIONS
From the three regressions, we keep the predicted values, called Hat, Hat1 and
Hat2, and then combine them into another data set with a data step. The
remainder of the code sets up and draws the plot.
Rather than relying on visualization to select a transformation, we can
use a numerical method. The idea is to use least squares to estimate the
parameters in the mean function:
E(Y |X = x) = 0 + 1 S (x, )
(7.1)
This will estimate 0 , 1 and simultaneously. This can be done using nonlinear least squares (see Chapter 11), with a program similar to the following:
69
Nonlinear least squares is fit using proc nlin. In this procedure, the user must
specify starting values for the parameters, and for this we set lam= 1, and the
values for beta0 and beta1 were obtained by first fitting the regression of Dbh
on Height using proc reg. The model statement matches (7.1), substituting
the definition of S (x, ). The output from this fit is, in part,
Parameter
th0
th1
lam
Estimate
-378.0
90.8127
0.0479
Approx
Std Error
249.8
79.5272
0.1522
from which we estimate to be about 0.05, with an interval estimate of between about .25 and +.35. Any transformation in is range is likely to be
useful, and since = 0 is included, we would use the logarithmic transformation.
The following program will use the output from the last program to draw
a graph of the data with the fitted line superimposed.
proc sort data=nlinout; by Dbh;
goptions reset=all;
symbol1 v=circle h=1 c=black;
symbol2 v=point h=1 c=black i=join l=1;
proc gplot data=nlinout;
plot Height*Dbh=1 yhat*Dbh=2/overlay;
run; quit;
proc sort is used to order the data according to the values on the horizontal
axis of the plot, and the remaining statements draw the plot.
7.1.3
SAS The inverse response plots like alr[F7.2] are very inconvenient in SAS
(although writing a macro that would automate this process would make an
interesting and useful student project); a sample program that will give this
plot is included in the script file for this chapter. SAS does have a procedure
that implements the Box-Cox method, however.
7.1.4
SAS The procedure proc transreg implements the Box-Cox procedure for
selecting a response transformation; the other options in this procedure are
70
TRANSFORMATIONS
not covered in alr. Here is the syntax for getting Box-Cox transformations,
using the highway data:
data highway;
set alr3.highway;
logLen=log2(Len); logADT=log2(ADT);
logTrks=log2(Trks); logSigs1=log2((Len*Sigs+1)/Len);
proc transreg data=highway;
model boxcox(Rate /convenient lambda=-1 to 1 by .005)=
identity(logLen logADT logTrks Slim Shld logSigs1);
run;
...
0.67
0.67
0.67
0.67
...
0.70
0.70
0.70
...
0.70
0.70
0.70
0.70
0.70
...
0.70
0.70
0.70
0.70
...
...
-0.57565
-0.54340
-0.51140
-0.47963
...
1.40755
1.40780
1.40774
...
1.09198
1.07794
1.06359
1.04893
0.98725
...
-0.45778
-0.49165
-0.52582
-0.56028
...
*
*
*
<
*
*
*
*
*
*
*
*
The required keyword identity specifies no further transformation of the predictors. In the example, zero for log transformation is a convenient power if it
is in the confidence interval, labelled by the +. As usual, we have transformed
predictors in the data step, although proc transreg does have some facility
for this within the command. The response is specified by boxcox(Rate). The
keywords specify marking a convenient power, and also the search range for
71
, here between 1 and 1 by steps of .005. Table 7.1 shows the result using
the Box-Cox method, where the Lambda denoted by a < is optimal, and
the Lambda denoted by a + is a convenient choice.
7.2
The scatterplot matrix is the central graphical object in learning about regression models. You should draw them all the time; all problems with many
continuous predictors should start with one.
SAS Use Solutions Analysis Interactive Data Analysis to get a scatterplot
matrix. The equivalent proc insight code is
proc insight data=alr3.highway;
scatter Rate Len ADT Trks Slim Shld Sigs
*Rate Len ADT Trks Slim Shld Sigs;
run;
72
TRANSFORMATIONS
You must type the variable names twice so that you can get all the pairwise
scatterplots. The symbol * connects the variables to be plotted on the vertical
axis and the variables to be plotted on the horizontal axis. And there is no
easy procedure for adding loess smooth curves and ols regression lines to the
scatterplot matrix.
7.2.1
7.2.2
7.3
SAS One can follow Section 7.1.4 on the Box-Cox method in SAS proc
transreg to obtain the a Box-Cox transformation power for the response.
An alternative is to do a grid search over an interval of interest. You need
to define power transformations for the response in the data step, then fit
multiple models for the transformed responses and pick the power that gives
you the smallest RSS.
7.4
8
Regression Diagnostics:
Residuals
8.1
THE RESIDUALS
SAS Options to proc glm and proc reg are used to compute and save diagnostic quantities to a data file. Additionally, residuals and related quantities
can be used to draw simple graphs inside these procedures. The following
program
proc reg data=alr3.ufcwc;
model Height=Dbh;
output out=m1 predicted=pred residual=res h=hat press=pres;
run;
will fit the regression mean function specified, and create a new data set
(work.m1) consisting of the original data, plus new variables called pred for
the predicted values, res for the residuals, hat for the leverages, and pres for
the PRESS residuals. Table 8.1 gives relevant diagnostic quantities that can
be saved.
You can also use residuals and related statistics within the call to proc reg
or proc glm. In particular
proc reg data=alr3.ufcwc;
model Height=Dbh;
plot residual.*predicted.;
plot press.*Dbh;
run;
will produce two scatterplots, the first of residuals versus fitted values, and
the second of PRESS versus Dbh. (To plot a keyword like residual, you
73
74
Table 8.1 Selected quantities with a value for every case that can be saved using the
output statement in proc reg or proc glm.
SAS name
Quantity
residuals
student
rstudent
cookd
h
press
stdi
stdp
predicted
must append a period to the end of the name.) A disadvantage to using this
plotting mechanism is that you cant add smoothers or other enhancements.
Also,
proc reg data=alr3.ufcwc;
model Height=Dbh;
print all;
run;
will print lists of many diagnostic statistics. This should only be used with
very small sample sizes!
8.1.1
Difference between
8.1.2
e and e
SAS To find the leverages hii , the diagonals of the hat matrix, using proc
reg, type:
data ufcwc;
set alr3.ufcwc;
proc reg data=ufcwc;
model Height=Dbh;
output out=m1 h=hat;
run;
This code saves the leverages, so they are available to other procedures for
summarization or graphing.
THE RESIDUALS
8.1.3
75
8.1.4
8.1.5
8.1.6
SAS Solutions Analysis Interactive Data Analysis starts proc insight, which
can be used to produce multiple residual plots. You can also start this procedure with the following program.
data fuel2001;
set alr3.fuel2001;
Dlic=Drivers*1000/Pop;
Fuel=FuelC*1000/Pop;
1 SAS
76
Income=Income/1000;
logMiles=log2(Miles);
proc insight data=fuel2001;
fit Fuel=Tax Dlic Income logMiles;
scatter R_Fuel*Tax Dlic Income logMiles P_Fuel /label=State;
scatter Fuel*P_Fuel /label=State;
run;
The label option allows user to specify a labelling variable. The variables
R yname for residuals and P yname for fitted values are computed by the fit
statement. These graphs are interactive, and you can use the mouse to identify
the labels for individual cases.
8.2
SAS The following SAS code will reproduce alr[T8.1], as shown in Table 8.2.
data fuel2001;
set alr3.fuel2001;
Dlic=Drivers*1000/Pop;
Fuel=FuelC*1000/Pop;
Income=Income/1000;
logMiles=log2(Miles);
proc reg data=fuel2001; *this proc reg is simply to retrieve fitted values;
model Fuel=Tax Dlic Income logMiles /noprint;
output out=m1 predicted=pred;
run;
data m1; *this data step is to add quadratic terms to the old data set;
set m1;
Tax2=Tax**2; Dlic2=Dlic**2;
Income2=Income**2; logMiles2=logMiles**2;
pred2=pred**2;
proc reg data=m1; *fit models with one quadratic term added each time;
model Fuel=Tax Tax2 Dlic Income logMiles;
model Fuel=Tax Dlic Dlic2 Income logMiles;
model Fuel=Tax Dlic Income Income2 logMiles;
model Fuel=Tax Dlic Income logMiles logMiles2;
model Fuel=Tax Dlic Income logMiles pred2;
ods output ParameterEstimates=est;
*save the parameter estimates and t-test results;
run;
data est; set est; Term=scan(Variable,1,2);
data est; set est; where Term ne Variable;
*keep the test statistics for added terms only;
if Term=pred then do;
Probt=2*probnorm(tValue); Term=Tukey test; end;
proc print data=est; var Term tValue Probt; run;
NONCONSTANT VARIANCE
77
8.3
Term
Tax
Dlic
Income
logMiles
Tukey test
tValue
-1.08
-1.92
-0.08
-1.35
-1.45
Probt
0.2874
0.0610
0.9334
0.1846
0.1482
NONCONSTANT VARIANCE
8.3.1
8.3.2
SAS The score test for nonconstant variance can be computed in SAS using
the proc model method; the score test is the same as the Bruesch-Pagan test.
The use of proc model is relatively complex and is beyond the scope of this
primer.
If you dont mind doing a side calculation by hand, you con compute the
score test using the following program:
proc reg data=alr3.snowgeese;
model photo=obs1;
output out=m1 residual=res predicted=fit;
run;
data m2;
set m1;
u = res*res;
proc reg data=m2;
model u=obs1;
run;
With only one term in the mean function beyond the intercept, the score test
for variance proportional to the mean is the same as the score test for variance
as a function of the single predictor, so both of the above tests will have the
same value (81.414) for the test statistic.
78
The SAS data step can be used to provide statistical tables. For example,
the score test described above has value 81.414 and 1 df. Although computation of the significance level for this test is unnecessary because the value of
the statistic is so large, we illustrate the computation:
data temp;
pvalue = 1-probchi(81.414,1);
proc print data=temp; run;
We provide in the scripts for this chapter a program that uses proc iml to
get the score test without the side calculation.
8.3.3
8.4
8.4.1
Additional comments
GRAPHS FOR MODEL ASSESSMENT
Checking mean functions
For a model with multiple predictors, the following SAS code generates
mean checking plots alr[F8.13] for the UN2 data:
proc reg data=alr3.UN2;
Fig. 8.1 SAS mean checking plot for the ufcwc data.
79
80
81
SAS Variance checking plots also require several steps, and are generated
following alr[A.5]. One important step in the SAS code below is proc sql,
which appends a single-number data set to another larger data set. The
general form of its create table statement is: create table newdata-name as
select variable-list from olddata-name1, olddata-name2 <,olddata-name3,...>;.
The * in the create table statement refers to all available variables in the
selected data sets. What this procedure below does is to assign the single
value (for example, sigma in anova1) to each row of the larger data set (for
example, ols1). By doing this, we then can use the appended variable in later
calculation. The following will reproduce alr[F8.15A] and alr[F8.15B] for
the UN2 data:
data UN2;
set alr3.UN2;
Purban2=Purban**2;
goptions reset=all;
symbol1 v=circle h=1 c=black;
symbol2 v=point c=black i=join l=33;
symbol3 v=point c=black i=join l=1;
axis1 label=(Fitted values, original model);
axis2 label=(Fitted values, quadratic term added);
axis3 label=(a=90 log(Fertility));
*------------------------------------------------------;
*BEGIN plotting alr{F8.15A};
proc reg data=UN2;
model logFertility=logPPgdp Purban;
output out=ols1 predicted=OLS_pred;
ods output Anova=anova1;
run;
82
Fig. 8.2
83
9
Outliers and Influence
9.1
9.1.1
OUTLIERS
An outlier test
SAS Both proc glm and proc reg can compute the studentized residuals
rstudent that are used as the basis of the outlier test. Finding the maximum
absolute residual, and then using the Bonferroni inequality as outlined in alr,
requires writing a complex SAS program, which we describe below.
9.1.2
9.1.3
86
absrstu=abs(rstu);
proc sort data=m1; by descending absrstu; run;
data m1; set m1 (obs=10); p=min(1,193*2*(1-probt(absrstu,190-1)));
proc print data=m1; run;
We first use proc reg to get the Studentized residuals, ti . In the next data
step, we create the absolute values ti | of rstudent and a variable casenum that
gives the observation number. This latter variable can serve as an identifier
in problems in which the points are not labelled. proc sort ordered the data
according to the values of |ti |. The next data step sets p equal to the Bonferroni p-values; you must supply the sample size, n = 193 in the example,
and the number of df for error, 190, in the example suitable for your problem.
The statistics for the 10 cases with the largest |ti | are then printed:
Obs
logPPgdp
logFertility
Purban
1
2
3
4
5
6
7
8
9
10
8.434628
9.424166
9.575539
10.663558
10.125413
9.231221
12.857398
9.157347
10.249113
11.94398
0.33647224
0.13976194
0.13976194
0.09531018
0.26236426
0.33647224
1.60140574
0.3435897
0.18232156
1.773256
41
67
68
67
43
57
77
61
70
49
Locality
absrstu
Moldova
Armenia
Ukraine
Bulgaria
Bosnia-Herzeg
Georgia
Oman
N.Korea
Belarus
Equatorial.Gu
2.73895
2.69971
2.63789
2.38718
2.34340
2.32188
2.30490
2.29487
2.27244
2.19654
1
1
1
1
1
1
1
1
1
1
None of the Studentized residuals are particularly large in this example, and
all the Bonferroni-adjusted p-values are equal to one.
For problems that are not too large, you can identify the largest Studentized
residual simply by examining a list of them. Suppose, for example, that
n = 100, p = 5, and the largest absolute Studentized residual has the value
3.83. You can compute the significance level for the outlier test from
data temp;
pvalue = min(1,100*2*(1-probt(3.83,200-5-1)));
proc print data=temp; run;
Giving the following output:
Obs
pvalue
1
0.017292
9.1.4
9.2
Additional comments
INFLUENCE OF CASES
SAS You can obtain various influence statistics from SAS output. For example, the following SAS code will give you raw residual, studentized residual,
leverage and Cooks distance:
proc reg data=alr3.UN2;
INFLUENCE OF CASES
87
The reweight and refit options in proc reg provide a simple mechanism
for refitting after deleting a few cases. In the following example, we fit the
mean function log(Fertility) = 0 + 1 log(PPgdp) + 2 Purban using the UN2
data. We first delete case 2 and refit. We then replace it, and remove cases
1, 3 and 6. Finally, we restore all the deleted cases.
data UN2;
set alr3.UN2;
retain casenum 0;
casenum=casenum+1;
proc reg data=UN2;
model logFertility=logPPgdp Purban;
run;
reweight casenum=2; refit; print; run; *delete case 2;
reweight undo; *restore the last deleted case;
reweight casenum=1;
reweight casenum=3;
reweight casenum=6;
refit; print; run; *delete case 1,3,6 from the data set;
reweight allobs /reset; *restore all previously deleted cases;
The key statements here are reweight condition(s) and refit. reweight
changes the weights on the selected cases. You can specify the new (common)
weight for the selected cases that satisfy certain condition(s). We choose to
omit the weight=value option because the default is to set the weights of those
cases to zero. Another option for reweight statement is reset, for example, in
the last reweight statement in the code above. What that statement does is
to reset the weights of allobs (namely, all observations) to their initial values.
The statement reweight undo; will undo the last reweight statement.
9.2.1
Cooks distance
SAS The plot of Cooks distance versus observation number using the rat
data can be obtained as follows:
goptions reset=all;
proc reg data=rat;
model y=BodyWt LiverWt Dose;
output out=m1 cookd=cookd;
plot cookd.*obs.;
run;
88
9.2.2
Magnitude of
Di
Di
9.2.3
Computing
9.2.4
9.3
NORMALITY ASSUMPTION
SAS QQ plots can be drawn with the plot statement in proc reg.
data heights;
infile "c:/My Documents/alr3/heights.txt" firstobs=2;
input Mheight Dheight;
goptions reset=all;
*********************************;
**qqnorm plot for heights data***;
*********************************;
proc reg data=heights;
model Dheight=Mheight;
title QQ Plot;
plot r.*nqq. /noline mse cframe=ligr;
run;
*********************************;
**qqnorm plot for transact data**;
*********************************;
data transact;
infile "c:/My Documents/alr3/transact.txt" firstobs=2;
input T1 T2 Time;
proc reg data=transact;
model Time=T1 T2;
title QQ Plot;
plot r.*nqq. /noline mse cframe=ligr;
run;
The mysterious keyword nqq. is simply the normal quantiles. The three options in the plot statement are used to, respectively, delete the horizontal
reference line, display mean squared error statistic on the graph and change
the background of the graph from white to grey.
NORMALITY ASSUMPTION
89
Fig. 9.1 QQnorm plots of residuals for the heights data (left) and for the transact
data (right).
10
Variable Selection
10.1
The first example in this chapter uses randomly generated data. This can be
helpful in trying to understand issues against a background where we know
the right answers. Generating random data is possible in most statistical
packages, though doing so may not be easy or intuitive.
SAS The function rannor is used in the data step to generate standard normal (psuedo) random numbers. rannor has an optional argument seed that
can be any positive integer. By setting the seed, you will always use the exact
same random numbers. (How can they be random if you always get the
same ones?) If seed is set to 1, the system selects the starting values for the
random numbers. Only one seed can be specified per data step. The data set
generates data only one observation at a time, so you need a do loop to get a
vector of random numbers. A output statement must be included in the loop,
or else only the last generated value will be kept.
The following SAS program generates one hundred pseudo-random numbers
for each xi s and e, then y is generated as a function of x1 , x2 and e. Finally,
proc reg is called to fit a linear model of y on all the xi s, as in alr[10.1].
See the fitted model in Table 10.1.
data case1; *create a dataset called case1;
seed=20040622; *use this seed to get the same random numbers;
do i=1 to 100; *repeat to get 100 observations;
x1=rannor(seed);
x2=rannor(seed);
91
92
VARIABLE SELECTION
x3=rannor(seed);
x4=rannor(seed);
e=rannor(seed);
y=1+x1+x2+e;
output; *REQUIRED to keep what you just did;
end; *all do loops end with end;
* the next line prints the first 3 and last 3 observations;
proc print data=case1; where (i le 3 | i ge 98); run;
proc reg data=case1;
model y=x1 x2 x3 x4/vif;
run;
Table 10.1
Variable
Intercept
x1
x2
x3
x4
DF
1
1
1
1
1
Parameter Estimates
Parameter Standard
Estimate
Error t Value
0.99662
0.10738
9.28
0.97910
0.10870
9.01
0.85222
0.10990
7.75
0.01066
0.10833
0.10
-0.06222
0.11079
-0.56
Pr > |t|
<.0001
<.0001
<.0001
0.9218
0.5757
Variance
Inflation
0
1.01447
1.01517
1.00133
1.01860
93
The response y is generated from the same function of xi s and e, and we fit
the same linear model of y on xi s, as in Case 1. See output in Table 10.2.
Unlike the results in the book, in this particular random data set the active
predictors are correctly identified, as will sometimes happen. As correlations
get bigger, the frequency of correct identification gets lower.
Table 10.2
Variable
Intercept
X1
X2
X3
X4
10.1.1
DF
1
1
1
1
1
Parameter Estimates
Parameter Standard
Estimate
Error t Value
-0.00338
0.10738
-0.03
0.94667
0.34410
2.75
0.66293
0.36497
1.82
0.03413
0.34695
0.10
-0.19926
0.35480
-0.56
Pr > |t|
0.9749
0.0071
0.0725
0.9218
0.5757
Variance
Inflation
0
10.16563
11.19670
10.15189
11.16227
Collinearity
SAS Variance inflation factors are computed in proc reg by adding the /
vif option to the model statement. In the two simulations, the VIF are all
close to one for the first simulation, and equal to about 11 in the second; see
output in Tables 10.1-10.2.
proc reg data=case1;
model y=x1 x2 x3 x4 /vif;
run;
proc reg data=case2;
model y=x1 x2 x3 x4 /vif;
run;
94
VARIABLE SELECTION
10.1.2
10.2
10.2.1
COMPUTATIONAL METHODS
Table 10.3
95
10.2.2
_CP_
_AIC_
_BIC_
_SBC_
14 -65.6115 -48.5587 -42.3216
6 -67.9866 -63.8709 -58.0052
SAS The SAS procedure proc reg has an option selection that is added to
the model statement for choosing a numerical procedure for selecting a subset
of terms. While several choices are available for this option, the most relevant
seem to be forward for forward selection; backward for backward elimination,
and cp, to find the subsets that minimize Cp . This latter option users the
Furnival and Wilson (1974) algorithm discussed in alr[10.3]. Since proc reg
does not permit factors, using Cp is essentially the same as using any of the
other criteria discussed in alr. If you have factors, you must code them
yourself using dummy variables (an example is below), and you apparently
cant force terms into every mean function, and add/delete terms as a group.
The first example uses the Furnival and Wilson algorithm to find subsets
that minimize Mallows Cp . As in alr[10.3], we use the highway data. We
start by coding the factor Hwy as a set of dummy variables.
goptions reset=all;
data highway;
set alr3.highway;
logLen=log2(Len); logADT=log2(ADT);
logTrks=log2(Trks); logSigs1=log2((Len*Sigs+1)/Len);
logRate=log2(Rate);
if (Hwy eq 1) then Hwy1=1; else Hwy1=0;
if (Hwy eq 2) then Hwy2=1; else Hwy2=0;
if (Hwy eq 3) then Hwy3=1; else Hwy3=0;
proc reg data=highway;
model logRate=logADT logTrks Lane Acpt logSigs1 Itg Slim
96
VARIABLE SELECTION
We have requested the 10 subsets with minimum Cp , and have also requested
that AIC be printed for each of these subsets. The output is shown in Table 10.4. Many of these subsets include one, but not all, of the dummy
variables for Hwy, so these are not likely to be of much interest.
Table 10.4
C(p)
R-Square
AIC
3.4857
0.7789
-75.3600
5
5
7
3.5203
3.5544
3.5744
0.7453
0.7450
0.7782
-73.8303
-73.7868
-75.2297
5
6
3.6065
3.6289
0.7445
0.7611
-73.7204
-74.3254
3.7028
0.7604
-74.2249
3.8174
0.7762
-74.8747
3.9278
0.7753
-74.7143
3.9477
0.7751
-74.6855
Variables in Model
logADT logTrks Acpt
logSigs1 Slim logLen Hwy1
logSigs1 Slim logLen Lwid Hwy2
logADT logSigs1 Slim logLen Hwy1
logADT Acpt logSigs1 Slim
logLen Hwy1 Hwy2
logSigs1 Slim logLen Shld Hwy2
logADT logTrks logSigs1
Slim logLen Hwy1
logSigs1 Itg Slim logLen
Hwy2 Hwy3
logADT logSigs1 Slim logLen
Lwid Hwy2 Hwy3
logADT logSigs1 Slim logLen
Hwy1 Hwy2 Hwy3
logADT logSigs1 Slim logLen
Lwid Hwy1 Hwy2
The summary of forward selection from proc reg is given in Table 10.5.
WINDMILLS
Table 10.5
97
Obs
1
2
3
4
5
6
7
8
Var
Number Partial
Model Dependent Step Entered
In
RSquare
MODEL1 logRate
1 Slim
1
0.4765
MODEL1 logRate
2 logLen
2
0.1629
MODEL1 logRate
3 Acpt
3
0.0354
MODEL1 logRate
4 logTrks
4
0.0212
MODEL1 logRate
5 Hwy2
5
0.0192
MODEL1 logRate
6 logSigs1
6
0.0386
MODEL1 logRate
7 logADT
7
0.0158
MODEL1 logRate
8 Hwy3
8
0.0171
10.3.1
10.4
Model
Rsquare
0.4765
0.6394
0.6748
0.6961
0.7152
0.7539
0.7697
0.7868
Cp FValue
27.7232 33.68
10.2021 16.27
7.9587
3.81
7.4145
2.38
7.1199
2.22
4.4903
5.02
4.5933
2.13
4.5427
2.41
ProbF
<.0001
0.0003
0.0589
0.1325
0.1458
0.0321
0.1544
0.1312
10.4.1
10.4.2
The data for the windmill example in alr[10.4.2] is not included with the
alr3 library, and must be downloaded separately from www.stat.umn.edu/alr.
SAS The SAS code used to compute alr[F10.1] are given in the script for
this chapter.
11
Nonlinear Regression
11.1
11.2
SAS SAS includes several procedures and tools for fitting nonlinear regression models. We will discuss only two of them, proc nlin and proc nlmixed.
proc nlmixed is newer and more general, and so we will mainly discuss using
this procedure. An example of using proc nlin is also given. The syntax in
these two procedures is much more complicated than is the syntax for proc
reg or proc glm because these two programs are more general.
We begin by illustrating fitting a mean function
E(Gain|A) = 1 + 2 (1 e3 A )
(11.1)
100
NONLINEAR REGRESSION
Iter
Iteration History
NegLogLike
Diff
Calls
1
2
... ...
24
25
3
4
733.498721
658.723273
1022.171
74.77545
MaxGrad
Slope
238.637
120.8772
-2272.58
-502.002
46
152.343619
0.000015
0.000243
48
152.343619
2.268E-8
0.000019
NOTE: GCONV convergence criterion satisfied.
-0.00003
-4.24E-8
Fit Statistics
-2 Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
BIC (smaller is better)
Parameter
th1
th2
th3
sig2
622.96
178.25
7.1222
353.35
Parameter
th1
th2
th3
sig2
Row
1
2
3
4
Estimate
304.7
312.7
314.0
318.9
Parameter Estimates
Standard
Error
DF
t Value
5.6064
10.9596
1.1071
84.4669
Parameter Estimates
Lower
Upper
611.58
156.00
4.8748
181.88
634.34
200.50
9.3697
524.83
35
35
35
35
Pr > |t|
Alpha
<.0001
<.0001
<.0001
0.0002
0.05
0.05
0.05
0.05
111.12
16.26
6.43
4.18
Gradient
5.031E-6
1.557E-6
0.000019
-3.1E-7
sig2
0.000012
-3.17E-6
-4.21E-6
1.0000
The parms statement specifies the parameters that will appear in the model.
These include the s, which we have called th, and also the variance 2 that we
call sig2. Unlike proc reg, you must specify starting values for the parameter
101
estimates, as discussed in alr[11.2], and if the starting values are poor, the
computational algorithm may fail to find the estimates. Next, we define a
mean function fitGain to be equal the kernel mean function, a rewriting of
(11.1) using notation that SAS understands. The model statement specifies
the whole model, not just the mean function. This corresponds to the use of
the term model in alr as referring to the mean function plus any other
assumptions. For the example, the model assumes that Gain is normally
distributed with mean fitGain and constant variance sig2. The option corr
in the proc nlmixed statement tells SAS to display the correlation matrix of the
parameter estimates. proc nlmixed permits many other distributions besides
the normal. In addition, it has another statement called random for specifying
random effects that we will not be using in this primer, except for computing
the random coefficient models presented in Section 6.5. Finally, the predict
statement is used to get a prediction and prediction standard error for each
observation. The option out saves the output as a data file that is required
to draw the graph of the fitted function. The output in Table 11.1 includes
the parameter estimates and the estimated correlation matrix. The remaining
part of the program uses proc gplot to graph the fitted curve and the data
shown in Figure 11.1.
The next SAS program will give a lack-of-fit test, as shown in Table 11.2,
for the above nonlinear model of Gain on A, using the turk0 data:
goptions reset=all;
data turk0;
infile "c:/My Documents/alr3/turk0.txt" firstobs=2;
102
NONLINEAR REGRESSION
input A Gain;
proc nlmixed data=turk0;
parms th1=620 th2=200 th3=10 s2=10;
fitGain=th1+th2*(1-exp(-th3*A));
model Gain~Normal(fitGain,s2);
ods output ParameterEstimates=a1;
run; quit;
data a1; set a1; where (Parameter=sig2); SS=Estimate*DF; DF=32;
proc glm data=turk0;
class A;
model Gain=A;
ods output OverallAnova=a2;
run; quit;
data a2; set a2; where (Source=Error);
data lof (keep=DF SS); set a1 a2;
proc sort data=lof; by descending DF;
proc iml;
use lof;
read all var {DF} into resDF;
read all var {SS} into resSS;
DF=j(2,1,.); Sumsq=j(2,1,.); F=j(2,1,.); pval=j(2,1,.);
DF[2]=resDF[1]-resDF[2];
Sumsq[2]=resSS[1]-resSS[2];
F[2]=Sumsq[2]/DF[2]/(resSS[2]/resDF[2]);
pval[2]=1-probF(F[2],DF[2],resDF[2]);
title lack of fit test;
print resDF resSS DF Sumsq F pval;
quit;
proc iml is used to do the F -test comparing the nonlinear model to the simple
one-way ANOVA, since proc nlmixed has neither class statement nor test
statement.
Table 11.2
Lack of fit test for a nonlinear model with the turk0 data.
lack of fit test
RESDF
RESSS
32
29
12367.34
9823.6
DF
SUMSQ
PVAL
.
.
.
.
3 2543.7399 2.5031033 0.0789298
For fitting nonlinear models with factors, you need to define dummy variables for factor variables in the data step. Then you can fit nonlinear model
with these dummies. We fit four different models and compare them with F tests. Since proc nlmixed does not provide correct error degrees of freedom,
we present such an example using proc nlin for the turkey data. The syntax
for nlin is a little simpler for nonlinear regression models because the method
is much less general. The parms statement is the same as in proc nlmixed.
103
The model statement in nlin specifies only the mean function, as constant
variance and normality as assumed by this procedure.
goptions reset=all;
data turkey;
set alr3.turkey;
S1=(S eq 1);
S2=(S eq 2);
S3=(S eq 3);
w=sqrt(m);
wGain=w*Gain;
proc nlin data=turkey; *most general;
parms th11=620 th12=620 th13=620
th21=200 th22=200 th23=200
th31=10 th32=10 th33=10;
model wGain=w*(S1*(th11+th21*(1-exp(-th31*A)))+
S2*(th12+th22*(1-exp(-th32*A)))+
S3*(th13+th23*(1-exp(-th33*A))));
ods output Anova=a1;
run;
proc nlin data=turkey; *common intercept;
parms th1=620
th21=200 th22=200 th23=200
th31=10 th32=10 th33=10;
model wGain=w*(th1+
S1*(th21*(1-exp(-th31*A)))+
S2*(th22*(1-exp(-th32*A)))+
S3*(th23*(1-exp(-th33*A))));
ods output Anova=a2;
run;
proc nlin data=turkey; *common intercept and asymptote;
parms th1=620 th2=200
th31=10 th32=10 th33=10;
model wGain=w*(th1+th2*(
S1*(1-exp(-th31*A))+
S2*(1-exp(-th32*A))+
S3*(1-exp(-th33*A))));
ods output Anova=a3;
run;
proc nlin data=turkey; *common regression;
parms th1=620 th2=200 th3=10;
model wGain=w*(th1+th2*(1-exp(-th3*A)));
ods output Anova=a4;
run;
**************************************;
***Anova table for model 1, 2 and 4***;
**************************************;
data anova (keep=DF SS); set a1 a2 a4; where (Source eq Residual);
proc sort data=anova; by descending DF;
104
NONLINEAR REGRESSION
proc iml;
use anova;
read all var {DF} into resDF;
read all var {SS} into resSS;
DF=j(3,1,.); Sumsq=j(3,1,.); F=j(3,1,.); pval=j(3,1,.);
DF[2]=resDF[1]-resDF[2]; DF[3]=resDF[2]-resDF[3];
Sumsq[2]=resSS[1]-resSS[2]; Sumsq[3]=resSS[2]-resSS[3];
F[2]=Sumsq[2]/DF[2]/(resSS[2]/resDF[2]);
F[3]=Sumsq[3]/DF[3]/(resSS[3]/resDF[3]);
pval[2]=1-probF(F[2],DF[2],resDF[2]);
pval[3]=1-probF(F[3],DF[3],resDF[3]);
title Analysis of Variance Table;
print resDF resSS DF Sumsq F pval;
quit;
**************************************;
***Anova table for model 1, 3 and 4***;
**************************************;
data anova (keep=DF SS); set a1 a3 a4; where (Source eq Residual);
proc sort data=anova; by descending DF;
proc iml;
use anova;
read all var {DF} into resDF;
read all var {SS} into resSS;
DF=j(3,1,.); Sumsq=j(3,1,.); F=j(3,1,.); pval=j(3,1,.);
DF[2]=resDF[1]-resDF[2]; DF[3]=resDF[2]-resDF[3];
Sumsq[2]=resSS[1]-resSS[2]; Sumsq[3]=resSS[2]-resSS[3];
F[2]=Sumsq[2]/DF[2]/(resSS[2]/resDF[2]);
F[3]=Sumsq[3]/DF[3]/(resSS[3]/resDF[3]);
pval[2]=1-probF(F[2],DF[2],resDF[2]);
pval[3]=1-probF(F[3],DF[3],resDF[3]);
title Analysis of Variance Table;
print resDF resSS DF Sumsq F pval;
quit;
We fit weighted nonlinear regression models with one continuous variable and
one factor. Model 1, 2, 3 and 4 are most general, common intercept, common intercept and asymptote and common regression, respectively. We first
code the dummies S1, S2 and S3 for S, and
create the weighted response in
the data step, namely, wGain here, using m as weights. Then we fit four
different nonlinear models and run two ANOVA for comparing model 1, 2, 4
and comparing models 1, 3, 4, respectively. See ANOVA tables in Table 11.3.
Finally, we compute all the F -values and p-values in proc iml.
BOOTSTRAP INFERENCE
105
Table 11.3 ANOVA tables for comparing four different nonlinear models with the
turk0 data.
***ANOVA table for comparing model 1, 2 and 4***;
Analysis of Variance Table
RESDF
RESSS
10 4326.0797
6 2040.0059
4 1151.1523
DF
SUMSQ
F
PVAL
.
.
.
.
4 2286.0738 1.6809318 0.2710972
2 888.85359 1.5442849 0.3184218
11.3
DF
SUMSQ
F
PVAL
.
.
.
.
2 1757.6915 2.7374233 0.1242409
4 1417.2359 1.2311454 0.4225784
BOOTSTRAP INFERENCE
SAS We give code for the bootstrap discussed in alr[11.3]. See also the
documentation for the macro boots for more information. The result with
B=999 and seed=2004 is given in Table 11.4.
*******************************************;
***A similar macro appeared in chapter 4***;
*******************************************;
%let n=39; *#obs in the data;
%macro nlsboot(B=1, seed=1234, outputdata=temp);
%do i=1 %to &B;
*the following command tells SAS to clear both;
*the log window and the output window;
dm log; clear; output; clear; ;
data analysis;
choice=int(ranuni(&seed*&i)*&n)+1;
set alr3.segreg point=choice;
j+1;
if j>&n then STOP;
proc nlmixed data=analysis;
parms th0=70 th1=.5 gamma=40 sig2=1;
fitC=th0+th1*max(0,Temp-gamma);
model C~Normal(fitC,sig2);
ods output ParameterEstimates=pe;
run; quit;
proc sql; create table outests as (select Estimate,
sum(Estimate*(Parameter=th0)) as th0,
sum(Estimate*(Parameter=th1)) as th1,
sum(Estimate*(Parameter=gamma)) as gamma from pe);
quit;
106
NONLINEAR REGRESSION
Table 11.4 Summary of bootstrap estimation for the segreg data with proc nlmixed,
based on 999 resamples.
Mean
SD
2.5%
97.5%
th0
74.757751
1.6226234
71.472186
77.657143
th1
0.6059323
0.1274196
0.4577865
0.9673802
gamma
42.874274
5.1164431
36
54.599069
11.4
REFERENCES
12
Logistic Regression
Both logistic regression and the normal linear models that we have discussed
in earlier chapters are examples of generalized linear models. Many programs,
including SAS, R, and S-Plus, have procedures that can be applied to any
generalized linear model. Both JMP and SPSS seem to have separate procedures for logistic regression. There is a possible source of confusion in the
name. Both SPSS and SAS use the name general linear model to indicate a
relatively complex linear model, possibly with continuous terms, covariates,
interactions, and possibly even random effects, but with normal errors. Thus
the general linear model is a special case of the generalized linear models.
12.1
12.1.1
12.2
BINOMIAL REGRESSION
Mean Functions for Binomial Regression
FITTING LOGISTIC REGRESSION
SAS There is more than one way to fit logistic regression models in SAS,
using, for example, proc logistic and proc genmod. We choose to use proc
genmod that not only fits binary regression models, but also other generalized linear models such as Poisson and gamma regression (and linear models, too). This procedure can be used to fit logistic model with the option
/dist=binomial link=logit. When SAS fits a logistic regression model with a
response y equal to 0 or 1, and terms x, it will by default model Prob(y = 0|x),
107
108
LOGISTIC REGRESSION
which is not standard. You can get SAS to model Prob(y = 1|x) by adding
the option descending after the proc genmod statement; see examples in the
next section.
12.2.1
One-predictor example
SAS First we fit logistic regression with one term. In this example, the
variable y contains only two different levels, 0 and 1. The option descending
in the proc genmod statement tells SAS to model Prob(y = 1|x) instead of
Prob(y = 0|x).
data blowBF;
set alr3.blowBF;
logD=log2(D);
proc genmod descending data=blowBF;
model y=logD /dist=binomial link=logit;
run; quit;
Edited output is given in Table 12.1. All the values shown are described in
alr, except the Wald 95% confidence limits, which are simply given by the
estimate plus or minus 1.96 times the standard error of the estimate.
Table 12.1 Output from proc genmod for the blowBF data.
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Deviance
657
655.2420
Scaled Deviance
657
655.2420
Pearson Chi-Square
657
677.4535
Scaled Pearson X2
657
677.4535
Log Likelihood
-327.6210
Parameter
Intercept
logD
Scale
DF
1
1
0
Value/DF
0.9973
0.9973
1.0311
1.0311
12.2.2
109
Many Terms
SAS Like proc glm, proc genmod permits the use of factors and interactions
in model statements. In the example below, a class statement is added to
tell SAS to treat Class, Age and Sex as factors. We also illustrate the use of
contrast statements make comparisons between levels of a factor or between
terms1 . Choose the name for the contrast, in quotation marks. Then if you
want to test different levels of a factor, give its name followed by contrast
coefficients; the number of coefficients should equal the number of levels of
the factor. If you want to test the parameters for two continuous variables, say
x1 = x2 , you would have a contrast statement like: contrast "x1?=x2" x1 1
x2 -1 /wald;. The wald option tells SAS to use the Wald chi-square statistic,
which presumably is computed by squaring the estimated contrast divided by
its standard error. The following binomial model uses the titanic data, and
the output is given in Table 12.2.
goptions reset=all;
proc genmod data=alr3.titanic;
class Class Age Sex;
model Surv/N=Class Age Sex / dist=binomial link=logit;
contrast "Class crew?=Class first" Class 1 -1 0 0 /wald;
contrast "Class second?=Class third" Class 0 0 1 -1 /wald;
contrast "Female?=Male" Sex 1 -1 /wald;
run; quit;
terms in the model statement. Here are two models listed in alr[T12.2]. In
the output, Wald chi-square tests are automatically done for each parameter.
We use the blowBF data, and the output from the second proc genmod is given
in Table 12.3.
goptions reset=all;
data blowBF;
set alr3.blowBF;
logD=log2(D);
proc genmod descending data=blowBF;
model y=logD S /dist=binomial link=logit;
output out=many1 p=pred_prob;
run; quit;
proc genmod descending data=blowBF;
model y=logD S logD*S /dist=binomial link=logit;;
output out=many2 p=pred_prob;
run; quit;
Programs for jittered scatterplots and for alr[F12.5] are given in the script
for this chapter.
1 This
110
LOGISTIC REGRESSION
Table 12.2 Edited output from a binomial model using proc genmod for the titanic
data.
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Response Variable (Events)
Response Variable (Trials)
Number
Number
Number
Number
of
of
of
of
ALR3.TITANIC
Binomial
Logit
Surv
N
Observations Read
Observations Used
Events
Trials
14
14
711
2201
DF
Value
Value/DF
8
8
8
8
112.5666
112.5666
103.8296
103.8296
-1105.0306
14.0708
14.0708
12.9787
12.9787
Parameter
Intercept
Class
Class
Class
Class
Age
Age
Sex
Sex
Scale
DF Estimate
Crew
First
Second
Third
Adult
Child
Female
Male
1
1
1
1
0
1
0
1
0
0
-1.0924
0.9201
1.7778
0.7597
0.0000
-1.0615
0.0000
2.4201
0.0000
1.0000
Standard
Error
0.2370
0.1486
0.1716
0.1764
0.0000
0.2440
0.0000
0.1404
0.0000
0.0000
-0.6279
1.2113
2.1140
1.1053
0.0000
-0.5833
0.0000
2.6953
0.0000
1.0000
21.24
38.34
107.37
18.56
.
18.92
.
297.07
.
P-value
<.0001
<.0001
<.0001
<.0001
.
<.0001
.
<.0001
.
Contrast Results
Contrast
Class crew?=Class first
Class second?=Class third
Female?=Male
DF
ChiSquare
Pr > ChiSq
Type
1
1
1
29.71
18.56
297.07
<.0001
<.0001
<.0001
Wald
Wald
Wald
Table 12.3
111
Output from proc genmod with interaction for the blowBF data.
Value/DF
0.8271
0.8271
1.3521
1.3521
Algorithm converged.
Parameter
Intercept
logD
S
logD*S
Scale
12.2.3
DF
1
1
1
1
0
Deviance
SAS In the following example, we retrieve the Deviance and the error DF
from the output of proc genmod. We then store these information in the data
set test, and we use proc iml to do all Chi-square tests. Output is formatted
so that ANOVA results are presented in Table 12.4.
goptions reset=all;
data blowBF;
set alr3.blowBF
logD = log(D);
proc genmod descending data=blowBF;
model y=logD / dist=binomial link=logit;
ods output ModelFit=m1;
run; quit;
proc genmod descending data=blowBF;
model y=logD S / dist=binomial link=logit;
ods output ModelFit=m2;
run; quit;
proc genmod descending data=blowBF;
model y=logD S logD*S/ dist=binomial link=logit;
ods output ModelFit=m3;
run; quit;
data test (keep=DF value);
set m1 m2 m3; where (criterion eq Deviance);
proc iml;
use test;
read all var _all_ into x; *_all_ stands for all variables in data test;
resDF=x[,1]; resDev=x[,2];
112
LOGISTIC REGRESSION
Table 12.4 ANOVA for comparing 3 logistic models using proc iml for the blowBF
data.
Analysis of Variance Table for model m1, m2 and m3
MODEL
m1
m2
m3
RESDF
657
656
655
RESDEV
655.242
563.90095
541.74561
DF
.
1
1
DEV
.
91.341046
22.155343
PVAL
.
0
2.5146E-6
Just to remind you that the keyword all that appears in the proc iml
above simply tells SAS to read all variables in the referenced data set into this
procedure.
12.2.4
SAS There are several ways to test lack-of-fit. One is the Deviance/ErrorDF
type of statistics from proc genmod. It is treated as criterion for assessing
goodness-of-fit. An example is given below, with output shown in Table 12.5:
goptions reset=all;
data blowBF;
set alr3.blowBF;
logD=log2(D);
proc genmod descending data=blowBF;
model y=logD / dist=binomial link=logit;
run; quit;
113
Table 12.5 Criteria for assessing goodness of fit from proc genmod for the blowBF
data.
The GENMOD Procedure
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
657
655.2420
0.9973
Scaled Deviance
657
655.2420
0.9973
Pearson Chi-Square
657
677.4535
1.0311
Scaled Pearson X2
657
677.4535
1.0311
Log Likelihood
-327.6210
12.3
12.3.1
12.3.2
12.4
Problems
SAS 12.5.1. SAS proc insight, or equivalently, the Interactive Data Analysis tool, can be used to assign colors to data points. After you launch the Interactive Data Analysis tool, choose Edit Windows Tools, which will open
a graphical panel as a new window. You can hold and drag your mouse cursor
to select data points in a rectangular region on any graph, then set color to
the selected points by clicking on one of the color buttons in the SAStool
window. SAS keeps track of the color, and any subsequent graphs will be in
different colors. Hint: The histogram of the response variable Y may help.
References
116
REFERENCES
11. Kn
usel, Leo (2005). On the accuracy of statistical distributions in Microsoft
Excel 2003. Computational Statistics and Data Analysis, 48, 445449.
12. Maindonald, J. and Braun, J. (2003). Data Analysis and Graphics Using R.
Cambridge: Cambridge University Press.
13. Muller, K. and Fetterman, B. (2003). Regression and ANOVA: An Integrated
Approach using SAS Software. Cary, NC: SAS Institute, Inc., and New York:
Wiley.
14. Nelder, J. (1977). A reformulation of linear models. Journal of the Royal
Statistical Society, A140, 4877.
15. Pinheiro, J. and Bates, D. (2000). Mixed-Effects Models in S and S-plus. New
York: Springer.
16. Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. New
York: John Wiley & Sons, Inc.
17. Sall, J., Creighton, L. and Lehman, A. (2005). JMP Start Statistics, third
edition. Cary, NC: SAS Institite, and Pacific Grove, CA: Duxbury. Referred
to as jmp-start.
18. SPSS (2003). SPSS Base 12.0 Users Guide. Chicago, IL: SPSS, Inc.
19. Thisted, R. (1988). Elements of Statistical Computing. New York: Chapman &
Hall.
20. Venables, W. and Ripley, B. (2000). S Programming. New York: Springer.
21. Venables, W. and Ripley, B. (2002). Modern Applied Statistics with S, 4th
edition. New York: Springer. referred to as vr.
22. Venables, W. and Smith, D. (2002). An Introduction to R. Network Theory,
Ltd.
23. Verzani, John (2005). Using R for Introductory Statistics. Boca Raton: Chapman & Hall.
24. Weisberg, S. (2005). Applied Linear Regression, third edition. New York: Wiley.
referred to as alr.
25. Weisberg, S. (2005). Lost opportunities: Why we need a variety of statistical
languages. Journal of Statistical Software, 13(1), www.jstatsoft.org.
Index
Analyze
Box Plot/Mosaic Plot(Y), 21
Distribution, 43
Scatter plot, 14
Arc, 5
Case-sensitive, 9
Chi-squared tables, 9
Cholesky decomposition, 92
Class variables, 26
CRAN, 2
Data files, 6
Documentation, 6
Missing values in, 6
wm5.txt, 7
Edit
Windows
Tools, 113
Excel, 5
Factor, 57
File
Export as Image, 9
Import data file, 8
Save As, 9
F tables, 9
Graphical user interface (GUI), 3
Help
Online Manuals, 2
Interactive, 20
Loess, 19, 72
Macro, 42
Macros, 4
Missing values, 6, 41
Normal tables, 9
Ordinary least squares, 23
Partial regression plot, 34
Polynomial regression, 53
Prediction, 27
Prediction, 47
Proc import, 8
Random coefficient model, 63
REML, 63
Residuals, 30
SAS
Anova
SAS sums of squares, 36
append, 25
axis1, 15
axis2, 15
betainv(p, a, b), 27
boots, 105
call, 51
cards, 29
CDF, 9
chiprob, 47
cinv(p, df), 27
circle, 16
cl, 63
class, 22, 58, 102, 109
clb, 27
cli, 27, 45
117
118
INDEX
close, 25
clparm, 59
contrast, 109
corr, 101
covtest, 63
create, 25
create table, 81
data, 15, 1920, 47, 58, 86, 91
descending, 108
dist, 107
do, 91
eigen, 51
estimate, 56
finv(p, ndf, ddf), 27
gaminv(p, a), 27
goptions, 1516
input, 29
into, 24
inv, 35
label, 76
link, 107
method, 63
mixed, 63
model, 16, 25, 27, 30, 33, 36, 45, 5354,
60, 95, 101
nlin, 103
noint, 30
ods, 17, 85
ods output, 19
order, 58
output, 1718, 26, 2930, 33, 91
overlay, 17
parms, 100
partial, 34
plot, 1618, 30, 88
predict, 101
predicted., 30
predicted, 55
print, 25
probit(p), 27
proc boxplot, 21
proc genmod, 107113
proc glm, 2628, 36, 41, 4547, 49, 5355,
5860, 73, 85, 99, 109
proc gplot, 1318, 54, 78, 101
proc iml, 2425, 3436, 52, 5455, 57, 78,
92, 102, 104, 111112
proc insight, 20, 71, 75, 113
proc loess, 17, 19, 81
proc logistic, 107
proc means, 2223, 34
proc mi, 41
proc mianalyze, 41
proc mixed, 28, 6365
proc model, 77