Arba Minch University College of Natural Sciences Department of Statistics
Arba Minch University College of Natural Sciences Department of Statistics
Arba Minch University College of Natural Sciences Department of Statistics
DEPARTMENT OF STATISTICS
2018/19
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
1. INTRODUCTION
1.1. Meaning and functions of SAS
What is SAS?
The Statistical Analysis System (SAS) is software with an integrated set of programs for
manipulating, analyzing, and presenting data. At the heart of SAS is a programming language
composed of statements that specify how data are to be processed and analyzed. The statements
correspond to operations to be performed on the data or instructions about the analysis. A SAS
program consists of a sequence of SAS statements grouped together into blocks, referred to as
“steps.” These fall into two basic steps namely: data steps and procedure (proc) steps. A data
step is used to prepare data for analysis. It creates a SAS data set and may reorganize the data
and modify it in the process. A proc step is used to perform a particular type of analysis, or
statistical test, on the data in a SAS data set. To program effectively using SAS, you need to
understand basic concepts about SAS programs and the SAS files that they process. In
particular, you need to be familiar with SAS data sets.
STARTING SAS: Double-click on the SAS icon on the desktop or use the start menu
and find SAS 9.2 (English) as shown below:
Quitting SAS
Page 1 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
• Use Alt + F4
You will see the following warning upon exiting:
At the top, below the SAS title bar, is the menu bar. On the line below that is the tool bar with
the command bar at its left end. The tool bar consists of buttons that perform frequently used
commands. The command bar allows one to type in less frequently used commands. At the
bottom, the status line comprises a message area with the current directory and editor cursor
position at the right. Double-clicking on the current directory allows it to be changed.
Briefly, windows in SAS and the purpose of the main windows are as follows.
i. Editor: The Editor window is for typing in editing, and running programs. When a SAS
program is run, two types of output are generated: the log and the procedure output, and
these are displayed in the Log and Output windows.
ii. Log: The Log window shows the SAS statements that have been submitted together with
information about the execution of the program, including warning and error messages.
iii. Output: The Output window shows the printed results of any procedures. It is here that
the results of any statistical analyses are shown.
iv. Results: The Results window is effectively a graphical index to the Output window
useful for navigating around large amounts of procedure output. Right-clicking on a
procedure, or section of output, allows that portion of the output to be viewed, printed,
deleted, or saved to file.
v. Explorer: The Explorer window allows the contents of SAS data sets and libraries to be
examined interactively, by double-clicking on them.
Page 3 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Page 4 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Comments in SAS
Comments, with any number or type of characters, can be inserted anywhere into your program;
they will be bypassed by SAS during processing. There are two ways of writing comments in
SAS; comments can starts with an asterisk (*) and ends with a semicolon (;) or starts with a
slash asterisk (/*) and ends with an asterisk slash (*/)
Errors in SAS statement
Each time a step is executed, SAS generates a log of the processing activities and the results of
the processing. The SAS log collects messages about the processing of SAS programs and about
any errors that occur. Hence, when your program fails to execute properly (e.g., no output is
displayed), check the LOG window as SAS will give an error message in the LOG window;
typically, the error message is specific enough for you to figure out and fix the error. It is good
practice to always check the LOG window for any errors; SAS sometimes executes program
statements even after an error is encountered (unless you check the LOG window, you will have
the impression that everything is fine).
Some Tips for Preventing and Correcting Errors
Before submitting a program:
1. Check that each statement ends with a semicolon.
2. Check that all opening and closing quotes match.
3. Check any statement that does not begin with a keyword (blue, or navy blue) or a variable
name (black).
1.5. Basic Building Blocks in writing SAS program
As discussed in section 1.1 above a SAS program consists of a sequence of SAS statements
grouped together into blocks, referred to as “steps.” These fall into two basic steps namely: data
steps and proc steps and detail descriptions of each step and general syntax are given below:
i. DATA step
It creates the SAS data set, which is then passed to the PROC step for processing. It begins with
a keyword DATA statement and used to create/read/modify a SAS data set. Before data can be
analyzed in SAS, they need to be read into a SAS data set. Creating a SAS data set for
subsequent analysis is the primary function of the data step. The data can be “raw” data or come
from a previously created SAS data set. A data step is also used to manipulate, or reorganize the
data. This can range from relatively simple operations (e.g., transforming variables) to more
Page 5 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
complex restructuring of the data. In many practical situations, organizing and preprocessing the
data takes up a large portion of the overall time and effort. The power and flexibility of SAS for
such data manipulation is one of its great strengths. We begin by describing how to create SAS
data sets from raw data as an example in data step.
General Structure/syntax of a DATA Step is:
DATA dataname ;
INPUT or INFILE ;
DATALINES or CARDS ;
... ... ...
... ... ...
... ... ...
;
RUN ;
It should be noted that Data step requires up to four different SAS statements namely: DATA,
INPUT, INFILE and CARDS or Datalines and description of each of the statements are given as
follows:
DATA Statement
The DATA statement is usually the first statement in a SAS job. It begins with the word
DATA and is followed by a name that you choose for the data set. Data set names must begin
with a letter, and can be no more than eight characters in length.
INPUT Statement
• Each line of data in a SAS program can be an observation.
• Each value in this observation represents a variable, and the INPUT statement is used to
name these variables.
• The INPUT statement follows the DATA statement.
As long as one blank space is inserted between each variable value in an observation, SAS
reads them as separate values.
CARDS or DATALINES Statement
• When data is entered as an internal part of a SAS program, the CARDS statement
immediately precedes the data lines.
• It is simply entered as CARDS and tells the SAS system that the data follows.
• Note that when a CARDS statement is used, the line length cannot exceed 80 characters.
Page 6 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
INFILE Statement
• Data may also be imported from a disk or tape into your SAS program.
• In this case, both the computer operating system and SAS must know where the data can be
found.
• The INFILE statement goes before the INPUT statement.
• It consists of INFILE followed by the file reference name and this identifies the name of the
file to be used. For example, if you are using a file called STUDENTS, the statement would
be: INFILE STUDENTS;
Note that: The CARDS and INFILE statements are not used together and when using the
INFILE statement, the computer's operating system must be told where the data can be
located.
Example 1: Create SAS Dataset from weights of patients for random samples five are measured
before and after medication at a given hospital as shown table below:
Ptnt 1 2 3 4 5
Wgt1 81 71 65 66 59
Wgt2 85 76 79 67 62
/*Sas program;
DATA weight;
INPUT ptnt wgt1 wgt2;
DATALINES;
1 81 85
2 71 76
3 65 79
4 66 67
5 59 62
;
RUN;
This DATA step creates a data set called “weight” (weight.sas7bdat).
• The keyword INPUT gives the names of the 3 variables in the data set.
• The keyword DATALINES (or CARDS) indicates the start of the data values. There are 5 data
lines; thus, there will be 5 observations in this data set.
Note: There are NO semicolons at the end of each line of the data values, but there is a single
semicolon after all the data lines.
• The keyword RUN tells SAS to execute the block of statements.
Page 7 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Page 8 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
the data set. A second form of variable list can be used where a set of variables have names of
the form score1, score2, … score10. That is, there are ten variables with the root score in
common and ending in the digits 1 to 10. In this case, they can be referred to by the variable list
score1 - score10 and do not need to be contiguous in the data set. In general, SAS can handle
up to 32,767 variables in a single data set, however, the number of observations is limited only
by your computer’s capacity.
Rules for SAS Variable Names
• Names can be 32 characters or fewer in length.
• Names must start with an English letter (A to Z) or an underscore ( ). Subsequent characters
can be letters, numeric digits (0 to 9), or underscores.
• Names can contain upper- and/or lowercase letters.
• Names cannot contain blanks and other special characters such as %, $, !, #, and @.
• Certain names are reserved for use by SAS, e.g. _N_, _TYPE_ and _NAME_. Similarly,
logical operators such as gt, lt, and, eq should not be used as variable names.
Examples of illegal names:
1000seedwt Does not begin with a letter or underscore
Bodyfat% Contains an illegal character
Contains blank space
Weight of cow
Page 9 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Missing Data
Some entries or fields in your data set may be missing. In a SAS data set, the representation of a
missing entry depends on the type of variable for instances missing value for numeric type is
represented by a single period (.) whereas for character data type is represented by a blank.
1.8. Reading Data into SAS
Data source for SAS can be typing the data into the program (not very efficient), using an
external data file with the SAS statement infile “ file location”; and writing a data-generating
program (useful for simulations).
a) Reading Raw Data in to SAS
It is typing the data directly in the PROGRAM EDITOR and read into SAS using the
DATALINES or CARDS statement in a DATA step. In this case, actual values can be copied
and pasted from elsewhere (e.g., a website, a text file, a Word document, etc.) It is useful only
for small to medium-sized data sets and can get messy when working with large data sets. There
are a variety of different styles of INPUT code that can be used to read raw data. Data that can be
entered may be in the following common INPUT specifications.
List Input specification
Data are read in order of variables given in input statement or simply list variables after the
INPUT keyword in the order they appear on file. It is used when raw data is separated by spaces.
All missing data must be indicated by period. If variables are character format, place a $ after the
variable name for instances list input specification will looks like: INPUT Name $ City $ Age
Height Weight Sex $; This is most common way of creating SAS datasets.
Example 1: list input specification method of creating SAS data set
DATA weight;
INPUT ptnt name $ wgt1 wgt2;
DATALINES;
1 Shaw 101 95
2 Serrano 91 96
3 Nance 95 89
4 Sinha 86 87
5 Henderson 89 82
;
RUN;
Page 10 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Page 11 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Example 2: Write and save the following data into a text file (wgtclub1.txt) in Notepad. Read in
these data using the INFILE statement to create a SAS data set called “wghtclub”, with the
following 4 variables: idno, team, startweight, weightnow. Also, print out the resulting SAS data
set.
1023 red 189 165
1049 yellow 145 124
1219 red 210 192
1246 yellow 194 177
1078 red 127 118
1221 yellow 220 .
1095 blue 135 127
1157 green 155 141
data wghtclub;
infile ’ D:\all in ones\Computing-II\weight.txt’;
input idno team $ startweight weightnow;
run;
Proc print data=wgt; run;
b) Creation SAS files from importing external excel in csv file extension
Example 1: The following data refers to the seasonal wheat yield per acre at eight different
locations, all having roughly the same quality soil. The data relate the wheat yield at each
location to the seasonal amount of rainfall and the amount of fertilizer used per acre.
RF(inches) 15.4 18.2 17.6 18.4 24 25.2 30.3 31
Fertilizer amt(pound/acre) 100 85 95 140 150 100 120 80
Wheat yield 46.6 45.7 50.4 66.5 82.1 63.7 75.8 58.9
• Write and save the above data as csv file (wheat.csv) in excel. Read in these data using the
INFILE statement to create a SAS data set called “wheat”, with the following three variables:
RF, Amtfert, wheatyield. Also, print out the resulting SAS data set.
/*SAS Program to import csv file after entering in excel and saving as csv file*/
DATA wheat;
INFILE “D:\all in ones\Computing-II\practice SAS\wheat.csv” dlm=‘,’;
INPUT RF Amtfert wheatyield;
RUN;
Proc print data=wheat; run;
❖ To read data from a SAS data set, rather than from a raw data file, the set statement is used in
place of the infile and input statements. The statement
data wgtclub2;
Page 12 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
set wghtclub;
run;
Creates a new SAS data set wgtclub2 reading in the data from wghtclub. It is also possible for
the new data set to have the same name; for example, if the data statement above were replaced
with data wghtclub; This would normally be used in a data step that also modified the data in
some way.
To Open the SAS data set itself by the following steps
1. Go to the EXPLORER window and double-click on Libraries.
2. Double-click on the library in which the file was created (this is either the specific libref you
assigned for the session or WORK).
3. Once you are within the pertinent library, double-click on the SAS data set you want to
check. This will open the SAS data set in a spreadsheet-type window.
1.9. SAS Files and Data Libraries
Every SAS file is stored in a SAS library, which is a collection of SAS files. A SAS data library
is the highest level of organization for information within SAS. SAS libraries have different
implementations depending on your operating environment, but a library usually corresponds to
the level of organization that your host operating system uses to access and store files.
Depending on the library name that you use when you create a file, you can store SAS files
temporarily or permanently. When you create a SAS data set, it is stored temporarily in a SAS
Data Library. You can store it permanently by assigning a particular location on your computer
to a SAS Data Library.
Storing Temporary SAS Files
If you don't specify a library name when you create a file (or if you specify the library name
Work), the file is stored in the temporary SAS data library. When you end the session, the
temporary library and all of its files are deleted. In general, temporary SAS Files:
– exist only during the current SAS session
– are stored in a special SAS library called WORK
– are automatically erased when you exit SAS
– have one-level names. This is the default case.
Storing files permanently:
Page 13 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
To store files permanently in a SAS data library, you specify a library name other than the
default library name Work. To reference a permanent SAS data set in your SAS programs, you
use a two-level name: libref.filename. In the two-level name, libref is the name of the SAS data
library that contains the file, and filename is the name of the file itself. A period separates the
libref and filename.
Figure 3: show SAS file with two-level name, of the form as : libref.filename
For example, in our sample program, Clinic.Admit is the two-level name for the SAS data set
Admit, which is stored in the library named Clinic. Sample program creates Clinic.Admit2, it
stores the new Admit2 data set permanently in the SAS library Clinic. Hence, all SAS data sets
have a two-level name, of the form as : libref.filename. Level 1 represents the libref which points
to a particular location and Level 2 which represents the actual filename as shown below:
Note that: if the libref is not explicitly stated, by default a temporary library (libref WORK)
will be used. SAS data sets in the WORK library will NOT be permanently saved and will be
erased at the end of the current session.
Example 2: The following DATA statements are equivalent: Create temporary SAS Datasets
DATA distance;
DATA WORK.distance;
• Although the first version does not use a two-level name, SAS automatically assigns the two-
level name WORK.distance.
• If you want this data set to be stored permanently in the folder c:\student, then use:
LIBNAME cls ’c:\student’;
Page 14 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
DATA cls.distance;
Example 3: Creating permanent SAS Datasets
libname db “D:\all in ones\Computing-II”;
data db.wghtclub;
set wghtclub;
run;
In the sample SAS Code: the libname statement specifies that the libref is db to the directory
‘D:\all inones\Computing-II’. Thereafter, a SAS data set name prefixed with ‘db.’ refers to a data
set stored in that directory. When used on a data statement, the effect is to create a SAS data set
in that directory. The data step reads data from the temporary SAS data set wghtclub and stores it
in a permanent data set of the same name.
2. DATA MANIPULATION USING SAS
Data manipulation involves creating, formatting and retrieving data sets by SAS. These can be
accomplished through data entered internally or externally. Internal data are lines embedded in
the SAS program, while external data are contained in a separate file. SAS can subset, split,
merge, concatenate, transpose, and aggregate data sets into formats appropriate for subsequent
analysis. Hence, an existing SAS data set can be modified by, for instance, creating new
variables from the original variables, removing/renaming some of the variables, deleting some
observations from the data set,…,etc. We first create the original SAS data set, and then create a
new one from it with the necessary modifications with SAS keyword:
SET name1 name2 . . . ;
The SET statement tells SAS to read from the enumerated SAS data files name1 name2 . . . and
uses them to build a new SAS data set. Some possible actions for data modification and their
corresponding SAS statements include:
– keep a selection of variables:
KEEP var1 var2 ... ;
– delete a selection of variables
DROP var1 var2 ... ;
– rename a variable
RENAME oldname=newname ;
Page 15 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Page 16 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Page 17 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
the merge list that contains the variable last. In order to merge two or more files it is necessary
for both the files to have a common variable say the ID number, To accomplish a one-to-one
match merge. The would be merged files must first be sorted by the common variables using the
SORT statement. Then the observations for the merged files are joined by using the MERGE
statement in a separate DATA step that creates a new file. The IN= data set option enables you
to track the source of file for merged records i.e. keep observations from desired file. Make sure
you have records of your choice from both files. An example will make this clearer.
Example 1 : one-to-one match merging files in SAS
data baskini; data baskage;
input id $ gender $ height; input id $ age;
cards; cards;
001 M 72 001 19
002 F 68 002 17
003 F 74 003 18
004 M 69 004 20
005 F 67 005 17
; ;
To merge above two Datasets
data newbask;
merge baskini baskage;
by id;
run;
proc print;
run;
SAS Output of merged datasets
BASKET Data Set
Obs id gender height age
1 001 M 72 19
2 002 F 68 17
3 003 F 74 18
4 004 M 69 20
5 005 F 67 17
Example 2: demographic details from a questionnaire may need to be combined with the results
of laboratory tests. To deal with this situation, the data are read into separate SAS data sets and
then combined using a merge with a unique subject identifier as a key. Assuming the data have
been read into two data sets, demographics and labtests, and that both data sets contain the
subject identifier idnumber, they can be combined as follows:
Page 18 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Page 19 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Page 20 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
1 IF RELIABLE = 3
RELBIN =
0 otherwise
• The corresponding SAS code would then be:
data cars2;
set cars;
if reliable <= 3 then relbin=1;
else relbin=0;
run;
SAS Output
Page 21 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Page 22 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Page 23 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
➢ PROC PRINT: prints the observations in a SAS data set, using all or some variables.
➢ PROC SORT: sorts the observations in a SAS data set by one or more variables.
➢ PROC CONTENTS: provides a description of a SAS data set
❑ PROC PRINT
The print procedure prints the observations in a SAS data set, using all or some variables.
proc print data=... ;
run;
• If you do not want all the variables in the data set to be printed, you can use the var
statement to name the variables to be printed.
❑ Example (basket data)
proc print data=basket;
var match points;
run;
❑ PROC SORT
• The sort procedure allows to sort observations in a SAS data set by one or more variables.
proc sort data=... out=...;
by ...;
run;
• specify the name of the data set to be sorted after the data= option.
• the out= option allows to put the newly sorted version of the data in a new data set. Without
specifying this option, the original data set is replaced.
• proc sort will sort the observations by the variables listed in the by statement.
❑ Example (basket data)
proc sort data=basket out=basketsort;
by gender;
run;
proc print data=basketsort;
run;
Page 24 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
❑ PROC CONTENTS
The contents procedure provides a description of a SAS data set. It gives information on a SAS
data set, including the name of the data set, the number of observations, the names of variables,
the type of each variable (numeric-num or character-char), and any labels or formats that have
been assigned to variables.
proc contents data=...;
run;
• You just need to specify the name of the data set you want to describe.
❑ Example (basket data)
proc contents data=basket;
run;
Page 25 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
central value). On the other hand qualitatively measured variables are summarized by
counts/proportions.
a) FREQ procedure to tabulate data for qualitative variables
The FREQ procedure produces one-way, contingency (cross tabulation) tables to n-way
frequency tables. In addition to summarizing data in form of table, for contingency tables, PROC
FREQ can compute various statistics to examine the relationships between two classification
variables. For n-way tables, PROC FREQ provides stratified analysis by computing statistics
across, as well as within, strata. In PROC FREQ step to summarize and display data in one way
table, use the TABLES statement to specify the variables to be included in the frequency counts.
These are typically variables that have a limited number of distinct values. General form of a
PROC FREQ step with a TABLES statement to produce one-way table is:
In a cross tabular report, the values of the first variable in the TABLES statement form the rows
of the frequency table and the values of the second variable form the columns.
b) PROC FREQ step and contingency table
i. PROC FREQ with raw data to calculate chi-square statistic
You should recall or reread the discussion of the proc freq statement in above section, as we
mention there are additional features related to carrying out the chi-square test. Although this
section is all about descriptive statistics, here additional features of PROC FREQ step related to
carrying out the chi-square test will be discussed. For example, suppose that for 100 observations
Page 26 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
in a SAS data set one we have a categorical variable on the sex taking the values 0=male and
1=feamle and a categorical variable religion taking the values 0=Orthodox, 1=protestant, and
2=Muslim. Then, the statements
proc freq data=one;
tables sex*religion;
run;
record the counts in the six cells of a table with sex indicating row and religion indicating
column . In addition to tabulating data in two way table, the association between two variables
will be assessed by conducting a chi-square test of association, which can be carried out using
the chisq option to the tables statement. Hence, the following SAS statements produce the results
of chi-square test of association between those two variables.
proc freq data=one;
tables sex*religion/chisq;
run;
Example 1: Consider the following data on the gender and satisfaction of 21 randomly selected
AMU 2nd year statistics department students: where in the data f=female, m=male for gender
group and Y=Yes, N=no for status satisfaction on the department.
Gender f M m m m F f m F m f m m f m F f m m f f
Satisfaction Y Y Y Y N Y N Y Y Y N N N Y Y Y Y Y Y N Y
Then, a) enter the data in SAS b) Describe data using appropriate summary statistics c) Is there
any association between gender of students and satisfaction on the department at 5% level of
significance?
/* SAS program for Creating SAS Dataset called g2stat*/
Data g2stat;
Input Gender $ Satisfaction$;
Datalines;
fY
mY
mY
fN
mY
fY
mY
fN
mN
Page 27 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
mN
fY
mY
fY
fY
mY
mY
fN
fY
;
Run;
/* SAS program for cross tabulate data*/
proc freq data=g2stat;
tables Gender*Satisfaction;
run;
/* SAS program to perform chi-square test of association*/
proc freq data=g2stat;
tables Gender*Satisfaction/chisq expected;
run;
ii. PROC FREQ step with tabulated data to calculate chi-square statistic
If the data come to you already tabulated, then you must use the weight statement in proc freq
together with the chisq option in the tables statement to compute the chi-square statistic to carry
out chi-square test of association as shown in examples below.
Example 2: A random sample of 17096 households is asked their opinion on an early marriage
from a certain city XYZ as a result of survey on their opinion by sex group is shown below:
Opinion Men Women
No 5550 8232
Yes 1630 1684
Test the hypothesis that the opinion on the early marriage is independent of sex group of
households at 5% level
/* SAS Program to conduct chis-square test of association*/
data one;
input Opinion$ gender $ count;
cards;
yes men 1630
yes women 1684
no men 5550
no women 8232
Page 28 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Example 2: A random sample of statistics department students at AMU is asked their opinion on
a proposed core curriculum changes as results of survey shown below:
Class Favoring Opposing
Freshman 140 100
Junior 80 130
Senior 70 110
Enter the data in SAS and test the hypothesis that the opinion on the changes is independent of
class standing at 5% level.
/* SAS program for cross tabulate data*/
data retabulate;
input class $ Opinion$ count;
cards;
Freshman Favoring 140
Page 29 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
By default, PROC MEANS analyzes every numeric variable in the SAS data set, prints the
statistics N, MEAN, STD, MIN, and MAX and excludes missing values before calculating
statistics. The VAR statement identifies the analysis variables and the CLASS statement in the
MEANS procedure groups the observations of the SAS data set for analysis.
PROC MEANS data=SAS-dataset;
Var SAS quantitative variables;
Class SAS qualitative variables;
Run;
As mentioned, the default output displays for PROC MEANS step are mean, standard deviation,
minimum value, maximum value of the variable; we can choose the required statistics from the
Page 30 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
options of PROC MEANS. For example if we require mean, standard deviation, median,
coefficient of variation, coefficient of skewness, coefficient of kurtosis,….,etc then we can write
as:
PROC MEANS mean std median cv skewness kurtosis data=SAS-dataset;
Var SAS variables;
Run;
To control the maximum number of decimal places for PROC MEANS to use in printing results,
use the MAXDEC= option in the PROC MEANS statement. General form of the PROC MEANS
statement with the MAXDEC= option is:
PROC MEANS data=SAS-dataset MAXDEC=number;
Run;
Example 1: Consider the following data on the height(in cm), weight(in kg) and gender group of
patients for a random sample of 10 from Arba Minch general hospital emergency room.
Height 173 179 197 195 173 184 162 169 164 168
Weight 57 58 62 84 64 74 57 55 56 60
Gender F F F M F M F F M M
a) Create SAS dataset named ptdata from the above variables
Data ptdata; /* SAS program to create temporary SAS dataset in SAS*/
Input Height Weight Gender$;
Datalines;
173 57 F
179 58 F
167 62 F
195 84 M
173 64 F
184 74 M
162 57 F
169 55 F
164 56 M
168 60 M
Page 31 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
run;
b) Compute descriptive summary statistics for weight and height of patients
/*Summarize the patient data*/
Proc Means data=ptdata;
var Height Weight;
Run;
The MEANS Procedure
Page 32 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Interpret the above descriptive statistics of the weights and heights of students by gender group!!
b) PROC SUMMARY Steps
For obtaining descriptive statistics for a given data one can use PROC SUMMARY procedure.
The PROC SUMMARY procedure uses the same program codes as PROC MEANS, however
PROC SUMMARY does not produce report by default. In order to produce the report, you need
to add PRINT as the option as shown below. It saves the summary to a SAS data set, if no print
option in PROC SUMMARY.
PROC SUMMARY data=SAS-dataset print;
Run;
As in PROC MEANS step if we wants to obtain mean, standard deviation, median, coefficient of
variation, coefficient of skewness, coefficient of kurtosis, then one may utilize the following in
PROC SUMMARY step.
PROC SUMMARY mean std median cv skewness kurtosis data=SAS-dataset print;
Var SAS variables;
Run;
The VAR statement identifies the analysis variables and the CLASS statement also can be
included in the PROC SUMMARY procedure to groups the observations of the SAS data set for
analysis.
Example 1: Consider the above data on the height, weight and gender group of patients for a
random sample of 10 from Arba Minch general hospital emergency room.
a) Compute descriptive summary statistics for weight and height of patients
/*Summarize the patient data*/
Proc summary data=ptdata print;
var Height Weight;
Run;
The SUMMARY Procedure
b) Compute the descriptive summary statistics for weight and height of patients by gender
Page 34 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Page 35 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Furthermore, proc univariate step can be used to display data by using histogram as shown.
Proc univariate data=ptdata;
var weight;
histogram/normal;
run;
Produce as part of the output histogram with normal curve.
c) Bar and Pie Charts/PROC GCHART
Sometimes it is useful to show frequencies in a graphical display. With SAS, you have several
options: First, there is SAS procedure called GCHART, which is part of the SAS/GRAPH
collection of procedures. PROC GCHART is used to produce vertical or horizontal bar charts
and pie charts of categorical variables. General form of the PROC GCHART statement:
PROC GCHART DATA=SAS--data--set;
HBAR chart-variable . . . </options>; /* Produce a horizontal bar chart*/
VBAR chart-variable . . . </options>; /*Produce a vertical bar chart*/
PIE chart-variable . . . </options>; /* Produce a pie chart*/
Run;
Example 1: Consider the above data on the height(in cm), weight(in kg) and gender group of
patients for a random sample of 10 from Arba Minch general hospital emergency room.
Then, construct pie chart and simple bar graph for gender of patients
PROC GCHART DATA=ptdata;
pie gender;
run;
quit;
Page 36 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
plot weight*gender;
title "Side-by-Side Boxplot Using Proc Boxplot";
run;
Example 1: For the data on the height(in cm), weight(in kg) and gender group of patients for a
random sample of 10 from Arba Minch general hospital emergency room, construct a simple
scatter plot between weight and height from a SAS data set called ptdata
proc gplot data=ptdata;
plot weight*height='dot';
run;
produces the simple scatter plot between weight and height as shown:
Interpret the relationship between weight and height of the students from the scatter plot!!!!
Create SAS dataset and then, does this data support the null hypothesis that the expenditure is
101.75 or the alternative, that it is different from 101.75?
data studdata;
input expenditure;
Page 39 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
cards;
140
125
150
124
143
170
125
94
127
53
;
Run;
proc ttest data=studdata h0=101.75;
var expenditure;
run;
The TTEST Procedure
Statistics
Lower CL Upper CL Lower CL Upper CL
Variable N Mean Mean Mean Std Dev Std Dev Std Dev Std Err Minimum Maximum
expenditure 10 102.04 125.1 148.16 22.169 32.23 58.839 10.192 53 170
T-Tests
Variable DF t Value Pr > |t|
expenditure 9 2.29 0.0477
;
run;
proc print data=oneprop;run;
/* SAS program to compute test statistic*/
proc freq data=oneprop order=data;
tables handtype/binomial(p=0.2);
weight cases;
run;
Page 41 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Create SAS dataset and test whether the differences among mean change on the blood pressure
are attributed to types of drugs at 5% level of significance.
/* SAS program to create SAS dataset*/
data indettest;
input changeBP Drugtype;
cards;
18.2 1
20.1 1
17.6 1
16.8 1
18.8 1
19.7 1
19.1 1
19.4 2
22.7 2
19.1 2
18.4 2
25.9 2
20.4 2
21.7 2
;
run;
/* SAS program to compute independent sample t test*/
proc ttest data= indettest;
var changeBP;
class Drugtype;
run;
Page 42 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Interpret the above SAS output with respect to equality of means and variances of change in
blood pressure for two drugs!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
ii. In the case of dependent samples/ PROC TTEST
Paired t-test compares the means of the same group at different times (e.g. before and after an
event) and paired samples occur if there are two measurements on the same experimental unit. In
SAS, this can be done with PROC TTEST and a PAIRED statement followed by the pairs joined
with the ' * ' symbol.
Example 1: In 10 women the systolic blood pressure (mm Hg) is measured at the beginning of a
clinical trial. Afterwards they have a fertility treatment with hormones. During this treatment
they are again measured.
Create SAS dataset and test whether the mean systolic blood pressure of women differs between
the two measurements at 5% level of significance.
/*SAS program to create SAS data set*/
data pairedttest;
input before during ;
cards;
115 128
112 115
107 106
119 128
115 122
138 145
126 132
105 109
104 102
115 117
;
run;
Page 43 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Interpret the above SAS output with respect to equality of means of systolic blood pressure of
women between the two measurements!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
f) Inference about Two Population Proportions/Proc FREQ
Proc FREQ is used to compare equality of two population proportions. If we have sample
proportions for two random samples, a significance test of H0:p1=p2, HA:p1<p2 or p1>p2, or
p1≠p2 can be carried out using Proc FREQ in SAS.
Example 1: In the year 2001, a poll of 600 people found that 250 supported early marriage. A
2003 poll of 500 found 250 in support. Do a test of significance to see whether the difference in
proportions is statistically significant.
/* SAS program to create SAS dataset from the given information*/
data twoprop;
input row column count;
cards;
1 1 250
1 2 350
2 1 250
2 2 500
;
Run;
/* SAS program to perform a test and 95% CI for difference in proportions*/
proc freq data= twoprop;
tables row*column/riskdiff chisq;
weight count;
run;
Page 44 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Interpret the above SAS output with respect to equality of proportions of people who supported
early marriage in two years!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
g) Correlation and linear regression analysis using SAS
a) Correlation analysis using SAS / PROC CORR
While a scatter plot is a convenient graphical method for assessing whether or not there is any
relationship between two variables, we would also like to assess their relationship numerically.
Hence, correlation coefficient provides a numerical summarization of the degree to which a
linear relationship exists between two quantitative variables, and can be calculated using the
proc corr command in SAS. If you want to produce correlations only for specific combinations
of the variables var statement will be used to specify variables to be analyzed under proc corr
step.
Example 1: Consider the following data collected to investigate the relationship between income,
age and hours worked per day for of households from a given town XYZ
Page 45 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Test significance of correlation coefficient and interpret it from the above SAS output!!!!!!!!!!!!
In the above SAS program, if the var statement is omitted, then correlations are computed
between all numeric variables in the data set. If you want to produce correlations only for
specific combinations, then include variables to be analyzed under var statement as above.
(c) /* To compute correlation matrix for the data*/
proc corr data=mlrdata;
run;
Test significance of correlation coefficients between income and age and income and hours
worked per day and interpret if it is significant from the above SAS output!!!!!!!!!!!!!!!!!!!!!!!!
Page 47 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
ATST(Y) 586 461.75 491.1 565 462 532 477.6 515.2 493 528.3 575.9 532.5 530.5
Age(X) 4.4 14 10.1 6.7 11.5 9.6 12.4 8.9 11.1 7.75 5.5 8.6 7.2
a) Create a SAS data set and fit regression equation that relates average total sleep time and
age of children
/* SAS program to create SAS data set*/
Data SLRM;
Input ATST Age;
Datalines;
586 4.4
461.75 14
491.1 10.1
565 6.7
Page 48 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
462 11.5
532 9.6
477.6 12.4
515.2 8.9
493 11.1
528.3 7.75
575.9 5.5
532.5 8.6
530.5 7.2
;
Run;
proc reg data = SLRM;
model ATST = Age;
run;
give the following simple linear regression output.
Test significance of the model and coefficient of age at 5% level of significance, also interpret
the result
b) Check assumptions of normality of error terms and homoscedasticity by constructing a
normal probability plot of the residuals and plot the residual versus fitted values and
interpret them.
We now generate a new dataset called OUTREG1 that contains all of the original variables, plus
the predicted value
for each observation (PREDICT) and the residual (RESID) to check assumptions of normality of
error terms and homoscedasticity
/*Check assumptions of normality, homoscedasticity and no multicollinearity for the fitted
model above*/
/*To compute residuals, predicted values of income and vif of the model*/
proc reg data = SLRM;
Page 49 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Interpret the above SAS outputs with respect to normality of assumption for error!!!!!!!!!!!!!!!!!!!
/* to check for homoscedasticity of error terms*/
proc gplot data=outreg1;
Plot RESID*PREDICT='dot';
run;
quit;
Interpret the above SAS outputs with respect to homoscedasticity of assumption for error!!!!!!!!
Page 50 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
(d) What will be the average total sleep time of an individual whose age is 3, 4 and 16 year old?
/* SAS Program to make prediction on ATST given age =3,4 and 16*/
data SLRM2;
set SLRM;
output; /* to place new observations at bottom after last value of age 7.2*/
if age=7.2 then do;
age = 3;ATST= .; output;/* to add new observations at bottom of original dataset*/
age = 4; ATST= .; output;/* to add new observations at bottom of original dataset*/
age= 16; ATST= .;output;
end;
run;
proc reg data=SLRM2 alpha=0.05;
model ATST = Age / clm; /*clm is used to get predicted value for new observation*/
run; quit;
Example 2: Consider the Above data collected to investigate the relationship between income,
age and hours worked per day for of households from a given town XYZ, then
(a) Fit multiple linear regression model that relates the income of households with their age and
hours worked per day
(b) Check assumptions of normality, homoscedasticity and no multicollinearity for the fitted
model above in (a)
(c) What will be the income of household for age=54 and hours worked per day=4
Solution
(a) /* To fit multiple linear regression model that relates the income of
households with their age and hours worked per day*/
proc reg data=mlrdata;
model Income=Age Hourworked;
run;
Test significance of the model and individual regression coefficients, interpret outputs!!!!!!!
Page 51 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Interprat the above SAS output with respect to assumptions of normality and no
multicollinearity among predictors!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/* to check for homoscedasticity of error terms*/
proc gplot data=outreg1;
Plot RESID*PREDICT='dot';
run;
Page 52 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Interprat the above SAS output with respect to assumption of homoscedasticity for error term
(c) /* SAS Program to make prediction on income given age = 54 and X2 = 4*/
data mlrdata2;
set mlrdata;
output; /* to place new observations at bottom after last value of X1=52.3*/
if age=33 then do;
age = 54; Hourworked =4; income = .; output;/* to add new observations at
bottom of original dataset*/
end;
run;
proc reg data= mlrdata2;
model income= age Hourworked/ clm; /*clm is used to get predicted value for
new observation*/
run; quit;
In the above SAS programs, the CLM option tells PROC reg to print confidence limits for
individual predicted values for each observation. The ALPHA= option specifies the alpha level
for confidence intervals. By default the alpha level is 0.05. The option P will cause the observed,
predicted, and residual values to be printed for each observation that does not contain missing
values.
The OUTPUT statement can be used to create a SAS data set that contains all the input data, as
well as predicted values, confidence limits, residuals, and regression diagnostics. The form of the
OUTPUT statement is:
OUTPUT OUT=datasetname keyword=name ;
The keywords are used to specify which values to store in the output data set. Useful keyword
options include PREDICTED (or P) for predicted values, and RESIDUAL (or R) for residual
values. An example of an OUTPUT statement is:
OUTPUT OUT=stats P=pred R=res;
Page 53 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
The OUTPUT statement is useful when creating a data set that will be used later by another SAS
procedure (such as PROC PLOT). A BY statement can be used with PROC GLM to obtain
separate plots on observations in groups defined by the BY variables. When a BY statement
appears, PROC GLM expects the data to be sorted in the order of the BY variables.
Qualitative predictors in linear regression model
In fitting linear regression with qualitative predictor variables, such as gender or marital status,
religion,…,etc can be included in model by incorporating so-called indicator variables or dummy
variables that take on the values 0 or 1 to identify the categories of the qualitative predictor
variable. In general for a categorical variable with k levels, you need to create (k− 1) dummy
variable and expected to have (k− 1) regression coefficients in the model so that the remaining
category will serve as reference to interpret the results.
Example 1: The following data are on plasma lipid levels of total cholesterol (in mg/100ml),
weights (in kg), age (in years) and frequency of performing exercises per week( no exercise,
exercise some time or exercise often) for a sample of 25 patients suffering from hyper
lipoproteinemia before drug therapy.
Total cholesterol(y) Weight(x1) Age(x2) Freq exercises(x3)
354 84 46 No
190 73 20 Some times
405 65 52 Some times
263 70 30 No
451 76 57 Some times
302 69 25 often
288 63 28 often
385 72 36 No
402 79 57 often
365 75 44 No
209 27 24 Some times
290 89 31 Some times
346 65 52 No
254 57 23 No
395 59 60 Some times
434 69 48 often
220 60 34 Some times
374 79 51 Some times
308 75 50 Some times
220 82 34 often
311 59 46 Some times
Page 54 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
181 67 23 No
274 85 37 often
303 55 40 No
244 63 30 often
Then, fit multiple regression models that relate total cholesterol level weight, age and frequency
of performing exercises per week.
/*Fitting Regression model with Categorical Predictor in SAS by using either Proc Reg or Proc
Glm*/
data MLRdataCatpr;
/*Y=Cholestrol level, x1=age, x2=weight, x3=frequency of exercises*/
input Y X1 X2 X3 $;
datalines;
354 84 46 No
190 73 20 Esometim
405 65 52 Esometim
263 70 30 No
451 76 57 Esometim
302 69 25 often
288 63 28 often
385 72 36 No
402 79 57 often
365 75 44 No
209 27 24 Esometim
290 89 31 Esometim
346 65 52 No
254 57 23 No
395 59 60 Esometim
434 69 48 often
220 60 34 Esometim
374 79 51 often
308 75 50 Esometim
220 82 34 often
311 59 46 Esometim
181 67 23 No
274 85 37 often
303 55 40 No
244 63 30 often
;
run;
proc print data=MLRdataCatpr;
run;
/* Create dummy variable for frequency of exercise to use PROC REG procedure in SAS*/
data MLRdataCatpr2;
Page 55 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
set MLRdataCatpr;
if x3='No' then noex=1;
else noex=0;
if x3='Esometim' then somex=1;
else somex=0;
/* Frequency of Exercise=Often used as Reference*/
run;
proc print data=MLRdataCatpr2;run;
/* Mlrmodel By Proc Reg Procedure after Creating Dummies for Categorical Predictor
Frequency of Exercise*/
PROC REG DATA=MLRdataCatpr2;
MODEL Y=X1 X2 NOEX SOMEX; /* often category is taken as reference*/
RUN;
/* MLR MODEL BY PROC GLM PROCEDURE THROUGH CLASS STATEMENT */
PROC GLM DATA=MLRdataCatpr;
CLASS X3;
MODEL Y=X1 X2 X3/SS3;
RUN;
/* The proc glm doing anova automatically provides the information provided by the test
statement*/
/*If we like, we can also request the parameter estimates by adding the option solution after the
model statement*/
/* MLRMODEL BY PROC GLM PROCEDURE*/
PROC GLM DATA=MLRdataCatpr;
CLASS X3;
MODEL Y=X1 X2 X3/ SOLUTION;
RUN;
h) Generalized Linear Models using SAS/ Proc GENMOD
The goal of modeling is to find the best fitting and most parsimonious model to describe the
relationship between an outcome (dependent or response variable) and a set of independent
(predictor or explanatory) variables. The most common example of modeling is the usual linear
regression model where the outcome variable is continuous. In a lot of cases however, the
outcome variable is discrete or categorical. In that case generalized linear models are often used.
Fitting GLM model, require only an additional parameter to specify the probability distribution
and link functions. There are the following choices of family and link functions:
Page 56 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Page 58 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Test the significance of individual regression coefficients and identify the factors affecting
saving habit of households!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
You can fit multiple binary logistic regression model with proc logistic option and get a similar
result with PROC genmod SAS program shown below.
/*SAS Program to fit multiple binary logistic regression with proc logistic option*/
proc logistic descending data= savingdata;
class gender edulev;
model saving= gender edulev;
run;
i) ANOVA Models using SAS/ PROC ANOVA or PROC GLM
a) One-way Analysis of Variance
This deals with methods for making inferences about the relationship between a single numeric
response variable and a single categorical explanatory variable. For this we use the procedure
proc glm or PROC ANOVA. A disadvantage of PROC ANOVA is that there are limitations
with respect to residual analysis since it will not allow output statement. Due to the
importance of checking assumptions in a statistical analysis, we prefer to use PROC GLM.
Example 1: An experiment was conducted to determine the effects of three types of fertilizer on
the first year growth rate of wheat seedlings as data shown below.
Types of fertilizer growth rate
A 11 13 16 10
B 15 17 20 12
C 10 15 13 10
a. Enter the data in SAS and create temporary SAS dataset named oneanova
b. Do the data provide sufficient evidence to indicate a difference in the average growth
rate depending on the type of fertilizer?
c. If the difference in mean growth rate is statistically significant perform pair wise
comparison and draw your conclusions.
(a) /* SAS Program to create SAS dataset */
Data oneanova;
Input growth fert;
Datalines;
Page 59 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
11 1
13 1
16 1
10 1
15 2
17 2
20 2
12 2
10 3
15 3
13 3
10 3
; run;
(b) /* SAS Program to carry out one ANOVA and multiple comparison*/
PROC ANOVA DATA=oneanova;
CLASS fert;
MODEL growth = fert;
MEANS fert / TUKEY CLDIFF;
RUN;
In the above command, the SAS dataset oneanova has a variable fert containing fertilizer types (
A, B, and C) and a variable growth which contains first year growth rate of wheat seedlings.
Assume a number of plots were planted with the three types of fertilizers and their first year
growth of wheat seedlings measured. The procedure specifies a one-factor model, testing
whether the levels of fert have different effects on the first year growth of wheat seedlings.
The MEANS statement:
The MEANS statement computes means and performs tests for the indicated effects. Possible
options include:
BON Bonferroni tests
SCHEFFE Scheffe theory tests
T Fisher’s LSD (least significant difference)
TUKEY Tukey's studentized range test
CLM presents output from BON, SCHEFFE, and T tests as confidence intervals
CLDIFF presents output from BON, SCHEFFE, TUKEY and T tests as confidence
intervals for pairwise differences (default).
Example 2: The following data are in an experiment to determine the effect of nutrition on the
attention spans of elementary schools students, a group of 15 students were randomly assigned to
Page 60 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
each of the three meal plans: no breakfast, light breakfast and full breakfast. Their attention
spans (in minutes) were recorded during a morning reading period.
No breakfast Light breakfast Full break fast
8 14 10
7 16 12
9 12 16
13 17 15
10 11 12
a. Enter the data in SAS
/* SAS Program to enter data/create SAS dataset*/
Data oneanova;
Input attention breakfast;
Datalines;
8 1
7 1
9 1
13 1
10 1
14 2
16 2
12 2
17 2
11 2
10 3
12 3
16 3
15 3
12 3
; run;
b. Do the data provide sufficient evidence to indicate a difference in the average attention
spans depending on the type of breakfast eaten by the students?
/* SAS program to carry out one way ANOVA and multiple comparison*/
PROC ANOVA DATA=oneanova;
CLASS breakfast;
MODEL attention = breakfast;
MEANS breakfast / TUKEY CLDIFF;
RUN;
Page 61 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
Interpret above SAS output with respect to difference in the average attention spans
depending on the type of breakfast eaten by the students!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
c. If the difference in mean attention span is statistically significant perform pair wise
comparison and draw your conclusions.
Identify the pairs of means responsible for rejection of null hypothesis in the above ANOVA
result!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
b) Two-way Analysis of Variance
This deals with methods for making inferences about the relationship existing between a single
numeric response variable and two categorical explanatory variables. In one-way ANOVA we
discussed the effect of only one factor and now let us pass on to a two way ANOVA. In effect
two-way ANOVA deals with two factors, whose treatments are along the row and column. In
such a situation it is logical to think in terms of treatment combinations being received by each
experimental unit. It is also assumed that each factor contribute to the response without the
Page 62 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
influence of the other. Finally the analysis is done in such a way that the main effects of column
and row factors and their interaction are estimated and tested.
Example 1: It is of interest to investigate whether four different forms of a standardized reading
test are in fact equivalent. To this end, a total of twenty students is randomly selected from five
different schools (four from each) to do these tests. The heterogeneity in experimental units
(students) that is present since they are from different schools is removed by blocking. Within a
block (the four students from school j), the treatments (type of form) should be randomly
assigned. The resulting achievements of the students are given as follows:
School 1 2 3 4 5
Form
1 75 73 59 69 84
2 83 72 56 70 92
3 86 61 53 72 88
4 73 67 62 79 95
Page 63 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
; run;
/* SAS Program to perform two-way ANOVA without Interaction and multiple
comparison*/
PROC ANOVA DATA=twoanova;
CLASS school form;
MODEL mark= school form;
MEANS form / TUKEY CLDIFF;
RUN;
Answer the above questions from (b) to (d) based on the above SAS
output!!!!
Example 2: A researcher interested in investigating two life style factors (Sodium intake in the
diet and smoking habit) on the systolic blood pressure. She/he took a random sample of 24
individuals and determines smoking habit, the amount of sodium in their diet and their systolic
blood pressure (in mmHg) as data from the experiment shown below:
Page 64 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS
• Then, do the average systolic blood pressure depends on the sodium intake and smoking
habit?
• Is there interaction between sodium intake smoking habits?
/* SAS Program to enter data and perform two-way ANOVA with interaction */
Data twoanovaint;
Input SBP Sointake smoking;
Datalines;
129 1 1
125 1 1
129 1 1
132 1 1
125 1 1
128 1 1
140 1 2
126 1 2
120 1 2
137 1 2
142 1 2
147 1 2
132 2 1
137 2 1
130 2 1
148 2 1
154 2 1
158 2 1
165 2 2
152 2 2
140 2 2
167 2 2
142 2 2
177 2 2
; run;
PROC ANOVA DATA=twoanovaint;
CLASS Sointake smoking;
MODEL SBP= Sointake smoking Sointake*smoking;
MEANS Sointake smoking / TUKEY CLDIFF;
RUN;
Page 65 of 66
Statistical Computing-II Stat 2082