Intrdocution To Statistical Modeling
Intrdocution To Statistical Modeling
Intrdocution To Statistical Modeling
By
AUGUST, 2015
Contents
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1 UNIT 1
OVERVIEW OF STATISTICAL MODELS 1
1.1.4 Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.5 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
i
STA223:Introduction to Statistical Modeling
2 UNIT 2
BUILDING STATISTICAL MODELS 28
2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 UNIT 3
SIMPLE LINEAR REGRESSION 41
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 CORRELATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 CHAPTER FOUR
Multiple Regression 54
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
ACKNOWLEDGEMENT
Consider the data John collected over a period of 5 years on the heights of maize
plants. The plants were descended from the same parents and planted at the same
time. Half of the plants were self-fertilized, and half were cross-fertilized, and the
purpose of the experiment was to compare their heights. To this end John planted
them in pairs in different pots. The table below gives the resulting heights. All but
two of the differences between pairs in the fourth column of the table are positive,
which suggests that cross-fertilized plants are taller than self-fertilized ones.
1
STA223:Introduction to Statistical Modeling
Exercise 1 Enter the data in the table above in SPSS. You should define the vari-
ables in the variable view. For example your variables definitions and entered data
may look like the one in the figure below.
Now construct a box plot using SPSS to see if the crossed maize plants are taller
than the pure-bred ones. (Hint: Use the breed variable as a categorizing variable.
You must get a graph like the one below.)
Discussion of the exercise Suppose you want to estimate the average height in-
crease in the experiment conducted by John. You can let Y = µ + ε be the height of
the self-fertilized plant, where µ is the mean of Y and ε is the random error whose
mean is 0 and whose variance is σ 2
You could also let X = µ + η + ε be the height of crossed plants, where η is an-
other unknown parameter. The most obvious question is about whether η = 0 or not.
Another consideration can be made by thinking that John planted the pairs of plants
in the same plot to control the effects of other factors like fertility of the soil.
If we were willing to assume that ε has a given distribution, then the distributions
of Y and X would be completely specified once the parameters µ and η were known,
giving a parametric model. Otherwise, the model would be called non-parametric
model.
The focus of interest in the exercise is the relation between the height of a plant
and something that can be controlled by the experimenter, namely whether it is
self or cross-fertilized. The essence of the model is to regard the height as random
with a distribution that depends on the type of fertilization, which is fixed for each
plant. The variable of primary interest, in this instance height, is called the re-
sponse, and the variable on which it depends, the type of fertilization, is called an
explanatory variable or a covariate. Many questions arising in data analysis involve
the dependence of one or more variables on another or others, but virtually limitless
complications can arise.
To test this question, an instructor gives students a math test. Before taking the test,
half the students were told that they would receive K1000 for every correct answer.
The other half was not given a monetary incentive. The number of correct answers
was recorded for each student.
Dependent variable: Number of correct answers (this is the score each person receives)
Scale for the dependent variable : ratio
Exercise 2 Take note that the steps outlined and explained above are used in the
planning stage of statistical modeling
2. Which of the steps above simply define and which simply design? Explain your
answers.
3. Collect data on the mass of a chicken within three weeks from any farmer who
keeps chicken. Suppose you have collected the data or you have not collected
it. Identify the variables in the data that you are to collect. What will be their
level of measurement?
(a) A social psychologist thinks that people are more likely to conform to a
large crowd than to a single person. To test this hypothesis, the psychol-
ogist had either one person or five people stand on a busy walking path
on campus and look up. (note: people who are in cahoots with the ex-
perimenter are called confederates). The psychologist stood nearby and
counted the number of people passing by who looked up and the number
who did not look up. Identify the variables in the data that you are to
collect. What will be their level of measurement?
(d) Previous research has shown that playing music helps plants grow taller.
But, does the type of music matter? Does the volume of the music matter?
To test this, 1 seedlings were assigned to a specific music group (country,
rock, classical). Then within each of these groups, the music was played
at either a low, medium, or high volume. At the end of one month, each
plants height was recorded.
(e) Harvester ants often strip a bush of all of its leaves. Some people believe
this helps the plant grow thicker, healthier stems. In an experiment, a
student stripped off all the leaves from a set of plants. In a second set
of identical plants, the student allowed ants to strip off the plants leaves.
The student measures the plants stem thickness (in mm) 4 weeks later.
(g) You want to test a new drug that supposedly prevents sneezing in people
allergic to grass. You randomly assign 1/2 the participants to the drug
group and the rest to a placebo control. One half hour later, you have
them sit in a room filled with the grass they are allergic to. You record the
total number of sneezes over the next 30 minutes.
(i) Which method of wound closure produces the least noticeable scarring 12
weeks later: stitches, staples, or steri-strips? He randomly assigns 10
patients to each method. Degree of scaring is measured on a 6 point Likert
scale, where 1 = no visible scar and 6 = extremely visible scar.
(j) Which method of learning brain anatomy is more effective: using a color-
ing book for the brain or the rap song method. Students in the respective
groups are scored on an anatomy test after using their method for 2 weeks.
(k) A florist wants to see if Product X or Product Y will extend the life of
cut flowers so that they last longer. Longevity is measured by rating the
health of the flower from 1 (dead) to 10 (no visible deterioration).
(l) Within a classroom setting, subjects were asked to listen to a guest in-
structor. All subjects were given a description of the instructor before
class. Some subjects read a description containing the phrase People who
know him consider him to be a rather cold person..., while other people
read a description where the word warm was substituted for the word cold
(otherwise, the descriptions were identical). After the lecture, subjects
were asked to rate the instructor. Subjects who were told the instructor
was warm gave him more favorable ratings compared to subjects who were
told that the instructor was cold. Instructor rating was assessed using a 5
point Likert scale.
(n) Subjects read about a woman who used a particular title, and then rated
her on a number of traits. When the woman used the title Ms. rather
Definition 2 A sample space is the set that contains all the possible outcomes of a
random experiment.
Example 3 If a coin is tossed once the set of all possible outcomes is { Head, T ail } =
{ H, T } = Ω
1.1.4 Event
Example 4 A coin is tossed once. Write down all the possible events.
{ Head, T ail }
{ Head }
{ T ail }
{}
1.1.5 Probability
(a) 0 ≤ P (E) ≤ 1
(b) P (Ω) = 1
∞
X
(c) If E1 , E2 , ...Ei , ... are mutually exclusive events, then P (∪∞
i=1 Ei ) = P (Ei )
i=1
Note: An event is usually denoted by the symbol E and the set that contains all
possible events or the power set of the sample space is the largest sigma field.
The discussion on sigma fields is beyond the scope of this topic.
Probability is the limit of the ratio of the number of experimental outcomes of in-
terest to the total number of trials as the total number of trials tend towards infinity.
In other words, if |Ω| be the total number of trials of a random experiment and |E|
the total number of outcomes of an event E, then the probability of the event E is
given by
|E|
P (E) = lim
|Ω|→∞ |Ω|
Answer
2. Highlight the cell and move the cursor to the lower right corner of the cell
A1 until a cross appears and drag down the cross to cell A1000
Exercise 3 Use the Microsoft Excel to simulate the rolling of a die. Check the
1
validity of the assumption that the probability that a die turns a 3 is 6
= 0.16666667.
Hint: Modify the steps in the example 5. This exercise must help you understand the
definition of probability.
Definition 5 A random variable is a function that maps a sample space into a set
of real numbers.
Answer
1 , ω=H
X(ω) =
0 , ω=T
Answer
Ω = { HH, HT, T H, T T }
Let the random variable X(ω) represent the number of heads, then we must have
X(HH) = 2, X(HT ) = 1, X(T T ) = 0, X(T H) = 1.
You can also write your answer as a table below
ω HH HT TH TT
X(ω) = x 2 1 1 0
Number of Heads 0 1 2
Probability 0.25 0.5 0.25
n
1X
Example 10 The sample mean, X = xi , is a statistic which is used to esti-
n i=1
mate the population mean, µ, which is a parameter.
A general parameter is usually denoted by the symbol θ. For example, we can say
θ = µ or θ = σ 2
Having discussed the basic terms in section 1.1 above we can now safely define the
term statistical model.
However before we define the term let us consider the following example.
Example 11 Suppose probability distribution given in this table is for the distribu-
tion of the ages of student in your class.
Age 30 41 52
Probability 0.25 0.5 0.25
If your class has 1000 student how many students do you expect to be aged 30 , 41 ,
or 52 ?
Answer
Age 30 41 52
Number of Students 250 500 250
If you went somewhere to collect data and found that the data you collected is like the
one in table 1.2, you would automatically conclude that its probability distribution
must be the one in table 1.1.
Similarly, consider another situation in which you are counting the number of girls
per family in a sample of 9 families in Zomba, and you have collected the following
data set.
Family Id 1 2 3 4 5 6 7 8 9
Mumber of Girls 4 4 7 8 7 8 10 12 5
Then there must be a function that is responsible for the generation of this data set.
We will consider such a function to be a probability distribution. There is one true
probability distribution that is responsible for the generation of this data. Since we
don’t know which probability distribution we can list infinitely many functions that
can possibly fit this data set.
This discussion should help you to understand the three possible definitions of sta-
tistical model below.
Definition 11 A statistical model is the pair (Ω, P ), where Ω is the the sample space
and P is the set of possible probability distributions on the sample space.
Exercise 5 (i) Describe the main principle behind statistical modeling with refer-
ence to the definitions given in this section.
(ii) In your own words explain how a model can be parametrized, non-parametrized
or semi-parametrized.
(i) The random variable X is said to follow the normal distribution with mean µ
and variance σ 2 if and only if its probability density function (pdf) is
2 !
1 1 x−µ
f (x) = √ exp − x∈ R
σ 2π 2 σ
(ii) The random variable X is said to follow the Poisson distribution with mean λ
if and only if its probability mass function (pmf) is
λk
Pr(X = k) = e−λ k = 0, 1, 2, 3, · · ·
k!
(iii) The random variable X is said to follow the binomial distribution with param-
eters (n, p) if and only if its probability mass function (pmf) is
n x
P (X = x) = p (1 − p)n−x , x = 0, 1, 2, · · · , n
x
(iv) The random variable X is said to follow an exponential distribution with pa-
rameter λ > 0 if and only if its probability density function is
All the four models in the subsection 1.3.1 are parametric. Since for each of these
models there is a finite set Θ that contains all the parameters of the model, so that
specifying Θ is as good as specifying the probability distribution of the model.
Example 12 Consider the normal distribution model in subsection 1.3.1, and spec-
ify the set Θ.
Exercise 6 Repeat this example for the rest models in subsection 1.3.1.
Z t
βx
F (t) = 1 − exp − λ0 (u)e du
0
, where x is the covariate vector, and β and λ0 (u) are unknown parameters. Θ =
(β, λ0 (u)). Here β is finite-dimensional and is of interest; λ0 (u) is an unknown
non-negative function of time (known as the baseline hazard function) and is often
a nuisance parameter. The collection of possible candidates for λ0 (u) is infinite-
dimensional.
(b) Which type of models is the standard normal distribution? Justify your answer.
We would like to differentiate the term General and Generalized Linear Models. To
understand the difference let us consider the following examples and exercises. We
urge the reader to go through the examples and the exercises before attempting to
differentiate these terms.
Yi = 9.5 + 2.5Xi + εi
(i) Yi is the value of the response variable in the ith trial, which is the mass of a
goat
(ii) Xi is a known constant, namely, the value of independent variable in ith trial
(iii) εi is a random variable with mean E(εi ) = 0 and variance σ 2 (εi ) = σ 2 ; εi s are
uncorrelated so that σ(εi , εj ) = 0 for all i, j; i 6= j.
(a) Explain the reason why the model so constructed is a statistical model
(b) Supposed a 45 weeks old goat has the mass 108. What is the expected mass of
the goat at this age? What is the value of the error term at this age?
Answer
(a) Since εi is random and 9.5 + 2.5Xi is a constant, Yi is a random variable and
hence the model is a statistic model.
NB The message in this that the function f (Xi ) = 9.5 + 2.5Xi is not a statistical
model, but the function g(Xi ) = 9.5 + 2.5Xi + εi is a statistical model based on
the assumptions given in the problem. This explains one role of random error is
statistical modeling.
Exercise 8 Using the same assumptions in the example, we can generalize Johns
model by
Y i = β0 + β1 X i + εi
(i) E(Yi )
(ii) σ(Yi , Yj )
(iii) σ 2 (Yi )
(i) Calculate the expected salary for an employee whose beginning salary is
1000. If the estimated random error for this employee is -920, calculate his
current salary.
(ii) John discovered that the current salary of the employee whose beginning
salary is 2000 is 5800. Calculate the expected current salary and the value
of the error term.
Remarks
Now the model which has been suggested by John can be considered to be of the
form
Y i = β0 + β1 X i + εi
That is Linear regression models have to follow two key assumptions: firstly er-
ror terms are independent and identically distributed (iid) and each follows normal
distribution with zero mean and variance σ 2 , secondly, the matrix X has to be non-
random and of full column rank. However, one confusion is that if we assume the
error terms are normally distributed then the second assumption implies that all the
explanatory variables are independent to others. To solve this puzzle we generalize
the assumptions that John makes as follows
3. Normality. We assume that the error terms are Normally distributed. While
it is possible to fit a linear regression model to data where the errors are not
Normally distributed (much like it’s possible to fit a linear regression model to
data that clearly do not follow a linear trend), this is inadvisable and linear
regression is seen as inappropriate. There are other regression models gener-
alized linear models is the term that is typically used) that are appropriate in
cases where the error terms are not Normally distributed.
While it makes sense for X to be of full rank, this does not necessarily need to be the
case. There are numerous benefits to X being of full rank and it allows for maximum
interpretability. However, one can conduct inference on parameters without X being
of full rank. There are also methods that are designed to take non-independent IVs
and project them such that they will be independent in your analysis.
To our example above, we can assume that the error terms are iid N ormal(0, σ 2 ), but
only if this assumption makes sense. If you know from the subject material or from
your data that the assumptions of independence, Normality, or equality of variances
are violated, then perhaps a linear regression model is not appropriate. In this case,
we would suggest looking into ways to transform your data to ensure the conditions
are met or research different types of models (i.e. generalized linear models) that are
designed to account for data that do not follow the four line assumptions mentioned
above.
Generalized linear models extend the last assumptions. They generalize the possi-
ble distributions of the residuals to a family of distributions called the exponential
family. This family includes the normal as well as the binomial, Poisson, negative
binomial, and gamma distributions, among others. Common examples are logistic,
Poisson, and probit models.
When you change the distribution of the residuals, it turns out that the relationship
between Y and the model parameters is no longer linear. However, for each distri-
bution in the exponential family, there exists at least one function f (µ) of the mean
of Y whose relationship with the model parameters is linear. This function is called
the link function.
The link function you choose will depend on which distribution you are choosing for
the outcome variable. For example, a binomial residual can use a probit or a logit
link function. A Poisson residual uses a log link function.
Exercise 9 State 3 differences between the general linear model and the generalized
linear model.
Introduction
Factor analysis is designed for interval data, although it can also be used for ordinal
data (e.g. scores assigned to Likert scales). The variables used in factor analysis
should be linearly related to each other. This can be checked by looking at scatter-
plots of pairs of variables. Obviously the variables must also be at least moderately
correlated to each other, otherwise the number of factors will be almost the same
as the number of original variables, which means that carrying out a factor analysis
would be pointless.
(b) State two major assumptions of factor analysis. Read the last paragraph to un-
derstand this.
The factor analysis model can be written algebraically as follows. If you have p
variables X1 , X2 , · · · , Xp measured on a sample of n subjects, then variable i can be
written as a linear combination of m factors F1 , F2 , ..., Fm where, as explained above
m < p. Thus,
Xi = ai F1 + ai2 F2 + · · · + aim Fm + εi
where the ai s are the factor loadings (or scores) for variable i and εi is the part of
variable Xi that cannot be explained by the factors.
(a) Define factor analysis Model. Read and understand the last paragraph.
(a) Define regression analysis. Read and understand the last paragraph.
Path analysis is an extension of the regression model. In a path analysis model from
the correlation matrix, two or more casual models are compared. The path of the
model is shown by a square and an arrow, which shows the causation. Regression
weight is predicated by the model. Then the goodness of fit statistic is calculated in
order to see the fitting of the model.
A Path model is a diagram which shows the independent, intermediate, and depen-
dent variables. A single-headed arrow shows the cause for the independent, interme-
diate and dependent variable. A double-headed arrow shows the covariance between
the two variables.
Exogenous variables causes lie outside the model while Endogenous variables are
determined by variables within the model.
The model above shows the factors that affect weight loss. Study the figure to
understand path modeling.
In each example, the researcher believes, based on theory and empirical research, sets
of variables define the constructs that are hypothesized to be related in a certain way.
The goal of SEM analysis is to determine the extent to which the theoretical model
is supported by sample data. If the sample data support the theoretical model, then
more complex theoretical models can be hypothesized. If the sample data do not
support the theoretical model, then either the original model can be modified and
Many kinds of data, including observational data collected in the human and biolog-
ical sciences, have a hierarchical or clustered structure. For example, children with
the same parents tend to be more alike in their physical and mental characteristics
than individuals chosen at random from the population at large. Individuals may be
further nested within geographical areas or institutions such as schools or employ-
ers. Multilevel data structures also arise in longitudinal studies where an individuals
responses over time are correlated with each other.
Multilevel models recognise the existence of such data hierarchies by allowing for
residual components at each level in the hierarchy. For example, a two-level model
which allows for grouping of child outcomes within schools would include residu-
als at the child and school level. Thus the residual variance is partitioned into a
between-school component (the variance of the school-level residuals) and a within-
school component (the variance of the child-level residuals). The school residuals,
often called school effects, represent unobserved school characteristics that affect
The level signifies the position of a unit of observation within the hierarchy
4. Make current salary your y-variable, beginning salary your x-variable, and em-
ployment category your column variable.
5. click OK. (Hint: You must obtain a graph like the one in the figure 1.1)
2. To extend our model beyond a single category, we need to allow for the variation
in patterns among different subjects. For example, a quick glance at Figure
1.1 shows that some of the subjects are consistently better than others.
3. To make our model more realistic, we allow the intercept in Model 1.1 to vary
from subject to subject. Writing (Current Salary)ij for the ith current salary
on the jth category, we have
4. Notice that the intercept β0j now has a subscript j, indicating that it will
vary from subject to subject(=category). We now assume that the individual
intercepts follow a Normal distribution with variance τ0 . This gives the model.
5. Model 1.2 accounts for the variation in the individual measurements on a single
subject, while Model 1.3 accounts for the variation from one subject to another.
The combination of such as these two models gives what is known as a multi-
level model.
Exercise 15 Now using substitution combine models 1.3 and 1.2. What do you get?
The equation 1.4 is an example of a Multilevel model. The feature that distinguishes
this model from an ordinary regression model is the presence of two random variables
- the measurement level random variable εj and the subject level random variable
u0j . Because multilevel models contain a mix of mixed effects and random effects,
they are sometimes known as mixed-effects models.
(b) State the difference between the difference between Regression Model and Multi-
level Model.
2.1.1 Introduction
There are many types of statistical models as discussed above. Each type may re-
quire a different approach for it to be built. However, the following steps are needed
for any statistical model that can be built as far as modern statistical researches are
concerned.
The order in which the steps can be followed is immaterial, and in addition to that,
some steps can be skipped based on the type of data used in modeling or the pro-
posed model to be build or fitted. Hence, we make a list of all the steps in this
subsection and then explain the steps in the next subsections. Any statement that
is inside the brackets () can be ignored because it is simply a redundancy.
2. Designing the study : This involves specifying the type of study design that
you will adopt, ( In statistical modeling study design mostly means the type of
randomization and sampling criteria that will be adopted, the most common
being
28
STA223:Introduction to Statistical Modeling
This affects the validity of the assumptions that you can use in building a
model. For example, if randomization has been applied you can assume that
the sample consists of independent variables. Make sure that you understand
this discussion.
Its absolutely vital that you know the level of measurement of each response
and predictor variable, because they determine both the type of information
you can get from your model and the family of models that is appropriate.
Now the following practical exercises should help you understand the use of
STATA in exploring data, and the last 4 steps outlined above. We are using
stata 12, but similar steps can be used in higher versions of stata.
(a) You must be able to identify five main windows of the stata
program: RESULTS, REVIEW, COMMANDS, PROPERTIES, and VARI-
ABLES. Discuss with your friend what you think are the uses of the win-
dows you can see.
i. Now close some of the windows. We believe you are able to close the
review, the variables and the properties windows
ii. Try to reopen the windows. There are many ways of doing this. One
way is this : Click Ctrl and hold it, while holding the control button
click 3, 4 and 5. Try other ways of reopening of the windows.
i. Go to edit
ii. Select preferences
iii. Select load preference set
iv. Choose your preferred layout: For example choose combined layout.
i. Try to change the size of the windows in the layout chosen above
ii. Follow the same steps in (c) select save preference set instead, name
it myview and click ok
iii. Try to load any other layout and then the layout created.
iv. You should be able to delete this layout.
(b) Open the SPSS 20 program and use it to open employee data that is found
in programs file. (These are the possible steps you might have followed:
File, Open, Data, look in C, program files (x 86), IBM, SPSS, statistics,
20, samples, English, Employeedata.sav, open)
(c) Save this data as using the name Employee and the stata version 6 ex-
tension in the folder you have created. Make sure that you are able to do
this.
Exercise 17 Create an excel data file and then convert this data file into stata.
(a) Using do-file editor opening log book NB: Comments may be added
to programs in three ways: begin the line with *; begin the comment with
//; or place the comment between /* and */ delimiters. Follow the fol-
lowing steps:
i. Open the do file (by running the command doedit on the command
window ) and save it as basics in the same folder stataintro
ii. Change the directory, by typing and running the following command
in the do file
cd"C:\Users\Administrator\Desktop\stataintro"
iii. Open log file by typing and running the following command in the do
file
log using mylog.log // opening log book
iv. Type and run the following command in the do file
use employee, clear
v. Type and run the following command in the do file
table jobcat, contents(freq mean salbegin mean salary)
vi. Now close the log book by typing and running the following command
in the command window
log close
NB: logic operators in stata( != (or ~ =), >, <=, >=, ==, &, and | )
1) [sum newsalary]
2) [gen salcat=.]
3) [replace salcat=1 if newsalary < 300]
4) [replace salcat=2 if newsalary >=305 & newsalary <1000]
5) [replace salcat=3 if newsalary >=1005]
6) [tab salcat]
7) [codebook salcat]
8) Drop the missing value [drop if salcat==.]
Example 18 Run the following commands in the same do file you have been
using in the above exercises
-----------------------------------------------------
salary | Coef. Std. Err. t P>|t|
-+----------------------------------------------------
salbegin | 1.90945 .0474097 40.28 0.000
_cons | 1928.206 888.6799 2.17 0.031
-----------------------------------------------------
In this case you have fitted a straight line model, ( simple linear regression) and
the scatter plot below suggests that we can assume a straight line relationship
between the variables. Now let y= the current salary of the employee, and x =
the beginning salary of the employee, then the table is telling us that model is
y = 1.90945x + 1928.206
Exercise 21 From the discussion in the examples and exercises above you
must be able to explain why each of the following steps is important in statistical
modeling
(e) These four steps constitute a stage known as initial modeling stage. Which
steps of these do you think are just preparatory and which are just ex-
ploratory? Explain your answers.
(f ) Repeat example 18 for the variable you created in the exercise 18 and the
beginning salary.
10. Refine the model and check the model fit: If you are doing a truly
exploratory analysis, or if the point of the model is pure prediction, you can
use some sort of stepwise approach to determine the best predictors.
(a) Test, and possibly drop, interactions and quadratic or explore other types
of non-linearity
Because you already investigated the right family of models in stage one ,
thoroughly investigated your variables in Step 8, and correctly specified your
model in Step 10, you should not have big surprises here. Rather, this step
will be about confirming, checking, and refining. But what you learn here can
send you back to any of those steps for further refinement.
12. Check for and resolve data issues Steps 11 and 12 are often done together,
or perhaps back and forth. This is where you check for data issues that can
affect the model, but are not exactly assumptions.
Data issues are about the data, not the model, but occur within the context
of the model
(a) Multicollinearity
Once again, data issues dont appear until you have chosen variables and put
them in the model.
You may not notice data issues or misspecified predictors until you interpret
the coefficients. Then you find something like a super high standard error
or a coefficient with a sign opposite what you expected, sending you back to
previous steps.
Model specification involves three distinct stages: The model specification is the
first and most critical of all stages in modeling. Our estimates of the parameters of
a model and our interpretation of them depend on the correct specification of the
model.
Consequently, problems can arise whenever we misspecify a model. There are two
basic types of specification errors. In the first, we misspecify a model by including
in the regression equation an independent variable that is theoretically irrelevant.
In the second, we misspecify the model by excluding from the model equation an
independent variable that is theoretically relevant.
Specification errors
There are basically two types of specification errors.
(b) Misspecification of the model by excluding from the model equation an indepen-
dent variable that is theoretically relevant.
Model Estimation Deals with estimating the values of parameters based on mea-
sured/empirical data that has a random component in a model. The model param-
eters describe an underlying physical setting in such a way that their value affects
the distribution of the measured data. An estimator attempts to approximate the
unknown parameters using the measurements.
(a) The probabilistic approach which assumes that the measured data is random
with probability distribution dependent on the parameters of interest.
(b) The set-membership approach assumes that the measured data vector belongs
to a set which depends on the parameter vector.
As you have noted, estimating a model is simply estimating parameters in the model.
The are two main ways of estimating parameters which are Maximum likelihood
estimation and Least Squares Estimation.
Solution
n
Y P
L= θeθxi = θn eθ xi
i=1
Solution
The probability density function of the exponential distribution is defined as
(
λe−λx if x ≥ 0
f (x; λ) =
0 if x < 0
Its likelihood function is
n n
Y Y Pn
L(λ, x1 , . . . , xn ) = f (xi , λ) = λe−λx = λn e−λ i=1 xi
i=1 i=1
d ln (L(λ, x1 , . . . , xn )) !
=0
dλ
for λ.
Since
Pn
d ln λn e−λ i=1 xi
d ln (L(λ, x1 , . . . , xn ))
= (2.1)
dλ dλ
d ln (n ln(λ) − λ ni=1 xi )
P
= (2.2)
dλ
n
n X
= − xi (2.3)
λ i=1
Finally we get
n
λ= P
n
xi
i=1
(
1/(θ2 − θ1 ), for θ1 ≤ x ≤ θ2
f (x|θ) =
0 otherwise
Suppose that θ1 and θ2 are unknown. Find L(θ1 , θ2 )
f (x; θ) = (θ + 1)(xθ )
3.1 Introduction
Example 21 Suppose we want to model the relationship between height and speed of
Athletes. In this case height is an explanatory variable and speed is response variable.
A basic simple linear regression model has only one independent variable. It is also
called the Prediction Equation. The model can be stated as follows:
Y i = β0 + β1 X i + εi
where Yi is the value of the response variable in the ith trial, β0 is an intercept of
the of the line, β1 is the slope (gradient of the line) β0 and β1 are Parameters Xi is
the value of the independent variacble in the ith trial εi is a random error term with
mean E{εi } = 0 and variance var(εi ) = σ 2
(a) εi are independent (uncorrelated) i.e covariance(εi , εj )=0 for all i and j;i 6=j,
(c) Xi are precisely measured (are not random variable) and continuous.
(d) Yi is a continuous random variable. Thus you can conclude that E(Yi ) = β0 +
β1 Xi , because E{εi } = 0 and var(Yi ) = σ 2 . Hence Yi ∈ N (β0 + β1 Xi , σ 2 ).
41
STA223:Introduction to Statistical Modeling
Line of best fit: Is the line that minimizes sum of squares of vertical deviations in
the Yi s. We use Least square estimation (LSE) method to estimate model pa-
rameters β0 and β1 .The method involves finding good estimators of β0 and β1 that
minimize the sum of squares of vertical deviations. That is to say, for each sample
observation (xi , yi ), the method of LSE considers sum of deviations of yi from its
expected value.
n
X
Let S = (yi − (β0 + β1 xi ))2 To estimate β0 and β1 , we use the following steps;
i=1
(b) Set ∂S
∂β0
∂S
= 0 and ∂β 1
= 0 equal to zero and solve for βˆ1 and βˆ0 respectively.
n
X
(yi − y)(xi − x)
After solving forβˆ1 and βˆ0 we get βˆ0 = y + i=1
n x = y + βˆ1 x. What is
X
(xi − x)2
i=1
the value of βˆ1 ?
We use sample data to estimate the parameters β0 and β1 . The model becomes
ŷi = βˆ0 + βˆ1 xi as regression estimator for sample data.
3.4 Interpretation
The fitted model ŷi = βˆ0 + βˆ1 xi means that for any unit change in xi , ŷi changes by
β1 units.
Example 23 Table 3.1 shows speed data for 12 kindergarten children and their body
weights;
(a) Find β0 and β1 and fit a regression line that best fits the data
speed (m/min) 5.4 3.4 6.3 3.2 7.5 8.1 9.1 11.5 12.1 14.7 18.5 8.1
weight (kg) 8.7 9.2 11.2 11.5 11.6 11.6 12.3 13.7 4.7 15.2 17.5 18.1
solution
(a) βˆ0 = 2.673, βˆ1 = 0.522. Therefore, the fitted regression model is:
ŷi = 2.673 + 0.522xi
(b) Thus the speed of a child weighing 10kgs is ŷi = 2.673 + 0.522(10) = 7.89m/min
Interptetation: For any unit change in the body weight of the child, the speed
increases by 1.50m/min.
4. Click Speed and then click Arrow next to the Dependent: box.
5. Click Weight, and then click the arrow next to the Independent(s):box (At this
point your screen should look like the one in figure 3.1).
6. Click Statistics and make sure that Estimates and Model fit are checked
7. Click Continue
8. Click Ok
Example 25 Transforming Random variables We can predict the speed of the Child
with given weight. To do this we are going to transform out data and generate a new
variable called Predicted-speed. Procedure:
3. Click Type and Label box, and then type predicted speed of the child in the box
titled Label
4. Click Continue
6. Click OK at the bottom and go back to our data. SPSS has now computed a
new Variable with predicted speed at given body weight of the child.
Your data will now look like this the one in the figure below You can also work out
the same problem in STATA
1. Save the SPSS file you above as a STATA file. Follow the following procedures
to do that.
4. Choose the location where your stata file is to be saved e.g Desktop
5. On Save as type, click on the arrow down far right and choose Stata Version
6(*.dta). By now you should have a stata file.
Table 3.2: daily Temperature and Patron data that visit a public swimming pool for
20 days
Exercise 23 The table 3.2 shows daily Temperature and Patron data that visit a
public swimming pool for 20 days.
(a) Compute β0 and β1 and write down the Prediction Equation manually.
(b) In SPSS enter the data, and follow the above procedures to compute
(i) β0 and β1 .
(c) Do the same problem in STATA and follow the above procedures
3.6 CORRELATION
This is the measure of degree linear relationship between two (quantitative) random
variables. Linear relationship can also be shown using Scatter plots.
1. In Stata data file you saved, type the command scatter Speed Weight. This will
give you the Scatter plot below.
5. Click OK.
Figure 3.2: A scatter plot of Child speed band body weight in STATA
Figure 3.3: A scatter plot of Child speed band body weight in SPSS
The above pattern shows linear relationship between Speed and weight of the child
Suppose we have simple random sample of paired observations, (x1 , y1 ), (x2 , y2 ) · · · (xn , yn )
with xi s coming from population characteristic X and yi s coming from population
characteristic Y. Then Population correlation is:
N
X (Xr − X)(Yr − Y )
Cov(x, y)
P
(X − X̄)(Y − Ȳ ) r=1
N
ρxy =p = pP =
SX SY
P
V ar(x)V ar(y) (X − X̄)2 (Y − Ȳ )2
Propositions −1 ≤ Corr(X, Y ) ≤ 1
− 1 ≤ rxy ≤ 1
Interpretation:
Example 27 Using the data In Previous example on Child speed and weight, show
whether there is linear relationship between Speed of the child and its body weight.
(a) Use the formula given above to find rxy and make a conclusion on whether there
is linear relationship between Speed and weight of the child
(c) Drag Speed and Weight to Variables box. Make sure Pearson is marked ,
(e) In STATA type the following command corr Speed Weight You will have the
following outputs.
Thus r = 0.429, implies that there is weak linear positive relationship between Speed
and body weight of the child. HYPOTHESIS TESTING ABOUT POPULATION
CORRELATION (ρxy ) This involves testing about the population correlation co-
efficient ρxy using sample correlation r We define H0 and H1 as H0 : ρxy = 0(No
linear relationship between X and Y),andH1 : ρxy = 0 (There is atleast significant
linear relationship between X and Y) , H1 : ρxy > 0(for strong positive relationship),
H1 : ρxy < 0,for strong negative relationship
s
n−2
The test Statistic is: t = ∼ Tn−2 with some degree of certainty, α.
1 − rp2
Example 28 Using above exercise on correlation for SPSS output we find that t =
0.164. Since t = 0.164 > 2.201, at α=0.05. Thus we have no evidence to reject H0 ,
hence no significant linear relationship between Speed and Weight of the child.
------------------------------------------------------------------------------
Speed | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Weight | .5218265 .3476604 1.50 0.164 -.2528092 1.296462
_cons | 2.673217 4.389698 0.61 0.556 -7.107639 12.45407
------------------------------------------------------------------------------
This is the measure of total variation accounted for the regression of dependent
variable on independent variable. It is given as r2 = (Sxy
2 2
)/(Sxx 2
Syy ) It is interpreted
as large values of r2 implies high variation in y due to changes in X and vice vesa,
(The bigger the value of R-squared, the bigger the variation in Y that has been
explained by X).
This deals with measuring the lack of fit to check if the straight line best fits the
model.
Lack of fit
X
The key statistics used is the residual sum of squares, (yi − ŷ)2 . If residual sum
of squares is large, then there is lack of fit. The residual sum of squares consists of
two components.
Pure error sum of squares is estimated using replicate observations (several different
values of y on x). Y,s are assumed independent for a fixed x. Pure error sum of
n X
X n
square is given by (yij − y.. )2 .
i=1 j=1
Xn
2
The estimate of σ (pure error) = (pure error sum of squares)/( ni − 2),
i=1
Degree of freedom for lack of fit sum of squares is k-2, where k is number of sets of
replicates. Hypothesis Testing about model appropriateness
. H0 : Straight line model not is appropriate
H1 :Straight line model is appropriate.
Rejecting H0 implies that the lack of fit contributes a large value to residual sum of
squares compared to pure error sum of squares.
M S(lack of fit)
The test statistic is F = M S(Pure error)
v Fk−2,Pni=1 ni , α2
Example 29 Using the data on child speed and body weight done previously, comput-
ing residual sum of squares, pure error sum of squares and lack of fit and F-statistic,
we have the following results.
Residual ssq = 185.857, and pure error ssq = 41.871, and lack of fit ssq = 185.857 −
41.871 = 143.986.F − statistic = 18.544, F (1, 10, 0.025) = 2.25. Thus F − statistic >
F (1, 10, 0.025). Reject H0 and conclude that the straight linear model above does not
appropriately fit the data values.
Exercise 24 Using the data on number of patrons visiting the public swimming pool
and temperature, test whether the straight line model is appropriate or not.
• Open the SPSS file you saved with data on public swimming pool provided in
this book.
• Click on Statistics and make sure that the following are marked, Estimates,
model fit and R-squared change.
NB: In STATA use the following Command regress No− patrons Temperature
This result will give you an ANOVA table with values of Residual ssq, pure error
ssq,and be able to compute lack of fit sum of squares, Fk−2,Pni=1 ni , α2 . Then be able
to make decision about the model fit.
4.1 Introduction
54
STA223:Introduction to Statistical Modeling
We discuss some example to show how multiple regression analysis can be used to
solve scenarios that cannot be solved by simple regression. We begin with an example
on simple variation of wage equation for obtaining effect of education on hourly wage,
y,
y = β0 + β1 x1 + β2 x2 + µ ...............(1)
where x1 is education and x2 is years of labour market experience. Thus wage is de-
termined by two explanatory or independent variables, namely, education and labour
experience and by other unobserved variables, which are contained in µ. We are still
interested in the effect of education level on wage, holding fixed all other factors af-
fecting wage. This means we are interested in the parameter β1 . When we compare
equation (1) with a simple regression relating wage to education level, the equation
(1) effectively takes experience out of the error term (noise term) and puts it explic-
itly in the equation. Because experience appears in the equation, its coefficient, β2 ,
measures the effect of experience on wage, which is also of some interest.
Another example, we consider the problem of explaining the effect of per student
expenditure on average standardized test score at high school level. Suppose that
the average test score depends on funding, average family income and other unob-
servable parameters (noise parameters).
Generally, we can write a model with two independent variables, illustrating the two
previous examples, as
y = β0 + β1 x1 + β2 x2 + µ
factors fixed. β0 is the value of the outcome when both explanatory variables have
values of zero. β1 and β2 are also called gradients for each explanatory variables.
These gradients are the regression coefficients which tell you how much change in
the outcome, y, is predicted by a unit change in that explanatory variable.
Though it can be difficult to visualize a linear model with two explanatory variables,
we may add a plane to two-way Cartesian plane and have a scatterplot as shown
below.
Once we are in the context of multiple regression, there is no need to stop with two
independent variables when there are more than two explanatory variables explaining
a particular variable. If we happen to have other explanatory variables that are
affecting the response, all we have to do is to add these explanatory variables into
the linear equation. Multiple regression analysis allows many observed factors to
affect y. In the wage example, we may add amount of job training, years of tenure
with the current employer, measures of ability, and even demographic variables like
number of siblings or mothers education. The general multiple linear regression
model can be written in the population as
y = β0 + β1 x1 + β2 x2 + β3 x3 + ... + βk xk + µ
Where β0 is the intercept (as pointed out above), β1 is the parameter associated with
independent variable x1 , β0 is the parameter associated with independent variable
x2 , and so on. Since there are k independent variables and an intercept, equation
above contains k + 1 unknown population parameters. The variable µ is the error
term or disturbance or noise term. It contains variables or factors other than
x1 , x2 , x3 , , xk that affect y. No matter how many factors we incorporate in our model,
there will always be factors we cannot include, and these are collectively contained
in µ. This random term is assumed to be normally distributed with mean zero and
variance σ 2 . Terminology for multiple regression is similar to that for simple regres-
sion and is given in the table below
Although the model formulation appears to be a simple generalization of the model
with one independent variable, the inclusion of several independent variables creates
a new concept in the interpretation of the regression coefficients. Thus, when apply-
ing multiple regression model, we must know how to interpret the parameters.
In multiple regression we are interested in what happens when each variable is varied
one at a time, while not changing the values of any others. This is in contrast to
performing several simple linear regressions, using each of these variables in turn,
but where each regression ignores what may be occurring with the other variables.
Therefore, in multiple regression, the coefficient attached to each independent vari-
able should measure the average change in the response variable associated with
changes in that independent variable, while all other independent variables remain
fixed. This is the standard interpretation for a regression coefficient in a multiple
regression model.
Suppose, manager salary (salary) is related to company sales (sales) and manager
tenure (manten) with the firm by
This fits into multiple regression model with three variables (k = 3) by defining
y = log(salary), x1 = log(sales) and x2 = (manten) and x3 = (manten)2 . The
parameter β1 is elasticity of salary with respect to sales when all other variables
are kept constant. If β3 = 0, then 100β2 is approximately percentage increase on
salary when manten increases with one year, with all factors constant. However,
when β3 6= 0, the effect of manten on salary is more complicated to describe.
The following are the assumptions of Multiple Regression with k variables written
above
• Constant Variance The variance of µ0i s is constant for all values of xi ’s. This
is detected by residual plots of ej = yj − yˆj and yˆj or the xi ’s. If these residual
plots show a rectangular shape, we can assume constant variance. Otherwise,
non-constant variance exists and must be corrected.
• Variables xj are considered fixed quantities (not random variables), that is,
only randomness in y comes the error term µ.
In simple linear regression, the F test from the ANOVA table is equivalent to the two-
sided test of the hypothesis that the slope of the regression line is zero. For multiple
regression, there is a corresponding ANOVA F test, but it tests the hypothesis that
all regression coefficients (except the intercept 0) are zero. The ANOVA Table for
Multiple Regression is given below
where SSR = Σ(ŷi − ȳ)2 , SSE = Σ(yi − ŷi )2 and Σ(yi − ȳ)2 . MSR is the variance
attributed is to the model and MSE is variance that is unaccounted for (due to
error).
H0 : β1 = β2 = β3 = ... = βk = 0
H1 : βj 6= 0, j = 1, 2, 3, ...j for at least on j
We test the general null hypothesis, H0 , by calculating the F ratio and comparing
it with the critical point F(k,n−k−1,1−α) where k is as defined above, n is sample size,
and α is significance level that is preselected. We reject H0 if computed F exceeds
F(k,n−k−1,1−α) in value. Rejection of H0 implies that at least one of the regressor
variables x1 , x2 , ..., xk contributes significantly to the model. The testing procedure
involves analysis of variance partitioning of the total of the sum of squares SST into
a sum of squares due to the model or regression and a sum of squares due to residual
(or error), say SST = SSR + SSE .
The more variation in the response the regressors explain, the larger SSR becomes
and the smaller SSE becomes. This means that M SR becomes larger and M SE
smaller and therefore the quotient F becomes larger. Thus, small values of F sup-
port the null hypothesis and large values of F provide evidence against the null
hypothesis and in favor for the alternative hypothesis.
We treat sales as the dependent variable y, and target population and per capita
discretionary income as independent variables x1 and x2 , respectively in an explo-
ration of feasibility of predicting district sales from target population and per capita
discretionary income. The regression model is
yi = β0 + β1 xi1 + β2 xi2 + µ
The figure table below show an SPSS output showing the values of β0 , β1 and β2 .
This shows that β0 = 3.453, the coefficient of xi1 , target population, is β1 = 0.496
and the coefficient of xi2 is β2 = 0.009. Having computed these coefficients, we now
have a regression that will be used to predict district sales as
The model says that the mean jar sales are expected to increase by 0.496 gross when
the target population increases by 1 thousand, holding per capita discretionary in-
come constant, and that mean jar sales are expected to increase by 0.009 gross when
per capita discretionary income increases by 1 Kwacha, holding target population
constant.
H0 : β1 = β2 = 0 against H1 : β1 6= 0andβ2 6= 0
Table below show SPSS output of ANOVA F-Test carried out on the data we have
in the beginning of this example.
From the table, SSR = 53844.716, SSE = 56.884 and SST = 53901.600 with degrees
of freedom (df) 2, 14 and 14 respectively. We now calculate Mean Squares as follows
M SR 26922.358
The F-ration is F = M SE
= 4.470
= 5679.466.
For significance level of α = 0.05, we calculate F(2,12,0.95) = 3.89. Since F >
F(k,n−k−1,1−α) , we reject our null hypothesis and conclude that either target pop-
ulation or per capita discretionary income contributes significantly to the sales. Al-
ternatively, we could use p−value to test the hypothesis. From the ANOVA table, the
significance level is 0.00 which is less than our preselected p-value of 0.05. Therefore,
we reject the null hypothesis as at that first and conclude in the same way.
SSR SSE
R2 = SST
=1− SST
Large value of R2 does not necessarily mean the regression model is a good one
(useful one). Adding new variables to the model can never decrease amount of
variance explained i.e. will always increase R2 , regardless of whether the additional
variable is statistically significant or not.This is because SSE does can never become
large with more independent variable and SST is always the same for given set of
response. Thus, it is possible for models that have large values of R2 to yield poor
predictions of new observations or estimates of the mean response.
Because R2 always increases as we add terms to the model, it is sometimes suggested
that a modified measure be used that adjusts for the number of the independent
SSE
n−k n−1
Ra2 =1− SST
=1−
n−1
n−k
Actually, the adjusted R2 statistic will not always increase as we add variables to the
model. In fact, if unnecessary terms are added, the value of adjusted R2 will often
decrease because the decrease in SSE may be more than offset by the loss of degrees
hus, when two independent
of freedom in the denominator (n-k). variables, target population and per capita discretionary income are
considered, the variation in sales is reduced by 99.9 percent.
√
Coefficient of Multiple Correlation R is the positive square root of R2 ; thus R = R2
SSE 56.884
R2 = 1 − =1− = 0.9989
SST 53901.600
Thus, when two independent variables, target population and per capita discre-
tionary income are considered, the variation in sales is reduced by 99.9 percent.
√
The coefficient of multiple correlation is R = 0.9989 = 0.999
If the overall test shows that the model as a whole is statistically significant as a
predictor of the response we want to know which of the regressors in the model
are statistically significant predictors of the response. We are frequently interested
in testing hypotheses on the individual regression coefficient. Such tests would be
useful in determining the value of each regressor variable in the regression model.
For example, a model might be more effective with the inclusion of the additional
variables or perhaps with deletion of one or more of the variables of the model.
Adding a variable to the regression model always causes the sum of squares for regres-
sion to increase and the error sum of squares to decrease. We must decide whether the
increase in the regression sum of squares is sufficient to warrant using the additional
variable in the model. furthermore, adding an unimportant variable to the model ac-
tually increase the mean square error, thereby decreasing the usefulness of the model.
The hypotheses for testing the significance of any individual regression coefficient,
say βj are
H0 : βj = 0 against H1 : βj 6= 0
βˆj βˆj
T = SE(βˆj )
= σ(βˆj )
∼ t(1− α2 ,n−k)
There are two general applications for multiple regression (MR): prediction and ex-
planation. These roughly correspond to two differing goals in research: being able to
make valid projections concerning an outcome for a particular individual (prediction),
or attempting to understand a phenomenon by examining a variable’s correlates on a
group level (explanation). When one uses Multiple Regression model for prediction,
one is using a sample to create a regression equation that would optimally predict
a particular phenomenon within a particular population which is our goal in this
section of the chapter.
Hypothetically, regression equation is created to predict eighth-grade achievement
test scores from fifth-grade variables, such as family socioeconomic status, race, sex,
1
V IFj = (1−Rj2 )
Where Rj2 is the multiple coefficient of determination for regression model that relates
Xj to all other independent variables in the set. If Rj2 = 0, which says j t h variable
is not related to other predictor variables, then V IFj = 1. On the other hand, if
Rj2 > 0 which says Xj is linearly related to other predictor variables then 1 − Rj2
less than 1, making V IFj > 1. Both the largest variance inflation factor among
the independent variables and the mean VIF of the variance inflation factors for
• The largest variance inflation factor greater than 10 (which means the largest
Rj2 is greater than 0.9)
For a given regression model, could some of the predictors be eliminated without
sacrificing too much in the way of fit? Conversely, would it be worthwhile to add
a certain set of new predictors to a given regression model? The partial F Test is
designed in such a way that these questions are answered by comparing two models
for the same response variable. The extra sum of squares is used to measure two
things. Firstly, it measures the marginal increase in the error of sum of squares when
one or more predictors are deleted from the model. Secondly, it measures marginal
reduction in the error sum of squares when one or more predictors are added to the
model.
The partial F test assesses whether addition of any specific independent variable,
given others already in the model, significantly contributes to the prediction of y.
The test therefore allows for elimination of variables that are of no help in predicting
y and thus enables one to reduce the set of possible independent variables to an
economical set of predictors.
The model containing all predictors is called full model. The regression model
y = β0 + β1 x1 + β2 x2 + β3 x3 + ... + βl xl + µ
We estimate the linear regression for each of the two models and the look at the
error sum of squares (SSE) from ANOVA table for each. We want to test if we can
safely drop the regressors x(l+1) , ..., xk from the model. if these predictors do not add
significantly to the model, then dropping them will make the model simpler and thus
preferably.
To perform partial F Test concerning a variable x(l+1) , ..., xk given that the variables
(x1 , x2 , ., xl )are already in the model, we must first compute the extra sum of squares
from adding X(l+1) , , Xk , given (x1 , x2 , ., xl )which we place in our ANOVA table. This
sum of squares is computed by the formula
This method s designed to run all small possible regressions between the dependent
variable and all possible subsets of explanatory variables. Suppose we want to explain
dependent variable WAGE. To aid in our study, we have three explanatory variables.
These are years of education (ED), the age of the worker (AGE) and the years of
experience of the worker (EXP). Using this data set, there are a total of 8 possible
regressions which could be run in order to explain WAGE
• No variables: W AGES = β0
All these possible approach will allow the researcher to analyze the summary statis-
tics of every possible regression. The choice of which criterion to use to select the
appropriate model is left to the researcher. Two commonly used criteria in choosing
between different regressions are:
• Using the Cp
Cp measures total mean error of the fitted values of regression. The total mean error
involves two parts: one that results from random errors and one resulting from bias.
When there is no bias, the expected values of Cp , E(Cp ) = p, where p is equal to
the K+1 coefficient estimates in the regression. A good regression will have a low
value of Cp near K+1. If the regression generates large Cp then the mean square
error of the fit is large, indicating that the regression is poor fit and/or has bias. Cp
SSEp
is described using the following formula Cp = M SEF
− (n − 2p)
Where SSEp is the error sum of squares for the regression with p = K+1 coefficients
to be estimated and M SEF is the mean square error for the model that include all
possible explanatory variables.
EXERCISE
i xi1 xi2 yi
1 4 2 64
2 4 4 73
3 4 2 61
4 4 4 76
5 6 2 72
6 6 4 80
7 6 2 71
8 6 4 83
9 8 2 83
10 8 4 89
11 8 2 86
12 8 4 93
13 10 2 88
14 10 4 95
15 10 2 94
16 10 4 100
(a) Fit the regression model to the data. State the estimated regression func-
tion and interpret β1
(b) Assume the regression model with independent normal error terms is ap-
propriate
2. The crime rate in 47 states in the USA was reported together with possible
factors that might influence it. The factors recorded are as follows;
• X4 : police expenditure (in dollars) per person by state and local govern-
ment in 1960 (expend60)
• X5 : police expenditure (in dollars) per person by state and local govern-
ment in 1959 (expend59)
• X6 : labour force participation rate per 1000 civilian urban males in the
age group 14-24 (LabF orce)
• X10 : unemployment rate of urban males per 1000 in the age group 14-24
(unemploy)
• X11 : unemployment rate of urban males per 1000 in the age group 35-59
(unemploy35)
• X12 : the median value of family income or transferable goods and assets
(unit 10 dollars) (income)
• X13 : the number of families per 1000 earning below one-half of the median
income. (poverty)
• Y : Crime rate
(c) What does your test imply about all explanatory variables?
(e) Now perform regression analysis with police expenditure in 1959. Inter-
pret the output.
How does one determine the most economical model? One method
is by considering the analysis of variance table of the full model remaining
after excluding those variables which are causes of any multicollinearity. In
the present data set the variable X5 (Police Expenditure in 1959) was ex-
cluded leaving 12 explanatory variables. In the resulting regression analysis
one examines the table of t-ratio and the corresponding p-value of significance.
Those variables which have a significant t-ratio are selected for inclusion into
the sub-set. (Recall that a significant t-ratio for a coefficient means that a
slope is present). Next the Analysis of Variance table is examined in the col-
umn headed SEQ SS. The amount in each row of this column corresponding
to the variable tells us the contribution made by the respective explanatory
variable to the Regression Sum of Squares. One verifies that the explanatory
variables selected to form the sub-set are indeed making a sizable contribution
to the Regression Sum of Squares. (Recall that for any explanatory variable
the larger the Regression Sum of Squares the smaller is the Residual Sum of
Squares, since TSS = ESS + RSS. A small Residual Sum of Squares means
that the data points are not so scattered, and that there is greater clustering
about the line of least squares). An alternative strategy is based on adding
or dropping one variable at a time from a given model. The idea is to com-
pare the current model with a new model obtained by adding or deleting an
explanatory variable from the current model. Call the smaller model (i.e. with
fewer variables) Model I and the bigger model as Model II. One can compute
the F statistic (called partial F) by RSS Model I RSS Model II (RSS Model
II / degree of freedom of Model II). If partial F value is bigger than that of F
the smaller model is better.
Onec the important β coefficients have been identified, the next step is to de-
termine their relative roles in explaining the variability in the outcome variable
Y. The actual numeric value of the β coefficients is not a guide. For example,
the β coefficient for statesouth is 7.2 and that of Popn is 0.324. This does not
mean that StateSth is 21 times more important than Popn. This is because of
the unit of measurement. Popn is measured in actual numbers and StateSth
is categorical. The t statistic provides the true measure. The t statistic for
StateSth is 0.39 and that for Popn is 2.10. In the regression analysis with 5
explanatory variables the sub-set was chosen in the manner just described. For
comparison the regression analysis resulting from the 7 remaining.
portant.
• The P values are to be discounted because they do not take into account
the very many tests that have been carried out.
(b) Adjusted R2 .
Although similar to R-sq. it takes into account the number of coefficients
p in the model and the number of subjects n. (R-sq. = Regression Sum
of Squares Total Sum of Squares). Adjusted R-sq = 1 (n 1 / n p 1) (1
(d) Cp statistic
In general, among candidate models we select the one with the small-
est C-p value, and where the value of C-p is closest to p, the number of
parameters in the model (i.e. intercept plus the number of explanatory
variables in the model).
ResidualSumof Squaresp
Cp = M eanResidualSumof Squaresm
− (n − 2p),
where Residual Sum of Squaresp = RSS for the model with p parameters
including the intercept and Mean Residual Sum of Squaresm = Mean RSS
with all the predictors.
Cp is commonly used to select a subset of explanatory variables after
an initial regression analysis employing all possible explanatory variables.
One then selects the smallest model that produces a C-p value near to p,
the number of parameters. Small C-p means small variance in estimating
the regression coefficients. In other words a precise model is achieved and
adding more explanatory variables is unlikely to improve the precision
any more. Simple models are always preferable because they are easy to
interpret and would be less prone to multicollinearity. In the example of
the best subset regression analysis provided step 6 with C-p value of 3.1
has been highlighted as giving the best subset. The corresponding R-sq.
value is 74.8%, adjusted R-sq. value is 71%, and s= 20.827. Caution
needs to be exercised when resorting to automatic selection procedures
like Stepwise and Best. These procedures are machine led and do not take
into account the practical importance or biological plausibility of the pre-
dictors
ones. If the explanatory variables are correlated with each other assessment of
the importance of each of them becomes ambiguous.
3. Having obtained a regression equation how can one judge whether
adding one or more explanatory variables would improve prediction
of the response?
Improvement in R-sq. and particularly adjusted R-sq. provides a useful guide.
The addition of new explanatory variables can affect the relative contributions
of those variables already in the equation. In the selection of additional vari-
ables the research question and the theoretical rationale behind it must guide
the researcher. To leave the selection entirely to the software would be a mis-
take.