Intrdocution To Statistical Modeling

CHANCELLOR COLLEGE
MATHEMATICAL SCIENCES DEPARTMENT
Introduction to Statistical Modeling MODULE
By
P. Chidzalo, E. Mwanandiye, D. Masangwi
AUGUST, 2015
Contents
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1 UNIT 1
OVERVIEW OF STATISTICAL MODELS 1
1.1 Basic Definitions and Facts-Revision . . . . . . . . . . . . . . . . . . 1
1.1.1 Motivation Example . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Random Experiment or Statistical Experiment . . . . . . . . . 8
1.1.3 Sample Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.4 Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.5 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.6 Alternative Definition of Probability . . . . . . . . . . . . . . 9
1.1.7 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.8 Probability Distribution . . . . . . . . . . . . . . . . . . . . . 11
1.1.9 Parameter Versus Statistic . . . . . . . . . . . . . . . . . . . . 11
1.2 Definition Of Statistical Model . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Types of Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . 14
i
STA223:Introduction to Statistical Modeling
1.3.1 Common Distributions-Revision . . . . . . . . . . . . . . . . . 14
1.3.2 Parametric, Semiparametric, and Nonparametric Statistical

Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.3 General Versus Generalized Linear Models . . . . . . . . . . . 16
1.3.4 Structural Equation Models . . . . . . . . . . . . . . . . . . . 20
1.3.5 Multilevel Model . . . . . . . . . . . . . . . . . . . . . . . . . 24
2 UNIT 2
BUILDING STATISTICAL MODELS 28
2.1 Stages in Statistical Model Building . . . . . . . . . . . . . . . . . . . 28
2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Main Stages of Model Estimation . . . . . . . . . . . . . . . . . . . . 36
2.2.1 MODEL SPECIFICATION, ESTIMATION AND VALIDA-

TION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3 Methods of Estimating Parameters in a Model . . . . . . . . . . . . . 38
2.3.1 Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . 38
3 UNIT 3
SIMPLE LINEAR REGRESSION 41
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 FITTING A STRAIGHT LINE TO THE DATA . . . . . . . . . . . . 42
3.4 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Practicals in SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
ii P. Chidzalo, E. Mwanandiye, D. Masangwi

3.6 CORRELATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 DECISION RULE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.8 COEFFICIENT OF DETERMINATION . . . . . . . . . . . . . . . . 51
3.9 ASSESING APPROPRIATENESS OF A STRAIGHT LINE MODEL 51
3.10 PURE ERROR ESTIMATE OF σ 2 . . . . . . . . . . . . . . . . . . . 52
4 CHAPTER FOUR
Multiple Regression 54
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 The Model with Multiple Regression Model . . . . . . . . . . . . . . 55
4.2.1 The Model with Two Independent Variables . . . . . . . . . . 55
4.2.2 The General Multiple Regression Model: The Model with k

Independent Variables . . . . . . . . . . . . . . . . . . . . . . 56
4.2.3 Assumptions of Multiple Regression Models . . . . . . . . . . 58
4.3 F-Tests of a Multiple Regression Models . . . . . . . . . . . . . . . . 59
4.3.1 Testing For Significant Overall Regression . . . . . . . . . . . 59
4.3.2 An Example – Multiple Regression with Two Independent

Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Checking For Model Quality . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Tests of Significance of Individual Regression Parameters . . . . . . . 64
4.6 Testing the Validity of the Regression Model . . . . . . . . . . . . . . 65
4.7 Using the Multiple Regression Model for Prediction . . . . . . . . . . 65
4.8 Multicollinearity, Causes, Detection Methods, Solutions to Multicollinear-

ity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
iii P. Chidzalo, E. Mwanandiye, D. Masangwi

4.8.1 What Are Causes of Multicollinearity . . . . . . . . . . . . . . 66
4.8.2 Detecting Multicollinearity . . . . . . . . . . . . . . . . . . . . 67
4.8.3 Solutions to Multicollinearity . . . . . . . . . . . . . . . . . . 68
4.9 Partial F Tests and Variable Selection Methods . . . . . . . . . . . . 68
4.9.1 Partial F Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.9.2 Variable Selection Methods . . . . . . . . . . . . . . . . . . . 69
4.10 Hands-on Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
iv P. Chidzalo, E. Mwanandiye, D. Masangwi

ACKNOWLEDGEMENT
v P. Chidzalo, E. Mwanandiye, D. Masangwi

UNIT 1
OVERVIEW OF STATISTICAL MODELS
1.1 Basic Definitions and Facts-Revision
1.1.1 Motivation Example
Consider the data John collected over a period of 5 years on the heights of maize
plants. The plants were descended from the same parents and planted at the same
time. Half of the plants were self-fertilized, and half were cross-fertilized, and the
purpose of the experiment was to compare their heights. To this end John planted
them in pairs in different pots. The table below gives the resulting heights. All but
two of the differences between pairs in the fourth column of the table are positive,
which suggests that cross-fertilized plants are taller than self-fertilized ones.
1
Exercise 1 Enter the data in the table above in SPSS. You should define the vari-
ables in the variable view. For example your variables definitions and entered data
may look like the one in the figure below.
Now construct a box plot using SPSS to see if the crossed maize plants are taller
than the pure-bred ones. (Hint: Use the breed variable as a categorizing variable.
You must get a graph like the one below.)
2 P. Chidzalo, E. Mwanandiye, D. Masangwi

Discussion of the exercise Suppose you want to estimate the average height in-
crease in the experiment conducted by John. You can let Y = µ + ε be the height of
the self-fertilized plant, where µ is the mean of Y and ε is the random error whose
mean is 0 and whose variance is σ 2
You could also let X = µ + η + ε be the height of crossed plants, where η is an-
other unknown parameter. The most obvious question is about whether η = 0 or not.
Another consideration can be made by thinking that John planted the pairs of plants
in the same plot to control the effects of other factors like fertility of the soil.
To do this we could consider let Y = µj + εij j = 1 · · · 15 and X = µj + η + ε2j j=

1 · · · 15.
If we were willing to assume that ε has a given distribution, then the distributions
of Y and X would be completely specified once the parameters µ and η were known,
giving a parametric model. Otherwise, the model would be called non-parametric

model.
The focus of interest in the exercise is the relation between the height of a plant
and something that can be controlled by the experimenter, namely whether it is
self or cross-fertilized. The essence of the model is to regard the height as random
with a distribution that depends on the type of fertilization, which is fixed for each
plant. The variable of primary interest, in this instance height, is called the re-
sponse, and the variable on which it depends, the type of fertilization, is called an
explanatory variable or a covariate. Many questions arising in data analysis involve
the dependence of one or more variables on another or others, but virtually limitless
complications can arise.
The expressions X = µ + η + ε, Y = µj + εij j = 1 · · · 15 and X = µj + η + ε2j j=

1 · · · 15 are examples of statistical models.
Example 1 Will students do better in school if you pay for grades?
To test this question, an instructor gives students a math test. Before taking the test,
half the students were told that they would receive K1000 for every correct answer.
The other half was not given a monetary incentive. The number of correct answers
was recorded for each student.
Independent variable: monetary incentive Levels: paid vs. not paid
Dependent variable: Number of correct answers (this is the score each person receives)
Scale for the dependent variable : ratio
Exercise 2 Take note that the steps outlined and explained above are used in the
planning stage of statistical modeling
1. Explain any three advantages of identifying variables at any stage of model

building.

2. Which of the steps above simply define and which simply design? Explain your
answers.
3. Collect data on the mass of a chicken within three weeks from any farmer who
keeps chicken. Suppose you have collected the data or you have not collected
it. Identify the variables in the data that you are to collect. What will be their
level of measurement?
4. Repeat the example 1 above for the following scenarios
(a) A social psychologist thinks that people are more likely to conform to a
large crowd than to a single person. To test this hypothesis, the psychol-
ogist had either one person or five people stand on a busy walking path
on campus and look up. (note: people who are in cahoots with the ex-
perimenter are called confederates). The psychologist stood nearby and
counted the number of people passing by who looked up and the number
who did not look up. Identify the variables in the data that you are to
collect. What will be their level of measurement?

(b) A soap manufacturer wants to show that their detergent (Suddsy-Clean)

works better to remove tough stains compared to the leading brand (Tidey-
Clean). To test this, 20 white t-shirts were purchased. On each t-shirt, 10
different types of stains were placed. Half the t-shirts were then washed
with Suddsy-Clean, and the other half were washed with Tidey-Clean. Af-
terwards, the number of stains left on each t-shirt was recorded. Identify
the variables in the data that you are to collect. What will be their level
of measurement?
(c) An entomologist wants to determine if temperature (50, 70, 90 degrees)

affects how many times a cricket chirps in a one hour period. Identify the
variables in the data that you are to collect. What will be their level of
measurement?
(d) Previous research has shown that playing music helps plants grow taller.
But, does the type of music matter? Does the volume of the music matter?
To test this, 1 seedlings were assigned to a specific music group (country,
rock, classical). Then within each of these groups, the music was played
at either a low, medium, or high volume. At the end of one month, each
plants height was recorded.
(e) Harvester ants often strip a bush of all of its leaves. Some people believe
this helps the plant grow thicker, healthier stems. In an experiment, a
student stripped off all the leaves from a set of plants. In a second set
of identical plants, the student allowed ants to strip off the plants leaves.
The student measures the plants stem thickness (in mm) 4 weeks later.
(f ) Does watching aggressive team sports on TV increase aggression in the

observer? To test this, half your participants watch a hockey game and
the other half watch a team relay race. Immediately after, the participants
are asked to sentence a hypothetical convicted felon to a jail sentence. You
infer that longer sentences equate with more aggression.
(g) You want to test a new drug that supposedly prevents sneezing in people
allergic to grass. You randomly assign 1/2 the participants to the drug
group and the rest to a placebo control. One half hour later, you have
them sit in a room filled with the grass they are allergic to. You record the
total number of sneezes over the next 30 minutes.

(h) Orchids were studied to determine if the amount of humidity (25 %, 55

%, 85 %) affects the flowering of these plants. The researcher placed 20
four month old plants with no blooms on them in each of these humidity
conditions. After 3 weeks, he recorded the total number of blooms on the
20 plants in each condition.
(i) Which method of wound closure produces the least noticeable scarring 12
weeks later: stitches, staples, or steri-strips? He randomly assigns 10
patients to each method. Degree of scaring is measured on a 6 point Likert
scale, where 1 = no visible scar and 6 = extremely visible scar.
(j) Which method of learning brain anatomy is more effective: using a color-
ing book for the brain or the rap song method. Students in the respective
groups are scored on an anatomy test after using their method for 2 weeks.
(k) A florist wants to see if Product X or Product Y will extend the life of
cut flowers so that they last longer. Longevity is measured by rating the
health of the flower from 1 (dead) to 10 (no visible deterioration).
(l) Within a classroom setting, subjects were asked to listen to a guest in-
structor. All subjects were given a description of the instructor before
class. Some subjects read a description containing the phrase People who
know him consider him to be a rather cold person..., while other people
read a description where the word warm was substituted for the word cold
(otherwise, the descriptions were identical). After the lecture, subjects
were asked to rate the instructor. Subjects who were told the instructor
was warm gave him more favorable ratings compared to subjects who were
told that the instructor was cold. Instructor rating was assessed using a 5
point Likert scale.
(m) Subjects watched a videotape of a woman taking an SAT-like test. In all

cases, she correctly answered 15 out of 30 questions. But subjects who ob-
served a pattern of initial success followed by failure perceived the woman
as more intelligent than did those who observed the opposite pattern of fail-
ure followed by success. Intelligence was measured by having participants
estimate her IQ
(n) Subjects read about a woman who used a particular title, and then rated
her on a number of traits. When the woman used the title Ms. rather

than Miss or Mrs., she was assumed to be more assertive, achievement

oriented, and dynamic, but also cold, unpopular, and unlikely to have a
happy marriage. Each of these traits was assessed using a Likert scale
1.1.2 Random Experiment or Statistical Experiment
Definition 1 A random experiment is an experiment in which the outcome cannot

be predetermined with certainty.
Example 2 1. Tossing of a coin or a die is a random experiment.
2. Giving birth to children is also a random experiment.
1.1.3 Sample Space
Definition 2 A sample space is the set that contains all the possible outcomes of a
random experiment.
Note: A sample space will be denoted by the symbol Ω.
Example 3 If a coin is tossed once the set of all possible outcomes is { Head, T ail } =
{ H, T } = Ω
1.1.4 Event
Definition 3 An event is a subset of the sample space.
Example 4 A coin is tossed once. Write down all the possible events.
{ Head, T ail }
{ Head }
{ T ail }

{}
Note: An event is usually denoted by the symbol E

The events E1 , E2 , ...Ei , ... are called mutually exclusive events if occurrence of one
event excludes the occurrence of all the others . For example if a coin is tossed once
and it turns head then it cannot turn tail at the same time, in other words the event
of turning tail is excluded.
1.1.5 Probability
Definition 4 Probability of an event, denoted P (E), is a number that satisfies the

following conditions or axioms
(a) 0 ≤ P (E) ≤ 1
(b) P (Ω) = 1
∞
X
(c) If E1 , E2 , ...Ei , ... are mutually exclusive events, then P (∪∞
i=1 Ei ) = P (Ei )
i=1
Note: An event is usually denoted by the symbol E and the set that contains all
possible events or the power set of the sample space is the largest sigma field.
The discussion on sigma fields is beyond the scope of this topic.
1.1.6 Alternative Definition of Probability
Probability is the limit of the ratio of the number of experimental outcomes of in-
terest to the total number of trials as the total number of trials tend towards infinity.
In other words, if |Ω| be the total number of trials of a random experiment and |E|
the total number of outcomes of an event E, then the probability of the event E is
given by
|E|
P (E) = lim
|Ω|→∞ |Ω|

Example 5 Simulating The Tossing of a coin

Using Microsoft Excel simulate the tossing of a coin. Check whether the theoretical
assumption that the probability that a coin turns head is 0.5 is reasonable.
Answer
Using excel you might have followed the following steps
1. In cell A1 type the command =rand(1,2)
2. Highlight the cell and move the cursor to the lower right corner of the cell
A1 until a cross appears and drag down the cross to cell A1000
3. In cell B1 type the command =IF(A1=2,1,0)
4. Repeat step 2 for the cell B1.
5. Calculate the average of the data set in column B1
Exercise 3 Use the Microsoft Excel to simulate the rolling of a die. Check the
1
validity of the assumption that the probability that a die turns a 3 is 6
= 0.16666667.
Hint: Modify the steps in the example 5. This exercise must help you understand the
definition of probability.
1.1.7 Random Variable
Definition 5 A random variable is a function that maps a sample space into a set
of real numbers.
Example 6 A coin is tossed once. Construct a random variable of your choice.
Answer
1 , ω=H
X(ω) =
0 , ω=T
Example 7 A coin is tossed Twice. Construct a random variable of your choice.

Answer
Ω = { HH, HT, T H, T T }
Let the random variable X(ω) represent the number of heads, then we must have
X(HH) = 2, X(HT ) = 1, X(T T ) = 0, X(T H) = 1.
You can also write your answer as a table below
ω HH HT TH TT
X(ω) = x 2 1 1 0
Definition 6 A random sample is a collection of random variables X1 , X2 , ..., Xn ,

that have the same probability distribution and are mutually independent.
Exercise 4 A die is tossed once. Construct a random variable of your choice
1.1.8 Probability Distribution
Definition 7 A probability distribution is a function that assigns an event or an

outcome of a random experiment a probability.
Definition 8 A probability distribution is a function that maps a random variable

to a probability.
Example 8 The table below show a probability distribution
Number of Heads 0 1 2
Probability 0.25 0.5 0.25
1.1.9 Parameter Versus Statistic
Definition 9 A parameter is a summary measure of a population while a statistic

is a summary measure of sample which is used to estimate the parameter.
Example 9 Sample variance, s2 , is a statistic and is used to estimate the population

variance, σ 2 , which is a parameter.

n
1X
Example 10 The sample mean, X = xi , is a statistic which is used to esti-
n i=1
mate the population mean, µ, which is a parameter.
A general parameter is usually denoted by the symbol θ. For example, we can say
θ = µ or θ = σ 2
1.2 Definition Of Statistical Model
Having discussed the basic terms in section 1.1 above we can now safely define the
term statistical model.
However before we define the term let us consider the following example.
Example 11 Suppose probability distribution given in this table is for the distribu-
tion of the ages of student in your class.
Age 30 41 52
Probability 0.25 0.5 0.25
Table 1.1: Probability Distribution of Ages in a Class
If your class has 1000 student how many students do you expect to be aged 30 , 41 ,
or 52 ?

Answer
Age 30 41 52
Number of Students 250 500 250
Table 1.2: Number of Students
If you went somewhere to collect data and found that the data you collected is like the
one in table 1.2, you would automatically conclude that its probability distribution
must be the one in table 1.1.
Similarly, consider another situation in which you are counting the number of girls
per family in a sample of 9 families in Zomba, and you have collected the following
data set.
Family Id 1 2 3 4 5 6 7 8 9
Mumber of Girls 4 4 7 8 7 8 10 12 5
Then there must be a function that is responsible for the generation of this data set.
We will consider such a function to be a probability distribution. There is one true
probability distribution that is responsible for the generation of this data. Since we
don’t know which probability distribution we can list infinitely many functions that
can possibly fit this data set.
This discussion should help you to understand the three possible definitions of sta-
tistical model below.
Definition 10 A statistical model is a probability distribution that is constructed on

a data set to enable inferences to be drawn about the population.
Alternative Definition: A statistical model is a set of probability distributions on

a sample space.
Definition 11 A statistical model is the pair (Ω, P ), where Ω is the the sample space
and P is the set of possible probability distributions on the sample space.

Exercise 5 (i) Describe the main principle behind statistical modeling with refer-
ence to the definitions given in this section.
(ii) In your own words explain how a model can be parametrized, non-parametrized
or semi-parametrized.
1.3 Types of Statistical Models
1.3.1 Common Distributions-Revision
(i) The random variable X is said to follow the normal distribution with mean µ
and variance σ 2 if and only if its probability density function (pdf) is
2 !
1 1 x−µ
f (x) = √ exp − x∈ R
σ 2π 2 σ
(ii) The random variable X is said to follow the Poisson distribution with mean λ
if and only if its probability mass function (pmf) is
λk
Pr(X = k) = e−λ k = 0, 1, 2, 3, · · ·
k!
(iii) The random variable X is said to follow the binomial distribution with param-
eters (n, p) if and only if its probability mass function (pmf) is

n x
P (X = x) = p (1 − p)n−x , x = 0, 1, 2, · · · , n
x
(iv) The random variable X is said to follow an exponential distribution with pa-
rameter λ > 0 if and only if its probability density function is
f (x) = λ exp(−λx), x>0
1.3.2 Parametric, Semiparametric, and Nonparametric Sta-

tistical Models
Definition 12 A statistical model is parametric if its distribution is completely spec-

ified by a finite set of parameters.

All the four models in the subsection 1.3.1 are parametric. Since for each of these
models there is a finite set Θ that contains all the parameters of the model, so that
specifying Θ is as good as specifying the probability distribution of the model.
We discuss the following examples before providing a more rigorous definition of

parametric model.
Example 12 Consider the normal distribution model in subsection 1.3.1, and spec-
ify the set Θ.
Answer The parameters of the model are µ and σ 2 , hence Θ = { µ, σ 2 }
Exercise 6 Repeat this example for the rest models in subsection 1.3.1.
Definition 13 A parametric model is one that can be parametrized by a finite num-

ber of parameters. We write the pdf f (x) = f (x; θ) to emphasize the parameter
θ∈R
A nonparametric model is one which cannot be parametrized by a fixed number of

parameters.
A statistical model is semiparametric if it has both finite-dimensional and infinite-

dimensional parameters. Formally, if d is the dimension of Θ and n is the number
of samples, both semiparametric and nonparemtric models have d → ∞ as n → ∞.
If d/n → 0 as n → ∞, then the model is semiparametric; otherwise, the model is
nonparametric.
A well-known example of a semiparametric model is the Cox proportional hazards

model. If we are interested in studying the time T to an event such as death due to
cancer or failure of a light bulb, the Cox model specifies the following distribution
function for T:
Z t
βx
F (t) = 1 − exp − λ0 (u)e du
0

, where x is the covariate vector, and β and λ0 (u) are unknown parameters. Θ =
(β, λ0 (u)). Here β is finite-dimensional and is of interest; λ0 (u) is an unknown
non-negative function of time (known as the baseline hazard function) and is often
a nuisance parameter. The collection of possible candidates for λ0 (u) is infinite-
dimensional.
Exercise 7 Types of Statistical Models
(a) State the main difference between
(i) a parametric and a semi-parametric model
(ii) a non-parametric and a semi-parametric model
(iii) a parametric and a non-parametric model
(b) Which type of models is the standard normal distribution? Justify your answer.
1.3.3 General Versus Generalized Linear Models
We would like to differentiate the term General and Generalized Linear Models. To
understand the difference let us consider the following examples and exercises. We
urge the reader to go through the examples and the exercises before attempting to
differentiate these terms.
Example 13 Suppose John measured the weights ( Yi ) of goats in kilograms at the

age ( Xi ) in weeks. He realized that the weight and age are related through the model
Yi = 9.5 + 2.5Xi + εi
. In this model John assumed that
(i) Yi is the value of the response variable in the ith trial, which is the mass of a
goat
(ii) Xi is a known constant, namely, the value of independent variable in ith trial
(iii) εi is a random variable with mean E(εi ) = 0 and variance σ 2 (εi ) = σ 2 ; εi s are
uncorrelated so that σ(εi , εj ) = 0 for all i, j; i 6= j.

(a) Explain the reason why the model so constructed is a statistical model
(b) Supposed a 45 weeks old goat has the mass 108. What is the expected mass of
the goat at this age? What is the value of the error term at this age?
Answer
(a) Since εi is random and 9.5 + 2.5Xi is a constant, Yi is a random variable and
hence the model is a statistic model.
NB The message in this that the function f (Xi ) = 9.5 + 2.5Xi is not a statistical
model, but the function g(Xi ) = 9.5 + 2.5Xi + εi is a statistical model based on
the assumptions given in the problem. This explains one role of random error is
statistical modeling.
(b) E(Yi ) = E(9.5+2.5Xi +εi ) = 9.5+2.5Xi +E(εi ) = 9.5+2.5Xi +0 = 9.5+2.5Xi =

9.5 + 2.5(45) = 104
Thus the expected mass of a goat at the age of 45 is 104.
From Yi = 9.5 + 2.5Xi + εi , we have 108 = 9.5 + 2.1(45) + εi , from which εi = 4
Thus the value of the error term is +4
Exercise 8 Using the same assumptions in the example, we can generalize Johns
model by
Y i = β0 + β1 X i + εi
(a) Using the same assumptions evaluate
(i) E(Yi )
(ii) σ(Yi , Yj )
(iii) σ 2 (Yi )

(b) Consider another model

Yi = 1924 + 2.1Xi + εi
where Yi is the current salary of an employee and Xi is the beginning salary of

the employee.
(i) Calculate the expected salary for an employee whose beginning salary is
1000. If the estimated random error for this employee is -920, calculate his
current salary.
(ii) John discovered that the current salary of the employee whose beginning
salary is 2000 is 5800. Calculate the expected current salary and the value
of the error term.
Remarks
Now the model which has been suggested by John can be considered to be of the
form
Y i = β0 + β1 X i + εi
. This model is a simple regression model. In fact, we need Xi to be a continuous

variable, not random, and we also need Yi to be a continuous random variable.
Suppose that there are many independent variables then we could write Yi = β0 +
β1 Xi1 + β2 Xi2 + · · · + βp Xip + εi . So each independent variable is continuous and
the dependent variable is also continuous. This is the general linear regression model.
That is Linear regression models have to follow two key assumptions: firstly er-
ror terms are independent and identically distributed (iid) and each follows normal
distribution with zero mean and variance σ 2 , secondly, the matrix X has to be non-
random and of full column rank. However, one confusion is that if we assume the
error terms are normally distributed then the second assumption implies that all the
explanatory variables are independent to others. To solve this puzzle we generalize
the assumptions that John makes as follows
1. Linearity. We assume that each independent variable Xi is linearly related

to the dependent variable Y . If this condition is not met, linear regression is
considered to be inappropriate. We may transform our data to ensure that Xi
and Y are approximately linearly related.

2. Independence. We assume that each of our observations are independent

of one another. (It is sometimes written that the Yi s are independent of one
another or that the εi are independent of one another.)
3. Normality. We assume that the error terms are Normally distributed. While
it is possible to fit a linear regression model to data where the errors are not
Normally distributed (much like it’s possible to fit a linear regression model to
data that clearly do not follow a linear trend), this is inadvisable and linear
regression is seen as inappropriate. There are other regression models gener-
alized linear models is the term that is typically used) that are appropriate in
cases where the error terms are not Normally distributed.
4. Equality of Variances. We assume that the data points are homoscedastic -

that is, equally scattered about the linear regression line, regardless of the value
of the Xi variables. If the data are heteroscedastic - the variance is not equal for
all values of Xi , then our analyses that rely on Normality are inappropriate. It
is possible to use robust standard errors (also known as Huber-White standard
errors) to correct for this if this assumption doesn’t hold.
While it makes sense for X to be of full rank, this does not necessarily need to be the
case. There are numerous benefits to X being of full rank and it allows for maximum
interpretability. However, one can conduct inference on parameters without X being
of full rank. There are also methods that are designed to take non-independent IVs
and project them such that they will be independent in your analysis.
To our example above, we can assume that the error terms are iid N ormal(0, σ 2 ), but
only if this assumption makes sense. If you know from the subject material or from
your data that the assumptions of independence, Normality, or equality of variances
are violated, then perhaps a linear regression model is not appropriate. In this case,
we would suggest looking into ways to transform your data to ensure the conditions
are met or research different types of models (i.e. generalized linear models) that are
designed to account for data that do not follow the four line assumptions mentioned
above.

The generalized linear models are usually of the form
f (µ) = β0 + β1 Xi1 + β2 Xi2 + · · · + βp Xip + εi
Generalized linear models extend the last assumptions. They generalize the possi-
ble distributions of the residuals to a family of distributions called the exponential
family. This family includes the normal as well as the binomial, Poisson, negative
binomial, and gamma distributions, among others. Common examples are logistic,
Poisson, and probit models.
When you change the distribution of the residuals, it turns out that the relationship
between Y and the model parameters is no longer linear. However, for each distri-
bution in the exponential family, there exists at least one function f (µ) of the mean
of Y whose relationship with the model parameters is linear. This function is called
the link function.
The link function you choose will depend on which distribution you are choosing for
the outcome variable. For example, a binomial residual can use a probit or a logit
link function. A Poisson residual uses a log link function.
Exercise 9 State 3 differences between the general linear model and the generalized
linear model.
1.3.4 Structural Equation Models
Introduction
To understand the definition of Structure Equation Modeling we are going to dis-

cuss the meaning of important components of the modeling which are factor, path
analysis and regression.
Factor analysis is a statistical method used to describe variability among observed,

correlated variables in terms of a potentially lower number of unobserved variables
called factors. For example, it is possible that variations in four observed variables

mainly reflect the variations in two unobserved variables.
As for principal components analysis, factor analysis is a multivariate method used

for data reduction purposes. Again, the basic idea is to represent a set of variables
by a smaller number of variables. In this case they are called factors. These factors
can be thought of as underlying constructs that cannot be measured by a single
variable (e.g. happiness).
Factor analysis is designed for interval data, although it can also be used for ordinal
data (e.g. scores assigned to Likert scales). The variables used in factor analysis
should be linearly related to each other. This can be checked by looking at scatter-
plots of pairs of variables. Obviously the variables must also be at least moderately
correlated to each other, otherwise the number of factors will be almost the same
as the number of original variables, which means that carrying out a factor analysis
would be pointless.
Exercise 10 You must be able to
(a) Define factor analysis
(b) State two major assumptions of factor analysis. Read the last paragraph to un-
derstand this.
The factor analysis model can be written algebraically as follows. If you have p
variables X1 , X2 , · · · , Xp measured on a sample of n subjects, then variable i can be
written as a linear combination of m factors F1 , F2 , ..., Fm where, as explained above
m < p. Thus,
Xi = ai F1 + ai2 F2 + · · · + aim Fm + εi
where the ai s are the factor loadings (or scores) for variable i and εi is the part of
variable Xi that cannot be explained by the factors.
(a) Define factor analysis Model. Read and understand the last paragraph.

In statistical modeling, regression analysis is a statistical process for estimating the

relationships among variables. It includes many techniques for modeling and ana-
lyzing several variables, when the focus is on the relationship between a dependent
variable and one or more independent variables (or predictors). In the last sections
we have seen a generalized linear model f (µ) = β0 + β1 Xi1 + β2 Xi2 + · · · + βp Xip + εi .
We can estimate this model we can just estimate the values of the parameters which
are βi s in the model if we have sufficient data values. Finding such estimates and
putting them in the model is what is known as regression analysis.
(a) Define regression analysis. Read and understand the last paragraph.
Path analysis is an extension of the regression model. In a path analysis model from
the correlation matrix, two or more casual models are compared. The path of the
model is shown by a square and an arrow, which shows the causation. Regression
weight is predicated by the model. Then the goodness of fit statistic is calculated in
order to see the fitting of the model.
A Path model is a diagram which shows the independent, intermediate, and depen-
dent variables. A single-headed arrow shows the cause for the independent, interme-
diate and dependent variable. A double-headed arrow shows the covariance between
the two variables.
Exogenous variables causes lie outside the model while Endogenous variables are
determined by variables within the model.

The model above shows the factors that affect weight loss. Study the figure to
understand path modeling.
What is Structural Equation Model?
Structural equation modeling (SEM) is a methodology for representing, estimating,

and testing a network of relationships between variables (measured variables and la-
tent constructs). Structural equation modeling (SEM) uses various types of models
to depict relationships among observed variables, with the same basic goal of provid-
ing a quantitative test of a theoretical model hypothesized by the researcher. More
specifically, various theoretical models can be tested in SEM that hypothesize how
sets of variables define constructs and how these constructs are related to each other.
For example, an educational researcher might hypothesize that a students home en-
vironment influences her later achievement in school. A marketing researcher may
hypothesize that consumer trust in a corporation leads to increased product sales
for that corporation. A health care professional might believe that a good diet and
regular exercise reduce the risk of a heart attack.
In each example, the researcher believes, based on theory and empirical research, sets
of variables define the constructs that are hypothesized to be related in a certain way.
The goal of SEM analysis is to determine the extent to which the theoretical model
is supported by sample data. If the sample data support the theoretical model, then
more complex theoretical models can be hypothesized. If the sample data do not
support the theoretical model, then either the original model can be modified and

tested, or other theoretical models need to be developed and tested. Consequently,

SEM tests theoretical models using the scientific method of hypothesis testing to
advance our understanding of the complex relationships among constructs.
SEM includes factor, path analysis and regression.
(a) Define structural model.
1.3.5 Multilevel Model
Multilevel models (also known as hierarchical linear models, nested models,

mixed models, random coefficient, random-effects models, random pa-
rameter models, or split-plot designs) are statistical models of parameters that
vary at more than one level.
Many kinds of data, including observational data collected in the human and biolog-
ical sciences, have a hierarchical or clustered structure. For example, children with
the same parents tend to be more alike in their physical and mental characteristics
than individuals chosen at random from the population at large. Individuals may be
further nested within geographical areas or institutions such as schools or employ-
ers. Multilevel data structures also arise in longitudinal studies where an individuals
responses over time are correlated with each other.
Multilevel models recognise the existence of such data hierarchies by allowing for
residual components at each level in the hierarchy. For example, a two-level model
which allows for grouping of child outcomes within schools would include residu-
als at the child and school level. Thus the residual variance is partitioned into a
between-school component (the variance of the school-level residuals) and a within-
school component (the variance of the child-level residuals). The school residuals,
often called school effects, represent unobserved school characteristics that affect

child outcomes. It is these unobserved variables which lead to correlation between

outcomes for children from the same school.
Multilevel models can also be fitted to non-hierarchical structures. For instance,
children might be nested within a cross-classification of neighbourhoods of residence
and schools.
Example 14 1. Studies of health services: assessment of quality of care are often

obtained from patients that are clustered within hospitals.
Patients are level 1 data and hospitals are level 2 data.
2. In developmental toxicity studies: pregnant mice (dams) are assigned to in-

creased doses of a chemical and examined for evidence of malformations (a
binary response). Data collected in developmental toxicity studies are clus-
tered. Observations on the fetuses (level 1 units) nested within dams/litters
(level 2 data)
The level signifies the position of a unit of observation within the hierarchy
Figure 1.1: Current Versus Beginning Salary
Exercise 14 Follow the following steps
1. Open employee data embedded in SPSS 20 or 22

2. Click graph, then legacy dialogue, scatter/dot.
3. Click simple scatter
4. Make current salary your y-variable, beginning salary your x-variable, and em-
ployment category your column variable.
5. click OK. (Hint: You must obtain a graph like the one in the figure 1.1)
Further Discussion of the Exercise
1. For each employment category, the pattern appears to be roughly linear, so we

might try modeling it by a simple linear regression model of the form
(Current Salary)i = β0 + β1 (Beginning Salary )i + εi (1.1)
2. To extend our model beyond a single category, we need to allow for the variation
in patterns among different subjects. For example, a quick glance at Figure
1.1 shows that some of the subjects are consistently better than others.
3. To make our model more realistic, we allow the intercept in Model 1.1 to vary
from subject to subject. Writing (Current Salary)ij for the ith current salary
on the jth category, we have
(Current Salary)ij = β0j + β1 (Beginning Salary )ij + εij (1.2)
4. Notice that the intercept β0j now has a subscript j, indicating that it will
vary from subject to subject(=category). We now assume that the individual
intercepts follow a Normal distribution with variance τ0 . This gives the model.
β0j = β0 + u0j (1.3)
Thus u0j ∈ N (0, τ0 ).
5. Model 1.2 accounts for the variation in the individual measurements on a single
subject, while Model 1.3 accounts for the variation from one subject to another.
The combination of such as these two models gives what is known as a multi-
level model.

Exercise 15 Now using substitution combine models 1.3 and 1.2. What do you get?
The difference between Regression Model and Multilevel Model
Whatever your answer is to the exercise 15 you must have obtained
(Current Salary)ij = β0 + β1 (Beginning Salary )ij + u0j + εij (1.4)
The equation 1.4 is an example of a Multilevel model. The feature that distinguishes
this model from an ordinary regression model is the presence of two random variables
- the measurement level random variable εj and the subject level random variable
u0j . Because multilevel models contain a mix of mixed effects and random effects,
they are sometimes known as mixed-effects models.
(a) Define Multilevel Model.
(b) State the difference between the difference between Regression Model and Multi-
level Model.
(c) State four benefits of Multilevel model.

UNIT 2
BUILDING STATISTICAL MODELS
2.1 Stages in Statistical Model Building
2.1.1 Introduction
There are many types of statistical models as discussed above. Each type may re-
quire a different approach for it to be built. However, the following steps are needed
for any statistical model that can be built as far as modern statistical researches are
concerned.
The order in which the steps can be followed is immaterial, and in addition to that,
some steps can be skipped based on the type of data used in modeling or the pro-
posed model to be build or fitted. Hence, we make a list of all the steps in this
subsection and then explain the steps in the next subsections. Any statement that
is inside the brackets () can be ignored because it is simply a redundancy.
The steps are:
1. Specifying a research question in statistical terms, (that is both the-

oretical and operational): The question should be clear and specific, that
is the real problem being solved should be well defined.
2. Designing the study : This involves specifying the type of study design that
you will adopt, ( In statistical modeling study design mostly means the type of
randomization and sampling criteria that will be adopted, the most common
being
• Nested and Crossed Factors
• Potential confounders and control variables
28
• Longitudinal or repeated measurements on a study unit
• Sampling: simple random sample or stratification or clustering
This affects the validity of the assumptions that you can use in building a
model. For example, if randomization has been applied you can assume that
the sample consists of independent variables. Make sure that you understand
this discussion.
3. Identifying variables and determining their levels of measurement :

Most models are expressed using mathematical equations, hence the variables
which are independent and those which are dependent should be named or
labeled.
( Level of measurement, remember, is whether a variable is nominal, ordinal,

or interval. Within interval, you also need to know if variables are discrete
counts or continuous ).
Its absolutely vital that you know the level of measurement of each response
and predictor variable, because they determine both the type of information
you can get from your model and the family of models that is appropriate.
4. Specifying analysis design : Inferential statistic method for analyzing the

data to be collected should be written down in the plan. ( This should take
into account the type of data you are to collect as well as sampling design that
has been adopted. ) This helps you suggest the type of model that you intend
to fit your data.
5. Calculating sample size before collecting the data : ( Any formula

for sample size calculation should depend on the type of sampling design and
analysis design you have adopted ). This affects the accuracy of population
parameters that are usually incorporated in the model that is to be build.
6. Collecting, coding , entering and cleaning data
7. Creating new variables

8. Running uni-variate and bivariate statistics
9. Running initial model
Data Entry and Exploration in SPSS and STATA
Now the following practical exercises should help you understand the use of
STATA in exploring data, and the last 4 steps outlined above. We are using
stata 12, but similar steps can be used in higher versions of stata.
Example 15 Understanding Stata GUI

Open the stata 12 program ( make sure it is installed in your laptop).
(a) You must be able to identify five main windows of the stata
program: RESULTS, REVIEW, COMMANDS, PROPERTIES, and VARI-
ABLES. Discuss with your friend what you think are the uses of the win-
dows you can see.
i. Now close some of the windows. We believe you are able to close the
review, the variables and the properties windows
ii. Try to reopen the windows. There are many ways of doing this. One
way is this : Click Ctrl and hold it, while holding the control button
click 3, 4 and 5. Try other ways of reopening of the windows.

(b) Changing appearance of viewer window
i. Right click inside the results window

ii. Select preferences
iii. Select preferred color scheme
(c) Loading the preferred layout
i. Go to edit
ii. Select preferences
iii. Select load preference set
iv. Choose your preferred layout: For example choose combined layout.
(d) Saving and deleting layout
i. Try to change the size of the windows in the layout chosen above
ii. Follow the same steps in (c) select save preference set instead, name
it myview and click ok
iii. Try to load any other layout and then the layout created.
iv. You should be able to delete this layout.
(e) Changing font, similar to (a)
Example 16 Entering Data into STATA via SPSS
(a) Create a folder on desktop and call it stataintro
(b) Open the SPSS 20 program and use it to open employee data that is found
in programs file. (These are the possible steps you might have followed:
File, Open, Data, look in C, program files (x 86), IBM, SPSS, statistics,
20, samples, English, Employeedata.sav, open)
(c) Save this data as using the name Employee and the stata version 6 ex-
tension in the folder you have created. Make sure that you are able to do
this.
Exercise 17 Create an excel data file and then convert this data file into stata.
Example 17 Entering Data into STATA via SPSS

(a) Using do-file editor opening log book NB: Comments may be added
to programs in three ways: begin the line with *; begin the comment with
//; or place the comment between /* and */ delimiters. Follow the fol-
lowing steps:
i. Open the do file (by running the command doedit on the command
window ) and save it as basics in the same folder stataintro
ii. Change the directory, by typing and running the following command
in the do file
cd"C:\Users\Administrator\Desktop\stataintro"
iii. Open log file by typing and running the following command in the do
file
log using mylog.log // opening log book
iv. Type and run the following command in the do file
use employee, clear
v. Type and run the following command in the do file
table jobcat, contents(freq mean salbegin mean salary)
vi. Now close the log book by typing and running the following command
in the command window
log close
Exercise 18 Creating new variables from old ones

Type and run the following commands in the do file
NB: functions in stata( sqrt, /, +,*, exp, log, -, log10, etc )

1) Compute square roots, natural logarithms, exponentials of a number,
etc using the command : [display function( arguments)]
2) To generate a new variable use the command generate (gen for short),
type [generate newvar = expression], e.g. [newsalary = salary/100]
3) [/* we want to generate new variables*/]
4) [generate newsalary = salary/100]
5) [tabulate newsalary]
6) [codebook newsalary]
7) [hist newsalary]

8) [tabulate gender jobcat, col row]
Make sure that you understand what is happening in this exercise.
Exercise 19 gen=g , replace

NB: logic operators in stata( != (or ~ =), >, <=, >=, ==, &, and | )
1) [sum newsalary]
2) [gen salcat=.]
3) [replace salcat=1 if newsalary < 300]
4) [replace salcat=2 if newsalary >=305 & newsalary <1000]
5) [replace salcat=3 if newsalary >=1005]
6) [tab salcat]
7) [codebook salcat]
8) Drop the missing value [drop if salcat==.]
Exercise 20 label variable and label define, label value

1) [label variable salcat "categorized newsalary"]

2) [label define salcat 1 "low salary" 2 "second lowest" 3 "high salary"]
3) [label value salcat salcat]
4) [tab salcat]
Example 18 Run the following commands in the same do file you have been
using in the above exercises
graph twoway (scatter salary salbegin)

regress salary salbegin

You must be getting the following results
-----------------------------------------------------
salary | Coef. Std. Err. t P>|t|
-+----------------------------------------------------
salbegin | 1.90945 .0474097 40.28 0.000
_cons | 1928.206 888.6799 2.17 0.031
-----------------------------------------------------
In this case you have fitted a straight line model, ( simple linear regression) and
the scatter plot below suggests that we can assume a straight line relationship
between the variables. Now let y= the current salary of the employee, and x =
the beginning salary of the employee, then the table is telling us that model is
y = 1.90945x + 1928.206
Exercise 21 From the discussion in the examples and exercises above you
must be able to explain why each of the following steps is important in statistical
modeling
(a) Collecting, coding , entering and cleaning data
(b) Creating new variables
(c) Running uni-variate and bivariate statistics
(d) Running initial model
(e) These four steps constitute a stage known as initial modeling stage. Which
steps of these do you think are just preparatory and which are just ex-
ploratory? Explain your answers.
(f ) Repeat example 18 for the variable you created in the exercise 18 and the
beginning salary.
10. Refine the model and check the model fit: If you are doing a truly
exploratory analysis, or if the point of the model is pure prediction, you can
use some sort of stepwise approach to determine the best predictors.

If the analysis is to test hypotheses or answer theoretical research questions,

this part will be more about refinement. You can
(a) Test, and possibly drop, interactions and quadratic or explore other types
of non-linearity
(b) Drop nonsignificant control variables
(c) Do hierarchical modeling to see the effects of predictors added alone or in

blocks.
(d) Check for overdispersion
(e) Test the best specification of random effects
11. Checking validity of assumptions (Test assumptions)
Because you already investigated the right family of models in stage one ,
thoroughly investigated your variables in Step 8, and correctly specified your
model in Step 10, you should not have big surprises here. Rather, this step
will be about confirming, checking, and refining. But what you learn here can
send you back to any of those steps for further refinement.
12. Check for and resolve data issues Steps 11 and 12 are often done together,
or perhaps back and forth. This is where you check for data issues that can
affect the model, but are not exactly assumptions.
Data issues are about the data, not the model, but occur within the context
of the model
(a) Multicollinearity
(b) Outliers and influential points
(c) Missing data
(d) Truncation and censoring
Once again, data issues dont appear until you have chosen variables and put
them in the model.
13. Interpret Results Now, finally, interpret the results.

You may not notice data issues or misspecified predictors until you interpret
the coefficients. Then you find something like a super high standard error
or a coefficient with a sign opposite what you expected, sending you back to
previous steps.
2.2 Main Stages of Model Estimation
2.2.1 MODEL SPECIFICATION, ESTIMATION AND VAL-

IDATION
Model specification refers to the determination of which independent variables should

be included in or excluded from a model equation. Our estimates of the parameters
of a model and our interpretation of them depend on the correct specification of the
model. Consequently, problems can arise whenever we misspecify a model.
In general, the specification of a model should be based primarily on theoretical

considerations rather than empirical or methodological ones.
Model specification involves three distinct stages: The model specification is the
first and most critical of all stages in modeling. Our estimates of the parameters of
a model and our interpretation of them depend on the correct specification of the
model.
Consequently, problems can arise whenever we misspecify a model. There are two
basic types of specification errors. In the first, we misspecify a model by including
in the regression equation an independent variable that is theoretically irrelevant.
In the second, we misspecify the model by excluding from the model equation an
independent variable that is theoretically relevant.
Specification errors
There are basically two types of specification errors.

(a) Misspecification of model by including in model equation an independent variable

that is theoretically irrelevant.
(b) Misspecification of the model by excluding from the model equation an indepen-
dent variable that is theoretically relevant.
Both types of specification can lead to problems of estimation and interpretation.
Model Validation Validation is the process of deciding whether the numerical/qualitative

results quantifying of qualifying hypothesized relationships between variables, ob-
tained from analysis, are acceptable as descriptions of the data. The validation
process can involve analyzing the goodness of fit as in case with regression models,
analyzing whether the regression residuals are random, and checking whether the
model’s predictive performance deteriorates substantially when applied to data that
were not used in model estimation.
Model Estimation Deals with estimating the values of parameters based on mea-
sured/empirical data that has a random component in a model. The model param-
eters describe an underlying physical setting in such a way that their value affects
the distribution of the measured data. An estimator attempts to approximate the
unknown parameters using the measurements.
For example, it is desired to estimate the proportion of a population of voters who

will vote for a particular candidate. That proportion is the parameter sought; the
sample proportion is an estimator based on a small random sample of voters.
In model estimation, two approaches are generally considered.
(a) The probabilistic approach which assumes that the measured data is random
with probability distribution dependent on the parameters of interest.
(b) The set-membership approach assumes that the measured data vector belongs
to a set which depends on the parameter vector.

2.3 Methods of Estimating Parameters in a Model
As you have noted, estimating a model is simply estimating parameters in the model.
The are two main ways of estimating parameters which are Maximum likelihood
estimation and Least Squares Estimation.
2.3.1 Maximum Likelihood Estimator
Definition 14 Let X = (X1 , X2 , · · · , Xn ) have joint density p(X; θ) = p(X1 , X2 , · · · , Xn ; θ)

Q
where θ ∈ Θ. The likelihood function L : Θ → [0, ∞) is defined by L = p(xi , θ).
Example 19 Let X1 , X2 , . . . , Xn be a random sample from a distribution having

probability density function (pdf )
f (x|θ) = θeθx , θ > 0, x > 0.
Derive the likelihood function for θ.
Solution
n
Y P
L= θeθxi = θn eθ xi
i=1
Example 20 Find the maximum likelihood estimator of exponential random vari-

able. Given the data values 4, 5, 6, 7, 8, 8, 7, 4, 6, 7 are from exponential distribution
with parameter, θ, estimate the maximum likelihood estimator for θ .
Solution
The probability density function of the exponential distribution is defined as
(
λe−λx if x ≥ 0
f (x; λ) =
0 if x < 0
Its likelihood function is
n n
Y Y Pn
L(λ, x1 , . . . , xn ) = f (xi , λ) = λe−λx = λn e−λ i=1 xi
i=1 i=1

To calculate the maximum likelihood estimator solved the equation
d ln (L(λ, x1 , . . . , xn )) !
=0
dλ
for λ.
Since
Pn
d ln λn e−λ i=1 xi

d ln (L(λ, x1 , . . . , xn ))
= (2.1)
dλ dλ
d ln (n ln(λ) − λ ni=1 xi )
P
= (2.2)
dλ
n
n X
= − xi (2.3)
λ i=1
Finally we get
n
λ= P
n
xi
i=1
Now substitute in this formula to get the value of the estimate.
Exercise 22 1. Suppose that Y1 , . . . , Yn are independent and identically distributed

random variables with density function
θ2 −θ/y
f (y | θ) = 2 e , where y, θ > 0.
y
Compute the likelihood function L(θ) for this random sample.
2. Let X1 , X2 , · · · , Xn be a random sample from a distribution with the following

pdf
(
1/(θ2 − θ1 ), for θ1 ≤ x ≤ θ2
f (x|θ) =
0 otherwise
Suppose that θ1 and θ2 are unknown. Find L(θ1 , θ2 )
3. find the likelihood function for the function
f (x; θ) = (θ + 1)(xθ )
4. Find likelihood functions and MLE of

(a) Poisson random variable
(b) Binomial random variable
(c) Standard normal random variable
(d) Chi-square random variable

UNIT 3
SIMPLE LINEAR REGRESSION
3.1 Introduction
It is mainly concerned with the relationship between an explanatory variable (inde-

pendent) and response variable (dependent variable).
Example 21 Suppose we want to model the relationship between height and speed of
Athletes. In this case height is an explanatory variable and speed is response variable.
Example 22 Suppose we want to model the statistical relationship between number

of peoples that patronize a public swimming pool and days temperature. Temperature
is an explanatory variable, while number of people is response variable.
A basic simple linear regression model has only one independent variable. It is also
called the Prediction Equation. The model can be stated as follows:
Y i = β0 + β1 X i + εi
where Yi is the value of the response variable in the ith trial, β0 is an intercept of
the of the line, β1 is the slope (gradient of the line) β0 and β1 are Parameters Xi is
the value of the independent variacble in the ith trial εi is a random error term with
mean E{εi } = 0 and variance var(εi ) = σ 2
3.2 Model Assumptions
(a) εi are independent (uncorrelated) i.e covariance(εi , εj )=0 for all i and j;i 6=j,
(b) εi s are normally distributed as N(0, σ 2 )
(c) Xi are precisely measured (are not random variable) and continuous.
(d) Yi is a continuous random variable. Thus you can conclude that E(Yi ) = β0 +
β1 Xi , because E{εi } = 0 and var(Yi ) = σ 2 . Hence Yi ∈ N (β0 + β1 Xi , σ 2 ).
41
3.3 FITTING A STRAIGHT LINE TO THE DATA
Line of best fit: Is the line that minimizes sum of squares of vertical deviations in
the Yi s. We use Least square estimation (LSE) method to estimate model pa-
rameters β0 and β1 .The method involves finding good estimators of β0 and β1 that
minimize the sum of squares of vertical deviations. That is to say, for each sample
observation (xi , yi ), the method of LSE considers sum of deviations of yi from its
expected value.
n
X
Let S = (yi − (β0 + β1 xi ))2 To estimate β0 and β1 , we use the following steps;
i=1
(a) Differentiate S with respect to β0 and β1 partially.
(b) Set ∂S
∂β0
∂S
= 0 and ∂β 1
= 0 equal to zero and solve for βˆ1 and βˆ0 respectively.
n
X
(yi − y)(xi − x)
After solving forβˆ1 and βˆ0 we get βˆ0 = y + i=1
n x = y + βˆ1 x. What is
X
(xi − x)2
i=1
the value of βˆ1 ?
We use sample data to estimate the parameters β0 and β1 . The model becomes
ŷi = βˆ0 + βˆ1 xi as regression estimator for sample data.
3.4 Interpretation
The fitted model ŷi = βˆ0 + βˆ1 xi means that for any unit change in xi , ŷi changes by
β1 units.
Example 23 Table 3.1 shows speed data for 12 kindergarten children and their body
weights;
(a) Find β0 and β1 and fit a regression line that best fits the data

speed (m/min) 5.4 3.4 6.3 3.2 7.5 8.1 9.1 11.5 12.1 14.7 18.5 8.1
weight (kg) 8.7 9.2 11.2 11.5 11.6 11.6 12.3 13.7 4.7 15.2 17.5 18.1
Table 3.1: Relationship between weight and speed
(b) Predict the speed of a child with body weight 10kgs
solution
(a) βˆ0 = 2.673, βˆ1 = 0.522. Therefore, the fitted regression model is:
ŷi = 2.673 + 0.522xi
(b) Thus the speed of a child weighing 10kgs is ŷi = 2.673 + 0.522(10) = 7.89m/min
Interptetation: For any unit change in the body weight of the child, the speed
increases by 1.50m/min.
3.5 Practicals in SPSS
Example 24 Running a regression
Figure 3.1: Linear Regression Window for Speed data Analysis
1. Enter the data above in SPSS
2. Name your values Speed and Weight
3. Go to Analyze, select Regression and then click Linear

4. Click Speed and then click Arrow next to the Dependent: box.
5. Click Weight, and then click the arrow next to the Independent(s):box (At this
point your screen should look like the one in figure 3.1).
6. Click Statistics and make sure that Estimates and Model fit are checked
7. Click Continue
8. Click Ok
Example 25 Transforming Random variables We can predict the speed of the Child
with given weight. To do this we are going to transform out data and generate a new
variable called Predicted-speed. Procedure:
1. Click Transform and then click Compute variable
2. Click Target variable box and then type pred− speed.
3. Click Type and Label box, and then type predicted speed of the child in the box
titled Label
4. Click Continue
5. Click the Numeric Expression box and then type 2.673+0.522*Weight
6. Click OK at the bottom and go back to our data. SPSS has now computed a
new Variable with predicted speed at given body weight of the child.
Your data will now look like this the one in the figure below You can also work out
the same problem in STATA

1. Save the SPSS file you above as a STATA file. Follow the following procedures
to do that.
2. Open the SPSS file
3. Click on File and click on save as Option.
4. Choose the location where your stata file is to be saved e.g Desktop
5. On Save as type, click on the arrow down far right and choose Stata Version
6(*.dta). By now you should have a stata file.
6. Click on the stata file in the saved location.
7. On the command window, type ” regress speed weight”.

Temperature (0 C) Number of Patrons

90 245
85 200
95 250
82 199
98 260
105 295
78 183
86 205
89 199
87 205
94 235
94 225
97 255
87 209
79 209
80 189
84 210
82 198
81 212
86 200
86 208
Table 3.2: daily Temperature and Patron data that visit a public swimming pool for
20 days
Exercise 23 The table 3.2 shows daily Temperature and Patron data that visit a
public swimming pool for 20 days.
(a) Compute β0 and β1 and write down the Prediction Equation manually.
(b) In SPSS enter the data, and follow the above procedures to compute
(i) β0 and β1 .
(ii) Write down the Prediction Equation.
(iii) Predict the Number of Patrons when the temperature is 1000 C.
(c) Do the same problem in STATA and follow the above procedures

3.6 CORRELATION
This is the measure of degree linear relationship between two (quantitative) random
variables. Linear relationship can also be shown using Scatter plots.
Example 26 Evaluating Correlation
1. In Stata data file you saved, type the command scatter Speed Weight. This will
give you the Scatter plot below.
2. If you are using SPSS, follow the following procedures.
3. Click on Graphs, then Legacy Dialogs, Scatter/Dot, Simple Scatter, Define.
4. Drag Speed to Y-axis box and Weight to X-axis.
5. Click OK.
Figure 3.2: A scatter plot of Child speed band body weight in STATA

Figure 3.3: A scatter plot of Child speed band body weight in SPSS
The above pattern shows linear relationship between Speed and weight of the child
Suppose we have simple random sample of paired observations, (x1 , y1 ), (x2 , y2 ) · · · (xn , yn )
with xi s coming from population characteristic X and yi s coming from population
characteristic Y. Then Population correlation is:
N
X (Xr − X)(Yr − Y )
Cov(x, y)
P
(X − X̄)(Y − Ȳ ) r=1
N
ρxy =p = pP =
SX SY
P
V ar(x)V ar(y) (X − X̄)2 (Y − Ȳ )2
The sample correlation is

X X X
xi y i −
n xi yi
r=q X X q X X
n x2i − ( x1 )2 n yi2 − ( y1 )2
This is called Pearson Product moment Correlation Coefficient
Propositions −1 ≤ Corr(X, Y ) ≤ 1
− 1 ≤ rxy ≤ 1

Interpretation:
1. If r = 1 means there is strong positive linear relationship
2. If r = -1 it means there is strong negative linear relationship
Example 27 Using the data In Previous example on Child speed and weight, show
whether there is linear relationship between Speed of the child and its body weight.
(a) Use the formula given above to find rxy and make a conclusion on whether there
is linear relationship between Speed and weight of the child
(b) Using SPSS: Click Analyze, Correlate, Bivariate
(c) Drag Speed and Weight to Variables box. Make sure Pearson is marked ,
(d) Then click OK
(e) In STATA type the following command corr Speed Weight You will have the
following outputs.
Figure 3.4: Stata Ouput

Thus r = 0.429, implies that there is weak linear positive relationship between Speed
and body weight of the child. HYPOTHESIS TESTING ABOUT POPULATION
CORRELATION (ρxy ) This involves testing about the population correlation co-
efficient ρxy using sample correlation r We define H0 and H1 as H0 : ρxy = 0(No
linear relationship between X and Y),andH1 : ρxy = 0 (There is atleast significant
linear relationship between X and Y) , H1 : ρxy > 0(for strong positive relationship),
H1 : ρxy < 0,for strong negative relationship
s
n−2
The test Statistic is: t = ∼ Tn−2 with some degree of certainty, α.
1 − rp2
3.7 DECISION RULE
Reject H0 if |t| > t(n−2),α/2 for a 2-sided test.
Example 28 Using above exercise on correlation for SPSS output we find that t =
0.164. Since t = 0.164 > 2.201, at α=0.05. Thus we have no evidence to reject H0 ,
hence no significant linear relationship between Speed and Weight of the child.
Figure 3.5: ANOVA TABLE IN SPSS
ANOVA TABLE IN STATA

Source | SS df MS Number of obs = 12
-------------+------------------------------ F( 1, 10) = 2.25
Model | 41.871797 1 41.871797 Prob > F = 0.1643
Residual | 185.85737 10 18.585737 R-squared = 0.1839

-------------+------------------------------ Adj R-squared = 0.1023

Total | 227.729167 11 20.7026515 Root MSE = 4.3111
------------------------------------------------------------------------------
Speed | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Weight | .5218265 .3476604 1.50 0.164 -.2528092 1.296462
_cons | 2.673217 4.389698 0.61 0.556 -7.107639 12.45407
------------------------------------------------------------------------------
3.8 COEFFICIENT OF DETERMINATION
This is the measure of total variation accounted for the regression of dependent
variable on independent variable. It is given as r2 = (Sxy
2 2
)/(Sxx 2
Syy ) It is interpreted
as large values of r2 implies high variation in y due to changes in X and vice vesa,
(The bigger the value of R-squared, the bigger the variation in Y that has been
explained by X).
3.9 ASSESING APPROPRIATENESS OF A STRAIGHT

LINE MODEL
This deals with measuring the lack of fit to check if the straight line best fits the
model.
Lack of fit
X
The key statistics used is the residual sum of squares, (yi − ŷ)2 . If residual sum
of squares is large, then there is lack of fit. The residual sum of squares consists of
two components.
(i) Pure error: A measure of variance σ 2 unrelated to regression of X and Y.
(ii) Lack of fit: Describes the extent to which the assumed

3.10 PURE ERROR ESTIMATE OF σ 2
Pure error sum of squares is estimated using replicate observations (several different
values of y on x). Y,s are assumed independent for a fixed x. Pure error sum of
n X
X n
square is given by (yij − y.. )2 .
i=1 j=1
Residual sum of squares = Pure error + Lack of fit
. Hence lack of fit = Residual sum of squares - Pure error.
Xn
2
The estimate of σ (pure error) = (pure error sum of squares)/( ni − 2),
i=1
Degree of freedom for lack of fit sum of squares is k-2, where k is number of sets of
replicates. Hypothesis Testing about model appropriateness
. H0 : Straight line model not is appropriate
H1 :Straight line model is appropriate.
Rejecting H0 implies that the lack of fit contributes a large value to residual sum of
squares compared to pure error sum of squares.
M S(lack of fit)
The test statistic is F = M S(Pure error)
v Fk−2,Pni=1 ni , α2
Decision Rule: Reject H0 when Fk−2,Pni=1 ni , α2 .
Example 29 Using the data on child speed and body weight done previously, comput-
ing residual sum of squares, pure error sum of squares and lack of fit and F-statistic,
we have the following results.
Residual ssq = 185.857, and pure error ssq = 41.871, and lack of fit ssq = 185.857 −
41.871 = 143.986.F − statistic = 18.544, F (1, 10, 0.025) = 2.25. Thus F − statistic >
F (1, 10, 0.025). Reject H0 and conclude that the straight linear model above does not
appropriately fit the data values.

Exercise 24 Using the data on number of patrons visiting the public swimming pool
and temperature, test whether the straight line model is appropriate or not.
PRACTICAL: Using the same data, do the following in SPSS.
• Open the SPSS file you saved with data on public swimming pool provided in
this book.
• Click on Analyze, Regression, Linear. Drag No− patrons and Temperature to

dependent and independent(s) boxes respectively.
• Click on Statistics and make sure that the following are marked, Estimates,
model fit and R-squared change.
• Click Continue and then OK
NB: In STATA use the following Command regress No− patrons Temperature
This result will give you an ANOVA table with values of Residual ssq, pure error
ssq,and be able to compute lack of fit sum of squares, Fk−2,Pni=1 ni , α2 . Then be able
to make decision about the model fit.

CHAPTER FOUR
Multiple Regression
4.1 Introduction
In chapter 3, we used simple regression to explain a dependent varible, y, as a func-

tion of a single independent variable, x. However, in most scenarios, more than one
independent variables (predictor variables) are found to explain the dependent vari-
able. For example, let us see factors that affect an individuals income. Most of the
times, the individual wage rate is known to depend on his or her educational level.
But to think that it is purely dependent on education level is nave since his or her
work experience, which comes with age, is ignored. Now, if we are to consider educa-
tion and age to determine how they affect ones wage rate, simple regression analysis
cannot be a tool. It will be very difficult to draw ceteris paribus conclusion about
how two independent variables affect one dependent variable. This is a drawback in
using simple regression analysis. This means we need to find another method other
than simple regression to explain the relationship between the dependent variable
and the independent variables.
Multiple regression extends the simple regression (two-variable) model to consider
the possibility that there are additional independent variables that have systematic
effect on the dependent variable. It allows us to explicitly control for many other
factors that simultaneously affect the dependent variable. This is important because
we can evaluate policy effects and test economic theories when we rely on non-
experimental data. The objective of multiple regression is the same as that of simple
regression; that is, we want to use the relationship between a response (dependent)
variable and factor (independent) variables to predict or explain the behavior of the
response variable.
Because of the nature of accommodating many explanatory variables, we can hope to
infer causality in cases where simple regression analysis would be misleading. Addi-
tionally, multiple regression is capable of explaining variation in dependent variable
if we add more variables that are useful for explaining the dependent variable.
We can define Multiple Regression as a technique that allows additional factors to
enter the analysis separately so that the effect of each can be estimated
54
4.2 The Model with Multiple Regression Model
4.2.1 The Model with Two Independent Variables
We discuss some example to show how multiple regression analysis can be used to
solve scenarios that cannot be solved by simple regression. We begin with an example
on simple variation of wage equation for obtaining effect of education on hourly wage,
y,
y = β0 + β1 x1 + β2 x2 + µ ...............(1)
where x1 is education and x2 is years of labour market experience. Thus wage is de-
termined by two explanatory or independent variables, namely, education and labour
experience and by other unobserved variables, which are contained in µ. We are still
interested in the effect of education level on wage, holding fixed all other factors af-
fecting wage. This means we are interested in the parameter β1 . When we compare
equation (1) with a simple regression relating wage to education level, the equation
(1) effectively takes experience out of the error term (noise term) and puts it explic-
itly in the equation. Because experience appears in the equation, its coefficient, β2 ,
measures the effect of experience on wage, which is also of some interest.
Another example, we consider the problem of explaining the effect of per student
expenditure on average standardized test score at high school level. Suppose that
the average test score depends on funding, average family income and other unob-
servable parameters (noise parameters).
Generally, we can write a model with two independent variables, illustrating the two
previous examples, as
y = β0 + β1 x1 + β2 x2 + µ
Where β0 is the intercept, β1 measures change in y with respect to x1 , holding other

factors fixed, and β2 measures the change in y with respect to x2 , holding other

factors fixed. β0 is the value of the outcome when both explanatory variables have
values of zero. β1 and β2 are also called gradients for each explanatory variables.
These gradients are the regression coefficients which tell you how much change in
the outcome, y, is predicted by a unit change in that explanatory variable.
Though it can be difficult to visualize a linear model with two explanatory variables,
we may add a plane to two-way Cartesian plane and have a scatterplot as shown
below.
Figure 4.1: Multiple Regression Plane
4.2.2 The General Multiple Regression Model: The Model

with k Independent Variables
Once we are in the context of multiple regression, there is no need to stop with two
independent variables when there are more than two explanatory variables explaining
a particular variable. If we happen to have other explanatory variables that are
affecting the response, all we have to do is to add these explanatory variables into
the linear equation. Multiple regression analysis allows many observed factors to
affect y. In the wage example, we may add amount of job training, years of tenure
with the current employer, measures of ability, and even demographic variables like
number of siblings or mothers education. The general multiple linear regression
model can be written in the population as

y = β0 + β1 x1 + β2 x2 + β3 x3 + ... + βk xk + µ
Where β0 is the intercept (as pointed out above), β1 is the parameter associated with
independent variable x1 , β0 is the parameter associated with independent variable
x2 , and so on. Since there are k independent variables and an intercept, equation
above contains k + 1 unknown population parameters. The variable µ is the error
term or disturbance or noise term. It contains variables or factors other than
x1 , x2 , x3 , , xk that affect y. No matter how many factors we incorporate in our model,
there will always be factors we cannot include, and these are collectively contained
in µ. This random term is assumed to be normally distributed with mean zero and
variance σ 2 . Terminology for multiple regression is similar to that for simple regres-
sion and is given in the table below
Although the model formulation appears to be a simple generalization of the model
with one independent variable, the inclusion of several independent variables creates
a new concept in the interpretation of the regression coefficients. Thus, when apply-
ing multiple regression model, we must know how to interpret the parameters.
For example, if multiple regression is to be used in estimating weight gain of children,

the effect of each independent variable–dietary supplement, exercise, and behavior
modification–depends on what is occurring with the other independent variables.
In multiple regression we are interested in what happens when each variable is varied
one at a time, while not changing the values of any others. This is in contrast to
performing several simple linear regressions, using each of these variables in turn,
but where each regression ignores what may be occurring with the other variables.
Therefore, in multiple regression, the coefficient attached to each independent vari-
able should measure the average change in the response variable associated with
changes in that independent variable, while all other independent variables remain
fixed. This is the standard interpretation for a regression coefficient in a multiple
regression model.

Formally, a coefficient, βj , j = 1, 2, 3, , k, in a multiple regression model is defined as

a partial regression coefficient, whose interpretation is the change in the mean
response, associated with a unit change in xj , j = 1, 2, 3, , k, holding constant all other
variables. For example, if we use a multiple regression model to relate weight gain
of children to amounts of a dietary supplement, exercise, and behavior modification,
the coefficients will be partial regression coefficients. This means that the partial
coefficient for the dietary supplement estimates the weight gain due to a one-unit
increase in the supplement, holding constant the other two variables. This would be
equivalent to performing regression for a population for which the other two variables
have the same values.
Suppose, manager salary (salary) is related to company sales (sales) and manager
tenure (manten) with the firm by
salary = β0 + β1 (sales) + β2 (manten) + β3 (manten)2 + µ
This fits into multiple regression model with three variables (k = 3) by defining
y = log(salary), x1 = log(sales) and x2 = (manten) and x3 = (manten)2 . The
parameter β1 is elasticity of salary with respect to sales when all other variables
are kept constant. If β3 = 0, then 100β2 is approximately percentage increase on
salary when manten increases with one year, with all factors constant. However,
when β3 6= 0, the effect of manten on salary is more complicated to describe.
4.2.3 Assumptions of Multiple Regression Models
The following are the assumptions of Multiple Regression with k variables written
above
• Normalityµ ∼ N ID(0, σ 2 ) for all observations j = 1, 2, 3, ..., k
• Linearity Multiple regression models a linear relationships between dependent

variable Y and independent variables Xi s. Any nonlinear/curvilinear relation-
ship is be ignored. This is mostly evaluated by scatter plots early on in your
analysis. Nonlinear patterns can be showed up in residual plots.

• Constant Variance The variance of µ0i s is constant for all values of xi ’s. This
is detected by residual plots of ej = yj − yˆj and yˆj or the xi ’s. If these residual
plots show a rectangular shape, we can assume constant variance. Otherwise,
non-constant variance exists and must be corrected.
• Variables xj are considered fixed quantities (not random variables), that is,
only randomness in y comes the error term µ.
• For each specific combination of values of the basic independent variables

x1 , x2 , x3 , ...xk , y is a random variable with certain probability distribution
4.3 F-Tests of a Multiple Regression Models
In simple linear regression, the F test from the ANOVA table is equivalent to the two-
sided test of the hypothesis that the slope of the regression line is zero. For multiple
regression, there is a corresponding ANOVA F test, but it tests the hypothesis that
all regression coefficients (except the intercept 0) are zero. The ANOVA Table for
Multiple Regression is given below
Source of Sum of Degrees of Mean Square F

Variation Squares Freedom
Regression SSR k M SR = SSkR F = M SR
M SE
SSE
Error or Residual SSE n-(k+1) M SE = n−(k+1)
SST
Total SST (n-1) SSR = (n−1)
Table 4.1: ANOVA Table
where SSR = Σ(ŷi − ȳ)2 , SSE = Σ(yi − ŷi )2 and Σ(yi − ȳ)2 . MSR is the variance
attributed is to the model and MSE is variance that is unaccounted for (due to
error).
4.3.1 Testing For Significant Overall Regression
To test for significance of regression is a to determine if there is a linear relationship

between the response variable y and a subset of the regressor variable x1 , x2 , x3 , ..., xk .
The appropriate hypotheses are

H0 : β1 = β2 = β3 = ... = βk = 0
H1 : βj 6= 0, j = 1, 2, 3, ...j for at least on j
We test the general null hypothesis, H0 , by calculating the F ratio and comparing
it with the critical point F(k,n−k−1,1−α) where k is as defined above, n is sample size,
and α is significance level that is preselected. We reject H0 if computed F exceeds
F(k,n−k−1,1−α) in value. Rejection of H0 implies that at least one of the regressor
variables x1 , x2 , ..., xk contributes significantly to the model. The testing procedure
involves analysis of variance partitioning of the total of the sum of squares SST into
a sum of squares due to the model or regression and a sum of squares due to residual
(or error), say SST = SSR + SSE .
The more variation in the response the regressors explain, the larger SSR becomes
and the smaller SSE becomes. This means that M SR becomes larger and M SE
smaller and therefore the quotient F becomes larger. Thus, small values of F sup-
port the null hypothesis and large values of F provide evidence against the null
hypothesis and in favor for the alternative hypothesis.
It can be shown that F follows an F(k,n−k−1,1−α) distribution when H0 is true, i.e.

M SR
F = M SE
∼ F(k,n−k−1,1−α) . Alternatively, we could use the p-value approach to
hypothesis testing and, thus, reject H0 if the p-value for the statistic F0 is less than
α . If the p-value is too big, another model is needed.

4.3.2 An Example – Multiple Regression with Two Indepen-

dent Variable
In this section, we develop a multiple regression application with two independent

variables. We use SPSS V20 to analyze the data.
Case Study: The Unilever South East Africa Limited sells a special lotion (Vaseline
for Men) through retail stores exclusively. It operates in 15 Marketing districts in
Malawi and is interested in predicting district sales. Table below contains data
on sales by district, as well as district data on target population and per capita
discretionary income.
Sales Per Capita

(gross of jars; Target Population Discretionary
District 1 gross = 12 dozens) (thousand of persons) Income (MK)
i yi xi1 xi2
1 162 274 2,450
2 120 180 3,254
3 223 375 3,802
4 131 205 2,838
5 67 86 2,347
6 169 265 3,782
7 81 98 3,008
8 192 330 2,450
9 116 195 2,137
10 55 53 2,560
11 252 430 4,020
12 232 372 4,427
13 144 236 2,660
14 103 157 2,088
15 212 370 2,605
We treat sales as the dependent variable y, and target population and per capita
discretionary income as independent variables x1 and x2 , respectively in an explo-
ration of feasibility of predicting district sales from target population and per capita
discretionary income. The regression model is
yi = β0 + β1 xi1 + β2 xi2 + µ
with normal error terms expected to be appropriate.
The figure table below show an SPSS output showing the values of β0 , β1 and β2 .

This shows that β0 = 3.453, the coefficient of xi1 , target population, is β1 = 0.496
and the coefficient of xi2 is β2 = 0.009. Having computed these coefficients, we now
have a regression that will be used to predict district sales as
y = 3.453 + 0.496x1 + 0.009x2
The model says that the mean jar sales are expected to increase by 0.496 gross when
the target population increases by 1 thousand, holding per capita discretionary in-
come constant, and that mean jar sales are expected to increase by 0.009 gross when
per capita discretionary income increases by 1 Kwacha, holding target population
constant.
Hypothesis Testing: We now carry out an F-test to test if there is a significance

relationship between district sales and any of the independent variables. We test the
following hypotheses:
H0 : β1 = β2 = 0 against H1 : β1 6= 0andβ2 6= 0
Table below show SPSS output of ANOVA F-Test carried out on the data we have
in the beginning of this example.
From the table, SSR = 53844.716, SSE = 56.884 and SST = 53901.600 with degrees
of freedom (df) 2, 14 and 14 respectively. We now calculate Mean Squares as follows

SSR 53844.716 SSE 56.884

M SR = k
= 2
= 26922.358 and M SE = n−k−1
= 12
= 4.470
M SR 26922.358
The F-ration is F = M SE
= 4.470
= 5679.466.
For significance level of α = 0.05, we calculate F(2,12,0.95) = 3.89. Since F >
F(k,n−k−1,1−α) , we reject our null hypothesis and conclude that either target pop-
ulation or per capita discretionary income contributes significantly to the sales. Al-
ternatively, we could use p−value to test the hypothesis. From the ANOVA table, the
significance level is 0.00 which is less than our preselected p-value of 0.05. Therefore,
we reject the null hypothesis as at that first and conclude in the same way.
4.4 Checking For Model Quality
The coefficient of multiple determination, R2 , is a measure of the amount of reduc-

tion of variation in y associated with the use of the independent variables in the
model. This means R2 measures the quality of the regression fit as the proportion of
the variation in the dependent variable explained by the linear combination of the
independent variables. R2 is expressed as
SSR SSE
R2 = SST
=1− SST
The coefficient of multiple determination R2 reduces to the coefficient of determina-

tion r2 for simple linear regression when one independent variable is in the regression
model. Just as for r2 , we have 0 ≤ R2 ≤ 1.
Large value of R2 does not necessarily mean the regression model is a good one
(useful one). Adding new variables to the model can never decrease amount of
variance explained i.e. will always increase R2 , regardless of whether the additional
variable is statistically significant or not.This is because SSE does can never become
large with more independent variable and SST is always the same for given set of
response. Thus, it is possible for models that have large values of R2 to yield poor
predictions of new observations or estimates of the mean response.
Because R2 always increases as we add terms to the model, it is sometimes suggested
that a modified measure be used that adjusts for the number of the independent

variables in the model. The adjusted coefficient of determination Ra2 adjusts R2 by

dividing each sum of squares by its degrees of freedom; thus
SSE
n−k n−1
Ra2 =1− SST
=1−
n−1
n−k
Actually, the adjusted R2 statistic will not always increase as we add variables to the
model. In fact, if unnecessary terms are added, the value of adjusted R2 will often
decrease because the decrease in SSE may be more than offset by the loss of degrees
hus, when two independent
of freedom in the denominator (n-k). variables, target population and per capita discretionary income are
considered, the variation in sales is reduced by 99.9 percent.
√
Coefficient of Multiple Correlation R is the positive square root of R2 ; thus R = R2
In our example above, we have
SSE 56.884
R2 = 1 − =1− = 0.9989
SST 53901.600
Thus, when two independent variables, target population and per capita discre-
tionary income are considered, the variation in sales is reduced by 99.9 percent.
√
The coefficient of multiple correlation is R = 0.9989 = 0.999
4.5 Tests of Significance of Individual Regression

Parameters
If the overall test shows that the model as a whole is statistically significant as a
predictor of the response we want to know which of the regressors in the model
are statistically significant predictors of the response. We are frequently interested
in testing hypotheses on the individual regression coefficient. Such tests would be
useful in determining the value of each regressor variable in the regression model.
For example, a model might be more effective with the inclusion of the additional
variables or perhaps with deletion of one or more of the variables of the model.
Adding a variable to the regression model always causes the sum of squares for regres-
sion to increase and the error sum of squares to decrease. We must decide whether the

increase in the regression sum of squares is sufficient to warrant using the additional
variable in the model. furthermore, adding an unimportant variable to the model ac-
tually increase the mean square error, thereby decreasing the usefulness of the model.
The hypotheses for testing the significance of any individual regression coefficient,
say βj are
H0 : βj = 0 against H1 : βj 6= 0
which we may use the test statistic
βˆj βˆj
T = SE(βˆj )
= σ(βˆj )
∼ t(1− α2 ,n−k)
where σ(βˆj ) is estimated standard error of βˆj . We reject H0 if |T | ≤ t(1− α2 ,n−k) . If

the null hypothesis is rejected, this indicates that independent variable xj can be
deleted from the model.
4.6 Testing the Validity of the Regression Model
4.7 Using the Multiple Regression Model for Pre-

diction
There are two general applications for multiple regression (MR): prediction and ex-
planation. These roughly correspond to two differing goals in research: being able to
make valid projections concerning an outcome for a particular individual (prediction),
or attempting to understand a phenomenon by examining a variable’s correlates on a
group level (explanation). When one uses Multiple Regression model for prediction,
one is using a sample to create a regression equation that would optimally predict
a particular phenomenon within a particular population which is our goal in this
section of the chapter.
Hypothetically, regression equation is created to predict eighth-grade achievement
test scores from fifth-grade variables, such as family socioeconomic status, race, sex,

educational plans, parental education, GPA, and participation in school-based ex-

tracurricular activities. The goal is not to understand why students achieve at a
certain level, but to create the best equation so that, for example, guidance coun-
selors could predict future achievement scores for their students, and be able to
provide an intervention to those students identified as at risk for poor performance,
or to select students into programs based on their projected scores.
Even though theory is useful for identifying what variables should be in a prediction
equation, the variables do not necessarily need to make conceptual sense. If the
single greatest predictor of future achievement scores was the number of hamburgers
a student eats, it should be in the prediction equation regardless of whether it makes
sense.
4.8 Multicollinearity, Causes, Detection Methods,

Solutions to Multicollinearity
The existence of correlations among independent variables is called Multicollinear-

ity. The most frequent result of having Multicollinearity when doing a regression
analysis is obtaining a very significant overall regression (small p-value for the F
statistic), while the partial coefficients are much less so (much larger p-values for
the t statistics). In fact, some of the coefficients may have signs that contradict
expectations. Such results are obviously of concern because the response variable is
strongly related to the independent variables, but we cannot specify the nature of
that relationship.
4.8.1 What Are Causes of Multicollinearity
Multicollinearity arises in several ways in a multiple regression analysis. According to

Montgomery (1982), there are five sources of multicollinearity in a multiple regression
that impacts analysis, the correction and the interpretation of linear models.
Multicollinearity can arise due to poor study design. Many times regression models
fit to data collected in an observational study. If the modeler can control the design,

the predictor variables can be chosen to be orthogonal to each other.

Since the modeler has no control over the design points due to physical constraints
of the linear model or population, predictor variables can be very dependent upon
each other. Hence bringing about multicollinearity.
Thirdly, the source of multicollinearity comes about when a model is over-defined.
This mean there more variables than observation.
Another cause of multicollinearity is model specification or choice. This is comes
by using independent variables that are higher powers or interactions of an original
set of variables. It should be noted that if sampling subspace of Xj is narrow, then
any combination of variables with xj will increase the multicollinearity problem even
further.
Lastly, outliers can bring about multicollinearity as well as hide multicollinearity.
4.8.2 Detecting Multicollinearity
First way of detecting multicollinearity is to examine correlation matrix. Large

correlation coefficients in the correlation matrix of predictor variables indicate mul-
ticollinearity. If there is multicollinearity between any two predictors, then the cor-
relation coefficient between these variables will be near unity.
Secondly, we can detected multicollinearity by using Variance Inflation Factor (VIF).
Variance Inflation factor quantifies the severity of multicollinearity in an ordinary
least-square regression analysis. VIF is an index that measures how much variance
of an estimated regression coefficient is increased because of multicollinearity. Con-
sider a regression model as defined above. The variance inflation factor for predictor
variable Xj in this set is denoted V IFj and is defined by equation
1
V IFj = (1−Rj2 )
Where Rj2 is the multiple coefficient of determination for regression model that relates
Xj to all other independent variables in the set. If Rj2 = 0, which says j t h variable
is not related to other predictor variables, then V IFj = 1. On the other hand, if
Rj2 > 0 which says Xj is linearly related to other predictor variables then 1 − Rj2
less than 1, making V IFj > 1. Both the largest variance inflation factor among
the independent variables and the mean VIF of the variance inflation factors for

the independent variables indicate the severity of multicollinearity. Generally, the

multicollinearity between independent variables is considered severe if
• The largest variance inflation factor greater than 10 (which means the largest
Rj2 is greater than 0.9)
• The mean of the variance inflation factors is substantially greater than 1
Thirdly, analysis of Eigensystem of correlation matrix helps to detect multicollinear-

ity in a model where eigenvalues are used. If multicollinearity is present in the
predictor variables, one or more of the eigenvalues will be small (near to zero)
4.8.3 Solutions to Multicollinearity
4.9 Partial F Tests and Variable Selection Meth-

ods
4.9.1 Partial F Tests
For a given regression model, could some of the predictors be eliminated without
sacrificing too much in the way of fit? Conversely, would it be worthwhile to add
a certain set of new predictors to a given regression model? The partial F Test is
designed in such a way that these questions are answered by comparing two models
for the same response variable. The extra sum of squares is used to measure two
things. Firstly, it measures the marginal increase in the error of sum of squares when
one or more predictors are deleted from the model. Secondly, it measures marginal
reduction in the error sum of squares when one or more predictors are added to the
model.
The partial F test assesses whether addition of any specific independent variable,
given others already in the model, significantly contributes to the prediction of y.
The test therefore allows for elimination of variables that are of no help in predicting
y and thus enables one to reduce the set of possible independent variables to an
economical set of predictors.
The model containing all predictors is called full model. The regression model

y = β0 + β1 x1 + β2 x2 + β3 x3 + ... + l xl + β(l+1) x(l+1) + ... + βk xk + µ
is a full model where l < k. If our hypothesis is H0 : β(l+1) = ... = βk = 0 then we

have model with fewer predictors which is called reduced model
y = β0 + β1 x1 + β2 x2 + β3 x3 + ... + βl xl + µ
We estimate the linear regression for each of the two models and the look at the
error sum of squares (SSE) from ANOVA table for each. We want to test if we can
safely drop the regressors x(l+1) , ..., xk from the model. if these predictors do not add
significantly to the model, then dropping them will make the model simpler and thus
preferably.
To perform partial F Test concerning a variable x(l+1) , ..., xk given that the variables
(x1 , x2 , ., xl )are already in the model, we must first compute the extra sum of squares
from adding X(l+1) , , Xk , given (x1 , x2 , ., xl )which we place in our ANOVA table. This
sum of squares is computed by the formula
SS(x(l+1) , ..., xk x1 , x2 , , xl ) = SSR(x1 , x2 , ..., xl , x(l+1) , ..., xk ) − SSR(x1 , x2 , ., xl )
SS(x(l+1) , ..., xk x1 , x2 , , xl ) = SSR(f ullmodel) − SSR(reducedmodel)
4.9.2 Variable Selection Methods
This method s designed to run all small possible regressions between the dependent
variable and all possible subsets of explanatory variables. Suppose we want to explain
dependent variable WAGE. To aid in our study, we have three explanatory variables.
These are years of education (ED), the age of the worker (AGE) and the years of
experience of the worker (EXP). Using this data set, there are a total of 8 possible
regressions which could be run in order to explain WAGE
• No variables: W AGES = β0
• Using only the first explanatory variable: W AGES = β0 + β1 ∗ ED

• Using only second explanatory variable: W AGES = β0 + β1 ∗ AGE
• Using only third explanatory variable: W AGES = 0 + 1 ∗ EXP
• Using variables one and two: W AGES = β0 + β1 ∗ ED + β2 ∗ AGE
• Using variable one and three: W AGES = β0 + β1 ∗ ED + β2 ∗ EXP
• Using variables two and three: W AGES = β0 + β1 ∗ AGE + β2 ∗ EXP
• Using all three variables: W AGES = β0 + β1 ∗ ED + β2 ∗ AGE + β3 ∗ EXP
All these possible approach will allow the researcher to analyze the summary statis-
tics of every possible regression. The choice of which criterion to use to select the
appropriate model is left to the researcher. Two commonly used criteria in choosing
between different regressions are:
• Selected the highest R2 or Ra dj 2
• Using the Cp
Cp measures total mean error of the fitted values of regression. The total mean error
involves two parts: one that results from random errors and one resulting from bias.
When there is no bias, the expected values of Cp , E(Cp ) = p, where p is equal to
the K+1 coefficient estimates in the regression. A good regression will have a low
value of Cp near K+1. If the regression generates large Cp then the mean square
error of the fit is large, indicating that the regression is poor fit and/or has bias. Cp
SSEp
is described using the following formula Cp = M SEF
− (n − 2p)
Where SSEp is the error sum of squares for the regression with p = K+1 coefficients
to be estimated and M SEF is the mean square error for the model that include all
possible explanatory variables.
4.10 Hands-on Exercise
EXERCISE

1. In a small-scale experimental study of the relation between degree of brand

liking (y) and moisture content (x1 ) and sweetness (x2 ) of the product, the
following results were obtained from the experiment based on completely ran-
domized design.
i xi1 xi2 yi
1 4 2 64
2 4 4 73
3 4 2 61
4 4 4 76
5 6 2 72
6 6 4 80
7 6 2 71
8 6 4 83
9 8 2 83
10 8 4 89
11 8 2 86
12 8 4 93
13 10 2 88
14 10 4 95
15 10 2 94
16 10 4 100
(a) Fit the regression model to the data. State the estimated regression func-
tion and interpret β1
(b) Assume the regression model with independent normal error terms is ap-
propriate
i. Test whether there is a regression relation using level of significance of

0.01. State the alternative hypothesis, decision rule and conclusion.
What does your test imply about β1 and β2 ?
ii. What is p-value of the test in part (i)
(c) Calculate the coefficient of multiple determination R2 . How is it inter-

preted here?
2. The crime rate in 47 states in the USA was reported together with possible
factors that might influence it. The factors recorded are as follows;
• X1 : the number of males aged 14 - 24 per 1000 of total state population

(popmale)

• X2 : binary variable distinguishing southern states (X2 = 1) from the rest

(X2 = 0) (statesouth).
• X3 : the mean number of years of schooling x 10 of the population, 25

years old and over (school)
• X4 : police expenditure (in dollars) per person by state and local govern-
ment in 1960 (expend60)
• X5 : police expenditure (in dollars) per person by state and local govern-
ment in 1959 (expend59)
• X6 : labour force participation rate per 1000 civilian urban males in the
age group 14-24 (LabF orce)
• X7 : the number of males per 1000 females (nummen)
• X8 : state population size in hundred thousands (popstate)
• X9 : the number of non-whites per 1000 (nonwhite)
• X10 : unemployment rate of urban males per 1000 in the age group 14-24
(unemploy)
• X11 : unemployment rate of urban males per 1000 in the age group 35-59
(unemploy35)
• X12 : the median value of family income or transferable goods and assets
(unit 10 dollars) (income)
• X13 : the number of families per 1000 earning below one-half of the median
income. (poverty)
• Y : Crime rate
(a) Carry regression analysis using all 13 variables
i. Write down the regression equation and interpret it.

ii. Explain how crime rate in all 47 states varies with the given explana-
tory variables?
iii. How has police expenditure affected crime rate in the two years?
(b) State the hypothesies, decision rule and conclusion.
(c) What does your test imply about all explanatory variables?

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 Y

151 1 91 58 56 510 950 33 301 108 41 394 261 79.1
143 0 113 103 95 583 1012 13 102 96 36 557 194 163.5
142 1 89 45 44 533 969 18 219 94 33 318 250 57.8
136 0 121 149 141 577 994 157 80 102 39 673 167 196.9
141 0 121 109 105 591 985 18 30 91 20 578 174 123.4
121 0 110 118 115 547 964 25 44 84 29 689 126 68.2
127 1 111 82 79 519 982 4 139 97 38 620 168 96.3
131 1 109 115 109 542 969 50 179 79 35 472 206 155.5
157 1 90 65 62 553 955 39 286 81 28 421 239 85.6
140 0 118 71 68 632 1029 7 15 100 24 526 174 70.5
124 0 105 121 116 580 966 101 106 77 35 657 170 167.4
134 0 108 75 71 595 972 47 59 83 31 580 172 84.9
128 0 113 67 60 624 972 28 10 77 25 507 206 51.1
135 0 117 62 61 595 986 22 46 77 27 529 190 66.4
152 1 87 57 53 530 986 30 72 92 43 405 264 79.8
142 1 88 81 77 497 956 33 321 116 47 427 247 94.6
143 0 110 66 63 537 977 10 6 114 35 487 166 53.9
135 1 104 123 115 537 978 31 170 89 34 631 165 92.9
130 0 116 128 128 536 934 51 24 78 34 627 135 75
125 0 108 113 105 567 985 78 94 130 58 626 166 122.5
126 0 108 74 67 602 984 34 12 102 33 557 195 74.2
157 1 89 47 44 512 962 22 423 97 34 288 276 43.9
132 0 96 87 83 564 953 43 92 83 32 513 227 121.6
131 0 116 78 73 574 1038 7 36 142 42 540 176 96.8
130 0 116 63 57 641 984 14 26 70 21 486 196 52.3
131 0 121 160 143 631 1071 3 77 102 41 674 152 199.3
135 0 109 69 71 540 965 6 4 80 22 564 139 34.2
152 0 112 82 76 571 1018 10 79 103 28 537 215 121.6
119 0 107 166 157 521 938 168 89 92 36 637 154 104.3
166 1 89 58 54 521 973 46 254 72 26 396 237 69.6
140 0 93 55 54 535 1045 6 20 135 40 453 200 37.3
125 0 109 90 81 586 964 97 82 105 43 617 163 75.4
147 1 104 63 64 560 972 23 95 76 24 462 233 107.2
126 0 118 97 97 542 990 18 21 102 35 589 166 92.3
123 0 102 97 87 526 948 113 76 124 50 572 158 65.3
150 0 100 109 98 531 964 9 24 87 38 559 153 127.2
177 1 87 58 56 638 974 24 349 76 28 382 254 83.1
133 0 104 51 47 599 1024 7 40 99 27 425 225 56.6
149 1 88 61 54 515 953 36 165 86 35 395 251 82.6
145 1 104 82 74 560 981 96 126 88 31 488 228 115.1
148 0 122 72 66 601 998 9 19 84 20 590 144 88
141 0 109 56 54 523 968 4 2 107 37 489 170 54.2
162 1 99 75 70 522 996 40 208 73 27 496 224 82.3
136 0 121 95 96 574 1012 29 36 111 37 622 162 103
139 1 88 46 41 480 968 19 49 135 53 457 249 45.5
126 0 104 106 97 599 989 40 24 78 25 593 171 50.8
130 0 121 90 91 623 1049 3 22 113 40 588 160 84.9
(d) Calculate the coefficient of multiple determination R2 . How is it inter-

preted here?
(e) Now perform regression analysis with police expenditure in 1959. Inter-
pret the output.
(f) Finding an economical model
i. Carry out regression analysis with any 5 explanatory variables (pop-

male, school, expend60, unemploy35 and poverty) and compare the
coefficient of multiple determination with that having all the vari-
ables.
ii. Carry out regression analysis with the remaining variables except ex-
pend59 and compare the coefficient of multiple determination with
that having 5 explanatory variables.
How does one determine the most economical model? One method
is by considering the analysis of variance table of the full model remaining
after excluding those variables which are causes of any multicollinearity. In
the present data set the variable X5 (Police Expenditure in 1959) was ex-
cluded leaving 12 explanatory variables. In the resulting regression analysis
one examines the table of t-ratio and the corresponding p-value of significance.
Those variables which have a significant t-ratio are selected for inclusion into
the sub-set. (Recall that a significant t-ratio for a coefficient means that a
slope is present). Next the Analysis of Variance table is examined in the col-
umn headed SEQ SS. The amount in each row of this column corresponding
to the variable tells us the contribution made by the respective explanatory
variable to the Regression Sum of Squares. One verifies that the explanatory
variables selected to form the sub-set are indeed making a sizable contribution
to the Regression Sum of Squares. (Recall that for any explanatory variable
the larger the Regression Sum of Squares the smaller is the Residual Sum of
Squares, since TSS = ESS + RSS. A small Residual Sum of Squares means
that the data points are not so scattered, and that there is greater clustering
about the line of least squares). An alternative strategy is based on adding
or dropping one variable at a time from a given model. The idea is to com-
pare the current model with a new model obtained by adding or deleting an
explanatory variable from the current model. Call the smaller model (i.e. with

fewer variables) Model I and the bigger model as Model II. One can compute
the F statistic (called partial F) by RSS Model I RSS Model II (RSS Model
II / degree of freedom of Model II). If partial F value is bigger than that of F
the smaller model is better.
Onec the important β coefficients have been identified, the next step is to de-
termine their relative roles in explaining the variability in the outcome variable
Y. The actual numeric value of the β coefficients is not a guide. For example,
the β coefficient for statesouth is 7.2 and that of Popn is 0.324. This does not
mean that StateSth is 21 times more important than Popn. This is because of
the unit of measurement. Popn is measured in actual numbers and StateSth
is categorical. The t statistic provides the true measure. The t statistic for
StateSth is 0.39 and that for Popn is 2.10. In the regression analysis with 5
explanatory variables the sub-set was chosen in the manner just described. For
comparison the regression analysis resulting from the 7 remaining.
Stepwise Regression Analysis Another method of selecting a sub-set from

a list of explanatory variables is Stepwise Regression Analysis. Most com-
puter software have a facility for carrying out stepwise forward selection and
backward elimination of variables based on their overall contribution to the F
statistic. Significance of F can be set by the individual or left to the default
value of 4. Stepwise regression removes and adds variables to the regression
model for the purpose of identifying a useful subset of the predictors. Stepwise
first finds the explanatory variable with the highest correlation (R2 ) to start
with. It then tries each of the remaining explanatory variables until it finds
the two with the highest R2 . Then it tries all of them again until it finds the
three variables with the highest R2 , and so on. The overall R2 gets larger as
more variables are added.
In resorting to the stepwise procedure the following should be bone in mind:
• By chance about 5% of sample relationships can be expected to be signif-

icant, and one can wrongly single them out as being truly significant.
• When there are strong relationships between several explanatory vari-

ables, stepwise procedures tend to exclude one or more of these variables.
One may then mistakenly conclude that the excluded variable is unim-

portant.
• If two variables X1 and X2 are positively related but their effects on Y

have opposite signs, or X1 and X2 are negatively related but their effects
on Y have the same signs, stepwise procedures may exclude one or both.
We then understate the importance of both variables.
• Stepwise regression tends to choose only those observations which have

no missing values for any of the variables. So the final model obtained
by the Stepwise procedure will have fewer subjects if the data set has
missing values. Hence the model selected by Stepwise may not fit a larger
subsequent data set well.
• The P values are to be discounted because they do not take into account
the very many tests that have been carried out.
• Different methods of Stepwise like forward selection and backward elimi-

nation are likely to produce different models. Also the same model seldom
emerges when a second data set is analysed.
In conclusion, Stepwise regression is useful in the early exploratory phase of

analysis, but not to be relied on for the confirmatory stage. Once the ex-
ploratory stage has identified the significant explanatory variables they can
then be further assessed
Criteria for judging the appropriateness of a model
(a) R2 , the coefficient of determination

The problem with R-sq. is that its value always increases as new variables
are added to the model even though they are not contributing significant
independent information. Hence by itself R-sq is not a good measure. But
increase in R-sq. when another variable is added can provide useful insight
into how much additional predictive information is gained by adding this
variable. R-sq. is most useful when choosing between two models with
the same number of explanatory variables.
(b) Adjusted R2 .
Although similar to R-sq. it takes into account the number of coefficients
p in the model and the number of subjects n. (R-sq. = Regression Sum
of Squares Total Sum of Squares). Adjusted R-sq = 1 (n 1 / n p 1) (1

R-sq). The formula involves p, the number of explanatory variables in the

model which makes comparison between Adjusted R-sq. with different p
meaningful because Adjusted R-sq. is not automatically greater for larger
p. R-sq. tends to overestimate the amount of variability accounted for, so
that if one were to apply the regression equation derived from one sample
to another independent sample one would almost always get a smaller
R-sq value in the new sample compared to the original.
(c) Standard error of the estimate

This is also called the standard deviation of Y about the regression line,
and is denoted as s in the computer output. One could take the model
with the smallest value of s as the best model
(d) Cp statistic
In general, among candidate models we select the one with the small-
est C-p value, and where the value of C-p is closest to p, the number of
parameters in the model (i.e. intercept plus the number of explanatory
variables in the model).
ResidualSumof Squaresp
Cp = M eanResidualSumof Squaresm
− (n − 2p),
where Residual Sum of Squaresp = RSS for the model with p parameters
including the intercept and Mean Residual Sum of Squaresm = Mean RSS
with all the predictors.
Cp is commonly used to select a subset of explanatory variables after
an initial regression analysis employing all possible explanatory variables.
One then selects the smallest model that produces a C-p value near to p,
the number of parameters. Small C-p means small variance in estimating
the regression coefficients. In other words a precise model is achieved and
adding more explanatory variables is unlikely to improve the precision
any more. Simple models are always preferable because they are easy to
interpret and would be less prone to multicollinearity. In the example of
the best subset regression analysis provided step 6 with C-p value of 3.1
has been highlighted as giving the best subset. The corresponding R-sq.
value is 74.8%, adjusted R-sq. value is 71%, and s= 20.827. Caution
needs to be exercised when resorting to automatic selection procedures
like Stepwise and Best. These procedures are machine led and do not take
into account the practical importance or biological plausibility of the pre-

dictors
Finally, before accepting the results of a regression analysis it is advisable

to always apply regression diagnostics. That is the subject of the next
chapter. But before proceeding to the next topic let us summarise below
what has been learnt so far about multiple linear regression
Commentary on multiple linear regression

Multiple linear regression is an extension of bivariate regression in which sev-
eral explanatory variables instead of one are combined to obtain a value on a
response variable. The goal of regression is to arrive at a set of values called
regression coefficients for the explanatory variables. Using the values one is
able to calculate a predicted Y value for any given subject in the sample such
that it would be as close as possible to the Y value obtained by measurement.
As a statistical tool regression can be used to answer a number of questions
related to the data set. We next consider these questions:
1. Does the regression equation really provide better than chance
prediction?
Multiple regression is used to find out how well a set of variables predict a
particular outcome. Amongst a set of candidate variables which ones are best
predictors of an outcome, and especially whether a particular predictor vari-
able can still predict the outcome after controlling for the influence of another
variable. The F ratio and its level of significance provide the answer. This may
not necessarily be the case with step-wise regression because all the candidate
explanatory variables do not enter the equation. Some textbooks provide tables
that show how large R-sq. must be to be considered statistically significant
for a given sample size, the number of candidate explanatory variables and the
level specified for F to enter.
2. Out of a set of explanatory variables in a regression equation how

can one select the more important ones?
When the explanatory variables are uncorrelated with each other, the assess-
ment of the contribution that each makes is straightforward. Explanatory vari-
ables with bigger t-ratios are usually more influential than those with smaller

ones. If the explanatory variables are correlated with each other assessment of
the importance of each of them becomes ambiguous.
3. Having obtained a regression equation how can one judge whether
adding one or more explanatory variables would improve prediction
of the response?
Improvement in R-sq. and particularly adjusted R-sq. provides a useful guide.
The addition of new explanatory variables can affect the relative contributions
of those variables already in the equation. In the selection of additional vari-
ables the research question and the theoretical rationale behind it must guide
the researcher. To leave the selection entirely to the software would be a mis-
take.
4. How can one find out what influence a particular explanatory

variable has in the context of another or a set of explanatory vari-
ables?
The investigator can enter the explanatory variables in sequence according to
theoretical or logical considerations. For example, variables that are considered
more influential are entered first, either one at a time or in blocks. At each
step the computer output provides the relevant information.
5. How good is the regression equation at predicting the response
in a new sample of subjects on whom only the data for explanatory
variables are available?
The question of generalisability is at the heart of all research. The primary
requirement is the accuracy of data. If accuracy has been ascertained, a pro-
cedure called cross validation may be employed. In this procedure a regression
equation is developed from a portion of the sample, and then applied to the
other portion. If it works then it is very likely that the regression equation will
predict the response variable better than chance for new cases. Another fac-
tor influencing generalisability is the size of the sample. Obviously with small
samples similar results cannot be expected when applied widely. One should
always bear in mind that regression functions do not involve any assumptions
of time, order or causal relations. Regression coefficients and quantities derived
from them represent measures of associations, not measures of effects.
6. Does sample size have any bearing on the number of explanatory

variables that can be included in developing the regression equation?

A simple rule of thumb is N ¿ 104 + number of explanatory variables. For ex-
ample, if there are 5 explanatory variables one should have at least 109 cases.
This assumes a medium sized relationship between the response and explana-
tory variables, a significance level of 0.05 and a power of 80. A higher number
of cases would be needed if the response variable is skewed, a small effect size
is anticipated, or there is a high likelihood of measurement error.
7. How does multiple regression contribute to answering a research

question?
• Most of the time investigators are looking for an input/output relation-

ship. The explanatory variables comprising the input are being related to
the output in a way as to enable them to measure the contribution of each
of them. The relationship is often complicated by other factors that may
be related to both the input and the output. Such confounding factors
are often taken care of in the regression.
• The regression equation provides a way of predicting the outcome.
• It is possible to assess the simultaneous effects of a number of explanatory

variables on the outcome

Intrdocution To Statistical Modeling

Uploaded by

Copyright:

Available Formats

Intrdocution To Statistical Modeling

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intrdocution To Statistical Modeling

Uploaded by

Copyright:

Available Formats

CHANCELLOR COLLEGE

MATHEMATICAL SCIENCES DEPARTMENT

Introduction to Statistical Modeling MODULE

P. Chidzalo, E. Mwanandiye, D. Masangwi

1.1 Basic Definitions and Facts-Revision . . . . . . . . . . . . . . . . . . 1

1.1.1 Motivation Example . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.2 Random Experiment or Statistical Experiment . . . . . . . . . 8

1.1.3 Sample Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.6 Alternative Definition of Probability . . . . . . . . . . . . . . 9

1.1.7 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.1.8 Probability Distribution . . . . . . . . . . . . . . . . . . . . . 11

1.1.9 Parameter Versus Statistic . . . . . . . . . . . . . . . . . . . . 11

1.2 Definition Of Statistical Model . . . . . . . . . . . . . . . . . . . . . . 12

1.3 Types of Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . 14

1.3.1 Common Distributions-Revision . . . . . . . . . . . . . . . . . 14

1.3.2 Parametric, Semiparametric, and Nonparametric Statistical

1.3.3 General Versus Generalized Linear Models . . . . . . . . . . . 16

1.3.4 Structural Equation Models . . . . . . . . . . . . . . . . . . . 20

1.3.5 Multilevel Model . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.1 Stages in Statistical Model Building . . . . . . . . . . . . . . . . . . . 28

2.2 Main Stages of Model Estimation . . . . . . . . . . . . . . . . . . . . 36

2.2.1 MODEL SPECIFICATION, ESTIMATION AND VALIDA-

2.3 Methods of Estimating Parameters in a Model . . . . . . . . . . . . . 38

2.3.1 Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . 38

3.2 Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 FITTING A STRAIGHT LINE TO THE DATA . . . . . . . . . . . . 42

3.5 Practicals in SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

ii P. Chidzalo, E. Mwanandiye, D. Masangwi

3.7 DECISION RULE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.8 COEFFICIENT OF DETERMINATION . . . . . . . . . . . . . . . . 51

3.9 ASSESING APPROPRIATENESS OF A STRAIGHT LINE MODEL 51

3.10 PURE ERROR ESTIMATE OF σ 2 . . . . . . . . . . . . . . . . . . . 52

4.2 The Model with Multiple Regression Model . . . . . . . . . . . . . . 55

4.2.1 The Model with Two Independent Variables . . . . . . . . . . 55

4.2.2 The General Multiple Regression Model: The Model with k

4.2.3 Assumptions of Multiple Regression Models . . . . . . . . . . 58

4.3 F-Tests of a Multiple Regression Models . . . . . . . . . . . . . . . . 59

4.3.1 Testing For Significant Overall Regression . . . . . . . . . . . 59

4.3.2 An Example – Multiple Regression with Two Independent

4.4 Checking For Model Quality . . . . . . . . . . . . . . . . . . . . . . . 63

4.5 Tests of Significance of Individual Regression Parameters . . . . . . . 64

4.6 Testing the Validity of the Regression Model . . . . . . . . . . . . . . 65

4.7 Using the Multiple Regression Model for Prediction . . . . . . . . . . 65

4.8 Multicollinearity, Causes, Detection Methods, Solutions to Multicollinear-

iii P. Chidzalo, E. Mwanandiye, D. Masangwi

4.8.1 What Are Causes of Multicollinearity . . . . . . . . . . . . . . 66

4.8.2 Detecting Multicollinearity . . . . . . . . . . . . . . . . . . . . 67

4.8.3 Solutions to Multicollinearity . . . . . . . . . . . . . . . . . . 68

4.9 Partial F Tests and Variable Selection Methods . . . . . . . . . . . . 68

4.9.1 Partial F Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.9.2 Variable Selection Methods . . . . . . . . . . . . . . . . . . . 69

4.10 Hands-on Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

iv P. Chidzalo, E. Mwanandiye, D. Masangwi

v P. Chidzalo, E. Mwanandiye, D. Masangwi

1.1 Basic Definitions and Facts-Revision

1.1.1 Motivation Example

2 P. Chidzalo, E. Mwanandiye, D. Masangwi

To do this we could consider let Y = µj + εij j = 1 · · · 15 and X = µj + η + ε2j j=