Chapter - Two - Simple Linear Regression - Final Edited
Chapter - Two - Simple Linear Regression - Final Edited
Chapter - Two - Simple Linear Regression - Final Edited
Regression analysis is concerned with the study of the dependence of one variable (the
dependent variable) on one or more other variables (the explanatory variables) with a view to
estimating and/or predicting the (population) mean or average value of the former in terms of the
known or fixed (in repeated sampling) values of the latter. For example: an economist may be
interested in studying the dependence of personal consumption expenditure on after tax or
disposable real personal income. Such an analysis may be helpful in estimating the marginal
propensity to consume (MPC), that is, average change in consumption expenditure for, say, a
dollar’s worth of change in real income. So, regression analysis helps to estimate or predict the
average value of one variable on the basis of the fixed values of other variables. On the other
word, when a single variable is used to estimate the value of an unknown variable, the method is
referred to as simple regression analysis. In summary, the key idea behind regression analysis is
the statistical dependence of one variable, the dependent variable, on one or more other
variables, the explanatory variables. The objective of such analysis is to estimate and/or predict
the mean or average value of the dependent variable on the basis of the known or fixed values of
the explanatory variables.
Terminology
1
Simple Regression Model
If we are studying the dependence of a variable on only a single explanatory variable, such as
that of consumption expenditure on real income, such a study is known as simple or two-variable
regression analysis. However, if we are studying the dependence of one variable on more than
one explanatory variable, as in the crop-yield; rainfall, temperature, soil fertility, and fertilizer
examples, it is known as multiple regression analysis. In other words, in two-variable regression
there is only one explanatory variable, whereas in multiple regressions there is more than one
explanatory variable. Thus, the estimator of these regressions is Ordinary Least Square (OLS)
estimator.
Simple linear regression
Simple linear regression or two variable regression analyses is not practically adequate, but it
presents the fundamental ideas of regression analysis as simply as possible and some of these
ideas can be illustrated with the aid of two-dimensional graphs. On the other hand, the more
general multiple regression analysis in which the regressand is related to one or more regressors
is in many ways a logical extension of the two-variable case.
Yi = β0 + β1X1 + ui (1)
Where, Y = dependent or regressand or endogenous or response or predicted or explained
variable; β0 = constant or intercept term; β 1 = slope coefficient that explains how Y changes for a
one unit increase in X; X = independent or regressor or exogenous or predictor or explanatory
control or covariate variable; U= error or disturbance term (It contains all factors other than X
that affect Y). Simple regression analysis is called simple or two variables regression model; it
shows how Y varies with X. Regression able to address the question “how do we estimate β 0 and
β1 by minimizing the total sum of residuals (ui)?”
Example: Suppose the relationship between expenditure (Y) and income (X) of households is
expressed as: Y = 0.6X + 120. Here, on the basis of income, we can predict expenditure. For
instance, if the income of a certain household is 1500 Birr, then the estimated expenditure will
be: expenditure = 0.6(1500) + 120 = 1020 Birr. Note that since expenditure is estimated on the
basis of income, expenditure is the dependent variable and income is the independent variable.
The error term
Consider the above model: Y = 0.6X + 120. This functional relationship is deterministic or exact,
that is, given income we can determine the exact expenditure of a household. But in reality this
rarely happens: different households with the same income are not expected to spend equal
2
Simple Regression Model
amounts due to habit persistence, geographical and time variation, etc. Thus, we should express
the regression model as: Yi = α + βX i + ei where, ei is the random error term (also called
disturbance term). Generally the reasons for including the error term include:
Omitted variables: a model is a simplification of reality. It is not always possible to include
all relevant variables in a functional form. For instance, we may construct a model relating
demand and price of a commodity. But demand is influenced not only by own price: income
of consumers, price of substitutes and several other variables also influence it. The omission
of these variables from the model introduces an error.
Measurement error: inaccuracy in collection and measurement of sample data.
Sampling error: Consider a model relating consumption (Y) with income (X) of
households. The sample we randomly choose to examine the relationship may turn out to be
predominantly poor households, predominantly women or men, households who live in the
remote areas will not consider in the sample. In such cases, our estimation of α_ and β_
from this sample may not be as good as that from a balanced sample group.
Note that the size of the error (e i) is not fixed: it is non-deterministic or stochastic or
probabilistic in nature. This in turn implies that Yi is also probabilistic in nature. Thus, the
probability distribution of Yi and its characteristics are determined by the values of Xi and by
the probability distribution of ei. Thus, a full specification of a regression model should include
a specification of the probability distribution of the disturbance (error) term. This information is
given by what we call basic assumptions or assumptions of the classical linear regression model
(CLRM).
Consider the model:
Yi = α + βXi + ei , i = 1, 2, . . ., n
Here, the subscript i refer to the i th observation. In the CLRM, Yi and Xi are observable while ei
is unobservable. If i refer to some point or period of time, then we speak of time series data or
panel data. On the other hand, if i refer to the i th individual, object, geographical region, etc., then
we speak that it is cross-sectional data.
If our objective is to estimate β1 and β2 only, the method of OLS discussed in the preceding
section will suffice. But in regression analysis, our objective is not only to obtain the estimated
3
Simple Regression Model
values of β1 and β2 but also to draw inferences about the true β1 and β2. For example, we would
like to know how close β1 and β2 are to their counterparts in the population or how close is
to the true E (Y|Xi). Notice that E (Y|Xi) is the same as . To that end, we must not only
specify the functional form of the model, but also make certain assumptions about the manner in
which Yi are generated. To see why this requirement is needed, look at the 1PRF: Yi = β1 + β2Xi
+ ui. It shows that Yi depends on both Xi and ui. Therefore, unless we are specific about how Xi
and ui are created or generated, there is no way we can make any statistical inference about the
Yi and also, as we shall see, about β1 and β2. Thus, the assumptions made about the Xi variable
(s) and the error terms are extremely critical to the valid interpretation of the regression
estimates.
The Classical Linear Regression Model (CLRM), which is the cornerstone of most econometric
theory; makes 10 assumptions. We first discuss these assumptions in the context of the two
variable regression models; and we extend them to multiple regression models, that is, models in
which there is more than one regressor.
Assumption 1: Linear in parameter: The regression model is linear in the parameters. Linear-
in-parameter means no parameters appear as an exponent or is multiplied or divided by another
parameter whatever the relationship between Y and X is linear or non-linear. Since linear-in-
parameter regression models are the starting point of the CLRM; we will maintain this
assumption throughout this teaching material. Keep in mind that the regressand Y and the regress
or X themselves may be nonlinear; Yi = β1 + β2Xi + ui
Assumption 2: X values are fixed in repeated sampling. Values taken by the regressor X are
considered fixed in repeated samples. More technically, X is assumed to be non-stochastic. What
this means is that our regression analysis is conditional regression analysis, that is, conditional on
the given values of the regressor(s) X. In experiments, where researchers have a chance to
control the explanatory variables / regressors, the explanatory variables are non-stochastic. For
instance, to see the response of fertilizer level on maize yield; the researcher may specify
regressors value at 0 kg/ha, 25 kg/ha, 50kg/ha, and 100 kg/ha purposively. Here the variable,
fertilizer level, is non-stochastic.
Assumption 3: Zero mean value of disturbance ui. Given the value of X, the mean, or expected
value of the random disturbance term ui is zero. Technically, the conditional mean value of ui is
1
PRF stands for Population Regression Function
4
Simple Regression Model
zero. Symbolically, we have E(ui|Xi) =0. It states that the mean value of ui, conditional upon the
given Xi, is zero. Geometrically, this assumption can be pictured as in Figure 1, which shows a
few values of the variable X and the Y populations associated with each of them. As shown, each
Y population corresponding to a given X is distributed around its mean value (shown by the
circled points on the PRF) with some Y values above the mean and some below it. The distances
above and below the mean values are nothing but the ui and what this assumption requires is that
the average or mean value of these deviations corresponding to any given X should be zero.
This assumption says that the factors not explicitly included in the model, and therefore
subsumed in ui, do not systematically affect the mean value of Y; so to speak, the positive ui
values cancel out the negative ui values so that their average or mean effect on Y is zero. Note
that the assumption E(ui|Xi) = 0 implies that E(Yi|Xi) = β1 + β2Xi Therefore, the two
assumptions are equivalent.
5
Simple Regression Model
some positive constant number equal to σ2. Technically, it represents the assumption of
Homoskedasticity, or equal (homo) spread (scedasticity) or equal variance. The word comes
from the Greek verb skedanime, which means to disperse or scatter. State differently, it means
that the Y populations corresponding to various X values have the same variance. Put simply, the
variation around the regression line (which is the line of average relationship between Y and X is
the same across the X values; it neither increases nor decreases as X varies. Diagrammatically,
the situation is as depicted in Figure 2.
Figure 2: Homoskedasticity
In contrast, consider Figure 3, where the conditional variance of the Y population varies with X.
This situation is known appropriately as heteroskedasticity, or unequal spread, or variance.
Symbolically, it can be written as var(ui|Xi) = σ2.
6
Simple Regression Model
Figure3: heteroskedasticity
Notice the subscript on σ2 in var(ui|Xi) σ2 which indicates that the variance of the Y population
is no longer constant.
To make the difference between the two situations clear; let Y represent weekly consumption
expenditure and X weekly income. Figures 2 and 3 show that as income increases the average
consumption expenditure also increases. But, in Figures 2 the variance of consumption
expenditure also remains the same at all levels of income, whereas in Figures 3 it increases with
increase in income. In other words, richer families on the average consume more than poorer
families, but there is also more variability in the consumption expenditure of the former.
To understand the rationale behind this assumption, refer to Figures 3. As this figure shows that
var(u|X1) < var(u|X2) < ,..., var(u|Xi). Therefore, the likelihood is that the Y observations
coming from the population with X = X1 would be closer to the PRF than those coming from
populations corresponding to X =X2, X =X3 and so on. In short, not all Y values corresponding
to the various X’s will be equally reliable, reliability being judged by how closely or distantly the
Y values are distributed around their means, that is, the points on the PRF. If this is in fact the
case, would we not prefer to sample from those Y populations that are closer to their mean that
are widely spread? But doing so might restrict the variation we obtain across X values.
By invoking Assumption 4, we are saying that at this stage all Y values corresponding to the
various X’s are equally important. In unit six we shall see what happens if this is not the case,
7
Simple Regression Model
that is, where there is heteroskedasticity. Note that Assumption 4 implies that the conditional
variances of Yi are also homoskedastic. That is, var(Yi|Xi) = σ 2. Of course, the unconditional
variance of Y is σ2.
Assumption 5: No autocorrelation between the disturbances. Given any two X values. Xi and Xj
(i j), the correlation between any two ui and uj (ij) is zero. Symbolically;
8
Simple Regression Model
Figure 4: Patterns of correlation among the disturbances. (a) positive serial correlation; (b)
negative serial correlation; (c) zero correlation.
One can explain this assumption as follows. Suppose in our PRF (Yt = β1+β2Xt + ui) that u t and
ut-1 are positively correlated. Then Yt depends not only on Xt but also on u t-1 for ut-1 to some
extent determines ut. By invoking Assumption 5, we are saying that we will consider the
systematic effect, if any, of Xt on Yt and not worry about the other influences that might act on
Y as a result of the possible inter correlations among the u's.
9
Simple Regression Model
Assumption 6 states that the disturbance ui and explanatory variable Xi are uncorrelated. The
rational for this assumption is as follows: When we expressed the PRF, we assumed that X and u
(which may represent the influence of all the omitted variables) have separate (and additive)
influence on Y. But if X and u are correlated, it is not possible to assess their individual effects
on Y. Thus, if X and u are positively correlated, X increases when u increases and it decreases
when u decreases. In either case, it is difficult to isolate the influence of X and u on Y.
Assumption 6 is automatically fulfilled if X variable is non-random or non-stochastic and
Assumption 3 holds, for in that case, cov(ui, Xi) = E[Xi – E(Xi)] [ui – E(ui)] = 0; since X’s are
non-stochastic. But since we have assumed that our X variable is not only non-stochastic but also
assumes fixed values in repeated sample, Assumption 6 is not very critical for us; it is stated here
merely to point out that the regression theory presented in the model holds true even if the X’s
are stochastic or random, provided they are independent or at least uncorrelated with the
disturbances ui.
Assumption 7: The number of observations n must be greater than the number of parameters to
be estimated. Alternatively, the number of observation n must be greater than the number of
explanatory variables.
Assumption 8: Variability in X values. The X values in a given sample must not all be the same.
Technically, var(X) must be a finite positive number.
Assumption 9: The regression model is correctly specified. Alternatively, there is no
specification bias or error in the model used in empirical analysis.
The classical econometric methodology assumes implicitly, if not explicitly, this assumption can
be explained informally as follows. An econometric investigation begins with the specification
10
Simple Regression Model
of the econometric model underlying the phenomenon of interest. Some important questions that
arise in the specification of the model include the following three features. (1) What variables
should be included in the model? (2) What is the functional form of the model? Is it linear in the
parameters, the variables, or both? (3) What are the probabilistic assumptions made about the Yi,
the Xi, and the ui entering the model? These are extremely important questions, for example, by
omitting important variables from the model, or by choosing the wrong functional form, or by
making wrong stochastic assumptions about the variables of the model, the validity of
interpreting the estimated regression will be highly questionable.
Our discussion of the assumptions underlying the classical linear regression model is now
completed. It is important to note that all these assumptions pertain to the PRF only and not the
2
SRF. But it is interesting to observe that the method of least squares discussed previously has
some properties that are similar to the assumptions we have made about the PRF. For example,
the finding that 0 is similar to the assumption that E(ui|Xi) = 0. Likewise, the finding
that is similar to the assumption that cov(ui, Xi) = 0. It is comforting to note that the
method of least squares thus tries to “duplicate” some of the assumptions we have imposed on
the PRF. Of course, the SRF does not duplicate all the assumptions of the CLRM. As we will
show later, although cov(ui, uj) = 0 where, i j by assumption, it is not true that the sample
where, i j. As a matter of fact, we will show later that the residuals not only
are autocorrelated but also are Heterosckedastic. When we go beyond the two-variable model
and consider multiple regression models, that is, models containing several regressors, we shall
add the following assumption.
Assumption 10: There is no perfect multicollinearity. That is, there are no perfect linear
relationships among the explanatory variables.
Method of estimation
Specifying the model and stating its underlying assumptions are the first stage of any
econometric application. The next step is estimation of the numerical values of the parameters of
2
SRF indicates the Sample Regression Function
11
Simple Regression Model
economic relationships. The parameters of the simple linear regression model can be estimated
by various methods. The most commonly used methods are;
1. The least square method (OLS)
2. The maximum likelihood method (MLM)
3. The method of moments (MM)
4. Bayesian estimation technique.
5. The free hand method
6. The semi-average method
But, here we deal with the OLS method of estimation.
In the regression model Yi = β1 + β2Xi + ui the values of the parameters β1 and β2 are not
known. When they are estimated from a sample of size n, we obtain the sample regression line
----------------------------- (1)
Where, ui is the disturbance term. The disturbance term ui is a proxy for all those variables that
are omitted from the model but that collectively affect Y. However, the PRF is not directly
observable. So, we estimate it from the Sample Regression Function (SRF):
------------------------- (2)
----------------------------- (3)
But how is the SRF itself determined? To see this, let us proceed as follows. First, express
equation (3) as:
------------------------------- (4)
12
Simple Regression Model
Equation (4) shows that the estimated residuals are simply the differences between the actual and
estimated Y values. Now given n pairs of observations on Y and X, we would like to determine
the SRF in such a manner that it is as close as possible to the actual Y. To this end, we may adopt
the following criterion: Choose the SRF in such a way that the sum of the residuals
If we adopt the criterion of minimizing , Figure 5 shows that the residuals as well
as the residuals receive the same weight in the sum , although the first
two residuals are much closer to the SRF than the latter two. In other words, all the residuals
receive equal importance no matter how choose or how widely the individual observations
scattered is from the SRF.
A consequence of this is that it is quite possible that the algebraic sum of the is small (even
zero) but the value of are widely scattered about the SRF. To see this, let in
Figure 5 assume the value of 10, -2, +2, and -10, respectively. The algebraic sum of these
residuals is zero although are scattered more widely around the SRF that .
13
Simple Regression Model
We can avoid this problem if we adopt the least squares criterion, which states that the SRF can
be fixed in such a way that the sum of squares of the errors (SSE) is:
------------------- (5)
The value of squared residuals is as small as possible, where, are the squared residuals.
Why we need Residual Square? It helps to us to give more emphasis on the outliers in order
to make correlation. When we square the value of predicted disturbance value, we can observe
the outlier who has more value than small value. So then, once we observe the outlier, we can
draw remedial measures on it. That means, by squaring , this method gives more weight to
residuals such as in Figure 5 than the residuals . Our aim is then to determine
the equation of such an estimating line in such a way that the error in estimation is minimized.
14
Simple Regression Model
About (6) is the first normal equation and (7) is the second normal equation. Solving the normal
equation simultaneously or using matrix algebra, we obtain β1 and β2. Before, we solving β1 and
β2, we should know where equation (6) and (7) came from.
Solving the normal equation simultaneously or using matrix algebra, we obtain β1 and β2. We
did the proof for the following two equations on the white board in the class.
----------------------------------------------------------------------------- (8)
-------------- (9)
15
Simple Regression Model
Example 1: suppose the following data on the level of Teff yield (Yi) measured in Qt/ha and the
amount of labor hours to work (Xi). Calculate the value of β1 and β2. Interpret the results what
you get.
N 1 2 3 4 5 6 7 8 9 10
Yi 11 10 12 6 10 7 9 10 11 10
Xi 10 7 10 5 8 8 6 7 9 10
We obtained that β1 = 3.6 and β2 = 0.75. So, the regression function between Teff yield and
16
Simple Regression Model
Example 2: based on the following hypothetical data on weekly family consumption expenditure
(Y) and weekly income (X). i) Calculate the constant term and marginal propensity to
consumption; ii) interpret the result of the calculated coefficients.
Interpretation:
^
B} rsub {2}} ¿ ¿ The family consumption expenditure will increase by 0.51 units if the
family weekly income increases by one unit. However,
^
β} rsub {1} ¿¿ ” ** the family consumption expenditure will be 24.45 units if the family
weekly income becomes zero; that means family consumption expenditure will not
depend on their weekly income, instead there will be another sources or funds to cover
their weekly consumption expenditure whether it is remittance or pension, so forth.
Therefore, the regression function between weekly family consumption expenditure and
Gauss-Markov Theorem
The ideal or optimum properties that the OLS estimate posses may be summarized by well
known theorem known as the Gauss –Markov Theorem. According to this theorem, under the
basic assumption of CLRM, the least square estimator are linear, unbiased, and have minimum
variance. On other word, Gauss–Markov Theorem states that “Given the assumptions of the
17
Simple Regression Model
classical linear regression model, the least-squares estimators, in the class of unbiased linear
estimators, have minimum variance, that is, they are Best Linear Unbiased Estimator (BLUE).”
These are;
SRF (ii)
As general, based on Gauss-Markov theorem, the most common statistical properties of the
OLS estimator are:
linearity,
unbiasedness and minimum variance.
That means; the OLS estimator (β1_hat) is said to be best linear unbiased estimator
(BLUE) of β1_hat if those three features should hold.
If an estimator is unbiased and has minimum variance, then it is efficient. But, if one or more of
those assumptions fail, OLs estimators are no more Best rather there are better, but not best,
estimators.
Equation (8) and (9) shows that the least-squares estimate are a function of the sample data. But
since the data are likely to change from sample to sample, the estimates will change ipso facto.
18
Simple Regression Model
3
The standard error is nothing but the standard deviation of the sampling distribution of the estimator,
and the sampling distribution of an estimator is simply a probability or frequency distribution of the
estimator, that is, a distribution of the set of values of the estimator obtained from all possible samples of
the same size from a given population. Sampling distributions are used to draw inferences about the
values of the population parameters on the basis of the values of the estimators calculated from one or
more samples (Gujarati, 2004, page 81/1003).
19
Simple Regression Model
y i= β^I (X ¿¿ I −X ) ¿
^
20
Simple Regression Model
Since Y =Y^i, subtract Y from the left hand side, and Y^i from the right hand side of equation (20):
u𝑖 = (𝑢𝑖 − 𝑢̅) – ( ^
β 1−β 1 ¿ x i -------------- (22)
Taking the summation of the square of both sides of equation (22)
Σu𝑖2 = Σ[(𝑢𝑖 − 𝑢̅) − ( ^
β 1−β 1)𝑥𝑖]2
= Σ(𝑢𝑖− 𝑢̅)2 − 2( ^
β 1−β 1) Σ𝑥𝑖(𝑢𝑖 − 𝑢̅) + ( ^
β 1−β 1)2 Σ𝑥𝑖2
Taking the expected value:
E(Σu𝑖2) = E[Σ(𝑢𝑖− 𝑢̅)2 − 2( ^
β 1−β 1) Σ𝑥𝑖(𝑢𝑖 − 𝑢̅) + ( ^
β 1−β 1)2 Σ𝑥𝑖2]
21
Simple Regression Model
= E[Σ𝑢 𝑖
2
–n(
∑ 2
ui
)]2
n
= E[Σ𝑢 – ∑ i ]
2
( u)
2
𝑖
n
1
E (∑ ui ) ]
2
= Σ(E(𝑢𝑖2)–
n
1
E (∑ ui +2 Σ ui u j )
2
= nσu2 –
n
1
= nσu2 – ¿ since 𝑖 ≠ 𝑗
n
1 2 1
= nσu2 – n σ + ∑ E(u i u j ¿)¿ 𝑏𝑢𝑡, (𝑢𝑖𝑢𝑗) = 0 by assumption of CLRM
n u n
1 2
= nσu2 – nσ
n u
n
= σ2 (n– ¿
n
= σ2 (n – 1 ¿ --------------------------------------------------------- (24)
B. E[( ^
β 1−β 1)2 Σ𝑥𝑖2] = Σ𝑥𝑖2 E( ^
β 1−β 1)2
2
σ
Also, we know from equation (12) that: ( ^
β 1) = ( ^
β 1−β 1)2 =
∑ xi2
Substitute this in place of E( ^
β 1−β 1)2
2
σ
Thus, Σ𝑥 E( ^
2
β 1−β 1)2 = Σ𝑥𝑖2 *
𝑖
∑ xi2
Σ𝑥𝑖2 E( ^
β 1−β 1)2 =σ 2 --------------------------------------------------------- (25)
C. 2E[( ^
β 1−β 1) Σ𝑥𝑖(𝑢𝑖 − 𝑢̅)] = -2E[( ^
β 1−β 1) (Σ𝑥𝑖𝑢𝑖 − Σ𝑥𝑖𝑢̅)]
= -2E[( ^
β 1−β 1)(Σ𝑥𝑖𝑢𝑖)] since Σ𝑥𝑖 = 0
22
Simple Regression Model
= -2[¿ ¿]
= -2(𝑢𝑖2) = −2𝜎2 ------------------------------------ (26)
Substitute equations (24) to (26) in place of A, B, and C in equation (23) respectively, we get:
(Σu𝑖2) = 𝜎2(𝑛 − 1) + 𝜎2 − 2𝜎2
= (𝑛 − 2)2
𝜎 2
=E( ∑ 2
ui
)= E(σ^ 2)
n−2
σ^ 2=( )∑ u2i
n−2
------------------------------------ (26)
(∑ )
2
ui
Hence, σ^ 2= is unbiased estimator of the true variance of the error term 𝜎2
n−2
Hypothesis testing
Now, we have all the estimations require for making hypothesis testing. It is obvious that the
estimates of parameters are obtained from samples. The problem, however, is that estimates
made using samples are prone to errors. As a result, we need to test significance of estimates and
determine the degree of confidence to measure the validity of these estimates. What are the
concepts of null hypothesis and alternative hypothesis? Generally, Null hypothesis depends on
population estimator or true value of β 1. Hence, null hypothesis assumes that there is no any
statistical significance effect of the independent variable X on the dependent variable Y.
Therefore, H0: β1 = 0. However, to verify the null hypothesis is whether it is true or false, we
should use alternative hypothesis. Alternative hypothesis depends on the sample regression
estimator ( ^
β 1). Hence, H1: ^
β i ≠ 0. Three approaches for hypothesis testing have been discussed in
this reading manual.
23
Simple Regression Model
The hypothesis:
𝑇ℎ𝑒 𝑁𝑢𝑙𝑙 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠: H0: ^
βi = 0
H1: ^
β0 ≠ 0
Decision: Since ½*24.453 > 18.1410 is false, we don’t reject the null hypothesis.
Interpretation: at 𝜶 = 𝟓% level of significance, the constant term is not statistically
significant in affecting the dependent variable (Y) in the estimated model.
For β1:-We have our hypothesis:
H0: ^
β1 = 0
H1: ^
β1 ≠ 0
Decision: Since ½*0.5091 > 0.0357 is true, we reject the null hypothesis.
24
Simple Regression Model
In the regression context, r2 is a more meaningful measure than r, for the former tells us the
proportion of variation in the dependent variable explained by the explanatory variable(s) and
therefore provides an overall measure of the extent to which the variation in one variable
determines the variation in the other. The latter does not have such value. Moreover, as we shall
see, the interpretation of r(= R) in a multiple which we shall study in detail in Unit four,
measures this strength of (linear) association. For example, we may be interested in finding the
correlation (coefficient) between smoking and lung cancer, between scores on statistics and
mathematics examinations, between high school grades and college grades, and so on. In
regression analysis, as already noted, we are not primary interested in such a measure. Instead,
we try to estimate or predict the average values of one variable on the basis of the fixed values of
other variables. Thus, we may want to know whether we can predict the average score on a
statistics examination by knowing a student’s score on a mathematics examination
So far we were concerned with the problem of estimating regression coefficients, their standard
errors, and some of their properties. We now consider the goodness of fit of the fitted regression
line to a set of data; that is, we shall find out how “well” the sample regression line fits the data.
25
Simple Regression Model
If all the observations were to lie on the regression line, be some positive . What we hope for
is that these residuals around the regression line are as small as possible. The coefficient of
determination r 2 (two-variable case) or R2 (multiple regression) is a summary measure that tells
how we the sample regression line fits the data.
26
Simple Regression Model
27
Simple Regression Model
Example: measure the goodness of fit, consider our hypothetical data on weekly family
consumption expenditure, Y, and weekly income X.
28