Econometrics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 115

Atlas Technology and Business College

Hawassa Campus
Department of Business Management
MBA Program

Econometrics
By
Mekasha T. (Ph.D. Scholar)
2022
References
✓ Damodar Gujarati, 2004. Basic Econometrics, 4e.
✓Jeffery Wooldridge. Introductory Econometrics: a
Modern Approach, 2e.
✓William H. Greene, 2002. Econometric Analysis, 5e.
✓G.S. Maddala, 1992. Introduction to Econometrics,
2e.
Introduction
• Economic theories: suggest the existence of many r/ships
among economic variables
- Microeconomics: demand and supply models (the quantities
demanded and supplied of a good depend on its price)
- Macroeconomics:
oinvestment function - to explain the amount of aggregate
investment in the economy as the rate of interest changes; and
oconsumption function - relates aggregate consumption to the
level of aggregate disposable income.
Cont’d
• Questions we might be interested:
If price of one commodity changes by certain magnitude,
by how much would quantity demanded for a commodity
changes?
Given that we know the value of one variable; can we
forecast or predict the corresponding value of another?
• The field of knowledge w/h helps us to carry out such
measurement and evaluation of economic theories in
empirical terms is known as econometrics.
Introduction: What is Econometrics?
• Econometrics: measurement in economics.

• Econometrics means “economic measurement.”

• It is the application of statistical and mathematical methods to


the analysis of economic data with a purpose of giving empirical
content to economic theories and verifying or refuting them.

• The techniques of econometrics consist of a


blend/combination of economic theory, mathematical
modelling and statistical analysis
Definitions given by different authors
• Gujarati (2003) … the social science in which the tools of economic
theory, mathematics, and statistical inference are applied to the analysis
of economic phenomena. Thus, it is concerned with empirical
determination of economic laws.
• Maddala (1992) …the application of statistical and mathematical
methods to the analysis of economic data, with a purpose of giving
empirical content to economic theories and verifying them or refuting
them.
• Woodridge (2004) … econometrics is based upon statistical methods
for estimating economic relationships, testing economic theories, and
evaluating and implementing government and business policy.
• Verbeek (2008) … explained econometrics as the interaction of
economic theory, observed data and statistical methods.
• Greene (2003) … is a unification of the theoretical-quantitative and the
empirical-quantitative approach to economic problems.
Cont’d
• Generally,
o econometrics is concerned with “economics measurement”
and quantitative analysis of economic phenomenon.
o Econometrics is the application of statistical and mathematical
methods
- to the analysis of economic data,
- with a purpose of giving empirical content to economic theories
and
- verifying or refuting them.
Economic and Econometric Models
▪ What is model?
• A model is a simplified representation of a real-world
process.
• It is important to explain complex real-world phenomena.
• For instance “the demand of oranges depends on the price
of oranges’’ keeping other things being constant is a
simplified representation.
• Why we hold other factors constant?
obecause, there are other variable that determine demand for oranges.
oto easily understand, communicate, and test empirically with data
with the help of model.
Cont’d
• An economic model - a set of assumptions that
approximately describes the behavior of an economy.
• An econometric model - a set of behavioral equations
derived from the economic model. These equations involve
some observed variables and some unobserved variable
(disturbances).
- It uses to test the empirical validity of the economic model
and to make forecasts or use it in policy analysis.
Why a Separate Discipline?
• As the preceding definitions suggest, econometrics is an
amalgam of economic theory, mathematical economics,
economic statistics, and mathematical statistics.
• Economic theory: makes statements or hypothesis that
are mostly qualitative in nature, while econometrics gives
empirical content to most economic theory (it helps us to quantify
economic theory).

• Example: ceteris paribus, a negative r/ship b/n price and


quantity demanded .
- but the theory by itself doesn’t provide any numerical measures.
Cont’d
• Mathematical economics: express economic theory in
mathematical form (employees mathematical symbolism)
without empirical verification of the theory, while
econometrics is mainly interested in the later (empirical verification)
- neither economic theory nor mathematical economics allows
for random elements which might affect the r/ship and make it
stochastic
- furthermore, they do not provide numerical values for the
coefficients of the r/ship.
Cont’d
• Econometrics methods are designed to
- take into account random disturbances which create
deviations from the exact behavioral patterns suggested by
economic theory and mathematical economics.
- provide numerical values of the coefficients of economic
phenomena.
Cont’d
• Economic Statistics: is mainly concerned with collecting, processing
and presenting economic data in the form of charts and tables perhaps
detect some r/ship b/n various economic magnitudes.

• it is mainly a descriptive aspect of economics.


- it does not provide explanations of the development of the various variables and
- it does not provide measurement of the parameters of economic r/ship.
- doesn’t go further to test economic theory
- it does not being concerned with using the collected data to test economic theories
Cont’d
• Mathematical statistics: deals with the methods of
measurement, which are developed on the bases of
controlled experiment in laboratories.
- deals with statistics from mathematical point of view using
probability theory but econometricians needs special
method of data generating process.
To sum-up…
• Econometrics is an amalgam of economic theory,
mathematical economics, economic statistics, and
mathematical statistics.
- Meaning: it borrows methods from statistics and
mathematics for estimating economic r/ships, testing
economic theories, and evaluating and implementing
government and business policy whenever possible.
• Econometrics is mainly concerned with:
✓Estimation of r/ships from sample data
…To sum-up…
✓Hypotheses testing about how variables are
related
▪ the existence of r/ships between variables
▪ the direction of the r/ships b/n the DV and its hypothesized
observable determinants
▪ the magnitude of the r/ships b/n a DV and the IDVs thought
to determine it
• Thus, the main task of econometrics is estimation
and hypothesis testing.
Methodology of Econometrics
1. Statement of theory or hypothesis
2. Specification of the mathematical model of the theory
3. Specification of the econometric/statistical model of the
theory
4. Obtaining Data
5. Estimating the parameters of the Econometric Model
6. Hypothesis Testing
7. Forecasting or Prediction
8. Using model for control or policy purposes
Description of the steps
1. A set of assumptions that describes the behavior of an
economic phenomenon.
• Statement of theory or hypothesis
- Example: Keynes stated that “consumption increases as
income increases, but not as much as the increase in income”.

- It means that “The marginal propensity to consume (MPC) for


a unit change in income is greater than zero but less than unit”
Cont’d
2.Formulation (specification) of the mathematical model of the
theory:
- a set of equations derived from the economic models
- although Keynes postulated a positive r/ship b/n consumption and
income, a mathematical economist might suggest the following form of
consumption function:
Y = ß1+ ß2X ; 0 < ß2< 1
Y= consumption expenditure
X= income
ß1 and ß2 are parameters; ß1 is intercept, and ß2 is slope
coefficients
Cont’d
3. Specification of the econometric model of the theory
• Due to the inexact r/ship b/n economic variables, the econometrician would modify the deterministic
consumption function as follows:
Y = ß1+ ß2X + u ; 0 < ß2< 1;
where
- Y = consumption expenditure; X = income;
- ß1 and ß2 are parameters; ß1 is intercept and ß2 is slope coefficients;
- u is disturbance term or error term (it is a random or stochastic variable)

4. Collection of relevant data on variables implied by the econometric model.


• To estimate the econometric model (to obtain the numerical values of β1 and β2) , we need data.
• Example: Y= Personal consumption expenditure
X= Gross Domestic Product; all in Billion US Dollars
Cont’d
Year Y X
1980 2447.1 3776.3
1981 2476.9 3843.1
1982 2503.7 3760.3
1983 2619.4 3906.6
1984 2746.1 4148.5
1985 2865.8 4279.8
1986 2969.1 4404.5
1987 3052.2 4539.9
1988 3162.4 4718.6
1989 3223.3 4838
1990 3260.4 4877.5
1991 3240.8 4821
Cont’d
5. Estimation of the model parameters.

Y = - 231.8 + 0.7194 X
• MPC is about 0.72 and it means that for the sample
period when real income increases by 1 USD, on
average, real consumption expenditure increases
by about 72 cents
• Note: A hat symbol (^) above one variable will signify
an estimator of the relevant population value
Cont’d
6. Conduct hypothesis tests to verify whether:
- the specification of the model is correct and
- model assumptions are valid.
• Are the estimates met the expectations of the theory that is
being tested? Is MPC < 1 statistically? If so, it may support
Keynes’ theory.

• Confirmation or refutation of economic theories based on sample


evidence is object of Statistical Inference (hypothesis testing)
Based on step (6)
Cont’d
• If the model failed to pass the specification test and
diagnostic checking step, then one has to revise the
specification of the econometric model (or new
specification)
• If the model passes the specification testing and diagnostic
checking steps, then one has to proceed with testing any
hypothesis of interest
- Example: which of the explanatory variables significantly affect
the response (endogenous) variable?
Cont’d
7. Forecasting or Prediction
- with given future value (s) of X, what is the future value(s) of Y?
- what is the forecasted consumption expenditure in 2025 if the
income(GDP) increased to $6,000 bill?
Y^ = - 231.8+0.7196(6000) = 4084.6
• Income Multiplier M = 1/(1 – MPC) = 3.57. Decrease (increase) of $1 in
investment will eventually lead to $3.57 decrease (increase) in income
Cont’d
• Suppose the government decided to reduce income tax. What
will be the effect of such a policy on income and consumption
expenditure and ultimately on employment?
• Assume that due to the proposed policy change investment
expenditure increased. What will be its effect on the
economy?
• Knowing MPC, one can predict the future course of income,
consumption expenditure, and employment following a
change in the government’s fiscal policies.
• The critical value in this computation is MPC, for the multiplier
(M) depends on it.
Cont’d
8. Using model for control or policy purposes
Suppose that the government believes $4,000 level of
consumption keeps inflation rate at 10%.
Y = 4000 = -231.8+0.7194X  X  5882
• MPC = 0.72, an income of $5,882 Bill will produce an expenditure of $4,000
Bill.
• By fiscal and monetary policy, Government can manipulate the control
variable X to get the desired level of target variable Y.
GOALS OF ECONOMETRICS
We can distinguish three main goals of econometrics:
1.Analysis, i.e. testing of economic theory
2.Policy-making, i.e. supplying numerical estimates of the coefficients of
economic relationships, which may be then used for decision- making;
D =  + 1 I +  2 Ex +  3 PI +  4 PEx + Ui
3. Forecasting, i.e. using the numerical estimates of the coefficients in order
to forecast the future values of the economic magnitudes.
Yˆ = −261.09 + 0.2453 Xi
• The goals are not mutually exclusive.
• Successful econometric applications should really include some
combination of the three aims.
The Structure of Economic Data
A. Cross-section data:
- this data set consists of a sample of individuals, HHs,
firms, cities, states, countries, or a variety of other units,
taken at a given point in time.
- important feature - assume that they have been
obtained by random sampling from the underlying
population
- these types of data are important for testing
microeconomic hypotheses and evaluating economic
policies
Cont’d
• Partial list of hypothetical cross sectional data collected
at one period
Cont’d
B. Time-series data
• this data set consists of observations on a variable or
several variables over time at certain regular time
interval
• Examples of time series data:
stock prices,
money supply,
consumer price index,
gross domestic product,
annual homicide rates, and
automobile sales figures
Cont’d
• Time is an important dimension - because:
- past events can influence future events and
- lags in behaviour are prevalent in the social sciences
• The chronological ordering of observations in a time
series conveys potentially important information
Cont’d
• Example: Ethiopian Money supply and consumer price index
(CPI) from 2005 to 2014
Cont’d
C. A Panel or Longitudinal Data
• Consists of a time series for each cross-sectional
member in the data set.
• A panel data set contains repeated observations over
the same units (individuals, households, firms),
collected over a number of periods.
• Data sets that have both cross-sectional and time series
dimensions are being used more and more often in
empirical research.
• In fact, data with cross-sectional and time series
aspects can often shed light on important policy
questions.
Cont’d
• To collect panel/longitudinal data we follow the same
individuals, families, firms, cities, states, or
whatever, across time.
• Example - a panel data set on individual wages, hours,
education, and other factors is collected by randomly
selecting people from a population at a given point in
time.
• Then, these same people are interviewed at several
subsequent points in time.
• This gives us data on wages, hours, education, and so
on, for the same group of people in different years
Cont’d
• Example: A Two-Year Panel Data Set on City Crime Statistics
Cont’d
D. Pooled Cross Section Data
• Is a randomly sampled cross sections of individuals at
different points in time
• Example: suppose that two cross-sectional HH surveys are
taken in Ethiopia, one in 2005 and one in 2008.
• In 2005, a random sample of HHs is surveyed for variables
such as income, savings, family size, and so on
• In 2008, a new random sample of HHs is taken using the
same survey questions.
Cont’d
• In order to increase our sample size, we can form a
pooled cross section by combining the two years survey

• B/c random samples are taken in each year, it would be a


fluke if the same HH appeared in the sample during
both years
Cont’d
E. Non-experimental vs Experimental data

a. Non-experimental data are obtained from


observations of a system that is not subject to
experimental control

b. Experimental data are obtained from controlled


experiments in laboratory
Cont’d
F. Qualitative versus quantitative data

• The data may be quantitative (e.g. exchange rates,


stock prices, number of shares outstanding, GDP,
inflation,..,) or qualitative (e.g. gender, color, race,
religion, etc).
The End!
Chapter Two
Simple Linear Regression Analysis
(A Two Variable Regression Model)
Introduction
• SLRM can be used to study the r/ship b/n two
variables
• In single-equation regression (SLR) models:
one variable, so called the DV, is expressed as a linear
function of another variable, called the explanatory
variable and
it is assumed implicitly that causal r/ships, if any, b/n
the DV and explanatory variables flow in one direction
only, namely, from the explanatory variables to the DV.
Concept of Regression Function
• Regression analysis is concerned with
- study of the dependence of one variable, the DV,
on one or more other variables called explanatory
variables (the dependency of DV on one or more IDV/s),
…with a view of estimating and/or predicting the
(population) mean or average value of the former in
terms of the known or fixed values of the latter
…it is concerned with describing and evaluating the
r/ship b/n DV and IDV(s).
Cont’d
Example:
Cont’d
• So, regression analysis is concerned with describing and evaluating the
r/ship b/n a given variable (the DV) and one or more variables which
are assumed to influence the given variable (IDVs or explanatory
variables).
• The simplest economic r/ship is presented through a two-variable
model (simple linear regression model) which is given by:
Y= a + bX
• Where
- a and b are unknown parameters (also called regression coefficients) that we
estimate using sample data.
- Y is the DV and
- X is the IDV.
Terminology for SLRM
Cont’d
• Example: Suppose the r/ship b/n expenditure (Y) and income (X) of
HHs is expressed as: Y = 0.6X + 120
• Here, on the basis of income, we can predict expenditure.
• For instance, if the income of a certain HH is birr 1,500, then the
estimated expenditure will be
Expenditure = 0.6 (1500)+120 = Birr 1,020
• Note that, since expenditure is estimated on the basis of income,
expenditure is the DV and income is the IDV.
Stochastic and Non-stochastic Relationships
• A r/ship b/n X and Y, characterized as Y = f(X) is said to be
deterministic or non-stochastic if for each value of the IDV (X)
there is one and only one corresponding value of DV (Y).
• On the other hand, a r/ship b/n X and Y is said to be stochastic
if for a particular value of X there is a whole probabilistic
distribution of values of Y.
- in such a case, for any given value of X, the DV Y assumes some
specific value only with some probability.
- let’s illustrate the distinction b/n stochastic and non stochastic
r/ships with the help of a supply function.
Cont’d
• Assuming that the supply for a certain commodity depends on its
price (other determinants taken to be constant) and the function
being linear, the r/ship can be expressed as:

• The above r/ship b/n P and Q shows that for a particular value of P
there is only one corresponding value of Q.
• This is a deterministic (non-stochastic) r/ship since for each price
there is always only one corresponding quantity supplied.
- this implies that
all the variation in Y is due solely to variation in X, and
there are no other factors affecting the dependent variable (Q)
Cont’d
• In deterministic/non-stochastic r/ship:
- all the points of price-quantity pairs, if plotted on a two-
dimensional plane, would fall on a straight line
- however, if we gather observations on the quantity actually
supplied in the market at various prices and we plot them on a
diagram we see that they do not fall on a straight line
Cont’d
Cont’d
The error term:
• Consider the previous model: Y = 0.6X +120. This r/ship is
deterministic or exact, that is, given income we can determine
the exact expenditure of a HH.
• But in reality this rarely happens: Different HHs with the same
income are not expected to spend equal amount due to habit
persistent, geographical and time variation, etc.
• Thus, we should express the regression model as:
Yi =  +  X i +  i
where  i is the random error term (also called disturbance term)
Why do we need to include the stochastic (random) component, for
example in the consumption function?
1.Omission of variables: leads to misspecification problem. For
example, income is not the only determinants of consumption.
2.Vagueness of theory: the theory, if any, determining the behavior of Y may
be, and often is, incomplete. We might know for certain that weekly income X
influences weekly consumption expenditure Y, but we might be ignorant or
unsure about the other variables affecting Y. Therefore, ui may be used as a
substitute for all the excluded or omitted variables from the model
3.There may be measurement error in collecting data. We may use poor
proxy variables, inaccuracy in collection and measurement of sample data.
Cont’d
4. The functional form may not be correct.
5. Erratic (random/unpredictable) human behaviour - even if we
succeed in introducing all the relevant variables into the model,
there is bound to be some “intrinsic” randomness in individual Y’s
that cannot be explained no matter how hard we try. The
disturbances, the u’s, may very well reflect this intrinsic
randomness.
6. Error of aggregation - the sum of the parts may be different from
the whole.
Cont’d
7.Sampling error: Consider a model relating Consumption (Y) with income
(X) of HHs. The sample we randomly choose to examine the r/ship may
turn out to be predominantly poor HHs. In such cases, our estimation of
α and β from this sample may not be as good as that from a balanced
sample group.
8.Unavailability of data: Even if we know what some of the excluded
variables are and therefore consider a multiple regression rather than a
simple regression, we may not have quantitative information about these
variables.
Cont’d
• Thus, a full specification of a regression model should include a
specification of the probability distribution of the disturbance (error) term.
This information is given by what we call basic assumptions of the Classic
Linear Regression Model (CLRM).
• Consider the model
Yi =  +  X i +  i , i = 1,2,...,n
Here the subscript i refers to the ith observation. In CLRM, Yi and Xi are
observable while εi is not.
• If i refers to some point or period of time, then we speak of time series
data.
• On the other hand, if i refers to the ith individual, object, geographical
region, etc., then we speak of cross-sectional data.
Assumption of the Classical Linear
Regression Model
• The linear regression model is based on certain assumptions, some of
which refer to
the distribution of the random variable ε,
the r/ship b/n u and explanatory variables, and
the r/ship b/n the explanatory variables
themselves.
• We will group the assumptions into two categories, (a) Stochastic
assumptions, (b) other assumptions.
Cont’d
• Assumption 1: The model is linear in parameters
- the model should be linear in the parameters regardless of whether the
explanatory and the DVs are linear or not
- in other words, the regression model is linear in the parameters, though
it may or may not be linear in the variables i.e.

- and that the deterministic component ( +  X ) and the stochastic


i

component (  i ) are additive


Example
Cont’d
• Assumption 2: Zero Mean Value of Disturbance ui
- This means that for each value of X, ε may assume various values, some
greater than zero and some smaller than zero, but if we consider all the
possible values of ε, for any given value of X, they would have an
average value equal to zero.
Cont’d
- Given the value of X, the mean, or expected, value of the random disturbance
term ui is zero.
- Technically, the conditional mean value of ui is zero.
- Symbolically, we have E(ui/Xi) = 0 or
- All that this assumption says is that the factors not explicitly included in the
model, and therefore subsumed in ui , do not systematically affect the mean
value of Y; so to speak, the positive ui values cancel out the negative ui
values so that their average or mean effect on Y is zero.
- In passing, note that the assumption E(ui/Xi) = 0 implies that E(Yi/Xi) = βi +
β2Xi
- Therefore, the two assumptions are equivalent.
Cont’d
• Assumption 3: Fixed X Values or X Values
Independent of the Error Term
• The disturbance term is not correlated with the explanatory variable(s).
The u’s and the X’s do not tend to vary together; their covariance is zero.
- X values are fixed in repeated sampling
- Values taken by the regressor X may be considered fixed in
repeated samples (the case of fixed regressor) or they may be
sampled along with the dependent variable Y (the case of
stochastic regressor).
- in both case it is assumed that the X variable(s) and the error term
are independent, that is:
Cont’d
• Assumption 4: Homoscedasticity or Constant
Variance of ui
- the variance of the error or disturbance term is constant regardless of the
value of X. Symbolically
Cont’d
• Assumption 5: No Autocorrelation/no serial
correlation between the Disturbances

• Where, i and j are two different observations and where cov (. ) means
covariance.
Cont’d
• Assumption 6: The number of observations, n, must
be greater than the number of parameters to be
estimated
- Alternatively, the number of observations must be greater than the number of
explanatory variables.
Cont’d
• Assumption 7: The Nature of X Variables
- the X values in a given sample must not all be the same. Technically, var(X)
must be a positive number.
- further more, there can be no outliers in the values of the X variable, that is,
values that are very large in relation to the rest of the observations.
- this assumption states that the value of variables of both dependent and
independent must vary. If all values of X are identical making it impossible to
estimate the coefficients of the model
Cont’d
• Assumption 8: The error term ui is normally
distributed
- in conjunction with assumptions 3, 4, and 5 this implies that u is independently
and normally distributed with mean zero and a common variance
Cont’d
- The concept of population regression function (PRF)

- The concept of Sample regression function (SRF)


Methods of Estimation
• After specifying the model and stating its underlying
assumptions, the next step is the estimation of the
numerical values of the parameters of economic r/ships.
• The parameters of the simple linear regression model can be
estimated by the three most commonly used methods:
- Ordinary least square method (OLS)
- Method of moments (MM)
- Maximum likelihood method (MLM)
The Ordinary Least square (OLS) method of Estimation
In the regression model, Yi =  +  X i +  i , the values of the
parameters  and  are not known. When they are estimated from a
sample of size n, we obtain the sample regression line given by:
  
Y =  +  Xi i = 1, 2,..., n
i

 
Where  and  are estimated by  and , respectively, and Yis the

estimated value of Y.
The dominating and powerful estimation method of the parameters
(or regression coefficients)  and  is the method of least squares.
The deviation between the observed and estimated values of Y are
called the residuals  , that is
i
 
 = Y −Y , i = 1, 2,..., n
i i i
Cont’d
The magnitude of the residuals is the vertical distance b/n the actual
observed points and the estimating line (see the figure below)

The estimating line will have a ‘good fit’ if it minimizes the error b/n the estimated
points on the line and the actual observed points that were used to draw it.
Our aim is then to determine the equation of such an estimating line
in such a way that the error in estimation is minimized.
Cont’d

 n X iYi − ( X i  Yi  X Y − nXY
= = i i

n X i 2 − ( X i ) 2  X − nX2 2
i
 
 =Y − X
Where X and Y arethe mean value of the independent and dependent var iables, respectively ,
1 1
that is X =  X i and Y =  Yi
n n
Cont’d
• OR

• The expression above to estimate the parameter coefficient is termed the


formula in deviation form.
Cont’d
 
 and  are said to be the ordinary least-square (OLS) estimators of α
and β, respectively. The line Y =  +  X is called the least square line or

i
 

the estimated regression line of Y on X.


Note: Model in deviation form
The OLS estimator of β is:

=  xyi i

x i
2
Example
Cont’d
• Required:
- based on the given information, estimate the regression equation
Cont’d
• The Coefficient of Determination (R2 – explained variation as a
percentage of the total variation)
Yi = bo + bi X i + Ui

Variation   Systematic   Random 


inY  =  var iation  +  var iation 
 i     
Variation   Explained  UnExplained 
inY  =  var iation  +  var iation 
 i     
Cont’d
Cont’d
Cont’d
Cont’d

Total Sum of Square Regression (Explained) Sum of Square Error (Residual) Sum of Square
TSS = RSS + ESS
Cont’d
• In other words, the Total Sum of Square (TSS) is decomposed in to
Regression (explained) Sum of Square (RSS) and Error (residual or
unexplained) Sum of Square (ESS).
TSS= RSS + ESS
Computation formulas
• The TSS is a measure of dispersion of the observed value of Y about
their mean. That is computed as: n n
TSS =  (Y
i =1
2
i −Y) =
2
y i =1
i

• The regression (explained) sum of square (RSS) measures the


amount of the total variability in the observed values of Y that is
accounted for by the linear r/ship b/n the observed values of X and Y.
this is computed as: n  
2
n
2
 n
RSS =  (Yi − Y ) =    ( X i − X )  = 
2 2 2
x i
i =1  i =1  i =1
Cont’d
• The error (residual or unexplained) sum of square (ESS) is measures
the amount of the total variability in the observed values of Y about
the regression line. This is computed as:

ESS =  (Y i − Yi ) 2 = TSS − RSS

If a regression equation does a good job of describing the r/ship b/n


two variables, the explained sum of squares should constitute a
large portion of the total sum of squares.
• Thus, it would be of interest to determine the magnitude of this
proportion by computing the ratio of the explained sum of squares
to the total sum of squares. This proportion is called the sample
Coefficient of determination, R 2 .That is:
RSS ESS
Coefficient of determination = R 2 = = 1−
TSS TSS

R2 =
 x y i i
where xi = X i − X and yi = Yi − Y .
y i
2
Cont’d
• Note
1) The proportion of total variation in the dependent variable (Y)
that is explained by changes in the independent variable (X) or by
the regression line is equal to: R 2 x100%
2) The proportion of total variation in the dependent variable (Y)
that is due to factors other than X (e.g., due to excluded variables,
chance, etc.) is equal to: (1 − R 2 ) x100%

Test for the coefficient of determination ( R 2 )


The largest value that R 2 can assume is 1 ( in which case all
observations fall on the regression line), and the smallest it can
assume is zero.
Cont’d
2
A low value of R is an indication that:
X is poor explanatory variable in the sense that variation in X leaves Y
unaffected, or
While X is a relevant variable, its influences on Y is weak as compared to
some other variables that are omitted from regression equation, or
The regression equation is mis-specified (e.g., an exponential r/ship might
be more appropriate).
2
Thus , small values of R casts doubt about the usefulness of the
regression equation. We do not, however, pass final judgment on the
equation until it has been subjected to an objective statistical test. Such a
test is accomplished by means of analysis of variance (ANOVA) which
enables us to test the significance of R 2 (i.e. the adequacy of the linear
regression model).
The ANOVA table for simple linear Regression
ANOVA table for simple linear regression
Source of Sum of Degrees of Mean Square Variance ratio
Variation Square freedom
Regression RSS 1 RSS/1 Fcal =
𝑅𝑆𝑆/1
𝐸𝑆𝑆/(𝑛−2)

Residual ESS n-2 ESS/(n-2)


Total TSS n-1

To test for the significant of R 2, we compare the variance ratio with the critical value from the F
distribution with 1 and (n-2) degree of freedom in the numerator and denominator, respectively, for a
given significance level α.
Decision: if the calculated variance ratio exceeds the tabulated value, that is, if

Fcal F (1, n − 2), we then conclude that R 2is significant (or that the linear regression
mod el is adquate.
Cont’d
Note: the F test is designed to test the significance of all variables or a
set of variables in a regression model. In the two variable model,
however, it used to test the explanatory power of a single variable (X).
and at the same time, is equivalent to the test of significance of R 2 .
Illustrative Example
Consider the following data on the pctg rate of change in electricity
consumption (millions KWH) (Y) and the rate of change in the price of
electricity (Birr/KWH) (X) for year 1979-1994.
Summary statistics: note here that:
xi = X i − X and yi = Yi − Y

n = 16 , X = 1.280625, Y = 23.42688, x
i
2
= 92.20109,  yi 2 = 13228.7,

x yi i = −779.235
Cont’d
• Estimation of regression coefficients
 
The slope  and the intercept  are computed as:

=  xy −779.235
= = −8.45147
 x 92.20109
2

 
 = Y −  X = 23.42688 − (−8.45147)(1.280625) = 34.25004
Therefore, the estimated regression equation is :
   
Y =  +  X  Y = 34.25004 − 8.45147 X
Test of model adequacy
n n
TSS =  (Yi − Y ) =  yi 2 = 13228.7
2

i =1 i =1

n 
 n
  n
RSS =  (Yi − Y ) 2 =  2   ( X i − X ) 2  =  2  xi 2 = (−8.45147) 2 (92.20109) = 6585.679
i =1  i =1  i =1
Cont’d
ESS = TSS-RSS = 13228.7-6585.679 = 6643.016
RSS 6585.679
R2 = = = 0.4978
TSS 13228.7

Thus, we can conclude that:

About 50% of the variation in electricity consumption is due to


changes in the price of electricity.

The 50% of the variation in electricity consumption is not due to


changes in the price of electricity, but instead due to chance and
other factors not included in the model.
ANOVA table

Source of Sum of Degrees of Mean Square Variance ratio


Variation Square freedom

Regression RSS=6585.679 1 RSS/1=6585.679 𝑅𝑆𝑆/1


Fcal =
𝐸𝑆𝑆/(𝑛−2)
Fcal = 13.87916

Residual ESS=6643.016 16-2=14 ESS/(n-2)=474.5011


Total TSS=13228.7 16-1=15

F (1, n − 2) = F0.05 (1,14) = 4.60


Cont’d
• Decision: Since the calculated variance ratio exceed the critical
value, we reject the null hypothesis of no linear r/ship b/n price and
consumption of electricity at the 5% level of significance.
2
• Thus, we then conclude that R is significant, that is, the linear
regression model is adequate and is useful for prediction purposes.
Estimation of the standard error of β and test of its significance
• An unbiased estimator of the error variance  2 is given by:

1 n 2 ESS 6643.016
 =
2

n − 2 i =1
i =
n−2
=
16 − 2
= 474.5011

Thus, an unbiased estimator of Var (  ) is given by :

  2
474.5011
V ( ) = = = 5.146372
x i
2
92.20109

The s tan dared error of  is :
  
s.e(  ) = V (  ) = 5.146372 = 2.268562
The hypothesis of interest is :
H0 :  = 0
H1 :   0
We calculated the test statistic :

 −8.45147
t= 
= = −3.72548
s.e(  ) 2.268562
Cont’d
• For α=0.05,the critical value from the student’s t distribution with (n-2)
degree of freedom is: t / 2 (n − 2) = t0.0025 (14) = 2.14479
• Decision: Since t t / 2 (n − 2), we reject the null hypothesis, and conclude
that β is significantly different from zero.
- In other words, the price of electricity significantly and negatively affects
electricity consumption.

• The interpretation of the estimated regression coefficient  = −8.45147 is
that for a one percent drop (increase) in the growth rate of price of
electricity, there is an 8.45 percent increase (decrease) in the growth
rate of electricity consumption.
Illustration - II

RSS ESS
Cont’d

ESS

ESS
Standard error test, Student’s t-test and Confidence interval
I. Standard error test
• To decide whether the estimates are significantly different from zero,
i.e. whether the sample from which they have been estimated might have
come from a population whose true parameters are zero
Decision rule:
The acceptance or rejection of the null hypothesis has definite
economic meaning
• Namely, the acceptance of the null hypothesis (the
slope parameter is zero) implies that the explanatory
variable to which this estimate relates does not in fact
influence the dependent variable Y and should not be
included in the function, since the conducted test
provided evidence that changes in X leave Y unaffected.
• In other words acceptance of H0 implies that the r/p
between Y and X is in fact , i.e. there is no r/p
b/n X and Y.
Example
ii) Student’s t-test
• Like the standard error test, this test is also important to test whether
coefficients are significantly different from zero or not. We can formulate
hypothesis for slope coefficient as follows:

• In order to test this hypothesis we need to form the test function relevant
for this case. We know that the sample estimator is normally distributed
with a mean and standard error.
• We can derive the t-value of the OLS estimator of as:
Cont’d
Cont’d
Cont’d
• Step 2: Choose level of “significance level” often denoted by .
- this is also sometimes called the size of the test and it determines the region where we will
reject or not reject the null hypothesis that we are testing.
• Level of significance is the probability of making ‘wrong’ decision, i.e. the
probability of rejecting the hypothesis when it is actually true or the probability
of committing a type I error.
• It is customary in econometric research to choose 10% or 5% or 1% level of
significance.
- 5% level of significance means that in making our decision we allow (tolerate)
five times out of a hundred to be ‘wrong’ i.e. reject the hypothesis when it is
actually true.
Cont’d
• Step 3: Check whether there is one tail test or two tail tests.
- if the inequality sign in the alternative hypothesis is , then it
implies a two tail test and divide the chosen level of significance
by two; decide the critical value of t
- but if the inequality sign is either > or < then it indicates one tail
test and there is no need to divide the chosen level of significance
by two to obtain the critical value from the t-table
Cont’d
• Step 4: Obtain critical value of t
- we need some tabulated distribution with which to compare the
estimated test statistics
- test statistics derived in this way can be shown to follow a t-
distribution with n-2 degrees of freedom
- as the number of degrees of freedom increases, we need to be less
cautious in our approach since we can be more sure that our
results are robust.
Cont’d
Cont’d
• Example: Consider our previous consumption-income regression result:
Cont’d
Cont’d
• We can summarize the t-test of significance approach to hypothesis
testing as
Confidence interval
• Rejection of the null hypothesis doesn’t mean that our estimate ˆ
and ˆ is the correct estimate of the true population parameter 
and  .
• It simply means that our estimate comes from a sample drawn from
a population whose parameters  and  are different from zero.
• In order to define how close the estimate to the true parameter,
we must construct confidence interval for the true parameter,
- in other words we must establish limiting values around the
estimate with in which the true parameter is expected to lie within
a certain “degree of confidence”
Cont’d
• In this respect we say that with a given probability the population
parameter will be within the defined confidence interval (confidence
limits).
• We choose a probability in advance and refer to it as confidence level
(interval coefficient). It is customarily in econometrics to choose the
95% confidence level. This means that in repeated sampling the
confidence limits, computed from the sample, would include the
true population parameter in 95% of the cases. In the other 5% of the
cases the population parameter will fall outside the confidence
interval.
• In a two-tail test at  level of significance, the probability of
obtaining the specific t-value either –tc or tc is at n-2 degree of
freedom.
Cont’d
Cont’d

Cont’d

You might also like