Unitb - II - Linear Probability, Logit and Probit
Unitb - II - Linear Probability, Logit and Probit
Unitb - II - Linear Probability, Logit and Probit
DEPARTMENT OF STATISTICS
THIDA THAN
M. Econ (Statistics)
(Roll No. 1)
MARCH 2010
iv
CONTENTS
ACKNOWLEDEGMENTS
ABSTRACT
ABBREVIATIONS
Chapter Page
Chapter I INTRODUCTION 1
Chapter II MODEL SPECIFICATION AND ESTIMATION 3
2.1 Linear Probability Model (LPM) 3
2.1.1 Function Form 3
2.1.2 Examination of the Assumption of ui 3
2.1.3 Estimation 5
2.2 Logit Model 7
2.2.1 Functional Form 7
2.2.2 Features 8
2.2.3 Estimation 9
2.3 Probit Model 13
2.3.1 Functional Form 13
2.3.2 Estimation 14
2.4 Comparison of Models 15
Chapter III DIAGNOSTIC STATISTICS FOR QUALITATIVE 17
RESPONSE MODELS
3.1 Z Statistic 17
3.2 Likelihood Ratio (LR) Statistic 17
3.3 R2 Statistic 17
3.4 Predictive Quality 19
3.5 Analysis of Residuals 20
3.5.1 Standardized Residuals and Consequences of
Heteroscedasticity
3.5.2 Likelihood Ratio Test for Heteroscedasticity
3.5.3 Largrange Multiplier Test for
Heteroskedasticity
v
Chapter Page
Chapter IV APPLICATION OF LINEAR PROBABILITY, LOGIT 23
AND PROBIT MODELS
4.1 Introduction 23
4.2 Models for Child's Weight Colour 23
4.3 Results 25
Chapter V CONCLUSION 28
REFERENCES 30
CHAPTER 1
INTRODUCTION
There are several methods for measuring the relationship among economic
variables. The simplest methods are correlation analysis and regression analysis.
Regression analysis was first developed by Sir Francis Galton who was a well known
British anthropologist and meteorologist in the latter part of the 19th century. It is a
statistics methodology that utilizes the relation between two or more variables so that
one variable can be predicted from the other, or others. This methodology is widely
used in businesses, social and behavioral sciences, biological sciences, and many other
disciplines.
Many regression models in which the regressand, the dependent variable, or the
response variable, say Y, is quantitative, whereas the explanatory variables are either
quantitative (or dummy), or a mixture thereof. In much research work, the researchers
often face situations where the dependent variable of interest is a qualitative in nature.
The dependent variable of interest or regressand,Y, may be two or three or multiple
possible qualitative outcomes. The models in which the dependent variable or
regressand, Y, is qualitative variable are called qualitative response models. These
models are valuable in the analysis of survey data. The simplest possible qualitative
response regression model is the binary model in which the regressand, has only two
possible qualitative outcomes, and therefore can be represented by a binary indicator
variable taking on values 0 and 1. So the regressand can be said that a binary or
dichotomous variable and the models developed for such situations are called binary
response models.
Both theoretical and empirical considerations suggest that when the response
variable is binary, the shape of the response function will frequently be curvilinear.
The shape of this response function is a titled S or as a reverse titled S, and they are
approximately linear except at the ends. These response functions are often referred to
as sigmoidal.
In a model where Y is quantitative, the objective is to estimate its predicted, or
mean value given the values of the regressors, that is, E(Yi∣X1i, X2i, X3i,……,Xki),
where the X's are regressors, may be quantitative or qualitative or both. In models
where Y is qualitative, the objective is to find the probability of something happening.
2
Hence, qualitative response regression models are often known as a type of probability
models. Qualitative response models have been extensively used in biometric
applications for a much longer time than they have used in economic applications.
Among the qualitative response models, linear probability, logit and probit
(also known as normit) models are studied in this paper. The objectives of this paper
are to study;
(1) how to develop the qualitative response models;
(2) how to estimate the qualitative response models;
(3) how to evaluate the qualitative response models;
Firstly, the natures of qualitative response models are introduced in Chapter I.
The specification and estimation procedure of the qualitative response models are
discussed in Chapter II. Then, in Chapter III, diagnostic statistics for qualitative
response models are discussed and, the applications of the models are studied in
Chapter IV. Finally, the important characteristics of the models and findings are
summarized in Chapter V.
3
CHAPTER II
MODEL SPECIFICATION AND ESTIMATION
In this Chapter some of the qualitative response models are considered for a
binary response variable. Among the binary response models, linear probability, logit,
and probit (normit) models are discussed in the following sub-sections.
Assume that the model contains a constant term, that is, Xi1 = 1 for all
individuals. The regression coefficient is interpreted in terms of the probability of
being in the interest category on Y. Hence, β2 represents the change in he probability
for each unit increase in Xi, net of the other covariates, and so on.
That is, the variance of the error term in the LPM is heteroscedastic.
Since i = E (Yi ∣Xi) = ∑ βk Xik the variance of ui ultimately depends on the values
of X and hence is not homoscedastic.
2.1.3 Estimation
For a model with heteroscedastic error disturbances it can be assumed that each
2
error term ui is normally distributed with variance i , where the variance Var (ui) = E
(ui2) = i
2
is not constant over observations. When heteroscedasticity is present,
ordinary least squares estimation places more weight on the observations with large
error variances than on those with small error variances. In the presence of
heteroscedasticity, the OLS estimators, although unbiased, are not efficient; that is,
they do not have minimum variance. If the heteroscedasticity is present, the
appropriate estimation technique is the weighted least-squares estimation procedure,
which can be derived from the maximum likelihood function.
Consider the simple linear probability model
∑ /
= ∑ /
∑( / )(/ )
= ∑( / )
∑ ∗∗
= ∑( ∗ ) where ∗ = , ∗ =
Yi* = , Xi* =
, ui* =
6
where Var (∗ ) = Var ( ) = Var (ui)
=
=1
Now, the new error term is homoscedastic.
Since there are many situations in which the relative magnitude of the error
variances is not known, it is important to consider special cases in which sufficient
sample information is available to make reasonable guesses of the true error variances.
One possibility is the existence of existence of a relationship between the error
variances and the values of explanatory variable in the regression model. Specifically,
assume that
Var (ui) =
= Var (ui)
= Var (ui)
=
=C
Now, error term ∗ is homoscedastic.
The LPM is plagued by problems, such as
(1) non – normality of ui
(2) heteroscedasticity of ui
lying outside the 0-1 range, and
(3) possibility of
(4) the generally lower R2 values.
7
! (∑ "# $# )
i =
% ! (∑ "# $# )
8
! ( & ")
i = (2.2.1)
% ! ( & ")
Letting Zi = ∑ X()
* +
i =
%* +
= (2.2.2)
%*,+
2.2.2 Features
The features of the logit model are as follows;
(1) Logistic regression effects can be expressed in terms of percent changes in
the odds. Odds ratios are useful in estimating changes in the probability of
event occurrence with changes in predictors once a baseline probability has
been calculated.
-./
i =
%-./
-./
1- i = 1 -
%-./
%-./ , -./
=
%-./
= (2.2.3)
%-./
The ratio of Equation (2.2.2) to (2.2.3)
01 -./
=( ./
)/( ) (2.2.4)
%01 %- %-./
=- 2
01
can be called the odds ratio.
%01
9
= Zi
= ∑ X() (2.2.5)
The logit L goes from -7 to + 7 as goes from 0to1. That is, although the
probabilities (of necessity) lie between 0 and 1, the logits are not so bounded.
(2) Although L is linear in X, the probabilities themselves are not. This property is
in contrast with the LPM model where the probabilities increase linearly with
X.
(3) If L, the logit, is positive, it means that when the value of the regressor (s)
increases, the odds that the regressand equals 1 (meaning some event of interest
happens) increases . It L is negative,the odds that the regressand equals 1
decreases as the value of X increases. To put it differently, the logit becomes
negative and increasingly large in magnitude as the odds ratio decreases from 1
to 0 and becomes increasingly large and positive as the odds ratio increases
from 1 to infinity.
(4) More formally, the interpretation of the logit model given in Equation (2.2.4) is
as follows; 2, the slope, measures the change in L for a unit change in X. The
intercept 1 is the value of the log odds in favor of occurring an event if the
other event does not occur (or) is zero.
(5) If we actually want to estimate not the odds in favor of event but the
probability of event itself, this can be done directly from Equation (2.2.2) once
the estimates of 1 and 2 are available.
(6) Whereas the LPM assumes that i is linearly related to Xi, the logit model
assumes that the log of the odds ratio is linearly related to Xi.
2.2.3 Estimation
A logistic response function is either monotonic increasing or
monotonic decreasing, depending on the sign of the slope coefficients. It can be
linearized easily. Logistic response functions, like the other response functions which
have been considered are used for describing the nature of the relationship between
the mean response and one (or more) predictor variable (s). They are also used for
10
making predictions. The weighted least squares and maximum likelihood estimation
procedures can be used to estimate the parameters of the logistic response function.
For estimation purposes, consider Equation (2.2.5), that is
πi
Li = ln (
1+πi
)
= ∑ X() (2.2.6)
In estimating the above equation, Logit , Li depends on the two types of data
which are categorized by
(1) data at the individual, or micro level, and
(2) grouped or replicated data
Individual data
Let i = 1 if the event occurs
P(Yi = 0) =1- i
fi (Yi) = (1- ),9 ; Yi = 0, 1, ; i =1…….,n (2.2.7)
11
,9
Logeg(Yi,……….,Yn) = loge∏;< / 9 ( 1- / )
= loge∏;<( / ) / (1- )
1+ /
= ∑;< Yi loge (1−/ ) + ∑;< log (1- ) (2.2.9)
/
Since E(Yi) = for a binary variable, it follows from Equation (2.2.1), and
according to Equation (2.2.5), the above Equation (2.2.9) can be expressed as follows:
LogeL() = ∑;< Y( (∑ ) - ∑;< log [1+exp(∑ )] (2.2.10)
where L() replaces g(Y1,……….,Yn) to show explicitly that function can be viewed
as the likelihood function of the parameters to be estimated, given the sample
observation.
Equation (2.2.10) can be expressed more clearly as follows;
closed-form solution exists for the values of in Equation (2.2.10) that maximize the
log likelihood function. There are many widely used numerical search procedures; one
of these employs iteratively reweighted least squares.
Once the maximum likelihood estimates are found, these values are
substituted into the response function in Equation (2.2.1) to obtain the fitted response
function.
The fitted logit model is as follows;
!(∑ G# $#)
F = (2.2.12)
% !(∑ G# $#)
Once the fitted logit model has been obtained, the usual next steps are to
examine the appropriateness of the fitted response function and , if the fit is good, to
make a variety of inferences and predictions.
that is, the relative frequency can be used as an estimate of the true
corresponding to each Xi. If Ni is fairly large, F will be a reasonably good estimate of
π i
which will be a fairly good estimate of the true logit Li if the no. of observations Ni at
each Xi is reasonably large.
13
that is, ui follows the normal distribution with zero mean and variance equal to 1/[Nii
(1-i)]. Therefore, as in the case of LPM the disturbance term in the logit model is
hetroscedastic. Thus, instead of OLS the weighted lest squares (WLS) should be used .
For empirical purposes, replace the unknown by F and use
S2 = as estimator of 2
L(,J
Q J K L)K
Ii = ∑ (2.3.1)
14
= F (x∗ ) (2.3.2)
where P(Y = 1∣ X) means the probability that an event occurs given the value
2
(s) of the X, or explanatory variable(s), i.e Z~(0, ).
F is the standard normal cumulative distribution function. The functional form
of the probity model in two- variable case is.
b ,2 a
F(Ii) =
√
_ -
J ,c
`2
∑ " $# ,2 a
= = √ J _,c # - (2.3.3)
where
Ii = ∑
2.3.2 Estimation
Once the estimated Ii was obtained, estimating are relatively straightforward.
Since the normal equivalent deviate (n.e.d) or Ii will be negative whenever < 0.5, in
practice the number 5 is added to the n.e.d and the result is called a probit. Probit
model is also constructed by assuming that a particular density underlies the data.
Hence, this model is typical estimated using maximum likelihood rather than least
squares.
Data for the probit model may also be two types. They are
(a) grouped data and
(b) ungrouped or individual data
As in the case of the logit model, a nonlinear estimating procedure based on the
method of maximum likelihood can be used to estimate the probit model.
normal and a3for the logistic distribution , where ≈ 22a7. Therefore, if the probit
CHAPTER III
Some diagnostic statistics for qualitative response models namely, t-test (Z-
test), the predictive quality (classification table and hit rate), and analysis of the
residuals (in particular an LM test for heteroscedasticity), the likelihood ratio test and
goodness-of-fit (R2) will be presented in this Chapter.
3.1 Z statistic
The significance of individual explanatory variables can be tested by the usual
t-test. The sample size should be sufficiently large to rely on the asymptotic
expressions for the standard errors, and the t-test statistic then follows approximately
the standard normal distribution. Since the method of maximum likelihood is generally
a large sample method, the estimated standard errors are asymptotic. As a result,
instead of using the t statistic to evaluate the statistical significance of a coefficient,
(standard normal) Z statistic has to be used.
where L0 is the likelihood function when all parameters except the intercept, are set to
zero and L1 is likelihood function of the model of interest. Sometimes this measures
similar to the R2 of linear regression models. Joint parameter restrictions can be tested
by the likelihood ratio test.
3.3 R2 Statistic
A goodness-of-fit measure is a summary statistic indicating the accuracy with
which the model approximates the observed data, like the R2 measure in the linear
19
regression model. In linear regression model, R2 is the most commonly used measure
for assessing the discriminatory power of the model. R2 possesses three properties.
First, it is standardized to fall in the range (0, 1), equaling 0 when the model affords no
predicted efficacy over the marginal mean and equaling 1 when the model perfectly
accounts for, or discriminates among the responses. Second, it is non decreasing in X,
meaning that it cannot decrease as regressors are added to the model. Third, it can be
interpreted as the proportion of variation in the response accounted for by the
regression.
In the case in which the dependent variable is qualitative, accuracy can be
judged either in terms of the fit between the calculated probabilities and observed
response frequencies or in terms of the model's ability to forecast observed responses.
Contrary to the linear regression model, there is not single measure for the goodness-
of-fit in qualitative response models and a variety of measures exists in nonlinear
models.
Often, goodness-of-fit measures are implicitly or explicitly based on
comparison with a model that contains only a constant as explanatory variable. A first
goodness-of-fit measure defined by Amemiya (1981) is known as Pseudo-R2 which is
formulated by
pseudo-R2 = 1-
% (jklmR ,jklmn ) /o
which is sometimes referred to as the likelihood ratio index. Like R2, R2MCF
also ranges between 0 and 1.
Another comparatively simple measure of goodness of fit is the count R2,
which is defined as:
;p.pq rpss*rt us*`rtp;v
Count R2 = wptcx ;p.pq pGv*syctp;v
Since the regressand in the model takes a value of 1 or zero, the number of
correct predictions can be counted. If the predicted probability is greater than 0.5, it is
classified as 1, but if it is less than 0.5, it is classified as 0.
20
},~ ;},;~
z= =
T~(,~)/; T;~(,~)
is large enough (larger than 1.64 at 5 per cent significance level). In practice, q= 2 +
(1- )2 is unknown and estimated by F2 + (1- S)2, where is the faction of successes
in the sample. In the above expression for the z-test, nh is the total number of correct
predictions in the sample and nq is the expected number of correct random
predictions.
21
LK
,J
∗ = (3.5.1)
L(,J
TJ K L)K
&
= 2
with zi a vector of observed variables. The constant term should not be included in
this vector because the scale parameter of a binary response model should be fixed,
22
independent of the data. Assume that the density function f (the derivative of F) is
symmetric – that is, f(t) = f(-t). It then follows that
= P [ui ≥ - C ]
= P [(ui/ ) ≥ - C / ]
= P [(ui/ ) ≤ C / ]
= F (C / ],so that
&
P[yi = 1] = F(C / 2 ) (3.5.2)
The null hypothesis of homoskedasticity corresponds to the parameter
restriction Ho : =0. This hypothesis can be tested by the LR-test. The unrestricted
likelihood function is obtain from the log-likelihood by replacing the term
&
= F (C ) by = F (C / 2 ).
ui = yi - F
= yi – F (C )
As a second up step, regress the residuals ui on the gradient of the non-linear
&
model P(yi = 1) = F (C / 2 ), taking into account that the residuals are
heteroskedastic. This amounts to applying (feasible) weighted least squares- that is,
OLS after division for the ith observation by the (estimated) standard deviation. The
variance of the 'error term' yi- is Var (yi - ) = Var (yi) = (1- ). is replaced by
F obtained in the first step, so that the weight of the ith observation in WLS is given
23
&
by 1/TF (1 − F . Further, the gradient of the function F (C / 2 ) in Equation (3.5.2)
, when evaluated at =0, is given by
& &
( & "/ ) ( & " / )
=f (C ) X, = - f (C ) C ..
Therefore, the required auxiliary regression in this second step can be written
in terms of the standardized residuals as
LK
,J q ( & G) q ( & G) & G
u∗ = = C 1 + .C 1 + ni. (3.5.3)
L(,J
TJ K L)K L(,J
TJ K L)
K L(,J
TJ K L)K
CHAPTER IV
4.1 Introduction
In this chapter, the application of linear probability, logit and probit models
are demonstrated by survey data. The survey data used in this chapter are provided by
Ma Moe Sandar Oo who collected the data for her Master of public Administration
Thesis. The data were responses of the mother of 300 children under 3 years of age in
Thingungyun Township. The weights of the children were assessed from the standard
weight chart using by Township Health Center. There are four different colours (red,
yellow, green, white) to present the condition of child's weight on this chart. Red
colour represents the child's weight, which reflects the severe malnutrition. Yellow
colour stands for moderate malnutrition of child's condition and green colour signifies
as good condition. White colour zone shows another form of malnutrition which is
known as over-eight child. In general, malnutrition can be defined as underweight in
developing countries, which is a serious public health problem that has been linked to
a substantial increase in the risk of morbidity and mortality. The term malnutrition
refers to both over-nutrition and under-nutrition. Malnutrition is a general term for a
medical condition caused by an improper or inadequate diet and nutrition. In This
study, if child's weight colour is green, the child can be determined by nutrition, and if
child's weight colour is yellow (or) red, the child can be determined by malnutrition.
The white colour case is very rare in Myanmar. So, white colour case is omitted from
this study.
Out of these collected information, mother's age, mother's education level and
child's weight colour variable are used to develop the models. Mother's education
levels are divided into 4 categories such as primary, middle, high, and graduate.
Child's weight colour is divided into 3 categories such as green, yellow, and red. To
estimate the models, mother's age and mother's education level are used as
independent variables and child's weight colour is used as dependent variable.
25
where ui is disturbance term and the unknown parameters β1, β2, β3, β4 and β5 in the
LPM are estimated by using the weighted least squares method using Statistical
Package for Social Science (SPSS). It is assumed that the variance of ui is proportional
to the variable MAGEi.
= Pr (Y = 1/X)
=Pr ( I∗ ≤ Ii)
= F (β1 + β2 MAGEi + β3 MEDU1 + β4 MEDU2+ β5 MEDU3)
4.3 Results
The estimated models and their results are described in this section. The
estimated standard error (se) and computed p-values are shown in parentheses.
The results imply that the variable MAGEi and MEDU3 are important factors in
explaining the changes of probability of child's nutrition . It can be said that if the
mother's age increases by 1-year and being mother's education in high school level
remained unchanged, the probability of child's nutrition will decrees by about 1.2% IF
the mother's education is in high school level and being mother's age remained
unchanged, the probability of child's nutrition will increase by 21.7%.
Logit Model
F
J
L = In ,JF
According to the p.values it can be said that each variable, except MEDU3 is
significant at 1% level and X2 = 64.241 indicates that the whole model is highly
significant . The insignificant variable MEDU3 is excluded from the model and
estimate the model for child's weight colour with the variables MAGEi , MEDUi, and
MEDU2. The re-estimated model is as follows;
F
J
L = ln ,JF
From the re-estimated ligit model, MAGEi, MEDU1 and MEDU2 are found to
be important factors in explaining the changes of the log of odds for child's nutrition. It
can be found that being other factors remained unchanged, with an increase of 1-year
of mother's age, there is an expectation of decrease in the log of odds for child's
nutrition about 0.25. Moreover, if the mother's education is in primary school level, it
is expected to have a decrease of 2.488 and if the mother's education is in middle
school level, it is expected to have a decrease of 2.86, in the log of odds for child's
nutrition, respectively.
Probit Model
Ii= -2.658 - 0.002 MAGEi + 0.103 MEDUi + 0.017 MEDU2 + 0.050 MEDU3
se (0.164) (0.005) (0.098) (0.079) (0.056)
p.values (0.000) (0.763) (0.291) (0.826) (0.446)
According to the p. values it can be said that the all variables are insignificant
at 1% and 10% level.
In summarizing the results and findings of estimated models, the diagnostic
statistics such as p-values, computed F-values and computed X2 values indicate that
the LPM and logit model are found to be significant models.
From the estimated LPM and logit models the variable mother's age and
mother's education are important factors in explaining the changes of child's nutrition.
For the estimated models, the count R2 value is high, whereas the McFadden R2
vale and pseudo R2 are low. Although these R2 values are not directly comparable,
they can give some idea about the orders of magnitude. Besides, one should not
overplay the importance of goodness of fit in models where the regressand is
dichotomous. The estimated R2 may seem rather low, but in view of the large sample
size, this R2 is still significant on the basis of the F test.
29
CHAPTER V
CONCLUSION
In this paper, qualitative response models: linear probability, logit, and probit
models in which the dependent variable involves only two qualitative choices are
studied together with their specification and estimation procedure. These models are
valuable in the analysis of survey data. The important characteristics of this study are
as follows:
1. Qualitative response regression models refer to models in which the response,
or regressand, variable is not quantitative or an interval scale.
2. The simplest possible qualitative response regression model is the binary
model in which the regressand is of the yes/no or presence / absence type.
3. The simplest possible binary regression model is the linear probability model
(LPM) in which the binary response variable is regressed on the relevant
explanatory variables by using the standard OLS methodology. Simplicity may
not be a virtue here, fore the LPM suffers from several estimating problems.
Even if some of the estimation problems can be overcame, the fundamental
weakness of the LPM is that it assumes that the probability of something
happening increases linearly with the level of the regressor. This very
restrictive assumption can be avoided by using the logit and probit models.
4. In the logit model the dependent variable is the log of the odds ratio, which is a
linear function of the regressors. The probability function that underlies the
logit model is the logistic distribution. If the data are available in grouped form,
OLS can be used to estimate the parameters of the logit model, provided the
heteroscedastic nature of the error term is taken into account explicitly. If the
data are available at the individual, or micro level, nonlinear-in-the-parameter
estimating procedures, like as method of maximum likelihood can be used.
5. If the normal distribution is chosen as the appropriate probability distribution,
then the probit model can be used. This model is mathematically a bit difficult
as it involves integrals.
6. The estimated model can be interpreted in terms of the signs and significance
of the estimated coefficient. The model can be evaluated in different ways, by
30
using diagnostic tests (t or Z-test, LR-test) and by measuring the model quality
(goodness of fit R2).
As an application, these models are developed and estimated by using SPSS
computer software with the survey data of the mother of 300-children in Thingungyun
Township.
The findings are as follows:
(1) According to the computed F value and X2 value, the LPM and logit models
are significant but probit is not.
(2) IN the estimated LPM, it can be concluded that the variables mother's age and
mother's education are found to be important factor in explaining the child's
nutrition. From the estimated model, being other factors remained unchanged,
an increase in the mother's age of 1-year will decrease the probability of child's
nutrition by about 1.2%.
(3) In the estimated logit model, it can be said that the mother's age and mother's
education are found to be important factors in explaining the child's nutrition
From the estimated model, being other factors remained unchanged, an
increase in the mother's age of 1-year will decrease the odds for child's
nutrition by about 22%. If the mother's education is in primary school level, it
is expected to have an decrease about 92%, if the mother's education is in
middle school level, it is expected to have a decrease about 94% in the odds for
child's nutrition, respectively.
31
REFERENCES
4. Gujarati, D.N. and Sangeetha (2008) " Basic Econometrics", Fourth Edition,
McGraw-Hill Publishing Company Ltd.
7. Neter J., Michael H.K, Christopher J.N, and William Wasserman, (1996),
" Applied Linear Statistical Models", 4th Edition, McGraw-Hill.