Unit 17
Unit 17
Unit 17
Structure
17.0 Objectives
17.1 Introduction
17.2 Simple Linear Regression
17.2.1 Objectives of Regression Analysis
17.2.2 Assumptions Underlying Regression Analysis
17.2.3 Estimation of Parameters
17.2A Fit of the Regression Model
17.3 Multiple Regression
17.3.1 Objectives of Multiple Regression
17.3.2 Estimation of Parameters
17.3.3 Fit of the Regression Model
17.3A Multicollinearity
17.3.5 Stepwise Regression
17.3.6 Regression with Qualitative Explanatory Variables
17.4 Examples
17A.l Example of Simple Linear Regression Through Origin
17A.2 Example of Simple Linear Regression
17A.3 Example of Multiple Regression Analysis
17AA Example of Stepwise Regression Analysis
17A.5 Example of Regression Analysis with Qualitative Variables
17.5 Appendices
17.6 Summary
17.7 Answers to Self Check Exercises
17.8 Key Words
17.9 References and Further Reading
17.8 OBJECTIVES
After going through this Unit, you will be able to:
17.1 INTRODUCTION
•
Regression analysis is one of the most commonly used statistical techniques
in social, behavioral and physical sciences. Its main objective is to explore the
relationship between a dependent variable (alternatively called criterion variable)
and one or more independent variables (alternatively called predictor or
explanatory variables). Linear regression explores relationships that can be
readily described by straight lines or their generalization to many dimensions.
A large number of problems can be solved by linear regression, and even more
by means of transformation of the original variables that result in linear
relationships among the transformed variables.
47
Techniques and Modeling It is assumed that the predicted values from multiple regression are linear
in Informetrics and combinations of the predictor variables. The~efore, the general form of a
Scientometrics
prediction equation from multiple regression is as follows:
Y = Criterion variable
x = Predictor variable
A Intercept: the predicted value of Y when all the predictors are zero. The
intercept, A, is so called, because it intercepts the Y-axis. It estimates
the average value of Y, when XI = O.
E = Residual. i.e. the difference between observed (Y) and predicted (Y)
values of Y.
p = Number of predictors.
Y =A + B X+ E
80 •
70
60
:;
S
0
50- •
40- •
30
20
• •
10 • •• •
•
0 1900 1905 1910 1915 1920
Year
• Description
• Coefficient Estimation
• Prediction
The prime concern here is to predict the dependent variable from the value of
an independent variable. For example, if we know the number of publications
of an institution over different periods, the objective could be to predict the
number of publications in a particular year in the future. However, the prediction
depends upon several crucial assumptions. Hence, instead of a point estimate
i.e. a single value, we should compute an interval estimate-.a range of values,
within which the predicted value would lie with a given probability. This
range is called "confidence interval". We would discuss this aspect later in
this module. 49
Techniques and Modeling 17.2.2 Assumptions Underlying Regression Analysis
in Informetrics and
Scientometrics
The regression model is based on the following assumptions:
3) The variance of the error term is constant for all the values of the
independent variable, X. This is the assumption of homoscedasticity. If
these residual plots show a rectangular shape, we can assume constant
variance. On the other hand, if a residual plot shows an increasing or
decreasing wedge or bowtie shape, non-constant variance
(heteroscedasticity) exists and must be corrected.
4) The residuals are assumed to be uncorrelated with one another, that is,
there is no autocorrelation. This implies that the Y's are also uncorrelated.
This is so because the observations y I, y2, ... ,yn are a random sample,
they are mutually independent and hence the error terms are also mutually
independent.
7) When hypothesis tests and confidence limits are to be used, the residuals
are assumed to follow the normal distribution.
I I I
I
60000
50000
40000
z
:::i
0..
0 30000
0
o
20000-
10000
0-
I I I I I I
Fig. 2: Scatter plot of publication output and cooperation links of 44 major countries
(excluding USA) and the fitted regression line
set of values X, of the predictor, and the assumed regression model, the t th
residual is defined as the difference between the i-I" observation Y and the j
" - Yj)
dj = (Y,
Y"j= A + BX
where
L(x - x) (Y- Y)
B=
L(X - X)2
and
A =
"
(Y - BY)
-
Here X and Y, denote the sample means of X and Yand }> denotes the predicted
value of Y for a given X.
Since we do r.ot know the population parameters, we have to estimate them
from the sample. The symbol S is used for the estimate of cr. The estimate of
cr2 is called the residual mean square and is computed by the following formula:
Techniques and Modeling
in Informctrics and
Scicntometrics
The number (n - 2). called the residual degrees of freedom, is the sample size
minus the number of parameters (in this case, there are two parameters, A and
B).
The square root of the Residual Mean Square (RMS) is called the standard
error of the estimate and is denoted by S. In effect, it indicates the reliability
of the estimating equation. Standard errors of A and Bare:
Standardized means that for each datum the mean is subtracted and the result
divided by the standard deviation. The result is that both X and Y have mean
= 0 and standard deviation = 1.
The fit of the regression model can be assessed by computing the correlation
between the observed (Yobs) and predicted (p) values of Y. Greater the correlation,
better is the fit of the regression model. The correlation coefficient is denoted
by R. The square of the correlation coefficient (R2) is officially called the
Coefficient of Determination (COD).
which means that the sum of squares for Y is divided into two components:
(1) the sum of squares explained (predicted) and (ii) the sum of squares error.
The ratio SSRlSST is the proportion explained and is equal to R2. Greater the
proportion explained, better is the fit of the regression model. The coefficient
of determination is computed as follows:
R2 = SST-SSE
SST
For testing the null hypothesis HO: b = ,0, it is expedient to represent the
results of regression analysis in the form of an analysis of variance CANOVA).
Obviously, a large residual mean square indicates poor fit. If residual mean
square is large, the value of F would be low and F ratio may become statistically
non-significant. If F ratio is statistically significant it implies that the null
hypothesis HO: b = 0 is rejected.
53
Techniques and Modeling
in Informetrics and
Scientometrics
where Y is the predicted score, X, is the score on the first predictor variable,
X, is the score on the second, etc. The Y intercept is Bo. Regression coefficients
are analogous to the slope in simple regression. V, are values of an unobserved
error term, and the unknown parameters PI' P
The parameters Bo, B" ... Bp can be estimated using the least squares procedure,
which minimizes the sum of squares of errors.
y = f
;=,
(Y, - B, Xr B1 X) -. .. - BI' ~.)2 •
Geometrical Representation
where
Total
dewtion
=1- n-I
Adjusted R 2 (n _ p -1)(1- R2)
=1_SSEf(n-p-l)
SST f(n -I)
Adjusted R-Square is an adjustment for the fact that when one has a large
number of independent variables; it is possible that R2 will become artificially
high simply because some independent variables' chance variations "explain"
small parts of the variance of the dependent variable. At the extreme, when
•
there are as many independent variables as cases in the sample, R2 will always
be 1.0. The adjustment to the formula arbitrarily lowers R2 as the number of
independent variables increases. When the number of independent variables is
small, R2 and adjusted R2 will be close. When there are several independent
variables, adjusted R2 would be noticeably lower.
The overall goodness of fit of the regression model (i.e. whether the regression
model is at all helpful in predicting the values of Y can be evaluated, using
an F-test in the format of analysis of variance.
56
Under the null hypothesis: Ho: ~I = ~2 = ... = ~p = 0, the statistic Regression Analysis
[SSR/ p]
---=----=---=--- = MSR
[SSE /(n - p -1)] MSE
Standardized means that for each datum the mean is subtracted and the result
divided by the standard deviation. The result is that all variables have a mean
of 0 and a standard deviation of 1. This enables comparison of variables of
differing magnitudes and dispersions. Only standardized b-coefficients (beta
weights) can be compared to judge relative predictive power of independent
variables. 57
Techniques and Modeling The estimated model
in Informetrics and
Scientometrics
Regression coefficients for standardized data are denoted as Beta (P). Beta is
the average amount by which the dependent variable increases when the
independent variable increases one standard deviation and other independent
variables are held constant. The ratio of the beta is the ratio of the predictive
importance of the independent variables. Note that the betas will change if
variables or interaction terms are added to or deleted from the equation.
Reordering the variables without adding or deleting will not affect the values
of beta.
17.3.4 MulticolIinearity
In regression analysis, a fundamental assumption is that predictors (regressors)
are not highly correlated. But, when several predictors (regressors) are highly
correlated, this problem is called multicollinearity or collinearity.
MuIticollinearity is the existence of near-linear relationships among the set of
independent variables. When variables are related, we say they are linearly
dependent on each other because one can nicely fit a straight regression line to
pass through many data points of those variables. Collinearity simply means eo-
dependence.
Detection of Multicollinearity
Correction of Multicollinearity
The most commonly used criterion for the addition or deletion of variables in
step wise regression is based on partial F-statistic:
The suffix 'Full' refers to the larger model with p explanatory variables,
whereas the suffix 'Reduced' refers to the reduced model with (p-q) explanatory
variables.
Forward Selection
Backward Elimination
The backward elimination procedure begins with all the variables in the model
and proceeds by eliminating the least useful variable at a time. A variable,
whose partial Fp-value is greater than a prescribed value, POUT, is the least
useful variable and is therefore removed from the regression model. The process
60
continues, until no variable can be removed according to the elimination Regression Analysis
criterion.
Stepwise Procedure
Dichotomous Variables
Y = A + BX
where
Y = income of an individual, and
X = a dichotomous variable, coded as
o if female
1 if otherwise
The estimated value of y is
y =A If X = 0
y =A + B If X = 1
Since our best estimate for a given sample is the sample mean, A is estimated
as the average income for females and A + a. is estimated as average income
for males. The regression coefficient B is therefore equal to:
A A
Y male - Y female
In effect, females are considered as the reference group and males' income is
measured by how much it differs from females' income.
Polytomous Variables
Consider, for example, the relationship between the time spent by an academic
scientist on teaching and his rank.
61
-
Techniques and Modeling Y = A+Bx
in Informetrics and
Scientometrics where
..............................................................................................................
..............................................................................................................
.........................•...................................................................................
..............................................................................................................
..............................................................................................................
..............................................................................................................
..............................................................................................................
..............................................................................................................
..............................................................................................................
62
Regression Analysis
17.4 EXAMPLES
17.4.1 Example of Simple Linear Regression (No intercept
Model)
Research Question How does the research effort of a country affect
its participation in the international cooperation
network?
Dataset Appendix 1
Variables Entered/Removed(b,c)
1 PUBLIC(a) Enter
Model Summary
a) For regression through the origin (the no-intercept model), R2 measures the
proportion of the variability in the dependent variable about the origin explained
byregression. This CANNOT be compared to R2 for models which include an
intercept.
\
b) Predictors: PUBLIC
ANOVA(c,<\)
Model Sum of Squares df Mean Square F Sig,
1 Regression 25527876817.716 1 25527876817.716 298.586 .000(a)
Residual 3761818085.284 44 85495865.575
Total 29289694903 .000(b) 45
a) Predictors: PUBLIC
b) This total sum of squares is not corrected for the constant because the constant
is zero for regression through the origin.
c) Dependent Variable: COOP _LlN
d) Linear Regression through the Origin
63
Techniques and Modeling Coefficients(a,b)
in Informetrics and
Scientometrics Unstandardized Coefficients Standardized
Coefficients t Sig.
Model B Std. Error Beta
60000
0
50000
z
.....J 40000
Po.. I
0
0 30000
U
20000
10000
PUBLIC
Comments
Notes
Variables EnteredlRemoved(b)
Model Summary
ANOVA(b)
Total 2400.171 34
Coefficients(a)
Variables EnteredlRemoved(b)
b) Dependent Variable: V4
Model Summary(b)
b) Dependent Variable: V4
ANOVA(b)
Total 1181.092 28
b) Dependent Variable: V4
Coefficients(a)
95%
Unstandardized Standardized Confidence Collinearity
Coefficients Coefficients Interval for B Statistics
Model B Std. Beta t Sig. Lower Upper Toler VIF
Error Bound Bound ance
1 Constan 31.913 14.072 2.268 .034 2.649 6 l.l 77
66 a) Dependent Variable: V4
Residuals Statistics(a) Regression Analysis
a) Dependent Variable: V4
Comments
2) The value of R2 indicates that about 79% of the variance in the dependent
variable is explained by the regression model.
3) F ratio in the analysis of variance is used to test the null hypothesis that
the population R2is zero. F ratio is statistically highly significant implying
rejection of the null hypothesis.
V3 = -.359 V2 + .156 V7
6) Collinearity Statistics:
• The values of tolerance level indicate that none of the predictors has
tolerance level less than .01
a) Dependent Variable: V4
Model Summary(c)
a) Predictors: (Constant), V6
c) Dependent Variable: V4
ANOVA(c)
Total 1181.092 28
Total 1181.092 28
a) Predictors: (Constant), V6
b) Predictors: (Constant), V6, V2
c) Dependent Variable: V4~
Excluded Variables
Partial Collinearity Statistics
Correlation
Model Beta In t Sig. Tolerance VIF Minimum
Tolerance
Residuals Statistics(a)
[)ataset Appendix 4
Variables EnteredlRemoved(b)
69
Techniques and Modeling Model Summary
in Informetrics and
Scientometrics Model R R Square Adjusted R Square Std. Error of the Estimate
ANOVA(b)
Total 33285.600 89
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
Comments
71
Techniques and Modeling
in Informetrics and 17.5 APPENDICES
Scientometrics
Appendix - 1.1
Data sets
73
Techniques and Modcling
in Informctrics and Appendix - 1.3
Scicntomctrics Data on Predictors of Poverty
Description Data on indicators of poverty:
Number of cases = 30 counties
Number of variables= 8
These variables are defined below:
V I = Population change VS = Percentage of residences with telephones
V2 = Number of persons V6 = Percentage of rural population
employed in agriculture
V3 = Percentage of families V7 = Median age
below poverty level
V4 = Residential and farm V8 = Number of African Americans
property tax rate
74
Appendix - 1.4 Regression Analysis
County VI V2 V3 V4 VS V6 V7 V8
1 13.7 400 19.0 1.09 82 75 33.5 360
2 -0.8 710 26.2 1.01 66 100 32.8 193
3 9.6 1610 18.1 0.40 80 70 33.4 3080
4 40.0 500 15.4 0.93 74 100 27.8 592
5 8.4 640 29.0 0.92 65 74 27.9 2
6 3.5 920 21.6 0.59 64 73 33.2 230
7 3.0 1890 21.9 0.63 82 52 30.8 3978
8 7.1 3040 18.9 0.49 85 50 32.4 9816
9 13.0 2730 21.1 0.71 78 71 29.2 1137
10 10;7 1850 23.8 0.93 74 71 28.7 992
11 -16.2 2920 40.5 0.51 69 64 25.1 10723
12 6.6 1070 21.6 0.80 85 58 35.9 3129
13 2l.9 160 25.4 0.74 69 100 31.4 338
14 17.8 380 19.7 0.44 83 72 30.1 516
15 -11.8 1140 38.0 0.81 54 100 34.1 12
16 7.5 690 30.1 1.05 65 100 30.5 104
17 3.7 1170 24.8 0.73 76 70 30.0 430
18 1.6 1280 30.3 0.65 67 8i 32.4 1240
19 8.4 2270 19.5 0.48 85 39 28.7 20446
20 2.7 960 15.6 0.72 84 58 33.4 1863
21 5.6 1710 17.2 0.62 84 42 29.9 8035
22 12.7 1410 18.4 0.84 86 36 23.3 10620
23 -4.8 200 27.3 0.73 66 100 27.5 211
24 16.5 960 19.2 0.45 74 91 29.5 133
25 15.2 11500 16.8 1.00 87 6 25.4 266159
26 11.6 1380 13.2 0.63 85 44 28.8 2432
27 4.9 530 29.7 0.54 70 100 33.1 932
28 1.1 370 19.8 0.98 75 53 30.8 7
29 3.8 440 27.7 0.46 48 100 28.4 208
30 19.0 1630 20.5 0.68 83 72 30.4 1732
Involvement of academic scientists in different activities
DescriptionSubset of data derived from the database: Profile and productivity of
academic scientists in India. The subset includes the following items:
• Professional Rank (3 categories: coded as Professor =1, Reader =2, Lecturer=3)
• Percentage of time spent on Teaching, Research and Supervision of Doctoral
Students (Data on other activities is suppressed)
• Respondents
Faculty members of Science, Engineering, Medicine and Agriculture Departments
of 20 Universities in India.
75
Techniques and Modeling • Sample size: 10% random sample of 1073 respondents.
in Informetrics and
Time spent by academic scientists on different activities (Percentage)
Scientometrics
Rank Teach Research Supvn Prof Reader Dummy Variables
2.00 40.00 30.00 20.00 .00 1.00
3.00 40.00 40.00 10.00 .00 .00
3.00 65.00 20.00 .00 .00 .00
2.00 50.00 10.00 20.00 .00 1.00
3.00 30.00 30.00 25.00 .00 .00
1.00 40.00 20.00 15.00 1.00 .00
1.00 90.00 .00 .00 1.00 .00
1.00 60.00 10.00 10.00 1.00 .00
1.00 20.00 20.00 10.00 1.00 .00
2.00 30.00 10.00 25.00 .00 1.00
1.00 50.00 20.00 10.00 1.00 .00
2.00 50.00 10.00 15.00 .00 1.00
2.00 22.00 22.00 22.00 .00 1.00
2.00 30.00 50.00 15.00 .00 1.00
3.00 60.00 15.00 15.00 .00 .00
1.00 10.00 15.00 25.00 1.00 .00
1.00 30.00 30.00 20.00 1.00 .00
1.00 45.00 8.00 30.00 1.00 .00
2.00 40.00 10.00 20.00 .00 1.00
1.00 35.00 10.00 20.00 1.00 .00
1.00 20.00 20.00 20.00 1.00 .00
2.00 30.00 30.00 40.00 .00 1.00
1.00 30.00 5.00 10.00 1.00 .00
1.00 20.00 20.00 20.00 1.00 .00
1.00 40.00 15.00 25.00 1.00 .00
1.00 40.00 25.00 20.00 1.00 .00
1.00 60.00 30.00 10.00 1.00 .00
2.00 20.00 35.00 20.00 .00 1.00
2.00 40.00 20.00 20.00 .00 1.00
1.00 30.00 20.00 20.00 1.00 .00
1.00 40.00 25.00 25.00 1.00 .00
3.00 50.00 30.00 10.00 .00 .00
3.00 75.00 20.00 .00 .00 .00
2.00 30.00 40.00 25.00 .00 1.00
2.00 20.00 25.00 40.00 .00 1.00
3.00 50.00 45.00 .00 .00 .00
2.00 50.00 10.00 30.00 .00 1.00
1.00 50.00 25.00 25.00 1.00 .00
3.00 50.00 40.00 .00 .00 .00
2.00 50.00 15.00 10.00 .00 1.00
1.00 80.00 5.00 5.00 1.00 .00
1.00 40.00 10.00 5.00 1.00 .00
3.00 60.00 35.0Q 5.00 .00 .00
1.00 20.00 20.00 30.00 1.00 .00
1.00 40.00 20.00 .00' 1.00 .00
3.00 40.00 30.00 10.00 .00 .00
2.00 40.00 40.00 15.00 .00 1.00
3.00 50.00 20.00 10.00 .00 .00
2.00 60.00 10.00 .00 .00 1.00
2.00 60.00 10.00 20.00 .00 1.00
1.00 50.00 15.00 10.00 1.00 .00
2.00 50.00 40.00 .00 .00 1.00
SOURCE: NAGPAUL, P.S., Profile and productivity of academic scientists in India, National
76 Institute of Science, Technology and Society, New Delhi (India)
Regression Analysis
17.6 SUMMARY
Regression is a statistical technique that uses the association between variables
as a means of prediction. In the simplest case, we consider two variables, the
independent variable and the dependent variable. The independent variable is
used to predict changes in the dependent variable. Multiple regression is an
extension of simple linear regression. In multiple regression, we consider
more than one independent variable and assess the combined ability of the
independent predictors to account for changes in the dependent variable. Typical
outcome of regression analysis is an equation or "model" that represents the
relationship between a dependent variable and independent variable(s). This
model is derived by minimizing the sum of squares of deviations between the
observed and. predicted values of the dependent variable. Procedures for
assessing the goodness of fit of the regression model and its parameters are
discussed. Underlying assumptions and consequences of their violation are
briefly discussed to indicate the possible pitfalls of regression analysis without
understanding the basic principles.
Some examples are presented and computer outputs are interpreted to familiarize
the students with the practical aspects of regression analysis.
4) Residuals are. the difference between the observed values and those
predicted by the regression equation.
Ordinary Least Squares This method derives its name from the
criterion used to draw the best-fit regression
line: a line such that the sum of the squared
deviations of the distances of all the points
to the line is minimized.
Montgomery, D.e. [et al]. Introduction to Linear Regression Analysis. 3rd ed.
New York: John Wiley.
79