Unit 17

UNIT 17 REGRESSION ANALYSIS
Structure
17.0 Objectives
17.1 Introduction
17.2 Simple Linear Regression
17.2.1 Objectives of Regression Analysis
17.2.2 Assumptions Underlying Regression Analysis
17.2.3 Estimation of Parameters
17.2A Fit of the Regression Model
17.3 Multiple Regression
17.3.1 Objectives of Multiple Regression
17.3.3 Fit of the Regression Model
17.3A Multicollinearity
17.3.5 Stepwise Regression
17.3.6 Regression with Qualitative Explanatory Variables
17.4 Examples
17A.l Example of Simple Linear Regression Through Origin
17A.2 Example of Simple Linear Regression
17A.3 Example of Multiple Regression Analysis
17AA Example of Stepwise Regression Analysis
17A.5 Example of Regression Analysis with Qualitative Variables
17.5 Appendices
17.6 Summary
17.7 Answers to Self Check Exercises
17.8 Key Words
17.9 References and Further Reading
17.8 OBJECTIVES
After going through this Unit, you will be able to:
• carry ol;lt simple and multiple regression analysis;

• interpret the results of regression analysis; and
• identify the problems and limitations of regression analysis.
17.1 INTRODUCTION
•
Regression analysis is one of the most commonly used statistical techniques
in social, behavioral and physical sciences. Its main objective is to explore the
relationship between a dependent variable (alternatively called criterion variable)
and one or more independent variables (alternatively called predictor or
explanatory variables). Linear regression explores relationships that can be
readily described by straight lines or their generalization to many dimensions.
A large number of problems can be solved by linear regression, and even more
by means of transformation of the original variables that result in linear
relationships among the transformed variables.
47
Techniques and Modeling It is assumed that the predicted values from multiple regression are linear
in Informetrics and combinations of the predictor variables. The~efore, the general form of a
Scientometrics
prediction equation from multiple regression is as follows:
Y = A+ B,X, + B2X2 + ..... + BpXp+ E

where
Y = Criterion variable
x = Predictor variable
A Intercept: the predicted value of Y when all the predictors are zero. The
intercept, A, is so called, because it intercepts the Y-axis. It estimates
the average value of Y, when XI = O.
B = Constant (or regression coefficient: how much of a difference in Y results

from a one unit difference in X).
1\
E = Residual. i.e. the difference between observed (Y) and predicted (Y)
values of Y.
p = Number of predictors.
17.2 SIMPLE LINEAR REGRESSION

Simple linear regression involves only two variables, one dependent variable
(Y) and one independent variable (X):
Y =A + B X+ E
The first step in determining whether there is a relationship between two

variables is to examine the graph of the observed data (Y*X). The graph is
called a scatter plot. Statistical packages such as SPSS, SYST AT or
STATISTICA can be used to draw the scatter plot. If there is a relationship
between the variables X and Y, the dots of the scatter plot would be more or
less concentrated around a curve, which may be called the curve of regression.
In the particular case when the curve is a straight line, it is called the line of
regression and the regression is said to be linear. In addition to the
linearity property, the scatter plot is also useful for observing whether there
are any outliers in the data and whether there are two or more clusters of
points.
For the population, the bivariate regression model is:

y, =(1 + b .x; + E,
••
where the subscript i refers to the ph observation, a is the intercept and b is
the regression coefficient.
Regression equation is thus a mathematical model describing the relationship

between X and Y. In most cases, the model does not define the exact
relationship between the two variables. Rather, we use it as an approximation
to the exact relationship. An important question is "how close is this
approximation?"
Regression Analysis
1
• •
90
80 •
70
60
:;
S
0
50- •
40- •
30
20
• •
10 • •• •
•
0 1900 1905 1910 1915 1920
Year
Fig. 1 : Scatter Plot of publication output of an instruction in different years
17.2.1 Objectives of Regression Analysis

What are the objectives of regression analysis? They could range from simple
description of data to more sophisticated hypothesis testing or prediction.
Note that regression equation predicts Y from X. The value of Y depends on
the value of X.
• Description
Here, the objective is to find an equation that describes or summarizes the

relationship between two variables. This purpose makes the fewest assumptions.
• Coefficient Estimation
Here, the- objective is to confirm or reject a theoretical or hypothesized

relationship between two variables X and Y. Most likely, there is specific
interest in the magnitudes and signs of the regression coefficients a and b.
Frequently, this objective overlaps with other objectives.
• Prediction
The prime concern here is to predict the dependent variable from the value of
an independent variable. For example, if we know the number of publications
of an institution over different periods, the objective could be to predict the
number of publications in a particular year in the future. However, the prediction
depends upon several crucial assumptions. Hence, instead of a point estimate
i.e. a single value, we should compute an interval estimate-.a range of values,
within which the predicted value would lie with a given probability. This
range is called "confidence interval". We would discuss this aspect later in
this module. 49
Techniques and Modeling 17.2.2 Assumptions Underlying Regression Analysis
in Informetrics and
Scientometrics
The regression model is based on the following assumptions:
1) With linear regression models the straight-line relationship between Yand

X. Any curvilinear relationship is ignored.
2) The expected value of the error term is zero.
3) The variance of the error term is constant for all the values of the
independent variable, X. This is the assumption of homoscedasticity. If
these residual plots show a rectangular shape, we can assume constant
variance. On the other hand, if a residual plot shows an increasing or
decreasing wedge or bowtie shape, non-constant variance
(heteroscedasticity) exists and must be corrected.
4) The residuals are assumed to be uncorrelated with one another, that is,
there is no autocorrelation. This implies that the Y's are also uncorrelated.
This is so because the observations y I, y2, ... ,yn are a random sample,
they are mutually independent and hence the error terms are also mutually
independent.
5) This assumption can be violated in two ways: Model misspecification or

Time-series data.
• Model misspecification. If an important independent variable is omitted

or if an incorrect functional form is used, the residuals may not be
independent. The solution to this dilemma is to find the proper
functional form or to include proper independent variables and use
multiple regression.
• Time-series data. Whenever regression analysis is performed on data
taken over time, the residuals may be correlated. This correlation
among residuals is called serial correlation. Positive serial correlation
means that the residual in time period j tends to have the same sign
as the residual in time period G - k), where k is the lag in time periods.
On the other hand, negative serial correlation means that the residual
in time period j tends to have the opposite sign as the residual in time
period G - k).
6) The independent variable is uncorrelated with the error term.
7) When hypothesis tests and confidence limits are to be used, the residuals
are assumed to follow the normal distribution.
The random sample of observations can be used to estimate the parameters of

the regression equation. Simple linear regression has two parameters: Intercept
and Regression Coefficient. The method of least squares is used to fit a
continuous dependent variable (Y) as a linear function of a single predictor
variable. The coefficients (A, B) are chosen such that the sum of squared
errors is minimized. The estimation technique is thus called least squares or
ordinary least squares (OLS). Given the criterion of least squares, the mean of
the errors is zero and correlation between errors and predictors is also zero.
50
Regression Analysis
Geometrically, the least squares method finds the line which minimizes the
sum of squared deviations from each point in the sample to the point on the
line corresponding to the X-value. Figure 2 shows the scatter plot of cooperation
links and publication output of 44 major countries (excluding USA) and the
fitted regression line.
I I I
I
60000
50000
40000
z
:::i
0..
0 30000
0
o
20000-
10000
0-
I I I I I I
5000 30000 55000 80000 105000 130000 155000 180000

PUBUC
Fig. 2: Scatter plot of publication output and cooperation links of 44 major countries
(excluding USA) and the fitted regression line
Given a set of n observations Y of the dependent variable corresponding to a

j
set of values X, of the predictor, and the assumed regression model, the t th
residual is defined as the difference between the i-I" observation Y and the j
fitted value "

Y j•
" - Yj)
dj = (Y,
The least ~quares line is:
Y"j= A + BX
where
L(x - x) (Y- Y)
B=
L(X - X)2
and
A =
"
(Y - BY)
-
Here X and Y, denote the sample means of X and Yand }> denotes the predicted
value of Y for a given X.
Since we do r.ot know the population parameters, we have to estimate them
from the sample. The symbol S is used for the estimate of cr. The estimate of
cr2 is called the residual mean square and is computed by the following formula:
Techniques and Modeling
in Informctrics and
Scicntometrics
The number (n - 2). called the residual degrees of freedom, is the sample size
minus the number of parameters (in this case, there are two parameters, A and
B).
The square root of the Residual Mean Square (RMS) is called the standard
error of the estimate and is denoted by S. In effect, it indicates the reliability
of the estimating equation. Standard errors of A and Bare:
Note: B is the slope of the regression line.
Standardized regression coefficient
Regression coefficient is the slope in the regression equation. If X and Y are

standardized intercept (A) will be equal to zero, and the standardized slope
will be equal to the correlation coefficient between X and Y.
Standardized means that for each datum the mean is subtracted and the result
divided by the standard deviation. The result is that both X and Y have mean
= 0 and standard deviation = 1.
The fit of the regression model can be assessed by computing the correlation
between the observed (Yobs) and predicted (p) values of Y. Greater the correlation,
better is the fit of the regression model. The correlation coefficient is denoted
by R. The square of the correlation coefficient (R2) is officially called the
Coefficient of Determination (COD).
The coefficient of determination is a measure of the goodness of fit of the

relationship between a dependent variable and an independent variable in a
regression analysis. Statistically speaking, it is:
• the percentage ot the variation that can be explained by the regression

equation.
• the variation explained by the regression equation divided by the total

variation.
The coefficient of determination varies between 0 and 1. If there is zero

explained variation (i.e .. the total variation is all unexplained), this ratio is O.
If there is zero unexplained variation (i.e., the total variation is all explained),
this ratio is 1.
52
Every sample has some variation in it (unless all the values are identical, and Regression Analysis
that's unlikely to happen). The total variation is composed of two parts, the
part that can be explained by the regression equation and the part that cannot
be explained by the regression equation. If the regression line passed through
every point on the scatter plot exactly, it would be able to explain all of the
variation. The further the line is from the points, the less it is able to explain.
The ratio of the explained variation to the total variation is a measure of how
good the regression line is.
The sum of the squared deviations of Y from r

is called the sum of squares
total (SS7). SST can be partitioned into the sum of squares explained by
regression and the sum of squares error (SSE).
SST= L(Y - rY
1\
SSR = L(Y - y)2
1\
SSE = L( Y _ y)2
It can be easily seen that
SST= SSR + SSE
which means that the sum of squares for Y is divided into two components:
(1) the sum of squares explained (predicted) and (ii) the sum of squares error.
The ratio SSRlSST is the proportion explained and is equal to R2. Greater the
proportion explained, better is the fit of the regression model. The coefficient
of determination is computed as follows:
R2 = SST-SSE
SST
For testing the null hypothesis HO: b = ,0, it is expedient to represent the
results of regression analysis in the form of an analysis of variance CANOVA).
Obviously, a large residual mean square indicates poor fit. If residual mean
square is large, the value of F would be low and F ratio may become statistically
non-significant. If F ratio is statistically significant it implies that the null
hypothesis HO: b = 0 is rejected.
Table 1: ANOVA Table for Simple Linear Regression
Source oJ Variation Sums of Squares DJ Mean Square F
Regression ~(Y - YM)2 I SS,.g / I MS,./ MS,es

1\
Residual ~(Y _ Y)? N-2 ss., / ( N - 2)
Total ~(Y-Y~lf N-I
Self Check Exercises
1) What is multiple correlation?

2) What is the coefficient of determination?
Note: i) Write your answers in the space given below.
ii) Check your answers with the answers given at end of this Unit.
53
in Informetrics and
Scientometrics
17.3 MULTIPLE REGRESSION

17.3.1 Objectives of Multiple Regression
The general purpose of multiple regression is to investigate the relationship
between several independent or predictor variables and a dependent variable.
This relationship can be expressed by the following equation
where Y is the predicted score, X, is the score on the first predictor variable,
X, is the score on the second, etc. The Y intercept is Bo. Regression coefficients
are analogous to the slope in simple regression. V, are values of an unobserved
error term, and the unknown parameters PI' P

Equations relating the n observations can be written as:
YI = Bo + BIX" + B)XI) + B3 X/3 + + B"X,I'
Y,., = BfI = B, XI" + B, X,'l + B3 X,,3 + +B,.X""
The parameters Bo, B" ... Bp can be estimated using the least squares procedure,
which minimizes the sum of squares of errors.
y = f
;=,
(Y, - B, Xr B1 X) -. .. - BI' ~.)2 •
Minimizing the sum of squares leads to a set of equations, called normal

equations, from which the values of unknown parameters can be computed:
Geometrical Representation
The problem of multiple-regression can be represented geometrically as follows.

We can visualize that n observations (Xiii Xi]' ..... Xip• Yi) i = I, 2, .... n are
represented as points in a (p+ I) - dimensional space. The regression problem
is to determine from the possible hyper-planes in the p - dimensional space,
the best fitting hyper-plane. We use the least squares criterion and locate the
hyper-plane that minimizes the sum of squares of the errors, i.e., the distances
from the points around the hYRer-plane (observations) and the point on the
hyper-plane (i.e. the estimate Y).
Standard error of the estimate

54
Regression Analysis
P (Yi - Y)2
SE= I1 (n - p -1)
where
Y; = the sample value of the dependent variable

y = corresponding value estimated from the regression equation
n = number of observations
p = number of predictors or independent variable
The denominator of the equation indicates that in multiple regression with p

independent variables, the standard error has n-p-l degrees of freedom.This
happens because the degrees of freedom are reduced from n by p + 1 numerical
constants a, bl, b; ..... bp, that have been estimated from the sample.
Just as in the case of one-variable regression, the sum of squares in multiple

regression analysis can be partitioned into the sum of squares predicted and
the sum of squares error. The fit of the multiple regression model can be
assessed by the Coefficient of Multiple determination, which is a fraction that
represents the proportion of total variation of Y that is explained by the
regression plane.
y Unexplained devimon nom the 2leg2lessi.on
Explainedttom the devimon
Total
dewtion
Fig. 3 : Fit of the regression model by coefficient of multiple determination
Sum of squares due to error

J'
SSE = L...(Yi - y)~
'" 2
;=1
Sum of squares due to regression

r
SSR = ICYi - y)2 55
;=1
Techniques and Modeling Total sum of squares
in Informetrics and
Scientometrics
SST = :t
1=1
(Yi - ry
Obviously,
SST = SSR + SSE
The ratio SSRlSSTrepresents the proportion of the total variation in Y explained

by the regression model: This ratio, denoted by R2, is called the coefficient of
multiple determination. Why do w~ denote this coefficient by R2? This is so,
because R is the Pearson correlation between the predicted scores (y) and the
observed scores (Y). R2 is the proportion of the sum of squares explained by
multiple regression.
Multiple correlation coefficient, R, is a measure of the strength of the linear

relationship between Y and the set of variables XI' X2, ••• X; It is the highest
possible simple correlation between y and any linear combination of XI. X2
...... X;,. This property explains that the computed value of R is never negative.
In this sense, the least squares regression plane maximizes the correlation
between the X variables and the dependent variable Y. Hence, it represents a
measure of how well the regression equation fits the data. When the value of
the multiple correlation R is close to zero, the regression equation barely
predicts Y better than sheer chance. A value of R close to 1 indicates a very
good fit.
The coefficient of multiple determination is sensitive to the magnitudes of n

and p in small samples. If p is large relative to n, the model tends to fit the
data very well. In the extreme case, if n = p + 1, the model would exactly fit
the data. To obviate this problem, we define another goodness of fit measure
- Adjusted R2 - which is computed as follows:
=1- n-I
Adjusted R 2 (n _ p -1)(1- R2)
=1_SSEf(n-p-l)
SST f(n -I)
Adjusted R-Square is an adjustment for the fact that when one has a large
number of independent variables; it is possible that R2 will become artificially
high simply because some independent variables' chance variations "explain"
small parts of the variance of the dependent variable. At the extreme, when
•
there are as many independent variables as cases in the sample, R2 will always
be 1.0. The adjustment to the formula arbitrarily lowers R2 as the number of
independent variables increases. When the number of independent variables is
small, R2 and adjusted R2 will be close. When there are several independent
variables, adjusted R2 would be noticeably lower.
Statistical Inferences for the Model
The overall goodness of fit of the regression model (i.e. whether the regression
model is at all helpful in predicting the values of Y can be evaluated, using
an F-test in the format of analysis of variance.
56
Under the null hypothesis: Ho: ~I = ~2 = ... = ~p = 0, the statistic Regression Analysis
[SSR/ p]
---=----=---=--- = MSR
[SSE /(n - p -1)] MSE
has an F-distribution with p and n-l degrees of freedom.
Note: p = number of predictors or independent variable and n is the sample size.
Table 2: ANOVA Table for Multiple Regression
Source of Slim of Degrees of Mea" Squares F ratio

Variation Squares freedom
Regression SSR p MSR MSRlMSE
Error SSE (n-p-l) MSE
Total SST (n-I )
F test: The following formula is used to test whether an R2 calculated in a

sample is significantly different from zero. (The null hypothesis is that the
population R2 is zero.)
F(k, N-k-l) = (l-R2)/(N-k-l)
F has k and N-k-l degrees of freedom: A common rule of thumb is to reject

the null hypothesis if Prob (F) > 0.5.
t-test: This test is used to assess the significance of individual b coefficients.

Following the general formula for significance testing using standard errors of
a statistic, t is computed as follows:
b
(=--
Sb
where b is the regression coefficient, s, is the standard error of the regression

coefficient. The resulting t is on N - k - 1 degrees of freedom, where N is the
number of cases and k is the number of predictor variables. A common rule
of thumb is to drop from the equation all variables not significant at the .05
level. Like all significance tests, the r-test assumes that error terms follow
normal rlistribution.
Standardized Regression Coefficients
The magnitude of the regression coefficients depends upon the scales of

measurement used for the dependent variablejy and the explanatory variables
included in the regression equation. Unstandardized regression coefficients
cannot be compared directly because of differing units of measurements and
different variances of the x variables. It is therefore necessary to standardize
the variables for meaningful comparisons.
Standardized means that for each datum the mean is subtracted and the result
divided by the standard deviation. The result is that all variables have a mean
of 0 and a standard deviation of 1. This enables comparison of variables of
differing magnitudes and dispersions. Only standardized b-coefficients (beta
weights) can be compared to judge relative predictive power of independent
variables. 57
Techniques and Modeling The estimated model
in Informetrics and
Scientometrics
can be written as:
The expressions in the parentheses are standardized variables; b's; are

unstandardized regression coefficients and SI' S2' ... s" are the standard deviations
of variables XI' X2, .... xl' and s, is the standard deviation .of variable y. The
coefficients (b,sJ/s-,,,j=i,2, ... ,p are called standardized regression coefficients.
The standardized regression coefficient measures the impact of a unit change
in the standardized value of Xi on the standardized value of y. The larger the
magnitude of standardized bi' the more x, contributes to the prediction of y.
However, the regression equation itself should be reported in terms of the
unstandardized regression coefficients so that prediction of y can be made
directly from the X variables.
Regression coefficients for standardized data are denoted as Beta (P). Beta is
the average amount by which the dependent variable increases when the
independent variable increases one standard deviation and other independent
variables are held constant. The ratio of the beta is the ratio of the predictive
importance of the independent variables. Note that the betas will change if
variables or interaction terms are added to or deleted from the equation.
Reordering the variables without adding or deleting will not affect the values
of beta.
17.3.4 MulticolIinearity
In regression analysis, a fundamental assumption is that predictors (regressors)
are not highly correlated. But, when several predictors (regressors) are highly
correlated, this problem is called multicollinearity or collinearity.
MuIticollinearity is the existence of near-linear relationships among the set of
independent variables. When variables are related, we say they are linearly
dependent on each other because one can nicely fit a straight regression line to
pass through many data points of those variables. Collinearity simply means eo-
dependence.
Why is eo-dependence of predictors detrimental? Multicollinearity causes

variance inflation (VIF). In other words, "Multicollinearity is the symptom
while variance inflation is the disease". In a regression model we expect a
high variance explained (R2). Higher the variance explained, better the model
is. However, if multicollinearity exists, explained variance, standard error,
parameter estimates all tend to be inflated. In other words, high variance is not
a result of good independent predictors, but a mis-specified model that
incorporates mutually dependent and thus redundant predictors.
Detection of Multicollinearity
The simplest method for detecting multicollinearity is the correlation matrix,

which can be used to detect if there are large correlations between pairs of
58
explanatory variables. Strong pairwise correlation may give some insight as to Regression Analysis
the variables causing the collinearity. Unfortunately, multicollinearity does
not always show up when considering the variables two at a time. Hence,
more sophisticated indicators are required to detect collinearity. These are
briefly discussed below.
• Tolerance - This indicator is computed as follows:

1) Regress each predictor on all the other predictors in the equation.
2) Compute the 'tolerance' associated with each predictor. The tolerance
of Xi is defined as 1 minus the squared multiple correlation between
that Xi and the remaining X variables.
When tolerance is small, say less than 0.01, then it would be expedient
to discard the variable with the smallest tolerance.
• Variance Infllation Factor - The inverse of the tolerance is called the

variance inflation factor (V/F). Mathematically speaking: V/F = lI( l-R2)
Sources of Multicollinearity
To deal with collinearity, it is essential to identify its source. There are five
sources for multicollinearity :
1) Data collection - Collinearity may arise from a narrow subspace of the
independent variables and/or by the sampling methodology. Obtaining
more data on an expanded range might resolve the collinearity problem.
2) Constraints of the linear model or population - This source of
collinearity will exist no matter what sampling technique is used. Many
manufacturing or service processes have constraints on independent
variables (as to their range), either physically, politically, or legally, which
will create collinearity.
3) Over-defined model - Here, there are more variables than observations.
This situation should be avoided.
4) Model specification - This source of collinearity comes from using
independent variables that are higher powers or interactions of an original
set of variables. It should be noted that if sampling subspace of predictors
is narrow, then any combination of predictors will accentuate the
collinearity problem ..
5) Outliers - Extreme values or outliers in the X-space can cause collinearity

as well as hide it.
Correction of Multicollinearity
The solution to collinearity problem wodld vary with the source of

collinearity. If the collinearity has been created by the data collection, then
collect additional data over a wider X-subspace. If the choice of the linear
model has accentuated the collinearity, simplify the model by variable selection
techniques. If an observation or two has induced the collinearity, remove those
observations and proceed. Above all, use care in selecting the variables at the
outset.
A common approach to multicollinearity problem is to omit explanatory

variables. For example if XI and Xl are highly correlated (say ccrrelation is
59
Techniques and Modeling greater than 0.9), then the simplest approach would be to use only one of
in Informetrics and them, since one variable conveys essentially all the information in the other
Scientometrics
.variable.
17.3.5 Stepwise Regression

Many inexperienced researchers confront collinearity. A common temptation
is to put too many regressors into the model. Inevitably many of those
independent variables will be correlated. Moreover, when there are too many
variables in a regression model i.e, the number of parameters to be estimated
is larger than the number of observations, this model is said to be deficient in
degrees of freedom and thus over-fitting. It is, therefore, essential to discard
those variables which do not contribute much to the explained variance. In
other words select a subset of more important variables.
One common approach to select a subset of variables from a complex model

is stepwise regression. Stepwise regression is a procedure to examine the
impact of each variable on the model step by step. The variable that cannot
contribute much to the variance explained should be discarded. Stepwise
regression is a sequential process for fitting the least squares model, where at
each step a single explanatory variable is either added to or removed from the
model in the next fit. There are different versions of stepwise regression, such
as, forward selection, backward elimination, and stepwise.
The most commonly used criterion for the addition or deletion of variables in
step wise regression is based on partial F-statistic:
SSR"i,1l - SSR Reduced Iq R2"-ulI - R2 Reduced (n - P - 1)

--------- = ------ -----
SSE,;"",/(n-p-l) (1-R2) FilII
q
The suffix 'Full' refers to the larger model with p explanatory variables,
whereas the suffix 'Reduced' refers to the reduced model with (p-q) explanatory
variables.
Forward Selection
Forward selection procedure begins with no explanatory variable in the model

and sequentially adds a variable according to the criterion of partial F-statistic.
At each step, a variable is added, whose partial F-statistic yields the smallest
p-value. Variables are entered as long as the partial F-statistic p-value remains
below a specific maximum value (PIN). The procedure stops when the addition
of any of the remaining variables yields a partial p-value > PIN. This procedure
has two limitations. Some of the variables never get into the model and hence
their importance is ~never determined. Another limitation is that a variable
once included in the model remains there throughout the process, even if it
loses its stated significance, after the inclusion of other variable(s).
Backward Elimination
The backward elimination procedure begins with all the variables in the model
and proceeds by eliminating the least useful variable at a time. A variable,
whose partial Fp-value is greater than a prescribed value, POUT, is the least
useful variable and is therefore removed from the regression model. The process
60
continues, until no variable can be removed according to the elimination Regression Analysis
criterion.
Stepwise Procedure
The stepwise procedure is a modified forward selection method which later in

the process permits the elimination of variables that become statistically non-
significant. At each step of the process, the p-values are computed for all
variables in the model. If the largest of these p-values > POUT, then that
variable is eliminated. After the included variables have been examined for
exclusion, the excluded variables are re-examined for inclusion. At each step
of the process, there can be at the most one exclusion, followed by one
inclusion. It is necessary that PIN S POUT to avoid infinite cycling of the
process.
17.3.6 Regression with Qualitative Explanatory Variables
Sometimes, explanatory variables for inclusion in a regression model are not

interval scale; they may be nominal or ordinal variables. Such variables can
be used in the regression model by creating 'dummy' (or indicator) variables.
Dichotomous Variables
Dichotomous variables do not cause the regression variables to lose any of

their properties. Since they have two categories, they manage to 'trick' least
squares, while entering into the regression equation as interval scale variables
with just two categories.
Consider for example, the relationship between income and gender
Y = A + BX
where
Y = income of an individual, and
X = a dichotomous variable, coded as
o if female
1 if otherwise
The estimated value of y is
y =A If X = 0
y =A + B If X = 1
Since our best estimate for a given sample is the sample mean, A is estimated
as the average income for females and A + a. is estimated as average income
for males. The regression coefficient B is therefore equal to:
A A
Y male - Y female
In effect, females are considered as the reference group and males' income is
measured by how much it differs from females' income.
Polytomous Variables
Consider, for example, the relationship between the time spent by an academic
scientist on teaching and his rank.
61
-
Techniques and Modeling Y = A+Bx
in Informetrics and
Scientometrics where
Y is the percentage of work time spent on teaching

x is a polytomous variable 'rank' with three modalities:
1 = Professor
2 = Reader
3 = Lecturer
We create two dummy variables:
XI = 1 if rank = Professor
= 0 if otherwise
X2 = 1 if rank = Reader
= 0 if otherwise
Note that we have created two dummy variables to represent a trichotomous
variable. If we create a third dummy variable X3 (score 1; if rank = Lecturer,
and 0 otherwise), the parameters of the regression equation cannot be estimated
uniquely. This is because if the score of any respondent on XI and X, is
known, it is always possible to predict the person's score on Xl' For example
if a respondent has score 0 on XI (not Professor) and 0 on X, (not Reader),
then the respondent is certainly a Lecturer (i.e., score 1 on Xl)' This represents
a situation of perfect multicollinearity, Hence the general rule for creating
dummy variables is:
Number of dummy variables = Number of modalities - 1.
Statistical significance of regression coefficients and Multiple R2 is determined

in the same way as for interval scale explanatory variables.
Self Check Exercises
3) What is the difference between simple and multiple regression?

4) What are residuals?
Note: i) Write your answers in the space given below.
ii) Check your answers with the answers given at end of this Unit.
..............................................................................................................
..............................................................................................................
.........................•...................................................................................
..............................................................................................................
..............................................................................................................
..............................................................................................................
..............................................................................................................
..............................................................................................................
..............................................................................................................
62
Regression Analysis
17.4 EXAMPLES
17.4.1 Example of Simple Linear Regression (No intercept
Model)
Research Question How does the research effort of a country affect
its participation in the international cooperation
network?
Research effort => Number of publications

(PUBLIC)
Participation In international network

=> Coauthorship links (COOP _LIN)
Methodology Scatter plot
Simple linear regression
Dataset Appendix 1
Extracts from Computer Output (SPSS package)
Variables Entered/Removed(b,c)
Model Variables Entered Variables Removed Method
1 PUBLIC(a) Enter
a) All requested variables entered.
b) Dependent Variable: COOP _LlN
c) Linear Regression through the Origin
Model Summary
Model R R Square(a) Adjusted R Square Std. Error of the Estimate
1 .934(b) .872 .869 9246.3974
a) For regression through the origin (the no-intercept model), R2 measures the
proportion of the variability in the dependent variable about the origin explained
byregression. This CANNOT be compared to R2 for models which include an
intercept.
\
b) Predictors: PUBLIC
ANOVA(c,<\)
Model Sum of Squares df Mean Square F Sig,
1 Regression 25527876817.716 1 25527876817.716 298.586 .000(a)
Residual 3761818085.284 44 85495865.575
Total 29289694903 .000(b) 45
a) Predictors: PUBLIC
b) This total sum of squares is not corrected for the constant because the constant
is zero for regression through the origin.
c) Dependent Variable: COOP _LlN
d) Linear Regression through the Origin
63
Techniques and Modeling Coefficients(a,b)
in Informetrics and
Scientometrics Unstandardized Coefficients Standardized
Coefficients t Sig.
Model B Std. Error Beta
1 PUBLIC .193 .011 .934 17.280 .000
a) Dependent Variable: COOP _LIN
60000
0
50000
z
.....J 40000
Po.. I
0
0 30000
U
20000
10000
5000 3000055000 80000 105000 130000 155000 180000
PUBLIC
Comments
1) Simple linear regression procedure of Regression module of SPSS was

used since there is only one independent variable.
2) Linear regression through the origin was used, since we cannot have any
coauthorship link without publication output.
3) The value of R" indicates that about 87% of the variance in the dependent
variable is explained by the regression model.
4) F ratio in the ANOVA table is used to test the null hypothesis that the
slope ( b ) of the regression line is O. F ratio is statistically highly significant
implying that there is positive relationship between publication output and
number of intemattonal coauthorship links.
5) Unstandardized regression coefficient (B) = 0.193. We can use this
coefficient to predict the number of coauthorship links from the following
regression equation:
No. of coauthorship links = 0.193 x No. of publications

6) Standardized regression coefficient= 0.934, which implies that a unit change
in number of publications would lead to 0.934 times change in coauthorship
links. In other words, jf a country doubles its publication output, the
64 number of its coauthorship links is expected to be 1.668 times the present.
17.4.2 Example of Simple Linear Regression (with intercept) Regression Analysis
Research Question How does the research effort of a country in

materials science depend upon its research effort
in chemistry?
Research effort => Number of publications

Methodology Scatter plot
Simple linear regression
Dataset Appendix 2
Notes
Variables EnteredlRemoved(b)

1 CHM(a) Enter
a) All requested variables entered
b) Dependent Variable: MTL
Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .919(a) .844 .839 3.3691
a) Predictors: (Constant), CHM
ANOVA(b)
Model Sum of Squares df Mean Square F Sig.
1 Regression 2025.592 1 2025.592 178.452 .000(a)
Residual 374.580 33 11.351
Total 2400.171 34
a) Predictors: (Constant), CHM b) Dependent Variable MTL
Coefficients(a)
Unstandardized Coefficients Standardized

S:oefficients t Sig.
Model B Std. Error Beta
1 (Constant) -.955 .629 -1.517 .139

CHM .148 .011 .919 13..359 .000
(a) Dependent Variable: MTL
17.4.3 Example of Multiple Regression Analysis

Research Question Which are the factors that influence the poverty
level of a country?
65
Techniques and Modeling Methodology Multiple regression
in Informetrics and
Scientometrics Stepwise Multiple regression
Dataset Appendix 3
Extracts from Computer Output (SPSS Package)
1 V9, V2, VS, V8, V6, V7, V3 Enter
b) Dependent Variable: V4
Model Summary(b)
1 .888(a) .788 .718 3.452
a) Predictors: (Constant), V9, V2, VS, V8, V6, V7, V3
ANOVA(b)
1 Regression 930.866 7 132.981 11.160 .000(a)
Residual 250.227 21 11.916
Total 1181.092 28
a) Predictors: (Constant), V9, V2, VS, V8, V6, V7, V3
Coefficients(a)
95%
Unstandardized Standardized Confidence Collinearity
Coefficients Coefficients Interval for B Statistics
Model B Std. Beta t Sig. Lower Upper Toler VIF
Error Bound Bound ance
1 Constan 31.913 14.072 2.268 .034 2.649 6 l.l 77
V2 -.359 .087 -.579 -4.142 .000 -.540 -.179 .517 1.934
V3 1.949E- .001 .619 1.700 .104 .000 .004 .076 13.140

03
V5 2.492 3.676 .074 .678 .505 -5.154 10.137 .840 1.190
V6 -.126 .140 -.196 -.903 .377 -.417 .164 .214 4.674
V7 .158 .063 .595 2.500 .021 .027 .290 .178 5.618
V8 -.397 .264 -.175 -1.507 .147 -.945 .151 .744 1.344
V9 -5.3llE- .000 -.401 -1.138 .268 .000 .000 .081 12.336
66 a) Dependent Variable: V4
Residuals Statistics(a) Regression Analysis
Minimum Maximum Mean Std. Deviation N

Predicted Value 16.241 35.858 23.148 5.766 29
Residual -5.075 4.889 2.021E-15 2.989 29
Std. Predicted Value -1.198 2.204 .000 1.000 29
Std. Residual -1.470 1.416 .000 .866 29
a) Dependent Variable: V4
Comments
1) Multiple regression procedure of Regression module of SPSS was used

since there are several independent variables.
2) The value of R2 indicates that about 79% of the variance in the dependent
variable is explained by the regression model.
3) F ratio in the analysis of variance is used to test the null hypothesis that
the population R2is zero. F ratio is statistically highly significant implying
rejection of the null hypothesis.
4) The following parameters are statistically significant. Constant and

regression coefficients for variables V2 and V7.
Constant = 31.913 is significant (p = .034). Its confidence limits are:

2.649 and 61.171
Unstandardized regression coefficient for variable V2 = -.359. It is

statistically highly significant (p =.000). Its confidence limits are:
-.540 and -.179
Unstandardized regression coefficient for variable V7 = .158. It is

statistically significant (p =.021) Its-confidence limits are:.027 and .290.
• Other predictor variables are not statistically significant.
• Thus the parsimonious regression model is:
V3 = -.359 V2 + .156 V7
5) Beta coefficients indicate the relative importance of different predictors.

Predictors, which are not statistically signifi~ant, should be excluded. Most
important predictors (in order of importance) are: V7 and V2
6) Collinearity Statistics:
• The values of tolerance level indicate that none of the predictors has
tolerance level less than .01
• The values of VIF are all less than 10 (except V3).
Hence, we do not expect multicollinearity in the data.

67
Techniques and Modeling 17.4.4 Example of Stepwise Regression Analysis
in Informetrics and
Scientometrics Variables Entered/Removed(a)
Model Variables Variables Method

Entered Removed
V6 Stepwise (Criteria: Probability-of-F-to-enter<=

.050, Probability-of-F-to-remove >= .100).
2 V2 Stepwise (Criteria: Probability-of-F-to-enter<=

.050, Probability-of-F-to-remove >= .100).
Model Summary(c)
.7i9(a) .532 .514 4.525
2 .834(b) .695 .671 3.723
a) Predictors: (Constant), V6
b) Predictors: (Constant), V6, V2
c) Dependent Variable: V4
ANOVA(c)
1 Regression 628.135 I 628.135 30.671 .000(a)
Residual 552.957 27 20.480
Total 1181.092 28
2 Regression 820.782 2 410.391 29.614 .000(b)
Residual 360.311 26 13.858
Total 1181.092 28
a) Predictors: (Constant), V6
b) Predictors: (Constant), V6, V2
c) Dependent Variable: V4~
Excluded Variables
Partial Collinearity Statistics
Correlation
Model Beta In t Sig. Tolerance VIF Minimum
Tolerance
1 V2 -.435(a) -3.728 .001 -.590 .864 1.158 .864

V3 .106(a) .742 .465 .144 .859 1.164 .859
V5 -.013(a) -.098 .922 -.019 .992 1.008 .992
68
V7 -.081(a) -.393 .697 -.077 .422 2.370 .422 Regression Analysis
V8 -.034(a) -.252 .803 -.049 .988 1.012 .988

V9 .005(a) .036 .971 .007 .922 1.084 .922
2 V3 .063(b) .530 .601 .105 .850 1.176 .736
V5 .047(b) .418 .679 .083 .971 1.030 .845
V7 .227(b) 1.242 .226 .241 .345 2.902 .298
V8 -.095(b) -.855 .400 -.169 .967 1.034 .846
V9 .016(b) .141 .889 .028 .921 1.085 .809
a) Predictors in the Model: (Constant), V6
b) Predictors in the Model: (Constant), V6, V2
c) Dependent Variable: V4
Residuals Statistics(a)
Minimum Maximum Mean Std. Deviation N
Predicted Value 14.633 35.939 23.148 5.414 29
Residual -6.548 8.864 -5.574E-15 3.587 29
Std. Predicted Value -1.573 2.362 .000 1.000 29
Std. Residual -1.759 2.381 .000 .964 29
17.4.5 Example of Regression Analysis with Qualitative

Variables
Research Question : Academic scientists spend their time on different

activities such as teaching, research, supervision of
doctoral students, administration, etc. Here, we wish to
know whether there are any differences in the time
spent by the academic scientists of different ranks in
teaching. What is the effect of the rank of an academic
scientist on the time devoted to teaching?
Methodology Regression analysis with dummy variables
[)ataset Appendix 4
Extracts from Computer Output
1 READER, PROF(a) Eriter
69
Techniques and Modeling Model Summary
in Informetrics and
Scientometrics Model R R Square Adjusted R Square Std. Error of the Estimate
1 .390(a) .152 .132 18.0130
a) Predictors: (Constant), READER, PROF
ANOVA(b)
1 Regression 5056.782 2 2528.391 7.792 .001(a)
Residual 28228.818 87 324.469
Total 33285.600 89
a) Predictors: (Constant), READER, PROF
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 50.750 3.404 14.908 .000
PROF -17.642 4.512 -.451 -3.910 .000
READER -12.350 4.957 -.288 -2.492 .015
Comments
I) The independent variable Rank. is categorized into two dummy variables

Xl and X2 defined as follows:
• Xl =1 if rank. = Professor.
= 0 if rank. "* Professor

• X2 = 1 if rank. = Reader
= 0 if rank. "* Reader
Thus, the total numbesof variables=J (One dependent and two independent
dummy variables). The dependent variable is the percentage of work time
spent on teaching.
2) The following regression model was set up:
Time spent on teaching = Constant + B, x Professor + B, x Reader
3) Multiple correlation coefficient (Multiple R) is the correlation between the
dependent variable ( time spent on teaching) and the predicted value.
Greater the value of Multiple R, greater is the agreement between the
predicted and observed values. Here, the value_of Multiple R is sufficiently
70 large.
Regression Analysis
4) F Ratio in the Analysis of Variance table is used to test the hypothesis:
(B1 =0 , B2 =0). F Ratio is large when the independent variable explains
the variation in the dependent variable. There is a significant linear
relationship between the rank of an academic scientist and the time spent
on teaching. ( F Ratio = 7.792 degrees of freedom = 2,87; p < .001) This
implies that Rank does affect the time devoted by an academic scientist
on teaching.
5) t values for all the parameters of the regression equation, viz. constant
term and regression coefficients are statistically significant.
6) Compared to Lecturer, a Professor spends 17.6 % less time on teaching
(B = -17.642)
7) Compared to Lecturer, a Reader spends 12.3 % less time on teaching. (B
= -12.350)
71
in Informetrics and 17.5 APPENDICES
Scientometrics
Appendix - 1.1
Data sets
International cooperation links of 45 major countries

No. of Articles No. of Coop links
751702 125272
182188 57021
156023 24750
136853 54814
107753 44007
90838 30294
81839 17016
65220 29814
45161 21250
43892 12565
39277 15710
34060 18082
33457 5702
31339 22051
23943 8729
22105 11295
20360 13298
18115 11928
16249 10046
14227 7651
13242 2869
13091 7876
12370 6749
10438 6426
9022 2645
8771 3293
8636 2999
7796 5437
7326 5323
6565 3213
6499 2613
6364 2951
4747 1504
4635 1339
4572 2873
4495 , 2208
4024 1991
3626 3376
3526 1999
3429 1611
3100 997
2589 647
2307 1249
1867 605
1831 921
72
Source: P.S. Nagpaul, Science beyond Institutional Boundaries, NISTADS
Appendix - 1.2 Regression Analysis
Research output in Chemistry and Materials Science (Number of

Publications)
Country Chem. Materials

USA 290.00 49.00
UKD 81.00 3.00
FRO 93.00 4.00
CAN 36.00 12.00
FRA 47.00 6.00
JPN 77.00 3.00
ITA 36.00 1.00
SUN 20.00 .00

AUS 34.00 .00
CHE 12.00 .00
NLD 6.00 3.00
SWE 9.00 1.00
ESP 13.00 .00

PRC 2.00 .00
BEL 9.00 1.00
HUN 9.00 1.00
BRA 1.00 .00

BND 6.00 .00
DNK 14.00 3.00
AUT 14.00 .00
BGR 2.00 .00
paL 7.00 1.00
KOR 13.00 .00

CSK 1.00 .00
MEX .00 .00
PHL 3.00 .00
FIN 3.00 .00
ISR 2.00 .00
••
EGY 2.00 1.00
ORC 1.00 1.00
TWN 1.00 .00

NOR 3.00 1.00
NIG .00 1.00
CHL .00 .00

ROM .00 .00
73
Techniques and Modcling
in Informctrics and Appendix - 1.3
Scicntomctrics Data on Predictors of Poverty
Description Data on indicators of poverty:
Number of cases = 30 counties
Number of variables= 8
These variables are defined below:
V I = Population change VS = Percentage of residences with telephones
V2 = Number of persons V6 = Percentage of rural population
employed in agriculture
V3 = Percentage of families V7 = Median age
below poverty level
V4 = Residential and farm V8 = Number of African Americans
property tax rate
74
Appendix - 1.4 Regression Analysis
County VI V2 V3 V4 VS V6 V7 V8
1 13.7 400 19.0 1.09 82 75 33.5 360
2 -0.8 710 26.2 1.01 66 100 32.8 193
3 9.6 1610 18.1 0.40 80 70 33.4 3080
4 40.0 500 15.4 0.93 74 100 27.8 592
5 8.4 640 29.0 0.92 65 74 27.9 2
6 3.5 920 21.6 0.59 64 73 33.2 230
7 3.0 1890 21.9 0.63 82 52 30.8 3978
8 7.1 3040 18.9 0.49 85 50 32.4 9816
9 13.0 2730 21.1 0.71 78 71 29.2 1137
10 10;7 1850 23.8 0.93 74 71 28.7 992
11 -16.2 2920 40.5 0.51 69 64 25.1 10723
12 6.6 1070 21.6 0.80 85 58 35.9 3129
13 2l.9 160 25.4 0.74 69 100 31.4 338
14 17.8 380 19.7 0.44 83 72 30.1 516
15 -11.8 1140 38.0 0.81 54 100 34.1 12
16 7.5 690 30.1 1.05 65 100 30.5 104
17 3.7 1170 24.8 0.73 76 70 30.0 430
18 1.6 1280 30.3 0.65 67 8i 32.4 1240
19 8.4 2270 19.5 0.48 85 39 28.7 20446
20 2.7 960 15.6 0.72 84 58 33.4 1863
21 5.6 1710 17.2 0.62 84 42 29.9 8035
22 12.7 1410 18.4 0.84 86 36 23.3 10620
23 -4.8 200 27.3 0.73 66 100 27.5 211
24 16.5 960 19.2 0.45 74 91 29.5 133
25 15.2 11500 16.8 1.00 87 6 25.4 266159
26 11.6 1380 13.2 0.63 85 44 28.8 2432
27 4.9 530 29.7 0.54 70 100 33.1 932
28 1.1 370 19.8 0.98 75 53 30.8 7
29 3.8 440 27.7 0.46 48 100 28.4 208
30 19.0 1630 20.5 0.68 83 72 30.4 1732
Involvement of academic scientists in different activities
DescriptionSubset of data derived from the database: Profile and productivity of
academic scientists in India. The subset includes the following items:
• Professional Rank (3 categories: coded as Professor =1, Reader =2, Lecturer=3)
• Percentage of time spent on Teaching, Research and Supervision of Doctoral
Students (Data on other activities is suppressed)
• Respondents
Faculty members of Science, Engineering, Medicine and Agriculture Departments
of 20 Universities in India.
75
Techniques and Modeling • Sample size: 10% random sample of 1073 respondents.
in Informetrics and
Time spent by academic scientists on different activities (Percentage)
Scientometrics
Rank Teach Research Supvn Prof Reader Dummy Variables
2.00 40.00 30.00 20.00 .00 1.00
3.00 40.00 40.00 10.00 .00 .00
3.00 65.00 20.00 .00 .00 .00
2.00 50.00 10.00 20.00 .00 1.00
3.00 30.00 30.00 25.00 .00 .00
1.00 40.00 20.00 15.00 1.00 .00
1.00 90.00 .00 .00 1.00 .00
1.00 60.00 10.00 10.00 1.00 .00
1.00 20.00 20.00 10.00 1.00 .00
2.00 30.00 10.00 25.00 .00 1.00
1.00 50.00 20.00 10.00 1.00 .00
2.00 50.00 10.00 15.00 .00 1.00
2.00 22.00 22.00 22.00 .00 1.00
2.00 30.00 50.00 15.00 .00 1.00
3.00 60.00 15.00 15.00 .00 .00
1.00 10.00 15.00 25.00 1.00 .00
1.00 30.00 30.00 20.00 1.00 .00
1.00 45.00 8.00 30.00 1.00 .00
2.00 40.00 10.00 20.00 .00 1.00
1.00 35.00 10.00 20.00 1.00 .00
1.00 20.00 20.00 20.00 1.00 .00
2.00 30.00 30.00 40.00 .00 1.00
1.00 30.00 5.00 10.00 1.00 .00
1.00 20.00 20.00 20.00 1.00 .00
1.00 40.00 15.00 25.00 1.00 .00
1.00 40.00 25.00 20.00 1.00 .00
1.00 60.00 30.00 10.00 1.00 .00
2.00 20.00 35.00 20.00 .00 1.00
2.00 40.00 20.00 20.00 .00 1.00
1.00 30.00 20.00 20.00 1.00 .00
1.00 40.00 25.00 25.00 1.00 .00
3.00 50.00 30.00 10.00 .00 .00
3.00 75.00 20.00 .00 .00 .00
2.00 30.00 40.00 25.00 .00 1.00
2.00 20.00 25.00 40.00 .00 1.00
3.00 50.00 45.00 .00 .00 .00
2.00 50.00 10.00 30.00 .00 1.00
1.00 50.00 25.00 25.00 1.00 .00
3.00 50.00 40.00 .00 .00 .00
2.00 50.00 15.00 10.00 .00 1.00
1.00 80.00 5.00 5.00 1.00 .00
1.00 40.00 10.00 5.00 1.00 .00
3.00 60.00 35.0Q 5.00 .00 .00
1.00 20.00 20.00 30.00 1.00 .00
1.00 40.00 20.00 .00' 1.00 .00
3.00 40.00 30.00 10.00 .00 .00
2.00 40.00 40.00 15.00 .00 1.00
3.00 50.00 20.00 10.00 .00 .00
2.00 60.00 10.00 .00 .00 1.00
2.00 60.00 10.00 20.00 .00 1.00
1.00 50.00 15.00 10.00 1.00 .00
2.00 50.00 40.00 .00 .00 1.00
SOURCE: NAGPAUL, P.S., Profile and productivity of academic scientists in India, National
76 Institute of Science, Technology and Society, New Delhi (India)
Regression Analysis
17.6 SUMMARY
Regression is a statistical technique that uses the association between variables
as a means of prediction. In the simplest case, we consider two variables, the
independent variable and the dependent variable. The independent variable is
used to predict changes in the dependent variable. Multiple regression is an
extension of simple linear regression. In multiple regression, we consider
more than one independent variable and assess the combined ability of the
independent predictors to account for changes in the dependent variable. Typical
outcome of regression analysis is an equation or "model" that represents the
relationship between a dependent variable and independent variable(s). This
model is derived by minimizing the sum of squares of deviations between the
observed and. predicted values of the dependent variable. Procedures for
assessing the goodness of fit of the regression model and its parameters are
discussed. Underlying assumptions and consequences of their violation are
briefly discussed to indicate the possible pitfalls of regression analysis without
understanding the basic principles.
Some examples are presented and computer outputs are interpreted to familiarize
the students with the practical aspects of regression analysis.
17.7 ANSWERS TO SELF CHECK EXERCISES

I) Multiple correlation (R) is the Pearson correlation between the predicted
and observed values of the dependent variable (Y). It is denoted by R.
2) The coefficient of determination is the square of multiple correlation. It

is the proportion of the total variation in the dependent variable Y, that is
explained or accounted for by the variation in the independent or
explanatory variables Xi. It is expressed as a percentage, ranging from 0
to 100%.
3) Multiple regression is an extension of the simple regression. In simple

regression, there is one explanatory variable. In multiple regression, there
are more than one explanatory variable.
4) Residuals are. the difference between the observed values and those
predicted by the regression equation.
17.8 KEY WORDS

Beta Weights These are the regression coefficients for
standardized data. Beta is the average
amount by \vhich the dependent variable
increases when the independent variable
increases one standard deviation and other
independent variables are held constant. The
ratio of the beta weights is the ratio of the
predictive importance of the independent
variables.
Coefficient of Determination: A measure of the goodness of fit of the
relationship between a dependent and
independent variable in a regression
analysis. 77
Techniques and Modeling Dummy Variables Regression assumes interval scale data, but
in Informetrics and dichotomies may be considered a special
Scientometrics
case of intervalness. Nominal and ordinal
data can be transformed into sets of
dichotomies, called dummy variables. To
prevent perfect multicollinearity, one
category must be left out.
Intercept It is the estimated Y value when all the

independents have a value of O.
Multiple Correlation It is the Pearson correlation between the

Coefficient (R) predicted and the observed values of the
dependent variable. It is a measure of the
goodness of fit of the relationship between
a dependent and independent variables.
Multiple Regression The general purpose of multiple regression

is to analyze the relationship between
several independent or predictor variables
and a dependent or criterion variable. The
computational problem that needs to be
solved in regression analysis is to fit a
straight line (or plane in an n-dimensional
space, where n is the number of independent
variables) to a number of points.
Ordinary Least Squares This method derives its name from the
criterion used to draw the best-fit regression
line: a line such that the sum of the squared
deviations of the distances of all the points
to the line is minimized.
Regression A category of problems where the objective

is to estimate the value of a continuous
output variable from some input variables.
Regression Coefficient . Regression coefficients bi are the slopes of

the regression plane in the direction of xi.
Each regression coefficient represents the
net effect the ith variable has on the
dependent variable, holding the remaining
.x's in the equation constant.
Residuals Residuals are differences between the

.observed values and the corresponding
values that are predicted by the regression
model and thus they represent the variance
that is not explained by the model. The
better the fit of the model, the smaller the
values of residuals.
78
Scatter Plot A scatter plot visualizes the relation Regression Analysis
(correlation) between two variables X and
Y. In the scatter plot, individual data points
are represented in two-dimensional space,
where the axes represent the variables (X
on the horizontal axis and Y on the vertical
axis).
Simple Regression The general purpose of simple regression

is to analyze the relationship between one
independent or predictor variable and a
dependent or criterion variable.
Standardized It means that for each datum the mean is

subtracted and the result divided by the
standard deviation. The result is that all
variables have a mean of 0 and a standard
deviation of 1.
Stepwise Regression Stepwise regression is a sequential process

for fitting the least squares model, where at
each step a single predictor variable is either
added to or removed from the model in the
next fit.
17.9 REFERENCES AND FURTHER READING

Nagpaul, P.S. (2001). Guide to Advanced Data Analysis Using IDAMS Software
(Chapters 4 and 5), http://www.Unesco.org/Idams
Montgomery, D.e. [et al]. Introduction to Linear Regression Analysis. 3rd ed.
New York: John Wiley.
79

Unit 17

Uploaded by

Copyright:

Available Formats

Unit 17

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 17

Uploaded by

Copyright:

Available Formats

UNIT 17 REGRESSION ANALYSIS

• carry ol;lt simple and multiple regression analysis;

Y = A+ B,X, + B2X2 + ..... + BpXp+ E

B = Constant (or regression coefficient: how much of a difference in Y results

17.2 SIMPLE LINEAR REGRESSION

The first step in determining whether there is a relationship between two

For the population, the bivariate regression model is:

Regression equation is thus a mathematical model describing the relationship

Fig. 1 : Scatter Plot of publication output of an instruction in different years

17.2.1 Objectives of Regression Analysis

Here, the objective is to find an equation that describes or summarizes the

Here, the- objective is to confirm or reject a theoretical or hypothesized

1) With linear regression models the straight-line relationship between Yand

2) The expected value of the error term is zero.

5) This assumption can be violated in two ways: Model misspecification or

• Model misspecification. If an important independent variable is omitted

6) The independent variable is uncorrelated with the error term.

17.2.3 Estimation of Parameters

The random sample of observations can be used to estimate the parameters of

5000 30000 55000 80000 105000 130000 155000 180000

Given a set of n observations Y of the dependent variable corresponding to a

fitted value "

The least ~quares line is:

Note: B is the slope of the regression line.

Standardized regression coefficient

Regression coefficient is the slope in the regression equation. If X and Y are

17.2.4 Fit of the Regression Model

The coefficient of determination is a measure of the goodness of fit of the

• the percentage ot the variation that can be explained by the regression

• the variation explained by the regression equation divided by the total

The coefficient of determination varies between 0 and 1. If there is zero

The sum of the squared deviations of Y from r

Table 1: ANOVA Table for Simple Linear Regression

Source oJ Variation Sums of Squares DJ Mean Square F

Regression ~(Y - YM)2 I SS,.g / I MS,./ MS,es

Total ~(Y-Y~lf N-I

Self Check Exercises

1) What is multiple correlation?

17.3 MULTIPLE REGRESSION

17.3.2 Estimation of Parameters

Y,., = BfI = B, XI" + B, X,'l + B3 X,,3 + +B,.X""

Minimizing the sum of squares leads to a set of equations, called normal

The problem of multiple-regression can be represented geometrically as follows.

Standard error of the estimate

Y; = the sample value of the dependent variable

The denominator of the equation indicates that in multiple regression with p

17.3.3 Fit of the Regression Model

Just as in the case of one-variable regression, the sum of squares in multiple

y Unexplained devimon nom the 2leg2lessi.on

Explainedttom the devimon

Fig. 3 : Fit of the regression model by coefficient of multiple determination

Sum of squares due to error

Sum of squares due to regression

SST = SSR + SSE

The ratio SSRlSSTrepresents the proportion of the total variation in Y explained

Multiple correlation coefficient, R, is a measure of the strength of the linear

The coefficient of multiple determination is sensitive to the magnitudes of n

Statistical Inferences for the Model

has an F-distribution with p and n-l degrees of freedom.