Coventry Simple Linear Regression
Coventry Simple Linear Regression
Coventry Simple Linear Regression
Research question type: When wanting to predict or explain one variable in terms of another
Example 1:
A dietetics student wants to look at the relationship between calcium intake and knowledge about
calcium in sports science students. Further she wants to know if knowledge about calcium can be
used to predict calcium intake of the students. Table 1 shows the data she collected.
Research question: Does knowledge about calcium predict calcium intake in sports science
students?
In this example there is a single predictor variable (knowledge about calcium) for one response
variable (calcium intake).
It can be seen from the scatter plot in Figure 1(i) that the calcium intake seems to increase as the
knowledge scores increase, and that, although there is some variation, the relationship roughly
follows a straight line (described as linear).
A straight line can be written in terms of its two variables, its gradient (slope) and where it crosses
the vertical axis (intercept), plus an allowance for the unexplained variation (e). See Figure 1(ii),
which shows a fitted line, sometimes termed the line of best fit, calculated from all the given data.
Step 1: Draw a scatter plot of the data to check for an underlying straight line relationship. If
there is NO underlying straight line apparent, ask for advice on how to proceed.
Unexplained
variation, e
A scatter plot can be drawn in SPSS, using the Graphs> Chart Builder options
- choose Scatter/Dot
- drag the Simple Scatter plot into the plotting region
- drag the Predictor (independent) variable (in this case Knowledge score) into the X-axis box
- drag the Response (dependent) variable (in this case Calcium intake) into the Y-axis box
- click OK
Step 2: The fitted line can be added to the chart from the SPSS Chart Editor:
- double-click on the chart in the SPSS Output window to open the Chart Editor
[The above charts have also had changes made to the marker and text size]
- choose Fit Line at Total from the Elements menu – make sure Linear Fit Method is selected in
the Properties window
- close the Chart Editor window to update the chart in the Output window.
The fitted line gives an idea of how much variation there is in the observed values compared to
the line. The variation is assumed to be just in the response variable, so all the unexplained
variation is shown by the difference between the plotted point and point on the fitted line from
the value on the horizontal axis.
Look at Figure 1(ii) and follow a Knowledge score of 40, up to the observed value at 800, and
compare this to value on the line (following the arrow) at approximately 930. So there is about
130 mg/day discrepancy (residual). Other discrepancies are smaller or larger.
Further exploration of these residuals can be carried-out to check the validity of the regression
model – ask for more details.
2
Step 3: Estimating the model
Estimates of the intercept and slope can
be made from the observed data using a
technique called Simple Linear
Regression, which aims to get a model
such that the line fits the data in the 'best'
way.
- choose Analyse>Regression>Linear –
see right
- move the response variable (Calcium
intake) into the Dependent box, and
- move the predictor variable
(Knowledge score) into the
Independent(s) box
- OK
Results:
Table 2 shows some of the output from the regression analysis
a
Table 2: Coefficients
Unstandardized 95.0% Confidence
Coefficients Interval for B
Model t Sig.
Std. Lower Upper
B
Error Bound Bound
(Constant) 373.743 55.067 6.787 .000 258.051 489.435
1 Knowledge score
13.897 1.748 7.951 .000 10.225 17.569
(out of 50)
The estimates of the intercept and slope are given in the 'B' column:
Intercept = Constant = 373.743, and
Slope = Knowledge score = 13.897
The p-value (< 0.001) next to the knowledge score implies that this variable is significant in
explaining calcium consumption.
Conclusion:
Our estimated model is (on average):
For prediction, knowledge scores should lie in the range of the data collected on knowledge.
3
Confidence Intervals:
Obviously this model is subject to uncertainty, as the observed points did not all lie on a perfect
straight line, so the coefficients for the intercept and slope are only estimates of the 'true' value.
Confidence intervals (CIs) can be calculated for these values to give a range of possible values;
choose the Statistics button in the Linear Regression dialog box, and check the Confidence
Intervals box, Level(%) = 95. This means that If we were to do this experiment 100 times, 95 times
the true value for the intercept and slope would lie in the 95% CI.
95% CIs were calculated for the two coefficients in this example (Table 2 above) as:
intercept: CI=(260, 490), slope: CI=(10, 18), both corrected to 2 sig fig.
In other words, we expect that the true estimates are in the intervals (260, 490) and (10, 18).
These CIs should be inspected in context by the reader for practical importance/usefulness.
More than one predictor variable can be used in a model with the same provisos as above – see
Multiple Linear Regression. There are also other regression modelling techniques for data not
considered to be at continuous/interval/ratio level.
Comments:
This value can be interpreted as 78% (or 77%) of variation in calcium intake can be explained by
knowledge score. A 95% CI can be calculated for this value, too. [But not in SPSS]
The remaining 22% (or 23%) arises from other variables not taken into account in the analysis.
Note we CANNOT assume that knowledge about calcium CAUSES the increase in calcium intake.
Pearson's coefficient of linear correlation R = 0.882 is also given in the output. It is a measure of
the strength of the linear relationship between the predictor and response variable.