Topic03 Correlation Regression

Correlation and
Regression
Cal State Northridge
427
Ainsworth
Major Points - Correlation
 Questions answered by correlation
 Scatterplots
 An example
 The correlation coefficient
 Other kinds of correlations
 Factors affecting correlations
 Testing for significance

The Question
 Are two variables related?

 Does one increase as the other increases?
 e. g. skills and income
 Does one decrease as the other increases?
 e. g. health problems and nutrition
 How can we get a numerical measure of
the degree of relationship?
Scatterplots
 AKA scatter diagram or scattergram.

 Graphically depicts the relationship
between two variables in two dimensional
space.
Direct Relationship
Scatterplot:Video Games and Alcohol Consumption
20
Average Number of Alcoholic Drinks
18
16
14
Per Week
12
10
8
6
4
2
0
0 5 10 15 20 25
Average Hours of Video Games Per Week
Inverse Relationship
Scatterplot: Video Games and Test Score
100
90
80
70
Exam Score
60
50
40
30
20
10
0
0 5 10 15 20
Average Hours of Video Games Per Week
An Example
 Does smoking cigarettes increase systolic

blood pressure?
 Plotting number of cigarettes smoked per
day against systolic blood pressure
 Fairly moderate relationship
 Relationship is positive
Trend?
170
160
150
140
130
120
SYSTOLIC
110
100
0 10 20 30
SMOKING
Smoking and BP
 Note relationship is moderate, but real.

 Why do we care about relationship?
 What would conclude if there were no

relationship?
 What if the relationship were near perfect?
 What if the relationship were negative?
Heart Disease and Cigarettes
 Data on heart disease and cigarette

smoking in 21 developed countries
(Landwehr and Watkins, 1987)
 Data have been rounded for computational
convenience.
 The results were not affected.
Country Cigarettes CHD
The Data
1 11 26
2 9 21
3 9 24
4 9 21
5 8 19
6 8 13
7 8 19
Surprisingly, the 8 6 11
9 6 23
U.S. is the first 10 5 15
country on the list- 11 5 13
12 5 4
-the country 13 5 18
with the highest 14 5 12
15 5 3
consumption and 16 4 11
17 4 15
highest mortality. 18 4 6
19 3 13
20 3 4
21 3 14
Scatterplot of Heart Disease
 CHD Mortality goes on ordinate (Y axis)

 Why?
 Cigarette consumption on abscissa (X

axis)
 Why?
 What does each dot represent?

 Best fitting line included for clarity
30
20
10
{X = 6, Y = 11}
0
2 4 6 8 10 12
Cigarette Consumption per Adult per Day

What Does the Scatterplot Show?
 As smoking increases, so does coronary

heart disease mortality.
 Relationship looks strong
 Not all data points on line.
 This
gives us “residuals” or “errors of
prediction”
 To be discussed later
Correlation
 Co-relation
 The relationship between two variables
 Measured with a correlation coefficient
 Most popularly seen correlation

coefficient: Pearson Product-Moment
Correlation
Types of Correlation
 Positive correlation
 High values of X tend to be associated with
high values of Y.
 As X increases, Y increases
 Negative correlation
 High values of X tend to be associated with
low values of Y.
 As X increases, Y decreases
 No correlation
 No consistent tendency for values on Y to
increase or decrease as X increases
Correlation Coefficient
 A measure of degree of relationship.

 Between 1 and -1
 Sign refers to direction.
 Based on covariance
 Measure of degree to which large scores on
X go with large scores on Y, and small scores
on X go with small scores on Y
 Think of it as variance, but with 2 variables
instead of 1 (What does that mean??)
18
Covariance
 Remember that variance is:
( X  X ) 2
( X  X )( X  X )
VarX  
N 1 N 1
 The formula for co-variance is:
( X  X )(Y  Y )
Cov XY 
N 1
 How this works, and why?
 When would covXY be large and positive?
Large and negative?
Country X (Cig.) Y (CHD) (X  X ) (Y  Y ) ( X  X ) * (Y  Y )
1 11 26 5.05 11.48 57.97
2 9 21 3.05 6.48 19.76
3 9 24 3.05 9.48 28.91
4 9 21 3.05 6.48 19.76
5 8 19 2.05 4.48 9.18
6 8 13 2.05 -1.52 -3.12
7 8 19 2.05 4.48 9.18
8 6 11 0.05 -3.52 -0.18
9 6 23 0.05 8.48 0.42
Example
10 5 15 -0.95 0.48 -0.46
11 5 13 -0.95 -1.52 1.44
12 5 4 -0.95 -10.52 9.99
13 5 18 -0.95 3.48 -3.31
14 5 12 -0.95 -2.52 2.39
15 5 3 -0.95 -11.52 10.94
16 4 11 -1.95 -3.52 6.86
17 4 15 -1.95 0.48 -0.94
18 4 6 -1.95 -8.52 16.61
19 3 13 -2.95 -1.52 4.48
20 3 4 -2.95 -10.52 31.03
21 3 14 -2.95 -0.52 1.53
Mean 5.95 14.52
SD 2.33 6.69
Sum 222.44
Example
21
( X  X )(Y  Y ) 222.44
Covcig .&CHD    11.12
N 1 21  1
 What the heck is a covariance?
 I thought we were talking about
correlation?
Correlation Coefficient
 Pearson’s Product Moment Correlation

 Symbolized by r
 Covariance ÷ (product of the 2 SDs)
Cov XY
r
s X sY
 Correlation is a standardized
covariance
Calculation for Example
 CovXY = 11.12
 sX = 2.33
 sY = 6.69
cov XY 11.12 11.12

r    .713
s X sY (2.33)(6.69) 15.59
Example
 Correlation = .713
 Sign is positive
 Why?
 If sign were negative

 What would it mean?
 Would not alter the degree of relationship.
Other calculations
25
Z-score method

r
 z z x y
N 1
 Computational (Raw Score) Method

N  XY   X  Y
r
 N  X 2  ( X )2   N  Y 2  ( Y )2 
Other Kinds of Correlation
 Spearman Rank-Order Correlation
Coefficient (rsp)
 used with 2 ranked/ordinal variables
 uses the same Pearson formula
Attractiveness Symmetry
3 2
4 6
1 1
2 3
5 4
6 5 26
rsp = 0.77
 Point biserial correlation coefficient (rpb)
 used with one continuous scale and one
nominal or ordinal or dichotomous scale.
Attractiveness Date?
3 0
4 0
1 1
2 1
5 1
6 0
rpb = -0.49 27
 Phi coefficient ()
 used with two dichotomous scales.
Attractiveness Date?
0 0
1 0
1 1
1 1
0 0
1 1
 = 0.71 28
Factors Affecting r
 Range restrictions
 Looking at only a small portion of the total
scatter plot (looking at a smaller portion of
the scores’ variability) decreases r.
 Reducing variability reduces r
 Nonlinearity
 The Pearson r (and its relatives) measure the
degree of linear relationship between two
variables
 If a strong non-linear relationship exists, r will
provide a low, or at least inaccurate measure
of the true relationship.
Factors Affecting r
 Heterogeneous subsamples
 Everyday examples (e.g. height and weight
using both men and women)
 Outliers
 Overestimate Correlation
 Underestimate Correlation
Countries With Low Consumptions
Data With Restricted Range
Truncated at 5 Cigarettes Per Day

20
18
16
CHD Mortality per 10,000
14
12
10
4
2
2.5 3.0 3.5 4.0 4.5 5.0 5.5

Truncation
32
Non-linearity
33
Heterogenous samples
34
Outliers
35
Testing Correlations
36
 So you have a correlation. Now what?

 In terms of magnitude, how big is big?
 Smallcorrelations in large samples are “big.”
 Large correlations in small samples aren’t
always “big.”
 Depends upon the magnitude of the
correlation coefficient
AND
 The size of your sample.
Testing r
 Population parameter = 
 Null hypothesis H0:  = 0
 Test
of linear independence
 What would a true null mean here?
 What would a false null mean here?
 Alternative hypothesis (H1)   0

 Two-tailed
Tables of Significance
 We can convert r to t and test for
significance:
N 2
tr
1 r 2
 Where DF = N-2
Tables of Significance
 In our example r was .71
 N-2 = 21 – 2 = 19
N 2 19 19
tr  .71*  .71*  6.90
1 r 2
1  .712
.4959
 T-crit (19) = 2.09

 Since 6.90 is larger than 2.09 reject  = 0.
Computer Printout
 Printout gives test of significance.
Correlations
CIGARET CHD
CIGARET Pears on Correlation 1 .713**
Sig. (2-tailed) . .000
N 21 21
CHD Pears on Correlation .713** 1
Sig. (2-tailed) .000 .
N 21 21
**. Correlation is significant at the 0.01 level (2-tailed).
Regression
What is regression?
42
 How do we predict one variable from

another?
 How does one variable change as the
other changes?
 Influence
Linear Regression
43
 A technique we use to predict the most

likely score on one variable from those
on another variable
 Uses the nature of the relationship (i.e.
correlation) between two variables to
enhance your prediction
Linear Regression: Parts
44
 Y - the variables you are predicting

 i.e. dependent variable
 X - the variables you are using to predict
 i.e. independent variable
 Ŷ - your predictions (also known as Y’)
Why Do We Care?
45
 We may want to make a prediction.

 More likely, we want to understand the
relationship.
 How fast does CHD mortality rise with a
one unit increase in smoking?
 Note: we speak about predicting, but
often don’t actually predict.
An Example
46
 Cigarettes and CHD Mortality again

 Data repeated on next slide
 We want to predict level of CHD

mortality in a country averaging 10
cigarettes per day.
Country Cigarettes CHD
1 11 26
47
The Data 2
3
9
9
21
24
4 9 21
5 8 19
Based on the data we have 6 8 13
what would we predict the 7
8
8
6
19
11
rate of CHD be in a country 9 6 23
10 5 15
that smoked 10 cigarettes on 11 5 13
12 5 4
average? 13 5 18
14 5 12
First, we need to establish a 15 5 3
4 11
prediction of CHD from 16
17 4 15
smoking… 18
19
4
3
6
13
20 3 4
21 3 14
30
We predict a
20
CHD rate of
about 14
Regression
Line
10
For a country that

smokes 6 C/A/D…
0
2 4 6 8 10 12
48
Regression Line
49
 Formula
Yˆ  bX  a
Yˆ
 = the predicted value of Y (e.g. CHD
mortality)
 X = the predictor variable (e.g. average
cig./adult/country)
Regression Coefficients
50
 “Coefficients” are a and b

 b = slope
 Change in predicted Y for one unit change

in X
 a = intercept
 value of Yˆ when X = 0
Calculation
51
 Slope cov XY  sy 
b  2 or b  r  
sX  sx 
N  XY   X  Y
or b 
 N  X 2  ( X ) 2 
 Intercept a  Y bX
For Our Data
52
 CovXY = 11.12
 s2X = 2.332 = 5.447
 b = 11.12/5.447 = 2.042
 a = 14.524 - 2.042*5.952 = 2.32
 See SPSS printout on next slide
Answers are not exact due to rounding error and desire to match
SPSS.
SPSS Printout
53
Note:
54
 The values we obtained are shown on

printout.
 The intercept is the value in the B
column labeled “constant”

 The slope is the value in the B column
labeled by name of predictor variable.
Making a Prediction
55
 Second, once we know the relationship

we can predict
Yˆ  bX  a  2.042 X  2.367
Yˆ  2.042*10  2.367  22.787
 We predict 22.77 people/10,000 in a
country with an average of 10 C/A/D
will die of CHD
Accuracy of Prediction
 Finnish smokers smoke 6 C/A/D
 We predict:
Yˆ  bX  a  2.042 X  2.367
Yˆ  2.042*6  2.367  14.619
 They actually have 23 deaths/10,000
 Our error (“residual”) =
23 - 14.619 = 8.38
a large error
56
30
CHD Mortality per 10,000 Residual
20
Prediction
10
0
2 4 6 8 10 12
57
Residuals
58
 When we predict Ŷ for a given X, we will

sometimes be in error.
 Y – Ŷ for any X is a an error of estimate
 Also known as: a residual
 We want to Σ(Y- Ŷ) as small as possible.
 BUT, there are infinitely many lines that can do

this.
 Just draw ANY line that goes through the
mean of the X and Y values.
 Minimize Errors of Estimate… How?
Minimizing Residuals
59
 Again, the problem lies with this

definition of the mean:
 ( X  X )  0
 So, how do we get rid of the 0’s?
 Square them.
Regression Line:
A Mathematical Definition
 The regression line is the line which when
drawn through your data set produces the
smallest value of:
 (Y  Y )
ˆ 2
 Called the Sum of Squared Residual or

SSresidual
 Regression line is also called a “least squares
line.” 60
Summarizing Errors of Prediction
61
 Residual variance
 The variability of predicted values
ˆ
(Yi  Yi ) 2
SSresidual
s2
Y Yˆ
 
N 2 N 2
Standard Error of Estimate
62
 Standard error of estimate

 Thestandard deviation of predicted
values
ˆ
(Yi  Yi ) 2
SSresidual
sY Yˆ  
N 2 N 2
 A common measure of the accuracy of
our predictions
 We want it to be as small as possible.
Country X (Cig.) Y (CHD) Y' (Y - Y') (Y - Y')2
Example
1 11 26 24.829 1.171 1.371
2 9 21 20.745 0.255 0.065
3 9 24 20.745 3.255 10.595
63 4 9 21 20.745 0.255 0.065
5 8 19 18.703 0.297 0.088
(Yi  Yˆi ) 2 440.756
6 8 13 18.703 -5.703 32.524 2
s
Y Yˆ
   23.198
7 8 19 18.703 0.297 0.088 N 2 21  2
8 6 11 14.619 -3.619 13.097
9 6 23 14.619 8.381 70.241 (Yi  Yˆi )2 440.756
10 5 15 12.577 2.423 5.871 sY Yˆ   
11 5 13 12.577 0.423 0.179
N 2 21  2
12 5 4 12.577 -8.577 73.565  23.198  4.816
13 5 18 12.577 5.423 29.409
14 5 12 12.577 -0.577 0.333
15 5 3 12.577 -9.577 91.719
16 4 11 10.535 0.465 0.216
17 4 15 10.535 4.465 19.936
18 4 6 10.535 -4.535 20.566
19 3 13 8.493 4.507 20.313
20 3 4 8.493 -4.493 20.187
21 3 14 8.493 5.507 30.327
Mean 5.952 14.524
SD 2.334 6.690
Sum 0.04 440.757
Regression and Z Scores
64
 When your data are standardized (linearly

transformed to z-scores), the slope of the
regression line is called β
 DO NOT confuse this β with the β
associated with type II errors. They’re

different.
 When we have one predictor, r = β
 Zy = βZx, since A now equals 0

Partitioning Variability
65
 Sums of square deviations

 Total
SStotal   (Y  Y )
2
SSregression   (Yˆ  Y )
2
 Regression
 Residual we already covered

SSresidual   (Y  Yˆ )
2
 SStotal = SSregression + SSresidual

66
 Degrees of freedom
 Total
 dftotal =N-1
 Regression
 dfregression = number of predictors
 Residual
 dfresidual = dftotal – dfregression
 dftotal = dfregression + dfresidual
67
 Variance (or Mean Square)

 Total Variance
 s2total = SStotal/ dftotal
 Regression Variance
 s2regression = SSregression/ dfregression
 Residual Variance
 s2residual = SSresidual/ dfresidual
Country X (Cig.) Y (CHD) Y' (Y - Y') (Y - Y')2 (Y' - Ybar) (Y - Ybar)
1 11 26 24.829 1.171 1.371 106.193 131.699
2 9 21 20.745 0.255 0.065 38.701 41.939
3 9 24 20.745 3.255 10.595 38.701 89.795
68
4 9 21 20.745 0.255 0.065 38.701 41.939
5 8 19 18.703 0.297 0.088 17.464 20.035
6 8 13 18.703 -5.703 32.524 17.464 2.323
7 8 19 18.703 0.297 0.088 17.464 20.035
8 6 11 14.619 -3.619 13.097 0.009 12.419
9
10
11
6
5
5
23
15
13
14.619
12.577
12.577
8.381
2.423
0.423
70.241
5.871
0.179
0.009
3.791
3.791
71.843
0.227
2.323
Example
12 5 4 12.577 -8.577 73.565 3.791 110.755
13 5 18 12.577 5.423 29.409 3.791 12.083
14 5 12 12.577 -0.577 0.333 3.791 6.371
15 5 3 12.577 -9.577 91.719 3.791 132.803
16 4 11 10.535 0.465 0.216 15.912 12.419
17 4 15 10.535 4.465 19.936 15.912 0.227
18 4 6 10.535 -4.535 20.566 15.912 72.659
19 3 13 8.493 4.507 20.313 36.373 2.323
20 3 4 8.493 -4.493 20.187 36.373 110.755
21 3 14 8.493 5.507 30.327 36.373 0.275
Mean 5.952 14.524
SD 2.334 6.690
Sum 0.04 440.757 454.307 895.247
Y' = (2.04*X) + 2.37
Example
SSTotal   (Y  Y )  895.247; dftotal  21  1  20
2
69
  (Y  Y )  454.307; df regression  1 (only 1 predictor)

ˆ 2
SSregression
  (Y  Y )  440.757; df residual  20  1  19
ˆ 2
SSresidual

2
(Y  Y ) 895.247
s2
   44.762
N 1
total
20
 (Y  Y )
ˆ 2
454.307
s2
regression    454.307
1 1
 (Y  Y )
ˆ 2
440.757
s2
   23.198
N 2
residual
19
2
Note : sresidual  sY Yˆ
Coefficient of Determination
70
 It is a measure of the percent of

predictable variability
r 2  the correlation squared
or
SS regression
r 
2
SSY
 The percentage of the total variability in
Y explained by X
2
r for our example
71
 r = .713
 r 2 = .7132 =.508
SSregression 454.307
r 
2
  .507
 or SSY 895.247
 Approximately 50% in variability of

incidence of CHD mortality is associated with
variability in smoking.
Coefficient of Alienation
72
 It is defined as 1 - r 2 or
SSresidual
1 r 2
SSY
 Example
1 - .508 = .492
SS residual 440.757
1 r 
2
  .492
SSY 895.247
2
r, SS and sY-Y’
73
 r2 * SStotal = SSregression
 (1 - r2) * SStotal = SSresidual
 We can also use r2 to calculate the

standard error of estimate as:
 N 1   20 
sY Yˆ  s y (1  r ) 
2
  6.690* (.492)    4.816
 N 2  19 
Testing Overall Model
74
 We can test for the overall prediction of

the model by forming the ratio:
2
sregression
2
 F statistic
sresidual
 If the calculated F value is larger than a
tabled value (F-Table) we have a
significant prediction
Testing Overall Model
75
 Example 2
sregression 454.307
2
  19.594
sresidual 23.198
 F-Table – F critical is found using 2 things

dfregression (numerator) and dfresidual.(demoninator)
 F-Table our Fcrit (1,19) = 4.38
 19.594 > 4.38, significant overall
 Should all sound familiar…
SPSS output
76
Model Summary
Adjus ted Std. Error of

Model R R Square R Square the Es timate
1 .713 a .508 .482 4.81640
a. Predictors : (Constant), CIGARETT
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regress ion 454.482 1 454.482 19.592 .000 a
Res idual 440.757 19 23.198
Total 895.238 20
a. Predictors : (Constant), CIGARETT
b. Dependent Variable: CHD
Testing Slope and Intercept
77
 The regression coefficients can be

tested for significance
 Each coefficient divided by it’s
standard error equals a t value that

can also be looked up in a t-table
 Each coefficient is tested against 0
Testing the Slope
78
 With only 1 predictor, the standard

error for the slope is:
sY Yˆ
seb 
sX N  1
 For our Example:
4.816 4.816
seb    .461
2.334 21  1 10.438
Testing Slope and Intercept
79
 These are given in computer printout as

a t test.
Testing
80
 The t values in the second from right

column are tests on slope and intercept.
 The associated p values are next to
them.
 The slope is significantly different from
zero, but not the intercept.
 Why do we care?
Testing
81
 What does it mean if slope is not

significant?
 How does that relate to test on r?
 What if the intercept is not significant?
 Does significant slope mean we predict
quite well?

Topic03 Correlation Regression

Uploaded by

Copyright:

Available Formats

Topic03 Correlation Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic03 Correlation Regression

Uploaded by

Copyright:

Available Formats

Correlation and

 The correlation coefficient

 Other kinds of correlations

 Factors affecting correlations

 Testing for significance

 Are two variables related?

 AKA scatter diagram or scattergram.

 Does smoking cigarettes increase systolic

 Note relationship is moderate, but real.

 What would conclude if there were no

 Data on heart disease and cigarette

 CHD Mortality goes on ordinate (Y axis)

 Cigarette consumption on abscissa (X

 What does each dot represent?

Cigarette Consumption per Adult per Day

 As smoking increases, so does coronary

 Not all data points on line.

 Measured with a correlation coefficient

 Most popularly seen correlation

 A measure of degree of relationship.

 Sign refers to direction.

 Pearson’s Product Moment Correlation

 Covariance ÷ (product of the 2 SDs)

cov XY 11.12 11.12

 If sign were negative

 Computational (Raw Score) Method

Truncated at 5 Cigarettes Per Day

Cigarette Consumption per Adult per Day

 So you have a correlation. Now what?

 Alternative hypothesis (H1)   0

 T-crit (19) = 2.09

 How do we predict one variable from

 A technique we use to predict the most

 Y - the variables you are predicting

 We may want to make a prediction.

 Cigarettes and CHD Mortality again

 We want to predict level of CHD

For a country that

Cigarette Consumption per Adult per Day

 “Coefficients” are a and b

 Change in predicted Y for one unit change

 a = 14.524 - 2.042*5.952 = 2.32

 See SPSS printout on next slide

 The values we obtained are shown on

column labeled “constant”

 Second, once we know the relationship

CHD Mortality per 10,000 Residual

Cigarette Consumption per Adult per Day

 When we predict Ŷ for a given X, we will

 Also known as: a residual

 We want to Σ(Y- Ŷ) as small as possible.

 BUT, there are infinitely many lines that can do

 Again, the problem lies with this

 Called the Sum of Squared Residual or

 Standard error of estimate

 When your data are standardized (linearly

associated with type II errors. They’re

 Zy = βZx, since A now equals 0

 Sums of square deviations