Topic03 Correlation Regression

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 81

Correlation and

Regression
Cal State Northridge
427
Ainsworth
Major Points - Correlation
 Questions answered by correlation
 Scatterplots

 An example

 The correlation coefficient

 Other kinds of correlations

 Factors affecting correlations

 Testing for significance


The Question

 Are two variables related?


 Does one increase as the other increases?
 e. g. skills and income
 Does one decrease as the other increases?
 e. g. health problems and nutrition
 How can we get a numerical measure of
the degree of relationship?
Scatterplots

 AKA scatter diagram or scattergram.


 Graphically depicts the relationship
between two variables in two dimensional
space.
Direct Relationship
Scatterplot:Video Games and Alcohol Consumption

20
Average Number of Alcoholic Drinks

18
16
14
Per Week

12
10
8
6
4
2
0
0 5 10 15 20 25
Average Hours of Video Games Per Week
Inverse Relationship
Scatterplot: Video Games and Test Score

100
90
80
70
Exam Score

60
50
40
30
20
10
0
0 5 10 15 20
Average Hours of Video Games Per Week
An Example

 Does smoking cigarettes increase systolic


blood pressure?
 Plotting number of cigarettes smoked per
day against systolic blood pressure
 Fairly moderate relationship
 Relationship is positive
Trend?
170

160

150

140

130

120
SYSTOLIC

110

100
0 10 20 30

SMOKING
Smoking and BP

 Note relationship is moderate, but real.


 Why do we care about relationship?

 What would conclude if there were no


relationship?
 What if the relationship were near perfect?
 What if the relationship were negative?
Heart Disease and Cigarettes

 Data on heart disease and cigarette


smoking in 21 developed countries
(Landwehr and Watkins, 1987)
 Data have been rounded for computational
convenience.
 The results were not affected.
Country Cigarettes CHD

The Data
1 11 26
2 9 21
3 9 24
4 9 21
5 8 19
6 8 13
7 8 19
Surprisingly, the 8 6 11
9 6 23
U.S. is the first 10 5 15
country on the list- 11 5 13
12 5 4
-the country 13 5 18
with the highest 14 5 12
15 5 3
consumption and 16 4 11
17 4 15
highest mortality. 18 4 6
19 3 13
20 3 4
21 3 14
Scatterplot of Heart Disease

 CHD Mortality goes on ordinate (Y axis)


 Why?

 Cigarette consumption on abscissa (X


axis)
 Why?

 What does each dot represent?


 Best fitting line included for clarity
30

20

10

{X = 6, Y = 11}

0
2 4 6 8 10 12

Cigarette Consumption per Adult per Day


What Does the Scatterplot Show?

 As smoking increases, so does coronary


heart disease mortality.
 Relationship looks strong

 Not all data points on line.

 This
gives us “residuals” or “errors of
prediction”
 To be discussed later
Correlation

 Co-relation
 The relationship between two variables

 Measured with a correlation coefficient

 Most popularly seen correlation


coefficient: Pearson Product-Moment
Correlation
Types of Correlation
 Positive correlation
 High values of X tend to be associated with
high values of Y.
 As X increases, Y increases
 Negative correlation
 High values of X tend to be associated with
low values of Y.
 As X increases, Y decreases
 No correlation
 No consistent tendency for values on Y to
increase or decrease as X increases
Correlation Coefficient

 A measure of degree of relationship.


 Between 1 and -1

 Sign refers to direction.

 Based on covariance
 Measure of degree to which large scores on
X go with large scores on Y, and small scores
on X go with small scores on Y
 Think of it as variance, but with 2 variables
instead of 1 (What does that mean??)
18
Covariance
 Remember that variance is:
( X  X ) 2
( X  X )( X  X )
VarX  
N 1 N 1
 The formula for co-variance is:

( X  X )(Y  Y )
Cov XY 
N 1
 How this works, and why?
 When would covXY be large and positive?
Large and negative?
Country X (Cig.) Y (CHD) (X  X ) (Y  Y ) ( X  X ) * (Y  Y )
1 11 26 5.05 11.48 57.97
2 9 21 3.05 6.48 19.76
3 9 24 3.05 9.48 28.91
4 9 21 3.05 6.48 19.76
5 8 19 2.05 4.48 9.18
6 8 13 2.05 -1.52 -3.12
7 8 19 2.05 4.48 9.18
8 6 11 0.05 -3.52 -0.18
9 6 23 0.05 8.48 0.42

Example
10 5 15 -0.95 0.48 -0.46
11 5 13 -0.95 -1.52 1.44
12 5 4 -0.95 -10.52 9.99
13 5 18 -0.95 3.48 -3.31
14 5 12 -0.95 -2.52 2.39
15 5 3 -0.95 -11.52 10.94
16 4 11 -1.95 -3.52 6.86
17 4 15 -1.95 0.48 -0.94
18 4 6 -1.95 -8.52 16.61
19 3 13 -2.95 -1.52 4.48
20 3 4 -2.95 -10.52 31.03
21 3 14 -2.95 -0.52 1.53
Mean 5.95 14.52
SD 2.33 6.69
Sum 222.44
Example
21

( X  X )(Y  Y ) 222.44
Covcig .&CHD    11.12
N 1 21  1
 What the heck is a covariance?
 I thought we were talking about
correlation?
Correlation Coefficient

 Pearson’s Product Moment Correlation


 Symbolized by r

 Covariance ÷ (product of the 2 SDs)

Cov XY
r
s X sY
 Correlation is a standardized
covariance
Calculation for Example

 CovXY = 11.12
 sX = 2.33

 sY = 6.69

cov XY 11.12 11.12


r    .713
s X sY (2.33)(6.69) 15.59
Example

 Correlation = .713
 Sign is positive

 Why?

 If sign were negative


 What would it mean?
 Would not alter the degree of relationship.
Other calculations
25

Z-score method

r
 z z x y

N 1

 Computational (Raw Score) Method


N  XY   X  Y
r
 N  X 2  ( X )2   N  Y 2  ( Y )2 
Other Kinds of Correlation
 Spearman Rank-Order Correlation
Coefficient (rsp)
 used with 2 ranked/ordinal variables
 uses the same Pearson formula
Attractiveness Symmetry
3 2
4 6
1 1
2 3
5 4
6 5 26
rsp = 0.77
Other Kinds of Correlation
 Point biserial correlation coefficient (rpb)
 used with one continuous scale and one
nominal or ordinal or dichotomous scale.
 uses the same Pearson formula
Attractiveness Date?
3 0
4 0
1 1
2 1
5 1
6 0
rpb = -0.49 27
Other Kinds of Correlation
 Phi coefficient ()
 used with two dichotomous scales.
 uses the same Pearson formula

Attractiveness Date?
0 0
1 0
1 1
1 1
0 0
1 1
 = 0.71 28
Factors Affecting r
 Range restrictions
 Looking at only a small portion of the total
scatter plot (looking at a smaller portion of
the scores’ variability) decreases r.
 Reducing variability reduces r
 Nonlinearity
 The Pearson r (and its relatives) measure the
degree of linear relationship between two
variables
 If a strong non-linear relationship exists, r will
provide a low, or at least inaccurate measure
of the true relationship.
Factors Affecting r
 Heterogeneous subsamples
 Everyday examples (e.g. height and weight
using both men and women)
 Outliers
 Overestimate Correlation
 Underestimate Correlation
Countries With Low Consumptions
Data With Restricted Range

Truncated at 5 Cigarettes Per Day


20

18

16
CHD Mortality per 10,000

14

12

10

4
2
2.5 3.0 3.5 4.0 4.5 5.0 5.5

Cigarette Consumption per Adult per Day


Truncation
32
Non-linearity
33
Heterogenous samples
34
Outliers
35
Testing Correlations
36

 So you have a correlation. Now what?


 In terms of magnitude, how big is big?
 Smallcorrelations in large samples are “big.”
 Large correlations in small samples aren’t
always “big.”
 Depends upon the magnitude of the
correlation coefficient
AND
 The size of your sample.
Testing r

 Population parameter = 
 Null hypothesis H0:  = 0

 Test
of linear independence
 What would a true null mean here?
 What would a false null mean here?

 Alternative hypothesis (H1)   0


 Two-tailed
Tables of Significance
 We can convert r to t and test for
significance:

N 2
tr
1 r 2

 Where DF = N-2
Tables of Significance
 In our example r was .71
 N-2 = 21 – 2 = 19

N 2 19 19
tr  .71*  .71*  6.90
1 r 2
1  .712
.4959

 T-crit (19) = 2.09


 Since 6.90 is larger than 2.09 reject  = 0.
Computer Printout
 Printout gives test of significance.
Correlations

CIGARET CHD
CIGARET Pears on Correlation 1 .713**
Sig. (2-tailed) . .000
N 21 21
CHD Pears on Correlation .713** 1
Sig. (2-tailed) .000 .
N 21 21
**. Correlation is significant at the 0.01 level (2-tailed).
Regression
What is regression?
42

 How do we predict one variable from


another?
 How does one variable change as the

other changes?
 Influence
Linear Regression
43

 A technique we use to predict the most


likely score on one variable from those
on another variable
 Uses the nature of the relationship (i.e.
correlation) between two variables to
enhance your prediction
Linear Regression: Parts
44

 Y - the variables you are predicting


 i.e. dependent variable
 X - the variables you are using to predict
 i.e. independent variable
 Ŷ - your predictions (also known as Y’)
Why Do We Care?
45

 We may want to make a prediction.


 More likely, we want to understand the
relationship.
 How fast does CHD mortality rise with a
one unit increase in smoking?
 Note: we speak about predicting, but
often don’t actually predict.
An Example
46

 Cigarettes and CHD Mortality again


 Data repeated on next slide

 We want to predict level of CHD


mortality in a country averaging 10
cigarettes per day.
Country Cigarettes CHD
1 11 26

47
The Data 2
3
9
9
21
24
4 9 21
5 8 19
Based on the data we have 6 8 13
what would we predict the 7
8
8
6
19
11
rate of CHD be in a country 9 6 23
10 5 15
that smoked 10 cigarettes on 11 5 13
12 5 4
average? 13 5 18
14 5 12
First, we need to establish a 15 5 3
4 11
prediction of CHD from 16
17 4 15
smoking… 18
19
4
3
6
13
20 3 4
21 3 14
30

We predict a
20
CHD rate of
about 14
Regression
Line

10

For a country that


smokes 6 C/A/D…
0
2 4 6 8 10 12

Cigarette Consumption per Adult per Day

48
Regression Line
49

 Formula

Yˆ  bX  a

 = the predicted value of Y (e.g. CHD
mortality)
 X = the predictor variable (e.g. average
cig./adult/country)
Regression Coefficients
50

 “Coefficients” are a and b


 b = slope

 Change in predicted Y for one unit change


in X
 a = intercept
 value of Yˆ when X = 0
Calculation
51

 Slope cov XY  sy 
b  2 or b  r  
sX  sx 
N  XY   X  Y
or b 
 N  X 2  ( X ) 2 

 Intercept a  Y bX
For Our Data
52

 CovXY = 11.12
 s2X = 2.332 = 5.447

 b = 11.12/5.447 = 2.042

 a = 14.524 - 2.042*5.952 = 2.32

 See SPSS printout on next slide

Answers are not exact due to rounding error and desire to match
SPSS.
SPSS Printout
53
Note:
54

 The values we obtained are shown on


printout.
 The intercept is the value in the B

column labeled “constant”


 The slope is the value in the B column
labeled by name of predictor variable.
Making a Prediction
55

 Second, once we know the relationship


we can predict
Yˆ  bX  a  2.042 X  2.367
Yˆ  2.042*10  2.367  22.787
 We predict 22.77 people/10,000 in a
country with an average of 10 C/A/D
will die of CHD
Accuracy of Prediction
 Finnish smokers smoke 6 C/A/D
 We predict:

Yˆ  bX  a  2.042 X  2.367
Yˆ  2.042*6  2.367  14.619
 They actually have 23 deaths/10,000
 Our error (“residual”) =

23 - 14.619 = 8.38
a large error
56
30

CHD Mortality per 10,000 Residual

20

Prediction

10

0
2 4 6 8 10 12

Cigarette Consumption per Adult per Day

57
Residuals
58

 When we predict Ŷ for a given X, we will


sometimes be in error.
 Y – Ŷ for any X is a an error of estimate

 Also known as: a residual

 We want to Σ(Y- Ŷ) as small as possible.

 BUT, there are infinitely many lines that can do


this.
 Just draw ANY line that goes through the
mean of the X and Y values.
 Minimize Errors of Estimate… How?
Minimizing Residuals
59

 Again, the problem lies with this


definition of the mean:

 ( X  X )  0
 So, how do we get rid of the 0’s?
 Square them.
Regression Line:
A Mathematical Definition
 The regression line is the line which when
drawn through your data set produces the
smallest value of:

 (Y  Y )
ˆ 2

 Called the Sum of Squared Residual or


SSresidual
 Regression line is also called a “least squares
line.” 60
Summarizing Errors of Prediction
61

 Residual variance
 The variability of predicted values
ˆ
(Yi  Yi ) 2
SSresidual
s2
Y Yˆ
 
N 2 N 2
Standard Error of Estimate
62

 Standard error of estimate


 Thestandard deviation of predicted
values
ˆ
(Yi  Yi ) 2
SSresidual
sY Yˆ  
N 2 N 2
 A common measure of the accuracy of
our predictions
 We want it to be as small as possible.
Country X (Cig.) Y (CHD) Y' (Y - Y') (Y - Y')2

Example
1 11 26 24.829 1.171 1.371
2 9 21 20.745 0.255 0.065
3 9 24 20.745 3.255 10.595
63 4 9 21 20.745 0.255 0.065
5 8 19 18.703 0.297 0.088
(Yi  Yˆi ) 2 440.756
6 8 13 18.703 -5.703 32.524 2
s
Y Yˆ
   23.198
7 8 19 18.703 0.297 0.088 N 2 21  2
8 6 11 14.619 -3.619 13.097
9 6 23 14.619 8.381 70.241 (Yi  Yˆi )2 440.756
10 5 15 12.577 2.423 5.871 sY Yˆ   
11 5 13 12.577 0.423 0.179
N 2 21  2
12 5 4 12.577 -8.577 73.565  23.198  4.816
13 5 18 12.577 5.423 29.409
14 5 12 12.577 -0.577 0.333
15 5 3 12.577 -9.577 91.719
16 4 11 10.535 0.465 0.216
17 4 15 10.535 4.465 19.936
18 4 6 10.535 -4.535 20.566
19 3 13 8.493 4.507 20.313
20 3 4 8.493 -4.493 20.187
21 3 14 8.493 5.507 30.327
Mean 5.952 14.524
SD 2.334 6.690
Sum 0.04 440.757
Regression and Z Scores
64

 When your data are standardized (linearly


transformed to z-scores), the slope of the
regression line is called β
 DO NOT confuse this β with the β

associated with type II errors. They’re


different.
 When we have one predictor, r = β

 Zy = βZx, since A now equals 0


Partitioning Variability
65

 Sums of square deviations


 Total
SStotal   (Y  Y )
2

SSregression   (Yˆ  Y )
2
 Regression

 Residual we already covered


SSresidual   (Y  Yˆ )
2

 SStotal = SSregression + SSresidual


Partitioning Variability
66

 Degrees of freedom
 Total
 dftotal =N-1
 Regression
 dfregression = number of predictors
 Residual
 dfresidual = dftotal – dfregression
 dftotal = dfregression + dfresidual
Partitioning Variability
67

 Variance (or Mean Square)


 Total Variance
 s2total = SStotal/ dftotal
 Regression Variance
 s2regression = SSregression/ dfregression
 Residual Variance
 s2residual = SSresidual/ dfresidual
Country X (Cig.) Y (CHD) Y' (Y - Y') (Y - Y')2 (Y' - Ybar) (Y - Ybar)
1 11 26 24.829 1.171 1.371 106.193 131.699
2 9 21 20.745 0.255 0.065 38.701 41.939
3 9 24 20.745 3.255 10.595 38.701 89.795
68
4 9 21 20.745 0.255 0.065 38.701 41.939
5 8 19 18.703 0.297 0.088 17.464 20.035
6 8 13 18.703 -5.703 32.524 17.464 2.323
7 8 19 18.703 0.297 0.088 17.464 20.035
8 6 11 14.619 -3.619 13.097 0.009 12.419
9
10
11
6
5
5
23
15
13
14.619
12.577
12.577
8.381
2.423
0.423
70.241
5.871
0.179
0.009
3.791
3.791
71.843
0.227
2.323
Example
12 5 4 12.577 -8.577 73.565 3.791 110.755
13 5 18 12.577 5.423 29.409 3.791 12.083
14 5 12 12.577 -0.577 0.333 3.791 6.371
15 5 3 12.577 -9.577 91.719 3.791 132.803
16 4 11 10.535 0.465 0.216 15.912 12.419
17 4 15 10.535 4.465 19.936 15.912 0.227
18 4 6 10.535 -4.535 20.566 15.912 72.659
19 3 13 8.493 4.507 20.313 36.373 2.323
20 3 4 8.493 -4.493 20.187 36.373 110.755
21 3 14 8.493 5.507 30.327 36.373 0.275
Mean 5.952 14.524
SD 2.334 6.690
Sum 0.04 440.757 454.307 895.247
Y' = (2.04*X) + 2.37
Example
SSTotal   (Y  Y )  895.247; dftotal  21  1  20
2
69

  (Y  Y )  454.307; df regression  1 (only 1 predictor)


ˆ 2
SSregression

  (Y  Y )  440.757; df residual  20  1  19
ˆ 2
SSresidual


2
(Y  Y ) 895.247
s2
   44.762
N 1
total
20
 (Y  Y )
ˆ 2
454.307
s2
regression    454.307
1 1
 (Y  Y )
ˆ 2
440.757
s2
   23.198
N 2
residual
19
2
Note : sresidual  sY Yˆ
Coefficient of Determination
70

 It is a measure of the percent of


predictable variability
r 2  the correlation squared
or
SS regression
r 
2

SSY
 The percentage of the total variability in
Y explained by X
2
r for our example
71

 r = .713
 r 2 = .7132 =.508

SSregression 454.307
r 
2
  .507
 or SSY 895.247

 Approximately 50% in variability of


incidence of CHD mortality is associated with
variability in smoking.
Coefficient of Alienation
72

 It is defined as 1 - r 2 or
SSresidual
1 r 2

SSY
 Example
1 - .508 = .492
SS residual 440.757
1 r 
2
  .492
SSY 895.247
2
r, SS and sY-Y’
73

 r2 * SStotal = SSregression
 (1 - r2) * SStotal = SSresidual

 We can also use r2 to calculate the


standard error of estimate as:

 N 1   20 
sY Yˆ  s y (1  r ) 
2
  6.690* (.492)    4.816
 N 2  19 
Testing Overall Model
74

 We can test for the overall prediction of


the model by forming the ratio:
2
sregression
2
 F statistic
sresidual
 If the calculated F value is larger than a
tabled value (F-Table) we have a
significant prediction
Testing Overall Model
75

 Example 2
sregression 454.307
2
  19.594
sresidual 23.198

 F-Table – F critical is found using 2 things


dfregression (numerator) and dfresidual.(demoninator)
 F-Table our Fcrit (1,19) = 4.38
 19.594 > 4.38, significant overall
 Should all sound familiar…
SPSS output
76

Model Summary

Adjus ted Std. Error of


Model R R Square R Square the Es timate
1 .713 a .508 .482 4.81640
a. Predictors : (Constant), CIGARETT

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regress ion 454.482 1 454.482 19.592 .000 a
Res idual 440.757 19 23.198
Total 895.238 20
a. Predictors : (Constant), CIGARETT
b. Dependent Variable: CHD
Testing Slope and Intercept
77

 The regression coefficients can be


tested for significance
 Each coefficient divided by it’s

standard error equals a t value that


can also be looked up in a t-table
 Each coefficient is tested against 0
Testing the Slope
78

 With only 1 predictor, the standard


error for the slope is:
sY Yˆ
seb 
sX N  1
 For our Example:
4.816 4.816
seb    .461
2.334 21  1 10.438
Testing Slope and Intercept
79

 These are given in computer printout as


a t test.
Testing
80

 The t values in the second from right


column are tests on slope and intercept.
 The associated p values are next to
them.
 The slope is significantly different from
zero, but not the intercept.
 Why do we care?
Testing
81

 What does it mean if slope is not


significant?
 How does that relate to test on r?
 What if the intercept is not significant?
 Does significant slope mean we predict

quite well?

You might also like