Regression PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

1 Correlation and Regression Analysis

In this section we will be investigating the relationship between two continuous variable, such
as height and weight, the concentration of an injected drug and heart rate, or the consumption
level of some nutrient and weight gain.
The tools used to explore this relationship, is the regression and correlation analysis.
These tools can be used to find out if the outcome from one variable depends on the value of
the other variable, which would mean a dependency from one variable on the other.
Regression and correlation analysis can be used to describe the nature and strength of the
relationship between two continuous variables.

1.1 Scatterplot
The first step in the investigation of the relationship between two continuous variables is a
scatterplot!
Create a scatterplot for the two variables and evaluate the quality of the relationship.
Example:
Does the number of years invested in schooling pay off in the job market?
Apparently so – the better educated you are, the more money you will earn. The data in
the following table give the median annual income of full-time workers age 25 or older by the
number of years of schooling completed.

x=Years of Schooling y=Salary (dollars)


8 18,000
10 20,500
12 25,000
14 28,100
16 34,500
19 39,700
Start of with creating a scatterplot for X and Y.

1
The scatterplot shows a strong, positive, linear association between years and salary.

Questions to be answered with the help of the scatterplot:

1. Does a relationship exist that can be described by a straight line (which means is there
a linear relationship)?

2. Is there a relationship, that is not linear?

3. If the scatterplot of the variables look like a cloud there is no relationship between both
variables and one would stop at this point.

1.2 Correlation

If the scatterplot shows a reasonable linear relationship (straight line) calculate Pearson’s
correlation coefficient to evaluate the strength of the linear relationship.

Notation:
Let (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) denote a sample of (x, y) pairs.

Definition:
Given the following sum of squares
P P
X ( x)( y)
Sxy = xy −
n
P
X x)22 (
Sxx = x −
n
P 2
X ( y)
Syy = y2 −
n
Pearson’s Correlation Coefficient can be calculated as:
Sxy
r=q
Sxx Syy

Pearson’s correlation coefficient (named after Karl Pearson, 1857-1936) is a number between -1
and 1, that measures the strength of a linear relationship between two continuous variables.
The absolute value of the coefficient measures how closely the variables are related. The
closer it is to 1 the closer the relationship. A correlation coefficient over 0.8 indicates a strong
correlation between the variables.

Data patterns and Pearson’s Correlation Coefficient

2
The sign of the correlation coefficient tells you of the trend in the relationship. A positive
(negative) coefficient means that one variable increases (decreases), when the other increases.

Continue Example:
Calculate Pearson’s correlation coefficient for years and salary. First find x̄ = 13.17, sx = 4.02
and ȳ = 27633, sy = 8290.
xi =Years of Schooling yi =Salary (dollars) xi · y i x2i yi2
8 18,000 144000 64 324,000,000
10 20,500 205000 100 420,250,000
12 25,000 300000 144 625,000,000
14 28,100 393400 196 789,610,000
16 34,500 552000 256 1,190,250,000
19 39,700 754300 361 1,576,090,000
P P P P 2 P
So that xi = 79, yi = 165800, xi yi = 2348700, xi = 1121, yi2 = 4, 925, 200, 000.
This leads to
P P
X ( xi ) ( yi ) (79) (165800)
Sxy = xi y i − = 2348700 − = 165666.65
n 6
P
X xi )2 ( (79)2
Sxx = − x2i = 1121 − = 80.8333
n 6
P
X
2 ( yi )2 (165800)2
Syy = yi − = 4, 925, 200, 000 − = 343593333.333
n 6
So that
Sxy 165666.65
r=q =√ = 0.994.
Sxx · Syy 80.8333 · 343593333.333

3
The Pearson correlation coefficient of Years of schooling and salary r = 0.994.

A correlation of 0.9942 is very high and shows a strong, positive, linear association between
years of schooling and the salary.

1.3 Linear Regression


In the example we might want to predict the expected salary for different times of schooling,
or calculate the increase in salary for every year of schooling. For this purpose we can do a
regression analysis.

Terms and Definition: If we want to use a variable x to draw conclusions concerning a


variable y:
y is called dependent or response variable.

x is called independent, predictor, os explanatory variable.


If the relationship between two variables is linear is can be summarized by a straight line. A
straight line can be described by an equation:

y =a+b x
a is called the intercept and b the slope of the equation.
The slope is the amount by which y increases when x increases by 1 unit.

Fitting a straight line

Given data points (xi , yi ) a and b shall now be chosen in that way that the corresponding linear
line will have the “best fit” for the given data.
The criteria for “best fit” used in regression analysis is the sum of the squared differences
between the data points and the line itself, that is the y deviations.
For data points (xi , yi ), 1 ≤ i ≤ n this can be written as
n
X
min (yi − (a + bxi ))2
a,b
i

In words: minimize the sum by choosing the appropriate parameters a and b.


The resulting line is called the least square line or sample regression line.

After the problem is stated it can be solved mathematically and the results are formulas, how
to calculate the best parameters.
Sxy
b= and a = ȳ − b · x̄.
Sxx

Write the equation of the least squares line as

ŷ = a + bx

ŷ gives an estimate for y for a given value of x.

4
Continue Example:
Since the salary and the years of schooling show such a strong linear relationship and the salary
can be viewed as depending on the years of schooling, do a linear regression analysis with the
salary as the response variable and the years of schooling as the predictor variable.
Calculate
Sxy 165666.65
b= = = 2050.28 and a = ȳ − bx̄ = 27633 − 2050.28 · 13.17 = 630.81
xx 80.8333
Our result is the least squares line

ŷ = a + bx = 630.81 + 2050.28 x
The slope equals $2050.28, that is for every year of schooling the average salary
increases by this amount.
To estimate the average salary after 18 years of schooling we calculate ŷ with x = 18

ŷ = 630.81 + 2050.28 · 18 = 37535.85$

Don’t use the regression line for values outside the range of the observed values. This is a
model that only has been proved valid for the given range.

Properties of the regression or least squares line


1. The least squares line passes always through the balance point (x̄, ȳ) of the data set.

2. The regression line of y on x should not be used to predict x, since it is not the line that
minimizes the sum of squared x deviations.

Assessing the fit of a line


Once the least squares line has been obtained, it is natural to examine how effectively the line
summarizes the relationship between x and y.
The first question that has to be answered is, if the line is an appropriate way to summarize the
relationship. In order to answer this question, we will calculate the coefficient of determination
r2 .

Definition: The coefficient of determination for he regression of y on x is


2
Sxy
r2 =
Sxx Syy
the square of Pearson’s Correlation Coefficient.
It gives the proportion of variation in y that can be attributed to a linear relationship between
x and y.
Is r2 greater than 0.8, the model has a good fit and can be used to calculate reliable predictions
of the dependent variable by using the independent variable.

In the example, the variable Years of Schooling explains r2 = 98.8% of the variation in the
variable Salary. Which is very high. The plot showed that the data points are almost on a
straight line.

Use the least squares line for predicting the annual salary of a person with 13 years of schooling.

5
ŷ(13) = a + b · 13 = 630.81 + 2050.28 · 13 = 27284.45$
This is just an estimate, from the other parts of the class, we know that a confidence interval
can be found that gives more information.

You might also like