Regression PDF
Regression PDF
Regression PDF
In this section we will be investigating the relationship between two continuous variable, such
as height and weight, the concentration of an injected drug and heart rate, or the consumption
level of some nutrient and weight gain.
The tools used to explore this relationship, is the regression and correlation analysis.
These tools can be used to find out if the outcome from one variable depends on the value of
the other variable, which would mean a dependency from one variable on the other.
Regression and correlation analysis can be used to describe the nature and strength of the
relationship between two continuous variables.
1.1 Scatterplot
The first step in the investigation of the relationship between two continuous variables is a
scatterplot!
Create a scatterplot for the two variables and evaluate the quality of the relationship.
Example:
Does the number of years invested in schooling pay off in the job market?
Apparently so – the better educated you are, the more money you will earn. The data in
the following table give the median annual income of full-time workers age 25 or older by the
number of years of schooling completed.
1
The scatterplot shows a strong, positive, linear association between years and salary.
1. Does a relationship exist that can be described by a straight line (which means is there
a linear relationship)?
3. If the scatterplot of the variables look like a cloud there is no relationship between both
variables and one would stop at this point.
1.2 Correlation
If the scatterplot shows a reasonable linear relationship (straight line) calculate Pearson’s
correlation coefficient to evaluate the strength of the linear relationship.
Notation:
Let (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) denote a sample of (x, y) pairs.
Definition:
Given the following sum of squares
P P
X ( x)( y)
Sxy = xy −
n
P
X x)22 (
Sxx = x −
n
P 2
X ( y)
Syy = y2 −
n
Pearson’s Correlation Coefficient can be calculated as:
Sxy
r=q
Sxx Syy
Pearson’s correlation coefficient (named after Karl Pearson, 1857-1936) is a number between -1
and 1, that measures the strength of a linear relationship between two continuous variables.
The absolute value of the coefficient measures how closely the variables are related. The
closer it is to 1 the closer the relationship. A correlation coefficient over 0.8 indicates a strong
correlation between the variables.
2
The sign of the correlation coefficient tells you of the trend in the relationship. A positive
(negative) coefficient means that one variable increases (decreases), when the other increases.
Continue Example:
Calculate Pearson’s correlation coefficient for years and salary. First find x̄ = 13.17, sx = 4.02
and ȳ = 27633, sy = 8290.
xi =Years of Schooling yi =Salary (dollars) xi · y i x2i yi2
8 18,000 144000 64 324,000,000
10 20,500 205000 100 420,250,000
12 25,000 300000 144 625,000,000
14 28,100 393400 196 789,610,000
16 34,500 552000 256 1,190,250,000
19 39,700 754300 361 1,576,090,000
P P P P 2 P
So that xi = 79, yi = 165800, xi yi = 2348700, xi = 1121, yi2 = 4, 925, 200, 000.
This leads to
P P
X ( xi ) ( yi ) (79) (165800)
Sxy = xi y i − = 2348700 − = 165666.65
n 6
P
X xi )2 ( (79)2
Sxx = − x2i = 1121 − = 80.8333
n 6
P
X
2 ( yi )2 (165800)2
Syy = yi − = 4, 925, 200, 000 − = 343593333.333
n 6
So that
Sxy 165666.65
r=q =√ = 0.994.
Sxx · Syy 80.8333 · 343593333.333
3
The Pearson correlation coefficient of Years of schooling and salary r = 0.994.
A correlation of 0.9942 is very high and shows a strong, positive, linear association between
years of schooling and the salary.
y =a+b x
a is called the intercept and b the slope of the equation.
The slope is the amount by which y increases when x increases by 1 unit.
Given data points (xi , yi ) a and b shall now be chosen in that way that the corresponding linear
line will have the “best fit” for the given data.
The criteria for “best fit” used in regression analysis is the sum of the squared differences
between the data points and the line itself, that is the y deviations.
For data points (xi , yi ), 1 ≤ i ≤ n this can be written as
n
X
min (yi − (a + bxi ))2
a,b
i
After the problem is stated it can be solved mathematically and the results are formulas, how
to calculate the best parameters.
Sxy
b= and a = ȳ − b · x̄.
Sxx
ŷ = a + bx
4
Continue Example:
Since the salary and the years of schooling show such a strong linear relationship and the salary
can be viewed as depending on the years of schooling, do a linear regression analysis with the
salary as the response variable and the years of schooling as the predictor variable.
Calculate
Sxy 165666.65
b= = = 2050.28 and a = ȳ − bx̄ = 27633 − 2050.28 · 13.17 = 630.81
xx 80.8333
Our result is the least squares line
ŷ = a + bx = 630.81 + 2050.28 x
The slope equals $2050.28, that is for every year of schooling the average salary
increases by this amount.
To estimate the average salary after 18 years of schooling we calculate ŷ with x = 18
Don’t use the regression line for values outside the range of the observed values. This is a
model that only has been proved valid for the given range.
2. The regression line of y on x should not be used to predict x, since it is not the line that
minimizes the sum of squared x deviations.
In the example, the variable Years of Schooling explains r2 = 98.8% of the variation in the
variable Salary. Which is very high. The plot showed that the data points are almost on a
straight line.
Use the least squares line for predicting the annual salary of a person with 13 years of schooling.
5
ŷ(13) = a + b · 13 = 630.81 + 2050.28 · 13 = 27284.45$
This is just an estimate, from the other parts of the class, we know that a confidence interval
can be found that gives more information.