Introduction of Regression
Introduction of Regression
Introduction of Regression
Recall: Correlation
With correlation our aim is to see if there is any relationship
between x and y.
How about take a step further?
Can we use one variable to predict another one?
The Aim of Regression
Why regression?
ANOVA, z and t-tests, and chi-square tests allow us to make comparisons
between multiple groups (categorical predictor levels) means.
Limitations:
The biggest limitation is we cannot really include continuous predictors (Age, Weight,
Income, Time, etc.)
ANOVA does not build a predictive model it simply tells us where differences exist but does
not quantify those differences.
Regression Line
(Un)Standardized Betas
𝜷 is reserved for standardized predictors while b is used for unstandardized.
b are what we get by default when we use our variables in their raw state. They tell us the unit
change in our dependent variable (in its raw units) for a unit change in our predictor variable
(in its raw units).
Reference: https://www3.nd.edu/~rwilliam/stats1/x92.pdf
We try to identify the values of b0 and b1 the produce the best-fitting linear
function.
So, we use the observed data to identify and create the values of b 0 and b1 that
minimize the distances between the observed values (Y) and the predicted values (Y-
hat)
In general, we choose the regression coefficient (b1) and the regression intercept
(b0) that define the regression line that minimized the sum of square residuals.
min Σ[Y-(b0 + b1X)]2 = min Σ(Y- Ŷ)2
Regression Coefficients
Ŷ–ȳ
Ŷ=b 0 + b 1 X
Overall Equation:
Things to Check
Second, we want to know if our parameters (b0, b1, b2, etc.) are significant.
A bad predictor means, a unit change in predictor result no change in the
predicted value of outcome.
T-test
The null of t-test is that the b is zero.
For parameters:
R-square in Regression
How do we know if the model is good in terms of how much variability we can
explain?
R-square
The value of R-square will be between 0 to 1. Why?
Regression: Hypothesis
5. Heteroscedasticity
Equal residual variances along regression line
6. Independent errors
Errors are not correlated with others
7. Normally distributed errors
The residuals in the model with a mean of zero.
8. Independence
All the values of the outcome (pairs) variable are independent
9. Linearity
The mean value of the outcome variable for each increment of the predictor lie along a straight line
Regression
A current study is interested in determine the effect of a new drug upon the
number of pictures recalled. Prior research had established a strong correlation
between painting skills and pictures recall. Therefore, individual differences in
painting skills were controlled to produce a more sensitive test of the treatment
effect.
What test should we use?
What are the IV and DV?
Scenario 2
Suppose we have data on blood pressure levels for males and females and we
wish to see if one sex has higher/lower blood pressure than the other.
What test should we use?
What are the IV and DV?
Scenario 3
Check assumptions:
The type of variable:
Is it good?
Non- zero variance
Use graph or look at the variance.
plot(density(data$Income))
plot(density(data$PI))
plot(density(data$MS))
Regression: Assumptions
Multicollinearity:
We have only one predictor here.
Predictors are not correlated with extraneous variable.
Heteroscedasticity:
The variance around the regression line should be equal across values of x.
scatterplot(data$Income, data$PI)
Regression: Assumptions
Independent errors:
We have no issue because we assume a random sample with no repeated
measurement.
Normally distributed error:
qqnorm(data$PI)
sshapiro.test(model$residuals)
Independence:
Linearity:
scatterplot(data$Income, data$PI)
Regression: Application
Based on the result, now we are able to conduct the regression model.
The expected regression model is: Y = b0 + b1X
We can replace the values of intercept (b0) and slope (b1) with our own values.
Our model is:
Y = (-0.0083) + (.0000001097 )X
Regression: Summary Write Up
Quadratic:
= b0 + b1X1 + b2
Cubic:
= b0 + b1X1 + b2 + b2
Lower Order Trends
When we include higher order trends it is important, we do not remove their lower order
components.
For example, if we wanted to add a quadratic trend to the model, we need to
keep the linear trend in as well even if its not significant.
What does that mean?
To include a cubic trend we must include the quadratic and linear trends as
well.
The reason for this is that if we do not include the lower order trends the higher order
trends will capture in part these effects which would bias our estimates.
A potential issue that will pop up as you include trends beyond linear is multicollinearity
(predictors sharing some of the same predictive variance – Age and Income).
Non-Linear Trends: Example