Introduction of Regression

Introduction of Regression
Instructor: Weikang Kao, Ph.D.

Correlation to Regression
 Recall: Correlation
With correlation our aim is to see if there is any relationship
between x and y.
 How about take a step further?
Can we use one variable to predict another one?
The Aim of Regression
 In regression, the aim is to use one or more predictors to predict the

outcome variable.
 Simple regression vs multiple regression.
 For example, as fat or sugar intake increases does weight increase?
Advantages of Regression
 Why regression?
 ANOVA, z and t-tests, and chi-square tests allow us to make comparisons
between multiple groups (categorical predictor levels) means.
 Limitations:
 The biggest limitation is we cannot really include continuous predictors (Age, Weight,
Income, Time, etc.)
 ANOVA does not build a predictive model it simply tells us where differences exist but does
not quantify those differences.
Regression Line
 From correlation to regression.

 First, the correlation coefficient tells us the strength of relationship between two
variables.
 In bivariate regression, we have a single independent variable (predictor) and a
dependent variable (criterion)
Regression Line
 The shape of the relationship being MODELED by the correlation coefficient is

linear.
 So, r describes the degree to which a straight line describes the values of the Y
variable across the range of X values.
 Regression line helps us to create the best-fitting line to predict what the values of
the Y variable will be for any given value of X.
Regression
Line example
Regression Line
 In mathematic, how do we define a straight line?

Y = a + bX
 In regression, the idea is the same.
Y = b0 + b1X
Intercept (b0) and gradient (b1)
Model of Regression
 Idea of prediction: outcome = (model) + error

 Like Correlation, regression can be positive or negative
 Two important factors in terms of defining a straight line.
 Intercept (b0) and gradient (b1)
Parameters of Regression
 Those parameters (b0 and b1) are the regression coefficients.

 Regression model:
 Outcome (Y) = [b0 + b1(X1)] + error
 b1 tells us what the model looks like and b0 tells us where the model is.
 These weights are known as coefficients (typically Betas: 𝜷) and coefficients tell
us the relative impact of a predictor (independent variable) on our dependent
variable
Parameters of Regression
 (Un)Standardized Betas
𝜷 is reserved for standardized predictors while b is used for unstandardized.
b are what we get by default when we use our variables in their raw state. They tell us the unit
change in our dependent variable (in its raw units) for a unit change in our predictor variable
(in its raw units).
Reference: https://www3.nd.edu/~rwilliam/stats1/x92.pdf
With 𝜷 we see the impact of a 1 SD change in the predictor on the dependent.

Variance in
Regression
Variance in Regression
 Recall: the correlation graph

 The best fitting line is the regression model
 The difference between the regression line and the mean of Y
 The method of least squares
 Helps to find the best fit line, we try to minimized the residual.
Least Squares Criterion
 We try to identify the values of b0 and b1 the produce the best-fitting linear
function.
 So, we use the observed data to identify and create the values of b 0 and b1 that
minimize the distances between the observed values (Y) and the predicted values (Y-
hat)
 In general, we choose the regression coefficient (b1) and the regression intercept
(b0) that define the regression line that minimized the sum of square residuals.
 min Σ[Y-(b0 + b1X)]2 = min Σ(Y- Ŷ)2
Regression Coefficients
 Below equations help to compute the least square solution:

 b0 = ȳ - b1x̄
 b1 = byx = = rxy
Predicting Y From X
 Predicted Score (Y-hat):

 By using the best-fitting model, we can take any value of X, substitute it into the
regression equation, and compute the value of Y.
 Observed Score (Y):
 The actual score we find in our data.
 Residual Score (Error):
 The difference between the predicted score and observed score, when using the same
value of X.
 How to determine whether it is a good regression model?

 The best fitting model vs the basic model
 What is the basic model?
 The mean of y
 Sum of square total: SST
 The overall model is :
 SST = SSR + SSE
 Sum of square Regression (Model): SSR
 Sum of square residual (error): SSE
 SST = SSR + SSE

1. (Observed Y – Ymean) = (Predicted Y –Ymean) + (Observed Y – Predicted Y)
2. Variation in Y = Explained by X + Unexplained by X
3. SSTotal = SSRegression + SSError
 What’s the importance of regression?
 Without regression, what’s the best fitting model?
Total Variance
in Regression:
Observed Y – Ymean
=
Variance Can
be explained:
Predicted Y –Ymean
=
Variance Can
Not Be
Explained:
Observed Y – PredictedY
=
Y –Ŷ Y –ȳ
Ŷ–ȳ
Ŷ=b 0 + b 1 X
Overall Equation:
Things to Check
 The regression line ALWAYS passes through the point (X, Y)

 The mean of the predicted scores is ALWAYS equal to the mean of the observed
Y score.
Regression: Tests
Test the Overall Regression Model
 First, we want to know if the model is significant.

 F-Ratio
 For the overall model
 F = (SSM/df) / (SSR/df) = MSM / MSR
 What is this number?

Test the Individual Parameter
 Second, we want to know if our parameters (b0, b1, b2, etc.) are significant.
 A bad predictor means, a unit change in predictor result no change in the
predicted value of outcome.
 T-test
 The null of t-test is that the b is zero.
 For parameters:
R-square in Regression
 How do we know if the model is good in terms of how much variability we can
explain?
 R-square

 The value of R-square will be between 0 to 1. Why?
Regression: Hypothesis
 The null hypothesis:

 There is no relationship between the X variables and the Y variable
 Alternative hypothesis:
 There is a significant relationship between the X variables and the Y variable
Regression: Assumptions
1. The type of variable

All predictor variables must be quantitative or categorical
The outcome variable be quantitative and continuous.
2. Non- zero variance
The predictors should have some variations.
3. No perfect multicollinearity
There should be no perfect linear relationship between two or more predictors.
4. Predictors are not correlated with extraneous variables
Influence the reliability of our model, think of the third-party issue.
5. Heteroscedasticity
Equal residual variances along regression line
6. Independent errors
Errors are not correlated with others
7. Normally distributed errors
The residuals in the model with a mean of zero.
8. Independence
All the values of the outcome (pairs) variable are independent
9. Linearity
The mean value of the outcome variable for each increment of the predictor lie along a straight line
Regression
 Before data analysis, some issues must be taken care of.

 Outliers.
 Extreme numbers influence coefficient estimates and model fit.
 Shrinkage.
 and adjusted
 FYI:
https://www.investopedia.com/ask/answers/012615/whats-difference-
between-rsquared-and-adjusted-rsquared.asp
 Sample size?
Mostly 15 cases for per predictor.
Pick the Right Tests:
Scenario Practice
Scenario 1
 A current study is interested in determine the effect of a new drug upon the
number of pictures recalled. Prior research had established a strong correlation
between painting skills and pictures recall. Therefore, individual differences in
painting skills were controlled to produce a more sensitive test of the treatment
effect.
 What test should we use?
 What are the IV and DV?
Scenario 2
 Suppose we have data on blood pressure levels for males and females and we
wish to see if one sex has higher/lower blood pressure than the other.
Scenario 3
 A current study is interested in whether individuals’ weight change can be

predicted by how many soda they drink per day.
Scenario 4
 In a current data we collect, we notice that there is a strong correlation between

consumers’ attitudes toward our brand and how many reward bonuses they
receive every year. We would like to do some further analysis to see if we can use
reward bonus receive to predict consumers’ attitudes.
Regression: Application
Use the “(simplerelationships.xlsx) “ data.

 Income – how much a person earns per year
 Percent Income Saved – what percent of income carried over to next year and can
be negative if more money was spent than came in.
 Score on a savings motivation metric – an ordinal scale.
Check assumptions:
 The type of variable:
Is it good?
 Non- zero variance
Use graph or look at the variance.
plot(density(data$Income))
plot(density(data$PI))
plot(density(data$MS))
 Multicollinearity:
 We have only one predictor here.
 Predictors are not correlated with extraneous variable.
 Heteroscedasticity:
 The variance around the regression line should be equal across values of x.
scatterplot(data$Income, data$PI)
 Independent errors:
 We have no issue because we assume a random sample with no repeated
measurement.
 Normally distributed error:
qqnorm(data$PI)
sshapiro.test(model$residuals)
 Independence:
 Linearity:
scatterplot(data$Income, data$PI)
 Build the regression model:

model <- lm(PI ~ Income, data = data)
#PI is the outcome variable and Income is the predictor variable
F (1, 97) = 18.37, p < .001, R2 = .16.
Estimate Std. error T-value P-value

Intercept -8.378e-03 2.570e-03 -3.26 0.00154
Income 1.097e-07 2.560e-08 4.29 < .001
Regression Model
 Based on the result, now we are able to conduct the regression model.
 The expected regression model is: Y = b0 + b1X
 We can replace the values of intercept (b0) and slope (b1) with our own values.
 Our model is:
 Y = (-0.0083) + (.0000001097 )X
Regression: Summary Write Up
A simple regression model was conducted to predict participants’ money-

saving motivation, based on their annul income. All the regression assumptions were
met, and no further adjustment made. A significant regression equation was found (F
(1, 97) = 18.37, p < .001), with an R2 of .16. Both the intercept (p = .002) and
predictor (p < .001) were statistically significant. The result suggested that, income
predicts and shows that for each dollar increase in income there is a .0000001097
percent increase in savings.
In Class Practice
In Class Practice
 Example: Can we use either height or weight to predict heart rate?

 Make sure you test the assumption and do a summary write up.
Height: 175, 170, 180, 178, 168, 181, 190, 185, 177, 162
Weight: 60, 70, 75, 80, 69, 78, 82, 84, 72, 53
Heart Rate: 60, 70, 75, 73, 71, 73, 76, 80, 68, 64
Non-Linear Trends
Non-Linear Trends
Many relationships are not
best captured as straight
lines.
For example, the effect of

stress on performance is
known to follow a quadratic
trend (Yerkes-Dodson curve).
What should we do?

Non-Linear Trends
 Always plot our data.

Using the scatterplot function in R two lines will be drawn (car library needed):
1. The red lines (by default) is a “smoothed” fit (only useful for thinking about
trend)
2. The green line (by default) plots the best fitting linear trend (regression line).
Non-Linear Trends
 It looks like we have a

non-linear trend – higher
values of the predictor
are related with a
quicker increase (steeper
slope) in our dependent
than lower values.
 We also see the linear
model is going to do a
poor job fitting the data.
Capturing Non-Linear Trends
 In many cases we can use linear regression to capture non-linear trends.

However, you will want to pay close attention to your residuals as frequently you
will get good model fit, but that fit may be deceptive (e.g., predicting before and
after a curve poorly but other points well).
 In the example graph on the preceding slide it appears the predicted slope gets
steeper as the predictor increases (linearly) – a quadratic trend.
 Such trends can be captured by adding a x^2 of the predictor to the model.
 If we had data that had two visible shifts, we would then consider adding higher
order functions (e.g., x^3).
Capturing Non-
Linear Trends
 It should be noted that a

heuristic for determining your
polynomial effect(s) is to count
the shifts (going from up to
down, up to flat, and so on).
Conducting a Non-Linear Regression Model
 Quadratic:
 = b0 + b1X1 + b2
 Cubic:
 = b0 + b1X1 + b2 + b2
Lower Order Trends
 When we include higher order trends it is important, we do not remove their lower order
components.
For example, if we wanted to add a quadratic trend to the model, we need to
keep the linear trend in as well even if its not significant.
 What does that mean?
To include a cubic trend we must include the quadratic and linear trends as
well.
 The reason for this is that if we do not include the lower order trends the higher order
trends will capture in part these effects which would bias our estimates.
 A potential issue that will pop up as you include trends beyond linear is multicollinearity
(predictors sharing some of the same predictive variance – Age and Income).
Non-Linear Trends: Example
 We try to simulate a non-linear dataset (Lionel, 2016)

set.seed(20191007)
x<-seq(0,50,1)
y<-((runif(1,10,20)*x)/(runif(1,0,10)+x))+rnorm(51,0,1)
 Plot our data:
plot(x,y)
Non-Linear Trends: Example
 From this graph set approximate starting values

 a_start<-8
 b_start<-2*log(2)/a_start
 Build the model:
model<-nls(y~a*exp(-b*x),start=list(a=a_start,b=b_start))
 Find the best fitting line:
lines(x,predict(model),lty=2,col="red",lwd=3)
1. When should we use regression instead of
ANOVA?
2. Please explain the relationship between
SStotal, SSregression and SSerror.
In Class 3. Please use the following data to build a

regression model and write a summary. IV is
Practice II sugar and DV is calories.
Sugar: 5, 8, 9, 10, 15, 18, 14, 17, 20, 22, 24, 26,
30 ,30, 32
Calories: 20, 30, 60, 70, 100, 95, 70, 83, 103, 112,
130, 80, 95, 130, 112

Introduction of Regression

Uploaded by

Copyright:

Available Formats

Introduction of Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction of Regression

Uploaded by

Copyright:

Available Formats

Introduction of Regression

Instructor: Weikang Kao, Ph.D.

 In regression, the aim is to use one or more predictors to predict the

 From correlation to regression.

 The shape of the relationship being MODELED by the correlation coefficient is

 In mathematic, how do we define a straight line?

 Idea of prediction: outcome = (model) + error

 Those parameters (b0 and b1) are the regression coefficients.

With 𝜷 we see the impact of a 1 SD change in the predictor on the dependent.

 Recall: the correlation graph

 Below equations help to compute the least square solution:

 Predicted Score (Y-hat):

 How to determine whether it is a good regression model?

 SST = SSR + SSE

 The regression line ALWAYS passes through the point (X, Y)

 First, we want to know if the model is significant.

 What is this number?

 The null hypothesis:

1. The type of variable

 Before data analysis, some issues must be taken care of.

 A current study is interested in whether individuals’ weight change can be

 In a current data we collect, we notice that there is a strong correlation between

Use the “(simplerelationships.xlsx) “ data.

 Build the regression model:

Estimate Std. error T-value P-value

A simple regression model was conducted to predict participants’ money-

 Example: Can we use either height or weight to predict heart rate?

For example, the effect of

What should we do?

 Always plot our data.

 It looks like we have a

 In many cases we can use linear regression to capture non-linear trends.

 It should be noted that a

 We try to simulate a non-linear dataset (Lionel, 2016)

 From this graph set approximate starting values

In Class 3. Please use the following data to build a

You might also like