Statics Thinking-Regression
Statics Thinking-Regression
Statics Thinking-Regression
E(y|x1 , x2 , . . . , xm )
We use these models when we believe that the response variable y can be
modeled by other independent variables.
To perform this type of analysis we need a dataset consisting of n observations
that include both the response variable and each of the attributes.
We refer to the process of fitting a regression function as the process in which
we infer a hypothesis function h from the data that allows us to predict unknown
values for y using the values of the attributes.
x1 , x2 , . . . , xm y
x1 , x2 , . . . , xm y
x1 , x2 , . . . , xm y
x1 , x2 , . . . , xm y
Learning Algorithm
x1 , x2 , . . . , xm y
x1 , x2 , . . . , xm y Hypothesis
Function
x1 , x2 , . . . , xm y
y ← h(x1 , . . . , xm )
x1 , x2 , . . . , xm y
yi = β0 + β1 xi + i ∀i
The parameter β0 represents the intercept of the line (the value of y when x is
zero).
The parameter β1 is the slope and represents the change of y when we vary the
value of x. The greater the magnitude of this parameter the greater the linear
relationship between the variables.
The i values correspond to the errors or residuals associated with the model.
We have to find a linear function or straight line hβ that allows us to find an
estimate of y, ŷ for any value of x with the minimum expected error.
h(x) = β0 + β1 x
The ordinary least squares method allows us to estimate β̂0 and β̂1 by
minimizing the sum of squared errors (SSE) of the observed data.
Suppose we have n observations of y and x, we compute the sum of squared
errors (SSE) as follows:
n
X n
X
SSE = (yi − h(xi ))2 = (yi − β0 − β1 xi )2 (1)
i=1 i=1
To find the parameters that minimize the error we calculate the partial derivatives
of SSE with respect to β0 and β1 . Then we equal the derivatives to zero and
solve the equation to find the parameter values.
n
∂SSE X
= −2 (yi − β0 − β1 xi ) = 0 (2)
∂β0
i=1
n
∂SSE X
= −2 (yi − β0 − β1 xi )xi = 0 (3)
∂β1
i=1
From the above system of equations the normal solutions are obtained:
Pn
i (xi − x)(yi − y)
β̂1 = Pn 2
(4)
i (xi − x)
The regression sum of squares (SSM), on the other hand, indicates the
variability of the model’s predictions with respect to the mean:
n
X
SSM = (ŷi − y)2
i
It can be proved that all the above errors are related as follows:
var()
R2 = 1 − (9)
var(y)
It is often interpreted as the “variance explained” by the model.
The coefficient usually takes values between 0 to 1 and the closer its value is to
1 the higher the quality of the model1 .
The value of R 2 is equivalent to the squared linear correlation (Pearsons)
between y and ŷ.
R 2 = cor(y, ŷ)2
1
It can take negative values when predictions are worse than using y in all
predictions.
Felipe Bravo Márquez Linear Regression
Assumptions of the Linear Model
Whenever we fit a linear model we are implicitly making certain assumptions about the
data.
Assumptions
1 Linearity: the response variable is linearly related to the attributes.
2 Normality: errors have a zero mean normal distribution: i ∼ N(0, σ 2 ).
3 Homoscedasticity: errors have constant variance (same value of σ 2 ).
4 Independence: errors are independent of each other.
Considering the above assumptions, we are saying that the probability density
(PDF) of the errors is a Gaussian with zero mean and constant variance σ 2 :
!
1 2i
PDF(i ) = √ exp −
2πσ 2σ 2
In order to get the standard error for a specific regression parameter estimate,
SEβx , we need to rescale the standard error of the model by the square root of
the sum of squares of the X variable:
SE
SEβ̂x = qP model
n 2
i=1 (xi − x)
β̂ − βH0 β̂
tn−p = =
SEβ̂x SEβ̂x
Later we will see that R automatically reports the t-statistics and p-values of all
coefficients of a linear model.
This allows us to determine whether the linear relationship between the two
variables (y and x) is significant.
We can see that there is a positive correlation between height and age.
Let’s filter out the non-adult examples because we know that height is strongly
correlated with age before adulthood.
d2 <- d[ d$age >= 18 , ]
height(weight) = β0 + β1 ∗ weight
In R the linear models are created with the command lm that receives formulas
of the form y˜x (y = f (x)).
> reg1<-lm(height˜weight,d2)
> reg1
Call:
lm(formula = height ˜ weight, data = d2)
Coefficients:
(Intercept) weight
113.879 0.905
We can see that the coefficients of the model are β0 = 113.879 and β1 = 0.905.
The estimate of β0 = 113.879, indicates that a person of weight 0 should be
around 114cm tall, which we know is nonsense but it what our linear model
believes.
Since β1 is a slope, the value 0.905 can be read as a person 1 kg heavier is
expected to be 0.90 cm taller.
We can directly access the coefficients and store them in a variable:
> reg1.coef<-reg1$coefficients
> reg1.coef
(Intercept) weight
113.8793936 0.9050291
Call:
lm(formula = height ˜ weight, data = d2)
Residuals:
Min 1Q Median 3Q Max
-19.7464 -2.8835 0.0222 3.1424 14.7744
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 113.87939 1.91107 59.59 <2e-16 ***
weight 0.90503 0.04205 21.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We can see that β0 and β1 are both statistically significantly different from zero.
We see that the coefficient of determination R 2 has a value of 0.57 which is not
so good but acceptable.
We can conclude that the weight while providing useful information to model a
part of the variability of the height of the !Kung people, is not enough to build a
highly reliable model.
We can store the results of the command summary in a variable then access the
coefficient of determination:
> sum.reg1<-summary(reg1)
> sum.reg1$r.squared
[1] 0.5696444
We can also access the fitted values which are the values predicted by my model
for the data used:
> reg1$fitted.values
1 2 3 4 5 6
157.1630 146.9001 142.7180 161.8839 151.2362 170.8895
We can check now that all definitions given for R 2 are equivalent.
> SSE<-sum(reg1$residualsˆ2)
> SST<-sum((d2$height-mean(d2$height))ˆ2)
> SSM<-sum((reg1$fitted.values-mean(d2$height))ˆ2)
> SSM/SST
[1] 0.5696444
> 1-SSE/SST
[1] 0.5696444
> 1-var(reg1$residuals)/var(d2$height)
[1] 0.5696444
> cor(d2$height,reg1$fitted.values)ˆ2
[1] 0.5696444
Suppose now that we know the weight for two !Kung people but we don’t know
their height.
We could use our linear model to predict the height of these two people.
To do this in R we must use the command predict.lm which receives the
linear model and a data.frame with the new data:
> new.weights<-data.frame(weight=c(50,62))
> predict.lm(object=reg1,newdata=new.weights)
1 2
159.1308 169.9912
> # this is equivalent to:
> reg1.coef[1]+reg1.coef[2]*new.weights[1:2,]
[1] 159.1308 169.99122
Y = Xβ +
y1
y2
Y =.
..
yn
β0
β1
β= .
..
βm
1
2
=.
..
n
Using matrix notation, we can see that the sum of squared errors (SSE) can be
expressed as:
SSE = (Y − X β)T (Y − X β)
Minimizing this expression by deriving the error as a function of β and setting it
equal to zero leads to the normal equations:
β̂ = (X T X )−1 X T Y
Now we will study a multiple linear regression for the Howell1 data.
Let’s add the variable age as an additional predictor for height.
We know that age is good at predicting height for non-adults so we will work with
the original dataset.
Let’s fit the following linear multi-variate model:
> summary(reg2)
Call:
lm(formula = height ˜ weight + age, data = d)
Residuals:
Min 1Q Median 3Q Max
-29.0350 -5.4228 0.7333 6.4919 19.6964
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 75.96329 1.04216 72.890 < 2e-16 ***
weight 1.65710 0.03656 45.324 < 2e-16 ***
age 0.11212 0.02594 4.322 1.84e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
When we had a simple regression we could see the fitted model as a line.
Now that we have two independent variables we can see the fitted model as a
plane.
If we had more independent variables our model would be a hyper-plane.
We can plot the plane of our linear model of two independent variables and one
dependent variable in R as follows:
library("scatterplot3d")
s3d <- scatterplot3d(d[,c("weight","age","height")],
type="h", highlight.3d=TRUE,
angle=55, scale.y=0.7, pch=16,
main="height˜weight+age")
s3d$plane3d(reg2, lty.box = "solid")
2 n−1
R = 1 − (1 − R 2 )
n−m−1
where n is the number of examples.
yi = β0 + β1 xi + β2 xi2 + i ∀i
The additional term uses the square of xi to construct a parabola, rather than a
perfectly straight line.
The new parameter β2 measures the curvature of the relationship.
Let’s fit a polynomial regression for the height variable using a parabolic model
for weight
Because the square of a large number can be truly massive we are going to
standardize the weight by substracting the mean and dividing by the standard
deviation.
d$weight_s <-(d$weight-mean(d$weight))/sd(d$weight)
Call:
lm(formula = height ˜ weight_s + I(weight_sˆ2), data = d)
Coefficients:
(Intercept) weight_s I(weight_sˆ2)
146.660 21.415 -8.412
We can also visualize the fitted parabola:
weight.seq <- seq( from=-2.2 , to=2 , length.out=30 )
new.weights<-data.frame("weight_s"=weight.seq,
"I(weight_sˆ2)"=weight.seqˆ2)
h.pred <- predict.lm(object=reg4,newdata=new.weights)
plot( height ˜ weight_s , d, col="red" )
lines( weight.seq , h.pred )
180
160
140
height
120
100
80
60
−2 −1 0 1 2
weight_s
height(male) = β0 + β1 ∗ male
Here “male” is a binary or dummy variable (it takes the value 1 when the person
is male and 0 otherwise).
In R, it is important to make sure that our categorical variables are represented
as factors:
> d$male<-as.factor(d$male)
> reg5<-lm(height˜male,d)
The coefficients of the model are:
> reg5$coefficients
(Intercept) male1
134.630278 7.690759
Here β0 is the average height of females and β1 is the average height difference
between males and females:
> sum(reg5$coefficients)
[1] 142.321
> means<-tapply(d$height,d$male,mean)
> means
0 1
134.6303 142.3210
> means[2]-means[1]
1
7.690759
> summary(reg5)
Call:
lm(formula = height ˜ male, data = d)
Residuals:
Min 1Q Median 3Q Max
-81.87 -12.73 12.65 18.33 36.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 134.630 1.615 83.366 < 2e-16 ***
male1 7.691 2.350 3.273 0.00113 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The p-value of the statistical sinfificance test for β1 in reg5 is the same as that
obtained from an unpaired two-sample t-tests with equal variance:
> t.test(d$height˜d$male, var.equal=T)
Notice that we must set var.equal=T to get the same results, since the linear
model assumes that the variance is constant.
We can also fit a multivariate linear model with both numeric and categorical
variables:
height(weight,male) = β0 + β1 ∗ weight + β2 ∗ male
> reg6<-lm(height˜weight+male,d)
> reg6
Call:
lm(formula = height ˜ weight + male, data = d)
Coefficients:
(Intercept) weight male1
75.5489 1.7664 -0.3971
In this model we use the same slope β1 relating height to weight for both groups.
The coefficient β2 (male1) indicates an expected height difference between
males and females once the “weight” is known.
In many cases we want to fit a model using lines with separate slopes for each
category.
This is done using an interaction, which in our case is just an additional
parameter associated with the the product between the category and the
predictor.
To understand why the interaction considers a different slope for each category
we must study the above equation for the cases (male=1) and (male=0).
If male=1, the expression takes the following form:
(i) (i)
which is essentially a linear model with an intercept of (β0 + β2 ) and a slope of
(i) (i)
(β1 + β3 ).
If male=0:
(i) (i)
β0 + β1 ∗ weight
(i) (i)
this is another linear model with an intercept of β0 and a slope of β1 .
In R, interactions are denoted by using the * or : symbol in the formula.
> reg7<-lm(height˜weight+male+weight:male,d)
> # or lm(height˜weight*male,d)
> reg7
Call:
lm(formula = height ˜ weight + male + weight:male, data = d)
Coefficients:
(Intercept) weight male1 weight:male1
74.2536 1.8051 2.0683 -0.0695
We can find many relationships between the coefficients of the model with
interaction and the two independent models.
All of them can be deduced by cancelling out terms in the formula of heightint for
the female cases (male = 0).
First, the intercept of the model with the interaction heightint is the same as the
(i) (f )
intercept of the model for females β0 = β0 :
> reg7$coefficients["(Intercept)"]
(Intercept)
74.25359
> reg9$coefficient["(Intercept)"]
(Intercept)
74.25359
(i) (f )
The same applies for the slope of weight for heightint and heightfemale β1 = β1 :
> reg7$coefficients["weight"]
weight
1.805118
> reg9$coefficient["weight"]
weight
1.805118
The interaction slope from heightint encodes the difference in weight slopes
(i) (m) (f )
between both groups β3 = β1 − β1 :
> reg7$coefficients["weight:male1"]
weight:male1
-0.06949648
> reg8$coefficient["weight"]-reg9$coefficient["weight"]
weight
-0.06949648
(i)
Finally, the category coefficient β2 from heightint corresponds to the difference
(i) (m) (f )
of intercepts between heightmale and heightfemale β2 = β0 − β0 :
> reg7$coefficients["male1"]
male1
2.068278
> reg8$coefficients["(Intercept)"]
-reg9$coefficient["(Intercept)"]
(Intercept)
2.068278
We can use a simple linear model to describe the relation between two variables
and to decide whether that relationship is statistically significant.
In addition, the model allows us to predict the value of the dependent variable
given some new value(s) of the independent variable(s).
Most importantly, a mutivariate linear model allow us to build models that
incorporate multiple independent variables.
An interaction allows the regression model to fit different slopes for different
categories.
McElreath, R. (2020).
Statistical rethinking: A Bayesian course with examples in R and Stan.
CRC press.
Poldrack, R. A. (2019).
Statistical thinking for the 21st century.
https://statsthinking21.org/.
Wasserman, L. (2013).
All of statistics: a concise course in statistical inference.
Springer Science & Business Media.