Statics Thinking-Regression

Linear Regression
Felipe José Bravo Márquez
September 21, 2021
Felipe Bravo Márquez Linear Regression

Introduction
A regression model is used to model the relationship of a numerical dependent

variable y with m independent variables x1 , x2 , . . . , xm [Wasserman, 2013].
The dependent variable y is also called target, outcome, or response variable.
The independent variables x1 , x2 , . . . , xm are also called covariates, attributes,
features, or predictor variables.
Roughly speaking we want to know the expected value of y from the values of x:
E(y|x1 , x2 , . . . , xm )
We use these models when we believe that the response variable y can be
modeled by other independent variables.
To perform this type of analysis we need a dataset consisting of n observations
that include both the response variable and each of the attributes.
We refer to the process of fitting a regression function as the process in which
we infer a hypothesis function h from the data that allows us to predict unknown
values for y using the values of the attributes.

Introduction (2)
This process of fitting a function from data is referred to in the areas of data
mining and machine learning as training.
In those disciplines, functions are said to learn from data.
Since we need observations where the value of y (the output) is known to learn
the function, such techniques are referred to as supervised learning.
When y is a categorical variable we have a classification problem.
x1 , x2 , . . . , xm y
x1 , x2 , . . . , xm y
x1 , x2 , . . . , xm y
x1 , x2 , . . . , xm y
Learning Algorithm
x1 , x2 , . . . , xm y
x1 , x2 , . . . , xm y Hypothesis
Function
x1 , x2 , . . . , xm y
y ← h(x1 , . . . , xm )
x1 , x2 , . . . , xm y

Simple Linear Regression
In simple linear regression, we have a single independent variable x to model the

dependent variable y.
The following linear relationship between the variables is assumed:
yi = β0 + β1 xi + i ∀i
The parameter β0 represents the intercept of the line (the value of y when x is
zero).
The parameter β1 is the slope and represents the change of y when we vary the
value of x. The greater the magnitude of this parameter the greater the linear
relationship between the variables.
The i values correspond to the errors or residuals associated with the model.
We have to find a linear function or straight line hβ that allows us to find an
estimate of y, ŷ for any value of x with the minimum expected error.
h(x) = β0 + β1 x

Simple Linear Regression
Figure: Source: https://www.jmp.com

Least Squares
The ordinary least squares method allows us to estimate β̂0 and β̂1 by
minimizing the sum of squared errors (SSE) of the observed data.
Suppose we have n observations of y and x, we compute the sum of squared
errors (SSE) as follows:
n
X n
X
SSE = (yi − h(xi ))2 = (yi − β0 − β1 xi )2 (1)
i=1 i=1
To find the parameters that minimize the error we calculate the partial derivatives
of SSE with respect to β0 and β1 . Then we equal the derivatives to zero and
solve the equation to find the parameter values.
n
∂SSE X
= −2 (yi − β0 − β1 xi ) = 0 (2)
∂β0
i=1
n
∂SSE X
= −2 (yi − β0 − β1 xi )xi = 0 (3)
∂β1
i=1

Least Squares
From the above system of equations the normal solutions are obtained:
Pn
i (xi − x)(yi − y)
β̂1 = Pn 2
(4)
i (xi − x)
β̂0 = y − β̂1 x (5)

The fitted model represents the line of least squared error.

Coefficient of Determination R 2
How can we evaluate the quality of our fitted model?

A very common metric is the coefficient of determination R 2 .
It is calculated from errors that are different from SSE.
The total sum of squares (SST) is defined as the predictive error when we use
the mean y to predict the response variable y (it is very similar to the variance of
the variable):
Xn
SST = (yi − y)2
i
The regression sum of squares (SSM), on the other hand, indicates the
variability of the model’s predictions with respect to the mean:
n
X
SSM = (ŷi − y)2
i

It can be proved that all the above errors are related as follows:
SST = SSE + SSM (6)
The coefficient of determination for a linear model R 2 is defined as:

Pn
SSM (ŷi − y)2
R2 = = Pin 2
(7)
SST i (yi − y)
An alternative but equivalent definition:

Pn
SSE (yi − ŷi )2
R2 = 1 − = 1 − Pin 2
(8)
SST i (yi − y)

Another alternative but equivalent definition:
var()
R2 = 1 − (9)
var(y)
It is often interpreted as the “variance explained” by the model.
The coefficient usually takes values between 0 to 1 and the closer its value is to
1 the higher the quality of the model1 .
The value of R 2 is equivalent to the squared linear correlation (Pearsons)
between y and ŷ.
R 2 = cor(y, ŷ)2
1
It can take negative values when predictions are worse than using y in all
predictions.
Assumptions of the Linear Model
Whenever we fit a linear model we are implicitly making certain assumptions about the
data.
Assumptions
1 Linearity: the response variable is linearly related to the attributes.
2 Normality: errors have a zero mean normal distribution: i ∼ N(0, σ 2 ).
3 Homoscedasticity: errors have constant variance (same value of σ 2 ).
4 Independence: errors are independent of each other.

Probabilistic Interpretation
Considering the above assumptions, we are saying that the probability density
(PDF) of the errors is a Gaussian with zero mean and constant variance σ 2 :
!
1 2i
PDF(i ) = √ exp −
2πσ 2σ 2
Moreover all i are IID (independent and identically distributed).

This implies that:
!
1 (yi − hβ (xi ))2
PDF(yi |xi ; β) = √ exp −
2πσ 2σ 2
Which implies that the distribution of y given the values of x, parameterized by β

is also normally distributed.
Let’s estimate the parameters of the model (β) using maximum likelihood
estimation.

The likelihood function of β can be written as:

n
!
Y 1 (yi − hβ (xi ))2
L(β) = √ exp −
2πσ 2σ 2
i=1
and the the log likelihood ln (β):
ln (β) = log L(β) (10)

n
!
Y 1 (yi − hβ (xi ))2
= log √ exp − (11)
2πσ 2σ 2
i=1
n
!
X 1 (yi − hβ (xi ))2
= log √ exp − (12)
2πσ 2σ 2
i=1
n
1 1 1X
= n log √ − × (yi − hβ (xi ))2 (13)
2πσ σ2 2
i=1

After ignoring the constants, we realize that maximizing ln (β) is equivalent to

minimizing:
n
X
(yi − hβ (xi ))2
i=1
Which is equivalent to minimizing SSE!

Then, if one estimates the parameters of β using maximum likelihood estimation
one arrives at the same results as doing least squares estimation.
Therefore, every time we perform least squares estimation, we are implicitly
making the same probabilistic assumptions mentioned above (e.g., normality,
homoscedasticity).

Standard errors for regression models
If we want to make inferences about the regression parameter estimates, then

we also need an estimate of their variability [Poldrack, 2019].
To compute this, we first need to compute the mean squared error MSerror of
the model:
Pn 2
SSE i=1 (yi − h(xi ))
MSerror = =
df n−p
The degrees of freedom (df ) are determined by subtracting the number of
estimated parameters (2 in this case: β0 and β1 ) from the number of
observations (n).
We can then compute the standard error for the model as:
p
SEmodel = MSerror

A significance test for β
In order to get the standard error for a specific regression parameter estimate,
SEβx , we need to rescale the standard error of the model by the square root of
the sum of squares of the X variable:
SE
SEβ̂x = qP model
n 2
i=1 (xi − x)
Now we can compute a t statistic to tell us the likelihood of the observed

parameter estimates compared to some expected value under the null
hypothesis.
In this case we will test against the null hypothesis of no effect (i.e. βH0 = 0):
β̂ − βH0 β̂
tn−p = =
SEβ̂x SEβ̂x
Later we will see that R automatically reports the t-statistics and p-values of all
coefficients of a linear model.
This allows us to determine whether the linear relationship between the two
variables (y and x) is significant.

Example: a model of height
We are going to work with the dataset Howell1 that has demographic data from
Kalahari !Kung San people collected by Nancy Howell in the late 1960s.
The !Kung San are the most famous foraging population of the twentieth century,
largely because of detailed quantitative studies by people like Howell.
[McElreath, 2020]
Figure: By Staehler - Own work, CC BY-SA 4.0, https:

//commons.wikimedia.org/w/index.php?curid=45076017

Each observation corresponds to an individual.

The variables of the dataset are:
1 height: height in cm
2 weight: weight in kg
3 age: age in years
4 male: gender indicator (1 for male 0 for woman)
Let’s explore the linear correlations between the variables

> d <- read.csv("howell1.csv")
> cor(d)
height weight age male
height 1.0000000 0.9408222 0.683688567 0.139229021
weight 0.9408222 1.0000000 0.678335313 0.155442866
age 0.6836886 0.6783353 1.000000000 0.005887126
male 0.1392290 0.1554429 0.005887126 1.000000000

We can see that there is a positive correlation between height and age.
Let’s filter out the non-adult examples because we know that height is strongly
correlated with age before adulthood.
d2 <- d[ d$age >= 18 , ]
Now age doesn’t correlate with height:

> cor(d2)
height weight age male
height 1.0000000 0.7547479 -0.10183776 0.69999340
weight 0.7547479 1.0000000 -0.17290430 0.52445271
age -0.1018378 -0.1729043 1.00000000 0.02845498
male 0.6999934 0.5244527 0.02845498 1.00000000
Let’s model height as a function of weight using a simple linear regression:
height(weight) = β0 + β1 ∗ weight

In R the linear models are created with the command lm that receives formulas
of the form y˜x (y = f (x)).
> reg1<-lm(height˜weight,d2)
> reg1
Call:
lm(formula = height ˜ weight, data = d2)
Coefficients:
(Intercept) weight
113.879 0.905

We can see that the coefficients of the model are β0 = 113.879 and β1 = 0.905.
The estimate of β0 = 113.879, indicates that a person of weight 0 should be
around 114cm tall, which we know is nonsense but it what our linear model
believes.
Since β1 is a slope, the value 0.905 can be read as a person 1 kg heavier is
expected to be 0.90 cm taller.
We can directly access the coefficients and store them in a variable:
> reg1.coef<-reg1$coefficients
> reg1.coef
(Intercept) weight
113.8793936 0.9050291

We can view various indicators about the linear model with the command
summary:
> summary(reg1)
Call:
lm(formula = height ˜ weight, data = d2)
Residuals:
Min 1Q Median 3Q Max
-19.7464 -2.8835 0.0222 3.1424 14.7744
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 113.87939 1.91107 59.59 <2e-16 ***
weight 0.90503 0.04205 21.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.086 on 350 degrees of freedom

Multiple R-squared: 0.5696,Adjusted R-squared: 0.5684
F-statistic: 463.3 on 1 and 350 DF, p-value: < 2.2e-16
We can see that β0 and β1 are both statistically significantly different from zero.

We see that the coefficient of determination R 2 has a value of 0.57 which is not
so good but acceptable.
We can conclude that the weight while providing useful information to model a
part of the variability of the height of the !Kung people, is not enough to build a
highly reliable model.
We can store the results of the command summary in a variable then access the
coefficient of determination:
> sum.reg1<-summary(reg1)
> sum.reg1$r.squared
[1] 0.5696444
We can also access the fitted values which are the values predicted by my model
for the data used:
> reg1$fitted.values
1 2 3 4 5 6
157.1630 146.9001 142.7180 161.8839 151.2362 170.8895

We can check now that all definitions given for R 2 are equivalent.
> SSE<-sum(reg1$residualsˆ2)
> SST<-sum((d2$height-mean(d2$height))ˆ2)
> SSM<-sum((reg1$fitted.values-mean(d2$height))ˆ2)
> SSM/SST
[1] 0.5696444
> 1-SSE/SST
[1] 0.5696444
> 1-var(reg1$residuals)/var(d2$height)
[1] 0.5696444
> cor(d2$height,reg1$fitted.values)ˆ2
[1] 0.5696444
Or manually calculate the standard error for the model SEmodel :

> df <- dim(d2)[1]-2
> SE <- sqrt(SSE/df)
> SE
[1] 5.086336

Suppose now that we know the weight for two !Kung people but we don’t know
their height.
We could use our linear model to predict the height of these two people.
To do this in R we must use the command predict.lm which receives the
linear model and a data.frame with the new data:
> new.weights<-data.frame(weight=c(50,62))
> predict.lm(object=reg1,newdata=new.weights)
1 2
159.1308 169.9912
> # this is equivalent to:
> reg1.coef[1]+reg1.coef[2]*new.weights[1:2,]
[1] 159.1308 169.99122

Multivariate Linear Regression
Suppose we have m independent variables: x1 , x2 , . . . , xm .

In many cases more variables can better explain the variability of the response
variable y than a single one.
A multivariate linear model is defined as follows:
yi = β0 + β1 xi,1 + β2 xi,2 + · · · + βm xi,m + i ∀i ∈ {1, n}
In essence we are adding a parameter for each independent variable.

Then, we multiply the parameter by the variable and add that term to the linear
model.
In the multivariate model, all the properties of the simple linear model are
extended.
The problem can be represented in a matrix form:
Y = Xβ +

Where Y is a vector n × 1 response variables:
y1
 
y2 
Y =.
 
 .. 
yn
X is a n × (m + 1) matrix with the explanatory variables. We have n

observations of the m variables. The first column is constant equal to 1
(xi,0 = 1 ∀i) to model the intercept variables β0 .
x1,0 x1,1 x1,2 ··· x1,m

 
x2,0 x2,1 x2,2 ··· x2,m 
X = . .. .. .. 
 
 .. ..
. . . . 
xn,0 xn,1 xn,2 ··· xn,m

Then, β is a (m + 1) × 1 vector of parameters.
β0


 β1 
β= . 
 
 .. 
βm
Finally, is a n × 1 vector with the model errors.
1
 
 2 
=.
 
 .. 
n
Using matrix notation, we can see that the sum of squared errors (SSE) can be
expressed as:
SSE = (Y − X β)T (Y − X β)
Minimizing this expression by deriving the error as a function of β and setting it
equal to zero leads to the normal equations:
β̂ = (X T X )−1 X T Y

Multivariate Linear Regression in R
Now we will study a multiple linear regression for the Howell1 data.
Let’s add the variable age as an additional predictor for height.
We know that age is good at predicting height for non-adults so we will work with
the original dataset.
Let’s fit the following linear multi-variate model:
height(weight,age) = β0 + β1 ∗ weight + β2 ∗ age
In R, we add more variables to the linear model with the operator + :

reg2<-lm(height˜weight+age,d)

> summary(reg2)
Call:
lm(formula = height ˜ weight + age, data = d)
Residuals:
-29.0350 -5.4228 0.7333 6.4919 19.6964
Coefficients:
(Intercept) 75.96329 1.04216 72.890 < 2e-16 ***
weight 1.65710 0.03656 45.324 < 2e-16 ***
age 0.11212 0.02594 4.322 1.84e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

F-statistic: 2166 on 2 and 541 DF, p-value: < 2.2e-16

When we had a simple regression we could see the fitted model as a line.
Now that we have two independent variables we can see the fitted model as a
plane.
If we had more independent variables our model would be a hyper-plane.
We can plot the plane of our linear model of two independent variables and one
dependent variable in R as follows:
library("scatterplot3d")
s3d <- scatterplot3d(d[,c("weight","age","height")],
type="h", highlight.3d=TRUE,
angle=55, scale.y=0.7, pch=16,
main="height˜weight+age")
s3d$plane3d(reg2, lty.box = "solid")


R 2 adjusted
The coefficient of determination R 2 tends increases when extra explanatory

variables are added to the model.
This happens even when the additional variables are useless.
This is not desirable a property when we want to compare different models.
2
R 2 adjusted or R penalizes for the number of variables m.
2 n−1
R = 1 − (1 − R 2 )
n−m−1
where n is the number of examples.

Polynomial Regression
Polynomial regression uses powers of a variable (e.g., squares, cubes) as extra

attributes.
The most common polynomial regression is a parabolic model of the mean:
yi = β0 + β1 xi + β2 xi2 + i ∀i
The additional term uses the square of xi to construct a parabola, rather than a
perfectly straight line.
The new parameter β2 measures the curvature of the relationship.
Let’s fit a polynomial regression for the height variable using a parabolic model
for weight
Because the square of a large number can be truly massive we are going to
standardize the weight by substracting the mean and dividing by the standard
deviation.
d$weight_s <-(d$weight-mean(d$weight))/sd(d$weight)

Then we fit the model as follows:

> reg4 <- lm(height˜weight_s+I(weight_sˆ2),d)
> reg4
Call:
lm(formula = height ˜ weight_s + I(weight_sˆ2), data = d)
Coefficients:
(Intercept) weight_s I(weight_sˆ2)
146.660 21.415 -8.412
We can also visualize the fitted parabola:
weight.seq <- seq( from=-2.2 , to=2 , length.out=30 )
new.weights<-data.frame("weight_s"=weight.seq,
"I(weight_sˆ2)"=weight.seqˆ2)
h.pred <- predict.lm(object=reg4,newdata=new.weights)
plot( height ˜ weight_s , d, col="red" )
lines( weight.seq , h.pred )

180
160
140
height
120
100
80
60
−2 −1 0 1 2
weight_s

Categorical Predictors
We can also add categorical predictors to our model.

For example, let’s try using the sex of the Kalahari people to predict height:
height(male) = β0 + β1 ∗ male
Here “male” is a binary or dummy variable (it takes the value 1 when the person
is male and 0 otherwise).
In R, it is important to make sure that our categorical variables are represented
as factors:
> d$male<-as.factor(d$male)
> reg5<-lm(height˜male,d)
The coefficients of the model are:
> reg5$coefficients
(Intercept) male1
134.630278 7.690759

Here β0 is the average height of females and β1 is the average height difference
between males and females:
> sum(reg5$coefficients)
[1] 142.321
> means<-tapply(d$height,d$male,mean)
> means
0 1
134.6303 142.3210
> means[2]-means[1]
1
7.690759

> summary(reg5)
Call:
lm(formula = height ˜ male, data = d)
Residuals:
-81.87 -12.73 12.65 18.33 36.75
Coefficients:
(Intercept) 134.630 1.615 83.366 < 2e-16 ***
male1 7.691 2.350 3.273 0.00113 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

F-statistic: 10.71 on 1 and 542 DF, p-value: 0.001131

The p-value of the statistical sinfificance test for β1 in reg5 is the same as that
obtained from an unpaired two-sample t-tests with equal variance:
> t.test(d$height˜d$male, var.equal=T)
Two Sample t-test
data: d$height by d$male

t = -3.2733, df = 542, p-value = 0.001131
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-12.306146 -3.075372
sample estimates:
mean in group 0 mean in group 1
134.6303 142.3210
Notice that we must set var.equal=T to get the same results, since the linear
model assumes that the variance is constant.

We can also fit a multivariate linear model with both numeric and categorical
variables:
height(weight,male) = β0 + β1 ∗ weight + β2 ∗ male
> reg6<-lm(height˜weight+male,d)
> reg6
Call:
lm(formula = height ˜ weight + male, data = d)
Coefficients:
(Intercept) weight male1
75.5489 1.7664 -0.3971
In this model we use the same slope β1 relating height to weight for both groups.
The coefficient β2 (male1) indicates an expected height difference between
males and females once the “weight” is known.

Interactions
In many cases we want to fit a model using lines with separate slopes for each
category.
This is done using an interaction, which in our case is just an additional
parameter associated with the the product between the category and the
predictor.
(i) (i) (i) (i)

heightint (weight,male) = β0 + β1 ∗ weight + β2 ∗ male + β3 ∗ weight ∗ male
To understand why the interaction considers a different slope for each category
we must study the above equation for the cases (male=1) and (male=0).
If male=1, the expression takes the following form:
(i) (i) (i) (i) (i) (i) (i) (i)

β0 + β1 ∗ weight + β2 + β3 ∗ weight = (β0 + β2 ) + (β1 + β3 ) ∗ weight
(i) (i)
which is essentially a linear model with an intercept of (β0 + β2 ) and a slope of
(i) (i)
(β1 + β3 ).

Interactions
If male=0:
(i) (i)
β0 + β1 ∗ weight
(i) (i)
this is another linear model with an intercept of β0 and a slope of β1 .
In R, interactions are denoted by using the * or : symbol in the formula.
> reg7<-lm(height˜weight+male+weight:male,d)
> # or lm(height˜weight*male,d)
> reg7
Call:
lm(formula = height ˜ weight + male + weight:male, data = d)
Coefficients:
(Intercept) weight male1 weight:male1
74.2536 1.8051 2.0683 -0.0695

Interactions
The above model can be interpreted as a compact version of fitting two models,
one for each group.
A model for the male group would be as follows:
(m) (m)
heightmale (weightmale ) = β0 + β1 ∗ weightmale
> d.male<-d[d$male==1,]
> reg8<-lm(height˜weight,d.male)
Call:
lm(formula = height ˜ weight, data = d.male)
Coefficients:
(Intercept) weight
76.322 1.736
and for the female group:
(f ) (f )
heightfemale (weightfemale ) = β0 + β1 ∗ weightfemale
> d.female<-d[d$male==0,]
> reg9<-lm(height˜weight,d.female)
Call:
lm(formula = height ˜ weight, data = d.female)
Coefficients:
(Intercept) weight
74.254 1.805

Interactions
We can find many relationships between the coefficients of the model with
interaction and the two independent models.
All of them can be deduced by cancelling out terms in the formula of heightint for
the female cases (male = 0).
First, the intercept of the model with the interaction heightint is the same as the
(i) (f )
intercept of the model for females β0 = β0 :
> reg7$coefficients["(Intercept)"]
(Intercept)
74.25359
> reg9$coefficient["(Intercept)"]
(Intercept)
74.25359
(i) (f )
The same applies for the slope of weight for heightint and heightfemale β1 = β1 :
> reg7$coefficients["weight"]
weight
1.805118
> reg9$coefficient["weight"]
weight
1.805118

Interactions
The interaction slope from heightint encodes the difference in weight slopes
(i) (m) (f )
between both groups β3 = β1 − β1 :
> reg7$coefficients["weight:male1"]
weight:male1
-0.06949648
> reg8$coefficient["weight"]-reg9$coefficient["weight"]
weight
-0.06949648
(i)
Finally, the category coefficient β2 from heightint corresponds to the difference
(i) (m) (f )
of intercepts between heightmale and heightfemale β2 = β0 − β0 :
> reg7$coefficients["male1"]
male1
2.068278
> reg8$coefficients["(Intercept)"]
-reg9$coefficient["(Intercept)"]
(Intercept)
2.068278

Conclusion
We can use a simple linear model to describe the relation between two variables
and to decide whether that relationship is statistically significant.
In addition, the model allows us to predict the value of the dependent variable
given some new value(s) of the independent variable(s).
Most importantly, a mutivariate linear model allow us to build models that
incorporate multiple independent variables.
An interaction allows the regression model to fit different slopes for different
categories.

Bonus: Four Cardinal Rules of Statistics by Daniela
Witten
Now that we have concluded the chapter on Frequentist inference, it is good to discuss
the points raised by Daniela Witten in a tweet.
One: Correlation does not imply causation

Yes, I know you know this, but it’s so easy to forget! Yeah, YOU OVER THERE,
you with the p-value of 0.0000001 — yes, YOU!! That’s not causation.
No matter how small the p-value for a regression of IQ onto shoe size is, that
doesn’t mean that big feet cause smarts!! It just means that grown-ups tend to
have bigger feet and higher IQs than kids.
So, unless you can design your study to uncover causation (very hard to do in
most practical settings — the field of causal inference is devoted to
understanding the settings in which it is possible), the best you can do is to
discover correlations. Sad but true.

Witten
Two: a p-value is just a test of sample size
Read that again — I mean what I said! If your null hypothesis doesn’t hold (and
null hypotheses never hold IRL) then the larger your sample size, the smaller
your p-value will tend to be.
If you’re testing whether mean=0 and actually the truth is that
mean=0.000000001, and if you have a large enough sample size, then YOU
WILL GET A TINY P-VALUE.
Why does this matter? In many contemporary settings (think: the internet),
sample sizes are so huge that we can get TINY p-values even when the
deviation from the null hypothesis is negligible. In other words, we can have
STATISTICAL significance w/o PRACTICAL significance.
Often, people focus on that tiny p-value, and the fact that the effect is of **literally
no practical relevance** is totally lost.
This also means that with a large enough sample size we can reject basically
ANY null hypothesis (since the null hypothesis never exactly holds IRL, but it
might be “close enough” that the violation of the null hypothesis is not important).
Want to write a paper saying Lucky Charms consumption is correlated w/blood
type? W/a large enough sample size, you can get a small p-value. (Provided
there’s some super convoluted mechanism with some teeny effect size. . . which
there probably is, b/c IRL null never holds)
Witten
Three: seek and you shall find

If you look at your data for long enough, you will find something interesting, even
if only by chance!
In principle, we know that we need to perform a correction for multiple testing if
we conduct a bunch of tests.
But in practice, what if we decide what test(s) to conduct AFTER we look at
data? Our p-value will be misleadingly small because we peeked at the data.
Pre-specifying our analysis plan in advance keeps us honest. . . but in reality, it’s
hard to do!!!
Everyone is asking me about the mysterious and much-anticipated fourth rule of

statistics. The answer is simple: we haven’t figured it out yet.... that’s the reason
we need to do research in statistics.

References I
McElreath, R. (2020).
Statistical rethinking: A Bayesian course with examples in R and Stan.
CRC press.
Poldrack, R. A. (2019).
Statistical thinking for the 21st century.
https://statsthinking21.org/.
Wasserman, L. (2013).
All of statistics: a concise course in statistical inference.
Springer Science & Business Media.

Statics Thinking-Regression

Uploaded by

Copyright:

Available Formats

Statics Thinking-Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statics Thinking-Regression

Uploaded by

Copyright:

Available Formats

Linear Regression

Felipe José Bravo Márquez

September 21, 2021

Felipe Bravo Márquez Linear Regression

A regression model is used to model the relationship of a numerical dependent

Felipe Bravo Márquez Linear Regression

Felipe Bravo Márquez Linear Regression

In simple linear regression, we have a single independent variable x to model the

Felipe Bravo Márquez Linear Regression

Figure: Source: https://www.jmp.com

Felipe Bravo Márquez Linear Regression

Felipe Bravo Márquez Linear Regression

β̂0 = y − β̂1 x (5)

Felipe Bravo Márquez Linear Regression

How can we evaluate the quality of our fitted model?

Felipe Bravo Márquez Linear Regression

SST = SSE + SSM (6)

The coefficient of determination for a linear model R 2 is defined as:

An alternative but equivalent definition:

Felipe Bravo Márquez Linear Regression

Another alternative but equivalent definition:

Felipe Bravo Márquez Linear Regression

Moreover all i are IID (independent and identically distributed).

Which implies that the distribution of y given the values of x, parameterized by β

Felipe Bravo Márquez Linear Regression

The likelihood function of β can be written as:

and the the log likelihood ln (β):

ln (β) = log L(β) (10)

Felipe Bravo Márquez Linear Regression

After ignoring the constants, we realize that maximizing ln (β) is equivalent to

Which is equivalent to minimizing SSE!

Felipe Bravo Márquez Linear Regression

If we want to make inferences about the regression parameter estimates, then

Felipe Bravo Márquez Linear Regression

Now we can compute a t statistic to tell us the likelihood of the observed

Felipe Bravo Márquez Linear Regression

Figure: By Staehler - Own work, CC BY-SA 4.0, https:

Felipe Bravo Márquez Linear Regression

Each observation corresponds to an individual.

Let’s explore the linear correlations between the variables

Felipe Bravo Márquez Linear Regression

Now age doesn’t correlate with height:

Let’s model height as a function of weight using a simple linear regression:

Felipe Bravo Márquez Linear Regression

Felipe Bravo Márquez Linear Regression

Felipe Bravo Márquez Linear Regression

Residual standard error: 5.086 on 350 degrees of freedom

Felipe Bravo Márquez Linear Regression

Felipe Bravo Márquez Linear Regression

Or manually calculate the standard error for the model SEmodel :

Felipe Bravo Márquez Linear Regression

Felipe Bravo Márquez Linear Regression

Suppose we have m independent variables: x1 , x2 , . . . , xm .

yi = β0 + β1 xi,1 + β2 xi,2 + · · · + βm xi,m + i ∀i ∈ {1, n}

In essence we are adding a parameter for each independent variable.

Felipe Bravo Márquez Linear Regression

Where Y is a vector n × 1 response variables:

X is a n × (m + 1) matrix with the explanatory variables. We have n

x1,0 x1,1 x1,2 ··· x1,m

Moreover all i are IID (independent and identically distributed).

yi = β0 + β1 xi,1 + β2 xi,2 + · · · + βm xi,m + i ∀i ∈ {1, n}

Finally, is a n × 1 vector with the model errors.