Lecture 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Big Data and Machine Learning

Lecture Slides 2: Linear Regression

University of Queensland
Outline

I Linear regression.
I Accuracy of linear regression.
I Problems with the linear regression model.
I Nearest-neighbor regression.
I First comparison of parametric (linear regression) vs.
nonparametric (nearest neighbor) learning.
Supervised learning setup
I Given a random variable X , predict another variable Y .
I Example:
I Y = Sales.
I X = Advertising.
I Solution:
I Learn a function fˆ from the data.
I Given input X , output

Y = fˆ(X )

I Simplest candidate function: linear function.

f (X ) = β0 + β1 X

I β0 and β1 are called parameters.


I They are unknown constants to be learned.
I This is called a parametric learning problem:
Simple linear regression

I Assume that the relationship between Y and X is given


approximately by

Y ≈ f (X ) = β0 + β1 X

I More precisely, The deviation by Y and f (X ) is assumed to


be modeled by a random error (or noise) ε

Y = f (X ) + ε
= β0 + β1 X + ε
How to learn the linear regression model?

I The learning of f in general is hard:


I For each x given, you should be able to output f (x): Infinity
of values to learn!!!!!
I But if f is a line, you only need two points to pin it down!!
I The learning of f (x) = β0 + β1 x only requires learning β0 and
β1 .
I Objective:
I Given a sample y1 , y2 , ..., yn and x1 , ..., xn ,
I find two values βˆ0 and βˆ1 ,
I that best fit the sample.
How to find the best fit?
I Look at a possible line and find the residuals

ei = yi − βˆ0 − βˆ1 xi

I Magnitude of the residuals


n
X
ε2i
i=1

I This is called the Sum of squared residuals.


Least Squares Minimization

I Objective function
n
X n
X
Q(β0 , β1 ) = (yi − β0 − β1 xi )2 = ε2i
i=1 i=1

I Find the two numbers β0 and β1 that makes Q smallest


possible.
Linear regression model in matrix form 1

I Matrices of output data, input data and residuals.

     
y1 1 x1 ε1
 ..   .. ..   .. 
 .   . .   . 
y X
     
 yi  ,
=  1 xi  , ε = 
= εi 
  
 ..   .. ..  .. 


 .   . .   . 
yn 1 xn εn

I Matrix of coefficients.
 
β0
β=
β1
Linear regression model in matrix form 2

I Original equations (i is the observation index)

y1 = β0 + β1 x1 + ε1
..
.
yi = β0 + β1 xi + εi
..
.
yn = β0 + β1 xn + εn

I Equations in matrix form

y = Xβ + ε
Least Squares Minimization 2

I The solution is very simple when written im matrix form.


!
= β̂ = (X T X )−1 X T y
βb0
β̂1

I As β̂ is random, how accurate it is given by its variance.

Var (β̂) = σ 2 (X T X )−1


I σ 2 is the variance of the residuals ε.
Multiple Linear regression: 2 inputs
I More than one input

I Y = β0 + β1 X1 + β2 X2 + ε
Multiple Linear regression

I Using matrix notation, going to p inputs is straightforward.


 
1 x1,1 · · · x1,p
 1 x2,1 x2,p
X =


 .. . .. .. 
 . . 
1 xn,1 · · · xn,p
I All the formulas in matrix form are the same!

y = Xβ + e
β̂ = (X T X )−1 X T y
Var (β̂) = σ 2 (X T X )−1
How good is the regression model?

I Is adding input Xj makes sense?


I How many inputs do we need?
I How good a linear regression model is?
I Is there a better model?
Prediction and fit

I From least squares, we get β̂.


I Given x0 , the prediction is

fˆ(x 0 ) = x 0 β̂
= β̂0 + β̂1 x 0,1 + · · · + β̂p x 0,p
I We can now compute
I Mean squared error. (It is linked to R 2 .)
I Test MSE. (Based on x0 in the test sample)
Special input data
I Example: a binary input:

1 if i is a student
studenti =
0 otherwise
I regression function.
balancei = β0 + β1 incomei + β2 studenti + εi

(β0 + β2 ) + β1 incomei + εi if i is a student
=
β0 + β1 incomei + εi otherwise
Discrete variable coding
I Example: a discrete input in the regression of Y on X

 red
X = blue
green

I Binary variable coding:


I Z1 = 1 if X = red and 0 otherwise.
I Z2 = 1 if X = blue and 0 otheriwse.
I Z3 = 1 if X = green and 0 otheriwse.
I Dummy variable trap
I Regression Y = β0 + β1 Z1 + β2 Z2 + β3 Z3 + ε
I 4 unknowns with three equations: Identification problem.
I E [Y |X = red] = β0 + β1
I E [Y |X = blue] = β0 + β2
I E [Y |X = green] = β0 + β3
I Drop one of the three variables Z1 , Z2 and Z3 .
Discrete variable coding (continued)

I Drop Z1 and use red as a base category.


I Contrast function:
red blue green
red 1 0 0
blue 0 1 0
green 0 0 1
I 3 unknowns with three equations in regression
Y = β0 + β2 Z2 + β3 Z3 + ε.
I E [Y |X = red] = β0
I E [Y |X = blue] = β0 + β2
I E [Y |X = green] = β0 + β3
I Interpretation of coefficients: deviations from base category.
Discrete variable coding (continued)
I Drop variable respresenting X = green and use sum
contrasts coding.
I Contrast function:
red blue
red 1 0
blue 0 1
green -1 -1
I 3 unknowns with three equations in regression
Y = β0 + β1 Z1 + β2 Z2 + ε.
I E [Y |X = red] = β0 + β1
I E [Y |X = blue] = β0 + β2
I E [Y |X = green] = β0 − β1 − β2
I Interpretation of coefficients: Average effect

E [Y |X = red] + E [Y |X = blue] + E [Y |X = green] 3β0


= = β0
3 3
Polynomial regression
I Example from auto data. 2-degree polynomial.

mpgi = β0 + β1 horsepoweri + β2 horespower2i + εi


List of potential problems

I Relationship is nonlinear between inputs and outputs.


I Correlation of errors. (Time Series data, Panel data)
I Variance is error term is not constant.
I Outliers and high-leverage points.
I Colinear inputs.
I Endogenity → Causal inference issues.
Causal Inference and other pitfalls

I Simpson’s Paradox

I Interpretability and policy recommendations.


Nearest neighbor regression
I Nearest neighbor regression

I N0 is the set of K nearest neighbors to x0

1 X
fˆ(x0 ) = yi
K
xi ∈N0
Comparison of parametric and nonparametric regression

I Curse of dimensionality

You might also like