Lecture 2

Big Data and Machine Learning
Lecture Slides 2: Linear Regression
University of Queensland
Outline
I Linear regression.
I Accuracy of linear regression.
I Problems with the linear regression model.
I Nearest-neighbor regression.
I First comparison of parametric (linear regression) vs.
nonparametric (nearest neighbor) learning.
Supervised learning setup
I Given a random variable X , predict another variable Y .
I Example:
I Y = Sales.
I X = Advertising.
I Solution:
I Learn a function fˆ from the data.
I Given input X , output
Y = fˆ(X )
I Simplest candidate function: linear function.
f (X ) = β0 + β1 X
I β0 and β1 are called parameters.

I They are unknown constants to be learned.
I This is called a parametric learning problem:
Simple linear regression
I Assume that the relationship between Y and X is given

approximately by
Y ≈ f (X ) = β0 + β1 X
I More precisely, The deviation by Y and f (X ) is assumed to

be modeled by a random error (or noise) ε
Y = f (X ) + ε
= β0 + β1 X + ε
How to learn the linear regression model?
I The learning of f in general is hard:

I For each x given, you should be able to output f (x): Infinity
of values to learn!!!!!
I But if f is a line, you only need two points to pin it down!!
I The learning of f (x) = β0 + β1 x only requires learning β0 and
β1 .
I Objective:
I Given a sample y1 , y2 , ..., yn and x1 , ..., xn ,
I find two values βˆ0 and βˆ1 ,
I that best fit the sample.
How to find the best fit?
I Look at a possible line and find the residuals
ei = yi − βˆ0 − βˆ1 xi
I Magnitude of the residuals

n
X
ε2i
i=1
I This is called the Sum of squared residuals.

Least Squares Minimization
I Objective function
n
X n
X
Q(β0 , β1 ) = (yi − β0 − β1 xi )2 = ε2i
i=1 i=1
I Find the two numbers β0 and β1 that makes Q smallest

possible.
Linear regression model in matrix form 1
I Matrices of output data, input data and residuals.
     
y1 1 x1 ε1
 ..   .. ..   .. 
 .   . .   . 
y X
     
 yi  ,
=  1 xi  , ε = 
= εi 
  
 ..   .. ..  .. 


 .   . .   . 
yn 1 xn εn
I Matrix of coefficients.

β0
β=
β1
Linear regression model in matrix form 2
I Original equations (i is the observation index)
y1 = β0 + β1 x1 + ε1
..
.
yi = β0 + β1 xi + εi
..
.
yn = β0 + β1 xn + εn
I Equations in matrix form
y = Xβ + ε
Least Squares Minimization 2
I The solution is very simple when written im matrix form.

!
= β̂ = (X T X )−1 X T y
βb0
β̂1
I As β̂ is random, how accurate it is given by its variance.
Var (β̂) = σ 2 (X T X )−1

I σ 2 is the variance of the residuals ε.
Multiple Linear regression: 2 inputs
I More than one input
I Y = β0 + β1 X1 + β2 X2 + ε
Multiple Linear regression
I Using matrix notation, going to p inputs is straightforward.

 
1 x1,1 · · · x1,p
 1 x2,1 x2,p
X =


 .. . .. .. 
 . . 
1 xn,1 · · · xn,p
I All the formulas in matrix form are the same!
y = Xβ + e
β̂ = (X T X )−1 X T y
Var (β̂) = σ 2 (X T X )−1
How good is the regression model?
I Is adding input Xj makes sense?

I How many inputs do we need?
I How good a linear regression model is?
I Is there a better model?
Prediction and fit
I From least squares, we get β̂.

I Given x0 , the prediction is
fˆ(x 0 ) = x 0 β̂
= β̂0 + β̂1 x 0,1 + · · · + β̂p x 0,p
I We can now compute
I Mean squared error. (It is linked to R 2 .)
I Test MSE. (Based on x0 in the test sample)
Special input data
I Example: a binary input:

1 if i is a student
studenti =
0 otherwise
I regression function.
balancei = β0 + β1 incomei + β2 studenti + εi

(β0 + β2 ) + β1 incomei + εi if i is a student
=
β0 + β1 incomei + εi otherwise
Discrete variable coding
I Example: a discrete input in the regression of Y on X

 red
X = blue
green

I Binary variable coding:

I Z1 = 1 if X = red and 0 otherwise.
I Z2 = 1 if X = blue and 0 otheriwse.
I Z3 = 1 if X = green and 0 otheriwse.
I Dummy variable trap
I Regression Y = β0 + β1 Z1 + β2 Z2 + β3 Z3 + ε
I 4 unknowns with three equations: Identification problem.
I E [Y |X = red] = β0 + β1
I E [Y |X = blue] = β0 + β2
I E [Y |X = green] = β0 + β3
I Drop one of the three variables Z1 , Z2 and Z3 .
Discrete variable coding (continued)
I Drop Z1 and use red as a base category.

I Contrast function:
red blue green
red 1 0 0
blue 0 1 0
green 0 0 1
I 3 unknowns with three equations in regression
Y = β0 + β2 Z2 + β3 Z3 + ε.
I E [Y |X = red] = β0
I E [Y |X = blue] = β0 + β2
I E [Y |X = green] = β0 + β3
I Interpretation of coefficients: deviations from base category.
Discrete variable coding (continued)
I Drop variable respresenting X = green and use sum
contrasts coding.
I Contrast function:
red blue
red 1 0
blue 0 1
green -1 -1
I 3 unknowns with three equations in regression
Y = β0 + β1 Z1 + β2 Z2 + ε.
I E [Y |X = red] = β0 + β1
I E [Y |X = blue] = β0 + β2
I E [Y |X = green] = β0 − β1 − β2
I Interpretation of coefficients: Average effect
E [Y |X = red] + E [Y |X = blue] + E [Y |X = green] 3β0

= = β0
3 3
Polynomial regression
I Example from auto data. 2-degree polynomial.
mpgi = β0 + β1 horsepoweri + β2 horespower2i + εi

List of potential problems
I Relationship is nonlinear between inputs and outputs.

I Correlation of errors. (Time Series data, Panel data)
I Variance is error term is not constant.
I Outliers and high-leverage points.
I Colinear inputs.
I Endogenity → Causal inference issues.
Causal Inference and other pitfalls
I Simpson’s Paradox
I Interpretability and policy recommendations.

Nearest neighbor regression
I Nearest neighbor regression
I N0 is the set of K nearest neighbors to x0
1 X
fˆ(x0 ) = yi
K
xi ∈N0
Comparison of parametric and nonparametric regression
I Curse of dimensionality

Lecture 2

Uploaded by

Copyright:

Available Formats

Lecture 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2

Uploaded by

Copyright:

Available Formats

Big Data and Machine Learning

Lecture Slides 2: Linear Regression

I Simplest candidate function: linear function.

I β0 and β1 are called parameters.

I Assume that the relationship between Y and X is given

I More precisely, The deviation by Y and f (X ) is assumed to

I The learning of f in general is hard:

I Magnitude of the residuals

I This is called the Sum of squared residuals.

I Find the two numbers β0 and β1 that makes Q smallest

I Matrices of output data, input data and residuals.

I Original equations (i is the observation index)

I Equations in matrix form

I The solution is very simple when written im matrix form.

I As β̂ is random, how accurate it is given by its variance.

Var (β̂) = σ 2 (X T X )−1

I Using matrix notation, going to p inputs is straightforward.

I Is adding input Xj makes sense?

I From least squares, we get β̂.

I Binary variable coding:

I Drop Z1 and use red as a base category.

E [Y |X = red] + E [Y |X = blue] + E [Y |X = green] 3β0

mpgi = β0 + β1 horsepoweri + β2 horespower2i + εi

I Relationship is nonlinear between inputs and outputs.

I Interpretability and policy recommendations.

I N0 is the set of K nearest neighbors to x0

You might also like