Lecture 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

ECE553/653 Neural Networks

Linear Regression

1
Previously …
• Linear regression problem

• Feature maps

• Overfitting and underfitting


– Training/test dataset split
– Capacity of model family
– Bias-variance tradeoff

2
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 that minimizes the
MSE:

1 𝑛 2
𝐿 𝛽; 𝑍 = σ T
𝑦 −𝛽 𝑥𝑖
𝑛 𝑖=1 𝑖
Feature Maps
General strategy Linear regression with feature map
• Model family • Linear functions over a given
𝐹 = 𝑓𝛽
𝛽
feature map ∅: 𝒙 → ℝ𝑑
• Loss function 𝐹 = 𝑓𝛽 𝑥 = 𝜷𝑇 ∅ 𝒙
𝐿 𝛽; 𝑍
• MSE
1 2
𝐿 𝛽; 𝑍 = σ𝑛𝑖=1 𝑦𝑖 − 𝜷𝑇 ∅ 𝒙𝑖
𝑛

4
Bias-Variance Tradeoff
• Capacity of a model family captures
“complexity” of data it can fit
– Higher capacity -> more likely to overfit (model
family has high variance)
– Lower capacity -> more likely to underfit (model
family has high bias)
• For linear regression, capacity corresponds to
feature dimension

5
Bias-Variance Tradeoff

6
Bias-Variance Tradeoff

7
Example of Underfitting/Overfitting
• Exploratory Data Analysis

8
Example of Underfitting/Overfitting
• Exploratory Data Analysis

9
Example of Underfitting/Overfitting
• Using ‘Sklearn.preprocessing.MinMaxScaler’
for data normalization.

𝑥 − 𝑥𝑚𝑖𝑛
𝑥=
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛

10
Example of Underfitting/Overfitting
• Exploratory Data Analysis

11
Example of Underfitting/Overfitting
• Exploratory Data Analysis

12
Example of Underfitting/Overfitting
• Linear regression uses only the first-order
features 𝑥:
y = wx + b
• Polynomial regression uses the higher-order
combination features 𝑥′ of 𝑥:
y = wx + b
For example, the degree-2 polynomial features of
𝑥1
𝑥 = 𝑥 are 𝑥 ′ = [1, 𝑥1 , 𝑥2 , 𝑥12 ,𝑥1 𝑥2 ,𝑥22 ]T .
2
13
Example of Underfitting/Overfitting
• We have features 𝑥 = (𝑥1 , 𝑥2 ) to predict 𝑦.
• In linear regression, we have
𝑥1
w = argmin (𝑦 − 𝑤1 , 𝑤2 𝑥 − 𝑏) 2 .

w 2
• In polynomial regression, if we use the
degree-2 polynomial features of
𝑥1
𝑥2
w ∗ = argmin (𝑦 − 𝑤1 , 𝑤 2 , 𝑤 3 , … , 𝑤 6 … − 𝑏) 2
w
𝑥22

14
Example of Underfitting/Overfitting

• Degree-1

• Degree-3

• Degree-6

15
Agenda
• Regularization
– Strategy to address bias—variance tradeoff
– By example: Linear regression with 𝐿2
regularization

• Minimizing the MSE Loss


– Closed-form solution
– Gradient descent

16
Recall: Mean Squared Error Loss
• Mean squared error loss for linear regression:

1 𝑛 2
𝐿 𝛽; 𝑍 = σ T
𝑦 −𝛽 𝑥𝑖
𝑛 𝑖=1 𝑖

17
Linear Regression with 𝐿2 Regularization
• Original loss + regularization
𝑛
1 T 2 2
𝐿 𝛽; 𝑍 = 𝑛
෍ 𝑦𝑖 − 𝛽 𝑥𝑖 +𝜆∙ 𝛽 2
𝑖=1
1 2
= σ𝑛𝑖=1 T
𝑦𝑖 − 𝛽 𝑥𝑖 + 𝜆 σ𝐷 𝛽
𝑗=1 𝑗
2
𝑛

• 𝜆 is a hyperparameter that must be tuned


(satisfies 𝜆 ≥ 0)

18
Intuition on 𝐿2 Regularization
• Equivalently the 𝐿2 norm of 𝛽
𝐷

𝜆 ෍ 𝛽𝑗2 = 𝛽 2
2 = 𝛽−0 2
2
𝑗=1

• ”Pulling” 𝛽 to zero
– ”Pulls” more as 𝜆 becomes larger

19
Intuition on 𝐿2 Regularization
• Why does it help?
– Encourages “simple” functions
– As 𝜆 → ∞, obtain 𝛽 = 0
– Use 𝜆 to tune bias-variance tradeoff

20
Bias-Variance Tradeoff for Regularization

21
Intuition on 𝐿2 Regularization
𝛽2 Minimizes
original loss
Loss varies greatly (or if 𝜆 = 0)
in this direction
→ Penalizes more Minimizes
full loss

𝛽1
Minimizes
• At this point, the
regularization term
gradients are equal
(or if 𝜆 → ∞)
• Tradeoff depends on
choice of 𝜆
22
Feature Standardization
• Unregularized linear regression is invariant to
feature scaling
– Suppose we scale 𝑥𝑖𝑗 -> 2𝑥𝑖𝑗 for all examples
– Without regularization, simply use 𝛽𝑗 -> 𝛽𝑗 Τ2

• Not true for regularized regression


2
– Penalty 𝛽𝑗 Τ2 is scaled by 1Τ4

23
Feature Standardization
• Solution: Rescale features to zero mean and
unit variance

• Must use same transformation during training


and for prediction
– Compute on standardization on training data and
use on test data

24
General Regularization Strategy
• Original loss + regularization

𝐿 𝛽; 𝑍 = 𝐿 𝛽; 𝑍 + 𝜆 ∙ 𝑅 𝛽

– Offers a way to express a preference “simple”


functions in family
– Typically, regularization is independent of data

25
Hyperparameter Tuning
• 𝜆 is a hyperparameter that must be tuned
(satisfies 𝜆 ≥ 0 )
• Naïve strategy: Try a few different candidates
𝜆 and choose the one that minimizes the test
loss
• Problem: We may overfit the test set!
– Major problem if we have more hyperparameters

26
Training/Validation/Test Split
• Goal: Choose best hyperparameter 𝜆
– Can also compare different model families, feature
maps, etc.
• Solution: Optimize 𝜆 on a validation data
– Rule: 60/20/20 split

Given data 𝑍

Training data 𝑍train Val data 𝑍val Test data 𝑍test


27
27
Basic Cross Validation Algorithm
• Step 1: Split 𝑍 into 𝑍train, 𝑍val and 𝑍test

Training data 𝑍train Val data 𝑍val Test data 𝑍test

• Step 2: For 𝑡 ∈ 1, … , ℎ :
• Step 2a: Run linear regression with 𝑍train and 𝜆𝑡 to obtain
𝛽መ 𝑍train , 𝜆𝑡
• Step 2b: Evaluate validation loss 𝐿𝑡𝑣𝑎𝑙 = 𝐿 𝛽መ 𝑍train , 𝜆𝑡 ; 𝑍val
• Step 3: Use best 𝜆𝑡
• Choose 𝑡 ′ = arg min 𝐿𝑡val with lowest validation loss
𝑡
• Re-run linear regression with 𝑍train and 𝜆𝑡 ′ to obtain 𝛽መ 𝑍train , 28𝜆𝑡 ′
Alternative Cross-Validation Algorithms
• If 𝑍 is small, then splitting it can reduce
performance
– Can use 𝑍train and 𝑍val in Step 3

• Alternative: 𝑘-fold cross-validation (e.g., 𝑘=3)


– Split 𝑍 into 𝑍train and 𝑍test
𝑆
– Split 𝑍train into 𝑘 disjoint sets 𝑍val
– Use 𝜆′ that works best on average across 𝑠 ∈ 1, … , 𝑘
with 𝑍train
– Choose better 𝜆′ than above strategy

29
Example: 3-Fold Cross Validation

30
𝐿1 Regularization
• Can we minimize 𝛽 0 = 𝑗 𝛽𝑗 ≠ 0 ?
– That is, the number of nonzero components of 𝛽
– Improves interpretability (automatic feature
selection!)
– Also serves as a strong regularizer

• Challenge: 𝛽 0 is not differentiable, making


it hard to optimize

31
Intuition on 𝐿1 Regularization
𝛽2 Minimizes
original loss
(or if 𝜆 = 0)

Minimizer of full loss at


corner → sparse (𝛽1 = 0)!

𝛽1
Minimizes
regularization term 𝑛 𝐷
(or if 𝜆 → ∞) 1 2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝛽 T 𝑥𝑖 + 𝜆 ෍ 𝛽j
𝑛
𝑖=1 𝑗=1 32
𝐿1 Regularization for Feature Selection
• Step 1: Construct a lot of features and add to
feature map
• Step 2: Use 𝐿1 regularized regression to
“select” subset of features
– I.e., coefficient 𝛽𝑗 ≠ 0 -> feature 𝑗 is selected)
• Optional: Remove unselected features from
the feature map and run vanilla linear
regression

33
Agenda
• Regularization
– Strategy to address bias—variance tradeoff
– By example: Linear regression with 𝐿2
regularization

• Minimizing the MSE Loss


– Closed-form solution
– Gradient descent

34
Minimizing the MSE Loss
• Recall that linear regression minimizes the loss
𝑛
1 2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝛽 T 𝑥𝑖
𝑛
𝑖=1

• Closed-form solution: Compute using matrix


operations
• Optimization-based solution: Search over
candidate 𝛽

35
Vectorizing Linear Regression

𝑌 ≈ 𝑋𝛽

𝑦1 𝑥1,1 ⋯ 𝑥1,𝑑 𝛽1
𝑌= ⋮ 𝑋= ⋮ ⋱ ⋮ 𝛽= ⋮
𝑦𝑛 𝑥𝑛,1 ⋯ 𝑥𝑛,𝑑 𝛽𝑑

36
Vectorizing Mean Squared Error
𝑦1 𝑓𝛽 𝑥1
⋮ ⋮
𝑦𝑛 𝑓𝛽 𝑥𝑛

𝑛
1 1 2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝛽 𝑇 𝑥𝑖 2
= 𝑌 − 𝑋𝛽 2
𝑛 𝑛
𝑖=1

𝑛
2
𝑧 2 = ෍ 𝑧𝑖 2
𝑖=1

37
Strategy 1: Closed-Form Solution

• The gradient is
1 2 1 𝑇
∇𝛽 𝐿 𝛽; 𝑍 = ∇𝛽 𝑌 − 𝑋𝛽 2 = ∇𝛽 𝑌 − 𝑋𝛽 𝑌 − 𝑋𝛽
𝑛 𝑛
2
= ∇𝛽 𝑌 − 𝑋𝛽 𝑇 𝑌 − 𝑋𝛽
𝑛
2 𝑇
= − 𝑋 𝑌 − 𝑋𝛽
𝑛
2 𝑇 2 𝑇
= − 𝑋 𝑌 + 𝑋 𝑋𝛽
𝑛 𝑛

38
Strategy 1: Closed-Form Solution
• The gradient is

1 2 2 𝑇 2 𝑇
∇𝛽 𝐿 𝛽; 𝑍 = ∇𝛽 𝑌 − 𝑋𝛽 2 = − 𝑋 𝑌 + 𝑋 𝑋𝛽
𝑛 𝑛 𝑛

• Setting ∇𝛽 𝐿 𝛽; 𝑍 = 0, we have 𝑋 𝑇 𝑋𝛽መ = 𝑋 𝑇 𝑌

39
Strategy 1: Closed-Form Solution
• Setting ∇𝛽 𝐿 𝛽; 𝑍 = 0, we have 𝑋 𝑇 𝑋𝛽መ = 𝑋 𝑇 𝑌

• Assuming 𝑋 𝑇 𝑋 is invertible, we have

𝛽መ 𝑍 = 𝑋 𝑇 𝑋 −1
𝑋𝑇𝑌

40
Closed-Form Solution for Vanilla Regression Model

• Given 𝐿 𝛽; 𝑍 = σ𝑛𝑖=1 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 2 , we have

σ𝑛𝑖=1 𝑥𝑖 − 𝑥ത𝑦𝑖 − 𝑦ത
𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ത 2

𝛽2 = 𝑦ത − 𝛽1𝑥ത

1 1
𝑦ത = σ𝑛𝑖=1 𝑦𝑖 and 𝑥ҧ = σ𝑛𝑖=1 𝑥𝑖
𝑛 𝑛

41
Proof
• Obtain 𝛽2 𝑛
2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2
𝑖=1
𝑛
𝜕𝐿 𝛽; 𝑍
⇒ = ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −1 = 0
𝜕𝛽2
𝑛 𝑖=1

⇒ ෍ 𝑦𝑖 − 𝛽1𝑥𝑖 − 𝛽2 = 0
𝑖=1
𝑛 𝑛 𝑛
⇒ ෍ 𝑦𝑖 − 𝛽1 ෍ 𝑥𝑖 − ෍ 𝛽2 = 0
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
1 1
⇒ 𝛽2 = ෍ 𝑦𝑖 − 𝛽1 ෍ 𝑥𝑖
𝑛 𝑛
𝑖=1 𝑖=1

⇒ 𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ
42
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1

43
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
𝛽2 = 𝑦ത − 𝛽1𝑥ҧ ⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1𝑥𝑖 2 − 𝑦𝑥
ത 𝑖 + 𝛽1 𝑥𝑥
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 ෍ 𝑥𝑖 2 − 𝑥𝑥
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
σ𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖
⇒ 𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 2 − 𝑥𝑥
ҧ 𝑖

44
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
𝛽2 = 𝑦ത − 𝛽1𝑥ҧ ⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1𝑥𝑖 2 − 𝑦𝑥
ത 𝑖 + 𝛽1 𝑥𝑥
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 ෍ 𝑥𝑖 2 − 𝑥𝑥
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
σ𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖
⇒ 𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 2 − 𝑥𝑥
ҧ 𝑖

Since σ𝑛𝑖=1 𝑥𝑖 𝑦ത = 𝑛𝑥ҧ 𝑦ത and σ𝑛𝑖=1 𝑥𝑖 𝑥ҧ = 𝑛𝑥ҧ 2 , we have

45
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
𝛽2 = 𝑦ത − 𝛽1𝑥ҧ ⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1𝑥𝑖 2 − 𝑦𝑥
ത 𝑖 + 𝛽1 𝑥𝑥
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 ෍ 𝑥𝑖 2 − 𝑥𝑥
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
σ𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖
⇒ 𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 2 − 𝑥𝑥
ҧ 𝑖

Since σ𝑛𝑖=1 𝑦𝑖 𝑥ҧ = 𝑛𝑥ҧ 𝑦ത and σ𝑛𝑖=1 𝑥𝑖 𝑥ҧ = 𝑛𝑥ҧ 2 , we have


σ𝑛𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖 − 𝑥𝑦 ҧ 𝑖 + 𝑥ҧ 𝑦ത σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
⇒ 𝛽1 = 𝑛 =
σ𝑖=1 𝑥𝑖 2 − 𝑥𝑥 ҧ 𝑖 + 𝑥ҧ 2
ҧ 𝑖 − 𝑥𝑥 σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
46
Example

• Five randomly selected students took a math


test before they began their statistics course.

• The Statistics Department has two questions.


– What linear regression equation best predicts
statistics performance, based on test scores?

– If a student made an 80 on the test, what grade


would we expect her to make in statistics?
Example

• 𝑥𝑖 is the score of the test.


• 𝑦𝑖 is the statistics grades
• 𝑦𝑖 = 𝛽1 𝑥𝑖 + 𝛽2
Student 𝒙𝒊 𝒚𝒊
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Sum 390 385
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.


• 𝑦𝑖 is the statistics grades 𝛽1 =
σ 𝑛
𝑖=1 𝑥𝑖 − 𝑥ҧ
𝑛
𝑦𝑖 − 𝑦ത
σ𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ

Student 𝒙𝒊 𝒚𝒊
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Sum 390 385
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.


• 𝑦𝑖 is the statistics grades 𝛽1 =
σ 𝑛
𝑖=1 𝑥𝑖 − 𝑥ҧ
𝑛
𝑦𝑖 − 𝑦ത
σ𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ

Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
1 95 85 17 8
2 85 95 7 18
3 80 70 2 -7
4 70 65 -8 -12
5 60 70 -18 -7
Sum 390 385
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.


• 𝑦𝑖 is the statistics grades 𝛽1 =
σ 𝑛
𝑖=1 𝑥𝑖 − 𝑥ҧ
𝑛
𝑦𝑖 − 𝑦ത
σ𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ

Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖
− 𝑦ത )
1 95 85 136
2 85 95 126
3 80 70 -14
4 70 65 96
5 60 70 126
Sum 390 385 470
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.


• 𝑦𝑖 is the statistics grades 𝛽1 = 𝑛
σ
470
2
𝑖=1 𝑥𝑖 − 𝑥ҧ

𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ

Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖
− 𝑦ത )
1 95 85 136
2 85 95 126
3 80 70 -14
4 70 65 96
5 60 70 126
Sum 390 385 470
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.


• 𝑦𝑖 is the statistics grades 𝛽1 =
470
= 0.644
730
𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ
2
Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 𝑥𝑖 − 𝑥ҧ
− 𝑦ത )
1 95 85 136 289
2 85 95 126 49
3 80 70 -14 4
4 70 65 96 64
5 60 70 126 324
Sum 390 385 470 730
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.


• 𝑦𝑖 is the statistics grades 𝛽1 =
470
= 0.644
730
𝛽2 = 77 − 0.644 × 78
= 26.768 2
Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 𝑥𝑖 − 𝑥ҧ
− 𝑦ത )
1 95 85 136 289
2 85 95 126 49
3 80 70 -14 4
4 70 65 96 64
5 60 70 126 324
Sum 390 385 470 730
Mean 78 77
Example

• What linear regression equation best predicts


statistics performance, based on test scores?

𝑦𝑖 = 𝛽1 𝑥𝑖 + 𝛽2
470
𝛽1 = = 0.644
730

𝛽2 = 77 − 0.644 × 78 = 26.768
Example

• If a student made an 80 on the test, what


grade would we expect her to make in
statistics?
𝑦𝑖 = 0.644 × 80 + 26.768
= 78.288
– Note: do not use values for the independent
variable that are outside the range of values
used to create the equation.
Note on Invertibility
• Closed-form solution only unique if 𝑋 𝑇 𝑋 is
invertible
– Otherwise, multiple solutions exist
– Example:
1 1 𝛽መ1 2
=
2 2 𝛽መ2 4

– Any 𝛽መ2 = 2 − 𝛽መ1 is a solution

57
When Can This Happen?
• Case 1: Fewer data examples than feature
dimension (i.e., 𝑛 < 𝑑)
– Solution 1: Remove features so 𝑑 ≤ 𝑛
– Solution 2: Collect more data until 𝑑 ≤ 𝑛

• Case 2: Some feature is a linear combination


of the others
– Solution 1: Remove linearly dependent features
– Solution 2: Use 𝐿2 regularization

58
Shortcomings of Closed-Form Solution

• Computing 𝛽መ 𝑍 = 𝑋 𝑇 𝑋 −1
𝑋 𝑇 𝑌 can be
challenging

𝑇 −1 3
• Computing 𝑋 𝑋 is 𝑂 𝑑
– 𝑑 = 104 features -> 𝑂 1012
– Even storing 𝑋 𝑇 𝑋 requires a lot of memory

59
Shortcomings of Closed-Form Solution
• Numerical accuracy issues due to “ill-
conditioning”
– 𝑋 𝑇 𝑋 is “barely” invertible
– Then, 𝑋 𝑇 𝑋 −1 has large variance along some
dimension
– Regularization helps

60
Agenda
• Regularization
– Strategy to address bias—variance tradeoff
– By example: Linear regression with 𝐿2
regularization

• Minimizing the MSE Loss


– Closed-form solution
– Gradient descent

61

You might also like