Lecture 3
Lecture 3
Lecture 3
Linear Regression
1
Previously …
• Linear regression problem
• Feature maps
2
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 that minimizes the
MSE:
1 𝑛 2
𝐿 𝛽; 𝑍 = σ T
𝑦 −𝛽 𝑥𝑖
𝑛 𝑖=1 𝑖
Feature Maps
General strategy Linear regression with feature map
• Model family • Linear functions over a given
𝐹 = 𝑓𝛽
𝛽
feature map ∅: 𝒙 → ℝ𝑑
• Loss function 𝐹 = 𝑓𝛽 𝑥 = 𝜷𝑇 ∅ 𝒙
𝐿 𝛽; 𝑍
• MSE
1 2
𝐿 𝛽; 𝑍 = σ𝑛𝑖=1 𝑦𝑖 − 𝜷𝑇 ∅ 𝒙𝑖
𝑛
4
Bias-Variance Tradeoff
• Capacity of a model family captures
“complexity” of data it can fit
– Higher capacity -> more likely to overfit (model
family has high variance)
– Lower capacity -> more likely to underfit (model
family has high bias)
• For linear regression, capacity corresponds to
feature dimension
5
Bias-Variance Tradeoff
6
Bias-Variance Tradeoff
7
Example of Underfitting/Overfitting
• Exploratory Data Analysis
8
Example of Underfitting/Overfitting
• Exploratory Data Analysis
9
Example of Underfitting/Overfitting
• Using ‘Sklearn.preprocessing.MinMaxScaler’
for data normalization.
𝑥 − 𝑥𝑚𝑖𝑛
𝑥=
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
10
Example of Underfitting/Overfitting
• Exploratory Data Analysis
11
Example of Underfitting/Overfitting
• Exploratory Data Analysis
12
Example of Underfitting/Overfitting
• Linear regression uses only the first-order
features 𝑥:
y = wx + b
• Polynomial regression uses the higher-order
combination features 𝑥′ of 𝑥:
y = wx + b
For example, the degree-2 polynomial features of
𝑥1
𝑥 = 𝑥 are 𝑥 ′ = [1, 𝑥1 , 𝑥2 , 𝑥12 ,𝑥1 𝑥2 ,𝑥22 ]T .
2
13
Example of Underfitting/Overfitting
• We have features 𝑥 = (𝑥1 , 𝑥2 ) to predict 𝑦.
• In linear regression, we have
𝑥1
w = argmin (𝑦 − 𝑤1 , 𝑤2 𝑥 − 𝑏) 2 .
∗
w 2
• In polynomial regression, if we use the
degree-2 polynomial features of
𝑥1
𝑥2
w ∗ = argmin (𝑦 − 𝑤1 , 𝑤 2 , 𝑤 3 , … , 𝑤 6 … − 𝑏) 2
w
𝑥22
14
Example of Underfitting/Overfitting
• Degree-1
• Degree-3
• Degree-6
15
Agenda
• Regularization
– Strategy to address bias—variance tradeoff
– By example: Linear regression with 𝐿2
regularization
16
Recall: Mean Squared Error Loss
• Mean squared error loss for linear regression:
1 𝑛 2
𝐿 𝛽; 𝑍 = σ T
𝑦 −𝛽 𝑥𝑖
𝑛 𝑖=1 𝑖
17
Linear Regression with 𝐿2 Regularization
• Original loss + regularization
𝑛
1 T 2 2
𝐿 𝛽; 𝑍 = 𝑛
𝑦𝑖 − 𝛽 𝑥𝑖 +𝜆∙ 𝛽 2
𝑖=1
1 2
= σ𝑛𝑖=1 T
𝑦𝑖 − 𝛽 𝑥𝑖 + 𝜆 σ𝐷 𝛽
𝑗=1 𝑗
2
𝑛
18
Intuition on 𝐿2 Regularization
• Equivalently the 𝐿2 norm of 𝛽
𝐷
𝜆 𝛽𝑗2 = 𝛽 2
2 = 𝛽−0 2
2
𝑗=1
• ”Pulling” 𝛽 to zero
– ”Pulls” more as 𝜆 becomes larger
19
Intuition on 𝐿2 Regularization
• Why does it help?
– Encourages “simple” functions
– As 𝜆 → ∞, obtain 𝛽 = 0
– Use 𝜆 to tune bias-variance tradeoff
20
Bias-Variance Tradeoff for Regularization
21
Intuition on 𝐿2 Regularization
𝛽2 Minimizes
original loss
Loss varies greatly (or if 𝜆 = 0)
in this direction
→ Penalizes more Minimizes
full loss
𝛽1
Minimizes
• At this point, the
regularization term
gradients are equal
(or if 𝜆 → ∞)
• Tradeoff depends on
choice of 𝜆
22
Feature Standardization
• Unregularized linear regression is invariant to
feature scaling
– Suppose we scale 𝑥𝑖𝑗 -> 2𝑥𝑖𝑗 for all examples
– Without regularization, simply use 𝛽𝑗 -> 𝛽𝑗 Τ2
23
Feature Standardization
• Solution: Rescale features to zero mean and
unit variance
24
General Regularization Strategy
• Original loss + regularization
𝐿 𝛽; 𝑍 = 𝐿 𝛽; 𝑍 + 𝜆 ∙ 𝑅 𝛽
25
Hyperparameter Tuning
• 𝜆 is a hyperparameter that must be tuned
(satisfies 𝜆 ≥ 0 )
• Naïve strategy: Try a few different candidates
𝜆 and choose the one that minimizes the test
loss
• Problem: We may overfit the test set!
– Major problem if we have more hyperparameters
26
Training/Validation/Test Split
• Goal: Choose best hyperparameter 𝜆
– Can also compare different model families, feature
maps, etc.
• Solution: Optimize 𝜆 on a validation data
– Rule: 60/20/20 split
Given data 𝑍
• Step 2: For 𝑡 ∈ 1, … , ℎ :
• Step 2a: Run linear regression with 𝑍train and 𝜆𝑡 to obtain
𝛽መ 𝑍train , 𝜆𝑡
• Step 2b: Evaluate validation loss 𝐿𝑡𝑣𝑎𝑙 = 𝐿 𝛽መ 𝑍train , 𝜆𝑡 ; 𝑍val
• Step 3: Use best 𝜆𝑡
• Choose 𝑡 ′ = arg min 𝐿𝑡val with lowest validation loss
𝑡
• Re-run linear regression with 𝑍train and 𝜆𝑡 ′ to obtain 𝛽መ 𝑍train , 28𝜆𝑡 ′
Alternative Cross-Validation Algorithms
• If 𝑍 is small, then splitting it can reduce
performance
– Can use 𝑍train and 𝑍val in Step 3
29
Example: 3-Fold Cross Validation
30
𝐿1 Regularization
• Can we minimize 𝛽 0 = 𝑗 𝛽𝑗 ≠ 0 ?
– That is, the number of nonzero components of 𝛽
– Improves interpretability (automatic feature
selection!)
– Also serves as a strong regularizer
31
Intuition on 𝐿1 Regularization
𝛽2 Minimizes
original loss
(or if 𝜆 = 0)
𝛽1
Minimizes
regularization term 𝑛 𝐷
(or if 𝜆 → ∞) 1 2
𝐿 𝛽; 𝑍 = 𝑦𝑖 − 𝛽 T 𝑥𝑖 + 𝜆 𝛽j
𝑛
𝑖=1 𝑗=1 32
𝐿1 Regularization for Feature Selection
• Step 1: Construct a lot of features and add to
feature map
• Step 2: Use 𝐿1 regularized regression to
“select” subset of features
– I.e., coefficient 𝛽𝑗 ≠ 0 -> feature 𝑗 is selected)
• Optional: Remove unselected features from
the feature map and run vanilla linear
regression
33
Agenda
• Regularization
– Strategy to address bias—variance tradeoff
– By example: Linear regression with 𝐿2
regularization
34
Minimizing the MSE Loss
• Recall that linear regression minimizes the loss
𝑛
1 2
𝐿 𝛽; 𝑍 = 𝑦𝑖 − 𝛽 T 𝑥𝑖
𝑛
𝑖=1
35
Vectorizing Linear Regression
𝑌 ≈ 𝑋𝛽
𝑦1 𝑥1,1 ⋯ 𝑥1,𝑑 𝛽1
𝑌= ⋮ 𝑋= ⋮ ⋱ ⋮ 𝛽= ⋮
𝑦𝑛 𝑥𝑛,1 ⋯ 𝑥𝑛,𝑑 𝛽𝑑
36
Vectorizing Mean Squared Error
𝑦1 𝑓𝛽 𝑥1
⋮ ⋮
𝑦𝑛 𝑓𝛽 𝑥𝑛
𝑛
1 1 2
𝐿 𝛽; 𝑍 = 𝑦𝑖 − 𝛽 𝑇 𝑥𝑖 2
= 𝑌 − 𝑋𝛽 2
𝑛 𝑛
𝑖=1
𝑛
2
𝑧 2 = 𝑧𝑖 2
𝑖=1
37
Strategy 1: Closed-Form Solution
• The gradient is
1 2 1 𝑇
∇𝛽 𝐿 𝛽; 𝑍 = ∇𝛽 𝑌 − 𝑋𝛽 2 = ∇𝛽 𝑌 − 𝑋𝛽 𝑌 − 𝑋𝛽
𝑛 𝑛
2
= ∇𝛽 𝑌 − 𝑋𝛽 𝑇 𝑌 − 𝑋𝛽
𝑛
2 𝑇
= − 𝑋 𝑌 − 𝑋𝛽
𝑛
2 𝑇 2 𝑇
= − 𝑋 𝑌 + 𝑋 𝑋𝛽
𝑛 𝑛
38
Strategy 1: Closed-Form Solution
• The gradient is
1 2 2 𝑇 2 𝑇
∇𝛽 𝐿 𝛽; 𝑍 = ∇𝛽 𝑌 − 𝑋𝛽 2 = − 𝑋 𝑌 + 𝑋 𝑋𝛽
𝑛 𝑛 𝑛
39
Strategy 1: Closed-Form Solution
• Setting ∇𝛽 𝐿 𝛽; 𝑍 = 0, we have 𝑋 𝑇 𝑋𝛽መ = 𝑋 𝑇 𝑌
𝛽መ 𝑍 = 𝑋 𝑇 𝑋 −1
𝑋𝑇𝑌
40
Closed-Form Solution for Vanilla Regression Model
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ത𝑦𝑖 − 𝑦ത
𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ത 2
𝛽2 = 𝑦ത − 𝛽1𝑥ത
1 1
𝑦ത = σ𝑛𝑖=1 𝑦𝑖 and 𝑥ҧ = σ𝑛𝑖=1 𝑥𝑖
𝑛 𝑛
41
Proof
• Obtain 𝛽2 𝑛
2
𝐿 𝛽; 𝑍 = 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2
𝑖=1
𝑛
𝜕𝐿 𝛽; 𝑍
⇒ = 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −1 = 0
𝜕𝛽2
𝑛 𝑖=1
⇒ 𝑦𝑖 − 𝛽1𝑥𝑖 − 𝛽2 = 0
𝑖=1
𝑛 𝑛 𝑛
⇒ 𝑦𝑖 − 𝛽1 𝑥𝑖 − 𝛽2 = 0
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
1 1
⇒ 𝛽2 = 𝑦𝑖 − 𝛽1 𝑥𝑖
𝑛 𝑛
𝑖=1 𝑖=1
⇒ 𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ
42
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
43
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
𝛽2 = 𝑦ത − 𝛽1𝑥ҧ ⇒ 𝑦𝑖 𝑥𝑖 − 𝛽1𝑥𝑖 2 − 𝑦𝑥
ത 𝑖 + 𝛽1 𝑥𝑥
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 𝑥𝑖 2 − 𝑥𝑥
⇒ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
σ𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖
⇒ 𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 2 − 𝑥𝑥
ҧ 𝑖
44
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
𝛽2 = 𝑦ത − 𝛽1𝑥ҧ ⇒ 𝑦𝑖 𝑥𝑖 − 𝛽1𝑥𝑖 2 − 𝑦𝑥
ത 𝑖 + 𝛽1 𝑥𝑥
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 𝑥𝑖 2 − 𝑥𝑥
⇒ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
σ𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖
⇒ 𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 2 − 𝑥𝑥
ҧ 𝑖
45
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
𝛽2 = 𝑦ത − 𝛽1𝑥ҧ ⇒ 𝑦𝑖 𝑥𝑖 − 𝛽1𝑥𝑖 2 − 𝑦𝑥
ത 𝑖 + 𝛽1 𝑥𝑥
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 𝑥𝑖 2 − 𝑥𝑥
⇒ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
σ𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖
⇒ 𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 2 − 𝑥𝑥
ҧ 𝑖
Student 𝒙𝒊 𝒚𝒊
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Sum 390 385
Mean 78 77
Example
Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
1 95 85 17 8
2 85 95 7 18
3 80 70 2 -7
4 70 65 -8 -12
5 60 70 -18 -7
Sum 390 385
Mean 78 77
Example
Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖
− 𝑦ത )
1 95 85 136
2 85 95 126
3 80 70 -14
4 70 65 96
5 60 70 126
Sum 390 385 470
Mean 78 77
Example
𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ
Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖
− 𝑦ത )
1 95 85 136
2 85 95 126
3 80 70 -14
4 70 65 96
5 60 70 126
Sum 390 385 470
Mean 78 77
Example
𝑦𝑖 = 𝛽1 𝑥𝑖 + 𝛽2
470
𝛽1 = = 0.644
730
𝛽2 = 77 − 0.644 × 78 = 26.768
Example
57
When Can This Happen?
• Case 1: Fewer data examples than feature
dimension (i.e., 𝑛 < 𝑑)
– Solution 1: Remove features so 𝑑 ≤ 𝑛
– Solution 2: Collect more data until 𝑑 ≤ 𝑛
58
Shortcomings of Closed-Form Solution
• Computing 𝛽መ 𝑍 = 𝑋 𝑇 𝑋 −1
𝑋 𝑇 𝑌 can be
challenging
𝑇 −1 3
• Computing 𝑋 𝑋 is 𝑂 𝑑
– 𝑑 = 104 features -> 𝑂 1012
– Even storing 𝑋 𝑇 𝑋 requires a lot of memory
59
Shortcomings of Closed-Form Solution
• Numerical accuracy issues due to “ill-
conditioning”
– 𝑋 𝑇 𝑋 is “barely” invertible
– Then, 𝑋 𝑇 𝑋 −1 has large variance along some
dimension
– Regularization helps
60
Agenda
• Regularization
– Strategy to address bias—variance tradeoff
– By example: Linear regression with 𝐿2
regularization
61