Lecture 3

ECE553/653 Neural Networks
Linear Regression
1
Previously …
• Linear regression problem
• Feature maps
• Overfitting and underfitting

– Training/test dataset split
– Capacity of model family
– Bias-variance tradeoff
2
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 that minimizes the
MSE:
1 𝑛 2
𝐿 𝛽; 𝑍 = σ T
𝑦 −𝛽 𝑥𝑖
𝑛 𝑖=1 𝑖
Feature Maps
General strategy Linear regression with feature map
• Model family • Linear functions over a given
𝐹 = 𝑓𝛽
𝛽
feature map ∅: 𝒙 → ℝ𝑑
• Loss function 𝐹 = 𝑓𝛽 𝑥 = 𝜷𝑇 ∅ 𝒙
𝐿 𝛽; 𝑍
• MSE
1 2
𝐿 𝛽; 𝑍 = σ𝑛𝑖=1 𝑦𝑖 − 𝜷𝑇 ∅ 𝒙𝑖
𝑛
4
Bias-Variance Tradeoff
• Capacity of a model family captures
“complexity” of data it can fit
– Higher capacity -> more likely to overfit (model
family has high variance)
– Lower capacity -> more likely to underfit (model
family has high bias)
• For linear regression, capacity corresponds to
feature dimension
5
6
7
Example of Underfitting/Overfitting
• Exploratory Data Analysis
8
9
• Using ‘Sklearn.preprocessing.MinMaxScaler’
for data normalization.
𝑥 − 𝑥𝑚𝑖𝑛
𝑥=
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
10
11
12
• Linear regression uses only the first-order
features 𝑥:
y = wx + b
• Polynomial regression uses the higher-order
combination features 𝑥′ of 𝑥:
y = wx + b
For example, the degree-2 polynomial features of
𝑥1
𝑥 = 𝑥 are 𝑥 ′ = [1, 𝑥1 , 𝑥2 , 𝑥12 ,𝑥1 𝑥2 ,𝑥22 ]T .
2
13
• We have features 𝑥 = (𝑥1 , 𝑥2 ) to predict 𝑦.
• In linear regression, we have
𝑥1
w = argmin (𝑦 − 𝑤1 , 𝑤2 𝑥 − 𝑏) 2 .
∗
w 2
• In polynomial regression, if we use the
degree-2 polynomial features of
𝑥1
𝑥2
w ∗ = argmin (𝑦 − 𝑤1 , 𝑤 2 , 𝑤 3 , … , 𝑤 6 … − 𝑏) 2
w
𝑥22
14
• Degree-1
• Degree-3
• Degree-6
15
Agenda
• Regularization
– Strategy to address bias—variance tradeoff
– By example: Linear regression with 𝐿2
regularization
• Minimizing the MSE Loss

– Closed-form solution
– Gradient descent
16
Recall: Mean Squared Error Loss
• Mean squared error loss for linear regression:
1 𝑛 2
𝐿 𝛽; 𝑍 = σ T
𝑦 −𝛽 𝑥𝑖
𝑛 𝑖=1 𝑖
17
Linear Regression with 𝐿2 Regularization
• Original loss + regularization
𝑛
1 T 2 2
𝐿 𝛽; 𝑍 = 𝑛
෍ 𝑦𝑖 − 𝛽 𝑥𝑖 +𝜆∙ 𝛽 2
𝑖=1
1 2
= σ𝑛𝑖=1 T
𝑦𝑖 − 𝛽 𝑥𝑖 + 𝜆 σ𝐷 𝛽
𝑗=1 𝑗
2
𝑛
• 𝜆 is a hyperparameter that must be tuned

(satisfies 𝜆 ≥ 0)
18
Intuition on 𝐿2 Regularization
• Equivalently the 𝐿2 norm of 𝛽
𝐷
𝜆 ෍ 𝛽𝑗2 = 𝛽 2
2 = 𝛽−0 2
2
𝑗=1
• ”Pulling” 𝛽 to zero
– ”Pulls” more as 𝜆 becomes larger
19
• Why does it help?
– Encourages “simple” functions
– As 𝜆 → ∞, obtain 𝛽 = 0
– Use 𝜆 to tune bias-variance tradeoff
20
Bias-Variance Tradeoff for Regularization
21
𝛽2 Minimizes
original loss
Loss varies greatly (or if 𝜆 = 0)
in this direction
→ Penalizes more Minimizes
full loss
𝛽1
Minimizes
• At this point, the
regularization term
gradients are equal
(or if 𝜆 → ∞)
• Tradeoff depends on
choice of 𝜆
22
Feature Standardization
• Unregularized linear regression is invariant to
feature scaling
– Suppose we scale 𝑥𝑖𝑗 -> 2𝑥𝑖𝑗 for all examples
– Without regularization, simply use 𝛽𝑗 -> 𝛽𝑗 Τ2
• Not true for regularized regression

2
– Penalty 𝛽𝑗 Τ2 is scaled by 1Τ4
23
Feature Standardization
• Solution: Rescale features to zero mean and
unit variance
• Must use same transformation during training

and for prediction
– Compute on standardization on training data and
use on test data
24
General Regularization Strategy
• Original loss + regularization
𝐿 𝛽; 𝑍 = 𝐿 𝛽; 𝑍 + 𝜆 ∙ 𝑅 𝛽
– Offers a way to express a preference “simple”

functions in family
– Typically, regularization is independent of data
25
Hyperparameter Tuning
• 𝜆 is a hyperparameter that must be tuned
(satisfies 𝜆 ≥ 0 )
• Naïve strategy: Try a few different candidates
𝜆 and choose the one that minimizes the test
loss
• Problem: We may overfit the test set!
– Major problem if we have more hyperparameters
26
Training/Validation/Test Split
• Goal: Choose best hyperparameter 𝜆
– Can also compare different model families, feature
maps, etc.
• Solution: Optimize 𝜆 on a validation data
– Rule: 60/20/20 split
Given data 𝑍
Training data 𝑍train Val data 𝑍val Test data 𝑍test

27
27
Basic Cross Validation Algorithm
• Step 1: Split 𝑍 into 𝑍train, 𝑍val and 𝑍test
Training data 𝑍train Val data 𝑍val Test data 𝑍test
• Step 2: For 𝑡 ∈ 1, … , ℎ :
• Step 2a: Run linear regression with 𝑍train and 𝜆𝑡 to obtain
𝛽መ 𝑍train , 𝜆𝑡
• Step 2b: Evaluate validation loss 𝐿𝑡𝑣𝑎𝑙 = 𝐿 𝛽መ 𝑍train , 𝜆𝑡 ; 𝑍val
• Step 3: Use best 𝜆𝑡
• Choose 𝑡 ′ = arg min 𝐿𝑡val with lowest validation loss
𝑡
• Re-run linear regression with 𝑍train and 𝜆𝑡 ′ to obtain 𝛽መ 𝑍train , 28𝜆𝑡 ′
Alternative Cross-Validation Algorithms
• If 𝑍 is small, then splitting it can reduce
performance
– Can use 𝑍train and 𝑍val in Step 3
• Alternative: 𝑘-fold cross-validation (e.g., 𝑘=3)

– Split 𝑍 into 𝑍train and 𝑍test
𝑆
– Split 𝑍train into 𝑘 disjoint sets 𝑍val
– Use 𝜆′ that works best on average across 𝑠 ∈ 1, … , 𝑘
with 𝑍train
– Choose better 𝜆′ than above strategy
29
Example: 3-Fold Cross Validation
30
𝐿1 Regularization
• Can we minimize 𝛽 0 = 𝑗 𝛽𝑗 ≠ 0 ?
– That is, the number of nonzero components of 𝛽
– Improves interpretability (automatic feature
selection!)
– Also serves as a strong regularizer
• Challenge: 𝛽 0 is not differentiable, making

it hard to optimize
31
𝛽2 Minimizes
original loss
(or if 𝜆 = 0)
Minimizer of full loss at

corner → sparse (𝛽1 = 0)!
𝛽1
Minimizes
regularization term 𝑛 𝐷
(or if 𝜆 → ∞) 1 2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝛽 T 𝑥𝑖 + 𝜆 ෍ 𝛽j
𝑛
𝑖=1 𝑗=1 32
𝐿1 Regularization for Feature Selection
• Step 1: Construct a lot of features and add to
feature map
• Step 2: Use 𝐿1 regularized regression to
“select” subset of features
– I.e., coefficient 𝛽𝑗 ≠ 0 -> feature 𝑗 is selected)
• Optional: Remove unselected features from
the feature map and run vanilla linear
regression
33
Agenda
• Regularization
regularization

34
Minimizing the MSE Loss
• Recall that linear regression minimizes the loss
𝑛
1 2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝛽 T 𝑥𝑖
𝑛
𝑖=1
• Closed-form solution: Compute using matrix

operations
• Optimization-based solution: Search over
candidate 𝛽
35
Vectorizing Linear Regression
𝑌 ≈ 𝑋𝛽
𝑦1 𝑥1,1 ⋯ 𝑥1,𝑑 𝛽1
𝑌= ⋮ 𝑋= ⋮ ⋱ ⋮ 𝛽= ⋮
𝑦𝑛 𝑥𝑛,1 ⋯ 𝑥𝑛,𝑑 𝛽𝑑
36
Vectorizing Mean Squared Error
𝑦1 𝑓𝛽 𝑥1
⋮ ⋮
𝑦𝑛 𝑓𝛽 𝑥𝑛
𝑛
1 1 2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝛽 𝑇 𝑥𝑖 2
= 𝑌 − 𝑋𝛽 2
𝑛 𝑛
𝑖=1
𝑛
2
𝑧 2 = ෍ 𝑧𝑖 2
𝑖=1
37
Strategy 1: Closed-Form Solution
• The gradient is
1 2 1 𝑇
∇𝛽 𝐿 𝛽; 𝑍 = ∇𝛽 𝑌 − 𝑋𝛽 2 = ∇𝛽 𝑌 − 𝑋𝛽 𝑌 − 𝑋𝛽
𝑛 𝑛
2
= ∇𝛽 𝑌 − 𝑋𝛽 𝑇 𝑌 − 𝑋𝛽
𝑛
2 𝑇
= − 𝑋 𝑌 − 𝑋𝛽
𝑛
2 𝑇 2 𝑇
= − 𝑋 𝑌 + 𝑋 𝑋𝛽
𝑛 𝑛
38
• The gradient is
1 2 2 𝑇 2 𝑇
∇𝛽 𝐿 𝛽; 𝑍 = ∇𝛽 𝑌 − 𝑋𝛽 2 = − 𝑋 𝑌 + 𝑋 𝑋𝛽
𝑛 𝑛 𝑛
• Setting ∇𝛽 𝐿 𝛽; 𝑍 = 0, we have 𝑋 𝑇 𝑋𝛽መ = 𝑋 𝑇 𝑌
39
• Setting ∇𝛽 𝐿 𝛽; 𝑍 = 0, we have 𝑋 𝑇 𝑋𝛽መ = 𝑋 𝑇 𝑌
• Assuming 𝑋 𝑇 𝑋 is invertible, we have
𝛽መ 𝑍 = 𝑋 𝑇 𝑋 −1
𝑋𝑇𝑌
40
Closed-Form Solution for Vanilla Regression Model
• Given 𝐿 𝛽; 𝑍 = σ𝑛𝑖=1 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 2 , we have
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ത𝑦𝑖 − 𝑦ത
𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ത 2
𝛽2 = 𝑦ത − 𝛽1𝑥ത
1 1
𝑦ത = σ𝑛𝑖=1 𝑦𝑖 and 𝑥ҧ = σ𝑛𝑖=1 𝑥𝑖
𝑛 𝑛
41
Proof
• Obtain 𝛽2 𝑛
2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2
𝑖=1
𝑛
𝜕𝐿 𝛽; 𝑍
⇒ = ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −1 = 0
𝜕𝛽2
𝑛 𝑖=1
⇒ ෍ 𝑦𝑖 − 𝛽1𝑥𝑖 − 𝛽2 = 0
𝑖=1
𝑛 𝑛 𝑛
⇒ ෍ 𝑦𝑖 − 𝛽1 ෍ 𝑥𝑖 − ෍ 𝛽2 = 0
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
1 1
⇒ 𝛽2 = ෍ 𝑦𝑖 − 𝛽1 ෍ 𝑥𝑖
𝑛 𝑛
𝑖=1 𝑖=1
⇒ 𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ
42
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
43
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
𝛽2 = 𝑦ത − 𝛽1𝑥ҧ ⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1𝑥𝑖 2 − 𝑦𝑥
ത 𝑖 + 𝛽1 𝑥𝑥
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 ෍ 𝑥𝑖 2 − 𝑥𝑥
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
σ𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖
⇒ 𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 2 − 𝑥𝑥
ҧ 𝑖
44
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 ෍ 𝑥𝑖 2 − 𝑥𝑥
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
ത 𝑖
⇒ 𝛽1 =
ҧ 𝑖
Since σ𝑛𝑖=1 𝑥𝑖 𝑦ത = 𝑛𝑥ҧ 𝑦ത and σ𝑛𝑖=1 𝑥𝑖 𝑥ҧ = 𝑛𝑥ҧ 2 , we have
45
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 ෍ 𝑥𝑖 2 − 𝑥𝑥
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
ത 𝑖
⇒ 𝛽1 =
ҧ 𝑖
Since σ𝑛𝑖=1 𝑦𝑖 𝑥ҧ = 𝑛𝑥ҧ 𝑦ത and σ𝑛𝑖=1 𝑥𝑖 𝑥ҧ = 𝑛𝑥ҧ 2 , we have

σ𝑛𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖 − 𝑥𝑦 ҧ 𝑖 + 𝑥ҧ 𝑦ത σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
⇒ 𝛽1 = 𝑛 =
σ𝑖=1 𝑥𝑖 2 − 𝑥𝑥 ҧ 𝑖 + 𝑥ҧ 2
ҧ 𝑖 − 𝑥𝑥 σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
46
Example
• Five randomly selected students took a math

test before they began their statistics course.
• The Statistics Department has two questions.

– What linear regression equation best predicts
statistics performance, based on test scores?
– If a student made an 80 on the test, what grade

would we expect her to make in statistics?
Example
• 𝑥𝑖 is the score of the test.

• 𝑦𝑖 is the statistics grades
• 𝑦𝑖 = 𝛽1 𝑥𝑖 + 𝛽2
Student 𝒙𝒊 𝒚𝒊
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Sum 390 385
Mean 78 77
Example
• 𝑥𝑖 is the score of the aptitude test.

• 𝑦𝑖 is the statistics grades 𝛽1 =
σ 𝑛
𝑖=1 𝑥𝑖 − 𝑥ҧ
𝑛
𝑦𝑖 − 𝑦ത
σ𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ
Student 𝒙𝒊 𝒚𝒊
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Sum 390 385
Mean 78 77
Example

σ 𝑛
𝑛
Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
1 95 85 17 8
2 85 95 7 18
3 80 70 2 -7
4 70 65 -8 -12
5 60 70 -18 -7
Sum 390 385
Mean 78 77
Example

σ 𝑛
𝑛
Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖
− 𝑦ത )
1 95 85 136
2 85 95 126
3 80 70 -14
4 70 65 96
5 60 70 126
Sum 390 385 470
Mean 78 77
Example

• 𝑦𝑖 is the statistics grades 𝛽1 = 𝑛
σ
470
2
Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖
− 𝑦ത )
1 95 85 136
2 85 95 126
3 80 70 -14
4 70 65 96
5 60 70 126
Sum 390 385 470
Mean 78 77
Example

470
= 0.644
730
2
Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 𝑥𝑖 − 𝑥ҧ
− 𝑦ത )
1 95 85 136 289
2 85 95 126 49
3 80 70 -14 4
4 70 65 96 64
5 60 70 126 324
Sum 390 385 470 730
Mean 78 77
Example

470
= 0.644
730
𝛽2 = 77 − 0.644 × 78
= 26.768 2
Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 𝑥𝑖 − 𝑥ҧ
− 𝑦ത )
1 95 85 136 289
2 85 95 126 49
3 80 70 -14 4
4 70 65 96 64
5 60 70 126 324
Sum 390 385 470 730
Mean 78 77
Example
• What linear regression equation best predicts

statistics performance, based on test scores?
𝑦𝑖 = 𝛽1 𝑥𝑖 + 𝛽2
470
𝛽1 = = 0.644
730
𝛽2 = 77 − 0.644 × 78 = 26.768
Example
• If a student made an 80 on the test, what

grade would we expect her to make in
statistics?
𝑦𝑖 = 0.644 × 80 + 26.768
= 78.288
– Note: do not use values for the independent
variable that are outside the range of values
used to create the equation.
Note on Invertibility
• Closed-form solution only unique if 𝑋 𝑇 𝑋 is
invertible
– Otherwise, multiple solutions exist
– Example:
1 1 𝛽መ1 2
=
2 2 𝛽መ2 4
– Any 𝛽መ2 = 2 − 𝛽መ1 is a solution
57
When Can This Happen?
• Case 1: Fewer data examples than feature
dimension (i.e., 𝑛 < 𝑑)
– Solution 1: Remove features so 𝑑 ≤ 𝑛
– Solution 2: Collect more data until 𝑑 ≤ 𝑛
• Case 2: Some feature is a linear combination

of the others
– Solution 1: Remove linearly dependent features
– Solution 2: Use 𝐿2 regularization
58
Shortcomings of Closed-Form Solution
• Computing 𝛽መ 𝑍 = 𝑋 𝑇 𝑋 −1
𝑋 𝑇 𝑌 can be
challenging
𝑇 −1 3
• Computing 𝑋 𝑋 is 𝑂 𝑑
– 𝑑 = 104 features -> 𝑂 1012
– Even storing 𝑋 𝑇 𝑋 requires a lot of memory
59
Shortcomings of Closed-Form Solution
• Numerical accuracy issues due to “ill-
conditioning”
– 𝑋 𝑇 𝑋 is “barely” invertible
– Then, 𝑋 𝑇 𝑋 −1 has large variance along some
dimension
– Regularization helps
60
Agenda
• Regularization
regularization

61

Lecture 3

Uploaded by

Copyright:

Available Formats

Lecture 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3

Uploaded by

Copyright:

Available Formats

ECE553/653 Neural Networks

• Overfitting and underfitting

• Minimizing the MSE Loss

• 𝜆 is a hyperparameter that must be tuned

• Not true for regularized regression

• Must use same transformation during training

– Offers a way to express a preference “simple”

Training data 𝑍train Val data 𝑍val Test data 𝑍test

Training data 𝑍train Val data 𝑍val Test data 𝑍test

• Alternative: 𝑘-fold cross-validation (e.g., 𝑘=3)

• Challenge: 𝛽 0 is not differentiable, making

Minimizer of full loss at

• Minimizing the MSE Loss

• Closed-form solution: Compute using matrix

• Setting ∇𝛽 𝐿 𝛽; 𝑍 = 0, we have 𝑋 𝑇 𝑋𝛽መ = 𝑋 𝑇 𝑌

• Assuming 𝑋 𝑇 𝑋 is invertible, we have

• Given 𝐿 𝛽; 𝑍 = σ𝑛𝑖=1 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 2 , we have

Since σ𝑛𝑖=1 𝑥𝑖 𝑦ത = 𝑛𝑥ҧ 𝑦ത and σ𝑛𝑖=1 𝑥𝑖 𝑥ҧ = 𝑛𝑥ҧ 2 , we have

Since σ𝑛𝑖=1 𝑦𝑖 𝑥ҧ = 𝑛𝑥ҧ 𝑦ത and σ𝑛𝑖=1 𝑥𝑖 𝑥ҧ = 𝑛𝑥ҧ 2 , we have

• Five randomly selected students took a math

• The Statistics Department has two questions.

– If a student made an 80 on the test, what grade

• 𝑥𝑖 is the score of the test.

• 𝑥𝑖 is the score of the aptitude test.

• 𝑥𝑖 is the score of the aptitude test.

• 𝑥𝑖 is the score of the aptitude test.

• 𝑥𝑖 is the score of the aptitude test.

• 𝑥𝑖 is the score of the aptitude test.

• 𝑥𝑖 is the score of the aptitude test.

• What linear regression equation best predicts

• If a student made an 80 on the test, what

– Any 𝛽መ2 = 2 − 𝛽መ1 is a solution

• Case 2: Some feature is a linear combination

• Minimizing the MSE Loss

You might also like