ML L6 Linear Regresion
ML L6 Linear Regresion
ML L6 Linear Regresion
19CSE305
L-T-P-C: 3-0-3-4
Lecture 6
Linear Regression
What is Linear Regression?
Learning
◦ A supervised algorithm that learns from a set of training samples.
◦ Each training sample has one or more input values and a single output
value.
◦ The algorithm learns the line, plane or hyper-plane that best fits the
training samples.
◦ Linear regression is a linear model, e.g. a model that assumes a linear
relationship between the input variables (x) and the single output variable
(y)
Prediction
◦ Use the learned line, plane or hyper-plane to predict the output value for
any input sample.
Regression
• Regression is a method of modelling a target value based on independent
predictors.
• This method is mostly used for forecasting and finding out cause and
effect relationship between variables.
Price 300000
(in 1000s of 200000
dollars)
100000
0
500 1000 1500 2000 2500 3000
1500 sqfeet Size (feet2)
Supervised Learning Regression Problem
7
How do we predict with
only one variable?
“Goodness Of Fit” is based on variability of the tip amount from the line
fitted.
8
How do we predict with
only one variable?
Measuring the deviation: Squared residuals or Sum of Squared Errors
(SSE)
9
SIMPLE LINEAR REGRESSION
Learning Algorithm
Size of h Estimated
house price
X Hypothesis Y
Y=F(X)
Training Set Price ($) in 1000's
Size in feet (x)
2
(y)
2104 460
1416 232
1534 315
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
EFFECT OF PARAMETERS
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
P1
P0
Choose parameters P1,P0 so that the predicted value h(x) is close to Y for
our training examples (X,Y).
The sum of the squares of the residual errors are called the Residual Sum of
Squares or RSS.
The average variation of points around the fitted regression line is called the
Residual Standard Error (RSE).
COST FUNCTION
P1
P0
Choose parameters P1,P0 so that the predicted value h(x) is close to Y for
our training examples (X,Y).
P1
P0
Salary
40
16 90 30
11 58 20
1 8 10
9 54 0
0 2 4 6 8 10 12 14 16 18
Years of Experience
ORDINARY LEAST SQUARES (OLS)
ORDINARY LEAST SQUARES (OLS)
ORDINARY LEAST SQUARES (OLS)
100
90 Years of
80 Experience 4.8*X+9.15
70 2 18.75
60 3 23.55
50 5 33.15
40
13 71.55
30
8 47.55
20
16 85.95
10
11 61.95
0
0 2 4 6 8 10 12 14 16 18 1 13.95
9 52.35
ORDINARY LEAST SQUARES (OLS)
ALGORITHM
Internal External
Exam Exam
15 49
23 63
18 58
23 60
24 58
22 61
22 60
19 63
19 60
16 52
24 62
11 30
24 59
16 49
23 68
ORDINARY LEAST SQUARES (OLS)
x y
1 2
2 1
3 3
4 6
5 9
6 11
7 13
8 15
9 17
10 20
ORDINARY LEAST SQUARES (OLS)
Sample Question
1/2 is a constant that helps cancel 2 in minimize the error between the
derivative of the function when doing
predicted value and the actual value.
calculations for gradient descent
HYPOTHESIS VS COST FUNCTION
Hypothesis: Assume: P0=0
Parameters:
Cost Function:
Goal:
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
HYPOTHESIS VS COST FUNCTION
3 3
2 2
1 1
y
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
𝑛
1
𝐽 = ∑ (𝑃1 𝑥 )− 𝑦 ¿ ¿
𝑖 𝑖 2
2𝑛 𝑖=1
(for fixed P1, this is a function of x) (function of the parameter P1)
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
𝑛
1
𝐽= ∑
2𝑛 𝑖=1
𝑖 𝑖
( 𝑃1 𝑥 ) − 𝑦 ¿ ¿ 2
THE GRADIENT DESCENT ALGORITHM
• Gradient descent is an iterative optimization algorithm to find the
minimum of a function.
• Here that function is the Loss Function.
• Take large steps when the slope is steep and small steps when the slope is
less steep.
• Next position based on the current position and stops when gets to the
bottom of the valley which was his goal.
THE GRADIENT DESCENT ALGORITHM
Let P0=c P1=m
Step 1:
Initially let m = 0 and c = 0. Let L be learning rate. This controls how much the value
of m changes with each step. L could be a small value like 0.0001 for good accuracy.
Step 2:
Calculate the partial derivative of the loss function with respect to m, and plug in the
current values of x, y, m and c in it to obtain the derivative value D.
THE GRADIENT DESCENT ALGORITHM
Let P0=c P1=m
Step 3:
Dₘ is the value of the partial derivative with respect to m. Similarly lets find the partial
derivative with respect to c, Dc
Step 4:
Now we update the current value of m and c using the following equation:
Step 5:
Repeat this process until the loss function is a very small value or ideally 0 (which means
0 error or 100% accuracy). The value of m and c that we are left with now will be the
optimum values.
m can be considered the current position of the
person.
When the slope is more steep (D is more) he takes longer steps and when it is
less steep (D is less), he takes smaller steps.
Finally he arrives at the bottom of the valley which corresponds to our loss = 0.
Metrics to check the goodness
of fit
The coefficient of determination (R2) is a measure of how well the
regression line fits the data.
The value of R2 lies between 0 and 1 and is the percentage of variation
explained by the regression model.
R2 is a rough indicator of the worth of the regression model.
R2 is the square of the correlation coefficient r(R2 = r2).
Metrics
Sl.No Term Equation
1 Total variation Target value – Predicted
value h(x)
2 Explained variation H(x) – mean (y)
3 Unexplained variation y – h(x)
Unexplained
variation
Total sum of Squares (SST)
The coefficient of determination is the ratio sum of squares due to regression to the total
sum of squares
Advantages
Advantages:
Simple to use
Works well with all dataset sizes
Gives information about features
Issues of Linear Regression
•Assumes the data is independent
•Outliers have a large effect
•Boundaries are linear
•Prone to overfitting
Solution
Regularization refers to techniques that are used to
calibrate machine learning models in order to minimize the
adjusted loss function and prevent overfitting or
underfitting.
Using Regularization, the machine learning model can be
fit appropriately on a given test set and hence reduce the
errors in it.
Regularization
Regularization is used to introduce bias to the model and to
decrease the variance.
This can be achieved by modifying the loss function with
a penalty term which effectively shrinks the estimates of the
coefficients.
Therefore these types of methods within the framework of
regression are also called “shrinkage” methods or “penalized
regression” methods.
Impact of Regularization
Lasso regression
LASSO stands for Least Absolute Shrinkage and Selection
Operator.
Lasso regression, or L1 regularization, is a technique that
increases the cost function by a penalty equal to the sum of
the absolute values of the non-intercept weights from linear
regression.
Ridge regression
Ridge Regression, or L2 regularization, adds a penalty to
the cost function. The only difference is that the penalty is
calculated using the squared values of non-intercept weights
from linear regression
L1 and L2 regularization
Lasso regression performs L1 regularization, which adds a penalty equal
to the absolute value of the magnitude of coefficients.
This type of regularization can result in sparse models with few
coefficients;
Some coefficients can become zero and eliminate from the model.
Larger penalties result in coefficient values closer to zero, which is ideal
for producing simpler models.
On the other hand, L2 regularization (e.g. Ridge regression) doesn’t
result in the elimination of coefficients or sparse models.
This makes the Lasso far easier to interpret than the Ridge.