ML L6 Linear Regresion

Machine Learning
19CSE305
L-T-P-C: 3-0-3-4
Lecture 6
Linear Regression
What is Linear Regression?
Learning
◦ A supervised algorithm that learns from a set of training samples.
◦ Each training sample has one or more input values and a single output
value.
◦ The algorithm learns the line, plane or hyper-plane that best fits the
training samples.
◦ Linear regression is a linear model, e.g. a model that assumes a linear
relationship between the input variables (x) and the single output variable
(y)
Prediction
◦ Use the learned line, plane or hyper-plane to predict the output value for
any input sample.
Regression
• Regression is a method of modelling a target value based on independent
predictors.
• This method is mostly used for forecasting and finding out cause and
effect relationship between variables.
• Regression techniques mostly differ based on the number of independent

variables and the type of relationship between the independent and
dependent variables.
 Simple Linear Regression

 Multiple Linear Regression
 Logistic Regression
Regression-Examples
Real Estate Price prediction
Dependent variable is retail price of gasoline in India – independent variable
is the price of crude oil.
Dependent variable is employment income – independent variables might be
hours of work, education, occupation, sex, age, region, years of experience,
unionization status, etc.
Price of a product and quantity produced or sold:
Quantity sold affected by price. Dependent variable is quantity of product
sold – independent variable is price.
Regression-Example
500000
Housing Prices 400000
Price 300000
(in 1000s of 200000
dollars)
100000
0
500 1000 1500 2000 2500 3000
1500 sqfeet Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for each Predict real-valued output

example in the data.
Linear Regression
• Statistical model that analyzes the linear relationship between a dependent
variable(Y) with given set of independent variables(X).
• Linear relationship: when the value of one or more independent variables

will change (increase or decrease), the value of dependent variable will also
change accordingly (increase or decrease).
• Y can be calculated from a linear combination of the input variables (X).

 Simple Linear Regression
 Multiple Linear Regression
When there is a single input variable (x), the When there are multiple input
method is referred to as simple linear variables, literature from statistics often
regression. refers to the method as multiple linear
regression.
How do we predict with only
one variable?
Best-fit line for this data is only based on the mean
7
How do we predict with
only one variable?
“Goodness Of Fit” is based on variability of the tip amount from the line
fitted.
8
How do we predict with
only one variable?
Measuring the deviation: Squared residuals or Sum of Squared Errors
(SSE)
9
SIMPLE LINEAR REGRESSION
The Goal of Simple Linear regression is to create a

linear model that minimizes the sum of squared
residuals/errors (SSE)
MODEL REPRESENTATION
Training Set How do we represent h ?
Learning Algorithm
Size of h Estimated
house price
X Hypothesis Y
Linear regression with one variable.

Univariate (ONE) linear regression.
SLOPE OF THE SIMPLE LINEAR
REGRESSION MODEL
A linear relationship will be called A linear relationship will be called

positive if both independent and Negative if independent
dependent variable increases. increases and dependent
variable decreases.
Independent Dependent
Variable Variable
Training set of
Size in feet2 (x) Price ($) in 1000's (y)
housing prices
2104 460
(Portland, OR) 1416 232
1534 315
852 178
… …
Notation:
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
Y=F(X)
Training Set Price ($) in 1000's
Size in feet (x)
2
(y)
2104 460
1416 232
1534 315
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
EFFECT OF PARAMETERS
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
Inference: For different et of parameters, different regression lines can be formed.

How to elect the best fit line????
ERROR IN SIMPLE LINEAR
REGRESSION
y
P1
P0
Choose parameters P1,P0 so that the predicted value h(x) is close to Y for
our training examples (X,Y).
The sum of the squares of the residual errors are called the Residual Sum of
Squares or RSS.
The average variation of points around the fitted regression line is called the
Residual Standard Error (RSE).
COST FUNCTION
P1
P0
Choose parameters P1,P0 so that the predicted value h(x) is close to Y for
our training examples (X,Y).
The cost function helps us to figure out the best

possible values for P0 and P1 which would
provide the best fit line for the data points.
minimize the error between the predicted

value and the actual value.
Mean Squared Error(MSE)

P1
P0
Regression : finding the best fit line/curve to your numerical data

—a functional approximation of the data.
Analytical Solution Numerical Solution

Ordinary Least Squares (OLS) Gradient Descent
Linear regression
Relationship Work it does
Find h(x) = P1*x+P0, Fitting a model
minimize SSE
Is x and y properly Validating a model
related? Is the
parameters P1 and P0 are
statistically relevant?
Predict y for a given x Using a model
Analytical Solution
Ordinary Least Squares (OLS)
ORDINARY LEAST SQUARES (OLS)
Years of 100
Experience Salary 90
2 15 80
3 28 70
5 42 60
13 64 50
8 50
Salary
40
16 90 30
11 58 20
1 8 10
9 54 0
0 2 4 6 8 10 12 14 16 18
Years of Experience
100
90 Years of
80 Experience 4.8*X+9.15
70 2 18.75
60 3 23.55
50 5 33.15
40
13 71.55
30
8 47.55
20
16 85.95
10
11 61.95
0
0 2 4 6 8 10 12 14 16 18 1 13.95
9 52.35
ALGORITHM
Step 1: Calculate the mean of X and Y.

Step 2: Calculate the errors of X and Y.
Step 3: Get the Product.
Step 4: Get the summation of products.
Step 5: Square the difference of X.
Step 6: Get the sum of squared difference.
Step 7: Divide output of step 4 by output of step 6 to calculate slope(P1).
Step 8: Calculate P0 using P1
Sample Question
Internal External
Exam Exam
15 49
23 63
18 58
23 60
24 58
22 61
22 60
19 63
19 60
16 52
24 62
11 30
24 59
16 49
23 68
x y
1 2
2 1
3 3
4 6
5 9
6 11
7 13
8 15
9 17
10 20
Sample Question
x (year) 2005 2006 2007 2008 2009

y
12 19 29 37 45
(sales)
a) Find the least square regression line y = a x + b.

b) Use the least squares regression line as a model to estimate the sales of the
company in 2012.
Numerical Solution
Gradient Descent
Basic Idea
In linear regression, the model targets to get the best-fit
regression line to predict the value of y based on the given
input value (x).
Model calculates the cost function which measures the

Root Mean Squared error between the predicted value (pred)
and true value (y)
The model targets to minimize the cost function.
WHY GRADIENT DESCENT??
• OLS algorithm is not suitable for multivariate datasets:

with multiple dependent variables.
• Gradient descent algorithm’s main objective is to minimize the cost
function (same as SSE).
• It is one of the best optimization algorithms to minimize errors (difference
of actual value and predicted value).
In a real world example, it is similar

to find out a best direction to take a
step downhill.
Gradient Descent 3D diagram. Source: Coursera — Andrew Ng

COST FUNCTION
Choose parameters P1, P0 so that the predicted value h(x) is

close to Y for our training examples (X, Y).
The model selects P1 and P0 values randomly and then iteratively

update these values in order to minimize the cost function until it
reaches the minimum.
1/2 is a constant that helps cancel 2 in minimize the error between the
derivative of the function when doing
predicted value and the actual value.
calculations for gradient descent
HYPOTHESIS VS COST FUNCTION
Hypothesis: Assume: P0=0
Parameters:
Cost Function:
Goal:
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
HYPOTHESIS VS COST FUNCTION
Hypothesis h(x) Cost Function J(P1)

For a fixed value of P0, function of x Function of parameter
Each value corresponds to a For any such value of
different hypothesis as it is the slope of , J(P1) can be calculated by
the line setting P0=0
Squared error cost function is convex in
It is a linear line or a hyperplane
nature
(for fixed P1, this is a function of x) (function of the parameter P1)
3 3
2 2
1 1
y
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
𝑛
1
𝐽 = ∑ (𝑃1 𝑥 )− 𝑦 ¿ ¿
𝑖 𝑖 2
2𝑛 𝑖=1
(for fixed P1, this is a function of x) (function of the parameter P1)
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
𝑛
1
𝐽= ∑
2𝑛 𝑖=1
𝑖 𝑖
( 𝑃1 𝑥 ) − 𝑦 ¿ ¿ 2
THE GRADIENT DESCENT ALGORITHM
• Gradient descent is an iterative optimization algorithm to find the
minimum of a function.
• Here that function is the Loss Function.
• Take large steps when the slope is steep and small steps when the slope is
less steep.
• Next position based on the current position and stops when gets to the
bottom of the valley which was his goal.
Let P0=c P1=m
Step 1:
Initially let m = 0 and c = 0. Let L be learning rate. This controls how much the value
of m changes with each step. L could be a small value like 0.0001 for good accuracy.
Step 2:
Calculate the partial derivative of the loss function with respect to m, and plug in the
current values of x, y, m and c in it to obtain the derivative value D.
Let P0=c P1=m
Step 3:
Dₘ is the value of the partial derivative with respect to m. Similarly lets find the partial
derivative with respect to c, Dc
Step 4:
Now we update the current value of m and c using the following equation:
Step 5:
Repeat this process until the loss function is a very small value or ideally 0 (which means
0 error or 100% accuracy). The value of m and c that we are left with now will be the
optimum values.
m can be considered the current position of the
person.
D is equivalent to the steepness of the slope
L can be the speed with which he moves.
New value of m that we calculate using the

above equation will be his next position
L×D will be the size of the steps he will take.
When the slope is more steep (D is more) he takes longer steps and when it is
less steep (D is less), he takes smaller steps.
Finally he arrives at the bottom of the valley which corresponds to our loss = 0.
Metrics to check the goodness
of fit
The coefficient of determination (R2) is a measure of how well the
regression line fits the data.
The value of R2 lies between 0 and 1 and is the percentage of variation
explained by the regression model.
R2 is a rough indicator of the worth of the regression model.
R2 is the square of the correlation coefficient r(R2 = r2).
Metrics
Sl.No Term Equation
1 Total variation Target value – Predicted
value h(x)
2 Explained variation H(x) – mean (y)
3 Unexplained variation y – h(x)
Unexplained
variation
Total sum of Squares (SST)
How much error is there in predicting Y without the

knowledge of X?
Sum of Squares error (SSE)
How much error is there in predicting Y with the knowledge of X?
Sum of Squares regression (SSR)
Amount of variation explained by the model
The coefficient of determination is the ratio sum of squares due to regression to the total
sum of squares
Advantages
Advantages:
 Simple to use
 Works well with all dataset sizes
 Gives information about features
Issues of Linear Regression
•Assumes the data is independent
•Outliers have a large effect
•Boundaries are linear
•Prone to overfitting
Solution
Regularization refers to techniques that are used to
calibrate machine learning models in order to minimize the
adjusted loss function and prevent overfitting or
underfitting.
Using Regularization, the machine learning model can be
fit appropriately on a given test set and hence reduce the
errors in it.
Regularization
Regularization is used to introduce bias to the model and to
decrease the variance.
This can be achieved by modifying the loss function with
a penalty term which effectively shrinks the estimates of the
coefficients.
Therefore these types of methods within the framework of
regression are also called “shrinkage” methods or “penalized
regression” methods.
Impact of Regularization
Lasso regression
LASSO stands for Least Absolute Shrinkage and Selection
Operator.
Lasso regression, or L1 regularization, is a technique that
increases the cost function by a penalty equal to the sum of
the absolute values of the non-intercept weights from linear
regression.
Ridge regression
Ridge Regression, or L2 regularization, adds a penalty to
the cost function. The only difference is that the penalty is
calculated using the squared values of non-intercept weights
from linear regression
L1 and L2 regularization
Lasso regression performs L1 regularization, which adds a penalty equal
to the absolute value of the magnitude of coefficients.
This type of regularization can result in sparse models with few
coefficients;
Some coefficients can become zero and eliminate from the model.
Larger penalties result in coefficient values closer to zero, which is ideal
for producing simpler models.
On the other hand, L2 regularization (e.g. Ridge regression) doesn’t
result in the elimination of coefficients or sparse models.
This makes the Lasso far easier to interpret than the Ridge.

ML L6 Linear Regresion

Uploaded by

Copyright:

Available Formats

ML L6 Linear Regresion

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML L6 Linear Regresion

Uploaded by

Copyright:

Available Formats

Machine Learning

• Regression techniques mostly differ based on the number of independent

 Simple Linear Regression

Given the “right answer” for each Predict real-valued output

• Linear relationship: when the value of one or more independent variables

• Y can be calculated from a linear combination of the input variables (X).

Best-fit line for this data is only based on the mean

The Goal of Simple Linear regression is to create a

Training Set How do we represent h ?

Linear regression with one variable.

A linear relationship will be called A linear relationship will be called

Inference: For different et of parameters, different regression lines can be formed.

The cost function helps us to figure out the best

minimize the error between the predicted

Mean Squared Error(MSE)

Regression : finding the best fit line/curve to your numerical data

Analytical Solution Numerical Solution

Step 1: Calculate the mean of X and Y.

x (year) 2005 2006 2007 2008 2009

a) Find the least square regression line y = a x + b.

Model calculates the cost function which measures the

• OLS algorithm is not suitable for multivariate datasets:

In a real world example, it is similar

Gradient Descent 3D diagram. Source: Coursera — Andrew Ng

Choose parameters P1, P0 so that the predicted value h(x) is

The model selects P1 and P0 values randomly and then iteratively

Hypothesis h(x) Cost Function J(P1)

D is equivalent to the steepness of the slope

L can be the speed with which he moves.

New value of m that we calculate using the

L×D will be the size of the steps he will take.

How much error is there in predicting Y without the

Sum of Squares error (SSE)

How much error is there in predicting Y with the knowledge of X?

Sum of Squares regression (SSR)

Amount of variation explained by the model

You might also like