Lab Manual 05

UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING

COMPUTER ENGINEERING DEPARTMENT
Machine Learning
L1 and L2 Regularization | Lasso, Ridge Regression
Dated:
13th Feb, 2024 to 17th Feb, 2024
Semester:
2024
Lab Instructor: Sheharyar Khan

Objectives: -
The objectives of this session are: -
• Regularization
• Overfitting
• Underfit
• Balance fit
Introduction
Let say you are trying to predict number of matches won based on the age. Now usually when the player
gets aged any sports person the matches won can be reduces. So, you can have this type of distribution
Figure 1(Under Fit) Figure 2 (Over Fit) Figure 3 (Balance)
Where to build a model you can create a simple linear regression model and the equation might look
like this theta 0 and theta 1 are just constant this linear equation .But you see that this line is not really
accurately describing all the data points.it is trying to find a best fit in term of straight line but you see all
these data points are kind of going away and then if you have test data points which are lying some
where here then this is not a very very accurate representation of our data distribution then you can
build a distribution which might look like Figure 2 . So here we are trying to draw a line which can passes
through all our data points and in that case you equation look like this, so it is higher order polynomial
equation where you are trying to find out matches won based on the age of a person. But here the issue
in this equation is really complicated the line is a zigzag type of line which is just passing through all the
data points and now if you have some general data points at the top again this is not generalizing the
distribution really well what might be better is you have a line like this Figure 3. So, this is a balance
between these two cases Fig 1 and Fig 2. If we have new data points in figure 3 we have better
prediction.

Overfitting:
If you try to train a model too much and try to fit too much to your training dataset then you have issue
with the testing dataset.
How to Reduce Overfitting?
Idea here is to shrink your parameters which is theta 0,2,3,4 If you can reduce this parameter if you can
keep these parameters smaller then you can get a better equation for your prediction function. Now
how do we do that we earlier saw in our linear regression we calculate mean square error so when we
run training we pass first sample and then we calculate y-predict on randomly initialized weights then
we compare it with the truth value then this is how we call it MSE.
MEAN SQUARE ERROR
Here Y_predicted is actually

Where xi could be higher order polynomial x1 and x2 is your feature so in our case it will be age of a
person just think in your MSE function (by the way MSE is used in training and we want to minimize the
value of this error on each iteration)
L2 Regularization
In Above equation what If I add this red parameter
Theta value higher your MSE is higher.
L1 Regularization
Example Dataset Code (Synthetic Dataset)
import numpy as np
import matplotlib.pyplot as plt
# Generating synthetic data

np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Plotting the synthetic data

plt.scatter(X, y)

plt.xlabel('X')
plt.ylabel('y')
plt.title('Synthetic Data')
plt.show()
Underfitting Example:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Fit a linear regression model (underfitting)

underfit_model = LinearRegression()
underfit_model.fit(X, y)
# Plotting the underfitting model

plt.scatter(X, y)
plt.plot(X, underfit_model.predict(X), color='red', label='Underfitting Model')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Underfitting Example')
plt.legend()
plt.show()
# Evaluate underfitting model

underfit_mse = mean_squared_error(y, underfit_model.predict(X))
print(f'Underfitting Model MSE: {underfit_mse}')
Overfitting Example:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# Create polynomial features for overfitting

degree = 10
poly_features = PolynomialFeatures(degree=degree, include_bias=False)
X_poly = poly_features.fit_transform(X)
# Fit a polynomial regression model (overfitting)

overfit_model = LinearRegression()
overfit_model.fit(X_poly, y)
# Plotting the overfitting model

X_new = np.linspace(0, 2, 100).reshape(100, 1)
X_new_poly = poly_features.transform(X_new)

plt.scatter(X, y)
plt.plot(X_new, overfit_model.predict(X_new_poly), color='green', label='Overfitting Model')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Overfitting Example')
plt.legend()
plt.show()
# Evaluate overfitting model

overfit_mse = mean_squared_error(y, overfit_model.predict(X_poly))
print(f'Overfitting Model MSE: {overfit_mse}')
Balanced Fitting Example:

# Fit a well-balanced model
balanced_model = LinearRegression()
balanced_model.fit(X, y)
# Plotting the balanced model

plt.scatter(X, y)
plt.plot(X, balanced_model.predict(X), color='blue', label='Balanced Model')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Balanced Fitting Example')
plt.legend()
plt.show()
# Evaluate balanced model

balanced_mse = mean_squared_error(y, balanced_model.predict(X))
print(f'Balanced Model MSE: {balanced_mse}')
L1 and L2 Regularization Example:

from sklearn.linear_model import Lasso, Ridge
# L1 Regularization (Lasso)
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X, y)
# L2 Regularization (Ridge)
ridge_model = Ridge(alpha=0.1)
ridge_model.fit(X, y)
# Plotting L1 and L2 regularization models

plt.scatter(X, y)
plt.plot(X, lasso_model.predict(X), color='orange', label='L1 Regularization (Lasso)')
plt.plot(X, ridge_model.predict(X), color='purple', label='L2 Regularization (Ridge)')
plt.xlabel('X')
plt.ylabel('y')
plt.title('L1 and L2 Regularization Example')
plt.legend()
plt.show()
# Evaluate L1 and L2 regularization models

lasso_mse = mean_squared_error(y, lasso_model.predict(X))
ridge_mse = mean_squared_error(y, ridge_model.predict(X))
print(f'L1 Regularization (Lasso) Model MSE: {lasso_mse}')

print(f'L2 Regularization (Ridge) Model MSE: {ridge_mse}')
Coding Section
Step 1 Open the Google collab Paste and Run
# import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
Step 2:
# Suppress Warnings for clean notebook

import warnings
warnings.filterwarnings('ignore')
Step 3:
We are going to use Melbourne House Price Dataset where we'll predict House Predictions
based on various features.
The Dataset Link is
https://www.kaggle.com/anthonypino/melbourne-housing-market

# read dataset
dataset = pd.read_csv('./Melbourne_housing_FULL.csv')
dataset.head()
Step 4:
dataset.nunique()
Step 5:
# let's use limited columns which makes more sense for serving our purpose
cols_to_use = ['Suburb', 'Rooms', 'Type', 'Method', 'SellerG',
'Regionname', 'Propertycount',
'Distance', 'CouncilArea', 'Bedroom2', 'Bathroom', 'Car',
'Landsize', 'BuildingArea', 'Price']
dataset = dataset[cols_to_use]
Step 6:
dataset.head()
dataset.shape
Step 7:
Checking for Nan values

dataset.isna().sum()
Step 8
Handling Missing values

# Some feature's missing values can be treated as zero (another class for
NA values or absence of that feature)

# like 0 for Propertycount, Bedroom2 will refer to other class of NA

values
# like 0 for Car feature will mean that there's no car parking feature
with house
cols_to_fill_zero = ['Propertycount', 'Distance', 'Bedroom2', 'Bathroom',
'Car']
dataset[cols_to_fill_zero] = dataset[cols_to_fill_zero].fillna(0)
# other continuous features can be imputed with mean for faster results
since our focus is on Reducing overfitting
# using Lasso and Ridge Regression
dataset['Landsize'] = dataset['Landsize'].fillna(dataset.Landsize.mean())
dataset['BuildingArea'] =
dataset['BuildingArea'].fillna(dataset.BuildingArea.mean())
Step 9
Drop NA values of Price, since it's our predictive variable we won't impute it
dataset.dropna(inplace=True)
Step 10
dataset.shape
Step 11:
Let's one hot encode the categorical features

dataset = pd.get_dummies(dataset, drop_first=True)
Step 12:
dataset.head()

Step 13
Let's bifurcate our dataset into train and test dataset
X = dataset.drop('Price', axis=1)
y = dataset['Price']
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3,
random_state=2)
Step 14
Let's train our Linear Regression Model on training dataset and check the accuracy on test set
from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(train_X, train_y)
reg.score(test_X, test_y)
reg.score(train_X, train_y)
Here training score is 68% but test score is 13.85% which is very low
Normal Regression is clearly overfitting the data, let's try other models
Using Lasso (L1 Regularized) Regression Model
from sklearn import linear_model

lasso_reg = linear_model.Lasso(alpha=50, max_iter=100, tol=0.1)

lasso_reg.fit(train_X, train_y)
lasso_reg.score(test_X, test_y)
lasso_reg.score(train_X, train_y)
Using Ridge (L2 Regularized) Regression Model
from sklearn.linear_model import Ridge

ridge_reg= Ridge(alpha=50, max_iter=100, tol=0.1)
ridge_reg.fit(train_X, train_y)
ridge_reg.score(test_X, test_y)
ridge_reg.score(train_X, train_y)
We see that Lasso and Ridge Regularizations prove to be beneficial when our Simple Linear Regression
Model overfits. These results may not be that contrast but significant in most cases. Also that L1 & L2
Regularizations are used in Neural Networks too
Tasks
1. Run all the Piece of Codes and Write your Understanding, Clearly Demonstrate the Underfit,
Overfit Problem in your own word. For every chunk of your code you are supposed to write
your understanding.
2. Imagine you are working on a project to develop a model that predicts the grades of
students based on various features such as hours of study, attendance, and extracurricular
activities. You have collected a dataset with these features and corresponding grades for a
group of students.

Dataset:
• Features:
o Hours of study per week (numeric)
o Attendance percentage (numeric)
o Number of extracurricular activities (numeric)
• Target:
o Grades (numeric)
Your goal is to build a machine learning model to predict the grades accurately.
| Hours of Study | Attendance (%) | Extracurricular Activities | Grades |

|----------------|-----------------|-----------------------------|--------|
|5 | 85 |2 | 75 |
|8 | 92 |3 | 90 |
|4 | 80 |1 | 60 |
|7 | 88 |2 | 80 |
|6 | 90 |1 | 70 |
| 10 | 95 |3 | 95 |
|3 | 75 |1 | 50 |
|9 | 94 |2 | 85 |
|5 | 82 |1 | 65 |
|8 | 89 |3 | 88 |
Task 3
Problem Description:
You are working with the famous Titanic dataset from Kaggle, which contains information about
passengers on the Titanic, including whether they survived or not. Your goal is to build a
machine learning model to predict the survival of passengers based on various features.
Dataset: https://www.kaggle.com/competitions/titanic
The dataset contains the following features:
• Pclass: Passenger class (1st, 2nd, or 3rd)

• Sex: Gender of the passenger
• Age: Age of the passenger

• SibSp: Number of siblings/spouses aboard

• Parch: Number of parents/children aboard
• Fare: Fare paid for the ticket
• Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
The target variable is Survived, indicating whether the passenger survived (1) or not (0).
Thanks

Lab Manual 05

Uploaded by

Copyright:

Available Formats

Lab Manual 05

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab Manual 05

Uploaded by

Copyright:

Available Formats

UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA

FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING

L1 and L2 Regularization | Lasso, Ridge Regression

Lab Instructor: Sheharyar Khan

The objectives of this session are: -

Figure 1(Under Fit) Figure 2 (Over Fit) Figure 3 (Balance)

Lab Instructor: Sheharyar Khan

How to Reduce Overfitting?

MEAN SQUARE ERROR

Here Y_predicted is actually

Lab Instructor: Sheharyar Khan

In Above equation what If I add this red parameter

Theta value higher your MSE is higher.

Example Dataset Code (Synthetic Dataset)

# Generating synthetic data

# Plotting the synthetic data

Lab Instructor: Sheharyar Khan

# Fit a linear regression model (underfitting)

# Plotting the underfitting model

# Evaluate underfitting model

# Create polynomial features for overfitting

# Fit a polynomial regression model (overfitting)

# Plotting the overfitting model

Lab Instructor: Sheharyar Khan

# Evaluate overfitting model

Balanced Fitting Example:

# Plotting the balanced model

# Evaluate balanced model

L1 and L2 Regularization Example:

# Plotting L1 and L2 regularization models

Lab Instructor: Sheharyar Khan

# Evaluate L1 and L2 regularization models

print(f'L1 Regularization (Lasso) Model MSE: {lasso_mse}')

Step 1 Open the Google collab Paste and Run

# Suppress Warnings for clean notebook

The Dataset Link is

Lab Instructor: Sheharyar Khan

Checking for Nan values

Handling Missing values

Lab Instructor: Sheharyar Khan

# like 0 for Propertycount, Bedroom2 will refer to other class of NA

Let's one hot encode the categorical features

Lab Instructor: Sheharyar Khan

Let's bifurcate our dataset into train and test dataset

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

Using Lasso (L1 Regularized) Regression Model

from sklearn import linear_model

Lab Instructor: Sheharyar Khan

Using Ridge (L2 Regularized) Regression Model

from sklearn.linear_model import Ridge

Lab Instructor: Sheharyar Khan

| Hours of Study | Attendance (%) | Extracurricular Activities | Grades |

The dataset contains the following features:

• Pclass: Passenger class (1st, 2nd, or 3rd)

Lab Instructor: Sheharyar Khan

• SibSp: Number of siblings/spouses aboard

Lab Instructor: Sheharyar Khan

You might also like