Lab Manual 05

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA

FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING


COMPUTER ENGINEERING DEPARTMENT

Machine Learning

L1 and L2 Regularization | Lasso, Ridge Regression

Dated:
13th Feb, 2024 to 17th Feb, 2024

Semester:
2024

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

Objectives: -

The objectives of this session are: -

• Regularization
• Overfitting
• Underfit
• Balance fit

Introduction

Let say you are trying to predict number of matches won based on the age. Now usually when the player
gets aged any sports person the matches won can be reduces. So, you can have this type of distribution

Figure 1(Under Fit) Figure 2 (Over Fit) Figure 3 (Balance)

Where to build a model you can create a simple linear regression model and the equation might look
like this theta 0 and theta 1 are just constant this linear equation .But you see that this line is not really
accurately describing all the data points.it is trying to find a best fit in term of straight line but you see all
these data points are kind of going away and then if you have test data points which are lying some
where here then this is not a very very accurate representation of our data distribution then you can
build a distribution which might look like Figure 2 . So here we are trying to draw a line which can passes
through all our data points and in that case you equation look like this, so it is higher order polynomial
equation where you are trying to find out matches won based on the age of a person. But here the issue
in this equation is really complicated the line is a zigzag type of line which is just passing through all the
data points and now if you have some general data points at the top again this is not generalizing the
distribution really well what might be better is you have a line like this Figure 3. So, this is a balance
between these two cases Fig 1 and Fig 2. If we have new data points in figure 3 we have better
prediction.

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

Overfitting:

If you try to train a model too much and try to fit too much to your training dataset then you have issue
with the testing dataset.

How to Reduce Overfitting?

Idea here is to shrink your parameters which is theta 0,2,3,4 If you can reduce this parameter if you can
keep these parameters smaller then you can get a better equation for your prediction function. Now
how do we do that we earlier saw in our linear regression we calculate mean square error so when we
run training we pass first sample and then we calculate y-predict on randomly initialized weights then
we compare it with the truth value then this is how we call it MSE.

MEAN SQUARE ERROR

Here Y_predicted is actually

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

Where xi could be higher order polynomial x1 and x2 is your feature so in our case it will be age of a
person just think in your MSE function (by the way MSE is used in training and we want to minimize the
value of this error on each iteration)

L2 Regularization

In Above equation what If I add this red parameter

Theta value higher your MSE is higher.

L1 Regularization

Example Dataset Code (Synthetic Dataset)

import numpy as np
import matplotlib.pyplot as plt

# Generating synthetic data


np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Plotting the synthetic data


plt.scatter(X, y)

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

plt.xlabel('X')
plt.ylabel('y')
plt.title('Synthetic Data')
plt.show()

Underfitting Example:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Fit a linear regression model (underfitting)


underfit_model = LinearRegression()
underfit_model.fit(X, y)

# Plotting the underfitting model


plt.scatter(X, y)
plt.plot(X, underfit_model.predict(X), color='red', label='Underfitting Model')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Underfitting Example')
plt.legend()
plt.show()

# Evaluate underfitting model


underfit_mse = mean_squared_error(y, underfit_model.predict(X))
print(f'Underfitting Model MSE: {underfit_mse}')

Overfitting Example:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Create polynomial features for overfitting


degree = 10
poly_features = PolynomialFeatures(degree=degree, include_bias=False)
X_poly = poly_features.fit_transform(X)

# Fit a polynomial regression model (overfitting)


overfit_model = LinearRegression()
overfit_model.fit(X_poly, y)

# Plotting the overfitting model


X_new = np.linspace(0, 2, 100).reshape(100, 1)
X_new_poly = poly_features.transform(X_new)

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

plt.scatter(X, y)
plt.plot(X_new, overfit_model.predict(X_new_poly), color='green', label='Overfitting Model')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Overfitting Example')
plt.legend()
plt.show()

# Evaluate overfitting model


overfit_mse = mean_squared_error(y, overfit_model.predict(X_poly))
print(f'Overfitting Model MSE: {overfit_mse}')

Balanced Fitting Example:


# Fit a well-balanced model
balanced_model = LinearRegression()
balanced_model.fit(X, y)

# Plotting the balanced model


plt.scatter(X, y)
plt.plot(X, balanced_model.predict(X), color='blue', label='Balanced Model')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Balanced Fitting Example')
plt.legend()
plt.show()

# Evaluate balanced model


balanced_mse = mean_squared_error(y, balanced_model.predict(X))
print(f'Balanced Model MSE: {balanced_mse}')

L1 and L2 Regularization Example:


from sklearn.linear_model import Lasso, Ridge

# L1 Regularization (Lasso)
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X, y)

# L2 Regularization (Ridge)
ridge_model = Ridge(alpha=0.1)
ridge_model.fit(X, y)

# Plotting L1 and L2 regularization models

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

plt.scatter(X, y)
plt.plot(X, lasso_model.predict(X), color='orange', label='L1 Regularization (Lasso)')
plt.plot(X, ridge_model.predict(X), color='purple', label='L2 Regularization (Ridge)')
plt.xlabel('X')
plt.ylabel('y')
plt.title('L1 and L2 Regularization Example')
plt.legend()
plt.show()

# Evaluate L1 and L2 regularization models


lasso_mse = mean_squared_error(y, lasso_model.predict(X))
ridge_mse = mean_squared_error(y, ridge_model.predict(X))

print(f'L1 Regularization (Lasso) Model MSE: {lasso_mse}')


print(f'L2 Regularization (Ridge) Model MSE: {ridge_mse}')

Coding Section

Step 1 Open the Google collab Paste and Run

# import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Step 2:

# Suppress Warnings for clean notebook


import warnings
warnings.filterwarnings('ignore')

Step 3:

We are going to use Melbourne House Price Dataset where we'll predict House Predictions
based on various features.

The Dataset Link is

https://www.kaggle.com/anthonypino/melbourne-housing-market

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

# read dataset
dataset = pd.read_csv('./Melbourne_housing_FULL.csv')
dataset.head()

Step 4:

dataset.nunique()

Step 5:

# let's use limited columns which makes more sense for serving our purpose
cols_to_use = ['Suburb', 'Rooms', 'Type', 'Method', 'SellerG',
'Regionname', 'Propertycount',
'Distance', 'CouncilArea', 'Bedroom2', 'Bathroom', 'Car',
'Landsize', 'BuildingArea', 'Price']
dataset = dataset[cols_to_use]

Step 6:

dataset.head()
dataset.shape

Step 7:

Checking for Nan values


dataset.isna().sum()

Step 8

Handling Missing values


# Some feature's missing values can be treated as zero (another class for
NA values or absence of that feature)

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

# like 0 for Propertycount, Bedroom2 will refer to other class of NA


values
# like 0 for Car feature will mean that there's no car parking feature
with house
cols_to_fill_zero = ['Propertycount', 'Distance', 'Bedroom2', 'Bathroom',
'Car']
dataset[cols_to_fill_zero] = dataset[cols_to_fill_zero].fillna(0)

# other continuous features can be imputed with mean for faster results
since our focus is on Reducing overfitting
# using Lasso and Ridge Regression
dataset['Landsize'] = dataset['Landsize'].fillna(dataset.Landsize.mean())
dataset['BuildingArea'] =
dataset['BuildingArea'].fillna(dataset.BuildingArea.mean())

Step 9

Drop NA values of Price, since it's our predictive variable we won't impute it

dataset.dropna(inplace=True)

Step 10

dataset.shape

Step 11:

Let's one hot encode the categorical features


dataset = pd.get_dummies(dataset, drop_first=True)

Step 12:

dataset.head()

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

Step 13

Let's bifurcate our dataset into train and test dataset

X = dataset.drop('Price', axis=1)
y = dataset['Price']

from sklearn.model_selection import train_test_split


train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3,
random_state=2)

Step 14

Let's train our Linear Regression Model on training dataset and check the accuracy on test set

from sklearn.linear_model import LinearRegression


reg = LinearRegression().fit(train_X, train_y)

reg.score(test_X, test_y)

reg.score(train_X, train_y)

Here training score is 68% but test score is 13.85% which is very low

Normal Regression is clearly overfitting the data, let's try other models

Using Lasso (L1 Regularized) Regression Model

from sklearn import linear_model


lasso_reg = linear_model.Lasso(alpha=50, max_iter=100, tol=0.1)

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

lasso_reg.fit(train_X, train_y)

lasso_reg.score(test_X, test_y)

lasso_reg.score(train_X, train_y)

Using Ridge (L2 Regularized) Regression Model

from sklearn.linear_model import Ridge


ridge_reg= Ridge(alpha=50, max_iter=100, tol=0.1)
ridge_reg.fit(train_X, train_y)

ridge_reg.score(test_X, test_y)

ridge_reg.score(train_X, train_y)

We see that Lasso and Ridge Regularizations prove to be beneficial when our Simple Linear Regression
Model overfits. These results may not be that contrast but significant in most cases. Also that L1 & L2
Regularizations are used in Neural Networks too

Tasks

1. Run all the Piece of Codes and Write your Understanding, Clearly Demonstrate the Underfit,
Overfit Problem in your own word. For every chunk of your code you are supposed to write
your understanding.

2. Imagine you are working on a project to develop a model that predicts the grades of
students based on various features such as hours of study, attendance, and extracurricular
activities. You have collected a dataset with these features and corresponding grades for a
group of students.

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

Dataset:

• Features:
o Hours of study per week (numeric)
o Attendance percentage (numeric)
o Number of extracurricular activities (numeric)
• Target:
o Grades (numeric)

Your goal is to build a machine learning model to predict the grades accurately.

| Hours of Study | Attendance (%) | Extracurricular Activities | Grades |


|----------------|-----------------|-----------------------------|--------|
|5 | 85 |2 | 75 |
|8 | 92 |3 | 90 |
|4 | 80 |1 | 60 |
|7 | 88 |2 | 80 |
|6 | 90 |1 | 70 |
| 10 | 95 |3 | 95 |
|3 | 75 |1 | 50 |
|9 | 94 |2 | 85 |
|5 | 82 |1 | 65 |
|8 | 89 |3 | 88 |

Task 3

Problem Description:

You are working with the famous Titanic dataset from Kaggle, which contains information about
passengers on the Titanic, including whether they survived or not. Your goal is to build a
machine learning model to predict the survival of passengers based on various features.

Dataset: https://www.kaggle.com/competitions/titanic

The dataset contains the following features:

• Pclass: Passenger class (1st, 2nd, or 3rd)


• Sex: Gender of the passenger
• Age: Age of the passenger

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

• SibSp: Number of siblings/spouses aboard


• Parch: Number of parents/children aboard
• Fare: Fare paid for the ticket
• Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

The target variable is Survived, indicating whether the passenger survived (1) or not (0).

Thanks

Lab Instructor: Sheharyar Khan

You might also like