Lab Manual 05
Lab Manual 05
Lab Manual 05
Machine Learning
Dated:
13th Feb, 2024 to 17th Feb, 2024
Semester:
2024
Objectives: -
• Regularization
• Overfitting
• Underfit
• Balance fit
Introduction
Let say you are trying to predict number of matches won based on the age. Now usually when the player
gets aged any sports person the matches won can be reduces. So, you can have this type of distribution
Where to build a model you can create a simple linear regression model and the equation might look
like this theta 0 and theta 1 are just constant this linear equation .But you see that this line is not really
accurately describing all the data points.it is trying to find a best fit in term of straight line but you see all
these data points are kind of going away and then if you have test data points which are lying some
where here then this is not a very very accurate representation of our data distribution then you can
build a distribution which might look like Figure 2 . So here we are trying to draw a line which can passes
through all our data points and in that case you equation look like this, so it is higher order polynomial
equation where you are trying to find out matches won based on the age of a person. But here the issue
in this equation is really complicated the line is a zigzag type of line which is just passing through all the
data points and now if you have some general data points at the top again this is not generalizing the
distribution really well what might be better is you have a line like this Figure 3. So, this is a balance
between these two cases Fig 1 and Fig 2. If we have new data points in figure 3 we have better
prediction.
Overfitting:
If you try to train a model too much and try to fit too much to your training dataset then you have issue
with the testing dataset.
Idea here is to shrink your parameters which is theta 0,2,3,4 If you can reduce this parameter if you can
keep these parameters smaller then you can get a better equation for your prediction function. Now
how do we do that we earlier saw in our linear regression we calculate mean square error so when we
run training we pass first sample and then we calculate y-predict on randomly initialized weights then
we compare it with the truth value then this is how we call it MSE.
Where xi could be higher order polynomial x1 and x2 is your feature so in our case it will be age of a
person just think in your MSE function (by the way MSE is used in training and we want to minimize the
value of this error on each iteration)
L2 Regularization
L1 Regularization
import numpy as np
import matplotlib.pyplot as plt
plt.xlabel('X')
plt.ylabel('y')
plt.title('Synthetic Data')
plt.show()
Underfitting Example:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
Overfitting Example:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
plt.scatter(X, y)
plt.plot(X_new, overfit_model.predict(X_new_poly), color='green', label='Overfitting Model')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Overfitting Example')
plt.legend()
plt.show()
# L1 Regularization (Lasso)
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X, y)
# L2 Regularization (Ridge)
ridge_model = Ridge(alpha=0.1)
ridge_model.fit(X, y)
plt.scatter(X, y)
plt.plot(X, lasso_model.predict(X), color='orange', label='L1 Regularization (Lasso)')
plt.plot(X, ridge_model.predict(X), color='purple', label='L2 Regularization (Ridge)')
plt.xlabel('X')
plt.ylabel('y')
plt.title('L1 and L2 Regularization Example')
plt.legend()
plt.show()
Coding Section
# import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
Step 2:
Step 3:
We are going to use Melbourne House Price Dataset where we'll predict House Predictions
based on various features.
https://www.kaggle.com/anthonypino/melbourne-housing-market
# read dataset
dataset = pd.read_csv('./Melbourne_housing_FULL.csv')
dataset.head()
Step 4:
dataset.nunique()
Step 5:
# let's use limited columns which makes more sense for serving our purpose
cols_to_use = ['Suburb', 'Rooms', 'Type', 'Method', 'SellerG',
'Regionname', 'Propertycount',
'Distance', 'CouncilArea', 'Bedroom2', 'Bathroom', 'Car',
'Landsize', 'BuildingArea', 'Price']
dataset = dataset[cols_to_use]
Step 6:
dataset.head()
dataset.shape
Step 7:
Step 8
# other continuous features can be imputed with mean for faster results
since our focus is on Reducing overfitting
# using Lasso and Ridge Regression
dataset['Landsize'] = dataset['Landsize'].fillna(dataset.Landsize.mean())
dataset['BuildingArea'] =
dataset['BuildingArea'].fillna(dataset.BuildingArea.mean())
Step 9
Drop NA values of Price, since it's our predictive variable we won't impute it
dataset.dropna(inplace=True)
Step 10
dataset.shape
Step 11:
Step 12:
dataset.head()
Step 13
X = dataset.drop('Price', axis=1)
y = dataset['Price']
Step 14
Let's train our Linear Regression Model on training dataset and check the accuracy on test set
reg.score(test_X, test_y)
reg.score(train_X, train_y)
Here training score is 68% but test score is 13.85% which is very low
Normal Regression is clearly overfitting the data, let's try other models
lasso_reg.fit(train_X, train_y)
lasso_reg.score(test_X, test_y)
lasso_reg.score(train_X, train_y)
ridge_reg.score(test_X, test_y)
ridge_reg.score(train_X, train_y)
We see that Lasso and Ridge Regularizations prove to be beneficial when our Simple Linear Regression
Model overfits. These results may not be that contrast but significant in most cases. Also that L1 & L2
Regularizations are used in Neural Networks too
Tasks
1. Run all the Piece of Codes and Write your Understanding, Clearly Demonstrate the Underfit,
Overfit Problem in your own word. For every chunk of your code you are supposed to write
your understanding.
2. Imagine you are working on a project to develop a model that predicts the grades of
students based on various features such as hours of study, attendance, and extracurricular
activities. You have collected a dataset with these features and corresponding grades for a
group of students.
Dataset:
• Features:
o Hours of study per week (numeric)
o Attendance percentage (numeric)
o Number of extracurricular activities (numeric)
• Target:
o Grades (numeric)
Your goal is to build a machine learning model to predict the grades accurately.
Task 3
Problem Description:
You are working with the famous Titanic dataset from Kaggle, which contains information about
passengers on the Titanic, including whether they survived or not. Your goal is to build a
machine learning model to predict the survival of passengers based on various features.
Dataset: https://www.kaggle.com/competitions/titanic
The target variable is Survived, indicating whether the passenger survived (1) or not (0).
Thanks