Lab Manual 04

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA

FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING


COMPUTER ENGINEERING DEPARTMENT

Machine Learning

Logistic Regression

Dated:
29th Jan, 2024 to 2nd Feb, 2024

Semester:
2024

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

Objectives:-

The objectives of this session are: -

1. Test/Train
2. Save and Load trained Model
3. Pickle and sklearn joblib
4. Logistic Regression

What is Train/Test

Train/Test is a method to measure the accuracy of your model.

It is called Train/Test because you split the data set into two sets: a training set and a testing set.

80% for training, and 20% for testing.

You train the model using the training set.

You test the model using the testing set.

Start With a Data Set

Start with a data set you want to test.

Our data set illustrates 100 customers in a shop, and their shopping habits.

import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)

x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x

plt.scatter(x, y)
plt.show()

Split Into Train/Test

The training set should be a random selection of 80% of the original data.

The testing set should be the remaining 20%.

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

train_x = x[:80]
train_y = y[:80]

test_x = x[80:]
test_y = y[80:]

Display the Training Set

Display the same scatter plot with the training set:

plt.scatter(train_x, train_y)
plt.show()

Display the Testing Set

To make sure the testing set is not completely different, we will take a look at the testing set as
well.

plt.scatter(test_x, test_y)
plt.show()

Example

Draw a polynomial regression line through the data points:

import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)

x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x

train_x = x[:80]
train_y = y[:80]

test_x = x[80:]
test_y = y[80:]

mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4))

myline = numpy.linspace(0, 6, 100)

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

plt.scatter(train_x, train_y)
plt.plot(myline, mymodel(myline))
plt.show()

The result can back my suggestion of the data set fitting a polynomial regression, even though it
would give us some weird results if we try to predict values outside of the data set. Example: the
line indicates that a customer spending 6 minutes in the shop would make a purchase worth 200.
That is probably a sign of overfitting.

But what about the R-squared score? The R-squared score is a good indicator of how well my
data set is fitting the model.

R2

Remember R2, also known as R-squared?

It measures the relationship between the x axis and the y axis, and the value ranges from 0 to 1,
where 0 means no relationship, and 1 means totally related.

The sklearn module has a method called r2_score() that will help us find this relationship.

In this case we would like to measure the relationship between the minutes a customer stays in
the shop and how much money they spend.

Example

How well does my training data fit in a polynomial regression?

import numpy
from sklearn.metrics import r2_score
numpy.random.seed(2)

x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x

train_x = x[:80]
train_y = y[:80]

test_x = x[80:]
test_y = y[80:]

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4))

r2 = r2_score(train_y, mymodel(train_x))

print(r2)

Bring in the Testing Set

Now we have made a model that is OK, at least when it comes to training data.

Now we want to test the model with the testing data as well, to see if gives us the same result.

Example

Let us find the R2 score when using testing data:

import numpy
from sklearn.metrics import r2_score
numpy.random.seed(2)

x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x

train_x = x[:80]
train_y = y[:80]

test_x = x[80:]
test_y = y[80:]

mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4))

r2 = r2_score(test_y, mymodel(test_x))

print(r2)

Predict Values

Now that we have established that our model is OK, we can start predicting new values.

Example

How much money will a buying customer spend, if she or he stays in the shop for 5 minutes?

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

print(mymodel(5))

Save and Load trained Model

Solving a problem in ML consist of two steps typically. The first step is to training a model
using your training dataset and the second step is to ask your questions to the trained model
which sort like a human brain and that will give you the answers often the size of the training
dataset is pretty huge because as the size increases your model becomes more accurate. It is
like if you are doing a football training and if you train yourself more and more you become
more and more better at your football game and when your training dataset is so huge often it
in like giga bytes the training steps become more time consuming if you save the train model
to a file you can latter on use that same model to make the actual prediction. So, you don’t
need to train it every time you want to ask these questions

Quick Task Load Saved Model Example

Open the Linear Regression Python File predicting Home Prices and do these changes.

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

In the above file take a new code block in google colab and write below code

import pickle

with open('model_pickle','wb') as file:


pickle.dump(model,file)

with open('model_pickle','rb') as file:


mp = pickle.load(file)

mp.coef_

mp.intercept_

mp.predict([[5000]])

Save Trained Model Using joblib (Second Way to save Model)

from sklearn.externals import joblib

joblib.dump(model, 'model_joblib')

Load Saved Model

mj = joblib.load('model_joblib')

mj.coef_

mj.intercept_

mj.predict([[5000]])

Question to Think?? What is difference between Joblib and Pickle?

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

Logistic Regression

Logistic regression aims to solve classification problems. It does this by predicting categorical
outcomes, unlike linear regression that predicts a continuous outcome.

In the simplest case there are two outcomes, which is called binomial, an example of which is
predicting if a tumor is malignant or benign. Other cases have more than two outcomes to
classify, in this case it is called multinomial. A common example for multinomial logistic
regression would be predicting the class of an iris flower between 3 different species.

Here we will be using basic logistic regression to predict a binomial variable. This means it has
only two possible outcomes.

How does it work?

In Python we have modules that will do the work for us. Start by importing the NumPy module.

import numpy

Store the independent variables in X.

Store the dependent variable in y.

Below is a sample dataset:

#X represents the size of a tumor in centimeters.


X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52,
3.69, 5.88]).reshape(-1,1)

#Note: X has to be reshaped into a column from a row for the


LogisticRegression() function to work.
#y represents whether or not the tumor is cancerous (0 for "No", 1 for
"Yes").
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

We will use a method from the sklearn module, so we will have to import that module as well:

from sklearn import linear_model

From the sklearn module we will use the LogisticRegression() method to create a logistic
regression object.

This object has a method called fit() that takes the independent and dependent values as
parameters and fills the regression object with data that describes the relationship:
Lab Instructor: Sheharyar Khan
UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

logr = linear_model.LogisticRegression()
logr.fit(X,y)

Now we have a logistic regression object that is ready to whether a tumor is cancerous based on
the tumor size:

#predict if tumor is cancerous where the size is 3.46mm:


predicted = logr.predict(numpy.array([3.46]).reshape(-1,1))

Example
import numpy
from sklearn import linear_model

#Reshaped for Logistic function.


X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-
1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

logr = linear_model.LogisticRegression()
logr.fit(X,y)

#predict if tumor is cancerous where the size is 3.46mm:


predicted = logr.predict(numpy.array([3.46]).reshape(-1,1))
print(predicted)

Coefficient

In logistic regression the coefficient is the expected change in log-odds of having the outcome
per unit change in X.

This does not have the most intuitive understanding so let's use it to create something that makes
more sense, odds.

Example

See the whole example in action:

import numpy
from sklearn import linear_model

#Reshaped for Logistic function.


X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-
1,1)

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

logr = linear_model.LogisticRegression()
logr.fit(X,y)

log_odds = logr.coef_
odds = numpy.exp(log_odds)

print(odds)

This tells us that as the size of a tumor increases by 1mm the odds of it being a cancerous tumor
increases by 4x.

Probability

The coefficient and intercept values can be used to find the probability that each tumor is
cancerous.

Create a function that uses the model's coefficient and intercept values to return a new value.
This new value represents probability that the given observation is a tumor:

def logit2prob(logr,x):
log_odds = logr.coef_ * x + logr.intercept_
odds = numpy.exp(log_odds)
probability = odds / (1 + odds)
return(probability)

Function Explained

To find the log-odds for each observation, we must first create a formula that looks similar to the
one from linear regression, extracting the coefficient and the intercept.

log_odds = logr.coef_ * x + logr.intercept_

To then convert the log-odds to odds we must exponentiate the log-odds.

odds = numpy.exp(log_odds)

Now that we have the odds, we can convert it to probability by dividing it by 1 plus the odds.

probability = odds / (1 + odds)

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

Let us now use the function with what we have learned to find out the probability that each
tumor is cancerous

Example

See the whole example in action:

import numpy
from sklearn import linear_model

X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-
1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

logr = linear_model.LogisticRegression()
logr.fit(X,y)

def logit2prob(logr, X):


log_odds = logr.coef_ * X + logr.intercept_
odds = numpy.exp(log_odds)
probability = odds / (1 + odds)
return(probability)

print(logit2prob(logr, X))

Results Explained

3.78 0.61 The probability that a tumor with the size 3.78cm is cancerous is 61%.

2.44 0.19 The probability that a tumor with the size 2.44cm is cancerous is 19%.

2.09 0.13 The probability that a tumor with the size 2.09cm is cancerous is 13%.

Lab Instructor: Sheharyar Khan


UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT

Tasks 1

Find the 7_Logistic Regression Python file and upload it in Google collab with data file
Insurance.csv

Run all Codes in the cell and write your understanding with output.

Task2

Download employee retention dataset from here: https://www.kaggle.com/giripujar/hr-analytics.

1. Now do some exploratory data analysis to figure out which variables have direct and clear
impact on employee retention (i.e. whether they leave the company or continue to work)
2. Plot bar charts showing impact of employee salaries on retention
3. Plot bar charts showing corelation between department and employee retention
4. Now build logistic regression model using variables that were narrowed down in step 1
5. Measure the accuracy of the model

Task 3: Run All Examples in the Lab Manual

Lab Instructor: Sheharyar Khan

You might also like