ML Assignment 4

NAME: KAUSTUBH GAJANAN INDULKAR
CLASS: TE-IT (A)

ROLL NO: 35026
ASSIGNMENT 4
Assignment on Improving Performance of Classifier Models
A SMS unsolicited mail (every now and then known as cell smartphone junk
mail) is any junk message
brought to a cellular phone as textual content messaging via the Short
Message Service (SMS). Use
probabilistic approach (Naive Bayes Classifier / Bayesian Network) to
implement SMS Spam Filtering
system. SMS messages are categorized as SPAM or HAM using features like
length of message, word
depend, unique keywords etc.
Download Data -Set from :

http://archive.ics.uci.edu/ml/datasets/sms+spam+collection
This dataset is composed by just one text file, where each line has the correct
class followed by
the raw message.
a. Apply Data pre-processing (Label Encoding, Data Transformation….)
techniques if
necessary
b. Perform data-preparation (Train-Test Split)
c. Apply at least two Machine Learning Algorithms and Evaluate Models
d. Apply Cross-Validation and Evaluate Models and compare performance.
e. Apply Hyper parameter tuning and evaluate models and compare
performance
CLASS: TE-IT (A)
ROLL NO: 35026
import pandas as pd
import numpy as np
df = pd.read_csv('SMSSpamCollection',sep='\t',names=['label','text'])
df
label text
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
... ... ...
5567 spam This is the 2nd time we have tried 2 contact u...
5568 ham Will ü b going to esplanade fr home?
5569 ham Pity, * was in mood for that. So...any other s...
5570 ham The guy did some bitching but I acted like i'd...
5571 ham Rofl. Its true to its name
[5572 rows x 2 columns]
# Data Preprocessing
df.shape
(5572, 2)
import nltk
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to

[nltk_data] C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\stopwords.zip.
True
sent = 'How are you friends?'
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
nltk.download('stopwords')
CLASS: TE-IT (A)
ROLL NO: 35026
[nltk_data] Downloading package stopwords to

[nltk_data] Package stopwords is already up-to-date!
True
sent = 'How are you friends?'
nltk.download('punkt')
[nltk_data] Downloading package punkt to

[nltk_data] Unzipping tokenizers\punkt.zip.
True
from nltk.tokenize import word_tokenize

word_tokenize(sent)
['How', 'are', 'you', 'friends', '?']
from nltk.corpus import stopwords

swords = stopwords.words('english')
from nltk.corpus import stopwords

swords = stopwords.words('english')
clean = [word for word in word_tokenize(sent) if word not in swords]
clean
['How', 'friends', '?']
from nltk.stem import PorterStemmer

ps = PorterStemmer()
clean = [ps.stem(word) for word in word_tokenize(sent)
if word not in swords]
clean
['how', 'friend', '?']
sent = 'Hello friends! How are you? We will learning python today'
def clean_text(sent):
tokens = word_tokenize(sent)
clean = [word for word in tokens if word.isdigit() or word.isalpha()]
clean = [ps.stem(word) for word in clean
if word not in swords]
return clean
CLASS: TE-IT (A)
ROLL NO: 35026
clean_text(sent)
['hello', 'friend', 'how', 'we', 'learn', 'python', 'today']
# Pre-processing
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer=clean_text)
x = df['text']
y = df['label']
x_new = tfidf.fit_transform(x)
x.shape
(5572,)
x_new.shape
(5572, 6513)
# tfidf.get_feature_names()
import seaborn as sns

sns.countplot(x=y)
<Axes: xlabel='label', ylabel='count'>

CLASS: TE-IT (A)
ROLL NO: 35026
#cross validation
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test =
train_test_split(x_new,y,test_size=0.25,random_state=1)
print(f"Size of splitted data")

print(f"x_train {x_train.shape}")
print(f"y_train {y_train.shape}")
print(f"y_test {x_test.shape}")
print(f"y_test {y_test.shape}")
Size of splitted data

x_train (4179, 6513)
y_train (4179,)
y_test (1393, 6513)
y_test (1393,)
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train.toarray(),y_train)
y_pred_nb = nb.predict(x_test.toarray())
y_test.value_counts()
label
ham 1208
spam 185
Name: count, dtype: int64
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score

from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
ConfusionMatrixDisplay.from_predictions(y_test,y_pred_nb)
plt.title('Naive bayes')
plt.show()
print(f" Accuracy is {accuracy_score(y_test,y_pred_nb)}")
print(classification_report(y_test,y_pred_nb))
CLASS: TE-IT (A)
ROLL NO: 35026
Accuracy is 0.867910983488873
precision recall f1-score support
ham 0.97 0.87 0.92 1208

spam 0.50 0.84 0.63 185
accuracy 0.87 1393

macro avg 0.74 0.86 0.77 1393
weighted avg 0.91 0.87 0.88 1393
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(random_state=1)
model_rf.fit(x_train,y_train)
RandomForestClassifier(random_state=1)
y_pred_rf = model_rf.predict(x_test)
ConfusionMatrixDisplay.from_predictions(y_test,y_pred_rf)
plt.title('Random Forest')
CLASS: TE-IT (A)
ROLL NO: 35026
plt.show()
print(f" Accuracy is {accuracy_score(y_test,y_pred_rf
)}")
print(classification_report(y_test,y_pred_rf))
Accuracy is 0.9748743718592965
ham 0.97 1.00 0.99 1208

spam 1.00 0.81 0.90 185
accuracy 0.97 1393

macro avg 0.99 0.91 0.94 1393
weighted avg 0.98 0.97 0.97 1393
from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression(random_state=1)
model_lr.fit(x_train,y_train)
y_pred_lr = model_lr.predict(x_test)
CLASS: TE-IT (A)
ROLL NO: 35026
ConfusionMatrixDisplay.from_predictions(y_test,y_pred_lr)
plt.title('Logistic regression')
plt.show()
print(f" Accuracy is {accuracy_score(y_test,y_pred_lr)}")
print(classification_report(y_test,y_pred_lr))
Accuracy is 0.9641062455132807
ham 0.96 1.00 0.98 1208

spam 0.99 0.74 0.85 185
accuracy 0.96 1393

macro avg 0.97 0.87 0.91 1393
weighted avg 0.96 0.96 0.96 1393
#Hyper Parameter Tunning
from sklearn.model_selection import GridSearchCV
para = {
'criterion':['gini', 'entropy','log_loss'],
# 'max_features': ['sqrt','log2'],
# 'random_state': [0,1,2,3,4],
'class_weight':['balanced','balanced_subsample']
}
CLASS: TE-IT (A)
ROLL NO: 35026
grid = GridSearchCV(model_rf, param_grid=para, cv=5, scoring='accuracy')
grid.fit(x_train,y_train)
GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=1),
param_grid={'class_weight': ['balanced', 'balanced_subsample'],
'criterion': ['gini', 'entropy', 'log_loss']},
scoring='accuracy')
rf = grid.best_estimator_
y_pred_grid = rf.predict(x_test)
ConfusionMatrixDisplay.from_predictions(y_test,y_pred_grid)
plt.title('Gride Search')
plt.show()
print(f" Accuracy is {accuracy_score(y_test,y_pred_grid)}")
print(classification_report(y_test,y_pred_grid))
CLASS: TE-IT (A)
ROLL NO: 35026
Accuracy is 0.9763101220387652
ham 0.97 1.00 0.99 1208

spam 1.00 0.82 0.90 185
accuracy 0.98 1393

macro avg 0.99 0.91 0.94 1393
weighted avg 0.98 0.98 0.98 1393

ML Assignment 4

Uploaded by

Copyright:

Available Formats

ML Assignment 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Assignment 4

Uploaded by

Copyright:

Available Formats

NAME: KAUSTUBH GAJANAN INDULKAR

CLASS: TE-IT (A)

Assignment on Improving Performance of Classifier Models

Download Data -Set from :

[5572 rows x 2 columns]

[nltk_data] Downloading package stopwords to

sent = 'How are you friends?'

[nltk_data] Downloading package stopwords to

sent = 'How are you friends?'

[nltk_data] Downloading package punkt to

from nltk.tokenize import word_tokenize

['How', 'are', 'you', 'friends', '?']

from nltk.corpus import stopwords

from nltk.corpus import stopwords

clean = [word for word in word_tokenize(sent) if word not in swords]

['How', 'friends', '?']

from nltk.stem import PorterStemmer

['how', 'friend', '?']

['hello', 'friend', 'how', 'we', 'learn', 'python', 'today']

import seaborn as sns

<Axes: xlabel='label', ylabel='count'>

print(f"Size of splitted data")

Size of splitted data

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score

ham 0.97 0.87 0.92 1208

accuracy 0.87 1393

from sklearn.ensemble import RandomForestClassifier

ham 0.97 1.00 0.99 1208

accuracy 0.97 1393

from sklearn.linear_model import LogisticRegression

ham 0.96 1.00 0.98 1208

accuracy 0.96 1393

#Hyper Parameter Tunning

from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(model_rf, param_grid=para, cv=5, scoring='accuracy')

ham 0.97 1.00 0.99 1208

accuracy 0.98 1393

You might also like