ML Assignment 4
ML Assignment 4
ML Assignment 4
ASSIGNMENT 4
A SMS unsolicited mail (every now and then known as cell smartphone junk
mail) is any junk message
brought to a cellular phone as textual content messaging via the Short
Message Service (SMS). Use
probabilistic approach (Naive Bayes Classifier / Bayesian Network) to
implement SMS Spam Filtering
system. SMS messages are categorized as SPAM or HAM using features like
length of message, word
depend, unique keywords etc.
This dataset is composed by just one text file, where each line has the correct
class followed by
the raw message.
a. Apply Data pre-processing (Label Encoding, Data Transformation….)
techniques if
necessary
b. Perform data-preparation (Train-Test Split)
c. Apply at least two Machine Learning Algorithms and Evaluate Models
d. Apply Cross-Validation and Evaluate Models and compare performance.
e. Apply Hyper parameter tuning and evaluate models and compare
performance
NAME: KAUSTUBH GAJANAN INDULKAR
CLASS: TE-IT (A)
ROLL NO: 35026
import pandas as pd
import numpy as np
df = pd.read_csv('SMSSpamCollection',sep='\t',names=['label','text'])
df
label text
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
... ... ...
5567 spam This is the 2nd time we have tried 2 contact u...
5568 ham Will ü b going to esplanade fr home?
5569 ham Pity, * was in mood for that. So...any other s...
5570 ham The guy did some bitching but I acted like i'd...
5571 ham Rofl. Its true to its name
# Data Preprocessing
df.shape
(5572, 2)
import nltk
nltk.download('stopwords')
True
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
nltk.download('stopwords')
NAME: KAUSTUBH GAJANAN INDULKAR
CLASS: TE-IT (A)
ROLL NO: 35026
True
nltk.download('punkt')
True
clean
sent = 'Hello friends! How are you? We will learning python today'
def clean_text(sent):
tokens = word_tokenize(sent)
clean = [word for word in tokens if word.isdigit() or word.isalpha()]
clean = [ps.stem(word) for word in clean
if word not in swords]
return clean
NAME: KAUSTUBH GAJANAN INDULKAR
CLASS: TE-IT (A)
ROLL NO: 35026
clean_text(sent)
# Pre-processing
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer=clean_text)
x = df['text']
y = df['label']
x_new = tfidf.fit_transform(x)
x.shape
(5572,)
x_new.shape
(5572, 6513)
# tfidf.get_feature_names()
#cross validation
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test =
train_test_split(x_new,y,test_size=0.25,random_state=1)
nb = GaussianNB()
nb.fit(x_train.toarray(),y_train)
y_pred_nb = nb.predict(x_test.toarray())
y_test.value_counts()
label
ham 1208
spam 185
Name: count, dtype: int64
ConfusionMatrixDisplay.from_predictions(y_test,y_pred_nb)
plt.title('Naive bayes')
plt.show()
print(f" Accuracy is {accuracy_score(y_test,y_pred_nb)}")
print(classification_report(y_test,y_pred_nb))
NAME: KAUSTUBH GAJANAN INDULKAR
CLASS: TE-IT (A)
ROLL NO: 35026
Accuracy is 0.867910983488873
precision recall f1-score support
RandomForestClassifier(random_state=1)
y_pred_rf = model_rf.predict(x_test)
ConfusionMatrixDisplay.from_predictions(y_test,y_pred_rf)
plt.title('Random Forest')
NAME: KAUSTUBH GAJANAN INDULKAR
CLASS: TE-IT (A)
ROLL NO: 35026
plt.show()
print(f" Accuracy is {accuracy_score(y_test,y_pred_rf
)}")
print(classification_report(y_test,y_pred_rf))
Accuracy is 0.9748743718592965
precision recall f1-score support
model_lr.fit(x_train,y_train)
y_pred_lr = model_lr.predict(x_test)
NAME: KAUSTUBH GAJANAN INDULKAR
CLASS: TE-IT (A)
ROLL NO: 35026
ConfusionMatrixDisplay.from_predictions(y_test,y_pred_lr)
plt.title('Logistic regression')
plt.show()
print(f" Accuracy is {accuracy_score(y_test,y_pred_lr)}")
print(classification_report(y_test,y_pred_lr))
Accuracy is 0.9641062455132807
precision recall f1-score support
para = {
'criterion':['gini', 'entropy','log_loss'],
# 'max_features': ['sqrt','log2'],
# 'random_state': [0,1,2,3,4],
'class_weight':['balanced','balanced_subsample']
}
NAME: KAUSTUBH GAJANAN INDULKAR
CLASS: TE-IT (A)
ROLL NO: 35026
grid.fit(x_train,y_train)
GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=1),
param_grid={'class_weight': ['balanced', 'balanced_subsample'],
'criterion': ['gini', 'entropy', 'log_loss']},
scoring='accuracy')
rf = grid.best_estimator_
y_pred_grid = rf.predict(x_test)
ConfusionMatrixDisplay.from_predictions(y_test,y_pred_grid)
plt.title('Gride Search')
plt.show()
print(f" Accuracy is {accuracy_score(y_test,y_pred_grid)}")
print(classification_report(y_test,y_pred_grid))
NAME: KAUSTUBH GAJANAN INDULKAR
CLASS: TE-IT (A)
ROLL NO: 35026
Accuracy is 0.9763101220387652
precision recall f1-score support