ML_Record (1)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

TEXT CLASSIFICATION

A MINI PROJECT REPORT

By

Batch-4

N. Meghana (21JG1A0582) P. Chandana Sai Sree (21JG1A0589)

Under the esteemed guidance of


Mrs. V. Gowtami Annapurna
Assistant Professor
Department of CSE

Department of Computer Science and Engineering


GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN
[Approved by AICTE NEW DELHI, Affiliated to JNTUK Kakinada]

[Accredited by National Board of Accreditation (NBA) for B.Tech. CSE, ECE & IT – Valid from 2019-22 and 2022-25]

[Accredited by National Assesment and Accreditation Council(NAAC)– Valid from 2022-27]

Kommadi , Madhurawada, Visakhapatnam–530048

2021–2025
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

This is to certify that the mini project report titled “TEXT CLASSIFICATION” is a
bonafide work of following III/IV B.Tech. students in the Department of Computer Science
a n d Engineering, Gayatri Vidya Parishad College of Engineering for Women affiliated to JNT
University, Kakinada during the academic year 2023-2024 Semester-II.

N. Meghana (21JG1A0582) P. Chandana Sai Sree (21JG1A0589)

Project Mentor
Mrs. V. Gowtami Annapurna

Assistant Professor
ACKNOWLEDGEMENT

The satisfaction that accompanies the successful completion of any task would be
incomplete without the mention of people who made it possible and whose constant guidance
and encouragement crown all the efforts with success.

We feel elated to extend our sincere gratitude to Mrs. V. Gowtami Annapurna,


Assistant Professor, for encouraging all the way during project analysis. His annotations,
insinuations and criticisms are the key behind the successful completion of the thesis and for
providing us all the required facilities.

We express our deep sense of gratitude and thanks to Dr. P. V. S. Lakshmi


Jagadamba, Professor and Head of the Department of Computer Science and Engineering for
her guidance and for expressing her valuable and grateful opinions in the project for its
development and for providing lab sessions and extra hours to complete the project.

We would like to take this opportunity to express our profound sense of gratitude to
Dr. R. K. Goswami, Principal and Dr. G. Sudheer, Vice Principal for allowing us to utilize
the college resources thereby facilitating the successful completion of our thesis.

We are also thankful to both teaching and non-teaching faculty of the Department of
Computer Science and Engineering for giving valuable suggestions from our project.

N. Meghana (21JG1A0582) P. Chandana Sai Sree (21JG1A0589)


TABLE OF CONTENTS

TOPICS PAGENO.

Abstract
1

1. INTRODUCTION 2

1.1 Problem Statement 2

2 LITERATURE REVIEW 3-4

3 TECHNOLOGY STACK 5-8

3.1 Datasets/Database Used 5

3.2 Pre-Processing Steps 5-6

3.3 Packages Used 6-7

3.4 Software Requirement Specification 7-8

4 METHODOLOGY 9-15

4.1 Modules Descriptions 9-11

4.2 Algorithms 12-14

4.3 Model Architecture Diagram 15

5 IMPLEMENTATION 16-18

6 RESULTS / ANALYSIS 19-21

6.1 Output Screens / Results Analysis 19-21

7 CONCLUSION & FUTURE SCOPE 22

8 REFERENCES 23-24
ABSTRACT

Text classification is the most vital area in natural language processing in which text
data is automatically sorted into a predefined set of classes. Text classification is more
significant for many enterprises since it eliminates the need for manual data classification, a
more expensive and time-consuming mechanism. Automated text classification has been
considered as a vital method to manage and process a vast amount of documents in digital
forms that are widespread and continuously increasing. In general, text classification plays an
important role in information extraction and summarization, text retrieval, and question-
answering. In this project, we investigate the application of the random forest algorithm for
multi-class text classification problems. Random forests are an ensemble learning technique
that constructs multiple decision trees during training and aggregates their predictions for
classification.

Key words
Text classification, natural language processing, digital forms, information extraction,
summarization, text retrieval, question-answering, random forest algorithm, multi-class
classification, ensemble learning, decision trees, aggregation.

1
1. INTRODUCTION

1.1 Problem Statement

In the era of big data, text classification using machine learning (ML) algorithms has become
indispensable, particularly in the context of multi-label classification. Multi-label
classification extends traditional binary or single-label classification by accommodating the
complex and diverse nature of textual data, allowing each instance to be associated with
multiple labels simultaneously. Nowadays, industries benefit greatly from developing
automatic systems for extracting useable structured data from unstructured text sources.
Researchers and industry professionals would perform reasonably easy queries to retrieve all
information related to industrial work using a structured resource. Text classification is
classifying text into different classes based on the text domain. It is a fundamental process in
natural language processing in which the tools are available for classifying textual data.
Automatic text classification has always been a critical application and research topic since
the inception of digital documents. Textual analytics translates text into numbers, giving
structured data and making it easier to spot trends. The more structured data, the better will be
the analysis, and eventually, the better the decisions. Machine learning (ML) is employed for
this purpose which is a branch of artificial intelligence (AI) that allows computers to operate
and learn even they are not explicitly programmed.

We focus on text classification within the domain of e-commerce, leveraging ML algorithms


to categorize textual data from an e-commerce dataset into multiple predefined labels. With
the exponential growth of online retail and e-commerce platforms, there is a pressing need to
automatically categorize and manage the vast amount of textual information generated by
product descriptions, customer reviews, and other textual sources. The applications of multi-
label text classification in e-commerce are wide-ranging and impactful, including product
categorization, sentiment analysis of customer reviews, personalized recommendations, and
more. By harnessing ML algorithms, e-commerce businesses can automate and optimize
processes related to product management, customer engagement, and decision-making,
ultimately enhancing the overall shopping experience for consumers.

In this study, we aim to explore and implement state-of-the-art ML algorithms for multi-label
text classification on an e-commerce dataset. By leveraging supervised learning techniques
and NLP models, we seek to develop a robust and accurate multi-label text classification
system capable of handling the specific challenges and nuances of e-commerce data. Through
rigorous experimentation and evaluation, we aim to identify optimal ML algorithms and
methodologies, contributing to advancements in multi-label text classification research within
the context of e-commerce applications. The documents in the text classification model are
passed through different steps viz we can preprocess text by converting to lowercase, removing
stop words, and applying stemming/lemmatization. Using a variety of classifiers and feature
representations, we train models to accurately classify text. The project finds applications in
sentiment analysis, topic categorization, and more. Ultimately, our goal is to develop robust
text classification systems with real-world impact.

2
2. LITERATURE REVIEW
1. In[6], Xiaoyu Luo(2021), “Efficient English Text Classification Using Selected
Machine Learning Techniques”. The classification is done by using SVM, Naïve
Bayes, Logistic Regression methodologies. After evaluating the classifier performance
using precision, recall, and F1-score metrics, it was observed that Support Vector
Machine (SVM) outperformed other classifiers in two datasets. Additionally, Logistic
Regression demonstrated superior performance in one dataset.
2. In[52], Dhirajj Kumar, Gopesh, Avinash Choubey, Ms. Pratibha Sing(2020),
“Restaurant Review Classification and Analysis”. The Classification is done by using
Naïve Bayes, Multinomial Naïve Bayes, Logistic Regression methodologies. Their
evaluation of classifier performance using precision, recall, and F1-score metrics
reveals that the Multinomial Naïve Bayes technique outperforms other algorithms in
terms of Precision, Recall, and F1 Score evaluation metrics.

3. In[53], Muhammad NabeelAsim, Muhammad Usman Ghani, Muhammad Ali Ibrahim,


Waqar Mahmood, Sdheraz Ahmad, Andreas Dengel(2020), “Benchmark Performance
Of Machine Learning and Deep learning Based Methodologies for Urdu Text
Document Classification”. They investigate the effectiveness of Naïve Bayes, K-
Nearest Neighbor, and Support Vector Machine algorithms. Their findings indicate
that SVM outperforms Naïve Bayes when utilizing TF-IDF vector representation for
Urdu text document classification.

4. In[21], EmmanouilK.Ikonomskis, Sotiris Kotsiantis, V.Tampakas(2019), “Text


classification using machine learning techniques”. This classification is done by using
Naïve Bayes, K-Nearest Neighbour, Support Vector Machine methodologies. They
highlight that the quality and diversity of the training data significantly impact the
effectiveness of the classifiers, with high-quality training corpora resulting in better
performance.

5. In[4], Kapil Sethi, Ankit Gupta, Gaurav Gupta, Varun Jaiswal(2017) , “ Comparative
Analysis of Machine Learning Algorithms on DifferentDatasets“. The methodologies
they used are Neural Network, K-Nearest Neighbor, Support Vector Machine. After
performing evaluation, they resulted that SVM outperforms the other Algorithms and
this model is useful in medication, governmental issues and Other different fields.
6. In[54], Jasleen Kaur, Dr. Jatinder Kumar, R Saini(2015), “A Study Of Text
Classification Natural Language Processing Algorithms for Indian Language“. They
used Naïve Bayes, SVM, Artificial Neural Network, N-gram. Their study reveals that
supervised machine learning algorithms, such as Naïve Bayes, SVM, and Artificial
Neural Network, outperform unsupervised machine learning algorithms for text
classification tasks in Indian languages.

7. In[55], Muhammad Abid, Asad Habib, JawabShahid, Jawad Ashraf(2017) , “ Urdu


Word Sense Disambiguation Using Machine Learning Approach “. They explore the
effectiveness of Bayes Net Classifier, Support Vector Machine (SVM), and Decision
Tree algorithms. Their study indicates that the Bayes Net Classifier outperforms the
other algorithms in Urdu word sense disambiguation tasks.

3
8. In[56], Basant Agarwal, NamitaMithal(2016) , “ Text classification Using Machine
Learning Methods. A Survey “. They investigate the effectiveness of Naïve Bayes,
Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Decision Tree
algorithms in classifying textual data. Their survey findings suggest that SVM
performs well for textual documents belonging to a particular category, but its
performance diminishes for multiclass classification tasks. This observation highlights
a limitation of SVM in handling complex classification scenarios involving multiple
classes.

9. In [1], Bao Y. and Ishii N., "Combining Multiple kNN Classifiers for Text
Categorization by Reducts", LNCS 2534, 2002, pp. 340-347, the authors propose
combining multiple kNN classifiers for text categorization using reducts. Their results
show that the combination of multiple kNN classifiers can improve the performance
compared to using a single kNN classifier.
10. In [9], D. E. Johnson, F. J. Oles, T. Zhang, T. Goetz, "A decision-tree-based symbolic
rule induction system for text categorization", IBM Systems Journal, September 2002,
the authors describe a fast decision tree construction algorithm that takes advantage of
the sparsity of text data, and a rule simplification method that converts the decision
tree into a logically equivalent rule set.

4
3. TECHNOLOGY STACK
3.1 Datasets/Database Used:

Fig: Dataset

• The dataset comprises various product listings from the e-commerce domain,
particularly focusing on household items, Books, Electronics and Clothing &
Accessories.
• The Ecommerce text classification dataset has 50425 rows and 2 Columns.
• The Ecommerce text classification dataset info:
RangeIndex: 50425 entries, 0 to 50424
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 label 50425 non-null object
1 text 50424 non-null object
• Null Values in each col in the Ecommerce text classification dataset:
label 0
text 1
3.2 Pre-Processing Steps:

1. Removing Rows with Null Values:

Code Snippet:
df.drop([null_val_ind], axis=0, inplace=True)

This code removes the row(s) containing null values from the DataFrame df. The drop
method is a pandas function used to remove rows or columns based on labels (index or
column names) or indices. By setting axis=0, you're specifying that you want to drop
rows.

5
2. Remove Stop words and apply lemmatization:

Code Snippet:
nlp = spacy.load('en_core_web_sm')
def preprocess(text):
doc = nlp(text)
preprocessed = []
for token in doc:
if not token.is_stop and not token.is_punct:
preprocessed.append(token.lemma_)
flat = ' '.join(preprocessed)
return flat
df['preprocessed_text'] = df['text'].apply(preprocess)

Removing Stopwords:
Stopwords are common words in a language that often don't carry significant meaning
and are typically removed from text before processing.

Applying Lemmatization:
Lemmatization is the process of reducing words to their base or root form, called a
lemma.
This preprocessing step removes stopwords and punctuation from the text data and
applies lemmatization to obtain the base form of each token. It helps in reducing noise
in the data and improving the quality of features used for text analysis or modeling.

3. Lower Case the text:

Code Snippet:
df['preprocessed_text'] = df['preprocessed_text'].str.lower()

Converting text to lower case is a common preprocessing step in natural language


processing tasks. It ensures consistency in text data by treating uppercase and lowercase
versions of the same word as identical, which can improve model performance and
simplify subsequent processing steps.

3.3 Packages Used


1. Pandas:
This is a popular library for data analysis and manipulation in Python. It provides data
structures like DataFrames (similar to spreadsheets) for storing and organizing labeled data. It
offers functionalities for data cleaning, filtering, transformation, and analysis.
2. matplotlib.pyplot:
Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python. The pyplot module provides a MATLAB-like interface for creating
plots and visualizations.

6
1. seaborn:
Seaborn is a statistical data visualization library based on matplotlib. It provides a
high-level interface for drawing attractive and informative statistical graphics.
2. spaCy:
A high-performance natural language processing library for efficiently processing
large amounts of text. Offers a wide range of NLP capabilities, including tokenization,
lemmatization, and named entity recognition. Allows for easy integration with machine
learning models through its language models.

3. WordCloud:
A library for generating word cloud visualizations from textual data. Provides
customization options for controlling the size, shape, and color of the word cloud. Useful for
quickly identifying the most prominent terms in a corpus of text.
4. scikit-learn:
Used for various machine learning tasks, such as feature extraction with
TfidfVectorizer, target encoding with LabelEncoder, model training and evaluation with
train_test_split, accuracy_score, confusion_matrix, and classification_report.

5. xgboost:
A scalable and efficient implementation of the Gradient Boosting algorithm. Known
for its high performance and ability to handle large-scale data. Widely used in machine
learning competitions and production environments.

6. RandomForestClassifier:
An ensemble learning algorithm that combines multiple decision trees for improved
classification accuracy. Handles both numerical and categorical features effectively.
Provides built-in feature importance estimation, making it useful for interpretability.
7. NaiveBayesClassifier:
A probabilistic machine learning algorithm based on Bayes' theorem. Assumes
independence between features, making it computationally efficient and suitable for text
classification tasks. Provides a simple and interpretable approach to classification, often used
as a baseline model.
8. joblib:
A utility library that provides easy-to-use functions for saving and loading Python
objects. Allows for efficient serialization and deserialization of machine learning models and
other data. Helps in maintaining the reproducibility of the model training and deployment
process.

3.4 Software Requirement Specification

The Text Classification system is designed with two key features to facilitate seamless user
interaction and efficient model utilization. Firstly, it offers robust internet connectivity,
enabling users to access the system from anywhere, anytime. This connectivity ensures that
users can easily upload their text data and receive predictions promptly, enhancing accessibility
and convenience. Secondly, the system empowers users to run the classification model and
generate predictions for their desired inputs effortlessly. Through an intuitive and user-friendly

7
interface, individuals can input text data, initiate the classification process, and obtain accurate
predictions in real-time. These features combine to provide a comprehensive and user-centric
text classification solution, catering to a wide range of applications and user requirements.
1. Software Requirements
Platform: Windows operating system
Programming Platform: Google Colab
Language used: Python 3
2. Hardware Requirements
Processor: 13
RAM: 4 GB or more
Hard disk: 16 GB or more
GPU: 2 GB

8
4. METHODOLOGY
4.1 Modules Description:

1. Data/ Text Preprocessing:


Code Snippet:
import pandas as pd
import spacy
df = pd.read_csv('ecommerceDataset.csv')
# Data preprocessing functions
def preprocess(text):
# SpaCy for text preprocessing
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
preprocessed = [token.lemma_ for token in doc if not token.is_stop and not
token.is_punct]
return ' '.join(preprocessed)
df['preprocessed_text'] = df['text'].apply(preprocess)
The code starts by importing the necessary libraries and modules for the task. It reads the input
data from a CSV file and performs some initial data exploration, such as checking for null
values and the number of unique values in each column. The preprocess() function uses the
spaCy library to perform text cleaning and normalization, including lemmatization and removal
of stopwords and punctuation. The preprocessed text is then stored in a new column
('preprocessed_text') in the DataFrame.
2. Text Vectorization:
Code Snippet:
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['preprocessed_text'])
Text data cannot be directly fed into machine learning models. This step converts textual data
from documents (sentences or paragraphs) into numerical features that models can understand.
Here, TfidfVectorizer is used to create a TF-IDF (Term Frequency-Inverse Document
Frequency) vector representation. TF-IDF considers the importance of a word based on how
often it appears in a document compared to its overall frequency in the corpus (all documents).
This helps emphasize relevant words and reduce the weight of common words.
The preprocessed text ('preprocessed_text') is extracted and assigned to the X variable. The
target variable ('label') is assigned to the y variable. The TfidfVectorizer from scikit-learn is
used to convert the text data into a numerical feature matrix (X_tfidf). The trained
TfidfVectorizer model is saved to a file using dump().
3. Label Encoding:
Code Snippet:
lbl_enc = LabelEncoder()
y_enc = lbl_enc.fit_transform(df['label'])
Machine learning models typically work better with numerical labels. This step uses
LabelEncoder to convert the text labels (e.g., "clothes", "electronics") into numerical values
9
(e.g., 0, 1). This allows the model to learn the relationships between the text features and the
corresponding categories.
The target labels are encoded into numerical values using LabelEncoder. This step assigns a
unique numerical identifier to each class label, allowing the machine learning algorithms to
interpret and learn from the target labels during model training.
The target variable y is encoded using the LabelEncoder from scikit-learn. The mapping
between the original labels and their encoded values is saved to a file using dump().
4. Splitting the Dataset:
Code Snippet:
X_train, X_test, y_train, y_test = train_test_split(
X_tfidf,
y_enc,
random_state=123,
test_size=0.2,
shuffle=True,
stratify=y_enc
)
The dataset is split into training and testing sets using the train_test_split function from the
sklearn.model_selection module. This ensures that the model's performance can be evaluated
on unseen data, helping to assess its generalization ability.
The X_tfidf and y_enc (encoded labels) are split into training and testing sets using
train_test_split() from scikit-learn.
The split is performed with a random state for reproducibility, a test size of 20%, and
stratification to preserve the class distribution. The sizes of the training and testing sets are
printed.
5. Modeling:
Code Snippet:
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
This step involves building the machine learning model. Here, models are trained on
XGBoostClassifier, RandomForestClassifier and NaiveBayesClassifier. These models learn
patterns from the training data to predict the category (label) for a new piece of text. The
modeling function trains the model, makes predictions on the testing set, and calculates the
accuracy to gauge how well the model performs.
6. Evaluation:
Code Snippet:
rf_ypred = rf_model.predict(X_test)
print("Random Forest Classification Report:")
print(classification_report(y_test, rf_ypred))
After training, the model's performance is evaluated on the testing set. confusion_matrix is used
to visualize how many data points were correctly and incorrectly classified for each category.

10
Additionally, classification_report provides more detailed metrics like precision, recall, and
F1-score to understand the model's strengths and weaknesses in classifying different categories.
• Precision is the percentage of correct positive predictions relative to total positive
predictions. It measures how accurate the model is in identifying the positive class.
• Recall is the percentage of correct positive predictions relative to total actual positives.
It measures how sensitive the model is in detecting the positive class.
• F1 score is a weighted harmonic mean of precision and recall. It balances both the
precision and recall of the model. The closer to 1, the better the model.
• Support is the number of instances that belong to each class.
To generate a classification report in Python, we can use the classification_report function from
the sklearn.metrics module.
7. Making Predictions:
Code Snippet:
model = load('rf_model.pkl')
def classify_text(input_text):
preprocessed_text = preprocess(input_text)
tfidf_vector = vectorizer.transform([preprocessed_text])
predicted_label_enc = model.predict(tfidf_vector)
predicted_label =
list(label_mapping.keys())[list(label_mapping.values()).index(predicted_label_enc)]
return predicted_label
# Collect user input
input_text = input("Enter your text: ")
predicted_label = classify_text(input_text)
# Display the predicted label
print("Predicted Label:", predicted_label)
Once a model is trained and evaluated, it can be used to make predictions on new, unseen data.
The predict function takes a new piece of text as input, preprocesses it using the same steps as
before, converts it into a TF-IDF vector, and feeds it to the trained model. Finally, it decodes
the predicted numerical label back to the original category name using the label mapping.
The trained model, TfidfVectorizer, and label mapping are loaded from the saved files. The
preprocess() function is defined to preprocess new input text. The classify_text() function is
defined to take an input text, preprocess it, and use the loaded model to predict the class label.
The user is prompted to enter text, and the predicted label is displayed.

11
4.2 Algorithms:
The text classification applied to an e-commerce dataset, a variety of machine learning
algorithms were employed to effectively categorize textual data into relevant classes.

• RANDOM FOREST:

Random forest is a Machine Learning Algorithm that is used in classification and to


solve regression problems. The algorithm works based on decision trees. Random
forest creates uncorrelated forest of trees whose prediction is more accurate than a
that of a single tree.

RF is an ensemble learning method that operates by constructing a multitude of


decision trees at training time and outputting the mode of the classes for classification.
It's robust against overfitting and tends to handle high-dimensional data well, making
it suitable for text classification tasks like this one.

The Random Forest (RF) classifiers are suitable for dealing with the high dimensional
noisy data in text classification. An RF model comprises a set of decision trees each
of which is trained using random subsets of features. Given an instance, the prediction
by the RF is obtained via majority voting of the predictions of all the trees in the forest.
However, different test instances would have different values for the features used in
the trees and the trees should contribute differently to the predictions. This diverse
contribution of the trees is not considered in traditional RFs.

When it comes to making predictions, each decision tree in the Random Forest casts
its vote. For classification tasks, the final prediction is determined by the mode (most
frequent prediction) across all the trees. In regression tasks, the average of the
individual tree predictions is taken. This internal voting mechanism ensures a balanced
and collective decision-making process.

Fig: Random Forest Algorithm

12
• XGBOOST (Extreme Gradient Boosting):

XGBoost (eXtreme Gradient Boosting) is a popular machine learning algorithm used


for various tasks including text classification. In text classification, XGBoost can be
used to predict the class label of a given text document based on its contents.

The basic idea behind XGBoost is to use decision trees as base models and then to
build an ensemble of these trees to improve the accuracy of predictions. In the context
of text classification, each tree takes a set of features extracted from the text document
as input and outputs a predicted class label. XGBoost then trains multiple trees on the
training data and aggregates their predictions to produce the final prediction for each
text document.

One of the advantages of XGBoost is that it can handle large amounts of data and can
be run on parallel computing systems, making it a fast and efficient algorithm for text
classification. Additionally, XGBoost also supports various regularization techniques
to prevent overfitting, which is a common problem in text classification due to the
large number of features often present in text data.

To implement XGBoost for text classification, the first step is to preprocess the text
data and extract meaningful features from it, such as term frequency-inverse document
frequency (TF-IDF) values. These features are then used as input to the XGBoost
model, which is trained on the preprocessed data. Finally, the trained model can be
used to predict the class label for new text documents.

Fig: XG Boost Algorithm

13
• Naive Bayes:

NB is a probabilistic classifier based on Bayes' theorem with a strong assumption of


independence among features. Despite its simplicity, NB classifiers are remarkably
effective in text classification tasks, especially when dealing with high-dimensional
data like text documents. It's computationally efficient and requires a small amount of
training data, making it suitable for large datasets.

Generative classifiers use the posterior probability of the documents belonging to


different classes in order to classify text. Which class the document belongs to is based
on the word presence in the documents. Naive Bayes is a traditional machine learning
technique that is based on Bayes’ theorem, that calculates the posterior probability
P(A|B). The equation for Bayes’ theorem is given below.

P(A|B) = P(B|A)P(A)

P(B)

Multinomial Naive Bayes: Multinomial Naive Bayes is an instance of the Naive Bayes
model, which uses a multinomial distribution of the features and is common to use for
text classification.

Naive Bayes model computes the class probabilities for a given text document, since
Multinomial Naive Bayes is the model used in this thesis. The total set of classes is
denoted by C, and the Multinomial Naive Bayes will assign a document di to the class
with the highest probability P(c|di), where c ∈ C. This is done by using Bayes’
theorem, and is given by:

P(c|di) = P(di|c)P(c)

P(di)

Fig: Naïve Bayes Algorithm

14
4.3 Model Architecture Diagram:

Fig: Flow chart for Text Classification

The image depicts a flow diagram of the text classification process, which involves several
steps:
Text Data: The raw text data that needs to be classified.
Tokenization: This step breaks down the text data into individual tokens (words or phrases).
Pre-Processing:
Removal of Punctuations: Removes punctuation marks from the text.
Removal of Stopwords: Removes common words that do not carry much meaning (e.g., "the",
"a", "is").
Stemming & Lemmatization: Converts words to their base or root form to reduce vocabulary
size.
Future Engineering:
Bag of Word: Representing the text as a vector of word frequencies.
TF-IDF (Term Frequency-Inverse Document Frequency): A more advanced text representation
method that considers the importance of each word.
Model Building: The processed text data is then used to build a machine learning or deep
learning model for text classification.
Model Evaluation: The performance of the built model is evaluated using appropriate metrics.
Model Building: Based on the evaluation, the model is further refined and improved.
This flow diagram provides a high-level overview of the typical text classification process,
highlighting the key steps involved in transforming the raw text data into a format suitable for
model building and evaluation.

15
5. IMPLEMENTATION
#Loading Dataset
ds_name = 'Ecommerce text classification'
df = pd.read_csv('ecommerceDataset.csv')
#Dropping Null Values
null_val_ind = df[df['text'].isnull()].index[0]
null_val_ind
df.drop([null_val_ind], axis=0, inplace = True)
#Text Preprocessing
nlp = spacy.load('en_core_web_sm')
def preprocess(text):
doc = nlp(text)
preprocessed = []
for token in doc:
if not token.is_stop and not token.is_punct:
preprocessed.append(token.lemma_)
flat = ' '.join(preprocessed)
return flat
df['preprocessed_text'] = df['text'].apply(preprocess)
df['preprocessed_text'] = df['preprocessed_text'].str.lower()
#Defining feature and target variables
X = df['preprocessed_text']
y = df['label']
#Tokenization
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(X.values)
lbl_enc = LabelEncoder()
y_enc = lbl_enc.fit_transform(y.values)
#Mapping the labels
label_mapping = {label: encoded for label, encoded in zip(lbl_enc.classes_,
lbl_enc.transform(lbl_enc.classes_))}
label_mapping
#Splitting dataset

16
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y_enc, random_state=123,
test_size=0.2, shuffle=True, stratify=y_enc)
#Modeling
def modeling(model, X_train, y_train, X_test, y_test):
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = accuracy_score(y_test, y_pred)
print(f"Model Accuarcy: {score*100}%")
return y_pred
xgboost_model = XGBClassifier()
xgboost_ypred = modeling(xgboost_model, X_train, y_train, X_test, y_test)
rf_model = RandomForestClassifier()
rf_ypred = modeling(rf_model, X_train, y_train, X_test, y_test)
nb_model = MultinomialNB()
nb_ypred = modeling(nb_model, X_train, y_train, X_test, y_test)
#Confusion Matrix
xgboost_conf_matrix = confusion_matrix(y_test, xgboost_ypred)
rf_conf_matrix = confusion_matrix(y_test, rf_ypred)
naive_conf_matrix=confusion_matrix(y_test, nb_ypred)
#Classification Report
def class_report(y_test, y_pred):
print(classification_report(y_test, y_pred))
print("Random Forest Classification Report:")
class_report(y_test, rf_ypred)
#Predicting the output for test data
text = df['text'][4848]
true_label = df['label'][4848]
print(text)
print(f"True Label is {true_label}")
print("-"*50)
print("Model Results:")
predict(text, loaded_vectorizer, loaded_label_mapping, model)
#Predicting the output of user desired input

17
def classify_text(input_text):
preprocessed_text = preprocess(input_text)
tfidf_vector = vectorizer.transform([preprocessed_text])
predicted_label_enc = model.predict(tfidf_vector)
predicted_label =
list(label_mapping.keys())[list(label_mapping.values()).index(predicted_label_enc)]
return predicted_label
input_text = input("Enter your text: ")
predicted_label = classify_text(input_text)

18
6. RESULTS / ANALYSIS
6.1 Output Screens / Results Analysis:

Output Screens:

Fig: Label Mapping


Fig: User desired input

Fig: Converting into Lower Case letters

Fig: Data Distribution

In this section, we present the results and analysis of our text classification. Leveraging
machine learning algorithms and natural language processing techniques, our aim was to

19
develop an accurate and robust classification model capable of effectively categorizing text
data from an e-commerce dataset.
The results and analysis are presented below, encompassing key performance metrics such as
Accuracy, precision, recall, F1-score, confusion matrix, ROC curve.
We calculated the classification report for each algorithm, providing detailed performance
metrics for each class. The classification report includes precision, recall, F1-score, and support
for each class, allowing for a comprehensive evaluation of the model's performance across
different categories.
Algorithms Accuracy Class Precision Recall F1 Score Support
Random 97.511 0 0.98 0.98 0.98 2364
Forest (RF) 1 0.98 0.98 0.98 1734
2 0.98 0.96 0.97 2124
3 0.97 0.98 0.98 3836
accuracy 0.98 10085
macro avg 0.98 0.97 0.98 10085
weighted avg 0.98 0.98 0.98 10085
XG Boost 95.865 0 0.95 0.96 0.96 2364
1 0.97 0.97 0.97 1734
2 0.97 0.94 0.95 2124
3 0.95 0.97 0.96 3863
accuracy 0.96 10085
macro avg 0.96 0.96 0.96 10085
weighted avg 0.96 0.96 0.96 10085
Naïve 94.526 0 0.97 0.92 0.95 2364
Bayes 1 0.98 0.95 0.97 1734
2 0.96 0.91 0.94 2124
3 0.91 0.98 0.94 3863
accuracy 0.95 10085
macro avg 0.96 0.94 0.95 10085
weighted avg 0.95 0.95 0.95 10085
Table: Summary of Classification Report for each Algorithms

ROC Curves: The Receiver Operating Characteristic (ROC) curve is a graphical representation
of the performance of a binary classification model across various decision thresholds. It plots
the true positive rate (sensitivity) against the false positive rate (1 - specificity) for different
threshold values.
True Positive Rate (TPR): Also known as sensitivity or recall, TPR measures the proportion of
positive instances that are correctly identified by the model. It is calculated as:
TPR= True Positives
True Positives+ False Negatives
False Positive Rate (FPR): FPR measures the proportion of negative instances that are
incorrectly classified as positive by the model. It is calculated as:
FPR= False Positives
True Positives+ False Negatives
20
Area Under the ROC Curve (AUC-ROC): AUC-ROC quantifies the overall performance of the
classifier across all possible decision thresholds. It represents the probability that the classifier
will rank a randomly chosen positive instance higher than a randomly chosen negative instance.
AUC values range from 0 to 1, where higher values indicate better performance. AUC = 0.5
indicates random guessing, while AUC = 1 represents perfect classification.

Fig: ROC Curve for each class label in e-commerce text classification

Confusion Matrix: A confusion matrix is a matrix that summarizes the performance of a


machine learning model on a set of test data. It is a means of displaying the number of accurate
and inaccurate instances based on the model’s predictions.

Fig: Confusion Matrix

21
7. CONCLUSION & FUTURE SCOPE
Conclusion:

In conclusion, this project explored the application of the random forest algorithm for
multi-class text classification tasks in the field of natural language processing. Through our
investigation, we have demonstrated the effectiveness of random forests with an accuracy of
97% in automatically sorting text data into predefined classes. By leveraging ensemble learning
techniques and aggregating predictions from multiple decision trees, random forests offer
robust performance in handling text classification tasks.
Throughout the project, we have evaluated the performance of the random forest algorithm
using metrics such as accuracy, precision, recall, and F1-score. Our experiments have
demonstrated promising results, indicating the suitability of random forests for multi-class text
classification problems.
Future Scope:

In the realm of text classification, the future holds promising avenues for exploration and
improvement. Optimizing the random forest algorithm through hyperparameter tuning and
advanced feature engineering techniques presents an opportunity to enhance classification
performance. As deep learning continues to advance, investigating deep neural network
architectures like CNNs and RNNs for text classification holds promise for achieving state-of-
the-art results. Moreover, applying text classification techniques to domain-specific
applications, coupled with efforts to ensure scalability, efficiency, interpretability, and
explainability, will contribute to the development of practical and impactful solutions for
industries dealing with large volumes of textual data.

8. REFERENCES

22
1. [6] Xiaoyu Luo(2021), “Efficient English Text Classification Using Selected Machine
Learning Techniques”, alexandria engineering journal,Published 1 June 2021.

2. [52], Dhirajj Kumar, Gopesh, Avinash Choubey, Ms. Pratibha Sing(2020), “Restaurant
Review Classification and Analysis”, alexandria engineering journal,Published 25
March 2022.

3. [53] Muhammad NabeelAsim, Muhammad Usman Ghani, Muhammad Ali Ibrahim,


Waqar Mahmood, Sdheraz Ahmad, Andreas Dengel(2020), “Benchmark Performance
Of Machine Learning and Deep learning Based Methodologies for Urdu Text
Document Classification”, DOI:10.1007/s00521-020-05321-8,Neural Computing and
Applications, 24 September 2020.

4. [21] EmmanouilK.Ikonomskis, Sotiris Kotsiantis, V.Tampakas(2019), “Text


classification using machine learning techniques”, Computer Science, Corpus ID:
267938927, Published 2005.

5. [4] Kapil Sethi, Ankit Gupta, Gaurav Gupta, Varun Jaiswal(2017) , "Comparative
Analysis of Machine Learning Algorithms on DifferentDatasets", Conference:
Circulation in Computer Science International Conference on Innovations in
ComputingAt: Mohali, April 2019.

6. [54] Jasleen Kaur, Dr. Jatinder Kumar, R Saini(2015), “A Study Of Text Classification
Natural Language Processing Algorithms for Indian Language“, Vol.4,No.1, July,
2015 162 - 167,ISSN:0975-5446, VNSGU JOURNAL OF SCIENCE AND
TECHNOLOGY,July 2015.

7. [55] Muhammad Abid, Asad Habib, JawabShahid, Jawad Ashraf(2017) , “ Urdu Word
Sense Disambiguation Using Machine Learning Approach “, Cluster Computing 21(4)
DOI:10.1007/s10586-017-0918-0, March 2018.

8. [56] Basant Agarwal, NamitaMithal(2016) , “ Text classification Using Machine


Learning Methods. A Survey “, Advances in Intelligent Systems and Computing
236:701-709, DOI:10.1007/978-81-322-1602-5_75, In book: Proceedings of the
Second International Conference on Soft Computing for Problem Solving (SocProS
2012), December 28-30, 2012 (pp.701-709), February 2014.

9. [1] Bao Y. and Ishii N., “Combining Multiple kNN Classifiers for Text Categorization
by Reducts”, LNCS 2534, 2002, pp. 340-347.

10. [9] D. E. Johnson, F. J. Oles, T. Zhang, T. Goetz, “A decision-tree-based symbolic rule


induction system for text categorization”, IBM Systems Journal, September 2002.

11. [11] Ke H., Shaoping M., “Text categorization based on Concept indexing and principal
component analysis”, Proc. TENCON 2002 Conference on Computers,
Communications, Control and Power Engineering, 2002, pp. 51- 56.

23
12. [12] Kehagias A., Petridis V., Kaburlasos V., Fragkou P., “A Comparison of Word- and
Sense-Based Text Categorization Using Several Classification Algorithms”, JIIS,
Volume 21, Issue 3, 2003, pp. 227-247.

13. [13] B. Kessler, G. Nunberg, and H. Schutze. Automatic detection of text genre. In
Proceedings of the Thirty-Fifth ACL and EACL, pages 32–38, 1997.

14. [14] Kim S. B., Rim H. C., Yook D. S. and Lim H. S., “Effective Methods for Improving
Naive Bayes Text Classifiers”, LNAI 2417, 2002, pp. 414-423

15. [15] Klopotek M. and Woch M., “Very Large Bayesian Networks in Text
Classification”, ICCS 2003, LNCS 2657, 2003, pp. 397-406.

24

You might also like