ML_Record (1)
ML_Record (1)
ML_Record (1)
By
Batch-4
[Accredited by National Board of Accreditation (NBA) for B.Tech. CSE, ECE & IT – Valid from 2019-22 and 2022-25]
2021–2025
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN
CERTIFICATE
This is to certify that the mini project report titled “TEXT CLASSIFICATION” is a
bonafide work of following III/IV B.Tech. students in the Department of Computer Science
a n d Engineering, Gayatri Vidya Parishad College of Engineering for Women affiliated to JNT
University, Kakinada during the academic year 2023-2024 Semester-II.
Project Mentor
Mrs. V. Gowtami Annapurna
Assistant Professor
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of any task would be
incomplete without the mention of people who made it possible and whose constant guidance
and encouragement crown all the efforts with success.
We would like to take this opportunity to express our profound sense of gratitude to
Dr. R. K. Goswami, Principal and Dr. G. Sudheer, Vice Principal for allowing us to utilize
the college resources thereby facilitating the successful completion of our thesis.
We are also thankful to both teaching and non-teaching faculty of the Department of
Computer Science and Engineering for giving valuable suggestions from our project.
TOPICS PAGENO.
Abstract
1
1. INTRODUCTION 2
4 METHODOLOGY 9-15
5 IMPLEMENTATION 16-18
8 REFERENCES 23-24
ABSTRACT
Text classification is the most vital area in natural language processing in which text
data is automatically sorted into a predefined set of classes. Text classification is more
significant for many enterprises since it eliminates the need for manual data classification, a
more expensive and time-consuming mechanism. Automated text classification has been
considered as a vital method to manage and process a vast amount of documents in digital
forms that are widespread and continuously increasing. In general, text classification plays an
important role in information extraction and summarization, text retrieval, and question-
answering. In this project, we investigate the application of the random forest algorithm for
multi-class text classification problems. Random forests are an ensemble learning technique
that constructs multiple decision trees during training and aggregates their predictions for
classification.
Key words
Text classification, natural language processing, digital forms, information extraction,
summarization, text retrieval, question-answering, random forest algorithm, multi-class
classification, ensemble learning, decision trees, aggregation.
1
1. INTRODUCTION
In the era of big data, text classification using machine learning (ML) algorithms has become
indispensable, particularly in the context of multi-label classification. Multi-label
classification extends traditional binary or single-label classification by accommodating the
complex and diverse nature of textual data, allowing each instance to be associated with
multiple labels simultaneously. Nowadays, industries benefit greatly from developing
automatic systems for extracting useable structured data from unstructured text sources.
Researchers and industry professionals would perform reasonably easy queries to retrieve all
information related to industrial work using a structured resource. Text classification is
classifying text into different classes based on the text domain. It is a fundamental process in
natural language processing in which the tools are available for classifying textual data.
Automatic text classification has always been a critical application and research topic since
the inception of digital documents. Textual analytics translates text into numbers, giving
structured data and making it easier to spot trends. The more structured data, the better will be
the analysis, and eventually, the better the decisions. Machine learning (ML) is employed for
this purpose which is a branch of artificial intelligence (AI) that allows computers to operate
and learn even they are not explicitly programmed.
In this study, we aim to explore and implement state-of-the-art ML algorithms for multi-label
text classification on an e-commerce dataset. By leveraging supervised learning techniques
and NLP models, we seek to develop a robust and accurate multi-label text classification
system capable of handling the specific challenges and nuances of e-commerce data. Through
rigorous experimentation and evaluation, we aim to identify optimal ML algorithms and
methodologies, contributing to advancements in multi-label text classification research within
the context of e-commerce applications. The documents in the text classification model are
passed through different steps viz we can preprocess text by converting to lowercase, removing
stop words, and applying stemming/lemmatization. Using a variety of classifiers and feature
representations, we train models to accurately classify text. The project finds applications in
sentiment analysis, topic categorization, and more. Ultimately, our goal is to develop robust
text classification systems with real-world impact.
2
2. LITERATURE REVIEW
1. In[6], Xiaoyu Luo(2021), “Efficient English Text Classification Using Selected
Machine Learning Techniques”. The classification is done by using SVM, Naïve
Bayes, Logistic Regression methodologies. After evaluating the classifier performance
using precision, recall, and F1-score metrics, it was observed that Support Vector
Machine (SVM) outperformed other classifiers in two datasets. Additionally, Logistic
Regression demonstrated superior performance in one dataset.
2. In[52], Dhirajj Kumar, Gopesh, Avinash Choubey, Ms. Pratibha Sing(2020),
“Restaurant Review Classification and Analysis”. The Classification is done by using
Naïve Bayes, Multinomial Naïve Bayes, Logistic Regression methodologies. Their
evaluation of classifier performance using precision, recall, and F1-score metrics
reveals that the Multinomial Naïve Bayes technique outperforms other algorithms in
terms of Precision, Recall, and F1 Score evaluation metrics.
5. In[4], Kapil Sethi, Ankit Gupta, Gaurav Gupta, Varun Jaiswal(2017) , “ Comparative
Analysis of Machine Learning Algorithms on DifferentDatasets“. The methodologies
they used are Neural Network, K-Nearest Neighbor, Support Vector Machine. After
performing evaluation, they resulted that SVM outperforms the other Algorithms and
this model is useful in medication, governmental issues and Other different fields.
6. In[54], Jasleen Kaur, Dr. Jatinder Kumar, R Saini(2015), “A Study Of Text
Classification Natural Language Processing Algorithms for Indian Language“. They
used Naïve Bayes, SVM, Artificial Neural Network, N-gram. Their study reveals that
supervised machine learning algorithms, such as Naïve Bayes, SVM, and Artificial
Neural Network, outperform unsupervised machine learning algorithms for text
classification tasks in Indian languages.
3
8. In[56], Basant Agarwal, NamitaMithal(2016) , “ Text classification Using Machine
Learning Methods. A Survey “. They investigate the effectiveness of Naïve Bayes,
Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Decision Tree
algorithms in classifying textual data. Their survey findings suggest that SVM
performs well for textual documents belonging to a particular category, but its
performance diminishes for multiclass classification tasks. This observation highlights
a limitation of SVM in handling complex classification scenarios involving multiple
classes.
9. In [1], Bao Y. and Ishii N., "Combining Multiple kNN Classifiers for Text
Categorization by Reducts", LNCS 2534, 2002, pp. 340-347, the authors propose
combining multiple kNN classifiers for text categorization using reducts. Their results
show that the combination of multiple kNN classifiers can improve the performance
compared to using a single kNN classifier.
10. In [9], D. E. Johnson, F. J. Oles, T. Zhang, T. Goetz, "A decision-tree-based symbolic
rule induction system for text categorization", IBM Systems Journal, September 2002,
the authors describe a fast decision tree construction algorithm that takes advantage of
the sparsity of text data, and a rule simplification method that converts the decision
tree into a logically equivalent rule set.
4
3. TECHNOLOGY STACK
3.1 Datasets/Database Used:
Fig: Dataset
• The dataset comprises various product listings from the e-commerce domain,
particularly focusing on household items, Books, Electronics and Clothing &
Accessories.
• The Ecommerce text classification dataset has 50425 rows and 2 Columns.
• The Ecommerce text classification dataset info:
RangeIndex: 50425 entries, 0 to 50424
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 label 50425 non-null object
1 text 50424 non-null object
• Null Values in each col in the Ecommerce text classification dataset:
label 0
text 1
3.2 Pre-Processing Steps:
Code Snippet:
df.drop([null_val_ind], axis=0, inplace=True)
This code removes the row(s) containing null values from the DataFrame df. The drop
method is a pandas function used to remove rows or columns based on labels (index or
column names) or indices. By setting axis=0, you're specifying that you want to drop
rows.
5
2. Remove Stop words and apply lemmatization:
Code Snippet:
nlp = spacy.load('en_core_web_sm')
def preprocess(text):
doc = nlp(text)
preprocessed = []
for token in doc:
if not token.is_stop and not token.is_punct:
preprocessed.append(token.lemma_)
flat = ' '.join(preprocessed)
return flat
df['preprocessed_text'] = df['text'].apply(preprocess)
Removing Stopwords:
Stopwords are common words in a language that often don't carry significant meaning
and are typically removed from text before processing.
Applying Lemmatization:
Lemmatization is the process of reducing words to their base or root form, called a
lemma.
This preprocessing step removes stopwords and punctuation from the text data and
applies lemmatization to obtain the base form of each token. It helps in reducing noise
in the data and improving the quality of features used for text analysis or modeling.
Code Snippet:
df['preprocessed_text'] = df['preprocessed_text'].str.lower()
6
1. seaborn:
Seaborn is a statistical data visualization library based on matplotlib. It provides a
high-level interface for drawing attractive and informative statistical graphics.
2. spaCy:
A high-performance natural language processing library for efficiently processing
large amounts of text. Offers a wide range of NLP capabilities, including tokenization,
lemmatization, and named entity recognition. Allows for easy integration with machine
learning models through its language models.
3. WordCloud:
A library for generating word cloud visualizations from textual data. Provides
customization options for controlling the size, shape, and color of the word cloud. Useful for
quickly identifying the most prominent terms in a corpus of text.
4. scikit-learn:
Used for various machine learning tasks, such as feature extraction with
TfidfVectorizer, target encoding with LabelEncoder, model training and evaluation with
train_test_split, accuracy_score, confusion_matrix, and classification_report.
5. xgboost:
A scalable and efficient implementation of the Gradient Boosting algorithm. Known
for its high performance and ability to handle large-scale data. Widely used in machine
learning competitions and production environments.
6. RandomForestClassifier:
An ensemble learning algorithm that combines multiple decision trees for improved
classification accuracy. Handles both numerical and categorical features effectively.
Provides built-in feature importance estimation, making it useful for interpretability.
7. NaiveBayesClassifier:
A probabilistic machine learning algorithm based on Bayes' theorem. Assumes
independence between features, making it computationally efficient and suitable for text
classification tasks. Provides a simple and interpretable approach to classification, often used
as a baseline model.
8. joblib:
A utility library that provides easy-to-use functions for saving and loading Python
objects. Allows for efficient serialization and deserialization of machine learning models and
other data. Helps in maintaining the reproducibility of the model training and deployment
process.
The Text Classification system is designed with two key features to facilitate seamless user
interaction and efficient model utilization. Firstly, it offers robust internet connectivity,
enabling users to access the system from anywhere, anytime. This connectivity ensures that
users can easily upload their text data and receive predictions promptly, enhancing accessibility
and convenience. Secondly, the system empowers users to run the classification model and
generate predictions for their desired inputs effortlessly. Through an intuitive and user-friendly
7
interface, individuals can input text data, initiate the classification process, and obtain accurate
predictions in real-time. These features combine to provide a comprehensive and user-centric
text classification solution, catering to a wide range of applications and user requirements.
1. Software Requirements
Platform: Windows operating system
Programming Platform: Google Colab
Language used: Python 3
2. Hardware Requirements
Processor: 13
RAM: 4 GB or more
Hard disk: 16 GB or more
GPU: 2 GB
8
4. METHODOLOGY
4.1 Modules Description:
10
Additionally, classification_report provides more detailed metrics like precision, recall, and
F1-score to understand the model's strengths and weaknesses in classifying different categories.
• Precision is the percentage of correct positive predictions relative to total positive
predictions. It measures how accurate the model is in identifying the positive class.
• Recall is the percentage of correct positive predictions relative to total actual positives.
It measures how sensitive the model is in detecting the positive class.
• F1 score is a weighted harmonic mean of precision and recall. It balances both the
precision and recall of the model. The closer to 1, the better the model.
• Support is the number of instances that belong to each class.
To generate a classification report in Python, we can use the classification_report function from
the sklearn.metrics module.
7. Making Predictions:
Code Snippet:
model = load('rf_model.pkl')
def classify_text(input_text):
preprocessed_text = preprocess(input_text)
tfidf_vector = vectorizer.transform([preprocessed_text])
predicted_label_enc = model.predict(tfidf_vector)
predicted_label =
list(label_mapping.keys())[list(label_mapping.values()).index(predicted_label_enc)]
return predicted_label
# Collect user input
input_text = input("Enter your text: ")
predicted_label = classify_text(input_text)
# Display the predicted label
print("Predicted Label:", predicted_label)
Once a model is trained and evaluated, it can be used to make predictions on new, unseen data.
The predict function takes a new piece of text as input, preprocesses it using the same steps as
before, converts it into a TF-IDF vector, and feeds it to the trained model. Finally, it decodes
the predicted numerical label back to the original category name using the label mapping.
The trained model, TfidfVectorizer, and label mapping are loaded from the saved files. The
preprocess() function is defined to preprocess new input text. The classify_text() function is
defined to take an input text, preprocess it, and use the loaded model to predict the class label.
The user is prompted to enter text, and the predicted label is displayed.
11
4.2 Algorithms:
The text classification applied to an e-commerce dataset, a variety of machine learning
algorithms were employed to effectively categorize textual data into relevant classes.
• RANDOM FOREST:
The Random Forest (RF) classifiers are suitable for dealing with the high dimensional
noisy data in text classification. An RF model comprises a set of decision trees each
of which is trained using random subsets of features. Given an instance, the prediction
by the RF is obtained via majority voting of the predictions of all the trees in the forest.
However, different test instances would have different values for the features used in
the trees and the trees should contribute differently to the predictions. This diverse
contribution of the trees is not considered in traditional RFs.
When it comes to making predictions, each decision tree in the Random Forest casts
its vote. For classification tasks, the final prediction is determined by the mode (most
frequent prediction) across all the trees. In regression tasks, the average of the
individual tree predictions is taken. This internal voting mechanism ensures a balanced
and collective decision-making process.
12
• XGBOOST (Extreme Gradient Boosting):
The basic idea behind XGBoost is to use decision trees as base models and then to
build an ensemble of these trees to improve the accuracy of predictions. In the context
of text classification, each tree takes a set of features extracted from the text document
as input and outputs a predicted class label. XGBoost then trains multiple trees on the
training data and aggregates their predictions to produce the final prediction for each
text document.
One of the advantages of XGBoost is that it can handle large amounts of data and can
be run on parallel computing systems, making it a fast and efficient algorithm for text
classification. Additionally, XGBoost also supports various regularization techniques
to prevent overfitting, which is a common problem in text classification due to the
large number of features often present in text data.
To implement XGBoost for text classification, the first step is to preprocess the text
data and extract meaningful features from it, such as term frequency-inverse document
frequency (TF-IDF) values. These features are then used as input to the XGBoost
model, which is trained on the preprocessed data. Finally, the trained model can be
used to predict the class label for new text documents.
13
• Naive Bayes:
P(A|B) = P(B|A)P(A)
P(B)
Multinomial Naive Bayes: Multinomial Naive Bayes is an instance of the Naive Bayes
model, which uses a multinomial distribution of the features and is common to use for
text classification.
Naive Bayes model computes the class probabilities for a given text document, since
Multinomial Naive Bayes is the model used in this thesis. The total set of classes is
denoted by C, and the Multinomial Naive Bayes will assign a document di to the class
with the highest probability P(c|di), where c ∈ C. This is done by using Bayes’
theorem, and is given by:
P(c|di) = P(di|c)P(c)
P(di)
14
4.3 Model Architecture Diagram:
The image depicts a flow diagram of the text classification process, which involves several
steps:
Text Data: The raw text data that needs to be classified.
Tokenization: This step breaks down the text data into individual tokens (words or phrases).
Pre-Processing:
Removal of Punctuations: Removes punctuation marks from the text.
Removal of Stopwords: Removes common words that do not carry much meaning (e.g., "the",
"a", "is").
Stemming & Lemmatization: Converts words to their base or root form to reduce vocabulary
size.
Future Engineering:
Bag of Word: Representing the text as a vector of word frequencies.
TF-IDF (Term Frequency-Inverse Document Frequency): A more advanced text representation
method that considers the importance of each word.
Model Building: The processed text data is then used to build a machine learning or deep
learning model for text classification.
Model Evaluation: The performance of the built model is evaluated using appropriate metrics.
Model Building: Based on the evaluation, the model is further refined and improved.
This flow diagram provides a high-level overview of the typical text classification process,
highlighting the key steps involved in transforming the raw text data into a format suitable for
model building and evaluation.
15
5. IMPLEMENTATION
#Loading Dataset
ds_name = 'Ecommerce text classification'
df = pd.read_csv('ecommerceDataset.csv')
#Dropping Null Values
null_val_ind = df[df['text'].isnull()].index[0]
null_val_ind
df.drop([null_val_ind], axis=0, inplace = True)
#Text Preprocessing
nlp = spacy.load('en_core_web_sm')
def preprocess(text):
doc = nlp(text)
preprocessed = []
for token in doc:
if not token.is_stop and not token.is_punct:
preprocessed.append(token.lemma_)
flat = ' '.join(preprocessed)
return flat
df['preprocessed_text'] = df['text'].apply(preprocess)
df['preprocessed_text'] = df['preprocessed_text'].str.lower()
#Defining feature and target variables
X = df['preprocessed_text']
y = df['label']
#Tokenization
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(X.values)
lbl_enc = LabelEncoder()
y_enc = lbl_enc.fit_transform(y.values)
#Mapping the labels
label_mapping = {label: encoded for label, encoded in zip(lbl_enc.classes_,
lbl_enc.transform(lbl_enc.classes_))}
label_mapping
#Splitting dataset
16
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y_enc, random_state=123,
test_size=0.2, shuffle=True, stratify=y_enc)
#Modeling
def modeling(model, X_train, y_train, X_test, y_test):
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = accuracy_score(y_test, y_pred)
print(f"Model Accuarcy: {score*100}%")
return y_pred
xgboost_model = XGBClassifier()
xgboost_ypred = modeling(xgboost_model, X_train, y_train, X_test, y_test)
rf_model = RandomForestClassifier()
rf_ypred = modeling(rf_model, X_train, y_train, X_test, y_test)
nb_model = MultinomialNB()
nb_ypred = modeling(nb_model, X_train, y_train, X_test, y_test)
#Confusion Matrix
xgboost_conf_matrix = confusion_matrix(y_test, xgboost_ypred)
rf_conf_matrix = confusion_matrix(y_test, rf_ypred)
naive_conf_matrix=confusion_matrix(y_test, nb_ypred)
#Classification Report
def class_report(y_test, y_pred):
print(classification_report(y_test, y_pred))
print("Random Forest Classification Report:")
class_report(y_test, rf_ypred)
#Predicting the output for test data
text = df['text'][4848]
true_label = df['label'][4848]
print(text)
print(f"True Label is {true_label}")
print("-"*50)
print("Model Results:")
predict(text, loaded_vectorizer, loaded_label_mapping, model)
#Predicting the output of user desired input
17
def classify_text(input_text):
preprocessed_text = preprocess(input_text)
tfidf_vector = vectorizer.transform([preprocessed_text])
predicted_label_enc = model.predict(tfidf_vector)
predicted_label =
list(label_mapping.keys())[list(label_mapping.values()).index(predicted_label_enc)]
return predicted_label
input_text = input("Enter your text: ")
predicted_label = classify_text(input_text)
18
6. RESULTS / ANALYSIS
6.1 Output Screens / Results Analysis:
Output Screens:
In this section, we present the results and analysis of our text classification. Leveraging
machine learning algorithms and natural language processing techniques, our aim was to
19
develop an accurate and robust classification model capable of effectively categorizing text
data from an e-commerce dataset.
The results and analysis are presented below, encompassing key performance metrics such as
Accuracy, precision, recall, F1-score, confusion matrix, ROC curve.
We calculated the classification report for each algorithm, providing detailed performance
metrics for each class. The classification report includes precision, recall, F1-score, and support
for each class, allowing for a comprehensive evaluation of the model's performance across
different categories.
Algorithms Accuracy Class Precision Recall F1 Score Support
Random 97.511 0 0.98 0.98 0.98 2364
Forest (RF) 1 0.98 0.98 0.98 1734
2 0.98 0.96 0.97 2124
3 0.97 0.98 0.98 3836
accuracy 0.98 10085
macro avg 0.98 0.97 0.98 10085
weighted avg 0.98 0.98 0.98 10085
XG Boost 95.865 0 0.95 0.96 0.96 2364
1 0.97 0.97 0.97 1734
2 0.97 0.94 0.95 2124
3 0.95 0.97 0.96 3863
accuracy 0.96 10085
macro avg 0.96 0.96 0.96 10085
weighted avg 0.96 0.96 0.96 10085
Naïve 94.526 0 0.97 0.92 0.95 2364
Bayes 1 0.98 0.95 0.97 1734
2 0.96 0.91 0.94 2124
3 0.91 0.98 0.94 3863
accuracy 0.95 10085
macro avg 0.96 0.94 0.95 10085
weighted avg 0.95 0.95 0.95 10085
Table: Summary of Classification Report for each Algorithms
ROC Curves: The Receiver Operating Characteristic (ROC) curve is a graphical representation
of the performance of a binary classification model across various decision thresholds. It plots
the true positive rate (sensitivity) against the false positive rate (1 - specificity) for different
threshold values.
True Positive Rate (TPR): Also known as sensitivity or recall, TPR measures the proportion of
positive instances that are correctly identified by the model. It is calculated as:
TPR= True Positives
True Positives+ False Negatives
False Positive Rate (FPR): FPR measures the proportion of negative instances that are
incorrectly classified as positive by the model. It is calculated as:
FPR= False Positives
True Positives+ False Negatives
20
Area Under the ROC Curve (AUC-ROC): AUC-ROC quantifies the overall performance of the
classifier across all possible decision thresholds. It represents the probability that the classifier
will rank a randomly chosen positive instance higher than a randomly chosen negative instance.
AUC values range from 0 to 1, where higher values indicate better performance. AUC = 0.5
indicates random guessing, while AUC = 1 represents perfect classification.
Fig: ROC Curve for each class label in e-commerce text classification
21
7. CONCLUSION & FUTURE SCOPE
Conclusion:
In conclusion, this project explored the application of the random forest algorithm for
multi-class text classification tasks in the field of natural language processing. Through our
investigation, we have demonstrated the effectiveness of random forests with an accuracy of
97% in automatically sorting text data into predefined classes. By leveraging ensemble learning
techniques and aggregating predictions from multiple decision trees, random forests offer
robust performance in handling text classification tasks.
Throughout the project, we have evaluated the performance of the random forest algorithm
using metrics such as accuracy, precision, recall, and F1-score. Our experiments have
demonstrated promising results, indicating the suitability of random forests for multi-class text
classification problems.
Future Scope:
In the realm of text classification, the future holds promising avenues for exploration and
improvement. Optimizing the random forest algorithm through hyperparameter tuning and
advanced feature engineering techniques presents an opportunity to enhance classification
performance. As deep learning continues to advance, investigating deep neural network
architectures like CNNs and RNNs for text classification holds promise for achieving state-of-
the-art results. Moreover, applying text classification techniques to domain-specific
applications, coupled with efforts to ensure scalability, efficiency, interpretability, and
explainability, will contribute to the development of practical and impactful solutions for
industries dealing with large volumes of textual data.
8. REFERENCES
22
1. [6] Xiaoyu Luo(2021), “Efficient English Text Classification Using Selected Machine
Learning Techniques”, alexandria engineering journal,Published 1 June 2021.
2. [52], Dhirajj Kumar, Gopesh, Avinash Choubey, Ms. Pratibha Sing(2020), “Restaurant
Review Classification and Analysis”, alexandria engineering journal,Published 25
March 2022.
5. [4] Kapil Sethi, Ankit Gupta, Gaurav Gupta, Varun Jaiswal(2017) , "Comparative
Analysis of Machine Learning Algorithms on DifferentDatasets", Conference:
Circulation in Computer Science International Conference on Innovations in
ComputingAt: Mohali, April 2019.
6. [54] Jasleen Kaur, Dr. Jatinder Kumar, R Saini(2015), “A Study Of Text Classification
Natural Language Processing Algorithms for Indian Language“, Vol.4,No.1, July,
2015 162 - 167,ISSN:0975-5446, VNSGU JOURNAL OF SCIENCE AND
TECHNOLOGY,July 2015.
7. [55] Muhammad Abid, Asad Habib, JawabShahid, Jawad Ashraf(2017) , “ Urdu Word
Sense Disambiguation Using Machine Learning Approach “, Cluster Computing 21(4)
DOI:10.1007/s10586-017-0918-0, March 2018.
9. [1] Bao Y. and Ishii N., “Combining Multiple kNN Classifiers for Text Categorization
by Reducts”, LNCS 2534, 2002, pp. 340-347.
11. [11] Ke H., Shaoping M., “Text categorization based on Concept indexing and principal
component analysis”, Proc. TENCON 2002 Conference on Computers,
Communications, Control and Power Engineering, 2002, pp. 51- 56.
23
12. [12] Kehagias A., Petridis V., Kaburlasos V., Fragkou P., “A Comparison of Word- and
Sense-Based Text Categorization Using Several Classification Algorithms”, JIIS,
Volume 21, Issue 3, 2003, pp. 227-247.
13. [13] B. Kessler, G. Nunberg, and H. Schutze. Automatic detection of text genre. In
Proceedings of the Thirty-Fifth ACL and EACL, pages 32–38, 1997.
14. [14] Kim S. B., Rim H. C., Yook D. S. and Lim H. S., “Effective Methods for Improving
Naive Bayes Text Classifiers”, LNAI 2417, 2002, pp. 414-423
15. [15] Klopotek M. and Woch M., “Very Large Bayesian Networks in Text
Classification”, ICCS 2003, LNCS 2657, 2003, pp. 397-406.
24