Questions
Questions
Questions
Introduction:
During the Lab assessment, we will classify text using Naïve Bayes Classifier using Multinomial
Naive Bayes classifier and Complement Naive Bayes classifier.
This dataset is a collection newsgroup document. The 20 newsgroups collection has become a
popular data set for experiments in text applications of machine learning techniques, such as
text classification and text clustering. This dataset includes 20,000 messages.
Questions:
1. Describe the key steps for data preparation and feature extraction (1 mark).
Preparing the data to be feature ready and to be used in the model requires many steps:
• Precision: the model was able to classify sci.crypt the best with 93% precision score,
and the lowest of talk.religion.misc with 57%. This means that 93% of the predicted
sci.crypt belong to that category.
• Recall: the highest recall was for rec.sport.hockey with a percentage of 96%. This
means that 96% of the actual rec.sport.hockey were correctly identified by the model.
• F1-Score: the best balance in the model belonged to rec.sport.baseball with a
percentage of 91%. This means that both precision and recall for this model was high.
• Support: support indicate the number of instances of that class
• Accuracy: The overall accuracy of the model is 77%. This means that 77% of all
instances in the test set were correctly classified across all classes.
3. Plot the confusion matrix for your classification result. Find the pair of classes that
confuses the classifier most. Is this result consistent with your expectation? (1 mark).
The classifier gets confused the most classifying alt.atheism and talk.religion.misc. This is
evident in the graph as there are 68 incorrect classifications for this pair.
The result of the confusion matrix is consistent with my expectations. Both alt.atheism and
talk.religion.misc have comparable topics that can get mixed up by the classifier. My
assumption was based on the classification matrix output. It is evident the talk.religion.misc
had the lowest F1-score of 43% and alt.atheism having one of the lowest with a F1-score of
66%.
4. Based on the confusion matrix, report the individual accuracy scores for each class (1
mark).
we use the number of “support “ from the classification matrix and the correctly classified
from the CM. Then calculate using “support “/ correctly classified.
Ex: alt.atheism = 171 / 233 = 0.73
The Multinomial Naive Bayes classifier has a slight better performance than Complement
Naive Bayes classifier. The MNB has an accuracy of 77% whereas CNB has an accuracy of 74%.
This slight advantage is evident in most scores.
The CNB has a better performance in only couple of the computer classes.