Questions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Lab Assessment 1

OSAID KASPROWICZ (8142397)

Introduction:
During the Lab assessment, we will classify text using Naïve Bayes Classifier using Multinomial
Naive Bayes classifier and Complement Naive Bayes classifier.

This dataset is a collection newsgroup document. The 20 newsgroups collection has become a
popular data set for experiments in text applications of machine learning techniques, such as
text classification and text clustering. This dataset includes 20,000 messages.

Questions:
1. Describe the key steps for data preparation and feature extraction (1 mark).

Preparing the data to be feature ready and to be used in the model requires many steps:

1.1. Get the folders’ directory


1.2. Split the data into test and train with a test percentage of 25%
1.3. During the preprocessing step, we:
1.3.1. Remove all tabs from the sentences
1.3.2. Remove every functionating except apostrophise
1.3.3. Remove any empty strings
1.3.4. Unquote words
1.3.5. Convert all the text to lower case
1.3.6. Remove words that are shorter than 3 letters
1.4. We remove a list of stop words that are predefined.
1.5. We tokenize sentences into words.
1.6. Remove the metadata that comes in the beginning of the files
1.7. We flatten the list into a 1-D array
1.8. Extract the unique words from the list
1.9. Sort the words based on frequency
1.10. Assign the labels to the folders (classes)
1.11. Convert the class labels to a numerical format
1.12. Convert the lists t NumPy arrays.
2. Report the overall classification results, including precision, recall, and f1-score.
Explain the meaning of these criteria (1 mark).

Precision Recall F1-score Support


alt.atheism 0.61 0.73 0.66 233
comp.graphics 0.6 0.66 0.63 253
comp.os.ms-windows.misc 0.73 0.65 0.69 249
comp.sys.ibm.pc.hardware 0.66 0.72 0.69 240
comp.sys.mac.hardware 0.69 0.78 0.73 236
comp.windows.x 0.78 0.72 0.75 240
misc.forsale 0.8 0.76 0.78 261
rec.autos 0.81 0.81 0.81 269
rec.motorcycles 0.82 0.9 0.86 284
rec.sport.baseball 0.91 0.9 0.91 248
rec.sport.hockey 0.87 0.96 0.91 231
sci.crypt 0.93 0.86 0.89 233
sci.electronics 0.77 0.7 0.74 244
sci.med 0.9 0.86 0.88 256
sci.space 0.88 0.83 0.85 246
soc.religion.christian 0.77 0.83 0.8 252
talk.politics.guns 0.68 0.83 0.75 249
talk.politics.mideast 0.9 0.83 0.86 281
talk.politics.misc 0.63 0.61 0.62 259
talk.religion.misc 0.57 0.35 0.43 236
accuracy 0.77 5000
macro avg 0.77 0.76 0.76 5000
weighted avg 0.77 0.77 0.76 5000

• Precision: the model was able to classify sci.crypt the best with 93% precision score,
and the lowest of talk.religion.misc with 57%. This means that 93% of the predicted
sci.crypt belong to that category.
• Recall: the highest recall was for rec.sport.hockey with a percentage of 96%. This
means that 96% of the actual rec.sport.hockey were correctly identified by the model.
• F1-Score: the best balance in the model belonged to rec.sport.baseball with a
percentage of 91%. This means that both precision and recall for this model was high.
• Support: support indicate the number of instances of that class
• Accuracy: The overall accuracy of the model is 77%. This means that 77% of all
instances in the test set were correctly classified across all classes.
3. Plot the confusion matrix for your classification result. Find the pair of classes that
confuses the classifier most. Is this result consistent with your expectation? (1 mark).

The classifier gets confused the most classifying alt.atheism and talk.religion.misc. This is
evident in the graph as there are 68 incorrect classifications for this pair.

The result of the confusion matrix is consistent with my expectations. Both alt.atheism and
talk.religion.misc have comparable topics that can get mixed up by the classifier. My
assumption was based on the classification matrix output. It is evident the talk.religion.misc
had the lowest F1-score of 43% and alt.atheism having one of the lowest with a F1-score of
66%.

4. Based on the confusion matrix, report the individual accuracy scores for each class (1
mark).
we use the number of “support “ from the classification matrix and the correctly classified
from the CM. Then calculate using “support “/ correctly classified.
Ex: alt.atheism = 171 / 233 = 0.73

alt.atheism: 0.73 comp.graphics: 0.66


comp.os.ms-windows.misc: 0.65 comp.sys.ibm.pc.hardware: 0.72
comp.sys.mac.hardware: 0.78 comp.windows.x: 0.72
misc.forsale: 0.76 rec.autos: 0.81
rec.motorcycles: 0.90 rec.sport.baseball: 0.90
rec.sport.hockey: 0.96 sci.crypt: 0.86
sci.electronics: 0.70 sci.med: 0.86
sci.space: 0.83 soc.religion.christian: 0.83
talk.politics.guns: 0.83 talk.politics.mideast: 0.83
talk.politics.misc: 0.61 talk.religion.misc: 0.35
5. Train a Complement Naive Bayes classifier and compare its classification results with
those of Multinomial Naive Bayes (1 mark).

Category Precision Recall F1-score Support


alt.atheism 0.58 0.65 0.61 233
comp.graphics 0.64 0.68 0.66 253
comp.os.ms-windows.misc 0.76 0.59 0.67 249
comp.sys.ibm.pc.hardware 0.59 0.71 0.64 240
comp.sys.mac.hardware 0.83 0.69 0.75 236
comp.windows.x 0.63 0.83 0.72 240
misc.forsale 0.79 0.59 0.68 261
rec.autos 0.82 0.8 0.81 269
rec.motorcycles 0.9 0.87 0.88 284
rec.sport.baseball 0.9 0.81 0.86 248
rec.sport.hockey 0.73 0.99 0.84 231
sci.crypt 0.83 0.91 0.87 233
sci.electronics 0.79 0.6 0.69 244
sci.med 0.86 0.86 0.86 256
sci.space 0.84 0.85 0.84 246
soc.religion.christian 0.63 0.86 0.73 252
talk.politics.guns 0.66 0.78 0.71 249
talk.politics.mideast 0.74 0.93 0.82 281
talk.politics.misc 0.71 0.55 0.62 259
talk.religion.misc 0.63 0.2 0.3 236
accuracy 0.74 5000
macro avg 0.74 0.74 0.73 5000
weighted avg 0.75 0.74 0.73 5000

The Multinomial Naive Bayes classifier has a slight better performance than Complement
Naive Bayes classifier. The MNB has an accuracy of 77% whereas CNB has an accuracy of 74%.
This slight advantage is evident in most scores.

The CNB has a better performance in only couple of the computer classes.

You might also like