Logistic Regression
Logistic Regression
Logistic Regression
August 6, 2019
1
2
0.0.1 Question 1c
Discuss one thing you notice that is different between the two emails that might relate to the identification
of spam.
it seems like a lot of the spam says html
3
4
0.0.2 Question 3a
Create a bar chart like the one above comparing the proportion of spam and ham emails containing certain
words. Choose a set of words that are different from the ones above, but also have different proportions for
the two classes. Make sure to only consider emails from train.
5
6
0.0.3 Question 3b
Create a class conditional density plot like the one above (using sns.distplot), comparing the distribution of
the length of spam emails to the distribution of the length of ham emails in the training set. Set the x-axis
limit from 0 to 50000.
7
8
0.0.4 Question 6c
Provide brief explanations of the results from 6a and 6b. Why do we observe each of these values (FP, FN,
accuracy, recall)?
There is nothing labeled as spam which means which is why there are no false positives. Also, all the
spam emails are labeled as ham, hence the false negatives being the same number as the training set’s spam
emails. The classifier appears to be accurate. And lastly, the classifier recalled nothing!
9
10
0.0.5 Question 6e
Are there more false positives or false negatives when using the logistic regression classifier from Question
5?
there’s more false negatives
11
12
0.0.6 Question 6f
1. Our logistic regression classifier got 75.6% prediction accuracy (number of correct predictions / total).
How does this compare with predicting 0 for every email?
2. Given the word features we gave you above, name one reason this classifier is performing poorly.
Hint: Think about how prevalent these words are in the email set.
3. Which of these two classifiers would you prefer for a spam filter and why? Describe your reasoning
and relate it to at least one of the evaluation metrics you have computed so far.
13
14
0.0.7 Question 7: Feature/Model Selection Process
In this following cell, describe the process of improving your model. You should use at least 2-3 sentences
each to address the follow questions:
To find a better model, I found the 5000 most common words in each of the spam and the ham emails,
then I took their differences and sorted by the highest difference. After that I used the top 103 words to train
my model. I tried comparing the length of the email header and length of the email text for spam and ham
emails, however the distributions seemed pretty similar in shape, both roughly normal. I was suprised that
the word "the" would be a viable feature because it seems like a word that would be in either spam or ham
emails.
15
16
Generate your visualization in the cell below and provide your description in a comment.
accuracy: 0.9386396912019167
17
18
0.0.8 Question 9: ROC Curve
In most cases we won’t be able to get no false positives and no false negatives, so we have to compromise.
For example, in the case of cancer screenings, false negatives are comparatively worse than false positives
— a false negative means that a patient might not discover a disease until it’s too late to treat, while a false
positive means that a patient will probably have to take another screening.
Recall that logistic regression calculates the probability that an example belongs to a certain class. Then,
to classify an example we say that an email is spam if our classifier gives it ≥ 0.5 probability of being spam.
However, we can adjust that cutoff : we can say that an email is spam only if our classifier gives it ≥ 0.7
probability of being spam, for example. This is how we can trade off false positives and false negatives.
The ROC curve shows this trade off for each possible cutoff probability. In the cell below, plot an ROC
curve for your final classifier (the one you use to make predictions for Kaggle). Refer to the Lecture 20
notebook to see how to plot an ROC curve.
19