Logistic Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Notebook

August 6, 2019

1
2
0.0.1 Question 1c
Discuss one thing you notice that is different between the two emails that might relate to the identification
of spam.
it seems like a lot of the spam says html

3
4
0.0.2 Question 3a
Create a bar chart like the one above comparing the proportion of spam and ham emails containing certain
words. Choose a set of words that are different from the ones above, but also have different proportions for
the two classes. Make sure to only consider emails from train.

In [202]: train=train.reset_index(drop=True) # We must do this in order to preserve the ordering of emai

other_words = ['body', 'business', 'html', 'money', 'offer', 'please']


training_set = words_in_texts(other_words, train['email'])
data = pd.DataFrame(data = training_set, columns = other_words)
data['label'] = train['spam']
sns.barplot(x = "variable", y = "value", hue = "label", data = (data.replace({'label': {0 : 'H
plt.title("Frequency of Words in Spam/Ham Emails")
plt.xlabel('Words')
plt.ylabel('Proportion of Emails')
plt.ylim([0, 1])
plt.legend()

Out[202]: <matplotlib.legend.Legend at 0x7f10bf47ac18>

5
6
0.0.3 Question 3b
Create a class conditional density plot like the one above (using sns.distplot), comparing the distribution of
the length of spam emails to the distribution of the length of ham emails in the training set. Set the x-axis
limit from 0 to 50000.

In [203]: train2 = train.copy()


train2['length'] = train2['email'].str.len()
sns.distplot(train2.loc[train2['spam'] == 0, 'length'],hist=False, label='Ham')
sns.distplot(train2.loc[train2['spam'] == 1, 'length'],hist=False, label='Spam')
plt.xlabel('Length of email body')
plt.ylabel('Distribution')
plt.xlim((0,50000))

Out[203]: (0, 50000)

7
8
0.0.4 Question 6c
Provide brief explanations of the results from 6a and 6b. Why do we observe each of these values (FP, FN,
accuracy, recall)?
There is nothing labeled as spam which means which is why there are no false positives. Also, all the
spam emails are labeled as ham, hence the false negatives being the same number as the training set’s spam
emails. The classifier appears to be accurate. And lastly, the classifier recalled nothing!

9
10
0.0.5 Question 6e
Are there more false positives or false negatives when using the logistic regression classifier from Question
5?
there’s more false negatives

11
12
0.0.6 Question 6f
1. Our logistic regression classifier got 75.6% prediction accuracy (number of correct predictions / total).
How does this compare with predicting 0 for every email?
2. Given the word features we gave you above, name one reason this classifier is performing poorly.
Hint: Think about how prevalent these words are in the email set.
3. Which of these two classifiers would you prefer for a spam filter and why? Describe your reasoning
and relate it to at least one of the evaluation metrics you have computed so far.

1. predicting 0 is worse than the regression classifier but not by much


2. There are a ton of zeros in the rows of X_Train because our words don’t appear much in the emails
3. I would rather looks through all my emails because the marking an important email as spam could
be devastating and the chances of that happening would worry me even at a false-alarm rate of 2%

13
14
0.0.7 Question 7: Feature/Model Selection Process
In this following cell, describe the process of improving your model. You should use at least 2-3 sentences
each to address the follow questions:

1. How did you find better features for your model?


2. What did you try that worked / didn’t work?
3. What was surprising in your search for good features?

To find a better model, I found the 5000 most common words in each of the spam and the ham emails,
then I took their differences and sorted by the highest difference. After that I used the top 103 words to train
my model. I tried comparing the length of the email header and length of the email text for spam and ham
emails, however the distributions seemed pretty similar in shape, both roughly normal. I was suprised that
the word "the" would be a viable feature because it seems like a word that would be in either spam or ham
emails.

15
16
Generate your visualization in the cell below and provide your description in a comment.

In [245]: X_2d = np.dot(X_train, vt[:2, :].T)


plt.figure(figsize=(9, 6))
plt.title("PC2 vs. PC1 for Emails Words")
plt.xlabel("Email Words PC1")
plt.ylabel("Email Words PC2")
sns.scatterplot(X_2d[:, 0], X_2d[:, 1], hue = Y_train);
model = LogisticRegression()
model.fit(X_train, Y_train)
print('accuracy: ', model.score(X_train, Y_train))
# Here is a graph representing the 2 words that have the highest correlation between the spam
# It implies that my method of finding the difference between the most common words of the spa
# ham emails and choosing the words with the highest values does a sufficient job at classifyi

accuracy: 0.9386396912019167

17
18
0.0.8 Question 9: ROC Curve
In most cases we won’t be able to get no false positives and no false negatives, so we have to compromise.
For example, in the case of cancer screenings, false negatives are comparatively worse than false positives
— a false negative means that a patient might not discover a disease until it’s too late to treat, while a false
positive means that a patient will probably have to take another screening.
Recall that logistic regression calculates the probability that an example belongs to a certain class. Then,
to classify an example we say that an email is spam if our classifier gives it ≥ 0.5 probability of being spam.
However, we can adjust that cutoff : we can say that an email is spam only if our classifier gives it ≥ 0.7
probability of being spam, for example. This is how we can trade off false positives and false negatives.
The ROC curve shows this trade off for each possible cutoff probability. In the cell below, plot an ROC
curve for your final classifier (the one you use to make predictions for Kaggle). Refer to the Lecture 20
notebook to see how to plot an ROC curve.

In [218]: y_hat = model.predict_proba(X_train)[:, 1]


y_hat

Out[218]: array([0.01556992, 0.00403072, 0.04029417, ..., 0.03553556, 0.00025957,


0.00161681])

19

You might also like