Week 6a
Week 6a
Week 6a
Mahmmoud Mahdi
Precision, Recall, and F measure
The 2-by-2 confusion matrix
Evaluation: Accuracy
Why don't we use accuracy as our metric?
Imagine we saw 1 million tweets
○ 100 of them talked about Delicious Pie Co.
○ 999,900 talked about something else
We could build a dumb classifier that just labels every tweet
"not about pie"
○ It would get 99.99% accuracy!!! Wow!!!!
○ But useless! Doesn't return the comments we are looking for!
○ That's why we use precision and recall instead
Evaluation: Precision
% of items the system detected (i.e., items the system labeled
as positive) that are in fact positive (according to the human
gold labels)
Evaluation: Recall
% of items actually present in the input that were correctly
identified by the system.
Why Precision and recall
Our dumb pie-classifier
○ Just label nothing as "about pie"
Accuracy=99.99%
but
Recall = 0
○ (it doesn't get any of the 100 Pie tweets)
Given:
○ Classifier A and B
○ Metric M: M(A,x) is the performance of A on testset x
○ 𝛿(x): the performance difference between A, B on x:
■ 𝛿(x) = M(A,x) – M(B,x)
○ We want to know if 𝛿(x)>0, meaning A is better than B
○ 𝛿(x) is called the effect size
○ Suppose we look and see that 𝛿(x) is positive. Are we done?
○ No! This might be just an accident of this one test set, or
circumstance of the experiment.
Statistical Hypothesis Testing
Consider two hypotheses:
• Null hypothesis: A isn't better than B
• A is better than B
We want to rule out H0
We create a random variable X ranging over test sets
And ask, how likely, if H0 is true, is it that among these test
sets we would see the 𝛿(x) we did see?
• Formalized as the p-value:
Statistical Hypothesis Testing