08 Classifier Evaluation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Model Evaluation

Instructor: Saravanan Thirumuruganathan


Model Evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?
• Methods for Performance Evaluation
• How to obtain reliable estimates?
• Methods for Model Comparison
• How to compare the relative performance among competing models?
• Debugging ML Models
• Error Analysis
Metrics for Performance Evaluation

• Training objective / cost function is a proxy for real world objective

• Metrics help capture a business goal into a quantitative target


• Why can’t you use metrics as cost function?

• Focus on the predictive capability of a model


• Rather than how fast it takes to classify or build models, scalability, etc.

CS 229 slides from Yining Chen


Metrics for Performance Evaluation
• Helps organize ML team effort towards that target.
• Generally in the form of improving that metric on the dev set.

• Useful to quantify the “gap” between:


• Desired performance and baseline (estimate effort initially).
• Desired performance and current performance.
• Measure progress over time.

• Useful for lower level tasks and debugging (e.g. diagnosing bias vs
variance).

CS 229 slides from Yining Chen


Metrics for Performance Evaluation
• Confusion Matrix

• Point metrics:
• Accuracy
• Precision, Recall, F-score
• Sensitivity, Specificity

• Summary metrics: AU-ROC, AU-PRC


Confusion Matrix

PREDICTED CLASS
a: TP (true positive)
Class=Yes Class=No b: FN (false negative)
c: FP (false positive)
Class=Yes a/TP b/FN d: TN (true negative)
ACTUAL
CLASS Class=No c/FP d/TN
Metrics for Performance Evaluation
• ML model for classifying passengers as COVID positive or negative.
• True Positive (TP): A passenger who is classified as COVID positive and
is actually positive.
• False Negative(FN): A passenger who is classified as not COVID
positive (negative) and is actually COVID positive.
• True Negative (TN): A passenger who is classified as not COVID
positive (negative) and is actually not COVID positive (negative).
• False Positive (FP): A passenger who is classified as COVID positive
and is actually not COVID positive (negative).
Type I and Type II Errors
Performance Evaluation Primitives
𝑇𝑃 𝑇𝑃
𝑇𝑃𝑅 = = = 1 − 𝐹𝑁𝑅
𝑃 𝑇𝑃 + 𝐹𝑁

𝐹𝑃 𝐹𝑃
𝐹𝑃𝑅 = = = 1 − 𝑇𝑁𝑅
𝑁 𝐹𝑃 + 𝑇𝑁

𝐹𝑁 𝐹𝑁
𝐹𝑁𝑅 = = = 1 − 𝑇𝑃𝑅
𝑃 𝑇𝑃 + 𝐹𝑁

𝑇𝑁 𝑇𝑁
𝑇𝑁𝑅 = = = 1 − 𝐹𝑃𝑅
𝑁 𝑇𝑁 + 𝐹𝑃
Accuracy
PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)

Most widely used metric


𝑎+𝑑 𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = =
𝑎+𝑏+𝑐+𝑑 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Limitation of Accuracy Metric
• Accuracy is a baseline metric and almost always incomplete
• It only makes sense when your dataset/classifier satisfies lot of conditions

• When should you not use Accuracy?


1. Imbalanced Datasets
Problem: In cases where one class vastly outnumbers the others (i.e., class imbalance), accuracy can give misleading results. For example, in a dataset where 95% of the instances belong to one
class, a model that always predicts the majority class will have 95% accuracy, even though it may be failing to correctly identify the minority class.
Impact: High accuracy in such scenarios does not reflect the poor performance on the minority class, which may be the more important class in the context (e.g., detecting rare diseases or fraud).
2. Does Not Differentiate Between Types of Errors
Problem: Accuracy treats all errors equally and does not distinguish between false positives and false negatives.
Impact: In applications where the cost of one type of error is higher than the other (e.g., in medical diagnosis, where false negatives could be life-threatening), accuracy is insufficient to evaluate the
real-world performance of the model.
3. Insensitive to Class Distribution
Problem: Accuracy does not account for the distribution of classes in the dataset. It focuses only on the overall number of correct predictions.
Impact: In scenarios where precision or recall is more important for specific classes (e.g., identifying a rare event), accuracy fails to provide insights into how well the model is performing for those
important cases.
4. Cannot Handle Multi-Class Problems Well
Problem: Accuracy becomes less informative in multi-class classification problems, where there are more than two classes, especially if some classes are very underrepresented.
Impact: While accuracy may still provide a general sense of performance, it doesn’t reveal which classes are being misclassified, making it less useful in diagnosing model issues.
5. Lacks Information on Prediction Confidence
Problem: Accuracy does not take into account the confidence of the model's predictions. A model that gives highly confident but incorrect predictions can have the same accuracy as one that gives
uncertain but correct predictions.
Impact: In real-world scenarios where decisions are based on the certainty of predictions (e.g., in risk assessment), accuracy alone doesn’t provide sufficient information.
Limitations of Accuracy Metric
• Consider a 2-class problem
• Number of Class 0 examples = 9990
• Number of Class 1 examples = 10

• If model predicts everything to be class 0


Accuracy is 9990/10000 = 99.9 %
• Accuracy is misleading as model does not detect any class 1 example

• Advice: Do not accuracy for unbalanced datasets


Limitations of Accuracy Metric
• Gives equal weight to both true positives and true negatives

Diagnosing a patient who does not have cancer as having cancer


Vs
Diagnosing a patient who has cancer as not having cancer

𝑎+𝑑 𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = =
𝑎+𝑏+𝑐+𝑑 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Limitations of Accuracy Metric
Q: Which model will you ship in a recruiting agency?

• Job resume filtering machine learning model 1:


• 10% error rate on whole dataset
• 40% error rate on female profiles

• Job resume filtering machine learning model 2:


• 18% error rate on whole dataset
• 25% error rate on female profiles
Possible Solution: Cost Matrix

PREDICTED CLASS

C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)


ACTUAL
CLASS Class=No C(Yes|No) C(No|No)

• C(i|j): Cost of misclassifying class j example as class i


Computing Cost of Classification
Cost PREDICTED CLASS
Matrix
C(i|j) + -
ACTUAL
+ -1 100
CLASS
- 1 0

Model PREDICTED CLASS Model PREDICTED CLASS


M1 M2
+ - + -
ACTUAL ACTUAL
+ 150 40 + 250 45
CLASS CLASS
- 60 250 - 5 200

Accuracy = 80% Accuracy = 90%


Cost = 3910 Cost = 4255
Cost vs Accuracy
Count PREDICTED CLASS Accuracy is proportional to cost if
1. C(Yes|No)=C(No|Yes) = q
Class=Yes Class=No
2. C(Yes|Yes)=C(No|No) = p
Class=Yes a b
ACTUAL N=a+b+c+d
CLASS Class=No c d
Accuracy = (a + d)/N

Cost PREDICTED CLASS


Cost = p (a + d) + q (b + c)
Class=Yes Class=No
= p (a + d) + q (N – a – d)
Class=Yes p q = q N – (q – p)(a + d)
ACTUAL
CLASS Class=No
= N [q – (q-p)  Accuracy]
q p
Weighted Accuracy

𝑤1 𝑇𝑃 + 𝑤4 𝑇𝑁
𝑊𝐴 =
𝑤1 𝑇𝑃 + 𝑤2 𝐹𝑁 + 𝑤3 𝐹𝑃 + 𝑤4 𝑇𝑁

• General purpose metric


• Interpretation is a bit tricky
Cost-Sensitive Measures
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑝 =
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 𝑟 =
𝑇𝑃 + 𝐹𝑁
Cost-Sensitive Measures

Harmonic mean of precision and recall

2𝑝𝑟 2𝑇𝑃
𝐹 𝑠𝑐𝑜𝑟𝑒 𝐹 = =
𝑝 + 𝑟 2𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃
Cost-Sensitive Measures
• Precision is biased towards C(Yes|Yes) & C(Yes|No)
• Recall is biased towards C(Yes|Yes) & C(No|Yes)

• F-measure is biased towards all except C(No|No)


Sensitivity and Specificity
• Used in Bio fields more

𝑇𝑃 𝑇𝑃
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = =
𝑃 𝐹𝑁 + 𝑇𝑃

𝑇𝑁 𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁𝑅 = =
𝑁 𝐹𝑃 + 𝑇𝑁
Matthew's Correlation Coefficient
• a specific case of a linear correlation coefficient (Pearson r) for a
binary classification setting
• useful in unbalanced class settings
• MCC is bounded between the range 1 (perfect correlation between
ground truth and predicted outcome) and -1 (inverse or negative
correlation) - a value of 0 denotes a random prediction.

𝑇𝑃 . 𝑇𝑁 − 𝐹𝑃 . 𝐹𝑁
𝑀𝐶𝐶 =
√ 𝑇𝑃 + 𝐹𝑃 𝑇𝑃 + 𝐹𝑁 𝑇𝑁 + 𝐹𝑃 (𝑇𝑁 + 𝐹𝑁)
Confusion Matrix for Multi-Class Settings
• Confusion matrices are traditionally for binary class problems but we
can be readily generalized it to multi-class settings

Sebastian Raschka/STAT 451/Lecture 12


Balanced vs Average Per-Class (APC) Accuracy

Sebastian Raschka/STAT 451/Lecture 12


Balanced vs Average Per-Class (APC) Accuracy

Sebastian Raschka/STAT 451/Lecture 12


Metrics for Multi-Class Classification
• We can define formulas for other metrics like precision, recall etc
analogously

𝑇𝑃𝐶𝑙𝑎𝑠𝑠 𝐴
• Precision for class A =
𝑇𝑃𝐶𝑙𝑎𝑠𝑠 𝐴 + 𝐹𝑃𝐶𝑙𝑎𝑠𝑠 𝐴

𝑇𝑃𝐶𝑙𝑎𝑠𝑠 𝐴
• Recall for class A =
𝑇𝑃𝐶𝑙𝑎𝑠𝑠 𝐴 + 𝐹𝑁𝐶𝑙𝑎𝑠𝑠 𝐴
Macro-Average of Precision and Recall

https://www.evidentlyai.com/classification-metrics/multi-class-metrics
Micro-Average of Precision and Recall

https://www.evidentlyai.com/classification-metrics/multi-class-metrics
Micro- vs. Macro-Averaging
• A macro-average will compute the metric independently for each
class and then take the average Macro-averaging treats each class equally.

• Gives equal weight to each class It can be useful when all classes are equally important, and you want to know how
well the classifier performs on average across them.
It is also useful when you have an imbalanced dataset and want to ensure each
class equally contributes to the final evaluation

• A micro-average will aggregate the contributions of all classes to


compute the average metric
• Gives equal weight to each instance
Macro-averaging treats each class equally.

• Prefer micro-average when there is class imbalance


It can be useful when all classes are equally
important, and you want to know how well the
classifier performs on average across them.
It is also useful when you have an imbalanced
dataset and want to ensure each class equally
contributes to the final evaluation
Micro- vs. Macro-Averaging

Dan Jurafsky, Speech and Language Processing, Chapter 4 slides


Micro- vs. Macro-Averaging

Dan Jurafsky, Speech and Language Processing, Chapter 4 slides


Micro vs. Macro Averaging
• Class A: 1 TP and 1 FP
• Class B: 10 TP and 90 FP
• Class C: 1 TP and 1 FP
• Class D: 1 TP and 1 FP

• Pr(A) = Pr(C) = Pr(D) = 0.5 and Pr(B) = 0.1


• Macro average = (0.5 * 3 + 0.1) / 4 = 0.4
1+10+1+1
• Micro average = = 0.123
2+200+2+2
Performance Trade-offs
• So far, we assumed that the classifier outputs the class labels

• But most classifiers output a richer value


• Logistic regression outputs a probability score
• Other classifiers output a possibly uncalibrated scores

• What can we do to achieve trade-offs?


Performance Trade-offs via Thresholds

• Logistic Regression
• Class as class 1 if output >= 0.5
• Class as class 1 if output < 0.5

• What happens if we changing the thresholds?


Impact of Changing Thresholds
• Suppose we are training a classifier to detect cancer

• Does using a threshold of 0.5 makes sense?


Impact of Changing Thresholds
• Suppose we are training a classifier to detect cancer

• Suppose we want to classify 𝑦 = 1 (disease) only if very confident

• Solution: Increase the threshold to say 0.9


• Higher precision
• Lower recall
• Why?
Impact of Changing Thresholds
• Suppose we are training a classifier to detect cancer

• Suppose we want to avoid missing too many cases of disease (avoid


false negatives

• Solution: Keep the threshold low


• Higher recall
• Lower precision
• Why?
Threshold Scanning

CS229 Yining Chen

You might also like