08 Classifier Evaluation

Model Evaluation
Instructor: Saravanan Thirumuruganathan

Model Evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?
• Methods for Performance Evaluation
• How to obtain reliable estimates?
• Methods for Model Comparison
• How to compare the relative performance among competing models?
• Debugging ML Models
• Error Analysis
Metrics for Performance Evaluation
• Training objective / cost function is a proxy for real world objective
• Metrics help capture a business goal into a quantitative target

• Why can’t you use metrics as cost function?
• Focus on the predictive capability of a model

• Rather than how fast it takes to classify or build models, scalability, etc.
CS 229 slides from Yining Chen

• Helps organize ML team effort towards that target.
• Generally in the form of improving that metric on the dev set.
• Useful to quantify the “gap” between:

• Desired performance and baseline (estimate effort initially).
• Desired performance and current performance.
• Measure progress over time.
• Useful for lower level tasks and debugging (e.g. diagnosing bias vs
variance).
CS 229 slides from Yining Chen

• Confusion Matrix
• Point metrics:
• Accuracy
• Precision, Recall, F-score
• Sensitivity, Specificity
• Summary metrics: AU-ROC, AU-PRC

Confusion Matrix
PREDICTED CLASS
a: TP (true positive)
Class=Yes Class=No b: FN (false negative)
c: FP (false positive)
Class=Yes a/TP b/FN d: TN (true negative)
ACTUAL
CLASS Class=No c/FP d/TN
• ML model for classifying passengers as COVID positive or negative.
• True Positive (TP): A passenger who is classified as COVID positive and
is actually positive.
• False Negative(FN): A passenger who is classified as not COVID
positive (negative) and is actually COVID positive.
• True Negative (TN): A passenger who is classified as not COVID
positive (negative) and is actually not COVID positive (negative).
• False Positive (FP): A passenger who is classified as COVID positive
and is actually not COVID positive (negative).
Type I and Type II Errors
Performance Evaluation Primitives
𝑇𝑃 𝑇𝑃
𝑇𝑃𝑅 = = = 1 − 𝐹𝑁𝑅
𝑃 𝑇𝑃 + 𝐹𝑁
𝐹𝑃 𝐹𝑃
𝐹𝑃𝑅 = = = 1 − 𝑇𝑁𝑅
𝑁 𝐹𝑃 + 𝑇𝑁
𝐹𝑁 𝐹𝑁
𝐹𝑁𝑅 = = = 1 − 𝑇𝑃𝑅
𝑃 𝑇𝑃 + 𝐹𝑁
𝑇𝑁 𝑇𝑁
𝑇𝑁𝑅 = = = 1 − 𝐹𝑃𝑅
𝑁 𝑇𝑁 + 𝐹𝑃
Accuracy
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
Most widely used metric

𝑎+𝑑 𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = =
𝑎+𝑏+𝑐+𝑑 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Limitation of Accuracy Metric
• Accuracy is a baseline metric and almost always incomplete
• It only makes sense when your dataset/classifier satisfies lot of conditions
• When should you not use Accuracy?

1. Imbalanced Datasets
Problem: In cases where one class vastly outnumbers the others (i.e., class imbalance), accuracy can give misleading results. For example, in a dataset where 95% of the instances belong to one
class, a model that always predicts the majority class will have 95% accuracy, even though it may be failing to correctly identify the minority class.
Impact: High accuracy in such scenarios does not reflect the poor performance on the minority class, which may be the more important class in the context (e.g., detecting rare diseases or fraud).
2. Does Not Differentiate Between Types of Errors
Problem: Accuracy treats all errors equally and does not distinguish between false positives and false negatives.
Impact: In applications where the cost of one type of error is higher than the other (e.g., in medical diagnosis, where false negatives could be life-threatening), accuracy is insufficient to evaluate the
real-world performance of the model.
3. Insensitive to Class Distribution
Problem: Accuracy does not account for the distribution of classes in the dataset. It focuses only on the overall number of correct predictions.
Impact: In scenarios where precision or recall is more important for specific classes (e.g., identifying a rare event), accuracy fails to provide insights into how well the model is performing for those
important cases.
4. Cannot Handle Multi-Class Problems Well
Problem: Accuracy becomes less informative in multi-class classification problems, where there are more than two classes, especially if some classes are very underrepresented.
Impact: While accuracy may still provide a general sense of performance, it doesn’t reveal which classes are being misclassified, making it less useful in diagnosing model issues.
5. Lacks Information on Prediction Confidence
Problem: Accuracy does not take into account the confidence of the model's predictions. A model that gives highly confident but incorrect predictions can have the same accuracy as one that gives
uncertain but correct predictions.
Impact: In real-world scenarios where decisions are based on the certainty of predictions (e.g., in risk assessment), accuracy alone doesn’t provide sufficient information.
Limitations of Accuracy Metric
• Consider a 2-class problem
• Number of Class 0 examples = 9990
• Number of Class 1 examples = 10
• If model predicts everything to be class 0

Accuracy is 9990/10000 = 99.9 %
• Accuracy is misleading as model does not detect any class 1 example
• Advice: Do not accuracy for unbalanced datasets

• Gives equal weight to both true positives and true negatives
Diagnosing a patient who does not have cancer as having cancer

Vs
Diagnosing a patient who has cancer as not having cancer
𝑎+𝑑 𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = =
𝑎+𝑏+𝑐+𝑑 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Q: Which model will you ship in a recruiting agency?
• Job resume filtering machine learning model 1:

• 10% error rate on whole dataset
• 40% error rate on female profiles
• Job resume filtering machine learning model 2:

• 18% error rate on whole dataset
• 25% error rate on female profiles
Possible Solution: Cost Matrix
PREDICTED CLASS
C(i|j) Class=Yes Class=No
Class=Yes C(Yes|Yes) C(No|Yes)

ACTUAL
CLASS Class=No C(Yes|No) C(No|No)
• C(i|j): Cost of misclassifying class j example as class i

Computing Cost of Classification
Cost PREDICTED CLASS
Matrix
C(i|j) + -
ACTUAL
+ -1 100
CLASS
- 1 0
Model PREDICTED CLASS Model PREDICTED CLASS

M1 M2
+ - + -
ACTUAL ACTUAL
+ 150 40 + 250 45
CLASS CLASS
- 60 250 - 5 200
Accuracy = 80% Accuracy = 90%

Cost = 3910 Cost = 4255
Cost vs Accuracy
Count PREDICTED CLASS Accuracy is proportional to cost if
1. C(Yes|No)=C(No|Yes) = q
Class=Yes Class=No
2. C(Yes|Yes)=C(No|No) = p
Class=Yes a b
ACTUAL N=a+b+c+d
CLASS Class=No c d
Accuracy = (a + d)/N
Cost PREDICTED CLASS

Cost = p (a + d) + q (b + c)
Class=Yes Class=No
= p (a + d) + q (N – a – d)
Class=Yes p q = q N – (q – p)(a + d)
ACTUAL
CLASS Class=No
= N [q – (q-p)  Accuracy]
q p
Weighted Accuracy
𝑤1 𝑇𝑃 + 𝑤4 𝑇𝑁
𝑊𝐴 =
𝑤1 𝑇𝑃 + 𝑤2 𝐹𝑁 + 𝑤3 𝐹𝑃 + 𝑤4 𝑇𝑁
• General purpose metric

• Interpretation is a bit tricky
Cost-Sensitive Measures
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑝 =
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 𝑟 =
𝑇𝑃 + 𝐹𝑁
Harmonic mean of precision and recall
2𝑝𝑟 2𝑇𝑃
𝐹 𝑠𝑐𝑜𝑟𝑒 𝐹 = =
𝑝 + 𝑟 2𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃
• Precision is biased towards C(Yes|Yes) & C(Yes|No)
• Recall is biased towards C(Yes|Yes) & C(No|Yes)
• F-measure is biased towards all except C(No|No)

Sensitivity and Specificity
• Used in Bio fields more
𝑇𝑃 𝑇𝑃
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = =
𝑃 𝐹𝑁 + 𝑇𝑃
𝑇𝑁 𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁𝑅 = =
𝑁 𝐹𝑃 + 𝑇𝑁
Matthew's Correlation Coefficient
• a specific case of a linear correlation coefficient (Pearson r) for a
binary classification setting
• useful in unbalanced class settings
• MCC is bounded between the range 1 (perfect correlation between
ground truth and predicted outcome) and -1 (inverse or negative
correlation) - a value of 0 denotes a random prediction.
𝑇𝑃 . 𝑇𝑁 − 𝐹𝑃 . 𝐹𝑁
𝑀𝐶𝐶 =
√ 𝑇𝑃 + 𝐹𝑃 𝑇𝑃 + 𝐹𝑁 𝑇𝑁 + 𝐹𝑃 (𝑇𝑁 + 𝐹𝑁)
Confusion Matrix for Multi-Class Settings
• Confusion matrices are traditionally for binary class problems but we
can be readily generalized it to multi-class settings
Sebastian Raschka/STAT 451/Lecture 12

Balanced vs Average Per-Class (APC) Accuracy

Balanced vs Average Per-Class (APC) Accuracy

Metrics for Multi-Class Classification
• We can define formulas for other metrics like precision, recall etc
analogously
𝑇𝑃𝐶𝑙𝑎𝑠𝑠 𝐴
• Precision for class A =
𝑇𝑃𝐶𝑙𝑎𝑠𝑠 𝐴 + 𝐹𝑃𝐶𝑙𝑎𝑠𝑠 𝐴
𝑇𝑃𝐶𝑙𝑎𝑠𝑠 𝐴
• Recall for class A =
𝑇𝑃𝐶𝑙𝑎𝑠𝑠 𝐴 + 𝐹𝑁𝐶𝑙𝑎𝑠𝑠 𝐴
Macro-Average of Precision and Recall
https://www.evidentlyai.com/classification-metrics/multi-class-metrics
Micro-Average of Precision and Recall
https://www.evidentlyai.com/classification-metrics/multi-class-metrics
Micro- vs. Macro-Averaging
• A macro-average will compute the metric independently for each
class and then take the average Macro-averaging treats each class equally.
• Gives equal weight to each class It can be useful when all classes are equally important, and you want to know how
well the classifier performs on average across them.
It is also useful when you have an imbalanced dataset and want to ensure each
class equally contributes to the final evaluation
• A micro-average will aggregate the contributions of all classes to

compute the average metric
• Gives equal weight to each instance
Macro-averaging treats each class equally.
• Prefer micro-average when there is class imbalance

It can be useful when all classes are equally
important, and you want to know how well the
classifier performs on average across them.
It is also useful when you have an imbalanced
dataset and want to ensure each class equally
contributes to the final evaluation
Dan Jurafsky, Speech and Language Processing, Chapter 4 slides

Dan Jurafsky, Speech and Language Processing, Chapter 4 slides

Micro vs. Macro Averaging
• Class A: 1 TP and 1 FP
• Class B: 10 TP and 90 FP
• Class C: 1 TP and 1 FP
• Class D: 1 TP and 1 FP
• Pr(A) = Pr(C) = Pr(D) = 0.5 and Pr(B) = 0.1

• Macro average = (0.5 * 3 + 0.1) / 4 = 0.4
1+10+1+1
• Micro average = = 0.123
2+200+2+2
Performance Trade-offs
• So far, we assumed that the classifier outputs the class labels
• But most classifiers output a richer value

• Logistic regression outputs a probability score
• Other classifiers output a possibly uncalibrated scores
• What can we do to achieve trade-offs?

Performance Trade-offs via Thresholds
• Logistic Regression
• Class as class 1 if output >= 0.5
• Class as class 1 if output < 0.5
• What happens if we changing the thresholds?

Impact of Changing Thresholds
• Suppose we are training a classifier to detect cancer
• Does using a threshold of 0.5 makes sense?

• Suppose we want to classify 𝑦 = 1 (disease) only if very confident
• Solution: Increase the threshold to say 0.9

• Higher precision
• Lower recall
• Why?
• Suppose we want to avoid missing too many cases of disease (avoid

false negatives
• Solution: Keep the threshold low

• Higher recall
• Lower precision
• Why?
Threshold Scanning
CS229 Yining Chen

08 Classifier Evaluation

Uploaded by

Copyright:

Available Formats

08 Classifier Evaluation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

08 Classifier Evaluation

Uploaded by

Copyright:

Available Formats

Model Evaluation

Instructor: Saravanan Thirumuruganathan

• Training objective / cost function is a proxy for real world objective

• Metrics help capture a business goal into a quantitative target

• Focus on the predictive capability of a model

CS 229 slides from Yining Chen

• Useful to quantify the “gap” between:

CS 229 slides from Yining Chen

• Summary metrics: AU-ROC, AU-PRC

Most widely used metric

• When should you not use Accuracy?

• If model predicts everything to be class 0

• Advice: Do not accuracy for unbalanced datasets

Diagnosing a patient who does not have cancer as having cancer

• Job resume filtering machine learning model 1:

• Job resume filtering machine learning model 2:

C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)

• C(i|j): Cost of misclassifying class j example as class i

Model PREDICTED CLASS Model PREDICTED CLASS

Accuracy = 80% Accuracy = 90%

Cost PREDICTED CLASS

• General purpose metric

Harmonic mean of precision and recall

• F-measure is biased towards all except C(No|No)

Sebastian Raschka/STAT 451/Lecture 12

Sebastian Raschka/STAT 451/Lecture 12

Sebastian Raschka/STAT 451/Lecture 12

• A micro-average will aggregate the contributions of all classes to

• Prefer micro-average when there is class imbalance

Dan Jurafsky, Speech and Language Processing, Chapter 4 slides

Dan Jurafsky, Speech and Language Processing, Chapter 4 slides

• Pr(A) = Pr(C) = Pr(D) = 0.5 and Pr(B) = 0.1

• But most classifiers output a richer value

• What can we do to achieve trade-offs?

• What happens if we changing the thresholds?

• Does using a threshold of 0.5 makes sense?

• Suppose we want to classify 𝑦 = 1 (disease) only if very confident

• Solution: Increase the threshold to say 0.9

• Suppose we want to avoid missing too many cases of disease (avoid

• Solution: Keep the threshold low

CS229 Yining Chen

You might also like