Beyond Binary Classification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Beyond binary classification

The slides are closely adapted from Subhransu Maji’s slides


Learning with imbalanced data
• One class might be rare (E.g., face detection)
• Mistakes on the rare class cost more:
‣ cost of misclassifying y=+1 is (>1)
‣ cost of misclassifying y=-1 is 1
• Why? we want is a better f-score (or average precision)

binary classification -weighted binary classification

Suppose we have an algorithm to train a binary classifier, can


we use it to train the alpha weighted version?
2
Training by sub-sampling
• Input: Output:

• While true
‣ Sample We have sub-sampled the negatives
by
‣ Sample
‣ If
➡ return
sub-sampling algorithm

Claim
binary classification -weighted binary classification

3
Modifying training
• To train simply —
‣ Subsample negatives and train a binary classifier.
‣ Alternatively, supersample positives and train a binary classifier.
‣ Which one is better?
• For some learners we don’t need to keep copies of the positives
‣ Decision tree
➡ Modify accuracy to the weighted version
‣ kNN classifier
➡ Take weighted votes during prediction
‣ Perceptron?

4
Multi-class classification
• Labels are one of K different ones.
• Some classifiers are inherently multi-class —
‣ kNN classifiers: vote among the K labels, pick the one with the highest vote
(break ties arbitrarily)
‣ Decision trees: use multi-class histograms to determine the best feature to
splits. At the leaves predict the most frequent label.
• Question: can we take a binary classifier and turn it into multi-class?

5
One-vs-all (OVA) classifier
• Train K classifiers, each to distinguish one class from the rest
• Prediction: pick the class with the highest score:

score function

• Example
‣ Perceptron:
➡ May have to calibrate the weights (e.g., fix the norm to 1) since we are
comparing the scores of classifiers
➡ In practice, doing this right is tricky when there are a large number of classes

6
One-vs-one (OVO) classifier
• Train K(K-1)/2 classifiers, each to distinguish one class from another
• Each classifier votes for the winning class in a pair
• The class with most votes wins

• Example
‣ Perceptron:

➡ Calibration is not an issue since we are taking the sign of the score

7
Classification tree
• Needs log(k) runs in inference

SVM

(1 or 3) (2 or 4)

SVM SVM

3 1 2 4
Directed acyclic graph (DAG) classifier
• DAG SVM [Platt et al., NIPS 2000]
‣ Faster testing: O(K) instead of O(K(K-1)/2)
‣ Has some theoretical guarantees

Figure from Platt et al.


9
Ranking

10
Ranking
• Input: query (e.g. “cats”)
• Output: a sorted list of items

• How should we measure performance?


• The loss function is trickier than in the binary classification case
‣ Example 1: All items in the first page should be relevant
‣ Example 2: All relevant items should be ahead of irrelevant items

11
Learning to rank
• For simplicity lets assume we are learning to rank for a given query.
• Learning to rank:
‣ Input: a list of items
‣ Output: a function that takes a set of items and returns a sorted list

• Approaches
‣ Pointwise approach:
➡ Assumes that each document has a numerical score.
➡ Learn a model to predict the score (e.g. linear regression).
‣ Pairwise approach:
➡ Ranking is approximated by a classification problem.
➡ Learn a binary classifier that can tell which item is better given a pair.

12
Naive rank train
• Create a dataset with binary labels
features for
‣ Initialize: comparing
‣ For every i and j such that, i ≠ j item i and j
➡ If item i is more relevant than j
• Add a positive point:
➡ If item i is less relevant than j
• Add a negative point:

• Learn a binary classifier on D


• Ranking
‣ Initialize:
‣ For every i and j such that, i ≠ j
➡ Calculate prediction:

➡ Update scores:

13
Problems with naive ranking
• Naive rank train works well for bipartite ranking problems
‣ Where the goal is to predict whether an item is relevant or not. There is no
notion of an item being more relevant than another.
• A better strategy is to account for the positions of the items in the list
• Denote a ranking by:
‣ If item u appears before item v, we have:
• Let the space of all permutations of M objects be:
• A ranking function maps M items to a permutation:
• A cost function (omega)
‣ The cost of placing an item at position i at j:
• Ranking loss:

14
ω-rank loss functions
• To be a valid loss function ω must be:
‣ Symmetric:
‣ Monotonic:
‣ Satisfy triangle inequality:

• Examples:
‣ Kemeny loss:

‣ Top-K loss:

15
ω-rank train
• Create a dataset with binary labels features for
‣ Initialize: comparing
‣ For every i and j such that, i ≠ j item i and j
➡ If σᵢ < σ (item i is more relevant)
• Add a positive point:
➡ If σᵢ > σ (item j is more relevant)
• Add a negative point:

• Learn a binary classifier on D (each instance has a weight)


• Ranking
‣ Initialize:
‣ For every i and j such that, i ≠ j
➡ Calculate prediction:

➡ Update scores:

16
Structured SVM
Predicting Syntactic Trees

Slide from Dan Roth from UIUC


1
Parsing
We map x and y into another feature space with mapping function f

Slide from Dan Roth from UIUC


1
Original SVM
Structured SVM

• We want the score for the right answer to be at least 1 better than
any other possible answer.

• Note that the space of can be huge.


‣ So the inference may be difficult:

• Not all wrong answers are “equal”: some are closer to the desired answer so

Structured SVM

• Note that the space of can be huge.


‣ So the inference may be difficult:
‣ What about learning?

• We know just a subset of those constraints are support vectors


‣ So if we are given the solution w, we can find active constraints
• What if we don’t know w? Can we search for those constraints iteratively.
How?
Structured SVM

• Algorithm:
‣ (assuming we have reasonable number of support vectors)
‣ 1. Learn a w given some constraints
➡ can be a random sample initially
‣ 2. Apply w to data and get most violating constraints
➡ Find that maximizes
➡ A data point will not have a constraint if all possible solutions satisfy the margin.
‣ 3. Add them to the constraints, and go to step 1.
Structured SVM

• What is a good mapping function for multi-class SVM?


Structured SVM

• What is a good mapping function for multi-class SVM?


‣ Will it solve the score calibration problem?

w for class 1

w for class n
Example: Deformable part models (DPM)
Human pose estimation

2
Sample results

2
Example: Deformable part models (DPM)

Human pose Face pose estimation Object detection


estimation

Felzenszwalb, Girshick, McAllester, Ramanan. "Object Detection with Discriminatively Trained Part-Based Models" TPAMI 2010
Yang & Ramanan, "Articulated Pose Estimation using Flexible Mixtures of Parts" CVPR 2011
Zhu & Ramanan, "Face Detection, Pose Estimation, and Landmark Localization in the Wild", CVPR 2012

2
Collective classification
• Predicting multiple correlated variables

input output

features labels
objective
29
Collective classification
• Predicting multiple correlated variables

independent predictions can be noisy

labels of nearby
vertices as
features E.g., histogram of labels in a 5x5 neighborhood
30
Stacking classifiers
• Train two classifiers
• First one is trained to predict output from the input
• Second is trained on the input and the output of first classifier

31
Stacking classifiers
• Train a stack of N classifiers
• ith classifier is trained on the input + output of the previous i-1 classifiers

• Overfitting is an issue: the classifiers are accurate on training data but on not on
test data leading to a cascade of overconfident classifiers
• Solution: held-out data


32
Summary
• Learning with imbalanced data
‣ Implicit and explicit sampling can be used to train binary classifiers for the
weighted loss case
• Beyond binary classification
‣ Multi-class classification
➡ Some classifiers are inherently multi-class

➡ Others can be combined using: one-vs-one, one-vs-all methods

‣ Ranking
➡ Ranking loss functions to capture distance between permutations

➡ Pointwise and pairwise methods

‣ Structured prediction
Change SVM to predict structured output rather than binary ones.

‣ Collective classification
➡ Stacking classifiers trained with held-out data

33
Slides credit
• Slides are closely adapted from CIML book by Hal Daume and Subransu
Maji’s course.
• Images for collective classification are from the PASCAL VOC dataset
‣ http://pascallin.ecs.soton.ac.uk/challenges/VOC/
• Some of the discussion is based on Wikipedia
‣ http://en.wikipedia.org/wiki/Learning_to_rank

34

You might also like