Beyond Binary Classification
Beyond Binary Classification
Beyond Binary Classification
• While true
‣ Sample We have sub-sampled the negatives
by
‣ Sample
‣ If
➡ return
sub-sampling algorithm
Claim
binary classification -weighted binary classification
3
Modifying training
• To train simply —
‣ Subsample negatives and train a binary classifier.
‣ Alternatively, supersample positives and train a binary classifier.
‣ Which one is better?
• For some learners we don’t need to keep copies of the positives
‣ Decision tree
➡ Modify accuracy to the weighted version
‣ kNN classifier
➡ Take weighted votes during prediction
‣ Perceptron?
4
Multi-class classification
• Labels are one of K different ones.
• Some classifiers are inherently multi-class —
‣ kNN classifiers: vote among the K labels, pick the one with the highest vote
(break ties arbitrarily)
‣ Decision trees: use multi-class histograms to determine the best feature to
splits. At the leaves predict the most frequent label.
• Question: can we take a binary classifier and turn it into multi-class?
5
One-vs-all (OVA) classifier
• Train K classifiers, each to distinguish one class from the rest
• Prediction: pick the class with the highest score:
score function
• Example
‣ Perceptron:
➡ May have to calibrate the weights (e.g., fix the norm to 1) since we are
comparing the scores of classifiers
➡ In practice, doing this right is tricky when there are a large number of classes
6
One-vs-one (OVO) classifier
• Train K(K-1)/2 classifiers, each to distinguish one class from another
• Each classifier votes for the winning class in a pair
• The class with most votes wins
• Example
‣ Perceptron:
➡ Calibration is not an issue since we are taking the sign of the score
7
Classification tree
• Needs log(k) runs in inference
SVM
(1 or 3) (2 or 4)
SVM SVM
3 1 2 4
Directed acyclic graph (DAG) classifier
• DAG SVM [Platt et al., NIPS 2000]
‣ Faster testing: O(K) instead of O(K(K-1)/2)
‣ Has some theoretical guarantees
10
Ranking
• Input: query (e.g. “cats”)
• Output: a sorted list of items
11
Learning to rank
• For simplicity lets assume we are learning to rank for a given query.
• Learning to rank:
‣ Input: a list of items
‣ Output: a function that takes a set of items and returns a sorted list
• Approaches
‣ Pointwise approach:
➡ Assumes that each document has a numerical score.
➡ Learn a model to predict the score (e.g. linear regression).
‣ Pairwise approach:
➡ Ranking is approximated by a classification problem.
➡ Learn a binary classifier that can tell which item is better given a pair.
12
Naive rank train
• Create a dataset with binary labels
features for
‣ Initialize: comparing
‣ For every i and j such that, i ≠ j item i and j
➡ If item i is more relevant than j
• Add a positive point:
➡ If item i is less relevant than j
• Add a negative point:
➡ Update scores:
13
Problems with naive ranking
• Naive rank train works well for bipartite ranking problems
‣ Where the goal is to predict whether an item is relevant or not. There is no
notion of an item being more relevant than another.
• A better strategy is to account for the positions of the items in the list
• Denote a ranking by:
‣ If item u appears before item v, we have:
• Let the space of all permutations of M objects be:
• A ranking function maps M items to a permutation:
• A cost function (omega)
‣ The cost of placing an item at position i at j:
• Ranking loss:
14
ω-rank loss functions
• To be a valid loss function ω must be:
‣ Symmetric:
‣ Monotonic:
‣ Satisfy triangle inequality:
• Examples:
‣ Kemeny loss:
‣ Top-K loss:
15
ω-rank train
• Create a dataset with binary labels features for
‣ Initialize: comparing
‣ For every i and j such that, i ≠ j item i and j
➡ If σᵢ < σ (item i is more relevant)
• Add a positive point:
➡ If σᵢ > σ (item j is more relevant)
• Add a negative point:
➡ Update scores:
16
Structured SVM
Predicting Syntactic Trees
• We want the score for the right answer to be at least 1 better than
any other possible answer.
• Not all wrong answers are “equal”: some are closer to the desired answer so
…
Structured SVM
• Algorithm:
‣ (assuming we have reasonable number of support vectors)
‣ 1. Learn a w given some constraints
➡ can be a random sample initially
‣ 2. Apply w to data and get most violating constraints
➡ Find that maximizes
➡ A data point will not have a constraint if all possible solutions satisfy the margin.
‣ 3. Add them to the constraints, and go to step 1.
Structured SVM
w for class 1
w for class n
Example: Deformable part models (DPM)
Human pose estimation
2
Sample results
2
Example: Deformable part models (DPM)
Felzenszwalb, Girshick, McAllester, Ramanan. "Object Detection with Discriminatively Trained Part-Based Models" TPAMI 2010
Yang & Ramanan, "Articulated Pose Estimation using Flexible Mixtures of Parts" CVPR 2011
Zhu & Ramanan, "Face Detection, Pose Estimation, and Landmark Localization in the Wild", CVPR 2012
2
Collective classification
• Predicting multiple correlated variables
input output
features labels
objective
29
Collective classification
• Predicting multiple correlated variables
labels of nearby
vertices as
features E.g., histogram of labels in a 5x5 neighborhood
30
Stacking classifiers
• Train two classifiers
• First one is trained to predict output from the input
• Second is trained on the input and the output of first classifier
31
Stacking classifiers
• Train a stack of N classifiers
• ith classifier is trained on the input + output of the previous i-1 classifiers
• Overfitting is an issue: the classifiers are accurate on training data but on not on
test data leading to a cascade of overconfident classifiers
• Solution: held-out data
…
32
Summary
• Learning with imbalanced data
‣ Implicit and explicit sampling can be used to train binary classifiers for the
weighted loss case
• Beyond binary classification
‣ Multi-class classification
➡ Some classifiers are inherently multi-class
‣ Ranking
➡ Ranking loss functions to capture distance between permutations
‣ Structured prediction
Change SVM to predict structured output rather than binary ones.
➡
‣ Collective classification
➡ Stacking classifiers trained with held-out data
33
Slides credit
• Slides are closely adapted from CIML book by Hal Daume and Subransu
Maji’s course.
• Images for collective classification are from the PASCAL VOC dataset
‣ http://pascallin.ecs.soton.ac.uk/challenges/VOC/
• Some of the discussion is based on Wikipedia
‣ http://en.wikipedia.org/wiki/Learning_to_rank
34