ML.4-Classification Techniques (Week 5,6,7)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Nhân bản – Phụng sự – Khai phóng

Chapter 4

Classification Techniques
Machine Learning
CONTENTs

• Classification Problems
• Classification Algorithms
• K-Nearest Neighbors
• Naïve Bayes Classification
• Decision Tree
• Support Vector Machines
• Metrics to measure Classification Performance

Machine Learning 2
Chapter Content
➢ Classification Problems
➢ Classification Algorithms
○ K-Nearest Neighbors
○ Naïve Bayes Classification
○ Decision Tree
○ Support Vector Machines
➢ Metrics
Machine Learning
to measure Classification Performance 3
Classification Problem

➢ Classification is a supervised task that requires the


use of machine learning algorithms that learn how to
assign a class label to examples from the problem
domain

Machine Learning 4
Classification Problem
➢ Classification requires a training dataset with many
examples of inputs and outputs from which to learn.
➢ A ML model will use the training dataset and will
calculate how to best map examples of input data to
specific class labels.
➢ The training dataset must be sufficiently representative
of the problem and have many examples of each class
label.

Machine Learning 5
Example

• Can we predict the class of the new example?

Machine Learning
Classification Examples

● Email Spam Detection


● Speech Recognition
● Identifications of Cancer tumor cells.
● Drugs Classification
● Biometric Identification, etc.

Machine Learning 7
Binary Classification
➢ Binary Classification classify the input data into two mutually
exclusive categories.
➢ The training data in such a situation is labeled in a binary
format: true and false; positive and negative; 0 and 1; spam and
not spam, etc. depending on the problem being tackled.

Machine Learning 8
Multi-Class Classification
➢ Multi-class classification is where each data sample is
assigned one and only one label from more than two
classes.

Machine Learning 9
Multi-Label Classification
➢ Multi-Label classification refers to predicting zero or
more labels to each data sample.
➢ Example: auto-tagging in Natural Language Processing,
where a given text can contain multiple topics or in computer
vision, an image can contain multiple objects.

Machine Learning 10
Example

????

Machine Learning 11
Chapter Content
➢ Classification Problems
➢ Classification Algorithms
○ K-Nearest Neighbors
○ Naïve Bayes Classification
○ Decision Tree
○ Support Vector Machines
➢ Metrics
Machine Learning
to measure Classification Performance 12
K- nearest neighbor (K-NN)
➔ Given training data D = {(x1, y1),...,(xN , yN )} and a test point
➔ Prediction Rule: Look at the K most similar training examples
➔ For classification: assign the majority class label (majority
voting)
➔ For regression: assign the average response
➔ The algorithm requires:
◆ Parameter K: number of nearest

neighbors to look for


◆ Distance function: To compute the

similarities between examples


➔ Special Case: 1-Nearest Neighbor
Machine Learning 13
K- Nearest Neighbor (K-NN): Steps

➔ Compute the test point’s distance from each training point


➔ Sort the distances in ascending (or descending) order
➔ Use the sorted distances to select the K nearest neighbors
➔ Use majority rule (for classification) or averaging (for
regression)

Machine Learning 14
K- Nearest Neighbor (K-NN)

➔ K-NN is called a non-parametric method


➔ Unlike other supervised learning algorithms, K-Nearest
Neighbors doesn’t learn an explicit mapping f from the
training data
➔ It simply uses the training data at the test time to make
predictions

Machine Learning 15
K- Nearest Neighbor (K-NN): Computing the distance

➔ The K-NN algorithm requires computing distances of the test


example from each of the training examples.
➔ Several ways to compute distances.
➔ The choice depends on the type of the features in the data.
D
◆ Real-valued features (xi ∈ R ): Euclidean distance is

commonly used.

Machine Learning 16
K- Nearest Neighbor (K-NN): Computing the distance

➔ Some other distance measures


◆ Binary-valued features
● Use Hamming distance:
● Hamming distance counts the number of features where
the two examples disagree
◆ Mixed feature types (some real-valued and some binary-
valued)?
● Can use mixed distance measures
● E.g., Euclidean for the real part, Hamming for the binary
part
◆ Can also assign weights to features:

Machine Learning 17
K- Nearest Neighbor (K-NN): Choice of K

➔ Small K
◆ Creates many small regions for each class
◆ May lead to non-smooth) decision boundaries and overfit
➔ Large K
◆ Creates fewer larger regions
◆ Usually leads to smoother decision boundaries (caution: too smooth decision
boundary can underfit)
◆ Choosing K
➔ Often data dependent and heuristic based
◆ Or using cross-validation (using some held-out data)
◆ In general, a K too small or too big is bad!

Machine Learning 18
K- Nearest Neighbor (K-NN): Properties
➔ What’s nice
◆ Simple and intuitive; easily implementable
◆ Asymptotically consistent (a theoretical property)
● With infinite training data and large enough K, K-NN approaches the
best possible classifier (Bayes optimal)
➔ What’s not so nice..
◆ Store all the training data in memory even at test time
● Can be memory intensive for large training datasets
● An example of non-parametric, or memory/instance-based methods
● Different from parametric, model-based learning models
◆ Expensive at test time: O(ND) computations for each test point
● Have to search through all training data to find nearest neighbors
● Distance computations with N training points (D features each)
◆ Sensitive to noisy features

Machine Learning 19
Chapter Content
➢ Classification Problems
➢ Classification Algorithms
○ K-Nearest Neighbors
○ Naïve Bayes Classification
○ Decision Tree
○ Support Vector Machines
➢ Metrics
Machine Learning
to measure Classification Performance 20
Naïve Bayes Classification: Rule

➔ Bayes Rule:

➔ Prior :
➔ Posterior :

…by no means merely a curious speculation in the doctrine of chances, but


necessary to be solved in order to a sure foundation for all our reasonings
concerning past facts, and what is likely to be hereafter…. necessary to be
considered by any that would give a clear account of the strength of
analogical or inductive reasoning…
Machine Learning 21
Naïve Bayes Classification: Rule

➔ Bayes Rule:

➔ A = you got flu B = you just coughed

➔ What is P(flu|cough)=P(A|B)?

Machine Learning 22
Naïve Bayes Classification: in a Nutshell

➔ Bayes Rule:

➔ If Xi and Xj are conditionally independent given Y, for all i ≠ j

➔ So, to pick the most probably Y for

Machine Learning 23
Chapter Content
➢ Classification Problems
➢ Classification Algorithms
○ K-Nearest Neighbors
○ Naïve Bayes Classification
○ Decision Tree
○ Support Vector Machines
➢ Metrics
Machine Learning
to measure Classification Performance 24
Decision Tree
Example: learn concept Play Tennis (i.e., decide whether our friend will play
tennis or not in a given day)
Simple Training Data Set

Machine Learning 25
Decision Tree
➔ Each internal node: test one (discrete-valued) attribute Xi
➔ Each branch from a node: corresponds to one possible values for Xi
➔ Each leaf node: predict Y (or P(Y=1|x ∊ leaf))
➔ Example: A Decision tree for f: PlayTennis?

E.g., x=(Outlook=sunny, Temperature-Hot, Humidity=Normal,Wind=High),


f(x)=Yes.

Machine Learning 26
Decision Tree
➔ Suppose X = <x1,…xn >
➔ where xi are boolean-valued variables
➔ How would you represent the following as DTs?

Machine Learning 27
Decision Tree
➔ Input: Training labeled examples {(x (i) ,y(i) )} of unknown target
function f
◆ Examples described by their values on some set of features or
attributes
● E.g. 4 attributes: Humidity, Wind, Outlook, Temp – e.g.,
● Set of possible instances X (a.k.a instance space)
◆ Unknown target function f : XY
● e.g., Y={0,1} label space
● e.g., 1 if we play tennis on this day, else 0

➔ Output: Hypothesis h H that (best) approximates target function f


◆ Set of function hypotheses H={ h | h : XY }
– each hypothesis h is a decision tree
Machine Learning 28
Chapter Content
➢ Classification Problems
➢ Classification Algorithms
○ K-Nearest Neighbors
○ Naïve Bayes Classification
○ Decision Tree
○ Support Vector Machines
➢ Metrics
Machine Learning
to measure Classification Performance 29
Support Vector Machines
➔ If large margin, # mistakes Peceptron makes is small (independent on
the dim of the ambient space)!
➔ Large margin can help prevent overfitting.
◆ If large margin 𝛾 and if alg. produces a
large margin classifier, then amount of
data needed depends only on R/𝛾
[Bartlett & Shawe-Taylor ’99].
➔ Ideas: Directly search for a large margin classifier!!!
Support Vector Machines (SVMs)

Machine Learning 30
Support Vector Machines: Geometric Margin
➔ Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the
distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0.
➔ Definition: The margin 𝛾𝑤 of a set of examples 𝑆 wrt a linear separator
𝑤 is the smallest margin over points 𝑥 ∈ 𝑆.
➔ Definition: The margin 𝛾 of a set of examples 𝑆 is the maximum 𝛾𝑤
over all linear separators 𝑤.

Machine Learning 31
Support Vector Machines: Geometric Margin
➔ Directly optimize for the maximum margin separator: SVMs
➔ First, assume we know a lower bound on the margin 𝜸

Input: 𝛾, S={(x1, 𝑦1), …,(xm, 𝑦m)};


Find: some w where:
• ||w ||2 = 1
• For all i, 𝑦𝑖𝑤 ⋅ 𝑥𝑖 ≥ 𝛾
Output: w, a separator of margin 𝛾 over S

The case where the data is truly linearly separable by margin 𝛾

Machine Learning 32
Support Vector Machines: Geometric Margin
➔ Directly optimize for the maximum margin separator: SVMs
E.g., search for the best possible 𝜸

Input: 𝛾, S={(x1, 𝑦1), …,(xm, 𝑦m)};


Find: some w where:
• ||w ||2 = 1
• For all i, 𝑦𝑖𝑤 ⋅ 𝑥𝑖 ≥ 𝛾
Output: w, a separator of margin 𝛾 over S

The case where the data is truly linearly separable by margin 𝛾

Machine Learning 33
Support Vector Machines: Geometric Margin
➔ Directly optimize for the maximum margin separator: SVMs

Input: 𝛾, S={(x1, 𝑦1), …,(xm, 𝑦m)};


Maximize 𝛾 under the constraint:
• ||w ||2 = 1
• For all i, 𝑦𝑖𝑤 ⋅ 𝑥𝑖 ≥ 𝛾

objective function

Famous example of constrained optimization: linear programming, where


objective fn is linear, constraints are linear (in)equalities

Machine Learning 34
Support Vector Machines: Geometric Margin
➔ Directly optimize for the maximum margin separator: SVMs

Input: 𝛾, S={(x1, 𝑦1), …,(xm, 𝑦m)};


Maximize 𝛾 under the constraint:
• ||w ||2 = 1
• For all i, 𝑦𝑖𝑤 ⋅ 𝑥𝑖 ≥ 𝛾

This constraint is non-linear.


In fact, it’s even non-convex

Machine Learning 35
Chapter Content
➢ Classification Problems
➢ Classification Algorithms
○ K-Nearest Neighbors
○ Naïve Bayes Classification
○ Decision Tree
○ Support Vector Machines
➢ Metrics
Machine Learning
to measure Classification Performance 36
Metrics to measure Classification Performance
➢ Confusion Matrix
➢ Accuracy
➢ Precision
➢ Recall
➢ F1 score

Machine Learning 37
Confusion Matrix
➢ describe the performance of a classification model on a set
of test data for which the true values (actual classes) are known.
Binary-class Classification

Machine Learning 38
Example
A confusion matrix for a binary classifier in disease diagnosis in
which "yes" would mean they have the disease, and "no" would
mean they don't have the disease.

Machine Learning 39
Confusion Matrix
Multi-class Classification

Machine Learning 40
Confusion Matrix
Multi-label Classification

Machine Learning 41
The role of confusion matrix
➢ evaluates the performance of the classification
models, when they make predictions on test data.
➢ measures how good generated classification model is.
➢ calculate the different metrics for evaluating the
model, such as accuracy, precision, etc.

Machine Learning 42
Accuracy
➢ determined as the number of correct predictions to the
total number of predictions.

Machine Learning 43
Accuracy in Binary Classification

Accuracy= (100+50)/165 = 0.91

Machine Learning 44
Accuracy in Multi-class Classification
Multi-class Classification

Accuracy

Machine Learning 45
Accuracy in Multilabel Classification

Machine Learning 46
When uses Accuracy?
➢ use the Accuracy metric when the target variable
classes in data are approximately balanced
(unbiased)
➢ in the case of imbalanced data (biased), where one
class is much larger than another, the accuracy can
be highly misleading.

Machine Learning 47
Example
There is a model for a disease prediction in which, out
of 100 persons, only 5 persons have a disease, and 95
people don't have one. If the classifier predicts
everyone with no disease , the Accuracy value is
95% (TP= 0; TN= 95; FP=0; FN=5; All examples=100)
=> not correct

Machine Learning 48
Example
60% of classes in a fruit image dataset are of Apple, 40%
are Mango (approximately balanced data)
if the classifier predict whether the image is of Apple or
Mango, a prediction with 97% of accuracy is credible.

Machine Learning 49
Precision
➢ Precision measures how good the model is at correctly
identifying the positive class

Machine Learning 50
Precision in Multi-class Classification

Actual Labels

Predicted
Labels

Machine Learning 51
Recall
➢ Recall measures how good the model is at
correctly predicting all the positive observations
in the dataset

Machine Learning 52
Recall in Multi-class Classification
Actual Labels

Predicted
Labels

Machine Learning 53
F1 Score
➢ F1 Score (F-measure) is the harmonic mean of precision
and recall.

➢ used to compare the performance of two classifiers when


determines which one produces better results.

Machine Learning 54
SUMMARY

• Classification Problems
• Classification Algorithms
• K-Nearest Neighbors
• Naïve Bayes Classification
• Decision Tree
• Support Vector Machines
• Metrics to measure Classification Performance

Machine Learning 55
Nhân bản – Phụng sự – Khai phóng

Enjoy the Course…!

Machine Learning 56

You might also like