ML.4-Classification Techniques (Week 5,6,7)

Nhân bản – Phụng sự – Khai phóng
Chapter 4
Classification Techniques
Machine Learning
CONTENTs
• Classification Problems
• Classification Algorithms
• K-Nearest Neighbors
• Naïve Bayes Classification
• Decision Tree
• Support Vector Machines
• Metrics to measure Classification Performance
Machine Learning 2
Chapter Content
➢ Classification Problems
➢ Classification Algorithms
○ K-Nearest Neighbors
○ Naïve Bayes Classification
○ Decision Tree
○ Support Vector Machines
➢ Metrics
Machine Learning
to measure Classification Performance 3
Classification Problem
➢ Classification is a supervised task that requires the

use of machine learning algorithms that learn how to
assign a class label to examples from the problem
domain
Machine Learning 4
Classification Problem
➢ Classification requires a training dataset with many
examples of inputs and outputs from which to learn.
➢ A ML model will use the training dataset and will
calculate how to best map examples of input data to
specific class labels.
➢ The training dataset must be sufficiently representative
of the problem and have many examples of each class
label.
Machine Learning 5
Example
• Can we predict the class of the new example?
Machine Learning
Classification Examples
● Email Spam Detection

● Speech Recognition
● Identifications of Cancer tumor cells.
● Drugs Classification
● Biometric Identification, etc.
Machine Learning 7
Binary Classification
➢ Binary Classification classify the input data into two mutually
exclusive categories.
➢ The training data in such a situation is labeled in a binary
format: true and false; positive and negative; 0 and 1; spam and
not spam, etc. depending on the problem being tackled.
Machine Learning 8
Multi-Class Classification
➢ Multi-class classification is where each data sample is
assigned one and only one label from more than two
classes.
Machine Learning 9
Multi-Label Classification
➢ Multi-Label classification refers to predicting zero or
more labels to each data sample.
➢ Example: auto-tagging in Natural Language Processing,
where a given text can contain multiple topics or in computer
vision, an image can contain multiple objects.
Machine Learning 10
Example
????
Machine Learning 11
Chapter Content
○ Decision Tree
➢ Metrics
Machine Learning
K- nearest neighbor (K-NN)
➔ Given training data D = {(x1, y1),...,(xN , yN )} and a test point
➔ Prediction Rule: Look at the K most similar training examples
➔ For classification: assign the majority class label (majority
voting)
➔ For regression: assign the average response
➔ The algorithm requires:
◆ Parameter K: number of nearest
neighbors to look for

◆ Distance function: To compute the
similarities between examples

➔ Special Case: 1-Nearest Neighbor
Machine Learning 13
K- Nearest Neighbor (K-NN): Steps
➔ Compute the test point’s distance from each training point

➔ Sort the distances in ascending (or descending) order
➔ Use the sorted distances to select the K nearest neighbors
➔ Use majority rule (for classification) or averaging (for
regression)
Machine Learning 14
K- Nearest Neighbor (K-NN)
➔ K-NN is called a non-parametric method

➔ Unlike other supervised learning algorithms, K-Nearest
Neighbors doesn’t learn an explicit mapping f from the
training data
➔ It simply uses the training data at the test time to make
predictions
Machine Learning 15
K- Nearest Neighbor (K-NN): Computing the distance
➔ The K-NN algorithm requires computing distances of the test

example from each of the training examples.
➔ Several ways to compute distances.
➔ The choice depends on the type of the features in the data.
D
◆ Real-valued features (xi ∈ R ): Euclidean distance is
commonly used.
Machine Learning 16
K- Nearest Neighbor (K-NN): Computing the distance
➔ Some other distance measures

◆ Binary-valued features
● Use Hamming distance:
● Hamming distance counts the number of features where
the two examples disagree
◆ Mixed feature types (some real-valued and some binary-
valued)?
● Can use mixed distance measures
● E.g., Euclidean for the real part, Hamming for the binary
part
◆ Can also assign weights to features:
Machine Learning 17
K- Nearest Neighbor (K-NN): Choice of K
➔ Small K
◆ Creates many small regions for each class
◆ May lead to non-smooth) decision boundaries and overfit
➔ Large K
◆ Creates fewer larger regions
◆ Usually leads to smoother decision boundaries (caution: too smooth decision
boundary can underfit)
◆ Choosing K
➔ Often data dependent and heuristic based
◆ Or using cross-validation (using some held-out data)
◆ In general, a K too small or too big is bad!
Machine Learning 18
K- Nearest Neighbor (K-NN): Properties
➔ What’s nice
◆ Simple and intuitive; easily implementable
◆ Asymptotically consistent (a theoretical property)
● With infinite training data and large enough K, K-NN approaches the
best possible classifier (Bayes optimal)
➔ What’s not so nice..
◆ Store all the training data in memory even at test time
● Can be memory intensive for large training datasets
● An example of non-parametric, or memory/instance-based methods
● Different from parametric, model-based learning models
◆ Expensive at test time: O(ND) computations for each test point
● Have to search through all training data to find nearest neighbors
● Distance computations with N training points (D features each)
◆ Sensitive to noisy features
Machine Learning 19
Chapter Content
○ Decision Tree
➢ Metrics
Machine Learning
Naïve Bayes Classification: Rule
➔ Bayes Rule:
➔ Prior :
➔ Posterior :
…by no means merely a curious speculation in the doctrine of chances, but

necessary to be solved in order to a sure foundation for all our reasonings
concerning past facts, and what is likely to be hereafter…. necessary to be
considered by any that would give a clear account of the strength of
analogical or inductive reasoning…
Machine Learning 21
Naïve Bayes Classification: Rule
➔ Bayes Rule:
➔ A = you got flu B = you just coughed
➔ What is P(flu|cough)=P(A|B)?
Machine Learning 22
Naïve Bayes Classification: in a Nutshell
➔ Bayes Rule:
➔ If Xi and Xj are conditionally independent given Y, for all i ≠ j
➔ So, to pick the most probably Y for
Machine Learning 23
Chapter Content
○ Decision Tree
➢ Metrics
Machine Learning
Decision Tree
Example: learn concept Play Tennis (i.e., decide whether our friend will play
tennis or not in a given day)
Simple Training Data Set
Machine Learning 25
Decision Tree
➔ Each internal node: test one (discrete-valued) attribute Xi
➔ Each branch from a node: corresponds to one possible values for Xi
➔ Each leaf node: predict Y (or P(Y=1|x ∊ leaf))
➔ Example: A Decision tree for f: PlayTennis?
E.g., x=(Outlook=sunny, Temperature-Hot, Humidity=Normal,Wind=High),

f(x)=Yes.
Machine Learning 26
Decision Tree
➔ Suppose X = <x1,…xn >
➔ where xi are boolean-valued variables
➔ How would you represent the following as DTs?
Machine Learning 27
Decision Tree
➔ Input: Training labeled examples {(x (i) ,y(i) )} of unknown target
function f
◆ Examples described by their values on some set of features or
attributes
● E.g. 4 attributes: Humidity, Wind, Outlook, Temp – e.g.,
● Set of possible instances X (a.k.a instance space)
◆ Unknown target function f : XY
● e.g., Y={0,1} label space
● e.g., 1 if we play tennis on this day, else 0
➔ Output: Hypothesis h H that (best) approximates target function f

◆ Set of function hypotheses H={ h | h : XY }
– each hypothesis h is a decision tree
Machine Learning 28
Chapter Content
○ Decision Tree
➢ Metrics
Machine Learning
Support Vector Machines
➔ If large margin, # mistakes Peceptron makes is small (independent on
the dim of the ambient space)!
➔ Large margin can help prevent overfitting.
◆ If large margin 𝛾 and if alg. produces a
large margin classifier, then amount of
data needed depends only on R/𝛾
[Bartlett & Shawe-Taylor ’99].
➔ Ideas: Directly search for a large margin classifier!!!
Support Vector Machines (SVMs)
Machine Learning 30
Support Vector Machines: Geometric Margin
➔ Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the
distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0.
➔ Definition: The margin 𝛾𝑤 of a set of examples 𝑆 wrt a linear separator
𝑤 is the smallest margin over points 𝑥 ∈ 𝑆.
➔ Definition: The margin 𝛾 of a set of examples 𝑆 is the maximum 𝛾𝑤
over all linear separators 𝑤.
Machine Learning 31
➔ Directly optimize for the maximum margin separator: SVMs
➔ First, assume we know a lower bound on the margin 𝜸
Input: 𝛾, S={(x1, 𝑦1), …,(xm, 𝑦m)};

Find: some w where:
• ||w ||2 = 1
• For all i, 𝑦𝑖𝑤 ⋅ 𝑥𝑖 ≥ 𝛾
Output: w, a separator of margin 𝛾 over S
The case where the data is truly linearly separable by margin 𝛾
Machine Learning 32
E.g., search for the best possible 𝜸
Input: 𝛾, S={(x1, 𝑦1), …,(xm, 𝑦m)};

Find: some w where:
• ||w ||2 = 1
Output: w, a separator of margin 𝛾 over S
The case where the data is truly linearly separable by margin 𝛾
Machine Learning 33
Input: 𝛾, S={(x1, 𝑦1), …,(xm, 𝑦m)};

Maximize 𝛾 under the constraint:
• ||w ||2 = 1
objective function
Famous example of constrained optimization: linear programming, where

objective fn is linear, constraints are linear (in)equalities
Machine Learning 34
Input: 𝛾, S={(x1, 𝑦1), …,(xm, 𝑦m)};

Maximize 𝛾 under the constraint:
• ||w ||2 = 1
This constraint is non-linear.

In fact, it’s even non-convex
Machine Learning 35
Chapter Content
○ Decision Tree
➢ Metrics
Machine Learning
Metrics to measure Classification Performance
➢ Confusion Matrix
➢ Accuracy
➢ Precision
➢ Recall
➢ F1 score
Machine Learning 37
Confusion Matrix
➢ describe the performance of a classification model on a set
of test data for which the true values (actual classes) are known.
Binary-class Classification
Machine Learning 38
Example
A confusion matrix for a binary classifier in disease diagnosis in
which "yes" would mean they have the disease, and "no" would
mean they don't have the disease.
Machine Learning 39
Confusion Matrix
Multi-class Classification
Machine Learning 40
Confusion Matrix
Multi-label Classification
Machine Learning 41
The role of confusion matrix
➢ evaluates the performance of the classification
models, when they make predictions on test data.
➢ measures how good generated classification model is.
➢ calculate the different metrics for evaluating the
model, such as accuracy, precision, etc.
Machine Learning 42
Accuracy
➢ determined as the number of correct predictions to the
total number of predictions.
Machine Learning 43
Accuracy in Binary Classification
Accuracy= (100+50)/165 = 0.91
Machine Learning 44
Accuracy in Multi-class Classification
Multi-class Classification
Accuracy
Machine Learning 45
Accuracy in Multilabel Classification
Machine Learning 46
When uses Accuracy?
➢ use the Accuracy metric when the target variable
classes in data are approximately balanced
(unbiased)
➢ in the case of imbalanced data (biased), where one
class is much larger than another, the accuracy can
be highly misleading.
Machine Learning 47
Example
There is a model for a disease prediction in which, out
of 100 persons, only 5 persons have a disease, and 95
people don't have one. If the classifier predicts
everyone with no disease , the Accuracy value is
95% (TP= 0; TN= 95; FP=0; FN=5; All examples=100)
=> not correct
Machine Learning 48
Example
60% of classes in a fruit image dataset are of Apple, 40%
are Mango (approximately balanced data)
if the classifier predict whether the image is of Apple or
Mango, a prediction with 97% of accuracy is credible.
Machine Learning 49
Precision
➢ Precision measures how good the model is at correctly
identifying the positive class
Machine Learning 50
Precision in Multi-class Classification
Actual Labels
Predicted
Labels
Machine Learning 51
Recall
➢ Recall measures how good the model is at
correctly predicting all the positive observations
in the dataset
Machine Learning 52
Recall in Multi-class Classification
Actual Labels
Predicted
Labels
Machine Learning 53
F1 Score
➢ F1 Score (F-measure) is the harmonic mean of precision
and recall.
➢ used to compare the performance of two classifiers when

determines which one produces better results.
Machine Learning 54
SUMMARY
• Classification Problems
• Classification Algorithms
• K-Nearest Neighbors
• Naïve Bayes Classification
• Decision Tree
• Support Vector Machines
• Metrics to measure Classification Performance
Machine Learning 55
Nhân bản – Phụng sự – Khai phóng
Enjoy the Course…!
Machine Learning 56

ML.4-Classification Techniques (Week 5,6,7)

Uploaded by

Copyright:

Available Formats

ML.4-Classification Techniques (Week 5,6,7)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML.4-Classification Techniques (Week 5,6,7)

Uploaded by

Copyright:

Available Formats

Nhân bản – Phụng sự – Khai phóng

➢ Classification is a supervised task that requires the

• Can we predict the class of the new example?

● Email Spam Detection

neighbors to look for

similarities between examples

➔ Compute the test point’s distance from each training point

➔ K-NN is called a non-parametric method

➔ The K-NN algorithm requires computing distances of the test

➔ Some other distance measures

…by no means merely a curious speculation in the doctrine of chances, but

➔ A = you got flu B = you just coughed

➔ If Xi and Xj are conditionally independent given Y, for all i ≠ j

➔ So, to pick the most probably Y for

E.g., x=(Outlook=sunny, Temperature-Hot, Humidity=Normal,Wind=High),

➔ Output: Hypothesis h H that (best) approximates target function f

Input: 𝛾, S={(x1, 𝑦1), …,(xm, 𝑦m)};

The case where the data is truly linearly separable by margin 𝛾

Input: 𝛾, S={(x1, 𝑦1), …,(xm, 𝑦m)};

The case where the data is truly linearly separable by margin 𝛾

Input: 𝛾, S={(x1, 𝑦1), …,(xm, 𝑦m)};

Famous example of constrained optimization: linear programming, where

Input: 𝛾, S={(x1, 𝑦1), …,(xm, 𝑦m)};

This constraint is non-linear.

Accuracy= (100+50)/165 = 0.91

➢ used to compare the performance of two classifiers when

Enjoy the Course…!

You might also like