Chapter 6 ML Classifications

Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

FEM 2063 - Data Analytics

CHAPTER 6
Machine Learning Classification Methods

1
Overview
Machine Learning Classification Methods

➢k-Nearest Neighbours (kNN)

➢Support Vector Machines (SVM)

2
KNN

Tell me about your friends(who your neighbors are) and I will tell you
who you are.
KNN

KNN – Different names

• K-Nearest Neighbours
• Memory-Based Reasoning
• Example-Based Reasoning
• Instance-Based Learning
• Lazy Learning
KNN
What is it?

• A powerful classification algorithm used in pattern recognition used today.

• K nearest neighbors stores all available cases and classifies new cases based
on a similarity measure (e.g distance function)

• Categorize query points based on their distance to points in a training data


set . You can use various metrics to determine the distance.

• A non-parametric lazy learning algorithm (An Instance based Learning


method).
KNN

Classification Approach

• An object (a new instance) is classified


by a majority votes for its neighbor
classes.
• The object is assigned to the most
common class amongst its K nearest
neighbors (measured by a distant
function )
KNN
KNN
Distance Measure
KNN

Distance measure for Continuous Variables

Distance Functions
k
• Euclidean  i i
( x
i =1
− y ) 2

• Manhattan x
i =1
i − yi
1
 k 
  ( xi − yi ) 
q q
• Minkowski
 i =1 
KNN
Distance Between Neighbours

• Calculate the distance between new example (E) and all examples in the
training set
• Euclidean distance between two examples.
– X = [x1,x2,x3,..,xn]
– Y = [y1,y2,y3,...,yn]
– The Euclidean distance between X and Y is defined as:
k
D( X , Y ) =  i i
( x
i =1
− y ) 2
KNN
K-Nearest Neighbour Algorithm
• All the instances correspond to points in an n-dimensional feature space.
• Each instance is represented with a set of numerical attributes.
• Each of the training data consists of a set of vectors and a class label associated
with each vector.
• Classification is done by comparing feature vectors of different K nearest points.
• Select the K-nearest examples to E in the training set.
• Assign E to the most common class among its K-nearest neighbours.
KNN
3-KNN: Example(1)
KNN
How to choose K?
• If K is too small, it is sensitive to noise points.
• Larger K works well. But too large K may include majority points
from other classes.

• Rule of thumb is k  n, n is number of examples


KNN
KNN

(a) 1-nearest neighbour (b) 2-nearest neighbour (c) 3-nearest neighbour

K-nearest neighbors of a record x are data points that have the k smallest
distance to x
KNN
KNN Feature Weighting
• Scale each feature by its importance for classification

• Can use our prior knowledge about which features are more important
• Can learn the weights wk using cross‐validation
KNN

Feature Normalization
• Distance between neighbors could be dominated by some attributes
with relatively large numbers.
e.g., income of customers in our previous example.

• Arises when two features are in different scales.


• Important to normalize those features.
– Mapping values to numbers between 0 – 1.
KNN
Nominal/Categorical Data

• Distance works naturally with numerical attributes.


• Binary value categorical data attributes can be regarded as 1 or 0.
k
DH =  xi − yi
Hamming Distance i =1

x= yD=0
x  y  D =1
X Y Distance
Male Male 0
Male Female 1
KNN
KNN Classification
KNN
KNN Classification – Distance
KNN
KNN Classification – Standardized Distance
KNN

Strengths of KNN

• Very simple and intuitive.

• Can be applied to the data from any distribution.

• Good classification if the number of samples is large enough.


KNN

Weaknesses of KNN

• Takes more time to classify a new example.


need to calculate and compare distance from new example to all other examples.

• Choosing k may be tricky.

• Need large number of samples for accuracy.


KNN
Example
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
x = np.array([3000, 1000, 2700, 1500, 1200,1400, 4200, 580, 7390, 245]).reshape(-1, 1)
y = np.array([1, 0, 1, 0, 0, 0, 1, 0, 1, 0])
xtest = np.array([1970, 2845, 4656, 5800,900]).reshape(-1, 1)
ytest = np.array([0, 1, 1, 1, 1])

#define the model to be used


model = KNeighborsClassifier(n_neighbors=3)

#fit the (training) data


model.fit(x, y)
ypred = model.predict(xtest)
print(ypred)

Output: [0 1 1 1 0]
Overview
Machine Learning Classification Methods

➢k-Nearest Neighbours (kNN)

➢Support Vector Machines (SVM)

25
SVM
Introduction
Given N labeled empirical data:

(x1, y1 ), , (x N , yN )  X  {−1,+1}
where X is the set of input data in D and yi are the labels.
x2 yi = 1
yi = −1

c1 c2
Domain X
x1
SVM

Introduction Unit normal vector

w w
x2 yi = 1
yi = −1

c1 c2
x Decision plan
Domain X

x1
SVM

Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1

How would you classify this data?

w x + b<0
SVM

Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?
SVM

Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?
SVM

Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

Any of these would be fine..

..but which is best?


SVM

Linear Classifiers x f yest


denotes +1 f(x,w,b) = sign(w x + b)
denotes -1

How would you classify this data?

Misclassified
to +1 class
SVM
Classifier Margin x f yest
denotes +1 f(x,w,b) = sign(w x + b)
denotes -1

Define the margin of a


linear classifier as the
width that the boundary
could be increased by
before hitting a datapoint.
SVM
Maximum Margin x f yest
denotes +1
denotes -1 f(x,w,b) = sign(w x + b)

The maximum margin


Support Vectors linear classifier is the
are those
datapoints that
linear classifier with the
the margin maximum margin.
pushes up
against
This is the simplest kind
of SVM (Called an LSVM)

Linear SVM
SVM
Good Decision Boundary: Margin Should Be Large
The decision boundary should be as far away from the data of both classes as
possible
We should maximize the margin, m

Class 2

m?
Class 1

35
SVM

Training set is linearly separable

H 2 = z, w • z − b = 1 H1 = z, w • z − b = −1


z i  H i , i = 1,2
w 1
d ( H1 , H 2 ) = • ( z 2 − z1 ) = (w • z 2 − w • z1 )
w w
1
= (1 + b − (−1 + b))
w
2
d(H1, H 2 ) =
w
36
SVM
Linear SVM

w w
x2

Therefore, maximizing
the margin is equivalent
w  x + b = +1 to minimizing ||w||2.
wx +b = 0

w  x + b = −1 x1
SVM
The Optimization Problem
• Let {x1, ..., xn} be our data set and let yi  {1,-1} be the class label of xi

• The decision boundary should classify all points correctly

• A constrained optimization problem

38
SVM
The Optimization Problem
• We can transform the problem to its dual using Lagrange multipliers

• This is a quadratic programming (QP) problem


Global maximum of ai can always be found

• w can be recovered by
39
SVM
Characteristics of the Solution

• Many of the ai are zero


• w is a linear combination of a small number of data
• Sparse representation

• xi with non-zero ai are called support vectors (SV)


• The decision boundary is determined only by the SV
• Let tj (j=1, ..., s) be the indices of the s support vectors. We can write

40
SVM
Characteristics of the solution

For testing with a new data z

Compute

and classify z as class 1 if the sum is positive, and class 2 otherwise

41
SVM
A Geometrical Interpretation
Class 2

a10=0
a8=0.6

a7=0
a2=0
a5=0

a1=0.8
a4=0
a6=1.4

a9=0
a3=0
Class 1

42
SVM
If Nonlinear?
Use of mapping through
Kernal function


43
SVM

Examples of kernel functions.

Polynomial kernel K ( xi , x j ) = ( xi • x j + 1) 2
2
x j − xi

Radial Base Function (RBF) K ( xi , x j ) = e 22

x j − xi

Laplace K ( xi , x j ) = e 
SVM
Dataset with noise
denotes +1 ◼ Hard Margin: require all data points be
denotes -1 correctly classified, No training error

◼ What if the training set is noisy?


- Solution 1: use very powerful kernels

OVERFITTING!
SVM
Hard Margin v.s. Soft Margin
◼ The formulation:
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1

◼ The formulation incorporating slack variables:


Find w and b such that
Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

◼ Parameter C can be viewed as a way to control overfitting.


SVM
More than 2 classes?
The SVM as defined works for K = 2 classes. What do we do if we have
K > 2 classes?
SVM

Strengths

• It has a unique solution

• It can deal with non-linearity


SVM

Weaknesses

• It is sensitive to noise: a relatively small number of mislabeled examples


can dramatically decrease the performance

• The choice of the kernel is not obvious!


KNN
Example
from sklearn import svm
from sklearn.svm import SVC
import numpy as np

x = np.array([3000, 1000, 2700, 1500, 1200,1400, 4200, 580, 7390, 245]).reshape(-1, 1)


y = np.array([1, 0, 1, 0, 0, 0, 1, 0, 1, 0])
xtest = np.array([1970, 2845, 4656, 5800,900]).reshape(-1, 1)
ytest = np.array([0, 1, 1, 1, 1])

#define the model to be used


model = SVC(kernel='linear’)

#fit the (training) data


model.fit(x, y)
ypred = model.predict(xtest)
print(ypred)

Output: [0 1 1 1 0]
References
https://www.slideshare.net/tilanigunawardena/k-nearest-neighbors
https://sikoried.github.io/sequence-learning/09-sequence-kernels/intro_svm_new.pdf

You might also like