Dr. S. Vairachilai Department of CSE CVR College of Engineering Mangalpalli Telangana

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Dr. S.

Vairachilai
Department of CSE
CVR College Of Engineering
Mangalpalli
Telangana
Topics to be Covered
Classification
 KNN : K Nearest Neighbor

 Examples

17-04-2022 S.Vairachilai 2
Types of Classification Algorithms
Parametric Algorithm & Non-Parametric Algorithms
Parametric Algorithm Parametric Algorithm
 Linear Regression  Need to know is the model’s parameters.
 Number of parameters is fixed with respect to the sample size.
 LDA( Linear Discriminant Algorithm)
 Make large assumptions about the mapping of the input
 Perceptron Advantages
variables to the output variable and in turn are faster to train,
 Simple
 Logistic Regression require less data but may not be as powerful.
 Fast
 Navie Bayes  Less data
Non-Parametric Algorithms
 Do not have fixed numbers of parameters in the model.
Non-Parametric Algorithms
Advantages  Number of parameters can grow with the sample size.
 K-Nearest Neighbour  Flexibility
 Make few or no assumptions about the target function and
 Power
 Decision Tree in turn require a lot more data, are slower to train and have a
 performance
higher model complexity but can result in more powerful
 Support Vector Machine (SVM)
models.
KNN – K Nearest Neighbors
• KNN classifies a data point based on how its neighbors are classified
K-Nearest Neighbors Lazy Learner Eger Learner
Do not work on training data Do less work on training data
Memory-Based Reasoning Less training time but more take a long time for train and
time in predicting. less time to predict.
Example-Based Reasoning
K - Nearest Neighbor Decision Tree, Naive Bayes,
Artificial Neural Networks,
 Instance-Based Learning support vector machines

 Lazy Learning

• K Nearest Neighbors stores all available cases and classifies new cases based
on a similarity measure (Distance function)
K Nearest Neighbors Classification Approach
• An object (a new instance) is classified by a majority votes for its
neighbor classes

• The object is assigned to the most common class amongst its K


nearest neighbors.(Measured by a distant function )

• Supervised Machine Learning algorithm that classifies a new data


point into the target class, based on a similarity measure (e.g.,
distance functions).
KNN :k Nearest Neighbor
• k is the number of neighbors considered

• How to select k

• k <Sqrt(n)

Where n is the number of samples


KNN : k Nearest Neighbor
Euclidean Distance
𝒏

𝒅 𝒙, 𝒚 = ෍(𝒙𝒊 − 𝒚𝒊 )𝟐
𝒊=𝟏

n is the number of dimensions


x is the data point from dataset
y is the new data point (to be predicted)

Point P1=(1,4)
Point p2= (5,1)
Euclidean Distance = (𝟓 − 𝟏)𝟐 +(𝟒 − 𝟏)𝟐
= 𝟏𝟔 + 𝟗
= 5
Manhattan distance
Manhattan distance
σ𝒏
d = 𝒊=𝟏 𝒙𝒊 − 𝒚𝒊
Two points in the image the red(4,4) and the green(1,1).
d = |4-1| + |4-1| = 6
Distance is preferred over Euclidean distance
when we have a case of high dimensionality.
𝒏

Minkowski Distance 𝒅 = ( ෍( 𝒙𝒊 − 𝒚𝒊 )𝒒 )𝟏/𝒒


𝒊=𝟏
Minkowski distance is the generalized distance metric
• p = 1, when p is set to 1 we get Manhattan distance
• p = 2, when p is set to 2 we get Euclidean distance
n is the number of dimensions
x is the data point from dataset
y is the new data point (to be predicted)
KNN : k Nearest Neighbor
Sample Dataset
Example 1:

We have a data from questionnaires( to ask people opinion) and objective testing
two attributes(acid durability and strength) to classify whether a special paper
tissue is good or bad X1 Acid Durability X2 Strength Y

(Seconds) (Kg/Square meter)

7 7 Bad

7 4 Bad

3 4 Good

1 4 Good
K Nearest Neighbor
Now factory produces a new tissue that pass laboratory test with X1=3 and X2=7.
Without any expensive survey, can we guess what the classification of this new
tissue is?
Step1:

Determine the parameter k= number of nearest neighbors

n=4

sqrt(n)= sqrt(4)=2

K=2
K Nearest Neighbor
Step 2: Calculate the distance between the query-instance and all the training samples. X1=3 and X2=7

𝒅= (𝟕 − 𝟑)𝟐 +(𝟕 − 𝟕)𝟐 = 𝟏𝟔 =4 𝒏 X1 X2 Y ED

𝒅= (𝟕 − 𝟑)𝟐 +(𝟒 − 𝟕)𝟐 = 𝟐𝟓 =5 𝒅 𝒙, 𝒚 = ෍(𝒙𝒊 − 𝒚𝒊 )𝟐 7 7 Bad 4


𝒅= (𝟑 − 𝟑)𝟐 +(𝟒 − 𝟕)𝟐 = 𝟗 =3 𝒊=𝟏 7 4 Bad 5
𝒅= (𝟏 − 𝟑)𝟐 +(𝟒 − 𝟕)𝟐 = 𝟏𝟑 =3.6 X1 X2 Y ED
3 4 Good 3
1 4 Good 3.6
3 4 Good 3
Step 3
1 4 Good 3.6
 Sort the Euclidean Distance 7 7 Bad 4
Step 4 7 4 Bad 5
 Use simple majority of the category of NN as the prediction of the query instance X1=3 & X2=7

X1 X2 Y ED

3 4 Good 3
1 4 Good 3.6
 The query-instance X1=3 & X2=7 7 7 Bad 4
belongs to the class Good. 7 4 Bad 5
\The tissue paper is good
Sample Dataset
Example 2: Age Loan Default
25 40000 N
35 60000 N
45 80000 N
20 20000 N
35 120000 N
52 18000 N
23 95000 Y
40 62000 Y
60 100000 Y
48 220000 Y
33 150000 Y

48 142000 ?
K Nearest Neighbor
Age Loan Default ED Age Loan Default ED
25 40000 N 12000 0.125 0.11 N 0.7652
35 60000 N 82000 0.375 0.21 N 0.5200
45 80000 N 62000 0.625 0.31 N 0.3160
20 20000 N 122000 0 0.01 N 0.9245
35 120000 N 22000 0.375 0.50 N 0.3428
52 18000 N 124000 0.8 0.00 N 0.6220
23 95000 Y 47000 0.075 0.38 Y 0.6669
40 62000 Y 80000 0.5 0.22 Y 0.4437
60 100000 Y 42000 1 0.41 Y 0.3650
48 220000 Y 78000 0.7 1.00 Y 0.3861
33 150000 Y 8000 0.325 0.65 Y 0.3771

48 142000 ? 48 0.61 ?
K Nearest Neighbor
Age Loan Default ED
25 40000 N 12000
35 60000 N 82000
45 80000 N 62000 𝒏
20 20000 N 122000 𝒅 𝒙, 𝒚 = ෍(𝒙𝒊 − 𝒚𝒊 )𝟐
35 120000 N 22000 𝒊=𝟏
52 18000 N 124000
23 95000 Y 47000
40 62000 Y 80000
60 100000 Y 42000
48 220000 Y 78000
33 150000 Y 8000

48 142000 ?
Normalization of data

• https://medium.com/analytics-vidhya/why-is-scaling-required-
in-knn-and-k-means-8129e4d88ed7

You might also like