Dr. S. Vairachilai Department of CSE CVR College of Engineering Mangalpalli Telangana

Dr. S.
Vairachilai
Department of CSE
CVR College Of Engineering
Mangalpalli
Telangana
Topics to be Covered
Classification
 KNN : K Nearest Neighbor
 Examples
17-04-2022 S.Vairachilai 2
Types of Classification Algorithms
Parametric Algorithm & Non-Parametric Algorithms
Parametric Algorithm Parametric Algorithm
 Linear Regression  Need to know is the model’s parameters.
 Number of parameters is fixed with respect to the sample size.
 LDA( Linear Discriminant Algorithm)
 Make large assumptions about the mapping of the input
 Perceptron Advantages
variables to the output variable and in turn are faster to train,
 Simple
 Logistic Regression require less data but may not be as powerful.
 Fast
 Navie Bayes  Less data
Non-Parametric Algorithms
 Do not have fixed numbers of parameters in the model.
Non-Parametric Algorithms
Advantages  Number of parameters can grow with the sample size.
 K-Nearest Neighbour  Flexibility
 Make few or no assumptions about the target function and
 Power
 Decision Tree in turn require a lot more data, are slower to train and have a
 performance
higher model complexity but can result in more powerful
 Support Vector Machine (SVM)
models.
KNN – K Nearest Neighbors
• KNN classifies a data point based on how its neighbors are classified
K-Nearest Neighbors Lazy Learner Eger Learner
Do not work on training data Do less work on training data
Memory-Based Reasoning Less training time but more take a long time for train and
time in predicting. less time to predict.
Example-Based Reasoning
K - Nearest Neighbor Decision Tree, Naive Bayes,
Artificial Neural Networks,
 Instance-Based Learning support vector machines
 Lazy Learning
• K Nearest Neighbors stores all available cases and classifies new cases based
on a similarity measure (Distance function)
K Nearest Neighbors Classification Approach
• An object (a new instance) is classified by a majority votes for its
neighbor classes
• The object is assigned to the most common class amongst its K

nearest neighbors.(Measured by a distant function )
• Supervised Machine Learning algorithm that classifies a new data

point into the target class, based on a similarity measure (e.g.,
distance functions).
KNN :k Nearest Neighbor
• k is the number of neighbors considered
• How to select k
• k <Sqrt(n)
Where n is the number of samples

KNN : k Nearest Neighbor
Euclidean Distance
𝒏
𝒅 𝒙, 𝒚 = ෍(𝒙𝒊 − 𝒚𝒊 )𝟐
𝒊=𝟏
n is the number of dimensions

x is the data point from dataset
y is the new data point (to be predicted)
Point P1=(1,4)
Point p2= (5,1)
Euclidean Distance = (𝟓 − 𝟏)𝟐 +(𝟒 − 𝟏)𝟐
= 𝟏𝟔 + 𝟗
= 5
Manhattan distance
Manhattan distance
σ𝒏
d = 𝒊=𝟏 𝒙𝒊 − 𝒚𝒊
Two points in the image the red(4,4) and the green(1,1).
d = |4-1| + |4-1| = 6
Distance is preferred over Euclidean distance
when we have a case of high dimensionality.
𝒏
Minkowski Distance 𝒅 = ( ෍( 𝒙𝒊 − 𝒚𝒊 )𝒒 )𝟏/𝒒

𝒊=𝟏
Minkowski distance is the generalized distance metric
• p = 1, when p is set to 1 we get Manhattan distance
• p = 2, when p is set to 2 we get Euclidean distance
n is the number of dimensions
x is the data point from dataset
y is the new data point (to be predicted)
KNN : k Nearest Neighbor
Sample Dataset
Example 1:
We have a data from questionnaires( to ask people opinion) and objective testing
two attributes(acid durability and strength) to classify whether a special paper
tissue is good or bad X1 Acid Durability X2 Strength Y
(Seconds) (Kg/Square meter)
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
K Nearest Neighbor
Now factory produces a new tissue that pass laboratory test with X1=3 and X2=7.
Without any expensive survey, can we guess what the classification of this new
tissue is?
Step1:
Determine the parameter k= number of nearest neighbors
n=4
sqrt(n)= sqrt(4)=2
K=2
K Nearest Neighbor
Step 2: Calculate the distance between the query-instance and all the training samples. X1=3 and X2=7
𝒅= (𝟕 − 𝟑)𝟐 +(𝟕 − 𝟕)𝟐 = 𝟏𝟔 =4 𝒏 X1 X2 Y ED
𝒅= (𝟕 − 𝟑)𝟐 +(𝟒 − 𝟕)𝟐 = 𝟐𝟓 =5 𝒅 𝒙, 𝒚 = ෍(𝒙𝒊 − 𝒚𝒊 )𝟐 7 7 Bad 4

𝒅= (𝟑 − 𝟑)𝟐 +(𝟒 − 𝟕)𝟐 = 𝟗 =3 𝒊=𝟏 7 4 Bad 5
𝒅= (𝟏 − 𝟑)𝟐 +(𝟒 − 𝟕)𝟐 = 𝟏𝟑 =3.6 X1 X2 Y ED
3 4 Good 3
1 4 Good 3.6
3 4 Good 3
Step 3
1 4 Good 3.6
 Sort the Euclidean Distance 7 7 Bad 4
Step 4 7 4 Bad 5
 Use simple majority of the category of NN as the prediction of the query instance X1=3 & X2=7
X1 X2 Y ED
3 4 Good 3
1 4 Good 3.6
 The query-instance X1=3 & X2=7 7 7 Bad 4
belongs to the class Good. 7 4 Bad 5
\The tissue paper is good
Sample Dataset
Example 2: Age Loan Default
25 40000 N
35 60000 N
45 80000 N
20 20000 N
35 120000 N
52 18000 N
23 95000 Y
40 62000 Y
60 100000 Y
48 220000 Y
33 150000 Y
48 142000 ?
K Nearest Neighbor
Age Loan Default ED Age Loan Default ED
25 40000 N 12000 0.125 0.11 N 0.7652
35 60000 N 82000 0.375 0.21 N 0.5200
45 80000 N 62000 0.625 0.31 N 0.3160
20 20000 N 122000 0 0.01 N 0.9245
35 120000 N 22000 0.375 0.50 N 0.3428
52 18000 N 124000 0.8 0.00 N 0.6220
23 95000 Y 47000 0.075 0.38 Y 0.6669
40 62000 Y 80000 0.5 0.22 Y 0.4437
60 100000 Y 42000 1 0.41 Y 0.3650
48 220000 Y 78000 0.7 1.00 Y 0.3861
33 150000 Y 8000 0.325 0.65 Y 0.3771
48 142000 ? 48 0.61 ?
K Nearest Neighbor
Age Loan Default ED
25 40000 N 12000
35 60000 N 82000
45 80000 N 62000 𝒏
20 20000 N 122000 𝒅 𝒙, 𝒚 = ෍(𝒙𝒊 − 𝒚𝒊 )𝟐
35 120000 N 22000 𝒊=𝟏
52 18000 N 124000
23 95000 Y 47000
40 62000 Y 80000
60 100000 Y 42000
48 220000 Y 78000
33 150000 Y 8000
48 142000 ?
Normalization of data
• https://medium.com/analytics-vidhya/why-is-scaling-required-
in-knn-and-k-means-8129e4d88ed7

Dr. S. Vairachilai Department of CSE CVR College of Engineering Mangalpalli Telangana

Uploaded by

Copyright:

Available Formats

Dr. S. Vairachilai Department of CSE CVR College of Engineering Mangalpalli Telangana

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dr. S. Vairachilai Department of CSE CVR College of Engineering Mangalpalli Telangana

Uploaded by

Copyright:

Available Formats

Dr. S.

• The object is assigned to the most common class amongst its K

• Supervised Machine Learning algorithm that classifies a new data

Where n is the number of samples

n is the number of dimensions

Minkowski Distance 𝒅 = ( ෍( 𝒙𝒊 − 𝒚𝒊 )𝒒 )𝟏/𝒒

(Seconds) (Kg/Square meter)

Determine the parameter k= number of nearest neighbors

𝒅= (𝟕 − 𝟑)𝟐 +(𝟕 − 𝟕)𝟐 = 𝟏𝟔 =4 𝒏 X1 X2 Y ED

𝒅= (𝟕 − 𝟑)𝟐 +(𝟒 − 𝟕)𝟐 = 𝟐𝟓 =5 𝒅 𝒙, 𝒚 = ෍(𝒙𝒊 − 𝒚𝒊 )𝟐 7 7 Bad 4

You might also like