Dr. S. Vairachilai Department of CSE CVR College of Engineering Mangalpalli Telangana
Dr. S. Vairachilai Department of CSE CVR College of Engineering Mangalpalli Telangana
Dr. S. Vairachilai Department of CSE CVR College of Engineering Mangalpalli Telangana
Vairachilai
Department of CSE
CVR College Of Engineering
Mangalpalli
Telangana
Topics to be Covered
Classification
KNN : K Nearest Neighbor
Examples
17-04-2022 S.Vairachilai 2
Types of Classification Algorithms
Parametric Algorithm & Non-Parametric Algorithms
Parametric Algorithm Parametric Algorithm
Linear Regression Need to know is the model’s parameters.
Number of parameters is fixed with respect to the sample size.
LDA( Linear Discriminant Algorithm)
Make large assumptions about the mapping of the input
Perceptron Advantages
variables to the output variable and in turn are faster to train,
Simple
Logistic Regression require less data but may not be as powerful.
Fast
Navie Bayes Less data
Non-Parametric Algorithms
Do not have fixed numbers of parameters in the model.
Non-Parametric Algorithms
Advantages Number of parameters can grow with the sample size.
K-Nearest Neighbour Flexibility
Make few or no assumptions about the target function and
Power
Decision Tree in turn require a lot more data, are slower to train and have a
performance
higher model complexity but can result in more powerful
Support Vector Machine (SVM)
models.
KNN – K Nearest Neighbors
• KNN classifies a data point based on how its neighbors are classified
K-Nearest Neighbors Lazy Learner Eger Learner
Do not work on training data Do less work on training data
Memory-Based Reasoning Less training time but more take a long time for train and
time in predicting. less time to predict.
Example-Based Reasoning
K - Nearest Neighbor Decision Tree, Naive Bayes,
Artificial Neural Networks,
Instance-Based Learning support vector machines
Lazy Learning
• K Nearest Neighbors stores all available cases and classifies new cases based
on a similarity measure (Distance function)
K Nearest Neighbors Classification Approach
• An object (a new instance) is classified by a majority votes for its
neighbor classes
• How to select k
• k <Sqrt(n)
𝒅 𝒙, 𝒚 = (𝒙𝒊 − 𝒚𝒊 )𝟐
𝒊=𝟏
Point P1=(1,4)
Point p2= (5,1)
Euclidean Distance = (𝟓 − 𝟏)𝟐 +(𝟒 − 𝟏)𝟐
= 𝟏𝟔 + 𝟗
= 5
Manhattan distance
Manhattan distance
σ𝒏
d = 𝒊=𝟏 𝒙𝒊 − 𝒚𝒊
Two points in the image the red(4,4) and the green(1,1).
d = |4-1| + |4-1| = 6
Distance is preferred over Euclidean distance
when we have a case of high dimensionality.
𝒏
We have a data from questionnaires( to ask people opinion) and objective testing
two attributes(acid durability and strength) to classify whether a special paper
tissue is good or bad X1 Acid Durability X2 Strength Y
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
K Nearest Neighbor
Now factory produces a new tissue that pass laboratory test with X1=3 and X2=7.
Without any expensive survey, can we guess what the classification of this new
tissue is?
Step1:
n=4
sqrt(n)= sqrt(4)=2
K=2
K Nearest Neighbor
Step 2: Calculate the distance between the query-instance and all the training samples. X1=3 and X2=7
X1 X2 Y ED
3 4 Good 3
1 4 Good 3.6
The query-instance X1=3 & X2=7 7 7 Bad 4
belongs to the class Good. 7 4 Bad 5
\The tissue paper is good
Sample Dataset
Example 2: Age Loan Default
25 40000 N
35 60000 N
45 80000 N
20 20000 N
35 120000 N
52 18000 N
23 95000 Y
40 62000 Y
60 100000 Y
48 220000 Y
33 150000 Y
48 142000 ?
K Nearest Neighbor
Age Loan Default ED Age Loan Default ED
25 40000 N 12000 0.125 0.11 N 0.7652
35 60000 N 82000 0.375 0.21 N 0.5200
45 80000 N 62000 0.625 0.31 N 0.3160
20 20000 N 122000 0 0.01 N 0.9245
35 120000 N 22000 0.375 0.50 N 0.3428
52 18000 N 124000 0.8 0.00 N 0.6220
23 95000 Y 47000 0.075 0.38 Y 0.6669
40 62000 Y 80000 0.5 0.22 Y 0.4437
60 100000 Y 42000 1 0.41 Y 0.3650
48 220000 Y 78000 0.7 1.00 Y 0.3861
33 150000 Y 8000 0.325 0.65 Y 0.3771
48 142000 ? 48 0.61 ?
K Nearest Neighbor
Age Loan Default ED
25 40000 N 12000
35 60000 N 82000
45 80000 N 62000 𝒏
20 20000 N 122000 𝒅 𝒙, 𝒚 = (𝒙𝒊 − 𝒚𝒊 )𝟐
35 120000 N 22000 𝒊=𝟏
52 18000 N 124000
23 95000 Y 47000
40 62000 Y 80000
60 100000 Y 42000
48 220000 Y 78000
33 150000 Y 8000
48 142000 ?
Normalization of data
• https://medium.com/analytics-vidhya/why-is-scaling-required-
in-knn-and-k-means-8129e4d88ed7