k-NN-1
k-NN-1
k-NN-1
( Classification )
2
• A powerful classification algorithm used in pattern
recognition.
• k-nearest neighbors stores all available cases and
classifies new cases based on a similarity measure
-
(e.g., distance function)
function
- Reasoning)
Training sample
be .
One
by One
-
3
Tr K should De di number .
52
X X y. 5 X
100%
y .
B
A B
A B
? B
A
5
Compute Distance
Test Record
6
• Euclidean
σ𝑝𝑖=1 𝑥𝑖 − 𝑦𝑖 2
• Manhattan
σ𝑝𝑖=1 𝑥𝑖 − 𝑦𝑖
Manhattan
→
• Minkowski 1t
1/𝑞
E2 - End ideas
,
𝑝
σ𝑖=1 𝑞
𝑥𝑖 − 𝑦𝑖
# of domain of
.
Generation
End ideal
'
𝐷 𝑋, 𝑌 = σ𝑝𝑖=1 𝑥𝑖 − 𝑦𝑖 2
8
• All the instances correspond to points in an 𝑝-dimensional
feature space.
• Each instance is represented with a set of numerical
attributes.
• Each of the training data consists of a set of vectors and a
class label associated with each vector.
• Classification is done by comparing feature vectors of
different k-nearest points.
• Select the k-nearest examples to E in the training set.
• Assign E to the most common class among its k-nearest
neighbors.
9
No. credit
Customer Age Income Class Distance from John
cards
sqrt [(35-37)2+(35-50)2 +(3-
George 35 35K 3 0
No o
2)2]=15.16
t.jo
sqrt [(22-37)2+(50-50)2 +(2-
Rachel 22 50K 2 Yes 2)2]=15
sqrt [(63-37)2+(200-50)2 +(1-
Steve 63 200K 1 No 2)2]=152.23
sqrt [(59-37)2+(170-50)2 +(1-
Tom 59 170K 1 No 2)2]=122
sqrt [(25-37)2+(40-50)2 +(4-
g
Anne 25 40K 4 Yes o
2)2]=15.74
John 37 50K 2 ?
YES
10
" "
Main Problem
→ 𝑛, 𝑛 is number of examples.
• Rule of thumb is 𝑘 <
11
12
𝑘=1 𝑘=3 𝑘 = 31
• 𝑘 acts as a smother
.
• Scale each feature by its importance for classification
𝐷 𝑋, 𝑌 = σ𝑝𝑖=1 𝑤𝑖 𝑥𝑖 − 𝑦𝑖 2
14
• Distance between neighbors could be dominated by some
attributes with relatively large numbers.
– e.g., income of customers in our previous example.
i
𝑋 − min 𝑋
𝑋𝑠 =
max 𝑋 − min 𝑋
15
• Distance works naturally with numerical attributes.
• Binary value categorical data attributes can be regarded as
1 or 0.
𝐷𝐻 (𝑋, 𝑌) = σ𝑘𝑖=1 𝑥𝑖 − 𝑦𝑖
𝑥=𝑦⇒𝐷=0
𝑥≠𝑦⇒𝐷=1
o l
No. 𝑿 𝒀 Distance
1 Male Male 0
2 Male Female 1
16
$250,000
$200,000 Non-Default
Default
$150,000
Loan$
$100,000
$50,000
$0
0 10 20 30 40 50 60 70
Age
17
Age Loan Default Distance
25 $40,000 N 102000
35 $60,000 N 82000
45 $80,000 N 62000
20 $20,000 N 122000
35 $120,000 N 22000
52 $18,000 N 124000
23 $95,000 Y 47000
40 $62,000 Y 80000
60 $100,000 Y 42000
48 $220,000 Y 78000
33 $150,000 Y 8000
48 $142,000 ?
variable scale problem
Euclidean 2 2
Distance 𝐷= 𝑥1 − 𝑥2 + 𝑦1 − 𝑦2
Normal i ration
18
!
Age Loan Default Distance
0.125 0.11 N 0.7652
0.375 0.21 N 0.5200
0.625 0.31 N 0.3160
0 0.01 N 0.9245
0.375 0.50 N 0.3428
0.8 0.00 N 0.6220
0.075 0.38 Y 0.6669
0.5 0.22 Y 0.4437
1 0.41 Y 0.3650
0.7 1.00 Y 0.3861
0.325 0.65 Y 0.3771
0.7 0.61 ?
Standardized 𝑋 − min 𝑋
Variable 𝑋𝑠 =
max 𝑋 − min 𝑋 19
• Strengths distribution .
1
– Very simple and intuitive. mlassumpt.in .
LFeal-ieweight.mg
-
20
• Estimate conditional probability Pr(𝑦|𝒙) is
Country .
" "
IT 2nd .
3.3 .
3.4 .
3.6 3.8 - - -
exsdecis.vn te
bias #
(
Var .nu#loverEttiag
problem )
bias About
.
accuracyvarianc.ee
About re
liability .
• Weight the contribution of each close neighbor based on
their distances
• Weight function
ii) 2
.
exp − 𝒙 − 𝒙𝒊 2
𝑤 𝒙, 𝒙𝒊 =
o~Tsi.fm
2
σ𝑘𝑗∈𝜙𝑥 exp(−
𝒙−𝒙 ) 𝒋 2
where all 𝑖 ∈ 𝜙𝑥
• Then, sum neigws.rs
of dI
weight
σ𝑘𝑗∈𝜙𝑥 𝑤(𝒙, 𝒙𝒋 ) = 1
Threenineisonecut ,
• Prediction of 𝒙𝒏𝒆𝒘
𝑘
𝑦𝑛𝑒𝑤 = 𝑤(𝒙𝒏𝒆𝒘 , 𝒙𝒋 ) 𝑦𝑗
Variable
𝑗∈𝜙𝑥𝑛𝑒𝑤 weight
(
sample weight .
*
lojsticrgess.in .
stable ( variation
but
have bias .
( bias T )
* dei sin ta
!
• Another notation
σ𝑘 𝑤 𝑓(𝑥𝑗 )
𝑗∈𝜙 𝑥𝑛𝑒𝑤 𝑗
𝑓መ 𝒙𝒏𝒆𝒘 =
σ𝑘
𝑗∈𝜙𝑥𝑛𝑒𝑤 𝑤𝑗
1
where 𝑤𝑗 = 2
𝒙𝒏𝒆𝒘 −𝒙𝒋
2
• Other variants
– Different weight functions
– Variable-weighted k-NN
• Parametric distribution models are restricted to specific
forms, which may not always be suitable; for example,
consider modelling a multimodal distribution with a single,
unimodal model.
cnn.ms about the
• Nonparametric approaches make few assumptions
overall shape of the distribution being modelled.
Dimension al Reduction
s\
l. Feature selection 2 .
Feature extinction
variable ) # NA
'
( latent
• Imagine instances described by 20 attributes, but only 2 are
relevant to target function
• Curse of dimensionality: nearest neighbor is easily mislead
when high dimensional 𝑋
– “Neighborhood” grows apart.
• Consider 𝑁 data points uniformly distributed in a 𝑝-
mail.leaming.dassifial.distana
dimensional unit ball centered at origin.
• Consider the NN estimate at the original. The mean distance
from the origin to the closest data point is:
Ill
"
1
1 log 𝑁
.
−𝑁 𝑝
𝑑 𝑝, 𝑁 = 1 − 2 ≈1− ter
n
𝑝 Fl of feature ,
T
Eg. 1) N=10, p=2, d = 0.5 di means on
( fixed
# of
sample)
1-dim. 3-dim.
2-dim.
• Given a data set with 𝑁𝑙 data points from class 𝐶𝑙 and σ𝑙 𝑁𝑙 = 𝑁, we
have
of
probability 𝐿
𝑝 𝒙 =
𝑁𝑉
offer
where 𝐿 is the number of data points of 𝒙, and 𝑉 is the total volume
of the region ()𝑉 𝒙 𝑝 ≈ 𝒙𝑑 𝒙 𝑝 𝑅.
class 1
• Correspondingly
.
𝐿𝑙
𝑝 𝒙|𝐶o𝑙 =
𝑁𝑙 𝑉
ftp.portio
where 𝐿𝑙 is the number of data points of 𝒙 from class 𝐶𝑙 .
• Since 𝑝 𝐶𝑙 = 𝑁𝑙 /𝑁, Bayes’ theorem gives
𝑝 𝒙|𝐶𝑙 𝑝 𝐶𝑙 𝐿𝑙
𝑝 𝐶𝑙 𝒙 = =
𝑝 𝒙 𝐿
! of
1-.-1
• A class label corresponds to a set of points which belong to
some region in the feature space 𝑅. If you draw sample
points from the actual probability distribution, 𝑝(𝒙),
independently, then the probability of drawing a sample from
that class is,
𝑃 = න 𝑝 𝒙 𝑑𝒙
𝑅
• What if you have 𝑁 points? The probability that 𝐿 points of
those 𝑁 points fall in the region 𝑅 follows the binomial
distribution,
𝑁 𝐿
𝑃𝑟𝑜𝑏 𝐿 = 𝑃 1 − 𝑃 𝑁−𝐿
𝐿
Other
• As 𝑁 → ∞ this distribution is sharply peaked, so that the
probability can be approximated by its mean value 𝐿/𝑁. An
additional approximation is that the probability distribution
over 𝑅 remains approximately constant, so that one can
approximate the integral by,
𝑃 = න 𝑝 𝒙 𝑑𝒙 ≈ 𝑝 𝒙 𝑉
𝑅
where 𝑉 is the total volume of the region. Under this
𝐿
approximations 𝑝 𝒙 ≈ .
𝑁𝑉
• Now, if we had several classes, we could repeat the same
analysis for each one, which would give us,
𝐿𝑙
of
𝑝 𝑥 𝐶𝑙 =
𝑁𝑙 𝑉
• where 𝐿𝑙 is the amount of points from class 𝐶𝑙 which falls
within that region and 𝑁𝑙 is the total number of points which
fall in that region. Notice σ𝑙 𝑁𝑙 = 𝑁.
• Repeating the analysis with the binomial distribution, it is
easy to see that we can estimate the prior 𝑝 𝐶𝑙 = 𝑁𝑙 /𝑁.
• Using Bayes rule,
𝑝 𝒙|𝐶𝑙 𝑝 𝐶𝑙 𝐿𝑙
𝑝 𝐶𝑙 𝒙 = =
𝑝 𝒙 𝐿