k-NN-1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

k-Nearest Neighborhood

( Classification )

Instructor: Junghye Lee


School of Management Engineering
[email protected]
• Tell me about your friends (who your neighbors are)
and I will tell you who you are.

2
• A powerful classification algorithm used in pattern
recognition.
• k-nearest neighbors stores all available cases and
classifies new cases based on a similarity measure

-
(e.g., distance function)

• A non-parametric lazy learning algorithm


(i.e., instance-based learning, memory-based

< reasoning, example-based

function
- Reasoning)
Training sample

be .

One
by One
-

3
Tr K should De di number .

52
X X y. 5 X
100%
y .

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

• k-nearest neighbors of a record 𝒙 are data points that have


the k smallest distance to 𝒙
• An object (a new instance) is classified by a majority
votes for its neighbor classes.
• The object is assigned to the most common class
amongst its k-nearest neighbors (measured by a
distance function).

B
A B

A B
? B
A

5
Compute Distance

Test Record

Training Records Choose “k-nearest” Records

6
• Euclidean

σ𝑝𝑖=1 𝑥𝑖 − 𝑦𝑖 2

• Manhattan
σ𝑝𝑖=1 𝑥𝑖 − 𝑦𝑖

Manhattan

• Minkowski 1t
1/𝑞
E2 - End ideas

,
𝑝
σ𝑖=1 𝑞
𝑥𝑖 − 𝑦𝑖
# of domain of
.

Generation
End ideal
'

Only for Continuous variable


• Calculate the distance between new example (E) and
all examples in the training set.
• Euclidean distance between two examples.
𝑋 = [𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑝 ]
𝑌 = [𝑦1 , 𝑦2 , 𝑦3 , … , 𝑦𝑝 ]

– The Euclidean distance between X and Y is defined as:

𝐷 𝑋, 𝑌 = σ𝑝𝑖=1 𝑥𝑖 − 𝑦𝑖 2

8
• All the instances correspond to points in an 𝑝-dimensional
feature space.
• Each instance is represented with a set of numerical
attributes.
• Each of the training data consists of a set of vectors and a
class label associated with each vector.
• Classification is done by comparing feature vectors of
different k-nearest points.
• Select the k-nearest examples to E in the training set.
• Assign E to the most common class among its k-nearest
neighbors.

9
No. credit
Customer Age Income Class Distance from John
cards
sqrt [(35-37)2+(35-50)2 +(3-
George 35 35K 3 0
No o
2)2]=15.16

t.jo
sqrt [(22-37)2+(50-50)2 +(2-
Rachel 22 50K 2 Yes 2)2]=15
sqrt [(63-37)2+(200-50)2 +(1-
Steve 63 200K 1 No 2)2]=152.23
sqrt [(59-37)2+(170-50)2 +(1-
Tom 59 170K 1 No 2)2]=122
sqrt [(25-37)2+(40-50)2 +(4-

g
Anne 25 40K 4 Yes o
2)2]=15.74

John 37 50K 2 ?
YES

10
" "

Main Problem

• If k is too small it is sensitive to noise points.

• Larger k works well, but too large k may include majority


points from other classes.

→ 𝑛, 𝑛 is number of examples.
• Rule of thumb is 𝑘 <

11
12
𝑘=1 𝑘=3 𝑘 = 31

• 𝑘 acts as a smother
.
• Scale each feature by its importance for classification

𝐷 𝑋, 𝑌 = σ𝑝𝑖=1 𝑤𝑖 𝑥𝑖 − 𝑦𝑖 2

1. Can use our prior knowledge about which features are


more important
2. Can learn the weights 𝑤𝑘 using cross‐validation

14
• Distance between neighbors could be dominated by some
attributes with relatively large numbers.
– e.g., income of customers in our previous example.

i
𝑋 − min 𝑋
𝑋𝑠 =
max 𝑋 − min 𝑋

• Arises when two features are in different scales.

• Important to normalize those features.


– Mapping values to numbers between 0 – 1.

15
• Distance works naturally with numerical attributes.
• Binary value categorical data attributes can be regarded as
1 or 0.

𝐷𝐻 (𝑋, 𝑌) = σ𝑘𝑖=1 𝑥𝑖 − 𝑦𝑖
𝑥=𝑦⇒𝐷=0
𝑥≠𝑦⇒𝐷=1

o l

No. 𝑿 𝒀 Distance
1 Male Male 0
2 Male Female 1

16
$250,000

$200,000 Non-Default
Default
$150,000
Loan$
$100,000

$50,000

$0
0 10 20 30 40 50 60 70

Age

17
Age Loan Default Distance
25 $40,000 N 102000
35 $60,000 N 82000
45 $80,000 N 62000
20 $20,000 N 122000
35 $120,000 N 22000
52 $18,000 N 124000
23 $95,000 Y 47000
40 $62,000 Y 80000
60 $100,000 Y 42000
48 $220,000 Y 78000
33 $150,000 Y 8000

48 $142,000 ?
variable scale problem
Euclidean 2 2
Distance 𝐷= 𝑥1 − 𝑥2 + 𝑦1 − 𝑦2
Normal i ration
18
!
Age Loan Default Distance
0.125 0.11 N 0.7652
0.375 0.21 N 0.5200
0.625 0.31 N 0.3160
0 0.01 N 0.9245
0.375 0.50 N 0.3428
0.8 0.00 N 0.6220
0.075 0.38 Y 0.6669
0.5 0.22 Y 0.4437
1 0.41 Y 0.3650
0.7 1.00 Y 0.3861
0.325 0.65 Y 0.3771

0.7 0.61 ?

Standardized 𝑋 − min 𝑋
Variable 𝑋𝑠 =
max 𝑋 − min 𝑋 19
• Strengths distribution .

1
– Very simple and intuitive. mlassumpt.in .

– Can be applied to the data from any distribution.


-

– Good classification if the number of samples is large enough.


– Training is very fast
– Learn complex target functions linear !

– Do not lose information


• Weaknesses
– Takes more time to classify a new example.
• need to calculate and compare distance from new example to all
other examples.
– Choosing k may be tricky.
– Need large number of samples for accuracy.
– Easily fooled by irrelevant attributes

LFeal-ieweight.mg
-

20
• Estimate conditional probability Pr(𝑦|𝒙) is

Country .

– Count of data points in class 𝑦 in the neighborhood of 𝒙


• Bias and variance tradeoff
– A small neighborhood → large variance → unreliable estimation
– A large neighborhood → large bias → inaccurate estimation
T. True LI
Variation bias # Ist 0.8 88 0.7 90 0.9 66
,

" "
IT 2nd .
3.3 .
3.4 .
3.6 3.8 - - -

exsdecis.vn te

bias #
(
Var .nu#loverEttiag
problem )
bias About
.

accuracyvarianc.ee
About re
liability .
• Weight the contribution of each close neighbor based on
their distances
• Weight function
ii) 2
.

exp − 𝒙 − 𝒙𝒊 2
𝑤 𝒙, 𝒙𝒊 =
o~Tsi.fm
2
σ𝑘𝑗∈𝜙𝑥 exp(−
𝒙−𝒙 ) 𝒋 2

where all 𝑖 ∈ 𝜙𝑥
• Then, sum neigws.rs
of dI
weight
σ𝑘𝑗∈𝜙𝑥 𝑤(𝒙, 𝒙𝒋 ) = 1
Threenineisonecut ,

• Prediction of 𝒙𝒏𝒆𝒘
𝑘
𝑦𝑛𝑒𝑤 = ෍ 𝑤(𝒙𝒏𝒆𝒘 , 𝒙𝒋 ) 𝑦𝑗
Variable
𝑗∈𝜙𝑥𝑛𝑒𝑤 weight
(
sample weight .
*
lojsticrgess.in .

stable ( variation
but
have bias .
( bias T )

* dei sin ta

!
• Another notation
σ𝑘 𝑤 𝑓(𝑥𝑗 )
𝑗∈𝜙 𝑥𝑛𝑒𝑤 𝑗
𝑓መ 𝒙𝒏𝒆𝒘 =
σ𝑘
𝑗∈𝜙𝑥𝑛𝑒𝑤 𝑤𝑗
1
where 𝑤𝑗 = 2
𝒙𝒏𝒆𝒘 −𝒙𝒋
2

• Other variants
– Different weight functions
– Variable-weighted k-NN
• Parametric distribution models are restricted to specific
forms, which may not always be suitable; for example,
consider modelling a multimodal distribution with a single,
unimodal model.
cnn.ms about the
• Nonparametric approaches make few assumptions
overall shape of the distribution being modelled.
Dimension al Reduction
s\
l. Feature selection 2 .
Feature extinction
variable ) # NA
'

( latent
• Imagine instances described by 20 attributes, but only 2 are
relevant to target function
• Curse of dimensionality: nearest neighbor is easily mislead
when high dimensional 𝑋
– “Neighborhood” grows apart.
• Consider 𝑁 data points uniformly distributed in a 𝑝-
mail.leaming.dassifial.distana
dimensional unit ball centered at origin.
• Consider the NN estimate at the original. The mean distance
from the origin to the closest data point is:

Ill
"

1
1 log 𝑁
.

−𝑁 𝑝
𝑑 𝑝, 𝑁 = 1 − 2 ≈1− ter
n
𝑝 Fl of feature ,

T
Eg. 1) N=10, p=2, d = 0.5 di means on
( fixed
# of
sample)

Eg. 2) N=10, p=50, d = 0.98


• Assume 5000 points uniformly distributed in the unit
hypercube and we want to apply 5-NN.
• Suppose our query point is to capture 5 nearest neighbors at
the origin (𝑝 ≈ 0.001)
– In 1-dim.: we must go a distance of 0.001 on the average
– In 2-dim.: we must go 0.001 to get a square that contains 0.001 of
the volume
– In 𝑑-dim.: we must go 0.001 1/𝑑

1-dim. 3-dim.
2-dim.
• Given a data set with 𝑁𝑙 data points from class 𝐶𝑙 and σ𝑙 𝑁𝑙 = 𝑁, we
have
of
probability 𝐿
𝑝 𝒙 =
𝑁𝑉

offer
where 𝐿 is the number of data points of 𝒙, and 𝑉 is the total volume
of the region (‫)𝑉 𝒙 𝑝 ≈ 𝒙𝑑 𝒙 𝑝 𝑅׬‬.
class 1
• Correspondingly
.

𝐿𝑙
𝑝 𝒙|𝐶o𝑙 =
𝑁𝑙 𝑉

ftp.portio
where 𝐿𝑙 is the number of data points of 𝒙 from class 𝐶𝑙 .
• Since 𝑝 𝐶𝑙 = 𝑁𝑙 /𝑁, Bayes’ theorem gives

𝑝 𝒙|𝐶𝑙 𝑝 𝐶𝑙 𝐿𝑙
𝑝 𝐶𝑙 𝒙 = =
𝑝 𝒙 𝐿
! of
1-.-1
• A class label corresponds to a set of points which belong to
some region in the feature space 𝑅. If you draw sample
points from the actual probability distribution, 𝑝(𝒙),
independently, then the probability of drawing a sample from
that class is,
𝑃 = න 𝑝 𝒙 𝑑𝒙
𝑅
• What if you have 𝑁 points? The probability that 𝐿 points of
those 𝑁 points fall in the region 𝑅 follows the binomial
distribution,
𝑁 𝐿
𝑃𝑟𝑜𝑏 𝐿 = 𝑃 1 − 𝑃 𝑁−𝐿
𝐿
Other
• As 𝑁 → ∞ this distribution is sharply peaked, so that the
probability can be approximated by its mean value 𝐿/𝑁. An
additional approximation is that the probability distribution
over 𝑅 remains approximately constant, so that one can
approximate the integral by,
𝑃 = න 𝑝 𝒙 𝑑𝒙 ≈ 𝑝 𝒙 𝑉
𝑅
where 𝑉 is the total volume of the region. Under this
𝐿
approximations 𝑝 𝒙 ≈ .
𝑁𝑉
• Now, if we had several classes, we could repeat the same
analysis for each one, which would give us,
𝐿𝑙
of
𝑝 𝑥 𝐶𝑙 =
𝑁𝑙 𝑉
• where 𝐿𝑙 is the amount of points from class 𝐶𝑙 which falls
within that region and 𝑁𝑙 is the total number of points which
fall in that region. Notice σ𝑙 𝑁𝑙 = 𝑁.
• Repeating the analysis with the binomial distribution, it is
easy to see that we can estimate the prior 𝑝 𝐶𝑙 = 𝑁𝑙 /𝑁.
• Using Bayes rule,
𝑝 𝒙|𝐶𝑙 𝑝 𝐶𝑙 𝐿𝑙
𝑝 𝐶𝑙 𝒙 = =
𝑝 𝒙 𝐿

You might also like