3 2KNN

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 27

CSCE-421 Machine Learning

2. KNN

Instructor: Guni Sharon, classes: TR 3:55-5:10, HRBB 124 1


Assignments
• Overdue:
• Programming assignment 0: Python Warmup (due Monday, Sep 6)
• Due:
• Written assignment 1: K Nearest Neighbors + Linear algebra (due Monday,
Sep 13)

2
Classification problem
• Find a hypothesis from the set of possible hypothesizes, , that
minimizes the classification error on the provided data set
• That’s easy:

• We want to generalize, i.e., minimizes the expected classification


error on samples from the true distribution,
• That is, find
• Assuming our data set comes from the same distribution as
• K-fold cross validation

3
The best we can hope for
• If we knew , specifically
• We can easily compute the most likely label, denoted

• AKA, Bayes optimal classifier


• What is the expected error rate (binary loss) for ?

4
The best we can hope for
• If we knew , specifically
• We can easily compute the most likely label, denoted

• AKA, Bayes optimal classifier


• What is the expected error rate (binary loss) for ?

• Lower bound of the error rate


• With the same feature representation, no classifier can obtain better accuracy

5
Optimal error rate
• Assume that we have a classifier for predicting if it is going to rain
tomorrow
• Assume that if the atmospheric pressure = 0.95 bar, there is a
probability of 0.8 that it will rain the next day
• What is the best prediction we can make if we know that atmospheric
pressure = 0.95 and don’t know anything else?
• What is our expected error rate for the best prediction?
• Suboptimality of a classification model can be defined with respect to
this optimal error rate

6
Worst-case accuracy
• The best accuracy we can hope for is that of the Bayes optimal
classifier
• What is the worst accuracy that we should never underperform?
• Constant classifier, essentially predicts the same label independent of
any feature vectors
• The best constant in classification is the most common label in the
training set
• you should show that your classifier outperforms the best Constant
classifier
Your classifier error
1 − 𝑃 ( 𝑦 ∨⋅ )

1 − 𝑃 ( 𝑦 ∗ ∨𝑥 )
7
An intuitive classification model
• is unknown but is provided
• Generalize beyond the observed
data…
• Guess the class of the unknown
point
• What do you assume about the
data?

8
K-nearest neighbors
• Assign the majority class among the
k-nearest neighbors
• K=2 ?
• K=17 ?

9
10
K-nearest neighbors
• Assign the majority class among the
k-nearest neighbors
• K=2 ?
• K=17 ?

• How do we define nearest to ?

• Obviously… but how do we define


the distance function ?

11
Distance/similarity metric
• Minkowski distance

• is a model (K-NN) parameter


• (Manhattan distance)… why Manhattan?
• (Euclidian distance)

12
Minkowski distance quiz
• Which blue circle is closest to the 1
red x?
2

13
Feature scaling
• Distance based algorithms are affected by the scale of the variables
Index X Y
Age Income House owner
1 25 45,000 0
2 34 56,000 1
3 27 90,000 0
4 23 51,000 0
5 42 105,000 1

• Which x seems most similar to ?


• Which x is closest to (Euclidian distance)?
14
Feature scaling - Normalization
• Avoid bias towards variables with higher magnitude
• Standard score:
• Represents the feature value in terms of units from mean
• Works well for populations that are normally distributed
• Min-max feature scaling:
• Also called unity-based normalization
• Set all feature values within [0,1] range

15
Feature scaling
• Min-max feature scaling
Index X Y Index X Y
Age Income House Age Income House
owner owner
1 25 45,000 0 1 0.1 0 0
2 34 56,000 1 2 0.58 0.18 1
3 27 90,000 0 3 0.21 0.75 0
4 23 51,000 0 4 0 0.1 0
5 42 105,000 1 5 1 1 1

• Which x seems most similar to ?


• Which x is closest to (Euclidian distance)?
16
Accuracy guarantees for K-NN
• Cover and Hart 1967: As , the 1-NN error for binary classification is no
more than twice the error of the Bayes optimal classifier

17
Accuracy guarantees for 1-NN
• Let be the nearest neighbor of our test point
• As , , i.e.,
• This means the nearest neighbor is identical to
• 1-NN returns the label of
• What is the probability that (binary classification)

1. and
2. At ,

18
Accuracy guarantees for 1-NN
• Let be the nearest neighbor of our test point
• As , , i.e.,
• This means the nearest neighbor is identical to
• 1-NN returns the label of
• What is the probability that (binary classification)

Good news:
1. and K-NN has impressive accuracy guarantees
2. At , Bad news: only for a large enough .
Q: How large?
19
Required number of samples
• Assume a 1-dimension feature vector from
• Accurate predictions require neighbors within distance
• Add samples (uniform distribution)
• Interval of sizecovers of the state space and should include of
the (uniform) samples on expectancy
• and so e.g., requires samples

0 1
20
Required number of samples
• Now consider a 2D feature vector:

• A square of sizecovers of the state space 1

and should include of the (uniform) samples


on expectancy 𝑙
1

21
Required number of samples
• Now consider a 3D feature vector:
• A cube of sizecovers of the state space and
should include of the (uniform) samples on
expectancy

• For the general case ( dimensions)

22
Curse of dimensionality
𝑛

𝑑
23
Curse of dimensionality
• Frequency of
pairwise Euclidian
distances between
randomly distributed
points within -
dimensional unit
squares
• K-NN is meaningless
for

24
Curse of dimensionality
• How many features are in a 1.3MP RGB image
• ~4M dimensions
• We would need samples
• Not enough atoms in the universe
• Say we could get all these samples, what is the computation
complexity for inference?

• We need to consider other approaches that can scale better


• We will revisit (next class) the linear classifier from Lecture 1

25
What did we learn?
• Classification problem
• Performance bounds (best - Bayes optimal classifier vs worst - Fixed
classifier)
• K-nearest neighbors
• Can reach optimal performance with enough samples and a suitable k
• “Enough samples” grows exponentially with the number of dimensions (curse
of dimensionality)
• Computationally intensive for large training sets

26
What next?
• Class:
• linear classifier (perceptron)
• Assignments:
• Written assignment 1: K Nearest Neighbors + Linear algebra (due Monday,
Sep 13)

27

You might also like