STAT 451: Introduction To Machine Learning Lecture Notes

STAT 451: Introduction to Machine Learning
Lecture Notes
Sebastian Raschka
Department of Statistics
University of Wisconsin–Madison
Fall 2021
Contents
2 Nearest Neighbor Methods 1
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.1.1 Key concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.1.2 Nearest Neighbor Classification In Context . . . . . . . . . . . . . . . 2
2.1.3 Common Use Cases of k NN . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Nearest Neighbor Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Nearest Neighbor Decision Boundary . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 k -Nearest Neighbor Classification and Regression . . . . . . . . . . . . . . . . 5
2.4.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Computational Complexity and the Big-O Notation . . . . . . . . . . . . . . 8
2.6.1 Big-O of k NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Improving Computational Performance . . . . . . . . . . . . . . . . . . . . . . 10
2.7.1 Naive k NN Algorithm in Pseudocode . . . . . . . . . . . . . . . . . . . 10
2.7.2 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7.3 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7.4 Faster Distance Metric/Heuristic . . . . . . . . . . . . . . . . . . . . . 12
2.7.5 “Pruning” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7.6 Parallelizing k NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Distance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8.1 Discrete Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8.2 Feature Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Distance-weighted k NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.10 Improving Predictive Performance . . . . . . . . . . . . . . . . . . . . . . . . 16
2.11 Error Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.12 k NN from a Bayesian Perspective . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.13 Advantages and Disadvantages of Using k NN . . . . . . . . . . . . . . . . . . 20
2.14 Other Forms of Instance-based Learning . . . . . . . . . . . . . . . . . . . . . 20
2.14.1 Locally Weighted Regression . . . . . . . . . . . . . . . . . . . . . . . 20
2.14.2 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2
STAT 451: Introduction to Machine Learning
Lecture Notes
Sebastian Raschka
Department of Statistics
University of Wisconsin–Madison
Fall 2021
2 Nearest Neighbor Methods
2.1 Introduction
Nearest neighbor algorithms are among the “simplest” supervised machine learning algo-
rithms and have been well studied in the field of pattern recognition over the last century.
While nearest neighbor algorithms are not as popular as they once were, they are still widely
used in practice, and I highly recommend that you are at least considering the k-Nearest
Neighbor algorithm in classification projects as a predictive performance benchmark – es-
pecially, when you are trying to develop more sophisticated models.
In this lecture, we focus mostly on two algorithms, the Nearest Neighbor (NN) algorithm
and the k -Nearest Neighbor (k NN) algorithm. NN is just a special case of k NN where k = 1.
To avoid making this text unnecessarily convoluted, we only use the abbreviation NN if we
talk about concepts that do not apply to k NN in general. Otherwise, we will use k NN to
refer to nearest neighbor algorithms in general, regardless of the value of k.
2.1.1 Key concepts
While k NN is a universal function approximator under certain conditions, the underlying

concept is relatively simple. k NN is an algorithm for supervised learning that simply stores
the labeled training examples,
hx[i] , y [i] i 2 D (|D| = n), (1)
during the training phase. For this reason, k NN is also called a lazy learning algorithm.
What it means to be a lazy learning algorithm is that the processing of the training examples
is postponed until making predictions 1 – again, the training consists literally of just storing
the training data.
Then, to make a prediction (class label or continuous target), a trained kNN model finds
the k nearest neighbors of a query point and computes the class label (classification) or
1 When you are reading recent literature, note that the prediction step is now often called ”inference” in
the machine learning community. Note that in the context of machine learning, the term ”inference” is not
to be confused with how we use the term ”inference” in statistics, though.
a
c
? ?
Sebastian Raschka STAT451 FS21. L02: Nearest Neighbor Methods Page 2
b
continuous target (regression) based on the k nearest (most “similar”) points. The exact
mechanics will be explained in the next sections. However, the overall idea is that instead of
approximating the target function f (x) = y globally, during each prediction, k NN approxi-
mates the target function locally. In practice, it is easier to learn to approximate a function
locally than globally.
? ?
Figure 1: Illustration of the nearest neighbor classification algorithm in two dimensions (features
x1 and x2 ). In the left subpanel, the training examples are shown as blue dots, and a query point
that we want to classify is shown as a question mark. In the right subpanel, the class labels are
annotated via blue squares and red triangle symbols. The dashed line indicates the nearest neighbor
of the query point, assuming a Euclidean distance metric. The predicted class label is the class
label of the closest data point in the training set (here: class 0, i.e., blue square).
2.1.2 Nearest Neighbor Classification In Context
The previous lecture notes document introduced di↵erent kinds of categorization schemes,
which may be helpful for understanding and distinguishing di↵erent types of machine learn-
ing algorithms.
To recap, the categories we discussed were
C
• eager vs lazy;
• batch vs online;
B
• parametric vs nonparametric;
A
• discriminative vs generative.
Since k NN does not have an explicit training step and defers all of the computation until
prediction, we already determined that k NN is a lazy algorithm.
Further, instead of devising one global model or approximation of the target function, for
each di↵erent data point, there is a di↵erent local approximation, which depends on the
data point itself as well as the training data points. Since the prediction is based on a D
comparison of a query point with data points in the training set (rather than Ba global
model), k NN is also categorized as instance-based (or “memory-based”) method. While
k NN is a lazy instance-based learning algorithm, an example of an eager instance-based
learning algorithm would be the support vector machine (which is not covered in this course
due to time constraints)
Lastly, because we do not make any assumption about the functional form of the k NN
algorithm, a k NN model is also considered a nonparametric model. However, categorizing
k NN as either discriminative or generative is not as straightforward as for other algorithms.
Under certain assumptions, we can estimate the conditional probability that a given data
point belongs to a given class as well as the marginal probability for a feature given a training
dataset (more details are provided in the section on “k NN from a Bayesian Perspective”
later). However, since k NN does not explicitly try to model the data generating process
but models the posterior probabilities, p(f (x) = i|x), directly, k NN is usually considered a
discriminative model.
2.1.3 Common Use Cases of k NN
While neural networks are gaining popularity in the computer vision and pattern recognition
field, one area where k -nearest neighbors models are still commonly and successfully being
used is in the intersection between computer vision, pattern classification, and biometrics
(e.g., to make predictions based on extracted geometrical features2 ).
Other common use cases include recommender systems (via collaborative filtering3 ) and
outlier detection4 .
2.2 Nearest Neighbor Algorithm
After introducing the overall concept of the nearest neighbor algorithms, this section provides
a more formal or technical description of the 1-nearest neighbor (NN) algorithm.
Training algorithm:
for i = 1, ..., n in the n-dimensional training dataset D (|D| = n):
• store training example hx[i] , y [i] i
Prediction algorithm 5 :
closest point := None
closest distance := 1
• for i = 1, ..., n:
– current distance := d(x[i] , x[q] )

– if current distance < closest distance:
⇤ closest distance := current distance
⇤ target value of closest point := y [i]
The prediction produced by the 1NN model,h(x[q] ), is the target value of the closest point.
Unless noted otherwise, the default distance metric (in the context of this lecture) of nearest
neighbor algorithms is the Euclidean distance (also called L2 distance), which computes the
distance between two points, x[a] and x[b] :
v
uX
u m ⇣ [a] [b]
⌘2
d(x , x ) = t
[a] [b]
xj xj . (2)
j=1
2 Asmaa Sabet Anwar, Kareem Kamal A Ghany, and Hesham Elmahdy. “Human ear recognition using
geometrical features extraction”. In: Procedia Computer Science 65 (2015), pp. 529–537.
3 Youngki Park et al. “Reversed CF: A fast collaborative filtering algorithm using a k-nearest neighbor
graph”. In: Expert Systems with Applications 42.8 (2015), pp. 4022–4028.
4 Guilherme O Campos et al. “On the evaluation of unsupervised outlier detection: measures, datasets,
and an empirical study”. In: Data Mining and Knowledge Discovery 30.4 (2016), pp. 891–927.
5 We use ”:=” as an assignment operator.
2.3 Nearest Neighbor Decision Boundary
In this section, we build some intuition for the decision boundary of the NN classification
model. Assuming a Euclidean distance metric, the decision boundary between any two
training examples a and b is a straight line. If a query point is located on the decision
boundary, this means its equidistant from both training example a and b.
While the decision boundary between a pair of points is a straight line, the decision boundary
of the NN model on a global level, considering the whole training set, is a set of connected,
convex polyhedra. All points within a polyhedron are closest to the training example inside,
and all points outside the polyhedron are closer to a di↵erent training example.
a a c
Figure 2: Illustration of the plane-partitioning of a two-dimensional dataset (features x1 and x2 )

via linear segments between two training examples (a & b, a & c, and c & d) and the resulting
Voronoi diagram (lower right corner).
This partitioning of regions on a plane in 2D is also called “Voronoi diagram” or “Voronoi

tessellation.” (You may remember from geometry classes that given a discrete set of points,
a Voronoi diagram can also be obtained by a process known as Delaunay triangulation6 , i.e.,
by connecting the centers of the circumcircles.)
While each linear segment is equidistant from two di↵erent training examples, a vertex (or
node) in the Voronoi diagram is equidistant to three training examples. Then, to draw the
decision boundary of a two-dimensional nearest neighbor classifier, we take the union of the
pair-wise decision boundaries of instances of the same class.
6 https://en.wikipedia.org/wiki/Delaunay˙triangulation.
Figure 3: Illustration of the nearest neighbor decision boundary as the union of the polyhedra of
training examples belonging to the same class.
2.4 k -Nearest Neighbor Classification and Regression
Previously, we described the NN algorithm, which makes a prediction by assigning the class
label or continuous target value of the most similar training example to the query point
(where similarity is typically measured using the Euclidean distance metric for continuous
features).
Instead of basing the prediction of the single, most similar training example, k NN considers
the k nearest neighbors when predicting a class label (in classification) or a continuous target
value (in regression).
2.4.1 Classification
In the classification setting, the common k NN model is to predicts the target class label as
the class label that is most often represented among the k most similar training examples
for a given query point. In other words, the class label can be considered as the “mode” of
the k training labels or the outcome of a “plurality voting.” Note that in literature, k NN
classification is often described as a “majority voting.” While the authors usually mean
the right thing, the term “majority voting” usually refers to a reference value of >50%
for making a decision7 . In the case of binary predictions (classification problems with two
classes), there is always a majority or a tie. Hence, a majority vote is also automatically
a plurality vote. However, in multi-class settings, we do not require a majority to make a
prediction via k NN. For example, in a three-class setting a frequency > 13 ( approx 33.3%)
could already enough to assign a class label.
7 Interestingly, the term “plurality vote” (in North America) is called ”relative majority” in the United
Kingdom, according to https://en.wikipedia.org/wiki/Plurality (voting)

A
y:
Majority vote:
Plurality vote:
B
y:
Majority vote: None
Plurality vote:
Figure 4: Illustration of plurality and majority voting.
Remember that the NN prediction rule (recall that we defined NN as the special case of
k NN with k = 1) is the same for both classification or regression. However, in k NN we have
two distinct prediction algorithms:
• Plurality voting among the k nearest neighbors for classification.

• Averaging the continuous target variables of the k nearest neighbors for regression.
More formally, assume we have a target function f (x) = y that assigns a class label y 2
{1, . . . , t} to a training example,
f : Rm ! {1, ..., t}. (3)
(Usually, we use the letter k to denote the number of classes in this course, but in the context
of kNN, it would be too confusing.)
Assuming we identified the k nearest neighbors (Dk ✓ D) of a query point x[q] ,
Dk = {hx[1] , f x[1] i, . . . , hx[k] , f x[k] i}, (4)
we can define the k NN hypothesis as
k
X
h(x[q] ) = arg max (y, f (x[i] )). (5)
y2{1,...,t}
i=1
Here, denotes the Kronecker Delta function
(
1, if a = b,
(a, b) = (6)
0, 6 b.
if a =
Or, in simpler notation, if you remember the “mode” from introductory statistics classes:
h(x[t] ) = mode f x[1] , . . . , f x[k] . (7)
A common distance metric to identify the k nearest neighbors Dk is the Euclidean distance
measure, v
uX
u m ⇣ [a] [b]
⌘2
d(x , x ) = t
[a] [b]
xj xj , (8)
j=1
which is a pairwise distance metric that computes the distance between two data points x[a]
and x[b] over the m input features.
Figure 5: Illustration of k NN for a 3-class problem with k = 5.
2.4.2 Regression
The general concept of k NN for regression is the same as for classification: first, we find
the k nearest neighbors in the dataset; second, we make a prediction based on the labels
of the k nearest neighbors. However, in regression, the target function is a real- instead of
discrete-valued function,
f : Rm ! R. (9)
A common approach for computing the continuous target is to compute the mean or average
target value over the k nearest neighbors,
k
[t] 1X
h x = f x[i] . (10)
k i=1
As an alternative to averaging the target values of the k nearest neighbors to predict the
label of a query point, it is also not uncommon to use the median instead.
2.5 Curse of Dimensionality
The k NN algorithm is particularly susceptible to the curse of dimensionality 8 . In machine

learning, the curse of dimensionality refers to scenarios with a fixed number of training ex-
amples but an increasing number of dimensions and range of feature values in each dimension
in a high-dimensional feature space.
In k NN an increasing number of dimensions becomes increasingly problematic because the
more dimensions we add, the larger the volume in the hyperspace needs to be to capture a
8 David L Donoho et al. “High-dimensional data analysis: The curses and blessings of dimensionality”.
In: AMS math challenges lecture 1.2000 (2000), p. 32.

fixed number of neighbors. As the volume grows larger and larger, the “neighbors” become
less and less “similar” to the query point as they are now all relatively distant from the query
point considering all di↵erent dimensions that are included when computing the pairwise
distances.
For example, consider a single dimension with unit length (range [0, 1]). Now, if we consider
100 training examples that are uniformly distributed, we expect one training example located
at each 0.01th unit along the [0, 1] interval or axis. So, to consider the three nearest neighbors
of a query point, we expect to cover 3/100 of the feature axis. However, if we add a second
dimension, the expected interval length that is required to include the same amount of data
(3 neighbors) now increases to 0.031/2 (we now have a unit rectangle). In other words,
instead of requiring 0.03 ⇥ 100% = 3% of the space to include 3 neighbors in 1D, we now
need to consider 0.031/2 ⇥ 100% = 17.3% of a 2D space to cover the same amount of data
points – the density decreases with the number of dimensions. In 10 dimensions, that’s
now 0.031/10 = 70.4% of the hypervolume we need to consider to include three neighbors
on average. You can see that in high dimensions we need to take a large portion of the
hypervolume into consideration (assuming a fixed number of training examples) to find k
nearest neighbors, and then these so-called “neighbors” may not be particularly “close” to
the query point anymore.
2.6 Computational Complexity and the Big-O Notation
The Big-O notation is used in both mathematics and computer science to study the asymp-
totic behavior of functions, i.e., the asymptotic upper bounds. In the context of algorithms
in computer science, the Big-O notation is most commonly used to measure the time com-
plexity or runtime of an algorithm for the worst case scenario. (Often, it is also used to
measure memory requirements.)
Since Big-O notation and complexity theory, in general, are areas of research in computer
science, we will not go into too much detail in this course. However, you should at least be
familiar with the basic concepts, since it is an essential component for the study of machine
learning algorithms.
Notation Name
O(1) Constant
O(log n) Logarithmic
O(n) Linear
O(n log n) Log Linear
O(n2 ) Quadratic
O(n3 ) Cubic
O(nc ) Higher-level polynomial
O(2n ) Exponential
Table 1: Most common Big-O runtime notations listed in increasing order (slowest-growing func-
tions first).
Figure 6: An illustration of the growth rates of common functions.
Note that in “Big-O” analysis, we only consider the most dominant term, as the other terms
and constants become insignificant asymptotically. For example, consider the function
f (x) = 14x2 10x + 25. (11)
The worst case complexity of this function is O(x2 ), since x2 is the dominant term.
Next, consider the example
f (x) = (2x + 8) log2 (x + 9). (12)
In “Big-O” notation, that is O(x log x). Note that it does not need to distinguish between
di↵erent bases of the logarithms, e.g., log10 , or log2 , since we can regard these just as a
scalar factor given the conversion
log2 (x) = log10 (x)/ log10 (2), (13)
1
where log10 (2) is just a scaling factor.
Lastly, consider this naive example of implementing matrix multiplication in Python:
A = [[1, 2, 3],
[2, 3, 4]]
B = [[5, 8],
[6, 9],
[7, 10]]
def matrixmultiply (A, B):
C = [[0 for row in range(len(A))]

for col in range(len(B[0]))]
for row_a in range(len(A)):

for col_b in range(len(B[0])):
for col_a in range(len(A[0])):
C[row_a][col_b] += \
A[row_a][col_a] * B[col_a][col_b]
return C
matrixmultiply(A, B)
Result:
[[38, 56],
[56, 83]]
Due to the three nested for -loops, the runtime complexity of this function is O(n3 ).
2.6.1 Big-O of k NN
For the brute-force neighbor search of the k NN algorithm, we have a time complexity of
O(n ⇥ m), where n is the number of training examples and m is the number of dimensions in
the training set. For simplicity, assuming n m, the complexity of the brute-force nearest
neighbor search is O(n). In the next section, we will briefly go over a few strategies to
improve the runtime of the k NN model.
2.7 Improving Computational Performance
2.7.1 Naive k NN Algorithm in Pseudocode
Below are two naive approaches (Variant A and Variant B) for finding the k nearest neighbors
of a query point x[q] .
Variant A
Dk := {}
while |Dk | < k:
• closest distance := 1
• for i = 1, ..., n, 8i 2
/ Dk :

– if current distance < closest distance:
⇤ closest distance := current distance
⇤ closest point := x[i]
• add closest point to Dk
Variant B
Dk := D
while |Dk | > k:
• largest distance := 0
• for i = 1, ..., n 8i 2 Dk :

– if current distance > largest distance:
⇤ largest distance := current distance
⇤ farthest point := x[i]
• remove farthest point from Dk
Using a Priority Queue

Both Variant A and Variant B are expensive algorithms, O(k ⇥ n) and O((n k) ⇥ n),
respectively . However, with a simple trick, we can improve the nearest neighbor search to
O(k log(n)). For instance, we could implement a priority queue using a heap data structure
9
.
We initialize the heap with the k arbitrary points from the training dataset based on their
distances to the query point. Then, as we iterate through the dataset to find the first
nearest neighbor of the query point, at each step, we make a comparison with the points
and distances in the heap. If the point with the largest stored distance in the heap is
farther away from the query point that the current point under consideration, we remove
the farthest point from the heap and insert the current point. Once we finished one iteration
over the training dataset, we now have a set of the k nearest neighbors.
2.7.2 Data Structures
Di↵erent data structures have been developed to improve the computational performance
of k NN during prediction. In particular, the idea is to be smarter about identifying the k
nearest neighbors. Instead of comparing each training example in the training set to a given
query point, approaches have been developed to partition the search space most efficiently.
The details of these data structures are beyond the scope of this lecture since they require
some background in computer science and data structures, but interested students are en-
couraged to read the literature referenced in this section.
Bucketing
The simplest approach is “bucketing”10 . Here, we divide the search space into identical,
similarly-sized cells (or buckets), that resemble a grid (picture a 2D grid 2-dimensional
hyperspace or plane).
KD-Tree
A KD-Tree11 , which stands for k -dimensional search tree, is a generalization of binary
search trees. KD-Trees data structures have a time complexity of O(log(n)) on average (but
O(n) in the worst case) or better and work well in relatively low dimensions. KD-Trees
also partition the search space perpendicular to the feature axes in a Cartesian coordinate
system. However, with a large number of features, KD-Trees become increasingly inefficient,
and alternative data structures, such as Ball-Trees, should be considered.12
Ball-Tree
In contrast to the KD-Tree approach, the Ball-Tree13 partitioning algorithms are based on
the construction of hyperspheres instead of cubes. While Ball-Tree algorithms are generally
9 A heap is a special case of a binary search tree with a structure that makes lookups more efficient.
You are not expected to now how heaps work in the exam, but you are encouraged to learn more about
this data structure. A good overview is provided on Wikipedia with links to primary sources: https:
//en.wikipedia.org/wiki/Heap %28data structure%29
10 Ronald L Rivest. “On the Optimality of Elia’s Algorithm for Performing Best-Match Searches.” In:
IFIP Congress. 1974, pp. 678–681.

11 Jon Louis Bentley. “Multidimensional binary search trees used for associative searching”. In: Commu-
nications of the ACM 18.9 (1975), pp. 509–517.

12 Note that software implementations such as the ighborsClassifier in the Scikit-learn library has a
”method=’auto’” default setting that chooses the most appropriate data structure automatically.
13 Stephen M Omohundro. Five balltree construction algorithms. International Computer Science Institute
Berkeley, 1989.
more expensive to run than KD-Trees, the algorithms address some of the shortcomings of
the KD-Tree datastructure and are more efficient in higher dimensions.
Note that these data structures or space partitioning algorithms come each with their own
set of hyperparameters (e.g., the leaf size, or settings related to the leaf size). Detailed
discussions of the di↵erent data structures for efficient data structures are beyond the scope
of this class.
2.7.3 Dimensionality Reduction
Next, to help reduce the e↵ect of the curse of dimensionality, dimensionality reduction strate-
gies are also useful for speeding up the nearest neighbor search by making the computation
of the pair-wise distances “cheaper.” There are two approaches to dimensionality reduction:
• Feature Selection (e.g., Sequential Forward Selection)
• Feature Extraction (e.g., Principal Component Analysis)
We will cover both feature selection and feature extraction as separate topics later in this
course.
2.7.4 Faster Distance Metric/Heuristic
k NN is compatible with any pairwise distance metric. However, the choice of the distance
metric a↵ects the runtime performance of the algorithm. For instance, computing the Maha-
lanobis distance is much more expensive than calculating the more straightforward Euclidean
distance.
2.7.5 “Pruning”
There are di↵erent kinds of “pruning” approaches that we could use to speed up the k NN
algorithm. For example, editing and prototype selection.
Editing
In edited k NN, we permanently remove data points that do not a↵ect the decision boundary.
For example, consider a single data point (aka “outlier”) surrounded by many data points
from a di↵erent class. If we perform a k NN prediction, this single data point will not
influence the class label prediction in plurality voting; hence, we can safely remove it.
Figure 7: Illustration of k NN editing, where we can remove points from the training set that do
not influence the predictions. For example, consider a 3NN model. On the left, the two points
enclosed in dashed lines would not a↵ect the decision boundary as “outliers.” Similarly, points of
the “right” class that are very far away from the decision boundary, as shown in the right subpanel,
do not influence the decision boundary and hence could be removed for efficiency concerning data
storage or the number of distance computations.
Prototypes
Another strategy (somewhat related to KMeans, a clustering algorithm that we will cover
towards the end of this course), is to replace selected data points by prototypes that sum-
marize multiple data points in dense regions.
2.7.6 Parallelizing k NN
k NN is one of these algorithms that are very easy to parallelize. There are many di↵erent
ways to do that. For instance, we could use distributed approaches like map-reduce and place
subsets of the training datasetsEuclidean
on di↵erent machines for theEuclidean
distance computations. Further,
distance=1
the distance computations themselves can be carried outdistance=1
using parallel computations on
multiple processors via CPUs or GPUs14 .
c c
? ?
a a b
2.8 Distance measures b
There are many distance metrics or measures we can use to select k nearest neighbors. There
is no “best” distance measure, and the choice is highly context- or problem-dependent.
Euclidean
distance=1 Manhattan
distance=1
a
c
? ?
Figure 8: The phrase “nearest” is ambiguous and depends on the distance metric we use.
For continuous features, the probably most common distance metric is the Euclidean dis-
14 For example, the RapidsAI ecosystem provided implementations of machine learning al-
gorithms that are parallelized and executed faster on GPUs; https://medium.com/rapids-ai/
accelerating-k-nearest-neighbors-600x-using-rapids-cuml-82725d56401e
tance. Another popular choice is the Manhattan distance,

m
X
d(x[a] , x[b] ) = x[a] x[b] , (14)
j=1
which emphasizes di↵erences between “distant” feature vectors or outliers less than the
Euclidean distance.
A generalization of the Euclidean or Manhattan distance is the so-called Minkowski distance,
" m ⇣
# p1
X ⌘p
[a] [b] [a] [b]
d(x , x ) = x x , (15)
j=1
which is equal to the Euclidean distance if p = 2 and equal to the Manhattan distanced if
p = 1.
The Mahalanobis distance would be another good choice for a distance metric as it considers
the variance of the di↵erent features as well as the covariance among them. However, one
downside of using more “sophisticated” distance metrics is that it also typically negatively
impacts computational efficiency. For instance, the Mahalanobis distance is substantially
more challenging to implement efficiently, for example, if we consider running k NN in a
distributed-fashion, as it requires the covariances as a scaling term.
2.8.1 Discrete Features
Minkowski-based distance metrics can be used for discrete and continuous features. One
example would be the so-called Hamming distance, which really is just the Manhattan
distance applied to binary feature vectors:
m
X
[a] [b]
d(x , x ) = x[a] x[b] . (16)
j=1
The Hamming distance is also called overlap metric, because it essentially measures how
many positions in two vectors are di↵erent. For instance, considering two vectors
2 3 2 3
0 0
617 617
6 7 6 7
a=6 7
617 , b = 607
6 7 (17)
415 415
1 1
The Hamming distance is one, because the vectors only di↵er in one position.
As we discussed in class, If we are working with vectors containing word counts of documents
and we want to measure the similarity (or distance) between two documents, cosine similarity
could be a metric that is more appropriate than, for example, the Euclidean distance. The
cosine similarity15 is defined as the the dot-product between two vectors normalized by their
magnitude:
a> b
cos(✓) = (18)
||a|| · ||b||
To provide some intuition for why this measure could be more indicative of the similarity
of two document vectors, consider a document a in which we duplicated every sentence a0 .
15 Please note that while cosine similarity can be useful to measure distances in *k*NN, it is not a proper
distance metric as it violates the triangle-inequality, d(a, c)  d(a, b) + d(b, c).

Then, the cosine similarity to another document vector b would be the same for a and a0 ,
which is not true for, for example, the dot product.
While document-word counts were used as an illustrative example above, please do not
worry about the details at this point as we will discover text analysis in a separate lecture
in this course – if time permits.
Also, an important consideration when we are talking about discrete or categorical features
is whether the features “categories” are on an ordinal or nominal scale. We will discuss this
in detail in a feature lecture on “Data Preprocessing.”
Euclidean Euclidean
distance=1 distance=1
c c
? ?
a a b
b
Figure 9: Next, to choosing an appropriate distance metric, feature scaling is another important
consideration, which is illustrated in this figure. The right subpanel illustrates the e↵ect of scaling
the x1 axis by a factor of 2 on finding the nearest neighbor viaEuclidean
Euclidean distance.
distance=1 Manhattan
distance=1
a
2.8.2 Feature Weighting
c
? ?
Furthermore, we can modify distances metrics by adding a weight to each feature dimension,
b
which is equivalent to feature scaling. In the case of the Euclidean distance, this would look
as follows: v
uX ⇣ ⌘2
um [a] [b]
dw (x , x ) = t
[a] [b]
w j xj xj , (19)
j=1
where wj 2 R.
To implement this efficiently in code, we can express the weighting as a transformation
matrix, where the transformation matrix is a diagonal matrix consisting of the m weight
coefficients for the m features:
W 2 Rm⇥m = diag(w1 , w2 , ..., wm ). (20)
In particular, note that we can express the standard Euclidean distance as a dot product of
the distance vector x:
q
d(x[a] , x[b] ) = (x[a] x[b] )T (x[a] x[b] ). (21)
Then, the distance-weighted Euclidean distance can be expressed as a follows:

? ?
q
dw (x[a] , x[b] ) = (x[a] x[b] )T W(x[a] x[b] ). (22)
2.9 Distance-weighted k NN
A variant of k NN is distance-weighted k NN. In “regular” k NN, the all k neighbors participate

similarly in the plurality voting or averaging. However, especially if the radius enclosing a
set of neighbors is large, we may want to give a stronger weights to neighbors that are
“closer” to the query point. For instance, we can assign a weight w to the neighbors in k NN
classification,
k
X
h(x[t] ) = arg max w[i] (j, f (x[i] )). (23)
j2{1,...,p}
i=1
Simiarly, we can define the following equation for k NN regression:

Pk
w[i] f (x[i] )
h(x[t] ) = i=1
Pk . (24)
[i]
i=1 w
As described by Tom Mitchell16 , a popular weighting scheme is using the inverse squared
distance
1
w[i] = , (25)
d(x[i] , x[t] )2
where h(x) = f (x) is used for an exact match. Other strategies include adding a small
constant to the denominator for avoiding zero-division errors.
Also, using a weighting scheme as described above, we can also turn the k NN algorithm into
a global method by considering all data points instead of k, as defined by Shepard17 .
2.10 Improving Predictive Performance
There are several di↵erent ways of how we can improve the predictive performance of the
k NN algorithms:
• Choosing the value of k.
• Scaling of the feature axes.

• Choice of distance measure.
• Weighting of the distance measure.
However, other techniques such as “editing” (removing noisy data points) and so forth also
a↵ect the generalization performance (i.e., the predictive performance on unseen, that is,
non-training data) of k NN.
Choosing between di↵erent settings of an algorithm is also known as “hyperparameter tun-
ing” or “model selection,” which is typically performed by cross-validation.
16 T.M. Mitchell. “Machine Learning”. In: McGraw-Hill International Editions. McGraw-Hill, 1997.
Chap. 8. isbn: 9780071154673. url: https://books.google.com/books?id=EoYBngEACAAJ.

17 Donald Shepard. “A two-dimensional interpolation function for irregularly-spaced data”. In: Proceed-
ings of the 1968 23rd ACM national conference. ACM. 1968, pp. 517–524.
Training Data
Classification
Accuracy Test Data
Model Complexity/
Value of k
Small k Large k
High complexity
Low complexity
High variance
Low variance
Low bias High bias
Figure 10: By changing the value of k, we a↵ect the complexity of a k NN model. In practice,
we try to find a good trade-o↵ between high bias (the model is not complex enough to fit the data
well) and high variance (the model fits the training data too closely). We will discuss overfitting
and the bias-variance trade-o↵ in more detail in future lectures.
In particular, model selection helps us with reducing the e↵ect of the curse of dimensionality
and overfitting to the training data, if done properly. We will discuss model selection and
cross-validation later in this course.
In practice, values of k between 3-15 seem reasonable choices. Also, if we are working on
binary classification problems, it is a good idea to avoid even numbers for k to prevent ties.
Figure 11: An illustration of the e↵ect of changing the value of k on a simple toy dataset.
2.11 Error Bounds
Cover and Hart have proved that the 1NN algorithm is guaranteed to be no worse than twice
the Bayes error18 . The Bayes error is the minimum possible error that can be achieved; we
will discuss the Bayes error in future lectures. We will not cover the proof of the error bounds
in class, but interested students are encouraged to read more in Chapter 4 (Nonparametric
18 Thomas Cover and Peter Hart. “Nearest neighbor pattern classification”. In: IEEE transactions on
information theory 13.1 (1967), pp. 21–27.

Techniques) of Duda, Hart, and Stork’s Pattern Classification book19 .
2.12 k NN from a Bayesian Perspective
As a statistics student, you may be interested in looking at k NN from a Bayesian perspective.

Duda, Hard, and Stork provide an excellent theoretical foundation together with helpful
illustrations in their book “Pattern Classification20 ” (Chapter 4, Section 4.4) – this is out of
the scope of this class but recommended if you are interested. If you are not familiar with
Bayes theorem, yet (no worries if you are not, we will cover it later in this course), do not
worry about it, and please feel free to skip this section.
In my attempt below, I try to describe the k NN concept from a Bayesian perspective using
our familiar notation.
First, recall Bayes’ theorem
P (B|A) ⇥ P (A)
P (A|B) = . (26)
P (B)
In our context, that is
Likelihood ⇥ Prior Prob.
Posterior Prob. = (27)
Marginal Prob.
or
p(x|y = i) ⇥ p(y = i)
P (y = i|x) = . (28)
p(x)
Suppose we have a dataset D with n data points and t classes. Di denotes the subset of
data points with class label y = i, i 2 {1, ..., t}.
Now, to classify a new data point x, we draw a tight sphere around the k nearest neighbors
of x. This sphere has the volume V .
Then
ki
p x|y = i = (29)
|Di | ⇥ V
is a density estimate of class i, where ki denote the k neighbors with class label i. The
marginal probability, p(x) can then be computed as
k
p(x) = . (30)
|D| ⇥ V
The class priors are then computed as
|Di |
p(y = i) = . (31)
|D|
Combining the equations, we get the posterior probability that the data point x belongs to
class i:
p(x|y = i) ⇥ p(y = i) ki
P (y = i|x) = = . (32)
p(x) k
Stable approximation under the assumption that
|D|, k ! 1 (33)
and
k
! 0. (34)
|D|
19 Richard O Duda, Peter E Hart, and David G Stork. Pattern classification. John Wiley & Sons, 2012.
20 Duda, Hart, and Stork, Pattern classification.
2.13 Advantages and Disadvantages of Using k NN
One of the most significant advantages of k NN is that it is relatively easy to implement and
interpret. Also, with its approach to approximate complex global functions locally, it can
be a powerful predictive model.
The downsides are that k NN is very sensitive to the curse of dimensionality and expensive
to compute with a O(n) prediction step – however, smart implementations and use of data
structures such as KD-trees and Ball-trees can make k NN substantially more efficient.
Compared to other machine learning algorithms, the basic k NN algorithm has relatively few
hyperparameters, namely k and the distance metric; however, the choice of an appropriate
distance metric is not always obvious. Additional hyperparameters are added if we consider
rescaling the feature axes and weighting neighbors by their distance from the query point,
for example.
2.14 Other Forms of Instance-based Learning
2.14.1 Locally Weighted Regression
Another popular example of instance-based learning is locally weighted regression. The idea
behind locally weighted regression is straightforward. Similar to k NN, training examples
neighboring a query point are selected. Then, a regression model is fit to that selected
set of training examples to predict the value of a given data point. Essentially, we are
approximating the target function locally, which is easier than learning a hypothesis that
approximates the target function well on a global scale.
The regression model could have any form, really; however, linear regression models are
popular in the context of locally weighted regression because it is simple, computationally
efficient, and performs sufficiently well for approximating the local neighborhood of a given
data point.
Interestingly, the concept behind locally weighted regression has been recently used to ap-
proximate decision functions locally to help interpret decisions made by “complex” models
(random forests, multi-layer networks, etc.), for example, via the LIME technique (LIME is
an acronym for Local Interpretable Model-Agnostic Explanations)21 .
2.14.2 Kernel Methods
The term kernel may be a bit confusing due to its varied use, but in the context of machine
learning, kernel methods are usually associated with what is known as the “kernel trick.”
The probably most widely used kernel method is the support vector machine (its standard
form can be interpreted as a special case using a linear kernel). We will discuss this in more
detail later in this course if time permits. In a nutshell, kernel methods use a kernel (as
mentioned earlier, a “similarity function”) to measure the similarity or distance between
pairs of data points, and the “trick” refers to an implicit transformation into a higher
dimensional space where a linear classifier can separate the data points. However, while
SVM can be considered an instance-based learner, it is not a “lazy” learner such as k NN.
21 Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should i trust you?: Explaining the
predictions of any classifier”. In: Proceedings of the 22nd ACM SIGKDD international conference on
knowledge discovery and data mining. ACM. 2016, pp. 1135–1144.

STAT 451: Introduction To Machine Learning Lecture Notes

Uploaded by

Copyright:

Available Formats

STAT 451: Introduction To Machine Learning Lecture Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STAT 451: Introduction To Machine Learning Lecture Notes

Uploaded by

Copyright:

Available Formats

STAT 451: Introduction to Machine Learning

2 Nearest Neighbor Methods

2.1.1 Key concepts

While k NN is a universal function approximator under certain conditions, the underlying

hx[i] , y [i] i 2 D (|D| = n), (1)

2.1.2 Nearest Neighbor Classification In Context

2.1.3 Common Use Cases of k NN

2.2 Nearest Neighbor Algorithm

• store training example hx[i] , y [i] i

– current distance := d(x[i] , x[q] )

2.3 Nearest Neighbor Decision Boundary

Figure 2: Illustration of the plane-partitioning of a two-dimensional dataset (features x1 and x2 )

This partitioning of regions on a plane in 2D is also called “Voronoi diagram” or “Voronoi

2.4 k -Nearest Neighbor Classification and Regression

Kingdom, according to https://en.wikipedia.org/wiki/Plurality (voting)

Figure 4: Illustration of plurality and majority voting.

• Plurality voting among the k nearest neighbors for classification.

we can define the k NN hypothesis as

Figure 5: Illustration of k NN for a 3-class problem with k = 5.

2.5 Curse of Dimensionality

The k NN algorithm is particularly susceptible to the curse of dimensionality 8 . In machine

In: AMS math challenges lecture 1.2000 (2000), p. 32.

2.6 Computational Complexity and the Big-O Notation

Figure 6: An illustration of the growth rates of common functions.

def matrixmultiply (A, B):

C = [[0 for row in range(len(A))]

for row_a in range(len(A)):

2.7 Improving Computational Performance

2.7.1 Naive k NN Algorithm in Pseudocode

– current distance := d(x[i] , x[q] )

• add closest point to Dk

– current distance := d(x[i] , x[q] )

• remove farthest point from Dk

Using a Priority Queue

2.7.2 Data Structures

IFIP Congress. 1974, pp. 678–681.

nications of the ACM 18.9 (1975), pp. 509–517.

2.7.3 Dimensionality Reduction

• Feature Selection (e.g., Sequential Forward Selection)

• Feature Extraction (e.g., Principal Component Analysis)

2.7.4 Faster Distance Metric/Heuristic

tance. Another popular choice is the Manhattan distance,

2.8.1 Discrete Features

distance metric as it violates the triangle-inequality, d(a, c)  d(a, b) + d(b, c).

W 2 Rm⇥m = diag(w1 , w2 , ..., wm ). (20)

Then, the distance-weighted Euclidean distance can be expressed as a follows:

A variant of k NN is distance-weighted k NN. In “regular” k NN, the all k neighbors participate

Simiarly, we can define the following equation for k NN regression:

2.10 Improving Predictive Performance

• Choosing the value of k.

• Scaling of the feature axes.

Chap. 8. isbn: 9780071154673. url: https://books.google.com/books?id=EoYBngEACAAJ.

Accuracy Test Data

Low bias High bias

2.11 Error Bounds

information theory 13.1 (1967), pp. 21–27.

Techniques) of Duda, Hart, and Stork’s Pattern Classification book19 .

2.12 k NN from a Bayesian Perspective

As a statistics student, you may be interested in looking at k NN from a Bayesian perspective.