Entropy 26 00039 v2
Entropy 26 00039 v2
Entropy 26 00039 v2
Article
Probabilistic Nearest Neighbors Classification
Bruno Fava 1 , Paulo C. Marques F. 2, * and Hedibert F. Lopes 2
Abstract: Analysis of the currently established Bayesian nearest neighbors classification model
points to a connection between the computation of its normalizing constant and issues of NP-
completeness. An alternative predictive model constructed by aggregating the predictive distributions
of simpler nonlocal models is proposed, and analytic expressions for the normalizing constants of
these nonlocal models are derived, ensuring polynomial time computation without approximations.
Experiments with synthetic and real datasets showcase the predictive performance of the proposed
predictive model.
1. Introduction
The now classic nearest neighbors classification algorithm, introduced in a 1951 techni-
cal report by Fix and Hodges (reprinted in [1]), marked one of the early successes of machine
learning research. The basic idea is that, given some notion of proximity between pairs of
observations, the class of a new sample unit is determined by majority voting among its k
nearest neighbors in the training sample [2,3]. A natural question is whether it is possible
to develop a probabilistic model that captures the essence of the mechanism contained in
the classic nearest neighbors algorithm while adding proper uncertainty quantification of
Citation: Fava, B.; F., P.C.F.; Lopes,
predictions made by the model. In a pioneering paper, Holmes and Adams [4] defined a
H.F. Probabilistic Nearest Neighbors
probabilistic nearest neighbors model specifying a set of conditional distributions. A few
Classification. Entropy 2024, 26, 39.
years later, Cucala et al. [5] pointed out the incompatibility of the conditional distributions
https://doi.org/10.3390/e26010039
specified by Holmes and Adams, which do not define a proper joint model distribution. As
Academic Editors: Carlos Alberto De an alternative, Cucala et al. developed their own nearest neighbors classification model,
Bragança Pereira, Paulo Canas defining directly a proper, Boltzmann-like joint distribution. A major difficulty with the
Rodrigues and Mark Andrew Gannon Cucala et al. model is the fact that its likelihood function involves a seemingly intractable
Received: 29 November 2023
normalizing constant. Consequently, in order to perform a Bayesian analysis of their model,
Revised: 23 December 2023 the authors engaged in a tour de force of Monte Carlo techniques, with varied computational
Accepted: 28 December 2023 complexity and approximation quality.
Published: 30 December 2023 In this paper, we introduce an alternative probabilistic nearest neighbors predictive
model constructed from an aggregation of simpler models whose normalizing constants
can be exactly summed in polynomial time. We begin by reviewing the Cucala et al. model
in Section 2, showing by an elementary argument that the computational complexity of
Copyright: © 2023 by the authors. the exact summation of its normalizing constant is directly tied to the concept of NP-
Licensee MDPI, Basel, Switzerland.
completeness [6]. The necessary concepts from the theory of computational complexity are
This article is an open access article
briefly reviewed. Section 3 introduces a family of nonlocal models, whose joint distributions
distributed under the terms and
take into account only the interactions between each sample unit and its r-th nearest
conditions of the Creative Commons
neighbor. For each nonlocal model, we derive an analytic expression for its normalizing
Attribution (CC BY) license (https://
constant, which can be computed exactly in polynomial time. The nonlocal models are
creativecommons.org/licenses/by/
4.0/).
combined in Section 4, yielding a predictive distribution that does not rely on costly Monte
Carlo approximations. We run experiments with synthetic and real datasets, showing that
our model achieves the predictive performance of the Cucala et al. model, with a more
manageable computational cost. We present our conclusions in Section 5.
From this definition, we see that direct (brute force) summation of Z ( β, k) would involve
an exponential number of operations (O( Ln )). The much more subtle question about
the possible existence of an algorithm that would allow us to exactly compute Z ( β, k) in
polynomial time is addressed in Section 2.3.
In their paper [5], the authors relied on a series of techniques to implement Markov
Chain Monte Carlo (MCMC) frameworks in the presence of the seemingly intractable model
normalizing constant Z ( β, k ). They developed solutions based on pseudo-likelihood [7],
path sampling [8,9] (which essentially approximates Z ( β, k) using a computationally inten-
sive process, for each value of the pair ( β, k) appearing in the iterations of the underlying
MCMC procedure) and the Møller et al. auxiliary variable method [10]. Although there
is currently no publicly available source code for further experimentation, at the end of
Section 3.4 the authors report computation times ranging from twenty minutes to more
Entropy 2024, 26, 39 3 of 12
than one week for the different methods using compiled code. We refer the reader to [5] for
the technical details.
n k n
1 n
T (y) = ∑ ∑ I(yi = y[i]r ) = ∑ aij I(yi = y j ) =
2 i,j∑
bij I(yi = y j ),
i =1 r =1 i,j=1 =1
for every y ∈ {0, 1}n . Furthermore, each y ∈ {0, 1}n corresponds to a cut of the graph G if
we define the disjoint sets of vertices V0 = {i ∈ E : yi = 0} and V1 = {i ∈ E : yi = 1}. The
respective cut size is given by:
1 n
2 i,j∑
cut-size(y) = bij I(yi ̸= y j ).
=1
it follows that !
1 n
2 i,j∑
cut-size(y) = bij − T ( y ). (∗)
=1
Figure 1 gives an example for a specific neighborhood structure involving the three nearest
neighbors with respect to Euclidean distance.
Figure 1. Weighted undirected nonplanar graphs associated with a specific neighborhood structure,
determined by the three nearest neighbors according to Euclidean distance, and the sizes of three
different cuts, obtained by summing the weights of all edges linking two vertices of different colors.
On each graph, light gray and black edges have weights with values one and two, respectively,
according to the adjacency matrix B. Black and white vertices correspond to class labels being equal
to one and zero, respectively.
Entropy 2024, 26, 39 5 of 12
By grouping each possible value of T (y) in the sum over y ∈ {0, 1}n appearing in the
definition of Z ( β, k), we obtain the alternative polynomial representation
nk
Z ( β, k ) = ∑ dt zt ,
t =0
in which z = e β/k and dt = ∑y∈{0,1}n I( T (y) = t), for t = 0, 1, . . . , nk. Note that dt is the
number of vectors y ∈ {0, 1}n such that T (y) = t,and from (∗) we have that dt is also the
1 n
number of possible cuts of the graph G with size 2 ∑i,j=1 bij − t.
Suppose that we have found a way to sum Z ( β, k) in polynomial time on a deter-
ministic computer, for every possible value of β and k and any specified neighborhood
structure. By polynomial interpolation (see [13]), we would be able to compute the value
of each coefficient dt in polynomial time, thus determining the number of cuts of G with
all possible sizes, thereby solving any maximum cut decision problem associated with the
graph G. In other words: the existence of a polynomial time algorithm to sum Z ( β, k) for an
arbitrary neighborhood structure on a deterministic computer would imply that P is equal
to NP.
with parameter β r ≥ 0.
In line with what was pointed out in our discussion of the normalizing constant
Z ( β, k) of the Cucala et al. model, brute force computation of Zr ( β r ) is also hopeless for
the nonlocal models, requiring the summation of an exponential number of terms (O( Ln )).
However, the much simpler topology associated with the neighborhood structure of a
nonlocal model can be exploited to give us a path to sum Zr ( β r ) analytically, resulting in
an expression that can be computed exactly in polynomial time on an ordinary computer.
Throughout the remainder of this section, our goal is to derive a tractable closed form
for the normalizing constant Zr ( β r ). For the r-th nonlocal model, consider the directed
graph G = (V, E) representing the associated neighborhood structure of a given training
sample. For i = 1, . . . , n, each vertex i ∈ V corresponds to one training sample unit, and the
existence of an oriented edge (i, j) ∈ E, represented pictorially by an arrow pointing from i
to j, means that the j-th sample unit is the r-th nearest neighbor of the i-th sample unit.
Entropy 2024, 26, 39 6 of 12
Since
L L L
∑ n := ∑ ∑ ··· ∑,
y ∈L y1 =1 y2 =1 y n =1
For each subgraph, starting at some vertex and following the arrows pointing to each
subsequent vertex, if we return to the first vertex after m steps, we say that the subgraph has
a simple cycle of size m. The outdegree of a vertex is the number of arrows pointing from it
to other vertices; the indegree of a vertex is defined analogously. Figure 3 depicts the fact
that each subgraph has exactly one simple cycle: in a subgraph without simple cycles, there
would be at least one vertex with outdegree equal to zero. Moreover, a subgraph with more
than one simple cycle would have at least one vertex in one of the simple cycles pointing to
a vertex in another simple cycle, implying that such a vertex would have outdegree equal
to two. Both cases contradict the fact that every vertex of each subgraph has outdegree
equal to one, since each sample unit has exactly one r-th nearest neighbor.
Figure 2. Distance matrix D and directed graphs showing the neighborhood structure for the nonlocal
models with r = 1 and r = 2.
Entropy 2024, 26, 39 7 of 12
Figure 3. Two impossible directed graphs for a nonlocal model. On the left graph, the white vertex
has outdegree equal to zero. On the right graph, the white vertex has outdegree equal to two. Both
graphs contradict the fact that each training sample unit has exactly one r-th nearest neighbor. The
conclusion is that in a directed graph describing the neighborhood structure of a nonlocal model each
subgraph contains exactly one simple cycle.
Figure 4 portrays the reduction process used to perform the summations for one
subgraph. For each vertex with indegree equal to zero, we sum over the correspondent
yi and remove the vertex from the graph. We repeat this process until we are left with a
summation over the vertices forming the simple cycle. The summation for each vertex i
with indegree equal to zero in this reduction process gives the factor
L
∑ exp( β r I(yi = y[i]r )) = e βr + L − 1,
y i =1
because—and this is a crucial aspect of the reduction process—in this sum the indicator
is equal to one just for a single term, and it is equal to zero for all the remaining L − 1
terms, whatever the value of y[i]r . Summation over the vertices forming the simple cycle is
performed as follows. Relabeling the indexes of the sample units if necessary, suppose that
the vertices forming a simple cycle of size m are labeled as 1, 2, . . . , m. Defining the matrix
S = (s a,b ) by s a,b = exp( β r I( a = b)), we have
L L L
∑ ∑ ··· ∑ exp( β r I(y1 = y2 )) × exp( β r I(y2 = y3 )) × · · · × exp( β r I(ym = y1 ))
y1 =1 y2 =1 y m =1
L L L L
= ∑ ∑ ··· ∑ sy1 ,y2 × sy2 ,y3 × · · · × sym ,y1 = ∑ (Sm )y1 ,y1 = Tr(Sm ).
y1 =1 y2 =1 y m =1 y1 =1
det(S − λI ) = (e βr − 1 − λ) L−1 (e βr + L − 1 − λ) = 0,
yielding
λ1 = λ2 = · · · = λ L−1 = e βr − 1, and λ L = e βr + L − 1.
(r )
For the r-th nonlocal model, let cm be the number of simple cycles of size m, considering all
(r )
the associated subgraphs. Algorithm 1 shows how to compute cm , for r = 1, . . . , n − 1 and
m = 2, . . . , n, in polynomial time. Taking into account all the subgraphs, and multiplying
all the factors, we arrive at the final expression:
n (r ) n c(mr)
Zr ( β r ) = (e βr + L − 1)n−∑m=2 m×cm × ∏ (e βr + L − 1)m + ( L − 1)(e βr − 1)m .
m =2
Entropy 2024, 26, 39 8 of 12
Figure 4. Example of the reduction process for the summation on each subgraph.
Algorithm 1 Count the occurrences of simple cycles of different sizes on the directed
subgraphs representing the neighborhood structures of all nonlocal models
Require: Neighborhood brackets {[i ]r : i = 1, . . . , n; r = 1, . . . , n − 1}.
4. Predictive Model
The nonlocal models developed in Section 3 will be the building blocks of our
probabilistic nearest neighbors classification model. Introducing a hyperparameter k =
1, . . . , n − 1, the first k nonlocal models can be combined; the heuristic being that a “super-
position” of nonlocal models could work as a model whose neighborhood structure takes
into account the sets of k nearest neighbors of each sample unit. Section 4.1 explains the
aggregation procedure leading to the predictive distribution of the combined model. The
results in Section 4.2 showcase the computational cost and the predictive performance of
the new model, examining the same synthetic and real datasets explored in [5].
Entropy 2024, 26, 39 9 of 12
Using this estimate, we define the predictive distribution of the r-th nonlocal model as
1 k
k r∑
p(k) (yn+1 | xn+1 , x, y) := pr (yn+1 | xn+1 , x, y).
=1
The value of the hyperparameter k is chosen using leave-one-out cross validation [15] on
the training sample. From now on, we refer to this predictive model as the probabilistic
nearest neighbors classification model (pnnclass, for short).
An open source software library implementing the predictive model described in this
section is publicly available at https://github.com/paulocmarquesf/pnnclass (access on
29 December 2023). It is an R [16] library, having internal routines written in C++ [17] with
the help of Rcpp [18].
Figure 5. Training sample for the Ripley synthetic dataset. Circles and triangles indicate the two
possible values of the response variable. The heatmap represents the probabilities of class “triangle”
given by the predictive model pnnclass, ranging from pure cyan (probability zero) to pure orange
(probability one).
Figure 6. ROC curves for the Ripley and the Pima Indian datasets. Annotated summaries correspond
to the probability threshold maximizing the sum of the classifier sensitivity and specificity.
The third and final example considered in [5] was the multiclass Forensic Glass
fragments dataset described in [19]. In this dataset, we have nine numerical predictors, and
the response variable was coalesced into four classes. With randomly split training and
testing samples of sizes 89 and 96, respectively, the pnnclass predictive model achieves a
testing error rate of 28.04%, which is better than the error rate of 29% reported in [5] for
this dataset. Performing 100 replications, the median running time using the pnnclass
library was 9.15 milliseconds. Since in this dataset the predictor space is nine-dimensional,
in Figure 7 we use Principal Component Analysis (PCA) as a visualization tool to depict
the predictive performance on the the testing sample (PCA played no role in the modeling
process). The confusion matrix for the Forensic Glass testing sample is given in Table 1.
Table 2 gives more detailed summaries for the running times on the three datasets. Table 3
summarizes the predictive performance comparison between the pnnclass and the Cucala
et al. models for the three datasets. For comparison, we added the test errors of a Random
Forest [21,22] constructed by bootstrap aggregating 500 classification trees.
Entropy 2024, 26, 39 11 of 12
Figure 7. Testing sample scores on the first two principal components of the predictors in the Forensic
Glass dataset. The different shapes indicate the four possible values of the response variable. The
colors blue or red mean, respectively, that the corresponding testing sample units were classified
correctly or incorrectly.
Observed Class
0 1 2 3
0 29 2 0 5
1 2 27 1 7
Predicted class
2 0 1 14 2
3 7 3 0 7
Table 2. Descriptive summaries for the running times in milliseconds of the pnnclass predictive
model library based on 100 replications.
Table 3. Testing errors for the three datasets. In the third column, we report the smallest errors
achieved in [5] considering all Monte Carlo approximations implemented by the authors.
5. Concluding Remarks
This work can be seen as the construction of a probabilistic predictive model based
on the concept of nearest neighbors, wherein we shifted the algorithmic complexity of the
solution from an intractable class into the realm of polynomial time computation. It would
be interesting to explore how the ideas discussed in this paper could be adapted to assist
in the analysis of spatial data models or models with graph-type data in general, since
these models may pose similar challenges related to the presence of intractable normalizing
constants in the corresponding likelihood functions.
Entropy 2024, 26, 39 12 of 12
Author Contributions: Conceptualization, B.F., P.C.M.F. and H.F.L.; Methodology, B.F., P.C.M.F. and
H.F.L.; Software, B.F., P.C.M.F. and H.F.L. All authors have read and agreed to the published version
of the manuscript.
Funding: Hedibert F. Lopes receives support from FAPESP (Fundação de Amparo à Pesquisa do
Estado de São Paulo) through Grant 2018/04654-9.
Data Availability Statement: The data used in the paper is available through the software library
developed by the authors.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Fix, E.; Hodges, J.L. Discriminatory analysis—Nonparametric discrimination: Consistency properties. Int. Stat. Rev. 1989,
57, 238–247. [CrossRef]
2. Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer: Berlin, Germany, 2013; Volume 31.
3. Biau, G.; Devroye, L. Lectures on the Nearest Neighbor Method; Springer: Berlin, Germany, 2015; Volume 246.
4. Holmes, C.; Adams, N. A probabilistic nearest neighbour method for statistical pattern recognition. J. R. Stat. Soc. B Stat.
Methodol. 2002, 64, 295–306. [CrossRef]
5. Cucala, L.; Marin, J.M.; Robert, C.P.; Titterington, D.M. A Bayesian Reassessment of Nearest-Neighbor Classification. J. Am. Stat.
Assoc. 2009, 104, 263–273. [CrossRef]
6. Garey, M.R.; Johnson, D.S. Computers and Intractability: A Guide to the Theory of NP-Completeness; W. H. Freeman: New York, NY,
USA, 1979.
7. Besag, J. Spatial Interaction and the Statistical Analysis of Lattice Systems. J. R. Stat. Soc. B Stat. Methodol. 1974, 36, 192–236.
[CrossRef]
8. Ogata, Y. A Monte Carlo method for high dimensional integration. Numer. Math. 1989, 55, 137–157. [CrossRef]
9. Gelman, A.; Meng, X.L. Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling.
Stat. Sci. 1998, 13, 163–185. [CrossRef]
10. Møller, J.; Pettitt, A.N.; Reeves, R.; Berthelsen, K.K. An Efficient Markov Chain Monte Carlo Method for Distributions with
Intractable Normalising Constants. Biometrika 2006, 93, 451–458. [CrossRef]
11. Cook, S.A. The Complexity of Theorem-Proving Procedures. In Proceedings of the STOC ’71: Third Annual ACM Symposium on
Theory of Computing, Shaker Heights, OH, USA, 3–May 1971; Association for Computing Machinery: New York, NY, USA, 1971;
pp. 151–158. [CrossRef]
12. Karp, R. Reducibility Among Combinatorial Problems. In Complexity of Computer Computations; Miller, R., Thatcher, J., Eds.;
Plenum Press: New York, NY, USA, 1972; pp. 85–103.
13. Rivlin, T. An Introduction to the Approximation of Functions; Blaisdell Publishing Company: Waltham, MA, USA, 1969.
14. Strang, G. Linear Algebra and Its Applications; Thomson, Brooks/Cole: Pacific Grove, CA, USA, 2006.
15. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer:
Berlin, Germany, 2009.
16. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria,
2017.
17. Stroustrup, B. The C++ Programming Language, 4th ed.; Addison-Wesley Professional: Reading, MA, USA, 2013.
18. Eddelbuettel, D. Seamless R and C++ Integration with Rcpp; Springer: Berlin, Germany, 2013.
19. Ripley, B.D. Pattern Recognition and Neural Networks; Cambridge University Press: Cambridge, UK, 1996.
[CrossRef]
20. Venables, W.N.; Ripley, B.D. Modern Applied Statistics with S, 4th ed.; Springer: New York, NY, USA, 2002; ISBN 0-387-95457-0.
21. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
22. Marques F., P.C. Confidence intervals for the random forest generalization error. Pattern Recognit. Lett. 2022, 158, 171–175.
[CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.