Neural Networks Use Distance Metrics: Alan Oursland
Neural Networks Use Distance Metrics: Alan Oursland
Neural Networks Use Distance Metrics: Alan Oursland
Alan Oursland
November 2024
Abstract
arXiv:2411.17932v1 [cs.LG] 26 Nov 2024
We present empirical evidence that neural networks with ReLU and Absolute Value ac-
tivations learn distance-based representations. We independently manipulate both distance
and intensity properties of internal activations in trained models, finding that both archi-
tectures are highly sensitive to small distance-based perturbations while maintaining robust
performance under large intensity-based perturbations. These findings challenge the prevail-
ing intensity-based interpretation of neural network activations and offer new insights into
their learning and decision-making processes.
1 Introduction
The foundation for interpreting neural network activations as indicators of feature strength
can be traced back to the pioneering work of McCulloch and Pitts in 1943 [McCulloch and
Pitts, 1943], who introduced the concept of artificial neurons with a threshold for activation. 1
This concept, where larger outputs signify stronger representations, was further developed by
Rosenblatt’s 1958 perceptron model [Rosenblatt, 1958] and has persisted through the evolution
of neural networks and deep learning [Schmidhuber, 2015]. Throughout this evolution, the field
has largely upheld this interpretation that larger activation values indicate stronger feature
presence – what we term an intensity metric. However, despite the remarkable success achieved
through this lens, the statistical principles underlying neural network feature learning remain
incompletely understood [Lipton, 2018].
This work builds on our recent theoretical framework [Oursland, 2024] that proposed neural
networks might naturally learn to compute statistical distance metrics, specifically the Maha-
lanobis distance [Mahalanobis, 1936]. Our analysis suggested that smaller node activations,
rather than larger ones, might correspond to stronger feature representations. While this pre-
vious work established a mathematical relationship between neural network linear layers and
the Mahalanobis distance, we need empirical evidence to determine whether networks actually
employ these distance-based representations in practice.
We use systematic perturbation analysis [Szegedy et al., 2013, Goodfellow et al., 2014] to
provide empirical evidence supporting the distance metric theory proposed in our previous work.
Using the MNIST dataset [LeCun et al., 1998], we modify trained models by independently
manipulating distance and intensity properties of network activations. By analyzing how these
perturbations affect model performance, we identify which properties – distance or intensity –
drive network behavior. Our investigation focuses on two key questions:
• Do neural networks naturally learn to measure distances rather than intensities when
processing data distributions?
1
The implementation for this work can be found at https://github.com/alanoursland/neural_networks_
use_distance_metrics.
1
• How do different activation functions (ReLU and Absolute Value) affect the type of sta-
tistical measures learned by the network?
Our results show that networks with both ReLU and Absolute Value activations are highly
sensitive to distance-based perturbations while maintaining robust performance under intensity
perturbations, supporting the hypothesis that they utilize distance-based metrics. These findings
not only validate our theoretical framework but also suggest new approaches for understanding
and improving neural network architectures.
2 Prior Work
In 1943 McCulloch and Pitt introduced a computation model of a neuron to explore logical
equations in biological brains [McCulloch and Pitts, 1943]. Their definition TRUE = (W x >
b) marks the beginning of our path using intensity metrics. Rosenblatt adapted this into an
activation value y = f (W x + b) in 1957 with the perceptron, further solidifying the intensity
metric interpretation [Rosenblatt, 1957].
The development of multilayer perceptrons (MLPs) and the backpropagation algorithm en-
abled the training of deeper networks with continuous activation functions. [Rumelhart et al.,
1986, LeCun et al., 1989, Hornik et al., 1989] The interpretation of activations continued to focus
on larger values as being more salient, reflected in visualizations of activations and analyses of
feature maps, where stronger activations were highlighted. [Zeiler and Fergus, 2014, Yosinski
et al., 2015, Olah et al., 2017, Erhan et al., 2009]
The rise of deep learning, with the widespread adoption of ReLU and its variants, further
reinforced the intensity metric interpretation by emphasizing the importance of large, positive
activations. [Nair and Hinton, 2010, Glorot et al., 2011] Visualization techniques, such as saliency
maps and Class Activation Mapping (CAM), often focused on highlighting regions with high
activations. [Simonyan et al., 2013, Zhou et al., 2016] Similarly, attention mechanisms, which
assign weights to different parts of the input, often rely on the magnitude of these weights as
indicators of importance. [Bahdanau et al., 2014, Vaswani et al., 2017]
While the intensity metric interpretation has been dominant, recent work has highlighted
its limitations. [Rudin, 2019] Considering the relationships between activations, particularly
through distance metrics, offers a promising avenue for understanding neural network repre-
sentations. [Goodfellow et al., 2014, Madry et al., 2017, Szegedy et al., 2013] Distance-based
methods, such as Radial Basis Function (RBF) networks and Siamese networks, demonstrate the
potential of incorporating distance computations into neural network architectures and inter-
pretation. [Broomhead and Lowe, 1988, Bromley et al., 1994, Schroff et al., 2015] This approach
could lead to more nuanced and effective representations.
3 Background
In our previous work, Interpreting Neural Networks through Mahalanobis Distance, we estab-
lished a mathematical link between linear nodes with absolute value activation functions and
statistical distance metrics. [Oursland, 2024] This framework suggests that neural networks may
naturally learn to measure distances rather than intensities.
We explore this idea within the MNIST dataset, a well-known digit recognition problem that
offers a structured environment for examining neural network behavior [LeCun et al., 1998].
MNIST’s clear feature structure and abundant prior research make it ideal for investigating core
properties of neural network learning. A distance metric quantifies how far an input is from a
learned statistical property of the data [Deza and Deza, 2009]. While an intensity metric reflects
a confidence level — larger values indicate higher certainty that the input belongs to the node’s
feature set. This dual interpretation of a node’s output — either as a measure of distance or as
2
confidence in feature presence — can help us understand the nature of neural network learning.
For instance, an intensity filter could be viewed as a disjunctive distance metric that measures
how close an input is to everything the target feature is not.
• What evidence would convincingly demonstrate which interpretation better reflects net-
work operation?
These questions inform our experimental design, which uses controlled perturbations to test
the nature of the learned features. By independently manipulating the distance and intensity
properties of network activations, we can determine which aspects truly drive network behavior.
Our investigation focuses not on proving specific mathematical relationships but on demon-
strating that distance-based properties, rather than intensity-based properties, govern network
performance. This approach aims to improve our understanding of how neural networks process
information and may lead to more effective network design and analysis methods [Montavon
et al., 2018, Samek et al., 2019].
4 Experimental Design
To empirically investigate whether neural networks naturally learn distance-based features,
we designed systematic perturbation experiments to differentiate between distance-based and
intensity-based feature learning. This experimental framework directly compares these two
interpretations by examining how learned features respond to specific modifications of their ac-
tivation patterns. We hypothesize that perturbing the ”true representation” will result in a drop
in model accuracy.
We train a basic feedforward model on the MNIST dataset to test our hypotheses. Our goal
is to obtain a robust model for perturbation analysis, not to optimize model accuracy. The
network processes MNIST digits through the following layers:
The perturbation layer is a custom module designed to control activation patterns using
three fixed parameters: a multiplicative factor (scale), a translational offset (of f set), and a
3
clipping threshold (clip). During training, these parameters remain fixed (scale = 1, of f set = 0,
clip = ∞), ensuring the layer does not influence the network’s learning. During perturbation
testing, these parameters are modified to probe the network’s learned features. For each input
x, the perturbation layer applies the following operation: y = min(scale·x+of f set, clip), where
scale, of f set, and clip are adjustable for each unit.
The model was trained on the entire MNIST dataset (rather than using minibatches) for
5000 epochs using Stochastic Gradient Descent (learning rate = 0.001, loss = cross-entropy).
Data normalization used µ = 0.1307, σ = 0.3081. To ensure statistically significant results, we
repeated each experiment 20 times.
4.2 Evaluation
Perturbation ranges were selected to span a broad spectrum to ensure comprehensive evaluation.
The ranges overlap to facilitate direct comparison between distance and intensity metrics. All
percentages are applied to individual node ranges over the input set. Intensity and cutoff range
over [1%..1000%]. Offset ranges over [−200%..100%].
We select a percentage in the perturbation range, calculate and apply scale, of f set and
clip for the active test, evaluate on the entire training set, and calculate the resulting accuracy.
We use the training set, and not the test set, to observe how perturbations affect the features
learned during training. Changes in accuracy indicate reliance on the perturbed feature type,
while stable accuracy suggests that the features are not critical to the model’s decisions. The use
of the training set ensures a comprehensive assessment with a sufficient number of data points.
5 Results
Our experiments provide strong empirical support for the theory that the tested models (Abs and
ReLU) primarily utilize distance metrics, rather than intensity metrics, for classification. This
means that the models rely on features residing near the decision boundaries for classification,
rather than in regions with high activation magnitudes. As shown in Table 1, both models
achieved high accuracy on MNIST before perturbation testing [LeCun et al., 1998].
4
Figure 1: Effects of intensity scaling and distance offset perturbations on model accuracy. Shaded
regions represent 95% confidence intervals across 20 runs.
Consistent with theory, both models resist intensity perturbations but are sensitive to dis-
tance ones (Figure 1). Specifically, both models maintain their baseline accuracy (approximately
98% for ReLU and 99% for Abs) across a wide range of intensity scaling (from 10% to 200%
of the original output range) and threshold clipping (from 50% of the maximum activation and
above). The minor fluctuations in accuracy observed within these ranges were small and not
statistically significant (p > 0.05), as detailed in Table 2 and Table 3. This robustness to inten-
sity perturbations suggests that the models are not heavily reliant on the absolute magnitude
of activations, or intensity metrics, for classification. This aligns with findings in adversarial
example literature, where imperceptible perturbations can drastically alter model predictions
[Szegedy et al., 2013, Goodfellow et al., 2014].
In contrast, both models exhibit a rapid decline in accuracy with relatively small distance
offset perturbations. ReLU maintains its baseline accuracy over an offset range from -3% to
+2% of the activation range, while the Abs model is even more sensitive, falling below 99%
accuracy outside of -1% to +1%. These findings, presented in detail in Table 4, underscore the
importance of distance metrics, particularly the distances to decision boundaries, in the learned
representations for accurate classification.
The high p-values associated with the intensity perturbations (see Appendix A) further
support our hypothesis. These non-significant results indicate that the observed variations in
accuracy under intensity changes are likely attributable to random fluctuations rather than a
systematic effect of the perturbations. This reinforces the notion that the models prioritize
distance metrics over intensity metrics, focusing on the features close to decision boundaries for
classification.
6 Discussion
We explore how ReLU and Abs activations represent features within a distance metric inter-
pretation. Figure 2 illustrates the key differences in how these activation functions process
information. In the pre-activation space (Figures 2a and 2b), both models can learn similar
linear projections of input features. ReLU is driven to minimize the active feature {c} and
ends up being positioned on the positive edge of the distribution. Abs positions the decision
boundary through the mean, or possibly the median, of the data. After activation, ReLU sets
all features on its dark side to the minimum possible distance: zero. Abs folds the space, moving
5
(a) ReLU pre-activation projection (b) Abs pre-activation projection
Figure 2: This series of figures illustrates how linear nodes process features using ReLU and
Absolute Value activation functions. Each blue peak represents a feature (a-e), with the red
dashed line showing the decision boundary. The top row shows features after linear projection
but before activation. The bottom row shows how ReLU and Absolute Value functions transform
these projections, highlighting their distinct effects on feature space.
all distributions on the negative side to the positive side. The ReLU activated node selects for
features {a, b, c}. The folding operation of the Abs activated feature results in {c} being the
sole feature with the smallest activation value.
6
(a) ReLU Negative Offset (b) Abs Negative Offset
Figure 3: Effects of decision boundary offsets on feature representation. Negative offsets (top
row) and positive offsets (bottom row) demonstrate how shifting the decision boundary affects
feature selection in ReLU and Abs activated nodes.
One possible explanation for this invariance could be the normalization effect of the Log-
Softmax operation within the cross-entropy loss function [Bridle, 1990]. By renormalizing the
output values, LogSoftmax might mitigate the impact of scaling on the relative differences be-
tween activations, potentially masking any effects on intensity-based features. However, this
does not explain the performance drop observed when activations are scaled down to the mag-
nitudes associated with distance features, suggesting a complex interplay between scaling and
the different types of learned features.
7
due to the lack of a widely accepted definition of what constitutes an intensity feature. This am-
biguity has persisted despite decades of research, with various interpretations proposed but no
consensus reached. Some studies suggest that intensity features are indicated by maximum acti-
vation values, as seen in the foundational work on artificial neurons and perceptrons [McCulloch
and Pitts, 1943, Rosenblatt, 1958]. Others propose that intensity features might be defined by
activation values falling within a specific range, aligning with the concept of confidence intervals
or thresholds.
The absence of a clear mathematical foundation for intensity metrics further complicates the
matter. Distance metrics like Euclidean and Mahalanobis distances have well-defined statistical
measures with clear linear formulations [Deza and Deza, 2009, Mahalanobis, 1936]. However,
we find no equivalent statistical measure for intensity that can be expressed through a linear
equation. This lack of a concrete mathematical basis makes it challenging to design experiments
that definitively target and assess intensity features.
Our scaling experiments highlight this difficulty. One might expect that doubling a strong
signal (high activation) should make it stronger, yet our networks maintain consistent behavior
under scaling. If we propose that relative values between nodes preserve intensity information,
this begins to sound suspiciously like a distance metric.
The distance features in the network are easily explained as a Mahalanobis distance of a
principal component as described in [Oursland, 2024]. But what is the statistical meaning be-
hind the intensity features? It implies a complement to the principal component, a principal
disponent consisting of an antivector, antivalue, and an unmean. I don’t think that principal
disponents are real. What looks like an intensity metric is really a distance metric that matches
everything except the large value. Perhaps statistical network interpretation has stymied re-
searchers because we have been looking for the mathematical equivalent of Bigfoot or the Loch
Ness Monster.
7 Conclusion
This paper provides empirical validation for the theoretical connection between neural networks
and Mahalanobis distance proposed in [Oursland, 2024]. Through systematic perturbation anal-
ysis, we demonstrated that neural networks with different activation functions implement dis-
tinct forms of distance-based computation, offering new insights into their learning and decision-
making processes.
Our experiments show that both architectures are sensitive to distance perturbations but
resistant to intensity perturbations. This supports the idea that neural networks learn through
distance-based representations. The Abs network’s performance degrades more dramatically
with small offsets than the ReLU network’s performance. This may be because the Abs network
relies on precise distance measurements, while the ReLU network uses a multi-feature approach.
Both architectures maintain consistent performance under scaling perturbations, which ap-
pears to support distance-based rather than intensity-based computation. However, the lack of
a precise mathematical definition for intensity metrics makes it difficult to definitively rule out
intensity-based interpretations. This limitation highlights a broader challenge in the field: we
cannot fully disprove a concept that lacks rigorous mathematical formulation.
These results provide empirical support for the theory that linear nodes naturally learn
to generate distance metrics. However, more work is needed to strengthen this theoretical
framework, particularly in understanding how these distance computations compose through
deeper networks and interact across multiple layers. The evidence presented here suggests that
distance metrics may provide a more fruitful framework for understanding and interpreting
neural networks than traditional intensity-based interpretations.
8
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature
verification using a” siamese” time delay neural network. In Advances in neural information
processing systems, pages 737–744, 1994.
David S Broomhead and David Lowe. Radial basis functions, multi-variable functional inter-
polation and adaptive networks. Royal Signals and Radar Establishment Malvern (United
Kingdom), 1988.
Michel Marie Deza and Elena Deza. Encyclopedia of distances. Springer, 2009.
Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer
features of a deep network. University of Montreal, 1341(3):1, 2009.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In
Proceedings of the fourteenth international conference on artificial intelligence and statistics,
pages 315–323, 2011.
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver-
sarial examples. arXiv preprint arXiv:1412.6572, 2014.
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are
universal approximators. Neural networks, 2(5):359–366, 1989.
Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne
Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recogni-
tion. Neural computation, 1(4):541–551, 1989.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied
to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian
Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint
arXiv:1706.06083, 2017.
Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous
activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.
Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Methods for interpreting and
understanding deep neural networks. Digital Signal Processing, 73:1–15, 2018.
Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines.
Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–
814, 2010.
9
Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2(11):
e7, 2017.
Alan Oursland. Interpreting neural networks through mahalanobis distance. arXiv preprint
arXiv:2410.19352, 2024.
Frank Rosenblatt. The perceptron: A perceiving and recognizing automaton. Technical Report
85-460-1, Cornell Aeronautical Laboratory, 1957.
Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organiza-
tion in the brain. Psychological review, 65(6):386, 1958.
Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions
and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019.
Wojciech Samek, Grégoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert
Müller. Explainable ai: interpreting, explaining and visualizing deep learning. Springer, 2019.
Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:
85–117, 2015.
Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for
face recognition and clustering. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 815–823, 2015.
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks:
Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034,
2013.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian
Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint
arXiv:1312.6199, 2013.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
processing systems, 30, 2017.
Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural
networks through deep visualization. In International conference on machine learning, pages
1576–1585. PMLR, 2015.
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks.
European conference on computer vision, pages 818–833, 2014.
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning
deep features for discriminative localization. Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 2921–2929, 2016.
10
A Statistic Tables
A.1 Baseline Performance
The following table presents the detailed performance metrics for both architectures across 20
training runs:
Table 1: Baseline model performance averaged across 20 training runs (mean ± standard devi-
ation).
Table 2: Effects of intensity scaling on model accuracy. Scale values are shown as percentages
of the original range.
Table 3: Effects of intensity cutoff on model accuracy. Cutoff values are shown as percentages
of the maximum activation.
11
Offset Abs ReLU
Change Acc (%) T-stat P-value Acc (%) T-stat P-value
-200% 11.04 49.6 1.4e-21 18.77 211.3 1.7e-33
-100% 12.60 48.8 1.9e-21 41.68 78.9 2.2e-25
-75% 14.79 47.1 3.9e-21 55.16 60.2 3.8e-23
-50% 21.78 43.7 1.6e-20 72.90 43.4 1.8e-20
-25% 49.03 32.0 5.3e-18 90.24 33.1 2.9e-18
-10% 86.25 -31.5 7.3e-18 96.55 -36.7 4.1e-19
-5% 95.49 -39.5 1.1e-19 97.78 -38.8 1.5e-19
-3% 97.86 -41.3 4.6e-20 98.13 -39.3 1.1e-19
-2% 98.95 -42.1 3.2e-20 98.24 -39.4 1.1e-19
-1% 99.82 -42.7 2.4e-20 98.31 -39.8 8.9e-20
Baseline 99.99 -42.8 2.3e-20 98.33 -40.0 8.3e-20
+1% 99.81 -42.7 2.4e-20 98.26 -39.6 9.8e-20
+2% 98.85 -42.0 3.3e-20 98.14 -39.3 1.2e-19
+3% 97.70 -41.0 5.2e-20 97.99 -38.5 1.7e-19
+5% 95.00 -38.3 1.8e-19 97.62 -36.6 4.5e-19
+10% 81.40 -19.5 4.9e-14 96.31 -28.3 5.3e-17
+25% 23.43 15.0 5.2e-12 81.60 14.7 8.1e-12
+50% 11.14 29.7 2.2e-17 32.54 64.3 1.1e-23
+75% 9.81 42.8 2.3e-20 13.94 61.1 2.8e-23
+100% 9.78 44.6 1.1e-20 9.41 493.8 1.7e-40
Table 4: Effects of distance offset on model accuracy. Offset values are shown as percentages of
the activation range.
12