Learning and organization of memory for evolving patterns
Oskar H Schnaack,1, 2 Luca Peliti,3 and Armita Nourmohammad1, 2, 4, ∗
arXiv:2106.02186v1 [physics.bio-ph] 4 Jun 2021
1
Max Planck Institute for Dynamics and Self-organization, Am Faßberg 17, 37077 Göttingen, Germany
2
Department of Physics, University of Washington,
3910 15th Ave Northeast, Seattle, WA 98195, USA
3
Santa Marinella Research Institute, 00058 Santa Marinella, Italy
4
Fred Hutchinson Cancer Research Center, 1100 Fairview ave N, Seattle, WA 98109, USA
(Dated: June 7, 2021)
Storing memory for molecular recognition is an efficient strategy for responding to external stimuli.
Biological processes use different strategies to store memory. In the olfactory cortex, synaptic connections form when stimulated by an odor, and establish distributed memory that can be retrieved
upon re-exposure. In contrast, the immune system encodes specialized memory by diverse receptors that recognize a multitude of evolving pathogens. Despite the mechanistic differences between
the olfactory and the immune memory, these systems can still be viewed as different information
encoding strategies. Here, we present a theoretical framework with artificial neural networks to characterize optimal memory strategies for both static and dynamic (evolving) patterns. Our approach
is a generalization of the energy-based Hopfield model in which memory is stored as a network’s
energy minima. We find that while classical Hopfield networks with distributed memory can efficiently encode a memory of static patterns, they are inadequate against evolving patterns. To follow
an evolving pattern, we show that a distributed network should use a higher learning rate, which
in turn, can distort the energy landscape associated with the stored memory attractors. Specifically, narrow connecting paths emerge between memory attractors, leading to misclassification of
evolving patterns. We demonstrate that compartmentalized networks with specialized subnetworks
are the optimal solutions to memory storage for evolving patterns. We postulate that evolution of
pathogens may be the reason for the immune system to encoded a focused memory, in contrast to
the distributed memory used in the olfactory cortex that interacts with mixtures of static odors.
I.
INTRODUCTION
Storing memory for molecular recognition is an efficient strategy for sensing and response to external stimuli. Apart from the cortical memory in the nervous system, molecular memory is also an integral part of the
immune response, present in a broad range of organisms
from the CRISPR-Cas system in bacteria [1–3] to adaptive immunity in vertebrates [4–6]. In all of these systems, a molecular encounter is encoded as a memory and
is later retrieved and activated in response to a similar
stimulus, be it a pathogenic reinfection or a re-exposure
to a pheromone. Despite this high-level similarity, the
immune system and the synaptic nervous system utilize
vastly distinct molecular mechanisms for storage and retrieval of their memory.
Memory storage, and in particular, associative memory in the hippocampus and olfactory cortex has been
a focus of theoretical and computational studies in neuroscience [7–11]. In the case of the olfactory cortex, the
input is a combinatorial pattern produced by olfactory receptors which recognize the constituent mono-molecules
of a given odor. A given odor composed of many monomolecules at different concentrations [12–14] drawn from
a space of about 104 distinct mono-molecules reacts with
olfactory receptors (a total of ∼ 300 − 1000 in mammals
[15–19]), resulting in a distributed signal and spatiotem-
∗
Correspondence should be addressed to:
[email protected].
poral pattern in the olfactory bulb [20]. This pattern
is transmitted to the olfactory cortex, which serves as a
pattern recognition device and enables an organism to
distinguish orders of magnitudes more odors compared
to the number of olfactory receptors [21–23]. The synaptic connections in the cortex are formed as they are costimulated by a given odor pattern, thus forming an associative memory that can be retrieved in future exposures [7–11, 24].
Artificial neural networks that store auto-associative
memory are used to model the olfactory cortex. In these
networks, input patterns trigger interactions between encoding nodes. The ensemble of interactive nodes keeps
a robust memory of the pattern since they can be simultaneously stimulated upon re-exposure and thus a
stored pattern can be recovered by just knowing part
of its content. This mode of memory storage resembles the co-activation of synaptic connections in a cortex.
Energy-based models, such as Hopfield neural networks
with Hebbian update rules [25], are among such systems,
in which memory is stored as the network’s energy minima [26]; see Fig. 1. The connection between the standard Hopfield network and the real synaptic neural networks has been a subject of debate over the past decades.
Still, the Hopfield network provides a simple and solvable
coarse-grained model of the synaptic network, relevant
for working memory in the olfactory system and the hippocampus [24].
Immune memory is encoded very differently from associative memory in the nervous system and the olfactory
cortex. First, unlike olfactory receptors, immune recep-
2
tors are extremely diverse and can specifically recognize
pathogenic molecules without the need for a distributed
and combinatorial encoding. In vertebrates, for example,
the adaptive immune system consists of tens of billions
of diverse B- and T-cells, whose unique surface receptors
are generated through genomic rearrangement, mutation,
and selection, and can recognize and mount specific responses against the multitude of pathogens [5]. Immune
cells activated in response to a pathogen can differentiate into memory cells, which are long lived and can
more efficiently respond upon reinfections. As in most
molecular interactions, immune-pathogen recognition is
cross-reactive, which would allow memory receptors to
recognize slightly evolved forms of the pathogen [5, 27–
32]. Nonetheless, unlike the distributed memory in the
olfactory cortex, the receptors encoding immune memory are focused and can only interact with pathogens
with limited evolutionary divergence from the primary
infection, in response to which memory was originally
generated [5].
There is no question that there are vast mechanistic
and molecular differences between how memory is stored
in the olfactory system compared to the immune system. However, we can still ask which key features of the
two systems prompt such distinct information encoding
strategies for their respective distributed versus specialized memory. To probe the emergence of distinct memory strategies, we propose a generalized Hopfield model
that can learn and store memory against both the static
and the dynamic (evolving) patterns. We formulate this
problem as an optimization task to find a strategy (i.e.,
learning rate and network structure) that confer the highest accuracy for memory retrieval in a network (Fig. 1).
In contrast to the static case, we show that a distributed memory in the style of a classical Hopfield
model [26] fails to efficiently work for evolving patterns. We show that the optimal learning rate should
increase with faster evolution of patterns, so that a network can follow the dynamics of the evolving patterns.
This heightened learning rate tends to carve narrow
connecting paths (mountain passes) between the memory attractors of a network’s energy landscape, through
which patterns can equilibrate in and be associated with
a wrong memory. To overcome this misclassification,
we demonstrate that specialized memory compartments
emerge in a neural network as an optimal solution to
efficiently recognize and retrieve a memory of out-ofequilibrium evolving patterns. Our results suggest that
evolution of pathogenic patterns may be one of the key
reasons why the immune system encodes a focused memory, as opposed to the distributed memory used in the
olfactory system, for which molecular mixtures largely
present static patterns. Beyond this biological intuition, our model offers a principle-based analytical framework to study learning and memory generation in out-ofequilibrium dynamical systems.
II.
A.
RESULTS
Model of working memory for evolving patterns
To probe memory strategies against different types of
stimuli, we propose a generalized energy-based model
of associative memory, in which a Hopfield-like neural network is able to learn and subsequently recognize binary patterns. This neural network is characterized by an energy landscape and memory is stored
as the network’s energy minima. We encode the target of recognition (stimulus) in a binary vector σ (pattern) with L entries: σ = (σ1 , . . . , σL ), with σi = ±1,
∀i (Fig. 1A). To store associative memory, we define a
fully connected network represented by an interaction
matrix J = (Ji,j ) of size L × L, and use a Hopfieldlike energy function (Hamiltonian)
to describe pattern
P
1
1
recognition EJ (σ) = − 2L
J
σ
σ
i,j
i
j ≡ − 2 hσ|J|σi [26]
ij
(Fig. 1C). Here, we used a short-hand notation to denote the normalized pattern vector by |σi ≡ √1L σ,
its transpose by hσ|,Presulting in a normalized scalar
product hσ|σ ′ i ≡ L1 i σi σi′ , and a matrix product
P
hσ|J|σi ≡ L1 i,j σi Ji,j σj .
The network undergoes a learning process, during
which different patterns are presented sequentially and
in random order (Fig. 1B). As a pattern σ α is presented,
the interaction matrix J is updated according to the following Hebbian update rule [33]
(
(1 − λ) Ji,j + λ σiα σjα , if i 6= j;
′
(1)
Ji,j −→ Ji,j =
0,
otherwise.
Here λ is the learning rate. In this model, the memorized patterns are represented by energy minima associated with the matrix J. We consider the case where
the number N of different pattern classes is below the
Hopfield capacity of the network (i.e., N . 0.14 L; see
refs. [26, 34, 35]).
With the update rule in eq. 1, the network develops energy minima as associative memory close to each
of the previously presented pattern classes σ α (α ∈
{1, . . . , N })(Fig. 1C). Although the network also has
minima close to the negated patterns, i.e., to −σ α , they
do not play any role in what follows. To find an associative memory we let a presented pattern σ α equilibrate
in the energy landscape, whereby we accept spin-flips
σ α → σ̃ α with a probability min 1, e−βH (EJ (σ̃)−EJ (σ)) ,
where βH is the inverse equilibration (Hopfield) temperature (Appendix A). In the low temperature regime (i.e.,
high βH ), equilibration in networks with working memory
drives a presented pattern σ α towards a similar attractor
α
, reflecting the memory associated with the correσatt
sponding energy minimum (Fig. 1C). This similarity is
α
measured by the overlap q α ≡ hσatt
|σ α i and determines
the accuracy of the associative memory.
Unlike the classical cases of pattern recognition by
Hopfield networks, we assume that patterns can evolve
3
A
B
pattern evolution
σ(t) = (−1,1,…,1, 1)
Hebbian update of memory
σ 1(t1) σ 2(t1) … σ N(t1)
mutation rate, μ
learning
rate, λ
σ(t + 1) = (−1,1,…,1,−1)
J(t1)
C
energy landscape with memory attractors
static patterns
pattern evolution
Ji,j(t + 1) = (1 − λ) Ji,j(t) + λ σiασjα
J(t2)
1−λ
D
memory strategies
evolving patterns
compartmentalized
distributed
σα
e −βHE(J,σ
J1
σ
σatt
σα
α
)
e −βSE(J , σ
i
)
α
J3
J2
σ
α
σatt
σatt
e −βHE (J , σ
i
α
σatt
i
)
α
FIG. 1. Model of Working memory for evolving patterns. (A) The targets of recognition are encoded by binary
vectors {σ} of length L. Patterns can evolve over time with a mutation rate µ, denoting the fraction of spin-flips in a pattern
per network update event. (B) Hebbian learning rule is shown for network J, which is presented a set of N patterns {σ α }
(colors) over time. At each step, one pattern σ α is randomly presented to the network and the network is updated with learning
rate λ (eq. 1). (C) The energy landscape for networks with distributed memory with optimal learning rate for static (left)
and evolving (right) patterns are shown. The equipotential lines are shown in the bottom 2D plane. The energy minima
correspond to memory attractors. For static patterns (left), equilibration in the network’s energy landscape drives a patterns
towards its associated memory attractor, resulting in an accurate reconstruction of the pattern. For evolving patterns (right),
the heightened optimal learning rate of the network results in the emergence of connecting paths (mountain passes) between the
energy minima. The equilibration process can drive a pattern through a mountain pass towards a wrong memory attractors,
resulting in pattern misclassification. (D) A network with distributed memory (left) is compared to a specialized network with
multiple compartments (right). To find an associative memory, a presented pattern σ α with energy E(J, σ α ) in network J
α
equilibrates with inverse temperature βH in the network’s energy landscape and falls into an energy attractor σatt
. Memory
retrieval is a two-step process in a compartmentalized network (right): First, the sub-network J i is chosen with a probability
Pi ∼ exp[−βS E i (J i , σ α )], where βS is the inverse temperature for this decision. Second, the pattern equilibrates within the
α
.
sub-network and falls into an energy attractor σatt
over time with a rate µ that reflects the average number
of spin-flips in a given pattern per network’s update event
(Fig. 1A). Therefore, the expected number of spin-flips
in a given pattern between two encounters is µeff = N µ,
as two successive encounters of the same pattern are on
average separated by N − 1 encounters (and updates) of
the network with the other patterns. We work in the
regime where the mutation rate µ is small enough such
that the evolved patterns stemming from a common ancestor σ α (t0 ) at time t0 (i.e., the members of the class
α) remain more similar to each other than to members
of the other classes (i.e., µN L ≪ L/2).
The special case of static patterns (µeff = 0) can reflect
distinct odor molecules, for which associative memory is
stored in the olfactory cortex. On the other hand, the
distinct pattern classes in the dynamic case (µeff > 0)
can be attributed to different types of evolving pathogens
(e.g., influenza, HIV, etc), and the patterns within a class
as different variants of a given pathogen. In our model,
we will use the mutation rate as an order parameter to
characterize the different phases of memory strategies in
biological systems.
B.
Optimal learning rate for evolving patterns
In the classical Hopfield model (µeff = 0) the learning rate λ is set to very small values for the network
to efficiently learn the patterns [33]. For evolving patterns, the learning rate should be tuned so the network can efficiently update the memory retained from
the prior learning steps. At each encounter, the over-
0.06
0.9
0.8
0.7
N =8
N = 16
N = 32
Data
1-2µeff
1-2glag µeff
0.6 −5
10
10−4
10−3
10−2
effective mutation rate, µeff
0.04
B
0
Data
Prediction
mean energy
A
1.0
learning rate, λ∗
optimal performance, Q∗
4
0.02
0.00 −5
10
10−4
10−3
10−2
effective mutation rate, µeff
C
−20
−40
−60
Data
Prediction
−80 −5
−4
10
10
10−3
10−2
effective mutation rate, µeff
FIG. 2. Reduced performance of Hopfield networks in retrieving memory of evolving patterns. (A) The optimal
performance of a network Q∗ ≡ Q(λ∗ ) (eq. 2) is shown as a function of the effective mutation rate µeff = N µ. The solid lines
show the simulation results for networks encountering different number of patterns (colors). The black dotted line shows the
naı̈ve expectation for the performance solely based on the evolutionary divergence of the patterns Q0 ≈ 1 − 2µeff , and the
colored dashed lines show the expected performance after accounting for the memory lag glag , Qlag ≈ 1 − 2glag µeff ; see Fig. S1
for more details. (B) The optimal learning rate λ∗ is shown as a function of effective mutation rate. The solid lines are the
numerical estimates and dashed lines show the theoretical predictions (eq. 4). (C) The mean energy obtained by simulations of
randomly ordered patterns (solid lines) and the analytical approximation (eq. 3) for ordered patterns (dotted lines) are shown.
Error bars show standard error from the independent realizations (Appendix A). The color code for the number of presented
patterns is consistent across panels, and the length of patterns is set to L = 800.
α
lap q α (t; λ) = hσatt
(t; λ)|σ α (t)i between a pattern σ α (t)
and the corresponding attractor for the associated energy
α
minimum σatt
(t; λ) determines the accuracy of pattern
recognition; the parameter λ explicitly indicates the dependency of the network’s energy landscape on the learning rate. We declare pattern recognition as successful if
the accuracy of reconstruction (overlap) is larger than
a set threshold q α (t) ≥ 0.8, but our results are insensitive to the exact value of this threshold (Appendix A
and Fig. S3). We define a network’s performance as the
asymptotic accuracy of its associative memory averaged
over the ensemble of pattern classes (Fig. 2A),
T
N
1X 1 X α
hσatt (t)|σ α (t)i
T →∞ T
N
t=0
α=1
(2)
The expectation E[·] is an empirical average over the
ensemble of presented pattern classes over time, which
in the stationary state approaches the asymptotic average of the argument. The optimal learning rate is
determined by maximizing the network’s performance,
λ∗ = argmaxλ Q(λ).
The optimal learning rate increases with growing mutation rate so that a network can follow the evolving patterns (Fig. 2B). Although it is difficult to analytically
calculate the optimal learning rate, we can use an approximate approach and find the learning rate that minimizes the expected energy of the patterns E [Eλ,ρ (J, σ)],
assuming that patterns are shown to the network at a
fixed order (Appendix B). In this case, the expected energy is given by
Q(λ) ≡ E [q α (t; λ)] ≃ lim
E [Eλ,ρ (J, σ)] =
λ
1
L−1
×
×
,
2
1 − λ ρ−2N (1 − λ)−N − 1
(3)
where ρN ≡ (1 − 2µ)N ≈ 1 − 2µeff is the upper bound for
the overlap between a pattern and its evolved form, when
separated by the other N − 1 patterns that are shown
in between. The expected energy grows slowly with increasing mutation rate (i.e., with decreasing overlap q),
and the approximation in eq. 3 agrees very well with the
numerical estimates for the scenario where pattens are
shown in a random order (Fig. 2C). In the regime where
memory can still be associated with the evolved patterns
(µeff ≪ 0.5), minimization of the expected energy (eq. 3)
results in an optimal learning rate,
p
(4)
λ∗ (µ) = 8µ/(N − 1),
that scales with the square root of the mutation rate.
Notably, this theoretical approximation agrees well with
the numerical estimates (Fig. 2B).
C.
Reduced accuracy of distributed associative
memory against evolving patterns
Despite using an optimized learning rate, a network’s
accuracy in pattern retrieval Q(λ) decays much faster
than the naı̈ve expectation solely based on the evolutionary divergence of patterns between two encounters with a
given class (i.e., Q0 = (1 − 2µ)N ≈ 1 − 2µeff ); see Fig. 2A.
There are two reasons for this reduced accuracy: (i) the
lag in the network’s memory against evolving patterns,
and (ii) misclassification of presented patterns.
The memory attractors associated with a given pattern class can lag behind and only reflect the older
patterns presented prior to the most recent encounter
of the network with the specified class. We characterize this lag glag by identifying a previous version of the pattern that has the maximum overlap
5
with the network’s energy landscape at a given time t:
glag = argmaxg≥0 E [hσ(t − g N )|J(t)|σ(t − g N )i] (Appendix B.2). glag measures time in units of N (i.e., the
effective separation time of pattern of the same class). An
increase in the optimal learning rate reduces the time lag
and enables the network to follow the evolving patterns
more closely (Fig. S1). The accuracy of the memory subject to such a lag decays as Qlag = ρglag N ≈ 1 − 2glag µeff ,
which is faster than the naı̈ve expectation (i.e., 1−2µeff );
see Fig. 2A. This memory lag explains the loss of performance for patterns that are still reconstructed by the
network’s memory attractors (i.e., those with q α > 0.8;
Fig. S1A). However, the overall performance of the network Q(λ) remains lower than the expectation obtained
by taking into account this time lag (Fig. 2A)—a discrepancy that leads us to the second root of reduction in
accuracy, i.e., pattern misclassification.
As the learning rate increases, the structure of the network’s energy landscape changes. In particular, we see
that with large learning rates a few narrow paths emerge
between the memory attractors of networks (Fig. 1C). As
a result, the equilibration process for pattern retrieval
can drive a presented pattern through the connecting
paths towards a wrong memory attractor (i.e., one with
a small overlap hσatt |σi), which leads to pattern misclassification (Fig. S2A and Fig. S3A,C). These paths are
narrow as there are only a few possible spin-flips (mutations) that can drive a pattern from one valley to another during equilibration (Fig. S3B,D and Fig. S4A,C).
In other words, a large learning rate carves narrow mountain passes in the network’s energy landscape (Fig. 1C),
resulting in a growing fraction of patterns to be misclassified. Interestingly, pattern misclassification occurs
even in the absence of mutations for networks with an
increased learning rate (Fig. S2A). This suggests that
mutations only indirectly contribute to the misclassification of memory, as they necessitate a larger learning rate
for the networks to optimally operate, which in turn results in the emergence of mountain passes in the energy
landscape.
To understand the memory misclassification, particularly for patterns with moderately low (i.e., non-random)
energy (Fig. 2C), we use spectral decomposition to characterize the relative positioning of patterns in the energy
landscape (Appendix C). The vector representing each
pattern |σi can be expressed
P in terms of the network’s
eigenvectors {Φi }, |σi = i mi |Φi i, where the overlap
mi ≡ hΦi |σi is the ith component of the pattern in the
network’s coordinate system. During equilibration, we
flip individual spins in a pattern and accept the flips
based on their contribution to the recognition energy.
We can view these spin-flips as rotations of the pattern
in the space spanned by the eigenvectors of the network.
Stability of a pattern depends on whether these rotations
could carry the pattern from its original subspace over to
an alternative region associated with a different energy
minimum.
There are two key factors that modulate the stabil-
ity of a pattern in a network. The dimensionality of the
subspace in which a pattern resides, i.e., support of a
pattern by the network’s eigenvectors, is one of the key
determining factors for pattern stability. We quantify
the support of
P σ using the participation raPa pattern
tio π(σ) = ( i m2i )2 / i m4i [36, 37] that counts the
number of eigenvectors that substantially overlap with
the pattern. A small support π(σ) ≈ 1 indicates that
the pattern is spanned by only a few eigenvectors and
is restricted to a small sub-space, whereas a large support indicates that the pattern is orthogonal to only a
few eigenvectors. As the learning rate increases, patterns
lie in lower dimensional sub-spaces supported by only a
few eigenvectors (Fig. S4B,D). This effect is exacerbated
by the fact the energy gap between the eigenstates of
the network also broaden with increasing learning rate
(Fig. S5). The combination of a smaller support for patterns and a larger energy gap in networks with increased
learning rate leads to the destabilization of patterns by
enabling the spin-flips during equilibration to drive a pattern from one subspace to another, through the mountain
passes carved within the landscape; see Appendix C and
Fig. S6 for the exact analytical criteria for pattern stability.
D.
Compartmentalized learning and memory
storage
Hopfield-like networks can store accurate associative
memory for static patterns. However, these networks fail
to perform and store retrievable associative memory for
evolving patterns (e.g. pathogens), even when learning
is done at an optimal rate (Fig. 2). To overcome this
difficulty, we propose to store memory in compartmentalized networks, with C sub-networks of size Lc (i.e., the
number of nodes in a sub-network). Each compartment
(sub-network) can store a few of the total of N pattern
classes without an interference from the other compartments (Fig. 1D).
Recognition of a pattern σ in a compartmentalized network involves a two step process (Fig. 1D): First, we
choose a sub-network J i associated with compartment i
with a probability Pi = exp[−βS E(J i , σ)]/N , where βS
is the inverse temperature for this decision and N is the
normalizing factor. Once the compartment is chosen, we
follow the recipe for pattern retrieval in the energy landscape of the associated sub-network, whereby a pattern
equilibrates into a memory attractor.
On average, each compartment stores a memory for
Nc = N/C pattern classes. To keep the networks with
different number of compartments C comparable, we
scale the size of each compartment Lc to keep C × Lc =
constant, which keeps the (Hopfield) capacity of the network α = Nc /Lc invariant under compartmentalization.
Moreover, the mutation rate experienced by each subnetwork scales with the number of compartments µc =
Cµ, which keeps the effective mutation rate µeff = Nc µc
E.
Phases of learning and memory production
Pattern retrieval can be stochastic due to the noise
in choosing the right compartment from the C subnetworks (tuned by the inverse temperature βS ), or the
noise in equilibrating into the right memory attractor in
the energy landscape of the chosen sub-network (tuned
by the Hopfield inverse temperature βHc ). We use
mutual information to quantify the accuracy of pattencompartment association, where larger values indicate a
more accurate association; see Appendix A and Fig. 4.
The optimal performance Q∗ determines the overall
accuracy of memory retrieval, which depends on both
finding the right compartment and equilibrating into
the correct memory attractor. The amplitudes of intraversus inter- compartment stochasticity determine the
optimal strategy {C ∗ , λ∗c } used for learning and retrieval
of patterns with a specified mutation rate. Varying the
corresponding inverse temperatures (βHc , βS ) results in
three distinct phases of optimal memory storage.
i. Small intra- and inter-compartment noise (βHc ≫ 1,
βS ≫ 1). In this regime, neither the compartment
choice nor the pattern retrieval within a compartment
are subject to strong noise. As a result, networks
are functional with working memory and the optimal
strategies can achieve the highest overall performance.
For small mutation rates, we see that all networks
perform equally well and can achieve almost perfect
performance, irrespective of their number of compartments (Figs. 3A, 4A,B). As the mutation rate increases,
networks with a larger number of compartments show
a more favorable performance, and the 1-to-1 specialized network, in which each pattern is stored in
a separate compartment (i.e., N = C), reaches the
optimal performance 1 − 2µeff (Figs. 3A, 4C,D). As
A
1.0
0.9
C=1
C=2
C=4
0.8
0.7
0.6 −5
10
λ∗
c
C
scaled opt. learning rate,
invariant under compartmentalization. As a result, the
optimal learning rate (eq. p
4) scales with the number of
compartments C as, λ∗c = 8µc /(Nc − 1) ≈ Cλ∗1 . However, since updates are restricted to sub-networks of size
Lc at a time, the expected amount of updates within a
network Lc λc remains invariant under compartmentalization. Lastly, since the change in energy by a single
spin-flip scales as ∆E ∼ 1/Lc , we introduce the scaled
Hopfield temperature βHc ≡ CβH to make the equilibration process comparable across networks with different
number of compartments. No such scaling is necessary
for βS .
By restricting the networks to satisfy the aforementioned scaling relationships, we are left with two independent variables, i.e., (i) the number of compartments
C, and (ii) the learning rate λc , which define a memory
strategy {C, λc }. A memory strategy can be then optimized to achieve a maximum accuracy for retrieving an
associative memory for evolving patterns with a given
effective mutation rate µeff .
optimal performance, Q∗
6
C=8
C = 16
C = 32
1-2µeff
10−4
10−3
effective mutation rate, µeff
10−2
B
10−2
10−3
theory
10−4 −5
10
10−4
10−3
effective mutation rate, µeff
10−2
FIG. 3.
Compartmentalized memory storage. (A)
The optimal performance is shown as a function of the effective mutation rate (similar to Fig. 2A) for networks with
different number of compartments C (colors), ranging from a
network with distributed memory C = 1 (blue) to a 1-to-1
compartmentalized network C = N (red). (B) The optimal
(scaled) learning rate λc /C is shown as a function of the effective mutation rate for networks with different number of
compartments (colors according to (A)). Full lines show the
numerical estimates and the
p dashed line is from the analytical approximation, λ∗c = 8µc /(Nc − 1) ≈ Cλ∗1 . The scaled
learning rates collapse on the analytical approximation for all
networks except for the 1-to-1 compartmentalized network
(red), where the maximal learning rate λ ≈ 1 is used and
each compartment is fully updated upon an encounter with a
new version of a pattern. The number of presented patterns
is set to N = 32. We keep L × C = const., with L = 800 used
for the network with C = 1.
predicted by the theory, the optimal learning rate for
compartmentalized networks scales with the mutation
1/2
rate as λ∗c ∼ µc , except for the 1-to-1 network in which
∗
λc → 1 and sub-networks are steadily updated upon an
encounter with a pattern (Fig. 3B). This rapid update
is expected since there is no interference between the
stored memories in the 1-to-1 network, and a steady
update can keep the stored memory in each sub-network
close to its associated pattern class without disrupting
the other energy minima.
ii. Small intra- and large inter-compartment noise
(βHc ≫ 1, βS ≪ 1). In this regime there is low noise
for equilibration within a compartment but a high
level of noise in choosing the right compartment. The
optimal strategy in this regime is to store patterns
in a single network with a distributed memory, since
7
B 10−1
0.8
0.6
accurate memory
0.4
0.2
100
0.0
10−1
100
101
scaled inverse Hopfield temp. βHc
inverse compartment temp. βS
partial memory
no-memory
101
1.0
optimal performance, Q∗
inverse compartment temp. βS
A
Q∗
1.0
0.6
0.4
partial memory
0.2
Q∗
0.0
10−1
100
101
scaled inverse Hopfield temp. βHc
inverse compartment temp. βS
no-memory
inverse compartment temp. βS
100
accurate memory
optimal performance, Q∗
101
0.8
C=1
C=2
C=4
C=8
C = 16
C = 32
0.0
101
101
distributed
memory
100
10−1
100
101
scaled inverse Hopfield temp. βHc
1.0
1.0
101
0.5
D 10−1
C
100
100
100
0.0 0.5 1.0
Normalized MI
101
C=1
C=2
C=4
0.5
C=8
C = 16
C = 32
0.0
101
100
1-to-1
specialized memory
distributed
memory
10−1
100
101
scaled inverse Hopfield temp. βHc
101
100
0.0 0.5 1.0
Normalized MI
FIG. 4. Phases of learning and memory production. Different optimal memory strategies are shown. (A) The heatmap
shows the optimal memory performance Q∗ as a function of the (scaled) Hopfield inverse temperature βHc = βH · C and the
inverse temperature associated with compartmentalization βS for networks learning and retrieving a memory of static patterns
(µ = 0); colors indicated in the color bar. The optimal performance is achieved by using the optimal strategy (i.e., learning
rate λ∗c and the number of compartments c∗ ) for networks at each value of βHc and βS . The three phases of accurate, partial,
and no-memory are indicated. (B) The heatmap shows the memory strategies for the optimal number of compartments (colors
as in the legend) corresponding to the memory performance shown in (A). We limit the optimization to the possible number of
compartments indicated in the legend to keep N/C an integer. The dashed region corresponds to the case where all strategies
perform equally well. Regions of distributed memory (C = 1) and the 1-to-1 specialized memory (C = N ) are indicated. The
top panel shows the optimal performance Q∗ of different strategies as a function of the Hopfield inverse temperature βHc . The
right panel shows the mutual information MI(Σ, C) between the set of pattern classes Σ ≡ {σ α } and the set of all compartments
C normalized by the entropy of the compartments H(C) as a function of the inverse temperature βS ; see Appendix A.3. This
normalized mutual information quantifies the ability of the system to assign a pattern to the correct compartment. (C-D)
Similar to (A-B) but for evolving patterns with the effective mutation rate µeff = 0.01. The number of presented patterns is
set to N = 32 (all panels). Similar to Fig. 3 we keep L × C = const., with L = 800 used for networks with C = 1.
identifying the correct compartment is difficult due
to noise (Fig. 4B,D). For static patterns this strategy
corresponds to the classical Hopfield model with a
high accuracy (Figs. 2A, 4A,B). On the other hand,
for evolving patterns this strategy results in a partial
memory (Fig. 4C,D) due to the reduced accuracy
of the distributed associative memory, as shown in
Fig. 2A. Interestingly, the transition between the optimal strategy with highly specific (compartmentalized)
memory for evolving patterns in the first regime and the
generalized (distributed) memory in this regime is very
sharp (Fig. 4D). This sharp transition suggests that
depending on the noise in choosing the compartments
βS , an optimal strategy either stores memory in a 1-to-1
specialized fashion (C = N ) or in a distributed generalized fashion (C = 1), but no intermediate solution (i.e.,
quasi-specialized memory with 1 < C < N ) is desirable.
8
iii. Large intra-compartment noise (βHc < 1). In
this regime there is a high level of noise in equilibration within a network and memory cannot be reliably
retrieved (Fig. 4A,C), regardless of the compartmentalization temperature βS . However, the impact of the equilibration noise βHc on the accuracy of memory retrieval
depends on the degree of compartmentalization. For the
1-to-1 specialized network (C = N ), the transition between the high and the low accuracy is smooth and occurs at βHc = 1, below which no memory attractor can be
found. As we increase the equilibration noise (decrease
βHc ), the networks with distributed memory (C < N )
show two-step transitions, with a plateau in the range of
1/Nc . βHc . 1. Similar to the 1-to-1 network, the first
transition at βHc ≈ 1 results in the reduced accuracy of
the networks’ memory retrieval. At this transition point,
the networks’ learning rate λc approaches its maximum
value 1 (Fig. S7), which implies that the memory is stored
(and steadily updated) for only C < N patterns (i.e., one
pattern per sub-network). Due to the invariance of the
networks’ mean energy under compartmentalization, the
depth of the energy minima associated with the stored
memory in each sub-network scales as N/C, resulting in
deeper and more stable energy minima in networks with
smaller number of compartments C. Therefore, as the
noise increases (i.e., βHc decreases), we observe a gradient in transition from partial retrieval to a no-memory
state at βH ≈ 1/Nc , with the most compartmentalized
network (larger C) transitioning the fastest, reflecting the
shallowness of its energy minima.
Taken together, the optimal strategy leading to working memory depends on whether a network is trained to
learn and retrieve dynamic (evolving) patterns or static
patterns. Specifically, we see that the 1-to-1 specialized network is the unique optimal solution for storing
working memory for evolving patterns, whereas the distributed generalized memory (i.e., the classical Hopfield
network) performs equally well in learning and retrieval
of memory for static patterns. The contrast between
these memory strategies can shed light on the distinct
molecular mechanisms utilized by different biological systems to store memory.
III.
DISCUSSION
Storing and retrieving memory from prior molecular
interactions is an efficient scheme to sense and respond to
external stimuli. Here, we introduced a flexible energybased network model that can adopt different memory
strategies, including distributed memory, similar to the
classical Hopfield network, or compartmentalized memory. The learning rate and the number of compartments
in a network define a memory strategy, and we probed
the efficacy of different strategies for static and dynamic
patterns. We found that Hopfield-like networks with distributed memory are highly accurate in storing associative memory for static patterns. However, these networks
fail to reliably store retrievable associative memory for
evolving patterns, even when learning is done at an optimal rate.
To achieve a high accuracy, we showed that a retrievable memory for evolving patterns should be compartmentalized, where each pattern class is stored in a separate sub-network. In addition, we found a sharp transition between the different phases of working memory
(i.e., compartmentalized and distributed memory), suggesting that intermediate solutions (i.e., quasi-specialized
memory) are sub-optimal against evolving patterns.
The contrast between these memory strategies is reflective of the distinct molecular mechanisms used for
memory storage in the adaptive immune system and in
the olfactory cortex. In particular, the memory of odor
complexes, which can be assumed as static, is stored in a
distributed fashion in the olfactory cortex [7–11, 24]. On
the other hand, the adaptive immune system, which encounters evolving pathogens, allocates distinct immune
cells (i.e., compartments) to store a memory for different
types of pathogens (e.g. different variants of influenza
or HIV)—a strategy that resembles that of the 1-to-1
specialized networks [5, 27–32]. Our results suggest that
pathogenic evolution may be one of the reasons for the
immune system to encode a specialized memory, as opposed to the distributed memory used in the olfactory
system.
The increase in the optimal learning rate in anticipation of patterns’ evolution significantly changes the structure of the energy landscape for associative memory. In
particular, we found the emergence of narrow connectors
(mountain passes) between the memory attractors of a
network, which destabilize the equilibration process and
significantly reduce the accuracy of memory retrieval. Indeed, tuning the learning rate as a hyper-parameter is
one of the challenges of current machine learning algorithms with deep neural networks (DNNs) [38, 39]. The
goal is to navigate the tradeoff between the speed (i.e.,
rate of convergence) and accuracy without overshooting
during optimization. It will be interesting to see how
the insights developed in this work can inform rational
approaches to choose an optimal learning rate in optimization tasks with DNNs.
Machine learning algorithms with DNNs [38] and modified Hopfield networks [40–43] are able to accurately
classify hierarchically correlated patterns, where different
objects can be organized into an ultrametric tree based
on some specified relations of similarity. For example,
faces of cats and dogs have the oval shape in common
but they branch out in the ultrametric tree according
to the organism-specific features, e.g., whiskers in a cat,
and the cat branch can then further diversify based on
the breed-specific features. A classification algorithm can
use these hierarchical relations to find features common
among members of a given sub-type (cats) that can distinguish them from another sub-type (dogs). Although
evolving patterns within each class are correlated, the
random evolutionary dynamics of these patterns does
9
not build a hierarchical structure where a pattern class
branches in two sub-classes that share a common ancestral root. Therefore, the optimal memory strategies
found here for evolving patterns are distinct from those
of the hierarchically correlated patterns. It will be interesting to see how our approaches can be implemented in
DNNs to classify dynamic and evolving patterns.
ACKNOWLEDGEMENT
This work has been supported by the DFG grant
(SFB1310) for Predictability in Evolution and the MPRG
funding through the Max Planck Society. O.H.S also
acknowledges funding from Georg-August University
School of Science (GAUSS) and the Fulbright foundation.
10
Appendix A: Computational procedures
A1.
Initialization of the network
A network J (with elements Jij ) is presented with N random (orthogonal) patterns |σ α i (with α = 1, . . . N ), with
entries σiα ∈ {−1, 1}, reflecting the N pattern classes. For aPnetwork with C compartments (with 1 ≤ C ≤ N ), we
1
s
s
α α
(t0 ) = N/C
initialize each sub-network J s at time t0 as Ji,j
α∈As σi σj and Jii (t0 ) = 0; here, As is a set of N/C
randomly chosen (without replacement) patterns initially assigned to the compartment (sub-network) s. We then let
the network undergo an initial learning process. At each step an arbitrary pattern σ ν is presented to the network and
a sub-network J s is chosen for an update with a probability
exp [−βS E (J s (t), σ ν (t))]
Ps = P C
,
s
ν
r=1 exp [−βS E (J (t), σ (t))]
where the energy is defined as,
E (J s (t), σ ν (t)) =
−1 X s
−1 ν
Ji,j (t)σiν (t)σjν (t) ≡
hσ (t)|J s (t)|σ ν (t)i
2L i,j
2
(S1)
(S2)
and βS is the inverse temperature associated with choosing the right compartment. We then update the selected
sub-network J s , using the Hebbian update rule,
(
s
(1 − λ) Ji,j
(t) + λ σiν σjν , if i 6= j;
s
Ji,j (t + 1) =
(S3)
0,
otherwise.
For dynamic patterns, the presented patterns undergo evolution with mutation rate µ, which reflects the average
number of spin flips in a given pattern per network’s update event (Fig. 1).
Our goal is to study the memory retrieval problem in a network that has reached its steady state. The state of a
network J(tn ) at the time step n can be traced back to the initial state J(t0 ) as,
J(tn ) = (1 − λ)n J(t0 ) + λ
n
X
i=1
(1 − λ)n−i |σ(ti )i hσ(ti )|
(S4)
n
The contribution of the initial state J(t0 ) to the state of the network at time thn decays as (1
(eq. S4).
− λ)−5 i
log 10
Therefore, we choose the number of steps to reach the steady state as nstat. = max 10N, 2C ceil log(1−λ) . This
criteria ensures that (1 − λ)nstat. ≤ 105 and the memory of the initial state J(t0 ) is removed from the network J(t).
We will then use this updated network to collect the statistics for memory retrieval. To report a specific quantity
from the network (e.g., the energy), we pool the nstat. samples collected from each of the 50 realizations.
A2.
Pattern retrieval from associative memory
Once the trained network approaches a stationary state, we collect the statistics of the stored memory.
α
To find a memory attractor σatt
for a given pattern σ α we use a Metropolis algorithm in the energy landscape
s
α
E(J , σ ) (eq. S2). To do so, we make spin-flips in a presented pattern σ α → σ̃ α and accept a spin-flip with
probability
P (σ α → σ̃ α ) = min 1, e−βH ∆E ,
(S5)
where ∆E = E(J s , σ̃ α ) − E(J s , σ α ) and βH is the inverse (Hopfield) temperature for pattern retrieval in the network
(see Fig. 1). We repeat this step for 2 × 106 steps, which is sufficient to find a minimum of the landscape (see Fig. S3).
For systems with more than one compartment C, we first choose a compartment according to eq. S1, and then
perform the Metropolis algorithm within the associated compartment.
After finding the energy minima, we update the systems for n′stat. = max[2 · 103 , nstat. ] steps. At each step we
present patterns as described above and collect the statistics of the recognition energy E(J s (t), σ α (t)) between a
presented pattern σ α and the memory compartment J s (t), assigned according to eq. S1. These measurements are
11
used to describe the energy statistics (Figs. 2,S2) of the patterns and the accuracy of pattern-compartment association
(Fig. 4B,D). After the n′stat. steps, we again use the Metropolis algorithm to find the memory attractors associated
with the presented patterns. We repeat this analysis for 50 independent realizations of the initializing pattern classes
{σ α (t0 )}, for each set of parameters {L, N, C, λ, µ, βS , βH }.
When calculating the mean performance Q of a strategy (see Figs. 2,3,4,S7), we set the overlap between attractor
α
and pattern q α = | hσatt
|σ α i | equal to zero when patterns are not recognized (q α < 0.8). As a result, systems can
only achieve a non-zero performance
if they recognize some of the patterns. This choice eliminates the finite size
√
effect of a random overlap ∼ 1/ L between an attractor and a pattern (see Fig. S3). This correction is especially
important when comparing systems with different sub-network sizes (Lc ≡ L/C) in the βH < 1 regime (Figs. 4,S7),
where random overlaps for small Lc could otherwise result in a larger mean performance compared to larger systems
that correctly reconstruct a fraction of the patterns.
A3.
Accuracy of pattern-compartment association
We use the mutual information MI(Σ, C) between the set of pattern classes Σ ≡ {σ α } and the set of all compartments
C to quantify the accuracy in associating a presented pattern with the correct compartment,
MI(Σ, C) = H(C) − H(C|Σ)
=−
X
c∈C
"
P (c) log P (c) − −
X
σ α ∈Σ
P (σ α )
X
c∈C
#
P (c|σ α ) log P (c|σ α ) .
(S6)
Here H(C) and H(C|Σ) are the entropy of the compartments, and the conditional entropy of the compartments given
the presented patterns, respectively. If chosen randomly, the entropy associated with choosing a compartment is
H random (C) = log C. The mutual information (eq. S6) measures the reduction in the entropy of compartments due
to the association between the patterns and the compartments, measured by the conditional entropy H(C|Σ). Figure
4B,D shows the mutual information MI(Σ, C) scaled by its upper bound H(C), in order to compare networks with a
different number of compartments.
12
Appendix B: Estimating energy and optimal learning rate for working memory
B1.
Approximate solution for optimal learning rate
The optimal learning rate is determined by maximizing the network’s performance Q(λ) (eq. 2) against evolving
patterns with a specified mutation rate:
λ∗ = argmax Q(λ)
(S1)
λ
We can numerically estimate the optimal learning rate as defined eq. S1; see Figs. 2,3. However, the exact analytical
evaluation of the optimal learning rate is difficult and we use an approximate
approach
and find the learning rate
that minimizes the expected energy of the patterns in the stationary state E Eλ,ρ (J, σ) , assuming that patterns are
shown to the network at a fixed order. Here, the subscripts explicitly indicate the learning rate of the network λ, and
the evolutionary overlap of the pattern ρ. To evaluate an analytical approximation for the energy, we first evaluate
the state of the network J(t) at time step t, given all the prior encounters of the networks with patterns shown at a
fixed order.
∞
X
1
(1 − λ)(j−1) |σ(t − j)i hσ(t − j)|
(J(t) + 1) = λ
L
j=1
=λ
∞
X
i=1
=λ
(1 − λ)(i−1)N
α=1
|
|
∞
N X
X
α=1 i=0
|
N
X
(S2)
(1 − λ)α−1 |σ α (t − α − (i − 1)N )i hσ α (t − α − (i − 1)N )|
{z
sum over N pattern classes
{z
sum over time (generations, i)
(1 − λ)(α−1)+iN |σ α (t − α − iN )i hσ α (t − α − iN )| .
{z
{z
|
(S3)
}
}
(S4)
}
}
sum over time
sum over patterns
Here, we referred to the (normalized) pattern vector from the class α presented to the network at time step t by
|σ α (t)i ≡ √1L σ α (t). Without a loss of generality, we assumed that the last pattern presented to the network at time
step t − 1 is from the first pattern class |σ 1 (t − 1)i, which enabled us to split the sum in eq. S2 into two separate
summations over pattern classes and N time-steps generations (eq. S3). Adding the identity matrix 1 on the left-hand
side of eq. S2 assures that the diagonal elements vanish, as defined in eq. S3.
The mean energy of the patterns, which in our setup is the energy of the pattern from the N th class at time t,
follows
1 N
N
E Eλ,ρ (J, σ) = E − hσ (t)|J(t)|σ (t)i
2
#
"
(S5)
N ∞
L−1 XX
(α−1)+iN
N
α
α
N
=E −
λ
(1 − λ)
hσ (t)|σ (t − α − iN )i hσ (t − α − iN )|σ (t)i .
2
α=1 i=0
Since the pattern families are orthogonal to each other, we can express the overlap between patterns at different
times as hσ α (t1 )|σ ν (t2 )i = δα,ν (1 − 2µ)|t2 −t1 | ≡ δα,ν ρ|t2 −t1 | , and simplify the energy function in eq. S5,
∞
L−1 X
E Eλ,ρ (J, σ) = −
λ
(1 − λ)(N −1)+iN ρ2(N +iN )
2
i=0
=−
=−
∞
X
i
L−1
λ(1 − λ)(N −1) ρ2N
(1 − λ)N ρ2N
2
i=0
L − 1 (1 − λ)(N −1) ρ2N
.
λ
2
1 − (1 − λ)N ρ2N
(S6)
13
Since accurate pattern retrieval depends on the depth of the energy valley for the associative memory, we will use
the expected energy of the patterns as a proxy for the performance of the network.
We can find the approximate
optimal learning rate that minimized the expected energy by setting ∂E Eλ,ρ (J, σ) /∂λ = 0, which results in
1
(1 − 2µ)2N = (1 − λ∗ )−N (1 − N λ∗ ) =⇒ 1 − 4N µ + O(µ2 ) = 1 + (N − N 2 )(λ∗ )2 + O(λ3 );
2
p
∗
=⇒ λ (µ) ≃ 8µ/(N − 1).
(S7)
where we used the fact that both the mutation rate µ and the learning rate λ are small, and therefore, expanded
eq. S7 up to the leading orders in these parameters.
In addition, eq. S7 establishes an upper bound for the learning rate: λ < N1 . Therefore, our expansion in mutation
rate (eq. S7) is only valid for 8µ < N1 , or equivalently for µeff = N µ < 12.5%; the rates used in our analyses lie far
below these upper bounds.
B2.
Lag of memory against evolving patterns
The memory attractors associated with a given pattern class can lag behind and only reflect the older patterns
presented prior to the most recent encounter of the network with the specified class. As a result, the upper bound
for the performance of a network Qlag = ρglag N ≈ 1 − 2glag µeff is determined by both the evolutionary divergence of
patterns between two encounters µeff and number of generations glag by which the stored memory lags behind; we
measure glag in units of generations; one generation is defined as the average time between a network’s encounter with
the same pattern class i.e., N . We characterize this lag glag by identifying the past pattern (at time t − glag N ) that
has the maximum overlap with the network’s energy landscape at given time t:
glag = argmax E [hσ(t − g N )|J(t)|σ(t − g N )i] ≡ argmin E Elag (g)
(S8)
g≥0
g≥0
where we introduced the expected lagged energy E Elag (g) . Here, the vector |σ(t)i refers to the pattern σ presented
to the network at time t, which can be from any of the pattern classes. Because of the translational symmetry in
time in the stationary state, the lagged energy can also be expressed in terms of the overlap between a pattern at
time t and the network at a later time t + gN . We evaluate the lagged energy by substituting the expression for the
network’s state J(t + gN ) from eq. S2, which entails,
2
1
E [Elag (g)] = −
E [hσ(t)|J(t + g N )|σ(t)i]
L−1
L−1
gN
−1
X
1
2
(1 − λ)gN −1−j hσ(t)|σ(t + j)i
= −E
(1 − λ)N g hσ(t)|J(t)|σ(t)i + λ
L−1
j=0
(S9)
(S10)
g−1 N
−1
X
X
2
2
Ng
(1 − λ)gN −1−N i−α hσ N (t)|σ N −α (t + N i + α)i (S11)
(1 − λ) E Eλ,ρ (J, σ) − λ
=
L−1
i=0 α=0
g−1
X
2
(1 − λ)gN −1−N i ρ2N i
(1 − λ)N g E Eλ,ρ (J, σ) − λ
L−1
i=0
N (g+1)−1
N (g+1) 2N
(1 − λ)
− (1 − λ)N −1 ρ2N g
(1 − λ)
ρ
+
.
= −λ
1 − (1 − λ)N ρ2N
(1 − λ)N − ρ2N
=
(S12)
(S13)
Here, we used the expression of the network’s matrix J in eq. S4 to arrive at eq. S10, and then followed the procedure
introduced in eq. S3 to arrived at the double-summation in eq. S11. We then used the equation for pattern overlap
hσ α (t1 )|σ ν (t2 )i = δα,ν ρ|t2 −t1 | to reduce the sum in eq. S12 and arrived
in eq. S13 by evaluating the
at the result
geometric sum and substituting the empirical average for the energy E Eλ,ρ (J, σ) from eq. S6.
We probe this lagged memory by looking at the performance Q for patterns that are correctly associated with their
memory attractors (i.e., those with hσatt |σi > 0.8). As shown in Fig. S1, for a broad parameter regime, the mean
performance for these correctly associated patterns agrees well with the theoretical expectation Qlag = ρglag N , which
is lower than the naive expectation Q0 .
14
Appendix C: Structure of the energy landscape for working memory
C1.
Formation of mountain passes in the energy landscape of memory for evolving patterns
As shown in Fig. 1, large learning rates in networks with memory for evolving patterns result in the emergence of
narrow connecting paths between the minima of the energy landscape. We refer to these narrow connecting paths as
mountain passes. In pattern retrieval, the Monte Carlo search can drive a pattern out of one energy minimum into
another minimum and potentially lead to pattern misclassification.
We use two features of the energy landscape to probe the emergence of the mountain passes.
First, we show that if a pattern is misclassified, it has fallen into a memory attractor associated with another
pattern class and not a spuriously made energy minima. To do so, we compare the overlap of the attractor with the
α
original pattern | hσatt
|σ α i | (i.e., the reconstruction performance of the patterns) with the maximal overlap of the
α
attractor with all other patterns maxν6=α | hσatt
|σ ν i |. Indeed, as shown in Fig. S3A for evolving patterns, the memory
attractors associated with 99.4% of the originally stored patterns have either a large overlap with the correct pattern
or with one of the other previously presented pattern. 71.3% of the patterns are correctly classified (stable patterns
in sector I in Fig. S3A), whereas 28.1% of them are associated with a secondary energy minima after equilibration
(unstable patterns in sector II in Fig. S3A). A very small fraction of patterns (< 1%) fall into local minima given
by the linear combinations of the presented patterns (sector IV in Fig. S3A). These minima are well-known in the
classical Hopfield model [44, 45]. Moreover, we see that equilibration of a random pattern (i.e., a pattern orthogonal
to all the presented classes) in the energy landscape leads to memory attractors for one of the originally presented
pattern classes. The majority of these random patterns lie in sector II of Fig. S3A), i.e., they have a small overlap
with the network since they are orthogonal to the originally presented pattern classes, and they fall into one of the
existing memory attractors after equilibration.
Second, we characterize the possible paths for a pattern to move from one valley to another during equilibration,
using Monte Carlo algorithm with the Metropolis acceptance probability,
′
ρ(σ → σ ′ ) = min 1, e−β(E(J,σ )−E(J,σ))
(S1)
We estimate the number of beneficial spin-flips (i.e., open paths) that decrease the energy of a pattern at the start
of equilibration (Fig. S3B). The average number of open paths is smaller for stable patterns compared to the unstable
patterns that are be miscalssified during retrieval (Fig. S3B). However, the distributions for the number of open paths
largely overlap for stable and unstable patterns. Therefore, the local energy landscape of stable and unstable patterns
are quite similar and it is difficult to discriminate between them solely based on the local gradients in the landscape.
Fig. S4A shows that the average number of beneficial spin-flips grows with the mutation rate of the patterns but
this number is comparable for stable and unstable patterns. Moreover, the unstable stored patterns (blue) have far
fewer open paths available to them during equilibration compared to random patterns (red) that are presented to
the network for the first time (Figs. S3B & S4A). Notably, on average half of the spin-flips reduce the energy of for
random patterns, irrespective of the mutation rate. This indicates that even though previously presented pattern
classes are statistically distinct from random patterns, they can still become unstable, especially in networks which
are presented with evolving patterns.
It should be noted that the evolution of the patterns only indirectly contribute to the misclassification of memory,
as it necessitates a larger learning rate for the networks to optimally operate, which in turn results in the emergence
of mountain passes. To clearly demonstrate this effect, Figs. S3C,D, and S4D shows the misclassification behavior for
a network trained to store memory for static pattern, while using a larger learning rate that is optimized for evolving
patterns. Indeed, we see that pattern misclassification in this case is consistent with the existence of mountain passes
in the network’s energy landscape.
C2.
Spectral decomposition of the energy landscape
We use spectral decomposition of the energy landscape to characterize the relative positioning and the stability of
patterns in the landscape. As shown in Figs. S3, S4, destabilization of patterns due to equilibration over mountain
passes occurs in networks with high learning rates, even for static patterns. Therefore, we focus on how the learning
rate impacts the spectral decomposition of the energy landscape in networks presented with static patterns. This
simplification will enable us to analytically probe the structure of the energy landscape, which we will compare with
numerical results for evolving patterns.
We can represent the network J (of size L × L) that store a memory of N static patterns with N non-trivial
eigenvectors |Φi i with corresponding eigenvalues Γi , and N − L degenerate eigenvectors, |Ψk i with corresponding
15
trivial eigenvalues γk = γ = −1:
J=
N
X
i=1
L−N
X
Γi |Φi i hΦi | +
k=1
γk |Ψk i hΨk | .
(S2)
The non-trivial eigenvectors span the space of the presented patterns, for which the recognition energy can be
expressed by
N
E(J, σ α ) = −
1X
Γi hσ α |Φi i hΦi |σ α i .
2 i=1
(S3)
An arbitrary configuration χ in general can have components orthogonal to the N eigenvectors |Φi i, as it points to
a vertex of the hypercube, and should be expressed in terms of all the eigenvectors {Φ1 , . . . , ΦN , Ψ1 , . . . , ΨL−N }:
X
N
L−N
X
1
i
i
k
k
E(J, χ) = −
Γi hχ|Φ i hΦ |χi +
γ hχ|Ψ i hΨ |χi
.
2 i=1
k=1
|
{z
} |
{z
}
stored patterns
(S4)
trivial space
Any spin-flip in a pattern (e.g., during equilibration) can be understood as a rotation in the eigenspace of the
network (eq. S4). As a first step in characterizing these rotations we remind ourselves of the identity
|χi =
N
X
i=1
hΦi |χi |Φi i +
L−N
X
k=1
hΨk |χi |Ψk i ,
(S5)
with the normalization condition
N
X
i=1
hΦi |χi
2
+
L−N
X
k=1
hΨk |χi
2
= 1.
(S6)
In addition, since the diagonal elements of the network are set to Jii = 0 (eq. S3), the eigenvalues should sum to zero,
or alternatively,
N
X
i=1
Γi = −
L−N
X
k=1
γk = L − N.
(S7)
To asses the stability of a pattern σ ν , we compare its recognition energy E(J, σ ν ) with the energy of the rotated
pattern after a spin-flip E(J, σ̃ ν ). To do so, we first consider a simple scenario, where we assume that the pattern σ ν
2
has a large overlap with one dominant non-trivial eigenvector ΦA (i.e., hσ ν |ΦA i = m2 ≈ 1). The other components
of the pattern can be expressed in terms of the remaining N − 1 non-trivial eigenvectors with mean squared overlap
1−m2
N −1 . The expansion of the recognition energy for the presented pattern is restricted to the N non-trivial directions
(eq. S4), resulting in
X 1 − m2
1 2
1
m ΓA + (1 − m2 )Γ̃ ,
(S8)
Γi = −
E(J, σ ν ) = − m2 ΓA +
2
N −1
2
i6=A
P
Γ̄−ΓA
where Γ̃ = N 1−1 i6=A Γi = NN
−1 is the mean eigenvalue for the non-dominant directions.
ν
ν
A spin-flip (|σ i → |σ̃ i ) can rotate the pattern out of the dominant direction ΦA and reduce the squared overlap
by ǫ2 . The rotated pattern |σ̃ ν i in general lies in the L-dimensional space and is not restricted to the N -dimensional
(non-trivial) subspace. We first take a mean-field approach in describing the rotation of the pattern after a spin-flip.
Because of the normalization condition (eq. S6), the loss in the overlap with the dominant direction should result in
ǫ2
. The energy of the rotated pattern after
an average increase in the overlap with the other L − 1 eigenvectors by L−1
16
a spin-flip E(J, σ̃ ν ) can be expressed in terms of all the L eigenvectors (eq. S4),
2
X 1 − m2
X ǫ2
1
ǫ
E(J, σ̃ ν ) = − (m2 − ǫ2 )ΓA +
Γi +
+
γk
2
N −1
L−1
L−1
i6=A
k
2
X
X
1
ǫ
γk .
Γi +
= E(J, σ ν ) + ΓA −
2
L−1
k
i6=A
2
ǫ
1
= E(J, σ ν ) + ΓA 1 +
.
2
L−1
(S9)
(S10)
where in eq. S10 we used the fact that the eigenvalues should sum up to zero. On average, a spin-flip |σ ν i → |σ̃ ν i
2
increases the recognition energy by E(J, σ̃ ν ) − E(J, σ ν ) = ǫ2 ΓA 1 + O L−1 . This is consistent with the results
shown in Figs. S3B,D and Figs. S4A,D, which indicate that the majority of the spin-flips keep a pattern in the original
energy minimum and only a few of the spin-flips drive a pattern out of the original attractor.
In the analysis above, we assumed that the reduction in overlap with the dominant eigenvector ǫ2 is absorbed
equally by all the other eigenvectors (i.e., the mean-field approach). In this case, the change in energy is equally
distributed across the positive and the negative eigenvalues (Γ’s and γ’s in eq. S9), resulting in an overall increase in
the energy due to the reduced overlap with the dominant direction |ΦA i. The destabilizing spin-flips are associated
with atypical changes that rotate a pattern onto a secondary non-trivial direction |ΦB i (with positive eigenvalue ΓB ),
as a result of which the total energy could be reduced. To better characterize the conditions under which patterns
become unstable, we will introduce a perturbation to the mean-field approach used in eq. S10. We will assume that a
spin-flip results in a rotation with a dominant component along a secondary non-trivial direction |ΦB i. Specifically,
we will assume the reduced overlap ǫ2 between the original pattern |σ ν i and the dominant direction |ΦA i is distributed
in an imbalanced fashion between the other eigenvectors, with a fraction p projected onto a new (non-trivial) direction
|ΦB i, while all the other L − 2 directions span the remaining (1 − p)ǫ2 . In this case, the energy of the rotated pattern
is given by
2
2
X (1 − p)ǫ2
X 1 − m2
1
1
−
m
(1
−
p)ǫ
E(J, σ̃ ν ) = − (m2 − ǫ2 )ΓA +
Γi +
+ pǫ2 ΓB +
+
γk
2
N −1
N −1
L−2
L−2
i6=A,B
ǫ2
= E(J, σ ν ) +
ΓA − p ΓB + O L−1 .
2
k
(S11)
Therefore, a spin-flip is beneficial if ΓA < p ΓB . To further concretize this condition, we will estimate the typical loss
ǫ2 and gain pǫ2 in the squared overlap between the pattern and its dominating directions due to rotation by a single
spin-flip.
Let us consider a rotation |σ ν i → |σ̃ ν i by a flip in the nth spin of the original pattern |σ ν i. This spin flip reduces
A
the original overlap of the pattern m = hσ ν |ΦA i with the dominant direction |ΦA i by the amount √2L ΦA
n , where Φn
is the nth entry of the eigenvector |ΦA √
i. Since the original overlap is large (i.e., m ≃ 1), all entries of the dominant
L, ∀i, resulting in a reduced overlap of the rotated pattern hσ ν |ΦA i = m− L2 .
eigenvector are approximately ΦA
≃
1/
i
Therefore, the loss in the squared overlap ǫ2 by a spin flip is given by
m
1
1
m
2
ν
j 2
ν
j 2
2
2
ǫ = hσ |Φ i − hσ̃ |Φ i = m − m − 4 + 4 2 = 4 + O( 2 ).
(S12)
L
L
L
L
Similarly, we can derive the gain in the squared overlap pǫ2 between the pattern |σ ν i and the new dominant direction
|ΦB i after a spin-flip. Except for the direction |ΦA i, the expected squared overlap between the original pattern (prior
2
2
to the spin flip) and any of the non-trivial eigenvectors including |ΦB i is hσ ν |ΦB i = 1−m
N −1 . The flip in the n-th spin
ΦB
of the original pattern increases the overlap of the rotated pattern with the new dominant direction |ΦB i by 2 √nL ,
p
where ΦB
1/L. Therefore, the largest gain in overlap due to a spin-flip is given by
n should be of the order of
!
r
2
1 − m2
(ΦB
1 − m2
1 − m2 Φ B
n
n)
2
ν
B 2
ν
B 2
√ +4
−
+4
pǫ = hσ̃ |Φ i − hσ |Φ i ≃
N −1
N −1 L
L
N −1
(S13)
r
1 − m2 Φ B
1
√n + O( 2 ).
=
N −1 L
L
17
By using the results from eqs. S12 and S13, we can express the condition for beneficial spin-flips to drive a pattern
over the carved mountain passes during equilibration (eq. S11),
2
2
ǫ ΓA < ǫ pΓB ,
−→
ΓA
<
ΓB
r
1 − m2 1 B
√ Φn ,
m2
α
(S14)
where α = N/L. This result suggests that the stability of a pattern depends on how the ratio of the eigenvalues
associated with the dominant projections of the pattern before and after the spin-flip ΓA /ΓB compare to the overlap
m of the original pattern with the dominant eigenvector ΦA and the change due to the spin-flip ΦB
n.
So far, we have constrained our analysis to patterns that have a dominant contribution to only one eigenvector
ΦA . To extend our analysis to patterns which are instead constrained to a small
P sub-space2 A spanned by non-trivial
eigenvalues, we define the squared pattern overlap with the subspace m2A = a∈A hσ ν |Φa i and a weighted averaged
P
2
eigenvalue in the subspace ΓA = a∈A hσ ν |Φa i Γa . As a result, the difference in the energy of a pattern before and
2
after a spin-flip (eq. S11) can be extended to E(J, σ ν ) − E(J, σ̃ ν ) = ǫ2 ΓA − p ΓB + O L−1 . Similarly, the stability
r
1−m2A 1
√ ΦB . Patterns that are constrained to larger subspaces
<
condition in eq. S14 can be extended to ΓΓA
m2
α n
B
A
are more stable, since the weighted averaged eigenvalue for their containing subspace ΓA is closer to the mean of all
eigenvalues Γ̄ = 1 − N/L (law of large numbers). Therefore, in these cases a much larger eigenvalue gap (or a broader
eigenvalue spectrum) is necessary to satisfy the condition for pattern instability.
Fig. S6 compares the loss in energy with the original dominant direction ǫ2 ΓA to the maximal gain in any of the
other directions ǫ2 pΓB to test the pattern stability criteria presented in eq. S14. To q
do so, we identify a spin flip
2
ΦB
√n ΓB . In Fig. S6A,C
n in a secondary direction B that confers the maximal energy gain: ǫ2 pΓB ≈ maxn,B 1−m
N −1
L
we specifically focus on the subset of patterns that show a large (squared) overlap with the one dominant direction
(i.e., m > 0.85). Given that evolving patterns are not constraint to the {Φ} (non-trivial) sub-space, we find a smaller
fraction of these patterns to fulfill the condition for such large overlap m (Fig. S6A), compared to the static patterns
(Fig. S6C). Nonetheless, we see that the criteria in eq. S14 can be used to predict the stability of patterns in a network
for both static and evolving patterns; note that here we use the same learning rate for both the static and evolving
patterns.
We then relax the overlap condition
Pby including2 all patterns that have a large overlap with a subspace A, spanned
by up to 10 eigenvectors (i.e., m2A = α∈A hσ|Φα i > 0.85). For these larger subspaces the transition between stable
and unstable patterns is no longer exactly given by eq. S14. However, the two contributions ǫ2 ΓA and ǫ2 pΓB still
clearly separate the patterns into stable and unstable classes for both static and evolving patterns (Figs. S6B,D). The
softening of this condition is expected as in this regime we can no longer assume that a single spin-flip can reduce
the overlap with all the eigenvectors in the original subspace. As a result, the effective loss in overlap become smaller
than ǫ2 and patterns become unstable below the dotted line in Fig. S6B,D.
As the learning rate increases, the gap between the eigenvalues ΓB /ΓA (Fig. S5) become larger. At the same time,
patterns become more constrained to smaller subspaces (Fig. S4C,D). As a result of these two effects, more patterns
satisfy the instability criteria in eq. S14. These patterns are misclassified as they fall into a wrong energy minimum
by equilibrating through the mountain passes carved in the energy landscape of networks with large learning rates.
[1] Labrie SJ, Samson JE, Moineau S (2010) Bacteriophage resistance mechanisms. Nature Rev Microbiol 8: 317–327.
[2] Barrangou R, Marraffini LA (2014) CRISPR-Cas systems: Prokaryotes upgrade to adaptive immunity. Molecular cell 54:
234–244.
[3] Bradde S, Nourmohammad A, Goyal S, Balasubramanian V (2020) The size of the immune repertoire of bacteria. Proc
Natl Acad Sci USA 117: 5144–5151.
[4] Perelson AS, Weisbuch G (1997) Immunology for physicists. Rev Mod Phys 69: 1219–1267.
[5] Janeway C, Travers P, Walport M, Schlomchik M (2001) Immunobiology. The Immune System in Health and Disease.
New York: Garland Science, 5 edition.
[6] Altan-Bonnet G, Mora T, Walczak AM (2020) Quantitative immunology for physicists. Physics Reports 849: 1–83.
[7] Haberly LB, Bower JM (1989) Olfactory cortex: Model circuit for study of associative memory? Trends Neurosci 12:
258–264.
[8] Brennan P, Kaba H, Keverne EB (1990) Olfactory recognition: A simple memory system. Science 250: 1223–1226.
[9] Granger R, Lynch G (1991) Higher olfactory processes: Perceptual learning and memory. Curr Opin Neurobiol 1: 209–214.
[10] Haberly LB (2001) Parallel-distributed processing in olfactory cortex: New insights from morphological and physiological
analysis of neuronal circuitry. Chemical senses 26: 551–576.
18
[11] Wilson DA, Best AR, Sullivan RM (2004) Plasticity in the olfactory system: Lessons for the neurobiology of memory.
Neuroscientist 10: 513–524.
[12] Raguso RA (2008) Wake Up and Smell the Roses: The Ecology and Evolution of Floral Scent. Annu Rev Ecol Evol Syst
39: 549–569.
[13] Dunkel A, et al. (2014) Nature’s chemical signatures in human olfaction: A foodborne perspective for future biotechnology.
Angew Chem Int Ed 53: 7124–7143.
[14] Beyaert I, Hilker M (2014) Plant odour plumes as mediators of plant-insect interactions: Plant odour plumes. Biol Rev
89: 68–81.
[15] Glusman G, Yanai I, Rubin I, Lancet D (2001) The Complete Human Olfactory Subgenome. Genome Research 11:
685–702.
[16] Bargmann CI (2006) Comparative chemosensation from receptors to ecology. Nature 444: 295–301.
[17] Touhara K, Vosshall LB (2009) Sensing Odorants and Pheromones with Chemosensory Receptors. Annu Rev Physiol 71:
307–332.
[18] Su CY, Menuz K, Carlson JR (2009) Olfactory Perception: Receptors, Cells, and Circuits. Cell 139: 45–59.
[19] Verbeurgt C, et al. (2014) Profiling of Olfactory Receptor Gene Expression in Whole Human Olfactory Mucosa. PLoS
ONE 9: e96333.
[20] Shepherd GM, Greer CA (1998) Olfactory bulb. In: The Synaptic Organization of the Brain, 4th ed., New York, NY, US:
Oxford University Press. pp. 159–203.
[21] Bushdid C, Magnasco MO, Vosshall LB, Keller A (2014) Humans Can Discriminate More than 1 Trillion Olfactory Stimuli.
Science 343: 1370–1372.
[22] Gerkin RC, Castro JB (2015) The number of olfactory stimuli that humans can discriminate is still unknown. eLife 4:
e08127.
[23] Mayhew EJ, et al. (2020) Drawing the Borders of Olfactory Space. preprint, Neuroscience. doi:10.1101/2020.12.04.412254.
URL http://biorxiv.org/lookup/doi/10.1101/2020.12.04.412254.
[24] Lansner A (2009) Associative memory models: From the cell-assembly theory to biophysically detailed cortex simulations.
Trends Neurosci 32: 178–186.
[25] Hebb DO (1949) The Organization of Behavior: A Neuropsychological Theory. New York: Wiley.
[26] Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad
Sci USA 79: 2554–2558.
[27] Mayer A, Balasubramanian V, Mora T, Walczak AM (2015) How a well-adapted immune system is organized. Proc Natl
Acad Sci USA 112: 5950–5955.
[28] Shinnakasu R, et al. (2016) Regulated selection of germinal-center cells into the memory B cell compartment. Nat Immunol
17: 861–869.
[29] Shinnakasu R, Kurosaki T (2017) Regulation of memory B and plasma cell differentiation. Curr Opin Immunol 45:
126–131.
[30] Mayer A, Balasubramanian V, Walczak AM, Mora T (2019) How a well-adapting immune system remembers. Proc Natl
Acad Sci USA 116: 8815–8823.
[31] Schnaack OH, Nourmohammad A (2021) Optimal evolutionary decision-making to store immune memory. eLife 10: e61346.
[32] Viant C, et al. (2020) Antibody affinity shapes the choice between memory and germinal center B cell fates. Cell 183:
1298–1311.e11.
[33] Mezard M, Nadal JP, Toulouse G (1986) Solvable models of working memories. J Physique 47: 1457-1462.
[34] Amit DJ, Gutfreund H, Sompolinsky H (1985) Storing infinite numbers of patterns in a spin-glass model of neural networks.
Phys Rev Lett 55: 1530–1533.
[35] McEliece R, Posner E, Rodemich E, Venkatesh S (1987) The capacity of the hopfield associative memory. IEEE Transactions
on Information Theory 33: 461-482.
[36] Bouchaud JP, Mezard M (1997) Universality classes for extreme-value statistics. J Phys A Math Gen 30: 7997–8016.
[37] Derrida B (1997) From random walks to spin glasses. Physica D: Nonlinear Phenomena 107: 186–198.
[38] Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press. http://www.deeplearningbook.org.
[39] Mehta P, et al. (2019) A high-bias, low-variance introduction to Machine Learning for physicists. Physics Reports 810:
1–124.
[40] Parga N, Virasoro M (1986) The ultrametric organization of memories in a neural network. J Phys France 47: 1857–1864.
[41] Virasoro MA (1986) Ultrametricity, Hopfield Model and all that. In: Disordered Systems and Biological Organization,
Springer, Berlin, Heidelberg. pp. 197–204.
[42] Gutfreund H (1988) Neural networks with hierarchically correlated patterns. Phys Rev A 37: 570–577.
[43] Tsodyks MV (1990) Hierarchical associative memory in neural networks with low activity level. Mod Phys Lett B 04:
259–265.
[44] Amit DJ, Gutfreund H, Sompolinsky H (1985) Spin-glass models of neural networks. Physical Review A 32: 1007–1018.
[45] Fontanari JF (1990) Generalization in a hopfield network. J Phys France 51: 2421–2430.
19
A
1.00
N =8
N = 16
0.95
N = 32
Data
1-2glag µeff
1-2µeff
0.90
10−5
10−4
10−3
effective mutation rate, µeff
10−2
memory time lag, glag [N ]
optimal performance, Q∗
Supplementary Figures
B
60
45
30
15
0
10−5
10−4
10−3
effective mutation rate, µeff
10−2
FIG. S1. Reduced performance of Hopfield networks due to memory delay. (A) The optimal performance Q∗ for
patterns that are correctly associated with their memory attractors (i.e., they have an overlap q(σ) = hσatt |σi > 0.8) is shown
as a function of the effective mutation rate µeff . The solid lines show the simulation results for networks encountering a different
number of patterns N (colors). The gray dashed line shows the naı̈ve expectation for the performance (Q0 = 1 − 2µeff ), and
the colored dashed lines show the expected performance after accounting for the memory lag Qlag = 1 − 2glag µeff . (B) The lag
time glag for memory is shown in units of generations [N ] as a function of the effective mutation rate for networks encountering
a different number of patterns (colors similar to (A)). The networks are trained with a learning rate λ∗ (µ) optimized for the
mutation rate specified on the x-axis. Other simulation parameters: L = 800.
evolving patterns
0.4
static patterns
A
B
8
C
0.1
0.0
10−5
10−4
10−3
10−2
effective mutation rate, µeff
−8
−10
−12
−14
10−5
10−4
10−3
10−2
effective mutation rate, µeff
standard error
Pwrong
0.2
mean energy
−6
0.3
6
4
2
0
10−5
10−4
10−3
10−2
effective mutation rate, µeff
FIG. S2. Statistics of static and evolving patterns for networks with different learning rates. We compare the
statistics of evolving (green) and static (orange) patterns in networks trained with a learning rate λ∗ (µ) optimized for the
mutation rate specified on each panel’s x-axis; see Fig. 2B for dependency of the optimal learning rate on mutation rate. The
reported statistics are (A) Fraction Pwrong of misclassified patterns (i.e., patterns with a small overlap q(σ) = hσatt |σi < 0.8),
(B) the mean energy of the patterns, and (C) the standard error of the energy of the patterns in the network. Simulation
parameters: L = 800 and N = 32.
20
0.6
0.4
0.2
0.0
0.0
IV 0.6 %
1.2 %
V
71.3 % 0 %
I
VI
0.2
0.4
0.6
0.8
α (σ α )|σ α i|
self overlap |hσatt
0.6
II
I
0.4
0.2
0.0
0.0
1.0
C
0.2
0.4
0.6
0.8
α (σ α )|σ α i|
self overlap |hσatt
1.0
D
1.0
next best overlap
B
beneficial flips σ α → σ̃ α
stored patterns, σ α random patterns, χ
1.0
28.1 %
II 98.8 %
III
0.8
0.8
0.6
0.4
0.2
0.0
0.0
0.6
14.9 %
II 97.5 %
IV 0.6 %
2.5 %
VI
III
V
84.5 % 0 %
I
0.2
0.4
0.6
0.8
α
α
α
self overlap |hσatt (σ )|σ i|
1.0
beneficial flips σ α → σ̃ α
next best overlap
A
II
I
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
α
α
α
self overlap |hσ att (σ )|σ i|
1.0
FIG. S3. Attractors and equilibration paths in networks. Overlap of patterns with the networks’ attractors are shown
for both the patterns σ α associated with one of the classes that were previously presented to the network during training (blue)
and the random patterns χ that are on average orthogonal to the previously presented classes. (A) The overlap between a
α
presented pattern σ α and the memory associated with the same pattern class σatt
(σ α ) is shown against the overlap of the
α
pattern with the next best memory attractor associated with any of the other presented pattern classes maxν6=α | hσatt
|σ ν i |.
Fractions of the previously presented patterns and the random patterns that fall into different sectors of the plot are indicated
α
in blue and red, respectively. Sector I corresponds to patterns that fall into the correct energy attractors (i.e., hσatt
|σ α i ≈ 1).
In the limit of large self-overlap, the maximal overlap to any other pattern family is close to zero, and thus, no patterns are
found in sector III. Patters with a small self-overlap could fall into three different sectors: Sector II corresponds to misclassified
α
patterns that fall into a valley associated with a different class (maxν6=α | hσatt
|σ ν i | ≈ 1). Patterns in sectors IV and V fall
into local valleys between the minima of two pattern families. This mixture states are well known in the classical Hopfield
model [44, 45]. Sector VI indicates patterns that fall into an attractor in the network that does not correspond to any of the
previously presented classes. The fact that neither the previously presented patterns nor the random patterns fall into this
sector suggests that the network indeed only stores memory of the presented patterns and is not in the glassy regime. (B) The
number of beneficial spin-flips for presented pattern at the beginning of equilibration (i.e., the number of open equilibration
paths) is shown against pattens’ self-overlap (x-axis in (A)). For stable patterns (sector I) the number of open paths is anticorrelated with the overlap between the attractor and the presented pattern. For unstable patterns (sector II), the number
of open paths is on average larger that that of the stable patterns. However, there are fewer paths available to the previously
presented patterns compared to the random patterns. In (A,B) patterns evolve with rate µeff = 0.01 and the network’s learning
rate is optimized accordingly. The sharp transition between sector occupations indicates that our results are insensitive to the
classification threshold for self-overlap (currently set to q α > 0.8), i.e. any threshold value between sectors I and II would
result in the same classification of patterns. (C,D) Similar to (A,B) but for static patterns in a network with a similar learning
rate to (A,B). Simulation parameters: L = 800 and N = 32.
21
A
open paths
0.5
stable patterns
unstable patterns
random patterns
0.4
0.3
0.2
0.1
0.0
10−5
0.6
10−4
10−3
effective mutation rate, µeff
C
0.4
0.3
0.2
0.1
0.0
10−5
10−4
10−3
effective mutation rate, µeff
10−2
mean
8
σ1
σN
6
4
2
10
stable patterns
unstable patterns
random patterns
B
0
10−5
10−2
0.5
open paths
participation, π(σ)
10
participation, π(σ)
0.6
10−4
10−3
effective mutation rate, µeff
10−2
C
8
mean
σ1
σN
6
4
2
0
10−5
10−4
10−3
effective mutation rate, µeff
10−2
FIG. S4. Open equilibration paths and participation ratio. (A) The mean number of open paths (i.e., the beneficial
spin-flips at the beginning of equilibration) is shown for stable, unstable, and random patterns (colors) as a function of
the effective mutation rate µeff in networks trained with the optimal learning rate λ∗ (µ). (B) The participation ratio
P
( m2 ) 2
π(σ j ) = Pi mi,j
, with mi,j = hΦi |σ j i is shown for the pattern σ 1 with the lowest energy (orange), the l pattern σ N with
4
i
i,j
the highest energy (purple). The mean participation ratio averaged over all patterns is shown in green. (C,D) Similar to
(A,B) but for static patterns (µ = 0). The learning rate of the network in this case is tuned to be optimal for the mutation
rate specified on the x-axis. Simulation parameters: L = 800 and N = 32.
22
B
60
50
50
40
40
eigenvalue
eigenvalue
A
60
30
20
10
0
10−5
Γ1
Γ10
Γ20
Γ32
γ1
γ20
30
20
10
10−4
10−3
effective mutation rate, µeff
10−2
0
10−5
γL−N
10−4
10−3
effective mutation rate, µeff
10−2
FIG. S5. Eigenvalues of networks with memory against dynamic and static patterns. (A) The first Γ1 , the 10th
(Γ10 ), the 20th (Γ20 ), and the last (ΓN =32 ) non-trivial eigenvalues of a network of size L = 800 presented with N = 32 patterns
is shown as a function the patterns’ effective mutation rate (different shades of blue). In each case, the network is trained with
the optimal learning rate λ∗ (µ). The trivial eigenvalues are shown in different shades of red, with their rank indicated in the
legend. For small µeff all trivial eigenvalues match the prediction γk = −1, which implies that the network updates fast enough
to keep the patterns within the N -dimensional sub-space. For larger mutation rates, some of the trivial eigenvalues deviate
from −1, indicating that evolving patterns start spanning in a larger sub-space. Moreover, as the mutation rate (or learning
rate) increases the gap between between the non-trivial eigenvalues broadens. (B) Similar to (A) but for static patterns in
networks trained with a learning rate λ∗ (µ) optimized for the mutation rate specified on the panel’s x-axis. In contrast to (A)
all trivial eigenvalues remain equal to −1 independent of the learning rate, implying that the static patterns remained within
the non-trivial N -dimensional sub-space. Similar to (A) the gap between the nontrivial eigenvalues broadens with increasing
learning rate.
23
A
B
14
14
stable
stable patterns, q α ≈ 1
10
unstable patterns, q α ≈ 0
8
6
4
2
0
unstable
possible gain, pǫ2 ΓB
possible gain, pǫ2 ΓB
unstable
12
C
6
4
0
10
20
30
40
dominant energy contribution, ǫ2 ΓA
14
unstable
stable
12
unstable
possible gain, pǫ2 ΓB
possible gain, pǫ2 ΓB
8
D
14
10
8
6
4
2
10
2
10
20
30
40
dominant energy contribution, ǫ2 ΓA
stable
12
0
10
20
30
40
dominant energy contribution, ǫ2 ΓA
stable
12
10
8
6
4
2
0
10
20
30
40
dominant energy contribution, ǫ2 ΓA
FIG. S6. Stability condition for patterns during equilibriation. The stability condition in eq. S14 (dotted line) is
used to classify stable (blue) and unstable (red) patterns for (A) the patterns that have a squared overlap with one dominant
2
eigenvector m2 = hΦA |σ ν i > 0.85, and (B)
P the patterns that are constrained to a small sub-space A spanned by up to 10
nontrivial eigenvectors; in this case, m2A = a∈A hΦa |σ ν i2 > 0.85. The shading indicate the number of eigenvectors needed to
represent a pattern from dark (one) to light (ten). (C, D) Similar to (A, B) but for static patterns in networks trained with
the same learning rate as in (A, B). In general more static patterns reach the threshold of m > 0.85 as these patterns remain
constrained to the N-dimensional subspace spanned by the non-trivial eigenvectors {Φi }. Simulation parameters: N = 32,
L = 800, µeff = 0.02, and networks are trained with the optimal learning rate λ∗ (µ).
1.0
A
0.8
C=1
C=2
C=4
0.6
0.4
0.2
0.0 −1
10
B
100
101
scaled inverse Hopfield temp. βHc
100
10−1
10−2
10−3 −1
10
100
101
scaled inverse Hopfield temp. βHc
1.0
C
0.8
C=1
C=2
C=4
0.6
C=8
C = 16
C = 32
0.4
0.2
0.0 −1
10
D
opt. learning rate, λ∗c
opt. learning rate, λ∗c
C=8
C = 16
C = 32
optimal performance, Q∗
optimal performance, Q∗
24
100
101
scaled inverse Hopfield temp. βHc
100
10−1
10−2
10−3 −1
10
100
101
scaled inverse Hopfield temp. βHc
FIG. S7. Optimal performance and learning rate at difference Hopfield temperatures. (A) The optimal accuracy
of compartmentalized networks (i.e., for βS ≫ 1) is shown as a function of the scaled inverse Hopfield temperature βHc for
difference number of compartments C (colors) for static patterns (µeff = 0); see Fig. 4B top. (B) The optimal learning rate λ∗c
for each strategy as a function of βHc is shown. In contrast to Fig. 3 the learning rate is not rescaled here and does not collapse
for βHc ≫ 1. As the equilibration noise increases (decreasing βHc ), networks with distributed memory (C < N ) show two-step
transitions (A). The first transition occurs at βHc ≃ 1, which results in the reduced accuracy of the networks’ memory retrieval.
At this transition point, the networks’ learning rate λc approaches its maximum value 1 (B). Consequently, memory is only
C
stored for C < N patterns (i.e., one pattern per sub-network) and the optimal performance Q∗ is reduced to approximately N
1
C
(A). The second transition occurs at βHc ≈ Nc = N , below which no pattern can be retrieved and the performance approaches
zero (A). (C-D) Similar to (A-B) but for evolving patterns with the effective mutation rate µeff = 0.01 similar to Fig. 4D. The
number of presented patterns is set to N = 32. Similar to Figs. 3, 4 we keep L · C = const., with L = 800 used for networks
with C = 1.