Academia.eduAcademia.edu

Learning and organization of memory for evolving patterns

2021, arXiv (Cornell University)

Storing memory for molecular recognition is an efficient strategy for responding to external stimuli. Biological processes use different strategies to store memory. In the olfactory cortex, synaptic connections form when stimulated by an odor, and establish distributed memory that can be retrieved upon re-exposure. In contrast, the immune system encodes specialized memory by diverse receptors that recognize a multitude of evolving pathogens. Despite the mechanistic differences between the olfactory and the immune memory, these systems can still be viewed as different information encoding strategies. Here, we present a theoretical framework with artificial neural networks to characterize optimal memory strategies for both static and dynamic (evolving) patterns. Our approach is a generalization of the energy-based Hopfield model in which memory is stored as a network's energy minima. We find that while classical Hopfield networks with distributed memory can efficiently encode a memory of static patterns, they are inadequate against evolving patterns. To follow an evolving pattern, we show that a distributed network should use a higher learning rate, which in turn, can distort the energy landscape associated with the stored memory attractors. Specifically, narrow connecting paths emerge between memory attractors, leading to misclassification of evolving patterns. We demonstrate that compartmentalized networks with specialized subnetworks are the optimal solutions to memory storage for evolving patterns. We postulate that evolution of pathogens may be the reason for the immune system to encoded a focused memory, in contrast to the distributed memory used in the olfactory cortex that interacts with mixtures of static odors.

Learning and organization of memory for evolving patterns Oskar H Schnaack,1, 2 Luca Peliti,3 and Armita Nourmohammad1, 2, 4, ∗ arXiv:2106.02186v1 [physics.bio-ph] 4 Jun 2021 1 Max Planck Institute for Dynamics and Self-organization, Am Faßberg 17, 37077 Göttingen, Germany 2 Department of Physics, University of Washington, 3910 15th Ave Northeast, Seattle, WA 98195, USA 3 Santa Marinella Research Institute, 00058 Santa Marinella, Italy 4 Fred Hutchinson Cancer Research Center, 1100 Fairview ave N, Seattle, WA 98109, USA (Dated: June 7, 2021) Storing memory for molecular recognition is an efficient strategy for responding to external stimuli. Biological processes use different strategies to store memory. In the olfactory cortex, synaptic connections form when stimulated by an odor, and establish distributed memory that can be retrieved upon re-exposure. In contrast, the immune system encodes specialized memory by diverse receptors that recognize a multitude of evolving pathogens. Despite the mechanistic differences between the olfactory and the immune memory, these systems can still be viewed as different information encoding strategies. Here, we present a theoretical framework with artificial neural networks to characterize optimal memory strategies for both static and dynamic (evolving) patterns. Our approach is a generalization of the energy-based Hopfield model in which memory is stored as a network’s energy minima. We find that while classical Hopfield networks with distributed memory can efficiently encode a memory of static patterns, they are inadequate against evolving patterns. To follow an evolving pattern, we show that a distributed network should use a higher learning rate, which in turn, can distort the energy landscape associated with the stored memory attractors. Specifically, narrow connecting paths emerge between memory attractors, leading to misclassification of evolving patterns. We demonstrate that compartmentalized networks with specialized subnetworks are the optimal solutions to memory storage for evolving patterns. We postulate that evolution of pathogens may be the reason for the immune system to encoded a focused memory, in contrast to the distributed memory used in the olfactory cortex that interacts with mixtures of static odors. I. INTRODUCTION Storing memory for molecular recognition is an efficient strategy for sensing and response to external stimuli. Apart from the cortical memory in the nervous system, molecular memory is also an integral part of the immune response, present in a broad range of organisms from the CRISPR-Cas system in bacteria [1–3] to adaptive immunity in vertebrates [4–6]. In all of these systems, a molecular encounter is encoded as a memory and is later retrieved and activated in response to a similar stimulus, be it a pathogenic reinfection or a re-exposure to a pheromone. Despite this high-level similarity, the immune system and the synaptic nervous system utilize vastly distinct molecular mechanisms for storage and retrieval of their memory. Memory storage, and in particular, associative memory in the hippocampus and olfactory cortex has been a focus of theoretical and computational studies in neuroscience [7–11]. In the case of the olfactory cortex, the input is a combinatorial pattern produced by olfactory receptors which recognize the constituent mono-molecules of a given odor. A given odor composed of many monomolecules at different concentrations [12–14] drawn from a space of about 104 distinct mono-molecules reacts with olfactory receptors (a total of ∼ 300 − 1000 in mammals [15–19]), resulting in a distributed signal and spatiotem- ∗ Correspondence should be addressed to: [email protected]. poral pattern in the olfactory bulb [20]. This pattern is transmitted to the olfactory cortex, which serves as a pattern recognition device and enables an organism to distinguish orders of magnitudes more odors compared to the number of olfactory receptors [21–23]. The synaptic connections in the cortex are formed as they are costimulated by a given odor pattern, thus forming an associative memory that can be retrieved in future exposures [7–11, 24]. Artificial neural networks that store auto-associative memory are used to model the olfactory cortex. In these networks, input patterns trigger interactions between encoding nodes. The ensemble of interactive nodes keeps a robust memory of the pattern since they can be simultaneously stimulated upon re-exposure and thus a stored pattern can be recovered by just knowing part of its content. This mode of memory storage resembles the co-activation of synaptic connections in a cortex. Energy-based models, such as Hopfield neural networks with Hebbian update rules [25], are among such systems, in which memory is stored as the network’s energy minima [26]; see Fig. 1. The connection between the standard Hopfield network and the real synaptic neural networks has been a subject of debate over the past decades. Still, the Hopfield network provides a simple and solvable coarse-grained model of the synaptic network, relevant for working memory in the olfactory system and the hippocampus [24]. Immune memory is encoded very differently from associative memory in the nervous system and the olfactory cortex. First, unlike olfactory receptors, immune recep- 2 tors are extremely diverse and can specifically recognize pathogenic molecules without the need for a distributed and combinatorial encoding. In vertebrates, for example, the adaptive immune system consists of tens of billions of diverse B- and T-cells, whose unique surface receptors are generated through genomic rearrangement, mutation, and selection, and can recognize and mount specific responses against the multitude of pathogens [5]. Immune cells activated in response to a pathogen can differentiate into memory cells, which are long lived and can more efficiently respond upon reinfections. As in most molecular interactions, immune-pathogen recognition is cross-reactive, which would allow memory receptors to recognize slightly evolved forms of the pathogen [5, 27– 32]. Nonetheless, unlike the distributed memory in the olfactory cortex, the receptors encoding immune memory are focused and can only interact with pathogens with limited evolutionary divergence from the primary infection, in response to which memory was originally generated [5]. There is no question that there are vast mechanistic and molecular differences between how memory is stored in the olfactory system compared to the immune system. However, we can still ask which key features of the two systems prompt such distinct information encoding strategies for their respective distributed versus specialized memory. To probe the emergence of distinct memory strategies, we propose a generalized Hopfield model that can learn and store memory against both the static and the dynamic (evolving) patterns. We formulate this problem as an optimization task to find a strategy (i.e., learning rate and network structure) that confer the highest accuracy for memory retrieval in a network (Fig. 1). In contrast to the static case, we show that a distributed memory in the style of a classical Hopfield model [26] fails to efficiently work for evolving patterns. We show that the optimal learning rate should increase with faster evolution of patterns, so that a network can follow the dynamics of the evolving patterns. This heightened learning rate tends to carve narrow connecting paths (mountain passes) between the memory attractors of a network’s energy landscape, through which patterns can equilibrate in and be associated with a wrong memory. To overcome this misclassification, we demonstrate that specialized memory compartments emerge in a neural network as an optimal solution to efficiently recognize and retrieve a memory of out-ofequilibrium evolving patterns. Our results suggest that evolution of pathogenic patterns may be one of the key reasons why the immune system encodes a focused memory, as opposed to the distributed memory used in the olfactory system, for which molecular mixtures largely present static patterns. Beyond this biological intuition, our model offers a principle-based analytical framework to study learning and memory generation in out-ofequilibrium dynamical systems. II. A. RESULTS Model of working memory for evolving patterns To probe memory strategies against different types of stimuli, we propose a generalized energy-based model of associative memory, in which a Hopfield-like neural network is able to learn and subsequently recognize binary patterns. This neural network is characterized by an energy landscape and memory is stored as the network’s energy minima. We encode the target of recognition (stimulus) in a binary vector σ (pattern) with L entries: σ = (σ1 , . . . , σL ), with σi = ±1, ∀i (Fig. 1A). To store associative memory, we define a fully connected network represented by an interaction matrix J = (Ji,j ) of size L × L, and use a Hopfieldlike energy function (Hamiltonian) to describe pattern P 1 1 recognition EJ (σ) = − 2L J σ σ i,j i j ≡ − 2 hσ|J|σi [26] ij (Fig. 1C). Here, we used a short-hand notation to denote the normalized pattern vector by |σi ≡ √1L σ, its transpose by hσ|,Presulting in a normalized scalar product hσ|σ ′ i ≡ L1 i σi σi′ , and a matrix product P hσ|J|σi ≡ L1 i,j σi Ji,j σj . The network undergoes a learning process, during which different patterns are presented sequentially and in random order (Fig. 1B). As a pattern σ α is presented, the interaction matrix J is updated according to the following Hebbian update rule [33] ( (1 − λ) Ji,j + λ σiα σjα , if i 6= j; ′ (1) Ji,j −→ Ji,j = 0, otherwise. Here λ is the learning rate. In this model, the memorized patterns are represented by energy minima associated with the matrix J. We consider the case where the number N of different pattern classes is below the Hopfield capacity of the network (i.e., N . 0.14 L; see refs. [26, 34, 35]). With the update rule in eq. 1, the network develops energy minima as associative memory close to each of the previously presented pattern classes σ α (α ∈ {1, . . . , N })(Fig. 1C). Although the network also has minima close to the negated patterns, i.e., to −σ α , they do not play any role in what follows. To find an associative memory we let a presented pattern σ α equilibrate in the energy landscape, whereby we accept spin-flips  σ α → σ̃ α with a probability min 1, e−βH (EJ (σ̃)−EJ (σ)) , where βH is the inverse equilibration (Hopfield) temperature (Appendix A). In the low temperature regime (i.e., high βH ), equilibration in networks with working memory drives a presented pattern σ α towards a similar attractor α , reflecting the memory associated with the correσatt sponding energy minimum (Fig. 1C). This similarity is α measured by the overlap q α ≡ hσatt |σ α i and determines the accuracy of the associative memory. Unlike the classical cases of pattern recognition by Hopfield networks, we assume that patterns can evolve 3 A B pattern evolution σ(t) = (−1,1,…,1, 1) Hebbian update of memory σ 1(t1) σ 2(t1) … σ N(t1) mutation rate, μ learning rate, λ σ(t + 1) = (−1,1,…,1,−1) J(t1) C energy landscape with memory attractors static patterns pattern evolution Ji,j(t + 1) = (1 − λ) Ji,j(t) + λ σiασjα J(t2) 1−λ D memory strategies evolving patterns compartmentalized distributed σα e −βHE(J,σ J1 σ σatt σα α ) e −βSE(J , σ i ) α J3 J2 σ α σatt σatt e −βHE (J , σ i α σatt i ) α FIG. 1. Model of Working memory for evolving patterns. (A) The targets of recognition are encoded by binary vectors {σ} of length L. Patterns can evolve over time with a mutation rate µ, denoting the fraction of spin-flips in a pattern per network update event. (B) Hebbian learning rule is shown for network J, which is presented a set of N patterns {σ α } (colors) over time. At each step, one pattern σ α is randomly presented to the network and the network is updated with learning rate λ (eq. 1). (C) The energy landscape for networks with distributed memory with optimal learning rate for static (left) and evolving (right) patterns are shown. The equipotential lines are shown in the bottom 2D plane. The energy minima correspond to memory attractors. For static patterns (left), equilibration in the network’s energy landscape drives a patterns towards its associated memory attractor, resulting in an accurate reconstruction of the pattern. For evolving patterns (right), the heightened optimal learning rate of the network results in the emergence of connecting paths (mountain passes) between the energy minima. The equilibration process can drive a pattern through a mountain pass towards a wrong memory attractors, resulting in pattern misclassification. (D) A network with distributed memory (left) is compared to a specialized network with multiple compartments (right). To find an associative memory, a presented pattern σ α with energy E(J, σ α ) in network J α equilibrates with inverse temperature βH in the network’s energy landscape and falls into an energy attractor σatt . Memory retrieval is a two-step process in a compartmentalized network (right): First, the sub-network J i is chosen with a probability Pi ∼ exp[−βS E i (J i , σ α )], where βS is the inverse temperature for this decision. Second, the pattern equilibrates within the α . sub-network and falls into an energy attractor σatt over time with a rate µ that reflects the average number of spin-flips in a given pattern per network’s update event (Fig. 1A). Therefore, the expected number of spin-flips in a given pattern between two encounters is µeff = N µ, as two successive encounters of the same pattern are on average separated by N − 1 encounters (and updates) of the network with the other patterns. We work in the regime where the mutation rate µ is small enough such that the evolved patterns stemming from a common ancestor σ α (t0 ) at time t0 (i.e., the members of the class α) remain more similar to each other than to members of the other classes (i.e., µN L ≪ L/2). The special case of static patterns (µeff = 0) can reflect distinct odor molecules, for which associative memory is stored in the olfactory cortex. On the other hand, the distinct pattern classes in the dynamic case (µeff > 0) can be attributed to different types of evolving pathogens (e.g., influenza, HIV, etc), and the patterns within a class as different variants of a given pathogen. In our model, we will use the mutation rate as an order parameter to characterize the different phases of memory strategies in biological systems. B. Optimal learning rate for evolving patterns In the classical Hopfield model (µeff = 0) the learning rate λ is set to very small values for the network to efficiently learn the patterns [33]. For evolving patterns, the learning rate should be tuned so the network can efficiently update the memory retained from the prior learning steps. At each encounter, the over- 0.06 0.9 0.8 0.7 N =8 N = 16 N = 32 Data 1-2µeff 1-2glag µeff 0.6 −5 10 10−4 10−3 10−2 effective mutation rate, µeff 0.04 B 0 Data Prediction mean energy A 1.0 learning rate, λ∗ optimal performance, Q∗ 4 0.02 0.00 −5 10 10−4 10−3 10−2 effective mutation rate, µeff C −20 −40 −60 Data Prediction −80 −5 −4 10 10 10−3 10−2 effective mutation rate, µeff FIG. 2. Reduced performance of Hopfield networks in retrieving memory of evolving patterns. (A) The optimal performance of a network Q∗ ≡ Q(λ∗ ) (eq. 2) is shown as a function of the effective mutation rate µeff = N µ. The solid lines show the simulation results for networks encountering different number of patterns (colors). The black dotted line shows the naı̈ve expectation for the performance solely based on the evolutionary divergence of the patterns Q0 ≈ 1 − 2µeff , and the colored dashed lines show the expected performance after accounting for the memory lag glag , Qlag ≈ 1 − 2glag µeff ; see Fig. S1 for more details. (B) The optimal learning rate λ∗ is shown as a function of effective mutation rate. The solid lines are the numerical estimates and dashed lines show the theoretical predictions (eq. 4). (C) The mean energy obtained by simulations of randomly ordered patterns (solid lines) and the analytical approximation (eq. 3) for ordered patterns (dotted lines) are shown. Error bars show standard error from the independent realizations (Appendix A). The color code for the number of presented patterns is consistent across panels, and the length of patterns is set to L = 800. α lap q α (t; λ) = hσatt (t; λ)|σ α (t)i between a pattern σ α (t) and the corresponding attractor for the associated energy α minimum σatt (t; λ) determines the accuracy of pattern recognition; the parameter λ explicitly indicates the dependency of the network’s energy landscape on the learning rate. We declare pattern recognition as successful if the accuracy of reconstruction (overlap) is larger than a set threshold q α (t) ≥ 0.8, but our results are insensitive to the exact value of this threshold (Appendix A and Fig. S3). We define a network’s performance as the asymptotic accuracy of its associative memory averaged over the ensemble of pattern classes (Fig. 2A), T N 1X 1 X α hσatt (t)|σ α (t)i T →∞ T N t=0 α=1 (2) The expectation E[·] is an empirical average over the ensemble of presented pattern classes over time, which in the stationary state approaches the asymptotic average of the argument. The optimal learning rate is determined by maximizing the network’s performance, λ∗ = argmaxλ Q(λ). The optimal learning rate increases with growing mutation rate so that a network can follow the evolving patterns (Fig. 2B). Although it is difficult to analytically calculate the optimal learning rate, we can use an approximate approach and find the learning rate that minimizes the expected energy of the patterns E [Eλ,ρ (J, σ)], assuming that patterns are shown to the network at a fixed order (Appendix B). In this case, the expected energy is given by Q(λ) ≡ E [q α (t; λ)] ≃ lim E [Eλ,ρ (J, σ)] = λ 1 L−1 × × , 2 1 − λ ρ−2N (1 − λ)−N − 1 (3) where ρN ≡ (1 − 2µ)N ≈ 1 − 2µeff is the upper bound for the overlap between a pattern and its evolved form, when separated by the other N − 1 patterns that are shown in between. The expected energy grows slowly with increasing mutation rate (i.e., with decreasing overlap q), and the approximation in eq. 3 agrees very well with the numerical estimates for the scenario where pattens are shown in a random order (Fig. 2C). In the regime where memory can still be associated with the evolved patterns (µeff ≪ 0.5), minimization of the expected energy (eq. 3) results in an optimal learning rate, p (4) λ∗ (µ) = 8µ/(N − 1), that scales with the square root of the mutation rate. Notably, this theoretical approximation agrees well with the numerical estimates (Fig. 2B). C. Reduced accuracy of distributed associative memory against evolving patterns Despite using an optimized learning rate, a network’s accuracy in pattern retrieval Q(λ) decays much faster than the naı̈ve expectation solely based on the evolutionary divergence of patterns between two encounters with a given class (i.e., Q0 = (1 − 2µ)N ≈ 1 − 2µeff ); see Fig. 2A. There are two reasons for this reduced accuracy: (i) the lag in the network’s memory against evolving patterns, and (ii) misclassification of presented patterns. The memory attractors associated with a given pattern class can lag behind and only reflect the older patterns presented prior to the most recent encounter of the network with the specified class. We characterize this lag glag by identifying a previous version of the pattern that has the maximum overlap 5 with the network’s energy landscape at a given time t: glag = argmaxg≥0 E [hσ(t − g N )|J(t)|σ(t − g N )i] (Appendix B.2). glag measures time in units of N (i.e., the effective separation time of pattern of the same class). An increase in the optimal learning rate reduces the time lag and enables the network to follow the evolving patterns more closely (Fig. S1). The accuracy of the memory subject to such a lag decays as Qlag = ρglag N ≈ 1 − 2glag µeff , which is faster than the naı̈ve expectation (i.e., 1−2µeff ); see Fig. 2A. This memory lag explains the loss of performance for patterns that are still reconstructed by the network’s memory attractors (i.e., those with q α > 0.8; Fig. S1A). However, the overall performance of the network Q(λ) remains lower than the expectation obtained by taking into account this time lag (Fig. 2A)—a discrepancy that leads us to the second root of reduction in accuracy, i.e., pattern misclassification. As the learning rate increases, the structure of the network’s energy landscape changes. In particular, we see that with large learning rates a few narrow paths emerge between the memory attractors of networks (Fig. 1C). As a result, the equilibration process for pattern retrieval can drive a presented pattern through the connecting paths towards a wrong memory attractor (i.e., one with a small overlap hσatt |σi), which leads to pattern misclassification (Fig. S2A and Fig. S3A,C). These paths are narrow as there are only a few possible spin-flips (mutations) that can drive a pattern from one valley to another during equilibration (Fig. S3B,D and Fig. S4A,C). In other words, a large learning rate carves narrow mountain passes in the network’s energy landscape (Fig. 1C), resulting in a growing fraction of patterns to be misclassified. Interestingly, pattern misclassification occurs even in the absence of mutations for networks with an increased learning rate (Fig. S2A). This suggests that mutations only indirectly contribute to the misclassification of memory, as they necessitate a larger learning rate for the networks to optimally operate, which in turn results in the emergence of mountain passes in the energy landscape. To understand the memory misclassification, particularly for patterns with moderately low (i.e., non-random) energy (Fig. 2C), we use spectral decomposition to characterize the relative positioning of patterns in the energy landscape (Appendix C). The vector representing each pattern |σi can be expressed P in terms of the network’s eigenvectors {Φi }, |σi = i mi |Φi i, where the overlap mi ≡ hΦi |σi is the ith component of the pattern in the network’s coordinate system. During equilibration, we flip individual spins in a pattern and accept the flips based on their contribution to the recognition energy. We can view these spin-flips as rotations of the pattern in the space spanned by the eigenvectors of the network. Stability of a pattern depends on whether these rotations could carry the pattern from its original subspace over to an alternative region associated with a different energy minimum. There are two key factors that modulate the stabil- ity of a pattern in a network. The dimensionality of the subspace in which a pattern resides, i.e., support of a pattern by the network’s eigenvectors, is one of the key determining factors for pattern stability. We quantify the support of P σ using the participation raPa pattern tio π(σ) = ( i m2i )2 / i m4i [36, 37] that counts the number of eigenvectors that substantially overlap with the pattern. A small support π(σ) ≈ 1 indicates that the pattern is spanned by only a few eigenvectors and is restricted to a small sub-space, whereas a large support indicates that the pattern is orthogonal to only a few eigenvectors. As the learning rate increases, patterns lie in lower dimensional sub-spaces supported by only a few eigenvectors (Fig. S4B,D). This effect is exacerbated by the fact the energy gap between the eigenstates of the network also broaden with increasing learning rate (Fig. S5). The combination of a smaller support for patterns and a larger energy gap in networks with increased learning rate leads to the destabilization of patterns by enabling the spin-flips during equilibration to drive a pattern from one subspace to another, through the mountain passes carved within the landscape; see Appendix C and Fig. S6 for the exact analytical criteria for pattern stability. D. Compartmentalized learning and memory storage Hopfield-like networks can store accurate associative memory for static patterns. However, these networks fail to perform and store retrievable associative memory for evolving patterns (e.g. pathogens), even when learning is done at an optimal rate (Fig. 2). To overcome this difficulty, we propose to store memory in compartmentalized networks, with C sub-networks of size Lc (i.e., the number of nodes in a sub-network). Each compartment (sub-network) can store a few of the total of N pattern classes without an interference from the other compartments (Fig. 1D). Recognition of a pattern σ in a compartmentalized network involves a two step process (Fig. 1D): First, we choose a sub-network J i associated with compartment i with a probability Pi = exp[−βS E(J i , σ)]/N , where βS is the inverse temperature for this decision and N is the normalizing factor. Once the compartment is chosen, we follow the recipe for pattern retrieval in the energy landscape of the associated sub-network, whereby a pattern equilibrates into a memory attractor. On average, each compartment stores a memory for Nc = N/C pattern classes. To keep the networks with different number of compartments C comparable, we scale the size of each compartment Lc to keep C × Lc = constant, which keeps the (Hopfield) capacity of the network α = Nc /Lc invariant under compartmentalization. Moreover, the mutation rate experienced by each subnetwork scales with the number of compartments µc = Cµ, which keeps the effective mutation rate µeff = Nc µc E. Phases of learning and memory production Pattern retrieval can be stochastic due to the noise in choosing the right compartment from the C subnetworks (tuned by the inverse temperature βS ), or the noise in equilibrating into the right memory attractor in the energy landscape of the chosen sub-network (tuned by the Hopfield inverse temperature βHc ). We use mutual information to quantify the accuracy of pattencompartment association, where larger values indicate a more accurate association; see Appendix A and Fig. 4. The optimal performance Q∗ determines the overall accuracy of memory retrieval, which depends on both finding the right compartment and equilibrating into the correct memory attractor. The amplitudes of intraversus inter- compartment stochasticity determine the optimal strategy {C ∗ , λ∗c } used for learning and retrieval of patterns with a specified mutation rate. Varying the corresponding inverse temperatures (βHc , βS ) results in three distinct phases of optimal memory storage. i. Small intra- and inter-compartment noise (βHc ≫ 1, βS ≫ 1). In this regime, neither the compartment choice nor the pattern retrieval within a compartment are subject to strong noise. As a result, networks are functional with working memory and the optimal strategies can achieve the highest overall performance. For small mutation rates, we see that all networks perform equally well and can achieve almost perfect performance, irrespective of their number of compartments (Figs. 3A, 4A,B). As the mutation rate increases, networks with a larger number of compartments show a more favorable performance, and the 1-to-1 specialized network, in which each pattern is stored in a separate compartment (i.e., N = C), reaches the optimal performance 1 − 2µeff (Figs. 3A, 4C,D). As A 1.0 0.9 C=1 C=2 C=4 0.8 0.7 0.6 −5 10 λ∗ c C scaled opt. learning rate, invariant under compartmentalization. As a result, the optimal learning rate (eq. p 4) scales with the number of compartments C as, λ∗c = 8µc /(Nc − 1) ≈ Cλ∗1 . However, since updates are restricted to sub-networks of size Lc at a time, the expected amount of updates within a network Lc λc remains invariant under compartmentalization. Lastly, since the change in energy by a single spin-flip scales as ∆E ∼ 1/Lc , we introduce the scaled Hopfield temperature βHc ≡ CβH to make the equilibration process comparable across networks with different number of compartments. No such scaling is necessary for βS . By restricting the networks to satisfy the aforementioned scaling relationships, we are left with two independent variables, i.e., (i) the number of compartments C, and (ii) the learning rate λc , which define a memory strategy {C, λc }. A memory strategy can be then optimized to achieve a maximum accuracy for retrieving an associative memory for evolving patterns with a given effective mutation rate µeff . optimal performance, Q∗ 6 C=8 C = 16 C = 32 1-2µeff 10−4 10−3 effective mutation rate, µeff 10−2 B 10−2 10−3 theory 10−4 −5 10 10−4 10−3 effective mutation rate, µeff 10−2 FIG. 3. Compartmentalized memory storage. (A) The optimal performance is shown as a function of the effective mutation rate (similar to Fig. 2A) for networks with different number of compartments C (colors), ranging from a network with distributed memory C = 1 (blue) to a 1-to-1 compartmentalized network C = N (red). (B) The optimal (scaled) learning rate λc /C is shown as a function of the effective mutation rate for networks with different number of compartments (colors according to (A)). Full lines show the numerical estimates and the p dashed line is from the analytical approximation, λ∗c = 8µc /(Nc − 1) ≈ Cλ∗1 . The scaled learning rates collapse on the analytical approximation for all networks except for the 1-to-1 compartmentalized network (red), where the maximal learning rate λ ≈ 1 is used and each compartment is fully updated upon an encounter with a new version of a pattern. The number of presented patterns is set to N = 32. We keep L × C = const., with L = 800 used for the network with C = 1. predicted by the theory, the optimal learning rate for compartmentalized networks scales with the mutation 1/2 rate as λ∗c ∼ µc , except for the 1-to-1 network in which ∗ λc → 1 and sub-networks are steadily updated upon an encounter with a pattern (Fig. 3B). This rapid update is expected since there is no interference between the stored memories in the 1-to-1 network, and a steady update can keep the stored memory in each sub-network close to its associated pattern class without disrupting the other energy minima. ii. Small intra- and large inter-compartment noise (βHc ≫ 1, βS ≪ 1). In this regime there is low noise for equilibration within a compartment but a high level of noise in choosing the right compartment. The optimal strategy in this regime is to store patterns in a single network with a distributed memory, since 7 B 10−1 0.8 0.6 accurate memory 0.4 0.2 100 0.0 10−1 100 101 scaled inverse Hopfield temp. βHc inverse compartment temp. βS partial memory no-memory 101 1.0 optimal performance, Q∗ inverse compartment temp. βS A Q∗ 1.0 0.6 0.4 partial memory 0.2 Q∗ 0.0 10−1 100 101 scaled inverse Hopfield temp. βHc inverse compartment temp. βS no-memory inverse compartment temp. βS 100 accurate memory optimal performance, Q∗ 101 0.8 C=1 C=2 C=4 C=8 C = 16 C = 32 0.0 101 101 distributed memory 100 10−1 100 101 scaled inverse Hopfield temp. βHc 1.0 1.0 101 0.5 D 10−1 C 100 100 100 0.0 0.5 1.0 Normalized MI 101 C=1 C=2 C=4 0.5 C=8 C = 16 C = 32 0.0 101 100 1-to-1 specialized memory distributed memory 10−1 100 101 scaled inverse Hopfield temp. βHc 101 100 0.0 0.5 1.0 Normalized MI FIG. 4. Phases of learning and memory production. Different optimal memory strategies are shown. (A) The heatmap shows the optimal memory performance Q∗ as a function of the (scaled) Hopfield inverse temperature βHc = βH · C and the inverse temperature associated with compartmentalization βS for networks learning and retrieving a memory of static patterns (µ = 0); colors indicated in the color bar. The optimal performance is achieved by using the optimal strategy (i.e., learning rate λ∗c and the number of compartments c∗ ) for networks at each value of βHc and βS . The three phases of accurate, partial, and no-memory are indicated. (B) The heatmap shows the memory strategies for the optimal number of compartments (colors as in the legend) corresponding to the memory performance shown in (A). We limit the optimization to the possible number of compartments indicated in the legend to keep N/C an integer. The dashed region corresponds to the case where all strategies perform equally well. Regions of distributed memory (C = 1) and the 1-to-1 specialized memory (C = N ) are indicated. The top panel shows the optimal performance Q∗ of different strategies as a function of the Hopfield inverse temperature βHc . The right panel shows the mutual information MI(Σ, C) between the set of pattern classes Σ ≡ {σ α } and the set of all compartments C normalized by the entropy of the compartments H(C) as a function of the inverse temperature βS ; see Appendix A.3. This normalized mutual information quantifies the ability of the system to assign a pattern to the correct compartment. (C-D) Similar to (A-B) but for evolving patterns with the effective mutation rate µeff = 0.01. The number of presented patterns is set to N = 32 (all panels). Similar to Fig. 3 we keep L × C = const., with L = 800 used for networks with C = 1. identifying the correct compartment is difficult due to noise (Fig. 4B,D). For static patterns this strategy corresponds to the classical Hopfield model with a high accuracy (Figs. 2A, 4A,B). On the other hand, for evolving patterns this strategy results in a partial memory (Fig. 4C,D) due to the reduced accuracy of the distributed associative memory, as shown in Fig. 2A. Interestingly, the transition between the optimal strategy with highly specific (compartmentalized) memory for evolving patterns in the first regime and the generalized (distributed) memory in this regime is very sharp (Fig. 4D). This sharp transition suggests that depending on the noise in choosing the compartments βS , an optimal strategy either stores memory in a 1-to-1 specialized fashion (C = N ) or in a distributed generalized fashion (C = 1), but no intermediate solution (i.e., quasi-specialized memory with 1 < C < N ) is desirable. 8 iii. Large intra-compartment noise (βHc < 1). In this regime there is a high level of noise in equilibration within a network and memory cannot be reliably retrieved (Fig. 4A,C), regardless of the compartmentalization temperature βS . However, the impact of the equilibration noise βHc on the accuracy of memory retrieval depends on the degree of compartmentalization. For the 1-to-1 specialized network (C = N ), the transition between the high and the low accuracy is smooth and occurs at βHc = 1, below which no memory attractor can be found. As we increase the equilibration noise (decrease βHc ), the networks with distributed memory (C < N ) show two-step transitions, with a plateau in the range of 1/Nc . βHc . 1. Similar to the 1-to-1 network, the first transition at βHc ≈ 1 results in the reduced accuracy of the networks’ memory retrieval. At this transition point, the networks’ learning rate λc approaches its maximum value 1 (Fig. S7), which implies that the memory is stored (and steadily updated) for only C < N patterns (i.e., one pattern per sub-network). Due to the invariance of the networks’ mean energy under compartmentalization, the depth of the energy minima associated with the stored memory in each sub-network scales as N/C, resulting in deeper and more stable energy minima in networks with smaller number of compartments C. Therefore, as the noise increases (i.e., βHc decreases), we observe a gradient in transition from partial retrieval to a no-memory state at βH ≈ 1/Nc , with the most compartmentalized network (larger C) transitioning the fastest, reflecting the shallowness of its energy minima. Taken together, the optimal strategy leading to working memory depends on whether a network is trained to learn and retrieve dynamic (evolving) patterns or static patterns. Specifically, we see that the 1-to-1 specialized network is the unique optimal solution for storing working memory for evolving patterns, whereas the distributed generalized memory (i.e., the classical Hopfield network) performs equally well in learning and retrieval of memory for static patterns. The contrast between these memory strategies can shed light on the distinct molecular mechanisms utilized by different biological systems to store memory. III. DISCUSSION Storing and retrieving memory from prior molecular interactions is an efficient scheme to sense and respond to external stimuli. Here, we introduced a flexible energybased network model that can adopt different memory strategies, including distributed memory, similar to the classical Hopfield network, or compartmentalized memory. The learning rate and the number of compartments in a network define a memory strategy, and we probed the efficacy of different strategies for static and dynamic patterns. We found that Hopfield-like networks with distributed memory are highly accurate in storing associative memory for static patterns. However, these networks fail to reliably store retrievable associative memory for evolving patterns, even when learning is done at an optimal rate. To achieve a high accuracy, we showed that a retrievable memory for evolving patterns should be compartmentalized, where each pattern class is stored in a separate sub-network. In addition, we found a sharp transition between the different phases of working memory (i.e., compartmentalized and distributed memory), suggesting that intermediate solutions (i.e., quasi-specialized memory) are sub-optimal against evolving patterns. The contrast between these memory strategies is reflective of the distinct molecular mechanisms used for memory storage in the adaptive immune system and in the olfactory cortex. In particular, the memory of odor complexes, which can be assumed as static, is stored in a distributed fashion in the olfactory cortex [7–11, 24]. On the other hand, the adaptive immune system, which encounters evolving pathogens, allocates distinct immune cells (i.e., compartments) to store a memory for different types of pathogens (e.g. different variants of influenza or HIV)—a strategy that resembles that of the 1-to-1 specialized networks [5, 27–32]. Our results suggest that pathogenic evolution may be one of the reasons for the immune system to encode a specialized memory, as opposed to the distributed memory used in the olfactory system. The increase in the optimal learning rate in anticipation of patterns’ evolution significantly changes the structure of the energy landscape for associative memory. In particular, we found the emergence of narrow connectors (mountain passes) between the memory attractors of a network, which destabilize the equilibration process and significantly reduce the accuracy of memory retrieval. Indeed, tuning the learning rate as a hyper-parameter is one of the challenges of current machine learning algorithms with deep neural networks (DNNs) [38, 39]. The goal is to navigate the tradeoff between the speed (i.e., rate of convergence) and accuracy without overshooting during optimization. It will be interesting to see how the insights developed in this work can inform rational approaches to choose an optimal learning rate in optimization tasks with DNNs. Machine learning algorithms with DNNs [38] and modified Hopfield networks [40–43] are able to accurately classify hierarchically correlated patterns, where different objects can be organized into an ultrametric tree based on some specified relations of similarity. For example, faces of cats and dogs have the oval shape in common but they branch out in the ultrametric tree according to the organism-specific features, e.g., whiskers in a cat, and the cat branch can then further diversify based on the breed-specific features. A classification algorithm can use these hierarchical relations to find features common among members of a given sub-type (cats) that can distinguish them from another sub-type (dogs). Although evolving patterns within each class are correlated, the random evolutionary dynamics of these patterns does 9 not build a hierarchical structure where a pattern class branches in two sub-classes that share a common ancestral root. Therefore, the optimal memory strategies found here for evolving patterns are distinct from those of the hierarchically correlated patterns. It will be interesting to see how our approaches can be implemented in DNNs to classify dynamic and evolving patterns. ACKNOWLEDGEMENT This work has been supported by the DFG grant (SFB1310) for Predictability in Evolution and the MPRG funding through the Max Planck Society. O.H.S also acknowledges funding from Georg-August University School of Science (GAUSS) and the Fulbright foundation. 10 Appendix A: Computational procedures A1. Initialization of the network A network J (with elements Jij ) is presented with N random (orthogonal) patterns |σ α i (with α = 1, . . . N ), with entries σiα ∈ {−1, 1}, reflecting the N pattern classes. For aPnetwork with C compartments (with 1 ≤ C ≤ N ), we 1 s s α α (t0 ) = N/C initialize each sub-network J s at time t0 as Ji,j α∈As σi σj and Jii (t0 ) = 0; here, As is a set of N/C randomly chosen (without replacement) patterns initially assigned to the compartment (sub-network) s. We then let the network undergo an initial learning process. At each step an arbitrary pattern σ ν is presented to the network and a sub-network J s is chosen for an update with a probability exp [−βS E (J s (t), σ ν (t))] Ps = P C , s ν r=1 exp [−βS E (J (t), σ (t))] where the energy is defined as, E (J s (t), σ ν (t)) = −1 X s −1 ν Ji,j (t)σiν (t)σjν (t) ≡ hσ (t)|J s (t)|σ ν (t)i 2L i,j 2 (S1) (S2) and βS is the inverse temperature associated with choosing the right compartment. We then update the selected sub-network J s , using the Hebbian update rule, ( s (1 − λ) Ji,j (t) + λ σiν σjν , if i 6= j; s Ji,j (t + 1) = (S3) 0, otherwise. For dynamic patterns, the presented patterns undergo evolution with mutation rate µ, which reflects the average number of spin flips in a given pattern per network’s update event (Fig. 1). Our goal is to study the memory retrieval problem in a network that has reached its steady state. The state of a network J(tn ) at the time step n can be traced back to the initial state J(t0 ) as, J(tn ) = (1 − λ)n J(t0 ) + λ n X i=1 (1 − λ)n−i |σ(ti )i hσ(ti )| (S4) n The contribution of the initial state J(t0 ) to the state of the network at time thn decays as (1 (eq. S4).  − λ)−5 i log 10 Therefore, we choose the number of steps to reach the steady state as nstat. = max 10N, 2C ceil log(1−λ) . This criteria ensures that (1 − λ)nstat. ≤ 105 and the memory of the initial state J(t0 ) is removed from the network J(t). We will then use this updated network to collect the statistics for memory retrieval. To report a specific quantity from the network (e.g., the energy), we pool the nstat. samples collected from each of the 50 realizations. A2. Pattern retrieval from associative memory Once the trained network approaches a stationary state, we collect the statistics of the stored memory. α To find a memory attractor σatt for a given pattern σ α we use a Metropolis algorithm in the energy landscape s α E(J , σ ) (eq. S2). To do so, we make spin-flips in a presented pattern σ α → σ̃ α and accept a spin-flip with probability  P (σ α → σ̃ α ) = min 1, e−βH ∆E , (S5) where ∆E = E(J s , σ̃ α ) − E(J s , σ α ) and βH is the inverse (Hopfield) temperature for pattern retrieval in the network (see Fig. 1). We repeat this step for 2 × 106 steps, which is sufficient to find a minimum of the landscape (see Fig. S3). For systems with more than one compartment C, we first choose a compartment according to eq. S1, and then perform the Metropolis algorithm within the associated compartment. After finding the energy minima, we update the systems for n′stat. = max[2 · 103 , nstat. ] steps. At each step we present patterns as described above and collect the statistics of the recognition energy E(J s (t), σ α (t)) between a presented pattern σ α and the memory compartment J s (t), assigned according to eq. S1. These measurements are 11 used to describe the energy statistics (Figs. 2,S2) of the patterns and the accuracy of pattern-compartment association (Fig. 4B,D). After the n′stat. steps, we again use the Metropolis algorithm to find the memory attractors associated with the presented patterns. We repeat this analysis for 50 independent realizations of the initializing pattern classes {σ α (t0 )}, for each set of parameters {L, N, C, λ, µ, βS , βH }. When calculating the mean performance Q of a strategy (see Figs. 2,3,4,S7), we set the overlap between attractor α and pattern q α = | hσatt |σ α i | equal to zero when patterns are not recognized (q α < 0.8). As a result, systems can only achieve a non-zero performance if they recognize some of the patterns. This choice eliminates the finite size √ effect of a random overlap ∼ 1/ L between an attractor and a pattern (see Fig. S3). This correction is especially important when comparing systems with different sub-network sizes (Lc ≡ L/C) in the βH < 1 regime (Figs. 4,S7), where random overlaps for small Lc could otherwise result in a larger mean performance compared to larger systems that correctly reconstruct a fraction of the patterns. A3. Accuracy of pattern-compartment association We use the mutual information MI(Σ, C) between the set of pattern classes Σ ≡ {σ α } and the set of all compartments C to quantify the accuracy in associating a presented pattern with the correct compartment, MI(Σ, C) = H(C) − H(C|Σ) =− X c∈C " P (c) log P (c) − − X σ α ∈Σ P (σ α ) X c∈C # P (c|σ α ) log P (c|σ α ) . (S6) Here H(C) and H(C|Σ) are the entropy of the compartments, and the conditional entropy of the compartments given the presented patterns, respectively. If chosen randomly, the entropy associated with choosing a compartment is H random (C) = log C. The mutual information (eq. S6) measures the reduction in the entropy of compartments due to the association between the patterns and the compartments, measured by the conditional entropy H(C|Σ). Figure 4B,D shows the mutual information MI(Σ, C) scaled by its upper bound H(C), in order to compare networks with a different number of compartments. 12 Appendix B: Estimating energy and optimal learning rate for working memory B1. Approximate solution for optimal learning rate The optimal learning rate is determined by maximizing the network’s performance Q(λ) (eq. 2) against evolving patterns with a specified mutation rate: λ∗ = argmax Q(λ) (S1) λ We can numerically estimate the optimal learning rate as defined eq. S1; see Figs. 2,3. However, the exact analytical evaluation of the optimal learning rate is difficult and we use an approximate  approach  and find the learning rate that minimizes the expected energy of the patterns in the stationary state E Eλ,ρ (J, σ) , assuming that patterns are shown to the network at a fixed order. Here, the subscripts explicitly indicate the learning rate of the network λ, and the evolutionary overlap of the pattern ρ. To evaluate an analytical approximation for the energy, we first evaluate the state of the network J(t) at time step t, given all the prior encounters of the networks with patterns shown at a fixed order. ∞ X 1 (1 − λ)(j−1) |σ(t − j)i hσ(t − j)| (J(t) + 1) = λ L j=1 =λ ∞ X i=1 =λ (1 − λ)(i−1)N α=1 | | ∞ N X X α=1 i=0 | N X (S2) (1 − λ)α−1 |σ α (t − α − (i − 1)N )i hσ α (t − α − (i − 1)N )| {z sum over N pattern classes {z sum over time (generations, i) (1 − λ)(α−1)+iN |σ α (t − α − iN )i hσ α (t − α − iN )| . {z {z | (S3) } } (S4) } } sum over time sum over patterns Here, we referred to the (normalized) pattern vector from the class α presented to the network at time step t by |σ α (t)i ≡ √1L σ α (t). Without a loss of generality, we assumed that the last pattern presented to the network at time step t − 1 is from the first pattern class |σ 1 (t − 1)i, which enabled us to split the sum in eq. S2 into two separate summations over pattern classes and N time-steps generations (eq. S3). Adding the identity matrix 1 on the left-hand side of eq. S2 assures that the diagonal elements vanish, as defined in eq. S3. The mean energy of the patterns, which in our setup is the energy of the pattern from the N th class at time t, follows     1 N N E Eλ,ρ (J, σ) = E − hσ (t)|J(t)|σ (t)i 2 # " (S5) N ∞ L−1 XX (α−1)+iN N α α N =E − λ (1 − λ) hσ (t)|σ (t − α − iN )i hσ (t − α − iN )|σ (t)i . 2 α=1 i=0 Since the pattern families are orthogonal to each other, we can express the overlap between patterns at different times as hσ α (t1 )|σ ν (t2 )i = δα,ν (1 − 2µ)|t2 −t1 | ≡ δα,ν ρ|t2 −t1 | , and simplify the energy function in eq. S5, ∞   L−1 X E Eλ,ρ (J, σ) = − λ (1 − λ)(N −1)+iN ρ2(N +iN ) 2 i=0 =− =− ∞ X i L−1 λ(1 − λ)(N −1) ρ2N (1 − λ)N ρ2N 2 i=0 L − 1 (1 − λ)(N −1) ρ2N . λ 2 1 − (1 − λ)N ρ2N (S6) 13 Since accurate pattern retrieval depends on the depth of the energy valley for the associative memory, we will use the expected energy of the patterns as a proxy for the performance of the network.  We can find the approximate optimal learning rate that minimized the expected energy by setting ∂E Eλ,ρ (J, σ) /∂λ = 0, which results in 1 (1 − 2µ)2N = (1 − λ∗ )−N (1 − N λ∗ ) =⇒ 1 − 4N µ + O(µ2 ) = 1 + (N − N 2 )(λ∗ )2 + O(λ3 ); 2 p ∗ =⇒ λ (µ) ≃ 8µ/(N − 1). (S7) where we used the fact that both the mutation rate µ and the learning rate λ are small, and therefore, expanded eq. S7 up to the leading orders in these parameters. In addition, eq. S7 establishes an upper bound for the learning rate: λ < N1 . Therefore, our expansion in mutation rate (eq. S7) is only valid for 8µ < N1 , or equivalently for µeff = N µ < 12.5%; the rates used in our analyses lie far below these upper bounds. B2. Lag of memory against evolving patterns The memory attractors associated with a given pattern class can lag behind and only reflect the older patterns presented prior to the most recent encounter of the network with the specified class. As a result, the upper bound for the performance of a network Qlag = ρglag N ≈ 1 − 2glag µeff is determined by both the evolutionary divergence of patterns between two encounters µeff and number of generations glag by which the stored memory lags behind; we measure glag in units of generations; one generation is defined as the average time between a network’s encounter with the same pattern class i.e., N . We characterize this lag glag by identifying the past pattern (at time t − glag N ) that has the maximum overlap with the network’s energy landscape at given time t:   glag = argmax E [hσ(t − g N )|J(t)|σ(t − g N )i] ≡ argmin E Elag (g) (S8) g≥0 g≥0   where we introduced the expected lagged energy E Elag (g) . Here, the vector |σ(t)i refers to the pattern σ presented to the network at time t, which can be from any of the pattern classes. Because of the translational symmetry in time in the stationary state, the lagged energy can also be expressed in terms of the overlap between a pattern at time t and the network at a later time t + gN . We evaluate the lagged energy by substituting the expression for the network’s state J(t + gN ) from eq. S2, which entails, 2 1 E [Elag (g)] = − E [hσ(t)|J(t + g N )|σ(t)i] L−1 L−1   gN −1 X 1 2 (1 − λ)gN −1−j hσ(t)|σ(t + j)i  = −E  (1 − λ)N g hσ(t)|J(t)|σ(t)i + λ L−1 j=0 (S9) (S10) g−1 N −1 X X   2 2 Ng (1 − λ)gN −1−N i−α hσ N (t)|σ N −α (t + N i + α)i (S11) (1 − λ) E Eλ,ρ (J, σ) − λ = L−1 i=0 α=0 g−1 X   2 (1 − λ)gN −1−N i ρ2N i (1 − λ)N g E Eλ,ρ (J, σ) − λ L−1 i=0   N (g+1)−1 N (g+1) 2N (1 − λ) − (1 − λ)N −1 ρ2N g (1 − λ) ρ + . = −λ 1 − (1 − λ)N ρ2N (1 − λ)N − ρ2N = (S12) (S13) Here, we used the expression of the network’s matrix J in eq. S4 to arrive at eq. S10, and then followed the procedure introduced in eq. S3 to arrived at the double-summation in eq. S11. We then used the equation for pattern overlap hσ α (t1 )|σ ν (t2 )i = δα,ν ρ|t2 −t1 | to reduce the sum in eq. S12 and arrived in eq. S13 by evaluating the  at the result  geometric sum and substituting the empirical average for the energy E Eλ,ρ (J, σ) from eq. S6. We probe this lagged memory by looking at the performance Q for patterns that are correctly associated with their memory attractors (i.e., those with hσatt |σi > 0.8). As shown in Fig. S1, for a broad parameter regime, the mean performance for these correctly associated patterns agrees well with the theoretical expectation Qlag = ρglag N , which is lower than the naive expectation Q0 . 14 Appendix C: Structure of the energy landscape for working memory C1. Formation of mountain passes in the energy landscape of memory for evolving patterns As shown in Fig. 1, large learning rates in networks with memory for evolving patterns result in the emergence of narrow connecting paths between the minima of the energy landscape. We refer to these narrow connecting paths as mountain passes. In pattern retrieval, the Monte Carlo search can drive a pattern out of one energy minimum into another minimum and potentially lead to pattern misclassification. We use two features of the energy landscape to probe the emergence of the mountain passes. First, we show that if a pattern is misclassified, it has fallen into a memory attractor associated with another pattern class and not a spuriously made energy minima. To do so, we compare the overlap of the attractor with the α original pattern | hσatt |σ α i | (i.e., the reconstruction performance of the patterns) with the maximal overlap of the α attractor with all other patterns maxν6=α | hσatt |σ ν i |. Indeed, as shown in Fig. S3A for evolving patterns, the memory attractors associated with 99.4% of the originally stored patterns have either a large overlap with the correct pattern or with one of the other previously presented pattern. 71.3% of the patterns are correctly classified (stable patterns in sector I in Fig. S3A), whereas 28.1% of them are associated with a secondary energy minima after equilibration (unstable patterns in sector II in Fig. S3A). A very small fraction of patterns (< 1%) fall into local minima given by the linear combinations of the presented patterns (sector IV in Fig. S3A). These minima are well-known in the classical Hopfield model [44, 45]. Moreover, we see that equilibration of a random pattern (i.e., a pattern orthogonal to all the presented classes) in the energy landscape leads to memory attractors for one of the originally presented pattern classes. The majority of these random patterns lie in sector II of Fig. S3A), i.e., they have a small overlap with the network since they are orthogonal to the originally presented pattern classes, and they fall into one of the existing memory attractors after equilibration. Second, we characterize the possible paths for a pattern to move from one valley to another during equilibration, using Monte Carlo algorithm with the Metropolis acceptance probability,   ′ ρ(σ → σ ′ ) = min 1, e−β(E(J,σ )−E(J,σ)) (S1) We estimate the number of beneficial spin-flips (i.e., open paths) that decrease the energy of a pattern at the start of equilibration (Fig. S3B). The average number of open paths is smaller for stable patterns compared to the unstable patterns that are be miscalssified during retrieval (Fig. S3B). However, the distributions for the number of open paths largely overlap for stable and unstable patterns. Therefore, the local energy landscape of stable and unstable patterns are quite similar and it is difficult to discriminate between them solely based on the local gradients in the landscape. Fig. S4A shows that the average number of beneficial spin-flips grows with the mutation rate of the patterns but this number is comparable for stable and unstable patterns. Moreover, the unstable stored patterns (blue) have far fewer open paths available to them during equilibration compared to random patterns (red) that are presented to the network for the first time (Figs. S3B & S4A). Notably, on average half of the spin-flips reduce the energy of for random patterns, irrespective of the mutation rate. This indicates that even though previously presented pattern classes are statistically distinct from random patterns, they can still become unstable, especially in networks which are presented with evolving patterns. It should be noted that the evolution of the patterns only indirectly contribute to the misclassification of memory, as it necessitates a larger learning rate for the networks to optimally operate, which in turn results in the emergence of mountain passes. To clearly demonstrate this effect, Figs. S3C,D, and S4D shows the misclassification behavior for a network trained to store memory for static pattern, while using a larger learning rate that is optimized for evolving patterns. Indeed, we see that pattern misclassification in this case is consistent with the existence of mountain passes in the network’s energy landscape. C2. Spectral decomposition of the energy landscape We use spectral decomposition of the energy landscape to characterize the relative positioning and the stability of patterns in the landscape. As shown in Figs. S3, S4, destabilization of patterns due to equilibration over mountain passes occurs in networks with high learning rates, even for static patterns. Therefore, we focus on how the learning rate impacts the spectral decomposition of the energy landscape in networks presented with static patterns. This simplification will enable us to analytically probe the structure of the energy landscape, which we will compare with numerical results for evolving patterns. We can represent the network J (of size L × L) that store a memory of N static patterns with N non-trivial eigenvectors |Φi i with corresponding eigenvalues Γi , and N − L degenerate eigenvectors, |Ψk i with corresponding 15 trivial eigenvalues γk = γ = −1: J= N X i=1 L−N X Γi |Φi i hΦi | + k=1 γk |Ψk i hΨk | . (S2) The non-trivial eigenvectors span the space of the presented patterns, for which the recognition energy can be expressed by N E(J, σ α ) = − 1X Γi hσ α |Φi i hΦi |σ α i . 2 i=1 (S3) An arbitrary configuration χ in general can have components orthogonal to the N eigenvectors |Φi i, as it points to a vertex of the hypercube, and should be expressed in terms of all the eigenvectors {Φ1 , . . . , ΦN , Ψ1 , . . . , ΨL−N }:   X  N L−N X  1 i i k k  E(J, χ) = −  Γi hχ|Φ i hΦ |χi + γ hχ|Ψ i hΨ |χi . 2  i=1  k=1 | {z } | {z } stored patterns (S4) trivial space Any spin-flip in a pattern (e.g., during equilibration) can be understood as a rotation in the eigenspace of the network (eq. S4). As a first step in characterizing these rotations we remind ourselves of the identity |χi = N X i=1 hΦi |χi |Φi i + L−N X k=1 hΨk |χi |Ψk i , (S5) with the normalization condition N X i=1 hΦi |χi 2 + L−N X k=1 hΨk |χi 2 = 1. (S6) In addition, since the diagonal elements of the network are set to Jii = 0 (eq. S3), the eigenvalues should sum to zero, or alternatively, N X i=1 Γi = − L−N X k=1 γk = L − N. (S7) To asses the stability of a pattern σ ν , we compare its recognition energy E(J, σ ν ) with the energy of the rotated pattern after a spin-flip E(J, σ̃ ν ). To do so, we first consider a simple scenario, where we assume that the pattern σ ν 2 has a large overlap with one dominant non-trivial eigenvector ΦA (i.e., hσ ν |ΦA i = m2 ≈ 1). The other components of the pattern can be expressed in terms of the remaining N − 1 non-trivial eigenvectors with mean squared overlap 1−m2 N −1 . The expansion of the recognition energy for the presented pattern is restricted to the N non-trivial directions (eq. S4), resulting in    X 1 − m2 1 2 1 m ΓA + (1 − m2 )Γ̃ , (S8) Γi  = − E(J, σ ν ) = − m2 ΓA + 2 N −1 2 i6=A P Γ̄−ΓA where Γ̃ = N 1−1 i6=A Γi = NN −1 is the mean eigenvalue for the non-dominant directions. ν ν A spin-flip (|σ i → |σ̃ i ) can rotate the pattern out of the dominant direction ΦA and reduce the squared overlap by ǫ2 . The rotated pattern |σ̃ ν i in general lies in the L-dimensional space and is not restricted to the N -dimensional (non-trivial) subspace. We first take a mean-field approach in describing the rotation of the pattern after a spin-flip. Because of the normalization condition (eq. S6), the loss in the overlap with the dominant direction should result in ǫ2 . The energy of the rotated pattern after an average increase in the overlap with the other L − 1 eigenvectors by L−1 16 a spin-flip E(J, σ̃ ν ) can be expressed in terms of all the L eigenvectors (eq. S4),    2 X  1 − m2 X ǫ2 1 ǫ E(J, σ̃ ν ) = − (m2 − ǫ2 )ΓA + Γi + + γk  2 N −1 L−1 L−1 i6=A k    2 X X 1 ǫ  γk  . Γi + = E(J, σ ν ) + ΓA − 2 L−1 k i6=A   2 ǫ 1 = E(J, σ ν ) + ΓA 1 + . 2 L−1 (S9) (S10) where in eq. S10 we used the fact that the eigenvalues should sum up to zero. On average, a spin-flip |σ ν i → |σ̃ ν i   2 increases the recognition energy by E(J, σ̃ ν ) − E(J, σ ν ) = ǫ2 ΓA 1 + O L−1 . This is consistent with the results shown in Figs. S3B,D and Figs. S4A,D, which indicate that the majority of the spin-flips keep a pattern in the original energy minimum and only a few of the spin-flips drive a pattern out of the original attractor. In the analysis above, we assumed that the reduction in overlap with the dominant eigenvector ǫ2 is absorbed equally by all the other eigenvectors (i.e., the mean-field approach). In this case, the change in energy is equally distributed across the positive and the negative eigenvalues (Γ’s and γ’s in eq. S9), resulting in an overall increase in the energy due to the reduced overlap with the dominant direction |ΦA i. The destabilizing spin-flips are associated with atypical changes that rotate a pattern onto a secondary non-trivial direction |ΦB i (with positive eigenvalue ΓB ), as a result of which the total energy could be reduced. To better characterize the conditions under which patterns become unstable, we will introduce a perturbation to the mean-field approach used in eq. S10. We will assume that a spin-flip results in a rotation with a dominant component along a secondary non-trivial direction |ΦB i. Specifically, we will assume the reduced overlap ǫ2 between the original pattern |σ ν i and the dominant direction |ΦA i is distributed in an imbalanced fashion between the other eigenvectors, with a fraction p projected onto a new (non-trivial) direction |ΦB i, while all the other L − 2 directions span the remaining (1 − p)ǫ2 . In this case, the energy of the rotated pattern is given by      2 2 X (1 − p)ǫ2 X  1 − m2 1 1 − m (1 − p)ǫ E(J, σ̃ ν ) = − (m2 − ǫ2 )ΓA + Γi + + pǫ2 ΓB + + γk  2 N −1 N −1 L−2 L−2 i6=A,B  ǫ2  = E(J, σ ν ) + ΓA − p ΓB + O L−1 . 2 k (S11) Therefore, a spin-flip is beneficial if ΓA < p ΓB . To further concretize this condition, we will estimate the typical loss ǫ2 and gain pǫ2 in the squared overlap between the pattern and its dominating directions due to rotation by a single spin-flip. Let us consider a rotation |σ ν i → |σ̃ ν i by a flip in the nth spin of the original pattern |σ ν i. This spin flip reduces A the original overlap of the pattern m = hσ ν |ΦA i with the dominant direction |ΦA i by the amount √2L ΦA n , where Φn is the nth entry of the eigenvector |ΦA √ i. Since the original overlap is large (i.e., m ≃ 1), all entries of the dominant L, ∀i, resulting in a reduced overlap of the rotated pattern hσ ν |ΦA i = m− L2 . eigenvector are approximately ΦA ≃ 1/ i Therefore, the loss in the squared overlap ǫ2 by a spin flip is given by   m 1 1 m 2 ν j 2 ν j 2 2 2 ǫ = hσ |Φ i − hσ̃ |Φ i = m − m − 4 + 4 2 = 4 + O( 2 ). (S12) L L L L Similarly, we can derive the gain in the squared overlap pǫ2 between the pattern |σ ν i and the new dominant direction |ΦB i after a spin-flip. Except for the direction |ΦA i, the expected squared overlap between the original pattern (prior 2 2 to the spin flip) and any of the non-trivial eigenvectors including |ΦB i is hσ ν |ΦB i = 1−m N −1 . The flip in the n-th spin ΦB of the original pattern increases the overlap of the rotated pattern with the new dominant direction |ΦB i by 2 √nL , p where ΦB 1/L. Therefore, the largest gain in overlap due to a spin-flip is given by n should be of the order of ! r 2 1 − m2 (ΦB 1 − m2 1 − m2 Φ B n n) 2 ν B 2 ν B 2 √ +4 − +4 pǫ = hσ̃ |Φ i − hσ |Φ i ≃ N −1 N −1 L L N −1 (S13) r 1 − m2 Φ B 1 √n + O( 2 ). = N −1 L L 17 By using the results from eqs. S12 and S13, we can express the condition for beneficial spin-flips to drive a pattern over the carved mountain passes during equilibration (eq. S11), 2 2 ǫ ΓA < ǫ pΓB , −→ ΓA < ΓB r 1 − m2 1 B √ Φn , m2 α (S14) where α = N/L. This result suggests that the stability of a pattern depends on how the ratio of the eigenvalues associated with the dominant projections of the pattern before and after the spin-flip ΓA /ΓB compare to the overlap m of the original pattern with the dominant eigenvector ΦA and the change due to the spin-flip ΦB n. So far, we have constrained our analysis to patterns that have a dominant contribution to only one eigenvector ΦA . To extend our analysis to patterns which are instead constrained to a small P sub-space2 A spanned by non-trivial eigenvalues, we define the squared pattern overlap with the subspace m2A = a∈A hσ ν |Φa i and a weighted averaged P 2 eigenvalue in the subspace ΓA = a∈A hσ ν |Φa i Γa . As a result, the difference in the energy of a pattern before and  2  after a spin-flip (eq. S11) can be extended to E(J, σ ν ) − E(J, σ̃ ν ) = ǫ2 ΓA − p ΓB + O L−1 . Similarly, the stability r 1−m2A 1 √ ΦB . Patterns that are constrained to larger subspaces < condition in eq. S14 can be extended to ΓΓA m2 α n B A are more stable, since the weighted averaged eigenvalue for their containing subspace ΓA is closer to the mean of all eigenvalues Γ̄ = 1 − N/L (law of large numbers). Therefore, in these cases a much larger eigenvalue gap (or a broader eigenvalue spectrum) is necessary to satisfy the condition for pattern instability. Fig. S6 compares the loss in energy with the original dominant direction ǫ2 ΓA to the maximal gain in any of the other directions ǫ2 pΓB to test the pattern stability criteria presented in eq. S14. To q do so, we identify a spin flip 2 ΦB √n ΓB . In Fig. S6A,C n in a secondary direction B that confers the maximal energy gain: ǫ2 pΓB ≈ maxn,B 1−m N −1 L we specifically focus on the subset of patterns that show a large (squared) overlap with the one dominant direction (i.e., m > 0.85). Given that evolving patterns are not constraint to the {Φ} (non-trivial) sub-space, we find a smaller fraction of these patterns to fulfill the condition for such large overlap m (Fig. S6A), compared to the static patterns (Fig. S6C). Nonetheless, we see that the criteria in eq. S14 can be used to predict the stability of patterns in a network for both static and evolving patterns; note that here we use the same learning rate for both the static and evolving patterns. We then relax the overlap condition Pby including2 all patterns that have a large overlap with a subspace A, spanned by up to 10 eigenvectors (i.e., m2A = α∈A hσ|Φα i > 0.85). For these larger subspaces the transition between stable and unstable patterns is no longer exactly given by eq. S14. However, the two contributions ǫ2 ΓA and ǫ2 pΓB still clearly separate the patterns into stable and unstable classes for both static and evolving patterns (Figs. S6B,D). The softening of this condition is expected as in this regime we can no longer assume that a single spin-flip can reduce the overlap with all the eigenvectors in the original subspace. As a result, the effective loss in overlap become smaller than ǫ2 and patterns become unstable below the dotted line in Fig. S6B,D. As the learning rate increases, the gap between the eigenvalues ΓB /ΓA (Fig. S5) become larger. At the same time, patterns become more constrained to smaller subspaces (Fig. S4C,D). As a result of these two effects, more patterns satisfy the instability criteria in eq. S14. These patterns are misclassified as they fall into a wrong energy minimum by equilibrating through the mountain passes carved in the energy landscape of networks with large learning rates. [1] Labrie SJ, Samson JE, Moineau S (2010) Bacteriophage resistance mechanisms. Nature Rev Microbiol 8: 317–327. [2] Barrangou R, Marraffini LA (2014) CRISPR-Cas systems: Prokaryotes upgrade to adaptive immunity. Molecular cell 54: 234–244. [3] Bradde S, Nourmohammad A, Goyal S, Balasubramanian V (2020) The size of the immune repertoire of bacteria. Proc Natl Acad Sci USA 117: 5144–5151. [4] Perelson AS, Weisbuch G (1997) Immunology for physicists. Rev Mod Phys 69: 1219–1267. [5] Janeway C, Travers P, Walport M, Schlomchik M (2001) Immunobiology. The Immune System in Health and Disease. New York: Garland Science, 5 edition. [6] Altan-Bonnet G, Mora T, Walczak AM (2020) Quantitative immunology for physicists. Physics Reports 849: 1–83. [7] Haberly LB, Bower JM (1989) Olfactory cortex: Model circuit for study of associative memory? Trends Neurosci 12: 258–264. [8] Brennan P, Kaba H, Keverne EB (1990) Olfactory recognition: A simple memory system. Science 250: 1223–1226. [9] Granger R, Lynch G (1991) Higher olfactory processes: Perceptual learning and memory. Curr Opin Neurobiol 1: 209–214. [10] Haberly LB (2001) Parallel-distributed processing in olfactory cortex: New insights from morphological and physiological analysis of neuronal circuitry. Chemical senses 26: 551–576. 18 [11] Wilson DA, Best AR, Sullivan RM (2004) Plasticity in the olfactory system: Lessons for the neurobiology of memory. Neuroscientist 10: 513–524. [12] Raguso RA (2008) Wake Up and Smell the Roses: The Ecology and Evolution of Floral Scent. Annu Rev Ecol Evol Syst 39: 549–569. [13] Dunkel A, et al. (2014) Nature’s chemical signatures in human olfaction: A foodborne perspective for future biotechnology. Angew Chem Int Ed 53: 7124–7143. [14] Beyaert I, Hilker M (2014) Plant odour plumes as mediators of plant-insect interactions: Plant odour plumes. Biol Rev 89: 68–81. [15] Glusman G, Yanai I, Rubin I, Lancet D (2001) The Complete Human Olfactory Subgenome. Genome Research 11: 685–702. [16] Bargmann CI (2006) Comparative chemosensation from receptors to ecology. Nature 444: 295–301. [17] Touhara K, Vosshall LB (2009) Sensing Odorants and Pheromones with Chemosensory Receptors. Annu Rev Physiol 71: 307–332. [18] Su CY, Menuz K, Carlson JR (2009) Olfactory Perception: Receptors, Cells, and Circuits. Cell 139: 45–59. [19] Verbeurgt C, et al. (2014) Profiling of Olfactory Receptor Gene Expression in Whole Human Olfactory Mucosa. PLoS ONE 9: e96333. [20] Shepherd GM, Greer CA (1998) Olfactory bulb. In: The Synaptic Organization of the Brain, 4th ed., New York, NY, US: Oxford University Press. pp. 159–203. [21] Bushdid C, Magnasco MO, Vosshall LB, Keller A (2014) Humans Can Discriminate More than 1 Trillion Olfactory Stimuli. Science 343: 1370–1372. [22] Gerkin RC, Castro JB (2015) The number of olfactory stimuli that humans can discriminate is still unknown. eLife 4: e08127. [23] Mayhew EJ, et al. (2020) Drawing the Borders of Olfactory Space. preprint, Neuroscience. doi:10.1101/2020.12.04.412254. URL http://biorxiv.org/lookup/doi/10.1101/2020.12.04.412254. [24] Lansner A (2009) Associative memory models: From the cell-assembly theory to biophysically detailed cortex simulations. Trends Neurosci 32: 178–186. [25] Hebb DO (1949) The Organization of Behavior: A Neuropsychological Theory. New York: Wiley. [26] Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci USA 79: 2554–2558. [27] Mayer A, Balasubramanian V, Mora T, Walczak AM (2015) How a well-adapted immune system is organized. Proc Natl Acad Sci USA 112: 5950–5955. [28] Shinnakasu R, et al. (2016) Regulated selection of germinal-center cells into the memory B cell compartment. Nat Immunol 17: 861–869. [29] Shinnakasu R, Kurosaki T (2017) Regulation of memory B and plasma cell differentiation. Curr Opin Immunol 45: 126–131. [30] Mayer A, Balasubramanian V, Walczak AM, Mora T (2019) How a well-adapting immune system remembers. Proc Natl Acad Sci USA 116: 8815–8823. [31] Schnaack OH, Nourmohammad A (2021) Optimal evolutionary decision-making to store immune memory. eLife 10: e61346. [32] Viant C, et al. (2020) Antibody affinity shapes the choice between memory and germinal center B cell fates. Cell 183: 1298–1311.e11. [33] Mezard M, Nadal JP, Toulouse G (1986) Solvable models of working memories. J Physique 47: 1457-1462. [34] Amit DJ, Gutfreund H, Sompolinsky H (1985) Storing infinite numbers of patterns in a spin-glass model of neural networks. Phys Rev Lett 55: 1530–1533. [35] McEliece R, Posner E, Rodemich E, Venkatesh S (1987) The capacity of the hopfield associative memory. IEEE Transactions on Information Theory 33: 461-482. [36] Bouchaud JP, Mezard M (1997) Universality classes for extreme-value statistics. J Phys A Math Gen 30: 7997–8016. [37] Derrida B (1997) From random walks to spin glasses. Physica D: Nonlinear Phenomena 107: 186–198. [38] Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press. http://www.deeplearningbook.org. [39] Mehta P, et al. (2019) A high-bias, low-variance introduction to Machine Learning for physicists. Physics Reports 810: 1–124. [40] Parga N, Virasoro M (1986) The ultrametric organization of memories in a neural network. J Phys France 47: 1857–1864. [41] Virasoro MA (1986) Ultrametricity, Hopfield Model and all that. In: Disordered Systems and Biological Organization, Springer, Berlin, Heidelberg. pp. 197–204. [42] Gutfreund H (1988) Neural networks with hierarchically correlated patterns. Phys Rev A 37: 570–577. [43] Tsodyks MV (1990) Hierarchical associative memory in neural networks with low activity level. Mod Phys Lett B 04: 259–265. [44] Amit DJ, Gutfreund H, Sompolinsky H (1985) Spin-glass models of neural networks. Physical Review A 32: 1007–1018. [45] Fontanari JF (1990) Generalization in a hopfield network. J Phys France 51: 2421–2430. 19 A 1.00 N =8 N = 16 0.95 N = 32 Data 1-2glag µeff 1-2µeff 0.90 10−5 10−4 10−3 effective mutation rate, µeff 10−2 memory time lag, glag [N ] optimal performance, Q∗ Supplementary Figures B 60 45 30 15 0 10−5 10−4 10−3 effective mutation rate, µeff 10−2 FIG. S1. Reduced performance of Hopfield networks due to memory delay. (A) The optimal performance Q∗ for patterns that are correctly associated with their memory attractors (i.e., they have an overlap q(σ) = hσatt |σi > 0.8) is shown as a function of the effective mutation rate µeff . The solid lines show the simulation results for networks encountering a different number of patterns N (colors). The gray dashed line shows the naı̈ve expectation for the performance (Q0 = 1 − 2µeff ), and the colored dashed lines show the expected performance after accounting for the memory lag Qlag = 1 − 2glag µeff . (B) The lag time glag for memory is shown in units of generations [N ] as a function of the effective mutation rate for networks encountering a different number of patterns (colors similar to (A)). The networks are trained with a learning rate λ∗ (µ) optimized for the mutation rate specified on the x-axis. Other simulation parameters: L = 800. evolving patterns 0.4 static patterns A B 8 C 0.1 0.0 10−5 10−4 10−3 10−2 effective mutation rate, µeff −8 −10 −12 −14 10−5 10−4 10−3 10−2 effective mutation rate, µeff standard error Pwrong 0.2 mean energy −6 0.3 6 4 2 0 10−5 10−4 10−3 10−2 effective mutation rate, µeff FIG. S2. Statistics of static and evolving patterns for networks with different learning rates. We compare the statistics of evolving (green) and static (orange) patterns in networks trained with a learning rate λ∗ (µ) optimized for the mutation rate specified on each panel’s x-axis; see Fig. 2B for dependency of the optimal learning rate on mutation rate. The reported statistics are (A) Fraction Pwrong of misclassified patterns (i.e., patterns with a small overlap q(σ) = hσatt |σi < 0.8), (B) the mean energy of the patterns, and (C) the standard error of the energy of the patterns in the network. Simulation parameters: L = 800 and N = 32. 20 0.6 0.4 0.2 0.0 0.0 IV 0.6 % 1.2 % V 71.3 % 0 % I VI 0.2 0.4 0.6 0.8 α (σ α )|σ α i| self overlap |hσatt 0.6 II I 0.4 0.2 0.0 0.0 1.0 C 0.2 0.4 0.6 0.8 α (σ α )|σ α i| self overlap |hσatt 1.0 D 1.0 next best overlap B beneficial flips σ α → σ̃ α stored patterns, σ α random patterns, χ 1.0 28.1 % II 98.8 % III 0.8 0.8 0.6 0.4 0.2 0.0 0.0 0.6 14.9 % II 97.5 % IV 0.6 % 2.5 % VI III V 84.5 % 0 % I 0.2 0.4 0.6 0.8 α α α self overlap |hσatt (σ )|σ i| 1.0 beneficial flips σ α → σ̃ α next best overlap A II I 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 α α α self overlap |hσ att (σ )|σ i| 1.0 FIG. S3. Attractors and equilibration paths in networks. Overlap of patterns with the networks’ attractors are shown for both the patterns σ α associated with one of the classes that were previously presented to the network during training (blue) and the random patterns χ that are on average orthogonal to the previously presented classes. (A) The overlap between a α presented pattern σ α and the memory associated with the same pattern class σatt (σ α ) is shown against the overlap of the α pattern with the next best memory attractor associated with any of the other presented pattern classes maxν6=α | hσatt |σ ν i |. Fractions of the previously presented patterns and the random patterns that fall into different sectors of the plot are indicated α in blue and red, respectively. Sector I corresponds to patterns that fall into the correct energy attractors (i.e., hσatt |σ α i ≈ 1). In the limit of large self-overlap, the maximal overlap to any other pattern family is close to zero, and thus, no patterns are found in sector III. Patters with a small self-overlap could fall into three different sectors: Sector II corresponds to misclassified α patterns that fall into a valley associated with a different class (maxν6=α | hσatt |σ ν i | ≈ 1). Patterns in sectors IV and V fall into local valleys between the minima of two pattern families. This mixture states are well known in the classical Hopfield model [44, 45]. Sector VI indicates patterns that fall into an attractor in the network that does not correspond to any of the previously presented classes. The fact that neither the previously presented patterns nor the random patterns fall into this sector suggests that the network indeed only stores memory of the presented patterns and is not in the glassy regime. (B) The number of beneficial spin-flips for presented pattern at the beginning of equilibration (i.e., the number of open equilibration paths) is shown against pattens’ self-overlap (x-axis in (A)). For stable patterns (sector I) the number of open paths is anticorrelated with the overlap between the attractor and the presented pattern. For unstable patterns (sector II), the number of open paths is on average larger that that of the stable patterns. However, there are fewer paths available to the previously presented patterns compared to the random patterns. In (A,B) patterns evolve with rate µeff = 0.01 and the network’s learning rate is optimized accordingly. The sharp transition between sector occupations indicates that our results are insensitive to the classification threshold for self-overlap (currently set to q α > 0.8), i.e. any threshold value between sectors I and II would result in the same classification of patterns. (C,D) Similar to (A,B) but for static patterns in a network with a similar learning rate to (A,B). Simulation parameters: L = 800 and N = 32. 21 A open paths 0.5 stable patterns unstable patterns random patterns 0.4 0.3 0.2 0.1 0.0 10−5 0.6 10−4 10−3 effective mutation rate, µeff C 0.4 0.3 0.2 0.1 0.0 10−5 10−4 10−3 effective mutation rate, µeff 10−2 mean 8 σ1 σN 6 4 2 10 stable patterns unstable patterns random patterns B 0 10−5 10−2 0.5 open paths participation, π(σ) 10 participation, π(σ) 0.6 10−4 10−3 effective mutation rate, µeff 10−2 C 8 mean σ1 σN 6 4 2 0 10−5 10−4 10−3 effective mutation rate, µeff 10−2 FIG. S4. Open equilibration paths and participation ratio. (A) The mean number of open paths (i.e., the beneficial spin-flips at the beginning of equilibration) is shown for stable, unstable, and random patterns (colors) as a function of the effective mutation rate µeff in networks trained with the optimal learning rate λ∗ (µ). (B) The participation ratio P ( m2 ) 2 π(σ j ) = Pi mi,j , with mi,j = hΦi |σ j i is shown for the pattern σ 1 with the lowest energy (orange), the l pattern σ N with 4 i i,j the highest energy (purple). The mean participation ratio averaged over all patterns is shown in green. (C,D) Similar to (A,B) but for static patterns (µ = 0). The learning rate of the network in this case is tuned to be optimal for the mutation rate specified on the x-axis. Simulation parameters: L = 800 and N = 32. 22 B 60 50 50 40 40 eigenvalue eigenvalue A 60 30 20 10 0 10−5 Γ1 Γ10 Γ20 Γ32 γ1 γ20 30 20 10 10−4 10−3 effective mutation rate, µeff 10−2 0 10−5 γL−N 10−4 10−3 effective mutation rate, µeff 10−2 FIG. S5. Eigenvalues of networks with memory against dynamic and static patterns. (A) The first Γ1 , the 10th (Γ10 ), the 20th (Γ20 ), and the last (ΓN =32 ) non-trivial eigenvalues of a network of size L = 800 presented with N = 32 patterns is shown as a function the patterns’ effective mutation rate (different shades of blue). In each case, the network is trained with the optimal learning rate λ∗ (µ). The trivial eigenvalues are shown in different shades of red, with their rank indicated in the legend. For small µeff all trivial eigenvalues match the prediction γk = −1, which implies that the network updates fast enough to keep the patterns within the N -dimensional sub-space. For larger mutation rates, some of the trivial eigenvalues deviate from −1, indicating that evolving patterns start spanning in a larger sub-space. Moreover, as the mutation rate (or learning rate) increases the gap between between the non-trivial eigenvalues broadens. (B) Similar to (A) but for static patterns in networks trained with a learning rate λ∗ (µ) optimized for the mutation rate specified on the panel’s x-axis. In contrast to (A) all trivial eigenvalues remain equal to −1 independent of the learning rate, implying that the static patterns remained within the non-trivial N -dimensional sub-space. Similar to (A) the gap between the nontrivial eigenvalues broadens with increasing learning rate. 23 A B 14 14 stable stable patterns, q α ≈ 1 10 unstable patterns, q α ≈ 0 8 6 4 2 0 unstable possible gain, pǫ2 ΓB possible gain, pǫ2 ΓB unstable 12 C 6 4 0 10 20 30 40 dominant energy contribution, ǫ2 ΓA 14 unstable stable 12 unstable possible gain, pǫ2 ΓB possible gain, pǫ2 ΓB 8 D 14 10 8 6 4 2 10 2 10 20 30 40 dominant energy contribution, ǫ2 ΓA stable 12 0 10 20 30 40 dominant energy contribution, ǫ2 ΓA stable 12 10 8 6 4 2 0 10 20 30 40 dominant energy contribution, ǫ2 ΓA FIG. S6. Stability condition for patterns during equilibriation. The stability condition in eq. S14 (dotted line) is used to classify stable (blue) and unstable (red) patterns for (A) the patterns that have a squared overlap with one dominant 2 eigenvector m2 = hΦA |σ ν i > 0.85, and (B) P the patterns that are constrained to a small sub-space A spanned by up to 10 nontrivial eigenvectors; in this case, m2A = a∈A hΦa |σ ν i2 > 0.85. The shading indicate the number of eigenvectors needed to represent a pattern from dark (one) to light (ten). (C, D) Similar to (A, B) but for static patterns in networks trained with the same learning rate as in (A, B). In general more static patterns reach the threshold of m > 0.85 as these patterns remain constrained to the N-dimensional subspace spanned by the non-trivial eigenvectors {Φi }. Simulation parameters: N = 32, L = 800, µeff = 0.02, and networks are trained with the optimal learning rate λ∗ (µ). 1.0 A 0.8 C=1 C=2 C=4 0.6 0.4 0.2 0.0 −1 10 B 100 101 scaled inverse Hopfield temp. βHc 100 10−1 10−2 10−3 −1 10 100 101 scaled inverse Hopfield temp. βHc 1.0 C 0.8 C=1 C=2 C=4 0.6 C=8 C = 16 C = 32 0.4 0.2 0.0 −1 10 D opt. learning rate, λ∗c opt. learning rate, λ∗c C=8 C = 16 C = 32 optimal performance, Q∗ optimal performance, Q∗ 24 100 101 scaled inverse Hopfield temp. βHc 100 10−1 10−2 10−3 −1 10 100 101 scaled inverse Hopfield temp. βHc FIG. S7. Optimal performance and learning rate at difference Hopfield temperatures. (A) The optimal accuracy of compartmentalized networks (i.e., for βS ≫ 1) is shown as a function of the scaled inverse Hopfield temperature βHc for difference number of compartments C (colors) for static patterns (µeff = 0); see Fig. 4B top. (B) The optimal learning rate λ∗c for each strategy as a function of βHc is shown. In contrast to Fig. 3 the learning rate is not rescaled here and does not collapse for βHc ≫ 1. As the equilibration noise increases (decreasing βHc ), networks with distributed memory (C < N ) show two-step transitions (A). The first transition occurs at βHc ≃ 1, which results in the reduced accuracy of the networks’ memory retrieval. At this transition point, the networks’ learning rate λc approaches its maximum value 1 (B). Consequently, memory is only C stored for C < N patterns (i.e., one pattern per sub-network) and the optimal performance Q∗ is reduced to approximately N 1 C (A). The second transition occurs at βHc ≈ Nc = N , below which no pattern can be retrieved and the performance approaches zero (A). (C-D) Similar to (A-B) but for evolving patterns with the effective mutation rate µeff = 0.01 similar to Fig. 4D. The number of presented patterns is set to N = 32. Similar to Figs. 3, 4 we keep L · C = const., with L = 800 used for networks with C = 1.