Academia.eduAcademia.edu

Prospective Learning: Back to the Future

2022

Research on both natural intelligence (NI) and artificial intelligence (AI) generally assumes that the future resembles the past: intelligent agents or systems (what we call 'intelligence') observe and act on the world, then use this experience to act on future experiences of the same kind. We call this 'retrospective learning'. For example, an intelligence may see a set of pictures of objects, along with their names, and learn to name them. A retrospective learning intelligence would merely be able to name more pictures of the same objects. We argue that this is not what true intelligence is about. In many real world problems, both NIs and AIs will have to learn for an uncertain future. Both must update their internal models to be useful for future tasks, such as naming fundamentally new objects and using these objects effectively in a new context or to achieve previously unencountered goals. This ability to learn for the future we call 'prospective learning&#39...

Prospective Learning: Back to the Future arXiv:2201.07372v1 [cs.LG] 19 Jan 2022 The Future Learning Collective Abstract. Research on both natural intelligence (NI) and artificial intelligence (AI) generally assumes that the future resembles the past: intelligent agents or systems (what we call ‘intelligence’) observe and act on the world, then use this experience to act on future experiences of the same kind. We call this ‘retrospective learning’. For example, an intelligence may see a set of pictures of objects, along with their names, and learn to name them. A retrospective learning intelligence would merely be able to name more pictures of the same objects. We argue that this is not what true intelligence is about. In many real world problems, both NIs and AIs will have to learn for an uncertain future. Both must update their internal models to be useful for future tasks, such as naming fundamentally new objects and using these objects effectively in a new context or to achieve previously unencountered goals. This ability to learn for the future we call ‘prospective learning’. We articulate four relevant factors that jointly define prospective learning. Continual learning enables intelligences to remember those aspects of the past which it believes will be most useful in the future. Prospective constraints (including biases and priors) facilitate the intelligence finding general solutions that will be applicable to future problems. Curiosity motivates taking actions that inform future decision making, including in previously unmet situations. Causal estimation enables learning the structure of relations that guide choosing actions for specific outcomes, even when the specific action-outcome contingencies have never been observed before. We argue that a paradigm shift from retrospective to prospective learning will enable the communities that study intelligence to unite and overcome existing bottlenecks to more effectively explain, augment, and engineer intelligences. “No man ever steps in the same river twice. For it’s not the same river and he’s not the same man.” Heraclitus 1 Introduction The goal of learning is ultimately about optimizing future performance. Intelligences— entities with intelligence, be they natural intelligent (NI) agents, or artificial intelligent (AI) systems—are challenged when the future is different from the past. For cases where the future is just like the past, what we call retrospective learning, AI has developed exceptionally successful techniques (leveraging advances in statistics and machine learning) to solve the learning problem. The NI field, which includes human and non-human animal cognition (and sometimes is interpreted even more broadly), also has explanations for the process of learning at both algorithmic and implementation levels. Arguably, both fields have very satisfying theories for many ecologically valid behaviors, such as identifying common objects, controlling the movement of limbs through space in relatively simple environments, and understanding spoken words. Ongoing theory development in both AI and NI highlights why retrospective learning is perceived to be so advanced. However, the most evolutionary important problems, such as those relating to life or death situations, mate choices, or child rearing, have a fundamentally different structure. These tend to be novel experiences with sparse information, low probability, and high consequential value. Problems with this structure also sink AI systems, such as when a self-driving car’s computer vision system is trained in cities and fails to recognize a tractor trailer crossing the road in the country[1] or when medical diagnosis systems are applied to under-represented samples[2]. We argue that many of the most interesting and important problems for intelligences are those that require being good at identifying apparent ‘unicorn’ experiences, realizing the (potentially complex) ways in which they are similar to past experiences, and then adjusting appropriately. These problems can happen far in the future, and they require extrapolation well outside of the previously encountered distribution of experiences. We call this phenomenon prospective learning, which typifies a large and important class of open research problems in both NI and AI. 1 Here we make the case that the study of intelligence, in both biological and human-made systems, needs to go back to where the goal of learning has always been: we need to bring learning back to the future. If we want to explain and augment animal intelligence, including humans, and to engineer more intelligent machines, it is time to change how we look at the problem of learning by understanding that it is fundamentally a future oriented process. This has both deep implications for how we understand the origins of behavior and how we think about the fundamental computational structure of learning in general. 1.1 What is Prospective Learning? Any intelligence engaged in prospective learning can make effective decisions in novel scenarios, with minimal data and resources, by extrapolating from the past and postulating active solutions via an internal model of the world. This allows them to out-compete other intelligences in their corresponding niches, by leveraging the four below capabilities. 1. Continual learning enables an intelligence to remember those aspects of the past that it believes will be relevant in the future [3]. 2. Constraints (including biases and priors) facilitates finding general solutions that will apply to future problems and thus makes learning possible. While the general problem is intractable (or even uncomputable) [4], built-in priors, inductive biases, and other constraints to regularize the search space [4, 5]. 3. Curiosity motivates taking actions that inform future decisions, including in previously unmet situations. Because intelligences do not know a priori which action-outcome contingencies are currently available, or how previous contingencies have changed, they should explore the world to gather additional information [6]. 4. Causal estimation enables learning the structure of relations that guide choosing actions for specific outcomes, even when the specific action-outcome contingencies have never been observed before. Intelligences in the wild are not merely perception machines; rather they take actions to maximize the probability of achieving the most desirable outcomes, and to learn what effects arise from what causes [7]. What makes these capabilities so critical for prospective learning? First, the above list of four desiderata yields intelligences that are capable of generalizing and adapting to future environments that differ from the past in ecologically predictable ways. Although cognitive scientists have not yet explicitly conducted careful experiments to quantify prospective learning capabilities holistically, there is ample evidence in the animal kingdom for each of these capabilities, across species, and even across phyla. We provide such evidence in Section 2. Second, each of the above capabilities, on its own, has already been recognized as an important open problem in the AI community, which has begun formalizing the problem and designing solutions to address it. This enables designing and conducting careful psychophysical experiments to assess and quantify them. In Section 3 we provide a brief background formalizing retrospective learning, to provide a contrast and show its limitations as compared to a sketch of what formalizing prospective learning would entail in Section 4. Finally, we conclude in Section 5 by proposing how modifying our thinking about learning to be for the future will transform natural and artificial intelligences, as well as society more generally. 1.2 A sketch of our framework Before proceeding any further, we must first provide a simple sketch to contextualize our thinking of the problem. This is illustrated in Figure 1. We envision an external world (W , top) and internal model of the world, also called a hypothesis (h, bottom). Both of these are evolving in time and dependent on the past and each other in various ways. Critically, this partial dependency between the past and a distinct future is the primary difference between prospective learning 2 external NOW FUTURE Wn Wf internal Xn Xf Yn Yf curiosity causal hn constraints continual g hf Figure 1: Our conceptual understanding of prospective learning. W: world state. X: input (sensory data). Y: output (response/action). h: hypothesis. g: learner. Subscript n: now. Subscript f: future. NOW: the task and context that matters right now. FUTURE: the task and context that will matter at some point of time in the future. Note that g, the learning algorithm, is fixed and unchanging, while continually updating the hypothesis. Black arrows denote inputs and outputs. White arrows indicate conceptual linkages. and traditional learning approaches (see Section 3 for further details). At any given time (now, indicated by subscript n), the intelligence receives some input from the world (Xn ). The intelligence contains a continual learning algorithm (g ). The goal of that algorithm is to leverage the new data to update the intelligence’s current hypothesis (hn ) without catastrophically forgetting the past, and ideally even improving upon previously acquired skills or capabilities. The hypothesis that is created is selected from a constrained set of hypotheses that are inductively biased towards those that also generalize well to problems the intelligence is likely to confront in the future. Curiosity motivates gathering more information that could be useful now, or in the future. Based on all available information, the intelligence makes a decision about to respond or act (Y ). Those actions causally impact the world. This process of acquiring input, learning from the past, updating hypotheses, and acting to gain rewards or information, is relevant and repeats itself in the far future (indicated by subscript f ). The central premise of our work here is that by understanding how NIs achieve (or fail at) each of these four capabilities–and describing them in terms of an AI formalism—we can overcome existing bottlenecks in explaining, augmenting, and engineering both natural and artificial intelligences. 2 Evidence for prospective learning in natural intelligences NIs (which we limit here to any organism with a brain or brain-like structure) always learn for the future, because the future is where survival, competition, feeding, and reproduction happen. That is not to say that NIs are perfect at it, or even particularly efficient. Rather, we argue that prospective abilities are successful just enough to bolster evolutionary fitness so as to be reinforced over time. In many ways, brains appear to be explicitly built (i.e., evolved) to make predictions about the future [8]. NIs learn abstract, transferable knowledge that can be flexibly applied to new problems, rather than greedily optimizing for the current situation (for 3 review see Gershman et al. [9], Raby and Clayton [10]). In the field of ecology, this process is part of what is called ‘future planning’ [11] or ‘prospective cognition’ [10], both of which describe the ability of animals to engage in ‘mental time travel’ [12] by projecting themselves into the future and imagining possible events, anticipating unseen challenges. Given our focus on prospective learning, here we will focus on the learning aspects of future planning and prospective cognition. While prospective abilities have classically been thought of as a uniquely human trait [13], we now know that many other NIs can do this. Bonobos and orangutans socially learn to construct tools for termite fishing, not for immediate use, but instead to carry with them in anticipation of future use [14]. This tool construction and use also extends beyond primates. Corvids collect materials to build tools that solve novel tasks [15]. This building of novel tools can be seen as a form of prospective learning. Here the experience of learning novel objects (e.g., pliability of specific twigs, inspecting glass bottles) transfers to novel applications in the future (e.g., curving a twig to use as a hook to fish out food from a bottle). It requires that the animal seek out the information (curiosity) to learn the physics of various objects (causality), both biased and facilitated by internal heuristics that limit the space of hypotheses for how to configure the tool (constraints), and extend this knowledge to produce new behavioral patterns (continuous learning) within the constraints of the inferred structure of the world. Another manifestation of future learning is food caching, seen in both mammals (e.g., squirrels and rats) as well as birds. Western scrub-jays not only have a complex spatial memory of food cache locations, but can flexibly adapt their caching to anticipate future needs [16]. Experiments on these scrub-jays have shown that they will stop caching food in locations where it gets degraded by weather or is stolen by a competitor [17, 18]. Indeed, consistent with the idea that these birds are learning, scrubjays that are caught stealing food from another’s cache (i.e., observed by another jay), will restore the stolen food in private, as if aware that the observing animal will take back the food [19]. This behavior can be considered prospective within a spatial framework, wherein prior experience facilitates learning a unique ‘cognitive map’ that facilitates innovative zero-shot transfer to creatively solve tasks (e.g., strategic storage and retrieval of food) in a novel future context (i.e., next season) [20]. This spatial learning likely evolved and helps animals solve ethologically critical tasks such as navigation and foraging, as well as to allow animals to use vicarious trial-and-error to imaginatively simulate multiple scenarios [21], generalizing previously learned information to a novel context. But such mechanisms are not limited to navigational problems: many animals have evolved the ability to transform non-spatial variables into a spatial framework, enabling them to solve a broad set of problems using the same computational principles and neural dynamics underlying spatial mapping [22, 23]. Finally, as an animal explores its environment, it can quickly incorporate novel information using the flexibility of the cognitive map, while leaving existing memories largely intact (i.e., without disrupting the weights of the existing network). Learning for the future is seen across phyla as well. Bees (arthropods) can extrapolate environmental cues to not only locate food sources, but communicate this location to hivemates via a ‘waggle dance’ where future targets lie with a high degree of accuracy (for review see Menzel [24]). Importantly, bees can also learn novel color-food associations, identifying the color of new high value nectar sources via associative learning in novel foraging environments [25, 26]. This ability also extends to learning novel flower geometries that may indicate high nectar food sources that are then communicated back to the hive for future visits by other bees [27]. This remarkably sophisticated form of learning for the future happens in an animal with less than a million neurons, and fewer than 10 billion synapses. In contrast, modern deep learning systems, such as GPT-3, have over 100 billion synapses [28] and yet fail at similar forms of sophisticated associative learning. Even in the mollusca phylum, octopuses have 4 been found to perform observational learning, with single shot accuracy at selecting novel objects by simply watching another octopus perform the task [29]. This rapid form of (continual) learning allows the animal to effectively use an object it has never seen before in new situations (constraints and causality), simply by choosing to play with it (curiosity). Thus, it is learning for the future. Prospective learning has, thus, a very long evolutionary history. Given that arthropods, mollusks, and chordates diverged in evolution 500 million years ago, the observation of prospective learning abilities across these phyla suggests one of two possibilities: 1) prospective learning is an evolutionarily old capacity with a shared neural implementation that exists in very simple nervous systems (and scales with evolution), or 2) prospective learning has independently evolved multiple times with different implementation-level mechanisms (a.k.a., multiple realizability [30, 31]). These two possibilities have different implications for the types of experiments that would inform our understanding of prospective learning in NIs and how we can implement it in artificial systems (see § 5 for further details). 3 The traditional approach to (retrospective) learning. The standard machine learning (ML) formalism dates back to the 1920’s, when Fisher wrote the first statistics textbook ever. In it, he states that “statistics may be regarded as . . . the study of methods of the reduction of data.” In other words, he established statistics to describe the past, not predict the future. Shortly thereafter, Glivenko and Cantelli established the fundamental theorem of pattern recognition: given enough data from some distribution, one can eventually estimate any parameter of that distribution [32, 33]. Vapnik and Chervonenkis [34] and then Valiant [35] rediscovered and further elaborated upon these ideas, leading to nearly all of the advancements of modern ML and AI. Here we will highlight the structure of this standard framework for understanding learning as used in AI. As above (Section 1.2), let X be the input to our intelligence, (e.g., sensory data) and Y be its output (e.g., an action). We assume those data are sampled from some distribution PX,Y that encapsulates some true but unknown properties of the world. For brevity, we allow PX,Y to also incorporate the causal graph, rather than merely the probabilistic distribution. Let n denote the nth experience, the one that is happening right now. In the classical form of the problem, a learning algorithm g (which we hereafter refer to as a ‘learner’) takes in the current data sample S := {(Xi , Yi )}ni=1 and outputs a hypothesis hn ∈ H, where the hypothesis hn : X → Y chooses a response based on the input. The nth sample corresponds to ‘now’, as described in Section 1.2. The learner chooses a hypothesis, often by optimizing a loss function ℓ, that compares the predicted output from any hypothesis, h(X) with the (sometimes unobserved) ground truth output, Y : ℓ(h(X), Y ). The goal of the learner is to minimize risk, which is often defined as the expected loss, by integrating over all possible test datasets: R(h) = Z (X ,Y) | {z } possible test ℓ(h(X), Y ) dPX,Y . | {z } | {z } loss function distribution Note that when we are learning, h is dependent on the past observed (training) dataset. However, when we are designing new learners (e.g., as evolution does), we do not have a particular training dataset available. Therefore we seek to develop learners that work well on whatever training dataset we have. To do so, in retrospective learning, we assume that all data are sampled from the exact same distribution (in supervised learning, this means the train and test sets), and then we can evaluate risk by integrating out all possible test datasets to determine the expected risk, E , of what is learned: 5 Eclassical (h, n, P ) = Z (X ,Y)n | {z } R(h) dP(X,Y )n = | {z } | {z } risk possible train Z distribution (X ,Y)n | {z } Z (X ,Y) | {z } possible train possible test ℓ(h(X), Y ) dPX,Y dP(X,Y )n . | {z } | {z } | {z } loss function distribution distribution Although both training and test datasets are assumed to be drawn from the same distribution, the two integrals are integrating over two different sets of random variables: the inner integral is integrating over all possible test datasets, and the outer integral is integrating over all possible train datasets. Assuming that the two distributions are identical has enabled retrospective learning to prove a rich set of theorems characterizing the limits of learning. Many learners have been developed based on this assumption, and such algorithms have recently enjoyed a cornucopia of successes, spanning computer vision [36], natural language processing [37], diagnostics [38], protein folding [39], autonomous control [40], and reinforcement learning [41]. The successes of the field, so far, rely on problems amenable to the classical statistical definition of learning in which the data are all sampled under a fixed distributional assumption. This, to be fair, encompasses a wide variety of applied tasks. However, in many real-world data problems, the assumption that the training and test data distributions are the same is grossly inadequate [42]. Recently a number of papers have proposed developing a theory of ‘out of distribution’ (OOD) learning [43–45], that includes as special cases transfer learning [46], multitask learning [47–49], metalearning [50], and continual [51] and lifelong learning [52]. The key to OOD learning is that now we assume that the test set is drawn from a distribution that differs in some way from the training set distribution. This assumption is an explicit generalization of the classical retrospective learning problem [45]. In OOD problems, the train and test sets can come from different sample spaces, different distributions, and optimized with respect to different loss functions. Thus, rather than designing learners with small expected risk as defined above, in OOD learning, we seek learners with small EOOD (highlighting in red the differences between EOOD and classical ‘in-distribution’ retrospective learning: EOOD h, n, P test ,P train  = Z (X ,Y) Z (X ,Y) | {z train} | {z test} possible train possible test test train dPX,Y . ℓtest (h(X), Y ) dPX,Y | {z } | {z } | {z } loss function distribution distribution Note that this expression for the risk permits the case of multiple tasks: both g and h are able to operate on different spaces of inputs and outputs, the inputs to both could include task identifiers or other side information, and the loss would measure performance for different tasks. All of prospective learning builds upon, and generalizes, finding hypotheses that minimizes EOOD . There are various ways in which an intelligence that can solve this OOD problem can still be only a retrospective learner. First, consider continual learning. A learner designed to minimize EOOD has no incentive to remember anything about the past. In fact, there is no coherent notion of past, because there is no time. However, even if there were time (for example, by assuming the training data is in the past, and the testing data is in the future), there is no mechanism by which anything about the past is retained. Rather, it could all be overwritten. Second, consider constraints. Often in retrospective ML, constraints are imposed on learning algorithms to help find a good solution for the problem at hand with limited resources. These constraints, therefore, do not consider the possibility of other future problems that might be related to the existing problems in certain structured ways, which a prospective 6 learner would. Third, curiosity is not invoked at all. Even if we generalized the above to consider time, curiosity would still only be about gaining information for the current situation, not realizing that there will be future situations that are similar to this one, but distinct along certain predictable dimensions (for example, entropy increases, children grow, etc.). Fourth, there is no notion of causality in the above equation, the optimization problem is purely of association and prediction. These limitations of retrospective learning motivates formalizing prospective learning to highlight these four capabilities explicitly. 4 The capabilities that characterize prospective learning The central hypothesis of this paper is that by posing the problem of learning as being about the future, many of the current problem areas of AI become tractable, and many aspects of NI behaviors can be better understood. Here we spell out the components of systems that learn for the future. While Figure 1 provides a schematic akin to partially observable Markov decision process [53], it is important to note, however, that prospective learning is not merely a rebranding of Markov decision processes, or reinforcement learning. Specifically, the ‘future’ in Figure 1 is not the next time step, but rather, some time in the potentially distant future. Moreover, at that time, everything could be different, not merely a time-varying state transition distribution and reward function, but also possibly different input and output spaces. In other words, it could be a completely different environment. Thus, while Markov assumptions may be at play, algorithms designed merely to address a stationary Markov decision process will catastrophically fail in the more general settings considered here. Nonetheless, without further assumptions, the problem would be intractable (or even uncomputable) [4]. Thus, we further assume that the external world W is changing somewhat predictably over time. For example, the distribution of states in a world that operates according to some mechanisms (e.g., sunny weather, cars driving on the right, etc.) may change when one or more of those mechanisms changes (e.g., rainy weather, or driving on the left). 1 Continual learning thereby enables the intelligence to store information in h about the past that it believes will be particularly useful in the (far) future. Prospective constraints, h ∈ H′ , including inductive biases and priors, contain information about the particular idiosyncratic ways in which both the intelligence and external world are likely to change over time. Such constraints, for example, include the possibility of compositional representations. The constraints also push the hypotheses towards those that are accurate both now and in the future. The actions are therefore not only aimed at exploiting the current environment but also aimed at exploring to gain knowledge useful for future environments and behaviors, reflecting a curiosity about the world. Finally, learning causal, rather than merely associational relations, enables the intelligence to choose actions that lead to the most desirable outcomes, even in complex, previously unencountered, environments. These four capabilities, continual learning, prospective constraints, curiosity, and causality, together form the basis of prospective learning. 4.1 Continual learning Continual (or lifelong) learning is the process by which intelligences learn for performance in the future, in a way that involves sequentially acquiring new capabilities (e.g., skills and representations) without forgetting—or even improving upon—previously acquired capabilities that are still useful. In general we expect previously learned abilities to be useful again in the future, either in part or in full. As such, it is clear that an intelligence that can remember useful capabilities, despite learning new behaviors, will outperform (and out survive) those that do not [54]. However, AI systems often do forget the old upon learning the new, a phenomenon called catastrophic interference or forgetting [3, 55, 56]. Better than merely not forgetting old things, a continual 1 In other words, the world may evolve due to interventions on causative factors, which we also touch on in Section 4.4. 7 learner improves performance on old things, and even potential future things, upon acquiring new data [49, 52, 57–59]. As such, the ability to do well on future tasks is the hallmark of real learning and the need to not forget immediately derives from it. An example of successful continual learning in NI is learning to play music. If a person is learning Mozart, and then practices arpeggios, having learned the arpeggios will improve their ability to play Mozart, and also their ability to play Bach in the future. When people learn another language, it also improves their comprehension of previously learned languages, making future language learning even easier [60]. The key to successful continual learning is, therefore, to transfer information from data and experiences backwards to previous tasks (called backwards transfer) and forwards to future tasks (called forward transfer) [45, 58, 59]. Humans also have failure modes: sometimes this prior learning can impair future performance, a process known as interference (e.g., [61, 62]). The extent of transfer or interference in future performance depends on both environmental context and internal mechanisms (see Bouton [63]). While continual learning is obviously required for efficient prospective learning, to date there are relatively few studies quantifying forward transfer in NIs, and, as far as we know, none that explicitly quantify backwards transfer [64–66]. Crucially, learning new information does not typically cause animals to forget old information. Nonetheless, while existing AI algorithms have tried to enable both forward and backward transfer, for the most part they have failed [67]. The field is only just beginning to explore effective continual learning strategies [68], which includes those that explicitly consider non-stationarity of the environment [69, 70]. Traditional retrospective learning starts from a tabula rasa mentality, implicitly assuming that there is only one task (e.g., a single machine learning problem) to be learned [71]. In these classical machine learning scenarios, each data sample is assumed to be sampled independently; this is true even in OOD learning. While online learning [72], sequential estimation [73], and reinforcement learning [74] relax this assumption, traditional variants of those ML disciplines typically assume a slow distributional drift, disallowing discrete jumps. This does not consider the possibility that the far future may strongly depend on the present (e.g., memories from last time an animal was at a particular location will be useful next time, even if it is far in the future). These previous approaches also typically only consider single task learning, whereas in continual learning there are typically multiple tasks, and multiple distinct datasets, sometimes each of which having different domains. In prospective learning, however, data from the far past can be leveraged to update the internal world model [75, 76]. Here the training and test sets are necessarily coupled by time. This is in contrast to the canonical OOD learning problem, in which the training and testing data lack any notion of time, but similar to classical online, sequential, and reinforcement learning. Here we assume that the future depends to some degree on the past. This dependency can be described by their conditional distrifuture | past bution: P(X,Y ) . Crucially, we do not necessarily assume a Markov process, where the future only depends on the recent past, but rather allow for more complex dependencies depending on structural regularities across scenarios. We thus obtain a more general expected risk in the learning for the future scenario: (4.1)   Z Econtinual h, n, P future, past = (X ,Y) Z past | {z } | (X ,Y)future {z } possible past possible future future | past past ℓ (h(X), Y ) dPX,Y dPX,Y . |future {z }| {z } | {z } loss function distribution distribution Continual learning is thus an immediate consequence of prospective learning. Recent work on continual reinforcement learning [77] can be thought of as devoted to developing algorithms that optimize 8 the above equation, but such efforts typically lack the other capabilities of prospective learning. As we will argue next, continual learning is only non-trivial upon assuming certain computational constraints. 4.2 Constraints Constraints for prospective learning effectively shrink the hypothesis space to require less data and fewer resources to find solutions to the current problem, which also generalize to potential future problems. Whereas in NI these constraints come from evolution, in AI these constraints are built into the system. Traditionally, constraints come in two forms. Statistical constraints limit the space of hypotheses that are possible to enhance statistical efficiency; they reduce the amount of data required to achieve a particular goal. For our purposes, priors and inductive biases are ‘soft’ statistical constraints. Computational constraints, on the other hand, impose limits on the amount of space and/or time an intelligence can use to learn and make inferences. Such constraints are typically imposed to enhance computational efficiency; that is, to reduce the amount of computation (space and/or time) to achieve a particular error guarantee. Both kinds of constraints, statistical and computational, restrict the search space of effective or available hypotheses, and of the two statistical constraints likely play a bigger role in prospective learning than computational. Moreover, both kinds of constraints can be thought of as different ways to regularize, either explicit regularization (e.g., priors and penalties) or implicit regularization (e.g., early stopping).2 . There is no way to build an intelligence, either via evolution or from human hands, without it having some constraints, particularly inductive biases (i.e., assumptions that a learner uses to facilitate learning an input-output mapping). For example, most mammals [23, 78, 79], and even some insects ([80]), excel at learning general relational associations that are acquired in one modality (e.g., space) and applied in another (e.g., social groups). Inductive biases like this often reflect solutions to problems faced by predecessors and learned over evolution. They are often expressed as instincts and emotions that provide motivation to pursue or avoid a course of action, leading to opportunities to learn about relevant aspects of the environment more efficiently. For example, mammals have a particular interest in moving stimuli, and specifically biologically-relevant motion [81], likely reflecting behaviorally-relevant threats [82]. Both chicks [83] and human babies [84] have biases for parsing visual information into object-like shapes, without extensive experience with objects. Newborn primates are highly attuned to faces [85] and direction of gaze [86], and these biases are believed to facilitate future conceptual [87] and social learning [88]. Thus, within the span of an individual’s lifetime, NIs are not purely data-driven learners. Not only is a great deal of information baked in via evolution, but this information is then used to guide prospective learning [89]. AI has a rich history of choosing constrained search spaces, including priors and specific inductive biases, so as to improve performance (e.g., [90]). Perhaps the most well known inductive bias deployed in modern AI solutions is the convolution operation [91], something NIs appear to have discovered hundreds of millions of years prior to us implementing them in AIs [92]. Such ideas can be generalized in terms of symmetries in the world[93]. Machine learning has developed many techniques to incorporate known invariances to the learning process [94–97], as well as mathematically quantifying how much one can gain by imposing them [98, 99]. In fact, in many cases we may want to think about constraints themselves as something to be learned [100, 101], a process that would unfold over evolutionary timescales for NIs. However, in many areas the true potential of prospective constraints for 2 Note that without computational constraints, some aspects of continual learning can trivially be solved with a naive retrospective learner that stores all the data it has ever encountered, and retrains its hypothesis from scratch each time new data arrive [45] Thus, continual learning is inherently defined by space and/or time constraints, which are present in any real world intelligence, be it natural or artificial 9 accelerating learning for the future remains unmet. For example, as pointed out above, many NIs can learn the component structure of problems (e.g., relations) which accelerates future learning when new contexts have similar underlying compositions (see Whittington et al. [23]). This capability corresponds to zero-shot cross-domain transfer, a challenge unmet by the current state-of-the-art machine learning methods [102]. Why are these constraints important? With a sufficiently general search space, and enough data, space, and time, one can always find a learner that does arbitrarily well [32, 103]. In practice, however, intelligences have finite data (in addition to finite space and time). Moreover, a fundamental theorem of pattern recognition is the arbitrary slow convergence theorem [104, 105], which states that given a fixed learning algorithm and any sample size N , there always exists a distribution such that the performance of the algorithm is arbitrarily poor whenever n < N [34, 35]. This theorem predates and implies the celebrated no free lunch theorem [106], which states that there is not one algorithm to rule them all; rather, if learner g converges faster than another learner g ′ on some problems, then the second learner g ′ will converge faster on other problems. In other words, one cannot hope for a general “strong AI” that solves all problems efficiently. Rather, one can search for a learner that efficiently solves problems in a specified family of problems. Constraints on the search space of hypotheses thereby enable intelligences to solve the problems of interest efficiently, by virtue of encoding some form of prior information and limiting the search space to specific problems. Prospective learners use prospective constraints, that is, constraints that push hypotheses to the most general solution that works for a given problem, such that it can readily be applied to future distinct problems. Formalizing constraints using the above terminology (see Section 3) does not require modifying the objective function of learning. It merely modifies the search space. Specifically, we place constraints on the learner g ∈ G ′ ⊂ G , the hypothesis h ∈ H′ ⊂ H, and the assumed joint distribution governing everything P = P f uture,past ∈ P ′ ⊂ P . The existence of constraints is what makes prospective learning possible, and the quality of these constraints is what decides how effective learning can be. 4.3 Curiosity We define curiosity for prospective learning as taking actions whose goal is to acquire information that the intelligence expects will be useful in the future (rather than to obtain rewards in the present). Goal-driven decisions can be broken down into a choice between maximizing one of two objective functions [107]: (1) an objective aimed at directly maximizing rewards, R, and (2) an objective aimed at maximizing relevant information, E . For prospective learning E is needed for making good choices now and in the as-yet-unknown future. In this way the intelligence, at each point of time, decides if it should dedicate time to learning about the world thereby maximizing E , or to doing a rewarding behavior thereby maximizing R. Critically, by being purely about relevant information for the future, objective (2) (i.e., pure curiosity) can maximize information about both current and future states of the world. E can be defined simply as the value of the unknown, the integration over possible futures and the knowledge it may afford. However, this term, can not easily be evaluated. Instead, we know much about what it drives the intelligence to learn: compositional representations, causal relations, and other kinds of invariances that allow us to solve current and future problems. In this way, E ultimately quantifies our understanding of the relevant parts of the world. In humans, curiosity is a defining aspect of early life development, where children consistently engage in more exploratory and information driven behavior than adults [6, 108–111]. This drive, particularly important in children, for directed exploration is often focused on learning causal relations—to acquire both forward and reverse causal explanations [112]—and develop models of the world that they can exploit in later development. But curiosity is not limited to humans (for review see Loewenstein [113]). Just like children [114], monkeys show curiosity for counterfactual outcomes [115]. Rats are 10 driven to explore novel stimuli and contexts, even in the absence of rewards [116, 117]. Just like children [118], octopuses appear to learn from playing with novel objects [119], a pure form of curiosity. In fact, even the roundworm C. elegans, an animal with a simple nervous system of only a few hundred neurons, shows evidence of exploration in novel environments [120]. Curiosity is clearly a fundamental drive of behavior in NIs [107]. It is well established in the active learning literature that curiosity, i.e., gathering information rather than rewards, can lead to an exponential speed-up in sample size convergence guarantees [121, 122]. Specifically, this means that if a passive learner requires n samples to achieve a particular performance guarantee, then an active learner requires only ln n samples to achieve the same performance. This is important as the scenarios for which prospective learning provides a competitive advantage are those where information is relatively sparse and the outcomes are of high consequential value. So every learning opportunity must really count in these situations. Although we cannot expect that either AIs or NIs can perfectly implement prospective learning by integrating E over long time horizons. Instead, we can approximate what we learn about the parts of the world that we will want to take future actions in, which compositional elements (i.e., constraints) exist in this world, and which causal interactions these components have. These properties mean that we can see E as an approximation to how well we can learn from the world. Thus optimal information gathering (i.e., curiosity) relies on similar learning policies as reinforcement learning. This may explain why empirical studies in humans show that information seeking relies on overlapping circuits as reward learning [123]. Most importantly, this shows how curiosity is innately future focused. The solution to reinforcement learning (i.e., the Bellman equation) reflects the optimal decision to make to maximize future returns [76, 124]. Thus, in the case of curiosity, this solution is the optimal decision to maximize information in the future. What distinguishes curiosity from reward learning is that learning E informs intelligences, whether NI or AI, about the structure of the world. E provides the necessary knowledge of things like spatial configurations, hierarchical relationships, and contingencies. In other words, to find an optimal curiosity policy we can find an optimal policy today about the structure of the world, regardless of immediate rewards, and solve the optimization problem again tomorrow. 4.4 Causality Causal estimation is the ability to identify how one event (the cause) produces another (the effect), which is particularly useful for understanding how our actions impact the world. Causal estimation is enabled in practice by assuming that the direct causal relationships are sparse. This sparsity assumption greatly simplifies modeling the world by adding some bias, but drastically reducing the search space over hypothesis to learn. While it might be tempting to think that prospective learning boils down to simply learning factorizable probabilistic models of the world, such models are inadequate for prospective learning. This is because probabilistic models are inherently invertible. That is, we can just as easily write the probability of wet grass given that it is raining, P (wet|rain), as the probability that it is raining given wet grass, P (rain|wet). Yet these probabilities do not tell us what would be the effect of intervening on one or the other variable. These probabilistic statements of the world do not convey whether or not increasing P (wet) increases P (rain). According to causal essentialists, such as Pearl [7], to make such statements requires more than a probabilistic model: it requires a causal model. Causal reasoning enables intelligences to transfer information across time. Specifically, it enables transferring causal mechanisms which, by their vary nature, are consistent across environments. This includes environments that have not previously been experienced, thereby transferring out-of-distribution. Thus, causal reasoning, like continual learning, is a qualitative capability, rather than a quantitative improvement, that is necessary for prospective learning. Causal reasoning has long been seen by philosophers as a fundamental feature of human intelli- 11 gence [125]. While it is not always easy to distinguish causal reasoning from associative learning in animals, many non-human animals have been shown to perform predictive inferences about objectobject relationships, allowing them to estimate causal patterns (for review see Völter and Call [126]). For example, great apes [127], monkeys [128], pigs [129], and parrots [130] can use simple sensory cues (e.g., rattling sound of a shaken cup) to infer outcomes (e.g., presence of food in cup), a form of diagnostic inference. However, this form of causal reasoning is inconsistently observed in NIs (see Cheney and Seyfarth [131], Collier-Baker et al. [132]). Other studies have shown that, particularly in social contexts, animals from great apes [133] and baboons [134] to corvids [135] and rats [136, 137], can perform transitive causal inference (i.e., if A → B and B → C , then A → C ; for review see Allen [138]). This causal ability has even been observed in insects [139], suggesting that forms of causal inference exist across taxa. The insight driving causal reasoning is that the causal mechanisms in the world tend to persist while correlations are often highly context sensitive [140]. Further, the same causal mechanisms are involved in generating many observations, so that models of these mechanisms are reusable modules in compositional solutions to many problems within the environment. For example, understanding gravity is useful for catching balls as well as for modeling tides and launching rockets. Prospective learning thus crucially benefits from causal models: they are more likely to be useful as they encode real invariances that persist across environments. For example, different variants of COVID will continue to emerge, but certain treatments are likely to be effective for each of them insofar as they act on the mechanism of disease which remains constant [141]. Such scenarios pose a problem for traditional AI algorithms. Modern retrospective learning machines notoriously fail to learn causal models in all but the most anodyne settings. Some AI researchers have advocated for creating models that can perform causal reasoning, which would help AI systems generalize better to new settings and perform prospective inference [142–144], but this field remains in its infancy. Going back to our formulation of the problem, what this all means is that what matters for future decisions is ‘doing Y ’: intervening on the world by virtue of taking action Y , rather than simply noting the (generalized) correlation between X and Y . Fundamentally, implementing Y simply means returning the value of the hypothesis for a specific X : i.e., do(h(X)). This modification yields an updated term to optimize to achieve prospective learning:3 Ecausal (do(h), n, P ) = Z (X ,Y)n | {z } Z (X ,Y) | {z } possible train possible test ℓ(do(h(X)), Y ) dPX,Y dP(X,Y )n . | {z } | {z } | {z } loss function distribution distribution Crucially, the ability to choose actions, Y , allows the agent to discover causal relations, regardless of the amount of confounding in the outside world. Causality links actions and learner, both by enabling actions that are helpful for learning (e.g., randomized ones), and by enabling learning strategies that are useful for discovering causal aspects of reality (e.g., through quasi-experiments). For our purposes here, we consider all interactions between learning and action strategies to belong to either causality or curiosity. 4.5 Putting it all together Table 1 provides a summary of the four capabilities that are necessary for prospective learning, examples of how they are expressed in nature, how retrospective learning 3 Note that minimizing this equation is close to reinforcement learning objectives, although they typically are not explicitly interested in learning causal models, and therefore, the ‘do’ operator is not typically present in the value or reward functions. 12 handles each process, and how a prospective learner would implement it. In it we highlight examples in the literature where the behavior of a prospective learner has been demonstrated in AI, illustrating the fact that the field is already moving somewhat in this direction, but just not completely yet. We argue this is due to the fact that the form of prospective learning has not been carefully defined, as we have attempted to do here. With this gap in mind, we argue that in NIs, evolution has led to the creation of intelligent agents that incorporate the above key components that jointly characterize prospective learning. Continual learning, enabled by constraints and driven by curiosity, allows for the ability to make causal inferences about which actions to take now that lead to better outcomes now and in the far future. In other words, our claim is that evolution led to the creation of NIs that choose a learner such that, with each new experience, updates the internal model g(hn , Xn , Yn ) → hf , where each hf is the solution to minimize E ′ (do(h), n, P ) , (1) subject to g ∈ G ′ , h ∈ H′ , & P ∈ P ′ , where P = P future,past , the constraints on g , h, and P encode aspects of time, compositionality, and causality (e.g., that the future is dependent on the past via causal mechanisms). The expected risk, E ′ , for a specified hypothesis, at the current moment, given a set of experiences, is defined by E ′ (do(h), n, P ) = Z (X ,Y) Z past | {z } | (X ,Y)future {z } possible past possible future past future | past dPX,Y . ℓfuture (do(h(X)), Y ) dPX,Y | {z }| {z } | {z } loss function distribution distribution This E ′ gives the fundamental structure of the prospective learning problem. We argue that although there has been substantial effort in the NI and AI communities to address each of the four capabilities that lead to solving Eq. (1) independently, each remains to be solved at either the theoretical or algorithmic/implementation levels. Solving prospective learning requires a coherent strategy for jointly solving all four of these problems together. While we argue that optimizing for Eq. (1) characterizes our belief about what intelligent agents do when performing prospective learning, it is strictly a computational level problem. It does not, however, satisfy the question of how they do it. What is the mechanism or algorithm that intelligent agents use to perform prospective learning? Intriguingly, the implementation of prospective learning in NIs happens in a network (to a first approximation) [145], and most modern state-of-the-art machine learning algorithms are also networks [146]. Moreover, both fields have developed a large body of knowledge in understanding network learning [147–150]. Thus, a key to solving how prospective learning can be implemented relies on understanding the how networks learn representations, particularly representations that are important for the future. This is a critical component in explaining, augmenting, and engineering prospective learning. Understanding the role of representations in prospective learning, thus, requires a deep understanding of the nature of internal representations in modern ML frameworks. A fundamental theorem of pattern recognition characterizes sufficient conditions for a learning system to be able to acquire any pattern. Specifically, that an intelligent agent must induce a hypothesis such that, for any new input, it only looks at a relatively small amount of data ‘local’ to that input [151]. In other words, a good learner will map new observations into a new representation, i.e., a stable and unique trace within a memory, 13 such that the inputs that are ‘close’ together in the external world are also close in their internal representation (see also Sorscher et al. [152], Sims [153]). In networks, this is implemented by any given input only activating a sparse subset of nodes, which is typical in both NIs [154], and becoming more common for AIs [155]. Indeed, deep networks can satisfy these criteria [156]. Specifically, deep networks partition feature space into geometric spaces called polytopes [157]. The internal representation of any given point then corresponds to which polytope the point is in, and where within that polytope it resides. Inference within a polytope is then simply linear [158]. The success of deep networks is a result of their ability to efficiently learn what counts as ‘local’ [159].4 In prospective learning, in contrast to retrospective learning, what counts as local is also a function of potential future environments. Thus, the key difference between retrospective and prospective representation learning is that the internal representation for prospective learning must trade-off between being effective for the current scenario, and being effective for potential future scenarios. CORE CAPACITIES LEARNING OF EXAMPLE IN NATURAL INTELLIGENCE RETROSPECTIVE LEARNING PROSPECTIVE LEARNING Continual Don’t forget the important stuff, hn → hf When people learn a new language we get better at our old one [60] Learning new information overwrites old information [163] Constraints Regularize via prior knowledge, heuristics, & biases, h ∈ H′ Animals learn to store food in locations that are optimal given local weather conditions [16] Generic constraints like sparseness enable learning a single task more efficiently [165] Curiosity Go get information about the (expected future) world, instead of just rewards, Animals explore novel stimuli and contexts, even in the absence of rewards [116, 117] Use randomness to explore new options with unknown outcomes in current scenario [166] Learn statistical associations between stimuli [167] Reuse useful information to facilitate learning new things without interference [59, 164] Compositional representations can be exponentially reassembled for future scenarios and to compress the past [90] Seek out information about potential future scenarios [107] E(hn , Wf ) Causality The world W has sparse causal relationships, do(Yn ) → Wf Animals can learn if A →B and B → C , then A → C [138] Apply causal information to novel situations [168] Table 1: The four core capabilities of prospective learning, evidence for their existence in natural intelligence, how retrospective learners deal with (or fail to deal with) it, and how a prospective learner would deal with it. See see Figure 1 for notation. 5 The future of learning for the future In many ways, prospective learning has always been a central (though often hidden) goal of both AI and NI research. Both fields offer theoretical and experimental approaches to understand how intelligent agents learn for future behavior. What we are proposing here is a formalization of the structure for how to approach studying prospective learning jointly in NI and AI that will benefit both by establishing a more cohesive synergy between these research areas (see Table 2). Indeed, the history of AI and NI exhibits many beautiful synergies [169, 170]. In the middle of the 20th century cognitive science (NI) and AI started in large part as a unified effort. Early AI work like the McCulloch-Pitts neurons [171] and the Perceptron [172] had strong conceptual links to biological neurons. Neural network models in the Parallel Distributed Processing framework had success in 4 Incidentally, decision forests, such as random forests [160] and gradient boosting trees [161] continue to be the leading ML methods for tabular data [162]. Moreover, they, like deep networks, also partition feature space into polytopes, and then learn a linear function within each polytope [157], suggesting that this approach for learning representations can be multiply realized by many different algorithms and substrates effectively [30]. 14 explaining many aspects of intelligent behavior, and there have been strong recent drives to bring deep learning and neuroscience closer together [169, 170, 173]. We believe that our understanding of both AI and NI are significantly held back by a lack of a coherent framework for prospective learning. NI research requires a framework to analyze the incredible way in which humans and non-human animals learn for the future. AI can benefit by studying how NIs solve problems that remain intractable for AI systems. We argue that the route to solving prospective learning rests on the two fields coming together around three major areas of development. • Theory: A theory of prospective learning, building on and complementing the theory of retrospective learning, will provide insights into which experiments (in both NIs and AIs) will provide the most insights that fill gaps in our current understanding, while also providing metrics to evaluate progress [174]. A theoretical understanding of prospective learning will also enable the generation of testable mechanistic hypotheses characterizing how intelligent systems can and do prospectively learn. • Experiments: Carefully designed, ecologically appropriate experiments across species, phyla, and substrates, will enable (i) quantifying the limitations and capabilities of existing intelligence systems with respect to the prospective learning criteria, and (ii) refining the mechanistic hypotheses generated by the theory. NI experiments across taxa will also establish milestones for experiments in AI [175]. • Real-World Evidence: Implementing and deploying AI systems and observing NIs exhibiting prospective learning ‘in the wild’ will provide real-world evidence to deepen our understanding. These implementations could be purely software, involve leveraging specialized (neuromorphic) hardware [176, 177], or even include wetware and hybrids [178]. An astute reader may wonder how prospective learning relates to reinforcement learning (RL), a subfield of both AI and NI. RL already has long worked towards bridging the gap between AI and NI. For example, early AI models of reinforcement learning formalized the phenomenon of an ‘eligibility trace’ in synaptic connections that may be crucial for resolving the credit assignment problem, i.e., determining which actions lead to a specific feedback signal [179]. Over 30 years later this AI work informed the design of experiments that led to the discovery of such traces in brains [180, 181]. In RL, through repeated trials of a task usually specified by its corresponding task reward, agents Leading approach in Retrospective NI Leading approach in Retrospective AI Proposed approach Prospective NI & AI in Experimental design Study (typically ethological inappropriate) behaviors after learning has saturated Study single task performance vs. sample size Design experiments to explicitly test each of the 4 capabilities Evaluation criteria Accuracy at saturation Accuracy at large sample sizes Amount of transfer [45] across tasks Algorithms Simple and intuitive Ensembling trees or networks Ensembling tions [59] Theory Qualitatively characterize learned behavior Convergence rates for single task Convergence rates leveraging data drawn from multiple, sequence, causal tasks [49] representa- Table 2: Comparing the approaches to studying retrospective NI, retrospective AI, and a proposed approach for studying prospective intelligences. 15 are trained to choose actions at each time instant, that would maximize those task rewards at future instants.[76, 124]. This future-oriented reward maximization objective at first glance bears resemblance to prospective learning, and deep learning-based RL algorithms building on decades of research have recently made great progress towards meeting this challenge [182–186]. However, these standard RL algorithms do not truly implement prospective learning: for example, while deep RL agents may do even better than humans in games they were trained on, they have extreme difficulty transferring skills across games (see Shao et al. [187]), even to games with similar task or rule structures as the training set [188, 189]. Rather than optimizing a prespecified task as in RL, prospective learning aims to acquire skills / representations now, which will be useful for future tasks whose rewards or other specifications are not usually available in advance. In the example above, a true prospective learner would acquire representations and skills that transfer across many games. As in other machine learning subfields, there are several growing movements within RL that study problems that would fall under the prospective learning umbrella, including continual multi-task RL [77], hierarchical RL [190–193] that combines low-level skills to solve new tasks, causal RL [194–196], and unsupervised exploration [197–199]. 5.1 Advancing Natural Intelligence via Artificial Prospective Learning Advances in AI’s understanding of prospective learning provides a necessary formalism that can be harvested by NI research. AI can produce the critical structure of scientific theories that lead to more rigorous and testable hypotheses for which to build experiments [200]. There is some historical evidence where our understanding of NI abilities has conceptually benefited from AI, including a few examples whereby theoretical formalisms from AI have inspired understandings of NI [201–204]. Our proposal for prospective learning expands upon the existing interrelation between these fields. Consider the problem of designing NI experiments to understand prospective learning, rather than cognitive function or retrospective learning. How would they look different? The experiments would demand that the NIs wrestle with each of the four capabilities: continual learning, constraints, curiosity, and causality—exactly what NI researchers currently avoid because we cannot readily fit theories to such behaviors. For continual learning, the experiments would have a sequence of tasks, some of which repeat, so that both forward and backward transfer can be quantified. For constraints, tasks would specifically investigate the priors and inductive biases of the animal, rather than their ability to learn over many repeated trials. For curiosity, tasks would require a degree of exploration and include information relevant only for future tasks. For causality, tasks would encode various conditional dependencies that are not causal, as in Simpson’s paradox. The ways in which the NIs are evaluated would also be informed by the theory, for example, quantifying amount of information transferred, rather than long run performance properties [45, 205]. While extensive research in all these areas exists, a deepened dialogue between individuals studying NI and AI will significantly advance their synergy. Importantly, prospective learning provides a scaffolding to organize the debate around. 5.2 Advancing Artificial Intelligence via Natural Prospective Learning Advances in our understanding of NIs has always been a central driver, if not the central driver, of AI research. Indeed, the fundamental logic of modern computer circuits was directly inspired by McCullough and Pitts’ [171] model of the binary logic in neural circuits. The way that we build AIs often involves a crucial step of observing differences between NIs and AIs and then looking for inspiration in NIs for what is missing and building it into our algorithms. The entire concept of intelligence used by AI derives from considerations of NIs. The concept of prospective learning promises to enable a stronger link between NI and AI, where the many components of the study of NI can be directly ported to the components of AI. 16 Focusing the study of both NI and AI together promises to clarify the logical relations of concepts in the two fields, making them considerably more synergistic. Over the past few decades, there have been many proposals for how to advance AI, and they all center on capturing the abilities of NIs, primarily humans. These include understanding ‘meaning’ [206], in particular language, and continuing the current trend of building ever larger deep networks and feeding them ever larger datasets as a means of approaching the depth and complexity seen in biological brains [207]. Others have looked exclusively at human intelligence, going so far as to even define AI as modeling human intelligence [208]. Indeed, today’s rallying cry in AI is for ‘artificial general intelligence’, which is commonly defined as the hypothetical ability of an intelligent agent to understand or learn any intellectual task that a human being can [209]. This ongoing influence can be seen by the naming of AI concepts with cognitive science words. Deep Learning is concerned with concepts like curiosity [210], attention [211], memory, and most recently consciousness priors [212]. Our approach fundamentally differs from, and builds upon, those efforts. Yes, human intelligence— or, more accurately, human intellect—is incredibly interesting. Yet NI, and specifically prospective learning, evolved hundreds of millions of years prior to human intelligence and natural language. We therefore argue that prospective learning is much more fundamental than human intellect. Therefore, AI can advance by studying prospective learning in non-human animals (in addition to studying human animals), that are more experimentally accessible. Moreover, we have an existence proof (from evolution) that one can get to human intellect by first building prospective learning. Whether one can side-step prospective learning capabilities, and go straight to complex language understanding, is an open question. So then how does studying prospective learning in NIs potentially move AI forward? Consider the design of experiments. The study of NI in the lab typically focuses on simple constrained environments, such as two-alternative forced choice paradigms. This is in contrast to how behavior is looked at in ecology, whose contributions to our understanding of NI behavior are largely underappreciated, where NIs have been studied in complex, unconstrained environments for centuries. Similarly as ecology, but in contrast to typical cognitive science and neuroscience approaches, modern AI often investigates abilities of agents in environments with rich, but complex, structure (e.g., video games, autonomous vehicles). Yet, those same AIs (or similar ones) often catastrophically fail in real-world environments [213–217]. Part of this failure to generalize to natural environments is likely due to the fact that the realworld places a heavy emphasis on prospective learning abilities, something that most artificial testing environments do not do. Thus, prospective learning provides additional context and motivation to design experiments that transcend boundaries between taxa and substrates, both natural and artificial [218]. We argue that experiments such as those described above in Section 2 to study NIs can be ported to also study AIs. We can build artificial environments, such as video-game worlds, that place heavy demands on prospective abilities, including learning, that allow for direct comparison of the abilities of AIs and NIs. Since it is possible to get some non-human NIs to play video games (e.g., monkeys [219], rats [220]), these experiments do not necessarily limit the comparisons to be between humans and AIs alone. Thus we can more effectively transfer our understanding of the abilities of NIs into AIs via a unification of tasks. 5.3 What is needed to move forward? In many ways the process of doing retrospective learning is simple. It requires the skill sets that many in ML have today: statistics, algorithms, and mathematics. Prospective learning, on the other hand, requires us to reason about potential futures that we have not yet experienced. In other words, we need to do prospective learning in order to understand prospective learning. As such, solving the problem of prospective learning requires a far broader group of people 17 working on the problem. While it sits clearly within the domain of statistics and machine learning, the problem of prospective learning also requires perspectives from well outside these fields as well, such as biology, ecology, and philosophy. As AI is not only modeling, but also shaping the future, it also reminds us of the deep ethical debt intelligence research owes to the society that enables it, and to those who are most directly impacted by it [221, 222]. Acknowledgements This white paper was supported by an NSF AI Institute Planning award (# 2020312), as well as support from Microsoft Research, and DARPA. The authors would like to especially thank Kathryn Vogelstein for putting up with endless meetings at the Vogelstein residence in order to make these ideas come to life. The Future Learning Collective Joshua T. Vogelstein1YB ; Timothy Verstynen3YB ; Konrad P. Kording2YB ; Leyla Isik1Y ; John W. Krakauer1 ; Ralph Etienne-Cummings1 ; Elizabeth L. Ogburn1 ; Carey E. Priebe1 ; Randal Burns1 ; Kwame Kutten1 ; James J. Knierim1 ; James B. Potash1 ; Thomas Hartung1 ; Lena Smirnova1 ; Paul Worley1 ; Alena Savonenko1 ; Ian Phillips1 ; Michael I. Miller1 ; Rene Vidal1 ; Jeremias Sulam1 ; Adam Charles1 ; Noah J. Cowan1 ; Maxim Bichuch1 ; Archana Venkataraman1 ; Chen Li1 ; Nitish Thakor1 ; Justus M Kebschull1 ; Marilyn Albert1 ; Jinchong Xu1 ; Marshall Hussain Shuler1 ; Brian Caffo1 ; Tilak Ratnanather1 ; Ali Geisa1 ; Seung-Eon Roh1 ; Eva Yezerets1 ; Meghana Madhyastha1 ; Javier J. How1 ; Tyler M. Tomita1 ; Jayanta Dey1 ; Ningyuan (Teresa) Huang1 ; Jong M. Shin1 ; Kaleab Alemayehu Kinfu1 ; Pratik Chaudhari2 ; Ben Baker2 ; Anna Schapiro2 ; Dinesh Jayaraman2 ; Eric Eaton2 ; Michael Platt2 ; Lyle Ungar2 ; Leila Wehbe3 ; Adam Kepecs4 ; Amy Christensen4 ; Onyema Osuagwu5 ; Bing Brunton6 ; Brett Mensh7 ; Alysson R. Muotri8 ; Gabriel Silva8 ; Francesca Puppo8 ; Florian Engert9 ; Elizabeth Hillman10 ; Julia Brown11 ; Chris White12 ; Weiwei Yang12 1 Johns Hopkins University; 2 University of Pennsylvania; 3 Carnegie Mellon University; 4 Washington University, St. Louis; 5 Morgan State University 6 University of Washington; 7 Howard Hughes Medical Institute; 8 University of California, San Diego 9 Harvard University; 10 Columbia University; 11 MindX; 12 Microsoft Research Y Principal investigators on the NSF AI Institute planning award. B Corresponding authors ([email protected], [email protected], [email protected]) References Diversity Statement By our estimates (using cleanBib), our references contain 11.67% woman(first)/woman(last), 22.15% man/woman, 22.15% woman/man, and 44.03% man/man, and 9.16% author of color (first)/author of color(last), 13.09% white author/author of color, 17.63% author of color/white author, and 60.12% white author/white author. References [1] Tracy Hresko Pearl. Fast & furious: the misregulation of driverless cars. NYU Ann. Surv. Am. L., 73:19, 2017. [2] Agostina J Larrazabal, Nicolás Nieto, Victoria Peterson, Diego H Milone, and Enzo Ferrante. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proceedings of the National Academy of Sciences, 117(23):12592–12594, 2020. [3] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Gordon H Bower, editor, Psychology of Learning and Motivation, volume 24, pages 109–165. Academic Press, January 1989. [4] Patricia Rich, Ronald de Haan, Todd Wareham, and Iris van Rooij. How hard is cognitive science? 18 In Proceedings of the Annual Meeting of the Cognitive Science Society, 43, April 2021. [5] Chetan Singh Thakur, Jamal Lottier Molin, Gert Cauwenberghs, Giacomo Indiveri, Kundan Kumar, Ning Qiao, Johannes Schemmel, Runchun Wang, Elisabetta Chicca, Jennifer Olson Hasler, et al. Large-scale neuromorphic spiking array processors: A quest to mimic the brain. Frontiers in neuroscience, 12:891, 2018. [6] Celeste Kidd and Benjamin Y Hayden. The psychology and neuroscience of curiosity. Neuron, 88(3):449–460, November 2015. [7] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, first edition edition, March 2000. [8] Rodolfo R Llinas. I of the Vortex: From Neurons to Self. MIT Press, February 2002. [9] Samuel J Gershman, Kenneth A Norman, and Yael Niv. Discovering latent causes in reinforcement learning. Current Opinion in Behavioral Sciences, 5:43–50, October 2015. [10] C R Raby and N S Clayton. Prospective cognition in animals. Behav. Processes, 80(3):314–324, March 2009. [11] Anthony Dickinson. Goal-directed behavior and future planning in animals. Animal thinking: Contemporary issues in comparative cognition, pages 79–91, 2011. [12] Sara J Shettleworth. Planning for breakfast. Nature, 445(7130):825–826, February 2007. [13] Sara J Shettleworth. Studying mental states is not a research program for comparative cognition. Behav. Brain Sci., 30(3):332–333, June 2007. [14] Nicholas J Mulcahy and Josep Call. Apes save tools for future use. Science, 312(5776):1038– 1040, May 2006. [15] A M P von Bayern, S Danel, A M I Auersperg, B Mioduszewska, and A Kacelnik. Compound tool construction by new caledonian crows. Sci. Rep., 8(1):15676, October 2018. [16] Nicola S Clayton, Timothy J Bussey, and Anthony Dickinson. Can animals recall the past and plan for the future? Nat. Rev. Neurosci., 4(8):685–691, August 2003. [17] Nicola S Clayton, Joanna Dally, James Gilbert, and Anthony Dickinson. Food caching by western scrub-jays (aphelocoma californica) is sensitive to the conditions at recovery. J. Exp. Psychol. Anim. Behav. Process., 31(2):115–124, April 2005. [18] Selvino R de Kort, Sérgio P C Correia, Dean M Alexis, Anthony Dickinson, and Nicola S Clayton. The control of food-caching behavior by western scrub-jays (aphelocoma californica). J. Exp. Psychol. Anim. Behav. Process., 33(4):361–370, October 2007. [19] Nathan J Emery and Nicola S Clayton. Effects of experience and social context on prospective caching strategies by scrub jays. Nature, 414(6862):443–446, November 2001. [20] Nathaniel H Hunt, Judy Jinn, Lucia F Jacobs, and Robert J Full. Acrobatic squirrels learn to leap and land on tree branches without falling. Science, 373(6555):697–700, August 2021. [21] A David Redish. Vicarious trial and error. Nat. Rev. Neurosci., 17(3):147–159, March 2016. [22] Alexandra O Constantinescu, Jill X O’Reilly, and Timothy E J Behrens. Organizing conceptual knowledge in humans with a gridlike code. Science, 352(6292):1464–1468, June 2016. [23] James C R Whittington, Timothy H Muller, Shirley Mark, Guifen Chen, Caswell Barry, Neil Burgess, and Timothy E J Behrens. The Tolman-Eichenbaum machine: Unifying space and relational memory through generalization in the hippocampal formation. Cell, 183(5):1249–1263.e23, November 2020. [24] Randolf Menzel. A short history of studies on intelligence and brain in honeybees. Apidologie, 52(1):23–34, February 2021. [25] Felicity Muth, Daniel R Papaj, and Anne S Leonard. Colour learning when foraging for nectar and 19 pollen: bees learn two colours at once. Biol. Lett., 11(9):20150628, September 2015. [26] Felicity Muth, Daniel R Papaj, and Anne S Leonard. Bees remember flowers for more than one reason: pollen mediates associative learning. Anim. Behav., 111:93–100, January 2016. [27] Felicity Muth, Tamar Keasar, and Anna Dornhaus. Trading off short-term costs for long-term gains: how do bumblebees decide to learn morphologically complex flowers? Anim. Behav., 101: 191–199, March 2015. [28] Tom B Brown et al. Language Models are Few-Shot Learners, May 2020. [29] G Fiorito and P Scotto. Observational learning in octopus vulgaris. Science, 256(5056):545–547, April 1992. [30] M Chirimuuta. Marr, mayr, and MR: What functionalism should now be about. Philos. Psychol., 31(3):403–418, April 2018. [31] Thomas W Polger and Lawrence A Shapiro. The Multiple Realization Book. Oxford University Press, September 2016. [32] V Glivenko. Sulla determinazione empirica delle leggi di probabilita. Gion. Ist. Ital. Attauri., 4: 92–99, 1933. [33] Francesco Paolo Cantelli. Sulla determinazione empirica delle leggi di probabilita. Giorn. Ist. Ital. Attuari, 4(421-424), 1933. [34] V Vapnik and A Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl., 16(2):264–280, January 1971. [35] L G Valiant. A theory of the learnable. Commun. ACM, 27(11):1134–1142, November 1984. [36] A Krizhevsky, I Sutskever, and G E Hinton. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst., 2012. [37] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT, January 2019. [38] Scott Mayer McKinney et al. International evaluation of an AI system for breast cancer screening. Nature, 577(7788):89–94, January 2020. [39] Andrew W Senior et al. Improved protein structure prediction using potentials from deep learning. Nature, January 2020. [40] Nick Statt. OpenAI’s dota 2 AI steamrolls world champion e-sports team with back-to-back victories. https://www.theverge.com/2019/4/13/18309459/ openai-five-dota-2-finals-ai-bot-competition-og-e-sports-the-international-champion, April 2019. Accessed: 2020-1-25. [41] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, October 2017. [42] David J Hand. Classifier technology and the illusion of progress. Statistical Science, 21(1):1–14, February 2006. [43] Elan Rosenfeld, Pradeep Kumar Ravikumar, and Andrej Risteski. The risks of invariant risk minimization, September 2020. [44] Martin Arjovsky. Out of distribution generalization in machine learning, March 2021. [45] Ali Geisa, Ronak Mehta, Hayden S Helm, Jayanta Dey, Eric Eaton, Jeffery Dick, Carey E Priebe, and Joshua T Vogelstein. Towards a theory of out-of-distribution learning, September 2021. [46] Stevo Bozinovski and Ante Fulgosi. The influence of pattern similarity and transfer learning upon the training of a base perceptron B2. Proceedings of Symposium Informatica, pages 3–121–5, 20 1976. [47] Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, volume 2. researchgate.net, 1992. [48] Rich Caruana. Multitask learning. Mach. Learn., 28(1):41–75, July 1997. [49] Jonathan Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12(1):149–198, March 2000. [50] Jane X Wang. Meta-learning in natural and artificial intelligence, November 2020. [51] M B Ring. Continual learning in reinforcement environments. PhD thesis, University of Texas at Austin, 1994. [52] Sebastian Thrun and Tom M Mitchell. Lifelong robot learning. Rob. Auton. Syst., 15(1):25–46, July 1995. [53] Robin Jaulmes, Joelle Pineau, and Doina Precup. Learning in non-stationary partially observable markov decision processes. In ECML Workshop on Reinforcement Learning in non-stationary environments, volume 25, pages 26–32. ias.informatik.tu-darmstadt.de, 2005. [54] Joan Baez. No woman no cry, 1979. [55] Yoshua Bengio, Mehdi Mirza, Ian Goodfellow, Aaron Courville, and Xia Da. An empirical investigation of catastrophic forgeting in Gradient-Based neural networks, December 2013. [56] Vinay V Ramasesh, Ethan Dyer, and Maithra Raghu. Anatomy of catastrophic forgetting: Hidden representations and task semantics, July 2020. [57] Sebastian Thrun. Is learning the n-th thing any easier than learning the first? In D S Touretzky, M C Mozer, and M E Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 640–646. MIT Press, 1996. [58] Paul Ruvolo and Eric Eaton. ELLA: An efficient lifelong learning algorithm. In International Conference on Machine Learning, volume 28, pages 507–515, February 2013. [59] Joshua T Vogelstein et al. Ensembling Representations for Synergistic Lifelong Learning with Quasilinear Complexity, April 2020. [60] Alex Boulton and Tom Cobb. Corpus use in language learning: A meta-analysis. Lang. Learn., 67(2):348–393, June 2017. [61] Reuven Dukas. Transfer and interference in bumblebee learning. Anim. Behav., 49(6):1481– 1490, June 1995. [62] M E Bouton. Context, time, and memory retrieval in the interference paradigms of pavlovian learning. Psychol. Bull., 114(1):80–99, July 1993. [63] Mark E Bouton. Extinction of instrumental (operant) learning: interference, varieties of context, and mechanisms of contextual control. Psychopharmacology, 236(1):7–19, January 2019. [64] H F Harlow. The formation of learning sets. Psychol. Rev., 56(1):51–65, January 1949. [65] Michelle J Spierings and Carel Ten Cate. Budgerigars and zebra finches differ in how they generalize in an artificial grammar learning experiment. Proc. Natl. Acad. Sci. U. S. A., 113(27): E3977–84, July 2016. [66] V Samborska, J L Butler, M E Walton, T E J Behrens, and others. Complementary task representations in hippocampus and prefrontal cortex for generalising the structure of problems. bioRxiv, 2021. [67] Khurram Javed and Martha White. Meta-Learning representations for continual learning, May 2019. [68] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual 21 lifelong learning with neural networks: A review. Neural Netw., 113:54–71, May 2019. [69] Richard Kurle, Botond Cseke, Alexej Klushyn, Patrick van der Smagt, and Stephan Günnemann. Continual learning with bayesian neural networks for Non-Stationary data, September 2019. [70] Annie Xie, James Harrison, and Chelsea Finn. Deep reinforcement learning amidst lifelong NonStationarity, June 2020. [71] Steven Pinker. The Blank Slate: The Modern Denial of Human Nature. Penguin Books, reprint edition edition, August 2003. [72] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, March 2006. [73] Alexander Rakhlin and Karthik Sridharan. Statistical Learning and Sequential Prediction. Massachusetts Institute of Technology, September 2014. [74] Moritz Hardt and Benjamin Recht. Patterns, predictions, and actions: A story about machine learning. https://mlstory.org, 2021. [75] Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pages 222–227. mediatum.ub.tum.de, 1991. [76] Richard S Sutton and Andrew G Barto. Introduction to Reinforcement Learning. Camgridge: MIT Press, March 1998. [77] Khimya Khetarpal, Matthew Riemer, Irina Rish, and Doina Precup. Towards continual reinforcement learning: A review and perspectives, December 2020. [78] Alex Konkel and Neal J Cohen. Relational memory and the hippocampus: representations and methods. Front. Neurosci., 3(2):166–174, September 2009. [79] Marc W Howard and Howard Eichenbaum. Time and space in the hippocampus. Brain Res., 1621:345–354, September 2015. [80] M Giurfa, S Zhang, A Jenett, R Menzel, and M V Srinivasan. The concepts of ’sameness’ and ’difference’ in an insect. Nature, 410(6831):930–933, April 2001. [81] Takeshi Atsumi, Masakazu Ide, and Makoto Wada. Spontaneous discriminative response to the biological motion displays involving a walking conspecific in mice. Front. Behav. Neurosci., 12: 263, November 2018. [82] Steven L Franconeri and Daniel J Simons. Moving and looming stimuli capture attention. Percept. Psychophys., 65(7):999–1010, October 2003. [83] Samantha M W Wood and Justin N Wood. One-shot object parsing in newborn chicks. J. Exp. Psychol. Gen., March 2021. [84] Elizabeth S Spelke. Principles of object perception. Cogn. Sci., 14(1):29–56, January 1990. [85] Morton J Mendelson, Marshall M Haith, and Patricia S Goldman-Rakic. Face scanning and responsiveness to social cues in infant rhesus monkeys. Dev. Psychol., 18(2):222–228, March 1982. [86] M Tomasello, J Call, and B Hare. Five primate species follow the visual gaze of conspecifics. Anim. Behav., 55(4):1063–1069, April 1998. [87] Shimon Ullman, Daniel Harari, and Nimrod Dorfman. From simple innate biases to complex visual concepts. Proc. Natl. Acad. Sci. U. S. A., 109(44):18215–18220, October 2012. [88] Lindsey J Powell, Heather L Kosakowski, and Rebecca Saxe. Social origins of cortical face areas. Trends Cogn. Sci., 22(9):752–763, September 2018. [89] Anthony M Zador. A critique of pure learning and what artificial neural networks can learn from animal brains. Nat. Commun., 10(1):3770, August 2019. 22 [90] Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jongwook Choi, Manoj Tiwari, Honglak Lee, and Aleksandra Faust. Environment generation for zero-shot compositional reinforcement learning. Adv. Neural Inf. Process. Syst., 34, December 2021. [91] Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recognition with GradientBased learning. In Shape, Contour and Grouping in Computer Vision, page 319, Berlin, Heidelberg, January 1999. Springer-Verlag. [92] Michael R Ibbotson, NSC Price, and Nathan A Crowder. On the division of cortical cells into simple and complex types: a comparative viewpoint. Journal of neurophysiology, 93(6):3699– 3702, 2005. [93] Emmy Noether. Invariante variationsprobleme, math-phys. Klasse, pp235-257, 1918. [94] Soledad Villar, David W Hogg, Kate Storey-Fisher, Weichi Yao, and Ben Blum-Smith. Scalars are universal: Equivariant machine learning, structured like classical physics. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. [95] Risi Kondor. N-body networks: a covariant hierarchical neural network architecture for learning atomic potentials. arXiv preprint arXiv:1803.01588, 2018. [96] Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pages 2990–2999. PMLR, 2016. [97] Marc Finzi, Max Welling, and Andrew Gordon Wilson. A practical method for constructing equivariant multilayer perceptrons for arbitrary matrix groups. arXiv preprint arXiv:2104.09459, 2021. [98] Alberto Bietti, Luca Venturi, and Joan Bruna. On the sample complexity of learning with geometric stability. arXiv preprint arXiv:2106.07148, 2021. [99] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Learning with invariances in random features and kernel models, 2021. [100] Gregory Benton, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson. Learning invariances in neural networks. arXiv preprint arXiv:2010.11882, 2020. [101] Evangelos Chatzipantazis, Stefanos Pertigkiozoglou, Kostas Daniilidis, and Edgar Dobriban. Learning augmentation distributions using transformed risk minimization. arXiv preprint arXiv:2111.08190, 2021. [102] Kim Junkyung, Ricci Matthew, and Serre Thomas. Not-So-CLEVR: learning same–different relations strains feedforward neural networks. Interface Focus, 8(4):20180011, August 2018. [103] Howard G Tucker. A generalization of the Glivenko-Cantelli theorem. Ann. Math. Stat., 30(3): 828–830, 1959. [104] Luc Devroye. On arbitrarily slow rates of global convergence in density estimation. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 62(4):475–483, December 1983. [105] A Antos, L Devroye, and L Gyorfi. Lower bounds for bayes error estimation. IEEE Trans. Pattern Anal. Mach. Intell., 21(7):643–645, July 1999. [106] David H Wolpert. The supervised learning No-Free-Lunch theorems. In Rajkumar Roy, Mario Köppen, Seppo Ovaska, Takeshi Furuhashi, and Frank Hoffmann, editors, Soft Computing and Industry: Recent Applications, pages 25–42. Springer London, London, 2002. [107] Erik J Peterson and Timothy D Verstynen. A way around the exploration-exploitation dilemma, June 2019. [108] Robert B Pippin. Hegel’s Practical Philosophy: Rational Agency as Ethical Life. Cambridge University Press, 2008. [109] Emily Sumner, Amy X Li, Amy Perfors, Brett Hayes, Danielle Navarro, and Barbara W Sarnecka. The exploration advantage: Children’s instinct to explore allows them to find information that 23 adults miss. psyarxiv.com, 2019. [110] Eliza Kosoy, Jasmine Collins, David M Chan, Sandy Huang, Deepak Pathak, Pulkit Agrawal, John Canny, Alison Gopnik, and Jessica B Hamrick. Exploring exploration: Comparing children with RL agents in unified environments, May 2020. [111] Emily G Liquin and Alison Gopnik. Children are more exploratory and learn more than adults in an approach-avoid task. Cognition, 218:104940, October 2021. [112] Andrew Gelman and Guido Imbens. Why ask why? forward causal inference and reverse causal questions, November 2013. [113] George Loewenstein. The psychology of curiosity: A review and reinterpretation. Psychol. Bull., 116(1):75–98, July 1994. [114] Lily FitzGibbon, Henrike Moll, Julia Carboni, Ryan Lee, and Morteza Dehghani. Counterfactual curiosity in preschool children. J. Exp. Child Psychol., 183:146–157, July 2019. [115] Maya Zhe Wang and Benjamin Y Hayden. Monkeys are curious about counterfactual outcomes, 2019. [116] D E Berlyne. The arousal and satiation of perceptual curiosity in the rat. J. Comp. Physiol. Psychol., 48(4):238–246, August 1955. [117] W N Dember and R W Earl. Analysis of exploratory, manipulatory, and curiosity behaviors. Psychol. Rev., 64(2):91–96, March 1957. [118] Junyi Chu and Laura E Schulz. Play, curiosity, and cognition. Annual Review of Developmental Psychology, December 2020. [119] Michael J Kuba, Tamar Gutnick, and Gordon M Burghardt. Learning from play in octopus. Cephalopod Cognition; Darmaillacq, A. -S. , Dickel, L. , Mather, J. , Eds, pages 57–67, 2014. [120] Adam J Calhoun, Sreekanth H Chalasani, and Tatyana O Sharpee. Maximally informative foraging by caenorhabditis elegans. Elife, 3, December 2014. [121] Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. J. Comput. System Sci., 75(1):78–89, January 2009. [122] Maria-Florina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. The true sample complexity of active learning. Mach. Learn., 80(2):111–139, September 2010. [123] Kenji Kobayashi and Ming Hsu. Common neural code for reward and information value. Proc. Natl. Acad. Sci. U. S. A., 116(26):13061–13066, June 2019. [124] R Bellman. DYNAMIC PROGRAMMING AND LAGRANGE MULTIPLIERS. Proc. Natl. Acad. Sci. U. S. A., 42(10):767–769, October 1956. [125] David Hume. An abstract of a treatise of human nature, 1739. [126] Christoph J Völter and Josep Call. Causal and inferential reasoning in animals. In Josep Call, editor, APA handbook of comparative psychology: Perception, learning, and cognition, Vol, volume 2, pages 643–671. American Psychological Association, xiii, Washington, DC, US, 2017. [127] Josep Call. Inferences about the location of food in the great apes (pan paniscus, pan troglodytes, gorilla gorilla, and pongo pygmaeus). J. Comp. Psychol., 118(2):232–241, June 2004. [128] Lisa A Heimbauer, Rebecca L Antworth, and Michael J Owren. Capuchin monkeys (cebus apella) use positive, but not negative, auditory cues to infer food location. Anim. Cogn., 15(1):45–55, January 2012. [129] Christian Nawroth and Eberhard von Borell. Domestic pigs’ (sus scrofa domestica) use of direct and indirect visual and auditory cues in an object choice task. Anim. Cogn., 18(3):757–766, May 2015. [130] Christian Schloegl, Judith Schmidt, Markus Boeckle, Brigitte M Weiß, and Kurt Kotrschal. Grey 24 [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [148] [149] parrots use inferential reasoning based on acoustic cues alone. Proc. Biol. Sci., 279(1745): 4135–4142, October 2012. Dorothy L Cheney and Robert M Seyfarth. How Monkeys See the World: Inside the Mind of Another Species. University of Chicago Press, 1990. Emma Collier-Baker, Joanne M Davis, and Thomas Suddendorf. Do dogs (canis familiaris) understand invisible displacement? J. Comp. Psychol., 118(4):421–433, December 2004. Katie E Slocombe, Tanja Kaller, Josep Call, and Klaus Zuberbühler. Chimpanzees extract social information from agonistic screams. PLoS One, 5(7):e11473, July 2010. D L Cheney, R M Seyfarth, and J B Silk. The responses of female baboons (papio cynocephalus ursinus) to anomalous social interactions: evidence for causal reasoning? J. Comp. Psychol., 109(2):134–141, June 1995. Jorg J M Massen, Andrius Pašukonis, Judith Schmidt, and Thomas Bugnyar. Ravens notice dominance reversals among conspecifics within and outside their social group. Nat. Commun., 5:3679, April 2014. H Davis. Transitive inference in rats (rattus norvegicus). J. Comp. Psychol., 106(4):342–349, December 1992. William A Roberts and Maria T Phelps. Transitive inference in rats: A test of the spatial coding hypothesis. Psychol. Sci., 5(6):368–374, November 1994. Colin Allen. Transitive inference in animals: Reasoning or conditioned associations. Rational animals, pages 175–185, 2006. Elizabeth A Tibbetts, Jorge Agudelo, Sohini Pandit, and Jessica Riojas. Transitive inference in polistes paper wasps. Biol. Lett., 15(5):20190015, May 2019. Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 5(78):947–1012, October 2016. Liam Rose, Laura Graham, Allison Koenecke, Michael Powell, Ruoxuan Xiong, Zhu Shen, Kenneth W Kinzler, Chetan Bettegowda, Bert Vogelstein, Maximilian F Konig, Susan Athey, Joshua T Vogelstein, and Todd H Wagner. The association between alpha-1 adrenergic receptor antagonists and In-Hospital mortality from COVID-19. medRxiv, December 2020. Bernhard Schölkopf. Causality for machine learning, November 2019. Rama K Vasudevan, Maxim Ziatdinov, Lukas Vlcek, and Sergei V Kalinin. Off-the-shelf deep learning is not enough, and requires parsimony, bayesianity, and causality. npj Computational Materials, 7(1):1–6, January 2021. Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Towards causal representation learning, February 2021. Sebastian Seung. Connectome: How the Brain’s Wiring Makes Us who We are. Houghton Mifflin Harcourt, none edition edition, 2012. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, November 2016. Judea Pearl. Bayesian networks, 2011. Avanti Athreya, Donniell E Fishkind, Minh Tang, Carey E Priebe, Youngser Park, Joshua T Vogelstein, Keith Levin, Vince Lyzinski, Yichen Qin, and Daniel L Sussman. Statistical inference on random dot product graphs: a survey. The Journal of Machine, 18(226):1–92, 2017. Carey E Priebe, Cencheng Shen, Ningyuan Huang, and Tianyi Chen. A simple spectral failure mode for graph convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell., PP, August 2021. 25 [150] Keyulu Xu*, Weihua Hu*, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural Networks? In International Conference on Learning Representations, September 2018. [151] Charles J Stone. Consistent Nonparametric Regression. Annals of Statistics, 5(4):595–620, July 1977. [152] Ben Sorscher, Surya Ganguli, and Haim Sompolinsky. The geometry of concept learning. bioRxiv, 2021. [153] Chris R Sims. Efficient coding explains the universal law of generalization in human perception. Science, 360(6389):652–656, 2018. [154] Carey E Priebe, Joshua T Vogelstein, Florian Engert, and Christopher M White. Modern Machine Learning: Partition & Vote, September 2020. [155] Jonathan Frankle and Michael Carbin. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In International Conference on Learning Representations, September 2018. [156] Luc Devroye, Laszlo Györfi, and Gabor Lugosi. A probabilistic theory of pattern recognition. Stochastic Modelling and Applied Probability. Springer, New York, NY, 1996 edition, November 2013. [157] Haoyin Xu at al. When are deep networks really better than decision forests at small sample sizes, and how?, August 2021. [158] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Z Ghahramani, M Welling, C Cortes, N D Lawrence, and K Q Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2924–2932. Curran Associates, Inc., 2014. [159] Raja Giryes, Guillermo Sapiro, and Alex M Bronstein. Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? IEEE Trans. Signal Process., 64(13): 3444–3457, July 2016. [160] Leo Breiman. Random Forests. Mach. Learn., 45(1):5–32, October 2001. [161] Robert E Schapire. The strength of weak learnability. Mach. Learn., 5(2):197–227, 1990. [162] Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do we need hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res., 15(1): 3133–3181, 2014. [163] R M French. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci., 3(4):128–135, April 1999. [164] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. U. S. A., 114(13):3521–3526, March 2017. [165] Bernhard Schölkopf, Director of the Max Planck Institute for Intelligent in Tübingen Germany Professor for Machine Lea Bernhard Schölkopf, Rnhard Schölkopf, Alexander J Smola, Francis Bach, and Managing Director of the Max Planck Institute for Biological Cybernetics in Tubingen Germany Profe Bernhard Scholkopf. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002. [166] Robert C Wilson, Elizabeth Bonawitz, Vincent D Costa, and R Becket Ebitz. Balancing exploration and exploitation with information and randomization. Curr Opin Behav Sci, 38:49–56, April 2021. [167] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In Eric P Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 387–395, Bejing, China, 2014. PMLR. 26 [168] David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Scholkopf, and Léon Bottou. Discovering causal signals in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6979–6987. openaccess.thecvf.com, 2017. [169] Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. Neuroscience-Inspired artificial intelligence. Neuron, 95(2):245–258, July 2017. [170] Adam H Marblestone, Greg Wayne, and Konrad P Kording. Toward an integration of deep learning and neuroscience. Frontiers in computational neuroscience, 10:94, 2016. [171] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys., 5(4):115–133, December 1943. [172] F Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev., 65(6):386–408, November 1958. [173] Blake A Richards, Timothy P Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz, Amelia Christensen, Claudia Clopath, Rui Ponte Costa, Archy de Berker, Surya Ganguli, et al. A deep learning framework for neuroscience. Nature neuroscience, 22(11):1761–1770, 2019. [174] Olivia Guest and Andrea E Martin. How computational modeling can force theory building in psychological science. Perspect. Psychol. Sci., 16(4):789–802, July 2021. [175] A M Turing. COMPUTING MACHINERY AND INTELLIGENCE. Mind, LIX(236):433–460, October 1950. [176] R Jacob Vogelstein, Udayan Mallik, Joshua T Vogelstein, and Gert Cauwenberghs. Dynamically reconfigurable silicon array of spiking neurons with conductance-based synapses. IEEE Trans. Neural Netw., 18(1):253–265, January 2007. [177] M Davies. Progress in Neuromorphic Computing : Drawing Inspiration from Nature for Gains in AI and Computing. In 2019 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), pages 1– 1, April 2019. [178] G A Silva, A R Muotri, and C White. Understanding the human brain using brain organoids and a Structure-Function theory, 2020. [179] Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern., SMC-13(5): 834–846, September 1983. [180] Simon D Fisher, Paul B Robertson, Melony J Black, Peter Redgrave, Mark A Sagar, Wickliffe C Abraham, and John N J Reynolds. Reinforcement determines the timing dependence of corticostriatal synaptic plasticity in vivo. Nat. Commun., 8(1):334, August 2017. [181] Tomomi Shindou, Mayumi Shindou, Sakurako Watanabe, and Jeffery Wickens. A silent eligibility trace enables dopamine-dependent synaptic plasticity for reinforcement learning in the mouse striatum. Eur. J. Neurosci., 49(5):726–736, March 2019. [182] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015. [183] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge. Nature, 550(7676):354–359, October 2017. 27 [184] Oriol Vinyals et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354, November 2019. [185] Max Jaderberg et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science, 364(6443):859–865, May 2019. [186] Will Dabney, Zeb Kurth-Nelson, Naoshige Uchida, Clara Kwon Starkweather, Demis Hassabis, Rémi Munos, and Matthew Botvinick. A distributional code for value in dopamine-based reinforcement learning. Nature, 577(7792):671–675, January 2020. [187] K Shao, Z Tang, Y Zhu, N Li, and D Zhao. A survey of deep reinforcement learning in video games. arXiv preprint arXiv:1912.10944, 2019. [188] Erik Peterson, Necati Alp Müyesser, Timothy Verstynen, and Kyle Dunovan. Combining imagination and heuristics to learn strategies that generalize. Neurons, Behavior, Data analysis, and Theory, 3(4):1–19, 2020. [189] Arghyadeep Das, Vedant Shroff, Avi Jain, and Grishma Sharma. Knowledge transfer between similar atari games using deep Q-Networks to improve performance. In 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), pages 1–8. ieeexplore.ieee.org, July 2021. [190] Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 13(1):41–77, 2003. [191] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The Option-Critic architecture. AAAI, 31(1), February 2017. [192] Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767, 2017. [193] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In International Conference on Machine Learning, pages 3540–3549. PMLR, 2017. [194] Elias Bareinboim. Causal reinforcement learning. https://crl.causalai.net/, 2022. Accessed: 20221-12. [195] Rosemary Nan Ke, Anirudh Goyal, Jane Wang, Stefan Bauer, Silvia Chiappa, Jovana Mitrovic, Theophane Weber, and Danilo Rezende. Causal learning for decision making (CLDM). https: //causalrlworkshop.github.io/, 2022. Accessed: 2022-1-12. [196] Aurelien Bibaut, Maria Dimakopoulou, Nathan Kallus, Xinkun Nie, Masatoshi Uehara, and Kelly Zhang. Causal sequential decision making workshop. https://nips.cc/Conferences/2021/ ScheduleMultitrack?event=21863, 2022. Accessed: 2022-1-12. [197] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018. [198] Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-supervised exploration via disagreement. In International conference on machine learning, pages 5062–5071. PMLR, 2019. [199] Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamicsaware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657, 2019. [200] Iris van Rooij and Giosuè Baggio. Theory before the test: How to build High-Verisimilitude explanatory theories in psychological science. Perspect. Psychol. Sci., 16(4):682–697, July 2021. [201] Mark K Singley and John Robert Anderson. The Transfer of Cognitive Skill. Harvard University Press, 1989. [202] Eric Schulz, Joshua B Tenenbaum, David Duvenaud, Maarten Speekenbrink, and Samuel J Gershman. Probing the compositionality of intuitive functions. CBMM, May 2016. 28 [203] Alison Gopnik, Clark Glymour, David M Sobel, Laura E Schulz, Tamar Kushnir, and David Danks. A theory of causal learning in children: causal maps and bayes nets. Psychol. Rev., 111(1):3–32, January 2004. [204] Alison Gopnik and Elizabeth Bonawitz. Bayesian models of child development. Wiley Interdiscip. Rev. Cogn. Sci., 6(2):75–86, March 2015. [205] Natalia Díaz-Rodríguez, Vincenzo Lomonaco, David Filliat, and Davide Maltoni. Don’t forget, there is more than forgetting: new metrics for Continual Learning, October 2018. [206] Melanie Mitchell. On Crashing the Barrier of Meaning in Artificial Intelligence. AI Magazine, 41 (2):86–92, 2020. [207] Rich Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson.html, March 2019. Accessed: 2021-3-16. [208] Tom M Mitchell. Machine Learning. McGraw-Hill Education, 1 edition edition, March 1997. [209] Cassio Pennachin and Ben Goertzel. Contemporary approaches to artificial general intelligence. In Artificial general intelligence, pages 1–30. Springer, 2007. [210] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pages 2778–2787. PMLR, 2017. [211] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. [212] Yoshua Bengio. The consciousness prior, September 2017. [213] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. The Parable of Google Flu: Traps in Big Data Analysis. Science, March 2014. [214] Solon Barocas and Andrew D Selbst. Big data’s disparate impact. Calif. Law Rev., 104:671, 2016. [215] Cathy O’Neil. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown, September 2016. [216] James Zou and Londa Schiebinger. AI can be sexist and racist — it’s time to make it fair, 2018. [217] Angharad N Valdivia. Algorithms of Oppression: How Search Engines Reinforce Racism by Safiya Umoja Noble (review). Feminist Formations, 30(3):217–220, 2018. [218] Sam Devlin, Raluca Georgescu, Ida Momennejad, Jaroslaw Rzepecki, Evelyn Zuniga, Gavin Costello, Guy Leroy, Ali Shaw, and Katja Hofmann. Navigation Turing Test (NTT): Learning to evaluate human-like navigation * 1. http://proceedings.mlr.press/v139/devlin21a/devlin21a.pdf, 2021. Accessed: 2022-1-8. [219] Takayuki Hosokawa and Masataka Watanabe. Prefrontal neurons represent winning and losing during competitive video shooting games between monkeys. J. Neurosci., 32(22):7662–7671, May 2012. [220] Viktor Tóth. A neuroengineer’s guide on training rats to play doom. https://medium.com/mindsoft/ rats-play-doom-eb0d9c691547, 2020. [221] Abeba Birhane. Algorithmic injustice: a relational ethics approach. Patterns (N Y), 2(2):100205, February 2021. [222] Cathy O’Neil. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown, September 2016. 29