Prospective Learning: Back to the Future
arXiv:2201.07372v1 [cs.LG] 19 Jan 2022
The Future Learning Collective
Abstract. Research on both natural intelligence (NI) and artificial intelligence (AI) generally assumes that the future resembles
the past: intelligent agents or systems (what we call ‘intelligence’) observe and act on the world, then use this
experience to act on future experiences of the same kind. We call this ‘retrospective learning’. For example, an
intelligence may see a set of pictures of objects, along with their names, and learn to name them. A retrospective
learning intelligence would merely be able to name more pictures of the same objects. We argue that this is not
what true intelligence is about. In many real world problems, both NIs and AIs will have to learn for an uncertain
future. Both must update their internal models to be useful for future tasks, such as naming fundamentally new
objects and using these objects effectively in a new context or to achieve previously unencountered goals. This
ability to learn for the future we call ‘prospective learning’. We articulate four relevant factors that jointly define
prospective learning. Continual learning enables intelligences to remember those aspects of the past which
it believes will be most useful in the future. Prospective constraints (including biases and priors) facilitate the
intelligence finding general solutions that will be applicable to future problems. Curiosity motivates taking actions
that inform future decision making, including in previously unmet situations. Causal estimation enables learning
the structure of relations that guide choosing actions for specific outcomes, even when the specific action-outcome
contingencies have never been observed before. We argue that a paradigm shift from retrospective to prospective
learning will enable the communities that study intelligence to unite and overcome existing bottlenecks to more
effectively explain, augment, and engineer intelligences.
“No man ever steps in the same river twice. For it’s not the same river and he’s not the same man.” Heraclitus
1 Introduction The goal of learning is ultimately about optimizing future performance. Intelligences—
entities with intelligence, be they natural intelligent (NI) agents, or artificial intelligent (AI) systems—are
challenged when the future is different from the past. For cases where the future is just like the past,
what we call retrospective learning, AI has developed exceptionally successful techniques (leveraging
advances in statistics and machine learning) to solve the learning problem. The NI field, which includes
human and non-human animal cognition (and sometimes is interpreted even more broadly), also has
explanations for the process of learning at both algorithmic and implementation levels. Arguably, both
fields have very satisfying theories for many ecologically valid behaviors, such as identifying common
objects, controlling the movement of limbs through space in relatively simple environments, and understanding spoken words. Ongoing theory development in both AI and NI highlights why retrospective
learning is perceived to be so advanced.
However, the most evolutionary important problems, such as those relating to life or death situations, mate choices, or child rearing, have a fundamentally different structure. These tend to be novel
experiences with sparse information, low probability, and high consequential value. Problems with this
structure also sink AI systems, such as when a self-driving car’s computer vision system is trained in
cities and fails to recognize a tractor trailer crossing the road in the country[1] or when medical diagnosis
systems are applied to under-represented samples[2]. We argue that many of the most interesting and
important problems for intelligences are those that require being good at identifying apparent ‘unicorn’
experiences, realizing the (potentially complex) ways in which they are similar to past experiences, and
then adjusting appropriately. These problems can happen far in the future, and they require extrapolation well outside of the previously encountered distribution of experiences. We call this phenomenon
prospective learning, which typifies a large and important class of open research problems in both NI
and AI.
1
Here we make the case that the study of intelligence, in both biological and human-made systems,
needs to go back to where the goal of learning has always been: we need to bring learning back to the
future. If we want to explain and augment animal intelligence, including humans, and to engineer more
intelligent machines, it is time to change how we look at the problem of learning by understanding that
it is fundamentally a future oriented process. This has both deep implications for how we understand
the origins of behavior and how we think about the fundamental computational structure of learning in
general.
1.1 What is Prospective Learning? Any intelligence engaged in prospective learning can make
effective decisions in novel scenarios, with minimal data and resources, by extrapolating from the past
and postulating active solutions via an internal model of the world. This allows them to out-compete
other intelligences in their corresponding niches, by leveraging the four below capabilities.
1. Continual learning enables an intelligence to remember those aspects of the past that it
believes will be relevant in the future [3].
2. Constraints (including biases and priors) facilitates finding general solutions that will apply to
future problems and thus makes learning possible. While the general problem is intractable (or even
uncomputable) [4], built-in priors, inductive biases, and other constraints to regularize the search space
[4, 5].
3. Curiosity motivates taking actions that inform future decisions, including in previously unmet
situations. Because intelligences do not know a priori which action-outcome contingencies are currently available, or how previous contingencies have changed, they should explore the world to gather
additional information [6].
4. Causal estimation enables learning the structure of relations that guide choosing actions for
specific outcomes, even when the specific action-outcome contingencies have never been observed
before. Intelligences in the wild are not merely perception machines; rather they take actions to maximize the probability of achieving the most desirable outcomes, and to learn what effects arise from what
causes [7].
What makes these capabilities so critical for prospective learning? First, the above list of four
desiderata yields intelligences that are capable of generalizing and adapting to future environments that
differ from the past in ecologically predictable ways. Although cognitive scientists have not yet explicitly
conducted careful experiments to quantify prospective learning capabilities holistically, there is ample
evidence in the animal kingdom for each of these capabilities, across species, and even across phyla.
We provide such evidence in Section 2. Second, each of the above capabilities, on its own, has already
been recognized as an important open problem in the AI community, which has begun formalizing
the problem and designing solutions to address it. This enables designing and conducting careful
psychophysical experiments to assess and quantify them. In Section 3 we provide a brief background
formalizing retrospective learning, to provide a contrast and show its limitations as compared to a sketch
of what formalizing prospective learning would entail in Section 4. Finally, we conclude in Section 5 by
proposing how modifying our thinking about learning to be for the future will transform natural and
artificial intelligences, as well as society more generally.
1.2 A sketch of our framework Before proceeding any further, we must first provide a simple sketch
to contextualize our thinking of the problem. This is illustrated in Figure 1. We envision an external
world (W , top) and internal model of the world, also called a hypothesis (h, bottom). Both of these are
evolving in time and dependent on the past and each other in various ways. Critically, this partial dependency between the past and a distinct future is the primary difference between prospective learning
2
external
NOW
FUTURE
Wn
Wf
internal
Xn
Xf
Yn
Yf
curiosity
causal
hn
constraints
continual
g
hf
Figure 1: Our conceptual understanding of prospective learning. W: world state. X: input (sensory data). Y: output (response/action). h: hypothesis. g: learner. Subscript n: now. Subscript f: future. NOW: the task and context that matters right
now. FUTURE: the task and context that will matter at some point of time in the future. Note that g, the learning algorithm,
is fixed and unchanging, while continually updating the hypothesis. Black arrows denote inputs and outputs. White arrows
indicate conceptual linkages.
and traditional learning approaches (see Section 3 for further details). At any given time (now, indicated
by subscript n), the intelligence receives some input from the world (Xn ). The intelligence contains
a continual learning algorithm (g ). The goal of that algorithm is to leverage the new data to update
the intelligence’s current hypothesis (hn ) without catastrophically forgetting the past, and ideally even
improving upon previously acquired skills or capabilities. The hypothesis that is created is selected from
a constrained set of hypotheses that are inductively biased towards those that also generalize well to
problems the intelligence is likely to confront in the future. Curiosity motivates gathering more information that could be useful now, or in the future. Based on all available information, the intelligence makes
a decision about to respond or act (Y ). Those actions causally impact the world. This process of
acquiring input, learning from the past, updating hypotheses, and acting to gain rewards or information,
is relevant and repeats itself in the far future (indicated by subscript f ).
The central premise of our work here is that by understanding how NIs achieve (or fail at) each of
these four capabilities–and describing them in terms of an AI formalism—we can overcome existing
bottlenecks in explaining, augmenting, and engineering both natural and artificial intelligences.
2 Evidence for prospective learning in natural intelligences NIs (which we limit here to any organism with a brain or brain-like structure) always learn for the future, because the future is where
survival, competition, feeding, and reproduction happen. That is not to say that NIs are perfect at it,
or even particularly efficient. Rather, we argue that prospective abilities are successful just enough to
bolster evolutionary fitness so as to be reinforced over time. In many ways, brains appear to be explicitly
built (i.e., evolved) to make predictions about the future [8]. NIs learn abstract, transferable knowledge
that can be flexibly applied to new problems, rather than greedily optimizing for the current situation (for
3
review see Gershman et al. [9], Raby and Clayton [10]). In the field of ecology, this process is part of
what is called ‘future planning’ [11] or ‘prospective cognition’ [10], both of which describe the ability of
animals to engage in ‘mental time travel’ [12] by projecting themselves into the future and imagining
possible events, anticipating unseen challenges. Given our focus on prospective learning, here we will
focus on the learning aspects of future planning and prospective cognition.
While prospective abilities have classically been thought of as a uniquely human trait [13], we now
know that many other NIs can do this. Bonobos and orangutans socially learn to construct tools for
termite fishing, not for immediate use, but instead to carry with them in anticipation of future use [14].
This tool construction and use also extends beyond primates. Corvids collect materials to build tools
that solve novel tasks [15]. This building of novel tools can be seen as a form of prospective learning.
Here the experience of learning novel objects (e.g., pliability of specific twigs, inspecting glass bottles)
transfers to novel applications in the future (e.g., curving a twig to use as a hook to fish out food from
a bottle). It requires that the animal seek out the information (curiosity) to learn the physics of various
objects (causality), both biased and facilitated by internal heuristics that limit the space of hypotheses for
how to configure the tool (constraints), and extend this knowledge to produce new behavioral patterns
(continuous learning) within the constraints of the inferred structure of the world.
Another manifestation of future learning is food caching, seen in both mammals (e.g., squirrels
and rats) as well as birds. Western scrub-jays not only have a complex spatial memory of food cache
locations, but can flexibly adapt their caching to anticipate future needs [16]. Experiments on these
scrub-jays have shown that they will stop caching food in locations where it gets degraded by weather or
is stolen by a competitor [17, 18]. Indeed, consistent with the idea that these birds are learning, scrubjays that are caught stealing food from another’s cache (i.e., observed by another jay), will restore
the stolen food in private, as if aware that the observing animal will take back the food [19]. This
behavior can be considered prospective within a spatial framework, wherein prior experience facilitates
learning a unique ‘cognitive map’ that facilitates innovative zero-shot transfer to creatively solve tasks
(e.g., strategic storage and retrieval of food) in a novel future context (i.e., next season) [20]. This
spatial learning likely evolved and helps animals solve ethologically critical tasks such as navigation and
foraging, as well as to allow animals to use vicarious trial-and-error to imaginatively simulate multiple
scenarios [21], generalizing previously learned information to a novel context. But such mechanisms
are not limited to navigational problems: many animals have evolved the ability to transform non-spatial
variables into a spatial framework, enabling them to solve a broad set of problems using the same
computational principles and neural dynamics underlying spatial mapping [22, 23]. Finally, as an animal
explores its environment, it can quickly incorporate novel information using the flexibility of the cognitive
map, while leaving existing memories largely intact (i.e., without disrupting the weights of the existing
network).
Learning for the future is seen across phyla as well. Bees (arthropods) can extrapolate environmental cues to not only locate food sources, but communicate this location to hivemates via a ‘waggle
dance’ where future targets lie with a high degree of accuracy (for review see Menzel [24]). Importantly,
bees can also learn novel color-food associations, identifying the color of new high value nectar sources
via associative learning in novel foraging environments [25, 26]. This ability also extends to learning
novel flower geometries that may indicate high nectar food sources that are then communicated back
to the hive for future visits by other bees [27]. This remarkably sophisticated form of learning for the
future happens in an animal with less than a million neurons, and fewer than 10 billion synapses. In
contrast, modern deep learning systems, such as GPT-3, have over 100 billion synapses [28] and yet
fail at similar forms of sophisticated associative learning. Even in the mollusca phylum, octopuses have
4
been found to perform observational learning, with single shot accuracy at selecting novel objects by
simply watching another octopus perform the task [29]. This rapid form of (continual) learning allows the
animal to effectively use an object it has never seen before in new situations (constraints and causality),
simply by choosing to play with it (curiosity). Thus, it is learning for the future.
Prospective learning has, thus, a very long evolutionary history. Given that arthropods, mollusks,
and chordates diverged in evolution 500 million years ago, the observation of prospective learning
abilities across these phyla suggests one of two possibilities: 1) prospective learning is an evolutionarily old capacity with a shared neural implementation that exists in very simple nervous systems (and
scales with evolution), or 2) prospective learning has independently evolved multiple times with different
implementation-level mechanisms (a.k.a., multiple realizability [30, 31]). These two possibilities have
different implications for the types of experiments that would inform our understanding of prospective
learning in NIs and how we can implement it in artificial systems (see § 5 for further details).
3 The traditional approach to (retrospective) learning. The standard machine learning (ML) formalism dates back to the 1920’s, when Fisher wrote the first statistics textbook ever. In it, he states that
“statistics may be regarded as . . . the study of methods of the reduction of data.” In other words, he established statistics to describe the past, not predict the future. Shortly thereafter, Glivenko and Cantelli
established the fundamental theorem of pattern recognition: given enough data from some distribution,
one can eventually estimate any parameter of that distribution [32, 33]. Vapnik and Chervonenkis [34]
and then Valiant [35] rediscovered and further elaborated upon these ideas, leading to nearly all of the
advancements of modern ML and AI. Here we will highlight the structure of this standard framework for
understanding learning as used in AI.
As above (Section 1.2), let X be the input to our intelligence, (e.g., sensory data) and Y be its output
(e.g., an action). We assume those data are sampled from some distribution PX,Y that encapsulates
some true but unknown properties of the world. For brevity, we allow PX,Y to also incorporate the
causal graph, rather than merely the probabilistic distribution. Let n denote the nth experience, the
one that is happening right now. In the classical form of the problem, a learning algorithm g (which
we hereafter refer to as a ‘learner’) takes in the current data sample S := {(Xi , Yi )}ni=1 and outputs a
hypothesis hn ∈ H, where the hypothesis hn : X → Y chooses a response based on the input. The
nth sample corresponds to ‘now’, as described in Section 1.2. The learner chooses a hypothesis, often
by optimizing a loss function ℓ, that compares the predicted output from any hypothesis, h(X) with the
(sometimes unobserved) ground truth output, Y : ℓ(h(X), Y ). The goal of the learner is to minimize
risk, which is often defined as the expected loss, by integrating over all possible test datasets:
R(h) =
Z
(X ,Y)
| {z }
possible test
ℓ(h(X), Y ) dPX,Y .
| {z } | {z }
loss function distribution
Note that when we are learning, h is dependent on the past observed (training) dataset. However,
when we are designing new learners (e.g., as evolution does), we do not have a particular training
dataset available. Therefore we seek to develop learners that work well on whatever training dataset
we have. To do so, in retrospective learning, we assume that all data are sampled from the exact same
distribution (in supervised learning, this means the train and test sets), and then we can evaluate risk
by integrating out all possible test datasets to determine the expected risk, E , of what is learned:
5
Eclassical (h, n, P ) =
Z
(X ,Y)n
| {z }
R(h) dP(X,Y )n =
| {z } | {z }
risk
possible train
Z
distribution
(X ,Y)n
| {z }
Z
(X ,Y)
| {z }
possible train possible test
ℓ(h(X), Y ) dPX,Y dP(X,Y )n .
| {z } | {z } | {z }
loss function distribution distribution
Although both training and test datasets are assumed to be drawn from the same distribution, the
two integrals are integrating over two different sets of random variables: the inner integral is integrating
over all possible test datasets, and the outer integral is integrating over all possible train datasets.
Assuming that the two distributions are identical has enabled retrospective learning to prove a rich set
of theorems characterizing the limits of learning. Many learners have been developed based on this
assumption, and such algorithms have recently enjoyed a cornucopia of successes, spanning computer
vision [36], natural language processing [37], diagnostics [38], protein folding [39], autonomous control
[40], and reinforcement learning [41]. The successes of the field, so far, rely on problems amenable to
the classical statistical definition of learning in which the data are all sampled under a fixed distributional
assumption. This, to be fair, encompasses a wide variety of applied tasks. However, in many real-world
data problems, the assumption that the training and test data distributions are the same is grossly
inadequate [42].
Recently a number of papers have proposed developing a theory of ‘out of distribution’ (OOD)
learning [43–45], that includes as special cases transfer learning [46], multitask learning [47–49], metalearning [50], and continual [51] and lifelong learning [52]. The key to OOD learning is that now we
assume that the test set is drawn from a distribution that differs in some way from the training set distribution. This assumption is an explicit generalization of the classical retrospective learning problem [45].
In OOD problems, the train and test sets can come from different sample spaces, different distributions,
and optimized with respect to different loss functions. Thus, rather than designing learners with small
expected risk as defined above, in OOD learning, we seek learners with small EOOD (highlighting in red
the differences between EOOD and classical ‘in-distribution’ retrospective learning:
EOOD h, n, P
test
,P
train
=
Z
(X ,Y)
Z
(X ,Y)
| {z train} | {z test}
possible train possible test
test
train
dPX,Y
.
ℓtest (h(X), Y ) dPX,Y
|
{z
} | {z } | {z
}
loss function
distribution distribution
Note that this expression for the risk permits the case of multiple tasks: both g and h are able to operate
on different spaces of inputs and outputs, the inputs to both could include task identifiers or other side
information, and the loss would measure performance for different tasks. All of prospective learning
builds upon, and generalizes, finding hypotheses that minimizes EOOD .
There are various ways in which an intelligence that can solve this OOD problem can still be only
a retrospective learner. First, consider continual learning. A learner designed to minimize EOOD has
no incentive to remember anything about the past. In fact, there is no coherent notion of past, because
there is no time. However, even if there were time (for example, by assuming the training data is in the
past, and the testing data is in the future), there is no mechanism by which anything about the past
is retained. Rather, it could all be overwritten. Second, consider constraints. Often in retrospective
ML, constraints are imposed on learning algorithms to help find a good solution for the problem at
hand with limited resources. These constraints, therefore, do not consider the possibility of other future
problems that might be related to the existing problems in certain structured ways, which a prospective
6
learner would. Third, curiosity is not invoked at all. Even if we generalized the above to consider time,
curiosity would still only be about gaining information for the current situation, not realizing that there
will be future situations that are similar to this one, but distinct along certain predictable dimensions
(for example, entropy increases, children grow, etc.). Fourth, there is no notion of causality in the
above equation, the optimization problem is purely of association and prediction. These limitations
of retrospective learning motivates formalizing prospective learning to highlight these four capabilities
explicitly.
4 The capabilities that characterize prospective learning The central hypothesis of this paper is
that by posing the problem of learning as being about the future, many of the current problem areas
of AI become tractable, and many aspects of NI behaviors can be better understood. Here we spell
out the components of systems that learn for the future. While Figure 1 provides a schematic akin
to partially observable Markov decision process [53], it is important to note, however, that prospective
learning is not merely a rebranding of Markov decision processes, or reinforcement learning. Specifically, the ‘future’ in Figure 1 is not the next time step, but rather, some time in the potentially distant
future. Moreover, at that time, everything could be different, not merely a time-varying state transition
distribution and reward function, but also possibly different input and output spaces. In other words, it
could be a completely different environment. Thus, while Markov assumptions may be at play, algorithms designed merely to address a stationary Markov decision process will catastrophically fail in the
more general settings considered here. Nonetheless, without further assumptions, the problem would
be intractable (or even uncomputable) [4].
Thus, we further assume that the external world W is changing somewhat predictably over time. For
example, the distribution of states in a world that operates according to some mechanisms (e.g., sunny
weather, cars driving on the right, etc.) may change when one or more of those mechanisms changes
(e.g., rainy weather, or driving on the left). 1 Continual learning thereby enables the intelligence to store
information in h about the past that it believes will be particularly useful in the (far) future. Prospective
constraints, h ∈ H′ , including inductive biases and priors, contain information about the particular
idiosyncratic ways in which both the intelligence and external world are likely to change over time.
Such constraints, for example, include the possibility of compositional representations. The constraints
also push the hypotheses towards those that are accurate both now and in the future. The actions
are therefore not only aimed at exploiting the current environment but also aimed at exploring to gain
knowledge useful for future environments and behaviors, reflecting a curiosity about the world. Finally,
learning causal, rather than merely associational relations, enables the intelligence to choose actions
that lead to the most desirable outcomes, even in complex, previously unencountered, environments.
These four capabilities, continual learning, prospective constraints, curiosity, and causality, together
form the basis of prospective learning.
4.1 Continual learning Continual (or lifelong) learning is the process by which intelligences
learn for performance in the future, in a way that involves sequentially acquiring new capabilities (e.g., skills and representations) without forgetting—or even improving upon—previously
acquired capabilities that are still useful. In general we expect previously learned abilities to be useful again in the future, either in part or in full. As such, it is clear that an intelligence that can remember
useful capabilities, despite learning new behaviors, will outperform (and out survive) those that do not
[54]. However, AI systems often do forget the old upon learning the new, a phenomenon called catastrophic interference or forgetting [3, 55, 56]. Better than merely not forgetting old things, a continual
1
In other words, the world may evolve due to interventions on causative factors, which we also touch on in Section 4.4.
7
learner improves performance on old things, and even potential future things, upon acquiring new data
[49, 52, 57–59]. As such, the ability to do well on future tasks is the hallmark of real learning and the
need to not forget immediately derives from it.
An example of successful continual learning in NI is learning to play music. If a person is learning
Mozart, and then practices arpeggios, having learned the arpeggios will improve their ability to play
Mozart, and also their ability to play Bach in the future. When people learn another language, it also
improves their comprehension of previously learned languages, making future language learning even
easier [60]. The key to successful continual learning is, therefore, to transfer information from data
and experiences backwards to previous tasks (called backwards transfer) and forwards to future tasks
(called forward transfer) [45, 58, 59]. Humans also have failure modes: sometimes this prior learning
can impair future performance, a process known as interference (e.g., [61, 62]). The extent of transfer
or interference in future performance depends on both environmental context and internal mechanisms
(see Bouton [63]). While continual learning is obviously required for efficient prospective learning, to
date there are relatively few studies quantifying forward transfer in NIs, and, as far as we know, none
that explicitly quantify backwards transfer [64–66]. Crucially, learning new information does not typically
cause animals to forget old information.
Nonetheless, while existing AI algorithms have tried to enable both forward and backward transfer,
for the most part they have failed [67]. The field is only just beginning to explore effective continual
learning strategies [68], which includes those that explicitly consider non-stationarity of the environment [69, 70]. Traditional retrospective learning starts from a tabula rasa mentality, implicitly assuming
that there is only one task (e.g., a single machine learning problem) to be learned [71]. In these classical machine learning scenarios, each data sample is assumed to be sampled independently; this is
true even in OOD learning. While online learning [72], sequential estimation [73], and reinforcement
learning [74] relax this assumption, traditional variants of those ML disciplines typically assume a slow
distributional drift, disallowing discrete jumps. This does not consider the possibility that the far future
may strongly depend on the present (e.g., memories from last time an animal was at a particular location will be useful next time, even if it is far in the future). These previous approaches also typically
only consider single task learning, whereas in continual learning there are typically multiple tasks, and
multiple distinct datasets, sometimes each of which having different domains.
In prospective learning, however, data from the far past can be leveraged to update the internal
world model [75, 76]. Here the training and test sets are necessarily coupled by time. This is in contrast
to the canonical OOD learning problem, in which the training and testing data lack any notion of time,
but similar to classical online, sequential, and reinforcement learning. Here we assume that the future
depends to some degree on the past. This dependency can be described by their conditional distrifuture | past
bution: P(X,Y )
. Crucially, we do not necessarily assume a Markov process, where the future only
depends on the recent past, but rather allow for more complex dependencies depending on structural
regularities across scenarios. We thus obtain a more general expected risk in the learning for the future
scenario:
(4.1)
Z
Econtinual h, n, P future, past =
(X ,Y)
Z
past
| {z } |
(X ,Y)future
{z
}
possible past possible future
future | past
past
ℓ
(h(X), Y ) dPX,Y
dPX,Y
.
|future {z
}|
{z
} | {z }
loss function
distribution
distribution
Continual learning is thus an immediate consequence of prospective learning. Recent work on continual reinforcement learning [77] can be thought of as devoted to developing algorithms that optimize
8
the above equation, but such efforts typically lack the other capabilities of prospective learning. As we
will argue next, continual learning is only non-trivial upon assuming certain computational constraints.
4.2 Constraints Constraints for prospective learning effectively shrink the hypothesis space
to require less data and fewer resources to find solutions to the current problem, which also
generalize to potential future problems. Whereas in NI these constraints come from evolution, in
AI these constraints are built into the system. Traditionally, constraints come in two forms. Statistical
constraints limit the space of hypotheses that are possible to enhance statistical efficiency; they reduce
the amount of data required to achieve a particular goal. For our purposes, priors and inductive biases
are ‘soft’ statistical constraints. Computational constraints, on the other hand, impose limits on the
amount of space and/or time an intelligence can use to learn and make inferences. Such constraints
are typically imposed to enhance computational efficiency; that is, to reduce the amount of computation
(space and/or time) to achieve a particular error guarantee. Both kinds of constraints, statistical and
computational, restrict the search space of effective or available hypotheses, and of the two statistical
constraints likely play a bigger role in prospective learning than computational. Moreover, both kinds of
constraints can be thought of as different ways to regularize, either explicit regularization (e.g., priors
and penalties) or implicit regularization (e.g., early stopping).2 .
There is no way to build an intelligence, either via evolution or from human hands, without it having
some constraints, particularly inductive biases (i.e., assumptions that a learner uses to facilitate learning
an input-output mapping). For example, most mammals [23, 78, 79], and even some insects ([80]),
excel at learning general relational associations that are acquired in one modality (e.g., space) and
applied in another (e.g., social groups). Inductive biases like this often reflect solutions to problems
faced by predecessors and learned over evolution. They are often expressed as instincts and emotions
that provide motivation to pursue or avoid a course of action, leading to opportunities to learn about
relevant aspects of the environment more efficiently. For example, mammals have a particular interest
in moving stimuli, and specifically biologically-relevant motion [81], likely reflecting behaviorally-relevant
threats [82]. Both chicks [83] and human babies [84] have biases for parsing visual information into
object-like shapes, without extensive experience with objects. Newborn primates are highly attuned to
faces [85] and direction of gaze [86], and these biases are believed to facilitate future conceptual [87]
and social learning [88]. Thus, within the span of an individual’s lifetime, NIs are not purely data-driven
learners. Not only is a great deal of information baked in via evolution, but this information is then used
to guide prospective learning [89].
AI has a rich history of choosing constrained search spaces, including priors and specific inductive
biases, so as to improve performance (e.g., [90]). Perhaps the most well known inductive bias deployed
in modern AI solutions is the convolution operation [91], something NIs appear to have discovered hundreds of millions of years prior to us implementing them in AIs [92]. Such ideas can be generalized
in terms of symmetries in the world[93]. Machine learning has developed many techniques to incorporate known invariances to the learning process [94–97], as well as mathematically quantifying how
much one can gain by imposing them [98, 99]. In fact, in many cases we may want to think about
constraints themselves as something to be learned [100, 101], a process that would unfold over evolutionary timescales for NIs. However, in many areas the true potential of prospective constraints for
2
Note that without computational constraints, some aspects of continual learning can trivially be solved with a naive
retrospective learner that stores all the data it has ever encountered, and retrains its hypothesis from scratch each time new
data arrive [45] Thus, continual learning is inherently defined by space and/or time constraints, which are present in any real
world intelligence, be it natural or artificial
9
accelerating learning for the future remains unmet. For example, as pointed out above, many NIs can
learn the component structure of problems (e.g., relations) which accelerates future learning when new
contexts have similar underlying compositions (see Whittington et al. [23]). This capability corresponds
to zero-shot cross-domain transfer, a challenge unmet by the current state-of-the-art machine learning
methods [102].
Why are these constraints important? With a sufficiently general search space, and enough data,
space, and time, one can always find a learner that does arbitrarily well [32, 103]. In practice, however,
intelligences have finite data (in addition to finite space and time). Moreover, a fundamental theorem
of pattern recognition is the arbitrary slow convergence theorem [104, 105], which states that given
a fixed learning algorithm and any sample size N , there always exists a distribution such that the
performance of the algorithm is arbitrarily poor whenever n < N [34, 35]. This theorem predates and
implies the celebrated no free lunch theorem [106], which states that there is not one algorithm to
rule them all; rather, if learner g converges faster than another learner g ′ on some problems, then the
second learner g ′ will converge faster on other problems. In other words, one cannot hope for a general
“strong AI” that solves all problems efficiently. Rather, one can search for a learner that efficiently
solves problems in a specified family of problems. Constraints on the search space of hypotheses
thereby enable intelligences to solve the problems of interest efficiently, by virtue of encoding some
form of prior information and limiting the search space to specific problems. Prospective learners use
prospective constraints, that is, constraints that push hypotheses to the most general solution that works
for a given problem, such that it can readily be applied to future distinct problems.
Formalizing constraints using the above terminology (see Section 3) does not require modifying the
objective function of learning. It merely modifies the search space. Specifically, we place constraints
on the learner g ∈ G ′ ⊂ G , the hypothesis h ∈ H′ ⊂ H, and the assumed joint distribution governing
everything P = P f uture,past ∈ P ′ ⊂ P . The existence of constraints is what makes prospective learning
possible, and the quality of these constraints is what decides how effective learning can be.
4.3 Curiosity We define curiosity for prospective learning as taking actions whose goal is to
acquire information that the intelligence expects will be useful in the future (rather than to obtain
rewards in the present). Goal-driven decisions can be broken down into a choice between maximizing
one of two objective functions [107]: (1) an objective aimed at directly maximizing rewards, R, and (2)
an objective aimed at maximizing relevant information, E . For prospective learning E is needed for
making good choices now and in the as-yet-unknown future. In this way the intelligence, at each point
of time, decides if it should dedicate time to learning about the world thereby maximizing E , or to doing
a rewarding behavior thereby maximizing R. Critically, by being purely about relevant information for the
future, objective (2) (i.e., pure curiosity) can maximize information about both current and future states
of the world. E can be defined simply as the value of the unknown, the integration over possible futures
and the knowledge it may afford. However, this term, can not easily be evaluated. Instead, we know
much about what it drives the intelligence to learn: compositional representations, causal relations, and
other kinds of invariances that allow us to solve current and future problems. In this way, E ultimately
quantifies our understanding of the relevant parts of the world.
In humans, curiosity is a defining aspect of early life development, where children consistently
engage in more exploratory and information driven behavior than adults [6, 108–111]. This drive, particularly important in children, for directed exploration is often focused on learning causal relations—to
acquire both forward and reverse causal explanations [112]—and develop models of the world that they
can exploit in later development. But curiosity is not limited to humans (for review see Loewenstein
[113]). Just like children [114], monkeys show curiosity for counterfactual outcomes [115]. Rats are
10
driven to explore novel stimuli and contexts, even in the absence of rewards [116, 117]. Just like children [118], octopuses appear to learn from playing with novel objects [119], a pure form of curiosity. In
fact, even the roundworm C. elegans, an animal with a simple nervous system of only a few hundred
neurons, shows evidence of exploration in novel environments [120]. Curiosity is clearly a fundamental
drive of behavior in NIs [107].
It is well established in the active learning literature that curiosity, i.e., gathering information rather
than rewards, can lead to an exponential speed-up in sample size convergence guarantees [121, 122].
Specifically, this means that if a passive learner requires n samples to achieve a particular performance
guarantee, then an active learner requires only ln n samples to achieve the same performance. This is
important as the scenarios for which prospective learning provides a competitive advantage are those
where information is relatively sparse and the outcomes are of high consequential value. So every
learning opportunity must really count in these situations. Although we cannot expect that either AIs or
NIs can perfectly implement prospective learning by integrating E over long time horizons. Instead, we
can approximate what we learn about the parts of the world that we will want to take future actions in,
which compositional elements (i.e., constraints) exist in this world, and which causal interactions these
components have. These properties mean that we can see E as an approximation to how well we
can learn from the world. Thus optimal information gathering (i.e., curiosity) relies on similar learning
policies as reinforcement learning. This may explain why empirical studies in humans show that information seeking relies on overlapping circuits as reward learning [123]. Most importantly, this shows
how curiosity is innately future focused. The solution to reinforcement learning (i.e., the Bellman equation) reflects the optimal decision to make to maximize future returns [76, 124]. Thus, in the case of
curiosity, this solution is the optimal decision to maximize information in the future. What distinguishes
curiosity from reward learning is that learning E informs intelligences, whether NI or AI, about the structure of the world. E provides the necessary knowledge of things like spatial configurations, hierarchical
relationships, and contingencies. In other words, to find an optimal curiosity policy we can find an
optimal policy today about the structure of the world, regardless of immediate rewards, and solve the
optimization problem again tomorrow.
4.4 Causality Causal estimation is the ability to identify how one event (the cause) produces
another (the effect), which is particularly useful for understanding how our actions impact the
world. Causal estimation is enabled in practice by assuming that the direct causal relationships are
sparse. This sparsity assumption greatly simplifies modeling the world by adding some bias, but drastically reducing the search space over hypothesis to learn. While it might be tempting to think that
prospective learning boils down to simply learning factorizable probabilistic models of the world, such
models are inadequate for prospective learning. This is because probabilistic models are inherently
invertible. That is, we can just as easily write the probability of wet grass given that it is raining,
P (wet|rain), as the probability that it is raining given wet grass, P (rain|wet). Yet these probabilities
do not tell us what would be the effect of intervening on one or the other variable. These probabilistic
statements of the world do not convey whether or not increasing P (wet) increases P (rain). According
to causal essentialists, such as Pearl [7], to make such statements requires more than a probabilistic
model: it requires a causal model. Causal reasoning enables intelligences to transfer information across
time. Specifically, it enables transferring causal mechanisms which, by their vary nature, are consistent
across environments. This includes environments that have not previously been experienced, thereby
transferring out-of-distribution. Thus, causal reasoning, like continual learning, is a qualitative capability,
rather than a quantitative improvement, that is necessary for prospective learning.
Causal reasoning has long been seen by philosophers as a fundamental feature of human intelli-
11
gence [125]. While it is not always easy to distinguish causal reasoning from associative learning in
animals, many non-human animals have been shown to perform predictive inferences about objectobject relationships, allowing them to estimate causal patterns (for review see Völter and Call [126]).
For example, great apes [127], monkeys [128], pigs [129], and parrots [130] can use simple sensory
cues (e.g., rattling sound of a shaken cup) to infer outcomes (e.g., presence of food in cup), a form
of diagnostic inference. However, this form of causal reasoning is inconsistently observed in NIs (see
Cheney and Seyfarth [131], Collier-Baker et al. [132]). Other studies have shown that, particularly in
social contexts, animals from great apes [133] and baboons [134] to corvids [135] and rats [136, 137],
can perform transitive causal inference (i.e., if A → B and B → C , then A → C ; for review see Allen
[138]). This causal ability has even been observed in insects [139], suggesting that forms of causal
inference exist across taxa.
The insight driving causal reasoning is that the causal mechanisms in the world tend to persist
while correlations are often highly context sensitive [140]. Further, the same causal mechanisms are
involved in generating many observations, so that models of these mechanisms are reusable modules in
compositional solutions to many problems within the environment. For example, understanding gravity
is useful for catching balls as well as for modeling tides and launching rockets. Prospective learning thus
crucially benefits from causal models: they are more likely to be useful as they encode real invariances
that persist across environments. For example, different variants of COVID will continue to emerge, but
certain treatments are likely to be effective for each of them insofar as they act on the mechanism of
disease which remains constant [141]. Such scenarios pose a problem for traditional AI algorithms.
Modern retrospective learning machines notoriously fail to learn causal models in all but the most
anodyne settings. Some AI researchers have advocated for creating models that can perform causal
reasoning, which would help AI systems generalize better to new settings and perform prospective
inference [142–144], but this field remains in its infancy.
Going back to our formulation of the problem, what this all means is that what matters for future
decisions is ‘doing Y ’: intervening on the world by virtue of taking action Y , rather than simply noting the
(generalized) correlation between X and Y . Fundamentally, implementing Y simply means returning
the value of the hypothesis for a specific X : i.e., do(h(X)). This modification yields an updated term to
optimize to achieve prospective learning:3
Ecausal (do(h), n, P ) =
Z
(X ,Y)n
| {z }
Z
(X ,Y)
| {z }
possible train possible test
ℓ(do(h(X)), Y ) dPX,Y dP(X,Y )n .
|
{z
} | {z } | {z }
loss function
distribution distribution
Crucially, the ability to choose actions, Y , allows the agent to discover causal relations, regardless
of the amount of confounding in the outside world. Causality links actions and learner, both by enabling
actions that are helpful for learning (e.g., randomized ones), and by enabling learning strategies that
are useful for discovering causal aspects of reality (e.g., through quasi-experiments). For our purposes
here, we consider all interactions between learning and action strategies to belong to either causality
or curiosity.
4.5 Putting it all together Table 1 provides a summary of the four capabilities that are necessary
for prospective learning, examples of how they are expressed in nature, how retrospective learning
3
Note that minimizing this equation is close to reinforcement learning objectives, although they typically are not explicitly
interested in learning causal models, and therefore, the ‘do’ operator is not typically present in the value or reward functions.
12
handles each process, and how a prospective learner would implement it. In it we highlight examples in
the literature where the behavior of a prospective learner has been demonstrated in AI, illustrating the
fact that the field is already moving somewhat in this direction, but just not completely yet. We argue
this is due to the fact that the form of prospective learning has not been carefully defined, as we have
attempted to do here.
With this gap in mind, we argue that in NIs, evolution has led to the creation of intelligent agents
that incorporate the above key components that jointly characterize prospective learning. Continual
learning, enabled by constraints and driven by curiosity, allows for the ability to make causal inferences
about which actions to take now that lead to better outcomes now and in the far future. In other words,
our claim is that evolution led to the creation of NIs that choose a learner such that, with each new
experience, updates the internal model g(hn , Xn , Yn ) → hf , where each hf is the solution to
minimize E ′ (do(h), n, P ) ,
(1)
subject to g ∈ G ′ , h ∈ H′ , & P ∈ P ′ ,
where P = P future,past , the constraints on g , h, and P encode aspects of time, compositionality, and
causality (e.g., that the future is dependent on the past via causal mechanisms). The expected risk, E ′ ,
for a specified hypothesis, at the current moment, given a set of experiences, is defined by
E ′ (do(h), n, P ) =
Z
(X ,Y)
Z
past
| {z } |
(X ,Y)future
{z
}
possible past possible future
past
future | past
dPX,Y
.
ℓfuture (do(h(X)), Y ) dPX,Y
|
{z
}|
{z
} | {z }
loss function
distribution
distribution
This E ′ gives the fundamental structure of the prospective learning problem. We argue that although
there has been substantial effort in the NI and AI communities to address each of the four capabilities
that lead to solving Eq. (1) independently, each remains to be solved at either the theoretical or algorithmic/implementation levels. Solving prospective learning requires a coherent strategy for jointly solving
all four of these problems together.
While we argue that optimizing for Eq. (1) characterizes our belief about what intelligent agents do
when performing prospective learning, it is strictly a computational level problem. It does not, however,
satisfy the question of how they do it. What is the mechanism or algorithm that intelligent agents use
to perform prospective learning? Intriguingly, the implementation of prospective learning in NIs happens in a network (to a first approximation) [145], and most modern state-of-the-art machine learning
algorithms are also networks [146]. Moreover, both fields have developed a large body of knowledge
in understanding network learning [147–150]. Thus, a key to solving how prospective learning can be
implemented relies on understanding the how networks learn representations, particularly representations that are important for the future. This is a critical component in explaining, augmenting, and
engineering prospective learning.
Understanding the role of representations in prospective learning, thus, requires a deep understanding of the nature of internal representations in modern ML frameworks. A fundamental theorem
of pattern recognition characterizes sufficient conditions for a learning system to be able to acquire any
pattern. Specifically, that an intelligent agent must induce a hypothesis such that, for any new input, it
only looks at a relatively small amount of data ‘local’ to that input [151]. In other words, a good learner
will map new observations into a new representation, i.e., a stable and unique trace within a memory,
13
such that the inputs that are ‘close’ together in the external world are also close in their internal representation (see also Sorscher et al. [152], Sims [153]). In networks, this is implemented by any given
input only activating a sparse subset of nodes, which is typical in both NIs [154], and becoming more
common for AIs [155]. Indeed, deep networks can satisfy these criteria [156]. Specifically, deep networks partition feature space into geometric spaces called polytopes [157]. The internal representation
of any given point then corresponds to which polytope the point is in, and where within that polytope
it resides. Inference within a polytope is then simply linear [158]. The success of deep networks is a
result of their ability to efficiently learn what counts as ‘local’ [159].4 In prospective learning, in contrast
to retrospective learning, what counts as local is also a function of potential future environments. Thus,
the key difference between retrospective and prospective representation learning is that the internal
representation for prospective learning must trade-off between being effective for the current scenario,
and being effective for potential future scenarios.
CORE CAPACITIES
LEARNING
OF
EXAMPLE IN NATURAL INTELLIGENCE
RETROSPECTIVE
LEARNING
PROSPECTIVE
LEARNING
Continual Don’t forget the important stuff, hn → hf
When people learn a new language we get better at our old
one [60]
Learning new information overwrites old information [163]
Constraints Regularize via
prior knowledge, heuristics, &
biases, h ∈ H′
Animals learn to store food in
locations that are optimal given
local weather conditions [16]
Generic
constraints
like sparseness enable
learning a single task
more efficiently [165]
Curiosity Go get information
about the (expected future)
world, instead of just rewards,
Animals explore novel stimuli
and contexts, even in the absence of rewards [116, 117]
Use randomness to explore new options with
unknown outcomes in
current scenario [166]
Learn statistical associations between stimuli
[167]
Reuse useful information
to facilitate learning new
things without interference [59, 164]
Compositional representations can be exponentially reassembled for future scenarios and to
compress the past [90]
Seek out information
about potential future
scenarios [107]
E(hn , Wf )
Causality The world W has
sparse causal relationships,
do(Yn ) → Wf
Animals can learn if A →B and
B → C , then A → C [138]
Apply causal information
to novel situations [168]
Table 1: The four core capabilities of prospective learning, evidence for their existence in natural intelligence, how retrospective
learners deal with (or fail to deal with) it, and how a prospective learner would deal with it. See see Figure 1 for notation.
5 The future of learning for the future In many ways, prospective learning has always been a central (though often hidden) goal of both AI and NI research. Both fields offer theoretical and experimental
approaches to understand how intelligent agents learn for future behavior. What we are proposing here
is a formalization of the structure for how to approach studying prospective learning jointly in NI and
AI that will benefit both by establishing a more cohesive synergy between these research areas (see
Table 2). Indeed, the history of AI and NI exhibits many beautiful synergies [169, 170]. In the middle
of the 20th century cognitive science (NI) and AI started in large part as a unified effort. Early AI work
like the McCulloch-Pitts neurons [171] and the Perceptron [172] had strong conceptual links to biological neurons. Neural network models in the Parallel Distributed Processing framework had success in
4
Incidentally, decision forests, such as random forests [160] and gradient boosting trees [161] continue to be the leading
ML methods for tabular data [162]. Moreover, they, like deep networks, also partition feature space into polytopes, and then
learn a linear function within each polytope [157], suggesting that this approach for learning representations can be multiply
realized by many different algorithms and substrates effectively [30].
14
explaining many aspects of intelligent behavior, and there have been strong recent drives to bring deep
learning and neuroscience closer together [169, 170, 173].
We believe that our understanding of both AI and NI are significantly held back by a lack of a
coherent framework for prospective learning. NI research requires a framework to analyze the incredible
way in which humans and non-human animals learn for the future. AI can benefit by studying how NIs
solve problems that remain intractable for AI systems. We argue that the route to solving prospective
learning rests on the two fields coming together around three major areas of development.
• Theory: A theory of prospective learning, building on and complementing the theory of retrospective learning, will provide insights into which experiments (in both NIs and AIs) will provide
the most insights that fill gaps in our current understanding, while also providing metrics to
evaluate progress [174]. A theoretical understanding of prospective learning will also enable
the generation of testable mechanistic hypotheses characterizing how intelligent systems can
and do prospectively learn.
• Experiments: Carefully designed, ecologically appropriate experiments across species, phyla,
and substrates, will enable (i) quantifying the limitations and capabilities of existing intelligence
systems with respect to the prospective learning criteria, and (ii) refining the mechanistic hypotheses generated by the theory. NI experiments across taxa will also establish milestones for
experiments in AI [175].
• Real-World Evidence: Implementing and deploying AI systems and observing NIs exhibiting
prospective learning ‘in the wild’ will provide real-world evidence to deepen our understanding.
These implementations could be purely software, involve leveraging specialized (neuromorphic)
hardware [176, 177], or even include wetware and hybrids [178].
An astute reader may wonder how prospective learning relates to reinforcement learning (RL), a
subfield of both AI and NI. RL already has long worked towards bridging the gap between AI and NI. For
example, early AI models of reinforcement learning formalized the phenomenon of an ‘eligibility trace’ in
synaptic connections that may be crucial for resolving the credit assignment problem, i.e., determining
which actions lead to a specific feedback signal [179]. Over 30 years later this AI work informed the
design of experiments that led to the discovery of such traces in brains [180, 181].
In RL, through repeated trials of a task usually specified by its corresponding task reward, agents
Leading approach in Retrospective NI
Leading approach in Retrospective AI
Proposed
approach
Prospective NI & AI
in
Experimental
design
Study (typically ethological
inappropriate) behaviors after
learning has saturated
Study single task performance
vs. sample size
Design experiments to explicitly test each of the 4 capabilities
Evaluation
criteria
Accuracy at saturation
Accuracy at large sample sizes
Amount of transfer [45] across
tasks
Algorithms
Simple and intuitive
Ensembling trees or networks
Ensembling
tions [59]
Theory
Qualitatively
characterize
learned behavior
Convergence rates for single
task
Convergence rates leveraging
data drawn from multiple, sequence, causal tasks [49]
representa-
Table 2: Comparing the approaches to studying retrospective NI, retrospective AI, and a proposed approach for studying
prospective intelligences.
15
are trained to choose actions at each time instant, that would maximize those task rewards at future
instants.[76, 124]. This future-oriented reward maximization objective at first glance bears resemblance
to prospective learning, and deep learning-based RL algorithms building on decades of research have
recently made great progress towards meeting this challenge [182–186].
However, these standard RL algorithms do not truly implement prospective learning: for example,
while deep RL agents may do even better than humans in games they were trained on, they have
extreme difficulty transferring skills across games (see Shao et al. [187]), even to games with similar
task or rule structures as the training set [188, 189]. Rather than optimizing a prespecified task as
in RL, prospective learning aims to acquire skills / representations now, which will be useful for future
tasks whose rewards or other specifications are not usually available in advance. In the example above,
a true prospective learner would acquire representations and skills that transfer across many games.
As in other machine learning subfields, there are several growing movements within RL that study
problems that would fall under the prospective learning umbrella, including continual multi-task RL [77],
hierarchical RL [190–193] that combines low-level skills to solve new tasks, causal RL [194–196], and
unsupervised exploration [197–199].
5.1 Advancing Natural Intelligence via Artificial Prospective Learning Advances in AI’s understanding of prospective learning provides a necessary formalism that can be harvested by NI research.
AI can produce the critical structure of scientific theories that lead to more rigorous and testable hypotheses for which to build experiments [200]. There is some historical evidence where our understanding of NI abilities has conceptually benefited from AI, including a few examples whereby theoretical formalisms from AI have inspired understandings of NI [201–204]. Our proposal for prospective
learning expands upon the existing interrelation between these fields.
Consider the problem of designing NI experiments to understand prospective learning, rather than
cognitive function or retrospective learning. How would they look different? The experiments would
demand that the NIs wrestle with each of the four capabilities: continual learning, constraints, curiosity,
and causality—exactly what NI researchers currently avoid because we cannot readily fit theories to
such behaviors. For continual learning, the experiments would have a sequence of tasks, some of
which repeat, so that both forward and backward transfer can be quantified. For constraints, tasks
would specifically investigate the priors and inductive biases of the animal, rather than their ability
to learn over many repeated trials. For curiosity, tasks would require a degree of exploration and
include information relevant only for future tasks. For causality, tasks would encode various conditional
dependencies that are not causal, as in Simpson’s paradox. The ways in which the NIs are evaluated
would also be informed by the theory, for example, quantifying amount of information transferred, rather
than long run performance properties [45, 205]. While extensive research in all these areas exists,
a deepened dialogue between individuals studying NI and AI will significantly advance their synergy.
Importantly, prospective learning provides a scaffolding to organize the debate around.
5.2 Advancing Artificial Intelligence via Natural Prospective Learning Advances in our understanding of NIs has always been a central driver, if not the central driver, of AI research. Indeed, the
fundamental logic of modern computer circuits was directly inspired by McCullough and Pitts’ [171]
model of the binary logic in neural circuits. The way that we build AIs often involves a crucial step of
observing differences between NIs and AIs and then looking for inspiration in NIs for what is missing
and building it into our algorithms. The entire concept of intelligence used by AI derives from considerations of NIs. The concept of prospective learning promises to enable a stronger link between NI
and AI, where the many components of the study of NI can be directly ported to the components of AI.
16
Focusing the study of both NI and AI together promises to clarify the logical relations of concepts in the
two fields, making them considerably more synergistic.
Over the past few decades, there have been many proposals for how to advance AI, and they all
center on capturing the abilities of NIs, primarily humans. These include understanding ‘meaning’ [206],
in particular language, and continuing the current trend of building ever larger deep networks and feeding them ever larger datasets as a means of approaching the depth and complexity seen in biological
brains [207]. Others have looked exclusively at human intelligence, going so far as to even define AI as
modeling human intelligence [208]. Indeed, today’s rallying cry in AI is for ‘artificial general intelligence’,
which is commonly defined as the hypothetical ability of an intelligent agent to understand or learn any
intellectual task that a human being can [209]. This ongoing influence can be seen by the naming of AI
concepts with cognitive science words. Deep Learning is concerned with concepts like curiosity [210],
attention [211], memory, and most recently consciousness priors [212].
Our approach fundamentally differs from, and builds upon, those efforts. Yes, human intelligence—
or, more accurately, human intellect—is incredibly interesting. Yet NI, and specifically prospective learning, evolved hundreds of millions of years prior to human intelligence and natural language. We therefore argue that prospective learning is much more fundamental than human intellect. Therefore, AI can
advance by studying prospective learning in non-human animals (in addition to studying human animals), that are more experimentally accessible. Moreover, we have an existence proof (from evolution)
that one can get to human intellect by first building prospective learning. Whether one can side-step
prospective learning capabilities, and go straight to complex language understanding, is an open question.
So then how does studying prospective learning in NIs potentially move AI forward? Consider the
design of experiments. The study of NI in the lab typically focuses on simple constrained environments,
such as two-alternative forced choice paradigms. This is in contrast to how behavior is looked at in
ecology, whose contributions to our understanding of NI behavior are largely underappreciated, where
NIs have been studied in complex, unconstrained environments for centuries. Similarly as ecology,
but in contrast to typical cognitive science and neuroscience approaches, modern AI often investigates
abilities of agents in environments with rich, but complex, structure (e.g., video games, autonomous
vehicles). Yet, those same AIs (or similar ones) often catastrophically fail in real-world environments
[213–217]. Part of this failure to generalize to natural environments is likely due to the fact that the realworld places a heavy emphasis on prospective learning abilities, something that most artificial testing
environments do not do. Thus, prospective learning provides additional context and motivation to design
experiments that transcend boundaries between taxa and substrates, both natural and artificial [218].
We argue that experiments such as those described above in Section 2 to study NIs can be ported
to also study AIs. We can build artificial environments, such as video-game worlds, that place heavy
demands on prospective abilities, including learning, that allow for direct comparison of the abilities of
AIs and NIs. Since it is possible to get some non-human NIs to play video games (e.g., monkeys [219],
rats [220]), these experiments do not necessarily limit the comparisons to be between humans and AIs
alone. Thus we can more effectively transfer our understanding of the abilities of NIs into AIs via a
unification of tasks.
5.3 What is needed to move forward? In many ways the process of doing retrospective learning is
simple. It requires the skill sets that many in ML have today: statistics, algorithms, and mathematics.
Prospective learning, on the other hand, requires us to reason about potential futures that we have not
yet experienced. In other words, we need to do prospective learning in order to understand prospective
learning. As such, solving the problem of prospective learning requires a far broader group of people
17
working on the problem. While it sits clearly within the domain of statistics and machine learning, the
problem of prospective learning also requires perspectives from well outside these fields as well, such
as biology, ecology, and philosophy. As AI is not only modeling, but also shaping the future, it also
reminds us of the deep ethical debt intelligence research owes to the society that enables it, and to
those who are most directly impacted by it [221, 222].
Acknowledgements This white paper was supported by an NSF AI Institute Planning award (# 2020312),
as well as support from Microsoft Research, and DARPA. The authors would like to especially thank
Kathryn Vogelstein for putting up with endless meetings at the Vogelstein residence in order to make
these ideas come to life.
The Future Learning Collective Joshua T. Vogelstein1YB ; Timothy Verstynen3YB ; Konrad P. Kording2YB ;
Leyla Isik1Y ; John W. Krakauer1 ; Ralph Etienne-Cummings1 ; Elizabeth L. Ogburn1 ; Carey E. Priebe1 ;
Randal Burns1 ; Kwame Kutten1 ; James J. Knierim1 ; James B. Potash1 ; Thomas Hartung1 ; Lena
Smirnova1 ; Paul Worley1 ; Alena Savonenko1 ; Ian Phillips1 ; Michael I. Miller1 ; Rene Vidal1 ; Jeremias
Sulam1 ; Adam Charles1 ; Noah J. Cowan1 ; Maxim Bichuch1 ; Archana Venkataraman1 ; Chen Li1 ; Nitish Thakor1 ; Justus M Kebschull1 ; Marilyn Albert1 ; Jinchong Xu1 ; Marshall Hussain Shuler1 ; Brian
Caffo1 ; Tilak Ratnanather1 ; Ali Geisa1 ; Seung-Eon Roh1 ; Eva Yezerets1 ; Meghana Madhyastha1 ;
Javier J. How1 ; Tyler M. Tomita1 ; Jayanta Dey1 ; Ningyuan (Teresa) Huang1 ; Jong M. Shin1 ; Kaleab
Alemayehu Kinfu1 ; Pratik Chaudhari2 ; Ben Baker2 ; Anna Schapiro2 ; Dinesh Jayaraman2 ; Eric Eaton2 ;
Michael Platt2 ; Lyle Ungar2 ; Leila Wehbe3 ; Adam Kepecs4 ; Amy Christensen4 ; Onyema Osuagwu5 ;
Bing Brunton6 ; Brett Mensh7 ; Alysson R. Muotri8 ; Gabriel Silva8 ; Francesca Puppo8 ; Florian Engert9 ;
Elizabeth Hillman10 ; Julia Brown11 ; Chris White12 ; Weiwei Yang12
1
Johns Hopkins University; 2 University of Pennsylvania; 3 Carnegie Mellon University; 4 Washington
University, St. Louis; 5 Morgan State University 6 University of Washington; 7 Howard Hughes Medical
Institute; 8 University of California, San Diego 9 Harvard University; 10 Columbia University; 11 MindX;
12 Microsoft Research
Y Principal investigators on the NSF AI Institute planning award.
B Corresponding authors (
[email protected],
[email protected],
[email protected])
References Diversity Statement By our estimates (using cleanBib), our references contain 11.67%
woman(first)/woman(last), 22.15% man/woman, 22.15% woman/man, and 44.03% man/man, and 9.16%
author of color (first)/author of color(last), 13.09% white author/author of color, 17.63% author of color/white
author, and 60.12% white author/white author.
References
[1] Tracy Hresko Pearl. Fast & furious: the misregulation of driverless cars. NYU Ann. Surv. Am. L.,
73:19, 2017.
[2] Agostina J Larrazabal, Nicolás Nieto, Victoria Peterson, Diego H Milone, and Enzo Ferrante.
Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided
diagnosis. Proceedings of the National Academy of Sciences, 117(23):12592–12594, 2020.
[3] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The
sequential learning problem. In Gordon H Bower, editor, Psychology of Learning and Motivation,
volume 24, pages 109–165. Academic Press, January 1989.
[4] Patricia Rich, Ronald de Haan, Todd Wareham, and Iris van Rooij. How hard is cognitive science?
18
In Proceedings of the Annual Meeting of the Cognitive Science Society, 43, April 2021.
[5] Chetan Singh Thakur, Jamal Lottier Molin, Gert Cauwenberghs, Giacomo Indiveri, Kundan Kumar, Ning Qiao, Johannes Schemmel, Runchun Wang, Elisabetta Chicca, Jennifer Olson Hasler,
et al. Large-scale neuromorphic spiking array processors: A quest to mimic the brain. Frontiers
in neuroscience, 12:891, 2018.
[6] Celeste Kidd and Benjamin Y Hayden. The psychology and neuroscience of curiosity. Neuron,
88(3):449–460, November 2015.
[7] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, first
edition edition, March 2000.
[8] Rodolfo R Llinas. I of the Vortex: From Neurons to Self. MIT Press, February 2002.
[9] Samuel J Gershman, Kenneth A Norman, and Yael Niv. Discovering latent causes in reinforcement learning. Current Opinion in Behavioral Sciences, 5:43–50, October 2015.
[10] C R Raby and N S Clayton. Prospective cognition in animals. Behav. Processes, 80(3):314–324,
March 2009.
[11] Anthony Dickinson. Goal-directed behavior and future planning in animals. Animal thinking:
Contemporary issues in comparative cognition, pages 79–91, 2011.
[12] Sara J Shettleworth. Planning for breakfast. Nature, 445(7130):825–826, February 2007.
[13] Sara J Shettleworth. Studying mental states is not a research program for comparative cognition.
Behav. Brain Sci., 30(3):332–333, June 2007.
[14] Nicholas J Mulcahy and Josep Call. Apes save tools for future use. Science, 312(5776):1038–
1040, May 2006.
[15] A M P von Bayern, S Danel, A M I Auersperg, B Mioduszewska, and A Kacelnik. Compound tool
construction by new caledonian crows. Sci. Rep., 8(1):15676, October 2018.
[16] Nicola S Clayton, Timothy J Bussey, and Anthony Dickinson. Can animals recall the past and
plan for the future? Nat. Rev. Neurosci., 4(8):685–691, August 2003.
[17] Nicola S Clayton, Joanna Dally, James Gilbert, and Anthony Dickinson. Food caching by western
scrub-jays (aphelocoma californica) is sensitive to the conditions at recovery. J. Exp. Psychol.
Anim. Behav. Process., 31(2):115–124, April 2005.
[18] Selvino R de Kort, Sérgio P C Correia, Dean M Alexis, Anthony Dickinson, and Nicola S Clayton.
The control of food-caching behavior by western scrub-jays (aphelocoma californica). J. Exp.
Psychol. Anim. Behav. Process., 33(4):361–370, October 2007.
[19] Nathan J Emery and Nicola S Clayton. Effects of experience and social context on prospective
caching strategies by scrub jays. Nature, 414(6862):443–446, November 2001.
[20] Nathaniel H Hunt, Judy Jinn, Lucia F Jacobs, and Robert J Full. Acrobatic squirrels learn to leap
and land on tree branches without falling. Science, 373(6555):697–700, August 2021.
[21] A David Redish. Vicarious trial and error. Nat. Rev. Neurosci., 17(3):147–159, March 2016.
[22] Alexandra O Constantinescu, Jill X O’Reilly, and Timothy E J Behrens. Organizing conceptual
knowledge in humans with a gridlike code. Science, 352(6292):1464–1468, June 2016.
[23] James C R Whittington, Timothy H Muller, Shirley Mark, Guifen Chen, Caswell Barry, Neil
Burgess, and Timothy E J Behrens. The Tolman-Eichenbaum machine: Unifying space and relational memory through generalization in the hippocampal formation. Cell, 183(5):1249–1263.e23,
November 2020.
[24] Randolf Menzel. A short history of studies on intelligence and brain in honeybees. Apidologie,
52(1):23–34, February 2021.
[25] Felicity Muth, Daniel R Papaj, and Anne S Leonard. Colour learning when foraging for nectar and
19
pollen: bees learn two colours at once. Biol. Lett., 11(9):20150628, September 2015.
[26] Felicity Muth, Daniel R Papaj, and Anne S Leonard. Bees remember flowers for more than one
reason: pollen mediates associative learning. Anim. Behav., 111:93–100, January 2016.
[27] Felicity Muth, Tamar Keasar, and Anna Dornhaus. Trading off short-term costs for long-term
gains: how do bumblebees decide to learn morphologically complex flowers? Anim. Behav., 101:
191–199, March 2015.
[28] Tom B Brown et al. Language Models are Few-Shot Learners, May 2020.
[29] G Fiorito and P Scotto. Observational learning in octopus vulgaris. Science, 256(5056):545–547,
April 1992.
[30] M Chirimuuta. Marr, mayr, and MR: What functionalism should now be about. Philos. Psychol.,
31(3):403–418, April 2018.
[31] Thomas W Polger and Lawrence A Shapiro. The Multiple Realization Book. Oxford University
Press, September 2016.
[32] V Glivenko. Sulla determinazione empirica delle leggi di probabilita. Gion. Ist. Ital. Attauri., 4:
92–99, 1933.
[33] Francesco Paolo Cantelli. Sulla determinazione empirica delle leggi di probabilita. Giorn. Ist. Ital.
Attuari, 4(421-424), 1933.
[34] V Vapnik and A Chervonenkis. On the uniform convergence of relative frequencies of events to
their probabilities. Theory Probab. Appl., 16(2):264–280, January 1971.
[35] L G Valiant. A theory of the learnable. Commun. ACM, 27(11):1134–1142, November 1984.
[36] A Krizhevsky, I Sutskever, and G E Hinton. Imagenet classification with deep convolutional neural
networks. Adv. Neural Inf. Process. Syst., 2012.
[37] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. In NAACL-HLT, January 2019.
[38] Scott Mayer McKinney et al. International evaluation of an AI system for breast cancer screening.
Nature, 577(7788):89–94, January 2020.
[39] Andrew W Senior et al. Improved protein structure prediction using potentials from deep learning.
Nature, January 2020.
[40] Nick Statt.
OpenAI’s dota 2 AI steamrolls world champion e-sports team
with
back-to-back
victories.
https://www.theverge.com/2019/4/13/18309459/
openai-five-dota-2-finals-ai-bot-competition-og-e-sports-the-international-champion,
April
2019. Accessed: 2020-1-25.
[41] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,
Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan
Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering
the game of go without human knowledge. Nature, 550(7676):354–359, October 2017.
[42] David J Hand. Classifier technology and the illusion of progress. Statistical Science, 21(1):1–14,
February 2006.
[43] Elan Rosenfeld, Pradeep Kumar Ravikumar, and Andrej Risteski. The risks of invariant risk
minimization, September 2020.
[44] Martin Arjovsky. Out of distribution generalization in machine learning, March 2021.
[45] Ali Geisa, Ronak Mehta, Hayden S Helm, Jayanta Dey, Eric Eaton, Jeffery Dick, Carey E Priebe,
and Joshua T Vogelstein. Towards a theory of out-of-distribution learning, September 2021.
[46] Stevo Bozinovski and Ante Fulgosi. The influence of pattern similarity and transfer learning upon
the training of a base perceptron B2. Proceedings of Symposium Informatica, pages 3–121–5,
20
1976.
[47] Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic
learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, volume 2.
researchgate.net, 1992.
[48] Rich Caruana. Multitask learning. Mach. Learn., 28(1):41–75, July 1997.
[49] Jonathan Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12(1):149–198, March
2000.
[50] Jane X Wang. Meta-learning in natural and artificial intelligence, November 2020.
[51] M B Ring. Continual learning in reinforcement environments. PhD thesis, University of Texas at
Austin, 1994.
[52] Sebastian Thrun and Tom M Mitchell. Lifelong robot learning. Rob. Auton. Syst., 15(1):25–46,
July 1995.
[53] Robin Jaulmes, Joelle Pineau, and Doina Precup. Learning in non-stationary partially observable
markov decision processes. In ECML Workshop on Reinforcement Learning in non-stationary
environments, volume 25, pages 26–32. ias.informatik.tu-darmstadt.de, 2005.
[54] Joan Baez. No woman no cry, 1979.
[55] Yoshua Bengio, Mehdi Mirza, Ian Goodfellow, Aaron Courville, and Xia Da. An empirical investigation of catastrophic forgeting in Gradient-Based neural networks, December 2013.
[56] Vinay V Ramasesh, Ethan Dyer, and Maithra Raghu. Anatomy of catastrophic forgetting: Hidden
representations and task semantics, July 2020.
[57] Sebastian Thrun. Is learning the n-th thing any easier than learning the first? In D S Touretzky,
M C Mozer, and M E Hasselmo, editors, Advances in Neural Information Processing Systems 8,
pages 640–646. MIT Press, 1996.
[58] Paul Ruvolo and Eric Eaton. ELLA: An efficient lifelong learning algorithm. In International
Conference on Machine Learning, volume 28, pages 507–515, February 2013.
[59] Joshua T Vogelstein et al. Ensembling Representations for Synergistic Lifelong Learning with
Quasilinear Complexity, April 2020.
[60] Alex Boulton and Tom Cobb. Corpus use in language learning: A meta-analysis. Lang. Learn.,
67(2):348–393, June 2017.
[61] Reuven Dukas. Transfer and interference in bumblebee learning. Anim. Behav., 49(6):1481–
1490, June 1995.
[62] M E Bouton. Context, time, and memory retrieval in the interference paradigms of pavlovian
learning. Psychol. Bull., 114(1):80–99, July 1993.
[63] Mark E Bouton. Extinction of instrumental (operant) learning: interference, varieties of context,
and mechanisms of contextual control. Psychopharmacology, 236(1):7–19, January 2019.
[64] H F Harlow. The formation of learning sets. Psychol. Rev., 56(1):51–65, January 1949.
[65] Michelle J Spierings and Carel Ten Cate. Budgerigars and zebra finches differ in how they
generalize in an artificial grammar learning experiment. Proc. Natl. Acad. Sci. U. S. A., 113(27):
E3977–84, July 2016.
[66] V Samborska, J L Butler, M E Walton, T E J Behrens, and others. Complementary task representations in hippocampus and prefrontal cortex for generalising the structure of problems. bioRxiv,
2021.
[67] Khurram Javed and Martha White. Meta-Learning representations for continual learning, May
2019.
[68] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual
21
lifelong learning with neural networks: A review. Neural Netw., 113:54–71, May 2019.
[69] Richard Kurle, Botond Cseke, Alexej Klushyn, Patrick van der Smagt, and Stephan Günnemann.
Continual learning with bayesian neural networks for Non-Stationary data, September 2019.
[70] Annie Xie, James Harrison, and Chelsea Finn. Deep reinforcement learning amidst lifelong NonStationarity, June 2020.
[71] Steven Pinker. The Blank Slate: The Modern Denial of Human Nature. Penguin Books, reprint
edition edition, August 2003.
[72] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University
Press, March 2006.
[73] Alexander Rakhlin and Karthik Sridharan. Statistical Learning and Sequential Prediction. Massachusetts Institute of Technology, September 2014.
[74] Moritz Hardt and Benjamin Recht. Patterns, predictions, and actions: A story about machine
learning. https://mlstory.org, 2021.
[75] Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building
neural controllers. In Proc. of the international conference on simulation of adaptive behavior:
From animals to animats, pages 222–227. mediatum.ub.tum.de, 1991.
[76] Richard S Sutton and Andrew G Barto. Introduction to Reinforcement Learning. Camgridge: MIT
Press, March 1998.
[77] Khimya Khetarpal, Matthew Riemer, Irina Rish, and Doina Precup. Towards continual reinforcement learning: A review and perspectives, December 2020.
[78] Alex Konkel and Neal J Cohen. Relational memory and the hippocampus: representations and
methods. Front. Neurosci., 3(2):166–174, September 2009.
[79] Marc W Howard and Howard Eichenbaum. Time and space in the hippocampus. Brain Res.,
1621:345–354, September 2015.
[80] M Giurfa, S Zhang, A Jenett, R Menzel, and M V Srinivasan. The concepts of ’sameness’ and
’difference’ in an insect. Nature, 410(6831):930–933, April 2001.
[81] Takeshi Atsumi, Masakazu Ide, and Makoto Wada. Spontaneous discriminative response to the
biological motion displays involving a walking conspecific in mice. Front. Behav. Neurosci., 12:
263, November 2018.
[82] Steven L Franconeri and Daniel J Simons. Moving and looming stimuli capture attention. Percept.
Psychophys., 65(7):999–1010, October 2003.
[83] Samantha M W Wood and Justin N Wood. One-shot object parsing in newborn chicks. J. Exp.
Psychol. Gen., March 2021.
[84] Elizabeth S Spelke. Principles of object perception. Cogn. Sci., 14(1):29–56, January 1990.
[85] Morton J Mendelson, Marshall M Haith, and Patricia S Goldman-Rakic. Face scanning and
responsiveness to social cues in infant rhesus monkeys. Dev. Psychol., 18(2):222–228, March
1982.
[86] M Tomasello, J Call, and B Hare. Five primate species follow the visual gaze of conspecifics.
Anim. Behav., 55(4):1063–1069, April 1998.
[87] Shimon Ullman, Daniel Harari, and Nimrod Dorfman. From simple innate biases to complex
visual concepts. Proc. Natl. Acad. Sci. U. S. A., 109(44):18215–18220, October 2012.
[88] Lindsey J Powell, Heather L Kosakowski, and Rebecca Saxe. Social origins of cortical face areas.
Trends Cogn. Sci., 22(9):752–763, September 2018.
[89] Anthony M Zador. A critique of pure learning and what artificial neural networks can learn from
animal brains. Nat. Commun., 10(1):3770, August 2019.
22
[90] Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jongwook Choi, Manoj Tiwari, Honglak Lee, and
Aleksandra Faust. Environment generation for zero-shot compositional reinforcement learning.
Adv. Neural Inf. Process. Syst., 34, December 2021.
[91] Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recognition with GradientBased learning. In Shape, Contour and Grouping in Computer Vision, page 319, Berlin, Heidelberg, January 1999. Springer-Verlag.
[92] Michael R Ibbotson, NSC Price, and Nathan A Crowder. On the division of cortical cells into
simple and complex types: a comparative viewpoint. Journal of neurophysiology, 93(6):3699–
3702, 2005.
[93] Emmy Noether. Invariante variationsprobleme, math-phys. Klasse, pp235-257, 1918.
[94] Soledad Villar, David W Hogg, Kate Storey-Fisher, Weichi Yao, and Ben Blum-Smith. Scalars
are universal: Equivariant machine learning, structured like classical physics. In Thirty-Fifth
Conference on Neural Information Processing Systems, 2021.
[95] Risi Kondor. N-body networks: a covariant hierarchical neural network architecture for learning
atomic potentials. arXiv preprint arXiv:1803.01588, 2018.
[96] Taco Cohen and Max Welling. Group equivariant convolutional networks. In International
conference on machine learning, pages 2990–2999. PMLR, 2016.
[97] Marc Finzi, Max Welling, and Andrew Gordon Wilson. A practical method for constructing equivariant multilayer perceptrons for arbitrary matrix groups. arXiv preprint arXiv:2104.09459, 2021.
[98] Alberto Bietti, Luca Venturi, and Joan Bruna. On the sample complexity of learning with geometric
stability. arXiv preprint arXiv:2106.07148, 2021.
[99] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Learning with invariances in random
features and kernel models, 2021.
[100] Gregory Benton, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson. Learning invariances
in neural networks. arXiv preprint arXiv:2010.11882, 2020.
[101] Evangelos Chatzipantazis, Stefanos Pertigkiozoglou, Kostas Daniilidis, and Edgar Dobriban.
Learning augmentation distributions using transformed risk minimization.
arXiv preprint
arXiv:2111.08190, 2021.
[102] Kim Junkyung, Ricci Matthew, and Serre Thomas. Not-So-CLEVR: learning same–different relations strains feedforward neural networks. Interface Focus, 8(4):20180011, August 2018.
[103] Howard G Tucker. A generalization of the Glivenko-Cantelli theorem. Ann. Math. Stat., 30(3):
828–830, 1959.
[104] Luc Devroye. On arbitrarily slow rates of global convergence in density estimation. Zeitschrift für
Wahrscheinlichkeitstheorie und Verwandte Gebiete, 62(4):475–483, December 1983.
[105] A Antos, L Devroye, and L Gyorfi. Lower bounds for bayes error estimation. IEEE Trans. Pattern
Anal. Mach. Intell., 21(7):643–645, July 1999.
[106] David H Wolpert. The supervised learning No-Free-Lunch theorems. In Rajkumar Roy, Mario
Köppen, Seppo Ovaska, Takeshi Furuhashi, and Frank Hoffmann, editors, Soft Computing and
Industry: Recent Applications, pages 25–42. Springer London, London, 2002.
[107] Erik J Peterson and Timothy D Verstynen. A way around the exploration-exploitation dilemma,
June 2019.
[108] Robert B Pippin. Hegel’s Practical Philosophy: Rational Agency as Ethical Life. Cambridge
University Press, 2008.
[109] Emily Sumner, Amy X Li, Amy Perfors, Brett Hayes, Danielle Navarro, and Barbara W Sarnecka.
The exploration advantage: Children’s instinct to explore allows them to find information that
23
adults miss. psyarxiv.com, 2019.
[110] Eliza Kosoy, Jasmine Collins, David M Chan, Sandy Huang, Deepak Pathak, Pulkit Agrawal, John
Canny, Alison Gopnik, and Jessica B Hamrick. Exploring exploration: Comparing children with
RL agents in unified environments, May 2020.
[111] Emily G Liquin and Alison Gopnik. Children are more exploratory and learn more than adults in
an approach-avoid task. Cognition, 218:104940, October 2021.
[112] Andrew Gelman and Guido Imbens. Why ask why? forward causal inference and reverse causal
questions, November 2013.
[113] George Loewenstein. The psychology of curiosity: A review and reinterpretation. Psychol. Bull.,
116(1):75–98, July 1994.
[114] Lily FitzGibbon, Henrike Moll, Julia Carboni, Ryan Lee, and Morteza Dehghani. Counterfactual
curiosity in preschool children. J. Exp. Child Psychol., 183:146–157, July 2019.
[115] Maya Zhe Wang and Benjamin Y Hayden. Monkeys are curious about counterfactual outcomes,
2019.
[116] D E Berlyne. The arousal and satiation of perceptual curiosity in the rat. J. Comp. Physiol.
Psychol., 48(4):238–246, August 1955.
[117] W N Dember and R W Earl. Analysis of exploratory, manipulatory, and curiosity behaviors.
Psychol. Rev., 64(2):91–96, March 1957.
[118] Junyi Chu and Laura E Schulz. Play, curiosity, and cognition. Annual Review of Developmental
Psychology, December 2020.
[119] Michael J Kuba, Tamar Gutnick, and Gordon M Burghardt. Learning from play in octopus.
Cephalopod Cognition; Darmaillacq, A. -S. , Dickel, L. , Mather, J. , Eds, pages 57–67, 2014.
[120] Adam J Calhoun, Sreekanth H Chalasani, and Tatyana O Sharpee. Maximally informative foraging by caenorhabditis elegans. Elife, 3, December 2014.
[121] Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. J.
Comput. System Sci., 75(1):78–89, January 2009.
[122] Maria-Florina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. The true sample complexity of active learning. Mach. Learn., 80(2):111–139, September 2010.
[123] Kenji Kobayashi and Ming Hsu. Common neural code for reward and information value. Proc.
Natl. Acad. Sci. U. S. A., 116(26):13061–13066, June 2019.
[124] R Bellman. DYNAMIC PROGRAMMING AND LAGRANGE MULTIPLIERS. Proc. Natl. Acad. Sci.
U. S. A., 42(10):767–769, October 1956.
[125] David Hume. An abstract of a treatise of human nature, 1739.
[126] Christoph J Völter and Josep Call. Causal and inferential reasoning in animals. In Josep Call,
editor, APA handbook of comparative psychology: Perception, learning, and cognition, Vol, volume 2, pages 643–671. American Psychological Association, xiii, Washington, DC, US, 2017.
[127] Josep Call. Inferences about the location of food in the great apes (pan paniscus, pan troglodytes,
gorilla gorilla, and pongo pygmaeus). J. Comp. Psychol., 118(2):232–241, June 2004.
[128] Lisa A Heimbauer, Rebecca L Antworth, and Michael J Owren. Capuchin monkeys (cebus apella)
use positive, but not negative, auditory cues to infer food location. Anim. Cogn., 15(1):45–55,
January 2012.
[129] Christian Nawroth and Eberhard von Borell. Domestic pigs’ (sus scrofa domestica) use of direct
and indirect visual and auditory cues in an object choice task. Anim. Cogn., 18(3):757–766, May
2015.
[130] Christian Schloegl, Judith Schmidt, Markus Boeckle, Brigitte M Weiß, and Kurt Kotrschal. Grey
24
[131]
[132]
[133]
[134]
[135]
[136]
[137]
[138]
[139]
[140]
[141]
[142]
[143]
[144]
[145]
[146]
[147]
[148]
[149]
parrots use inferential reasoning based on acoustic cues alone. Proc. Biol. Sci., 279(1745):
4135–4142, October 2012.
Dorothy L Cheney and Robert M Seyfarth. How Monkeys See the World: Inside the Mind of
Another Species. University of Chicago Press, 1990.
Emma Collier-Baker, Joanne M Davis, and Thomas Suddendorf. Do dogs (canis familiaris) understand invisible displacement? J. Comp. Psychol., 118(4):421–433, December 2004.
Katie E Slocombe, Tanja Kaller, Josep Call, and Klaus Zuberbühler. Chimpanzees extract social
information from agonistic screams. PLoS One, 5(7):e11473, July 2010.
D L Cheney, R M Seyfarth, and J B Silk. The responses of female baboons (papio cynocephalus
ursinus) to anomalous social interactions: evidence for causal reasoning? J. Comp. Psychol.,
109(2):134–141, June 1995.
Jorg J M Massen, Andrius Pašukonis, Judith Schmidt, and Thomas Bugnyar. Ravens notice
dominance reversals among conspecifics within and outside their social group. Nat. Commun.,
5:3679, April 2014.
H Davis. Transitive inference in rats (rattus norvegicus). J. Comp. Psychol., 106(4):342–349,
December 1992.
William A Roberts and Maria T Phelps. Transitive inference in rats: A test of the spatial coding
hypothesis. Psychol. Sci., 5(6):368–374, November 1994.
Colin Allen. Transitive inference in animals: Reasoning or conditioned associations. Rational
animals, pages 175–185, 2006.
Elizabeth A Tibbetts, Jorge Agudelo, Sohini Pandit, and Jessica Riojas. Transitive inference in
polistes paper wasps. Biol. Lett., 15(5):20190015, May 2019.
Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant
prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 5(78):947–1012, October 2016.
Liam Rose, Laura Graham, Allison Koenecke, Michael Powell, Ruoxuan Xiong, Zhu Shen, Kenneth W Kinzler, Chetan Bettegowda, Bert Vogelstein, Maximilian F Konig, Susan Athey, Joshua T
Vogelstein, and Todd H Wagner. The association between alpha-1 adrenergic receptor antagonists and In-Hospital mortality from COVID-19. medRxiv, December 2020.
Bernhard Schölkopf. Causality for machine learning, November 2019.
Rama K Vasudevan, Maxim Ziatdinov, Lukas Vlcek, and Sergei V Kalinin. Off-the-shelf deep
learning is not enough, and requires parsimony, bayesianity, and causality. npj Computational
Materials, 7(1):1–6, January 2021.
Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner,
Anirudh Goyal, and Yoshua Bengio. Towards causal representation learning, February 2021.
Sebastian Seung. Connectome: How the Brain’s Wiring Makes Us who We are. Houghton Mifflin
Harcourt, none edition edition, 2012.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, November
2016.
Judea Pearl. Bayesian networks, 2011.
Avanti Athreya, Donniell E Fishkind, Minh Tang, Carey E Priebe, Youngser Park, Joshua T Vogelstein, Keith Levin, Vince Lyzinski, Yichen Qin, and Daniel L Sussman. Statistical inference on
random dot product graphs: a survey. The Journal of Machine, 18(226):1–92, 2017.
Carey E Priebe, Cencheng Shen, Ningyuan Huang, and Tianyi Chen. A simple spectral failure
mode for graph convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell., PP, August 2021.
25
[150] Keyulu Xu*, Weihua Hu*, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural
Networks? In International Conference on Learning Representations, September 2018.
[151] Charles J Stone. Consistent Nonparametric Regression. Annals of Statistics, 5(4):595–620, July
1977.
[152] Ben Sorscher, Surya Ganguli, and Haim Sompolinsky. The geometry of concept learning.
bioRxiv, 2021.
[153] Chris R Sims. Efficient coding explains the universal law of generalization in human perception.
Science, 360(6389):652–656, 2018.
[154] Carey E Priebe, Joshua T Vogelstein, Florian Engert, and Christopher M White. Modern Machine
Learning: Partition & Vote, September 2020.
[155] Jonathan Frankle and Michael Carbin. The Lottery Ticket Hypothesis: Finding Sparse, Trainable
Neural Networks. In International Conference on Learning Representations, September 2018.
[156] Luc Devroye, Laszlo Györfi, and Gabor Lugosi. A probabilistic theory of pattern recognition. Stochastic Modelling and Applied Probability. Springer, New York, NY, 1996 edition, November 2013.
[157] Haoyin Xu at al. When are deep networks really better than decision forests at small sample
sizes, and how?, August 2021.
[158] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of
linear regions of deep neural networks. In Z Ghahramani, M Welling, C Cortes, N D Lawrence,
and K Q Weinberger, editors, Advances in Neural Information Processing Systems 27, pages
2924–2932. Curran Associates, Inc., 2014.
[159] Raja Giryes, Guillermo Sapiro, and Alex M Bronstein. Deep Neural Networks with Random
Gaussian Weights: A Universal Classification Strategy? IEEE Trans. Signal Process., 64(13):
3444–3457, July 2016.
[160] Leo Breiman. Random Forests. Mach. Learn., 45(1):5–32, October 2001.
[161] Robert E Schapire. The strength of weak learnability. Mach. Learn., 5(2):197–227, 1990.
[162] Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do we need
hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res., 15(1):
3133–3181, 2014.
[163] R M French. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci., 3(4):128–135,
April 1999.
[164] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis
Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic
forgetting in neural networks. Proc. Natl. Acad. Sci. U. S. A., 114(13):3521–3526, March 2017.
[165] Bernhard Schölkopf, Director of the Max Planck Institute for Intelligent in Tübingen Germany
Professor for Machine Lea Bernhard Schölkopf, Rnhard Schölkopf, Alexander J Smola, Francis Bach, and Managing Director of the Max Planck Institute for Biological Cybernetics in Tubingen Germany Profe Bernhard Scholkopf. Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press, 2002.
[166] Robert C Wilson, Elizabeth Bonawitz, Vincent D Costa, and R Becket Ebitz. Balancing exploration
and exploitation with information and randomization. Curr Opin Behav Sci, 38:49–56, April 2021.
[167] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller.
Deterministic policy gradient algorithms. In Eric P Xing and Tony Jebara, editors, Proceedings of
the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine
Learning Research, pages 387–395, Bejing, China, 2014. PMLR.
26
[168] David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Scholkopf, and Léon Bottou.
Discovering causal signals in images. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 6979–6987. openaccess.thecvf.com, 2017.
[169] Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick.
Neuroscience-Inspired artificial intelligence. Neuron, 95(2):245–258, July 2017.
[170] Adam H Marblestone, Greg Wayne, and Konrad P Kording. Toward an integration of deep learning and neuroscience. Frontiers in computational neuroscience, 10:94, 2016.
[171] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity.
Bull. Math. Biophys., 5(4):115–133, December 1943.
[172] F Rosenblatt. The perceptron: a probabilistic model for information storage and organization in
the brain. Psychol. Rev., 65(6):386–408, November 1958.
[173] Blake A Richards, Timothy P Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz, Amelia
Christensen, Claudia Clopath, Rui Ponte Costa, Archy de Berker, Surya Ganguli, et al. A deep
learning framework for neuroscience. Nature neuroscience, 22(11):1761–1770, 2019.
[174] Olivia Guest and Andrea E Martin. How computational modeling can force theory building in
psychological science. Perspect. Psychol. Sci., 16(4):789–802, July 2021.
[175] A M Turing. COMPUTING MACHINERY AND INTELLIGENCE. Mind, LIX(236):433–460, October 1950.
[176] R Jacob Vogelstein, Udayan Mallik, Joshua T Vogelstein, and Gert Cauwenberghs. Dynamically
reconfigurable silicon array of spiking neurons with conductance-based synapses. IEEE Trans.
Neural Netw., 18(1):253–265, January 2007.
[177] M
Davies.
Progress
in
Neuromorphic
Computing
:
Drawing
Inspiration
from
Nature
for
Gains
in
AI
and
Computing.
In
2019 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), pages 1–
1, April 2019.
[178] G A Silva, A R Muotri, and C White. Understanding the human brain using brain organoids and
a Structure-Function theory, 2020.
[179] Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements
that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern., SMC-13(5):
834–846, September 1983.
[180] Simon D Fisher, Paul B Robertson, Melony J Black, Peter Redgrave, Mark A Sagar, Wickliffe C
Abraham, and John N J Reynolds. Reinforcement determines the timing dependence of corticostriatal synaptic plasticity in vivo. Nat. Commun., 8(1):334, August 2017.
[181] Tomomi Shindou, Mayumi Shindou, Sakurako Watanabe, and Jeffery Wickens. A silent eligibility
trace enables dopamine-dependent synaptic plasticity for reinforcement learning in the mouse
striatum. Eur. J. Neurosci., 49(5):726–736, March 2019.
[182] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen,
Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.
[183] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,
Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan
Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering
the game of Go without human knowledge. Nature, 550(7676):354–359, October 2017.
27
[184] Oriol Vinyals et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.
Nature, 575(7782):350–354, November 2019.
[185] Max Jaderberg et al. Human-level performance in 3D multiplayer games with population-based
reinforcement learning. Science, 364(6443):859–865, May 2019.
[186] Will Dabney, Zeb Kurth-Nelson, Naoshige Uchida, Clara Kwon Starkweather, Demis Hassabis,
Rémi Munos, and Matthew Botvinick. A distributional code for value in dopamine-based reinforcement learning. Nature, 577(7792):671–675, January 2020.
[187] K Shao, Z Tang, Y Zhu, N Li, and D Zhao. A survey of deep reinforcement learning in video
games. arXiv preprint arXiv:1912.10944, 2019.
[188] Erik Peterson, Necati Alp Müyesser, Timothy Verstynen, and Kyle Dunovan. Combining imagination and heuristics to learn strategies that generalize. Neurons, Behavior, Data analysis, and
Theory, 3(4):1–19, 2020.
[189] Arghyadeep Das, Vedant Shroff, Avi Jain, and Grishma Sharma. Knowledge transfer between
similar atari games using deep Q-Networks to improve performance. In 2021 12th International
Conference on Computing Communication and Networking Technologies (ICCCNT), pages 1–8.
ieeexplore.ieee.org, July 2021.
[190] Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 13(1):41–77, 2003.
[191] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The Option-Critic architecture. AAAI, 31(1),
February 2017.
[192] Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared
hierarchies. arXiv preprint arXiv:1710.09767, 2017.
[193] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg,
David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In
International Conference on Machine Learning, pages 3540–3549. PMLR, 2017.
[194] Elias Bareinboim. Causal reinforcement learning. https://crl.causalai.net/, 2022. Accessed: 20221-12.
[195] Rosemary Nan Ke, Anirudh Goyal, Jane Wang, Stefan Bauer, Silvia Chiappa, Jovana Mitrovic,
Theophane Weber, and Danilo Rezende. Causal learning for decision making (CLDM). https:
//causalrlworkshop.github.io/, 2022. Accessed: 2022-1-12.
[196] Aurelien Bibaut, Maria Dimakopoulou, Nathan Kallus, Xinkun Nie, Masatoshi Uehara, and
Kelly Zhang. Causal sequential decision making workshop. https://nips.cc/Conferences/2021/
ScheduleMultitrack?event=21863, 2022. Accessed: 2022-1-12.
[197] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need:
Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
[198] Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-supervised exploration via disagreement. In International conference on machine learning, pages 5062–5071. PMLR, 2019.
[199] Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamicsaware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657, 2019.
[200] Iris van Rooij and Giosuè Baggio. Theory before the test: How to build High-Verisimilitude explanatory theories in psychological science. Perspect. Psychol. Sci., 16(4):682–697, July 2021.
[201] Mark K Singley and John Robert Anderson. The Transfer of Cognitive Skill. Harvard University
Press, 1989.
[202] Eric Schulz, Joshua B Tenenbaum, David Duvenaud, Maarten Speekenbrink, and Samuel J
Gershman. Probing the compositionality of intuitive functions. CBMM, May 2016.
28
[203] Alison Gopnik, Clark Glymour, David M Sobel, Laura E Schulz, Tamar Kushnir, and David Danks.
A theory of causal learning in children: causal maps and bayes nets. Psychol. Rev., 111(1):3–32,
January 2004.
[204] Alison Gopnik and Elizabeth Bonawitz. Bayesian models of child development. Wiley Interdiscip.
Rev. Cogn. Sci., 6(2):75–86, March 2015.
[205] Natalia Díaz-Rodríguez, Vincenzo Lomonaco, David Filliat, and Davide Maltoni. Don’t forget,
there is more than forgetting: new metrics for Continual Learning, October 2018.
[206] Melanie Mitchell. On Crashing the Barrier of Meaning in Artificial Intelligence. AI Magazine, 41
(2):86–92, 2020.
[207] Rich Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson.html,
March 2019. Accessed: 2021-3-16.
[208] Tom M Mitchell. Machine Learning. McGraw-Hill Education, 1 edition edition, March 1997.
[209] Cassio Pennachin and Ben Goertzel. Contemporary approaches to artificial general intelligence.
In Artificial general intelligence, pages 1–30. Springer, 2007.
[210] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by
self-supervised prediction. In International conference on machine learning, pages 2778–2787.
PMLR, 2017.
[211] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[212] Yoshua Bengio. The consciousness prior, September 2017.
[213] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. The Parable of Google Flu:
Traps in Big Data Analysis. Science, March 2014.
[214] Solon Barocas and Andrew D Selbst. Big data’s disparate impact. Calif. Law Rev., 104:671,
2016.
[215] Cathy O’Neil. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens
Democracy. Crown, September 2016.
[216] James Zou and Londa Schiebinger. AI can be sexist and racist — it’s time to make it fair, 2018.
[217] Angharad N Valdivia. Algorithms of Oppression: How Search Engines Reinforce Racism by
Safiya Umoja Noble (review). Feminist Formations, 30(3):217–220, 2018.
[218] Sam Devlin, Raluca Georgescu, Ida Momennejad, Jaroslaw Rzepecki, Evelyn Zuniga, Gavin
Costello, Guy Leroy, Ali Shaw, and Katja Hofmann. Navigation Turing Test (NTT): Learning to
evaluate human-like navigation * 1. http://proceedings.mlr.press/v139/devlin21a/devlin21a.pdf,
2021. Accessed: 2022-1-8.
[219] Takayuki Hosokawa and Masataka Watanabe. Prefrontal neurons represent winning and losing
during competitive video shooting games between monkeys. J. Neurosci., 32(22):7662–7671,
May 2012.
[220] Viktor Tóth. A neuroengineer’s guide on training rats to play doom. https://medium.com/mindsoft/
rats-play-doom-eb0d9c691547, 2020.
[221] Abeba Birhane. Algorithmic injustice: a relational ethics approach. Patterns (N Y), 2(2):100205,
February 2021.
[222] Cathy O’Neil. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens
Democracy. Crown, September 2016.
29