1 s2.0 S0010027723000409 Main

Cognition 235 (2023) 105406
Contents lists available at ScienceDirect
Cognition
journal homepage: www.elsevier.com/locate/cognit
Commonsense psychology in human infants and machines

Gala Stojnić a, Kanishk Gandhi b, Shannon Yasuda a, Brenden M. Lake a, c, Moira R. Dillon a, *
a
Department of Psychology, New York University, New York, NY, USA
b
Department of Computer Science, Stanford University, Palo Alto, CA, USA
c
Center for Data Science, New York University, New York, NY, USA
A R T I C L E I N F O A B S T R A C T
Keywords: Human infants are fascinated by other people. They bring to this fascination a constellation of rich and flexible
Intuitive psychology expectations about the intentions motivating people’s actions. Here we test 11-month-old infants and state-of-
Commonsense psychology the-art learning-driven neural-network models on the “Baby Intuitions Benchmark (BIB),” a suite of tasks
Action understanding
challenging both infants and machines to make high-level predictions about the underlying causes of agents’
Infancy
Machine common sense
actions. Infants expected agents’ actions to be directed towards objects, not locations, and infants demonstrated
Artificial intelligence default expectations about agents’ rationally efficient actions towards goals. The neural-network models failed to
capture infants’ knowledge. Our work provides a comprehensive framework in which to characterize infants’
commonsense psychology and takes the first step in testing whether human knowledge and human-like artificial
intelligence can be built from the foundations cognitive and developmental theories postulate.
The early-developing ease with which infants know about people that support infants’ commonsense psychology are foundational to
(Gergely, Nádasdy, Csibra, & Bíró, 1995; Woodward, 1998), objects human social intelligence (Banaji & Gelman, 2013; Jara-Ettinger,
(Spelke, 1990; Stahl & Feigenson, 2015), and places (Hermer & Spelke, Gweon, Schulz, & Tenenbaum, 2016) and could thus inform better
1994) is impressive, especially compared with the difficulties machines commonsense AI, but these predictions are typically missing from
have had in achieving these simple human competencies (Lake, Ullman, machine-learning algorithms, which instead predict actions directly (e.
Tenenbaum, & Gershman, 2017; Marcus & Davis, 2019). Such differ g., churn, clicks, likes, etc.; Griffiths, 2015), and therefore lack flexibility
ences between human and artificial intelligence (AI) are critical to to new contexts and situations.
address if we aim to create commonsense AI, leading to AI that we better Nevertheless, research on infants’ commonsense psychology has not
understand and that better understands us. yet been evaluated in a framework that could be directly tested against
One of the general challenges of building commonsense AI is machines’—let alone built into them—because of non-scalable stimuli,
deciding what knowledge to start with. A human infant’s foundational varied task demands, isolated questions, and mixed results. For example,
knowledge is limited, abstract, and reflects our evolutionary inheri experiments on infants’ commonsense psychology have exemplified
tance, yet it can accommodate any context or culture in which that in agents and their actions using various displays, from live human actors
fant might develop (Spelke, 2022; Spelke & Kinzler, 2007). If an aim of reaching for everyday objects (Woodward, 1998), to live puppets with
AI is to build the flexible, commonsense thinker that human adults or without animate features like eyes or fur (Johnson, Slaughter, &
become, then machines might need to start like adults do, from the same Carey, 1998), to highly minimal animations of simple shapes navigating
core abilities as infants, whether achieved through learning-driven or in 2D or 3D worlds (Csibra, Bíró, Koós, & Gergely, 2003; Csibra, Gergely,
engineered approaches (Botvinick et al., 2017). Bíró, Koós, & Brockbank, 1999). These experiments have also typically
Over the past several decades, foundational research on infants’ focused on individual questions of, e.g., goal (Woodward, 1998) or ra
commonsense psychology, i.e., infants’ understanding of the intentions, tionality (Gergely et al., 1995) attribution, although some work has
goals, preferences, and rationality underlying agents’ actions, has sug probed, for example, how infants’ inferences about goals and rationality
gested that infants attribute goals to agents and expect agents to pursue might combine to support notions of consistency, cost, or value (Liu,
goals in rationally efficient ways (Baillargeon, Scott, & Bian, 2016; Ullman, Tenenbaum, & Spelke, 2017; Scott & Baillargeon, 2013).
Gergely et al., 1995; Spelke, 2022; Woodward, 1998). The predictions Different accounts of infants’ knowledge about agents have
* Corresponding author.
E-mail address: [email protected] (M.R. Dillon).
https://doi.org/10.1016/j.cognition.2023.105406
Received 8 September 2022; Received in revised form 8 February 2023; Accepted 9 February 2023
Available online 16 February 2023
0010-0277/© 2023 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
G. Stojnić et al. Cognition 235 (2023) 105406
suggested that this knowledge: coheres as a unified set of abstract con Here we take a critical step in addressing this need. We provide a
cepts of causal efficacy, efficiency, goal-directedness, and perceptual comprehensive framework for testing infants’ commonsense psychology
access (Spelke, 2022); reflects infants’ intuitive understanding of agents’ by assessing infants’ performance on the “Baby Intuitions Benchmark
mental states, which direct their efficient actions consistent with their (BIB),” a suite of six tasks probing commonsense psychology. BIB was
mental states (Baillargeon et al., 2015; Baillargeon et al., 2016); or designed expressly to allow for testing both infant and machine intelli
emerges from individual achievements rooted in infants’ own action gence alike (Gandhi, Stojnic, Lake, & Dillon, 2021), and fulfilling that
experience (Woodward, 2009; Woodward, Sommerville, & Guajardo, intention, here we also directly compare the performance of infants and
2001). From this rich experimental and theoretical tradition thus arises machines, providing an empirical foundation for building human-like
the need for a comprehensive framework in which to characterize in AI.
fants’ knowledge of agents with results on one task comparable with
those on another and with results on the suite of tasks comparable across
infants and machines. Such a framework can inform both theories of
infants’ knowledge and the future of human-like AI.
Fig. 1. Schematic of BIB’s six tasks used in Experiments 1 & 2 (see also Fig. S1). For each task, observers first see eight familiarization trial videos in which an agent
acts consistently in terms of its goals, rationality, or instrumentality. The exact make-up of the grid world and the movement of the agent may vary across trials, as
described in the main text and SI. One example still image per task from a familiarization trial video is shown here. Observers then see expected and unexpected test
trial videos (with the order of these trials varying for infants). Example still images of both test trial videos per task are shown here. All of the videos are available at:
https://osf.io/r98je/.
2
1. General methods environment changes as in the Efficient Agent Task. Observers may either
be more surprised when the agent continues to move inefficiently to the
1.1. Materials object (Liu & Spelke, 2017) or may have no expectations about whether
that agent will move efficiently or inefficiently to the object (Gergely
BIB’s tasks include short silent animated videos with simple visuals et al., 1995).
(Heider & Simmel, 1944), like basic shapes without eyes or limbs, un The last task focuses on an observer’s attribution of instrumentality
dertaking basic movements in a grid world (Figs. 1 and S1). This design to agents’ actions. The Instrumental-Action Task captures the idea that
allowed for the stimuli’s scalable procedural generation, which is agents should only take instrumental actions when necessary. During
required for testing machine-learning algorithms, and emphasized the familiarization, observers watch an agent move first to a key, which it
high-level properties of agents (Csibra et al., 1999; Gao, McCarthy, & uses to remove a barrier around an object in varying locations, and then
Scholl, 2010; Johnson & Gilmore, 2003; Meltzoff, 1995), which chal to that object. At test, observers may be more surprised when the agent
lenges the limits and abstraction of an observer’s inferential capacity continues to move to the key, instead of directly to the object, when the
(Kominsky, Lucca, Thomas, Frank, & Hamlin, 2022). This design also barrier is no longer blocking the object (Sommerville & Woodward,
presented a novel, overhead navigational context, which required an 2005; Woodward & Sommerville, 2000).
assumption of agents’ full observability of the grid world and its con All of the stimuli videos are available at: https://osf.io/r98je/, and
tents (Baker, Jara-Ettinger, Saxe, & Tenenbaum, 2017; Luo & Baillar additional details about each task are included in the SI.
geon, 2007; Luo & Johnson, 2009; Rabinowitz, Perbet, Song, Zhang, & BIB’s task structure adopts the “violation-of-expectation” looking-
Botvinick, 2018). time paradigm often used to test infants (Spelke, 1985; Téglás et al.,
Importantly, all of BIB’s tasks are presentationally consistent, 2011). Observers see a series of familiarization trials that serve to set up
allowing for comparisons across tasks, without concerns of attributing an expectation followed by an expected outcome that is perceptually
null effects to varying visual, memory, or other task demands. Instead of dissimilar to the familiarization but is conceptually consistent and an
focusing on one principle of commonsense psychology, moreover, BIB’s unexpected outcome that is perceptually similar to the familiarization
tasks focus on three possible attributions to agents’ actions that an but is conceptually surprising. This task structure has been used in
observer could make—goal attribution, rationality attribution, and recent machine-learning benchmarks focusing on common sense (Piloto,
instrumentality attribution—thereby addressing whether and how such Weinstein, Battaglia, & Botvinick, 2022; Shu et al., 2021; Smith et al.,
principles of commonsense psychology might cohere. 2019) and is advantageous because it both protects against low-level
Using BIB’s environment (Gandhi et al., 2021), we procedurally heuristic-based solutions (Spelke, 1985) and allows for an algorithm’s
generated the video stimuli to test infants and computational models quantitative measure of surprise to be compared with a well-established
and chose the clearest examples of the particular principles of psychological measure of surprise (Piloto et al., 2022; Stahl & Kibbe,
commonsense psychology targeted by each task (Figs. 1 and S1). The 2022).
first three tasks focus on an observer’s attribution of goals to agents’
actions. The Goal-Directed Task captures the idea that agents’ goals are 2. Infant methods
directed towards objects, not locations. Observers watch an agent
repeatedly move to the same one of two objects in approximately the 2.1. Infant design and analyses
same location in an unchanging grid world during familiarization. At
test, observers may be more surprised when the agent moves to a new In Experiment 1, we collected infants’ responses to two of BIB’s six
object in that grid world after the locations of the two objects switch tasks, the Goal-Directed Task and the Efficient-Agent Task. Mixed-model
(Woodward, 1998). The Multi-Agent Task asks whether goals are specific linear regressions with raw looking time as the dependent variable,
to agents. Observers watch an agent move to the same one of two objects outcome (expected versus unexpected) as a fixed effect, and participant
during familiarization in a changing grid world, with both objects as a random-effects intercept evaluated infants’ performance on each
appearing in varying locations. At test, observers may be more surprised task, and an additional regression examined infants’ overall perfor
when the original agent versus a new agent moves to a new object mance across both tasks. To obtain p-values, we ran Type 3 Wald tests on
(Buresh & Woodward, 2007; Repacholi & Gopnik, 1997). The Inacces the results of each regression. Experiment 1 focused on these two tasks
sible-Goal Task asks whether agents might form new goals when their because the common sense they measured has had consistent findings in
existing goals become unattainable. Observers watch an agent move to the prior literature on infants’ action understanding (Baillargeon et al.,
the same one of two objects during familiarization in a changing grid 2016; Gergely & Csibra, 2003; Spelke, 2022; Woodward, 2009).
world, with both objects appearing in varying locations. At test, the grid Experiment 1 thus aimed to provide initial evidence of infants’
world changes again such that the agent’s goal object becomes physi commonsense psychology, as elicited by BIB’s highly minimal displays,
cally inaccessible. Observers may be more surprised when the agent in BIB’s fully observable, overhead navigational context, and with BIB’s
moves to a new object when its prior goal object is accessible versus multiple tasks presented to infants online.
inaccessible (Luo & Baillargeon, 2007; Scott & Baillargeon, 2013). Experiment 2 followed a preregistered design and analysis plan
The next two tasks focus on an observer’s attribution of rationality to (https://osf.io/p6kba) with replications of the two tasks in Experiment 1
agents’ actions. The Efficient-Agent Task captures the idea that agents act with several improvements, including: automated trial progression;
rationally to achieve goals. Observers watch an agent move to an object balancing of the side of the goal object across participants in the Goal-
efficiently around obstacles in an unchanging grid world during famil Directed Task; and matching of the test-trial lengths within participants
iarization. At test, the object appears in a location that it had appeared in the Efficient-Agent Task. Infants were tested on these two tasks as well
during familiarization, but the grid world has changed such that the as on BIB’s other four tasks outlined above that were not included in
obstacles that blocked the object are gone or have been replaced with Experiment 1.
different obstacles (Gergely et al., 1995; Liu & Spelke, 2017). Observers Following Experiment 1, Experiment 2 evaluated infants’ perfor
may be more surprised when the agent moves along a familiar but now mance on each task with planned mixed-model linear regressions and
inefficient path to the object. The Inefficient-Agent Task asks what ex Type 3 Wald tests with raw looking time as the dependent variable,
pectations observers have about agents who initially move inefficiently outcome (expected versus unexpected) as a fixed effect, and participant
in a changing grid world. During familiarization, observers watch an as a random-effects intercept. Additional planned regressions examined
agent move along the same paths to an object as the agent in the Effi infants’ overall performance across all six tasks and directly compared
cient-Agent Task, but this time there are no obstacles in the agent’s way, their performance on the two tasks focused on agents’ rational actions.
so the agent’s movements to the object are inefficient. At test, the
3
2.2. Infant participants study outcome, what trial was being presented, and the order of the test
trials, recoded 48 randomly chosen sessions (25%) from the 32 infants
In Experiment 1, typically developing 11-month-old infants (N = 26, who completed all six tasks. The reliability between the first and second
Mage = 11.13 months, Range = 10.42 months – 11.83 months; 12 girls) coder was very high (ICC = 0.98).
born at ≥37 weeks gestational age were included. They completed the
Goal-Directed Task, the Efficient-Agent Task, or both, with half of the 3. Infant results
infants receiving each task first, totaling N = 48 individual testing ses
sions and N = 24 sessions per task. An additional four sessions were Infants’ performance on Experiment 1’s two tasks is displayed in
excluded because infants did not complete the session. Fig. 2. Infants’ looking time varied by task, with longer looking to the
In Experiment 2, typically developing 11-month-old infants (N = 58, Efficient-Agent versus Goal-Directed Task (F(1, 71) = 9.34, p = .003),
Mage = 11.06 months, Range = 10.50 months – 11.50 months; 31 girls) reflecting the longer test-trial lengths in the Efficient-Agent Task (see SI).
born at ≥37 weeks gestational age were included. Each infant completed Overall, infants looked longer to the unexpected versus expected out
at least one of BIB’s tasks, totaling N = 288 individual testing sessions. comes (F(1, 66) = 11.34, p = .001), and there was no task by outcome
Following our preregistration, data collection stopped when 32 infants interaction (F(1, 66) = 0.30, p = .585). Infants were surprised (looked
(Mage = 11.09 months, Range = 10.50 months – 11.50 months; 17 girls) longer) when an agent moved to a new object in the Goal-Directed Task (F
completed all six of BIB’s tasks. Tasks were presented in a semi- (1, 23) = 4.73, p = .040), and they were surprised when an efficient
randomized order using 32 fixed orders that averaged to each task agent later took an inefficient path to an object in the Efficient-Agent Task
being presented 5.33 times in each ordinal position (range: 4–7 times). (F(1, 23) = 2.60, p = .016).
All included sessions for each task contributed to the analyses reported Infants’ performance on Experiment 2’s six tasks is also displayed in
here. The final sample sizes for each task were: Goal-Directed Task, N = Fig. 2. Infants’ looking time varied by task (F(5, 341) = 2.78, p = .018),
48; Multi-Agent Task, N = 49; Inaccessible-Goal Task, N = 47; Efficient- reflecting the different test-trial lengths of the different tasks (see SI).
Agent Task, N = 47; Inefficient-Agent Task, N = 49; Instrumental-Action Overall, infants did not look longer to unexpected versus expected
Task, N = 48. The results from the 32 infants who completed all six of outcomes (F(1, 341) = 2.27, p = .133), but a task by outcome interaction
BIB’s tasks were consistent with the results reported here and so are suggested that different tasks elicited different patterns of infants’
reported in the SI. looking (F(5, 341) = 2.23, p = .051).
An additional 37 sessions were excluded because of preregistered We first considered infants’ performance on Experiment 2’s three
exclusion criteria, including: looking time < 1.5 s to least one test trial tasks that focused on goal attribution: the Goal-Directed; Multi-Agent; and
and/or two familiarization trials with or without the infant completing Inaccessible-Goal Tasks. First, consistent with the results in Experiment 1,
the session (16); poor video quality and/or technical failure (18); and infants were surprised when an agent moved to a new object in the Goal-
caretaker interference (3). An additional two sessions were excluded Directed Task (F(1, 47) = 4.09, p = .049). Infants presented with a new
post hoc for extreme values (> 40 s) to one test outcome, which could agent in the Multi-Agent Task, however, did not show a difference in
artificially inflate the calculation of the sample’s variance. These surprise when that agent versus the original agent moved to a new object
extreme values were identified through examination of a histogram of (F(1, 48) = 3.41, p = .071; with longer looking times to the expected
the raw looking times across all of the sessions and across all of the tasks outcome). Infants in the Inaccessible-Goal Task also did not show a dif
by two researchers masked to the task and outcome represented by each ference in surprise when an agent moved to a new object when its goal
value. Exclusions were consistent across tasks: Goal-Directed Task, 5; object was accessible versus inaccessible (F(1, 46) = 0.02, p = .891).
Multi-Agent Task, 6; Inaccessible-Goal Task, 9; Efficient-Agent Task, 7; We next considered infants’ performance on the two tasks that
Inefficient-Agent Task, 5; Instrumental-Goal Task, 7. The total exclusion focused on rationality attribution: the Efficient-Agent and Inefficient-
rate was 11.9%. Agent Tasks. First, consistent with the results in Experiment 1, infants
Participating families received a $5 Amazon gift card after each were surprised when an efficient agent later took an inefficient path to
testing session and received a bonus gift card of $30 if they completed all an object in the Efficient-Agent Task (F(1, 46) = 7.72, p = .008). Infants in
six sessions. Prior to participation in session one, we obtained informed the Inefficient-Agent Task did not show a difference in surprise when an
consent from the infant’s legal guardian, and we confirmed consent inefficient agent continued to move inefficiently to an object at test (F(1,
before each subsequent session. The use of human participants for this 48) = 2.51, p = .119). But, when comparing infants’ performance in the
study was approved by the Institutional Review Board on the Use of Efficient-Agent and Inefficient-Agent Tasks directly, there was no signifi
Human Subjects at our university. cant task by outcome interaction (F(1, 132) = 0.49, p = .484): We did
not find evidence that infants’ surprise at the inefficient agent’s later
2.3. Infant procedure inefficient action was different from their surprise at the efficient agent’s
later inefficient action.
Infants were tested online on Zoom. In the first ten minutes of the Finally, we considered infants instrumentality attribution through
first testing session, the experimenter explained to caretakers the in their performance on the Instrumental-Action Task. Infants did not show a
structions for setting up their device and for positioning the infant in difference in surprise when the agent moved to the tool as opposed to its
front of the screen. We asked caretakers to close their eyes and not goal object when the tool was no longer needed to achieve the goal (F(1,
communicate with the infant during the stimuli presentation. The 47) = 0.03, p = .853).
experimenter, masked to what trial was being presented and the order of
the test trials, coded infants’ looking to the stimuli live from the start of 4. Infant discussion
each video and controlled the progression of stimuli using PyHab
(Kominsky, 2019) and slides.com. Each trial video was preceded by a 5 s Infants’ successful performance in the Goal-Directed and Efficient-
attention grabber (a swirling blob accompanied by a chiming sound, Agent Tasks in both Experiments 1 and 2 suggest that they expect agents’
centered on the screen) to focus the infant’s attention to the screen, and actions to be goal directed towards objects, not locations, and that they
each video froze after the agent reached an object. The last frame of the expect agents’ goal-directed actions to be rationally efficient. These
video remained on the screen until infants looked away for 2 s consec results also show that infants’ common sense about the underlying
utively or for a maximum of 60 s. Testing sessions were recorded causes of agents’ actions are accessible when testing infants online and
through the Zoom recording function, capturing both the infant’s face are highly abstract: Infants’ expectations are elicited by BIB’s minimal
and the screen presenting the stimuli. displays and are generalizable to BIB’s novel, overhead navigational
Following our preregistration, a different researcher, masked to the context.
4
Fig. 2. Infants’ raw looking times to the two outcomes in

each of BIB’s tasks in Experiments 1 & 2. Gray lines connect
the individual looking times (represented by blue and yellow
dots) of each infant to each outcome. Red dots connected by
red lines indicate the mean looking times to each outcome for
each task. Beta coefficients are effects sizes in terms of
standard deviations, and statistical analyses are reported in
the main text (*p < .05, **p < .01). (For interpretation of the
references to colour in this figure legend, please see the on
line version.)
5
This latter suggestion is especially striking given infants’ success on 2005) or on novel tools with which infants were first given direct
the Efficient-Agent Task since obstacles in the grid world blocked an experience (Sommerville et al., 2008). The tool infants saw in the
agent’s direct access to the goal object. Given infants sensitivity to and Instrumental-Action Task was both novel and not something they were
use of agents’ perceptual access to objects when making inferences given experience with. Future versions of the Instrumental-Action Task
about agents’ actions (Luo & Baillargeon, 2007; Luo & Johnson, 2009), might thus introduce state-changes, such as colour changes, to the
infants evidently appreciated BIB’s blocking obstacles as only physical, contacted tools and objects, which, in previous studies, have made the
not perceptual. With BIB’s context providing no information that these causal efficacy of otherwise novel and inscrutable actions appreciable to
obstacles limit an agent’s perceptual access, infants may have inter young infants (Liu, Brooks, & Spelke, 2019; Skerry et al., 2013).
preted the obstacles as something that agents could “see over” or “see
through.” Future studies could explore how infants appreciate the geo 5. Model methods
metric, physical, and perceptual affordances of such overhead naviga
tional environments. 5.1. Model design and analyses
Infants’ pattern of performance on BIB thus enriches our under
standing of their commonsense psychology and raises new questions To examine whether infants’ intelligence about agents might be re
about the abstract principles that might be inherent to that common flected in state-of-the-art machine intelligence, we compared infants’
sense. Building on questions of infants’ sensitivity to agents’ physical performance on BIB in Experiment 2 to the performance of three
and perceptual access to objects, future versions of the Goal-Directed learning-driven neural-network models. Following prior work (Gandhi
Task could reveal how having an agent move around obstacles to a goal et al., 2021; Rabinowitz et al., 2018), the models formed predictions
object, instead of taking only straight paths—actions providing addi about an agent’s actions at test based on its actions during familiariza
tional cues to agency (Johnson, Shimizu, & Ok, 2007; Luo & Baillargeon, tion. To obtain a continuous measure of surprise as a correlate of infants’
2005)—might bolster infants’ goal attribution in that task. Introducing looking time, we calculated the models’ prediction error for each frame
significant changes to the arrangement of obstacles across the famil of each outcome and considered the frame with the maximum error. To
iarization and test environments in the Goal-Directed Task, moreover, compare model and infant performance, we then calculated the Z-scored
could explore the effects of context changes on goal attribution (Liu & mean surprisal score to each outcome for each model and the Z-scored
Spelke, 2017; Sommerville & Crane, 2009). These latter results might mean looking time to each outcome for infants. Z-scores were calculated
also shed light on infants’ failures in some of BIB’s other tasks. For within task. For an unplanned quantitative comparison of the overall
example, infants may have failed in the Inaccessible-Goal Task because similarity between the infants’ and each models’ performance, we
the arrangement of obstacles changed from familiarization to test, evaluated the root mean squared error (RMSE) across BIB’s six tasks
including in a way that affected one object’s physical accessibility. In using the mean Z-score to the unexpected outcome. We also included a
fants may have found a change in the object’s accessibility itself sur comparison between infants’ performance and a “baseline,” which we
prising, or they may not have generalized the agent’s goal to this new gave a surprisal score of “0” for all tasks.
test environment with significantly different physical affordances Finally, to confirm that the models’ performance on the specific trials
because they interpreted this change as indicating two different places in presented to infants was representative of their performance more
which the agent was acting (Sommerville & Crane, 2009). The Multi- generally and not due to any idiosyncrasies of the particular videos
Agent Task similarly changed the arrangement of obstacles from famil shown to infants, we also evaluated the models’ accuracy on BIB’s full
iarization to test, although infants may have failed in this task simply dataset (Gandhi et al., 2021). Because those results were consistent with
because of heightened attention to the new agent, who appeared for the the models’ performance on the infant videos and with prior work
first and only time in the expected outcome (prior studies showing (Gandhi et al., 2021), they are reported in the SI.
agent-specific goal attribution had presented the new agent in both test
outcomes; Buresh & Woodward, 2007). 5.2. Model specifications
Changes to the affordances of the environment from familiarization
to test may also explain the pattern of findings in the Inefficient-Agent Learning-driven neural network models have accelerated recent
Task, which did not differ from the patterns of findings in the Efficient- advances in AI (Lecun, Bengio, & Hinton, 2015; Rabinowitz et al., 2018),
Agent Task. In particular, previous literature suggests both that infants and so we chose to compare such models’ performance on BIB to in
do not expect an agent who had previously moved inefficiently to later fants’. Approaches like reinforcement learning (Sutton & Barto, 2018)
move efficiently when an obstacle present during familiarization is and inverse reinforcement learning (Ng & Russel, 2000), for example,
removed from the test environment (Gergely et al., 1995; Skerry, Carey, have succeeded in learning to control agents and in understanding the
& Spelke, 2013) and that infants do expect a previously inefficient agent actions of agents, but these approaches cannot be used with BIB because
to later move efficiently if the test environment introduces a new they require privileged information, including the ability to actively
obstacle (Liu & Spelke, 2017). The changes in the number and location control agents in the test environment and, in the case of reinforcement
of the obstacles across the Inefficient-Agent Task’s familiarization and test learning, receive a reward. Infants engage with stimuli like BIB’s
environments may have weakly elicited, or elicited in only some infants, through passive observation, and so we based our modeling on the
this latter, “default” prediction about rationally efficient goal-directed “Theory of Mind Net (ToMnet)” architecture from Rabinowitz et al.
actions for inefficient agents in the Inefficient-Agent Task (Liu & (2018), which is a neural network designed specifically for passive
Spelke, 2017). Future versions of the Inefficient-Agent Task could thus observation that has been shown to make inferences about an agent’s
focus specifically on the effects of different kinds of changes in the underlying mental states from its behavior.
context and in the environment’s affordances on infants’ rationality With this architecture, we tested three models from two classes:
attribution. behavioral cloning (BC) and video modeling (Gandhi et al., 2021). The
Finally, given infants’ successes in previous tasks probing their un models’ schematized architectures are presented in Figs. 3 and S2. Two
derstanding of instrumental actions, infants may have failed in BIB’s BC models predicted how an agent would act using the background
Instrumental-Action Task because they could not understand the tool training as examples of state and action pairs (see Model Training below).
object’s causal efficacy (Sommerville, Hildebrand, & Crane, 2008) or the To predict the agent’s next action in a test trial, BC combined infor
agent’s ultimate goal. Specifically, prior findings suggesting that infants mation from the learned features from the previous frame of a test-trial
recognize agents’ instrumental actions (e.g., the use of a tool) relied on video along with the learned features in the set of familiarization-trial
tools whose causal efficacy was familiar to infants (e.g., pulling a cloth videos. Video modeling used a similar strategy, architecture, and
to bring a toy within reach; Piaget, 1953; Sommerville & Woodward, training procedure, but it aimed to predict the entire next frame of the
6
LSTM LSTM LSTM CHARACTERISTIC
8
FAMILIARIZATION
TRIALS
LSTM LSTM LSTM CHARACTERISTIC
BC-RNN MODEL
PREDICTION
CHARACTERISTIC 1 TILED TO 64 X 64
AVERAGED POLICY MLP [0.2, 0.8]

CHARACTERISTIC
8 MSE LOSS
FAMILIARIZATION
TEST FRAME
TRIALS VIDEO MODEL PREDICTION
U-NET
CHARACTERISTIC 8
MSE LOSS
Fig. 3. Architecture of the video and BC RNN models (Gandhi et al., 2021; Rabinowitz et al., 2018). An agent-characteristic embedding was inferred from the
familiarization trials using a recurrent net. This embedding, with a frame from the test trial, was used to predict the next action of the agent in case of the BC model
and the next frame of the video using a U-net (Ronneberger, Fischer, & Brox, 2015) in the case of the video model.
test-trial video rather than just the agent’s next action. 5.3. Model training
The two BC models differed in their encoding of the familiarization
trials. One BC model relied on a simple multi-layer perceptron (MLP) to Prior to being tested, the models were trained on thousands of
encode pairs of states and actions independently (Fig. S2), and the other background examples provided by the BIB dataset (Gandhi et al., 2021)
BC model relied on a more complex, bi-directional recurrent neural of BIB-like agents exhibiting simple behaviors in a grid world. While the
network (RNN) to sequentially encode pairs of states and actions training set included individual components of the test set (e.g., agents’
(Fig. 3). The states were encoded with a convolutional neural network movement to objects, agents’ consistent object goals, barriers, tools, etc.;
(CNN), which was pretrained using Augmented Temporal Contrast see below), success on the test set required models to flexibly combine
(ATC) (Stooke, Lee, Abbeel, & Laskin, 2020). Table S1 provides the CNN representations across the different training tasks. Moreover, since
specifications and the ATC data augmentation details. For both the MLP training included only expected outcomes, training with labeled videos
and RNN encoders, the model obtained a characteristic embedding was not possible. The training otherwise used the same familiarization/
(Rabinowitz et al., 2018) of an agent by first aggregating the embed test task design as the test set.
dings across frames (using the average for the MLP and the last step for In one training task, an agent moved to one object in varying loca
the RNN) for each familiarization trial and by second averaging across tions in the grid world. In a second training task, two objects were
familiarization trials. When aggregating frames, the videos were presented in varying locations in the grid world but always very close to
randomly sub-sampled to use up to 30 frames. To predict the future the agent; the agent consistently moved to one of the two objects. In a
actions of the agent, defined as the continuous change in position based third training task, the agent moved to one object in varying locations in
on the video (at 3 frames per second), the models combined the char the grid world; at varying points during the familiarization, that agent
acteristic embedding with the current state of the environment (also was substituted by another agent. Finally, in a fourth training task, a
encoded with the CNN). See Table S2 for the specifications of the BC green barrier surrounded an agent and a key; the agent retrieved the key
models. to let itself out of the blocked area to move to an object.
The one video model sequentially encoded each familiarization trial We included five runs of each model type with the runs initialized
by passing up to 30 frames through a CNN and then combining them randomly and trained until they converged on the background training.
with a bi-directional RNN. The model obtained a characteristic The BC models were trained to minimize mean squared error, and the
embedding of an agent by averaging the RNN embeddings. The model video model was trained to minimize mean squared error in pixel space.
combined the characteristic embedding with the current state of the Twenty percent of the background training trials were left out as a
environment (specified by the current frame of the video) to predict the validation set, and the models were successful at the validation set in
next frame of the video (at 3 frames per second) using a U-net archi predicting agents’ actions on all of the background training tasks, with
tecture (Ronneberger et al., 2015). low prediction errors. For example, the MSE error for the BC models on
7
the validation set was about 0.03 which is 0.8% of the maximum of-distribution novel test displays compared with the displays used for
possible prediction error (4.00). The only exception was that the BC their training (a generalization BIB requires and infants excel at), such
RNN model performed an order of magnitude less well compared to the models have nevertheless accelerated recent advances in AI (Lecun
BC MLP model on the training task in which two objects were presented et al., 2015; Rabinowitz et al., 2018). Our comparison reveals that the
very close to the agent and the agent consistently moved to just one (see state-of-the-art “machine theory of mind” captured in such models is
SI). indeed missing key principles of commonsense psychology that infants
possess.
6. Model results In particular, while infants expect agents’ goal-directed actions to be
towards objects, not locations, models either have no expectations or
Fig. 4 displays the Z-scored means of the models’ surprisal scores to expect those actions to be towards locations, not objects. And, while
the expected and unexpected outcomes for each task (see SI for addi infants expect both previously efficient and inefficient agents to exhibit
tional details). The Z-scored means of infants’ looking times in the tasks rational and efficient goal-directed actions towards objects in new en
of Experiment 2 are also displayed. Model performance shows little vironments, models only expect previously efficient agents to act effi
resemblance to infant performance. ciently in new environments. Finally, where we were unable to find any
First, to evaluate machines’ goal attribution relative to infants’, we predictions that infants might have about the goals of new agents, about
compared infants and models on the Goal-Attribution Task. Unlike in agents’ goal objects in new environments, or about novel instrumental
fants, who attributed to agents goal objects, not goal locations, the actions, models show no additional commonsense psychology.
models either attributed to agents goal locations (BC MLP) or neither Our approach of directly comparing infant and machine intelligence
goal objects nor goal locations (BC RNN, video model). Next, to evaluate allows us to specify what principles of commonsense psychology are
machines’ rationality attribution relative to infants’, we compared in present in infants yet missing in machines, thereby inspiring new di
fants and models on the Efficient-Agent and Inefficient-Agent Tasks. While rections in engineering AI. For example, alternative models based on
models attributed rational action to agents in the Efficient-Agent Task (to Bayesian inverse planning have been applied successfully to tasks like
an even greater degree than did infants), models did not attribute BIB by making more explicit abstract inferences about mental states
rational action to previously inefficient agents who act in new envi (Baker et al., 2017; Baker, Saxe, & Tenenbaum, 2009; Shu et al., 2021).
ronments in the Inefficient-Agent Task. Here the models’ performance Nevertheless, extending the Bayesian approach to BIB in particular and
was nearly orthogonal to infants’, who did attribute rational action to to videos in general is not straightforward: A video format does not by
previously inefficient agents who act in new environments. itself provide the identification of the agents or objects present in the
The comparisons between machine and infant performance on BIB’s scene (let alone any relations among them). Recent approaches based on
other three tasks revealed no instances in which the models demon inverse reinforcement learning (Sim & Xu, 2019; Yu, Yu, Finn, & Ermon,
strated positive predictions about agents’ actions missing from infants’ 2019) could also be promising, but, as reviewed above, they require
predictions. In particular, while infants’ may have been relatively more online, active sampling from the testing environment, and BIB’s envi
surprised at the appearance of the new agent in the expected outcome of ronment, like much of infants’ experience, involves passive viewing. It
the Multi-Agent Task, as described above, the models did not show a thus remains an open challenge for learning-driven systems to acquire
difference in surprise across the two outcomes. In the Inaccessible-Goal sufficiently rich, abstract structure from BIB’s training to match infant
Task, the video model did appear to be more surprised when the agent commonsense intelligence. Nevertheless, setting infant common sense
moved to a new object when its goal object was accessible, unlike the as a benchmark for machine common sense promises to give AI the
infants, but given this model’s failure on the Goal-Directed and Multi- foundations of human intelligence.
Agent Tasks, its performance is unlikely to reflect an understanding of
agents’ goal-directed actions towards objects. For example, the model 8. General discussion
may have learned that the obstacles in the grid world block objects and
that agents move to objects. This would lead to a lower surprisal score BIB includes six highly minimal but presentationally consistent tasks
when an agent moved to the one accessible object compared with when focusing on three high-level principles of commonsense psychology:
it moved to either one of the accessible objects. Similarly, in the goal attribution; rationality attribution; and instrumentality attribution.
Instrumental-Action Task the models seemed to have succeeded where Infants’ successes on BIB suggest they have a highly abstract notion of
the infants did not, showing greater surprise when the agent moved to agents’ actions as goal-directed towards objects and a principle of ra
the key when it was unnecessary to do so. But, closer investigation of the tionality that leads to default expectations of agents’ efficient actions
models’ performance shows that this apparent success is limited to test towards goals. These results are consistent with the rich literature on
trials in which the green barrier was absent versus present and incon infants’ commonsense psychology (Baillargeon et al., 2015; Baillargeon
sequential (see SI). A true understanding of instrumental actions would et al., 2016; Spelke, 2022; Woodward, 2009; Woodward et al., 2001)
generalize across the presence or absence of the green barrier at test. The and synthesize the literature’s findings in a unified framework that can
models thus did not understand agents’ instrumental actions. be directly compared with—and perhaps built into—machine intelli
Finally, the RMSE analysis revealed high values for all infant and gence. In addition, BIB uniquely reveals that infants appreciate agents’
model comparisons: BC RNN: 0.319; BC MLP: 0.492; video model: actions in a novel, overhead navigational context, here recognizing
0.297, suggesting little similarity between infant and model perfor obstacles as physical but not perceptual barriers to action.
mance. Indeed, these RMSE values were higher than the one obtained by Infants’ failures on BIB suggest that changes to the contexts in which
comparing infants’ performance to “baseline” surprisal scores of “0” for goals are first demonstrated may have significant impacts on infants’
all tasks: 0.143. goal and rationality attribution (Liu & Spelke, 2017; Sommerville &
Crane, 2009). For example, infants may not generalize an agent’s goal to
7. Model discussion a test environment with even minimal or inconsequential changes
relative to the environment in which the goal was initially demonstrated
BIB was expressly designed to allow for testing both infant and ma if those changes suggest that agents are acting in a new place. Regardless
chine intelligence alike (Gandhi et al., 2021), providing an empirical of how infants might come to understand the geometry of BIB’s envi
foundation for building human-like AI. While the performance of the ronment, their sensitivity to and use of where an agent is for goal and
models tested here has not previously been compared with human rationality attribution is apparent. Future studies might thus investigate
performance (let alone with infant performance), and while and models infants’ use of such geometry for recognizing places based on their shape
like these are limited in their capacity for flexible generalization to out- or navigability even before infants can navigate on their own (Deen
8
Fig. 4. Z-scored means of the models’ surprisal scores (each model is

shown with a different shape and in a different shade of gray) and the Z-
scored means of infants’ looking times (shown in red) to the expected
and unexpected outcomes in each of BIB’s six tasks in Experiment 2.
Models differ from infants in terms of infants’ successful goal and ra
tionality attribution (A), and models show no additional commonsense
psychology missing from infants’ performance (B). (For interpretation
of the references to colour in this figure legend, please see the online
version.)
9
et al., 2017; Kosakowski et al., 2021). Baillargeon, R., Scott, R. M., He, Z., Sloane, S., Setoh, P., Jin, K., … Bian, L. (2015).
Psychological and sociomoral reasoning in infancy. In , 1. APA handbook of
Future work exploring infants’ knowledge about the world could
personality and social psychology, volume 1: Attitudes and social cognition (pp. 79–150).
extend our general approach to investigate other aspects of infant https://doi.org/10.1037/14341-003
commonsense psychology. Because BIB’s tasks are procedurally gener Baker, C. L., Jara-Ettinger, J., Saxe, R., & Tenenbaum, J. B. (2017). Rational quantitative
ated and presentationally consistent, for example, new tasks could easily attribution of beliefs, desires and percepts in human mentalizing. Nature Human
Behaviour, 1(4), 1–10. https://doi.org/10.1038/s41562-017-0064
be incorporated into BIB’s dataset. Future studies might explore ex Baker, C. L., Saxe, R., & Tenenbaum, J. B. (2009). Action understanding as inverse
pectations of agents’ notions of cost and value (Jara-Ettinger et al., planning. Cognition, 113(3), 329–349. https://doi.org/10.1016/j.
2016; Liu et al., 2017) or recognition of agents’ actions that might signal cognition.2009.07.005
Banaji, M., & Gelman, S. A. (2013). Navigating the social world: What infants, children, and
potential social partnerships (Meltzoff, 2007; Powell & Spelke, 2013; other species can teach us. Oxford University Press.
Schachner & Carey, 2013; Tomasello, 2018). While we show that Botvinick, M., Barrett, D. G. T., Battaglia, P., de Freitas, N., Kumaran, D., Leibo, J. Z., …
learning-driven neural-network approaches already fall short of infant’s Hassabis, D. (2017). Building machines that learn and think for themselves.
Behavioral and Brain Sciences, 40, Article e255. https://doi.org/10.1017/
common sense on BIB’s existing tasks, such expectations will never S0140525X17000048
theless become increasingly important for AI too as it becomes further Buresh, J. S., & Woodward, A. L. (2007). Infants track action goals within and across
embedded in real-world, multi-agent settings that demand common agents. Cognition, 104(2), 287–314. https://doi.org/10.1016/j.
cognition.2006.07.001
sense. Extending our approach can ultimately inform comprehensive Csibra, G., Bíró, S., Koós, O., & Gergely, G. (2003). One-year-old infants use teleological
accounts of infants’ knowledge not only about agents, but also about representations of actions productively. Cognitive Science, 27(1), 111–133. https://
objects (Lin, Stavans, & Baillargeon, 2022; Spelke, 1990; Stahl & Fei doi.org/10.1207/s15516709cog2701_4
Csibra, G., Gergely, G., Bíró, S., Koós, O., & Brockbank, M. (1999). Goal attribution
genson, 2015) and places (Hermer & Spelke, 1994), allowing us to more
without agency cues: The perception of “pure reason” in infancy. Cognition, 72(3),
fully describe the origins and development of human common sense and 237–267. https://doi.org/10.1016/S0010-0277(99)00039-6
provide an avenue for building the future of human-like AI. Deen, B., Richardson, H., Dilks, D. D., Takahashi, A., Keil, B., Wald, L. L., … Saxe, R.
BIB called for an interanimating research program between devel (2017). Organization of high-level visual cortex in human infants. Nature
Communications, 8. https://doi.org/10.1038/ncomms13995
opmental cognitive science and artificial intelligence. The present work Gandhi, K., Stojnic, G., Lake, B. M., & Dillon, M. R. (2021). Baby Intuitions Benchmark
demonstrates that such a program is both possible and generative for (BIB): Discerning the goals, preferences, and actions of others. Advances in Neural
both fields. Our work provides a first step in this productive dialogue Information Processing Systems, 34, 9963–9976.
Gao, T., McCarthy, G., & Scholl, B. J. (2010). The wolfpack effect: Perception of animacy
between the cognitive and computational sciences to test whether irresistibly influences interactive behavior. Psychological Science, 21(12),
knowledge can be built, in human or machine, from the foundations that 1845–1853. https://doi.org/10.1177/0956797610388814
cognitive and developmental theories postulate. Gergely, G., & Csibra, G. (2003). Teleological reasoning in infancy: The naïve theory of
rational action. Trends in Cognitive Sciences, 7(7), 287–292. https://doi.org/10.1016/
S1364-6613(03)00128-1
Credit author statement Gergely, G., Nádasdy, Z., Csibra, G., & Bíró, S. (1995). Taking the intentional stance at 12
months of age. Cognition, 56(2), 165–193. https://doi.org/10.1016/0010-0277(95)
00661-H
GS, KG, BL, and MRD conceptualized the study. GS, KG, and SY Griffiths, T. L. (2015). Manifesto for a new (computational) cognitive revolution.
curated the data. KG and SY analyzed the data. MRD and BL acquired Cognition, 135, 21–23. https://doi.org/10.1016/j.cognition.2014.11.026
funding and supervised the study. MRD wrote the original draft. GS, KG, Heider, F., & Simmel, M. (1944). An experimental study of apparent behavior. The
American Journal of Psychology, 57(2), 243–259.
SY, and BL reviewed and edited the draft. Hermer, L., & Spelke, E. S. (1994). A geometric process for spatial reorientation in young
children. Nature, 370, 57–59. https://doi.org/10.1038/370057a0
Data availability Jara-Ettinger, J., Gweon, H., Schulz, L. E., & Tenenbaum, J. B. (2016). The Naïve utility
Calculus: Computational principles underlying commonsense psychology. Trends in
Cognitive Sciences, 20(8), 589–604. https://doi.org/10.1016/j.tics.2016.05.011
Experiment 2 with infants was preregistered on the Open Science Johnson, M. H., & Gilmore, R. O. (2003). Object-centered attention in 8-month-old
Framework (OSF) prior to data collection, and the preregistration is infants. Developmental Science, 1(2), 221–225. https://doi.org/10.1111/1467-
7687.00034
available at https://osf.io/p6kba. The data, code, and materials related Johnson, S., Slaughter, V., & Carey, S. (1998). Whose gaze will infants follow? The
to all of the infant testing and the comparison between infant and ma elicitation of gaze following in 12-month-olds. Developmental Science, 1(2), 233–238.
chine performance are available on the OSF at: https://osf.io/htjc2/. https://doi.org/10.1111/1467-7687.00036
Johnson, S. C., Shimizu, Y. A., & Ok, S. J. (2007). Actors and actions: The role of agent
The code related to the model testing is available at: https://github. behavior in infants’ attribution of goals. Cognitive Development, 22(3), 310–322.
com/kanishkg/bib-models. https://doi.org/10.1016/j.cogdev.2007.01.002
Kominsky, J. F. (2019). PyHab: Open-source real time infant gaze coding and stimulus
presentation software. Infant Behavior and Development, 54(January), 114–119.
Acknowledgements https://doi.org/10.1016/j.infbeh.2018.11.006
Kominsky, J. F., Lucca, K., Thomas, A. J., Frank, M. C., & Hamlin, J. K. (2022). Simplicity
This work was supported by a National Science Foundation CAREER and validity in infant research. Cognitive Development, 63(May), Article 101213.
https://doi.org/10.1016/j.cogdev.2022.101213
Award (DRL1845924; to MRD) and a DARPA grant on Machine Com Kosakowski, H. L., Cohen, M. A., Takahashi, A., Keil, B., Kanwisher, N., & Saxe, R.
mon Sense (HR001119S0005; to MRD and BML). We thank Eli Mitnick (2021). Selective responses to faces, scenes, and bodies in the ventral visual pathway
for assistance with data collection, Koleen McCrink, David Moore, Lisa of infants. Current Biology. https://doi.org/10.1016/j.cub.2021.10.064
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building
Oakes, and Victoria Romero for their feedback on the project’s general machines that learn and think like people. Behavioral and Brain Sciences, 40, e253.
aims, and Brian Reilly for his feedback on the project and manuscript. https://doi.org/10.1017/S0140525X16001837
Finally, we thank the generous families who volunteered their time to Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
https://doi.org/10.1038/nature14539
participate in this research. Lin, Y., Stavans, M., & Baillargeon, R. (2022). Infants’ physical reasoning and the
cognitive architecture that supports it. In O. Houdé, & G. Borst (Eds.), Cambridge
handbook of cognitive development (pp. 244–268). Cambridge University Press.
Appendix A. Supplementary data
Liu, S., Brooks, N. B., & Spelke, E. S. (2019). Origins of the concepts cause, cost, and goal
in prereaching infants. Proceedings of the National Academy of Sciences of the United
Supplementary data to this article can be found online at https://doi. States of America, 116(36), 17747–17752. https://doi.org/10.1073/
org/10.1016/j.cognition.2023.105406. pnas.1904410116
Liu, S., & Spelke, E. S. (2017). Six-month-old infants expect agents to minimize the cost of
their actions. Cognition, 160, 35–42. https://doi.org/10.1016/j.
References cognition.2016.12.007
Liu, S., Ullman, T. D., Tenenbaum, J. B., & Spelke, E. S. (2017). Ten-month-old infants
infer the value of goals from the costs of actions. Science, 358(6366), 1038. https://
Baillargeon, R., Scott, R. M., & Bian, L. (2016). Psychological reasoning in infancy.
doi.org/10.1126/science.aag2132
Annual Review of Psychology, 67(April), 159–186. https://doi.org/10.1146/annurev-
psych-010213-115033
10
Luo, Y., & Baillargeon, R. (2005). Can a self-propelled box have a goal? - psychological Smith, K. A., Mei, L., Yao, S., Wu, J., Spelke, E., Tenenbaum, J. B., & Ullman, T. D.
reasoning in 5-month-old infants. Psychological Science, 16(8), 601–608. https://doi. (2019). Modeling expectation violation in intuitive physics with coarse probabilistic
org/10.1111/j.1467-9280.2005.01582.x object representations. Advances in Neural Information Processing Systems, 32, 1–11.
Luo, Y., & Baillargeon, R. (2007). Do 12.5-month-old infants consider what objects others Sommerville, J. A., & Crane, C. C. (2009). Ten-month-old infants use prior information to
can see when interpreting their actions? Cognition, 105(3), 489–512. https://doi. identify an actor’s goal. Developmental Science, 12(2), 314–325. https://doi.org/
org/10.1016/j.cognition.2006.10.007 10.1111/j.1467-7687.2008.00787.x
Luo, Y., & Johnson, S. C. (2009). Recognizing the role of perception in action at 6 Sommerville, J. A., Hildebrand, E. A., & Crane, C. C. (2008). Experience matters: The
months. Developmental Science, 12(1), 142–149. https://doi.org/10.1111/j.1467- impact of doing versus watching on Infants’ subsequent perception of tool-use
7687.2008.00741.x events. Developmental Psychology, 44(5), 1249–1256. https://doi.org/10.1037/
Marcus, G., & Davis, E. (2019). Building machines that learn and think like people. Pantheon a0012296
Books. Sommerville, J. A., & Woodward, A. L. (2005). Pulling out the intentional structure of
Meltzoff, A. N. (1995). Understanding the intentions of others: Re-enactment of intened action: The relation between action processing and action production in infancy.
acts by 18-month-old children. Developmental Psychology, 31(5), 838–850. Cognition, 95(1), 1–30. https://doi.org/10.1016/j.cognition.2003.12.004
Meltzoff, A. N. (2007). “Like me”: A foundation for social cognition. Developmental Spelke, E. S. (1985). Preferential-looking methods as tools for the study of cognition in
Science, 10(1), 126–134. https://doi.org/10.1111/j.1467-7687.2007.00574.x infancy. In G. Gottlieb, & N. A. Krasnegor (Eds.), Measurement of audition and vision in
Ng, A. Y., & Russel, S. (2000). Algorithms for inverse reinforcement learning, 1. the first year of postnatal life: A methodological overview (pp. 323–361).
International Conference on Machine Learning. Spelke, E. S. (1990). Principles of object perception. Cognitive Science, 14(1), 29–56.
Piaget, J. (1953). The origins of intelligence in children. International Journal of Spelke, E. S. (2022). What babies know: Core knowledge and composition. Oxford University
Psychoanalysis, 35, 373–375. Press.
Piloto, L. S., Weinstein, A., Battaglia, P., & Botvinick, M. (2022). Intuitive physics Spelke, E. S., & Kinzler, K. D. (2007). Core knowledge. Developmental Science, 10(1),
learning in a deep-learning model inspired by developmental psychology. Nature 89–96. https://doi.org/10.1111/j.1467-7687.2007.00569.x
Human Behaviour. https://doi.org/10.1038/s41562-022-01394-8 Stahl, A. E., & Feigenson, L. (2015). Observing the unexpected enhances infants’ learning
Powell, L. J., & Spelke, E. S. (2013). Preverbal infants expect members of social groups to and exploration. Science, 348(6230), 91–94. https://doi.org/10.1126/science.
act alike. Proceedings of the National Academy of Sciences, 110(41), E3965–E3972. aaa3799
https://doi.org/10.1073/pnas.1304326110 Stahl, A. E., & Kibbe, M. M. (2022). Great expectations: The construct validity of the
Rabinowitz, N. C., Perbet, F., Song, H. F., Zhang, C., & Botvinick, M. (2018). Machine violation-of-expectation method for studying infant cognition. Infant and Child
theory of mind, 10. International Conference on Machine Learning (pp. 6723–6738). Development. https://doi.org/10.1002/icd.2359
Repacholi, B. M., & Gopnik, A. (1997). Early reasoning about desires: Evidence from 14- Stooke, A., Lee, K., Abbeel, P., & Laskin, M. (2020). Decoupling representation learning
and 18-month-olds. Developmental Psychology, 33(1), 12–21. https://doi.org/ from reinforcement learning. In International Conference on Machine Learning.
10.1037/0012-1649.33.1.12 Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for Téglás, E., Vul, E., Girotto, V., Gonzalez, M., Tenenbaum, J. B., & Bonatti, L. L. (2011).
biomedical image segmentation. International Conference on Medical Image Computing Pure reasoning in 12-month-old infants as probabilistic inference. Science, 332
and Computer-Assisted Intervention, 9, 16591–16603. (6033), 1054–1059. https://doi.org/10.1126/science.1196404
Schachner, A., & Carey, S. (2013). Reasoning about “irrational” actions: When Tomasello, M. (2018). How children come to understand false beliefs: A shared
intentional movements cannot be explained, the movements themselves are seen as intentionality account. Proceedings of the National Academy of Sciences of the United
the goal. Cognition, 129(2), 309–327. https://doi.org/10.1016/j. States of America, 115(34), 8491–8498. https://doi.org/10.1073/pnas.1804761115
cognition.2013.07.006 Woodward, A. L. (1998). Infants selectively encode the goal object of an actor’s reach.
Scott, R. M., & Baillargeon, R. (2013). Do infants really expect agents to act efficiently? A Cognition, 69(1), 1–34. https://doi.org/10.1016/S0010-0277(98)00058-4
critical test of the rationality principle. Psychological Science, 24(4), 466–474. Woodward, A. L. (2009). Infants’ grasp of others’ intentions. Current Directions in
https://doi.org/10.1177/0956797612457395 Psychological Science, 18(1), 53–57. https://doi.org/10.1111/j.1467-
Shu, T., Bhandwaldar, A., Gan, C., Smith, K. A., Liu, S., … Gutfreund, D. Ullman (2021). 8721.2009.01605.x
AGENT: A benchmark for core psychological reasoning. International Conference on Woodward, A. L., & Sommerville, J. A. (2000). Twelve-month-old infants interpret action
Machine Learning, 9614–9625. in context. Psychological Science, 11(1), 73–77. https://doi.org/10.1111/1467-
Sim, Z. L., & Xu, F. (2019). Another look at looking time: Surprise as rational statistical 9280.00218
inference. Topics in Cognitive Science, 11(1), 154–163. https://doi.org/10.1111/ Woodward, A. L., Sommerville, J. A., & Guajardo, J. J. (2001). Action. In B. F. Malle,
tops.12393 L. J. Moses, & D. A. Baldwin (Eds.), Intentions and intentionality : Foundations of social
Skerry, A. E., Carey, S. E., & Spelke, E. S. (2013). First-person action experience reveals cognition (pp. 149–169). MIT Press.
sensitivity to action efficiency in prereaching infants. Proceedings of the National Yu, L., Yu, T., Finn, C., & Ermon, S. (2019). Meta-inverse reinforcement learning with
Academy of Sciences, 110(46), 18728–18733. https://doi.org/10.1073/ probabilistic context variables. Advances in Neural Information Processing Systems, 32,
pnas.1312322110 1–12.
11

1 s2.0 S0010027723000409 Main

Uploaded by

Copyright:

Available Formats

1 s2.0 S0010027723000409 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0010027723000409 Main

Uploaded by

Copyright:

Available Formats

Cognition 235 (2023) 105406

Contents lists available at ScienceDirect

Commonsense psychology in human infants and machines

Fig. 2. Infants’ raw looking times to the two outcomes in

LSTM LSTM LSTM CHARACTERISTIC

LSTM LSTM LSTM CHARACTERISTIC

AVERAGED POLICY MLP [0.2, 0.8]

Fig. 4. Z-scored means of the models’ surprisal scores (each model is

You might also like