Tenenbaum 2011

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

How to Grow a Mind: Statistics, Structure, and Abstraction

Joshua B. Tenenbaum et al.


Science 331, 1279 (2011);
DOI: 10.1126/science.1192788

This copy is for your personal, non-commercial use only.

If you wish to distribute this article to others, you can order high-quality copies for your
colleagues, clients, or customers by clicking here.

Downloaded from www.sciencemag.org on October 22, 2012


Permission to republish or repurpose articles or portions of articles can be obtained by
following the guidelines here.

The following resources related to this article are available online at


www.sciencemag.org (this information is current as of October 22, 2012 ):

Updated information and services, including high-resolution figures, can be found in the online
version of this article at:
http://www.sciencemag.org/content/331/6022/1279.full.html
Supporting Online Material can be found at:
http://www.sciencemag.org/content/suppl/2011/03/08/331.6022.1279.DC1.html
This article cites 33 articles, 4 of which can be accessed free:
http://www.sciencemag.org/content/331/6022/1279.full.html#ref-list-1
This article has been cited by 8 articles hosted by HighWire Press; see:
http://www.sciencemag.org/content/331/6022/1279.full.html#related-urls
This article appears in the following subject collections:
Psychology
http://www.sciencemag.org/cgi/collection/psychology

Science (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by the
American Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. Copyright
2011 by the American Association for the Advancement of Science; all rights reserved. The title Science is a
registered trademark of AAAS.
REVIEW
moment when all these fields converged on a
common paradigm for understanding the mind),
the labels “Bayesian” or “probabilistic” are mere-
How to Grow a Mind: Statistics, ly placeholders for a set of interrelated principles
and theoretical claims. The key ideas can be
Structure, and Abstraction thought of as proposals for how to answer three
central questions:
Joshua B. Tenenbaum,1* Charles Kemp,2 Thomas L. Griffiths,3 Noah D. Goodman4
1) How does abstract knowledge guide learn-
In coming to understand the world—in learning concepts, acquiring language, and grasping ing and inference from sparse data?
causal relations—our minds make inferences that appear to go far beyond the data available. 2) What forms does abstract knowledge take,
How do we do it? This review describes recent approaches to reverse-engineering human learning across different domains and tasks?
and cognitive development and, in parallel, engineering more humanlike machine learning 3) How is abstract knowledge itself acquired?
systems. Computational models that perform probabilistic inference over hierarchies of flexibly
structured representations can address some of the deepest questions about the nature and origins We will illustrate the approach with a focus
of human thought: How does abstract knowledge guide learning and reasoning from sparse on two archetypal inductive problems: learning
data? What forms does our knowledge take, across different domains and tasks? And how is that concepts and learning causal relations. We then

Downloaded from www.sciencemag.org on October 22, 2012


abstract knowledge itself acquired? briefly discuss open challenges for a theory of hu-
man cognitive development and conclude with a
The Challenge: How Does the Mind Get not imply causation, yet children routinely in- summary of the approach’s contributions.
So Much from So Little? fer causal links from just a handful of events (4), We will also draw contrasts with two earlier
or scientists studying how humans come far too small a sample to compute even a reli- approaches to the origins of knowledge: nativism

F to understand their world, the central chal-


lenge is this: How do our minds get so
much from so little? We build rich causal models,
able correlation! Perhaps the deepest accomplish-
ment of cognitive development is the construction
of larger-scale systems of knowledge: intuitive
and associationism (or connectionism. These ap-
proaches differ in whether they propose stronger
or weaker capacities as the basis for answering
make strong generalizations, and construct pow- theories of physics, psychology, or biology or rule the questions above. Bayesian models typically
erful abstractions, whereas the input data are systems for social structure or moral judgment. combine richly structured, expressive knowledge
sparse, noisy, and ambiguous—in every way far Building these systems takes years, much longer representations (question 2) with powerful statis-
too limited. A massive mismatch looms between than learning a single new word or concept, but tical inference engines (questions 1 and 3), arguing
the information coming in through our senses on this scale too the final product of learning far that only a synthesis of sophisticated approaches
and the ouputs of cognition. outstrips the data observed (5–7). to both knowledge representation and inductive
Consider the situation of a child learning the Philosophers have inquired into these puz- inference can account for human intelligence. Until
meanings of words. Any parent knows, and sci- zles for over two thousand years, most famously recently it was not understood how this fusion
entists have confirmed (1, 2), that typical 2-year- as “the problem of induction,” from Plato and could work computationally. Cognitive modelers
olds can learn how to use a new word such as Aristotle through Hume, Whewell, and Mill to were forced to choose between two alternatives
“horse” or “hairbrush” from seeing just a few Carnap, Quine, Goodman, and others in the 20th (13): powerful statistical learning operating over
examples. We know they grasp the meaning, not century (8). Only recently have these questions the simplest, unstructured forms of knowledge,
just the sound, because they generalize: They become accessible to science and engineering by such as matrices of associative weights in connec-
use the word appropriately (if not always per- viewing inductive learning as a species of compu- tionist accounts of semantic cognition (12, 14),
fectly) in new situations. Viewed as a compu- tational problems and the human mind as a nat- or richly structured symbolic knowledge equipped
tation on sensory input data, this is a remarkable ural computer evolved for solving them. with only the simplest, nonstatistical forms of
feat. Within the infinite landscape of all possible The proposed solutions are, in broad strokes, learning, checks for logical inconsistency between
objects, there is an infinite but still highly con- just what philosophers since Plato have sug- hypotheses and observed data, as in nativist ac-
strained subset that can be called “horses” and gested. If the mind goes beyond the data given, counts of language acquisition (15). It appeared
another for “hairbrushes.” How does a child grasp another source of information must make up necessary to accept either that people’s abstract
the boundaries of these subsets from seeing just the difference. Some more abstract background knowledge is not learned or induced in a nontrivial
one or a few examples of each? Adults face the knowledge must generate and delimit the hypothe- sense from experience (hence essentially innate)
challenge of learning entirely novel object concepts ses learners consider, or meaningful generaliza- or that human knowledge is not nearly as ab-
less often, but they can be just as good at it (Fig. 1). tion would be impossible (9, 10). Psychologists stract or structured (as “knowledge-like”) as it
Generalization from sparse data is central in and linguists speak of “constraints;” machine learn- seems (hence simply associations). Many devel-
learning many aspects of language, such as syn- ing and artificial intelligence researchers, “induc- opmental researchers rejected this choice alto-
tactic constructions or morphological rules (3). tive bias;” statisticians, “priors.” gether and pursued less formal approaches to
It presents most starkly in causal learning: Every This article reviews recent models of human describing the growing minds of children, under
statistics class teaches that correlation does learning and cognitive development arising at the headings of “constructivism” or the “theory
the intersection of these fields. What has come theory” (5). The potential to explain how peo-
1 to be known as the “Bayesian” or “probabilistic” ple can genuinely learn with abstract structured
Department of Brain and Cognitive Sciences, Computer Sci-
ence and Artificial Intelligence Laboratory (CSAIL), Massa- approach to reverse-engineering the mind has been knowledge may be the most distinctive feature
chusetts Institute of Technology, 77 Massachusetts Avenue, heavily influenced by the engineering successes of of Bayesian models: the biggest reason for their
Cambridge, MA 02139, USA. 2Department of Psychology, Bayesian artificial intelligence and machine recent popularity (16) and the biggest target of
Carnegie Mellon University, Pittsburgh, PA 15213, USA. 3De- learning over the past two decades (9, 11) and, skepticism from their critics (17).
partment of Psychology, University of California, Berkeley,
Berkeley, CA 94720, USA. 4Department of Psychology, Stan- in return, has begun to inspire more powerful and
ford University, Stanford, CA 94305, USA. more humanlike approaches to machine learning. The Role of Abstract Knowledge
*To whom correspondence should be addressed. E-mail: As with “connectionist” or “neural network” Over the past decade, many aspects of higher-
[email protected] models of cognition (12) in the 1980s (the last level cognition have been illuminated by the

www.sciencemag.org SCIENCE VOL 331 11 MARCH 2011 1279


REVIEW
mathematics of Bayesian statistics: our sense of The claim that human minds learn and rea- be understood in Bayesian terms. In addition
similarity (18), representativeness (19), and ran- son according to Bayesian principles is not a to the general cognitive abilities just mentioned,
domness (20); coincidences as a cue to hidden claim that the mind can implement any Bayesian Bayesian analyses have shed light on many spe-
causes (21); judgments of causal strength (22) and inference. Only those inductive computations that cific cognitive capacities and modules that result
evidential support (23); diagnostic and condi- the mind is designed to perform well, where from rapid, reliable, unconscious processing, in-
tional reasoning (24, 25); and predictions about biology has had time and cause to engineer ef- cluding perception (27), language (28), memory
the future of everyday events (26). fective and efficient mechanisms, are likely to (29, 30), and sensorimotor systems (31). In contrast,
in tasks that require explicit conscious manipu-
lations of probabilities as numerical quantities—a
recent cultural invention that few people become
fluent with, and only then after sophisticated
training—judgments can be notoriously biased
away from Bayesian norms (32).
At heart, Bayes’s rule is simply a tool for
answering question 1: How does abstract knowl-
edge guide inference from incomplete data?
Abstract knowledge is encoded in a probabilistic
generative model, a kind of mental model that

Downloaded from www.sciencemag.org on October 22, 2012


describes the causal processes in the world giv-
ing rise to the learner’s observations as well as
unobserved or latent variables that support ef-
fective prediction and action if the learner can
infer their hidden state. Generative models must
be probabilistic to handle the learner’s uncertain-
ty about the true states of latent variables and
the true causal processes at work. A generative
model is abstract in two senses: It describes not
only the specific situation at hand, but also a broader
class of situations over which learning should
generalize, and it captures in parsimonious form
the essential world structure that causes learners’
observations and makes generalization possible.
Bayesian inference gives a rational framework
for updating beliefs about latent variables in gen-
erative models given observed data (33, 34).
Background knowledge is encoded through a
constrained space of hypotheses H about pos-
sible values for the latent variables, candidate
world structures that could explain the observed
data. Finer-grained knowledge comes in the “prior
probability” P(h), the learner’s degree of belief in
a specific hypothesis h prior to (or independent
of) the observations. Bayes’s rule updates priors
to “posterior probabilities” P(h|d) conditional on
the observed data d:

P(djh)P(h)
P(hjd) ¼ ºP(djh)P(h)
∑h′ ∈H P(djh′ )P(h′ )
ð1Þ

The posterior probability is proportional to the


product of the prior probability and the likelihood
Fig. 1. Human children learning names for object concepts routinely make strong generalizations from P(d|h), measuring how expected the data are under
just a few examples. The same processes of rapid generalization can be studied in adults learning names
hypothesis h, relative to all other hypotheses h′ in H.
for novel objects created with computer graphics. (A) Given these alien objects and three examples
To illustrate Bayes’s rule in action, suppose
(boxed in red) of “tufas” (a word in the alien language), which other objects are tufas? Almost everyone
selects just the objects boxed in gray (75). (B) Learning names for categories can be modeled as we observe John coughing (d), and we consider
Bayesian inference over a tree-structured domain representation (2). Objects are placed at the leaves of three hypotheses as explanations: John has h1, a
the tree, and hypotheses about categories that words could label correspond to different branches. cold; h2, lung disease; or h3, heartburn. Intuitively
Branches at different depths pick out hypotheses at different levels of generality (e.g., Clydesdales, draft only h1 seems compelling. Bayes’s rule explains
horses, horses, animals, or living things). Priors are defined on the basis of branch length, reflecting the why. The likelihood favors h1 and h2 over h3:
distinctiveness of categories. Likelihoods assume that examples are drawn randomly from the branch only colds and lung disease cause coughing and
that the word labels, favoring lower branches that cover the examples tightly; this captures the sense of thus elevate the probability of the data above
suspicious coincidence when all examples of a word cluster in the same part of the tree. Combining baseline. The prior, in contrast, favors h1 and h3
priors and likelihoods yields posterior probabilities that favor generalizing across the lowest distinctive over h2: Colds and heartburn are much more
branch that spans all the observed examples (boxed in gray). common than lung disease. Bayes’s rule weighs

1280 11 MARCH 2011 VOL 331 SCIENCE www.sciencemag.org


REVIEW
hypotheses according to the product of priors that horses or squirrels do?) is also well described n − 1 hypotheses. In learning a causal network
and likelihoods and so yields only explanations by Bayesian models that assume nearby objects over 16 variables, there are roughly 1046 logical-
like h1 that score highly on both terms. in the tree are likely to share properties (36). How- ly possible hypotheses (directed acyclic graphs),
The same principles can explain how people ever, trees are by no means a universal represen- but a framework theory restricting hypotheses
learn from sparse data. In concept learning, the tation. Inferences about other kinds of categories to bipartite disease-symptom graphs reduces this
data might correspond to several example ob- or properties are best captured by using proba- to roughly 1023 hypotheses. Knowing which var-
jects (Fig. 1) and the hypotheses to possible ex- bilistic models with different forms (Fig. 2): two- iables belong to the disease and symptom classes
tensions of the concept. Why, given three examples dimensional spaces or grids for reasoning about further restricts this to roughly 1018 networks.
of different kinds of horses, would a child gen- geographic properties of cities, one-dimensional The smaller the hypothesis space, the more ac-
eralize the word “horse” to all and only horses orders for reasoning about values or abilities, or curately a learner can be expected to generalize,
(h1)? Why not h2, “all horses except Clydesdales”; directed networks for causally transmitted proper- but only as long as the true structure to be learned
h3, “all animals”; or any other rule consistent with ties of species (e.g., diseases) (36). remains within or near (in a probabilistic sense)
the data? Likelihoods favor the more specific Knowledge about causes and effects more the learner’s hypothesis space (10). It is no coin-
patterns, h1 and h2; it would be a highly suspi- generally can be expressed in a directed graph- cidence then that our best accounts of people’s
cious coincidence to draw three random exam- ical model (9, 11): a graph structure where nodes mental representations often resemble simpler ver-
ples that all fall within the smaller sets h1 or h2 represent variables and directed edges between sions of how scientists represent the same do-
if they were actually drawn from the much larger nodes represent probabilistic causal links. In a mains, such as tree structures for biological species.
h3 (18). The prior favors h1 and h3, because as medical setting, for instance (Fig. 3A), nodes A compact description that approximates how

Downloaded from www.sciencemag.org on October 22, 2012


more coherent and distinctive categories, they might represent whether a patient has a cold, a the grain of the world actually runs offers the
are more likely to be the referents of common cough, a fever or other conditions, and the pres- most useful form of constraint on inductive learning.
words in language (1). Only h1 scores highly ence or absence of edges indicates that colds tend
on both terms. Likewise, in causal learning, the to cause coughing and fever but not chest pain; The Origins of Abstract Knowledge
data could be co-occurences between events; the lung disease tends to cause coughing and chest The need for abstract knowledge and the need
hypotheses, possible causal relations linking pain but not fever; and so on. to get it right bring us to question 3: How do
the events. Likelihoods favor causal links that Such a causal map represents a simple kind learners learn what they need to know to make
make the co-occurence more probable, whereas of intuitive theory (4), but learning causal net- learning possible? How does a child know which
priors favor links that fit with our background works from limited data depends on the con- tree structure is the right way to organize hypothe-
knowledge of what kinds of events are likely to straints of more abstract knowledge. For example, ses for word learning? At a deeper level, how can
cause which others; for example, a disease (e.g., learning causal dependencies between medical a learner know that a given domain of entities
cold) is more likely to cause a symptom (e.g., conditions is enabled by a higher-level framework and concepts should be represented by using a
coughing) than the other way around. theory (37) specifying two classes of variables (or tree at all, as opposed to a low-dimensional space
nodes), diseases and symptoms, and the tendency or some other form? Or, in causal learning, how
The Form of Abstract Knowledge for causal relations (or graph edges) to run from do people come to correct framework theories
Abstract knowledge provides essential con- diseases to symptoms, rather than within these such as knowledge of abstract disease and symp-
straints for learning, but in what form? This is classes or from symptoms to diseases (Fig. 3, A tom classes of variables with causal links from
just question 2. For complex cognitive tasks such to C). This abstract framework can be repre- diseases to symptoms?
as concept learning or causal reasoning, it is im- sented by using probabilistic models defined over The acquisition of abstract knowledge or new
possible to simply list every logically possible hy- relational data structures such as graph schemas inductive constraints is primarily the province
pothesis along with its prior and likelihood. Some (9, 38), templates for graphs based on types of of cognitive development (5, 7). For instance,
more sophisticated forms of knowledge repre- nodes, or probabilistic graph grammars (39), similar children learning words initially assume a flat,
sentation must underlie the probabilistic gener- in spirit to the probabilistic grammars for strings mutually exclusive division of objects into name-
ative models needed for Bayesian cognition. that have become standard for representing lin- able clusters; only later do they discover that cat-
In traditional associative or connectionist ap- guistic knowledge (28). At the most abstract lev- egories should be organized into tree-structured
proaches, statistical models of learning were de- el, the very concept of causality itself, in the sense hierarchies (Fig. 1B) (41). Such discoveries are also
fined over large numerical vectors. Learning was of a directed relationship that supports interven- pivotal in scientific progress: Mendeleev launched
seen as estimating strengths in an associative mem- tion or manipulation by an external agent (40), modern chemistry with his proposal of a periodic
ory, weights in a neural network, or parameters of can be formulated as a set of logical laws express- structure for the elements. Linnaeus famously
a high-dimensional nonlinear function (12, 14). ing constraints on the structure of directed graphs proposed that relationships between biological
Bayesian cognitive models, in contrast, have had relating actions and observable events (Fig. 3D). species are best explained by a tree structure, rather
most success defining probabilities over more Each of these forms of knowledge makes than a simpler linear order (premodern Europe’s
structured symbolic forms of knowledge repre- different kinds of prior distributions natural to “great chain of being”) or some other form.
sentations used in computer science and artificial define and therefore imposes different constraints Such structural insights have long been
intelligence, such as graphs, grammars, predicate on induction. Successful generalization depends viewed by psychologists and philosophers of
logic, relational schemas, and functional programs. on getting these constraints right. Although in- science as deeply mysterious in their mecha-
Different forms of representation are used to cap- ductive constraints are often graded, it is easiest nisms, more magical than computational. Con-
ture people’s knowledge in different domains and to appreciate the effects of qualitative constraints ventional algorithms for unsupervised structure
tasks and at different levels of abstraction. that simply restrict the hypotheses learners can discovery in statistics and machine learning—
In learning words and concepts from exam- consider (i.e., setting priors for many logical hierarchical clustering, principal components anal-
ples, the knowledge that guides both children’s possible hypotheses to zero). For instance, in ysis, multidimensional scaling, clique detection—
and adults’ generalizations has been well de- learning concepts over a domain of n objects, assume a single fixed form of structure (42). Un-
scribed using probabilistic models defined over there are 2n subsets and hence 2n logically pos- like human children or scientists, they cannot
tree-structured representations (Fig. 1B) (2, 35). sible hypotheses for the extension of a novel learn multiple forms of structure or discover
Reasoning about other biological concepts for concept. Assuming concepts correspond to the new forms in novel data. Neither traditional ap-
natural kinds (e.g., given that cows and rhinos branches of a specific binary tree over the ob- proach to cognitive development has a fully
have protein X in their muscles, how likely is it jects, as in Fig. 1B, restricts this space to only satisfying response: Nativists have assumed that,

www.sciencemag.org SCIENCE VOL 331 11 MARCH 2011 1281


REVIEW
Fig. 2. Kemp and Tenenbaum (47)
A B showed how the form of structure in
a domain can be discovered by using
a HBM defined over graph gram-
Abstract tree: chain: mars. At the bottom level of the
principles
model is a data matrix D of objects
and their properties, or similarities
Blackmun O'Connor between pairs of objects. Each square
Stevens Souter
Salmon Trout Alligator Marshal Breyer White
Rehnquist of the matrix represents whether a
Scalia
Eagle given feature (column) is observed
Penguin Brennan
Robin Ginsburg Thomas for a given object (row). One level
Finch
Iguana Kennedy
up is the structure S, a graph of rela-
Whale
Ant tions between objects that describes
Chicken
Dolphin how the features in D are distributed.
Ostrich Cockroach C
Seal Intuitively, objects nearby in the graph
are expected to share similar feature
Structure Butterfly
ring: values; technically, the graph Laplacian
Rhino Wolf Bee parameterizes the inverse covariance
Elephant Horse
Cow Dog of a gaussian distribution with one

Downloaded from www.sciencemag.org on October 22, 2012


Deer
Giraffe
Cat dimension per object, and each feature
Camel Lion is drawn independently from that dis-
Gorilla Tiger
Chimp Squirrel tribution. The highest level of abstract
Mouse principles specifies the form F of
structure in the domain, in terms of
grammatical rules for growing a graph
S of a constrained form out of an
initial seed node. Red arrows repre-
Animals

Data
sent P(S|F) and P(D|S), the condi-
tional probabilities that each level
specifies for the level below. A search
D
algorithm attempts to find both the
Features form F and the structure S of that form
ring x chain that jointly maximize the posterior
E probability P(S,F|D), a function of the
product of P(D|S) and P(S|F). (A) Given
chain x chain as data the features of animals, the
Mexico City Bogota Lima
algorithm finds a tree structure with
Los Angeles Santiago
intuitively sensible categories at mul-
Honolulu
Buenos tiple scales. (B) The same algorithm
Aires discovers that the voting patterns of
Chicago U.S. Supreme Court judges are best
Wellington Vancouver Toronto
Sao explained by a linear “left-right” spec-
Anchorage New
York Paulo trum. (C) Subjective similarities among
colors are best explained by a circu-
Sydney Dakar lar ring. (D) Given proximities between
Tokyo Madrid Kinshasa
cities on the globe, the algorithm dis-
Vladivostok London covers a cylindrical representation
Berlin Cape
Perth
Town analogous to latitude and longitude:
Irkutsk Moscow the cross product of a ring and a
Jakarta Nairobi
ring. (E) Given images of realistically
Manila
synthesized faces varying in two di-
Cairo
mensions, race and masculinity, the
Shanghai
Budapest algorithm successfully recovers the un-
Bangkok Teheran
Bombay derlying two-dimensional grid struc-
ture: a cross product of two chains.

if different domains of cognition are represented tics. Hierarchical Bayesian models (HBMs) (45) ing. In machine learning and artificial intelligence
in qualitatively different ways, those forms must address the origins of hypothesis spaces and priors (AI), HBMs have primarily been used for transfer
be innate (43, 44); connectionists have suggested by positing not just a single level of hypotheses learning: the acquisition of inductive constraints
these representations may be learned but in a generic to explain the data but multiple levels: hypoth- from experience in previous related tasks (46).
system of associative weights that at best only esis spaces of hypothesis spaces, with priors on Transfer learning is critical for humans as well
approximates trees, causal networks, and other forms priors. Each level of a HBM generates a proba- (SOM text and figs. S1 and S2), but here we
of structure people appear to know explicitly (14). bility distribution on variables at the level below. focus on the role of HBMs in explaining how people
Recently cognitive modelers have begun to Bayesian inference across all levels allows hypothe- acquire the right forms of abstract knowledge.
answer these challenges by combining the struc- ses and priors needed for a specific learning task to Kemp and Tenenbaum (36, 47) showed how
tured knowledge representations described above themselves be learned at larger or longer time scales, HBMs defined over graph- and grammar-based
with state-of-the-art tools from Bayesian statis- at the same time as they constrain lower-level learn- representations can discover the form of structure

1282 11 MARCH 2011 VOL 331 SCIENCE www.sciencemag.org


REVIEW

s' s'
se tom
A sea mp C D
'di 'sy
1 67 16
True structure 1
6
7
Abstract 1 2 3 7 8 9 10
0.4
1 2 3 4 5 6 16 principles 4 5 6 11 12 13
14 15 16
... C
1 ... C
2
7 8 9 10 11 12 13 14 15 16

B n = 20 n = 80 n = 20 n = 80

Structure Structure

Downloaded from www.sciencemag.org on October 22, 2012


Patients

Data Data

Variables

Events
Variables

Fig. 3. HBMs defined over graph schemas can explain how intuitive theories schema discovers the disease-symptom framework theory by assigning var-
are acquired and used to learn about specific causal relations from limited iables 1 to 6 to class C1, variables 7 to 16 to class C2, and a prior favoring
data (38). (A) A simple medical reasoning domain might be described by only C1 → C2 links. These assignments, along with the effective number of
relations among 16 variables: The first six encode presence or absence of classes (here, two), are inferred automatically via the Bayesian Occam's razor.
“diseases” (top row), with causal links to the next 10 “symptoms” (bottom Although this three-level model has many more degrees of freedom than the
row). This network can also be visualized as a matrix (top right, links shown model in (B), learning is faster and more accurate. With n = 80 patients, the
in black). The causal learning task is to reconstruct this network based on causal network is identified near perfectly. Even n = 20 patients are sufficient
observing data D on the states of these 16 variables in a set of patients. (B) to learn the high-level C1 → C2 schema and thereby to limit uncertainty at the
A two-level HBM formalizes bottom-up causal learning or learning with an network level to just the question of which diseases cause which symptoms.
uninformative prior on networks. The bottom level is the data matrix D. The (D) A HBM for learning an abstract theory of causality (62). At the highest
second level (structure) encodes hypothesized causal networks: a grayscale level are laws expressed in first-order logic representing the abstract
matrix visualizes the posterior probability that each pairwise causal link properties of causal relationships, the role of exogenous interventions in
exists, conditioned on observing n patients; compare this matrix with the defining the direction of causality, and features that may mark an event as an
black-and-white ground truth matrix shown in (A). The true causal network exogenous intervention. These laws place constraints on possible directed
can be recovered perfectly only from observing very many patients (n = graphical models at the level below, which in turn are used to explain patterns
1000; not shown). With n = 80, spurious links (gray squares) are inferred, of observed events over variables. Given observed events from several different
and with n = 20 almost none of the true structure is detected. (C) A three- causal systems, each encoded in a distinct data matrix, and a hypothesis space
level nonparametric HBM (48) adds a level of abstract principles, represented by of possible laws at the highest level, the model converges quickly on a correct
a graph schema. The schema encodes a prior on the level below (causal network theory of intervention-based causality and uses that theory to constrain
structure) that constrains and thereby accelerates causal learning. Both schema inferences about the specific causal networks underlying the different systems at
and network structure are learned from the same data observed in (B). The the level below.

governing similarity in a domain. Structures of ture (the graph) of the appropriate form (Fig. tive theories (38). Mansinghka et al. (48) showed
different forms—trees, clusters, spaces, rings, 2). In particular, it can infer that a hierarchical how a graph schema representing two classes
orders, and so on—can all be represented as organization for the novel objects in Fig. 1A of variables, diseases and symptoms, and a pref-
graphs, whereas the abstract principles under- (such as Fig. 1B) better fits the similarities peo- erence for causal links running from disease to
lying each form are expressed as simple gram- ple see in these objects, compared to alternative symptom variables can be learned from the
matical rules for growing graphs of that form. representations such as a two-dimensional space. same data that support learning causal links be-
Embedded in a hierarchical Bayesian frame- Hierarchical Bayesian models can also be tween specific diseases and symptoms and be
work, this approach can discover the correct used to learn abstract causal knowledge, such learned just as fast or faster (Fig. 3, B and C).
forms of structure (the grammars) for many as the framework theory of diseases and symp- The learned schema in turn dramatically accel-
real-world domains, along with the best struc- toms (Fig. 3), and other simple forms of intui- erates learning of specific causal relations (the

www.sciencemag.org SCIENCE VOL 331 11 MARCH 2011 1283


REVIEW
directed graph structure) at the level below. levels, whereas the Bayesian Occam’s razor en- dividual differences in preferences (64), as well
Getting the big picture first—discovering that sures the proper balance of constraint and flex- as the origins of essentialist theories in intui-
diseases cause symptoms before pinning down ibility as knowledge grows. tive biology and early beliefs about magnetism
any specific disease-symptom links—and then us- Across several case studies of learning abstract in intuitive physics (39, 38). The most daunting
ing that framework to fill in the gaps of specific knowledge—discovering structural forms, caus- challenge is that formalizing the full content
knowledge is a distinctively human mode of learn- al framework theories, and other inductive con- of intuitive theories appears to require Turing-
ing. It figures prominently in children’s develop- straints acquired through transfer learning—it complete compositional representations, such as
ment and scientific progress but has not previously has been found that abstractions in HBMs can probabilistic first-order logic (65, 66) and prob-
fit into the landscape of rational or statistical lear- be learned remarkably fast from relatively little abilistic programming languages (67). How to
ning models. data compared with what is needed for learning effectively constrain learning with such flexible
Although this HBM imposes strong and at lower levels. This is because each degree of representations is not at all clear.
valuable constraints on the hypothesis space of freedom at a higher level of the HBM influences Lastly, the project of reverse-engineering the
causal networks, it is also extremely flexible: and pools evidence from many variables at lev- mind must unfold over multiple levels of anal-
It can discover framework theories defined by els below. We call this property of HBMs “the ysis, only one of which has been our focus here.
any number of variable classes and any pattern blessing of abstraction.” It offers a top-down Marr (68) famously argued for analyses that in-
of pairwise regularities on how variables in route to the origins of knowledge that contrasts tegrate across three levels: The computational
these classes tend to be connected. Not even the sharply with the two classic approaches: nativ- level characterizes the problem that a cognitive
number of variable classes (two for the disease- ism (59, 60), in which abstract concepts are as- system solves and the principles by which its so-

Downloaded from www.sciencemag.org on October 22, 2012


symptom theory) need be known in advance. This sumed to be present from birth, and empiricism lution can be computed from the available inputs
is enabled by another state-of-the-art Bayesian or associationism (14), in which abstractions are in natural environments; the algorithmic level de-
tool, known as “infinite” or nonparametric hier- constructed but only approximately, and only scribes the procedures executed to produce this
archical modeling. These models posit an un- slowly in a bottom-up fashion, by layering many solution and the representations or data structures
bounded amount of structure, but only finitely experiences on top of each other and filtering over which the algorithms operate; and the im-
many degrees of freedom are actively engaged out their common elements. Only HBMs thus plementation level specifies how these algorithms
for a given data set (49). An automatic Occam’s seem suited to explaining the two most striking and data structures are instantiated in the circuits
razor embodied in Bayesian inference trades off features of abstract knowledge in humans: that it of a brain or machine. Many early Bayesian mod-
model complexity and fit to ensure that new can be learned from experience, and that it can els addressed only the computational level, char-
structure (in this case, a new class of variables) is be engaged remarkably early in life, serving to acterizing cognition in purely functional terms as
introduced only when the data truly require it. constrain more specific learning tasks. approximately optimal statistical inference in a
The specific nonparametric distribution on given environment, without reference to how the
node classes in Fig. 3C is a Chinese restaurant Open Questions computations are carried out (25, 39, 69). The
process (CRP), which has been particularly in- HBMs may answer some questions about the HBMs of learning and development discussed
fluential in recent machine learning and cogni- origins of knowledge, but they still leave us here target a view between the computational and
tive modeling. CRP models have given the first wondering: How does it all start? Developmen- algorithmic levels: cognition as approximately op-
principled account of how people form new talists have argued that not everything can be timal inference in probabilistic models defined
categories without direct supervision (50, 51): As learned, that learning can only get off the ground over a learner’s subjective and dynamically growing
each stimulus is observed, CRP models (guided with some innate stock of abstract concepts such mental representations of the world’s structure, ra-
by the Bayesian Occam’s razor) infer whether as “agent,” “object,” and “cause” to provide the ther than some objective and fixed world statistics.
that object is best explained by assimilation to basic ontology for carving up experience (7, 61). Much ongoing work is devoted to pushing
an existing category or by positing a previously Surely some aspects of mental representation Bayesian models down through the algorithmic
unseen category (fig. S3). The CrossCat mod- are innate, but without disputing this Bayesian and implementation levels. The complexity of
el extends CRPs to carve domains of objects modelers have recently argued that even the most exact inference in large-scale models implies that
and their properties into different subdomains or abstract concepts may in principle be learned. For these levels can at best approximate Bayesian
“views,” subsets of properties that can all be instance, an abstract concept of causality expressed computations, just as in any working Bayesian
explained by a distinct way of organizing the as logical constraints on the structure of directed AI system (9). The key research questions are as
objects (52) (fig. S4). CRPs can be embedded in graphs can be learned from experience in a HBM follows: What approximate algorithms does the
probabilistic models for language to explain that generalizes across the network structures of mind use, how do they relate to engineering ap-
how children discover words in unsegmented many specific causal systems (Fig. 3D). Following proximations in probabilistic AI, and how are
speech (53), learn morphological rules (54), the “blessing of abstraction,” these constraints they implemented in neural circuits? Much recent
and organize word meanings into hierarchical can be induced from only small samples of each work points to Monte Carlo or stochastic sampling–
semantic networks (55, 56) (fig. S5). A related network’s behavior and in turn enable more ef- based approximations as a unifying framework
but novel nonparametric construction, the Indian ficient causal learning for new systems (62). How for understanding how Bayesian inference may
buffet process (IBP), explains how new percep- this analysis extends to other abstract concepts work practically across all these levels, in minds,
tual features can be constructed during object such as agent or object and whether children ac- brains, and machines (70–74). Monte Carlo in-
categorization (57, 58). tually acquire these concepts in such a manner re- ference in richly structured models is possible
More generally, nonparametric hierarchical main open questions. (9, 67) but very slow; constructing more efficient
models address the principal challenge human Although HBMs have addressed the acqui- samplers is a major focus of current work. The
learners face as knowledge grows over a life- sition of simple forms of abstract knowledge, biggest remaining obstacle is to understand how
time: balancing constraint and flexibility, or the they have only touched on the hardest subjects structured symbolic knowledge can be represented
need to restrict hypotheses available for gener- of cognitive development: framework theories in neural circuits. Connectionist models sidestep
alization at any moment with the capacity to for core common-sense domains such as intui- these challenges by denying that brains actually
expand one’s hypothesis spaces, to learn new tive physics, psychology, and biology (5, 6, 7). encode such rich knowledge, but this runs counter
ways that the world could work. Placing non- First steps have come in explaining develop- to the strong consensus in cognitive science and
parametric distributions at higher levels of the ing theories of mind, how children come to artificial intelligence that symbols and structures
HBM yields flexible inductive biases for lower understand explicit false beliefs (63) and in- are essential for thought. Uncovering their neural

1284 11 MARCH 2011 VOL 331 SCIENCE www.sciencemag.org


REVIEW
basis is arguably the greatest computational 12. J. McClelland, D. Rumelhart, Eds., Parallel Distributed 53. S. Goldwater, T. L. Griffiths, M. Johnson, Cognition 112,
challenge in cognitive neuroscience more Processing: Explorations in the Microstructure of 21 (2009).
Cognition (MIT Press, Cambridge, MA, 1986). 54. M. Johnson, T. L. Griffiths, S. Goldwater, in Advances in
generally—our modern mind-body problem. 13. S. Pinker, How the Mind Works (Norton, New York, 1997). Neural Information Processing Systems (MIT Press,
14. T. Rogers, J. McClelland, Semantic Cognition: A Parallel Cambridge, MA, 2007), vol. 19, pp. 641–648.
Conclusions Distributed Processing Approach (MIT Press, Cambridge, 55. T. L. Griffiths, M. Steyvers, J. B. Tenenbaum, Psychol. Rev.
We have outlined an approach to understanding MA, 2004). 114, 211 (2007).
15. P. Niyogi, The Computational Nature of Language Learning 56. D. Blei, T. Griffiths, M. Jordan, J. Assoc. Comput. Mach.
cognition and its origins in terms of Bayesian and Evolution (MIT Press, Cambridge, MA, 2006). 57, 1 (2010).
inference over richly structured, hierarchical gen- 16. T. L. Griffiths, N. Chater, C. Kemp, A. Perfors, J. B. Tenenbaum, 57. T. L. Griffiths, Z. Ghahramani, in Advances in Neural
erative models. Although we are far from a com- Trends Cogn. Sci. 14, 357 (2010). Information Processing Systems (MIT Press, Cambridge,
plete understanding of how human minds work 17. J. L. McClelland et al., Trends Cogn. Sci. 14, 348 (2010). MA, 2006), vol. 18, pp. 475–482.
18. J. B. Tenenbaum, T. L. Griffiths, Behav. Brain Sci. 58. J. Austerweil, T. L. Griffiths, in Advances in Neural
and develop, the Bayesian approach brings us closer Information Processing Systems (MIT Press, Cambridge,
24, 629 (2001).
in several ways. First is the promise of a unifying 19. J. Tenenbaum, T. Griffiths, in Proceedings of the 23rd MA, 2009), vol. 21, pp. 97–104.
mathematical language for framing cognition as Annual Conference of the Cognitive Science Society, 59. N. Chomsky, Language and Problems of Knowledge: The
the solution to inductive problems and building J. D. Moore, K. Stenning, Eds. (Erlbaum, Mahwah, NJ, Managua Lectures (MIT Press, Cambridge, MA, 1986).
2001), pp. 1036–1041. 60. E. S. Spelke, K. Breinlinger, J. Macomber, K. Jacobson,
principled quantitative models of thought with a
20. T. Griffiths, J. Tenenbaum, in Proceedings of the 23rd Psychol. Rev. 99, 605 (1992).
minimum of free parameters and ad hoc assump- Annual Conference of the Cognitive Science Society, 61. S. Pinker, The Stuff of Thought: Language as a Window
tions. Deeper is a framework for understanding J. D. Moore, K. Stenning, Eds. (Erlbaum, Mahwah, NJ, into Human Nature (Viking, New York, 2007).
why the mind works the way it does, in terms of 2001), pp. 370–375. 62. N. D. Goodman, T. D. Ullman, J. B. Tenenbaum, Psychol.

Downloaded from www.sciencemag.org on October 22, 2012


rational inference adapted to the structure of real- 21. T. L. Griffiths, J. B. Tenenbaum, Cognition 103, 180 (2007). Rev. 118, 110 (2011).
22. H. Lu, A. L. Yuille, M. Liljeholm, P. W. Cheng, K. J. Holyoak, 63. N. Goodman et al., in Proceedings of the 28th Annual
world environments, and what the mind knows Conference of the Cognitive Science Society (Erlbaum,
Psychol. Rev. 115, 955 (2008).
about the world, in terms of abstract schemas 23. T. L. Griffiths, J. B. Tenenbaum, Cognit. Psychol. 51, Mahwah, NJ, 2006), pp. 1382–1387.
and intuitive theories revealed only indirectly 334 (2005). 64. C. Lucas, T. Griffiths, F. Xu, C. Fawcett, in Advances in
through how they constrain generalizations. 24. T. R. Krynski, J. B. Tenenbaum, J. Exp. Psychol. Gen. 136, Neural Information Processing Systems (MIT Press,
430 (2007). Cambridge, MA, 2009), vol. 21, pp. 985–992.
Most importantly, the Bayesian approach lets 65. B. Milch, B. Marthi, S. Russell, in ICML 2004 Workshop
25. M. Oaksford, N. Chater, Trends Cogn. Sci. 5, 349 (2001).
us move beyond classic either-or dichotomies 26. T. L. Griffiths, J. B. Tenenbaum, Psychol. Sci. 17, 767 (2006). on Statistical Relational Learning and Its Connections
that have long shaped and limited debates in 27. A. Yuille, D. Kersten, Trends Cogn. Sci. 10, 301 (2006). to Other Fields, T. Dietterich, L. Getoor, K. Murphy, Eds.
cognitive science: “empiricism versus nativism,” 28. N. Chater, C. D. Manning, Trends Cogn. Sci. 10, 335 (2006). (Omnipress, Banff, Canada, 2004), pp. 67–73.
66. C. Kemp, N. Goodman, J. Tenenbaum, in Proceedings
“domain-general versus domain-specific,” “logic 29. R. M. Shiffrin, M. Steyvers, Psychon. Bull. Rev. 4, 145
of the 30th Annual Meeting of the Cognitive
(1997).
versus probability,” “symbols versus statistics.” 30. M. Steyvers, T. L. Griffiths, S. Dennis, Trends Cogn. Sci. Science Society (Publisher, City, Country, 2008),
Instead we can ask harder questions of reverse- 10, 327 (2006). pp. 1606–1611.
engineering, with answers potentially rich enough 31. K. P. Körding, D. M. Wolpert, Nature 427, 244 (2004). 67. N. Goodman, V. Mansinghka, D. Roy, K. Bonawitz,
32. A. Tversky, D. Kahneman, Science 185, 1124 (1974). J. Tenenbaum, in Proceedings of the 24th Conference on
to help us build more humanlike AI systems. How
33. E. T. Jaynes, Probability Theory: The Logic of Science Uncertainty in Artificial Intelligence (AUAI Press,
can domain-general mechanisms of learning and Corvallis, OR, 2008), vol. 22, p. 23.
(Cambridge Univ. Press, Cambridge, 2003).
representation build domain-specific systems 34. D. J. C. Mackay, Information Theory, Inference, and Learning 68. D. Marr, Vision (W. H. Freeman, San Francisco, CA, 1982).
of knowledge? How can structured symbolic Algorithms (Cambridge Univ. Press, Cambridge, 2003). 69. J. B. Tenenbaum, T. L. Griffiths, in Advances in Neural
knowledge be acquired through statistical learn- 35. F. Xu, J. B. Tenenbaum, Dev. Sci. 10, 288 (2007). Information Processing Systems, T. Leen, T. Dietterich,
36. C. Kemp, J. B. Tenenbaum, Psychol. Rev. 116, 20 (2009). V. Tresp, Eds. (MIT Press, Cambridge, MA, 2001),
ing? The answers emerging suggest new ways vol. 13, pp. 59–65.
37. H. M. Wellman, S. A. Gelman, Annu. Rev. Psychol. 43,
to think about the development of a cognitive 337 (1992). 70. A. N. Sanborn, T. L. Griffiths, D. J. Navarro, in
system. Powerful abstractions can be learned sur- 38. C. Kemp, J. B. Tenenbaum, S. Niyogi, T. L. Griffiths, Proceedings of the 28th Annual Conference of the
prisingly quickly, together with or prior to learn- Cognition 114, 165 (2010). Cognitive Science Society (Erlbaum, Mahwah, NJ, 2006),
39. T. L. Griffiths, J. B. Tenenbaum, in Causal Learning: Psychology, pp. 726–731.
ing the more concrete knowledge they constrain. 71. S. D. Brown, M. Steyvers, Cognit. Psychol. 58, 49 (2009).
Philosophy, and Computation, A. Gopnik, L. Schulz, Eds.
Structured symbolic representations need not be (Oxford University Press, Oxford, 2007), pp. 323–345. 72. R. Levy, F. Reali, T. L. Griffiths, in Advances in Neural
rigid, static, hard-wired, or brittle. Embedded in a 40. J. Woodward, Making Things Happen: A Theory of Causal Information Processing Systems, D. Koller, D. Schuurmans,
probabilistic framework, they can grow dynam- Explanation (Oxford Univ. Press, Oxford, 2003). Y. Bengio, L. Bottou, Eds. (MIT Press, Cambridge, MA,
41. E. S. Markman, Categorization and Naming in Children 2009), vol. 21, pp. 937–944.
ically and robustly in response to the sparse, 73. J. Fiser, P. Berkes, G. Orbán, M. Lengyel, Trends Cogn.
(MIT Press, Cambridge, MA, 1989).
noisy data of experience. 42. R. N. Shepard, Science 210, 390 (1980). Sci. 14, 119 (2010).
43. N. Chomsky, Rules and Representations (Basil Blackwell, 74. E. Vul, N. D. Goodman, T. L. Griffiths, J. B. Tenenbaum, in
References and Notes Oxford, 1980). Proceedings of the 31st Annual Conference of the Cognitive
1. P. Bloom, How Children Learn the Meanings of Words 44. S. Atran, Behav. Brain Sci. 21, 547, (1998). Science Society (Erlbaum, Mahwah, NJ, 2009), pp. 148–153.
(MIT Press, Cambridge, MA, 2000). 45. A. Gelman, J. B. Carlin, H. S. Stern, D. B. Rubin, Bayesian 75. L. Schmidt, thesis, Massachusetts Institute of Technology,
2. F. Xu, J. B. Tenenbaum, Psychol. Rev. 114, 245 (2007). Data Analysis (Chapman and Hall, New York, 1995). Cambridge, MA (2009).
3. S. Pinker, Words and Rules: The Ingredients of Language 46. C. Kemp, A. Perfors, J. B. Tenenbaum, Dev. Sci. 10, 307 (2007). 76. We gratefully acknowledge the suggestions of R. R. Saxe,
(Basic, New York, 1999). 47. C. Kemp, J. B. Tenenbaum, Proc. Natl. Acad. Sci. U.S.A. M. Bernstein, and J. M. Tenenbaum on this manuscript
4. A. Gopnik et al., Psychol. Rev. 111, 3 (2004). 105, 10687 (2008). and the collaboration of N. Chater and A. Yuille on a
5. A. Gopnik, A. N. Meltzoff, Words, Thoughts, and Theories 48. V. K. Mansinghka, C. Kemp, J. B. Tenenbaum, T. L. Griffiths, in forthcoming joint book expanding on the methods
(MIT Press, Cambridge, MA, 1997). Proceedings of the 22nd Conference on Uncertainty in and perspectives reviewed here. Grant support was
6. S. Carey, Conceptual Change in Childhood (MIT Press, Artificial Intelligence, R. Dechter, T. Richardson, Eds. (AUAI provided by Air Force Office of Scientific Research,
Cambridge, MA, 1985). Press, Arlington, VA, 2006), pp. 324–331. Office of Naval Research, Army Research Office,
7. S. Carey, The Origin of Concepts (Oxford Univ. Press, 49. C. Rasmussen, in Advances in Neural Information NSF, Defense Advanced Research Projects Agency,
New York, 2009). Processing Systems (MIT Press, Cambridge, MA, 2000), Nippon Telephone and Telegraph Communication
8. P. Godfrey-Smith, Theory and Reality (Univ. of Chicago vol. 12, pp. 554–560. Sciences Laboratories, Qualcomm, Google, Schlumberger,
Press, Chicago, 2003). 50. J. R. Anderson, Psychol. Rev. 98, 409 (1991). and the James S. McDonnell Foundation.
9. S. Russell, P. Norvig, Artificial Intelligence: A Modern 51. T. L. Griffiths, A. N. Sanborn, K. R. Canini, D. J. Navarro,
Approach (Prentice Hall, Upper Saddle River, NJ, 2009). in The Probabilistic Mind, N. Chater, M. Oaksford, Eds.
Supporting Online Material
www.sciencemag.org/cgi/content/full/331/6022/1279/DC1
10. D. McAllester, in Proceedings of the Eleventh Annual (Oxford Univ. Press, Oxford, 2008).
SOM Text
Conference on Computational Learning Theory [Association 52. P. Shafto, C. Kemp, V. Mansinghka, M. Gordon,
Figs. S1 to S5
for Computing Machinery (ACM), New York, 1998], p. 234. J. B. Tenenbaum, in Proceedings of the 28th Annual
References
11. J. Pearl, Probabilistic Reasoning in Intelligent Systems Conference of the Cognitive Science Society (Erlbaum,
(Morgan Kaufmann, San Francisco, CA, 1988). Mahwah, NJ, 2006), pp. 2146–2151. 10.1126/science.1192788

www.sciencemag.org SCIENCE VOL 331 11 MARCH 2011 1285

You might also like