Analogical Reasoning
Module 6, Neuroscience 299, University of California, Berkeley
Ross W Gayler ORCiD: 0000-0003-4679-585X
[email protected]
2021-10-06
Motivation & Objectives
Analogy is really cool and central to cognition
Analogy is a good use case for the unique properties of VSA/HDC
· What makes analogy hard for conventional computing?
· Which VSA/HDC features might help with analogy?
· Not a solved problem
Use a set of attempts at aspects of analogy to highlight some VSA design issues
2/49
Outline
What is analogy?
Why is analogy hard for conventional computing?
VSA design examples:
· Plate - Similarity of hand crafted vectors
· Mikolov et al - Similarity of learned word vectors
· Kanerva - Simple substitution
· Emruli et al - Substitution with lookup
· Gayler & Levy - Settling on substitution
3/49
What is ANALOGY?
ANALOGY
≜ what analogy really is
Whatever it is, ANALOGY as a cognitive phenomenon is a complex, nuanced thing
· Everybody presents a different partial view of ANALOGY
- Tendency to interpret the partial view as the whole thing
- Analogical reasoning
- Proportional analogies
- Grand analogy (analogy as a party trick)
- Please don’t do that
ANALOGY is too big to fit in this lecture, so I will resort to assertions and hand
waving to explain enough of it for current purposes
4/49
Analogy is the core of cognition
Quote from Hofstadter (2006):
≜
analogy-making
the perception of common essence1
between two things2
1 In one’s current frame of mind
2 Thing
≜ mental thing
See also
Gust et al (2008)
Chalmers et al (1992)
Blokpoel et al (2018)
I will jump off from Blokpoel:
cognition as inference to the best explanation
5/49
Inference to the Best Explanation
The cognitive loop:
Given some inputs (evidence ) and a set of potential explanations (hypotheses
) find the hypothesis ( ) that best explains the evidence
H
h
e
Evidence and hypotheses are represented relationally (trees/graphs)
· A bet that natural regularities are “best” captured as transformations
“explains” is interpreted as graph structure matching - (sub)graph isomorphism
· structural similarity = literal similarity | optimal substitution of literals
· analogical “common essence” = common relational structure
Partial structure matching enables inference by carrying structure from one
representation to another (pattern completion via autoassociative memory)
6/49
Where do the hypotheses come from?
Hypotheses are generated from all the agent’s relevant knowledge
The hypothesis space must be open-ended, to allow for explaining novelty
· Hypotheses must be compositional
- Allows infinite productivity
- Allows novel compositions of familiar components
- Like a grammar for hypotheses
· Substitution enables composition (there may be other mechanisms)
7/49
Example: Relational representation
solar system = base structure = hypothesis (on this slide)
atom = target structure = evidence (on this slide)
structural similarity = literal similarity | {sun ↦ nucleus, planet
↦
electron}
Chalmers et al (1992)
8/49
Example: Relational representation of evidence
Blokpoel et al (2018)
9/49
Example: Relational representation of knowledge
Blokpoel et al (2018)
10/49
Example: Analogical augmentation
Blokpoel et al (2018)
11/49
Example: Augmentation of evidence
Blokpoel et al (2018)
12/49
Example: Explanation
Blokpoel et al (2018)
13/49
Why is analogy hard for conventional computing?
Subgraph isomorphism, of two graphs, is NP-complete (intractable)
· Considers all possible vertex mappings
· The “obvious” approach is brute-force exhaustive enumeration
· Each vertex mapping provides very little information about the adequacy of
the other vertex mappings
Considering all the base structures in the agent’s knowledge is much larger
Considering the transitive closure of analogical augmentations is much larger
14/49
Preview: Which VSA features might help?
Hardware parallelism (elementwise operations with small fan-in)
Mathematical parallelism (avoids explicit enumeration)
· The hardware only “sees” the total vector
· Distributive parallelism
- (A + B + C) * ρ(P + Q + R) = A*ρP + A*ρQ + … B*ρP + B*ρQ + …
· Equational parallelism
- T = (A + B + C) = (P + Q + R + S) = (X * Y * Z) = …
· Enables holistic transformations
Substitution is a primitive (via binding)
· Every value is potentially a variable
15/49
Plate (1994) - Hand crafted similarity
Focus of Chapter 6 of Plate’s thesis (1994) is the use of dot-product similarity as a
measure of structural similarity of representations
Reports experiments with hand-crafted representations aimed at qualitatively
reproducing the results of psychology research into human judgement of
analogical similarity under varying contributions of component similarity to
overall similarity.
My take,
· superficial similarity very similarity of arguments of relations
≈
·
very
structural similarity ≈ similarity of pattern of relations
but researchers are free to suit the details of their definitions to their needs
16/49
Example stimuli
P (Probe) “Spot bit Jane, causing Jane to flee from Spot”
LS (Literal Similarity) “Fido bit John, causing John to flee from Fido.” (Has both structural
and superficial similarity to the probe P.)
SF (Surface features) “John fled from Fido, causing Fido to bite John.” (Has superficial but
not structural similarity.)
CM (Cross-mapped analogy) “Fred bit Rover, causing Rover to flee from Fred.” (Has both
structural and superficial similarity, but types of corresponding objects are switched.)
AN (Analogy) “Mort bit Felix, causing Felix to flee from Mort.” (Has structural but not
superficial similarity).
FOR (First-order-relations only) “Mort fled from Felix, causing Felix to bite Mort.” (Has
neither structural nor superficial similarity, other than shared predicates.)
Plate (2000)
17/49
Base and token vectors
Plate (1994)
18/49
Stimulus episode representation construction
Probe episode (P): “Spot bit Jane, causing Jane to flee from Spot”
Plate (1994)
Note the addition of “lower level” components into the representations
· These are not strictly necessary for representing the structure
Construction of all the episode representations follows the same scheme
19/49
Dot-product similarity with Probe
Plate (2000)
20/49
Reminders of dot-product similarity properties
sim(A, A ) = sim(A, (A + X)) > 0
sim(A, (A ⊗ P )) ≈ 0
sim((A ⊗ P ), (A ⊗ P )) = sim(A, A ) > 0
sim((A ⊗ P ), (A ⊗ P )) = sim(A, A ) × sim(P , P ) > 0
sim((A ⊗ B ⊗ … ⊗ X), (A ⊗ B ⊗ … ⊗ X)) = sim(A, A ) > 0
sim((A ⊗ B ⊗ … ⊗ X), (A ⊗ B ⊗ … ⊗ X ⊗ Y )) ≈ 0
′
′
′
′
′
′
′
′
′
′
If using self-inverse products:
sim((A ⊗ B ⊗ … ⊗ X), (A ⊗ B ⊗ … ⊗ X ⊗ Y ) ⊗ Y
† . ⊗Y is equivalent to applying the substitution Y ↦ 1
′
†
)=
sim(A, A )
′
21/49
My interpretation of Plate (1994) Chapter 6
Representation of structure requires product operations,
which destroys dot-product similarity of result to arguments
· structural similarity ≠ only the dot-product similarity of core structure
· Needs something extra
Plate decorates the composite core structures with components to create similarity
· Might be ad hoc (depends on whether it is natural for the construction process)
· Might be missing necessary structure
- Predicates are not represented as unique instances,
- “Spot bit Fido causing Felix to bite John” is ambiguous
- but representing them as unique instances might destroy their similarity
-
?
sim((bite1 ⊗ biteagt ⊗ Spot), (bite2 ⊗ biteagt ⊗ Spot)) ≈ 0
22/49
Dot-product similarity is local
Dot-product similarity is at the heart of VSA system dynamics
Dot-product similarity is very “local”
· Almost all vectors are quasi-orthogonal to current state vector
· Only a tiny fraction of the vector space has nonzero similarity with state
· Miraculous luck if all directions of interest are local to the state
· Similarity driven dynamics alone can’t select between non-local directions
Relational structure encoded by Multiply and Permute, which are orthogonalising
· Something needs to be done to map relational structure into the local space
so that it can engage the similarity dynamics
23/49
Mikolov et al (2013) - Learned word similarity
Proportional analogy with learned “semantic” vectors for words
·
·
·
a : a′ :: b : b′
man : woman :: king :: queen
Vwoman − Vman + Vking
?
= Vqueen
Mikolov et al (2013)
24/49
Successful analogy? Not so much
Doesn’t work as well as originally thought (Rogers et al, 2017)
· Relies on excluding b from answer set (or using multiple choice)
· Works best when a, a′ , and b are relatively similar to each other and to b′
· Poor at some classes of relations, e.g. synonymy, antonymy
ANALOGY enables proportional analogy, but proportional analogy ≠ ANALOGY
“semantic” vectors don’t capture SEMANTICS
· Semantic vectors don’t know how to change a flat tyre
· Captures a narrow subset of linguistic regularities induced by SEMANTICS
· ANALOGY engages with SEMANTICS
25/49
Vector semantics and dot-product similarity
Proportional analogy via semantic vectors implies semantics ≡ additive features
Vwoman − Vman + Vking
= (Vperson + VFvsM ) − (Vperson − VFvsM ) + (Vperson − VFvsM + Vroyal )
= Vperson + VFvsM + Vroyal
= Vqueen
Additive features can’t capture SEMANTICS
ANALOGY is about structural relational similarity
· Representing relational structure requires product operators
· Static dot-product similarity structure is driven by additive structure
· Dot-product similarity (by itself) can’t fully capture structural similarity
26/49
Systematic substitution via binding
A binding can be used as a partial function for substitution
The substitution is applied uniformly across the components of the argument
With a commutative, self-inverse product operator, e.g. BSC, MAP
(I won’t discuss non-commutative or non-self-inverse products here):
⊗
≡{ ↦ ,
↦ }
A X
A
XX
A
Apply the substitution by binding it with the argument:
(A ⊗ X) ⊗ ( + ⊗ + ⊗ + )
=
⊗
⊗
+
⊗
⊗
⊗
+
⊗
⊗
⊗
=( ⊗ )⊗
+( ⊗ )⊗
⊗
+
⊗( ⊗
=1⊗
+1⊗
⊗
+
⊗1⊗
+
⊗
⊗
=
+
⊗
+
⊗
+
⊗
⊗
A A B X C D
A X A A X A B A X X C+A⊗X⊗D
A A X A A X B A X X) ⊗ C + A ⊗ X ⊗ D
X
X B A
C A X D
X X B A C A X D
A + A ⊗ B + X ⊗ C + D)
(
A X)
( ⊗
↦
X + X ⊗ B + A ⊗ C + A ⊗ X ⊗ D)
(
27/49
Subtlety of binding
A ⊗ X) depending on how they are used, e.g.:
Multiple interpretations of bindings (
· key:value pair in a dictionary (note key and value are treated identically)
· variable:value pair (note variable and value are treated identically)
A = pattern detected, X = identity of neuron)
· Inference rule (A = antecedent, B = consequent)
· Substitution pattern (A = search pattern, B = replacement pattern)
A and X can be arbitrarily complex composites (everything is just a vector), e.g.:
· (A + B + C ) ⊗ X = A ⊗ X + B ⊗ X + C ⊗ X
· A ⊗ (X + Y + Z ) = A ⊗ X + A ⊗ Y + A ⊗ Z
· A ⊗ B ⊗ X ⊗ Y = A ⊗ (B ⊗ X ⊗ Y ) = (A ⊗ X ⊗ Y ) ⊗ B = …
· Virtual feature detector neuron (
28/49
Kanerva (2010) - What is the Dollar of Mexico?
Vustates = Vname ⊗ VUSA + Vcapital ⊗ VWDC + Vcurrency ⊗ VUSD
Vmexico = Vname ⊗ VMEX + Vcapital ⊗ VMXC + Vcurrency ⊗ VMXN
VU M = Vustates ⊗ Vmexico
= VUSA ⊗ VMEX + VWDC ⊗ VMXC + VUSD ⊗ VMXN + noise
⊗
1
(a sum of filler substitutions - fillers occupying the same role have been bound)
Warning:
noise is the sum of all terms you know will not be important
Apply the USA/Mexico mapping, e.g. “What is the Dollar of Mexico?”
U ⊗M ⊗ USD
= ( USA ⊗ MEX + WDC ⊗ MXC + USD ⊗ MXN +
1) ⊗
=
2 +
3 + USD ⊗ MXN ⊗ USD +
1 ⊗ USD
≈ MXN
V
V
V
V
V
noise noise V
V
V
V
V
V
V
noise
noise V
VUSD
29/49
Hand crafted substitution
This approach works because of identical roles, e.g. Vcurrency
The role representations are static and enumerated in advance
Great if that’s all you need
· Seriously, if that’s all you need, it’s the best way to do something analogy-like
ANALOGY needs dynamic substitutions chosen in response to the context
30/49
Emruli et al (2013) - Learned substitution
Emruli et al (2013)
“The analogical mapping unit (AMU) which learns mappings of the type xk
from examples and uses bundled mapping vectors stored in the SDM to
calculate the output vector yk ” Emruli et al (2013)
↦y
k
31/49
How it works
xk is used as the address for the Sparse Distributed Memory (SDM)
The mapping xk ⊗ yk is used as the value to store in the SDM
Mappings are noncommutative because xk and yk are used differently
Write mode: mappings are written to SDM
Read mode: retrieves average mapping corresponding to xk and applies it to xk
SDM does a sort of averaging memory over similar addresses
Interpolates over mappings
Can’t generate completely novel mappings
Note the “circuit based” approach, including non-VSA components (SDM)
There is usually an amount of plumbing and control to deal with
A purist would make the control distributed (VSA-like) but it’s usual to make the
control localist as an engineering hack
32/49
Gayler & Levy (2009) - Settling on substitution
Graph isomorphism (not analogical mapping, but a proxy for necessary process)
Find vertex mappings that make the graphs identical
A
A
·
{
↦
·
{
↦
PB
PB
,
↦
,
↦
QC
QC
,
↦
,
↦
SD R
RD S
,
↦
}
,
↦
}
Gayler & Levy (2009)
33/49
How it works: The long and winding road
The explanation is going to be long winded (sorry)
· What is a graph isomorphism (implementable definition)?
· Localist heuristic to find graph isomorphisms
· VSA distributed implementation of localist method
34/49
Adjacency matrix of a graph
How to represent a graph with a matrix
· Row and column indices correspond to vertices
· Cell entries indicate edges
Gayler & Levy (2009)
35/49
Association graph
The association graph is a graph product of the two graphs
- Vertices correspond to vertex mappings of the two graphs
- Edges correspond to edge existence agreement of the vertex mappings
- A maximal clique corresponds to a maximal isomorphism of the two graphs
Gayler (2009) Melbourne University presentation (edited). Only a subset of vertices and edges shown.
36/49
How to find a maximal clique of a graph
Replicator equations (from evolutionary game theory)
Also interpretable as Bayesian update
x(t) = prior distribution = support for each possible vertex mapping
x(t + 1) = posterior distribution = support for each possible vertex mapping
w = adjacency matrix of association graph
π(t) = likelihood = multiplicative update to vertex mapping support given w
Gayler (2009) Melbourne University presentation
37/49
Replicator equation circuit
Localist representation of mappings (potentially large)
k number of vertices in each of the two graphs
dim(x) = dim(π) = k2
dim(w) = k2 × k2
Gayler (2009) Melbourne University presentation
∧
≜ elementwise product
38/49
Settling of localist replicator equations
Gayler (2009) Melbourne University presentation
39/49
VSA representations for replicator equations
A, B, C, D, P , Q, R, S
Edges: B ⊗ C , B ⊗ D, Q ⊗ R, Q ⊗ S
Graph vertex sets: (A + B + C + D), (P + Q + R + S )
Graph edge sets: (B ⊗ C + B ⊗ D), (Q ⊗ R + Q ⊗ S )
Initial potential vertex mappings = x(t = 1) =
(A + B + C + D) ⊗ (P + Q + R + S ) =
A⊗P +A⊗Q+…+B⊗P +B⊗Q+…+D⊗S
Vertices:
w
Association graph edges (positive only) = potential edge mappings = =
( ⊗
+
⊗ )⊗( ⊗
+
⊗ )=
( ⊗
⊗
⊗ )+( ⊗
⊗
⊗ )+( ⊗
⊗
⊗ )+(
B C B D
B C Q R
Q R Q S
B C Q S
B D Q R
B ⊗ D ⊗ Q ⊗ S)
40/49
VSA replicator equation circuit
Interpret all vectors as being of the form kV ,
where k is the magnitude/support for V (the unit magnitude direction)
Analog computing: V is the labelled wire, k is the voltage on the wire
Gayler (2009) Melbourne University presentation
41/49
Multiset intersection
∧
≜ multiset intersection
A multiset is a set with a nonnegative magnitude of membership for each
element (i.e. the magnitudes of the component vectors vary across components)
arg1
arg1
=
=
aV1 + bV2 + cV3
pV1 + qV2 + rV4
∧(arg1 , arg2 ) =
apV1 + bqV2
The elementwise multiplication of the magnitudes of the component vectors
This corresponds to the elementwise multiplication of support for the vertex
mappings in the localist version
Won’t go into the implementation here
42/49
Evidence propagation
B C Q R
Association graph edges have the form: ⊗
⊗
⊗
and can be interpreted as mappings:
( ⊗ ) ⊗ ( ⊗ ) # interpret as mapping between edges in the graphs
( ⊗ ) ⊗ ( ⊗ ) # interpret as mapping between vertex mappings
B C
B Q
Q R
C R
Association graph edges applied as inference rules:
= ⊗
+ …) ⊗ ( ⊗
⊗
⊗
+ …)
=( ⊗
= (vertex mappings) ⊗ (mappings between vertex mappings)
= ( ( ⊗ ) + …) ⊗ (( ⊗ ) ⊗ ( ⊗ ) + …)
= ( ⊗ )+…
π x w
B Q
B C Q R
kB Q
B Q C R
kC R
Interpret (B ⊗ Q) ⊗ (C ⊗ R) as the rule:
“To the extent k that B ⊗ Q is supported as part of the solution
Increase the support for C ⊗ R as part of the solution by k”
43/49
Settling of VSA replicator equation circuit
Gayler (2009) Melbourne University presentation
44/49
References / Reading
M. Blokpoel, T. Wareham, P. Haselager, and I. van Rooij (2018) Deep Analogical
Inference as the Origin of Hypotheses. The Journal of Problem Solving
D. J. Chalmers, R. M. French, and D. R. Hofstadter (1992) High-level perception,
representation, and analogy: A critique of artificial intelligence methodology. Journal
of Experimental & Theoretical Artificial Intelligence
B. Emruli, R. W. Gayler, and F. Sandin (2013) Analogical mapping and inference with
binary spatter codes and sparse distributed memory. The 2013 International Joint
Conference on Neural Networks (IJCNN)
B. Emruli and F. Sandin (2014) Analogical Mapping with Sparse Distributed Memory:
A Simple Model that Learns to Generalize from Examples. Cognitive Computation
45/49
R. W. Gayler and S. D. Levy (2009) A Distributed Basis for Analogical Mapping. New
Frontiers in Analogy Research, Proceedings of the Second International
Conference on Analogy, ANALOGY-2009
R. W. Gayler and R. Wales (1998) Connections, Binding, Unification and Analogical
Promiscuity. Advances In Analogy Research: Integration Of Theory And Data From
The Cognitive, Computational, And Neural Sciences
H. Gust, U. Krumnack, K.-U. Kühnberger, and A. Schwering (2008) Analogical
Reasoning: A Core of Cognition. KI - Künstliche Intelligenz
D. R. Hofstadter (2006) Analogy as the core of cognition. Stanford Presidential
Lecture
P. Kanerva (2000) Large Patterns Make Great Symbols: An Example of Learning from
Example. Hybrid Neural Systems
46/49
P. Kanerva (2010) What We Mean when We Say “What’s the Dollar of Mexico?”:
Prototypes and Mapping in Concept Space. Quantum Informatics 2010: AAAI-Fall
2010 Symposium on Quantum Informatics for Cognitive, Social, and Semantic
Processes
S. D. Levy and R. W. Gayler (2009) “Lateral inhibition” in a fully distributed
connectionist architecture. In Proceedings of the Ninth International Conference
on Cognitive Modeling (ICCM 2009)
T. Mikolov, W. Yih, and G. Zweig (2013) Linguistic Regularities in Continuous Space
Word Representations. Proceedings of NAACL-HLT 2013
T. A. Plate (1994) Distributed Representations and Nested Compositional Structure.
PhD Thesis. University of Toronto
T. A. Plate (2000) Analogy retrieval and processing with distributed vector
representations. Expert Systems
47/49
A. Rogers, A. Drozd, and L. Bofang (2017) The (too Many) Problems of Analogical
Reasoning with Word Vectors. Proceedings of the 6th Joint Conference on Lexical
and Computational Semantics (*SEM 2017)
48/49
DOI: 10.5281/zenodo.5552219
This presentation is licensed under a Creative Commons Attribution 4.0 International License
https://creativecommons.org/licenses/by/4.0/
49/49