Analogical Reasoning (video recording)


Video recording of the lecture "Analogical Reasoning" given on 2021⁠-⁠10⁠-⁠06 as Module 6 of Neuroscience 299: Computing with High-Dimensional Vectors at the Redwood Center for Theoretical Neuroscience, University of California, Berkeley. Video DOI:10.5281/zenodo.5560797 Slides DOI:10.5281/zenodo.5552219 Source DOI:10.5281/zenodo.5561063

Analogical Reasoning Module 6, Neuroscience 299, University of California, Berkeley Ross W Gayler ORCiD: 0000-0003-4679-585X [email protected] 2021-10-06 Motivation & Objectives Analogy is really cool and central to cognition Analogy is a good use case for the unique properties of VSA/HDC · What makes analogy hard for conventional computing? · Which VSA/HDC features might help with analogy? · Not a solved problem Use a set of attempts at aspects of analogy to highlight some VSA design issues 2/49 Outline What is analogy? Why is analogy hard for conventional computing? VSA design examples: · Plate - Similarity of hand crafted vectors · Mikolov et al - Similarity of learned word vectors · Kanerva - Simple substitution · Emruli et al - Substitution with lookup · Gayler & Levy - Settling on substitution 3/49 What is ANALOGY? ANALOGY ≜ what analogy really is Whatever it is, ANALOGY as a cognitive phenomenon is a complex, nuanced thing · Everybody presents a different partial view of ANALOGY - Tendency to interpret the partial view as the whole thing - Analogical reasoning - Proportional analogies - Grand analogy (analogy as a party trick) - Please don’t do that ANALOGY is too big to fit in this lecture, so I will resort to assertions and hand waving to explain enough of it for current purposes 4/49 Analogy is the core of cognition Quote from Hofstadter (2006): ≜ analogy-making the perception of common essence1 between two things2 1 In one’s current frame of mind 2 Thing ≜ mental thing See also Gust et al (2008) Chalmers et al (1992) Blokpoel et al (2018) I will jump off from Blokpoel: cognition as inference to the best explanation 5/49 Inference to the Best Explanation The cognitive loop: Given some inputs (evidence ) and a set of potential explanations (hypotheses ) find the hypothesis ( ) that best explains the evidence H h e Evidence and hypotheses are represented relationally (trees/graphs) · A bet that natural regularities are “best” captured as transformations “explains” is interpreted as graph structure matching - (sub)graph isomorphism · structural similarity = literal similarity | optimal substitution of literals · analogical “common essence” = common relational structure Partial structure matching enables inference by carrying structure from one representation to another (pattern completion via autoassociative memory) 6/49 Where do the hypotheses come from? Hypotheses are generated from all the agent’s relevant knowledge The hypothesis space must be open-ended, to allow for explaining novelty · Hypotheses must be compositional - Allows infinite productivity - Allows novel compositions of familiar components - Like a grammar for hypotheses · Substitution enables composition (there may be other mechanisms) 7/49 Example: Relational representation solar system = base structure = hypothesis (on this slide) atom = target structure = evidence (on this slide) structural similarity = literal similarity | {sun ↦ nucleus, planet ↦ electron} Chalmers et al (1992) 8/49 Example: Relational representation of evidence Blokpoel et al (2018) 9/49 Example: Relational representation of knowledge Blokpoel et al (2018) 10/49 Example: Analogical augmentation Blokpoel et al (2018) 11/49 Example: Augmentation of evidence Blokpoel et al (2018) 12/49 Example: Explanation Blokpoel et al (2018) 13/49 Why is analogy hard for conventional computing? Subgraph isomorphism, of two graphs, is NP-complete (intractable) · Considers all possible vertex mappings · The “obvious” approach is brute-force exhaustive enumeration · Each vertex mapping provides very little information about the adequacy of the other vertex mappings Considering all the base structures in the agent’s knowledge is much larger Considering the transitive closure of analogical augmentations is much larger 14/49 Preview: Which VSA features might help? Hardware parallelism (elementwise operations with small fan-in) Mathematical parallelism (avoids explicit enumeration) · The hardware only “sees” the total vector · Distributive parallelism - (A + B + C) * ρ(P + Q + R) = A*ρP + A*ρQ + … B*ρP + B*ρQ + … · Equational parallelism - T = (A + B + C) = (P + Q + R + S) = (X * Y * Z) = … · Enables holistic transformations Substitution is a primitive (via binding) · Every value is potentially a variable 15/49 Plate (1994) - Hand crafted similarity Focus of Chapter 6 of Plate’s thesis (1994) is the use of dot-product similarity as a measure of structural similarity of representations Reports experiments with hand-crafted representations aimed at qualitatively reproducing the results of psychology research into human judgement of analogical similarity under varying contributions of component similarity to overall similarity. My take, · superficial similarity very similarity of arguments of relations ≈ · very structural similarity ≈ similarity of pattern of relations but researchers are free to suit the details of their definitions to their needs 16/49 Example stimuli P (Probe) “Spot bit Jane, causing Jane to flee from Spot” LS (Literal Similarity) “Fido bit John, causing John to flee from Fido.” (Has both structural and superficial similarity to the probe P.) SF (Surface features) “John fled from Fido, causing Fido to bite John.” (Has superficial but not structural similarity.) CM (Cross-mapped analogy) “Fred bit Rover, causing Rover to flee from Fred.” (Has both structural and superficial similarity, but types of corresponding objects are switched.) AN (Analogy) “Mort bit Felix, causing Felix to flee from Mort.” (Has structural but not superficial similarity). FOR (First-order-relations only) “Mort fled from Felix, causing Felix to bite Mort.” (Has neither structural nor superficial similarity, other than shared predicates.) Plate (2000) 17/49 Base and token vectors Plate (1994) 18/49 Stimulus episode representation construction Probe episode (P): “Spot bit Jane, causing Jane to flee from Spot” Plate (1994) Note the addition of “lower level” components into the representations · These are not strictly necessary for representing the structure Construction of all the episode representations follows the same scheme 19/49 Dot-product similarity with Probe Plate (2000) 20/49 Reminders of dot-product similarity properties sim(A, A ) = sim(A, (A + X)) > 0 sim(A, (A ⊗ P )) ≈ 0 sim((A ⊗ P ), (A ⊗ P )) = sim(A, A ) > 0 sim((A ⊗ P ), (A ⊗ P )) = sim(A, A ) × sim(P , P ) > 0 sim((A ⊗ B ⊗ … ⊗ X), (A ⊗ B ⊗ … ⊗ X)) = sim(A, A ) > 0 sim((A ⊗ B ⊗ … ⊗ X), (A ⊗ B ⊗ … ⊗ X ⊗ Y )) ≈ 0 ′ ′ ′ ′ ′ ′ ′ ′ ′ ′ If using self-inverse products: sim((A ⊗ B ⊗ … ⊗ X), (A ⊗ B ⊗ … ⊗ X ⊗ Y ) ⊗ Y † . ⊗Y is equivalent to applying the substitution Y ↦ 1 ′ † )= sim(A, A ) ′ 21/49 My interpretation of Plate (1994) Chapter 6 Representation of structure requires product operations, which destroys dot-product similarity of result to arguments · structural similarity ≠ only the dot-product similarity of core structure · Needs something extra Plate decorates the composite core structures with components to create similarity · Might be ad hoc (depends on whether it is natural for the construction process) · Might be missing necessary structure - Predicates are not represented as unique instances, - “Spot bit Fido causing Felix to bite John” is ambiguous - but representing them as unique instances might destroy their similarity - ? sim((bite1 ⊗ biteagt ⊗ Spot), (bite2 ⊗ biteagt ⊗ Spot)) ≈ 0 22/49 Dot-product similarity is local Dot-product similarity is at the heart of VSA system dynamics Dot-product similarity is very “local” · Almost all vectors are quasi-orthogonal to current state vector · Only a tiny fraction of the vector space has nonzero similarity with state · Miraculous luck if all directions of interest are local to the state · Similarity driven dynamics alone can’t select between non-local directions Relational structure encoded by Multiply and Permute, which are orthogonalising · Something needs to be done to map relational structure into the local space so that it can engage the similarity dynamics 23/49 Mikolov et al (2013) - Learned word similarity Proportional analogy with learned “semantic” vectors for words · · · a : a′ :: b : b′ man : woman :: king :: queen Vwoman − Vman + Vking ? = Vqueen Mikolov et al (2013) 24/49 Successful analogy? Not so much Doesn’t work as well as originally thought (Rogers et al, 2017) · Relies on excluding b from answer set (or using multiple choice) · Works best when a, a′ , and b are relatively similar to each other and to b′ · Poor at some classes of relations, e.g. synonymy, antonymy ANALOGY enables proportional analogy, but proportional analogy ≠ ANALOGY “semantic” vectors don’t capture SEMANTICS · Semantic vectors don’t know how to change a flat tyre · Captures a narrow subset of linguistic regularities induced by SEMANTICS · ANALOGY engages with SEMANTICS 25/49 Vector semantics and dot-product similarity Proportional analogy via semantic vectors implies semantics ≡ additive features Vwoman − Vman + Vking = (Vperson + VFvsM ) − (Vperson − VFvsM ) + (Vperson − VFvsM + Vroyal ) = Vperson + VFvsM + Vroyal = Vqueen Additive features can’t capture SEMANTICS ANALOGY is about structural relational similarity · Representing relational structure requires product operators · Static dot-product similarity structure is driven by additive structure · Dot-product similarity (by itself) can’t fully capture structural similarity 26/49 Systematic substitution via binding A binding can be used as a partial function for substitution The substitution is applied uniformly across the components of the argument With a commutative, self-inverse product operator, e.g. BSC, MAP (I won’t discuss non-commutative or non-self-inverse products here): ⊗ ≡{ ↦ , ↦ } A X A XX A Apply the substitution by binding it with the argument: (A ⊗ X) ⊗ ( + ⊗ + ⊗ + ) = ⊗ ⊗ + ⊗ ⊗ ⊗ + ⊗ ⊗ ⊗ =( ⊗ )⊗ +( ⊗ )⊗ ⊗ + ⊗( ⊗ =1⊗ +1⊗ ⊗ + ⊗1⊗ + ⊗ ⊗ = + ⊗ + ⊗ + ⊗ ⊗ A A B X C D A X A A X A B A X X C+A⊗X⊗D A A X A A X B A X X) ⊗ C + A ⊗ X ⊗ D X X B A C A X D X X B A C A X D A + A ⊗ B + X ⊗ C + D) ( A X) ( ⊗ ↦ X + X ⊗ B + A ⊗ C + A ⊗ X ⊗ D) ( 27/49 Subtlety of binding A ⊗ X) depending on how they are used, e.g.: Multiple interpretations of bindings ( · key:value pair in a dictionary (note key and value are treated identically) · variable:value pair (note variable and value are treated identically) A = pattern detected, X = identity of neuron) · Inference rule (A = antecedent, B = consequent) · Substitution pattern (A = search pattern, B = replacement pattern) A and X can be arbitrarily complex composites (everything is just a vector), e.g.: · (A + B + C ) ⊗ X = A ⊗ X + B ⊗ X + C ⊗ X · A ⊗ (X + Y + Z ) = A ⊗ X + A ⊗ Y + A ⊗ Z · A ⊗ B ⊗ X ⊗ Y = A ⊗ (B ⊗ X ⊗ Y ) = (A ⊗ X ⊗ Y ) ⊗ B = … · Virtual feature detector neuron ( 28/49 Kanerva (2010) - What is the Dollar of Mexico? Vustates = Vname ⊗ VUSA + Vcapital ⊗ VWDC + Vcurrency ⊗ VUSD Vmexico = Vname ⊗ VMEX + Vcapital ⊗ VMXC + Vcurrency ⊗ VMXN VU M = Vustates ⊗ Vmexico = VUSA ⊗ VMEX + VWDC ⊗ VMXC + VUSD ⊗ VMXN + noise ⊗ 1 (a sum of filler substitutions - fillers occupying the same role have been bound) Warning: noise is the sum of all terms you know will not be important Apply the USA/Mexico mapping, e.g. “What is the Dollar of Mexico?” U ⊗M ⊗ USD = ( USA ⊗ MEX + WDC ⊗ MXC + USD ⊗ MXN + 1) ⊗ = 2 + 3 + USD ⊗ MXN ⊗ USD + 1 ⊗ USD ≈ MXN V V V V V noise noise V V V V V V V noise noise V VUSD 29/49 Hand crafted substitution This approach works because of identical roles, e.g. Vcurrency The role representations are static and enumerated in advance Great if that’s all you need · Seriously, if that’s all you need, it’s the best way to do something analogy-like ANALOGY needs dynamic substitutions chosen in response to the context 30/49 Emruli et al (2013) - Learned substitution Emruli et al (2013) “The analogical mapping unit (AMU) which learns mappings of the type xk from examples and uses bundled mapping vectors stored in the SDM to calculate the output vector yk ” Emruli et al (2013) ↦y k 31/49 How it works xk is used as the address for the Sparse Distributed Memory (SDM) The mapping xk ⊗ yk is used as the value to store in the SDM Mappings are noncommutative because xk and yk are used differently Write mode: mappings are written to SDM Read mode: retrieves average mapping corresponding to xk and applies it to xk SDM does a sort of averaging memory over similar addresses Interpolates over mappings Can’t generate completely novel mappings Note the “circuit based” approach, including non-VSA components (SDM) There is usually an amount of plumbing and control to deal with A purist would make the control distributed (VSA-like) but it’s usual to make the control localist as an engineering hack 32/49 Gayler & Levy (2009) - Settling on substitution Graph isomorphism (not analogical mapping, but a proxy for necessary process) Find vertex mappings that make the graphs identical A A · { ↦ · { ↦ PB PB , ↦ , ↦ QC QC , ↦ , ↦ SD R RD S , ↦ } , ↦ } Gayler & Levy (2009) 33/49 How it works: The long and winding road The explanation is going to be long winded (sorry) · What is a graph isomorphism (implementable definition)? · Localist heuristic to find graph isomorphisms · VSA distributed implementation of localist method 34/49 Adjacency matrix of a graph How to represent a graph with a matrix · Row and column indices correspond to vertices · Cell entries indicate edges Gayler & Levy (2009) 35/49 Association graph The association graph is a graph product of the two graphs - Vertices correspond to vertex mappings of the two graphs - Edges correspond to edge existence agreement of the vertex mappings - A maximal clique corresponds to a maximal isomorphism of the two graphs Gayler (2009) Melbourne University presentation (edited). Only a subset of vertices and edges shown. 36/49 How to find a maximal clique of a graph Replicator equations (from evolutionary game theory) Also interpretable as Bayesian update x(t) = prior distribution = support for each possible vertex mapping x(t + 1) = posterior distribution = support for each possible vertex mapping w = adjacency matrix of association graph π(t) = likelihood = multiplicative update to vertex mapping support given w Gayler (2009) Melbourne University presentation 37/49 Replicator equation circuit Localist representation of mappings (potentially large) k number of vertices in each of the two graphs dim(x) = dim(π) = k2 dim(w) = k2 × k2 Gayler (2009) Melbourne University presentation ∧ ≜ elementwise product 38/49 Settling of localist replicator equations Gayler (2009) Melbourne University presentation 39/49 VSA representations for replicator equations A, B, C, D, P , Q, R, S Edges: B ⊗ C , B ⊗ D, Q ⊗ R, Q ⊗ S Graph vertex sets: (A + B + C + D), (P + Q + R + S ) Graph edge sets: (B ⊗ C + B ⊗ D), (Q ⊗ R + Q ⊗ S ) Initial potential vertex mappings = x(t = 1) = (A + B + C + D) ⊗ (P + Q + R + S ) = A⊗P +A⊗Q+…+B⊗P +B⊗Q+…+D⊗S Vertices: w Association graph edges (positive only) = potential edge mappings = = ( ⊗ + ⊗ )⊗( ⊗ + ⊗ )= ( ⊗ ⊗ ⊗ )+( ⊗ ⊗ ⊗ )+( ⊗ ⊗ ⊗ )+( B C B D B C Q R Q R Q S B C Q S B D Q R B ⊗ D ⊗ Q ⊗ S) 40/49 VSA replicator equation circuit Interpret all vectors as being of the form kV , where k is the magnitude/support for V (the unit magnitude direction) Analog computing: V is the labelled wire, k is the voltage on the wire Gayler (2009) Melbourne University presentation 41/49 Multiset intersection ∧ ≜ multiset intersection A multiset is a set with a nonnegative magnitude of membership for each element (i.e. the magnitudes of the component vectors vary across components) arg1 arg1 = = aV1 + bV2 + cV3 pV1 + qV2 + rV4 ∧(arg1 , arg2 ) = apV1 + bqV2 The elementwise multiplication of the magnitudes of the component vectors This corresponds to the elementwise multiplication of support for the vertex mappings in the localist version Won’t go into the implementation here 42/49 Evidence propagation B C Q R Association graph edges have the form: ⊗ ⊗ ⊗ and can be interpreted as mappings: ( ⊗ ) ⊗ ( ⊗ ) # interpret as mapping between edges in the graphs ( ⊗ ) ⊗ ( ⊗ ) # interpret as mapping between vertex mappings B C B Q Q R C R Association graph edges applied as inference rules: = ⊗ + …) ⊗ ( ⊗ ⊗ ⊗ + …) =( ⊗ = (vertex mappings) ⊗ (mappings between vertex mappings) = ( ( ⊗ ) + …) ⊗ (( ⊗ ) ⊗ ( ⊗ ) + …) = ( ⊗ )+… π x w B Q B C Q R kB Q B Q C R kC R Interpret (B ⊗ Q) ⊗ (C ⊗ R) as the rule: “To the extent k that B ⊗ Q is supported as part of the solution Increase the support for C ⊗ R as part of the solution by k” 43/49 Settling of VSA replicator equation circuit Gayler (2009) Melbourne University presentation 44/49 References / Reading M. Blokpoel, T. Wareham, P. Haselager, and I. van Rooij (2018) Deep Analogical Inference as the Origin of Hypotheses. The Journal of Problem Solving D. J. Chalmers, R. M. French, and D. R. Hofstadter (1992) High-level perception, representation, and analogy: A critique of artificial intelligence methodology. Journal of Experimental & Theoretical Artificial Intelligence B. Emruli, R. W. Gayler, and F. Sandin (2013) Analogical mapping and inference with binary spatter codes and sparse distributed memory. The 2013 International Joint Conference on Neural Networks (IJCNN) B. Emruli and F. Sandin (2014) Analogical Mapping with Sparse Distributed Memory: A Simple Model that Learns to Generalize from Examples. Cognitive Computation 45/49 R. W. Gayler and S. D. Levy (2009) A Distributed Basis for Analogical Mapping. New Frontiers in Analogy Research, Proceedings of the Second International Conference on Analogy, ANALOGY-2009 R. W. Gayler and R. Wales (1998) Connections, Binding, Unification and Analogical Promiscuity. Advances In Analogy Research: Integration Of Theory And Data From The Cognitive, Computational, And Neural Sciences H. Gust, U. Krumnack, K.-U. Kühnberger, and A. Schwering (2008) Analogical Reasoning: A Core of Cognition. KI - Künstliche Intelligenz D. R. Hofstadter (2006) Analogy as the core of cognition. Stanford Presidential Lecture P. Kanerva (2000) Large Patterns Make Great Symbols: An Example of Learning from Example. Hybrid Neural Systems 46/49 P. Kanerva (2010) What We Mean when We Say “What’s the Dollar of Mexico?”: Prototypes and Mapping in Concept Space. Quantum Informatics 2010: AAAI-Fall 2010 Symposium on Quantum Informatics for Cognitive, Social, and Semantic Processes S. D. Levy and R. W. Gayler (2009) “Lateral inhibition” in a fully distributed connectionist architecture. In Proceedings of the Ninth International Conference on Cognitive Modeling (ICCM 2009) T. Mikolov, W. Yih, and G. Zweig (2013) Linguistic Regularities in Continuous Space Word Representations. Proceedings of NAACL-HLT 2013 T. A. Plate (1994) Distributed Representations and Nested Compositional Structure. PhD Thesis. University of Toronto T. A. Plate (2000) Analogy retrieval and processing with distributed vector representations. Expert Systems 47/49 A. Rogers, A. Drozd, and L. Bofang (2017) The (too Many) Problems of Analogical Reasoning with Word Vectors. Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017) 48/49 DOI: 10.5281/zenodo.5552219 This presentation is licensed under a Creative Commons Attribution 4.0 International License 49/49