A Mathematical Theory of Generalization: Part I: David H. Wolpert

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Com plex System s 4 (1990) 151-200

A Mathematical Theory of Generalization: Part I

David H. Wolpert
Th eoretical Division, Center for Nonlinear Studies,
Los Alamos National Lab ora tory, Los A lamos , NM, 87545, USA

Abstract. The problem of how b est to generalize fro m a given learn-


ing set of inpu t-output exam ples is cent ral to the fields of neural nets,
statistics, approximation theor y, and ar tificial in tellige nce. T his series
of pap ers inv estigates this pr oble m from wit hin an abstract and model-
indep endent fr am ework and then tests some of t he resulting conc epts
in real- world sit uations. In t his abs tract fram ewor k a gene ralizer is
completely specifie d by a certain countab ly infi nite set of fun ctions ,
so the m at hematics of generali zat ion b ecomes an investigation in to
candida te set s of cri teria governing the behavior of that infinite set
of fun ctions. In the first pap er of thi s series, t he foundations of this
mathematics are spelled out an d some relatively simple generaliza-
tion criteri a are investigated. Elsewhere t he re al-wor ld gener alizing of
sys tems construct ed with these generalization crite ria in mind h ave
been favorably comp ared t o neural nets for several real generali zation
pr oblems , including Sejnowski's problem of reading aloud . T his lead s
to the con clusion that (current) ne ural net s in fact cons titute a poor
means of gen erali zing. In th e second of t his p air of papers, ot her sets
of crit eri a, mor e sophistic at ed than those crit eri a emb odie d in thi s first
series of paper s, are inv esti gat ed. Gen erali zers meeting these mo re so-
phisticated criteria can readil y be approximated on computers. Some
of th ese approxi mati ons em ploy net work structures built via an evo-
lutionar y pr ocess. A pr eliminary and favorable in vestigati on int o the
generali zation behavior of these approximations fini she s the secon d
pap er of thi s series .

Outline of these papers


In section 1 of this p ap er the topi c of ge neralizat ion is dis cu ssed fr om a
ve ry b road per spective . It is argued th at it is t he ir ab ility t o generalize that
consti t u t es t he p r imary reason for curre nt interest in n eural ne ts (even t houg h
suc h n eural nets in fact gener a lize poorly on average, as is demonstrat ed
in [1- 3]) . This se ction goes on to discuss the b en efits that wo uld come fro m
h aving a particularly go od ge neraliz ing algorithm. Sectio n 1 t hen en ds wi th a

@ 1990 Complex Systems Pub lications, Inc.


152 David H. Wolpert

detailed out line of the rest of these pap ers , presented in t erms of the pr eceding
discussion of generalization .
Here and t hroughou t these pap ers, gener alization is assumed to be t ak ing
pl ace wit hout any knowledg e of what the variables involved "r eally mean. "
An abs tract, model-independent formalism is the most rigorous way to deal
wit h this kind of gen eralizing.
Section 2 of this paper begins with a mathematically precise defin ition of
gen er alizers. It t hen goes on to exp lore some of the more basic properties
that can b e required of generalizers and elucidat es som e of the more straight-
for ward m athematical conse quence s of such requirements. Some of these
conse que nces (e.g., no lin ear mathematical model should be used to gen eralize
whe n t he underl ying syste m being mo delled is known to be geometric in
nature) are not intuitively obvious .
The par adigm here and throughout these papers is to make gen eral, broad
requ iremen ts for gen erali zing behavior , and then see what (if any) mathemat-
ical solut ions there are for such requirements . This contrasts to the usual
(extrem ely ad hoc) way of dealing with generalizers, which is to take a con -
crete generalizer and investigate its behavior for assorted tes t probl ems. In
t hese pap ers , the be havior defines t he architecture, not the ot her way aro und.
The other way of trying to build an eng ine which exh ibits "good" gene ral-
ization is, ultimately, dep end ent to a degree on sheer luck.
Sections 1 and 2 of the second paper present and explore some sets of
restri cti on s on generalization behavior which ar e more sophist ica ted than
those found in section 2 of the first pap er. The first of these new restric-
tions , exp lored at length in section 1 of t he second pap er , is th e restriction of
"self-guess ing" in its various formulations. Intuitively, self-guessing refers to
the requirement that if taught with a subset of the full t raining set , the gen -
eralizer should correctly guess t he rest of t he training set . One of the mo re
interesti ng results concern ing self-guessing is th at it is impossible to con-
struct generalization crit eri a which, along wit h self-guessing, specify unique
gen er alization of a learn ing set . (Any particular set of crite ria will always be
eithe r under-restrictive or over-restricti ve.)
Sectio n 2 of the second pap er then discusses the restriction of inform ati on
compactification , which can be viewed as a m athematically precise st at em ent
of Occam 's razor. Particul ar at t ention is drawn t o the fac t (and it s conse-
quen ces) that at present t here is no kno wn way of making an a priori most
reason abl e definition of information measure in a model-independent way.
Finally, section 3 of th e second paper is a partial investigation of t he
real-worl d efficacy of the tools elaborated in the first paper and in the first
two sections of the second paper. References [1- 3J consist of other such in-
vestigations and show that these techniques far outperform backpropagation
[4, 5J and in particular easily beat NETtaik [6J. T he t ests and investigations
pr esented in section 3 of this second paper are intended to be an extension
and ela bo ration of these t ests presented in [1-3 ].
A Math ematical Th eory of Generalization : Part I 153

"T his then is t he measure of a man - that from the parti cu-
lars he can discern the pattern in t he greater whole ." [U. Merre,
from Studies on the Na ture of ManJ

1. Background

Neural nets is a field that has recently started to at t ract int erest from man y
sep ar at e discipl ines, including phys ics (esp ecially condense d mat t er ph ysics) ,
m at hem atics, computer scien ce, neurobiology, and cognitive science . In ad-
dition to the intrinsic mathematical interest in investigating net works of non-
linear mappers , there are several other reason s why neural nets ar e pro ving
so popular . In this section som e of these reasons are discussed , with par tic-
ular attention paid to the reason of ability to generalize . The us efuln ess of
this ab ilit y is discussed, and then an outline of this series of paper s, which
constitute an investiga tion into the problem of generalization , is pr esent ed
from t he p erspective of the discussion of the impor tance of t he ab ility to
gen eralize.

1.1 Reasons for interest in neural nets


Historically neural nets were first investig at ed as a m eans of modeling net-
works of neurons (hence the name) . By int egrating models of neurons with
models ofthe connecti viti es in the human br ain , it was hoped th at something
of how brains function would be un covered [7-10J . Although this ap proach
is so compe lling that it is hard t o im agine a m ature neurobiology not mak -
ing us e of it , to dat e neural nets have not m ad e any novel and subst an ti ve
pr ediction s concern ing br ain fun ct ion whi ch were lat er corr oborated in the
laboratory [I1J (although it can be argued that som e aspec ts of neuromor-
phology, especially in t he hippocampu s, ar e eas iest to understand in te rms of
sim ple Hebbian neural nets [12]) . This lack of m ajor results is probabl y du e
t o the current extre me lack of knowl edge of ju st exact ly how ne ur ons ope rate
and are conne cte d [12 , 13J. Given how easy it is t o render a human br ain
inoperative, for example by slightly varying the amounts of various neuro-
tran smitters [14J, perhaps it should not be sur prising that our inability to
model a human br ain has accompanied our lacking a detailed understanding
of it s operation all t he way down to th e molecu la r level.
After investigating ne urobiology, perhaps the next most obvious use of
neural nets is in the field of artificial intelligence (AI). Simply put, sinc e brains
are intelligent, it is hoped that by modeling them one could bui ld an artificial
intelligen ce. Unfort unately, insofar as neural nets are unabl e to substantially
aid t he field of neurobiology, t hey also ar e un able to substantially aid the field
of AI. Unmotivated hop es of "collective emergent intelligenc e," oft en heard
in connection with the field of neural net s [4], seem to be wishful thinking,
at least at present. No novel principles of how to design intelligent sys te ms
(wh atever that means) have yet b een discovered through investigati ons of
neural net s. Yet , as pointed out by Cla rk [15], it is pr ecisely such novel
154 David H. Wolpert

principl es which must be discovered if neural nets are to make any significant
addit ion to the field of AI.
Cla rk's search for such principles, based on statistical mechanics, touches
on the t hird usual motivation for investigating neural nets, their similarity
to spin glasse s. This is the tradi t ion al entry point into the field for physicists
[16, 17]. Unfortunately, alt hough mu ch h as been learned of how to model
neural net s as spin glass es, no new principles conc erni ng intelligence have
emerge d from this work to dat e.
In addit ion to these three broad reasons for interest in neural nets, there
are several features of neural ne ts whi ch , although of more minor breadth ,
do constitu t e concrete results and not simply motivational generalities. The
first an d perhaps most famous of these grew out of the spin-glass approach
to the field . This is Hopfield 's [18] (and subsequent researchers ' [19, 20))
idea of using neural nets as a means of imp lementing associ at ive memories
in parallel. (See appendix A for a review of how Hopfield 's schem e works .)
Although Hop field's associative memory scheme do es indeed work, it sho uld
be noted that it does not provide any nove l insight s into how to construct
an artificial intelligence. What it does provide is a poss ible means of making
particular ly fas t associa tiv e memories. It constitutes a potential spee d-u p
of something we can already do (build associ ative memories ), but no t an
ent irely nove l skill .
Another contribut ion by Hopfield is t he realization that his associative
memory neu ral nets can be used as a means of finding approximate solutions
to optimization problems quadrati c in t heir arg uments [21]. Recently, the
reported efficacy of this techn ique has come un der at t ack [22]. However ,
even if the attacks are mistaken and the technique ends up proving usefu l, as
with neural net associative memories it will only constitute a potential speed-
up of some thing we can alr eady do (approximate solutions to optimization
pro blems), not an entirely novel skill.
There is one last application of neural nets which, in addition to be -
ing mor e than just a sp eeding up of something we can already im plement
through other means, is act ua lly exhibited by neural nets th at are up and
running right now . This application is the abi lity to gen eralize exhibited by
Boltzmann machines [4, 23], perceptrons [24], and in general all feedforward
neural nets made via backpropagation [4, 5], simu lated annealing [25], or
any other technique which does not try to restrict t he net 's output to the set
of a tt ractors typical of Hopfield-type neural nets (and, ind eed, typica l of all
asso ciative memories ). (See appendix B for a re view of Boltzmann machines
and t he tec hn ique of back-propagation.) T hat is, it is t he ability of such
neur al net s to be t au gh t wit h a learni ng set of input- outpu t pairs , be fed a
novel in pu t as a qu est ion , an d t hen to form a (hopefull y) reason abl e response
to that qu estion, a response whi ch might differ from all of the output s in the
learning set of input-out pu t pairs [26].1 It seems t hat in this ab ility to gen-

lThese nets have t he advantage over Hopfield-t ype associative memories in that th ey
can respond to a query with an output not contained in those which make up the learni ng
set . W it hout such an ability you are not fully generalizing . You are only classifying . In a
A Mat hematical T heory of Generalization: Part I 155

eralize ne ural nets can in fact do som ething t hat few other current systems
can do , or at least wh ich few other current systems can do as well wit h as
broad a range of applicability. Since it is the most widely studied neural net
generalize r, it is the techn iqu e of b ackpropagation that is taken to b e the
archetypical ne ural net generalizer for the purposes of this seri es of p ap er s.
It is the hope of the neural net community that this (or some other) neural
net scheme somehow picks out the important information from the learning
set whe n gene ralizing an d therefore gen eralizes well (see referen ce [4]).

1.2 Generalization

A gene ralizing syst em has very many potential uses, t he most obvious of
which lie in the field of statistics. In dee d, as h as also been po inted out by
Wolp ert [I , 2], Lap ed es an d Fa rber [27], and by Smi th [28], generalizing
neural nets can b e viewed as simply a novel kind of statistica l curve fitting .
Clearly, a solution to the problem of how "best" t o gen erali ze fro m a learning
set wh en no a priori assumptions are made about the typ e of fun cti on do-
in g the gene ralizing would have m any ramifi cati ons for the (n on p ara metric)
stat istics problem of ho w b est to generalize (sic) from ex perimental data to
t he best model to fit the data.
In ad dit ion t o t his po tential application t o statistics (and t he refore all
experime ntal science), a solution to the prob lem of how best to generalize
from a learning set might be he lpful at the t ask of image reconstruction (t he
"learning set" her e being incomplete inform ati on about the full im age) . It
might also be helpful for approximation theory an d t herefore for numerical
analysis. Indeed , at the risk of be ing overspeculative , a successful solution
to the problem of generalization cou ld even be he lpful in the broadest sci-
entific task of all, the task of inferring the "b est " theory to explain a set of
observa tions.?
In addit ion to these ap plications, the ability to gene ralize well also has
m any potential uses in the field of AI. To un der st an d one such use , first
note (as is pointed out in [1] and [2]) t hat in an abstract sense the goal of
AI research is t o create a system which meets certain sets of "human-like"
crite ria . In ge neral, t he broader t he criteria a syst em satisfies, t he better.
When t hese criteria are specified very prec isely - "given a, respond with
b" - then the system created to meet t hem con stitutes a database. As a
next st ep it is straightforward t o in cor porate into the system an algorithm
for m ak in g logical inferences from the provid ed crit eria , resulting in wh at is
essentia lly an expert system. In sofar as dat abases , b eing a set of responses
to queri es, ca n be viewed as in put-output fun ctions over an ap propria te
sp ace, such expert sy ste ms are j ust extensions of the domain of t he or iginal

certain sense , a Hopfield-type assoc iat ive memory is noth ing more t han a crude, par allel,
Bayesian classifier, since its "bas ins of att raction" are nothing mor e th an th e set of vectors
strongly correlat ed wit h one another .
2See reference [29] for an interest ing attempt at solving t his last task which might have
some app lications to the prob lem of generalization .
156 David H. Wolp ert

database. The logical inferen ce algorithm used has "broadene d" the (human-
like) criteri a defining the prop erties of the syst em .
Unfortunately, such logical inference algorithm s can not exte nd the do-
main of the system to include the whol e of the inp u t space. A purely deduc-
ti ve algorit hm can not solve the problem of inductive inferenc e [30]. This is
not a trivial shortcoming. Inso far as it is impossible t o explicitly list all of
th e beh avioral attributes of hu man int elligence, it seems that any ar t ificial
syst em that will be abl e to fully mimic human beh avior will have to extrapo-
late th at full behavior from a cor e su bse t of (ext ernally provid ed) behavioral
attributes. It would therefore seem that any AI syst em t hat can suc cessfu lly
pass the Turing t est will have to be able t o form an extension of an initial
set of behavior criteria . It will have to be able to generalize from the or iginal
criteria."
To understand th e second reason that generalization is cruci al to AI , first
define a system's gene ral inte lligence as a measure of how well it performs at
an arbitrary cognitive t ask for which it is provided limited inform at ion , wh ere
eithe r the task and/or the information is novel. The syst em is explicitly t old
b eforehand how its performan ce is being measured . In an academic set ting ,
the limited information might be a chapter in a mat h book, t he novel task
might be a problem set bas ed on that chap ter , and the p erformance measure
might be the percentage of problems from the problem set done correctly.
At a more primitive level, the limited information m ight be a novel visu al
image, t he t ask might be to pick out any cre at ures which might be pr ed ators
from t hat image, and the p erformance measure might be whether the system
gets eaten or not . Alternatively, the limited information mig ht contain a set
of images with all pr edators in them pointed out , as well as the novel visual
image. The more generally intelligent a sys t em, t he larger the numbe r of
(task, inform at ion ) pairs at which it pe rform s well and the better it performs
at them (see appendix C for a mor e mathem at ically pr ecise formu lation of
this definition of intelligence) .
Now define gen eralization as the ab ilit y of a syst em to t ake a learn ing
set of input-output vector pairs from a particular pair of input and output
spaces, and, based only that learning set , make a "good" guess as to the
outp ut vector corresponding to an input vector not contained in th e learni ng
set. The guessing is exp licitly ind ep endent of any information not cont ain ed
in the learn ing set , such as what the input and output spaces "really mean ."
Loose ly speaking , generalization is th e ability to perceive and extrapola te
pattern s in a learn ing set. It is clea r then that in addition to traits like
facility with logic and the ab ility t o reason by an alogy with other, successfully
completed tasks , the ability to generalize well from limited information (i.e. ,

3Note t hat as opposed to a conventional "inside-out " system in which the internal
structure is externally (and overtly) prov ided, such a successful mimic of hum an behavior
would be "outside-in" in th e sense of only having beh avioral crit eria extern ally prov ided,
wit h t he intern al structure determined implic itly as a means of meet ing those crit eria
(along with th eir extrapolation) .
A Mathematical The ory of Generalization: Part I 157

from a lim it ed learning set ) is a prerequisite for any syste m to be intelligent


over a broad range of tasks."
As a final example of th e wide-ranging utility of be ing able to generalize
well, note that if you could construct a generalizer which , for whate ver reason ,
you t hou ght was optimal for a particular collection of learni ng sets , then
by us ing that generalizer you could defin e a meas ur e of randomness (and
therefore you would be able t o decide which of your learning sets was least
"random" ). The way to measure this gen eralizer-based randomness of a
learn ing set is to t each the gener alizer using var ious subsets of the learning
set and t hen measure the generalization err or on t he res t of the learning set .
T he more random a learning set is with resp ect to the given generalizer, the
mo re slowly the average generalization error rate sho uld fall off as t he size
of t he subset bein g used t o teach the generalizer increases. If the gen er alizer
can discern very little new about the learning set as a whol e as it is given
larger and lar ger samples of that learni ng set, then as far as that gen eralizer
is concerned the learn ing set is effectively "random."
The ob jection h as sometimes been made that however you defin e "good"
generalization you can nev er be assured that a "good" generalizer will pro-
duce the correct gen eralization for an arbitrar y real-world guessing sit uation
[31]. However, not e that all real int elligences (i.e., people) gen erali ze all the
time, in t he broadest sense possible, and that such generalizing seems to
constitute an essential part of their intelligence. Although in fact people
pe rform such "intelligent guessing" surprisingly well, such guessing carries
no ass urance of b eing corr ect. Rather , it is used as a strategy for dealing
wit h an ex ternal environment. Try ing to find rules for generalizing well in a
series of generalizing situations is similar to using a particular game th eory
strategy (1ike minimax say) for a series of different games . No such strat-
egy is guaranteed to give you best resu lts in all games, so you try to find
a strategy which will give as good resu lts as poss ible over a wide variety of
games. Simila rly, the problem of finding a good generalization scheme is to
find a scheme which gives good results over a wide variety of gen eralizing
sit uations .
W hile a lot of work has been done that relates to the problem of gen er-
alization, none has bee n done that directly ad dr esses the issu e its elf:

1. App roximation theory an d reg ular ization t heory [32, 33] can be viewed
as attempts to deal with the problem of generalization when one knows
what the inp ut and output spaces "really mean."
2. The field of machine learn ing [34- 36] consists to a large degree of var ious
schemes for how to gen eralize in certain strictly limi ted cont exts (e.g.,
having Boo lean valued variables) .
3. Bayesian classifica tion [37], and in general all of the work on class ifica-
ti on in image p rocessing [38], is similar to t he problem of gen eraliza tion,
4Note tha t the pro blem of "making a best guess for Imiss> given I p roy and J," discussed
in ap pe ndix C, is a generalization problem.
158 David H. Wolpert

the major distinction between the two being that in classification the
set of possible answ ers to questions is strictly controlled and finit e (in-
deed , oft en hav ing only two elements).
4. Time-series an alysis and prediction theory, especially as extended to
chaotic systems by Farmer [39] (and consequently, using backpropa-
gated neural nets, by Lapedes and Farber [27, 40]), can be viewed as
an ad 110C approach to a limited version of the full problem of general-
ization.
5. Information complexity theory [41J can be viewed as dealing with a
generalization problem modified by adding to the condition t hat they
have to reproduce the learning set some extra, rather stringent pro blem-
dependent conditions on t he allowable generalizations.
6. The work with local languages [42], with grammatical inference [43],
with mathematical criteria for associaters [44], and with finding the
min imal high-level language set of statements to reproduce a learning
set [45] can also all be viewed as dealing with problems related to that
of generalization, but not with the full problem itself.

Not dealing directly with the full problem of generalization, these ap-
proaches do not provide any set of tractable mathematical criteria for good
generalization . Accordingly, the criterion in common use for judging how well
an algorit hm gene ralizes is to simp ly test the performance of the algori t hm
on lot s of learn ing sets created with a "correct" generalization in mind. T he
accuracy of the algor ithm in guessing this "correct" generalization from t he
learni ng set is taken to be a measure of how well it generalizes. (Alt hough the
field of machine learning can make use of more ob jective criteria in measuring
generalization efficacy, since it is restricted to situations in which t he num-
ber of possible input-output mappings is finite, its applicability to real-world
problems is limit ed .) Clearly, this is an unsatisfactory state of affairs, having
littl e if any potential to meet the promise of a full theory of generaliza tion.

1.3 Synopsi s of the rest of these papers


It is due to these inadequa cies in conventional approaches to the problem
of generalization that using neural nets to generalize has exc ited so much
interest . Unfortunately, there does not currently exist any understanding of
which feat ures of neural nets are helpful for generalizing and which are not .
Nonetheless, of all the motivating causes for interest in neural nets list ed in
section 1.1, it is t he abi lity of such nets to generalize which seems to be the
most substantive and which seems to have t he mo st potential.
It was with the idea of exp loring generalization as a more abstract, math-
ematical problem and (hopefully) t hereby improving upon t he performance
of neural nets t hat t he generalization theory of this series of papers was de-
veloped. Generalization theory is an abst ract mathematics of generalizers,
various reasonable criteria that can be required of them, and t he consequence s
A Mathematical Th eory of Gen eralization: Part I 159

of those criteria. Some examp les of such crite ria are invari an ce under vari-
ous kinds of coordinate tran sformations of the input and/or output spaces,
having the gene r aliza tion of part of the learn ing set guess the rest of it (self-
gue ssing) , and information compactificat ion (i.e. , Occ am 's razor). All the
generalizers commonly used at pr esent can be categori zed in terms of which
of t hese and other similar crite ria they meet . Section 2 of this first paper and
sections 1 and 2 of the second paper in this series of papers describe gen er al-
ization theory in more detail, sp elling out the categories created by various
set s of criteria and investigating t he problem of finding a set of criteria which
specify a unique generalization of any learning set .
To qu antify the advantages of neural nets as gen eralizers , it is important
to develop a benchmark for measuring the efficacy of a sys tem's generalizing
from a learning set. Without use of such a benchmark , any claims for neural
nets having "emergent and adaptive intelligen ce" (beyond that which follows
simply from their ability to form associative memories) are vague at best, and
mis leading at worst. Such a benchmark should be very easy to construct,
flexib le, qui ck, and possess simple to an alyz e generalization behavior. (It
is important to realize that the unanalyzability of neural net s is not reason
to b elieve they generalize well, bu t rather is just an imp ed iment to seeing
pr ecisel y how they go abo ut trying to gen eralize.) The gen er alization t heory
crit eria discuss ed in the first pap er whi ch mo st naturally lead to such a ben ch-
m ark are those defining a HERBIE (HEuRisti c BInar y E ngine) . In essence,
a HERBIE ben chmark gener alizer is a simple surface-fitter which creates an
input-output surface which goes through all the point s of the learning set
and has the whol e of the inputs space as its domain. A pr ecise definition of
a certain kind of benchmark HERBIE (a hyperplanar HERBIE) , exploration
of its be havior, and use of it to benchmark the generalization of neural nets
makes up reference [1]. As is discussed in [1- 3J, the hyp erp lanar and other
kinds of HERBIEs have been found to be very helpfu l in ben chmarking gen-
eralization, and actually beat backpropagated neural nets rather decisively
in the t ests p erformed so far. "
Despite it s usefulness as a ben chmark for measuring gen eralization , how-
ever, there is no reason to believe this HERBIE and the crite ria definin g
it will be the best generalizer for an arbitrar y learning set. It is not t he
definitive answer t o the qu estion , "If I have a given learning set and sever al
different algorithms which can generalize from it , whi ch one do I choose?"
Some of the other generalization criteria which const itute a possible answer
to this question ar e those making up self-guessing and information compacti-
fication, which ar e presented in the second pap er. Discussion of some systems
approximately ob eying self-gu essing and information compactification, along
with examples of some t ests of su ch systems make up section 3 of the second
pap er. In addition to utilizing evolut ionary (as opposed to neuro-) biology,
such systems tend t o make use of parallel distributed network structures. In

S"Beats" here meaning what it does in the pr esent-day literature: HERBIE does better
at guessing the "correct" input-output mapping th e researcher had in mind when making
up the learning set.
160 David H. Wolpert

this sense, this series of papers comes full circle, concluding with an explo-
ration of what can be thought of as very sophisticated, feedback, variable
number of iteration, neural nets.

2. Introduction to generalization theory


As mentioned in section 1, a generalizer is any program which tries to
"broaden the criteria" defining a system's behavior by extrapolating from
them and making guesses for appropriate behavior when faced with any sit-
uation, whether or not it is contained in these original criteria fed to the
system. These extrapolations are often "human-like," in that they are not
necessarily made according to the rules of logic. Their being nonlogical ex-
trapolators of the database of the defining criteria allows such generalizers
to make a guess in situations where such a guess is not decidable by logic
alone - the who le of the input space is in their domain. In this section, the
foundations of a mathematical framework for exploring such gene ralizers will
be presented and explored. In particular, various sets of criteria which can
be required of generalizers will be investigated from within this framework.
For the purposes of this invest igat ion , all broadenings of criteria can be
assumed to first transform the space of the criteria into an m +n dimensional
(usually) Euclidean vector space, where m of the dimensions span the input
space, and n of the dimensions span the output space. Such a transformation
can always by done - if need be, every distinct variable in the database can
be assigned its own dimension, and every distinct state of the variable can be
assigned an arbitrary number . More rigorously, by Church's hypothesis any
criteria computable by humans is computable by a Turing machine [46], and
therefore can be digitally encoded - this encoding can serve as the desired
transformation. In this way, every piece of information in the data base is
mapped to a data vector in the vector space, producing a numerical learni ng
set. In general, since the vector space is supposed to be an input-output
space, it's required that every m -dimensional image of an inp ut component of
an element of the data base uniquely specifies the associated n-dimensional
image of the output component of that data base element . All of t hese
considerations also apply to encoding criteria for use by neural nets.
Define the n separated systems as the n vector spaces formed by taking
the Cartesian products of the m -dimensional input space with each of the n
distinct output dimensions. The set of all the resultant data vectors for one
of the separated systems is called the learning set or sometimes the training
set.
The essence of any generalizer, even a neural net generalizer, is to take
each separated system and fit a surface with an infinite domain (i.e., extend-
ing over the who le of the m -dimensional input space) to the data vectors of
that system. More precisely, given an output value for each of m -dimensional
input vectors, i.e., given k m + l-dimensional vectors, a generalizer creates
a surface in the m + l -dimensional space which passes through all k of the
m + I -dimensional vectors, and which has an output value for any and all
A Mathematical Theory of Generalization: Part I 161

ot he r inpu t vectors - the domain of the mapping delineated by t he surface


is the entire input space. To ask a generalizer a question, one takes the
appropriate input vector and reads off the output value on the surface gen-
erated by t he surface-fitting algorithm. The surface's passing throug h t he
n original vectors makes the generalizer a superset of the database (for any
one of the original input vectors it returns the correct output value ), and
the surface's having an m + lth dimensional value for any input coordinates
constitutes the broadening of the orig inal criteria. Different generalizers use
different surface-fitting algo rithms . In ad dition to choosing between different
generalizers, in general there will be an infinite number of transformations
from the data base to n m + 1 dimensional separated systems . Usually it is
best to pick such a transformation with as small an m as possible, to channel
t he info rmation contained in t he data to the de tailed shape of the surface,
which is where we want it.
From an abs tract poi nt of view, the generalization problem then is to first
create broa dly applicable mathematical criteria for whet her or not a given
generalizer performs well the generalization of a given learning set living in
a particular separated system (these criteria for good generalization sho uld
no t be confused wit h the criteria making up the learn ing set). Once th is is
done, a procedure must be created to take as input any learning set an d from
it build a generalizer which generalizes in at least approximate accordance
wit h the generalization criteria, for that learning set . To date, sever al sets of
seemingly reasonable generalization criteria have been exp lored . Although
some such sets have turned out to be over-restrictive (cert ain learning set s
cannot be generalized at all), most are under-restri ctive (t he generalization
of a learning set is not unique) and therefore, ins tead of sp ecifying a unique
gene ralization for any learning set , serve to categorize equivalence sets of
such gene ralizations. In addition to this theoretical work, computer programs
des igne d to approximate some of these (under-restrictive) crite ria sets have
been tes ted. As explained in section 3 of the second paper of t his ser ies,
these programs have usually proven to be good generalizers in such tests, in
the sense t hat they do well at the task of guess ing the parent fu nct ion the
researcher has in mind and from which the elements of t he learning set were
constructed. Such res ults serve as (partial) real-world corroboration of the
reasonableness of these criteria sets.
All of the more sophisticated sets of generalization criteria investigated
so far have made use of the criteria of self-guessing and/or information com-
pactification. Information compactification, which is discussed in sect ion 2
of the second paper, has proven to be under-restrictive. Depending on the
criteria that go wit h it, self-guessing, discussed in section 1 of the second pa-
per, has t urned out to be either over- or under-restrictive. For the fut ur e, it
is hoped t hat some com bination of the two will be discovered whic h is a fully
satisfactory generalizer (i.e ., which specifies a unique generalization of any
learn ing set) . Before discussing these sophisticated kinds of cri te ria, t ho ugh,
it is first necessary to formulate a rigorous definition of t he generalization
from a learning set (of arbitrary cardinality) to a mapping from questions
162 David H. Wolpert

to guesses, and then to investigate some of the simp ler sets of generalization
criter ia . The rest of the current section constitutes such an investi gati on .

2.1 Defin ition of a generalizer


A gene ralizer is a set of cont inuous map pings from learning sets of arbi trary
car dinalit y along with a single question, to an output. More precisely,

(2 .1) An m-dimensional generalizer is a count ab ly infinite set of continuo us


functions fro m a subset of (Rm x R x R '" ) to R, from a subset of
(R'" x R x R'" x R x R m) t o R, etc ., wh ere R denot es t he space of
real nu mb ers and R'" denot es the Carte sian product of m such spaces.
Notationally, an m-dimen sional gener alizer is a set of cont inuous fun c-
tions g{ i} along wit h associa te d domains of definition , i being a nonzero
nat ur al number, and g{i} being from R i(m+l)+m t o R.

T he 9 {i } are defined for all of R i(m+l)+m, up to explicit holes in the space


discusse d be low. The last R '" comp onent of the input to any g{i} is referred
to as the question. The rest of t he com po nents of the inp ut constit u te the
learning set . (Sometimes, as ab ove in the discussion of separat ed systems and
as below in the discussion of expans ions of a learning set , the term "learning
set" will be taken t o mean simp ly t he set of all input- output pairs provided
to t he resear cher , as op posed to th e set pro vided to a particular g{i} . The
context shou ld m ake it clear which defini t ion is bei ng used .) Each R '" x R
input- ou tpu t pair of the learning set is called a datum space. T he numbe r
of datum spaces is the order of the learni ng set. As explain ed above, an y
qu estion in generalization theory can be cast su ch that the image of the
gu essing fun ction is one dimensional. This is why the mapping is taken to
betoR1 .
If desir ed , decom po sition and expans ion of generalizers can be performed
quite eas ily. On e natural way of doing t his is to use a pseudo inner pr oduct ,
m apping t o a count abl y infinite set of real numbers , rather than the usu al
inner product m apping to a single real number. The pseudo inner p roduct
between any t wo gener alizers G and H, (GIH) , is defined as the set of all of
th e (g{i} Ih {i} ), where the 9 {i} are the set of functions making up G and the
h{i} are the set of functions m aking up H . The inner product between the
g{i} and t he h{i} can b e defined as any of the usual inner products amo ngst
(L 2 ) real functions (see appendix D).
Any g{i } can be ap proximate d by a Turing Machine (T M), of course.
W hat is more, any T M can be uniquely specified by a pair of two-dimen sional
g{ i} . Each of the two g{i} t akes a real number and an int eger as its qu est ion :
the current state of the tape an d the head's position on it is given by the
real number , and the internal st ate of the TM is given by the integer. One
of t he g{ i}'s has its output interpreted as the new t ape and head po siti on
an d the other on e has its ou tput int er pr et ed as the new int ern al state of t he
m achine. To run a TM, you j ust feed the outputs of t his pair of g{i}'s back
into themselves as inputs , t er minat ing when the integer ou tput corresponds
A Mathem atical The ory of Generalization: Part I 163

to the hal t ing internal state of the TM.6 The number of pos sible internal
st at es of a given TM sets i, and t he ru le table for the TM sets both t he
learni ng set an d how t he g{i } generalize from t he learning set.
In addit ion to this relation wit h TMs, any scheme for bui lding neural
net s from learning sets can be viewed as a generalizer, though for some such
schemes the req uirement of the cont inuity of the g{i} has to be relaxed . Since
any g{i } can be app roximated by an appropriate gen eralizer, it is also true
t hat any g{i} can be approximated by a neural net . Similar comments hold
for genetic algorit hms and classifier systems [47]. A final, and perhaps most
imp ortan t example of a generalizer is an y statist ical algo rithm for finding a
cur ve which has in some sense the least var iation from a set of data po ints .
In t his case, the learni ng set simply consis ts of the data points.

2.2 Basic generalization criteria


Beyond t hose implicit in the simple definition (2.1), there are several other
restrict ion s which immediately come to mind when t rying to formulate crite-
ria for generaliza t ion . T he first of these is t hat the or der in which the datum
spaces are pr esented sho uld be irr elevant :

(2 .2) Every g{i } is invariant under permu t at ion of the datum spaces.

T his invari ance , being a simple rela beling, is on surer ground than m any
of those which will be descr ibed lat er. A second restrict ion is that the g{ i }
mu st form a dat abase:

(2. 3) If, for any g{i }, the value of the question is t he same as t he valu e of
an R m ent ry in one of the dat um spaces, t he output of t he function is
the corresponding R ent ry from that datum space.

For example, g{2 }(x,y,x',y' ,X) = y for all x ,x',y , and y'. One can
investi gat e generaliza tion schemes in which t he data in the learni ng set is
su sp ect and therefor e should not necessarily be followed , but they will not
be con sidered here. For single-valuedness, (2.3) necess itates t he following
re stricti on on t he domains of the 9 {i }:

(2.4) For any g{i} , if a ny two datum spaces have t he same value for their
R'" ent ries, t hen they must have t he same values for their R entries.

6From this pe rspec tive a TM is simp ly a funct ion taking a two-dimens ional inpu t vector
t o a two-dimensional outpu t vector wh ich is t hen fed bac k into the input. One component
of t he output det ermi nes whe n the process stops , and t he other determines t he out pu t
valu e of th e TM as a who le. Insofar as it is ju st such an iterated map, it is interesting
to note t hat a TM has the p otential to exhibit what is in some sens e t he richest possible
beh avior , na mely chao s. In the sa me spirit, it is interesting to note t hat the Halt ing
Pr obl em in comp uter science seems to parallel t he nonlinear science problem of finding
when a ch aot ic or bit alights wit hin a certain region of t he appropriate state space .
164 David II. Wolpert

For example, g{2}(x , y, x, z,w) where z does not equal y is outside the
dom ain of definit ion for t he generalizer . T his indu ces an odd st ructure on
t he domain of definition of any g{i }, turn ing it int o R i(m H)+m minus an
uncou nt ably infinit e numb er of holes . Wi th t he usual definit ion in R n of
an open se t be ing a union of op en hyperspheres , t he dom ain of definition is
part op en, part closed. For example , if one datum space of a one-dimensional
gene ra lizer has the values (Xl , yl), t hen another datum space has the following
domain of definition: {x E (- oo, x l), Y E (-oo, +oo)} U {x = xl,y = y/} U
{x E (Xl, + 00), y E (-00, + oo)}. No union of open circles can give this
st ructure around the line x = x' , In gen eral, unless otherwi se indicated, t he
domain s of defin it ion of the g{i} will always be con sidered to be the full
Euclidean space minus the holes caused by (2.4).
In add ition to t he restricti ons list ed above, it is usually helpful to impose
the following restrict ion on the domain of t he g{i}:

(2.5) Unless i , the order of t he learn ing set, excee ds m , the dimension of
the generalizer , g{i} is not defined . Even if i > m, g{i} is not defined
if t he values of the R m ent ries of t he datum spaces all lie on the sa me
(m - 1)-dimension al hyp erpl an e.

T he rationale for (2.5) can be seen by taking m = 2. If the input com-


p onen ts of the element s of the learning set are colin ear (i.e., t here are only
two eleme nts in the learning set ), then the learning set prov ides no informa-
t ion for guess ing the ou tpu t s of point s not on the line of t he elements of t he
learn ing set . It provid es no inform ati on t o fix gene ra lization in any direction
per pendicular to t he lin e containing t he learning set. Since in any real-world
scen ar io t he data m aki ng up t he learning set will not be infin itely precise, we
can never be sure that a given question is exactly on t hat line, and therefore
can nev er even be sure that our learn ing set pr ovid es any pe rtinent informa-
t ion for gues sing the output to a question which we think is colin ea r wit h
the lea rn ing set.
From now on , unless expli cit ly st ate d otherwise, the term "generalizer"
will be assumed to imply adhe rence to restrict ions (2.2) t hrough (2.5) . Be-
yond th ese simple restrictions, there are several others which help dist inguish
bet ween different types of generalizers and which suggest avenues t hro ugh
which to attack the pr oblem of find ing a reasonable set of generalization cri-
te ria which fixes uniq ue gen eralization of any learn ing set. The rest of t his
first paper con sists of an exploration of these restrict ions, t heir m athem atical
form ulation s, and their relations to an d interdep endencies with one another.

2. 3 HERBIEs
The first such additional restriction which comes to mind is t o assume in-
var iance of the g{i} under cer tain coordinate transformations. A number
of t he techniques us ed in classificat ion , im age processin g, and t he like ob ey
suc h an invari an ce, under certain condit ions. For example, for certain kinds
of const raints on t he allowed for m of a differential pro babili ty dist rib ut ion
A Math em ati cal Th eory of Generalization: Part I 165

over a E uclidean space ~ , P ( O" E ~ ) , choosing that probability distrib ution


which meet s those constraints and which also m aximi zes t he entropy [48],
- J P(O")ln[P(O")]dO", will pi ck out a dist ribution in a manner which is invari-
ant under all differentiable tr an sformat ions of the space ~. If the space ~ is
trans formed t o a new E uclidean space ~ ' via a transform at ion operator T,
and if the constraint is re-expressed in terms of the space ~ ', then the tech-
nique of m aximizing ent ropy over ~' su bject to the transformed constraint
will pro duce a distribution P' (O"' E ~') which is simply t he image under T
of the distribution produced by maximizing entropy over the orig inal space
~ sub jec t to the origina l cons traint : P' (O"' = TO" )d(T( O" )) = P (O" )d(O"). For
example, this invari anc e will be obeyed if t he const raint on the pro bability
dist ributions is fixing the expe ctation value of th e posit ion in the spaces to
be some const ant.
Of course, it is not true that the techniqu e of maximiz ing ent ropy will
give a probability distribution invari ant under all coordinate transformations
for arbitrary const raint s on that distribution . Perhaps t he most well-known
example of su ch an invari an ce-bre aki ng occurs when t he constraint is the
mi croc anonical distribu tion of classical statistical mec hanics [49]. In this
case the constraint is to fix t he energy of the sys t em t o be some cons tant ,
and maximizing entropy over phase space sub je ct to t his constraint gives
a uniform distribution over a sur face in phase space. In gen eral, however ,
the re-expression of t hat dist ributio n in other spaces will no t be uniform ,
as maximizing entropy over those spaces would require. W ith a cons traint
of this nature the invari anc e un der ar bitrary coordinat e t ransform ation s is
broken . You will make different prediction s using pr obabili ty distrib utions
constructed by maximizing ent ropy over different spaces.
Similarly, no non con st ant g{ i} is invari ant u nder all coordinate transfor-
mations. No g{i} exists for which

for ar bit rary coordinate t ransformations T and T' . However it can be re-
quired (for example) that the g{i } obey a scale invariance. More pr ecisely, if,
for any 9 {i}, in each datum space t he R coordinate is sca led by a real, nonzero
factor k, the output R of that g{i} is scaled by k, Notationally, if g{i } takes
(R m x R x R'" x R .. .) -4 R, t hen it also t akes (R m x kRxRm xkR . .. ) ----> kR
in the sam e way. Intuitively, t his restricti on of sca le invariance says t hat gen-
eraliza tion should not be dep endent on the ou tp ut dimension 's unit s.
A simi lar restriction is to require that if for an y g{i }, for ever y R m en-
t ry (including the quest ion) t here is t he same scaling transformation of the
all the axes (by a nonzero real number), t hen again the outp ut valu e is
un chang ed. In the same spirit it can also be req uired t hat if every R '" en-
try undergoes t he sam e rot ation , parity, or t ransl ati on transformation, then
the output value is unchanged . If we m ake the sa me req uirement of the R
166 David H. Wolpert

entries, we can summarize this restriction of invariance under these coordi-


nat e transformation s by saying

(2 .6) Eve ry g{i} is invariant under any rot at ion , pari ty, translation , or
nonzero scaling tran sformation , applied simultaneously to the all the
R m , including the quest ion, or to all the R , including the output.

For example, for the g{ 2} in a one-dimensional gener alizer ob eying the


input space invari anc es,

g{2} (x ,y , x' ,y' , z) = g{2}( a + k x ,y, a + kx' ,y' ,a + k z) ,


for all x ,x', y, y' ,z, a and nonzero k . Techn ically speaking, for multidimen -
sional input spaces translation invari ance refers to translation by any con-
st ant vect or (not simply to tran slation by a vector proportional to the unit
vect or) , and scaling invari ance refers to scaling of any set of components of
th e input vect ors (not simp ly to scaling of all components simultaneously).
In pr actice this extra freedom is not import ant , however , since almost all of
the cases considered in thes e papers involves one-dimens ional input spaces.
We are now in a pos ition to define t he firs t cat egory of generalizers wit h
which we will work. A HERBIE generalizer is defined to be a generalizer
which obeys t he criteria (2.2) through (2.6). (In addition , in practice the
t erm "HERB IE" will often b e taken to mean a generalizer for wh ich t he
g{i} are straightforward and simple functions of their arg um ents.) As in
referring to neural net as generalizers , algor it hms taking learning sets to
question-guess mappings will often be referred to as HERBIEs even if thei r
g{i} are not everywhere cont inuous functions of their arguments , so t hat t hey
are not even generalizers , te chni cally sp eaking. The context should m ake it
clear when this relaxation of the cont inuity req uirement is t ak ing place. In
addition , some times t he t erm "HE RBIE" will be taken to mean a p articular
kind of HERBIE generalizer. Som et imes the term will just mean the concept
of using HERBIEs to gener alize. Again , as always, t he context should make
th e pr ecise meaning clear.
Intuitively speaking, a HERBIE is a generalizer which, owing to restric-
tion (2.6), makes its guesses based purely on the geometric relationship be -
tween the question and the data vectors making up the learn ing set . W he n-
ever you ar e in a sit uation in which you thinks t he uni ts used should not
matter, or where your zeroes shou ld not matter, t hen you had b etter use a
HERBIE (or an ap prox im at ion to one). If you do not , t hen whether you are
awar e of it or not, you are actually introducing preferred scales, preferr ed
origins, an d t he like, whe n you know t here are none such in the problem at
han d.
Note that the restricti on of a generalizer that it be a HERBIE is not
p owerf ul enough t o for ce uni que generaliza t ion of any learni ng set . For ex-
ample, many of the st andard algorithms for sur face-fit t ing used in statistical
data analysis are HE RBI Es, wit h fitt ing t he lowest order polynomial sur -
face possible to t he dat a vectors and using t hat sur face to determine t he
A Math em atical Theory of Generalization: Part I 167

question-gu ess m apping bei ng perhaps t he most commonly used example.


Ano ther, com plete ly different kind of HERBIE is havin g t he output guessed
by t he gene ralizer be th e ou tput compo nent of the dat a vector closest to th e
qu estion.
Alt hough the individu al restricti on s m ak ing up bein g a (continuous) HE R-
BI E are all quite weak , t aken together t hey ar e enough to force t he following :

(2.7) For any i , if t he values of the R entry of every datum space of a g{i}
of a continuous HERBIE are equal, the output is thi s value, regar dless
of the qu estion.

Proof. Let g{i}(x , y, x', y, . . . , u) = v for an m -dimension al gene ra lizer (x, x'
and u are m -dimen sion al vect ors ). T he R valu e of ea ch datum space is y.
It is assume d that u do es not equal an y of the x , x', etc . If it did , then
we would have v = y immediat ely by (2.3) and would be don e. By (2.6),
g{i }(x , ay + b,x',ay + b, ... ,u) = av + b for all nonzer o real a and b. In
particular, th e equality ho lds for b = V. Therefore g{i }(x, y( a + 1), x', y( a +
1), . .. , u) = av + y. Now take the limi t of both sides as a goes to 0. Since
g{i } is con tinuous, we get g{i}(x ,y , x', V, . . . , u) = v.
There exist a number of other somewhat trivia l properties tha t follow from
a gen eralizer 's being a HERBIE. An example of such a proper ty applies to
a one-dimensional generalizer 's g{2} when the question is h alfway inb et ween
the two eleme nt s of th e learn ing set . Let the resulting guessed ou tpu t be z;
g{2}( x ,y,x' ,y' , (x + x') / 2) = z. Then , using the coordina te transformation
invari an ces we get in order :

g{2}(x, - v, x' , - V', (x + x' )/ 2) = - z ,


g{2}x - x' )/ 2, - v, (x' - x) /2, -y' ,O) = - z ,

g{2}x' - x)/2 , -V ,(x - x')/ 2,-y', O) = - z ,

g{2}(x', -v , x, -V' , (x + x' )/ 2) = - z ,


g{2}(x' ,V' , x, y, (x + x')/2) = - z + V + V'.
Comparing this wit h our starting point, z = y + y' - z , and ther efore z
must equal (y + V')/2. This kind of reasoning can be extended in an obv ious
manner to show t hat any g{2} of a one-dimensional HERBIE must have odd
symmetry about t he po int { X~X' , lL:!f}.
Vie can also make some comments about the der ivatives of anyone of
the functions making up any HERBIE generalizer if we assume that that
function , in addition to being continuous throughout its domain of defini-
tion and meeting the restrictions of requirem ent (2.6) , is also analytic. For
example, if our generalizer is one-dimen sional and we expand g{i} about a
cert ain point (i.e., about a certain set of values X l, YI , . . . , q) and then eval-
uate th at expansion at points whos e differenc e from the expansion point is
168 David H. Wolp ert

a vector (a, 0, a, 0, . .. , a), then by trans lation invariance we kno w that that
expansion's value mus t be the same for all values a. Writ ing out the Taylor
series expansion for g{i}, t his means t hat L~l kiai = 0 for all a, whe re the
k, are linear combinations of derivatives of g{ i } wit h res pect to the varia bles
Xl, X2 , . . . all the way up to q. Taking derivatives with res pect to a of bo th
sides of t his equality an d t hen setting a equal t o 0, we get t hat all t he ki
must equal O. For example ,
k _ Og {2}(XI,Y I,X2'Y2,q) og{2}(-) og{2}(-' .) _ 00 I - 0
I - 0 Xl + 0 X2 + 0q - 0a a = O - .
(In terest ingly eno ugh, it turns out t h at we can gain no addit ional information
from the equations involving k's of order higher t han kl . T his is becau se
t he eq uation kn = 0 follows directly from allowing the expans ion point at
which the derivatives are evaluated to vary, taking various derivat ives of
t he equation kl = 0 (now, wit h the expansion po int varyi ng, viewed as an
equality of functions rather t han as an equality of rea l number s) , and t hen
forming linear combinations of these resultant equat ions .)
Vve can simila rly make use of the the req uire men t of sca ling invariance
along with analyticity of the g{i} to derive restrictions on the derivat ives of
the g{ i}. If we expand all t he {x;} along wit h q by a factor b == 1 + a, where
a =f - 1, then we can again collect powers of a and get, for examp le,
. Og {2}(XI' YI, X2, Y2 , q) og{2}(-") og{2}(- ' .) _ 0
Xl 0 Xl + X2 0 X2 +q 0q - .
(Again, we gain no information from t he equations involving hig her-order
derivatives since these equations follow directly from this equation involving
first order der ivatives. ) Combining this equation wit h t he one above for
translation invariance, we get three equation s of the nature

( q -XI )Og{~}( ... ) + (q - X2/g{~}(- ) =0.


Xl X2
On ly two of these three equations are independent , so we cannot deri ve
the va lues of og{2}(- ' ')/ OXI, og{2}(-' ')/ OX2' and og{2}(- )/oq. How-
ever, given q, Xl, and X2, it is sufficien t to kn ow one of the thr ee deriva t ives
og{2}(-' ')/OXl, og{2}(-' ')/ OX2, or og{2}(-' )/ oq t o get t he other two. So
know ing just one of these three der ivat ives gives all deriva t ives of all powers .
Since we have assumed analyticity, this means that in its dep endence on X l ,
X2 and q, the function 9 {2} is in fact fully specified by anyone of these t hree
first-order derivat ives. In other words , ass uming analytic ity, the two requ ire-
ments of translation invari ance and scaling invariance remo ve 2 degree s of
freedom from g{2} and, in fact, from all the funct ions g{i }, and therefore,
in a sense, from the gene ralizer as a whole.
If our generalizer is mu ltidimensional, then t he requirem ent of rot ation
invariance in t he input sp ace will also lead to res trict ions on the der ivatives .
Res ults similar to all t hese also follow from the re quire ments of translat ion
and scaling invariance in the out put space.
A Mathematical Theory of Generalization: Part I 169

As it turns out , these and all the other ramifications of the requirements
of (2.6) can be derived without recourse to this derivative-based framework.
This is done by using the following geometric argument to enumerate all
possible HERBIEs.
First examine the case where there are two elements in the learning
set and the input space is one -dimensional. Assume YI and Y2, the out-
put components of the two datum spaces, are fixed . We want to see how
g{2}(Xl,Yl,X2,Y2,q) varies with the input components, XI,X2, and q. Our
restrictions on this varying are those of (2.6), namely that if any vector
(a, a, a) is added to the vector (Xl, X2, q), then the output of the generalizer,
g{2}, is unchanged, and similarly, if the vector (XI,X2,q) is multiplied by
any nonzero constant then again g{2} is unchanged. Whatever the g{2}
value at a point (Xl,X2,q), the g{2} value is the same at all points on the
plane containing the two lines formed by all translations and by all scalings
of this point (Xl, X2, q). (We are implicitly assuming that we do not have
Xl = X2 = q, which is the only case where these two plane-delineating lines
are identical and therefore do not specify a unique plane. This assumption
does not make us lose any generality in the argument, since we already know
what the output must be in the case Xl = X2 = q.) In other words, the func-
tion g{2} taking points in the R 3 space of the input components (Xl>X2,q)
to the R space of the generalizer's output is made up of invariant planes.
To calculate the equation of the plane of invariance containing any par-
ticular point (Xl>X2,q), we solve for the coefficients a, b, and c in the plane-
defining equation aXI + bX 2 + Q = C (Xl, X 2 , and Q being the three co-
ordinates constituting our R 3 space) . Since the plane has to be invariant
under identical scalings of all three of the coordinates, we know that c must
equal 0. Since the plane has to be invariant under identical translations of all
three of the coordinates, we know that a + b = - 1. Therefore, our plane is
uniquely defined by the single parameter a, which is set by the values Xl, X2,
and q. Any triplet Xl = X2 = q is contained in all of the planes, regardless
of the value of a, so every plane contains the line going through the origin
and the point (1,1,1 ). In addition, translation invariance means that every
plane contains lines parallel to the vector (1,1,1) but which lie off to the side
of the line going from the origin through (1,1,1). Therefore, looking down
the vector (1, 1, 1) toward the origin, we see all of our planes of invariance
edge-on, and anyone of these planes can be generated from any other one
simp ly by rotating it about this line going from the origin out to (1,1,1).
We could parameterize the planes by this angle of rotation instead of by the
value a if we wished.
Define G(a) to be the function mapping any value of a (i.e., any set
of values Xl> X2, and q) to the output of the function g{2}(XI,1,X2,0,q).
By translation and scaling invariance in the output components of the da -
tum spaces we know that g{2 }(XI,YI,X2,Y2,q) for arbitrary YI and Y2 =
g{2}(XI' 1, X2, 0, q)[YI - Y2]+Y2. In other words, now allowing all arguments of
g{2} to take on arbitrary values, g{2}( xl, YI, X2, Y2, q) = G( a )Yl+[l - G(a )]Y2,
where a is set by Xl, X2, and q, and G(a) is the dependence of g{2} on a
170 David H. Wolpert

for the case where Yl = 1 and Y2 = O. This is a complete enumeration of


all poss ible g{2}'s for one-dime nsional HERBIEs; g{2} for one-dimensional
HERBIEs is fully specified by a one-dimensional function, G(a), in agr eement
wit h our discussion of the derivatives of g{i}'s of HERBIEs.
A similar analysis can be carr ied out for arbitrary g{i} for HERBIE
genera lizers of arbitrary dimension . If th e dimension of the generaliz er is
greater than 1, t hen rotation invarian ce and the like in the input spa ce can
serve to rest rict t he possible form of th e g{i}' s more t han would otherwise
be the case . Note that this kind of geometric reasoning does not require any
assumptions ab out analyticity, or even about continuity. Such consid erations
come in, for example, as rest rictions to be made of G(a); they do not come
in in our concluding that g{2}(Xl ,YllX2,Y2,Q) = G(a)Yl + [1 - G(a)JY2 for
some G(a). Similarly, the rest riction of datum-space interchange symmetry
is also simply a rest riction on the possible forms of G(a).

2.4 LMMGs and upward compatibility


Note t hat th e HERBIE of guessing the closest element of th e learning set is
not continuous, even when the learning set is fixed so that only the question
is allowed to vary. Such a lack of being everywhere continuous is a problem
wit h many real-world HERBIEs. One interesting situation in which such loss
of continuity is common is when two points in the learning set approach one
another. As an example, consider the "fit the lowest possible order polyno-
mia l to t he learn ing set " generalizer . This generalizer is a HERBIE (with the
proviso t ha t t he definition of a generalizer is here being extended to allow
noncontinuity) . Yet as we take the limit of one point in the learning set
approaching another, the shape of the question-output function will, in gen-
eral, depend on the direction in input-output space along which that second
point in the learning set is approaching the first . Unfortunately, this prop-
erty viola tes th e requirement of continuity together with single-valuedness
of the g{i} . The same problem occurs for generalizers which try to fit a
Four ier series to the learning set, and for the hyp erplanar HERBIE gener -
alizer described in [1], even when the surface such a hyperplanar HERBIE
create s from the learning set is "smoothed out" to be continuous with finite
derivative everywhere in the dom ain of definition.
Ind eed, any so-called "linear mathematical mod el" generalizer (LMMG)
which generalizes by fit ti ng the (lowest order possible) elements of a set of
bas is fun ctions to t he element s of the learning set will have this problem with
discontinuities, whether or not it happens to be a HERBIE. The proof of this
is ap pendices E and F (Appendix E being a detailed definition of LMMGs) .
On the other hand, LMMGs have the nice property that if two of the points
in t he learning set are identical, the generalizer makes the same guess as it
would if it had simply been fed the sam e learning set but without one of the
duplica ted datum spaces. (This propert y is formally defined in restriction
(2.8) below). Finally, while on t he subject of LMMGs, it is interesting to
note t hat th ere is only one LMMG which obeys t he invariances required of
A Mathematical Theory of Generalization: Part I 171

HERBIEs (requirement (2.6)). This is the LMMG which works by fitting the
learning set with polynomials. No other basis set offunctions gives an LMMG
which obeys scaling and translation invariance in both input and output.
And even polynomial LMMGs, strictly speaking, do not meet input space
rotation invariance. (The proof of all of this is in appendix G.) What this
means is that whenever someone uses a non-polynomial LMMG to generalize,
whether they mean to or not they have an implicit preferred origin, preferred
scaling dimension, preferred orientation, or perhaps even a combination of
all three. Conversely, if you think the system you are modeling with your
generalizer does not have such preferred spatial characteristics, then under
no circumstances should you use an LMMG as the generalizer.
There are other restrictions besides the various coordinate transformation
invariances of (2.6) and besides the restrictions implicit in the definition
of LMMGs which can be added to the core definition of restrictions (2.1)
through (2.5). For example, along the same lines as requirement (2.3), we
can require upward compatibility of the g{i}:

(2.8) For all g{i},i > m + 1, if the values of the entries of 2 datum spaces
are identical, then the output is equal to the output of g{ i-I} working
on the rest of the learning set and on one of the two identical datum
spaces.

For example, g{3}(x,y,x,y,x',y',z) = g{2}(x,y,x',y', z). Intuitively,


(2.8) is expressing the idea that if we add a new point to the learning set
which tells us nothing new, our guessing should be unchanged. As an ex-
ample, the generalizer which works by fitting the lowest-possible-order poly-
nomial surface to the points of the learning set is upwardly compatible, as
are all LMMGs, for that matter. Sometimes a generalizer meeting restric-
tion (2.8) (along with (2.1) through (2.5) but not necessarily (2.6)) will be
called "semi-proper," whereas a generalizer meeting all of (2.1) through (2.8),
including restriction (2.6), will be referred to as "proper." The reason for
such definitions will become clear below when we discuss self-guessing. Any
proper generalizer is necessarily a HERBIE, though of course not vice versa.
Although LMMGs (for example), while upwardly compatible are not ev-
erywhere continuous, with care it is possible to create fully proper generaliz-
ers; that is, there are HERBIEs which are continuous throughout the whole
domain and which meet restriction (2.8). One example of such a proper
generalizer is the nearest-point-averaged (npa) generalizer, described in ap-
pendix H. In any case, due to their simple and overt generalizing behavior,
HERBIEs are, in general, well-suited to serving as benchmarks for gener-
alization, irrespective of the question of upward compatibility or continuity
when elements of the learning set approach one another. Indeed, the hyper-
planar HERBIE discussed in [1], although overtly nonupwardly compatible
and noncontinuous, was explicitly created to serve as a benchmark for gen-
eralization.
172 David H. Wolpert

2.5 Output linearity


In addition to restrictions (2.6) and (2.8), yet another possible restriction on
generalization which can be added to requirements (2.2) through (2.5) is to
require that the generalizer be output linear.

(2.9) A generalizer is said to be output linear iff for all of the g{i}, if
g{i}(x, y, x', y', u) = v and g{i}(x, z, x', z', ...u) = w, then g{i}(x, ay+
bz,x', ay' + bz', u) = av + bw, for all a, b,x,y,z,x' ,y',z', . . . u.
Examples:

1. g{2} for one-d imensional HERBIEs, shown above to be of the form


g{2}(X1'Y1,X2,Y2,q) = G(a)Y1 + [1 - G(a)]Y2 ' is output linear by in-
spection. All LMMGs (that is, all generalizers which work by summing
functions taken from a basis set in such a way as to reproduce the
learning set) are output linear for all cardinalities of the learning set,
as is shown in ap pendix I. For example, the HERBIE generalizer "fit
the lowest-possible-order polynomial to the learning set" is output lin-
ear. Of cours e, as is discussed in appendix F, such generalizers are not
everywhere continuous in t he domain of definition delineated in (2.4) ,
and are not even HERBIEs, strictly speaking, if the dimension of the
input space is > 1 (see appendix G).
2. The analytic continuation of the generalizer defined by

a variation of the surface-fitter used in [2], is an everywhere (in the


domain of definition) continuous generalizer meeting all the restrictions
of being a HERBIE in full, which is also output linear. This HERBIE,
along with its variations, like

n Yi n 1
g{n}(XI,Y1, .. . ,q) ={ L ( -.) } / {L -(; r -
i=l d q, z,
.)},
i=l q, x,

(where d(.,.) is a metric such that for any real constant k, d(kx, ky)
is proportional to d(x,y)), all of which are output linear, continuous
throughout the domain delineated in (2.4), and intimately related to
radial bas is functions and kernel-density estimators, will generically
be referred to as "metric-based HERBIEs." Note that metric-based
HERBIEs, like npa generalizers, can never guess an answer which either
exceeds the maximum of the learning set or is less than the min imum
of that learning set . This shortcoming is the primary reason for caution
in making use of such generalizers. (To see why this shortcoming holds,
simply rewrite the output of the generalizer as {L:i=l aiYi }/ {L:i=l ai},
where all the a, are positive. It is clear that if you take any but the
A Ma thematical Theory of Generalization: Part I 173

largest of the Yi and increase its value, the value of the output must
increase, going to the value of that maximal Yi when all the other Yi
have been increased to the value of that ma ximal Yi. In other words, the
output is bounded above by max(Yi)' For similar reasons, the output
is also bounded below by min (Yi), which concludes the proof.)
3. All HERBIEs are close to being output linear, in that if

g{i}(x , z , x', Z', . . . u ) = w,


t hen g{i }(x,az,x',az',q) = aw. It is the restriction concerning addi-
tion of output components of the learning set that HERBIEs are capa-
ble of viola ting. (One currently open question is whether an analytic
HERBIE can avoid being output linear .)
4. The npa generalizer of appendix H, which is both a HERBIE and up -
wardly compatible (i.e., is proper), is not output linear.
The restriction of output linearity can be viewed in a number of ways.
On one han d , it can be viewed as a combination of output scaling invariance
toget her with an ou tp ut-summing property. Viewed another way, in a loose
sense, every g{i} of an output linear generalizer is like a tensor field defined
over t he (m X i) -dimensional manifold of all (x, x', . . . u), with the tensor at
any point on the mani fold being a linear mapping from the i-fold Cartesian
product (R x R x . .. ) to R. 7 Alternatively, if the Cartesian product of all the
outp ut components of the datum spaces is taken to itself be a vector space
V, then any g{i} is a dual vector field V* . See [50] and [51] for a general
discussion of linear operators.
Note that restriction (2.9) is comp letely compatible with restriction (2.6).
If b an d z are constants, (2.6) says t hat g{i}(x, ay + bz, x', ay' + bz, . . . , u) =
a(g{i }(x , y, x', y', ... ,u )) + bz for all x,x',y,y', .. . ,u . Plugging into (2.9)
implies t ha t g{i}(x, z , x', z, .., u) = z. This, of course, is exactly (2.7), the
restriction we derived earlier from (2.6) .
T here are a number of properties which must be obeyed by any generalizer
which is output linear. T he first of these is that
(2. 10) Any output linear g{i } can be written as

guess = ho(x , x' , x", . . . , q)y + h1 (x, x', x", . . . , q)y' + ...
To prove (2.10) it suffices to note that by (2.9 ) any output linear gener -
alizer can be writ ten as
g { z' } ( X,Y ,X,Y,x,y,
I I II II
.. . ,q ) = g{i }( x, 1, x' , 0, x" , 0, , q)y
+g{i }(x , O, x' , I , x" , O, ,q )y' + ...
7Rjgorously speaking, tho ugh, g{i} is not really a ten sor field unless mxi , the dimension
of th e manifold, equals 1, the dimension of the spaces whose Cartesian product make s up
th e space on which t he te nsors work. There is also the prob lem, of course, that the ma pp ing
is only linea r from (R x R x R x ...) to R, and not multilinear (see refer ence [51]).
174 David H. Wolpert

assuming the individual g{i} terms on the right hand side exist. Therefore
ho(x, x' , x", . . . ,q) = g{i}(x, 1, x', 0, x", 0, . . . ,q), for exa mple. Unless explic-
it ly noted otherwise, it will always be assumed that t he cardinality of the
learning set exceed s the dimension of the input space and that all t he inp ut
space components of the datum space s, x , x', x", etc., are different from one
another. These two conditions ensur e that those g{i} te rms all exist, so that
we can in deed set ho(x,x', x", .. . ,q) = g{i}(x,l ,x',O,x",O, . .. ,q) ,etc . (see
rest riction 2.5).
If we assume, as we often do, t hat the g{i} are everywhere differentiable,
then t here is a weaker condit ion than (2.9) which results in (2.10) and there-
fore in (2.9). Namely, if all of t he g{i} of a generalizer are differentiable
and obey the prop erty that g{i}(x, y, x' , y', .. . u) + g{i}(x, z, x' , z', . .. u) =
g{i}(x, y + z, x', y' + z' , .. . u ) for all x , y, z, x', y', z', . . . u, then in fact the gen-
eralizer is output linear. To see this simply note that th is property imp lies
that
lim g{i}(x,y + z ,x', O, x" , O, .. . u) - g{i}(x,y,x',O ,x",O, .. . u)
z -+ o Z

. g{i}(x , z , x', 0, x", 0, ... u)


= 11m ::...o.~-,--,--,--,--,--,-_---,-

z -+ o Z

The right-hand side is simp ly some constant, independent of y, so we can


write g{i}(x, y, x', 0, x", 0, . . . u) = yk + k', where k and k' are independent
of y. Since this pr ope rty also implies that

2g{i}(x, y, x', 0, x", 0, . . . u) = g{i}(x, 2y, x', 0, x", 0, . . . u ),


we know th at k' must equa l 0. T his allows us to write

k = g{i}(x, 1, x', 0, x", 0, . .. u).


In a similar manner we see that (for example) g{i}(x,O,x',y',x", O, .. . u) =
y'g{i}(x, 0, x', 1, x", 0, . .. u) . Therefore, pu lling th is all together,
gz
{O}(x, y,x"
,y " " . .. U )
,x ,y, = yg{i}(x, 1, x', 0, x", 0, ... u)
+ y'g{ i}( x, 0, x', 1, x", 0 .. . u) + ...
which is exactly of the form given in (2.10).
As a result of (2.10) , we see overtly that the guess to any question is
if the output values of all the dat um spaces are 0. Req uirement (2.2) of in-

var iance under permutation of the datum space s mean s t hat ho(x , x', .. . )y +
h1 (x , x' , . . .)y' + ... = ho(x' , x , . .. )y' + h1(x',x, .. .)y + .. .. Since thi s must
be true for all y, y', y", etc. (up to the rest rictions of 2.5), we can conclude th at
ho(x , x', x", . . .) = hI(x', x, x", . . .), and that hj(x, x', x", . . .) = hj(x' , x , x", .. .)
for all j > 1. In general then, if we define Tij(hn)(x,x',x", ... ) as t he func-
tion created by interchanging the variables in the ith and jth slots of the
arg ument list before acting on it with hn (the first slot being defined to be
slot 0),
A Mathematical Theory of Generalization: Part I 175

(2.11) Tij = r.; (Tij )2 = 1, t; = 1, Tij(h k) = hk if i =f k and j =f k, and


Tik(h k) = hi.
A number of equalities follow directly from (2.11). For example,

ho(x, x', x") + hO(X",XI,X) = hO(x",x,X /) + hO(X,X",X/)


results from (2.11) . Also (2.11) means that hj = Tjo(ho) and therefore the
output of any output linear generalizer can be written as L:i';;-ol TiO(ho)Yi , or
L:i~l Ti( h)Yi for short (n is the order of the learning set). So the output of
t he generalizer is
yh(x, z', x", . . . , q) + ylh(x l, x, x", . . . , q) + y"h(x", x', x, . . . , q) + ...
To ensure that the equalities of (2.11) cannot place us in a situation where
we have two expressions which appear to be different but in fact are identi-
cal, we must fix a protocol for standardizing the order of the argument list
of h. The protocol used here is to rearrange the variables occurring in the
slots beyond the first in increasing order. So, for example, instead of writing
h(x"', x", z , x', q), we would write the equivalent expression h(x lll, x, x', x", q).
Therefore, we standardize how we write the output of an output-linear gen-
eralizer as
y h( x,x,x
' " ,x"' , ... ,q) + y'h (x', x, x", XIII, . . . , q)
+ y"h(x II ,x,x,x
' I I I , . . . ,q ) + ...
As was mentioned, there is a slight caveat to the equality

hex, x' , x", . . . , q) = g{i}(x, 1, x', 0, x", 0, . .. ,q)


presented just below equation (2.10), which comes from the fact that the g{i}
on t he right-hand side of the equality must exist: the equality holds only so
long as x does not equal any of t he variables x', x", . .. etc., since 9{i} is not
defined for su ch a set of argument values, given that y does not equal any
of t he variables y',y", . . . etc. This does not mean that h(x,x,x", .. . ,q ) is
necessarily undefined, simply th at it does not equal 9 {i}( x, 1, x, 0, x",
(which is undefined) . Indeed, h(x,x,x" , . . . ,q ) can be consistently defined
q) ...,
wit hout any difficulty, assuming that
g{i}(x, 1,x, 1,x",0, ... ,q )
is defined (see 2.5). To do thi s, write the output-linear generalizer

9 {zO}( x, y,x,y,x " , y II ,x II I ,yIII , .. . ,q )


as
yg{ i}( x , 1, x, 1, x", 0, XIII , 0, . . . , q) + y"g{i}(x , 0, x, 0, x" , 1, XIII, 0, ,q )
+ ylllg{i}(x, 0, x, 0, x", 0, XIII, 1, , q)
+
176 David H. Wolpert

Exploiting datum space interchange symmetry between the unprimed and


doubly primed variables , we can also write this expression as
Y"9 {'}(
z x "1
, ,x"1 ) ++
, , x, ,xIII , , .. . , q yg{i}(x",O ,x",O,x,l,x''',O, ,q)
ylllg{i}(x", 0, x", 0, x , 0, XIII, 1, , q)
+
Therefore, equating the coefficients of y, 9 {i}( x , 1, x, 1, x" , 0, XIII, 0, .. . , q)
must = g{i}(x", 0, x", 0, x , 1, XIII, 0, .. . , q). Similarly, equating the coefficients
of ylll ,

g{i}(x, 0, x, 0, x", 0, X III , 1, .. . , q) = g{i}(x", 0, x", 0, x, 0, XIII, 1, .. . ,q).


If we define H(x, x", XIII, . , q) == g{i}(x, 1, x, 1, x", 0, XIII, 0, .. . ,q), the equal-
ity arising from equating the coefficients of y means that we can write the
output of the generalizer as
y H( x,x " ,xIII , ... ,q) + Y"H ( x " ,x,xIII , ... ,q )
+ ylllg{i}(x, 0, x , 0, x"; 0, z '", 1, . .. , q) + ...
On the other hand, we want be able to extend (2.10) to our case where
input- sp ace arguments are equal and have the output of the generalizer equal
2 y h( x,x ,x",x
I I I , ... ,q ) + Y"h( X ff ,x,x,xIII , .. . ,q )
+ yff'h( x III ,x,x,xff , .. . ,q ) + ...
This can be done if we equate h(x, x, x", x"', . . . , q) with

"21 g{'}(
Z x ,l,x,l ,x",O,x
I I I , O, ... ,q.
)

Smce we al den tl y h ave h(x , x ff"


so'evi ,x ,x'" , . .. , q) = H( x, x"'"
,x , , q) ,
we also see that h(x, x" , x", x"', .. . , q) must equal 1/2h(x, x, x" , x'"; , q).
Wh en investigating upwardly compatible output-linear generalizers it is nec-
essary to have at hand all such restrictions on h in the situation where it has
two (or more) of its arguments equal in value.
For output linear HERBIEs y translation symmetry combined with (2.10)
immediately gives L:;~:l Ti(h) = 1. For all the input components of the
learning set different from one another, requirement (2.3) immediately gives
h(q,x',x", ... , q) = 1, and in general hn(x,x', ... ,q, .. . ,q) = 8(n,p), where
p is the number of the slot having the same value (q) as the question. This
is in complete agreement with the other restrictions on h and the hn .
It is important to understand that although g{i} completely specifies h,
and vice versa, all in the rather simple manner described above, h does not
have to satisfy the same restrictions that apply to the g{i} . For example,
for n > 2, we can take x = x' = q, y = y', and reproduction of the learning
set forces h to equal 1/2. However, as was just shown, for x = q = x' + c:,
h must = 1, even if e is infinitesimally small. Since the point x = q = x' is
within the domain of definition even for generalizers satisfying requirement
A Mathematical Theory of Generalization: Part I 177

(2.4), this means that h cannot be a continuous function of its arguments


if the associated g{i} is to also reproduce the learning set and be output
linear. Note however that this does not mean that the full 9 {i} cannot
both be a continuous function of its arguments throughout the whole domain
delineated in (2.4), reproduce the learning set, and be output linear. As has
been mentioned, metric-based HERBIEs are examples of generalizers which
meet all of these requirements.
Given that such continuous output linear generalizers exist, since (as was
shown before) all LMMGs, although output linear, are somewhere discon-
tinuous, we can conclude that LMMGs form a proper subset of the set of
all output linear generalizers. Even ignoring such questions of continuity, we
can still see that not all output linear generalizers are LMMGs since metric
based HERBIEs are full HERBIEs, even being invariant under rotation in
the input space, whereas we know there are no LMMGs which can satisfy
(2.6) in its entirety (see closing arguments in appendix G).
As an aside, it is interesting to note that the kinds of thresholding func -
tions usually used to model neurons in neural nets satisfy all the restrictions
on h for n = 2, output-linear, one-dimensional HERBIEs. In proving this by
enumerating all output linear HERBIEs (see appendix J), it is also shown
how the restriction of output linearity is not enough to force unique gener-
alization of HERBIEs. For purely heuristic reasons, it is therefore interest-
ing to explore an additional criterion which does force unique generalization
of output linear one-dimensional HERBIEs, at least for the n = 2 case.
This criterion is to require that there exists an analytic function f (x) going
through the two points of the learning set such that if the two points of the
learning set had lain anywhere along f( x), then the output guessed for any
question q would be f (q). This amounts to a sort of self-consistency restric-
tion on the guessing" and is automatically satisfied by LMMGs. (As was
mentioned previously, such generalizers usually are not HERBIEs, however.)
More precisely, we require that for any learning set (( a, b), (a', b', there ex-
ists a function f(x) such that f(a) = b, f(a') = b', and in general obeying
h(x, x', q)f(x) + (1 - h(x, x', qf(x') = f(q), which is equivalent to

(2 .12) h( x,x ',q) -- f(q) -f(x')


fix) fix')'

Note that an f(x) obeying equation (2.12) is only fixed up to an overall


scaling and/or translation (i.e ., if f(x) is a solution to (2.12), so is af(x)+b).
It is shown in appendix K that for n = 2 any f(x) obeying (2.12) is a straight
line or an exponential, with the freedom in fixing the scaling and translation
of f(x) allowing it to be made to go through any two points making up a
learning set. If we require that the generalizer be a full HERBIE, input space
scaling invariance says that h(ax, ax', aq) must equal h(x , x', q) for all nonzero
real numbers a, and as a result the exponential solution must be discarded.
8This restriction is somewhat similar to the restriction of strong self-guessing discussed
in the first section of the second paper. The primary difference is that here we are not
making any assumptions about upward compatibility.
178 David H. Wolpert

(The fact that f(x) must be a straight line shouldn't be too surprising, since
we know that in addition to being expressible as it is in (2.12), as is discussed
in app endix J h must be expressible as a function of (q - x')j(x - x'), and
it is hard to think of any way of doing this without having f be linear in
its argument.) Therefore, for any n = 2 learning set, there exists a unique
function f(x) which reproduces the learning set and which obeys (2.12), and
therefore this restriction sets unique generalization of any n = 2 learning
set. Note that this is all markedly similar to the situation encountered when
enumerating all HERBIE LMMGs (see appendix G). In solving for all such
LMMGs we are allowed both exponentials and polynomials as members of
our basis set of functions, unless we require input-space scaling invariance,
in which case the exponential solution must be discarded.

As it turns out , the extension of the restriction (2.12) to n = 3 can't be


met by any generalizer. Ironically, this is because such an extension would
necessitate upward compatibility between the n = 2 and the n = 3 cases .
For n = 3, we write f(q) = h(x, x', x", q)f(x) + h(x',x,x",q)f(x') + (1 -
h(x, x', x", q) -h(x', x, x", q))f(x"). For x = x', this becomes 2h(x, x, x", q) =
(f(q) - f(x") j(f(x) - f(x")), and the exact same analysis as in the n = 2
case means that f(x) must be linear. Therefore we have upward compat-
ibility between n = 2 and n = 3. Now, however, let x i- x' and the
three points of the learning set not be co-linear. By hypothesis, there is
some smooth curve f(x) going through those three points such that g{3}
(x,f( x),x',f(x'),x",f(x"),q) = f(q) for all x,x', x", and q. Now letting this
x' become infinitesimally close to x, continuity of g{3} and upward com-
patibility mean that our f(q) must be a straight line. However, we already
assumed that f(q) goes through three points which are not co-linear, result-
ing in a contradiction. Therefore we cannot extend the restriction embodied
in equation (2.12) to the n = 3 case.

It is interesting to note that away from these points where datum spaces
have identical input values, fitting a parabola to three points constitutes t he
g{3} of an output linear continuous HERBIE obeying (the n = 3 extension
of) equation (2.12). Indeed, fitting with a polynomial constitutes an output
linear, upwardly compatible, (2.12) obeying continuous HERBIE of arbitrary
dimension and order, so long as one stays away from regions where datum
spaces have identical input values . (An interesting and as yet unsolved prob-
lem is to determine all solutions to this partial set of restrictions.) The
Achilles' heel of polynomial fitting, of course, just all other LMMGs, is its
inability to satisfy continuity in the situation where upward compatibility
applies (i.e., when elements of the learning set approach one another). Note,
however, that if we are willing to relax the requirement that generalizers be
continuous, we can easily have fully proper output linear generalizers. A
trivial example of such a generalizer is a HERBIE that guesses as the output
to any given question the output component of the datum vector nearest to
that question.
A Mathematical Theory of Generalization: Pest I 179

This ends the first paper of the series. The next paper cont inues t he
discussion of generalization, using gener alizatio n criteria of a qua litatively
different nature than thos e present ed here.

App endix A.
Hopfield's scheme for using neural nets as associative memories
In Hopfield's schem e each neu ron, hav ing a value of + 1 or -1 , is connected
to every other neuron. Evolution of th e net pr oceeds according to the ru le
O"i(t ) = g[L:j JijO"j(t - 1)], O"i(t) being t he state of t he ith neuron at cycle
number t , and Jij constituting t he connect ions between t he neurons and
usually taken to be symmetric. The fun ction 9 is usually taken to be the
sgn function. In analogy to simple statistical mechanics we can now define a
Hamiltonian H as - L:ij JijO"i(t -l)O"j(t -1) . Let 6.O"i represent th e change
in a, over one cycle. If it is nonzero, then by the definit ion of the evolution
operator 9 the sign of 6.O"i is the same as th e sign of L:j JijO"j evaluated at th e
previous cycle. Therefore 6.O"i L: j JijO"j is positive definit e, 6.H is negati ve
definite under evolution of the system, and H descends monotonically to a
local minimum. To see how this can serve as an associative memory, view the
state of all the neurons in t he net as a (po tenti al) mem ory. We then want to
pick the Jij in such a way t hat if we start with t he syst em in a state close to a
memory we want st ored in the system, then the evolution takes the neur ons
to that stored memory. This behavior can be achieved by using the Hebbian
rul e to set th e Jij as L:m xr Xjm, where xr is t he ith compo nent of t he mth
memory to be "stored." With this Jij> H = - L:m(xm (f) 2 , (f being th e state
of all the neurons at a given time. For rather reason abl e assumptions on t he
distribution of the xm H will be minimized for (f parallel to X'" for some n.
In other words, the if'" serve as stable points of t he evolution of th e syst em,
as desired.

Appendix B.
Boltzmann machines and backpropagation
To understand how neural nets generalizers behave, it is helpful to examine
th e details of their operation . As a first example, consider Boltzmann ma-
chines, which constitute a sort of stochast ic generalization of Hopfield nets .
Instead of viewing the state of the whole net as being the input and output ,
in Boltzmann machines certain neurons are declared to be input and certain
other neurons are declared to be output . Input neurons h ave th eir st ate set
by the outside world exclusively and ou tput neuro ns have their state fed to
the outside world exclusi vely. Therefore connect ivit ies are , in a cert ain sense,
explicitly nons ymmetric. Take the possible states of any given neuron to be
{O, I}. Evolution of t he net pro ceeds according to t he probabilistic rule that
O"i(t) = 1 with probability j(6.Ei) , j bein g the fermi function and 6.Ei, a
function solely of t he other spins in the net , being t he difference between
180 David H. Wolpert

the energy the net would have at cycle t - 1 if neuron i were fixed as a 0
and the energy it would have if neuron i were fixed as a 1. (The Hopfield
net 's evolution is akin to the temperature = 0 case.) This, of course, is just
conventional statistical mechanics. The probability of a particular state iJ
for the whole net is [fii30';=1 P(O"i(t))][fij30'j=o(1 - P(O"j(t)))J, P(O"i(t)) being
the probability of neuron i of the state iJ being a 1 (i.e. !(tlEi ) ) . Taking a
a
particular neuron k which is off in state and turning it on to make state
if, we see that Pc;/PiJ = e-(Ea-EfJ)/kT, just as in Boltzmann statistical me-
chanics (hence the name "Boltzmann machine"). Therefore, as in a Hopfield
net, letting a Boltzmann machine evolve will (likely) cause it to end up in
a local minimum of its energy function. Simulated annealing [25J is often
used in practice to try to "freeze in" that minimum by gradually lowering
the temperature in the fermi function.
Again, just as in the Hopfield net, "teaching" the net in a Boltzmann
machine consists of choosing the energy function to assure certain dynamic
behavior in certain situations. Unlike the Hopfield net, however, here the
behavior to be taught only entails fixing the values of a subset of the neurons,
those which serve as the inputs and the outputs. Along with the fact that
the input neurons do not evolve but remain constant throughout the running
of the machine, this is what allows the Boltzmann machine to be viewed as
a generalizer and respond with outputs completely different from any with
which it was taught. As with Hopfield nets, a magnetic energy function is
usually chosen: Ec; = - Li<j JijO"fO"y + Li OiO"1- Jij is usually taken to be
symmetric. For the Boltzmann machine the determination of the Jij from the
learning set is more complicated than in the Hopfield case. Define a cross-
entropy error function G = Lc; Pt In [Pj) Pi], where a runs over the set of
input and output states with a given input vector fixed on the input units,
pt is the desired probability of state a, and Pi is the actual (equilibrium)
probability of state a when the machine is running. By inspection, when G
is minimized (to 0), pt = Pi and the system will definitely evolve to a state
which has the desired output units when the input units are held fixed at
the corresponding values. A number of techniques can be used to minimize
G, gradient descent being perhaps the most common. When it is desired to
have the learning set taught to the net consist of more than just one input
vector/output vector pair, the G to be minimized is modified by summing
over all elements of the learning set.
Another neural net generalizer which is perhaps the simplest (and cer-
tainly the most widely researched) is the technique of backpropagation. Like
the Boltzmann machine, this net has a subset of the neurons designated in-
put neurons and another subset designated output neurons. Also like the
Boltzmann machine, a fermi function (or, in general, any sigmoidal function)
is used to do the evolving. However, here the neurons do not only take on
discrete values and the evolving is not stochastic: O"i(t) = !(Lj JijO"j(t - 1))
is the updating rule. Usually the Jij are chosen so that the connectivity is
that of layers of neurons feeding forward into one another, starting from the
input layer and proceeding all the way to the output layer. As in the previ -
A Mathematical Theory of Generalization: Part I 181

ously described neural nets, output is when the system reaches a steady st ate,
which in t his case means after n cycles, n being the number of neuron layers.
"Backpropagation" refers to the teaching technique of this scheme. The Jij
are found by minimizing an energy (i.e., error) function via gradient descent .
The energy funct ion is usually chosen to be the sum of the squares of the
differences between desired and actual output states for a given input st at e.
Any Ji j which constitute a zero of the energy function will give the desired
out put for any input chosen from the set of input /output pairs making up
the provided learning set . Since the net is feedforward, it is a simple matter
to wri te down the function relating the Ji j and the ou tput of the net, given
any input . T herefore, although such a function is not directly invertible, it is
t rivially easy to ru n an iterat ive procedure to approximate the Jij which give
zeros of the energy function by analytically calculating the functional form
of the gradient of t he function energy(Ji j ) and then using a steepest descent
pr oced ure . Again as wit h the Boltzmann machines, since a backpropagated
net can resp ond in a novel way to a novel input, it constitutes a generalizer.

A p p endix C.

A precise d efini tion of in t elligence


If desire d, the definit ion of general int elligence in section 1.2 can be made
mor e pr ecise (and, accord ingly, more narrow) . An intelligence task can be
defined in terms of the qui ntuple of a set of possible decisions, 6., a set of
possible results, 0 , one of which is delineated as being the desired result, a set
of pr ovided informat ion , I p r ov , a set of nonprovided information, l reUs., and a
function f taking l reUs., I p rov , and a single element b E 6. to an element w EO .
The intelligen ce task for the syst em is to pick the b which, when fed to f along
wit h I p rov an d l reU s., produces the desired element of O. The system's decision
is required t o be based only on f and I p r ov ' The system is not provided with
l reUss ' (In t he langu age of sect ion 2 of t he second paper, the mapping taking
l reUss an d I pr ov t o the (indu ced) pro ject ion of f taking 6. to 0 is a "method."
This method is found from f , wit h l reUss and I p r ov forming the defining set .)
On the one extreme, f could be relatively independent of the contents of
l reUss ' in which case t he intelligence task essentially reduces to the problem
of inverting f. On the other extreme, f could be a fairly straightforward
fun ction , easy to invert , but highly dependent on the contents of l reUss ' In
t his case, the int elligence task is essentially the problem of making a best
guess for l reUss based on I pro v an d f . Given the concept of an intelligence
task , a syst em's general intelligence can be defined as a measure of how well
the syst em perfor ms at arbi trary intelligence tasks. Many different measures
of that performance can be used, of course. Perhaps the simplest is to tally
the percentage of intelligence tasks at which the system succeeds. Another
possible measure requires a metric giving how close any element in the result
set is t o the desired result. Then a measure of a syst em 's general intelligence
could be t he average distance of the syst em's guess from the desired result.
182 David N. Wolpert

A ppendix D .
Discussion of inner products among ge n e r a liz er s
Some interesting mathematics result from trying to define a vector space in
which generalizers live and then defining a mapping meeting all of the criteria
usually required of an inner product and which takes pairs of vectors from
that vector space to elements of an appropriate field (see reference [50]). One
way to go about this is to have the infinite set of numbers making up t he
"pseudo" inner product serve as the inner product in the vector space of
generalizers. To do this, define the field F to consist of all countably infinite
sets of real numbers: if a E F , then a = {aI, a2, . . .} where all a; E R. For a
and b both E F, we say that a = b iff ai = b, for all i. Similarly, we say that
a > or < b iff a, > or < b, for all i. (Note that with these definitions, it is
no longer true that either a > b, a = b, or a < b for all a and b - the field is
only partially, not totally, ordered.) Add ition in F in performed component
by component: a + b = {al + b1, a2 + ~, .. .}. The add it ive ident ity is simp ly
{O, 0, . . .}. F differs from the standard L 2 Hilbert space in that, be ing a
field, multiplication of elements in F is define d. As wit h addition , elements
in F mu ltiply by component: a X b = {a1b1,a2 b2 " .. }. The mult iplicati ve
identity therefore is {I, 1, . .. }. (Note that if F were to be viewed as an
infinite vector space over R, this definition of multi plicat ion would not b e a
proper tensor operation, since after appropriate rotations a X b could always
be made to be either parallel to a or parallel t o b.) Closure, the existe nce
of multiplicative inverses (for elements of F with no components equal to
0), commutativity, and the distributive law all follow, confirming that F is
indeed a field. Int uit ively, F is simp ly an infinite set of R fields all acting in
parallel- there is no interaction between components of an element of F .
Now that we have an appropriate field, we can define the vector space
in which generalizers live. Simply put, generalizers live in the same space
as the elements of F, except that their components are functions from R"
to R, rather than just elements of R. This space is a vector space over F ,
just as the set of all well-behaved (i.e., D') functions from R" to R is a
vector space over R. a E F multiplied by the generalizer {g{1} ,g{2 }, . . .} =
{alg{l} , a2g{2}, .. .}. Addition of generalizers goes exactly as addition of
element s of F. All the usual restrictions on legal vector spaces follow from
these definitions (see reference [50] again) .
Using Dirac notation, we can now define the inner product between two
generalizers IG > and IH > to be the elemen t of F {(g{l }lh{l} ), (g{2}lh{2}),
. . .}. Except for t he fact that the inner product as defined th is way does not
live in R, it obeys all the usual restrictions on inner products. As a result ,
bo th the t riangle inequalit y and Schwar tz 's inequality hold. An or t hogonal
basis for the sp ace of generalizers consist s of th e generalizer s {B ib 0, 0, . . .}

for all i, {O, B i2 , 0, ... } for all i, etc., where "0" denotes the function , and
B i j denotes a complete orthonormal set of basi s functions for the mappings
from Rj(m+l) +m to R. The elements of this bas is will be generically denot ed
as IBi j . If the i are discrete, we can expand any generalizer IH > as
A Mathematical Theory of Generalization: Part I 183

IH > = Eij IBij > (B ij IH) . A similar expression holds for cont inuous i.
Although easy to work with, it should be noted that these IB ij > are not
normalized. Nor can they be normalized simply by multiplying them by
the appropriate constant, since the multiplicative identity is {I , 1, . . .}, and
therefore the 0 components in the IB ij > make their multiplicative inverses
undefined. To construct an orthonormal basis, we would have to take the
(nonorthogonal) basis {B il, B i2, . . .} and apply Graham-Schmidt.
As opposed to this formalism using an infinite number of components, it
is possible to view generalizers as a vector space over the reals rather than
over F. IG > + IH > still = {g{l} + h{I},g{2} + h{2} , .. .}, but now
alG > = {ag{I},ag{2 }, .. .}. The difficulty with this new formalism is in
trying to define an inner product mapping two generalizers to an element of
the reals in such a way that the magnitude of a generalizer IG > is finite even
if all of its component functions g{i } have nonze ro magnitude. One way to
do this is to define
(GIH ) = lim E?-l (g(i )lh(i )) .
n-+oo n
The usual req uired properties of inner products follow, assuming that the
limit defining t he inner product exist s for the generalizers in question. Un-
fortunately, t he mag nit ude of the inner product of anyone of the basis vec-
tor s IB ij > with another generalizer is O. We can no longer write IH > =
E ij IB ij > (B ij IH). Graham-Schmidt (for example run over the set of ele-
ments from {B il, B i2, .. .} which have well-defined inner products with one
another) has to be used just to form a basis allowing us to write this kind of
expansion, rega rd less of any questions of overall normalization.

App endix E.
Basic p roperties of LMMGs
In this appendix a rigoro us definition of LMMGs is given and a property of
them which is used in appendices G and F is proven.
Make the rigorous definition that an LMMG is a generalizer which works
by taking an orde red basis set of infinitely differentiable nonzero functions,
<PI (T), <P2 (T), . .. E q>, and for the given learning set B forms the linear com-
bination of bas is functions, Ef: l ai<Pi(T), aN =f 0, which has as sma ll N as
po ssible and yet still goes t hrough every element of the B. T he guessed output
of th e generalizer for a question q is given by Ef:l ai<Pi (if). Any generalizer
like "fit the learning set with a power series" or "fit the learni ng set wit h a
sum of Bessel functions" or t he like is an LMMG .
First not e that aN mus t be unique. To see t his, first note that if Ef: l
n,
ai<Pi(T) reproduces a learn ing set of pairs {( i'j, Yj t hen the matrix equation
<p}ai = Yj must hold, where implied sum notation is being used and where
t he matrix <p} is defined to have as elements t he values <Pi (i'j). Now if the
set of {a d which repro duce the learni ng set is not unique, we have that
<p} (ai - a D = 0 where at least one of the ai differs from the corresponding ai
184 David H. Wolpert

Therefore, indexed by i, the vectors rP) are linearly dependent. This means
that t here exist s some value I such that rPj can be written as (3irP) where
(3, = O. Therefore we can write

If in particular aN =1= aN, we can t ake I = N, and therefore have just


derived a formula for a linear combination of the first N - 1 rP;('r') which
reproduces the learning set. Since by hypothesis N is as small as possible,
this is impossible, so aN indeed must equal aN. If the remaining coefficients
are not unique, then we can choose between any two candidate sets {ad
and {a'i} according to any arbitrary method, so long as we are consistent in
ap plying it . Here we will require that we choose the set which has the lowest
abs olute value for the highest i in which they differ. For example, choose
{6, 6, - 3, 7, 2} over {8, 9, 4, 7, 2}, sin ce 1- 31 < 141. (Rigorously speaking, it
is only once we have set such a rule for choosing between two candidate sets
t hat we have complete ly defined the LMMG .)

(E.!) For all N and {ad where aN =1= 0 there exists an N-point learn ing set
such that the LMMG chooses those {ad t o fit the learning set.

Proof. Since the rPi(rj are linearly independent, for any N of the rPi (rj there
does not exist a nonzero N -component vector such that if

for all vectors [rPl (rj, rP2 (rj, ... , rP N(rj ] ~.e., for all rj. In other words, the
vect ors [rPl(rj , rP2(rj, ... , rPN (rj ] span R . Therefore, there exist N vectors
[rPl (rj , rP2(rj , .. . , rPN(rj ] which are linearly independent, i.e. for any N there
exists a set of N input values {fi ll:::; j :::; N} such that the N vectors

are lin early independent. Now for any N and any associated set {ad t he
N -element learning set
N N
{ (rl' L airPi (rl )), . . . , (rN, L ai rPi(fN))
i =l i=l

is such that it lies on th e surface I:~1 airP(rj. If aN =1= 0, then it is poss ible
that th e LMMG will fit the learning set using those {ad and therefore th is
surface I:~1 airP(rj . To ensure th at this surface is the actual one which t he
LMMG will use to fit the learning set, it suffices to note that since all t he
rows of our matrix rP) are linearly independent, there doe s not exist any
set of values {an different from t he set {ai} such that the lear ning set can
be reproduced by the surface I:~1 a;rPi(rj. This comp letes the proof of the
lemma.
A Mathematical Theory of Generalization: Part I 185

Appendix F.
Proof that no LMMG is continuous
See appendix E for a rigorous definition of LMMGs and a proof of the fact
that for any N and any set {ad where aN =f 0, we can always build an
N-element learning set () such that the LMMG fits it with a surface of the
form I:f:l aiJi(i) , where the Ji(i) are taken from the basis set of functions
of the LMMG. Build such a learning set , and indicate its elements by (rj, Yj),
where j ranges from 1 to N . Indicate the surface the LMMG fits to () by
f(i)
To proceed in the proof, it is first necessary to establish the following
lemma:

(F.l) Let J~ indicate N - 1 linearly independent vectors of dimension N,


1 :::; i :::; N - 1, 1 :::; j :::; N . Then there exist two indices a and b such
that if J~ is modified by having all N - 1 element s J~ set equal to the
corresponding N - 1 elements J~, then the new set of vectors J~ is still
linearly independent .

To prove (F.l) make an N x N matrix <Il whose first N - 1 rows are


the vectors J~ and whose last row is a vector linearly independent with all
N - 1 of the J~. We know that det(<Il) can not equal O. However, we can
evaluate that determinant by cofactors of the ent ries in the Nth row. At
least one of these cofactors cannot equal 0, since the full determinant does
not equal O. This means that there is a column index c such that if we make
an (N - 1) x (N - 1) matrix <Il' from the first N - 1 rows of <Il by removing the
cth column from those rows, then det( <Il') =f O. This means that the vectors
making up the rows of <Il' are linearly independent. Now make a new matrix
of N - 1 rows and N columns by taking <Il', duplicating one of its columns,
and appending this column to the right-hand side of <Il'. The vectors formed
by the rows of this new matrix are still linearly independent. However, this
matrix is exactly the one made by modifying J~, as referred to in (F.l) .
Take our original learning set () and modify it continuously by sliding
point a of the learning set (i.e., the point (rj=a' Yj=a)) toward point b (i.e.,
toward the point (rj=b,Yj=b)) along the surface f(i). At any point during
this sliding the LMMG can fit the (modified) learning set with the first N
Ji(rj), just by fitting it with f(i) . Therefore we know that at any point in
the sliding the LMMG will fit the learning set with the first m :::; N of the
Ji(rj).
Now by (F.l), we can choose a and b so that when the two points have
been slid on top of one another, the N - 1 vectors Ji(rj) formed by taking
the first N - 1 J;'s of the LMMG and evaluating them at the N points
of the learning set are still linearly independent. However, we know that
when the two points are on top of one another the first N vectors Ji(rj)
must be linearly dependent, since now the determinant of Ji(rj) = O. The
only way this is possible is if the vector IN(rj) is expressible as a linea r
186 David H. Wolper t

combination of the first N - 1 r/>i(fj ). Since we know that the LMMG can
fit the modified learning set with (at most) N of the first r/> i(fj ), this linear
dependence tells us that the LMMG will in fact fit the modified learning set
with only m :::; (N - 1) of the first r/>i( fj ), when point b has been moved on
top of p oint a.
So we know that there must be a point during this sliding at which the
LMMG suddenly stops fitting the learn ing set with the first N of t he r/>i (fj )
and instead fits it with the first N - 1 of them. However, all of t he r/>;('rj are
linearly independent of one another. This means that any linear combination
of t he first N - 1 of the r/>i (fj) must differ from all linear combinations of t he
first N of the r/>i (fj) by a nonzero amount for some value of f . This means
th at the LMMG's g{N} mus t be a discontinuous function of its arg ume nt s
when t he arguments are at this point of transition from N function fitting
to N - 1 function fitting . This concludes the proof of t he prop osition.

Appendix G .
Proof that only polynomial LMMGs obey (2.6)
See appendix E for a precise definition of LMMGs. To begin our proof that
only polynomial LMMGs obey (2.6), it is easy to show that output translation
invar iance forces r/>1 (f') to = 1. Let D indicate t he support of r/>1 (f'). By
definition of LMMGs, D is nonempty. Now consider any point fE D . Let
our learning set be {f, r/>1 (f') =f O} . Output-translating our learning set by .A
we get the new learning set {f, r/>1 (f') + .A }. By output-translation invariance,
g{l}(f,r/>l(f'),q) = g{l}(f,r/>l (f') + .A ,q) - .A for all if. However, since g{l} is
part of an LMMG, g{l}( f ,r/>l (f'),q) = r/>1 (q). Moreover, out pu t space scalin g
invariance requires that

In other words, r/>1 (q) is a nonzero constant, k, independent of if, and without
loss of generality we can set k = 1.
Now let T>. indicate anyone of the (parameterized) transformations of
requirement (2.6). For example, we indicate the operation of out put space
tr anslation invariance by the (nonlinear) operator T>. where T>. [f( f') ] = .A +
f (f') . Then it follows that

(G.1) For a given bas is set of functions r/>i(f') E qi, if the LMMG ba sed on qi
obeys invariance under T>. then for all sets of real numb ers {ad an d all
N (where aN =f 0) there exists a set {aD and an N' (where a N' =f 0)
such that T>. [L:~l air/>i (f')] = L:f::l a:r/>i (f').
A Mathematical Theory of Generalization: Part I 187

To see (G. l) first recall from appendix E that since the i(r') are lin-
ea rly ind ependent of one another, for all Nand {ad there exists an N-point
learning set such that the LMMG chooses those {a;} to fit the learning set .
Next note that by the hypot hesis of meeting (2.6), th e curve T>. [L:~l aii(r')]
rep roduces the transformation of this learni ng set and also makes t he t rans-
formed quest ion-guess mapping . It is unique in doing this, an d therefore
invariance of the LMMG under T>. means that it must be the same as t he
curve generat ed by ru nning the LMMG off of the transformed learning set.
This LMMG-generat ed curve can always be written as L:;::1 a:i(r'), however,
which proves the supposition.
Note that 1(r') = 1 is comp letely in accord with (G .l) . Also note that if
a learning set {rj,Yj} is fit by an LMMG with the curve L~l aii(r'), then
the learning set {rj,>'Yj} will be fit with L:~l>.aii(r') = >'L~laii(r') .
Therefore LMMGs au tomat ically obey output scaling invariance.
The proof t hat there is only one set <P whose LMMG obey the invariances
of (2.6) and that that set is the set of all polynomia ls works by showing
that there is only one set <P which meet (G.l ) for input space invariances.
Note, however, that (G.1) is only a necessary cond ition for the invariance
of an LMMG un der T>. . Fortunately, in what follows the sufficient condi-
tio n (namely, that the given basis set <P, which obeys T>' [L~l ai i(r')] =
L:;::1 a:i(r; for all transformations T>., is such that the associated LMMG
obeys invariance under all the T>.) will be obvious. Therefore, in the exposi-
tion be low sufficiency will not be explicitly illustrated.
Wit hou t loss of generality we can take N' :::; N in (G.1) when we are
dealing with a single >. - if N' > N, then T;l [L:;::l a:i(r')] = L:~ 1 aii(r'),
and since our (paramet erized) transformations are gro ups, T;l = TN for
some ).' (in practice we usually parameterize the gro up so that >" = ->.
so that our groups are explicitly Lie groups). Therefore by simply flipping
which we call the t ransformed space and which we call the untransformed
space we can have N ' :::; N .
In the analysis below, however, we are going to need to make use of
an even stronger condition on the N ' . Instead of just choosing between
T>. and T_>. to ensure N' :::; N for a single x, we will require N' :::; N for all
>. E [- bo>' ,bo>']' with bo>' > o. In other words, as we move through [-bo>' , bo>']
our choice of T>. or T_>. (to ensu re N ' :::; N) must not change.
This stronger con dit ion is not difficult to meet, fortunately. Since the
i( r') are all linearly independent, as before t here exists a learning set () with
input coordin ates r~ , r-;, . . . , rN such that the LMMG working from this ()
will create a curve L:~ 1 aii(r') where (in particular) aN is nonzero. Now
assume t here exit s an open ball of nonzero radius living in R m N (t he vector
space delineati ng all datum space inputs components for the m-dimensional
LMM G 's g{N}) and centered on the input components of (), such that the
matrix i(i'j)(l :::; i,j :::; N) has nonzero determinant for all sets {i'j} con-
tained within that ball . Then any T>. moving () to any point within that
ball will have N ' :::; N. (aN might = 0, for example, which is why N' is al-
lowed to be < N. T he import ant point is that there is no linear dependence
188 David H. Wolpert

among the transformed i(i'j) which might force N' to be > N.) Therefore
our "stronger condition" will be met if this assumption holds. To prove this
assumption it is sufficient to disprove its negation. To do this, note that if
there does not exist any such ball of nonzero det(i(i'j)), then since we know
that det(i(i'j)) is nonzero at (), there must exist an axis in R m N such that
the set of points along that axis with 0 det( i(i'j)) contains the intervals [x , (})
and (() , y], where x is () shifted a nonzero amount in one direction and y is
() shifted a nonzero amount in the opposite direction. (At first it might be
thought that the statement that no such open ball of nonzero radius encloses
() could be satisfied if, for example, the set of points with 0 det( i(i'j)) in the
vicinity of () were [x, (} ) and either (z,y ] or [z,y], where z is () shifted by a
nonzero amount in the y direction. Note, however, that if this were the case
then any learning set lying between () and z would lie wholly within a ball
of nonzero det( i(i'j)). (I am assuming here that it is only this one axis in
R m n which has the property that det(i(i'j) ) flips from being 0 to nonzero
as you move along it at the point (). The argument if there is more than one
such axis goes similarly.) So for our purposes of finding any such ball we can
safely remove all such situations from consideration .) Now note however that
det (i (i'j)) is a continuous function of the i'j. Therefore it is impossible for it
to be 0 in [x,()) and (8, y] but nonzero at 8. Therefore our assumption of the
existence of such a ball of nonzero det( i( i'j)) is justified, and our "stronger
cond ition" can be satisfied. In what follows we will always take our learning
set to be in the center of such a ball of nonzero det (i (i'j)), and all T>. will
keep us within that ball. (Note that we make no restrictions on the value of
q, however.) Therefore we will always be able to take N' ~ N .
Now examine the invariance of scaling and translation in the input space.
Rearranging (G.1), and assuming our learning set is in the center of a ball of
nonzero det(i (i'j)), we immediately get

(G .2) For a given basis set of functions <1>, if the LMMG ba sed on <1> obey s
(2.6) then there exist s a function K( a, b, n, i) such that n(ar + b) =
L:i=l K (a, b,n, i )i(T) for all r. a and b are assumed close enough to 1
and IT respecti vely to ensure that we are within t he ball mentioned in
the previous paragraph.

Note that although the transformation of the learning sets is restricted


to being within the ball, the restrictions on the i(T) given by (G.2) app ly
for allr.
K (a,b, n,i ) is a differentiable function of a and b everywhere that (G.2)
holds. For example, to see differentiability of K with respect to the compo-
nents of b, re-express (G.2) in an abbreviated format (for the particular case
where a is fixed and close to 1) as n(r + b) = Ki(b)i (T), where implied sum
notation is being used. Then
A Mathematical Theory of Generalization : Part I 189

for any vector r*. In particular, since the >i(T) are all linearly independent,
as before we can choose Nr*'s so that det( >i(r*)) 1- O. This means that there
exists a matrix Ai such that Ai>i(r*) = 51. Therefore

oK/(b) I~ __ o[Ai>n(r* + b)] 1__


obm b=O - obm b=O'

Since the >i(T) are all differentiable functions , the right-hand side of this
vector equation is well defined. This means that for any I, K(a,b,n,l) is a
different iable function of all the components of b, assuming a and bmeet the
conditions given in (G.2). A similar proof holds for showing that K(a, b, n, i)
is a different iable function of a .
This different iab ility of K is what allows us to delineat e the set of all
<I> whose associated LMMG obeys (2.6). To see th is first examine the case
where the input space is one-dimensional. >2(X + b) = K( 1,b,2,2)>2(x ) +
K (1, b,2,1 ). Now since the >i(X) are linearly independent, K (l,O, m, n ) =
5(m, n), the kronecker delta of m and n. This allows us to subtract t erms
like K (1, 0,2, 1) at will, so we can write

lim[ >2(x + b) - >2(X) ]


b--+O b
~ ( )dI{(1,b,2,2) 1 dI{(1,b,2,1),
'+'2 x db b=O + db b=O

The two derivatives with respect to bin this equation are arbitrary constants.
If the first of them = 0, we have the solution >2(X) = Ax + B (A and B
be ing arbitrary constants) . Otherwise >2(X) = Ae Bx + C (A, B, and C
being arbitrary constants). This second solution can be discarded, however,
since scaling invariance would require that for any real number a, >2( ax) =
Ae Bax + C be exp ressib le as a linear combination of Ae Bx + C and 1, which
is impossible. Therefore anyone-dimensional LMMG obeying (2.6) taking
a two-element learni ng set fits that learning set with a straight line. This
result is in accord wit h (2.7).
T hese kin ds of arg uments can be iterated to get all t he other >i(X) :
d>3 (X )
--a:;;- = a>3(x ) + bx + c.
a = 0 gives >3(X) = Ax 2 + Bx + C. a 1- 0 ha s t he solution >3 (X ) =
eax [Je-ax(bx + c)dx + D] . T his solution can not E <I> if D 1- 0, due t o th e
req uiremen t of input space scaling invariance. However , t he remaining t erms
in this solut ion are linearly dependent on >2 (x) an d >l (x), and therefore must
be discarded. Therefore >3 (X ) = Ax 2 + Bx + C. Similarly, via induction
>n(x ) = Ei=o ai xi, or, without loss of generality, >n(x) = z ". Note that if we
were not requiring scaling invariance we would also have as legal <I>'s those
whose individual member functions are polynomials + exponentials.
Now let us consider the case where the input space dimension m is > 1.
>1(T) still (can be taken to ) = 1. Now, however, we get p.d.e.'s rather t han
190 David H. Wolpert

o.d.e.'s for the other <pi(f'). For example, we get a~w') = A i<P2(f') + Bi, where
i ranges over the coordinates of the input space . If, for i = a, Ai = 0, then

(note there is no implied sum over the repeated a's in this equation) . If all
the Ai = 0, <P2(f') = Bi ri + C (implied sum over the repeated i) . Let Ai =I-
for i = a. Then

and as before we cannot meet scaling invariance. So all the Ai must = 0,


and <P2(f') = B r + C.
Next up we get

where this B is the same as the one in the solution for <P2( f'). Without loss
of generality we can set all E, to O. This is because if Eo< did not equal for
a particular a, then

<P3(f') = eIEQTQ] [H(rl,r2, . . . ,ro<_bro<+I, "')
J
+ e[-EQTQ](Fo<B . r + Go<)dro<]'
Scaling invariance forces H(rl' r2, . . . ,ro<_!> ro<+!, " ') to = 0, as usual, so we
would have

which is linearly dependent on <PI (f') and <P2(f'), contrary to the requirement
that ill be a basis.
If in addition to the E; all the F; = 0, then we get <P3(f') = Gr + H. Since
<P3 (f') is linearly independent of <P2 (f') and <PI (f'), Gand B cannot be parallel.
We can similarly set all but the last constant to 0 in the next m - 2 sets
of p.d .e.'s (for <P4(f') through <Pm+!(f') ), and then without loss of generality
choose B, G and all the other constants so that <P2 (f') = rl, <P3 (f') = r2, ... ,
all the way up to <Pm+!(f') = r m.
For the next set of p.d.e.'s (those concerning <Pm+2(f')) we cannot continue
this procedure of setting all but the last constant to 0, since <Pm+2( f') must
be linearly independent of <PI (f') through <Pm+! (f'). Therefore
A Mathematical Theory of Generalization: Part I 191

where for all i, III i- O. This results in 1>m+2 being a quadratic polynomial
of the components of r.
A priori, we are not justified in setting all those constants to 0 in the
p.d.e.'s for 1>3(r') through 1>m+I (r'). It is entirely possible, for example, that
1>3(r') could have one or more of its F, i- 0 (which would result in 1>3(r') being
a quadratic polynomial) and yet all but the last constant in the p.d.e .'s for
1>4(r') could be 0, so that 1>4 (r') would be one of the remaining linear functions.
In other words, the ordering of the elements of ~ is not necessarily according
to the power of the individual polynomials. It is only in the one-dimensional
case where the requirements of scaling and translation invariance force us to
order the basis set of polynomials according to the power of the polynomials.
Indeed, for the multidimensional case it is even possible that a given linear
function appears nowhere in ~, but can be constructed by forming a linear
combination of polynomials from ~ each of which has power> 1.
However, note that there is an important invariance from (2.6) which we
have not yet taken into account: invariance of the mapping under rotation of
the input space . Indeed, no basis set ~ for an input space having dimension
> 1 can obey this invariance. This is because as was shown above, if we
have a two-element learning set, in general it gets fitted with a linear com-
bination of 1>1(r') = 1 and 1>2(r') = jj . r, where jj is an arbitrary constant.
For any such fixed jj though, it is impossible to express the rotation of the
surface fitted to the two-element learning set with a new linear combination
of 1>1(r') = 1 and 1>2(r') = jj . r. Rotation invariance is violated by LMMGs.
However, note that, for example, if the 1>3(r') through 1>m+I (r') are indeed
the linear functions, then for learning sets of cardinality m + 1, we do get
rotation invariance, since such a learning set will be fit with a hyperplane
al + I:~~l airi, and the rotation of such a hyperplane is just another hyper-
plane, a; + I:~~l airi' We can therefore make a watered-down version of the
requirement of rotation invariance: if the ')'(i) are the successive values of N
such that the rotation of any linear combination of the first N elements of
~ is expressible as another linear combination of those first N elements of
~, then we require first that ')'(2) -')'(1) be as small as possible, then that
')'(3) - ')'(2) be as small as possible, and so on through all')'U) - ')'U - 1).
Trivially, ')'(1) = 1. Since 1>2(r') is necessarily linear, to meet this "modified
requirement of rotation invariance" for ')'(2) - ')'(1) we do indeed have to
have 1>3(r') through 1>m+I(r') be linear functions of the rio As above this then
forces 1>m+2( r') to be a quadratic, and so we have to have all the other linearly
independent quadratics (there are m(m - 1)/2 of them) coming next in ~.
Iterating this argument, it is clear that

(G.3) The only basis sets offunctions ~ whose associated LMMGs obey the
invariances of (2.6) (with the requirement of rotation invariance in the
input space modified as above) are the sets of functions
192 David H. Wolpert

where notationally (x, y, . . . , z) means all possible bases which span the
same space as t he functions x, y, . . . , z.

T his concludes t he pro of of the proposition . Again, note that if we do


not make any requirements conce rn ing rotation invariance at all, t hen all we
can say is t hat <I> must be a set of polynomials, arranged in any ord er, which
spans t he space. Similarly, if we do not require scaling invari ance in t he input
space, t he individu al elements of <I> ca n have exponent ials added to th em .

A p p endix H .
The npa generalizer
The npa generalizer begins its calculation of t he output corresponding to
any given question by computing t he input space vector ra from the ques-
tio n to the point in the learning set nearest the question. (In gener al, all
vect ors are here assumed to be rela t ive t o the question.) Next it forms an
m-dimensional open hypersphere cent ered on the question with radius 2lf o l.
Lab el the position vectors of the p oint s from the learning set which fall within
th e hyp ersphere as ri. Assume there are n of them, so 0 ~ i ~ n - 1, where
it is assumed that lril 2 lri-ll for all i. Label the associated outputs by m i '
Let f n be any point 21ral away from the question. Now define the value of
th e scalar field p(rj to be the median of the mj whose associated positions
obe y lfil ~ If I In other words, at any point f, p(rj is the average of the
smallest and the largest m i which lie inward of f . The output value t he
gener alizer guesses for th e question is given by [~~~ I p(rjdmr] / [~~~~o l dmr].
If one of the m i changes by an infinitesimal amount Sm, t hen its greatest
possible effect on the generalizer's guess would be if it's at ra, and is (say)
t he largest m i throughout th e whole hyp ersphere. In thi s case, p(rj every-
where increases by at mos t 8m / 2 (8m if t his point happens to be the only
one in t he hyp ersphere) , and therefore the generalizer's output increase by
8m /2. Con sequently, this generalizer is continuous in the mi . Furthermore,
alt hough p(rj is in general a discontinuous fun ction of f and th e ri, ~~~I dmr
is a cont inuous function of the ri, as is ~~~ol p(rjdmr, since any infinitesimal
variat ion in ri can make a noninfinitesimal change in p(rj only over an in-
finitesim ally thin (hyper)shell. As a result, the output of the gene ralizer is
a continuous fun ction of the ri. Similarly, an infinitesimal variation in the
question, being equ ivalent to an infinitesimal variation in all the ri, can only
result in an infinitesimal vari ation in the output . This verifies requirement
(2.1) of properness . Now note that if we translate all the mi up or down by
some constant k, p(rj is translated by the same k, and t herefore the ou t-
put as a whole is translated by k. Similarly, if we multiply all the mi by a
scaling factor z, t he output gets multiplied by z . Since the output is purely
a function of metric distances, which are tensor scalars, the generalizer is
aut omat ically invariant under rotation, parity and translation. Invarian ce
under nonzero scaling of all th e inputs and the question follows from how t he
A Mathematical Theory of Generalization: Part I 193

output's normalized. This completes the verification of requirement (2.6) .


Requirement (2.2) is immediate, since no a priori ordering is assumed of the
datum spaces. Requirement (2.4) is a restriction on the learning set, not
the generalizer. Given a learning set, as the question approaches any point
in the learning set lral and therefore 21ra l shrink to 0, by requirement (2.4)
and continuity all points within the hypersphere approach the value mo, p(r')
approaches mo everywhere within the hypersphere, and the output of the
generalizer becomes mo. Therefore reproduction of the learning set is as-
sured, and requirement (2.3) is met. As one of the T; becomes arbitrarily
close to T;-l> mi approaches mi-l> and p(r') gets arbitrarily close to what it
would be if the point at ii had never been an element of the learning set.
This verifies requirement (2.8), and therefore properness in general.

Appendix I.
Proof that all LMMGs are output linear
Assume we have one learning set, (}l> consisting of the n pairs

and another learning set (}2 consisting of the n pairs (Xl, YD, ... (x n , y~), where
the vectors X j might be multidimensional, in general. Assume further that
our LMMG fits (}1 with the surface 2:~1 O'.i>i(r') and fits (}2 with 2:;::1 O'.:>i(r') .
So 2:~1 O'.i>i(Xj) = Yj and 2:;::lO'.:>i(Xj) = yj for all n Xj. Without loss of
generality we can take N ~ N'. We can then extend the set of 0'.: to those i

between Nand N' (including N) by simply setting 0'.: = for N ~ i > N '.
So t he LMMG fits (}1 with 2:~1 O'.i>i(r') and fits (}2 with 2:~1 O'.:>i (r'). Now
it is obvious that
N N N
A L O'.i>i(r') + B L O'.;>i(r') = L(AO'.i + BO'.;)>i(r')
i =l i=l i=l

goes through the points {x i Ay j + By]} for all j ~ N and ~ 1. If we can


prove that the LMMG will actually pick these coefficients AO'.i + BO'.: to fit
the learning set ()' consisting of the points {x j, Ay j + Byj }, then we will have
proven that the LMMG is output linear. Since no particular attributes were
assumed of the LMMG, we will have proven in fact that all LMMGs are
output linear .
So first we must prove that it is not possible for the LMMG to fit the
new learning set without using the function >N(r'). (If the LMMG could fit
the learning set without using >N(r') it would do so, by definition of LMMGs
(see appendix E), and therefore would not generalize from ()' with the surface
2:~l(AO'.i + BO'.;)>i(r').) In other words, we need to prove that if (}1 needs
to be fit with >N(r') and (}2 needs to be fit with >N'<N(r'), then ()' = (}1 + (}2
needs to be fit with >N(r'). To prove this assertion it suffices to disprove its
negation. If ()' could be fit using only functions with ordinality < N, after
194 David II. Wolpert

rearranging terms we would see that th ere must exist a set of coefficients fJi '
1 :s: i < N, su ch that
N N-l
I>~i <Pi ( Xj ) = L 13i<Pi (Xj)
i= 1 i =1

for all Xj in 81 However if this were true, then we could have fit 81 with only
th e first N - 1 of the <Pi(rj. This is contrary to our initial assumption that
th e LMMG act ually fit 81 using the first N of the <Pi(rj , so it is true that the
LMMG cannot fit 81 without using <PN(rj.
It was shown in appendix E that the coefficient the LMMG fits to <PN (rj
is unique, so that coefficient must indeed equal Aow+BaN. So now we must
prove that the LMMG will also choose the values Aai<N + Ba ;<N (as t he
coefficient s of the basis functions <Pi<N(rj) to construct a surface which fits
81 To carry out such a proof we need to assume some means for t he LMM G
to choose between two (or more) possible sets of coefficients which both fit a
given learning set and which both have the same N. Here I will assume the
choosing scheme outlined in appendix E.
Examine the (N - l)th coefficient which t he LMMG uses to fit A8 +
B81 : 13N-l. (We alread y know that fJN = AaN + BaN. ) In genera l eit her
the vector {<PN-l (7"1), <PN-l (1"2) , . .. ,<PN-l(r:.n is linea rly independent ofthe
vectors {<Pi(i'i), <Pi (1"2) , ... , <Pi (r:.n,
1 :s: i < N - 1, or it is linearly depen dent
on them. If it is linearly dependent on them, then we can set fJN-l equa l
to o. In light of our choosing method, in point of fact we have no choice
but to choose the fJi such that fJN-l = 0 if this linear dependence holds .
Similarly, if we have this linear dependence then both aN- l and aN- 1 must
= o. Therefore, in t his sit uation, 13N- l does indeed = AaN-l + BaN_I .
Now examine the case where the vector

is linearly independent of all the vectors

Hypo thesize that the 13N-l which reprod uces the learni ng set 81 is not unique ,
and represent the alternative set of coefficients as 13:, 1 :s: i :s: N - 1.
Since by hypothesis of nonuniqueness fJN-l =f 131.-1 and yet z=f:;i1 fJi<Pi(fj ) =
z=f:;i 1 13:<Pi (fj) for all fj in the learning set, we can write

as a linear combination of the vectors

Such an equality, however , is in violation of our assumption of linear indepen-


den ce. Therefore our hypothesis of nonuniqueness must be wrong, and if we
do indeed have linear independence then we can conclude that 13N-l must be
A Mathematical Theory of Generalization: Part I 195

unique. Since we know that the coefficient AaN_l + Ba~ _l is a legal coeffi-
cient ofthe 4>N_lTterm, we can in fact conclude that f3N-l = AaN-l +Ba~_l
for this case of linear independence as well as for the previously explored case
of linear dependence.
These same arguments can be used for all the remaining coefficients the
LMMG fits to A8l + B8 2 The result is clearly that f3i = Aai + Ba; for all i.
This concludes the proof that LMMGs are output linear.

Ap pendix J.
Output linear H ERB IEs a nd n eural nets
To see that neural net neu rons satisfy the restrict ions on h for n = 2, output
linear, one-dimensional HERBIEs, note that for any output linear generaliz-
ers,

h( x + b, x' + b, x" + b, . . . , q + b) h (ax, ax ', axil, . . . , aq)


h( x , x', x", . . . q) ,

due to freedom in picking the y, y', . . . and invariance under scaling and trans-
lating the input space. Since the equality holds for arbitrary a and b, if the
x, x' , x", . .. are not all the same as one another then the value of h is the
same everywhere on the plane in (x, x', x", . . . q) space formed by allowing a
and b to take on all possible values. (For x = x' = ... = q, varying a and b
only traces out a line, not a plane.) For an n = 2, one-dimensional HERBIE,
restriction (2.5) in fact requires that x i: x', so the generalization is indeed
determine d by such iso-h planes. So given any triplet of values X, X' i: X ,
Q in (x, x', q) space, we have a unique iso-h plane. Now any such plane has to
intersect the line x = -1/2, x' = 1/2, as can be seen by solving for a and bin
the pair of equations aX + b = - 1/2, aX'+b = 1/2 (since X i: X ', we always
get such a solution for a and b). Since a and b are therefore uniquely specified
by X and X', Q = aQ +b, the q value of the inte rsection of the plane with the
line x = -1/2, x' = 1/2, is specified uniquely. Because any point on an iso-h
plane completely specifies that plane, we can therefore use (-1 /2,1 /2, Q) to
specify the plane instead of (X,X',Q) - the plane specified by (X,X',Q)
is t he same as the locus of points ( -a' / 2 + b', a' /2 + b', a' Q/2 + b') for all
a' and b'. As a result, t he iso-h pla nes form a one-parameter set, with Q
being t he paramet er. Now all points (x,x' ,q) on any given iso-h plane will
agree on t he valu e for w = 1/ 2 - (q - x )/ (x ' - x ), and t his function speci -
fies a un ique Q valu e for t he int ersecti on of t he plane containing t he points
with t he line x = -1 /2, x' = 1/2. Therefore w uniquely specifies the iso-
h plane. Under interchange of x and x', w becomes - w. Since from before
h( x ' , x, q) = 1 - h(x, x', q), for all x, x' i: x, and q, h( - w ) = 1 - h(w). In par-
tic ular since h( q, x' ,q) = 1, we can write h(w = 1/2) = 1, and h( -1/2) = o.
Also h(O ) = 1/2 . This of course is exactly the behavior exhibited by fermi
fun ct ions (for app ropriate rescaling of the w so that w = 1/2 becomes w = 00
and w = - 1/2 becomes w = - (0 ), step functions and most of the other
196 David H. Wolp ert

t hres holding functions used to model neur ons in neu ral nets. Note t hat since
there exist an infinite number of functions h( w) ob eyin g h( w ) = 1 - h( -w)
and h(1j 2) = 1, the rest rict ion of output linearity is not enoug h t o force
un ique generaliza t ion of HERB IEs. Also note th at most of t his analysis ap-
plies equally well t o the g{i } of HERBIEs seen as fun ctions only over t he
Xl, X2, .. . ,q.
To place the iso-h concept in a lar ger context, view t he direct pr od-
uct of the input components of the datum spaces for a given ord er g{i },
(x, x', x" , . . . , q), as defining a vector space V . T hen t he invariances of re-
striction (2.6), in addition to being viewed as referri ng to each component of
a vecto r in V undergoing coordinate t ransformations wit hin its own pri vate
R m space, ca n also be viewed as referring to coordinat e transformations in
the whole space V . The iso-h planes are planes of invari an ce in V. If we now
construct a space V' with element s (y, y', y", .. . ,guess) in exac t ly th e same
way we const ru ct ed V , then da t um space interchang e symmetr y is an invari-
ance of g{i } under certain simultaneous reflections (those which flip ar ound
t he axes ) in b oth V an d V'. Although it has not yet been don e (m ainly
becau se it 's hard to see what its physical significan ce would be) it might be
mat hemat ically interesting to explore the consequences of requiring that the
g{i} be invariant under mor e gener al sets of simultaneous reflections in V
and V' . For output lineari ty, wher e th e g{i} can be viewed as a sort of dual
vect or field, such an inva rianc e amount s to a symmetry under simultan eous
reflect ions of the underlying ma nifold across which the field is define d and
t he accompanying transformation of the ind ividual vector spaces living on
t ha t manifold.

Appendix K.
Proof that for n = 2, the solution to (2.12) is a straight line or an
exponential
Clearly both a linear f( x) and an exp onential one are solu tions to this new
restrictio n . To prove uniqueness of such these solutions, we want t o solve for
the set of all f (x ) which obey (2.12) and give an h obeying the coordinate
transformation invariances required by t he generalizer's being a HERBIE.
To do this , first exploit the symmet ries allowed all HERBIEs to transl ate
an d scale the input space and then translate the output space so that any
origina l triple of points (x,f(x)), (x',f(x ')), (q,J(q)) becomes (0,0), (a, b),
(c, d), resp ectively, where the original x, x', and q are assumed chosen such
that c> a > 0. Due to the fact that th e value of h is unchanged if its three
arguments are all translated by th e same amount, if q now refers to a point
a little to the right of c,
f(q) -f(q -(c -a)) = d -b = f(c) _l.
(K.1)
f(q-(c-a)) -f(q- c) b f(a)
We can ignore th e case wher e both d and b = 0, since by (2.12) this means
= 0. By cont inuity, f(x) is nonz ero everywhere
f( q) is t he straight line f(q)
A Mathematical Theory of Generalizatio n : Part I 197

in a noninfinitesimal neighborhood of whichever point a or c obeys f (x) -=I- O.


Without loss of gen erality, we will from now on as sume t hat both a and c
h ave b een chos en to be from this region of f (x ) -=I- O. Defining 8 to be q - c,
t he amount by whi ch the input space was t rans la ted in m aki ng (K.1), we can
solve (K. 1) for f (q), getting

f(q) = [J(8 + a) - f( 8)]J(c)/ f(a ) + f(8) .


Now change a -+ a + E, with E small enoug h so that f (a + E) still does not
= O. This results in
f(q) = [J(8 + a + E) - f(8) ]f(~(~ E) + f (8).
Equating the two formulas for f(q),

f(a )[f (8 + a + E) - f (8)] = f(a + E)[J(8 + a) - f(8) ]. (K.2)

This must be true for all 8 and (small enough ) E. Let 8 an d E b oth b e
small and expand both sides of (K.2) to second order in 8 and E. We recover
f(O) = 0 (which we knew already) and for t he 8E t erm get f (a)f"(a) =
f '(a)[f'(a) - 1'(0)]. This must be true for a whole range of a, so setting 1'(0)
equal to a con st ant , a, and let t ing a var y, we get a seco nd-order nonli near
differ ential equation :

f(x)j"( x) - (f'( x)? + af'(x) = O.


Now change variables to u(y) == y'( x)/y( x) . This t ransforms our differenti al
equ at ion to d"y = - ~ y
. This has t he solution u == i. y
= !!y + {3, {3 being an
arbitrary real constant . If (3 = 0, t he solu ti on for y( x) is a straight line. If
{3 do esn 't equal 0, the solution is ie{Jx - a/ {3, wh er e plugging in to equation
(K.2) yie lds i = a/ {3.

References
[1] D. Wolpert, "A benchmark for how well neural nets generalize," Biological
Cyb ernetics , 61 (1989) 303-315.

[2] D. Wolpert, "Using a mathematical theory of generalization to build a gen-


eralizer superior to NETtalk," Neural Networks, in press.

[3] D. Wolp ert, "Generalization, surface-fitting, and network st ructures," Ph.D.


thesis, University of California, 1989.

[4] D.E. Rumelhart, et al., Explorations in the Microstructure of Cogni tion,


Vol. I and II (MIT Press, Cambridge, MA , 1986) .

[5] D.E . Rumelhart , et al., "Learning repr esentations by back-propagating er-


rors," Nature , 323 (1986) 533-536.
198 David H. Wolpert

[6] T .J. Sejnowski and C.R. Rosenberg, NETtalk: A Parallel Network that
Learns to Read Aloud, Johns Hopkins University Electrical Engineering and
Computer Science Techni cal Report JHU / EECS-86/01, 1986.
[7] R. Linsker, "Self-organi zation in a perceptual network," Computer, 3 (1988)
105-117.

[8] T .J . Sejnowski, "Neural net models of visual processing," Short Course on


Computational Neuroscience (Society for Neuroscience, Department of Bio-
physics , Johns Hopkins University, 1987).

[9] D.O . Hebb , The Organization of Behavior (Wiley, New York, 1949).

[10] O.S. McCullough and WQ . Pitts, "A logical calculus immanent in nervous
activity," Bulletin of Mathematical Biophysics, 5 (1943) 115-133.

[11] Andrew Moore, Personal communication, 1988.

[12] B. McNaughton, et al., "Hippocampal synaptic enhancement and informa-


tion storage within a distributed memory system," Trends in Neurosciences,
10, 408-415 (Elsevier Publications, Cambridge, 1987).

[13] The Brain, A Scientifi c American Book (W.H . Freeman, San Francisco,
1979).

[14] P.L. Carlton, A Primer of Behavioral Pharm acology (W .H. Freeman, San
Francisco, 1983).

[15] J .W. Clark, "Statistical mechanics of neural networks," Physics Reports,


158 (1988) 91-157.

[16] D. Amit, et al., "Spin glass models of neural networks," Physical Review A,
32 (1985) 1007-1018.

[17] B. Derrida, et al. , "An exactly soluble asymmetric neural network model,"
Europhysics Letters, 4 (1987) 167-173.

[18] J.J . Hopfield, "Neural networks and physical systems with emergent col-
lective computational abilities," Proceedings of the National Academy of
Science, 79 (1982) 2554-2558.
[19] H. Gutfreund, "Neural networks with hierarchically correlated patterns,"
UCSB ITP preprint NSF-ITP-86-151 (1986).

[20] A. Dembo, et al., "General potential surfaces and neural networks ," Physical
Review A, 37 (1988) 2134-2143.

[21] J.J . Hopfield and D.W. Tank, "Neural computation of decisions in optimiza-
tion problems," Biological Cybernetics, 52 (1985) 141-152.

[22] G.V. Wilson and G.S. Pawley, "On the stability of the traveling salesman
problem algorithm of Hopfield and Tank," Biological Cybernetics, 58 (1988)
63-70.
A Mathematical Theory of Generalization: Part I 199

[23] 1. Parberry, et al ., Relating Boltzman n machines to convent ional m odels


of com p utation , Pennsylvania St at e University Department of Comput er
Science Technical Report CS-87-07, Ma rch , 1987.

[24] M. Minsky and S. P apert, Perceptrons (MIT Press , Cam bridge, MA , 1969).

[25] S. Kirkpatrick, et al ., "Optimization by simulat ed annealin g," Science, 229


(1983) 671-680.

[26] A. Bergmann, et al ., "Breeding intelligent automat a," P ro ceedings of th e


IEEE Conference on Neural Nets (San Diego, 1987).
[27] A. Lap edes and R. Farber, "Non -line ar signal processing using neural nets:
Prediction and system mod eling," Los Alamos preprint LA-UR-87 -2662
(1987) .

[28] T . Smith, Personal communication, 1987.

[29] R. Solomonoff, "A formal theory of inductive inference: I and II," Informa-
tion and Control, 7 (1964) 1, 224.

[30] S. Watanabe, "Inductive ambiguity and t he limit of artificial intelligence,"


Computational Intelligence, 3 (1987) 304-309.
[31] N. Littlestone, A. Bergmann , and P. Caiani e1lo, Per sonal communication,
1988.

[32] T . Poggio and staff, MIT AI Lab, "MIT progress in underst andings," to be
published in Proceedings of th e Image Underst anding Work shop , L. Bauman,
ed . (SAl Corporation, McLean , VA, 1988).

[33] T . Poggio, et al., "Compute r vision and regularization theory," N ature, 317
(1985) 314-319.

[34] Computational Intelligence, 3 (1987) special issue on machine learning.


[35] L.G . Valiant, "A theory of the learnable," Comm unications of the A CM , 27
(1984) 1134-1142.

[36] N. Littlestone, "Lear ning quickl y when irrelevant attributes abound: A new
linear-threshold algorithm," Ma chine Learning, 2 (1988) 285-318.

[37] R. Duda and P. Hart, Pattern classification and scene analysis (Wil ey, New
York , 1973) .

[38] V. Cappellini an d R. Marconi, eds., Advances in Image Processing and Pat -


tern Recognition: Proceedings of the International Conference P isa, Italy,
December, 1985 (Elsevier, 1986) .
[39] J.D . Farmer an d J.J . Sidorowich, "Exploiting chaos to pr edi ct the future an d
reduce noise ," Los Alamos preprint LA-UR-88-901 (1988).

[40] A. Lapedes and R. Farber, "How neural nets work, " Proceedings of th e 1987
IEEE Denver Conference on Neural Net works.
200 David H. Wolpert

[41] E.W. Packel, et al., "Informat ion-based complexit y," Nat ure, 328 (1987)
29-33 .

[42] P. Garcia, et al., "Local langu ages, t he successor method, and a step to-
war ds a general methodology for the inference of regular gramma rs," IEEE
Transactions on Pattern Analysis and Mac hine In telligence, pami-9 (1987)
841-845.

[43] K.S . Fu , et al., "Grammatical infe rence: Int roducti on and survey parts 1 and
2," IEEE Transactions on Systems , Man , and Cy berne tics, 5 (1975) 95-111 ,
409-423.

[44] C. Thompson, et al., "On limits t o perception and pat tern recogni tion ,"
Institute for theoretical physi cs at Santa Bar bara rep ort, UCSB IT P pr eprin t
NSF -ITP-87-127, 1987.

[45] T .M. Mitchell, "Generalization as sear ch," Artificial Intelligence, 18 (1982)


203-226.

[46] J.E. Hopc roft and J .D. Ullman, Introd uction t o A uto m ata Th eory, Lan-
g uages, and Com p utation (Addison-Wesley, 1979).

[47] Holland, J ohn , Adap tation in Nat ural and A rti ficial Sy stem s: A n Introdu c-
tory Analy sis with App lications t o Bi ology, Control, and A rtificial In telli-
gence (University of Michigan P ress, Ann Ar bor , 1975).

[48] E.T. J aynes, "On the rational e of maximum-entropy methods," Proceedings


of the IEEE, 70 (1982) 939-952.

[49] F . Reif, Fun dame ntals of Statistical Physics and Th erm al Physic s (McGraw-
Hill, 1965).

[50] G. Birkhoff an d S. Maclane , A S urvey of Modern A lgebra, 4t h ed. (Macmillan


P ubli shing Co., New York, 1977).

[51] R . Wald, General R elati vity , (Univer sity of Chicago P ress , Chicago, IL,
1984) chapter 2.

You might also like