A Mathematical Theory of Generalization: Part I: David H. Wolpert
A Mathematical Theory of Generalization: Part I: David H. Wolpert
A Mathematical Theory of Generalization: Part I: David H. Wolpert
David H. Wolpert
Th eoretical Division, Center for Nonlinear Studies,
Los Alamos National Lab ora tory, Los A lamos , NM, 87545, USA
detailed out line of the rest of these pap ers , presented in t erms of the pr eceding
discussion of generalization .
Here and t hroughou t these pap ers, gener alization is assumed to be t ak ing
pl ace wit hout any knowledg e of what the variables involved "r eally mean. "
An abs tract, model-independent formalism is the most rigorous way to deal
wit h this kind of gen eralizing.
Section 2 of this paper begins with a mathematically precise defin ition of
gen er alizers. It t hen goes on to exp lore some of the more basic properties
that can b e required of generalizers and elucidat es som e of the more straight-
for ward m athematical conse quence s of such requirements. Some of these
conse que nces (e.g., no lin ear mathematical model should be used to gen eralize
whe n t he underl ying syste m being mo delled is known to be geometric in
nature) are not intuitively obvious .
The par adigm here and throughout these papers is to make gen eral, broad
requ iremen ts for gen erali zing behavior , and then see what (if any) mathemat-
ical solut ions there are for such requirements . This contrasts to the usual
(extrem ely ad hoc) way of dealing with generalizers, which is to take a con -
crete generalizer and investigate its behavior for assorted tes t probl ems. In
t hese pap ers , the be havior defines t he architecture, not the ot her way aro und.
The other way of trying to build an eng ine which exh ibits "good" gene ral-
ization is, ultimately, dep end ent to a degree on sheer luck.
Sections 1 and 2 of the second paper present and explore some sets of
restri cti on s on generalization behavior which ar e more sophist ica ted than
those found in section 2 of the first pap er. The first of these new restric-
tions , exp lored at length in section 1 of t he second pap er , is th e restriction of
"self-guess ing" in its various formulations. Intuitively, self-guessing refers to
the requirement that if taught with a subset of the full t raining set , the gen -
eralizer should correctly guess t he rest of t he training set . One of the mo re
interesti ng results concern ing self-guessing is th at it is impossible to con-
struct generalization crit eri a which, along wit h self-guessing, specify unique
gen er alization of a learn ing set . (Any particular set of crite ria will always be
eithe r under-restrictive or over-restricti ve.)
Sectio n 2 of the second pap er then discusses the restriction of inform ati on
compactification , which can be viewed as a m athematically precise st at em ent
of Occam 's razor. Particul ar at t ention is drawn t o the fac t (and it s conse-
quen ces) that at present t here is no kno wn way of making an a priori most
reason abl e definition of information measure in a model-independent way.
Finally, section 3 of th e second paper is a partial investigation of t he
real-worl d efficacy of the tools elaborated in the first paper and in the first
two sections of the second paper. References [1- 3J consist of other such in-
vestigations and show that these techniques far outperform backpropagation
[4, 5J and in particular easily beat NETtaik [6J. T he t ests and investigations
pr esented in section 3 of this second paper are intended to be an extension
and ela bo ration of these t ests presented in [1-3 ].
A Math ematical Th eory of Generalization : Part I 153
"T his then is t he measure of a man - that from the parti cu-
lars he can discern the pattern in t he greater whole ." [U. Merre,
from Studies on the Na ture of ManJ
1. Background
Neural nets is a field that has recently started to at t ract int erest from man y
sep ar at e discipl ines, including phys ics (esp ecially condense d mat t er ph ysics) ,
m at hem atics, computer scien ce, neurobiology, and cognitive science . In ad-
dition to the intrinsic mathematical interest in investigating net works of non-
linear mappers , there are several other reason s why neural nets ar e pro ving
so popular . In this section som e of these reasons are discussed , with par tic-
ular attention paid to the reason of ability to generalize . The us efuln ess of
this ab ilit y is discussed, and then an outline of this series of paper s, which
constitute an investiga tion into the problem of generalization , is pr esent ed
from t he p erspective of the discussion of the impor tance of t he ab ility to
gen eralize.
principl es which must be discovered if neural nets are to make any significant
addit ion to the field of AI.
Cla rk's search for such principles, based on statistical mechanics, touches
on the t hird usual motivation for investigating neural nets, their similarity
to spin glasse s. This is the tradi t ion al entry point into the field for physicists
[16, 17]. Unfortunately, alt hough mu ch h as been learned of how to model
neural net s as spin glass es, no new principles conc erni ng intelligence have
emerge d from this work to dat e.
In addit ion to these three broad reasons for interest in neural nets, there
are several features of neural ne ts whi ch , although of more minor breadth ,
do constitu t e concrete results and not simply motivational generalities. The
first an d perhaps most famous of these grew out of the spin-glass approach
to the field . This is Hopfield 's [18] (and subsequent researchers ' [19, 20))
idea of using neural nets as a means of imp lementing associ at ive memories
in parallel. (See appendix A for a review of how Hopfield 's schem e works .)
Although Hop field's associative memory scheme do es indeed work, it sho uld
be noted that it does not provide any nove l insight s into how to construct
an artificial intelligence. What it does provide is a poss ible means of making
particular ly fas t associa tiv e memories. It constitutes a potential spee d-u p
of something we can already do (build associ ative memories ), but no t an
ent irely nove l skill .
Another contribut ion by Hopfield is t he realization that his associative
memory neu ral nets can be used as a means of finding approximate solutions
to optimization problems quadrati c in t heir arg uments [21]. Recently, the
reported efficacy of this techn ique has come un der at t ack [22]. However ,
even if the attacks are mistaken and the technique ends up proving usefu l, as
with neural net associative memories it will only constitute a potential speed-
up of some thing we can alr eady do (approximate solutions to optimization
pro blems), not an entirely novel skill.
There is one last application of neural nets which, in addition to be -
ing mor e than just a sp eeding up of something we can already im plement
through other means, is act ua lly exhibited by neural nets th at are up and
running right now . This application is the abi lity to gen eralize exhibited by
Boltzmann machines [4, 23], perceptrons [24], and in general all feedforward
neural nets made via backpropagation [4, 5], simu lated annealing [25], or
any other technique which does not try to restrict t he net 's output to the set
of a tt ractors typical of Hopfield-type neural nets (and, ind eed, typica l of all
asso ciative memories ). (See appendix B for a re view of Boltzmann machines
and t he tec hn ique of back-propagation.) T hat is, it is t he ability of such
neur al net s to be t au gh t wit h a learni ng set of input- outpu t pairs , be fed a
novel in pu t as a qu est ion , an d t hen to form a (hopefull y) reason abl e response
to that qu estion, a response whi ch might differ from all of the output s in the
learning set of input-out pu t pairs [26].1 It seems t hat in this ab ility to gen-
lThese nets have t he advantage over Hopfield-t ype associative memories in that th ey
can respond to a query with an output not contained in those which make up the learni ng
set . W it hout such an ability you are not fully generalizing . You are only classifying . In a
A Mat hematical T heory of Generalization: Part I 155
eralize ne ural nets can in fact do som ething t hat few other current systems
can do , or at least wh ich few other current systems can do as well wit h as
broad a range of applicability. Since it is the most widely studied neural net
generalize r, it is the techn iqu e of b ackpropagation that is taken to b e the
archetypical ne ural net generalizer for the purposes of this seri es of p ap er s.
It is the hope of the neural net community that this (or some other) neural
net scheme somehow picks out the important information from the learning
set whe n gene ralizing an d therefore gen eralizes well (see referen ce [4]).
1.2 Generalization
A gene ralizing syst em has very many potential uses, t he most obvious of
which lie in the field of statistics. In dee d, as h as also been po inted out by
Wolp ert [I , 2], Lap ed es an d Fa rber [27], and by Smi th [28], generalizing
neural nets can b e viewed as simply a novel kind of statistica l curve fitting .
Clearly, a solution to the problem of how "best" t o gen erali ze fro m a learning
set wh en no a priori assumptions are made about the typ e of fun cti on do-
in g the gene ralizing would have m any ramifi cati ons for the (n on p ara metric)
stat istics problem of ho w b est to generalize (sic) from ex perimental data to
t he best model to fit the data.
In ad dit ion t o t his po tential application t o statistics (and t he refore all
experime ntal science), a solution to the prob lem of how best to generalize
from a learning set might be he lpful at the t ask of image reconstruction (t he
"learning set" her e being incomplete inform ati on about the full im age) . It
might also be helpful for approximation theory an d t herefore for numerical
analysis. Indeed , at the risk of be ing overspeculative , a successful solution
to the problem of generalization cou ld even be he lpful in the broadest sci-
entific task of all, the task of inferring the "b est " theory to explain a set of
observa tions.?
In addit ion to these ap plications, the ability to gene ralize well also has
m any potential uses in the field of AI. To un der st an d one such use , first
note (as is pointed out in [1] and [2]) t hat in an abstract sense the goal of
AI research is t o create a system which meets certain sets of "human-like"
crite ria . In ge neral, t he broader t he criteria a syst em satisfies, t he better.
When t hese criteria are specified very prec isely - "given a, respond with
b" - then the system created to meet t hem con stitutes a database. As a
next st ep it is straightforward t o in cor porate into the system an algorithm
for m ak in g logical inferences from the provid ed crit eria , resulting in wh at is
essentia lly an expert system. In sofar as dat abases , b eing a set of responses
to queri es, ca n be viewed as in put-output fun ctions over an ap propria te
sp ace, such expert sy ste ms are j ust extensions of the domain of t he or iginal
certain sense , a Hopfield-type assoc iat ive memory is noth ing more t han a crude, par allel,
Bayesian classifier, since its "bas ins of att raction" are nothing mor e th an th e set of vectors
strongly correlat ed wit h one another .
2See reference [29] for an interest ing attempt at solving t his last task which might have
some app lications to the prob lem of generalization .
156 David H. Wolp ert
database. The logical inferen ce algorithm used has "broadene d" the (human-
like) criteri a defining the prop erties of the syst em .
Unfortunately, such logical inference algorithm s can not exte nd the do-
main of the system to include the whol e of the inp u t space. A purely deduc-
ti ve algorit hm can not solve the problem of inductive inferenc e [30]. This is
not a trivial shortcoming. Inso far as it is impossible t o explicitly list all of
th e beh avioral attributes of hu man int elligence, it seems that any ar t ificial
syst em that will be abl e to fully mimic human beh avior will have to extrapo-
late th at full behavior from a cor e su bse t of (ext ernally provid ed) behavioral
attributes. It would therefore seem that any AI syst em t hat can suc cessfu lly
pass the Turing t est will have to be able t o form an extension of an initial
set of behavior criteria . It will have to be able to generalize from the or iginal
criteria."
To understand th e second reason that generalization is cruci al to AI , first
define a system's gene ral inte lligence as a measure of how well it performs at
an arbitrary cognitive t ask for which it is provided limited inform at ion , wh ere
eithe r the task and/or the information is novel. The syst em is explicitly t old
b eforehand how its performan ce is being measured . In an academic set ting ,
the limited information might be a chapter in a mat h book, t he novel task
might be a problem set bas ed on that chap ter , and the p erformance measure
might be the percentage of problems from the problem set done correctly.
At a more primitive level, the limited information m ight be a novel visu al
image, t he t ask might be to pick out any cre at ures which might be pr ed ators
from t hat image, and the p erformance measure might be whether the system
gets eaten or not . Alternatively, the limited information mig ht contain a set
of images with all pr edators in them pointed out , as well as the novel visual
image. The more generally intelligent a sys t em, t he larger the numbe r of
(task, inform at ion ) pairs at which it pe rform s well and the better it performs
at them (see appendix C for a mor e mathem at ically pr ecise formu lation of
this definition of intelligence) .
Now define gen eralization as the ab ilit y of a syst em to t ake a learn ing
set of input-output vector pairs from a particular pair of input and output
spaces, and, based only that learning set , make a "good" guess as to the
outp ut vector corresponding to an input vector not contained in th e learni ng
set. The guessing is exp licitly ind ep endent of any information not cont ain ed
in the learn ing set , such as what the input and output spaces "really mean ."
Loose ly speaking , generalization is th e ability to perceive and extrapola te
pattern s in a learn ing set. It is clea r then that in addition to traits like
facility with logic and the ab ility t o reason by an alogy with other, successfully
completed tasks , the ability to generalize well from limited information (i.e. ,
3Note t hat as opposed to a conventional "inside-out " system in which the internal
structure is externally (and overtly) prov ided, such a successful mimic of hum an behavior
would be "outside-in" in th e sense of only having beh avioral crit eria extern ally prov ided,
wit h t he intern al structure determined implic itly as a means of meet ing those crit eria
(along with th eir extrapolation) .
A Mathematical The ory of Generalization: Part I 157
1. App roximation theory an d reg ular ization t heory [32, 33] can be viewed
as attempts to deal with the problem of generalization when one knows
what the inp ut and output spaces "really mean."
2. The field of machine learn ing [34- 36] consists to a large degree of var ious
schemes for how to gen eralize in certain strictly limi ted cont exts (e.g.,
having Boo lean valued variables) .
3. Bayesian classifica tion [37], and in general all of the work on class ifica-
ti on in image p rocessing [38], is similar to t he problem of gen eraliza tion,
4Note tha t the pro blem of "making a best guess for Imiss> given I p roy and J," discussed
in ap pe ndix C, is a generalization problem.
158 David H. Wolpert
the major distinction between the two being that in classification the
set of possible answ ers to questions is strictly controlled and finit e (in-
deed , oft en hav ing only two elements).
4. Time-series an alysis and prediction theory, especially as extended to
chaotic systems by Farmer [39] (and consequently, using backpropa-
gated neural nets, by Lapedes and Farber [27, 40]), can be viewed as
an ad 110C approach to a limited version of the full problem of general-
ization.
5. Information complexity theory [41J can be viewed as dealing with a
generalization problem modified by adding to the condition t hat they
have to reproduce the learning set some extra, rather stringent pro blem-
dependent conditions on t he allowable generalizations.
6. The work with local languages [42], with grammatical inference [43],
with mathematical criteria for associaters [44], and with finding the
min imal high-level language set of statements to reproduce a learning
set [45] can also all be viewed as dealing with problems related to that
of generalization, but not with the full problem itself.
Not dealing directly with the full problem of generalization, these ap-
proaches do not provide any set of tractable mathematical criteria for good
generalization . Accordingly, the criterion in common use for judging how well
an algorit hm gene ralizes is to simp ly test the performance of the algori t hm
on lot s of learn ing sets created with a "correct" generalization in mind. T he
accuracy of the algor ithm in guessing this "correct" generalization from t he
learni ng set is taken to be a measure of how well it generalizes. (Alt hough the
field of machine learning can make use of more ob jective criteria in measuring
generalization efficacy, since it is restricted to situations in which t he num-
ber of possible input-output mappings is finite, its applicability to real-world
problems is limit ed .) Clearly, this is an unsatisfactory state of affairs, having
littl e if any potential to meet the promise of a full theory of generaliza tion.
of those criteria. Some examp les of such crite ria are invari an ce under vari-
ous kinds of coordinate tran sformations of the input and/or output spaces,
having the gene r aliza tion of part of the learn ing set guess the rest of it (self-
gue ssing) , and information compactificat ion (i.e. , Occ am 's razor). All the
generalizers commonly used at pr esent can be categori zed in terms of which
of t hese and other similar crite ria they meet . Section 2 of this first paper and
sections 1 and 2 of the second paper in this series of papers describe gen er al-
ization theory in more detail, sp elling out the categories created by various
set s of criteria and investigating t he problem of finding a set of criteria which
specify a unique generalization of any learning set .
To qu antify the advantages of neural nets as gen eralizers , it is important
to develop a benchmark for measuring the efficacy of a sys tem's generalizing
from a learning set. Without use of such a benchmark , any claims for neural
nets having "emergent and adaptive intelligen ce" (beyond that which follows
simply from their ability to form associative memories) are vague at best, and
mis leading at worst. Such a benchmark should be very easy to construct,
flexib le, qui ck, and possess simple to an alyz e generalization behavior. (It
is important to realize that the unanalyzability of neural net s is not reason
to b elieve they generalize well, bu t rather is just an imp ed iment to seeing
pr ecisel y how they go abo ut trying to gen eralize.) The gen er alization t heory
crit eria discuss ed in the first pap er whi ch mo st naturally lead to such a ben ch-
m ark are those defining a HERBIE (HEuRisti c BInar y E ngine) . In essence,
a HERBIE ben chmark gener alizer is a simple surface-fitter which creates an
input-output surface which goes through all the point s of the learning set
and has the whol e of the inputs space as its domain. A pr ecise definition of
a certain kind of benchmark HERBIE (a hyperplanar HERBIE) , exploration
of its be havior, and use of it to benchmark the generalization of neural nets
makes up reference [1]. As is discussed in [1- 3J, the hyp erp lanar and other
kinds of HERBIEs have been found to be very helpfu l in ben chmarking gen-
eralization, and actually beat backpropagated neural nets rather decisively
in the t ests p erformed so far. "
Despite it s usefulness as a ben chmark for measuring gen eralization , how-
ever, there is no reason to believe this HERBIE and the crite ria definin g
it will be the best generalizer for an arbitrar y learning set. It is not t he
definitive answer t o the qu estion , "If I have a given learning set and sever al
different algorithms which can generalize from it , whi ch one do I choose?"
Some of the other generalization criteria which const itute a possible answer
to this question ar e those making up self-guessing and information compacti-
fication, which ar e presented in the second pap er. Discussion of some systems
approximately ob eying self-gu essing and information compactification, along
with examples of some t ests of su ch systems make up section 3 of the second
pap er. In addition to utilizing evolut ionary (as opposed to neuro-) biology,
such systems tend t o make use of parallel distributed network structures. In
S"Beats" here meaning what it does in the pr esent-day literature: HERBIE does better
at guessing the "correct" input-output mapping th e researcher had in mind when making
up the learning set.
160 David H. Wolpert
this sense, this series of papers comes full circle, concluding with an explo-
ration of what can be thought of as very sophisticated, feedback, variable
number of iteration, neural nets.
to guesses, and then to investigate some of the simp ler sets of generalization
criter ia . The rest of the current section constitutes such an investi gati on .
to the hal t ing internal state of the TM.6 The number of pos sible internal
st at es of a given TM sets i, and t he ru le table for the TM sets both t he
learni ng set an d how t he g{i } generalize from t he learning set.
In addit ion to this relation wit h TMs, any scheme for bui lding neural
net s from learning sets can be viewed as a generalizer, though for some such
schemes the req uirement of the cont inuity of the g{i} has to be relaxed . Since
any g{i } can be app roximated by an appropriate gen eralizer, it is also true
t hat any g{i} can be approximated by a neural net . Similar comments hold
for genetic algorit hms and classifier systems [47]. A final, and perhaps most
imp ortan t example of a generalizer is an y statist ical algo rithm for finding a
cur ve which has in some sense the least var iation from a set of data po ints .
In t his case, the learni ng set simply consis ts of the data points.
(2 .2) Every g{i } is invariant under permu t at ion of the datum spaces.
T his invari ance , being a simple rela beling, is on surer ground than m any
of those which will be descr ibed lat er. A second restrict ion is that the g{ i }
mu st form a dat abase:
(2. 3) If, for any g{i }, the value of the question is t he same as t he valu e of
an R m ent ry in one of the dat um spaces, t he output of t he function is
the corresponding R ent ry from that datum space.
For example, g{2 }(x,y,x',y' ,X) = y for all x ,x',y , and y'. One can
investi gat e generaliza tion schemes in which t he data in the learni ng set is
su sp ect and therefor e should not necessarily be followed , but they will not
be con sidered here. For single-valuedness, (2.3) necess itates t he following
re stricti on on t he domains of the 9 {i }:
(2.4) For any g{i} , if a ny two datum spaces have t he same value for their
R'" ent ries, t hen they must have t he same values for their R entries.
6From this pe rspec tive a TM is simp ly a funct ion taking a two-dimens ional inpu t vector
t o a two-dimensional outpu t vector wh ich is t hen fed bac k into the input. One component
of t he output det ermi nes whe n the process stops , and t he other determines t he out pu t
valu e of th e TM as a who le. Insofar as it is ju st such an iterated map, it is interesting
to note t hat a TM has the p otential to exhibit what is in some sens e t he richest possible
beh avior , na mely chao s. In the sa me spirit, it is interesting to note t hat the Halt ing
Pr obl em in comp uter science seems to parallel t he nonlinear science problem of finding
when a ch aot ic or bit alights wit hin a certain region of t he appropriate state space .
164 David II. Wolpert
For example, g{2}(x , y, x, z,w) where z does not equal y is outside the
dom ain of definit ion for t he generalizer . T his indu ces an odd st ructure on
t he domain of definition of any g{i }, turn ing it int o R i(m H)+m minus an
uncou nt ably infinit e numb er of holes . Wi th t he usual definit ion in R n of
an open se t be ing a union of op en hyperspheres , t he dom ain of definition is
part op en, part closed. For example , if one datum space of a one-dimensional
gene ra lizer has the values (Xl , yl), t hen another datum space has the following
domain of definition: {x E (- oo, x l), Y E (-oo, +oo)} U {x = xl,y = y/} U
{x E (Xl, + 00), y E (-00, + oo)}. No union of open circles can give this
st ructure around the line x = x' , In gen eral, unless otherwi se indicated, t he
domain s of defin it ion of the g{i} will always be con sidered to be the full
Euclidean space minus the holes caused by (2.4).
In add ition to t he restricti ons list ed above, it is usually helpful to impose
the following restrict ion on the domain of t he g{i}:
(2.5) Unless i , the order of t he learn ing set, excee ds m , the dimension of
the generalizer , g{i} is not defined . Even if i > m, g{i} is not defined
if t he values of the R m ent ries of t he datum spaces all lie on the sa me
(m - 1)-dimension al hyp erpl an e.
2. 3 HERBIEs
The first such additional restriction which comes to mind is t o assume in-
var iance of the g{i} under cer tain coordinate transformations. A number
of t he techniques us ed in classificat ion , im age processin g, and t he like ob ey
suc h an invari an ce, under certain condit ions. For example, for certain kinds
of const raints on t he allowed for m of a differential pro babili ty dist rib ut ion
A Math em ati cal Th eory of Generalization: Part I 165
for ar bit rary coordinate t ransformations T and T' . However it can be re-
quired (for example) that the g{i } obey a scale invariance. More pr ecisely, if,
for any 9 {i}, in each datum space t he R coordinate is sca led by a real, nonzero
factor k, the output R of that g{i} is scaled by k, Notationally, if g{i } takes
(R m x R x R'" x R .. .) -4 R, t hen it also t akes (R m x kRxRm xkR . .. ) ----> kR
in the sam e way. Intuitively, t his restricti on of sca le invariance says t hat gen-
eraliza tion should not be dep endent on the ou tp ut dimension 's unit s.
A simi lar restriction is to require that if for an y g{i }, for ever y R m en-
t ry (including the quest ion) t here is t he same scaling transformation of the
all the axes (by a nonzero real number), t hen again the outp ut valu e is
un chang ed. In the same spirit it can also be req uired t hat if every R '" en-
try undergoes t he sam e rot ation , parity, or t ransl ati on transformation, then
the output value is unchanged . If we m ake the sa me req uirement of the R
166 David H. Wolpert
(2 .6) Eve ry g{i} is invariant under any rot at ion , pari ty, translation , or
nonzero scaling tran sformation , applied simultaneously to the all the
R m , including the quest ion, or to all the R , including the output.
(2.7) For any i , if t he values of the R entry of every datum space of a g{i}
of a continuous HERBIE are equal, the output is thi s value, regar dless
of the qu estion.
Proof. Let g{i}(x , y, x', y, . . . , u) = v for an m -dimension al gene ra lizer (x, x'
and u are m -dimen sion al vect ors ). T he R valu e of ea ch datum space is y.
It is assume d that u do es not equal an y of the x , x', etc . If it did , then
we would have v = y immediat ely by (2.3) and would be don e. By (2.6),
g{i }(x , ay + b,x',ay + b, ... ,u) = av + b for all nonzer o real a and b. In
particular, th e equality ho lds for b = V. Therefore g{i }(x, y( a + 1), x', y( a +
1), . .. , u) = av + y. Now take the limi t of both sides as a goes to 0. Since
g{i } is con tinuous, we get g{i}(x ,y , x', V, . . . , u) = v.
There exist a number of other somewhat trivia l properties tha t follow from
a gen eralizer 's being a HERBIE. An example of such a proper ty applies to
a one-dimensional generalizer 's g{2} when the question is h alfway inb et ween
the two eleme nt s of th e learn ing set . Let the resulting guessed ou tpu t be z;
g{2}( x ,y,x' ,y' , (x + x') / 2) = z. Then , using the coordina te transformation
invari an ces we get in order :
a vector (a, 0, a, 0, . .. , a), then by trans lation invariance we kno w that that
expansion's value mus t be the same for all values a. Writ ing out the Taylor
series expansion for g{i}, t his means t hat L~l kiai = 0 for all a, whe re the
k, are linear combinations of derivatives of g{ i } wit h res pect to the varia bles
Xl, X2 , . . . all the way up to q. Taking derivatives with res pect to a of bo th
sides of t his equality an d t hen setting a equal t o 0, we get t hat all t he ki
must equal O. For example ,
k _ Og {2}(XI,Y I,X2'Y2,q) og{2}(-) og{2}(-' .) _ 00 I - 0
I - 0 Xl + 0 X2 + 0q - 0a a = O - .
(In terest ingly eno ugh, it turns out t h at we can gain no addit ional information
from the equations involving k's of order higher t han kl . T his is becau se
t he eq uation kn = 0 follows directly from allowing the expans ion point at
which the derivatives are evaluated to vary, taking various derivat ives of
t he equation kl = 0 (now, wit h the expansion po int varyi ng, viewed as an
equality of functions rather t han as an equality of rea l number s) , and t hen
forming linear combinations of these resultant equat ions .)
Vve can simila rly make use of the the req uire men t of sca ling invariance
along with analyticity of the g{i} to derive restrictions on the derivat ives of
the g{ i}. If we expand all t he {x;} along wit h q by a factor b == 1 + a, where
a =f - 1, then we can again collect powers of a and get, for examp le,
. Og {2}(XI' YI, X2, Y2 , q) og{2}(-") og{2}(- ' .) _ 0
Xl 0 Xl + X2 0 X2 +q 0q - .
(Again, we gain no information from t he equations involving hig her-order
derivatives since these equations follow directly from this equation involving
first order der ivatives. ) Combining this equation wit h t he one above for
translation invariance, we get three equation s of the nature
As it turns out , these and all the other ramifications of the requirements
of (2.6) can be derived without recourse to this derivative-based framework.
This is done by using the following geometric argument to enumerate all
possible HERBIEs.
First examine the case where there are two elements in the learning
set and the input space is one -dimensional. Assume YI and Y2, the out-
put components of the two datum spaces, are fixed . We want to see how
g{2}(Xl,Yl,X2,Y2,q) varies with the input components, XI,X2, and q. Our
restrictions on this varying are those of (2.6), namely that if any vector
(a, a, a) is added to the vector (Xl, X2, q), then the output of the generalizer,
g{2}, is unchanged, and similarly, if the vector (XI,X2,q) is multiplied by
any nonzero constant then again g{2} is unchanged. Whatever the g{2}
value at a point (Xl,X2,q), the g{2} value is the same at all points on the
plane containing the two lines formed by all translations and by all scalings
of this point (Xl, X2, q). (We are implicitly assuming that we do not have
Xl = X2 = q, which is the only case where these two plane-delineating lines
are identical and therefore do not specify a unique plane. This assumption
does not make us lose any generality in the argument, since we already know
what the output must be in the case Xl = X2 = q.) In other words, the func-
tion g{2} taking points in the R 3 space of the input components (Xl>X2,q)
to the R space of the generalizer's output is made up of invariant planes.
To calculate the equation of the plane of invariance containing any par-
ticular point (Xl>X2,q), we solve for the coefficients a, b, and c in the plane-
defining equation aXI + bX 2 + Q = C (Xl, X 2 , and Q being the three co-
ordinates constituting our R 3 space) . Since the plane has to be invariant
under identical scalings of all three of the coordinates, we know that c must
equal 0. Since the plane has to be invariant under identical translations of all
three of the coordinates, we know that a + b = - 1. Therefore, our plane is
uniquely defined by the single parameter a, which is set by the values Xl, X2,
and q. Any triplet Xl = X2 = q is contained in all of the planes, regardless
of the value of a, so every plane contains the line going through the origin
and the point (1,1,1 ). In addition, translation invariance means that every
plane contains lines parallel to the vector (1,1,1) but which lie off to the side
of the line going from the origin through (1,1,1). Therefore, looking down
the vector (1, 1, 1) toward the origin, we see all of our planes of invariance
edge-on, and anyone of these planes can be generated from any other one
simp ly by rotating it about this line going from the origin out to (1,1,1).
We could parameterize the planes by this angle of rotation instead of by the
value a if we wished.
Define G(a) to be the function mapping any value of a (i.e., any set
of values Xl> X2, and q) to the output of the function g{2}(XI,1,X2,0,q).
By translation and scaling invariance in the output components of the da -
tum spaces we know that g{2 }(XI,YI,X2,Y2,q) for arbitrary YI and Y2 =
g{2}(XI' 1, X2, 0, q)[YI - Y2]+Y2. In other words, now allowing all arguments of
g{2} to take on arbitrary values, g{2}( xl, YI, X2, Y2, q) = G( a )Yl+[l - G(a )]Y2,
where a is set by Xl, X2, and q, and G(a) is the dependence of g{2} on a
170 David H. Wolpert
HERBIEs (requirement (2.6)). This is the LMMG which works by fitting the
learning set with polynomials. No other basis set offunctions gives an LMMG
which obeys scaling and translation invariance in both input and output.
And even polynomial LMMGs, strictly speaking, do not meet input space
rotation invariance. (The proof of all of this is in appendix G.) What this
means is that whenever someone uses a non-polynomial LMMG to generalize,
whether they mean to or not they have an implicit preferred origin, preferred
scaling dimension, preferred orientation, or perhaps even a combination of
all three. Conversely, if you think the system you are modeling with your
generalizer does not have such preferred spatial characteristics, then under
no circumstances should you use an LMMG as the generalizer.
There are other restrictions besides the various coordinate transformation
invariances of (2.6) and besides the restrictions implicit in the definition
of LMMGs which can be added to the core definition of restrictions (2.1)
through (2.5). For example, along the same lines as requirement (2.3), we
can require upward compatibility of the g{i}:
(2.8) For all g{i},i > m + 1, if the values of the entries of 2 datum spaces
are identical, then the output is equal to the output of g{ i-I} working
on the rest of the learning set and on one of the two identical datum
spaces.
(2.9) A generalizer is said to be output linear iff for all of the g{i}, if
g{i}(x, y, x', y', u) = v and g{i}(x, z, x', z', ...u) = w, then g{i}(x, ay+
bz,x', ay' + bz', u) = av + bw, for all a, b,x,y,z,x' ,y',z', . . . u.
Examples:
n Yi n 1
g{n}(XI,Y1, .. . ,q) ={ L ( -.) } / {L -(; r -
i=l d q, z,
.)},
i=l q, x,
(where d(.,.) is a metric such that for any real constant k, d(kx, ky)
is proportional to d(x,y)), all of which are output linear, continuous
throughout the domain delineated in (2.4), and intimately related to
radial bas is functions and kernel-density estimators, will generically
be referred to as "metric-based HERBIEs." Note that metric-based
HERBIEs, like npa generalizers, can never guess an answer which either
exceeds the maximum of the learning set or is less than the min imum
of that learning set . This shortcoming is the primary reason for caution
in making use of such generalizers. (To see why this shortcoming holds,
simply rewrite the output of the generalizer as {L:i=l aiYi }/ {L:i=l ai},
where all the a, are positive. It is clear that if you take any but the
A Ma thematical Theory of Generalization: Part I 173
largest of the Yi and increase its value, the value of the output must
increase, going to the value of that maximal Yi when all the other Yi
have been increased to the value of that ma ximal Yi. In other words, the
output is bounded above by max(Yi)' For similar reasons, the output
is also bounded below by min (Yi), which concludes the proof.)
3. All HERBIEs are close to being output linear, in that if
guess = ho(x , x' , x", . . . , q)y + h1 (x, x', x", . . . , q)y' + ...
To prove (2.10) it suffices to note that by (2.9 ) any output linear gener -
alizer can be writ ten as
g { z' } ( X,Y ,X,Y,x,y,
I I II II
.. . ,q ) = g{i }( x, 1, x' , 0, x" , 0, , q)y
+g{i }(x , O, x' , I , x" , O, ,q )y' + ...
7Rjgorously speaking, tho ugh, g{i} is not really a ten sor field unless mxi , the dimension
of th e manifold, equals 1, the dimension of the spaces whose Cartesian product make s up
th e space on which t he te nsors work. There is also the prob lem, of course, that the ma pp ing
is only linea r from (R x R x R x ...) to R, and not multilinear (see refer ence [51]).
174 David H. Wolpert
assuming the individual g{i} terms on the right hand side exist. Therefore
ho(x, x' , x", . . . ,q) = g{i}(x, 1, x', 0, x", 0, . . . ,q), for exa mple. Unless explic-
it ly noted otherwise, it will always be assumed that t he cardinality of the
learning set exceed s the dimension of the input space and that all t he inp ut
space components of the datum space s, x , x', x", etc., are different from one
another. These two conditions ensur e that those g{i} te rms all exist, so that
we can in deed set ho(x,x', x", .. . ,q) = g{i}(x,l ,x',O,x",O, . .. ,q) ,etc . (see
rest riction 2.5).
If we assume, as we often do, t hat the g{i} are everywhere differentiable,
then t here is a weaker condit ion than (2.9) which results in (2.10) and there-
fore in (2.9). Namely, if all of t he g{i} of a generalizer are differentiable
and obey the prop erty that g{i}(x, y, x' , y', .. . u) + g{i}(x, z, x' , z', . .. u) =
g{i}(x, y + z, x', y' + z' , .. . u ) for all x , y, z, x', y', z', . . . u, then in fact the gen-
eralizer is output linear. To see this simply note that th is property imp lies
that
lim g{i}(x,y + z ,x', O, x" , O, .. . u) - g{i}(x,y,x',O ,x",O, .. . u)
z -+ o Z
z -+ o Z
"21 g{'}(
Z x ,l,x,l ,x",O,x
I I I , O, ... ,q.
)
(The fact that f(x) must be a straight line shouldn't be too surprising, since
we know that in addition to being expressible as it is in (2.12), as is discussed
in app endix J h must be expressible as a function of (q - x')j(x - x'), and
it is hard to think of any way of doing this without having f be linear in
its argument.) Therefore, for any n = 2 learning set, there exists a unique
function f(x) which reproduces the learning set and which obeys (2.12), and
therefore this restriction sets unique generalization of any n = 2 learning
set. Note that this is all markedly similar to the situation encountered when
enumerating all HERBIE LMMGs (see appendix G). In solving for all such
LMMGs we are allowed both exponentials and polynomials as members of
our basis set of functions, unless we require input-space scaling invariance,
in which case the exponential solution must be discarded.
It is interesting to note that away from these points where datum spaces
have identical input values, fitting a parabola to three points constitutes t he
g{3} of an output linear continuous HERBIE obeying (the n = 3 extension
of) equation (2.12). Indeed, fitting with a polynomial constitutes an output
linear, upwardly compatible, (2.12) obeying continuous HERBIE of arbitrary
dimension and order, so long as one stays away from regions where datum
spaces have identical input values . (An interesting and as yet unsolved prob-
lem is to determine all solutions to this partial set of restrictions.) The
Achilles' heel of polynomial fitting, of course, just all other LMMGs, is its
inability to satisfy continuity in the situation where upward compatibility
applies (i.e., when elements of the learning set approach one another). Note,
however, that if we are willing to relax the requirement that generalizers be
continuous, we can easily have fully proper output linear generalizers. A
trivial example of such a generalizer is a HERBIE that guesses as the output
to any given question the output component of the datum vector nearest to
that question.
A Mathematical Theory of Generalization: Pest I 179
This ends the first paper of the series. The next paper cont inues t he
discussion of generalization, using gener alizatio n criteria of a qua litatively
different nature than thos e present ed here.
App endix A.
Hopfield's scheme for using neural nets as associative memories
In Hopfield's schem e each neu ron, hav ing a value of + 1 or -1 , is connected
to every other neuron. Evolution of th e net pr oceeds according to the ru le
O"i(t ) = g[L:j JijO"j(t - 1)], O"i(t) being t he state of t he ith neuron at cycle
number t , and Jij constituting t he connect ions between t he neurons and
usually taken to be symmetric. The fun ction 9 is usually taken to be the
sgn function. In analogy to simple statistical mechanics we can now define a
Hamiltonian H as - L:ij JijO"i(t -l)O"j(t -1) . Let 6.O"i represent th e change
in a, over one cycle. If it is nonzero, then by the definit ion of the evolution
operator 9 the sign of 6.O"i is the same as th e sign of L:j JijO"j evaluated at th e
previous cycle. Therefore 6.O"i L: j JijO"j is positive definit e, 6.H is negati ve
definite under evolution of the system, and H descends monotonically to a
local minimum. To see how this can serve as an associative memory, view the
state of all the neurons in t he net as a (po tenti al) mem ory. We then want to
pick the Jij in such a way t hat if we start with t he syst em in a state close to a
memory we want st ored in the system, then the evolution takes the neur ons
to that stored memory. This behavior can be achieved by using the Hebbian
rul e to set th e Jij as L:m xr Xjm, where xr is t he ith compo nent of t he mth
memory to be "stored." With this Jij> H = - L:m(xm (f) 2 , (f being th e state
of all the neurons at a given time. For rather reason abl e assumptions on t he
distribution of the xm H will be minimized for (f parallel to X'" for some n.
In other words, the if'" serve as stable points of t he evolution of th e syst em,
as desired.
Appendix B.
Boltzmann machines and backpropagation
To understand how neural nets generalizers behave, it is helpful to examine
th e details of their operation . As a first example, consider Boltzmann ma-
chines, which constitute a sort of stochast ic generalization of Hopfield nets .
Instead of viewing the state of the whole net as being the input and output ,
in Boltzmann machines certain neurons are declared to be input and certain
other neurons are declared to be output . Input neurons h ave th eir st ate set
by the outside world exclusively and ou tput neuro ns have their state fed to
the outside world exclusi vely. Therefore connect ivit ies are , in a cert ain sense,
explicitly nons ymmetric. Take the possible states of any given neuron to be
{O, I}. Evolution of t he net pro ceeds according to t he probabilistic rule that
O"i(t) = 1 with probability j(6.Ei) , j bein g the fermi function and 6.Ei, a
function solely of t he other spins in the net , being t he difference between
180 David H. Wolpert
the energy the net would have at cycle t - 1 if neuron i were fixed as a 0
and the energy it would have if neuron i were fixed as a 1. (The Hopfield
net 's evolution is akin to the temperature = 0 case.) This, of course, is just
conventional statistical mechanics. The probability of a particular state iJ
for the whole net is [fii30';=1 P(O"i(t))][fij30'j=o(1 - P(O"j(t)))J, P(O"i(t)) being
the probability of neuron i of the state iJ being a 1 (i.e. !(tlEi ) ) . Taking a
a
particular neuron k which is off in state and turning it on to make state
if, we see that Pc;/PiJ = e-(Ea-EfJ)/kT, just as in Boltzmann statistical me-
chanics (hence the name "Boltzmann machine"). Therefore, as in a Hopfield
net, letting a Boltzmann machine evolve will (likely) cause it to end up in
a local minimum of its energy function. Simulated annealing [25J is often
used in practice to try to "freeze in" that minimum by gradually lowering
the temperature in the fermi function.
Again, just as in the Hopfield net, "teaching" the net in a Boltzmann
machine consists of choosing the energy function to assure certain dynamic
behavior in certain situations. Unlike the Hopfield net, however, here the
behavior to be taught only entails fixing the values of a subset of the neurons,
those which serve as the inputs and the outputs. Along with the fact that
the input neurons do not evolve but remain constant throughout the running
of the machine, this is what allows the Boltzmann machine to be viewed as
a generalizer and respond with outputs completely different from any with
which it was taught. As with Hopfield nets, a magnetic energy function is
usually chosen: Ec; = - Li<j JijO"fO"y + Li OiO"1- Jij is usually taken to be
symmetric. For the Boltzmann machine the determination of the Jij from the
learning set is more complicated than in the Hopfield case. Define a cross-
entropy error function G = Lc; Pt In [Pj) Pi], where a runs over the set of
input and output states with a given input vector fixed on the input units,
pt is the desired probability of state a, and Pi is the actual (equilibrium)
probability of state a when the machine is running. By inspection, when G
is minimized (to 0), pt = Pi and the system will definitely evolve to a state
which has the desired output units when the input units are held fixed at
the corresponding values. A number of techniques can be used to minimize
G, gradient descent being perhaps the most common. When it is desired to
have the learning set taught to the net consist of more than just one input
vector/output vector pair, the G to be minimized is modified by summing
over all elements of the learning set.
Another neural net generalizer which is perhaps the simplest (and cer-
tainly the most widely researched) is the technique of backpropagation. Like
the Boltzmann machine, this net has a subset of the neurons designated in-
put neurons and another subset designated output neurons. Also like the
Boltzmann machine, a fermi function (or, in general, any sigmoidal function)
is used to do the evolving. However, here the neurons do not only take on
discrete values and the evolving is not stochastic: O"i(t) = !(Lj JijO"j(t - 1))
is the updating rule. Usually the Jij are chosen so that the connectivity is
that of layers of neurons feeding forward into one another, starting from the
input layer and proceeding all the way to the output layer. As in the previ -
A Mathematical Theory of Generalization: Part I 181
ously described neural nets, output is when the system reaches a steady st ate,
which in t his case means after n cycles, n being the number of neuron layers.
"Backpropagation" refers to the teaching technique of this scheme. The Jij
are found by minimizing an energy (i.e., error) function via gradient descent .
The energy funct ion is usually chosen to be the sum of the squares of the
differences between desired and actual output states for a given input st at e.
Any Ji j which constitute a zero of the energy function will give the desired
out put for any input chosen from the set of input /output pairs making up
the provided learning set . Since the net is feedforward, it is a simple matter
to wri te down the function relating the Ji j and the ou tput of the net, given
any input . T herefore, although such a function is not directly invertible, it is
t rivially easy to ru n an iterat ive procedure to approximate the Jij which give
zeros of the energy function by analytically calculating the functional form
of the gradient of t he function energy(Ji j ) and then using a steepest descent
pr oced ure . Again as wit h the Boltzmann machines, since a backpropagated
net can resp ond in a novel way to a novel input, it constitutes a generalizer.
A p p endix C.
A ppendix D .
Discussion of inner products among ge n e r a liz er s
Some interesting mathematics result from trying to define a vector space in
which generalizers live and then defining a mapping meeting all of the criteria
usually required of an inner product and which takes pairs of vectors from
that vector space to elements of an appropriate field (see reference [50]). One
way to go about this is to have the infinite set of numbers making up t he
"pseudo" inner product serve as the inner product in the vector space of
generalizers. To do this, define the field F to consist of all countably infinite
sets of real numbers: if a E F , then a = {aI, a2, . . .} where all a; E R. For a
and b both E F, we say that a = b iff ai = b, for all i. Similarly, we say that
a > or < b iff a, > or < b, for all i. (Note that with these definitions, it is
no longer true that either a > b, a = b, or a < b for all a and b - the field is
only partially, not totally, ordered.) Add ition in F in performed component
by component: a + b = {al + b1, a2 + ~, .. .}. The add it ive ident ity is simp ly
{O, 0, . . .}. F differs from the standard L 2 Hilbert space in that, be ing a
field, multiplication of elements in F is define d. As wit h addition , elements
in F mu ltiply by component: a X b = {a1b1,a2 b2 " .. }. The mult iplicati ve
identity therefore is {I, 1, . .. }. (Note that if F were to be viewed as an
infinite vector space over R, this definition of multi plicat ion would not b e a
proper tensor operation, since after appropriate rotations a X b could always
be made to be either parallel to a or parallel t o b.) Closure, the existe nce
of multiplicative inverses (for elements of F with no components equal to
0), commutativity, and the distributive law all follow, confirming that F is
indeed a field. Int uit ively, F is simp ly an infinite set of R fields all acting in
parallel- there is no interaction between components of an element of F .
Now that we have an appropriate field, we can define the vector space
in which generalizers live. Simply put, generalizers live in the same space
as the elements of F, except that their components are functions from R"
to R, rather than just elements of R. This space is a vector space over F ,
just as the set of all well-behaved (i.e., D') functions from R" to R is a
vector space over R. a E F multiplied by the generalizer {g{1} ,g{2 }, . . .} =
{alg{l} , a2g{2}, .. .}. Addition of generalizers goes exactly as addition of
element s of F. All the usual restrictions on legal vector spaces follow from
these definitions (see reference [50] again) .
Using Dirac notation, we can now define the inner product between two
generalizers IG > and IH > to be the elemen t of F {(g{l }lh{l} ), (g{2}lh{2}),
. . .}. Except for t he fact that the inner product as defined th is way does not
live in R, it obeys all the usual restrictions on inner products. As a result ,
bo th the t riangle inequalit y and Schwar tz 's inequality hold. An or t hogonal
basis for the sp ace of generalizers consist s of th e generalizer s {B ib 0, 0, . . .}
for all i, {O, B i2 , 0, ... } for all i, etc., where "0" denotes the function , and
B i j denotes a complete orthonormal set of basi s functions for the mappings
from Rj(m+l) +m to R. The elements of this bas is will be generically denot ed
as IBi j . If the i are discrete, we can expand any generalizer IH > as
A Mathematical Theory of Generalization: Part I 183
IH > = Eij IBij > (B ij IH) . A similar expression holds for cont inuous i.
Although easy to work with, it should be noted that these IB ij > are not
normalized. Nor can they be normalized simply by multiplying them by
the appropriate constant, since the multiplicative identity is {I , 1, . . .}, and
therefore the 0 components in the IB ij > make their multiplicative inverses
undefined. To construct an orthonormal basis, we would have to take the
(nonorthogonal) basis {B il, B i2, . . .} and apply Graham-Schmidt.
As opposed to this formalism using an infinite number of components, it
is possible to view generalizers as a vector space over the reals rather than
over F. IG > + IH > still = {g{l} + h{I},g{2} + h{2} , .. .}, but now
alG > = {ag{I},ag{2 }, .. .}. The difficulty with this new formalism is in
trying to define an inner product mapping two generalizers to an element of
the reals in such a way that the magnitude of a generalizer IG > is finite even
if all of its component functions g{i } have nonze ro magnitude. One way to
do this is to define
(GIH ) = lim E?-l (g(i )lh(i )) .
n-+oo n
The usual req uired properties of inner products follow, assuming that the
limit defining t he inner product exist s for the generalizers in question. Un-
fortunately, t he mag nit ude of the inner product of anyone of the basis vec-
tor s IB ij > with another generalizer is O. We can no longer write IH > =
E ij IB ij > (B ij IH). Graham-Schmidt (for example run over the set of ele-
ments from {B il, B i2, .. .} which have well-defined inner products with one
another) has to be used just to form a basis allowing us to write this kind of
expansion, rega rd less of any questions of overall normalization.
App endix E.
Basic p roperties of LMMGs
In this appendix a rigoro us definition of LMMGs is given and a property of
them which is used in appendices G and F is proven.
Make the rigorous definition that an LMMG is a generalizer which works
by taking an orde red basis set of infinitely differentiable nonzero functions,
<PI (T), <P2 (T), . .. E q>, and for the given learning set B forms the linear com-
bination of bas is functions, Ef: l ai<Pi(T), aN =f 0, which has as sma ll N as
po ssible and yet still goes t hrough every element of the B. T he guessed output
of th e generalizer for a question q is given by Ef:l ai<Pi (if). Any generalizer
like "fit the learning set with a power series" or "fit the learni ng set wit h a
sum of Bessel functions" or t he like is an LMMG .
First not e that aN mus t be unique. To see t his, first note that if Ef: l
n,
ai<Pi(T) reproduces a learn ing set of pairs {( i'j, Yj t hen the matrix equation
<p}ai = Yj must hold, where implied sum notation is being used and where
t he matrix <p} is defined to have as elements t he values <Pi (i'j). Now if the
set of {a d which repro duce the learni ng set is not unique, we have that
<p} (ai - a D = 0 where at least one of the ai differs from the corresponding ai
184 David H. Wolpert
Therefore, indexed by i, the vectors rP) are linearly dependent. This means
that t here exist s some value I such that rPj can be written as (3irP) where
(3, = O. Therefore we can write
(E.!) For all N and {ad where aN =1= 0 there exists an N-point learn ing set
such that the LMMG chooses those {ad t o fit the learning set.
Proof. Since the rPi(rj are linearly independent, for any N of the rPi (rj there
does not exist a nonzero N -component vector such that if
for all vectors [rPl (rj, rP2 (rj, ... , rP N(rj ] ~.e., for all rj. In other words, the
vect ors [rPl(rj , rP2(rj, ... , rPN (rj ] span R . Therefore, there exist N vectors
[rPl (rj , rP2(rj , .. . , rPN(rj ] which are linearly independent, i.e. for any N there
exists a set of N input values {fi ll:::; j :::; N} such that the N vectors
are lin early independent. Now for any N and any associated set {ad t he
N -element learning set
N N
{ (rl' L airPi (rl )), . . . , (rN, L ai rPi(fN))
i =l i=l
is such that it lies on th e surface I:~1 airP(rj. If aN =1= 0, then it is poss ible
that th e LMMG will fit the learning set using those {ad and therefore th is
surface I:~1 airP(rj . To ensure th at this surface is the actual one which t he
LMMG will use to fit the learning set, it suffices to note that since all t he
rows of our matrix rP) are linearly independent, there doe s not exist any
set of values {an different from t he set {ai} such that the lear ning set can
be reproduced by the surface I:~1 a;rPi(rj. This comp letes the proof of the
lemma.
A Mathematical Theory of Generalization: Part I 185
Appendix F.
Proof that no LMMG is continuous
See appendix E for a rigorous definition of LMMGs and a proof of the fact
that for any N and any set {ad where aN =f 0, we can always build an
N-element learning set () such that the LMMG fits it with a surface of the
form I:f:l aiJi(i) , where the Ji(i) are taken from the basis set of functions
of the LMMG. Build such a learning set , and indicate its elements by (rj, Yj),
where j ranges from 1 to N . Indicate the surface the LMMG fits to () by
f(i)
To proceed in the proof, it is first necessary to establish the following
lemma:
combination of the first N - 1 r/>i(fj ). Since we know that the LMMG can
fit the modified learning set with (at most) N of the first r/> i(fj ), this linear
dependence tells us that the LMMG will in fact fit the modified learning set
with only m :::; (N - 1) of the first r/>i( fj ), when point b has been moved on
top of p oint a.
So we know that there must be a point during this sliding at which the
LMMG suddenly stops fitting the learn ing set with the first N of t he r/>i (fj )
and instead fits it with the first N - 1 of them. However, all of t he r/>;('rj are
linearly independent of one another. This means that any linear combination
of t he first N - 1 of the r/>i (fj) must differ from all linear combinations of t he
first N of the r/>i (fj) by a nonzero amount for some value of f . This means
th at the LMMG's g{N} mus t be a discontinuous function of its arg ume nt s
when t he arguments are at this point of transition from N function fitting
to N - 1 function fitting . This concludes the proof of t he prop osition.
Appendix G .
Proof that only polynomial LMMGs obey (2.6)
See appendix E for a precise definition of LMMGs. To begin our proof that
only polynomial LMMGs obey (2.6), it is easy to show that output translation
invar iance forces r/>1 (f') to = 1. Let D indicate t he support of r/>1 (f'). By
definition of LMMGs, D is nonempty. Now consider any point fE D . Let
our learning set be {f, r/>1 (f') =f O} . Output-translating our learning set by .A
we get the new learning set {f, r/>1 (f') + .A }. By output-translation invariance,
g{l}(f,r/>l(f'),q) = g{l}(f,r/>l (f') + .A ,q) - .A for all if. However, since g{l} is
part of an LMMG, g{l}( f ,r/>l (f'),q) = r/>1 (q). Moreover, out pu t space scalin g
invariance requires that
In other words, r/>1 (q) is a nonzero constant, k, independent of if, and without
loss of generality we can set k = 1.
Now let T>. indicate anyone of the (parameterized) transformations of
requirement (2.6). For example, we indicate the operation of out put space
tr anslation invariance by the (nonlinear) operator T>. where T>. [f( f') ] = .A +
f (f') . Then it follows that
(G.1) For a given bas is set of functions r/>i(f') E qi, if the LMMG ba sed on qi
obeys invariance under T>. then for all sets of real numb ers {ad an d all
N (where aN =f 0) there exists a set {aD and an N' (where a N' =f 0)
such that T>. [L:~l air/>i (f')] = L:f::l a:r/>i (f').
A Mathematical Theory of Generalization: Part I 187
To see (G. l) first recall from appendix E that since the i(r') are lin-
ea rly ind ependent of one another, for all Nand {ad there exists an N-point
learning set such that the LMMG chooses those {a;} to fit the learning set .
Next note that by the hypot hesis of meeting (2.6), th e curve T>. [L:~l aii(r')]
rep roduces the transformation of this learni ng set and also makes t he t rans-
formed quest ion-guess mapping . It is unique in doing this, an d therefore
invariance of the LMMG under T>. means that it must be the same as t he
curve generat ed by ru nning the LMMG off of the transformed learning set.
This LMMG-generat ed curve can always be written as L:;::1 a:i(r'), however,
which proves the supposition.
Note that 1(r') = 1 is comp letely in accord with (G .l) . Also note that if
a learning set {rj,Yj} is fit by an LMMG with the curve L~l aii(r'), then
the learning set {rj,>'Yj} will be fit with L:~l>.aii(r') = >'L~laii(r') .
Therefore LMMGs au tomat ically obey output scaling invariance.
The proof t hat there is only one set <P whose LMMG obey the invariances
of (2.6) and that that set is the set of all polynomia ls works by showing
that there is only one set <P which meet (G.l ) for input space invariances.
Note, however, that (G.1) is only a necessary cond ition for the invariance
of an LMMG un der T>. . Fortunately, in what follows the sufficient condi-
tio n (namely, that the given basis set <P, which obeys T>' [L~l ai i(r')] =
L:;::1 a:i(r; for all transformations T>., is such that the associated LMMG
obeys invariance under all the T>.) will be obvious. Therefore, in the exposi-
tion be low sufficiency will not be explicitly illustrated.
Wit hou t loss of generality we can take N' :::; N in (G.1) when we are
dealing with a single >. - if N' > N, then T;l [L:;::l a:i(r')] = L:~ 1 aii(r'),
and since our (paramet erized) transformations are gro ups, T;l = TN for
some ).' (in practice we usually parameterize the gro up so that >" = ->.
so that our groups are explicitly Lie groups). Therefore by simply flipping
which we call the t ransformed space and which we call the untransformed
space we can have N ' :::; N .
In the analysis below, however, we are going to need to make use of
an even stronger condition on the N ' . Instead of just choosing between
T>. and T_>. to ensure N' :::; N for a single x, we will require N' :::; N for all
>. E [- bo>' ,bo>']' with bo>' > o. In other words, as we move through [-bo>' , bo>']
our choice of T>. or T_>. (to ensu re N ' :::; N) must not change.
This stronger con dit ion is not difficult to meet, fortunately. Since the
i( r') are all linearly independent, as before t here exists a learning set () with
input coordin ates r~ , r-;, . . . , rN such that the LMMG working from this ()
will create a curve L:~ 1 aii(r') where (in particular) aN is nonzero. Now
assume t here exit s an open ball of nonzero radius living in R m N (t he vector
space delineati ng all datum space inputs components for the m-dimensional
LMM G 's g{N}) and centered on the input components of (), such that the
matrix i(i'j)(l :::; i,j :::; N) has nonzero determinant for all sets {i'j} con-
tained within that ball . Then any T>. moving () to any point within that
ball will have N ' :::; N. (aN might = 0, for example, which is why N' is al-
lowed to be < N. T he import ant point is that there is no linear dependence
188 David H. Wolpert
among the transformed i(i'j) which might force N' to be > N.) Therefore
our "stronger condition" will be met if this assumption holds. To prove this
assumption it is sufficient to disprove its negation. To do this, note that if
there does not exist any such ball of nonzero det(i(i'j)), then since we know
that det(i(i'j)) is nonzero at (), there must exist an axis in R m N such that
the set of points along that axis with 0 det( i(i'j)) contains the intervals [x , (})
and (() , y], where x is () shifted a nonzero amount in one direction and y is
() shifted a nonzero amount in the opposite direction. (At first it might be
thought that the statement that no such open ball of nonzero radius encloses
() could be satisfied if, for example, the set of points with 0 det( i(i'j)) in the
vicinity of () were [x, (} ) and either (z,y ] or [z,y], where z is () shifted by a
nonzero amount in the y direction. Note, however, that if this were the case
then any learning set lying between () and z would lie wholly within a ball
of nonzero det( i(i'j)). (I am assuming here that it is only this one axis in
R m n which has the property that det(i(i'j) ) flips from being 0 to nonzero
as you move along it at the point (). The argument if there is more than one
such axis goes similarly.) So for our purposes of finding any such ball we can
safely remove all such situations from consideration .) Now note however that
det (i (i'j)) is a continuous function of the i'j. Therefore it is impossible for it
to be 0 in [x,()) and (8, y] but nonzero at 8. Therefore our assumption of the
existence of such a ball of nonzero det( i( i'j)) is justified, and our "stronger
cond ition" can be satisfied. In what follows we will always take our learning
set to be in the center of such a ball of nonzero det (i (i'j)), and all T>. will
keep us within that ball. (Note that we make no restrictions on the value of
q, however.) Therefore we will always be able to take N' ~ N .
Now examine the invariance of scaling and translation in the input space.
Rearranging (G.1), and assuming our learning set is in the center of a ball of
nonzero det(i (i'j)), we immediately get
(G .2) For a given basis set of functions <1>, if the LMMG ba sed on <1> obey s
(2.6) then there exist s a function K( a, b, n, i) such that n(ar + b) =
L:i=l K (a, b,n, i )i(T) for all r. a and b are assumed close enough to 1
and IT respecti vely to ensure that we are within t he ball mentioned in
the previous paragraph.
for any vector r*. In particular, since the >i(T) are all linearly independent,
as before we can choose Nr*'s so that det( >i(r*)) 1- O. This means that there
exists a matrix Ai such that Ai>i(r*) = 51. Therefore
Since the >i(T) are all differentiable functions , the right-hand side of this
vector equation is well defined. This means that for any I, K(a,b,n,l) is a
different iable function of all the components of b, assuming a and bmeet the
conditions given in (G.2). A similar proof holds for showing that K(a, b, n, i)
is a different iable function of a .
This different iab ility of K is what allows us to delineat e the set of all
<I> whose associated LMMG obeys (2.6). To see th is first examine the case
where the input space is one-dimensional. >2(X + b) = K( 1,b,2,2)>2(x ) +
K (1, b,2,1 ). Now since the >i(X) are linearly independent, K (l,O, m, n ) =
5(m, n), the kronecker delta of m and n. This allows us to subtract t erms
like K (1, 0,2, 1) at will, so we can write
The two derivatives with respect to bin this equation are arbitrary constants.
If the first of them = 0, we have the solution >2(X) = Ax + B (A and B
be ing arbitrary constants) . Otherwise >2(X) = Ae Bx + C (A, B, and C
being arbitrary constants). This second solution can be discarded, however,
since scaling invariance would require that for any real number a, >2( ax) =
Ae Bax + C be exp ressib le as a linear combination of Ae Bx + C and 1, which
is impossible. Therefore anyone-dimensional LMMG obeying (2.6) taking
a two-element learni ng set fits that learning set with a straight line. This
result is in accord wit h (2.7).
T hese kin ds of arg uments can be iterated to get all t he other >i(X) :
d>3 (X )
--a:;;- = a>3(x ) + bx + c.
a = 0 gives >3(X) = Ax 2 + Bx + C. a 1- 0 ha s t he solution >3 (X ) =
eax [Je-ax(bx + c)dx + D] . T his solution can not E <I> if D 1- 0, due t o th e
req uiremen t of input space scaling invariance. However , t he remaining t erms
in this solut ion are linearly dependent on >2 (x) an d >l (x), and therefore must
be discarded. Therefore >3 (X ) = Ax 2 + Bx + C. Similarly, via induction
>n(x ) = Ei=o ai xi, or, without loss of generality, >n(x) = z ". Note that if we
were not requiring scaling invariance we would also have as legal <I>'s those
whose individual member functions are polynomials + exponentials.
Now let us consider the case where the input space dimension m is > 1.
>1(T) still (can be taken to ) = 1. Now, however, we get p.d.e.'s rather t han
190 David H. Wolpert
o.d.e.'s for the other <pi(f'). For example, we get a~w') = A i<P2(f') + Bi, where
i ranges over the coordinates of the input space . If, for i = a, Ai = 0, then
(note there is no implied sum over the repeated a's in this equation) . If all
the Ai = 0, <P2(f') = Bi ri + C (implied sum over the repeated i) . Let Ai =I-
for i = a. Then
where this B is the same as the one in the solution for <P2( f'). Without loss
of generality we can set all E, to O. This is because if Eo< did not equal for
a particular a, then
<P3(f') = eIEQTQ] [H(rl,r2, . . . ,ro<_bro<+I, "')
J
+ e[-EQTQ](Fo<B . r + Go<)dro<]'
Scaling invariance forces H(rl' r2, . . . ,ro<_!> ro<+!, " ') to = 0, as usual, so we
would have
which is linearly dependent on <PI (f') and <P2(f'), contrary to the requirement
that ill be a basis.
If in addition to the E; all the F; = 0, then we get <P3(f') = Gr + H. Since
<P3 (f') is linearly independent of <P2 (f') and <PI (f'), Gand B cannot be parallel.
We can similarly set all but the last constant to 0 in the next m - 2 sets
of p.d .e.'s (for <P4(f') through <Pm+!(f') ), and then without loss of generality
choose B, G and all the other constants so that <P2 (f') = rl, <P3 (f') = r2, ... ,
all the way up to <Pm+!(f') = r m.
For the next set of p.d.e.'s (those concerning <Pm+2(f')) we cannot continue
this procedure of setting all but the last constant to 0, since <Pm+2( f') must
be linearly independent of <PI (f') through <Pm+! (f'). Therefore
A Mathematical Theory of Generalization: Part I 191
where for all i, III i- O. This results in 1>m+2 being a quadratic polynomial
of the components of r.
A priori, we are not justified in setting all those constants to 0 in the
p.d.e.'s for 1>3(r') through 1>m+I (r'). It is entirely possible, for example, that
1>3(r') could have one or more of its F, i- 0 (which would result in 1>3(r') being
a quadratic polynomial) and yet all but the last constant in the p.d.e .'s for
1>4(r') could be 0, so that 1>4 (r') would be one of the remaining linear functions.
In other words, the ordering of the elements of ~ is not necessarily according
to the power of the individual polynomials. It is only in the one-dimensional
case where the requirements of scaling and translation invariance force us to
order the basis set of polynomials according to the power of the polynomials.
Indeed, for the multidimensional case it is even possible that a given linear
function appears nowhere in ~, but can be constructed by forming a linear
combination of polynomials from ~ each of which has power> 1.
However, note that there is an important invariance from (2.6) which we
have not yet taken into account: invariance of the mapping under rotation of
the input space . Indeed, no basis set ~ for an input space having dimension
> 1 can obey this invariance. This is because as was shown above, if we
have a two-element learning set, in general it gets fitted with a linear com-
bination of 1>1(r') = 1 and 1>2(r') = jj . r, where jj is an arbitrary constant.
For any such fixed jj though, it is impossible to express the rotation of the
surface fitted to the two-element learning set with a new linear combination
of 1>1(r') = 1 and 1>2(r') = jj . r. Rotation invariance is violated by LMMGs.
However, note that, for example, if the 1>3(r') through 1>m+I (r') are indeed
the linear functions, then for learning sets of cardinality m + 1, we do get
rotation invariance, since such a learning set will be fit with a hyperplane
al + I:~~l airi, and the rotation of such a hyperplane is just another hyper-
plane, a; + I:~~l airi' We can therefore make a watered-down version of the
requirement of rotation invariance: if the ')'(i) are the successive values of N
such that the rotation of any linear combination of the first N elements of
~ is expressible as another linear combination of those first N elements of
~, then we require first that ')'(2) -')'(1) be as small as possible, then that
')'(3) - ')'(2) be as small as possible, and so on through all')'U) - ')'U - 1).
Trivially, ')'(1) = 1. Since 1>2(r') is necessarily linear, to meet this "modified
requirement of rotation invariance" for ')'(2) - ')'(1) we do indeed have to
have 1>3(r') through 1>m+I(r') be linear functions of the rio As above this then
forces 1>m+2( r') to be a quadratic, and so we have to have all the other linearly
independent quadratics (there are m(m - 1)/2 of them) coming next in ~.
Iterating this argument, it is clear that
(G.3) The only basis sets offunctions ~ whose associated LMMGs obey the
invariances of (2.6) (with the requirement of rotation invariance in the
input space modified as above) are the sets of functions
192 David H. Wolpert
where notationally (x, y, . . . , z) means all possible bases which span the
same space as t he functions x, y, . . . , z.
A p p endix H .
The npa generalizer
The npa generalizer begins its calculation of t he output corresponding to
any given question by computing t he input space vector ra from the ques-
tio n to the point in the learning set nearest the question. (In gener al, all
vect ors are here assumed to be rela t ive t o the question.) Next it forms an
m-dimensional open hypersphere cent ered on the question with radius 2lf o l.
Lab el the position vectors of the p oint s from the learning set which fall within
th e hyp ersphere as ri. Assume there are n of them, so 0 ~ i ~ n - 1, where
it is assumed that lril 2 lri-ll for all i. Label the associated outputs by m i '
Let f n be any point 21ral away from the question. Now define the value of
th e scalar field p(rj to be the median of the mj whose associated positions
obe y lfil ~ If I In other words, at any point f, p(rj is the average of the
smallest and the largest m i which lie inward of f . The output value t he
gener alizer guesses for th e question is given by [~~~ I p(rjdmr] / [~~~~o l dmr].
If one of the m i changes by an infinitesimal amount Sm, t hen its greatest
possible effect on the generalizer's guess would be if it's at ra, and is (say)
t he largest m i throughout th e whole hyp ersphere. In thi s case, p(rj every-
where increases by at mos t 8m / 2 (8m if t his point happens to be the only
one in t he hyp ersphere) , and therefore the generalizer's output increase by
8m /2. Con sequently, this generalizer is continuous in the mi . Furthermore,
alt hough p(rj is in general a discontinuous fun ction of f and th e ri, ~~~I dmr
is a cont inuous function of the ri, as is ~~~ol p(rjdmr, since any infinitesimal
variat ion in ri can make a noninfinitesimal change in p(rj only over an in-
finitesim ally thin (hyper)shell. As a result, the output of the gene ralizer is
a continuous fun ction of the ri. Similarly, an infinitesimal variation in the
question, being equ ivalent to an infinitesimal variation in all the ri, can only
result in an infinitesimal vari ation in the output . This verifies requirement
(2.1) of properness . Now note that if we translate all the mi up or down by
some constant k, p(rj is translated by the same k, and t herefore the ou t-
put as a whole is translated by k. Similarly, if we multiply all the mi by a
scaling factor z, t he output gets multiplied by z . Since the output is purely
a function of metric distances, which are tensor scalars, the generalizer is
aut omat ically invariant under rotation, parity and translation. Invarian ce
under nonzero scaling of all th e inputs and the question follows from how t he
A Mathematical Theory of Generalization: Part I 193
Appendix I.
Proof that all LMMGs are output linear
Assume we have one learning set, (}l> consisting of the n pairs
and another learning set (}2 consisting of the n pairs (Xl, YD, ... (x n , y~), where
the vectors X j might be multidimensional, in general. Assume further that
our LMMG fits (}1 with the surface 2:~1 O'.i>i(r') and fits (}2 with 2:;::1 O'.:>i(r') .
So 2:~1 O'.i>i(Xj) = Yj and 2:;::lO'.:>i(Xj) = yj for all n Xj. Without loss of
generality we can take N ~ N'. We can then extend the set of 0'.: to those i
between Nand N' (including N) by simply setting 0'.: = for N ~ i > N '.
So t he LMMG fits (}1 with 2:~1 O'.i>i(r') and fits (}2 with 2:~1 O'.:>i (r'). Now
it is obvious that
N N N
A L O'.i>i(r') + B L O'.;>i(r') = L(AO'.i + BO'.;)>i(r')
i =l i=l i=l
rearranging terms we would see that th ere must exist a set of coefficients fJi '
1 :s: i < N, su ch that
N N-l
I>~i <Pi ( Xj ) = L 13i<Pi (Xj)
i= 1 i =1
for all Xj in 81 However if this were true, then we could have fit 81 with only
th e first N - 1 of the <Pi(rj. This is contrary to our initial assumption that
th e LMMG act ually fit 81 using the first N of the <Pi(rj , so it is true that the
LMMG cannot fit 81 without using <PN(rj.
It was shown in appendix E that the coefficient the LMMG fits to <PN (rj
is unique, so that coefficient must indeed equal Aow+BaN. So now we must
prove that the LMMG will also choose the values Aai<N + Ba ;<N (as t he
coefficient s of the basis functions <Pi<N(rj) to construct a surface which fits
81 To carry out such a proof we need to assume some means for t he LMM G
to choose between two (or more) possible sets of coefficients which both fit a
given learning set and which both have the same N. Here I will assume the
choosing scheme outlined in appendix E.
Examine the (N - l)th coefficient which t he LMMG uses to fit A8 +
B81 : 13N-l. (We alread y know that fJN = AaN + BaN. ) In genera l eit her
the vector {<PN-l (7"1), <PN-l (1"2) , . .. ,<PN-l(r:.n is linea rly independent ofthe
vectors {<Pi(i'i), <Pi (1"2) , ... , <Pi (r:.n,
1 :s: i < N - 1, or it is linearly depen dent
on them. If it is linearly dependent on them, then we can set fJN-l equa l
to o. In light of our choosing method, in point of fact we have no choice
but to choose the fJi such that fJN-l = 0 if this linear dependence holds .
Similarly, if we have this linear dependence then both aN- l and aN- 1 must
= o. Therefore, in t his sit uation, 13N- l does indeed = AaN-l + BaN_I .
Now examine the case where the vector
Hypo thesize that the 13N-l which reprod uces the learni ng set 81 is not unique ,
and represent the alternative set of coefficients as 13:, 1 :s: i :s: N - 1.
Since by hypothesis of nonuniqueness fJN-l =f 131.-1 and yet z=f:;i1 fJi<Pi(fj ) =
z=f:;i 1 13:<Pi (fj) for all fj in the learning set, we can write
unique. Since we know that the coefficient AaN_l + Ba~ _l is a legal coeffi-
cient ofthe 4>N_lTterm, we can in fact conclude that f3N-l = AaN-l +Ba~_l
for this case of linear independence as well as for the previously explored case
of linear dependence.
These same arguments can be used for all the remaining coefficients the
LMMG fits to A8l + B8 2 The result is clearly that f3i = Aai + Ba; for all i.
This concludes the proof that LMMGs are output linear.
Ap pendix J.
Output linear H ERB IEs a nd n eural nets
To see that neural net neu rons satisfy the restrict ions on h for n = 2, output
linear, one-dimensional HERBIEs, note that for any output linear generaliz-
ers,
due to freedom in picking the y, y', . . . and invariance under scaling and trans-
lating the input space. Since the equality holds for arbitrary a and b, if the
x, x' , x", . .. are not all the same as one another then the value of h is the
same everywhere on the plane in (x, x', x", . . . q) space formed by allowing a
and b to take on all possible values. (For x = x' = ... = q, varying a and b
only traces out a line, not a plane.) For an n = 2, one-dimensional HERBIE,
restriction (2.5) in fact requires that x i: x', so the generalization is indeed
determine d by such iso-h planes. So given any triplet of values X, X' i: X ,
Q in (x, x', q) space, we have a unique iso-h plane. Now any such plane has to
intersect the line x = -1/2, x' = 1/2, as can be seen by solving for a and bin
the pair of equations aX + b = - 1/2, aX'+b = 1/2 (since X i: X ', we always
get such a solution for a and b). Since a and b are therefore uniquely specified
by X and X', Q = aQ +b, the q value of the inte rsection of the plane with the
line x = -1/2, x' = 1/2, is specified uniquely. Because any point on an iso-h
plane completely specifies that plane, we can therefore use (-1 /2,1 /2, Q) to
specify the plane instead of (X,X',Q) - the plane specified by (X,X',Q)
is t he same as the locus of points ( -a' / 2 + b', a' /2 + b', a' Q/2 + b') for all
a' and b'. As a result, t he iso-h pla nes form a one-parameter set, with Q
being t he paramet er. Now all points (x,x' ,q) on any given iso-h plane will
agree on t he valu e for w = 1/ 2 - (q - x )/ (x ' - x ), and t his function speci -
fies a un ique Q valu e for t he int ersecti on of t he plane containing t he points
with t he line x = -1 /2, x' = 1/2. Therefore w uniquely specifies the iso-
h plane. Under interchange of x and x', w becomes - w. Since from before
h( x ' , x, q) = 1 - h(x, x', q), for all x, x' i: x, and q, h( - w ) = 1 - h(w). In par-
tic ular since h( q, x' ,q) = 1, we can write h(w = 1/2) = 1, and h( -1/2) = o.
Also h(O ) = 1/2 . This of course is exactly the behavior exhibited by fermi
fun ct ions (for app ropriate rescaling of the w so that w = 1/2 becomes w = 00
and w = - 1/2 becomes w = - (0 ), step functions and most of the other
196 David H. Wolp ert
t hres holding functions used to model neur ons in neu ral nets. Note t hat since
there exist an infinite number of functions h( w) ob eyin g h( w ) = 1 - h( -w)
and h(1j 2) = 1, the rest rict ion of output linearity is not enoug h t o force
un ique generaliza t ion of HERB IEs. Also note th at most of t his analysis ap-
plies equally well t o the g{i } of HERBIEs seen as fun ctions only over t he
Xl, X2, .. . ,q.
To place the iso-h concept in a lar ger context, view t he direct pr od-
uct of the input components of the datum spaces for a given ord er g{i },
(x, x', x" , . . . , q), as defining a vector space V . T hen t he invariances of re-
striction (2.6), in addition to being viewed as referri ng to each component of
a vecto r in V undergoing coordinate t ransformations wit hin its own pri vate
R m space, ca n also be viewed as referring to coordinat e transformations in
the whole space V . The iso-h planes are planes of invari an ce in V. If we now
construct a space V' with element s (y, y', y", .. . ,guess) in exac t ly th e same
way we const ru ct ed V , then da t um space interchang e symmetr y is an invari-
ance of g{i } under certain simultaneous reflections (those which flip ar ound
t he axes ) in b oth V an d V'. Although it has not yet been don e (m ainly
becau se it 's hard to see what its physical significan ce would be) it might be
mat hemat ically interesting to explore the consequences of requiring that the
g{i} be invariant under mor e gener al sets of simultaneous reflections in V
and V' . For output lineari ty, wher e th e g{i} can be viewed as a sort of dual
vect or field, such an inva rianc e amount s to a symmetry under simultan eous
reflect ions of the underlying ma nifold across which the field is define d and
t he accompanying transformation of the ind ividual vector spaces living on
t ha t manifold.
Appendix K.
Proof that for n = 2, the solution to (2.12) is a straight line or an
exponential
Clearly both a linear f( x) and an exp onential one are solu tions to this new
restrictio n . To prove uniqueness of such these solutions, we want t o solve for
the set of all f (x ) which obey (2.12) and give an h obeying the coordinate
transformation invariances required by t he generalizer's being a HERBIE.
To do this , first exploit the symmet ries allowed all HERBIEs to transl ate
an d scale the input space and then translate the output space so that any
origina l triple of points (x,f(x)), (x',f(x ')), (q,J(q)) becomes (0,0), (a, b),
(c, d), resp ectively, where the original x, x', and q are assumed chosen such
that c> a > 0. Due to the fact that th e value of h is unchanged if its three
arguments are all translated by th e same amount, if q now refers to a point
a little to the right of c,
f(q) -f(q -(c -a)) = d -b = f(c) _l.
(K.1)
f(q-(c-a)) -f(q- c) b f(a)
We can ignore th e case wher e both d and b = 0, since by (2.12) this means
= 0. By cont inuity, f(x) is nonz ero everywhere
f( q) is t he straight line f(q)
A Mathematical Theory of Generalizatio n : Part I 197
This must be true for all 8 and (small enough ) E. Let 8 an d E b oth b e
small and expand both sides of (K.2) to second order in 8 and E. We recover
f(O) = 0 (which we knew already) and for t he 8E t erm get f (a)f"(a) =
f '(a)[f'(a) - 1'(0)]. This must be true for a whole range of a, so setting 1'(0)
equal to a con st ant , a, and let t ing a var y, we get a seco nd-order nonli near
differ ential equation :
References
[1] D. Wolpert, "A benchmark for how well neural nets generalize," Biological
Cyb ernetics , 61 (1989) 303-315.
[6] T .J. Sejnowski and C.R. Rosenberg, NETtalk: A Parallel Network that
Learns to Read Aloud, Johns Hopkins University Electrical Engineering and
Computer Science Techni cal Report JHU / EECS-86/01, 1986.
[7] R. Linsker, "Self-organi zation in a perceptual network," Computer, 3 (1988)
105-117.
[9] D.O . Hebb , The Organization of Behavior (Wiley, New York, 1949).
[10] O.S. McCullough and WQ . Pitts, "A logical calculus immanent in nervous
activity," Bulletin of Mathematical Biophysics, 5 (1943) 115-133.
[13] The Brain, A Scientifi c American Book (W.H . Freeman, San Francisco,
1979).
[14] P.L. Carlton, A Primer of Behavioral Pharm acology (W .H. Freeman, San
Francisco, 1983).
[16] D. Amit, et al., "Spin glass models of neural networks," Physical Review A,
32 (1985) 1007-1018.
[17] B. Derrida, et al. , "An exactly soluble asymmetric neural network model,"
Europhysics Letters, 4 (1987) 167-173.
[18] J.J . Hopfield, "Neural networks and physical systems with emergent col-
lective computational abilities," Proceedings of the National Academy of
Science, 79 (1982) 2554-2558.
[19] H. Gutfreund, "Neural networks with hierarchically correlated patterns,"
UCSB ITP preprint NSF-ITP-86-151 (1986).
[20] A. Dembo, et al., "General potential surfaces and neural networks ," Physical
Review A, 37 (1988) 2134-2143.
[21] J.J . Hopfield and D.W. Tank, "Neural computation of decisions in optimiza-
tion problems," Biological Cybernetics, 52 (1985) 141-152.
[22] G.V. Wilson and G.S. Pawley, "On the stability of the traveling salesman
problem algorithm of Hopfield and Tank," Biological Cybernetics, 58 (1988)
63-70.
A Mathematical Theory of Generalization: Part I 199
[24] M. Minsky and S. P apert, Perceptrons (MIT Press , Cam bridge, MA , 1969).
[29] R. Solomonoff, "A formal theory of inductive inference: I and II," Informa-
tion and Control, 7 (1964) 1, 224.
[32] T . Poggio and staff, MIT AI Lab, "MIT progress in underst andings," to be
published in Proceedings of th e Image Underst anding Work shop , L. Bauman,
ed . (SAl Corporation, McLean , VA, 1988).
[33] T . Poggio, et al., "Compute r vision and regularization theory," N ature, 317
(1985) 314-319.
[36] N. Littlestone, "Lear ning quickl y when irrelevant attributes abound: A new
linear-threshold algorithm," Ma chine Learning, 2 (1988) 285-318.
[37] R. Duda and P. Hart, Pattern classification and scene analysis (Wil ey, New
York , 1973) .
[40] A. Lapedes and R. Farber, "How neural nets work, " Proceedings of th e 1987
IEEE Denver Conference on Neural Net works.
200 David H. Wolpert
[41] E.W. Packel, et al., "Informat ion-based complexit y," Nat ure, 328 (1987)
29-33 .
[42] P. Garcia, et al., "Local langu ages, t he successor method, and a step to-
war ds a general methodology for the inference of regular gramma rs," IEEE
Transactions on Pattern Analysis and Mac hine In telligence, pami-9 (1987)
841-845.
[43] K.S . Fu , et al., "Grammatical infe rence: Int roducti on and survey parts 1 and
2," IEEE Transactions on Systems , Man , and Cy berne tics, 5 (1975) 95-111 ,
409-423.
[44] C. Thompson, et al., "On limits t o perception and pat tern recogni tion ,"
Institute for theoretical physi cs at Santa Bar bara rep ort, UCSB IT P pr eprin t
NSF -ITP-87-127, 1987.
[46] J.E. Hopc roft and J .D. Ullman, Introd uction t o A uto m ata Th eory, Lan-
g uages, and Com p utation (Addison-Wesley, 1979).
[47] Holland, J ohn , Adap tation in Nat ural and A rti ficial Sy stem s: A n Introdu c-
tory Analy sis with App lications t o Bi ology, Control, and A rtificial In telli-
gence (University of Michigan P ress, Ann Ar bor , 1975).
[49] F . Reif, Fun dame ntals of Statistical Physics and Th erm al Physic s (McGraw-
Hill, 1965).
[51] R . Wald, General R elati vity , (Univer sity of Chicago P ress , Chicago, IL,
1984) chapter 2.