Uniqueness of The Weights For Minimal Feedforward Nets With A Given Input-Output Map
Uniqueness of The Weights For Minimal Feedforward Nets With A Given Input-Output Map
Uniqueness of The Weights For Minimal Feedforward Nets With A Given Input-Output Map
HECTOR J. SUSSMANN
Rutgers University
output node, and a \transfer function" Tanh s, the net is uniquely determined by its inputoutput map, up to an obvious nite group of symmetries (permutations of the hidden nodes, and changing the sign of all the weights associated to a particular hidden node), provided that the net is irreducible, i.e. that there does not exist an inner node that makes a zero contribution to the output, and there is no pair of hidden nodes that could be collapsed to a single node without altering the input-output map.
Abstract |{ We show that, for feedforward nets with a single hidden layer, a single
Rutgers Center for Systems and Control, May 1991 Revised October 1991
Research supported in part by the Air Force O ce of Scienti c Research (AFOSR-91-0343). The author thanks Eduardo Sontag for suggesting the problem and for his helpful comments and ideas, and an anonymous referee for suggesting how to improve the exposition at several points. Requests for reprints should be sent to Hector J. Sussmann, Department of Mathematics, Rutgers University, New Brunswick, NJ 08903.
H.J. Sussmann
x1. Introduction .
Feedworward neural nets with a single hidden layer have been shown to exhibit a number of remarkable properties, that makes them good candidates for nonlinear curve- tting. They satisfy universal approximation properties (Cybenko, 1989, Hornik et al., 1989). More recently, it has been proved that they are actually better approximators than many other classes (e.g. polynomials) in that one needs a smaller number of parameters in order to approximate an arbitrary function in a certain class within a given error (Barron, 1991a, 1991b). Questions of interpolation using one-hidden layer nets, and in particular issues dealing with counting how many units are needed in order to achieve a given objective, are discussed by Sontag (1991b). (It is worth remarking that single-hidden layer nets are severely restricted in their approximation properties with respect to discontinuous target functions, as explained in Sontag, 1991a.) We study the following question: to what extent is a feedforward net uniquely determined by its corresponding input-output map? Hecht-Nielsen (1989) pointed out that there are some obvious sources of non-uniqueness, arising from internal symmetries such as the possibility of relabeling the hidden nodes. He asked whether the small, discrete symmetry group G that he exhibited was actually the full symmetry group, i.e. whether the input-output map of the net determines the net modulo the symmetries in G . We give an a rmative answer to this question, by proving a uniqueness theorem: two \irreducible" nets with the same input-output map are related by a transformation in G. The irreducibility condition is needed because there is another source of nonuniqueness, coming from the fact that a net may contain nodes that make no contribution whatsoever to the output, e.g. nodes whose outgoing connection weights are equal to zero. This means that there are cases when a net can be reduced (i.e. some of its nodes can be removed) without changing the input-output map. Clearly, a uniqueness theorem can only hold for irreducible nets. In this note we will consider feedforward nets with a single hidden layer, and a transfer function (s) = Tanh s. For such nets, we will give a precise de nition of reducibility and of the group G of internal symmetries, and show rigorously that an irreducible net is uniquely determined, up to symmetries in G , by its input-output map.
H.J. Sussmann
As a byproduct of our results, it will follow that an irreducible net is minimal. Here we call a net \minimal" if its input-output map cannot be obtained from another net with fewer hidden nodes. So minimality is not, in principle, an easily checkable property. On the other hand, \irreduciblity" is de ned by listing three very simple situations under which a net could be made smaller, and then calling the net irreducible if none of these situations occurs. So irreducibility is easily checked. The fact that irreducibility is equivalent to minimality means that there is no mechanism, other than the three listed in the de nition of reducibility, that can be used to reduce the number of hidden nodes without changing the input-output map. The paper is organized as follows: in x2 we de ne the main concepts and state the uniqueness theorem. The proof of the theorem is given in x3.
H.J. Sussmann
The output z(x) of the net is the number z(x) = ( (x)). The function x ! z(x) is the input-output map of the net. For a xed m, two nets are I-O equivalent if their corresponding input-output maps are the same. Our goal is to show that two I-O equivalent nets must necessarily be the same net, up to some simple internal symmetries. Fix m. To a net with n hidden nodes we associate the n linear a ne functions IRm 3 x ! j (x) 2 IR, j = 1; : : : ; n. We call two linear a ne functions , on IRm sign-equivalent if j (x)j = j (x)j for all x 2 IRm. (This is easily seen to be equivalent to the condition that either (x) = (x) for all x or (x) = ? (x) for all x.) We call a net reducible if one of the following conditions hold: (I) one of the cj , for j = 1; : : : ; n, vanishes; (II) there exist two di erent indices j1; j2 2 f1; : : : ; ng such that the functionals j1 , j2 are sign-equivalent. (III) one of the functionals j is a constant. If a net N is reducible, then it is I-O equivalent to another net with fewer hidden nodes. Indeed, if N is reducible because (I) holds, and j 2 f1; : : : ; ng is such that cj = 0, then the j -th node makes no contribution to the net input (x), so we can simply remove it without changing the output. Now suppose that N is reducible because (II) holds. Let j1, j2 be such that j1 and j2 are sign-equivalent. Let j1 (x) j2 (x), where = 1 or = ?1. Then the combined contribution of the nodes j1 and j2 to (x) is cj1 ( j1 (x)) + cj2 ( j2 (x)), i.e. cj1 ( j2 (x)) + cj2 ( j2 (x)), which is equal to cj1 ( j2 (x)) + cj2 ( j2 (x)), i.e. to ( cj1 + cj2 ) ( j2 (x)). (Here we used the fact that is an odd function, so ( s) = (s).) So we could obtain the same input-output map by just removing Node j1 and changing the weight of the connection from Node j2 to the output from cj2 to c0j2 = cj1 + cj2 . Finally,
H.J. Sussmann
if N is reducible because (III) holds, and j is constant with value , then we can remove the j -th hidden node and change c0 to c0 + ( ). An net which is not reducible will be called irreducible. A net with n hidden nodes will be called minimal if it is not I-O equivalent to a net with fewer hidden nodes. We have shown above that a minimal net is necessarily irreducible. It will follow from our main theorem that, conversely, an irreducible net is minimal. The are some obvious transformations that can be applied to a net N without changing its input-output map. For instance, suppose we pick a hidden node j and change the sign of all the weights wij for i = 0; 1; : : : ; m, and also change the sign of cj . Since is odd, this will not alter the contribution of this node to the total net input (x). Let us use j (N ) to denote the net resulting from this transformation. Another possibility is to interchange two hidden nodes, that is, to take two hidden nodes j1 and j2 and relabel j1 as j2 and j2 as j1, taking care to also relabel the cornew responding weights. (Precisely: consider a new net with weights wij , cnew, such that j new = wij , wnew = wij , cnew = cj , cnew = cj .) Call the resulting net j j (N ). The wij1 1 2 1 2 1 2 j2 j1 ij2 maps j1j2 , j generate a nite group Gm;n of transformations of the set Nm;n. (The precise number of elements of this group is 2nn!.) Call two nets in Nm;n equivalent if they are related by a transformation in the group Gm;n. It is then clear that two equivalent nets are I-O equivalent. Our main result says that the converse is true for irreducible nets: Theorem 2.1. Let N1, N2 be irreducible I-O equivalent nets in Nm;n1 , Nm;n2 . Then (i) n1 = n2 and (ii) N1 and N2 are equivalent. We will prove this theorem in the next section. First, we remark that the theorem implies: Corollary 2.1. An irreducible net is minimal. PROOF. Let N be irreducible. If N was not minimal, there would be another net N 0 with fewer hidden nodes that computes the same input-output function. If N 0 is not irreducible, then we could reduce the number of nodes, as explained above, without changing the inputoutput map. If the resulting net is not irreducible, we can reduce again. Continue this ^ process until no further reduction is possible. We then get an irreducible net N which is ^ I-O equivalent to N but has fewer nodes than N . By the theorem, N and N have the same number of hidden nodes. This contradiction completes the proof.
H.J. Sussmann
c1 + 0
n X c1
1
j =1
j (
1 2 j (x)) = c0 +
n X c2
2
j =1
( j2 (x)) :
(3:1)
Let J = f1; 2; : : : ; n1 + n2g, and de ne 'j = j1 for 1 j n1, 'j = j2?n1 for n1 + 1 j n1 + n2, a0 = c1 ? c2, aj = c1 for 1 j n1, aj = ?c2?n1 for n1 + 1 j n1 + n2. 0 0 j j Then (3.1) becomes: X a0 + aj ('j (x)) = 0 : (3:2)
Lemma 3.1. Let J be a nite set and let f'j gj2J be a family of nonconstant linear
a ne functions on IRm , no two of which are sign-equivalent. Then the functions j 2 J and the constant function 1 are linearly independent.
j 2J
'j ,
PROOF. Assume that a + j2J aj 'j 0. We have to prove that a and all the aj vanish. Write 'j (x) = j (x)+ 0 , where j is a nonzero linear functional and 0 is a constant. j j (The fact that j 6= 0 follows because 'j is not a constant.) Our hypothesis guarantees that, if j1 6= j2, then either (i) the functionals j1 , j2 are not sign-equivalent, or (ii) if j1 j2 , then the corresponding constant terms 0k do not satisfy 01 = 02 with the j j j same choice of sign. De ne an equivalence relation on J by calling two elements j1, j2 of J equivalent if the corresponding linear functions j1 , j2 , are sign-equivalent. Let E be the set of equivalence classes. Pick a jE for each E 2 E . As E varies over the classes in E , no two of the functionals jE are sign-equivalent. So, for each pair E1, E2 of distinct members of E , the set of points x 2 IRm where j jE1 (x)j 6= j jE2 (x)j is open and dense in IRm . Also, for each E , the set of x 2 IRm such that jE (x) 6= 0 is open and dense in IRm, because jE is not 0. So we may pick an x such that jE (x) 6= 0 for all E 2 E , and j jE1 (x)j 6= j jE2 (x)j
H.J. Sussmann
for all pairs E1, E2 of distinct members of E . Let ~j = j j , where j = 1, the sign being chosen so that ~j (x) > 0. Let ~0 = j 0, 'j = j 'j , ~j = j aj . Then a j j ~
a+
Xa ~
j 2J
'j 0 ~
(3:3)
as well, and our proof will be complete if we show that a and all the ~j vanish. Moreover, a if the ~j vanish, then (3.3) shows that a = 0 as well. So it will be enough to show that the a aj vanish. ~ If E 2 E , then the functionals ~j for j 2 E are all sign-equivalent, and satisfy ~j (x) > 0. So in fact all the ~j are equal to one and the same functional, that we will call ~E . If j; j 0 2 E and j 6= j 0, then we know that j = j , where j j = 1 and 0 6= 0 . j j On the other hand, ~j (x) = ~j (x) and ~j (x) = j j (x), ~j (x) = j j (x), so j = j . Since 0 6= 0 , we conclude that ~0 6= ~0 . So we have shown that the numbers ~0 , as j j j j j j varies over an E 2 E , are all di erent. As E varies over E , the numbers ~E (x) are positive and di erent. If j 2 E , we have
0 0 0 0 0 0 0 0 0
(3:4)
j where j = e2~0 . (In particular, the j for j in a xed E 2 E , are all di erent.) Since ~E (x) > 0, and j > 0, we can conclude that
t!+1
(3:5)
a+
If we subtract (3.6) from (3.3) we get
Xa ~
j 2J j j
=0 :
(3:6)
Xa ~
j 2J
0;
(3:7)
H.J. Sussmann
7
(3:8)
where j =
~ So we have j2J aj j 0. Now order the classes E 2 E in a nite sequence (E1 ; E2; : : : ; Er ), chosen so that ~E1 (x) < ~E2 (x) < < ~Er (x). Let vk = ~Ek (x). We then have
j (?tx) = (1 + j e?2tvk )?1
j (q ) =
1 : 1 + j e2~j (q)
if j 2 Ek . For each j , let k(j ) be the k such that j 2 Ek . For t > 0 and su ciently large, we have 0 < j e?2tvk(j) < 1, and so we can expand j (?tx) in a convergent power series:
j
( )
(3:9)
; :
(3:10)
1 r X X e?2tsvk X (?1)sa 0= ~
s=0 k=1 j 2Ek
s j j
v2IR;v 0
e?2tv
s2f0;1;:::g;k2f1;:::;rg;svk =v j 2Ek
X (?1)sa ~
s j j
(3:11)
The index of summation v is a nonnegative real number, but in fact the only v's occurring in the summation are those that can be expressed as integral multiples of some vk . So these v's form a discrete subset of the half-line 0; 1 . If we order the elements of as a sequence v1 ; v2; : : :, such that 0 < v1 < v2 < , then it follows easily from (3.11) that all the coe cients
` def =
X (?1)sa ~
s j j
(3:12)
H.J. Sussmann
vanish. (This is proved by induction on `: if all the ` for ` < ` are equal to zero, then (3.11) implies that 0 = `e?2tv` + o(e?2tv` ) as t ! +1, so ` = 0.) We now x a k 2 f1; : : : ; rg, and assume that we have already proved that the ~j a vanish for j 2 Ek , k0 < k. Let be the number of indices j belonging to Ek . Choose integers s > 0, h > 0, such that the numbers svk , (s+h)vk , (s+2h)vk , : : :, (s+( ?1)h)vk are not integral multiples of vk for any k0 2 fk + 1; : : : ; rg. (To see that such a choice is possible, let B be the subset of fk + 1; : : : ; rg consisting of those k0 such that the quotient vk =vk is rational. Write this quotient as pk =qk , where pk , qk are relatively prime positive integers, and pk > qk 1, because vk < vk if k0 > k. Then svk cannot be an integer Q multiple of vk unless s is divisible by pk . So, if we pick h = k 2B pk and s = 1 + h, we see that, if is an integer, then s + h can never be divisible by pk , because pk > 1. So this s and h have the desired property for all the indices k0 2 B. Moreover, it is clear that (s + h)vk cannot be an integer multiple of vk if vk =vk is irational. So in fact s and h work for all the indices k0 2 fk + 1; : : : ; rg. Now let be an integer such that 0 < . Then (s + h)vk is equal to v` for some `. The corresponding coe cient ` is equal to the sum
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
This sum contains no contribution coming from values of k0 such that k0 < k, because we are assuming that all the corresponding aj vanish. A k0 such that k0 > k can only contribute ~ to the sum if s0 vk = v` for some integer s0 , and this can only happen if (s + h)vk is an integral multiple of vk . Since this is impossible by our choice of s and h, we conclude that the sum only contains contributions coming from k0 = k. In this case, the only possible value of s0 is s + h. So X (?1)s+ ha s+ h : ~j j 0= `=
0 0
X (?1)s ~ a
0
s j j
j 2Ek
Xbg
j 2Ek bj = aj js, gj ~ j j
X~ a
j 2Ek
s j j+ h = 0
= jh . Then (3:13)
= 0 for = 1; : : : ; ? 1 :
H.J. Sussmann
Now, we know that the numbers j , for j 2 Ek , are all di erent. So the gj are all di erent. So (3.13) is a system of linear homogeneous equations for the bj , whose coe cients form a Vandermonde matrix. Since the determinant of this matrix is 6= 0, because the gj are all di erent, we conclude that all the bj vanish. But this implies that the aj vanish as well. ~ It follows by induction from the above that all the aj , for all j 2 Ek , for all k, are ~ equal to zero. As explained above, this proves our lemma. We now return to the proof of the theorem. Assume that not all the aj are zero. Then the lemma and (3.2) imply that either (i) one of the 'j is a constant, or (ii) two of the 'j are sign-equivalent. The rst possibility is excluded because both nets under consideration are irreducible. (Cf. Condition (III) of the de nition of reducibility.) So (ii) must hold. But, since no two 'j coming from the same net can be sign-equivalent (by Condition (II) of the de nition of reducibility), there must exist j1, j2 such that 1 j1 n1 , n1 + 1 j1 n1 + n2 , such that 'j1 and 'j2 are sign-equivalent. Moreover, for no other index j 2 J can 'j be sign-equivalent to 'j1 . So we can split the left-hand side of (3.2) into two parts, namely, (a) the sum aj1 ('j1 (x)) + aj2 ('j2 (x)) and (b) the other terms. The other terms involve the function 1 and functions 'j that are not constant and not sign-equivalent to 'j1 . So, by the lemma, the sum aj1 ('j1 (x)) + aj2 ('j2 (x)) must vanish. This means that either
j1
1 2 ^1 j 2 and c11 = c^1 j j
2 and c11 = ?c^1 ; j j where we have written j 1 = j1 and ^1 = j2 ? n1. j Using this, we can rewrite (3.2) removing the contribution from j1 and j2, and apply the lemma again to the resulting identity. This will give rise to a second pair of hidden nodes j 2, ^2 such that either j
or
j1
2 ^1 j
j2
2 ^2 j
H.J. Sussmann
10
2 and c12 = ?c^2 : j j This procedure can be continued until we end up with an identity where all the remaining coe cients aj vanish. By the irreducibility of the nets, at this point there cannot be any terms left other than a0 , because all the cj for j 6= 0 are supposed to be nonzero (cf. Condition (I)). So a0 = 0, i.e. c1 = c2. Moreover, we have constructed sequences 0 0 1 ; j 2 ; : : : ; j k , ^1 ; ^2 ; : : : ; ^k , of nodes of N1 , N2 , such that each ` is sign-equivalent to j j j j j ` , ^` exhaust all the nodes of N1 , ^` and to no other j coming from N2 , and that the j j j N2. This means that n1 = n2 = k. If we make a permutation of the nodes of N2 (which transforms N2 into an equivalent net) we may assume that ^` = j ` for each `, so that in j 2 1 2 fact j1 j for each j , and cj = cj , with the same choice of sign in both. So N1 and N2 are equivalent, as stated. 2 ^2 j
or
Barron, A. R. (1991a). Approximation bounds for superpositions of a sigmoidal function. Proceedings of the IEEE International Symposium on Information Theory. IEEE Press. Barron, A. R. (1991b). Universal approximation bounds for superpositions of a sigmoidal function. Technical Report #58, Statistics Department, University of Illinois at Urbana-Champaign. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2, 303-314. Hecht-Nielsen, R. (1989) Theory of the Backpropagation Neural Network. Proceedings of the International Joint Conference on Neural Networks, Washington, 1989. IEEE Publications, NY, 593{605. Hornik, K., Stinchcombe, M., & White, H. (1989). Multi-layer feedforward networks are universal approximators. Neural Networks 2, 359-366. Sontag, E.D. (1991a). Capabilities of four- vs three-layer nets, and control applications. Proceedings of the Conference on Information Sciences and Systems. John Hopkins University, 558-563. Sontag, E.D. (1991b). Remarks on interpolation and recognition using neural nets. Advances in Neural Information Processing Systems 3 (R.P. Lippmann, J. Moody, and D.S. Touretzky, eds), Morgan Kaufmann, San Mateo, CA, 939-945.
REFERENCES