Academia.eduAcademia.edu

A framework to deal with interference in connectionist systems

2000

We analyze the conditions under which a memory system is prone to interference between new and old items. Essentially, these are the distributedness of the representation and the lack of retraining. Both are, however, desirable features providing compactness and speed. Thus, a two-stage framework to palliate interference in this type of systems is proposed based on exploiting the information available at each moment. The two stages are separated by the instant at which a new item becomes known: a) interference prevention, prior to that instant, consists in preparing the system to minimize the impact of learning new items and b) retroactive interference minimization, posterior to that instant, seeks to learn the new item while minimizing the damages in icted on the old items. The subproblems addressed at the two stages are stated rigorously and possible methods to solve each of them are presented.

A framework to deal with interference in connectionist systems Vicente Ruiz de Angulo and Carme Torras Institut de Robotica i Informatica Industrial (CSIC-UPC) Edici NEXUS Gran Capita 2-4 08034-Barcelona, Spain. E-mail: [email protected], [email protected] 1 Abstract We analyze the conditions under which a memory system is prone to interference between new and old items. Essentially, these are the distributedness of the representation and the lack of retraining. Both are, however, desirable features providing compactness and speed. Thus, a two-stage framework to palliate interference in this type of systems is proposed based on exploiting the information available at each moment. The two stages are separated by the instant at which a new item becomes known: a) interference prevention, prior to that instant, consists in preparing the system to minimize the impact of learning new items and b) retroactive interference minimization, posterior to that instant, seeks to learn the new item while minimizing the damages inicted on the old items. The subproblems addressed at the two stages are stated rigorously and possible methods to solve each of them are presented. 1 The interference problem Suppose one would like to learn a set of input-output pairs f( )g =1 , where is the desired output of the system to input . In some applications not all the items to be learned are known at the same time. Instead, there is a sequential order of arrival, and the system must be operative before the last item is known with, desirably, the best possible performance. Therefore, new items must be quickly learned and integrated. However, when new memories are introduced in an associative system, the performance of the already stored items can be seriously degraded. This performance degradation is usually called interference or catastrophic forgetting. The eect of adding a new item depends basically on two factors: the type of memory used and the training scheme applied. Xi  Di Di  i :::N Xi Two types of representations can be distinguished: { Local representations, where the items are stored in such a way that they do not interfere with one other, like in a look-up table. As a consequence, in normal conditions, there is a perfect recall of both the old and the new items. However, when the system is full, there is no possibility of storing new items without destroying completely some old items. 2 { Distributed representations, in which each element or parameter of  the associative memory intervenes in the representation of several items. And reciprocally, each item is represented by several parameters, each of which supports only partially the storage of the item. When a new item is introduced, the recall of several stored items is degraded. The importance of forgetting raises very gradually with the number of new items learned. Distributed representations exploit memory resources much better than local ones, and are also reputed to interpolate more adequately (see Section 3.2 for details about this). There are two basic training schemes: { To introduce the new item isolately. In this case, the last item is quickly learned but, in many cases, the structure of the system is such that this learning causes a very important degradation of the previously learned items. That is, catastrophic interference appears. { To train the new item jointly with the old ones. The problem is that with some kinds of systems, especially those having distributed representations, this can be a complex optimization process that requires lots of computation. In the above account of types of memories and training schemes, we have only described the extreme possibilities. There are others in between representing a compromise between their properties. 2 Previous work The phenomenon of interference has been noted and studied by the connectionist community: Ratcli 14] and MacCloskey and Cohen 9] rst unveiled the problem. Their purpose was to model by means of articial neural networks some aspects of human learning and memory, such as forgetting and retroactive interference. They arrived to the conclusion that there is too much and too sudden interference in neural networks to be a valid model. Almost all the following papers on interference inherit this marked psychological character. We briey comment on some of them, to next focus on the engineering aspects of the problem. 3 The dierent works will be placed in the framework outlined in the preceding section, namely an imaginary plane whose axes refer to the locality of the representation and the extent of retraining with old items (Figure 1). Thus, French's approach 4] is situated at the extreme of localized solutions. He uses sigmoidal units (see the next section) with a 0 1] range, saturating them and favoring the zero states. Hetherington and Seidenberg 5] observed that, although the old items could have large errors after the introduction of a new item, they could be quickly relearned. Thus, they suggest to retrain new and old items altogether after the introduction of the new pattern. This can be considered an approach far from the origin in the retraining axis, in which information about old items is used in a delayed mode. In 15] the new item is trained with a (xed or, preferably, randomly changing) part of the previously trained items. Thus, it can be situated in the middle of the retraining axis. In a variation of this strategy, he essays to produce the same result without requiring the availability of the old items the associations that are jointly trained with the new items are not old items, but (input, network output) pairs, which we denote ( ( )), randomly extracted from the network before introducing the new item. Although in appearance similar to the former, this is a dierent approach (that Robins calls pseudorehearsal). Training with the old items constrains the shape of ( ) much less than pseudorehearsal, because the latter implicitly aims to reproduce the previous ( ) function, except at the discontinuous point where the new item is located. The repeated introduction of items through this process leads to very local representations. It must be said that pseudorehearsal has some intriguing points in common with brain processes during dreaming 16], which suggests that the brain could use some related strategy, as commented below. McClelland et al. 8] claim that the brain avoids catastrophic interference by using a rapid -local- memory to store new items, which afterwards should be integrated slowly in the long-term, distributed memory1. This approach can be considered a combination of a temporal local solution and delayed retraining. To be successful it needs a method to discriminate when the output for an input must be generated using the local or the distributed memory. Other requirements are the availability of the old items, and free  X F X F X F X Note that this requires to know whether an association has been introduced recently in the quick memory. 1 4 time to train the distributed memory with them and those stored in the local memory. Following McClelland et al., the cortex is the long-term memory in the human brain, and the hippocampus is the local memory, able also to recognize when a pattern is new. Dreaming could be, as suggested by Robins 16], the time for integration, and the information about the old patterns would be extracted by exciting the cortex randomly, in a way similar to the pseudorehearsal technique commented above. 3 A closer look at the interference problem 3.1 Associative systems and neural networks At this point it is convenient to enter into details about the kind of systems that will be dealt with in this paper. Although we will try to keep the discussion at a rather general level, our results will be centered on neural networks. First of all, we require the associative memory system to be able to generalize. This means that for any input the system must yield a response or output (  ) parametrized by a vector of parameters . To store a set of items f( )g =1 , the system must minimize some function ( P) enforcing the similarity of (  ) and , as for example P ( ) = =1 ( ) = =1 ( (  ) ; )2 . The exact shape of this function inuences the way in which the network responds to inputs outside the training set and, for this reason, ( ) could include terms aimed at determining the way the system should generalize. Another condition we impose is the derivability of (  ) and ( ), this allowing the minimization of ( ) with standard methods such as gradient descent using its derivatives. The usual models of multilayer neural networks satisfy the above specications. For example, a typical regression2 two-layer neural network has the form X F X W W Xi  Di i :::N E W E W F Xi W i :::N Ei W i :::N F Xi W Di Di E W F X W E W E W (  )= Fi X W X ( ) (1) wij yj X  j where is the th component of , are modiable parameters and is a nonlinear function of the input that, in the terminology of neural networks, Fi 2 i F wij By regression network we mean one whose output is a vector of real numbers. 5 yj is known as the output of hidden unit . Depending on the shape of the 's, feedforward neural networks can be classied into two types. The most frequent one uses j yj ( )= yj X sig X ( ) (2) vj k Xk  k where is the th component of , 's are modiable parameters (also components of ) and is a function of sigmoidal shape, such as the hyperbolic tangent. The other type of networks is the Radial Basis Function (RBF) networks, whose 's are RBF units, typically gaussians: Xk k X W vjk sig yj vjk k (xk ;vjk ) j2 2 ( )= and 2 are modiable parameters (included in yj X where ; P e  j 3.2 Distributed representations W ). Up to now we have associated catastrophic forgetting with the existence of crossed dependencies of dierent outputs of the system on many parameters. This corresponds to systems with distributed representations, which seem to be at the heart of the problem. It is convenient to take a closer look at this relation and at the arguments against renouncing to distributed representations. The term "distributed representations" is often used in the connectionist literature ambiguously or with dierent meanings. Here, to avoid misunderstandings, we put forth a precise denition. We say that the response of the system to input has a distributed representation if the majority of the derivatives i (  ) have a signicant magnitude. In the opposite case, we say that it has a local representation. Note that the dierent items stored in a system may have representations with very dierent degrees of distributedness. Moreover, an arbitrary input which has not yet been associated to a desired output, but producing anyway an output response, can be said also to have a local or a distributed representation. The above denition is adequate to analyze the interference problem, while at the same time bearing resemblance to the common conception of distributed representations. The easiest way to introduce a new item is to change those parameters to which the current (erroneous) system response is more sensitive. If there is a great overlap between this set of parameters and those having a X @F @w X W Xnew 6 large i (  ) magnitude, the response of item will be seriously perturbed (in a way approximately proportional to i (  ) i (  )). Thus, catastrophic forgetting appears when a) the stored items have distributed representations and b) the response of the system to the new item prior to learning has also a distributed representation. Sigmoidal networks, due to the euclidean products in (1) and (2), tend to produce very distributed representations for all their possible inputs, unless the sigmoidal functions work in their saturated zones. Instead, RBF networks can have distributed or local representations, depending on the relation between the distance among the RBF centers and their widths. Then, why not simply use a very local representation of the memories to avoid interference? There are two main reasons for not doing this:  Compactness. It is clear that local representations using an exclusive set of parameters for each item will require many resources to deal with large quantities of data. Indeed, purely local representations must always have at least as many parameters as data stored. Instead, when data are introduced by minimizing an error function without any locality constraint, the result is naturally distributed. With high data-to-parameter ratios, there are not local representations producing equivalent results.  Generalization. To determine the output of an arbitrary input , any good generalizer should take into account multiple stored items, especially in stochastic or noisy domains. When data are scarce and sparse, items far from should inuence the system response. This inuence can be only realized through commonly dependent parameters. The sharing of parameters by several items implies, at least, a certain degree of distributedness in the representations. Note that in the sense we are using the term distributed, RBF networks composed of units with a limited "local" activation function can exhibit distributed representations, the determinant factor being the degree of overlap between the units. @F @w Xp W p @F @w Xp W @F @w Xnew W X X 3.3 Our goal In this paper we intend to deal with the engineering aspects of the core problem. This core is for us at the origin of the retraining and representation axes 7 (refer to Figure 1), i.e., catastrophic interference avoidance without requiring repeated presentations of the old items, and without explicitly imposing local representations. We coincide with 13] in pursuing this goal and in taking a pure engineering perspective. However, theoretical reasons forbid a perfect solution to this problem. When a feed-forward network with a xed number of units has enough capacity to encode a xed set of items, there is a bound on how fast learning can take place, since this problem has been proven to be NP-complete 3] 6]. Therefore, we cannot aim at nding a procedure that in approximately constant time learns a new item while leaving the old ones intact, because by iterating this procedure learning could be carried out in time linear in the number of items. As a consequence, our goal must be limited to try to palliate interference as much as possible. 4 A two-stage approach to palliate interference We consider that the interference problem must be handled in two stages separated by the arrival of the new item:   A priori stage: encoding the old, known, items in the best way to prepare the network to minimize the impact of learning new items. We call this stage interference prevention. A posteriori stage: learning the new items while minimizing the damages inicted on the old items stored with a given weight conguration. We call this stage retroactive interference minimization. We will concrete the scenario in a set of items 1 ; 1 which are stored prior to the arrival of the new item . Let 1 denote the error in items 1 2 . With this notation we formulate our problem as how to nd eciently a minimum of 1 ( ) with the constraint of having item available in a moment posterior to the arrival of items 1 ; 1. The optimization to be performed prior to this moment constitutes the interference prevention, whereas those optimizations posterior to that moment conform the retroactive interference minimization. We will enforce this division by reecting the calculations of interference prevention in the variable itself, xing :::N N  E :::p :::p E :::N W N :::N W 8 its value afterwards. The changes to be made to the weights during retroactive interference minimization will be reected in the variable vector  , so that the value of the weight vector after the introduction of the new item will be + . For clarity of explanation, we make all the derivations assuming a perfect encoding of the new item ( ( +  ) = 0). However, the -th item can also be introduced partially within this framework. Instead of using ( ), we would take ( ( )+ ( ; ( )) 0 1 as the new item. Reducing only a fraction of the error can be useful when dealing with noisy data 21]. The minimization of 1 ( ) is approximately equivalent to: W W W EN W XN  DN W XN  F X N  W E ::N  DN F XN  W  <  < W min + W N E W 1 + ) ( + )=0 :::N ;1 (W subj ect to EN W W W : And making the above distinction explicit: min  W W 1 E :::N ;1 (W ) + E1 :::N ;1 (W  W )] ( + )=0 subj ect to EN W W : For clarity of explanation, we make all the derivations assuming a perfect encoding of the new item ( + ) = 0. Let us abbreviate 1 ;1 with . We need to express  ( + ) in a manageable way. Evaluating accurately  ( + ) for a given  can have a high computational cost. For example, in a supervised feedforward neural network, it would entail presenting the whole set of items to the network. If the optimum is to be found through some search process, this evaluation has to be performed repeatedly, leading to a high computational cost. An alternative solution to accurate evaluation is to approximate the error function over the old items through a truncated Taylor series expansion, for example, with a second-order polynomial in the weights. A linear model is too rude and not feasible because it may turn the problem into an unsolvable one 21]. The coecients of the polynomial have a direct interpretation, namely the rst and second derivatives of . Cross-terms are not included because the calculation of the Hessian requires computations very costly in memory and time. The Hessian diagonal is, therefore, the only information about old items that we require in principle. E E W EN W E W W E W W W E 9 :::N Thus, the most faithful problem formulation we can reasonably aspire to deal with is: # X 2 X 1 2 +  min ( )+ 2 2  ( + )=0 (3) The problem is how to carry out this constrained minimization on (interference prevention) and on  (retroactive interference minimization). In the latter, is by denition a constant already known, and the minimization takes place over  . Interference prevention is more complex because the minimization must be performed over , but the cost function is parametrized by  , which is unknown at that moment. This leads to the minimization of the expected cost for the a priori distribution of  . We begin by the easier one, namely retroactive interference minimization. " @ E E W W W i @E Wi @Wi @Wi i subj ect to EN W W Wi : W W W W W W W 5 Minimizing retroactive interference 5.1 Problem formulation Now we consider W as a constant, and therefore (3) becomes " # 1X  2+X  min  2 ( ) = 0 with assigned to i and to i . Note that usually will be close to zero because, by introducing the rst ; 1 items we should have minimized and, therefore, also the absolute value of its rst derivatives. In any case, this is part of interference prevention. Thus, we assume = 0, and our denitive formulation for the retroactive interference subproblem is: W ci Wi bi i subj ect to EN 2 @ E ci @W 2 bi Wi i W  @E bi @W N E bi min  X W subj ect to EN with = ci 2 @ E @W 2 i ci  2 Wi i ( ) = 0 . 10 W  (4) 5.2 Selection of coecients The above formulation is similar to that adopted by Park et al. 13], and also in previous works by the authors in 18], 20] and 21], the dierence being in the value assigned to . In 13] a modied second derivative of , weighted by the error of the outputs of the network is used. In 20] a sensitivity measure dependent on the history of the rst derivatives i ( ) and the changes made by the back-propagation algorithm ( ) in each iteration is used, ci E @E @W t Wi t t ci P    @E ( t) Wi (t)   t @W i  =    Wif in ; Wiin  where and are the initial weight and the nal weight after the training, respectively. In 13], = i and, = 1 are tested and compared. The latter option is interesting for its cheap computational cost, and because it is intuitively appealing, since (4) calculates in this case the nearest solution for the new item in parameter space. We next show that the option = i is the best in average to estimate  ( ) with a cost model of thePform P  2, even outside a minimum for the old items. First, note that  2 cannot distinguish between the eect produced by positive and negative increments of the same magnitude, P i.e., the sign information is lost. The quantity  2 could have been generated by 2 vectors, denoted  ) , being the number of components of  , by considering two opposite values for each component. The best one can do in this situation is to use as cost function the average of the error increments produced by these vectors: f in in Wi Wi 2 @ E ci @W 2 @ E ci E ci 2 @W 2 W i i ci ci Wi Wi i n W l ci Wi n W 2n X 1 ) h ( ) ; ( +  )i = 2 =1( ( ) ; ( +  )) = 2n X = ( ); ( +  ) ) 21 =1 where  denotes expectation. Now we need to use the following result 22], 17]: E W E W W E W n E W W l l E W E W W l n  l < > Z ( + ) ( ) g U R P R dR = ( ) + 21 g U 11 X @2g i 2 @Ui 2 + O( i i )  (5) where is an arbitrary dierentiable function, is a symmetric, zero-mean probability density function, and 2 and are the variance and the fourth central moment of the marginal distribution of , respectively. This formula can be particularized to discrete distributions, so that considering 1 is the probability of  ) given the information about the square components, 1 ( ) ) = 21n and, therefore, g P i i Ui P W P W l l hE (W ) ; E (W +  )i  W  ! X 2 2 = ( ) ; ( ) + 12 2 X 2  2 + (; )2 =1 2 2 2 E W @ E E W i @ E i @Wi Wi i Wi  @Wi and nally X 2 2 +  )i  12 (6) 2  Thus, we see that the Pabsence of sign information leads naturally to a cost function of the form  2 with the assignment = i , even in points far from the minimum. This result can be easily generalized to the estimation of an arbitrary function in the same conditions. This conclusion is in agreement with the results of an experimental comparison of all the above-mentioned options for assigning 20]. Another conclusion emerged from this comparison: the dierence between using = 1 and = i decreases with the number of items stored in the memory system. This can be interpreted in the following way: the weights that are more changed by the solution of (4) are those with lower second derivatives. Thus, second derivatives can be considered as a measure of how much the encoding of the previously stored items are supported by the corresponding weights. Those weights with current lower derivatives change to support the new items and, then, their second derivatives grow. As the number of stored items increases, all weights tend to have similar, highmagnitude second derivatives and, thus, using = 1 or = i tends to yield the same results. hE (W ) ; E (W @ E W i i ci @Wi Wi Wi : ci 2 @ E @W 2 ci ci ci 2 @ E @W 2 ci 12 ci 2 @ E @W 2 5.3 Minimization method Now the question remains of how to solve (4) eciently. A usual way to tackle a constrained optimization problem is to linearly combine the cost function with the deviation of the constraint from zero, and then minimize this new error function: min  W " X 2 c W + i i # EN ( ) (7) W i In the minimization of this function, there is a tradeo between and  2 that depends on and , and the error in the new item will not be 0 unless the relation tends to innite with time. In practice this is impossible and it is approximated through an appropriate schedule for changing and . More sophisticated algorithms from the theory of constrained optimization can also be applied, as in 13], where the reduced gradient algorithm for nonlinear constrained optimization is used. The advantage of these methods is their generality, in the sense that in principle they can deal with general functional forms3, but they can be complex and computationally expensive. In 20] and 17], an algorithm to solve (4) that exploits the structure of multilayer neural networks is developed. The drawbacks of solving a constrained minimization problem are here avoided through the transformation of retroactive interference into an unconstrained minimization problem. The nding underlying this transformation is the existence of an one-to-one mapping between the hidden unit congurations and the best solution in the subspaces of weights that produce those hidden unit congurations. Besides allowing a non-constrained optimization, this transformation has other advantages, like a number of variables much lower than in the original problem, and an always perfect encoding of the new item. The algorithm derived from this transformation, called LMD (Learning with Minimal Degradation), is completely parallel, even among layers. P i EN ci Wi   But not always. For example, the reduced gradient method only works with linear constraints. Because of this, EN (W ) is approximated linearly in 13]. 3 13 5.4 Relation between retroactive interference minimization and pruning Pruning is a technique commonly used in connectionist systems consisting in trimming the network by eliminating the most superuous weights. To palliate retrograde interference, one of the more important issues is the determination of the less signicant parameters for the encoding of a number of stored memories. These are the parameters that should support most of the necessary changes to introduce new information. Instead, pruning detects the less proted parameters to eliminate them. Indeed, we have used as relevance measure the second derivatives, also used in Optimal Brain Damage 7], the most popular pruning technique. The relation with pruning suggests that advances in pruning techniques can be incorporated into retroactive interference minimization algorithms. 6 Interference prevention 6.1 Problem formulation The interference prevention subproblem is formulated as the selection of a set of system parameters able to codify items 1 ; 1 in such a way that they are disrupted minimally when item is introduced. Now, we have to solve (3) in , taking  as an unknown constant. :::N N W min W " # 2 X X 1 2 ( )+ 2  2+  ( + )=0 E W W @ E Wi @Wi i Wi i subj ect to EN W @E @Wi W : Before knowing the new item, we must solve this problem without assuming any particular value for  . The ideal solution is that which, in average, best solves the problem, taking into consideration the distribution of  : W W " * 2 X X min ( ) + 21  2 2 +  And developing the second term: Wi E W W i 14 @ E @Wi i Wi @E @Wi +# : min " ( )+ E W W Z  X 1 2 2E 2 @W W 2 @ i X + i i  Wi i @E ! @Wi P (W )  d # W (8)  where P is the density function of  . Let us assume that P ( ) is equal for each of the components  , or at least that they have equal variances (although this assumption can be easily relaxed). It is also very natural to assume a zero mean for this distribution: otherwise item would not be really new and the information of the mean of P ( ) couldPbe used to 2 + train the system. We can apply (5) to (8) by making ( ) = 21 i P , so that the integral in (8) becomes the expectation of (0 +  ). i Since (0) is null and i ( ) = 2 i , we get W W Wi N W g U i Ui @E i g @W 2 @ g g @U min " 2 W 2 @W ( )+ E W W 2 @ E @W @ E W 2 Ui 2 2 X @2E #  2 @W 2 (9)  i i without any error4. This result can be derived in another way, without supposing a particular shape for  1 ;1(  ). The problem of interference prevention can be understood as the search for a such that, after being modied by the introduction of the new item, it would be still able to reproduce old ones. As the new items are by denition unknown, they produce unknown modications in the parameters when they are learned. Thus, the problem consists in getting a point of low ( ) stable with respect to random perturbations of . This resistance to perturbations can be expressed as E W :::N W W E W W Z min ( +  ) P ( )  (10) This expression is made exactly equivalent to (9) by applying (5). Thus, reassuringly, we have obtained the same result with two dierent reasonings. E W W W d W: W 6.2 Selection of coecients Unfortunately the above discussion did not suggest which is the density function P ( ). This amounts to decide the variance 2, which is the main W  The term P in (5) disregarded here implies fourth or higher-order derivatives, and these @2E + P @E are null for 21 i Wi2 @W 2 i Wi @W . 4 i i 15 parameter of the distribution inuencing the minimization (9) or (10)5. The variance of the weight changes produced by future items could be approximately deduced from their expected error, which in turn may be estimated from the error that the most recently introduced items had at their arrival. However, this is a context-dependent hypothesis, which can lead to signicant errors. Moreover, there is another important issue about the selection of 2 to be dealt with: it inuences not only the error increments produced by new items, but also the way in which the non presented items are interpolated. In fact, the way in which the items 1 ; 1 are encoded determines the answer of the system to all possible inputs. The quality of these answers is often more important than exact storage of the presented items. In this case, 2 should be tuned in accordance to the former desideratum. In conclusion, either because of ignorance of the appropriate value or because it is tuned for other purposes, the parameter 2 used in (9) could be signicantly dierent from real weight variances. What are the consequences of this parameter imprecision? Could it have eects opposed to those desired, i.e., in these conditions, the introduction of new items can be worst after minimizing (9) than after minimizing ( )? A detailed mathematical analysis of this question is carried out in 17]. The conclusions can be summarized in a few words. If 2 is smaller than the real variance, the minimization (9) is always benecial. The opposite case is also safe if the remaining error ( ) in the minimum of (9) is not much higher than in the minimum of ( ). ( ) in the minimum of (9) has a sigmoidal shape when considered as a function of 2 and, therefore, 2 can be increased until the error begins to grow quickly.  :::N   E W  E W E W E W   6.3 Minimization method There is a direct way of minimizing (9) or equivalently (10). It consists in adding noise to the weights while minimizing ( ), so that a sample of the gradient distribution of i ( + ) is calculated in each iteration. Lots of samples of  must be extracted from its distribution to get a good estimate of the average derivatives (or alternatively, very small minimization steps must be done in the direction of i ( +  )). This noise addition during training has been already used for other purposes, such as ameliorating faultE W @E @W W W W @E @W W W The terms disregarded when approximating (10) with (9) are signicant in general, but are close to zero in the minimum of (9). 5 16 tolerance in neural networks 11] or improving their generalization properties 12, 1]. Again, this method is very general, it being valid for any type of ( ) and P ( ). However, it is extremely inecient for systems such as neural networks, which have a high-dimensional parameter space to be sampled in order to obtain the averages in the optimization steps. In 19] and 22], a method based on the gradient of (9), especially adapted for feed-forward networks, is developed. It has the advantage of being deterministic and much more stable. In addition, it is easily computable with an algorithm of the same order as the backpropagation of the gradient of ( ). E W W E W 6.4 Relation between interference prevention and generalization There was an implicit assumption in the derivation of our interference prevention method: the basic features of P ( ) and especially its variance are independent of the used to encode the old items. We have supposed that this variance, which is directly related to the expected error for the new item, does not change while performing the minimization (10). In other words, we minimize future interference assuming an expected error for the new items that is independent of . But the error in the new items (or equivalently, the variance of P ( )) is another factor determining the interference that these new items will produce and, of course, it is also controlled by the selection of . Thus, there exists an alternative way to prevent interference, namely reducing the error in the new items. This is nothing more than improving generalization. The point we want to make next is that the minimization (10) is also useful to control generalization. This can be understood in two ways: -First, by reducing second derivatives of the parameters, their information content6 with respect to the encoding of the items is also reduced. Controlling the information content of a model is a general way of controlling its generalization. -The other way is considering the term 2 P  2 i as a regularizer that constrains the system to be simple or smooth. The use of regularizer terms is another classical technique that controls the smoothing of ( ) W W W W W  2 i Wi 2 @ E @W 2 F X The information content of a parameter can be approximated by log a uniform a priori distribution for it 17]. 6 17  @2E @Wi2 , assuming by means of a regularization coecient that regulates the importance of the regularizer. In our case this regularization coecient is 2.  7 Experimental results We show now results obtained by combining the two complementary algorithms for catastrophic interference avoidance. First the robustness against changes is enhanced by the minimization of (9), and then the new patterns are introduced while minimizing retrograde interference by means of the LMD algorithm. Plain back-propagation and retrograde interference minimization with LMD have been extensively compared 18, 20, 21].We concentrate on studying the benets of complementing LMD with the interference prevention algorithm. 7.1 A rst example The following experiment used a neural network architecture with seven hidden units. Fourteen random samples of the function ( 1 + 2) were chosen as an initial training set for the network. The network was trained using our interference prevention method, i.e., by minimizing (9) following its gradient. This process was repeated eleven times using dierent 2, producing eleven dierent networks. For each of these networks, we tested the eect of introducing twelve more random samples of the same function, using the standard ( = i ) and the coarse ( = 1) versions of the LMD algorithm for retrograde interference minimization. Note that, adhering to our simplied formulation, the LMD algorithm always encodes perfectly the new patterns. Thus, the state of the network after their introduction is entirely characterized by the error increment in the old patterns. Figure 2 shows the average error increments produced by the introduction of the new items. An important fall of catastrophic interference can be observed, especially in the second half of the graph. This reduction is due in part to the improvement in generalization, which is reected in Figure 3, where the coarse version distances suer a small drop located more or less in the same place, due to the lower error of the new items at arrival time, which requires smaller weight modications. Observe that the origin of abcissas corresponds to coding the old patterns with plain back-propagation (i.e., no interference prevention is applied). Applying also plain back-propagation sin x x  ci 2 @ E @W 2 ci 18 instead of LMD to code the new pattern (i.e., no retrograde interference avoidance) produces error increments that go beyond the scale of the graph. The distances in general are greater in the standard LMD than in the coarse LMD, because the latter explicitly minimizes jj jj, while the former looks for privileged directions suggested by the second derivatives. When the network is trained with the classical backpropagation method, i.e., following the gradient of ( ) in discrete steps, the results depend on the length of these steps. The total distance covered in weight space tends to decrease with the shortening of the length of the steps (at the cost of longer training times). In the innitesimal limit, the solution tends usually to approximate the coarse version of LMD 20]. Note that the distances covered by the standard version grow with 2. The reason is that not all second derivatives decrease in the same way when 2 increases. The minimum of (9) does not make a pressure proportional to the second derivative's value, but an equal pressure for large and small ones. Therefore some of them become almost null, while others remain large. This has consequences for the retroactive interference problem formulation: the cost coecients are the second derivatives and, thus, weights with null second derivatives can be changed arbitrarily. This problem is similar to the excessively large steps that optimization methods based on the Hessian (like Newton or Pseudo-newton) perform when the second derivatives are small. We solve it in the usual way, namely by adding a constant to the coecients so that = i + . In 17] we argue that a sensible value for this constant in the case of feedforward neural networks is the square of the maximum possible activation of the hidden units. W EN W   ci k 2 @ E @W 2 k 7.2 Experiments using the Pumadyn datasets The next series of experiments have been performed using data from the Pumadyn family datasets 24](loaded from the Delve database 25]), which come from a realistic simulation of the dynamics of a Puma 560 robot arm. The inputs in the datasets chosen consist of angular positions, velocities, torques and other dynamic parameters of the robot arm. The output is the angular acceleration of one of the links of the robot arm. We have used the two datasets in the family labelled with the attributes: 32-dimensional input, fairly linear, and medium noise in one case, and high noise in the other case. We made the same type of experiments shown in Figure 2, but only with the standard version of LMD, since it is the one that works best. Moreover, we 19 have added a very important information to the graphs: the generalization error obtained for each of the values of 2, evaluated over 2000 untrained patterns. Networks with forty hidden units were rst trained with 200 or 400 patterns and then the average error increment produced by introuducing 200 new patterns separately was evaluated. The results of all combinations of number of previously trained patterns and degrees of noise are displayed in Figures 4 through 7. These gures show that interference can be alleviated while at the same time improving generalization. This is in contrast with other strategies for interference avoidance based on a special coding of patterns (e.g., saturating the hidden units to get more local representations), which produce input-output functions (  ) (e.g., piece-wise step functions) of a dierent nature from the function being approximated, thus resulting in high generalization errors. However, we are forced to use the same regularization coecient for controlling generalization and prevention of interference, the best values for each of these purposes being usually dierent. Therefore, there is a trade-o that should be considered depending on the application. If generalization takes priority, the potential benet of the interference prevention procedure depends on several factors. One of such factor is the number of patterns already stored in the network. The more information a network has stored in, the more its approximation power is well directed and, therefore, the less it requires to be regularized. This can be seen by comparing the curves in Figures 4 and 5: with double number of trained patterns, the generalization curve is more squashed against the left vertical axis. It can also be seen in how the generalization curve of Figure 7 increases more slowly when compared to Figure 6. Another factor is the amount of noise in the examples. The more noisy the training patterns are, the more convenient it is to smooth (  ) by increasing the regularizer coecient. This is very evident when comparing Figures 6 and 7 (medium noise) with Figures 4 and 5 (high noise), which exhibit generalization error minima at higher values of 2 . Moreover, the error raises very gently with 2 in these gures, allowing for large reductions in interference without paying a high cost in the generalization account. Therefore, when priority is given to generalization, the narrowest margins for benets in interference prevention occur for networks trained intensively with a large number of noiseless patterns. But this is precisely the case in which interference is less serious, since the new patterns are better predicted and their introduction produces less damage. This can be checked by ob F X W F X W   20 serving that the generalization minimum for the network trained with 400 medium-noise patterns (Figure 6) has an associated damage that is an order of magnitude lower than that of the opposite case (200 high-noise patterns) displayed in Figure 5. Finally a few words about an aspect of Figures 6 and 7 that could seem strange: the damage curve quickly drops to zero and apparently continues with negative values. In fact, interference takes negative values that is, the encoding of old patterns is improved (rather than disrupted) by the introduction of the new patterns. Note that this happens when the network is highly over regularized, so that the smoothing constraint pushes the interpolating curve (  ) far from the trained patterns. Then the introduction of new patterns (that in Figures 6 and 7 have not much noise) without such constraint brings the interpolation curve nearer to the old patterns with high probability. F X W 7.3 Limitations of the proposed algorithm Together with the benets above, we must also point out the limitations in the application of our method for interference prevention. We said the drop in the distances for the coarse LMD in Figure 3 was due to generalization. This is true, but the fall that would correspond to the improvement in generalization should be greater. This means that, although the errors are lower, the weights have had to be modied almost the same. The reason is that the algorithm minimizing (9) makes the network output insensitive to changes in the weights for the stored items, but this insensitiveness is transmitted or generalized to the rest of the input space. Because of this, it is also necessary to modify more the weights to introduce the new items, and the potential benets of the strategy get limited. Like in generalization, the greater the number of items, the greater and more likely the insensitiveness of the items outside of the learning set will be. When the items are few, the results are irregular, as the network has become insensitive for the new items located near a group of old items. 8 Conclusions Two conditions are required for catastrophic interference to occur: 21 -The isolated training with new items without reminding the old ones, and - The use of distributed representations. We have typied the approaches to solve the interference problem by their degree of retraining with old items, and by the locality of their representations. We have proposed a two-stage framework to deal with the interference problem based on the information available at each moment. Retroactive minimization deals with interference when the new item is already known. It can be formulated as the search for the weights minimizing the error increments of the already stored items subject to the encoding of the new information. In practice, the best model aordable for a highly dimensional system is a weighted sum of squares of the changes in the parameters. We have shown that the best coecients are in average the second derivatives of the parameters, even outside of the minimum. For feedforward neural networks, a very ecient algorithm can be used to solve this constrained minimization. Instead, at the earlier stage of interference prevention, when the new item has not yet arrived, the corresponding weight changes are also unknown. Thus, the only reasonable way of minimizing in anticipation the cost function is minimizing the coecients, i.e., the second derivatives, jointly with the error. The eect of this is to make the stored items insensitive to future changes. When tested experimentally, we have found a limited success of this strategy due to an unexpected reason: the insensitiveness to which old items were trained gets "generalized", especially to nearby zones. If a new item has to be introduced in one of these insensitive zones, larger weight changes are required, and most of the expected benets are lost. When the old items cover densely the input space, there is no possible gain. There is a solution for this situation: to accept and assume that the average sensitivity is the same for old and new items. Thus, the weight increments for the new items will depend on the sensitivity (second derivatives) of the old items. This is reected by expressing 2 as a function of the second derivatives for the old items, and the cost function (9) becomes 17]:  P ( ) + 2( E W N ; 1) < E N > 2 @ E i 2 @W P @2E i i @W 2 i 22 2 2 (11) where is the expected error function for the new item. Unfortunately, this function is more complex than (9) and its gradient is harder to calculate. This failure to avoid interference completely with a simple procedure was previously expected, as explained in Section 3.3. Moreover, there are many scenarios in which the blind application of the hypothetically best possible algorithm could be inappropriate. In fact, as mentioned in Section 4, usually it would be better to introduce only partially the new item, leaving a certain error that is exchanged for a minor error increment in the old items. The appropriate balance point in this trade-o depends critically on several factors: - The capacity of the memory system, i.e., in what measure it is able to assimilate all the items. - The amount of noise in the data. - The number of already stored items. As it grows, the comparative importance of the new item error decreases. - The variability in time of the function to be approximated. If the function changes quickly, the comparative importance of the errors in new items becomes more important, and more interference should be allowed. A rule of thumb that is generally correct when the objective function is static or changes steadily is that the error in the new item should not be made lower than the average error in the old items. We have argued, with others, that distributed representations are indispensable for good generalization. But, is this completely true? Think in this extremely localist representation: the items themselves as a list of input-output pairs, with no other structure or parameter. But, each time an answer to an arbitrary input is required, one can make some very complicated process, for example, building a sigmoidal feedforward network, training it with the stored items, and producing as answer the output of the network. When a new item is introduced there is no catastrophic interference, because it is just added to the list. Thus, the key point is shifting the processing time from training to the generation of answers by the system. More practical methods than the one above could be imagined and some work in the literature 2] can be considered as other, more practical examples of moving computational cost to the recall phase. So, under this point of view, the question is where to put the burden of processing. Putting it in the learning phase is advantageous if there is enough time for it and one continuously has to generate outputs and react very quickly to the inputs. This is the scenario for animals in their envi< EN > 23 ronments. Thus, based on engineering principles, we think that there are two ultimate reasons for which the cortex uses distributed representations that constrain it to slow learning. The rst is that, being the residence of long-term memory, it must be able to store great quantities of information, which implies a high degree of compactness that can only be reached using distributed representations, as explained in Section 3.2. The second reason is the requirement of very quick responses to the stimuli (on which life or death can depend) that must be however "optimal" in the sense of well generalized form past experiences. This can be obtained only if the inuences of past memories required to respond to new stimuli are already calculated during a previous learning phase and explicited as distributed representations, as explained before. 9 Bibliography 1] G. An, The eects of adding noise during backpropagation training on generalization performance, Neural Computation 8 (1996), 643-674. 2] C.G. Atkenson, A.W. Moore, and S. Schaal, Locally Weighted Learning, Articial Intelligence Review (in Press). 3] A. Blum, and R.L. Rivest, Training a 3-node neural network is NPcomplete, in: Proc. of the Workshop on Computational Learning Theory, Morgan-Kauman Publishers, San Mateo, CA, 1988, pp. 9-18. 4] R.M. French, Dynamically constraining connectionist networks to produce distributed, orthogonal representations to reduce catastrophic interference, in: Proc. of the 16th Annual Conf. of the Cognitive Science Society, Erlbaum, Hillsdale, 1994, pp. 335-340. 5] P.A. Hetherington and M.S. Seidenberg, Is there catastrophic interference in connectionist networks?, in: Proc. of the Eleventh Annual Conf. of the Cognitive Science Society, Erlbaum, Hillsdale, NJ, 1993, pp. 26-33. 6] S. Judd, Neural Network Design and the Complexity of Learning, MIT Press, Cambridge, 1990. 24 7] Y. Le Cun, J.S. Denker, and S.A. Solla, Optimal Brain Damage, Advances in Neural Information Processing Systems 2. Neural Information Processing Systems 2, Morgan Kauman Publishers, San Mateo, CA, 1990. 8] J. McClelland, B. McNaughton, and R. OReilly, Why there are complementary learning systems in the hippocampus and the neocortex: Insights from the successes and failures of connectionist models of learning and memory, Psychological Review 102 (1995), 419-457. 9] M. McCloskey, and N.J. Cohen, Catastrophic interference in connectionist networks: The sequential learning problem, in: The psychology of learning and motivation, G. H. Bower, ed., Academic Press, New York, 1989. 10] K. McRae and P.A. Hetherington, Catastrophic interference is eliminated in pretrained networks, in: Proc. of the Fifteenth Annual Meeting of the Cognitive Science Society, Erlbaum, Hillsdale, NJ, 1993, pp. 723-728. 11] A.F. Murray and P.J. Edwards, Synaptic weight noise during multilayer perceptron training: Fault tolerance and training improvements, IEEE Transactions on Neural Networks 4(4) (1993), 722-725. 12] A.F. Murray and P.J. Edwards, Enhanced multilayer perceptron performance and fault tolerance resulting from synaptic weight noise during training, IEEE Transactions on Neural Networks 5 (1994), 792-802. 13] D.C. Park, M.A. El-Sharkawi, and R.J. Marks II, An adaptively trained neural network, IEEE Transactions on Neural Networks 2(3) (1991), 334-345. 14] R. Ratcli, Connectionist models of recognition memory: constraints imposed by learning and forgetting functions, Psychological Review 97(2) (1990), 235-308. 15] A. Robins, Catastrophic forgetting, rehearsal and pseudorehersal, Connection Science 7(2) (1995), 123-146. 16] A. Robins, Consolidation in neural networks and in the sleeping brain, Connection Science 8(2) (1995), 259-275. 25 17] V. Ruiz de Angulo, Interferencia catastroca en redes neuronales: soluciones y relacion con otros problemas del conexionismo, Ph. D. Thesis, Universidad del Pais Vasco, 1996. 18] V. Ruiz de Angulo and C. Torras, The MDL algorithm, in: Proc. of the Int. Workshop on Neural Networks (IWANN91), A. Prieto ed., Lecture Notes on Computer Science, Vol. 540, Springer-Verlag, 1991, pp. 162-172. 19] V. Ruiz de Angulo and C. Torras, Random weights and regularization, in: Proc. of the Int. Conf. on Articial Neural Networks (ICANN94), G. Marinaro and P. Morasso eds., Springer-Verlag, 1994, pp. 14561459. 20] V. Ruiz de Angulo and C. Torras, On-line learning with minimal degradation in feedforward networks, IEEE Transactions on Neural Networks 6(3), (1995) 657-668. 21] V. Ruiz de Angulo and C. Torras, Learning of nonstationary processes, in: Optimization Techniques, C.T. Leondes ed., Neural Network Systems Techniques and Applications series, Vol. 2, Academic Press, 1998, pp. 175-207. 22] V. Ruiz de Angulo and C. Torras, Averaging over networks: properties, evaluation and minimization. Tech. Report IRI-DT-9811, Institut de Robotica i Informatica Industrial, Barcelona, Spain, 1998. 23] A.S. Weigend, D.E. Rumelhart and B.A, Huberman, Generalisation by weight-elimination with application to forecasting. Neural Information Processing Systems 3, Morgan Kauman Publishers, San Mateo, CA, 1991, pp. 885-882. 24] http://www.cs.utoronto.ca/delve/data/Pumadyn/desc.html, Detailed documentation le. 25] http://www.cs.utoronto.ca/delve/ 26 Figure captions Figure 1. Imaginary plane on which the dierent approaches to deal with catastrophic interference can be placed. Figure 2. Average error increments produced by the application of LMD to eleven networks trained with our interference prevention method. The networks have resulted from minimizing (9) for 2 between 0 and 0.05, as represented in the axis of abscissas. Figure 3. Average jj jj produced by the application of LMD to the networks in Figure 1. Figure 4. Same experiments as in Figure 2 but using networks with 40 hidden units and, as training set, the 400 patterns of the high-noise Pumadyn dataset. Damage is measured as the average error increments for the old patterns, whilst the generalization error is evaluated over 2000 untrained patterns. Figure 5. As in Figure 4, but using the 200 training patterns of the high-noise Pumadyn dataset. Figure 6. As in Figure 4, but using the 400 training patterns of the medium-noise Pumadyn dataset. Figure 7. As in Figure 4, but using the 200 training patterns of the medium-noise Pumadyn dataset.  W 27 2 Joint training RETRAINING AXIS Combined training or retraining (Hetherington and Seidenberg) Temporal local learning (McClelland et al.) Rehearsal (Robins) Pseudorehearsal (Robins) Hard core of the problem Special coding of the patterns (French) Isolated training Distributed representation REPRESENTATION AXIS Figure 1. 28 Local representation 0.012 0.010 0.008 coarse version of LMD standard version of LMD 0.006 0.004 0.002 0.000 0.00 0.01 0.02 0.03 σ2/2 Figure 2. 29 0.04 0.05 0.06 0.03 coarse version of LMD standard version of LMD 0.02 0.01 0.00 0.00 0.01 0.02 0.03 2 σ /2 Figure 3. 30 0.04 0.05 0.06 0.18 0.0040 Damage Generalization error 0.16 0.12 0.0020 0.10 0.0010 0.08 0.0000 0.000 0.005 0.010 σ2/2 Figure 4. 31 0.015 0.06 0.020 Generalization error 0.14 Damage 0.0030 0.18 0.0040 Damage Generalization error 0.16 0.14 0.12 0.0020 0.10 0.0010 0.08 0.0000 0.000 0.005 0.010 σ2/2 Figure 5. 32 0.015 0.06 0.020 Generalization error Damage 0.0030 0.020 0.0005 Damage Generalization error 0.018 0.0004 Damage 0.0003 0.014 0.0002 0.012 0.0001 0.010 0.0000 0.000 0.005 0.010 σ2/2 Figure 6. 33 0.015 0.008 0.020 Generalization error 0.016 0.020 0.0005 Damage Generalization error 0.0004 0.018 Damage 0.0003 0.014 0.0002 0.012 0.0001 0.010 0.0000 0.000 0.005 0.010 σ2/2 Figure 7. 34 View publication stats 0.015 0.008 0.020 Generalization error 0.016