A framework to deal with interference
in connectionist systems
Vicente Ruiz de Angulo and Carme Torras
Institut de Robotica i Informatica Industrial (CSIC-UPC)
Edici NEXUS
Gran Capita 2-4
08034-Barcelona, Spain.
E-mail:
[email protected],
[email protected]
1
Abstract
We analyze the conditions under which a memory system is prone to interference between new and old items. Essentially, these are the distributedness
of the representation and the lack of retraining. Both are, however, desirable
features providing compactness and speed. Thus, a two-stage framework to
palliate interference in this type of systems is proposed based on exploiting
the information available at each moment. The two stages are separated by
the instant at which a new item becomes known: a) interference prevention,
prior to that instant, consists in preparing the system to minimize the impact
of learning new items and b) retroactive interference minimization, posterior
to that instant, seeks to learn the new item while minimizing the damages
inicted on the old items. The subproblems addressed at the two stages are
stated rigorously and possible methods to solve each of them are presented.
1 The interference problem
Suppose one would like to learn a set of input-output pairs f(
)g =1 ,
where is the desired output of the system to input . In some applications
not all the items to be learned are known at the same time. Instead, there is
a sequential order of arrival, and the system must be operative before the last
item is known with, desirably, the best possible performance. Therefore, new
items must be quickly learned and integrated. However, when new memories
are introduced in an associative system, the performance of the already stored
items can be seriously degraded. This performance degradation is usually
called interference or catastrophic forgetting. The eect of adding a new
item depends basically on two factors: the type of memory used and the
training scheme applied.
Xi Di
Di
i
:::N
Xi
Two types of representations can be distinguished:
{ Local representations, where the items are stored in such a way
that they do not interfere with one other, like in a look-up table.
As a consequence, in normal conditions, there is a perfect recall
of both the old and the new items. However, when the system is
full, there is no possibility of storing new items without destroying
completely some old items.
2
{ Distributed representations, in which each element or parameter of
the associative memory intervenes in the representation of several
items. And reciprocally, each item is represented by several parameters, each of which supports only partially the storage of the
item. When a new item is introduced, the recall of several stored
items is degraded. The importance of forgetting raises very gradually with the number of new items learned. Distributed representations exploit memory resources much better than local ones,
and are also reputed to interpolate more adequately (see Section
3.2 for details about this).
There are two basic training schemes:
{ To introduce the new item isolately. In this case, the last item
is quickly learned but, in many cases, the structure of the system
is such that this learning causes a very important degradation of
the previously learned items. That is, catastrophic interference
appears.
{ To train the new item jointly with the old ones. The problem
is that with some kinds of systems, especially those having distributed representations, this can be a complex optimization process that requires lots of computation.
In the above account of types of memories and training schemes, we
have only described the extreme possibilities. There are others in between
representing a compromise between their properties.
2 Previous work
The phenomenon of interference has been noted and studied by the connectionist community: Ratcli 14] and MacCloskey and Cohen 9] rst unveiled
the problem. Their purpose was to model by means of articial neural networks some aspects of human learning and memory, such as forgetting and
retroactive interference. They arrived to the conclusion that there is too
much and too sudden interference in neural networks to be a valid model.
Almost all the following papers on interference inherit this marked psychological character. We briey comment on some of them, to next focus on the
engineering aspects of the problem.
3
The dierent works will be placed in the framework outlined in the preceding section, namely an imaginary plane whose axes refer to the locality
of the representation and the extent of retraining with old items (Figure 1).
Thus, French's approach 4] is situated at the extreme of localized solutions. He uses sigmoidal units (see the next section) with a 0 1] range,
saturating them and favoring the zero states.
Hetherington and Seidenberg 5] observed that, although the old items
could have large errors after the introduction of a new item, they could be
quickly relearned. Thus, they suggest to retrain new and old items altogether
after the introduction of the new pattern. This can be considered an approach
far from the origin in the retraining axis, in which information about old items
is used in a delayed mode.
In 15] the new item is trained with a (xed or, preferably, randomly
changing) part of the previously trained items. Thus, it can be situated
in the middle of the retraining axis. In a variation of this strategy, he essays to produce the same result without requiring the availability of the old
items the associations that are jointly trained with the new items are not
old items, but (input, network output) pairs, which we denote ( ( )),
randomly extracted from the network before introducing the new item. Although in appearance similar to the former, this is a dierent approach (that
Robins calls pseudorehearsal). Training with the old items constrains the
shape of ( ) much less than pseudorehearsal, because the latter implicitly
aims to reproduce the previous ( ) function, except at the discontinuous
point where the new item is located. The repeated introduction of items
through this process leads to very local representations. It must be said that
pseudorehearsal has some intriguing points in common with brain processes
during dreaming 16], which suggests that the brain could use some related
strategy, as commented below.
McClelland et al. 8] claim that the brain avoids catastrophic interference
by using a rapid -local- memory to store new items, which afterwards should
be integrated slowly in the long-term, distributed memory1. This approach
can be considered a combination of a temporal local solution and delayed
retraining. To be successful it needs a method to discriminate when the
output for an input must be generated using the local or the distributed
memory. Other requirements are the availability of the old items, and free
X F X
F X
F X
Note that this requires to know whether an association has been introduced recently
in the quick memory.
1
4
time to train the distributed memory with them and those stored in the local
memory. Following McClelland et al., the cortex is the long-term memory
in the human brain, and the hippocampus is the local memory, able also to
recognize when a pattern is new. Dreaming could be, as suggested by Robins
16], the time for integration, and the information about the old patterns
would be extracted by exciting the cortex randomly, in a way similar to the
pseudorehearsal technique commented above.
3 A closer look at the interference problem
3.1 Associative systems and neural networks
At this point it is convenient to enter into details about the kind of systems
that will be dealt with in this paper. Although we will try to keep the
discussion at a rather general level, our results will be centered on neural
networks.
First of all, we require the associative memory system to be able to
generalize. This means that for any input the system must yield a response or output ( ) parametrized by a vector of parameters . To
store a set of items f(
)g =1 , the system must minimize some function ( P) enforcing the similarity
of ( ) and , as for example
P
( ) = =1
( ) = =1 ( ( ) ; )2 . The exact shape of this
function inuences the way in which the network responds to inputs outside
the training set and, for this reason, ( ) could include terms aimed at
determining the way the system should generalize.
Another condition we impose is the derivability of ( ) and ( ),
this allowing the minimization of ( ) with standard methods such as gradient descent using its derivatives.
The usual models of multilayer neural networks satisfy the above specications. For example, a typical regression2 two-layer neural network has the
form
X
F X W
W
Xi Di
i
:::N
E W
E W
F Xi W
i
:::N
Ei W
i
:::N
F Xi W
Di
Di
E W
F X W
E W
E W
( )=
Fi X W
X
( )
(1)
wij yj X
j
where is the th component of , are modiable parameters and is a
nonlinear function of the input that, in the terminology of neural networks,
Fi
2
i
F
wij
By regression network we mean one whose output is a vector of real numbers.
5
yj
is known as the output of hidden unit . Depending on the shape of the
's, feedforward neural networks can be classied into two types. The most
frequent one uses
j
yj
( )=
yj X
sig
X
(
)
(2)
vj k Xk
k
where is the th component of , 's are modiable parameters (also
components of ) and
is a function of sigmoidal shape, such as the
hyperbolic tangent. The other type of networks is the Radial Basis Function
(RBF) networks, whose 's are RBF units, typically gaussians:
Xk
k
X
W
vjk
sig
yj
vjk
k (xk ;vjk )
j2
2
( )=
and 2 are modiable parameters (included in
yj X
where
;
P
e
j
3.2 Distributed representations
W
).
Up to now we have associated catastrophic forgetting with the existence of
crossed dependencies of dierent outputs of the system on many parameters. This corresponds to systems with distributed representations, which
seem to be at the heart of the problem. It is convenient to take a closer
look at this relation and at the arguments against renouncing to distributed
representations.
The term "distributed representations" is often used in the connectionist
literature ambiguously or with dierent meanings. Here, to avoid misunderstandings, we put forth a precise denition. We say that the response of
the system to input has a distributed representation if the majority of the
derivatives i ( ) have a signicant magnitude. In the opposite case, we
say that it has a local representation. Note that the dierent items stored in
a system may have representations with very dierent degrees of distributedness. Moreover, an arbitrary input which has not yet been associated to a
desired output, but producing anyway an output response, can be said also
to have a local or a distributed representation. The above denition is adequate to analyze the interference problem, while at the same time bearing
resemblance to the common conception of distributed representations.
The easiest way to introduce a new item
is to change those parameters to which the current (erroneous) system response is more sensitive. If
there is a great overlap between this set of parameters and those having a
X
@F
@w
X W
Xnew
6
large i ( ) magnitude, the response of item will be seriously perturbed (in a way approximately proportional to i ( ) i (
)).
Thus, catastrophic forgetting appears when a) the stored items have distributed representations and b) the response of the system to the new item
prior to learning has also a distributed representation.
Sigmoidal networks, due to the euclidean products in (1) and (2), tend to
produce very distributed representations for all their possible inputs, unless
the sigmoidal functions work in their saturated zones.
Instead, RBF networks can have distributed or local representations, depending on the relation between the distance among the RBF centers and
their widths.
Then, why not simply use a very local representation of the memories to
avoid interference? There are two main reasons for not doing this:
Compactness. It is clear that local representations using an exclusive set of parameters for each item will require many resources to
deal with large quantities of data. Indeed, purely local representations
must always have at least as many parameters as data stored. Instead,
when data are introduced by minimizing an error function without
any locality constraint, the result is naturally distributed. With high
data-to-parameter ratios, there are not local representations producing
equivalent results.
Generalization. To determine the output of an arbitrary input , any
good generalizer should take into account multiple stored items, especially in stochastic or noisy domains. When data are scarce and sparse,
items far from should inuence the system response. This inuence
can be only realized through commonly dependent parameters. The
sharing of parameters by several items implies, at least, a certain degree of distributedness in the representations. Note that in the sense we
are using the term distributed, RBF networks composed of units with
a limited "local" activation function can exhibit distributed representations, the determinant factor being the degree of overlap between the
units.
@F
@w
Xp W
p
@F
@w
Xp W
@F
@w
Xnew W
X
X
3.3 Our goal
In this paper we intend to deal with the engineering aspects of the core problem. This core is for us at the origin of the retraining and representation axes
7
(refer to Figure 1), i.e., catastrophic interference avoidance without requiring
repeated presentations of the old items, and without explicitly imposing local
representations. We coincide with 13] in pursuing this goal and in taking a
pure engineering perspective.
However, theoretical reasons forbid a perfect solution to this problem.
When a feed-forward network with a xed number of units has enough capacity to encode a xed set of items, there is a bound on how fast learning
can take place, since this problem has been proven to be NP-complete 3]
6]. Therefore, we cannot aim at nding a procedure that in approximately
constant time learns a new item while leaving the old ones intact, because
by iterating this procedure learning could be carried out in time linear in
the number of items. As a consequence, our goal must be limited to try to
palliate interference as much as possible.
4 A two-stage approach to palliate interference
We consider that the interference problem must be handled in two stages
separated by the arrival of the new item:
A priori stage: encoding the old, known, items in the best way to
prepare the network to minimize the impact of learning new items. We
call this stage interference prevention.
A posteriori stage: learning the new items while minimizing the damages inicted on the old items stored with a given weight conguration.
We call this stage retroactive interference minimization.
We will concrete the scenario in a set of items 1
; 1 which are stored
prior to the arrival of the new item . Let 1 denote the error in items
1 2 . With this notation we formulate our problem as how to nd eciently a minimum of 1 ( ) with the constraint of having item available in a moment posterior to the arrival of items 1
; 1. The optimization to be performed prior to this moment constitutes the interference prevention, whereas those optimizations posterior to that moment conform the
retroactive interference minimization. We will enforce this division by reecting the calculations of interference prevention in the variable itself, xing
:::N
N
E
:::p
:::p
E
:::N
W
N
:::N
W
8
its value afterwards. The changes to be made to the weights during retroactive interference minimization will be reected in the variable vector , so
that the value of the weight vector after the introduction of the new item will
be + . For clarity of explanation, we make all the derivations assuming
a perfect encoding of the new item ( ( + ) = 0). However, the -th
item can also be introduced partially within this framework. Instead of using
(
), we would take (
(
)+ ( ; (
)) 0
1
as the new item. Reducing only a fraction of the error can be useful when
dealing with noisy data 21].
The minimization of 1 ( ) is approximately equivalent to:
W
W
W
EN W
XN DN
W
XN F X N W
E
::N
DN
F XN W
< <
W
min
+
W
N
E
W
1
+ )
( + )=0
:::N
;1 (W
subj ect to EN W
W
W
:
And making the above distinction explicit:
min
W
W
1
E
:::N
;1 (W ) + E1
:::N
;1 (W W )]
( + )=0
subj ect to EN W
W
:
For clarity of explanation, we make all the derivations assuming a perfect
encoding of the new item ( + ) = 0. Let us abbreviate 1 ;1 with
. We need to express ( + ) in a manageable way. Evaluating
accurately ( + ) for a given can have a high computational
cost. For example, in a supervised feedforward neural network, it would
entail presenting the whole set of items to the network. If the optimum is to
be found through some search process, this evaluation has to be performed
repeatedly, leading to a high computational cost.
An alternative solution to accurate evaluation is to approximate the error
function over the old items through a truncated Taylor series expansion, for
example, with a second-order polynomial in the weights. A linear model is too
rude and not feasible because it may turn the problem into an unsolvable one
21]. The coecients of the polynomial have a direct interpretation, namely
the rst and second derivatives of . Cross-terms are not included because
the calculation of the Hessian requires computations very costly in memory
and time. The Hessian diagonal is, therefore, the only information about old
items that we require in principle.
E
E
W
EN
W
E
W
W
E
W
W
W
E
9
:::N
Thus, the most faithful problem formulation we can reasonably aspire to
deal with is:
#
X 2
X
1
2
+
min
( )+ 2
2
( + )=0
(3)
The problem is how to carry out this constrained minimization on (interference prevention) and on (retroactive interference minimization).
In the latter, is by denition a constant already known, and the minimization takes place over . Interference prevention is more complex
because the minimization must be performed over , but the cost function
is parametrized by , which is unknown at that moment. This leads to
the minimization of the expected cost for the a priori distribution of .
We begin by the easier one, namely retroactive interference minimization.
"
@ E
E W
W
W
i
@E
Wi
@Wi
@Wi
i
subj ect to EN W
W
Wi
:
W
W
W
W
W
W
W
5 Minimizing retroactive interference
5.1 Problem formulation
Now we consider
W
as a constant, and therefore (3) becomes
"
#
1X 2+X
min
2
( ) = 0
with assigned to i and to i . Note that usually will be close to
zero because, by introducing the rst ; 1 items we should have minimized
and, therefore, also the absolute value of its rst derivatives. In any case,
this is part of interference prevention. Thus, we assume = 0, and our
denitive formulation for the retroactive interference subproblem is:
W
ci
Wi
bi
i
subj ect to EN
2
@ E
ci
@W
2
bi
Wi
i
W
@E
bi
@W
N
E
bi
min
X
W
subj ect to EN
with =
ci
2
@ E
@W
2
i
ci
2
Wi
i
( ) = 0
.
10
W
(4)
5.2 Selection of coecients
The above formulation is similar to that adopted by Park et al. 13], and
also in previous works by the authors in 18], 20] and 21], the dierence
being in the value assigned to . In 13] a modied second derivative of
, weighted by the error of the outputs of the network is used. In 20] a
sensitivity measure dependent on the history of the rst derivatives i ( )
and the changes made by the back-propagation algorithm
( ) in each
iteration is used,
ci
E
@E
@W
t
Wi t
t
ci
P
@E
(
t) Wi (t)
t @W
i
=
Wif in ; Wiin
where
and
are the initial weight and the nal weight after the
training, respectively. In 13], = i and, = 1 are tested and compared.
The latter option is interesting for its cheap computational cost, and because
it is intuitively appealing, since (4) calculates in this case the nearest solution
for the new item in parameter space.
We next show that the option = i is the best in average to estimate
( ) with a cost model of thePform P 2, even outside a minimum
for the old items. First, note that
2 cannot distinguish between the
eect produced by positive and negative increments
of the same magnitude,
P
i.e., the sign information is lost. The quantity
2 could have been
generated by 2 vectors, denoted ) , being the number of components
of , by considering two opposite values for each component. The best
one can do in this situation is to use as cost function the average of the error
increments produced by these vectors:
f in
in
Wi
Wi
2
@ E
ci
@W
2
@ E
ci
E
ci
2
@W
2
W
i
i
ci
ci
Wi
Wi
i
n
W
l
ci
Wi
n
W
2n
X
1
)
h ( ) ; ( + )i =
2 =1( ( ) ; ( + )) =
2n
X
= ( );
( + ) ) 21
=1
where denotes expectation. Now we need to use the following result
22], 17]:
E W
E W
W
E W
n
E W
W
l
l
E W
E W
W
l
n
l
<
>
Z
( + ) ( )
g U
R
P R
dR
= ( ) + 21
g U
11
X @2g
i
2
@Ui
2 + O(
i
i
)
(5)
where is an arbitrary dierentiable function, is a symmetric, zero-mean
probability density function, and 2 and are the variance and the fourth
central moment of the marginal distribution of , respectively. This formula
can be particularized to discrete distributions, so that considering 1 is the
probability of ) given the information about the square components,
1 ( ) ) = 21n and, therefore,
g
P
i
i
Ui
P
W
P
W
l
l
hE (W ) ; E (W
+ )i
W
!
X 2
2 =
( ) ; ( ) + 12
2
X 2
2 + (; )2
=1
2
2
2
E W
@ E
E W
i
@ E
i
@Wi
Wi
i
Wi
@Wi
and nally
X 2
2
+ )i 12
(6)
2
Thus, we see that the Pabsence of sign information leads naturally to a
cost function of the form
2 with the assignment = i , even in
points far from the minimum. This result can be easily generalized to the
estimation of an arbitrary function in the same conditions.
This conclusion is in agreement with the results of an experimental comparison of all the above-mentioned options for assigning 20].
Another conclusion emerged from this comparison: the dierence between
using = 1 and = i decreases with the number of items stored in the
memory system. This can be interpreted in the following way: the weights
that are more changed by the solution of (4) are those with lower second
derivatives. Thus, second derivatives can be considered as a measure of
how much the encoding of the previously stored items are supported by the
corresponding weights. Those weights with current lower derivatives change
to support the new items and, then, their second derivatives grow. As the
number of stored items increases, all weights tend to have similar, highmagnitude second derivatives and, thus, using = 1 or = i tends to
yield the same results.
hE (W ) ; E (W
@ E
W
i
i
ci
@Wi
Wi
Wi :
ci
2
@ E
@W
2
ci
ci
ci
2
@ E
@W
2
ci
12
ci
2
@ E
@W
2
5.3 Minimization method
Now the question remains of how to solve (4) eciently. A usual way to
tackle a constrained optimization problem is to linearly combine the cost
function with the deviation of the constraint from zero, and then minimize
this new error function:
min
W
"
X
2
c W +
i
i
#
EN
( )
(7)
W
i
In the minimization of this function, there is a tradeo between
and
2 that depends on and , and the error in the new item will
not be 0 unless the relation tends to innite with time. In practice this
is impossible and it is approximated through an appropriate schedule for
changing and .
More sophisticated algorithms from the theory of constrained optimization can also be applied, as in 13], where the reduced gradient algorithm for
nonlinear constrained optimization is used. The advantage of these methods
is their generality, in the sense that in principle they can deal with general
functional forms3, but they can be complex and computationally expensive.
In 20] and 17], an algorithm to solve (4) that exploits the structure of
multilayer neural networks is developed. The drawbacks of solving a constrained minimization problem are here avoided through the transformation
of retroactive interference into an unconstrained minimization problem. The
nding underlying this transformation is the existence of an one-to-one mapping between the hidden unit congurations and the best solution in the
subspaces of weights that produce those hidden unit congurations. Besides
allowing a non-constrained optimization, this transformation has other advantages, like a number of variables much lower than in the original problem,
and an always perfect encoding of the new item. The algorithm derived from
this transformation, called LMD (Learning with Minimal Degradation), is
completely parallel, even among layers.
P
i
EN
ci
Wi
But not always. For example, the reduced gradient method only works with linear
constraints. Because of this, EN (W ) is approximated linearly in 13].
3
13
5.4 Relation between retroactive interference minimization and pruning
Pruning is a technique commonly used in connectionist systems consisting in
trimming the network by eliminating the most superuous weights.
To palliate retrograde interference, one of the more important issues is the
determination of the less signicant parameters for the encoding of a number
of stored memories. These are the parameters that should support most
of the necessary changes to introduce new information. Instead, pruning
detects the less proted parameters to eliminate them. Indeed, we have
used as relevance measure the second derivatives, also used in Optimal Brain
Damage 7], the most popular pruning technique.
The relation with pruning suggests that advances in pruning techniques
can be incorporated into retroactive interference minimization algorithms.
6 Interference prevention
6.1 Problem formulation
The interference prevention subproblem is formulated as the selection of a
set of system parameters able to codify items 1
; 1 in such a way that
they are disrupted minimally when item is introduced. Now, we have to
solve (3) in , taking as an unknown constant.
:::N
N
W
min
W
"
#
2
X
X
1
2
( )+ 2
2+
( + )=0
E W
W
@ E
Wi
@Wi
i
Wi
i
subj ect to EN W
@E
@Wi
W
:
Before knowing the new item, we must solve this problem without assuming any particular value for . The ideal solution is that which, in
average, best solves the problem, taking into consideration the distribution
of :
W
W
"
*
2
X
X
min ( ) + 21 2 2 +
And developing the second term:
Wi
E W
W
i
14
@ E
@Wi
i
Wi
@E
@Wi
+#
:
min
"
( )+
E W
W
Z X
1
2
2E
2
@W
W 2 @
i
X
+
i
i
Wi
i
@E
!
@Wi
P (W )
d
#
W
(8)
where P is the density function of . Let us assume that P ( ) is equal
for each of the components , or at least that they have equal variances
(although this assumption can be easily relaxed). It is also very natural to
assume a zero mean for this distribution: otherwise item would not be
really new and the information of the mean of P ( ) couldPbe used to
2
+
train the system. We can apply (5) to (8) by making ( ) = 21
i
P
, so that the integral in (8) becomes the expectation of (0 + ).
i
Since (0) is null and i ( ) = 2 i , we get
W
W
Wi
N
W
g U
i
Ui
@E
i
g
@W
2
@ g
g
@U
min
"
2
W
2
@W
( )+
E W
W
2
@ E
@W
@ E
W
2
Ui
2
2 X @2E #
2 @W 2
(9)
i
i
without any error4.
This result can be derived in another way, without supposing a particular shape for 1 ;1( ). The problem of interference prevention
can be understood as the search for a such that, after being modied by
the introduction of the new item, it would be still able to reproduce old ones.
As the new items are by denition unknown, they produce unknown modications in the parameters when they are learned. Thus, the problem consists
in getting a point of low ( ) stable with respect to random perturbations
of . This resistance to perturbations can be expressed as
E
W
:::N
W
W
E W
W
Z
min ( + ) P ( )
(10)
This expression is made exactly equivalent to (9) by applying (5). Thus,
reassuringly, we have obtained the same result with two dierent reasonings.
E W
W
W
d
W:
W
6.2 Selection of coecients
Unfortunately the above discussion did not suggest which is the density function P ( ). This amounts to decide the variance 2, which is the main
W
The term P
in (5) disregarded
here implies fourth or higher-order derivatives, and these
@2E + P
@E
are null for 21 i Wi2 @W
2
i Wi @W .
4
i
i
15
parameter of the distribution inuencing the minimization (9) or (10)5. The
variance of the weight changes produced by future items could be approximately deduced from their expected error, which in turn may be estimated
from the error that the most recently introduced items had at their arrival.
However, this is a context-dependent hypothesis, which can lead to signicant errors. Moreover, there is another important issue about the selection
of 2 to be dealt with: it inuences not only the error increments produced
by new items, but also the way in which the non presented items are interpolated. In fact, the way in which the items 1
; 1 are encoded determines
the answer of the system to all possible inputs. The quality of these answers
is often more important than exact storage of the presented items. In this
case, 2 should be tuned in accordance to the former desideratum.
In conclusion, either because of ignorance of the appropriate value or because it is tuned for other purposes, the parameter 2 used in (9) could be
signicantly dierent from real weight variances. What are the consequences
of this parameter imprecision? Could it have eects opposed to those desired,
i.e., in these conditions, the introduction of new items can be worst after minimizing (9) than after minimizing ( )? A detailed mathematical analysis
of this question is carried out in 17]. The conclusions can be summarized in
a few words. If 2 is smaller than the real variance, the minimization (9) is
always benecial. The opposite case is also safe if the remaining error ( )
in the minimum of (9) is not much higher than in the minimum of ( ).
( ) in the minimum of (9) has a sigmoidal shape when considered as a
function of 2 and, therefore, 2 can be increased until the error begins to
grow quickly.
:::N
E W
E W
E W
E W
6.3 Minimization method
There is a direct way of minimizing (9) or equivalently (10). It consists in
adding noise to the weights while minimizing ( ), so that a sample of the
gradient distribution of i ( + ) is calculated in each iteration. Lots of
samples of must be extracted from its distribution to get a good estimate
of the average derivatives (or alternatively, very small minimization steps
must be done in the direction of i ( + )). This noise addition during
training has been already used for other purposes, such as ameliorating faultE W
@E
@W
W
W
W
@E
@W
W
W
The terms disregarded when approximating (10) with (9) are signicant in general,
but are close to zero in the minimum of (9).
5
16
tolerance in neural networks 11] or improving their generalization properties
12, 1].
Again, this method is very general, it being valid for any type of ( )
and P ( ). However, it is extremely inecient for systems such as neural
networks, which have a high-dimensional parameter space to be sampled in
order to obtain the averages in the optimization steps. In 19] and 22],
a method based on the gradient of (9), especially adapted for feed-forward
networks, is developed. It has the advantage of being deterministic and much
more stable. In addition, it is easily computable with an algorithm of the
same order as the backpropagation of the gradient of ( ).
E W
W
E W
6.4 Relation between interference prevention and generalization
There was an implicit assumption in the derivation of our interference prevention method: the basic features of P ( ) and especially its variance are
independent of the used to encode the old items. We have supposed that
this variance, which is directly related to the expected error for the new item,
does not change while performing the minimization (10). In other words, we
minimize future interference assuming an expected error for the new items
that is independent of .
But the error in the new items (or equivalently, the variance of P ( )) is
another factor determining the interference that these new items will produce
and, of course, it is also controlled by the selection of . Thus, there exists
an alternative way to prevent interference, namely reducing the error in the
new items. This is nothing more than improving generalization.
The point we want to make next is that the minimization (10) is also
useful to control generalization. This can be understood in two ways:
-First, by reducing second derivatives of the parameters, their information
content6 with respect to the encoding of the items is also reduced. Controlling the information content of a model is a general way of controlling its
generalization.
-The other way is considering the term 2 P 2 i as a regularizer
that constrains the system to be simple or smooth. The use of regularizer
terms is another classical technique that controls the smoothing of ( )
W
W
W
W
W
2
i
Wi
2
@ E
@W
2
F X
The information content of a parameter can be approximated by log
a uniform a priori distribution for it 17].
6
17
@2E
@Wi2
, assuming
by means of a regularization coecient that regulates the importance of the
regularizer. In our case this regularization coecient is 2.
7 Experimental results
We show now results obtained by combining the two complementary algorithms for catastrophic interference avoidance. First the robustness against
changes is enhanced by the minimization of (9), and then the new patterns are
introduced while minimizing retrograde interference by means of the LMD
algorithm. Plain back-propagation and retrograde interference minimization
with LMD have been extensively compared 18, 20, 21].We concentrate on
studying the benets of complementing LMD with the interference prevention algorithm.
7.1 A rst example
The following experiment used a neural network architecture with seven hidden units. Fourteen random samples of the function ( 1 + 2) were chosen
as an initial training set for the network. The network was trained using our
interference prevention method, i.e., by minimizing (9) following its gradient. This process was repeated eleven times using dierent 2, producing
eleven dierent networks. For each of these networks, we tested the eect
of introducing twelve more random samples of the same function, using the
standard ( = i ) and the coarse ( = 1) versions of the LMD algorithm
for retrograde interference minimization. Note that, adhering to our simplied formulation, the LMD algorithm always encodes perfectly the new
patterns. Thus, the state of the network after their introduction is entirely
characterized by the error increment in the old patterns.
Figure 2 shows the average error increments produced by the introduction
of the new items. An important fall of catastrophic interference can be
observed, especially in the second half of the graph. This reduction is due
in part to the improvement in generalization, which is reected in Figure 3,
where the coarse version distances suer a small drop located more or less in
the same place, due to the lower error of the new items at arrival time, which
requires smaller weight modications. Observe that the origin of abcissas
corresponds to coding the old patterns with plain back-propagation (i.e., no
interference prevention is applied). Applying also plain back-propagation
sin x
x
ci
2
@ E
@W
2
ci
18
instead of LMD to code the new pattern (i.e., no retrograde interference
avoidance) produces error increments that go beyond the scale of the graph.
The distances in general are greater in the standard LMD than in the
coarse LMD, because the latter explicitly minimizes jj jj, while the former
looks for privileged directions suggested by the second derivatives. When the
network is trained with the classical backpropagation method, i.e., following
the gradient of ( ) in discrete steps, the results depend on the length
of these steps. The total distance covered in weight space tends to decrease
with the shortening of the length of the steps (at the cost of longer training
times). In the innitesimal limit, the solution tends usually to approximate
the coarse version of LMD 20].
Note that the distances covered by the standard version grow with 2.
The reason is that not all second derivatives decrease in the same way when
2 increases. The minimum of (9) does not make a pressure proportional to
the second derivative's value, but an equal pressure for large and small ones.
Therefore some of them become almost null, while others remain large.
This has consequences for the retroactive interference problem formulation: the cost coecients are the second derivatives and, thus, weights with
null second derivatives can be changed arbitrarily. This problem is similar to
the excessively large steps that optimization methods based on the Hessian
(like Newton or Pseudo-newton) perform when the second derivatives are
small. We solve it in the usual way, namely by adding a constant to the
coecients so that = i + . In 17] we argue that a sensible value for
this constant in the case of feedforward neural networks is the square of the
maximum possible activation of the hidden units.
W
EN W
ci
k
2
@ E
@W
2
k
7.2 Experiments using the Pumadyn datasets
The next series of experiments have been performed using data from the
Pumadyn family datasets 24](loaded from the Delve database 25]), which
come from a realistic simulation of the dynamics of a Puma 560 robot arm.
The inputs in the datasets chosen consist of angular positions, velocities,
torques and other dynamic parameters of the robot arm. The output is the
angular acceleration of one of the links of the robot arm. We have used the
two datasets in the family labelled with the attributes: 32-dimensional input,
fairly linear, and medium noise in one case, and high noise in the other case.
We made the same type of experiments shown in Figure 2, but only with the
standard version of LMD, since it is the one that works best. Moreover, we
19
have added a very important information to the graphs: the generalization
error obtained for each of the values of 2, evaluated over 2000 untrained
patterns. Networks with forty hidden units were rst trained with 200 or
400 patterns and then the average error increment produced by introuducing
200 new patterns separately was evaluated. The results of all combinations
of number of previously trained patterns and degrees of noise are displayed
in Figures 4 through 7.
These gures show that interference can be alleviated while at the same
time improving generalization. This is in contrast with other strategies for interference avoidance based on a special coding of patterns (e.g., saturating the
hidden units to get more local representations), which produce input-output
functions ( ) (e.g., piece-wise step functions) of a dierent nature from
the function being approximated, thus resulting in high generalization errors.
However, we are forced to use the same regularization coecient for controlling generalization and prevention of interference, the best values for each of
these purposes being usually dierent. Therefore, there is a trade-o that
should be considered depending on the application.
If generalization takes priority, the potential benet of the interference
prevention procedure depends on several factors. One of such factor is the
number of patterns already stored in the network. The more information
a network has stored in, the more its approximation power is well directed
and, therefore, the less it requires to be regularized. This can be seen by
comparing the curves in Figures 4 and 5: with double number of trained
patterns, the generalization curve is more squashed against the left vertical
axis. It can also be seen in how the generalization curve of Figure 7 increases
more slowly when compared to Figure 6. Another factor is the amount of
noise in the examples. The more noisy the training patterns are, the more
convenient it is to smooth ( ) by increasing the regularizer coecient.
This is very evident when comparing Figures 6 and 7 (medium noise) with
Figures 4 and 5 (high noise), which exhibit generalization error minima at
higher values of 2 . Moreover, the error raises very gently with 2 in these
gures, allowing for large reductions in interference without paying a high
cost in the generalization account.
Therefore, when priority is given to generalization, the narrowest margins
for benets in interference prevention occur for networks trained intensively
with a large number of noiseless patterns. But this is precisely the case in
which interference is less serious, since the new patterns are better predicted
and their introduction produces less damage. This can be checked by ob
F X W
F X W
20
serving that the generalization minimum for the network trained with 400
medium-noise patterns (Figure 6) has an associated damage that is an order
of magnitude lower than that of the opposite case (200 high-noise patterns)
displayed in Figure 5.
Finally a few words about an aspect of Figures 6 and 7 that could seem
strange: the damage curve quickly drops to zero and apparently continues
with negative values. In fact, interference takes negative values that is, the
encoding of old patterns is improved (rather than disrupted) by the introduction of the new patterns. Note that this happens when the network is
highly over regularized, so that the smoothing constraint pushes the interpolating curve ( ) far from the trained patterns. Then the introduction
of new patterns (that in Figures 6 and 7 have not much noise) without such
constraint brings the interpolation curve nearer to the old patterns with high
probability.
F X W
7.3 Limitations of the proposed algorithm
Together with the benets above, we must also point out the limitations in
the application of our method for interference prevention. We said the drop
in the distances for the coarse LMD in Figure 3 was due to generalization.
This is true, but the fall that would correspond to the improvement in generalization should be greater. This means that, although the errors are lower,
the weights have had to be modied almost the same. The reason is that the
algorithm minimizing (9) makes the network output insensitive to changes
in the weights for the stored items, but this insensitiveness is transmitted or
generalized to the rest of the input space. Because of this, it is also necessary
to modify more the weights to introduce the new items, and the potential
benets of the strategy get limited. Like in generalization, the greater the
number of items, the greater and more likely the insensitiveness of the items
outside of the learning set will be. When the items are few, the results are
irregular, as the network has become insensitive for the new items located
near a group of old items.
8 Conclusions
Two conditions are required for catastrophic interference to occur:
21
-The isolated training with new items without reminding the old ones,
and
- The use of distributed representations.
We have typied the approaches to solve the interference problem by their
degree of retraining with old items, and by the locality of their representations.
We have proposed a two-stage framework to deal with the interference
problem based on the information available at each moment. Retroactive
minimization deals with interference when the new item is already known.
It can be formulated as the search for the weights minimizing the error increments of the already stored items subject to the encoding of the new
information. In practice, the best model aordable for a highly dimensional
system is a weighted sum of squares of the changes in the parameters. We
have shown that the best coecients are in average the second derivatives
of the parameters, even outside of the minimum. For feedforward neural
networks, a very ecient algorithm can be used to solve this constrained
minimization.
Instead, at the earlier stage of interference prevention, when the new
item has not yet arrived, the corresponding weight changes are also unknown.
Thus, the only reasonable way of minimizing in anticipation the cost function
is minimizing the coecients, i.e., the second derivatives, jointly with the
error. The eect of this is to make the stored items insensitive to future
changes. When tested experimentally, we have found a limited success of
this strategy due to an unexpected reason: the insensitiveness to which old
items were trained gets "generalized", especially to nearby zones. If a new
item has to be introduced in one of these insensitive zones, larger weight
changes are required, and most of the expected benets are lost. When the
old items cover densely the input space, there is no possible gain. There is a
solution for this situation: to accept and assume that the average sensitivity
is the same for old and new items. Thus, the weight increments for the new
items will depend on the sensitivity (second derivatives) of the old items.
This is reected by expressing 2 as a function of the second derivatives for
the old items, and the cost function (9) becomes 17]:
P
( ) + 2(
E W
N
; 1) < E
N
>
2
@ E
i
2
@W
P
@2E
i
i @W 2
i
22
2
2
(11)
where
is the expected error function for the new item. Unfortunately,
this function is more complex than (9) and its gradient is harder to calculate.
This failure to avoid interference completely with a simple procedure was
previously expected, as explained in Section 3.3.
Moreover, there are many scenarios in which the blind application of the
hypothetically best possible algorithm could be inappropriate. In fact, as
mentioned in Section 4, usually it would be better to introduce only partially
the new item, leaving a certain error that is exchanged for a minor error
increment in the old items. The appropriate balance point in this trade-o
depends critically on several factors:
- The capacity of the memory system, i.e., in what measure it is able to
assimilate all the items.
- The amount of noise in the data.
- The number of already stored items. As it grows, the comparative
importance of the new item error decreases.
- The variability in time of the function to be approximated. If the
function changes quickly, the comparative importance of the errors in new
items becomes more important, and more interference should be allowed.
A rule of thumb that is generally correct when the objective function is
static or changes steadily is that the error in the new item should not be
made lower than the average error in the old items.
We have argued, with others, that distributed representations are indispensable for good generalization. But, is this completely true? Think
in this extremely localist representation: the items themselves as a list of
input-output pairs, with no other structure or parameter. But, each time an
answer to an arbitrary input is required, one can make some very complicated
process, for example, building a sigmoidal feedforward network, training it
with the stored items, and producing as answer the output of the network.
When a new item is introduced there is no catastrophic interference, because
it is just added to the list. Thus, the key point is shifting the processing
time from training to the generation of answers by the system. More practical methods than the one above could be imagined and some work in the
literature 2] can be considered as other, more practical examples of moving
computational cost to the recall phase.
So, under this point of view, the question is where to put the burden
of processing. Putting it in the learning phase is advantageous if there is
enough time for it and one continuously has to generate outputs and react
very quickly to the inputs. This is the scenario for animals in their envi< EN >
23
ronments. Thus, based on engineering principles, we think that there are
two ultimate reasons for which the cortex uses distributed representations
that constrain it to slow learning. The rst is that, being the residence of
long-term memory, it must be able to store great quantities of information,
which implies a high degree of compactness that can only be reached using
distributed representations, as explained in Section 3.2. The second reason is
the requirement of very quick responses to the stimuli (on which life or death
can depend) that must be however "optimal" in the sense of well generalized
form past experiences. This can be obtained only if the inuences of past
memories required to respond to new stimuli are already calculated during
a previous learning phase and explicited as distributed representations, as
explained before.
9 Bibliography
1] G. An, The eects of adding noise during backpropagation training on
generalization performance, Neural Computation 8 (1996), 643-674.
2] C.G. Atkenson, A.W. Moore, and S. Schaal, Locally Weighted Learning, Articial Intelligence Review (in Press).
3] A. Blum, and R.L. Rivest, Training a 3-node neural network is NPcomplete, in: Proc. of the Workshop on Computational Learning Theory, Morgan-Kauman Publishers, San Mateo, CA, 1988, pp. 9-18.
4] R.M. French, Dynamically constraining connectionist networks to produce distributed, orthogonal representations to reduce catastrophic interference, in: Proc. of the 16th Annual Conf. of the Cognitive Science
Society, Erlbaum, Hillsdale, 1994, pp. 335-340.
5] P.A. Hetherington and M.S. Seidenberg, Is there catastrophic interference in connectionist networks?, in: Proc. of the Eleventh Annual
Conf. of the Cognitive Science Society, Erlbaum, Hillsdale, NJ, 1993,
pp. 26-33.
6] S. Judd, Neural Network Design and the Complexity of Learning, MIT
Press, Cambridge, 1990.
24
7] Y. Le Cun, J.S. Denker, and S.A. Solla, Optimal Brain Damage, Advances in Neural Information Processing Systems 2. Neural Information Processing Systems 2, Morgan Kauman Publishers, San Mateo,
CA, 1990.
8] J. McClelland, B. McNaughton, and R. OReilly, Why there are complementary learning systems in the hippocampus and the neocortex: Insights from the successes and failures of connectionist models of learning
and memory, Psychological Review 102 (1995), 419-457.
9] M. McCloskey, and N.J. Cohen, Catastrophic interference in connectionist networks: The sequential learning problem, in: The psychology
of learning and motivation, G. H. Bower, ed., Academic Press, New
York, 1989.
10] K. McRae and P.A. Hetherington, Catastrophic interference is eliminated in pretrained networks, in: Proc. of the Fifteenth Annual Meeting of the Cognitive Science Society, Erlbaum, Hillsdale, NJ, 1993, pp.
723-728.
11] A.F. Murray and P.J. Edwards, Synaptic weight noise during multilayer
perceptron training: Fault tolerance and training improvements, IEEE
Transactions on Neural Networks 4(4) (1993), 722-725.
12] A.F. Murray and P.J. Edwards, Enhanced multilayer perceptron performance and fault tolerance resulting from synaptic weight noise during training, IEEE Transactions on Neural Networks 5 (1994), 792-802.
13] D.C. Park, M.A. El-Sharkawi, and R.J. Marks II, An adaptively trained
neural network, IEEE Transactions on Neural Networks 2(3) (1991),
334-345.
14] R. Ratcli, Connectionist models of recognition memory: constraints
imposed by learning and forgetting functions, Psychological Review
97(2) (1990), 235-308.
15] A. Robins, Catastrophic forgetting, rehearsal and pseudorehersal, Connection Science 7(2) (1995), 123-146.
16] A. Robins, Consolidation in neural networks and in the sleeping brain,
Connection Science 8(2) (1995), 259-275.
25
17] V. Ruiz de Angulo, Interferencia catastroca en redes neuronales: soluciones y relacion con otros problemas del conexionismo, Ph. D. Thesis,
Universidad del Pais Vasco, 1996.
18] V. Ruiz de Angulo and C. Torras, The MDL algorithm, in: Proc. of
the Int. Workshop on Neural Networks (IWANN91), A. Prieto ed.,
Lecture Notes on Computer Science, Vol. 540, Springer-Verlag, 1991,
pp. 162-172.
19] V. Ruiz de Angulo and C. Torras, Random weights and regularization,
in: Proc. of the Int. Conf. on Articial Neural Networks (ICANN94),
G. Marinaro and P. Morasso eds., Springer-Verlag, 1994, pp. 14561459.
20] V. Ruiz de Angulo and C. Torras, On-line learning with minimal degradation in feedforward networks, IEEE Transactions on Neural Networks 6(3), (1995) 657-668.
21] V. Ruiz de Angulo and C. Torras, Learning of nonstationary processes,
in: Optimization Techniques, C.T. Leondes ed., Neural Network Systems Techniques and Applications series, Vol. 2, Academic Press, 1998,
pp. 175-207.
22] V. Ruiz de Angulo and C. Torras, Averaging over networks: properties,
evaluation and minimization. Tech. Report IRI-DT-9811, Institut de
Robotica i Informatica Industrial, Barcelona, Spain, 1998.
23] A.S. Weigend, D.E. Rumelhart and B.A, Huberman, Generalisation by
weight-elimination with application to forecasting. Neural Information
Processing Systems 3, Morgan Kauman Publishers, San Mateo, CA,
1991, pp. 885-882.
24] http://www.cs.utoronto.ca/delve/data/Pumadyn/desc.html, Detailed
documentation le.
25] http://www.cs.utoronto.ca/delve/
26
Figure captions
Figure 1. Imaginary plane on which the dierent approaches to deal
with catastrophic interference can be placed.
Figure 2. Average error increments produced by the application of LMD
to eleven networks trained with our interference prevention method. The
networks have resulted from minimizing (9) for 2 between 0 and 0.05, as
represented in the axis of abscissas.
Figure 3. Average jj jj produced by the application of LMD to the
networks in Figure 1.
Figure 4. Same experiments as in Figure 2 but using networks with 40
hidden units and, as training set, the 400 patterns of the high-noise Pumadyn
dataset. Damage is measured as the average error increments for the old
patterns, whilst the generalization error is evaluated over 2000 untrained
patterns.
Figure 5. As in Figure 4, but using the 200 training patterns of the
high-noise Pumadyn dataset.
Figure 6. As in Figure 4, but using the 400 training patterns of the
medium-noise Pumadyn dataset.
Figure 7. As in Figure 4, but using the 200 training patterns of the
medium-noise Pumadyn dataset.
W
27
2
Joint training
RETRAINING
AXIS
Combined training or retraining
(Hetherington and Seidenberg)
Temporal local learning
(McClelland et al.)
Rehearsal
(Robins)
Pseudorehearsal
(Robins)
Hard core of
the problem
Special coding of the patterns
(French)
Isolated training
Distributed representation
REPRESENTATION
AXIS
Figure 1.
28
Local representation
0.012
0.010
0.008
coarse version of LMD
standard version of LMD
0.006
0.004
0.002
0.000
0.00
0.01
0.02
0.03
σ2/2
Figure 2.
29
0.04
0.05
0.06
0.03
coarse version of LMD
standard version of LMD
0.02
0.01
0.00
0.00
0.01
0.02
0.03
2
σ /2
Figure 3.
30
0.04
0.05
0.06
0.18
0.0040
Damage
Generalization error
0.16
0.12
0.0020
0.10
0.0010
0.08
0.0000
0.000
0.005
0.010
σ2/2
Figure 4.
31
0.015
0.06
0.020
Generalization error
0.14
Damage
0.0030
0.18
0.0040
Damage
Generalization error
0.16
0.14
0.12
0.0020
0.10
0.0010
0.08
0.0000
0.000
0.005
0.010
σ2/2
Figure 5.
32
0.015
0.06
0.020
Generalization error
Damage
0.0030
0.020
0.0005
Damage
Generalization error
0.018
0.0004
Damage
0.0003
0.014
0.0002
0.012
0.0001
0.010
0.0000
0.000
0.005
0.010
σ2/2
Figure 6.
33
0.015
0.008
0.020
Generalization error
0.016
0.020
0.0005
Damage
Generalization error
0.0004
0.018
Damage
0.0003
0.014
0.0002
0.012
0.0001
0.010
0.0000
0.000
0.005
0.010
σ2/2
Figure 7.
34
View publication stats
0.015
0.008
0.020
Generalization error
0.016