Control Using Soft Computing PDF
Control Using Soft Computing PDF
Control Using Soft Computing PDF
Neural Networks
journal homepage: www.elsevier.com/locate/neunet
Learning in fully recurrent neural networks by approaching tangent planes to constraint surfaces
P. May a , E. Zhou b, , C.W. Lee b
a b
K College, Brook Street, Tonbridge, Kent, TN9 2PW, UK Faculty of Advanced Engineering and Sciences, University of Bolton, Bolton, BL3 5AB, UK
article
info
abstract
In this paper we present a new variant of the online real time recurrent learning algorithm proposed by Williams and Zipser (1989). Whilst the original algorithm utilises gradient information to guide the search towards the minimum training error, it is very slow in most applications and often gets stuck in local minima of the search space. It is also sensitive to the choice of learning rate and requires careful tuning. The new variant adjusts weights by moving to the tangent planes to constraint surfaces. It is simple to implement and requires no parameters to be set manually. Experimental results show that this new algorithm gives significantly faster convergence whilst avoiding problems like local minima. 2012 Elsevier Ltd. All rights reserved.
Article history: Received 11 February 2011 Received in revised form 28 June 2012 Accepted 29 June 2012 Keywords: Real time recurrent learning Accelerated Local minimum Speed Temporal pattern recognition Henon map Non-linear process plant
1. Introduction Fully recurrent neural networks (FRNN) are powerful computational models that can learn temporal sequences, either online or in batch mode. A diagram of an FRNN is shown in Fig. 1. The FRNN consists of two layers, an input layer of linear units and an output layer of non-linear units. The input layer is fully connected to the output layer by adjustable weights. Furthermore, the FRNN features unit delay feedback connections that feed back the activations of the output units to the input layer units. The output units thus have some knowledge of their prior activations, which enables them to perform learning that extends over time. FRNNs accomplish their learning task by mapping input sequences, and delayed activations, to another set of output sequences. Due to the nature of feedback around the output units, these units may continue to cycle information through the network over multiple time steps, and thereby discover abstract representations over time. One of the most popular algorithms for training FRNNs is the real time recurrent learning (RTRL) algorithm (Williams & Zipser, 1989). RTRL is a gradient descent based algorithm for adjusting the output layer weights in a fully recurrent neural network. In the seminal paper by Williams and Zipser (1989), two variations are presented, one for online and one for off-line (batch) learning.
Corresponding author. Tel.: +44 0 1204903465. E-mail address: [email protected] (E. Zhou).
In both forms, RTRL has been used to train applications in a variety of areas such as speech enhancement (Ganchev, Tasoulis, Vrahatis, & Fakotakis, 2007; Parveen & Green, 2004), and realtime process control (Juang, Huang, & Duh, 2006; Peng & Lin, 2007), where the output of the system is the response to current and previous input. RTRL has also been used to train FRNNs for next symbol prediction in an English text processing application (Perez-Ortez, Calera-Rubio, & Forcada, 2001). Delgado, Pegalajar, and Cuellar (2006) have used RTRL to train FRNNs for applications in financial time series prediction. Finally RTRL has been used to train FRNNs capable of removing artefacts in EEG signals (Suresh & Puttamadappa, 2008). There are many variants of the RTRL algorithm suggested in the literature aimed at improving different aspects of the algorithm such as its computational complexity, convergence properties, and its sensitivity to the choice of initial weights. In Vartak, Georgiopoulos, and Anagnostopoulos (2005), a Jacobian based variant of RTRL is presented for online training of FRNNs giving faster convergence to a better quality of solution than the original algorithm. In Leclerq, Druaux, and Lefebre (2003), a new autonomous learning algorithm is developed for fully connected RTRL neural networks independent of any initial conditions. Starting from almost zero initial conditions the method adapts the learning rate and time parameter in order to track dynamical behaviours. The block update method (Schmidhuber, 1992; Williams & Peng, 1992) updates the gradient only once every M step. This algorithm combines characteristics of RTRL and BPTT and is suitable for very long sequences. The Greens function
0893-6080/$ see front matter 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2012.06.011
73
Fig. 1. An example of a fully recurrent neural network with one output unit, one hidden unit, and two input units. The function z 1 is the unit delay operator whose output is delayed with respect to the input by one time step.
method (Atiya & Parlos, 2000) is a variation of the RTRL algorithm which is designed to reduce the computational complexity by efficient ordering of operations. Some approaches try to avoid the problem of vanishing gradients, e.g. neuron saturation, by designing the architecture appropriately e.g. restricting recurrence to linear neurons (Tino & Mills, 2006). However there is still no widely accepted training model in the gradient descent based approach. In this paper we present a new RTRL variant based upon the tangent plane algorithm (Lee, 1993, 1997). Whilst the original algorithm utilises gradient information to guide the search towards the minimum training error, it very slow in most applications and often gets stuck in local minima of the search space. The original algorithm is also sensitive to the choice of the learning rate and requires careful tuning. The new algorithm adapts weights by approaching tangent planes to constraint surfaces. It does not require any parameters to be set manually, it is fast and simple to implement, and avoids problems such as local minima. The rest of the paper is organised as follows. In Section 2, a detailed derivation of the algorithm is presented, and in Section 3, consideration of neural network architectures. Finally, the experimental results are given in Section 4. 2. Description of the algorithm Lee (1993, 1997) proposed an algorithm for supervised learning in multilayered feedforward neural networks which gives significantly faster convergence than the gradient descent backpropagation algorithm. This tangent plane algorithm treats each teaching value as a constraint which defines a surface in the weight space of the network. The weights are then adjusted by moving from the present position in weight space to the foot of the perpendicular to the tangent plane to this surface, taken at a convenient point. The principal strength of the tangent plane algorithm is that it does not require a learning rate parameter to be set but instead it calculates the correct step size each time the weights are changed. Consider a FRNN of units {uj } (see Fig. 1). For unit uj , wjT = [wj1 , wj2 , . . . , wj, (n+m+1) ] denotes a (n + m + 1) 1 vector of weights, where n are the number of feedback connections and m the number of external inputs, one remaining input being the bias input weight. Let j and j denote the net input and output of uj ,
Fig. 2. Movement from the present position a to the foot of the perpendicular to (t ) (t ) the tangent plane of the surface 1 = f 1 (y1 ) at position b. In the figure, i1,n+1 denotes a unit vector in the direction defined by the w1,n+1 axis.
and f the units activation function, typically tanh(x). The following equations describe the FRNN at time instant t
j = 1, 2, . . . , n
l =1
(t ) (t ) wjl zl
t) (t 1) (t 1) (t ) [zl(t ) ]T = [1 , . . . , n , +1, x( 1 , . . . , xm ].
The method assumes a FRNN with a single output unit. Let this (t ) output unit be denoted by u1 , with 1 at time step t being trained to mimic the teaching value y1 . For a given set of inputs xi ,
t) (t ) 1 : R R. Thus the equation 1 = f 1 (y( 1 ) defines n(n+m+1) a n (n + m + 1) 1 surface in R . The aim of the training procedure is to move from the current position a Rn(n+m+1) in n(n+m+1)
(t )
(t )
(t )
weight space to the nearest point on a tangent plane of this surface (see Fig. 2).
(t ) Let a(t ) = iji be the current vector of the weights, j,i wji where iji is a unit vector in the direction of the wji axis. Use
74
(t )
(t )
(t ) w1 ,n + 1
i=n+1
(t ) (t ) w1 to find a ,i z i
Let 1
(t )
(t ) be the unit normal to the surface 1 = f 1 (y1 ) at b(t ) . Let n The length of the perpendicular from a(t ) to the tangent plane is
(t ) (t ) t) (t ) (t ) . Now, using f 1 (y( [b(t ) a(t ) ] n 1 ) = w1,n+1 + i=n+1 w1i zi (t ) (t ) (t ) (t ) and f 1 (1 ) = w1,n+1 + i=n+1 w1i zi , and from the definition
of b(t ) b(t ) a(t ) = [w1,n+1 w1,n+1 ]i1,n+1
t) 1 ( t ) = [f 1 (y( (1 )]i1,n+1 . 1 )f
is simply 1 / wpq . Thus the TPA-RTRL algorithm is similar 2 to the error minimisation procedure min(1 ) = min([f 1 (y1 ) 1 2 f (1 )] ) with the weights adjusted along the negative of (t ) 1 . The TPA-RTRL algorithm differs from the original GD-RTRL algorithm in two respects. First, the objective function is expressed (t ) (t ) as 1/2(f 1 (y1 ) f 1 (1 ))2 . Output errors calculated based 1 1 on f (x) = tanh (x) could potentially lead to some very big weight updates that might represent a source of instability in the network. Second 1 is a global learning rate for
the output unit u1 after target value y1 has been applied at time t . An alternative method of representing the derivative in Eq. (6) (t ) (t ) (t ) (t ) is to note that if 1 = 1/2[1 ]2 then the term 1 1 / wpq
(t )
(t )
(t )
(t )
(t )
(4)
(t ) 1 (t ) (t ) As n is the unit normal the surface 1 = f (y1 ), it equals (t ) (t ) 1 / 1 . Therefore, using Eq. (4)
(t ) 1
(t ) (t )
(5)
f 1 (y1 ) f 1 (1 ) 1
(t )
2 (t ) 1
(t ) wpq
(6)
Using the chain rule of differentiation, the derivative 1 / wpq in Eq. (6) can be re-written as
(t ) 1 (t )
wpq
wpq
(t )
n+m+1
j =1
(t ) (t ) w1 j zj
n j(t 1) j =1 n j =1
wpq
f
(t )
(t ) (t ) w1 j + p1 zq
the network. It requires information about the whole of the network to be gathered in calculating each weight adjustment. This involves calculating the squared norm of n (n + m + 1) 1 1 (t ) partial derivatives [pq ] and represents a small computational overhead. The approach of the TPA-RTRL algorithm is reminiscent of the NewtonRaphson method of first degree (Stoer, 1976) used to find the zero points of functions that depend on one value. An important difference is that it provides a whole n (n + m + 1) 1 plane of suitable points to move towards. Any method that does a zero-point search of a function cannot get trapped in a local minimum unless it hits one by accident. The TPA-RTRL algorithm (t ) uses the vector [ 1 ] to do a linear extrapolation of the surface 1 1 = f (y1 ) in order to gain a new weight vector that is hoped to be on, or at least close, to this surface. However, 1 = f 1 (y1 ) is a non-trivial function of the weights, the recursive feedback (t 1) activations i , i = 1, . . . , n, themselves being a function of the weights. Thus the basic approximation that the surface can be locally approximated by a tangential hyper-plane may only be limited to regions of the weight space close to the surface. This observation suggests it might be helpful to restrict the dimension of the weight space by fixing some of the wpq when the error starts to increase. This will have a smoothing effect on the constraint surface 1 = f 1 (y1 ) making a tangent plane approximation to this surface viable. In order to vanish the specified wpq we need to set the constraint pq = 0 for some selected j. Further, varying the selection of the wpq will help avoid the problem of constraining the weight change to a limited number of directions, which will prevent the network from attaining a low MSE. The online version of the GD-RTRL algorithm introduces additional complexities, most notably is that the online version can move very far from the desired trajectory and never return. j Before starting the training all the [pq ](t ) are initialised to zero. At
(t ) (t ) wji is calculated and applied to the weights wji . The j (t ) calculation of the [pq ] is according to a recursive relationship
j
=
where
(t 1)
(t 1) j wpq
(t )
(t ) (t ) w1 j + p1 zq
(7)
p1 =
1, 0,
if p = 1 otherwise.
(8)
each cycle
Under the assumption, also used in the RTRL algorithm [9], that when the step size is sufficiently small, we have
j(t 1)
(t ) wpq
j(t 1)
(t 1) wpq
.
j
(9)
A triple index set of variables pq can be used to denote the partial derivatives j / wpq , 1 j, p n, 1 q n + m + 1. We compute the values pq for every time step t and appropriate j, p, q as follows
j (t ) [pq ] = n m=1 m (t 1) (t 1) (t ) f (m )[pq ] wjm + pj zq . j
(t )
(10)
Because we assume that the initial state of the FRNN has no j functional dependence on the weights, we also have [pq ](0) = 0.
that is dependent on the values of the weights. The problem with the RTRL algorithm is that while learning from a large amount of data, the weight change between the start of the learning and the end of the learning is not likely to be small. This means that the algorithm is not a correct gradient following technique, not even along the gradient of the local error. The situation with the TPA-TPA algorithm is likely to be worse as the method of approaching tangent planes could potentially lead to some very big updates. In view of the large step sizes which the TPA-RTRL algorithm can involve it seems unwise to accumulate the steps if it is found to be unnecessary. Catfolis (1993) has suggested an improvement to the basic GD-RTRL algorithm that involves reinitialising the gradient matrix so that the weight changes depend on the events that have happened only a few time steps ago. This technique could easily be incorporated into the TPA-RTRL algorithm with little diminution of the convergence speed.
75
3. Different neural network architectures A FRNN can be unfolded in time to obtain a feedforward network that grows by one layer at each time step. The backpropagation algorithm can then be used to train the unfolded network. When the network is unfolded so that only h layers, corresponding to the most recent time steps, are retained leads to the online back-propagation-through-time BPTT algorithm (Williams & Peng, 1990). The TPA-RTRL algorithm applied to the t) (t ) BPTT paradigm leads to h constraint surfaces, j = f 1 (y( j ), n h < t n, one for each unit uj for which a desired response is required. The individual weight changes wji at each step in the interval [n h + 1, n] are accumulated and applied to the weights (n) (t ) wji at step n. The only difficulty is the possibility that wji is large enough to result in significant deviations from the gradient descent trajectory. One way to avoid this problem would be to keep the trajectories [n0 , n1 ] relatively short. Simple recurrent networks (SRNs) are a category of recurrent neural networks that have far simpler recurrent connections compared with FRNN. This category consists of networks that have only one-to-one recurrent connections. This means that a particular neuron output is fed back to the input of only one neuron, possibly itself. Only the recurrent connections have a time delay, the forward connections are instantaneous. The state space model of the FRNN can be straightforwardly modified for SRN by decomposing the function realised by the model into layers. The TPA-RTRL algorithm applied to SRN may need to be further modified for backpropagation learning. For the special case of the Elman network (Elman, 1990) and the Jordan network (Hertz, Krogh, & Palmer, 1991), which both have fixed recurrent connections, the TPA-RTRL algorithm is simply the tangent plane algorithm. Partially recurrent neural networks (PRNs) have separate output and state units. To obtain such a network the feedback connections from the output units are removed. Robinson (1994) have used a partially recurrent neural network for speech recognition. The reason for not using a FRNN was the reduction in the number of weights and the argument that the output and state do not have to coincide (as is the case with the FRNN). To obtain the state space model for PRN, a first approach would be to take the state space model of FRNN and simply set weights corresponding to the removed connections to zero. The tangent plane method applied to PRN is then simply a special case of TPA-RTRL algorithm. Fixing recurrent connections will improve the computational complexity by saving (n + 1) gradient computations. 4. Computer simulations Comparative tests were performed with the TPA-RTRL algorithm and the original GD-RTRL algorithm using different network sizes and initial conditions. In order to assess the performance of the TPA-RTRL algorithm, which uses an adaptive learning rate, an approximate parabolic interpolation technique was used to accelerate learning in the GD-RTRL algorithm. 4.1. Simulation problems Four different datasets were used; pipelined Xor, simple sequence recognition (Williams & Zipser, 1989), non-linear dynamic plant (Narendra & Parthasarathy, 1990), and the Henon map time series (Mak, Ku, & Lu, 1999). The pipelined Xor problem was used to analyse the convergence behaviour of each algorithm on a simple classification task. The simple sequence recognition problem, the non-linear dynamic plant and the Henon map were used to determine the ability of the network to configure itself so that it stores
(t )
information from an earlier time step in a temporal sequence in order to determine a value at a later time. The Exclusive OR (XOR) problem is an example of a pattern classification task that cannot be solved using a single neuron. The input patterns are (1, 1), (1, 1), (1, 1) and (1, 1). The first and third patterns are in class 1, and the second and fourth patterns in class 1. The training examples were presented to the network in a random order, one each time step. Thus the epoch length is four cycles. The network was operated in a continuous mode, meaning that all the epochs were presented to the network after each other without telling the network something had happened. Tests were performed by varying the time delay between the presentation of an input pattern and training the network to match the corresponding teaching value. For each test, 50 trials were made with the mean number of steps to converge, the standard deviation, and the number of successful trials recorded. Network training was terminated when the error was reduced to below 103 or 4000 time steps elapsed. The simple sequence recognition problem has four inputs and one output. Two of the input lines, labelled a and b, serve a special purpose. The other two input lines, c and d, serve as distractors. At each time step one input line carries a 1, and all other input lines a 1. The network is trained to output a 1 when a 1 on the b line is immediately followed by a 1 on the a line, otherwise the output is 1. Once such a b occurs, its corresponding a is considered to be used up. An additional constraint was imposed that two 1s should be output every 16 time steps. Thus the epoch length is 16 cycles. Tests were carried out using different sized networks, and also using data that had been partially corrupted to determine the robustness of the TPA-RTRL algorithm. For each test, 50 trials were made with the mean number of steps to converge, standard deviation and number of successful trials recorded. Network training was terminated when the error was reduced to below 103 or 16,000 time steps elapsed. The non-linear dynamic plant problem is a high order nonlinear system introduced in Narendra and Parthasarathy (1990). It is modelled by the following discrete time equation y(t ) = y(t 1) y(t 2) y(t 3) u(t 1) [y(t 3) 1] + u(t ) 1 + [y(t 2) ]2 + [y(t 3) ]2 (11)
where y(t ) is the model output at time t . A single layer FRNN with only one output unit and two input units was trained. The training data was generated using a random input signal uniformly distributed over the interval [1, 1]. Tests were carried out using different sized networks. For each test, 5 trials were made with the mean square error (MSE) and CPU time of the learning algorithms recorded. The mean square error was measured by averaging the output square errors over 100 time steps. Thus the epoch length is 100 time steps. Network training was terminated after 500 epochs were presented to the network. The Henon map problem is a chaotic time-series prediction problem. The time series is computed by x(t +1) = 1 c (x(t ) )2 + b x(t 1)
(1) (0)
(12)
where b = 0.3, and c = 1.4, and x = x = 0.6313354. The objective of the simulation is to train a network with one input and one output and various processing units to model the chaotic series generated by (11). Since xmax = 1.272967 and xmin = 1.284657, the input values were scaled in the range [1, 1]. Tests were carried out using different sized networks. For each test, 5 trials were made with the mean square error and CPU time of the learning algorithms recorded. The mean square error (MSE) was measured by averaging the output square errors between 4,999,000 and 5,000,000 time steps.
76 Table 1 Convergence performance in the Exclusive Or (Xor) problem. Units TPA-RTRL algorithm Training terminated after 4000 cycles trained or MSE < 103 Mean (a) One cycle delay 3 4 5 118 123 121 100 63 49 43 50 50 630 611 581 Std Success
GD-RTRL algorithm Training terminated after 4000 cycles trained or MSE < 103 Mean Std Success
117 91 92
33 46 48
(b) Two cycle delay 3 4 5 465 313 266 189 0 21 44 833 763 84 121 0 15 18 Fig. 3. Typical convergence behaviour of the new TPA-RTRL algorithm on the Xor problem with one unit trained to match the teaching signal of the inputs one cycle ago. T
Table 2 Weight matrix for Xor with one-cycle delay. U 1 2 3 B 1.88 1.49 1.97 I 0.00 1.98 2.12 0.00 1.98 2.13 R
1.98
0.15 0.15
Note: The columns labelled U , B, I , R, T indicate respectively: unit number, the bias weight, input weights, recurrent weights, and the teaching status where + indicates the presence of teaching value, and no teaching value present.
4.2. Network initialization The approximate parabolic line minimization technique used with the GD-RTRL algorithm can result in some very big weight updates. For the Xor and simple sequence problem, it was found necessary to place an upper bound on the learning rate of 10.0. For the non-linear plant problem an upper bound of 0.1 was used. Preliminary tests showed that the best results with the Henon map problem were obtained with the learning rate set to = 0.01. Both algorithms require the weights of the network to be set. In all the tests carried out the input weights were initially set to random values in the range [1.0, 1.0]. 4.3. Discussion of results Exclusive Or (Xor) with time delay. The first test is a simple nonlinearly separable problem requiring at least two processing cycles to complete. The test was carried out using a single layer FRNN network with three, four and five processing units. One of the processing units was trained to match the teaching signal at time t corresponding to the inputs presented to the network at time t , where the computational time delay was chosen to be one or two cycles (time steps). The results are tabulated in Table 1. It was found that the new TPA-RTRL algorithm gave significantly faster convergence than the original algorithm. It was also found that increasing the time delay between the inputs and corresponding target values had a deleterious effect on convergence, and that additional units had to be added to the network for convergence to occur. Generally speaking for longer delays than one cycle the network has to configure itself to have more than two hidden layers in order to match the required delay. Finally, the results present a compelling case for the TPA-RTRL algorithm as a global minimiser. Support for believing this was found empirically: some net was said to have found a local minima if one training cycle did not cause a change in the weights of more than 0.01. The final weight values of a typical FRNN network trained by the TPA-RTRL algorithm are given in Table 2. Notice the weights providing a feedforward solution have become large, whilst the recurrent weights have become small. This suggests that the FRNN network has organised itself into a single hidden layered
Fig. 4. Typical convergence behaviour of the original GD-RTRL algorithm on the Xor problem with one unit trained to match the teaching signal of the inputs one cycle ago.
feedforward neural network with outputs delayed by one cycle relative to the inputs. The function of the recurrent weights has become one of providing additional pathways during the learning stage. Figs. 3 and 4 show some typical training curves for both algorithms on the Xor problem. The training examples were split into groups of four with the network operated in continuous mode. Thus the epoch length is four cycles. The MSE was calculated over a strip length of 5 epochs. The training curves for the new TPA-RTRL algorithm show that convergence occurs rapidly, typically within 80 epochs. One of the curves (test 2) contains a slight dip which is probably due to the presence of turbulence caused by large weight updates, averaged over by the coarse sampling rate in Fig. 3. The training curves for the original GD-RTRL algorithm show the same general trend, but this time convergence occurs typically within 250 epochs. Simple sequence recognition. The second test is a simple sequence recognition task. The results are tabulated in Table 3. From the table we can see that that the new TPA-RTRL algorithm converges to a solution faster than the GD-RTRL algorithm, and that the convergence speeds of both algorithms improves with the size of the network. The improvement in the performance of the TPA-RTRL algorithm in larger networks can be explained as follows. If there are N patterns to be learned and n (n + m + 1) free parameters in the network, the probability of a pair of normals 1 being nearly parallel, or a set of normals in weight space being nearly linearly dependent, will be reduced if n (n + m + 1) N . Therefore a set of patterns should be learned more quickly by a network with
77
Fig. 5. Typical learning curves of the new TPA-RTRL algorithm on the simple sequence problem for a network with four processing units. Note that 1 epoch = 16 cycles. Table 3 Convergence performance in the simple sequence problem for different sized networks. Units TPA-RTRL algorithm Training terminated after 16,000 cycles trained or MSE < 103 Mean 2 4 6 8 47 21 18 18 Std 53 11 8 13 Success 48 50 50 50 GD-RTRL algorithm Training terminated after 16,000 cycles trained or MSE < 103 Mean 126 80 63 55 Std 21 12 7 7 Success 50 50 50 50
Fig. 6. Typical learning curves of the original GD-RTRL algorithm on the simple sequence problem for a network with four processing units. Note that 1 epoch = 16 cycles.
Table 4 Convergence performance in the simple sequence problem in a FRNN with 4 units in the output layer and n (%) of teaching values randomised. n (%) TPA-RTRL algorithm Training terminated after 16,000 cycles trained or MSE < 103 Mean 0 1 2 5 26 26 23 44 Std 42 15 11 28 Success 50 50 50 50 GD-RTRL algorithm Training terminated after 16,000 cycles trained or MSE < 103 Mean 75 89 96 179 Std 10 15 19 70 Success 50 50 50 50 Fig. 7. Learning curves of the new TPA-RTRL algorithm for a network with four processing units. One teaching value has been corrupted at epoch 12, 24 and 36.
a greater number of connections. Figs. 5 and 6 show typical learning curves for both algorithms. The training examples were split into groups of 16. Thus the epoch length is 16 cycles. The learning curves for the TPA-RTRL algorithm show that convergence occurs rapidly, typically within 15 epochs. One curve (test 3) gets trapped in a local minimum. Convergence was restored by increasing the time delay between the presentation of an input pattern and the response of the network. The learning curves for the GD-RTRL algorithm all get stuck on a flattish plateau before converging within 30 epochs. A further test was carried out with the training data generated as normal, but at each presentation of an item of data the teaching value was given a 1%, 2% and 5% probability of being set at +1. The results are tabulated in Table 4. It was found that the TPA-RTRL algorithm was far more tolerant of low levels of noise than the GD-RTRL algorithm. However, the TPA-RTRL algorithm faired less well with the highest level of noise, which resulted in local instability in some trials. Figs. 7 and 8 show some typical learning curves for a different type of inaccurate data. Once again, the training examples were split into groups of 16, so the epoch length is 16 cycles. The training data was generated as normal, but this time a pattern selected at random had its teaching value set at +1
Fig. 8. Learning curves of the original GD-RTRL algorithm for a network with four processing units. One teaching value has been corrupted at epoch 12, 24 and 36.
during 12, 24 and 36 epochs. The learning curves for the TPA-RTRL algorithm show a pronounced peak at 12 epochs that corresponds to the first item of corrupted data. Subsequent responses to noisy data at epoch 24 and 36 diminish in amplitude. In each case, the network recovers quickly within four epochs after being presented with an item of corrupted data. The learning curves for the GDRTRL algorithm show small peaks occurring at 12 and 36 epochs and one large peak at 24 epochs. The recovery time was similar to the TPA-RTRL algorithm, which was typically within four epochs. Non-linear dynamic plant. The third test is a high order nonlinear discrete time system. Table 5 shows the mean square error
78
Table 5 Convergence performance and final error in the non-linear dynamic plant problem. Note that 1 epoch = 100 cycles. Units = 6 MSE 102 TPA-RTRL GD-RTRL 7.865 3.892 CPU time 3 12 Units = 9 MSE 102 3.210 4.142 CPU time 11 37 Units = 12 MSE 102 5.433 5.450 CPU time 27 93
Fig. 9. Typical convergence behaviour of the new TPA-RTRL algorithm on the nonlinear dynamic plant problem.
Fig. 10. Typical convergence behaviour of the GD-RTRL algorithm on the non-linear dynamic plant problem.
(MSE) and CPU time in seconds of the learning algorithms for various processing units. The mean square error was calculated by averaging the error over 100 time steps. It was found that the new TPA-RTRL algorithm obtained a MSE comparable with the original GD-RTRL algorithm in the largest network. In the smallest network the solution obtained by the original algorithm with weight backtracking was better. However this improvement was paid for at the expense of the CPU time used, which was almost four times longer. The only difficulty with the TPA-RTRL algorithm was that the error curves were very turbulent, with intermittent increases in the error happening frequently. This was particularly evident in the smallest network. The convergence of the TPA-RTRL algorithm was improved by introducing a simple one stage reduction of the learning rate by 50% (6 units = 0.043, 9 units = 0.022, 12 units = 0.046). The best test error obtained by the TPA-RTRL algorithm is very similar to that given by Al-Ajlounti, Schilling, and Harris (2004). In Al-Ajlounti et al. (2004), a zerothorder RBF network was used with no training. The performance of the RBF network achieved was a mean square error of 0.015. This was achieved using 1024 weights. Figs. 9 and 10 show the fit for both algorithms using a FRNN with 12 units after 500 epochs. Clearly the fit is not exact but this is quite reasonable considering the type of input and the size of the network. Henon map time series. The final test is a classical one-stepahead prediction problem. Table 6 shows the mean square error and CPU time in seconds of the learning algorithms for various processing units. The mean square error was obtained by averaging the error over 1000 time steps. It was found that the TPA-RTRL algorithm gave faster convergence relative to the original algorithm. The slower convergence of the original algorithm can be explained by very slow asymptotic behaviour, and the tendency of the network to get trapped in local minima. In contrast the much larger steps taken by the TPA-RTRL algorithm made this method of learning an effective global minimiser. However, the faster convergence of the new algorithm was paid for at the expense of the CPU time used. The training error is very similar to those given by Mak et al. (1999). The mean square error was RTRL (6 units = 0.008, 9 units = 0.0008, 12 units = 0.0006), and MERTLR (6 units = 0.001, 9 units = 0.003, 12 units = 0.001). The poor performance of the MERTLR algorithm was because some of the weights were seldom
Fig. 11. Typical convergence behaviour of the new TPA-RTRL algorithm on the Henon map time series prediction problem.
Fig. 12. Typical convergence behaviour of the GD-RTRL algorithm on the Henon map time series prediction problem.
adapted, which prevents the algorithm from attaining a low mean square error. Figs. 11 and 12 show some typical convergence curves for both algorithms. The learning curves for the TPA-RTRL algorithm show good asymptotic behaviour, whilst the curves for the original algorithm appear to get trapped in a stucking state after a few epochs.
P. May et al. / Neural Networks 34 (2012) 7279 Table 6 Convergence performance and final error in the Henon map time series problem. Note that 1 epoch = 100 cycles. Units = 6 MSE 102 TPA-RTRL GD-RTRL 0.021 0.018 CPU time 440 208 Units = 9 MSE 102 0.007 0.013 CPU time 862 694 Units = 12 MSE 102 0.004 0.022 CPU time 2147 1804
79
5. Conclusions A new variant of the RTRL algorithm is derived for training online fully recurrent neural networks. This new TPA-RTRL algorithm speeds up the learning by approaching the tangent planes to constraint surfaces. The results show that the TPA-RTRL algorithm is very fast and avoids problems like local minima. However this improvement is paid for by much longer runtimes, which can be twice as long as the original algorithm without weight backtracking. Clearly this may become significant when the network is scaled up and for larger problems. The results also show that the TPA-RTRL algorithm can handle inaccurate data and that recovery times are rapid after the network is presented with an item of erroneous data. References
Al-Ajlounti, A. F., Schilling, R. J., & Harris, S. L. (2004). Identification of nonlinear discrete systems using raised cosine radial basis function networks. International Journal of Systems Science, 35(4), 211221. Atiya, A. F., & Parlos, A. G. (2000). New results on recurrent network training: unifying the algorithms and accelerating convergence. IEEE Transactions on Neural Networks, 11(3), 697709. Catfolis, T. (1993). A method for improving the real-time recurrent learning algorithm. Neural Networks, 6(6), 807821. Delgado, M., Pegalajar, M. C., & Cuellar, M. P. (2006). Evolutionary training for dynamical recurrent neural networks an application in financial time series prediction. Mathware and Soft Computing , 13, 89110. Elman, J. (1990). Finding structure in time. Cognitive Science, 14, 179211. Ganchev, T. D., Tasoulis, D. K., Vrahatis, M. N., & Fakotakis, N. D. (2007). Generalised locally recurrent probabilistic neural networks with application to text independent speaker verification. Neurocomputing , 70(79), 14241438. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to theory of neural computation. Addison Wesley Publishing Company. Juang, C., Huang, S., & Duh, F. (2006). Mold temperature control of a rubber injection molding machine by TSK-type recurrent neural fuzzy network. Neurocomputing , 70(13), 559567.
Leclerq, E., Druaux, F., & Lefebre, D. (2003). European symposium on artificial neural networks (pp. 379384). Lee, C. W. (1993). Learning in neural networks by using tangent planes to constraint surfaces. Neural Networks, 6, 385392. Lee, C. W. (1997). Training feedforward neural networks: an algorithm giving improved generalization. Neural Networks, 10, 6168. Mak, M. W., Ku, K. W., & Lu, Y. L. (1999). On the improvement of the real time recurrent learning algorithm for recurrent neural networks. Neurocomputing , 24(13), 1336. Narendra, K. S., & Parthasarathy, K. (1990). Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1(1), 427. Parveen, S., & Green, P. (2004). Speech enhancement with missing data techniques using recurrent neural networks. IEEE Transactions on Fuzzy Systems, 1, 733736. Peng, Y., & Lin, K. (2007). Adaptive recurrent network for cerebellar model articulation controller for linear ultrasonic motor with optimal learning rates. Neurocomputing , 70(1618), 26262637. Perez-Ortez, J.A., Calera-Rubio, M.L., & Forcada, M.L. (2001). Online symbolic sequence prediction with discrete time recurrent neural networks. In Proceedings of the international conference on artificial intelligence. Vol. 2130 (pp. 719724). Robinson, A. J. (1994). An application of recurrent nets to phone probability estimation. Neural Networks, 5(2), 295301. Schmidhuber, J. (1992). A fixed size storage O(N3) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2), 243248. Stoer, J. (1976). Einfhrung in die numerische mathematik. Vol. 1. Springer. Suresh, H. N., & Puttamadappa, C. (2008). Removal of EMG and ECG artifacts from EEG based on real time recurrent learning algorithm. International Journal of Physical Sciences, 3(5), 120125. Tino, P., & Mills, A. J. S. (2006). Learning beyond finite memory in recurrent networks of spiking neurons. Neural Computation, 18(3), 591613. Vartak, A. A., Georgiopoulos, M., & Anagnostopoulos, G. C. (2005). On-line GaussNewton based learning for fully recurrent neural networks. Nonlinear Analysis, 63, e867e876. Williams, R. J., & Peng, J. (1992). Gradient based learning for recurrent neural networks and their computational complexity. In Backpropagation: theory, architecture and applications (pp. 433486). Hillsdale, NJ: Lawrence Erlbaum. Williams, R. J., & Peng, J. (1990). An efficient gradient based algorithm for online training of recurrent network trajectories. Neural Computation, 2, 490501. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running recurrent neural networks. Neural Computation, 1(2), 270280.