Academia.eduAcademia.edu

On sequential construction of binary neural networks

1995, IEEE Transactions on Neural Networks

A new technique, called Sequential Window Learning (SWL), for the construction of two-layer perceptrons with binary inputs is presented. It generates the number of hidden neurons together with the correct values for the weights, starting from any binary training set. The introduction of a new type of neuron, having a window-shaped activation function, considerably increases the convergence speed and the compactness of resulting networks. Furthermore, a preprocessing technique, called Hamming Clustering (HC), is proposed for improving the generalization ability of constructive algorithms for binary feedforward neural networks. Its insertion in the Sequential Window Learning is straightforward. Tests on classical benchmarks show the good performances of the proposed techniques, both in terms of network complexity and recognition accuracy.

On Sequential Construction of Binary Neural Networks Marco Muselli architecture is enough general to implement any boolean function (with one or more output values) if a sufficient number of hidden units is provided [2]. Moreover, this kind of neural networks greatly simplifies the extraction of symbolic rules from connection weights [3]. Some constructive methods are specifically devoted to the synthesis of binary feedforward neural networks [4]–[7]; they take advantage of this particular situation and lead in a short time to interesting solutions. A natural approach is determined by the procedure of sequential learning [6], but its implementation presents some practical problems. First of all, the weights in the output layer grow exponentially, leading to intractable numbers also for few hidden units. Then, difficulties arise when the output dimension is greater than one, since no extension for the standard procedure is given. Furthermore, the proposed algorithm for the training of hidden neurons is not efficient and considerably increases the time needed for the synthesis of the whole network. On the other side, faster methods like perceptron algorithm [8] or optimal methods like pocket algorithm [9] cannot be used because of the particular construction made by the procedure of sequential learning. The present work describes a method of solving these problems and proposes a well-suited algorithm, called Sequential Window Learning (SWL), for the training of twolayer perceptrons with binary inputs and outputs. In particular, we introduce a new type of neuron, having a windowshaped activation function; this unit allows the definition of a fast training algorithm based on the solution of linear algebraic equations. Moreover, a procedure for increasing the generalization ability of constructive methods is presented. Such a procedure, called Hamming Clustering (HC), explores the local properties of the training set, leaving any global examination to the constructive method. In this way we can obtain a good balance between locality and capacity, which is an important tradeoff for the treatment of real world problems [10]. HC is also able to recognize irrelevant inputs inside the current training set, in order to remove useless connections. The complexity of resulting networks are then reduced leading to simpler architectures. This fact is strictly related to the Vapnik-Chervonenkis dimension of the system, which depends on the number of weights in the neural network [11]. The structure of this paper is as follows. Section II in- Abstract—A new technique, called Sequential Window Learning (SWL), for the construction of two-layer perceptrons with binary inputs is presented. It generates the number of hidden neurons together with the correct values for the weights, starting from any binary training set. The introduction of a new type of neuron, having a window-shaped activation function, considerably increases the convergence speed and the compactness of resulting networks. Furthermore, a preprocessing technique, called Hamming Clustering (HC), is proposed for improving the generalization ability of constructive algorithms for binary feedforward neural networks. Its insertion in the Sequential Window Learning is straightforward. Tests on classical benchmarks show the good performances of the proposed techniques, both in terms of network complexity and recognition accuracy. I. Introduction The back-propagation algorithm [1] has successfully been applied both to classification and approximation problems, showing a remarkable flexibility and simplicity of use. Nevertheless, some important drawbacks restrict its application range, particularly when dealing with real world data: • The network architecture must be fixed a priori, i.e. the number of layers and the number of units for each layer must be determined by the user before the training process. • The learning time is in general very high and consequently the maximum number of weights one can consider is reduced. • Classification problems are not tackled in a natural way, because the cost function does not directly depend on the number of wrongly classified patterns. A variety of solutions has been proposed in order to solve such problems; among these, an important contribution comes from constructive methods [2]. Such techniques successively add units in the hidden layer until all the inputoutput relations of a given training set are satisfied. In general, the convergence time is very low, since at any iteration the learning process involves the weights of only one neuron. On the contrary, in the back-propagation procedure all the weights in the network are modified at the same time in order to minimize the cost function value. In this paper we focus on the construction of binary feedforward neural networks with single hidden layer. Every input and output in the net can only assume two possible states, coded by the integer values +1 and −1. Such an The author is with the Istituto per i Circuiti Elettronici, Consiglio Nazionale delle Ricerche, 16149 Genova, Italy. 1 P − = {ξµ : ζ µ = −1, µ = 1, . . . , p} troduces the formalism and describes the modifications and extensions to the procedure of sequential learning. In section III the window neuron is defined with its properties and the related training algorithm is examined in detail. Section IV presents the Hamming Clustering and its insertion in the Sequential Window Learning, while section V shows the simulation results and some comparisons with other training algorithms. Conclusions and final observations are the matter of section VI. where ζ µ is the output pattern (single binary value) corresponding to the input pattern ξµ of the training set. While we leave free the activation function σx of hidden units, let σy be the well-known sign function: ½ +1 if x ≥ 0 σy (x) = ψ(x) = (1) −1 if x < 0 Since we are dealing with the case m = 1, let us denote with vj , j = 1, . . . , h, the weights for the output neuron Y and with v0 the corresponding bias. The kernel of the procedure of sequential learning is the addition of a new unit in the hidden layer; for this aim a suitable training algorithm is applied. It provides the weights of a new unit Xj starting from a particular training set, in most cases different from the original one. Let Q+ j be the set of the patterns ξ µ for which the desired output µ is Sjµ = +1, whereas Q− j contains the patterns ξ for which µ we want Sj = −1. When the training algorithm stops we obtain the following four sets (some of which can eventually be empty): II. The Procedure of Sequential Learning Throughout this paper, we consider two-layer feedforward perceptrons with binary inputs and outputs; let n be the number of inputs, h the number of units in the hidden layer (initially unknown) and m the number of outputs. The procedure of sequential learning starts from a training set containing p input-output relations (ξµ , ζ µ ), µ = 1, . . . , p. All the components ξiµ , i = 1, . . . , n, ζkµ , k = 1, . . . , m, are binary, coded by the values −1 and +1. For sake of simplicity, a new component ξ0µ = +1 is always added to each input pattern, so that the bias of hidden neurons becomes an additional weight. Then, let us introduce the following notations: • Rj+ contains the patterns ξµ ∈ Q+ j rightly classified by Xj (Sjµ = +1). • Rj− contains the patterns ξµ ∈ Q− j rightly classified by Xj (Sjµ = −1). • Wj− contains the patterns ξµ ∈ Q+ j wrongly classified by Xj (Sjµ = −1). • Wj+ contains the patterns ξµ ∈ Q− j wrongly classified by Xj (Sjµ = +1). • Xj , j = 1, . . . , h, is the j-th hidden neuron having activation function σx . • wji , j = 1, . . . , h, i = 0, 1, . . . , n, is the weight for the connection between the i-th input and the hidden neuron Xj . wj0 is the bias of the unit Xj . • Sjµ , j = 0, 1, . . . , h, is the output of Xj caused by the application of the pattern ξµ to the network inputs; we set S0µ = +1 by definition. All the binary values Sjµ form a vector S µ = (S0µ , S1µ , . . . , Shµ ) called the internal representation of pattern ξµ . • Yk , k = 1, . . . , m, is the k-th output neuron having activation function σy . • vkj , k = 1, . . . , m, j = 0, 1, . . . , h, is the weight for the connection between the hidden unit Xj and the output neuron Yk . vk0 is the bias of Yk . • Okµ , k = 1, . . . , m, is the output of Yk caused by the application of the pattern ξµ to the network inputs. In the procedure of sequential learning each unit Xj is assigned an arbitrary variable sj having values in the set {−1, +1}. By setting the sign of such variable we determine the class of the patterns which will be eliminated from the current training set after the creation of the hidden neuron Xj . In particular, if sj > 0 the learning algorithm for the unit Xj must provide Wj+ = ∅; in this case the patterns contained in Rj+ will be removed from the training set. In the same way, if sj < 0 the condition Wj− = ∅ is required and Rj− will be considered for the elimination. We can join together these two cases by introducing an equivalent formulation of the procedure of sequential learning. It always requires Wj+ = ∅ after each insertion in the hidden layer; the class of the patterns removed is now de− termined by a proper definition of the sets Q+ j and Qj for the training of the neuron Xj . In this formulation the main steps of the algorithm are the following: The activation functions σx and σy , respectively for hidden and output neurons, can be different, but they must provide binary values in the set {−1, +1}. Consequently, also the internal representations S µ have binary components. The procedure of sequential learning adds hidden units, following a suitable rule, until all the relations contained in the training set are satisfied. The standard version, proposed by Marchand et al. [6] applies only to neural networks with single output (m = 1); obviously, we can always construct a distinct network for each output, by simply iterating the basic algorithm, but in general the resulting configuration has too many neurons and weights. However, let us begin our examination from the case m = 1; we shall give later some solutions for approaching generic output dimensions. Let P + and P − be the following sets: 1. An arbitrary variable sj is chosen in the set {−1, +1}. 2. A new hidden unit Xj is generated, starting from the training set: ½ + P if sj = +1 Q+ = j P − if sj = −1 ½ − P if sj = +1 Q− = j P + if sj = −1 P + = {ξµ : ζ µ = +1, µ = 1, . . . , p} 2 The constraint Wj+ = ∅ must be satisfied in the generation. 3. The resulting set Rj+ is subtracted from the current training set {P + , P − }. 1. There exists j ∗ (1 ≤ j ∗ ≤ h) such that sj ∗ = +1 and Sjµ∗ = +1. 2. ξµ belongs to the residual training set, when execution stops; thus, we have sh = −1. These three steps are iterated until the current training set contains only patterns from one class (P + = ∅ or P − = ∅). In practice, each neuron Xj must be active (Sjµ = +1) for some patterns ξµ having ζ µ = sj and holds inactive (Sjµ = −1) in correspondence of any pattern ξµ for which ζ µ = −sj . Neurons that satisfy this condition can always be found; for example the grandmother cell of a pattern ξµ ∈ P + verifies such property [6]. However, a neural network containing only grandmother cells in its hidden layer has no practical interest; so, a suitable training algorithm for the hidden neurons is required. This is the object of section III. Now, we are interested in the choice of output weights vj , j = 0, 1, . . . , h, that correctly satisfy all the input-output relations contained in the training set. After the iterated execution of the three main steps above, a possible assignment for the weights vj is the following [6]: In the first case let l∗ denote the group of hidden neurons containing Xj ∗ (1 ≤ l∗ ≤ g). Then we have: v0 = h X vj − sh ; vj = sj 2h−j sj = sj ∗ = +1 whereas: Sjµ = −1 = h X ¡ vj = sj h X vj − sh + j=1 h X vj Sjµ = j=1 ¢ 1 + Sjµ vj − sh = (by virtue of (4)) hl∗ X ¡ ¢ ¡ ¢ = 1 + Sjµ∗ vj ∗ + 1 + Sjµ vj + j=j ∗ +1 (2) h X ¡ ¢ 1 + Sjµ vj − sh = (since Sjµ∗ = +1) + j=hl∗ +1 = 2sj ∗ h X |vj | + 2sj ∗ + h X ¡ + hl∗ X ¡ ¢ 1 + Sjµ vj + j=j ∗ +1 j=hl∗ +1 ¢ 1 + Sjµ vj − sh ≥ j=hl∗ +1 ≥2 h X |vj | + 2 + hl∗ X ¡  µ¢ 1 + Sj  j=j ∗ +1 j=hl∗ +1 l = 1, . . . , g −2 h X h X  |vi | + 1 + i=hl∗ +1 |vj | − 1 ≥ 1 > 0 j=hl∗ +1 Thus, by applying (1) we obtain Oµ = ζ µ = +1 as desired. In the second case we have instead: Sjµ = −1 for j = 1, . . . , h vj − sh from which: j=1 Ã (4) j=1 Theorem 1: A correct choice for the output weights vj in the procedure of sequential learning is the following: v0 = vj Sjµ = j=1 having set h0 = 0 by definition. The following result is then valid: h X h X v0 + for j = 1, . . . , h. Unfortunately, these values exponentially grow with the number h of hidden neurons; thus, even for small sizes of resulting networks, the needed range of values makes extremely difficult or impossible both the simulation on a conventional computer and the implementation on a physical support. To overcome this problem, let us subdivide the hidden neurons in g groups, each of which contains adjacent units Xj with the same value of sj . More formally, if hl , l = 1, . . . , g, is the index for the last neuron belonging to the l-th group, then we have: for j = hl−1 + 1, . . . , hl − 1, for j = 1, . . . , j ∗ − 1 So, the input to the output neuron Y is given by: j=1 sj+1 = sj for j = hl∗ −1 + 1, . . . , hl∗ h X ! |vi | + 1 v0 + (3) h X j=1 i=hl +1 vj Sjµ = h X vj − sh − j=1 h X vj = −sh = 1 > 0 j=1 and again Oµ = ζ µ = +1. for j = hl−1 + 1, . . . , hl , l = 1, . . . , g. Proof. We show that (3) leads to a correct output Oµ for a generic ξµ ∈ P + ; the complementary case (ξµ ∈ P − ) can be treated in a similar way. If ξµ ∈ P + , then, from the iterated execution of the main steps, we obtain two possible situations: From (3) we obtain two extreme cases: • if all the sj are equal, then the output weights vj have constant (binary) values, whereas the bias v0 linearly grows with h. 3 Fig. 2 Extension with output training of the procedure of sequential learning. Fig. 1 Natural extension of the procedure of sequential learning. 1. For every k = 1, . . . , m take the set Uk given by: ½ + Pk if skj = +1 Uk = Pk− if skj = −1 • If the sj vary alternately, that is: sj+1 = −sj for j = 1, . . . , h − 1 then we return to the standard assignment (2). 2. Put in K (initially empty) the output index k corresponding to the set Uk with the greatest number of elements and set Q+ h = Uk . 3. Modify the set Uk , k = 1, . . . , m, in the following way: \ Uk = Uk Q+ h Since the values of the variables sj can be chosen in an arbitrary way, we have a method for controlling the growth of output weights. A. Generalization of the Procedure of Sequential Learning The procedure of sequential learning can be extended in two ways in order to construct binary feedforward neural networks with a generic number of outputs. The difference between the two extensions lies in the generation of the output weights vkj . In the first method the assignment (3) is naturally generalized whereas in the latter the weights vkj are found by a proper training algorithm. In this last case the computing time needed for the construction of the whole network is higher, but such drawback is often balanced by a greater compactness of resulting architectures and consequently by a better generalization ability. On the other hand, the availability of a fast method for the construction of binary neural networks, starting from a generic training set, is of great practical interest. In a natural extension of the procedure of sequential learning to the case m > 1, the variables sj must be replaced by a matrix [skj ], k = 1, . . . , m, j = 1, . . . , h, filled by values in the set {−1, 0, +1}. In fact, a hidden neuron Xj could push towards positive values an output neuron (skj = +1) and towards negative values another unit (skj = −1). The choice skj = 0 means that no interaction exists between the units Xj and Yk (consequently vkj = 0). For the same reason we must consider a number gk of groups for each output and the last neuron of each group explicitly depends on the output it refers to. Thus we have a matrix of indexes [hkl ], k = 1, . . . , m, l = 1, . . . , gk , in which the length of each row depends on the output k. Furthermore, the procedure starts from m pairs of pattern sets Pk+ and Pk− , k = 1, . . . , m, obtained by the inputoutput relations of the training set in the following way: 4. Let k 0 6∈ K be the output associated with the set Uk0 with the greatest size. If the number of patterns in Uk0 exceeds a given threshold τ , then put k 0 in the set K and repeat steps 2–4, otherwise stop. Such simple method leads to the construction of hidden layers having a limited number of neurons, in general much lower than networks obtained by repeated applications of the standard procedure for m = 1. In the simulations we have always used this method with the choice τ = n. Theorem 1 can easily be extended to this case and shows the correctness of this approach: it is able to construct a two-layer neural network satisfying all the input-output relations contained in a given training set. More compact nets can be obtained in many cases by applying a suitable training algorithm for the output layer. Also this second technique assures the feasibility of the neural network, but, as shown later, the convergence is theoretically asymptotic. However, in most practical cases, the training time turns out to be acceptable. A possible implementation of this second extension is shown in fig. 2. The choice of variables sk is made only at the first step; the user cannot modify their values during the construction of the hidden layer. The dynamic adaptation of such quantities is provided by the auxiliary variables tk , which sign is indirectly affected by the training algorithm for output neurons (through the sets Vk+ e Vk− ). Step 4 chooses the output index k 0 that determines the addition of a new hidden unit Xj . This neuron must be active (Sjµ = +1) for some patterns ξ µ ∈ Q+ k0 and provide µ − µ Sj = −1 for all the patterns ξ ∈ Qk0 , likewise the standard version. The disjunction of the sets Rj+ obtained by subsequent choices of the same output index k is warranted by the updating of Uk+ and Uk− at step 6. The sets Tk+ and Tk− contain internal representations for the input patterns of Pk+ and Pk− . They are obtained at step 7 through the following relations: ª © Tk+ = S µ : ξ µ ∈ Pk+ © ª Tk− = S µ : ξ µ ∈ Pk− Pk+ = {ξµ : ζkµ = +1, µ = 1, . . . , p} Pk− = {ξ µ : ζkµ = −1, µ = 1, . . . , p} With these notations a natural extension of the procedure of sequential learning is shown in fig. 1. At step 4 the set of input patterns Q+ h for the training of a new hidden unit Xh is determined in such a way that: ½ + Pk if skj = +1 + Qh ⊂ Pk− if skj = −1 where the components of S µ are given by: Ã n ! X µ µ Sj = σx wji ξi , for j = 1, . . . , h for any k ∈ K, where K is a subset of output indexes. Among the possible ways of determining Q+ h and K a simple method giving good results is the following: i=0 4 After the application of the training algorithm for output neurons, we obtain for every unit Yk two sets Vk+ and Vk− : © ª Vk+ = ξµ ∈ Pk+ : Okµ = ζkµ = +1 © ª Vk− = ξµ ∈ Pk− : Okµ = ζkµ = −1 The two extended versions of the procedure of sequential learning are practically useful if a fast training algorithm for the hidden neurons is available. This algorithm must correctly classify all the input patterns belonging to a given class and some input pattern belonging to the opposite class. No method in current literature, except that proposed in [6], pursues this object; so, a suitable technique will be described in the following section. Such a training algorithm plays a fundamental role in this second extension of the procedure of sequential learning. If the output layer is trained by an algorithm that finds, at least in an asymptotic way, the optimal configuration (i.e. the weight matrix [vkj ] which makes the minimum number of errors in the current training set), then correct binary neural networks are always constructed. Algorithms of this kind are available in the literature; the most popular is probably the pocket algorithm [9]. It can be shown that the probability of reaching the optimal configuration approaches unity as the training time increases. Unfortunately, there is no bound known for the training time actually required and other training algorithms, less good from a theoretical point of view, but more efficient, are often used [12]. However, the properties of pocket algorithm allows to formulate the following result: III. The Window Neuron Currently available constructive methods for binary training sets build neural networks that exclusively contain threshold units. In these neurons the output is given by (1), here reported: Ã n ! ½ Pn X +1 if i=0 wi ξi ≥ 0 ζ=ψ wi ξi = −1 otherwise i=0 As mentioned above, a new component ξ0 = +1 is added to the input vector ξ, in order to consider the neuron bias as a normal weight. In the following we use the term threshold network for indicating a two-layer perceptron containing only threshold units. A well-known result is the following [2]: given a training set made by p binary input-output relations (ξ µ , ζ µ ), µ = 1, . . . , p, it is always possible to find a threshold network that correctly satisfies these relations. Now, let us introduce a new kind of neuron having a window-shaped activation function; its output is given by: Ã n ! ½ Pn X +1 if | i=0 wi ξi | ≤ δ wi ξi = (6) ζ=ϕ −1 otherwise Theorem 2: The extension with output training of the procedure of sequential learning is (asymptotically) able to construct a two-layer perceptron satisfying all the inputoutput relations contained in a given binary training set. Proof. Let us refer to the implementation in fig. 2; the repeated execution of steps 3–9 causes the addition of some hidden neurons for every output. Let Ik denote the set of indexes of the hidden units generated when step 4 chooses the k-th output (Ik ⊂ {1, . . . , h}). Moreover, let Rj+ , for j ∈ Ik , be the set of input patterns correctly classified by the neurons Xj . By construction we have: \ Ri+ Rj+ = ∅ , ∀i 6= j , i, j ∈ Ik i=0 The real value δ is called amplitude and is meaningful only by an implementative point of view. For sake of simplicity, in the whole description we could set δ = 0, but when the computations are made by a machine (a computer or a dedicated support), the summation in (6) can move away from its theoretical value, because of precision errors. Thus, the introduction of a small amplitude δ allows the practical use of the window neuron. A window neuron can always be substituted by three threshold neurons; in fact, the output of a generic window neuron is given by: Ã n ! X ζ=ϕ wi ξi = Thus, in the worst case, all the input patterns belonging to Pk+ or Pk− will be contained in the union of the sets Rj , for j ∈ Ik . Let us suppose, without loss of generality, that there exists a subset J ⊂ Ik such that: [ Pk+ = Rj+ j∈J i=0 In this case, as derived from theorem 1, the neuron Yk correctly classifies all the input patterns in the training set, with regard to the k-th output, if the following choice for the weights vkj is made: ½ +1 for j ∈ J vkj = ; vk0 = |J| − 1 (5) 0 otherwise =ψ Ã n X i=0 ! wi ξi + δ −ψ Ã n X ! wi ξi − δ −1 i=0 On the contrary, it seems that a direct correspondence in the opposite sense does not exist. Nevertheless, the following result shows the generality of window neuron: where |J| is the number of elements in the set J. The properties of pocket algorithm assure that solution (5) can be found asymptotically. Theorem 3: It is always possible to find a two-layer perceptron, containing only window neurons in the hidden layer, that correctly satisfies all the input-output relations of a given binary training set. 5 Proof. The construction follows the same steps as for threshold networks. Let (ξµ , ζ µ ), µ = 1, . . . , p, be the p binary input-output relations contained in a given training set. For sake of simplicity, let us consider the case m = 1 (output pattern ζ µ with single binary value); the neural network for generic m can be obtained by iterating the following procedure. Let s be the output value (−1 or +1) associated with the least number of input pattern in the training set. For every ξµ having ζ µ = s a window neuron Xj is added to the hidden layer with weights: wj0 = −n ; wji = ξiµ Theorem 4: It is always possible to construct a window neuron that provides the desired outputs for a given set of linearly independent input patterns. Proof. Consider the matrix A, having size p × (n + 1), which rows are formed by the input patterns ξµ of the training set; let r be the rank of this matrix. If q ≤ r let B denote a nonsingular minor containing q linearly independent input patterns of the training set. Suppose, without loss of generality, that B is formed by the first q rows (j = 1, . . . , q) and the first q columns (i = 0, 1, . . . , q − 1) of A. Then, consider the following system of linear algebraic equations: Bw = z (7) for i = 1, . . . , n Such a unit is a grandmother cell [2] for the pattern ξµ (it is active only in correspondence of ξ µ ), as it can be shown by simple inspection. Now, a threshold output neuron performing the logic operation or (nor) if s = +1 (s = −1) completes the construction. where the j-th component of vector z is given by: zj = 1 − ζ − n X µ wi ξi j (8) i=q (ξµj , ζ µj ) is the input-output relation of the training set associated with the j-th row of A. The weights wi , for i = q, . . . , n, are arbitrarily chosen. By solving (7) we obtain the weights wi for a window neuron that correctly satisfies the q relations (ξµj , ζ µj ), for j = 1, . . . , q. In fact, we have from (8): A two-layer perceptron with window neurons in the hidden layer will be called window network in the following. Two results are readily determined: • The parity operation [13] can be realized by a window network containing b(n + 1)/2c hidden units. A possible choice of weights is the following: n X wj0 = n − 4j + 2 ; wji = +1 ¹ º n−1 v0 = − ; vj = +1 2 µ wi ξi j i=0 = q−1 X i=0 µ wi ξi j + n X µ wi ξi j = 1 − ζ µj i=q hence: Ã n ! X µj wi ξi = ϕ (1 − ζ µj ) = ζ µj ϕ for i = 1, . . . , n and j = 1, . . . , b(n + 1)/2c. In this configuration the j-th hidden unit is active if the input pattern contains 2j − 1 components with value +1. Then, the output neuron executes a logic or and produces the correct result. • A single window neuron performs the symmetry operation [13]. A possible choice of weights is the following: w0 = 0   2bn/2c−i wi = 0  −wn−i+1 µj for j = 1, . . . , q i=0 if 0 ≤ δ < 2. Then, the main object of correctly classifying all the input patterns ξµ having ζ µ = −1 can be reached, by following two steps: 1. Put in B only patterns ξ µj with ζ µj = +1. 2. Search for a window neuron that provides output −1 for the greater number of patterns not contained in the minor B. for i = 1, . . . , bn/2c for i = (n + 1)/2, if n odd for i = (n − bn/2c + 1), . . . , n The following theorem offers an operative approach: In these cases window networks are considerably more compact than corresponding threshold networks. Unfortunately, this is not a general result, since there exist linearly separable training sets that lead to more complex window networks. Theorem 5: If the minor B has dimension q ≤ n and contains only patterns ξ µj , j = 1, . . . , q, for which ζ µj = +1, then the window neuron obtained by solving (7) gives output +1 for all the input patterns linearly dependent with ξµ1 , . . . , ξ µq . Moreover, if the arbitrary weights wi , i = q, . . . , n, are linearly independent in R as Q-vector space, then every input pattern which is linearly independent with ξµ1 , . . . , ξ µq in Qn yields output −1. A. The Training Algorithm for Window Neurons Given a generic training set (ξ µ , ζ µ ), µ = 1, . . . , p, we wish to find a learning algorithm that provides the weights wi , i = 0, 1, . . . , n, for a window neuron that correctly classifies all the patterns ξµ for which ζ µ = −1 and the maximum number of patterns ξµ for which ζ µ = +1. The following result plays a fundamental role: Proof. Since ζ µj = +1, for j = 1, . . . , q, from (7) we obtain: n X µ wi ξi j = 0 for j = 1, . . . , q i=0 6 Fig. 3 Now, consider an input pattern ξν not contained in the ν minor B; the q + 1 vectors (ξ0ν , ξ1ν , . . . , ξq−1 )t , (ξ0µ1 , ξ1µ1 , µq µq µq t µ1 t . . . , ξq−1 ) , . . . , (ξ0 , ξ1 , . . . , ξq−1 ) are linearly dependent in Qq , being Q the rational field (t denotes transposition). Thus, there exists constants λ1 , . . . , λq ∈ Q, some of which are different from zero, such that: ξiν = q X µ λj ξ i j for i = 0, 1, . . . , q − 1 computing time. However, good results can be reached by employing a simple greedy procedure: the minor B is built by subsequently adding the row and the column of A that maximize the number of patterns with positive output correctly classified by the corresponding window neuron. Obviously, the number of correctly classified patterns with negative output must be kept maximum. A detailed description of the resulting training algorithm is shown in fig. 3. It employs two sets I and J for the progressive construction of minor B; these sets are carefully increased in order to maximize the number of patterns with correct output. The algorithm stops when it cannot successfully add rows and columns to B. This learning method can easily be modified if other different goals are to be pursued. Its insertion in the procedure of sequential learning is straightforward and leads to a general technique called Sequential Window Learning (SWL), which exhibits interesting features, as we will see in the tests. (9) j=1 and consequently: n X wi ξiν = i=0 n X q X wi ξiν − i=0 = n X λj j=1  wi ξiν − i=q q X n X µ wi ξi j = i=0  µ λj ξ i j  (10) j=1 If the patterns ξν , ξµ1 , . . . , ξµq are linearly dependent in Q , then the right hand term of (10) is null for some rational constants λj , j = 1, . . . , q, that satisfy (9). Hence, the corresponding output of the window neuron is +1. On the contrary, if the vectors ξ ν , ξµ1 , . . . , ξµq are linearly independent in Qn , then there exists at least one index i (q ≤ i ≤ n) such that: n ξiν 6= q X IV. The Hamming Clustering Although SWL allows the efficient construction of a twolayer perceptron starting from a given binary training set, its generalization ability depends on a variety of factors that are not directly controllable. Concerning this, a new technique, called Hamming Clustering (HC), approaches the solution of two important problems: µ λj ξ i j j=1 Pq µ But the terms ξiν − j=1 λj ξi j are rational numbers; so, if the wi , i = q, . . . , n, are linearly independent in R as Q-vector space we obtain:   q n n X X X µ λj ξi j  6= 0 wi ξiν = wi ξiν − i=0 i=q Training algorithm for window neurons. • To increase the algorithm locality for improving the generalization ability of the produced neural network in real world applications [10]. • To find redundant inputs which does not affect one or more outputs. By removing corresponding connections, the number of weights in the resulting network and consequently its complexity are reduced. This also improves the generalization ability of the system [11],[15]. j=1 Hence, the output of the window neuron will be −1 if the amplitude δ is small enough. A natural method of proceeding is grouping the patterns belonging to the same class, that are close each other according to the Hamming distance. This produces some clusters in the input space which determine the class extension. The clustering must be made in such a way that it can directly be inserted in the SWL algorithm. The concept of template plays a fundamental role in HC; it is very similar to the concept of schema widely used in the theory of genetic algorithms [16]. Let us denote with the symbols ’+’ and ’−’ the two binary states (corresponding to the integer values +1 and −1); with this notation the pattern ξ = (+1, −1, −1, +1, −1)t is equivalent to the string + − − + −. In HC a template is a string of binary components that can also contain don’t care symbols, denoted by the symbol ’0’. A template represents a set of patterns which are obtained by expanding the don’t care symbols. For example, in the space of binary patterns with five components, the template +0 − 0− is equivalent to the set {+ − − − −, Several choices for the real numbers wi are available in the literature, but in most cases they require a representation range that is too large for a practical implementation. In the simulations we have used the following choice [14]: √ wi = γi for i = 0, 1, 2, . . . where γi are positive integer squarefree numbers (not divisible for a perfect square), sorted in increasing order (by definition γ0 = 1). Theorem 5 gives a method of reaching our desired result: to correctly satisfy all the input-output relations (ξ µ , ζ µ ) for which ζ µ = −1 and some pairs (ξ µ , ζ µ ) having ζ µ = +1. The final step is to maximize the whole total number of correct outputs. With high probability, the optimal configuration could be reached only through an exhaustive inspection of all the possible solutions; unfortunately, this leads to a prohibitive 7 Fig. 4 + − − + −, + + − − −, + + − + −}. The template shows two important properties: Neural network corresponding to example (12). 1. It has a direct correspondence with a logic and among Initially the method considers only one cluster; the folthe pattern components. For example, the template diagram shows the first execution of extension and , +0−0− above corresponds to the operation ξ1 andξ 3 andξlowing 5 where ξ1 , ξ3 and ξ5 are respectively the first, the third subdivision: and the fifth component of a generic pattern ξ, while  Comp. Conflicts  +0 + −−   +−+−− ξ1 1  a bar indicates the logic operation not. Only the pat-   +0 − −+    +−−−+   Subd. Ext. ξ2 0 terns in the equivalent set generate the value +1 as a −0 + ++ −++++ =⇒ =⇒ ξ3 1    +0 + +−    ++++−   result. ξ4 0 −0 − ++ −+−++ 2. The construction of a window neuron that performs ξ5 1 the and associated with a given template is straightThe selected component is ξ2 (enclosed in a square), forward. Every weight must be set to the correspond- which does not generate conflicts with the patterns of P − ; ing value in the template, whereas the bias must be thus, at the end of the first subdivision, we obtain a clusequal to the number of binary variables in the tem- ter with five templates, each of which contains a don’t care plate, changed in sign. For example, the template symbol in the second location. Another execution of ex+0 − 0− corresponds to the window neuron having tension and subdivision leads to the following situation: weights:  +0 + −−  Comp. Conflicts    ξ1 2  +0 − −+    +0 + 0−  w0 = −3 , w1 = +1 , w2 = 0 , Subd. Ext. +0 − 0+ ξ3 1 −0 + ++ =⇒ =⇒ −0 + 0+     ξ4 0  +0 + +−  −0 − 0+ w3 = −1 , w4 = 0 , w5 = −1 −0 − ++ ξ5 This unit is active only for the patterns in the equivalent set of the template above and realizes the desired logic operation and. HC proceeds by extending and subdividing clusters of templates; every cluster contains templates having don’t care symbols in the same locations. Let P + and P − be the sets of patterns forming the given training set; the method initially considers only one cluster containing all the patterns of P + . Then, this cluster undergoes the following two actions: Ext. +0 − 0+ =⇒  −0 + 0+  −0 − 0+ n 00 − 0+ 00 + 0+ o {+0 + 0−} Subd. =⇒ 1 1 3 00 + 0+ {+0 + 0−} Ext. =⇒ Ext. =⇒ Comp. Conflicts ξ3 ξ5 0 2 Comp. Conflicts ξ1 ξ3 ξ5 1 1 0 Subd. =⇒ {0000+} Subd. =⇒ {+0 + 00} Resulting clusters directly correspond to the logic operation (12) from which the training set was derived. In fact, the template 0000+ (first cluster) corresponds to the component ξ5 that forms the second operand for the boolean or, whereas the template +0 + 00 corresponds to the first operand ξ1 and ξ3 . The binary neural network providing the correct output ζ is then easily constructed by placing a hidden window neuron for each template and adding an output threshold neuron that performs the required operation or. Such a network is shown in fig. 4. This example shows how HC reaches the proposed goals; first of all, the boolean function generating the training set has been found. Then, the number of connections in the resulting network has been minimized; finally, redundant inputs have been determined. These three results is of great practical importance. Furthermore, the equivalent threshold network can di- These two actions are then applied to the resulting clusters and the procedure is iterated until no more extensions could be done. The method is better understood if we consider in detail a clarifying example. Let P + and P − be the following sets of patterns:  +−+−−   −−−−−     +−−−+    −+++−   −++++ ++−+− P+ = ; P− = (11)    ++++−    −++−−   +−−+− generated by the logic operation: ζ = (ξ1 and ξ3 ) or ξ5 ξ1 ξ3 ξ5 Finally, a last iteration gives the desired result: Extension: Each binary component in the cluster is replaced one at a time by the don’t care symbol and the corresponding number of conflicts with the patterns of P − is computed. Subdivision: The binary component with the minimum number of conflicts is considered. If this number is greater than zero the cluster is subdivided in two subsets: the first contains the templates that do not lead to conflicts (with the selected binary component replaced by the don’t care symbol) and the latter is formed by the remaining templates (unchanged). −+−++ 2 The component ξ2 has not yet been considered, since it contains only don’t care symbols. Again, the selected component ξ4 does not generate conflicts with the patterns of P − and we have a single cluster with four templates. The next application of extension and subdivision yields the generation of two clusters as follows: n o   Comp. Conflicts 00 − 0+  +0 + 0−  (12) 8 Fig. 5 rectly be built; so, HC has general validity. It generates and-or networks, in which the hidden layer performs and operations among the inputs and the final layer computes or operations among the outputs of hidden neurons. Now, a fundamental result of the theory of logic networks says: every boolean function (hence every training set) can be written in the and-or form [17]. Consequently, the network above is general. However, more compact neural networks can exist. For example, the well-known parity function needs 2n−1 hidden neurons in the and-or configuration, whereas, as shown in section III, a window network with b(n+1)/2c hidden units can provide the correct output. This remark shows the importance of a deep integration between SWL and HC. The following definition introduces a possible approach: if a pattern of P + belongs to the equivalent set of a given template, then we say that this template covers that pattern. For example, the template 0000+ covers the pattern +−−−+ in the training set (11). Then we use the term covering for denoting the number of patterns of P + covered by the templates of a given cluster. With these definitions a practical method for integrating HC and the training algorithm for window neurons follows these steps: Simulation results for the random function. lems of real generalization: in these cases the training set contains real patterns and is again incomplete. An A/D conversion is performed and the resulting set of binary patterns is used by SWL for generating a neural network that tries to minimize the number of errors on a given test set. The reported computing times refer to a C code running on a DECStation 5200; they provide only a first indication about the convergence speed of the proposed techniques. In fact, a correct valutation requires the comparison with other algorithms and consequently the definition of a proper environment in which the tests have to be made. For sake of brevity, we defer such valutation to a subsequent paper. A. Exhaustive simulations Three groups of trials analyzed the goodness of configurations obtained by SWL. In these simulations HC was not used because it improves only the generalization ability of a learning algorithm. Parity Function: Training were made for n = 2, 3, . . . , 10 and SWL was always able to find the optimal configuration with b(n+1)/2c hidden neurons. The time required for the construction of the network went from 0.02 (n = 2) to 62 seconds (n = 10). 1. Starting from the pattern sets P + and P − (training set for the window neuron), perform the Hamming Clustering and reach the final clusters of templates. 2. Choose the cluster having maximum covering and consider the binary components of this cluster (leave out don’t care symbols). 3. Construct a new training set from P + and P − , in which only the selected components are kept. 4. Generate the corresponding window neuron and remove the connections corresponding to the neglected components (by zeroing their weights). Symmetry Function: Also in this case the simulations with n = 2, 3, . . . , 10 always yielded the minimum network containing only one window neuron. Less than 2 seconds was sufficient for every trial. Random Function: A group of tests involves the random function, which output is given by a uniform probability distribution. It allows an interesting comparison with other constructive methods, like tiling [4] and upstart [5] algorithms. Fig. 5 shows the number of units in the neural networks realized by the three techniques for n = 4, 5, . . . , 10. Every point is the average of 50 trials with different training sets. The networks generated by SWL are more compact and the difference increases with the number n of inputs. This shows the efficiency of SWL in terms of configuration complexity. However, the computational task is very heavy: the CPU time required grows exponentially with n from 0.03 (n = 4) to 524 seconds (n = 10). Such a method performs the training of the window neuron on a reduced number of components and recovers the computing time lost in the execution of HC. Moreover, it can successfully be inserted in the procedure of sequential learning. V. Tests and Results The proposed techniques were tested on several benchmarks in order to point out their performances, also in comparison with other constructive methods. These groups of tests regard different situations, each of which emphasizes a particular characteristic of the learning algorithm. The first group concerns the so-called exhaustive simulations: in these cases the whole truth table of a boolean function is given as a training set and the algorithm is required to minimize the number of weights in the resulting network. The second set of trials refers to binary generalizations: the training set is binary and incomplete. SWL must return a configuration that minimizes the number of errors on a given test set. Finally, the third group of simulations concerns prob- B. Binary Generalizations Parity and symmetry functions were also used for the performance analysis on binary generalizations. A third group of trials concerns the monk problems [18], an interesting benchmark both for connectionist and symbolic methods. Parity Function: With a number n = 10 of inputs, we considered a training set of p randomly chosen pat9 Fig. 6 Generalization results for the parity function. Fig. 7 Generalization results for the symmetry function. Fig. 8 Original (a) and reconstructed (b) pattern for the test on circle recognition. sepal dimensions is a classical test on real generalization. If we use an 8-bit gray code for each real input we obtain a training set containing 100 strings of length n = 32. The neural network constructed by SWL with HC correctly classifies every pattern in the training set and makes three errors (the theoretical minimum) on the test set. Such a network contains 6 hidden neurons and only 37 connections in two layers. The CPU time needed for the construction is 10 seconds. terns. The remaining 1024 − p patterns formed the test set. Fifty different trials were executed for each value of p = 100, 200, . . . , 900, obtaining the correctness percentages shown in fig. 6. Since the parity function does not present any locality property (the value in a point is uncorrelated from the values in the neighborhoods), the application of HC always leads to poorer performances. Symmetry Function: Tests on symmetry function present the same environment as for parity. Thus, the number of inputs was again n = 10 and fifty runs were made for each value of p = 100, 200, . . . , 900. The results are shown in fig. 7. Also in this case the lack of locality makes HC useless; the obtained generalization ability is however very interesting. Monk Problems: Three classification problems have been proposed [18] as benchmarks for learning algorithms. Since their inputs are not binary, a conversion was needed before the application of SWL. We used a bit for each possible value, so that the resulting binary neural network had n = 17 inputs and a single output. In this way the results in tab. 1 were obtained. Here HC improves the performance on all the tests and reduces the number of weights in the networks, allowing an easier rule extraction. The computational effort is rather low: about 1 second for problems #1 and #2, about 5 seconds for problem #3. Circle Recognition: The last trial concerns the recognition of a circle from randomly placed points. Let us consider the pattern in fig. 8a, in which white and black zones correspond to positive and negative output respectively. 1000 random points formed the training set, while other 1000 points were contained in the test set. Since the area of the two regions are equal, about 50% of points in either set falls inside the circle. By using a Gray code with 6 bits for each real input, we obtained from SWL with HC a binary neural network containing 21 hidden units and 119 connections in two layers. No error was made in the training set, whereas 4.1% is the percentage of errors encountered in the test set. This result shows that the quantization effects of the A/D conversion was overcome by HC, leading to an interesting generalization ability also with real inputs. The reconstructed pattern is presented in fig. 8b. Five seconds of computation was sufficient for obtaining this result. VI. Conclusions C. Real Generalizations Finally, two tests were devoted to classification problems with real inputs. In these cases the application of SWL (and any other binary constructive method) is influenced by the A/D conversion used in the preprocessing phase. Among the variety of proposed methods, we chose the Gray code [17], which maps close integer numbers to similar binary strings (with respect to the Hamming distance). A new constructive method for the training of two-layer perceptrons with binary inputs and outputs have been presented. This algorithm is based on two main blocks: Iris Problem: The dataset of Anderson-Fisher [19] for the classification of three species of Iris starting from petal and Tests on this procedure, called Sequential Window Learning (SWL), show interesting results both in terms of computational speed and compactness of constructed networks. Moreover, a new preprocessing algorithm, called Hamming Clustering (HC), has been introduced for improving the generalization ability of constructive methods. Its application is particularly useful in case of training sets with high locality; again, some simulations have shown the characteristics of HC. Currently, SWL and HC are being applied to real world problems, such as handwritten character recognition and genomic sequence analysis, giving some first promising results. Problem Use HC? Neurons Weights 6= 0 % Test Errors #1 #1 #2 #2 #3 #3 YES NO YES NO YES NO 3 4 1 1 10 6 11 58 7 18 46 96 0 2.61 0 0 7.87 29.86 • A modification of the procedure of sequential learning [6] with practical extensions to the construction of neural networks with any number of outputs. • The introduction of a new kind of neuron, having a window-shaped activation function, which allows a fast and efficient training algorithm. Tab. 1 Results for the application of SWL to the monk problems. 10 References View publication stats [1] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Learning Representations by Back-Propagating Errors. Nature, 323 (1986) 533–536. [2] J. Hertz, A. Krogh, and R. Palmer, Introduction to the Theory of Neural Computation (1991) Redwood City: Addison-Wesley. [3] S.I. Gallant, Neural Network Learning and Expert Systems (1993) Cambridge, MA: MIT Press. [4] M. Mézard, and J.-P. Nadal, Learning in Feedforward Layered Networks: the Tiling Algorithm. Journal of Physics A, 22 (1989) 2191–2204. [5] M. Frean, The Upstart Algorithm: a Method for Constructing, and Training Feedforward Neural Networks. Neural Computation, 2 (1990) 198–209. [6] M. Marchand, M. Golea, and P. Ruján, A Convergence Theorem for Sequential Learning in Two-Layer Perceptrons. Europhysics Letters, 11 (1990) 487–492. [7] D.L. Gray, and A.N. Michel, A Training Algorithm for Binary Feedforward Neural Networks. IEEE Transactions on Neural Networks, 3 (1992) 176–194. [8] F. Rosenblatt, Principles of Neurodynamics (1962) New York: Spartan. [9] S.I. Gallant, PerceptronBased Learning Algorithms. IEEE Transactions on Neural Networks, 1 (1990) 179–191. [10] L. Bottou, and V. Vapnik, Local Learning Algorithms. Neural Computation, 4 (1992) 888–900. [11] E.B. Baum, and D. Haussler, What Size Net Gives Valid Generalization? Neural Computation, 1 (1989) 151–160. [12] M. Frean, A “Thermal” Perceptron Learning Rule. Neural Computation, 4 (1992) 946–957. [13] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Learning Internal Representations by Error Propagation. In Parallel Distribute Processing. Eds.: D.E. Rumelhart and J.L. McClelland (1986) Cambridge, MA: MIT Press, 318–362. [14] D.A. Marcus, Number Fields (1977) New York: SpringerVerlag. [15] A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth, Learnability and the Vapnik-Chervonenkis Dimension. Journal of the ACM, 36 (1989) 929–965. [16] D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning (1989) Reading, MA: Addison Wesley. [17] H.W. Gschwind, and E.J. McCluskey, Design of Digital Computers (1975) New York: Springer-Verlag. [18] S.B. Thrun, J. Bala, E. Bloedorn, I. Bratko, B. Cestnik, K. De Jong, S. Džeroski, S.E. Fahlman, D. Fisher, R. Hamann, K. Kaufman, S. Keller, I. Kononenko, J. Kreuziger, R.S. Michalski, T. Mitchell, P. Pachowicz, Y. Reich, H. Vafaie, W. Van de Welde, W. Wenzel, J. Wnek, and J. Zhang, The MONK’s Problems. A Performance Comparison of Different Learning Algorithms. (1991) Carnegie Mellon University report CMU-CS-91-197. [19] R.A. Fisher, The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 8 (1936) 376–386. Marco Muselli was born in 1962. He received the degree in electronic engineering from the University of Genoa, Italy, in 1985. He is currently a Researcher at the Istituto per i Circuiti Elettronici of CNR in Genoa. His research interests include neural networks, global optimization, genetic algorithms and characterization of nonlinear systems. 11