On Sequential Construction
of Binary Neural Networks
Marco Muselli
architecture is enough general to implement any boolean
function (with one or more output values) if a sufficient
number of hidden units is provided [2]. Moreover, this
kind of neural networks greatly simplifies the extraction of
symbolic rules from connection weights [3].
Some constructive methods are specifically devoted to
the synthesis of binary feedforward neural networks [4]–[7];
they take advantage of this particular situation and lead in
a short time to interesting solutions. A natural approach
is determined by the procedure of sequential learning [6],
but its implementation presents some practical problems.
First of all, the weights in the output layer grow exponentially, leading to intractable numbers also for few hidden
units. Then, difficulties arise when the output dimension
is greater than one, since no extension for the standard
procedure is given.
Furthermore, the proposed algorithm for the training of
hidden neurons is not efficient and considerably increases
the time needed for the synthesis of the whole network. On
the other side, faster methods like perceptron algorithm
[8] or optimal methods like pocket algorithm [9] cannot be
used because of the particular construction made by the
procedure of sequential learning.
The present work describes a method of solving these
problems and proposes a well-suited algorithm, called Sequential Window Learning (SWL), for the training of twolayer perceptrons with binary inputs and outputs. In particular, we introduce a new type of neuron, having a windowshaped activation function; this unit allows the definition
of a fast training algorithm based on the solution of linear
algebraic equations.
Moreover, a procedure for increasing the generalization
ability of constructive methods is presented. Such a procedure, called Hamming Clustering (HC), explores the local
properties of the training set, leaving any global examination to the constructive method. In this way we can obtain
a good balance between locality and capacity, which is an
important tradeoff for the treatment of real world problems
[10].
HC is also able to recognize irrelevant inputs inside the
current training set, in order to remove useless connections. The complexity of resulting networks are then reduced leading to simpler architectures. This fact is strictly
related to the Vapnik-Chervonenkis dimension of the system, which depends on the number of weights in the neural
network [11].
The structure of this paper is as follows. Section II in-
Abstract—A new technique, called Sequential Window Learning (SWL), for the construction of two-layer perceptrons
with binary inputs is presented.
It generates the number of hidden neurons together with
the correct values for the weights, starting from any binary
training set. The introduction of a new type of neuron,
having a window-shaped activation function, considerably
increases the convergence speed and the compactness of resulting networks.
Furthermore, a preprocessing technique, called Hamming
Clustering (HC), is proposed for improving the generalization ability of constructive algorithms for binary feedforward neural networks. Its insertion in the Sequential Window Learning is straightforward.
Tests on classical benchmarks show the good performances
of the proposed techniques, both in terms of network complexity and recognition accuracy.
I. Introduction
The back-propagation algorithm [1] has successfully been
applied both to classification and approximation problems,
showing a remarkable flexibility and simplicity of use. Nevertheless, some important drawbacks restrict its application range, particularly when dealing with real world data:
• The network architecture must be fixed a priori, i.e. the
number of layers and the number of units for each layer
must be determined by the user before the training
process.
• The learning time is in general very high and consequently the maximum number of weights one can
consider is reduced.
• Classification problems are not tackled in a natural
way, because the cost function does not directly depend on the number of wrongly classified patterns.
A variety of solutions has been proposed in order to solve
such problems; among these, an important contribution
comes from constructive methods [2]. Such techniques successively add units in the hidden layer until all the inputoutput relations of a given training set are satisfied. In
general, the convergence time is very low, since at any iteration the learning process involves the weights of only
one neuron. On the contrary, in the back-propagation procedure all the weights in the network are modified at the
same time in order to minimize the cost function value.
In this paper we focus on the construction of binary feedforward neural networks with single hidden layer. Every
input and output in the net can only assume two possible
states, coded by the integer values +1 and −1. Such an
The author is with the Istituto per i Circuiti Elettronici, Consiglio
Nazionale delle Ricerche, 16149 Genova, Italy.
1
P − = {ξµ : ζ µ = −1, µ = 1, . . . , p}
troduces the formalism and describes the modifications and
extensions to the procedure of sequential learning. In section III the window neuron is defined with its properties
and the related training algorithm is examined in detail.
Section IV presents the Hamming Clustering and its insertion in the Sequential Window Learning, while section V
shows the simulation results and some comparisons with
other training algorithms. Conclusions and final observations are the matter of section VI.
where ζ µ is the output pattern (single binary value) corresponding to the input pattern ξµ of the training set.
While we leave free the activation function σx of hidden
units, let σy be the well-known sign function:
½
+1 if x ≥ 0
σy (x) = ψ(x) =
(1)
−1 if x < 0
Since we are dealing with the case m = 1, let us denote
with vj , j = 1, . . . , h, the weights for the output neuron Y
and with v0 the corresponding bias.
The kernel of the procedure of sequential learning is the
addition of a new unit in the hidden layer; for this aim
a suitable training algorithm is applied. It provides the
weights of a new unit Xj starting from a particular training
set, in most cases different from the original one. Let Q+
j
be the set of the patterns ξ µ for which the desired output
µ
is Sjµ = +1, whereas Q−
j contains the patterns ξ for which
µ
we want Sj = −1.
When the training algorithm stops we obtain the following four sets (some of which can eventually be empty):
II. The Procedure of Sequential Learning
Throughout this paper, we consider two-layer feedforward perceptrons with binary inputs and outputs; let n be
the number of inputs, h the number of units in the hidden
layer (initially unknown) and m the number of outputs.
The procedure of sequential learning starts from a training set containing p input-output relations (ξµ , ζ µ ), µ =
1, . . . , p. All the components ξiµ , i = 1, . . . , n, ζkµ , k =
1, . . . , m, are binary, coded by the values −1 and +1. For
sake of simplicity, a new component ξ0µ = +1 is always
added to each input pattern, so that the bias of hidden
neurons becomes an additional weight.
Then, let us introduce the following notations:
• Rj+ contains the patterns ξµ ∈ Q+
j rightly classified by
Xj (Sjµ = +1).
• Rj− contains the patterns ξµ ∈ Q−
j rightly classified by
Xj (Sjµ = −1).
• Wj− contains the patterns ξµ ∈ Q+
j wrongly classified
by Xj (Sjµ = −1).
• Wj+ contains the patterns ξµ ∈ Q−
j wrongly classified
by Xj (Sjµ = +1).
• Xj , j = 1, . . . , h, is the j-th hidden neuron having
activation function σx .
• wji , j = 1, . . . , h, i = 0, 1, . . . , n, is the weight for
the connection between the i-th input and the hidden
neuron Xj . wj0 is the bias of the unit Xj .
• Sjµ , j = 0, 1, . . . , h, is the output of Xj caused by the
application of the pattern ξµ to the network inputs;
we set S0µ = +1 by definition. All the binary values Sjµ
form a vector S µ = (S0µ , S1µ , . . . , Shµ ) called the internal
representation of pattern ξµ .
• Yk , k = 1, . . . , m, is the k-th output neuron having
activation function σy .
• vkj , k = 1, . . . , m, j = 0, 1, . . . , h, is the weight for the
connection between the hidden unit Xj and the output
neuron Yk . vk0 is the bias of Yk .
• Okµ , k = 1, . . . , m, is the output of Yk caused by the
application of the pattern ξµ to the network inputs.
In the procedure of sequential learning each unit Xj is assigned an arbitrary variable sj having values in the set
{−1, +1}. By setting the sign of such variable we determine the class of the patterns which will be eliminated
from the current training set after the creation of the hidden neuron Xj .
In particular, if sj > 0 the learning algorithm for the
unit Xj must provide Wj+ = ∅; in this case the patterns
contained in Rj+ will be removed from the training set. In
the same way, if sj < 0 the condition Wj− = ∅ is required
and Rj− will be considered for the elimination.
We can join together these two cases by introducing an
equivalent formulation of the procedure of sequential learning. It always requires Wj+ = ∅ after each insertion in the
hidden layer; the class of the patterns removed is now de−
termined by a proper definition of the sets Q+
j and Qj for
the training of the neuron Xj . In this formulation the main
steps of the algorithm are the following:
The activation functions σx and σy , respectively for hidden and output neurons, can be different, but they must
provide binary values in the set {−1, +1}. Consequently,
also the internal representations S µ have binary components.
The procedure of sequential learning adds hidden units,
following a suitable rule, until all the relations contained
in the training set are satisfied. The standard version, proposed by Marchand et al. [6] applies only to neural networks with single output (m = 1); obviously, we can always
construct a distinct network for each output, by simply iterating the basic algorithm, but in general the resulting
configuration has too many neurons and weights.
However, let us begin our examination from the case m =
1; we shall give later some solutions for approaching generic
output dimensions. Let P + and P − be the following sets:
1. An arbitrary variable sj is chosen in the set {−1, +1}.
2. A new hidden unit Xj is generated, starting from the
training set:
½ +
P
if sj = +1
Q+
=
j
P − if sj = −1
½ −
P
if sj = +1
Q−
=
j
P + if sj = −1
P + = {ξµ : ζ µ = +1, µ = 1, . . . , p}
2
The constraint Wj+ = ∅ must be satisfied in the generation.
3. The resulting set Rj+ is subtracted from the current
training set {P + , P − }.
1. There exists j ∗ (1 ≤ j ∗ ≤ h) such that sj ∗ = +1 and
Sjµ∗ = +1.
2. ξµ belongs to the residual training set, when execution
stops; thus, we have sh = −1.
These three steps are iterated until the current training set
contains only patterns from one class (P + = ∅ or P − = ∅).
In practice, each neuron Xj must be active (Sjµ = +1) for
some patterns ξµ having ζ µ = sj and holds inactive (Sjµ =
−1) in correspondence of any pattern ξµ for which ζ µ =
−sj . Neurons that satisfy this condition can always be
found; for example the grandmother cell of a pattern ξµ ∈
P + verifies such property [6]. However, a neural network
containing only grandmother cells in its hidden layer has
no practical interest; so, a suitable training algorithm for
the hidden neurons is required. This is the object of section
III.
Now, we are interested in the choice of output weights vj ,
j = 0, 1, . . . , h, that correctly satisfy all the input-output
relations contained in the training set. After the iterated
execution of the three main steps above, a possible assignment for the weights vj is the following [6]:
In the first case let l∗ denote the group of hidden neurons
containing Xj ∗ (1 ≤ l∗ ≤ g). Then we have:
v0 =
h
X
vj − sh ; vj = sj 2h−j
sj = sj ∗ = +1
whereas:
Sjµ = −1
=
h
X
¡
vj = sj
h
X
vj − sh +
j=1
h
X
vj Sjµ =
j=1
¢
1 + Sjµ vj − sh = (by virtue of (4))
hl∗
X
¡
¢
¡
¢
= 1 + Sjµ∗ vj ∗ +
1 + Sjµ vj +
j=j ∗ +1
(2)
h
X
¡
¢
1 + Sjµ vj − sh = (since Sjµ∗ = +1)
+
j=hl∗ +1
= 2sj ∗
h
X
|vj | + 2sj ∗ +
h
X
¡
+
hl∗
X
¡
¢
1 + Sjµ vj +
j=j ∗ +1
j=hl∗ +1
¢
1 + Sjµ vj − sh ≥
j=hl∗ +1
≥2
h
X
|vj | + 2 +
hl∗
X
¡
µ¢
1 + Sj
j=j ∗ +1
j=hl∗ +1
l = 1, . . . , g
−2
h
X
h
X
|vi | + 1 +
i=hl∗ +1
|vj | − 1 ≥ 1 > 0
j=hl∗ +1
Thus, by applying (1) we obtain Oµ = ζ µ = +1 as desired.
In the second case we have instead:
Sjµ = −1 for j = 1, . . . , h
vj − sh
from which:
j=1
Ã
(4)
j=1
Theorem 1: A correct choice for the output weights vj in
the procedure of sequential learning is the following:
v0 =
vj Sjµ =
j=1
having set h0 = 0 by definition. The following result is
then valid:
h
X
h
X
v0 +
for j = 1, . . . , h.
Unfortunately, these values exponentially grow with the
number h of hidden neurons; thus, even for small sizes of
resulting networks, the needed range of values makes extremely difficult or impossible both the simulation on a
conventional computer and the implementation on a physical support.
To overcome this problem, let us subdivide the hidden
neurons in g groups, each of which contains adjacent units
Xj with the same value of sj . More formally, if hl , l =
1, . . . , g, is the index for the last neuron belonging to the
l-th group, then we have:
for j = hl−1 + 1, . . . , hl − 1,
for j = 1, . . . , j ∗ − 1
So, the input to the output neuron Y is given by:
j=1
sj+1 = sj
for j = hl∗ −1 + 1, . . . , hl∗
h
X
!
|vi | + 1
v0 +
(3)
h
X
j=1
i=hl +1
vj Sjµ =
h
X
vj − sh −
j=1
h
X
vj = −sh = 1 > 0
j=1
and again Oµ = ζ µ = +1.
for j = hl−1 + 1, . . . , hl , l = 1, . . . , g.
Proof. We show that (3) leads to a correct output Oµ
for a generic ξµ ∈ P + ; the complementary case (ξµ ∈ P − )
can be treated in a similar way. If ξµ ∈ P + , then, from
the iterated execution of the main steps, we obtain two
possible situations:
From (3) we obtain two extreme cases:
• if all the sj are equal, then the output weights vj have
constant (binary) values, whereas the bias v0 linearly
grows with h.
3
Fig. 2 Extension with output training of the procedure of
sequential learning.
Fig. 1 Natural extension of the procedure of sequential
learning.
1. For every k = 1, . . . , m take the set Uk given by:
½ +
Pk if skj = +1
Uk =
Pk− if skj = −1
• If the sj vary alternately, that is:
sj+1 = −sj
for j = 1, . . . , h − 1
then we return to the standard assignment (2).
2. Put in K (initially empty) the output index k corresponding to the set Uk with the greatest number of
elements and set Q+
h = Uk .
3. Modify the set Uk , k = 1, . . . , m, in the following way:
\
Uk = Uk
Q+
h
Since the values of the variables sj can be chosen in an
arbitrary way, we have a method for controlling the growth
of output weights.
A. Generalization of the Procedure of Sequential Learning
The procedure of sequential learning can be extended in
two ways in order to construct binary feedforward neural
networks with a generic number of outputs. The difference
between the two extensions lies in the generation of the
output weights vkj . In the first method the assignment (3)
is naturally generalized whereas in the latter the weights
vkj are found by a proper training algorithm.
In this last case the computing time needed for the construction of the whole network is higher, but such drawback is often balanced by a greater compactness of resulting
architectures and consequently by a better generalization
ability. On the other hand, the availability of a fast method
for the construction of binary neural networks, starting
from a generic training set, is of great practical interest.
In a natural extension of the procedure of sequential
learning to the case m > 1, the variables sj must be replaced by a matrix [skj ], k = 1, . . . , m, j = 1, . . . , h, filled
by values in the set {−1, 0, +1}. In fact, a hidden neuron Xj could push towards positive values an output neuron (skj = +1) and towards negative values another unit
(skj = −1). The choice skj = 0 means that no interaction
exists between the units Xj and Yk (consequently vkj = 0).
For the same reason we must consider a number gk of
groups for each output and the last neuron of each group
explicitly depends on the output it refers to. Thus we have
a matrix of indexes [hkl ], k = 1, . . . , m, l = 1, . . . , gk , in
which the length of each row depends on the output k.
Furthermore, the procedure starts from m pairs of pattern sets Pk+ and Pk− , k = 1, . . . , m, obtained by the inputoutput relations of the training set in the following way:
4. Let k 0 6∈ K be the output associated with the set Uk0
with the greatest size. If the number of patterns in
Uk0 exceeds a given threshold τ , then put k 0 in the set
K and repeat steps 2–4, otherwise stop.
Such simple method leads to the construction of hidden
layers having a limited number of neurons, in general much
lower than networks obtained by repeated applications of
the standard procedure for m = 1. In the simulations we
have always used this method with the choice τ = n.
Theorem 1 can easily be extended to this case and shows
the correctness of this approach: it is able to construct
a two-layer neural network satisfying all the input-output
relations contained in a given training set.
More compact nets can be obtained in many cases by
applying a suitable training algorithm for the output layer.
Also this second technique assures the feasibility of the
neural network, but, as shown later, the convergence is
theoretically asymptotic. However, in most practical cases,
the training time turns out to be acceptable.
A possible implementation of this second extension is
shown in fig. 2. The choice of variables sk is made only
at the first step; the user cannot modify their values during the construction of the hidden layer. The dynamic
adaptation of such quantities is provided by the auxiliary
variables tk , which sign is indirectly affected by the training algorithm for output neurons (through the sets Vk+ e
Vk− ).
Step 4 chooses the output index k 0 that determines the
addition of a new hidden unit Xj . This neuron must be
active (Sjµ = +1) for some patterns ξ µ ∈ Q+
k0 and provide
µ
−
µ
Sj = −1 for all the patterns ξ ∈ Qk0 , likewise the standard version. The disjunction of the sets Rj+ obtained by
subsequent choices of the same output index k is warranted
by the updating of Uk+ and Uk− at step 6.
The sets Tk+ and Tk− contain internal representations for
the input patterns of Pk+ and Pk− . They are obtained at
step 7 through the following relations:
ª
©
Tk+ = S µ : ξ µ ∈ Pk+
©
ª
Tk− = S µ : ξ µ ∈ Pk−
Pk+ = {ξµ : ζkµ = +1, µ = 1, . . . , p}
Pk− = {ξ µ : ζkµ = −1, µ = 1, . . . , p}
With these notations a natural extension of the procedure of sequential learning is shown in fig. 1.
At step 4 the set of input patterns Q+
h for the training
of a new hidden unit Xh is determined in such a way that:
½ +
Pk if skj = +1
+
Qh ⊂
Pk− if skj = −1
where the components of S µ are given by:
à n
!
X
µ
µ
Sj = σx
wji ξi
,
for j = 1, . . . , h
for any k ∈ K, where K is a subset of output indexes.
Among the possible ways of determining Q+
h and K a simple method giving good results is the following:
i=0
4
After the application of the training algorithm for output
neurons, we obtain for every unit Yk two sets Vk+ and Vk− :
©
ª
Vk+ = ξµ ∈ Pk+ : Okµ = ζkµ = +1
©
ª
Vk− = ξµ ∈ Pk− : Okµ = ζkµ = −1
The two extended versions of the procedure of sequential learning are practically useful if a fast training algorithm for the hidden neurons is available. This algorithm
must correctly classify all the input patterns belonging to
a given class and some input pattern belonging to the opposite class. No method in current literature, except that
proposed in [6], pursues this object; so, a suitable technique
will be described in the following section.
Such a training algorithm plays a fundamental role in this
second extension of the procedure of sequential learning.
If the output layer is trained by an algorithm that finds,
at least in an asymptotic way, the optimal configuration
(i.e. the weight matrix [vkj ] which makes the minimum
number of errors in the current training set), then correct
binary neural networks are always constructed.
Algorithms of this kind are available in the literature; the
most popular is probably the pocket algorithm [9]. It can
be shown that the probability of reaching the optimal configuration approaches unity as the training time increases.
Unfortunately, there is no bound known for the training
time actually required and other training algorithms, less
good from a theoretical point of view, but more efficient,
are often used [12].
However, the properties of pocket algorithm allows to
formulate the following result:
III. The Window Neuron
Currently available constructive methods for binary training sets build neural networks that exclusively contain threshold units. In these neurons the output is given by (1), here
reported:
à n
! ½
Pn
X
+1 if i=0 wi ξi ≥ 0
ζ=ψ
wi ξi =
−1 otherwise
i=0
As mentioned above, a new component ξ0 = +1 is added to
the input vector ξ, in order to consider the neuron bias as a
normal weight. In the following we use the term threshold
network for indicating a two-layer perceptron containing
only threshold units. A well-known result is the following
[2]: given a training set made by p binary input-output
relations (ξ µ , ζ µ ), µ = 1, . . . , p, it is always possible to find
a threshold network that correctly satisfies these relations.
Now, let us introduce a new kind of neuron having a
window-shaped activation function; its output is given by:
à n
! ½
Pn
X
+1 if | i=0 wi ξi | ≤ δ
wi ξi =
(6)
ζ=ϕ
−1 otherwise
Theorem 2: The extension with output training of the
procedure of sequential learning is (asymptotically) able to
construct a two-layer perceptron satisfying all the inputoutput relations contained in a given binary training set.
Proof. Let us refer to the implementation in fig. 2; the
repeated execution of steps 3–9 causes the addition of some
hidden neurons for every output. Let Ik denote the set of
indexes of the hidden units generated when step 4 chooses
the k-th output (Ik ⊂ {1, . . . , h}). Moreover, let Rj+ , for
j ∈ Ik , be the set of input patterns correctly classified by
the neurons Xj .
By construction we have:
\
Ri+ Rj+ = ∅ ,
∀i 6= j , i, j ∈ Ik
i=0
The real value δ is called amplitude and is meaningful only
by an implementative point of view. For sake of simplicity,
in the whole description we could set δ = 0, but when the
computations are made by a machine (a computer or a dedicated support), the summation in (6) can move away from
its theoretical value, because of precision errors. Thus, the
introduction of a small amplitude δ allows the practical use
of the window neuron.
A window neuron can always be substituted by three
threshold neurons; in fact, the output of a generic window
neuron is given by:
à n
!
X
ζ=ϕ
wi ξi =
Thus, in the worst case, all the input patterns belonging
to Pk+ or Pk− will be contained in the union of the sets Rj ,
for j ∈ Ik . Let us suppose, without loss of generality, that
there exists a subset J ⊂ Ik such that:
[
Pk+ =
Rj+
j∈J
i=0
In this case, as derived from theorem 1, the neuron Yk
correctly classifies all the input patterns in the training
set, with regard to the k-th output, if the following choice
for the weights vkj is made:
½
+1 for j ∈ J
vkj =
;
vk0 = |J| − 1 (5)
0
otherwise
=ψ
à n
X
i=0
!
wi ξi + δ
−ψ
à n
X
!
wi ξi − δ
−1
i=0
On the contrary, it seems that a direct correspondence
in the opposite sense does not exist. Nevertheless, the following result shows the generality of window neuron:
where |J| is the number of elements in the set J. The properties of pocket algorithm assure that solution (5) can be
found asymptotically.
Theorem 3: It is always possible to find a two-layer perceptron, containing only window neurons in the hidden layer,
that correctly satisfies all the input-output relations of a
given binary training set.
5
Proof. The construction follows the same steps as for
threshold networks. Let (ξµ , ζ µ ), µ = 1, . . . , p, be the p
binary input-output relations contained in a given training
set. For sake of simplicity, let us consider the case m = 1
(output pattern ζ µ with single binary value); the neural
network for generic m can be obtained by iterating the
following procedure.
Let s be the output value (−1 or +1) associated with
the least number of input pattern in the training set. For
every ξµ having ζ µ = s a window neuron Xj is added to
the hidden layer with weights:
wj0 = −n
;
wji = ξiµ
Theorem 4: It is always possible to construct a window
neuron that provides the desired outputs for a given set of
linearly independent input patterns.
Proof. Consider the matrix A, having size p × (n + 1),
which rows are formed by the input patterns ξµ of the
training set; let r be the rank of this matrix. If q ≤ r let
B denote a nonsingular minor containing q linearly independent input patterns of the training set. Suppose, without loss of generality, that B is formed by the first q rows
(j = 1, . . . , q) and the first q columns (i = 0, 1, . . . , q − 1)
of A.
Then, consider the following system of linear algebraic
equations:
Bw = z
(7)
for i = 1, . . . , n
Such a unit is a grandmother cell [2] for the pattern ξµ (it
is active only in correspondence of ξ µ ), as it can be shown
by simple inspection.
Now, a threshold output neuron performing the logic
operation or (nor) if s = +1 (s = −1) completes the construction.
where the j-th component of vector z is given by:
zj = 1 − ζ
−
n
X
µ
wi ξi j
(8)
i=q
(ξµj , ζ µj ) is the input-output relation of the training set
associated with the j-th row of A. The weights wi , for
i = q, . . . , n, are arbitrarily chosen.
By solving (7) we obtain the weights wi for a window
neuron that correctly satisfies the q relations (ξµj , ζ µj ), for
j = 1, . . . , q. In fact, we have from (8):
A two-layer perceptron with window neurons in the hidden layer will be called window network in the following.
Two results are readily determined:
• The parity operation [13] can be realized by a window network containing b(n + 1)/2c hidden units. A
possible choice of weights is the following:
n
X
wj0 = n − 4j + 2 ; wji = +1
¹
º
n−1
v0 = −
; vj = +1
2
µ
wi ξi j
i=0
=
q−1
X
i=0
µ
wi ξi j +
n
X
µ
wi ξi j = 1 − ζ µj
i=q
hence:
à n
!
X
µj
wi ξi
= ϕ (1 − ζ µj ) = ζ µj
ϕ
for i = 1, . . . , n and j = 1, . . . , b(n + 1)/2c. In this
configuration the j-th hidden unit is active if the input pattern contains 2j − 1 components with value
+1. Then, the output neuron executes a logic or and
produces the correct result.
• A single window neuron performs the symmetry operation [13]. A possible choice of weights is the following:
w0 = 0
2bn/2c−i
wi =
0
−wn−i+1
µj
for j = 1, . . . , q
i=0
if 0 ≤ δ < 2.
Then, the main object of correctly classifying all the input patterns ξµ having ζ µ = −1 can be reached, by following two steps:
1. Put in B only patterns ξ µj with ζ µj = +1.
2. Search for a window neuron that provides output −1
for the greater number of patterns not contained in
the minor B.
for i = 1, . . . , bn/2c
for i = (n + 1)/2, if n odd
for i = (n − bn/2c + 1), . . . , n
The following theorem offers an operative approach:
In these cases window networks are considerably more
compact than corresponding threshold networks. Unfortunately, this is not a general result, since there exist linearly
separable training sets that lead to more complex window
networks.
Theorem 5: If the minor B has dimension q ≤ n and
contains only patterns ξ µj , j = 1, . . . , q, for which ζ µj =
+1, then the window neuron obtained by solving (7) gives
output +1 for all the input patterns linearly dependent with
ξµ1 , . . . , ξ µq . Moreover, if the arbitrary weights wi , i =
q, . . . , n, are linearly independent in R as Q-vector space,
then every input pattern which is linearly independent with
ξµ1 , . . . , ξ µq in Qn yields output −1.
A. The Training Algorithm for Window Neurons
Given a generic training set (ξ µ , ζ µ ), µ = 1, . . . , p, we
wish to find a learning algorithm that provides the weights
wi , i = 0, 1, . . . , n, for a window neuron that correctly classifies all the patterns ξµ for which ζ µ = −1 and the maximum number of patterns ξµ for which ζ µ = +1.
The following result plays a fundamental role:
Proof. Since ζ µj = +1, for j = 1, . . . , q, from (7) we
obtain:
n
X
µ
wi ξi j = 0 for j = 1, . . . , q
i=0
6
Fig. 3
Now, consider an input pattern ξν not contained in the
ν
minor B; the q + 1 vectors (ξ0ν , ξ1ν , . . . , ξq−1
)t , (ξ0µ1 , ξ1µ1 ,
µq µq
µq t
µ1 t
. . . , ξq−1 ) , . . . , (ξ0 , ξ1 , . . . , ξq−1 ) are linearly dependent
in Qq , being Q the rational field (t denotes transposition).
Thus, there exists constants λ1 , . . . , λq ∈ Q, some of which
are different from zero, such that:
ξiν =
q
X
µ
λj ξ i j
for i = 0, 1, . . . , q − 1
computing time. However, good results can be reached
by employing a simple greedy procedure: the minor B is
built by subsequently adding the row and the column of A
that maximize the number of patterns with positive output
correctly classified by the corresponding window neuron.
Obviously, the number of correctly classified patterns with
negative output must be kept maximum.
A detailed description of the resulting training algorithm
is shown in fig. 3. It employs two sets I and J for the progressive construction of minor B; these sets are carefully
increased in order to maximize the number of patterns with
correct output. The algorithm stops when it cannot successfully add rows and columns to B.
This learning method can easily be modified if other different goals are to be pursued. Its insertion in the procedure of sequential learning is straightforward and leads
to a general technique called Sequential Window Learning
(SWL), which exhibits interesting features, as we will see
in the tests.
(9)
j=1
and consequently:
n
X
wi ξiν =
i=0
n
X
q
X
wi ξiν −
i=0
=
n
X
λj
j=1
wi ξiν −
i=q
q
X
n
X
µ
wi ξi j =
i=0
µ
λj ξ i j
(10)
j=1
If the patterns ξν , ξµ1 , . . . , ξµq are linearly dependent in
Q , then the right hand term of (10) is null for some rational constants λj , j = 1, . . . , q, that satisfy (9). Hence,
the corresponding output of the window neuron is +1.
On the contrary, if the vectors ξ ν , ξµ1 , . . . , ξµq are linearly independent in Qn , then there exists at least one
index i (q ≤ i ≤ n) such that:
n
ξiν 6=
q
X
IV. The Hamming Clustering
Although SWL allows the efficient construction of a twolayer perceptron starting from a given binary training set,
its generalization ability depends on a variety of factors
that are not directly controllable. Concerning this, a new
technique, called Hamming Clustering (HC), approaches
the solution of two important problems:
µ
λj ξ i j
j=1
Pq
µ
But the terms ξiν − j=1 λj ξi j are rational numbers; so,
if the wi , i = q, . . . , n, are linearly independent in R as
Q-vector space we obtain:
q
n
n
X
X
X
µ
λj ξi j 6= 0
wi ξiν =
wi ξiν −
i=0
i=q
Training algorithm for window neurons.
• To increase the algorithm locality for improving the
generalization ability of the produced neural network
in real world applications [10].
• To find redundant inputs which does not affect one
or more outputs. By removing corresponding connections, the number of weights in the resulting network
and consequently its complexity are reduced. This
also improves the generalization ability of the system
[11],[15].
j=1
Hence, the output of the window neuron will be −1 if the
amplitude δ is small enough.
A natural method of proceeding is grouping the patterns
belonging to the same class, that are close each other according to the Hamming distance. This produces some
clusters in the input space which determine the class extension. The clustering must be made in such a way that
it can directly be inserted in the SWL algorithm.
The concept of template plays a fundamental role in HC;
it is very similar to the concept of schema widely used in
the theory of genetic algorithms [16]. Let us denote with
the symbols ’+’ and ’−’ the two binary states (corresponding to the integer values +1 and −1); with this notation
the pattern ξ = (+1, −1, −1, +1, −1)t is equivalent to the
string + − − + −.
In HC a template is a string of binary components that
can also contain don’t care symbols, denoted by the symbol ’0’. A template represents a set of patterns which are
obtained by expanding the don’t care symbols. For example, in the space of binary patterns with five components,
the template +0 − 0− is equivalent to the set {+ − − − −,
Several choices for the real numbers wi are available in
the literature, but in most cases they require a representation range that is too large for a practical implementation.
In the simulations we have used the following choice [14]:
√
wi = γi for i = 0, 1, 2, . . .
where γi are positive integer squarefree numbers (not divisible for a perfect square), sorted in increasing order (by
definition γ0 = 1).
Theorem 5 gives a method of reaching our desired result:
to correctly satisfy all the input-output relations (ξ µ , ζ µ )
for which ζ µ = −1 and some pairs (ξ µ , ζ µ ) having ζ µ = +1.
The final step is to maximize the whole total number of
correct outputs.
With high probability, the optimal configuration could
be reached only through an exhaustive inspection of all the
possible solutions; unfortunately, this leads to a prohibitive
7
Fig. 4
+ − − + −, + + − − −, + + − + −}.
The template shows two important properties:
Neural network corresponding to example (12).
1. It has a direct correspondence with a logic and among
Initially the method considers only one cluster; the folthe pattern components. For example, the template
diagram shows the first execution of extension and
,
+0−0− above corresponds to the operation ξ1 andξ 3 andξlowing
5
where ξ1 , ξ3 and ξ5 are respectively the first, the third subdivision:
and the fifth component of a generic pattern ξ, while
Comp.
Conflicts
+0 + −−
+−+−−
ξ1
1
a bar indicates the logic operation not. Only the pat-
+0 − −+
+−−−+
Subd.
Ext.
ξ2
0
terns in the equivalent set generate the value +1 as a
−0 + ++
−++++
=⇒
=⇒
ξ3
1
+0 + +−
++++−
result.
ξ4
0
−0 − ++
−+−++
2. The construction of a window neuron that performs
ξ5
1
the and associated with a given template is straightThe selected component is ξ2 (enclosed in a square),
forward. Every weight must be set to the correspond- which does not generate conflicts with the patterns of P − ;
ing value in the template, whereas the bias must be thus, at the end of the first subdivision, we obtain a clusequal to the number of binary variables in the tem- ter with five templates, each of which contains a don’t care
plate, changed in sign. For example, the template symbol in the second location. Another execution of ex+0 − 0− corresponds to the window neuron having tension and subdivision leads to the following situation:
weights:
+0 + −−
Comp.
Conflicts
ξ1
2
+0 − −+
+0 + 0−
w0 = −3 , w1 = +1 , w2 = 0 ,
Subd.
Ext.
+0 − 0+
ξ3
1
−0 + ++
=⇒
=⇒
−0 + 0+
ξ4
0
+0 + +−
−0 − 0+
w3 = −1 , w4 = 0 , w5 = −1
−0 − ++
ξ5
This unit is active only for the patterns in the equivalent set of the template above and realizes the desired
logic operation and.
HC proceeds by extending and subdividing clusters of
templates; every cluster contains templates having don’t
care symbols in the same locations. Let P + and P − be the
sets of patterns forming the given training set; the method
initially considers only one cluster containing all the patterns of P + . Then, this cluster undergoes the following
two actions:
Ext.
+0 − 0+
=⇒
−0 + 0+
−0 − 0+
n
00 − 0+
00 + 0+
o
{+0 + 0−}
Subd.
=⇒
1
1
3
00 + 0+
{+0 + 0−}
Ext.
=⇒
Ext.
=⇒
Comp.
Conflicts
ξ3
ξ5
0
2
Comp.
Conflicts
ξ1
ξ3
ξ5
1
1
0
Subd.
=⇒ {0000+}
Subd.
=⇒ {+0 + 00}
Resulting clusters directly correspond to the logic operation (12) from which the training set was derived. In fact,
the template 0000+ (first cluster) corresponds to the component ξ5 that forms the second operand for the boolean
or, whereas the template +0 + 00 corresponds to the first
operand ξ1 and ξ3 . The binary neural network providing
the correct output ζ is then easily constructed by placing
a hidden window neuron for each template and adding an
output threshold neuron that performs the required operation or. Such a network is shown in fig. 4.
This example shows how HC reaches the proposed goals;
first of all, the boolean function generating the training set
has been found. Then, the number of connections in the
resulting network has been minimized; finally, redundant
inputs have been determined. These three results is of
great practical importance.
Furthermore, the equivalent threshold network can di-
These two actions are then applied to the resulting clusters and the procedure is iterated until no more extensions
could be done.
The method is better understood if we consider in detail
a clarifying example. Let P + and P − be the following sets
of patterns:
+−+−−
−−−−−
+−−−+
−+++−
−++++
++−+−
P+ =
; P− =
(11)
++++−
−++−−
+−−+−
generated by the logic operation:
ζ = (ξ1 and ξ3 ) or ξ5
ξ1
ξ3
ξ5
Finally, a last iteration gives the desired result:
Extension: Each binary component in the cluster is replaced one at a time by the don’t care symbol and the
corresponding number of conflicts with the patterns of
P − is computed.
Subdivision: The binary component with the minimum
number of conflicts is considered. If this number is
greater than zero the cluster is subdivided in two subsets: the first contains the templates that do not lead
to conflicts (with the selected binary component replaced by the don’t care symbol) and the latter is
formed by the remaining templates (unchanged).
−+−++
2
The component ξ2 has not yet been considered, since it
contains only don’t care symbols. Again, the selected component ξ4 does not generate conflicts with the patterns of
P − and we have a single cluster with four templates. The
next application of extension and subdivision yields the
generation of two clusters as follows:
n
o
Comp.
Conflicts
00 − 0+
+0 + 0−
(12)
8
Fig. 5
rectly be built; so, HC has general validity. It generates
and-or networks, in which the hidden layer performs and
operations among the inputs and the final layer computes
or operations among the outputs of hidden neurons. Now,
a fundamental result of the theory of logic networks says:
every boolean function (hence every training set) can be
written in the and-or form [17]. Consequently, the network above is general.
However, more compact neural networks can exist. For
example, the well-known parity function needs 2n−1 hidden
neurons in the and-or configuration, whereas, as shown in
section III, a window network with b(n+1)/2c hidden units
can provide the correct output.
This remark shows the importance of a deep integration
between SWL and HC. The following definition introduces
a possible approach: if a pattern of P + belongs to the
equivalent set of a given template, then we say that this
template covers that pattern. For example, the template
0000+ covers the pattern +−−−+ in the training set (11).
Then we use the term covering for denoting the number of
patterns of P + covered by the templates of a given cluster.
With these definitions a practical method for integrating
HC and the training algorithm for window neurons follows
these steps:
Simulation results for the random function.
lems of real generalization: in these cases the training set
contains real patterns and is again incomplete. An A/D
conversion is performed and the resulting set of binary patterns is used by SWL for generating a neural network that
tries to minimize the number of errors on a given test set.
The reported computing times refer to a C code running
on a DECStation 5200; they provide only a first indication about the convergence speed of the proposed techniques. In fact, a correct valutation requires the comparison with other algorithms and consequently the definition
of a proper environment in which the tests have to be made.
For sake of brevity, we defer such valutation to a subsequent paper.
A. Exhaustive simulations
Three groups of trials analyzed the goodness of configurations obtained by SWL. In these simulations HC was not
used because it improves only the generalization ability of
a learning algorithm.
Parity Function: Training were made for n = 2, 3, . . . , 10
and SWL was always able to find the optimal configuration
with b(n+1)/2c hidden neurons. The time required for the
construction of the network went from 0.02 (n = 2) to 62
seconds (n = 10).
1. Starting from the pattern sets P + and P − (training
set for the window neuron), perform the Hamming
Clustering and reach the final clusters of templates.
2. Choose the cluster having maximum covering and consider the binary components of this cluster (leave out
don’t care symbols).
3. Construct a new training set from P + and P − , in
which only the selected components are kept.
4. Generate the corresponding window neuron and remove the connections corresponding to the neglected
components (by zeroing their weights).
Symmetry Function: Also in this case the simulations
with n = 2, 3, . . . , 10 always yielded the minimum network
containing only one window neuron. Less than 2 seconds
was sufficient for every trial.
Random Function: A group of tests involves the random function, which output is given by a uniform probability distribution. It allows an interesting comparison
with other constructive methods, like tiling [4] and upstart [5] algorithms. Fig. 5 shows the number of units in
the neural networks realized by the three techniques for
n = 4, 5, . . . , 10. Every point is the average of 50 trials
with different training sets.
The networks generated by SWL are more compact and
the difference increases with the number n of inputs. This
shows the efficiency of SWL in terms of configuration complexity.
However, the computational task is very heavy: the CPU
time required grows exponentially with n from 0.03 (n = 4)
to 524 seconds (n = 10).
Such a method performs the training of the window neuron on a reduced number of components and recovers the
computing time lost in the execution of HC. Moreover, it
can successfully be inserted in the procedure of sequential
learning.
V. Tests and Results
The proposed techniques were tested on several benchmarks in order to point out their performances, also in comparison with other constructive methods. These groups of
tests regard different situations, each of which emphasizes
a particular characteristic of the learning algorithm.
The first group concerns the so-called exhaustive simulations: in these cases the whole truth table of a boolean
function is given as a training set and the algorithm is required to minimize the number of weights in the resulting
network.
The second set of trials refers to binary generalizations:
the training set is binary and incomplete. SWL must return a configuration that minimizes the number of errors
on a given test set.
Finally, the third group of simulations concerns prob-
B. Binary Generalizations
Parity and symmetry functions were also used for the
performance analysis on binary generalizations. A third
group of trials concerns the monk problems [18], an interesting benchmark both for connectionist and symbolic
methods.
Parity Function: With a number n = 10 of inputs,
we considered a training set of p randomly chosen pat9
Fig. 6
Generalization results for the parity function.
Fig. 7
Generalization results for the symmetry function.
Fig. 8 Original (a) and reconstructed (b) pattern for the
test on circle recognition.
sepal dimensions is a classical test on real generalization.
If we use an 8-bit gray code for each real input we obtain
a training set containing 100 strings of length n = 32.
The neural network constructed by SWL with HC correctly classifies every pattern in the training set and makes
three errors (the theoretical minimum) on the test set.
Such a network contains 6 hidden neurons and only 37
connections in two layers.
The CPU time needed for the construction is 10 seconds.
terns. The remaining 1024 − p patterns formed the test
set. Fifty different trials were executed for each value of
p = 100, 200, . . . , 900, obtaining the correctness percentages shown in fig. 6.
Since the parity function does not present any locality
property (the value in a point is uncorrelated from the values in the neighborhoods), the application of HC always
leads to poorer performances.
Symmetry Function: Tests on symmetry function present
the same environment as for parity. Thus, the number of
inputs was again n = 10 and fifty runs were made for each
value of p = 100, 200, . . . , 900. The results are shown in
fig. 7.
Also in this case the lack of locality makes HC useless;
the obtained generalization ability is however very interesting.
Monk Problems: Three classification problems have been
proposed [18] as benchmarks for learning algorithms. Since
their inputs are not binary, a conversion was needed before the application of SWL. We used a bit for each possible value, so that the resulting binary neural network had
n = 17 inputs and a single output. In this way the results
in tab. 1 were obtained.
Here HC improves the performance on all the tests and
reduces the number of weights in the networks, allowing an
easier rule extraction.
The computational effort is rather low: about 1 second
for problems #1 and #2, about 5 seconds for problem #3.
Circle Recognition: The last trial concerns the recognition of a circle from randomly placed points. Let us consider the pattern in fig. 8a, in which white and black zones
correspond to positive and negative output respectively.
1000 random points formed the training set, while other
1000 points were contained in the test set. Since the area
of the two regions are equal, about 50% of points in either
set falls inside the circle.
By using a Gray code with 6 bits for each real input,
we obtained from SWL with HC a binary neural network
containing 21 hidden units and 119 connections in two layers. No error was made in the training set, whereas 4.1% is
the percentage of errors encountered in the test set. This
result shows that the quantization effects of the A/D conversion was overcome by HC, leading to an interesting generalization ability also with real inputs. The reconstructed
pattern is presented in fig. 8b.
Five seconds of computation was sufficient for obtaining
this result.
VI. Conclusions
C. Real Generalizations
Finally, two tests were devoted to classification problems with real inputs. In these cases the application of
SWL (and any other binary constructive method) is influenced by the A/D conversion used in the preprocessing
phase. Among the variety of proposed methods, we chose
the Gray code [17], which maps close integer numbers to
similar binary strings (with respect to the Hamming distance).
A new constructive method for the training of two-layer
perceptrons with binary inputs and outputs have been presented. This algorithm is based on two main blocks:
Iris Problem: The dataset of Anderson-Fisher [19] for the
classification of three species of Iris starting from petal and
Tests on this procedure, called Sequential Window Learning (SWL), show interesting results both in terms of computational speed and compactness of constructed networks.
Moreover, a new preprocessing algorithm, called Hamming Clustering (HC), has been introduced for improving
the generalization ability of constructive methods. Its application is particularly useful in case of training sets with
high locality; again, some simulations have shown the characteristics of HC.
Currently, SWL and HC are being applied to real world
problems, such as handwritten character recognition and
genomic sequence analysis, giving some first promising results.
Problem
Use HC?
Neurons
Weights 6= 0
% Test Errors
#1
#1
#2
#2
#3
#3
YES
NO
YES
NO
YES
NO
3
4
1
1
10
6
11
58
7
18
46
96
0
2.61
0
0
7.87
29.86
• A modification of the procedure of sequential learning [6] with practical extensions to the construction of
neural networks with any number of outputs.
• The introduction of a new kind of neuron, having
a window-shaped activation function, which allows a
fast and efficient training algorithm.
Tab. 1 Results for the application of SWL to the monk
problems.
10
References
View publication stats
[1] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Learning Representations by Back-Propagating Errors.
Nature,
323 (1986) 533–536.
[2] J. Hertz, A. Krogh, and R. Palmer,
Introduction to
the Theory of Neural Computation
(1991) Redwood City:
Addison-Wesley.
[3] S.I. Gallant,
Neural Network Learning and Expert Systems
(1993) Cambridge, MA: MIT Press.
[4] M. Mézard, and J.-P. Nadal, Learning in Feedforward Layered Networks: the Tiling Algorithm.
Journal of Physics A,
22 (1989) 2191–2204.
[5] M. Frean,
The Upstart Algorithm: a Method for Constructing, and Training Feedforward Neural Networks. Neural Computation, 2 (1990) 198–209.
[6] M. Marchand, M. Golea, and P. Ruján,
A Convergence Theorem for Sequential Learning in Two-Layer Perceptrons.
Europhysics Letters, 11 (1990) 487–492.
[7] D.L. Gray, and A.N. Michel,
A Training Algorithm for
Binary Feedforward Neural Networks. IEEE Transactions on
Neural Networks, 3 (1992) 176–194.
[8] F. Rosenblatt,
Principles of Neurodynamics (1962) New
York: Spartan.
[9] S.I.
Gallant,
PerceptronBased Learning Algorithms.
IEEE Transactions on Neural
Networks, 1 (1990) 179–191.
[10] L. Bottou, and V. Vapnik,
Local Learning Algorithms.
Neural Computation, 4 (1992) 888–900.
[11] E.B. Baum, and D. Haussler,
What Size Net Gives Valid
Generalization?
Neural Computation, 1 (1989) 151–160.
[12] M. Frean, A “Thermal” Perceptron Learning Rule. Neural
Computation, 4 (1992) 946–957.
[13] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Learning Internal Representations by Error Propagation.
In Parallel Distribute Processing. Eds.: D.E. Rumelhart and J.L. McClelland (1986) Cambridge, MA: MIT Press, 318–362.
[14] D.A. Marcus,
Number Fields (1977) New York: SpringerVerlag.
[15] A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth,
Learnability and the Vapnik-Chervonenkis Dimension.
Journal of the ACM, 36 (1989) 929–965.
[16] D. E. Goldberg,
Genetic Algorithms in Search, Optimization, and Machine Learning
(1989) Reading, MA: Addison
Wesley.
[17] H.W. Gschwind, and E.J. McCluskey,
Design of Digital
Computers
(1975) New York: Springer-Verlag.
[18] S.B. Thrun, J. Bala, E. Bloedorn, I. Bratko, B. Cestnik, K. De Jong, S. Džeroski, S.E. Fahlman, D. Fisher,
R. Hamann, K. Kaufman, S. Keller, I. Kononenko,
J. Kreuziger, R.S. Michalski, T. Mitchell, P. Pachowicz, Y. Reich, H. Vafaie, W. Van de Welde, W. Wenzel,
J. Wnek, and J. Zhang,
The MONK’s Problems. A Performance Comparison of Different Learning Algorithms.
(1991)
Carnegie Mellon University report CMU-CS-91-197.
[19] R.A. Fisher,
The Use of Multiple Measurements in Taxonomic Problems.
Annals of Eugenics, 8 (1936) 376–386.
Marco Muselli was born in 1962. He received the degree in electronic engineering from
the University of Genoa, Italy, in 1985.
He is currently a Researcher at the Istituto per i Circuiti Elettronici
of CNR in Genoa.
His research interests include neural networks, global optimization,
genetic algorithms and characterization of nonlinear systems.
11