zyxwv
M5.26
A NEURAL NETWORK WITH BOOLEAN OUTPUT LAYER
PETER STROBACH
zyxwvutsrqpon
zyxwvuts
zyxw
SIEMENS AG, Zentralabteilung Forschung und Entwicklung
ZFE IS INF 1, Otto-Hahn-Ring 6, D-8000 Munchen 83, FRG
ABSTRACT - The design of feed-forward ADALINE neural
networks can be split into two independent optimization
problems. (1) t h e design of the f i r s t hidden layer which
uses linear hyperplanes t o partition the continuous amplitude input space into a number of cells and (2) the
design of t h e second and succeeding hidden layers which
"group" the cells t o larger decision regions. The weights
of a linear combiner in the f i r s t hidden layer are b e s t
adjusted in a sense t h a t a hyperplane determined by these
weights is placed exactly in the middle of t h e "gap"
between two training s e t s . This leads t o a minimax optimization problem. The hyperplanes intersect in t h e input
space and form a "lattice" of decision cells. The basic
functioning of t h e f i r s t hidden layer is therefore a vector
quantization of the input space. Each decision cell in the
lattice is uniquely determined by i t s "codeword", namely,
the binary o u t p u t of t h e f i r s t hidden layer. The basic
functioning of the second and succeeding hidden layers is
then t o perform a "grouping" of decision cells. The
grouping of decision cells can be described alternatively
by a Boolean function of the output " w o r d of the f i r s t
hidden layer. In this way i t is shown t h a t t h e second and
succeeding hidden layers in a feed-forward network may
be replaced by a simple logical network. An algorithm f o r
the design of this logical network is devised.
I. INTRODUCTION
There has been a flurry interest recently in feed-forward
neural networks f o r pattern recognition, classification and
other purposes. This paper deals with networks of the
ADALINE type [I, 21. The basic function of a feed-forward
multilayer network of ADALINE neurons is t h a t the first
hidden layer uses linear hyperplanes t o partition the observation space into a number of decision cells. The sole
function of t h e additional layers in the network is then t o
group these decision cells together in order t o form larger
decision regions. Many approaches have attempt t o combine more than three layers in a multilayer structure from
the believe t h a t additional layers may continuously improve
the capability of the network t o approximate arbitrary
decision regions. This paper shows t h a t the task of forming arbitrary decision regions in an observation space can
be split into two independent subtasks: (1) The optimal
design of the weights of the first hidden layer, which can
be formulated a s a minimax optimization problem. (2)The
design of t h e additional layers a s the realization of a
Boolean function. These considerations lead t o the insight
that a feed-forward network with only two layers following the first hidden layer can completely determine any
desired decision region constructable from a combination
of decision cells. These two layers following t h e f i r s t
hidden layer have an alternative realization a s a logical
network termed t h e Boolean output layer. This paper discusses (1) the design of the first hidden layer and (2) the
design of t h e Boolean o u t p u t layer. I t is shown t h a t the
logical network of t h e Boolean output layer can be a computationally attractive alternative t o the conventional
ADALINE realization of t h e second and third hidden layer.
A. The ADALINE Neuron
The basic element in t h e networks studied in this paper
is t h e adaptive linear neuron (ADALINE) which computes
the inner product y of a pattern vector x = [xi,x2, . . . , xqlT
and a pre-determined weight vector w = [wl, w2, . . . , wqlT
plus a fixed threshold which can be s e t t o - 1 without
loss of generality. The o u t p u t d of t h e ADALINE "neuron"
is then simply t h e sign of y (binary decision):
y = -1 + xTw = -1 + x9x i w i ;
(la)
d = SGN(y)
.
(lb)
i= 1
The elements of t h e weight vector may b e interpreted
geometrically as t h e parameters of the hyperplane y = 0 in
the q-dimensional observation space of pattern vectors.
This hyperplane divides the observation space into two
open half-spaces. The thresholded inner product (la) of a
pattern vector and t h e weight vector can be interpreted a s
a "distance" between t h e pattern vector and the hyperplane.
Moreover, t h e sign of y determines whether a given pattern
is an element of t h e "left" (-) o r t h e "right" (+) half-space.
-1
zyxwvu
zyxwvutsrqpo
-/---Lrb
- l/Wi
'
xi
Fig. 1. The ADALINE neuron and t h e interpretation of i t s
coefficients a s t h e parameters of a hyperplane in the
q-dimensional observation space.
Given t w o s e t s of training patterns X'k' and X"' , the
weights w of an ADALINE neuron may be adjusted (trained)
so t h a t t h e decision hyperplane is placed in t h e gap
between the two training s e t s . Given a n arbitrary pattern
vector x, t h e o u t p u t of the ADALINE is a binary decision
whether x is an element of class 1 determined by the
half-space of t h e training s e t XCk' , o r an element of
class 2 determined by t h e half-space of X'?'
2081
zyxw
CH2847-2/90/0000-2081$1.00 0 1990 IEEE
zyxw
zy
zyxw
zyxw
C. A Boolean Formulation o f the Grouping Problem
Fig. 2. Separation of t w o training (point) s e t s XCk)andX"'
in t h e observation space of continuous amplitude training
patterns by a linear hyperplane.
B. ADALINE Neurons in the First Hidden Layer
A single ADALINE neuron can at m o s t distinguish between
two point s e t s . When t h e classification of n point s e t s is
an issue, one requires m = n(n-1)/2 hyperplanes of ADALINE
neurons t o seperate each point s e t from each member of
the remaining point sets, where we assumed t h a t all point
sets are pairwise linearly separable. These neurons cons t i t u t e t h e f i r s t hidden layer in a neural network. The
hyperplanes associated with t h e f i r s t hidden layer intersect
in t h e q-dimensional space and form a segmentation
"lattice" of decision cells. Each cell in t h e lattice can be
labeled with a binary word with m digits, namely, the
decisions of t h e m neurons in t h e first hidden layer, where
each digit indicates the location of t h e cell relative t o the
corresponding hyperplane. See Fig. 3 f o r an example of
a f i r s t hidden layer consisting of m = 4 neurons. The closed
cell c1 in Fig. 3 b is completely determined by t h e 4-bit
word d = [+++-I. One could now believe t h a t there are
Zm = 2, = 16 cells in this example, b u t a quick inspection
of Fig. 3b reveals t h a t one finds only a total number of 3
"closed" and 8 "open" cells = 11 cells. Consequently, there
m u s t b e a number of S remaining addresses which have
no correspondence in the real space. Such cells have been
termed "imaginary cells" [3].
Explicit formulas f o r t h e calculation of the total number
of open and closed cells in a q-dimensional problem with
m neurons (hyperplanes) in t h e f i r s t hidden layer were
reported in [SI. The considerations in [3] are based on
early results by L. Schafli [4] of the l a s t century. The
basic r e s u l t of these investigations is t h a t the number of
open and closed cells grows tremendously with growing
q and m. By an appropriate "grouping" of a number of
these decision cells, one can approximate a desired decision
region within t h e accuracy determined by the segmentation
lattice of the f i r s t hidden layer. This grouping of decision
cells i s accomplished by t h e additional layers following
the f i r s t hidden layer in a conventional multilayer network.
But from t h e considerations t h a t we have made so far, i t
becomes already apparent t h a t we may formulate the
grouping of decision cells alternatively by a Boolean
function of t h e output decision "word" d = [d,, d, , . . . , d,]
of t h e f i r s t hidden layer. Consider again t h e lattice of
decision cells in Fig. 3b. We may be interested in forming
a decision region R composed of the three closed cells
c,, c, and c3 . These cells are determined by the hidden
layer decision words d, = [+++-I, d, = [+-+-I and
d, = [++--I. The decision region R is hence completely
determined by t h e Boolean function R = d , v d, v d, ,
where v *' is t h e logical OR operator. We may rewrite this
expression more explicitly in terms of the decisions of the
four neurons in t h e f i r s t hidden layer of Fig. 3a:
"
zyxwvut
zyxwvutsrqpo
zyxwvutsrqpo
(a)
d3
+
/- 2
d,
Fig. 3. (a) First hidden layer consisting of 4 ADALINE
neurons and (b) corresponding segmentation lattice in the
observation space.
R = d,d,d,a,
v
d , ~ , d , ~ , v d,d,d,d,
(2)
This Boolean function can be minimized easily by an
appropriate exploitation of don't cares. One simple method
t o achieve this i s t h e Karnaugh diagram of Fig. 4.
Fig. 4. Evaluation of a Karnaugh
diagram f o r minimization of t h e
Boolean function R, which gives
the minimized result:
R = d,d,a,
v
dld3d,
It is emphasized t h a t f o r t h e minimization of Boolean functions with a large number of input variables (e.g. m=190),
one can take advantage of automatic methods developed
in t h e area of logical network design [SI. Next, one finds
t h a t f o r practical problems, t h e total number of cells in
the lattice is orders of magnitude larger than the number
of training patterns available in a practical application.
Since each training pattern corresponds with a distinct cell
address, the training patterns "hit" only a tremendously
small number of cells compared t o the total number of
existing cells in t h e lattice. This gives rise t o a second
p r o b l e m , namely, t h e construction of a Boolean function
determining a closed decision region from these sparsely
distributed cells determined by the hidden layer words of
a training sequence. This is in fact a difficult task of combinatorical optimization and a f a s t solution of this design
problem could open a new dimension in neural network
design. Obviously, a closed decision region can be constructed from the cells by some kind of "binary region
growing" where one f i r s t minimizes the conjunctive terms
describing t h e hidden layer words f o r a given training
sequence, a s described above. Next one notes t h a t the
incorporation of a don't care in a decision word adds a
"neighbor" cell t o the decision region. Consider again the
lattice of Fig. 3b f o r an illustration of this statement.
Suppose f o r example t h a t we intend t o "merge" the cell c,
with a larger decision region which already contains the
2082
zyxwvutsrqpo
zyxwvuts
zyxwvutsr
zyx
cell c1. This merging is achieved by a simple elimination of
the hyperplane d,, i.e., d, = 0 (don't care) so t h a t d
v
d,
did$,
. Thus one continues introducing don't cares
in the Boolean function of the output layer under the
constraint t h a t the cells such included in the decision
region are not yet members of another decision region.
The procedure can be repeated recursively, until no more
"free" cells are available in the lattice and all cells belong
t o a distinct decision region. I t is emphasized t h a t in
practice, a given s e t of training data still leaves much
freedom in the design of the Boolean output layer. Even
with training s e t s consisting of a relatively large number
of training patterns compared t o the total number of cells
in the lattice, there is little chance t h a t all cells in the
lattice are addressed by patterns of the training sequence.
Slightly different patterns may fall into the same decision
cell and some free cells remain. From these considerations,
one may gain the interesting insight that in most cases
there is a large number of different final solutions of the
Boolean function determining the decision region. All these
different solutions yield the same network behavior inside
the training sequence. But the solutions may differ concering their robustness outside training. These important
questions are the subject of current investigations.
the training s e t s X'k' and X'"
(8)
From (7) i t follows t h a t the "gap variable"
pressed a s a linear objective function of
E
can b e exand g
With these expressions (8) and (9), the minimax design of
the weight vector wk has been brought into the following
standard linear progkamming (LP) formulation: "minimize
the linear objective function f(wk,,, g ) with respect t o the
s e t of linear inequality constraints (8) ". One possible
method t h a t solves the given LP problem is the Simplex
algorithm [7]. The characteristic operation of the Simplex
method is t h a t the solution travels along the boundary of
the feasible region seeking t h a t vertex which minimizes
the linear objective function. In large practical problems,
the Simplex algorithm converges rather slowly, b u t with
an approximately constant speed of convergence, t o the
exact minimax solution. In many cases, however, i t is
sufficient t o work with approximate minimax designs of
the weight vector. The goal is therefore to derive f a s t
approximate algorithms f o r the given LP problem. As the
Simplex algorithm exhibits an almost constant convergence
speed over t h e entire sequence of iterations, i t does not
provide a good basis f o r an approximate algorithm. Algorithms with a much faster initial convergence within the
first few iterations can be deduced from LP algorithms
t h a t operate from t h e interior of the feasible region rather
than from i t s boundary a s does the Simplex algorithm.
One such method is the Karmarkar algorithm [8] which is
basically a combination of a projected gradient method
coupled with a recentering method. We employed a modified version of Karmarkar's algorithm of Vanderbei, et. al.
[SI. An explicit summary of this algorithm can be found
in [lo]. The algorithm s t a r t s with a least-squares solution.
Sufficient approximations of the minimax solution can be
obtained by two iterations with this algorithm. An even
simpler approach t o an approximate minimax solution is
error fluffing [ill. Here, one s t a r t s again with a leastsquares solution, searches for the largest absolute error
of the least-squares approximation, augments the "old"
system by t h e row of the largest absolute error, and
solves this "updated" system with a Givens recursive leastsquares (RLS) algorithm [E].
zyxwvutsr
zyxwvuts
11. THE DESIGN
OF THE
FIFST
HIDDEN LAYER
We have seen that the neurons of the first hidden layer
perform a pairwise linear separation of two s e t s of training
patterns X'k' and X"' where
(3)
where n is the number of classes and p is the number of
training patters in the training s e t of each class. The
"distance" of the training patterns with respect t o a hyperplane wk may then be expressed in terms of the ADALINE
formulation (la) as
y'k' = - 1 + X ' k ' ~ k , , ,
(4a) y"'
=
-1+
X"' wk,l , (4b)
where I is the vector of all ones and y is the distance
vector. The hyperplane should be placed in the gap between
the two s e t s Xck' and X"' . One can further be interested
in adjusting the parameters wk in a sense that the hyperplane is placed in the middle o'f the gap, with a maximum
distance from the closest members of the two training
s e t s located on opposite sides of the hyperplane. This
demand is fulfilled when only the errors satisfy the
condition
y(k' <
E
[E,
!
= min
.
. . , elT
,
(Sa) y'l'
2
-
[E,
. . . , E]
T
. (Sb)
(6)
The variable E is the width of a "tolerance tube" o r half of
the "closest distance" between the two point sets. The
constraints (Sa,b) together with the (6) already constitute
the minimax design problem for the desired parameter s e t
wk 1. Considering (4a,b), the minimax design problem can
be brought into a linear programming (LP) formulation [6].
Introducing an auxiliary vector g
111. THE DESIGN
OF THE
BOOLEAN OUTPUT LAYER
In a second step, we supply t h e training s e t s {X'k', 1 5 k
s n} t o the f i r s t hidden layer t h a t maps the continuous
amplitude patterns into a s e t of decision vectors (binary
patterns) of dimension m = n(n-1)/2.
X(k) first hidden fayer) D(k) =
;
(7)
one can express the constraints (Sa,b) in terms of g and
2083
I s k s n . (10)
zyxwvutsrqpo
zyxwvutsrqp
zyxwvutsrq
zyxwvuts
zyxwvutsrq
Next define a Boolean distance vector 8!kS1' between the
decision vectors di(k' of class k and d;''l'bf class 1:
(11)
w h e r e a i s the EXOR operator. We can now determine the
Boolean distance between a decision vector di(k) and all
the remaining decision vectors {a!" , 1i j p } of classes
1 i 1 i n, 1 # k. This set of diitance vectors forms a
Boolean distance matrix A
of dimension (n-l)p x m.
We next describe an algorithm t h a t determines t h e Boolean
output layer from t h e decision vectors by constraint
Boolean minimization based o n the distance matrix A :
i(k)
FOR i = 1, 2, . .
FOR k = 1, 2,
r
. , p (for all training patterns)
. . . , n (of each class) COMPUTE:
-(1) a: Determine t h e distance matrixk;'
b: For each column vector of)!'A
count the number of o n ("+") bits.
(2) a: Select t h e column with t h e smallest number of
"+" bits, and f o r this column index insert a
don't care in the decision vector di(k'.
b: S e t all elements of this column vector t o "-"
and reset the corresponding counter of "+" bits.
c: For each r o w of k;' make a t e s t if all bits in
a row are off ("-").
Repeat (2) until this t e s t is satisfied f o r one o r
more rows of k!'.
layer when only a pattern of the training s e t i s supplied
t o the input of neural network. The second problem is
then t o form a closed decision region as a Boolean function
of cell addresses. We devised a procedure t h a t minimizes
this Boolean function by incorporation of "free" cells in
the lattice, cells t h a t are n o t addressed by the training
patterns. Note t h a t this i s a highly parallel and regular
algorithm based entirely on logical operations. The algorithm is of a type such t h a t i t minimizes the number of
gates in t h e Boolean output layer. This property and the
fact t h a t t h e total complexity of the algorithm depends
only linearly on t h e dimension of t h e decision vectors
makes the method also an attractive candidate f o r the
design of neural networks f o r recognition of large binary
p a t t e r n s . In this special case t h e data is already quantized
in two s t e p s and therefore t h e vector quantization
appearing in the general (continuous amplitude) case can
be omitted.
REFERENCES
zyxwvutsrqpo
B. Widrow, "Generalization and information storage
in networks of adaline 'neurons"', in Self-organizing
Systems 1962, M.C. Yovitz, G.T. Jacobi, and G.D.
Goldstein, Eds., Washington, DC: Spartan Books,
1962, pp. 435-461.
B. Widrow, RG. Winter, and RA. Baxter, "Layered
neural nets f o r pattern recognition", IEEE Trans. on
ASSP, Vol. ASSP-36, NO. 7, pp. 1109-1118, 1988.
J. Makhoul, R Schwartz and A. EIJaroudl, "Classification capabilities of two-layer neural networks",
in Proc. ICASSP-89, pp. 635-638, Glasgow, Scotland,
May 1989.
L. Schifli (1814-1899, Gesammelte Mathematische
Abhandlungen, Vol. 1, Birkhauser-Verlag, Basel, 1950.
M.H. Lewin, Logic Design and Computer Organization,
Addison-Wesley, Series in Computer Sciences and
Information Processing, Menlo Park, CA 1983.
M.R Osbome, Finite Algorithms in Optimization and
Data Analysis, Wiley, 1985.
G.B. Dantzig, Linear Programming and Extensions,
Princeton, NJ: Princeton University Press, 1963.
N.K. Karmarkar, "A new polynomial-time algorithm
f o r linear programming", Combinatorica, Vol. 4,
No. 4, pp. 373-395, 1984.
RJ. Vfmderbei, M.S. Meketon and B.A. Freedman, "A
modlfication of Karmarkar's linear programming
algorithm", Algorithmica, Vol. 1, pp. 395-407, 1986.
S.A. Ruzinsky and E.T. Olsen, "L and L minimization via a variant of Karmarkar's algor%hm", IEEE
Trans. on ASSP, Vol. ASSP-37, No. 2, pp. 245-253,1989.
S.A. Ruzinsky, "A simple minimax algorithm", Dr.
Doob's Journal, Vol. 93, pp. 84-101, 1984.
P. Strobach, finear Prediction Theory: A Mathematical Basis f o r Adaptive Systems, Springer Series in
Information Sciences, Vol. 21, 1990.
Minimize t h e number of decision vectors in
{D'k', i s k i n } by exploitation of don't cares.
The decision regions Rk are then constituted by t h e
Boolean functions of the remaining decision vectors which
contain a large number of don't cares.
Rk =
4 k ) v dF'v
.
:
l i k i n .
(12)
From t h e VLSI designer's view this procedure m u s t sound
particularly nice. The basic operation in t h e generation of
the distance matrix A (step 1) is a parallel EXOR of m
channels followed by m counters. Step 2 is basically the
OR of all elements of each row of A [p(n-1) OR functions
with m inputs each]. The o u t p u t s of this rowwise O R s
are finally ANDed t o obtain decision 2(c). The total complexity of t h e procedure i s n(n-l)p2m EXOR's and t h e same
number of OR operations. The above described algorithm
is characterized by a high degree of parallelism and regularity. Moreover, i t is a remarkable fact t h a t t h e complexity of this "training" algorithm grows only linearly
with t h e dimension m of t h e pattern space of decision
vectors. The algorithm can therefore be a useful tool also
in t h e design of neural networks f o r connected recognition
of large binary patterns. Note finally t h a t the Boolean
functions (12) have a n alternative representation a s a
two-layer network of ADALINE neurons with distinct
power-of-two coefficients.
IV. CONCLUSIONS
I t has been shown t h a t t h e design of ADALINE neural
networks f o r continuous amplitude patterns can be split
into two decoupled design procedures. The first procedure
is basically a vector quantization of the pattern space t h a t
linearly seperates all training patterns. Each cell in t h e
vector quantization "lattice" is completely determined by
a code "address" which is generated by t h e f i r s t hidden
2084