Academia.eduAcademia.edu

A neural network with Boolean output layer

1990, International Conference on Acoustics, Speech, and Signal Processing

The design of feed-forward ADALINE neural networks can be split into two independent optimization problems.

zyxwv M5.26 A NEURAL NETWORK WITH BOOLEAN OUTPUT LAYER PETER STROBACH zyxwvutsrqpon zyxwvuts zyxw SIEMENS AG, Zentralabteilung Forschung und Entwicklung ZFE IS INF 1, Otto-Hahn-Ring 6, D-8000 Munchen 83, FRG ABSTRACT - The design of feed-forward ADALINE neural networks can be split into two independent optimization problems. (1) t h e design of the f i r s t hidden layer which uses linear hyperplanes t o partition the continuous amplitude input space into a number of cells and (2) the design of t h e second and succeeding hidden layers which "group" the cells t o larger decision regions. The weights of a linear combiner in the f i r s t hidden layer are b e s t adjusted in a sense t h a t a hyperplane determined by these weights is placed exactly in the middle of t h e "gap" between two training s e t s . This leads t o a minimax optimization problem. The hyperplanes intersect in t h e input space and form a "lattice" of decision cells. The basic functioning of t h e f i r s t hidden layer is therefore a vector quantization of the input space. Each decision cell in the lattice is uniquely determined by i t s "codeword", namely, the binary o u t p u t of t h e f i r s t hidden layer. The basic functioning of the second and succeeding hidden layers is then t o perform a "grouping" of decision cells. The grouping of decision cells can be described alternatively by a Boolean function of the output " w o r d of the f i r s t hidden layer. In this way i t is shown t h a t t h e second and succeeding hidden layers in a feed-forward network may be replaced by a simple logical network. An algorithm f o r the design of this logical network is devised. I. INTRODUCTION There has been a flurry interest recently in feed-forward neural networks f o r pattern recognition, classification and other purposes. This paper deals with networks of the ADALINE type [I, 21. The basic function of a feed-forward multilayer network of ADALINE neurons is t h a t the first hidden layer uses linear hyperplanes t o partition the observation space into a number of decision cells. The sole function of t h e additional layers in the network is then t o group these decision cells together in order t o form larger decision regions. Many approaches have attempt t o combine more than three layers in a multilayer structure from the believe t h a t additional layers may continuously improve the capability of the network t o approximate arbitrary decision regions. This paper shows t h a t the task of forming arbitrary decision regions in an observation space can be split into two independent subtasks: (1) The optimal design of the weights of the first hidden layer, which can be formulated a s a minimax optimization problem. (2)The design of t h e additional layers a s the realization of a Boolean function. These considerations lead t o the insight that a feed-forward network with only two layers following the first hidden layer can completely determine any desired decision region constructable from a combination of decision cells. These two layers following t h e f i r s t hidden layer have an alternative realization a s a logical network termed t h e Boolean output layer. This paper discusses (1) the design of the first hidden layer and (2) the design of t h e Boolean o u t p u t layer. I t is shown t h a t the logical network of t h e Boolean output layer can be a computationally attractive alternative t o the conventional ADALINE realization of t h e second and third hidden layer. A. The ADALINE Neuron The basic element in t h e networks studied in this paper is t h e adaptive linear neuron (ADALINE) which computes the inner product y of a pattern vector x = [xi,x2, . . . , xqlT and a pre-determined weight vector w = [wl, w2, . . . , wqlT plus a fixed threshold which can be s e t t o - 1 without loss of generality. The o u t p u t d of t h e ADALINE "neuron" is then simply t h e sign of y (binary decision): y = -1 + xTw = -1 + x9x i w i ; (la) d = SGN(y) . (lb) i= 1 The elements of t h e weight vector may b e interpreted geometrically as t h e parameters of the hyperplane y = 0 in the q-dimensional observation space of pattern vectors. This hyperplane divides the observation space into two open half-spaces. The thresholded inner product (la) of a pattern vector and t h e weight vector can be interpreted a s a "distance" between t h e pattern vector and the hyperplane. Moreover, t h e sign of y determines whether a given pattern is an element of t h e "left" (-) o r t h e "right" (+) half-space. -1 zyxwvu zyxwvutsrqpo -/---Lrb - l/Wi ' xi Fig. 1. The ADALINE neuron and t h e interpretation of i t s coefficients a s t h e parameters of a hyperplane in the q-dimensional observation space. Given t w o s e t s of training patterns X'k' and X"' , the weights w of an ADALINE neuron may be adjusted (trained) so t h a t t h e decision hyperplane is placed in t h e gap between the two training s e t s . Given a n arbitrary pattern vector x, t h e o u t p u t of the ADALINE is a binary decision whether x is an element of class 1 determined by the half-space of t h e training s e t XCk' , o r an element of class 2 determined by t h e half-space of X'?' 2081 zyxw CH2847-2/90/0000-2081$1.00 0 1990 IEEE zyxw zy zyxw zyxw C. A Boolean Formulation o f the Grouping Problem Fig. 2. Separation of t w o training (point) s e t s XCk)andX"' in t h e observation space of continuous amplitude training patterns by a linear hyperplane. B. ADALINE Neurons in the First Hidden Layer A single ADALINE neuron can at m o s t distinguish between two point s e t s . When t h e classification of n point s e t s is an issue, one requires m = n(n-1)/2 hyperplanes of ADALINE neurons t o seperate each point s e t from each member of the remaining point sets, where we assumed t h a t all point sets are pairwise linearly separable. These neurons cons t i t u t e t h e f i r s t hidden layer in a neural network. The hyperplanes associated with t h e f i r s t hidden layer intersect in t h e q-dimensional space and form a segmentation "lattice" of decision cells. Each cell in t h e lattice can be labeled with a binary word with m digits, namely, the decisions of t h e m neurons in t h e first hidden layer, where each digit indicates the location of t h e cell relative t o the corresponding hyperplane. See Fig. 3 f o r an example of a f i r s t hidden layer consisting of m = 4 neurons. The closed cell c1 in Fig. 3 b is completely determined by t h e 4-bit word d = [+++-I. One could now believe t h a t there are Zm = 2, = 16 cells in this example, b u t a quick inspection of Fig. 3b reveals t h a t one finds only a total number of 3 "closed" and 8 "open" cells = 11 cells. Consequently, there m u s t b e a number of S remaining addresses which have no correspondence in the real space. Such cells have been termed "imaginary cells" [3]. Explicit formulas f o r t h e calculation of the total number of open and closed cells in a q-dimensional problem with m neurons (hyperplanes) in t h e f i r s t hidden layer were reported in [SI. The considerations in [3] are based on early results by L. Schafli [4] of the l a s t century. The basic r e s u l t of these investigations is t h a t the number of open and closed cells grows tremendously with growing q and m. By an appropriate "grouping" of a number of these decision cells, one can approximate a desired decision region within t h e accuracy determined by the segmentation lattice of the f i r s t hidden layer. This grouping of decision cells i s accomplished by t h e additional layers following the f i r s t hidden layer in a conventional multilayer network. But from t h e considerations t h a t we have made so far, i t becomes already apparent t h a t we may formulate the grouping of decision cells alternatively by a Boolean function of t h e output decision "word" d = [d,, d, , . . . , d,] of t h e f i r s t hidden layer. Consider again t h e lattice of decision cells in Fig. 3b. We may be interested in forming a decision region R composed of the three closed cells c,, c, and c3 . These cells are determined by the hidden layer decision words d, = [+++-I, d, = [+-+-I and d, = [++--I. The decision region R is hence completely determined by t h e Boolean function R = d , v d, v d, , where v *' is t h e logical OR operator. We may rewrite this expression more explicitly in terms of the decisions of the four neurons in t h e f i r s t hidden layer of Fig. 3a: " zyxwvut zyxwvutsrqpo zyxwvutsrqpo (a) d3 + /- 2 d, Fig. 3. (a) First hidden layer consisting of 4 ADALINE neurons and (b) corresponding segmentation lattice in the observation space. R = d,d,d,a, v d , ~ , d , ~ , v d,d,d,d, (2) This Boolean function can be minimized easily by an appropriate exploitation of don't cares. One simple method t o achieve this i s t h e Karnaugh diagram of Fig. 4. Fig. 4. Evaluation of a Karnaugh diagram f o r minimization of t h e Boolean function R, which gives the minimized result: R = d,d,a, v dld3d, It is emphasized t h a t f o r t h e minimization of Boolean functions with a large number of input variables (e.g. m=190), one can take advantage of automatic methods developed in t h e area of logical network design [SI. Next, one finds t h a t f o r practical problems, t h e total number of cells in the lattice is orders of magnitude larger than the number of training patterns available in a practical application. Since each training pattern corresponds with a distinct cell address, the training patterns "hit" only a tremendously small number of cells compared t o the total number of existing cells in t h e lattice. This gives rise t o a second p r o b l e m , namely, t h e construction of a Boolean function determining a closed decision region from these sparsely distributed cells determined by the hidden layer words of a training sequence. This is in fact a difficult task of combinatorical optimization and a f a s t solution of this design problem could open a new dimension in neural network design. Obviously, a closed decision region can be constructed from the cells by some kind of "binary region growing" where one f i r s t minimizes the conjunctive terms describing t h e hidden layer words f o r a given training sequence, a s described above. Next one notes t h a t the incorporation of a don't care in a decision word adds a "neighbor" cell t o the decision region. Consider again the lattice of Fig. 3b f o r an illustration of this statement. Suppose f o r example t h a t we intend t o "merge" the cell c, with a larger decision region which already contains the 2082 zyxwvutsrqpo zyxwvuts zyxwvutsr zyx cell c1. This merging is achieved by a simple elimination of the hyperplane d,, i.e., d, = 0 (don't care) so t h a t d v d, did$, . Thus one continues introducing don't cares in the Boolean function of the output layer under the constraint t h a t the cells such included in the decision region are not yet members of another decision region. The procedure can be repeated recursively, until no more "free" cells are available in the lattice and all cells belong t o a distinct decision region. I t is emphasized t h a t in practice, a given s e t of training data still leaves much freedom in the design of the Boolean output layer. Even with training s e t s consisting of a relatively large number of training patterns compared t o the total number of cells in the lattice, there is little chance t h a t all cells in the lattice are addressed by patterns of the training sequence. Slightly different patterns may fall into the same decision cell and some free cells remain. From these considerations, one may gain the interesting insight that in most cases there is a large number of different final solutions of the Boolean function determining the decision region. All these different solutions yield the same network behavior inside the training sequence. But the solutions may differ concering their robustness outside training. These important questions are the subject of current investigations. the training s e t s X'k' and X'" (8) From (7) i t follows t h a t the "gap variable" pressed a s a linear objective function of E can b e exand g With these expressions (8) and (9), the minimax design of the weight vector wk has been brought into the following standard linear progkamming (LP) formulation: "minimize the linear objective function f(wk,,, g ) with respect t o the s e t of linear inequality constraints (8) ". One possible method t h a t solves the given LP problem is the Simplex algorithm [7]. The characteristic operation of the Simplex method is t h a t the solution travels along the boundary of the feasible region seeking t h a t vertex which minimizes the linear objective function. In large practical problems, the Simplex algorithm converges rather slowly, b u t with an approximately constant speed of convergence, t o the exact minimax solution. In many cases, however, i t is sufficient t o work with approximate minimax designs of the weight vector. The goal is therefore to derive f a s t approximate algorithms f o r the given LP problem. As the Simplex algorithm exhibits an almost constant convergence speed over t h e entire sequence of iterations, i t does not provide a good basis f o r an approximate algorithm. Algorithms with a much faster initial convergence within the first few iterations can be deduced from LP algorithms t h a t operate from t h e interior of the feasible region rather than from i t s boundary a s does the Simplex algorithm. One such method is the Karmarkar algorithm [8] which is basically a combination of a projected gradient method coupled with a recentering method. We employed a modified version of Karmarkar's algorithm of Vanderbei, et. al. [SI. An explicit summary of this algorithm can be found in [lo]. The algorithm s t a r t s with a least-squares solution. Sufficient approximations of the minimax solution can be obtained by two iterations with this algorithm. An even simpler approach t o an approximate minimax solution is error fluffing [ill. Here, one s t a r t s again with a leastsquares solution, searches for the largest absolute error of the least-squares approximation, augments the "old" system by t h e row of the largest absolute error, and solves this "updated" system with a Givens recursive leastsquares (RLS) algorithm [E]. zyxwvutsr zyxwvuts 11. THE DESIGN OF THE FIFST HIDDEN LAYER We have seen that the neurons of the first hidden layer perform a pairwise linear separation of two s e t s of training patterns X'k' and X"' where (3) where n is the number of classes and p is the number of training patters in the training s e t of each class. The "distance" of the training patterns with respect t o a hyperplane wk may then be expressed in terms of the ADALINE formulation (la) as y'k' = - 1 + X ' k ' ~ k , , , (4a) y"' = -1+ X"' wk,l , (4b) where I is the vector of all ones and y is the distance vector. The hyperplane should be placed in the gap between the two s e t s Xck' and X"' . One can further be interested in adjusting the parameters wk in a sense that the hyperplane is placed in the middle o'f the gap, with a maximum distance from the closest members of the two training s e t s located on opposite sides of the hyperplane. This demand is fulfilled when only the errors satisfy the condition y(k' < E [E, ! = min . . . , elT , (Sa) y'l' 2 - [E, . . . , E] T . (Sb) (6) The variable E is the width of a "tolerance tube" o r half of the "closest distance" between the two point sets. The constraints (Sa,b) together with the (6) already constitute the minimax design problem for the desired parameter s e t wk 1. Considering (4a,b), the minimax design problem can be brought into a linear programming (LP) formulation [6]. Introducing an auxiliary vector g 111. THE DESIGN OF THE BOOLEAN OUTPUT LAYER In a second step, we supply t h e training s e t s {X'k', 1 5 k s n} t o the f i r s t hidden layer t h a t maps the continuous amplitude patterns into a s e t of decision vectors (binary patterns) of dimension m = n(n-1)/2. X(k) first hidden fayer) D(k) = ; (7) one can express the constraints (Sa,b) in terms of g and 2083 I s k s n . (10) zyxwvutsrqpo zyxwvutsrqp zyxwvutsrq zyxwvuts zyxwvutsrq Next define a Boolean distance vector 8!kS1' between the decision vectors di(k' of class k and d;''l'bf class 1: (11) w h e r e a i s the EXOR operator. We can now determine the Boolean distance between a decision vector di(k) and all the remaining decision vectors {a!" , 1i j p } of classes 1 i 1 i n, 1 # k. This set of diitance vectors forms a Boolean distance matrix A of dimension (n-l)p x m. We next describe an algorithm t h a t determines t h e Boolean output layer from t h e decision vectors by constraint Boolean minimization based o n the distance matrix A : i(k) FOR i = 1, 2, . . FOR k = 1, 2, r . , p (for all training patterns) . . . , n (of each class) COMPUTE: -(1) a: Determine t h e distance matrixk;' b: For each column vector of)!'A count the number of o n ("+") bits. (2) a: Select t h e column with t h e smallest number of "+" bits, and f o r this column index insert a don't care in the decision vector di(k'. b: S e t all elements of this column vector t o "-" and reset the corresponding counter of "+" bits. c: For each r o w of k;' make a t e s t if all bits in a row are off ("-"). Repeat (2) until this t e s t is satisfied f o r one o r more rows of k!'. layer when only a pattern of the training s e t i s supplied t o the input of neural network. The second problem is then t o form a closed decision region as a Boolean function of cell addresses. We devised a procedure t h a t minimizes this Boolean function by incorporation of "free" cells in the lattice, cells t h a t are n o t addressed by the training patterns. Note t h a t this i s a highly parallel and regular algorithm based entirely on logical operations. The algorithm is of a type such t h a t i t minimizes the number of gates in t h e Boolean output layer. This property and the fact t h a t t h e total complexity of the algorithm depends only linearly on t h e dimension of t h e decision vectors makes the method also an attractive candidate f o r the design of neural networks f o r recognition of large binary p a t t e r n s . In this special case t h e data is already quantized in two s t e p s and therefore t h e vector quantization appearing in the general (continuous amplitude) case can be omitted. REFERENCES zyxwvutsrqpo B. Widrow, "Generalization and information storage in networks of adaline 'neurons"', in Self-organizing Systems 1962, M.C. Yovitz, G.T. Jacobi, and G.D. Goldstein, Eds., Washington, DC: Spartan Books, 1962, pp. 435-461. B. Widrow, RG. Winter, and RA. Baxter, "Layered neural nets f o r pattern recognition", IEEE Trans. on ASSP, Vol. ASSP-36, NO. 7, pp. 1109-1118, 1988. J. Makhoul, R Schwartz and A. EIJaroudl, "Classification capabilities of two-layer neural networks", in Proc. ICASSP-89, pp. 635-638, Glasgow, Scotland, May 1989. L. Schifli (1814-1899, Gesammelte Mathematische Abhandlungen, Vol. 1, Birkhauser-Verlag, Basel, 1950. M.H. Lewin, Logic Design and Computer Organization, Addison-Wesley, Series in Computer Sciences and Information Processing, Menlo Park, CA 1983. M.R Osbome, Finite Algorithms in Optimization and Data Analysis, Wiley, 1985. G.B. Dantzig, Linear Programming and Extensions, Princeton, NJ: Princeton University Press, 1963. N.K. Karmarkar, "A new polynomial-time algorithm f o r linear programming", Combinatorica, Vol. 4, No. 4, pp. 373-395, 1984. RJ. Vfmderbei, M.S. Meketon and B.A. Freedman, "A modlfication of Karmarkar's linear programming algorithm", Algorithmica, Vol. 1, pp. 395-407, 1986. S.A. Ruzinsky and E.T. Olsen, "L and L minimization via a variant of Karmarkar's algor%hm", IEEE Trans. on ASSP, Vol. ASSP-37, No. 2, pp. 245-253,1989. S.A. Ruzinsky, "A simple minimax algorithm", Dr. Doob's Journal, Vol. 93, pp. 84-101, 1984. P. Strobach, finear Prediction Theory: A Mathematical Basis f o r Adaptive Systems, Springer Series in Information Sciences, Vol. 21, 1990. Minimize t h e number of decision vectors in {D'k', i s k i n } by exploitation of don't cares. The decision regions Rk are then constituted by t h e Boolean functions of the remaining decision vectors which contain a large number of don't cares. Rk = 4 k ) v dF'v . : l i k i n . (12) From t h e VLSI designer's view this procedure m u s t sound particularly nice. The basic operation in t h e generation of the distance matrix A (step 1) is a parallel EXOR of m channels followed by m counters. Step 2 is basically the OR of all elements of each row of A [p(n-1) OR functions with m inputs each]. The o u t p u t s of this rowwise O R s are finally ANDed t o obtain decision 2(c). The total complexity of t h e procedure i s n(n-l)p2m EXOR's and t h e same number of OR operations. The above described algorithm is characterized by a high degree of parallelism and regularity. Moreover, i t is a remarkable fact t h a t t h e complexity of this "training" algorithm grows only linearly with t h e dimension m of t h e pattern space of decision vectors. The algorithm can therefore be a useful tool also in t h e design of neural networks f o r connected recognition of large binary patterns. Note finally t h a t the Boolean functions (12) have a n alternative representation a s a two-layer network of ADALINE neurons with distinct power-of-two coefficients. IV. CONCLUSIONS I t has been shown t h a t t h e design of ADALINE neural networks f o r continuous amplitude patterns can be split into two decoupled design procedures. The first procedure is basically a vector quantization of the pattern space t h a t linearly seperates all training patterns. Each cell in t h e vector quantization "lattice" is completely determined by a code "address" which is generated by t h e f i r s t hidden 2084