Academia.eduAcademia.edu

Hritsev The ANN Book

In recent years arti cial neural networks (ANN) have emerged as a mature and viable framework with many applications in various areas. ANN are mostly applicable wherever some hard to de ne (exactly) patterns have to be dealt with. \Patterns" are taken here in the broadest sense, applications and models have been developed from speech recognition to (stock)market time series prediction with almost anything in between and new ones appear at a very fast pace.

R. M. Hristev TheANN ANNBook Book The ANN Book The ANN Book The The ANN Book | R. M. Hristev | Edition 1 (supercede the draft (edition 0) named \Arti cial Neural Networks") Copyright  1998 by R. M. Hristev This book is released under the GNU Public License, ver. 2 (the copyleft license). Basically \...the GNU General Public License is intended to guarantee your freedom to share and change free software|to make sure the software is free for all its users...". It also means that, as is free of charge, there is no warranty. Preface ➧ About This Book In recent years arti cial neural networks (ANN) have emerged as a mature and viable framework with many applications in various areas. ANN are mostly applicable wherever some hard to de ne (exactly) patterns have to be dealt with. \Patterns" are taken here in the broadest sense, applications and models have been developed from speech recognition to (stock)market time series prediction with almost anything in between and new ones appear at a very fast pace. However, to be able to (correctly) apply this technology it is not enough just to throw some data at it randomly and wait to see what happens next. At least some understanding of the underlying mechanism is required in order to make ecient use of it. Please note that this book is released in electronic format (LATEX) and under the GNU Public Licence (ver. 2) which allow for free copying and redistribution as long as you do not restrict the same rights for others. See the licence terms included in le \LICENCE.txt". A freeware version of LATEX is available for almost any type of computer/OS combination and is practically guaranteed to produce the same high quality typesetting output. On Internet the URL where you can nd the latest edition/version of this book is \ftp://ftp.funet.fi/pub/sci/neural/books/". Note that you may nd two les there: one being this book in Postscript format and the other containing the source les, the source les contain the LATEX les as well as some additional programs | e.g. a program showing an animated learning process into a Kohonen network. The programs used in this book were developed mostly under Scilab, available under a very generous licence (basically: free and with source code included) from \http://www-rocq.inria.fr/scilab/". SciLab is very similar to Octave and Matlab. Octave is also released under the GNU licence, so it's free. This book make an attempt to cover some of the basic ANN development: some theories, principles and ANN architectures which have found a way into the mainstream. First part covers some of the most widely used ANN architectures. New ones or variants appear at a fast rate so it is not possible to cover them all, but these are among the few ones with wide applications. This part would be of use as an introduction and for those who have to implement them but do not have to worry about their applications (e.g. programmers required to implement a particular ANN engine for some applications | but note that some important algorithmic improvements are explained in the second part). Second part takes a deeper insight at the fundamentals as well as establishing the most important theoretical results. It also describes some algorithmic optimizations/variants for i ii Preface ANN simulators which require a more advanced math apparatus. It is important for those who want to develop applications using ANN. As ANN have been revealed to be statistical by their nature it requires some basic knowledge of statistical methods. An appendix containing a small introduction to statistics (very bare but essential for those who did not studied statistics) have been developed. Third part is reserved to topics which are very recent developments and usually open-ended. For each section (chapter, sub-section, e.t.c.) there is a special footnote designed in particular to refer some bibliographic information. These footnotes are marked with the section number (e.g. 2.3.1 for sub-section numbered 2.3.1) or with a number and an \.*" for chapters (e.g. 3.* for the third chapter). To avoid an ugly appearance they are hidden from the section's title. Follow them for further references. The appendix(es) contains also some information which have not been deemed appropriate for the main text (e.g. some useful mathematical results). The next section describes the notational system used in this book. ➧ Mathematical Notations and Conventions ❖ marginal note The following notational system will be used (hopefully into a consistent manner) trough the whole book. There will be two kind of notations: one which will be described here and most of the time will not be explained in the text again; the other ones will be local (to the chapter, section, e.t.c.) and will appear also in the marginal notes in the place where they are de ned/used rst, marked with the symbol ❖ like the one appearing here. So, when you encounter a symbol you don't know what it is: rst look in this section, if is not here follow the marginal notes upstream from the point where you encountered it till you nd it and there should be its de nition (you should not go beyond the current chapter). Proofs are typeset in a smaller (8 pt.) font size and refer to previous formulas when not explicitly speci ed. The reader may skip them, however following them will enhance its skills in mathematical methods used in this eld. Do not worry about (fully) understanding all notations de ned here, right now. Return here when the text will send you (automatically) back. *** ANN involves heavy manipulations of vectors and matrices. A vector will be often represented by a column matrix: 0x 1 1 x=B @ ... CA xN ;  and in text will be often represented by its transposed xT = x1    xN (for aesthetic reasons and readability). Also it will be represented by lowercase bold letters. A scalar product between two vectors may be represented by a product between the corre- Preface iii spondent matrices: xy = X i xi yi = xT y The other matrices will be represented by uppercase letters. A inverse of a matrix will be marked by ();1 . There is an important distinction between scalar and vectorial functions. When a scalar function is applied to a vector or matrix it means in fact that is applied to each element in turn, i.e. 0 f (x ) 1 1 f (x) = B @ ... CA ; f : R f (xN ) ! R is a scalar function and application to a vector is just a convenient notation, while: 0 g (x) 1 1 B g(x) = @ ... C A ; g : RN ! RK gK (x) is a vectorial function and generally K = N . Note that bold letters are used for vectorial 6 functions. One operator which will be used is the ":". The A(i; :) notation will represent row i of matrix A, while A(:; j ) will stand for column j of the same matrix. Another operation used will be the Hadamard product, i.e. the element-wise product between matrices (or vectors) which will be marked with . The terms and result have to have same dimensions (number of rows and columns) and the elements of result are the product of the corresponding elements of terms, i.e. 0a 11 A=B @ ... ak 1  ...  a1n akn 1 0b CA ; B = B@ 11... 0a b 11 11 C =A B=B @ ... ak 1 b k 1 ANN () : ()T n bk1  ...  b1n bkn  a1n b1n  akn bkn ... 1 CA ) 1 CA acronym for Arti cial Neural Network(s). Hadamard product of matrices (see above for de nition). a convenient notation for ()    ()  () n | {z } n matrix \scissors" operator: A(i : j; k : `) selects a submatrix from matrix A, made from rows i to j and columns k to `. A(i; :) represents row i while A(:; k) represents column k. transposed of matrix (). ❖ : Hadamard ❖ iv Preface  ( )C jj kk hi Ef j g f g Vf g f sign(x) e 1 1b e 0 0b I i x x j y y z k z k t t i ji w ;w complement of vector (). it involves swapping 0 $ 1 for binary vectors and ;1 $ +1 for bipolar vectors. module of () (absolute value), when applied to a matrix is an elementwise operation (by contrast to the norm operation). norm of a vector or matrix | the actual de nition may di er depending of the metric used. mean value of a variable. expectation of event f (mean value), given event g. As f and g are usually functions then Eff jgg is a functional. variance of f . As f is usually a function then Vff g is a functional. 8 > if x > 0 <1 the sign function, de ned as sign(x) = >0 if x = 0 . : ;1 if x < 0 In case of a matrix, it applies to each element individually. a matrix having all elements equal to 1, its dimensions will be always such that the mathematical operations in which is involved are correct. a (column) vector having all elements equal to 1, its dimensions will be always such that the mathematical operations in which is involved are correct. a matrix having all elements equal to 0, its dimensions will be always such that the mathematical operations in which is involved are correct. a (column) vector having all elements equal to 0, its dimensions will be always such that the mathematical operations in which is involved are correct. the unit square matrix, assumed always to have the correct dimensions for the operations in which is involved. component i of the input vector.  ; the input vector: xT = x1    xN output of the output neuron j .  ; the output vector of output layer: yT = y1    yK output of a hidden neuron k.  ; the output vector of a hidden layer: zT = z1    zM component k of the target pattern. the target vector | desired output corresponding to input x. wi the weight associated with i-th input of a neuron; wji the weight associated with connection to neuron j , from neuron i. Preface v a j prev. f 0 E k C k) P (X` ) P (Ck ; X` ) P (C ` k) P (X jC k `) P (C jX p  N w1 1 CA, note that all weights W all a f w11 .. . . .. . . . wK 1  wKN associated with a particular neuron j (j = 1; K ) are on the same row. total to a neuron j , the weighted sum of its inputs, e.g. aj = P w input ji xi | for a neuron receiving input x, wji being the weights. i the vector containing total inputs aj for all neurons in a same layer, usually a = W z , where z is the output of previous layer. activation function of the neuron; the neuron output is f (aj ) and the output of current layer is z = f (a). the derivative of the activation function f . the error function. class k. prior probability of a pattern x to belong to class k. distribution probability of a pattern x to be in pattern subspace X` . join probability of a pattern x to belong to class k and pattern subspace X` . class-conditional probability of a pattern x to belonging to class k to be in pattern subspace X` . posterior probability of a pattern x to belong to class k when is from subspace X` . probability density. the weight matrix W 0 B = @ prev. Note also that wherever possible will try to reserve index i for input components, j for hidden neurons, k for output neurons and p for training patterns with P as the total number of (lerning) patterns. Ryurick M. Hristev Contents Preface About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mathematical Notations and Conventions . . . . . . . . . . . . . . . . . . . . . i i ii I ANN Architectures 1 1 Basic Neuronal Dynamics 3 2 The Backpropagation Network 9 1.1 Simple Neurons and Networks . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Neurons as Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Common Signal Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Network Structure . . . . . . . . . 2.2 Network Dynamics . . . . . . . . . 2.2.1 Neuron Output Function . 2.2.2 Network Running Function 2.2.3 Network Learning Function 2.2.4 Initialization and Stop . . . 2.3 The Algorithm . . . . . . . . . . . 2.4 Bias . . . . . . . . . . . . . . . . 2.5 Algorithm Enhancements . . . . . 2.5.1 Momentum . . . . . . . . 2.5.2 Adaptive Backpropagation 2.5.3 SuperSAB . . . . . . . . . 2.6 Applications . . . . . . . . . . . . 2.6.1 Identity Mapping Network 2.6.2 The Encoder . . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 5 6 9 10 10 11 11 15 16 18 20 20 21 23 24 24 26 viii CONTENTS 3 The SOM/Kohonen Network 3.1 Network Structure . . . . . . . . . . . . . . . . . . 3.2 Types of Neuronal Learning . . . . . . . . . . . . . 3.2.1 The Learning Process . . . . . . . . . . . . 3.2.2 The Trivial Equation . . . . . . . . . . . . 3.2.3 The Simple Equation . . . . . . . . . . . . 3.2.4 The Riccati Equation . . . . . . . . . . . . 3.2.5 More General Equations . . . . . . . . . . . 3.3 Network Dynamics . . . . . . . . . . . . . . . . . . 3.3.1 Network Running Function . . . . . . . . . 3.3.2 Network learning function . . . . . . . . . . 3.3.3 Initialization and Stop condition . . . . . . 3.3.4 Remarks . . . . . . . . . . . . . . . . . . . 3.4 The algorithm . . . . . . . . . . . . . . . . . . . . 3.5 Applications . . . . . . . . . . . . . . . . . . . . . 3.5.1 The Trivial Model with Forgetting Function 3.5.2 Square mapping . . . . . . . . . . . . . . . 4 The BAM/Hop eld Memory 4.1 Associative Memory . . . . . . . . . . . . . 4.2 The BAM Architecture . . . . . . . . . . . 4.3 BAM Dynamics . . . . . . . . . . . . . . . 4.3.1 Network Running . . . . . . . . . . 4.3.2 The BAM Energy Function . . . . . 4.4 The BAM Algorithm . . . . . . . . . . . . 4.5 The Hop eld Memory . . . . . . . . . . . . 4.5.1 The Discrete Hop eld Memory . . . 4.5.2 The Continuous Hop eld Memory . 4.6 Applications . . . . . . . . . . . . . . . . . 4.6.1 The Traveling Salesperson Problem . 5 The Counterpropagation Network 5.1 The CPN Architecture . . 5.1.1 The Input Layer . 5.1.2 The Hidden Layer 5.1.3 The Output Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 31 31 32 32 33 35 39 39 39 40 40 40 41 41 44 47 47 48 49 49 51 53 54 54 57 60 60 67 67 68 71 76 CONTENTS 5.2 CPN Dynamics . . . . . . 5.2.1 Network Running . 5.2.2 Network Learning . 5.3 The Algorithm . . . . . . . 5.4 Applications . . . . . . . . 5.4.1 Letter classi cation ix . . . . . . . . . . . . . . . . . . 6 Adaptive Resonance Theory (ART) . . . . . . . . . . . . . . . . . . . . . . . . 6.1 The ART1 Architecture . . . . . . . . . 6.2 ART1 Dynamics . . . . . . . . . . . . . 6.2.1 The F1 layer . . . . . . . . . . . 6.2.2 The F2 layer . . . . . . . . . . . 6.2.3 Learning on F1 : The W weights 6.2.4 Learning on F2 : The W weights 6.2.5 Subpatterns . . . . . . . . . . . 6.2.6 The Reset Unit . . . . . . . . . 6.3 The ART1 Algorithm . . . . . . . . . . 6.4 The ART2 Architecture . . . . . . . . . 6.5 ART2 Dynamics . . . . . . . . . . . . . 6.5.1 The F1 layer . . . . . . . . . . . 6.5.2 The F2 Layer . . . . . . . . . . 6.5.3 The Reset Layer . . . . . . . . . 6.5.4 Learning and Initialization . . . . 6.6 The ART2 Algorithm . . . . . . . . . . 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 78 79 81 83 83 85 85 87 87 90 93 94 96 97 99 100 102 102 103 103 104 107 II Basic Principles 109 7 Pattern Recognition 111 7.1 Patterns: The Statistical Approach . . . . . . . . . . . . . . . . 7.1.1 Patterns and Classi cation . . . . . . . . . . . . . . . . 7.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . 7.1.3 Model Complexity . . . . . . . . . . . . . . . . . . . . . 7.1.4 Classi cation: Making Decisions and Minimizing Risk . . 7.2 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 The Discriminant Functions . . . . . . . . . . . . . . . . 7.2.2 Likelihood Function and Maximum Likelihood Procedure 7.3 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 111 113 114 116 120 120 121 125 x CONTENTS 8 Single Layer Neural Networks 8.1 Linear Separability . . . . . . . . . . . . . . . . . . . . . 8.1.1 Discriminant Functions . . . . . . . . . . . . . . 8.1.2 Neuronal Memory Capacity . . . . . . . . . . . . 8.1.3 Logistic discrimination . . . . . . . . . . . . . . 8.1.4 Binary pattern vectors . . . . . . . . . . . . . . . 8.1.5 Generalized linear discriminants . . . . . . . . . . 8.2 The Least Squares Technique . . . . . . . . . . . . . . . 8.2.1 The Error Function . . . . . . . . . . . . . . . . 8.2.2 The Pseudo{inverse solution . . . . . . . . . . . 8.2.3 The Gradient Descent Solution . . . . . . . . . . 8.3 The Perceptron . . . . . . . . . . . . . . . . . . . . . . 8.3.1 The Error Function . . . . . . . . . . . . . . . . 8.3.2 The Learning Procedure . . . . . . . . . . . . . . 8.3.3 Convergence of Learning . . . . . . . . . . . . . 8.4 Fisher Linear Discriminant . . . . . . . . . . . . . . . . . 8.4.1 Two Classes Case . . . . . . . . . . . . . . . . . 8.4.2 Connections With The Least Squares Technique . 8.4.3 Multiple Classes Case . . . . . . . . . . . . . . . 129 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Multi Layer Neural Networks 9.1 Feed-Forward Networks . . . . . . . . . . . . . . . . . . . . . . . 9.2 Threshold Neurons . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Binary Vectors . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Continuous Vectors . . . . . . . . . . . . . . . . . . . . . 9.3 Sigmoidal Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Three Layer Networks . . . . . . . . . . . . . . . . . . . . 9.3.2 Two Layer Networks . . . . . . . . . . . . . . . . . . . . . 9.4 Weight-Space Symmetry . . . . . . . . . . . . . . . . . . . . . . 9.5 Higher-Order Neuronal Networks . . . . . . . . . . . . . . . . . . 9.6 Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . . 9.6.1 Error Backpropagation . . . . . . . . . . . . . . . . . . . 9.6.2 Application: Sigmoidal Neurons and Sum-of-squares Error 9.7 Jacobian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.8 Hessian Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.8.1 Diagonal Approximation . . . . . . . . . . . . . . . . . . . 129 129 132 134 136 137 137 137 139 142 144 144 145 145 147 147 149 151 153 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 154 155 155 157 158 158 160 161 161 161 163 164 166 166 CONTENTS 9.8.2 9.8.3 9.8.4 9.8.5 9.8.6 xi Outer Product Approximation Inverse Hessian . . . . . . . . Finite Di erences . . . . . . . Exact Hessian . . . . . . . . . Multiplication with Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Radial Basis Function Networks 10.1 Exact Interpolation . . . . . . . . . . . . 10.2 Radial Basis Function Networks . . . . . . 10.3 Relation to Other Theories . . . . . . . . 10.3.1 Relation to Regularization Theory 10.3.2 Relation to Interpolation Theory . 10.3.3 Relation to Kernel Based Method 10.4 Classi cation . . . . . . . . . . . . . . . . 10.5 Network Learning . . . . . . . . . . . . . 10.5.1 Radial Basis Functions . . . . . . 10.5.2 Output Layer Weights . . . . . . . 167 167 168 169 170 173 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Error Functions 11.1 Generalities . . . . . . . . . . . . . . . . . . . 11.2 Sum-of-Squares Error . . . . . . . . . . . . . . 11.2.1 Linear Output Units . . . . . . . . . . . 11.2.2 Linear Sum Rules . . . . . . . . . . . . 11.2.3 Signi cance of Network Output . . . . . 11.2.4 Outer product approximation of Hessian 11.3 Minkowski Error . . . . . . . . . . . . . . . . . 11.4 Input-dependent Variance . . . . . . . . . . . . 11.5 Modeling Conditional Distributions . . . . . . . 11.6 Classi cation using Sum-of-Squares . . . . . . . 11.6.1 Hidden Neurons . . . . . . . . . . . . . 11.6.2 Weighted Sum-of-Squares . . . . . . . . 11.6.3 Loss Matrix . . . . . . . . . . . . . . . 11.7 Cross Entropy . . . . . . . . . . . . . . . . . . 11.7.1 Two Classes Case . . . . . . . . . . . . 11.7.2 Sigmoidal Activation Functions . . . . . 11.7.3 Cross-Entropy Properties . . . . . . . . 173 174 175 175 177 178 178 180 180 182 185 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 186 187 188 189 191 192 193 194 197 198 199 201 202 202 203 203 xii CONTENTS 11.7.4 Multiple Independent Features 11.7.5 Multiple Classes Case . . . . . 11.8 Entropy . . . . . . . . . . . . . . . . 11.9 Outputs as Probabilities . . . . . . . . 12 Parameter Optimization 12.1 12.2 12.3 12.4 12.5 12.6 12.7 12.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error Surfaces . . . . . . . . . . . . . . . . . . . . . . . Local Quadratic Approximation . . . . . . . . . . . . . . Initialization and Termination of Learning . . . . . . . . Gradient Descent . . . . . . . . . . . . . . . . . . . . . 12.4.1 Learning Parameter and Convergence . . . . . . 12.4.2 Momentum . . . . . . . . . . . . . . . . . . . . 12.4.3 Other Gradient Descent Improvement Techniques Line Search . . . . . . . . . . . . . . . . . . . . . . . . Conjugate Gradients . . . . . . . . . . . . . . . . . . . . 12.6.1 Conjugate Search Directions . . . . . . . . . . . 12.6.2 Quadratic Error Function . . . . . . . . . . . . . 12.6.3 The Algorithm . . . . . . . . . . . . . . . . . . . 12.6.4 Scaled Conjugated Gradients . . . . . . . . . . . Newton's Method . . . . . . . . . . . . . . . . . . . . . Levenberg-Marquardt Algorithm . . . . . . . . . . . . . 13 Feature Extraction 13.1 13.2 13.3 13.4 13.5 13.6 . . . . . . . . . . . . . . . . . . . . . . . Pre/Post-processing . . . . . . . . . . . . . . . . . . . . . Input Normalization . . . . . . . . . . . . . . . . . . . . . Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . Time Series Prediction . . . . . . . . . . . . . . . . . . . Feature Selection . . . . . . . . . . . . . . . . . . . . . . Dimensionality Reduction . . . . . . . . . . . . . . . . . . 13.6.1 Principal Component Analysis . . . . . . . . . . . 13.6.2 Non-linear Dimensionality Reduction Trough ANN . 13.7 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . 13.7.1 The Tangent Prop Method . . . . . . . . . . . . . 13.7.2 Preprocessing . . . . . . . . . . . . . . . . . . . . 13.7.3 Shared Weights . . . . . . . . . . . . . . . . . . . 13.7.4 Higher-order ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 205 207 212 215 215 216 217 218 220 221 223 225 225 225 226 229 231 233 234 237 237 238 239 240 241 243 243 245 247 247 248 250 250 CONTENTS xiii 14 Learning Optimization 14.1 The Bias-Variance Tradeo . . . . . . . . . . . . . . . . 14.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . 14.2.1 Weight Decay . . . . . . . . . . . . . . . . . . . 14.2.2 Linear Transformation And Weight Decay . . . . 14.2.3 Early Stopping . . . . . . . . . . . . . . . . . . . 14.2.4 Curvature Smoothing . . . . . . . . . . . . . . . 14.2.5 Choosing weight decay hyperparameter . . . . . 14.3 Adding Noise . . . . . . . . . . . . . . . . . . . . . . . 14.4 Soft Weight Sharing . . . . . . . . . . . . . . . . . . . . 14.5 Growing And Pruning Methods . . . . . . . . . . . . . . 14.5.1 Cascade Correlation . . . . . . . . . . . . . . . . 14.5.2 Pruning Techniques . . . . . . . . . . . . . . . . 14.5.3 Neuron Pruning . . . . . . . . . . . . . . . . . . 14.6 Committees of Networks . . . . . . . . . . . . . . . . . 14.7 Mixture Of Experts . . . . . . . . . . . . . . . . . . . . 14.8 Other Training Techniques . . . . . . . . . . . . . . . . 14.8.1 Cross-validation . . . . . . . . . . . . . . . . . . 14.8.2 Stacked Generalization . . . . . . . . . . . . . . 14.8.3 Complexity Criteria . . . . . . . . . . . . . . . . 14.8.4 Model For Mixed Discrete And Continuous Data 15 Bayesian Techniques 15.1 Bayesian Learning . . . . . . . . . . . . . . . . 15.1.1 Weight Distribution . . . . . . . . . . . 15.1.2 Gaussian Prior Weight Distribution . . . 15.1.3 Application | Simple Classi er . . . . . 15.1.4 Gaussian Noise Model . . . . . . . . . . 15.1.5 Gaussian Posterior Weight Distribution . 15.1.6 Consistent Prior Weight Distribution . . 15.1.7 Approximation Of Weight Distribution . 15.2 Network Outputs Distribution . . . . . . . . . . 15.2.1 Generalized Linear Networks . . . . . . 15.3 Classi cation . . . . . . . . . . . . . . . . . . . 15.4 The Evidence Approximation For And . . . 15.5 Integration Over And ........... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 253 255 255 257 258 259 259 260 261 264 264 266 268 268 271 272 272 273 273 274 275 275 275 276 276 279 280 280 280 281 283 284 287 291 xiv CONTENTS 15.6 Model Comparison . . . . . . 15.7 Committee Of Networks . . . 15.8 Monte Carlo Integration . . . 15.9 Minimum Description Length 15.10Performance Of Models . . . 15.10.1Risk Averaging . . . . . . . . . 16 Tree Based Classi ers 16.1 Tree Classi ers . . . . . . . . . 16.2 Splitting . . . . . . . . . . . . 16.2.1 Impurity based method 16.2.2 Deviance based method 16.3 Pruning . . . . . . . . . . . . 16.4 Missing Data . . . . . . . . . . 17 Belief Networks 17.1 Graphs . . . . . . . . . . . . 17.1.1 Markov Properties . . 17.1.2 Markov Trees . . . . 17.1.3 Decomposable Trees 17.2 Casual Networks . . . . . . . 17.3 The Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 294 295 297 298 298 301 301 302 302 303 304 306 307 307 308 310 312 313 314 III Advanced Topics 317 18 Matrix Operations on ANN 319 18.1 New Matrix Operations . . . . . 18.2 Algorithms . . . . . . . . . . . . 18.2.1 Backpropagation . . . . 18.2.2 SOM/Kohonen Networks 18.2.3 BAM/Hop eld Networks 18.3 Conclusions . . . . . . . . . . . A Mathematical Sidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 320 320 321 322 323 325 A.1 Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 A.1.1 Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 325 A.1.2 Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 325 CONTENTS xv A.2 Generalized Spherical Coordinates . . . . . . . . . . . . . . . . . A.3 Properties of Symmetric Matrices . . . . . . . . . . . . . . . . . . A.3.1 Eigenvectors and Eigenvalues . . . . . . . . . . . . . . . . A.3.2 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.3 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . A.4 The Gaussian Integrals . . . . . . . . . . . . . . . . . . . . . . . A.4.1 The Unidimensional Case . . . . . . . . . . . . . . . . . . A.4.2 The Multidimensional Case . . . . . . . . . . . . . . . . . A.4.3 The multidimensional Gaussian integral with a linear term A.5 The Euler Functions . . . . . . . . . . . . . . . . . . . . . . . . . A.5.1 The Euler function . . . . . . . . . . . . . . . . . . . . . A.5.2 The sphere volume in the n{dimensional space . . . . . . A.6 The Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . A.7 Useful Mathematical equations . . . . . . . . . . . . . . . . . . . A.7.1 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . A.7.2 The Jensen's inequality . . . . . . . . . . . . . . . . . . . A.7.3 The Stirling Formula . . . . . . . . . . . . . . . . . . . . A.8 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . A.9 Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B Statistical Sidelines B.1 Probabilities . . . . . . . . . . . . . . . . . . . . . . B.1.1 Probabilities and Bayes Theorem . . . . . . . B.1.2 Probability Density, Expectation and Variance B.2 Modeling the Density of Probability . . . . . . . . . B.2.1 The Parametric Method . . . . . . . . . . . . B.2.2 The non-parametric method . . . . . . . . . B.2.3 The Semi{Parametric Method . . . . . . . . B.3 The Bayesian Inference . . . . . . . . . . . . . . . . B.4 The Robbins;Monro algorithm . . . . . . . . . . . . B.5 Learning vector quantization . . . . . . . . . . . . . 326 327 327 329 329 329 329 330 331 331 331 332 333 334 334 334 336 337 338 341 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 341 343 344 345 350 354 361 364 366 Bibliography 369 Index 371 ANN Architectures CHAPTER 1 Basic Neuronal Dynamics ➧ 1.1 Simple Neurons and Networks First attempts at building arti cial neural networks (ANN) were motivated by the desire to create models for natural brains. Much later it was discovered that ANN are a very general statistical1 framework for modelling posterior probabilities given a set of samples (the input data). The basic building block of a (arti cial) neural network (ANN) is the neuron. A neuron is a processing unit which have some (usually more than one) inputs and only one output. See gure 1.1 on the following page. First each input xi is weighted by a factor wi and the whole sum of inputs is calculated wi xi = a. Then a activation function f is applied P all inputs to the result a. The neuronal output is taken to be f (a). Generally the ANN are build by putting the neurons in layers and connecting the outputs of neurons from one layer to the inputs of the neurons from the next layer. See gure 1.2 on the next page. The type of network depicted there is also named feedforward (a feedforward network do not have feedbacks, i.e. no \loops"). Note that there is no processing on the layer 0, its role is just to distribute the inputs to the next layer (data processing really starts with layer 1); for this reason its representation will be omitted most of the time. Variations are possible: the output of one neuron may go to the input of any neuron, including itself; if the outputs on neuron from one layer are going to the inputs of neurons from previous layers then the network is called recurrent , this providing feedback; lateral 1.* For more information see also [BB95], it provides with some detailed theoretical neuronal models for true neurons. 1 In the second part of this book it is explained in greater detail how the ANN output have a statistical signi cance. 3 4 CHAPTER 1. BASIC NEURONAL DYNAMICS w xi P i a f (a) f (a) weighted sum unit activation unit inputs output Figure 1.1: (layer 0) xi i layer 1 The neuron layer L ; 1 layer L xi xi input Figure 1.2: output The general layout of a (feedforward) neural network. Layer 0 distributes the input to the input layer 1. The output of the network is (generally) the output of the output layer (last layer). L feedback is done when the output of one neuron goes to the other neurons on the same layer2 . So, to compute the output, an \activation function" is applied on the weighted sum of inputs: total input  a = X all inputs wi  xi 0 X output = activation function @ all inputs 2 This is used into the SOM/Kohonen architecture. wi 1  x A = f (a) i 1.2. NEURONS AS FUNCTIONS 5 More general designs are possible, e.g. higher order ANNs where the total input to a neurons contains also higher order combinations between inputs (e.g. 2-nd order terms, of the form wij xi xj ); however these are seldom used in practice as it involves huge computational e orts without clear-cut bene ts. The tunable parameters of an ANN are the weights fwi g. They are found by di erent mathematical procedures3 by using a given set of data. The procedure of nding the weights is named learning or training. The data set is called learning or training set and contain pairs of input vectors associated with the desired output vectors: f(xi ; yi )g. Some ANN architectures do not need a learning set in order to set their weights, in this case the learning is said to be unsupervised (otherwise the learning being supervised ). ✍ Remarks: ➥ ➥ ➥ ➧ 1.2 Usually the inputs are distributed to all neurons of the rst layer, this one being called the input layer | layer 1 in gure 1.2 on the facing page. Some networks may use an additional layer which neurons receive each one single component of the total input and distribute it to all neurons of the input layer. This layer may be seen as a sensor layer and (usually) doesn't do any (real) processing. In some other architectures the input layer is also the sensor layer. Unless otherwise speci ed it will be omitted. The last layer is called the output layer. The output set of the output neurons is (commonly) the desired output (of the network). The layers between input and output are called hidden layers. Neurons as Functions Neurons behave as functions. Neurons transduce an unbounded input activation x(t) at a time t into a bounded output signal f (x(t)). Usually a sigmoidal or S{shaped curve, as in gure 1.3 on the next page describes the transduction. This function (f ) is called the activation or signal function. The most used function is the logistic signal function: f (a) = 1 1 + e ca ; which is sigmoidal and strictly increases for positive scaling constant c > 0. Strict monotonicity implies that the activation derivative of f is positive: f 0 df da = cf (1 ; f) > 0 The threshold signal function (dashed line) in gure 1.3 on the following page illustrates a non-di erentiable signal function. The family of logistic signal function, indexed by c, approaches asymptotically the threshold function as c ! +1. Then f transduce positive activations signals a to unity signals and negative activations to zero signals. A discontinuity 3 Most usual is the gradient-descent method and derivatives. ❖ c 6 CHAPTER 1. BASIC NEURONAL DYNAMICS f (a) ;1 Figure 1.3: ❖ f_ ; 0 1 a + Signal f (a) as a bounded monotone-nondecreasing function of activation a. Dashed curve de nes a threshold signal function. occurs at the zero activation value (which equals the signal function's \threshold"). Zero activation values seldom occur in large neural networks4. The signal velocity df =dt, denoted f_, measures the signal's instantaneous time change. f_ is the product between the change in the signal due to the activation and the change in the activation with time: f_ ➧ 1.3 = df da da dt = 0 f a _ Common Signal Functions The following activation functions are more often encountered in practice: 1. Logistic: f (a) where c > 0 ; f 2. 0= df da = c cf (1 = 1 1 + e ca ; const. is a positive scaling constant. The activation derivative is = ; ) and so is monotone increasing ( f f f > most common one. 0 ). This function is the Hyperbolic-tangent: f (a) where c > 0 ; f 0= df da = c(1 c = ; f = tanh(ca) = e ca ; e ;ca ca + e;ca e const. is a positive scaling constant. The activation derivative is 2) > 0 and so f is monotone increasing (f < 1). 4 Threshold activation functions were used in early developments of ANN, e.g. perceptrons, however because they were not di erentiable they represented an obstacle in the development of ANNs till the sigmoidal functions were adopted and gradient descent techniques (for weight adaptation) were developed. 1.3. COMMON SIGNAL FUNCTIONS 3. Threshold: 7 8 > <1 if > 1c ( ) = >0 if 0 : otherwise ( 2 [0 1 ]) a f a a < ca x ; =c where c > 0 ; c = const. is a positive scaling constant. The activation derivative is: 0 f ( a) = ( df da = 0 if 2 (;1 0) [ [1 otherwise a ; =c; 1) c Note that is not a true threshold function as it have a non-in nite slope between 0 and c. 4. Exponential-distribution: ( ) = max(0 1 ; ;ca ) f a ; e where c > 0 ; c = const. is a positive scaling constant. The activation derivative is: f 0 (a) = df da = ;ca ce and for a > 0, supra-threshold signals are monotone increasing (f 0 > 0). Note: since the second derivative f 00 = ;c2 e;ca the exponential-distribution function is strictly convex. 5. Ratio-polynomial:  where c > 0 ; c  n ( ) = max 0 + n for = const.. The activation derivative is: f a a ; 0= c a df cna n;1 da c a n > 1 = ( + n )2 and for positive activation supra-threshold signals are monotone increasing. 6. Pulse-coded: In biological neuronal systems the information seems to be carried by pulse trains rather than individual pulses. Train-pulse coded information can be decoded more reliably than shape-pulse coded information (arriving individual pulses can be somewhat corrupted in shape and still accurately decoded as present or absent). The exponentially weighted time average of sampled binary pulses function is: f ( )= Z f t t t ;1 ( ) s;t g s e ds being time, where the function g is: ❖ ( ( ) = 1 if a pulse occurs at 0 if no pulse at g t t t and equals one if a pulse arrives at time t or zero if no pulse arrives. The pulse-coded signal function is: f (t) : [0; 1] ! [0; 1]. ,g t 8 CHAPTER 1. BASIC NEURONAL DYNAMICS If g(t) = 0; 8 t then f (t) = 0 (trivial). If g(t) = 1; 8 t then Z t s;t ds = et;t ; lim es;t = 1 f (t) = e s!;1 ;1 Proof. When the number of arriving pulses increase then the \pulse count" can only increase so monotone nondecreasing. f is CHAPTER 2 The Backpropagation Network The backpropagation network represents one of the most classical example of an ANN, being also one of the most simple in terms of the overall design. ➧ 2.1 Network Structure The network consists of several layers of neurons. The rst one (let it be layer 0) distributes the inputs to the input layer 1. There is no processing in layer 0, it can be seen just as a sensory layer | each neuron receive just one component of the input (vector) x which gets distributed, unchanged, to all neurons from the input layer. The last layer is the output layer which outputs the processed data; each output of individual output neurons being a component of the output vector y. The layers between the input one and the output one are hidden layers. ✍ Remarks: Layer 0 have to have the same number of neurons as the number of input components (dimension of input vector x). ➥ The output layer have to have the same number of neurons as the desired output have (i.e. the dimension of the output vector y dictates the number of neurons on the output layer). ➥ In general the input and hidden layers may have any number of neurons, however their number may be chosen to achieve some special e ects in some practical cases. The network is a straight feedforward network: each neuron receives as input the outputs ➥ 9 feedforward network 10 CHAPTER 2. THE BACKPROPAGATION NETWORK layer 1 layer ` ; 1 (layer 0) xi layer L ; 1 layer ` xi z`k k i layer L w`kj xi ;1 j z` ;j input output |{z} xp |{z} |{z} N` tp Figure 2.1: The backpropagation network structure. ❖ z`k , , w`kj , , x , , , p tp (xp ) z0i N` L P of all neurons from the previous layer (excepting the rst sensory layer). See gure 2.1. The following notations are used:  z is the output of neuron j from layer `.  w is the weight by which output of neuron j from layer ` ; 1 contribute to input of neuron k from layer `.  x is training input vector no. p.  t (x ) is the target (desired output) vector no. p (at training time).  z0 is the i component of input vector. By notation, at training time, z0  x , where x is the component i of one of input vectors, for some p.  N is the number of neurons in layer `.  L is the number of layers (the input layer is no. 0, the output layer is no. L).  P is the number of training vectors, p = 1; P The learning set is (according to the above notation) f(x ; t )g =1 . `k `kj p p p i i i i ` p ➧ 2.2 2.2.1 p p ;P Network Dynamics Neuron Output Function The activation function used is usually the logistic: f (a) df da 1 ; exp(; = [1 + exp(; = 1 + exp( c ; ca) ca) ca)] 2 f = : R ! (0; 1) ; cf (a) [1 ; f (a)] c > 0 ; c = const. (2.1) (2.2) 2.2. NETWORK DYNAMICS 11 but note that the backpropagation algorithm is not particularly tied up to it. 2.2.2 Network Running Function Each neuron compute at output the weighted sum of its input to which it applies the signal function (see also gure 2.1 on the facing page): 1 0 `; X z = f @ w z ;1 A N 1 `kj `k j ` (2.3) ;j =1 It must be computed in succession for each layer, starting from input and going trough all layer in succession till the output layer is reached. A more compact matrix notation may be developed as follows. For each layer (except 0), a matrix of weights W is build as: 1 0 w  w 11 1 `; .. C W =B @ ... . . . . A ` ` ` w ` N  w `1 `N zT `;1 ` `;1    z ;1 ;1 1 ; ` ;N `;1  the output of the actual layer ` may be calculated as: zT ` = ; f (aT ) = f (a 1)    f (a ` ` = 2.2.3 Network Learning Function ` `  ❖ a` ` The network learning process is supervised i.e. the network receives (at training phase) both the raw data as inputs and the targets as output. The learning involves adjusting weights so that errors will be minimized. The function used to measure errors is usually the sum-ofsquares de ned below but note that backpropagation algorithm is not particularly tied up to it. De nition 2.2.1. For an input pattern x and the associated target t, the sum-of-squares error function E (W ) (E is dependent on all weights W ) is de ned as: E (W )  2 1 XL N q =1 z [ Lq ( Lq Lq ❖ E ,W ❖ z p x) ; tq (x)]2 where z is the output of neuron q from the output layer i.e. the component q of the output vector. Note that all components of input vector will in uence any component of output vector, thus z = z (x). Lq ❖ z` `) `N W z ;1 . where a ` `N N ` ` W 1 (note that all weights associated with a particular neuron are on the same row) then, considering the output vector of the previous layer z ;1 ; = z ❖ Lq 12 ❖ CHAPTER 2. THE BACKPROPAGATION NETWORK E ✍ Remarks: ➥ tot. Considering all learning samples (the full training set) then the total sum-ofsquares error sum-of-squares error E (W ) (E is also dependent of all weights as E ) is de ned as: tot. E tot. ❖ NW tot. NL P P X X X (W )  12 E (W ) = [zLq (xp ) ; tq (xp )] 2 p=1 p=1 q=1 The network weights are found (weights are adapted/changed) step by step. Considering NW the total number of weights then the error function E : RNW !(R may be)represented as a surface in the RNW +1 space. The gradient vector rE ❖ t  = @E (W ) @w`ji shows the direction of (local) maximum of square mean error and fw`ji g are to be changed in the opposite direction (so \;" have to be used) | see also gure 2.3 on the next page. In the discrete time t approximation, at step t + 1, given the weights at step t, the weights are adjusted as: w`ji (t + 1) = w`ji (t) ;  ❖ (2.4) P X @Ep (W ) @E (W ) = w`ji (t) ;  @w`ji W (t) p=1 @w`ji W (t) where  = const.,  > 0 is named the learning constant and it is used for tuning the speed and quality of the learning process. In matrix notation the above equation may be written simply as: W (t + 1) = W (t) ; rE (2.5) because the error gradient may also be considered as a matrix or tensor, it have the same dimensions as W . ✍ Remarks: ➥ delta rule ➥ The above method does not provide for a starting point. In practice, weights are initialized with small random values (usually in the [;1; 1] interval). The (2.5) equation represents the basics of weight adjusting, i.e. learning, in many ANN architectures; it is known as the delta rule: W = W (t + 1) ; W (t) / ;rE ➥ ➥ ➥ If  is too small then the learning is slow and it may stop the learning process into a local minimum (of E ), being unable to overtake a local maximum. See gure 2.3 on the facing page. If  is too large then the learning is fast but it may jump over local minimum (of E ) which may be deeper than the next one. See gure 2.3 on the next page (Note that in general that is a surface). Another point to consider is the problem of oscillations. When approaching error minima, a learning step may overshot it, the next one may again overshot it 2.2. NETWORK DYNAMICS 13 E (w ) w(t + 3) w ( t) w(t + 2) w(t + 1) w Figure 2.2: Oscillations in learning process. The weights move around E minima without being able to reach it (arrows show the jumps made by the learning process). E (w ) rE local maximum local minimum w Figure 2.3: E (w) | Total square error as function of weights. rE points towards the (local) maximum. bringing back the weights to a similar point to previous one. The net result is that weights are changed to values around minima but never able to reach it. See gure 2.2. This problem is particularly likely to occur for deep and narrow minima because in this case rE is large (deep  steep) and subsequently W is large (narrow  easy to overshot). The problem is to nd the error gradient rE . Considering the \standard" approach (i.e. frE g`ji  E for some small w`ji ) this would require an computational time of the w`ji order O(NW ), because each calculation of E require O(NW ) and it have to be repeated 2 for each w`ji in turn. The importance of the backpropagation algorithm resides in the fact that it reduces the computational time of rE to O(NW ), thus greatly improving the speed of learning. Theorem 2.2.1. Backpropagation algorithm. For each layer (except 0, input), an error gradient matrix may be build as follows: backpropagation ❖ (rE )` 14 CHAPTER 2. THE BACKPROPAGATION NETWORK 0 @E    B @w..`11 . . (r )`  B . @ . E ❖ r z` E .. .  @E @w`N` 1 1 CC A @E @w`1N`;1 @E ; ` = 1; L @w`N` N`;1 For each layer, except L the error gradient, with respect to neuronal outputs, may be de ned as: rz`  E  @E  @z`1 @E @z`N`  ; ` = 1; L ;1 The error gradient with respect to network output zL is considered to be known and dependent only on network outputs fzL (xp )g and the set of targets ftp g. rzL = E known. Then considering the error function E and the activation function f and its total derivative 0 then the error gradient may be computed recursively according to the formulas: f T  rz E f 0 (a`+1 ) calculated recursively from L ; 1 to 1 (2.6a) +1 `+1 0 T (2.6b) (rE )` = [rz` E f (a` )]  z`;1 for layers ` = 1; L rz l ❖ z0 = E W` where z0  x. Proof. The error E (W ) is dependent on w`ji trough the output of neuron (j; i) i.e. z`j : @E @w `ji = @ z`j `j @ w`ji @E @z and each derivative is computed separately ❐ term `j `ji @z @w `j @w`ji @z @ = `ji @w 2 0N`;1 4 @X f m=1 0N`;1 `jm z w 13 `;1;m A5 = 1 0N`;1 1 0 X w`jm z`;1;m A  @ @ X w`jm z`;1;m A = =f @ @w`ji m=1 m=1 = f 0 (a`j )  z`;1;i because weights are mutually independent. ❐ term @E @z `j Neuron z`j a ect E trough all following layers that are intermediate between layer in uence being exercised through the interposed neurons). First a ected is next, successive layers): @E @z `j = ` , layer, trough term +1 NX `+1 m=1 @E `+1;m @z @z `+1;m `j @z = @z ` and output (the `+1;m (and then the dependency is carried on next `j @z 2.2. NETWORK DYNAMICS = = = NX `+1 15 @E m=1 @z`+1;m NX `+1 @E m=1 @z`+1;m NX `+1  13 2 0N 4f @X̀ w`+1;mq z`q A5 = @ @z`j 0N X̀  f0 @ q=1 q=1 w`+1;mq z`q 1 A @ @z`j 0N 1 @X̀ w`+1;mq z`q A = q=1 @E f 0 (a`+1;m ) w`+1;mj @z`+1;m m=1 which represents exactly the element j of column matrix rz` E as build from (2.6a). The above formula applies iteratively from layer L ; 1 to 1, for layer L, rzL E is assumed known. Finally, the desired derivative is: @E @w`ji = @E @z`j @z`j @w`ji 2N`+1 X =4 @E 3 m=1 @z`+1;m f 0 (a`+1;m ) w`+1;mj 5 f 0 (a`j ) z`;1;i representing the element found at row j , column i of matrix (rE )` as build from (2.6b). Proposition 2.2.1. If using the logistic activation function and sum-of-squares error function then the error gradient may be computed recursively according to the formulas: (2.7a) rzL E = zL (x) ; t rz ` = E h T ` cW +1  rz`+1 h (r )` = rz` E c E E z`+1 i (1b ; z` ) +1 (1b ; z`)  z`; z` T 1 i for ` = 1; L ; 1 (2.7b) for ` = 1; L (2.7c) where z0  x Proof. From de nition 2.2.1: @E @zLj = zLj ; tj ) rzL E = zL ; t By using (2.2) in the main results (2.6a) and (2.6b) of theorem 2.2.1, and considering that f (a` ) = z` the other two formulas are deducted immediately. 2.2.4 Initialization and Stop Weights are initialized (in practice) with small random values and the adjusting process continue by iteration. The stopping of the learning process can be done by one of the following methods: ➀ choosing a xed number of steps t = 1; T . ➁ the learning process continue until the adjusting quantity w`ji = w`ji(at time t+1) ; w`ji(at time t) is under some speci ed value, 8`, 8j , 8i. ➂ learning stops when the total error, e.g. the total sum-of-squares Etot. , attain a minima on a test set, not used for learning. ✍ Remarks: ➥ If the trained network performs well on the training set but have bad results on previously unseen patterns (i.e. it have pour generalization capabilities) then 16 CHAPTER 2. THE BACKPROPAGATION NETWORK this is usually a sign of overtraining (assuming, of course, that the network is reasonably build and there are a sucient number of training patterns). ➧ 2.3 The Algorithm The algorithm is based on discrete time approximation, i.e. time is t = 0; 1; 2; : : : . The activation and error functions and the stop condition are presumed to be chosen (known) and xed. Network running procedure: 1. The input layer is initialised, i.e. the output of input layer is made to be x: z0 x For all layers ` = 1; L | starting with rst hidden layer 1 | do: z` = f (W` z`;1 ) 2. The output of the network is taken to be the output of the output layer, i.e. y  zL . Network learning procedure: 1. Initialize all fw`ji g weights with (small) random values. 2. For all training sets (xp ; tp ) (as long as the stop condition is not met) do: (a) Run the network to nd the activations on all neurons a` and then the derivatives 0 f (a` ). The network output yp  zL (xp ) = f (aL ) will also be required on next step. NOTE: The algorithm require the derivatives of activation functions for all neurons. For most activation functions this may be expressed in terms of activation itself, i.e. f 0 (a` ) = g(z` ), as is the case for logistic, see (2.2). This approach may reduce the memory usage or increase speed (or both in case of logistic function). (b) Using (yp ; tp ), calculate rzL E , e.g. for sum-of-squares use (2.7a). (c) Compute the error gradient.  For output layer (rE )L is calculated directly from (2.6b) (or from (2.7c) for sum-of-squares and logistic).  For all other layers ` = 1; L ; 1, going backwards from L ; 1 to 1, calculate rst rz` E using (2.6a) (or (2.7b) for sum-of-squares and logistic). Then calculate (rE )` using (2.6b) (or respectively (2.7c)). (d) Update the W weights according to the delta rule (2.5). (e) Check the stop condition and exit if it have been met. ✍ 1 Remarks: ➥ In most cases1 a better performance is obtained when training repeatedly with the whole training set. A shuing of patterns is recommended, between repeats. But e.g. not when patterns form a time series. 2.3. THE ALGORITHM ➥ 17 Trying to stop error backpropagation when is below some threshold value  may also improve learning, e.g. in case of sum-of-squares the rzL E may be changed to: @E @zLq ➥ ( = ➥ 0 if jzLq ; tq j >  otherwise for q = 1; NL i.e. rounding towards 0 the elements of rzL E smaller than . The classical way of calculating the gradient, i.e. @E @w`ji ➥ zLq ; tq  E (w`ji + ") 2;" E (w`ji ; ") ; " & 0 while too slow for direct usage, is an excellent tool for checking the correctness of algorithm implementation. There are not (yet) good theoretical methods of choosing the learning parameters (constants)  and . The practical, hands-on, approach is still the best. Usual values for  are in the range [0:1; 1] (but some networks may learn even faster with  > 1) and [0; 0:1] for . In accordance with neuron output function (2.1) the output of the neuron have values within (0; 1) (in practice, due to rounding errors, the range is in fact [0; 1]). If the desired outputs have values within [0; 1) then the following transforming function may be used: y(x) = 1 ; exp(; x) ; >0; = const. which have the inverted: y;1(x) = ➥ 1 ln 1 1 ;x The same procedure described above may be used for inputs. This kind of transformation can be used each time the desired input/output falls beyond neuron activation function range. By no means reaching the absolute minima of E is guaranteed. First the training set is limited and the error minima with respect to the learning set may will generally not coincide with the minima considering all possible patterns, but in most cases should be close enough for practical applications. On the other hand, the error surface have a symmetry, e.g. swapping two neurons from the same layer (or in fact their weights) will not a ect the network performance, so the algorithm will not search trough the whole weight space but rather a small area of it. This is also the reason for which the starting point, given randomly, will not a ect substantially the learning process. ❖  18 CHAPTER 2. THE BACKPROPAGATION NETWORK z0 z1 ` z ` ` `N layer ` 1 w +1 1 0 ` Figure 2.4: layer ` + 1 ; ; Bias may be emulated with the help of an additional neuron z 0 those output is always 1 and is distributed to all neurons from next layer (exactly as for a \regular" neuron). ` ➧ 2.4 Bias Some problems, while having an obvious solution, cannot be solved with the architecture described above2. To solve these, the neuronal activation (2.3) is changed to: 0 z = f @w `k `; X `k 0+ j =1 w z ;1 `kj ` `k ` z`  z `  ` and `N 0w 10 w 11 B . .. f = @ .. W . ` ` ^ E)  ` w `0 w `1 `N (r (2.8) `k w 0 eT = ;1 z 1 ❖ ;j 1 A and the new parameter w 0 introduced is named bias. As it may be immediately be seen the change is equivalent to inserting a new neuron z 0 , whose activation (output) is always 1, on all layers except output. See gure 2.4. The required changes in neuronal outputs and weight matrices are: bias ❖ 1 N `N ...  w1 ` N w .. . `;1 1 CA ` `;1 `N N biases being added as a rst column in W , and then the neuronal output is calculated as f ez ;1). z = f (a ) = f (W ` ` ` ` ^ ^ ` ` f is: The error gradient matrix (rE ) associated with W 0 B E) = B @ (r ` ` @E `10 @w .. . ` @E @w `N` 0 @E `11 @w .. . @E @w `N` 1  @E @w   `1N`;1 .. . @E `N` N`;1 1 CC A @w Following the changes from above, the backpropagation theorem becomes: 2 E.g. the tight encoder described later. 2.4. BIAS 19 Theorem 2.4.1. Backpropagation with biases. If the error gradient with respect to neuronal outputs rzL E is known, and depends only on (actual) network outputs fzL (xp )g and targets ftp g: rzL = E known. then the error gradient (with respect to weights) may be calculated recursively according to formulas: ^ rz` r ( E E) = W T `+1   rz`+1 ` = [rz` E E f 0 (a`+1 ) calculated recursively from L ; 1 to 1 (2.9a) 0 (a` )]  ezT `;1 f for layers ` = 1; L (2.9b) where z0  x. See theorem 2.2.1 and its proof. Equation (2.9a) results directly from (2.6a). Proof. ^ ^ Columns 2 to N ;1 + 1 of (rE ) represents (rE ) given by (2.6b). ` ` ` The only terms to be calculated remains those of rst column of (rE ) , i.e. terms of the form j being the row index. But these terms may be written as (see proof of theorem 2.2.1): @E @w`j 0 where `j is term j of @E @z rz = ` `j0 , @E @w @E @z`j @z`j @w`j 0 ` , already calculated, and from (2.8): @z`j @w`j 0 = f 0 (a`j )  1 = f 0 (a`j ) As ze ;1 1  1 (by construction) then formula (2.9b) proves correct. ` ; Proposition 2.4.1. If using the logistic activation function and the sum-of-squares error function then the error gradient may be computed recursively using the formulas: rzL E ^ rz` r ( E E) = zL (x) ; t = cW h `=c T `+1  rz ` E (2.10a) h rz`+1 z` E b (1 z`+1 i b (1 i ; z`+1 ) for ; z` )  ezT`;1 ` = 1; L ;1 for ` = 1; L (2.10b) (2.10c) where z0  x. Proof. ✍ It is proved the same way as proposition 2.2.1 but using theorem 2.4.1 instead. Remarks: ➥ ➥ The algorithm for a backpropagation ANN with biases is (mutatis mutandi) identical to the one described in section 2.3. In practice, biases are usually initialized with 0. backpropagation 20 CHAPTER 2. THE BACKPROPAGATION NETWORK ➧ 2.5 Algorithm Enhancements 2.5.1 Momentum The weight adaption described in standard (\vanilla") backpropagation (see section 2.2.3) is very sensitive to small perturbations. If a small \bump" appears in the error surface the algorithm is unable to jump over it and it will change direction. This situation is avoided by taking into account the previous adaptations in learning process (see also (2.5)): W (t) = W (t + 1) ; W (t) = ; rE j momentum ❖ W (t) + W (t ; 1) (2.11) This procedure is named backpropagation with momentum and 2 [0; 1) is named the momentum (learning) parameter. The algorithm is very similar (mutatis mutandi) to the one described in section 2.3. As the main memory consumption is given by the requirement to store the weight matrix (especially true for large ANN), the momentum algorithm requires double the amount of standard backpropagation, to store W for next step. ✍ Remarks: ➥ When choosing the momentum parameter the following results have to be considered:  if > 1 then the contribution of each w grows in nitely.  if & 0 then the momentum contribution is insigni cant. so should be chosen somewhere in [0:5; 1) (in practice, usually  0:9). The momentum method assumes that the error gradient slowly decreases when approaching the absolute minimum. If this is not the case then the algorithm may jump over it. Another improvement over momentum is the at spot elimination. If the error surface is very at then rE  e0 and subsequently W  e0. This may lead to a very slow learning due to the increased number of training steps required. To avoid this problem, a change to the calculation of error gradient (2.6b) may be performed as follows: `ji ➥ ➥ (2.6b) ! ❖ c (rE ) n pseudo = rz E `; ` h f 0 (a ` io ) + c  b1  z ; f T ` 1 where (rE ) pseudo is no more the real (rE ) . The c is named at spot elimination constant. Several points to note here:  The procedure of adding a term to f 0 instead of multiplying it means that (rE ) pseudo is more a ected when f 0 is smaller | a desirable e ect. f `; `; 2.5.1 [BTW95] p. 50 ` f 2.5. ALGORITHM ENHANCEMENTS 21 W (t + 1) W (t ; 1) rE j Figure 2.5: ; rE j W (t) W (t) W (t) Learning with momentum. A contributing term from the previous step is added.  The error gradient terms corresponding to a weight close to input layer is smaller than a similar term for a weight more closely to the output layer because the e ect of changing the weight gets attenuated when propagated trough layers. So another e ect of c is the speed up of weight adaptation in layers close to input, again a desirable e ect.  The formulas (2.7c), (2.9b) and (2.10c) change the same way: f ➥ n (rE ) `;pseudo (^ rE ) `;pseudo = rz` E (^ rE ) `;pseudo = c rz` E h = c rz` E n n io (b ; ) + c  b  z` h 1 f 0 (a ` h z` z` f io 1 T z` 1 ; )+c b e ; f 1 (2.12b) T z` 1 io (b ; ) + c  b  e ; 1 z` f 1 (2.12a) T z` 1 (2.12c) In physical terms: The set of weights W may be though as a set of coordinates de ning a point in the space R W . During learning, this point is moved towards reducing the error E . The moment introduce an \inertia" proportional to , such that when changing direction under the in uence of rE \force" it have a tendency to keep the old direction of movement and \overshot" the point given by ;rE . See gure 2.5. The momentum method assumes that if the weights have been moved in some direction then this direction is good for the next steps and is kept as a trend: unwinding the weight adaptation over 2 steps (applying (2.11) twice, for t ; 1 and t) it gives: W (t) = ; rE j ( ) ;  rE j ( ;1) + 2W (t ; 2) N W t W t and it can be seen that the contributions of previous W gradually disappear with the increase of power of (as < 1). 2.5.2 Adaptive Backpropagation The main idea of this algorithm came from the following observations:  If the slope of the error surface is gentle then a big learning parameter could be used to speed up learning over at spot areas. 2.5.2 [BTW95] p. 50 22 CHAPTER 2. THE BACKPROPAGATION NETWORK  If the slope of the error surface is step then a small learning parameter should be used to avoid overshooting the error minima.  In general the slopes are gentle in some directions and step in the other ones. This algorithm is based on assigning individual learning rates for each weight w`ji based on the previous behavior. This means that the learning constant  becomes a matrix of the same dimension as W . The learning rate is increased if the gradient kept the direction over last two steps (i.e. is likely to continue) and decreased otherwise: ( I `ji (t ; 1) if w`ji (t) w`ji (t ; 1) > 0 `ji (t) = D`ji (t ; 1) if w`ji (t) w`ji (t ; 1) < 0 ❖ I, D (2.13) where I > 1 and D 2 (0; 1). The I parameter is named the adaptive increasing factor and the D parameter is named the adaptive decreasing factor. In matrix form, equation (2.13) may be written considering the matrix of w`ji sign changes, i.e. sign[W (t) W (t ; 1)]: (t) = n h i o (I ; D) sign sign(W (t) W (t ; 1)) + e1 + D  e1 (t ; 1) (2.14) Proof. The problem is to build a matrix containing 1-es corresponding to each w`ij (t)w`ji (t ; 1) > 0 and 0-es in rest. This matrix multiplied by I will be used to increase the corresponding `ji elements. The complementary matrix will be used to modify the matching `ji which have to be decreased. The sign(W (t) W (t ; 1)) matrix have elements consisting only of 1, 0 and ;1. By adding e1 and taking the sign again, all 1 and 0 elements are transformed to 1 while the ;1 elements are transformed to zero. So the desired matrix is h i sign sign(W (t) W (t ; 1)) + e1 while its complementary is e1 ; sign h Then the updating formula for  nally becomes: (t) = I (t ; 1) i sign(W (t) W (t ; 1)) + e1 h i sign sign(W (t) W (t ; 1)) + e1 n h io + D (t ; 1) e1 ; sign sign(W (t) W (t ; 1)) + e1 ❖ 0 ✍ Remarks: ➥ f`ji g is initialized with a constant 0 and W (t ; 1) = e0. Learning parameter matrix  is updated after each training session (considering the current W (t)). For the rest the same main algorithm as described in section 2.3 apply. Note that after initialization, when (0) = 0 and W (0) = e0, the rst step will lead automatically to the increase (1) = I 0 , so 0 should be chosen accordingly. Also, this algorithm requires three times as much memory compared to standard backpropagation, to store  and W for next step, both being of the same size as W . 2.5. ALGORITHM ENHANCEMENTS ➥ ➥ 23 If I = 1 and D = 1 then the e ect of algorithm is obviously void. In practice I 2 [1:1; 1:3] and D . 1=I gives the best results for a wide spectrum of applications. Note that sign(W (t) W (t ; 1)) could be replaced by sign(W (t)) sign(W (t ; 1)). This is a tradeo between one oating point multiplication followed by a sign versus two sign operations followed by an integer multiplication; whatever is faster may depend on the actual system used. Due to the fact that the next change is not exactly in the direction of the error gradient (because each component of rE is multiplied with a di erent constant) this technique may cause problems. This may be avoided by testing the output after an adaptation has taken place: if there is an increase in output error then the adaptation should be rejected and the next step calculated with the classical method; then the adaptation process can be resumed at next step. 2.5.3 SuperSAB SuperSAB (Super Self-Adapting Backpropagation) is a combination of momentum and adaptive backpropagation algorithms. The algorithm uses adaptive backpropagation for the w`ji terms who continue to move in the same direction and momentum for the others, i.e.:  If w`ji (t) w`ji (t ; 1) > 0 then: `ji (t) = I `ji (t ; 1) w`ji (t + 1) = ;`ji (t) @E @w`ji W (t) the momentum being 0 because it's not necessary, the learning rate grows in geometrical progression due to the adaptive algorithm.  If w`ji (t) w`ji (t ; 1) < 0 then: `ji (t) = D `ji (t ; 1) w`ji (t + 1) = ;`ji @E ; @w`ji W (t) w`ji (t) Note the \;" sign in front of which being used to cancel the previous \wrong" weight adaption (not to boost w`ji as in momentum method); the corresponding `ji is decreased to get smaller steps. In matrix notation SuperSAB rules are written as: (t) = n h W (t + 1) = ;(t) rE ; W (t) 2.5.3 [BTW95] p. 51 i o (I ; D) sign sign(W (t) W (t ; 1)) + e1 + D  e1 n e h (t ; 1) io 1 ; sign sign(W (t) W (t ; 1)) + e1 24 CHAPTER 2. THE BACKPROPAGATION NETWORK The rst equation came directly from (2.14). For the second equation, the matrix Proof. e1 ; sign h i sign(W (t) W (t ; 1)) + e1 contains, as elements, 1 if w`ji (t)w`ji < 0 and zero in rest, so momentum terms are added exactly to the w`ji requiring it, see proof of (2.14). ✍ Remarks: ➥ ➥ While this algorithm uses the same main algorithm as described in section 2.3 (of course with the required changes) note however that the memory requirement is four times higher than for standard backpropagation, to store supplementary  and two W . Arguably, the matrix notation for this algorithm may be less bene cial in terms of speed. However: there is a bene t of splitting the e ort of implementation into two levels, a lower one, dealing with matrix operations and a higher one dealing with the implementation of the algorithm itself. Beyond this an ecient matrix operations implementation may be already developed for the targeted system (e.g. an ecient matrix multiplication algorithm may be several times faster than the classical one when written speci cally for the system used3, there is also the possibility of taking advantage of the hardware, threaded matrix operation on multiprocessor systems e.t.c.). All algorithms presented here may be sees as predictors (of error surface features) from the simple momentumto the more sophisticated SuperSAB. Based on previous behaviour of error gradient, they try to predict the future behaviour and change learning path accordingly. ➧ 2.6 2.6.1 Applications Identity Mapping Network The network consists of 1 input, 1 hidden and 1 output neurons with 2 weights: w1 and w2 . See gure 2.6 on the next page. This particular network, while of little practical usage, it presents some interesting features:  there are only 2 weights so it is possible to visualize exactly the error surface see gure 2.7 on the facing page;  the error surface have a local maxima and a local minima, note that if the weights are \trapped" into the local minima the standard backpropagation can't move forward as rE becomes zero there. The problem is to con gure w1 and w2 such that the identity mapping is realized for binary input. For a fast implementation of a matrix multiplication on a RISC processor , 8 times speed increase, see [Mos97]. For a multi-threaded matrix multiplication see [McC97]. [BTW95] pp. 48{49 3 2.6.1 2.6. APPLICATIONS 25 w1 w2 input hidden Figure 2.6: output The identity mapping network E (w1 ; w2 ) 1:237 0:250 10 10 0 w2 ;10 ;10 10 0 w1 ;10 0 10 0 10 ; w1 local maximum Figure 2.7: local minimum The error surface for identity mapping network. The output of input neuron is xp = z01 (by notation). The output of hidden neuron z11 is (see (2.3)): z11 = 1 ; 1 + exp( cw1 z01 ) The output of the output neuron is: z21 = 1 1 + exp( ; cw2 z11 ) = 1 + exp  1 cw2 1+exp(;cw1 z01 ) The identity mapping network tries to map its input to output i.e. for x1 = z01 = 0 ) for x2 = z01 = 1 ) The square mean error is (2.4), where P Etot. (w1 ; w2 ) = 1 2 P X 1 X p=1 q=1 =2 [z2q (xp ) ; t1 (z01 ) =0 t2 (z01 ) =1 and NL = 1: q (xp )] t 2  w2 26 CHAPTER 2. THE BACKPROPAGATION NETWORK input hidden output Figure 2.8: The 4-2-4 encoder: 4 inputs, 2 hidden, 4 outputs. # " " # 1 = 21 1 + exp1; ;cw2  + 1 + exp 2 ;cw2 1+exp(;cw1 ) 2  ;1 2 For c = 1 the error surface is shown in gure 2.7 on the page before. The surface have a local minimum and a local maximum. 2.6.2 The Encoder This network is also an identity mapping ANN (targets are the same as inputs) with a single hidden layer which is smaller in size than the input/output layers. See gure 2.8. Beyond any possible practical applications, this network shows the following:  the architecture of an backpropagation ANN may be important with respect to its purpose;  the output of a hidden layer is not necessary meaningless,  the importance of biases. The input vectors and targets are: x1 011 B0CC =B @0A 0 ; x2 001 B1CC =B @0A ; 0 x3 001 B0CC =B @1A ; 0 x4 001 B0CC =B @0A 1 ; ti  xi ; i =1 4 ; The idea is that the inputs have to be \squeezed" trough the bottleneck represented by hidden layer, before being reproduced at output. The network have to nd a way to encode the 4-component vectors on a 2-component vector, the output of hidden layer. Obviously the encoding is given by: z1 0 = 0 ; z2 0 = 1 ; z3 1 = 0 ; z4 1 = 1 2.6. APPLICATIONS 27 Note that one of xi vectors will be encoded by z1 . On a network without biases this means that the output layer will receive total input a and considering the logistic activation ;  function then the corresponding output will always be yT = 0:5 0:5 0:5 0:5 and this particular output will be weights-independent. One of the input vectors may never be learned by the encoder. In practice usually (but not always) the net will enter in oscillation trying to learn two vectors on one encoding so there will be two unlearn vectors. When using biases this do not happen as the output layer will always receive something weight-dependent. An ANN was trained with the following parameters: learning parameter =  = 2:5 momentum = = 0:9 at spot elimination = cf = 0:25 and after 200 epochs the outputs of hidden layer became: z1 = 1 0 ; z2 = 0 0 ; z3 = 1 1 ; z1 = 0 1 and the corresponding output: 0 0:9975229 1 0 0:0000140 1 B 0:0047488 CC ; y = BB 0:9929489 CC ; y1 = B 2 @ 0:0015689 A @8:735  10;9A ; 8 1:876  10 0:0045821 0 0:0020816 1 B7:241 10;11CC ; y3 = B @ 0:997392 A  0:0010320 ✍ Remarks: ➥ ➥ 07:227 10;111 B 0:0000021 CC y4 = B @ 0:0021213 A  0:9960705 The encoders with N1 = log2 N0 are called tight encoders, those with N1 < log2 N0 are loose and those with N1 > log2 N0 are supertight. It is possible to train a loose encoder on an ANN without biases as the null vector doesn't have to be among the outputs of hidden neurons. CHAPTER 3 The SOM/Kohonen Network The Kohonen network represents an example of an ANN with unsupervised learning. ➧ 3.1 Network Structure A SOM (Self Organizing Map, known also as Kohonen) network have one single layer, let name this one the output layer. The additional input (\sensory") layer just distribute the inputs to output layer, there is no data processing on it. Into the output layer a lateral (feedback) interaction is provided (see also section 3.3). The number of neurons on input layer is N | equal to the dimension of input vector and for output layer is K . See gure 3.1 on the next page. ✍ Remarks: ➥ ➥ 3.1 Here the output layer have been considered unidimensional. Taking into account the \mapping" feature of the Kohonen networks the output layer may be considered | more convenient for some particular applications | multidimensional. A multidimensional output layer may be trivially mapped to a unidimensional one and the discussion below will remain the same. E.g. a bidimensional layer K  K may be mapped to a unidimensional layer having K 2 neurons just by nding a function f : K  K ! K to do the relabeling/numbering of neurons. Such a function may be e.g. f (j; `) = (` ; 1)K + j which maps rst row of neurons (1; 1) : : : (1; K ) to the rst K unidimensional chunk and so on (j; ` = 1; K ). See [BTW95] pp. 83{89 and [Koh88] pp. 119{124. 29 ❖ N ,K 30 CHAPTER 3. THE SOM/KOHONEN NETWORK input lateral feedback layer output Figure 3.1: The Kohonen network structure. feedback strenght + + ; Figure 3.2: ; distance between output neurons The lateral feedback interaction function of the \mexican hat" type. The important thing is that all output neurons receive all components of the input vector and a little bit of care is to be taken when establishing the neuronal neighborhood (see below). In general, the lateral feedback interaction function is indirect | i.e. neurons do not receive the inputs of their neighbors | and of \mexican hat" type. See gure 3.2. The closest neurons receive a positive feedback, the more distant ones receive negative feedback and the far away ones are not a ected. inter-neuronal distances ✍ Remarks: ➥ ➥ The distance between neuron neighbors in output layer is (obvious) a discrete one. It may be de ned as being 0 for the neuron itself (auto-feedback), 1 for the closest neighbors, 2 for the next ones, and so on. On multidimensional output layers there are several choices, the most obvious one being the Euclidean. The feedback function determines the quantity by which the weights of neuron neighbors are updated during the learning process (as well as which weights are updated). 3.2. TYPES OF NEURONAL LEARNING ➥ ➥ ➧ 3.2 3.2.1 31 The area a ected by the lateral feedback is named neuronal neighborhood. For suciently large neighborhoods the distance may be considered continue when carrying some types of calculations. neuronal neighborhood Types of Neuronal Learning The Learning Process Let W = fw g =1 ji j ;K =1;N be the weight matrix and x = fx g =1 i i ;N be the input vector for a i (output) neuron, i.e. the total input to the (output) layer is a = W x. Note that each row from W represents the weights associated with one neuron and may be seen as a vector W (j; :) of the same size and from the same space R as the input vector x. R is named the weight space. When an input vector x is presented to the network, the neuron having its associated weight vector W (k; :) closest to x, i.e. the one for which: N N ❖ W (j; :) weight space kW (k; :)T ; xk = min kW (j; :)T ; xk j =1;K is declared \winner". All neurons included in its vicinity (neuronal neighborhood), including itself, will participate to the \learning" of x. The other ones are not a ected. The learning process consists of changing the weight vectors W (j; :) towards the input vector (positive feedback) There is also a \forgetting" process which tries to slow down the progress (negative feedback). It can be immediately seen why the feedback is indirect: the neurons are a ected by being in the neuronal neighborhood of the winner, not by receiving directly the winner's output. Considering a linear learning | changes are restricted to occur only in the direction of a linear combination of x and W (j; :) for each neuron | then: dW dt = (x; W ) ; (x; W ) where  and are scalar (possibly nonlinear) functions,  representing the positive feedback, being the negative one. These two functions have to be build in such a way as to a ect only the neuronal neighborhood of winning neuron k, which vary in time as the input vector is function of time x = x(t). Note that here the winning neuron appears implicitly as it may be determined from x and W . Various adaptation models (di erential equations for W ) can be build for the neurons (from the output layer) of the Kohonen network. Some of the more simple ones which may be analyzed (at least to some extent) analytically are discussed in the following sections. To simplify further the discussion it will be considered (at this stage) that the neuronal neighborhood is suciently large to contain the whole network. Later it will be shown how to limit weight change/adaptation to targeted neurons by using the lateral feedback function. 3.2 See [Koh88] pp. 92{98. ❖ , 32 CHAPTER 3. THE SOM/KOHONEN NETWORK 3.2.2 The Trivial Equation ❖ , One of the most simple equations is a linear di erential: dW (j; :) T = x ; W (j; :) ; ; >0; ; dt which in matrix form becomes: dW bxT ; W = 1 dt and with initial condition W (0)  W0 the solution is: = const. ; j = 1; K (3.1) 1 3 2 0Z t W (t) = 4 1b @ xT (t0 ) e t dt0 A + W0 5 e; t 0 0 which shows that for t ! 1, W (j; :) is the exponentially weighted average of x(t) and do not produce any interesting e ects. Proof. The equation is solved by the method of variation of parameters. First, the homogeneous equation: have a solution of the form: dW dt W (t) = Ce; t + W =0 ; C = matrix of constants The general solution for the nonhomogeneous equation (3.1) is found by considering C = C (t). Then: dW dC ; t = e ; C (t) e; t dt dt and by replacing in (3.1) it gives: dC ; t e ; C (t) e; dt ) dC dt = 1bxT (t) e t ) C (t) = and, at t = 0, W (0) = C 0  W0 . t 1b = Zt 1bxT (t) ; C (t) e; t 0 xT (t0 ) e t dt0 + C 0 ( ) C 0 = matrix of constants ) 0 3.2.3 The Simple Equation The simple equation is de ned as: dW (j; :) T = aj (t) x ; W (j; :) ; ; >0; ; dt and in matrix notation it becomes: dW T = a(t) x ; W dt and consequently, as a = W x then: dW T = W ( xx ; I ) dt = const. ; j = 1; K 3.2. TYPES OF NEURONAL LEARNING 33 In the time-discrete approximation: dW ! Wt = W (t(t++1)1);;Wt (t) = W (t) [ x(t) xT (t) ; I ] dt ) W (t + 1) = W (t) [ x(t) xT (t) ; I + I ] ; t 2 N + ; W (0)  W0 (initial condition) so the general solution is: W (t) = W0 ;1  tY t0 ✍ =0 x(t0 ) xT (t0 ) ; I + I  (3.2) Remarks: ➥ For most cases the solution (3.2) is either divergent or converges to zero, both cases unacceptable. However, for a relatively short time, the simple equation may approximate a more complicated, asymptotically stable, process. For t or relatively small, such that the superior order terms O( 2 ) may be neglected, and considering b = 0 (no \forgetting" e ect) then from (3.2): " W (t) ' W0 I + t;1 X t0 3.2.4 =0 # x(t0 ) xT (t0 ) The Riccati Equation The Riccati equation is de ned as: dW (j; :) = xT ; aj W (j; :) ; ; > 0 ; ; = const. ; j = 1; K (3.3) dt and after the replacement aj = W (j; :) x = xT W (j; :)T (note that [W (j; :)]T  W (j; :)T for brevity), it becomes:   dW (j; :) (3.4) = xT I ; W (j; :)T W (j; :) dt or in matrix notation: dW b T ; (W x1 bT ) W = 1x (3.5) dt Equation (3.3) may be written as: dW dt = 1bxT ; (a1bT ) W and a = W x. For general x = x(t), the Riccati equation is not integrable directly (of course beyond the trivial xT = W (j; :) = 0b). However a statistical approach may be performed. Proposition 3.2.1. Considering a statistical approach to the Riccati equation (3.4), if there is a solution, i.e. tlim !1 W exists, then the solution of W is of the form: Proof. lim W = t!1 r b 1 hxiT khxik if hxi = 6 0b ❖ W (j; :)T 34 CHAPTER 3. THE SOM/KOHONEN NETWORK ❖  where hxi = EfxjW g = const., independent of W and p time; i.e. all W (j; :) became parallel N = (the Euclidean metric being with hxi in R and will have the norm kW (j; :)k = used here). As x and W (j; :) may be seen as vectors in R space, then let  be the angle between them. From the scalar product, cos  is: W (j; :) x cos  = kW (j; :)k kxk ❖ kxk, kW (j; :)k N Proof. where kxk2 = xT x and kW (j; :)k2 = W (j; :) W T (j; :), the Euclidean metric being used here. When bringing under the EfjW g operator, as x is obviously independent of W (is the input vector), it goes to x ! hxi. Then the expected value of d cos =dt is: (  ( :)hxi W (j; :)hxi k ( :)k  W = E ; E d cos dt kW (j; :)k khxik kW (j; :)k2 khxik W dW j; d W j; dt ❐ Term (3.6) xi dW (j;:)h dt kW (j;:)k khxik x dW (j;:)h i dt W (j;:)k kh ik First k by x: ) dt x dW (j;:) hxi dt = kW (j; :)k khxik , as hxi is time independent. Then, by multiplying (3.4) to the right dW (j; :) x = xT x ; xT W T (j; :) W (j; :) x = kxk2 ; [W (j; :) x]2 dt (as for two matrices (AB )T = B T AT is true), and then this term becomes: ( :)hxi khxik2 ; [W (j; :) hxi]2 = kW (j; :)k khxik kW (j; :)k khxik (as x ! hxi under EfjW g operator). dW j; dt ❐ Term xi dkWdt(j;:)k kW (j;:)k2 khxik W (j;:)h First the derivative dkW (j; :)k=dt is found as follows: dkW (j; :)k2 dt = 8 <2 kW (j; :)k k ( :)k ( :) : 2 W (j; :)T (because kW (j; :)k2 = W (j; :) W (j; :)T ) dkW (j; :)k = dW (j; :) W (j; :)T dt dt kW (j; :)k d W j; dt dW j; ) dt and by using (3.4) then dkW (j; :)k = h xT ; xT W (j; :)T W (j; :)i W (j; :)T dt kW (j; :)k h i 1 = xT W (j; :)T ; xT W (j; :)T W (j; :) W (j; :)T kW (j; :)k 1 xT W (j; :)T h ; W (j; :) W (j; :)T i kW (j; :)k  W (j; :) x  2 (3.7) = kW (j; :)k ; kW (j; :)k (as xT W (j; :)T = W (j; :) x and W (j; :) W (j; :)T  kW (j; :)k2 ). By replacing back into the wanted term and as x ! hxi: ( :)hxi [W (j; :)hxi]2 ( ; kW (j; :)k2 ) = kW (j; :)k khxik kW (j; :)k3 khxik = dW j; dt 3.2. TYPES OF NEURONAL LEARNING 35 Replacing back into (3.6) gives    khxik2 ; [W (j; :) hxi]2 ; [W (j; :)hxi]2 ( ; kW (j; :)k2 ) W   W =E E d cos dt kW (j; :)k khxik kW (j; :)k3 khxik  2 2 [W (j; :) hxi]2 W  = E khxik kW (j; :)k ; kW (j; :)k3 khxik     2 2 = khxik E 1 ; cos  W = khxik E sin  W (3.8) kW (j; :)k kW (j; :)k Existence of tlim !1 W means that  stabilizes in time and then its derivative limit is zero and the expected value of the derivative is also zero (as it will remain zero after reaching the limit):   d cos  d cos  = 0 = E W lim t!1 dt dt By using (3.8), it follows immediately that Efsin jW g = 0 and then EfjW g = 0, i.e. all EfW (j; :)jW g are parallel to hxi. The norm of W (j; :) is found from (3.7). If tlim !1 W (j; :) does exists then the expectation of dkW (j; :)k=dt have to be zero:     E dkWdt(j; :)k W = E kWW(j;(j;:):)xk  ; kW (j; :)k2  W = 0 but as W (j; :) 6= 0b, this may happen only if E  ; kW (j; :)k2 W = 0 ) (EfkW (j; :)kg)2 = \ Finally, combining all previously obtained results: W (j;:) hxi lim cos(W (j; :); hxi) = 1 ) tlim !1 kW (j;:)k = khxik q : lim kW (j; :)k = t!1 8 < t!1 3.2.5 T ) lim W (j; :) = t!1 r hxiT khxik More General Equations Theorem 3.2.1. Let > 0, a = W x and (a) an arbitrary function such that Ef (a)jW g exists. Let x = x(t) a vector with stationary statistical properties (and independent of W ). Then, if a learning model (process) of type: dW (j; :) = xT ; (aj ) W (j; :) ; j = 1; K dt or, in matrix notation: i h dW W = 1bxT ; (a) 1bT dt have nonzero bounded W solutions for t ! 1, then it must be of the form: tlim !1 W / 1b  hxiT where hxi is the mean of x(t); i.e. all W (j; :) become parallel to hxi in RN . (3.9) ❖ , a, 36 CHAPTER 3. THE SOM/KOHONEN NETWORK Let  be the angle between hxi and W (j; :) vectors in R then, from the scalar product: cos  = hxi . The Efd cos =dtjW g is calculated in similar way as in proof of proposition 3.2.1. :)kkhxik N Proof. k W (j;:) W (j; E   d cos  W =E dt ( hxi ; [W (j; :)hxi] k ( :)k W kW (j; :)k khxik kW (j; :)k2 khxik dW (j;:) d W j; dt ) dt (3.10) Multiplying (3.9) by x, to the right, gives: dW (j; :) x = xT x ; (aj ) W (j; :) x = dt kxk2 ; (a ) [W (j; :) x] j (3.11) The dkW (j; :)k=dt derivative is calculated in similar way as in proof of proposition 3.2.1 to give: dkW (j; :)k dW (j; :) W (j; :)T = dt kW (j; :)k dt and then, by using (3.9): dkW (j; :)k xT W (j; :)T = dt kW (j; :)k T ; (a ) W (kj;W:)(Wj; :)(j;k :) = [kWW((j;j;:):)xk] ; (a ) kW (j; :)k j j (3.12) The (3.11) and (3.12) results are used in (3.10) (and also x ! hxi) to give:    khxik2 ; (a ) [W (j; :)hxi]  E d cos W =E dt kW (j; :)k khxik    [W (j; :)hxi] [ k ( (:) :)hxki] ; (a ) kW (j; :)k ; W kW (j; :)k2 khxik j W j; j W j; and, after simpli cation, it becomes:    khxik2 kW (j; :)k2 ; [W (j; :)hxi]2 W   W = E E d cos dt kW (j; :)k3 khxik Existence of lim !1 W means that the  stabilizes in time and then lim !1(d cos =dt) = 0, and then the expected value is zero as well: Efd cos =dtjW g = 0. But this may happen only if: t t     E d cos W = 0 , E khxik2 kW (j; :)k2 ; [W (j; :)hxi]2 W = 0 , dt   E kWW(j;(j;:):)k hkhxxi ik W = 1 , E f cos j W g = 1 T i.e. lim !1  = 0 and then W (j; :) and hxi become parallel for t ! 1, i.e. lim !1 W (j; :) / hxi . t ❖ hxxT i, max , umax t Theorem 3.2.2. Let > 0, a = W x and (a) an arbitrary function such that Ef (a)jW g exists. Let hxxT i = EfxxT jW g (in fact xxT does not depend on W as is the covariance matrix of input vector). Let max = max ` the maximum eigenvalue of hxxT i and umax ` the associated eigenvector. Then, if a learning model (process) of type: dW (j; :) = aj xT ; (aj )W (j; :) (3.13) dt or, in matrix notation: i h dW W = axT ; (a) 1bT dt 3.2.5 See [Koh88] pp. 98{101. ❖  3.2. TYPES OF NEURONAL LEARNING 37 have nonzero bounded W solutions for t ! 1, they have to be of the form: / 1buTmax 6= 0b, where (0)  0 ; i.e. all W provided that in RN . W umax W W (j; W :) become parallel to umax Let  be an eigenvalue and u the corresponding eigenvector of hxxT i such that hxxT iu =  u . Let  be the angle between W (j; :) and u such that cos  = ( ( :):) uu` ` . Proof. ` ` ` ` ` ` kW j; kk E  d cos ` W =E dt  ( ` k Efd cos  =dtjW g is calculated the same way as in proof of theorem 3.2.1: ` ` W j; u` [W (j; :) u` ] dkWdt(j;:)k ; kW (j; :)k ku` k kW (j; :)k2 ku` k W dW (j;:) ) (3.14) dt Note that xxT ! hxxT i when passing under the EfjW g operator. From (3.13), knowing that a = W (j; :) x then: j dW (j; :) = W (j; :) xxT ; (aj ) W (j; :) dt then, multiplying by u` to the right and knowing that hxxT iu` = ` u` , it follows that dW (j; :) u` = W (j; :) xxT u` ; (aj ) W (j; :) u` dt under ;;;;;! W (j; :)hxxT iu` ; (aj ) W (j; :) u` = ( ` ; (aj )) [W (j; :) u` ] E fjW g (3.15) The other required term is dkW (j; :)k=dt which again is calculated in similar way as in proof of theorem 3.2.1: dkW (j; :)k dW (j; :) W (j; :)T = dt kW (j; :)k dt and, by using (3.13), aj = W (j; :) x and W (j; :) W (j; :)T = kW (j; :)k2 , then: W (j; :) xxT W (j; :)T W (j; :) W (j; :)T dkW (j; :)k = ; ( aj ) dt kW (j; :)k kW (j; :)k T iW (j; :)T W ( j; :) h xx under ;;;;;! ; (aj ) kW (j; :)k E fjW g kW (j; :)k (3.16) Replacing (3.15) and (3.16) results back into (3.14) gives:    (  ; (a )) [W (j; :) u ]  E d cos W =E dt kW (j; :)k ku k j ` ` ` ` ; [W (j; :) u ]  ` xxT iW (j;:)T W (j;:) h kW (j;:)k kW (j; :)k2 ku k ; (a ) kW (j; :)k j ` and, after simpli cation, it may be written as:  W      [W (j; :) u ] W   W (j; :) hxxT iW (j; :)T E d cos W = E  ; dt kW (j; :)k2 kW (j; :)k ku k ` ` ` ` Lets take u = umax and the corresponding  = max . The above formula becomes: E  ` d cos max W = dt  E ` W (j; :)hxxT iW (j; :)T max ; kW (j; :)k2   [W (j; :) umax ] W  kW (j; :)k kumax k The existence of lim W means that max stabilizes in time and thus lim d cos max =dt = 0 and so is its expected value. As W (j; :) umax 6= 0 then: t!1 t!1  hxxT iW (j; :)T W  = 0 , E  W (j; :)hxxT iW (j; :)T W  = max (3.17) E max ; W (j; :)kW (j; :)k2 kW (j; :)k2 ❖ W0 ` , u` ❖ ` ❖ 38 CHAPTER 3. THE SOM/KOHONEN NETWORK the equality being possible only for appendix). max , in accordance to the Rayleigh quotient (See the mathematical As the matrix xxT is symmetrical (and so is hxxT i, i.e. hxxT i = (hxxT i)T ) then an orthogonal set of eigenvectors may be build (see the mathematical appendix). A transformation of coordinates to the system fu g =1 may be performed by using the matrix U build using the set of eigenvectors as columns (and then U T U = I as uT u =  ,  being the Kroneker symbol). Then W (j; :)T ! W 0 (j; :)T = U W (j; :)T , W (j; :) ! W 0 (j; :) = W (j; :) U T , also ` ` ;N ` k `k `k kW (j; :)k2 ! kW 0(j; :)k2 = W (j; :)U T U W (j; :)T = W (j; :)IW (j; :)T = kW (j; :)k2 and W 0 (j; :) may be represented as a linear combination of fu g: 0 ( :) = W j; ` X ! ` uT ` ` !` being the coecients of the linear combination (u appear transposed because W (j; :) is a row matrix). ` Knowing that U T hxxT iU is a diagonal matrix with eigenvalues on the main diagonal (and all others being zero, see again the mathematical appendix) and using again the orthogonality of fu g (i.e. uT u =  ) then: ` 0 ( :)hxxT iW 0 (j; :)T = W j; X ` !`2 and ` Replacing back into (3.17) (with W 0 (j; :) replacing W (j; :)) gives: E which may happen only if all T !max uT max / umax . ❖ W0 , P?umax !` ( P ` P ` !`2 W !2 ` kW 0(j; :)k2 = X ` k `k !`2 ` ) = max ` ! 0 except the one !max corresponding to umax , i.e. lim !1 W (j; :) = t At rst glance the condition W umax 6= 0 at all t, met in theorem 3.2.2 seems to be very hard. In fact it is, but in practice have a smaller importance and this deserves a discussion. First, the initial value of W , let W (0)  W0 be that one, should be chosen such that W0 umax 6= 0. But W (j; :)umax 6= 0 means that W (j; :) 6? umax , i.e. W (j; :) is not contained in a hyperplane P?umax perpendicular on umax (in R ). Even a random selection on W0 would have good chances to stand this condition, even more so as umax is seldom known exactly in practice, being dependent on the stochastic input variable x (an exact knowledge of umax would mean a knowledge of an in nite series of x(t)). Second, theorem 3.2.2 says that, statistically, W (j; :) vectors will move towards either umax or ;umax depending upon what side of P?umax is W0 (j; :), i.e. away from P?umax . However the proof is statistical in nature and there is a small but nite probability that, at same t, W (j; :)(t) falls into P?umax . What happens then, tell us (3.15): as W (j; :)umax = 0 then ( :) umax = 0 and this means that all further changes in W (j; :) are contained in P?umax , i.e. W (j; :) becomes trapped in P?umax . The conclusion is then that the condition W umax 6= 0, 8t, may be neglected, with the remark that there is a small probability that some W (j; :) weight vectors may become trapped and then learning will be incomplete. N dW j; dt 3.3. NETWORK DYNAMICS ➧ 3.3 39 Network Dynamics 3.3.1 Network Running Function As previously discussed, x; W (j; :) 2 R , the weight space. For each input vector x, the neuron k for which: kx ; W (k; :)T k = min kx ; W (j; :)T k N j =1;K is declared winner, i.e. the one with for which the associated weight vector W (k; :) is closest to x (in weight space). The winner is used to decide which weights get changed using the current input vector x. All and only the neurons found into the winner's neighborhood participate to learning, i.e. will have their weights changed/adapted. All other weights remain unchanged at this stage; later a new input vector may change the winner and thus the area of change. ✍ Remarks: ➥ kx ; ( :)T k is the (mathematical) distance between vectors s x and ( :). W j; This distance is user de nable but most used is the Euclidean ➥ ➥ P( W j; N =1 i xi ; wji ). As the learning of the network is unsupervised, i.e. there are no targets. If the input vectors x and weight vectors fW (j; :)g =1 are normalized kxk = kW (j; :)k (to the same value, not necessary 1), e.g. in an Euclidean space: j v u u tX N =1 2 xi v u uX =t i N i =1 2 wji ; j ;K =1 ;K i.e. x(t) and W (j; :) are points on a hyper-sphere in R , then the dot vector product can be used to nd the matching. The winner neuron is that k one for which: W (k; :) x = max W (j; :) x N j =i;K i.e. the winner is that neuron for which the weight vector W (k; :) points to the closest direction to that one to which points the input vector x. This operation is a little faster as it skips a subtraction operation of type: x ; T W (j; :) , however it requires normalization of x and W (j; :) which is not always desirable in practice. 3.3.2 Network learning function The learning process is an unsupervised one. Time is considered to be discrete t = 1; 2; : : : . The weights are time dependent W = W (t). The learning network is feed with data x(t). At time t = 0 the weights are initialized with (small) random values. The weights at time t are updated as follows: 40 CHAPTER 3. THE SOM/KOHONEN NETWORK ➀ ➁ For x(t) nd the winning (output) neuron k. See section 3.3.1. Update weights according to the model chosen, see section 3.2 for a selection of learning models: dW =  dW  dt  dt which in discrete time approximation (dt ! t ; (t ; 1) = 1) becomes: W = W (t) ; W (t ; 1) = dW dt (3.18) 3.3.3 Initialization and Stop condition Weights are initialized (in practice) with small random values (normalized or not) and the adjusting process continue by iteration. The stopping of the learning process may be done by one of the following methods: ➀ choosing a xed number of steps t = 1; T . ➁ the learning process continue until the adjusting quantity wji = wji (t +1) ; wji (t) falls under some speci ed value, i.e. wji 6 ", where " is the threshold. 3.3.4 Remarks The mapping feature of Kohonen networks SOM Due to the fact that the Kohonen algorithm \moves" the weights vectors towards the input vectors the Kohonen network tries to map the input vectors, i.e. the weights vectors will try to copy the topology of input vectors in the weight space. The mapping occurs in the weight space. See section 3.5.2 for an example. For this reason Kohonen networks are also called self ordering maps or SOM. Activation Function Note that the activation function, as well as the neuronal output is irrelevant to the learning process. Incomplete Learning Even if the learning is unsupervised, in fact, a poor choice of learning parameters may lead to an incomplete learning (so, in fact, a full successful learning is \supervised" at a \highest" level). See section 3.5.2 and gure 3.6 for an example of an incomplete learning. ➧ 3.4 The algorithm 1. For all neurons in output layer: initialize weights with random values. 2. If working with normalized vectors then normalize the weights. 3. Choose a model | type of neuronal learning. See section 3.2 for some examples. 3.5. APPLICATIONS 41 4. Choose a model for the neuronal neighborhood | lateral feedback function. See also section 3.5 for some examples. 5. Choose a stop condition. See section 3.3.3. 6. Knowing the learning model, the neuronal neighborhood function and the stop condition, build the nal equation giving the weight adaptation formula. See section 3.5.1 for an example of how to do this. 7. In discrete time approximation repeat the following steps till the stop condition is met: (a) Get the input vector x(t). (b) For all neurons j in output layer, nd the \winner" | the neuron k for which: kx(t) ; W (k; :) k = min kx(t) ; W (j; :) k T T j =1;K or, if working with normalized vectors: W (k; :) x = max W (j; :) x j =1;K (c) Knowing the winner, change weights by using the adaptation formula built at step 6. ➧ 3.5 3.5.1 Applications The Trivial Model with Forgetting Function This application is without practical value but it shows how to build a weight adaptation formula. It also gives some examples of neuronal neighborhood models. The topics discussed here apply to many types of Kohonen networks. Let choose the trivial equation (3.1) as learning model: dW b T; W = 1x (3.19) dt Next let consider h(k; j ) the function modelling neuronal neighborhood, i.e. the lateral feedback. This function should be of \mexican hat" type, i.e. h(k; j ) 8 > <> 0 for j relatively close to k 6 0 for j far, but not too much, from k > : = 0 for j far away from k This function will be generally a function of \distance" between k and j , the distance being user de nable. Considering x and x the \coordinates" then h = h(jx ; x j). Let xT( ) =  ; x1    x be the vector containing the neuron coordinates, then h(jx( ) ; x( ) 1bj) will give the vector containing adaptation height around winner k, for the whole network. k K h neuronal neighborhood lateral feedback ❖ j j k K K K k ❖ x(K ) 42 CHAPTER 3. THE SOM/KOHONEN NETWORK h h+ k + n+ k k + n+ + n; j h; Figure 3.3: The simple lateral feedback function. To account for the neuronal neighborhood model chosen, equation (3.19) have to be changed to: i hbT i dW h 1x ; W (3.20) = h(jx( ) ; x( ) b1j) 1bT dt K K k Note that the elements of h(jx( ) ; x( ) b1j) corresponding to neurons outside neuronal neighborhood of winner are zero and thus, for these neurons: ( :) = 0, i.e. their weights remain unchanged for current x. Various neuronal neighborhood may be of the form:  Simple lateral feedback function: K K k dW j; dt 8h >> ><;h; h(j; k) = >> >: 0 + ❖ h + , h;  for j 2 fk ; n+ ; : : : ; k; k + n+ g (positive feedback) for j 2fk ; n+ ; n; ; : : : ; k ; n+ ; 1g [ fk + n+ + 1; : : : ; k + n+ + n; g (negative feedback) in rest where h 2 [0; 1], h = const. de nes the height of positive/negative feedback and n > 1, n 2 N de nes the neural neighborhood. See gure 3.3. Exponential lateral feedback function: 2 h(j; k) = h+ e;( ; ) ; for jj ; kj 6 n k stop condition ❖ j where h+ > 0, h+ = const. de nes the positive feedback and there is no negative one and n > 0, n = const. de nes the neuronal neighborhood. See gure 3.4 on the facing page. Finally the stop condition may be implemented by multiplying the right side of equation (3.20) by a function  (t) with the property lim !1  (t) = 0. This way (3.20) becomes: h dW =  (t) h(jx( dt K) t ;x b j) b i h b ; (K )k 1 T 1 1x T W i (3.21) 3.5. APPLICATIONS 43 1 0 ;3 0 3 Figure 3.4: The exponential lateral feedback function h(x) = e;x 2 . dW The convergence of  (t) to zero ensures that tlim !1 dt = 0 and thus the weight adaptation, i.e. learning process, stops eventually. Various stop functions may be of the form:  Geometrical progression function: t ; init ; ratio 2 (0; 1) ; init ; ratio = const.  (t) = init ratio where init and ratio are the initial/ratio values.  Exponential function:  (t) = init e;f (t) where f (t) : N ! [0; 1) is a monotone increasing function. Note that for f (t) = t and t 2 N + this function is a geometrical progression. In discrete time approximation, using (3.21) into (3.18) gives: h W (t + 1) = W (t) +  (t) h(jx(K ) ; x(K )k 1bj) 1bT i h b 1x T ; W i (3.22) (note also that the winner k is also time dependent, i.e. x(K )k = x(K )k (x(t))). ✍ Remarks: ➥ The above considerations are easily extensible to multidimensional Kohonen networks. E.g. for a bidimensional K  K layer, considering the winner k, the interneuronal Euclidean distance, squared, will be: (x(K ) ; x(K )k 1b) 2 + (y(K ) ; y(K )k 1b) y( 2 K ) holding the second coordinate. Of course h also changes to: h = h((x(K ) ; x(K )k 1b) 2 + (y(K ) ; y(K )k 1b) 2 ) 44 CHAPTER 3. THE SOM/KOHONEN NETWORK 3.5.2 Square mapping ❖ X(K ) This example shows the mapping feature of a Kohonen network, a way (among many other possibilities) to build multidimensional networks and possible defects in learning process. Let be a Kohonen network with 2 inputs and 8  8 neurons (bidimensional output layer with K = 8). The trivial equation model, in discrete time approximation, will be used here, see section 3.5.1. As the network is bidimensional will do the following changes:  The \x" coordinates of neurons, in terms of interneuronal distances, are kept in a matrix X(K ) of the form: X(K ) = b1 ;1 01 2    81  B1 2    8CC  8 = B @: : : : : : : : : : : : :A 2 1 ❖ Y(K ) while the \y" coordinates are: Y(K ) 011 BB2CC b = B.C 1 @ .. A T  2 8 01 1    11 B2 2    2CC =B @: : : : : : : : : : : : :A 8 8  8 8 and it becomes immediately that the layout of network is (coordinates (x; y) are in parentheses): (1; 1)  (8; 1) (1; 8)  (8; 8) .. . ❖ D (k ) .. . A particular neuron will be then identi ed by two numbers (jx ; jy ).  Considering (x(K )kx ky ; y(K )kx yy ) the coordinates of winner then the interneuronal squared distances may be kept in a matrix D(kx ; ky ) as: D(kx ; ky ) = (X(K ) ; x K kx ky e1) ( ) 2  The lateral feedback function is h((kx ; ky ); t) = h+ exp K ; y(K )kx ky e1) + (Y( )  D(kx ; ky ) ; (d t init drate ) 2  2 where h+ = 0:6, dinit = 5 and drate = 0:993.  The stop function is t  (t) = init rate where init = 1 and rate = 0:993.  The weights are kept in two matrices W1 and W2 corresponding to the two components of input vector. The weight vector associated to a particular neuron (jx ; jy ) is (w1jx jy ; w2jx jy ). 3.5. APPLICATIONS 45 1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.0 0.0 0.1 0.2 0.3 0.4 Figure 3.5: 0.5 0.6 0.7 0.8 0.9 1.0   Mapping of the [0; 1] [0; 1] square by an 8 8 network. These are snapshots taken at t = 0 (upper-left), t = 7 (upper-right), t = 15 (lower-left) and t = 120 (lowerright).  The constants of trivial equation are taken to be Then the general weights updating formulas are: W1 (t + 1) = W1 (t) +  (t) h((kx ; ky ); t) W2 (t + 1) = W2 (t) +  (t) h((kx ; ky ); t) =1 h and =1 . x1 (t)e1 ; W1 (t) h i x2 (t)e1 ; W2 (t) i The pair of weights associated with each neuron (w1jx jy ; w2jx jy ) may also be represented as points in the [0; 1]  [0; 1] square. A successful learning will try to cover as much as possible of the area. See gure 3.5 where the evolution of training is shown from t = 0 (upper{left) to nal stage (down{right). Lines are drawn between closest neighbors (network topology wise). The weights are the points at intersections. ✍ Remarks: ➥ Even if the learning is unsupervised, in fact, a poor choice of learning parameters ( , h+ , e.t.c.) may lead to an incomplete learning. See gure 3.6 on the following page: small values of feedback function at the beginning of learning makes the network to be unable to \deploy" itself fast enough leading to the 46 CHAPTER 3. THE SOM/KOHONEN NETWORK 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 Figure 3.6: 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Incomplete learning of the [0; 1]  [0; 1] square by an 8  8 network. appearance of a \twist' in the mapping. The network was the same as the one used to generate gure 3.5 including same inputs and same weights initial values. The only parameters changed were: h+ = 0:35, dinit = 3:5, drate = 0:99 and rate = 0:9999. CHAPTER 4 The BAM/Hopfield Memory This network illustrate an associative memory. Unlike the classical von Neumann systems where there is no link between memory address and its contents, in ANN, part of information is used to retrieve the rest associated with it This kind of memory is named associative. ➧ 4.1 Associative Memory De nition 4.1.1. Let be P pairs of vectors yp 2 RK , N; K; P 2 N + called exemplars. f(x1 ; y1 ); : : : (xP ; yP )g with xp 2 RN and Then the mapping M : RN ! RK is said to implement an heteroassociative memory if: ❖ xp , vp , P heteroassociative memory M(xp ) = yp 8 = 1 6= M(x) = yp 8x such that kx ; xp k kx ; x` k 8 = 1 De nition 4.1.2. Let be pairs of vectors f(x1 y1 ) (xP yP )g with xp 2 RN and yp 2 RK , 2 N + called exemplars. Then the mapping M : RN ! RK is said to implement an interpolative associative interpolative assop ;P < P ; ` ;::: ;P; ` p ; N; K; P memory if: M(xp ) = yp 8 =1 8 d ) 9 e such that M(xp + d) = yp + e d 2 RN i.e. if x 6= xp , then y = M(x) = 6 yp , 8 = 1 . p ; p 4.1 ;P See [FS92] pp. 130{131. 47 ciative memory ;P ; and e 2 RK ; d; e 6= 0b 48 CHAPTER 4. THE BAM/HOPFIELD MEMORY The interpolative associative memory may be build from a set of orthonormal set of exemplars fxp g. The M function is then de ned as: ! P X T (4.1) M(x) = yp xp x p=1 Kroneker symbol Proof. f g Orthogonality of xp means that xTp x` = p` , where p` is the Kroneker symbol. From equation 4.1: 0P 1 X TA X @ M(x`) = ypxp x` = P ypxTp x` = XP ypp` = y` p=1 p=1 p=1 and for some x = x` + d: M x M x` d M x` M d ). ( )= (obviously M x` ( + d) = ( )+ ( + y` + e where e = P X p=1 yp xTp d ( ) De nition 4.1.3. Let be a set of autoassociative memory )= P vectors fx1 ; : : : xP g with xp 2 RN and N; P 2 N + called exemplars. Then, the mapping M : RN ! RN is said to implement an autoassociative memory if: M(xp ) = xp 8 = 1 M(x) = xp 8x such that kx ; xp k kx ; x` k 8 p ;P < ` = 1; P ; ` 6= p In general x will be used to denote the input vector and y the output vector into an associative memory. ➧ 4.2 The BAM Architecture The BAM (Bidirectional Associative Memory) implements a interpolative associative memory and consists of 2 layers of neurons fully interconnected. The gure 4.1 on the facing page shows the net as M(x) = y but the input and output may swap places, i.e. the direction of connection arrows may be reversed and y play the role of input, using the same weight matrix (but transposed, see below). Considering the weight matrix W then the network output is y = W x, i.e. the activation function is identity f (x) = x. According to (4.1), the weight matrix may be build using a set of orthogonal fxp g and the associated fyp g, as: P X (4.2) yp xTp W = p=1 4.2 See [FS92] pp. 131{132. 4.3. BAM DYNAMICS Figure 4.1: 49 x layer y layer The BAM network structure. If the fy g are also orthogonal then the network is reversible. Considering y layer as input then: p x = W Proof. T y As for two matrices it is true that (AB )T = B T AT then from (4.2): WT = By using the orthogonality property ypT y` x $ y). ✍ = Xx y P p=1 p pT p` , the proof is very similar to the one for (4.1) (replacing Remarks: ➥ ➥ ➥ According to (4.2) the weights can be computed exactly (within the limitations of rounding errors). The activation function of neurons was assumed to be the identity: f (a) = a. Because the output function of a neuron should be bounded so should be the data network is working with (i.e. x and y vectors). The network can be used as autoassociative memory considering x  y. The weight matrix becomes: X P W = p ➧ 4.3 T xp xp =1 BAM Dynamics 4.3.1 Network Running The BAM functionality di er from others by the fact that weights are not adjusted during a training period but calculated from the start from the set of vectors to be stored fx ; y g =1 . p 4.3.1 p p ;P See [FS92] pp. 132{136. 50 CHAPTER 4. THE BAM/HOPFIELD MEMORY The procedure is developed for vectors belonging to Hamming space1 H. Due to the fact that most information can be encoded in binary form this is not a signi cant limitation and it does improve the reliability and speed of the net. Bipolar vectors are used (with components having values either +1 or ;1). A transition to and from binary vectors (having component values either 0 or 1) can be easily done using the following relation: x = 2xe ; b1 where x is a bipolar (Hamming) vector and xe is a binary vector. From a set of vectors fxp ; yp g the weight matrix is calculated by using (4.2). Both fxp g and fyp g have to be orthogonal because the network works in both directions. The procedure works in discrete time approximation. An initial x(0) is applied to the input. The goal is to retrieve a vector y` corresponding to the closest x` to x(0), where fx` ; y` g are from the exemplars set (stored into the net, at the initialization time, by the mean of the calculated weight matrix). The information is propagated forward and back between layers x and y till a stable state is reached and subsequently a pair fx` ; y` g belonging to the set of exemplars is found (at the output of x respectively y layers). The procedure is as follows:  At t = 0 the x(0) is applied to the net and the corresponding y(0) = W x(0) is calculated.  The outputs of x and y layers are propagated back and forward, till a stable state is reached, according to the formulas (for convenience [W (:; i)]T  W (:; i)T ): xi (t yj (t + 1) = + 1) = 8 >><+1 if T ) y( )) = >> ( ) if :;1 if 8 >>+1 < :) x( + 1)) = >> ( ) :;1 f (W (:; i f (W (j; t t xi t yj t T y(t) > 0 W (:; i) T y(t) = 0 W (:; i) ; i = 1; N T y(t) < 0 W (:; i) if W (j; :) x(t + 1) > 0 if W (j; :) x(t + 1) = 0 if W (j; :) x(t + 1) < 0 ; j = 1; K Note that the activation function f is not the identity. In matrix notation the formulas may be written as: x(t + 1) = sign(W T y(t)) + j sign(W T y(t))jC x(t) y(t + 1) = sign(W x(t + 1)) + j sign(W x(t + 1))jC y(t) (4.3) 0b. Proof. sign(W T y(t)) gives the correct (1) values of x(t + 1) for the changing components and make xi (t + 1) = 0 if W (:; i)T y = 0. The vector j sign(W y(t))jC have its elements equal to 1 only for those xi components which have to remain unchanged and thus restores the values of x to the previous ones (only for those elements requiring it). 1 See math appendix. and the stable condition means: sign(W T y(t)) = sign(W x(t + 1)) = 4.3. BAM DYNAMICS 51 The proof for second formula is similar. Convergence of the process is ensured by theorem 4.3.1. When working in reverse y(0) is applied to the net, x(0) = W T y(0) is calculated and the formulas change to: y(t + 1) = sign(W x(t)) + j sign(W x(t))jC y(t) x(t + 1) = sign(W T y(t + 1)) + j sign(W T y(t + 1))jC (4.4) x(t) 4.3.2 The BAM Energy Function De nition 4.3.1. The following function: E ❖ ( x y ) = ;y T ; W (4.5) x is called BAM energy function2. Theorem 4.3.1. The BAM energy function have the following properties: 1. Any change in x or y during BAM running results in a decrease in E i.e.: +1 (x(t + 1); y(t + 1)) 6 E (x(t); y(t)) Et 2. E is bounded below by Emin = ; Pj t wji j;i j. 3. When E changes it must change by an nite amount, i.e. E = E +1 ; E is nite. Let consider that just one component of vector y changes from to +1, i.e. k ( +1) = 6 k ( ). t Proof. 1. k Then from equation (4.5): t t t y t  = t+1 ; t = 1 0 0 N K X N N K N X X X X C ; BB; X B@; k ( + 1) ki i ; =B j ji j ji i C k ( ) ki i ; A @ E i=1 y t w x j=1 i=1 j =k E y w E x i=1 y t w y t y t N X i=1 x 8 > <+1 ( + 1) = k > :;k1( ) t y w 1 C iC A x ki i = [ k ( ) ; k ( + 1)] ( :) x w According to the updating procedure, see section 4.3.1: y j=1 i=1 j =k t 6 6 = [ k ( ) ; k ( + 1)] x y y t y t y t W k; if W (k; :) x > 0 if W (k; :) x = 0 if W (k; :) x < 0 As it was assumed that yk changes then there are two cases  yk (t) = +1 and it changes to yk (t + 1) = ;1. Then yk (t) ; yk (t + 1) > 0 and W (k; :) x (according to the algorithm) so E < 0.  yk (t) = ;1 and it changes to yk (t + 1) = +1. Analogous the preceding case: E < 0. < 0 See [FS92] pp. 136{141. This is the Liapunov function for BAM. All state change, with respect to time (x = x(t) and y = y(t)) involves a decrease in the value of the function. 4.3.2 2 E BAM energy function 52 CHAPTER 4. THE BAM/HOPFIELD MEMORY If more than one term is changing then E is of the form: E = E +1 ; E = t t X K k =1 y W (k; :) x < 0 k which represents a sum of negative terms. A similar discussion may be performed for an x change. 2. The fy g =1 and fx g =1 have all values either +1 or ;1. j j i ;K i ;N The lowest possible value for E is obtained when all sum terms y Emin =; X jy w j ji xi j=; j;i X j wji xi (see (4.5)) are positive. jy jjw jjx j = ; j ji i j;i X jw j ji j;i The energy function decreases, it doesn't increase, so E 6= +1. On the other hand the energy function is limited on the low end (according to the second part of the theorem) so it cannot decrease by an in nite amount: E 6= ;1. Also the value of E can't be in nitesimally small resulting into an in nite amount of time before it reaches it's minimum. The minimum amount by which E may change is that k for which W (k; :) x is minimum and occurs when y is changing; the minimum amount being: 3. k E = ;2jW (k; :) xj (t) ; y (t + 1)j = 2. Proposition 4.3.1. If the input pattern x` is exactly one of stored fxp g then the correbecause jy k k sponding y` is retrieved. Theorem 4.3.1 ensures convergence of the process. According to the procedure, eventually a vector y = sign(W x ) is obtained. The zeros generated by sign disappear by procedure de nition: previous values of y are kept instead. But as x x =  (orthogonality) then: Proof. ` p j ` p` 0 0 1 1 X X T y = sign(W x ) = sign @ y x x A = sign @ y  A = sign(y ) = y P ` p ✍ Remarks: ➥ ➥ ➥ crosstalk ➥ =1 P p p ` =1 p p` ` ` p The existence of BAM energy function with the outlined properties ensures that the running process is convergent and a for any input vector and solution is reached in nite time. If an input vector is slightly di erent from one of the stored, i.e. there is noise in data, then the corresponding associated vector is eventually retrieved. However the process is not guaranteed. Results may vary depending upon the amount of noise and the saturation of memory (see below). The theoretical upper limit (number of vectors to be stored) of BAM is 2N ;1 (i.e. 2N =2 because the x and ;x carry the same amount of information due to symmetry). But if the possibility to work with noisy data is sought then the real capacity is much lower. A crosstalk may appear (a di erent vector from the desired one is retrieved). The Hamming vectors are symmetric with respect to the 1 notation. For this reason an Hamming vector u carry the same information as its complement xC 4.4. THE BAM ALGORITHM 53 and a BAM network stores automatically both, because xC = ;x and yp = W xp , 8p = 1; P so: ypC ➥ ➧ 4.4 = ;yp = ; W xp = W (;xp ) = W xCp such that the same W matrix is used. When trying to retrieve a vector, if the initial one x(0) is closer to the complement of an stored one xCp then the complement pair will be retrieved fxC ; yC g (because both exemplars and their complements are stored with equal precedence). The conclusion is that BAM stores the direction of the exemplar vectors and not their values. The BAM Algorithm Network initialization The weight matrix is calculated directly from the desired set to be stored: X P W = p=1 yp xTp Note that there is no learning process. Weights are directly initialized with their nal values. Network running forward The network runs in discrete time approximation. Given x(0), calculate y(0) = W x(0) Propagate: repeat the following steps x(t + 1) = sign(W T y(t)) + j sign(W T y(t))jC x(t) y(t + 1) = sign(W x(t + 1)) + j sign(W x(t + 1))jC b y(t) till network stabilize, i.e. sign(W T y(t)) = sign(W x(t + 1)) = 0. Note that in both forward and backward running cases the intermediate vectors T W y may not be of Hamming type. W x or Network running backwards In the same discrete time approximation, given y(0), calculate x() propagate using the formulas: y(t + 1) = sign(W x(t)) + j sign(W x(t))jC = W y(t) x(t + 1) = sign(W T y(t + 1)) + j sign(W T y(t + 1))jC b till the network stabilize, i.e. sign(W T y(t)) = sign(W x(t + 1)) = 0. x(t) . Then T y(0) 54 CHAPTER 4. THE BAM/HOPFIELD MEMORY x layer y layer Figure 4.2: , The autoassociative memory structure. input x Figure 4.3: ➧ 4.5 The Hop eld network structure. The Hopfield Memory 4.5.1 The Discrete Hop eld Memory Let consider an autoassociative memory. The weight matrix, build from a set fyp g is: W= ❖ x X P p=1 T yp yp and it's square (K  K ) and symmetric (W = W T ). An autoassociative memory is similar to a BAM with the remark that the 2 layers (x and y) are identical. In this case the 2 layers may be replaced with one fully interconnected layer, including a feedback for each neuron | see gure 4.2: the output of each neuron is connected to the inputs of all neurons, including itself. The discrete Hop eld memory is build from the autoassociative memory described above by replacing the autofeedback (feedback from a neuron to itself) by an external input signal x | see gure 4.3. The di erences from autoassociative memory or BAM are as follows: 4.5.1 See [FS92] pp. 141{144. 4.5. THE HOPFIELD MEMORY ➀ ➁ 55 The discrete Hop eld memory is working with binary vectors rather than bipolar ones | see section 4.3.1 | so here and below the y vectors are considered binary and so are the input x vectors. The weight matrix is obtained from the following matrix: X P =1 (2yp ; 1b)(2y ; b1)T p p ➂ by replacing the diagonal values with 0 (zeroing the diagonal elements of W is important for a ecient matrix notation3). The algorithm is similar with the BAM one but the updating formula is: 8 >>+1 >> >> < y (t + 1) = >y (t) >> >>0 : j j where the ft g =1 j j ;K = if if if Pw y +x >t =1 K ji i i i j j =j Pw y +x =t =1 6 K ji i i i j (4.6) j =j Pw y +x <t =1 6 K ji i i i j j =j 6 t is named the threshold vector. ❖t In matrix notation the equation become: A(t) = sign(W y(t) + x ; t) y(t + 1) = Proof. 1 h 2 A(t) + 1b ; jA(t)jC First as diagonal elements of W are zero (wii elements of A(t) + b1 are: i = 0 + jA(t)jC ) then y(t) K P wji yi ii=1 =j = W (j; :)y. Also the 6 8 > <2 if W (j; :) y + xj > tj bgj = 1 if W (j; :) y + xj = tj fA(t) + 1 > :0 if W (j; :) y + xj < tj and the elements of jAjC are: A j fj jC g = ( 1 0 if W (j; :) y + xj = tj otherwise De nition 4.5.1. The following function: E = ; 12 yT W y ; yT (x ; t) is named the discrete Hop eld memory energy function. 3 This helps towards an ecient simulation implementation as well. ❖ (4.7) E 56 CHAPTER 4. THE BAM/HOPFIELD MEMORY ✍ Remarks: Comparing to the BAM energy function the discrete Hop eld energy function have a factor of 1=2 because there is just one layer of neurons (in BAM both forward and backward passes contribute to the energy function). ➥ Theorem 4.5.1. The discrete Hop eld energy function have the following properties: 1. Any change in y (during running) results in a decrease in E : ( ( + 1)) 6 E (y(t)) Et+1 y t 2. E is bounded below by: Emin t Xj = ; 12 wji j ; K j;i 3. When E changes it must change by an nite amount, i.e. E = E +1 ; E is nite. Consider that, from t to t + 1, just one component of vector y changes: y . Then from (4.7): E = E +1 ; E = t Proof. 1. t k t t  X  = 2[y (t + 1) ; y (t)] ; 12 w y ; (x + t )[y (t + 1) ; y (t)] K k k ==1k i ki k k k k i i 6 because in the sum P y w y , y appears twice: once at the left and once at the right and w = w . K =1 = According to the updating procedure (4.6): j ji i ij k i;j i j ji 6 8 P > >+1 if ; w y ; x + t < 0 > > ==1 > < P y (t + 1) = >y (t) if ; =1 w y ; x + t = 0 = > > P w y ;x +t > 0 > > 0 if ; > : ==1 K i i 6 ki i k k ki i k k ki i k k k K k k i i 6 k K i i 6 k there are 2 cases (it was assumed that y (t + 1) 6= y (t)): k k  y (t) = +1 and it changes to y (t+1) = 0. Then [y (t+1);y (t)] < 0 and ; P w y ;x +t > 0 K k k k ==1k k i i ki i k i (according to the algorithm) so E < 0.  y (t) = 0 and it changes to y (t + 1) = +1. Analogous the preceding case: E < 0. If more than one term is changing then E is of the form: 6 k k E = E +1 ; E = t t X K j =1 y X K j =1 =j ! w y ;x +t < 0 ji i j j i i 6 which represents a sum of negative terms. 2. The fy g =1 have all values either +1 or 0. The lowest possible value for E is obtained when fy g =1 = 1, the input vector is also x = b1 and the threshold vector is t = 0b such that the negative i i i ;K i ;K 4.5. THE HOPFIELD MEMORY 57 terms are maximum and the positive term is minimum (see (4.7)), assuming that all w ji > min E = ; 12 K X j;i=1 j =i j j; wji 0, i; j =1 . ;K K 6 The energy function decreases, it doesn't increase, so E 6= +1. On the other hand the energy function is limited on the low end (according to the second part of the theorem) so it cannot decrease by an in nite amount: E 6= ;1. Also the value of E can't be in nitesimally small resulting into an in nite amount of time before it reaches it's minimum. The minimum amount by which E may change is when just one component k is changing, for which W (k; :)y is minimum, x = 1 and t = 0, the amount being: 3. k k  =; K X E i=1 i=k ; wki yi xk 6 (y appears twice: once at the left and once at the right and w ij k ✍ = wji ). Remarks: ➥ The existence of discrete Hop eld energy function with the outlined properties ensures that the running process is convergent and a solution is reached in nite time. 4.5.2 The Continuous Hop eld Memory The continuous Hop eld memory model is similar to the discrete one except for the activation function of the neuron which is of the form: f (a) = 1 + tanh(a) 2  = const. ;  2 R+ ; where  is called the gain parameter. See gure 4.4 on the next page. The inverse of activation is: f (;1) (y ) = 1 2 ln y 1 ;y (4.8) See gure 4.5 on the following page. The di erential equation governing the evolution of continuous Hop eld memory is de ned as: K 1 daj wji yi + xj ; tj aj = (4.9) dt  ii=1 6=j X or in matrix notation: da dt 4.5.2 See [FS92] pp. 144{148. = Wy + x ; 1  t a ❖ f , , f (;1) gain parameter 58 CHAPTER 4. THE BAM/HOPFIELD MEMORY 1:0 f (a) = 50 =1 a =3 0:0 ;1:0 Figure 4.4: 1:0 The neuron activation function for continuous Hop eld memory (for di erent values). 1:0 f (;1) (y ) =1 =3 = 50 ;1:0 y 0:0 Figure 4.5: 1:0 The inverse of neuron activation function for continuous Hop eld memory (for di erent values). In discrete time approximation the updating procedure may be written as: y(t + 1) = y(t) +  W y(t) + x ; 1  t ln y(t) b ; y(t) 1  y(t) b ; y(t)] [1 (of course operations under ln function are done on each yj separately). Proof. From (4.9): df (;1) (yj ) K X 1 wji yi + xj ; tj f (;1) (yj ) dt  ii=1 6=j  y  1 1  dyj = 1 dyj ) df (;1) (yj ) = d ln j = + 1 ; yj yj 1 ; yj yj (1 ; yj ) ! K X y 1 y (1 ; yj ) dt dyj = wji ui + xj ; tj ln j  1 ; yj j ii=1 6=j = and in discrete time approximation dt ! t = t + 1 ; t = 1. (4.10) 4.5. THE HOPFIELD MEMORY 59 De nition 4.5.2. The following function: E = X X ; K K ;2 1 yj wji yi xj yj XZ + tj j =1 i;j =1 6 i=j yj K 1 j =1 ❖E f (;1) (y 0 ) dy 0 (4.11) 0 is named the continuous Hop eld memory energy function. Theorem 4.5.2. The continuous Hop eld energy function have the following properties: 1. Any change in y as a result of running (evolution) results in a decrease in E , i.e. dE dt 2. E is bounded below by: Emin Proof. 1. First: Z ln x lim &0 ln x = lim &0 1 0 x K wji j ; K j;i=1 j =i 6 x ln (L'Hospital) 1 ; x dx = ln e ;1 x (1 ; x)1; x x lim &0 ;x = 0 = x Xj ; 21 Z ln x dx = x ln x ; x ) x Zyi = 60 x x Zi 0 0 ln y 0 dy0 = ln 1 ;y y0 dy0 = lim 1;y 0 &0 y y y0 = ln y (1 ; y )1; i ; lim ln y0 0 (1 ; y0 )1; 0 = ln y i (1 ; y )1; 0 &0 ;  d ln y i (1 ; y )1; i = dyd ln y i (1 ; y )1; i dy = dt dt yi i y i i i i y i y y y i y y y yi i y i i = (ln y ; ln(1 ; y )) dy = f (;1) (y ) dy dt dt i i i then, from (4.11) and using (4.9): dE dt =; =; because df ;1) (yj ) ( dyj = 1 yj (1 ; yj ) K X K X j =1 wji ui + xj i=1 i=j 6 X daj dyj K j =1 dt dt > 0 ( yj =; i i ;1t a j ! j dyj dt   K X df (;1) (yj ) dyj 2 dyj j =1 2 (0; 1)) such that dE dt dt <0 is always negative and E decreases in time. The lowest possible value for E is obtained when fy g =1 = 1, the input vector is also x = 1b and the threshold vector is t = b0, such that the negative terms are maximum and the positive term is minimum (see (4.11)), assuming that all w > 0, i; j = 1; K . 2. j j ;K ji Emin = ; 21 K X jw j ; K ji j;i=1 j =i 6 60 CHAPTER 4. THE BAM/HOPFIELD MEMORY ✍ Remarks: ➥ ➥ ➥ ➧ 4.6 The existence of continuous Hop eld energy function with the outlined properties ensures that the running process is convergent. While the process is convergent there is no guarantee that the process will converge to the lowest energy value. For ! +1 then the continuous Hop eld becomes identical to the discrete one. Otherwise: b.  For ! 0 then there is only one stable state for the network when y = 12 1  For 2 (0; +1) the stable states are somewhere between the corners of Hamming hypercube (having its center at 21 1b) and its center such that as the gain decreases from +1 to 0 the stable points moves from corners towards the center and at some point they may merge. Applications 4.6.1 The Traveling Salesperson Problem This example shows a practical problem of scheduling, e.g. as it arises in communication networks. A bidimensional Hop eld continuous memory is being used. It is also a classical example of an NP-problem but solved with the help of an ANN. The problem: A traveling salesperson must visit a number of cities, each only once. Moving from one city to other have a cost e.g. the intercity distance associated. The cost/distance traveled must be minimized. The salesperson have to return to the starting point. The problem is of NP (non-polynomial) type. Proof. Assuming that there are K cities there will be K ! paths. For a given tour it doesn't matter which city is rst (one division by K ) nor does matter the direction (one division by 2). So the number of di erent of paths is (K ; 1)!=2 (K > 3 otherwise a circuit is not possible). Adding a new city to the previous set means that now there are K !=2 routes. That means an increase in the number of paths by a factor of: ( K !=2 K ; 1)!=2 = K so for a arithmetic progression growth of the problem the space of possible solutions grows exponentially. ❖ Ci Let C1 ; : : : ; CK be the cities involved. To each of the K cities is attached a vector which represents a number, converted to a;binary form, of the order of visiting in the current tour, i.e. ; the rst one to be visited have: 1 0 : : : 0 , the second one have: 0 1 0 : : : 0 ;  and so on, the last one to be visited having attached the vector: 0 : : : 0 1 ; i.e. each vector have one digit \1", all other elements being \0" (this format is di erent from the binary representation of the city number j as cities are not visited in their numbering order). Having the cities C1 ; : : : ; CK , a squared matrix can be build using their associate vectors as rows. Because the cities aren't necessary visited in their listed order the matrix is not 4.6.1 See [FS92] pp. 148{156. 4.6. APPLICATIONS 61 necessary diagonal. 1 : : : j1 : : : j2 : : : K ::: 0 ::: 0 ::: 0 :::::::::::::::::::::::::::: 0 ::: 1 ::: 0 ::: 0 1 .. . .. . ::: 0 ::: 1 ::: 0 :::::::::::::::::::::::::::: 0 ::: 0 ::: 0 ::: 1 0 C1 .. . Ck1 (4.12) .. . Ck2 .. . CK This matrix de nes the tour: for each column j = 1; K pickup as next city the one having the corresponding row element equal to 1. The idea is to build a bidimensional Hop eld memory such that its output is a matrix Y (not a vector) having the layout (4.12) and this will give the solution (as each row will represent the visiting order number in binary format of the respective city). In order to be an acceptable solution, the Y matrix have to have the following properties: ➀ Each city must not be visited more that once , Each row of the matrix (4.12) should have no more that one \1", all others elements should be \0". ➁ Two cities can not be visited at the same time (can't have the same order number) , Each column of the matrix (4.12) should have no more than one \1" all others elements should be \0". ➂ All cities should be visited , Each row or column of the matrix (4.12) should have at least one \1". ➃ The total distance/cost of the tour should be minimised. Let dk1 k2 be the distance/cost between cities Ck1 and Ck2 . As the network is bidimensional, each weight have 4 subscripts: wk2 j2 k1 j1 is the weight from neuron frow k1 ; column j1 g to neuron frow k2 ; column j2 g. See gure 4.6 on the next page. ✍ ❖ dk1 k2 Remarks: When using bidimensional Hop eld networks all the established results will be kept but each subscript will split in 2 giving the row and column position (instead of one giving the column position). The weights cannot be build from a set of some fY`g as these are not known (the idea is to nd them) but they may be build considering the following reasoning: ➀ A city must appear only once in a tour: this means that one neuron on a row must inhibit all others on the same row such that in the end only one will have active output 1, all others will have 0. Then the weight should have a term of the form: ➥ wk(1)2 j2 k1 j1 = ;Ak1 k2 (1 ; j1 j2 ) ; A 2 R+ ; A = const. where  is the Kroneker symbol. This means all wk(1)2 j2 k1 j1 = 0 for neurons on ❖ A 62 CHAPTER 4. THE BAM/HOPFIELD MEMORY j1 1 j2 K C1 w11k1 j1 w1K k1 j1 Ck1 wKKk1 j1 C k2 wk2 j2 k1 j1 wK 1k1 j1 CK Figure 4.6: The bidimensional Hop eld memory and its weight representation. di erent rows, w(1)1 2 1 1 < 0 for a given row k1 if j1 6= j2 and w(1)1 1 1 1 = 0 for feedback. There must be no cities with the same order number in a tour: this means that one neuron on a column must inhibit all others on the same column such that in the end only one will have active output 1, all others will have 0. Then the weight should have a term of the form: k j k j k j k j ➁ ❖B (2) wk2 j2 k1 j1 = ;B 1 2 (1 ;  1 k k2 ) j j ; B 2R + ; B= const. This means all w(2)2 2 1 1 = 0 for neurons on di erent columns, w(2)2 1 given column if k1 6= k2 and w(2)1 1 1 1 = 0 for feedback. k j k1 j1 k j k j <0 for a k j k j ➂ ❖C Most of the neurons should have output 0 so a global inhibition may be used. Then the weight should have a term of the form: (3) wk2 j2 k1 j1 = ➃ ❖D C 2R + ; C = const. i.e. all neurons receive the same global inhibition / C . The total distance/cost have to be minimized so neurons receive a inhibitory input proportional with the distance between cities represented by them. Only neurons on adjacent columns, representing cities which may came before or after the current city in the tour order (only one will actually be selected) should receive this inhibition: (4) wk2 j2 k1 j1 = ( ;Dd 1 2 ( 1 k k 0 j ;j2 +1 ; + j1 ;j200 1 ) ; 2R ( D + ; D = const. if j1 = 1 and j2 = K K + 1 if j1 = K and j2 = 1 and j200 = j2 in rest j2 in rest to take care of special cases j1 = 1 and j1 = K . The term  1 2 +1 takes care of the case when column j2 came before j1 (column K came \before" column 1) while  1 2 ;1 operate similar for the case when j2 came after j1 . where j20 ❖ j20 , j200 ;C ; = 0 j ;j j ;j 00 0 4.6. APPLICATIONS 63 Finally, the weights are: w22 k j k1 j1 = w(1)2 2 = ;A 1 k j k1 j 1 w(2)2 2 + k k2 (1 + k j k1 j1 w(3)2 2 + k j k1 j1 w(4)2 2 (4.13) k j k1 j1 ;  1 2 ) ; B 1 2 (1 ;  1 2 ) ; C ; Dd 1 2 ( 1 k k k k j j j j 0 j ;j2 +1 +  1 2 ;1 ) j ;j 00 Taking Y = fy g be the network output (matrix) and considering X = C e1 as input (again matrix) and T = e0 as the threshold (matrix again) then from the de nition of continuous Hop eld energy function (4.11) and using the weights (4.13): kj E = 12 A + X X K K y 1y kj2 + kj k=1 j1 ;j2 =1 j1 =j2 6 C 2 X X K 1 ; CK K B 2 k1 ;k2 =1 j1 ;j2 =1 XX K K K y1y2 k j k j + k1 ;k2 =1 j =1 k1 =k2 6 X X K y 1 1 y 2 2 + 12 D k j K X X 1 K k j d 1 2 y 1 (y 2 k j k k k ;j 0 +1 + y 2 ;1 ) k ;j 00 k1 ;k2 =1 j =1 6 k1 =k2 y kj k=1 j =1 = 1 A 2 X X K K y kj1 i u kj2 + k=1 j1 ;j2 =1 j1 =j2 6 B 2 1 0 1 X X 1 + C@ y ; KA 2 K K kj = + where j 0 = K y1y2 k j k j + j =1 k1 ;k2 =1 k1 =k2 6 D 2 1 X X K K d 1 2 y 1 (y 2 ;1 + y 2 k k k j k ;j 0 k ;j 00 +1 ) k1 ;k2 =1 j =1 k1 =k2 6 2 E1 + E2 + E3 + E4 ; 21 CK 2 ( K 2 k=1 j =1 ; 12 CK X X (4.14) ( j if j 6= 1 and j 00 = j if j 6= K . K if j = 1 1 if j = K ❖ According to theorem 4.5.2, during network running the energy (4.14) decreases and reaches a minima. This may be interpreted as follows: ➀ Energy minimum will favor states that have each city only once in the tour: E1 = 21 A X X K K y 1y kj2 kj k=1 j1 ;j2 =1 j1 =j2 6 which reaches minimum E1 = 0 if and only if each city appears only once in the tour such that the products y 1 y 2 are either of type 1  0 or 0  0, i.e. there is only one 1 in each row of the matrix (4.12). The 1=2 factor means that the terms y 1 y 2 = y 2 y 1 will be added only once, not twice. kj kj kj kj kj kj j 0 , j 00 64 CHAPTER 4. THE BAM/HOPFIELD MEMORY ➁ Energy minimum will favor states that have each position of the tour occupied by only one city, i.e. if city C 1 is the k2 -th to be visited then any other city can't be in the same k2 -th position in the tour: k E2 = 1 2 X X K B j K =1 yk1 j yk2 j 6= which reaches minimum E2 = 0 if and only if each city have di erent order number in the tour such that the products y 1 y 2 are either of type 1  0 or 0  0, i.e. there is only one 1 in each column of the matrix (4.12). The 1=2 factor means that the terms y 1 y 2 = y 2 y 1 will be added only once, not twice. Energy minimum will favor states that have all cities in the tour: k j k j k j ➂ =1 k1 ;k2 k1 k2 k j k j k j 0 12 X X 1 y ; KA E3 = C @ 2 K k K kj =1 =1 j reaching minimum E3 = 0 if all cities are represented in the tour, i.e. P Py K K k =1 =1 kj = K j | the fact that if a city was present, it was once and only once was taken care in previous terms (there are K and only K \ones" in the whole matrix (4.12)). The squaring shows that the module of the di erence is important (otherwise energy may decrease for an increase of the number of missed cities, i.e. either ➃ P Py 7K K k K =1 =1 kj j is bad). Energy minimum will favor states with minimum distance/cost of the tour: E4 = 1 2 X X K D K =1 =1 6 = dk1 k2 yk1 j (yk2 ;j 0 +1 + yk2 ;j 00 ;1 ) j k1 ;k2 K 1 k2 If y 1 = 0 then no distance will be added. If y 1 = 1 then 2 cases arises: (a) y 2 +1 = 1 that means that the city C 2 is the next one in the tour and the distance d 1 2 will be added: d 1 2 y 1 y 2 +1 = d 1 2 . (b) y 2 +1 = 0 that means that the city C 2 isn't the next one on the tour and the corresponding distance will not be added: d 1 2 y 1 y 2 +1 = 0. Similar discussion for y 2 ;1 . The 1=2 means that the distances d 1 2 = d 2 1 will be added only once, not twice. From previous terms it should be only one digit \1" on each row so a distance d 1 2 should appear only once (the factor 1=2 was considered). k j k j k ;j k k k k k k ;j k j k ;j k k k k k k j k ;j k ;j k k k k k k ➄ The term ; 21 CK 2 is just an additive constant, used to create the square in E3 . 4.6. APPLICATIONS 65 To be able to use the running procedure (4.10), a way to convert fw 2 2 1 1 g to a matrix and fy g to a vector should be found. As indices fk; j g work in pairs this can be easily done using the lexicographic convention: k j k j kj ! we( 2 ;1) + 2 ( 1 ;1) + 1 y ! ye( ;1) + The graphical representation of Y ! ye transformation is very simple: take each column of Y T and \glue" it at the bottom of previous one. The inverse transformation of ye to get Y w22 k j k1 j1 kj is also very simple: k k ye ! y mod K ` K K ` j ; k K j j + 1; ` ; K mod K ` The updating formula (4.10) then becomes: h i fy b e (t) + C 1 ye(t + 1) = ye (t) + W ye (t) b [1 and the A, B , C and D constants are used to tune the process. ; ye(t)] 5 CHAPTER The Counterpropagation Network The counterpropagation network | CPN | is an example of an ANN interconnectivity. From some subnetworks, a new one is created, to form a reversible, heteroassociative memory1 . ➧ 5.1 The CPN Architecture Let be a set of vector pairs (x1 ; y1 ); : : : ; (xP ; yP ) who may be classi ed into several classes C1 ; : : : ; CH . The CPN associate an input x vector with an hyik 2 Ck for which the corresponding hxik is closest to input x. hxik and hyik are the averages over those xp , respectively yp who are from the same class Ck . CPN may also work in reverse, inputting an y and retrieving an hxik . The CPN architecture consists of 5 layers on 3 levels. The input level contains x and y layers; the middle level contains the hidden layer and the output level contains the x and y layers. See gure 5.1 on the following page. Note that each neuron on x layer, input level, receive the whole x (and similar for y layer) and also there are direct links between input and output levels. 0 0 Considering a trained network an x vector is applied, y = 0b at input level and the corresponding vector hyik is retrieved. When running in reverse an y vector is applied, x = b0 at input level and the hyik is retrieved. Both cases are identical, only the rst will be discussed in detail. 1 5.1 CPN See "The BAM/Hop eld Memory" chapter for de nition. See [FS92] pp. 213{234. 67 ❖ Ck , hxik , hyik 68 CHAPTER 5. THE COUNTERPROPAGATION NETWORK z xp }| { z yp }| { input level hidden level output level | {z 0 xp } | {z } 0 yp Figure 5.1: The CPN architecture. This functionality is achieved as follows:  The rst level normalize the input vector.  The second level (hidden layer) does a classi cation of input vector, outputting an oneof-k encoded classi cation, i.e. the outputs of all hidden neurons are zero with the exception of one: and the number/label of its corresponding neuron identi es the input vector as belonging to a class (as being closest to some particular, \representative", previously stored, vector).  Based on the classi cation performed on hidden layer, the output layer actually retrieve a \representative" vector. All three subnetworks are quasi-independent and training at one level is performed only after the training at previous level have been nished. 5.1.1 ❖ zx ❖ B The Input Layer Let consider the x input layer2 Let be N the dimension of vector x and K the dimension of vector y. Let zx be the output vector of the input x layer. See gure 5.2 on the next page. The input layer have to achieve a normalization of input; this may be done if the neuronal activity on the input layer is de ned as follows:  each neuron receive a positive excitation proportional to it's corresponding input, i.e. +Bxi , B = const., B > 0, 2 An identical discussion goes for y layer, as previously speci ed. 5.1. THE CPN ARCHITECTURE z 69 x }| { input level | {z zx } Figure 5.2: The input layer.  each neuron receive a negative excitation from all neurons on the same layer, including itself, equal to ;zxi xj ,  the input vector x is applied at time t = 0 and removed (x becomes 0b) at time t = t0 , and  in the absence of input xi , the neuronal output zxi decrease to zero following an exponential, de ned by A = const., A > 0, i.e. zxi / e;At . Then the neuronal behaviour may be summarized into: N dzxi = ;Az + Bx ; z X xj for t 2 [0; t0) xi i xi dt j =1 dzxi = ;Az for t 2 [t0 ; 1) xi dt or in matrix notation: dzx = ;Az + B x ; z (xT 1b)1b for t 2 [0; t0) x xi dt dzx = ;Az for t 2 [t0 ; 1) x dt (5.1a) (5.1b) The boundary conditions are zx (0) = 0b (starts from 0b, no previously applied signal) and b b tlim !1 zx (t) = 0 (returns to 0 after the signal have been removed). For continuity purposes the condition tlim z (t) = lim zx (t) should be imposed. With these limit conditions, the %t x t&t solutions to (5.1a) and (5.1b) are: 0 x= z z x= 0 Bx n h   io Bx n h   io for t 2 [0; t0 ) 1 ; exp ; A + xT 1b t A + xT 1b 1 ; exp ; A + xT 1b t0 e;A(t;t ) for t 2 [t0 ; 1) A + xT b1 0 (5.2a) (5.2b) ❖ A 70 CHAPTER 5. THE COUNTERPROPAGATION NETWORK Proof. 1. From (5.1a), for t 2 [0; t0 ): 0 1 N dzx + A + xT b1 z 1b = Bx ) dzxi + @A + X xj A zxi = Bxi xi dt dt j =1 First a solution for the homogeneous equation is to be found: 0 N 1 0 N 1 X dz dzxi + @A + X xi A xj zxi = 0 ) z = ; @A + xj A dt ) dt xi j =1 j =1 2 0 N 13 X zxi = zxi exp 4; @A + xj A t5 0 ❖ zx0 j =1 where zx0 is the integral constant. The general solution to the non-homogeneous equation is found considering zxi0 = zxi0 (t). Then from (5.1a) (xi = const.): 2 0 N 13 dzxi0 exp 4; @A + X xj A t5 = Bxi ) dt j =1 20 N 1 3 20 N 1 3 Z X A5 X A5 Bx i 4 @ exp 4@A + xj t ) zxi0 = Bxi exp A + xj t dt = n P j =1 j =1 A + xj j =1 Bx i ! = const. zxi = N P A + xj j =1 This solution is not convenient because it will mean an instant jump from 0 to maximal value for zxi when x is applied (see the initial condition) or it will be the trivial solution zx = x = 0. As it was obtained in the assumption that dzdtxi0 6= 0, this means that this is not valid and thus zxi0 have to be considered constant. Then the general solution to (5.1a) is: 2 0 N 13 X zxi = zxi0 exp 4; @A + xj A t5 + const. ; zxi0 = const. j =1 and then, replacing back into (5.1a) and using rst boundary condition gives (5.2a). 0 2. From equation (5.1a), by separating variables and integrating, for t 2 [t ; 1): zxi = zxi0 e;A(t;t ) zxi0 = const. Then, from the continuity condition and (5.2a), the zxi0 is: 8 2 0 N 1 39 < X A 0 5= Bx i xj t ; 1 ; exp 4; @A + zxi0 = N : j =1 A + P xj j =1 0 The output of a neuron from the input layer as function of time is shown in gure 5.3 on the facing page. The maximum value attainable on output is zxmax = A+BxxTb1 , for t = t ! 1. 0 ✍ Remarks: ➥ In practice, due to the exponential nature of the output, close values to zximax are obtained for t relatively small, see again gure 5.3 on the next page, about 98% of the maximum value was attained at t = t = 4. 0 0 5.1. THE CPN ARCHITECTURE Bxi N A+ P xj 71 zxi j=1 0 t 0 Figure 5.3: ➥ =4 10 The output of a neuron from the input layer as function of time. Even if the input vector x is big (xi ! 1, i = 1; N ) the output is limited but proportional with the input: xmax = z x Bx A + xT 1b = x B xT 1b /  where  = xi / x i xi P N A + xT 1b x xj j =1 being named the re ectance pattern and is \normalized" in the sense that sums to unity: 5.1.2 t 0 P xj N j =1 ❖ x . =1 The Hidden Layer The Instar The neurons from the hidden layer are called instars . The input vector is z = fzi gi=1;N +K | here z will contain the outputs from both x and y layers, let be H the dimension3 (number of neurons) and zH = fzHk gk=1;H the output vector of the hidden layer. Let fwki g k=1;H be the weight matrix (by which z enters the hidden layer) such that the , H , zH ❖ z ❖ a, b i=1;N +K input to neuron k is W (k; :) z. The equations governing the behavior of a hidden neuron k are de ned in a similar way as those of input layer (z is applied from t = 0 to t = t ): 0 dzHk dt dzHk dt = ;azHk + bW (k; :) z for t 2 [0; t ) (5.3a) = ;azHk (5.3b) 0 for t 2 [t ; 1) 0 3 The fact that this equals the number of classes is not a coincidence, later it will be shown that there have to be at least one hidden neuron to represent each class. 72 CHAPTER 5. THE COUNTERPROPAGATION NETWORK where a; b 2 R+ , a; b = const., and boundary conditions are z (0) = 0, lim !1 z (t) = 0 and, for continuity purposes lim z (t) = lim z (t). Hk % t ❖ ,d c t0 & Hk t t0 t Hk Hk The weight matrix is de ned to change as well (the network is learning) according to the following equation: ( ( :) = ; 0 dW k; c  ( :) ; zT W k;  if z = 6 0 if z = 0 (5.4) k dt k where c; d 2 R+ , c; d = const. and boundary condition W (k; :)(0) = 0b, second case being introduced to avoid the forgetting process. ✍ Remarks: ➥ In the absence of the input vector z = 0 if the learning process would continue then: dW (k; :) !1 0 = ;cW (k; :) ) W (k; :) = C e; ;;;! ct t dt (C being a constant row matrix). Assuming that the weight matrix is changing much slower that the neuron output then W (k; :) z ' const. and the solution to (5.3a) and (5.3b) are: Proof. (5.2b). zH k = zH k = b a b a ; ( :) z 1 ; ; W k; e  at ( :) z 1 ; ; W k; e  0 at  e ; (; a t 0 t ) for t 2 [0; t0) (5.5a) for t 2 [t0 ; 1) (5.5b) It is proven in a similar way as for the input layer, see section 5.1.1, proof of equations (5.2a) and The output of hidden neuron is similar to the output of input layer, see also gure 5.3 on the preceding page, with the remark that the maximal possible value for z is z max = 0 W (k; :) z for t; t ! 1. Assuming that an input vector z is applied and kept suciently long then the solution to (5.4) is of the form: Hk Hk b a ( :) = zT (1 ; ; ) W k; e ct !1 zT . i.e. W (k; :) moves towards z. If z is kept applied suciently long then W (k; :) ;;;! t Proof. The di erential (5.4) is very similar to previous encountered equations. It may be proven also by direct replacement. Let be a set of input vectors fz g =1 applied as follows: z1 between t = 0 and t = t1 , :::, z between t = t ;1 and t = t . Then the learning procedure is: p P ➀ ❖ t0 ➁ P p ;P P Initialize the weight matrix: W (k; :) = b0. Considering t0  0, calculate the next value for weights: ; ; 1 W (k; :)(t1 ) = z1 1 ; e ct 5.1. THE CPN ARCHITECTURE 73 ; zp 1 + zp W (k; ; zp 1 W (k; :)(tp ) ; :)(tp 1 ) zp zp ; 1 ; ; ( p; p;1 )  e c t t Figure 5.4: The directions taken by weight vectors, relatively to input, in hidden layer.  W (k; 1 :)(t2 ) = z2  W (k; :)(tP ) = zP = ; ( 2; 1 ) ; 1 ; e  P X zp c t e  + W (k; :)(t1 ) : : : ; ( P ; P ;1 ) c t ; 1 t e t  ; ( p; p;1 ) c t t ; + W (k; :)(tP 1 )  p=1 The nal weight vector W (k; :) represents a linear combination of all input vectors ;  fz g =1 . Because the coecients 1 ; e; ( p; p;1 ) are all positive then the nal direction of W (k; :) will point to an \average" direction pointed by fz g =1 and this is exactly the purpose of de ning (5.4) as it was. See gure 5.4. p c t p t ;P p ✍ p ;P Remarks: ➥ It is also possible to give each z a time-slice dt and when nishing with z to start over with z1 till some (external) stop conditions are met. p P \ \ The trained instar is able to classify the direction of input vectors (see (5.5a)): zH k / W (k; :) z = k W (k; k kzk cos( :) W (k; :); z) / cos( W (k; :); z) The Competitive Network The hidden layer of CPN is made out of instars interconnected such that each inhibits all others and eventually there is only one \winner" (the instars compete one against each other). The purpose of hidden layer is to classify the normalized input vector z (who is proportional to input). Its output is of the one-of-k encoding type, i.e. all neurons have output zero except a neuron k. Note that it is assumed that classes do not overlap, i.e. an input vector may belong to one class only. 74 CHAPTER 5. THE COUNTERPROPAGATION NETWORK }| z { z f ( zH 1) f (zH 2) f (zH;N | {z ❖ f ❖ a K) } zH Figure 5.5: + The competitive | hidden | layer. The property of instars that their associate weight vector moves towards an average of input have to be preserved, but a feedback function is to be added, to ensure the required output. Let f = f (z ) be the feedback function of the instars, i.e. the value added at the neuron input. See gure 5.5 Then the instar equations (5.3a) and (5.3b) are rede ned as: Hk ,b dzHk dt = ; azH k + b[f (zHk ) + W (k; :) z] X+ N ; zH k ` dzHk dt where a; b 2 R+ , = a; b K =1 [f (zH` ) + W (`; :) z] ; = dt dzH dt = ; = ; azH const.; the expression azH + b[f (zH ) + W z] (5.6a) for t 2 [t ; 1) (5.6b) 0 0 azH k ;z H P+ N W (k; total input of hidden neuron k. In matrix notation is: dzH for t 2 [0; t ) :) z + ` K =1 [f (zH ) + W z] f (zH` ) representing the for t 2 [0; t ) 0 for t 2 [t ; 1) 0 The feedback function f have to be selected such that the hidden layer performs as a competitive layer, i.e. at equilibrium all neurons will have the output zero except one, the \winner" which will have the output \ 1". This type of behaviour is also known as \winnertakes-all". 5.1. THE CPN ARCHITECTURE 75 For a feedback function of type f (z ) = z r where r > 1, equations (5.6) de nes a competitive layer, First a change of variable is performed as follows: Proof. e zH k  P + and zH k N K +K X N  zH;tot =1 (5.7) zH ` =1 ` zH ` ` ) zH k = and e zH k zH;tot By making the sum over k on (5.6a) (and f (z dzH;tot = dt and substituting z Hk dzH k dt As = ; = e ; azH;tot H`) e e dzH k = dt zH;tot dt e = f (zH ` zH;tot ) +K X +z eH k dzH;tot dt ): N ; zH;tot ) ` zH k zH;tot azH k zH;tot + (b dzH k =1 in (5.6a): +K X N ;e + b[f (z eH k zH;tot ) + W (k; :) z] (5.8) [f (z eH ` zH;tot ) + W (`; :) z] zH k zH;tot =1 [f ( z eH ` zH;tot ) + W (`; :)z] (5.9) ` e dzHk dt e zH;tot dzH k dt = zH;tot dzHk dt ;e zH k and from (5.8) and (5.9): " + X dzH;tot dt = b[f (z eH k zH;tot ) + W (k; :) z] ; N e bzH k =1 +K X e f (zH k zH;tot ) + =1 The following cases, with regard to the feedback function, may be discussed: +  The identity function: f (ze z tot ) = ze z tot . Then, by using P =1 become: + X dz e N K hi H k H; = 1 ` Hk dt e = bW (k; :) z ; bzeH k N e e f (zH k zH;tot ) in the form: e dt W (`; :) =1 z =0 dt z W (k; :) ) z / P + W (k; :) zH k N =b =1 z W (`; :) z +K 2 . Again, as NP z eH ` = 1, (5.10) can be rewritten `=1 = (z eH k zH;tot ) +K X ` K ` N zH;tot , the above equation , is: dzHk W (k; :) +K P N =1 dzH k (5.10) K `  The square function: z ` The stable value, obtained by stating zH k = zH;tot W (`; :) ` ` H k H; # N K =1 e e zH ` f (zH ` zH;tot ) +K X e bzH k ` +K X N = bzH;tot z eH k ` + bW (k; :)z =1 ; +K X N =1 e e zH ` f (zH ` zH;tot ) ` N ; + bW (k; :) z ; e  =1 W (`; :) z e z eH k zH;tot f (zH k zH;tot ) zH ` +K X N e bzH k =1 ` z W (`; :) ; e z eH ` zH;tot f (zH ` zH;tot )  ❖ zeH k , zH;tot 76 CHAPTER 5. THE COUNTERPROPAGATION NETWORK Then, considering the expression of f , the term: e e f (zH k zH;tot ) zH k zH;tot reduces to z tot (ze ; ze it represents an inhibition. H; H`) Hk which for ze ; e H k > zH ` e e f (zH ` zH;tot ) zH ` zH;tot represents an ampli cation while for ze e H k < zH ` P+ W (`; :) z represents The term bW (k; :) z is a constant with respect to ze and the term ;bze =1 an inhibitory term. So the di erential equations describe the behaviour of a network where all neurons inhibit all others with less output than theirs and are inhibited by neurons which have greater output. The gap between neurons with high output and those with low output gets ampli ed. In this case the layer acts like an \winner-takes-all" network, eventually only one neuron, that one with the largest ze will have a non zero output. The same discussion occurs for f (ze z tot ) = (ze z tot ) where ` > 1. N Hk K Hk ` Hk H k H; H k H; ` And nally, only the winning neuron, let k be the one, have to be a ected by the learning process: this neuron will represent the class Ck to which input vector belongs and its associated weight vector W (k; :) have to be moved towards the average \representative" fhxik ; hyik g (combined as to form one vector), all other neurons should remain untouched (weights unchanged). Two points to note here:  It becomes obvious that there should be at least one hidden neuron for each class.  Several neurons may represent the same class (but there will be only one winner at a time). This is particularly necessary if classes are represented by unconnected domains in RN +K (because for a single neuron representative, the moving weight vector, towards the average input for the class, may become stuck somewhere in the middle and represent another class) or for cases with deep intricacy between various classes. See also gure 5.6 on page 81. 5.1.3 The Output Layer The neurons on the output level are called outstars. The output level contains 2 layers: x of dimension N and y of dimension K | same as for input layers. For both the discussion goes the same way. The weight matrix describing the connection strengths between hidden and output layer is W . The purpose of output level is to retrieve a pair fhxik ; hyik g, where hxik is closest to input x, from an \average" over previously learned training vectors from the same class. The x layer is discussed below, the y part is, mutatis mutandis, identical. The di erential equations governing the behavior of outstars are de ned as: 0 0 ❖ W 0 0 0 0 ❖ A,B,C 0 0 0 dxi = ;A x + B x + C W (i; :) z for t 2 [0; t ) i H i dt dxi = ;A x + C W (i; :) z for t 2 [t ; 1) H i dt where A ; B ; C 2 R+ , A ; B ; C = const., or in matrix notation: dx = ;A x + B x + C W (1 : N; :) z for t 2 [0; t ) H dt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (5.11a) 5.1. THE CPN ARCHITECTURE 77 dx0 = ;A0 x0 + C 0 W 0 (1 : N; :) z H dt for t 2 [t0 ; 1) (5.11b) with boundary condition x0 (0) = 0b. The weight matrix is changing | the network is learning | by construction, according to the following equation: ( dW 0 (1 : N; k) = ;D0 W 0 (1 : N; k) + E 0 x if zHk 6= 0 dt 0b if zHk = 0 (5.12) with the boundary condition W 0 (1 : N; k)(0) = b0 (the part for zHk = 0 being de ned in order to avoid weight \decaying"). Note that for a particular input vector there is just one winner on the hidden layer and thus just one column of matrix W 0 gets changed, all other remain the same (i.e. just the connections coming from the hidden winner to output layer get updated). It is assumed that the weight matrix is changing much slower than the neuron output. Considering that the hidden layer is also much faster than output (only the asymptotic behaviour is of interest here), i.e. W (i; :) zH ' const. then the solution to (5.11a) is: x0 =  0 B C 0 W 0 (1 : N; :) z  (1 ; e;A t ) H A0 x + A0 0 Proof. For each x : B xi + C W (i; :) zH = const. and then the solution is built the same way as for i previous di erential equations by starting to search for the solution for homogeneous equation. See also proof of equations (5.2). The boundary condition is used to nd the integration constant. 0 0 0 0 The solution to weights update formula (5.12) is: E 0 x(1 ; e;D t ) W 0 (1 : N; k) = D 0 0 Proof. Same way as for x , above, for each wik separately. 0 The asymptotic behaviour for t ! 1 of x0 and W 0 (1 : N; k) are (from the solutions found): 0 0 0 C E 0 0 B !1 W (1 : N; k) = D0 x tlim !1 x = A0 x + A0 W (1 : N; :) zH and tlim (5.13) After a training with a set of fxp ; yp g vectors, from the same class k, the weights will be: W 0 (1 : N; k) / hxik respectively W 0 (N + 1 : N + K; k) / hyik . At runtime x 6= 0b and y = 0b to retrieve an y0 (or vice-versa to retrieve an x0 ). But (similar to x0 ) 0 0 B C lim y0 = 0 y + 0 W (N + 1 : N + K; :) zH t!1 A A and, as zH represents an one-of-k encoding, then W (N + 1 : N + K; :) zH selects the column k out of W 0 (for the corresponding winner) and as y = 0b then y0 / W 0 (N + 1 : N + K; k ) / hyik ❖ D0 , E 0 78 CHAPTER 5. THE COUNTERPROPAGATION NETWORK ➧ 5.2 CPN Dynamics ✍ Remarks: ➥ ➥ ➥ 5.2.1 In simulations on digital systems the normalization of vectors and the decision over the winner in the hidden layer may be done in separate processes and such simplifying and speeding up the network running. The process uses the asymptotic (equilibrium) values to avoid the actual solving of di erential equations. p The vector norm may be written as kxk = xT x. Network Running To generate the corresponding y vector for the input x: ➀ The input layer normalizes the input vectors and distributes the result to the hidden layer. For the y part a null vector 0b is applied: = p xT z(1 : N ) z(N + 1 : N + K ) = 0b i.e. z is the combination of vectors ➁ x kxk (as kyk = 0) x x and 0b to form a single vector. The hidden layer is of \winner-takes-all" type. To avoid a di erential equation calculation the weight vectors W (`; :) are normalized. This way the closest one to normalized input z will be found by doing a simple scalar product: W (`; :) z / cos(W\ (`; :) z). The \raw" output of the hidden neurons is calculated rst: zH` / W (`; :) z, then the neuron with the largest output is declared winner and it gets the output 1, all others get output 0. Let k be the winning neuron, i.e. W (k; :) z = max W (`; :) z. Then ` initialize zH to b0 and afterwards make zHk = 1: zH = 0b and afterwards z Hk = 1 and so all outstars receive an input vector of the form: ; zTH = 0 ::: 1 :::  0 (5.14) where all zH` are zero, except zHk . Note that as y = 0b is as all things happen in the space RN  RN +K . The scalar product may be replaced with the scalar product between kxxk and the projection of W (`; :), i.e. W (`; 1 : N ). 5.2 See [FS92] pp. 235{239. 5.2. CPN DYNAMICS 79 From (5.13) and making y = b0, C = A and E = D then the output of the y layer is y = W (N + 1 : N + K; k), i.e. the winner of hidden layer selects what column of W will be chosen to be the output (the W (1 : N; k) represents the x while W (:; k) represents the joining of x and y ) as the multiplication between W and a vector of type (5.14) selects column k out of W . To generate the corresponding x from y, i.e. working in reverse, the procedure is the same (by changing x $ y). ➂ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5.2.2 ➀ ➁ Network Learning An input vector, randomly selected, is applied to the input layer. The input layer normalize the input vector and distribute it to the hidden layer. =p T x T x x+y y y z(N + 1 : N + K ) = p T x x + yT y z(1 : N ) ➂ i.e. z is the normalized combination of vectors x and y to form a single vector. The training of the hidden layer is done rst. The weights are initialized with randomly selected normalized input vectors. This ensure both the normalization of weight vectors W (`; :) and a good spread of them. The hidden layer is of \winner-takes-all" type. To avoid a di erential equation calculation, and as W (`; :) are normalized, the closest W (`; :) to z is found by using the scalar product W (`; :) z / cos(W (`; :) z). The \raw" output of the hidden neurons is calculated rst as scalar products: zH` / W (`; :) z, then the neuron with the largest output, i.e. the k one for which W (k; :) z = max W (`; :) z, is declared winner and it gets the output 1, all others get output 0: ` \ zH = 0b and afterwards zHk = 1 and so all outstars receive an input vector of the form ; zTH = 0 ::: 1 ::: 0 where all zH` are zero with one exception zHk .  The weight of the winning neuron is updated according to the equation (5.4). In discrete time approximation: such that: dt ! t = 1 and dW (k; :) ! W (k; :) = W (k; :)(t + 1) ; W (k; :)(t) ) W (k; :)(t + 1) = W (k; :)(t) + c[zT ; W (k; :)(t)] The above updating is repeated for all input vectors. The training is repeated until the input vectors are recognized correctly, e.g. till the angle between the input vector and the output vector is less than some maximum error speci ed: cos(W (k; :); z) < ". \ 80 CHAPTER 5. THE COUNTERPROPAGATION NETWORK ➃ The network may be tested with some input vectors not used before. If the classi cation is good (the error is under the maximal one) the the training of hidden layer is done, else the training continues. After the training of the hidden layer is nished the training of the output layer begins. An input vector is applied, the input layer normalizes it and distribute it to the trained hidden layer. On the hidden layer only one neuron is winner and have output non-zero, ; let k be that one, such that the output vector becomes z = 0 : : : 1 : : : 0 with 1 on the k position. The weight of the winning neuron is updated according to the equation (5.12). In discrete time approximation: H ( :) ! W (k; :) = W (k; :)(t + 1) ; W (k; :)(t) ) W (1 : N; k )(t + 1) = W (1 : N; k )(t) + E [x ; W (1 : N; k )(t)] (5.15) W (N + 1 : N + K; k )(t + 1) = W (N + 1 : N + K; k )(t) + E [y ; W (N + 1 : N + K; k)(t)] dt ! t = 1 and dW 0 k; 0 0 0 0 0 0 0 0 0 0 0 The above updating is repeated for all input vectors. The training is repeated until the input vectors are recognized correctly, e.g. till the error is less than some maximum error speci ed: w ; x < " for i = 1; N and similar for y. 0 ik ✍ i Remarks: ➥ ➥ From the learning procedure it becomes clear that the CPN is in fact a system composed of several semi-independent subnetworks:  the input level who normalize input,  the hidden layer of \winner-takes-all" type and  the output level who generate the actual required output. Each level is independent and the training of next layer starts only after the learning in precedent layer have been done. Usually the CPN is used to classify an input vector x as belonging to a class represented by hxi . A set of input vectors fx g will be used to train the network such that it will have the output fhxi ; hyi g if x 2 C (see also gure 5.6 on the facing page). Unfortunate choice of weight vectors for the hidden layer W (`; :) may lead to the \stuck-vector" situation when one neuron from the hidden layer may never win. See gure 5.6 on the next page: the vector W (2; :) will move towards x1 2 3 4 during learning and will become representative for both classes 1 and 2 | the corresponding hidden neuron 2 will be a winner for both classes. The \stuck vector" situation can be avoided by two means. One is to initialize each weight by a vector belonging to the class for which the corresponding hidden neuron will win | the most representative if possible, e.g. by averaging over the training set. The other is to attach to the neuron an \overloading device": if the neuron wins too often | e.g. during training wins more that the number of k p k ➥ k k ; ; ; 5.3. THE ALGORITHM W 81 hxi1 (2 :) ; 1 Class 1 Class 2 2 x x hxi2 3 x W (1 :) ; 4 x Figure 5.6: ➥ ➥ The \stuck vector" situation. training vectors from the class it suppose to represents | then it will shut down allowing other neurons to win and corresponding weights to be changed. The hidden layer should have at least as many neurons as the number of classes to be recognized. At least one neuron is needed to win the \competition" for the class it represents. If classes form unconnected domains in the input space N +K then one neuron at least is necessary for each connected domain. z 2 R Otherwise the \stuck vector" situation is likely to appear. For the output layer the critical point is to select an adequate learning constant E : the learning constant can be chosen small 0 . E  1 at the beginning and increased later when xi ; wik (t) decreases, see equation (5.15). Obviously the hidden layer may be replaced by any other system able to perform an one-of-k encoding. 0 0 0 ➥ ➧ 5.3 The Algorithm The running procedure 1. The x vector is assumed to be known and the corresponding hyik is to be retrieved. For the reverse situation | when y is known and hxik is to be retrieved | the algorithm is similar changing x $ y. Make the y null (0b) at input and compute the normalized input vector z: z (1 : ) = p xT and N x x ( +1: z N N 2. Find the winning neuron on hidden layer, the k for which: 3. Find the y vector in W matrix: y = W (N + 1 : N + K; k ) 0 0 The learning procedure 1. Let x and y be one set of input vectors + ) = 0b K ( :) z = max ` ( :) z. W k; W `; 82 CHAPTER 5. THE COUNTERPROPAGATION NETWORK Let N be the dimension of the \x" part and K the dimension of the \y". Then + K is the number of neurons in the input layer. 2. For all fxp ; yp g training sets, normalize the input vector , compute zp . N =q zp (1 : N ) zp (N + 1 : N + K ) = q xp xTp xp + ypT yp yp xTp xp + ypT yp Note that the input layer does just a normalisation of the input vectors. No further training is required. 3. Initialize weights on hidden layer. For all ` neurons (` = 1; H ) in hidden layer select an representative input vector z` for class ` and then W (`; :) = zT` (this way the weight vectors become automatically normalized). Note that in extreme case there may be just one vector available for training for each class. In this case that vector becomes the \representative". 4. Train the hidden layer. For all normalized training vectors zp nd the winning neuron on hidden layer, the k one for which W (k; :) zp = max W (`; :) zp . ` Update the winner weights: ( :) = Wnew k; ( :) + [zTp ; Wold k; c ( :)] Wold k; The training of hidden layer have to be nished before moving forward to the output layer. 5. Initialize the weights on output layer. As for the hidden layer, select an representative input vector pair fxk ; yk g for each class Ck . (1 : ( +1: + W W 0 N 0 N N; ` ) = xk ) = yk K; ` Another possibility would be to make an average over several fxp ; yp g belonging to the same class. 6. Train the output layer. For all training vectors zp nd the winning neuron on hidden layer, the k one for which W (k; :) zp = max W (`; :) zp . ` Update the winner's output weights: (1 : + new ( + 1 : 0 Wnew W 0 N N N; k K; k )= )= (1 : ) + [xp ; + ) old ( + 1 : + [yp ; old ( + 1 : 0 Wold W N; k 0 E N 0 E N W 0 0 0 Wold (1 : N; k K; k N N + K; k )] )] 5.4. APPLICATIONS 83 The representative set for letters A, B, C. Figure 5.7: Figure 5.8: ➧ 5.4 The training set for letters A, B, C. Applications 5.4.1 Letter classi cation Being given a set of letters as binary image into a 5  6 matrix the network have to correctly associate the ASCII code to the image even if there are missing parts or noise in the image. The letters are uppercase A, B, C so there are 3 classes and corresponding 3 neurons on the hidden layer. The representative letters are in gure 5.7. The x vectors are created by reading the graphical representation of the characters on rows; a \dot" gets an 1, its absence gets an 0: ( T (A) = x rep. 0 0 1 0 0 ; 0 1 0 1 0 ; 1 0 0 0 1 ; 1 1 1 1 1 ; 1 0 0 0 1 1 0 0 0 1 ( 1 1 1 1 0 ; 1 0 0 0 1 ; 1 1 1 1 0 ; 1 0 0 0 1 ; ; 1 0 0 0 1 ; ) 1 1 1 1 0 ) T (B ) = x ; ( T (C ) = x rep. rep. 0 1 1 1 1 ; 1 0 0 0 0 ; 1 0 0 0 0 ; 1 0 0 0 0 ; 1 0 0 0 0 ; 0 1 1 1 1 ) such that they are 30{dimensional vectors. The ASCII codes are 65 for A, 66 for B and 67 for C. They are converted to binary format and to a y 8{dimensional vector: T A ) = ;0 ; T y (B) = 0 y (  1 0 0 0 0 0 1 1 0 0 0 0 1 0  84 CHAPTER 5. THE COUNTERPROPAGATION NETWORK a) b) c) d) Figure 5.9: TC The testing set for letters A, B, C. y ( ) = ; 0 1 0 0 0 0 1 1  The training letters are depicted in gure 5.8 on the page before | the rst set contains a \dot" less (information missing), the second one contains a supplementary \dot" (noise added) while the third set contains both. The test letters are depicted in gure 5.9 | rst 3 sets are similar, but not identical, to the training sets and they where not used in training. The system is able to recognize correctly even the fourth set (labeled d) from which a large amount of information is missing. ✍ Remarks: ➥ ➥ The conversion of training and test sets to binary x vectors is done into a similar ways as for the representative set. The training was done just once for the training set (one epoch) with the following constants: c = 0:1 and E = 0:1 0 ➥ ➥ At run-time the y vector becomes 0b. If a large part of the information is missing then the system may miss-classify the letters due to the fact that there are \dots" in common positions (especially for letters B and C). CHAPTER 6 Adaptive Resonance Theory (ART) The ART networks are an example of ANN composed of several subsystems and able to resume learning at a later stage without having to restart from scratch. ➧ 6.1 The ART1 Architecture The ART1 network is made from 2 main layers of neurons: F1 and F2 , a gain control unit (neuron) and a reset unit (neuron). See gure 6.1 on the next page. The ART1 network works only with binary vectors. The 2/3 rule. The neurons from F1 layer receive inputs from 3 sources: input, gain control and F2 . The neurons from F2 layer also receive inputs from 3 sources: F1 , gain control and reset unit. Both layers F1 and F2 are build such that they become active if and only if 2 out of 3 input sources are active. ✍ 2/3 rule Remarks: The input is considered to be active when the input vector is non-zero i.e. it have at least one non-zero component. ➥ The F2 layer is considered active when its output vector is non-zero i.e. it have at least one non-zero component. The propagation of signals trough the network is done as follows: ➀ The input vector is distributed to F1 layer, gain control unit and reset unit. Each component of input vector is distributed to a di erent F1 neuron | F1 have the same dimension N as input x. 6.1 ART1 F 1 , F2 ❖ ➥ See [FS92] pp. 293{298. 85 ❖ x, N 86 CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART) Input vector layer F1 Gain Control 111111 111111 000000 000000 111111 111111 000000 000000 111111 111111 000000 000000 to111111 all neurons 111111 000000 000000 111111 111111 000000 000000 111111 111111 000000 000000 111111 000000 000000 fully interconected111111 111111 000000 000000 Reset layers, both ways 111111 111111 111111 000000 000000 111111 111111 000000 000000 111111 111111 000000 000000 111111 000000 000000 to111111 all neurons to all neurons 111111 111111 000000 000000 000000 111111 000000 111111 000000 111111 000000 111111 layer F2 with lateral feedback Figure 6.1: ➁ ➂ ➃ ➄ ➅ ➆ The ART1 network architecture. Inhibitory input is marked with symbol. The output of F1 is sent as inhibitory signal to the reset unit. The design of the network is such that the inhibitory signal from F1 cancels the input vector and the reset unit remains inactive. The gain control unit send a nonspeci c excitatory signal to F1 layer (an identical signal to all neurons from F1 ). F2 receives the output of F1 (all neurons between F1 and F2 are fully interconnected). The F2 layer is of contrast enhancement type: only one neuron should trigger for a given pattern (or, in a more generalized case only few neurons should \ re" | have a nonzero output). The output of F2 is sent back as excitatory signal to F1 and as inhibitory signal to the gain control unit. The design of the network is such that if the gain control unit receives an inhibitory signal from F2 it ceases activity. Then F1 receives signals from F2 and input (the gain control unit have been deactivated). The output of the F1 layer changes such that it isn't anymore identical to the rst one, because the overall input had changed: the gain control unit ceases activity and | instead | the F2 sends its output to F1 . Also there is the 2/3 rule which have to be taken into account: only those F1 neurons who receive input from both input and F2 will trigger. Because the output of F1 had changed, the reset unit becomes active. The reset unit send a reset signal to the active neuron(s) from the F2 layer which forces 6.2. ART1 DYNAMICS ➇ 87 it (them) to become inactive for a long period of time, i.e. they do not participate into the next network pattern matching cycle(s). The inactive neurons are not a ected. The output of F2 disappears due to the reset action and the whole cycle is repeated until a match is found i.e. the output of F2 causes F1 to output a pattern which will not trigger the reset unit, because is identical to the rst one, or | no match was found, the output of F2 is zero | a learning process begins in F2 . The action of the reset unit (see previous step) ensures that a neuron already used in the \past" will not be used again for pattern matching. ✍ Remarks: ➥ ➧ 6.2 In complex systems an ART network may be just a link into a bigger chain. The F2 layer may receive signals from some other networks/layers. This will make F2 to send a signal to F1 and, consequently F1 may receive a signal from F2 before receiving the input signal. A premature signal from F2 layer usually means an expectation. If the gain control system and consequently the 2/3 rule would not have been in place then the expectation from F2 would have triggered an action in absence of the input signal. With the presence with the gain unit F2 can't trigger a process by itself but it can precondition F1 layer such that when the input arrives the process of pattern matching will start at a position closer to the nal state and the process takes less time. expectation ART1 Dynamics The equation describing the activation (total input) of a neuron j from F1 2 layers is of the form: ; da = ;a + (1 ; Aa )  excitatory input ; (B + Ca )  inhibitory input dt where A, B , C are positive constants. j j ✍ j j (6.1) Remarks: ➥ 6.2.1 These equations do not describe the actual output of neurons with will be obtained from the activation by applying a \digitizing" function which will transform the activation into a binary vector. The F1 layer The neuron on F1 layer receives input x, input z from F2 and input from the gain control unit as excitatory input. The inhibitory input is set to 1. See (6.1). Let be a the activation of the neuron k from the F2 layer, f2 (a ) its output (f2 being the 0 6.2 0 0 k k See [FS92] pp. 298{310. ❖ a , f2 , W , y k 88 CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART) activation function on layer F2 and f2 (a ) = y) and wjk the weight when entering neuron j on layer F1 . Then the total input received by the F1 neuron from F2 is W (j; :) f2 (a ). The F2 layer is of competitive (\winner-takes-all") type | there is only one winning neuron which have a non-zero output, all others will have null output. The output activation function for F2 neurons is a binary function1: 0 0 output of F2 neuron = f2 (ak ) = ( ❖ g if winner is k otherwise 1 0 0 The gain control unit is set such that if input vector x 6= 0b and the vector from F2 is f2 (a ) = 0b then its output is 1, otherwise is 0: 0 g= ( 1 0 if x 6= 0b and f2 (a ) = 0b otherwise 0 Finally the dynamic equation for a neuron j from the F1 layer becomes (from (6.1)): ❖ A 1 , B1 , daj dt = ;aj + (1 ; A1 aj )[xj + D1 W (j; :) f2 (a ) + B1 g] ; (B1 + C1 aj ) 0 (6.2) C1 , D1 where the constants A, B , C and D have been given the subscript 1 to denote that they are for F1 layer. Obviously here D1 controls the amplitude of W weights, it should be chosen such that all weights are wj` 2 [0; 1]. The following cases may be considered: b) and F2 is inactive (f2(a ) ➀ Input is inactive (x = 0 becomes: 0 daj dt ➁ = = 0b). Then g = 0 and (6.2) ;aj ; (B1 + C1 aj ) At equilibrium dadtj = 0 and aj = ; 1+BC1 1 i.e. inactive F1 neurons have negative activation. Input is active (x 6= 0b) but F2 is still inactive (f2 (a ) = 0b) | there was no time for the signal to travel from F1 to F2 and back (and deactivating the gain control unit on the way back). The gain control unit is activated: g = 1 and (6.2) becomes: 0 daj dt At equilibrium dadtj = =0 ;aj + (1 ; A1 aj )(xj + B1 ) ; (B1 + C1 aj ) and aj = 1 + A (x x+j B ) + C (6.3) 1 j 1 1 i.e. neurons who received non-zero input (xj 6= 0) have a positive activation (aj > 0) and the neurons who received a zero input have their activation raised to zero. 1 See also the F2 section, below 6.2. ART1 DYNAMICS ➂ 89 Input is active (x 6= 0b) and F2 is also active (f2 (a0 ) 6= b0). Then the gain control unit is deactivated (g = 0) and (6.2) becomes: daj = ;a + (1 ; A a )[x + D W (j; :) f (a0 )] ; (B + C a ) j 1 j j 1 2 1 1 j dt At equilibrium daj dt = 0 and 0 B1 aj = 1 + Axj (+x D+1 WD (j;W:)(j;f2:)(af )(; a0 )) + C 1 j 1 2 (6.4) 1 The following cases may be discussed here: (a) Input is maximum: xj = 1 and input from F2 is minimum: a0 ! 0b. Because the gain control unit have been deactivated and the activity of F2 layer is dropping to 0b then | according to the 2/3 rule | the neuron have to switch to inactive state and consequently aj have to switch to a negative value. From (6.4):   (j; :) f2 (a0 ) ; B1 limb aj = limb 1 + Axj (+x D+1 W D1 W (j; :) f2 (a0 )) + C1 < 0 ) 1 j a !0 a !0 0 0 B1 > 1 (6.5) (as all constants A1 , B1 , C1 , D1 were positive de nite). (b) Input is maximum: xj = 1 and input from F2 is non-zero. F2 layer is of a contrast enhancement type (\winner takes all") and it have only (maximum) one winner, let k be that one, i.e. W (j; :) f2 (a0 ) = wjk f2 (a0k ). Then according to the 2/3 rule the neuron is active and the activation value should be aj > 0 and (6.4) becomes: 1 + D1 wjk f2 (a0k ) ; B1 > 0 ) wjk f2 (a0k ) > B1 ; 1 (6.6) D1 There is a discontinuity between this condition and the preceding one: from (6.6), if wjk f2 (a0k ) ! 0 then B1 ; 1 < 0 which seems to be in contradiction with the previous (6.5) condition. Consequently this condition will be imposed on W weights and not on constants B1 and D1 . (c) Input is maximum xj = 1 and input from F2 is maximum, i.e. wjk f2 (a0k ) = 1 (see above, k is the F2 winning neuron). Then (6.4) gives: 1 + D1 ; B1 > 0 ) B1 < D1 + 1 (6.7) As f2 (a0k ) = 1 (maximum) and because of the choosing of D1 constant: wjk 2 [0; 1] then wjk = 1. (d) Input is minimum x ! 0b and input from F2 is maximum. Similarly to the rst case above (and (6.4)):   0 x j + D1 W (j; :)f2 (a ) ; B1 lim aj = limb 1 + A [x + D W (j; :) f (a0 )] + C < 0 ) x!b 0 x!0 1 j 1 2 1 D1 < B1 (6.8) 90 CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART) (because of the 2/3 rule at limit the F1 neuron have to switch to negative state and subsequently have a negative activation). (e) Input is minimum (x ! 0b) and input from F2 is also minimum (a0 ! 0b). Similar to the above cases the F1 neuron turns to inactive state, so it will have a negative activation and (6.4) (on similar premises as above) gives: lim aj < 0 ) ;B1 < 0 !! x b 0 a0 b 0 which is useless because anyway B1 constant is positive de nite. Combining all of the above requirements (6.5), (6.7), and (6.8) in one gives: max(1; D1 ) < B1 < D1 + 1 ❖ f1 which represents one of the condition to be put on F1 constants such that the 2/3 rule will operate. The output value for the j -th F1 neuron is obtained by applying the following activation function: ( f1 (aj ) = 1 if activation aj > 0 (6.9) 0 if activation aj 6 0 6.2.2 The F2 layer Initially the network is started at a \0" state. There are no signals traveling internally and no input (x = 0b). So, the output of the gain control unit is 0. Even if the F2 layer receive a direct input from the outside environment (another network, e.t.c.) the 2/3 rule stops F2 from sending any output. ➁ Once an input have arrived on F1 the output of the gain control unit switches to 1, because the output of F2 (z0 ) is still 0. Now the output of F2 is allowed. There are two cases: (a) There is already an input from outside, then the F2 will output immediately without waiting for the input from F1 { see the remarks about expectation in section 6.1. (b) If there is no external input then the F2 layer have to wait for the output of F1 before being able to send an output. The conclusion is that on F2 level the gain control unit is used just to turn on/o the right of F2 to send an output. Because the output of the gain control unit (i.e. 1) is sent uniformly to all neurons it doesn't play any other active role and can be left out from the equations describing the behavior of F2 units. ➀ ✍ Remarks: ➥ In fact the equation describing the activation of the F2 neuron is of the form: 8 0 da > > < dtk = ; a0k + (1 ; A2a0k )  excitatory input output : ; (B2 + C2 a0k )  inhibitory input equation > > a0 = 0 : k if g = 1 othewise 6.2. ART1 DYNAMICS 91 where ak is the neuron activation and A2 , B2 and C2 are positive constants (see (6.1)). The rst part is analyzed below. 0 The neuron k on F2 layer receives an excitatory input from F1 : W (k; :) f1 (a) (where W is the weight matrix of connections from F1 to F2 ) and from itself: h(ak ); h being the feedback function: excitatory input = W (k; :) f1 (a) + h(ak ) 0 0 ❖ a0 , ❖ W,h ❖ K ❖ D2 0 0 and an indirect inhibitory input: ``==1k PK h a` ``==1k ( 0 ) 6 0 `; :)f1 (a), where K is the number of neurons on ( 6 F2 . The latter term represents the indirect inhibition (feedback) due to the fact that others neurons will have an positive output (because of their input), while the former is due to direct inter-connections (lateral feedback) between neurons in F2 layer. inhibitory input = 0 0 The same k neuron receive an direct inhibitory input from all other F2 neurons: PK W A2 , B2 , C2 K K X X W ha ( ``==1k `) + 0 `; :)f1 (a) 0 ( ``==1k 6 6 Eventually, from (6.1): dak dt ; ak + (1 ; A2 ak )(D2 W (k; :) f1 (a) + h(ak )) 0 0 = 0 ; (B2 + C2 ak ) 0 0 K X ha ``==1k [ ( `) + W 0 0 `; :) f1 (a)] ( 6 where D2 is a multiplying positive constant. Let B2 = 0, C2 = A2 and D2 = 1. Then: dak dt 0 = ;ak + h(ak ) + W (k; :) f1 (a) ; A2 ak 0 0 or in matrix notation: da dt 0 = 0 0 `=1 `) + W 0 [ ( 0 ( `; :) f1 (a)] ;a + h(a ) + W f1 (a) ; A2 1bT [h(a ) + W f1 (a)]a 0 0 0 0 For an feedback function of the form h(ak ) de ne an competitive layer. 0 Proof. K X ha = 0 0 `=1 a0k = eak atot 0 0 and ak 0 a0tot = 0 and da0k dt 0 ak m , where m > 1, the above equations First the following change of variable is performed: eak = PKak 0 (6.10) K X `=1 e a0` ❖ a0k , a0tot ) = ddteak atot + eak dadttot 0 0 0 0 92 CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART) By doing a summation for all k in (6.10) and using the change of variable just introduced: datot dt 0 = ;atot + K X `=1 h(ea` atot ) + 0 0 K X W (`; :) f1 (a) ; A2 atot 0 `=1 0 K X Then, from the substitution introduced and (6.10) and (6.11): deak a = dak ; ea datot k dt dt tot dt 0 As 0 0 K P 0 0 = h(eak atot ) + W (k; :)f1 (a) ; eak 0 0 0 h a` a [ (e0 tot ) + `=1 K X `=1 W (`; :) f1 (a)] h a` a [ (e0 0tot ) + (6.11) 0 W (`; :) f1 (a)] 0 ea = 1, the above equation may be rewritten as: `=1 k K K K X X deak a = X dt tot `=1 ea` h(eak atot ) ; `=1 eak h(ea` atot ) + W (k; :) f1 (a) ; eak `=1 W (`; :) f1 (a) # K K " X X =e ak atot ea` h(eaeakaatot ) ; h(eaea`aatot ) + W (k; :) f1 (a) ; eak W (`; :) f1 (a) k tot ` tot `=1 `=1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 and on this formula the following cases may be considered:  Identity function: h(eak atot ) = eak atot then: 0 0 0 0 K deak a = W (k; :) f (a) ; ea X W (`; :) f1 (a) 1 total k dt `=1 0 0 0 and the stable value (obtained from ddtaek 0 ea0k 0 =0 W (k; :) f1 (a) ) is: = W (k; :) f1 (a) ak / P K W (`; :)f1 (a) ) 0 K P W (`; :) f1 (a) `=1 0 0 0 0 `=1 i.e. the output is proportional to the weighted sum of inputs.  Square function: h(eak atot ) = (eak atot )2 then h(aeaekkaa ) ; h(aeae``aa 0 reduces to atot (eak ; ea` ) which for eak > ea` represents an ampli cation while for eak < ea` represents an inhibition. The W (k; :) f1 (a) is constant. The F2 layer acts as an \winner-takes-all"2 network; the distance between neurons with large output and those with small output widens | eventually only one neuron will have a non-zero output. The same discussion goes for h(eak atot ) = (eak atot )m where m > 1. 0 0 0 0 0 0 tot 0 0 0 0 tot 0 0 tot ) 0 0 0 tot 0 0 0 0 0 0 0 0 0 The winning F2 neuron sends a value of 1 to F1 layer, all other send 0. Let f2 (a ) be the output (activation) function which value is sent to F1 : 0 ❖ f2 8 <1 f2(a ) = :0 0 k ✍ if a = max a 0 k =1;K ` (6.12) ` otherwise Remarks: ➥ It seems | at rst sight | that the F2 neuron have two outputs: one sent to the F2 layer | h function and one sent to the F1 layer | f2 function. This runs counter the de nition of a neuron | it should have just one output. However this contradiction may be overcome if the neuron is replaced with an ensemble of three neurons: the main one which calculate the activation a and send the 0 k 2 There is a strong similarity with the functionality of the hidden layer of a counterpropagation network 6.2. ART1 DYNAMICS 93 inputs F2 neuron main unit f2 feedback h to F1 to F2 Figure 6.2: The F2 neuron structure. result (it have the identity function as activation function) to two others which receive its output (they have one input with weight 1), apply the h respectively f2 functions and send their output wherever is required. See gure 6.2. 6.2.3 F1 : Learning on The W weights The di erential equations describing the F1 learning process, (i.e. W weights adaptation) are: dwjk dt ; =[ wjk 0 + f1 (aj )]f2 (ak ) ; j = 1; N ; k = 1; K (6.13) There is just one (at most) \winner" | let k be that one | on F2 for which f2 (a0k ) 6= 0, for all others f2 (a0̀ ) = 0, i.e. only the weights related to the winning F2 neuron are adapted on F1 layer, all other remain unchanged (for a given input). ✍ Remarks: ➥ During the learning process only one (at most) component of the weight vector W (j; :) changes for each F1 neuron, i.e. only column k of W changes, k being the F2 winner. Because of the de nition of the f1 and f2 functions (see (6.9) and (6.12)) the following cases may be considered: ➀ F2 :k neuron winner (f2 (a0k ) = 1) and F1 :j neuron active (f1 (aj ) = 1), then: dwjk dt = ; wjk +1 ) wjk =1 ; e ;t (solution found by searching rst for the solution for the homogeneous equation and then making the \constant" time dependent to nd the general solution). The weight asymptotically approaches 1 for t ! 1. 94 CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART) ➁ : neuron winner (f2 (a0k ) = 1) and F1 :j neuron non-active (f1 (aj ) = 0), then: F2 k dw jk dt ➂ = ; w ) jk w jk = jk (0) e;t w where wjk (0) is the initial value at t = 0. The weight asymptotically decreases to 0 for t ! 1. 0 F2 :k neuron non-winner (f2 (a ) = 0), then: k jk dw dt ) =0 w jk = const. the weights do not change. A supplementary condition for W weights is required in order for 2/3 rule to function, see (6.6): w jk > B1 ;1 (6.14) D1 i.e. all weights have to be initialized to a value greater than BD1 ;1 1 . Otherwise the F1 :i neuron is kept into an inactive state and the weights decrease to 0 (or do not change). Fast Learning: If the F2 :k and F1 :j neurons are both active then the weight wjk ! 1, otherwise it decays towards 0 or remain unchanged. A fast way to achieve the learning is to set the weights to their asymptotic values as soon is possible, i.e. knowing the neuronal activities: 8 > if j; k neurons are active <1 wjk = (6.15) no change if k neuron is non-active > : 0 otherwise or in matrix notation: W (:; k )new = f1 (a) and W (:; `)new = W (:; `)old for ` 6= k Only column k of W is to be changed (weights related to F2 winner). f1 (a) is 1 for active neuron, otherwise, see (6.9). Proof. 0 6.2.4 Learning on F2 : The W 0 weights The di erential equations describing the F2 learning process are:  N X 0 0 0 = E F (1 ; wkj )f1 (aj ) ; wkj f1 (a` ) f2 (ak ) dt ``=1 6 j = N N P P f1 (a` ) ; f1 (aj ), the equations may be f1 (a` ) = = const. and, because ` =1 ``=1 6=j dw ❖ E ,F where E ; F rewritten as: 0 kj dw dt " = E 0 kj  !# N X 0 0 0 F (1 ; wkj )f1 (aj ) ; wkj f1 (a` ) ; f1 (aj ) f2 (ak ) `=1 6.2. ART1 DYNAMICS 95 For all neurons ` on F2 except winner f2 (a0̀ ) = 0 so only the winner's weights are adapted, all other remain unchanged (for a given input). Analogous previous W weights, and see also (6.9) and (6.12), the following cases are discussed: ➀ F2 :k neuron winner (f2 (a0k ) = 1) and F1 :j neuron active (f1 (aj ) = 1) then: " 0 kj dw = dt 0 F ;w kj E and the solution is of the form: 0 w kj ;1+ P N F `=1 (found analogous t W ; exp ; E f1 (a f1 (a ;1+ `) !# N X f1 (a `=1 `) ). The weight asymptotically approaches ! 1. N P `=1 t F ;1+ PNF f `=1 ` ` ) = 0 such that the condition F f1 (a for 1 (a ) > 1 have to be imposed to keep weights positive. 0 F2 :k neuron winner (f2 (a ) = 1) and F1 :j neuron non-active (f1 (aj ) = 0), then: k dw 0 kj dt ➂ F `=1 !# `) In extreme case it may be possible that ➁ ;1+ " F = F N X = ; 0 kj Ew N X `=1 f1 (a ) `) 0 kj w = w 0 kj (0) exp ; Et N X `=1 ! f1 (a `) where wkj (0) is the initial value at t = 0. The weight asymptotically decreases to 0 for t ! 1. 0 F2 :k neuron non-winner (f2 (a ) = 0) then: k dw 0 kj dt =0 ) w 0 kj = const. the weight do not change. Fast learning: If the F2 :k and F1 :j neurons are both active then the weight wkj ! 1, otherwise it decays towards 0 or remain unchanged. A fast way to achieve the learning is to set the weights to their asymptotic values as soon is possible, i.e. knowing the neuronal activities: 8 > if k; j neurons are active PNF > 0 kj w < F ; `=1 f1 a` > no change > : 1+ = 0 ( ) if k non-active otherwise (6.16) or, in matrix notation: W 0 (k; :)new = F F ; 1 + 1b a) Tf ( 1 f1 ( aT ) and W 0 (`; :)new = W 0 (`; :)old for ` 6= k (6.17) 96 CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART) Only row k of W 0 is to be changed (weights related to F2 winner). otherwise, see (6.9). Proof. 6.2.5 ❖ x0 , x00 , k 0 ,k 00 a) is 1 for active neuron, f1 ( 0 Subpatterns The input vector is binary xi = 0 or 1. All patterns are subpatterns of the unit input vector 1b. Also it is possible that a pattern x may be a subpattern of another input vector x : x  x , i.e. either xi = xi or xi = 0. The ART1 network ensures that the proper F2 neuron will win such case may occur, i.e. the winning neurons must be di erent for x and x | let k and k be those ones. k 6= k if x and x correspond to di erent classes. 0 0 00 0 00 00 0 0 00 0 00 0 00 0 00 When rst presented with an input vector the F1 layer outputs the same vector and distribute it to layer, see (6.3) and (6.9) (the change in F1 activation pattern later is used just to reset the F2 layer). So at this stage f1 (a) = x. Assuming that k0 neuron have learned x0 , its weights should be (from (6.17)): F 0 0 x0 T (6.18) W (k ; :) = b T x0 F ;1+ 1 and for k00 neuron which have learned x00 its weights should be: F 0 00 W (k ; :) = x00 T bT x00 F ;1 + 1 Proof. F2 When x0 is presented as input, total input to k0 neuron is (output of F1 is x0 ): F 0 = W 0 (k0 ; :) x0 = a x0 T x0 k bT x0 F ;1 +1 while the total input to k00 neuron is F 0 = W 0 (k00 ; :) x0 = x00 T x0 a k bT x00 F ;1 +1 Because x0  x00 then x00 T x0 = x0 T x0 but 1bT x00 > 1bT x0 and then a0k < a0k and k0 wins as it should. Similarly when x00 is presented as input, the output of k0 neuron is 0 00 00 0 k a 0 = W 0 (k0 ; :) x0 = while the total input to k00 neuron is 0 = W 0 (k00 ; :) x00 = a k00 As x0  x00 and are binary vectors then x0 T x00 = x0 T x0 = 1bT x0 0 k a 0 and neuron k00 wins as it should. ✍ = ; F F ; 1 + 1bT x0 F F x0 T x00 ; 1 + 1bT x00 x00 T x00 x00 T x00 = 1bT x00 and b1T x0 < 1bT x00 F F ;1 1bT x0 +1 0 k < a 00 = ➥ ) F F ;1 1bT x00 +1 Remarks: ➥ 0 The input patterns are assumed non zero otherwise x = b0 means no activation. All inputs are subpatterns of the unit vector 1b and the neuron who have learned the unit vector have the smallest weights: W 0 (k; :) = F F ;1+ N 1bT 6.2. ART1 DYNAMICS 97 and smallest output (when the unit vector is presented as input) a0k ➥ = F F ;1 + 1 N The F2 neurons which aren't used yet should not win over neurons which were already committed to an input vector. Then the weights of unused neurons have to be initialized such that they do not win in the worst case, i.e. when 1b have been already committed to a neuron. Uncommitted neuronal weights have to be initialized with values: w0 kj 2  0;  F F ;1+N (the 0 have to be avoided because it will give 0 output) Also the values by which they are initialized should be random such that when a new class of inputs are presented at input and none of previous committed neurons won then only one of the uncommitted neurons wins. 6.2.6 The Reset Unit The reset neuron is set to detect mismatches between the input vector and the output of the F1 layer. At start, when an input is present, the output of F1 is identical to the input and the reset unit should not activate. Also the reset unit should not activate if the di erence between input and F1 output is below some speci ed value. Di erences between input and stored pattern appear due to noise, missing data or small di erences between vectors belonging to same class. All inputs are of equal importance so they receive the same weight; the same happens with F1 outputs but they came as inhibitory input to the reset unit. See gure 6.1 on page 86. Let Q be the weight(s) for inputs and ;S the weight(s) for F1 connections (Q; S > 0). The total input to the reset unit is aR : aR = Q N X i=1 xi ; S N X i=1 f1 (ai ) ; Q; S The reset unit activates if the net input is positive: Q N X i=1 xi ; S N X i=1 f1 (ai ) > 0 where  is called the vigilance parameter. For , = const. ; Q; S 2R ❖ Q, S , aR + N P f1 (ai ) Q i=1 < N P S xi i=1 PN f (a ) i=1 1 i PN x i=1 i =  >  the reset unit does not trigger. Because at the beginning (before F2 activate) f1 (a) = x then the vigilance parameter should be  6 1, i.e. Q 6 S (otherwise it will always trigger). ❖ 98 CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART) Noise in data The vigilance parameter self scale the di erence between the actual input pattern and the stored/learned one. Depending upon the input pattern the di erence (noise, distance, e.t.c.) between the two may or may not trigger the reset unit. For the same di erence (same set of ones in input vector) the ratio between noise and information varies depending upon the number; of ones in the input vector, i.e. assuming  T that the noise vector is the smallest one x = 0 : : : 1 : : : 0 (just one \1") and the input stored vector is similar (have also just one \1") then for an input with noise the ratio between noise and data is at least 1 : 1; for an input having two ones the ratio drops to half 0:5 : 1 and so on. New pattern learning If the reset unit is activated then it deactivates the winning F2 neuron for a sucient long time such that all committed F2 neurons have a chance to win (and see if the input pattern is \theirs") or a new uncommitted neuron is set to learn a new class of inputs. If none of the already used neurons was able to establish a \resonance" between F1 and F2 then an unused (so far) neuron k (from F2 ) win. The activity of the F1 neurons are: x + D1 W (j; :) f2 (a0 ) ; B1 = x + D1 w ; B1 1 + A1 [x + D1 W (j; :) f2 (a0 )] + C1 1 + A1 (x + D1 w ) + C1 (see (6.4) and because f2 (a0 ) is 1 just for winner k and zero in rest and then W (j; :) f2 (a0 ) = w ). For newly committed F2 neurons the weights (from F2 to F1 ) are initialized to a value w > 1 ;1 1 (see (6.14)) and then: x + D1 w ; B 1 > x ; 1 which means that for x = 1 the activity a is positive and f1(a ) = 1 while for x = 0, because of the 2/3 rule, the activity is negative and f1 (a ) = 0. Conclusion: when a new F2 neuron is committed to learn the input, the output of F1 layer a j j = j j jk j jk jk jk B D j j jk j j j j j is identical to the input, the reset neuron does not activate, and the learning of the new pattern begins. ✍ Remarks: ➥ The F2 layer should have enough neurons for all classes to be learn, otherwise an ➥ overloading of neurons, and consequently instabilities, may occur. Learning of new patterns may be stopped and resumed at any time by allowing or denying the weight adaptation. Missing data Let assume that an input vector, which is similar to a stored/learned one, is presented to the network. Let consider the case where some data is missing, i.e. some components of input are 0 where the stored one have 1's. Assuming that it is \far" enough from other stored vectors, only the designated F2 neuron will win | that one which previously learned the complete pattern (there is no reset). Assuming that a reset does not occur, the vector sent by the winning F1 neuron to the F2 layer will have more 1-s than the input pattern (after one transmission cycle between F1 2 ; 6.3. THE ART1 ALGORITHM 99 layers). The corresponding weights (see (6.15): j non-active because of 2/3 rule, k active) are set to 0 and eventually the F2 winner learns the new input | assuming that learning was allowed, i.e. weights are free to adapt. If the original, complete, input vector is applied again the original F2 neuron may learn again the same class of input vectors or otherwise a new unassigned F2 neuron may be committed to learn. This kind of behavior may lead to a continuous change in the class of vectors represented by the F2 neurons, if learning is always allowed. ➧ 6.3 The ART1 Algorithm Initialization The size of F1 layer: N is determined by the dimension of the input vectors. 1. The dimension of the F2 layer K is based on the desired number of classes to be learned now and later. Note also that in special cases some classes may require to be divided into \subclasses" with di erent assigned F2 winning neurons. 2. Select the constants: A1 > 0, C1 > 0, D1 > 0, max(1; D1 ) < B1 < D1 + 1, F > 1 and  2 (0; 1]. 3. Initialize the W weights with random values such that: wjk > B1 ; 1 ; j = 1; N ; k = 1; K D1 4. Initialize W weights with random values such that: 0  F wkj 2 0; F ; 1 + n ; k = 1; K ; j = 1; N  0 Network running and learning The algorithm uses the fast learning method (asymptotic values for weights). 1. Apply an input vector x. The F1 output becomes f1 (a) = x (at rst run, x is a binary vector). 2. Calculate the activities of F2 neurons and nd the winner. The neuron with the biggest input from F1 wins (and all other will have zero output). For k the F2 winner it is true that: W (k; :) f1 (a) = max W (`; :) f1 (a) ` 0 0 3. Calculate the new activities of F1 neurons caused by inputs from F2 . The F2 output is a vector which have all its components 0 with one exception for winner k, multiplying W by such a vector means that column W (:; k) is selected to become the activity of F1 and the new F1 output becomes: f1(a)new = sign(W (:; k)) 100 CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART) 4. Calculate the \degree of match" between input and the new output of F1 layer N f1 (aj )new 1T f1 (a)new = degree of match = j=1 N 1T x xj j=1 P P b b 5. Compare the \degree of match" computed previously with the vigilance parameter . If the vigilance parameter is bigger than the \degree of match" then (a) Mark the F2 :k neuron as inactive for the rest of the cycle while working with the same input x. (b) Restart the procedure with the same input vector. Otherwise continue, the input vector was positively identi ed (assigned, if the winning neuron is a previously unused one) as being of class k. 6. Update the weights (if learning is enabled; it have to be always enabled for new classes). See (6.15) and (6.16). W (:; k )new W 0 (k; :)new = f1 (a)new = F F ; 1 + b1T f1 (a)new f1 (aT )new only the weights related to the winning F2 neuron being updated. 7. The information returned by the network is the classi cation of the input vector given by the winning F2 neuron (in the one-ofk encoding scheme). ➧ 6.4 ❖ w, s, u, v , p, q The ART2 Architecture Unlike ART1, the ART2 network is designed to work with analog positive inputs. There is a broad similarity with ART1 architecture: there is a F1 layer which sends its output to a F2 layer and a reset layer. The F2 layer is of \winner-takes-all" type and the reset unit have the same role as in ART1. However the F1 layer is made of 6 sublayers labeled w, s, u, v , p and q . See gure 6.3 on the facing page. Each of the sublayers have the same number of neurons as the number of components in the input vector. ➀ The input vector is sent to the w sublayer. ➁ The output to the w sublayer is sent to s sublayer. ➂ The output of the s sublayer is sent to the v sublayer. ➃ The output of the v sublayer is sent to the u sublayer. ➄ The output of the u sublayer is sent to the p sublayer, to the reset layer r and back to the w sublayer. 6.4 See [FS92] pp. 316{318. 6.4. THE ART2 ARCHITECTURE 101 Input F1 layer w sublayer u sublayer G s sublayer p sublayer G Reset layer Figure 6.3: ➅ ✍ v sublayer G G R G q sublayer F2 layer The ART2 network architecture. Thin arrows represent neuron-to-neuron connections between (sub)layers; thick arrows represent full inter-layer connections (from all neurons to all neurons). The G units are gain-control neurons which sent an inhibitory signal; the R unit is the reset neuron. The output of the p sublayer is sent to the q sublayer and to the reset layer. The output of the p sublayer represents also the output of the F1 layer and is sent to F2 . Remarks: ➥ ➥ ➥ Between sublayers there is a one-to-one neuronal connection (neuron j from one layer to the corresponding neuron j from the other layer) All (sub)layers receive input from 2 sources, a supplementary gain-control neuron have been added where necessary such that the layers may be complacent with the 2/3 rule. The gain-control unit have the role to normalize the output of the corresponding layers (see also (6.20) equations); note that all layers have 2 sources of input either from 2 layers or from a layer and a gain-control unit. The gain-control neuron receive input from all neurons from the corresponding sublayer and send the sum of its input as inhibitory input (see also (6.21) and table 6.1), while the other layers sent an excitatory input. 102 CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART) ➧ 6.5 ART2 Dynamics 6.5.1 ❖ a, b, e The F1 layer The di erential equations governing the F1 sublayers behavior are: dwj dvj = ;wj + xj + auj = ;vj + f (sj ) + bf (qj ) dt dt dpj dsj = ;esj + wj ; sj kwk = ;pj + uj + W (j; :) f2 (a ) dt dt dqj duj = ;euj + vj ; uj kvk = ;eqj + pj ; qj kpk dt dt where a; b; e = const.. The f function determines the contrast enhancement which takes place inside F1 layer, a possible de nition would be 0 ❖ f, f (x) = ( for x < x otherwise 0 (6.19) where 2 (0; 1), = const.. The norm of a vector x is here de ned as the Euclidean. When applied to a vector the contrast enhancement function may be written in matrix format as: f (x) = x sign(sign(x ; 1b) + 1b) The equilibrium values (from ddt( ) = 0) are: wj = sj + auj vj = f (sj ) + bf (qj ) wj sj = pj = uj + W (j; :) f2 (a ) e + kwk pj vj qj = uj = e + kvk e + kpk  0 (6.20) These results may be described by the means of one single equation with di erent parameters for di erent sub-layers | see table 6.1: d(neuron output) = ; C1 neuron output + excitatory input (6.21) dt ; C2 neuron output  inhibitory input ✍ Remarks: ➥ ➥ 6.5 The same (6.21) is applicable to the reset layer with the parameters in table 6.1 (c = const.). The purpose of the e constant is to limit the output of s, q, u and r (sub)layers when their input is 0 and consequently e should be chosen e & 0 and may be neglected when real data is presented to the network. See [FS92] pp. 318{324 and [CG87]. 6.5. ART2 DYNAMICS 103 Layer Neuron output C2 1 1 1 1 1 1 1 w wj 1 s sj e u uj e v vj p pj 1 1 q qj e r rj e Table 6.1: 6.5.2 C1 The F2 Excitatory input xj Inhibitory input + auj 0 kwk kvk wj vj f (sj ) + bf (qj ) uj + W (j; :) f2 (a ) 0 pj uj + cpj 0 0 kpk kuk + ckpk The parameters for the general, ART2 di erential equation (6.21). Layer The F2 layer of ART2 network is identical to the F2 layer of ART1 network. The total input into neuron k is ak = W (k; :) f1 (a) and the output is: 0 0 f2 (ak ) = ( 0 d 0 if ak = max a` ` otherwise 0 0 (6.22) where d = const., d 2 (0; 1). Then the output of p sublayer becomes: pj 6.5.3 = ( uj uj + dwjk ❖d if F2 is inactive for k neuron winner on F2 (6.23) The Reset Layer The di erential equation de ning the reset layer running is of the form given by (6.21) with the parameters de ned in table 6.1: drj dt = ;erj + uj + cpj ; rj kuk ; crj kpk with the inhibitory input given by the 2 gain-control neurons. The equilibrium value is: uj + cpj ' uj + cpj , r = kuuk ++ ccpkpk rj = e + kuk + ckpk kuk + ckpk (6.24) (see also the remarks regarding the value of e). By de nition | considering  the vigilance parameter | the reset occurs when: k rk <  ❖ (6.25) 104 CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART) The reset should not activate before an output from F2 layer have arrived | this condition is used in the ART2 algorithm | and indeed from (6.20) equations, if f2 (a0 ) = 0b then p = u and then (6.24) gives krk = 1. 6.5.4 Learning and Initialization The di erential equations governing the weights adaptation are de ned as: dwjk dt 0 = f2 (ak )(pj ; wjk ) and 0 dwkj dt = f2 (a0k )(pj ; wkj0 ) and, considering (6.22) and (6.23): dwjk dt 0 dwkj dt ( = ( = d(uj + dwjk ; wjk ) for k winner on F 2 + dwjk ; wkj0 ) for k winner on F 2 otherwise 0 d(uj otherwise 0 Fast Learning The weights related to winning F2 neuron are updated, all others remain unchanged. The equilibrium values are obtained from ddt() = 0 condition. Assuming that k is the winner: uj + dwjk ; wjk = 0 and then wjk = 0 wkj = uj + dw(jk 8 <W (:; k) = ) : 0 W (k; :) = uj 1 and ;d ; wkj0 = 0 u ;d 1 uT ; = 1 d W (:; k )T (6.26) (so this is why the condition 0 < d < 1 is necessary). Eventually the W 0 weight matrix becomes the transposed of W matrix | when all the F2 neurons have been used to learn new data. New pattern learning The reset unit should not activate when a learning process takes place. p Using the fact that uj = kvvjk (e ' 0) and then kuk = 1 and also krk = rT r, from (6.24): p kpk cos(ud ; p) + c kpk (6.27) 1 + ckpk where ud ; p is the angle between u and p vectors and if p k u then a reset does not occur because  < 1 and the reset condition (6.25) is not met (krk = 1). If the W weights are initialized to 0e then the output of F , at the beginning of the learning krk = process, 1 + 2c 2 2 2 is zero and p = u (see (6.23)) such that the reset does not occur. During the learning process the weight vector W (; ; k), associated with connection from F2 winner to F1 (p layer), becomes parallel to u (see (6.26)) and then, from (6.23), p moves 6.5. ART2 DYNAMICS 105 (during the learning process) towards becoming parallel with u and again a reset does not occur. e. Conclusion: The W weights have to be initialized to 0 Initialization of W 0 weights Let assume that a k neuron from F2 have learned a input vector and, after some time, is presented again to the network. The same k neuron should win, and not one of the uncommitted (yet) F2 neurons. This means that the output of the k neuron, i.e. a0k = 0 0̀ 0 W (k; :) p should be bigger than an a = W (`; :) p for all ` unused F2 neurons: W 0 (k; :) p > W (`; :) p ) k W \ 0 (k; :)k kpk > kW 0(`; :)k kpk cos(W 0 (`; :); p) because p k u k W 0 (k; :), see (6.23) and (6.26) (the learning process | weight adaptation | has been done already previously for k neuron). \ The worst possible case is when W 0 (`; :) k p such that cos(W 0 (`; :) p) = 1. To ensure that no other neuron but k wins, the condition: k W 0 (`; :)k < kW 0 (k; :)k = 1 1 ; = d p 1bT K (1 ; (6.28) d) have to be imposed for unused ` neurons (for a committed neuron kuk = 1 as uj ' kvvjk , e ' 0). W 0 (k; :) T u ;d and = 1 To maximize the unused neurons input a0̀ such that the network will be more sensitive to new patterns the weights of (unused) neurons have to be uniformly initialized with maximum values allowed by the condition (6.28), i.e. w0̀ j . pK (11 ;d) . 1 0 0 Conclusion: The W weights have to be initialized with: wkj . p K (1;d) . Constants Restrains As p = u + dW (:; k) and kuk = 1 then: \ \ \ up = kpk cos(ud ; p) = 1 + dkW (:; k )k cos(u; W (:; k )) q p kpk = pT p = 1 + 2dkW (:; k)k cos(u; W (:; k)) + d2 kW (:; k)k2 (k being the F2 winner). Replacing kpk cos(ud ; p) and kpk into (6.27) gives: q krk = \ (1 + c)2 + 2(1 + c)cd 1+ p 2 c k + 2c k cos(u  k (: )k + W (:; k ) cd W ; W (:; k )) ;k k + c2 d2 k k k W (:; k ) 2 2 2 W (:; k ) 2 c d Figure 6.4 on the next page shows the dependency of krk as function of cdkw(1)k k, cos u; w(1)k and c | note that krk decreases for cdkw(1)k k < 1. Discussion:  The learning process should increase the mismatch sensitivity between the F1 pattern sent to F2 and the classi cation received from F2 , i.e. at the end of the learning process, 106 CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART) 1:0 1:0 1 2 krk 0:9 0 0:8 0:7 krk 0:9 0 0:8 0 1 2 3 4 \ 5 0:7 0 2 1 4 5 cdkW (:; k )k cdkW (:; k )k a) variable cos(u; W (:; k)) Figure 6.4: 3 b) variable c \ krk as function of cdkW ((:; k))k. Figure (a) shows de- pendency for various angles u; W (:; k), from =2 to 0 in steps, and c = 0:1 constant. Figure (b) shows dependency for various c values, from 0:1 to 1:9 in 0:2 steps, and u; W (:; k) = =2 ; =20 constant. =20 \ the network should be able to discern better between di erent classes of input. This means that, while the network learns a new input class and W (:; k) increases from initial value 0b to the nal value when kW (:; k)k = 1;1 d , the krk value (de ning the network sensitivity) have to decrease (such that reset condition (6.25) became easier to met). In order to achieve this the following condition have to be imposed: cd 1 ;d 61 (at the end of the learning W (:; k) = 1;1 d ). Or, in other words, when a there is a perfect t krk reaches its maximal value 1; when presented with a slightly di erent input vector, the same F2 neuron should win and adapt to the new value. During this process the value of krk will rst decrease before increasing back to 1. When decreasing it may happen that krk <  and a rest occurs. This means that the input vector does not belong to the class represented by the current F2 winner.  For 1cd;d . 1 the network is more sensitive than for 1cd;d  1 | the krk value will drop more for the same input vector (slightly di erent from the stored/learned one). See gure 6.4{a.  For c  1 the network is more sensitive than for c . 1 | same reasoning as above. See gure 6.4{b.  What happens in fact into F1 layer may be explained now: { s, u and q just normalize w, v and p outputs before sending it further. 6.6. THE ART2 ALGORITHM 107 s w q v u Figure 6.5: The ers. F1 p u dynamics: data communication between sublay- { There are connection between p ! v (via q) and w ! v (via s) and also back v ! w; p (via u). See gure 6.5. Obviously v layer acts as a mediator between input x received via w and the output of p activated by F2 . During this negotiation u and v (as u is the normalization of v) move away from W (:; k) (krk drops). If it moves too far then an reset occurs (krk becomes smaller than ) and the process starts over with another F2 neuron and a new W (:; k ). Note that u represents eventually a normalized combination ( ltered trough f ) of x and p (see (6.20)). 0 ➧ 6.6 The ART2 Algorithm Initialization The dimensions of w, s, u, v, p,pq and r layers equals N (dimension of input vector). The norm used is Euclidean: kxk = xT x. 1. Select network learning constants such that: a; b > 0  2 (0; 1) d 2 (0; 1) c 2 (0; 1) cd 1 ;d 61 and the size of F2 (similar to ART1 algorithm). 2. Choose a contrast enhancement function, e.g. (6.19). 3. Initialize the weights: W W0 e =0 0 wkj .p 1 K (1 ; d) to be initialized with random values such that the above condition is met. Network running and learning 1. Pickup an input vector x. 108 CHAPTER 6. ADAPTIVE RESONANCE THEORY (ART) 2. First initialize: u = b0 and q = 0b and then iterate the following steps till the output values of F1 sublayers stabilize: w kw k ! ! u = kvvk ! w = x + au ! v = f (s) + bf (q) ! p= ( u at rst iteration u + dW (:; k ) on next iterations ! s= ! q= p kpk (F2 is inactive at rst iteration). Note that usually two iterations are enough. 3. Calculate the output of the r layer: u + cp r= kuk + kpk If there is a reset, i.e. krk >  then the F2 winner (there should be no reset at rst pass as krk = 1 > ) is made inactive for the current input vector and go back to step number 2. If there is no reset (step 4 is always executed at least once) | a winner was found on F2 | then the resonance was found and jump to step 6. 4. Find the winner on F2 . First calculate total inputs a = W p then nd the winner k for which ak = max a` and nally ` 0 0 0 0 b f2 (a0 ) = 0 and afterwards f2 (a0k ) = d as F2 is a contrast enhancement layer (see (6.22)). 5. Go back to step 2 ( nd the new output values for F1 layers). 6. Update the weights (if learning is allowed): W (:; k ) = W 0 (k; :)T u = 1; d 7. The information returned by the network is the classi cation of the input vector given by the winning F2 neuron in one-of-k encoding. Basic Principles CHAPTER 7 Pattern Recognition ➧ 7.1 Patterns: The Statistical Approach 7.1.1 Patterns and Classi cation In most general way an intelligent behavior (living organism, arti cial intelligence machines) is represented by the characteristic of being able to recognize/classify patterns | taken in its broadest de nition. A pattern may represent a class of objects, a sequence of movements or even a mixture of feelings. The way the intelligent system reacts to the recognition of a pattern may also be considered a pattern. A quantized pattern is represented by a vector x Let C , k = 1; K be the classes with respect to which the pattern x have to be classi ed. k ✍ Remarks: Many classes of patterns overlap and, in many cases, it may be quite dicult to classify a pattern to a certain class. For this reason a statistical approach is taken, i.e. a class membership probability is attached to patterns. ➥ One of the things which may be hard sometimes is to quantize the pattern into a set of numbers such that a processing can be done on them. An image may be quantized into a set of pixels, a sound may be quantized into a set of frequencies associated width pitch, volume and time duration, e.t.c. The pattern x have associated a certain probability to being of class C | di erent for each class. This probability is a function of variables fx g =1 . The main problem is to nd/build these functions as to be able to give reliable results on previously unseen patterns ➥ k i 7.1.1 See [Bis95] pp. 1{6 and [Rip96] pp. 5, 24. 111 i ;N ❖ Ck 112 CHAPTER 7. PATTERN RECOGNITION probability C C k0 k00 overlap xi Figure 7.1: xi Overlapping probability: Considering only the xi component it's more probable that the x pattern is of class Ck , however it is possible for x to be of class Ck . 00 0 weights ❖ (generalization), i.e. to build a statistical model. The probabilities may overlap | see gure 7.1. In ANN eld, in most cases the probabilities are given by an vector y representing the network output and it's a function of its input pattern x and some parameters named weights collected together in a matrix W (or several matrices): W X, Y learning ❖ training set, generalization classi cation, regression outliers y = y(x; W ) Usually the mapping from the pattern space X to the output space Y is non-linear. ANN o ers a very general framework for nding out the mapping y : X ! Y (and are particularly ecient on building nonlinear models). The process of nding the adequate W weights is called learning. The probabilities and the classes are usually determined from a learning data set already classi ed by a supervisor. The ANN learns to classify from the data set | this kind of learning is called supervised learning. At the end of the learning process the network should be able to correctly classify an previously unseen pattern, i.e. to generalize. The data set is called a sample or a training set | because the ANN trains/learns using it | and the supervisor is usually a human, however there are neural networks with unsupervised learning capabilities. There is also another type called reinforced learning where there is no desired output present into the data set but there is a positive or negative feedback depending on output (desired/undesired). If the output variables are part of a discrete, nite set then the process of neural network computing will be called classi cation; for continuous output variables it will be called regression1 ✍ Remarks: ➥ 1 It is also possible to have a (more or less) special class containing all patterns x which were classi ed (with some con dence) as not belonging to any other class. Regression refers to functions de ned as an average over a random quantity. 7.1. PATTERNS: THE STATISTICAL APPROACH ➥ ➥ 7.1.2 These patterns are called outliers. The outliers usually appear due to insucient data. In some cases there is a \cost" associated to a (mis)classi cation. If the cost is too high and the probability of misclassi cation is also (relatively) high then the classi er may decline to classify the input pattern. These patterns are called rejects or doubts. Two more \classes" with special meaning may be considered: O for outliers and D for rejected patterns (doubts). Remarks: The performance of a network can be measured as the percent of correct outputs (correct classi cation, e.t.c.) with regard to the total number of inputs | after the learning process have nished. The process of extracting the useful information and translating the pattern into a vector is 7.1.2 rejects ❖ O, D Feature Extraction Usually the straight usage of a pattern (like taking all the pixels from a image and transforming it into a vector) may end up into a very large pattern vector. This may pose some problems to the neural networks in the process of classi cation because the training set is limited. Let assume that the pattern vector is unidimensional x = (x1 ) and there are only 2 classes such that if x1 is less than some threshold value xe1 then is of class C1 otherwise is of class C2 . See gure 7.2{a. Let now add a new feature/dimension to the pattern vector: x2 such that it becomes bidimensional x = (x1 ; x2 ). There are 2 cases: either x2 is relevant to the classi cation or is not. If x2 is not relevant to the classi cation (does not depend on its value) then it shouldn't have been added; it just increases the useless/useful data ratio, i.e. it just increase the noise (useless data is noise and each xi component may bear some noise embedded in its value). Adding more of irrelevant components may increase the noise to a level where it will exceed the useful information. If is relevant, then it must have a threshold value xe2 such that if x2 is lower then it is classi ed in one class, let C1 be that one (the number is irrelevant for the justi cation of the argument, classes may be just renumbered in a convenient way) otherwise is of C2 . Now, instead of just 2 cases to be considered (x1 less or greater that xe1 ) there are four cases (for x1 and x2 ). See gure 7.2{b. The number of patterns into the training set being constant, the number of training patterns per each case have halved (assuming a large number of training pattern vectors spread evenly). A further addition of a new feature x3 increases the number of cases to 8. See gure 7.2{c. In general the number of cases to be considered increases exponentially, i.e. K N . The training set spreads thinner into the pattern space. The performance of the neural network with respect to the dimension of the pattern space have a peak and increasing the dimension further may decreases it. See gure 7.2{d. The phenomena of performance decreasing as dimensionality of pattern space increases is known as the course of dimensionality. ✍ 113 ➥ See [Bis95] pp. 6{9. course of dimensionality network performance feature extraction 114 CHAPTER 7. PATTERN RECOGNITION x2 C1 e x2 e x1 C2 x1 e a) One dimension, 2 cases x1 x1 b) 2 dimensions, 4 cases x3 e network performance x3 e e x2 x1 x2 x1 n c) 3 dimensions, 8 cases Figure 7.2: d) The curse of dimensionality The curse of dimensionality: The increase of pattern vector dimension n may cause a worsening in neural network performance. Patterns are represented as dots in the pattern space. The same number of training patterns have to be spread \thinner" if there are more dimensions. called feature extraction. This process may be manually, automatic or even done by another neural network and takes place before entering the actual neural network. 7.1.3 Model Complexity ❖ xp ,t p For a supervised learning, the training set consists of pairs fx ; t g =1 , where t is the desired network target output vector given input x . The W weights are to be found by trying to minimize an error function E = E (x; t; W ). The most widely used error function is the sum-of-squares de ned as: p p ❖ E E 7.1.3 See [Bis95] pp. 9-15. = 1 2 X P p =1 [y(xp ; W ) ; t ]2 p p p ;P p 7.1. PATTERNS: THE STATISTICAL APPROACH C1 x2 C1 x2 C2 Figure 7.3: C1 x2 C2 C2 x1 a) medium complexity 115 x1 x1 b) low complexity c) high complexity Model complexity. That training set patterns are marked with circles, new patterns are marked with squares, exceptions are marked with . another one being root-mean-square ERMS : vuu 1 X [ ( RMS = t E P P p=1 y xp ; W ❖ ERMS ) ; tp ]2 In solving this problem an arbitrary complex model may be build. However there is a trade-o between exception handling and generalization. ✍ Remarks: ➥ ➥ Exceptions are patterns which are more likely to me member of one class but in fact are from another. Misclassi cation usually happens due to overlapping of probability (see also gure 7.1) noise in data or missing data. There may be also a fuzziness between di erent classes, i.e. they may not be well de ned. A reasonably complex model will be able to handle (recognize/classify) new patterns and consequently to generalize. See gure 7.3{a. A too simple model will have a low performance | many patterns misclassi ed. See gure 7.3{b. A too complex model may well handle the exceptions present in the training set but may have poor generalization capabilities. See gure 7.3{c. One widely used way to control the complexity of the model is to add a regularization term to the error function: e= E E + which is high for complex models and low for simple models. The  parameter controls the weight by which in uences E . regularization ❖ , 116 CHAPTER 7. PATTERN RECOGNITION 7.1.4 Classi cation: Making Decisions and Minimizing Risk decision boundaries Bayes rule A neural network maps the pattern space X to the classes of patterns. The pattern space is divided into a K number of areas Xk (which may be of any possible form) such that if the pattern vector x 2 Xk it is classi ed as being of class k. These areas are named decision regions and the boundaries between them are named decision boundaries. The problem consists in nding the decision boundaries such that the errors of misclassi cation are minimized or the correctness of the classi cation is maximized. Theorem 7.1.1. Bayes rule. Being given a pattern vector x to be classi ed into one of the classes fCk gk=1;K | the probability of misclassi cation is minimized if it is classi ed into class Ck for which the posterior probability is maximal: P (Ck jx) = max P (C` jx) `=1;K Proof. The decision boundaries are found by maximizing the correctness of the classi cation Let consider one nite decision region X1  X such that all pattern vectors belonging to it will be classi ed as C1 . The probability of making a correct classi cation if x 2 X1 is the join-probability associated with that class and decision region P (C1 ; X1 ). Considering two decision region and two classes all patterns with their pattern vectors in X1 and belonging to class C1 will be classi ed correctly and the same happens for X2 and C2 | such that the probability of making a correct classi cation is the sum of the two join-probabilities, i.e. P (C1 ; X1 ) + P (C2 ; X2 ). In general, for K classes and respectively decision regions: K X P (Ck ; Xk ) Pcorrect = k=1 The join probability may be written as the product between class-conditional and prior probability2 : P( R Ck ` ) = P (X ;X also as P (X` jCk ) = X` p(xjCk ) dx then: Pcorrect = K X k=1 P (X k jCk ) (Ck ) = P ` jCk ) (Ck ) P K Z X k=1 Xk Each Xk should be chosen such that, inside it, the integrand integrand p(xjC` ) P (C` ), for ` 6= k: xjCk ) P (Ck ) dx p( xjCk ) P (Ck ) is greater that any other p( xjCk ) P (Ck ) > p(xjC` ) P (C` ) ; 8` 2 f1; : : : ; K j` 6= kg , p(xjCpk()xP) (Ck ) > p(xjCp`()xP) (C` ) (because p(x) > 0) , P (Ck jx) > P (C` jx) (from Bayes theorem) p( ✍ ❖ M, P mc Remarks: The decision boundaries are placed at the points where the highest posterior probability P (Ck jx) becomes smaller than another P (C`jx). See gure 7.4 on the facing page. De nition 7.1.1. Given a mapping (classi cation procedure) M : X ! f1; : : : ; K; Dg the probability of misclassi cation a vector x of class Ck is ➥ P mc ( ) = (M(x) 6= Ck k P ; M(x) 2 fC 1; : : : ; CK gjx 2 Ck ) = X K ``==1k 6 2 See the statistical appendix. ( `jCk ) P X 7.1. PATTERNS: THE STATISTICAL APPROACH probability decision boundary P (C2 jx) P (C1 jx) X1 Figure 7.4: 117 X2 x Decision boundary for a unidimensional pattern space and two classes. Z 2 C ) = p( jC ) d The doubt probability P (i) is de ned similarly: d P (k) = P (M(x) = Djx d x k X k P (Djx) D K d =1 More general, the decision boundaries may be de ned with the help of a set of discriminant functions fy (x)g =1 such that an pattern vector x is assigned to class C if k ❖ d X P (k) k k P x The total doubt probability is the probability for a pattern from any class to be unclassied(i.e. classi ed as doubt) P (Djx) = ❖ k ;K discriminant functions y (x) = max y (x) k ` =1;K ` and in particular case y (x) = P (C jx) as in theorem 7.1.1. In particular cases (in practice, e.g. in medical eld) there may be necessary to increase the penalty of misclassi cation of a pattern from one class to another. Then a loss matrix de ning penalties for misclassi cation is used: let L be the penalty associated to misclassi cation of a pattern belonging to class C as being of class C . k k k` k ✍ Remarks: ➥ Penalty may be associated with risk: high penalty means high risk in case of misclassi cation. The diagonal elements of the loss matrix should be L = 0 because, in this case, there is no misclassi cation so no penalty. The penalty associated with misclassi cation of a pattern x 2 C into a particular class C is L multiplied by the probability of misclassi cation P (X jC ) (the probability that the kk k k` loss matrix ` ` k ` risk 118 CHAPTER 7. PATTERN RECOGNITION pattern vector is in X` but is of class Ck ). Rk` ✍ = Lk` P (X` jCk ) = Lk` Z X` jCk ) dx p(x Remarks: The loss term Lk` have the role of increasing the e ect of misclassi cation probability P (X` jCk ). The total penalty associated with misclassi cation of a pattern x 2 Ck in any other class is the sum, over all classes, of penalties for misclassi cation of x into another class ➥ = Rk ❖ PX jC K X `=1 Rk` = K X `=1 Lk` P (X` jCk ) = K X `=1 Z Lk` X` jCk ) dx p(x (7.1) or, considering the matrix PX C = fP (X`jCk )g`;k (` row, k column index) then j = L(k; :) PX C (:; k) Rk j (Rk are the diagonal terms of LPX C product). The total penalty for misclassi cation is the sum of penalties associated with misclassi cation of a pattern x 2 Ck into any other class multiplied by the probability that such penalty ;  T may occur, i.e. P (Ck ). De ning the vector PC = P (C1 ) : : : then: j R = K X k=1 Ck ) = PC [(L Rk P ( T b] PX jC )1 T = K Z X " K X `=1 X` k=1 # jCk ) P (Ck ) Lk` p(x ; dx (7.2)  R1 : : : . L PXT C creates the elements of sum appearing in (7.1) while multiplication by 1e sums the L PXT C matrix on rows. Proof. R represents the multiplication between PC and RT = j j The penalty is minimized when the X` areas are chosen such that the integrand in (7.1) is minimum: x 2 X` ✍ ) K X k=1 jCk ) P (Ck ) = min m Lk` p(x K X k=1 jCk ) P (Ck ) Lkm p(x (7.3) Remarks: ➥ (7.3) is equivalent with theorem 7.1.1 if the penalty is 1 for any misclassi cation, i.e. Lk` = 1 ; k` (ij being the Kronecker symbol). Proof. Indeed, in this case (7.3) becomes: K X k=1 k6=` p(xjCk ) P (Ck ) < K P K X k=1 k6=m p(xjCk ) P (Ck ) for x 2 X` ; 8m 6= ` K P p(xjCk ) P (Ck ) = p(xjCk ) P (Ck ) from above equation, and by subtracting the identity k=1 k=1 nally p(xjC` )P (C` ) > p(xjCm )P (Cm ) ; 8` 2 f1; : : : ; K j` 6= mg 7.1. PATTERNS: THE STATISTICAL APPROACH probability P decision boundary reject area (C1 j ) x P X1 Figure 7.5: 119 (C2 j ) X2 x x Reject area around a decision boundary between two classes, into a unidimensional pattern space. which is equivalent to P (C` jx) > P (Cm jx) (see the proof for theorem 7.1.1). In general, most of the misclassi cation occur in the vicinity of decision boundaries where the di erence between the top-most posterior probability and the next one is relatively low. If the penalty/risk of misclassi cation is very high then is better to de ne a reject area around the decision boundary such that all pattern vectors inside it are rejected from the classi cation process (to be analyzed by a higher instance, e.g. a human or a slower but more accurate ANN), i.e. are classi ed in class D. See gure 7.5. Considering the doubt class D as well, with the associated loss L D , then the risk terms change to: k Rk = X K ` Rk` =1 Lk k ` =1;K or x is classi ed as D otherwise, i.e. P (C jx) < P (Djx), 8` = 1; K . 2. Considering loss, the best classi cation is obtained when x is classi ed as C if R = min R < RD ` k k ` ❖ d ;K k k ` Lk Pd k k k ❖ + D ( ) and the total penalty (7.2) also changes accordingly. To accommodate the reject area and the loss matrix, the Bayes rule 7.1.1 is changed as in the proposition below. Proposition 7.1.1. (Bayes rule with reject area and loss matrix). Let consider a pattern vector x to be classi ed in one of the classes fC g =1 or D. Let consider the loss matrix fL g and also the loss L D = d = const., 8k , for the doubt class. Then: 1. Neglecting loss, the best classi cation is obtained when x is classi ed as C if P (C jx) = max P (C jx) > P (Djx) k` reject area =1;K ` D 120 CHAPTER 7. PATTERN RECOGNITION probability density decision boundary p( C1 jx) p( C2 jx) e12 X1 Figure 7.6: ❖ R X2 The graphical representation for the elements of the confusion matrix. The hatched area represents element e12 . or x is classi ed as D otherwise, i.e. R` > RD , 8` associated to a classi cation in the doubt category. D x = 1; K , where RD is the risk Proof. 1. This is simply an extension to theorem 7.1.1 considering a supplementary class D with an associated decision area XD . 2. This represents just the rule of classi cation according to the minimum risk. confusion matrix Another useful tool for estimating the capabilities of a classi er is the confusion matrix whose elements are de ned as: ek` = P (x classi ed as C` jx 2 Ck ) As a simple geometrical representation, the element ek` represents the integral of p(Ck jx) over the decision area of class C`: k` = Z Ck jx) dx p( e X` See gure 7.6. Note that ekk represents the probability of a correct classi cation for class Ck . ➧ 7.2 Likelihood Function 7.2.1 The Discriminant Functions Instead of the most obvious discriminant function yk (x) = P (Ck jx) its logarithm may be chosen: p(xjCk ) P (Ck ) yk (x) = ln P (Ck jx) = ln = ln p(xjCk ) + ln P (Ck ) + const. p(x) (see the Bayes theorem and theorem 7.1.1). The p(x) being class independent (normalization factor) is just an additive constant. 7.2. LIKELIHOOD FUNCTION ✍ 121 Remarks: ➥ Considering each class-conditional probability density p(xjCk ) as independent multidimensional Gaussian distribution3   (x ; k )T ;k 1 (x ; k ) 1p exp ; p(xjCk ) = 2 (2)N=2 jk j (each having its k and k parameters) then yk = ; (x ; k )T ;k 1 (x ; k ) 1 ; 2 ln jk j + ln P (Ck ) + const. 2 Let k = , 8k 2 f1; : : : ; K g. Then ln jj is class independent (constant) and xT ;1 k = Tk ;1 x. Eventually: T ;1 k + ln P (C ) + xT ;1 x + const. yk (x) = Tk ;1 x ; k k 2 Because xT ;1 x is class independent then it may be dropped from the discriminant function yk (x) (being an additive factor, equal for all discriminants). Eventually: T ;1 k + ln P (C ) yk (x) = (Tk ;1 )x ; k k 2 Let consider  the matrix built using k as columns and the matrix W de ned as: T ;1 k + ln P (C ) W (1 : K; 1 : N ) = T ;1 and w0k = ; k k 2  ; and xeT = 1 x1 : : : xN ; then the discriminant functions may be written simply as y = W xe (i.e. the general sought form y = y(W; x)). The above equation represents a linear form, such that the decision boundaries are hyper-planes. The equation describing the hyper-plane decision boundary between class Ck and C` is found by formulating the condition yk (x) = y`(x). See gure 7.7 on the next page. 7.2.2 Likelihood Function and Maximum Likelihood Procedure The maximum likelihood method tries to nd the best values for the parameters by maximizing a function named likelihood, using the training set. The procedure below is repeated for each class Ck in turn. Let consider a probability density function depending on x and a set of parameters W : p = p(x; W ). Let also the training set be fxp gP = fx1 ; : : : ; xP g, all xp being taken from 3 See statistical appendix ❖ fxp gP 122 CHAPTER 7. PATTERN RECOGNITION x2 y1 (x) = y2 (x) p2 p1 C1 C2 x1 Figure 7.7: The linear discriminant function for a bidimensional probability density and two classes. At the upper-left corner p1 > p2 , at the lower-right corner p1 < p2 . the same class. Considering the vectors from the training set randomly selected (training set statistically signi cant), the join probability density for the whole fxp gP is: p(fxp gP jW ) = likelihood function ❖ L(W ) Y p( P p=1 xp jW )  L(W ) where L(W ) is named likelihood function. The method is to try to nd the W set of parameters for which L(W ) is maximum. ✍ Remarks: ➥ In the case of a Gaussian distribution the W parameters are de ned4 trough  and :  = Efxg and  = Ef(x ; )(x ; )T g. Then: e X ;;;;!1!  X e = P1 ( ; e )( ; e ) ;;;; !1!  i.e.  | the mean of is replaced with e | the mean of the training set (considered statistically representative); and the same happens for .  = P1 P p=1 xp P P ❖ (7.4) xp i=1 ee ,  xp T P x ➥ Assuming an unidimensional Gaussian distribution then e P1 X x = 4 See statistical appendix P p=1 p f P1 X(x ; e) and 2 = P p=1 p 2 7.2. LIKELIHOOD FUNCTION 123 e Considering  as  and the true value for standard deviation  then the expectation of assumed standard deviation, compared to the true one, is: Effg = P P; 1  2 Proof. Eff2 g = p21  The change of variable yp = Z1 ;1 xp ;e  ;1 dxp and dyp = PP 2 Eff2 g = p2(P ; 1) 1 P X (xp P p=1 2  ; e)2 exp ; (xp2;2e) is done, then xp ; e P Z1 X yp2 exp p=1 ;1 2 ; y2p = 2 P ;1 xp P ! dyp = dxi ; P1 N P xq q=1 q6=p such that P 2 P ;1 (integral made by parts, similar to the calculus of E [(x ; )2 ] | see unidimensional Gaussian distribution in statistical appendix). ➥ The build probability distribution have a bias with tends to 0 for P ! 1. In this case the Gaussian distribution is also suited for sequential parameter estimation. Assuming that not all patterns from the training set are known at once then new patterns may be added later as they become available, i.e. the W parameters may be adapted to the new training set P xp to For the Gaussian distribution, adding a new pattern changes P = P1 p=1 the new value: P +1 1 P +1 = P + 1 xp = P + xP P+1+;1P p=1 e X e bias sequential parameter estimation P e e For multiple classes the likelihood function is de ned as: P K p(xp jCk ; W ) P (Ck jW ) L(W ) = p(xp jW ) = p=1 k=1 xp2Ck Y YY Maximizing likelihood function is equivalent to minimizing its negative logarithm, which is the usual way in practice. Considering that in the training set there are Pk vector patterns for each class Ck then: K Pk K Pk ln P (Ck jW ) (7.5) E = ; ln L = ; ln p(x(k)p jCk ; W ) ; k=1 p=1 k=1 XX where x(k)p 2 Ck . X ❖ Pk ❖ x(k)p 124 CHAPTER 7. PATTERN RECOGNITION For a given training set the above expression may be reduced to: E = ; k P K X X k=1 p=1 ln p(x(k)p jCk ;W) + const. (7.6) The expression is reduced by using the Lagrange multipliers method5 using as minimization condition the normalization of P (Ck jW ): Proof. K X k=1 which leads to the Lagrange function: L =E + and then and replacing into (7.7) gives @L @P (Ck jW ) = K X k=1 = K X k=1 (7.7) ! P (Ck jW ) ; 1 ; P (CPkjW ) +  = 0 k = 1; K k ) Pk P (Ck jW ) = 1 P (Ck jW ) = Pk K P P = k=1 k Pk P As the training set is xed then fPk g are constant and then the likelihood (7.5) becomes (7.6). As it can be seen, the formula (7.5) contains the sum both over classes and inside a class. If the data gathering is easy and the classi cation (by some supervisor) is dicult (\expensive") then the function (7.6) may be replaced by: E = ; k P K X X k=1 p=1 ln p(x(k)p jCk ;W) ; P X 0 p=1 ln p(xp 0 j W) The unclassi ed training set fxp g may still be very useful in nding the appropriate set of W parameters. 0 ✍ Remarks: ➥ The maximum likelihood method is based on nding the W parameters for which the likelihood function L(W ) is maximum, i.e. where its derivative is 0 | W are the roots of the derivative; The Robbins{Monro algorithm6 may be used to nd them. The maximum of likelihood function L(W ) (see equation (7.4)) may be found from the condition: rW ❖ rW , where rW is the vector NW 5 6 See mathematical appendix. See statistical appendix.  @ @w1 " P Y p=1 ::: # xp jW ) p( @ @wNW T f W =0 (7.8) , NW being the dimension of W 7.3. STATISTICAL MODELS 125 parameter space. Because ln is a monotone function then ln L may be used instead of L and the above condition becomes: " # P 1r X ln p(xp jW ) =0 W P f W p=1 (the factor 1=P is constant and does not a ect the result). Considering the vectors from the training set randomly selected then 1r lim W P !1 P " P X p=1 # ln p(xp jW ) = EfrW ln p(xjW )g and the condition (7.8) becomes: EfrW ln p(xjW )g = 0 ➥ f of this function are found using the Robbins{Monro algorithm. The roots W It is not possible to chose the W parameters such that the likelihood function L(W ) = P Q p=1 j p(xp W ) (see (7.4)) is maximized, because the L(W ) may be increased inde nitely by over tting the training set such that the estimated probability density is reduced to a function similar to  Dirac function, having the value 1 at the training set points and 0 elsewhere. ➧ 7.3 Statistical Models It is important to note that the statistical model built as p(xjW ) generally di er from the true probability density ptrue (x), which is also independent of W parameters. The estimated probability density will give the best t of true p(x) for some W0 parameters. These parameters may be found given an in nite training set. However, in practice, as the f of W may be found. learning set (P ) is nite then only an estimate W 0 It is possible to build a function such that it will measure the \distance" between the estimated and the real probability densities. Then the W parameters have to be found such that this function will be minimum. Let consider the expected value of the minus logarithm of the likelihood function: 1 Ef; ln Lg = ; Plim !1 P P X p=1 Z X R X ptrue (x) ln ptrue (x) dx. The the following function | called asymmetric divergence 7 | is de ned: 7 See [Rip96] pp. 32{34. also known as Fullback{Leiber distance f W0 , W ln p(xp jW ) = ; ln[p(xjW )] ptrue (x) dx which, for p(xjW ) = ptrue (x), have the value: ; 7.3 ❖ ptrue (x), asymmetric divergence 126 CHAPTER 7. PATTERN RECOGNITION L = Ef; ln Lg + Z Z (x) ln (x) x = ; ln (xj (x)) true p X p p d W p X ptrue (x) x (7.9) d The asymmetric divergence L is positive de nite, i.e. L > 0, the equality being for p(xjW ) = ptrue (x). Proposition 7.3.1. df = ; 1 + 1 and is negative Let consider the function f (x) = ; ln x + x ; 1. Its derivative is dx x for x < 1 and positive for x > 1. It follows that the f (x) function have a minimum for x = 1 where f (1) = 0. Because xlim !1 f (x) = 1 then the function f (x) is positive de nite, i.e. f (x) > 0, 8x, the equality happening for x = 1. Let now consider the function: Proof.  )  = ; ln p(xjW ) + p(xjW ) ; 1 > 0 f pp(xjW ptrue (x) ptrue (x) true (x) and because it is positive de nite then its expectation is positive:   )  = ; Z ln p(xjW ) p (x) dx + Z p(xjW ) p (x) dx ; 1 E f pp(xjW ptrue(x) true p (x) true true (x) X X true Z ) p (x) dx > 0 = ; ln pp(xjW true true (x) X R (p(xjW ) is normalized such that p(xjW ) dx = 1) and then L > 0, being 0 when the probability distribuX tions are equal. As previously discussed, usually the model chosen for probability density p(xjW ) is not even from the same class of models as the \true" probability density ptrue (x), i.e. they may have totally di erent functional expressions. However there is a set of parameters W0 for which the asymmetric divergence (7.9) is minimized: 8 <Z min [ln W : 9 = ptrue X (x) ; ln (xj )] ( x) x; = p W p Z d X (x) ln (true xj 0 ) p p ptrue W (x) x d (7.10) The minimization of the negative logarithm of the likelihood function involves nding a set f . Due to the limitation of training set, in general W f 6= W0 of (\optimal") parameters W f ;;;;! W0 . but at the limit P ! 1 (P being the number of training patterns) W P !1 Considering the Nabla operator r and the negative logarithm of likelihood E : r = T ❖ J ,K ; @  @w1 @  @wW ; E = ; ln L( ) = ; ln P Y W (xp j ) p p=1 W then the following matrices are de ned: J ; = E rr T  E = ;E ( P X ; p=1  ) rr ln (xp j ) = ; T p W P X ; p=1  rr ln (xp j ) T p W0 7.3. STATISTICAL MODELS and K = =  E (r P X p=1 E )( r r ln E) T 127 8 ! P !T 9 P < X = X =E r ln p(xjW ) r ln p(xjW ) : p=1 ; p=1 ! p( P X xjW0 ) p=1 r ln p( xjW0 ) !T f ; W0 by the For P suciently large it is possible to approximate the distribution of W (Gaussian) normal distribution NN (0b; J ;1 K J ;1) (here and below W are seen as vectors, i.e. column matrices). E is minimized with respect to W , then: Proof. rE jWf = 0b ) P X p=1 f) = 0 b r ln p(xp jW f and W0 are suciently close and a Taylor series development may Considering P reasonably large then W be done around W0 for rE jW f: f ; W0 ) + O ((W f ; W0 )2 ) rE jWf = rE jW0 + H jW0 (W ;  f and W0 are to be seen as where H = rrT E and is named the Hessian. Note that here and below W b 0= vectors (column matrices). Finally: f ; W0 ' H ;1 jW rE jW W 0 0 ❖ H f ; W0 g = lim H ;1 jW rE jW = 0 b ) EfW 0 0 P !1 f ;;;;! W0 . as W P !1 Also: n o n o f ; W0 )(W f ; W0 )T = E H ;1 rE rTEH ;1T = J ;1 KJ ;1 E (W (7.11) (by using the matrix property (AB )T = B T AT ). De nition 7.3.1. The deviance D of a pattern vector is de ned as being twice the ex- pectancy of log-likelihood of the best model minus the log-likelihood of current model, the best model being the true model or an exact t, also named a saturated model: D Efln =2 f )g x) ; ln p(xjW ptrue ( The deviance may be approximated by: D Proof. '2 L + Tr(K J ;1 ) f ) around W0 : Considering the Taylor series development of ln p(xjW T c ; W0 ) + 1 (W f ) ' ln p(xjW0 ) + rT ln p(xjW )jW (W f f ln p(xjW 0 2 ; W0 ) H jW0 (W ; W0 ) It is assumed that the gradient of asymptotic divergence (7.10)(see also (7.9)) is zero at W0 : rL = Efr ln p(xjW0)g = 0b (as it hits an minima). The following matrix relation is also true: ; f ; W0 )T H jW (W f ; W0 ) = Tr H jW (W f ; W0 )(W f ; W0 )T (W 0 0  deviance 128 CHAPTER 7. PATTERN RECOGNITION (may be proven by making H jW0 diagonal, using its eigenvectors, is a symmetrical matrix, see mathematical appendix) and then, from the de nition, the deviance approximation is: and nally, using also (7.11): f ; W0)(Wf ; W0)T )g D ' 2L ; EfTr(H jW0 (W h i f ; W0 )(Wf ; W0)T g D ' 2L + Tr J Ef(W ❖ DN L + Tr(KJ ;1 ) Considering a sum over training examples, instead of integration over whole X , the deviance DN for a given training set may be approximated as P DN ' 2 ln p (xp ) + Tr(KJ ; p(xp jW ) p X =1 information criterion =2 true f ( 1) ) where the left term of above equation is named information criterion. CHAPTER 8 Single Layer Neural Networks ➧ 8.1 Linear Separability 8.1.1 Discriminant Functions Two Classes Case Let consider the problem of classi cation in two classes with linear decision boundary such that the classes are separated by a hyperplane in the pattern space. Then the discriminant function is the equation describing that hyperplane and so it is linear in x. See also gure 8.1 on the following page. y(x) = wT x + w0 (8.1) The w is the parameter vector and w0 is named bias . For y(x) > 0 the pattern x is assigned to one class, for y(x) < 0 it is assigned to the other class, the hyperplane being de ned by y(x) = 0. Considering two vectors x1 and x2 contained within hyperplane (decision boundary) then y(x) = y(x2 ) = 0. From (8.1) wT (x1 ; x2 ) = 0, i.e. w is normal to any vector in the hyperplane and then it is normal to the hyperplane y(w) = 0. See gure 8.1 on the next page. The distance between the origin and the hyperplane is given by the scalar product between a versor perpendicular on the plane kwwk and a vector pointing to a point in the plane x, 8.1 See [Bis95] pp. 77-89. 129 ❖ w , w0 130 CHAPTER 8. SINGLE LAYER NEURAL NETWORKS x2 ( )=0 y w w C1 ; kw0k w Figure 8.1: x1 C2 Linear discriminant in a two dimensional pattern space with two classes. de ned by y(x) = 0. Then: distance = ❖ N T 0 kwk = ; kwk w x w such that the bias de nes the shift of the hyperplane from origin. Considering N to be the dimension of the pattern space, the whole classi cation problem may be transferred to a N + 1 dimensional pattern space by considering the translations x ! x = (1; x) and w ! w = (w0 ; w) e e such that the discriminant equation becomes: T y (x) = w x ee de ning a hyperplane in the N + 1 dimension space, passing trough the origin (bias is 0 now). The whole process may be expressed in terms of one neuron which have N + 1 inputs and e one output being the weighted sum of its inputs y(x) = 1  w0 + on the facing page. Multiple Classes Case Let consider several classes fC g k k N i =1 wi xi . See gure 8.2 and one linear discriminant function for each class: =1;K ( ) = wT x + yk x P k wk0 ; k =1 (8.2) ;K such that a pattern x is assigned to class C if y (x) = max y (x), K being the dimension k k ` ❖ K =1;K ` of the output space. The decision boundary between classes C and C is given by the equation y (x) = y (x): ( w ; w ) T x + (w 0 ; w 0 ) = 0 k k ` ` k k ` ` 8.1. LINEAR SEPARABILITY 131 1 x1 w0 xN w1 wN e y (x) Figure 8.2: A single, simple, neuron performs the weighted sum of its inputs and may act as a linear classi er. e x0 =1 e 1 e y1 (x) Figure 8.3: e x1 xN 2 K e e y2 (x) yK (x) A single layer of neurons may act as a linear classi er for classes. The connection between input i and neuron k e . is weighted by we being the component i of vector w K ki k which is the equation of a hyperplane in the pattern space. The distance from the hyperplane (k; `) to the origin is: distance( k;`) = ; kw ;; w k wk0 w`0 k ` Similarly to the previous two classes case it is possible to move to the N + 1 space by the transformation: e = (1; x) and w ! w e = (w 0 ; w ) ; k = 1 ; K x ! x k k k k and then the whole process may be represented by a neural network having one layer of K neurons and N + 1 inputs. See gure 8.3. e T as rows then network output is simply written Note that if a matrix W is built using w as: k e e y (x ) = W x The training of the network consists in nding the adequate W . A new vector x is assigned to the class C for which the corresponding neuron k have the biggest output y (x). k k ❖ W 132 linear separability CHAPTER 8. SINGLE LAYER NEURAL NETWORKS De nition 8.1.1. Considering a set of pattern vectors, they are called linearly separable if they may be separated by a set of hyperplanes as decision boundaries in the pattern space. Proposition 8.1.1. The regions Xk de ned by linear discriminant functions yk (x) are simply connected and convex. Let consider two vector patterns: x x 2 . Proof. Then a; yk (x ) = max (x ) and a `=1;K y` a yk Xk (x ) = max (x ). Any point on the line connecting b may be de ned by the vector: xc b = `=1;K y` b xa + (1 ; t)xb where t t 2 [0 1] ; (see also Jensen's inequality in mathematical appendix). Also y (x ) = ty (x ) + (1 ; t)y (x ) and then y (x ) = max k c k a k b k c xa and xb `=1;K y` (x ), i.e. x 2 c c Xk (because y` are linear). Then any line connecting two points, from the same X , is contained in the X domain , X is convex and simple connected. k k k 8.1.2 Neuronal Memory Capacity general position ❖ F (P; N ) Let consider one neuron with N inputs and one output which have to learn P pattern vectors. All input vectors belong to one of two classes, i.e. either C1 or C2 and the output of neuron indicates to which class the input belongs (e.g. y(x) = 1 for x 2 C1 and y(x) = ;1 for x 2 C2 ) Considering that the input vectors are points in RN space the neuron may learn only those cases where the inputs are linearly separable by a hyperplane. As the number of linearly separable cases is limited so is the learning capacity/memory of a single neuron (and of course the learning capacity of the network is limited as well). Let consider that there are P xed points in RN , in general position, i.e. for N > 2 there are not N or fewer points linearly dependent. Let consider that either of these points may belong to class C1 or C2 , the total number of combinations is 2P (as each point brings up P cases some are linearly separable and 2 cases, independently of the others). From the 2 some are not, let F (P; N ) be the number of linearly separable cases. Then the probability of linear separability is: Probability of linear separability = F (P; N ) 2 P Proposition 8.1.2. The number of linearly separable cases is given by: F (P; N ) = 2 N  ;  X P i=0 1 i (8.3) (where 0! = 1 by de nition). Proof. It is proven by induction. A hyperplan in R is de ned by the equation aT x + b = 0 (where x is a point contained in the hyperplan), i.e. is de ned by N + 1 parameters. N 8.1.2 See [Rip96] pp. 119{120. 8.1. LINEAR SEPARABILITY Let consider rst the case: 6 133 . Then the set of equations: 8 > <> 0 one side of hyperplan aT xi + b = yi < 0 other side of hyperplan > : = 0 contained in the hyperplan de ne the hyperplan parameters. As there are N + 1 parameters and at most N + 1 equations and the points are in general position then the system of equations with unknowns a and b (hyperplan parameters) have always a solution, i.e. there is always a way to separate the two classes with a hyperplan (there may be several solutions); then: P for P 6 N + 1 F (P; N ) = 2 P N +1 Let now consider the case P > N + 1. A recurrent formula is attempted for F (P + 1; N ). Let consider that P linearly separable points are already \in position", i.e. one case from F (P; N ), and a new one is added. There are two cases to be considered:  The new point sides on one side only of any separating (the P set) hyperplane. Then the new set is separable only if the new point is on the \correct" side, i.e. the same as for its class.  If the above it's not true then it is possible to choose the separating hyperplane as to pass trough the new point. Then no matter to which class the new point is assigned, the new set is linearly separable. Let considering again only the P points set and the hyperplane as chosen above. If all points are projected into a (new) hyperplane perpendicular on the separating one, then the points in the new hyperplane are linearly separate (by the hyperline given by the intersection of the two hyperplanes). This means that the number of possibilities in this situation is F (P; N ; 1) and the number of combinations is 2F (P; N ; 1), as the P + 1-th point may be assigned to either class. Finally, the rst case analyzed above gives F (P; N ) minus the number of possibilities in the second case (i.e. F (P; N ; 1)) and the second case gives 2F (P; N ; 1). Thus the wanted recurrent formula is: F (P + 1; N ) = [F (P; N ) ; F (P; N ; 1)] + 2F (P; N ; 1) = F (P; N ) + F (P; N ; 1) (8.4) Induction: for F (P; N ; 1), from (8.3), the expression is: N; X1 P ; 1 F (P; N ; 1) = 2 =2 N P X ; 1 ;1 i i=1 i i=0 and then, using (8.4), (8.3) and the above equation, the expression for F (P + 1; N ) is: N P  N P ; 1 P ; 1 X X F (P + 1; N ) = F (P; N ) + F (P; N ; 1) = 2 + 2 =2 + i i i=0 i i=1  ;   ; ; i.e. is of the same form as (8.3) (the property P;i 1 + P;i 1 = Pi was used here1 ). For P = 4 and N = 2 the total number of cases is 24 = 16 out of which 14 are linearly separable. One of the two cases which may not be linearly separated is depicted in gure 8.4 on the following page, the other one is its mirror image. So the formula (8.3) checks also for an initial case. The probability of linear separability is then: 8 <1 Plinear separability = and then i=0 8 > <> 0:5 Plinear separability 1 : 2P1;1 See mathematical appendix. = 0:5 > : < N ;P ;1 P 0:5 i for P for P for P < P P 6 > N +1 N +1 2(N + 1) = 2(N + 1) > 2(N + 1) 134 CHAPTER 8. SINGLE LAYER NEURAL NETWORKS x2 C C 2 (1; 1) C C 1 (0; 0) Figure 8.4: 1 (0; 1) 2 (1; 0) x1 The XOR problem. The vectors marked with black circles are from one class; the vectors marked with white circles are from the other class. C1 and C2 are not linearly separable. i.e. the memory capacity of a single neuron is around 2(N + 1). ✍ Remarks: ➥ ➥ As the points (pattern vectors) from the same class are usually (to some extent) correlated then the memory capacity of a single neuron is much higher than the above reasoning suggests. A simple problem of classi cation where the pattern vectors are not linear separable is the exclusive-or (XOR), in the bidimensional space. The vectors (0; 0) and (1; 1) are from one class (0 xor 0 = 0, 1 xor 1 = 0); while the vectors (0; 1) and (1; 0) are from the other (1 xor 0 = 1, 0 xor 1 = 1). See gure 8.4. 8.1.3 Logistic discrimination The discriminant functions may be generalized by replacing the linear functions (8.2) with a monotone function applied to the linear combination of w and x k (x) = f (wkT x + wk0 ) y activation function ; k = 1; K where f is named activation function. From the Bayesian theorem: p( Ck jx) = jCk ) (Ck ) = p(x p(xjCk )P (Ck ) K P p(xjC`) P (C`) `=1 P p(x) = 1 1 P p xjC P C  1 + ;a K 1+ `=1 ; `=k ( `) p(xjC )P (C 6 k k) ( `) e = (8.5) 8.1. LINEAR SEPARABILITY 135 f ( a) 1 2 ;1 Figure 8.5: ; 0 1 The logistic signal activation function. The particular case of step function is drawn with a dashed line. The maximum value of the function is 1. where: a  ln p(xjCk ) K P `=1 `6=k ✍ a + p(xjC` ) + ln P (Ck ) K P `=1 `6=k (8.6) P (C` ) Remarks: ➥ For the Gaussian model with the same variance matrix  for all classes:  (x ; k )T ;1 (x ; k ) exp ; p(xjCk ) = 2 (2)n=2 jj 1p  ; k = 1; K and two classes C1 and C2 then the expression of a becomes: a = wT x + w0 where ( is symmetric): w = ;1 (1 ; 2 ) T ;1 1 ; T2 ;1 2 P (C1 ) + ln 2 P (C2 )   w0 = ; 1 Then, by choosing the logistic sigmoid activation function as f (a) = 1+1e;a | see gure 8.5 | the meaning of the neuron output becomes simply the posterior probability P (Ck jx). ➥ The logistic sigmoid activation function have also the property of mapping the interval (;1; 1) into [0; 1] and thus limiting the neuron output. Another choice for the activation function of neuron is the threshold (step) function: f (a) = ( 1 for a > 0 0 for a < 0 (8.7) sigmoid function threshold function 136 perceptron adaline CHAPTER 8. SINGLE LAYER NEURAL NETWORKS and the neurons having it are called perceptrons or adaline. ✍ Remarks: ➥ 8.1.4 Bernoulli distribution The step function is the particular case of the logistic signal activation function 1 f (a) = 1+e;ca when c ! 1. See gure 8.5 on the page before. Binary pattern vectors A binary pattern vector x have its components xi 2 f0; 1g. Let Pki be the probability that the component i of vector x 2 Ck is xi = 1, respectively (1 ; Pki ) is the probability of xi = 0. Then: xi 1;x p(xi jCk ) = P ki (1 ; Pki ) i also named Bernoulli distribution. Assuming that the components of the pattern vector x are statistically independent then: N xi 1;x P p(xjCk ) = ki (1 ; Pki ) i i=1 Y By taking the discriminant function in the form of: k (x) = ln P (xjCk ) + ln P (Ck ) y then it may be written as: k (x) = wkT x + w0k y where: ki = ln Pki ; ln(1 ; Pki ) ; i = 1; N and N ln(1 ; Pki ) + ln P (Ck ) w0k = i=1 w X Similar as above, from the Bayesian theorem, see (8.5) and (8.6), | for two classes C1 and C2 | the posterior probability P (Ck jx) may be expressed as the output of the neuron having the logistic sigmoidal activation function: P( C1 jx) = (wT x + f where: w w 0 1i 2i P i = ln P = XN ln w 0) = 1 1 + exp[ ; ln 11 ;; 1i P P 1 1 ; 2i i ; 1i + ln (C1 ) ; 2i (C2 ) P ;(wT x + = 1; N and P P i=1 and P (C2 jx) have a similar form (obtainable by swapping 1 $ 2). P w 0 )] 8.2. THE LEAST SQUARES TECHNIQUE 137 8.1.5 Generalized linear discriminants Generalized linear discriminant functions are obtained by replacing x with a vectorial function of it: ' : X ! X having the same dimension. The discriminant functions then becomes y k( ' (8.8) x) = wkT (x) + w0k and, by switching to the N + 1 space: y e e 'e x) = wkT (x) k( where 'e0 (x)  1. e associated with each output By building the (N + 1)  K matrix W having the weights w neuron as rows: 0 we    we 1 10 1 CA W =B @ ... . . . we 0    we k ❖ W N K KN then: y(x) = ➧ 8.2 W 'e (x) The Least Squares Technique 8.2.1 The Error Function Let fx g =1 be the training set and ft g the sum-of-squares error function is: p p p ;P E (W ) = 1 2 XX P p the desired output of the network. Then =1;P p K y [ =1 k=1 xp ; wk ) k( ;t kp ] E (W ) = p 1 2 XX P p K =1 k=1 e 'e wkT (xp ) ;t 2 = 1 X[W 'e (x ) ; t ]T [W 'e (x ) ; t ] P kp 2 p p =1 p sum-of-squares error function (8.9) 2 where t is the component k of desired vector t . Considering the generalized linear discriminants of the form (8.8) then E (W ) is a quadratic function of weights and its derivatives are linear function of weights and then the minimum of the error function may be found exactly in closed form. kp ❖ tp p p ❖ t ❖ ~y kp (8.10) e T on rows). (here W contains w The geometrical interpretation of error function Let consider the P -dimensional vector ~y which components are the outputs of the same k 8.2 See [Bis95] pp. 89{98. k 138 CHAPTER 8. SINGLE LAYER NEURAL NETWORKS neuron k while the network is presented with the xp training vectors: ~ykT = ; T e k e ( 1) w 'x we kT'e (xP ) ::: ;  e (x) vectorial function: ' e (x)T = ' e0 (x) : : : ' eN (x) and Using the components of the ' its value for the training set vectors xp , it is possible to build the vector: ;  '~ Ti = 'ei (x1 ) : : : 'ei (xP ) As the component p of vector ~yk is f~yk gp = N P i=0 weki 'ei (xp ) then ~yk may be written as a linear combination of '~ i in the N + 1 space: N X ~yk = weki '~ i i=0 (8.11) e k ). (weki being the component i of vector w  K N P K P P weki '~ i and, again, is a linear combiThe sum of ~yk vectors is ~ytotal = ~yk = i=0 k=1 k=1 nation of '~ i . Similar, the vector ~tk may be build, using the target values for output neuron k given the input vector xp as being tkp : ~tTk = ;tk1 : : : tkP  Finally, the sum{of{squares error function (8.9) may be written as: E= P K 1 XX N X wek` 'e` (xp ) ; tkp p=1 k=1 `=0 K 1X = k~yk ; ~tk k2 2 k=1 2 !2 K P = 1 XX 2 k=1 p=1 f~yk gp ; tkp )2 ( (8.12) Let make the reasonable assumption that the number of inputs is smaller that the number of sample vectors in the training set, i.e. N + 1 6 P (what happens if this is not true is discussed later). Let consider the space of dimension P : the set of N + 1 vectors f'~ i g de ne a subspace in which all ~yk are contained, the vector ~tk being in general not included. Then the sum{of{squares error function (8.12) represents simply the sum of all distances between ~yk and ~tk . See gure 8.6 on the facing page. The ~tk may be decomposed into two components: ~tkk 2 S ('~ 0 ; : : : ; '~ N ) and ~tk? ? S ('~ 0 ; : : : ; '~ N ) ❖ S where S ('~ 0 ; : : : ; '~ N ) is the sub-space de ned by the set of functions f'~ i g. See gure 8.6. 8.2. THE LEAST SQUARES TECHNIQUE ~tk '~ 1 ~tk? ~tkk S ('~ 0 ; '~ 1 ) Figure 8.6: 139 ~yk ; ~tk ~yk '~ 0 The error vector ~yk ; ~tk for output neuron k. S ('~ 0 ; '~ 1 ) represents the subspace de ned by the set of functions f'~ i g in a bidimensional case | one input neuron plus bias. The ~yk and ~tkk are included in S ('~ 0 ; '~ 1 ) space. The minimum of error function: E = 12 K P (~yk ; ~tk )T (~yk ; ~tk ) (see (8.11) and (8.12)) with k=1 respect to weki is found from the condition that its derivatives are zero: @E (8.13) = 0 , '~ Ti (~yk ; ~tk ) = 0 ; k = 1; K ; i = 0; N @ weki and because ~tk = ~tkk + ~tk? and '~ Ti ~tk? = 0 (by choice of ~tkk and ~tk? ) then the above condition is equivalent with: '~ Ti (~yk ; ~tkk ) = 0 , ~yk = ~tkk ; k = 1; K (8.14) i.e. fweki g should be chosen such that ~yk = ~tkk | see also gure 8.6. Note that there is always a \residual" error due to the ~tk? . Assuming that the network is optimized such that ~yk = ~tkk , 8k 2 f1; : : : ; K g then ~yk ; ~tk = ;~tk? and the error function (8.12) becomes: K K 1X 1X ~ 2 k ~yk ; ~tk k2 = Emin = 2 k=1 2 k=1 ktk? k 8.2.2 The Pseudo{inverse solution Let build the P  (N + 1) matrix  from '~ i used as columns: 0 'e ( )    'e ( ) 1 0 1 N 1 CA . . . . =B @ . . 'e0 ( P )    'eN ( P ) x x x x ❖  140 ❖T CHAPTER 8. SINGLE LAYER NEURAL NETWORKS also the P  K matrix T using ~tk vectors as columns: 0t    11 B . .. T = @ .. . tK 1  t1P tKP 1 CA From the above matrices and (8.11): ~yk = W (k; :)T . Then using the above notations, the set of minima conditions (8.14) may be written in matrix form as: (T )W T ; T T = e0 Assuming that the square (N + 1)  (N + 1) matrix T  is inversable then the solution for the weights is: WT pseudo-inverse matrix ;  = T  ;1 T T = y T (8.15) where y is the pseudo-inverse matrix of  (which generally is not square) and is de ned as: ;  y = T  ;1 T If the T  is not inversable then taking an de ned as: " 2 R the pseudo-inverse matrix may be ; T ;1 T y = "lim !0   + "I  ✍ Remarks: ➥ ➥ ➥ As the  matrix is built from the '~ i set of vectors, if two of them are parallel (or nearly parallel) then the T  is singular (or nearly singular) | the rank of the matrix will be lower. The case of nearly singularity also leads to large weights necessary to represent the solution ~yk = ~tkk . See gure 8.7 on the facing page. In case of two parallel vectors '~ i k '~ ` the one of them is proportional with another: '~ i / '~ ` and then they may be combined together in the error function and thus reducing the number of dimensions of S space. By writing explicitly the minima conditions (8.14) for biases wk0 : @E @wk0 = N P X X p=1 `=1 e wk` '` (xp ) + wk0 ! ; tkp = 0 ('e0 (xp ) = 1, by construction of 'e0 ) the bias may be written as: wk 0 ❖ t k , '` where: = tk ; N X `=1 wk` '` 8.2. THE LEAST SQUARES TECHNIQUE 141 ~yk = ~tkk wk1 '~ 1 '~ 1 '~ 0 S ('~ ; '~ ) 0 1 wk0 '~ 0 a) '~ 0 and '~ 1 almost perpendicular ~yk = ~tkk wk1 '~ 1 '~ 1 '~ 0 S ('~ ; '~ ) 0 wk0 '~ 0 1 b) '~ 0 and '~ 1 almost parallel Figure 8.7: The solution ~yk = ~tkk in a bidimensional space S ('~ 0 ; '~ 1 ). Figure a presents the case of nearly orthogonal set, gure b presents the case of a nearly parallel set of f'~ i g functions. ~tkk and '~ 0 were kept the same in both cases tk = ➥ P 1X t P p=1 kp and '` = P 1X 'e` (xp ) P p=1 i.e. the bias compensate the di erence between the mean of targeted output tk and mean of the actual output | over the training set. If the number of vectors in the training set P is equal with the number of inputs N + 1 then the  matrix is square and it have an inverse. By multiplying with ; T ;1  to the left in (8.15): W T = T ) W T = ;1 T . Geometrically speaking ~tk 2 S and ~tk? = 0 | see gure 8.6, i.e. the network is capable to learn perfectly the target and the error function is zero (after training). If P 6 N + 1 then the ~tk vectors are included into a subspace of S and to minimize the error to zero it us enough to make the projection of ~yk (into that subspace) equal with ~tk . (a situation similar | mutatis mutandis | to that represented in gure 8.6 on page 139 but with ~yk and ~tk swapping places). This means that just a part of weights are to be adapted (found); the other ones do not count (there are an in nity of solutions). 142 CHAPTER 8. SINGLE LAYER NEURAL NETWORKS ➥ 8.2.3 To have P 6 N + 1 is not normally a good idea because the network acts more as a memory rather that as a generalizer (the network have a strong tendency to overadapt). The solution developed in this section does not work if the neuron does not have a linear activation, e.g. sigmoid activation function. The Gradient Descent Solution The neuronal activation function is supposed to be di erentiable and the error function may be expressed as function of weights E = E (W ). Then an initial value for fwki g parameters is chosen (usually weights are initialized randomly) and the parameters are modi ed by small values in the direction of decrease of E , i.e. in the direction of ;rE (with respect to weights), in small steps: wki = w s ( +1)ki ❖ t,  learning rate @E ; w s ki = ; @w ( ) ; ki W(s)  = const. 2 R + being the step of iteration (discrete time).  is a positive constant called learning rate and governs the speed by which the fwki g parameters are changed. Obviously ;rE may be represented as a matrix of the same dimensions as W and then the above equation may be written simply as: W = W(t+1) ; W(t) = ;rE (8.16) t and is known as the delta rule. Usually the error function is expressed as a sum over the training set of a P error terms: ❖ Ep E P = P Ep (W ), then the weight adjustment may be done in steps, for each training vector p=1 in turn: wki = w t ( +1)ki ✍ @Ep ; w t ki = ; @w ( ) ki W(t) ; p = 1; P Remarks: The above procedure is especially useful if the training set is not available from start but rather the vectors are arriving as a time series. ➥ The learning parameter  may be chosen to decrease in time, e.g.  = t . This procedure is very similar to the Robbins{Monro algorithm for nding the root of derivative of E , i.e. the minima of E . Assuming general linear discriminant (8.8) then considering the sum-of-squares error function (8.10): ➥ 0 P X K 1X E (W ) = 2p k =1 =1 " N X i=0 w eki ' ei (xp ) ; tkp #2 ; K 1X E p (W ) = 2k =1 " N X i=0 w eki ' ei (xp ) ; tkp #2 8.2. THE LEAST SQUARES TECHNIQUE then: @Ep @wk` = "N X i=0 ( ) ; tkp 143 # or, in matrix notation: ( ) = [yk (xp ) ; tkp ] 'e` (xp ) ' e` xp w eki 'ei xp rEp = [y(xp ) ; tp ] 'e T (xp ) where y(xp ) = W 'e (xp ). Then the delta rule2 (8.16) becomes: delta rule So far only networks with identity activation function were discussed. In general neurons have a di erentiable activation function f (the perceptron being an exception) then total input to neuron k is akp = W (k; :) 'e (xp ) and: ❖f W = ;[W 'e (xp ) ; tp ] 'e T(xp ) ap = W 'e (xp ) y(xp ) = f (ap ) ; The sum-of-squares error (8.9), for each training pattern p is: ( ) = 21 [f (ap ) ; tp ]T[f (ap ) ; tp] Ep W and then: rEp = f[f (ap ) ; tp ] ( )g'e T (xp ) f 0 ap where f is the total derivative of f . ❖ f0 0 Proof. From the expression of Ep : Ep (W ) = and then: 1 2 K X f akp ) ; tkp [ ( k=1 1 2 ] = 2 K " X N X f k=1 i=1 ! wki 'ei (xp ) ; tkp #2 @Ep = [f (akp ) ; tkp ]f (akp ) ' e` (xp ) @wk` which leads directly to the matrix formula above. 0 ✍ Remarks: ➥ In the case of sigmoid function f (x) = 1+1e;x the derivative is: df ( ) = dx = f (x)[1 ; f (x)] f0 x In this case writing f in terms of f speeds up the calculation and save some memory on digital simulations. It is easily seen that the derivatives are \local", i.e. depend only on parameters linked to the particular neuron in focus and do not depend on the values linked to other neurons. 0 ➥ 2 This equation is also known as the least-mean-square (LMS) rule, the adaline rule and the Widrow-Ho rule. 144 CHAPTER 8. SINGLE LAYER NEURAL NETWORKS ➥ The total derivative over whole training set may be found easily from: P @E = X @Ep @wk` p=1 @wk` ➧ 8.3 The Perceptron The perceptron (or adaline 3) represents a single layer neural network with threshold activation function (see (8.7)). Usually the activation function is chosen as being odd: ( for a > 0 f (a) = ;+11 for a<0 8.3.1 The Error Function e T'e (x)). Because the output of the Considering just one neuron, its output is y(x) = f (w e T'e > 0 single neuron is either +1 or ;1 then it may classify just a set of two classes: if w e T 'e < 0, the output is ;1 and x 2 C2. then the output is +1 and x 2 C1 , else w e T 'e > 0, 8x; where t is the target value given the input Then, for a correct classi cation, tw e T 'e > 0 while t = ;1 or vice-versa, i.e. vector x. For a misclassi ed input vector either w ;twe T 'e < 0. A good choice for the error function will be: E (w) = ; ❖ M X xp2M twe T 'e (xp ) (8.17) where M is the set of misclassi ed vectors xp . ✍ Remarks: ➥ 8.3 3 e T'e (xp ) is proportional to From the discussion in section 8.1.1 it follows that w the distance from the misclassi ed vector 'e (xp ) to the decision boundary. The process of minimizing the function (8.17) is equivalent to shifting the decision boundary such that misclassi cation becomes minimum. During the shifting process M changes as some previously misclassi ed vectors becomes correctly classi ed and vice-versa. See [Bis95] pp. 98{105. From ADAptive LINear Element. 8.3. THE PERCEPTRON 145 e C '1 e 1 ;t =1 e t' w(1) e w(0) e '0 'e C Figure 8.8: 8.3.2 2 ;t = ;1 The learning process for perceptron. White circles are from one class, black ones are from the other. Initially e (0) and the pattern shown by 'e is misthe parameter is w e (1) = we (0) ; 'e ;  was chosen 1 and classi ed. Then w t = ;1. The decision boundary is always perpendicular to w vector, see section 8.1.1. The other case of a misclassi ed xp 2 C1 is similar. The Learning Procedure The gradient descent solution (section 8.2.3) is used to nd the weight vector: ( = ;0 t'e ( ) ifif 2is M correctly classi and then rE = ;t'e ( ) if 2 M or rE = b otherwise. @Ep @wi xp xp xp p i xp xp p ed 0 The delta rule (8.16) becomes: e = e w w(t+1) ( 'e (x ) if x 2 M ; we = t b0 if x is correctly classi ed (t) p p p (8.18) i.e. all training vectors are tested: if the xp is correctly classi ed then w is left unchanged, otherwise it is \adapted" and the process is repeated until all vectors from the training set are classi ed correctly. See gure 8.8. 8.3.3 Convergence of Learning The error function (8.17) decreases by using the learning rule (8.18). Proof. The terms from E (8.17), after one learning step using (8.18), are: T T T 'e 'e ; t2 k'e k2 < ;we (t) 'e = ;twe (t) ;twe (t+1) 146 CHAPTER 8. SINGLE LAYER NEURAL NETWORKS where k'e k2 = 'e T 'e ; then E(t+1) < E(t) , i.e. E decreases. w ❖ b b such that: Let consider a linearly separable problem. Then it exists a solution w tp wb T 'e (xp ) > 0 ; p = 1; P The process of updating w, using the delta rule (8.18) is convergent. e (0) = 0b and Proof. Let consider that the initial vector | in the above learning procedure | is chosen as w let the learning parameter be  = 1 (without any loss of generality); then: e (t+1) = we (t) + tp 'e (xp ) w where xp is a misclassi ed training vector. (see (8.18)). Then the weight vector may be written as: e (t+1) = w ❖ ` X ` where ` is the number of misclassi cation of x` vector | note that while w changes, the decision boundary changes and a training pattern vector may move from being correctly classi ed to being misclassi ed and back. The sum is done over the training cycle | the training set may be used several times in any order. b T to the left: By multiplying with w b T we = w ❖ e ` t` '(x` ) X ` b e ` t` wT '(x` ) > X ` ! ` t` b T 'e (x` ) min w ` such that the product is limited from below by a function linear in b is constant). limited from below as well (w  = P k tk | and thus we (t+1) is k On the other hand: kwe (t+1) k2 = kwe (t) k2 + t2` k'e (x` )k2 + 2t` we (Tt) 'e (x` ) 6 kwe (t) k2 + k'e (x` )k2 Therefore: and then: e (t+1) i.e. kw kwe k2 = kwe (t+1) k2 ; kwe (t) k2 6 max k'e (x` )k2 ` kwe (t+1) k2 6  max k'e (x` )k2 ` k is limited from above by a function linear in p . Considering both limitations (below by  and above by p ) it follows that no matter how large t is, i.e. no p matter how many update steps are taken,  have to be limited (because  from below grows faster than  from above, during training) and then it means that at some stage ` becomes stationary for all ` 2 f1; P g b was presumed to exists) all training vectors becomes correctly classi ed. | thus (because w ✍ Remarks: ➥ ➥ The learning algorithm is good at generalization as long as the training set is statistically signi cant. The perceptron may be successfully used only for linearly separable classes. 8.4. FISHER LINEAR DISCRIMINANT ➧ 8.4 8.4.1 147 Fisher Linear Discriminant Two Classes Case A very simple way to reduce the dimensionality it to apply a linear projection, into a unidimensional space, of the form: y = wT x (8.19) where w is the vector of parameters chosen such as to maximize separability. Let consider two classes and a training set containing P1 vectors of class C1 and P2 vectors of class C2 . The mean vectors of class distribution are: m1 = P1 1 X xp 2C1 x m2 = P1 and p 2 X xp 2C2 ❖ P 1 , P2 , m1, m2 x p Then a natural choice for w would be such that it will maximize the distance between the unidimensional projection of means, i.e. wT (m1 ; m2) | on the other hand this distance may be arbitrary increased by increasing w; to avoid this a constraint on the size of w should be imposed, e.g. a normalization: kwk2 = wT w = 1. The Lagrange multiplier method is applied (see mathematical appendix) the Lagrange function (using the normalization on w) is: L(w; ) = wT (m1 ; m2) + (kwk2 ; 1) and the required solution is found from the conditions: @L @w = m1 ; m2 + 2w = 0 i i i @L = X w2 ; 1 = 0 @ and i i i which gives w / m1 ; m2. However this solution is not generally good because it considers only the relative positions of the distributions, not their form, e.g. for Gaussian distribution this means the matrix . See gure 8.9 on the next page. One way to measure the within class scatter, of the uni-dimensional projection of the data, is s2 = k X  xp 2Ck 2 = wT 4 y(x ) ; wT m p X xp 2Ck k 2 = X  xp 2Ck  T wT x ; wT m x w ; mTw p k p See [Bis95] pp. 105{112. s k k 3 (x ; m ) (x ; m )T 5 w p k p k and the total scatter, for two classes, would be s2total = s21 + s22 . Then a criteria to search for w would be to minimize the scattering. The Fisher technique takes the approach of maximizing the inverse of total scattering. The 8.4 ❖  Fisher criterion 148 CHAPTER 8. SINGLE LAYER NEURAL NETWORKS x2 C1 w m1 C2 m2 y x1 Figure 8.9: The unidimensional reduction in y space using the La- grange multiplier: Gaussian distribution, two classes, two dimensions. While the separation is better than a projection on x1 axis, it is worse than that one in the x2 axis because the  parameter was not considered. Fisher criterion is de ned as: 2 T w m wT (m1 ; m2)(m1 ; m2)Tw 1 ; w m2 J (w) = = 2 2 s1 + s2 s21 + s22 ; T and it may be expressed also as: J (w ) = ❖ Sb where Sb is named between-class covariance matrix: Sb ❖ Sw wT Sbw wTSw w = (m1 ; m2)(m1 ; m2)T and Sw is named within-class covariance matrix: Sw = X xp 2C1 (xp ; m1) (xp ; m1)T + X xp 2C2 (xp ; m2) (xp ; m2)T The gradient of J with respect to w is zero at the desired maximum: T T rJ = (wSw w )S(wb wS ;w(Tw)S2 b w )Sw w = 0 w (8.20) 8.4. FISHER LINEAR DISCRIMINANT 149 wSw wT )Sbw = (wSb wT )Sw w ) ( (8.21) From (8.20) it gives that: Sb w = (m1 ; m2 )(m1 ; m2)T w / (m1 ; m2 ) Because only the direction of (8.22) into (8.21) gives: (8.22) w matters then any scalar terms may be dropped; replacing w / Sw;1(m1 ; m2) known as the Fisher discriminant. 8.4.2 Fisher discriminant Connections With The Least Squares Technique For the particular case of transformation (8.19), the sum-of-squares error function (8.9) becomes: E = P 1 X; 2 p=1 wTxp + w0 ; tp2 The target values are chosen as follows: ti 8 <P = P1 ; PP2 : if xp 2 C1 if xp 2 C2 (8.23) where P1 is the number of training patterns in C1 and similar for P2 , obviously P1 + P2 = P . The minima of E with respect to w and w0 is found by zeroing its derivatives: rE = @E @w0 = P X p=1 P X p=1 wT xp + w0 ; tp)xp = 0 (8.24a) wT xp + w0 ; tp) = 0 (8.24b) ( ( ❖ P 1 , P2 P The sum in (8.24b) may be split on two sums following the membership of xp : xp2C1 P and xp2C2 ; from the particular choice (8.23) for tp and because P1 + P2 = P then: w0 = ;wT m where m = P1 P X p=1 xp = PP1 m1 + PP2 m2 (8.25) i.e. m represents the mean of x over the whole training set. The sum from (8.24a) may be split in 4 terms | separate summation over each class, ❖ m 150 CHAPTER 8. SINGLE LAYER NEURAL NETWORKS replacing w0 from (8.25) and replacing tp values from (8.23): X rE =  wTxp ; PP1 wT m1 ; PP xp ; PP2 wT m2 xp 2C1 1 X xp 2C1 xp (8.26)  X P P 2 T xp = 0 w xp ; P w m2 + P xp ; PP1 wT m1 + 2 xp 2C2 xp 2C2 X T This relation reduces to: Using the de nition of m1 2 , (8.26) becomes: Proof. ; X xp 2C1 T x wT x )x ; ( p X + As w   Sw + P1PP2 Sb w = P (m1 ; m2) xp 2C2 p p X P1 xp 2C1 P w T x )x ; ( p p ( wT m1)x ; p X P2 ( xp 2C2 P wT m2)x p X P xp 2C1 P1 x ; PP2 (wT m2 )P1 m1 X P + p xp 2C2 P2 x ; PP1 (wT m1)P2 m2 = 0 p xT w and the same for other w products: X X P1 X x (xT w) ; x (mT1w) ; P = p xp 2C1 + T p X xp 2C2 p xp 2C1 x (xT w) ; p p P xp ; P1PP2 m1 (mT2 w) P xp 2C1 1 p X P2 xp 2C2 P x (mT2 w) + p and using again the de nitions of m1 2 : X P xp 2C2 P2 x ; P1PP2 m2(mT1 w) = 0 p ; X 2 xp 2C1 + x (xT w) ; PP1 m1(mT1 w) ; P m1 ; P1PP2 m1(mT2 w) p X p x (xT w) ; PP2 m2 (mT2w) + P m2 ; P1PP2 m2 (mT1 w) = 0 2 p xp 2C2 p As matrix multiplication is associative, w is a common factor. A 1 2 m1 mT1 + and then subtracted to help form S (it's moved to the right of equality): P P P P m2mT2 is added b " X xp 2C1 + x xT ; p X xp 2C2 p xx p T p P12 P X xp 2C1 m1 mT1 ; P1PP2 m1mT1 2 ; PP2 m2mT2 ; P1PP2 m2 mT2 Terms 2, 3, 5, 6 reduces (P " P1 P2 = x x ; P1 m1 m p T p # w = P (m1 ; m2) ; P1PP2 S w b P1 + P2 ) to give: T+ 1 X xp 2C2 # x x ; P2 m2 m w = P (m1 ; m2) ; P1PP2 S w p T p T 2 and expanding m1 2 shows that the square parenthesis is S . ; w b Because Sb w / (m1 ; m2) (see (8.22)) and only the direction of w counts then, by dropping the irrelevant constant factors, the Fisher discriminant is obtained: w / Sw;1(m1 ; m2) 8.4. FISHER LINEAR DISCRIMINANT 8.4.3 151 Multiple Classes Case It is assumed that the dimensionality of pattern vector space is greater than the number of classes, i.e. N > K . A set of K transformations is considered T ; j = 1; K , y(x) = W x (8.27) yk (x) = wk x where the matrix W is build using wkT as rows. Let Pk the number of training vectors (from the whole set) being of class Ck , and mk be the mean vector of that class: P K mk = P1 xp ; k = 1; K and m = P1 xp = P1 Pk mk k xp2Ck p=1 k=1 ❖ W ❖ P ❖ P ❖ S w The total covariance matrix St is P K T T (xp ; m)(xp ; m) St = (xp ; m)(xp ; m) = p=1 k=1 xp2Ck ❖ S t and may be written as: ❖ S b X X X PK Pk is the total number of training vectors, and m is the mean over the k=1 whole training set. The generalization of within-class covariance matrix (8.20) is easily performed as: K X X (x ; m )(x ; m )T Sw = Swk where Swk = p k p k k=1 xp 2Ck where P = X XX t = Sw + Sb S where b= S K X k=1 k (mk ; m)(mk ; m)T P where St , Sw and Sb are de ned in X pattern space. Proof. St = = K 2 X X 4 3 1 0 X T X A T T @ xp + Pk mm 5 xp m ; m xp xp ; K 2 X X 4 3 xp xp ; Pk mk m ; mPk mk + Pk mm 5 k=1 k=1 xp 2Ck xp 2Ck T xp 2Ck xp 2Ck T T T T By adding, and then subtracting, a term of the form Pk mk mTk , the Sb is formed and then: St = K 2 X X 4 k=1 xp 2Ck 3 xp xTp ; Pk mk mTk 5 + Sb = Sw + Sb k , mk ,m 152 ❖ k ,  CHAPTER 8. SINGLE LAYER NEURAL NETWORKS Similar matrices may be expresses in the Y output space. Let k and  be the mean over class Ck and, respectively, over all training set of output y(xp ): P K y(xp ) ; k = 1; K and  = P1 y(xp ) = P1 Pk k k = P1 k xp2Ck p=1 k=1 X X X The covariance matrices in Y space are: K T [y(xp ) ; k ][y(xp ) ; k ] S(Y )w = k=1 xp2Ck K T S(Y )b = Pk [k ; ][k ; ] k=1 XX X One possibility4 for the Fisher criterion is: J (W ) ;1 Y w S(Y )b ) = Tr(W Sw W = Tr(S( ) T b WS W T ) (considering (8.27)). ✍ Remarks: ➥ b is a sum of K matrices, each of rank 1 | because it represents a product of vectors. Also there is a relation between all mk given by the de nition of m. Then the rank of Sb is K ; 1 at most, and, consequently it have only K ; 1 eigenvectors/values. By the means of Fisher criterion it is possible to nd only K ; 1 transformations. S 2 4 There are several choices. CHAPTER 9 Multi Layer Neural Networks ➧ 9.1 Feed-Forward Networks Feedforward networks do not contain feedback connections. Also between the input units x and the output units y there are (usually) some hidden units z . Let N be the dimension (number of neurons) of input layer, H the dimension of hidden layer and K the one of output layer See gure 9.1 on the following page. Assuming that the weights for the hidden layer are fw(1) g =1 (characterizing the coni k j ji nection to z from neurons is: j x ) and the activation function is f i z = j f X N 1 w (1)ji x ❖ N, H, K ❖ w (1)ji , f1 ❖ w (2)ki , f2 ;H j i=0;N 1 then the output of the hidden ! i i=0 where x0 = 1 at all times | w 0 being the bias (characteristic to hidden neuron j ). Note that there are no connections from x to z0 , they would be irrelevant as z0 represents the bias and its output is 1 at all times (regardless of its input). On similar grounds, let fw(2) g =1 be the weights of the output layer, and f2 its actij i kj k ;K j =0;H vation function. Then the output of the output neuron is: 0 X y =f @ w H H k 2 j =0 9.1 1 0 X z A=f @ w (2)kj j 2 j =0 See [Bis95] pp. 116{121. 153 !1 X w x A f N (2)ki 1 (1)ji i=0 i 154 CHAPTER 9. MULTI LAYER NEURAL NETWORKS x0 z0 Input x1 zH z1 y1 Output Figure 9.1: xN yK The feedforward network architecture. Between the input units xi and output units yk there are some hidden units zj . x0 and z0 represents the bias. The above formula may be easily generalized to a network with multiple hidden layers. ✍ Remarks: ➥ ➥ ➧ 9.2 While the network depicted in gure 9.1 appears to have 3 layers, there are in fact only 2 processing layers: hidden and output. So this network will be said to have 2 layers. The input layer x plays the role of distributing the inputs to all neurons in subsequent layer, i.e. it plays the role of a sensory layer. In general a network is of feedforward type if there is a possibility to label all neurons (input, hidden and output) such that any neuron will receive inputs only from those with lower number label. By the above de nition, more general neural networks may be build (than the one from gure 9.1). Threshold Neurons A threshold neuron have the activation function of the form: f (a) = ( 1 0 if a > 0 if a < 0 and the a = 0 value may be assigned to either case. 9.2 See [Bis95] pp. 121{126. (9.1) 9.2. THRESHOLD NEURONS 9.2.1 155 Binary Vectors Let consider the case of binary pattern vectors, i.e. xi 2 f0; 1g, 8i. On the other end, let the outputs be also binary. One output neuron will be considered, the discussion being easily generalisable to multiple output neurons. Then the problem is to model a Boolean function fB : f0; 1gN ! f0; 1g. The total number of possible inputs is 2N . The function fB is totally de ned once the output value is given for all input combinations. Then the network may be build as follows:  The number of hidden neurons is equal to the number of input patterns which should result in an output equal to 1.  The activation function for hidden neurons is de ned as: ( 1 if a > 0 f1 (a) = 0 if a 6 0 Each hidden neuron is set to ( be activated just by one pattern: For that patternN the P 1 if xi = 1 weights are set up: wji = ;1 if xi = 0 and wj0 = 1 ; nx ; where nx = i=1 xi , i.e. is equal with the number of \ones" into the x pattern vector. Then the total input to a hidden neuron is 1  nx + 1  (1 ; nx) = 1 and then the output of the hidden neuron is 1 when presented with the \learned" pattern vector. The total input is at most (nx ; 1) + (1 ; nx ) = 0 when presented with another pattern vector (one xi component changed from 1 to 0 or vice-versa); such that the output will be 0.  The activation function for the output neuron is: ( f2 (a) = 1 0 if a > 0 if a < 0 The weights to the output neuron are set to 1. The bias w(2)0 is set to ;1 such that when a pattern for which the output should be 1 is presented to the net, the total input in y is 0; otherwise is ;1 and thus the correct output is ensured at all times. A vectorial output function may be split into components and each component may be assigned to a output neuron. ✍ 9.2.2 Remarks: ➥ While not very useful by itself (it does not have a generalization capability) the above architecture illustrate the possible importance of singular neuron \ ring" | used extensively in CPN and ART neural networks. Continuous Vectors The two propositions below assume a 2 class problem (either x 2 C1 or x 2 C2 ) and thus one output neuron is enough for classi cation. The solution is easily extensible to multi-class ❖ f B 156 CHAPTER 9. MULTI LAYER NEURAL NETWORKS z1 z H z2 y Figure 9.2: = 1^ z ::: ^ zH In a two layer network the hidden layer may represent the hyperplane decision boundaries and then the output layer may do a logical AND to establish if the input vector is within the decision region. Thus any convex decision area may be represented by a two layer network. problems by assigning each output neuron to a class. Proposition 9.2.1. A two-layer neural network may have arbitrary convex decision boundary. Proof. A single layer neural network (with one output) have a decision boundary which is a hyperplane. Let the hidden layer represent the hyperplanes (see chapter \Single Layer Neural Networks"). Then the output layer may perform an logical AND between the outputs of the hidden layer to decide if the pattern vector is inside the decision region or not | each output neuron representing a class. See gure 9.2. Proposition 9.2.2. A 3-layer neural network may have arbitrary decision boundary (it may be non-convex and/or disjoint). The pattern space is divided into suciently small hypercubes such that the decision region may be approximated using them (i.e. each hypercube will be included either in the C1 decision region or in that of C2 's). The decision area may be approximated with arbitrary precision by making the hypercubes smaller. The neural network is built as follows: The rst hidden layer contains a group of 2N neurons for each hypercube of the same one class (2 hyperplanes for each dimension to de ne the hypercube). The second hidden layer contains N neurons who receive the input from the corresponding group of 2N neurons from the rst hidden layer. The output layer receive its inputs from all N neurons from the second hidden layer. The architecture of the network is depicted in gure 9.3 on the next page. By the same method as described in the previous proposition a neuron from the second layer may decide if the input pattern vector is inside the hypercube it represents. An logical OR on the output layer, between the outputs of the second hidden layer, decides if the pattern vector belongs to any of hypercubes represented by the rst layer and thus to the class selected; if not then the input vector belongs to the other class. Proof. ✍ Remarks: ➥ The network architecture described in proof of proposition 9.2.2 have the disadvantage that require large hidden layers. 9.3. SIGMOIDAL NEURONS 157 xN x1 rst hidden layer second hidden layer y Figure 9.3: The 3 layer neural network architecture for arbitrary de- cision areas. ➧ 9.3 Sigmoidal Neurons The possible sigmoidal activation functions are: y = f (a) = 1 ; 1 + e ca and ;ca ca + e;ca e ca y = f (a) = tanh(a) = e ; e where c = const.. 2)+1 1 = 1+e;ca then using the tanh function instead of the logistic one Because tanh(a= 2 is equivalent to apply a linear transformation ea = a=2 before and (again a linear transformation) y = 12 ye + 1 after the processing on neural layer (this is equivalent in a linear transformation of weights and biases). The tanh function have the advantage of being symmetrical with respect to the origin. See gure 9.4 on the following page. ✍ 9.3 Remarks: ➥ The output of a logistic neuron is limited to the interval [0; 1]. However the logistic function is easily inversable and then the inverse f ;1 (y) = 1c ln 1;y y . ➥ In the vicinity of the origin the logistic function is almost linear and thus it can approximate a linear neuron. See [Bis95] pp. 126{132. 158 CHAPTER 9. MULTI LAYER NEURAL NETWORKS tanh(x) 1:0 c = 10 c c =1 =3 ;1 0 ;2 0 : x 2:0 : Figure 9.4: The graph of tanh function for various c constants. 9.3.1 Three Layer Networks A three layer network may approximate with any accuracy a smooth function (mapping) X ! Y . Proposition 9.3.1. Proof. The logistic activation function is: y= 1 1 + exp( ;wT x ; w0 ) and is represents the output of a rst-layer neuron. See gure 9.5{a . By making linear combinations, i.e. when entering the second neuronal layer, it is possible to get a function with an absolute maximum. See gure 9.5{b and 9.5{c. By applying the sigmoidal function on the second layer, i.e. when exiting the second layer, it is possible to get just a localized output, i.e. just the maximum of the linear combination. See gure 9.5{d. The third layer may combine the localized outputs of the second layer to perform the approximation of any smooth function | it should have a linear activation function. 9.3.2 Two Layer Networks Proposition 9.3.2. A two layer neuronal network can approximate, arbitrary well, any function (mapping) X ! Y , provided that X and Y spaces are nite-dimensional and there are enough hidden neurons. Proof. X ci Any function may be decompose into a Fourier series: y(x1 ; : : : ; xN ) = = i1 x ; : : : ; xN ) cos(i1 x1 ) = : : : 1( 2 X    X Ci iN YN i1 iN 1  `=1 i` x` ) cos( (by developing in series all c parameters). Any product of 2 cosines may be transformed into a sum, using the formula cos cos = 12 cos( + ) + 1 2 cos( ; ). Then by applying this procedure N ; 1 times, the product from the equation above may be changed to a sum of cosines (the values of the angles and the constants don't have to be speci ed for the proof). 9.3. SIGMOIDAL NEURONS 159 f1 1:0 1:0 0:5 0:5 0:0 10 0 a1 1:0 f2 ;10 ;10 0:0 0 10 a2 a) 10 a1 f3 0:0 10 0 ;10 a1 Figure 9.5: 10 0 ;10 a2 b) 0:5 0:0 10 0 ;10 a1 c) d) Two dimensional space. Figure a) shows the sigmoidal function f1 = 1+ ;1 1 ; 2 . Figure b) shows the linear combination f2 = 1+ ; 11 ; 2 ;5 ; 1+ ; 11 ; 2 +5 | they are displaced relatively one each other. Figure c) shows a linear combination of 4 sigmoidal functions f3 = 1+ ;015; 2 ;5 ; 05 05 05 1+ ; 1 ; 2 +5 + 1+ ; 1 + 2 ;5 ; 1+ ; 1 + 2 +5 | the second pair is rotated by =2. Figure d) shows the output after applying the sigmoidal function again f4 = 1+ 141 3 +1 6 | only the central maximum remains. e a a e a a e a a e e a : a e a : a ;10 a2 f4 1:0 0:5 0 10 0 ;10 e a : a : a a e f : 10 0 ;10 a2 160 CHAPTER 9. MULTI LAYER NEURAL NETWORKS f f (xi+1 ) f (xi ) x0 xi xi+1 Figure 9.6: The approximation of a function by S steps. Using the Heaviside step function: H (x) = any function may be approximated as: f (x) ' f (x0 ) + ❖ S xS x S X i=1 ( 1 0 if x > 0 if x < 0 f xi ) ; f (xi;1 )] H (x ; xi ) [ ( where S is the number of steps by which the function was approximated. See gure 9.6. Then the network is build/performs as follows:  The rst layer have threshold activation functions and calculate the values of cosine functions.  The second layer performs the linear combination from the Fourier series development. ➧ 9.4 Weight-Space Symmetry Considering the tanh activation function then by changing the sign on all weights and the bias the output of neuron have reverted sign (tanh is symmetrical with respect to origin, see also gure 9.4 on page 158). If the network have two layers and the weights and biases on the second layer have also the signs reverted then the nal output of network remains unchanged. Then the two sets of weights and biases leads to same result, i.e. there is a symmetry of the weights towards origin. Assuming a 2-layer network with H hidden neurons then the total number of weights and biases sets which gives the same nal result is 2H Also, if all weights and the bias are interchanged between two neurons on one layer then the output of the next layer remains unchanged (assuming that the layers are fully interconnected). 9.4 See [Bis95] pg. 133. 9.5. HIGHER-ORDER NEURONAL NETWORKS 161 There are H ! such possible combinations on the hidden layer. Finally there are H !2H sets of weights witch gives the same output on a 2-layer network. On more complex networks there may be even more symmetries. The symmetry in weights leads directly to a symmetry in error function and then the error will have several equivalent minima. Eventually this means that the minima point of error may be much closer than it looks at rst sight, regardless of the starting point, usually randomly selected. ➧ 9.5 Higher-Order Neuronal Networks The neurons studied so far performed a linear combination of their inputs before applying the activation function: N ftotal inputgj = aj = wji xi then foutputgj = yj = f (aj ) i=0 (or in matrix notation: a = W x and y = f (a)). It is possible to design a neuron which performs a higher-degree combination of its inputs, e.g. a second-order: N N N wji` xi x` ftotal inputgj = wj0 + wji xi + i=1 `=1 i=1 X X ✍ ➧ 9.6 9.6.1 XX Remarks: ➥ The main diculty in dealing with such neurons consists in the tremendous increase in the number of W parameters. A rst-ordered neuron will have N + 1 parameters while a second order neuron will have N 2 + N + 1 parameters and for N  1 ) N 2  N . On the other hand higher-order neurons may be build to be invariant to some transformations of the pattern space, e.g. translations, rotation and scaling. This property may make them usable into the rst layer of network. Backpropagation Algorithm Error Backpropagation It is assumed that the error function may be written as a sum over all training vector patterns P E = Ep p=1 X 9.5 See [Bis95] pp. 133{135. 9.6 See [Bis95] pp. 140{148. 162 CHAPTER 9. MULTI LAYER NEURAL NETWORKS and also that the error function E = E (y) is di erentiable with respect to the output variables. Then just one vector pattern is considered at a time. The output of network is calculated by forward propagation from: p X p wji zi and zj = f (aj ) a = W (j; :) z and z = f (a) aj = (9.2) i ❖ aj where a is the weighted sum of all inputs entering neuron z , z being here any neuron, hidden or output. Then E also depends on w trough a and then the derivative of E with respect to w is j j p ji @Ep @wji = j p @Ep @aj = @aj @wji rE ❖ j ,  @Ep @aj raE = p p ) zi = j zi zT ji (9.3)  zT = where jp   is named error | it's a factor determinant in weight adjusting. (as shown below); ra E  , note that here rE represents just the part linked to W from the whole error gradient z for all layers is found trough a forward propagation trough the network.  is found by backpropagation (from output to input) as follows:  For the output layer: E = E (f (a)) and then: @E j @a p p p out;k p  @E @a p @Ep df (ak ) = @yk k  out ❖ f0 i = dak ry E = @Ep @yk (9.4) f 0 (ak ) f 0 (a) p where f is the total derivative of activation function f .  For other layers: neuron z a ects the error E trough all other neurons to which it sends its output: 0 j j = @Ep @aj = X ` p @Ep @a` @a` @aj = X next;` ` @a` (9.5) @aj where the sum is done over all neurons to which z send connections (next  `p ) and from the expression of a and de nition of  nally the back-propagation formula: @E ❖ next;` ,  next ` j = f 0 (aj ) X j ;` @a ` w`j next;` (9.6) ` by the means of which the derivatives may be calculated backwards | from the output layer to the input layer. In matrix notation:  = f (a) 0 (W T next ) By knowing the derivatives of E with respect to the weights, the W parameters may be adjusted in the direction of minimizing the error, using the delta rule. 9.6. BACKPROPAGATION ALGORITHM 163 9.6.2 Application: Sigmoidal Neurons and Sum-of-squares Error Let consider a network having logistic activation function for its neurons and the sum-ofsquares error function. Then the derivative of the activation function is: ( ) = 1 +1e;x ) df f x dx = f (x)(1 ; f (x)) and the sum-of-squares error function for pattern vector xp is: 1 K [yk (xp ) ; tpk ]2 = 1 [y(xp ) ; tp]T [y(xp ) ; tp ] Ep = 2 k=1 2 X (9.7) (9.8) (it is assumed that yk and tpk are the components of the respective vectors y and tp , when the network is presented with the pattern vector xp ). From (9.8), (9.4) and (9.7), for output layer:  out b [b ; ] = y(xp ) [1 ; y(xp )] [y(xp ) ; tp ] and similar, for the hidden layers:  =z 1 z (W T  ) next where the  errors are calculated backwards starting with the hidden layer closest to the output and ending with the rst hidden layer. The error gradient (the part linked to W ) rEp is: rEp =  zT To minimize the error, the weights have to be changed in direction contrary to that pointed by the gradient vector, i.e. the amount will be: W / ;rEp = ;rEp where  governs the overall speed of weight adaptation, i.e. the speed of learning, and thus is named learning constant. i is a determining factor in weight change and thus is named error . The above equation represents the delta rule. ✍ Remarks: ➥ The choice of order regarding training pattern vectors is optional:  Consider one at a time (including random selection for the next one).  Consider all together and then change the weights with the sum of all individual weight adjustments P W = ; rEp p=1 X This represents the batch backpropagation. learning constant 164 CHAPTER 9. MULTI LAYER NEURAL NETWORKS  Any pattern vector may be selected multiple times for training/weight ad- ➥ ➥ ❖ W, O justing.  Any of the above in any order.  have to be suciently large to avoid the trap of local minima of E but on the other hand it have to be suciently small such that the \absolute" minima will not be jumped over. In numerical simulation the most computational intensive terms are the matrix operations found in (9.6). Then the computational time is proportional with the number of weights W : O(W ) | where O is a linear function in W | the number of neurons being usually much lower (unless there is a very sparse network). Because the algorithm require in fact one forward and one backward propagation trough the net the dependency is O(2W ). On the other hand to explicitly calculate the derivatives of E would require one pass trough the net for each one and then the computational time will be proportional with O(W 2 ). The importance of backpropagation algorithm resides in the fact that reduces computational time from O(W 2 ) to O(W ). However the classical way of calculating rE by perturbing each w by a small amount " & 0 is still good for checking the correctness of a digital implementation on a particular system: p p @Ep @wji ' E (w p ji ji + ") ; E (w ; ") p ji 2" Another approach would be to calculate the derivatives: @Ep @aj ' E (a p j + ") ;E ; ") aj p( 2" This approach still needs two steps for each neuron and then the computational time is proportional with O(2M W ), where assuming M is the total number of neurons. Note that because the derivative is calculated with the aid of two values centered around w , respectively a , the terms O(") are canceled (the bigger non-zero terms neglected in the above approximations are O("2 )). ji ➧ 9.7 j Jacobian Matrix The following matrix: J Jacobian   @y  k @xi =1;K i=1;N k 0 B =@ @y1 @ x1 .. . @y K @ x1  @x  @x ... @ y1 .. . N K N @y 1 CA is named the Jacobian matrix and it provides a measure of the local sensitivity of the network 9.7 See [Bis95] pp. 148{150. 9.7. JACOBIAN MATRIX 165 output to a change in the inputs, i.e. for small perturbation in input vector x  x the perturbation in the output vector is: y ' J x Considering the rst layer to which inputs send connections (i.e. the rst hidden layer): Jki k = = @y @xi X @yk @a` X @yk @a` @xi ` = @a` ` w`i (9.9) (the sum is made over all neurons to which inputs send connections). To use a matrix notation, the following matrix is de ned: ra y   @y  k @a` where a refers to a particular layer. k The derivatives @y @a` are found by a procedure similar to the backpropagation algorithm. A perturbation/change in a` is propagated trough all neurons to which neuron ` send connections (its output), then: @yk @a` and, because aq = Pw s qs zs = X @yk @aq q @aq @a` = P wqs f (as) then: s @yk @a` = f (a` ) 0 X @yk q @aq (9.10) wq` k i.e. the derivatives @y @a` may be evaluated in terms of the same derivatives of the next layer.  For the output layer: @yk @a` = k` dfda(ak ) k where a` here are those received by the output neurons (k` is the Kronecker symbol), @yk and the matrix ra y have just one non-zero i.e. all partial derivatives are 0 except @a k element per each row/column.  For other layers: the derivatives are calculated backward using (9.10), then: ra y = [1b f (aT )] [ra 0 next yW] When the rst layer is reached then the Jacobian is found from (9.9). ✍ Remarks: ➥ The same remarks as for backpropagation algorithm regarding the computing time (see section 9.9) applies here. ❖ ra y 166 CHAPTER 9. MULTI LAYER NEURAL NETWORKS ➧ 9.8 Hessian Tensor The elements of the Hessian tensor are the second derivatives of the error function with respect to weights: H ✍  = @2E @wji @wk`  jikl Remarks: ➥ ➥ The Hessian is used in several non-linear learning algorithms by analyzing the surface E = E (W ); the inverse of the Hessian is used to determine the least signi cant weights in order to \prune" the network and it is also used to nd the output uncertainties. Considering that there are W weights then the Hessian have W 2 elements and then the computational time is at least of O(W 2 ) order. The error function is considered additive with respect to the set of training pattern vectors. 9.8.1 Diagonal Approximation Here the Hessian is calculated by considering it diagonal, then only the diagonal elements @2E @w2 have to be found (all others are zero). ji The error function is considered P a sum of elements over the training set as in the backpropagation algorithm: E = Ep . p From the expression of aj , in (9.2), the operator @ @wji = @ @aj @aj @wji @ @wji = may be written as: @ zi @aj and, because zi does not depend on wji then: @2 2 @wji = @2 2 z @a2j i and the diagonal elements of the Hessian for pattern vector xp becomes From (9.6) (and on similar grounds): @ 2 Ep @a2j 9.8 = @Ep d2 f (aj ) X w`j da2j @a` ` See [Bis95] pp. 150{160. +  df (a ) 2 X X j daj ` `0 w`j w`0 j @ 2 Ep 2 @ 2 Ep 2 = @a2 zi . @wji j @ 2 Ep @a` @a`0 9.8. HESSIAN TENSOR   P 167 @ Ep @a` p 0 (where @a@ j @E @a` = @a` @a` @aj ) the sum over ` and ` being done over all neurons to ` which neuron i is sending connections. In matrix notation it may be written directly as: 2 0 0 0 ra ra )Ep = f 00 (a) (W ( T  next ) + [f 0 (a) f 0 (a)] By neglecting the non-diagonal terms: @ 2 Ep @a2j  @Ep d2 f (aj ) X w`j 2 daj @a` `  + (W df (aj ) daj T 2 X ` W T ranext )Ep ra next @ 2 Ep @ 2 a` 2 w`j or in matrix notation: ra ra )Ep = f 00 (a) (W ( 0 T  next ) f 0 (a)] + [f (a) W 2T ra ( next ra Ep next ) and thus, the computational time is reduced from O(W 2 ) to O(W ) (by neglecting the o 2E p , ` 6= `0 ). Note however that in practice the Hessian is quite far from diagonal terms @a@` @a ` being diagonal. 0 9.8.2 Outer Product Approximation Considering the sum-of-squares error function for pattern vector xp : Ep = 1 2 K X s=1 [ys (xp ) ; tps ]2 = 21 [y(xp ) ; tp ]T [y(xp ) ; tp ] then the Hessian is calculated immediately as: @ 2 Ep @wji @wk` = K X @ys @ys s=1 @wji @wk` + K X s=1 (ys 2 ys ; ts ) @w@ @w ji k` Considering a well trained network and the amount of noise small then the terms ys ; ts have to be small and may be neglected; then: @ 2 Ep @wji @wk` and @ys @wji  K X @ys @ys s=1 @wji @wk` may be found by backpropagation procedure. 9.8.3 Inverse Hessian Let consider the Nabla vectorial operator with respect to the weights r = the Hessian may be written as a square W  W matrix HP = P X p=1 r  rT Ep n @ @wji o ji then 168 CHAPTER 9. MULTI LAYER NEURAL NETWORKS (P being the number of vectors in the training set). By adding a new training vector: HP +1 = HP + r  rT EP +1 A similar recurrent formula may be developed for the inverse Hessian: 1 = H ;1 HP;+1 P Proof. ;1 T EP +1 H ;1 P ; H1 P+ rrr T H ;1 rE P P +1 Considering 3 matrices A, B and C , A inversable, then it is true that: ;1 = A;1 ; A;1 B (I + CA;1 B );1 CA;1 (A + BC ) (the identity may be veri ed by multiplication by (A + BC ) to the left respectively to the right). By putting HP = A, r = B and rT EP = C then rT HP;1 rEP +1 have dimension 1 (as the product between a row and column matrix) and the required formula is obtained. Using the above formula and starting with H0 / I the Hessian may be computed by just one pass trough all training set. 9.8.4 Finite Di erences Another possibility to nd the Hessian is by applying small perturbations to weights and calculate the error; then: 1 @2E = [E (wji + "; wk` + ") @wji @wk` 4"2 ; E (wji ; "; wk` + ") (9.11) ; E (wji + "; wk` ; ") + E (wji ; "; wk` ; ")] + O("2 ) ' 41"2 [E (wji + "; wk` + ") ; E (wji ; "; wk` + ") ; E (wji + "; wk` ; ") + E (wji ; "; wk` ; ")] where 0 . "  1. By choosing an interval centered around cancels one each other. Another approach is to use the gradient rE : @2E 1 = @wji @wk` 2" ' 2" 1 ✍ Remarks: ➥ " " @E @wji @E @wji (wk` + ") wji " + + wk` " ; @E @wji ; @E @wji (wij ; wk` ) # (wk` + ") wji " ; + the terms O(") O("2 ) (9.12) # ; wk` " (9.13) The \brute force" attack used in (9.11) require O(4W ) computing time for each Hessian element (one pass for each of four E terms), i.e. O(4W 3 ) computing time for the whole tensor. However it represent a good way of checking other algorithm implementations. 9.8. HESSIAN TENSOR ➥ 9.8.5 169 The second approach, used in (9.11) require O(2W ) computing time to calculate the gradient and thus O(2W 2 ) computing time for the whole tensor. Exact Hessian From (9.3): @Ep @wk` = k z` @ 2 Ep @wji @wk` = and then by di erentiating again: @  @E  @aj @aj @wk` @wji p = @  @E  @aj @wk` p zi = zi @ (k z` ) @aj (9.14) (using also (9.2)). Because z` = f (a` ) then: @ 2 Ep = zi k f (a` )h`j + zi z` bkj 0 @wji @wk` where: ❖ h`j , bkj h`j @a`  @a and j bkj @k  @a j The h`j coecients a` depends over all as from neurons s which connects to neuron `, then: h`j = and (as a` = P s w`s f (as )) X @a` @as s @as @aj (9.15) then: h`j = X s w`s f 0 (as )hsj i.e. the h`j coecients may be calculated in terms of the previous ones till the rst layer which receive directly the input and for which the coecients do not have to be calculated. ✍ Remarks: ➥ ➥ For the same neuron obviously h`` = 1. For two di erent neurons, by continuing the development in (9.15): if a forward propagation path can't be established from neuron ` to neuron j , then a` and aj are independent and, consequently h`j = 0 The bkj coecients The bkj are calculated using a backpropagation method. The neuron k a ects the error trough all neurons s to which it sends connections: k = f 0 (ak ) X s wsk s 170 CHAPTER 9. MULTI LAYER NEURAL NETWORKS (see also the backpropagation formula (9.6)). Then: f 0 (ak ) @ bkj = @aj s X 00 = f (ak )hkj ❖ f 00 X s ! (9.16) wsk s wsk s + f 0 (ak ) X s wsk bsj where f 00 (a) = ddaf2 is the second derivative of the activation function. 2 ✍ Remarks: ➥ The derivative @a@ proceed from the derivative @w@ , see (9.14). Its application is correct only as long wji does not appear explicitly in the expression to be derived in (9.16). The weight wji represents the connection from neuron i to neuron j while the sum in (9.16) is done over all neurons to which neuron j send connections to. Thus | according to the feedforward network de nition | wji can't be among the set of weights wsk and the formula is correct. j ji For the s neuron on output layer: s = @Ep @as 0 = f (as ) @Ep @ys (as ys = f (as )) and: bss = @s @as = @ f 0 (as ) @as 00 @Ep 00  = f ( as )  @ys @ @as $ @ @ys  @ys + f (as ) = f ( as ) 1 + f (because 00 @Ep 00 = f (as ) das @Ep dys @ys ;1 0 (ys ) @Ep @ys @Ep @ys 0 + f (as ) 0 + f (as ) 0 + f (as ) @ @Ep @ys @as @ 2 Ep @ys2 @ 2 Ep @ys2 switch places, using the expression of s above and as = f ;1 (ys )). For all other layers the bkj coecients may be found from (9.16) by backpropagation and by considering the condition wjj = 0, i.e. into a feedforward network there is no connection from a neuron to itself. 9.8.6 Multiplication with Hessian Considering the Hessian as a matrix r  rT Ep (see section 9.8.3) the problem is to nd an easy way of multiplying it with a vector v having W components: vT H  vT  (rrT )Ep 9.8. HESSIAN TENSOR 171 Using the nite di erence method on a interval centered around the current set of weights W (W is seen here as a vector ): vT r(rT Ep ) = 1  2"  r Ep (W + "v) ; r Ep (W ; "v) T T + O(" ) 2 where 0 . "  1. Application. Let consider a two layer feedforward neural network with the sum-of-squares error function. Then: a = W(1) xp ; z = f (a) and y = W(2) z Let de ne the operator R() = vT r, then R(W ) = v (R is a scalar operator, when applied to the components of W the results are the components of v as rW = 1b). Note that W represents the total set of weights, i.e. W(1) [ W(2) ; also the v vector may be represented, here, in the form of two matrices v(1) and v(2) , the same way W is represented by W(1) and W(2) . By applying R to a, z and y: R(a) = v x R(z) = f (a)R(a) R(y) = W R(z) + v ❖ a, z, y ❖ R() ❖ v(1) , v(2) (1) 0 (2) (2) z From the sum-of-squares error function:  (2) = f (aout )  (1) = f 0 (a) [y 0 ❖ aout ; tp ] for output layer, aout = W (2) ) = f 00 (aout ) (1) ) = f 00 (a) R(aout ) R(a) z for hidden layer W(2)  (2) (see (9.4), (9.6) and (9.8)). By applying again the R operator: R( R( (2) (y ; tp ) + f (a 0 W(2)  (2) + f 0 (a) (2) ) R(y) v(2)  (2) + f (a) 0 Finally, from (9.3): rW rW (2) Ep =  (2) zT for output layer (1) Ep =  (1) xT for hidden layer and then the components of vT H vector are found trough: (2)  Ep = R( (2) ) zT +  (2) R(zT ) for output layer (2)  Ep = R( (1) ) xT ; R rW ; R rW R W(2) ( (2) ) for hidden layer 172 CHAPTER 9. MULTI LAYER NEURAL NETWORKS ✍ Remarks: ➥ The above result may also be used to nd the Hessian, by making v of the form ; T v = 0 : : : 0 1 0 : : : 0 ; and then the expression vT R(Ep ) gives a row of the Hessian \matrix". CHAPTER 10 Radial Basis Function Networks ➧ 10.1 Exact Interpolation For a unidimensional output network let consider a set of training vectors fx g =1 , the corresponding set of targets ft g =1 and a function h : X ! Y which tries to map exactly the training set to the targets such that: (10.1) h(x ) = t ; p = 1; P p p p p ;P ;P p p It is assumed that the h(x) function may be found by the means of a linear combination of a set of P basis functions of the form '(kx ; x k): p h(x) = X P w '(kx ; x k) p p =1 (10.2) p By building the symmetrical matrix: 0 '(kx ; x k) : : : '(kx 1 1 .. ... =B @ . '(kx1 ; x k) : : : '(kx P P P ❖ 1 ; x1 k) CA .. .  ; x k) P then from (10.1) and (10.2) it follows that: ; ~ T = w1 : : : w where w 10.1 ~T w P  and ~tT = ;t 1  = ~tT  ::: t . P See [Bis95] pp. 164{167. 173 (10.3) ~ , ~t ❖w 174 CHAPTER 10. RADIAL BASIS FUNCTION NETWORKS ~ set of parameters is found immediately if the square matrix  is inversable (which The w usually is); the solution being: ~ = ~tT ;1 w Examples of basis functions are:       Gaussian function: '(x) = exp ; 2x2 '(x) = (x2 + 2 ); , > 0 '(x) = x2 ln x 2  p Multi-quadratic function: '(x) = x2 + 2 Cubic and linear functions: '(x) = x3 , '(x) = x For the multidimensional network output the established relations are immediately extendible as follows: hk (xp ) = tkp ; p = 1; P ; k = 1; K hk (x) = ❖ h ; P X p=1 wkp '(kx ; xp k) ; k = 1; K  ~ T as rows and Let h  h1 : : : hK then h(xp ) = tp and also by building W using all w T using all ~tT again as rows then W  = T and thus W = T ;1 (assuming  inversable, of course). ➧ 10.2 Radial Basis Function Networks The radial basis function network is built by considering the basis function as an neuronal activation function and the wkp parameters as weights. ✍ Remarks: To perform a exact interpolation of the training set is not only unnecessary but a bad thing as well | see the course of dimensionality problem. For this reason some modi cations are made. When using the basis functions in neural networks, the following changes are performed: ➥  The number H of basis functions do not need to be equal to the number P of training vectors | usually is much less. So fwkp g ! fwkj g.  The basis functions do not have to be centered around the training vectors.  The basis functions may have themselves some tunable (during training) parameters.  There may be a bias parameter. 10.2 See [Bis95] pp. 167{169. 10.3. RELATION TO OTHER THEORIES x1 175 x2 xN '1 '0 y1 'H y2 yK Figure 10.1: The radial basis function network architecture. On the hidden layer each neuron have a radial basis activation function. The activation function of the output layer is the identity. For bias '0 = 1. Then the output of the network looks like: H X yk (x) = wkj 'j (x) + wk0 j=1 ; , f' e (x) ( )=W y x (10.4)  f holds both weights and bias. where 'e T  '0 : : : 'K and W ✍ f e, W ❖' Remarks: ➥ If the basis functions are of Gaussian type the they are of the form: " (x ; j )T ;j 1 (x ; j ) 'j (x) = exp ; 2 # where j is a covariant symmetrical matrix. The model may be represented in the form of a two layer network where the hidden neurons have the basis functions as activation function. Note that the weight matrix from input to hidden layer is e1 and the output layer have the identity activation function. See gure 10.1. The basis function associated with bias is the constant function '0 (x) = 1. ➧ 10.3 10.3.1 Relation to Other Theories Relation to Regularization Theory A way to control the smoothing of the model (in order to avoid large oscillations in the output after a small variation of input) | and thus to control the complexity of the model 10.3 See [Bis95] pp. 171{179. 176 CHAPTER 10. RADIAL BASIS FUNCTION NETWORKS | would be trough an additional term in the error function which penalizes un-smooth network mappings. For a unidimensional output, the sum-of-squares error function will be written as: 1 E= ❖ , P ❖ D Z P p ; t ]2 + 2 jP yj2dx [y (xp ) (10.5) p =1 X where  is a constant named regularization parameter and P is a di erential operator such that large curvatures of y(x) gives rise to large values of jP yj2 . By replacing y(x ) with y(x )  (x ; x ) | where  is the Dirac function | the error function (10.5) may be expressed as a functional of y(x) (see the mathematical appendix). The condition of stationarity for E is to set to 0 its derivative with respect to y(x): p p E y Euler-Lagrange equation Green's functions 2 X X D p P = [y (xp ) =1 D ;t ] D( p x b y(x) = 0 ; xp ) +  PP (10.6) p where Pb is the adjoint di erential operator of P (see also the mathematical appendix). (10.6) is named the Euler-Lagrange equation. The solutions to (10.6) are found in terms of the Green's functions G of the operator P which are de ned as being the solutions to the equation: b G(x; x ) =  PP 0 x D( ;x ) 0 and then the solutions y(x) are searched in the form: X P y (x) = (10.7) wp G(x; xp ) =1 p where fw g p p =1;P are parameters found by replacing solution (10.7) back into (10.6), giving: X P p =1 [y (xp ) ;t p ] D (x X P ;x p) +  p wp D (x =1 ;x p) and by integrating around a small enough vicinity of x (small enough such that it will not contain any other x ), for all p = 1; P : p p0 y (xp ) ;t p + wp = 0 p = 1; P (because of the way  is de ned). By replacing the solution (10.7) in the above equation and considering the G matrix and ~ and ~t vectors: w D ❖G (10.8) 0 G(x ; x )    B 1... 1 . . . G=@ G(x1 ; xP )  G(xP ; x1 ) .. . 1 CA G(xP ; xP ) 10.3. RELATION TO OTHER THEORIES 177 nally (10.8) becomes: ~ T (G + I ) = ~tT w By comparing (10.4) and (10.7) it becomes clear that the Green's functions plays the same role as the basis functions (also the above equation may be compared to (10.3)). ✍ Remarks: ➥ If the P operator is invariant to translation and rotation then G functions are dependent only on the norm of x: G(x; x ) = G(kx ; x k). 0 10.3.2 0 Relation to Interpolation Theory Let consider a mapping h : X ! Y . A noise is added to input x. Let p(x) be the probability density of input and p( ) the probability density of the noise. Considering the sum-of-squares error function: e E= 1 2 ZZ [y (x +  ) X ; h(x)]2 p(x) pe( ) dxd where h(x) is the desired output (target) for x +  . By making the change of variable x +  = z : E= 1 2 ZZ [y ( z ) X ❖z ; h(x)]2 p(x) pe(z ; x) dxdz and the y(x) is found by setting E functional derivative (see the mathematical appendix) with respect to y(z ) to 0: E y = Z X [y (z ) ; h(x)] p(x) pe(z ; x) dx = 0 ) R h x p x pe z ; x dx X R p x pe z ; x dx X ( ) ( ) ( y (z ) = ( ) ( ) ) Considering a suciently large set of training vectors then the integrals may be approximated by sums: y (z ) = XP h x pe x ; xp p P P pe x ; xq p ( =1 e ❖ p, p ) ( q=1 ) ( ) and by comparing to (10.4) it becomes obvious that the above expression of y(x) represents an expansion in basis functions form. 178 CHAPTER 10. RADIAL BASIS FUNCTION NETWORKS 10.3.3 Relation to Kernel Based Method Considering a set of training data fx ; t g =1 then the estimated probability density of a pair fx; tg, assuming an exponential smoothing kernel function, is1 : p pe(x; t) = 1 P P X p=1 p p ;P  1 (2L2 )(N +K )=2 exp ; kx ; x k 2L+ kt ; t k p 2 p 2 2  (10.9) where L is the length of the hypercube side (N being the dimensionality of the input X space and K the dimensionality of the Y output space). The optimal mapping y(x) is given by the average over the desired output conditioned by the input: ❖L Z y(x) = R j t p(t x) dt = R t p(x; t) dt Y Y p(x; t) dt Y R (by using also the Bayes theorem; p(x; t) dt = p(x)). Y By replacing the p(x; t) with the value from (10.9) and integrating (see also the mathematical appendix regarding Gaussian integrals): P P y(x) = p=1 tp exp P P h exp p=1 Nadaraya-Watson estimator h ; kx;x2pk 2i 2L ; kx;x2pk2 i 2L known also as Nadaraya-Watson estimator | a formula similar to (10.4). ➧ 10.4 Classification For classi cation into a set of C classes the Bayes theorem gives: k P (Ck jx) = p(xjCk )P (Ck ) p(x) = p(xjCk )P (Ck ) K P q =1 p(xjCq )P (Cq ) (10.10) (because p(x) plays the role of a normalization factor). Because the posterior probabilities P (C jx) is what the model should calculate, (10.10) may be compared with (10.4), the basis functions being: k 'k (x) = K P q =1 1 10.4 p(xjCk ) p(xjCq )P (Cq ) See the non-parametric/kernel based method in the statistical appendix. See [Bis95] pp.179{182. 10.4. CLASSIFICATION x2 179 x2 C1 C1 C2 C2 x1 a) Figure 10.2: x1 b) Bidimensional pattern vector space. The di erence between perceptron based classi ers and radial basis function classi cation: a) The perceptron represents the decision boundaries; b) the radial basis function networks represents the classes trough the basis functions. and the output layer having only one connection per neuron from the hidden layer, the weight being P (Ck ) (in this case the hidden layer/number of basis function is also K ). The interpretation of this method is that each basis function represents a class | while the perceptron network represents the hyperplanes acting as decision boundaries. See gure 10.2. It is possible to improve the model by considering a mixture of basis functions m = 1; M (instead of one single basis function per class): p(xjCk ) = M X p m=1 (xj m) P (mjCk ) (i.e. the mixture model). The total probability density for x also changes to: p(x) = K X p k=1 (xjCk ) P (Ck ) = M X p m=1 (xj m) P (m) where P (m) represents the prior probability of mixture component and it can be expressed as: P (m) = K X P mjC k=1 k ) P (Ck ) ( By replacing the above expressions into the Bayes theorem: P (Ck jx) = p(xjCpk()xP) (Ck ) = PM P mjCk p jm P Ck P m M Pm Xw ' m km m PM p jm P m m ( ) (x ) ( ( ) ) ( ) =1 = m=1 (x ) ( ) =1 (x) ❖ P (m) 180 ❖ wkm , 'm CHAPTER 10. RADIAL BASIS FUNCTION NETWORKS ) ( PP ((m m) =1 being added on purpose) where: p(xjm) P (m) = P (mjx) and 'm (x) = P M p(xjn) P (n) n=1 wkm = P (mPjC(km) P) (Ck ) = P (Ck jm) (by using the Bayes theorem), i.e. the basis functions represents the posterior probability of x being generated by the component m of the mixture model and the weights represent the posterior probability of class Ck membership given a pattern vector generated by component m of the mixture. ➧ 10.5 Network Learning The learning procedure consists of two steps:  The basis functions are established and their parameters are found, usually by an unsupervised learning procedure, i.e. only the inputs fxp g from the training set are considered (the targets ftp g are ignored). The basis functions are usually de ned to depend only over a distance between the pattern vector x and the training vectors fxp g, i.e. kx ; xp k, such that they have a radial symmetry.  Then, having the basis functions properly established, the weights on output layer are to be found. 10.5.1 Radial Basis Functions The relation between radial basis function network and other statistical methods described in section 10.3, suggest that the basis functions should represent the probability density of input vectors. Then an unsupervised method may be envisaged to nd the parameters of the basis functions, several are described below. Subsets of data points This procedure builds the basis functions as Gaussians: 'j (x) = exp " ; kx ; j k 2 #  2 j2 The fj g vectors are chosen as a subset of the training set fxp g. (10.11) The fj g parameters are chosen all equal to a multiple of average distances between the centers of radial functions (as de ned by vectors i ). They should be chosen such as to allow for a small overlap between radial functions. 10.5 See [Bis95] pp. 170{171 and pp. 183{191. 10.5. NETWORK LEARNING ✍ 181 Remarks: ➥ This ad hoc algorithm usually gives poor results but it may be useful as a starting point for further optimization. Clustering The basis functions are built as Gaussian, as above. The training set is divided into a number of subsets Sj , equal to the number of basis functions, such that an error function associated with the clustering is minimized: H E= kxp ; j k2 j=1 xp 2Sj XX At the beginning the learning set is divided into subsets at random. Then the mean j is calculated inside each subset and each training pattern vector xp is reassigned to the subset having the closest mean. The procedure is repeated until there are no more reassignments. ✍ Remarks: ➥ Alternatively the j vectors may be found by an on-line stochastic procedure. First they are chosen at random from the training set. Then they are updated according to: j = (xp ; j ) for all xp , where  represents a \learning constant". Note that this is similar to nding the root of rj E , i.e. the minima of E , trough the Robbins-Monro algorithm. Gaussian mixture model The radial basis functions are considered of the Gaussian form (10.11). The probability density of the pattern vector is considered a mixture model of the form: H P (j ) 'j (x) p(x) = j=1 X It is possible to use the EM (expectation-maximisation) algorithm to nd j , j and P (j ) at step s +1 from the values at step s, starting with some initial values 0j , 0j and P0 (j ): P P(s) (j jxp ) xp (s+1)j = p=1P P(s) (j jxp ) p=1 P P v u PP P s (jjxp) kxp ;  s u u u p j =u u t N PP P s (jjxp) (s+1) =1 ( ) ( +1) p=1 ( ) j k2 182 CHAPTER 10. RADIAL BASIS FUNCTION NETWORKS 1 X P (j jx ) +1) (j ) = ( ) P P P( s p s p=1 where j = 1; H and: P (j jx) = P (j ) '(x) p(x) Supervised learning It is possible to envisage also a supervised learning for the hidden layer. E.g. considering the basis function of Gaussian form (10.11), y = k E= 1 2 P [y ( x ) ; t ] P p=1 k p kp 2 Pw H j =1 kj ' and the sum-of-squares error j which have to be minimized then the conditions are: @E X X = w [y (x ) ; t ] exp @ =1 =1 ; kx ;  k XX @E = w [y (x ) ; t ] exp @ =1 =1 ; kx ;  k P K kj j p k K kj p kp k p kp 2 j 2 p ! 2 2 j 2 kx ;  k = 0  p 2 j k P sj p p 2 ! j ; =0 x sp 2 j k j 3 sj j However, to solve the above equations is very computational intensive and the solutions have to be checked against the possibility of a local minima of E . 10.5.2 Output Layer Weights f = fw g =1 Considering the matrix W ;' 0 :::' H  then (10.4) gives: kj i ;K j =0;H y (biases included) and the vectorial function 'e T = f'e (x) =W (10.12) and it may be considered as a single layer network. Then considering the sum-of-squares error function E = 21 ❖ , T P [y(x ) ; t ] [y(x ) ; t ] the least squares technique is applied to P p p=1 0 1 ' (x1 ) H y T p p nd the weights. Considering that y is obtained by the form of a generalized linear discriminant, see (10.12), then the following matrices are build: 0 ' (x )    =B @ ... . . . ❖ p  '0 ( x ) .. . P ' (x ) H 1 CA and 0t  T =B @ ... . . . 11 t K1 P  t1 P t .. . 1 CA KP f = T which have the solution: Then (according to the least squares technique) W ; f = T y where y = T T W ; 1 10.5. NETWORK LEARNING 183 (y being the pseudo-inverse of ). Proof. f = T W ) Wf ;T = T T ) Wf = T y . CHAPTER 11 Error Functions ➧ 11.1 Generalities Usually neural networks are used to generalize (not to memorize) from a set of training data. The most general way to describe the modeller (the neural network) is in terms of join probability p(x; t), where t is the desired output (target) given x as input. The joined probability may be decomposed in the conditional probability of t, given x, and the probability of x: p( ❖ p( ❖ L x; t) = p(tjx) p(x) where the unconditional probability of x may be written also as: p( x) = Z p( x; t) dt Y Most error functions may be expressed in terms of the maximum likelihood function L, given the fx ; t g =1 set of training vectors: p p p ;P L= Y P p =1 Y P xp ; tp ) = p( p =1 t jxp ) p(xp ) p( p which represents the probability of observing the training set fx p; 11.1 See [Bis95] pp. 194{195. 185 tp g. x; t) 186 CHAPTER 11. ERROR FUNCTIONS The ANN parameters are tuned such that L is maximized. It may be convenient instead to minimize a derived function: P P P X X X Ee = ; ln L = ; ln p(tp jxp ) ; ln p(xp ) ) E = ; ln p(tp jxp ) (11.1) p=1 p=1 p=1 error function where, in E , the terms p(xp ) were dropped because they don't depend on network parameters. The E function is called error function. ➧ 11.2 Sum-of-Squares Error It is assumed that the K components of the output vector are statistically independent, i.e. K Y p(tjx) = p(tk jx) (11.2) k=1 (tk being the k-th component of vector t). It is assumed that the distribution of target data is Gaussian, i.e. it is composed from a deterministic function h(x) and some Gaussian noise ":  2 "k 1 tk = hk (x) + "k where p("k ) = p exp ; 2 2 2 2  (11.3) Then "k = tk ; hk (x), hk (x) = yk (x; W ) because it's the model represented by the neural network (W being the network parameters), and p("k ) = p(tk jx) (as hk (x) are purely deterministic): p(tk jx) = p1 2 2  exp ) ; tk ] ; [yk (x; W 2 2 2  (11.4) By using the above expression in (11.2) and then in (11.1), the error function becomes: P K 1 XX PK 2 E= 2 ln 2 [yk (xp ; W ) ; tkp ] + P K ln  + 2 2 p=1 k=1 ❖ tkp RMS where tkp is the component k of vector tp . By dropping the last two terms which are weight (W ) independent, as well as the 1=2 from the rst term, the error function may be written as: P K 1 XX E= kyk (xp ; W ) ; tkp k2 (11.5) 2 p=1 k=1 ✍ Remarks: ➥ 11.2 It is sometimes convenient to use one error function for network training, e.g. sum- See [Bis95] pp. 195{208. 11.2. SUM-OF-SQUARES ERROR 187 of-squares, and another to test over the performance achieved after training, e.g. the root-mean-square (RMS) error: PP ky ; t k p p P 1 X p=1 ERMS = where h ti = tp PP kt ; htik P p =1 p p=1 which have the advantage that is not growing with the number of tests P . 11.2.1 Linear Output Units Generally, the output neuron k is applying the activation function f to the weighted sum of zj | inputs from (last) hidden layer | in order to compute its own output: H X f) yk (x; W ) = f (ak ) where ak = wkj zj (x; W j=0 f is the set of all weights except those involving the output layer, H is the number where W of neurons on hidden layer and wk0 is the bias corresponding to z0 = 1 (wkj being the weight for connection from neuron j to neuron k). Then, the derivative of error (11.5), with respect to the total input ak into the neuron k, is: P X df (ak ) @E = [yk (xp ; W ) ; tkp ] @ ak dak p=1 Assuming a linear activation function f (x) = x then: P X @E @E = [yk (xp ; W ) ; tkp ] = @ ak @ yk p=1 and the optimization of output layer weights becomes simple. Let: P P X X hti = P1 tp and hzi = P1 z(xp ) p=1 p=1 W w11 .. . wK 1 1 0e  e 1 0e  e 1 11 1P 11 1P C B B . . . .. C . . ... e e . . . . . = = @ . . ... CA @. . . A . A eK1    eKP    KH eH1    eHP  w1 H z ; w t z Z ; z z f, W H , ak , zj ❖ hti, hzi and the following Wout , Ze and Te matrices are de ned: 0 B out = @ ❖ ❖ W ❖ z out , Ze, Te t T t t where Wout holds the weights associated with links between (last) hidden and output layers, z ejp = zj (xp ) ; hzj i and etkp = tkp ; htk i. Then the solution for weights matrix is: eZey where Zey = ZeT(ZeZeT);1 Wout = T (11.6) ejp , ekp t 188 CHAPTER 11. ERROR FUNCTIONS e y , w0 ❖Z ey Z being the pseudo-inverse of Ze. The biases w0T = w0 k= y H X j =1  hzi K 0 are found trough: (11.7) kj j + k0 (11.8) = hti ; Proof. By writing explicitly the bias: ; w Wout z w10 ::: w w and putting the condition of minimum of E with respect to biases: @E @w ❖ h k i, h j i t z k0 k @y @w 3 P 2X H X 4 = wkj zj (xp ) + wk0 ; tkp 5 = 0 p=1 j=1 (zj (xp ) being the output of the hidden neuron j when the input of the network is xp ) and then: k0 = h k i ; w bias = k k0 @y @E t H X j =1 kj h j i where h k i = w z t 1 P P X p=1 kp t ; h ji = 1 z P P X p=1 j (xp ) z i.e. the bias compensate for the di erence between the average of target and the weighted average of (last) hidden layer output. By replacing the biases found, back into (11.5) (through (11.8)): E = 1 2 32 P X K 2X H X 4 wkj zejp ; etkp 5 (11.9) p=1 k=1 j =1 The minimum of E with respect to wkj is found from: @E kj @w = P "X̀ X p=1 s=1 # e wks z esp ; tkp zejp = 0 ; k = 1; K ; j = 1; H (11.10) and the set of equations (11.10) may be condensed into one matrix equation: e eT ; TeZeT = e0 (11.11) Wout Z Z which yields the desired result. ✍ Remarks: ➥ ➥ outliers ➥ It was assumed that ZeZeT is invertible. f weights xed. However if The solution (11.6) was found by maintaining the W they change, the optimal solution (11.6) changes as well. The sum-of-squares error function is sensitive to training vectors with high error | called outliers . It is also sensitive to misslabeled data because it leads to high error. 11.2.2 Linear Sum Rules ❖ u, u0 Let assume that for all targets (desired outputs) from the training set it is true that: uT tp + 0 = 0 where u u ; u0 = const. 8 ; p By summing over all training vectors it follows that u0 = ;uT hti and then: uT tp = ;uT hti (11.12) 11.2. SUM-OF-SQUARES ERROR 189 Then it is also true that: = uT hti uT y i.e. if the training set have the property (11.12) then any network output will have the same property. The network output is y = out z + w0 and also from (11.7): w0 = hti ; out hzi. Then: uT y = uT ( out z + w0 ) = uT e e y (z ; hzi) + uT hti Proof. W W W TZ But the elements of the row matrix (vector) uT Te are: fuT eT gp = uT e(: T T ; p) ; hti) = 0 = uT (tp by virtue of (11.12). 11.2.3 Signi cance of Network Output Let consider that the number of training data sets grows to in nity and the following error function (using Euclidean norm): 1 = lim !1 2 E P X = 21 = 12 ky(x p; W P P =1 [ (x ); [ (x ) ; ]2 ( jx) (x) yk =1 X;Y p; W yk =1 X;Y tkp ]2 ( p tk ; x) dtk dx k ZZ K X k (11.13) p p ZZ K X k ) ; t k2 p; W tk p tk p x dtk d k where Y is the unidimensional component of the output space Y related to t . The following conditional averages are de ned: k ❖ k h jxi  Z ( jx) tk p tk tk and ht2 jxi  dtk Z 2 ( jx) tk p tk k ❖ (11.14) dtk Yk Yk Then, by using the above de nitions, it's possible to write: [y ; t ]2 = [y ; ht jxi + ht jxi ; t ]2 k k k k k k = [ ; h jxi]2 + 2( ; h jxi)(h jxi ; ) + [h jxi ; ]2 and by replacing into the expression of , the middle term cancels after integration over (h jxi ; ! h jxi ; h jxi = 0) yk tk yk tk tk tk tk tk E tk tk tk XZ 1 [ (x =2 =1 K E yk k X tk tk ;W XZ 1 ) ; h jxi] (x) x + 2 [h 2 jxi ; h jxi2 ] (x) =1 tk 2 K p d tk k tk (upon integration over t : the rst term is independent of p(t jx) and k p d x (11.15) X R k Yk ( jx) p tk dtk =1 as p(t jx) is assumed normated, while for the second term [ht jxi ; t ]2 ! ht jxi2 ; 2ht jxi2 + ht2 jxi). k k k k k k Yk h jxi, h 2 jxi tk t k 190 CHAPTER 11. ERROR FUNCTIONS j p(t x) y (x) t x x Figure 11.1: The network output signi cance. Unidimensional input and output space. Note that htjxi doesn't necessary coincide with the maximum of p(tjx). In the expression (11.15) of the error function, the second term does not depend upon network parameters W and may be dropped; the rst term is always positive and then E > 0. The error becomes minimum (zero) when the integrand in the rst term is zero, i.e. yk (x; W ❖ W   )= h jxi (11.16) tk where W  denotes the nal weights, after the learning process have been nished (i.e. optimal W value). The ANN output with sum-of-squares error function represents the average of target conditioned on input. See also gure 11.1. To obtain the above result the following assumptions were made:  The training set is suciently large: P ! 1.  The number of weights (w parameters) is suciently large.  The absolute minimum of error function was found. ✍ Remarks: ➥ ➥ The above result does not make any assumption over the network architecture or even the existence of a neuronal model at all. It holds for any model who try to minimize the sum-of-squares error function. As p(t jx) is normated then each term in the sumRappearing in last expression of E in (11.13) may be multiplied conveniently by p(t jx) dk0 = 1 such that k k0 E Yk 0 may be written as: E = 1 ZZ 2 ky(x p; W ) ; tk2 (tjx) p p(x) dt dx X;Y (ft g were assumed statistically independent, p(tjx) = k Q K k =1 j ) and then the p(tk x) 11.2. SUM-OF-SQUARES ERROR 191 proof of (11.16): hj i y(x; W ) = t x may be expressed similarly as above but directly in vectorial terms. The same result may be obtained faster by considering the functional derivative of E with respect to y (x) which is set to zero to nd its minima: k ZZ E yk (x) [yk (x; W ) = ;t k] j p(tk x) p(x) dtk dx = 0 X;Yk The right term may be split in two integrals, normated and then: yk (x; W ) is independent of t , j h ji Z k j is p(tk x) tk p(tk x) dtk = tk x yk (x; W ) = Yk R (the integrand of have to be zero). In general the result (11.16) represents the optimal solution: assuming that the data is generated from a set of deterministic functions h (x) with superimposed zero-mean noise then: X k ) tk = hk (x) + "k h ji h ji yk (x) = tk x = hk (x) + "k x = hk (x) The variance of the target data, as function of x, is: 2 Z [tk k (x) = ; ht jxi]2 p(t jx) dt k k k h j i ; ht jxi2 2 = tk x k (11.17) Yk ([t ; ht jxi]2 ! ht2 jxi ; 2ht jxi2 + ht jxi2 ) i.e. is exactly the residual term of the error function (11.15) and it may be used to assign error bars to the network prediction. k k ✍ k k k Remarks: ➥ Using the sum-of-squares error function, the network outputs are x-dependent means of the distribution and the average variance is the residual value of the error function at its minimum. Thus the sum-of-squares error function cannot distinguish between the true distribution and a Gaussian distribution with the same x-dependent mean and average variance. 11.2.4 Outer product approximation of Hessian From the de nition (11.13) of error function, the Hessian terms are: @2E @ws @wq = K Z X @yk @yk k =1 X @ws @wq p(x) dx + K Z X k =1 X @ 2 yk @ws @wq (yk ; ht jxi)p(x) dx k 192 CHAPTER 11. ERROR FUNCTIONS where ws , wq are some two weights, also theRintegration over tk was performed (yk is independent of p(tk jx) which is normated, i.e. p(tk jx) dtk = 1). Yk The second term, after integration over x, cancels because of the result (11.16), such that the Hessian becomes | after reverting back from the integral to the discrete sum: P X K @2E 1X @yk (xp ; W  ) @yk (xp ; W  ) = @ws @wq P p=1 k=1 @ws @wq ➧ 11.3 Minkowski Error A more general Gaussian distribution for the noise ", see (11.3), is: p("k ) = R 1=R 2;E (1=R) exp ; ; j"k jR  = p(tk jx) ; R = const. (11.18) where ;E is the Euler function1 . By a similar procedure as used in section 11.2 (and using the likelihood function) the error function becomes: E= Minkowski error P X K X p=1 k=1 jyk (xp ; W ) ; tkp jR which is named Minkowski error. The derivative of the error function, with respect to the weights, is P K xp ; W ) @E X X = jy (x ; W ) ; tkp jR;1 sign(yk (xp ; W ) ; tkp ) @yk (@w @ws p=1 k=1 k p s which may be evaluated by the means of backpropagation algorithm (ws being here some weight). ✍ Remarks: ➥ The constant in front of exp function in (11.18) ensures the normalization of R probability density: p("k ) d"k = 1. Yk ➥ LR norm 11.3 1 Obviously, for R = 2 it reduces to the Gaussian distribution. For R = 1 the distribution is called Laplacian and the corresponding error function city-block metric. More generally the distance jy ; tjR is named LR norm. See [Bis95] pp. 208{210. See the mathematical appendix. 11.4. INPUT-DEPENDENT VARIANCE ➥ 193 The use of R < 2 reduces the sensitivity to outliers. E.g. by considering a network with one output and R = 1 then: X P E= jy(x p p =1 ;W) ;t j p and, at minimum, the derivative is zero: X P p =1 ;t )=0 sign(y (xp ; W ) p condition which is satis ed if there are an equal number of points t are t < y, irrespective of the value of di erence. p >y as there p ➧ 11.4 Input-dependent Variance Considering the variance as depending on the input vector distribution of the noise then, from (11.4): p(tk jx) = p  1 exp 2 k (x) ; [y k( k = k (x) x; W ) ; t 2 2 (x) k] 2 and a Gaussian  k By the same means as used in section 11.2 for the sum-of-squares function (by using the likelihood function), the error may be written as: X X  [y ( x P E= K k p =1 k=1 p ;W) ;t  2 kp ] 2k2 (xp ) + ln k (xp ) Dividing by 1=P and considering the limit P ! 1 (in nite training set) then: X ZZ  [y (x; W ) ; t E= K k k =1X;Y k k] 2k2 (x) 2  + ln k (x) p(tk jx) p(x) dt k dx and, the condition of minima with respect to y is: E yk = p(x) Z k yk (x; W ) ;t k k2 (x) Yk p(tk jx) dt k =0 which means that the output of network, after the training was done, is: h jxi yk (x; W  ) = tk Similarly, the condition of minima for E with respect to  is: E k (x) 11.4 = p(x) See [Bis95] pp. 211{212. Z  Yk k [y ; (x) 1 k x) ; t  3 (x) k( k] k 2  p(tk jx) dt k =0 194 CHAPTER 11. ERROR FUNCTIONS and, after the training (E minimum), the variance is (see also (11.17)): k2 (x) = h[tk ; htk jxi]2 jxi The above result may be used to nd the variance as follows:  First a network is trained to minimize the sum-of-squares error function, using the fxp ; tp gp=1;P training set.  The outputs of the above network are yk = htk jxi. These values are subtracted from the target values tk and the result is squared. Together with the input vectors xp they form a new training set fxp ; (tp ; htp jxi)2 gp=1;P .  The new set is used to train a new network with a sum-of-squares error function. The outputs of this network are k2 = h[tk ; htk jxi]2 jxi. ➧ 11.5 Modeling Conditional Distributions A very general framework of modeling conditional distributions is to build a model in two stages:  The rst stage uses the input vectors xp to model | trough an ANN | some parameters (x).  The  parameters are used into a parametric model (non ANN) to nd the conditional probability density p(tjx). ✍ Remarks: The above approach may deal well with complex distributions. By comparison, a neural network with sum-of-squares error function may model just Gaussian distributions with a global variance parameter and a x-dependent mean. As parametric model, a good choice is the mixture model. In this approach, the distribution ➥ ❖ M PM of p(x) is considered of the form: p(x) = p(xjm) P (m) where M is the number of m=1 mixture components m. On similar grounds, the probability distribution p(tjx) may be expressed as: p(tjx) = ❖ m M X m=1 m (x) 'm (tjx) (11.19) where m (x) are prior probabilities, conditioned on x, of the target vector t being generated by the m-th component of the mixture. Being probabilities they have to satisfy the constraint of normality: M X m=1 11.5 See [Bis95] pp. 212{222. m (x) = 1 and m (x) 2 [0; 1] 11.5. MODELING CONDITIONAL DISTRIBUTIONS 195 The kernel functions 'm (tjx) represents the conditional density of the target vector t for the m-th mixture component. One possibility is to choose them to be of Gaussian type:   2 'm (tjx) = (2)K=12 K (x) exp ; kt ;22m(x(x))k (11.20) m m and then the outputs of the neural network may be de ned as follows:  A set of outputs fy mg for the m parameters which will be calculated trough a softmax function: y m) exp(y n ) exp( m= P M n=1 ❖ 'm y m , ym , ykm ❖ softmax function (11.21)  A set of outputs fym g for the m parameters which will be calculated trough the exponential function: m = exp(ym ) (11.22)  A set of outputs fykm g for the m parameters which will be represented directly by the network outputs: km = ykm The error function is build from the likelihood: E = ; ln L = ; ln P Y p=1 p(tp jxp ) ! ; = P X p=1 ln M X m=1 m (xp ) 'm (tp jxp ) ! The problem is to nd the network weights in order to minimize the error function. This is done by nding an expression for the derivatives of E with respect to network outputs y( ) and then the weights are updated by using the backpropagation algorithm. The error function will be considered as a sum of P components Ep : ❖ Ep  Ep = ; ln M X m=1 The posterior probability that the pair mixture is: m (xp ) 'm (tp jxp ) n=1 M P m=1 (11.23) ; was generated by the component m of the (x t) m (x) 'm (tjx) m (x; t) = P M n (x) 'n (tjx) and, obviously, is normated ( ! m = 1). ❖ m 196 CHAPTER 11. ERROR FUNCTIONS @E @y m The derivatives. From (11.23): @@Enp = ; nn and from (11.21): @ n @y m nm n ; n m = (nm being the Kronecker symbol). Ep depends upon y m trough all n parameters, then: @Ep @y m M @E @ X p n = n=1 @ n @y m = m ; m (by using the property of normation for k as well). @E @ym The derivatives. From (11.23) and (11.20): @Ep @m m Also from (11.22) dydm = @Ep @ym @E @ykm The =   2 ;m kt ;3 m k ; K m m m and then: = @Ep dm @m dym =   2 ;m kt ;2 m k ; K m derivatives. Since km = ykm , then from (11.23) and (11.20) (Euclidean norm): km ; tk @Ep @Ep = = m 2 @ykm @km m The conditional average of the target data is: htjxi = ❖ s Z Y j tp(t x) dt = M X m=1 m (x) Z Y j t'm (t x) dt = (from (11.19) and (11.20); it was also assumed that Y mathematical appendix). The variance s(x) is: s 2 Proof. (x)  hkt ; htjxik jxi = 2 M X m=1 " m (x) m (x) + 2 = M X m (x) m (x) m=1 (11.24) RK , see Gaussian integrals in the M X n=1 n (x) n (x) ; m (x) 2 # From the de nition of average and using (11.19) and (11.20) (also: Euclidean norm and Yk = R): s2 (x) = Z k ; h j ik p j RK t t x 2 (t x) dt = XK Z tk ; htk j i k=1 R ( j x )2 p(tk x) dtk 11.6. CLASSIFICATION USING SUM-OF-SQUARES = = K X M X k=1 m=1 K X M X k=1 m=1 m (x) Z R 197 tk ; htk jxi)2 'm (tk jx) dtk ( m (x) K (x) (2 )K=2 m Z R  [t ;  (x)]2  dtk ; k 22km(x) tk ; htk jxi)2 exp ( m To calculate the Gaussian integral above rst make a change of variable etk = tk ;htk jxi and then a second one btk = etk + htk jxi ; km (x) forcing also a squared btk . This leads to: Z R ( tk ; htk jxi)2 exp etk , btk  [t ;  (x)]2   (t + ht jxi ;  (x))2  Z k km dtk = etk exp ; k detk ; k 22km(x) 2 2 (x) m R ! m Z 2 bt2 ! b btk exp ; 2btk dbtk ; (htk jxi ; km )2 exp ; 2k dtk 2m (x) 2m (x) R R Z bt2k ! b b ; 2(htk jxi ; km ) tk exp ; 22 (x) dtk Z = ❖ R m First term, after an integration by parts, leads to a Gaussian (see mathematical appendix) and equals p2 2 (x) while in the third term the integrand p 2  3 (x); the second one is directly a Gaussian giving m m is an odd function, integrated over a domain symmetrical around origin, so it is zero. The sought result is obtained after the (11.24) replacement. ➧ 11.6 Classification using Sum-of-Squares If the sum-of-squares is used as error function then the outputs of neurons represent the conditional average of the target data y = htjxi as proven in (11.16). In problems of classi cation the t vectors represent the class labels. Then the most simple way of labeling is to assign an network output to each class and and to consider that an input vector x is belonging to that class C represented by the (output) neuron k with the biggest output. This means that the target for a vector x 2 C is ft =  g =1 ( being the Kronecker symbol). This method is named one-of-k encoding. The p(t jx) probability may be expressed as: k k k kq q kq ;K k j X K p(tk x) = q D (tk =1 ; kq ) C jx) P( q ( being the Dirac function and P (C jx) is the probability of the given x). From (11.16) and the above expression of p(t jx): D q XZ q =1 Y 2 C event, for a q k K yk (x) = x kq D (tk ; kq ) P( C jx) dt q k C jx) = P( k k i.e. when using the one-of-k encoding the network outputs represent the posterior probability. 11.6 See [Bis95] pp. 225{230. one-of-k 198 CHAPTER 11. ERROR FUNCTIONS The outputs of network being probabilities must sum to unity and be in the range : [0; 1]  By the way the targets are chosen (1-of- ) and the linear sum rules (see section 11.2.2) k the outputs of the network will sum to unity.  The range of the outputs may be ensured by the way the activation function is chosen (e.g. logistic signal activation function). 11.6.1 Hidden Neurons The error function (11.9) may be written in a matrix form: E ❖ S B , ST = 1 2 f e Tr (Wout Z ; e)T ( e ; e)g T Wout Z B eTeT TeZ eT =Z (11.25) T and, by replacing the solution (11.6): E = 1 2 B ST;1 ) where ; eT Te Tr(T S S T ; S eZ eT =Z (11.26) As Zey = ZeT (ZeZeT ) and for two matrices (AB )T = (B T AT ) then:   o T   n 1 TeZeyZe ; Te TeZeyZe ; Te = 1 Tr ZeT ZeyTTeT ; TeT TeZeyZe ; Te Tr Proof. E= 2 = = 1 2 1 2 Tr Tr nh n 2 ih ZeT (ZeZeT );1T ZeTeT ; TeT TeZeT (ZeZeT );1 Ze ; Te io ZeT (ZeZeT );1T ZeTeT TeZeT (ZeZeT );1 Ze ; ZeT (ZeZeT );1T ZeTeT Te ; TeT TeZeT (ZeZeT );1 Ze + TeT Te o For two matrices A and B , such that B have same dimensions as AT , it is true that Tr(AB ) = Tr(BA). This property is used in rst and third terms, by moving Ze from left to right, thus they become identical and cancel out. As Tr(AT ) = Tr(A), the second term is transposed (again (AB )T = B T AT ) and then, Ze is moved from left to right: n o E = 1 Tr ;ZeTeT TeZeT (ZeZeT );1 + TeT Te 2 Minimizing the error becomes equivalent to maximize the expression: J = 1 2 Tr ; S  B ST;1 (11.27) The ST matrix may be written as (see the de nition of Ze): T S ❖ P k , hziCk P X eZ eT = =Z p=1 [z(xp ) ; hzi][z(xp ) ; hzi]T (11.28) i.e. it represents the total covariance matrix of (last) hidden layer with respect to the training set. P z(xp ) (i.e. the mean of hidden Let Pk be the number of xp 2 Ck and hziCk = P1k xp 2Ck neurons output over the training vectors belonging to a single class). Then: B S = K X k=1 k (hziCk ; hzi)(hziCk ; hzi) P 2 T (11.29) 11.6. CLASSIFICATION USING SUM-OF-SQUARES Proof. SB =  ZeTeT TeZeT = TeZeT 199  T  TeZeT (as (AB)T = BT AT ). Considering the one-of-k encoding P P P tkp = PPk , also hzj i = P1 P zj (xp ) and then: then htk i = P1 p=1 p=1 etkp = tkp ; Pk P and ZeT (p; j ) = zj (xp ) ; P P P X P p=1 zj (xp ) 1 tkp zj (xp ) = Pk hzj iCk due to one-of-k encoding: p=1   P  P   X X TeZeT (k; j ) = tkp ; PPk zj (xp ) ; P1 zj (xp ) = Pk (hzj iCk ; hzj i) p=1 p =1 Each element of TeZeT is calculated directly, note that 0 0  The nal result is obtained by doing the TeZeT  T  TeZeT multiplication. The processing on hidden neuron layer(s) may be seen as non-linear transformation such that (11.27) is maximized. The (11.27) expression have a strong resemblance with the Fisher criterion. The output of (last) hidden layer have the role to generate maximum discrimination between classes, optimum for a linear transformation (performed by the output layer). ✍ Remarks: ➥ Note that SB contains the multiplicative factor Pk2 under the sum, fact which strongly raises the importance of classes well represented in the training set, see section 11.6.2. 11.6.2 Weighted Sum-of-Squares For the sum-of-squares networks, the minimization of the error function results into a maximisation of the J term (see (11.27)) which for the one-of-k encoding is dependent upon the observed prior probabilities P (Ck ) = Pk =P of the training set. If these prior probabilities di er in the training set from the true P (Ck ) then the (trained) classi er (model) built will be suboptimal. To correct the above problem the error function may be modi ed by inserting a weighting factor p for each pattern; the error becoming: e E= ✍ 1 2 XP pk p=1  y(xp ) ; tp k2 where p = P (Ck ) for xp 2 Ck P (Ck ) e Remarks: (11.30) The prior probabilities P (Ck ) are usually not known, then they may be assumed to be the probabilities related to some test patterns (patterns which are run trough the net but not used to train it) or the adjusting coecients p may be even estimated by other means. Considering the identity activation function f (a) = a then: ➥ @E @ak = @E df @yk dak = @E @yk = XP p k p=1  [y (xp ) ; tkp ] e ❖ P (Ck ), P (Ck ) ❖ p 200 CHAPTER 11. ERROR FUNCTIONS where (bias was considered): yk = K X q=1 (11.31) wkj zj + wk0 Then the condition of minima of E with respect to wk0 is: 1 0H P X X @E @yk @E w z (x ) + wk0 ; tkp A = 0 = =  @ @wk0 @yk @wk0 p=1 p j =1 kj j p ❖ htk i, hzj i the solution being: wk0 = htk i ; H X j=1 wkj hzj i PP  t htk i = p=1P P ; p kp p=1 ; p PP  z (x ) pj p hzj i = p=1 P P p=1 p (11.32) From (11.30), using (11.31) and (11.32), by same reasoning as for (11.9) and with the same meaning for zejp and etkp : 12 0 0H 12 P X P K H K X X 1 XX X p p @ wkj zejp p ; etkp pA @ wkj zejp ; etkp A = p E= 2 2 1 p=1 ❖K p=1 k=1 j=1 k=1 j=1 The matrix of p coecients is build as: 0p BB 0 1 K =B B@ ... 0 0 ...  1 .. C . C CC A p0 P  0 BB 01 K T K = KK T = B B@ ... 0 ... 0 ) 0 and then, by the same reasoning as for (11.25): E e e ❖ Z 0, T 0 = 1 2 Tr  Wout Ze 0 ; Te 0 T  Wout Ze 0 ; Te 0 0 ...   ... 0  e and Te = TeK . where Ze = ZK On the other hand, the condition of minimum for E , relative to the weights, is: ! P H X X @E ze =  w ze ; et @wkj p=1 p q=1 kq qp kp jp ! P H X X p p p = p wkq zeqp p ; etkp p zejp p = 0 p=1 q=1 0 0 1 .. C .C CC 0 A 0 P (11.33) 11.6. CLASSIFICATION USING SUM-OF-SQUARES or, in matrix notation, similar to (11.11): e0 Ze0 T Wout Z 201 ; Te0 Ze0T = 0 (11.34) which leads to the solution (see also (11.6) and related calculations: results are similar, by making the substitution Ze $ Ze0 and Te $ Te0 ): e0y e0y = Ze0 T (Ze0 Ze0 T );1 Wout = Te0 Z where Z and the error function may be written as (see (11.26)): E= 1 2 Tr n Te0 T Te0 ; SB0 ST0 ;1 o where SB0 = Ze0 Te0T Te0Ze0 T and ST0 = Ze0 Ze0 T . Similar to (11.28) and (11.29), considering the de nition of p and the one-of-k encoding, the SB0 and ST0 matrices may be written as: ST0 = 0 = SB K P (C ) X X k Pk k=1 K X k=1 xp 2Ck [z(xp ) ; hzi][z(xp ) ; hzi]T Ck ) (hziC ; hzi)(hziC ; hzi)T P 2P 2( k k The proof for SB is very similar to the proof of (11.29). Note that in one-of-k encoding: P P (Ck ) Pk hzj i k p tkp zj (xp ) = P (Ck ) p=1 (of course, assuming that the training set is correctly classi ed) and so forth for the rest of terms. Also P (Ck ) = Pk =P . X 0 Proof. e e 11.6.3 C Loss Matrix Penalties for misclassi cation may be introduced in the one-of-k encoding by changing the targets to: tkp = 1 ( ; Lk` for xp 2 C` where Lk` = if ` = k 0 2 [0; 1] otherwise L being the loss matrix. Note that for Lk` = 1 ; k` the situation is reduced to the one-of-k case. Considering the loss matrix, SB becomes: SB = K "X K X k=1 `=1 P` (1 ; Lk` )(hziC ; hzi) ` X ` =1 #T P`0 (1 ; Lk` )(hziC ; hzi) 0 `0 0 Same as for proof of (11.29): P K tkp zj (xp ) = p=1 `=1 xp and so forth. Proof. #" K X XX 2C` (1 ; Lk` )zj (xp ) = XK P` `=1 (1 ; Lk` )hzj i C` 0 , S0 ❖ SB T 202 CHAPTER 11. ERROR FUNCTIONS ➧ 11.7 Cross Entropy 11.7.1 Two Classes Case Considering a two classes problem, a one network output is discussed, such that the target is either t = 1, for pattern vectors belonging to C1 , or otherwise t = 0 (x 2 C2 ), i.e. the network output represents the posterior probabilities P (C1 jx) = y(x) and P (C2 jx) = 1 ; y(x). Then p(tjx) may be written as: jx) = p(t y p )[1 ; y (xp )] t (x 1;t (11.35) and for the whole training set | giving the likelihood: Y P L= p=1 p jxp ) = p(t Y P p=1 y p ) [1 ; y (xp )] tp (x 1;tp The error function may be taken as the negative logarithm of the likelihood function (as previously discussed): E cross-entropy = ; ln L = ; X P p ln y (xp ) + (1 ; tp ) ln[1 ; y (xp )] t p=1 also known as cross-entropy . The error minima, with respect to y(xp ), is found by zeroing its derivatives: @E @ y( ❖ E min = xp ) tp ; y (xp ) xp ) [1 ; y(xp )] y( the minimum occurring for y(xp ) = tp , 8p 2 f1; : : : ; P g; the value being: min E ✍ = ; X P p=1 p ln tp + (1 ; tp ) ln(1 ; tp ) (11.37) t Remarks: ➥ Considering the logistic activation function y = f (a): f (a) ❖ (11.36) = 1 1+e ;a ; df da = f (a) [1 ; f (a)] then the derivative of error with respect to a(xp ) (total input to output neuron when presented with xp ) is: xp ) a( @E @ a( 11.7 See [Bis95] pp. 230{240. xp ) = @E @ y( xp ) xp )) da(xp ) df (a( = xp ) ; tp y( 11.7. CROSS ENTROPY 203 For one-of-k encoding, either tp or 1 ; tp is 0 and then Emin. = 0. For the general case, the minimum value (11.37) may be subtracted, from the general expression (11.36) such that E becomes: E=; P X p=1 tp ln y(txp ) + (1 ; tp ) ln 1 ;1 ;y(txp ) p (11.38) p 11.7.2 Sigmoidal Activation Functions Let assume that the posterior probabilities of outputs z of hidden neurons, with respect to classes Ck (k 2 f1; 2g) is of the exponential general form: h i p(zjCk ) = exp A(k ) + B (z; ') + Tk z (11.39) where k and ' are some parameters de ning the form of distribution and A and B are some ( xed) functions. From the Bayes theorem: ❖ k , ', A, B P (C1 jz) = p(zjCp1()zP) (C1 ) = p(zjC )Pp((CzjC)1+)Pp((Cz1jC) )P (C ) 1 1 2 2 1 p(zjC1 )P (C1 ) = where a = ln 1 + exp(;a) p(zjC2 )P (C2 ) By replacing (11.39) in the expression of a (above) ❖ a, w , w 0 P (C1 jz) = 1 + exp[;(1wT z + w )] where w = 1 ; 2 0 w0 = A(1 ) ; A(2 ) + ln PP ((CC1 )) 2 i.e. the network output is represented by a logistic sigmoid activation function applied to the weighted sum of hidden neurons outputs (obviously only those connected to the network output). 11.7.3 Cross-Entropy Properties Let "p = y(xp ) ; tp be the error for input pattern xp . becomes: E=; P X p=1 tp ln  1+ Then the cross-entropy (11.38) "p  + (1 ; t ) ln 1 ; "p  p tp 1 ; tp (11.40) and it can be seen that it depends on relative error ("p =tp , "p =(1 ; tp )). ✍ Remarks: ➥ The cross-entropy error function will try to minimize the relative error and thus giving evenly results for any kind of targets (large or small) while compared with sum-of-squares error function which tries to minimize the absolute errors (thus giving better results with large target values). ❖ "p 204 CHAPTER 11. ERROR FUNCTIONS Let consider binary outputs, i.e. tp (11.40) error becomes: E= ; = 1 X xp 2C1 for xp 2 C1 and tp p ln(1 + " ) ; X xp 2C2 ln(1 = 0 for xp 2 C2 . Then the ; "p ) (by splitting the sum in (11.40) in two, separately for each class). ✍ Remarks: ➥ Considering the absolute error "p small, then ln(1  "p ) ' "p; also for xp 2 C1 ) tp = 1 and yp 2 [0; 1], then, obviously "p < 0; similarly, for xp 2 C2 , P P "p > 0, then E ' j"p j. p=1 For an in nitely large training set the sum in (11.36) transforms into an integral: E= ❖ htjxi y (x) ; Z Z1 X ft ln y(x) + (1 ; t) ln[1 ; y(x)]g p(tjx) p(x) dt dx 0 is independent of t and then, by integration over t: E= ; Z X h jxi ln y(x) + (1 ; htjxi) ln(1 ; y(x))] p(x) dx [ t where htjxi = Z1 j tp(t x) dt (11.41) 0 The value of y(x) for which E is minimum is found by zeroing its functional derivative: E y Z  = X htjxi ; 1 ; htjxi  p(x) dx = 0 y (x) 1 ; y (x) and then y(x) = htjxi (E = 0 if and only if the integrand is 0), i.e. the output of network represents the conditional average of the target data for the given input. For the particular encoding scheme chosen for t, the posterior probabilities p(tjx) may be written as: p(tjx) = D (1 ; t)P (C1 jx) + D (t)P (C2 jx) (11.42) (D being the Dirac function). Substituting (11.42) in (11.41) and integrating gives y(x) = P ( 1 x) i.e. the output of network is exactly what it was supposed to represent. Cj 11.7.4 Multiple Independent Features So far only one property of the input vectors x was present and studied in the network output | its membership to a particular class | property mutually exclusive. 11.7. CROSS ENTROPY 205 To watch out for multiple, non exclusive, features a network with multiple outputs is required and then y (x) will represent the probability that the k-th feature is present in x input vector. k j Assuming independent features then: p(t x) = K Q k =1 j p(tk x). The presence or absence of the k-th feature may be used to classify x as belonging to one of 2 classes, e.g. C10 if it does have it and C20 if it doesn't. Then, by the way the meaning of y was chosen, the posterior probabilities p(t jx) may be expressed as in (11.35) and: k k j tk p(tk x) = yk (1 ; yk )1;tk ) j p(t x) = K Y k L= P Y p j p(tp xp ) = =1 P K Y Y p =1 k=1 tkp yk (xp ) [1 =1 yktk (1 ;y k( 1;tk ;y k) xp )]1;tkp In the usual way, the error function is build as the negative logarithm of the likelihood function: E= ; ln L = ; P K X X p ft =1 k=1 kp ln yk (xp ) + (1 ;t kp ) ln[1 ; y (x )]g p p Because the network outputs are independent then for each y the analysis for one output network applies. Considering this, the error function may be changed the same way as (11.38): k E= ; P K  X X tkp ln p 11.7.5 yk (xp ) tkp =1 k=1 + (1 ;t kp ) ln 1 ; y (x 1;t p) k  kp Multiple Classes Case Let consider the problem of classi cation with a set of mutually exclusive classes fC g =1 , and the one-of-k encoding scheme, i.e. t =  for the input vector x 2 C . The output of (output) neuron k represents the posterior probability of C : P (C jx) = k kp yk (x); k` p thus the posterior probability density of t is: p j p(tp xp ) = the two classes case, see (11.35)). Then: L= P Y p =1 j p(tp xp ) = P K Y Y p =1 k=1 ) ykkp (xp ) t E= k ; ln L = ; tkp =1 The error function have a minimum when all the output values y targets t : Emin = ; P K X X =1 k=1 p tkp ln tkp k (similarly to tkp ln yk (xp ) =1 k=1 xp ) k( kp yk (xp ) P K X X p ;K ` k K Q k coincide with the 206 ❖ Ep CHAPTER 11. ERROR FUNCTIONS and, as in (11.38), the minimum may be subtracted from the general expression, and the energy becomes: P X K P X X E=; tkp ln ykt(xp ) = ; Ep (11.43) kp p=1 p=1 k=1 K P where Ep = tkp ln ykt(kpxp ) . k=1 To represent the posterior probabilities P (Ck jx), the network outputs should be yk 2 [0; 1] K P and sum to unity yk = 1. To ensure this, the softmax function may be used: k=1 X 1 exp(ak ) where A = ak ; ln exp(a` ) (11.44) = yk = P k K 1 + exp(;Ak ) ` =1 exp(a` ) `=k `=1 and it may be seen that the activation of output neurons is the logistic function. Let assume that the posterior probabilities of the outputs of hidden neurons is of the general exponential form as (11.39) 6 h p(zjCk ) = exp A(k ) + B (z; ') + Tk z i (11.45) where A, B , k and ' have the same signi cance as in section 11.7.2. From Bayes theorem, the posterior probability p(Ck jz) is p(zjCk ) P (Ck ) p(Ck jz) = p(zjCpk()zP) (Ck ) = P K p(zjC` ) P (C` ) (11.46) `=1 ❖ ak , wk , wk0 exp(ak ) By substituting (11.45) in (11.46), the posterior probability becomes: p(Ck jz) = P K exp(a` ) `=1 where: ak = wk z + wk0 T and ( wk =  k wk0 = A(k ) + ln P (Ck ) The derivative of error function with respect to the weighted sum of inputs ak of the output neurons is: K @E @y @Ep = X p ` @ak `=1 @y` @ak (because Ep depends on ak trough all y`, see (11.44)). From (11.43) respectively (11.44) @Ep = ; t`p @y` y` (xp ) respectively @y` @ak = y``k ; y` yk 11.8. ENTROPY 207 p(x) P k  Figure 11.2: x k A histogram. It may be seen as a set of \bins", each containing some \objects". and nally kp = y (x ) ; t , same result as for sum-of-squares error with linear activation and two class with logistic activation cases. @E k @a ➧ 11.8 p k Entropy Let consider a random variable x and its probability density p(x). Also let consider a set of P values fx g =1 . By dividing the X axis (x 2 X ) into \bins" of width  (for each \bin" k) and putting each p(x ) into the corresponding \bin", a histogram is obtained. See gure 11.2. Let consider the number of ways the objects from bins are arranged. Let P be the number of \objects" from bin no. k. There are P ways to pick up the rst object, P ; 1 ways to pick up the second one and so on, there are P ! ways to pick up the set of P objects. Also there are P ! ways to arrange the objects in each bin. Because the number of arrangements inside bins doesn't count then the total number of arrangements with will end up in the same histogram is: P! W=Q P! p p k ;P ❖  ❖ P k p k k k k k The particular way of arranging the objects in bins is called a microstate, while the arrangement with leads to the same p(x) is called macrostate. The parameter representing the number of microstates for one given macrostate is named multiplicity. ✍ Remarks: ➥ 11.8 The notion of entropy came from physics where it's a measure of the system \disorder" (higher \disorder" meaning higher entropy). In the information theory it is a measure of information content. See [Bis95] pp. 240{245. microstate macrostate multiplicity 208 CHAPTER 11. ERROR FUNCTIONS There is an analogy which may be done between the above example and a physics related example. Consider a gas formed of molecules with di erent speeds (just one component of the speed will be considered). From macroscopic (macrostate) point of view it doesn't matter which molecule have a (particular) given speed while from microscopic point of view if two molecules swap speeds there is. ➥ If the number of microstates decrease for the same macrostate then the order in the system increases (till there is only one microstate corresponding to the given macrostate in which case the system is totally ordered | there is only one way to arrange it). As entropy measures the degree of \disorder" it shall be inversely proportional to the multiplicity. The results from statistical physics corroborated with thermodynamics lead to the de nition of entropy as being equal (up to a multiplicative constant) with the negative logarithm of multiplicity; the constant being the Boltzmann constant k. Based on the above remarks, the entropy in the above situation is de ned as: ➥ " # X = ; 1 ln ! ; k! k P and, by using the Stirling formula ln ! ' ln ; for  1 and the relation k= , k for k ! 1, it becomes: X k X ln k = (11.47) =; k ln k k k where k = k represents the | observed in histogram | probability density. The lower entropy ( = 0) will be when all \objects" are in one single \bin", i.e. all probabilities k = 0 with the exception of one ` = 1. The highest entropy will be when all k are equal. S = ; 1 ln P W P P P P P P P P P P P; P P S ❖ p k p P P p P p P =P S p p p Proof. The proof of the above statement is obtained by using the Lagrange multiplier technique to nd the maximum (see mathematical appendix) but the P minimum of S will be missed due to discontinuities at 0. The constraint is the condition of normation pk = 1; then the Lagrange function will be: k L= X k pk ln pk +  " X k and the maximum is found by zeroing the derivatives: @L @pk ❖ I = ln pk + 1 +  = 0 # pk ; 1 @L @ and = X k pk ; 1 = 0 then from the rst equation: pk = e;(1+) , 8k, and, considering K as the total number of \bins", by introducing this value into the second one  = ;1 + ln I . Then for pk = 1=I the entropy is maximum. Considering the limit K ! 1 (number of bins) then the probability density is constant inside a \bin" and pk = p(xp 2 bink ) k , where xp is from inside \bin" k. Then the entropy is: S = X k ( p 2 bink ) k ln[ ( p ) k ] ' p x Z p x ( ) ln ( ) p x X p x dx + Klim !1 ln k 11.8. ENTROPY 209 R (because k = const. and p(x) dx = 1). For K ! 1 the with of \bins" will reduce X to k ! 0 and thus the term Klim ln k ! 1 and, to keep a meaning, it is dropped !1 from the expression of entropy (anyway the most important is the change of entropy S and the divergent term disappears when calculating it; this is also the reason for the name \di erential entropy"). De nition 11.8.1. The general expression of di erential entropy is: S=; Z X p(x) ln p(x) dx The distribution with maximum of entropy, subject to the following constrains:    R1 ;1 R1 ;1 R1 ;1 p(x) dx = 1, i.e. normalization of distribution. xp(x) dx = , i.e. existence of a mean. (x ; )2 p(x) dx = 2 , i.e. existence of a variance. is the Gaussian: p(x) =  2 p 1 exp ; (x 2;2) 2  The distribution is found trough the Lagrange multipliers method (see mathematical appendix). The Lagrange function is: Z1 Proof. L= ;1 p(x)  ln p(x) + 1 + 2 x + 3 (x ; )2  dx ; 1 ; 2 x ; 3 (x ; )2 Then maximum of S is found by zeroing the derivatives of L: Z1 @L @p  ; )2  dx = 0 (11.48) ;1 @L , @L , @L just lead to the imposed conditions. The derivative and the other derivative equations @ 1 @2 @3 (11.48) may be zero only if the integrand is zero | because the integrand is an odd function2 and the integration is made over an interval symmetric relatively to the origin. Then the searched probability density is of the form:   p(x) = exp ;1 ; 1 ; 2 x ; 3 (x ; )2 (11.49) and the 1 , 2 , 3 constants are found from the conditions imposed. From the rst condition: Z1  2  dx exp ;1 ; 1 ; 2 x ; 3 (x ; ) (11.50) 1= ;1 "  Z1  2 #  = exp 2 = ln p(x) + 1 + 1 + 2 x + 3 (x ;1 ; 1 + 22 ; 2  A function f is odd if f (;x) = f (x). 3 ;1 exp ;3 x;+ 2 23 dx di erential entropy 210 CHAPTER 11. ERROR FUNCTIONS  r ;1 ; 1 + 22 ; 2  = exp 3  3 (see also the Gaussian integrals, mathematical appendix). From the second condition, similar as above, and using the change of variable xe = x ;  (one of the integrals will cancel due to the fact that the integrand will be an odd function and the integration interval is symmetric relatively to origin): Z1 = ;1 x exp  = exp  ;1 ; 1 ; 2 x ; 3 (x ; )2  dx 3  = exp 1  Z ;1 ; 1 + 22 ; 2  ;1  ;1 ; 1 + 22 ; 2  3 " x exp ; 2 23 ;3  1  Z ;1 x;+ 2 23 " x exp ;3  2 # dx x;+ 2 23 2 # and by using (11.50) (see also Gaussian integrals) if follows that  =  ; 223 , thus 2 q back into (11.50) gives exp (1 + 1 ) = 3 . dx . Replacing = 0 From the third condition, as 2 = 0 (integration by parts): Z1 2 = = ;1 (x ; )2 exp ;1 ; 1 ; 3 (x ; )2  dx ;1;1 Z1 ; e 2 and then nally 3 = 21 2 values back into (11.49). 3 (x ; ) d exp ;3 (x ; )2  = ;1;1 r  ; e 2  3 3 ;1 and e;1;1 = p21  . The wanted distribution is found by replacing the 1;2;3 Another way of interpreting the entropy is to consider it as the amount of information. It is reasonable to consider that the amount of information and the probability are somehow interdependent, i.e. S = S (p). The entropy is a continuous, monotonically increasing function. On the other hand an certain event have probability p = 1 and bears no information S (1) = 0. Let now consider two events A and B , statistically independent and having probabilities pA and pB . The probability of both events occurring is pApB and the entropy associated is S (pA pB ). If event A have occurred then the information given by event B is S (pA pB ) ; S (pA ) and on the other hand S (pB ). Then: S (pA pB ) = S (pA) + S (pB ) From the above result it follows that S (p2 ) = 2S (p) and, by induction, S (pn ) = nS (p). Then, by the same means S (p) = S f(p1=n)n g = nS (p1=n ) which may be immediately extended to S (pn=m ) = mn S (p). Finally, from the continuity of S : S (px ) = xS (p) and thus it also may be written as: S (p) = S f(1=2); log2 p g = ;S (1=2) log2 p 11.8. ENTROPY 211 where S (1=2) is a multiplicative constant and by choosing it to be equal to 1 the entropy/information is being said to be expressed in bits. By choosing natural logarithm S (p) = ;S (1=e) ln p and S (1=e) = 1 the information will be expressed in nats. ✍ Remarks: The above result may be explained by the fact that, for independent events, the information is additive (there may be information about event A and event B ) while the probability is multiplicative. Let consider a discrete random variable x which may take one of the values fxp g and have to be transmitted, or rather information about it. For each xp the information is ; ln p(xp ). Then the expected/average information/entropy to be transmitted (as to establish what of the possible xp values x have now) is: ➥ S=; e e S= Remarks: ➥ Z ; p x pe x dx (11.51) ( ) ln ( ) X Z Z p e x ; p x p x dx ; p x The entropy may be written as: S= ( ) ln ( ) ( ) p ( ) ln (x) dx X X where the rst term represents the asymmetric divergence between p(x) and p(x)3 . It is shown in the \Pattern Recognition" chapter that S (x) is minimum for p(x) = p(x). For a neural network which have the outputs yk the entropy (11.51), for one xp is: K Sp = ; tk ln yk (xp ) k=1 because tk represents the true probability (desired output) while yk (x) represents the estimated (calculated) probability (actual output). For the whole training set the crossentropy is: P K S=; tkp ln yk (xp ) p=1 k=1 the result being valid for all networks for which tkp and yk represents probabilities. ee X XX 3 Known also as Kullback-Leibler distance. noiseless coding theorem X p xp ( ) ln p(xp ) p the result being known also as noiseless coding theorem. For a continuous vectorial variable x, and also considering that usually the true distribution p(x) is not known but rather the estimated one p(x) (the associated information being ; ln p(x)). Then the average/expected information relative to x is: ✍ bits, nats ❖ e p(x) 212 CHAPTER 11. ERROR FUNCTIONS ➧ 11.9 Outputs as Probabilities Due to the fact that many error functions lead to the interpretation of network output as probabilities, the problem is to nd the general conditions which lead to this result. It will be assumed that the error function is additive with respect to the number of training P P vectors E = E and each E is a sum over the components of the target and actual =1 output vectors: p p p Ep = K X k ❖f f (tkp ; yk (xp )) =1 where f is a function to be found, assumed to be of the form: j f (tkp ; yk (xp )) = f ( yk (xp ) ❖ hE p i ; t j) kp i.e. it depends only on the distance between actual and desired output. Considering an in nite training set, the expected (average) per-pattern error is: K ZZ X hE i = p k =1 X Y j ; t j) p(tjx) p(x) dt dx f ( yk (x) k For the one-of-k encoding scheme, and considering statistically independent outputs), the conditional probability of the target may be written as: j p(t x) = " K K Y X =1 ` q D (t` =1 ; `q ) P( C jx) # = q K Y j p(t` x) =1 k =1 X =1 Y =s ` ` ` (network (11.52) ` and then the expectation of the error function becomes: 2 K Z K Z X Y 6 hEp i = 4 yk 32 7 p(t` jx) dt` 5 4 Z j f ( yk (x) 3 ; t j) p(t jx) dt k k k 5 p(x) dx Yk 6 where the integrals over Y = are equal to 1 due to the normalization of the probability (or, otherwise, by substituting the value of p(t jx) and doing the integral) and the p(t jx) probability density may be written as (see (11.52) | the cases k = q and k 6= q were considered): `;` 6 k ` j p(tk x) = D (tk ; 1) P (C jx) + k K X k C jx) = D (tk ) P ( =1 =k q q q 6 = D (tk (because 11.9 K P q =1 P( ; 1) P (C jx) +  C jx) = 1, normalization). q See [Bis95] pp. 245{247. k t D( k) [1 ; P (C jx)] k 11.9. OUTPUTS AS PROBABILITIES 213 Finally the average error becomes: K Z (Z X hEp i = f (jyk (x) ; tk j) D (tk ; 1) P (Ck jx) dtk k=1 X Yk Z + = K Z (Z X k=1 X Yk + = K Z X k=1 X Yk j f ( yk (x) f (1 Z Yk ; tk j) D (tk ) [1 ; P (Ck jx)] dtk ) p(x) dx ; yk ) D (0) P (Ck jx) dtk f (yk ) D (0) [1 ; P (Ck jx)] dtk ) p(x) dx ff (1 ; yk ) P (Ck jx) + f (yk ) [1 ; P (Ck jx)]g p(x) dx because the rst integral over Yk is for the case tk = 1 while the second is for the case tk = 0; obviously yk 2 [0; 1]. The conditions of minima for hEp i are found by setting its functional derivative, with respect to yk (x), to zero: h i  Ep yk (x) = f (1 0 ; yk ) P (Ck jx) + f (yk ) [1 ; P (Ck jx)] = 0 0 where f is the derivative of f . This leads to: ❖ f0 0 f 0 (1 ; yk ) = 1 ; P (Ck jx) P (Ck jx) f 0 (yk ) and, considering that the network outputs are probabilities (this was the hypothesis): f 0 (1 ; yk ) = 1 ; yk f (yk ) 0 (11.53) yk A general class of functions which satis es (11.53) is: f (y ) = Z y r (1 ; y)r dy ; r = const. and from this class (f (1 ; y) = df (1 ; y)=d(1 ; y)):  r = 1 ) f (y) = y2 =2, i.e. the sum-of-squares error function.  r = 0 ) f (y) = ; ln(1 ; y) = ; ln(1 ; jyj), i.e. the cross-entropy error function. 0 ✍ Remarks: ➥ The Minkowski error function f (y) = yR does not satis es the (11.53) unless R = 2 which leads to the, already known, sum-of-squares error. 214 CHAPTER 11. ERROR FUNCTIONS In this case the outputs of network are: P (Ck jx)1=(R;1) yk = P (Ck jx)1=(R;1) + [1 ; P (Ck jx)]1=(R;1) On the other hand the decision boundaries still give the minimum of misclassi cation because yk are monotonic functions of P (Ck jx). CHAPTER 12 Parameter Optimization ➧ 12.1 Error Surfaces Generally the error may be represented as a surface E = E (W )1 into the NW + 1 space where NW is the total number of weights (see also gure 12.2 on page 219). The goal is to nd the minima of error function, where rE = 0; however note that this condition is not enough to nd the absolute minima because it is also true for local minimums, maximums and saddle-points. In general it is not possible to nd the solution W in a closed form. Then a numerical approach is taken, to nd it by searching the weights space in incremental steps (t = 1; : : : ) of the form W(t+1) = W(t) + W(t) . However, usually, the algorithms does not guarantee for the nding of absolute minima and even an saddle-point may stuck them. On the other hand the weight space have a high degree of symmetry2 and thus many local and global minimums which give the same value for the error function; then a relatively fast convergence may be achieved starting from a random point (i.e. the local/global minima will be relatively close wherever the starting point is). It was shown3 that the optimum value for the network is obtained when hyk jxi = htk jxi, i.e. the actual average output yk equals the desired output tk , both conditioned on input x. However this equality was obtained by considering the training set as in nite while in practice the training set is always nite and covers just a tiny amount from all possible cases. Then even a local minima may give a good generalization. See [Bis95] pp. 253{256. Such a surface was drawn in the \Backpropagation" chapter for 2 weights. 2 See the \Multi Layer Neural Networks" chapter. 3 See the \Error Functions" chapter. 12.1 1 215 ❖ N W 216 CHAPTER 12. PARAMETER OPTIMIZATION ➧ 12.2 Local Quadratic Approximation Let consider the Taylor series development of error E around a point W0 , note that here W is being seen as a vector : E (W ) = E (W0 ) + rE jW0 + ❖ H 1 T 2 (W ; W0 ) H jW0 (W ; W0 ) (12.1) where fH gij = @w@E@w is the Hessian matrix, note that here the Hessian will be considered as a matrix. Then the gradient of E with respect to W may be approximated as: i j rE = rE jW0 + H jW0 (W ; W0 ) ❖ W Considering a minima point W  then rE jW  = 0 and: E (W ) = E (W  ) + ❖ ui , i 1 (W ; W )T H jW  (W ; W ) 2 (12.2) Because the Hessian is a symmetric matrix then it is possible to nd an orthonormal set of eigenvectors4 fui g with the corresponding eigenvalues fi g such that: Hui = i ui ; uTi uj = ij (12.3) (i and ui being calculated at the minimum of E , given by the point W  ). By writing (W ; W  ) in terms of fui g W ; W = ❖ i (where i X i (12.4) i ui are the coecients of the fui g development) and replacing in (12.2) E (W ) = E (W  ) + 1 2 X i 2 i i i.e. the error hyper-surfaces projected into the weights space are hyper-ellipsoids oriented with the axes parallel with the orthonormated fui g system and having the lengths inversely proportional with the square roots of Hessian eigenvalues. See gure 12.1 on the facing page. ✍ Remarks: ➥ From (12.2): E = E (W ) ; E (W  ) = 12 (W ; W )T H jW  (W ; W ) then W  is a minimum point if only if H is positive de nite5 ( , E (W ) > E (W  )). 12.2 4 5 See [Bis95] pp. 257{259. See mathematical appendix. See previous footnote. E > 0 , 12.3. INITIALIZATION AND TERMINATION OF LEARNING w2 ;1=2 1 u2 2;1=2 u1 W E = const. w1 Figure 12.1: ➥ ➥ ➧ 12.3 A constant error surface projection into a bidimensional weights space is an ellipse having the axes parallel with the Hessian eigenvectors and lengths inversely proportional with the square root of Hessian eigen values. The quadratic approximation of error function (12.1) involves the knowledge of W (W + 3)=2 terms (W for rE and W (W + 1)=2 for the symmetrical Hessian matrix). So, to nd a minimum will require at least O(W 2 ) equations, each needing at least O(W ) steps, i.e. a total of O(W 3 ) steps. By using the gradient information and the backpropagation algorithm the required steps are reduced to O(W 2 ). It was shown6 that for linear output neurons it is possible to nd the related weights by one step. Initialization and Termination of Learning Usually the weights are initialized with random values to avoid problems due to weight space symmetry. However there are two restrictions:  If the initial weights are too big then the the activation functions f will have values into the saturation region (e.g. see sigmoidal activation function) and their derivatives f 0 will be small, leading to an small error gradient as well, i.e. a approximatively at error surface and, consequently, a slow learning.  If the initial weights are too small then the activation functions f will be linear and their derivatives will be quasi-constant, the second derivatives will be small and then the Hessian will be small meaning that around minimums the error surface will be approximatively at and, consequently, a slow learning (see section 12.2). which suggest that the weighted sum of inputs, for sigmoidal activation function, should be of order unity. 6 12.3 See chapter \Single Layer Neural Networks". See [Bis95] pp. 260{263. 217 218 CHAPTER 12. PARAMETER OPTIMIZATION The weights are usually drawn from a symmetrical Gaussian distribution with 0 mean (there is no reason to choose any other mean, due to symmetry). To see the choice for the variance  of the above distribution, let consider a set of inputs fx g =1 with zero mean: hx i = 0 and unit variance: hx2 i = 1 (calculated over the training set). The output of neuron is: i i i ;N i = f (a) z where = a X N i wi x i =0 (for i = 0 the weight w0 represents the bias and x0 = 1). The weights are chosen randomly, independent of fx g, and then the mean of a is: i hai = X N hwi xi i i =0 X N = hwi ihxi i =0 =0 i = 0) and, considering fw g statistically independent hw w i =  2, the vari* X ! X !+ X = hw2 ihx2 i = (N + 1)2 ' N 2 w x wx ha2 i = (as hx i ance is: i i i j i =0 j j ij N N N i i j =0 i i =0 i (for non-bias the sum is from i = 1 to N and then ha2 i = N 2 ). Because ha2 i ' 1 (as discussed above) then the variance should be chosen  ' N ;1 2 . Another way to improve network performance is to train multiple instances of the same network, but with a di erent set of initial weights, and choosing among those who give best results. This method is called committee of networks. The criteria for stopping the learning process may be one of the following:  Stop after a xed number of steps.  Stop when the error function had become smaller than a speci ed amount.  Stop when the change in the error function (E ) had become smaller than a speci ed amount.  Stop when the error on an (independent) validation set begin to increase. = committee of networks ➧ 12.4 Gradient Descent A simple, geometrical, explanation of the gradient descent method is the means of representation of the error surface into the weights space. See gure 12.2 on the next page. The gradient of error function, relative to the weights, rE will point to the set of weights which will give maximum error. Then the weights have to be moved against the direction of rE ; note that this does not mean a movement towards the minima point. See gure 12.2 on the facing page. 12.4 See [Bis95] pp. 263{272. 12.4. GRADIENT DESCENT 219 E max rE (t) w1 w2 min (t + 2) (t + 1) rE Figure 12.2: The error surface for a bidimensional weights space. The steps t, t +1 and t +2 are also shown. The weights are moved against the error maximum (and not towards minimum). At step t = 0 (in discrete time approximation) the weights are initialized with the value W0 (usually randomly selected). Then at some step t + 1 they are adjusted following the formula: (12.5) W(t+1) = W(t+1) ; W(t) = ;rE jW t ( ) where  is a parameter governing the speed of learning, named learning rate/constant, controlling the distance between W(t+1) and W(t) (see also gure 12.2). This type of learning is named a batch process (it uses all training vectors at once every time the gradient is calculated). (12.5) is also known as delta rule. Alternatively the same method (as above) may be used but with one pattern at a time: W t ( +1) = W t ; W t = ;rEp jW ( +1) ( ) (t) where p denotes a pattern from the training set, e.g. the training patterns may be numbered in some way and then considered in the order p = t ( rst pattern taken at rst step, : : : , and so on). This type of learning is named sequential process (it uses just one training vector at a time). ✍ 7 See Remarks: ➥ The gradient descent technique is very similar to the Robbins-Monro algorithm7 and it becomes identical if the learning rate is chosen of the form  / 1=t and thus chapter \Pattern recognition". learning rate batch learning sequential learning 220 CHAPTER 12. PARAMETER OPTIMIZATION ➥ ➥ ➥ the convergence is assured; however this alternative leads to very long training time. The choice of  may be critical to the learning process. A large value may lead to an over shooting of the minima especially if it's narrow and step (in terms of error surface) and/or oscillations between 2 areas (points) in weights space. A small value will lead to long learning time. Compared to the batch learning, the sequential process is less sensitive to training data redundancy (a duplicate of a training vector will be used twice, in two separate steps, and thus, usually, improving learning, rather than being used twice at each learning step). It is also possible to use a mixture learning, i.e. to divide the training set into subsets and use each subset for batch process. This technique is especially useful if the training algorithm is intrinsically of the batch type. 12.4.1 Learning Parameter and Convergence Considering the quadratic approximation of the error function in the neighborhood of minima, from (12.2), (12.3) and (12.4): rE = From (12.4): W(t+1) = X i i ui X i (12.6) i i ui where  i = i(t+1) ; i(t) (12.7) and by replacing (12.6) and (12.7) in (12.5) and because the eigenvectors ui are orthonormated, then:  i = ;i i(t) ) i(t+1) = (1 ; i ) i(t) (12.8) Again from (12.4), by multiplying with uTi to the left, and from the orthonormation of fui g: uTi (W ; W  ) = i i.e. i represents the distance (into the weights space) to the minimum, along the direction given by ui . After T steps the iterative usage of (12.8) gives: i(T ) ❖ max = (1 ; i )T i(0) (12.9) A convergence process means that tlim W(t) = W  , i.e. tlim i(t) = 0 (by considering the !1 !1 signi cance of i as discussed above). Then, to have a convergent learning, formula (12.9) shows that it is necessary to impose the condition: j1 ; i j < 1 ; 8i )  <  2 max where max is the biggest eigenvalue of the Hessian. 12.4. GRADIENT DESCENT 221 ;=  max w2 1 2 u2 u1 ;=  min 1 2 ;rE E = const. w1 Figure 12.3: Slow learning for a small min =max ratio. The longest axis of the E = const. ellipsoid surface is proportional 1=2 while the shortest one is proportional with with ;min ;1=2 . The gradient is perpendicular to the E = const. max surface. The weight vector is moving slowly, with oscillations towards the minima of E , center of the ellipsoids. Considering the maximum possible value  = 2=max, and replacing into (12.8), the speed of learning will be decided by the time needed for the convergence of i corresponding to  2min the smallest eigenvalue, i.e. the size of factor 1 ; max . If the ratio min =max is very small then the convergence will be slow. See gure 12.3. So far, in equation (12.5), the time, given by t, was considered discrete. By considering continuous time the equation governing weights change during learning becomes: dW = ;rE dt which represents the equation of movement for a point in weights space, which position is given by W and subject to a potential eld E (W ) and viscosity 1=. 12.4.2 Momentum By adding a term to the equation (12.5) and changing it to: W t = ;rE jW + W t ( +1) (12.10) ( ) ( t) it is possible to increase the speed of convergence. The  parameter is named momentum. (12.10) may be rewritten into a continuous form as follows:  W(t+1) ; W(t) = ;rE jW(t) +  W(t) ; W(t;1)  ❖  222 CHAPTER 12. PARAMETER OPTIMIZATION E E E W 0 W 00 W 000 W W W W 0 W 00 W 000 W 0 W 00 W 000 a) W 00 b) Figure 12.4: W 000 W 0 c) Learning with and without momentum (W 0 , W 00 and W 000 are 3 points, consecutive in time). Figure a) shows learning without momentum: W decreases in proportion with rE , i.e. decreases around minima. Figure b) shows learning with momentum (without oscillations): the learning rate increases, compensating for the decrease of rE . Figure c) shows oscillations; most of the additional quantities, introduced by momentum, cancels from one oscillation to the next.  W(t+1) ; W(t) = ;rE jW +  W(t) ; W(t;1)  (t)   ( +1) ( +1) W ,  W ❖  2 ( +1) ( +1) ( ) ( ) ( +1) (t) ( +1) ❖ (t) 2 ( +1) 2 ( ) ( +1) ( +1) ( +1) ( ) By switching to the limit, the terms in (12.11) have to be of the same in nitesimal order, let  be equal to the unit of time (introduced for dimensionality correctness), then (1 ; )dW dt = ;rE dt ;  d W 2 m,   ( +1) ( ) 2 ❖  ; W t ;W t + W t ;W t W t = ;rE jW ;  W t ; W t  + W t (12.11) (1 ; )W t = ;rE jW ;  W t where W t = W t ; W t and  W t = W t ; W t . 2 which gives nally the di erential equation: 2  2 ;  = (1 ; ) = ;r E ; where m = m ddtW2 +  dW dt   pointing that the momentum shall be chosen  2 (0; 1). ✍ Remarks: ➥ (12.12) (12.12) represents the equation of movement of a particle into the weights space W , having \mass" m, subject to friction (viscosity) proportional with the speed, de ned by  and into a conservative force eld E . 12.4. GRADIENT DESCENT 223 u1 u2 E Figure 12.5: = const. The learning, alongside Hessian eigenvector corresponding to smallest eigenvalue, is accelerated comparing with plain gradient descent. See also gure 12.3 on page 221 for comparison. 2 d W represents the position, dW dt represents the speed, dt2 is the acceleration, nally E and ;rE are the potential, respectively the force of the conservative elds. W To understand the e ect given by the momentum | see also gure 12.4 on the preceding page | two cases may be analyzed:  The gradient is constant rE = const. Then, by applying iteratively (12.10): W = ;rE (1 +  + 2 + : : : ) ' ; 1 ;  rE n (because  2 (0; 1) and then nlim !1  = 0), i.e. the learning rate e ectively increases from  to (1; ) .  In a region with high curvature where the gradient change direction almost completely (opposite direction), generating weights oscillations, the e ects of momentum will tend to cancel from one oscillation to the next. The above discussion is also true on the components of vector W . The advancement in the direction of error minima alongside the direction of eigenvector corresponding to the smallest eigenvalue of Hessian is accelerated. See gure 12.5. 12.4.3 Other Gradient Descent Improvement Techniques Bold descent This algorithm improvement is based on the following idea:  If, after a step, the error increases, i.e. E > 0 this means that the minimum point was overshot. Then the change is rejected (reversed), because otherwise the weight value will be further from the minima point, and the learning rate is decreased.  If, after a step, the error decreases, i.e. E < 0 then the weight adjustment is accepted, but the learning rate is considered to be too small and consequently is increased. 224 CHAPTER 12. PARAMETER OPTIMIZATION The algorithm changes the learning rate, at each step, as follows: ( 1 1 and the weight change is rejected if E > 0. In practice  ' 1:1 and  ' 0:5. Quick backpropagation (t+1) ❖ a, b, c = if E < 0 ; if E > 0 ; (t) (t) > < The quick backpropagation algorithm makes the following assumptions:  The weights are independent. This is true only if the Hessian would be diagonal, i.e. its eigenvectors are parallel with the weights axes which in practice is seldom true (see also gure 12.3 on page 221 | the system fui g is usually not parallel with the W system).  The error function may be approximated with a quadratic polynomial, i.e. a parabola E (wi ) = a + bwi + cwi2 , a; b; c = const. and then the weights are changed to the minima of error function by using 2 estimates of the error gradient. Assuming the above hypotheses then: ( ) = a + bwi + cwi ) 2 E wi ) 8 > < @E @wi t @E > : @wi t;1 = b + 2cwi t = b + 2cwi t; = b + 2cwi ) 2c = ( ) ( @E @wi 1) @E @wi t wi(t) @E ; @w t; ; wi t; i ( 1 1) At t + 1 the minimum is attended (as assumed): minimum ) @E @wi t =0 ) wi(t+1) = ; 2bc From the error gradient at t: b = @wi t ; @wi t;1 @E ; wi(t) @wi t wi(t) ; wi(t;1) @E @E and then the weights update formula is: wi t ( +1) @E @wi t @E @wi t @wi t;1 = ; @E ; wi t ( ) (wi(t+1) = wi(t+1) ; wi(t) , wi(t) = wi(t) ; wi(t;1) ). ✍ Remarks: ➥ The algorithm assumes that the parabola have a minimum. If it have a maximum then the weights will be updated in the wrong direction. 12.5. LINE SEARCH ➥ ➧ 12.5 225 Into a region of nearly at error surface it is necessary to limit the size of the weights update, otherwise the algorithm may jump over minima, especially if it is narrow. Line Search The main idea of the algorithm is to search for the minimum of error function along the direction given by its negative gradient, instead of just move by a xed amount given by the learning rate. The weight vector at step t + 1, given the value at step t, is: W(t+1) = W(t) ; (t) rE jW(t) ;  where (t) is a parameter, such that E () = E W(t) ; rE jW t is minimum. By the above approach the problem is reduced to a unidimensional case. The minimum of E () may be found by searching for 3 points W 0 , W 00 and W 000 such that E (W 0 ) > E (W 00 ) and E (W 000 ) > E (W 00 ). Then the minimum is to be found somewhere between W 0 and W 000 and may be found by approximation E with a quadratic polynomial, i.e. a parabola. ( ) ✍ Remarks: ➥ ➧ 12.6 12.6.1 Another, less ecient but simpler, way would be to advance, along the direction given by ;rE , in small, xed steps, and stop when error begin to increase. Conjugate Gradients Conjugate Search Directions The method described in section 12.5 does not give the best direction for search of the minima in the weight space. Considering that the line search minimum have been reached at step t + 1 then: @E @ W(t+1) = @E =0 @ W(t) ;(t) rE jW (t) ) ; rE jW t ( +1) T ; rE jW t ( )  =0 i.e. the gradients from two consecutive steps are orthogonal. See gure 12.6 on the following page. This behavior may lead to more (searching) steps than necessary. It is possible to continue the search from step t + 1, beyond, such that the component of the gradient parallel with the previous direction | and made 0 by the previous minimization step | is preserved to the lowest order. This means that a series of directions fd(t) g have to be found such that: ;  ;  rE jW t T d(t) = 0 and rE jW t +d t T d(t) = 0 ( +1) 12.5 See [Bis95] pp. 263{272. 12.6 See [Bis95] pp. 274{285. ( +1) ( +1) ❖ (t) 226 CHAPTER 12. PARAMETER OPTIMIZATION w2 rE j rE j t W( +1) t W( ) rE j Figure 12.6: W( surfaces E = const. t;1) w1 Line search learning. The error gradients from two consecutive steps are perpendicular. and, by developing in series to the lowest order: h; rE j T t W( +1) T T d(t+1) conjugate directions i + d(t+1) H d(t) = 0 ) (12.13) H d(t) = 0 where H is the Hessian calculated at point W +1 . Directions which respects condition (12.13) are named conjugate (or interfering ). t 12.6.2 Quadratic Error Function A quadratic error function is of the form: E (W ) = E0 + bT W + 1 2 W T HW ; b ;H = const. and H (the Hessian) is symmetrical and positive de nite. The error gradient is: rE ❖ W (12.14) and the minimum of E is achieved at the point W  where: rE j W ❖ NW = b + HW  =0 ) b+ HW  = 0 (12.15) Let consider a set of fd g =1 W conjugate | with respect to H | directions (N being the total number of weights): i and, of course, d i 6 i , 8i. =0 Proposition 12.6.1. The W ;N T di f g di H dj W i=1;N =0 for i 6= j (12.16) set of conjugate directions is linearly independent. 12.6. CONJUGATE GRADIENTS 227 Let consider that there is a direction d which is a linear combination of the other ones: X d = d ; = const. = and then, from the way the set fd g was chosen: 1 0 X AT T̀ @ d Hd = = dT = Hd = = 0 d Hd = = = and as dT = Hd = 6= 0 then 8 = = 0, i.e. d = 0 which runs counter the way fd g set was chosen. Proof. ` i ` i;i 6 i i ` i q;q 6 i ` i;i q;q ✍ 6 ` q;q 6 ` q;q i q;q 6 ` q q;q 6 ` q;q 6 ` ` 6 6 ` i ` Remarks: ➥ The above proposition ensures the existence of a set of NW linear independent and H conjugate vectors. As fdi g contains NW vectors and is linear independent then it may be used as a reference system, i.e. any W vector from the weights space may be written as a linear combination of fdi g. Let assume that the starting point for the search is W1 and the minimum point is W  . Then it may be possible to write: NW  (12.17) W ; W1 = i di i=1 X where i are some parameters. Finding W  may be envisaged by a successive steps (of length i along directions di ) in the form: Wi+1 = Wi + i di (12.18) where i = 1; NW and WNW +1 = W  . By multiplying (12.17) with dT̀ H to the left and using (12.15) plus the property (12.16): NW dT` H (W  ; W1 ) = ;dT` b ; dT` HW1 = i dT` Hdi = ` dT` Hd` i=1 X and then the ` steps are: dT (b + HW1 ) ` = ; ` dT Hd ` ` The ` coecients may be put into another form. From (12.18): Wi+1 = W1 + multiplying with dTi+1 H to the left and using again (12.16): dTi+1 HWi+1 = dTi+1 HW1 ) dTi+1 rE jWi+1 = dTi+1 (b + HWi+1 ) = dTi+1 (b + HW1 ) (12.19) Pi j=1 j dj ; 228 CHAPTER 12. PARAMETER OPTIMIZATION and by using this result in (12.19): dT̀ rE j ` = ; dT̀ HdW` ` (12.20) Proposition 12.6.2. If the weight vector is updated according to the procedure (12.18) the the gradient of the error function at step i + 1 is orthogonal on all previous conjugate directions: (12.21) dTj rE jWi = 0 ; 8j; i such that j < i 6 NW Proof. From (12.14) and (12.18) rE jWi+1 ; rE jWi = H (Wi+1 ; Wi ) = i Hdi (12.22) and, by multiplying to the left with dTi and replacing i from (12.20), it follows that (fdi g are conjugate directions): dTi rE jWi+1 (12.23) =0 On the other hand, by multiplying (12.22) with dj to the left:   dTj rE jWi+1 ; rE jWi = i dTj Hdi = 0 ; 8j < i 6 NW (12.24) The equation (12.24) is written for all instances i; i ; 1; : : : ; j + 1 and then a summation is done over all equation obtained, which gives:   dTj rE jW ; rE jW =0 ; 8j < i 6 NW i and, by using (12.23), nally: +1 j dTj rE jWi+1 +1 =0 ; 8j < i 6 NW which combined with (12.23) (i.e. for j = i) proves the desired result. The set of conjugate directions fdi g may be built as follows: 1. The rst direction is chosen as: d1 = ;rE jW1 2. The following directions are build incrementally as: di+1 = ;rE jWi+1 + i di ❖ i (12.25) where i are coecients to be found such that the newly build di+1 is conjugate with the previous di , i.e. dTi+1 Hdi = 0. By multiplying (12.25) with Hdi to the right: ;rE jW +1 + i di T Hdi = 0 )  ; i i= rE jW +1 )T Hdi dTi Hdi ( i (12.26) Proposition 12.6.3. By using the above method for building the set of directions, the error gradient at step j is orthogonal on all previous ones: ; rE jW j T rE jW i =0 ; 8j; i such that j < i 6 NW (12.27) Obviously by the way of building, each direction vector represents a linear combination of all previously gradients, of the form: jX ;1 dj = ;rE jWj + (12.28) ` r E jW ` `=1 Proof. 12.6. CONJUGATE GRADIENTS 229 where ` are the coecients of the linear combination (their exact value is not relevant to the proof). By multiplying (12.28) with rw E jwi to the right and using the result established in (12.21): jX ;1 ; T   rE jWj rE jWi = ` rE jW` T rE jWi ; 8j; i such that j < i 6 NW `=1 For j = 1 the error gradient equals the direction d1 and, by using (12.21), the result (12.27) holds. For j = 2: jX ;1 ; ; ;   rE jW2 T rE jWi = ` rE jW` T rE jWi = 1 rE jW1 T rE jWi = 0 `=1 and so on , as long as j < i and thus the (12.27) is true. Proposition 12.6.4. The (set of) directions build by the above method are mutually conjugate. It will be shown by induction. By construction d2 Hd1 = 0, i.e. these directions are conjugate. It is assumed (induction) that: dTi Hdj = 0 ; 8i; j such that j < i 6 NW is true and it have to be shown that it holds for i + 1 as well (assuming i + 1 6 NW , of course). For di+1 , by using the above assumption and (12.25):  T  T dTi+1 Hdj = ; rE jWi+1 Hdj + i dTi Hdj = ; rE jWi+1 Hdj (12.29) Proof. 8j; i such that j < i 6 NW (the second term disappears due to the induction assumption supposed true).  T j Hdj . By multiplying this equation with rE jWi+1 to the left, T  and considering (12.27), i.e. rE jWi+1 rE jWj = 0 for 8j < i + 1 < NW , then: From (12.22), rE jWj+1 ;rE jWj =     rE jWi+1 T rE jWj+1 ; rE jWj  T  T = rE jWi+1 rE jWj+1 ; rE jWi+1 rE jWj = j rE jWi+1 Hdj = 0 and by inserting this result into (12.29) then: dTi+1 Hdj = 0 ; 8j; i such that j < i 6 NW and this result is extensible from j < i 6 NW to j < i + 1 6 NW because of the way di+1 is build, i.e. di+1 Hdi = 0 by design. ✍ Remarks: ➥ The method described in this section gives a very fast converging method for nding the error minima, i.e. the number of steps required equals the dimensionality of the weight space. See gure 12.7 on the next page. 12.6.3 The Algorithm The previous section give the general method for fast nding the minima of E . However there are 2 remarks to be made:  The error function was assumed to be quadratic. ❖ ` 230 CHAPTER 12. PARAMETER OPTIMIZATION u1 u2 E = const. Figure 12.7: Conjugate gradient learning. Into a bidimensional weight space it takes just 2 steps to reach the minimum.  For a non-quadratic error function the Hessian is variable and then it have to be calculated at each Wi point which results into a very computational intensive process. For the general algorithm it is possible to express the i and i coecients without explicit Hestenes-Stiefel calculation of Hessian. Also while in practice the error function is not quadratic the conjugate gradient algorithm still gives a good way of nding the error minimum point. There are several ways to express the i coecients:  The Hestenes-Stiefel formula. By replacing (12.22) into (12.26): i= Polak-Ribiere ;  ; rE jW +1; T rE jW +1 ; rE jW T di rE jW +1 ; rE jW i  (12.30) i i i i  The Polak-Ribiere formula. From (12.25) and (12.21) and by making a multiplication to the right: i;1 rE jW = 0 ) T T T di rE jW = ; (rE jW ) rE jW + i di;1 rE jW i ;rE jW d = i + i di;1 ; d T = i i i i i ; (rE jW )T rE jW i i and by using this result, together with (12.21) again, into (12.30), nally: i= Fletcher-Reeves ;  ; rE jW +1 T rE jW +1 ; rE jW T (rE jW ) rE jW i i i  i  The Fletcher-Reeves formula. From (12.31) and using (12.27): ;  r E jW +1 T rE jW +1 i = (rE j )T rE j W W i i i ✍ Remarks: ➥ (12.31) i (12.32) i In theory, i.e. for a quadratic error function, (12.30), (12.31) and (12.32) are equivalent. In practice, because the error function is seldom quadratic, they may gives di erent results. Usually the Polak-Ribiere formula gives best result. 12.6. CONJUGATE GRADIENTS 231 Let consider a quadratic error as function of i : 1 E (Wi + i di ) = E0 + bT (Wi + i di ) + (Wi + i di )T H (Wi + i di ) 2 The minimum if error along the direction given by di is found by imposing the cancellation of its derivative with respect to i : @E =0 ) bT di + (W + i di )T Hdi = 0 @ i and considering the property xT y = yT x and the fact that rE = b + HW , then: dT rE j i = ; di T HdWi i i (12.33) The fact that formula (12.33) coincide with expression (12.20) indicate that the procedure of nding these coecients may be replaced with any procedure for nding the error minima along di direction. The general algorithm is: 1. Select an starting (into the weight space) point given by W1 . 2. Calculate rE jW1 and make: d1 = ;rE jW1 3. For i = 1; : : : ; (max. value): (a) Find the minimum of E (Wi + i di ) with respect to i and move to the next point Wi+1 = Wi + i(min.) di . (b) Evaluate the stop condition. It may be error drop under some speci ed value, a xed number of steps, e.t.c. (c) Calculate rE jW +1 and then i , using one of the (12.30), (12.31) or (12.32). Finally calculate the new di+1 direction from (12.25). (Cycle is to be repeated till the error minima have been found, or some maximal number of steps have been executed). i 12.6.4 Scaled Conjugated Gradients The line search algorithm may have the following drawbacks:  it may involve several calculations of the error function.  it may be sensible on the accuracy of i determination. For a general network however it is possible to calculate the product between the Hessian and a vector, e.g. Hdi in the conjugate gradient algorithm, in just O(W ) steps8 . However there is a possibility that the Hessian is not positive de nite, meaning that the denominator dT` Hd` of (12.20) may be negative and thus driving an increase of error. The Hessian may be made positive de nite by performing a translation of the form H ! 8 See chapter \Multi layer neural networks". ❖  232 CHAPTER 12. PARAMETER OPTIMIZATION H + I where I is the unit matrix and  is a suciently large coecient. The formula (12.20) then becomes: dTi rE jW (12.34) i =; T di H jW di + i kdi k2 i i ❖ H j W i , i ❖ i ❖ i , i 0 0 H jW being the Hessian calculated at point Wi and i being the required value of  to make the Hessian positive de nite. The problem is to choose the i parameter. One way is to start from some value | which may be 0 as well | and to increase it till the denominator of (12.34) becomes positive. Let consider the denominator: i = dTi Hdi + i kdi k2 i If i > 0 then it can be used; otherwise (i < 0) the i parameter is increased to the new value i such that the new value of the denominator i > 0:  i = i + (i ; i )kdi k2 > 0 ) i > i ; i 2 kdi k 0 0 0 0 0  An interesting value to choose for i is i = 2 i ; 0 0 i  kdi k2 which gives: i = ;i + i kdi k2 = ;dTi Hdi 0 ✍ ❖ i ❖ EQ Remarks: ➥ ➥ If the error is quadratic then i = 0. In the regions of the weights space where error is badly approximated by a quadratic i have to be increased, in the regions where error is closer to the quadratic approximation i may be decreased. One way to measure the degree of quadratic approximation is: i = EE((WWii));;EEQ((WWii++ iiddii)) (12.35) where EQ is the local quadratic approximation: EQ (Wi + i di ) = E (Wi ) + i dTi rE jWi + 1 2 2 dT H jW di i i i and, by replacing into (12.35) and using (12.20): i = 2 [E (Wi ) d;TEr(EWji + i di)] W i i i ➥ In practice the usual values used are: 8 > <i+1 = i =2 i+1 = i > : = 4 i+1 i if i > 0:75 if 0:25 < i < 0:75 if i < 0:75 (12.36) with the supplemental rule to not update the weights if i < 0 but just increase the value of i and reevaluate i . 12.7. NEWTON'S METHOD W 233 ;H ;1rE W ;rE E = const. Figure 12.8: ➧ 12.7 Newton direction ;H ;1 rE points directly to the error minima point W  . Newton’s Method For a local quadratic approximation around minima (12.2) of the error function, the gradient at a some W point is rE = H jW  (W ; W  ) and then the minimum point is at: W  = W ; H ;1 jW  rE (12.37) The vector ;H ;1 rE , named Newton direction, points from the W point directly to the minimum W  . See gure 12.8. There are several points to be considered regarding the Newton's method:  Since it is just an approximation, this method may require several steps to be performed to reach the real minimum.  The exact evaluation of the Hessian is computationally intensive, of order O(PW 2 ), P being the number of training vectors. The computation of the inverse Hessian H ;1 is even more computationally intensive, i.e. O(W 3 ).  The Newton direction may also point to a saddle point or maximum so checks should be made. This occurs if the Hessian is not positive de nite.  The point given by (12.37) may be outside the range of quadratic approximation, leading to instabilities in the learning process. If the Hessian is positive de nite then the Newton direction points towards a decrease of error | considering the derivative of error along Newton direction: @E (W ; H ;1 rE ) = ; ;H ;1 rE T rE @ T ;1 = ; (rE ) H rE < 0 (the matrix property (AB )T = B T AT was used here, as well as the fact that Hessian is symmetric). If the Hessian is not positive de nite the a approach similar to that used in section 12.6.4 may be used, i.e. H ! H + I . This represents a compromise between gradient descent 12.7 See [Bis95] pp. 285{287. Newton direction 234 CHAPTER 12. PARAMETER OPTIMIZATION and Newton's direction:  For  & 0 ) H + I ' H , i.e. the new direction is close to Newton's direction.  For   0 ) H + I ' I , and then ;(I );1 = ; 1 I , i.e. the new direction is close to rE . ➧ 12.8 ❖ "p , " Levenberg-Marquardt Algorithm This algorithm is speci cally designed for \sum-of-squares" error function. Let "p be the  ; T error given by the p-th training pattern vector and " = "1 : : : "P . The error function is then: E= P 1X 1 ( "p ) = k "k 2p 2 (12.38) 2 2 =1 ❖  Let consider the following matrix: 0 @" B @W.. =B @ . 1 CA : : : @W@"N1 WC 1 1 .. . ... @"P @W1 @"P : : : @W NW then, considering a small variation in W weights from step t to t + 1, the error vector " may be developed in a Taylor series to the rst order: = " t + (W t ; W t ) and the error function at step t + 1 is: 1 E t = k" t + (W t ; W t )k 2 "(t+1) ( +1) ( ) ( +1) ( ) ( ) ( +1) ( ) (12.39) 2 Minimizing (12.39) with respect to W(t+1) means: rE j W t = ( +1) " + (W ; W )  = 0 ) t t t ( ) ( +1) ( ) "(t) + (W(t+1) ; W(t) ) = 0 T  is not square so rst a multiplication with  to the left is performed and then a multiplication by ( ); again to the left which nally gives: W t = W t ; ( );  " t (12.40) T T 1 ( +1) ( ) T 1 T ( ) which represents the core of Levenberg-Marquardt weights update formula. From (12.38) the Hessian matrix is: fH gij = 12.8 See [Bis95] pp. 290{292. @2E @wi @wj = P  X @"p @"p p=1 @ 2 "p + "p @wi @wj @wi @wj  12.8. LEVENBERG-MARQUARDT ALGORITHM 235 and by neglecting the second order derivatives the Hessian may be considered as: H '  T i.e. the equation (12.40) essentially involves the inverse Hessian. However this is done trough the computation of the error gradient with respect to weights which may be eciently done by the backpropagation algorithm. One problem should be taken care of: the formula (12.40) may give large values for W , i.e. so large that the (12.39) approximation no longer apply. To avoid this situation the following changed error function may be used instead: E(t+1) = 1 k" + (W 2 (t) (t+1) ; W )k + kW (t) 2 (t+1) ;W k (t) 2 where  is a parameter governing the W size. By the same means as for (12.40) the new update formula becomes: W(t+1) = W(t) ; (T  + I );1 T "(t) For  & 0, (12.41) approaches the Newton formula, for   gradient descent formula. ✍ Remarks: ➥ ➥ ➥ (12.41) 0 (12.41) approaches the For suciently large values of  the error function is \guaranteed" to decrease since the direction of change is opposite to the gradient and the step is proportional with 1=. The Levenberg-Marquardt algorithm is of model trust region type. The model | the linearized approximation error function | is \trusted" just around the current point W , the size of region being de ned by . Practical values for  are:  ' 0:1 at start then if error decrease multiply  by 10; if the error increases go back (restore the old value of W , i.e. undo changes), divide  by 10 and try again. ❖ CHAPTER 13 Feature Extraction ➧ 13.1 Pre/Post-processing Usually the raw data is not feed directly into the ANN but rather processed before. The preprocessing have the following advantages:  it allows for dimensionality reduction and thus avoid the course of dimensionality,  it may include some prior knowledge, i.e. information additional to the training set. Also the ANN outputs are also postprocessed to give the required output format. The pre and post processing may take any form, i.e. a statistical processing, some xed transformation and/or even a processing involving another ANN. The most important forms of preprocessing are:  dimensionality reduction | it allows for building smaller ANN's and thus with less parameters and better suited to learn/generalize1,  feature extraction | it involves making some combination of original training vector components called features; the process of calculating the features is named feature extraction, usually both processes going together, i.e. by dropping some vector components automatically those more \feature rich" will be kept and reciprocal the number of features extracted is usually much smaller than the dimension of the original pattern vector. The preprocessing techniques described above will always drive to some loss of information. However the gain in accuracy neural computation outweighs this loss (of course, assuming that some care have been taken in the preprocessing phase). The main diculty here is to nd the right balance between the informational loss and the neural processing gain. 13.1 1 See [Bis95] pp. 295{298. See the \Pattern Recognition" chapter, regarding the course of dimensionality 237 feature extraction 238 CHAPTER 13. FEATURE EXTRACTION ➧ 13.2 ❖ hxi i, i Input Normalization One useful transformation is to scale the inputs such that they will be into the same order of magnitude. For each component xi , of the input vector x, the mean hxi i and variance i are calculated: hxi i = P1 P X p=1 xip ; P X = P 1; 1 (xip ; hxi i)2 i2 p=1 and then the following transformation is applied: e xip = xip ; hxi i (13.1) i where the new pattern introduced xeip have zero mean and variance one: e ❖ xip hxei i = P1 P X p=1 e xip =0 ; e i2 P P X X = P 1; 1 (xeip ; hxei i)2 = P 1; 1 xe2ip = 1 p=1 p=1 A similar transformation may be applied to the target pattern vectors. ✍ whitening ❖ hxi,  Remarks: While the input normalization performed in (13.1) could be done into the rst layer of the neural network, this preprocessing makes easier the initial choice of weights and thus learning: if the inputs and outputs are of order unity the the weights should also be of the same order of magnitude. Another, more sophisticated, transformation is whitening . The input training vector pattern set have the mean hxi and covariance matrix : ➥ hxi = P1 ❖ U,  P X p=1 xp ; 1 = P ;1 P X p=1 (xp ; hxi)(xp ; hxi)T and considering the eigenvectors fui g and eigenvalues i of the covariance matrix: ui = i ui then the following transformation is performed: xep = ;1=2 U T (xp ; hxi) where 0u  11 B . ... U = @ .. uN 1  1 u1N .. C . A uNN ; 0 BB 01 =B B@ .. . 0  ... ... 0  0 1 .. C .C CC 0A 0 N being the component i of uj ; f;1=2gij = p1 ij . Because  is symmetric, it is possible to build an orthonormal fui g set2: ui uj = ij ) U T U = 1 uij 13.2 2 i See [Bis95] pp. 298{300. See mathematical appendix. 13.3. MISSING DATA 239 x2 u2 u1 fxp g distribution fxep g distribution x1 Figure 13.1: The whitening process. The new distribution fxep g have a spherical symmetry and is centered in the origin | in the eigenvectors system of coordinates. The mean of transformed pattern vectors is zero and the covariance matrix is unity: hxei = P X p=1 e =0 (13.2) xp P P X X e = P 1; 1 (xep ; hxei)(xep ; hxei)T = P 1; 1 xep xeTp = ; 12 U T U ; 12 T p=1  = ; 21 U T U ; 21 T = ; 21 U T  12 T  p=1   21 U ; 21 T = U T U = I ( may be written as  =  12  12 ;  21 U ; 12 T = U is true | may be checked by direct multiplication | and, because of diagonal nature of  matrices they equal their transposed). The (13.2) result shows that in the system of coordinates given by fui g the transformed distribution of xep is centered in origin and have a spherical symmetry. See gure 13.1. ✍ Remarks: ➥ ➧ 13.3 The discussion in this section was around input vectors with a continuous spectrum. For discrete input values the above methods may be inappropriate. In these cases one possibility would be to use a one-of-k encoding scheme similar to that used in classi cation. Missing Data The \missing data" problem appears when some components of some/all training vector are unknown. 13.3 See [Bis95] pp. 301{302. 240 ❖ x(k) , x(m) CHAPTER 13. FEATURE EXTRACTION Several approaches may be taken to deal with the problem:  Discarding data. If the process responsible of missing data is independent of data set and there is a (relatively) large quantity of data then the incomplete pattern vectors may be discarded from the training set. This is the most simple approach.  Filling in. The missing data may be lled in with values calculated using various approaches: → Use the means over corresponding data for which values are known. However this approach is too simplistic and usually gives poor results. → Calculate the missing values by a regression method over known data. The drawback is that the regression function generated is noise-free and then it underestimates the covariance in data. → Use the EM (expectation-maximisation) algorithm3 . where missing data may be treated as mixture components. → The general approach: integrate over the corresponding variables by weighting with the corresponding distribution. This involves the fact that the distribution itself is modeled. Let consider that, from the pattern vector x, one part x(k) is known and the rest x(m) is missing. Using the Bayes theorem, the posterior probability of x(k) 2 Ck , respectively x 2 Ck are: ;  P Ck jx(k) = ;  P Ck jx(k) ; x(m) = p(x(k) jCk ) P (Ck ) p ; x(k)  p(x(k) ; x(m) jCk ) P (Ck ) p ; x(k) ; x(m)  Using the above equations, the posterior probability of class Ck , while knowing only x(k) , may be expressed as: ;  P Ck jx(k) = = ➧ 13.4 p(x(k) jCk ) P (Ck ) p p ; 1 ; x(k) x(k)  Z  X(m) = P (Ck ) p ; ; x(k) Z  p X(m)  ; P Ck jx(k) ; x(m) p ; x(k) ; x(m) jCk x(k) ; x(m)   dx(m) dx(m) Time Series Prediction The time series prediction problem involves the prediction of a pattern vector x( ) based of the knowledge of previous behavior of the variable. 3 13.4 Described in \Pattern Recognition" chapter. See [Bis95] pp. 302{304. 13.5. FEATURE SELECTION 241 The following approaches are common:  The static method.It is assumed that the statistical properties of the data generator do not evolve in time. The pattern vector is sampled in time, at regular interval, resulting a series of values, i.e. time is converted to a discrete form: : : : , x ; , x ; , x , : : : The training set is build by using one value as output and some n previous values as input, e.g. x is used as output and x ;n , : : : , x ; as inputs; then x is used as output and x ;n , : : : , x as inputs, e.t.c. The rst predicted value may be used as output to make the second prediction, and so on.  Detrending. The time series may have a simple trend, e.g. increasing or decreasing in time. This may lead to a pour prediction over time. However, by tting a simple curve to the data and then removing the value predicted by this simple model the data are detrended. Only the more complex model (assumed constant in time) remains.  The dynamical approach. This involves a method which allows for retraining in time and adaptation to the data generator as it changes in time. 2 1 1 +1 +1 ➧ 13.5 Feature Selection Usually, only a subset of the full feature set is used in the training process of ANN. As there may be many combinations of features, a selection criteria have to be applied in order to nd the most tted subset of features (the individual input vector components may be seen as features as well). One procedure of selection is to train the network on di erent sets of features and then to test for generalization achieved on a set of pattern vectors not used at learning stage. As the training and testing may be time consuming on a complex model, an alternative is to use a simpli ed one for this purpose. It is to be expected | due to the curse of dimensionality | that there is some optimal minimum set of features which gives the best generalization. For less features there is too little information and for more features the dimensionality course intervene. On the other hand usually a criteria J used in class separability increases with the number of features X : J (X ) > J (X ) if X  X (13.3) + + e.g. the Mahalanobis distance4  . This means that this kind of criteria can't be used directly to compare the results given by two, di erent in size, feature sets. Assuming that there are N features, there are 2N possible subsets (2 because a feature may be present or absent from the subset, the whole set may be also considered as a subset). Considering that the subset is required to have exactly Ne features the number of possible combination is still N ;NNe Ne . In principle, an exhaustive search trough all possible 2 ! ( 13.5 4 )! ! See [Bis95] pp. 304{310. De ned in chapter \Pattern Recognition" ❖ J ,X 242 CHAPTER 13. FEATURE EXTRACTION 1 2 3 Figure 13.2: 2 3 4 5 4 4 5 5 3 4 3 4 5 5 4 5 The branch and bound decision tree. Build for a particular case of 5 features. If at one level a feature, e.g. 1, can not be eliminated then the whole subtree, marked with black dots, is removed from the search. combinations should be made in order to nd the best subset of feature. In practice even a small number of features will generate a huge number of subset combinations, too big to be fully checked. There are several methods to avoid evaluating the whole feature combination sets. The branch and bound method This method gives the optimal solution based on a criteria for which (13.3) is true. Let assume that there are N features and Ne features to be selected, i.e. there are N ; Ne features to be dropped. A top-down decision tree is build. It starts with one node and have N ; Ne levels (not counting the root itself). At each level one feature is dropped such that at the bottom there are only Ne left. See gure 13.2. Note that as the number of features to remain is Ne then the rst level have only N ; Ne . It doesn't matter what features are present on this level as the order of elimination is not important. For the same reason one feature eliminated at one level does not appear into the sub-trees of other eliminated features. The elimination works as follows:  A random combination of features is selected, i.e. one point from the bottom of the decision tree. The corresponding criteria used in class separability J (X0 ) is calculated. See (13.3).  Then one feature is eliminated at a time, going from top to the bottom of the tree. The criteria J (X ) is calculated at each level. If at some level J (X ) > J (X0 ) then continue the search. Otherwise (J (X ) < J (X0 )) there are are the following possibilities:  The node is at bottom of the tree. The new feature combination is better then the old one and becomes the new level of comparison, i.e. J (X ) ! J (X0 ). 13.6. DIMENSIONALITY REDUCTION 243  If the node is not at the bottom of the tree then the whole subtree is eliminated from search as being suboptimal. The condition (13.3) is no more met, i.e. a feature (or combination of features) which shouldn't have been eliminated just was and all combinations which contain the elimination of that feature shouldn't be considered further. The sequential search This method may give suboptimal solutions but is faster than the previous one. It is based on considering one feature at a time. There are two ways of selection:  At each step one feature | the one which gives biggest J (X ) criterion | is added to the (initial empty) set. The method is named sequential forward selection.  At each step the least important feature | the one witch gives the smaller decrease of J (X ) criterion | is eliminated from the (initial full) set. The method is named sequential backward elimination. ➧ 13.6 sequential forward selection sequential backward elimination Dimensionality Reduction This procedure tries to reduce the dimensionality of input data by combining them using an unsupervised learning process. 13.6.1 Principal Component Analysis The problem is to map a set of input vectors fxp g, which are N dimensional, into a set of corresponding vectors fzp g which have a lower dimensionality K < N . The x input vector may be represented by the means of an orthonormal set of vectors fui g: N x = zi ui ) zi = uTi x where uTi uj = ij i=1 X e e X A transformation x ! x is performed as follows: from the set fzi g only K are retained (e.g. the rst ones) and the others are replaced with constants bi N K bi ui ; bi = const. x = zi ui + i=1 i=K +1 which practically represents a dimensionality reduction from N to K . The problem is now to to select the best K set of components from x. This can be done by trying to minimize the error when switching from x to x, i.e. the di erence between the two vectors: N x;x= (zi ; bi )ui i=K +1 X e X 13.6 See [Bis95] pp. 310{316. e ❖ e x , bi 244 ❖ EK ❖ hxi ❖  CHAPTER 13. FEATURE EXTRACTION and for a set of P input vectors N P P X X X (zip ; bi )2 EK = 21 kxp ; xep k2 = 21 p=1 p=1 i=K +1 From the condition of minimum of EK with respect to bi P @EK = 0 ) b = 1 X zip = uTi hxi i P @bi p=1 P P xp is the mean input vector. where: hxi = P1 p=1 Then the error (13.4) may be written as (use (AB )T = B T AT matrix property): P X N N  X 2 1 X EK = 12 uTi ui uTi (xp ; hxi) = 2 p=1 i=K +1 i=K +1 where  is the covariance matrix of input vectors fxp g: P X  = (xp ; hxi)(xp ; hxi)T p=1 (13.4) (13.5) (13.6) The minima of EK with respect to fui g occurs when the this set is made from the eigenvectors of covariance matrix5 : ui = i ui (13.7) ❖ i i being the eigenvalues. By replacing (13.7) into (13.5) and using the orthogonality of fui g the error becomes: N X i EK = 12 (13.8) i=K +1 and is minimized by choosing the smallest N ; K eigenvalues. Let consider the translation of coordinates from whatever fxp g was represented to the one de ned by the eigenvectors fui g and in the same time a translation of the origin of the new system to hxi, i.e. in the new system hxi = b0; this means that each xp may be represented as linear combination of fui g: N X xp = ip ui i=1 P P N 2 P and by replacing in (13.5), considering also the orthogonality of fui g, EK = 12 , p=1 i=K +1 ip i.e. EK is a quadratic form. From (13.8) itpfollows that the surfaces EK = const. are hyperellipses with the axes proportional with 1= i and the dimensionality reduction is done by 5 See the mathematical appendix. 13.6. DIMENSIONALITY REDUCTION x2 245 u2 2;1=2 u1 ;1=2 1 hxi E = const. x1 Figure 13.3: The constant-error surfaces are hyper-ellipses. The dimensionality reduction is done by projecting the data points (black dots) on the axes corresponding to the smallest eigenvalue i , i.e. in this bidimensional case on u1 dropping those dimensions corresponding to the smallest axes, i.e. by making a projection on the axes corresponding to largest i representing the largest spread of data points. See gure 13.3. ✍ Remarks: ➥ The method described is also known as the Karhunen-Loeve transformation. The ui are named principal components. 13.6.2 Non-linear Dimensionality Reduction Trough ANN The principal component analysis performed in the previous section may reduce the dimensionality only trough a linear process. Sometimes the data have an intrinsic dimensionality which cannot be retrieved by linear methods. See gure 13.4 on the following page. Auto-associative ANN may be used to perform dimensionality reduction. The input patterns are used also as targets (hence the \auto-associative" name) but the network have a \bottleneck" build into hidden layers. See gure 13.5 on the next page. The hidden layers have less neurons than input and output layers thus forcing the network to \squeeze" the data trough, thus achieving a dimensionality reduction | the output of hidden layer representing the dimensionally reduced input data. ✍ Remarks: ➥ ➥ The error function used is usually the sum-of-squares. If the network contains only one hidden layer then the transformation performed is linear, i.e. equivalent to principal component analysis. Unless it is \hardware" implemented there is no reason to use single hidden layer ANNs to perform dimensionality reduction. 246 CHAPTER 13. FEATURE EXTRACTION x2  x1 Figure 13.4: x1 Sometimes the data have an intrinsic dimensionality which cannot be revealed by a linear transformation. In this case the data points are distributed on a circular shape and could be described by one dimension, e.g. the angle  measured from one from a reference point. xN input layer non-linear transformation Z layer dimensionality reduction non-linear transformation output layer x1 Figure 13.5: xN The use of auto-associative ANNs for dimensionality reduction. The network have a bottleneck, i.e. the hidden layers have less neurons than input and output layers. The dimensionality reduction is achieved in layer Z . 13.7. INVARIANCE ➥ ➧ 13.7 247 The same design may be used for data compression. Invariance In some applications of ANNs the output should be invariant to some transformation of the input, e.g. the classi cation of the shape of an object should be invariant to the position/rotation of it. The most straightforward approach | and also the most inecient | is to include into the training set as many examples of the transformed input as possible. This require a large amount of training data and gives pour result for transformations uncovered well by the learning set. Various more e ective alternatives have been developed. 13.7.1 The Tangent Prop Method This method is applicable for continuous transformations, e.g. translations and rotations but not mirroring. Let assume that the transformation of a vector x leads to vector s and is governed by one scalar parameter , e.g. rotation is de ned by the angle parameter. Then s = s(x; ) and let M be the reunion of all s vectors for a given x and all possible values of . Also the \natural" condition s(x; 0) = x is imposed. Let  be the vector de ned as @ s(x; ) = @ =0 ❖ s, ,M ❖ The change in network output, due to the transformation de ned by is: N N N X @yj @xi X @yj @yj X = @x @ = @x i = Jji i @ i=1 i=1 i i=1 i where J is the Jacobian6 J 0 @y  @yj  B@ @x... = @xi i=1;N j =1;K 1 1  @y1 @xN @yK @x1  @yK @xN ... .. . (13.9) 1 CA ❖ J If the network mapping function is invariant to the transformation in the vicinity of each pattern vector xp then the term (13.9) is heading towards zero and it may be used to build a regularization function7 of the form: = 12 P X K X p=1 j =1 (Jjip ip )2 where Jji p and ip are referring to the input vector xp from the training set fxp gp=1;P . 13.7 6 7 See [Bis95] pp. 319{329. The Jacobian is de ned in the \Multi layer neural networks" chapter. See chapter \Pattern recognition". ❖ Jjip , ip 248 CHAPTER 13. FEATURE EXTRACTION e = E+ The new error function becomes: E determining the in uence of . ✍ Remarks: ➥ In practice the derivative de ning  is found by the mean of nite di erence between x and s obtained from a (relative) small . 13.7.2 ❖ x, u i where  is a regularization parameter Preprocessing The invariant features may be extracted at preprocessing stage. These features are named moments. Let consider one particular component x of the input vector, described in some system of coordinates and let fu g be the coordinates of x, e.g. x may be a point on a bidimensional image and fu1; u2 g may be the Cartesian coordinates, the point being either \lit" (x = 1) or not (x = 0) depending upon coordinates (u1 ; u2 ). The moment M is de ned as: i ❖ M Z Z x(u ; : : : ; u ) K(u ; : : : ; u ) du : : : du =  M U1 kernel function 1 1 N 1 N N UN where K (u1; : : : ; u ) is named kernel function and determines the moments to be considered. For discrete spaces the integrals changes to sums: X    X x(u N M = i1 1(i1 ) ;::: ;u N (iN ) ) K (u 1(i1 ) ;::: ;u N (iN ) ) u 1(i1 ) : : : u N (iN ) iN (the values being taken at the points i1 ; : : : ; i ). One of the possible kernel functions is the power function which give rise to regular moments : M1 N regular moments i ;::: ;iN M1 i ;::: ;iN Z Z =    x(u ; : : : ; u ) u 1 U1 ❖ u i i c i ;::: ;iN M0;::: ;0 =  U1 central moments M0;::: ;1;::: ;0 1 i1 1 : : : u du1 : : : du iN N N UN (1 in i-th position) then: Z Z x(u ; : : : ; u ) (u ; u ) and by de ning u M1 = N N 1 1 i1 : : : (u N ; u ) du : : : du N iN 1 N (13.10) UN which is named central moment. The central moment, de ned in (13.10), is invariant to translations, i.e. to transformations of type x(u1 ; : : : ; u ) ! x(u1 + u1; : : : ; u + u ), provided that the edge e ects are neglected, or the U domains are in nite. N N N i Proof. Indeed, the moment calculated for the new pattern is: f ci1 ;::: ;iN M Z Z U1 UN =  ( + x u1 u1 ; : : : ; uN +  ) ( 1 ; 1) 1 ( ; ) uN u u i : : : uN uN iN du1 : : : duN (13.11) 13.7. INVARIANCE 249 and by making the change of variable u ! u replacing in (13.11): i f ci ;::: ;i M 1 N Z =   U1 Z U N i + u = ue then du = due , also ue = u + u , and by i i i i i x(ue1 ; : : : ; ueN ) (ue1 ; ue1 )i1 : : : (ueN ; ueN )iN due1 : : : dueN i i c =M 1 N i ;::: ;i A moment i1 ;::: ;iN invariable to uniform scaling, i.e. x(u1 ; : : : ; uN ) ! x( u1 ; : : : ; uN ) where is the scaling parameter, may be built as: i1 ;::: ;iN = c Mi1 ;::: ;iN c M (13.12) N )=N 1+(i1 ++i 0;::: ;0 c Because the i1 ;::: ;iN moment is build using only the central moments Mi1 ;::: ;iN then it is automatically invariant to translations. Proof. Let consider a scaling de ned by . Similarly as for translation: f ci ;::: ;i M 1 N Z =  U1 = = = Z N U N N x( u1 ; : : : ; uN ) (u1 ; u1 )i1 : : : (uN ; uN )iN du1 : : : duN 1 Z +i1 ++iN U1 1 Z +i1 ++iN U1   Z U   N Z U x( u1 ; : : : ; uN ) ( u1 ; u1 )i1 : : : ( uN ; uN )iN   d( u1 ) : : : d( u ) x(ue1 ; : : : ; ue ) (ue1 ; ue1 ) 1 : : : (ue ; ue ) N due1 : : : due N i n N (13.13) n N i N ci ;::: ;i M 1 N +i1 ++iN N c0 By the same means, for M ;::: ; c0;::: ;0 = 0 which is: M f c0;::: ;0 = M R  R N c M0;::: ;0 U1 U x(u1 ; : : : ; uN ) du1 : : : duN it gives that: (13.14) N Finally, by using (13.13) and (13.13) into (13.12) it gives that e 1 invariant to scaling as well. i ;::: ;i N =1 i ;::: ;i N , i.e. the  moment is It is possible also to build a moment which is independent to rotation. First the M moment is rewritten in generalized spherical coordinates8: Z Z Z 1 Mi1 ;::: ;iN = PN i=1 cos2 2  0 As 2 0 x(r; 1 ; : : : ; N ) (r cos 1 )i1 : : : (r cos N )iN r dr d1 : : : dN 0 i = 1 then the moment: MR = M2;0;::: ;0 + M0;2;0;::: ;0 + : : : + M0;::: ;0;2 is rotation independent and thus the moment: R = 2;0;::: ;0 + 0;2;0;::: ;0 + : : : + 0;::: ;0;2 8 See mathematical appendix. ❖ i1 ;::: ;iN , 250 CHAPTER 13. FEATURE EXTRACTION A 1 2 B layer 1 layer 2 layer 3 Figure 13.6: The shared weights method for bidimensional pattern vectors. Region A from rst layer activates neuron 1 from second layer, respectively region B activates neuron 2. The weights from A to 1, respectively from B to 2 are the same, i.e. the same input pattern in A respectively B will give same activation in 1 respectively 2. Layers 1, 2, 3, : : : are in decreasing size order. is independent to translation, scaling and rotation. ✍ Remarks: ➥ There are two potential problems when using moments: one is that it may be computationally intensive and the second is that some information may be lost during preprocessing. 13.7.3 Shared Weights This technique is using specially build ANNs to allow for a relative translation invariance. ✍ Remarks: This technique is useful in image processing and bears some resemblance with mammalian vision. Considering a bidimensional input pattern the network is build such that a layer will receive excitatory input only from a small area of the previous layer. By having (sharing) the same weights, such areas send the same excitatory input to the next layer, when presented with the same input pattern. See gure 13.6. ➥ 13.7.4 Higher-order ANNs A higher-order network have a neural activation of the form9 : ! N X N N X X wji1 i2 xi1 xi2 + : : : wji xi + zj = f wj 0 + i1 =1 i2 =1 i=1 9 See chapter \Multi Layer Neural Networks" 13.7. INVARIANCE 251 i1 t 0 i1 t 0 i2 Figure 13.7: i2 Translation in bidimensional space. t represents the translation vector. The pattern values xi1 is replaced by xi1 , respectively xi2 by xi2 . 0 0 where neuron zj receives input from xi , wj0 is the bias and f is the activation function. By using second-order neural networks, i.e. of the form: ! N X N X (13.15) wji1 i2 xi1 xi2 zi = f i1 =1 i2 =1 it is possible to built a network whose output is translation invariant for a bidimensional input pattern. Considering a translation of a pattern in a bidimensional space then in the place i1 occupied previously by xi1 now will be an input xi1 which have come from i1 , the same happens for the second point i2 . See gure 13.7. To keep the network output (13.15) the same it is necessary to impose the condition: (13.16) wji1 i2 = wji i 1 2 0 0 0 0 for each pair of points f(i1; i2 ); (i1 ; i2 )g which may be correlated by a translation. 0 ✍ Remarks: ➥ ➥ ➥ 0 The condition (13.16) greatly reduces the number of independent weights (such that a second order neural network becomes more manageable). One layer of second order neural network is sucient to extract the translation invariant pattern information. Highest order networks may be used for more complex invariant information extraction, e.g. a third order network may be used to extract information invariant to translation, uniform scaling and rotation by correlating 3 points. CHAPTER 14 Learning Optimization ➧ 14.1 The Bias-Variance Tradeoff Let p(tjx) be the target t probability density, conditioned on input x. The conditional R average of the target is htjxi = tp(tjx) dt and the conditional average of the square of R targets ht2 jxi = t2 p(tjx) dt. ❖ htjxi, ht2 jxi Y Y For an in nite training set the sun-of-square error function may be written1, considering only one output, as: 1 E=2 Z X y [ (x) Z ; htjxi] p(x) dx + 2 [ht2 jxi ; htjxi2 ] p(x) dx 1 2 (14.1) X The second term in (14.1) is independent of y(x) and thus independent of weights | it represents an intrinsic noise which sets the minimum of E . The rst term may be minimized to 0 in which case y(x) = htjxi. Finite training data sets S , considered containing P training vector patterns (each), cannot cover the whole possibilities for input/output and then where will always be some di erence between y(x) (after training) and htjxi (considering an in nite set). The integrand of the rst term in (14.1), i.e. [y(x) ;htjxi]2 , gives a measure of how close is the actual mapping y(x) to the desired target t. To avoid the dependency over a particular 14.1 1 See [Bis95] pp. 332{338. See chapter \Error Functions", \Signi cance of network output". 253 ❖ ES 254 CHAPTER 14. LEARNING OPTIMIZATION training set, the expectation (average) ES is taken as a measure of the mapping: measure of mapping  ES f[y(x) ; htjxi]2 g bias the average being done over the whole ensemble of training sets. De nition 14.1.1. The bias represents the di erence between the average of network mapping y(x) and the data generator, i.e. htjxi: bias  ES fy(x)g ; htjxi The average bias over the input x is de ned as: average bias)2  ( variance 1 2 Z ES fy(x)g ; htjxi]2 p(x) dx [ X For a perfect model, the bias would be 0 (as ES fy(x)g = htjxi). Non-zero bias arises from two causes:  di erence between the function created by the model (e.g. ANN) and the true function who generated the data | this is the \true" bias,  variance due to data sets | particular sensitivity to some training sets. De nition 14.1.2. The variance represents the average sensitivity of the network mapping y (x) to a particular training set: variance  ES f[y(x) ; ES fy(x)g]2 g The average variance over the input x is de ned as: average variance  1 2 Z ES f[y(x) ; ES fy(x)g]2 g p(x) dx X Let consider the measure of mapping [y(x) ; htjxi]2 : [y (x) ; htjxi]2 = [y(x) ; ES fy(x)g + ES fy(x)g ; htjxi]2 2 2 = [y (x) ; ES fy (x)g] + [ES fy (x)g ; htjxi] + 2[y (x) ; ES fy (x)g][ES fy (x)g ; htjxi] When doing an average over the above equation, the third term cancels (because y(x) ! ES fy(x)g) and then: ES f[y(x) ; htjxi]2 g = ES f[y(x) ; ES fy(x)g]2 g + [ES fy(x)g ; htjxi]2 = ❖ h(x), " variance + (bias)2 To see the meaning of bias and variance let consider the targets as being generated from a function h(x) to which some random noise ", with 0 mean, have been added: tp = h(xp ) + "p (14.2) 14.2. REGULARIZATION 255 and the optimal mapping is then htjxi = h(x). There are two extreme possibilities for the y(x) mapping choice:  The mapping is build to be some g(x) function, completely independent of data set. Then the variance is zero since ES fy(x)g = g(x) = y(x). However the bias may be very high (unless some other prior knowledge have been used to build g(x)).  The mapping is build to t the data perfectly. Then the bias is zero since: ES fy(x)g = Efh(x) + "g = h(x) = htjxi However the variance is: ES f[y(x) ; ES fy(x)g]2 g = ES f[y(x) ; h(x)]2 g = ES f"2 g and the variance of " may be high. The above discussion reveals that there is a tradeo between the bias and the variance and, in practice, a balance between the two, should be found. One way to reduce the bias and variance at the same time is to increase the complexity model by using larger training sets, i.e. the size of training set determines the size of the model such that the optimal balance between bias and variance is found (note however that this approach leads to the course of dimensionality and thus the model cannot be increased inde nitely). Another way to reduce bias and variance at the same time is to use the prior knowledge (if any) about the data generator (14.2) when building the model. ➧ 14.2 Regularization The error function E may be changed by adding a regularization term2 : e E =E + (14.3) where is a function which have a large value for smooth mapping functions y(x) and a small value otherwise, i.e. is large for simple models and small for complex models, thus encouraging less complex models. The  parameter controls the in uence of . Thus the regularization and the error E counterbalance one each other (as error generally increases in simple models) in the process of minimizing the changed error function Ee during learning process. 14.2.1 Weight Decay In the case of a over- tted model the mapping will have areas with large curvature which require large values for weights3. 14.2 2 3 See [Bis95] pp.338{346. See chapter \Radial Basis Function Networks" See chapter \Parameter optimization" ❖ , 256 CHAPTER 14. LEARNING OPTIMIZATION By encouraging weights to be small the error hyper-surface is kept relatively smooth. This may be achieved by using a regularization of the form: X = 12 i (14.4) wi2 Many ANN training algorithm make use of the error gradient. From (14.3) and (14.4): e rE = rE + W (14.5) (W being seen as vector). Considering just the part generated by the regularization term, the weight update formula in the gradient descent learning method4 in the continuous time  limit is: dW d e = ;rE = ;W (rE neglected) (where  is the learning parameter) which have a solution of the form: ( ) = W (0) e; W  ❖ b, H i.e. the regularization term (14.4) favors weights decay toward 0, following an exponential rule. Considering a quadratic error function5 of the form: ( ) = E0 + bT W + 21 W T HW E W ❖ W f ❖ W = const. (14.6) b where H is a symmetrical matrix (it is in fact the Hessian), the minima of E (from rE = 0) is at W  given by: b Similarly, the minima of e (from r e = b) occurs at f given by: r e = + f  + f = b rE = b + HW  = 0 E E 0 b E ❖ ui , i b; H ; (14.7) W HW 0 W (14.8) (from (14.5) and (14.7)). Let fui g be an orthogonal system of eigenvectors of H such that: H ui = i ui ; uTi uj = ij (14.9) where i are the eigenvalues of H (such construction is possible due to the fact that H is symmetrical, see mathematical appendix). Let consider that the system of coordinates in the weights space is rotated such that it becomes parallel with fui g. Then W  and W  may be written as: W 4 5 = X i wi ui See chapter \Single Layer Neural Networks". See chapter \Parameter Optimization" f ; f = X e W i wi u i 14.2. REGULARIZATION 257 w2 1=p1 u2 1=p2 W f W u1 E = const. w1 Figure 14.1: The hyper-surfaces E =pconst. are hyper-ellipses having axes proportional to 1= i . The distance between W  f is smaller along axes corresponding to larger and W eigenvalues i of H . and from (14.7) and (14.8) rE = 0b = rEe and considering the orthogonality of fui g it follows that: i w i +  i wei = ( ) we ' w . As the surfaces E = const. are hyper) jwe j  jw j p f is closer to W  ellipses which have axes proportional with6 1=  this means that W which means that i   i   i i i i i along ui directions with correspond to larger i . See gure 14.1. 14.2.2 Linear Transformation And Weight Decay Let consider a multi-layer perceptron network having one hidden  layer and one linear outP put layer. Then for the hidden layer zj = f wj0 + wji xi and for the output layer yk = wk0 + P j i wkj zj . Considering a linear transformation of the form: xi ! xei = axi + b where a; b = const. then is is possible to get the same outputs by changing the weights of hidden layer as: 8 > > <w ! we = 1 w a bP > > :w 0 ! we 0 = w 0 ; a w ji ji j j ji j i (may be checked directly, zj doesn't change). 6 See chapter \Parameter Optimization". ji 258 CHAPTER 14. LEARNING OPTIMIZATION Error validation training under- tting over- tting early stop Figure 14.2: time The error given by the validation set increases from some point outwards. At that point the network is optimally tted; beyond is over- tted, before is under- tted. Similarly, for a transformation of the output layer: the weight changes of the output layer are: ( wkj wk 0 yk ! ye k = cyk + d where c; d = const. ! we ! we = cwkj k0 = cwk0 + d kj By training two identical networks, one with original data and one with transformed inputs and/or outputs the the trained networks should be equivalent, with weights changed by the linear transformations as shown above. The regularization term (14.4) does not satisfy this requirement. However a weight decay regularization term of the form: = 21 X hidden layer w2 + 2 2 X w2 output layer satis es the invariance of linear transformation, provided that suitable rescaling is performed on 1 and 2 . 14.2.3 validation Early Stopping From the whole set of available data for training, usually, some part is put aside and not used in training, for the purpose of validation, i.e. checking the generalization capability of network with some data unused during the learning process. While the error function always decreases for the training set during learning, the error for the validation set decreases at the beginning then, later, begin to increase as the network becomes over- tted. See gure 14.2. The network is optimally tted at the point where the validation set gives the lowest error; at this point training should be stopped as the generalization capabilities are best, even if further training will lower error with respect to the training set. A justi cation of the early stop method may be given by the means of the weight decay procedure. As the weights adapt during learning, the weight vector W follows a path which 14.2. REGULARIZATION 259 w2 u f W 2 W u 1 E = const. w1 Figure 14.3: The \early stop" method gives similar result as the \weight decay" procedure as it involves stopping on the learning path | marked with an arrow | before reaching W  . The particular form of the learning path may be explained by the faster advancement along directions with greater rE , i.e. smaller i . See also gure 14.1 on page 257. f  before reaching W  . See gure 14.3 and section 14.2.1. leads to W 14.2.4 Curvature Smoothing As over-complex neural models leads to network mappings with large curvatures (on error surface), a direct approach will be to build a regularization term which increases with curvature. As the second derivatives are a measure of the curvature then the regularization should be of the form: P X N X K @ 2 y (x ) !2 X 1 k p =2 2 @x ip p=1 i=1 k=1 N respectively K being the size of input respectively output vectors. 14.2.5 Choosing weight decay hyperparameter Considering the weight decay regularization (14.4) then the prior probability density of weights is chosen usually as a Gaussian of the form p(w) / exp(; ) which have the p variance  = 1=  . Considering the logistic activation function7 f (x) = 1+1e;x this one is saturated around x = 3, i.e. f (;3) ' 0:04 and f (3) ' 0:95 (very close to lower 0 and upper 1 asymptotes). As the reason (among others) for weight decay regularization is to prevent saturation of neuronal outputs then the variance of total input is to be around 2. For a small number of 14.2.5 7 See [Rip96] pg. 163. See also chapter \Pattern Recognition". ❖ N, K 260 CHAPTER 14. LEARNING OPTIMIZATION inputs in the range [0; 1] a reasonable variance should be between 1 and 10, e.g. the middle value 5 may be considered, this corresponds to   0:04 In practice is a good thing to have some neurons saturated (this corresponding to a sort of pruning8) then the base range for  is between 0:001 and 0:1. Experience indicate that a multiplication of  up to 5 times is not critical to the learning process. ➧ 14.3 ❖ ", pe(") Adding Noise Another way to achieve better generalization is to add random noise to the training set before inputting it to the network. Let consider a random vector " which is being added to the input x and have probability density pe("). In the absence of noise, the error function is9: K Z Z 1X E= [yk (x) ; tk ]2 p(tk jx) p(x) dx dtk 2 k=1 Y X Considering an in nite number of training vectors, each containing an added noise term, then K Z Z Z X 1 e E= [yk (x + ") ; tk ]2 p(tk jx) p(x) pe(") dx dtk d" (14.10) 2 k=1 Y X " A reasonable assumption is to consider the noise suciently small as to allow for the expansion of yi (x + ") in a Taylor series, neglecting the third order and above terms: N @y N X N X @ 2 yk 1X k "i "i "j + + O("3 ) (14.11) yk (x + ") = yk (x) + @xi "=0 2 @xi @xj "=0 i=1 i=1 j=1 It is also reasonable to assume that the noise have zero mean and uncorrelated components: Z () "i pe " d" = 0 and " ❖ ZZ () "i "j p e " d" = ij (14.12) " where  is the variance of noise. By using (14.11) in (14.10) and integrating over " with the aid of (14.12), the error function may be written as: e E See also section 14.5. See [Bis95] pp. 346{349. 9 See chapter \Error Functions". 8 14.3 =E+ 14.4. SOFT WEIGHT SHARING where 261 becomes a regularization parameter of the form: K X N Z X = 12 k=1 i=1 Z ( Y X @yk @xi 2 2 + 12 [yk (x) ; tk ] @@xy2k i ) p(tk jx) p(x) dx dtk (14.13) where the noise do not appear anymore (normalization of p(") was also used). The network mapping which minimize the sum-of-squares error function E is10 yk = htk jxi. Thus the network mapping which minimize the regularized error function Ee, see (14.3), should be of the form yk = htk jxi + O( ). Then the second term in (14.13): K X N Z 1X 4 k=1 i=1 [yk (x) ; htk jxi] X @yk p(x) dx @xi (the integration over Y have been performed) cancels to the lowest order of  , at the minima of error Ee . This property makes the computation of second order derivatives of yk (a task computationally intensive) avoidable, thus leaving the regularization function of the form: N X K Z X 1 =2 i=1 k=1  X @yk @xi 2 p(x) dx (14.14) or, for a discrete series of training vectors: P X N X K X = 21P p=1 i=1 k=1 ➧ 14.4 @yk @xi xp !2 Soft Weight Sharing This method encourages groups of weights to have similar values11 by using some regularization links. The soft weight sharing relates to weight decay method. Considering a Gaussian distribution for weights, of the form: p(w) =  2 p1 exp ; w2 2 then the likelihood12 of the whole set of weights is: ! N NW W Y X 1 1 L = p(wi ) = (2)NW =2 exp ; 2 wi2 i=1 i=1 NW 10 14.4 11 12 (14.15) being the total number of weights. The same way, the negative logarithm of a like- See chapter \Error functions". See [Bis95] pp. 349{353. A hard sharing method is discussed in the chapter \Feature Extraction". See chapter \Pattern Recognition". ❖ NW 262 CHAPTER 14. LEARNING OPTIMIZATION lihood gives the error function13, the negative logarithm of likelihood (14.15) gives the regularization function (14.4) (up to an additive constant which plays no role in a learning process). It is possible to encourage weights to group together by using a mixture14 of Gaussian distributions: p(w) = ❖ M , m , pm (w) M X m=1 m pm (w ) where M is the number of mixture components, m are the mixing coecients and pm (w) are the mixture components of the (assumed) Gaussian form:  m )2 pm (w) = p 1 2 exp ; (w ; 2m2 2m ❖ ,   (14.16)  being the mean and  being the standard deviation. ✍ Remarks: In the mixture model m = P (m) is the prior probability of the mixture component m. Then the regularization function is: NW X NY M W X (14.17) = ; ln L = ; ln p(wi ) = ; ln m pm (wi ) ➥ i=1 i=1 m=1 The regularized error function: Ee = E +  (14.18) have to be optimized with respect to weights wi and parameters m , m and m . ✍ Remarks: ➥ ❖ m An optimization with respect to  is not possible as it will lead to  = 0 and Ee ! E , i.e. the network will end up by being over- tted. See section 14.2.3. The posterior probability of mixture component m is: m pm (w ) m (w)  P M n pn (w) (14.19) n=1 From (14.18), (14.17), (14.19) and (14.16), the error derivatives with respect to wi are: M wi ; m @ Ee = @E +  X @wi @wi m=1 m (wi ) m2 13 14 See chapter \Error Functions". See also the chapter \Pattern recognition" regarding the mixture models. 14.4. SOFT WEIGHT SHARING 263 which shows that the weights wi are pulled towards the distribution centers m with with \forces" proportional to the posterior probabilities m . Similarly the error derivatives with respect to m are: @ Ee @m =  NW X  ;w m (wi ) m 2 i m i=1 which shows that m are pulled towards the sum of all weights, weighted by m . Finally, the error derivatives with respect to m are: @ Ee @m ✍ =  NW X i=1 m (wi )  1 m ; wi ; m )2 3 m (  Remarks: ➥ In practice the m parameters are taken of the form m = exp(m ) and optimization is performed with respect to m . This ensures that m remains strictly positive. As m plays the role of a prior probability it have to follow the probability constrains M P m 2 [0; 1] and m = 1. This is best done by using the softmax function method: m=1 exp( m ) m= P M exp( n ) n=1 ) @@ m n = mn n ; m n Then the derivative of Ee with respect to m is (by similar ways as for previous derivatives and using the normalization @ Ee @ m ✍ M X @ Ee @ n = n=1 @ n @ m Remarks: ➥ = M P n=1 M X n=1 " n = 1): ; NW  (w ) X n i i=1 n ! ( nm m ; n m ) # = NW X i=1 [ m ; m (wi )] In practice care should be taken when initializing weights and related parameters. One way of doing it is to initialize weights with values over a nite interval, then divide the interval into M subintervals and assigning one to each pm (wi ), i.e. equal m , m centered into the respective subinterval and m equal to the width of the subinterval. This method of initialization ensures a partial soft sharing and allows better learning from the training set. 264 CHAPTER 14. LEARNING OPTIMIZATION ➧ 14.5 growing method pruning method Growing And Pruning Methods The network architecture may play a signi cant role in the nal generalization performance. To allow for the nding of best architecture, e.g. the number of hidden layers and number of neurons per hidden layer, two general ways (which avoids an extensive search) may be used:  The growing method: The starting network is small and then neurons are added one at a time till some criteria for optimization have been satis ed.  The pruning method: The starting network is big and then neurons are removed one at a time till some criteria for optimization have been satis ed. 14.5.1 Cascade Correlation Cascade correlation is of growing method type. The network starts with a single fully connected layer, where all inputs are linked to outputs. At each stage | after the network have been trained for an empirical amount of time | a new hidden neuron is added, such that it receive input from all inputs and all previously added (hidden) neurons and send its output to all output neurons. See gure 14.4 on the next page. The weights from inputs and all previously added neurons to the actually being added hidden neuron are calculated in one step and after insertion they remain xed, e.g.| in gure 14.4 | the weights on link ➀ are calculated prior to z1 insertion, then the weights on links ➃ and ➂ are calculated prior to insertion of z2 and so on. The weights from inputs to outputs and from all hidden neurons to outputs remain variable, to be further adapted, e.g.| in gure 14.4 | only weights on main (input ! output) link and those on links ➁ and ➄ will continue to adapt during further learning. ✍ ❖ " k , h "k i Remarks: By the above architecture the network is similar to one having just one \active" layer (the output) which receive input from input neurons and all hidden (added) neurons. The \ xed" weights of the new (in the process of being inserted) neuron z is computed as follows:  The error "k of output neuron yk and the mean error h"k i over all training set are calculated rst: ➥ "k = y k ❖ "kp ❖ wi ❖ zp , hz i ; tk ; 1 h"k i = P X P p=1 "kp ("kp being the error "k for input vector xp from the training set).  The weights wi of all inputs to the new neuron z are initialized | weights for links from all inputs and all previous inserted hidden neurons. The output of the new neuron z and the mean output over all training set hz i will be: 14.5 See [Bis95] pp. 353{364. 14.5. GROWING AND PRUNING METHODS 265 input layer 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 ➀ z1 ➃ ➂ z2 ➁ ➄ output layer Figure 14.4: The cascade correlation method. The network architecture starts with input and output layers fully interconnected (the thick arrow shows that all output neurons receive all components of the input vector). The the rst hidden neuron z1 is added and the connections ➀ (z1 receives all inputs) and ➁ (z1 sends its output to all output neurons) are established. later, when z2 is added connections ➂ (z2 receive input from z1 ), ➃ and ➄ are also added. And so on. 0 X zp = f @ all inputs 1 wi xip A hz i = P1 ; P X p=1 zp where the sum in zp is done also over the previous inserted hidden neurons (f being the activation function and zp being the new neuron output for input vector xp ).  The weights are optimized by maximizing the correlation, i.e. covariance, S between the output of the new neuron to be inserted and the residual actual (before the insertion takes place) error of the network output | similar to the Fisher discriminant15: K X P X S= (zp ; hz i)("k ; h"k i) k=1 p=1  The maximisation of S is performed by using the derivative with respect to wi : @S = @wi  K X P X k=1 p=1   f 0 xip ("k ; h"k i) where the sign is given by the expression inside the module from S . 15 See chapter \Single Layer Neural Networks". ❖ f , zp ❖S 266 CHAPTER 14. LEARNING OPTIMIZATION The maximisation may be done using the above derivative in a way similar to the methods used in backpropagation. 14.5.2 Pruning Techniques Saliency of weights saliency The pruning technique involves starting with a (relatively) large network, training it, then removing some neuronal connections, then training again and repeat the training/pruning process. To decide about what network links are less important and thus susceptible to removal it is necessary to assess the relative importance of weights, this process being named saliency. Optimal Brain Damage ❖ E , wi , NW Let consider a small variation in error function E due to a small variation of weights wi . By developing in series and taking only the rst 2 terms (NW is the total number of weights): XW N E = i=1 ❖ Hij @E @wi wi + 1 2 XW XW N N i=1 j =1 Hij wi wj + O(w3 ) 2 E . where H is the Hessian whose elements are Hij = @w@i @w j At minimum of E its rst derivatives become zero. Assuming that the Hessian may be approximated by making all non-diagonal elements equals to zero | this representing the main assumption of this technique | then: E = 1 2 X W i=1 Hii (wi )2 To remove a neuronal connection is equivalent to make its corresponding weight equal zero: wi = 0 ; wi = ;wi (14.20) and then a measure of saliency for weight wi would be: saliency = Hii wi2 2 (14.21) The optimal brain damage technique involves removal of connections de ned by the weights of lowest saliency. ✍ Remarks: ➥ In practice the number of weights being deleted, the amount of training between weight removal and the overall stop conditions are empirically established. 14.5. GROWING AND PRUNING METHODS 267 Optimal Brain Surgeon The optimal brain surgeon technique works in the same manner as the optimal brain damage but does not assume that the Hessian is diagonal as this may lead to pour results in some cases. The variation around minima of E is then (W is the vector of wi ): E = 12 W T HW ❖ W (14.22) Assuming that weight wi is pruned then, equivalent to (14.20): wi = 0 ; wi = ;wi , eTi W + wi = 0 (14.23) where ei is a unit vector;parallel to wi axis, i.e. eTi W is the projection of w on wi axis;  ei is of the form eTi = 0    0 1 0    0 , with 1 in the i-th position and NW dimensional. To nd the new (pruned) weights, E have to be minimized subject to condition (14.23). The Lagrange multipliers are used16; the Lagrange function is: L = E + (eTi W + wi ) = 21 W T HW + (eTi W + wi ) and then by zeroing the derivative with respect to W : rL = HW + ei = 0b ) W = ;H ;1 ei and, by replacing in (14.23) and considering the form of eTi and thus eTi H ;1 ei = fH ;1 gii then: ;eTi H ei + wi = 0 )  = fHw;i1 g ii such that nally: W = ; fHw;i1g H ;1 ei ii (14.24) Replacing (14.24) into (14.22) the minimal error variation due to the removal of neural link corresponding to wi is (H is symmetric and then H ;1 is as well, also use matrix property (AB )T = B T AT ): 2 2 E = 12 fHw;i1g2 eTi (H ;1 )T HH ;1 ei = 12 fHw;i1 g2 eTi (H ;1 )T ei ii ii wi2 1 = 2 fH ;1 gii (14.25) Optimal brain surgery works in similar way to optimal brain damage: after some training the inverse Hessian is calculated and some weights involving the smaller error variation, as given by (14.22), are zeroed, i.e. the corresponding inter-neuronal links removed. Then the procedure is repeated. 16 See mathematical appendix. ❖ ei 268 CHAPTER 14. LEARNING OPTIMIZATION 14.5.3 ❖ sj Neuron Pruning Instead of pruning inter-neuronal links, it is possible to prune whole neurons. To be able to choose which neurons may be removed it is necessary to have a measure of neuron saliency. Such a measure may be the di erence in network output error made by neuron removal: sj = Ewith neuron j ; Ewithout neuron j ❖ j As the above measure is computationally intensive, an alternative approach is to modify the neuronal input by adding an overall multiplying factor j . The neuronal output is then written as: zj = f j X i ! wji zi where the activation function f is de ned such that f (0) = 0, e.g. f  tanh. Then for j = 1 the neuron j is present, for j = 0 the neuron is removed. The saliency of neuron j becomes: sj = E j =1 ; E j =0 which shows that the derivative of E with respect to j is also a good measure of neuronal j saliency: j sej = ; @@E j j =1 which may be evaluated in a backpropagation way. Note the \;" sign; usually the error increases after the removal of a neuron, while j decreases and sj is taken as a positive quantity (and the bigger it is, the more important the corresponding neuron is). ➧ 14.6 ❖ M Committees of Networks As, in practice it is common to train di erent networks (with di erent architectures) in order to choose the best it would be much better to combine several networks to form a committee (it's even not required to be a network, it may be any kind of model). Let assume that each network have only one output and there are M such networks. The mapping achieved by each network ym (x) may be seen as being the desired mapping h(x) plus some added error "m (x): ym (x) = h(x) + "m(x) The averaged sum-of-squares error for network m is: Z hEm i = Ef[ym (x) ; h(x)]2 g = Ef"2m g = "2m (x) p(x) dx X 14.6 See [Bis95] pp. 364{369. 14.6. COMMITTEES OF NETWORKS 269 The average error over the whole set of networks when acting individually (not as committee) is: Eav. = M1 M X hEm i = m=1 1 M X ❖ Eav. M m=1 Ef"m g 2 Simple committee The simplest way of building a committee of networks is to consider the output of the whole system ycom. as being the average of individual network outputs: ycom. = M1 M X m=1 ym (x) (14.26) ❖ hEcom. i The averaged sum-of-squares error for the committee is: 8" 8 #2 9 #2 9 M M < 1 X = <" 1 X = y (x) " (x) hEcom. i = E =E m i : M m=1 ; : M m=1 ; By using the Cauchy inequality in the form: hEcom. i 6 Eav. , networks. ✍ P M m=1 "m (x) ❖ ycom. 2 M 6 L P "2m (x) it follows that m=1 i.e. the error of committee is less than the average error over independent Remarks: ➥ Considering that the error "m (x) have zero mean and are uncorrelated: Ef"m g = 0 and Ef"m "n g = 0 for m 6= n M X 1 then the error hEcom. i becomes: hEcom. i = 1 M 2 m=1 Ef"mg = M Eav. 2 which shows a big improvement. However in practice the error "m(x) is strongly correlated, the same input vector x giving a similar error on di erent networks. Weighted committee Another way of building a committee is to make an weighted mean over individual network outputs: ycom. = M X m=1 m ym (x) (14.27) As ycom. have the same meaning as ym (x) then clearly the m coecients should have a meaning of probability, i.e. m is the probability of ym being the desired output from all ❖ m 270 CHAPTER 14. LEARNING OPTIMIZATION fym g set. The the following conditions should be imposed: M X m 2 [0; 1] ; 8m and (14.28) m=1 m=1 and they will be determined as to minimize the network committee error. ✍ ❖ Remarks: By comparing (14.27) with (14.26) it's clear that in the simple model all networks have equal weight 1=M . The error correlation matrix C is de ned as the square and symmetrical matrix whose elements are: Cmn = Ef"m (x) "n (x)g (as it's de ned as expectation, it does not depend on x). The averaged sum-of-squares error of the committee becomes: C ➥ 8 < hEcom. i = Ef[ycom.(x) ; h(x)]2 g = E : = E ( M X m "m m=1 ! M X n "n n=1 M X m "m m=1 !) = !2 9 = (14.29) ; M M X X m n Cmn m=1 n=1 To nd the minima of hEcom. i subject to constraint (14.28) the Lagrange multipliers method17 is used. The Lagrange function is: L=E+ M X m=1 ! m;1 = M X M X m=1 n=1 m n Cmn +  M X m=1 ! m;1 and by zeroing its derivatives with respect to m : @L @ m ❖ , =0 ) 2 M X n=1 n Cmn +  = 0 ; for m = 1; M ; (as C is symmetrical then Cij = Cji ). Considering the vectors T = 1     = 1b then the above set of equations may be written in matrix form as: C 2 +  = 0b which may be solved easily as: = ; 12 C ;1  , M X fC ;1 g m = ;2 n=1 mn By replacing the value of m from (14.30) into (14.28),  is found to be: 2 =; P M P M fC ;1 gmn m=1 n=1 17 See mathematical appendix.  M and (14.30) 14.7. MIXTURE OF EXPERTS 271 and then the replacement of  back into (14.30) gives the nal values for m : = L P C ;1 1b M P M P m=1 n=1 fC ; gmn ( 1) ) m= fC ; gmn 1 n=1 M P M P n=1 o=1 The error (14.29) may be written in matrix form as hEcom. i = the value of , and using the relations: C ;1 1b)T = 1bT (C ;1 )T = 1bT C ;1 and 1bT C ;1 1b = ( (C is symmetrical) then the minimum error is hEcom. i = T fC ; gno (14.31) 1 C and then, by replacing M X M X fC ; gmn 1 m=1 n=1 PM PM fC; gmn 1 1 m=1 n=1 As the weighted committee is similar to the simple committee, the same criteria apply to prove that hEcom. i 6 Eav. ✍ Remarks: ➥ The coecients found in (14.31) are not guaranteed to be positive so this have to be enforced by other means. However if all m are positive then from 8 m > 0 and M P m=1 m = 1 ) 8 m 2 [0; 1] (worst case when all coecients are zero except one which is 1). ➧ 14.7 Mixture Of Experts The mixture of experts model divides the input space in sub-areas, using a separate, specially trained, network for each sub-space | the expert | and a supplementary network to act as a gateway, deciding what expert will be allowed to generate the nal output. See gure 14.5 on the following page. The error function is build from the negative logarithm of the likelihood function, considering a mixture model18 of M Gaussian distributions. The number of training vectors is P . E=; P X p=1 ln " M X m=1 m (xp ) pm (tp jxp ) # (14.32) where the pm(tjx) components are Gaussians of the form: pm(tjx) = 1 exp (2 )N=2  ; kt ; 2m (x)k 2  (N being the dimensionality of x). The m (x) means are functions of input and the See [Bis95] pp. 369{371. See chapter \Pattern Recognition" and also chapter \Error Functions" regarding the modeling of conditional distributions. 14.7 18 ❖ pm (tjx) ❖ m (x) 272 CHAPTER 14. LEARNING OPTIMIZATION input network expert 1 network expert M gateway network output Figure 14.5: The mixture of experts model. Each \expert" network is specialized in one input sub-space. The gateway network decide what expert will be allowed to give the nal output (by blocking all others). There are M \experts" and the gateway have one output for each \expert". covariance is set to unity. Each \expert" will represent an Gaussian and will output the m (x). The m coecients are generated trough a softmax function from the outputs m of the gateway: exp( P m= M n=1 m) exp( n) The mixture of experts is trained simultaneously, including the gateway, by adjusting the weights such that the error (14.32) is minimized. ✍ Remarks: ➥ ➧ 14.8 14.8.1 The model presented here may be extended to a level where each expert becomes an embedded mixture of experts. Other Training Techniques Cross-validation Often, in practice, several models are build and then the eciency of each is estimated, e.g. generalization capability using a validation set, in order to select the best one. 14.8 See [Bis95] pp. 371{377 and [Rip96] pg. 41. 14.8. OTHER TRAINING TECHNIQUES 273 x N N M (0)1 y1 (0) M y N (1) y Figure 14.6: The stacked generalization method. The set of networks (0)1 ; : : : ; N(0)M are trained using P ; 1 vectors from the training set then the network N(1) is used to assess the generalization capability of the level 0 networks. N However sometimes the available data is too scarce to a ord to put aside a set for validation. In this case the cross-validation technique may be used. The training set is divided into S subsets. The model to be considered is trained using S ; 1 subsets and the one left as a validation set. There are S such combinations. Then the eciency of the model is calculated by making an average over all S training/validation results. 14.8.2 Stacked Generalization This method is also applicable when the quantity of available data is small and it is desirable to keep all the \good parts" of various models. Considering that the number of available training patterns is P then a set of M level 0 networks N(0)1 ; : : : ; N(0)M are trained using only P ; 1 training patterns. See gure 14.6. The left-out pattern vector is run trough the set of networks N(0)1 ; : : : ; N(0)M this will give rise to a new pattern (for the next network layer) of the form fy1 ; : : : ; yM g. The whole procedure is repeated in turn for each of the P patterns, this giving rise to a new set of P vectors. This new set is used to train a second level network N(1) using as target the desired output y. Obviously the N(1) assess the generalization capability of the N(0)1 ; : : : ; N(0)M networks. Finally the N(0)1 ; : : : ; N(0)M are trained using all P training x patterns. 14.8.3 Complexity Criteria Complexity criteria are measures of the generalization and complexity of models. The prediction error is de ned as being the sum between the training error and a measure 274 CHAPTER 14. LEARNING OPTIMIZATION of the complexity of model: prediction error = training error + complexity measure where the complexity measure may be the number of weights. For small networks the training error will be large and complexity measure low. For large networks the training error will be low and the complexity measure high. Thus by nding the minimum of prediction error helps nding the optimal tradeo between model complexity and generalization capability. The prediction error, for a sum-of-squares error function is de ned as: 2E + 2NW 2 prediction error = P P where E is the error given by the training set, after learning was completed, P is the number of training patterns, NW is the total number of weights and  is the variance of noise in data | to be estimated. Another way of de ning prediction error | which is applicable for non-linear models and regularized error functions | is: 2E + 2  2 prediction error = P P where is named the e ective number of parameters and is de ned as: NW X =  +i  i=1 i i being the eigenvalues of Hessian and  being the multiplication parameter of the regularization term. 14.8.4 Model For Mixed Discrete And Continuous Data It may happen that the input pattern vector x have a discrete component along a continuous ; T one, i.e. is of the form x = a x1 : : : xN where a takes discrete values and fxi g continuous ones. In this case one way of modeling the distribution p(x) is to try to nd a conditional Gaussian distribution of the form:   1 p (a) T ;1 p(x) = (2)N=2 jj1=2 exp ; 2 (xb ;  a )  (xb ;  a ) ;  where p (a) is the probability of a taking the value , xbT = x1 : : : xN is the continuous part of the input vector and  a and  are the means and respectively the covariance matrix (which may be and a dependent). CHAPTER 15 Bayesian Techniques ➧ 15.1 Bayesian Learning 15.1.1 Weight Distribution Let consider a given network, i.e. the number of layers, number of neurons, activation function of neurons are all known and xed. Let p(W; ) be the prior probability density of weights, N the total number of weights,  T = w1 : : : w W the weight vector, P the number of training patterns and T = f 1; : : : ; g the training set of targets. The posterior probability density of weights p(W jT ) is (using Bayes theorem): W W t t N P j p(W T ) = j p(T W ) p(W ) (15.1) p(T ) where p(T ) represents a normalization factor which ensures that p(W jT ) integrated over R all weight space equals unity, thus p(T ) = p(T jW ) p(W ) dW . W ✍ Remarks: ➥ As the training set consists of inputs as well, the the probability densities in (15.1) should be conditioned also on input p(W jT ; X ) = ( ( ) () ) , however the networks do not model the probability density p( ) of inputs and then X appears always as a conditioning factor and it will be omitted for brevity. x 15.1 See [Bis95] pp. 385{398. 275 p T jW;X p W jX p T jX ❖ ,N ,T p(W ) W 276 CHAPTER 15. BAYESIAN TECHNIQUES The Bayesian learning involves the following steps:  Some prior probability density p(W ) is established for weights. This will have a rather large breath as little is known at this stage.  The posterior probability of the targets p(T jW ) is established.  Using the Bayes theorem (15.1) the posterior conditioned probability density p(W jT ) is found. 15.1.2 Gaussian Prior Weight Distribution As explained in the previous section the prior probability density p(W ) should be de ned in a form which will de ne some characteristics of the model but on the other side leave enough freedom for weights. Let consider an exponential form: exp(; EW ) ZW ( ) where EW is a function of weights and ZW is the normalization factor: 1 p(W ) = ❖ EW , ZW ZW ( Z )= W W R p(W ) d = 1. W One possible choice for EW is: such that exp( ; EW ) d W 1 2 k W k = 2 2 1 NW X (15.2) (15.3) (15.4) i=1 which encourages small weights since for large kW k, EW is large and consequently p(W ) is small and thus W have unlikely value. From (15.3) and (15.4) (see also the mathematical appendix, Gaussian integrals):   N2W Z1 Z1 NW ! X 2 2 wi dw1 : : : dwNW = ZW =    exp ; (15.5) i =1 ;1 ;1 EW p(W ) = =  2   N2W ; 2 kWk2  exp wi2  (15.6) 15.1.3 Application | Simple Classi er Let consider a neuron with twoinputs x1 and x2 and one output y. The neuron classi es ; T the input vector ; = x1 x2 in two classes C1 and C2 . The weights are w1 and w2 , i.e. the vector W T = w1 w2 , for inputs x1 respectively x2 . See gure 15.1 on the facing page{a. The output y represents the probability1 of 2 C1 and 1 ; y the probability of 2 C2 . x x 1 See chapter \Single Layer Neural Networks". x 15.1. BAYESIAN LEARNING 277 x2 4 x2 x1 w1 2 w2 x4 ;4 y x2 2 C2 ;2 ;2 2 C1  5 5 2 C1 2 ;4 4 C1 x1 C2 a) The simple classi er: one neuron with two inputs and one output. b) The training data for the classi er, 4 pattern vectors. The dashed line represents the class decision boundary, x3 and x4 are exceptions. Let consider that there are 4 training patterns: x1 = x3 2 C2 b) the training data a) the classi er Figure 15.1: x1   5 2 C2 ; x2 = ; ;5 2 C1 ; x3 =  0 1   2 C1 ; x2 = ;01 2 C2 The reason for this choice is the following: x1 and x2 are good examples of their respective class while x3 and x4 are exceptions | it is not reasonable to expect correct classi cation from one single neuron for them as the decision boundary will not be convex2 (the problem is not linearly separable). See gure 15.1{b. However x3 and x4 do carry some information: together with x1 and x2 it suggest the decision boundary is rather as depicted in gure 15.1; if they would have been absent it the decision of where to draw the \line" would have been more dicult (lower probability to be the one chosen). The neuronal activation function is chosen as the sigmoidal function: y= 1 ; 1 + exp( WT x) = 1 1 + exp( As probability of x 2 C1 is y and probability of targets, conditioned on weights is: p(T jW ) = Y xp 2C1 y(xp ) Y xp 2C2 [1 x ;w1 x1 ; w2 x2 ) 2 C2 is 1 ; y then the probability of ; y(xp )] = [1 ; y(x1 )] y(x2 ) y(x3 ) [1 ; y(x4 )] The prior probability density for weights is chosen as a Gaussian of type (15.6) with  w2 + w2  1 1 2 exp ; p(W ) = p(w1 ; w2 ) = 2 2 2 See chapter \Single Layer Neural Networks" : =1 278 CHAPTER 15. BAYESIAN TECHNIQUES 3:0 p(W ) 0:2 1:8 0:1 p(W ) 0:6 0:15 ;0 6 0:10 w2 : 0:05 ;1 8 0:0 3 ; : 0 w1 3 ;3 3 0 w2 0:01 ;3 0 ;3 0 ;1 8 ;0 6 : : a) weight distribution before learning : : 0:6 1:8 3:0 w1 b) level curves of weight distribution before learning j 3:0 p(W T ) 0:4 j p(W T ) 1:8 0:30 0:6 0:2 0:20 w2 ;0 6 0:10 : ;1 8 0:0 3 ; 0:05 : w1 0 3 ;3 3 0 w2 c) weight distribution after learning Figure 15.2: ;3 0 ;3 0 ;1 8 ;0 6 0:01 : : : : 0:6 w1 1:8 3:0 d) level curves of weight distribution after learning The probability density for weights: gures a) and b) show the prior probability p(W ); gures c) and d) show the posterior probability p(W jT ). 15.1. BAYESIAN LEARNING 279 and the graphic of this function is drawn in gure 15.2 on the facing page a) and b). The normalization factor p(T ) = R1 R1 ;1 ;1 p(T jW ) p(W ) dw1 dw2 may be calculated numer- ically. ) p(W ) Finally, the posterior probabilities of weights is p(W jT ) = p(T jW and its graphic is p(T ) depicted in gures 15.2{c and 15.2{d. The best weights correspond to the maximum of p(W jT ) which occurs at w1  ;0:3 and w2  ;0:3. ✍ Remarks: ➥ ➥ ➥ 15.1.4 Before any data is available the best weights correspond to the maximum of prior probability which occurs at w1 = 0 and w2 = 0. This will give y(x) = 0:5, 8x, i.e. there is an equal probability of x belonging to either class | the result re ecting the absence of all data on which the decision is made. After the data becomes available the weights are shifted to w1 = ;0:3 and w2 = ;0:3 which gives y(x1 ) ' 0:0475, y(x2 ) ' 0:9525, y(x3 ) ' 0:4255, y(x4 ) ' 0:5744. The x3 and x4 are misclassi ed (as it should have been y (x3 2 C1 ) > 0:5 and y(x4 2 C2 ) < 0:5) but this was to be expected, and the patterns still carry some useful information (they are used to reinforce the established decision boundary). In general the prior probability p(W ) is wide and have a low maximum and posterior probability is narrow and have a high peak(s) | this may be seen also in gure 15.2 on the preceding page. Gaussian Noise Model In general, the likelihood function, i.e. p(T jW ) may be written in exponential form as: p(T jW ) = 1 ZT ( ) ; (15.7) ET ) exp( where ET is the error function, is a parameter and ZT ( ) is a normalization factor which ensure that the p(T jW ) is normalized: ZT ( Z )= exp( Y ; (15.8) ET ) dt1 : : : dtP Assuming a sum-of-squares error function3, and that the targets are generated from a smooth function to which a zero-mean noise have been added, then: p(tjx; W )  exp p(T jW ) = 3 See P Y p=1  ; 2 [y(x; W ) ; t] p(tp jxp ; W ) = chapter \Error Functions". 2 1 ZT ( ) exp ;2  P X p=1 and thus [y (xp ; W ) (15.9) ; tp ] 2 ! ❖ ET , , ZT 280 CHAPTER 15. BAYESIAN TECHNIQUES i.e. ET 1 = 2 P X p=1 x y p ; W ) ; tp ]2 [ ( (15.10) p , and by replacing E into T and it becomes evident that controls the variance:  = 1= (15.8) and integrating4 ZT ( )=   P2  2 (15.11) 15.1.5 Gaussian Posterior Weight Distribution ❖ ZS From (15.1), (15.2) and (15.7), and de ning ZS  p(T ) as the normalization factor, then p(W jT ) = ❖ S 1 exp( ZS ; EW ; ET ) = Z1 S RR exp[ ;S (W )] Wt (15.12) t exp[;S (W )] d d 1:::d P. WY Considering the expression (15.10) of ET and (15.4) of EW then NW P X X 2 S (W ) = wi2 [y ( p ; W ) ; t p ] + 2 2 p=1 i=1 where S (W ) = EW + ET and ZS ( ; )= x ❖ W which represents a sum-of-squares error function with a weight decay regularization function5. Since an overall multiplicative term does not count, the multiplicative constant of the regularization term is = . For the most probable weight vector W  for which p(W jT ) is maximum, S is minimum and thus the regularized error is minimum. 15.1.6 Consistent Prior Weight Distribution The plain weight decay, as the one from (15.4), is not consistent with linear transformation6. However for two layers the simple weight decay may be changed to the form EW = 1 2 X w2 + hidden layer 2 2 X output layer w2 15.1.7 Approximation Of Weight Distribution In order to simplify the computational process of nding the (maximum of) posterior probability, the function S (W ) may be developed in series and only the most signi cant terms retained: S (W ) = S (W  ) + 1 2 W;W (  T ) W;W H(  )+ O[(W ; W  )3 ] See mathematical appendix, regarding the Gaussian integrals. See chapter \Learning Optimization". 6 See chapter \Learning Optimization", also for the form of changed weight decay expression. 4 5 15.2. NETWORK OUTPUTS DISTRIBUTION 281 ' S (W  ) + 12 (W ; W )T H (W ; W ) where HS is the Hessian of the regularized error function (the term proportional in ( ;  ) is zero due to the fact that the series development is done around minimum). W W Considering the gradient as a vectorial operator rT ering a weigh decay EW , the Hessian is7: HS = (rrT ) S (W  ) = =  @ @w1 ::: (rrT ) ET (W ) + I = @ @wNW  ❖ HS ❖ ZS ❖ g , W , y  then, consid- H+ I The posterior distribution becomes: 1 exp ;S (W ) ; 1 (W ; W )THS (W ; W) 2 p(W jT ) =  ZS (15.13) where ZS is the normalization coecient for the distribution (15.13) and then8: Z  = exp[;S (W  )] Z S W   1  T  exp ; 2 (W ; W ) HS (W ; W ) dW (15.14) = exp[;S (W )] (2)NW =2 jHS j;1=2 ➧ 15.2 Network Outputs Distribution The probability density of targets is dependent on input vector and on weights trough the training set: x p(tj ; T ) = Z x p(tj ; W ) p(W jT ) dW W x Considering a sum-of-squares error function, i.e. p(tj ; W ) given by (15.9) and the quadratic approximation of p(W jT ) given by (15.13), then     1  T  p(tjx; T ) / exp ; [t ; y(x; W )] exp (W ; W ) HS (W ; W ) dW 2 2 Z 2 W g (15.15) W W W y(x; W ) ' y + g W Let de ne  ryjW  and  = ; . As the error function was approximated after a series development around W  , the same can be done with y( ; W ): T where x y  y(x; W  ) The method of calculation of the Hessian H of the error function is discussed in chapter \Multi Layer Neural Networks". See also mathematical appendix. See [Bis95] pp. 398{402. 7 8 15.2 282 CHAPTER 15. BAYESIAN TECHNIQUES The targets probability density becomes:   p(tjx; T ) = C exp ; 2 [t ; y ; gT W]2 ; 21 WT HS W dW Z W x R where C is a multiplicative constant such that p(tj ; T ) dt = 1. Y g W) = W gg W (by using the matrix property: (AB) = B A ) then: p(tjx; T ) = C exp ; 2 (t ; y)   exp ; 12 W (H + gg )W + (t ; y)g W dW As ( T 2 T T T  Z 2  T T  T T S T  W which represent a Gaussian integral with a linear term9 , and then:  p(tjx; T ) = C exp ; 2  exp = C 0 exp ❖ C0  2 (2)NW =2jHS + 2 T  ; 2 h T S 2 T;  1 T 1 =2  T 1 S i where C 0 = const. R The normalization condition p(tj ; T ) dt = 1 (where Y  (;1; 1)) gives: x Y C0 ❖ t  gg j;  ;  2 (t ; y ) g H + gg g  ; (t ;2y ) ; g H + gg ; g (t ; y)2 " ; g 2 T 2 (HS + leading to the distribution wanted in the form: gg ); T 1 #1=2 =1   )2  ( t ; y 1 where p(tjx; T ) = p 2 exp ; 22 2t t 1 t2 = (15.16) 2 T ; g (HS + ggT );1 g representing a Gaussian distribution10 with t the variance and hti = y the mean of t. To simplify the expression of t2 rst the following transformation is done HS ! HS0 = HS = ) HS;1 ! HS0 ;1 = HS , and then the numerator and denominator are both multiplied by the number gT (I + HS0 ;1 ggT ) g which gives: T 0 ;1 T S gg )g t2 = gT (I + H 0 ;1 ggT )g ;g g(TI (+HH 0 + ggT );1 ggT (I + H 0 ;1 ggT )g S S S 9 10 See the mathematical appendix. See also the chapter \Pattern Recognition". 15.2. NETWORK OUTPUTS DISTRIBUTION 283 gg g gg g while for the second, The rst term of the denominator reduces to T + T HS0 ;1 the matrix property (AB );1 = B ;1 A;1 is applied, in the form: T HS0 + ggT );1 = HS0 (I + HS0 ;1 ggT ) ;1 = (I + HS0 ;1 ggT );1 HS0 ;1  (  and then the second term of the denominator becomes: g T I HS0 ;1 ggT );1 HS0 ;1 ggT (I + HS0 ;1 ggT )g ( + g g gg gg g To simplify the expression a T I is added and subtracted and, in the added term, I is changed to I = (I + HS0 ;1 T );1 I (I + HS0 ;1 T ). By regrouping, the whole expression is reduced to T HS0 ;1 T . Finally the variance (15.16) becomes: g gg t2 = 1 + g HS g T As it measures the \width" of distribution (see the remarks below) the variance t may be considered as an error bar for y. The 1= part is due to noise in data while the T HS part is due to the \width" of the posterior probability of weights. The error bar is between y( ) ; C and y( ) + C, the value of constant C to be established application dependent (see also the remarks below). x ✍ g g x Remarks: ➥ Considering a unidimensional distribution p(x) = p 1 2  exp  ; (x ;2h2xi) 2 the \width" of it | see gure 15.3 on the following page | is proportional with the variance . The width of distribution, at the level half of maximum: p p(x) = 12 pmax = 21 p 1 , x = hxi  2 ln 2   2  p is only  dependent, being \width" = 2 2 ln 2    2:35  . ➥ The width of probability distribution equals 2 for x = hxi   at which point p the probability drops at p(hxi  ) = pmax= e  0:606 pmax. 15.2.1 Generalized Linear Networks The generalized linear network have one layer and the output is of the form11 y(x; W ) = WT '(x) where 'T (x) = '1 (x)    'H (x) ; 'j (x) being the activation function of neuron j . 11 See chapter \Radial Basis Function Networks".  ❖ ', 'j 284 CHAPTER 15. BAYESIAN TECHNIQUES p pmax pmax=2 p 2 2 ln2   Figure 15.3: x The p Gaussian width. At p(x) = pmax =2 the width is 2 2 ln2  . The error function of this network is: S (W ) = ❖ W  , y P X 2 p=1[tp ; W T '(xp )]2 + 2 kWk2 and, as is a quadratic function in W, then the weight probability density is strictly Gaussian with one single maximum. Considering W as the weight vector corresponding to the maxima of posterior weight probability and y (x)  y(x; W  ) then: y(x; W ) = y(x) + WT '(x) where W = W ; W . The Hessian is: HS = (r  rT ) S (W  ) = P P p=1 '(xp ) '(xp )T + I . The posterior probability of targets is calculated in a similar way as for (15.15): p(tjx; T ) / Z W with a variance t2 = ➧ 15.3  exp ; 2 [t ; 1 W T  '(x)]2 ; 12 WT HS W dW + 'THS '. Classification Let consider a classi cation problem with two classes C1 and C2 with a network involving only one output y representing the probability of the input vector x being of class C1 , then (1 ; y) represents the probability of being of class C2. Obviously the targets t 2 f0; 1g and the network output y 2 [0; 1]. 15.3 See [Bis95] pp. 403{406. 15.3. CLASSIFICATION 285 The likelihood function for observing the whole training set is12: p(T jW ) = Y P [y(x )] p [1 ; y(x )] ; p = L = exp[;G(T jW )] t p p=1 1 p (15.17) t where G(T jW ) is the cross-entropy: G(T jW ) = ; ✍ X P G ❖ Z ft ln y(x ) + (1 ; t ) ln[1 ; y(x )]g p p=1 ❖ p p p Remarks: P p(T jW ) = 1. p 2f0 1g After the replacement of p(T jW ) from (15.17), the result is a product of terms of the form y( ) + [1 ; y( )] = 1, i.e. the distribution p(T jW ) is normalized. The normalization condition for the distribution (15.17) is ➥ t x x p p ; x 1 The neuron activation function is taken as the sigmoidal y( ) = f (a) = 1+exp( ; ) where P a = w z ; w being the weights of connections from hidden neurons z coming into the j j a j j j output neuron y and a is the total input. Considering a prior distribution (15.2) then the posterior distribution, from (15.1), similar to (15.12), will be: p(W jT ) = 1 exp(;G ; Z E S W ) = Z1 exp[;S (w)] S where Z is the normalization coecient. Considering a quadratic approximation as in (15.13), the distribution may be approximated as: S  1 1 p(W jT ) =  exp ;S (W ) ; W Z 2 S R Z  being the normalization coecient, and p(W jT ) d S Z = Z S exp W  ;S (W  ) ; 1  2 W  W H W dW T S T H S W  (15.18) W = 1 gives: = e; S (W ) YW r 2 N  i i=1 (15.19) (see also the mathematical appendix). As the network output is interpreted as probability, i.e. y( ; W ) = P (C1 j ; W ) then, for a new vector the Bayesian learning involves an integration over all possible weights, in the form: x x x P (C1 j ; T ) = Z W 12 See x Z x W = y(x; W ) p(W ) dW P (C1 j ; W ) p(W ) d W chapter \Error functions" | section \Cross entropy". S 286 CHAPTER 15. BAYESIAN TECHNIQUES Assuming a linear approximation of weights for the total input a then ❖ a , g (x; W ) = a(x) + gTW where a (x)  a(x; W  ) and g(x)  ra(x; W )jW  . (15.20) a The posterior probability of a may be written as: ( jx ) = p a ;T Z W ( jx ) ( j ) p a ;W p W T d W= W Z ( ; a ; gT W) p(W jT ) dW D a W (15.21) Since, from (15.20), a and  are linearly related, and p(W jT ) is a Gaussian, then the distribution of a is also a Gaussian of the form:  2 ( jx ) = p 1 2 exp ; (a ;2sh2ai) 2s (15.22) p a ;T ❖ hai, s with the mean hai and variance s being: hai = a and Proof. 1. s2 = g T HS g Using (15.21) and (15.18):  ;S (W  ) Z Z 1 hai = e Z  a D (a ; a ; T  ) exp ;  T HS  2 S AW   ; S (W  ) Z  T (a +  ) exp ; 12  T HS  d = e Z S W g W g W W s2 = Z Z W (and d R ; T 2 d da 1 2 T H  d d ; S (W  ) g W) exp  W W T (a ; a )2 D (a ; a ; gT W) e Z  S AW ; S (W  ) Z = e Z ( S W W W g W exp ; W S W W = 0 because W W = (W)) and the integral is done over a origin centered Replacing (15.19), considering that a = const. and the integrand is odd function13 in  interval, then hai = a . 2. The variance is: W   exp ; 21 WT HS W dW ; 12 WT HS W   W d u u Let ui be the eigenvectors and i the eigenvalues of HS , i.e. HS i = i i . As HS is symmetrical, then it is possible14 to build an orthogonal system of eigenvectors. Let consider: W = X wi ui and g= X u gi i i i and by replacing into the above expression of variance: !2   Z X 1 i w2  d e;S (W ) 2 exp ; gi wi s = i  ZS 2 W i W 13 14 An function f is odd when f (;x) = ;f (;x). See mathematical appendix. 15.4. THE EVIDENCE APPROXIMATION FOR AND 287 Developing the square of the sum, the variance becomes a sum of two types of integrals. The rst one,8i 6= j , is of the form   Z1 Z1 ;1 ;1 gi wi gj wj exp ; 12 (i wi2 + j wj2 ) d(wi ) d(wj ) = 2 Z1 3 2 Z1 3  1  1   2 2 4 5 4 gi wi exp ; i wi d(wi ) = gj wj exp ; j wj d(wj )5 = 0 2 2 ;1 ;1 because the integrand are odd function and the integral is done over origin centered interval. The second one is for i = j : s  Z1 2 2 (gi wi )2 exp ; i 2wi ;1 g d(wi ) = i i As the HS matrix may be diagonalized using the eigenvalues then s2 = 2 i NP W 1 g2 i i=1 i = g T HS g . The posterior probability P (C1 jx; T ) may be written in terms of total neuronal input a as: P (C1 jx; T ) = Z A P (C1 ja) p(ajx; T ) da = Z A f (a) p(ajx; T ) da which does not have generally an analytic solution but may be approximated by: P (C1 jx; T ) ' f (a (s)) where (s) = ✍ 1+  1 s2 ; 2 8 Remarks: ➥ ➧ 15.4  For the simple, two class problem described above, the decision boundary is established for P (C1 jx; T ) = 0:5 which corresponds to a = 0. The same result is obtained using just the most probable weights W  and just the network output: y(x; W  ) = 0:5 ) a = 0. Thus the two methods give the same results unless there are some more complex rules involved, e.g. a loss matrix. The Evidence Approximation For And As the parameters and which do appear in the expression of posterior weight distribution | see equation (15.12) | are them-self not known, then the Bayesian framework require an integration over all possible values: p(W jT ) = ZZ p(W j ; ; T ) p( ; jT ) d d (15.23) One possible way of dealing with and parameters is known as the evidence approximation and is discussed below. The posterior distribution of and , i.e. p( ; jT ), it is assumed to have a \sharp peak" 15.4 See [Bis95] pp. 406{415. evidence approximation ❖ p ( ; jT ) ,  ,  288 CHAPTER 15. BAYESIAN TECHNIQUES   . Then p(  ;  jT ) . 1 and , using also around the most probable values RR  and  the normalization condition p( ; jT ) d d = 1, the distribution (15.23) may be approximated as: p(W jT ) ' p(W j  ;  ; T ) hyperprior evidence ZZ p ( ; jT ) d d = p(W j  ;  ; T ) i.e. the integral over all possible values of and is replaced with the use of the most probable values:  and  . To nd the most probable  and  values, it is necessary to estimate the posterior probability for and . This is achieved by using the Bayes theorem p( ; jT ) = p(T j p; (T) p) ( ; ) where p( ; ) is the prior probability density, also known as the hyperprior . If little is known about the model to be build then the hyperprior should give a relatively equal value for all possible and parameters, the same way p(W ) prior operates. The term p(T j ; ) is named evidence . The term p(T ) is the normalization factor. The evidence may be written in the form of explicit dependencies over and in the posterior distribution of targets and weights: Z p(T j ; )= p(T jW; ; p Wj ; ) ( ) W d Z = p(T jW; W p Wj )d ) ( as the prior weight distribution is independent of (which is data related) p(W j ; ) = p(W j ) and the likelihood function is independent of : p(T jW; ; ) = p(T jW; ) | see equations (15.2) and (15.7). From the same set of equations | (15.2) and (15.7): Z ZS ( ; ) 1 exp(;S (W )) d = p(T j ; ) = ZW ( ) ZT ( ) ZW ( ) ZT ( ) W and considering the values of ZS , ZW and ZT from (15.14), (15.5) and (15.11) respectively  + E  then: and as S (W  ) = EW T NW P NW + P   1 ln p(T j ; ) = ; EW ; ET ; ln jHS j + ln + ln ; ln(2 ) 2 2 2 2 (15.24) Considering EW an quadratic form on weights then: HS = (r  rT )( EW + ET ) = I + (r  rT )ET = I +H H being the Hessian of error function. As the Hessian is a symmetric matrix then it may 15 be diagonalized Q using its eigenvalues i and, obviously, HS have the eigenvalues i + and jHS j = (i + ). Finally: i d d ln j HS j = d d ln N YW i=1 ( i + ! X NW ) = 1 i=1 i + = Tr HS;1 (15.25) where 1=(i + ) are the eigenvalues of HS;1 and the i eigenvalues were supposed to be -independent, i.e. ET is also a quadratic of weights. 15 See mathematical appendix. 15.4. THE EVIDENCE APPROXIMATION FOR The condition of minimum of (15.24), gives: 2  EW = NW ; p Tj ; ln ( NW X ) 289 with respect to where = i=1 i + AND = is NW X i @ ln p @ = i=1 i + = 0 ❖ NW X i=1 which, from i ln jHS j = The condition of minimum of (15.24), gives: 2 d d N YW ln i=1 p Tj ; ln ( ET = P; ) i + ( ! ) = 1 NW X i with respect to NW X i i=1 i + = P; (15.28) i=1 i + is @ ln p @ = 0 which, from (15.29) and the same comments as above apply. As S = EW + ET then, from (15.26) and (15.29), S (W  ) = P=2. To nd out the and parameters an iterative algorithm may be used. Starting with some initial values 0 and 0 , the values at step t + 1 may be computed from (15.26) and (15.29): 8 < : (t) t = 2E W (t) t = 2E T (t) ( +1) ( +1) P ; ( t) i (15.26) and may be interpreted as follows:  The prior distribution of weights is usually chosen to be centered in origin and with a spherical symmetry, so, in the absence of data, the most probable weight vector is  = 0. W = 0b and consequently EW  When there are data available, the EW shifts to a position given by (15.26). Considering a system of coordinates rotated such that the axes are parallel to the eigenvectors of H then the quantities by which W is shifted (from the origin) along each axis are given by i = i+i . For i  ) i & 0, i.e. wi & 0, wi is not shifted much from the origin and the main contribution to this particular weight(s) is given by the prior weight distribution. For i  ) i . 1, i.e. wi . 1, the main contribution to this weight is given by data (trough i ). These kind of wi weights are named well-determined parameters. Thus measures the e ective number of weights changed by the data present in the training set (all others being given rather small values from the prior distribution). The Hessian is H = (r  rT )ET this means that i / (as the Hessian may be diagonalized using the eigenvalues) ) di / d and then: di i = (15.27) d Using this result, and similar to the previous derivative: d d , well-determined parameters e ective number 290 CHAPTER 15. BAYESIAN TECHNIQUES For large training sets, i.e. P  NW , all weights may be approximated as well-determined parameters and then i  1 and  NW which leads to the much faster updating formulas: 8 < : t ( +1) t ( +1) = = NW EW (t) P 2ET (t) 2 Considering a Gaussian approximation for p(T j ; ) then:   1   ; p(T j ln ; ln ) = p(T j ln ; ln ) exp ;    2 0  1 ln ; ln  A and: where  = ln ; ln  ,  = @  0 ; 1 0  ; ; 1 1 B C ; A  = @ ;  ; A =   ;  @;  T ln ln 2 ln ln ln ( 1)2 ln ( 1)2 ln ln ( 1)2 ln ( 1)2 ln ln ln ln 2 ln ln 2 ln 2 ln ln 1 1 2 ln 2 ln 2 ln 2 ln ln 2 ln 2 ln ln 4 ln ln Note that the logarithm of and have been considered (instead of , ) as they are scaling hyper-parameters. From the above equations and because @ ln@ = @@ , and same for , then: (;1)2 ln = ; ( 1)2 ln ; ( 1)2 ln ln ln2 2 ln ln2 = = @ @ ; 2 = ; @@(lnln p)2 =; @ @ = @ ln@ ln@ pln = @ @ 2 4 ln ln ln2 ln2 2 ln ; = ; @@(lnln p) = ; 4 ln ln ; ln2 2 ln 2 ln ln 2 2 ; 4 ln ln  @ ln p  @  @ ln p  @  @ ln p  @ By using (15.24) when calculating the last set of derivatives from the above equations (and using (15.25), (15.28) and (15.27) as well): (;1)2 ln = EW + 21 NW X i=1 i (i + ) 2 ln(;1)2 = ln (;1) ; ln 2 = 2 + ; ( 1)2 ln ln NW i 1X 2 i (i + ) =1 NW 1X i 2 i (i + ) =1 and then, using (15.26) and (15.29): ln(;1)2 = ED + and ln(;1)2 = 2 N; ; 2 + ( 1)2 ln ln 2 15.5. INTEGRATION OVER The ln(;1)2 is a sum of terms ln AND i  ( i + )2 291 ; for i  it reduces to =i  1 while for i  it reduces to i =  1; the only signi cant terms come from  ' which are usually in a small number. Then the the following approximations may be performed: (;1)2 ln  ln(;1)2 ; ln(;1)2  ln(;1)2 ln ; ( 1)2 ln ln ' 1 2 ln '2 ; ; ( 1)2 ln ' ' 1 2 ln (15.30) P; 2 and the and parameters may be considered statistically independent and their distribution is of the form: p(T j ln p(T j ln ➧ 15.5 )= )= p(T j ln  ) exp p(T j ln  ) exp  ; (ln ; (ln ; ln  )2  ; ln  )2  2 2 ln  ! (15.31) 2 2 ln And Integration Over The Bayesian approach require an integration over all possible values for unknown parameters, i.e. the posterior probability density for weights is: p(W jT ) = ZZ p(W; ; jT ) d d ZZ = p(W jT; ; p ; ) ( ) d d Using the Bayes theorem (15.1) and as p(T jW; ; ) = p(T jW; ) (is independent of , see also section 15.1.4) and p(W j ; ) = p(W j ) (is independent of , see also section 15.1.2) and considering also that and are statistically independent, i.e. p( ; ) = p( ) p( ) (see also section 15.4) then: p(W jT ) = 1 ZZ p(T ) p(T jW; p W j ) p( ) p( ) d d ) ( Now, a form of p( ) and p( ) have to be chosen. The best option would be to choose them such that p(ln ) and p(ln ) have a relatively large breath. One possibility, leading to easy integration, would be: p( )= 1 and p( )= 1 and the integrals over and may now be separated. Using (15.2) and (15.5) the prior of weights becomes16: p(W ) = Z1 0 15.5 16 p(W j ) p( ) d Z1 = 0 1 ZW ( ) See [Bis95] pp. 415{417. See also mathematical appendix regarding Euler functions. exp( ; EW ) 1 d 292 CHAPTER 15. BAYESIAN TECHNIQUES = 1 (2)NW =2 Z1 exp(; EW ) NW ;1 2 ) d = ;E (NW =2) (2EW )NW =2 where ;E is the Euler function. The p(T jW ) distribution is calculated in the same way, using (15.7) and (15.11): ; (P=2) p(T jW ) = E P=2 (2ET ) From the above equations, the negative logarithm of the posterior distribution of weights is: ; ln p(W jT ) = P ln ET + NW ln EW + const. 2 and then its gradient may be written as: ;r ln p(W jT ) = c rET + crEW (15.32) where P NW and c = (15.33) c = 2E 2ED W are the current values of the parameters. The minima of ; ln p(W jT ) may be found by iteratively using (15.32) and (15.33) ❖ c, c ✍ ➧ 15.6 Remarks: ➥ While the direct integration over and seems to be better that the evidence approximation (see section 15.4), in practice the approximations required after integration may give worst results. Model Comparison Let consider that there are several models fMmg for the same problem and set of data T . The posterior probability of a particular model Mm is given by the Bayes theorem p(T jMm) P (Mm ) P (Mm jT ) = p(T ) ❖ P (Mm ), p(T jMm) Mm evidence where P (Mm ) is the prior probability for model Mm and p(T jMm) is the evidence for Mm . An interpretation of the model evidence may be given as follows below. Let consider rst the weight dependency: in the Bayesian framework: p(T jMm) = ❖ w prior 2 , w post Z W p(T jW; Mm) p(W jMm ) d W (15.34) and let consider one single weight: the prior distribution p(W jMm ) have a low maximum 15.6 See [Bis95] pp. 418{422. 15.6. MODEL COMPARISON 293 p p(wjT; Mm) p(wjMm ) w Figure 15.4: w post w prior The prior and posterior distribution of weights. The prior distribution p(W jMm ) have a (relatively) low maximum and a wide width w | all weights have approximatively the same probability, denoting the absence of data on what to make a decision. The posterior probability density p(W jT; Mm) have a high maximum and a narrow width w . prior post and a large width w (in fact it should be almost constant over a large range of weight values) while the posterior distribution p(W jT; Mm) have (usually) a high maximum and a small width w . See gure 15.4. Considering a sharp peak of the posterior weight distribution around some maximum w the the integral (15.34) may be approximated as p(T jMm) ' p(T jw ; Mm ) p(w jMm ) w prior post post and also the prior distribution (as is normated) should have a inverse dependency of its widening p(W jMm ) / 1=w (the wider is the distribution, the smaller is the maximum and subsequently all values) and then: w p(T jMm) / p(T jw ; Mm ) w prior post prior which represents the product between the likelihood p(T jw ; Mm ), estimated at the most probable weight value, and a term named the Occam factor. A model with a good t will have a high likelihood, however these models are usually complex and consequently have a very high and narrow posterior distribution peak, i.e. a small Occam factor, and reciprocal. Also for di erent models which make the same predictions, i.e. have the same p(T jW ) the Occam factor advantages the simpler model, i.e. the one with a larger factor. Let consider the and hyper-parameter dependency. The Bayesian framework require an integration over all possible values: p(T jMm) = ZZ p(T j ; ; Mm ) p( ; jMm ) d d where p(T j ; ; Mm ) represents the evidence for and | see section 15.4. (15.35) Occam factor 294 CHAPTER 15. BAYESIAN TECHNIQUES The and parameters are considered statistically independent (see again section 15.4). By using the Gaussian approximation given by (15.31) and considering an uniform prior distribution p( ) = p( ) = 1= ln , where is a region containing  respectively  , then the integral (15.35) may be split in the form: ❖  (ln ; ln  )2   ;  ; Mm ) Z1 p ( T j p(T jMm) = d(ln )  exp ; 22 (ln )2 ln ;1  Z1 ;1 !  )2 (ln ; ln exp ; 22 d(ln ) = ln = p(T j  ;  ; Mm ) 2(lnln )2ln ❖ R The above result was obtained by integrating over a single Gaussian. However in networks with hidden neurons there is a symmetry and many equivalent maximums17, thus the model evidence p(T jMm) have to be multiplied by this redundancy factor R, e.g. for a 2-layer network with H hidden neurons there are R = 2H H ! equivalent maximums, and the model evidence have to be multiplied by this factor. On similar grounds as for (15.24), and using (15.30), the logarithm of evidence becomes: ln p(T jMm) = ;  EW ;  ED ; 21 ln jHS j + N2W ln  + P2 ln  + ln R ; 21 ln 2 ; 12 ln P ;2 + const. where the additive constant is model independent. By the above means it is possible to calculate the probabilities of various models. However there are several comments to be made:  The model evidence is not particularly easy to calculate due to the Hessian jHS j.  Choosing the model with highest evidence is not necessary the best option as there may be several models with signi cant/comparable evidence. ➧ 15.7 ❖ mm Committee Of Networks Usually the error have several local, non-equivalent, minima, i.e. not due to the symmetry18 . The posterior probability density may be written as a sum of all posterior distributions corresponding to the local minima mm : p(W jT ) = 17 15.7 18 X p(W; mm jT ) = X p(W jmm ; T ) P (mmjT ) m m See chapter \Multi Layer Neural Networks", section \Weight-Space Symmetry\. See [Bis95] pp. 422{424. See footnote 17. 15.8. MONTE CARLO INTEGRATION 295 By using the above distribution decomposition, other parameters may be calculated by integration over the weight space W , e.g. the averaged output is: hyi = Z y x; W p W jT dW X P m jT Z y x; W p W jm ; T dW m m m W W X P mmjT hymi ( ) ( ) = ( ( ) ) ( ) m = ( m ) where Wi is the portion of the weight space corresponding to the minima mm and hym i is the output average corresponding to mm . The above formula shows an weighted average of the outputs corresponding to di erent local minima. ➧ 15.8 Wm , hym i ❖ I Monte Carlo Integration The Bayesian techniques often require integration over a large number of weights, i.e. the computation of integrals of the form: I= Z F W p W jT dW ( ) ( (15.36) ) W where F (W ) is some integrand. As the number of weights is usually very big, the classical numerical methods of integration leads to a large computational task. One way to approximate the above type of integrals is to use Monte Carlo method, i.e. to select a sample set of weight vectors f i gi=1;L from the distribution p(W jT ) (i.e. the weights are randomly chosen such that their distribution equals p(W jT )) and then approximate the integral by a nite sum: W I ' VLW XL F Wi i=1 ( Z W jT ) q(W ) dW ' VW I = F (W ) pq(W (W ) L W W XL F Wi p WijT i=1 ( ) ( q(Wi ) e ❖ VW , ❖ L q(W ) ) As the normalization of p(W jT ) requires itself an integration of the type (15.36) with F (W ) = 1 (e.g. see the calculation of ZW , ZT in previous sections) then the integral may be approximated by using the non-normalized distribution p(W jT ): See [Bis95] pp. 425{429. Monte Carlo method ) where VW is the volume of weight space and VW =L replaces (approximate) d . While usually the posterior distribution p(W jT ) may be calculated relatively with ease, the selection of sample set f i g may be dicult. An alternative is to draw the sample weight set from another distribution q(W ), in which case the integral (15.36) becomes: 15.8 ❖ importance sampling ❖ e p(W jT ) 296 CHAPTER 15. BAYESIAN TECHNIQUES L P F (W ) pe W jT ( i q(Wi i ) I'i P L pe W jT =1 ( i i=1 q(Wi ) Metropolis algorithm (15.37) ) this procedure being called importance sampling. The importance sampling method still have one problem requiring attention. In practice the posterior distribution is usually almost zero in all weight space except some narrow areas (see section 15.1.3 and gure 15.2 on page 278). In order to ensure the computation of integral (15.37) with enough precision it is necessary to choose an L big enough such that the areas with signi cant posterior distribution p(WjT ) will have adequate coverage. To avoid the previous problem, the Metropolis algorithm was developed. The weight vectors in the sample set fWi g form a discrete time series in the form: Wt ( +1) random walking ) = Wt ( ) +" where " is a randomly chosen vector from a distribution (e.g. Gaussian) with spherical symmetry; this kind of series being named random walking. Then the new (t+1) are accepted or rejected following the rules: W ( accept if p(W(t+1) jT ) > p(W(t) jT ) p (W j T ) accept with probability p(W jT ) if p(W(t+1) jT ) > p(W(t) jT ) (t+1) (t) and considering an error function of the form E = ; ln p(WjT ), then the above rules may be rewritten in the form: ( accept accept with probability ❖ p(") ❖ T ;(E t ( +1) if E t ; E t )] if E t ( +1) ( ) ( +1) <Et >Et ( ) ( ) (15.38) The Metropolis algorithm still leaves a problem with respect to local minima. Assuming that the weights are strongly correlated this means that around (local) minima the hypersurfaces corresponding to constant distribution p(WjT ) = const. are highly elongated hyperellipses19. As the distribution of ", i.e. p("), have spherical symmetry this will lead, according to the rules (15.38), to many rejected and the algorithm have a tendency to slow down around the local minima. See gure 15.5 on the next page. To correct the problem introduced by the Metropolis algorithm the rules (15.38) may be changed to: W ( simulated annealing exp[ accept accept with probability exp h i if E(t+1) < E(t) E (t+1) ;E(t) ; T(t+1) if E(t+1) > E(t) leading to the algorithm named simulated annealing. The T(t+1) | named \temperature" | is chosen to have large starting value T(0)  1 and decreasing in time, this way the algorithm jumping fast over local minima found near the starting point. For T = 1, the simulated annealing is equivalent to the Metropolis algorithm. 19 See chapter \Parameter Optimization", section "Local quadratic approximation" 15.9. MINIMUM DESCRIPTION LENGTH w 297 2 E = const. W(t+1) " local minima p(") = const. W(t) w 1 Figure 15.5: ➧ 15.9 The Metropolis algorithm. Only the W(t+1) which falls in the \area" which may be represented (sort of) by the intersection of the dotted circle (p(") = const.) and the ellipse E = const. are certainly accepted. Only some of the other W(t+1) | from the rest of the circled area | are accepted. As the (hyper) volumes of the two areas (of certainly acceptance, respectively partial acceptance) are proportional with the weight space dimensionality the algorithm slows more around local minima in the highly dimensional spaces, i.e. the problem worsens with the increase of weights number. Minimum Description Length Let consider that a \sender" wants to transmit some data D to a \receiver" such that the message have the shortest length possible. Beyond of the simple method of sending the data itself; if the quantity of data is suciently large then there is a possibility to shorten the message by sending a model of the data plus some information regarding the di erence between the actual data set and the generated (by the model) data set. In this situation the message length will be the sum between the length of the model description L( ) and the length of the di erence L(D ), which is model dependent: M M jM M message length = L( M )+ L(DjM) The L( ) quantity may also be seen as a measure of model complexity (as the more complex a model is the bigger its \description" is) and the L(D ) may also be seen as the error of model (di erence between model output and actual data target). The the message length may be written as: jM message length = model complexity + error M (15.39) The more complex the model is, i.e. L( ) is bigger, the more accurate its predictions are, and thus the error L(D ) is small. Reciprocally a simple model, i.e. L( ) small, will jM 15.9 See [Bis95] pp. 429{431. M ❖ D L(M), L(DjM) ❖ 298 CHAPTER 15. BAYESIAN TECHNIQUES generate many errors, leading to a large L(DjM). This reasoning involves that there should be an optimum balance (tradeo ) between the two, resulting in a minimal message length. For some variable x the information needed to be transmitted is ; ln p(x) where p(x) is the probability density20. Then: message length = ; ln p(M) ; ln p(DjM) ) p(M) (where p(MjD) is the probability and by using the Bayes theorem: p(MjD) = p(DjM p(D) density for the model M, given the data D) it becomes: message length = ; ln p(MjD) ; ln p(D) Let consider that the model M represents a neural network. Then the message length becomes the length of the weight vector and data, given the speci ed model, L(W; DjM). The model complexity is measured by the probability of the weight vector given the model ; ln p(W jM) and the error is calculated given the weight vector and the model ; ln p(DjW; M). The equation (15.39) becomes: L(W; DjM) = ; ln p(W jM) ; ln p(DjW; M) (15.40) To transmit the distributions, both the sender and receiver must agree upon the general form of the distributions. Let consider the weight distribution as a Gaussian with zero mean and 1= variance:   N2W W   exp ; 2 k k2 2 and the error distribution as a Gaussian centered around data to be transmitted. Assuming one output for network y and P number of targets ftp g to be transmitted then: p(W jM) = "  P2  P X # x exp ; 2 [y( p ) ; tp ] p=1 The message length becomes the sum-of-squares error function with the weight decay regularization factor: p(DjW; M) = L(DjM) = ➧ 15.10 2 P X x 2 W 2 2 2 p=1[y( p ) ; tp ] + 2 k k Performance Of Models 15.10.1 Risk Averaging x Given a input vector , a classi er M will categorize it of class Ck for which the posterior probability is maximum (according to the Bayes rule21 ) P (Ck j ) = max P (C`j ). Then the ` See chapter \Error Function", section \Entropy". See [Rip96] pg. 68. 21 See chapter \Pattern Recognition". 20 15.10.1 x x 15.10. PERFORMANCE OF MODELS 299 probability of the model to make a correct classi cation equals to the probability of the class chosen (according to the Bayes rule and for given x) to be the correct one, i.e. its posterior probability: Pcorrect (x) = P (Ck jx) = Efmax P (C` jx)g ` (as the posterior probability is also W -dependent, and W have a distribution associated, the expected value of maximum was used). The probability of misclassi cation is the complement of probability of correct classi cation: Pmc (x) = 1 ; Pcorrect (x) = 1 ; Efmax P (C` jx)g ` The above formula for misclassi cation does not depend on correct classi cation of x and then it may be used successfully in the situations where gathering raw data is easy but the number of classi ed patterns is low. This procedure is known as risk averaging . K P As the probabilities are normated, i.e. P (Ck jx) = 1 then the worst value for max ` P (C` jx) k=1 is the one for which P (Ck jx) = 1=K , 8k = 1; K , and:  Ef1 ; max ` (C` jx)g 2   6 1 ; 1 Ef1 ; max ` (C` jx)g P P K The variance22 of max ` P (C` jx) is: 2 Vfmax i (Ci jx)g = Ef[(1 ; max ` (C` jx)) ; Ef1 ; max ` (C` jx)g] g P P P  2 = Ef(1 ; max ` (C` jx)) g ; Ef1 ; max ` (C`jx)g P  6 1; 1 K ✍ P P mc (x) ; Pmc (x) 2 Remarks: ➥ 22  2 In the process of estimating the probability of misclassi cation is better to use the posterior class probability P (Ck jx) given by the Bayesian inference, rather than the one given the most probable W set of parameters, because it takes into account the variability of W and gives superior results especially for small probabilities. Variance of a random variable x being de ned as Vf(x ; Efxg)2 g. risk averaging CHAPTER 16 Tree Based Classifiers ➧ 16.1 Tree Classifiers The decision trees are usually built from top to bottom. At each (nonterminal) node a decision is made till a terminal node, named also leaf, is reached. Each leaf should contain a class label, each nonterminal node should contain a decisional question. See gure 16.1 on the following page. The main problem is to build the tree classi er using a training set (obviously having a limited size). This problem may be complicated by the overlapping between class areas, in which case there is noise present. The tree is build by transforming a leaf into a decision making node and growing the tree further down | this process being named splitting . In the presence of noise the resulting tree may be over tted (on the training set) so some pruning may be required. ✍ Remarks: ➥ ➥ 16.1 As the pattern space is separated into a decision areas (by the decision boundaries) the tree based classi ers may be seen as a hierarchical way of describing the partition of input space. Usually there are many possibilities to build a tree structured classi er for the same classi cation problem so an exhaustive search for the best one is not possible. See [Rip96] pp. 213{216. 301 leaf splitting 302 CHAPTER 16. TREE BASED CLASSIFIERS from previous level Question decision making node paths Class label Class label Figure 16.1: ➧ 16.2 leafs The tree based classi er. Splitting In general the tree is build considering one feature (i.e. component of the input vector) at a time. For binary features the choice is obvious but for continuous ones the problem is more dicult, especially if a small subset of features may greatly simplify the emerging tree. 16.2.1 Impurity based method ❖ i(n) One method of deciding over the splitting method (feature selection and decision boundaries among features) is to increase \purity", i.e. the pattern vectors who passes trough new build path should be, with greater probability, from some class(es) (rather than other). Alternatively the target is to decrease \impurity" which is easier to de ne in quantitative terms. Impurity i(n) at the output of node n should be de ned such that is zero if all P (C jx) are zero except one (which will have the value 1, due to normalization) and be maximum if all P (C jx) are equal. Two de nitions of impurity are widely used | the probabilities refer to the current node n: k k  Entropy: i(n) = ; P P (C jx) ln P (C jx). K k =1 k k Because lim !0 P ln P = 0 (by L'Hospital rule), ln 1 = 0 and P 6 1 ) ln P 6 0, then the de ning conditions for impurity are met. P Gini index  Gini index: i(n) = P P (C jx) P (C jx) = 1 ; P P 2 (C jx) =1 6 =1 = K K k k;` ` k k ` k (last equality derived from normalization condition, squared, i.e. 16.2 See [Rip96] pp. 216{221. P K k =1 P (C 2 k jx) = 1). 16.2. SPLITTING 303 Z The average decrease in impurity after splitting by feature x is: ; ( ) = i n ( )( ) p x i n X dx such that usually the algorithm building the tree classi er will try to choose that feature who maximize the above expression. The average impurity of the whole tree may be de ned as tree i = XK k ( ) k=1 (16.1) q i k where qk is the probability of a pattern vector reaching leaf k (assuming that the nal tree have a leaf for each class). k ❖ q ❖ P 16.2.2 Deviance based method Another approach to the building of the decisional tree is to consider is as a probabilistic model. Each leaf k may be associated with a distribution which show the probability that a pattern reaching the node is of some particular class, i.e. P (C`; k). Considering P (n) the probability of a pattern to reach node n then the conditional probability density p(C`jn) is: P (C` ) = (C` j ) ( ) ) ;n P n P n P (C`j ) = (C(` ) ) n P (16.2) ;n P n Also, taking the number of patterns (from the training set), arriving at leaf k and of class C`, as being k` , then the likelihood of the training set is: P L= The deviance1 is: tree D = 2 (ln L)for perfect model YK YK [ (C`j )]P k=1 `=1 K X ; 2 ln L = k=1 D P k k where k` D K X k = ;2 k` ln `=1 P P (C` j ) (16.3) k because for the perfect model p(C` jk) = 1 for Pk` > 0 and equals zero otherwise (and xlim !0 x ln x = 0) and thus the deviance term associated with the perfect model cancels. If the total number of patterns arriving at leaf k is Pk then an estimate of p(C` jk) would be p(C` jk) = Pk` =Pk (also from (16.2): p(C` ; k) / Nk` , p(k) / Nk ), note also that the training set is assumed to be a unbiased sample from the true distribution). From (16.3), b 1 See chapter \Pattern Recognition" for the de nition. k` 304 CHAPTER 16. TREE BASED CLASSIFIERS and using K P P k=1 k` = Pk : Dtree 0K 1 K X X = 2@ Pk ln Pk ; Pk` ln Pk` A k=1 `;k=1 Considering the tree impurity (16.1) then qk would be qk impurity: i(k) = ; K X Pk` `=1 Pk ln Pk` Pk ) itree = ; = Pk =P and for the entropy K X Pk Pk` Pk` ln Pk k;`=1 P Pk = Dtree 2P and so the entropy and deviance based splitting methods are equivalent. ✍ Remarks: ➥ ➥ In practice it may well happen that the training set is biased, e.g. it contains a larger number of examples from rare classes than it would have been occurred in a random sample. In this case the probabilities P (C` jk) should be estimated separately. One way to do this is to attach \weights" to the training patterns and consider Pk , Pk` and P as the sum of \weights" rather that actual count of patterns. If there are \costs" associated with misclassi cation then these may be inserted directly into the Gini index in the form: i(n) = K X k;`=1 k =` Ck` P (Ck jx) P (C` jx) 6 where Ck` is the cost for misclassi cation between classes Ck and C` . Note that this leads to completely symmetrical \costs" as the total coecient of P (Ck jx) P (C` jx) (in the above sum) is Ck` + C`k . Thus this approach is ine ective for two class problems. ➧ 16.3 ❖ T , R(T ), S (T ) ❖ ❖ , R (T ) T( ) Pruning Let R(T ) be a measure of the tree T such that the better T is the smaller R(T ) becomes and such that it have contributions from (and only from) all (its) leaves. Possible choices are the total number of misclassi cation over a training/test set or the entropy or deviance. Let S (T ) be the size of the tree T proportional to the number of leaves. A good criterion for characterizing the tree is: R (T ) = R(T ) + S (T ) (16.4) which is minimal for a good one. is a positive parameter that penalizes the tree size; for = 0 the tree is chosen based only on error rate. For a given R (T ) there are several possible trees, let T ( ) be the optimal one. 16.3 See [Rip96] pp. 221{225. 16.3. PRUNING 305 Let consider a tree T and for a non-leaf node n the subtree Tn having as root the node n. Let g(n; T ) a measure of reduction in R by adding Tn to node n, relative to the increase in size: R(n) ; R(Tn) g(n; T ) = (16.5) S (Tn ) ; S (n) ❖ Tn S (n) is the size corresponding to one node n, it is assumed that S (Tn ) represents a subtree with at least two nodes (it doesn't make sense to add just one leaf to another one) and thus S (Tn ) > S (n). R(n) is the measure R of node n, considering it a leaf. From (16.4): R (n) = R(n) + S (n) and R (Tn ) = R(Tn ) + S (Tn ) and using (16.5) then g(n; T ) may be written as: g(n; T ) = R (n) ; R (Tn ) S (Tn) ; S (n) (16.6) , R (n) > R (Tn ) (16.7) + As the denominator is always positive then: g(n; T ) > Proposition 16.3.1. Let consider a tree T and number its nodes from bottom up such that each child node have a number label smaller than its parent node. Let visit each node in its number order (i.e. from bottom up) and prune at the current node n if R (n) 6 R (Tn ) where T is the current tree. After visiting all nodes the result is T ( ). 0 0 ❖ T 0 It is demonstrated by induction. For the unpruned tree T all leaves are optimally pruned. Let consider a current node n. This one is either pruned with the value R (n) or is not, in which case Proof. R (Tn ) = X 0 branches R (TB ) < R (n) 0 B the sum being done over all branches B of node n. But if it is not pruned then it's not possible to have a tree Tn with a smaller R because in this case it will be (at least) one branch B such that R (TB ) < R (TB ) and thus Tn wouldn't be optimally pruned; i.e. if the tree is not pruned at node n then the whole subtree (from node n downwards) is optimally pruned. After analyzing the last node, which (according to the numbering scheme) is the root of whole tree, the tree is optimally pruned. 00 00 0 0 Let 1 be the smallest value of g(n; T ) for any non-leaf node of T . The optimally pruned tree is either T for < 1 or T ( 1 ) obtained by pruning all nodes with g(n; T ) = 1 . 1 is chosen such that 1 = min g (n; T ) 6 g (n; T ), 8n non-leaf node. n Proposition 16.3.2. Proof. Let consider the rst case when < 1 . But then < g(n; T ) for all non-leaf n and thus from (16.7) it follows that R (n) > R (Tn ) for all non-leaf nodes and according to the previous proposition no pruning is performed and the tree is T ( ). Considering the second case, after pruning all nodes with g(n; T ) = 1  min g(n; T ), for all non-terminal n nodes left in the tree: g(n; T ) > 1 . This means that, according to (16.7), R 1 (n) > R 1 (Tn ) and, by using previous proposition, no node pruning takes place and the tree is T ( 1 ). Proposition 16.3.3. For > , T ( ) is a subtree of T ( ) ❖ 1 306 CHAPTER 16. TREE BASED CLASSIFIERS It will be shown by induction that Tn ( ) is a subtree of Tn ( ) for any node n and thus for the root node as well. The proposition is true for all terminal nodes (leafs). It have to be shown that if R (n) 6 R (Tn ) is true then R (n) 6 R (Tn ) is also true and thus when pruning by the method described in the rst proposition (above) then T ( ) will contain T ( ). From (16.4): Proof. ( R (n) = R(n) + S (n) R (n) = R(n) + S (n) ) R (n) = R (n) + ( ; ) S (n) and also (the same way) ( R (Tn ) = R(Tn ) + S (Tn ) R (Tn ) = R(Tn ) + S (Tn ) and by subtracting the above two equations: ) R (Tn ) = R (Tn ) + ( ; ) S (Tn ) R (n) ; R (Tn ) = [R (n) ; R (Tn )] + ( ; )[S (n) ; S (Tn )] Considering R (n) 6 R (Tn ) and as > and S (n) < S (Tn ) then R (n) ; R (Tn ) 6 0. (16.8) The last two propositions show the following:  There are a series of parameters 1 < 2 : : : found by ordering all g (n; T ). For each ( i;1 ; i ] there is only one optimal tree T ( i ).  For j > i the T ( j ) is embedded in T ( i ) (as j > i and by applying the last proposition). ➧ 16.4 Missing Data One of the advantages of the tree classi ers is the ease by which the missing data may be treated. The alternatives when data is partially missing from a pattern vector are:  \Drop". Work with the available data \pushing" the pattern down the tree as far as possible and then use the distribution at the reached node to make a prediction (if not a leaf, of course).  Surrogate splits. Create a set of surrogate split rules at non-terminal nodes, to be used if real data is missing.  Missing feature. Consider the \missing" as a possible/supplemental value for the feature and create a separate branch/split/sub-tree for it.  Split examples. When an pattern vector (from the training set) reaches a node where a split should occur over its missing features then it is split in fractions over the branches nodes. Theoretically this should be done using a posterior probability conditioned on available data but this is seldom available. However it is possible to use the probability of going along (the node's) branches using the (previous) complete (without missing data) pattern vectors. 16.4 See [Rip96] pp. 231{233. CHAPTER 17 Belief Networks ➧ 17.1 Graphs De nition 17.1.1. A graph is a collection of vertices and edges. The vertices represent (in the context of belief networks) a set of random variables (each drawn from some distribution). A edge represent a pair of distinct vertices. If the pair is ordered then the graph is directed, otherwise the graph is undirected. For the ordered edges, the rst vertex is named parent and the second child. The graphs are represented visually by using nodes for vertices and connecting lines for edges, ordered edges are shown using arrows. See gure 17.1 on the next page. De nition 17.1.2. A path is a list of vertices each of them being connected trough an edge. A cycle is a path which ends on the same vertex where it started and do not go trough a vertex more than once. A subgraph is a subset of vertices (from the same graph) together with all edges connecting them. A (sub)graph is connected if there is a path for every possible pair of vertices. A (sub)graph is complete if every possible edge is present. A maximal complete subgraph (of a graph) is named clique. De nition 17.1.3. A tree is a connected graph with no cycles. A directed tree is a connected directed acycled graph, abbreviated DAG. A directed tree have the property that there is a vertex, named root, such that there is a 17.1 See [Rip96] pp. 243{249. 307 graph vertices edge (un)directed graph path cycle subgraph connected graph complete graph clique tree DAG 308 CHAPTER 17. BELIEF NETWORKS Figure 17.1: root ancestral subgraph polytree directed path from the root to any other vertex and any vertex except root have exactly one incoming edge (arrow), i.e. have just one parent. De nition 17.1.4. An ancestral subgraph of a directed graph contains all the parents of its vertices, i.e. it contain also the root (which have no parent). A polytree is a singly connected graph, i.e. there is only one path between any two vertices. 17.1.1 ❖ G, V ❖ A ❖ AC , @A boundary Markov properties ❖ 'C (xC ) A graph. The one represented here is unconnected, have a cycle and one ordered edge (shown by the arrow). Markov Properties A; B; C  V , where V is the whole set of vertices of graph G . Then C separate A and B in G if any path from any vertex in A to any vertex in B have to pass trough a vertex from C . Let xA be the set of (random) variables associated with A and similarly xB , xC . Then the conditional independence of xA and xB , given xC , is written as: xA A xB jxC or (in short) A A B jC De nition 17.1.5. Let consider 3 subsets of vertices: De nition 17.1.6. Let AC be the complementary set of A vertices, i.e. AC = V n A. The C boundary of A, notated @A is the set of all vertices from A who are connected to a vertex in A trough an edge. De nition 17.1.7. The following Markov properties are de ned: 1. Global: if, for any subsets A, B and C of vertices, it is true that xA A xB jxC . 2. Local: if, for some subset A the conditional distribution of xA , given xV nfA[@Ag depends only on x@A , i.e. xA A xV nfA[@Ag jx@A . Otherwise said, the xA variables and those not directly connected to them, are conditionally independent. 3. Pairwise: if, for some subsets A and B , xA and xB are conditionally independent given all other (stochastic) variables and there is no edge from A to B . Proposition 17.1.1. 1. Let consider a set of (random) variables de ned on the vertices of a graph G . If the graph have the pairwise Markov property then there exists a set of positive functions 'C (xC ), de ned over the cliques of G , symmetric in their arguments, such that P (xV ) / Y 'C cliques 17.1.1 See [Rip96] pp. 249{252. C (xC ) (17.1) 17.1. GRAPHS 309 i.e. the probability of the graph random variables to have the values xV (all values are collected together into a vector) is proportional to the product (over its cliques) of functions 'C (the values of components of xC are the same with the corresponding components of xV , vertex wise). 2. If P (xV ) may be represented in the form (17.1) then the graph have the global Markov property. It is assumed that for any vertex s of G the associated variable may take the value 0 (this can Proof. 1. be done by some \re-labeling" procedure if it's not true initially). It will be shown by induction over the size of an subgraph A = fvertex sjxs 6= 0g  VG . The desired probability is built in the form: b0) Y 'C (xC ) (17.2) P (xC ; xCC = 0b) P (xV = b0) Q 'D (xD ) (17.3) P (xV ) = P (xV where 'C is de ned recursively as: 'C (xC ) = = C A D (C and for the empty set D = ; ) 'D (0b)  1 (the sum over D being done for all strict subsets of C ). Also 'C  1 for C non-complete. Now, for the cases when A = ; and A contains just one vertex, equations (17.2) and (17.3) gives the identities P (0b) = P (0b), respectively P (xA ) = P (xA ; xAC = b0). so (17.1) holds. The (17.2) and (17.3) equations are condensed to: Y P (xC ; xCC = b0) P (xA ) = P (0b) Q C A P (xA = 0b) D(C 'D (xD ) If A is complete then any of its subgraphs is also part of one of its cliques, so the above equation may be written in the form (17.1). So the proposition also holds for A complete. Let assume that (17.1) holds for non-complete A having i vertices and let study a new A with i + 1 vertices. As A is not complete then it is possible to write A = B [ s [ t where B is a subgraph of A having i ; 1 vertices and s and t are not neighbors (there is no edge (s; t)), i.e. xs A xt jxB ; xAC . Also: P (xV ) = P (xB ; xs ; xt ; xAC = 0b) b) P (xtjxB ; xs; XAC = 0b)b = P (xB ; xs ; xt = 0; xAC = 0 P (xt = 0jxB ; xs ; XAC = 0) and considering the conditional independence of s and t then: b P (xV ) = P (xB ; xs ; xt = 0; xAC = 0b) P (xt jxB ; xs = 0; xAC = 0)b P (xt = 0jxB ; xs = 0; xAC = 0) b) P (xt; xB ; xs = 0; xAC = 0b)b = P (xB ; xs ; xt = 0; xAC = 0 P (xt = 0; xB ; xs = 0; xAC = 0) By using (17.2) (supposed true, induction): Y ' (x ) P (xB ; xs = 0; xt = 0; xAC = 0b) = P (xV = 0b) for xBC = 0b C C C B Y ' (x ) for x P (xB ; xs ; xt = 0; xAC = 0b)) = P (xV = 0b) C C fB[sgC = 0b C fB[sg Y ' (x ) for X P (xB ; xs = 0; xt; xAC = 0b)) = P (xV = 0b) C C fB[tgC = 0b C fB[tg 310 CHAPTER 17. BELIEF NETWORKS and then: P (xV ) = P (xV = Y 0b ) 'C (xC ) C fB[sg Q 'C xC C fB[tg Q 'C xC 0( 0 00 ( 0) 00 ) C B As a complete subgraph C of A cannot contain both s and t vertices (because the is no path between them) then the P (xV ) can be nally written as: 00 b0 Y 'C (xC ) C fB[s[tg which shows that the A subgraph of size i + 1 may be written in the same form as the A subgraph of size i. 2. It is assumed that (17.1) is true. For an subgraph A  G : P (xV ) = P (xA jxAC ) P (xAC ) ) P (xA jxAC ) = PP((xxV )) AC For P (xV ) the formula (17.1) may be used directly, for P (xAC ), as only xAC is xed and xA may take any value, a sum over all xA values is to be performed: 'C (xC ) 'C (xC ) C G C \ A = 6 ; C G P (xA jxAC ) = 'C (xC ) = 'C (xC ) xA xA C G C G C \A6=; because, at the denominator, the terms corresponding to cliques C disjoint from A (i.e. C \ A = ;) may be extracted as a common multiplicative factor (of the sum) and then canceled with the corresponding terms of the numerator. C is a clique and some of its vertices are in A. Then C  fA [ @Ag (assume by absurd that there is a vertex s 2 A and s 2 C and another one t 62 fA [ @Ag and t 2 C ; as s; t 2 C and C is complete then the edge (s; t) does exists, then t 2 fA [ @Ag, contradiction). This means that the right-hand part of the above equation is in fact just a function of xA[@A , i.e. P (xA jxAC ) = P (xA jx@A ). Let be A; B; C such that C separate A and B . Let B 0 be the set of vertices which may be reached from B using a path not going trough C , thus B  B 0 , and let D = fB 0 [ C gC . Then B 0 , C and D are disjoint by construction (i.e. B 0 \ C = ;, B 0 \ D = ; and C \ D = ;). As A is separated from B by C , while B 0 is not, then no vertex from A may be in either C or B 0 thus A  D. By construction B 0 [ C [ D = VG (and they are disjoint). Also as B 0 is formed by all vertices which may be connected trough a path not passing trough C then B 0 and D are separated by C , i.e. B 0 A DjC and, as A  D and B  B 0 then A A B jC and thus the global Markov property. Pr(x ) = Pr(x V Q P Q cliques cliques 17.1.2 chain V = ) Q P Q cliques cliques Markov Trees Considering a tree, there is a unique path between each node; an undirected tree may be transformed into a directed one by choosing any vertex as root and then ordering the edges as to have the same direction as the paths leading from root to other vertices. The simplest tree is the chain, where each vertex have just one parent and one child, and each vertex splits the graph in two conditionally independent subgraphs. ✍ 17.1.2 Remarks: ➥ Markov chains may be used in time series and conditional independence in this case may be expressed as \past and future are independent, given the present". See [Rip96] pp. 252{254. 17.1. GRAPHS 311 Let consider that, for directed chains, each vertex is labeled with a number such that for each i its parent is i ; 1, the root having label 1. Then the probability of the whole tree is: x P( )= V 1) P (x Y j ;1 6=1 ) P (xi xi ;i i (the root doesn't have a parent and thus its probability is unconditioned). For directed trees the above formula is to be slightly modi ed to: x P( V )= 1) P (x Y j P (xi xj ; j parent of i) i Let consider a vertex t and its associated random variable x . Let consider that the parent (ancestor) of t is s and the associated x may take the values x 2 fx 1 ; : : : g. Then the the distribution of x is p expressed by the means of distribution p at vertex s: t s t pt (xt ) s X t = s s j ps (xsi ) P (xt xsi ) i (the sum being done over the possible values for x ). Thus, given the distribution of root, the other distributions may be calculated recursively, from root to leafs. Let E ; be the events on the descendants of s, E + all other events and E = E ; [ E + . The distribution p (x ) = P (x jE ) is to be found (i.e. given the values for some random variables at the vertices of the graph the problem is to nd the probability of a random variable taking a particular value for a given vertex). Then (using Bayes theorem): s s s s s s s j P (xs Es ) = s s s s j ; + ) / P (E ; jx P (xs Es ; Es and, as E ; A E + jx then P (E ; jx s s s j P (xs Es j +) s Es + ) = P (E ; jx ) and: ; + ) / P (E jx ) P (x jE ) s ; Es s + ) P (x s ; Es s s s s s s The rst term is: 81 >< ; ( j )=> Q : ;j ) = P ( ;j ) ( j P Es xs cildren of s where P (E t xs xt P Et ; jx P (Et t s) if s have no children otherwise . P xt xs ) xt For the other term, x is conditionally separated from E + by its parent r: s j +) = P (xs Es X s j j +) P (xs xr ) P (xr Es xr (the sum being done over all possible values for x ). The restriction over the possible values of x are given by E + trough E + and E ; where q is any child of r, except s. r r s r j q + ) = P (x P (xr Es j + ) Y P (E ; jx r Er q r) q and nally the p s (xs ) may be calculated by using the above formulas recursively. ❖ ;, E+, E Es s s 312 CHAPTER 17. BELIEF NETWORKS 17.1.3 Decomposable Trees triangulated graph join tree ❖ Si De nition 17.1.8. A graph is named triangulated (or chordal) if every cycle of four vertices/edges or more have a chord, i.e. an edge connecting two non-consecutive vertices. De nition 17.1.9. A join tree is the tree associated with a graph, having its cliques as vertices. The vertices of the tree are connected in such a way that if all cliques containing one vertex of the graph are removed, the join tree remains connected. Considering a join tree and two cliques containing the same vertex (of the graph) then all cliques sitting on the path, in the join tree, between the two contains the vertex. For a triangulated graph, the join tree may be built according to the following procedure: 1. Numbering the graph vertices: Starting at any point, number the vertices of the graph by maximum cardinality search, i.e. number next the vertex with the maximum number of already numbered neighbors. 2. Starting with the highest numbered vertex, check if all its lower numbered neighbors are also neighbors between them. If they are then the original graph is triangulated; if they aren't then by adding missing edges the graph may be triangulated. 3. Identify all cliques and order them by the highest numbered vertex in the clique. 4. Build the join tree by connecting each clique with a predecessor which have the most common vertices. The above building procedure ensures also that the (unique) path (in the join tree) from clique C1 to some Ci passes trough cliques having increasing order numbers. For i > 2 let Cj(i) be the parent of Ci in the join tree. Let Si = Ci \ (C1 [ : : : [ Ci;1 ). Then Si  Cj(i) . As j (i) is one of 1 i;1 then i contains all vertices of i \ j (i) . There is not possible to have a vertex such that 2 i \ k and = 6 ( ) and 62 j(i) because of the way the join tree is built (steps 3 and 4). Alternatively: as 2 i \ k then is contained by each clique in the path between i Proof. C C ;::: ;C s s C S C s k C C j i C s C C s C and Ck (by direct consequence of the de nition, see above) and the path must go trough Cj (i) , the parent of Ci , as is a path in a tree and by the way the clique have been numbered, Ck lies somewhere above Ci (on the way to root) or on another tree. ❖ Hi , Ri Let Hi = C1 [ : : : [ Ci;1 and Ri = Ci n Si . Then @ Ri \ Hi is a complete subgraph. Let consider a vertex from i , then either 2 i or 2 i . Now, let consider an 2 i \ i . From the previous reasoning, either 2 i \ i = i , or 2 i \ i  i . In either case 2 i  j (i) which is complete. Proposition 17.1.2. Si separate Ri from Hi n Si , i.e. Ri A (Hi n Si )jSi . S Let consider a path P from i to i which contains a vertex 2 j for a and 62 k k>j Proof. s s s S H @R S Proof. @R s C s S H s s R S C H S C H s R j > i s R (it may happen that there is no k > j ). Let r and t be two vertices before and after s, in the order of numbering given by the procedure of join tree building, such that r; t 62 Rj but they are on P and r 2 Ri and t 2 Hi (P starts somewhere in Ri and ends on Hi ). By the way of selecting r and t and also from the numbering procedure, r; t 2 @Rj . Being in the vicinity of Rj then r, t are either in Sj  Hj or in @ Cj  Hj . Thus r; t 2 @ Rj \ Hj . As @ Rj \ Hj is complete then the edge (r; t) does exists. 17.1.3 See [Rip96] pp. 258{261. 17.2. CASUAL NETWORKS 313 If the (r; t) edge exists then r and t should be members of the same clique Ck . As r 2 Ri then k > i and as r; t 2 Hj then k < j . If k > i, as Hk  Hj , then r; t 2 Ck \ Hk  C` , where ` > k. Repeat this procedure as necessary, considering C` instead of Ck , till ` = i and thus t 2 Ci \ Hi = Si , i.e. Si separates Ri and Hi n Si . Proposition 17.1.3. A distribution which is decomposable with respect to a graph G can be written as a product of the cliques Ci of G divided by the product of the distributions of their intersections Si : P (xV ) = Y P xC ( i) i P (xS ) i known as the set-chain/marginal representation. If any denominator term is zero then the whole expression is considered zero. S S For a given j : Ri = Ci = Hj Qi<j i<j Q and then P (xV ) = P (xR jxR1 ; : : : ; xR ;1 ) = P (xR jxH ) and, as xR A xH jxS (see proposii i Q tion 17.1.2), then P (xV ) = P (xR jxS ). Proof. i i i i i i i i i i Ci = Ri [ Si and then P (xCi ) = P (xRi ; xSi ) = P (xRi jxSi ) P (xSi ) and the nal result is obtained by replacing back into P (xV ). e e Proposition 17.1.4. Let consider the sets of cliques CA and CB , in the join tree, separated by CC . Considering that CA contains the set of vertices A, CB contains B and CC contains the set of vertices C , then xA A xB jxC , i.e. A and B are separated by C . e e e e It is rst proven, by contradiction, that C separates A n C and B n C on G . Let consider that there is a path from v0 2 A to vn 2 B , passing trough vertices v1 ; : : : ; vn;1 62 C . Proof. Let consider that v0 2 C0 2 CeA (where C0 is some clique containing v0 ). Let consider some vertex vj such that vj ;1 ; vj 2 Cj (some Cj , note that (vj ;1 ; vj ) is an edge so there some clique containing it). As vj;1 ; vj 62 C then Cj;1 ; Cj 62 CC . Then, in the join tree, the path Cj;1 to Cj do not pass trough CeC , as all contain the vj vertex (by the way the join tree was built). In this way, by repeating the procedure (with vj ;2 , vj ;1 , e.t.c), it is possible to build a path from CeA to CeB not passing trough CeC . Contradiction. Finally, by using the global Markov property, the proposition is demonstrated. ✍ ➧ 17.2 Remarks: ➥ Instead of working of the original graph G , it is possible to work on triangulated G M (obtained from G by the procedure previously explained). As G M have all edges of G plus (maybe) some more then the separation properties on G M hold on the original G . Casual Networks Basically the casual networks are represented by DAG's. The vertices are numbered according to the topological sort order, i.e. each vertex have associated a number i grater than 17.2 See [Rip96] pp. 261{265. ❖ GM 314 CHAPTER 17. BELIEF NETWORKS the order number of its parent j (i). Then, considering the graph, the probability of x is: P (x V P (x ) )= 1 Y V P (x jx (17.4) j (i) ) i i >1 (of course the root having no parent its distribution is unconditioned). The directions being from root to the leafs this also means that: x A x jx ( ) for k < i k recursive model moral graph i j i the DAG having the above property being also named a recursive model. De nition 17.2.1. The moral graph of a DAG is build following the procedure: 1. All directed edges are replaced by undirected ones. 2. All the (common) parents of a vertex are joined by edges (by adding them if necessary). Proposition 17.2.1. A recursive model on a DAG have the global Markov property and a potential representation on its associated moral graph of the form (17.1). Proof. The potential representation is built as follows: 1. The potential of each clique is set to 1. 2. For each vertex i, select a clique (any) which contain both i and its parent j (i) and multiply its potential by P (xi jxj (i) ). In this way the expression (17.4) is transformed into a potential representation of the form (17.1). By using proposition 17.1.1 it follows that the graph have the global Markov property. ➧ 17.3 The Boltzmann Machine The Boltzmann machine have binary random variable associated with the vertices of a graph completely connected. The probability of a variable x associated to vertex i is obtained considering a regression over all other vertices. The join distribution is de ned as: i P (x 1 0 X @ w xxA )= Z exp where Z = 1 V ij i j i;j; i<j X xV 1 0 X w xxA exp @ ij the parameters w are the \connection weights", symmetric (w P ization constant (obtained from P (x ) = 1). ij i j i;j; i<j ij = w ); Z is the normalji V xV The Boltzmann machine learns the join distribution of some inputs x and outputs x ; some vertices x are \hidden". The join probability over (given) input and output x = x [ x is obtained by summation over all possibilities for the hidden vertices: I H P (x S) 17.3 See [Rip96] pp. 279{281. Y S = X xH P (x V ) I Y 17.3. THE BOLTZMANN MACHINE 315 The problem is to nd the weights w given the training set. This is achieved by a gradient ascent method applied on the log-likelihood function: ij L= X 2 0 X 4X @ X ln exp ln P (xV ) = The derivative of ln Z is: @ ln Z @ wij xH training set training set = 1 Z X xV xi xj wij xi xj 13 A5 ; ln Z i;j; i<j 0 X exp @ wij xi xj 1 A= P (xi = 1; xj = 1) i;j; i<j as all terms for which x = 0 or x = 0 cancels. Considering L1 , the contribution of just one training pattern to the log-likelihood function then: i @ L1 @ wij j P = xH xi xj P (xi wij xi xj ! ; i;j; i<j P exp P xH = exp ! P @ ln Z @ wij wij xi xj i;j; i<j jx ) ; = 1; xj = 1 S P (xi = 1; xj = 1) and for the whole log-likelihood: @ L @ wij ✍ = X [Pr(xi = 1; xj = 1 training set jx ) ; Pr( S xi = 1; xj = 1)] Remarks: ➥ To evaluate the above expression there are necessary two simulations: one for P (x = 1; x = 1) and one for P (x = 1; x = 1jx ) (with \clamped" inputs and outputs). The resulting algorithm is very slow. i j i j S ❖ L1 Advanced Topics CHAPTER 18 Matrix Operations on ANN ➧ 18.1 New Matrix Operations As already seen, ANN involves heavy manipulations of large sets of numbers. It is most convenient to manipulate them as matrices or vectors (column matrices) and it is possible to write fully matrix formulas for many ANN algorithms. However some operations are the same across several algorithms and it would make sense to introduce new matrix operations for them, in order to avoid unnecessary operations and waste of storage space, on digital simulations. De nition 18.1.1. The addition/subtraction on rows/columns between a constant and a vector or a vector and a matrix is de ned as follows: a. Addition/subtraction between a constant and a vector: R a  x  a1b + x C R T x  a1bT ; xT a a  xT  a1bT + xT a b. C x  a1b ; x Addition/subtraction between a vector and a matrix: R xT  A  1bxT + A xT ✍ R C x  A  x1bT + A A  1bxT ; A x Remarks: ➥ C A  x1bT ; A These operations avoid an unnecessary expansion of a constant/vector to a vector/matrix, before doing an addition/subtraction. 319 ❖ R , R , C , C 320 CHAPTER 18. MATRIX OPERATIONS ON ANN The operations de ned above are commutative. ➥ When the operation involves a constant then it represents in fact an addiR tion/subtraction from all elements of the vector/matrix. In this situation  C R C is practically equivalent with  and is equivalent with (and they could be replaced with something simpler, e.g.  and ). However, it seems that not introducing separate operations keeps the formulas simpler and easier to follow. De nition 18.1.2. The Hadamard product between a matrix and vector (row/column matrix) is de ned as follows: ➥ ❖ R C , xT ✍ ❖ H R A  b1xT A and x C A  x1bT A Remarks: These operations avoid expansion of vectors to matrices, before doing the Hadamard product. ➥ They seems to ll a gap between the operation of multiplication between a constant and a matrix and the Hadamard product. De nition 18.1.3. The (meta)operator H takes as arguments two matrices of the same size and three operators. Depending over the sign of elements of the rst matrix, it applies one of the three operators to the corresponding elements of the second matrix. It returns the second matrix updated. Assuming that the two matrices are A and B , and the operators are , and then HfA; B; ; ; g = B , the elements of B being: ➥ 0 0 8 > < b if aij > 0 bij = > (bij ) if aij = 0 : (bij ) if aij < 0 where aij , bij are the elements of A, respectively B . 0 ✍ ( ij ) Remarks: ➥ ➧ 18.2 18.2.1 While H could be replaced by some operations with the sign function, it wouldn't be as ecient when used in simulations, and it may be used in several ANN algorithms. Algorithms Backpropagation Plain backpropagation Using de nition 18.1.1, formulas (2.7b) and (2.7c) are written as: rz` E = cW`T+1   rz +1 E ` z`+1 (1 C z`+1 )  for ` = 1; L ; 1 18.2. ALGORITHMS 321  (rE ) = c rz E  (1 C z )  zT;1 z` ` ` ` for ` = 1; L ` and equations (2.10b) and (2.10c) becomes:  rz E = cW T+1  rz +1 E ^ ` ` `  (rE ) = c rz E ` ` z`+1  (1 C z +1) for ` = 1; L ; 1 `  (1 C z )  ezT;1 z` ` for ` = 1; L ` Backpropagation with momentum Formulas (2.12a), (2.12b) and (2.12c) change to:  (rE ) ^ ^ `;pseudo (rE ) (rE )  = c rz E `  `;pseudo `  C f 0 (a )  c = rz E ` `  `;pseudo (1 C z ) C c z`  = c rz E ` z` f (1 C   f  zT;1 `  ezT;1 ` C z` )  cf   ezT;1 ` Adaptive Backpropagation From (2.13) and using de nition 18.1.3, (2.14) is replaced by: (t) = HfW (t) W (t ; 1); (t ; 1); I; I; Dg SuperSAB Using the H operator, the SuperSAB rules may be written as: (t) = HfW (t) W (t ; 1); (t ; 1); I; I; Dg W (t + 1) = ;(t) rE ; HfW (t) W (t ; 1); W (t); = 0; = 0; ; g (note that the product W (t) W (t ; 1) is used twice and it is probably better to calculate it just once, before applying these formulas). 18.2.2 SOM/Kohonen Networks The algorithms heavily depend over the dW=dt equation chosen to model the learning process. The trivial equation (3.1) is changed to: The Riccati equation (3.5) becomes: dW dt = b T; a 1x C W ). dW dt dW dt = xT = xT R R W. (W x) C W (see proof of (3.5), 322 CHAPTER 18. MATRIX OPERATIONS ON ANN The more general equations (3.2.1) and (3.2.2): R C dW dW T = x (W x) W and dt dt W xxT ; (W x) W C = The trivial model with a neuronal neighborhood and a stop condition (3.22) will be written as: W (t + 1) = W (t) +  (t)h(jx(K ) ✍ C x(K )k j) C  T x R W  Remarks: ➥ The above equations are just examples. There are many possible variations in SOM/Kohonen networks and it's very easy to build own equations, according to the model chosen. 18.2.3 BAM/Hop eld Networks BAM networks The (4.3) formulas change to: HfW y(t); x(t); = +1; =; = ;1g y(t + 1) = HfW x(t + 1); y(t); = +1; =; = ;1g x(t + 1) = T and for working in reverse, (4.4) become: y(t + 1) = HfW x(t); y(t); = +1; =; = ;1g x(t + 1) = HfW T y(t + 1); x(t); = +1; =; = ;1g The nal algorithm change accordingly. Discrete Hop eld memory Formula (4.6) transforms to: y(t + 1) = HfW y(t) + x ; t; y(t); = +1; =; = 0g Continuous Hop eld memory (4.10) may be written as: y(t + 1) = y(t) + ❖  Wy + x ; 1   t ln y(t) (1 C y(t)) C  y(t) (1 C y(t)) Here signify the element-wise division between y(t) and 1 y(t). The ln function follows the convention used in this book for scalar functions applied to vectors: it applies to each vector element in turn. 18.3. CONCLUSIONS ➧ 18.3 323 Conclusions The new matrix operations seems to be justi ed by their usage across several very di erent network architectures. The table 18.1 shows their usage. Operation ANN architectures R | R SOM C  momentum C backpropagation, momentum, SOM, continuous Hop eld R | C SOM H adaptive backpropagation, SuperSAB, BAM, discrete Hop eld Table 18.1: The usage of new matrix operations across ANN architectures. R R It may be seen that two operations, i.e.  and were not used in the ANN algorithms studied here. However they were de ned because:  there are many other yet unchecked algorithms and they may nd an usage;  together with the rest of operations, they form a complete (symmetrical) system allowing for a large variety of matrix/vector manipulations. The fact that so di erent ANN architectures could be expressed in terms of fully matrix equations strongly suggests that many other algorithms (if not all) may be converted to full matrix formulas. One other operator, the element-wise \Hadamard division" , also seems to be useful; it represents the \opposite" of Hadamard product, possibly lling a gap in matrix operations. The usage of matrix formulas on numerical simulations have the following advantages:  splits the diculty of implementations onto two levels: a lower one, involving matrix      operations, and a higher one involving the ANNs themselves; leads to code reusage, with respect to matrix operations; makes implementation of new ANNs easier, once the basic matrix operations have been already implemented; ANN algorithms expressed trough the new matrix operations, do not lead to unnecessary operations or waste of memory; makes heavy optimization of the basic matrix operations more desirable, as the resulting code is reusable; see [Mos97] and [McC97] for some ideas regarding optimizations; makes debugging of ANN implementations easier. 324 CHAPTER 18. MATRIX OPERATIONS ON ANN In order to fully take advantage of the matrix formulas, it may be necessary to have some supplemental support:  scalar functions when applied to vectors, do in fact apply to each element in turn.  some algorithms use the summation over all elements of a vector, i.e. a operation of type xT 1b. APPENDIX A Mathematical Sidelines ➧ A.1 Distances A.1.1 Euclidean Distance Let be two real vectors x; y 2 Rn of dimension n 2 N . The Euclidean distance d between the vectors x and y is de d= kx ; yk v u n uX =t (xi i=1 ❖ ned as: ; yi )2 Also, considering the vectors as column matrices then d = p ❖d (A.1) xT y. A.1.2 Hamming Distance The Hamming space of dimension n is de ned as: Hn = xT = ;x1 : : : xn  2 Rn jxi 2 f;1; 1g so in an Euclidean space the Hamming space can be represented as a set of 2n points at equal distance from origin (corners of a hyper-cube). The Hamming distance between 2 (Hamming) points x and y is de ned as: ( n X 0 if xi = yi  (xi ; yi ) where  (xi ; yi ) = h = kx ; y kH = 1 if xi 6= yi i=1 325 x, y, n 326 APPENDIX A. MATHEMATICAL SIDELINES x3 3 r 2 x2 1 x1 Figure A.1: The generalized spherical coordinates. The angles i are measured from the position vector of the point to the axes of a Cartesian system of coordinates. i.e. h represents the number of mismatched components of x and y. ✍ Remarks: ➥ ➧ A.2 For 2 Euclidean vectors x and y subject to the restriction that xi ; yi 2 f;1; 1g ( 0 if xi = yi then (see (A.1)): (xi ; yi )2 = ) h = d42 . 4 if xi 6= yi Generalized Spherical Coordinates Considering an n-dimensional space it is possible to de ne the position of an arbitrary point by the means of n angles and a distance: f1 ; : : : ; n ; rg. r represents the distance from the (current) point to the origin of the coordinates system while the angles i are measured between the position vector and the axes of a Cartesian orthogonal system. See gure A.1. ✍ Remarks: ➥ Note that the system fr; 1 ; n g have n + 1 elements and thus the coordinates are not independent. By using repetitively the Pitagora theorem: jrj2 = (r cos 1 )2 +    + (r cos n )2 ) n X i=1 cos 2 i = 1 A.3. PROPERTIES OF SYMMETRIC MATRICES ➧ A.3 327 Properties of Symmetric Matrices A matrix A is called symmetric if it's square (A = A ) and symmetric, i.e. the matrix is equal with its transposed A = AT . Proposition A.3.1. The inverse of a symmetric matrix, if exists, is also symmetric. It is assumed that the inverse A;1 does exist. Then, by de nition, A;1 A = I , where I is the unit ij ji Proof. matrix. For any two matrices A and B it is true that (AB )T = B T AT . Applying this result, it gives that AT A;1T = I . Finally, multiplying with A;1 , to the left, and knowing that AT = A, it follows that A;1T = A;1 . A.3.1 Eigenvectors and Eigenvalues Let assume that there are a set of eigenvectors fu g =1 and a corresponding set of eigenvalues f g =1 , such that i i i ;n ;n i Au i =  u ; i = 1; n i (A.2) i The eigenvalues are found from the general equation: A x = x , A ; I )x = 0b ( If the matrix A ; I would have an inverse, then by multiplying the above equation by its inverse it would give that x = b0, i.e. all eigenvectors are zero. To avoid this situation is necessary to impose the condition that A ; I matrix is not inversable, i.e. its determinant is null: jA ; I j = 0 and this leads to the characteristic polynom of A, of the n-th degree in , whose roots are the eigenvalues. The set of f g =1 is also named the spectrum of A. Proposition A.3.2. If two eigen vectors are parallel then they represent the same eigen value | assuming that they are non-zero. i i ;n Let assume, by absurd, that the above proposition is not true, i.e. there are two eigenvectors is a any non-zero constant, such that Au1 = 1 u1 and Au2 = 2 u2 and 1 6= 2 . But then the following is also true: Au1 = 1 u1 and A u1 = 2 u1 and, by subtracting the equations it follows that 1 ; = 1 ; 2 . Finally, for = 1 it gets that the two eigenvalues are equal. Proof. u1 k u2 , u1 = u2 , where The eigenvectors are de ned up to a multiplicative constant; if a vector u is an eighenvector then the u is also an eighenvector for the same eigenvalue (where is some constant). Let consider two arbitrary chosen eigenvectors u and u . By multiplying (A.2) with uT and the equation Au =  u with uT : i i i i i j j j [Bis95] pp. 440{443. j i uTj Aui = i uTj ui and uTi Auj A.3 See j =  uT u j i j ❖ ui ,  i 328 APPENDIX A. MATHEMATICAL SIDELINES Considering A is symmetric then uTj Aui = uTi Auj (using (AB )T = B T AT ) and uTj ui = uT i uj (whatever ui and uj ). By subtracting the two above equations: (i ; j )uTi uj = 0 ❖ U Two situations arises:  i 6= j : Then uTi uj , i.e. ui ?uj | the vectors are orthogonal.  i = j : by substituting i and respectively j in (A.2) and adding the two equation obtained, it gets that a linear combination of the two eigenvectors ui + uj is also an eigenvector. Because the two vectors are not parallel ui , uj then they de ne a plane and, a pair of orthogonal vectors as a linear combination, of the two may be chosen. The same rationing may be done for more than 2 equal eigen values. Considering the above discussion, then the eigenvector set fui g may be easily normalized, such that ui uj = ij , 8i; j . Also the associated matrix U , built using fui g as columns, is orthogonal U T U = UU T = I , i.e. U T = U ;1 (by multiplying the true relation U T U = I , to the left, by U ;1 ). Proposition A.3.3. The inverse of the matrix A have the same eigenvectors and the 1=i eigenvalues, i.e. A;1 ui = 1 ui . By multiplying (A.2) with A;1 to the left and A;1 A = I it gives u =  A;1 u . i Proof. ✍ ❖ i i i Remarks: ➥ The A matrix may be diagonalized. From Aui = i ui and, by multiplying by T uT j to the left: uj Aui = i ij , so, in matrix notation, it may be written as U AU = , where  =  T 1  0 ! .. . . .. . Then: . . . 0   n jU T AU j = jU T j jAj jU j = jU T U j jAj = jI j jAj = jAj = jj = Proposition A.3.4. true that: Rayleigh Quotient. n Y i=1 i For A a square matrix and any a vector, it is aTAa kak2 6 max where max = max i is the maximum eigenvalue; Euclidean metric being used here. i Proof. The above relation should be una ected by a coordinate transformation. Then let use the coordinate transformation de ned by matrix U . The new representation of vector a is then a0 = U a and respectively a0 T = aT U T (use (AB )T = B T AT ). The (Euclidean) norm remains unchanged ka0 k2 = a0 T a0 = aT a = kak2 (U T U = I ). Then: aTAa a0 TAa0 aTU T AU a kak2 6 max , kak2 6 max , kak2 6 max A.4. THE GAUSSIAN INTEGRALS 329 X and as U TAU =  then the above inequation is equivalent with: aT a = i a2i 6 max kak2 = max i Xa 2 i i which, obviously, is true. De nition A.3.1. A matrix A is positive de nite if xT Ax > 0, 8x 6= 0. From (A.2), considering a orthonormal set of eigen vectors and by multiplying to the left by uTi then i = uTi Aui and the eigen values are positive for positive de nite matrix. A.3.2 Rotation Proposition A.3.5. If the matrix U is used for a coordinate transformation, the result is a rotation. U is supposed normated. Let x = U T x. Then kxk2 = xT x = xT UU T x = kxk2 (use (AB )T = B T AT ), i.e. the length of e Proof. e ee e the vector is not changed. Let be two vectors x1 = U T x1 and x2 = U T x2 . Then xT1 x~ 2 = xT1 UU T x = xT1 x, i.e. the angle between two vectors is preserved. The only transformation preserving lengths and angles is the rotation. e e A.3.3 Quadratic Forms A quadratic form of type: F (x) = xT Ax where A is an arbitrary square matrix. By using the eigenvectors of A the function F (x) becomes: F (x) = xT Ax = xT UU T AUU T x = xeT U T AU xe = xeT xe = n X i=1 i xe2i (because UU T = I , xe = U T x and U T AU = ). ➧ A.4 The Gaussian Integrals A.4.1 The Unidimensional Case Let I = 1 R ;1 e;x2 dx and then I 2 = dS = dx dy is the elementary surface. 1 R ;1 e;x2 dx 1 R ;1 e;y2 dy = RR R2 e;(x2 +y2 ) dS , where By switching to polar coordinates, see gure A.2 on the following page, x2 + y2 = r2 and dS = dr  r d'; then the integral becomes: I = ZZ 2 A.4 See R2 e;r2 dS = 1 Z Z2 0 [Bis95] pp. 444-447. 0 e;r2 r dr d' = 2 1 Z 0 re;r2 dr = 2  1 2 1 ; r ;2 e = 0  330 APPENDIX A. MATHEMATICAL SIDELINES y dS ~r + d~r ' d' ~r d~r x Figure A.2: Polar coordinates: The surface element dS . 1 p R e;x2 dx = p and e;x2 dx = 2 because e;x2 is an even function (same ;1 0 value for x and ;x). Finally 1 R A.4.2 The Multidimensional Case   exp ; xT2Ax dx where A is a n  n square and symmetric matrix and x 2 Rn R (dx = dx1  dx2 : : : dxn ). Since A is symmetrical, it is possible to build a orthonormal set of eigenvectors fui gi=1;n such that uTi uj = ij (see section A.3), and then the x vector may be written as x = Let I = n P i=1 R n i ui . The change of variable from fxi gi=1;n to f i gi=1;n is done. Then xT Ax = fi gi=1;n are the eigenvalues of A, and dx = ❖ uij n o @xi @ j ij n P i=1 i 2i , where d 1 : : : d n. = uij | where uij is the i-th element of the vector uj | and, because the fuj gj=1;n is orthonormal, then for the associated matrix U is true that U T U = I (the matrix is n o orthogonal, see section A.3) and the Jacobian determinant jJ j = @@x becomes: ij @xi @ j i j jJ j2 = jU j2 = jU T j jU j = jU T U j = jI j = 1 ) jJ j = 1   (the integrand exp ; x 2Ax is positive over all space Rn , then the integral is positive and then the solution jJ j = ;1 is not acceptable). Finally the integral becomes: T n 1 n 2 2 Y  i I= exp ; 2 i d i = i=1 ;1 i=1 i Y Z  r A.5. THE EULER FUNCTIONS Because jAj = 331 n n=2 Q i then I = (2p)jAj i=1 A.4.3 The multidimensional Gaussian integral with a linear term R  ; xTAx + c  dx where A is a n  n square and symmetric matrix, x 2 Rn and c 2 Rn is a constant vector (dx = dx1  dx2 : : : dxn ). Let fui gi=1;n be the set of orthonormated eigenvectors of A. Let I = The c Rn exp T 2 x vector may be written by means of the eigenvectors set as c ci = cT ui , (as ui uj = ij ), are called the projections of c on ui . = n P ci ui where i=1 Finally, similar to the multidimensional Gaussian integral (above) the integral may be transformed into a product of independent integrals I= n Z1 Y i=1 ;1 exp  ; i2 i + ci 2 i  A square is forced in the exponent: ; i2 i + ci i = ; 2i integral becomes: 2 1 d 1:::d n  i ; cii 2 2 + 2cii , such that the # " Z 2 c 2 d   exp ; i i ; i I = exp 2ci i 2 i i i=1 ;1 n Y  A new change of variable is done: ei = i ; cii , then d ei = d i , the integral limits remain the same, and:   n c2 ! Y n Z1 X  e2i i i I = exp d ei exp ; 2 i=1 2i i=1 ;1 n 2 P ci  Similar as for multidimensional Gaussian integrals: I = pjAj exp . Because i=1 2i n 2 P A;1 ui = 1i ui (see section A.3) then: cT A;1 c = cii and, nally: i=1   (2)n=2 cT A;1 c p exp I= 2 jAj (2 )n=2 ➧ A.5 The Euler Functions A.5.1 The Euler function The Euler function ;E (x) is de ned as being: ❖ ;E 332 APPENDIX A. MATHEMATICAL SIDELINES ;E (x) = Z1 e ;t tx;1 dt (A.3) 0 and is convergent for x > 0. Proposition A.5.1. For the Euler function it is true that ;E (x + 1) = x;E (x) Proof. Integrating by parts: Z1 ;E (x) = e;t tx;1 dt = e;t 0 Proposition A.5.2. If Proof. For n n 2 1 1 tx 1 1 e;t t;x dt = ;E (x + 1) + x 0 x x 0 Z N then n! = ;E (n + 1) where 0! = 1 by de nition. It is demonstrated by mathematical induction. 1 R = 0: ;E (1) = e;t dt = ;e;t 1 0 = 1 = 0! and for 0 n = 1, by using proposition A.5.1: ;E (2) = 1  ;E (1) = 1 = 1!. It is assumed that n! = ;E (n + 1) is true and then: (n + 1)! = (n + 1)  n! = (n + 1);E (n + 1) = ;E (n + 2) the equation (n + 1)! = ;E (n + 2) is also true. A.5.2 The sphere volume in the n{dimensional space It is assumed that the volume of a sphere of radius proportional with the power n of the radius: V n = Cn r n ; C n r into a n-dimensional space V n is = const. where Cn is to be found. The integral: n I = Z1 Z1  ;1 ;1  exp a(x21 +    + x2n )  n dx1 : : : dx is calculated in two ways: 1. The integrals from In are decoupled such that In = R1 ;1 ; a = const. !n ;ax2 dx e = ;  n=2 . a 2. The change of variable from Cartesian coordinates to generalized spherical coordinates is performed: 2 x1 +    + x2n = r2 ; n dx1 : : : dx = dVn = nCn rn;1 dr where the elementary volume in spherical coordinates may be assumed as an in nitesimal spherical layer, due to the symmetry of the integrand relatively to origin. Then R1 n;1 ;ar2 In becomes: In = nCn r e dr . 0 A.6. THE LAGRANGE MULTIPLIERS x2 333 rg r? f rf g (x) = 0 rkf x1 Figure A.3: The gradient vectors rf space. and rg into a bidimensional 8 <dr = 2p1 a x;1=2 dx A new change of variable is performed: ar2 = x ) : n;1 n;1 r = n1;1 x 2 a 2 integral becomes: nC In = n=n2 2a Z1 xn=2;1 e;x dx = 0 and the  n n Cn nCn = + 1 ; ; E E 2 2 2an=2 an=2 n=2 By comparing the two results Cn = ;E( n +1) and nally: 2 Vn = ➧ A.6 ;E  n=2 ;n 2  rn +1 The Lagrange Multipliers The problem is to nd the stationary points of a function f (x) subject to a relation between the components of vector x, given as g(x) = 0. Geometrically, g(x) = 0 represents a surface in the X n space, where n is the dimension of that space. At each point rg represents a vector perpendicular on that surface and rf may be expressed as rf = rk f + r? f , where rk f is the component parallel with the surface g(x) = 0 and r? f is the component perpendicular on it. See gure A.3. ✍ Remarks: ➥ Considering a point in the vicinity of x, on the g(x) surface, such that it is de ned by the vector x + ", where " lies within the surface de ned by g(x) = 0, then the Taylor development around x is: r g (x + ") = g (x) + "T g (x) A.6 See [Bis95] pp. 448{450. ❖ r k , r? 334 APPENDIX A. MATHEMATICAL SIDELINES and, on the other hand, g(x + ") = g(x) = 0 because of the choice of ". Then "T rg(x) = 0, i.e. rg(x) is perpendicular on the surface g(x) = 0. Lagrange multiplier As r? f k rg (see above), then it is possible to write r? f = ;rg where  is called the Lagrange multiplier or undetermined multiplier, and rk f = rf + rg. The following Lagrange function is de ned: L x;  ( ) = f x) + g (x) ( such that rL = rk f and the condition for the stationary points of f is rL = 0b. For the n-dimensional space, the rL = 0b condition gives n + 1 equations: @L @xi = 0 ;i = 1 ; n and @L @ g x = ( ) = 0 and the constraint g(x) = 0 is also met. More general, for a set of constrains gi (x) = 0, i = 1; m, the Lagrange function have the form: L x; 1 ; : : : ; m ( ➧ A.7 ) = f x) + ( m X i=1 i gi x ( ) Useful Mathematical equations A.7.1 Combinatorics Let consider N di erent objects. The number of ways it is possible to choose n objects out of the N set is:   N  N n N ;n n ! ( Considering the above expression then:  N; n 1  = ( )! 1)! !   N ;n N ;n; n ( )! ! = N ;n N N n   =  N ; N; n n;  1 1 ;  ;  ;  representing the recurrent formula: Nn+1 = Nn + nN;1 . A.7.2 The Jensen's inequality Let consider a convex function, i.e. a function for which all points from a chord, between any two points of the graph, are \below" the graph of the function. See gure A.4 on the facing page. A.7 See [Str81] pp. 200{201. A.7. USEFUL MATHEMATICAL EQUATIONS 335 f (x) b; f (b) f (b) f (x) d(x) f (a) a; f (a) a x x b Figure A.4: A convex function f . A chord between arbitrary points a and b is under the graph of the function. Proposition A.7.1. Considering a convex function f . a set of N > 2 points fx g =1 and a set of N numbers f g =1 i i P such that i N ;N f =1 =1 i i X i =1 ! X f (x ) x > i ;N > 0, 8i then it is true that: N N i and i i i i i =1 which is called Jensen's inequality. Let rst consider two points a and b and two numbers 0 6 t 6 1 and 1 ; t; such that they respect the condition of the theorem. The points (a; f (a)) and (b; f (b)) de nes a chord whose equation is (equation of a straight line passing trough 2 points): Proof. f (a) d(x) = bf (ab) ;; aaf (b) + f (bb) ; ;a x then, for any x 2 [a; b] it will be true that d(x) 6 f (x). See also gure A.4. By expressing x in the form of x = a + t(b ; a), t 2 [0; 1], and replacing in the expression of d(x), it gives: f (a) + t[f (b) ; f (a)] 6 f [a + t(b ; a)] , f [(1 ; t)a + tb] > (1 ; t)f (a) + tf (b) i.e. the Jensen's inequality holds for two numbers (t and 1 ; t). Let c be a point inside the [a; b] interval, f;0 (c) the derivative of f (x) at c and f+0 (c) the derivative of f (x) at the same point c. For a continuous derivative in c they are equal: f;0 (c) = f+0 (c) = f 0 (c). to the left to the right The expression f (xx);;cf (c) represents the tangent of the angle between the chord | passing trough the points (x; f (x)) and (c; f (c)) | and the Ox axis. Similarly f 0 (c) represents the tangent of the angle made by the tangent. Let m be a number f;0 (c) 6 m 6 f+0 (c). Because f is convex then it is true that: f (x) ; f (c) > m for x < c x;c and f (x) ; f (c) 6 m for x > c x;c see also gure A.5 on the following page. Finally, from the above equations, it is true that f (x) 6 m(x ; c) + f (c), 8x 2 [a; b]. Jensen's inequality 336 APPENDIX A. MATHEMATICAL SIDELINES ; ; f (x) f (c) x c fx ( ) f;0 c f0 c ( ) +( ) x<c c x>c x Figure A.5: A convex function f , its derivatives in point c | to the left: f; (c); and to the right f+ (c). The chords for x < c and respectively for x > c are drawn with dashed lines. Parameters f;0 (c), f+0 (c) and f (xx);;fc (c) are the tangents of the angles between the tangents in c, respectively the chords, and the Ox axis. Considering now a set of numbers fx g =1 i N P i=1 i = 1 then: a 6 xi 6 b ) Let c = N P i=1 i i ;N i 2 [a; b] and a set of parameters f g =1 2 [0; 1] such that i i ;N a 6 x 6 b and after a summation over i: a 6 i i i N P i=1 i x 6 b. i x 2 [a; b], then: i f (x ) 6 m(x ; c) + f (c) ) i i i f (x ) 6 m( x ; c) + f (c) i i i i i and the Jensen's inequality is obtained by summation over i = 1; N . A.7.3 The Stirling Formula Proposition A.7.2. For n 2 N n  , ln Proof. 1 it is true that: n ' n n ; n n ne ! ln = ln 1 R The Euler function ; (x + 1) e; t dt (see (A.3)) is estimated for x ! 1 by the method of E t x 0 saddle point | the integrand is developed in series around maximum and then the superior order terms are neglected. The derivative of integrand e; t = exp(;t + x ln t) is zero at maximum: t x d [exp(;t + x ln t)] = 0 , ;1 + x  exp (;t + x ln t) = 0 dt t i.e. maximum is at point t = x (because the exp is never 0). The exponent is developed in series around t = x: 2 2 2 ;t + x ln t = ;x + x ln x + (t ;2!x) dtd 2 (;t + x ln t) + : : : ' ;x + x ln x ; (t ;2xx) = t x A.8. CALCULUS OF VARIATIONS 337 because dtd (;t + x ln t) t=x = 0, and just the rst term from the series development is kept. Then:  Z1 2 ;E (x + 1) ' exp(x ln x ; x) ; (t ;2xx) exp 0 and the t ! t ; x change of variable is performed and then: Z1 ;E (x + 1) ' exp(x ln x ; x) ;x and by using the limit x ! 1: ;E (x + 1) ' exp(x ln x ; x) 1 Z ;1 exp  exp 2 ; 2tx  2 ; 2tx dt dt dt and nally another change of variable s = pt2x (see also section A.4): Z1 p p p 2 ;E (x + 1) ' 2x exp(x ln x ; x) e;s ds = 2x exp(x ln x ; x)  ;1 p For large x = n 2 N : ln ;E (n + 1) = ln n! ' n ln n ; n + ln 2n ' n ln n ; n. ➧ A.8 Calculus of Variations The change in a function f (x) when the variable x changes by a small amount x is: f = df dx x + O(x2 ) where O(x2 ) represents the superior terms (depending at most as x2 ). For a function of several variables f (x1 ; : : : ; xn ) the above expression becomes: X @f x n f = i=1 @xi i + O(x2 ) A functional E [f ] is a form which takes a function as variable and returns a value. Considering an arbitrary function f (x), which have small values everywhere, then the variation of E is (by similarity with f , ! ): E = E [f + f ] P R Z ; E [f ] = X X E f (x) f (x) dx + O(f 2 ) (A.4) being the space of x. Proposition A.8.1. The fundamental lemma of the calculus of variations. The con- dition of stationarity for E , to the lowest order in assuming the continuity of E . f , involves the requirement E f = 0 , 338 APPENDIX A. MATHEMATICAL SIDELINES Proof. Stationarity, to the lowest order, involves E = 0 and O(f 2 ) ' 0. (A.4) gives Let assume that there is an xe for which whole vicinity [xe1 ; xe2 ] of xe, such that 6 E = 0 f x e ( R X 6 E f f dx = 0 ✍ = X E f dx = 0 f . . Then the continuity condition implies that there is a 6 E =0 f x2[x e1 ;x e2 ] As f is arbitrary then it is chosen as f R and keeps its sign. 6= 0 and keeps its sign =0 and the lemma assumptions are contradicted. x 2 [xe1 ; xe2 ] . Then, it follows that in rest Remarks: ➥ A functional form, used in regularization theory, is: Z "  2 E [f ] = f + df 2 # X (A.5) dx dx Then, by replacing f with f + f in the above equation: Z  E [f + f ] = E [f ] + 2 f f + X The term R df d(f ) X dx dx dx, df d(f ) dx dx integrated by parts, gives Considering the boundary term equal to comes: Z  E = 0 d2 f ; 2 dx 2f 2 X df ( dx  dx + df dx f X boundaries  f dx + O(f 2 Z  D  d dx ) ; ddx2 f2 f . ) then (A.5) be- ) E f = 2f ; 2 ddx2 f2 . then the functional and its derivative may be  f 2 + (Df )2 dx E= 2 X boundaries = 0 By comparing the above equation with (A.4) it follows that De ning the operator written as: O(f and X E f b = 2f + 2DDf where Db = ; dfd is the adjoint operator of D. ➧ A.9 Principal Components Let consider, into the space X of dimensionality n, a set of orthonormated vectors fui gi=1;n : uTi uj = ij (A.6) A.9. PRINCIPAL COMPONENTS 339 N P and a set of vectors fxi gi=1;N , having the mean hxi = N1 i=1 The residual error E is de ned as: E = 21 n X N  X i=K +1 j =1 2 uTi (xj ; hxi) ; K = const. ❖ X , n, xi , hxi ❖E xi . N , ui , 2 [0; 1; : : : ; n ; 1] and may be written as (use (AB )T = B T AT matrix property): E = 21 n X i=K +1 uTi ui where  = N X i=1 (xi ; hxi)(xi ; hxi)T  being the covariance matrix of fxi g set. ❖  ❖ U, M ❖ i, The problem is to nd the minima of E , with respect to fui g, subject to constraints (A.6). This is done by using a set of Lagrange multipliers fij g. Then the Lagrange function to minimize is: L = E ; 12 n X n X i=K +1 j =K +1 ij (uTi uj ; ij ) (because of symmetry uTi uj = uTj ui then ij = ji and each term under the sums appears twice and thus the factor 1=2 is inserted to count each di erent term once). Let consider the matrices: ; U = uK +1 : : : un  and M = K +1;K +1 .. . n;K +1  ...  K +1;n ! .. . n;n U being formed using ui as columns and M being a symmetrical matrix. Then the Lagrange function becomes: ;   ;  L = 21 Tr U T U ; 21 Tr M U T U ; I Minimizing L with respect to ui means the set of conditions (use (AB )T = B T AT matrix property): @L @uij = 0, i.e. in matrix format ( + T )U ; U (M + M T ) = e0 ) U = UM and, by using the property of orthogonality of fui g, i.e. U T U = I , it becomes: U T U = M (A.7) One particular solution of the above equation is to choose fui g to be the eigenvectors of  (as  is symmetrical it is possible to build an orthogonal system of eigenvectors) and to choose M as the diagonal matrix of eigenvalues (i.e. ij = ij i where i are the eigenvalues of ). An alternative is to consider the eigenvectors of M : f i g and the matrix built by using them as columns. Let  be the diagonal matrix of eigenvalues of M , i.e. fgij = ij i . As M is symmetric, it is possible to choose an orthogonal set f i g, i.e. T = I . , , i 340 APPENDIX A. MATHEMATICAL SIDELINES From eigenvector equation M  = TM . By replacing M from (A.7) = , and by multiplying to the right by T:  = T U T U = (U )T (U ) = Ue T Ue e ❖U where Ue = U . This means that if there are a particular solution U to (A.7) then Ue = U is also solution: Ue T Ue = M and the residual error may be written as:   ;  E = 21 Tr U T U = 21 Tr(M ) = 12 Tr Ue T Ue ✍ Remarks: ➥ There is an invariance to the orthogonal transformation de ned by . APPENDIX B Statistical Sidelines ➧ B.1 Probabilities B.1.1 Probabilities and Bayes Theorem Let consider some pattern vectors fxp g and some classes fCk g these patterns have to be classi ed into. De nition B.1.1. The prior probability P (Ck ) represents the probability of a pattern as being of class k while belonging to a very large set of samples: number of patterns of class Ck 2 [0; 1] (B.1) P (Ck ) = total number of patterns when \total number of patterns" ! 1. De nition B.1.2. The join probability P (Ck ; X` ) represents the probability of a pattern as being of class k and | at the same time | the pattern vector being in the pattern subspace X`  X ; the pattern belonging to a very large set of samples. number of patterns of class Ck with x 2 X` 2 [0; 1] (B.2) P (Ck ; X` ) = total number of patterns when \total number of patterns" ! 1. ✍ Remarks: ➥ B.1.1 For discrete pattern spaces x 2 X` may be replaced with x 2 fX`1; : : : g, where X`1 , : : : are also pattern vectors. See [Bis95] pp. 17{28 and [Rip96] pp. 19{20, 75. 341 ❖ xp , Ck ❖ P( ❖ P( Ck ) Ck ; X` ) 342 APPENDIX B. STATISTICAL SIDELINES ➥ ❖ P (X ` jC k) For continuous pattern spaces either X` de nes a volume in pattern space, in which the point x should be, or a point in which case x 2 X` is replaced with x = X` but, in this case, P (Ck ; X` ) represents an in nitesimal quantity. for a pattern of class by X` . P (X` ❖ P (X ` ) P (X` jCk ) represents the probability to have its pattern vector in the pattern subspace area de ned De nition B.1.3. The class-conditional probability jC C k k) = number of patterns of class Ck with x 2 X` total number of patterns of class Ck P( Cj k X` ) [0; 1] (B.3) when \total number of patterns of class Ck " ! 1. De nition B.1.4. The distribution probability P (X` ) represents the probability of a pattern to have its associated vector x in the subspace X`. P (X` ) ❖ 2 = number of patterns with x 2 X` total number of patterns 2 (B.4) [0; 1] when \total number of patterns" ! 1. De nition B.1.5. The posterior probability P (Ck jX` ) represents the probability for a pattern which have its associated vector in subspace X` to be of class Ck : Cj P( k X` ) = number of patterns with x 2 X` and of class Ck total number of patterns with x 2 X` 2 [0; 1] (B.5) when \total number of patterns with x 2 X`" ! 1. ✍ Remarks: ➥ ➥ ➥ Regarding X` and probabilities same previous remarks apply. The the prior probability refers to knowledge available before the pattern vector is known while the posterior probability refers to knowledge available after the pattern vector is known. By assigning a pattern to a class for which the posterior probability is maximum, the errors of misclassi cation are minimized. Theorem B.1.1. Bayes. The posterior probability is the normalized product between prior and class-conditional probabilities: Cj P( P (X` ) Proof. k X` ) = P ( X` jC C k) P ( k) P (X` ) (B.6) being the normalization factor. By multiplying (B.3) and (B.1) and comparing the result with (B.2) it follows that Ck ; X` ) = P (X` jCk ) P (Ck ) (B.7) Ck ; X` ) = P (Ck jX` ) P (X` ) (B.8) P( similarly, from (B.5) and (B.4) P( The nal result is obtained by comparing (B.7) and (B.8). B.1. PROBABILITIES ✍ 343 Remarks: ➥ ➥ When working with classi cation, all patterns belong to a class: if a pattern can not be classi ed into a \normal" class the there may be a outliers class containing all patterns not classi able in any other class. P (X` ) represents the normalization factor of P (X` jCk ) P (Ck ). Proof. Because each pattern should be classi ed into a class then K X P C jX ( k=1 (B.9) k `) = 1 By using the Bayes theorem (B.6) in (B.9) the distribution probability may be expressed as: P (X` ) = ` k ) P (Ck ) ( k=1 PK P Ck jX` and then P (Ck jX` ) is normalized, i.e. B.1.2 K X P X jC ( k=1 )=1 . Probability Density, Expectation and Variance De nition B.1.6. The probability density function p(x) Z P (X `) = is the function for which (B.10) x) dx p( X` probability density ` is a pattern subspace. Similarly, the following probability densities may be de ned:  Join probability density p(Ck ; x): where X Z (Ck Ck ` ) = X P( p ;X ; x) dx `  Class-conditional probability density P (X p( `jCk ) =  Posterior probability density (Ck jx): p xjCk ): Z X` p( xjCk ) dx Z (Ck jx) x Ck j ` ) = X De nition B.1.7. The expectation (expected value) Ef g of a function P( p X d ` Ef g = Q The variance VfQg of a function Vf g = Q Q( Q Z X Q( x)p(x) dx x) is: sZ X [Q(x) ; Ef g]2 Q p( x) dx Q( x) is: expectation 344 APPENDIX B. STATISTICAL SIDELINES Proposition B.1.1. Using probability densities, the Bayes theorem B.1.1 may be written as: P (C jx) = k p(xjC ) P (C p(x) k k) (B.11) Proof. For X` being a point in the pattern space, (B.10) may be rewritten as: x) = p(x) dx dP ( and similarly with the other types of probabilities; the nal formula is obtained doing the replacement into (B.6). As in Bayes theorem, p(x) is a normalization factor for P (C jx). k Proof. The p(x) represents the probability density for the pattern vector of being x no matter what class, it represents the sum of join probability densities for the pattern vector of being x and the pattern being of class Ck , over all classes: p( XK x) = p( Ck ; x) = XK xjCk ) P (Ck ) p( k=1 k=1 and comparing to (B.11) it shows that (Ck jx) is normalized. P ✍ Remarks: ➥ The p(xjC ) probability density may be seen as the likelihood probability that a pattern of class C will have its pattern vector x. The p(x) represents a normalization factor such that the sum of all posterior probabilities sum to one. Then the Bayes theorem may be expressed as: likelihood  prior probability posterior probability = normalization factor k k ➧ B.2 Modeling the Density of Probability Let be a training set of classi ed patterns fx g1 . The problem is to nd a good approximation for probability density starting from the training set. Knowing it, from the Bayes theorem and Bayes rule1 it is possible to built the device able to classify new input patterns. There are several approaches to this problem:  The parametric method: A speci c functional form is assumed. There are a small number of tunable parameters which are optimized such that the model ts the training set. Disadvantages: there are limits to the capability of generalization: the functional forms chosen may not generalize well.  The non-parametric method: The form of the probability density is determined from the data (no functional form is assumed). Disadvantages: The number of parameters grow with the size of the training set.  The semi-parametric method: Tries to combine the above 2 methods by using a very general class of functional forms and by allowing the number of parameters to vary independently from the size of training set. Feed-forward ANN are of this type. p 1 See \Pattern Recognition" chapter. ;P B.2. MODELING THE DENSITY OF PROBABILITY 345 B.2.1 The Parametric Method The parametric method uses functions with few tunable parameters to model the probability density. These functions are named distributions. The most widely used is the Gaussian due to its properties and good approximation of many real world processes. Gaussian Unidimensional For a unidimensional space the Gaussian distribution is de ned as: p(x) =  p1 2  exp 2 ; (x 2;2) ; ;  = const. (B.12) This function have the following properties: 1. p(x) is normalized, i.e. R1 ;1 p(x) dx = 1. 2. Expected value of x is , i.e. Efxg = R1 xp(x) dx = . ;1 3. The variance (standard deviation) of x is , i.e. v u Z1 u Vfxg = u t [x ; Efxg]2 p(x) dx =  ;1 Proof. 1. By making the change of variable: y= x; p2  , dy = pdx2  1 1 R x2 R p e dx =  (see the mathematical appendix) then p(x) dx = 1 i.e. the probability ;1 ;1 density is normalized (this is the role of the p21  factor) as it should be, because the probability of nding x in the whole space is 1 (certainty). 2. The mean value of x is the expectation (see de nition B.1.7) Z1 and because Efxg = x p(x) dx ;1 p and by making the same change of variable as above (x = 2 y + ) Z1 p p Efxg = p21  ( 2 y + )e;y2 2  dy ;1 r Z1 Z1 2 2 2  ye;y dy + p e;y dy =   =   ;1 ;1 1 ;y2 R ye dy is 0 | the integrand is an odd function (the value for ;x is minus because the rst integral ;1 p the value for x) and the integration interval is symmetric relatively to origin; and the second integral is  (see the mathematical appendixes). B.2.1 See [Bis95] pp. 34{49 and [Rip96] pp. 21, 30{31. 346 APPENDIX B. STATISTICAL SIDELINES 3. The variance is (see de nition B.1.7): Vfxg = 1 Z (x ; )2 p(x) dx ;1 (as Efxg = ) and same change of variable leads to an integral solvable by parts:  Z1 2 2 Z1 Vfxg = p21  (x ; )2 exp ; (x 2;2) dx = 2p y2 e;y2 dy ;1 ;1 1   2 Z 2 2 Z1 1 = ; p  y d e;y2 = ; p  y e;y2 ;1 + p  e;y2 dy = 2 ;1 ;1 ✍ Remarks: ➥ To apply to a stochastic process calculate the mean and variance of training set, then replace in (B.12) to nd an approximation of real probability density by an Gaussian. Gaussian Multidimensional In general, into a N -dimensional space, the Gaussian distribution is de ned as: p(x) = ❖ dx  (x ; )T ;1 (x ; )  exp ; 2 jj 1p (2)N=2 (B.13) where  is a N -dimensional vector and  is a N  N matrix, symmetric and inversable. This function have the following properties: R 1. p(x) is normalized, i.e. RN p(x) dx = 1, where dx = dx1 : : : dxN . R 2. Expected value of x is , i.e. Efxg = RN xp(x) dx = . 3. Expected value of (x ; )(x ; )T is , i.e. Ef(x ; )(x ; )T g =  1 , and, by making Because det(;1 ) det() = det(;1 ) = det(I ) = 1 then det(;1 ) = det() the change of variable xe = x ; , the integral become: p  T ;1  Z xe  xe j;1 j Z Proof. 1. RN p(x) dx = (2)N=2 N exp R 2 dxe = 1 (see also the mathematical appendix regarding Gaussian integrals). 2. The mean value for x is the expectation p  Z (x ; )T ;1 (x ; )  j;1 j Z Efxg = RN xp(x) dx = (2)N=2 N R x exp ; 2 and, by making the same change of variable as above xe = x ; , it becomes 2 dx 3   Z T ;1  T ;1  ;1 Z Efxg = (2j)N=2j 64 xe exp ; xe 2 xe dxe +  exp ; xe 2 xe dxe75 p RN RN B.2. MODELING THE DENSITY OF PROBABILITY  ;1 347  The function exp ; xe 2 xe is even (same value for xe and ;xe), such that the integrand of the rst integral is an odd function and, the interval of integration being symmetric to the origin, the integral is zero. The second integral have the value (2)n=12 pjj and, as j;1 j = j1 j , then nally Efxg = . 3. T The expectation value for that matrix is: Z Ef(x ; )(x ; )T g = (x ; )(x ; )T p(x) dx = RpN j;1 j Z   T exp ; (x ; )T ;1 (x ; ) dx ( x ;  )( x ;  ) 2 (2)N=2 N R Same change of variable as above xe = x ;  and: p ;1 Ef(x ; )(x ; )T g = (2j)N=2j Z RN xexeT exp  xeT ;1 xe  e ; dx 2 Let fui gi=1;N be the eigenvectors and fi gi=1;N the eigenvalues of  such that ui = i ui . The consider the set of eigenvectors chosen such that is orthonormated (see the mathematical appendix regarding the properties of symmetrical matrices). Let be U the matrix build using the eigenvectors of  as columns and  the matrix of eigenvalues ij = ij i (ij being the Kronecker symbol), i.e.  is a diagonal matrix with eigenvalues on main diagonal and 0 in rest. By multiplying Ef(x ; )(x ; )T g with U T to the left and with U to the right and because the set of eigenvectors is orthonormated then U T U = I ) U ;1 = U T ) U U T = I and it gets that: pj;1 j Z   T U exp ; xe T U U T ;1 U U T xe dxe e e U T Ef(x ; )(x ; )T gU = U Tx x 2 (2)N=2 N R A new change of variable is performed: y = U T xe and then yT = xeT U and dy = dxe | because this transformationn conserve the distances and angles. Also ;1 have the same eigenvectors as , the o 1 1 eigenvalues being 1i i=1;N and respectively the eigenmatrix ;1 is de ned by ; ij = ij i . Then: UT p  T ;1  ;1 Z Ef(x ; )(x ; )T gU = (2j)N=2j yyT exp ; y 2 y dy = RN pj;1 j Z (2)N=2 RN yyT exp ; N y2 ! X i i=1 2i dy = pj;1 j Z (2)N=2 i.e. the integrals now may be decoupled. First the yyT is of the form 0 y1 1 yyT = B @ ... CA ;y1   0  B yN = @ yN y12  yN y1  .. . .. . RN yyT y1 yN 2 yN N Y  2 exp ; yi 2i i=1  dy 1 CA Each element of the matrix fU T Ef(x ; )( x ; R)(t) gU gi;j is computed separately. There are two cases: R 1 . non-diagonal and diagonal elements. Also R, ;1 The non-diagonal elements are fU T Ef(x ; )(x ; )T gU g i;j i6=j 2 ! 3 ! 372 Z1 pj;1 j 6 Y  y2  32 Z1 N Z1 2 yj2 yk i 7 6 5 4 4 yj exp ; yi exp ; dyi dyj 5 dyk 5 exp ; = 2 2i 2j (2)N=2 4 k=1 k6=i;j ;1 k ;1 ;1 ❖ ui , i ❖ U,  348 APPENDIX B. STATISTICAL SIDELINES  2  and because the function yi exp ; 2yii is odd, the corresponding integrals are 0 such that fU T Ef(x ; )(x ; )T gU g i;j = 0 i6=j The diagonal elements are 32 2 ! p ;1 6 Y  y2  N Z1 2 7 Z1 y j  j fU T Ef(x ; )(x ; )T gU gi;i = (2)N=2 64 exp ; k dyk 75 4 yi2 exp ; i 2k 2i k=1 ;1 ;1 3 dyi 5 k6=i and the individual integrals appearing above are: !  y2  Z1 Z1 p p yk2 yi2 exp ; i exp ; 2 dyk = 2k and dyi = i 2i 2  i k ;1 ;1 (calculated same way as for the unidimensional case) and jj = U ✍ T N Q i ; so nally i=1 Ef(x ; )(x ; ) gU =  ) E [(x ; )(x ; )T ] = U U T =  (t) Remarks: ➥ By applying the transformation (equivalent to a rotation) xe = U T (x ; ) the probability distribution p(x) becomes (similar to the above calculations) p ;1  xeT U T (;1) U xe  pj;1 j  xeT ;1xe  j  j = (2)N=2 exp ; 2 = p(x) = 2 (2)N=2 exp ; ! N p ;1 N Y X j  j xe2i = (2)N=2 exp ; 2 = pi i i=1 i=1 where pi is N xei 1 p exp ; X pi = N= 2 (2)  2 2 ➥ i i=1 ! i and then the probabilities are decoupled, i.e. the components of x are statistically independent. The expression q  = (x ; ) ; (x ; ) Mahalanobis distance ➥ T 1 is called Mahalanobis distance between vectors x and . For  = const. the probability density is constant so  represents surfaces of equal probability for x. B.2. MODELING THE DENSITY OF PROBABILITY x2 u2 349 p 2 p u1 1  x1 Figure B.1: ➥ Equal probability density for Gauss probability density in two dimensions. The ellipse represents all points where p(x1 ; x2 ) = e;1=2 . The  vector points to the center of the ellipse. e By applying the transformation x = U T (x ; ) the Mahalanobis distance becomes: 2 = Xe N i=1 i x2i i.e. the surfaces of equal probability are hyper-ellipsoids. The main axes of the p ellipsoid are proportional with i . The  vector points to the location of highest probability density. See gure B.1 The transformation x = U T (x ; ) is from (x1 ; x2 ) to fu1 ; u2 g, i.e. a translation by  then a rotation such that fu1 ; u2 g becomes the new set of versors. The probability density for a two dimensional pattern space is shown in gure B.2 on the following page. The number of parameters de ning the Gaussian distribution is 1 +    + N = N (N +1) for  (symmetrical matrix) plus N parameters for  so the Gaussian 2 distribution is completely de ned by N (N2+3) number of parameters. e ➥ De nition B.2.1. A N -dimensional vector x it is said to be normal, i.e. x  NN f; g if it have a Gaussian distribution of the form (B.13) with a mean  and covariance matrix . ✍ Remarks: ➥ Let consider a set of several classes 1; : : : ; K such that each may be modelled using a multidimensional Gaussian, each with its own k and k . Considering that the prior probabilities P (Ck ) are equal then the biggest posterior probability for a given vector x is the one corresponding to the minimum of the Mahalanobis distance k x = min ` x This type of classi cation is named linear discriminant ` analysis. ❖ NN f; g linear discriminant analysis 350 APPENDIX B. STATISTICAL SIDELINES 0 12 p : 0 06 x : 0 00 5 : 0 0 ;5 ;5 x2 5 5 3 1 2 ;1 ;3 ;5 p 0 10 0 08 0 06 0 04 0 02 0 001 : : : : : : ; 5 ;3 ; 1 1 x1 x1 a) p(x1 ; x2 ) 3 5 b) level curves for p Figure B.2: The Gaussian probability density function in two dimensional pattern space for  = 0b and  = ( 10 02 ). The function was sampled in x1 2 = 0:5 steps. ; ➥ Considering the covariance matrix equal to the identity matrix, i.e.  = I , then the Mahalanobis distance reduces to the simple Euclidean distance (see mathematical appendix) and then the pattern vectors x are simply classi ed to the class C with the closest mean  . k k B.2.2 The non-parametric method Non-parametric method try to solve the problem of nding a probability distribution of data using a training set and without making any assumption about the form of the distribution function. Histograms ❖ Xk In the histogram method the pattern space is divided in subspaces (areas) X and the probability density is estimated from the number of patterns in each area. See gure B.3 on the next page. The size of X determine the model complexity: if it is too large then it ts poorly the data, if it is too small then it over ts the exceptions/noise which may be present in the training set, i.e. the size of X controls the model complexity. See gure B.3 on the facing page. Let take one area X and let K be the number of patterns from the training set which are in the area X . Assuming a suciently large number of patterns in the training set (such that the training set is statistically signi cant) then the probability that a pattern will fall in the area X is approximatively: k k k ❖ K k k k ( )' P Xk B.2.2 See [Bis95] pp. 49{59 and [Rip96] pp. 190, 201{206. K P B.2. MODELING THE DENSITY OF PROBABILITY 351 p p X1 X X X2 a) Simple model b) Complex model p X b) Good model Figure B.3: Histograms: probability density p versus pattern space X . The true probability density is shown with a dotted line. The estimated probability density is shown with a thick line. The rectangles represent the number of patterns from the training set which falls in the designated regions Xk . The small circles represents the patterns. On the other hand, assuming that the probability density is approximatively constant to x) in all points from Xk then: pe( Z P (Xk ) = p( ) d x x Xk p Z ' e(x) Xk d x = pe( )VXk x V where VXk is the volume of the pattern area Xk . Finally: p( x) p ' e(x) = KP PVXk ❖ Xk (B.14) where pe(x) is the estimated probability density. To solve (B.14) further, two approaches may be taken.  Xk , respectively VXk is xed and KP is counted | this represents the kernel based method.  K is xed and the VXk is calculated | this represents the K{nearest{neighbors method. p ❖ e(x) kernel method K nearest neighbors 352 APPENDIX B. STATISTICAL SIDELINES p(x) X Figure B.4: The kernel{based method. The dashed line is the real probability density p(x); the continuous line is the estimated probability density pe(x) based on counting the patterns in the hypercube sorounding the current point and represented by a dashed rectangle. The kernel{based method ❖ H (x) Let Xk be a hypercube having the side of length ` and being centered in x. Then its volume is VXk = `N (N being the dimension of the pattern space). The following kernel function2 H (x) is de ned ( H (x) = 1 0 if xi < 12 for i = 1; N otherwise such that H (x) is 1 ifthe point x is inside the unit hypercube centered in the origin and  x;xp will indicate if the xp point from the training set is in the 0 otherwise. Then H ` hypercube Xk or not. The total number of patterns falling in Xk is: P x ; x  X p K = H ` p=1 and then, the estimate for the probability density is:   P x ; xp 1 X 1 p e(x) = NH P ` p=1 ` this may be visualized as a sliding hypercube in the pattern space, centered in the current point x. While moving it, some of the xp points will enter it while others will leave it such that | unless the total number remains constant | pe(x) will have a step jumps. See gure B.4. The function pe(x) may be \smoothened" by replacing the kernel function with a continuous 2 Known also as the Parzen window. B.2. MODELING THE DENSITY OF PROBABILITY function:  1 exp H (x) = (2)N=2 353 ; kxk 2  2 and then the estimation of the probability density becomes: pe(x) 1 = P P X 1 exp (2 `2 )N=2 p=1  ; kx ; xp k 2  2`2 In general any function bounded by the conditions: H (x) Z > 0 and H (x) dx =1 X is suitable to be used as a kernel function. Let examine the expectation of the estimated probability density, considering that P ! 1: Efpe(x)g = P1 P X p=1 E   1 NH ` x ; xp  ` ! Z 1  NH X ` x ; x0  ` p(x0 ) dx0 (where x0 represents the integral variable). This formula shows that the expectation of the estimated probability density is a convolution of the (true) probability density with the kernel function. For ` ! 0 and P ! 1 the estimated probability density approaches the true one while the kernel function approaches the -Dirac function. The K{nearest{neighbors method Let K be xed and Xk a hyper-sphere centered in x and with variable radius, such that it will contain always the same number K of vectors from the training set. See gure B.5 on the following page. The estimation of probability density is found from: pe(x) = K P VXk The volume V(N ) of a sphere, of radius r in the N -dimensional space is: V(N ) = ;E  N=2 ;N 2  rN +1 where ;E (x) = Z1 e;t tx;1 dt 0 ;E being the Euler function, see the mathematical appendix. Let consider a set of classes andPa training set. Let Pk be the number of patterns of class Ck in the training set such that Pk = P . Let Kk be the number of patterns of class Ck k in the hyper-sphere of volume V . Then jC p(x k ) = Kk Pk V ; p(x) = K PV and C p( k ) = Pk P ❖ V(N ) 354 APPENDIX B. STATISTICAL SIDELINES p( ) x X Figure B.5: The K{nearest{neighbors based method. The dashed line is the real probability density p(x); the continuous line is the estimated probability density pe(x) based on estimating the volume of the hyper-sphere with variable radius and represented by a dashed circle (the hyper-sphere is de ned by xing the K number). From the Bayes theorem p(C j ) = p( jCp() p) (C ) = KK k x x k k k x which is known as the K{nearest{neighbors classi cation rule. This means that once the volume of the hyper-sphere was established (by xing K ) a new pattern x is classi ed as being of that class which have most representatives (Kk ) included into the hyper-sphere, i.e. to that class Ck for which: p(Ck jx) = max p(C`jx) ` (according to the Bayes rule, see \Pattern Recognition" chapter). ✍ Remarks: ➥ The parameters governing the smoothness of the histogram method are V for kernel based procedure and K for K nearest neighbors procedure If this tunable parameter is to large then an excessive smoothness occurs and the resulting model is too simple. If the parameter is chosen too small then the variance of the model is too large, the model will approximate well the probability for the training set but will have poor generalization, the model will be too complex. B.2.3 The Semi{Parametric Method The mixture model The mixture model consider that the probability density is a superposition of probability densities, each of them having a di erent weight by which contributes to the total. B.2.3 See [Bis95] pp. 59{73. B.2. MODELING THE DENSITY OF PROBABILITY 355 The procedure below is repeated for each class Ck in turn. Considering a superposition of M probability densities then: p(x) = M X m=1 p(xjm)P (m) m=1 ✍ P (m) = 1 ; P (m) 2 [0; 1] and Z X The training set have the patterns classi ed in classes but does not have the patterns classi ed by the superposition components m, i.e. the training set is incomplete and the complete model have to provide a mean to determine this. The posterior probability is given by the Bayesian theorem and is normalized: ➥ M X m=1 P (mjx) = 1 (B.16) The problem is to determinate the components of the superposition of probability densities. ✍ Remarks: ➥ One possibility is to model the conditional probability densities p(xjm) as Gaussian, de ned by the parameters m and m = m I :  2 p(xjm) = (221 )N=2 exp ; kx ;22 m k m ➥ m (B.17) and then a search for optimal parameters  and  may be done. To avoid singularities the conditions: m 6= xp m = 1; M ; p = 1; P and m 6= 0 m = 1; M have to be imposed (xp are the training vectors). The maximum likelihood method The parameters are searched by maximizing the likelihood function, de ned as3 : L= 3 See \Pattern Recognition" chapter. P Y p=1 p(xp jW ) ❖ p(xjm), P (m) p(xjm) dx = 1 Remarks: )P (m) and P (mjx) = p(xjm p(x) M (B.15) where p(xjm) represents the probability density of pattern as being x, from all patterns generated by component m of the superposition, and the weight is P (m), the prior probability of the pattern x having been created by the component m of the superposition. M becomes also a parameter of the model. All these probabilities have to be normalized: M X ❖ (B.18) incomplete training set 356 APPENDIX B. STATISTICAL SIDELINES equivalent to minimize the negative log-likelihood function E an error function. E = ; ln L = ; ✍ P X p=1 p ln (xp ) = ; P X p=1 ln " M X m=1 = ; ln L which may act as p(xp jm)P (m) # (B.19) Remarks: ➥ Considering the Gaussian model then from (B.16) and (B.17): # "M X p(xp )P (mjxp )  kxp ; m k2  E = ; ln exp ; 2 2 N=2 2m p=1 m=1 (2m ) P X The minimum of E is found by searching for the roots of its derivative, i.e. E is minimum for those values of m and m for which: rm E = 0 , @E @m , =0 P X p=1 P X p=1 P (mjxp ) xp ;2 m m =0 and   2 P (mjxp ) N ; kxp ;2m k = 0 m (the denominator of derivative shouldn't be 0 anyway, also m 6= 0, see previous remarks). The above equations gives the following estimates for m and m parameters: P P P P P (mjxp )xp kxp ; m k2 P (mjxp ) p=1 p=1 e m = P and em = P P P N P (mjxp ) P (mjxp ) (B.20) p=1 p=1 In order to automatically ensure normalization, the P (m) parameters may be expressed by the means of M parameters m as follows: em P (m) = P M eq m 2 R; m = 1; M q=1 softmax function These expressions are called softmax functions. Then the m are also parameters of the model and E which depends upon them have to be minimized with respect to them, i.e. its derivative with respect to m should be 0 at minimum. (q ) From the softmax expression: @P @ m = mq P (m) ; P (m)P (q), also @@Em (mq being the Kronecker symbol) and then from (B.19): = M @E @P (q) P @P (q) @ m , q=1 ) P M (X @E = X p(xp jq) [mq P (m) ; P (m)P (q )] M @ m q=1 p=1 P p(xp j`)P (`) `=1 B.2. MODELING THE DENSITY OF PROBABILITY = 357 P p(x jm)P (m) M X P p(x jq )P (q ) X X p p ; P ( m ) M M P P p=1 q=1 p=1 `=1 p(xp j`)P (`) `=1 p(xp j`)P (`) By applying the Bayes theorem and considering the normalization of P (mjx), see (B.16), the rst term becomes: P P (mjx ) p(x ) P p(x jm)P (m) X P X p p = X P (mjx ) p = p M M P p=1 P p=1 p=1 `=1 p(xp j`)P (`) p(xp ) `=1 P (`jxp ) while the second term is: P p(xp ) X M P P (qjxp ) p ( xp jq) P (q ) q=1 = PP (m) = P (m) P (m) M M p=1 p(xp ) P P (`jxp ) q=1 p=1 P p(xp j`)P (`) M X P X `=1 `=1 so nally: P @E = 0 , P (m) = 1 X @ m P p=1 P (mjxp ) (B.21) The EM (expectation{maximisation) algorithm The algorithm works with formulas (B.20) and (B.21) iteratively, in steps, starting with some initial values for the parameters at the rst step t = 1 and then recalculating e (t+1)m , e(t+1)m and the estimated Pe(t+1) (m) by using the old values at the previous step e (t)m , e(t)m and Pe(t) (m). It is supposed that E function gets smaller at each step till it reaches the minimum. The variation in the error function E (given by (B.19)), from one step to the next, is: E = E t ; E t = ; ( +1) ( ) P X p(t+1) (xp ) p=1 ln p (x ) t p ( ) and, using (B.15) for p(t+1) (xp ): 3 2P M p ( x j m ) P ( m ) P 6 p (t+1) (t+1) X P(t) (mjxp ) 77 E = ; ln 64 m=1 5 p ( x ) P p (t) (t) (mjxp ) p=1 ✍ Remarks: ➥ (B.22) (B.23) The Jensen's inequality states that given a function f , convex down on an interval [a; b], a set of P points in that interval fxpgp=1;P 2 [a; b] and a set of numbers Jensen's inequality 358 APPENDIX B. STATISTICAL SIDELINES f p gp ;P =1 2 [0; 1] such that f P P p=1 P X p=1 p = 1, then: ! p xp > P X p=1 p f (xp ) see the mathematical appendix. By applying the Jensen's inequality to (B.23): f , ln and p , P(t) (mjxp ) (P(t) (mjxp ) are normated) then: P X M X P(t+1) (m) p(t+1) (xp jm)  E 6 P(t) (mjxp ) ln p (x ) P (mjx )  Q p (t) p (t) p=1 m=1 ❖ Q  (B.24) and E(t+1) 6 E(t) + Q, i.e. E(t+1) is bounded above and it may be minimized by minimizing Q. Generally Q = Q(W(t+1) ), the old parameters W(t) being already established at the previous t step. Eventually, minimizing Q is equivalent to minimizing: Qe = ; M P X X p=1 m=1   P(t) (mjxp ) ln P(t+1) (m) p(t+1) (xk jm) (B.25) For the Gaussian distribution, see (B.17), Qe becomes Q=; e P X M X p=1 m=1 " P(t) (mjxp ) ln P(t+1) (m) ; N ln (t+1)m ; kxp ;  t 2 t mk ( +1) 2 ( +1) m 2 # + const. The problem is to minimize Qe with respect to (t +1) parameters, i.e. to nd the parameters at step t + 1, such that the condition of normalization for P(t+1)(m) ( M P m=1 P(t+1) (m) = 1) is met. The Lagrange multiplier method4 is used here. The Lagrange function is: L=Q+ e " M X m=1 P(t+1) (m) ; 1 # The value of the parameter  is found by putting the set of conditions: @P t@L (m) which gives: ( +1) ; =0 P X P(t) (mjxp ) +  = 0 ; m = 1; M p=1 P(t+1) (m) and by summation over m, and because both P(t) (mjxp ) and P(t+1) (m) are normated: M X P(t) (mjxp ) = 1 and m=1 4 See mathematical appendix. M X m=1 P(t+1) (m) = 1 B.2. MODELING THE DENSITY OF PROBABILITY 359 then  = M . The required parameters are found by setting the conditions: r t ( +1) mL = 0 @L ; @ m =0 @L and @P(t+1) (m) = 0 which gives the solutions: P P P(t) (mjxp ) xp p=1 (t+1)m = P P P(t) (mjxp ) p=1 v uP u P u P(t) (mjxp )kxp ; (t+1)m k2 u p=1 (t+1)m = u u P P t N P(t) (mjxp ) p=1 P(t+1) (m) = N1 P X p=1 (B.26a) (B.26b) P(t) (mjxp ) (B.26c) As discussed earlier (see the remarks regarding the incomplete training set) usually the data available is not classi ed in terms of probability components m, m = 1; M . A variable zp , zp 2 [1; : : : ; M ] may be associated with each training vector xp , to hold the probability component. The error function then becomes: E = ; ln L = ; = ; P X M X p=1 m=1 P X p=1 p ln (t+1) (xp ; zp ) = ; P X p=1 P ln[ (t+1) ( zp ) p(t+1) (xp jzp )] mzp ln[P(t+1) (zp ) p(t+1) (xp jzp )] P(t) (zp jxp ) represents the probability of zp for a given xp , at step t. The probability of P Q E for a given set of fzp gp=1;P is the product of all P(t) (zp jxp ), i.e. P(t) (zp jxp ). The p=1 expectation EfE g is the sum of E over all values of fzp gp=1;P weighted by the probability of the fzpgp=1;P set: EfE g = ; M X P Y E P(t) (zp jxp ) zP =1 p=1 # P M M P X M "X Y X X    mzp P(t) (zq jxq ) ln[P(t+1) (zp ) p(t+1) (xp jzp )] =; q=1 p=1 m=1 z1 =1 zP =1 z1 =1  M X ❖ zp 360 APPENDIX B. STATISTICAL SIDELINES On similar grounds as for EfE g, the expression in square parenthesis from the above equation represents the expectation Efmzp g: Efmzp g = M X z1 =1  M X zP =1 mzp P Y p=1 P(t) (zp jxp ) = P(t) (mjxp ) which represents exactly the probability P(t) (mjxp ). Finally: EfE g = ; P X M X p=1 m=1 P(t) (mjxp ) ln[P(t+1) (m) p(t+1) (xp jm)] which is identical with the expression of Qe, see (B.25), and thus minimizing Qe is equivalent to minimizing EfE g at the same time. Stochastic estimation Let consider that the training vectors came one at a time. For a set of P training vectors the (P )m parameter from the Gaussian distribution is (see (B.26a)): (P )m = P P P (mjxp ) xp p=1 P P P (mjxp ) p=1 and, after the P + 1 training vector have arrived: (P +1)m = PP +1 P (mjxp ) xp p=1 PP +1 P (mjxp ) p=1  (P )m + (P +1)m xP +1 ; (P )m = where  P (mjxP +1 ) m = PP +1 P (mjxp ) p=1 P ( +1) To avoid keeping all old fxp gp=1;P (to calculate P m= ( +1) PP +1 P (mjxp )) use either (B.21) such that: p=1 P (mjxP +1 ) (P + 1)P (m) or, directly from the expression of (P +1)m : P ( +1) m= 1 1+ P (mjxP ) (P )m P (mjxP +1 ) B.3. THE BAYESIAN INFERENCE ➧ B.3 361 The Bayesian Inference Unlike other techniques where the probabilities are build by nding the best set of W parameters, the Bayesian inference assumes a probability density for W itself. The following procedure is repeated for each class Ck in turn. First a prior probability is chosen p(W ), with a large coverage for the unknown parameters, then using the training set fxp gP the posterior probability density p(W jfxp gP ) is found trough the Bayes theorem. The process of nding p(W jfxp gP ), from p(W ) and fxp gP , is named Bayesian learning. Let p(xjfxp gP ) be the probability density for a pattern from fxp gP to have its pattern vector x and let p(x; W jfxp gP ) be the join probability density that a pattern from fxp gP have its pattern vector x and the parameters of the probability density are de ned trough W . Then: p( xjfxp gP ) = Z p( x; W jfxp gP ) dW ❖ fxp gP Bayesian learning ❖ p(xjfxp gP ), x; W jfxp gP ) p( (B.27) W the integral being done over all possible values of W , i.e. in the W space. p(x; W jfxp gP ) represents the ratio of pattern vectors being x with their probability density characterized by W and being into the training set fxp gP relative to the total number of training vectors: no. patterns being x; with W; in fxp gP p(x; W jfxp gP ) = no. patterns in fxp gP xjW; fxp gP ) represents the ratio of pattern vectors being x with their probability density characterized by W and being into the training set fxp gP relative to the number of training vectors with their probability density characterized by W from the training set fxp gP : no. patterns being x; with W; in fxp gP p(xjW; fxp gP ) = no. patterns with W; in fxp gP ❖ p( represents the ratio of pattern vectors with their probability density characand being into the training set fxp gP relative to the number of training ❖ p(W p( p(W jfxp gP ) terized by vectors: W p(W jfxp gP ) = no. patterns with W; in fxp gP no. patterns in fxp gP Then, from the above equations, it gets that: x; W jfxp gP ) = p(xjW; fxp gP ) p(W jfxp gP ) p( The probability density p(xjW; fxp gP ) have to be independent of the choice of the statistically valid training set (same ratio should be in any training set) and consequently it reduces to p(xjW ). Finally: xjfxp gP ) = Z p( xjW )p(W jfxp gP ) dW p( W (B.28) xjW; fxp gP ) jfxp gP ) 362 APPENDIX B. STATISTICAL SIDELINES ✍ Remarks: By contrast with other statistical methods who tries to nd the best set of parameters W , the Bayesian inference method performs an weighted average over the W space, using all possible sets of W parameters according to their probability to be the right choice. The probability density of the whole set fxp gP is the product of densities for each xp (assuming that the set is statistically signi cant): ➥ p(fxp gP jW ) = P Y p=1 p(xp jW ) From the Bayes theorem and using the above equation: p(W jfxp gP ) = p(fxp gP jW ) p(W ) p(fxp gP ) = P p(W ) Y p(x jW ) p(fxp gP ) p=1 p (B.29) p(fxp gP ) plays theR role of a normalization factor, and from the condition of normalization for p(W jfxp gP ): W p(W jfxp gP ) dW = 1, it follows that: p(fxp gP ) = Z p(W ) W ✍ ❖  ,  P Y p=1 p(xp jW ) dW (B.30) Remarks: ➥ Let consider a unidimensional Gaussian distribution with the standard deviation  known and try to nd the parameter  from a training set fxp gP . The probability density of  will be modeled also as a Gaussian characterized by parameters  and  . p() = p  1 2  exp ; ( ;22 ) 2  (B.31) where this form of p() expresses the prior knowledge of the probability density for  and then a large value for  should be chosen (large variance). Using the training set, the posterior probability p(jfxp gP ) is calculated for : p(jfxp gP ) = P p() Y p(x j) p(fxp gP ) p=1 p (B.32) Assuming a Gaussian distribution for p(xp j) then: p(xp j) = p 1 2   exp ; (xp ; )2  2 2 (B.33) From (B.30) and (B.33) p(fxp gP ) = Z1 ;1 p() P Y p=1 p(xp j) d (B.34) B.3. THE BAYESIAN INFERENCE = Let hxi = Z1 1 (2) P 2+1  P ;1 PP x p=1 363 p # " P 1 X  )2 2 ; exp ; ( ; 22 22 p=1( ; xi ) d be the mean of the training set. ❖ Replacing (B.31), (B.33) and (B.34) back into (B.32) gives: p(jfxp gP ) / exp   ; P22 ; Ph2xi ; 22 ;  2 2 2    ) 2 0  P hxi 12 3 1 + P 2 2 2 + 2 p(jfxp gP ) = const. exp 6 4;  2  @ ; 12 + P2 A 75   (const. being the normalization factor). This expression shows that p(jfxp gP ) is also a Gaussian distribution characterized by the parameters: P 2 hxi +  2 2 and e = e =  2 P  +  2 ➥ s  2 2 P 2 +  2 For P ! 1: e ! hxi and e ! 0. Plim e = 0 shows that, for P ! 1,  itself !1 will assume the limit value hxi. There is a relationship between the Bayesian inference and maximum likelihood methods. The likelihood function is de ned as: p(fxp gP jW ) = P Y p=1 p(xp jW )  L(W ) then from (B.29): p(W jfxp gP ) = L(W )p(W ) p(fxp gP ) p(W ) represents the prior knowledge about W which is low and so p(W ) should be relatively at, i.e. all W have (relatively) the same probability (chance). Also, f by construction, L(W ) is maximum for the most probable value for W , let W f. be that one. Then p(W jfxp gP ) have a maximum around W f, is relatively sharp then for P ! 1 the If the p(W jfxp gP ), maximum in W f and: integral (B.28) is dominated by the area around W p(xjfxp gP ) = Z p(xjW ) p(W jfxp gP ) dW W f) ' p(xjW Z W f) p(W jfxp gP ) dW = p(xjW hxi 364 APPENDIX B. STATISTICAL SIDELINES x f (w ) w Figure B.6: (because R W 0 w The regression function f (w) approximate the x(w) dependency. The roots w0 of f are found by the Robbins{ Monro algorithm. p(W jfxp gP ) dW =1 , i.e. p(W jfxp gP ) is normalized). The conclusion is that for a large number of training patterns P ! 1 the Bayesian inference solution (B.27) approaches the maximum likelihood solution f ). p(xjW ➧ B.4 ❖ f The Robbins;Monro algorithm This algorithm shows a way of nding the roots of a function stochastically de ned. Let consider 2 variables x and w which are correlated x = x(w). Let Efxjwg be the expectation of x for a given w | this expression de nes a function of w f (w) = Efxjwg regression function this types of functions being named regression functions. The regression function f expresses the dependency between the mean of x and w. See gure B.6. Let w0 be the wanted root. It is assumed that x have a nite variance: Ef(x ; f )2 jwg = nite (B.35) and, without any loss of generality, that f (w) < 0 for w < w0 and f (w) > 0 for w > w0 , as in gure B.6. Theorem B.4.1. The root w0 , of the regression function f is found by successive iteration, starting from a value w1 in the vicinity of w0 , as follows: wi B.4 See [Fuk90] pp. 378{380. +1 = wi ; i x(wi ) B.4. THE ROBBINS;MONRO ALGORITHM where i 365 have to satisfy the following conditions: lim !1 i X1 X1 i =0 (B.36a) i =1 (B.36b) = (B.36c) i=1 i=1 2 i nite The x(w) may be expressed as a sum between the regression function f (w) and some noise : x(w) = f (w) +  (B.37) then, from the de nition of the regression function f Proof. ❖ Efjwg = Efxjw) ; f (wg = 0 (f is well de ned, so its expectation is f itself). The di erence between wi+1 and w0 is wi+1 ; w0 = wi ; w0 ; i f (wi ) ; i i where i is the noise from x(wi ). Taking the expectation of the square of the above expression: Ef(wi+1 ; w0 )2 g ; Ef(wi ; w0 )2 g = 2i f 2 (wi ) + 2i Ef2i g ; 2 i Ef(wi ; w0 )f (wi )g (Eff (wi )i g = Eff (wi )gEfi g because f and  are statistically independent, so 2 i Eff (wi )i g = 2 i f (wi )Ef i g = 0 because of the expectation of ). By repeating the above procedure over N steps and doing the sum gives: N  N X 2i f 2 (wi ) + Ef2i g ; 2 X i Ef(wi ; w0 )f (wi )g Ef(wN +1 ; w0 )2 g ; Ef(w1 ; w0 )2 g = i=1 i=1 It is reasonable to assume that w1 is chosen such that f 2 is bounded in the chosen vicinity of the searched root, then let f 2 (wi ) 6 F for all i 2 f1; : : : N +1g and let Ef2i g 6 2 . (it is assumed that 2 is bounded, see (B.35) and (B.37)). Then: N N X 2i ; 2 X i Ef(wi ; w0 )f (wi )g Ef(wN +1 ; w0 )2 g ; Ef(w1 ; w0 )2 g 6 (F ; 2 ) (B.38) i=1 i=1 Ef(wN +1 ; w0 )2 g > 0 as being the expectation of a positive quantity. As it was already discussed, w1 is chosen such that E [(w1 ; w0 )2 ] is limited. Then the left part of the equation (B.38) is bounded below by 0. N; P1 2 term is also nite because of the condition (B.36c) then, from (B.38): The (F ; 2 ) i=1 i N N X X 2i = nite ) 062 i Ef(wi ; w0 )f (wi )g 6 (F ; 2 ) i=1 i=1 N X i Ef(wi ; w0 )f (wi )g = nite i=1 Because of the conditions put on the signs of f and wi then (wi ; w0 )f (wi ) > 0, 8wi , and its expectation is also positive. Eventually, because of the condition (B.36b) and the above equation then: lim Ef(wi ; w0 )f (wi )g = 0 i!1 i.e. i!1 lim wi = w0 . and because f changes sign around w0 then it's a root for the regression function. ❖ i 366 APPENDIX B. STATISTICAL SIDELINES ✍ Remarks: ➥ ➥ ➥ Condition (B.36a) ensure that the process of nding the root is convergent. Condition (B.36b) ensure that each iterative correction to the solution wi is large enough. Condition (B.36c) ensure that the accumulated noise (the di erence between x(w) and f (w)) does not break the convergence process. The Robbins{Monro algorithm allows for nding the roots of f (w) without knowing the exact form of the regression function. A possible candidate for i coecients is i = ➧ B.5 ❖ mk codebook ✍ ➥ Vector quantization is used in signal processing to replace a vector x 2 Ck by a representative mk for it's class. The collection of all representatives fmk g is named codebook . Then, instead of sending each x, the codebook is send rst and then just the index of mk (closest to x) instead of x. The representatives are chosen by starting with some set and then updating it using a rule similar to the following: at step t a vector x from the \training set" is taken in turn and the representatives are updated/changed in discrete steps: ( +1) t LVQ1 ; n 2 (1=2; 1] Remarks: ( +1) ( ) in Learning vector quantization mk t m` s ❖ 1 where mk t = mj s = ( ) + t ( )( x ; mk t ) for mk closest to x ( ) no change, for all other m` ( ) t is a function of the \discrete time" t, usually decreasing in time. ( ) The mk vectors should be chosen as representative as possible. It is also possible to choose several mk vectors per each class (this is particularly necessary if a class is represented by some disconnected subspaces Xk 2 X in the pattern space). There are several methods of updating the code vectors. The simplest method is named LVQ1 and the updating rule is: mk t = mk t + (t) (x ; mk t ) for x 2 Ck (B.39) m` t = mk t ; (t) (x ; mk t ) for all other m` i.e. not only the mk is pushed towards an \average" of class Ck but all others are spread in OLVQ1 k ( +1) ( ) ( ) ( +1) ( ) ( ) the opposite direction. The (t) is chosen to start with some small value, (0) < 0:1 and decreasing linearly towards zero. Another variant, named OLVQ1, uses the same rule (B.39) but di erent k (t) function for B.5. LEARNING VECTOR QUANTIZATION each mk 367 k (t + 1) = 1 +k (t)(t) for x 2 Ck k (t + 1) = 1 ;k (t)(t) otherwise (B.40) k k i.e. k is decreased for the correct class and increased otherwise. The above formula may be justi ed as follows. Let (t) be de ned as: ❖ (t) ( +1 for correct classi cation ;1 otherwise then, from (B.39), it follows that: mk(t+1) = mk(t) + (t) (t) [x(t) ; mk(t)] = [1 ; (t) (t)]mk(t) + (t) (t) x(t) ) mk t = [1 ; (t ; 1) (t ; 1)]mk t; + (t ; 1) (t ; 1) x t; mk t = [1 ; (t) (t)][1 ; (t ; 1) (t ; 1)]mk t; (B.41) + [1 ; (t) (t)](t ; 1) (t ; 1) x t; + (t) (t) x t i.e. the code vectors mk t are a linear combination of the training set and the initial code vectors mk . Now let consider that x t; = x t , then consistency requires that mk t = mk t and thus (t ; 1) = (t). x t; and x t being identical should have the same contribution to mk t , i.e. their coecients in (B.41) should be equal: [1 ; (t) (t)] (t ; 1) = (t) ( ) ( 1) ( ( +1) ( ( 1) 1) 1) ( ) ( +1) (0) ( ( 1) 1) ( ) ( ) ( +1) ( ) ( +1) which leads directly to (B.40). Another procedure to build the code vectors is LVQ2.1. For each x the two closest code vectors are found. Then these two code vectors are updated if:  one is of the same class as x, let m= be this one,  the other is of di erent class, let m6= be that one and  x is near the middle between the two code vectors:   f k x ; m = k kx ; m6= k min kx ; m k ; kx ; m k > 11 ; + f 6= = where f ' 0:25. ✍ Remarks: ➥ The last rule from above may be geometrically interpreted as follows: let consider that the x is suciently close to the line connecting m= and m6=. See gure B.7 on the next page. In this case it is possible to make the approximations: kx ; m=k ' d= and kx ; m6=k ' d6= LVQ2.1 ❖ m , m6 , f = = 368 APPENDIX B. STATISTICAL SIDELINES x kx ; m6 k = m6 = kx ; m k = d6= m d= = Figure B.7: The geometrical interpretation of the LVQ2.1 rule. The x point is projected onto the line connecting m and m6 . = and there are 2 cases, one of them being  d ;f = d6= min d ; d = dd= > 11 + f 6= = 6= ) d = d6 + d < d 1 ;2 f = = = = ) d6 < d 11 ;+ ff ) d > 1;f = = = d 2 the other case giving a similar result: dd= > 1;2 f , i.e. in either case the projection of x is at least at a fraction (1 ; f )=2 of d from m= and m6=. The updating formula for LVQ2.1 is: m=(t+1) = m=(t) + (t)[x ; m=(t)] 6 m6 =(t+1) = m6=(t) ; (t)[x ; m6=(t)] While the LVQ2.1 updates the codebook less frequently than the previous procedures it tends to over adapt the code vectors. The LVQ3 was developed to prevent the over adaptation of LVQ2.1 method. This algorithm is similar to LVQ2.1 but if the two closest code vectors, let m0= and m00= be the ones, are of the same class then they are also updated according to the formula: m0=(t+1) = m0=(t) + " (t)[x ; m0=(t)] and m00 t =( +1) = m00=(t) + " (t)[x ; m00=(t)] where " is a tunable parameter, usually chosen in the interval [0:1; 0:5]. ✍ Remarks: ➥ In practice usually OLVQ1 is run rst (on the rapid changing code vectors part) and then LVQ1 and/or LVQ3 is used to make the ne adjustments. Bibliography [BB95] [Bis95] [BTW95] [CG87] [FS92] [Fuk90] [Koh88] [McC97] [Mos97] [Rip96] [Str81] James M. Bower and David Beeman. The Book of Genesis. Springer-Verlag, New York, 1995. ISBN 0{387{94019{7. Cristopher M. Bishop. Neural Networks for pattern recognition. Oxford University Press, New York, 1995. ISBN 0{19{853864{2. P.J. Braspenning, F. Thuijsman, and A.J.M.M. Weijters, Eds. Arti cial Neural Networks. Springer-Verlag, Berlin, 1995. ISBN 3{540{59488{4. G. A. Carpenter and S. Grossberg. Art2: self-organization of stable category recognition codes for analog input patterns. Applied Optics, 26(23):4919{ 4930, December 1987. ISSN 0003{6935. J. A. Freeman and D. M. Skapura. Neural Networks, Algorithms, Applications and Programming Techniques. Addison-Wesley, New York, 2nd edition, 1992. ISBN 0{201{51376{5. Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, San Diego, 2nd edition, 1990. ISBN 0{12{269851{7. Teuvo Kohonen. Self-Organization and Associative Memory. Springer-Verlag, New York, 2nd edition, 1988. ISBN 0{201{51376{5. Martin McCarthy. What is multi-threading? Linux Journal, {(34):31{40, February 1997. David Mosberger. Linux and the alpha: How to make your application y, part 2. Linux Journal, {(43):68{75, November 1997. Brian D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, New York, 1996. ISBN 0{521{46086{7. Karl R. Stromberg. An Introduction to Classical Real Analysis. Wadsworth International Group, Belmont, California, 1981. ISBN 0{534{98012{0. 369 Index 2/3 rule, 85, 86, 94, 98, 101 certain event, 210 city-block metric, see error, city-block metric class, 111 classi cation, 116, 197 clique, 307 codebook, 366 complexity criteria, 273 conditional independence, 308 confusion matrix, 120 conjugate direction, 226, 228 conjugate directions, 226 contrast enhancement, 86, 108 function, see function, contrast enhancement counterpropagation network, see network, CPN course of dimensionality, 113, 237, 255 cross-validation, 273 curse of dimensionality, 241 cycle, 307 activation function, see function, activation adaline, see percepton adaptive backpropagation, see algorithm, backpropagation, adaptive Adaptive Resonance Theory, see ART algorithm ART1, 99 ART2, 107 backpropagation, 16, 162, 320 adaptive, 21, 23, 321 batch, 163 momentum, 20, 23, 24, 321 quick, 224 standard, 13, 19 SuperSAB, 23, 321 vanilla, see standard BAM, 49, 53, 322 branch and bound, 242 CPN, 81 EM, 181, 240, 357 expectation{maximization, see EM gradient descent, 218 Levenberg-Marquardt, 234 line search, 225 Metropolis, 296 model trust region, 235 optimal brain damage, 266 optimal brain surgeon, 267 Robbins{Monro, 124, 181, 219 sequential search, 243 simulated annealing, 296 SOM, 40, 321 ANN, iii, 3 output, 112, 190 asymmetric divergence, 125, 211 DAG, 307, 313 decision boundary, 116, 117, 119 region, 116 tree, 301 splitting, 301 delta rule, 12, 142, 143, 145, 146, 162, 163, 219 deviance, 127 dimensionality reduction, 237, 245 discrete time approximation, 219 distance Euclidean, 325, 350 Hamming, 325 Kullback{Leiber, see asymetric divergence Mahalanobis, 241, 348 distribution, 191 Bernoulli, 136 conditional, 194 Gaussian, 191, 209, 261, 274, 345 multidimensional, 346 unidimensional, 345 Laplacian, 192 BAM, 48 Bayes rule, 116, 119, 354 theorem, 288, 342 Bayesian learning, 361 bias, 18, 123, 129, 160, 188, 254 average, 254 bias-variance tradeo , 255 Bidirectional Associative Memory, see BAM bit, 211 Boltzmann constant, 208 edge, 307 eigenvalue, 36, 327, 328 spectrum, 327 eigenvector, 36, 327, 347 encoding one-of-k, 68, 73, 81, 108, 197, 203, 205, 212, 239 371 372 INDEX entropy, 208, 210, 302 cross-entropy, 202, 203 di erential, 209 equation Riccati, 33 simple, 32 trivial, 32 error, 163, 176, 186, 212, 215, 218 absolute, 203 bar, 191 city-block metric, 192 cross-entropy, 213 function, 114, 185 gradient, 12{15, 17 Minkowski, 192, 213 quadratic, 232 relative, 203 RMS, 115, 187 root-mean-square, see error, RMS sum-of-squares, 11, 12, 15, 17, 19, 114, 137, 143, 149, 163, 167, 171, 176, 177, 182, 190, 203, 213, 253, 268, 274, 279{281 surface, 12, 20, 24, 26, 215, 217, 256 Euler-Lagrange equation, 176 evidence, 288 approximation, 287 exception, 115 exemplars, 47 expectation, 87, 90, 343 feature extraction, 114, 237 feedback, 31 auto, 54 indirect, 30, 91 lateral, 41, 91 feedforward network, see network, feedforward Fisher criterion, 148, 152, 199 Fisher discriminant, 149, 265 at spot elimination, 20 Fletcher-Reeves formula, 230 function  -Dirac, 125, 176, 197, 204, 353 activation, 3{6, 40, 57, 90, 92, 134, 198, 217 exponential-distribution, 7 hyperbolic-tangent, 6 identity, 48, 175, 199 logistic, 5, 6, 10, 15, 19, 135, 163, 202, 206, 259 pulse-coded, 7 ratio-polynomial, 7 sigmoidal, 157, 285 threshold, 5, 6, 7, 135, 154 BAM energy, 51 contrast enhancement, 102 discriminant, 117 energy BAM, 52, 56 error, see error Euler, 192, 292, 331, 353 even, 330 feedback, 30, 74, 75, 91 Gaussian, 174, 175 Green, 176 Heaviside, 160 Hop eld energy, 55, 56, 57, 59 kernel, 178, 195, 248, 352 Lagrange, 334 Liapunov, 51 likelihood, 121, 123, 185, 192, 202, 205, 261, 315 multi-quadratic, 174 radial basis, 173, 174 regression, 364 softmax, 195, 206, 263, 272, 356 stop, 43 generalization, 15, 112, 115 Gini index, 302 graph, 307 ancestral, 308 boundary, 308 chordal, see graph, triangulated complete, 307 connected, 307 directed, 307 directed acycled, see DAG moral, 314 polytree, 308 tree, 307 chain, 310 join, 312 triangulated, 312 undirected, 307 Hadamard product, iii Hamming distance, see distance, Hamming space, 50, 325 vector, see vector, bipolar Hessian, 127, 166, 172, 256, 266, 281, 288 inverse, 166, 168 Hestenes-Stiefel formula, 230 histogram, 350 K-nearest-neighbors, 351, 353 kernel method, 351 hyperprior, 288 importance sampling, 296 information, 210 criterion, 128 input normalization, 238 invariance, 247 translation, 250 Jacobian, 164, 247 Jensen's inequality, 132, 335, 357 Karhunen-Loeve transformation, 245 kernel function, see function, kernel Kohonen, see SOM Kronecker symbol, 118, 165, 196, 197, 347, 356 Kroneker symbol, 38, 48, 61 INDEX L'Hospital rule, 302 Lagrange multiplier, 334 layer competitive, 74, 75, 88, 91 hidden, 73 input, 68 output, 76 learning, 5, 11, 112, 217 ART1 fast, 94, 95 ART2 fast, 104 batch, 219 Bayesian, 275, 276 constant, 12, 17, 22, 27, 81, 107, 142, 163, 181, 219 adaptive decreasing factor, 22 adaptive increasing factor, 22 error backpropagation threshold, 17 at spot elimination, 20, 27 momentum, 20, 27, 221 convergence, 146 incomplete, 38, 40, 45 reinforced, 112 sequential, 219 set, 5, 112, see set, learning speed, 221 stop, 218 supervised, 5, 11, 112, 182 unsupervised, 5, 29, 31, 39, 112, 243 Learning Vector Quantization, see LVQ least squares technique, 182 lexicographic convention, 65 linear discriminant analysis, 349 linear separability, 129, 132, 146 loss matrix, see matrix, loss LVQ LVQ1, 366 LVQ2.1, 367 LVQ3, 368 OLVQ1, 366 macrostate, 207 Markov chain, 310 properties, 308 matrix covariance, 198 between-class, 148, 151 within-class, 148, 151 loss, 117, 201 positive de nite, 329 pseudo-inverse, 140 memory, 20, 22, 24 associative, 47 interpolative, 47 autoassociative, 48, 54 crosstalk, 52 heteroassociative, 47, 67 Hop eld continuous, 57, 60, 322 discrete, 54, 60, 322 gain parameter, 57 373 saturation, 52 microstate, 207 misclassi cation, 304 penalty, 117, 118 missing data, 97, 239, 306 mixture model, 179, 194 mixture of experts, 271 moment, 248 central, 248 regular, 248 momentum, see algorithm, backpropagation, momentum Monte Carlo method, 295 multiplicity, 207 Nadaraya-Watson estimator, 178 nat, 211 network ART, 85, 155 ART1, 85 ART2, 100 autoassociative, 245 backpropagation, 9 BAM, 48 cascade correlation, 264 committee, 218, 268 CPN, 67, 155 architecture, 67 feedforward, 3, 4, 9, 153, 154, 170 growing method, 264 higher order, 5, 250 Hop eld continuous, 57 discrete, 54 Kohonen, see network, SOM layer hidden, 26 output, 197 performance, 113 pruning method, 264 recurrent, 3 SOM, 4, 29 neuron, 3 excitation, 68 gain control, 85, 88 hidden, 72 instar, 71, 73 neighborhood, 31, 39, 41, 42 function, 41 output, 70 outstar, 76 pruning, 268 reset, 85 saliency, 268 winner, 41 neuronal neighborhood, see neuron, neighborhood Newton direction, 233 noise, 52, 97, 98, 186, 253 norm LR , 192 Euclidean, 102, 107, 189 NP-problem, 60 374 INDEX Occam factor, 293 operator Nabla, 126 outlier, 113, 188, 193, 343 overadaptation, 142 overtraining, 16 Parzen window, see function, kernel path, 307 pattern, 111 re ectance, 71 space, 112 perceptron, 136, 144, 179 Polak-Ribiere formula, 230 postprocessing, 237 prediction, 191 preprocessing, 237, 248 principal component, 243, 245 probability class-conditional, 342 density, 343 distribution, 342, 343 doubt, 117 join, 341 misclassi cation, 116 posterior, 116, 197, 342 prior, 341 pruning, 266 random walking, 296 Rayleigh quotient, 38, 328 recursive model, 314 re ectance pattern, see pattern, re ectance regression, 112 regularization, 115, 247, 255 parameter, 176, 248, 255 weight decay, 280 reject area, 119 rejects, 113 representation marginal, 313 potential, 314 set-chain, see representation, marginal risk, 117 averaging, 299 Self Organizing Maps, see SOM sequential parameter estimation, 123 set learning, 10, 114 test, 15, 199 training, 253 validation, 218, 258 signal function, see function, activation SOM, 40 SuperSAB, see algorithm, backpropagation, SuperSAB theorem noiseless coding, 211 training, see learning variance, 191, 218, 238, 254, 343 average, 254 input dependent, 193 vector binary, 50, 96, 99, 136, 155 bipolar, iv, 50, 52, 53 bynary, iv code, 366 normal distribution, 349 orthogonal, 50 target, 114, 197 large, 203 threshold, 55 vertex, 307 child, 307 parent, 307 root, 307 vigilance parameter, 97 weight, 112 decay, see regularization, weight decay e ective number, 274, 289 saliency, 266 shared, 250 soft, 261 space, 31, 39 symmetry, 161, 215 well-determined, 289 whitening, 238 winner-takes-all, see layer, competitive XOR problem, 134